Files
ma/nextflow/specs/260410-fusion-gpu-metrics/research.md
2026-04-29 23:01:54 +02:00

50 lines
3.6 KiB
Markdown

# Research: Fusion GPU Metrics Collection
## R1: How to detect Fusion at trace collection time
**Decision**: Use `task.processor.executor.isFusionEnabled()` in `TaskHandler.getTraceRecord()`.
**Rationale**: TaskHandler already accesses the executor at line 222 (`task.processor.executor.getName()`), so this is a proven access path. The base `Executor.isFusionEnabled()` returns `false` by default, and Fusion-capable executors override it via `FusionHelper.isFusionEnabled(session)`. This works for all handler subclasses without requiring `instanceof` checks.
**Alternatives considered**:
- Checking `this instanceof FusionAwareTask`: Would miss custom executors that support Fusion but don't implement the trait. Also, `FusionAwareTask` is a trait on handler subclasses, not on the base `TaskHandler` where `getTraceRecord()` lives.
- Adding a Fusion flag to TaskRun/TaskConfig: Unnecessary complexity — Fusion is an executor-level property, not a per-task property.
## R2: Where to read `.fusion/trace.json`
**Decision**: Read it in `TaskHandler.getTraceRecord()`, right after the existing `.command.trace` parsing block (lines 244-253), gated by `task.processor.executor.isFusionEnabled()`.
**Rationale**: This is the single place where all task trace data is assembled, regardless of executor type. The existing `.command.trace` parsing already demonstrates the pattern: resolve a file in the work dir, parse it, handle `NoSuchFileException` and `IOException` gracefully.
**Alternatives considered**:
- Reading in each TaskHandler subclass: Would require changes across 7 handler subclasses in both core and plugins. Much higher blast radius.
- Reading in `TowerObserver`: Would couple Platform-specific code with file I/O. The observer should only transform data, not collect it.
## R3: Transient field pattern on TraceRecord
**Decision**: Add `transient private Map<String,Object> gpuMetrics` with getter/setter, following the exact `resourceAllocation` pattern.
**Rationale**: Transient fields on TraceRecord are the established mechanism for carrying executor-specific data to TowerObserver without persisting it in serialization (Kryo). The `resourceAllocation` field is the closest precedent — it's also a `Map<String,Object>` set during trace collection and consumed in `TowerObserver.makeTaskMap0()`.
**Implementation details**:
- Field: `transient private Map<String,Object> gpuMetrics`
- Getter: `Map<String,Object> getGpuMetrics()`
- Setter: `void setGpuMetrics(Map<String,Object> value)`
- In `makeTaskMap0()`: `record.gpuMetrics = trace.getGpuMetrics()`
## R4: JSON parsing approach
**Decision**: Use Groovy's `JsonSlurper` to parse `.fusion/trace.json` and extract the `gpu` key.
**Rationale**: `JsonSlurper` is already used throughout the Nextflow codebase (e.g., in tests and utilities). It parses JSON into native Groovy maps/lists, which is exactly what we need for the `Map<String,Object>` transient field. No additional dependencies required.
## R5: Test strategy
**Decision**: Three test locations following existing patterns.
1. **TraceRecordTest**: Verify `gpuMetrics` transient field is not persisted across serialization (follows `numSpotInterruptions` test pattern).
2. **TraceRecordTest**: Verify `parseFusionTraceFile()` correctly extracts GPU block from valid JSON, handles missing file, handles malformed JSON, handles missing GPU block.
3. **TowerClientTest**: Verify `gpuMetrics` is included in task map output (follows `resourceAllocation` test at lines 684-711).
**Rationale**: These three test locations mirror exactly how `resourceAllocation` and `numSpotInterruptions` are tested, ensuring consistency with project conventions.