5.5 KiB
Tasks: Fusion GPU Metrics Collection
Input: Design documents from /specs/260410-fusion-gpu-metrics/
Prerequisites: plan.md, spec.md, research.md
Tests: Included — the spec requires unit tests for both TraceRecord and TowerClient.
Organization: Tasks grouped by user story for independent implementation and testing.
Format: [ID] [P?] [Story] Description
- [P]: Can run in parallel (different files, no dependencies)
- [Story]: Which user story this task belongs to (e.g., US1, US2)
- Exact file paths included in descriptions
Phase 1: Foundational (TraceRecord transient field)
Purpose: Add the gpuMetrics transient field to TraceRecord — all subsequent tasks depend on this.
- T001 Add transient
gpuMetricsfield with getter/setter tomodules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy(afterresourceAllocationfield at line 128, getter/setter after line 649) - T002 Add static
parseFusionTraceFile(Path)method tomodules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy— parse.fusion/trace.jsonand return thegpublock asMap<String,Object>
Checkpoint: TraceRecord can hold and parse GPU metrics. No behavior change yet.
Phase 2: User Story 1 - GPU metrics sent to Platform (Priority: P1)
Goal: Read .fusion/trace.json on task completion, extract GPU block, send to Platform via TowerObserver.
Independent Test: Run a Fusion-enabled task with .fusion/trace.json containing a gpu block, verify GPU data appears in the Platform task payload.
Implementation
- T003 [US1] Read
.fusion/trace.jsoninTaskHandler.getTraceRecord()atmodules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy— add after.command.traceparsing block (after line 253), gated bytask.processor.executor.isFusionEnabled() - T004 [US1] Include
gpuMetricsin task payload inTowerObserver.makeTaskMap0()atplugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy— addrecord.gpuMetrics = trace.getGpuMetrics()afterresourceAllocationline (line 476)
Tests
- T005 [P] [US1] Test
parseFusionTraceFilewith valid GPU block inmodules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy— create temp file with full trace.json, verify returned map has all GPU fields - T006 [P] [US1] Test
gpuMetricstransient field is not persisted across serialization inmodules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy— set field, serialize/deserialize, verify null - T007 [US1] Test
gpuMetricsincluded in task map inplugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy— create TraceRecord with gpuMetrics set, callmakeTasksReq(), verify output contains GPU data
Checkpoint: GPU metrics flow end-to-end from .fusion/trace.json to Platform payload. Run:
./gradlew :nextflow:test --tests "TraceRecordTest"
./gradlew :nf-tower:test --tests "TowerClientTest"
Phase 3: User Story 2 - Graceful error handling (Priority: P2)
Goal: Ensure missing, malformed, or GPU-less trace files don't break task reporting.
Independent Test: Simulate tasks with missing/malformed .fusion/trace.json, verify task trace is sent without GPU data and no errors.
Tests
- T008 [P] [US2] Test
parseFusionTraceFilewithout GPU block inmodules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy— create temp file with valid JSON but nogpukey, verify null returned - T009 [P] [US2] Test
parseFusionTraceFilewith malformed JSON inmodules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy— create temp file with invalid JSON, verify exception is thrown
Checkpoint: Error handling verified. The implementation in T003 already handles these cases via try/catch — these tests confirm the behavior.
Phase 4: Verification
Purpose: End-to-end validation across both modules.
- T010 Run smoke tests to verify no regressions:
make smoke
Dependencies & Execution Order
Phase Dependencies
- Phase 1 (T001, T002): No dependencies — start immediately
- Phase 2 (T003-T007): Depends on Phase 1 completion
- Phase 3 (T008-T009): Depends on Phase 1 (T002 specifically)
- Phase 4 (T010): Depends on all previous phases
Parallel Opportunities
- T001 and T002 modify the same file but different sections — execute sequentially
- T005, T006 are [P] — can run in parallel (same file but independent test methods)
- T008, T009 are [P] — can run in parallel
- T004 and T005/T006 are in different modules — can run in parallel after T001
Within Each Phase
Phase 1: T001 → T002
Phase 2: T003 → T004 (sequential: different modules but T004 depends on field from T001)
T005, T006 (parallel, after T002)
T007 (after T004)
Phase 3: T008, T009 (parallel, after T002)
Phase 4: T010 (after all)
Implementation Strategy
MVP (User Story 1 Only)
- Complete Phase 1: TraceRecord field + parser (T001-T002)
- Complete Phase 2: TaskHandler + TowerObserver + tests (T003-T007)
- STOP and VALIDATE: Run unit tests for both modules
- GPU metrics flow to Platform
Full Feature
- MVP above
- Add Phase 3: Error handling tests (T008-T009)
- Phase 4: Smoke tests (T010)
Summary
| Metric | Value |
|---|---|
| Total tasks | 10 |
| US1 tasks | 5 (T003-T007) |
| US2 tasks | 2 (T008-T009) |
| Foundational | 2 (T001-T002) |
| Verification | 1 (T010) |
| Files modified | 5 |
| Parallel opportunities | T005+T006, T008+T009 |