Files
ma/nextflow/specs/260410-fusion-gpu-metrics/tasks.md
2026-04-29 23:01:54 +02:00

132 lines
5.5 KiB
Markdown

# Tasks: Fusion GPU Metrics Collection
**Input**: Design documents from `/specs/260410-fusion-gpu-metrics/`
**Prerequisites**: plan.md, spec.md, research.md
**Tests**: Included — the spec requires unit tests for both TraceRecord and TowerClient.
**Organization**: Tasks grouped by user story for independent implementation and testing.
## Format: `[ID] [P?] [Story] Description`
- **[P]**: Can run in parallel (different files, no dependencies)
- **[Story]**: Which user story this task belongs to (e.g., US1, US2)
- Exact file paths included in descriptions
## Phase 1: Foundational (TraceRecord transient field)
**Purpose**: Add the `gpuMetrics` transient field to TraceRecord — all subsequent tasks depend on this.
- [ ] T001 Add transient `gpuMetrics` field with getter/setter to `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy` (after `resourceAllocation` field at line 128, getter/setter after line 649)
- [ ] T002 Add static `parseFusionTraceFile(Path)` method to `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy` — parse `.fusion/trace.json` and return the `gpu` block as `Map<String,Object>`
**Checkpoint**: TraceRecord can hold and parse GPU metrics. No behavior change yet.
---
## Phase 2: User Story 1 - GPU metrics sent to Platform (Priority: P1)
**Goal**: Read `.fusion/trace.json` on task completion, extract GPU block, send to Platform via TowerObserver.
**Independent Test**: Run a Fusion-enabled task with `.fusion/trace.json` containing a `gpu` block, verify GPU data appears in the Platform task payload.
### Implementation
- [ ] T003 [US1] Read `.fusion/trace.json` in `TaskHandler.getTraceRecord()` at `modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy` — add after `.command.trace` parsing block (after line 253), gated by `task.processor.executor.isFusionEnabled()`
- [ ] T004 [US1] Include `gpuMetrics` in task payload in `TowerObserver.makeTaskMap0()` at `plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy` — add `record.gpuMetrics = trace.getGpuMetrics()` after `resourceAllocation` line (line 476)
### Tests
- [ ] T005 [P] [US1] Test `parseFusionTraceFile` with valid GPU block in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with full trace.json, verify returned map has all GPU fields
- [ ] T006 [P] [US1] Test `gpuMetrics` transient field is not persisted across serialization in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — set field, serialize/deserialize, verify null
- [ ] T007 [US1] Test `gpuMetrics` included in task map in `plugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy` — create TraceRecord with gpuMetrics set, call `makeTasksReq()`, verify output contains GPU data
**Checkpoint**: GPU metrics flow end-to-end from `.fusion/trace.json` to Platform payload. Run:
```bash
./gradlew :nextflow:test --tests "TraceRecordTest"
./gradlew :nf-tower:test --tests "TowerClientTest"
```
---
## Phase 3: User Story 2 - Graceful error handling (Priority: P2)
**Goal**: Ensure missing, malformed, or GPU-less trace files don't break task reporting.
**Independent Test**: Simulate tasks with missing/malformed `.fusion/trace.json`, verify task trace is sent without GPU data and no errors.
### Tests
- [ ] T008 [P] [US2] Test `parseFusionTraceFile` without GPU block in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with valid JSON but no `gpu` key, verify null returned
- [ ] T009 [P] [US2] Test `parseFusionTraceFile` with malformed JSON in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with invalid JSON, verify exception is thrown
**Checkpoint**: Error handling verified. The implementation in T003 already handles these cases via try/catch — these tests confirm the behavior.
---
## Phase 4: Verification
**Purpose**: End-to-end validation across both modules.
- [ ] T010 Run smoke tests to verify no regressions: `make smoke`
---
## Dependencies & Execution Order
### Phase Dependencies
- **Phase 1** (T001, T002): No dependencies — start immediately
- **Phase 2** (T003-T007): Depends on Phase 1 completion
- **Phase 3** (T008-T009): Depends on Phase 1 (T002 specifically)
- **Phase 4** (T010): Depends on all previous phases
### Parallel Opportunities
- T001 and T002 modify the same file but different sections — execute sequentially
- T005, T006 are [P] — can run in parallel (same file but independent test methods)
- T008, T009 are [P] — can run in parallel
- T004 and T005/T006 are in different modules — can run in parallel after T001
### Within Each Phase
```
Phase 1: T001 → T002
Phase 2: T003 → T004 (sequential: different modules but T004 depends on field from T001)
T005, T006 (parallel, after T002)
T007 (after T004)
Phase 3: T008, T009 (parallel, after T002)
Phase 4: T010 (after all)
```
---
## Implementation Strategy
### MVP (User Story 1 Only)
1. Complete Phase 1: TraceRecord field + parser (T001-T002)
2. Complete Phase 2: TaskHandler + TowerObserver + tests (T003-T007)
3. **STOP and VALIDATE**: Run unit tests for both modules
4. GPU metrics flow to Platform
### Full Feature
1. MVP above
2. Add Phase 3: Error handling tests (T008-T009)
3. Phase 4: Smoke tests (T010)
---
## Summary
| Metric | Value |
|--------|-------|
| Total tasks | 10 |
| US1 tasks | 5 (T003-T007) |
| US2 tasks | 2 (T008-T009) |
| Foundational | 2 (T001-T002) |
| Verification | 1 (T010) |
| Files modified | 5 |
| Parallel opportunities | T005+T006, T008+T009 |