6.9 KiB
Implementation Plan: Fusion GPU Metrics Collection
Branch: 260410-fusion-gpu-metrics-v2 | Date: 2026-04-10 | Spec: spec.md
Input: Feature specification from /specs/260410-fusion-gpu-metrics/spec.md
Summary
Collect GPU metrics from Fusion's .fusion/trace.json file on task completion and send them to Seqera Platform. The GPU block is carried as a transient Map<String,Object> field on TraceRecord (same pattern as resourceAllocation) and included in the task payload via TowerObserver.makeTaskMap0().
Technical Context
Language/Version: Groovy 4.0.29 / Java 17 target (Java 21 toolchain)
Primary Dependencies: Nextflow core (modules/nextflow), nf-tower plugin (plugins/nf-tower)
Storage: N/A (read-only file access to .fusion/trace.json)
Testing: Spock Framework (unit tests in both modules)
Target Platform: All Fusion-enabled executors (AWS Batch, Google Batch, Azure Batch, K8s, Seqera, SLURM)
Project Type: Multi-module Gradle project
Performance Goals: Negligible overhead — one small JSON file read per task completion
Constraints: Must not break existing trace pipeline; must be forward-compatible with evolving GPU block schema
Scale/Scope: 4 files modified, ~80 lines of production code, ~120 lines of test code
Constitution Check
GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.
| Principle | Status | Notes |
|---|---|---|
| I. Modular Architecture | PASS | Core trace logic in modules/nextflow, Platform integration in plugins/nf-tower — correct placement |
| II. Test-Driven Quality | PASS | Unit tests planned for both TraceRecord and TowerClient |
| III. Dataflow Programming | N/A | No changes to dataflow model |
| IV. Apache 2.0 License | PASS | All modified files already have headers |
| V. DCO Sign-off | PASS | Will use git commit -s |
| VI. Semantic Versioning | PASS | No version bump needed — feature addition within existing release cycle |
| VII. Groovy Idioms | PASS | Uses JsonSlurper, follows existing getter/setter patterns |
Project Structure
Files to Modify
modules/nextflow/
├── src/main/groovy/nextflow/trace/TraceRecord.groovy # Add transient field + parsing method
├── src/main/groovy/nextflow/processor/TaskHandler.groovy # Read .fusion/trace.json on completion
└── src/test/groovy/nextflow/trace/TraceRecordTest.groovy # Test transient field + parsing
plugins/nf-tower/
├── src/main/io/seqera/tower/plugin/TowerObserver.groovy # Include gpuMetrics in task map
└── src/test/io/seqera/tower/plugin/TowerClientTest.groovy # Test gpuMetrics in task map
Implementation Tasks
Task 1: Add transient gpuMetrics field to TraceRecord
File: modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy
Changes:
- Add field after
resourceAllocation(line 128):transient private Map<String,Object> gpuMetrics - Add getter/setter after existing
resourceAllocationgetter/setter (after line 649):Map<String,Object> getGpuMetrics() { return gpuMetrics } void setGpuMetrics(Map<String,Object> value) { this.gpuMetrics = value }
Task 2: Add Fusion trace file parsing method to TraceRecord
File: modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy
Changes:
Add a static method to parse .fusion/trace.json and extract the gpu block:
static Map<String,Object> parseFusionTraceFile(Path file) {
final text = file.text
final json = (Map) new JsonSlurper().parseText(text)
return (Map<String,Object>) json.get('gpu')
}
This keeps parsing logic on TraceRecord (consistent with parseTraceFile() for .command.trace).
Task 3: Read .fusion/trace.json in TaskHandler.getTraceRecord()
File: modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy
Changes:
After the existing .command.trace parsing block (after line 253), add:
// collect Fusion GPU metrics
if( task.processor.executor.isFusionEnabled() ) {
final fusionTrace = task.workDir?.resolve('.fusion/trace.json')
try {
if( fusionTrace ) {
final gpu = TraceRecord.parseFusionTraceFile(fusionTrace)
if( gpu )
record.gpuMetrics = gpu
}
}
catch( NoSuchFileException e ) {
// ignore - Fusion trace may not exist
}
catch( Exception e ) {
log.debug "[WARN] Cannot read Fusion trace file: $fusionTrace -- Cause: ${e.message}"
}
}
Key design decisions:
- Gated by
task.processor.executor.isFusionEnabled()— no file access when Fusion is not enabled (FR-007) - Placed inside
isCompleted()block but NOT gated by task status — runs for both success and failure (FR-005) - Same error handling pattern as
.command.traceparsing above it (FR-006)
Task 4: Include gpuMetrics in TowerObserver task payload
File: plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy
Changes:
In makeTaskMap0() method, add after record.resourceAllocation = trace.getResourceAllocation() (after line 476):
record.gpuMetrics = trace.getGpuMetrics()
Task 5: Unit tests for TraceRecord
File: modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy
Tests to add:
-
Transient field serialization test (follows
numSpotInterruptionspattern):- Set
gpuMetricson a TraceRecord - Serialize and deserialize
- Verify deserialized record has
nullforgpuMetrics
- Set
-
parseFusionTraceFile with GPU block:
- Create a temp file with valid trace.json content including a
gpublock - Verify the returned map contains all GPU fields with correct values
- Create a temp file with valid trace.json content including a
-
parseFusionTraceFile without GPU block:
- Create a temp file with valid trace.json content without a
gpukey - Verify
nullis returned
- Create a temp file with valid trace.json content without a
-
parseFusionTraceFile with malformed JSON:
- Create a temp file with invalid JSON
- Verify an exception is thrown (caller handles it)
Task 6: Unit tests for TowerClient/TowerObserver
File: plugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy
Test to add (follows resourceAllocation test at lines 684-711):
- Create a TraceRecord with
gpuMetricsset to a GPU metrics map - Call
makeTasksReq([trace]) - Verify
req.tasks[0].gpuMetricscontains the GPU data
Implementation Order
- Task 1 + Task 2 (TraceRecord changes) — no dependencies
- Task 3 (TaskHandler) — depends on Task 1+2
- Task 4 (TowerObserver) — depends on Task 1
- Task 5 (TraceRecord tests) — depends on Task 1+2
- Task 6 (TowerClient tests) — depends on Task 4
Tasks 1+2 and 5 can be done in parallel with Tasks 4 and 6.
Verification
After implementation, run:
./gradlew :nextflow:test --tests "TraceRecordTest"
./gradlew :nf-tower:test --tests "TowerClientTest"
make smoke # verify no regressions