add nextflow d30e48d
This commit is contained in:
@@ -0,0 +1,35 @@
|
||||
# Specification Quality Checklist: Fusion GPU Metrics Collection
|
||||
|
||||
**Purpose**: Validate specification completeness and quality before proceeding to planning
|
||||
**Created**: 2026-04-10
|
||||
**Feature**: [spec.md](../spec.md)
|
||||
|
||||
## Content Quality
|
||||
|
||||
- [x] No implementation details (languages, frameworks, APIs)
|
||||
- [x] Focused on user value and business needs
|
||||
- [x] Written for non-technical stakeholders
|
||||
- [x] All mandatory sections completed
|
||||
|
||||
## Requirement Completeness
|
||||
|
||||
- [x] No [NEEDS CLARIFICATION] markers remain
|
||||
- [x] Requirements are testable and unambiguous
|
||||
- [x] Success criteria are measurable
|
||||
- [x] Success criteria are technology-agnostic (no implementation details)
|
||||
- [x] All acceptance scenarios are defined
|
||||
- [x] Edge cases are identified
|
||||
- [x] Scope is clearly bounded
|
||||
- [x] Dependencies and assumptions identified
|
||||
|
||||
## Feature Readiness
|
||||
|
||||
- [x] All functional requirements have clear acceptance criteria
|
||||
- [x] User scenarios cover primary flows
|
||||
- [x] Feature meets measurable outcomes defined in Success Criteria
|
||||
- [x] No implementation details leak into specification
|
||||
|
||||
## Notes
|
||||
|
||||
- All items pass. Spec references internal Nextflow concepts (TraceRecord, TowerObserver) by necessity since this is an internal infrastructure feature, but avoids prescribing implementation approach.
|
||||
- The `resourceAllocation` pattern reference in FR-003 is a design constraint from the user, not an implementation detail leak.
|
||||
179
nextflow/specs/260410-fusion-gpu-metrics/plan.md
Normal file
179
nextflow/specs/260410-fusion-gpu-metrics/plan.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Implementation Plan: Fusion GPU Metrics Collection
|
||||
|
||||
**Branch**: `260410-fusion-gpu-metrics-v2` | **Date**: 2026-04-10 | **Spec**: [spec.md](spec.md)
|
||||
**Input**: Feature specification from `/specs/260410-fusion-gpu-metrics/spec.md`
|
||||
|
||||
## Summary
|
||||
|
||||
Collect GPU metrics from Fusion's `.fusion/trace.json` file on task completion and send them to Seqera Platform. The GPU block is carried as a transient `Map<String,Object>` field on `TraceRecord` (same pattern as `resourceAllocation`) and included in the task payload via `TowerObserver.makeTaskMap0()`.
|
||||
|
||||
## Technical Context
|
||||
|
||||
**Language/Version**: Groovy 4.0.29 / Java 17 target (Java 21 toolchain)
|
||||
**Primary Dependencies**: Nextflow core (`modules/nextflow`), nf-tower plugin (`plugins/nf-tower`)
|
||||
**Storage**: N/A (read-only file access to `.fusion/trace.json`)
|
||||
**Testing**: Spock Framework (unit tests in both modules)
|
||||
**Target Platform**: All Fusion-enabled executors (AWS Batch, Google Batch, Azure Batch, K8s, Seqera, SLURM)
|
||||
**Project Type**: Multi-module Gradle project
|
||||
**Performance Goals**: Negligible overhead — one small JSON file read per task completion
|
||||
**Constraints**: Must not break existing trace pipeline; must be forward-compatible with evolving GPU block schema
|
||||
**Scale/Scope**: 4 files modified, ~80 lines of production code, ~120 lines of test code
|
||||
|
||||
## Constitution Check
|
||||
|
||||
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
|
||||
|
||||
| Principle | Status | Notes |
|
||||
|-----------|--------|-------|
|
||||
| I. Modular Architecture | PASS | Core trace logic in `modules/nextflow`, Platform integration in `plugins/nf-tower` — correct placement |
|
||||
| II. Test-Driven Quality | PASS | Unit tests planned for both TraceRecord and TowerClient |
|
||||
| III. Dataflow Programming | N/A | No changes to dataflow model |
|
||||
| IV. Apache 2.0 License | PASS | All modified files already have headers |
|
||||
| V. DCO Sign-off | PASS | Will use `git commit -s` |
|
||||
| VI. Semantic Versioning | PASS | No version bump needed — feature addition within existing release cycle |
|
||||
| VII. Groovy Idioms | PASS | Uses JsonSlurper, follows existing getter/setter patterns |
|
||||
|
||||
## Project Structure
|
||||
|
||||
### Files to Modify
|
||||
|
||||
```text
|
||||
modules/nextflow/
|
||||
├── src/main/groovy/nextflow/trace/TraceRecord.groovy # Add transient field + parsing method
|
||||
├── src/main/groovy/nextflow/processor/TaskHandler.groovy # Read .fusion/trace.json on completion
|
||||
└── src/test/groovy/nextflow/trace/TraceRecordTest.groovy # Test transient field + parsing
|
||||
|
||||
plugins/nf-tower/
|
||||
├── src/main/io/seqera/tower/plugin/TowerObserver.groovy # Include gpuMetrics in task map
|
||||
└── src/test/io/seqera/tower/plugin/TowerClientTest.groovy # Test gpuMetrics in task map
|
||||
```
|
||||
|
||||
## Implementation Tasks
|
||||
|
||||
### Task 1: Add transient `gpuMetrics` field to TraceRecord
|
||||
|
||||
**File**: `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy`
|
||||
|
||||
**Changes**:
|
||||
1. Add field after `resourceAllocation` (line 128):
|
||||
```groovy
|
||||
transient private Map<String,Object> gpuMetrics
|
||||
```
|
||||
2. Add getter/setter after existing `resourceAllocation` getter/setter (after line 649):
|
||||
```groovy
|
||||
Map<String,Object> getGpuMetrics() {
|
||||
return gpuMetrics
|
||||
}
|
||||
|
||||
void setGpuMetrics(Map<String,Object> value) {
|
||||
this.gpuMetrics = value
|
||||
}
|
||||
```
|
||||
|
||||
### Task 2: Add Fusion trace file parsing method to TraceRecord
|
||||
|
||||
**File**: `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy`
|
||||
|
||||
**Changes**:
|
||||
Add a static method to parse `.fusion/trace.json` and extract the `gpu` block:
|
||||
```groovy
|
||||
static Map<String,Object> parseFusionTraceFile(Path file) {
|
||||
final text = file.text
|
||||
final json = (Map) new JsonSlurper().parseText(text)
|
||||
return (Map<String,Object>) json.get('gpu')
|
||||
}
|
||||
```
|
||||
|
||||
This keeps parsing logic on TraceRecord (consistent with `parseTraceFile()` for `.command.trace`).
|
||||
|
||||
### Task 3: Read `.fusion/trace.json` in TaskHandler.getTraceRecord()
|
||||
|
||||
**File**: `modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy`
|
||||
|
||||
**Changes**:
|
||||
After the existing `.command.trace` parsing block (after line 253), add:
|
||||
```groovy
|
||||
// collect Fusion GPU metrics
|
||||
if( task.processor.executor.isFusionEnabled() ) {
|
||||
final fusionTrace = task.workDir?.resolve('.fusion/trace.json')
|
||||
try {
|
||||
if( fusionTrace ) {
|
||||
final gpu = TraceRecord.parseFusionTraceFile(fusionTrace)
|
||||
if( gpu )
|
||||
record.gpuMetrics = gpu
|
||||
}
|
||||
}
|
||||
catch( NoSuchFileException e ) {
|
||||
// ignore - Fusion trace may not exist
|
||||
}
|
||||
catch( Exception e ) {
|
||||
log.debug "[WARN] Cannot read Fusion trace file: $fusionTrace -- Cause: ${e.message}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key design decisions**:
|
||||
- Gated by `task.processor.executor.isFusionEnabled()` — no file access when Fusion is not enabled (FR-007)
|
||||
- Placed inside `isCompleted()` block but NOT gated by task status — runs for both success and failure (FR-005)
|
||||
- Same error handling pattern as `.command.trace` parsing above it (FR-006)
|
||||
|
||||
### Task 4: Include `gpuMetrics` in TowerObserver task payload
|
||||
|
||||
**File**: `plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy`
|
||||
|
||||
**Changes**:
|
||||
In `makeTaskMap0()` method, add after `record.resourceAllocation = trace.getResourceAllocation()` (after line 476):
|
||||
```groovy
|
||||
record.gpuMetrics = trace.getGpuMetrics()
|
||||
```
|
||||
|
||||
### Task 5: Unit tests for TraceRecord
|
||||
|
||||
**File**: `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy`
|
||||
|
||||
**Tests to add**:
|
||||
|
||||
1. **Transient field serialization test** (follows `numSpotInterruptions` pattern):
|
||||
- Set `gpuMetrics` on a TraceRecord
|
||||
- Serialize and deserialize
|
||||
- Verify deserialized record has `null` for `gpuMetrics`
|
||||
|
||||
2. **parseFusionTraceFile with GPU block**:
|
||||
- Create a temp file with valid trace.json content including a `gpu` block
|
||||
- Verify the returned map contains all GPU fields with correct values
|
||||
|
||||
3. **parseFusionTraceFile without GPU block**:
|
||||
- Create a temp file with valid trace.json content without a `gpu` key
|
||||
- Verify `null` is returned
|
||||
|
||||
4. **parseFusionTraceFile with malformed JSON**:
|
||||
- Create a temp file with invalid JSON
|
||||
- Verify an exception is thrown (caller handles it)
|
||||
|
||||
### Task 6: Unit tests for TowerClient/TowerObserver
|
||||
|
||||
**File**: `plugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy`
|
||||
|
||||
**Test to add** (follows `resourceAllocation` test at lines 684-711):
|
||||
- Create a TraceRecord with `gpuMetrics` set to a GPU metrics map
|
||||
- Call `makeTasksReq([trace])`
|
||||
- Verify `req.tasks[0].gpuMetrics` contains the GPU data
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. **Task 1 + Task 2** (TraceRecord changes) — no dependencies
|
||||
2. **Task 3** (TaskHandler) — depends on Task 1+2
|
||||
3. **Task 4** (TowerObserver) — depends on Task 1
|
||||
4. **Task 5** (TraceRecord tests) — depends on Task 1+2
|
||||
5. **Task 6** (TowerClient tests) — depends on Task 4
|
||||
|
||||
Tasks 1+2 and 5 can be done in parallel with Tasks 4 and 6.
|
||||
|
||||
## Verification
|
||||
|
||||
After implementation, run:
|
||||
```bash
|
||||
./gradlew :nextflow:test --tests "TraceRecordTest"
|
||||
./gradlew :nf-tower:test --tests "TowerClientTest"
|
||||
make smoke # verify no regressions
|
||||
```
|
||||
49
nextflow/specs/260410-fusion-gpu-metrics/research.md
Normal file
49
nextflow/specs/260410-fusion-gpu-metrics/research.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Research: Fusion GPU Metrics Collection
|
||||
|
||||
## R1: How to detect Fusion at trace collection time
|
||||
|
||||
**Decision**: Use `task.processor.executor.isFusionEnabled()` in `TaskHandler.getTraceRecord()`.
|
||||
|
||||
**Rationale**: TaskHandler already accesses the executor at line 222 (`task.processor.executor.getName()`), so this is a proven access path. The base `Executor.isFusionEnabled()` returns `false` by default, and Fusion-capable executors override it via `FusionHelper.isFusionEnabled(session)`. This works for all handler subclasses without requiring `instanceof` checks.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Checking `this instanceof FusionAwareTask`: Would miss custom executors that support Fusion but don't implement the trait. Also, `FusionAwareTask` is a trait on handler subclasses, not on the base `TaskHandler` where `getTraceRecord()` lives.
|
||||
- Adding a Fusion flag to TaskRun/TaskConfig: Unnecessary complexity — Fusion is an executor-level property, not a per-task property.
|
||||
|
||||
## R2: Where to read `.fusion/trace.json`
|
||||
|
||||
**Decision**: Read it in `TaskHandler.getTraceRecord()`, right after the existing `.command.trace` parsing block (lines 244-253), gated by `task.processor.executor.isFusionEnabled()`.
|
||||
|
||||
**Rationale**: This is the single place where all task trace data is assembled, regardless of executor type. The existing `.command.trace` parsing already demonstrates the pattern: resolve a file in the work dir, parse it, handle `NoSuchFileException` and `IOException` gracefully.
|
||||
|
||||
**Alternatives considered**:
|
||||
- Reading in each TaskHandler subclass: Would require changes across 7 handler subclasses in both core and plugins. Much higher blast radius.
|
||||
- Reading in `TowerObserver`: Would couple Platform-specific code with file I/O. The observer should only transform data, not collect it.
|
||||
|
||||
## R3: Transient field pattern on TraceRecord
|
||||
|
||||
**Decision**: Add `transient private Map<String,Object> gpuMetrics` with getter/setter, following the exact `resourceAllocation` pattern.
|
||||
|
||||
**Rationale**: Transient fields on TraceRecord are the established mechanism for carrying executor-specific data to TowerObserver without persisting it in serialization (Kryo). The `resourceAllocation` field is the closest precedent — it's also a `Map<String,Object>` set during trace collection and consumed in `TowerObserver.makeTaskMap0()`.
|
||||
|
||||
**Implementation details**:
|
||||
- Field: `transient private Map<String,Object> gpuMetrics`
|
||||
- Getter: `Map<String,Object> getGpuMetrics()`
|
||||
- Setter: `void setGpuMetrics(Map<String,Object> value)`
|
||||
- In `makeTaskMap0()`: `record.gpuMetrics = trace.getGpuMetrics()`
|
||||
|
||||
## R4: JSON parsing approach
|
||||
|
||||
**Decision**: Use Groovy's `JsonSlurper` to parse `.fusion/trace.json` and extract the `gpu` key.
|
||||
|
||||
**Rationale**: `JsonSlurper` is already used throughout the Nextflow codebase (e.g., in tests and utilities). It parses JSON into native Groovy maps/lists, which is exactly what we need for the `Map<String,Object>` transient field. No additional dependencies required.
|
||||
|
||||
## R5: Test strategy
|
||||
|
||||
**Decision**: Three test locations following existing patterns.
|
||||
|
||||
1. **TraceRecordTest**: Verify `gpuMetrics` transient field is not persisted across serialization (follows `numSpotInterruptions` test pattern).
|
||||
2. **TraceRecordTest**: Verify `parseFusionTraceFile()` correctly extracts GPU block from valid JSON, handles missing file, handles malformed JSON, handles missing GPU block.
|
||||
3. **TowerClientTest**: Verify `gpuMetrics` is included in task map output (follows `resourceAllocation` test at lines 684-711).
|
||||
|
||||
**Rationale**: These three test locations mirror exactly how `resourceAllocation` and `numSpotInterruptions` are tested, ensuring consistency with project conventions.
|
||||
147
nextflow/specs/260410-fusion-gpu-metrics/spec.md
Normal file
147
nextflow/specs/260410-fusion-gpu-metrics/spec.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Feature Specification: Fusion GPU Metrics Collection
|
||||
|
||||
**Feature Branch**: `260410-fusion-gpu-metrics`
|
||||
**Created**: 2026-04-10
|
||||
**Status**: Draft
|
||||
**Input**: User description: "Collect GPU metrics from Fusion trace.json and send to Seqera Platform via TowerClient"
|
||||
|
||||
## User Scenarios & Testing *(mandatory)*
|
||||
|
||||
### User Story 1 - GPU metrics sent to Platform on task completion (Priority: P1)
|
||||
|
||||
A user runs a Nextflow pipeline with Fusion enabled on a GPU-equipped executor (e.g., AWS Batch, Google Batch, Kubernetes). When each task completes, Nextflow reads the Fusion-generated `.fusion/trace.json` file from the task work directory, extracts the `gpu` block, and includes it in the task trace data sent to Seqera Platform. The user can then view GPU utilization metrics (compute %, memory %, active time, etc.) for each task in the Platform UI.
|
||||
|
||||
**Why this priority**: This is the core feature. Without it, GPU usage is invisible to Platform users running Fusion-enabled pipelines.
|
||||
|
||||
**Independent Test**: Can be tested by running a Fusion-enabled task that produces a `.fusion/trace.json` with a `gpu` block, then verifying the GPU data appears in the task payload sent to Platform.
|
||||
|
||||
**Acceptance Scenarios**:
|
||||
|
||||
1. **Given** a completed task with Fusion enabled and a valid `.fusion/trace.json` containing a `gpu` block, **When** the task trace is collected, **Then** all GPU metrics from the `gpu` block are included in the task data sent to Platform.
|
||||
2. **Given** a completed task with Fusion enabled and a valid `.fusion/trace.json` without a `gpu` block (CPU-only task), **When** the task trace is collected, **Then** no GPU metrics are sent and no error occurs.
|
||||
3. **Given** a failed task with Fusion enabled and a valid `.fusion/trace.json` containing a `gpu` block, **When** the task trace is collected, **Then** GPU metrics are still sent (metrics are collected irrespective of task status).
|
||||
|
||||
---
|
||||
|
||||
### User Story 2 - Graceful handling when trace.json is missing or malformed (Priority: P2)
|
||||
|
||||
When Fusion's `.fusion/trace.json` file is missing (e.g., task was killed before Fusion wrote it) or contains invalid JSON, the system logs a debug-level warning and proceeds without GPU metrics. The task trace is still sent to Platform with all other fields intact.
|
||||
|
||||
**Why this priority**: Robustness is essential — GPU metrics are supplementary data and must never cause task reporting to fail.
|
||||
|
||||
**Independent Test**: Can be tested by simulating a completed task where `.fusion/trace.json` is absent or contains malformed JSON, and verifying the task trace is still sent successfully without GPU data.
|
||||
|
||||
**Acceptance Scenarios**:
|
||||
|
||||
1. **Given** a completed Fusion-enabled task where `.fusion/trace.json` does not exist, **When** the task trace is collected, **Then** no GPU metrics are included and no error is raised.
|
||||
2. **Given** a completed Fusion-enabled task where `.fusion/trace.json` contains invalid JSON, **When** the task trace is collected, **Then** the file is skipped with a debug log message and the task trace is sent without GPU data.
|
||||
3. **Given** a completed Fusion-enabled task where `.fusion/trace.json` exists but the `gpu` block is null/absent, **When** the task trace is collected, **Then** no GPU metrics are included and no error is raised.
|
||||
|
||||
---
|
||||
|
||||
### Edge Cases
|
||||
|
||||
- What happens when the `gpu` block contains unexpected or extra fields not in the known schema? They are included as-is (forward compatibility).
|
||||
- What happens when Fusion is not enabled for a task? No attempt is made to read `.fusion/trace.json`.
|
||||
- What happens when the task work directory is inaccessible at trace collection time (e.g., remote storage timeout)? The same error handling as existing `.command.trace` parsing applies — log and continue.
|
||||
|
||||
## Requirements *(mandatory)*
|
||||
|
||||
### Functional Requirements
|
||||
|
||||
- **FR-001**: System MUST read the file `.fusion/trace.json` from the task work directory on task completion when the executor has Fusion enabled.
|
||||
- **FR-002**: System MUST extract the entire `gpu` block from the parsed `trace.json` as a map.
|
||||
- **FR-003**: System MUST store the GPU metrics as a transient field on `TraceRecord` (following the same pattern as `resourceAllocation`).
|
||||
- **FR-004**: System MUST include the GPU metrics map in the task payload sent to Seqera Platform via the Tower observer.
|
||||
- **FR-005**: System MUST collect GPU metrics irrespective of task completion status (success or failure).
|
||||
- **FR-006**: System MUST NOT fail or disrupt task trace reporting if `.fusion/trace.json` is missing, unreadable, or malformed.
|
||||
- **FR-007**: System MUST only attempt to read `.fusion/trace.json` when Fusion is enabled for the executor.
|
||||
|
||||
### Key Entities
|
||||
|
||||
- **Fusion Trace File**: JSON file at `.fusion/trace.json` in the task work directory, produced by the Fusion client. Contains `proc`, `gpu`, and `cgroup` blocks with runtime metrics.
|
||||
- **GPU Metrics Block**: The `gpu` object within `trace.json`, containing fields: `name`, `mem`, `driver`, `active_time`, `pct`, `peak`, `pct_mem`, `peak_mem`, `avg_mem`, `peak_mem_used`, `avg_mem_bw_util`, `peak_mem_bw_util`.
|
||||
|
||||
#### Example `.fusion/trace.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"proc": {
|
||||
"realtime": 660541,
|
||||
"pct_cpu": 1045,
|
||||
"cpu_name": "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz",
|
||||
"arch": "linux/amd64",
|
||||
"rchar": 14112539262,
|
||||
"wchar": 12668821375,
|
||||
"syscr": 1823378,
|
||||
"syscw": 169293,
|
||||
"read_bytes": 8011776,
|
||||
"write_bytes": 102400,
|
||||
"pct_mem": 56,
|
||||
"vmem": 39015152,
|
||||
"rss": 14826068,
|
||||
"peak_vmem": 39047920,
|
||||
"peak_rss": 15775480,
|
||||
"vol_ctxt": 413015,
|
||||
"inv_ctxt": 1540
|
||||
},
|
||||
"gpu": {
|
||||
"name": "Tesla T4",
|
||||
"mem": 15360,
|
||||
"driver": "580.126.09",
|
||||
"active_time": 651030,
|
||||
"pct": 75,
|
||||
"peak": 100,
|
||||
"pct_mem": 40.11115345483025,
|
||||
"peak_mem": 74.140625,
|
||||
"avg_mem": 6161,
|
||||
"peak_mem_used": 11388,
|
||||
"avg_mem_bw_util": 43,
|
||||
"peak_mem_bw_util": 83
|
||||
},
|
||||
"cgroup": {
|
||||
"version": "v2",
|
||||
"memory_current": 25469927424,
|
||||
"memory_peak": 41178980352,
|
||||
"memory_rss": 67919872,
|
||||
"memory_peak_rss": 14783070208,
|
||||
"cpu_usage_usec": 785302059,
|
||||
"cpu_user_usec": 549732867,
|
||||
"cpu_system_usec": 235569192,
|
||||
"io_read_bytes": 8503296,
|
||||
"io_write_bytes": 12671918080,
|
||||
"io_read_ops": 98,
|
||||
"io_write_ops": 97975,
|
||||
"memory_limit": 77309411328,
|
||||
"cpu_quota": 0,
|
||||
"cpu_period": 0,
|
||||
"memory_oom_kills": 0,
|
||||
"cpu_nr_throttled": 0,
|
||||
"cpu_throttled_usec": 0,
|
||||
"cpu_psi_some": 582969,
|
||||
"cpu_psi_full": 582860,
|
||||
"memory_psi_some": 0,
|
||||
"memory_psi_full": 0,
|
||||
"io_psi_some": 1038270,
|
||||
"io_psi_full": 1037514
|
||||
}
|
||||
}
|
||||
```
|
||||
- **TraceRecord GPU field**: New transient field on `TraceRecord` that carries the GPU metrics map through the existing trace pipeline to the Tower observer, following the `resourceAllocation` pattern.
|
||||
|
||||
## Success Criteria *(mandatory)*
|
||||
|
||||
### Measurable Outcomes
|
||||
|
||||
- **SC-001**: GPU metrics from Fusion trace files are visible in Seqera Platform for all Fusion-enabled tasks that ran on GPU hardware.
|
||||
- **SC-002**: Tasks without GPU usage or without Fusion enabled report successfully with no GPU data and no errors.
|
||||
- **SC-003**: A missing or malformed `.fusion/trace.json` does not cause any task to fail reporting — 100% of tasks still have their standard metrics delivered.
|
||||
- **SC-004**: GPU metrics collection adds negligible overhead — reading and parsing a single small JSON file per task completion.
|
||||
|
||||
## Assumptions
|
||||
|
||||
- The Fusion client is responsible for creating `.fusion/trace.json` in the task work directory. Nextflow only reads it.
|
||||
- The `gpu` block schema may evolve over time. The implementation forwards the entire block as a map rather than mapping to fixed fields, ensuring forward compatibility.
|
||||
- Seqera Platform API already accepts or will be updated to accept the GPU metrics payload alongside existing task trace data.
|
||||
- The file path `.fusion/trace.json` is stable and defined by the Fusion client contract.
|
||||
- All executors that support Fusion (AWS Batch, Google Batch, Azure Batch, Kubernetes, Seqera, SLURM) benefit from this feature without executor-specific code — the detection is based on whether Fusion is enabled, not on the executor type.
|
||||
131
nextflow/specs/260410-fusion-gpu-metrics/tasks.md
Normal file
131
nextflow/specs/260410-fusion-gpu-metrics/tasks.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Tasks: Fusion GPU Metrics Collection
|
||||
|
||||
**Input**: Design documents from `/specs/260410-fusion-gpu-metrics/`
|
||||
**Prerequisites**: plan.md, spec.md, research.md
|
||||
|
||||
**Tests**: Included — the spec requires unit tests for both TraceRecord and TowerClient.
|
||||
|
||||
**Organization**: Tasks grouped by user story for independent implementation and testing.
|
||||
|
||||
## Format: `[ID] [P?] [Story] Description`
|
||||
|
||||
- **[P]**: Can run in parallel (different files, no dependencies)
|
||||
- **[Story]**: Which user story this task belongs to (e.g., US1, US2)
|
||||
- Exact file paths included in descriptions
|
||||
|
||||
## Phase 1: Foundational (TraceRecord transient field)
|
||||
|
||||
**Purpose**: Add the `gpuMetrics` transient field to TraceRecord — all subsequent tasks depend on this.
|
||||
|
||||
- [ ] T001 Add transient `gpuMetrics` field with getter/setter to `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy` (after `resourceAllocation` field at line 128, getter/setter after line 649)
|
||||
- [ ] T002 Add static `parseFusionTraceFile(Path)` method to `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy` — parse `.fusion/trace.json` and return the `gpu` block as `Map<String,Object>`
|
||||
|
||||
**Checkpoint**: TraceRecord can hold and parse GPU metrics. No behavior change yet.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: User Story 1 - GPU metrics sent to Platform (Priority: P1)
|
||||
|
||||
**Goal**: Read `.fusion/trace.json` on task completion, extract GPU block, send to Platform via TowerObserver.
|
||||
|
||||
**Independent Test**: Run a Fusion-enabled task with `.fusion/trace.json` containing a `gpu` block, verify GPU data appears in the Platform task payload.
|
||||
|
||||
### Implementation
|
||||
|
||||
- [ ] T003 [US1] Read `.fusion/trace.json` in `TaskHandler.getTraceRecord()` at `modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy` — add after `.command.trace` parsing block (after line 253), gated by `task.processor.executor.isFusionEnabled()`
|
||||
- [ ] T004 [US1] Include `gpuMetrics` in task payload in `TowerObserver.makeTaskMap0()` at `plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy` — add `record.gpuMetrics = trace.getGpuMetrics()` after `resourceAllocation` line (line 476)
|
||||
|
||||
### Tests
|
||||
|
||||
- [ ] T005 [P] [US1] Test `parseFusionTraceFile` with valid GPU block in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with full trace.json, verify returned map has all GPU fields
|
||||
- [ ] T006 [P] [US1] Test `gpuMetrics` transient field is not persisted across serialization in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — set field, serialize/deserialize, verify null
|
||||
- [ ] T007 [US1] Test `gpuMetrics` included in task map in `plugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy` — create TraceRecord with gpuMetrics set, call `makeTasksReq()`, verify output contains GPU data
|
||||
|
||||
**Checkpoint**: GPU metrics flow end-to-end from `.fusion/trace.json` to Platform payload. Run:
|
||||
```bash
|
||||
./gradlew :nextflow:test --tests "TraceRecordTest"
|
||||
./gradlew :nf-tower:test --tests "TowerClientTest"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: User Story 2 - Graceful error handling (Priority: P2)
|
||||
|
||||
**Goal**: Ensure missing, malformed, or GPU-less trace files don't break task reporting.
|
||||
|
||||
**Independent Test**: Simulate tasks with missing/malformed `.fusion/trace.json`, verify task trace is sent without GPU data and no errors.
|
||||
|
||||
### Tests
|
||||
|
||||
- [ ] T008 [P] [US2] Test `parseFusionTraceFile` without GPU block in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with valid JSON but no `gpu` key, verify null returned
|
||||
- [ ] T009 [P] [US2] Test `parseFusionTraceFile` with malformed JSON in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with invalid JSON, verify exception is thrown
|
||||
|
||||
**Checkpoint**: Error handling verified. The implementation in T003 already handles these cases via try/catch — these tests confirm the behavior.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Verification
|
||||
|
||||
**Purpose**: End-to-end validation across both modules.
|
||||
|
||||
- [ ] T010 Run smoke tests to verify no regressions: `make smoke`
|
||||
|
||||
---
|
||||
|
||||
## Dependencies & Execution Order
|
||||
|
||||
### Phase Dependencies
|
||||
|
||||
- **Phase 1** (T001, T002): No dependencies — start immediately
|
||||
- **Phase 2** (T003-T007): Depends on Phase 1 completion
|
||||
- **Phase 3** (T008-T009): Depends on Phase 1 (T002 specifically)
|
||||
- **Phase 4** (T010): Depends on all previous phases
|
||||
|
||||
### Parallel Opportunities
|
||||
|
||||
- T001 and T002 modify the same file but different sections — execute sequentially
|
||||
- T005, T006 are [P] — can run in parallel (same file but independent test methods)
|
||||
- T008, T009 are [P] — can run in parallel
|
||||
- T004 and T005/T006 are in different modules — can run in parallel after T001
|
||||
|
||||
### Within Each Phase
|
||||
|
||||
```
|
||||
Phase 1: T001 → T002
|
||||
Phase 2: T003 → T004 (sequential: different modules but T004 depends on field from T001)
|
||||
T005, T006 (parallel, after T002)
|
||||
T007 (after T004)
|
||||
Phase 3: T008, T009 (parallel, after T002)
|
||||
Phase 4: T010 (after all)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
### MVP (User Story 1 Only)
|
||||
|
||||
1. Complete Phase 1: TraceRecord field + parser (T001-T002)
|
||||
2. Complete Phase 2: TaskHandler + TowerObserver + tests (T003-T007)
|
||||
3. **STOP and VALIDATE**: Run unit tests for both modules
|
||||
4. GPU metrics flow to Platform
|
||||
|
||||
### Full Feature
|
||||
|
||||
1. MVP above
|
||||
2. Add Phase 3: Error handling tests (T008-T009)
|
||||
3. Phase 4: Smoke tests (T010)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Total tasks | 10 |
|
||||
| US1 tasks | 5 (T003-T007) |
|
||||
| US2 tasks | 2 (T008-T009) |
|
||||
| Foundational | 2 (T001-T002) |
|
||||
| Verification | 1 (T010) |
|
||||
| Files modified | 5 |
|
||||
| Parallel opportunities | T005+T006, T008+T009 |
|
||||
Reference in New Issue
Block a user