add nextflow d30e48d

This commit is contained in:
2026-04-29 23:01:54 +02:00
parent d0b12d668d
commit 97cc9058d3
2840 changed files with 730250 additions and 0 deletions

View File

@@ -0,0 +1,35 @@
# Specification Quality Checklist: Fusion GPU Metrics Collection
**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2026-04-10
**Feature**: [spec.md](../spec.md)
## Content Quality
- [x] No implementation details (languages, frameworks, APIs)
- [x] Focused on user value and business needs
- [x] Written for non-technical stakeholders
- [x] All mandatory sections completed
## Requirement Completeness
- [x] No [NEEDS CLARIFICATION] markers remain
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified
## Feature Readiness
- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification
## Notes
- All items pass. Spec references internal Nextflow concepts (TraceRecord, TowerObserver) by necessity since this is an internal infrastructure feature, but avoids prescribing implementation approach.
- The `resourceAllocation` pattern reference in FR-003 is a design constraint from the user, not an implementation detail leak.

View File

@@ -0,0 +1,179 @@
# Implementation Plan: Fusion GPU Metrics Collection
**Branch**: `260410-fusion-gpu-metrics-v2` | **Date**: 2026-04-10 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `/specs/260410-fusion-gpu-metrics/spec.md`
## Summary
Collect GPU metrics from Fusion's `.fusion/trace.json` file on task completion and send them to Seqera Platform. The GPU block is carried as a transient `Map<String,Object>` field on `TraceRecord` (same pattern as `resourceAllocation`) and included in the task payload via `TowerObserver.makeTaskMap0()`.
## Technical Context
**Language/Version**: Groovy 4.0.29 / Java 17 target (Java 21 toolchain)
**Primary Dependencies**: Nextflow core (`modules/nextflow`), nf-tower plugin (`plugins/nf-tower`)
**Storage**: N/A (read-only file access to `.fusion/trace.json`)
**Testing**: Spock Framework (unit tests in both modules)
**Target Platform**: All Fusion-enabled executors (AWS Batch, Google Batch, Azure Batch, K8s, Seqera, SLURM)
**Project Type**: Multi-module Gradle project
**Performance Goals**: Negligible overhead — one small JSON file read per task completion
**Constraints**: Must not break existing trace pipeline; must be forward-compatible with evolving GPU block schema
**Scale/Scope**: 4 files modified, ~80 lines of production code, ~120 lines of test code
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
| Principle | Status | Notes |
|-----------|--------|-------|
| I. Modular Architecture | PASS | Core trace logic in `modules/nextflow`, Platform integration in `plugins/nf-tower` — correct placement |
| II. Test-Driven Quality | PASS | Unit tests planned for both TraceRecord and TowerClient |
| III. Dataflow Programming | N/A | No changes to dataflow model |
| IV. Apache 2.0 License | PASS | All modified files already have headers |
| V. DCO Sign-off | PASS | Will use `git commit -s` |
| VI. Semantic Versioning | PASS | No version bump needed — feature addition within existing release cycle |
| VII. Groovy Idioms | PASS | Uses JsonSlurper, follows existing getter/setter patterns |
## Project Structure
### Files to Modify
```text
modules/nextflow/
├── src/main/groovy/nextflow/trace/TraceRecord.groovy # Add transient field + parsing method
├── src/main/groovy/nextflow/processor/TaskHandler.groovy # Read .fusion/trace.json on completion
└── src/test/groovy/nextflow/trace/TraceRecordTest.groovy # Test transient field + parsing
plugins/nf-tower/
├── src/main/io/seqera/tower/plugin/TowerObserver.groovy # Include gpuMetrics in task map
└── src/test/io/seqera/tower/plugin/TowerClientTest.groovy # Test gpuMetrics in task map
```
## Implementation Tasks
### Task 1: Add transient `gpuMetrics` field to TraceRecord
**File**: `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy`
**Changes**:
1. Add field after `resourceAllocation` (line 128):
```groovy
transient private Map<String,Object> gpuMetrics
```
2. Add getter/setter after existing `resourceAllocation` getter/setter (after line 649):
```groovy
Map<String,Object> getGpuMetrics() {
return gpuMetrics
}
void setGpuMetrics(Map<String,Object> value) {
this.gpuMetrics = value
}
```
### Task 2: Add Fusion trace file parsing method to TraceRecord
**File**: `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy`
**Changes**:
Add a static method to parse `.fusion/trace.json` and extract the `gpu` block:
```groovy
static Map<String,Object> parseFusionTraceFile(Path file) {
final text = file.text
final json = (Map) new JsonSlurper().parseText(text)
return (Map<String,Object>) json.get('gpu')
}
```
This keeps parsing logic on TraceRecord (consistent with `parseTraceFile()` for `.command.trace`).
### Task 3: Read `.fusion/trace.json` in TaskHandler.getTraceRecord()
**File**: `modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy`
**Changes**:
After the existing `.command.trace` parsing block (after line 253), add:
```groovy
// collect Fusion GPU metrics
if( task.processor.executor.isFusionEnabled() ) {
final fusionTrace = task.workDir?.resolve('.fusion/trace.json')
try {
if( fusionTrace ) {
final gpu = TraceRecord.parseFusionTraceFile(fusionTrace)
if( gpu )
record.gpuMetrics = gpu
}
}
catch( NoSuchFileException e ) {
// ignore - Fusion trace may not exist
}
catch( Exception e ) {
log.debug "[WARN] Cannot read Fusion trace file: $fusionTrace -- Cause: ${e.message}"
}
}
```
**Key design decisions**:
- Gated by `task.processor.executor.isFusionEnabled()` — no file access when Fusion is not enabled (FR-007)
- Placed inside `isCompleted()` block but NOT gated by task status — runs for both success and failure (FR-005)
- Same error handling pattern as `.command.trace` parsing above it (FR-006)
### Task 4: Include `gpuMetrics` in TowerObserver task payload
**File**: `plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy`
**Changes**:
In `makeTaskMap0()` method, add after `record.resourceAllocation = trace.getResourceAllocation()` (after line 476):
```groovy
record.gpuMetrics = trace.getGpuMetrics()
```
### Task 5: Unit tests for TraceRecord
**File**: `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy`
**Tests to add**:
1. **Transient field serialization test** (follows `numSpotInterruptions` pattern):
- Set `gpuMetrics` on a TraceRecord
- Serialize and deserialize
- Verify deserialized record has `null` for `gpuMetrics`
2. **parseFusionTraceFile with GPU block**:
- Create a temp file with valid trace.json content including a `gpu` block
- Verify the returned map contains all GPU fields with correct values
3. **parseFusionTraceFile without GPU block**:
- Create a temp file with valid trace.json content without a `gpu` key
- Verify `null` is returned
4. **parseFusionTraceFile with malformed JSON**:
- Create a temp file with invalid JSON
- Verify an exception is thrown (caller handles it)
### Task 6: Unit tests for TowerClient/TowerObserver
**File**: `plugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy`
**Test to add** (follows `resourceAllocation` test at lines 684-711):
- Create a TraceRecord with `gpuMetrics` set to a GPU metrics map
- Call `makeTasksReq([trace])`
- Verify `req.tasks[0].gpuMetrics` contains the GPU data
## Implementation Order
1. **Task 1 + Task 2** (TraceRecord changes) — no dependencies
2. **Task 3** (TaskHandler) — depends on Task 1+2
3. **Task 4** (TowerObserver) — depends on Task 1
4. **Task 5** (TraceRecord tests) — depends on Task 1+2
5. **Task 6** (TowerClient tests) — depends on Task 4
Tasks 1+2 and 5 can be done in parallel with Tasks 4 and 6.
## Verification
After implementation, run:
```bash
./gradlew :nextflow:test --tests "TraceRecordTest"
./gradlew :nf-tower:test --tests "TowerClientTest"
make smoke # verify no regressions
```

View File

@@ -0,0 +1,49 @@
# Research: Fusion GPU Metrics Collection
## R1: How to detect Fusion at trace collection time
**Decision**: Use `task.processor.executor.isFusionEnabled()` in `TaskHandler.getTraceRecord()`.
**Rationale**: TaskHandler already accesses the executor at line 222 (`task.processor.executor.getName()`), so this is a proven access path. The base `Executor.isFusionEnabled()` returns `false` by default, and Fusion-capable executors override it via `FusionHelper.isFusionEnabled(session)`. This works for all handler subclasses without requiring `instanceof` checks.
**Alternatives considered**:
- Checking `this instanceof FusionAwareTask`: Would miss custom executors that support Fusion but don't implement the trait. Also, `FusionAwareTask` is a trait on handler subclasses, not on the base `TaskHandler` where `getTraceRecord()` lives.
- Adding a Fusion flag to TaskRun/TaskConfig: Unnecessary complexity — Fusion is an executor-level property, not a per-task property.
## R2: Where to read `.fusion/trace.json`
**Decision**: Read it in `TaskHandler.getTraceRecord()`, right after the existing `.command.trace` parsing block (lines 244-253), gated by `task.processor.executor.isFusionEnabled()`.
**Rationale**: This is the single place where all task trace data is assembled, regardless of executor type. The existing `.command.trace` parsing already demonstrates the pattern: resolve a file in the work dir, parse it, handle `NoSuchFileException` and `IOException` gracefully.
**Alternatives considered**:
- Reading in each TaskHandler subclass: Would require changes across 7 handler subclasses in both core and plugins. Much higher blast radius.
- Reading in `TowerObserver`: Would couple Platform-specific code with file I/O. The observer should only transform data, not collect it.
## R3: Transient field pattern on TraceRecord
**Decision**: Add `transient private Map<String,Object> gpuMetrics` with getter/setter, following the exact `resourceAllocation` pattern.
**Rationale**: Transient fields on TraceRecord are the established mechanism for carrying executor-specific data to TowerObserver without persisting it in serialization (Kryo). The `resourceAllocation` field is the closest precedent — it's also a `Map<String,Object>` set during trace collection and consumed in `TowerObserver.makeTaskMap0()`.
**Implementation details**:
- Field: `transient private Map<String,Object> gpuMetrics`
- Getter: `Map<String,Object> getGpuMetrics()`
- Setter: `void setGpuMetrics(Map<String,Object> value)`
- In `makeTaskMap0()`: `record.gpuMetrics = trace.getGpuMetrics()`
## R4: JSON parsing approach
**Decision**: Use Groovy's `JsonSlurper` to parse `.fusion/trace.json` and extract the `gpu` key.
**Rationale**: `JsonSlurper` is already used throughout the Nextflow codebase (e.g., in tests and utilities). It parses JSON into native Groovy maps/lists, which is exactly what we need for the `Map<String,Object>` transient field. No additional dependencies required.
## R5: Test strategy
**Decision**: Three test locations following existing patterns.
1. **TraceRecordTest**: Verify `gpuMetrics` transient field is not persisted across serialization (follows `numSpotInterruptions` test pattern).
2. **TraceRecordTest**: Verify `parseFusionTraceFile()` correctly extracts GPU block from valid JSON, handles missing file, handles malformed JSON, handles missing GPU block.
3. **TowerClientTest**: Verify `gpuMetrics` is included in task map output (follows `resourceAllocation` test at lines 684-711).
**Rationale**: These three test locations mirror exactly how `resourceAllocation` and `numSpotInterruptions` are tested, ensuring consistency with project conventions.

View File

@@ -0,0 +1,147 @@
# Feature Specification: Fusion GPU Metrics Collection
**Feature Branch**: `260410-fusion-gpu-metrics`
**Created**: 2026-04-10
**Status**: Draft
**Input**: User description: "Collect GPU metrics from Fusion trace.json and send to Seqera Platform via TowerClient"
## User Scenarios & Testing *(mandatory)*
### User Story 1 - GPU metrics sent to Platform on task completion (Priority: P1)
A user runs a Nextflow pipeline with Fusion enabled on a GPU-equipped executor (e.g., AWS Batch, Google Batch, Kubernetes). When each task completes, Nextflow reads the Fusion-generated `.fusion/trace.json` file from the task work directory, extracts the `gpu` block, and includes it in the task trace data sent to Seqera Platform. The user can then view GPU utilization metrics (compute %, memory %, active time, etc.) for each task in the Platform UI.
**Why this priority**: This is the core feature. Without it, GPU usage is invisible to Platform users running Fusion-enabled pipelines.
**Independent Test**: Can be tested by running a Fusion-enabled task that produces a `.fusion/trace.json` with a `gpu` block, then verifying the GPU data appears in the task payload sent to Platform.
**Acceptance Scenarios**:
1. **Given** a completed task with Fusion enabled and a valid `.fusion/trace.json` containing a `gpu` block, **When** the task trace is collected, **Then** all GPU metrics from the `gpu` block are included in the task data sent to Platform.
2. **Given** a completed task with Fusion enabled and a valid `.fusion/trace.json` without a `gpu` block (CPU-only task), **When** the task trace is collected, **Then** no GPU metrics are sent and no error occurs.
3. **Given** a failed task with Fusion enabled and a valid `.fusion/trace.json` containing a `gpu` block, **When** the task trace is collected, **Then** GPU metrics are still sent (metrics are collected irrespective of task status).
---
### User Story 2 - Graceful handling when trace.json is missing or malformed (Priority: P2)
When Fusion's `.fusion/trace.json` file is missing (e.g., task was killed before Fusion wrote it) or contains invalid JSON, the system logs a debug-level warning and proceeds without GPU metrics. The task trace is still sent to Platform with all other fields intact.
**Why this priority**: Robustness is essential — GPU metrics are supplementary data and must never cause task reporting to fail.
**Independent Test**: Can be tested by simulating a completed task where `.fusion/trace.json` is absent or contains malformed JSON, and verifying the task trace is still sent successfully without GPU data.
**Acceptance Scenarios**:
1. **Given** a completed Fusion-enabled task where `.fusion/trace.json` does not exist, **When** the task trace is collected, **Then** no GPU metrics are included and no error is raised.
2. **Given** a completed Fusion-enabled task where `.fusion/trace.json` contains invalid JSON, **When** the task trace is collected, **Then** the file is skipped with a debug log message and the task trace is sent without GPU data.
3. **Given** a completed Fusion-enabled task where `.fusion/trace.json` exists but the `gpu` block is null/absent, **When** the task trace is collected, **Then** no GPU metrics are included and no error is raised.
---
### Edge Cases
- What happens when the `gpu` block contains unexpected or extra fields not in the known schema? They are included as-is (forward compatibility).
- What happens when Fusion is not enabled for a task? No attempt is made to read `.fusion/trace.json`.
- What happens when the task work directory is inaccessible at trace collection time (e.g., remote storage timeout)? The same error handling as existing `.command.trace` parsing applies — log and continue.
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: System MUST read the file `.fusion/trace.json` from the task work directory on task completion when the executor has Fusion enabled.
- **FR-002**: System MUST extract the entire `gpu` block from the parsed `trace.json` as a map.
- **FR-003**: System MUST store the GPU metrics as a transient field on `TraceRecord` (following the same pattern as `resourceAllocation`).
- **FR-004**: System MUST include the GPU metrics map in the task payload sent to Seqera Platform via the Tower observer.
- **FR-005**: System MUST collect GPU metrics irrespective of task completion status (success or failure).
- **FR-006**: System MUST NOT fail or disrupt task trace reporting if `.fusion/trace.json` is missing, unreadable, or malformed.
- **FR-007**: System MUST only attempt to read `.fusion/trace.json` when Fusion is enabled for the executor.
### Key Entities
- **Fusion Trace File**: JSON file at `.fusion/trace.json` in the task work directory, produced by the Fusion client. Contains `proc`, `gpu`, and `cgroup` blocks with runtime metrics.
- **GPU Metrics Block**: The `gpu` object within `trace.json`, containing fields: `name`, `mem`, `driver`, `active_time`, `pct`, `peak`, `pct_mem`, `peak_mem`, `avg_mem`, `peak_mem_used`, `avg_mem_bw_util`, `peak_mem_bw_util`.
#### Example `.fusion/trace.json`
```json
{
"proc": {
"realtime": 660541,
"pct_cpu": 1045,
"cpu_name": "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz",
"arch": "linux/amd64",
"rchar": 14112539262,
"wchar": 12668821375,
"syscr": 1823378,
"syscw": 169293,
"read_bytes": 8011776,
"write_bytes": 102400,
"pct_mem": 56,
"vmem": 39015152,
"rss": 14826068,
"peak_vmem": 39047920,
"peak_rss": 15775480,
"vol_ctxt": 413015,
"inv_ctxt": 1540
},
"gpu": {
"name": "Tesla T4",
"mem": 15360,
"driver": "580.126.09",
"active_time": 651030,
"pct": 75,
"peak": 100,
"pct_mem": 40.11115345483025,
"peak_mem": 74.140625,
"avg_mem": 6161,
"peak_mem_used": 11388,
"avg_mem_bw_util": 43,
"peak_mem_bw_util": 83
},
"cgroup": {
"version": "v2",
"memory_current": 25469927424,
"memory_peak": 41178980352,
"memory_rss": 67919872,
"memory_peak_rss": 14783070208,
"cpu_usage_usec": 785302059,
"cpu_user_usec": 549732867,
"cpu_system_usec": 235569192,
"io_read_bytes": 8503296,
"io_write_bytes": 12671918080,
"io_read_ops": 98,
"io_write_ops": 97975,
"memory_limit": 77309411328,
"cpu_quota": 0,
"cpu_period": 0,
"memory_oom_kills": 0,
"cpu_nr_throttled": 0,
"cpu_throttled_usec": 0,
"cpu_psi_some": 582969,
"cpu_psi_full": 582860,
"memory_psi_some": 0,
"memory_psi_full": 0,
"io_psi_some": 1038270,
"io_psi_full": 1037514
}
}
```
- **TraceRecord GPU field**: New transient field on `TraceRecord` that carries the GPU metrics map through the existing trace pipeline to the Tower observer, following the `resourceAllocation` pattern.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: GPU metrics from Fusion trace files are visible in Seqera Platform for all Fusion-enabled tasks that ran on GPU hardware.
- **SC-002**: Tasks without GPU usage or without Fusion enabled report successfully with no GPU data and no errors.
- **SC-003**: A missing or malformed `.fusion/trace.json` does not cause any task to fail reporting — 100% of tasks still have their standard metrics delivered.
- **SC-004**: GPU metrics collection adds negligible overhead — reading and parsing a single small JSON file per task completion.
## Assumptions
- The Fusion client is responsible for creating `.fusion/trace.json` in the task work directory. Nextflow only reads it.
- The `gpu` block schema may evolve over time. The implementation forwards the entire block as a map rather than mapping to fixed fields, ensuring forward compatibility.
- Seqera Platform API already accepts or will be updated to accept the GPU metrics payload alongside existing task trace data.
- The file path `.fusion/trace.json` is stable and defined by the Fusion client contract.
- All executors that support Fusion (AWS Batch, Google Batch, Azure Batch, Kubernetes, Seqera, SLURM) benefit from this feature without executor-specific code — the detection is based on whether Fusion is enabled, not on the executor type.

View File

@@ -0,0 +1,131 @@
# Tasks: Fusion GPU Metrics Collection
**Input**: Design documents from `/specs/260410-fusion-gpu-metrics/`
**Prerequisites**: plan.md, spec.md, research.md
**Tests**: Included — the spec requires unit tests for both TraceRecord and TowerClient.
**Organization**: Tasks grouped by user story for independent implementation and testing.
## Format: `[ID] [P?] [Story] Description`
- **[P]**: Can run in parallel (different files, no dependencies)
- **[Story]**: Which user story this task belongs to (e.g., US1, US2)
- Exact file paths included in descriptions
## Phase 1: Foundational (TraceRecord transient field)
**Purpose**: Add the `gpuMetrics` transient field to TraceRecord — all subsequent tasks depend on this.
- [ ] T001 Add transient `gpuMetrics` field with getter/setter to `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy` (after `resourceAllocation` field at line 128, getter/setter after line 649)
- [ ] T002 Add static `parseFusionTraceFile(Path)` method to `modules/nextflow/src/main/groovy/nextflow/trace/TraceRecord.groovy` — parse `.fusion/trace.json` and return the `gpu` block as `Map<String,Object>`
**Checkpoint**: TraceRecord can hold and parse GPU metrics. No behavior change yet.
---
## Phase 2: User Story 1 - GPU metrics sent to Platform (Priority: P1)
**Goal**: Read `.fusion/trace.json` on task completion, extract GPU block, send to Platform via TowerObserver.
**Independent Test**: Run a Fusion-enabled task with `.fusion/trace.json` containing a `gpu` block, verify GPU data appears in the Platform task payload.
### Implementation
- [ ] T003 [US1] Read `.fusion/trace.json` in `TaskHandler.getTraceRecord()` at `modules/nextflow/src/main/groovy/nextflow/processor/TaskHandler.groovy` — add after `.command.trace` parsing block (after line 253), gated by `task.processor.executor.isFusionEnabled()`
- [ ] T004 [US1] Include `gpuMetrics` in task payload in `TowerObserver.makeTaskMap0()` at `plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerObserver.groovy` — add `record.gpuMetrics = trace.getGpuMetrics()` after `resourceAllocation` line (line 476)
### Tests
- [ ] T005 [P] [US1] Test `parseFusionTraceFile` with valid GPU block in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with full trace.json, verify returned map has all GPU fields
- [ ] T006 [P] [US1] Test `gpuMetrics` transient field is not persisted across serialization in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — set field, serialize/deserialize, verify null
- [ ] T007 [US1] Test `gpuMetrics` included in task map in `plugins/nf-tower/src/test/io/seqera/tower/plugin/TowerClientTest.groovy` — create TraceRecord with gpuMetrics set, call `makeTasksReq()`, verify output contains GPU data
**Checkpoint**: GPU metrics flow end-to-end from `.fusion/trace.json` to Platform payload. Run:
```bash
./gradlew :nextflow:test --tests "TraceRecordTest"
./gradlew :nf-tower:test --tests "TowerClientTest"
```
---
## Phase 3: User Story 2 - Graceful error handling (Priority: P2)
**Goal**: Ensure missing, malformed, or GPU-less trace files don't break task reporting.
**Independent Test**: Simulate tasks with missing/malformed `.fusion/trace.json`, verify task trace is sent without GPU data and no errors.
### Tests
- [ ] T008 [P] [US2] Test `parseFusionTraceFile` without GPU block in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with valid JSON but no `gpu` key, verify null returned
- [ ] T009 [P] [US2] Test `parseFusionTraceFile` with malformed JSON in `modules/nextflow/src/test/groovy/nextflow/trace/TraceRecordTest.groovy` — create temp file with invalid JSON, verify exception is thrown
**Checkpoint**: Error handling verified. The implementation in T003 already handles these cases via try/catch — these tests confirm the behavior.
---
## Phase 4: Verification
**Purpose**: End-to-end validation across both modules.
- [ ] T010 Run smoke tests to verify no regressions: `make smoke`
---
## Dependencies & Execution Order
### Phase Dependencies
- **Phase 1** (T001, T002): No dependencies — start immediately
- **Phase 2** (T003-T007): Depends on Phase 1 completion
- **Phase 3** (T008-T009): Depends on Phase 1 (T002 specifically)
- **Phase 4** (T010): Depends on all previous phases
### Parallel Opportunities
- T001 and T002 modify the same file but different sections — execute sequentially
- T005, T006 are [P] — can run in parallel (same file but independent test methods)
- T008, T009 are [P] — can run in parallel
- T004 and T005/T006 are in different modules — can run in parallel after T001
### Within Each Phase
```
Phase 1: T001 → T002
Phase 2: T003 → T004 (sequential: different modules but T004 depends on field from T001)
T005, T006 (parallel, after T002)
T007 (after T004)
Phase 3: T008, T009 (parallel, after T002)
Phase 4: T010 (after all)
```
---
## Implementation Strategy
### MVP (User Story 1 Only)
1. Complete Phase 1: TraceRecord field + parser (T001-T002)
2. Complete Phase 2: TaskHandler + TowerObserver + tests (T003-T007)
3. **STOP and VALIDATE**: Run unit tests for both modules
4. GPU metrics flow to Platform
### Full Feature
1. MVP above
2. Add Phase 3: Error handling tests (T008-T009)
3. Phase 4: Smoke tests (T010)
---
## Summary
| Metric | Value |
|--------|-------|
| Total tasks | 10 |
| US1 tasks | 5 (T003-T007) |
| US2 tasks | 2 (T008-T009) |
| Foundational | 2 (T001-T002) |
| Verification | 1 (T010) |
| Files modified | 5 |
| Parallel opportunities | T005+T006, T008+T009 |