Files
2026-04-29 23:01:54 +02:00

8.0 KiB

Feature Specification: Fusion GPU Metrics Collection

Feature Branch: 260410-fusion-gpu-metrics
Created: 2026-04-10
Status: Draft
Input: User description: "Collect GPU metrics from Fusion trace.json and send to Seqera Platform via TowerClient"

User Scenarios & Testing (mandatory)

User Story 1 - GPU metrics sent to Platform on task completion (Priority: P1)

A user runs a Nextflow pipeline with Fusion enabled on a GPU-equipped executor (e.g., AWS Batch, Google Batch, Kubernetes). When each task completes, Nextflow reads the Fusion-generated .fusion/trace.json file from the task work directory, extracts the gpu block, and includes it in the task trace data sent to Seqera Platform. The user can then view GPU utilization metrics (compute %, memory %, active time, etc.) for each task in the Platform UI.

Why this priority: This is the core feature. Without it, GPU usage is invisible to Platform users running Fusion-enabled pipelines.

Independent Test: Can be tested by running a Fusion-enabled task that produces a .fusion/trace.json with a gpu block, then verifying the GPU data appears in the task payload sent to Platform.

Acceptance Scenarios:

  1. Given a completed task with Fusion enabled and a valid .fusion/trace.json containing a gpu block, When the task trace is collected, Then all GPU metrics from the gpu block are included in the task data sent to Platform.
  2. Given a completed task with Fusion enabled and a valid .fusion/trace.json without a gpu block (CPU-only task), When the task trace is collected, Then no GPU metrics are sent and no error occurs.
  3. Given a failed task with Fusion enabled and a valid .fusion/trace.json containing a gpu block, When the task trace is collected, Then GPU metrics are still sent (metrics are collected irrespective of task status).

User Story 2 - Graceful handling when trace.json is missing or malformed (Priority: P2)

When Fusion's .fusion/trace.json file is missing (e.g., task was killed before Fusion wrote it) or contains invalid JSON, the system logs a debug-level warning and proceeds without GPU metrics. The task trace is still sent to Platform with all other fields intact.

Why this priority: Robustness is essential — GPU metrics are supplementary data and must never cause task reporting to fail.

Independent Test: Can be tested by simulating a completed task where .fusion/trace.json is absent or contains malformed JSON, and verifying the task trace is still sent successfully without GPU data.

Acceptance Scenarios:

  1. Given a completed Fusion-enabled task where .fusion/trace.json does not exist, When the task trace is collected, Then no GPU metrics are included and no error is raised.
  2. Given a completed Fusion-enabled task where .fusion/trace.json contains invalid JSON, When the task trace is collected, Then the file is skipped with a debug log message and the task trace is sent without GPU data.
  3. Given a completed Fusion-enabled task where .fusion/trace.json exists but the gpu block is null/absent, When the task trace is collected, Then no GPU metrics are included and no error is raised.

Edge Cases

  • What happens when the gpu block contains unexpected or extra fields not in the known schema? They are included as-is (forward compatibility).
  • What happens when Fusion is not enabled for a task? No attempt is made to read .fusion/trace.json.
  • What happens when the task work directory is inaccessible at trace collection time (e.g., remote storage timeout)? The same error handling as existing .command.trace parsing applies — log and continue.

Requirements (mandatory)

Functional Requirements

  • FR-001: System MUST read the file .fusion/trace.json from the task work directory on task completion when the executor has Fusion enabled.
  • FR-002: System MUST extract the entire gpu block from the parsed trace.json as a map.
  • FR-003: System MUST store the GPU metrics as a transient field on TraceRecord (following the same pattern as resourceAllocation).
  • FR-004: System MUST include the GPU metrics map in the task payload sent to Seqera Platform via the Tower observer.
  • FR-005: System MUST collect GPU metrics irrespective of task completion status (success or failure).
  • FR-006: System MUST NOT fail or disrupt task trace reporting if .fusion/trace.json is missing, unreadable, or malformed.
  • FR-007: System MUST only attempt to read .fusion/trace.json when Fusion is enabled for the executor.

Key Entities

  • Fusion Trace File: JSON file at .fusion/trace.json in the task work directory, produced by the Fusion client. Contains proc, gpu, and cgroup blocks with runtime metrics.
  • GPU Metrics Block: The gpu object within trace.json, containing fields: name, mem, driver, active_time, pct, peak, pct_mem, peak_mem, avg_mem, peak_mem_used, avg_mem_bw_util, peak_mem_bw_util.

Example .fusion/trace.json

{
  "proc": {
    "realtime": 660541,
    "pct_cpu": 1045,
    "cpu_name": "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz",
    "arch": "linux/amd64",
    "rchar": 14112539262,
    "wchar": 12668821375,
    "syscr": 1823378,
    "syscw": 169293,
    "read_bytes": 8011776,
    "write_bytes": 102400,
    "pct_mem": 56,
    "vmem": 39015152,
    "rss": 14826068,
    "peak_vmem": 39047920,
    "peak_rss": 15775480,
    "vol_ctxt": 413015,
    "inv_ctxt": 1540
  },
  "gpu": {
    "name": "Tesla T4",
    "mem": 15360,
    "driver": "580.126.09",
    "active_time": 651030,
    "pct": 75,
    "peak": 100,
    "pct_mem": 40.11115345483025,
    "peak_mem": 74.140625,
    "avg_mem": 6161,
    "peak_mem_used": 11388,
    "avg_mem_bw_util": 43,
    "peak_mem_bw_util": 83
  },
  "cgroup": {
    "version": "v2",
    "memory_current": 25469927424,
    "memory_peak": 41178980352,
    "memory_rss": 67919872,
    "memory_peak_rss": 14783070208,
    "cpu_usage_usec": 785302059,
    "cpu_user_usec": 549732867,
    "cpu_system_usec": 235569192,
    "io_read_bytes": 8503296,
    "io_write_bytes": 12671918080,
    "io_read_ops": 98,
    "io_write_ops": 97975,
    "memory_limit": 77309411328,
    "cpu_quota": 0,
    "cpu_period": 0,
    "memory_oom_kills": 0,
    "cpu_nr_throttled": 0,
    "cpu_throttled_usec": 0,
    "cpu_psi_some": 582969,
    "cpu_psi_full": 582860,
    "memory_psi_some": 0,
    "memory_psi_full": 0,
    "io_psi_some": 1038270,
    "io_psi_full": 1037514
  }
}
  • TraceRecord GPU field: New transient field on TraceRecord that carries the GPU metrics map through the existing trace pipeline to the Tower observer, following the resourceAllocation pattern.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: GPU metrics from Fusion trace files are visible in Seqera Platform for all Fusion-enabled tasks that ran on GPU hardware.
  • SC-002: Tasks without GPU usage or without Fusion enabled report successfully with no GPU data and no errors.
  • SC-003: A missing or malformed .fusion/trace.json does not cause any task to fail reporting — 100% of tasks still have their standard metrics delivered.
  • SC-004: GPU metrics collection adds negligible overhead — reading and parsing a single small JSON file per task completion.

Assumptions

  • The Fusion client is responsible for creating .fusion/trace.json in the task work directory. Nextflow only reads it.
  • The gpu block schema may evolve over time. The implementation forwards the entire block as a map rather than mapping to fixed fields, ensuring forward compatibility.
  • Seqera Platform API already accepts or will be updated to accept the GPU metrics payload alongside existing task trace data.
  • The file path .fusion/trace.json is stable and defined by the Fusion client contract.
  • All executors that support Fusion (AWS Batch, Google Batch, Azure Batch, Kubernetes, Seqera, SLURM) benefit from this feature without executor-specific code — the detection is based on whether Fusion is enabled, not on the executor type.