Files
ma/nextflow/adr/20251020-workflow-outputs.md
2026-04-29 23:01:54 +02:00

8.1 KiB

Workflow outputs

  • Authors: Ben Sherman
  • Status: accepted
  • Date: 2025-10-20
  • Tags: lang, workflows

Summary

Introduce a unified, dataflow-centric way to declare the top-level outputs of a workflow.

Problem Statement

In Nextflow DSL1, each process used publishDir to copy output files from the work directory to an external location. Nextflow DSL2 inherited this approach but it became increasingly problematic as pipelines grew larger and more modular:

  • Mismatch with reusable modules: Publishing rules often depend on how a process is used in a given pipeline. Setting publishDir inside a module process makes the module less reusable, since the publish path and mode are baked into the process definition. Using process selectors in configuration is verbose and fragile.

  • Fragmented outputs: Publishing logic is scattered across many module files. There is no single place to see what a pipeline produces or to reason about the output structure.

  • Redundant configuration: Common settings like the base output directory and publish mode must be repeated in every publishDir declaration, leading to duplication.

  • Mismatch with channels: Channels carry both files and structured metadata (e.g., sample IDs, quality flags). The publishDir directive matches files with glob patterns and cannot capture metadata unless it happens to be written to a file. This mismatch makes it difficult to produce structured, self-describing outputs.

Goals

  • Declare all pipeline outputs in a single location alongside the entry workflow.

  • Assign outputs from channels rather than from individual process definitions, decoupling pipeline-specific publishing rules from reusable modules.

  • Support dynamic and fine-grained file publishing to match common publishing patterns (e.g. directory per sample, directory per pipeline step).

  • Support structured index files (CSV, JSON, YAML) that preserve output files with associated metadata.

  • Define publishing behavior (mode, overwrite, storage class, etc.) globally in the config.

  • Support type annotations on output declarations for documentation and compile-time validation.

Non-goals

  • Removing support for publishDir immediately -- publishDir should continue to work without modification, although it may eventually be phased out as users migrate away from it.

  • Publishing outputs from processes or named workflows -- only the entry workflow has a publish: section.

  • Defining a JSON schema for workflow outputs -- schema/spec generation will be explored in the future.

Decision

Introduce the output block for declaring workflow outputs. Each output defines how files are published to the output directory, and the format of the index file (if defined).

Introduce the output directory as a first-class concept in Nextflow, as well as the workflow.output config scope for controlling publishing behavior.

Core Capabilities

Output definition

Workflow outputs consist of an output block, which declares each output, and a publish: section in the entry workflow, which assigns a dataflow source (channel or value) to each output:

workflow {
    main:
    ch_fastqc = FASTQC(ch_reads)
    ch_report = MULTIQC(ch_fastqc.collect())

    publish:
    fastqc = ch_fastqc
    report = ch_report
}

output {
    fastqc: Channel<Path> {
        path 'fastqc'
    }
    report: Path {
        path '.'
    }
}

Every output assigned in publish: must be declared in the output block, and vice versa. A mismatch is a compile-time error.

Each output declaration can specify a type annotation for documentation and type checking support. Type annotations are optional and do not change runtime behavior. They are used by the type checker to validate the publish: section and the path directive.

Output directory

The top-level output directory defaults to results in the launch directory. It can be overridden from the command line or config file:

nextflow run main.nf -output-dir my-results
// nextflow.config
outputDir = 'my-results'

All publish paths declared in the output block are relative to this directory. Absolute paths are not allowed.

Static and dynamic publish paths

The path directive accepts a string for a fixed path, or a closure for per-value paths:

output {
    // static: all files go to results/fastq/
    reads {
        path 'fastq'
    }

    // dynamic: results are organized by sample id
    samples {
        path { sample -> "${sample.id}" }
    }
}

Nextflow recursively scans channel values for files, including files nested inside lists, maps, records, and tuples. Files that did not originate from the work directory are not published.

Fine-grained file publishing with >>

Within a path closure, individual files can be published to different locations using the >> operator. Only files explicitly captured with >> are published; other files in the value are ignored.

output {
    samples {
        path { sample ->
            sample.fastqc       >> "fastqc/"
            sample.bam          >> (params.save_bams ? "align/" : null)
            sample.bam_index    >> (params.save_bams ? "align/" : null)
        }
    }
}

The publish source (left-hand side) should be a file or collection of files. The publish target (right-hand side) should be a relative path. If the target has a trailing slash, then the source is published into the target directory; otherwise the source is published as the target name.

A null target suppresses publishing for that file, and a null source is also a no-op. This way, publishing of individual files can be disabled by either setting the record field to null in workflow logic or using a param in the publish statement.

Index files

Each output can generate a structured index file that records each published channel value along with its metadata. Supported formats are CSV, JSON, and YAML.

output {
    samples {
        path 'fastq'
        index {
            path 'samples.csv'
            header true
        }
    }
}

The index file is essentially a samplesheet -- it preserves the structure of files and metadata in the published channel, and can be easily passed as input to downstream pipelines. Metadata fields (sample IDs, quality flags, etc.) do not need to be written to a separate metadata file or encoded into file paths.

Files that did not originate from the work directory are not published, but are still included in the index.

Global defaults via configuration

Common publish settings can be set globally under the workflow.output config scope:

// nextflow.config
workflow {
    output {
        mode = 'copy'
        overwrite = 'lenient'
    }
}

These defaults can be overridden per-output in the output block:

// main.nf
output {
    fastqc {
        mode = 'symlink'
        overwrite = true
    }
}

Alternatives

Publishing from processes and subworkflows

Earlier iterations allowed for workflow outputs to be published from subworkflows or processes, instead of requiring all workflow outputs to be propagated up to the entry workflow.

While this approach is less verbose, it breaks the modularity of processes and subworkflows. Publishing behavior is inherent to the pipeline, not the individual subcomponents which could be shared across many pipelines. The process or subworkflow should expose all of its outputs as channels, and the calling pipeline should decide whether and how to publish these outputs.

On the other hand, propagating all workflow outputs to the top will make pipelines more verbose, especially when using "skinny tuple" channels. This issue will be alleviated by migrating from tuples to records -- for this reason, it is recommended that large pipelines be migrated to records before being migrated to workflow outputs.