add nextflow d30e48d

This commit is contained in:
2026-04-29 23:01:54 +02:00
parent d0b12d668d
commit 97cc9058d3
2840 changed files with 730250 additions and 0 deletions

View File

@@ -0,0 +1,379 @@
(data-lineage-page)=
# Getting started with data lineage
Data lineage in Nextflow provides comprehensive tracking of workflow runs, task executions, and output files. This feature helps you verify the integrity and reproducibility of your pipeline results by maintaining a complete history of computations and intermediate data.
:::{warning}
Data lineage is an experimental feature added in Nextflow 25.04. The functionality may change in future releases.
:::
## Overview
Data lineage tracks the complete history of your Nextflow runs, including:
- Workflow runs and their configurations
- Task executions and their inputs/outputs
- File outputs and their provenance
Each lineage record has a unique identifier called a *lineage ID* (LID) that you can use to access and query the data.
:::{note}
The data model for every lineage record is defined in the Nextflow [source code](https://github.com/nextflow-io/nextflow/tree/master/modules/nf-lineage/src/main/nextflow/lineage/model).
:::
## Enable data lineage
To enable data lineage tracking, add the following to your Nextflow configuration:
```groovy
lineage.enabled = true
```
By default, lineage data is stored in the `.lineage` directory in your current working directory. You can customize this location:
```groovy
lineage.store.location = '<PATH_TO_STORAGE>'
```
:::{tip}
For global configuration, add these settings to `$HOME/.nextflow/config`.
:::
See the {ref}`config-lineage` configuration scope for details.
## Generate lineage metadata
Run a Nextflow pipeline to generate some lineage metadata. For example:
```console
$ nextflow run rnaseq-nf -profile conda
```
Nextflow will automatically record the workflow run, task executions, and output files in the lineage store.
## Explore lineage
Now that you have generated some lineage metadata, you can explore it from the command line using the {ref}`cli-lineage` command.
First, use the `list` subcommand to list the workflow runs in the lineage store:
```console
$ nextflow lineage list
TIMESTAMP RUN NAME SESSION ID LINEAGE ID
2025-05-09 13:28:30 CDT peaceful_blackwell 065bdc6b-89b4-42ee-92c1-2a5af37f2c50 lid://16b31030474f2e96c55f4940bca3ab64
```
The *lineage ID* (LID) is the unique identifier for the workflow run and the entrypoint for exploring the lineage.
Use the `view` subcommand to view the lineage record for the workflow run:
```console
$ nextflow lineage view lid://16b31030474f2e96c55f4940bca3ab64
{
"version": "lineage/v1beta1",
"type": "WorkflowRun",
"workflow": {
"scriptFiles": [
...
],
"repository": "https://github.com/nextflow-io/rnaseq-nf",
"commitId": "86165b8c81d43a1f57363964431395152e353e56"
},
"sessionId": "065bdc6b-89b4-42ee-92c1-2a5af37f2c50",
"name": "peaceful_blackwell",
"params": [
...
],
"config": {
...
}
}
```
Every workflow run is represented in the lineage store as a `WorkflowRun` record. It includes information such as the pipeline repository, revision, run name, parameters, and resolved config.
The output files of a workflow run can be accessed as `lid://<WORKFLOW_RUN_HASH>/<PATH>`, where `<PATH>` is the file path relative to the workflow output directory.
:::{note}
Files must be published to the workflow output directory as defined by the `outputDir` config option (or `-output-dir` command line option) in order to be recorded as workflow outputs in the lineage store.
:::
List the output directory to see the available files:
```console
$ find results
results
results/fastqc_ggal_gut_logs
results/fastqc_ggal_gut_logs/ggal_gut_1_fastqc.html
results/fastqc_ggal_gut_logs/ggal_gut_1_fastqc.zip
results/fastqc_ggal_gut_logs/ggal_gut_2_fastqc.html
results/fastqc_ggal_gut_logs/ggal_gut_2_fastqc.zip
results/multiqc_report.html
```
Now, use the workflow LID and relative path to view the lineage record for an output file:
```console
$ nextflow lineage view lid://16b31030474f2e96c55f4940bca3ab64/multiqc_report.html
{
"version": "lineage/v1beta1",
"type": "FileOutput",
"path": "/results/multiqc_report.html",
"checksum": {
"value": "03fd5ed150c7862e1fad5efd4f574a47",
"algorithm": "nextflow",
"mode": "standard"
},
"source": "lid://862df53160e07cd823c0c3960545e747/multiqc_report.html",
"workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64",
"taskRun": null,
"size": 5079806,
"createdAt": "2025-05-09T13:27:34.576590545-05:00",
"modifiedAt": "2025-05-09T13:27:34.586590551-05:00",
"labels": null
}
```
Every output file is represented in the lineage store as a `FileOutput` record. It includes basic file information, such as the real path, checksum, file size and created/modified timestamps, as well as lineage information, such as the workflow run and task run that produced it.
As this record is a workflow output, it is not linked directly to a task run. Instead, it is linked to the original task output.
:::{note}
The `labels` field is `null` because no labels were assigned to this file. Labels are set using the `label` directive in the `output` block. See {ref}`data-lineage-workflow-outputs` for more information.
:::
Any LID in a lineage record can be viewed, allowing you to traverse the lineage metadata interactively. Use the value of `source` to view the original task output:
```console
$ nextflow lineage view lid://862df53160e07cd823c0c3960545e747/multiqc_report.html
{
"version": "lineage/v1beta1",
"type": "FileOutput",
"path": "/work/86/2df53160e07cd823c0c3960545e747/multiqc_report.html",
"checksum": {
"value": "b14f5171a48ce5c22ea27d7b8e57b6c4",
"algorithm": "nextflow",
"mode": "standard"
},
"source": "lid://862df53160e07cd823c0c3960545e747",
"workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64",
"taskRun": "lid://862df53160e07cd823c0c3960545e747",
"size": 5079806,
"createdAt": "2025-05-09T13:27:34.236590379-05:00",
"modifiedAt": "2025-05-09T13:27:34.246590383-05:00",
"labels": null
}
```
This record is the task output for the same file -- it has a value for `taskRun` which is the same as its `source`.
View the lineage record for the task that produced this file:
```console
$ nextflow lineage view lid://862df53160e07cd823c0c3960545e747
{
"version": "lineage/v1beta1",
"type": "TaskRun",
"sessionId": "065bdc6b-89b4-42ee-92c1-2a5af37f2c50",
"name": "MULTIQC",
"codeChecksum": {
"value": "edf2e9f84cd3a18ee9259012b660f2dd",
"algorithm": "nextflow",
"mode": "standard"
},
"script": "\n cp multiqc/* .\n echo \"custom_logo: $PWD/nextflow_logo.png\" \u003e\u003e multiqc_config.yaml\n multiqc -n multiqc_report.html .\n ",
"input": [
{
"type": "path",
"name": "*",
"value": [
"lid://eff8846883b46c5a76f11e7e4480a6c8/ggal_gut",
"lid://2d8bd92c69f732605bc99941e60d5319/fastqc_ggal_gut_logs"
]
},
{
"type": "path",
"name": "config",
"value": [
{
"path": "https://github.com/nextflow-io/rnaseq-nf/tree/86165b8c81d43a1f57363964431395152e353e56/multiqc",
"checksum": {
"value": "2aac500cdfb292e961e678433e7dc3d8",
"algorithm": "nextflow",
"mode": "standard"
}
}
]
}
],
"container": null,
"conda": "file:///conda/env-4a436c230263dfdbbf4dddd0623505d1",
"spack": null,
"architecture": null,
"globalVars": {},
"binEntries": [],
"workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64"
}
```
Every task run is represented in the lineage store as a `TaskRun`, which includes information such as the name, script, inputs, and software dependencies. From here, you can continue traversing through the file inputs to view upstream tasks.
Finally, use the `render` subcommand to render the entire lineage of the MULTIQC report as an HTML report:
```console
$ nextflow lineage render lid://16b31030474f2e96c55f4940bca3ab64/multiqc_report.html
Rendered lineage graph for lid://16b31030474f2e96c55f4940bca3ab64/multiqc_report.html to lineage.html
```
Open the HTML report in a web browser to view the lineage graph.
## Query lineage records
To find a lineage record, you normally have to know the LID of the record or a downstream record (such as a workflow run) from which you can traverse to the desired record. However, you can also query the entire lineage store by fields to quickly find relevant records and aggregate records from different runs.
Use the `find` subcommand to find all tasks executed by a workflow run:
```console
$ nextflow lineage find type=TaskRun workflowRun=lid://16b31030474f2e96c55f4940bca3ab64
[
"lid://2d8bd92c69f732605bc99941e60d5319",
"lid://eff8846883b46c5a76f11e7e4480a6c8",
"lid://862df53160e07cd823c0c3960545e747",
"lid://6d3bff36bf2c3c14c2d383384621e8ca"
]
```
You can use any field defined in the [lineage data model](https://github.com/nextflow-io/nextflow/tree/master/modules/nf-lineage/src/main/nextflow/lineage/model).
:::{tip}
Since the `find` and `view` subcommands always output JSON, you can use JSON processing tools such as [jq](https://jqlang.org/) to further query and transform results.
:::
## Compare task runs
Task run LIDs are based on the standard {ref}`task hash <cache-resume-task-hash>`, which makes it easy to compare two task runs in the lineage metadata. For example, if a task is unexpectedly re-executed during a resumed run, as long as lineage is enabled for both the initial and resumed runs, the two tasks can be compared without any additional runs.
This section builds on the above [`rnaseq-nf` example](#generate-lineage-metadata) to demonstrate how to compare two task runs in the event of a cache invalidation.
First, modify the pipeline in a way that invalidates the cache for the `MULTIQC` process. For example, modify the process script.
Resume the pipeline. It will re-execute the `MULTIQC` process:
```console
$ nextflow run rnaseq-nf -profile conda -resume
...
[6d/3bff36] process > RNASEQ:INDEX (ggal_1_48850000_49020000) [100%] 1 of 1, cached: 1 ✔
[2d/8bd92c] process > RNASEQ:FASTQC (FASTQC on ggal_gut) [100%] 1 of 1, cached: 1 ✔
[ef/f88468] process > RNASEQ:QUANT (ggal_gut) [100%] 1 of 1, cached: 1 ✔
[94/33dda7] process > MULTIQC [100%] 1 of 1 ✔
```
Retrieve the hash of the `MULTIQC` run from the log file or work directory. Compare it to the task hash of the initial run:
```console
$ nextflow lineage diff lid://862df53160e07cd823c0c3960545e747 lid://9433dda73f2193491f9a26e3e23cd8a1
diff --git 862df53160e07cd823c0c3960545e747 9433dda73f2193491f9a26e3e23cd8a1
--- 862df53160e07cd823c0c3960545e747
+++ 9433dda73f2193491f9a26e3e23cd8a1
@@ -3,11 +3,11 @@
"sessionId": "065bdc6b-89b4-42ee-92c1-2a5af37f2c50",
"name": "MULTIQC",
"codeChecksum": {
- "value": "edf2e9f84cd3a18ee9259012b660f2dd",
+ "value": "9615a8da3a3f9e935cfc8e4042cdf5e0",
"algorithm": "nextflow",
"mode": "standard"
},
- "script": "\n cp multiqc/* .\n echo \"custom_logo: $PWD/nextflow_logo.png\" \u003e\u003e multiqc_config.yaml\n multiqc -n multiqc_report.html .\n ",
+ "script": "\n cp multiqc/* . # hello!\n echo \"custom_logo: $PWD/nextflow_logo.png\" \u003e\u003e multiqc_config.yaml\n multiqc -n multiqc_report.html .\n ",
"input": [
{
"type": "path",
@@ -38,5 +38,5 @@
"architecture": null,
"globalVars": {},
"binEntries": [],
- "workflowRun": "lid://16b31030474f2e96c55f4940bca3ab64"
+ "workflowRun": "lid://65044872aad36f97e42336b9ba0dee57"
}
```
Note the difference between the task scripts, highlighting the change that caused the task to be re-executed.
(data-lineage-workflow-outputs)=
## Use lineage with workflow outputs
Workflow outputs declared in the `output` block are also recorded in the lineage store. The output of a workflow run is accessible as `lid://<WORKFLOW_RUN_HASH>#output`.
For example, run the `rnaseq-nf` pipeline with the `preview-25-04` branch, which uses the `output` block to publish outputs:
```console
$ nextflow run rnaseq-nf -r preview-25-04 -profile conda
```
View the workflow output in the lineage metadata:
```console
$ nextflow lineage view lid://9410d13abeec617640b5fe9735ba12fc#output
[
{
"type": "Collection",
"name": "samples",
"value": "lid://9410d13abeec617640b5fe9735ba12fc/samples.json"
},
{
"type": "Path",
"name": "summary",
"value": "lid://9410d13abeec617640b5fe9735ba12fc/multiqc_report.html"
}
]
```
This view can be used to traverse output files directly instead of inferring LIDs from the workflow output directory.
The following types are used in workflow output lineage records:
| Type | JSON representation | Nextflow type |
| ---- | ------------------- | ------------- |
| `Collection` | array (or index file path) | `Bag`, `List`, `Set` |
| `Map` | object | `Map`, `Record` |
| `Path` | string with `lid://` prefix | `Path` |
When labels are assigned to a workflow output with the `label` directive, they appear in the `labels` field of each corresponding `FileOutput` record:
```console
$ nextflow lineage view lid://9410d13abeec617640b5fe9735ba12fc/multiqc_report.html
{
"version": "lineage/v1beta1",
"type": "FileOutput",
"path": "/results/multiqc_report.html",
...
"labels": ["qc", "summary"]
}
```
Labels can be used to filter files when querying lineage records with the `nextflow lineage find` command. See {ref}`workflow-output-labels` for details on assigning labels to workflow outputs.
## Use lineage in a Nextflow script
Since lineage IDs are valid URIs, output files in the lineage store can be accessed by their LID in a Nextflow script, like any other path. The LID path returns the *real* path as defined by the `path` field in the `FileOutput` record.
The following script uses the `samples.json` from the previous example as an input samplesheet:
```nextflow
channel.fromPath('lid://9410d13abeec617640b5fe9735ba12fc/samples.json')
.splitJson()
.view()
```
It should produce the following output:
```console
[id:gut, quant:/results/gut/quant, fastqc:/results/gut/fastqc]
```
The `fromLineage` channel factory can also be used to query lineage records in a similar manner as the `find` subcommand. See {ref}`channel-from-lineage` for details.

View File

@@ -0,0 +1,147 @@
(flux-page)=
# Using Nextflow with Flux
:::{versionadded} 22.11.0-edge
:::
## Overview
The [Flux Framework](https://flux-framework.org/) is a modern resource manager that can span the space between cloud and HPC. If your center does not provide Flux, you can [build Flux yourself](https://flux-framework.readthedocs.io/en/latest/quickstart.html#building-the-code) and launch it as a job using your resource manager of choice (e.g. SLURM or a cloud provider).
In the [`docker/flux`](https://github.com/nextflow-io/nextflow/tree/master/docker/flux) directory we provide a [Dockerfile for interacting with Flux](https://github.com/nextflow-io/nextflow/tree/master/docker/flux/.devcontainer/Dockerfile) along with a [VSCode Developer Container](https://code.visualstudio.com/docs/devcontainers/containers) environment that you can put at the root of the project to be provided with a Flux agent and the dependencies needed to build Nextflow. There are two ways to use this:
- Build a container from scratch and bind your code to it (e.g. for development or testing)
- Use VSCode and DevContainers to create a more seamless environment
Both strategies are described below. For this tutorial, you will generally want to prepare a pipeline to use the `flux` executor, create an environment with Flux, start a Flux instance, and interact with it.
## Prepare your pipeline
To run your pipeline with Flux, you'll want to specify it in your config. Here is an example `nextflow.config`:
```groovy
manifest {
mainScript = 'demo.nf'
homePage = 'https://github.com/nextflow-io/nextflow/tree/master/docker/flux'
description = 'Demo using Nextflow with Flux'
}
process {
executor = 'flux'
}
```
For additional Flux settings, see the {ref}`flux-executor` section.
Here is an example pipeline that we will use:
```nextflow
workflow {
breakfast = channel.of '🥞️', '🥑️', '🥧️', '🍵️', '🍞️'
haveMeal(breakfast)
}
process haveMeal {
debug true
input:
val food
script:
"""
printf '$food for breakfast!'
"""
}
```
## Prepare your environment
You can either build the Docker image from the root of the Nextflow repository:
```console
$ docker build -f docker/flux/.devcontainer/Dockerfile --platform linux/amd64 -o type=docker -t nextflow-flux .
```
And then shell into the container for a development environment. You'll need to bind the present working directory to `/code` to see your local changes in the container:
```console
$ docker run -it -v $PWD:/code nextflow-flux
```
You can also move the `.devcontainer` directory to the root of your repository, and open it in VSCode:
```console
$ cp -R docker/flux/.devcontainer .devcontainer
```
Then open in VSCode, and select **Re-open in container**:
```console
$ code .
```
Then you should be able to open a terminal (**Terminal** -> **New Terminal**) to interact with the command line. Try running `make` again! Whichever of these two approaches you take, you should be in a container environment with the `flux` command available.
## Start a Flux instance
Once in your container, you can start an interactive Flux instance (from which you can submit jobs on the command line to test with Nextflow) as follows:
```console
$ flux start --test-size=4
```
### Getting familiar with Flux
Here is an example of submitting a job and getting the log for it.
First submit the job:
```console
$ flux submit echo "HELLO MOTO"
ƒEzWqspb
```
Then get the log for it:
```console
$ flux job attach ƒEzWqspb
HELLO MOTO
```
Try submitting a longer job:
```console
$ flux submit sleep 60
```
And then seeing it in the jobs listing.
```console
$ flux jobs
JOBID USER NAME ST NTASKS NNODES TIME INFO
ƒ4tkMUAAT root sleep R 1 1 2.546s ab6634a491bb
```
## Submitting with Nextflow
Prepare your `nextflow.config` and `demo.nf` in the same directory.
```console
$ ls .
demo.nf nextflow.config
```
Finally, run the pipeline with Flux:
```console
$ nextflow -c nextflow.config run demo.nf
N E X T F L O W ~ version 22.10.0
Launching `demo.nf` [clever_blackwell] DSL2 - revision: f8cda838cb
executor > flux (5)
[4c/f162db] process > haveMeal (3) [100%] 5 of 5 ✔
🥞️ for breakfast!
🍞️ for breakfast!
🍵️ for breakfast!
🥑️ for breakfast!
🥧️ for breakfast!
```

View File

@@ -0,0 +1,315 @@
(metrics-page)=
# Understanding task resource metrics
This tutorial explains how resource usage metrics are computed from execution reports. See {ref}`Execution report <execution-report>` for more information about how to enable them.
## CPU Usage
CPU Usage plots report how CPU resources are used by each process.
```{image} ../_static/report-resource-cpu-noheader.png
```
**Raw Usage** tabs are expected to show 100% core usage if processes perform one task of pure computation. If tasks are distributed over, 2, 3, or 4 CPUs, the raw usage will be 200%, 300%, or 400%, respectively. **% Allocated** tabs rescale raw usage values relative to the number of CPUs that are set with the `cpus` directive. If the `cpus` directive is not set, CPUs are set to `1` and **% Allocated** tabs will show the same values **Raw Usage** tabs.
For example, using the [stress](https://people.seas.harvard.edu/~apw/stress/) program, the following script would report 100% CPU usage in the **Raw Usage** tab and 50% CPU usage in the **% Allocated** tab as the process requested double the number of CPUs that are required:
```nextflow
process cpu_usage_ex1 {
cpus 2
script:
"""
stress -c 1 -t 10 # compute square-root of random numbers during 10s using 1 CPU
"""
}
workflow{
cpu_usage_ex1() // Stress using 1 CPU
}
```
:::{tip}
See [Linux stress command with examples](https://www.geeksforgeeks.org/linux-stress-command-with-examples/) for more information about the stress command.
:::
CPU usage decreases if processes spend some time performing pure computation and some time waiting for CPUs. For example, using the `stress` and `sleep` commands, the following script would report 75% CPU usage in the **Raw Usage** tab:
```nextflow
process cpu_usage_ex2 {
cpus 1
script:
"""
stress -c 1 -t 10 # compute square-root of random numbers during 10s using 1 CPU
stress -c 1 -t 5 # compute square-root of random numbers during 5s using 1 CPU
sleep 5 # use no CPU during 5s
"""
}
workflow{
cpu_usage_ex2() // Stress using 1 CPU and sleep
}
```
In the above example, CPU usage is a weighted average that accounts for the percentage of the CPU used and duration of each individual program over the job duration:
$$
\frac{ 100\% \times 10s + 100\% \times 5s + 0\% \times 5s }{10s+5s+5s} = 75\%
$$
CPU usage increases if a single step is forked on multiple CPUs:
```nextflow
process cpu_usage_ex3 {
cpus 2
script:
"""
stress -c 2 -t 10 # compute square-root of random numbers during 10 s using 2 CPUs
sleep 10 # use no CPU during 10s
"""
}
workflow{
cpu_usage_ex3() // Stress using 2 CPUs and sleep
}
```
In the above example, the **Raw Usage** tab would report 100%:
$$
\frac{ 200\% \times 10s + 0\% \times 10s }{10s+10s} = 100\%
$$
However, the **% Allocated** tab would report 50%. It would not be relevant to change the `cpus` directive from `2` to `1` as the process uses 2 CPUs at it peak load.
## Memory Usage
Memory Usage plots report how memory was used by each process. It has three tabs, **Physical (RAM)**, **Virtual (RAM + Disk swap)**, and **% RAM Allocated**, showing the usage of the physical memory (RAM), the virtual memory (vmem), and the percentage of RAM used by the process relative to the memory that the `memory` directive set, respectively.
Peak usage during process executions is reported for both physical and virtual memories. The total amount of memory used by a process is the `virtual memory (vmem)`. The `vmem` contains all memory areas, including in the physical memory (RAM), in the swap space, on the disk, or shared with other processes. The `resident set size (RSS)` is the amount of `physical memory (RAM)` held by a process.
The relationship is:
$$
vmem \geq RSS + Swap
$$
The behavior of **Memory Usage** plots can be examined using two programs written in C. The first program allocates a variable of 1 GiB:
```c
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
#include <time.h>
/* Get vmem and rss usage from /proc/<pid>/statm */
static int mem_used(pid_t pid, unsigned long* vmem, unsigned long* rss) {
FILE* file;
char path[40];
unsigned int page_size;
snprintf(path, 40, "/proc/%ld/statm", (long) pid);
file = fopen(path, "r");
// vmem and rss are the first values in the file
fscanf(file, "%lu %lu", vmem, rss);
// values in statm are in pages so to get bytes we need to know page size
page_size = (unsigned) getpagesize();
*vmem = *vmem * page_size;
*rss = *rss * page_size;
fclose(file);
return 0;
}
int main(int argc, char **argv) {
unsigned char *address;
char input;
size_t size = 1024*1024*1024; // 1 GiB
unsigned long i;
unsigned long vmem = 0;
unsigned long rss = 0;
pid_t pid;
pid = getpid();
printf("Pid: %ld\n", (long) pid);
mem_used(pid, &vmem, &rss);
printf("VMEM: %lu RSS: %lu\n", vmem, rss);
address = malloc(size);
printf("Allocated %d Bytes of memory\n", (int) size);
mem_used(pid, &vmem, &rss);
printf("VMEM: %lu RSS: %lu\n", vmem, rss);
// Leave time for nextflow to get information
sleep(15);
free(address);
return 0;
}
```
The second program allocates a variable of 1 GiB and fills it with data:
```c
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
#include <time.h>
/* Get vmem and rss usage from /proc/<pid>/statm */
static int mem_used(pid_t pid, unsigned long* vmem, unsigned long* rss) {
FILE* file;
char path[40];
unsigned int page_size;
snprintf(path, 40, "/proc/%ld/statm", (long) pid);
file = fopen(path, "r");
// vmem and rss are the first values in the file
fscanf(file, "%lu %lu", vmem, rss);
// values in statm are in pages so to get bytes we need to know page size
page_size = (unsigned) getpagesize();
*vmem = *vmem * page_size;
*rss = *rss * page_size;
fclose(file);
return 0;
}
int main(int argc, char **argv) {
unsigned char *address;
char input;
size_t size = 1024*1024*1024; // 1 GiB
unsigned long i;
unsigned long vmem = 0;
unsigned long rss = 0;
pid_t pid;
pid = getpid();
printf("Pid: %ld\n", (long) pid);
mem_used(pid, &vmem, &rss);
printf("VMEM: %lu RSS: %lu\n", vmem, rss);
address = malloc(size);
printf("Allocated %d Bytes of memory\n", (int) size);
mem_used(pid, &vmem, &rss);
printf("VMEM: %lu RSS: %lu\n", vmem, rss);
printf("Filling memory with data...");
fflush(stdout);
for (i = 0; i < size; i++) {
*(address + i) = 123;
}
mem_used(pid, &vmem, &rss);
printf("\nVMEM: %lu RSS: %lu\n", vmem, rss);
// Leave time for nextflow to get information
sleep(15);
free(address);
return 0;
}
```
The first and second programs are executed as `malloc` and `malloc_fill`, respectively, in the following script:
```nextflow
process malloc {
memory '1.5 GB'
script:
"""
memory_vmem_1GiB_ram_0Gib
"""
}
process malloc_fill {
memory '1.5 GB'
script:
"""
memory_vmem_1GiB_ram_1Gib
"""
}
workflow{
malloc() // Allocates a variable of 1 GiB
malloc_fill() // Allocates a variable of 1 GiB and fills it with data
}
```
The **Virtual (RAM + Disk swap)** tab shows that both `malloc` and `malloc_fill` use the same amount of virtual memory (~1 GiB):
```{image} ../_static/report-resource-memory-vmem.png
```
However, the **Physical (RAM)** tab shows that `malloc_fill` uses ~1 GiB of RAM while `malloc` uses ~0 GiB of RAM:
```{image} ../_static/report-resource-memory-ram.png
```
The **% RAM Allocated** tab shows that `malloc` and `malloc_fill` used 0% and 67% of resources set in the `memory` directive, respectively:
```{image} ../_static/report-resource-memory-pctram.png
```
:::{warning}
Memory and storage metrics are reported in bytes. For example, 1 KB = $1024$ bytes, 1 MB = $1024^2$ bytes, and 1 GB = $1024^3$ bytes.
:::
## Job Duration
**Job Duration** plots report how long each process took to run. It has two tabs. The **Raw Usage** tab shows the job duration and the **% Allocated** tab shows the time that was requested relative to what was requested using the `time` directive. Job duration is sometimes known as elapsed real time, real time or wall time.
```{image} ../_static/report-resource-job-duration.png
```
## I/O Usage
I/O Usage plots show how much data was read and written by processes. The amount of data that was read by a process (`rchar` in trace files) is the number of bytes the process read using read-like system calls. The amount of data that was written by a process (`wchar` in trace files) is the number of bytes the process wrote using write-like system call. Read and write data is read from the file `/proc/$pid/io`.
**Read** tabs shows how much data was read and **Write** tabs shows how much data was written by each process. For example, the following script reads and writes different data volumes:
```nextflow
process io_read_write_1G {
script:
"""
dd if=/dev/zero of=/dev/null bs=1G count=1
"""
}
process io_read_write_256M {
script:
"""
dd if=/dev/zero of=/dev/null bs=256M count=1
"""
}
workflow{
io_read_write_1G() // Read and write 1 GiB
io_read_write_256M() // Read and write 256 Mb
}
```
The **Read** tab shows that ~1 Gib and ~256 Mb are read:
```{image} ../_static/report-resource-io-read.png
```
The **Write** tab shows that ~1 Gib and ~256 Mb are written:
```{image} ../_static/report-resource-io-write.png
```

View File

@@ -0,0 +1,226 @@
(rnaseq-nf-page)=
# Getting started with rnaseq-nf
[`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) is a basic Nextflow pipeline for RNA-Seq analysis that performs quality control, transcript quantification, and result aggregation. The pipeline processes paired-end FASTQ files, generates quality control reports with [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), quantifies transcripts with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html), and produces a unified report with [MultiQC](https://seqera.io/multiqc/).
This tutorial describes the architecture of the [`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) pipeline and provides instructions on how to run it.
## Pipeline architecture
The pipeline is organized into modular workflows and processes that coordinate data flow from input files through analysis steps to final outputs.
### Entry workflow
The [entry workflow](https://github.com/nextflow-io/rnaseq-nf/blob/master/main.nf) orchestrates the entire pipeline by coordinating input parameters and data flow:
```{mermaid}
flowchart TB
subgraph " "
subgraph params
v0["transcriptome"]
v1["reads"]
v5["multiqc"]
v2["outdir"]
end
v4([RNASEQ])
v6([MULTIQC])
v0 --> v4
v1 --> v4
v4 --> v6
v5 --> v6
end
```
Data flow:
- The `transcriptome` and `reads` parameters are passed to the `RNASEQ` subworkflow, which performs indexing, quality control, and quantification.
- The outputs from `RNASEQ`, along with the MultiQC configuration (`multiqc`), are passed to the `MULTIQC` module, which aggregates results into a unified HTML report.
- The `outdir` parameter defines where all results are published.
### `RNASEQ`
The [`RNASEQ`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/rnaseq.nf) subworkflow coordinates three processes that run in parallel and sequence:
```{mermaid}
flowchart TB
subgraph RNASEQ
subgraph take
v0["read_pairs_ch"]
v1["transcriptome"]
end
v2([INDEX])
v4([FASTQC])
v6([QUANT])
subgraph emit
v8["fastqc"]
v9["quant"]
end
v1 --> v2
v0 --> v4
v0 --> v6
v2 --> v6
v4 --> v8
v6 --> v9
end
```
Inputs (`take:`):
- `read_pairs_ch`: A channel of paired-end read files
- `transcriptome`: A reference transcriptome file
Data flow (`main:`):
- [`INDEX`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/index/main.nf) creates a Salmon index from the `transcriptome` input (runs once).
- [`FASTQC`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/fastqc/main.nf) analyzes the samples in the `read_pairs_ch` channel in parallel (runs independently for each sample).
- [`QUANT`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/quant/main.nf) quantifies transcripts using the index from `INDEX` and the samples in the `read_pairs_ch` channel (runs for each sample after `INDEX` completes).
Outputs (`emit:`):
- `fastqc`: The results from `FASTQC`
- `quant`: The results from `QUANT`
### `MULTIQC`
The [`MULTIQC`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/multiqc/main.nf) process aggregates all quality control and quantification outputs into a comprehensive HTML report.
Inputs:
- Input files: All collected outputs from the `RNASEQ` subworkflow (FastQC reports and Salmon quantification files).
- `config`: MultiQC configuration files and branding (logo, styling).
Process execution:
- `MULTIQC` scans all input files, extracts metrics and statistics, and generates a unified report.
Outputs:
- `multiqc_report.html`: A single consolidated HTML report providing an overview of:
- General stats
- Salmon fragment length distribution
- FastQC quality control
- Software versions
## Pipeline parameters
The pipeline behavior can be customized using command-line parameters to specify input data, output locations, and configuration files.
The pipeline accepts the following command-line parameters:
- `--reads`: Path to paired-end FASTQ files (default: `data/ggal/ggal_gut_{1,2}.fq`).
- `--transcriptome`: Path to reference transcriptome FASTA (default: `data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa`).
- `--outdir`: Output directory for results (default: `results`).
- `--multiqc`: Path to MultiQC configuration directory (default: `multiqc`).
## Configuration profiles
Configuration profiles allow you to customize how and where the pipeline runs by specifying the `-profile` flag. Multiple profiles can be specified as a comma-separated list. Profiles are defined in the [`nextflow.config`](https://github.com/nextflow-io/rnaseq-nf/blob/master/nextflow.config) file in the base directory.
<h3>Software profiles</h3>
Software profiles specify how software dependencies for processes should be provisioned:
- `conda`: Provision a Conda environment for each process based on its required Conda packages
- `docker`: Use a Docker container which contains all required dependencies
- `singularity`: Use a Singularity container which contains all required dependencies
- `wave`: Provision a Wave container for each process based on its required Conda packages
:::{note}
The respective container runtime or package manager must be installed to use these profiles.
:::
<h3>Execution profiles</h3>
Execution profiles specify the compute and storage environment used by the pipeline:
- `slurm`: Run on a SLURM HPC cluster
- `batch`: Run on AWS Batch
- `google-batch`: Run on Google Cloud Batch
- `azure-batch`: Run on Azure Batch
:::{note}
Depending on your environment, you may need to configure underlying infrastructure such as resource pools, storage, and credentials.
:::
## Test data
The pipeline includes test data in the [`data/ggal/`](https://github.com/nextflow-io/rnaseq-nf/tree/master/data/ggal) directory for demonstration and validation purposes:
- Paired-end FASTQ files from four tissue samples (gut, liver, lung, spleen):
- `ggal_gut_{1,2}.fq`
- `ggal_liver_{1,2}.fq`
- `ggal_lung_{1,2}.fq`
- `ggal_spleen_{1,2}.fq`
- Reference transcriptome:
- `ggal_1_48850000_49020000.Ggal71.500bpflank.fa`
By default, only the `gut` sample is processed. You can use the `all-reads` profile to process all four tissue samples.
## Quick start
The [`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) pipeline is executable out-of-the-box. This section provides examples for running the pipeline with different configurations.
### Basic execution
Run the pipeline with default parameters using Docker:
```bash
nextflow run nextflow-io/rnaseq-nf -profile docker
```
### Configuring individual parameters
Override default parameters to use custom input files and output locations:
```bash
nextflow run nextflow-io/rnaseq-nf \
--reads '/path/to/reads/*_{1,2}.fastq.gz' \
--transcriptome '/path/to/transcriptome.fa' \
--outdir 'my_results' \
-profile docker
```
### Using profiles
Specify configuration profiles to customize runtime environments and data sources:
```bash
# Use Conda to provision software dependencies
nextflow run nextflow-io/rnaseq-nf -profile conda
# Run on a SLURM cluster
nextflow run nextflow-io/rnaseq-nf -profile slurm
# Combine multiple profiles: process all reads using Docker
nextflow run nextflow-io/rnaseq-nf -profile all-reads,docker
```
:::{tip}
See [Configuration profiles](#configuration-profiles) for more information about profiles.
:::
## Expected outputs
The [`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) pipeline produces the following outputs in the `results` directory:
```
results/
├── fastqc_<SAMPLE_ID>_logs/ # FastQC quality reports per sample
│ ├── <SAMPLE_ID>_1_fastqc.html
│ ├── <SAMPLE_ID>_1_fastqc.zip
│ ├── <SAMPLE_ID>_2_fastqc.html
│ └── <SAMPLE_ID>_2_fastqc.zip
└── multiqc_report.html # Aggregated QC and Salmon report
```
The MultiQC report (`multiqc_report.html`) can be viewed in a web browser.

View File

@@ -0,0 +1,513 @@
(migrating-static-types-operators)=
# Using operators with static typing
Nextflow 26.04 brings updates to the operator library in order to support static tying and records. This page provides best practices for using operators with static typing.
See {ref}`migrating-static-types` for more information about migrating pipelines to static typing.
## Overview
All operators can be used with or without static typing (i.e. {ref}`typed workflows <workflow-typed-page>`). However, only a core subset of operators are recommended for use with static typing, while the rest are discouraged. They are distinguished here as *core operators* and *legacy operators*.
## Core operators
The {ref}`core operators <operator-typed-page>` are recommended for use with static typing. When static typing is enabled (via `nextflow.enable.types`), some of these operators have stricter semantics which may require minor changes to pipeline code. These cases are described below.
### collect
When using `collect` with static typing, it has the same semantics as `toList`. Collected values are not flattened, and when the source channel is empty, an empty list is emitted.
### combine
When using `combine` with static typing, the right operand should be a channel, dataflow value, or named arguments corresponding to record fields.
When `combine` is called with named arguments, they are appended to each source record as additional fields:
```nextflow
sample = channel.value( record(id: 1, fastq: file('1.fq')) )
index = channel.value( file('index.fa') )
sample.combine(strandedness: 'auto', index: index)
// [id:1, fastq:1.fq, strandedness:auto, index:index.fa]
```
For uses of `combine` with the `by` option, use `join` instead:
```nextflow
// before
left = channel.of( [1, 'alpha'], [2, 'beta'] )
right = channel.of( [1, 'x'], [1, 'y'], [2, 'p'] )
left.combine(right, by: 0).view()
// [1, alpha, x]
// [1, alpha, y]
// [2, beta, p]
// after (static typing enabled)
left = channel.of(
record(id: 1, name: 'alpha'),
record(id: 2, name: 'beta')
)
right = channel.of(
record(id: 1, code: 'x'),
record(id: 1, code: 'y'),
record(id: 2, code: 'p')
)
left.join(right, by: 'id').view()
// [id:1, name:alpha, code:x]
// [id:1, name:alpha, code:y]
// [id:2, name:beta, code:p]
```
### filter
When using `filter` with static typing, the predicate should be a closure.
```nextflow
// before
ch.filter( ~/^a.*/ )
// after (static typing enabled)
ch.filter { v -> v == ~/^a.*/ }
```
### flatMap
When using `flatMap` with static typing, the mapping closure should always return a collection. Maps and tuples are not automatically flattened because they are not collection types.
```nextflow
// before
channel.of( 1, 2, 3 )
.flatMap { n -> [ number: n, square: n*n, cube: n*n*n ] }
.view { entry -> "${entry.key}: ${entry.value}" }
// after (static typing enabled)
channel.of( 1, 2, 3 )
.flatMap { n -> [ tuple('number', n), tuple('square', n*n), tuple('cube', n*n*n) ] }
.view { key, value -> "${key}: ${value}" }
```
### groupBy
The `groupBy` operator is a replacement for `groupTuple` that is statically typed.
While `groupTuple` accepts tuples of arbitrary length, `groupBy` accepts either a 2-tuple of `(<key>, <value>)` or a 3-tuple of `(<key>, <size>, <value>)`. Specifying the group size with each input tuple provides the same behavior as using the `size` option (or wrapping each key with `groupKey()`) does with `groupTuple`.
While `groupTuple` can group multiple lists in a group, `groupBy` always emits 2-tuples of the form `(<key>, <values>)`, where `<values>` is an unordered collection (`Bag`). This approach avoids a pitfall with `groupTuple` where the grouped lists can be ordered inconsistently.
### join
When using `join` with static typing, the `by` option is required. It should be either an integer (for joining tuples by index) or a string (for joining records by field name).
When using `join` with records, the `failOnDuplicate` and `failOnMismatch` options are not supported. Duplicate matches are handled by emitting each matching combination (like a relational join). Unmatched records are either emitted or discarded depending on whether the `remainder` option is set. To fail on mismatches, use the `remainder` option and check for unmatched records in downstream logic.
```nextflow
// tuples
left = channel.of( ['X', 1], ['Y', 2], ['Z', 3], ['P', 7] )
right = channel.of( ['Z', 6], ['Y', 5], ['X', 4] )
left.join(right).view()
// [X, 1, 4]
// [Y, 2, 5]
// [Z, 3, 6]
// records
left = channel.of(
record(id: 'X', a: 1),
record(id: 'Y', a: 2),
record(id: 'Z', a: 3),
record(id: 'P', a: 7)
)
right = channel.of(
record(id: 'Z', b: 6),
record(id: 'Y', b: 5),
record(id: 'X', b: 4)
)
left.join(right, by: 'id').view()
// [id: X, a: 1, b: 4]
// [id: Y, a: 2, b: 5]
// [id: Z, a: 3, b: 6]
```
### map
When using `map` with static typing, `null` values are not automatically discarded. Use `filter` to discard `null` values explicitly.
```nextflow
// before
ch.map { r -> r.id }
// after (static typing enabled)
ch.map { r -> r.id }.filter { id -> id != null }
```
### mix
When using `mix` with static typing, only one argument should be supplied for each `mix` call.
```nextflow
// before
ch1.mix(ch2, ch3)
// after (static typing enabled)
ch1.mix(ch2).mix(ch3)
```
## Legacy operators
The {ref}`legacy operators <operator-page>` are discouraged from use with static typing. They can still be used, but the type checker will not be able to validate them.
This section describes how to rewrite each legacy operator with core operators.
### branch
Use `filter` and `map` for each branch instead. Using records instead of tuples can eliminate much of the need for `branch`.
Example requiring only `filter`:
```nextflow
// before
ch_gvcf_branch = ch_gvcf.branch { meta, gvcf, tbi ->
no_tbi: !tbi
return tuple(meta, gvcf)
tbi: tbi
return tuple(meta, gvcf, tbi)
}
ch_gvcf_branch.no_tbi.view()
ch_gvcf_branch.tbi.view()
// after
ch_gvcf_no_tbi = ch_gvcf.filter { s -> !s.tbi }
ch_gvcf_tbi = ch_gvcf.filter { s -> s.tbi }
```
Example requiring `filter` and `map`:
```nextflow
// before
ch_input_by_type = ch_input.branch { meta, platform, fastq_1, fastq_2 ->
fastq: meta.single_end || fastq_2
return tuple(meta + [type: "short"], fastq_2 ? [fastq_1, fastq_2] : [fastq_1])
nanopore: platform == 'OXFORD_NANOPORE'
meta.single_end = true
return tuple(meta + [type: "long"], [fastq_1])
pacbio: platform == 'PACBIO_SMRT'
meta.single_end = true
return tuple(meta + [type: "long"], [fastq_1])
}
ch_input_by_type.fastq.view()
ch_input_by_type.nanopore.view()
ch_input_by_type.pacbio.view()
// after -- no more fastq_1/fastq_2 wrangling
ch_input_fastq = ch_input
.filter { s -> s.single_end || s.fastq_2 }
.map { s -> s + record(type: 'short') }
ch_input_nanopore = ch_input
.filter { s -> s.platform == 'OXFORD_NANOPORE'}
.map { s -> s + record(single_end: true, type: 'long') }
ch_input_pacbio = ch_input
.filter { s -> s.platform == 'PACBIO_SMRT' }
.map { s -> s + record(single_end: true, type: 'long') }
ch_input_fastq.view()
ch_input_nanopore.view()
ch_input_pacbio.view()
```
### buffer, collate
These operators are {ref}`non-deterministic <cache-nondeterministic-inputs>`. Use `groupBy` or `List::collate()` instead.
```nextflow
// before
channel.of(1..9)
.collate(3)
.view()
// [1, 2, 3]
// [4, 5, 6]
// [7, 8, 9]
// after
channel.of(1..9)
.map { i -> tuple((i - 1).intdiv(3) + 1, i) }
.groupBy()
.view()
// [1, [1, 2, 3]]
// [2, [4, 5, 6]]
// [3, [7, 8, 9]]
```
### collectFile
The `collectFile` operator is useful for collecting intermediate results into a final output file, or writing a samplesheet. In many cases, `collectFile` can be replaced by a {ref}`workflow output <workflow-output-def>`, which can generate an index file for a published channel.
For other cases, consider the following alternatives:
- Use the `collect` and `groupBy` operators to collect and group items
- Use `Iterable::toSorted` to sort collected items
- Use an `exec` process to write files (see {ref}`working-with-files`)
You can compose these functions and operators as needed to achieve the desired functionality.
For example:
```nextflow
nextflow.enable.types = true
process COLLECT_FILE {
input:
name: String
items: List<String>
output:
file(name)
exec:
def path = task.workDir.resolve(name)
items.each { item ->
path << item
path << '\n'
}
}
workflow {
val_names = channel.of('alpha', 'beta', 'gamma')
.collect()
.map { names -> names.toSorted() }
COLLECT_FILE('sample.txt', val_names)
.view { result -> result.text }
}
```
Prints:
```
alpha
beta
gamma
```
### concat
Use `mix` instead.
### count, max, min, sum
Use `collect` and the corresponding `Iterable <stdlib-types-iterable>` methods instead.
### cross
Use `join` with records instead.
```nextflow
// before
left = channel.of( [1, 'alpha'], [2, 'beta'] )
right = channel.of( [1, 'x'], [1, 'y'], [2, 'p'] )
left.cross(right).view()
// [[1, alpha], [1, x]]
// [[1, alpha], [1, y]]
// [[2, beta], [2, p]]
// after
left = channel.of(
record(id: 1, name: 'alpha'),
record(id: 2, name: 'beta')
)
right = channel.of(
record(id: 1, code: 'x'),
record(id: 1, code: 'y'),
record(id: 2, code: 'p')
)
left.join(right, by: 'id').view()
// [id:1, name:alpha, code:x]
// [id:1, name:alpha, code:y]
// [id:2, name:beta, code:p]
```
### distinct
This operator is {ref}`non-deterministic <cache-nondeterministic-inputs>`. Use `unique` instead.
### dump
Use `view` instead. The `view` operator now supports the `tag` option, allowing it to be used like `dump`.
### first, last, take
These operators are {ref}`non-deterministic <cache-nondeterministic-inputs>`. Use the corresponding {ref}`List <stdlib-types-list>` methods instead.
### flatten
Use `flatMap` instead.
### ifEmpty
The `ifEmpty` operator is typically used to either (1) raise an error if a channel is empty or (2) provide a fallback for a null dataflow value.
With static typing, both cases can be implemented without `ifEmpty`:
```nextflow
// (1) fail if channel is empty
files_ch = channel.fromPath('*.txt')
files_ch.collect().subscribe { files ->
if( files.isEmpty() )
error 'no input files were found'
}
files_ch.view()
// (2) provide a fallback for dataflow value
index_file = FETCH_INDEX().map { index ->
index ?: file('index_default.txt')
}
index_file.view()
```
The example for (2) assumes that `FETCH_INDEX` is a typed process. Typed processes emit `null` when an optional output is missing, whereas legacy processes emit nothing.
### merge
This operator is {ref}`non-deterministic <cache-nondeterministic-inputs>`. Use `join` instead.
### multiMap
Use `map` for each branch instead. Using records instead of tuples can eliminate much of the need for `branch`.
For example:
```nextflow
// before
ch_input_by_type = ch_input.multiMap { families, meta, cram, crai, gvcf, tbi, roi ->
def new_meta = meta + [
family_count: families[meta.family].size(),
type: gvcf && cram ? "gvcf_cram" : gvcf ? "gvcf" : "cram"
]
gvcf: tuple(new_meta, gvcf, tbi)
cram: tuple(new_meta, cram, crai)
roi: tuple(new_meta, roi)
}
ch_input_by_type.gvcf.view()
ch_input_by_type.cram.view()
ch_input_by_type.roi.view()
// after -- just keep everything in a single record
ch_input = ch_input.map { s ->
s + record(
family_count: s.families[s.family].size(),
type: s.gvcf && s.cram ? "gvcf_cram" : s.gvcf ? "gvcf" : "cram"
)
}
```
### randomSample
This operator is {ref}`non-deterministic <cache-nondeterministic-inputs>`. It should not be used.
If needed, it is possible to implement a function that samples a collection (e.g., using `Math.random()` from the Java standard library).
### set
Use standard assignments instead:
```nextflow
// before
channel.of(10, 20, 30).set { my_channel }
// after
my_channel = channel.of(10, 20, 30)
```
### tap
Use standard assignments instead:
```nextflow
// before
channel.of(10, 20, 30)
.tap { log1 }
.map { v -> v * 2 }
.tap { log2 }
// after
log1 = channel.of(10, 20, 30)
log2 = log1.map { v -> v * 2 }
```
### splitCsv, splitFasta, splitFastq, splitJson, splitText
Use the equivalent {ref}`stdlib-types-path` method with `flatMap` instead:
```nextflow
// before
channel.fromPath('samplesheet.csv')
.splitCsv(sep: ',')
.view()
// after
channel.fromPath('samplesheet.csv')
.flatMap { csv -> csv.splitCsv(sep: ',') }
.view()
```
### toList
Use `collect` instead.
### toSortedList
Use `collect` and `Iterable::toSorted` instead:
```nextflow
// before
channel.of(3, 2, 1, 4)
.toSortedList()
.view()
// after
channel.of(3, 2, 1, 4)
.collect()
.map { vals -> vals.toSorted() }
.view()
```
### transpose
Use `flatMap` instead:
```nextflow
// before
channel.of(
tuple(1, ['A', 'B', 'C']),
tuple(2, ['C', 'A']),
tuple(3, ['B', 'D']),
)
.transpose()
.view()
// after
channel.of(
tuple(1, ['A', 'B', 'C']),
tuple(2, ['C', 'A']),
tuple(3, ['B', 'D']),
)
.flatMap { key, values ->
values.collect { value -> tuple(key, value) }
}
.view()
```

View File

@@ -0,0 +1,718 @@
(migrating-static-types)=
# Migrating to static typing
Nextflow 26.04 brings full support for *static typing* in Nextflow code. This tutorial demonstrates how to migrate to static typing using the [rnaseq-nf](https://github.com/nextflow-io/rnaseq-nf) pipeline as an example.
:::{note}
Static typing is optional. All existing code will continue to work.
:::
## Overview
Static typing allows you to precisely model and validate the structure of your data as it flows through your pipeline. It consists of several new language features:
- **Type annotations** can be added to inputs and outputs at every level of a pipeline, from pipeline parameters to process inputs and outputs, using the {ref}`standard Nextflow types <stdlib-types>`. These annotations make your code easier to understand and are used by the Nextflow language server to identify type-related errors during development.
- **Records** are a new data structure for modeling composite data. They serve as an alternative to tuples -- whereas tuple elements must be accessed by index, record fields are accessed by name. This allows you to model data with meaningful names instead of keeping track of how tuple elements are ordered.
- **Record types** are custom type definitions that can be used to guarantee a minimum set of requirements for a record in a particular context. Records are *duck-typed*, which means that a record can be used as an input as long as it meets the minimum requirements of that input (given by a record type).
## Developer tooling
Static typing works best with the [Nextflow language server](https://github.com/nextflow-io/language-server) and [Nextflow VS Code extension](https://marketplace.visualstudio.com/items?itemName=nextflow.nextflow).
:::{tip}
See {ref}`devenv-page` for instructions on how to setup VS Code and the Nextflow extension.
:::
### Type checking
When using static typing, the language server can check your code for type-related errors. For example, it can validate that a channel of records has all the required fields when it is passed as input to a process.
The language server performs type checking on every script that enables the `nextflow.enable.types` feature flag.
### Automatic migration
The Nextflow VS Code extension provides a command for automatically migrating Nextflow pipelines to static types. To migrate a script, open the Command Palette, search for **Convert script to static types**, and select it.
:::{note}
Automatic migration is an experimental feature and may not be able to convert an entire pipeline to static types. Always review generated code for correctness.
:::
## Example: rnaseq-nf
This section demonstrates how to migrate a pipeline to static typing using [rnaseq-nf](https://github.com/nextflow-io/rnaseq-nf) as an example. See {ref}`rnaseq-nf-page` for an introduction to the pipeline.
The approach is as follows:
1. Convert legacy parameters to a `params` block
2. Convert the primary input (`params.reads`) from a glob pattern to a samplesheet
3. Convert each process to static typing
4. Convert each workflow to static typing
The completed migration is available in the [preview-26-04](https://github.com/nextflow-io/rnaseq-nf/tree/preview-26-04) branch.
### Migrating pipeline parameters
The pipeline defines the following parameters in the main script using the legacy syntax:
```nextflow
params.reads = "$baseDir/data/ggal/ggal_gut_{1,2}.fq"
params.transcriptome = "$baseDir/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
params.outdir = "results"
params.multiqc = "$baseDir/multiqc"
```
The pipeline also has a `nextflow_schema.json` schema with the following properties:
```json
"reads": {
"type": "string",
"description": "The input read-pair files",
"default": "${projectDir}/data/ggal/ggal_gut_{1,2}.fq"
},
"transcriptome": {
"type": "string",
"format": "file-path",
"description": "The input transcriptome file",
"default": "${projectDir}/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
},
"outdir": {
"type": "string",
"format": "directory-path",
"description": "The output directory where the results will be saved",
"default": "results"
},
"multiqc": {
"type": "string",
"format": "directory-path",
"description": "Directory containing the configuration for MultiQC",
"default": "${projectDir}/multiqc"
}
```
To migrate the pipeline parameters, use the schema and legacy parameters to define the equivalent `params` block:
```nextflow
params {
// The input read-pair files
reads: String = "${projectDir}/data/ggal/ggal_gut_{1,2}.fq"
// The input transcriptome file
transcriptome: Path = "${projectDir}/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
// The output directory where the results will be saved
outdir: Path = 'results'
// Directory containing the configuration for MultiQC
multiqc: Path = "${projectDir}/multiqc"
}
```
See {ref}`workflow-typed-params` for more information about the `params` block.
:::{note}
Parameters used only in the config file should be declared in the config, not in the script. Since rnaseq-nf has no such parameters, all parameters are declared in the script. See {ref}`config-params` for more information.
:::
:::{tip}
The rnaseq-nf pipeline initializes the `reads` and `transcriptome` parameters to a test dataset by default, as it is designed as a toy example. In practice, defaults for test data should be defined in a config profile (e.g., `test`).
:::
(static-types-samplesheet)=
### Loading a samplesheet input
The rnaseq-nf pipeline takes a glob pattern of FASTQ pairs (e.g., `data/ggal/ggal_gut_{1,2}.fq`) and uses the `channel.fromFilePairs()` factory to load the files as a channel of tuples:
```nextflow
read_pairs_ch = channel.fromFilePairs(params.reads, checkIfExists: true, flat: true)
```
Each tuple has three elements -- the sample ID (inferred from the file names) and the two FASTQ files.
This approach will not work with static typing because `fromFilePairs()` does not have a well-defined return type. A more robust way to model a collection of samples is with a *samplesheet*, such as a CSV file specifying samples as rows and sample fields as columns.
Create the following samplesheet to represent the test data:
**`data/allreads.csv`**
```
id,fastq_1,fastq_2
gut,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_gut_1.fq,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_gut_2.fq
liver,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_liver_1.fq,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_liver_2.fq
lung,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_lung_1.fq,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_lung_2.fq
spleen,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_spleen_1.fq,https://raw.githubusercontent.com/nextflow-io/rnaseq-nf/refs/heads/master/data/ggal/ggal_spleen_2.fq
```
Refactor `params.reads` to refer to the samplesheet file path instead of a glob pattern:
```nextflow
params {
// The input samplesheet of paired-end reads
reads: Path = "${projectDir}/data/allreads.csv"
// ...
}
```
Refactor the `read_pairs_ch` to load the samplesheet as a channel of records:
```nextflow
read_pairs_ch = channel.of(params.reads)
.flatMap { csv -> csv.splitCsv() }
.map { row ->
record(id: row[0], fastq_1: file(row[1]), fastq_2: file(row[2]))
}
```
### Migrating processes
See {ref}`process-typed-page` for an overview of typed processes.
:::{note}
You must enable the `nextflow.enable.types` feature flag in each script that uses typed processes.
:::
<h4>FASTQC</h4>
The `FASTQC` process is defined as follows:
```nextflow
process FASTQC {
tag id
conda 'bioconda::fastqc=0.12.1'
input:
tuple val(id), path(fastq_1), path(fastq_2)
output:
path "fastqc_${id}_logs"
script:
"""
fastqc.sh "${id}" "${fastq_1} ${fastq_2}"
"""
}
```
To migrate the `FASTQC` process, rewrite the inputs and outputs as follows:
```nextflow
nextflow.enable.types = true
process FASTQC {
tag id
conda 'bioconda::fastqc=0.12.1'
input:
record(
id: String,
fastq_1: Path,
fastq_2: Path
)
output:
record(
id: id,
fastqc: file("fastqc_${id}_logs")
)
script:
"""
fastqc.sh "${id}" "${fastq_1} ${fastq_2}"
"""
}
```
The tuple input is converted to a record input using the `record()` destructor. The field types are specified alongside the field names. The `path` input qualifier is replaced by the `Path` type.
Whereas tuple elements must be specified in a particular order, record fields can be specified in any order. The records supplied by the calling workflow must have the same field names and types as the process definition.
The tuple output is converted to a record using the `record()` function and specifying a name for each record field. The `path` output qualifier is replaced by the `file()` function (or `files()` if multiple files are expected). See {ref}`process outputs <process-reference-typed>` for the list of special functions that can be used in the `output:` section to retrieve task outputs.
<h4>QUANT</h4>
The `QUANT` process is defined as follows:
```nextflow
process QUANT {
tag id
conda 'bioconda::salmon=1.10.3'
input:
tuple val(id), path(fastq_1), path(fastq_2)
path index
output:
path "quant_${id}"
script:
"""
salmon quant \
--threads ${task.cpus} \
--libType=U \
-i ${index} \
-1 ${fastq_1} \
-2 ${fastq_2} \
-o quant_${id}
"""
}
```
To migrate the `QUANT` process, rewrite the inputs and outputs as follows:
```nextflow
nextflow.enable.types = true
process QUANT {
tag id
conda 'bioconda::salmon=1.10.3'
input:
record(
id: String,
fastq_1: Path,
fastq_2: Path
)
index: Path
output:
record(
id: id,
quant: file("quant_${id}")
)
script:
"""
salmon quant \
--threads ${task.cpus} \
--libType=U \
-i ${index} \
-1 ${fastq_1} \
-2 ${fastq_2} \
-o quant_${id}
"""
}
```
<h4>MULTIQC</h4>
The `MULTIQC` process is defined as follows:
```nextflow
process MULTIQC {
conda 'bioconda::multiqc=1.27.1'
input:
path '*'
path config
output:
path 'multiqc_report.html'
script:
"""
cp ${config}/* .
echo "custom_logo: \$PWD/nextflow_logo.png" >> multiqc_config.yaml
multiqc -n multiqc_report.html .
"""
}
```
To migrate this process, rewrite the inputs and outputs as follows:
```nextflow
nextflow.enable.types = true
process MULTIQC {
// ...
input:
logs: Set<Path>
config: Path
// stage:
// stageAs logs, '*'
output:
file('multiqc_report.html')
// ...
}
```
In a typed process, file patterns for `path` inputs must be declared using a *stage directive*. In this example, the first input uses the variable name `logs`, and the `stageAs` directive stages the input using the glob pattern `*`.
In this case, you can omit the stage directive because `*` matches Nextflow's default staging behavior. Inputs of type `Path` or a `Path` collection (e.g., `Set<Path>`) are staged by default using the pattern `'*'`.
:::{note}
In a legacy process, you can use the `arity` option to specify whether a `path` qualifier expects a single file or collection of files. When using typed inputs and outputs, the type determines this behavior, i.e., `Path` vs `Set<Path>`.
:::
:::{note}
While `List<Path>` and `Bag<Path>` are also valid path collection types, `Set<Path>` is preferred in this case because it represents an unordered collection of files. You should only use `List<Path>` when you want the collection to be ordered.
:::
<h4>INDEX</h4>
Apply the same migration principles from the previous processes to migrate `INDEX`.
### Migrating workflows
Once you migrate every process called by a workflow to static typing, you can migrate the workflow itself.
See {ref}`workflow-typed-page` for an overview of typed workflows.
:::{note}
You must enable the `nextflow.enable.types` feature flag in each script that uses typed workflows.
:::
<h4>RNASEQ</h4>
The `RNASEQ` workflow is defined as follows:
```nextflow
workflow RNASEQ {
take:
read_pairs_ch
transcriptome
main:
index = INDEX(transcriptome)
fastqc_ch = FASTQC(read_pairs_ch)
quant_ch = QUANT(index, read_pairs_ch)
emit:
fastqc = fastqc_ch
quant = quant_ch
}
```
You can infer the type of each workflow input by examining how the workflow is called. In this case, `RNASEQ` is called by the entry workflow with the following arguments:
```nextflow
workflow {
read_pairs_ch = channel.of(params.reads)
.flatMap { csv -> csv.splitCsv() }
.map { row ->
record(id: row[0], fastq_1: file(row[1]), fastq_2: file(row[2]))
}
RNASEQ(read_pairs_ch, params.transcriptome)
// ...
}
```
You can determine the type of each input as follows:
- The channel `read_pairs_ch` has type `Channel<Record>`, where each record contains the fields `id`, `fastq_1`, `fastq_2`.
- The parameter `params.transcriptome` has type `Path` as defined in the `params` block.
Specify the workflow input types as follows:
```nextflow
nextflow.enable.types = true
workflow RNASEQ {
take:
read_pairs_ch: Channel<Sample>
transcriptome: Path
// ...
}
record Sample {
id: String
fastq_1: Path
fastq_2: Path
}
```
The `read_pairs_ch` channel also needs to provide all of the record fields required by downstream processes. It is used by `FASTQC` and `QUANT`, which both declare the following record input:
```nextflow
input:
record(
id: String,
fastq_1: Path,
fastq_2: Path
)
```
The `Sample` record type contains all of the required fields.
:::{note}
In this case, the records in `read_pairs_ch` are identical to the record inputs of `FASTQC` and `QUANT`. However, `read_pairs_ch` would still be compatible if it contained additional record fields, as long as it contains the fields required by the two processes.
:::
The `FASTQC` and `QUANT` processes produce the channels `fastqc_ch` and `quant_ch`, both of which have type `Channel<Record>`:
- `fastqc_ch` contains records with the fields `id` and `fastqc`
- `quant_ch` contains records with the fields `id` and `quant`
You can infer this type information from the respective process outputs, as shown in the previous section.
These channels are emitted as the outputs of `RNASEQ`. However, with records it is usually simpler to join related channels into a single channel (e.g., to publish the channel as a {ref}`workflow output <migrating-workflow-outputs>`).
Use the `join` operator to join `fastqc_ch` and `quant_ch` by sample ID:
```nextflow
nextflow.enable.types = true
workflow RNASEQ {
take:
read_pairs_ch: Channel<Sample>
transcriptome: Path
main:
index = INDEX(transcriptome)
fastqc_ch = FASTQC(read_pairs_ch)
quant_ch = QUANT(read_pairs_ch, index)
samples_ch = fastqc_ch.join(quant_ch, by: 'id')
// ...
}
```
Finally, the workflow needs to be updated to only emit the `samples_ch` channel. Type annotations are not required for emits, but they are still useful as documentation and as a sanity chcek -- if the declared output type doesn't match the assigned value's type, the language server will report it.
While `samples_ch` could be emitted as type `Channel<Record>`, the best practice to use an explicit record type so that downstream workflows know which record fields are available.
Define a new record type based on the available fields in `samples_ch`:
```nextflow
record AlignedSample {
id: String
fastqc: Path
quant: Path
}
```
Update the workflow to emit `samples_ch` with the new record type:
```nextflow
nextflow.enable.types = true
workflow RNASEQ {
take:
read_pairs_ch: Channel<Sample>
transcriptome: Path
main:
index = INDEX(transcriptome)
fastqc_ch = FASTQC(read_pairs_ch)
quant_ch = QUANT(read_pairs_ch, index)
samples_ch = fastqc_ch.join(quant_ch, by: 'id')
emit:
samples: Channel<AlignedSample> = samples_ch
}
```
<h4>Entry workflow</h4>
The entry workflow is defined as follows:
```nextflow
workflow {
read_pairs_ch = channel.of(params.reads)
.flatMap { csv -> csv.splitCsv() }
.map { row ->
record(id: row[0], fastq_1: file(row[1]), fastq_2: file(row[2]))
}
(fastqc_ch, quant_ch) = RNASEQ(read_pairs_ch, params.transcriptome)
multiqc_files_ch = fastqc_ch.mix(quant_ch).collect()
MULTIQC(multiqc_files_ch, params.multiqc)
}
```
Rewrite this workflow based on the updated params, processes, and subworkflows:
```nextflow
nextflow.enable.types = true
workflow {
read_pairs_ch = channel.of(params.reads)
.flatMap { csv -> csv.splitCsv() }
.map { row ->
record(id: row[0], fastq_1: file(row[1]), fastq_2: file(row[2]))
}
samples_ch = RNASEQ(read_pairs_ch, params.transcriptome)
multiqc_files_ch = samples_ch
.flatMap { id, fastqc, quant -> [fastqc, quant] }
.collect()
MULTIQC(multiqc_files_ch, params.multiqc)
}
```
The `reads` param was refactored as a `Path`, so it is loaded into a channel of records using `splitCsv`. It is compatible with the records expected by `RNASEQ`.
The `RNASEQ` workflow now returns a single combined channel, so the `mix` operation is no longer needed. The `flatMap` operator is used to extract the files from each record in `samples_ch`.
(preparing-static-types)=
## Preparing for static typing
While static typing can be adopted progressively with existing code, many coding patterns are not compatible with static typing. Following best practices and avoiding anti-patterns beforehand will make it easier to adopt static typing.
### Use the strict syntax
The {ref}`strict syntax <strict-syntax-page>` is required to use static typing. It is enabled by default in Nextflow 26.04.
Before you migrate to static typing, ensure your code adheres to the strict syntax using `nextflow lint` or the language server.
### Avoid deprecated patterns
When preparing for the strict syntax, try to address {ref}`deprecation warnings <strict-syntax-deprecated>` as much as possible. For example:
```nextflow
Channel.from(1, 2, 3).map { it * 2 } // deprecated
channel.of(1, 2, 3).map { it -> it * 2 } // best practice
```
The above example shows how to avoid three deprecated patterns:
1. Using `Channel` to access channel factories (use `channel` instead)
2. Using the deprecated `channel.from` factory (use `channel.of` or `channel.fromList` instead)
3. Using the implicit `it` closure parameter (declare the parameter explicitly instead)
### Avoid `set` and `tap` operators
Nextflow provides three ways to assign a channel: a standard assignment, the `set` operator, and the `tap` operator:
```nextflow
ch = channel.of(1, 2, 3) // standard assignment
channel.of(10, 20, 30).set { ch } // set
channel.of(10, 20, 30).tap { ch } // tap
```
However, `set` and `tap` are not supported in typed workflows. Use standard assignments instead.
### Avoid `|` and `&` dataflow operators
The {ref}`special operators <workflow-special-operators>` `|` and `&` provide shorthands for writing dataflow logic:
```nextflow
channel.of('Hello', 'Hola', 'Ciao')
| greet
| map { v -> v.toUpperCase() }
| view
```
However, these special operators are not supported in typed workflows. Use standard assignments and method calls instead:
```nextflow
ch_input = channel.of('Hello', 'Hola', 'Ciao')
ch_greet = greet(ch_input)
ch_greet
.map { v -> v.toUpperCase() }
.view()
```
### Avoid `.out` for process and workflow outputs
The `.out` property can be used to access process and workflow outputs in legacy workflows:
```nextflow
MY_WORKFLOW()
MY_WORKFLOW.out.foo.view()
MY_WORKFLOW.out.bar.view()
```
However, this pattern is not supported in typed workflows. Use standard assignments instead:
```nextflow
my_out = MY_WORKFLOW()
my_out.foo.view()
my_out.bar.view()
```
### Avoid `each` input qualifier
The {ref}`each <process-input-each>` input qualifier is not supported in typed processes. Use the {ref}`operator-combine` operator to create a single tuple channel instead.
For example:
```nextflow
process align {
input:
path seq
each mode
script:
"""
t_coffee -in $seq -mode $mode > result
"""
}
workflow {
sequences = channel.fromPath('*.fa')
methods = ['regular', 'espresso', 'psicoffee']
align(sequences, methods)
}
```
Rewrite the script to use the `combine` operator. It becomes:
```nextflow
process align {
input:
tuple path(seq), val(mode)
script:
"""
t_coffee -in $seq -mode $mode > result
"""
}
workflow {
sequences = channel.fromPath('*.fa')
methods = ['regular', 'espresso', 'psicoffee']
align(sequences.combine(methods))
}
```
:::{tip}
The `each` qualifier is discouraged in modern Nextflow code. While it provides a convenient shorthand for combining multiple inputs, it couples the process definition with external workflow logic. Since the introduction of DSL2, Nextflow aims to treat processes as standalone modules that are decoupled from workflow logic.
:::
### Avoid legacy operators
Many {ref}`operators <operator-page>` are not statically typed. While you can still use them in typed workflows, the type checker will not be able to fully validate your code. These operators can usually be replaced by another operator and/or a standard library function.
For example, the `splitCsv` operator is not statically typed. Use `flatMap` and the equivalent {ref}`stdlib-types-path` method instead:
```nextflow
// before
channel.fromPath('samplesheet.csv')
.splitCsv(sep: ',')
.view()
// after
channel.fromPath('samplesheet.csv')
.flatMap { csv -> csv.splitCsv(sep: ',') }
.view()
```
See {ref}`migrating-static-types-operators` for more information.
## Additional resources
See the following links to learn more about static typing:
- {ref}`process-typed-page`
- {ref}`workflow-typed-page`
- {ref}`stdlib-types`
- {ref}`script-records`
- {ref}`syntax-record-type`

View File

@@ -0,0 +1,299 @@
(migrating-workflow-outputs)=
# Migrating to workflow outputs
The {ref}`workflow output definition <workflow-output-def>` is a new way to define the top-level outputs of a workflow. It is a replacement for the {ref}`publishDir <process-publishdir>` directive. This tutorial demonstrates how to migrate from `publishDir` to workflow outputs using the [rnaseq-nf](https://github.com/nextflow-io/rnaseq-nf) pipeline as an example.
:::{note}
Workflow outputs are stable in Nextflow 25.10. However, for large pipelines, it is recommended that you migrate to static typing and records first. See {ref}`migrating-static-types` for more information.
:::
## Overview
In Nextflow DSL1, pipelines were defined in a single script and there was no concept of workflows. Each process used the `publishDir` directive to publish task outputs, which captured output files with glob patterns and copied them from the work directory to an external location.
Nextflow DSL2 introduced workflows and modules, making it easier to develop large and complex pipelines. However, DSL2 retained the same process-based publishing syntax and became unwieldy for several reasons:
- **Mismatch with reusable modules**: Publishing rules often depend on how a process is used in a pipeline. This made it impractical to set `publishDir` in a reusable way for processes that are shared across many pipelines. Publishing rules could be defined in the configuration, but this approach requires extensive use of {ref}`process selectors <config-process-selectors>`, which are difficult to use for large pipelines.
- **Fragmented outputs**: It is difficult to get a concise view of a workflow's outputs when publishing rules are separated across many different modules.
- **Redundant configuration**: Certain settings, such as the base output directory and publish mode, must be repeated for each `publishDir` declaration, leading to duplicated code.
- **Mismatch with channels**: Channels, the primary data structure in Nextflow, contain files and associated metadata that can be accessed by name. However, `publishDir` uses glob patterns to match files, and cannot publish metadata unless it happens to be in a file. This mismatch makes it difficult to translate channels into pipeline outputs.
Workflow outputs were introduced to address these problems by providing a unified, structured, and flexible way to publish outputs:
- **Unified output definition**: Workflow outputs are declared in an `output` block alongside the entry workflow, ensuring that there is a single comprehensive view of what a pipeline produces.
- **Channel-based publishing**: Instead of publishing files from individual processes, workflow outputs are assigned from channels in the entry workflow. The channel itself can be saved as an *index file*, such as a CSV or JSON file, which provides a structured view of the output directory and can be ingested by downstream pipelines.
- **Flexible file selection**: By default, all files in a published channel are included. However, the published channel can be configured to publish specific files by name, instead of using glob patterns. This approach to publishing files is a natural extension of workflows and channels.
- **Simple configuration**: The base output directory is defined as a global configuration setting, and all files are published into this directory. Publish settings such as the mode are also defined as configuration settings under the `workflow.output` scope, reducing code duplication.
## Timeline
Workflow outputs are stable in Nextflow {ref}`25.10 <workflow-outputs-final>`.
This feature is also available as a preview in Nextflow {ref}`24.04 <workflow-outputs-first-preview>`, {ref}`24.10 <workflow-outputs-second-preview>`, and {ref}`25.04 <workflow-outputs-third-preview>`. If you are upgrading from a preview version of workflow outputs, you must remove the `nextflow.preview.output` feature flag, as it is not supported in Nextflow 25.10.
The `publishDir` directive will continue to be supported, but will be deprecated. It may be removed in a future release.
## Example: rnaseq-nf
This section describes how to migrate from `publishDir` to workflow outputs using the [rnaseq-nf](https://github.com/nextflow-io/rnaseq-nf) pipeline as an example. To view the completed migration, see the [`preview-25-10`](https://github.com/nextflow-io/rnaseq-nf/tree/preview-25-10) branch of the rnaseq-nf repository.
See {ref}`rnaseq-nf-page` for an introduction to the rnaseq-nf pipeline.
### Replacing `publishDir` with workflow outputs
Start by removing each `publishDir` directive and publishing the corresponding process output channel in the entry workflow.
Declare an output for each channel in the `output` block and publish the corresponding channel in the `publish:` section of the entry workflow:
```nextflow
workflow {
main:
read_pairs_ch = channel.fromFilePairs(params.reads, checkIfExists: true, flat: true)
(fastqc_ch, quant_ch) = RNASEQ(read_pairs_ch, params.transcriptome)
multiqc_files_ch = fastqc_ch.mix(quant_ch).collect()
multiqc_report = MULTIQC(multiqc_files_ch, params.multiqc)
publish:
fastqc_logs = fastqc_ch
multiqc_report = multiqc_report
}
output {
fastqc_logs {
}
multiqc_report {
}
}
```
:::{note}
Each output assigned in the `publish:` section must be declared in the `output` block, and vice versa.
:::
Nextflow copies all files in the published channels into the output directory, which is `results` by default. You can set the output directory using the `outputDir` config setting or the `-output-dir` command-line option.
You can set the publish mode in the config. For example:
```groovy
workflow.output.mode = 'copy'
```
Run the pipeline with the `all-reads` profile to verify the published outputs:
```console
$ nextflow run . -profile conda,all-reads
```
### Customizing the publish paths
The pipeline runs `FASTQC` and `QUANT` for each input sample. However, the workflow publishes only the `FASTQC` results. The workflow passes the `QUANT` results to `MULTIQC` but doesn't publish them directly.
Improve the workflow outputs by also publishing the outputs of `QUANT`:
```nextflow
workflow {
main:
// ...
publish:
fastqc_logs = fastqc_ch
quant = quant_ch
multiqc_report = multiqc_report
}
output {
fastqc_logs {
}
quant {
}
multiqc_report {
}
}
```
Running the pipeline with the `all-reads` profile will produce the following output directory:
```console
results
├── fastqc_gut_logs
├── fastqc_liver_logs
├── fastqc_lung_logs
├── fastqc_spleen_logs
├── multiqc_report.html
├── quant_gut
├── quant_liver
├── quant_lung
└── quant_spleen
```
This directory will quickly become cluttered as you process more samples. It would be better to group the `FASTQC` and `QUANT` results into separate subdirectories:
```console
results
├── fastqc
│ ├── gut
│ ├── liver
│ ├── lung
│ └── spleen
├── multiqc_report.html
└── quant
├── gut
├── liver
├── lung
└── spleen
```
Achieve this directory structure by customizing the `output` block.
First, update the `FASTQC` and `QUANT` processes to also emit the sample ID alongside the output files:
```nextflow
process FASTQC {
// ...
output:
tuple val(id), path("fastqc_${id}")
// ...
}
process QUANT {
// ...
output:
tuple val(id), path("quant_${id}")
// ...
}
```
Configure the `fastqc_logs` and `quant` outputs in the `output` block to use dynamic publish paths:
```nextflow
output {
fastqc_logs {
path { id, fastqc -> "fastqc/${id}" }
}
quant {
path { id, quant -> "quant/${id}" }
}
multiqc_report {
}
}
```
Preserving the sample ID in each output channel allows you to customize the publish path without trying to parse the file name. The dynamic path is applied to each channel value to determine the target name for the given file.
:::{note}
The closure parameters for the dynamic publish path must match the structure of the published channel.
:::
### Generating an index file
An *index file* is a manifest, or *index*, of the published files and their metadata for a workflow output. Nextflow can create an index file for each workflow output by saving the channel as a CSV, JSON, or YAML file.
For example, if you enable the index file for `fastqc_logs`:
```nextflow
output {
fastqc_logs {
path { id, fastqc -> "fastqc/${id}" }
index {
path 'fastqc.csv'
header true
}
}
}
```
The workflow produces the following index file:
```console
$ cat results/fastqc.csv
"id","fastqc"
"lung","results/fastqc/lung"
"spleen","results/fastqc/spleen"
"gut","results/fastqc/gut"
"liver","results/fastqc/liver"
```
The index file mirrors the structure of the published channel, and it provides a structured view of the output directory. Index files are equivalent to samplesheets, and can be used as inputs to downstream pipelines.
You could define two index files for `fastqc_logs` and `quant`. However, since these outputs essentially provide different *slices* of data for the same set of samples, you can also combine them into a single output with one index file.
Use the `join` operator to combine the `FASTQC` and `QUANT` results into a single channel:
```nextflow
workflow {
main:
// ...
samples_ch = fastqc_ch
.join(quant_ch)
.map { id, fastqc, quant ->
[id: id, fastqc: fastqc, quant: quant]
}
multiqc_files_ch = samples_ch
.flatMap { sample -> [sample.fastqc, sample.quant] }
.collect()
multiqc_report = MULTIQC( multiqc_files_ch, params.multiqc )
publish:
samples = samples_ch
multiqc_report = multiqc_report
}
```
This example uses maps instead of tuples so that you can access fields by name, and so that the index file can use the map keys as column names.
Declare the `samples` output with an index file:
```nextflow
output {
samples {
path { sample ->
sample.fastqc >> "fastqc/${sample.id}"
sample.quant >> "quant/${sample.id}"
}
index {
path 'samples.csv'
header true
}
}
multiqc_report {
}
}
```
Since each channel value now contains multiple files that go to different subdirectories, you must use *publish statements* in the `path` directive to route each file to the appropriate location.
Run the pipeline, then verify the index file:
```console
$ nextflow run . -profile conda,all-reads -resume
$ cat results/samples.csv
"id","fastqc","quant"
"lung","results/fastqc/lung","results/quant/lung"
"gut","results/fastqc/gut","results/quant/gut"
"liver","results/fastqc/liver","results/quant/liver"
"spleen","results/fastqc/spleen","results/quant/spleen"
```
In the future, if you add a tool with per-sample outputs, you only need to join the tool output into the `samples_ch` channel and update the output `path` directive accordingly. This approach keeps the output definition concise as you add more tools to the pipeline. Additionally, a single unified index file for all per-sample outputs is easier for downstream pipelines to consume, rather than cross-referencing multiple related index files.