(process-page)= # Processes In Nextflow, a **process** is a specialized function for executing scripts in a scalable and portable manner. Here is an example process definition: ```nextflow process hello { output: path 'hello.txt' script: """ echo 'Hello world!' > hello.txt """ } ``` See {ref}`syntax-process` for a full description of the process syntax. (process-script)= ## Script The `script` section defines, as a string expression, the script that is executed by the process. A process may contain only one script, and if the `script` guard is not explicitly declared, the script must be the final statement in the process definition. The script string is executed as a [Bash]() script in the host environment. It can be any command or script that you would normally execute on the command line or in a Bash script. Naturally, the script may only use commands that are available in the host environment. The script section can be a simple string or a multi-line string. The latter approach makes it easier to write scripts with multiple commands spanning multiple lines. For example: ```nextflow process blast { """ blastp -db $db -query query.fa -outfmt 6 > blast_result cat blast_result | head -n 10 | cut -f 2 > top_hits blastdbcmd -db $db -entry_batch top_hits > sequences """ } ``` As explained in the script tutorial section, strings can be defined using single-quotes or double-quotes, and multi-line strings are defined by three single-quote or three double-quote characters. There is a subtle but important difference between them. Like in Bash, strings delimited by a `"` character support variable substitutions, while strings delimited by `'` do not. In the above code fragment, the `$db` variable is replaced by the actual value defined elsewhere in the pipeline script. :::{warning} Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a *Nextflow* variable or a *Bash* variable. ::: When you need to access a system environment variable in your script, you have two options. If you don't need to access any Nextflow variables, you can define your script section with single-quotes: ```nextflow process echo_path { ''' echo "The path is: $PATH" ''' } ``` Otherwise, you can define your script with double-quotes and escape the system environment variables by prefixing them with a back-slash `\` character, as shown in the following example: ```nextflow process blast { """ blastp -db \$DB -query query.fa -outfmt 6 > blast_result cat blast_result | head -n $MAX | cut -f 2 > top_hits blastdbcmd -db \$DB -entry_batch top_hits > sequences """ } ``` In this example, `$MAX` is a Nextflow variable that must be defined elsewhere in the pipeline script. Nextflow replaces it with the actual value before executing the script. Meanwhile, `$DB` is a Bash variable that must exist in the execution environment, and Bash will replace it with the actual value during execution. ### Scripts *à la carte* The process script is interpreted by Nextflow as a Bash script by default, but you are not limited to Bash. You can use your favourite scripting language (Perl, Python, R, etc), or even mix them in the same pipeline. A pipeline may be composed of processes that execute very different tasks. With Nextflow, you can choose the scripting language that best fits the task performed by a given process. For example, for some processes R might be more useful than Perl, whereas for others you may need to use Python because it provides better access to a library or an API, etc. To use a language other than Bash, simply start your process script with the corresponding [shebang](). For example: ```nextflow process perl_task { """ #!/usr/bin/perl print 'Hi there!' . '\n'; """ } process python_task { """ #!/usr/bin/python x = 'Hello' y = 'world!' print "%s - %s" % (x,y) """ } workflow { perl_task() python_task() } ``` :::{tip} Since the actual location of the interpreter binary file can differ across platforms, it is wise to use the `env` command followed by the interpreter name, e.g. `#!/usr/bin/env perl`, instead of the absolute path, in order to make your script more portable. ::: ### Conditional scripts The `script` section is like a function that returns a string. This means that you can write arbitrary code to determine the script, as long as the final statement is a string. If-else statements based on task inputs can be used to produce a different script. For example: ```nextflow mode = 'tcoffee' process align { input: path sequences script: if( mode == 'tcoffee' ) """ t_coffee -in $sequences > out_file """ else if( mode == 'mafft' ) """ mafft --anysymbol --parttree --quiet $sequences > out_file """ else if( mode == 'clustalo' ) """ clustalo -i $sequences -o out_file """ else error "Invalid alignment mode: ${mode}" } ``` In the above example, the process will execute one of several scripts depending on the value of the `mode` parameter. By default it will execute the `tcoffee` command. (process-template)= ### Template Process scripts can be externalized to **template** files, which allows them to be reused across different processes and tested independently from the pipeline execution. A template can be used in place of an embedded script using the `template` function in the script section: ```nextflow process hello { input: val STR output: stdout script: template 'hello.sh' } workflow { hello('Hello!').view() } ``` By default, Nextflow looks for the template script in the `templates` directory located alongside the Nextflow script in which the process is defined. An absolute path can be used to specify a different location. However, this practice is discouraged because it hinders pipeline portability. An example template script is provided below: ```bash #!/bin/bash echo "process started at `date`" echo $STR echo "process completed" ``` Variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template script is executed by Nextflow and Bash variables when executed directly. For example, the above script can be executed from the command line by providing each input as an environment variable: ```bash STR='Hello!' bash templates/hello.sh ``` The following caveats should be considered: - Template scripts are recommended only for Bash scripts. Languages that do not prefix variables with `$` (e.g. Python and R) can't be executed directly as a template script. - Variables escaped with `\$` will be interpreted as Bash variables when executed by Nextflow, but will not be interpreted as variables when executed from the command line. This practice should be avoided to ensure that the template script behaves consistently. - Template variables are evaluated even if they are commented out in the template script. If a template variable is missing, it will cause the pipeline to fail regardless of where it occurs in the template. :::{tip} Template scripts are generally discouraged due to the caveats described above. The best practice for using a custom script is to embed it in the process definition at first and move it to a separate file with its own command line interface once the code matures. ::: (process-shell)= ### Shell :::{deprecated} 24.11.0-edge Use the `script` section instead. Consider using the {ref}`strict syntax `, which provides error checking to help distinguish between Nextflow variables and Bash variables in the process script. ::: The `shell` section is a string expression that defines the script that is executed by the process. It is an alternative to the {ref}`process-script` definition with one important difference: it uses the exclamation mark `!` character, instead of the usual dollar `$` character, to denote Nextflow variables. This way, it is possible to use both Nextflow and Bash variables in the same script without having to escape the latter, which makes process scripts easier to read and maintain. For example: ```nextflow process hello { input: val str shell: ''' echo "User $USER says !{str}" ''' } workflow { channel.of('Hello', 'Hola', 'Bonjour') | hello } ``` In the above example, `$USER` is treated as a Bash variable, while `!{str}` is treated as a Nextflow variable. :::{note} - Shell script definitions require the use of single-quote `'` delimited strings. When using double-quote `"` delimited strings, dollar variables are interpreted as Nextflow variables as usual. See {ref}`string-interpolation`. - Variables prefixed with `!` must always be enclosed in curly brackets, i.e. `!{str}` is a valid variable whereas `!str` is ignored. - Shell scripts support the use of the {ref}`process-template` mechanism. The same rules are applied to the variables defined in the template script. ::: (process-native)= ### Native execution The `exec` section executes the given code without launching a job. For example: ```nextflow process hello { input: val name exec: println "Hello Mr. $name" } workflow { channel.of('a', 'b', 'c') | hello } ``` will display: ``` Hello Mr. b Hello Mr. a Hello Mr. c ``` A native process is very similar to a {ref}`function `. However, it provides additional capabilities such as parallelism, caching, and progress logging. (process-stub)= ## Stub :::{versionadded} 20.11.0-edge ::: You can define a command *stub*, which replaces the actual process command when the `-stub-run` or `-stub` command-line option is enabled: ```nextflow process salmon_index { input: path transcriptome output: path 'index' script: """ salmon index --threads $task.cpus -t $transcriptome -i index """ stub: """ mkdir index touch index/seq.bin touch index/info.json touch index/refseq.bin """ } ``` The `stub` section can be defined before or after the `script` section. When the pipeline is executed with the `-stub-run` option and a process's `stub` is not defined, the `script` section is executed. This feature makes it easier to quickly prototype the workflow logic without using the real commands. The developer can use it to provide a dummy script that mimics the execution of the real one in a quicker manner. In other words, it is a way to perform a dry-run. (process-input)= ## Inputs The `input` section defines the input of a process, similar to function arguments. A process may have at most one input section, which must contain at least one input declaration. The input section follows the syntax shown below: ``` input: ``` An input declaration consists of a *qualifier* and a *name*. The input qualifier defines the type of data to be received. This information is used by Nextflow to apply the semantic rules associated with each qualifier, and handle it properly depending on the target execution platform (grid, cloud, etc). When a process is invoked in a workflow, it must be provided a channel or dataflow value for each input in the process input section, similar to calling a function with specific arguments. The examples provided in the following sections demonstrate how a process is invoked. The following input qualifiers are available: - `val`: Access the input value by name in the process script. - `path`: Handle the input value as a path, staging the file properly in the execution context. - `env`: Use the input value to set an environment variable in the process script. - `stdin`: Forward the input value to the process `stdin` special file. - `tuple`: Handle a group of input values having any of the above qualifiers. - `each`: Execute the process for each element in the input collection. See {ref}`process reference ` for the full list of input methods and options. ### Input variables (`val`) The `val` qualifier accepts any data type. It can be accessed in the process script by using the specified input name, as shown in the following example: ```nextflow process echo { input: val x script: """ echo "process job $x" """ } workflow { def num = channel.of(1,2,3) echo(num) } ``` In the above example, the process is executed three times: once for each value emitted by the `num` channel. The resulting output is similar to the one shown below: ``` process job 3 process job 1 process job 2 ``` :::{note} Processes do not necessarily process items in the order that they are received. In the above example, the value `3` was processed before the others. ::: :::{note} When the process declares exactly one input, the pipe `|` operator can be used to provide inputs to the process, instead of passing it as a parameter. Both methods have identical semantics: ```nextflow process echo { input: val x script: """ echo "process job $x" """ } workflow { channel.of(1,2,3) | echo } ``` ::: (process-input-path)= ### Input files (`path`) The `path` qualifier allows you to provide input files to the process execution context. Nextflow will stage the files into the process execution directory, and they can be accessed in the script by using the specified input name. For example: ```nextflow process blast { input: path query_file script: """ blastp -query ${query_file} -db nr """ } workflow { def proteins = channel.fromPath( '/some/path/*.fa' ) blast(proteins) } ``` In the above example, all the files ending with the suffix `.fa` are sent over the channel `proteins`. These files are received by the process, which executes a BLAST query on each of them. It's worth noting that in the above example, the name of the file in the file-system is not used. You can access the file without even knowing its name, because you can reference it in the process script by the input name. There may be cases where your task needs to use a file whose name is fixed, i.e. it does not have to change along with the actual provided file. In this case, you can specify a fixed name with the `name` attribute in the input file parameter definition, as shown in the following example: ```nextflow input: path query_file, name: 'query.fa' ``` or, using a shorter syntax: ```nextflow input: path 'query.fa' ``` The previous example can be re-written as shown below: ```nextflow process blast { input: path 'query.fa' script: """ blastp -query query.fa -db nr """ } workflow { def proteins = channel.fromPath( '/some/path/*.fa' ) blast(proteins) } ``` In this example, each file received by the process is staged with the name `query.fa` in a different execution context (i.e. the folder where a task is executed). :::{tip} This feature allows you to execute the process command multiple times without worrying about the file names changing. In other words, Nextflow helps you write pipeline tasks that are self-contained and decoupled from the execution environment. As a best practice, you should avoid referencing files in your process script other than those defined in your input section. ::: Channel factories like `channel.fromPath` produce file objects, but a `path` input can also accept a string literal path. The string value should be an absolute path, i.e. it must be prefixed with a `/` character or a supported URI protocol (`file://`, `http://`, `s3://`, etc), and it cannot contain special characters (`\n`, etc). ```nextflow process cat { input: path x script: """ cat $x """ } workflow { cat('/some/data/file.txt') } ``` :::{note} Process `path` inputs have nearly the same interface as described in {ref}`stdlib-types-path`, with one difference which is relevant when files are staged into a subdirectory. Given the following input: ```nextflow path x, name: 'my-dir/file.txt' ``` In this case, `x.name` returns the file name with the parent directory (e.g. `my-dir/file.txt`), whereas normally it would return the file name (e.g. `file.txt`). You can use `x.fileName.name` to get the file name. ::: (process-multiple-input-files)= ### Multiple input files A `path` input can also accept a collection of files instead of a single value. In this case, the input variable will be a list. When the input has a fixed file name and a collection of files is received by the process, the file name will be appended with a numerical suffix representing its ordinal position in the list. For example: ```nextflow process blast { input: path 'seq' script: """ echo seq* """ } workflow { def fasta = channel.fromPath( "/some/path/*.fa" ).buffer(size: 3) blast(fasta) } ``` will output: ``` seq1 seq2 seq3 seq1 seq2 seq3 ... ``` The target input file name may contain the {index}`*` and {index}`?` wildcards, which can be used to control the name of staged files. The following table shows how the {index}`wildcards` are replaced depending on the cardinality of the received input collection. | Arity | Name pattern | Staged file names | | ----------- | ------------ | ------------------------------------------------------------------------------------------------------- | | any | `*` | named as the source file | | one | `file*.ext` | `file.ext` | | one | `file?.ext` | `file1.ext` | | one | `file??.ext` | `file01.ext` | | many | `file*.ext` | `file1.ext`, `file2.ext`, `file3.ext`, .. | | many | `file?.ext` | `file1.ext`, `file2.ext`, `file3.ext`, .. | | many | `file??.ext` | `file01.ext`, `file02.ext`, `file03.ext`, .. | | many | `dir/*` | named as the source file, created in `dir` subdirectory | | many | `dir??/*` | named as the source file, created in a progressively indexed subdirectory e.g. `dir01/`, `dir02/`, etc. | | many | `dir*/*` | (as above) | The following example shows how a wildcard can be used in the input file definition: ```nextflow process blast { input: path 'seq?.fa' script: """ cat seq1.fa seq2.fa seq3.fa """ } workflow { def fasta = channel.fromPath( "/some/path/*.fa" ).buffer(size: 3) blast(fasta) } ``` :::{note} Rewriting input file names according to a named pattern is an extra feature and not at all required. The normal file input syntax introduced in the {ref}`process-input-path` section is valid for collections of multiple files as well. To handle multiple input files while preserving the original file names, use a variable identifier or the `*` wildcard. ::: :::{versionadded} 23.09.0-edge ::: The `arity` option can be used to enforce the expected number of files, either as a number or a range. For example: ```nextflow input: path('one.txt', arity: '1') // exactly one file is expected path('pair_*.txt', arity: '2') // exactly two files are expected path('many_*.txt', arity: '1..*') // one or more files are expected ``` When a task is executed, Nextflow will check whether the received files for each path input match the declared arity, and fail if they do not. When the arity is `'1'`, the corresponding input variable will be a single file; otherwise, it will be a list of files. ### Dynamic input file names When the input file name is specified by using the `name` option or a string literal, you can also use other input values as variables in the file name string. For example: ```nextflow process grep { input: val x path "${x}.fa" script: """ cat ${x}.fa | grep '>' """ } ``` In the above example, the input file name is determined by the current value of the `x` input value. This approach allows input files to be staged in the task directory with a name that is coherent with the current execution context. :::{tip} In most cases, you won't need to use dynamic file names, because each task is executed in its own directory, and input files are automatically staged into this directory by Nextflow. This behavior guarantees that input files with the same name won't overwrite each other. The above example is useful specifically when there are potential file name conflicts within a single task. ::: ### Input environment variables (`env`) The `env` qualifier allows you to define an environment variable in the process execution context based on the input value. For example: ```nextflow process echo_env { input: env 'HELLO' script: ''' echo "$HELLO world!" ''' } workflow { channel.of('hello', 'hola', 'bonjour', 'ciao') | echo_env } ``` ``` hello world! ciao world! bonjour world! hola world! ``` ### Standard input (`stdin`) The `stdin` qualifier allows you to forward the input value to the [standard input](http://en.wikipedia.org/wiki/Standard_streams#Standard_input_.28stdin.29) of the process script. For example: ```nextflow process cat { input: stdin script: """ cat - """ } workflow { channel.of('hello', 'hola', 'bonjour', 'ciao') | map { v -> v + '\n' } | cat } ``` will output: ``` hola bonjour ciao hello ``` (process-input-tuple)= ### Input tuples (`tuple`) The `tuple` qualifier groups multiple values into a single input definition. Each element in the tuple is associated with a corresponding element in the `tuple` definition. For example: ```nextflow process cat { input: tuple val(id), path('input.txt') script: """ echo "Processing $id" cat input.txt > copy """ } workflow { channel.of( [1, 'alpha.txt'], [2, 'beta.txt'], [3, 'delta.txt'] ) | cat } ``` In the above example, the `tuple` input consists of the value `x` and the file `input.txt`. A `tuple` definition may contain any of the following qualifiers, as previously described: `val`, `env`, `path` and `stdin`. Files specified with the `path` qualifier are treated exactly the same as standalone `path` inputs. (process-input-each)= ### Input repeaters (`each`) The `each` qualifier allows you to repeat the execution of a process for each item in a collection, each time a new value is received. For example: ```nextflow process align { input: path seq each mode script: """ t_coffee -in $seq -mode $mode > result """ } workflow { sequences = channel.fromPath('*.fa') methods = ['regular', 'espresso', 'psicoffee'] align(sequences, methods) } ``` In the above example, each time a file of sequences is emitted from the `sequences` channel, the process executes *three* tasks, each running a T-coffee alignment with a different value for the `mode` parameter. This behavior is useful when you need to repeat the same task over a given set of parameters. Input repeaters can be applied to files as well. For example: ```nextflow process align { input: path seq each mode each path(lib) script: """ t_coffee -in $seq -mode $mode -lib $lib > result """ } workflow { sequences = channel.fromPath('*.fa') methods = ['regular', 'espresso'] libraries = [ file('PQ001.lib'), file('PQ002.lib'), file('PQ003.lib') ] align(sequences, methods, libraries) } ``` In the above example, each sequence input file emitted by the `sequences` channel triggers six alignment tasks, three with the `regular` method against each library file, and three with the `espresso` method. :::{note} When multiple repeaters are defined, the process is executed for each *combination* of them. ::: :::{note} Input repeaters do not support tuples. Use the {ref}`operator-combine` operator to combine the repeated input with the other inputs to produce all of the desired input combinations. ::: (process-multiple-inputs)= ### Multiple inputs A process can declare multiple inputs, which allows it to accept inputs from multiple dataflow sources. :::{warning} Do not supply more than one channel when calling a process with multiple inputs. Invoking a process with multiple channels can lead to {ref}`non-deterministic behavior `. All additional inputs should be dataflow values. ::: When a process is defined with multiple inputs, it waits for a value from each input and launches a new task with the combined values. When one of the inputs is a channel, the process repeats until all values in the channel are consumed. If the channel is empty, the process will not launch any tasks. For example: ```nextflow process echo { input: val x val y script: """ echo $x and $y """ } workflow { x = channel.value(1) y = channel.of('a', 'b', 'c') echo(x, y) } ``` The above example executes the `echo` process three times. The dataflow value `x` is reused for each value in `y`. It outputs: ``` 1 and a 1 and b 1 and c ``` See also: {ref}`process-out-singleton`. (process-output)= ## Outputs The `output` section defines the outputs of a process, similar to a function return. A process may have at most one output section, which must contain at least one output declaration. The output section follows the syntax shown below: ``` output: [,