add nextflow d30e48d
This commit is contained in:
229
nextflow/adr/20250825-workflow-params.md
Normal file
229
nextflow/adr/20250825-workflow-params.md
Normal file
@@ -0,0 +1,229 @@
|
||||
# Workflow params
|
||||
|
||||
- Authors: Ben Sherman
|
||||
- Status: accepted
|
||||
- Date: 2025-08-25
|
||||
- Tags: lang, static-types, params
|
||||
|
||||
## Summary
|
||||
|
||||
Introduce a unified, statically typed way to declare the top-level inputs (i.e. parameters) of a workflow.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Pipeline parameters in Nextflow are currently declared using property assignments:
|
||||
|
||||
```groovy
|
||||
params.reads = "$baseDir/data/ggal/ggal_gut_{1,2}.fq"
|
||||
params.transcriptome = "$baseDir/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
|
||||
params.multiqc = "$baseDir/multiqc"
|
||||
```
|
||||
|
||||
This approach has several limitations:
|
||||
|
||||
- **No type annotations**: Parameter types cannot be expressed in the script. The type of a parameter can only be inferred from its default value, which may be ambiguous (e.g., a default value of `null`, a `String` that should be treated as a `Path`).
|
||||
|
||||
- **Heuristic type coercion**: When a parameter is supplied on the command line, Nextflow attempts to coerce the string value to the appropriate type using heuristics (e.g., `'true'` → boolean `true`, `'42'` → integer `42`). These heuristics are not always correct and can lead to unexpected behavior.
|
||||
|
||||
- **No built-in validation**: There is no built-in way to validate that a parameter is required, or that a parameter value has the correct type. Validation must be done manually in the script, or through an external JSON Schema file (`nextflow_schema.json`).
|
||||
|
||||
- **Scattered declarations and usage**: Parameters may be declared anywhere in the script or across multiple scripts, making it difficult to get a single view of the pipeline parameters. Parameters can be used anywhere in the pipeline, even outside the script where they are declared, making it impossible to validate params usage at compile-time.
|
||||
|
||||
## Goals
|
||||
|
||||
- Declare all parameters in one place in the script, with documentation.
|
||||
|
||||
- Provide explicit type annotations for parameters, enabling compile-time validation and IDE support.
|
||||
|
||||
- Clearly distinguish between required and optional parameters.
|
||||
|
||||
- Coerce CLI parameter values based on declared types, rather than relying on heuristics.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Removing the legacy `params.foo = bar` syntax -- legacy parameters must continue to work without modification.
|
||||
|
||||
- Changing the `params` config scope -- params can still be declared in the config file, although some best practices apply.
|
||||
|
||||
- Replacing `nextflow_schema.json` -- while the `params` block addresses many of the same needs, existing pipelines that use a JSON Schema should not be required to migrate. A native integration with `nextflow_schema.json` can be explored in the future.
|
||||
|
||||
- Supporting nested params -- the `params` block only supports a flat list of params. Nested params can still be used in the config, but they do not have first-class support at this time.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce the `params` block for declaring pipeline parameters. Each parameter is declared with a name, a type, and an optional default value:
|
||||
|
||||
```groovy
|
||||
params {
|
||||
// Path to the input samplesheet
|
||||
input: Path
|
||||
|
||||
// Whether to save intermediate files
|
||||
save_intermeds: Boolean = false
|
||||
}
|
||||
```
|
||||
|
||||
Typed parameters are used to validate parameter usage in the script, and to coerce CLI parameter values at runtime.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### Parameter declaration
|
||||
|
||||
The `params` block consists of parameter *declarations*. Each parameter is declared as `name: Type` (required) or `name: Type = default` (optional with default):
|
||||
|
||||
```groovy
|
||||
params {
|
||||
input: Path // required
|
||||
extra_file: Path? // optional (defaults to null)
|
||||
db_file: Path = 'db.json' // optional with default
|
||||
flag: Boolean // boolean params default to false
|
||||
}
|
||||
```
|
||||
|
||||
All standard Nextflow types except `Channel` and `Value` can be used for parameter type annotations.
|
||||
|
||||
### Required and optional parameters
|
||||
|
||||
A parameter without a default value is *required*. If a required parameter is not supplied at runtime (via the command line, a params file, or the config), the run fails immediately with an informative error.
|
||||
|
||||
A parameter with the `?` suffix on its type is *optional* and will be `null` if not supplied. Boolean parameters without a default value implicitly default to `false`.
|
||||
|
||||
### Type-based CLI coercion
|
||||
|
||||
When a parameter is supplied on the command line, Nextflow converts the string value to the declared type:
|
||||
|
||||
| Declared type | String input | Resolved value |
|
||||
|---|---|---|
|
||||
| `Boolean` | `'true'` | `true` |
|
||||
| `Integer` | `'42'` | `42` |
|
||||
| `Float` | `'3.14'` | `3.14` |
|
||||
| `Duration` | `'1h'` | `Duration.of('1h')` |
|
||||
| `MemoryUnit` | `'8 GB'` | `MemoryUnit.of('8 GB')` |
|
||||
| `Path` | `'/data'` | `Path.of('/data')` |
|
||||
|
||||
This replaces the heuristic type detection used for legacy parameters.
|
||||
|
||||
### Compile-time validation
|
||||
|
||||
Legacy parameters can be accessed globally by all scripts in the pipeline. While this approach is flexible, it prevents compile-time validation and breaks modularity.
|
||||
|
||||
When a module references a param, it implicitly assumes that the param will always be defined by the workflow that uses it. This assumption cannot be validated at compile-time, so if the param is missing, an error will occur only at runtime.
|
||||
|
||||
The `params` block solves this problem by defining all params in one place. It serves as the inputs for the entry workflow, similar to the `take:` section in named workflows. Parameters should be passed to processes and workflows as explicit inputs, so that every variable reference can be validated against local declarations.
|
||||
|
||||
For example, the following workflow:
|
||||
|
||||
```groovy
|
||||
// main.nf
|
||||
params.input = '...'
|
||||
|
||||
workflow {
|
||||
HELLO()
|
||||
}
|
||||
|
||||
// hello.nf
|
||||
workflow HELLO {
|
||||
println "input = ${params.input}"
|
||||
}
|
||||
```
|
||||
|
||||
Can be rewritten as follows:
|
||||
|
||||
```groovy
|
||||
// main.nf
|
||||
params {
|
||||
input: String
|
||||
}
|
||||
|
||||
workflow {
|
||||
HELLO(params.input)
|
||||
}
|
||||
|
||||
// hello.nf
|
||||
workflow HELLO {
|
||||
take:
|
||||
input: String
|
||||
|
||||
main:
|
||||
println "input = ${input}"
|
||||
}
|
||||
```
|
||||
|
||||
Typed parameters can still be used globally by all scripts for backwards compatibility. However, the type checker will only validate params used in the entry workflow and `output` block. Users should eventually migrate their pipelines as shown above for effective type checking.
|
||||
|
||||
### Script and config params
|
||||
|
||||
Parameters can also be defined in config files:
|
||||
|
||||
```groovy
|
||||
params {
|
||||
outdir = 'results'
|
||||
publish_dir_mode = 'copy'
|
||||
}
|
||||
```
|
||||
|
||||
Config params continue to work as before. As a best practice, they should be used only to "configure the configuration."
|
||||
|
||||
Some config params can be replaced with native functionality, e.g., `outputDir` and `workflow.output.mode` for the above. The nf-core [institutional configs](https://github.com/nf-core/configs), which enable users to run a pipeline with their institutional config entirely from the command line, cannot be easily replaced and provide a clear use case for config params.
|
||||
|
||||
Config params are also propagated to the script since the config file can overwrite script params (e.g. in a profile). However, since the script `params` block only allows params that were explicitly declared, it needs to be able to distinguish between config params and invalid params (e.g. command line param with a typo).
|
||||
|
||||
To prevent a circular dependency between the script execution and config resolution, parameters are resolved as follows:
|
||||
|
||||
1. Load *CLI params* from command line, params file
|
||||
|
||||
2. Load config files
|
||||
- Params declared in the `params` scope are *config params*
|
||||
- If a config setting references an undeclared param, report an error
|
||||
- Params assigned in a profile are also marked as config params -- they should be used to overwrite existing params or potential script params
|
||||
- CLI params override config params
|
||||
|
||||
3. Execute script, resolve `params` block
|
||||
- CLI params and config params override default values in `params` block
|
||||
- If a required script param is undefined, report an error
|
||||
- If a CLI param is not declared in the `params` block and is not a config param, report an error
|
||||
|
||||
In other words, params are applied in the following order (lowest to highest precedence):
|
||||
|
||||
1. Default value in the `params` block
|
||||
2. Config file (`params { param = value }`)
|
||||
3. Params file (`-params-file params.json`)
|
||||
4. Command-line arguments (`--param value`)
|
||||
|
||||
Any parameter supplied via command line or params file must be declared in the script or config. Supplying an undeclared parameter is an error.
|
||||
|
||||
## Links
|
||||
|
||||
- Community issue: [#4669](https://github.com/nextflow-io/nextflow/issues/4669)
|
||||
- [Workflow outputs ADR](./20251020-workflow-outputs.md)
|
||||
- [Record types ADR](./20260306-record-types.md)
|
||||
|
||||
## Appendix
|
||||
|
||||
### Runtime type analysis via reflection
|
||||
|
||||
Validating and converting params against declared types requires the type annotations to be fully available at runtime. Parameterized types such as `List<String>` must provide both the type (`List`) and the generic type arguments (`[String]`).
|
||||
|
||||
During compilation, type annotations are modeled using `ClassNode`, which provides the "raw" type and type arguments via `getTypeClass() -> Class` and `getGenericsTypes() -> GenericsType[]`.
|
||||
|
||||
At runtime, type annotations are modeled using `Type`, for which there are two primary cases:
|
||||
|
||||
- If the type is parameterized, it is a `ParameterizedType`, which provides the "raw" type and type arguments via `getRawType() -> Class` and `getActualTypeArguments() -> Type[]`.
|
||||
|
||||
- Otherwise, the type is a `Class` corresponding to the raw type.
|
||||
|
||||
This type information can be obtained at runtime from the following entities:
|
||||
|
||||
- Class fields via `Field::getGenericType() -> Type`
|
||||
- Method parameters via `Parameter::getParameterizedType() -> Type`
|
||||
|
||||
For this reason, the `params` block is compiled as a class, so that each parameter declaration is a field which can model a parameterized type.
|
||||
|
||||
Type annotations can be marked as nullable using the `?` suffix. This marker is compiled as a custom `@Nullable` annotation on the corresponding field, so that the runtime can use this information.
|
||||
|
||||
For example, when loading a JSON file as a collection of records, Nextflow uses the given record type to validate each JSON object in the collection:
|
||||
|
||||
- String values that map to a record field with type `Path` are converted to Path values
|
||||
- If a JSON object is missing a record field that is marked as nullable, it is considered valid
|
||||
|
||||
While type annotations are used only at compile-time in all other contexts, they are needed at runtime for pipeline parameters in order to validate and convert external input data to the expected type.
|
||||
154
nextflow/adr/20250922-plugin-spec.md
Normal file
154
nextflow/adr/20250922-plugin-spec.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# Plugin Spec
|
||||
|
||||
- Authors: Ben Sherman
|
||||
- Status: accepted
|
||||
- Deciders: Ben Sherman, Paolo Di Tommaso
|
||||
- Date: 2025-09-22
|
||||
- Tags: plugins
|
||||
|
||||
## Summary
|
||||
|
||||
Provide a way for external systems to understand key information about third-party plugins.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Nextflow plugins need a way to statically declare extensions to the Nextflow language so that external systems can extract information about a plugin without loading it in the JVM.
|
||||
|
||||
Primary use cases:
|
||||
|
||||
- The Nextflow language server needs to know about any config scopes, custom functions, etc, defined by a plugin, in order to recognize them in Nextflow scripts and config files.
|
||||
|
||||
- The Nextflow plugin registry (or other user interfaces) can use this information to provide API documentation.
|
||||
|
||||
## Goals or Decision Drivers
|
||||
|
||||
- External systems (e.g. language server) need to be able to understand plugins without having to load them in the JVM.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Defining specs for the core runtime and core plugins: these definitions are handled separately, although they may share some functionality with plugin specs.
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Nextflow plugin system
|
||||
|
||||
Require external systems to use Nextflow's plugin system to load plugins at runtime in order to extract information about them.
|
||||
|
||||
- **Pro:** Allows any information to be extracted since the entire plugin is loaded
|
||||
|
||||
- **Con:** Requires the entire Nextflow plugin system to be reused or reimplemented. Not ideal for Java applications since the plugin system is implemented in Groovy, incompatible with non-JVM applications
|
||||
|
||||
- **Con:** Requires plugins to be downloaded, cached, loaded in the JVM, even though there is no need to use the plugin.
|
||||
|
||||
### Plugin spec
|
||||
|
||||
Define a plugin spec for every plugin release which is stored and served by the plugin registry as JSON.
|
||||
|
||||
- **Pro:** Allows any system to inspect plugin definitions through a standard JSON document, instead of downloading plugins and loading them into a JVM.
|
||||
|
||||
- **Con:** Requires the plugin spec to be generated at build-time and stored in the plugin registry.
|
||||
|
||||
- **Con:** Requires a standard format to ensure interoperability across different versions of Nextflow, the language server, and third-party plugins.
|
||||
|
||||
## Solution
|
||||
|
||||
Define a plugin spec for every plugin release which is stored and served by the plugin registry as JSON.
|
||||
|
||||
- Plugin developers only need to define [extension points](https://nextflow.io/docs/latest/plugins/developing-plugins.html#extension-points) as usual, and the Gradle plugin will extract the plugin spec and store it in the plugin registry as part of each plugin release.
|
||||
|
||||
- The language server can infer which third-party plugins are required from the `plugins` block in a config file. It will retrieve the appropriate plugin specs from the plugin registry.
|
||||
|
||||
A plugin spec consists of a list of *definitions*. Each definition has a *type* and a *spec*.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "https://raw.githubusercontent.com/nextflow-io/schemas/main/plugin/v1/schema.json",
|
||||
"definitions": [
|
||||
{
|
||||
"type": "ConfigScope",
|
||||
"spec": {
|
||||
// ...
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "Function",
|
||||
"spec": {
|
||||
// ...
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The following types of definitions are allowed:
|
||||
|
||||
**ConfigScope**
|
||||
|
||||
Defines a top-level config scope. The spec consists of a *name*, an optional *description*, and *children*.
|
||||
|
||||
The children should be a list of definitions corresponding to nested config scopes and options. The following definitions are allowed:
|
||||
|
||||
- **ConfigOption**: Defines a config option. The spec consists of a *description* and *type*.
|
||||
|
||||
- **ConfigScope**: Defines a nested config scope, using the same spec as for top-level scopes.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "ConfigScope",
|
||||
"spec": {
|
||||
"name": "hello",
|
||||
"description": "The `hello` scope controls the behavior of the `nf-hello` plugin.",
|
||||
"children": [
|
||||
{
|
||||
"type": "ConfigOption",
|
||||
"spec": {
|
||||
"name": "message",
|
||||
"description": "Message to print to standard output when the plugin is enabled.",
|
||||
"type": "String"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Factory**
|
||||
|
||||
Defines a channel factory that can be included in Nextflow scripts. The spec is the same as for functions.
|
||||
|
||||
**Function**
|
||||
|
||||
Defines a function that can be included in Nextflow scripts. The spec consists of a *name*, an optional *description*, a *return type*, and a list of *parameters*. Each parameter consists of a *name* and a *type*.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "Function",
|
||||
"spec": {
|
||||
"name": "sayHello",
|
||||
"description": "Say hello to the given target",
|
||||
"returnType": "void",
|
||||
"parameters": [
|
||||
{
|
||||
"name": "target",
|
||||
"type": "String"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Operator**
|
||||
|
||||
Defines a channel operator that can be included in Nextflow scripts. The spec is the same as for functions.
|
||||
|
||||
## Rationale & discussion
|
||||
|
||||
Now that there is a Gradle plugin for building Nextflow plugins and a registry to publish and retrieve plugins, it is possible to generate, publish, and retrieve plugin specs in a way that is transparent to plugin developers.
|
||||
|
||||
Plugins specs adhere to a pre-defined [schema](https://raw.githubusercontent.com/nextflow-io/schemas/main/plugin/v1/schema.json) to ensure consistency across different versions of Nextflow. In the future, new versions of the schema can be defined as needed to support new behaviors or requirements.
|
||||
84
nextflow/adr/20250929-repository-directory-traversal.md
Normal file
84
nextflow/adr/20250929-repository-directory-traversal.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# ADR: Repository Directory Traversal API
|
||||
|
||||
**Date**: 2025-09-29
|
||||
**Status**: Accepted
|
||||
**Context**: Need for standardized directory listing across Git hosting providers
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce a `listDirectory(String path, int depth)` method to the `RepositoryProvider` abstraction to enable unified directory traversal across different Git hosting platforms.
|
||||
|
||||
## Context
|
||||
|
||||
Nextflow requires the ability to explore repository directory structures across multiple Git hosting providers (GitHub, GitLab, Bitbucket, Azure DevOps, Gitea) without full repository clones. Each provider has different API capabilities and constraints for directory listing operations.
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Core Algorithm
|
||||
|
||||
All providers follow a consistent pattern:
|
||||
1. **Path Resolution**: Normalize path to provider API format
|
||||
2. **Strategy Selection**: Choose recursive vs iterative approach based on API capabilities
|
||||
3. **HTTP Request**: Execute provider-specific API calls
|
||||
4. **Response Processing**: Parse to standardized `RepositoryEntry` objects
|
||||
5. **Depth Filtering**: Apply client-side limits when APIs lack precise depth control
|
||||
|
||||
### API Strategy Classification
|
||||
|
||||
**Strategy A: Native Recursive (GitHub, GitLab, Azure)**
|
||||
- Single HTTP request with recursive parameters
|
||||
- Server-side tree traversal
|
||||
- Performance: O(1) API calls
|
||||
|
||||
**Strategy B: Iterative Traversal (Bitbucket Server, Gitea)**
|
||||
- Multiple HTTP requests per directory level
|
||||
- Client-side recursion management
|
||||
- Performance: O(n) API calls where n = number of directories
|
||||
|
||||
**Strategy C: Limited Support (Bitbucket Cloud)**
|
||||
- Single-level listing only
|
||||
- Throws exceptions for depth > 1
|
||||
|
||||
### Provider Implementation Details
|
||||
|
||||
| Provider | Endpoint | Recursive Support | Performance |
|
||||
|----------|----------|-------------------|-------------|
|
||||
| GitHub | `/git/trees/{sha}?recursive=1` | Native | Optimal |
|
||||
| GitLab | `/repository/tree?recursive=true` | Native | Optimal |
|
||||
| Azure | `/items?recursionLevel=Full` | Native | Optimal |
|
||||
| Bitbucket Server | `/browse/{path}` | Manual iteration | Multiple calls |
|
||||
| Gitea | `/contents/{path}` | Manual iteration | Multiple calls |
|
||||
| Bitbucket Cloud | `/src/{commit}/{path}` | None | Unsupported |
|
||||
|
||||
### HTTP API Constraints
|
||||
|
||||
- **Rate Limiting**: 60-5000 requests/hour depending on provider and authentication
|
||||
- **Response Size**: Controlled by `NXF_GIT_RESPONSE_MAX_LENGTH` environment variable
|
||||
- **Timeouts**: 60-second connect timeout across all providers
|
||||
- **Authentication**: Required for private repositories and higher rate limits
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- **Unified Interface**: Consistent API across all Git hosting providers
|
||||
- **Performance Optimization**: Uses native recursive APIs where available
|
||||
- **Graceful Degradation**: Falls back to iterative traversal when needed
|
||||
- **Error Resilience**: Handles partial failures and API limitations
|
||||
|
||||
### Negative
|
||||
- **Provider Inconsistency**: Performance varies significantly between providers
|
||||
- **API Rate Limits**: Multiple calls required for some providers may hit limits faster
|
||||
- **Memory Usage**: Large directory structures loaded entirely into memory
|
||||
|
||||
### Neutral
|
||||
- **Complexity**: Abstraction layer adds code complexity but improves maintainability
|
||||
- **Testing**: Comprehensive test coverage required for each provider implementation
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
- Local Git repositories use JGit TreeWalk for optimal performance
|
||||
- Client-side depth filtering ensures consistent behavior across providers
|
||||
- Error handling varies by provider: some return empty lists, others throw exceptions
|
||||
- Future enhancements could include caching based on commit SHA and pagination support
|
||||
|
||||
This decision enables Nextflow to efficiently explore repository structures regardless of the underlying Git hosting platform, with automatic optimization based on each provider's API capabilities.
|
||||
404
nextflow/adr/20251017-typed-processes.md
Normal file
404
nextflow/adr/20251017-typed-processes.md
Normal file
@@ -0,0 +1,404 @@
|
||||
# Typed processes
|
||||
|
||||
- Authors: Ben Sherman
|
||||
- Status: accepted
|
||||
- Deciders: Ben Sherman, Paolo Di Tommaso
|
||||
- Date: 2025-10-17
|
||||
- Tags: lang, static-types, processes
|
||||
|
||||
## Updates
|
||||
|
||||
### Version 1.1 (2026-03-23)
|
||||
|
||||
- Changed the method signature for `stageAs` from `(filePattern, value)` to `(value, filePattern)` to mirror commands like `cp`, `mv`, etc.
|
||||
|
||||
- Replaced annotation-based tuple syntax (`(...): Tuple<...>`) with destructuring syntax (`record(...)`) for better continuity with legacy syntax and record input syntax.
|
||||
|
||||
## Summary
|
||||
|
||||
Support static typing for process inputs and outputs.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The legacy process syntax uses *qualifiers* to describe both the type and staging behavior of each input and output:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
tuple val(id), path(fastq_1), path(fastq_2)
|
||||
path index
|
||||
|
||||
output:
|
||||
tuple val(id), path("fastqc_${id}_logs")
|
||||
|
||||
script:
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This syntax has several limimtations:
|
||||
|
||||
- **No static typing**: The `val` qualifier can not specify a type, so there is no way to validate input values. The `path` qualifier can not distinguish between a file and a file collection. The `arity` option was introduced to address this ambiguity, but it is cumbersome and rarely used.
|
||||
|
||||
- **Type and staging behavior are coupled**: Qualifiers like `path` describe both the type *and* the staging behavior (link into task directory). There is no way to specify staging behavior separately, such as staging a tuple element or record field as an environment variable.
|
||||
|
||||
- **No nullability**: There is no way to declare that an input may be `null`. The `path` qualifier raises a runtime error if a null value is received. Outputs can be marked optional, but optional outputs are handled by emitting nothing rather than emitting `null`. A tuple output can be optional, but a tuple element can not.
|
||||
|
||||
- **Limited output expressiveness**: Outputs must be expressed in terms of qualifiers that mirror the input qualifiers. It is difficult to express many kinds of output values, and it is unclear to the user whether a given expression is valid or not.
|
||||
|
||||
## Goals
|
||||
|
||||
- Provide a way to model process inputs and outputs with types from the Nextflow standard library.
|
||||
|
||||
- Separate the *type* of an input from its *staging behavior*.
|
||||
|
||||
- Provide first-class support for nullable inputs and outputs.
|
||||
|
||||
- Allow outputs to be arbitrary expressions, ensuring consistency with the rest of the language.
|
||||
|
||||
- Enable compile-time type checking for processes.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Removing the legacy qualifier syntax -- legacy processes must continue to work without modification.
|
||||
|
||||
- Enforcing type checking -- static type checking will be introduced progressively as an opt-in feature.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce **typed processes**, which use a new syntax for inputs and outputs based on type annotations instead of qualifiers.
|
||||
|
||||
All other process sections (directives, script, stub, etc) are supported by typed processes without changes. Only the `input:` and `output:` sections are changed.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### Typed inputs
|
||||
|
||||
Each input is declared as `name: Type`:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path
|
||||
|
||||
script:
|
||||
"""
|
||||
fastqc -o fastqc_${id}_logs ${fastq_1} ${fastq_2}
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
All standard library types except `Channel` and `Value` are valid input types. Inputs of type `Path` (or `Path` collections such as `Set<Path>`) are automatically staged into the task directory.
|
||||
|
||||
### Nullable inputs
|
||||
|
||||
Appending `?` to a type annotation allows the input to be `null`:
|
||||
|
||||
```groovy
|
||||
process CAT_OPT {
|
||||
input:
|
||||
input: Path?
|
||||
|
||||
stage:
|
||||
stageAs input, 'input.txt'
|
||||
|
||||
output:
|
||||
stdout()
|
||||
|
||||
script:
|
||||
'''
|
||||
[[ -f input.txt ]] && cat input.txt || echo 'empty input'
|
||||
'''
|
||||
}
|
||||
```
|
||||
|
||||
By default, a task fails if any input receives `null`.
|
||||
|
||||
### Stage directives
|
||||
|
||||
Staging behavior is moved to a dedicated `stage:` section that appears after `input:`. This replaces the staging aspects of legacy qualifiers:
|
||||
|
||||
| Legacy qualifier | Stage directive |
|
||||
|-------------------|--------------------|
|
||||
| `env('NAME')` | `env 'NAME', value` |
|
||||
| `stdin` | `stdin value` |
|
||||
| `path('name.fa')` | `stageAs file, 'name.fa'` |
|
||||
|
||||
For example:
|
||||
|
||||
```groovy
|
||||
process BLAST {
|
||||
input:
|
||||
fasta: Path
|
||||
|
||||
stage:
|
||||
stageAs fasta, 'query.fa'
|
||||
|
||||
script:
|
||||
"""
|
||||
blastp -query query.fa -db nr
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
Separating staging from type declaration keeps the inputs clean and makes it easier to specify staging behavior independently of the input type.
|
||||
|
||||
### Tuple inputs
|
||||
|
||||
Tuples are declared inline using `tuple(name: Type, ...)`:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
tuple(id: String, fastq_1: Path, fastq_2: Path)
|
||||
|
||||
script:
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
Each component is destructured into a local variable. This mirrors the `tuple()` constructor used in the output section and in workflow logic, making the syntax consistent.
|
||||
|
||||
### Typed outputs
|
||||
|
||||
Each output declaration consists of an optional name and type, and a value expression:
|
||||
|
||||
```groovy
|
||||
process ECHO {
|
||||
input:
|
||||
message: String
|
||||
|
||||
output:
|
||||
out_file: Path = file('message.txt')
|
||||
out_std: String = stdout()
|
||||
|
||||
script:
|
||||
"""
|
||||
echo '${message}' | tee message.txt
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
When there is only one output, the name and type can be omitted:
|
||||
|
||||
```groovy
|
||||
process ECHO {
|
||||
input:
|
||||
message: String
|
||||
|
||||
output:
|
||||
file('message.txt')
|
||||
|
||||
script:
|
||||
"""
|
||||
echo '${message}' > message.txt
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
Outputs can be arbitrary expressions, rather that being restricted to specific qualifiers such as `tuple` and `val`. Special functions such as `file()`, `files()`, `env()`, and `stdout()` can be composed into the desired output structure.
|
||||
|
||||
### Nullable outputs
|
||||
|
||||
By default, the `file()` and `files()` function raise an error if the given file is missing. These functions can be called with `optionel: true` to allow missing files. This way, it is possible to declare a tuple output that contains nullable values:
|
||||
|
||||
```groovy
|
||||
process MAYBE {
|
||||
input:
|
||||
id: String
|
||||
|
||||
output:
|
||||
tuple(id, file('result.txt'))
|
||||
|
||||
script:
|
||||
"""
|
||||
[[ '$id' == 42 ]] && touch result.txt
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
### Topic emissions
|
||||
|
||||
A `topic:` section emits values to topic channels using the `>>` operator:
|
||||
|
||||
```groovy
|
||||
process CAT {
|
||||
input:
|
||||
message: Path
|
||||
|
||||
output:
|
||||
stdout()
|
||||
|
||||
topic:
|
||||
tuple('cat', eval('cat --version')) >> 'versions'
|
||||
|
||||
script:
|
||||
"""
|
||||
cat ${message}
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
Moving topic emissions to a dedicated section allows them to be defined without having to include them in the process outputs.
|
||||
|
||||
## Distinguishing between typed and legacy processes
|
||||
|
||||
Typed processes are gated behind the `nextflow.enable.types` feature flag, in order to distinguish between typed and legacy processes in the language.
|
||||
|
||||
When a script enables this feature flag, its processes are treated as typed processes; otherwise, its processes are treated as legacy processes. This way, typed and legacy processes cannot be mixed in the same script, but they can be used together as long as they are declared in different scripts.
|
||||
|
||||
While typed and legacy processes are syntactically distinct and could theoretically be allowed in the same script, the feature flag helps distinguish typed vs legacy to the reader (whether human or agent).
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Implicit tuple input
|
||||
|
||||
The syntax for typed process inputs aims to be consistent with typed inputs throughout the rest of the language, such as the `params` block and workflow inputs, which use the pattern of `<name>: <type>`. The `tuple` input qualifier does not fit neatly into this pattern, since it specifies multiple tuple *components*:
|
||||
|
||||
```groovy
|
||||
process QUANT {
|
||||
input:
|
||||
tuple(id: String, fastq_1: Path, fastq_2: Path)
|
||||
index: Path
|
||||
|
||||
// ...
|
||||
}
|
||||
|
||||
workflow {
|
||||
ch_samples = channel.of( tuple('1', file('1_1.fq'), file('1_2.fq')) )
|
||||
index = file('index.fa')
|
||||
QUANT(ch_samples, index)
|
||||
}
|
||||
```
|
||||
|
||||
One alternative is to remove tuple inputs altogether and treat the entire `input:` section as an implicit tuple input:
|
||||
|
||||
```groovy
|
||||
process QUANT {
|
||||
input:
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path
|
||||
index: Path
|
||||
|
||||
// ...
|
||||
}
|
||||
|
||||
workflow {
|
||||
ch_samples = channel.of( tuple('1', file('1_1.fq'), file('1_2.fq')) )
|
||||
index = file('index.fa')
|
||||
QUANT( ch_samples.combine(index) )
|
||||
}
|
||||
```
|
||||
|
||||
With this approach, a process would always be called with a single input, and multiple sources (e.g. `ch_samples` and `index`) would need to be combined into a single input. This could be done explicitly with the `combine` operator or implicitly by the runtime.
|
||||
|
||||
However, this approach would be a significant change to process call semantics, even if only applied to typed processes. It would likely be difficult to validate for processes with many inputs.
|
||||
|
||||
The tuple destructuring syntax makes it possible to migrate legacy processes to typed processes without changing workflow logic or call semantics. While the `tuple(...)` syntax is a deviation from the typed input syntax used by the rest of the language, such deviations can be appropriate and even advantageous when used judiciously in a custom language.
|
||||
|
||||
### Type annotation syntax for tuple inputs
|
||||
|
||||
Another alternative for tuple inputs is to use a type annotation:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
(id, fastq_1, fastq_2): Tuple<String,Path,Path>
|
||||
index: Path
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This approach attempts to bring the syntax closer to the `<name>: <type>` pattern while maintaining support for tuple destructuring. This syntax was used in the first preview of typed processes in Nextflow 25.10.
|
||||
|
||||
However, this syntax needlessly separates the component name from its corresponding type, making it harder to read and validate. Although it is semantically equivalent to the legacy syntax, it looks and feels very different, which can be jarring for users.
|
||||
|
||||
With the introduction of records, the `tuple(...)` destructuring syntax emerged as a clear pattern to follow for both records and tuples:
|
||||
|
||||
**Legacy process:**
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
tuple val(id), path(fastq_1), path(fastq_2)
|
||||
|
||||
output:
|
||||
tuple val(id), path("fastqc_${id}_logs")
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Typed process (tuple):**
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
tuple(id: String, fastq_1: Path, fastq_2: Path)
|
||||
|
||||
output:
|
||||
tuple(id, file("fastqc_${id}_logs"))
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Typed process (record):**
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
record(
|
||||
id: String,
|
||||
fastq_1: Path,
|
||||
fastq_2: Path
|
||||
)
|
||||
|
||||
output:
|
||||
record(
|
||||
id: id,
|
||||
fastqc: file("fastqc_${id}_logs")
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This pattern provides the best balance of continuity with the old way and consistency with static typing:
|
||||
|
||||
- A legacy process can be migrated to a typed process by replacing the `tuple` input/output qualifier with the `tuple` destructor/constructor.
|
||||
- A typed process can be migrated from tuples to records by replacing `tuple` with `record` and adding fields to the record output.
|
||||
- The `tuple` and `record` destructors use the same `<name>: <type>` pattern used by the rest of the language.
|
||||
- At each stage, the inputs and outputs mirror each other without creating syntactic confusion.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
|
||||
- Type annotations make processes self-documenting and provide the information needed to perform static type checking.
|
||||
|
||||
- Separating type from staging behavior (the `stage:` section) makes each concern independently clear.
|
||||
|
||||
- Nullable types (`?`) provide first-class support for nullable input files.
|
||||
|
||||
- Outputs can be structured arbitrarily and can contain nullable files.
|
||||
|
||||
**Negative:**
|
||||
|
||||
- The `each` qualifier is not supported; pipelines using it must be refactored to use the `combine` operator before migrating to typed processes.
|
||||
|
||||
- The typed syntax must be maintained alongside the legacy syntax, which makes the codebase more complex and may cause confusion as the community transitions to the new syntax.
|
||||
|
||||
**Neutral:**
|
||||
|
||||
- Typed processes use the same standard types as the rest of the language, so no additional type vocabulary is introduced.
|
||||
|
||||
- Typed processes are enabled by a feature flag, which introduces new functionality without breaking existing code and helps distinguish between typed and legacy code.
|
||||
|
||||
## Links
|
||||
|
||||
- [Nextflow standard types](https://nextflow.io/docs/latest/reference/stdlib-types.html)
|
||||
- Community issues: #1694, #2678
|
||||
- Related nf-core discussion: https://github.com/nf-core/modules/issues/4311
|
||||
- Original implementation: #4553
|
||||
198
nextflow/adr/20251020-workflow-outputs.md
Normal file
198
nextflow/adr/20251020-workflow-outputs.md
Normal file
@@ -0,0 +1,198 @@
|
||||
# Workflow outputs
|
||||
|
||||
- Authors: Ben Sherman
|
||||
- Status: accepted
|
||||
- Date: 2025-10-20
|
||||
- Tags: lang, workflows
|
||||
|
||||
## Summary
|
||||
|
||||
Introduce a unified, dataflow-centric way to declare the top-level outputs of a workflow.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
In Nextflow DSL1, each process used `publishDir` to copy output files from the work directory to an external location. Nextflow DSL2 inherited this approach but it became increasingly problematic as pipelines grew larger and more modular:
|
||||
|
||||
- **Mismatch with reusable modules**: Publishing rules often depend on how a process is used in a given pipeline. Setting `publishDir` inside a module process makes the module less reusable, since the publish path and mode are baked into the process definition. Using process selectors in configuration is verbose and fragile.
|
||||
|
||||
- **Fragmented outputs**: Publishing logic is scattered across many module files. There is no single place to see what a pipeline produces or to reason about the output structure.
|
||||
|
||||
- **Redundant configuration**: Common settings like the base output directory and publish mode must be repeated in every `publishDir` declaration, leading to duplication.
|
||||
|
||||
- **Mismatch with channels**: Channels carry both files and structured metadata (e.g., sample IDs, quality flags). The `publishDir` directive matches files with glob patterns and cannot capture metadata unless it happens to be written to a file. This mismatch makes it difficult to produce structured, self-describing outputs.
|
||||
|
||||
## Goals
|
||||
|
||||
- Declare all pipeline outputs in a single location alongside the entry workflow.
|
||||
|
||||
- Assign outputs from channels rather than from individual process definitions, decoupling pipeline-specific publishing rules from reusable modules.
|
||||
|
||||
- Support dynamic and fine-grained file publishing to match common publishing patterns (e.g. directory per sample, directory per pipeline step).
|
||||
|
||||
- Support structured index files (CSV, JSON, YAML) that preserve output files with associated metadata.
|
||||
|
||||
- Define publishing behavior (mode, overwrite, storage class, etc.) globally in the config.
|
||||
|
||||
- Support type annotations on output declarations for documentation and compile-time validation.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Removing support for `publishDir` immediately -- `publishDir` should continue to work without modification, although it may eventually be phased out as users migrate away from it.
|
||||
|
||||
- Publishing outputs from processes or named workflows -- only the entry workflow has a `publish:` section.
|
||||
|
||||
- Defining a JSON schema for workflow outputs -- schema/spec generation will be explored in the future.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce the `output` block for declaring workflow outputs. Each output defines how files are published to the output directory, and the format of the index file (if defined).
|
||||
|
||||
Introduce the output directory as a first-class concept in Nextflow, as well as the `workflow.output` config scope for controlling publishing behavior.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### Output definition
|
||||
|
||||
Workflow outputs consist of an `output` block, which declares each output, and a `publish:` section in the entry workflow, which assigns a dataflow source (channel or value) to each output:
|
||||
|
||||
```groovy
|
||||
workflow {
|
||||
main:
|
||||
ch_fastqc = FASTQC(ch_reads)
|
||||
ch_report = MULTIQC(ch_fastqc.collect())
|
||||
|
||||
publish:
|
||||
fastqc = ch_fastqc
|
||||
report = ch_report
|
||||
}
|
||||
|
||||
output {
|
||||
fastqc: Channel<Path> {
|
||||
path 'fastqc'
|
||||
}
|
||||
report: Path {
|
||||
path '.'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Every output assigned in `publish:` must be declared in the `output` block, and vice versa. A mismatch is a compile-time error.
|
||||
|
||||
Each output declaration can specify a type annotation for documentation and type checking support. Type annotations are optional and do not change runtime behavior. They are used by the type checker to validate the `publish:` section and the `path` directive.
|
||||
|
||||
### Output directory
|
||||
|
||||
The top-level output directory defaults to `results` in the launch directory. It can be overridden from the command line or config file:
|
||||
|
||||
```bash
|
||||
nextflow run main.nf -output-dir my-results
|
||||
```
|
||||
|
||||
```groovy
|
||||
// nextflow.config
|
||||
outputDir = 'my-results'
|
||||
```
|
||||
|
||||
All publish paths declared in the `output` block are relative to this directory. Absolute paths are not allowed.
|
||||
|
||||
### Static and dynamic publish paths
|
||||
|
||||
The `path` directive accepts a string for a fixed path, or a closure for per-value paths:
|
||||
|
||||
```groovy
|
||||
output {
|
||||
// static: all files go to results/fastq/
|
||||
reads {
|
||||
path 'fastq'
|
||||
}
|
||||
|
||||
// dynamic: results are organized by sample id
|
||||
samples {
|
||||
path { sample -> "${sample.id}" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Nextflow recursively scans channel values for files, including files nested inside lists, maps, records, and tuples. Files that did not originate from the work directory are not published.
|
||||
|
||||
### Fine-grained file publishing with `>>`
|
||||
|
||||
Within a `path` closure, individual files can be published to different locations using the `>>` operator. Only files explicitly captured with `>>` are published; other files in the value are ignored.
|
||||
|
||||
```groovy
|
||||
output {
|
||||
samples {
|
||||
path { sample ->
|
||||
sample.fastqc >> "fastqc/"
|
||||
sample.bam >> (params.save_bams ? "align/" : null)
|
||||
sample.bam_index >> (params.save_bams ? "align/" : null)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The *publish source* (left-hand side) should be a file or collection of files. The *publish target* (right-hand side) should be a relative path. If the target has a trailing slash, then the source is published *into* the target directory; otherwise the source is published *as* the target name.
|
||||
|
||||
A `null` target suppresses publishing for that file, and a `null` source is also a no-op. This way, publishing of individual files can be disabled by either setting the record field to `null` in workflow logic or using a param in the publish statement.
|
||||
|
||||
### Index files
|
||||
|
||||
Each output can generate a structured index file that records each published channel value along with its metadata. Supported formats are CSV, JSON, and YAML.
|
||||
|
||||
```groovy
|
||||
output {
|
||||
samples {
|
||||
path 'fastq'
|
||||
index {
|
||||
path 'samples.csv'
|
||||
header true
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The index file is essentially a *samplesheet* -- it preserves the structure of files and metadata in the published channel, and can be easily passed as input to downstream pipelines. Metadata fields (sample IDs, quality flags, etc.) do not need to be written to a separate metadata file or encoded into file paths.
|
||||
|
||||
Files that did not originate from the work directory are not published, but are still included in the index.
|
||||
|
||||
### Global defaults via configuration
|
||||
|
||||
Common publish settings can be set globally under the `workflow.output` config scope:
|
||||
|
||||
```groovy
|
||||
// nextflow.config
|
||||
workflow {
|
||||
output {
|
||||
mode = 'copy'
|
||||
overwrite = 'lenient'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
These defaults can be overridden per-output in the `output` block:
|
||||
|
||||
```groovy
|
||||
// main.nf
|
||||
output {
|
||||
fastqc {
|
||||
mode = 'symlink'
|
||||
overwrite = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Publishing from processes and subworkflows
|
||||
|
||||
Earlier iterations allowed for workflow outputs to be published from subworkflows or processes, instead of requiring all workflow outputs to be propagated up to the entry workflow.
|
||||
|
||||
While this approach is less verbose, it breaks the modularity of processes and subworkflows. Publishing behavior is inherent to the pipeline, not the individual subcomponents which could be shared across many pipelines. The process or subworkflow should expose all of its outputs as channels, and the calling pipeline should decide whether and how to publish these outputs.
|
||||
|
||||
On the other hand, propagating all workflow outputs to the top will make pipelines more verbose, especially when using "skinny tuple" channels. This issue will be alleviated by migrating from tuples to records -- for this reason, it is recommended that large pipelines be migrated to records before being migrated to workflow outputs.
|
||||
|
||||
## Links
|
||||
|
||||
- Community issues: [#4042](https://github.com/nextflow-io/nextflow/issues/4042), [#4661](https://github.com/nextflow-io/nextflow/issues/4661), [#4670](https://github.com/nextflow-io/nextflow/issues/4670)
|
||||
- [Workflow params ADR](./20250825-workflow-params.md)
|
||||
- [Record types ADR](./20260306-record-types.md)
|
||||
930
nextflow/adr/20251114-module-system.md
Normal file
930
nextflow/adr/20251114-module-system.md
Normal file
@@ -0,0 +1,930 @@
|
||||
# Module System for Nextflow
|
||||
|
||||
- Authors: Paolo Di Tommaso
|
||||
- Status: approved
|
||||
- Date: 2025-01-06
|
||||
- Tags: modules, dsl, registry, versioning, architecture
|
||||
- Version: 2.7
|
||||
|
||||
## Updates
|
||||
|
||||
### Version 2.7 (2026-03-09)
|
||||
- **Renamed `.checksum` to `.module-info`**: Leaves room for additional properties in the future
|
||||
- **Removed `@` prefix from module scopes**: Local modules are distinguished from remote modules by presence/absence of `./` prefix
|
||||
- **Removed version pinning from config**: Installed module versions are now inferred from the `meta.yml` of each module in the `modules/` directory instead of being declared in `nextflow.config`
|
||||
|
||||
### Version 2.6 (2026-01-28)
|
||||
- **Removed module parameters**: Module parameters specification moved to separate spec document.
|
||||
|
||||
### Version 2.5 (2026-01-23)
|
||||
- **Module parameters**: Replaced structured tool arguments with general module parameters defined in `meta.yml`
|
||||
- **Simplified tools section**: Removed `args` property from tools; tool arguments now configured via module parameters
|
||||
- **Simplified `requires` block**: Removed `plugins`, `modules`, and `subworkflows` sub-properties; `requires` now only contains `nextflow` version constraint
|
||||
- **Process modules focus**: Removed sub-workflow references; spec is now focused on process modules only
|
||||
|
||||
### Version 2.4 (2026-01-15)
|
||||
- **Removed transitive dependency resolution**: Module dependencies are explicit only; no automatic transitive resolution
|
||||
- **Removed `freeze` command**: No longer needed without transitive dependency management
|
||||
- **Simplified model**: Each module explicitly declares its dependencies in `nextflow.config`
|
||||
|
||||
### Version 2.3 (2026-01-15)
|
||||
- **Resolution Rules table**: Added clear table specifying behavior for each combination of local state and declared version
|
||||
- **Local modification protection**: Locally modified modules (checksum mismatch) are NOT overridden unless `-force` flag is used
|
||||
- **Simplified storage model**: Single version per module locally (`modules/@scope/name/` without version in path)
|
||||
- **`.checksum` file**: Registry checksum cached locally for fast integrity verification without network calls
|
||||
|
||||
### Version 2.2 (2025-01-06) — *Superseded by v2.5*
|
||||
- **Structured tool arguments**: Added `args` property to `tools` section for type-safe argument configuration
|
||||
- **New implicit variables**: `tools.<toolname>.args.<argname>` returns formatted flag+value; `tools.<toolname>.args` returns all args concatenated
|
||||
- **Deprecation**: All `ext.*` custom directives (e.g., `ext.args`, `ext.args2`, `ext.args3`, `ext.prefix`, `ext.suffix`) deprecated in favor of structured tool arguments
|
||||
- *Note: Tool arguments replaced by module parameters in v2.5*
|
||||
|
||||
### Version 2.1 (2024-12-11)
|
||||
- **Unified dependencies**: Consolidated `components`, `dependencies`, and `requires` into single `requires` field
|
||||
- **Unified version syntax**: `[scope/]name[@constraint]` format across plugins and modules
|
||||
- **Deprecation**: `components` field deprecated (use top-level `modules` instead)
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
Nextflow supports local script inclusion via `include` directive but lacks standardized mechanisms for package management, versioning, and distribution of reusable process definitions. This limits code reuse and reproducibility across the ecosystem.
|
||||
|
||||
Discussion/request goes back to at least 2019, see GitHub issues [#1376](https://github.com/nextflow-io/nextflow/issues/1376), [#1463](https://github.com/nextflow-io/nextflow/issues/1463) and [#4122](https://github.com/nextflow-io/nextflow/issues/4112).
|
||||
|
||||
## Decision
|
||||
|
||||
Implement a module system with four core capabilities:
|
||||
|
||||
1. **Remote module inclusion** via registry
|
||||
2. **Semantic versioning** with dependency resolution
|
||||
3. **Unified Nextflow Registry** (rebrand existing Nextflow registry)
|
||||
4. **First-class CLI support** (install, publish, search, list, remove, run)
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Remote Module Inclusion
|
||||
|
||||
**DSL Syntax**:
|
||||
```groovy
|
||||
// Include from registry (scoped module name without `./` prefix)
|
||||
include { BWA_ALIGN } from 'nf-core/bwa-align'
|
||||
|
||||
// Existing file-based includes remain supported
|
||||
include { MY_PROCESS } from './modules/my-process.nf'
|
||||
```
|
||||
|
||||
**Module Naming**: Scoped modules `scope/name` (e.g., `nf-core/salmon`, `myorg/custom`). Local paths supported for backwards compatibility. No nested paths with the module are allowed - each module must have a `main.nf` as the entry point.
|
||||
|
||||
**Version Resolution**: Installed module versions are inferred from the `meta.yml` of each module in the `modules/` directory. If a module is not present locally, the latest available version is downloaded from the registry.
|
||||
|
||||
**Resolution Order**:
|
||||
1. Check local `modules/scope/name/` exists
|
||||
2. Verify integrity against `.module-info` file
|
||||
3. Apply resolution rules (see below)
|
||||
|
||||
**Resolution Rules**:
|
||||
|
||||
| Local State | Action |
|
||||
|-------------|--------|
|
||||
| Missing | Download latest from registry |
|
||||
| Exists, checksum valid | Use local module (version from `meta.yml`) |
|
||||
| Exists, checksum mismatch | **Warn**: locally modified, will NOT replace unless `-force` is used |
|
||||
|
||||
**Key Behaviors**:
|
||||
- **Local modification**: When the local module content was manually changed (checksum mismatch with `.module-info`), Nextflow warns and does NOT override to prevent accidental loss of local changes
|
||||
- **Force flag**: Use `-force` with `nextflow module install` to override locally modified modules
|
||||
|
||||
**Resolution Timing**: Modules resolved at workflow parse time (after plugin resolution at startup).
|
||||
|
||||
**Local Storage**: Downloaded modules stored in `modules/scope/name/` directory in project root (not global cache). Each module must contain a `main.nf` file as the required entry point. It is intended that module source code will be committed to the pipeline git repository.
|
||||
|
||||
### 2. Semantic Versioning and Configuration
|
||||
|
||||
**Version Format**: MAJOR.MINOR.PATCH
|
||||
- **MAJOR**: Breaking changes to process signatures, inputs, or outputs
|
||||
- **MINOR**: New processes, backward-compatible enhancements
|
||||
- **PATCH**: Bug fixes, documentation updates
|
||||
|
||||
**Registry Configuration** (`nextflow.config`):
|
||||
```groovy
|
||||
registry {
|
||||
url = 'https://registry.nextflow.io' // Default registry
|
||||
|
||||
// allow the use of multiple registry url for resolving module
|
||||
// across custom registries, e.g.
|
||||
// url = [ 'https://custom.registry.com', 'https://registry.nextflow.io' ]
|
||||
|
||||
auth {
|
||||
'registry.nextflow.io' = '${NXF_REGISTRY_TOKEN}'
|
||||
'npm.myorg.com' = '${MYORG_TOKEN}'
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Module Spec** (`meta.yml`):
|
||||
```yaml
|
||||
name: nf-core/bwa-align
|
||||
version: 1.2.4 # This module's version
|
||||
|
||||
requires:
|
||||
nextflow: ">=24.04.0"
|
||||
```
|
||||
|
||||
**Version Constraints** (unified `name@constraint` syntax):
|
||||
- `name`: Any version (latest)
|
||||
- `name@1.2.3`: Exact version
|
||||
- `name@>=1.2.3`: Greater or equal
|
||||
- `name@>=1.2.3,<2.0.0`: Range (comma-separated)
|
||||
|
||||
**Version Notation Consistency**:
|
||||
|
||||
Modules use the same version constraint syntax already supported by both `nextflowVersion` and plugins:
|
||||
|
||||
| Notation | Meaning | nextflowVersion | Plugins | Modules |
|
||||
| :---- | :---- | :---- | :---- | :---- |
|
||||
| 1.2.3 | Exact version | ✓ | ✓ | ✓ |
|
||||
| >=1.2.3 | Greater or equal | ✓ | ✓ | ✓ |
|
||||
| <=1.2.3 | Less or equal | ✓ | ✓ | ✓ |
|
||||
| >1.2.3 | Greater than | ✓ | ✓ | ✓ |
|
||||
| <1.2.3 | Less than | ✓ | ✓ | ✓ |
|
||||
| >=1.2, <2.0 | Range (comma) | ✓ | ✓ | ✓ |
|
||||
| !=1.2.3 | Not equal | ✓ | - | - |
|
||||
| 1.2+ | >=1.2.x <2.0 | ✓ | - | - |
|
||||
| 1.2.+ | >=1.2.0 <1.3.0 | ✓ | - | - |
|
||||
| ~1.2.3 | >=1.2.3 <1.3.0 | - | ✓ | - |
|
||||
|
||||
Using comparison operators (`>=`, `<`) with comma-separated ranges provides the same expressive power as
|
||||
npm-style `^` and `~` notation while maintaining consistency with existing Nextflow version constraint syntax.
|
||||
This avoids introducing new notation that would require additional parser support.
|
||||
|
||||
**Module Resolution**:
|
||||
|
||||
Installed module versions are inferred from the `meta.yml` file for each module in the `modules/` directory.
|
||||
|
||||
### 3. Unified Nextflow Registry
|
||||
|
||||
**Architecture Decision**: Extend existing Nextflow registry at `registry.nextflow.io` to host both plugins and modules.
|
||||
|
||||
**Current Plugin API** (reference: https://registry.nextflow.io/openapi/):
|
||||
```
|
||||
GET /api/v1/plugins # List/search plugins
|
||||
GET /api/v1/plugins/{pluginId} # Get plugin + all releases
|
||||
GET /api/v1/plugins/{pluginId}/{version} # Get specific release
|
||||
GET /api/v1/plugins/{pluginId}/{version}/download/{fileName} # Download artifact
|
||||
POST /api/v1/plugins/release # Create draft release
|
||||
POST /api/v1/plugins/release/{releaseId}/upload # Upload artifact
|
||||
```
|
||||
|
||||
**Module API** (reference: https://github.com/seqeralabs/plugin-registry/pull/266):
|
||||
```
|
||||
GET /api/modules?query=<text> # Search modules (semantic search)
|
||||
GET /api/modules/{name} # Get module + latest release
|
||||
GET /api/modules/{name}/releases # List all releases
|
||||
GET /api/modules/{name}/{version} # Get specific release
|
||||
GET /api/modules/{name}/{version}/download # Download module bundle
|
||||
POST /api/modules/{name} # Publish module version (authenticated)
|
||||
```
|
||||
|
||||
Note: The `{name}` parameter includes the namespace prefix (e.g., "nf-core/fastqc").
|
||||
|
||||
**Registry URL**: `registry.nextflow.io`
|
||||
|
||||
**Artifact Types**:
|
||||
- **Plugins**: JAR files with JSON metadata, resolved at startup
|
||||
- **Modules**: Source archives (.nf + meta.yml), resolved at parse time
|
||||
|
||||
**Benefits**:
|
||||
- Reuses existing infrastructure (HTTP service, S3 storage, authentication)
|
||||
- Consistent API patterns for both artifact types
|
||||
- Operational simplicity (one service vs. two)
|
||||
- Internal module API already partially implemented
|
||||
|
||||
### 4. First-Class CLI Support
|
||||
|
||||
**Commands**:
|
||||
```bash
|
||||
nextflow module run scope/name # Run a module directly without a wrapper script
|
||||
nextflow module search <query> # Search registry
|
||||
nextflow module install scope/name # Install a module
|
||||
nextflow module list # Show installed vs configured
|
||||
nextflow module remove scope/name # Remove from config + local cache
|
||||
nextflow module publish scope/name # Publish to registry (requires api key)
|
||||
```
|
||||
|
||||
**General Notes**:
|
||||
- All commands respect the `registry.url` configuration for custom registries
|
||||
|
||||
#### `nextflow module run scope/name`
|
||||
|
||||
Run a module directly without requiring a wrapper workflow script. This command enables standalone execution of any module by automatically mapping command-line arguments to the module's process inputs. If the module is not available locally, it is automatically installed before execution.
|
||||
|
||||
**Arguments**:
|
||||
- `scope/name`: Module identifier to run (required)
|
||||
|
||||
**Options**:
|
||||
- `-version <ver>`: Run a specific version (default: latest or configured version)
|
||||
- `--<input_name> <value>`: Map value to the corresponding module process input channel
|
||||
- All standard `nextflow run` options (e.g., `-profile`, `-work-dir`, `-resume`, etc.)
|
||||
|
||||
**Behavior**:
|
||||
1. Checks if module is installed locally; if not, downloads from registry
|
||||
2. Parses the module's `main.nf` to identify the main process and its input declarations
|
||||
3. Validates command-line arguments against the process input declarations
|
||||
4. Generates an implicit workflow that wires CLI arguments to process inputs
|
||||
5. Executes the workflow using standard Nextflow runtime
|
||||
|
||||
**Input Mapping**:
|
||||
- Named arguments (`--reads`, `--reference`) are mapped to corresponding process inputs
|
||||
- File paths are automatically converted to files for process file inputs
|
||||
- Multiple values can be provided for inputs expecting collections
|
||||
- Required inputs without defaults must be provided; optional inputs use declared defaults
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
# Run BWA alignment module with input files
|
||||
nextflow module run nf-core/bwa-align \
|
||||
--reads 'samples/*_{1,2}.fastq.gz' \
|
||||
--reference genome.fa
|
||||
|
||||
# Run a specific version with Nextflow options
|
||||
nextflow module run nf-core/fastqc -version 1.0.0 \
|
||||
--input 'data/*.fastq.gz' \
|
||||
-profile docker \
|
||||
-resume
|
||||
|
||||
# Run with work directory and output specification
|
||||
nextflow module run nf-core/salmon \
|
||||
--reads reads.fq \
|
||||
--index salmon_index \
|
||||
-work-dir /tmp/work \
|
||||
-output-dir results/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### `nextflow module search <query>`
|
||||
|
||||
Search the Nextflow registry for available modules matching the specified query. The search operates against module names, descriptions, tags, and author information. Results are displayed with module name, latest version, description, and download statistics.
|
||||
|
||||
**Arguments**:
|
||||
- `<query>`: Search term (required) - matches against module metadata
|
||||
|
||||
**Options**:
|
||||
- `-limit <n>`: Maximum number of results to return (default: 10)
|
||||
- `-json`: Output results in JSON format for programmatic use
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
nextflow module search bwa
|
||||
nextflow module search "alignment" -limit 50
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### `nextflow module install <scope/name>`
|
||||
|
||||
Download and install a module to the local `modules/` directory.
|
||||
|
||||
**Arguments**:
|
||||
- `<scope/name>`: Module identifier.
|
||||
|
||||
**Options**:
|
||||
- `-version <ver>`: Install a specific version (default: latest)
|
||||
- `-force`: Overwrite any local changes
|
||||
|
||||
**Behavior**:
|
||||
1. If `-version` not specified, queries registry for the latest available version
|
||||
2. Checks if local module exists and verifies integrity against `.module-info` file
|
||||
3. If local module is unmodified and version differs: replaces with requested version
|
||||
4. If local module was modified (checksum mismatch): warns and aborts unless `-force` is used
|
||||
5. Downloads the module archive from the registry
|
||||
6. Extracts to `modules/scope/name/` directory
|
||||
7. Stores `.module-info` file from registry's X-Checksum response header
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
nextflow module install nf-core/bwa-align # Install specific module (latest)
|
||||
nextflow module install nf-core/salmon -version 1.2.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### `nextflow module list`
|
||||
|
||||
Display the status of all modules, comparing what is configured in `nextflow.config` against what is actually installed in the `modules/` directory.
|
||||
|
||||
**Options**:
|
||||
- `-json`: Output in JSON format
|
||||
- `-outdated`: Only show modules with available updates
|
||||
|
||||
**Output columns**:
|
||||
- Module name (`scope/name`)
|
||||
- Installed version (from `modules/scope/name/meta.yml`)
|
||||
- Latest available version (from registry)
|
||||
- Status indicator (up-to-date, outdated, missing)
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
nextflow module list
|
||||
nextflow module list -outdated
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### `nextflow module remove scope/name`
|
||||
|
||||
Remove a module from the local `modules/` directory.
|
||||
|
||||
**Arguments**:
|
||||
- `scope/name`: Module identifier to remove (required)
|
||||
|
||||
**Options**:
|
||||
- `-keep-files`: Remove `.module-info` file but keep local files
|
||||
|
||||
**Behavior**:
|
||||
1. Removes the module directory from `modules/scope/name/`
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
nextflow module remove nf-core/bwa-align
|
||||
nextflow module remove myorg/custom -keep-files
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### `nextflow module publish scope/name`
|
||||
|
||||
Publish a module to the Nextflow registry, making it available for others to install. Requires authentication via API key and appropriate permissions for the target scope.
|
||||
|
||||
**Arguments**:
|
||||
- `scope/name`: Module identifier to publish (required)
|
||||
|
||||
**Options**:
|
||||
- `-registry <url>`: Target registry URL (default: `registry.nextflow.io`)
|
||||
- `-tag <tag>`: Additional tags for discoverability
|
||||
- `-dry-run`: Validate without publishing
|
||||
|
||||
**Behavior**:
|
||||
1. Validates `meta.yml` schema and required fields (name, version, description)
|
||||
2. Verifies that `main.nf` exists and is valid Nextflow syntax
|
||||
3. Verifies that `README.md` documentation is present
|
||||
4. Authenticates with registry using configured credentials
|
||||
5. Creates a release draft and uploads the module archive
|
||||
6. Publishes the release, making it available for installation
|
||||
|
||||
**Requirements**:
|
||||
- Valid `meta.yml` with name, version, and description
|
||||
- `main.nf` entry point file
|
||||
- `README.md` documentation
|
||||
- Authentication token configured in `registry.auth` or `NXF_REGISTRY_TOKEN`
|
||||
- Write permission for the target scope
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
nextflow module publish myorg/my-process
|
||||
nextflow module publish myorg/my-process -dry-run
|
||||
```
|
||||
|
||||
## Module Structure
|
||||
|
||||
**Directory Layout**:
|
||||
Everything within the module directory should be uploaded. Module bundle should not exceed 1MB (uncompressed). Typically this is expected to look something like this:
|
||||
```
|
||||
my-module/
|
||||
├── main.nf # Required: entry point for module
|
||||
├── meta.yml # Required: Module spec (version, metadata, I/O specs)
|
||||
├── README.md # Required: Module description
|
||||
└── tests/ # Optional tests
|
||||
```
|
||||
|
||||
**Module Spec extension** (`meta.yml`):
|
||||
```yaml
|
||||
name: nf-core/bwa-align
|
||||
version: 1.2.4 # This module's version
|
||||
description: Align reads using BWA-MEM
|
||||
authors:
|
||||
- nf-core community
|
||||
license: MIT
|
||||
|
||||
requires:
|
||||
nextflow: ">=24.04.0"
|
||||
```
|
||||
|
||||
**Local Storage Structure**:
|
||||
```
|
||||
project-root/
|
||||
├── nextflow.config
|
||||
├── main.nf
|
||||
└── modules/ # Local module cache
|
||||
├── nf-core/
|
||||
│ ├── bwa-align/
|
||||
│ │ ├── .module-info # Cached registry checksum
|
||||
│ │ ├── meta.yml
|
||||
│ │ └── main.nf # Required entry point
|
||||
│ └── samtools/view/
|
||||
│ ├── .module-info
|
||||
│ ├── meta.yml
|
||||
│ └── main.nf # Required entry point
|
||||
└── myorg/
|
||||
└── custom-process/
|
||||
├── .module-info
|
||||
├── meta.yml
|
||||
└── main.nf # Required entry point
|
||||
```
|
||||
|
||||
**Module Integrity Verification**:
|
||||
- On install: `.module-info` file created from registry's X-Checksum response header
|
||||
- On run: Local module checksum compared against `.module-info` file
|
||||
- If match: Proceed without network call
|
||||
- If mismatch: Report warning (module may have been locally modified)
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
**Phase 1**: Module schema, local module loading, validation tools
|
||||
|
||||
**Phase 2**: Extend Nextflow registry for modules, implement caching, add `install` and `search` commands
|
||||
|
||||
**Phase 3**: Extend DSL parser for `from module` syntax
|
||||
|
||||
**Phase 4**: Implement `publish` command with authentication and `run` command
|
||||
|
||||
**Phase 5**: Advanced features (search UI, language server integration, ontology validation)
|
||||
|
||||
## Technical Details
|
||||
|
||||
**Module Resolution Flow**:
|
||||
1. Parse `include` statements → extract module names (e.g., `nf-core/bwa-align`)
|
||||
2. For each module:
|
||||
a. Check local `modules/scope/name/` exists
|
||||
- If exists → read installed version from `modules/scope/name/meta.yml`
|
||||
- If missing → download latest version from registry
|
||||
b. Verify local module integrity against `.module-info` file
|
||||
- Checksum mismatch → warn and do NOT override (local changes detected)
|
||||
3. On download: store module to `modules/scope/name/` with `.module-info` file
|
||||
4. Read `meta.yml` file: Validates Nextflow requirement → Fail if not fulfilled
|
||||
5. Parse module's `main.nf` file → make processes available
|
||||
|
||||
**Security**:
|
||||
- SHA-256 checksum verification on download (stored in `.module-info` file)
|
||||
- Integrity verification on run (local checksum vs `.module-info` file)
|
||||
- Authentication required for publishing
|
||||
- Support for private registries
|
||||
|
||||
**Integration with Plugin System**:
|
||||
- Both plugins and modules query same registry
|
||||
- Single authentication system
|
||||
- Separate cache locations: `$NXF_HOME/plugins/` (global) vs `modules/` (per-project)
|
||||
|
||||
## Comparison: Plugins vs. Modules
|
||||
|
||||
| Aspect | Plugins | Modules |
|
||||
|--------|---------|---------|
|
||||
| Purpose | Extend runtime | Reusable processes |
|
||||
| Format | JAR files | Source code (.nf) |
|
||||
| Resolution | Startup | Parse time |
|
||||
| Metadata | JSON spec | YAML spec |
|
||||
| Naming | `nf-amazon` | `nf-core/salmon` |
|
||||
| Cache Location | `$NXF_HOME/plugins/` | `modules/scope/name/` |
|
||||
| Version Config | `plugins {}` in config | `meta.yml` in `modules/` directory |
|
||||
| Registry Path | `/api/v1/plugins/` | `/api/modules/{name}` |
|
||||
|
||||
## Rationale
|
||||
|
||||
**Why unified registry?**
|
||||
- Reuses battle-tested infrastructure (HTTP API, S3, auth)
|
||||
- Single discovery experience for ecosystem
|
||||
- Lower operational overhead
|
||||
- Type-specific handling maintains separation of concerns
|
||||
|
||||
**Why infer versions from `meta.yml` instead of pinning in a separate file?**
|
||||
- Simple: install a version once and it is captured in the module files
|
||||
- Reproducibility via committing the `modules/` directory (including `meta.yml`) to the project git repository
|
||||
- Reduces configuration burden: no need to keep config in sync with installed state
|
||||
|
||||
**Why parse-time resolution?**
|
||||
- Modules are source code, not compiled artifacts
|
||||
- Allows inspection/modification for reproducibility
|
||||
- Enables dependency analysis before execution
|
||||
|
||||
**Why scoped modules?**
|
||||
- Organization namespacing prevents name collisions (`nf-core/salmon` vs `myorg/salmon`)
|
||||
- Clear ownership and provenance of modules
|
||||
- Supports private registries per scope
|
||||
- Industry-standard pattern (NPM, Terraform, others)
|
||||
- Enables ecosystem organization by maintainer/organization
|
||||
|
||||
**Why semantic versioning?**
|
||||
- Clear compatibility guarantees
|
||||
- Industry standard (npm, cargo, Go modules)
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Enables ecosystem-wide code reuse
|
||||
- Reproducible workflows via committing the `modules/` directory (including `meta.yml`) to the project git repository
|
||||
- Centralized discovery and distribution via unified registry
|
||||
- Minimal operational overhead (single registry for both plugins and modules)
|
||||
- Module scoping enables organization namespaces and private registries
|
||||
- Local `modules/` directory provides project isolation
|
||||
- No version duplication: installed `meta.yml` is the single source of truth
|
||||
- Simple module structure: each module has single `main.nf` entry point
|
||||
|
||||
**Negative**:
|
||||
- Registry becomes critical infrastructure (requires HA setup)
|
||||
- Type-specific handling adds registry complexity
|
||||
- Parse-time resolution adds latency to workflow startup
|
||||
- Local `modules/` directory duplicates storage across projects (unlike global cache)
|
||||
|
||||
**Neutral**:
|
||||
- Modules and plugins conceptually distinct but share infrastructure
|
||||
- Different resolution timing supported by same API
|
||||
|
||||
## Links
|
||||
|
||||
- Related: [Plugin Spec ADR](20250922-plugin-spec.md)
|
||||
- Inspired by: [Go Modules](https://go.dev/ref/mod), [npm](https://docs.npmjs.com), [Cargo](https://doc.rust-lang.org/cargo/)
|
||||
- Related: [nf-core modules](https://nf-co.re/modules)
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Module Schema Specification
|
||||
|
||||
This appendix defines the JSON schema for module `meta.yml` files. The schema maintains backward compatibility with existing nf-core module metadata patterns while supporting the new Nextflow module system features.
|
||||
|
||||
**Schema File:** [module-spec-schema.json](module-spec-schema.json)
|
||||
**Published URL:** `https://registry.nextflow.io/schemas/module-spec/v1.0.0`
|
||||
|
||||
### Field Reference
|
||||
|
||||
#### Core Fields (Existing nf-core Pattern)
|
||||
|
||||
These fields are already widely adopted in the nf-core community and remain fully supported:
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `name` | string | Yes | Module identifier |
|
||||
| `description` | string | Yes | Brief description of module functionality |
|
||||
| `keywords` | array[string] | Recommended | Discovery and categorization keywords |
|
||||
| `authors` | array[string] | Recommended | Original authors (GitHub handles) |
|
||||
| `maintainers` | array[string] | Recommended | Current maintainers |
|
||||
| `tools` | array[object] | Conditional | Software tools wrapped by the module |
|
||||
| `input` | array/object | Recommended | Input channel specifications |
|
||||
| `output` | object/array | Recommended | Output channel specifications |
|
||||
|
||||
#### Extension Fields (Nextflow Module System)
|
||||
|
||||
These fields extend the schema to support the new Nextflow module system:
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `version` | string | Registry | Semantic version (MAJOR.MINOR.PATCH) |
|
||||
| `license` | string | Registry | SPDX license identifier for module code |
|
||||
| `requires` | object | Optional | Runtime requirements |
|
||||
| `requires.nextflow` | string | Optional | Nextflow version constraint |
|
||||
|
||||
### Detailed Field Specifications
|
||||
|
||||
#### `name`
|
||||
|
||||
The module name must be a fully qualified scoped identifier in `scope/name` format:
|
||||
|
||||
```yaml
|
||||
name: nf-core/fastqc
|
||||
name: nf-core/bwa-mem
|
||||
name: myorg/custom-aligner
|
||||
```
|
||||
|
||||
**Naming Rules:**
|
||||
- Format: `scope/name` (e.g., `nf-core/salmon`, `myorg/custom`)
|
||||
- Scope: lowercase alphanumeric with hyphens (organization/owner identifier)
|
||||
- Name: lowercase alphanumeric with underscores/hyphens (module identifier)
|
||||
- Pattern: `^[a-z0-9][a-z0-9-]*/[a-z][a-z0-9_-]*$`
|
||||
|
||||
#### `version`
|
||||
|
||||
Semantic version following [SemVer 2.0.0](https://semver.org/):
|
||||
|
||||
```yaml
|
||||
version: "1.0.0"
|
||||
version: "2.3.1"
|
||||
version: "1.0.0-beta.1"
|
||||
```
|
||||
|
||||
**Version Semantics:**
|
||||
- **MAJOR:** Breaking changes to process signatures, inputs, or outputs
|
||||
- **MINOR:** New processes, backward-compatible enhancements
|
||||
- **PATCH:** Bug fixes, documentation updates
|
||||
|
||||
**Requirement:** Mandatory for registry-published modules (scoped names in `scope/name` format).
|
||||
|
||||
#### `requires`
|
||||
|
||||
Specifies runtime requirements for the module.
|
||||
|
||||
```yaml
|
||||
requires:
|
||||
nextflow: ">=24.04.0"
|
||||
```
|
||||
|
||||
**`requires.nextflow`** - Nextflow version constraint:
|
||||
```yaml
|
||||
requires:
|
||||
nextflow: ">=24.04.0" # minimum version
|
||||
nextflow: ">=24.04.0,<25.0.0" # version range
|
||||
```
|
||||
|
||||
#### `tools`
|
||||
|
||||
Documents the software tools wrapped by the module:
|
||||
|
||||
```yaml
|
||||
tools:
|
||||
- bwa:
|
||||
description: BWA aligner
|
||||
homepage: http://bio-bwa.sourceforge.net/
|
||||
license: ["GPL-3.0-or-later"]
|
||||
identifier: biotools:bwa
|
||||
```
|
||||
|
||||
**Tool Properties:**
|
||||
|
||||
| Property | Required | Description |
|
||||
|----------|----------|-------------|
|
||||
| `description` | Yes | Tool description |
|
||||
| `homepage` | One of these | Tool homepage URL |
|
||||
| `documentation` | One of these | Documentation URL |
|
||||
| `tool_dev_url` | One of these | Development/source URL |
|
||||
| `doi` | One of these | Publication DOI |
|
||||
| `arxiv` | No | arXiv identifier |
|
||||
| `license` | Recommended | SPDX license(s) |
|
||||
| `identifier` | Recommended | bio.tools identifier |
|
||||
| `manual` | No | User manual URL |
|
||||
|
||||
#### `input` and `output`
|
||||
|
||||
The schema supports a simplified syntax for inputs and outputs. Tooling such as the `module` CLI and module registry supports both the module spec and nf-core syntax for backwards compatibility:
|
||||
|
||||
**Module Spec:**
|
||||
```yaml
|
||||
input:
|
||||
- - name: meta
|
||||
type: map
|
||||
description: Sample metadata
|
||||
- name: reads
|
||||
type: file
|
||||
description: Input FastQ files
|
||||
ontologies:
|
||||
- edam: "http://edamontology.org/format_1930"
|
||||
- name: index
|
||||
type: directory
|
||||
description: Reference index
|
||||
|
||||
output:
|
||||
- - name: meta
|
||||
type: map
|
||||
description: Sample metadata
|
||||
- name: bam
|
||||
type: file
|
||||
description: Aligned BAM file
|
||||
pattern: "*.bam"
|
||||
- name: versions
|
||||
type: file
|
||||
description: Software versions
|
||||
```
|
||||
|
||||
**nf-core module (legacy):**
|
||||
```yaml
|
||||
input:
|
||||
- - meta:
|
||||
type: map
|
||||
description: Sample metadata
|
||||
- reads:
|
||||
type: file
|
||||
description: Input FastQ files
|
||||
ontologies:
|
||||
- edam: "http://edamontology.org/format_1930"
|
||||
- - index:
|
||||
type: directory
|
||||
description: Reference index
|
||||
|
||||
output:
|
||||
bam:
|
||||
- - meta:
|
||||
type: map
|
||||
description: Sample metadata
|
||||
- "*.bam":
|
||||
type: file
|
||||
description: Aligned BAM file
|
||||
pattern: "*.bam"
|
||||
versions:
|
||||
- versions.yml:
|
||||
type: file
|
||||
description: Software versions
|
||||
```
|
||||
|
||||
**Module Input/Output Properties:**
|
||||
|
||||
| Property | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `type` | string | Data type: `map`, `file`, `directory`, `string`, `integer`, `float`, `boolean`, `list`, `val` |
|
||||
| `description` | string | Human-readable description |
|
||||
| `pattern` | string | File glob pattern or value pattern |
|
||||
| `optional` | boolean | Whether input is optional (default: false) |
|
||||
| `default` | any | Default value if not provided |
|
||||
| `enum` | array | List of allowed values |
|
||||
| `ontologies` | array | EDAM or other ontology annotations |
|
||||
|
||||
### Migration Guide
|
||||
|
||||
#### From nf-core Module to Registry Module
|
||||
|
||||
**Before (nf-core local):**
|
||||
```yaml
|
||||
name: bwa_mem
|
||||
description: Align reads using BWA-MEM
|
||||
keywords:
|
||||
- alignment
|
||||
- bwa
|
||||
tools:
|
||||
- bwa:
|
||||
description: BWA software
|
||||
homepage: http://bio-bwa.sourceforge.net/
|
||||
license: ["GPL-3.0-or-later"]
|
||||
identifier: biotools:bwa
|
||||
authors:
|
||||
- "@drpatelh"
|
||||
maintainers:
|
||||
- "@drpatelh"
|
||||
input:
|
||||
# ... existing input spec
|
||||
output:
|
||||
# ... existing output spec
|
||||
```
|
||||
|
||||
**After (Registry-ready):**
|
||||
```yaml
|
||||
name: nf-core/bwa-mem # Added scope prefix
|
||||
version: "1.0.0" # Added version
|
||||
description: Align reads using BWA-MEM
|
||||
keywords:
|
||||
- alignment
|
||||
- bwa
|
||||
license: MIT # Added module license
|
||||
requires: # Added requirements
|
||||
nextflow: ">=24.04.0"
|
||||
tools:
|
||||
- bwa:
|
||||
description: BWA software
|
||||
homepage: http://bio-bwa.sourceforge.net/
|
||||
license: ["GPL-3.0-or-later"]
|
||||
identifier: biotools:bwa
|
||||
authors:
|
||||
- "@drpatelh"
|
||||
maintainers:
|
||||
- "@drpatelh"
|
||||
input:
|
||||
# ... modified as shown above
|
||||
output:
|
||||
# ... modified as shown above
|
||||
```
|
||||
|
||||
#### Schema Validation
|
||||
|
||||
Use the schema reference in your `meta.yml`:
|
||||
|
||||
```yaml
|
||||
# yaml-language-server: $schema=https://registry.nextflow.io/schemas/module-spec/v1.0.0
|
||||
|
||||
name: nf-core/my-module
|
||||
version: "1.0.0"
|
||||
# ...
|
||||
```
|
||||
|
||||
### Compatibility Matrix
|
||||
|
||||
| Feature | nf-core Current | Nextflow Module System |
|
||||
|---------|-----------------|------------------------|
|
||||
| Simple names | Yes | Yes (local only) |
|
||||
| Scoped names | No | Yes (registry) |
|
||||
| Version field | No | Yes (required for registry) |
|
||||
| `tools` section | Yes | Yes |
|
||||
| `components` | Yes | Deprecated |
|
||||
| `requires` | No | Yes (Nextflow version constraint) |
|
||||
| I/O specifications | Yes | Yes |
|
||||
| Ontologies | Yes | Yes |
|
||||
|
||||
### Unsupported nf-core Attributes
|
||||
|
||||
The following attributes from the nf-core meta schema are **not supported** in the Nextflow module system:
|
||||
|
||||
| Attribute | Reason | Alternative |
|
||||
|-----------|--------|-------------|
|
||||
| `extra_args` | Not adopted in practice by nf-core modules | To be defined |
|
||||
| `components` | No longer supported | Module dependencies are managed via `nextflow.config` |
|
||||
|
||||
### Complete Examples
|
||||
|
||||
#### Minimal nf-core Module
|
||||
|
||||
```yaml
|
||||
name: fastqc
|
||||
description: Run FastQC on sequenced reads
|
||||
keywords:
|
||||
- quality control
|
||||
- qc
|
||||
- fastq
|
||||
tools:
|
||||
- fastqc:
|
||||
description: FastQC quality metrics
|
||||
homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
|
||||
license: ["GPL-2.0-only"]
|
||||
identifier: biotools:fastqc
|
||||
authors:
|
||||
- "@drpatelh"
|
||||
maintainers:
|
||||
- "@drpatelh"
|
||||
input:
|
||||
- - meta:
|
||||
type: map
|
||||
description: Metadata map
|
||||
- reads:
|
||||
type: file
|
||||
description: List of input FastQ files
|
||||
output:
|
||||
html:
|
||||
- "*.html":
|
||||
type: file
|
||||
description: FastQC HTML report
|
||||
versions:
|
||||
- versions.yml:
|
||||
type: file
|
||||
description: Software versions
|
||||
```
|
||||
|
||||
#### Full Registry Module
|
||||
|
||||
```yaml
|
||||
name: nf-core/bwa-align
|
||||
version: "1.2.4"
|
||||
description: Align reads to reference genome using BWA-MEM algorithm
|
||||
keywords:
|
||||
- alignment
|
||||
- mapping
|
||||
- bwa
|
||||
- bam
|
||||
- fastq
|
||||
license: MIT
|
||||
|
||||
requires:
|
||||
nextflow: ">=24.04.0"
|
||||
|
||||
tools:
|
||||
- bwa:
|
||||
description: |
|
||||
BWA is a software package for mapping DNA sequences
|
||||
against a large reference genome.
|
||||
homepage: http://bio-bwa.sourceforge.net/
|
||||
documentation: https://bio-bwa.sourceforge.net/bwa.shtml
|
||||
doi: 10.1093/bioinformatics/btp324
|
||||
license: ["GPL-3.0-or-later"]
|
||||
identifier: biotools:bwa
|
||||
|
||||
authors:
|
||||
- "@nf-core"
|
||||
maintainers:
|
||||
- "@drpatelh"
|
||||
- "@maxulysse"
|
||||
|
||||
input:
|
||||
- - name: meta
|
||||
type: map
|
||||
description: Sample metadata map (e.g., [ id:'sample1', single_end:false ])
|
||||
- name: reads
|
||||
type: file
|
||||
description: Input FastQ files
|
||||
ontologies:
|
||||
- edam: "http://edamontology.org/format_1930"
|
||||
- - name: meta2
|
||||
type: map
|
||||
description: Reference metadata
|
||||
- name: index
|
||||
type: directory
|
||||
description: BWA index directory
|
||||
ontologies:
|
||||
- edam: "http://edamontology.org/data_3210"
|
||||
|
||||
output:
|
||||
- - name: meta
|
||||
type: map
|
||||
description: Sample metadata
|
||||
- name: bam
|
||||
type: file
|
||||
description: Aligned BAM file
|
||||
pattern: "*.bam"
|
||||
ontologies:
|
||||
- edam: "http://edamontology.org/format_2572"
|
||||
- name: versions
|
||||
type: file
|
||||
description: Software versions
|
||||
pattern: "versions.yml"
|
||||
```
|
||||
|
||||
252
nextflow/adr/20251205-git-multi-revision-asset-management.md
Normal file
252
nextflow/adr/20251205-git-multi-revision-asset-management.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Git Multi-Revision Asset Management with Strategy Pattern
|
||||
|
||||
- Authors: Jorge Ejarque
|
||||
- Status: Approved
|
||||
- Deciders: Jorge Ejarque, Ben Sherman, Paolo Di Tommaso
|
||||
- Date: 2025-12-05
|
||||
- Tags: scm, asset-management, multi-revision
|
||||
|
||||
## Summary
|
||||
|
||||
Nextflow's asset management system has been refactored to support multiple revisions of the same pipeline concurrently through a bare repository approach with shared object storage, while maintaining backward compatibility with legacy direct-clone repositories using the Strategy design pattern.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The original asset management system (`AssetManager`) cloned each pipeline directly to `~/.nextflow/assets/<org>/<project>/.git`, creating several limitations:
|
||||
|
||||
1. **No concurrent Git multi-revision support**: Only one revision of a pipeline could be checked out at a time, preventing concurrent execution of different versions
|
||||
2. **Update conflicts**: Pulling updates while a pipeline was running could cause conflicts or corruption
|
||||
3. **Testing limitations**: Users couldn't easily test different versions of a pipeline side-by-side
|
||||
|
||||
The goal was to enable running multiple revisions of the same pipeline concurrently (e.g., production on v1.0, testing on v2.0-dev) while maintaining efficient disk usage through object sharing.
|
||||
|
||||
## Goals or Decision Drivers
|
||||
|
||||
- **Concurrent multi-revision execution**: Must support running different revisions of the same pipeline simultaneously
|
||||
- **Efficient disk usage**: Share Git objects between revisions to minimize storage overhead
|
||||
- **Backward compatibility**: Must not break existing pipelines using the legacy direct-clone approach
|
||||
- **API stability**: Maintain the existing `AssetManager` API for external consumers (K8s plugin, CLI commands, etc.)
|
||||
- **Minimal migration impact**: Existing repositories should continue to work without user intervention
|
||||
- **JGit compatibility**: Solution must work within JGit's capabilities to avoid relying on Git client installations
|
||||
- **Atomic updates**: Downloading new revisions should not interfere with running pipelines
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Migration of existing legacy repositories**: Legacy repos continue to work as-is; no forced migration
|
||||
- **Native Git worktree support**: Due to JGit limitations, not using Git's worktree feature
|
||||
- **Revision garbage collection**: No automatic cleanup of old revisions (users can manually drop)
|
||||
- **Multi-hub support**: Still tied to a single repository provider per pipeline
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Bare Repository with Git Worktrees
|
||||
|
||||
Use Git's worktree feature to create multiple working directories from a single bare repository.
|
||||
|
||||
**Implementation**:
|
||||
- One bare repository at `~/.nextflow/assets/<org>/<project>/.git`
|
||||
- Multiple worktrees at `~/.nextflow/assets/<org>/<project>/<revision>/`
|
||||
|
||||
- Good, because it's the native Git solution for multiple checkouts
|
||||
- Good, because worktrees are space-efficient
|
||||
- Good, because Git handles all the complexity
|
||||
- **Bad, because JGit doesn't support worktrees** (deal-breaker)
|
||||
- Bad, because requires native Git installation
|
||||
|
||||
**Decision**: Rejected due to JGit incompatibility
|
||||
|
||||
### Option 2: Bare Repository + Clones per Commit + Revision Map File
|
||||
|
||||
Use a bare repository for storage and create clones for each commit, tracking them in a separate file.
|
||||
|
||||
**Implementation**:
|
||||
- Bare repository at `~/.nextflow/assets/<org>/<project>/.nextflow/bare_repo/`
|
||||
- Clones at `~/.nextflow/assets/<org>/<project>/.nextflow/commits/<commit-sha>/`
|
||||
- Revision map file at `~/.nextflow/assets/<org>/<project>/.nextflow/revisions.json` mapping revision names to commit SHAs
|
||||
|
||||
- Good, because it works with JGit
|
||||
- Good, because bare repo reduces remote repository interactions to checkout commits
|
||||
- Good, because explicit revision tracking
|
||||
- Bad, because disk space as git objects replicated in clones
|
||||
- Bad, because revision map file can become stale
|
||||
- Bad, because requires file I/O for every revision lookup
|
||||
- Bad, because potential race conditions on map file updates
|
||||
- Bad, because adds complexity of maintaining external state
|
||||
|
||||
**Decision**: Initially implemented but later refined
|
||||
|
||||
### Option 3: Bare Repository + Shared Clones with Strategy Pattern
|
||||
|
||||
Similar to Option 2 but eliminate the separate revision map file by using the bare repository itself as the source of truth. Additionally, use the Strategy pattern to maintain backward compatibility with existing legacy repositories without requiring migration.
|
||||
|
||||
**Implementation**:
|
||||
- Bare repository at `~/.nextflow/assets/.repos/<org>/<project>/bare/`
|
||||
- Shared clones at `~/.nextflow/assets/.repos/<org>/<project>/clones/<commit-sha>/`
|
||||
- Use bare repository refs to resolve revisions to commit SHAs dynamically
|
||||
- JGit alternates mechanism for object sharing
|
||||
- `AssetManager` as facade with unchanged public API
|
||||
- `RepositoryStrategy` interface defining repository operations
|
||||
- `LegacyRepositoryStrategy` for existing direct-clone behavior
|
||||
- `MultiRevisionRepositoryStrategy` for new bare-repo approach
|
||||
- Strategy selection based on environment variable or repository state detection
|
||||
|
||||
- Good, because no external state file to maintain
|
||||
- Good, because bare repository is always in sync (fetched on updates)
|
||||
- Good, because simpler and more reliable
|
||||
- Good, because atomic updates (Git operations are atomic)
|
||||
- Good, because works entirely within JGit
|
||||
- Good, because zero migration needed for existing repositories
|
||||
- Good, because maintains API compatibility
|
||||
- Good, because allows gradual adoption
|
||||
- Good, because isolates legacy code
|
||||
- Good, because makes future strategies easy to add
|
||||
- Neutral, because adds abstraction layer
|
||||
- Bad, because requires resolution on every access (minimal overhead)
|
||||
- Bad, because increases codebase size initially
|
||||
|
||||
**Decision**: Selected
|
||||
|
||||
## Solution or decision outcome
|
||||
|
||||
Implemented **Option 3 (Bare Repository + Shared Clones with Strategy Pattern)** for multi-revision support with backward compatibility. Multi-revision is the default for new repositories, while legacy mode is available via `NXF_SCM_LEGACY` environment variable.
|
||||
|
||||
## Rationale & discussion
|
||||
|
||||
### Git Multi-Revision Implementation
|
||||
|
||||
The bare repository approach provides efficient multi-revision support:
|
||||
|
||||
```
|
||||
~/.nextflow/assets/.repos/nextflow-io/hello/
|
||||
├── bare/ # Bare repository (shared objects)
|
||||
│ ├── objects/ # All Git objects stored here
|
||||
│ ├── refs/
|
||||
│ │ ├── heads/
|
||||
│ │ └── tags/
|
||||
│ └── config
|
||||
│
|
||||
└── clones/ # Revisions-specific clones
|
||||
├── abc123.../ # Clone for commit abc123
|
||||
│ └── .git/
|
||||
│ ├── objects/ # (uses alternates → bare/objects)
|
||||
│ └── info/
|
||||
│ └── alternates # Points to bare/objects
|
||||
│
|
||||
└── def456.../ # Clone for commit def456
|
||||
└── .git/
|
||||
|
||||
~/.nextflow/assets/nextflow-io/hello/
|
||||
└── .git/ # Legacy repo location (HYBRID state)
|
||||
```
|
||||
|
||||
**Key mechanisms:**
|
||||
|
||||
1. **Bare repository as source of truth**: The bare repo is fetched/updated from the remote, keeping refs current
|
||||
2. **Dynamic resolution**: Revisions (branch/tag names) are resolved to commit SHAs using the bare repo's refs
|
||||
3. **Object sharing**: Clones use Git alternates to reference the bare repo's objects, avoiding duplication
|
||||
4. **Atomic operations**: Each clone is independent; downloading a new revision doesn't affect existing ones
|
||||
5. **Lazy creation**: Clones are created on-demand when a specific revision is requested
|
||||
|
||||
**Advantages over revision map file:**
|
||||
- No external state to maintain or keep in sync
|
||||
- Bare repo fetch automatically updates all refs
|
||||
- Resolution is simple: `bareRepo.resolve(revision)` returns commit SHA
|
||||
- No race conditions on file updates
|
||||
- Simpler code with fewer failure modes
|
||||
|
||||
### Strategy Pattern for Backward Compatibility
|
||||
|
||||
The Strategy pattern provides clean separation and backward compatibility:
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ AssetManager │ ← Public API (unchanged)
|
||||
│ (Facade) │
|
||||
└───────────┬─────────────┘
|
||||
│
|
||||
│ delegates to
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ RepositoryStrategy │ ← Interface
|
||||
└───────────┬─────────────┘
|
||||
△
|
||||
│ implements
|
||||
┌───────┴────────┐
|
||||
│ │
|
||||
┌───────────┐ ┌─────────────────┐
|
||||
│ Legacy │ │ MultiRevision │ ← Concrete strategies
|
||||
│ Strategy │ │ Strategy │
|
||||
└───────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
**Strategy selection logic:**
|
||||
|
||||
1. Check `NXF_SCM_LEGACY` environment variable → Use legacy if set
|
||||
2. Check if there is only the legacy asset of the repository (`isOnlyLegacy` method) → Use legacy (preserve existing)
|
||||
3. Otherwise -> Use multi-revision
|
||||
|
||||
|
||||
**Backward compatibility guarantees:**
|
||||
|
||||
- Existing repositories continue to work without changes
|
||||
- `AssetManager` API remains identical
|
||||
- CLI commands work with both strategies transparently
|
||||
- Tests pass with minimal modifications
|
||||
- No forced migration; users opt-in naturally when creating new repos
|
||||
|
||||
### Hybrid State Handling
|
||||
|
||||
The system gracefully handles hybrid states where both legacy and multi-revision repositories coexist:
|
||||
|
||||
- **Detection**: In hybrid states, a multi-revision strategy is selected by default.
|
||||
- **Fallback logic**: Multi-revision strategy can fall back to legacy repo for operations if needed
|
||||
- **No conflicts**: Strategies are designed to coexist; operations target different directories
|
||||
- **Explicit control**: Users can force a specific strategy via `setStrategyType()` or `NXF_SCM_LEGACY` environment variable
|
||||
|
||||
### Migration Path
|
||||
|
||||
Users naturally migrate as they pull new revisions:
|
||||
|
||||
1. **Existing users**: Can continue with legacy repos (`NXF_SCM_LEGACY` state detected)
|
||||
2. **New users**: Get multi-revision by default
|
||||
3. **Opt-in migration**: Delete project directory to switch to multi-revision or pull with --migrate
|
||||
4. **Opt-out**: Set `NXF_SCM_LEGACY=true` to force legacy mode
|
||||
|
||||
### Implementation Details
|
||||
|
||||
**Key classes:**
|
||||
|
||||
- `RepositoryStrategy`: Interface defining repository operations
|
||||
- `AbstractRepositoryStrategy`: Base class with shared helper methods
|
||||
- `LegacyRepositoryStrategy`: Direct clone implementation (original behavior)
|
||||
- `MultiRevisionRepositoryStrategy`: Bare repo + shared clones implementation
|
||||
|
||||
**Critical methods:**
|
||||
|
||||
- `download()`: Equivalent for both strategies (legacy pulls, multi-revision creates shared clone)
|
||||
- `getLocalPath()`: Returns appropriate working directory based on strategy
|
||||
- `getGit()`: Returns appropriate Git instance (legacy git, bare git, or commit git)
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
**Disk usage:**
|
||||
- Legacy: ~100% per repository (full clone with all git objects) + Worktree
|
||||
- Multi-revision: ~100% for bare + ~100K (.git with alternates) per revision + Worktree per revision
|
||||
|
||||
**Operation speed:**
|
||||
- First download: Similar (both clone from remote)
|
||||
- Additional revisions: Multi-revision faster (only fetches new objects once, creates cheap clones)
|
||||
- Switching revisions: Multi-revision instant (different directories), legacy requires checkout
|
||||
|
||||
### Known Limitations
|
||||
|
||||
- No automatic migration of legacy repositories
|
||||
- Bare repository overhead even for users who only need one revision
|
||||
- JGit alternates slightly more complex than worktrees
|
||||
- Manual cleanup required for old revision clones
|
||||
|
||||
## Links
|
||||
- [GitHub Issue #2870 - Multiple revisions of the same pipeline for concurrent execution](https://github.com/nextflow-io/nextflow/issues/2870)
|
||||
- [PR #6620 - Implementation of multiple revisions without revisions map](https://github.com/nextflow-io/nextflow/pull/6620)
|
||||
- Related PRs implementing the multi-revision approach (linked in #6620)
|
||||
|
||||
484
nextflow/adr/20260306-record-types.md
Normal file
484
nextflow/adr/20260306-record-types.md
Normal file
@@ -0,0 +1,484 @@
|
||||
# Record types
|
||||
|
||||
- Authors: Ben Sherman
|
||||
- Status: accepted
|
||||
- Deciders: Ben Sherman, Paolo Di Tommaso
|
||||
- Date: 2026-03-06
|
||||
- Tags: lang, static-types
|
||||
|
||||
## Updates
|
||||
|
||||
### Version 1.1 (2026-03-23)
|
||||
|
||||
- Replaced inline record type syntax (`Record { ... }`) with destructuring syntax (`record(...)`) for better continuity with legacy syntax and record output syntax.
|
||||
|
||||
## Summary
|
||||
|
||||
Provide a way to model composite data types in the Nextflow language.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Nextflow pipelines typically need a way to model *aggregates*, or "data that travels together", such as a paired-end read consisting of a sample ID and two FASTQ files.
|
||||
|
||||
Primary use cases:
|
||||
|
||||
- Model complex pipeline inputs and outputs (e.g. samplesheets as collections of records)
|
||||
|
||||
- Generate JSON schemas for pipeline inputs and outputs from source code (e.g. to facilitate pipeline chaining with external validation)
|
||||
|
||||
- Model directory outputs as records, enabling more fine-grained validation
|
||||
|
||||
## Goals
|
||||
|
||||
- Introduce a alternative data structure to tuples that allows users to model their data domain more precisely.
|
||||
|
||||
- Introduce a way to validate custom data types at compile-time while keeping pipeline code concise and readable.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Removing support for tuples -- tuples should continue to work as before
|
||||
|
||||
- Introducing new dataflow operators or changing existing ones -- any changes to operators will be addressed in future efforts as needed
|
||||
|
||||
- Introducing type inheritance -- the Nextflow type system avoids type inheritance as much as possible in order to not introduce unnecessary complexity
|
||||
|
||||
- Introducing object methods -- this can be handled by standalone functions for now, and may be improved in a future effort (e.g. namespaces)
|
||||
|
||||
## Decision
|
||||
|
||||
Add support for **records**, an alternative data structure to tuples, and **record types**, a way to declare custom types that can be applied to records a la carte.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### Records
|
||||
|
||||
Records are an attempt to take the best qualities of maps and custom classes while avoiding the downsides.
|
||||
|
||||
A record can be created using the `record()` function:
|
||||
|
||||
```groovy
|
||||
sample = record(
|
||||
id: '1',
|
||||
fastq_1: file('1_1.fastq'),
|
||||
fastq_2: file('1_2.fastq')
|
||||
)
|
||||
|
||||
sample.id = '2' // error: record cannot be modified
|
||||
sample += record(id: '2') // ok
|
||||
|
||||
println sample.id
|
||||
```
|
||||
|
||||
This function effectively creates an immutable map (`Map<String,?>`):
|
||||
|
||||
- The keys are just field names
|
||||
- The values can have any type
|
||||
- The record can't be modified -- use the `+` operator instead
|
||||
|
||||
Records can have arbitrary fields, unlike custom classes, which makes them easy to use with dataflow operators.
|
||||
|
||||
For example, a future version of the `join` operator could join records as follows:
|
||||
|
||||
```groovy
|
||||
ch_bam = channel.of( record(id: '1', bam: file('1.bam')) )
|
||||
ch_bai = channel.of( record(id: '1', bai: file('1.bai')) )
|
||||
|
||||
ch_bam.join(ch_bai, by: 'id').view()
|
||||
|
||||
// -> record(id: '1', bam: file('1.bam'), bai: file('1.bai'))
|
||||
```
|
||||
|
||||
Whereas tuples are joined by a matching index, records would be joined by a matching key.
|
||||
|
||||
### Record types
|
||||
|
||||
A record type is a user-defined type that consists of a name and a set of fields:
|
||||
|
||||
```groovy
|
||||
record Sample {
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path?
|
||||
}
|
||||
```
|
||||
|
||||
Fields in a record type are declared the same way as [typed parameters](https://nextflow.io/docs/latest/workflow.html#typed-parameters). All [standard types](https://nextflow.io/docs/latest/reference/stdlib-types.html) can be used. Fields can be marked as optional by appending a `?` to the field type.
|
||||
|
||||
The purpose of a record type is to specify a *minimum set of requirements* for a record *in a particular context*. A record created with the `record()` function simply has the type `Record` -- it makes no guarantees about which fields it provides. A *record type* can be used (e.g. in a workflow input) to make a stronger guarantee.
|
||||
|
||||
For example:
|
||||
|
||||
```groovy
|
||||
workflow RNASEQ {
|
||||
take:
|
||||
samples: Channel<Sample>
|
||||
|
||||
main:
|
||||
// ...
|
||||
}
|
||||
|
||||
workflow {
|
||||
ch_samples = channel.of( record(id: '1', fastq_2: file('1_2.fastq')) )
|
||||
RNASEQ(ch_samples) // error: `ch_samples` is missing `fastq_1` field required by Sample
|
||||
}
|
||||
```
|
||||
|
||||
This workflow definition specifies that the `samples` input should be a channel of records, where each record has at minimum the fields specified by the `Sample` record type. The records can still have additional fields, but only the fields in `Sample` are guaranteed to be present.
|
||||
|
||||
In other words, records are *duck-typed*. Duck-typing semantics are used whenever a record is validated against a record type:
|
||||
|
||||
- Supplying a record argument to a workflow, process, or function with a record type input (as shown above)
|
||||
|
||||
- Casting a record to a record type (e.g. `record(...) as Sample`)
|
||||
|
||||
Record types can be included across modules:
|
||||
|
||||
```groovy
|
||||
include { Sample } from './module'
|
||||
```
|
||||
|
||||
Because of duck-typing, two record types with the same fields and field types are effectively equivalent, even if they have different names:
|
||||
|
||||
```groovy
|
||||
record Sample {
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path?
|
||||
}
|
||||
|
||||
record FastqPair {
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path?
|
||||
}
|
||||
```
|
||||
|
||||
This makes it easier to compose modules and workflows that use their own record types.
|
||||
|
||||
### Process inputs
|
||||
|
||||
When a record is supplied as input to a process, the process needs to know how to stage input files from the record, like it does with the `path` qualifier.
|
||||
|
||||
Typed processes can stage inputs using the `stage:` section, but ideally the files in a record should be automatically detected and staged.
|
||||
|
||||
A typed process can declare a record input using a record type:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
sample: FastqPair
|
||||
|
||||
// ...
|
||||
}
|
||||
|
||||
record FastqPair {
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path
|
||||
}
|
||||
```
|
||||
|
||||
All record fields that are a `Path` or `Path` collection (e.g. `Set<Path>`) are automatically staged. The record itself is declared in the process body as `sample`, like any other input, and record fields are accessed as `sample.id`, `sample.fastq_1`, and so on.
|
||||
|
||||
Alternatively, a typed process can declare a *destructured* record input:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
record(
|
||||
id: String,
|
||||
fastq_1: Path,
|
||||
fastq_2: Path
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This approach allows record inputs to be declared without the need for external record types. Each record field is acessed directly as `id`, `fastq_1`, and so on.
|
||||
|
||||
### Process outputs
|
||||
|
||||
Typed processes can declare outputs with arbitrary expressions, so no new syntax is required to support record outputs. Simply use the `record()` function to create a record:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
// ...
|
||||
|
||||
output:
|
||||
record(
|
||||
id: id,
|
||||
fastqc: file('fastqc_logs')
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
The type of this output is an *implicit* record type that is inferred from the code: `Record { id: String ; fastqc: Path }`.
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Custom classes
|
||||
|
||||
Define Groovy-style classes (see #2085) and use them to model composite data:
|
||||
|
||||
```groovy
|
||||
@nextflow.io.ValueObject
|
||||
class Sample {
|
||||
String id
|
||||
Path fastq_1
|
||||
Path fastq_2
|
||||
}
|
||||
|
||||
workflow {
|
||||
sample = new Sample('1', file('1_1.fastq'), file('1_2.fastq'))
|
||||
|
||||
println sample.id
|
||||
}
|
||||
```
|
||||
|
||||
This approach can be used, but in practice it requires a lot of extra dataflow logic around process calls to convert between custom types and tuples, because processes don’t know how to stage input files from custom types.
|
||||
|
||||
Custom classes are not very flexible. For example, joining two channels of custom classes would be more complicated than joining two tuples by a matching key, because you would need to define an additional class for the “joined” type and explicitly construct it from the two joining classes.
|
||||
|
||||
### Maps
|
||||
|
||||
Use maps to create composite data structures dynamically (see #2127):
|
||||
|
||||
```groovy
|
||||
sample = [
|
||||
id: '1',
|
||||
fastq_1: file('1_1.fastq'),
|
||||
fastq_2: file('1_2.fastq')
|
||||
]
|
||||
|
||||
println sample.id
|
||||
```
|
||||
|
||||
Maps are flexible because you can store arbitrary fields rather than being restricted to a fixed set of fields. However, maps are meant to be used for a single value type (e.g. `Map<String,Integer>`).
|
||||
|
||||
Unlike tuples, maps are mutable (i.e. they can be modified). Modifying maps can lead to race conditions if done improperly. As a best practice, maps should be modified by adding another map, which creates a copy:
|
||||
|
||||
```groovy
|
||||
sample2 = sample + [id: '2']
|
||||
|
||||
println sample.id // -> '1'
|
||||
println sample2.id // -> '2'
|
||||
```
|
||||
|
||||
### Inline record input type
|
||||
|
||||
A process can declare a destructured record input as shown above:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
record(
|
||||
id: String,
|
||||
fastq_1: Path,
|
||||
fastq_2: Path
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
One alternative is to declare an *inline record type*:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
input:
|
||||
sample: Record {
|
||||
id: String
|
||||
fastq_1: Path
|
||||
fastq_2: Path
|
||||
}
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This approach was considered because it uses the same syntax as a `record` definition, making it easy to switch between inline and external record types. The block syntax is also slightly better suited for a type definition since it doesn't require commas.
|
||||
|
||||
However, this approach creates an asymmetry between record inputs and outputs (`Record { ... }` vs `record(...)`). It also removes the ability to destructure a record input.
|
||||
|
||||
Declaring a record input with `record()` can be understood as a reverse constructor, mirroring the `record()` function used to construct a record output in the `output:` section.
|
||||
|
||||
While both approaches have pros and cons, the `record()` approach was ultimately chosen for its continuity with the existing tuple syntax and its similarity with the record output syntax.
|
||||
|
||||
### Implicit process record output
|
||||
|
||||
A process record output can be defined using the `record()` function as shown above:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
// ...
|
||||
|
||||
output:
|
||||
record(
|
||||
id: id,
|
||||
fastqc: file('fastqc_logs')
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
One alternative is to re-interpret the existing typed output syntax as an implicit record, treating each line as a record field:
|
||||
|
||||
```groovy
|
||||
process FASTQC {
|
||||
// ...
|
||||
|
||||
output:
|
||||
id: String = id
|
||||
fastqc: Path = file('fastqc_logs')
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This approach is syntactically more concise, and it re-uses the typed output syntax that was introduced in Nextflow 25.10.
|
||||
|
||||
However, with this approach, the same syntax can have different meanings depending on the surrounding context (e.g. presence/absence of the `nextflow.enable.types` feature flag), which can be confusing for both users and agents.
|
||||
|
||||
The `record()` approach works "out of the box", and it isn't much more verbose, so we decided that it is sufficient for now.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
|
||||
- Records replace positional elements with named fields, making data structures self-documenting and less error-prone (e.g. no more accidentally swapping `fastq_1` and `fastq_2` by position).
|
||||
|
||||
- Duck-typing makes module composition easier: downstream processes declare only the fields they need, and record types from different modules are interchangeable if they share the same fields.
|
||||
|
||||
- Immutability by default eliminates the race conditions that can occur when mutable maps are improperly used in workflow logic.
|
||||
|
||||
- A single record output replaces multiple per-file tuple output channels (as shown in the prokka example), reducing the total number of channels in a workflow.
|
||||
|
||||
- Record types provide a foundation for generating JSON schemas for pipeline inputs and outputs directly from source code, enabling external validation and pipeline chaining.
|
||||
|
||||
- Backward compatibility is fully preserved: existing tuple-based pipelines continue to work without modification.
|
||||
|
||||
**Negative:**
|
||||
|
||||
- Since records must match based on field name rather than element input, users must be careful to use consistent naming conventions or write additional adaptor logic in their workflow. Type checking for records will be essential to streamline the developer experience as much as possible.
|
||||
|
||||
- Tuples and records coexist as parallel data model options, which may cause confusion about which to use for a given situation. Guidelines will be needed to help users make the right choice.
|
||||
|
||||
- Dataflow operators such as `cross`, `groupTuple`, and `join` need to be updated to support records natively.
|
||||
|
||||
**Neutral:**
|
||||
|
||||
- Record types use structural (duck) typing rather than nominal typing. Two record types with identical fields are interchangeable regardless of their names. This is intentional and enables flexible module composition, but it differs from the nominal typing that most users encounter in other languages, so it may be surprising at first.
|
||||
|
||||
- Records have no methods; behavior must be expressed via standalone functions. This is consistent with the functional style of Nextflow pipelines, and may be improved in the future with namespaces.
|
||||
|
||||
## Links
|
||||
|
||||
- Community issues: #2085, #2127
|
||||
- Related nf-core discussion: https://github.com/nf-core/modules/issues/4311
|
||||
- Original implementation: #4553
|
||||
- nf-core/fetchngs POC: https://github.com/nf-core/fetchngs/pull/309
|
||||
- Inspired by: [Simple Made Easy](https://github.com/matthiasn/talk-transcripts/blob/master/Hickey_Rich/SimpleMadeEasy.md)
|
||||
- Type systems: [Nominal typing](https://en.wikipedia.org/wiki/Nominal_type_system) vs [Structural typing](https://en.wikipedia.org/wiki/Structural_type_system) vs [Duck typing](https://en.wikipedia.org/wiki/Duck_typing)
|
||||
|
||||
## Appendix
|
||||
|
||||
### Example: nf-core/prokka
|
||||
|
||||
The [nf-core/prokka](https://github.com/nf-core/modules/blob/master/modules/nf-core/prokka/main.nf) module produces several output files, emitting a tuple channel for each output file. If a downstream process requires multiple outputs, the individual output channels must be joined as needed to match the process input tuple.
|
||||
|
||||
Here is how the process might look using records:
|
||||
|
||||
```groovy
|
||||
process PROKKA {
|
||||
// ...
|
||||
|
||||
input:
|
||||
record(
|
||||
meta: Map,
|
||||
fasta: Path
|
||||
)
|
||||
proteins: Path
|
||||
prodigal_tf: Path
|
||||
|
||||
output:
|
||||
record(
|
||||
meta: meta,
|
||||
gff: file("${prefix}/*.gff"),
|
||||
gbk: file("${prefix}/*.gbk"),
|
||||
fna: file("${prefix}/*.fna"),
|
||||
faa: file("${prefix}/*.faa"),
|
||||
ffn: file("${prefix}/*.ffn"),
|
||||
sqn: file("${prefix}/*.sqn"),
|
||||
fsa: file("${prefix}/*.fsa"),
|
||||
tbl: file("${prefix}/*.tbl"),
|
||||
err: file("${prefix}/*.err"),
|
||||
log: file("${prefix}/*.log"),
|
||||
txt: file("${prefix}/*.txt"),
|
||||
tsv: file("${prefix}/*.tsv")
|
||||
)
|
||||
|
||||
topic:
|
||||
file("versions.yml") >> 'versions'
|
||||
|
||||
script:
|
||||
prefix = meta.id
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
The tuple input is refactored as a record input with an inline record type. The tuple outputs are combined into a single record output. No external record types are needed, although they could be used if desired.
|
||||
|
||||
*NOTE:* The `meta` map has not been changed in this example for brevity. However, it could be modeled with a record type instead of the generic `Map` type, or it could even be replaced with explicit fields such as `id: String`.
|
||||
|
||||
Now, suppose there are two downstream processes that want to use the outputs of `PROKKA`:
|
||||
|
||||
1. Process `FOO` only needs the `gff` file
|
||||
2. Process `BAR` only needs the `fna`, `faa`, and `tbl` files
|
||||
|
||||
These processes would be defined as follows:
|
||||
|
||||
```groovy
|
||||
process FOO {
|
||||
|
||||
input:
|
||||
record(
|
||||
meta: Map,
|
||||
gff: Path
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
|
||||
process BAR {
|
||||
|
||||
input:
|
||||
record(
|
||||
meta: Map,
|
||||
fna: Path,
|
||||
faa: Path,
|
||||
tbl: Path
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
And the calling workflow would be written as follows:
|
||||
|
||||
```groovy
|
||||
workflow {
|
||||
ch_inputs = channel.of( /* ... */ )
|
||||
proteins = // ...
|
||||
prodigal_tf = // ...
|
||||
cn_prokka = PROKKA( ch_inputs, proteins, prodigal_tf )
|
||||
|
||||
FOO(ch_prokka)
|
||||
BAR(ch_prokka)
|
||||
}
|
||||
```
|
||||
|
||||
Each process declares a record input containing only the fields that it needs. When the output of `PROKKA` is passed to `FOO` and `BAR`, each process stages only the files that it declared in the record input.
|
||||
136
nextflow/adr/20260310-seqera-dataset-filesystem.md
Normal file
136
nextflow/adr/20260310-seqera-dataset-filesystem.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# NIO Filesystem for Seqera Platform Datasets
|
||||
|
||||
- Authors: Jorge Ejarque
|
||||
- Status: draft
|
||||
- Date: 2026-03-10
|
||||
- Tags: nio, filesystem, seqera, datasets, nf-tower
|
||||
|
||||
Technical Story: Enable Nextflow pipelines to read Seqera Platform datasets as ordinary file paths using `seqera://` URIs.
|
||||
|
||||
## Summary
|
||||
|
||||
Add a Java NIO `FileSystemProvider` to the `nf-tower` plugin that registers the `seqera://` scheme, allowing pipelines to reference Seqera Platform datasets (CSV/TSV) as standard file paths without manual download steps. The implementation reuses the existing `TowerClient` for all HTTP communication, inheriting authentication and retry behaviour.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Nextflow users managing datasets on the Seqera Platform must currently download dataset files manually or through custom scripts before referencing them in pipelines. There is no native integration between Nextflow's file abstraction and the Seqera Platform dataset API. This creates friction in workflows where datasets are the primary input and forces users to handle authentication, versioning, and file staging outside the pipeline definition.
|
||||
|
||||
## Goals or Decision Drivers
|
||||
|
||||
- Transparent access to Seqera Platform datasets using standard Nextflow file path syntax
|
||||
- Reuse of existing nf-tower plugin infrastructure (authentication, HTTP client, retry/backoff)
|
||||
- Hierarchical path browsing matching the platform's org/workspace/dataset structure
|
||||
- Extensible architecture that can support future Seqera-managed resource types (e.g. data-links)
|
||||
- No new plugin or module — feature lives within nf-tower
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Streaming large datasets — the Platform API does not support streaming; content is fully buffered on download
|
||||
- Implementing resource types beyond `datasets` — only the extensible architecture is required
|
||||
- Local caching across pipeline runs — Nextflow's standard task staging handles caching
|
||||
- Dataset management operations (delete, rename) — the filesystem is read-only in the initial implementation
|
||||
|
||||
## Considered Options
|
||||
|
||||
### Option 1: Standalone plugin with dedicated HTTP client
|
||||
|
||||
A new `nf-seqera-fs` plugin with its own HTTP client configuration and authentication setup.
|
||||
|
||||
- Good, because it isolates the filesystem code from the nf-tower plugin
|
||||
- Bad, because it duplicates authentication configuration and HTTP client setup
|
||||
- Bad, because two separate HTTP clients sharing a refresh token would corrupt each other's auth state
|
||||
|
||||
### Option 2: NIO filesystem within nf-tower using TowerClient delegation
|
||||
|
||||
Add the filesystem to nf-tower, delegating all HTTP through the existing `TowerClient` singleton via a typed `SeqeraDatasetClient` wrapper.
|
||||
|
||||
- Good, because it shares authentication and token refresh with TowerClient
|
||||
- Good, because it reuses existing retry/backoff configuration
|
||||
- Good, because no new dependencies are needed
|
||||
|
||||
### Option 3: Direct HxClient usage within nf-tower
|
||||
|
||||
Add the filesystem to nf-tower but use `HxClient` directly rather than going through TowerClient.
|
||||
|
||||
- Good, because it gives full control over request construction
|
||||
- Bad, because exposing HxClient internals couples the filesystem to implementation details
|
||||
- Bad, because token refresh coordination with TowerClient becomes manual
|
||||
|
||||
## Solution or decision outcome
|
||||
|
||||
Option 2 — NIO filesystem within nf-tower using TowerClient delegation. All HTTP calls go through `TowerClient.sendApiRequest()`, ensuring a single point of authentication and retry logic.
|
||||
|
||||
## Rationale & discussion
|
||||
|
||||
### Path Hierarchy
|
||||
|
||||
The `seqera://` path encodes the Platform's organizational structure directly:
|
||||
|
||||
```
|
||||
seqera:// → ROOT (directory, depth 0)
|
||||
└── <org>/ → ORGANIZATION (directory, depth 1)
|
||||
└── <workspace>/ → WORKSPACE (directory, depth 2)
|
||||
└── datasets/ → RESOURCE TYPE (directory, depth 3)
|
||||
└── <name>[@<version>] → DATASET (file, depth 4)
|
||||
```
|
||||
|
||||
Each level is a directory except the leaf dataset, which is a file. Version pinning uses an `@version` suffix on the dataset name segment (e.g. `seqera://acme/research/datasets/samples@2`). Without it, the latest non-disabled version is resolved.
|
||||
|
||||
### Name-to-ID Resolution
|
||||
|
||||
The path uses human-readable names but the Platform API requires numeric IDs. Resolution is built from two API calls at filesystem initialization:
|
||||
|
||||
1. `GET /user-info` → obtain `userId`
|
||||
2. `GET /user/{userId}/workspaces` → returns all accessible org/workspace pairs
|
||||
|
||||
This single source provides both directory listing content and name→ID mapping. Results are cached in `SeqeraFileSystem` with invalidation on write operations. `GET /orgs` is intentionally not used as it returns all platform orgs, not scoped to user membership.
|
||||
|
||||
### Component Structure
|
||||
|
||||
```
|
||||
plugins/nf-tower/src/main/io/seqera/tower/plugin/
|
||||
├── fs/ ← NIO layer
|
||||
│ ├── SeqeraFileSystemProvider ← FileSystemProvider (scheme: "seqera")
|
||||
│ ├── SeqeraFileSystem ← FileSystem with org/workspace/dataset caches
|
||||
│ ├── SeqeraPath ← Path implementation (depth 0–4)
|
||||
│ ├── SeqeraFileAttributes ← BasicFileAttributes
|
||||
│ ├── SeqeraPathFactory ← PF4J FileSystemPathFactory extension
|
||||
│ └── DatasetInputStream ← SeekableByteChannel over InputStream
|
||||
├── dataset/ ← API client layer
|
||||
│ ├── SeqeraDatasetClient ← Typed HTTP client wrapping TowerClient
|
||||
│ ├── DatasetDto ← Dataset API response model
|
||||
│ ├── DatasetVersionDto ← Version API response model
|
||||
│ ├── OrgAndWorkspaceDto ← Org/workspace list model
|
||||
│ └── WorkspaceOrgDto ← Workspace/org mapping model
|
||||
└── resources/META-INF/services/
|
||||
└── java.nio.file.spi.FileSystemProvider
|
||||
```
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
1. **TowerClient delegation**: `SeqeraDatasetClient` delegates all HTTP through `TowerFactory.client()` → `TowerClient.sendApiRequest()`. This ensures shared authentication state and avoids the token refresh corruption that would occur with separate HTTP client instances.
|
||||
|
||||
2. **One filesystem per JVM**: `SeqeraFileSystemProvider` maintains a single `SeqeraFileSystem` keyed by scheme. This matches the `TowerClient` singleton-per-session pattern.
|
||||
|
||||
3. **Read-only initial scope**: The filesystem reports `isReadOnly()=true`. Write support (dataset upload via multipart POST) is deferred to a future iteration.
|
||||
|
||||
4. **Download filename constraint**: The Platform API's download endpoint (`GET /datasets/{id}/v/{version}/n/{fileName}`) requires the exact filename from upload time. The implementation always resolves `DatasetVersionDto.fileName` from `GET /datasets/{id}/versions` before constructing the download URL.
|
||||
|
||||
5. **Extensible resource types**: The path hierarchy reserves depth 3 for a resource type segment (currently only `datasets`). Adding support for data-links or other resource types requires only a new handler at the directory listing and I/O layers, with no changes to path resolution or authentication.
|
||||
|
||||
6. **Thread safety**: `SeqeraFileSystem` cache methods and `SeqeraFileSystemProvider` lifecycle methods are `synchronized`. The filesystem map uses `LinkedHashMap` with external synchronization rather than `ConcurrentHashMap`, matching the low-contention access pattern.
|
||||
|
||||
### Limitations
|
||||
|
||||
- **No size metadata**: `SeqeraFileAttributes.size()` returns 0 for all paths because the Platform API does not expose content length in dataset metadata.
|
||||
- **Single endpoint per JVM**: The filesystem key is scheme-only; concurrent access to different Platform endpoints in the same JVM is not supported.
|
||||
|
||||
### Streaming Downloads
|
||||
|
||||
Dataset downloads use `TowerClient.sendStreamingRequest()` which calls `HxClient.sendAsStream()` — the response body is returned as an `InputStream` streamed directly from the HTTP connection. This avoids the triple-buffering problem (`String` → `getBytes()` → `ByteArrayInputStream`) that would otherwise consume ~40 MB heap per 10 MB dataset. The `HxClient.sendAsStream()` method goes through the same `sendWithRetry()` path as `sendAsString()`, so retry logic and token refresh are preserved.
|
||||
|
||||
## Links
|
||||
|
||||
- [Spec](../specs/260310-seqera-dataset-fs/spec.md)
|
||||
- [Implementation plan](../specs/260310-seqera-dataset-fs/plan.md)
|
||||
- [Data model](../specs/260310-seqera-dataset-fs/data-model.md)
|
||||
555
nextflow/adr/20260310-typed-workflows.md
Normal file
555
nextflow/adr/20260310-typed-workflows.md
Normal file
@@ -0,0 +1,555 @@
|
||||
# Typed workflows
|
||||
|
||||
- Authors: Ben Sherman
|
||||
- Status: accepted
|
||||
- Date: 2026-03-10
|
||||
- Tags: lang, static-types, workflows
|
||||
|
||||
## Summary
|
||||
|
||||
Extend workflows and dataflow logic to provide first-class support for static typing and records.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Workflow logic in Nextflow consists of composing processes, channels, and *dataflow operators* (or just *operators*). Operators are essential for transforming, filtering, and combining channels to control the flow of data through a pipeline.
|
||||
|
||||
However, workflows were not originally designed with static typing in mind. The introduction of static typing throughout the rest of the language has revealed several gaps in the design of workflow logic.
|
||||
|
||||
### Operators
|
||||
|
||||
Many operators cannot be statically type-checked because they do not have well-defined argument types and return types. Operators such as `combine` and `join` are designed to work with tuples, but do not support records.
|
||||
|
||||
A broader issue is that the operator library is very large, which makes it difficult to find the right operator for a given situation. Several operators deal with additional concerns such as reading/writing specific data formats, which blurs the distinction between dataflow logic and the domain-specific aspects of a workflow. Several operators rely on the ordering of values in a channel, which can cause non-deterministic behavior and hinder reproducibility.
|
||||
|
||||
The need for static typing is also an opportunity to address these issues by encouraging the use of a core subset of operators that provide all necessary functionality and support static typing.
|
||||
|
||||
### Dataflow syntax
|
||||
|
||||
There are many different ways to express the same dataflow logic. Consider the following example:
|
||||
|
||||
```groovy
|
||||
ch_input = channel.of('Hello', 'Hola', 'Ciao')
|
||||
|
||||
// alt 1
|
||||
ch_input
|
||||
| GREET
|
||||
| map { v -> v.toUpperCase() }
|
||||
| view
|
||||
| set { ch_upper }
|
||||
|
||||
// alt 2
|
||||
GREET(ch_input)
|
||||
GREET.out
|
||||
.map { v -> v.toUpperCase() }
|
||||
.tap { ch_upper }
|
||||
.view()
|
||||
|
||||
// alt 3
|
||||
ch_greet = GREET(ch_input)
|
||||
ch_upper = ch_greet
|
||||
.map { v -> v.toUpperCase() }
|
||||
.view()
|
||||
```
|
||||
|
||||
Here we see several syntax variants:
|
||||
|
||||
- Processes and operators can be composed with pipes (alt 1) or with method calls (alt 2, alt 3).
|
||||
|
||||
- Channels can be assigned using `set`, `tap`, or a regular assignment.
|
||||
|
||||
- Process outputs can be accessed using the `.out` property on the process name (alt 2) or by assignment (alt 3). The `.out` property can refer to a single output or a record of outputs, depending on the process definition.
|
||||
|
||||
Every syntax variant has a cost -- it make code look less familiar to new users, it can cause counterproductive debates over which variant is "better", and it makes Nextflow code less consistent overall. Even if you stick to your preferred syntax, you still have to learn the other variants because you might encounter them when reading someone else's code.
|
||||
|
||||
Therefore, syntax sugar should be used judiciously -- it should provide some value that makes adding it worth the aforementioned cost. The variants shown in alt 1 and alt 2 do not add much value relative to their cost.
|
||||
|
||||
Even the pipe (`|`), which is loved by many users, can rarely be used in its ideal form because processes usually have additional arguments that can’t be specified in a pipe chain.
|
||||
|
||||
## Goals
|
||||
|
||||
- Introduce first-class support for static typing and records with dataflow operators
|
||||
|
||||
- Encourage the use of a core set of operators (`map`, `filter`, `join`, etc)
|
||||
|
||||
- Discourage the use of non-deterministic operators (`buffer`, `distinct`, `first`, etc)
|
||||
|
||||
- Discourage the use of syntax variants that do not provide sufficient value to the language
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Remove support for existing workflow syntax and semantics -- static typing should be opt-in
|
||||
|
||||
- Change the way that processes are called -- processes are still called directly with channels, preserving the common mental model of "processes connected by channels"
|
||||
|
||||
## Solution
|
||||
|
||||
Introduce **typed workflows**, which provide a streamlined syntax for workflows that supports static typing.
|
||||
|
||||
Typed workflows can be used with the `nextflow.enable.types` feature flag:
|
||||
|
||||
```groovy
|
||||
// typed workflow
|
||||
nextflow.enable.types = true
|
||||
|
||||
workflow HELLO {
|
||||
take:
|
||||
ch_names: Channel<String>
|
||||
|
||||
main:
|
||||
ch_names.subscribe { name ->
|
||||
println "Hello, $name!"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```groovy
|
||||
// legacy workflow
|
||||
workflow HELLO {
|
||||
take:
|
||||
ch_names
|
||||
|
||||
main:
|
||||
ch_names.subscribe { name ->
|
||||
println "Hello, $name!"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This flag behaves the same way for typed processes and typed workflows:
|
||||
|
||||
- The flag must be specified in every script that uses typed processes/workflows
|
||||
- Typed processes/workflows cannot be mixed with legacy processes/workflows in the same script
|
||||
- Typed and non-typed scripts can be used in the same pipeline
|
||||
|
||||
### Operators
|
||||
|
||||
The operator library is extended to support static typing and records:
|
||||
|
||||
- The `combine` and `join` operators are extended to support both tuples and records.
|
||||
|
||||
- The `groupBy` operator is introduced as a statically-typed replacement for `groupTuple`
|
||||
|
||||
All operators can be used with or without static typing, with some caveats:
|
||||
|
||||
- Some operators have stricter semantics when static typing is enabled via `nextflow.enable.types`. These changes are necessary in order to support static typing effectively. They should not affect the majority of existing code.
|
||||
|
||||
- Some operators are discouraged from use with static typing. While they can still be used, the type checker will not be able to validate them. Users should be encouraged to migrate away from them in favor of the *core operators* that are statically typed.
|
||||
|
||||
The accompanying reference documentation and best practices guide explain these updates in detail. Here we highlight the most important changes.
|
||||
|
||||
The *core operators* are:
|
||||
|
||||
- `collect`: collect the channel values into a collection (dataflow value)
|
||||
- `combine`: emit the combinations of two channels
|
||||
- `filter`: emit only the channel values that satisfy a condition
|
||||
- `flatMap`: emit multiple values for each channel value with a closure
|
||||
- `groupBy`: group channel values by a grouping key
|
||||
- `join`: relational join of two channels based on a matching key
|
||||
- `map`: transform each channel value with a closure
|
||||
- `mix`: concatenate two channels
|
||||
- `reduce`: reduce channel values into a single value with an accumulator
|
||||
- `subscribe`: perform an action for each channel value
|
||||
- `unique`: emit unique values
|
||||
- `until`: emit each channel value until a stopping condition is satisfied
|
||||
- `view`: print each channel value
|
||||
|
||||
The core operators provide a minimal subset (13 out of ~50) that covers practically all use cases and supports static typing. Encouraging this subset as best practice makes it easier to find the right operator for a given situation, while preserving existing code that uses other operators.
|
||||
|
||||
The *legacy operators* are:
|
||||
|
||||
| Operator | Problem | Migration strategy |
|
||||
|---|---|---|
|
||||
| `branch` | Redundant | Use `filter` and `map` for each branch instead |
|
||||
| `buffer`, `collate` | Non-deterministic | Use `List::collate()` instead |
|
||||
| `collectFile` | Not statically typed | Use `collect`, `groupBy`, and `Iterable::toSorted()` instead |
|
||||
| `concat` | Redundant | Use `mix` instead |
|
||||
| `count`, `max`, `min`, `sum` | Redundant, rarely used | Use `collect` and the corresponding `Iterable` method instead |
|
||||
| `cross` | Redundant | Use `join` instead |
|
||||
| `distinct` | Non-deterministic | Use `unique` instead |
|
||||
| `dump` | Redundant | Use `view` with `tag` option instead |
|
||||
| `first`, `last`, `take` | Non-deterministic | Use a list instead |
|
||||
| `flatten` | Not statically typed | Use `flatMap` instead |
|
||||
| `groupTuple` | Not statically typed | Use `groupBy` instead |
|
||||
| `ifEmpty` | Not statically typed | Use `map` with `?:` instead |
|
||||
| `merge` | Non-deterministic | Use `join` instead |
|
||||
| `multiMap` | Redundant | Use `map` instead |
|
||||
| `randomSample` | Non-deterministic | - |
|
||||
| `set`, `tap` | Redundant | Use a regular assignment instead |
|
||||
| `splitCsv`, `splitFasta`, `splitFastq`, `splitJson`, `splitText` | Not statically typed | Use `flatMap` with the equivalent `Path` method instead |
|
||||
| `countCsv`, `countFasta`, `countFastq`, `countJson`, `countLines` | Not statically typed | Use `flatMap` with the equivalent `Path` method instead |
|
||||
| `toDouble`, `toFloat`, `toInteger`, `toLong` | Redundant, rarely used | Use `map` and the corresponding `String` method instead |
|
||||
| `toList` | Redundant | Use `collect` instead |
|
||||
| `toSortedList` | Redundant | Use `collect` and `Iterable::toSorted()` instead |
|
||||
| `transpose` | Not statically typed | Use `flatMap` instead |
|
||||
|
||||
In most cases, a legacy operator can be rewritten in terms of core operators and standard library functions. The accompanying best practices guide provides detailed examples for each operator. Since legacy operators can still be used in typed workflows, users can migrate away from legacy operators at their own pace.
|
||||
|
||||
### Fewer syntax variants
|
||||
|
||||
Typed workflows do not support the following syntax variants:
|
||||
|
||||
- Implicit `it` closure parameter → declare an explicit parameter instead
|
||||
- `it` can still be used as a variable name as long as it is explicitly declared
|
||||
|
||||
- Using `Channel` to access channel factories → use `channel` instead
|
||||
- `Channel` should be used only in type annotations
|
||||
|
||||
- Using `set` or `tap` to assign channels → use assignments instead
|
||||
|
||||
- Special dataflow operators `|` and `&` → use assignments and method calls instead
|
||||
- The equivalent bitwise operators are still allowed
|
||||
|
||||
- Using the `.out` property to access process and workflow outputs → use assignments instead
|
||||
|
||||
These restrictions are designed to make Nextflow code more consistent across the board and more familiar to users from other programming languages. Things like variable assignments and method calls in Nextflow look and feel the same as most other languages, whereas things like `set` assignments and the `.out` property make Nextflow code feel more unfamiliar without adding much value.
|
||||
|
||||
This aspect of the language is becoming more salient as code is increasingly read and written by AI agents. Agents need many examples of a programming language in order to use it effectively, so when a niche language has many syntax variants or syntax that deviates heavily from the common patterns used by other languages, it hurts the agent's ability to read and write code in that language.
|
||||
|
||||
## Distinguishing between typed and legacy workflows
|
||||
|
||||
Static typing has been introduced as multiple independent features:
|
||||
|
||||
- Typed parameters (`params` block)
|
||||
- Typed outputs (`output` block)
|
||||
- Typed processes
|
||||
- Record types
|
||||
- Typed workflows (this proposal)
|
||||
|
||||
This incremental approach was done in contrast to DSL2, which was a monolithic change that required an entire pipeline to be updated at once. With static typing, each new feature can be adopted independently of the others, rather than requiring all new features to be adopted at once (e.g. "DSL3").
|
||||
|
||||
Most of the features for static typing are new concepts that can be used alongside existing code. However, typed processes and typed workflows modify existing concepts (`process` and `workflow` definitions), so they require a feature flag.
|
||||
|
||||
The `nextflow.enable.types` feature flag will be used to distinguish between typed and legacy code, indefinitely. It would only be removed if the support for legacy syntax was removed, which is unlikely since DSL2 has been the standard Nextflow syntax for many years.
|
||||
|
||||
To help distinguish between typed and legacy workflows, the use of type annotations should be allowed only for typed workflows:
|
||||
|
||||
```groovy
|
||||
// legacy workflow
|
||||
workflow greet {
|
||||
take:
|
||||
greetings
|
||||
|
||||
main:
|
||||
messages = greetings.map { v -> "$v world!" }
|
||||
|
||||
emit:
|
||||
messages
|
||||
}
|
||||
```
|
||||
|
||||
```groovy
|
||||
// typed workflow
|
||||
nextflow.enable.types = true
|
||||
|
||||
workflow greet {
|
||||
take:
|
||||
greetings: Channel<String>
|
||||
|
||||
main:
|
||||
messages = greetings.map { v -> "$v world!" }
|
||||
|
||||
emit:
|
||||
messages: Channel<String>
|
||||
}
|
||||
```
|
||||
|
||||
## Interoperability between typed and legacy workflows
|
||||
|
||||
Typed and legacy workflows use different underlying dataflow types:
|
||||
|
||||
- **Legacy workflows (v1)** use raw GPars types: `DataflowBroadcast` (queue channel) and `DataflowVariable` (value channel).
|
||||
|
||||
- **Typed workflows (v2)** use wrapper types: `ChannelImpl` (wraps a `DataflowBroadcast`) and `ValueImpl` (wraps a `DataflowVariable`). These wrappers implement the new operators and integrate with the type system.
|
||||
|
||||
While a given script must be entirely typed or entirely legacy (controlled by the `nextflow.enable.types` flag), **typed and legacy workflows can call each other across different scripts**. This interoperability enables incremental migration -- individual scripts can be migrated to static typing without having to update the entire pipeline at once.
|
||||
|
||||
### Normalization at call sites
|
||||
|
||||
When a workflow calls another workflow, the Nextflow runtime automatically converts dataflow arguments and return values to the appropriate type for each side of the call site.
|
||||
|
||||
Normalization can occur in either direction:
|
||||
|
||||
- **v2 → v1 (unwrap)**: when passing typed channels to a legacy component, `ChannelImpl` / `ValueImpl` are unwrapped to the underlying `DataflowBroadcast` / `DataflowVariable`.
|
||||
|
||||
- **v1 → v2 (wrap)**: when passing legacy channels to a typed component, `DataflowBroadcast` / `DataflowVariable` are wrapped as `ChannelImpl` / `ValueImpl`.
|
||||
|
||||
The normalization is applied twice per call: once to the arguments (converted to match the *callee's* semantics), and once to the return value (converted to match the *caller's* semantics).
|
||||
|
||||
### Example: typed workflow calling a legacy workflow
|
||||
|
||||
**`legacy.nf`**
|
||||
```groovy
|
||||
workflow LEGACY_ALIGN {
|
||||
take:
|
||||
reads // DataflowBroadcast
|
||||
|
||||
main:
|
||||
ALIGN(reads)
|
||||
|
||||
emit:
|
||||
bam = ALIGN.out // DataflowBroadcast
|
||||
}
|
||||
```
|
||||
|
||||
**`typed.nf`**
|
||||
```groovy
|
||||
nextflow.enable.types = true
|
||||
|
||||
include { LEGACY_ALIGN } from './legacy'
|
||||
|
||||
workflow {
|
||||
reads = channel.fromPath('*.fastq') // ChannelImpl
|
||||
|
||||
// `reads` is unwrapped to DataflowBroadcast when passed to LEGACY_ALIGN
|
||||
// The return value (DataflowBroadcast) is wrapped to ChannelImpl
|
||||
bam = LEGACY_ALIGN(reads)
|
||||
|
||||
bam.view() // ChannelImpl
|
||||
}
|
||||
```
|
||||
|
||||
### Example: legacy workflow calling a typed workflow
|
||||
|
||||
**`typed.nf`**
|
||||
```groovy
|
||||
nextflow.enable.types = true
|
||||
|
||||
workflow TYPED_TRIM {
|
||||
take:
|
||||
reads: Channel<Record>
|
||||
|
||||
main:
|
||||
ch_trimmed = TRIM(reads)
|
||||
|
||||
emit:
|
||||
trimmed = ch_trimmed
|
||||
}
|
||||
```
|
||||
|
||||
**`legacy.nf`**
|
||||
```groovy
|
||||
include { TYPED_TRIM } from './typed'
|
||||
|
||||
workflow {
|
||||
reads = Channel.fromPath('*.fastq') // DataflowBroadcast
|
||||
|
||||
// `reads` is wrapped as ChannelImpl when passed to TYPED_TRIM
|
||||
// The return value (ChannelImpl) is unwrapped to DataflowBroadcast
|
||||
trimmed = TYPED_TRIM(reads)
|
||||
|
||||
trimmed.view() // DataflowBroadcast
|
||||
}
|
||||
```
|
||||
|
||||
### Process and workflow outputs (`ChannelOut`)
|
||||
|
||||
Processes and workflows -- regardless of whether they are legacy or typed -- always return a `ChannelOut`, a specialized class that can contain one or more named outputs (`DataflowBroadcast` / `DataflowVariable`).
|
||||
|
||||
When a `ChannelOut` is returned to a typed workflow, it is normalized as follows:
|
||||
|
||||
- If the `ChannelOut` contains only one output, it is unwrapped to the underlying `DataflowBroadcast` / `DataflowVariable` and then wrapped as a `ChannelImpl` / `ValueImpl`.
|
||||
|
||||
- If the `ChannelOut` contains multiple outputs, it is converted to a record (`RecordMap`), where each named output is wrapped as a `ChannelImpl` / `ValueImpl`.
|
||||
|
||||
For example:
|
||||
|
||||
**`legacy.nf`**
|
||||
```groovy
|
||||
workflow LEGACY_QC {
|
||||
take:
|
||||
reads
|
||||
|
||||
main:
|
||||
FASTQC(reads)
|
||||
MULTIQC(FASTQC.out)
|
||||
|
||||
emit:
|
||||
fastqc = FASTQC.out // DataflowBroadcast
|
||||
multiqc = MULTIQC.out // DataflowBroadcast
|
||||
}
|
||||
```
|
||||
|
||||
**`typed.nf`**
|
||||
```groovy
|
||||
nextflow.enable.types = true
|
||||
|
||||
include { LEGACY_QC } from './legacy'
|
||||
|
||||
workflow {
|
||||
reads = channel.fromPath('*.fastq')
|
||||
|
||||
// LEGACY_QC returns a ChannelOut with two outputs
|
||||
// which is converted to a record:
|
||||
// Record { fastqc: ChannelImpl ; multiqc: ChannelImpl }
|
||||
qc = LEGACY_QC(reads)
|
||||
|
||||
// RecordMap provides same semantics as ChannelOut
|
||||
qc.fastqc.view()
|
||||
qc.multiqc.view()
|
||||
}
|
||||
```
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Processes in operator closures
|
||||
|
||||
A process call is essentially a task function wrapped in a `map` operation. But processes are called directly on channels, which has a few implications:
|
||||
|
||||
- The true structure of process calls are somewhat obscured
|
||||
- Process calls can have a different return type (`Channel` or `Value`) depending on how they are called
|
||||
- Processes can only be called as a `map` operation, not with other operators like `reduce`
|
||||
- Processes can not be chained like operator calls (without additional syntax like `|`)
|
||||
|
||||
These limitations could be addressed by calling processes in operator closures instead of calling them directly with channels:
|
||||
|
||||
```groovy
|
||||
ch_samples = channel.of(...)
|
||||
fasta = file(...)
|
||||
|
||||
// before
|
||||
SALMON(ch_samples, fasta)
|
||||
|
||||
// after
|
||||
ch_samples.map { sample -> SALMON(sample, fasta) }
|
||||
```
|
||||
|
||||
Where `SALMON` is defined as follows:
|
||||
|
||||
```groovy
|
||||
process SALMON {
|
||||
input:
|
||||
record(
|
||||
id: String,
|
||||
fastq: Path
|
||||
)
|
||||
fasta: Path
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This syntax brings a number of benefits:
|
||||
|
||||
- The `map` operation is explicit, making process calls consistent with other operator logic
|
||||
- The process call matches the process definition -- it accepts and returns regular values, not channels or dataflow values
|
||||
- Processes can be chained without needing a pipe syntax (e.g. `ch.map(FOO).map(BAR).map(BAZ) ...`)
|
||||
- Processes could theoretically be composed with other operators (e.g. an iterative process with the `reduce` operator)
|
||||
- The closure around the process call can be used to handle process inputs and outputs without additional operator calls
|
||||
|
||||
Decoupling the process lifecycle from an implicit `map` operation, however, breaks a key assumption of the Nextflow runtime:
|
||||
|
||||
- Handling a process call in an arbitrary closure instead of as an implicit `map` operation is significantly more complex, and would likely require new language semantics and compiler transformations to implement.
|
||||
- Alternatively, the compiler could restrict such closures to specific patterns (e.g. a single process call with some statements before and after), but this would add complexity for developer experience (i.e. having to remember which patterns are allowed in which cases).
|
||||
|
||||
Additionally, some of the problems that motivated this approach have been addressed by type checking and records:
|
||||
|
||||
- The type checker can infer the return type of a direct process call from the call arguments (e.g. `Channel` vs `Value`)
|
||||
- Records and record types provide additional flexibility that eliminates much of the adaptor logic that was required between tuple channels and processes
|
||||
|
||||
Ultimately, this change would mostly be a cosmetic syntax improvement that would do little to improve the developer experience, but would introduce a great deal of complexity to the compiler and runtime. It would also be a significant break from the way that Nextflow workflows have been written since the introduction of DSL2.
|
||||
|
||||
### Processes as operator closures
|
||||
|
||||
A moderated version of calling processes in operator closures is to call them *as* operator closures:
|
||||
|
||||
```groovy
|
||||
ch_samples = channel.of(...)
|
||||
fasta = file(...)
|
||||
|
||||
// before
|
||||
SALMON(ch_samples, fasta)
|
||||
|
||||
// after
|
||||
ch_samples.map(SALMON, index: fasta)
|
||||
```
|
||||
|
||||
The process name takes the place of the `map` closure. The channel calling `map` is supplied as the first process input, and any additional inputs are supplied as named arguments to `map`.
|
||||
|
||||
This approach avoids much of the aforementioned complexity risk while retaining many of the benefits.
|
||||
|
||||
For example, process calls can be chained with other operator calls:
|
||||
|
||||
```groovy
|
||||
ch_input
|
||||
.map(GREET, greeting: "Hello")
|
||||
.map { v -> v.toUpperCase() }
|
||||
.view()
|
||||
```
|
||||
|
||||
And processes can be called with other operators such as `reduce`:
|
||||
|
||||
```groovy
|
||||
process ACCUMULATE {
|
||||
input:
|
||||
result: Path
|
||||
input: Path
|
||||
|
||||
script:
|
||||
"""
|
||||
cat ${input} >> ${result}
|
||||
"""
|
||||
|
||||
output:
|
||||
file('result.txt')
|
||||
}
|
||||
|
||||
workflow {
|
||||
channel.fromPath("*.txt").reduce(ACCUMULATE).view()
|
||||
}
|
||||
```
|
||||
|
||||
This particular pattern was proposed as a cleaner alternative to the experimental [recursion](https://nextflow.io/docs/latest/workflow.html#process-and-workflow-recursion) feature. As long as the process matches the signature of the accumulator closure (two inputs and one output), the process can be executed iteratively.
|
||||
|
||||
While this approach avoids most of the potential complexity that would be required to call processes in operator closures, it is still a significant syntax change with dubious relative benefit.
|
||||
|
||||
Investigating these approaches revealed an important trade-off -- Nextflow sacrifices a small amount of syntactic precision in order to make process calls prominent in the workflow logic. While calling processes in an operator would be more correct and provide some additional flexibility (e.g. using processes with other operators), it would make workflows feel much more like "operators that call processes in closures" instead of "processes connected by channels".
|
||||
|
||||
The reality is that most Nextflow users think of their pipelines as "processes connected by channels", and operator logic is a minor detail at best and a confusing distraction at worst. While we can and should make channel operators as simple and pleasant to use as possible, it should be in service of making them less prominent in the language, not more.
|
||||
|
||||
### Implicit dataflow values
|
||||
|
||||
Dataflow values (a.k.a. *value channels*) are analogous to Futures or Promises in other languages. For example, given a `CompletableFuture` in Java, you can either call `get()` to await the value or `thenAccept()` / `thenApply()` to invoke a callback when the value is ready.
|
||||
|
||||
Dataflow values can similarly call `subscribe` or `map`, but it is not possible to "await" a dataflow value directly. For example, it is not possible to use a dataflow value in an `if` statement:
|
||||
|
||||
```groovy
|
||||
vals = channel.of(1..10).collect()
|
||||
if( vals.size() > 2 )
|
||||
println 'More than two!'
|
||||
```
|
||||
|
||||
Instead, you must use `subscribe` to act on the value asynchronously:
|
||||
|
||||
```groovy
|
||||
vals = channel.of(1..10).collect()
|
||||
vals.subscribe { _vals ->
|
||||
if( _vals.size() > 2 )
|
||||
println 'More than two!'
|
||||
}
|
||||
```
|
||||
|
||||
This is a common frustration for many users, that dataflow values don't quite work like regular values, even though it seems like they should.
|
||||
|
||||
A solution could be to make dataflow values *implicit* -- users would use them like regular values (i.e. the first example above) and the compiler would translate the user's code into explicit dataflow logic (i.e. the second example).
|
||||
|
||||
To do this, the compiler would need to:
|
||||
|
||||
1. distinguish implicit dataflow values from regular values via type inference (e.g. the result of a `collect` operator),
|
||||
|
||||
2. wrap downstream code in `map` and/or `subscribe` operators as needed to produce the desired dataflow logic.
|
||||
|
||||
In the end, however, this change does not seem worthwhile:
|
||||
|
||||
- It makes type inference an essential part of the compilation process rather than an optional enhancement.
|
||||
|
||||
- The above example is simple to understand, but it is easy to construct more complicated examples that quickly cast doubt on whether the compiler could solve this problem in general.
|
||||
|
||||
- Even if there is a general solution, any mistake or edge case would likely lead to unexpected behavior that would be extremely difficult to debug (e.g. a low-level compiler error, compiled code that is silently incorrect).
|
||||
|
||||
Additionally, most of the problems that motivated this idea have been effectively solved by type checking:
|
||||
|
||||
- There is now an explicit `Value` type which allows both developers and the type checker to distinguish between channels and dataflow values.
|
||||
|
||||
- While users still can't use a dataflow value in an `if` statement, they can get clear and early feedback on whether their code is valid, which is what ultimately matters.
|
||||
|
||||
- Being transparent about regular values vs dataflow values in the language may be for the best anyway -- it provides a clear picture of how things are working "under the hood", and it is still far simpler than the async programming models employed by most languages.
|
||||
181
nextflow/adr/20260322-platform-oidc-pkce-login.md
Normal file
181
nextflow/adr/20260322-platform-oidc-pkce-login.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Replace Auth0 with Platform OIDC PKCE for Nextflow CLI login
|
||||
|
||||
- Authors: Paolo Di Tommaso
|
||||
- Status: draft
|
||||
- Date: 2026-03-22
|
||||
- Tags: auth, oidc, pkce, nextflow
|
||||
|
||||
## Summary
|
||||
|
||||
Replace the Auth0 Device Authorization Grant used by `nextflow auth login` with an OAuth2 Authorization Code + PKCE flow against Seqera Platform itself acting as the OIDC identity provider. Implement the reusable OIDC PKCE client as `lib-platform-oidc` in libseqera.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The `nextflow auth login` command authenticates via Auth0, an external identity provider, using the OAuth2 Device Authorization Grant. This requires hardcoded Auth0 domain and client ID mappings per environment (dev, stage, prod) and creates an artificial distinction between "cloud" and "enterprise" endpoints — cloud uses Auth0 while enterprise falls back to manual PAT entry.
|
||||
|
||||
Seqera Platform now exposes a standards-compliant OIDC provider at `/.well-known/openid-configuration`, making it possible for CLI clients to authenticate directly against Platform using Authorization Code + PKCE — eliminating the Auth0 dependency and unifying the login flow for all Platform instances.
|
||||
|
||||
## Goals
|
||||
|
||||
- Single login flow for all Platform endpoints (cloud and enterprise)
|
||||
- Eliminate Auth0 dependency and hardcoded domain/clientId mappings
|
||||
- Deprecate `TOWER_AUTH_DOMAIN` and `TOWER_AUTH_CLIENT_ID` env vars
|
||||
- Reusable OIDC PKCE library in libseqera for other CLI tools
|
||||
- No changes to PAT generation, storage, runtime usage, refresh, or logout
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Changing how PATs are generated or stored after login
|
||||
- Modifying runtime token refresh (`TowerXAuth`, `TowerClient`, `WaveClient`)
|
||||
- Removing backward compatibility with existing PATs in config
|
||||
|
||||
Note: The logout flow was simplified as a consequence of removing the cloud-vs-enterprise distinction — PAT deletion via Platform API is now always attempted for all endpoints.
|
||||
|
||||
## Login Flow
|
||||
|
||||
```
|
||||
Nextflow CLI Browser Platform
|
||||
│ │
|
||||
│ GET /.well-known/openid-configuration │
|
||||
│───────────────────────────────────────────────────────────────────>│
|
||||
│<───────────── { authorization_endpoint, token_endpoint } ──────────│
|
||||
│ │
|
||||
│ [generate code_verifier, code_challenge, state] │
|
||||
│ [start local HTTP server on 127.0.0.1:PORT] │
|
||||
│ │
|
||||
│ open browser ──>│ │
|
||||
│ │ GET /authorize? │
|
||||
│ │ client_id=nextflow_cli │
|
||||
│ │ &response_type=code │
|
||||
│ │ &scope=openid+profile+email+offline_access │
|
||||
│ │ &redirect_uri=http://127.0.0.1:PORT/callback │
|
||||
│ │ &state=<random> │
|
||||
│ │ &code_challenge=<S256(verifier)> │
|
||||
│ │ &code_challenge_method=S256 │
|
||||
│ │────────────────────────────────────────────────>│
|
||||
│ │ │
|
||||
│ │ (user authenticates on Platform) │
|
||||
│ │ │
|
||||
│ │<── redirect to 127.0.0.1:PORT/callback ─────────│
|
||||
│ │ ?code=<auth_code>&state=<state> │
|
||||
│ │ │
|
||||
│ [callback server receives code, validates state] │
|
||||
│ [returns HTML "Login successful" to browser] │
|
||||
│ │
|
||||
│ POST /token │
|
||||
│ grant_type=authorization_code │
|
||||
│ &client_id=nextflow_cli │
|
||||
│ &code=<auth_code> │
|
||||
│ &code_verifier=<verifier> │
|
||||
│ &redirect_uri=http://127.0.0.1:PORT/callback │
|
||||
│───────────────────────────────────────────────────────────────────>│
|
||||
│<──────────── { access_token, refresh_token } ──────────────────────│
|
||||
│ │
|
||||
│ GET /user-info (Authorization: Bearer <access_token>) │
|
||||
│───────────────────────────────────────────────────────────────────>│
|
||||
│<──────────────────── { user info } ────────────────────────────────│
|
||||
│ │
|
||||
│ POST /tokens (generate PAT — same as current flow) │
|
||||
│───────────────────────────────────────────────────────────────────>│
|
||||
│<──────────────────── { accessKey: <PAT> } ─────────────────────────│
|
||||
│ │
|
||||
│ [save PAT to ~/.nextflow/seqera-auth.config] │
|
||||
```
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Register `nextflow_cli` client in Platform
|
||||
|
||||
**File:** `platform/tower-config/src/main/resources/application-oauth-client.yml`
|
||||
|
||||
```yaml
|
||||
- client-id: "nextflow_cli"
|
||||
client-name: "Nextflow CLI"
|
||||
client-type: NATIVE
|
||||
client-secret: null
|
||||
application-type: "native"
|
||||
token-endpoint-auth-method: "none"
|
||||
id-token-signed-response-alg: "RS256"
|
||||
redirect-uris:
|
||||
- "http://127.0.0.1"
|
||||
allowed-flows:
|
||||
- "authorization_code_pkce"
|
||||
allowed-scopes:
|
||||
- "openid"
|
||||
- "profile"
|
||||
- "email"
|
||||
- "offline_access"
|
||||
third-party: false
|
||||
```
|
||||
|
||||
### 2. New module `lib-platform-oidc` in libseqera
|
||||
|
||||
Plain Java production code, Groovy/Spock tests. No external dependencies — uses only JDK classes (`java.net.http.HttpClient`, `com.sun.net.httpserver.HttpServer`, `java.security.*`).
|
||||
|
||||
```
|
||||
lib-platform-oidc/
|
||||
build.gradle
|
||||
VERSION
|
||||
changelog.txt
|
||||
src/
|
||||
main/java/io/seqera/platform/auth/oidc/
|
||||
OidcConfig.java # authorization_endpoint + token_endpoint
|
||||
PkceChallenge.java # code_verifier + code_challenge
|
||||
PkceUtil.java # PKCE generation helpers
|
||||
OidcDiscovery.java # GET /.well-known/openid-configuration
|
||||
OidcCallbackServer.java # Local HTTP server on 127.0.0.1:0
|
||||
OidcTokenExchange.java # POST token endpoint for code exchange
|
||||
OidcLoginFlow.java # Orchestrator
|
||||
test/groovy/io/seqera/platform/auth/oidc/
|
||||
PkceUtilTest.groovy
|
||||
OidcDiscoveryTest.groovy
|
||||
OidcCallbackServerTest.groovy
|
||||
OidcTokenExchangeTest.groovy
|
||||
OidcLoginFlowTest.groovy
|
||||
```
|
||||
|
||||
**`OidcLoginFlow`** — public API:
|
||||
```java
|
||||
public class OidcLoginFlow {
|
||||
public OidcLoginFlow(String endpoint, String clientId) { ... }
|
||||
public String login(Consumer<String> browserLauncher) throws Exception { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Flow: OIDC discovery → generate PKCE → start callback server (ephemeral port) → invoke `browserLauncher` with authorization URL → wait for callback → exchange code for tokens → return `access_token`.
|
||||
|
||||
The `browserLauncher` callback delegates browser-opening to the caller since it's platform-specific.
|
||||
|
||||
### 3. Modify `AuthCommandImpl.login()` in Nextflow
|
||||
|
||||
**File:** `nextflow/plugins/nf-tower/src/main/io/seqera/tower/plugin/auth/AuthCommandImpl.groovy`
|
||||
|
||||
Replace the cloud-vs-enterprise branching with a single OIDC flow:
|
||||
|
||||
```groovy
|
||||
// Was: getCloudEndpointInfo → performAuth0Login / handleEnterpriseAuth
|
||||
// Now:
|
||||
performOidcLogin(apiUrl)
|
||||
```
|
||||
|
||||
`performOidcLogin()` creates an `OidcLoginFlow`, gets the OAuth access token, then follows the existing post-auth steps unchanged: `getUserInfo()` → `generatePAT()` → `saveAuthToConfig()` → `config()`.
|
||||
|
||||
Remove: `performAuth0Login()`, `requestDeviceAuthorization()`, `pollForDeviceToken()`, `performAuth0Request()`, `handleEnterpriseAuth()`, `promptPAT()`, `getCloudEndpointInfo()`.
|
||||
|
||||
### 4. Deprecate Auth0 mappings in PlatformHelper
|
||||
|
||||
**File:** `nextflow/modules/nextflow/src/main/groovy/nextflow/platform/PlatformHelper.groovy`
|
||||
|
||||
Add `@Deprecated` to `getAuthDomain()` and `getAuthClientId()`.
|
||||
|
||||
## Verification
|
||||
|
||||
1. `./gradlew :lib-platform-oidc:test` in libseqera
|
||||
2. `./gradlew :plugins:nf-tower:test` in Nextflow
|
||||
3. Manual: `./launch.sh auth login -url https://api.cloud.dev-seqera.io` — browser opens, PKCE flow completes, PAT stored in config
|
||||
4. Backward compat: existing PAT in `seqera-auth.config` continues to work
|
||||
|
||||
## Links
|
||||
|
||||
- [Platform OIDC provider PR](https://github.com/seqeralabs/platform/pull/10336)
|
||||
- [OAuth client examples PR](https://github.com/seqeralabs/platform/pull/10473)
|
||||
180
nextflow/adr/20260323-hints-process-directive.md
Normal file
180
nextflow/adr/20260323-hints-process-directive.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# `hints` process directive for executor-specific scheduling hints
|
||||
|
||||
- Authors: Rob Syme
|
||||
- Status: accepted
|
||||
- Deciders: Paolo Di Tommaso, Ben Sherman, Rob Syme
|
||||
- Date: 2026-03-23
|
||||
- Tags: directive, executor, scheduling
|
||||
|
||||
## Summary
|
||||
|
||||
Introduce a `hints` process directive for executor-specific scheduling hints that don't map to existing directives.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Many executors can be configured in various ways on a per-task basis. For example:
|
||||
|
||||
- AWS Batch jobs can use *consumable resources* to limit concurrent job execution based on non-standard resources such as software license seats.
|
||||
|
||||
- Google Batch jobs can specify a *provisioning model* to control the use of spot vs on-demand VMs on a per-task basis.
|
||||
|
||||
- Seqera Scheduler supports a variety of resource and scheduling settings, including spot/on-demand provisioning.
|
||||
|
||||
These settings can be exposed by Nextflow as executor-specific config options, such as `google.batch.spot`, but config options are applied globally. In order to apply a setting to specific processes or tasks, it must be exposed as a process directive.
|
||||
|
||||
Process directives in Nextflow aim to provide a common vocabulary for executing tasks in many different environments. Directives such as `cpus`, `memory`, and `time` have broadly the same meaning across most executors, making it easier for users to write portable pipelines.
|
||||
|
||||
At the same time, many executors have custom settings not shared by other executors, and it is not practical to create a new process directive for every new setting. There are over 40 [process directives](https://docs.seqera.io/nextflow/reference/process#directives) at the time of writing, and every new directive adds cognitive load when a user is trying to find the right directive for a given situation.
|
||||
|
||||
There exist a few generic process directives already:
|
||||
|
||||
- The `clusterOptions` directive can be used to specify command-line arguments, primarily for HPC schedulers
|
||||
- The `ext` directive supports arbitrary key-values, but is designed primarily to customize the task script (e.g. tool arguments), not executor behavior
|
||||
- The `resourceLabels` directive also supports arbitrary key-values, but is intended for tagging and tracking resources, not controlling them
|
||||
|
||||
A new directive is needed to support executor-specific settings at a per-task level in a structured manner, without bloating the process directives for every new custom setting.
|
||||
|
||||
## Goals
|
||||
|
||||
- Provide a way to apply executor-specific settings to individual processes or tasks
|
||||
|
||||
- Avoid the proliferation of narrow, executor-specific directives (e.g. `consumableResources`, `schedulingPolicy`, etc.)
|
||||
|
||||
- Provide a single extension point that executors can consume selectively
|
||||
|
||||
- Allow settings to be specified as key-values, providing validation where possible
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Replacing existing directives (`cpus`, `memory`, `accelerator`, `queue`) — those remain the right place for standard resources
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce a `hints` process directive with namespaced keys. Executors consume the hints they understand and silently ignore the rest.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### Syntax
|
||||
|
||||
The `hints` directive accepts a map of key-value pairs:
|
||||
|
||||
```groovy
|
||||
// process definition
|
||||
process runDragen {
|
||||
cpus 4
|
||||
memory '16 GB'
|
||||
hints consumableResources: ['my-dragen-license': 1, 'other-license': 2]
|
||||
|
||||
script:
|
||||
"""
|
||||
dragen --ref-dir /ref ...
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
```groovy
|
||||
// process config
|
||||
process {
|
||||
withName: 'runDragen' {
|
||||
hints = [
|
||||
consumableResources: ['my-dragen-license': 1, 'other-license': 2]
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Keys are strings. Values may be any raw data type: strings, numbers, booleans, lists, or maps. Executors are responsible for defining which hints they recognize and what value type each hint expects.
|
||||
|
||||
In the above example, the `consumableResources` hint is given as a map of resource name to quantity. The AWS Batch executor supplies it to each job request using `ConsumableResourceProperties`.
|
||||
|
||||
### Namespacing
|
||||
|
||||
Keys can use dot-separated scopes to namespace settings as needed:
|
||||
|
||||
```groovy
|
||||
hints consumableResources: ['my-dragen-license': 1]
|
||||
hints 'scheduling.priority': 10
|
||||
hints 'scheduling.provisioningModel': 'spot'
|
||||
```
|
||||
|
||||
Keys can be routed to specific executors by prefixing with the executor name and a slash (`/`):
|
||||
|
||||
```groovy
|
||||
hints 'awsbatch/consumableResources': ['my-dragen-license': 1]
|
||||
hints 'seqera/scheduling.provisioningModel': 'spot'
|
||||
hints 'k8s/nodeSelector': 'gpu=true'
|
||||
```
|
||||
|
||||
The executor prefix gives pipeline developers the ability to target specific executors and have assurance that it won't accidentally apply to other executors (e.g. if another executor adds support for the same hint in the future).
|
||||
|
||||
### Validation
|
||||
|
||||
Nextflow should validate hints to the best of its ability, to catch errors such as typos:
|
||||
|
||||
- **Prefixed hints** can be validated against the set of hints declared by the corresponding executor. Unrecognized hints should be reported as errors.
|
||||
|
||||
- **Unprefixed hints** can be validated against the union of hints declared by all executors. Since unprefixed hints might be supported by executors that aren't currently loaded, unrecognized hints should be reported as warnings.
|
||||
|
||||
### Multiple hint resolution
|
||||
|
||||
The `hints` directive uses *replacement semantics* when specified multiple times, meaning that each `hints` setting completely replaces any previous settings:
|
||||
|
||||
```groovy
|
||||
process {
|
||||
// generic hint
|
||||
hints = [provisioningModel: 'spot']
|
||||
|
||||
// specific hint replaces generic hint
|
||||
withLabel: 'dragen' {
|
||||
hints = [consumableResources: ['my-dragen-license': 1]]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Within a process definition, the `hints` directive uses *accumulation semantics*, meaning that subsequent `hints` directives are accumulated:
|
||||
|
||||
```groovy
|
||||
process runDragen {
|
||||
// multiple separate hints
|
||||
hints provisioningModel: 'spot'
|
||||
hints consumableResources: ['my-dragen-license': 1, 'other-license': 2]
|
||||
|
||||
// equivalent to...
|
||||
hints (
|
||||
provisioningModel: 'spot',
|
||||
consumableResources: ['my-dragen-license': 1, 'other-license': 2]
|
||||
)
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This behavior is consistent with other directives such as `pod` and `resourceLabels`. In practice, this means that a given `hints` setting should specify all relevant hints for the given context.
|
||||
|
||||
For example, the `withLabel` selector above should also specify the `provisioningModel` hint if the intention is to preserve that hint for the selected processes:
|
||||
|
||||
```groovy
|
||||
process {
|
||||
hints = [provisioningModel: 'spot']
|
||||
|
||||
withLabel: 'dragen' {
|
||||
hints = [provisioningModel: 'spot', consumableResources: ['my-dragen-license': 1]]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
While this approach may lead to duplication, it gives users and developers more control over which hints are applied in a given context.
|
||||
|
||||
### Initial hint catalog
|
||||
|
||||
The following hints should be supported initially:
|
||||
|
||||
| Hint name | Value type | Executors | Use case |
|
||||
|--|--|--|--|
|
||||
| `consumableResources` | `Map<String, Integer>` | AWS Batch | License-aware scheduling ([#5917](https://github.com/nextflow-io/nextflow/issues/5917)) |
|
||||
| `scheduling.priority` | `Integer` | AWS Batch | Job scheduling priority ([#6998](https://github.com/nextflow-io/nextflow/issues/6998)) |
|
||||
| `scheduling.provisioningModel` | `String` | Google Batch | Spot VM scheduling ([#3530](https://github.com/nextflow-io/nextflow/issues/3530)) |
|
||||
|
||||
## Links
|
||||
|
||||
- [Community issue](https://github.com/nextflow-io/nextflow/issues/5917)
|
||||
76
nextflow/adr/YYYYMMDD-template-name.md
Normal file
76
nextflow/adr/YYYYMMDD-template-name.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# [short title of solved problem and solution]
|
||||
|
||||
- Authors: [who wrote the ADR]
|
||||
- Status: [draft | proposed | rejected | accepted | deprecated | … | superseded by [xxx](xxx.md)]
|
||||
- Deciders: [list everyone involved in the decision] <!-- optional - to be formalised -->
|
||||
- Date: [YYYY-MM-DD when the decision was last updated]
|
||||
- Tags: [space and/or comma separated list of tags]
|
||||
|
||||
Technical Story: [description | ticket/issue URL] <!-- optional -->
|
||||
|
||||
## Summary
|
||||
|
||||
Quick description of the problem and the context. Should not take more than 2-3 lines.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Description of the technical problem to solve or to decision to make. This should be concise but provide all required details and the context related to the technical decision to be taken.
|
||||
|
||||
## Goals or Decision Drivers
|
||||
|
||||
Depending the context define clearly what are the goals or what are the most important decision drivers.
|
||||
|
||||
- [driver 1, e.g., a force, facing concern, …]
|
||||
- [driver 2, e.g., a force, facing concern, …]
|
||||
- … <!-- numbers of drivers can vary -->
|
||||
|
||||
## Non-goals
|
||||
|
||||
Define what's out of the scope of this ADR.
|
||||
|
||||
## Considered Options <!-- optional -->
|
||||
|
||||
- [option 1]
|
||||
- [option 2]
|
||||
- [option 3]
|
||||
- … <!-- numbers of options can vary -->
|
||||
|
||||
|
||||
## Pros and Cons of the Options <!-- optional -->
|
||||
|
||||
### [option 1]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
- Good, because [argument a]
|
||||
- Good, because [argument b]
|
||||
- Bad, because [argument c]
|
||||
- … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
### [option 2]
|
||||
|
||||
[example | description | pointer to more information | …] <!-- optional -->
|
||||
|
||||
- Good, because [argument a]
|
||||
- Good, because [argument b]
|
||||
- Bad, because [argument c]
|
||||
- … <!-- numbers of pros and cons can vary -->
|
||||
|
||||
|
||||
## Solution or decision outcome
|
||||
|
||||
Summarize the solution or decision outcome in one-two lines.
|
||||
|
||||
## Rationale & discussion
|
||||
|
||||
Describe the solution or the decision outcome discussing how decision drivers have been applied and how it matches the declared goals. This section is expected to be concise though providing comprehensive description of the technical solution and covering all uncertainty or ambiguous points.
|
||||
|
||||
## Links <!-- optional -->
|
||||
|
||||
- [Link type](link to adr) <!-- example: Refined by [xxx](yyyymmdd-xxx.md) -->
|
||||
- … <!-- numbers of links can vary -->
|
||||
|
||||
## More information
|
||||
|
||||
- [What is an ADR and why should you use them](https://github.com/thomvaill/log4brains/tree/master#-what-is-an-adr-and-why-should-you-use-them)
|
||||
- [ADR GitHub organization](https://adr.github.io/)
|
||||
416
nextflow/adr/module-spec-schema.json
Normal file
416
nextflow/adr/module-spec-schema.json
Normal file
@@ -0,0 +1,416 @@
|
||||
{
|
||||
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
||||
"$id": "https://raw.githubusercontent.com/nextflow-io/schemas/main/module/v1/schema.json",
|
||||
"title": "Nextflow Module Schema",
|
||||
"description": "Schema for Nextflow module meta.yml files, supporting both nf-core community patterns and the Nextflow module system",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {
|
||||
"type": "string",
|
||||
"description": "Module name. Can be a simple identifier (e.g., 'fastqc', 'bwa_mem') for local/nf-core modules, or a fully qualified scoped name (e.g., 'nf-core/fastqc', 'myorg/custom') for registry modules.",
|
||||
"examples": ["fastqc", "bwa_mem", "nf-core/fastqc", "myorg/salmon-quant"],
|
||||
"pattern": "^([a-z0-9][a-z0-9-]*/)?[a-z][a-z0-9_-]*$"
|
||||
},
|
||||
"version": {
|
||||
"type": "string",
|
||||
"description": "Semantic version of the module (MAJOR.MINOR.PATCH). Required for registry publication",
|
||||
"pattern": "^(0|[1-9]\\d*)\\.(0|[1-9]\\d*)\\.(0|[1-9]\\d*)(-[0-9A-Za-z-]+(\\.[0-9A-Za-z-]+)*)?(\\+[0-9A-Za-z-]+(\\.[0-9A-Za-z-]+)*)?$",
|
||||
"examples": ["1.0.0", "2.1.3", "1.0.0-beta.1"]
|
||||
},
|
||||
"description": {
|
||||
"type": "string",
|
||||
"description": "Brief description of what the module does",
|
||||
"minLength": 10,
|
||||
"maxLength": 500
|
||||
},
|
||||
"keywords": {
|
||||
"type": "array",
|
||||
"description": "Keywords for discovery and categorization",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"minLength": 2
|
||||
},
|
||||
"minItems": 1,
|
||||
"uniqueItems": true
|
||||
},
|
||||
"license": {
|
||||
"type": "string",
|
||||
"description": "SPDX license identifier for the module code itself",
|
||||
"examples": ["MIT", "Apache-2.0", "GPL-3.0-or-later"]
|
||||
},
|
||||
"authors": {
|
||||
"type": "array",
|
||||
"description": "Original authors of the module (GitHub handles preferred)",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"pattern": "^@?[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"
|
||||
},
|
||||
"minItems": 1
|
||||
},
|
||||
"maintainers": {
|
||||
"type": "array",
|
||||
"description": "Current maintainers of the module (GitHub handles preferred)",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"pattern": "^@?[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$"
|
||||
}
|
||||
},
|
||||
"requires": {
|
||||
"type": "object",
|
||||
"description": "Runtime requirements for the module",
|
||||
"properties": {
|
||||
"nextflow": {
|
||||
"type": "string",
|
||||
"description": "Nextflow version constraint using comparison operators",
|
||||
"examples": [">=24.04.0", ">=24.04.0,<25.0.0"],
|
||||
"pattern": "^[<>=!]+[0-9]+\\.[0-9]+\\.[0-9]+(-[a-zA-Z0-9]+)?(,\\s*[<>=!]+[0-9]+\\.[0-9]+\\.[0-9]+(-[a-zA-Z0-9]+)?)*$"
|
||||
}
|
||||
},
|
||||
"additionalProperties": false
|
||||
},
|
||||
"input": {
|
||||
"type": "array",
|
||||
"description": "Inputs of the module",
|
||||
"items": {
|
||||
"$ref": "#/$defs/structuredParameter"
|
||||
}
|
||||
},
|
||||
"output": {
|
||||
"type": "array",
|
||||
"description": "Outputs of the module",
|
||||
"items": {
|
||||
"$ref": "#/$defs/structuredParameter"
|
||||
}
|
||||
},
|
||||
"topics": {
|
||||
"type": "array",
|
||||
"description": "Topics of the module",
|
||||
"items": {
|
||||
"$ref": "#/$defs/structuredParameter"
|
||||
}
|
||||
},
|
||||
"tools": {
|
||||
"type": "array",
|
||||
"description": "Software tools wrapped by this module with their metadata",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"minProperties": 1,
|
||||
"maxProperties": 1,
|
||||
"patternProperties": {
|
||||
"^[a-zA-Z][a-zA-Z0-9_-]*$": {
|
||||
"$ref": "#/$defs/toolSpec"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["name", "description"],
|
||||
"$defs": {
|
||||
"toolSpec": {
|
||||
"type": "object",
|
||||
"description": "Specification for a software tool used by the module",
|
||||
"properties": {
|
||||
"description": {
|
||||
"type": "string",
|
||||
"description": "Description of the tool and its purpose"
|
||||
},
|
||||
"homepage": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "Tool's homepage URL",
|
||||
"pattern": "^https?://.*$"
|
||||
},
|
||||
"documentation": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "Documentation URL",
|
||||
"pattern": "^(https?|ftp)://.*$"
|
||||
},
|
||||
"tool_dev_url": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "Development/source code URL",
|
||||
"pattern": "^https?://.*$"
|
||||
},
|
||||
"doi": {
|
||||
"description": "Digital Object Identifier for the tool's publication",
|
||||
"oneOf": [
|
||||
{
|
||||
"type": "string",
|
||||
"pattern": "^10\\.\\d{4,9}/[^,]+$"
|
||||
},
|
||||
{
|
||||
"type": "string",
|
||||
"const": "no DOI available"
|
||||
}
|
||||
]
|
||||
},
|
||||
"arxiv": {
|
||||
"type": "string",
|
||||
"description": "arXiv identifier",
|
||||
"pattern": "^arXiv:\\d{4}\\.\\d{4,5}(v\\d+)?$"
|
||||
},
|
||||
"licence": {
|
||||
"type": "array",
|
||||
"description": "SPDX license identifier(s) for the tool",
|
||||
"items": {
|
||||
"type": "string"
|
||||
},
|
||||
"minItems": 1,
|
||||
"uniqueItems": true
|
||||
},
|
||||
"identifier": {
|
||||
"description": "bio.tools identifier or empty string",
|
||||
"oneOf": [
|
||||
{
|
||||
"type": "string",
|
||||
"pattern": "^biotools:[a-zA-Z0-9_-]+$"
|
||||
},
|
||||
{
|
||||
"type": "string",
|
||||
"maxLength": 0
|
||||
}
|
||||
]
|
||||
},
|
||||
"manual": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "Manual/user guide URL"
|
||||
}
|
||||
},
|
||||
"required": ["description"],
|
||||
"anyOf": [
|
||||
{
|
||||
"required": ["homepage"]
|
||||
},
|
||||
{
|
||||
"required": ["documentation"]
|
||||
},
|
||||
{
|
||||
"required": ["tool_dev_url"]
|
||||
},
|
||||
{
|
||||
"required": ["doi"]
|
||||
}
|
||||
]
|
||||
},
|
||||
"structuredParameter": {
|
||||
"oneOf": [
|
||||
{
|
||||
"$ref": "#/$defs/paramSpec"
|
||||
},
|
||||
{
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/$defs/paramSpec"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"paramSpec": {
|
||||
"type": "object",
|
||||
"description": "Specification for a module parameter",
|
||||
"properties": {
|
||||
"name": {
|
||||
"type": "string",
|
||||
"description": "Parameter identifier",
|
||||
"pattern": "^[a-zA-Z_][a-zA-Z0-9_]*$"
|
||||
},
|
||||
"type": {
|
||||
"type": "string",
|
||||
"description": "Data type of the parameter value",
|
||||
"enum": [
|
||||
"boolean",
|
||||
"float",
|
||||
"integer",
|
||||
"string",
|
||||
"list",
|
||||
"map",
|
||||
"file",
|
||||
"directory"
|
||||
]
|
||||
},
|
||||
"description": {
|
||||
"type": "string",
|
||||
"description": "Human-readable description of the parameter"
|
||||
},
|
||||
"pattern": {
|
||||
"type": "string",
|
||||
"description": "Glob pattern for file/directory parameters"
|
||||
},
|
||||
"optional": {
|
||||
"type": "boolean",
|
||||
"description": "Whether this parameter is optional",
|
||||
"default": false
|
||||
},
|
||||
"enum": {
|
||||
"type": "array",
|
||||
"description": "List of allowed values",
|
||||
"uniqueItems": true
|
||||
},
|
||||
"ontologies": {
|
||||
"type": "array",
|
||||
"description": "Ontology annotations (e.g., EDAM)",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"patternProperties": {
|
||||
"^[a-zA-Z]+$": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "Ontology URI"
|
||||
}
|
||||
}
|
||||
},
|
||||
"uniqueItems": true
|
||||
}
|
||||
},
|
||||
"required": ["type", "description"]
|
||||
}
|
||||
},
|
||||
"allOf": [
|
||||
{
|
||||
"if": {
|
||||
"properties": {
|
||||
"name": {
|
||||
"pattern": "^[a-z0-9][a-z0-9-]*/"
|
||||
}
|
||||
},
|
||||
"required": ["name"]
|
||||
},
|
||||
"then": {
|
||||
"required": ["name", "description", "version"],
|
||||
"properties": {
|
||||
"version": {
|
||||
"description": "Version is required for scoped/registry modules (scope/name format)"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"examples": [
|
||||
{
|
||||
"name": "fastqc",
|
||||
"description": "Run FastQC on sequenced reads",
|
||||
"keywords": ["quality control", "qc", "adapters", "fastq"],
|
||||
"tools": [
|
||||
{
|
||||
"fastqc": {
|
||||
"description": "FastQC gives general quality metrics about your reads.",
|
||||
"homepage": "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/",
|
||||
"documentation": "https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/",
|
||||
"licence": ["GPL-2.0-only"],
|
||||
"identifier": "biotools:fastqc"
|
||||
}
|
||||
}
|
||||
],
|
||||
"input": [
|
||||
[
|
||||
{
|
||||
"name": "meta",
|
||||
"type": "map",
|
||||
"description": "Groovy Map containing sample information"
|
||||
},
|
||||
{
|
||||
"name": "reads",
|
||||
"type": "file",
|
||||
"description": "Input FastQ files",
|
||||
"ontologies": []
|
||||
}
|
||||
]
|
||||
],
|
||||
"output": [
|
||||
[
|
||||
{
|
||||
"type": "map",
|
||||
"description": "Sample information"
|
||||
},
|
||||
{
|
||||
"type": "file",
|
||||
"description": "FastQC report",
|
||||
"pattern": "*_{fastqc.html}",
|
||||
"ontologies": []
|
||||
}
|
||||
],
|
||||
{
|
||||
"name": "versions",
|
||||
"type": "file",
|
||||
"description": "File containing software versions",
|
||||
"pattern": "versions.yml"
|
||||
}
|
||||
],
|
||||
"authors": ["@drpatelh", "@ewels"],
|
||||
"maintainers": ["@drpatelh", "@ewels"]
|
||||
},
|
||||
{
|
||||
"name": "nf-core/bwa-align",
|
||||
"version": "1.2.4",
|
||||
"description": "Align reads using BWA-MEM algorithm",
|
||||
"keywords": ["alignment", "bwa", "mapping", "fastq", "bam"],
|
||||
"license": "MIT",
|
||||
"authors": ["@nf-core"],
|
||||
"maintainers": ["@nf-core"],
|
||||
"requires": {
|
||||
"nextflow": ">=24.04.0"
|
||||
},
|
||||
"tools": [
|
||||
{
|
||||
"bwa": {
|
||||
"description": "BWA aligner",
|
||||
"homepage": "http://bio-bwa.sourceforge.net/",
|
||||
"licence": ["GPL-3.0-or-later"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"samtools": {
|
||||
"description": "SAMtools",
|
||||
"homepage": "http://www.htslib.org/",
|
||||
"licence": ["MIT"]
|
||||
}
|
||||
}
|
||||
],
|
||||
"input": [
|
||||
[
|
||||
{
|
||||
"name": "meta",
|
||||
"type": "map",
|
||||
"description": "Sample metadata map"
|
||||
},
|
||||
{
|
||||
"name": "reads",
|
||||
"type": "file",
|
||||
"description": "Input FastQ files",
|
||||
"ontologies": [
|
||||
{ "edam": "http://edamontology.org/format_1930" }
|
||||
]
|
||||
}
|
||||
],
|
||||
{
|
||||
"name": "index",
|
||||
"type": "directory",
|
||||
"description": "BWA index directory"
|
||||
}
|
||||
],
|
||||
"output": [
|
||||
[
|
||||
{
|
||||
"type": "map",
|
||||
"description": "Sample metadata"
|
||||
},
|
||||
{
|
||||
"type": "file",
|
||||
"description": "Aligned BAM file",
|
||||
"pattern": "*.bam",
|
||||
"ontologies": [
|
||||
{ "edam": "http://edamontology.org/format_2572" }
|
||||
]
|
||||
}
|
||||
],
|
||||
{
|
||||
"name": "versions",
|
||||
"type": "file",
|
||||
"description": "Software versions"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user