Nextflow
Overview
Nextflow is a reactive workflow framework built on the dataflow programming model. In its current DSL2 syntax, pipelines are composed of modular processes connected through channels. Each process runs in its own isolated environment, with built-in support for Docker, Singularity, Conda, and cloud executors (AWS Batch, Google Life Sciences, Azure Batch). Nextflow handles job scheduling, fault tolerance with automatic retries, and seamless resumption of incomplete runs via its caching mechanism. The nf-core community maintains a large collection of peer-reviewed, production-ready pipelines for common bioinformatics analyses.
Pipeline Steps
A typical Nextflow QC pipeline follows these steps:
Create a channel of paired-end read files using
fromFilePairs.Run FastQC on each sample to generate quality reports.
Trim adapters and low-quality bases with fastp.
Collect all QC outputs and aggregate them into a MultiQC report.
QC Pipeline
The pipeline below implements the complete quality-control workflow in Nextflow DSL2. It reads paired FASTQ files from a glob pattern, runs FastQC and fastp in parallel per sample, then aggregates all reports with MultiQC.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.reads = "data/*_{R1,R2}.fastq.gz"
params.outdir = "results"
params.min_qual = 20
params.min_len = 50
process FASTQC {
tag "${sample_id}"
publishDir "${params.outdir}/fastqc", mode: 'copy'
conda 'bioconda::fastqc=0.12.1'
cpus 2
input:
tuple val(sample_id), path(reads)
output:
path("*.html"), emit: html
path("*.zip"), emit: zip
script:
"""
fastqc -t ${task.cpus} ${reads}
"""
}
process FASTP {
tag "${sample_id}"
publishDir "${params.outdir}/fastp", mode: 'copy'
conda 'bioconda::fastp=0.23.4'
cpus 4
input:
tuple val(sample_id), path(reads)
output:
tuple val(sample_id), path("*_trimmed.fastq.gz"), emit: trimmed
path("*.html"), emit: html
path("*.json"), emit: json
script:
def (r1, r2) = reads
"""
fastp -i ${r1} -I ${r2} \
-o ${sample_id}_R1_trimmed.fastq.gz \
-O ${sample_id}_R2_trimmed.fastq.gz \
-h ${sample_id}_fastp.html \
-j ${sample_id}_fastp.json \
--qualified_quality_phred ${params.min_qual} \
--length_required ${params.min_len} \
--detect_adapter_for_pe \
--thread ${task.cpus}
"""
}
process MULTIQC {
publishDir "${params.outdir}/multiqc", mode: 'copy'
conda 'bioconda::multiqc=1.21'
input:
path('*')
output:
path("multiqc_report.html")
script:
"""
multiqc . -o . --force
"""
}
workflow {
read_pairs_ch = Channel
.fromFilePairs(params.reads, checkIfExists: true)
FASTQC(read_pairs_ch)
FASTP(read_pairs_ch)
ch_multiqc = FASTQC.out.zip
.mix(FASTP.out.json)
.collect()
MULTIQC(ch_multiqc)
}
Configuration
Nextflow uses a nextflow.config file to define execution profiles. The
configuration below sets up local, SLURM, Docker, and Singularity profiles:
profiles {
standard {
process.executor = 'local'
process.cpus = 4
process.memory = '8 GB'
}
slurm {
process.executor = 'slurm'
process.queue = 'standard'
process.cpus = 4
process.memory = '8 GB'
process.time = '2h'
}
docker {
docker.enabled = true
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
}
}
process {
errorStrategy = 'retry'
maxRetries = 2
}
Running the Pipeline
nextflow run main.nf --reads "data/*_{R1,R2}.fastq.gz"
nextflow run main.nf -profile slurm
nextflow run main.nf -profile docker
nextflow run main.nf -resume
nextflow run nf-core/rnaseq -r 3.14.0 \
--input samplesheet.csv --genome GRCh38 -profile singularity
See Also
Snakemake – Snakemake workflow manager
WGS Variant Calling – WGS variant calling pipeline
RNA-seq Differential Expression – RNA-seq differential expression pipeline