Nextflow

Overview

Nextflow is a reactive workflow framework built on the dataflow programming model. In its current DSL2 syntax, pipelines are composed of modular processes connected through channels. Each process runs in its own isolated environment, with built-in support for Docker, Singularity, Conda, and cloud executors (AWS Batch, Google Life Sciences, Azure Batch). Nextflow handles job scheduling, fault tolerance with automatic retries, and seamless resumption of incomplete runs via its caching mechanism. The nf-core community maintains a large collection of peer-reviewed, production-ready pipelines for common bioinformatics analyses.

Pipeline Steps

A typical Nextflow QC pipeline follows these steps:

Create a channel of paired-end read files using fromFilePairs.
Run FastQC on each sample to generate quality reports.
Trim adapters and low-quality bases with fastp.
Collect all QC outputs and aggregate them into a MultiQC report.

QC Pipeline

The pipeline below implements the complete quality-control workflow in Nextflow DSL2. It reads paired FASTQ files from a glob pattern, runs FastQC and fastp in parallel per sample, then aggregates all reports with MultiQC.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.reads    = "data/*_{R1,R2}.fastq.gz"
params.outdir   = "results"
params.min_qual = 20
params.min_len  = 50

process FASTQC {
    tag "${sample_id}"
    publishDir "${params.outdir}/fastqc", mode: 'copy'
    conda 'bioconda::fastqc=0.12.1'
    cpus 2

    input:
    tuple val(sample_id), path(reads)

    output:
    path("*.html"), emit: html
    path("*.zip"),  emit: zip

    script:
    """
    fastqc -t ${task.cpus} ${reads}
    """
}

process FASTP {
    tag "${sample_id}"
    publishDir "${params.outdir}/fastp", mode: 'copy'
    conda 'bioconda::fastp=0.23.4'
    cpus 4

    input:
    tuple val(sample_id), path(reads)

    output:
    tuple val(sample_id), path("*_trimmed.fastq.gz"), emit: trimmed
    path("*.html"), emit: html
    path("*.json"), emit: json

    script:
    def (r1, r2) = reads
    """
    fastp -i ${r1} -I ${r2} \
      -o ${sample_id}_R1_trimmed.fastq.gz \
      -O ${sample_id}_R2_trimmed.fastq.gz \
      -h ${sample_id}_fastp.html \
      -j ${sample_id}_fastp.json \
      --qualified_quality_phred ${params.min_qual} \
      --length_required ${params.min_len} \
      --detect_adapter_for_pe \
      --thread ${task.cpus}
    """
}

process MULTIQC {
    publishDir "${params.outdir}/multiqc", mode: 'copy'
    conda 'bioconda::multiqc=1.21'

    input:
    path('*')

    output:
    path("multiqc_report.html")

    script:
    """
    multiqc . -o . --force
    """
}

workflow {
    read_pairs_ch = Channel
        .fromFilePairs(params.reads, checkIfExists: true)

    FASTQC(read_pairs_ch)
    FASTP(read_pairs_ch)

    ch_multiqc = FASTQC.out.zip
        .mix(FASTP.out.json)
        .collect()

    MULTIQC(ch_multiqc)
}

Configuration

Nextflow uses a nextflow.config file to define execution profiles. The configuration below sets up local, SLURM, Docker, and Singularity profiles:

profiles {
    standard {
        process.executor = 'local'
        process.cpus = 4
        process.memory = '8 GB'
    }
    slurm {
        process.executor = 'slurm'
        process.queue = 'standard'
        process.cpus = 4
        process.memory = '8 GB'
        process.time = '2h'
    }
    docker {
        docker.enabled = true
    }
    singularity {
        singularity.enabled = true
        singularity.autoMounts = true
    }
}

process {
    errorStrategy = 'retry'
    maxRetries = 2
}

Running the Pipeline

nextflow run main.nf --reads "data/*_{R1,R2}.fastq.gz"
nextflow run main.nf -profile slurm
nextflow run main.nf -profile docker
nextflow run main.nf -resume
nextflow run nf-core/rnaseq -r 3.14.0 \
  --input samplesheet.csv --genome GRCh38 -profile singularity