Snakemake

Overview

Snakemake is a Python-based workflow management system that uses a declarative syntax to define data analysis pipelines. Workflows are specified as a set of rules, each describing how to produce output files from input files. Snakemake automatically resolves dependencies between rules, determines execution order via a directed acyclic graph (DAG), and supports parallel execution across local cores or cluster nodes. Its integration with Conda, containers, and cluster schedulers (SLURM, PBS, LSF) makes it a standard choice for reproducible bioinformatics pipelines.

Pipeline Steps

A typical Snakemake QC pipeline follows these steps:

Define sample names and parameters in a configuration file (config.yaml).
Run FastQC on raw reads to generate per-sample quality reports.
Trim adapters and low-quality bases with fastp.
Aggregate all QC outputs into a single MultiQC report.

QC Pipeline

The Snakefile below implements the complete quality-control pipeline described in the textbook. It reads sample names from a YAML configuration file, runs FastQC on the raw paired-end reads, trims with fastp, and produces a combined MultiQC report.

# Snakefile for basic QC pipeline
configfile: "config.yaml"

SAMPLES = config["samples"]

rule all:
    input:
        expand("results/fastqc/{sample}_{read}_fastqc.html",
               sample=SAMPLES, read=["R1", "R2"]),
        expand("results/fastp/{sample}_R1.trimmed.fastq.gz",
               sample=SAMPLES),
        "results/multiqc/multiqc_report.html"

rule fastqc_raw:
    input:
        "data/{sample}_{read}.fastq.gz"
    output:
        html="results/fastqc/{sample}_{read}_fastqc.html",
        zip="results/fastqc/{sample}_{read}_fastqc.zip"
    threads: 2
    conda:
        "envs/qc.yaml"
    shell:
        "fastqc -t {threads} -o results/fastqc/ {input}"

rule fastp_trim:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/fastp/{sample}_R1.trimmed.fastq.gz",
        r2="results/fastp/{sample}_R2.trimmed.fastq.gz",
        html="results/fastp/{sample}_fastp.html",
        json="results/fastp/{sample}_fastp.json"
    threads: 4
    conda:
        "envs/qc.yaml"
    params:
        qual=config.get("min_quality", 20),
        len=config.get("min_length", 50)
    shell:
        """
        fastp -i {input.r1} -I {input.r2} \
          -o {output.r1} -O {output.r2} \
          -h {output.html} -j {output.json} \
          --qualified_quality_phred {params.qual} \
          --length_required {params.len} \
          --detect_adapter_for_pe \
          --thread {threads}
        """

rule multiqc:
    input:
        expand("results/fastqc/{sample}_{read}_fastqc.zip",
               sample=SAMPLES, read=["R1", "R2"]),
        expand("results/fastp/{sample}_fastp.json",
               sample=SAMPLES)
    output:
        "results/multiqc/multiqc_report.html"
    conda:
        "envs/qc.yaml"
    shell:
        "multiqc results/ -o results/multiqc/ --force"

Running the Pipeline

# Dry run
snakemake -n --cores 8

# Execute locally
snakemake --cores 8 --use-conda

# Execute on SLURM cluster
snakemake --cores 100 --use-conda --slurm \
  --default-resources slurm_partition=standard mem_mb=8000

# Generate DAG
snakemake --dag | dot -Tpdf > dag.pdf

Configuration

Create a config.yaml file in the same directory as the Snakefile:

samples:
  - sample1
  - sample2
  - sample3
min_quality: 20
min_length: 50