Snakemake
Overview
Snakemake is a Python-based workflow management system that uses a declarative syntax to define data analysis pipelines. Workflows are specified as a set of rules, each describing how to produce output files from input files. Snakemake automatically resolves dependencies between rules, determines execution order via a directed acyclic graph (DAG), and supports parallel execution across local cores or cluster nodes. Its integration with Conda, containers, and cluster schedulers (SLURM, PBS, LSF) makes it a standard choice for reproducible bioinformatics pipelines.
Pipeline Steps
A typical Snakemake QC pipeline follows these steps:
Define sample names and parameters in a configuration file (
config.yaml).Run FastQC on raw reads to generate per-sample quality reports.
Trim adapters and low-quality bases with fastp.
Aggregate all QC outputs into a single MultiQC report.
QC Pipeline
The Snakefile below implements the complete quality-control pipeline described in the textbook. It reads sample names from a YAML configuration file, runs FastQC on the raw paired-end reads, trims with fastp, and produces a combined MultiQC report.
# Snakefile for basic QC pipeline
configfile: "config.yaml"
SAMPLES = config["samples"]
rule all:
input:
expand("results/fastqc/{sample}_{read}_fastqc.html",
sample=SAMPLES, read=["R1", "R2"]),
expand("results/fastp/{sample}_R1.trimmed.fastq.gz",
sample=SAMPLES),
"results/multiqc/multiqc_report.html"
rule fastqc_raw:
input:
"data/{sample}_{read}.fastq.gz"
output:
html="results/fastqc/{sample}_{read}_fastqc.html",
zip="results/fastqc/{sample}_{read}_fastqc.zip"
threads: 2
conda:
"envs/qc.yaml"
shell:
"fastqc -t {threads} -o results/fastqc/ {input}"
rule fastp_trim:
input:
r1="data/{sample}_R1.fastq.gz",
r2="data/{sample}_R2.fastq.gz"
output:
r1="results/fastp/{sample}_R1.trimmed.fastq.gz",
r2="results/fastp/{sample}_R2.trimmed.fastq.gz",
html="results/fastp/{sample}_fastp.html",
json="results/fastp/{sample}_fastp.json"
threads: 4
conda:
"envs/qc.yaml"
params:
qual=config.get("min_quality", 20),
len=config.get("min_length", 50)
shell:
"""
fastp -i {input.r1} -I {input.r2} \
-o {output.r1} -O {output.r2} \
-h {output.html} -j {output.json} \
--qualified_quality_phred {params.qual} \
--length_required {params.len} \
--detect_adapter_for_pe \
--thread {threads}
"""
rule multiqc:
input:
expand("results/fastqc/{sample}_{read}_fastqc.zip",
sample=SAMPLES, read=["R1", "R2"]),
expand("results/fastp/{sample}_fastp.json",
sample=SAMPLES)
output:
"results/multiqc/multiqc_report.html"
conda:
"envs/qc.yaml"
shell:
"multiqc results/ -o results/multiqc/ --force"
Running the Pipeline
# Dry run
snakemake -n --cores 8
# Execute locally
snakemake --cores 8 --use-conda
# Execute on SLURM cluster
snakemake --cores 100 --use-conda --slurm \
--default-resources slurm_partition=standard mem_mb=8000
# Generate DAG
snakemake --dag | dot -Tpdf > dag.pdf
Configuration
Create a config.yaml file in the same directory as the Snakefile:
samples:
- sample1
- sample2
- sample3
min_quality: 20
min_length: 50
See Also
Nextflow – Nextflow DSL2 workflow manager
WGS Variant Calling – WGS variant calling pipeline
RNA-seq Differential Expression – RNA-seq differential expression pipeline