Snakemake ========= Overview -------- Snakemake is a Python-based workflow management system that uses a declarative syntax to define data analysis pipelines. Workflows are specified as a set of rules, each describing how to produce output files from input files. Snakemake automatically resolves dependencies between rules, determines execution order via a directed acyclic graph (DAG), and supports parallel execution across local cores or cluster nodes. Its integration with Conda, containers, and cluster schedulers (SLURM, PBS, LSF) makes it a standard choice for reproducible bioinformatics pipelines. Pipeline Steps -------------- A typical Snakemake QC pipeline follows these steps: 1. Define sample names and parameters in a configuration file (``config.yaml``). 2. Run FastQC on raw reads to generate per-sample quality reports. 3. Trim adapters and low-quality bases with fastp. 4. Aggregate all QC outputs into a single MultiQC report. QC Pipeline ----------- The Snakefile below implements the complete quality-control pipeline described in the textbook. It reads sample names from a YAML configuration file, runs FastQC on the raw paired-end reads, trims with fastp, and produces a combined MultiQC report. .. code-block:: python # Snakefile for basic QC pipeline configfile: "config.yaml" SAMPLES = config["samples"] rule all: input: expand("results/fastqc/{sample}_{read}_fastqc.html", sample=SAMPLES, read=["R1", "R2"]), expand("results/fastp/{sample}_R1.trimmed.fastq.gz", sample=SAMPLES), "results/multiqc/multiqc_report.html" rule fastqc_raw: input: "data/{sample}_{read}.fastq.gz" output: html="results/fastqc/{sample}_{read}_fastqc.html", zip="results/fastqc/{sample}_{read}_fastqc.zip" threads: 2 conda: "envs/qc.yaml" shell: "fastqc -t {threads} -o results/fastqc/ {input}" rule fastp_trim: input: r1="data/{sample}_R1.fastq.gz", r2="data/{sample}_R2.fastq.gz" output: r1="results/fastp/{sample}_R1.trimmed.fastq.gz", r2="results/fastp/{sample}_R2.trimmed.fastq.gz", html="results/fastp/{sample}_fastp.html", json="results/fastp/{sample}_fastp.json" threads: 4 conda: "envs/qc.yaml" params: qual=config.get("min_quality", 20), len=config.get("min_length", 50) shell: """ fastp -i {input.r1} -I {input.r2} \ -o {output.r1} -O {output.r2} \ -h {output.html} -j {output.json} \ --qualified_quality_phred {params.qual} \ --length_required {params.len} \ --detect_adapter_for_pe \ --thread {threads} """ rule multiqc: input: expand("results/fastqc/{sample}_{read}_fastqc.zip", sample=SAMPLES, read=["R1", "R2"]), expand("results/fastp/{sample}_fastp.json", sample=SAMPLES) output: "results/multiqc/multiqc_report.html" conda: "envs/qc.yaml" shell: "multiqc results/ -o results/multiqc/ --force" Running the Pipeline -------------------- .. code-block:: bash # Dry run snakemake -n --cores 8 # Execute locally snakemake --cores 8 --use-conda # Execute on SLURM cluster snakemake --cores 100 --use-conda --slurm \ --default-resources slurm_partition=standard mem_mb=8000 # Generate DAG snakemake --dag | dot -Tpdf > dag.pdf Configuration ------------- Create a ``config.yaml`` file in the same directory as the Snakefile: .. code-block:: yaml samples: - sample1 - sample2 - sample3 min_quality: 20 min_length: 50 See Also -------- * :doc:`nextflow` -- Nextflow DSL2 workflow manager * :doc:`wgs-variant-calling` -- WGS variant calling pipeline * :doc:`rnaseq-differential-expression` -- RNA-seq differential expression pipeline