Bisulfite Sequencing (WGBS)

Overview

Whole-genome bisulfite sequencing (WGBS) provides single-nucleotide resolution maps of DNA methylation across the entire genome. Sodium bisulfite treatment converts unmethylated cytosines to uracils while leaving methylated cytosines intact. Sequencing reads are then aligned to a bisulfite-converted reference genome, and the methylation status of each CpG (or non-CpG) site is determined from the ratio of converted to unconverted bases. This pipeline implements a standard WGBS analysis: adapter trimming with Trim Galore, alignment and deduplication with Bismark, methylation extraction, and aggregated quality reporting with MultiQC.

Pipeline Steps

  1. Trim Galore – Remove adapter sequences and low-quality bases from paired-end reads. Trim Galore automatically detects Illumina adapter contamination and applies quality trimming.

  2. Bismark align – Align trimmed reads to a bisulfite-converted reference genome using Bismark (wrapping Bowtie 2). Bismark handles the four possible bisulfite-converted strands transparently.

  3. Bismark dedup – Remove PCR duplicates from the aligned BAM file. Deduplication is critical in WGBS because duplicate reads inflate apparent coverage and bias methylation estimates.

  4. Methylation extraction – Extract per-cytosine methylation calls from deduplicated alignments, producing bedGraph and coverage files.

  5. MultiQC – Aggregate Trim Galore, Bismark alignment, deduplication, and methylation extraction reports into a single HTML summary.

Implementation

# Snakefile -- WGBS Bisulfite Sequencing Pipeline
configfile: "config.yaml"

SAMPLES     = config["samples"]
BISMARK_IDX = config["bismark_index"]

rule all:
    input:
        expand("results/methylation/{sample}.bismark.cov.gz", sample=SAMPLES),
        "results/multiqc/multiqc_report.html"

rule trim_galore:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/trimmed/{sample}_R1_val_1.fq.gz",
        r2="results/trimmed/{sample}_R2_val_2.fq.gz",
        report_r1="results/trimmed/{sample}_R1.fastq.gz_trimming_report.txt",
        report_r2="results/trimmed/{sample}_R2.fastq.gz_trimming_report.txt"
    threads: 4
    conda: "envs/trimgalore.yaml"
    shell:
        """
        trim_galore --paired --cores {threads} \
          -o results/trimmed/ \
          {input.r1} {input.r2}
        """

rule bismark_align:
    input:
        r1="results/trimmed/{sample}_R1_val_1.fq.gz",
        r2="results/trimmed/{sample}_R2_val_2.fq.gz"
    output:
        bam="results/bismark/{sample}_R1_val_1_bismark_bt2_pe.bam",
        report="results/bismark/{sample}_R1_val_1_bismark_bt2_PE_report.txt"
    threads: 8
    params:
        idx=BISMARK_IDX
    conda: "envs/bismark.yaml"
    shell:
        """
        bismark --genome {params.idx} \
          -1 {input.r1} -2 {input.r2} \
          --parallel {threads} \
          -o results/bismark/
        """

rule bismark_dedup:
    input:
        "results/bismark/{sample}_R1_val_1_bismark_bt2_pe.bam"
    output:
        bam="results/dedup/{sample}.deduplicated.bam",
        report="results/dedup/{sample}.deduplication_report.txt"
    conda: "envs/bismark.yaml"
    shell:
        """
        deduplicate_bismark --paired --bam \
          --output_dir results/dedup/ \
          {input}
        """

rule methylation_extract:
    input:
        "results/dedup/{sample}.deduplicated.bam"
    output:
        "results/methylation/{sample}.bismark.cov.gz"
    params:
        idx=BISMARK_IDX
    threads: 8
    conda: "envs/bismark.yaml"
    shell:
        """
        bismark_methylation_extractor --paired-end \
          --comprehensive --merge_non_CpG \
          --parallel {threads} \
          --genome_folder {params.idx} \
          --bedGraph --counts \
          -o results/methylation/ \
          {input}
        """

rule multiqc:
    input:
        expand("results/methylation/{sample}.bismark.cov.gz", sample=SAMPLES)
    output:
        "results/multiqc/multiqc_report.html"
    conda: "envs/multiqc.yaml"
    shell:
        "multiqc results/ -o results/multiqc/ --force"

Expected Output

After a successful run the output directory will contain:

  • results/trimmed/ – Adapter-trimmed FASTQ files and Trim Galore trimming reports.

  • results/bismark/ – Bismark-aligned BAM files and alignment reports showing mapping efficiency, C-to-T conversion rates, and strand information.

  • results/dedup/ – Deduplicated BAM files and deduplication reports indicating the fraction of duplicate reads removed.

  • results/methylation/ – Per-cytosine methylation coverage files (*.bismark.cov.gz), bedGraph files, and M-bias reports.

  • results/multiqc/multiqc_report.html – Aggregated quality report summarizing all pipeline stages.

See Also

  • Snakemake – Snakemake workflow manager overview

  • ChIP-seq – ChIP-seq pipeline for histone modification profiling