Bisulfite Sequencing (WGBS)

Overview

Whole-genome bisulfite sequencing (WGBS) provides single-nucleotide resolution maps of DNA methylation across the entire genome. Sodium bisulfite treatment converts unmethylated cytosines to uracils while leaving methylated cytosines intact. Sequencing reads are then aligned to a bisulfite-converted reference genome, and the methylation status of each CpG (or non-CpG) site is determined from the ratio of converted to unconverted bases. This pipeline implements a standard WGBS analysis: adapter trimming with Trim Galore, alignment and deduplication with Bismark, methylation extraction, and aggregated quality reporting with MultiQC.

Pipeline Steps

Trim Galore – Remove adapter sequences and low-quality bases from paired-end reads. Trim Galore automatically detects Illumina adapter contamination and applies quality trimming.
Bismark align – Align trimmed reads to a bisulfite-converted reference genome using Bismark (wrapping Bowtie 2). Bismark handles the four possible bisulfite-converted strands transparently.
Bismark dedup – Remove PCR duplicates from the aligned BAM file. Deduplication is critical in WGBS because duplicate reads inflate apparent coverage and bias methylation estimates.
Methylation extraction – Extract per-cytosine methylation calls from deduplicated alignments, producing bedGraph and coverage files.
MultiQC – Aggregate Trim Galore, Bismark alignment, deduplication, and methylation extraction reports into a single HTML summary.

Implementation

# Snakefile -- WGBS Bisulfite Sequencing Pipeline
configfile: "config.yaml"

SAMPLES     = config["samples"]
BISMARK_IDX = config["bismark_index"]

rule all:
    input:
        expand("results/methylation/{sample}.bismark.cov.gz", sample=SAMPLES),
        "results/multiqc/multiqc_report.html"

rule trim_galore:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/trimmed/{sample}_R1_val_1.fq.gz",
        r2="results/trimmed/{sample}_R2_val_2.fq.gz",
        report_r1="results/trimmed/{sample}_R1.fastq.gz_trimming_report.txt",
        report_r2="results/trimmed/{sample}_R2.fastq.gz_trimming_report.txt"
    threads: 4
    conda: "envs/trimgalore.yaml"
    shell:
        """
        trim_galore --paired --cores {threads} \
          -o results/trimmed/ \
          {input.r1} {input.r2}
        """

rule bismark_align:
    input:
        r1="results/trimmed/{sample}_R1_val_1.fq.gz",
        r2="results/trimmed/{sample}_R2_val_2.fq.gz"
    output:
        bam="results/bismark/{sample}_R1_val_1_bismark_bt2_pe.bam",
        report="results/bismark/{sample}_R1_val_1_bismark_bt2_PE_report.txt"
    threads: 8
    params:
        idx=BISMARK_IDX
    conda: "envs/bismark.yaml"
    shell:
        """
        bismark --genome {params.idx} \
          -1 {input.r1} -2 {input.r2} \
          --parallel {threads} \
          -o results/bismark/
        """

rule bismark_dedup:
    input:
        "results/bismark/{sample}_R1_val_1_bismark_bt2_pe.bam"
    output:
        bam="results/dedup/{sample}.deduplicated.bam",
        report="results/dedup/{sample}.deduplication_report.txt"
    conda: "envs/bismark.yaml"
    shell:
        """
        deduplicate_bismark --paired --bam \
          --output_dir results/dedup/ \
          {input}
        """

rule methylation_extract:
    input:
        "results/dedup/{sample}.deduplicated.bam"
    output:
        "results/methylation/{sample}.bismark.cov.gz"
    params:
        idx=BISMARK_IDX
    threads: 8
    conda: "envs/bismark.yaml"
    shell:
        """
        bismark_methylation_extractor --paired-end \
          --comprehensive --merge_non_CpG \
          --parallel {threads} \
          --genome_folder {params.idx} \
          --bedGraph --counts \
          -o results/methylation/ \
          {input}
        """

rule multiqc:
    input:
        expand("results/methylation/{sample}.bismark.cov.gz", sample=SAMPLES)
    output:
        "results/multiqc/multiqc_report.html"
    conda: "envs/multiqc.yaml"
    shell:
        "multiqc results/ -o results/multiqc/ --force"

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.samplesheet  = "samplesheet.csv"
params.bismark_index = "ref/bismark_index"
params.outdir        = "results"

process TRIM_GALORE {
    tag "${sample_id}"
    publishDir "${params.outdir}/trimmed", mode: 'copy'
    cpus 4
    input:
        tuple val(sample_id), path(r1), path(r2)
    output:
        tuple val(sample_id),
              path("*_val_1.fq.gz"),
              path("*_val_2.fq.gz"), emit: reads
        path("*_trimming_report.txt"), emit: reports
    script:
    """
    trim_galore --paired --cores ${task.cpus} ${r1} ${r2}
    """
}

process BISMARK_ALIGN {
    tag "${sample_id}"
    publishDir "${params.outdir}/bismark", mode: 'copy'
    cpus 8
    memory '32 GB'
    input:
        tuple val(sample_id), path(r1), path(r2)
        path(bismark_index)
    output:
        tuple val(sample_id), path("*_bismark_bt2_pe.bam"), emit: bam
        path("*_PE_report.txt"), emit: report
    script:
    """
    bismark --genome ${bismark_index} \
      -1 ${r1} -2 ${r2} \
      --parallel ${task.cpus}
    """
}

process BISMARK_DEDUP {
    tag "${sample_id}"
    publishDir "${params.outdir}/dedup", mode: 'copy'
    input:
        tuple val(sample_id), path(bam)
    output:
        tuple val(sample_id), path("*.deduplicated.bam"), emit: bam
        path("*.deduplication_report.txt"), emit: report
    script:
    """
    deduplicate_bismark --paired --bam ${bam}
    """
}

process METHYL_EXTRACT {
    tag "${sample_id}"
    publishDir "${params.outdir}/methylation", mode: 'copy'
    cpus 8
    input:
        tuple val(sample_id), path(bam)
        path(bismark_index)
    output:
        tuple val(sample_id), path("*.bismark.cov.gz"), emit: coverage
        path("*.txt"), emit: reports
    script:
    """
    bismark_methylation_extractor --paired-end \
      --comprehensive --merge_non_CpG \
      --parallel ${task.cpus} \
      --genome_folder ${bismark_index} \
      --bedGraph --counts \
      ${bam}
    """
}

workflow {
    ch_input = Channel.fromPath(params.samplesheet)
        .splitCsv(header: true)
        .map { row -> tuple(row.sample_id, file(row.fastq_r1), file(row.fastq_r2)) }

    bismark_idx = Channel.value(file(params.bismark_index))

    TRIM_GALORE(ch_input)
    BISMARK_ALIGN(TRIM_GALORE.out.reads, bismark_idx)
    BISMARK_DEDUP(BISMARK_ALIGN.out.bam)
    METHYL_EXTRACT(BISMARK_DEDUP.out.bam, bismark_idx)
}

Expected Output

After a successful run the output directory will contain:

results/trimmed/ – Adapter-trimmed FASTQ files and Trim Galore trimming reports.
results/bismark/ – Bismark-aligned BAM files and alignment reports showing mapping efficiency, C-to-T conversion rates, and strand information.
results/dedup/ – Deduplicated BAM files and deduplication reports indicating the fraction of duplicate reads removed.
results/methylation/ – Per-cytosine methylation coverage files (*.bismark.cov.gz), bedGraph files, and M-bias reports.
results/multiqc/multiqc_report.html – Aggregated quality report summarizing all pipeline stages.

Bisulfite Sequencing (WGBS)

Overview

Pipeline Steps

Implementation

Expected Output

See Also