Bisulfite Sequencing (WGBS) =========================== Overview -------- Whole-genome bisulfite sequencing (WGBS) provides single-nucleotide resolution maps of DNA methylation across the entire genome. Sodium bisulfite treatment converts unmethylated cytosines to uracils while leaving methylated cytosines intact. Sequencing reads are then aligned to a bisulfite-converted reference genome, and the methylation status of each CpG (or non-CpG) site is determined from the ratio of converted to unconverted bases. This pipeline implements a standard WGBS analysis: adapter trimming with Trim Galore, alignment and deduplication with Bismark, methylation extraction, and aggregated quality reporting with MultiQC. Pipeline Steps -------------- 1. **Trim Galore** -- Remove adapter sequences and low-quality bases from paired-end reads. Trim Galore automatically detects Illumina adapter contamination and applies quality trimming. 2. **Bismark align** -- Align trimmed reads to a bisulfite-converted reference genome using Bismark (wrapping Bowtie 2). Bismark handles the four possible bisulfite-converted strands transparently. 3. **Bismark dedup** -- Remove PCR duplicates from the aligned BAM file. Deduplication is critical in WGBS because duplicate reads inflate apparent coverage and bias methylation estimates. 4. **Methylation extraction** -- Extract per-cytosine methylation calls from deduplicated alignments, producing bedGraph and coverage files. 5. **MultiQC** -- Aggregate Trim Galore, Bismark alignment, deduplication, and methylation extraction reports into a single HTML summary. Implementation -------------- .. tabs:: .. tab:: Snakemake .. code-block:: python # Snakefile -- WGBS Bisulfite Sequencing Pipeline configfile: "config.yaml" SAMPLES = config["samples"] BISMARK_IDX = config["bismark_index"] rule all: input: expand("results/methylation/{sample}.bismark.cov.gz", sample=SAMPLES), "results/multiqc/multiqc_report.html" rule trim_galore: input: r1="data/{sample}_R1.fastq.gz", r2="data/{sample}_R2.fastq.gz" output: r1="results/trimmed/{sample}_R1_val_1.fq.gz", r2="results/trimmed/{sample}_R2_val_2.fq.gz", report_r1="results/trimmed/{sample}_R1.fastq.gz_trimming_report.txt", report_r2="results/trimmed/{sample}_R2.fastq.gz_trimming_report.txt" threads: 4 conda: "envs/trimgalore.yaml" shell: """ trim_galore --paired --cores {threads} \ -o results/trimmed/ \ {input.r1} {input.r2} """ rule bismark_align: input: r1="results/trimmed/{sample}_R1_val_1.fq.gz", r2="results/trimmed/{sample}_R2_val_2.fq.gz" output: bam="results/bismark/{sample}_R1_val_1_bismark_bt2_pe.bam", report="results/bismark/{sample}_R1_val_1_bismark_bt2_PE_report.txt" threads: 8 params: idx=BISMARK_IDX conda: "envs/bismark.yaml" shell: """ bismark --genome {params.idx} \ -1 {input.r1} -2 {input.r2} \ --parallel {threads} \ -o results/bismark/ """ rule bismark_dedup: input: "results/bismark/{sample}_R1_val_1_bismark_bt2_pe.bam" output: bam="results/dedup/{sample}.deduplicated.bam", report="results/dedup/{sample}.deduplication_report.txt" conda: "envs/bismark.yaml" shell: """ deduplicate_bismark --paired --bam \ --output_dir results/dedup/ \ {input} """ rule methylation_extract: input: "results/dedup/{sample}.deduplicated.bam" output: "results/methylation/{sample}.bismark.cov.gz" params: idx=BISMARK_IDX threads: 8 conda: "envs/bismark.yaml" shell: """ bismark_methylation_extractor --paired-end \ --comprehensive --merge_non_CpG \ --parallel {threads} \ --genome_folder {params.idx} \ --bedGraph --counts \ -o results/methylation/ \ {input} """ rule multiqc: input: expand("results/methylation/{sample}.bismark.cov.gz", sample=SAMPLES) output: "results/multiqc/multiqc_report.html" conda: "envs/multiqc.yaml" shell: "multiqc results/ -o results/multiqc/ --force" .. tab:: Nextflow .. code-block:: groovy #!/usr/bin/env nextflow nextflow.enable.dsl = 2 params.samplesheet = "samplesheet.csv" params.bismark_index = "ref/bismark_index" params.outdir = "results" process TRIM_GALORE { tag "${sample_id}" publishDir "${params.outdir}/trimmed", mode: 'copy' cpus 4 input: tuple val(sample_id), path(r1), path(r2) output: tuple val(sample_id), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: reads path("*_trimming_report.txt"), emit: reports script: """ trim_galore --paired --cores ${task.cpus} ${r1} ${r2} """ } process BISMARK_ALIGN { tag "${sample_id}" publishDir "${params.outdir}/bismark", mode: 'copy' cpus 8 memory '32 GB' input: tuple val(sample_id), path(r1), path(r2) path(bismark_index) output: tuple val(sample_id), path("*_bismark_bt2_pe.bam"), emit: bam path("*_PE_report.txt"), emit: report script: """ bismark --genome ${bismark_index} \ -1 ${r1} -2 ${r2} \ --parallel ${task.cpus} """ } process BISMARK_DEDUP { tag "${sample_id}" publishDir "${params.outdir}/dedup", mode: 'copy' input: tuple val(sample_id), path(bam) output: tuple val(sample_id), path("*.deduplicated.bam"), emit: bam path("*.deduplication_report.txt"), emit: report script: """ deduplicate_bismark --paired --bam ${bam} """ } process METHYL_EXTRACT { tag "${sample_id}" publishDir "${params.outdir}/methylation", mode: 'copy' cpus 8 input: tuple val(sample_id), path(bam) path(bismark_index) output: tuple val(sample_id), path("*.bismark.cov.gz"), emit: coverage path("*.txt"), emit: reports script: """ bismark_methylation_extractor --paired-end \ --comprehensive --merge_non_CpG \ --parallel ${task.cpus} \ --genome_folder ${bismark_index} \ --bedGraph --counts \ ${bam} """ } workflow { ch_input = Channel.fromPath(params.samplesheet) .splitCsv(header: true) .map { row -> tuple(row.sample_id, file(row.fastq_r1), file(row.fastq_r2)) } bismark_idx = Channel.value(file(params.bismark_index)) TRIM_GALORE(ch_input) BISMARK_ALIGN(TRIM_GALORE.out.reads, bismark_idx) BISMARK_DEDUP(BISMARK_ALIGN.out.bam) METHYL_EXTRACT(BISMARK_DEDUP.out.bam, bismark_idx) } Expected Output --------------- After a successful run the output directory will contain: * ``results/trimmed/`` -- Adapter-trimmed FASTQ files and Trim Galore trimming reports. * ``results/bismark/`` -- Bismark-aligned BAM files and alignment reports showing mapping efficiency, C-to-T conversion rates, and strand information. * ``results/dedup/`` -- Deduplicated BAM files and deduplication reports indicating the fraction of duplicate reads removed. * ``results/methylation/`` -- Per-cytosine methylation coverage files (``*.bismark.cov.gz``), bedGraph files, and M-bias reports. * ``results/multiqc/multiqc_report.html`` -- Aggregated quality report summarizing all pipeline stages. See Also -------- * :doc:`snakemake` -- Snakemake workflow manager overview * :doc:`chipseq` -- ChIP-seq pipeline for histone modification profiling