ChIP-seq

Overview

Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) identifies genome-wide binding sites of transcription factors, histone modifications, and other chromatin-associated proteins. After cross-linking, fragmentation, and immunoprecipitation with a target-specific antibody, enriched DNA fragments are sequenced. The analysis pipeline trims reads, aligns them to the reference genome, removes duplicates, calls peaks relative to an input control using MACS2, and generates normalised signal tracks for genome browser visualisation.

Pipeline Steps

  1. Trim adapters and low-quality bases with fastp.

  2. Align reads to the reference genome using Bowtie2.

  3. Filter low-quality alignments and remove PCR duplicates.

  4. Call peaks with MACS2 using matched input controls.

  5. Filter peaks against a blacklist of problematic regions.

  6. Generate normalised signal tracks (bigWig) with deepTools.

  7. Compute quality metrics including FRiP (Fraction of Reads in Peaks).

  8. Aggregate QC metrics into a MultiQC report.

Implementation

# Snakefile -- ChIP-seq Peak Calling Pipeline
configfile: "config.yaml"

SAMPLES   = config["samples"]     # ChIP samples
INPUTS    = config["inputs"]      # input controls per sample
REF       = config["bowtie2_index"]
BLACKLIST = config["blacklist"]
GENOME    = config.get("genome_size", "hs")

rule all:
    input:
        expand("results/peaks/{sample}_peaks.narrowPeak",
               sample=SAMPLES),
        expand("results/bigwig/{sample}.normalized.bw",
               sample=SAMPLES),
        "results/multiqc/multiqc_report.html"

rule fastp_trim:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/trimmed/{sample}_R1.trimmed.fastq.gz",
        r2="results/trimmed/{sample}_R2.trimmed.fastq.gz",
        json="results/trimmed/{sample}_fastp.json"
    threads: 4
    conda: "envs/qc.yaml"
    shell:
        """
        fastp -i {input.r1} -I {input.r2} \
          -o {output.r1} -O {output.r2} \
          -j {output.json} \
          --detect_adapter_for_pe --thread {threads}
        """

rule bowtie2_align:
    input:
        r1="results/trimmed/{sample}_R1.trimmed.fastq.gz",
        r2="results/trimmed/{sample}_R2.trimmed.fastq.gz",
        index=REF
    output:
        bam="results/aligned/{sample}.sorted.bam"
    threads: 16
    conda: "envs/align.yaml"
    shell:
        """
        bowtie2 -x {input.index} \
          -1 {input.r1} -2 {input.r2} \
          --very-sensitive -X 1000 \
          --threads {threads} \
          | samtools sort -@ 4 -o {output.bam} -
        samtools index {output.bam}
        """

rule filter_and_dedup:
    input:
        "results/aligned/{sample}.sorted.bam"
    output:
        bam="results/filtered/{sample}.filtered.bam",
        metrics="results/filtered/{sample}.dup_metrics.txt"
    threads: 8
    conda: "envs/align.yaml"
    shell:
        """
        samtools view -@ {threads} -h -q 30 -f 2 -F 1804 \
          {input} \
          | samtools sort -@ 4 -o results/filtered/{wildcards.sample}.tmp.bam -
        sambamba markdup -t {threads} --remove-duplicates \
          results/filtered/{wildcards.sample}.tmp.bam {output.bam} \
          2> {output.metrics}
        samtools index {output.bam}
        rm -f results/filtered/{wildcards.sample}.tmp.bam
        """

rule macs2_callpeak:
    input:
        treatment="results/filtered/{sample}.filtered.bam",
        control=lambda wc: f"results/filtered/{INPUTS[wc.sample]}.filtered.bam"
    output:
        peaks="results/peaks/{sample}_peaks.narrowPeak",
        xls="results/peaks/{sample}_peaks.xls"
    params:
        genome=GENOME,
        name="{sample}"
    conda: "envs/peaks.yaml"
    shell:
        """
        macs2 callpeak \
          -t {input.treatment} -c {input.control} \
          -f BAMPE -g {params.genome} \
          -n {params.name} --outdir results/peaks/ \
          -q 0.05
        """

rule filter_blacklist:
    input:
        peaks="results/peaks/{sample}_peaks.narrowPeak",
        blacklist=BLACKLIST
    output:
        "results/peaks/{sample}_peaks.filtered.narrowPeak"
    conda: "envs/peaks.yaml"
    shell:
        """
        bedtools intersect -v \
          -a {input.peaks} -b {input.blacklist} \
          > {output}
        """

rule bigwig:
    input:
        "results/filtered/{sample}.filtered.bam"
    output:
        "results/bigwig/{sample}.normalized.bw"
    threads: 8
    conda: "envs/peaks.yaml"
    shell:
        """
        bamCoverage -b {input} -o {output} \
          --normalizeUsing RPGC \
          --effectiveGenomeSize 2913022398 \
          --binSize 10 --numberOfProcessors {threads} \
          --extendReads 200
        """

rule frip_score:
    input:
        bam="results/filtered/{sample}.filtered.bam",
        peaks="results/peaks/{sample}_peaks.filtered.narrowPeak"
    output:
        "results/qc/{sample}_frip.txt"
    conda: "envs/peaks.yaml"
    shell:
        """
        total=$(samtools view -c {input.bam})
        in_peaks=$(bedtools intersect -a {input.bam} -b {input.peaks} \
          -u -ubam | samtools view -c)
        echo -e "{wildcards.sample}\t${{in_peaks}}\t${{total}}" > {output}
        """

rule multiqc:
    input:
        expand("results/trimmed/{sample}_fastp.json",
               sample=list(SAMPLES) + list(INPUTS.values())),
        expand("results/filtered/{sample}.dup_metrics.txt",
               sample=list(SAMPLES) + list(INPUTS.values())),
        expand("results/peaks/{sample}_peaks.xls", sample=SAMPLES)
    output:
        "results/multiqc/multiqc_report.html"
    conda: "envs/qc.yaml"
    shell:
        "multiqc results/ -o results/multiqc/ --force"

See Also

  • Snakemake – Snakemake workflow manager overview

  • Nextflow – Nextflow DSL2 workflow manager overview

  • ATAC-seq – ATAC-seq workflow (related chromatin profiling assay)

  • Bowtie2 – Bowtie2 aligner reference

  • Epigenomics – Epigenomics tools