ATAC-seq

Overview

Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) maps genome-wide chromatin accessibility. Tn5 transposase inserts sequencing adapters into open chromatin regions, and the resulting fragments are sequenced. The analysis pipeline trims reads, aligns to the reference genome, removes duplicates and mitochondrial reads, shifts alignments to account for the Tn5 insertion offset, calls peaks representing accessible regions, and produces signal tracks for visualisation in a genome browser.

Pipeline Steps

  1. Trim adapters and low-quality bases with fastp.

  2. Align reads to the reference genome using Bowtie2.

  3. Filter out mitochondrial reads, low-quality alignments, and duplicates.

  4. Apply the Tn5 offset shift (+4 bp/–5 bp) to aligned reads.

  5. Call peaks with MACS2 in narrow-peak mode.

  6. Generate normalised signal tracks (bigWig) with deepTools.

  7. Produce a fragment-size distribution plot as a QC metric.

  8. Aggregate QC metrics into a MultiQC report.

Implementation

# Snakefile -- ATAC-seq Peak Calling Pipeline
configfile: "config.yaml"

SAMPLES   = config["samples"]
REF       = config["bowtie2_index"]
BLACKLIST = config["blacklist"]
GENOME    = config.get("genome_size", "hs")

rule all:
    input:
        expand("results/peaks/{sample}_peaks.narrowPeak",
               sample=SAMPLES),
        expand("results/bigwig/{sample}.normalized.bw",
               sample=SAMPLES),
        "results/multiqc/multiqc_report.html"

rule fastp_trim:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/trimmed/{sample}_R1.trimmed.fastq.gz",
        r2="results/trimmed/{sample}_R2.trimmed.fastq.gz",
        json="results/trimmed/{sample}_fastp.json"
    threads: 4
    conda: "envs/qc.yaml"
    shell:
        """
        fastp -i {input.r1} -I {input.r2} \
          -o {output.r1} -O {output.r2} \
          -j {output.json} \
          --detect_adapter_for_pe --thread {threads}
        """

rule bowtie2_align:
    input:
        r1="results/trimmed/{sample}_R1.trimmed.fastq.gz",
        r2="results/trimmed/{sample}_R2.trimmed.fastq.gz",
        index=REF
    output:
        bam="results/aligned/{sample}.sorted.bam"
    threads: 16
    conda: "envs/align.yaml"
    shell:
        """
        bowtie2 -x {input.index} \
          -1 {input.r1} -2 {input.r2} \
          --very-sensitive -X 2000 --no-mixed --no-discordant \
          --threads {threads} \
          | samtools sort -@ 4 -o {output.bam} -
        samtools index {output.bam}
        """

rule filter_and_dedup:
    input:
        "results/aligned/{sample}.sorted.bam"
    output:
        bam="results/filtered/{sample}.filtered.bam",
        metrics="results/filtered/{sample}.dup_metrics.txt"
    threads: 8
    conda: "envs/align.yaml"
    shell:
        """
        # Remove mitochondrial, low-quality, and unmapped reads
        samtools view -@ {threads} -h -q 30 -f 2 -F 1804 \
          {input} | grep -v chrM \
          | samtools sort -@ 4 -o results/filtered/{wildcards.sample}.tmp.bam -
        # Mark and remove duplicates
        sambamba markdup -t {threads} --remove-duplicates \
          results/filtered/{wildcards.sample}.tmp.bam {output.bam} \
          2> {output.metrics}
        samtools index {output.bam}
        rm -f results/filtered/{wildcards.sample}.tmp.bam
        """

rule tn5_shift:
    input:
        "results/filtered/{sample}.filtered.bam"
    output:
        "results/shifted/{sample}.shifted.bam"
    threads: 4
    conda: "envs/align.yaml"
    shell:
        """
        alignmentSieve -b {input} -o {output} \
          --ATACshift --numberOfProcessors {threads}
        samtools sort -@ {threads} -o {output} {output}
        samtools index {output}
        """

rule macs2_callpeak:
    input:
        "results/shifted/{sample}.shifted.bam"
    output:
        peaks="results/peaks/{sample}_peaks.narrowPeak",
        xls="results/peaks/{sample}_peaks.xls"
    params:
        genome=GENOME,
        name="{sample}"
    conda: "envs/peaks.yaml"
    shell:
        """
        macs2 callpeak -t {input} \
          -f BAMPE -g {params.genome} \
          -n {params.name} --outdir results/peaks/ \
          --nomodel --shift -100 --extsize 200 \
          --keep-dup all -q 0.05
        """

rule bigwig:
    input:
        "results/shifted/{sample}.shifted.bam"
    output:
        "results/bigwig/{sample}.normalized.bw"
    threads: 8
    conda: "envs/peaks.yaml"
    shell:
        """
        bamCoverage -b {input} -o {output} \
          --normalizeUsing RPGC \
          --effectiveGenomeSize 2913022398 \
          --binSize 10 --numberOfProcessors {threads} \
          --blackListFileName {BLACKLIST}
        """

rule multiqc:
    input:
        expand("results/trimmed/{sample}_fastp.json", sample=SAMPLES),
        expand("results/filtered/{sample}.dup_metrics.txt",
               sample=SAMPLES),
        expand("results/peaks/{sample}_peaks.xls", sample=SAMPLES)
    output:
        "results/multiqc/multiqc_report.html"
    conda: "envs/qc.yaml"
    shell:
        "multiqc results/ -o results/multiqc/ --force"

See Also

  • Snakemake – Snakemake workflow manager overview

  • Nextflow – Nextflow DSL2 workflow manager overview

  • ChIP-seq – ChIP-seq workflow (related chromatin profiling assay)

  • Bowtie2 – Bowtie2 aligner reference

  • Epigenomics – Epigenomics tools