16S Amplicon Sequencing

Overview

16S ribosomal RNA gene amplicon sequencing is the standard approach for profiling bacterial and archaeal community composition. The 16S rRNA gene contains hypervariable regions (V1–V9) flanked by conserved primer-binding sites, enabling PCR amplification from mixed microbial DNA. This pipeline implements a DADA2-based workflow: primers are removed with Cutadapt, reads are denoised into amplicon sequence variants (ASVs) with DADA2, taxonomy is assigned against a reference database (SILVA or Greengenes2), and downstream alpha/beta diversity analyses are performed. ASVs provide single-nucleotide resolution compared to traditional OTU clustering, improving sensitivity for detecting closely related taxa.

Pipeline Steps

  1. Cutadapt – Remove PCR primer sequences from the 5’ ends of forward and reverse reads. Reads that do not contain the expected primer are discarded.

  2. DADA2 – Learn error profiles, denoise reads, merge paired ends, remove chimeras, and assign taxonomy. DADA2 runs as a single R script that produces an ASV count table and taxonomy assignments.

  3. Diversity analysis – Compute alpha diversity metrics (Shannon, observed ASVs, Faith PD), beta diversity ordinations (Bray-Curtis PCoA, weighted UniFrac), and generate summary visualizations.

Implementation

# Snakefile -- 16S Amplicon Pipeline (DADA2)
configfile: "config.yaml"

SAMPLES   = config["samples"]
FWD_PRIMER = config["forward_primer"]   # e.g., "CCTACGGGNGGCWGCAG"
REV_PRIMER = config["reverse_primer"]   # e.g., "GACTACHVGGGTATCTAATCC"
SILVA_DB   = config["silva_db"]

rule all:
    input:
        "results/dada2/asv_table.tsv",
        "results/dada2/taxonomy.tsv",
        "results/diversity/alpha_diversity.tsv",
        "results/diversity/beta_pcoa.pdf"

rule cutadapt_primers:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/cutadapt/{sample}_R1.trimmed.fastq.gz",
        r2="results/cutadapt/{sample}_R2.trimmed.fastq.gz"
    params:
        fwd=FWD_PRIMER,
        rev=REV_PRIMER
    threads: 4
    conda: "envs/cutadapt.yaml"
    shell:
        """
        cutadapt -g {params.fwd} -G {params.rev} \
          --discard-untrimmed \
          --cores {threads} \
          -o {output.r1} -p {output.r2} \
          {input.r1} {input.r2}
        """

rule dada2_pipeline:
    input:
        r1=expand("results/cutadapt/{sample}_R1.trimmed.fastq.gz",
                  sample=SAMPLES),
        r2=expand("results/cutadapt/{sample}_R2.trimmed.fastq.gz",
                  sample=SAMPLES)
    output:
        asv="results/dada2/asv_table.tsv",
        tax="results/dada2/taxonomy.tsv",
        seqs="results/dada2/rep_seqs.fasta"
    params:
        silva=SILVA_DB
    conda: "envs/dada2.yaml"
    script:
        "scripts/run_dada2.R"

rule diversity_analysis:
    input:
        asv="results/dada2/asv_table.tsv",
        tax="results/dada2/taxonomy.tsv",
        meta=config["metadata"]
    output:
        alpha="results/diversity/alpha_diversity.tsv",
        pcoa="results/diversity/beta_pcoa.pdf"
    conda: "envs/phyloseq.yaml"
    script:
        "scripts/run_diversity.R"

Expected Output

After a successful run the output directory will contain:

  • results/cutadapt/ – Primer-trimmed FASTQ files per sample.

  • results/dada2/asv_table.tsv – ASV count matrix (samples x ASVs).

  • results/dada2/taxonomy.tsv – Taxonomic assignments for each ASV from kingdom through species.

  • results/dada2/rep_seqs.fasta – Representative sequences for each ASV.

  • results/diversity/alpha_diversity.tsv – Per-sample Shannon index, observed ASVs, and Faith phylogenetic diversity.

  • results/diversity/beta_pcoa.pdf – Principal coordinates analysis plot based on Bray-Curtis dissimilarity.

See Also