16S Amplicon Sequencing
Overview
16S ribosomal RNA gene amplicon sequencing is the standard approach for profiling bacterial and archaeal community composition. The 16S rRNA gene contains hypervariable regions (V1–V9) flanked by conserved primer-binding sites, enabling PCR amplification from mixed microbial DNA. This pipeline implements a DADA2-based workflow: primers are removed with Cutadapt, reads are denoised into amplicon sequence variants (ASVs) with DADA2, taxonomy is assigned against a reference database (SILVA or Greengenes2), and downstream alpha/beta diversity analyses are performed. ASVs provide single-nucleotide resolution compared to traditional OTU clustering, improving sensitivity for detecting closely related taxa.
Pipeline Steps
Cutadapt – Remove PCR primer sequences from the 5’ ends of forward and reverse reads. Reads that do not contain the expected primer are discarded.
DADA2 – Learn error profiles, denoise reads, merge paired ends, remove chimeras, and assign taxonomy. DADA2 runs as a single R script that produces an ASV count table and taxonomy assignments.
Diversity analysis – Compute alpha diversity metrics (Shannon, observed ASVs, Faith PD), beta diversity ordinations (Bray-Curtis PCoA, weighted UniFrac), and generate summary visualizations.
Implementation
# Snakefile -- 16S Amplicon Pipeline (DADA2)
configfile: "config.yaml"
SAMPLES = config["samples"]
FWD_PRIMER = config["forward_primer"] # e.g., "CCTACGGGNGGCWGCAG"
REV_PRIMER = config["reverse_primer"] # e.g., "GACTACHVGGGTATCTAATCC"
SILVA_DB = config["silva_db"]
rule all:
input:
"results/dada2/asv_table.tsv",
"results/dada2/taxonomy.tsv",
"results/diversity/alpha_diversity.tsv",
"results/diversity/beta_pcoa.pdf"
rule cutadapt_primers:
input:
r1="data/{sample}_R1.fastq.gz",
r2="data/{sample}_R2.fastq.gz"
output:
r1="results/cutadapt/{sample}_R1.trimmed.fastq.gz",
r2="results/cutadapt/{sample}_R2.trimmed.fastq.gz"
params:
fwd=FWD_PRIMER,
rev=REV_PRIMER
threads: 4
conda: "envs/cutadapt.yaml"
shell:
"""
cutadapt -g {params.fwd} -G {params.rev} \
--discard-untrimmed \
--cores {threads} \
-o {output.r1} -p {output.r2} \
{input.r1} {input.r2}
"""
rule dada2_pipeline:
input:
r1=expand("results/cutadapt/{sample}_R1.trimmed.fastq.gz",
sample=SAMPLES),
r2=expand("results/cutadapt/{sample}_R2.trimmed.fastq.gz",
sample=SAMPLES)
output:
asv="results/dada2/asv_table.tsv",
tax="results/dada2/taxonomy.tsv",
seqs="results/dada2/rep_seqs.fasta"
params:
silva=SILVA_DB
conda: "envs/dada2.yaml"
script:
"scripts/run_dada2.R"
rule diversity_analysis:
input:
asv="results/dada2/asv_table.tsv",
tax="results/dada2/taxonomy.tsv",
meta=config["metadata"]
output:
alpha="results/diversity/alpha_diversity.tsv",
pcoa="results/diversity/beta_pcoa.pdf"
conda: "envs/phyloseq.yaml"
script:
"scripts/run_diversity.R"
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.samplesheet = "samplesheet.csv"
params.forward_primer = "CCTACGGGNGGCWGCAG"
params.reverse_primer = "GACTACHVGGGTATCTAATCC"
params.silva_db = "ref/silva_nr99_v138.1_train_set.fa.gz"
params.metadata = "metadata.tsv"
params.outdir = "results"
process CUTADAPT {
tag "${sample_id}"
publishDir "${params.outdir}/cutadapt", mode: 'copy'
cpus 4
input:
tuple val(sample_id), path(r1), path(r2)
output:
tuple val(sample_id),
path("${sample_id}_R1.trimmed.fastq.gz"),
path("${sample_id}_R2.trimmed.fastq.gz"), emit: reads
script:
"""
cutadapt -g ${params.forward_primer} -G ${params.reverse_primer} \
--discard-untrimmed \
--cores ${task.cpus} \
-o ${sample_id}_R1.trimmed.fastq.gz \
-p ${sample_id}_R2.trimmed.fastq.gz \
${r1} ${r2}
"""
}
process DADA2 {
publishDir "${params.outdir}/dada2", mode: 'copy'
cpus 8
memory '16 GB'
input:
path(trimmed_reads)
path(silva_db)
output:
path("asv_table.tsv"), emit: asv
path("taxonomy.tsv"), emit: tax
path("rep_seqs.fasta"), emit: seqs
script:
"""
Rscript ${projectDir}/scripts/run_dada2.R \
--input_dir . \
--silva_db ${silva_db} \
--output_dir .
"""
}
process DIVERSITY {
publishDir "${params.outdir}/diversity", mode: 'copy'
input:
path(asv_table)
path(taxonomy)
path(metadata)
output:
path("alpha_diversity.tsv"), emit: alpha
path("beta_pcoa.pdf"), emit: pcoa
script:
"""
Rscript ${projectDir}/scripts/run_diversity.R \
--asv_table ${asv_table} \
--taxonomy ${taxonomy} \
--metadata ${metadata} \
--output_dir .
"""
}
workflow {
ch_input = Channel.fromPath(params.samplesheet)
.splitCsv(header: true)
.map { row -> tuple(row.sample_id, file(row.fastq_r1), file(row.fastq_r2)) }
silva = Channel.value(file(params.silva_db))
meta = Channel.value(file(params.metadata))
CUTADAPT(ch_input)
// Collect all trimmed FASTQs into a single directory for DADA2
trimmed_all = CUTADAPT.out.reads
.map { sample_id, r1, r2 -> [r1, r2] }
.flatten()
.collect()
DADA2(trimmed_all, silva)
DIVERSITY(DADA2.out.asv, DADA2.out.tax, meta)
}
Expected Output
After a successful run the output directory will contain:
results/cutadapt/– Primer-trimmed FASTQ files per sample.results/dada2/asv_table.tsv– ASV count matrix (samples x ASVs).results/dada2/taxonomy.tsv– Taxonomic assignments for each ASV from kingdom through species.results/dada2/rep_seqs.fasta– Representative sequences for each ASV.results/diversity/alpha_diversity.tsv– Per-sample Shannon index, observed ASVs, and Faith phylogenetic diversity.results/diversity/beta_pcoa.pdf– Principal coordinates analysis plot based on Bray-Curtis dissimilarity.
See Also
Metagenomic Classification – Metagenomic classification with Kraken2
Snakemake – Snakemake workflow manager overview