Metagenomic Classification
Overview
Metagenomic classification assigns taxonomic labels to sequencing reads without requiring assembly or marker-gene amplification, providing a culture-independent view of microbial community composition directly from shotgun or long-read sequencing data. This pipeline uses Kraken2 for fast k-mer-based taxonomic classification, Bracken for Bayesian re-estimation of species-level abundances, and Krona for interactive hierarchical visualization. An initial quality filtering step with Chopper removes low-quality and short reads to reduce spurious classifications.
Pipeline Steps
Chopper – Filter reads by minimum quality (Q10) and minimum length (1 kb) to remove low-confidence basecalls and adapter artifacts.
Kraken2 – Classify each read against a pre-built k-mer database, assigning taxonomic labels based on exact k-mer matches. The
--minimum-hit-groups 3parameter requires hits across multiple distinct minimizer groups to reduce false-positive classifications.Bracken – Re-estimate species-level abundances from the Kraken2 report using a Bayesian model that redistributes reads assigned to higher taxonomic levels. Reads are re-distributed based on read-length-specific expected k-mer distributions.
Krona – Convert the Kraken2 report into an interactive HTML visualization showing the hierarchical taxonomic composition of the sample.
Implementation
# Snakefile -- Metagenomic Classification Pipeline
configfile: "config.yaml"
SAMPLES = config["samples"]
KRAKEN2_DB = config["kraken2_db"]
rule all:
input:
expand("results/bracken/{sample}_bracken.txt", sample=SAMPLES),
expand("results/krona/{sample}_krona.html", sample=SAMPLES)
rule chopper_filter:
input:
"data/{sample}.fastq.gz"
output:
"results/filtered/{sample}.filtered.fastq.gz"
shell:
"zcat {input} | chopper --quality 10 --minlength 1000 | gzip > {output}"
rule kraken2:
input:
"results/filtered/{sample}.filtered.fastq.gz"
output:
classifications="results/kraken2/{sample}_classifications.txt",
report="results/kraken2/{sample}_report.txt"
params:
db=KRAKEN2_DB
threads: 8
shell:
"""
kraken2 --db {params.db} \
--output {output.classifications} \
--report {output.report} \
--minimum-hit-groups 3 \
--threads {threads} \
{input}
"""
rule bracken:
input:
report="results/kraken2/{sample}_report.txt"
output:
"results/bracken/{sample}_bracken.txt"
params:
db=KRAKEN2_DB
shell:
"""
bracken -d {params.db} \
-i {input.report} \
-o {output} \
-r 1000 -l S -t 10
"""
rule krona:
input:
report="results/kraken2/{sample}_report.txt"
output:
"results/krona/{sample}_krona.html"
shell:
"""
kreport2krona.py -r {input.report} -o results/krona/{wildcards.sample}_krona_input.txt
ktImportText results/krona/{wildcards.sample}_krona_input.txt \
-o {output}
"""
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.samplesheet = "samplesheet.csv"
params.kraken2_db = "ref/kraken2_db"
params.outdir = "results"
process CHOPPER_FILTER {
tag "${sample_id}"
publishDir "${params.outdir}/filtered", mode: 'copy'
input:
tuple val(sample_id), path(fastq)
output:
tuple val(sample_id), path("${sample_id}.filtered.fastq.gz"), emit: fastq
script:
"""
zcat ${fastq} | chopper --quality 10 --minlength 1000 \
| gzip > ${sample_id}.filtered.fastq.gz
"""
}
process KRAKEN2 {
tag "${sample_id}"
publishDir "${params.outdir}/kraken2", mode: 'copy'
cpus 8
memory '64 GB'
input:
tuple val(sample_id), path(fastq)
path(kraken2_db)
output:
tuple val(sample_id),
path("${sample_id}_classifications.txt"),
path("${sample_id}_report.txt"), emit: results
tuple val(sample_id),
path("${sample_id}_report.txt"), emit: report
script:
"""
kraken2 --db ${kraken2_db} \
--output ${sample_id}_classifications.txt \
--report ${sample_id}_report.txt \
--minimum-hit-groups 3 \
--threads ${task.cpus} \
${fastq}
"""
}
process BRACKEN {
tag "${sample_id}"
publishDir "${params.outdir}/bracken", mode: 'copy'
input:
tuple val(sample_id), path(report)
path(kraken2_db)
output:
tuple val(sample_id), path("${sample_id}_bracken.txt"), emit: abundances
script:
"""
bracken -d ${kraken2_db} \
-i ${report} \
-o ${sample_id}_bracken.txt \
-r 1000 -l S -t 10
"""
}
process KRONA {
tag "${sample_id}"
publishDir "${params.outdir}/krona", mode: 'copy'
input:
tuple val(sample_id), path(report)
output:
path("${sample_id}_krona.html"), emit: html
script:
"""
kreport2krona.py -r ${report} -o ${sample_id}_krona_input.txt
ktImportText ${sample_id}_krona_input.txt \
-o ${sample_id}_krona.html
"""
}
workflow {
ch_input = Channel.fromPath(params.samplesheet)
.splitCsv(header: true)
.map { row -> tuple(row.sample_id, file(row.fastq)) }
kraken2_db = Channel.value(file(params.kraken2_db))
CHOPPER_FILTER(ch_input)
KRAKEN2(CHOPPER_FILTER.out.fastq, kraken2_db)
BRACKEN(KRAKEN2.out.report, kraken2_db)
KRONA(KRAKEN2.out.report)
}
Expected Output
After a successful run the output directory will contain:
results/filtered/– Quality- and length-filtered FASTQ files.results/kraken2/<sample>_classifications.txt– Per-read taxonomic assignments from Kraken2.results/kraken2/<sample>_report.txt– Kraken2 summary report with read counts and percentages at each taxonomic level.results/bracken/<sample>_bracken.txt– Bracken species-level abundance estimates with re-distributed read counts and fractions.results/krona/<sample>_krona.html– Interactive Krona pie chart showing hierarchical taxonomic composition (viewable in any web browser).
See Also
16S Amplicon Sequencing – 16S amplicon sequencing pipeline with DADA2
Bacterial Genome Assembly (Nanopore) – Nanopore bacterial genome assembly