Metagenomic Classification ========================== Overview -------- Metagenomic classification assigns taxonomic labels to sequencing reads without requiring assembly or marker-gene amplification, providing a culture-independent view of microbial community composition directly from shotgun or long-read sequencing data. This pipeline uses Kraken2 for fast k-mer-based taxonomic classification, Bracken for Bayesian re-estimation of species-level abundances, and Krona for interactive hierarchical visualization. An initial quality filtering step with Chopper removes low-quality and short reads to reduce spurious classifications. Pipeline Steps -------------- 1. **Chopper** -- Filter reads by minimum quality (Q10) and minimum length (1 kb) to remove low-confidence basecalls and adapter artifacts. 2. **Kraken2** -- Classify each read against a pre-built k-mer database, assigning taxonomic labels based on exact k-mer matches. The ``--minimum-hit-groups 3`` parameter requires hits across multiple distinct minimizer groups to reduce false-positive classifications. 3. **Bracken** -- Re-estimate species-level abundances from the Kraken2 report using a Bayesian model that redistributes reads assigned to higher taxonomic levels. Reads are re-distributed based on read-length-specific expected k-mer distributions. 4. **Krona** -- Convert the Kraken2 report into an interactive HTML visualization showing the hierarchical taxonomic composition of the sample. Implementation -------------- .. tabs:: .. tab:: Snakemake .. code-block:: python # Snakefile -- Metagenomic Classification Pipeline configfile: "config.yaml" SAMPLES = config["samples"] KRAKEN2_DB = config["kraken2_db"] rule all: input: expand("results/bracken/{sample}_bracken.txt", sample=SAMPLES), expand("results/krona/{sample}_krona.html", sample=SAMPLES) rule chopper_filter: input: "data/{sample}.fastq.gz" output: "results/filtered/{sample}.filtered.fastq.gz" shell: "zcat {input} | chopper --quality 10 --minlength 1000 | gzip > {output}" rule kraken2: input: "results/filtered/{sample}.filtered.fastq.gz" output: classifications="results/kraken2/{sample}_classifications.txt", report="results/kraken2/{sample}_report.txt" params: db=KRAKEN2_DB threads: 8 shell: """ kraken2 --db {params.db} \ --output {output.classifications} \ --report {output.report} \ --minimum-hit-groups 3 \ --threads {threads} \ {input} """ rule bracken: input: report="results/kraken2/{sample}_report.txt" output: "results/bracken/{sample}_bracken.txt" params: db=KRAKEN2_DB shell: """ bracken -d {params.db} \ -i {input.report} \ -o {output} \ -r 1000 -l S -t 10 """ rule krona: input: report="results/kraken2/{sample}_report.txt" output: "results/krona/{sample}_krona.html" shell: """ kreport2krona.py -r {input.report} -o results/krona/{wildcards.sample}_krona_input.txt ktImportText results/krona/{wildcards.sample}_krona_input.txt \ -o {output} """ .. tab:: Nextflow .. code-block:: groovy #!/usr/bin/env nextflow nextflow.enable.dsl = 2 params.samplesheet = "samplesheet.csv" params.kraken2_db = "ref/kraken2_db" params.outdir = "results" process CHOPPER_FILTER { tag "${sample_id}" publishDir "${params.outdir}/filtered", mode: 'copy' input: tuple val(sample_id), path(fastq) output: tuple val(sample_id), path("${sample_id}.filtered.fastq.gz"), emit: fastq script: """ zcat ${fastq} | chopper --quality 10 --minlength 1000 \ | gzip > ${sample_id}.filtered.fastq.gz """ } process KRAKEN2 { tag "${sample_id}" publishDir "${params.outdir}/kraken2", mode: 'copy' cpus 8 memory '64 GB' input: tuple val(sample_id), path(fastq) path(kraken2_db) output: tuple val(sample_id), path("${sample_id}_classifications.txt"), path("${sample_id}_report.txt"), emit: results tuple val(sample_id), path("${sample_id}_report.txt"), emit: report script: """ kraken2 --db ${kraken2_db} \ --output ${sample_id}_classifications.txt \ --report ${sample_id}_report.txt \ --minimum-hit-groups 3 \ --threads ${task.cpus} \ ${fastq} """ } process BRACKEN { tag "${sample_id}" publishDir "${params.outdir}/bracken", mode: 'copy' input: tuple val(sample_id), path(report) path(kraken2_db) output: tuple val(sample_id), path("${sample_id}_bracken.txt"), emit: abundances script: """ bracken -d ${kraken2_db} \ -i ${report} \ -o ${sample_id}_bracken.txt \ -r 1000 -l S -t 10 """ } process KRONA { tag "${sample_id}" publishDir "${params.outdir}/krona", mode: 'copy' input: tuple val(sample_id), path(report) output: path("${sample_id}_krona.html"), emit: html script: """ kreport2krona.py -r ${report} -o ${sample_id}_krona_input.txt ktImportText ${sample_id}_krona_input.txt \ -o ${sample_id}_krona.html """ } workflow { ch_input = Channel.fromPath(params.samplesheet) .splitCsv(header: true) .map { row -> tuple(row.sample_id, file(row.fastq)) } kraken2_db = Channel.value(file(params.kraken2_db)) CHOPPER_FILTER(ch_input) KRAKEN2(CHOPPER_FILTER.out.fastq, kraken2_db) BRACKEN(KRAKEN2.out.report, kraken2_db) KRONA(KRAKEN2.out.report) } Expected Output --------------- After a successful run the output directory will contain: * ``results/filtered/`` -- Quality- and length-filtered FASTQ files. * ``results/kraken2/_classifications.txt`` -- Per-read taxonomic assignments from Kraken2. * ``results/kraken2/_report.txt`` -- Kraken2 summary report with read counts and percentages at each taxonomic level. * ``results/bracken/_bracken.txt`` -- Bracken species-level abundance estimates with re-distributed read counts and fractions. * ``results/krona/_krona.html`` -- Interactive Krona pie chart showing hierarchical taxonomic composition (viewable in any web browser). See Also -------- * :doc:`16s-amplicon` -- 16S amplicon sequencing pipeline with DADA2 * :doc:`bacterial-genome-assembly` -- Nanopore bacterial genome assembly