Metagenomic Classification

Overview

Metagenomic classification assigns taxonomic labels to sequencing reads without requiring assembly or marker-gene amplification, providing a culture-independent view of microbial community composition directly from shotgun or long-read sequencing data. This pipeline uses Kraken2 for fast k-mer-based taxonomic classification, Bracken for Bayesian re-estimation of species-level abundances, and Krona for interactive hierarchical visualization. An initial quality filtering step with Chopper removes low-quality and short reads to reduce spurious classifications.

Pipeline Steps

  1. Chopper – Filter reads by minimum quality (Q10) and minimum length (1 kb) to remove low-confidence basecalls and adapter artifacts.

  2. Kraken2 – Classify each read against a pre-built k-mer database, assigning taxonomic labels based on exact k-mer matches. The --minimum-hit-groups 3 parameter requires hits across multiple distinct minimizer groups to reduce false-positive classifications.

  3. Bracken – Re-estimate species-level abundances from the Kraken2 report using a Bayesian model that redistributes reads assigned to higher taxonomic levels. Reads are re-distributed based on read-length-specific expected k-mer distributions.

  4. Krona – Convert the Kraken2 report into an interactive HTML visualization showing the hierarchical taxonomic composition of the sample.

Implementation

# Snakefile -- Metagenomic Classification Pipeline
configfile: "config.yaml"

SAMPLES    = config["samples"]
KRAKEN2_DB = config["kraken2_db"]

rule all:
    input:
        expand("results/bracken/{sample}_bracken.txt", sample=SAMPLES),
        expand("results/krona/{sample}_krona.html", sample=SAMPLES)

rule chopper_filter:
    input:
        "data/{sample}.fastq.gz"
    output:
        "results/filtered/{sample}.filtered.fastq.gz"
    shell:
        "zcat {input} | chopper --quality 10 --minlength 1000 | gzip > {output}"

rule kraken2:
    input:
        "results/filtered/{sample}.filtered.fastq.gz"
    output:
        classifications="results/kraken2/{sample}_classifications.txt",
        report="results/kraken2/{sample}_report.txt"
    params:
        db=KRAKEN2_DB
    threads: 8
    shell:
        """
        kraken2 --db {params.db} \
          --output {output.classifications} \
          --report {output.report} \
          --minimum-hit-groups 3 \
          --threads {threads} \
          {input}
        """

rule bracken:
    input:
        report="results/kraken2/{sample}_report.txt"
    output:
        "results/bracken/{sample}_bracken.txt"
    params:
        db=KRAKEN2_DB
    shell:
        """
        bracken -d {params.db} \
          -i {input.report} \
          -o {output} \
          -r 1000 -l S -t 10
        """

rule krona:
    input:
        report="results/kraken2/{sample}_report.txt"
    output:
        "results/krona/{sample}_krona.html"
    shell:
        """
        kreport2krona.py -r {input.report} -o results/krona/{wildcards.sample}_krona_input.txt
        ktImportText results/krona/{wildcards.sample}_krona_input.txt \
          -o {output}
        """

Expected Output

After a successful run the output directory will contain:

  • results/filtered/ – Quality- and length-filtered FASTQ files.

  • results/kraken2/<sample>_classifications.txt – Per-read taxonomic assignments from Kraken2.

  • results/kraken2/<sample>_report.txt – Kraken2 summary report with read counts and percentages at each taxonomic level.

  • results/bracken/<sample>_bracken.txt – Bracken species-level abundance estimates with re-distributed read counts and fractions.

  • results/krona/<sample>_krona.html – Interactive Krona pie chart showing hierarchical taxonomic composition (viewable in any web browser).

See Also