Bacterial Genome Assembly (Nanopore)

Overview

Oxford Nanopore long-read sequencing enables complete or near-complete assembly of bacterial genomes in a single contig. Long reads span repetitive elements that fragment short-read assemblies, making it possible to resolve plasmids, prophages, and genomic islands. This pipeline converts raw Nanopore BAM output to FASTQ, filters reads by quality and length with Chopper, assembles with Flye (optimized for bacterial genomes), polishes the consensus with Medaka, evaluates assembly quality with QUAST and BUSCO, and annotates genes with Bakta.

Pipeline Steps

  1. BAM to FASTQ – Convert the raw Nanopore BAM file (produced by the basecaller) into FASTQ format using samtools.

  2. Chopper filter – Remove reads below a minimum quality score (Q10) and minimum length (1 kb) to reduce noise from failed or chimeric reads.

  3. Flye assembly – Assemble filtered long reads into contigs using the Flye assembler in --nano-hq mode, appropriate for Q20+ Nanopore reads.

  4. Medaka polish – Correct residual errors in the assembly consensus using a neural-network model trained on the specific Nanopore chemistry.

  5. QUAST – Compute assembly statistics including total length, number of contigs, N50, and GC content.

  6. BUSCO – Assess assembly completeness by searching for lineage-specific single-copy orthologs.

  7. Bakta annotate – Predict and annotate coding sequences, rRNAs, tRNAs, and other genomic features.

Implementation

# Snakefile -- Nanopore Bacterial Genome Assembly
configfile: "config.yaml"

rule all:
    input:
        "results/bakta/annotation.gff3"

rule bam_to_fastq:
    input: config["raw_bam"]
    output: "results/fastq/reads.fastq.gz"
    shell: "samtools fastq {input} | gzip > {output}"

rule chopper_filter:
    input: "results/fastq/reads.fastq.gz"
    output: "results/filtered/reads.filtered.fastq.gz"
    shell:
        "zcat {input} | chopper --quality 10 --minlength 1000 | gzip > {output}"

rule flye_assembly:
    input: "results/filtered/reads.filtered.fastq.gz"
    output: "results/flye/assembly.fasta"
    params: size=config["genome_size"]
    threads: 8
    shell:
        "flye --nano-hq {input} --out-dir results/flye/ --genome-size {params.size} --threads {threads}"

rule medaka_polish:
    input:
        reads="results/filtered/reads.filtered.fastq.gz",
        asm="results/flye/assembly.fasta"
    output: "results/medaka/consensus.fasta"
    params: model=config["medaka_model"]
    threads: 8
    shell:
        "medaka_polisher -i {input.reads} -d {input.asm} -o results/medaka/ -m {params.model} --bacteria --threads {threads}"

rule quast:
    input: "results/medaka/consensus.fasta"
    output: "results/quast/report.txt"
    shell: "quast {input} -o results/quast/"

rule busco:
    input: "results/medaka/consensus.fasta"
    output: "results/busco/short_summary.txt"
    params: lineage=config["busco_lineage"]
    shell: "busco -i {input} -o results/busco/ -m genome -l {params.lineage} --cpu 8"

rule bakta_annotate:
    input: "results/medaka/consensus.fasta"
    output: "results/bakta/annotation.gff3"
    params: db=config["bakta_db"]
    shell:
        "bakta {input} --db {params.db} --output results/bakta/ --complete --threads 8"

Expected Output

After a successful run the output directory will contain:

  • results/fastq/reads.fastq.gz – Converted FASTQ from the raw BAM.

  • results/filtered/reads.filtered.fastq.gz – Quality- and length-filtered reads.

  • results/flye/assembly.fasta – Draft genome assembly from Flye.

  • results/medaka/consensus.fasta – Polished consensus assembly.

  • results/quast/report.txt – Assembly statistics (total length, contigs, N50, GC content).

  • results/busco/short_summary.txt – Completeness assessment showing complete, fragmented, and missing orthologs.

  • results/bakta/annotation.gff3 – Gene annotations in GFF3 format including CDS, rRNA, tRNA, and other features.

See Also