Bakta

Overview

Bakta is a rapid and comprehensive annotation tool for bacterial genomes. It identifies coding sequences, rRNAs, tRNAs, ncRNAs, CRISPR arrays, and other genomic features using a combination of database searches and predictive models. Bakta relies on a pre-built, regularly updated database that combines UniProt, AMRFinderPlus, Pfam, and other resources for functional annotation. It produces standard output formats (GFF3, GenBank, EMBL) that are compatible with downstream analysis tools and public database submissions.

Installation

mamba install -c bioconda bakta

Basic Usage

Download the Bakta database and annotate a polished bacterial assembly.

# Download database
bakta_db download --output /databases/bakta --type full

bakta medaka_output/consensus.fasta \
  --db /databases/bakta/db \
  --output bakta_output/ \
  --genus Escherichia --species "coli" \
  --complete \
  --threads 8

Key Parameters

Flag / option

Description

(positional)

Input genome assembly in FASTA format.

--db

Path to the Bakta database directory.

--output

Output directory for annotation files.

--genus / --species

Taxonomic classification for the organism being annotated.

--complete

Flag indicating the assembly represents a complete genome (affects feature coordinate handling for circular replicons).

--threads

Number of CPU threads to use.

--prefix

Prefix for output file names (default derived from input file).

--locus-tag

Locus tag prefix for gene identifiers.

--min-contig-length

Minimum contig length to annotate (default 1).

--keep-contig-headers

Preserve original contig names from the input FASTA.

Expected Output

Bakta writes multiple annotation files to the output directory:

  • .gff3 – genome annotations in GFF3 format, the most common format for downstream tools.

  • .gbff – GenBank flat file format, suitable for submission to NCBI and viewing in genome browsers.

  • .faa – predicted protein sequences in FASTA format.

  • .ffn – nucleotide sequences of predicted features in FASTA format.

  • .embl – annotations in EMBL format for submission to ENA.

  • .tsv – a tab-separated summary table of all annotated features.

  • .json – machine-readable annotation results in JSON format.

See Also

  • Prokka – alternative prokaryotic annotation pipeline

  • Medaka – polishing step commonly run before annotation

  • BUSCO – assess gene-level completeness before or after annotation