Bakta ===== Overview -------- Bakta is a rapid and comprehensive annotation tool for bacterial genomes. It identifies coding sequences, rRNAs, tRNAs, ncRNAs, CRISPR arrays, and other genomic features using a combination of database searches and predictive models. Bakta relies on a pre-built, regularly updated database that combines UniProt, AMRFinderPlus, Pfam, and other resources for functional annotation. It produces standard output formats (GFF3, GenBank, EMBL) that are compatible with downstream analysis tools and public database submissions. Installation ------------ .. code-block:: bash mamba install -c bioconda bakta Basic Usage ----------- Download the Bakta database and annotate a polished bacterial assembly. .. code-block:: bash # Download database bakta_db download --output /databases/bakta --type full bakta medaka_output/consensus.fasta \ --db /databases/bakta/db \ --output bakta_output/ \ --genus Escherichia --species "coli" \ --complete \ --threads 8 Key Parameters -------------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Flag / option - Description * - (positional) - Input genome assembly in FASTA format. * - ``--db`` - Path to the Bakta database directory. * - ``--output`` - Output directory for annotation files. * - ``--genus`` / ``--species`` - Taxonomic classification for the organism being annotated. * - ``--complete`` - Flag indicating the assembly represents a complete genome (affects feature coordinate handling for circular replicons). * - ``--threads`` - Number of CPU threads to use. * - ``--prefix`` - Prefix for output file names (default derived from input file). * - ``--locus-tag`` - Locus tag prefix for gene identifiers. * - ``--min-contig-length`` - Minimum contig length to annotate (default 1). * - ``--keep-contig-headers`` - Preserve original contig names from the input FASTA. Expected Output --------------- Bakta writes multiple annotation files to the output directory: * ``.gff3`` -- genome annotations in GFF3 format, the most common format for downstream tools. * ``.gbff`` -- GenBank flat file format, suitable for submission to NCBI and viewing in genome browsers. * ``.faa`` -- predicted protein sequences in FASTA format. * ``.ffn`` -- nucleotide sequences of predicted features in FASTA format. * ``.embl`` -- annotations in EMBL format for submission to ENA. * ``.tsv`` -- a tab-separated summary table of all annotated features. * ``.json`` -- machine-readable annotation results in JSON format. See Also -------- * :doc:`prokka` -- alternative prokaryotic annotation pipeline * :doc:`/tools/assembly/medaka` -- polishing step commonly run before annotation * :doc:`/tools/assembly-qc/busco` -- assess gene-level completeness before or after annotation