FASTA
=====

Overview
--------

FASTA is the simplest and most widely used format for representing nucleotide
or protein sequences. Unlike FASTQ, it carries **no quality information** --
only a header line and the sequence itself. FASTA files serve as the standard
format for:

* **Reference genomes** (e.g. ``GRCh38.fa``)
* **Transcript sequences** (e.g. ``gencode.v44.transcripts.fa``)
* **Protein databases** (e.g. UniProt FASTA downloads)
* **De novo assembly contigs** and scaffolds
* **Consensus sequences** from multiple-sequence alignment

FASTA files use the extensions ``.fa``, ``.fasta``, ``.fna`` (nucleotide),
or ``.faa`` (amino acid). They are plain-text and are frequently compressed
with ``gzip`` or block-compressed with ``bgzip`` for indexed random access.

Structure
---------

A FASTA file consists of one or more records. Each record starts with a
``>`` header line followed by one or more lines of sequence:

.. code-block:: text

   >chr1 Homo sapiens chromosome 1
   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATAATCTACTACTACTCAG
   TGGCCAACACCTGATCCTGACAGCTGGAGTAAGGAACCTGAAGTCCCTA
   AAACTCATCAATGTTCTTTAGAGACTTACCAGGACCACTTCGTGAGGGA
   >chr2 Homo sapiens chromosome 2
   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   AGAGATCAGCTCAGGAGAGTCTCTTGAAGAAATCTGATTCACTGTATGG

**Header line (starts with** ``>`` **)**
   Everything after ``>`` up to the first whitespace is the **sequence
   identifier** (e.g. ``chr1``). The remainder of the line is a free-text
   description (e.g. ``Homo sapiens chromosome 1``).

**Sequence lines**
   Contain the actual nucleotide (A, C, G, T, N) or amino acid characters.
   Lines are typically wrapped at 60 or 80 characters, although single-line
   sequences are also valid.

FASTA index (``.fai``)
^^^^^^^^^^^^^^^^^^^^^^

Tools such as ``samtools`` and ``bedtools`` require a FASTA index to perform
random access into a reference genome. The index is a tab-separated file with
one row per sequence:

.. code-block:: text

   chr1  248956422  112  80  81
   chr2  242193529  253404903  80  81

Columns: name, length, byte offset, bases per line, bytes per line.

Working With
------------

Indexing a reference
^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Create a .fai index (required by most aligners and variant callers)
   samtools faidx reference.fa

   # Block-compress and index for fast random access
   bgzip reference.fa            # produces reference.fa.gz
   samtools faidx reference.fa.gz

Extracting a region
^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Extract chr1:10000-20000 from an indexed FASTA
   samtools faidx reference.fa chr1:10000-20000

Counting sequences and total bases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Count the number of sequences
   grep -c '^>' reference.fa

   # Count total bases (excluding headers and newlines)
   grep -v '^>' reference.fa | tr -d '\n' | wc -c

Creating aligner indices
^^^^^^^^^^^^^^^^^^^^^^^^

Most aligners build their own index from the FASTA reference:

.. code-block:: bash

   # BWA-MEM2 index
   bwa-mem2 index reference.fa

   # STAR genome generate for RNA-seq
   STAR --runMode genomeGenerate \
     --genomeDir star_index/ \
     --genomeFastaFiles reference.fa \
     --sjdbGTFfile genes.gtf \
     --runThreadN 8

   # HISAT2 index
   hisat2-build reference.fa hisat2_index/genome

Getting genome information
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Create a genome file (chromosome sizes) from the index
   cut -f1,2 reference.fa.fai > chrom.sizes

   # Generate a sequence dictionary (required by GATK / Picard)
   samtools dict reference.fa -o reference.dict

See Also
--------

* :doc:`fastq` -- sequence format that includes per-base quality scores
* :doc:`sam-bam-cram` -- alignment format that maps reads against a FASTA
  reference
* :doc:`/tools/alignment/bwa-mem2` -- short-read aligner that requires a
  FASTA reference index
* :doc:`/tools/sam-bam-processing/samtools` -- ``samtools faidx`` for indexing
  and region extraction
* :doc:`/tools/assembly/spades` -- de novo assembler that produces FASTA
  contigs