FASTA

Overview

FASTA is the simplest and most widely used format for representing nucleotide or protein sequences. Unlike FASTQ, it carries no quality information – only a header line and the sequence itself. FASTA files serve as the standard format for:

Reference genomes (e.g. GRCh38.fa)
Transcript sequences (e.g. gencode.v44.transcripts.fa)
Protein databases (e.g. UniProt FASTA downloads)
De novo assembly contigs and scaffolds
Consensus sequences from multiple-sequence alignment

FASTA files use the extensions .fa, .fasta, .fna (nucleotide), or .faa (amino acid). They are plain-text and are frequently compressed with gzip or block-compressed with bgzip for indexed random access.

Structure

A FASTA file consists of one or more records. Each record starts with a > header line followed by one or more lines of sequence:

>chr1 Homo sapiens chromosome 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATAATCTACTACTACTCAG
TGGCCAACACCTGATCCTGACAGCTGGAGTAAGGAACCTGAAGTCCCTA
AAACTCATCAATGTTCTTTAGAGACTTACCAGGACCACTTCGTGAGGGA
>chr2 Homo sapiens chromosome 2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
AGAGATCAGCTCAGGAGAGTCTCTTGAAGAAATCTGATTCACTGTATGG

Header line (starts with > ): Everything after > up to the first whitespace is the sequence identifier (e.g. chr1). The remainder of the line is a free-text description (e.g. Homo sapiens chromosome 1).
Sequence lines: Contain the actual nucleotide (A, C, G, T, N) or amino acid characters. Lines are typically wrapped at 60 or 80 characters, although single-line sequences are also valid.

FASTA index (`.fai`)

Tools such as samtools and bedtools require a FASTA index to perform random access into a reference genome. The index is a tab-separated file with one row per sequence:

chr1  248956422  112  80  81
chr2  242193529  253404903  80  81

Columns: name, length, byte offset, bases per line, bytes per line.

Working With

Indexing a reference

# Create a .fai index (required by most aligners and variant callers)
samtools faidx reference.fa

# Block-compress and index for fast random access
bgzip reference.fa            # produces reference.fa.gz
samtools faidx reference.fa.gz

Extracting a region

# Extract chr1:10000-20000 from an indexed FASTA
samtools faidx reference.fa chr1:10000-20000

Counting sequences and total bases

# Count the number of sequences
grep -c '^>' reference.fa

# Count total bases (excluding headers and newlines)
grep -v '^>' reference.fa | tr -d '\n' | wc -c

Creating aligner indices

Most aligners build their own index from the FASTA reference:

# BWA-MEM2 index
bwa-mem2 index reference.fa

# STAR genome generate for RNA-seq
STAR --runMode genomeGenerate \
  --genomeDir star_index/ \
  --genomeFastaFiles reference.fa \
  --sjdbGTFfile genes.gtf \
  --runThreadN 8

# HISAT2 index
hisat2-build reference.fa hisat2_index/genome

Getting genome information

# Create a genome file (chromosome sizes) from the index
cut -f1,2 reference.fa.fai > chrom.sizes

# Generate a sequence dictionary (required by GATK / Picard)
samtools dict reference.fa -o reference.dict