FASTA
Overview
FASTA is the simplest and most widely used format for representing nucleotide or protein sequences. Unlike FASTQ, it carries no quality information – only a header line and the sequence itself. FASTA files serve as the standard format for:
Reference genomes (e.g.
GRCh38.fa)Transcript sequences (e.g.
gencode.v44.transcripts.fa)Protein databases (e.g. UniProt FASTA downloads)
De novo assembly contigs and scaffolds
Consensus sequences from multiple-sequence alignment
FASTA files use the extensions .fa, .fasta, .fna (nucleotide),
or .faa (amino acid). They are plain-text and are frequently compressed
with gzip or block-compressed with bgzip for indexed random access.
Structure
A FASTA file consists of one or more records. Each record starts with a
> header line followed by one or more lines of sequence:
>chr1 Homo sapiens chromosome 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATAATCTACTACTACTCAG
TGGCCAACACCTGATCCTGACAGCTGGAGTAAGGAACCTGAAGTCCCTA
AAACTCATCAATGTTCTTTAGAGACTTACCAGGACCACTTCGTGAGGGA
>chr2 Homo sapiens chromosome 2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
AGAGATCAGCTCAGGAGAGTCTCTTGAAGAAATCTGATTCACTGTATGG
- Header line (starts with
>) Everything after
>up to the first whitespace is the sequence identifier (e.g.chr1). The remainder of the line is a free-text description (e.g.Homo sapiens chromosome 1).- Sequence lines
Contain the actual nucleotide (A, C, G, T, N) or amino acid characters. Lines are typically wrapped at 60 or 80 characters, although single-line sequences are also valid.
FASTA index (.fai)
Tools such as samtools and bedtools require a FASTA index to perform
random access into a reference genome. The index is a tab-separated file with
one row per sequence:
chr1 248956422 112 80 81
chr2 242193529 253404903 80 81
Columns: name, length, byte offset, bases per line, bytes per line.
Working With
Indexing a reference
# Create a .fai index (required by most aligners and variant callers)
samtools faidx reference.fa
# Block-compress and index for fast random access
bgzip reference.fa # produces reference.fa.gz
samtools faidx reference.fa.gz
Extracting a region
# Extract chr1:10000-20000 from an indexed FASTA
samtools faidx reference.fa chr1:10000-20000
Counting sequences and total bases
# Count the number of sequences
grep -c '^>' reference.fa
# Count total bases (excluding headers and newlines)
grep -v '^>' reference.fa | tr -d '\n' | wc -c
Creating aligner indices
Most aligners build their own index from the FASTA reference:
# BWA-MEM2 index
bwa-mem2 index reference.fa
# STAR genome generate for RNA-seq
STAR --runMode genomeGenerate \
--genomeDir star_index/ \
--genomeFastaFiles reference.fa \
--sjdbGTFfile genes.gtf \
--runThreadN 8
# HISAT2 index
hisat2-build reference.fa hisat2_index/genome
Getting genome information
# Create a genome file (chromosome sizes) from the index
cut -f1,2 reference.fa.fai > chrom.sizes
# Generate a sequence dictionary (required by GATK / Picard)
samtools dict reference.fa -o reference.dict
See Also
FASTQ – sequence format that includes per-base quality scores
SAM / BAM / CRAM – alignment format that maps reads against a FASTA reference
BWA-MEM2 – short-read aligner that requires a FASTA reference index
SAMtools –
samtools faidxfor indexing and region extractionSPAdes – de novo assembler that produces FASTA contigs