GFF / GTF
Overview
GFF (General Feature Format) and GTF (Gene Transfer Format) are tab-delimited formats for describing genomic features – genes, transcripts, exons, CDS regions, UTRs, and other annotations. They are the standard output of genome annotation pipelines and the required input for RNA-seq quantification, spliced alignment, and functional analysis.
The two formats share the same nine-column layout but differ in how the attributes column (column 9) is structured:
Format |
Also known as |
Attribute style |
|---|---|---|
GTF |
GFF2 / GFF2.5 |
Key-value pairs: |
GFF3 |
GFF version 3 |
Key=value pairs: |
GTF is used predominantly by GENCODE, Ensembl, and tools in the RNA-seq ecosystem (STAR, featureCounts, HTSeq, StringTie). GFF3 is the official sequence ontology standard and is preferred by NCBI RefSeq and many non-model organism databases.
Important
Both GFF and GTF use 1-based, fully closed coordinates. The interval
covering bases 1 through 100 is written as start=1 end=100. This
differs from BED, which uses 0-based, half-open coordinates.
Structure
Both formats use nine tab-separated columns:
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
Column definitions
Col |
Field |
Description |
|---|---|---|
1 |
seqname |
Chromosome or contig name (e.g. |
2 |
source |
Annotation source or program (e.g. |
3 |
feature |
Feature type: |
4 |
start |
Start position (1-based, inclusive). |
5 |
end |
End position (1-based, inclusive). |
6 |
score |
Numeric score or |
7 |
strand |
|
8 |
frame |
Reading frame for CDS features: |
9 |
attributes |
Semicolon-separated key-value pairs (format differs between GTF and GFF3). |
GTF vs GFF3 attributes
GTF (GFF2) style:
gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_biotype "lncRNA";
Keys and values separated by a space.
Values enclosed in double quotes.
Entries terminated by a semicolon.
gene_idandtranscript_idare mandatory.
GFF3 style:
ID=gene-DDX11L1;Name=DDX11L1;gene_id=ENSG00000223972;gene_biotype=lncRNA
Key=value pairs separated by semicolons (no trailing semicolon).
Values are not quoted.
IDprovides a unique identifier for the feature.Parentlinks child features (exons, CDS) to their parent (transcript, gene).Hierarchical relationships are explicit through
ID/Parentlinks.
Feature hierarchy
Annotations follow a nested hierarchy:
gene
transcript (mRNA)
exon
CDS
five_prime_UTR
three_prime_UTR
In GTF, the hierarchy is implicit – features are grouped by shared
gene_id and transcript_id values. In GFF3, the hierarchy is explicit
through ID and Parent attributes.
Working With
Downloading annotations
# GENCODE human GTF (widely used for RNA-seq)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
# NCBI RefSeq GFF3
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
Extracting specific features
# Extract only gene-level features from a GTF
awk '$3 == "gene"' annotation.gtf > genes_only.gtf
# Extract protein-coding genes
awk '$3 == "gene" && /gene_biotype "protein_coding"/' annotation.gtf > protein_coding.gtf
# Extract exon coordinates as a BED file (converting 1-based to 0-based)
awk 'BEGIN{OFS="\t"} $3=="exon" {print $1, $4-1, $5, ".", ".", $7}' \
annotation.gtf > exons.bed
Using with RNA-seq aligners
# STAR genome generation with GTF
STAR --runMode genomeGenerate \
--genomeDir star_index/ \
--genomeFastaFiles reference.fa \
--sjdbGTFfile annotation.gtf \
--runThreadN 8
# HISAT2 with splice sites extracted from GTF
hisat2_extract_splice_sites.py annotation.gtf > splice_sites.txt
hisat2_extract_exons.py annotation.gtf > exons.txt
Counting reads per gene
# featureCounts (from Subread package)
featureCounts -a annotation.gtf -o counts.txt \
-T 8 -p --countReadPairs aligned.sorted.bam
# HTSeq
htseq-count -f bam -r pos -s reverse \
aligned.sorted.bam annotation.gtf > counts.txt
Converting between formats
# GFF3 to GTF with gffread
gffread annotation.gff3 -T -o annotation.gtf
# GTF to GFF3
gffread annotation.gtf -o annotation.gff3
# Convert GTF to BED12 (transcript models)
gtfToGenePred annotation.gtf /dev/stdout \
| genePredToBed /dev/stdin > transcripts.bed
Sorting GTF/GFF files
# Sort by chromosome and start position
sort -k1,1 -k4,4n annotation.gtf > sorted.gtf
See Also
featureCounts – count reads using GTF annotations
HTSeq – alternative read counting tool
Prokka – prokaryotic annotation producing GFF3
BED – simpler interval format (0-based, half-open coordinates)
FASTA – the reference genome that annotations describe
SAM / BAM / CRAM – alignment format used together with GTF for quantification