GFF / GTF

Overview

GFF (General Feature Format) and GTF (Gene Transfer Format) are tab-delimited formats for describing genomic features – genes, transcripts, exons, CDS regions, UTRs, and other annotations. They are the standard output of genome annotation pipelines and the required input for RNA-seq quantification, spliced alignment, and functional analysis.

The two formats share the same nine-column layout but differ in how the attributes column (column 9) is structured:

Format

Also known as

Attribute style

GTF

GFF2 / GFF2.5

Key-value pairs: gene_id "ENSG..."; gene_name "DDX11L1";

GFF3

GFF version 3

Key=value pairs: ID=gene0;Name=DDX11L1;biotype=lncRNA

GTF is used predominantly by GENCODE, Ensembl, and tools in the RNA-seq ecosystem (STAR, featureCounts, HTSeq, StringTie). GFF3 is the official sequence ontology standard and is preferred by NCBI RefSeq and many non-model organism databases.

Important

Both GFF and GTF use 1-based, fully closed coordinates. The interval covering bases 1 through 100 is written as start=1  end=100. This differs from BED, which uses 0-based, half-open coordinates.

Structure

Both formats use nine tab-separated columns:

chr1  HAVANA  gene        11869  14409  .  +  .  gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1  HAVANA  transcript  11869  14409  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1  HAVANA  exon        11869  12227  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1  HAVANA  exon        12613  12721  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1  HAVANA  exon        13221  14409  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";

Column definitions

Col

Field

Description

1

seqname

Chromosome or contig name (e.g. chr1).

2

source

Annotation source or program (e.g. HAVANA, ENSEMBL).

3

feature

Feature type: gene, transcript, exon, CDS, UTR, start_codon, stop_codon, etc.

4

start

Start position (1-based, inclusive).

5

end

End position (1-based, inclusive).

6

score

Numeric score or . if not applicable.

7

strand

+ (forward), - (reverse), or . (unstranded).

8

frame

Reading frame for CDS features: 0, 1, 2, or ..

9

attributes

Semicolon-separated key-value pairs (format differs between GTF and GFF3).

GTF vs GFF3 attributes

GTF (GFF2) style:

gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_biotype "lncRNA";
  • Keys and values separated by a space.

  • Values enclosed in double quotes.

  • Entries terminated by a semicolon.

  • gene_id and transcript_id are mandatory.

GFF3 style:

ID=gene-DDX11L1;Name=DDX11L1;gene_id=ENSG00000223972;gene_biotype=lncRNA
  • Key=value pairs separated by semicolons (no trailing semicolon).

  • Values are not quoted.

  • ID provides a unique identifier for the feature.

  • Parent links child features (exons, CDS) to their parent (transcript, gene).

  • Hierarchical relationships are explicit through ID/Parent links.

Feature hierarchy

Annotations follow a nested hierarchy:

gene
  transcript (mRNA)
    exon
    CDS
    five_prime_UTR
    three_prime_UTR

In GTF, the hierarchy is implicit – features are grouped by shared gene_id and transcript_id values. In GFF3, the hierarchy is explicit through ID and Parent attributes.

Working With

Downloading annotations

# GENCODE human GTF (widely used for RNA-seq)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz

# NCBI RefSeq GFF3
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz

Extracting specific features

# Extract only gene-level features from a GTF
awk '$3 == "gene"' annotation.gtf > genes_only.gtf

# Extract protein-coding genes
awk '$3 == "gene" && /gene_biotype "protein_coding"/' annotation.gtf > protein_coding.gtf

# Extract exon coordinates as a BED file (converting 1-based to 0-based)
awk 'BEGIN{OFS="\t"} $3=="exon" {print $1, $4-1, $5, ".", ".", $7}' \
  annotation.gtf > exons.bed

Using with RNA-seq aligners

# STAR genome generation with GTF
STAR --runMode genomeGenerate \
  --genomeDir star_index/ \
  --genomeFastaFiles reference.fa \
  --sjdbGTFfile annotation.gtf \
  --runThreadN 8

# HISAT2 with splice sites extracted from GTF
hisat2_extract_splice_sites.py annotation.gtf > splice_sites.txt
hisat2_extract_exons.py annotation.gtf > exons.txt

Counting reads per gene

# featureCounts (from Subread package)
featureCounts -a annotation.gtf -o counts.txt \
  -T 8 -p --countReadPairs aligned.sorted.bam

# HTSeq
htseq-count -f bam -r pos -s reverse \
  aligned.sorted.bam annotation.gtf > counts.txt

Converting between formats

# GFF3 to GTF with gffread
gffread annotation.gff3 -T -o annotation.gtf

# GTF to GFF3
gffread annotation.gtf -o annotation.gff3

# Convert GTF to BED12 (transcript models)
gtfToGenePred annotation.gtf /dev/stdout \
  | genePredToBed /dev/stdin > transcripts.bed

Sorting GTF/GFF files

# Sort by chromosome and start position
sort -k1,1 -k4,4n annotation.gtf > sorted.gtf

See Also

  • featureCounts – count reads using GTF annotations

  • HTSeq – alternative read counting tool

  • Prokka – prokaryotic annotation producing GFF3

  • BED – simpler interval format (0-based, half-open coordinates)

  • FASTA – the reference genome that annotations describe

  • SAM / BAM / CRAM – alignment format used together with GTF for quantification