GFF / GTF
=========

Overview
--------

GFF (General Feature Format) and GTF (Gene Transfer Format) are tab-delimited
formats for describing genomic features -- genes, transcripts, exons, CDS
regions, UTRs, and other annotations. They are the standard output of genome
annotation pipelines and the required input for RNA-seq quantification,
spliced alignment, and functional analysis.

The two formats share the same nine-column layout but differ in how the
**attributes column** (column 9) is structured:

.. list-table::
   :header-rows: 1
   :widths: 15 20 65

   * - Format
     - Also known as
     - Attribute style
   * - **GTF**
     - GFF2 / GFF2.5
     - Key-value pairs: ``gene_id "ENSG..."; gene_name "DDX11L1";``
   * - **GFF3**
     - GFF version 3
     - Key=value pairs: ``ID=gene0;Name=DDX11L1;biotype=lncRNA``

GTF is used predominantly by **GENCODE**, **Ensembl**, and tools in the
RNA-seq ecosystem (STAR, featureCounts, HTSeq, StringTie). GFF3 is the
official sequence ontology standard and is preferred by **NCBI RefSeq** and
many non-model organism databases.

.. important::

   Both GFF and GTF use **1-based, fully closed** coordinates. The interval
   covering bases 1 through 100 is written as ``start=1  end=100``. This
   differs from BED, which uses 0-based, half-open coordinates.

Structure
---------

Both formats use nine tab-separated columns:

.. code-block:: text

   chr1  HAVANA  gene        11869  14409  .  +  .  gene_id "ENSG00000223972"; gene_name "DDX11L1";
   chr1  HAVANA  transcript  11869  14409  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
   chr1  HAVANA  exon        11869  12227  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
   chr1  HAVANA  exon        12613  12721  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
   chr1  HAVANA  exon        13221  14409  .  +  .  gene_id "ENSG00000223972"; transcript_id "ENST00000456328";

Column definitions
^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 8 18 74

   * - Col
     - Field
     - Description
   * - 1
     - seqname
     - Chromosome or contig name (e.g. ``chr1``).
   * - 2
     - source
     - Annotation source or program (e.g. ``HAVANA``, ``ENSEMBL``).
   * - 3
     - feature
     - Feature type: ``gene``, ``transcript``, ``exon``, ``CDS``, ``UTR``,
       ``start_codon``, ``stop_codon``, etc.
   * - 4
     - start
     - Start position (1-based, inclusive).
   * - 5
     - end
     - End position (1-based, inclusive).
   * - 6
     - score
     - Numeric score or ``.`` if not applicable.
   * - 7
     - strand
     - ``+`` (forward), ``-`` (reverse), or ``.`` (unstranded).
   * - 8
     - frame
     - Reading frame for CDS features: ``0``, ``1``, ``2``, or ``.``.
   * - 9
     - attributes
     - Semicolon-separated key-value pairs (format differs between GTF and
       GFF3).

GTF vs GFF3 attributes
^^^^^^^^^^^^^^^^^^^^^^^

**GTF (GFF2) style:**

.. code-block:: text

   gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_biotype "lncRNA";

* Keys and values separated by a space.
* Values enclosed in double quotes.
* Entries terminated by a semicolon.
* ``gene_id`` and ``transcript_id`` are mandatory.

**GFF3 style:**

.. code-block:: text

   ID=gene-DDX11L1;Name=DDX11L1;gene_id=ENSG00000223972;gene_biotype=lncRNA

* Key=value pairs separated by semicolons (no trailing semicolon).
* Values are **not** quoted.
* ``ID`` provides a unique identifier for the feature.
* ``Parent`` links child features (exons, CDS) to their parent (transcript,
  gene).
* Hierarchical relationships are explicit through ``ID``/``Parent`` links.

Feature hierarchy
^^^^^^^^^^^^^^^^^

Annotations follow a nested hierarchy:

.. code-block:: text

   gene
     transcript (mRNA)
       exon
       CDS
       five_prime_UTR
       three_prime_UTR

In GTF, the hierarchy is implicit -- features are grouped by shared
``gene_id`` and ``transcript_id`` values. In GFF3, the hierarchy is explicit
through ``ID`` and ``Parent`` attributes.

Working With
------------

Downloading annotations
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # GENCODE human GTF (widely used for RNA-seq)
   wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz

   # NCBI RefSeq GFF3
   wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz

Extracting specific features
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Extract only gene-level features from a GTF
   awk '$3 == "gene"' annotation.gtf > genes_only.gtf

   # Extract protein-coding genes
   awk '$3 == "gene" && /gene_biotype "protein_coding"/' annotation.gtf > protein_coding.gtf

   # Extract exon coordinates as a BED file (converting 1-based to 0-based)
   awk 'BEGIN{OFS="\t"} $3=="exon" {print $1, $4-1, $5, ".", ".", $7}' \
     annotation.gtf > exons.bed

Using with RNA-seq aligners
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # STAR genome generation with GTF
   STAR --runMode genomeGenerate \
     --genomeDir star_index/ \
     --genomeFastaFiles reference.fa \
     --sjdbGTFfile annotation.gtf \
     --runThreadN 8

   # HISAT2 with splice sites extracted from GTF
   hisat2_extract_splice_sites.py annotation.gtf > splice_sites.txt
   hisat2_extract_exons.py annotation.gtf > exons.txt

Counting reads per gene
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # featureCounts (from Subread package)
   featureCounts -a annotation.gtf -o counts.txt \
     -T 8 -p --countReadPairs aligned.sorted.bam

   # HTSeq
   htseq-count -f bam -r pos -s reverse \
     aligned.sorted.bam annotation.gtf > counts.txt

Converting between formats
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # GFF3 to GTF with gffread
   gffread annotation.gff3 -T -o annotation.gtf

   # GTF to GFF3
   gffread annotation.gtf -o annotation.gff3

   # Convert GTF to BED12 (transcript models)
   gtfToGenePred annotation.gtf /dev/stdout \
     | genePredToBed /dev/stdin > transcripts.bed

Sorting GTF/GFF files
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Sort by chromosome and start position
   sort -k1,1 -k4,4n annotation.gtf > sorted.gtf

See Also
--------

* :doc:`/tools/quantification/featurecounts` -- count reads using GTF
  annotations
* :doc:`/tools/quantification/htseq` -- alternative read counting tool
* :doc:`/tools/annotation/prokka` -- prokaryotic annotation producing GFF3
* :doc:`bed` -- simpler interval format (0-based, half-open coordinates)
* :doc:`fasta` -- the reference genome that annotations describe
* :doc:`sam-bam-cram` -- alignment format used together with GTF for
  quantification