GATK

Overview

The Genome Analysis Toolkit (GATK) is the industry-standard software suite for variant discovery in high-throughput sequencing data, developed by the Broad Institute. Its best-practices workflow centres on HaplotypeCaller for per-sample variant calling in GVCF mode, followed by joint genotyping across a cohort with GenotypeGVCFs. GATK provides Variant Quality Score Recalibration (VQSR) for machine-learning-based filtering that leverages known variant resources such as HapMap and dbSNP to separate true variants from artefacts. The toolkit handles both SNPs and indels and integrates seamlessly with Picard for preprocessing steps like duplicate marking and base quality score recalibration.

Installation

mamba install -c bioconda gatk4

Basic Usage

Step 1 – Call variants per sample in GVCF mode

gatk HaplotypeCaller \
  -R reference.fa \
  -I sample.dedup.bam \
  -O sample.g.vcf.gz \
  -ERC GVCF

Step 2 – Combine GVCFs from multiple samples

gatk CombineGVCFs \
  -R reference.fa \
  -V sample1.g.vcf.gz -V sample2.g.vcf.gz \
  -O combined.g.vcf.gz

Step 3 – Joint genotyping

gatk GenotypeGVCFs \
  -R reference.fa \
  -V combined.g.vcf.gz \
  -O genotyped.vcf.gz

Step 4 – Variant Quality Score Recalibration (SNPs)

gatk VariantRecalibrator \
  -R reference.fa -V genotyped.vcf.gz \
  --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
  --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
  -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
  -mode SNP -O snp_recal --tranches-file snp_tranches

Step 5 – Apply VQSR filtering

gatk ApplyVQSR \
  -R reference.fa -V genotyped.vcf.gz \
  --recal-file snp_recal --tranches-file snp_tranches \
  --truth-sensitivity-filter-level 99.5 -mode SNP \
  -O filtered.vcf.gz

Key Parameters

Flag / option

Description

-R

Path to the reference genome FASTA (must be indexed with .fai and .dict files).

-I

Input BAM or CRAM file (should be sorted, indexed, and duplicate-marked).

-O

Output file path for variants.

-ERC GVCF

Emit a GVCF with per-site confidence for both variant and reference positions, enabling joint genotyping later.

-V

Input VCF or GVCF file(s); repeat for multiple samples.

--resource

Known variant resource with truth/training/prior annotations for VQSR.

-an

Annotation feature to use in the VQSR Gaussian model (e.g. QD, MQ, FS).

-mode

Recalibration mode: SNP or INDEL.

--truth-sensitivity-filter-level

Sensitivity threshold for the VQSR tranche filter (e.g. 99.5).

Expected Output

  • sample.g.vcf.gz – per-sample GVCF containing variant calls and reference confidence blocks.

  • combined.g.vcf.gz – multi-sample GVCF merging all individual GVCFs.

  • genotyped.vcf.gz – joint-genotyped VCF with genotype likelihoods for every sample at every variant site.

  • snp_recal / snp_tranches – VQSR model and tranche files used by ApplyVQSR.

  • filtered.vcf.gz – final VCF with VQSR filter annotations in the FILTER column (PASS or sensitivity tranche labels).

See Also