GATK
Overview
The Genome Analysis Toolkit (GATK) is the industry-standard software suite for variant discovery in high-throughput sequencing data, developed by the Broad Institute. Its best-practices workflow centres on HaplotypeCaller for per-sample variant calling in GVCF mode, followed by joint genotyping across a cohort with GenotypeGVCFs. GATK provides Variant Quality Score Recalibration (VQSR) for machine-learning-based filtering that leverages known variant resources such as HapMap and dbSNP to separate true variants from artefacts. The toolkit handles both SNPs and indels and integrates seamlessly with Picard for preprocessing steps like duplicate marking and base quality score recalibration.
Installation
mamba install -c bioconda gatk4
Basic Usage
Step 1 – Call variants per sample in GVCF mode
gatk HaplotypeCaller \
-R reference.fa \
-I sample.dedup.bam \
-O sample.g.vcf.gz \
-ERC GVCF
Step 2 – Combine GVCFs from multiple samples
gatk CombineGVCFs \
-R reference.fa \
-V sample1.g.vcf.gz -V sample2.g.vcf.gz \
-O combined.g.vcf.gz
Step 3 – Joint genotyping
gatk GenotypeGVCFs \
-R reference.fa \
-V combined.g.vcf.gz \
-O genotyped.vcf.gz
Step 4 – Variant Quality Score Recalibration (SNPs)
gatk VariantRecalibrator \
-R reference.fa -V genotyped.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP -O snp_recal --tranches-file snp_tranches
Step 5 – Apply VQSR filtering
gatk ApplyVQSR \
-R reference.fa -V genotyped.vcf.gz \
--recal-file snp_recal --tranches-file snp_tranches \
--truth-sensitivity-filter-level 99.5 -mode SNP \
-O filtered.vcf.gz
Key Parameters
Flag / option |
Description |
|---|---|
|
Path to the reference genome FASTA (must be indexed with |
|
Input BAM or CRAM file (should be sorted, indexed, and duplicate-marked). |
|
Output file path for variants. |
|
Emit a GVCF with per-site confidence for both variant and reference positions, enabling joint genotyping later. |
|
Input VCF or GVCF file(s); repeat for multiple samples. |
|
Known variant resource with truth/training/prior annotations for VQSR. |
|
Annotation feature to use in the VQSR Gaussian model (e.g. |
|
Recalibration mode: |
|
Sensitivity threshold for the VQSR tranche filter (e.g. 99.5). |
Expected Output
sample.g.vcf.gz– per-sample GVCF containing variant calls and reference confidence blocks.combined.g.vcf.gz– multi-sample GVCF merging all individual GVCFs.genotyped.vcf.gz– joint-genotyped VCF with genotype likelihoods for every sample at every variant site.snp_recal/snp_tranches– VQSR model and tranche files used by ApplyVQSR.filtered.vcf.gz– final VCF with VQSR filter annotations in the FILTER column (PASS or sensitivity tranche labels).
See Also
FreeBayes – haplotype-based variant caller with a simpler single-step workflow
DeepVariant – deep-learning variant caller from Google
VEP (Variant Effect Predictor) – annotate called variants with functional consequences
BCFtools – filter and manipulate VCF files post-calling