GATK ==== Overview -------- The Genome Analysis Toolkit (GATK) is the industry-standard software suite for variant discovery in high-throughput sequencing data, developed by the Broad Institute. Its best-practices workflow centres on HaplotypeCaller for per-sample variant calling in GVCF mode, followed by joint genotyping across a cohort with GenotypeGVCFs. GATK provides Variant Quality Score Recalibration (VQSR) for machine-learning-based filtering that leverages known variant resources such as HapMap and dbSNP to separate true variants from artefacts. The toolkit handles both SNPs and indels and integrates seamlessly with Picard for preprocessing steps like duplicate marking and base quality score recalibration. Installation ------------ .. code-block:: bash mamba install -c bioconda gatk4 Basic Usage ----------- **Step 1 -- Call variants per sample in GVCF mode** .. code-block:: bash gatk HaplotypeCaller \ -R reference.fa \ -I sample.dedup.bam \ -O sample.g.vcf.gz \ -ERC GVCF **Step 2 -- Combine GVCFs from multiple samples** .. code-block:: bash gatk CombineGVCFs \ -R reference.fa \ -V sample1.g.vcf.gz -V sample2.g.vcf.gz \ -O combined.g.vcf.gz **Step 3 -- Joint genotyping** .. code-block:: bash gatk GenotypeGVCFs \ -R reference.fa \ -V combined.g.vcf.gz \ -O genotyped.vcf.gz **Step 4 -- Variant Quality Score Recalibration (SNPs)** .. code-block:: bash gatk VariantRecalibrator \ -R reference.fa -V genotyped.vcf.gz \ --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP -O snp_recal --tranches-file snp_tranches **Step 5 -- Apply VQSR filtering** .. code-block:: bash gatk ApplyVQSR \ -R reference.fa -V genotyped.vcf.gz \ --recal-file snp_recal --tranches-file snp_tranches \ --truth-sensitivity-filter-level 99.5 -mode SNP \ -O filtered.vcf.gz Key Parameters -------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Flag / option - Description * - ``-R`` - Path to the reference genome FASTA (must be indexed with ``.fai`` and ``.dict`` files). * - ``-I`` - Input BAM or CRAM file (should be sorted, indexed, and duplicate-marked). * - ``-O`` - Output file path for variants. * - ``-ERC GVCF`` - Emit a GVCF with per-site confidence for both variant and reference positions, enabling joint genotyping later. * - ``-V`` - Input VCF or GVCF file(s); repeat for multiple samples. * - ``--resource`` - Known variant resource with truth/training/prior annotations for VQSR. * - ``-an`` - Annotation feature to use in the VQSR Gaussian model (e.g. ``QD``, ``MQ``, ``FS``). * - ``-mode`` - Recalibration mode: ``SNP`` or ``INDEL``. * - ``--truth-sensitivity-filter-level`` - Sensitivity threshold for the VQSR tranche filter (e.g. 99.5). Expected Output --------------- * ``sample.g.vcf.gz`` -- per-sample GVCF containing variant calls and reference confidence blocks. * ``combined.g.vcf.gz`` -- multi-sample GVCF merging all individual GVCFs. * ``genotyped.vcf.gz`` -- joint-genotyped VCF with genotype likelihoods for every sample at every variant site. * ``snp_recal`` / ``snp_tranches`` -- VQSR model and tranche files used by ApplyVQSR. * ``filtered.vcf.gz`` -- final VCF with VQSR filter annotations in the FILTER column (PASS or sensitivity tranche labels). See Also -------- * :doc:`freebayes` -- haplotype-based variant caller with a simpler single-step workflow * :doc:`deepvariant` -- deep-learning variant caller from Google * :doc:`/tools/variant-annotation/vep` -- annotate called variants with functional consequences * :doc:`/tools/variant-processing/bcftools` -- filter and manipulate VCF files post-calling