VCF / BCF
Overview
VCF (Variant Call Format) is the standard file format for describing genetic variants – SNPs, insertions, deletions, and structural variants – relative to a reference genome. It is produced by every major variant caller (GATK HaplotypeCaller, DeepVariant, bcftools, FreeBayes) and consumed by annotation tools, filtering pipelines, and population-genetic analyses.
BCF is the binary, BGZF-compressed equivalent of VCF. It is significantly faster to parse and is the recommended format for large cohorts. The relationship between VCF and BCF mirrors that of SAM and BAM.
VCF files use the extension .vcf (plain text) or .vcf.gz
(block-compressed with bgzip). BCF files use .bcf. Both compressed
formats support tabix or CSI indexing for fast regional queries.
Structure
A VCF file has three parts: meta-information lines, a header line, and data lines.
##fileformat=VCFv4.3
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr1 12345 rs123456 A G 99 PASS DP=50;AF=0.5 GT:DP:GQ 0/1:50:99
chr1 67890 . CT C 85 PASS DP=40;AF=0.3 GT:DP:GQ 0/1:40:85
Meta-information lines
Lines starting with ## define the schema for INFO, FORMAT, FILTER, and
contig fields. They are machine-readable and describe every annotation that
appears in the data section.
Header line
The single line starting with #CHROM lists the eight mandatory columns
followed by one column per sample.
Data columns
Col |
Field |
Description |
|---|---|---|
1 |
CHROM |
Chromosome or contig name. |
2 |
POS |
1-based position of the variant on the reference. |
3 |
ID |
Variant identifier (e.g. dbSNP rsID) or |
4 |
REF |
Reference allele(s). |
5 |
ALT |
Alternate allele(s), comma-separated for multi-allelic sites. |
6 |
QUAL |
Phred-scaled quality score for the variant call. |
7 |
FILTER |
|
8 |
INFO |
Semicolon-separated key-value pairs with site-level annotations. |
9 |
FORMAT |
Colon-separated keys defining the per-sample fields. |
10+ |
SAMPLE |
Per-sample genotype and annotations matching the FORMAT keys. |
Genotype encoding
The GT (genotype) field uses indices into the REF/ALT alleles:
Genotype |
Meaning |
|---|---|
|
Homozygous reference |
|
Heterozygous (one REF, one ALT) |
|
Homozygous alternate |
|
Heterozygous with two different ALT alleles |
|
Missing genotype (no call) |
|
Phased heterozygous ( |
Variant types
Type |
REF |
ALT |
Example |
|---|---|---|---|
SNP |
|
|
Single nucleotide change |
Insertion |
|
|
Two bases inserted after position |
Deletion |
|
|
One base deleted |
MNP |
|
|
Multi-nucleotide polymorphism |
Complex |
|
|
Simultaneous insertion and deletion |
Working With
Compressing and indexing
# Block-compress with bgzip (NOT gzip)
bgzip variants.vcf
# Create a tabix index
tabix -p vcf variants.vcf.gz
Viewing and querying
# View VCF header
bcftools view -h variants.vcf.gz
# Query a specific region
bcftools view variants.vcf.gz chr1:10000-50000
# View only PASS variants
bcftools view -f PASS variants.vcf.gz
Filtering variants
# Keep only biallelic SNPs with QUAL >= 30 and depth >= 10
bcftools view -m2 -M2 -v snps variants.vcf.gz \
| bcftools filter -i 'QUAL>=30 && INFO/DP>=10' -o filtered.vcf.gz -Oz
# Exclude variants that failed filters
bcftools view -f PASS variants.vcf.gz -o pass_only.vcf.gz -Oz
Statistics and summary
# Generate variant statistics
bcftools stats variants.vcf.gz > stats.txt
# Count variants by type
bcftools stats variants.vcf.gz | grep '^SN'
Converting VCF to BCF and back
# VCF to BCF
bcftools view variants.vcf.gz -Ob -o variants.bcf
bcftools index variants.bcf
# BCF to VCF
bcftools view variants.bcf -Oz -o variants.vcf.gz
tabix -p vcf variants.vcf.gz
Merging and concatenating
# Merge VCFs from different samples (same sites)
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
# Concatenate VCFs from different regions (same samples)
bcftools concat chr1.vcf.gz chr2.vcf.gz -Oz -o combined.vcf.gz
Extracting fields
# Extract specific fields as a tab-separated table
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\t[%GT]\n' \
variants.vcf.gz > variants.tsv
See Also
GATK – GATK HaplotypeCaller for germline variant calling
DeepVariant – deep-learning variant caller
BCFtools – the primary VCF/BCF manipulation toolkit
SnpEff – functional annotation of VCF variants
VEP (Variant Effect Predictor) – Ensembl Variant Effect Predictor
SAM / BAM / CRAM – the alignment format from which variants are called
BED – interval format often used for filtering VCF by region