VEP (Variant Effect Predictor)
Overview
The Ensembl Variant Effect Predictor (VEP) annotates genetic variants with their predicted functional consequences on genes, transcripts, and protein sequences. For each variant it reports the affected gene, transcript, protein position, amino acid change, SIFT and PolyPhen pathogenicity predictions, regulatory region overlaps, and allele frequencies from population databases. VEP supports VCF input and can write annotated output in VCF, tab-delimited, or JSON format. It uses a local cache for fast offline annotation and can be extended with plugins for additional data sources such as CADD scores, dbNSFP, and LOFTEE.
Installation
mamba install -c bioconda ensembl-vep
After installation, download the annotation cache for your reference assembly:
vep_install -a cf -s homo_sapiens -y GRCh38 -c /databases/vep/
Basic Usage
Annotate a filtered VCF with comprehensive functional annotations.
vep --input_file filtered.vcf.gz \
--output_file annotated.vcf.gz \
--vcf --compress_output bgzip \
--cache --dir_cache /databases/vep/ \
--assembly GRCh38 \
--everything \
--fork 4
The --everything flag enables all available annotation fields including
SIFT, PolyPhen, allele frequencies, regulatory annotations, and more.
Key Parameters
Flag / option |
Description |
|---|---|
|
Path to the input VCF or variant file. |
|
Path for the annotated output file. |
|
Write output in VCF format (annotations added to the INFO field as CSQ). |
|
Compress the output file with bgzip. |
|
Use the local annotation cache instead of querying the Ensembl API. |
|
Path to the directory containing the VEP cache files. |
|
Genome assembly version (e.g. |
|
Enable all annotation fields: SIFT, PolyPhen, allele frequencies, regulatory features, and more. |
|
Number of parallel forked processes for faster annotation. |
|
Report only the most severe consequence per variant rather than all affected transcripts. |
|
Load a VEP plugin for additional annotations (e.g. CADD, LOFTEE). |
Expected Output
annotated.vcf.gz– a VCF file with aCSQfield in the INFO column containing pipe-delimited annotation values for each variant-transcript pair. Annotations include: gene symbol, transcript ID, consequence type (e.g. missense_variant, synonymous_variant, frameshift_variant), protein position, amino acid change, SIFT score, PolyPhen score, and population allele frequencies.annotated.vcf.gz_summary.html– an HTML summary report with variant class distributions, consequence type counts, and annotation statistics.