Glossary

ASV

Amplicon Sequence Variant. Exact biological sequences resolved by DADA2, replacing the older OTU approach in 16S amplicon analysis.

BAM

Binary Alignment Map. Compressed binary form of the SAM format for storing aligned sequencing reads.

BED

Browser Extensible Data. Tab-delimited format for genomic intervals using 0-based, half-open coordinates.

BUSCO

Benchmarking Universal Single-Copy Orthologs. Tool for assessing genome assembly completeness using conserved gene sets.

BWA

Burrows-Wheeler Aligner. Short-read alignment tool; BWA-MEM2 is the faster SIMD-optimized successor.

CIGAR

Compact Idiosyncratic Gapped Alignment Report. String in SAM format describing how a read aligns to the reference (e.g., 151M, 75M2I74M).

CpG

Cytosine-phosphate-Guanine dinucleotide. Primary target of DNA methylation in mammalian genomes.

CRAM

Reference-based compressed alignment format. More compact than BAM but requires the reference genome for decoding.

DE

Differential Expression. Statistical identification of genes with significantly different expression levels between conditions.

DMR

Differentially Methylated Region. Genomic region with statistically significant methylation differences between conditions.

DSL2

Domain Specific Language version 2. Nextflow’s modular workflow syntax enabling process reuse and sub-workflows.

eDNA

Environmental DNA. DNA extracted directly from environmental samples (water, soil) without isolating organisms.

FASTA

Text-based format for nucleotide or protein sequences. Uses > header lines followed by sequence data.

FASTQ

Text-based format storing nucleotide sequences together with per-base Phred quality scores.

FLAG

Bitwise flag in SAM format encoding read properties (paired, mapped, duplicate, etc.). E.g., 99 = paired, proper pair, mate reverse, first in pair.

GFF

General Feature Format. Tab-delimited format for genomic annotations (genes, exons, CDS). GFF3 is the current version.

GTF

Gene Transfer Format (GFF version 2). Widely used for gene annotations, especially from GENCODE and Ensembl.

GVCF

Genomic VCF. GATK’s extended VCF that records confidence at every position, enabling efficient joint genotyping.

h5ad

HDF5-based file format used by AnnData/Scanpy for single-cell data. Stores count matrices, cell metadata, and embeddings.

HPC

High-Performance Computing. Cluster computing environment managed by job schedulers like SLURM.

HVG

Highly Variable Genes. Genes with the most variable expression across cells, selected as informative features for dimensionality reduction.

IGV

Integrative Genomics Viewer. Desktop application for interactive visualization of genomic data (BAM, VCF, BED, BigWig).

indel

Insertion or deletion variant relative to the reference genome.

MAPQ

Mapping Quality. Phred-scaled probability that an alignment is wrong. MAPQ 30 means 1 in 1000 chance of incorrect mapping.

MEX

Market Exchange format. Sparse matrix format (matrix.mtx, barcodes.tsv, features.tsv) output by Cell Ranger and STARsolo.

NGS

Next-Generation Sequencing. High-throughput DNA/RNA sequencing technologies producing millions of reads per run.

ONT

Oxford Nanopore Technologies. Long-read sequencing platform that reads DNA by measuring ionic current changes through nanopores.

OTU

Operational Taxonomic Unit. Cluster of similar sequences (typically 97% identity) used in older 16S analyses. Superseded by ASVs.

PCA

Principal Component Analysis. Dimensionality reduction method used to identify major axes of variation in gene expression data.

Phred

Quality score encoding where Q = -10 log10(P_error). Q30 means 99.9% base call accuracy.

RPKM

Reads Per Kilobase per Million mapped reads. Normalization method for coverage tracks accounting for library size and region length.

SAM

Sequence Alignment/Map. Tab-delimited text format for storing aligned sequencing reads against a reference genome.

SLURM

Simple Linux Utility for Resource Management. Widely used HPC job scheduling system.

SNV

Single Nucleotide Variant. Single base-pair change relative to the reference genome.

SRA

Sequence Read Archive. NCBI’s primary repository for raw sequencing data from high-throughput platforms.

TSS

Transcription Start Site. Genomic position where RNA polymerase begins transcription of a gene.

UMAP

Uniform Manifold Approximation and Projection. Non-linear dimensionality reduction used for visualizing single-cell data.

UMI

Unique Molecular Identifier. Random barcode attached during library preparation to tag individual molecules, enabling PCR duplicate removal in single-cell and other protocols.

VCF

Variant Call Format. Standard format for storing genetic variants (SNVs, indels, structural variants) with genotype information.

VQSR

Variant Quality Score Recalibration. GATK’s machine-learning approach to variant filtering using known variant databases as training data.

WGBS

Whole Genome Bisulfite Sequencing. Technique for measuring DNA methylation at single-base resolution across the entire genome.

WGS

Whole Genome Sequencing. Sequencing of an organism’s entire genome at uniform coverage.