Glossary

ASV: Amplicon Sequence Variant. Exact biological sequences resolved by DADA2, replacing the older OTU approach in 16S amplicon analysis.
BAM: Binary Alignment Map. Compressed binary form of the SAM format for storing aligned sequencing reads.
BED: Browser Extensible Data. Tab-delimited format for genomic intervals using 0-based, half-open coordinates.
BUSCO: Benchmarking Universal Single-Copy Orthologs. Tool for assessing genome assembly completeness using conserved gene sets.
BWA: Burrows-Wheeler Aligner. Short-read alignment tool; BWA-MEM2 is the faster SIMD-optimized successor.
CIGAR: Compact Idiosyncratic Gapped Alignment Report. String in SAM format describing how a read aligns to the reference (e.g., 151M, 75M2I74M).
CpG: Cytosine-phosphate-Guanine dinucleotide. Primary target of DNA methylation in mammalian genomes.
CRAM: Reference-based compressed alignment format. More compact than BAM but requires the reference genome for decoding.
DE: Differential Expression. Statistical identification of genes with significantly different expression levels between conditions.
DMR: Differentially Methylated Region. Genomic region with statistically significant methylation differences between conditions.
DSL2: Domain Specific Language version 2. Nextflow’s modular workflow syntax enabling process reuse and sub-workflows.
eDNA: Environmental DNA. DNA extracted directly from environmental samples (water, soil) without isolating organisms.
FASTA: Text-based format for nucleotide or protein sequences. Uses > header lines followed by sequence data.
FASTQ: Text-based format storing nucleotide sequences together with per-base Phred quality scores.
FLAG: Bitwise flag in SAM format encoding read properties (paired, mapped, duplicate, etc.). E.g., 99 = paired, proper pair, mate reverse, first in pair.
GFF: General Feature Format. Tab-delimited format for genomic annotations (genes, exons, CDS). GFF3 is the current version.
GTF: Gene Transfer Format (GFF version 2). Widely used for gene annotations, especially from GENCODE and Ensembl.
GVCF: Genomic VCF. GATK’s extended VCF that records confidence at every position, enabling efficient joint genotyping.
h5ad: HDF5-based file format used by AnnData/Scanpy for single-cell data. Stores count matrices, cell metadata, and embeddings.
HPC: High-Performance Computing. Cluster computing environment managed by job schedulers like SLURM.
HVG: Highly Variable Genes. Genes with the most variable expression across cells, selected as informative features for dimensionality reduction.
IGV: Integrative Genomics Viewer. Desktop application for interactive visualization of genomic data (BAM, VCF, BED, BigWig).
indel: Insertion or deletion variant relative to the reference genome.
MAPQ: Mapping Quality. Phred-scaled probability that an alignment is wrong. MAPQ 30 means 1 in 1000 chance of incorrect mapping.
MEX: Market Exchange format. Sparse matrix format (matrix.mtx, barcodes.tsv, features.tsv) output by Cell Ranger and STARsolo.
NGS: Next-Generation Sequencing. High-throughput DNA/RNA sequencing technologies producing millions of reads per run.
ONT: Oxford Nanopore Technologies. Long-read sequencing platform that reads DNA by measuring ionic current changes through nanopores.
OTU: Operational Taxonomic Unit. Cluster of similar sequences (typically 97% identity) used in older 16S analyses. Superseded by ASVs.
PCA: Principal Component Analysis. Dimensionality reduction method used to identify major axes of variation in gene expression data.
Phred: Quality score encoding where Q = -10 log10(P_error). Q30 means 99.9% base call accuracy.
RPKM: Reads Per Kilobase per Million mapped reads. Normalization method for coverage tracks accounting for library size and region length.
SAM: Sequence Alignment/Map. Tab-delimited text format for storing aligned sequencing reads against a reference genome.
SLURM: Simple Linux Utility for Resource Management. Widely used HPC job scheduling system.
SNV: Single Nucleotide Variant. Single base-pair change relative to the reference genome.
SRA: Sequence Read Archive. NCBI’s primary repository for raw sequencing data from high-throughput platforms.
TSS: Transcription Start Site. Genomic position where RNA polymerase begins transcription of a gene.
UMAP: Uniform Manifold Approximation and Projection. Non-linear dimensionality reduction used for visualizing single-cell data.
UMI: Unique Molecular Identifier. Random barcode attached during library preparation to tag individual molecules, enabling PCR duplicate removal in single-cell and other protocols.
VCF: Variant Call Format. Standard format for storing genetic variants (SNVs, indels, structural variants) with genotype information.
VQSR: Variant Quality Score Recalibration. GATK’s machine-learning approach to variant filtering using known variant databases as training data.
WGBS: Whole Genome Bisulfite Sequencing. Technique for measuring DNA methylation at single-base resolution across the entire genome.
WGS: Whole Genome Sequencing. Sequencing of an organism’s entire genome at uniform coverage.