Glossary
- ASV
Amplicon Sequence Variant. Exact biological sequences resolved by DADA2, replacing the older OTU approach in 16S amplicon analysis.
- BAM
Binary Alignment Map. Compressed binary form of the SAM format for storing aligned sequencing reads.
- BED
Browser Extensible Data. Tab-delimited format for genomic intervals using 0-based, half-open coordinates.
- BUSCO
Benchmarking Universal Single-Copy Orthologs. Tool for assessing genome assembly completeness using conserved gene sets.
- BWA
Burrows-Wheeler Aligner. Short-read alignment tool; BWA-MEM2 is the faster SIMD-optimized successor.
- CIGAR
Compact Idiosyncratic Gapped Alignment Report. String in SAM format describing how a read aligns to the reference (e.g.,
151M,75M2I74M).- CpG
Cytosine-phosphate-Guanine dinucleotide. Primary target of DNA methylation in mammalian genomes.
- CRAM
Reference-based compressed alignment format. More compact than BAM but requires the reference genome for decoding.
- DE
Differential Expression. Statistical identification of genes with significantly different expression levels between conditions.
- DMR
Differentially Methylated Region. Genomic region with statistically significant methylation differences between conditions.
- DSL2
Domain Specific Language version 2. Nextflow’s modular workflow syntax enabling process reuse and sub-workflows.
- eDNA
Environmental DNA. DNA extracted directly from environmental samples (water, soil) without isolating organisms.
- FASTA
Text-based format for nucleotide or protein sequences. Uses
>header lines followed by sequence data.- FASTQ
Text-based format storing nucleotide sequences together with per-base Phred quality scores.
- FLAG
Bitwise flag in SAM format encoding read properties (paired, mapped, duplicate, etc.). E.g., 99 = paired, proper pair, mate reverse, first in pair.
- GFF
General Feature Format. Tab-delimited format for genomic annotations (genes, exons, CDS). GFF3 is the current version.
- GTF
Gene Transfer Format (GFF version 2). Widely used for gene annotations, especially from GENCODE and Ensembl.
- GVCF
Genomic VCF. GATK’s extended VCF that records confidence at every position, enabling efficient joint genotyping.
- h5ad
HDF5-based file format used by AnnData/Scanpy for single-cell data. Stores count matrices, cell metadata, and embeddings.
- HPC
High-Performance Computing. Cluster computing environment managed by job schedulers like SLURM.
- HVG
Highly Variable Genes. Genes with the most variable expression across cells, selected as informative features for dimensionality reduction.
- IGV
Integrative Genomics Viewer. Desktop application for interactive visualization of genomic data (BAM, VCF, BED, BigWig).
- indel
Insertion or deletion variant relative to the reference genome.
- MAPQ
Mapping Quality. Phred-scaled probability that an alignment is wrong. MAPQ 30 means 1 in 1000 chance of incorrect mapping.
- MEX
Market Exchange format. Sparse matrix format (matrix.mtx, barcodes.tsv, features.tsv) output by Cell Ranger and STARsolo.
- NGS
Next-Generation Sequencing. High-throughput DNA/RNA sequencing technologies producing millions of reads per run.
- ONT
Oxford Nanopore Technologies. Long-read sequencing platform that reads DNA by measuring ionic current changes through nanopores.
- OTU
Operational Taxonomic Unit. Cluster of similar sequences (typically 97% identity) used in older 16S analyses. Superseded by ASVs.
- PCA
Principal Component Analysis. Dimensionality reduction method used to identify major axes of variation in gene expression data.
- Phred
Quality score encoding where Q = -10 log10(P_error). Q30 means 99.9% base call accuracy.
- RPKM
Reads Per Kilobase per Million mapped reads. Normalization method for coverage tracks accounting for library size and region length.
- SAM
Sequence Alignment/Map. Tab-delimited text format for storing aligned sequencing reads against a reference genome.
- SLURM
Simple Linux Utility for Resource Management. Widely used HPC job scheduling system.
- SNV
Single Nucleotide Variant. Single base-pair change relative to the reference genome.
- SRA
Sequence Read Archive. NCBI’s primary repository for raw sequencing data from high-throughput platforms.
- TSS
Transcription Start Site. Genomic position where RNA polymerase begins transcription of a gene.
- UMAP
Uniform Manifold Approximation and Projection. Non-linear dimensionality reduction used for visualizing single-cell data.
- UMI
Unique Molecular Identifier. Random barcode attached during library preparation to tag individual molecules, enabling PCR duplicate removal in single-cell and other protocols.
- VCF
Variant Call Format. Standard format for storing genetic variants (SNVs, indels, structural variants) with genotype information.
- VQSR
Variant Quality Score Recalibration. GATK’s machine-learning approach to variant filtering using known variant databases as training data.
- WGBS
Whole Genome Bisulfite Sequencing. Technique for measuring DNA methylation at single-base resolution across the entire genome.
- WGS
Whole Genome Sequencing. Sequencing of an organism’s entire genome at uniform coverage.