Glossary
========

.. glossary::
   :sorted:

   ASV
      Amplicon Sequence Variant. Exact biological sequences resolved by DADA2,
      replacing the older OTU approach in 16S amplicon analysis.

   BAM
      Binary Alignment Map. Compressed binary form of the SAM format for
      storing aligned sequencing reads.

   BED
      Browser Extensible Data. Tab-delimited format for genomic intervals
      using 0-based, half-open coordinates.

   BUSCO
      Benchmarking Universal Single-Copy Orthologs. Tool for assessing
      genome assembly completeness using conserved gene sets.

   BWA
      Burrows-Wheeler Aligner. Short-read alignment tool; BWA-MEM2 is the
      faster SIMD-optimized successor.

   CIGAR
      Compact Idiosyncratic Gapped Alignment Report. String in SAM format
      describing how a read aligns to the reference (e.g., ``151M``,
      ``75M2I74M``).

   CpG
      Cytosine-phosphate-Guanine dinucleotide. Primary target of DNA
      methylation in mammalian genomes.

   CRAM
      Reference-based compressed alignment format. More compact than BAM
      but requires the reference genome for decoding.

   DE
      Differential Expression. Statistical identification of genes with
      significantly different expression levels between conditions.

   DMR
      Differentially Methylated Region. Genomic region with statistically
      significant methylation differences between conditions.

   DSL2
      Domain Specific Language version 2. Nextflow's modular workflow
      syntax enabling process reuse and sub-workflows.

   eDNA
      Environmental DNA. DNA extracted directly from environmental samples
      (water, soil) without isolating organisms.

   FASTQ
      Text-based format storing nucleotide sequences together with
      per-base Phred quality scores.

   FASTA
      Text-based format for nucleotide or protein sequences. Uses ``>``
      header lines followed by sequence data.

   FLAG
      Bitwise flag in SAM format encoding read properties (paired, mapped,
      duplicate, etc.). E.g., 99 = paired, proper pair, mate reverse, first
      in pair.

   GFF
      General Feature Format. Tab-delimited format for genomic annotations
      (genes, exons, CDS). GFF3 is the current version.

   GTF
      Gene Transfer Format (GFF version 2). Widely used for gene annotations,
      especially from GENCODE and Ensembl.

   GVCF
      Genomic VCF. GATK's extended VCF that records confidence at every
      position, enabling efficient joint genotyping.

   h5ad
      HDF5-based file format used by AnnData/Scanpy for single-cell data.
      Stores count matrices, cell metadata, and embeddings.

   HPC
      High-Performance Computing. Cluster computing environment managed
      by job schedulers like SLURM.

   HVG
      Highly Variable Genes. Genes with the most variable expression across
      cells, selected as informative features for dimensionality reduction.

   IGV
      Integrative Genomics Viewer. Desktop application for interactive
      visualization of genomic data (BAM, VCF, BED, BigWig).

   indel
      Insertion or deletion variant relative to the reference genome.

   MAPQ
      Mapping Quality. Phred-scaled probability that an alignment is wrong.
      MAPQ 30 means 1 in 1000 chance of incorrect mapping.

   MEX
      Market Exchange format. Sparse matrix format (matrix.mtx, barcodes.tsv,
      features.tsv) output by Cell Ranger and STARsolo.

   NGS
      Next-Generation Sequencing. High-throughput DNA/RNA sequencing
      technologies producing millions of reads per run.

   ONT
      Oxford Nanopore Technologies. Long-read sequencing platform that
      reads DNA by measuring ionic current changes through nanopores.

   OTU
      Operational Taxonomic Unit. Cluster of similar sequences (typically
      97% identity) used in older 16S analyses. Superseded by ASVs.

   PCA
      Principal Component Analysis. Dimensionality reduction method used
      to identify major axes of variation in gene expression data.

   Phred
      Quality score encoding where Q = -10 log10(P_error). Q30 means
      99.9% base call accuracy.

   RPKM
      Reads Per Kilobase per Million mapped reads. Normalization method
      for coverage tracks accounting for library size and region length.

   SAM
      Sequence Alignment/Map. Tab-delimited text format for storing
      aligned sequencing reads against a reference genome.

   SLURM
      Simple Linux Utility for Resource Management. Widely used HPC job
      scheduling system.

   SNV
      Single Nucleotide Variant. Single base-pair change relative to the
      reference genome.

   SRA
      Sequence Read Archive. NCBI's primary repository for raw sequencing
      data from high-throughput platforms.

   TSS
      Transcription Start Site. Genomic position where RNA polymerase
      begins transcription of a gene.

   UMAP
      Uniform Manifold Approximation and Projection. Non-linear
      dimensionality reduction used for visualizing single-cell data.

   UMI
      Unique Molecular Identifier. Random barcode attached during library
      preparation to tag individual molecules, enabling PCR duplicate
      removal in single-cell and other protocols.

   VCF
      Variant Call Format. Standard format for storing genetic variants
      (SNVs, indels, structural variants) with genotype information.

   VQSR
      Variant Quality Score Recalibration. GATK's machine-learning approach
      to variant filtering using known variant databases as training data.

   WGS
      Whole Genome Sequencing. Sequencing of an organism's entire genome
      at uniform coverage.

   WGBS
      Whole Genome Bisulfite Sequencing. Technique for measuring DNA
      methylation at single-base resolution across the entire genome.