edgeR
=====

Overview
--------

edgeR is an R/Bioconductor package for differential expression analysis of
count-based data from RNA-seq and other high-throughput sequencing assays. It
models count data with the negative binomial distribution and estimates
gene-wise dispersions using an empirical Bayes approach that shares information
across genes, providing robust results even with small sample sizes. edgeR
supports multiple testing frameworks including exact tests for pairwise
comparisons, generalised linear models (GLMs) with likelihood ratio tests, and
quasi-likelihood F-tests. It uses TMM (trimmed mean of M-values) normalisation
to account for compositional differences between libraries.

Installation
------------

edgeR is an R/Bioconductor package. Install it from within R:

.. code-block:: r

   if (!requireNamespace("BiocManager", quietly = TRUE))
       install.packages("BiocManager")
   BiocManager::install("edgeR")

Basic Usage
-----------

Run a quasi-likelihood differential expression analysis from a count matrix.

.. code-block:: r

   library(edgeR)

   counts <- read.csv("counts.csv", row.names = 1)
   group <- factor(c("control", "control", "treated", "treated"))

   y <- DGEList(counts = counts, group = group)
   y <- filterByExpr(y) |> (\(keep) y[keep, , keep.lib.sizes = FALSE])()
   y <- calcNormFactors(y)
   design <- model.matrix(~ group)
   y <- estimateDisp(y, design)

   fit <- glmQLFit(y, design)
   qlf <- glmQLFTest(fit, coef = 2)
   topTags(qlf, n = 20)

Key Parameters
--------------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Function / parameter
     - Description
   * - ``DGEList()``
     - Create a DGEList object from a count matrix, with sample grouping and
       optional library size information.
   * - ``filterByExpr()``
     - Determine which genes have sufficiently large counts to be retained for
       statistical analysis.
   * - ``calcNormFactors()``
     - Calculate TMM normalisation factors to account for compositional
       differences between libraries.
   * - ``estimateDisp()``
     - Estimate common, trended, and tagwise dispersions using the empirical
       Bayes method and a design matrix.
   * - ``model.matrix()``
     - Define the experimental design for the GLM (e.g. ``~ group`` or
       ``~ batch + group``).
   * - ``glmQLFit()``
     - Fit a quasi-likelihood negative binomial GLM to the data.
   * - ``glmQLFTest()``
     - Perform a quasi-likelihood F-test on specified model coefficients or
       contrasts.
   * - ``topTags()``
     - Extract a table of the top differentially expressed genes, ranked by
       p-value.
   * - ``exactTest()``
     - Perform an exact test for differences between two groups (alternative
       to the GLM approach).

Expected Output
---------------

The ``topTags()`` function returns a data frame with one row per gene and the
following columns:

* ``logFC`` -- log2 fold change between conditions.
* ``logCPM`` -- average log2 counts per million across all samples.
* ``F`` -- quasi-likelihood F-statistic (or ``LR`` for likelihood ratio tests,
  or ``PValue`` for exact tests).
* ``PValue`` -- raw p-value.
* ``FDR`` -- false discovery rate (Benjamini-Hochberg adjusted p-value).

The full results table can be exported:

.. code-block:: r

   results <- topTags(qlf, n = Inf)
   write.csv(results$table, file = "edger_results.csv")

See Also
--------

* :doc:`deseq2` -- alternative Bioconductor package for differential
  expression using a negative binomial model with shrinkage estimators
* :doc:`/tools/quantification/featurecounts` -- generate the count matrix from
  aligned BAM files
* :doc:`/tools/quantification/salmon` -- alignment-free transcript
  quantification compatible with edgeR via tximport
* :doc:`/tools/quantification/htseq` -- read counting tool whose output can
  be used directly with edgeR