MEX / 10x Format
================

Overview
--------

The MEX (Market Exchange) format is a sparse matrix representation used by
**10x Genomics Cell Ranger** and **STARsolo** to store single-cell gene
expression count data. Because scRNA-seq count matrices are extremely
sparse -- typically more than 90 % of entries are zero -- MEX stores only the
non-zero values, achieving dramatic space savings over dense formats.

The 10x MEX output consists of **three files** in a directory (commonly named
``filtered_feature_bc_matrix/`` or ``raw_feature_bc_matrix/``):

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - File
     - Contents
   * - ``matrix.mtx.gz``
     - Sparse count matrix in Matrix Market coordinate format.
   * - ``barcodes.tsv.gz``
     - Cell barcodes (one per line, corresponding to matrix columns).
   * - ``features.tsv.gz``
     - Gene/feature information (corresponding to matrix rows).

This three-file bundle is the primary interchange format between upstream
processing (Cell Ranger, STARsolo, Alevin, Kallisto-BUStools) and downstream
analysis frameworks (Scanpy, Seurat, Bioconductor/SingleCellExperiment).

Structure
---------

matrix.mtx
^^^^^^^^^^

The matrix file follows the `Matrix Market coordinate format
<https://math.nist.gov/MatrixMarket/formats.html>`_:

.. code-block:: text

   %%MatrixMarket matrix coordinate integer general
   %
   33538 7374 12890567
   1 1 3
   32 1 1
   51 1 5
   134 1 2
   245 2 1
   ...

* **Line 1** -- Header declaring the format (coordinate, integer, general).
* **Line 2** -- Optional comment lines starting with ``%``.
* **Line 3** -- Dimensions: ``n_features  n_barcodes  n_nonzero_entries``.
* **Remaining lines** -- One triplet per non-zero entry:
  ``feature_index  barcode_index  count``.

Indices are **1-based** (the first feature and first barcode are numbered 1).

barcodes.tsv
^^^^^^^^^^^^^

One cell barcode per line:

.. code-block:: text

   AAACCCAAGAAACACT-1
   AAACCCAAGAAACCAT-1
   AAACCCAAGAAACTGT-1
   AAACCCAAGAAAGCGA-1
   ...

The ``-1`` suffix is a GEM well identifier appended by Cell Ranger. The
number of lines equals the number of columns in the matrix.

features.tsv
^^^^^^^^^^^^^

Tab-separated file with gene/feature metadata:

.. code-block:: text

   ENSG00000243485  MIR1302-2HG  Gene Expression
   ENSG00000237613  FAM138A      Gene Expression
   ENSG00000186092  OR4F5        Gene Expression
   ENSG00000238009  AL627309.1   Gene Expression
   ...

Columns:

.. list-table::
   :header-rows: 1
   :widths: 10 25 65

   * - Col
     - Field
     - Description
   * - 1
     - Feature ID
     - Ensembl gene ID or feature identifier.
   * - 2
     - Feature name
     - Gene symbol or feature name.
   * - 3
     - Feature type
     - ``Gene Expression``, ``Antibody Capture`` (CITE-seq),
       ``CRISPR Guide Capture`` (Perturb-seq), ``Multiplexing Capture``
       (cell hashing), etc.

The number of lines equals the number of rows in the matrix.

Filtered vs raw matrices
^^^^^^^^^^^^^^^^^^^^^^^^^

Cell Ranger outputs **two** versions of the MEX directory:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Directory
     - Description
   * - ``raw_feature_bc_matrix/``
     - Contains **all** barcodes detected (including empty droplets). May
       have hundreds of thousands of columns.
   * - ``filtered_feature_bc_matrix/``
     - Contains only barcodes that Cell Ranger classified as real cells.
       This is the starting point for most analyses.

Working With
------------

Loading into Scanpy (Python)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   python3 -c "
   import scanpy as sc

   # Load from the three-file MEX directory
   adata = sc.read_10x_mtx(
       'filtered_feature_bc_matrix/',
       var_names='gene_symbols',    # use gene symbols as variable names
       cache=True                    # cache for faster re-loading
   )
   print(adata)
   # AnnData object with n_obs x n_vars = 7374 x 33538

   # Save as h5ad for faster future access
   adata.write('counts.h5ad')
   "

Loading into Seurat (R)
^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   Rscript -e '
   library(Seurat)

   # Load from the three-file MEX directory
   counts <- Read10X(data.dir = "filtered_feature_bc_matrix/")
   seurat_obj <- CreateSeuratObject(counts = counts, project = "my_project")
   print(seurat_obj)
   '

Loading into Bioconductor (R)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   Rscript -e '
   library(DropletUtils)

   # Load as SingleCellExperiment
   sce <- read10xCounts("filtered_feature_bc_matrix/")
   print(sce)
   '

Inspecting MEX files from the command line
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Count the number of barcodes (cells)
   zcat filtered_feature_bc_matrix/barcodes.tsv.gz | wc -l

   # Count the number of features (genes)
   zcat filtered_feature_bc_matrix/features.tsv.gz | wc -l

   # View the matrix header (dimensions and non-zero count)
   zcat filtered_feature_bc_matrix/matrix.mtx.gz | head -3

   # View the first few features
   zcat filtered_feature_bc_matrix/features.tsv.gz | head -5

Loading STARsolo output
^^^^^^^^^^^^^^^^^^^^^^^

STARsolo produces the same three-file format:

.. code-block:: bash

   python3 -c "
   import scanpy as sc

   # STARsolo output directory structure
   adata = sc.read_10x_mtx(
       'Solo.out/Gene/filtered/',
       var_names='gene_symbols'
   )
   adata.write('starsolo_counts.h5ad')
   "

Loading multi-modal data (CITE-seq)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the features.tsv file contains multiple feature types (e.g. Gene
Expression and Antibody Capture), they must be separated:

.. code-block:: bash

   python3 -c "
   import scanpy as sc
   import muon as mu

   # Read as a MuData object (multi-modal)
   mdata = mu.read_10x_mtx('filtered_feature_bc_matrix/')
   print(mdata)
   # MuData object with 'rna' and 'prot' modalities
   "

See Also
--------

* :doc:`/tools/single-cell/cellranger` -- the upstream pipeline that produces
  MEX output
* :doc:`/tools/single-cell/starsolo` -- alternative single-cell aligner with
  MEX output
* :doc:`h5ad-anndata` -- the h5ad format that MEX data is typically converted
  into
* :doc:`/tools/single-cell/scanpy` -- Python framework for analysing
  single-cell data