h5ad / AnnData ============== Overview -------- The **h5ad** format is an HDF5-based file format designed for storing annotated data matrices, particularly **single-cell RNA-seq** (scRNA-seq) and other single-cell omics data. It is the native file format of the `AnnData `_ Python library, which is the core data structure for the **Scanpy** ecosystem -- the most widely used Python framework for single-cell analysis. h5ad files store not just the count matrix, but all associated metadata, dimensionality reductions, clustering results, and analysis outputs in a single, self-contained file. A typical scRNA-seq experiment with 10 000 cells and 30 000 genes produces an h5ad file of approximately 200--500 MB. The format is used in virtually every step of single-cell analysis: from loading raw counts through quality control, normalisation, dimensionality reduction, clustering, differential expression, and trajectory inference. Structure --------- An AnnData object (and its h5ad file on disk) has a well-defined structure with several interconnected components: .. code-block:: text AnnData object +-- .X # Main data matrix (cells x genes) +-- .obs # Observation (cell) metadata [DataFrame] +-- .var # Variable (gene) metadata [DataFrame] +-- .uns # Unstructured annotations [dict] +-- .obsm # Observation matrices [dict of arrays] +-- .varm # Variable matrices [dict of arrays] +-- .obsp # Observation pairwise [dict of sparse matrices] +-- .layers # Alternative matrix layers [dict of matrices] +-- .raw # Raw (unprocessed) data [AnnData] Components ^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 15 20 65 * - Attribute - Shape / type - Description * - ``.X`` - ``(n_obs, n_vars)`` - The main data matrix. Typically raw counts or normalised expression values. Can be dense (NumPy array) or sparse (scipy CSR/CSC matrix). * - ``.obs`` - ``(n_obs,)`` DataFrame - Cell-level metadata: barcodes, sample IDs, cluster labels, QC metrics (``n_genes``, ``total_counts``, ``pct_mito``), cell-type annotations. * - ``.var`` - ``(n_vars,)`` DataFrame - Gene-level metadata: gene symbols, Ensembl IDs, ``highly_variable`` flags, mean expression, dispersion. * - ``.uns`` - dict - Unstructured data: colour palettes, UMAP parameters, marker gene rankings, analysis logs, sample-level information. * - ``.obsm`` - dict of arrays - Cell embeddings: ``X_pca``, ``X_umap``, ``X_tsne``, ``X_diffmap``. Each array has shape ``(n_obs, n_components)``. * - ``.varm`` - dict of arrays - Gene-level embeddings: PCA loadings (``PCs``). * - ``.obsp`` - dict of sparse matrices - Cell-cell pairwise data: ``connectivities`` and ``distances`` graphs used for clustering (Leiden, Louvain). * - ``.layers`` - dict of matrices - Alternative representations of ``.X``: ``counts`` (raw), ``log1p`` (log-normalised), ``spliced``, ``unspliced`` (RNA velocity). * - ``.raw`` - AnnData - Snapshot of the data before gene filtering, used for differential expression on the full gene set. HDF5 on-disk layout ^^^^^^^^^^^^^^^^^^^^ The h5ad file maps directly to HDF5 groups and datasets: .. code-block:: text / # Root /X # Main matrix (dense or sparse) /obs # Cell metadata (stored as HDF5 group with columns) /var # Gene metadata /uns # Nested dictionaries /obsm/X_pca # PCA embedding /obsm/X_umap # UMAP embedding /obsp/connectivities # Neighbour graph /obsp/distances # Distance matrix /layers/counts # Raw count layer /raw # Pre-filtering snapshot Sparse matrices are stored in CSR or CSC format with ``data``, ``indices``, and ``indptr`` datasets. Working With ------------ Reading and writing h5ad files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash python3 -c " import scanpy as sc # Read an h5ad file adata = sc.read_h5ad('pbmc3k.h5ad') # Inspect the object print(adata) print(adata.obs.head()) print(adata.var.head()) # Write to h5ad adata.write('pbmc3k_processed.h5ad') # Write with compression adata.write('pbmc3k_compressed.h5ad', compression='gzip') " Basic single-cell workflow ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash python3 -c " import scanpy as sc adata = sc.read_h5ad('raw_counts.h5ad') # Quality control sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) adata = adata[adata.obs.pct_counts_mt < 20, :] # Normalise and log-transform adata.layers['counts'] = adata.X.copy() sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) # Feature selection and dimensionality reduction sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata, n_pcs=30) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) adata.write('processed.h5ad') print(adata) " Inspecting h5ad from the command line ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Inspect HDF5 structure h5ls -r pbmc3k.h5ad # View file size and compression h5stat pbmc3k.h5ad Converting from other formats ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash python3 -c " import scanpy as sc # From 10x MEX format (Cell Ranger output) adata = sc.read_10x_mtx('filtered_feature_bc_matrix/') adata.write('from_10x.h5ad') # From 10x HDF5 format adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5') adata.write('from_10x_h5.h5ad') # From CSV/TSV count matrix adata = sc.read_csv('counts.csv').T # genes x cells -> cells x genes adata.write('from_csv.h5ad') " Converting to/from Seurat (R) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Using the anndataR or SeuratDisk package in R Rscript -e ' library(Seurat) library(SeuratDisk) # h5ad to h5Seurat to Seurat Convert("processed.h5ad", dest="h5seurat", overwrite=TRUE) seurat_obj <- LoadH5Seurat("processed.h5Seurat") ' Backed mode for large datasets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash python3 -c " import anndata as ad # Open in backed mode (reads from disk on demand, low memory) adata = ad.read_h5ad('large_dataset.h5ad', backed='r') print(adata.obs.head()) # Access a subset without loading the full matrix subset = adata[adata.obs['cell_type'] == 'T cell', :].to_memory() " See Also -------- * :doc:`/tools/single-cell/scanpy` -- the primary analysis framework for h5ad data * :doc:`/tools/single-cell/cellranger` -- upstream tool that produces the raw count matrices * :doc:`mex-10x` -- the sparse matrix format output by Cell Ranger (converted to h5ad for analysis) * :doc:`/tools/single-cell/seurat` -- R-based single-cell analysis framework with interoperability via SeuratDisk