h5ad / AnnData

Overview

The h5ad format is an HDF5-based file format designed for storing annotated data matrices, particularly single-cell RNA-seq (scRNA-seq) and other single-cell omics data. It is the native file format of the AnnData Python library, which is the core data structure for the Scanpy ecosystem – the most widely used Python framework for single-cell analysis.

h5ad files store not just the count matrix, but all associated metadata, dimensionality reductions, clustering results, and analysis outputs in a single, self-contained file. A typical scRNA-seq experiment with 10 000 cells and 30 000 genes produces an h5ad file of approximately 200–500 MB.

The format is used in virtually every step of single-cell analysis: from loading raw counts through quality control, normalisation, dimensionality reduction, clustering, differential expression, and trajectory inference.

Structure

An AnnData object (and its h5ad file on disk) has a well-defined structure with several interconnected components:

AnnData object
+-- .X             # Main data matrix (cells x genes)
+-- .obs           # Observation (cell) metadata  [DataFrame]
+-- .var           # Variable (gene) metadata      [DataFrame]
+-- .uns           # Unstructured annotations       [dict]
+-- .obsm          # Observation matrices           [dict of arrays]
+-- .varm          # Variable matrices              [dict of arrays]
+-- .obsp          # Observation pairwise           [dict of sparse matrices]
+-- .layers        # Alternative matrix layers      [dict of matrices]
+-- .raw           # Raw (unprocessed) data         [AnnData]

Components

Attribute	Shape / type	Description
`.X`	`(n_obs, n_vars)`	The main data matrix. Typically raw counts or normalised expression values. Can be dense (NumPy array) or sparse (scipy CSR/CSC matrix).
`.obs`	`(n_obs,)` DataFrame	Cell-level metadata: barcodes, sample IDs, cluster labels, QC metrics (`n_genes`, `total_counts`, `pct_mito`), cell-type annotations.
`.var`	`(n_vars,)` DataFrame	Gene-level metadata: gene symbols, Ensembl IDs, `highly_variable` flags, mean expression, dispersion.
`.uns`	dict	Unstructured data: colour palettes, UMAP parameters, marker gene rankings, analysis logs, sample-level information.
`.obsm`	dict of arrays	Cell embeddings: `X_pca`, `X_umap`, `X_tsne`, `X_diffmap`. Each array has shape `(n_obs, n_components)`.
`.varm`	dict of arrays	Gene-level embeddings: PCA loadings (`PCs`).
`.obsp`	dict of sparse matrices	Cell-cell pairwise data: `connectivities` and `distances` graphs used for clustering (Leiden, Louvain).
`.layers`	dict of matrices	Alternative representations of `.X`: `counts` (raw), `log1p` (log-normalised), `spliced`, `unspliced` (RNA velocity).
`.raw`	AnnData	Snapshot of the data before gene filtering, used for differential expression on the full gene set.

HDF5 on-disk layout

The h5ad file maps directly to HDF5 groups and datasets:

/                          # Root
/X                         # Main matrix (dense or sparse)
/obs                       # Cell metadata (stored as HDF5 group with columns)
/var                       # Gene metadata
/uns                       # Nested dictionaries
/obsm/X_pca                # PCA embedding
/obsm/X_umap               # UMAP embedding
/obsp/connectivities       # Neighbour graph
/obsp/distances            # Distance matrix
/layers/counts             # Raw count layer
/raw                       # Pre-filtering snapshot

Sparse matrices are stored in CSR or CSC format with data, indices, and indptr datasets.

Working With

Reading and writing h5ad files

python3 -c "
import scanpy as sc

# Read an h5ad file
adata = sc.read_h5ad('pbmc3k.h5ad')

# Inspect the object
print(adata)
print(adata.obs.head())
print(adata.var.head())

# Write to h5ad
adata.write('pbmc3k_processed.h5ad')

# Write with compression
adata.write('pbmc3k_compressed.h5ad', compression='gzip')
"

Basic single-cell workflow

python3 -c "
import scanpy as sc

adata = sc.read_h5ad('raw_counts.h5ad')

# Quality control
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20, :]

# Normalise and log-transform
adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Feature selection and dimensionality reduction
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

adata.write('processed.h5ad')
print(adata)
"

Inspecting h5ad from the command line

# Inspect HDF5 structure
h5ls -r pbmc3k.h5ad

# View file size and compression
h5stat pbmc3k.h5ad

Converting from other formats

python3 -c "
import scanpy as sc

# From 10x MEX format (Cell Ranger output)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')
adata.write('from_10x.h5ad')

# From 10x HDF5 format
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
adata.write('from_10x_h5.h5ad')

# From CSV/TSV count matrix
adata = sc.read_csv('counts.csv').T  # genes x cells -> cells x genes
adata.write('from_csv.h5ad')
"

Converting to/from Seurat (R)

# Using the anndataR or SeuratDisk package in R
Rscript -e '
library(Seurat)
library(SeuratDisk)

# h5ad to h5Seurat to Seurat
Convert("processed.h5ad", dest="h5seurat", overwrite=TRUE)
seurat_obj <- LoadH5Seurat("processed.h5Seurat")
'

Backed mode for large datasets

python3 -c "
import anndata as ad

# Open in backed mode (reads from disk on demand, low memory)
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
print(adata.obs.head())
# Access a subset without loading the full matrix
subset = adata[adata.obs['cell_type'] == 'T cell', :].to_memory()
"