h5ad / AnnData

Overview

The h5ad format is an HDF5-based file format designed for storing annotated data matrices, particularly single-cell RNA-seq (scRNA-seq) and other single-cell omics data. It is the native file format of the AnnData Python library, which is the core data structure for the Scanpy ecosystem – the most widely used Python framework for single-cell analysis.

h5ad files store not just the count matrix, but all associated metadata, dimensionality reductions, clustering results, and analysis outputs in a single, self-contained file. A typical scRNA-seq experiment with 10 000 cells and 30 000 genes produces an h5ad file of approximately 200–500 MB.

The format is used in virtually every step of single-cell analysis: from loading raw counts through quality control, normalisation, dimensionality reduction, clustering, differential expression, and trajectory inference.

Structure

An AnnData object (and its h5ad file on disk) has a well-defined structure with several interconnected components:

AnnData object
+-- .X             # Main data matrix (cells x genes)
+-- .obs           # Observation (cell) metadata  [DataFrame]
+-- .var           # Variable (gene) metadata      [DataFrame]
+-- .uns           # Unstructured annotations       [dict]
+-- .obsm          # Observation matrices           [dict of arrays]
+-- .varm          # Variable matrices              [dict of arrays]
+-- .obsp          # Observation pairwise           [dict of sparse matrices]
+-- .layers        # Alternative matrix layers      [dict of matrices]
+-- .raw           # Raw (unprocessed) data         [AnnData]

Components

Attribute

Shape / type

Description

.X

(n_obs, n_vars)

The main data matrix. Typically raw counts or normalised expression values. Can be dense (NumPy array) or sparse (scipy CSR/CSC matrix).

.obs

(n_obs,) DataFrame

Cell-level metadata: barcodes, sample IDs, cluster labels, QC metrics (n_genes, total_counts, pct_mito), cell-type annotations.

.var

(n_vars,) DataFrame

Gene-level metadata: gene symbols, Ensembl IDs, highly_variable flags, mean expression, dispersion.

.uns

dict

Unstructured data: colour palettes, UMAP parameters, marker gene rankings, analysis logs, sample-level information.

.obsm

dict of arrays

Cell embeddings: X_pca, X_umap, X_tsne, X_diffmap. Each array has shape (n_obs, n_components).

.varm

dict of arrays

Gene-level embeddings: PCA loadings (PCs).

.obsp

dict of sparse matrices

Cell-cell pairwise data: connectivities and distances graphs used for clustering (Leiden, Louvain).

.layers

dict of matrices

Alternative representations of .X: counts (raw), log1p (log-normalised), spliced, unspliced (RNA velocity).

.raw

AnnData

Snapshot of the data before gene filtering, used for differential expression on the full gene set.

HDF5 on-disk layout

The h5ad file maps directly to HDF5 groups and datasets:

/                          # Root
/X                         # Main matrix (dense or sparse)
/obs                       # Cell metadata (stored as HDF5 group with columns)
/var                       # Gene metadata
/uns                       # Nested dictionaries
/obsm/X_pca                # PCA embedding
/obsm/X_umap               # UMAP embedding
/obsp/connectivities       # Neighbour graph
/obsp/distances            # Distance matrix
/layers/counts             # Raw count layer
/raw                       # Pre-filtering snapshot

Sparse matrices are stored in CSR or CSC format with data, indices, and indptr datasets.

Working With

Reading and writing h5ad files

python3 -c "
import scanpy as sc

# Read an h5ad file
adata = sc.read_h5ad('pbmc3k.h5ad')

# Inspect the object
print(adata)
print(adata.obs.head())
print(adata.var.head())

# Write to h5ad
adata.write('pbmc3k_processed.h5ad')

# Write with compression
adata.write('pbmc3k_compressed.h5ad', compression='gzip')
"

Basic single-cell workflow

python3 -c "
import scanpy as sc

adata = sc.read_h5ad('raw_counts.h5ad')

# Quality control
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20, :]

# Normalise and log-transform
adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Feature selection and dimensionality reduction
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

adata.write('processed.h5ad')
print(adata)
"

Inspecting h5ad from the command line

# Inspect HDF5 structure
h5ls -r pbmc3k.h5ad

# View file size and compression
h5stat pbmc3k.h5ad

Converting from other formats

python3 -c "
import scanpy as sc

# From 10x MEX format (Cell Ranger output)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')
adata.write('from_10x.h5ad')

# From 10x HDF5 format
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
adata.write('from_10x_h5.h5ad')

# From CSV/TSV count matrix
adata = sc.read_csv('counts.csv').T  # genes x cells -> cells x genes
adata.write('from_csv.h5ad')
"

Converting to/from Seurat (R)

# Using the anndataR or SeuratDisk package in R
Rscript -e '
library(Seurat)
library(SeuratDisk)

# h5ad to h5Seurat to Seurat
Convert("processed.h5ad", dest="h5seurat", overwrite=TRUE)
seurat_obj <- LoadH5Seurat("processed.h5Seurat")
'

Backed mode for large datasets

python3 -c "
import anndata as ad

# Open in backed mode (reads from disk on demand, low memory)
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
print(adata.obs.head())
# Access a subset without loading the full matrix
subset = adata[adata.obs['cell_type'] == 'T cell', :].to_memory()
"

See Also

  • Scanpy – the primary analysis framework for h5ad data

  • Cell Ranger – upstream tool that produces the raw count matrices

  • MEX / 10x Format – the sparse matrix format output by Cell Ranger (converted to h5ad for analysis)

  • Seurat – R-based single-cell analysis framework with interoperability via SeuratDisk