h5ad / AnnData
Overview
The h5ad format is an HDF5-based file format designed for storing annotated data matrices, particularly single-cell RNA-seq (scRNA-seq) and other single-cell omics data. It is the native file format of the AnnData Python library, which is the core data structure for the Scanpy ecosystem – the most widely used Python framework for single-cell analysis.
h5ad files store not just the count matrix, but all associated metadata, dimensionality reductions, clustering results, and analysis outputs in a single, self-contained file. A typical scRNA-seq experiment with 10 000 cells and 30 000 genes produces an h5ad file of approximately 200–500 MB.
The format is used in virtually every step of single-cell analysis: from loading raw counts through quality control, normalisation, dimensionality reduction, clustering, differential expression, and trajectory inference.
Structure
An AnnData object (and its h5ad file on disk) has a well-defined structure with several interconnected components:
AnnData object
+-- .X # Main data matrix (cells x genes)
+-- .obs # Observation (cell) metadata [DataFrame]
+-- .var # Variable (gene) metadata [DataFrame]
+-- .uns # Unstructured annotations [dict]
+-- .obsm # Observation matrices [dict of arrays]
+-- .varm # Variable matrices [dict of arrays]
+-- .obsp # Observation pairwise [dict of sparse matrices]
+-- .layers # Alternative matrix layers [dict of matrices]
+-- .raw # Raw (unprocessed) data [AnnData]
Components
Attribute |
Shape / type |
Description |
|---|---|---|
|
|
The main data matrix. Typically raw counts or normalised expression values. Can be dense (NumPy array) or sparse (scipy CSR/CSC matrix). |
|
|
Cell-level metadata: barcodes, sample IDs, cluster labels, QC
metrics ( |
|
|
Gene-level metadata: gene symbols, Ensembl IDs, |
|
dict |
Unstructured data: colour palettes, UMAP parameters, marker gene rankings, analysis logs, sample-level information. |
|
dict of arrays |
Cell embeddings: |
|
dict of arrays |
Gene-level embeddings: PCA loadings ( |
|
dict of sparse matrices |
Cell-cell pairwise data: |
|
dict of matrices |
Alternative representations of |
|
AnnData |
Snapshot of the data before gene filtering, used for differential expression on the full gene set. |
HDF5 on-disk layout
The h5ad file maps directly to HDF5 groups and datasets:
/ # Root
/X # Main matrix (dense or sparse)
/obs # Cell metadata (stored as HDF5 group with columns)
/var # Gene metadata
/uns # Nested dictionaries
/obsm/X_pca # PCA embedding
/obsm/X_umap # UMAP embedding
/obsp/connectivities # Neighbour graph
/obsp/distances # Distance matrix
/layers/counts # Raw count layer
/raw # Pre-filtering snapshot
Sparse matrices are stored in CSR or CSC format with data, indices,
and indptr datasets.
Working With
Reading and writing h5ad files
python3 -c "
import scanpy as sc
# Read an h5ad file
adata = sc.read_h5ad('pbmc3k.h5ad')
# Inspect the object
print(adata)
print(adata.obs.head())
print(adata.var.head())
# Write to h5ad
adata.write('pbmc3k_processed.h5ad')
# Write with compression
adata.write('pbmc3k_compressed.h5ad', compression='gzip')
"
Basic single-cell workflow
python3 -c "
import scanpy as sc
adata = sc.read_h5ad('raw_counts.h5ad')
# Quality control
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20, :]
# Normalise and log-transform
adata.layers['counts'] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Feature selection and dimensionality reduction
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
adata.write('processed.h5ad')
print(adata)
"
Inspecting h5ad from the command line
# Inspect HDF5 structure
h5ls -r pbmc3k.h5ad
# View file size and compression
h5stat pbmc3k.h5ad
Converting from other formats
python3 -c "
import scanpy as sc
# From 10x MEX format (Cell Ranger output)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')
adata.write('from_10x.h5ad')
# From 10x HDF5 format
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
adata.write('from_10x_h5.h5ad')
# From CSV/TSV count matrix
adata = sc.read_csv('counts.csv').T # genes x cells -> cells x genes
adata.write('from_csv.h5ad')
"
Converting to/from Seurat (R)
# Using the anndataR or SeuratDisk package in R
Rscript -e '
library(Seurat)
library(SeuratDisk)
# h5ad to h5Seurat to Seurat
Convert("processed.h5ad", dest="h5seurat", overwrite=TRUE)
seurat_obj <- LoadH5Seurat("processed.h5Seurat")
'
Backed mode for large datasets
python3 -c "
import anndata as ad
# Open in backed mode (reads from disk on demand, low memory)
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
print(adata.obs.head())
# Access a subset without loading the full matrix
subset = adata[adata.obs['cell_type'] == 'T cell', :].to_memory()
"
See Also
Scanpy – the primary analysis framework for h5ad data
Cell Ranger – upstream tool that produces the raw count matrices
MEX / 10x Format – the sparse matrix format output by Cell Ranger (converted to h5ad for analysis)
Seurat – R-based single-cell analysis framework with interoperability via SeuratDisk