Scanpy
Overview
Scanpy is a scalable Python framework for analysing single-cell gene expression data, developed at the Theis Lab. It provides a complete workflow covering quality control, normalisation, highly variable gene selection, dimensionality reduction (PCA, UMAP, t-SNE), graph-based clustering (Leiden, Louvain), differential expression, and trajectory inference. Scanpy stores all data in the AnnData format, which efficiently handles large datasets and integrates tightly with the broader scverse ecosystem including scvi-tools, squidpy, and muon.
Installation
pip install scanpy
For Leiden clustering support (recommended):
pip install scanpy leidenalg
Basic Usage
Load a 10x Cell Ranger output matrix, filter low-quality cells, and run the standard processing and clustering pipeline.
import scanpy as sc
adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")
# QC
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20].copy()
# Processing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=20)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color="leiden")
adata.write("processed.h5ad")
Key Parameters
Function / parameter |
Description |
|---|---|
|
Remove cells with fewer than the specified number of detected genes. |
|
Remove genes detected in fewer than the specified number of cells. |
|
Compute per-cell and per-gene quality metrics (total counts, gene counts, mitochondrial fraction). |
|
Normalise each cell so total counts equal |
|
Apply log1p (natural logarithm of 1 + x) transformation. |
|
Select the top N highly variable genes for downstream analysis. |
|
Perform PCA; by default uses the highly variable genes. |
|
Build a k-nearest-neighbours graph using the specified number of principal components. |
|
Cluster cells using the Leiden algorithm. Higher resolution yields more clusters. |
|
Compute a UMAP embedding for visualisation. |
Expected Output
The processing pipeline produces an AnnData object (saved as
processed.h5ad) containing:
adata.X– normalised and log-transformed expression matrix.adata.raw– a snapshot of the data before filtering on highly variable genes (ifadata.rawis set).adata.obs["leiden"]– cluster assignments for every cell.adata.obsm["X_pca"]– PCA coordinates.adata.obsm["X_umap"]– UMAP coordinates.adata.var["highly_variable"]– boolean mask marking selected genes.
The sc.pl.umap() call produces a UMAP scatter plot coloured by Leiden
cluster identity. Plots can be saved to files with the save parameter
(e.g., sc.pl.umap(adata, color="leiden", save="_clusters.png")).
See Also
Seurat – R equivalent for single-cell analysis with a similar workflow
Cell Ranger – upstream pipeline that generates count matrices loaded by Scanpy
STARsolo – open-source alternative for generating count matrices