Scanpy

Overview

Scanpy is a scalable Python framework for analysing single-cell gene expression data, developed at the Theis Lab. It provides a complete workflow covering quality control, normalisation, highly variable gene selection, dimensionality reduction (PCA, UMAP, t-SNE), graph-based clustering (Leiden, Louvain), differential expression, and trajectory inference. Scanpy stores all data in the AnnData format, which efficiently handles large datasets and integrates tightly with the broader scverse ecosystem including scvi-tools, squidpy, and muon.

Installation

pip install scanpy

For Leiden clustering support (recommended):

pip install scanpy leidenalg

Basic Usage

Load a 10x Cell Ranger output matrix, filter low-quality cells, and run the standard processing and clustering pipeline.

import scanpy as sc

adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")

# QC
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20].copy()

# Processing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=20)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

sc.pl.umap(adata, color="leiden")
adata.write("processed.h5ad")

Key Parameters

Function / parameter	Description
`sc.pp.filter_cells(min_genes)`	Remove cells with fewer than the specified number of detected genes.
`sc.pp.filter_genes(min_cells)`	Remove genes detected in fewer than the specified number of cells.
`sc.pp.calculate_qc_metrics`	Compute per-cell and per-gene quality metrics (total counts, gene counts, mitochondrial fraction).
`sc.pp.normalize_total(target_sum)`	Normalise each cell so total counts equal `target_sum` (typically 10,000).
`sc.pp.log1p`	Apply log1p (natural logarithm of 1 + x) transformation.
`sc.pp.highly_variable_genes(n_top_genes)`	Select the top N highly variable genes for downstream analysis.
`sc.tl.pca`	Perform PCA; by default uses the highly variable genes.
`sc.pp.neighbors(n_pcs)`	Build a k-nearest-neighbours graph using the specified number of principal components.
`sc.tl.leiden(resolution)`	Cluster cells using the Leiden algorithm. Higher resolution yields more clusters.
`sc.tl.umap`	Compute a UMAP embedding for visualisation.

Expected Output

The processing pipeline produces an AnnData object (saved as processed.h5ad) containing:

adata.X – normalised and log-transformed expression matrix.
adata.raw – a snapshot of the data before filtering on highly variable genes (if adata.raw is set).
adata.obs["leiden"] – cluster assignments for every cell.
adata.obsm["X_pca"] – PCA coordinates.
adata.obsm["X_umap"] – UMAP coordinates.
adata.var["highly_variable"] – boolean mask marking selected genes.

The sc.pl.umap() call produces a UMAP scatter plot coloured by Leiden cluster identity. Plots can be saved to files with the save parameter (e.g., sc.pl.umap(adata, color="leiden", save="_clusters.png")).