Scanpy

Overview

Scanpy is a scalable Python framework for analysing single-cell gene expression data, developed at the Theis Lab. It provides a complete workflow covering quality control, normalisation, highly variable gene selection, dimensionality reduction (PCA, UMAP, t-SNE), graph-based clustering (Leiden, Louvain), differential expression, and trajectory inference. Scanpy stores all data in the AnnData format, which efficiently handles large datasets and integrates tightly with the broader scverse ecosystem including scvi-tools, squidpy, and muon.

Installation

pip install scanpy

For Leiden clustering support (recommended):

pip install scanpy leidenalg

Basic Usage

Load a 10x Cell Ranger output matrix, filter low-quality cells, and run the standard processing and clustering pipeline.

import scanpy as sc

adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")

# QC
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
adata = adata[adata.obs.pct_counts_mt < 20].copy()

# Processing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=20)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

sc.pl.umap(adata, color="leiden")
adata.write("processed.h5ad")

Key Parameters

Function / parameter

Description

sc.pp.filter_cells(min_genes)

Remove cells with fewer than the specified number of detected genes.

sc.pp.filter_genes(min_cells)

Remove genes detected in fewer than the specified number of cells.

sc.pp.calculate_qc_metrics

Compute per-cell and per-gene quality metrics (total counts, gene counts, mitochondrial fraction).

sc.pp.normalize_total(target_sum)

Normalise each cell so total counts equal target_sum (typically 10,000).

sc.pp.log1p

Apply log1p (natural logarithm of 1 + x) transformation.

sc.pp.highly_variable_genes(n_top_genes)

Select the top N highly variable genes for downstream analysis.

sc.tl.pca

Perform PCA; by default uses the highly variable genes.

sc.pp.neighbors(n_pcs)

Build a k-nearest-neighbours graph using the specified number of principal components.

sc.tl.leiden(resolution)

Cluster cells using the Leiden algorithm. Higher resolution yields more clusters.

sc.tl.umap

Compute a UMAP embedding for visualisation.

Expected Output

The processing pipeline produces an AnnData object (saved as processed.h5ad) containing:

  • adata.X – normalised and log-transformed expression matrix.

  • adata.raw – a snapshot of the data before filtering on highly variable genes (if adata.raw is set).

  • adata.obs["leiden"] – cluster assignments for every cell.

  • adata.obsm["X_pca"] – PCA coordinates.

  • adata.obsm["X_umap"] – UMAP coordinates.

  • adata.var["highly_variable"] – boolean mask marking selected genes.

The sc.pl.umap() call produces a UMAP scatter plot coloured by Leiden cluster identity. Plots can be saved to files with the save parameter (e.g., sc.pl.umap(adata, color="leiden", save="_clusters.png")).

See Also

  • Seurat – R equivalent for single-cell analysis with a similar workflow

  • Cell Ranger – upstream pipeline that generates count matrices loaded by Scanpy

  • STARsolo – open-source alternative for generating count matrices