Scanpy ====== Overview -------- Scanpy is a scalable Python framework for analysing single-cell gene expression data, developed at the Theis Lab. It provides a complete workflow covering quality control, normalisation, highly variable gene selection, dimensionality reduction (PCA, UMAP, t-SNE), graph-based clustering (Leiden, Louvain), differential expression, and trajectory inference. Scanpy stores all data in the AnnData format, which efficiently handles large datasets and integrates tightly with the broader scverse ecosystem including scvi-tools, squidpy, and muon. Installation ------------ .. code-block:: bash pip install scanpy For Leiden clustering support (recommended): .. code-block:: bash pip install scanpy leidenalg Basic Usage ----------- Load a 10x Cell Ranger output matrix, filter low-quality cells, and run the standard processing and clustering pipeline. .. code-block:: python import scanpy as sc adata = sc.read_10x_mtx("filtered_feature_bc_matrix/") # QC sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata.var["mt"] = adata.var_names.str.startswith("MT-") sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) adata = adata[adata.obs.pct_counts_mt < 20].copy() # Processing sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.tl.pca(adata) sc.pp.neighbors(adata, n_pcs=20) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) sc.pl.umap(adata, color="leiden") adata.write("processed.h5ad") Key Parameters -------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Function / parameter - Description * - ``sc.pp.filter_cells(min_genes)`` - Remove cells with fewer than the specified number of detected genes. * - ``sc.pp.filter_genes(min_cells)`` - Remove genes detected in fewer than the specified number of cells. * - ``sc.pp.calculate_qc_metrics`` - Compute per-cell and per-gene quality metrics (total counts, gene counts, mitochondrial fraction). * - ``sc.pp.normalize_total(target_sum)`` - Normalise each cell so total counts equal ``target_sum`` (typically 10,000). * - ``sc.pp.log1p`` - Apply log1p (natural logarithm of 1 + x) transformation. * - ``sc.pp.highly_variable_genes(n_top_genes)`` - Select the top N highly variable genes for downstream analysis. * - ``sc.tl.pca`` - Perform PCA; by default uses the highly variable genes. * - ``sc.pp.neighbors(n_pcs)`` - Build a k-nearest-neighbours graph using the specified number of principal components. * - ``sc.tl.leiden(resolution)`` - Cluster cells using the Leiden algorithm. Higher resolution yields more clusters. * - ``sc.tl.umap`` - Compute a UMAP embedding for visualisation. Expected Output --------------- The processing pipeline produces an ``AnnData`` object (saved as ``processed.h5ad``) containing: * ``adata.X`` -- normalised and log-transformed expression matrix. * ``adata.raw`` -- a snapshot of the data before filtering on highly variable genes (if ``adata.raw`` is set). * ``adata.obs["leiden"]`` -- cluster assignments for every cell. * ``adata.obsm["X_pca"]`` -- PCA coordinates. * ``adata.obsm["X_umap"]`` -- UMAP coordinates. * ``adata.var["highly_variable"]`` -- boolean mask marking selected genes. The ``sc.pl.umap()`` call produces a UMAP scatter plot coloured by Leiden cluster identity. Plots can be saved to files with the ``save`` parameter (e.g., ``sc.pl.umap(adata, color="leiden", save="_clusters.png")``). See Also -------- * :doc:`seurat` -- R equivalent for single-cell analysis with a similar workflow * :doc:`cellranger` -- upstream pipeline that generates count matrices loaded by Scanpy * :doc:`starsolo` -- open-source alternative for generating count matrices