scRNA-seq
Overview
Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, enabling the identification of cell types, states, and trajectories within heterogeneous tissues. This workflow covers the complete analysis from raw FASTQ files to an integrated multi-sample object. STARsolo performs barcode demultiplexing and UMI counting directly during alignment, producing a cell-by-gene count matrix. Scanpy then handles quality filtering, normalization, dimensionality reduction, clustering, and marker gene detection. When multiple samples are profiled, batch-aware integration (e.g., scVI, Harmony, or scanorama) corrects for technical variation while preserving biological signal.
Pipeline Steps
STARsolo – Align reads, demultiplex cell barcodes, collapse UMIs, and produce a filtered count matrix per sample. Empty droplets are removed with the
EmptyDrops_CRcell-calling algorithm.Scanpy analysis – For each sample, load the count matrix, filter cells and genes, normalize, identify highly variable genes, compute PCA, build a neighbor graph, cluster (Leiden), and run UMAP embedding.
Integration – Concatenate per-sample AnnData objects and apply a batch-correction method to produce a unified embedding across all samples.
Implementation
# Snakefile -- scRNA-seq Pipeline
configfile: "config.yaml"
SAMPLES = config["samples"]
STAR_IDX = config["star_index"]
WHITELIST = config["barcode_whitelist"]
rule all:
input:
expand("results/scanpy/{sample}_processed.h5ad", sample=SAMPLES),
"results/scanpy/integrated.h5ad"
rule starsolo:
input:
r1="data/{sample}_R1.fastq.gz",
r2="data/{sample}_R2.fastq.gz"
output:
bam="results/starsolo/{sample}/Aligned.sortedByCoord.out.bam",
mtx="results/starsolo/{sample}/Solo.out/Gene/filtered/matrix.mtx"
threads: 16
params:
idx=STAR_IDX,
wl=WHITELIST,
outdir="results/starsolo/{sample}/"
conda: "envs/starsolo.yaml"
shell:
"""
STAR --runThreadN {threads} \
--genomeDir {params.idx} \
--readFilesIn {input.r1} {input.r2} \
--readFilesCommand zcat \
--soloType CB_UMI_Simple \
--soloCBwhitelist {params.wl} \
--soloCBstart 1 --soloCBlen 16 \
--soloUMIstart 17 --soloUMIlen 12 \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix {params.outdir} \
--soloFeatures Gene \
--soloCellFilter EmptyDrops_CR
"""
rule scanpy_analysis:
input:
mtx="results/starsolo/{sample}/Solo.out/Gene/filtered/matrix.mtx"
output:
h5ad="results/scanpy/{sample}_processed.h5ad"
conda: "envs/scanpy.yaml"
script:
"scripts/run_scanpy.py"
rule integrate_samples:
input:
expand("results/scanpy/{sample}_processed.h5ad", sample=SAMPLES)
output:
"results/scanpy/integrated.h5ad"
conda: "envs/scanpy.yaml"
script:
"scripts/integrate_scanpy.py"
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.samplesheet = "samplesheet.csv"
params.star_index = "ref/star_index"
params.whitelist = "ref/3M-february-2018.txt"
params.outdir = "results"
process STARSOLO {
tag "${sample_id}"
publishDir "${params.outdir}/starsolo/${sample_id}", mode: 'copy'
cpus 16
memory '64 GB'
input:
tuple val(sample_id), path(r1), path(r2)
path(star_index)
path(whitelist)
output:
tuple val(sample_id), path("Solo.out/Gene/filtered/"), emit: counts
path("Log.final.out"), emit: log
script:
"""
STAR --runThreadN ${task.cpus} \
--genomeDir ${star_index} \
--readFilesIn ${r1} ${r2} \
--readFilesCommand zcat \
--soloType CB_UMI_Simple \
--soloCBwhitelist ${whitelist} \
--soloCBstart 1 --soloCBlen 16 \
--soloUMIstart 17 --soloUMIlen 12 \
--outSAMtype BAM SortedByCoordinate \
--soloFeatures Gene \
--soloCellFilter EmptyDrops_CR
"""
}
process SCANPY_PROCESS {
tag "${sample_id}"
publishDir "${params.outdir}/scanpy", mode: 'copy'
input: tuple val(sample_id), path(counts_dir)
output: tuple val(sample_id), path("${sample_id}_processed.h5ad"), emit: h5ad
script:
"""
python ${projectDir}/scripts/run_scanpy.py \
--input ${counts_dir} --sample ${sample_id} \
--output ${sample_id}_processed.h5ad
"""
}
process INTEGRATE {
publishDir "${params.outdir}/scanpy", mode: 'copy'
input: path(h5ad_files)
output: path("integrated.h5ad")
script:
"""
python ${projectDir}/scripts/integrate_scanpy.py \
--input_files ${h5ad_files} --output integrated.h5ad
"""
}
workflow {
ch_input = Channel.fromPath(params.samplesheet)
.splitCsv(header: true)
.map { row -> tuple(row.sample_id, file(row.fastq_r1), file(row.fastq_r2)) }
star_idx = Channel.value(file(params.star_index))
wl = Channel.value(file(params.whitelist))
STARSOLO(ch_input, star_idx, wl)
SCANPY_PROCESS(STARSOLO.out.counts)
INTEGRATE(SCANPY_PROCESS.out.h5ad.map { it[1] }.collect())
}
Expected Output
After a successful run the output directory will contain:
results/starsolo/<sample>/Solo.out/Gene/filtered/– Per-sample filtered count matrices (matrix.mtx,barcodes.tsv,features.tsv).results/scanpy/<sample>_processed.h5ad– AnnData objects with QC metrics, clusters, UMAP coordinates, and marker genes per sample.results/scanpy/integrated.h5ad– Combined AnnData with batch-corrected embedding, joint clustering, and cell-type annotations.
See Also
RNA-seq Differential Expression – Bulk RNA-seq differential expression pipeline
Snakemake – Snakemake workflow manager overview