scRNA-seq

Overview

Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, enabling the identification of cell types, states, and trajectories within heterogeneous tissues. This workflow covers the complete analysis from raw FASTQ files to an integrated multi-sample object. STARsolo performs barcode demultiplexing and UMI counting directly during alignment, producing a cell-by-gene count matrix. Scanpy then handles quality filtering, normalization, dimensionality reduction, clustering, and marker gene detection. When multiple samples are profiled, batch-aware integration (e.g., scVI, Harmony, or scanorama) corrects for technical variation while preserving biological signal.

Pipeline Steps

STARsolo – Align reads, demultiplex cell barcodes, collapse UMIs, and produce a filtered count matrix per sample. Empty droplets are removed with the EmptyDrops_CR cell-calling algorithm.
Scanpy analysis – For each sample, load the count matrix, filter cells and genes, normalize, identify highly variable genes, compute PCA, build a neighbor graph, cluster (Leiden), and run UMAP embedding.
Integration – Concatenate per-sample AnnData objects and apply a batch-correction method to produce a unified embedding across all samples.

Implementation

# Snakefile -- scRNA-seq Pipeline
configfile: "config.yaml"

SAMPLES    = config["samples"]
STAR_IDX   = config["star_index"]
WHITELIST  = config["barcode_whitelist"]

rule all:
    input:
        expand("results/scanpy/{sample}_processed.h5ad", sample=SAMPLES),
        "results/scanpy/integrated.h5ad"

rule starsolo:
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        bam="results/starsolo/{sample}/Aligned.sortedByCoord.out.bam",
        mtx="results/starsolo/{sample}/Solo.out/Gene/filtered/matrix.mtx"
    threads: 16
    params:
        idx=STAR_IDX,
        wl=WHITELIST,
        outdir="results/starsolo/{sample}/"
    conda: "envs/starsolo.yaml"
    shell:
        """
        STAR --runThreadN {threads} \
          --genomeDir {params.idx} \
          --readFilesIn {input.r1} {input.r2} \
          --readFilesCommand zcat \
          --soloType CB_UMI_Simple \
          --soloCBwhitelist {params.wl} \
          --soloCBstart 1 --soloCBlen 16 \
          --soloUMIstart 17 --soloUMIlen 12 \
          --outSAMtype BAM SortedByCoordinate \
          --outFileNamePrefix {params.outdir} \
          --soloFeatures Gene \
          --soloCellFilter EmptyDrops_CR
        """

rule scanpy_analysis:
    input:
        mtx="results/starsolo/{sample}/Solo.out/Gene/filtered/matrix.mtx"
    output:
        h5ad="results/scanpy/{sample}_processed.h5ad"
    conda: "envs/scanpy.yaml"
    script:
        "scripts/run_scanpy.py"

rule integrate_samples:
    input:
        expand("results/scanpy/{sample}_processed.h5ad", sample=SAMPLES)
    output:
        "results/scanpy/integrated.h5ad"
    conda: "envs/scanpy.yaml"
    script:
        "scripts/integrate_scanpy.py"

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.samplesheet = "samplesheet.csv"
params.star_index  = "ref/star_index"
params.whitelist   = "ref/3M-february-2018.txt"
params.outdir      = "results"

process STARSOLO {
    tag "${sample_id}"
    publishDir "${params.outdir}/starsolo/${sample_id}", mode: 'copy'
    cpus 16
    memory '64 GB'
    input:
        tuple val(sample_id), path(r1), path(r2)
        path(star_index)
        path(whitelist)
    output:
        tuple val(sample_id), path("Solo.out/Gene/filtered/"), emit: counts
        path("Log.final.out"), emit: log
    script:
    """
    STAR --runThreadN ${task.cpus} \
      --genomeDir ${star_index} \
      --readFilesIn ${r1} ${r2} \
      --readFilesCommand zcat \
      --soloType CB_UMI_Simple \
      --soloCBwhitelist ${whitelist} \
      --soloCBstart 1 --soloCBlen 16 \
      --soloUMIstart 17 --soloUMIlen 12 \
      --outSAMtype BAM SortedByCoordinate \
      --soloFeatures Gene \
      --soloCellFilter EmptyDrops_CR
    """
}

process SCANPY_PROCESS {
    tag "${sample_id}"
    publishDir "${params.outdir}/scanpy", mode: 'copy'
    input:  tuple val(sample_id), path(counts_dir)
    output: tuple val(sample_id), path("${sample_id}_processed.h5ad"), emit: h5ad
    script:
    """
    python ${projectDir}/scripts/run_scanpy.py \
      --input ${counts_dir} --sample ${sample_id} \
      --output ${sample_id}_processed.h5ad
    """
}

process INTEGRATE {
    publishDir "${params.outdir}/scanpy", mode: 'copy'
    input:  path(h5ad_files)
    output: path("integrated.h5ad")
    script:
    """
    python ${projectDir}/scripts/integrate_scanpy.py \
      --input_files ${h5ad_files} --output integrated.h5ad
    """
}

workflow {
    ch_input = Channel.fromPath(params.samplesheet)
        .splitCsv(header: true)
        .map { row -> tuple(row.sample_id, file(row.fastq_r1), file(row.fastq_r2)) }

    star_idx = Channel.value(file(params.star_index))
    wl       = Channel.value(file(params.whitelist))

    STARSOLO(ch_input, star_idx, wl)
    SCANPY_PROCESS(STARSOLO.out.counts)
    INTEGRATE(SCANPY_PROCESS.out.h5ad.map { it[1] }.collect())
}

Expected Output

After a successful run the output directory will contain:

results/starsolo/<sample>/Solo.out/Gene/filtered/ – Per-sample filtered count matrices (matrix.mtx, barcodes.tsv, features.tsv).
results/scanpy/<sample>_processed.h5ad – AnnData objects with QC metrics, clusters, UMAP coordinates, and marker genes per sample.
results/scanpy/integrated.h5ad – Combined AnnData with batch-corrected embedding, joint clustering, and cell-type annotations.

scRNA-seq

Overview

Pipeline Steps

Implementation

Expected Output

See Also