RNA-seq Differential Expression
Overview
RNA-seq differential expression analysis quantifies gene expression levels across experimental conditions and identifies genes with statistically significant changes. The Snakemake implementation uses STAR for splice-aware alignment followed by featureCounts for read summarisation, while the Nextflow implementation uses Salmon for lightweight, alignment-free quantification via quasi-mapping. Both approaches feed count matrices into DESeq2 for normalisation, dispersion estimation, and differential testing.
Pipeline Steps
Trim adapters and low-quality bases with fastp.
Align reads to the reference genome (STAR) or quantify transcripts directly (Salmon).
Generate a gene-level count matrix (featureCounts or Salmon
quant.sf).Run DESeq2 for normalisation and differential expression testing.
Aggregate QC metrics into a MultiQC report.
Implementation
The Snakemake approach uses STAR two-pass alignment and featureCounts for read counting, then runs DESeq2 via an R script.
# Snakefile -- RNA-seq Differential Expression (STAR + featureCounts)
configfile: "config.yaml"
SAMPLES = config["samples"]
REF = config["star_index"]
GTF = config["gtf"]
rule all:
input:
"results/deseq2/differential_expression.csv",
"results/multiqc/multiqc_report.html"
rule fastp_trim:
input:
r1="data/{sample}_R1.fastq.gz",
r2="data/{sample}_R2.fastq.gz"
output:
r1="results/trimmed/{sample}_R1.trimmed.fastq.gz",
r2="results/trimmed/{sample}_R2.trimmed.fastq.gz",
json="results/trimmed/{sample}_fastp.json",
html="results/trimmed/{sample}_fastp.html"
threads: 4
conda: "envs/qc.yaml"
shell:
"""
fastp -i {input.r1} -I {input.r2} \
-o {output.r1} -O {output.r2} \
-j {output.json} -h {output.html} \
--detect_adapter_for_pe --thread {threads}
"""
rule star_align:
input:
r1="results/trimmed/{sample}_R1.trimmed.fastq.gz",
r2="results/trimmed/{sample}_R2.trimmed.fastq.gz",
index=REF
output:
bam="results/star/{sample}/{sample}.Aligned.sortedByCoord.out.bam",
log="results/star/{sample}/{sample}.Log.final.out"
threads: 12
conda: "envs/align.yaml"
shell:
"""
STAR --runThreadN {threads} \
--genomeDir {input.index} \
--readFilesIn {input.r1} {input.r2} \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/star/{wildcards.sample}/{wildcards.sample}. \
--quantMode GeneCounts \
--twopassMode Basic
samtools index {output.bam}
"""
rule featurecounts:
input:
bams=expand("results/star/{sample}/{sample}.Aligned.sortedByCoord.out.bam",
sample=SAMPLES),
gtf=GTF
output:
counts="results/featurecounts/gene_counts.txt",
summary="results/featurecounts/gene_counts.txt.summary"
threads: 8
conda: "envs/quantification.yaml"
shell:
"""
featureCounts -T {threads} -p --countReadPairs \
-a {input.gtf} -o {output.counts} \
-s 2 --primary \
{input.bams}
"""
rule deseq2:
input:
counts="results/featurecounts/gene_counts.txt",
metadata=config["sample_metadata"]
output:
results="results/deseq2/differential_expression.csv",
norm_counts="results/deseq2/normalised_counts.csv",
pca="results/deseq2/pca_plot.pdf",
ma="results/deseq2/ma_plot.pdf",
volcano="results/deseq2/volcano_plot.pdf"
conda: "envs/deseq2.yaml"
script:
"scripts/run_deseq2.R"
rule multiqc:
input:
expand("results/trimmed/{sample}_fastp.json", sample=SAMPLES),
expand("results/star/{sample}/{sample}.Log.final.out",
sample=SAMPLES),
"results/featurecounts/gene_counts.txt.summary"
output:
"results/multiqc/multiqc_report.html"
conda: "envs/qc.yaml"
shell:
"multiqc results/ -o results/multiqc/ --force"
The Nextflow approach uses Salmon for alignment-free transcript quantification and passes the results to DESeq2 via tximport.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.reads = "data/*_{R1,R2}.fastq.gz"
params.salmon_index = "ref/salmon_index"
params.gtf = "ref/genes.gtf"
params.outdir = "results"
params.metadata = "samplesheet.csv"
process FASTP {
tag "${sample_id}"
publishDir "${params.outdir}/trimmed", mode: 'copy'
cpus 4
input: tuple val(sample_id), path(reads)
output:
tuple val(sample_id), path("*_trimmed.fastq.gz"), emit: trimmed
path("*.json"), emit: json
script:
def (r1, r2) = reads
"""
fastp -i ${r1} -I ${r2} \
-o ${sample_id}_R1_trimmed.fastq.gz \
-O ${sample_id}_R2_trimmed.fastq.gz \
-j ${sample_id}_fastp.json \
--detect_adapter_for_pe --thread ${task.cpus}
"""
}
process SALMON_QUANT {
tag "${sample_id}"
publishDir "${params.outdir}/salmon", mode: 'copy'
cpus 8
input:
tuple val(sample_id), path(reads)
path(index)
output:
path("${sample_id}"), emit: quant_dir
script:
def (r1, r2) = reads
"""
salmon quant -i ${index} -l A \
-1 ${r1} -2 ${r2} \
-o ${sample_id} \
--threads ${task.cpus} \
--validateMappings \
--gcBias --seqBias
"""
}
process DESEQ2 {
publishDir "${params.outdir}/deseq2", mode: 'copy'
input:
path(quant_dirs)
path(metadata)
output:
path("differential_expression.csv"), emit: results
path("normalised_counts.csv")
path("*.pdf")
script:
"""
run_deseq2_tximport.R \
--quant_dirs . \
--metadata ${metadata} \
--output_prefix .
"""
}
process MULTIQC {
publishDir "${params.outdir}/multiqc", mode: 'copy'
input: path('*')
output: path("multiqc_report.html")
script:
"""
multiqc . -o . --force
"""
}
workflow {
reads_ch = Channel.fromFilePairs(params.reads, checkIfExists: true)
index_ch = Channel.value(file(params.salmon_index))
FASTP(reads_ch)
SALMON_QUANT(FASTP.out.trimmed, index_ch)
quant_dirs = SALMON_QUANT.out.quant_dir.collect()
DESEQ2(quant_dirs, file(params.metadata))
ch_multiqc = FASTP.out.json
.mix(SALMON_QUANT.out.quant_dir)
.collect()
MULTIQC(ch_multiqc)
}
See Also
Snakemake – Snakemake workflow manager overview
Nextflow – Nextflow DSL2 workflow manager overview
STAR – STAR aligner reference
Quantification – Quantification tools
Differential Expression – Differential expression tools