STAR
Overview
STAR (Spliced Transcripts Alignment to a Reference) is a fast RNA-seq aligner that discovers splice junctions during alignment, making it the standard choice for mapping RNA-seq reads to a reference genome. STAR uses an uncompressed suffix array index for rapid seed finding and supports two-pass alignment for improved novel junction detection. It can output gene-level read counts directly, which is convenient for differential expression workflows.
Installation
mamba install -c bioconda star
Basic Usage
STAR requires a genome index to be generated before alignment. The index incorporates known splice junctions from a GTF annotation file.
# Generate genome index
STAR --runMode genomeGenerate \
--genomeDir star_index/ \
--genomeFastaFiles reference.fa \
--sjdbGTFfile genes.gtf \
--runThreadN 8
# Align RNA-seq reads
STAR --runMode alignReads \
--genomeDir star_index/ \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFileNamePrefix sample_ \
--runThreadN 8
Note
Genome index generation requires substantial memory. For the human genome,
allocate at least 32 GB of RAM. Use --genomeSAindexNbases to reduce
memory requirements for smaller genomes.
Key Parameters
Flag / option |
Description |
|---|---|
|
Operation mode: |
|
Path to the genome index directory. |
|
Reference genome FASTA file(s) (for index generation). |
|
Gene annotation in GTF format (provides known splice junctions). |
|
Input FASTQ file(s). For paired-end, supply read 1 and read 2 separated by a space. |
|
Command to decompress input files (e.g., |
|
Output format. |
|
|
|
Prefix for all output file names. |
|
Number of threads. |
|
Enable STAR’s two-pass mode for improved novel splice junction detection. |
|
Read length minus 1 (default 100). Set to match your read length for optimal sensitivity. |
|
SAM attributes to include (e.g., |
Expected Output
With the parameters above, STAR produces the following files (all prefixed
with sample_):
sample_Aligned.sortedByCoord.out.bam– coordinate-sorted BAM file of aligned reads.sample_ReadsPerGene.out.tab– gene-level read counts (when--quantMode GeneCountsis set). Columns correspond to unstranded, sense-strand, and antisense-strand counts.sample_Log.final.out– alignment summary statistics including total reads, uniquely mapped reads, multi-mapped reads, and splice junction counts.sample_Log.out– detailed run log.sample_SJ.out.tab– splice junctions detected during alignment.
Index the BAM file for downstream use:
samtools index sample_Aligned.sortedByCoord.out.bam
See Also
FastQC – quality control before alignment
MultiQC – aggregate STAR log files across samples
Quantification – transcript-level and gene-level quantification tools
Differential Expression – differential expression analysis tools
FASTQ – reference for the FASTQ file format
SAM / BAM / CRAM – reference for the SAM/BAM/CRAM alignment format
GFF / GTF – reference for the GTF annotation format