STAR ==== Overview -------- STAR (Spliced Transcripts Alignment to a Reference) is a fast RNA-seq aligner that discovers splice junctions during alignment, making it the standard choice for mapping RNA-seq reads to a reference genome. STAR uses an uncompressed suffix array index for rapid seed finding and supports two-pass alignment for improved novel junction detection. It can output gene-level read counts directly, which is convenient for differential expression workflows. Installation ------------ .. code-block:: bash mamba install -c bioconda star Basic Usage ----------- STAR requires a genome index to be generated before alignment. The index incorporates known splice junctions from a GTF annotation file. .. code-block:: bash # Generate genome index STAR --runMode genomeGenerate \ --genomeDir star_index/ \ --genomeFastaFiles reference.fa \ --sjdbGTFfile genes.gtf \ --runThreadN 8 # Align RNA-seq reads STAR --runMode alignReads \ --genomeDir star_index/ \ --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --quantMode GeneCounts \ --outFileNamePrefix sample_ \ --runThreadN 8 .. note:: Genome index generation requires substantial memory. For the human genome, allocate at least 32 GB of RAM. Use ``--genomeSAindexNbases`` to reduce memory requirements for smaller genomes. Key Parameters -------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Flag / option - Description * - ``--runMode`` - Operation mode: ``genomeGenerate`` for indexing, ``alignReads`` for alignment (default). * - ``--genomeDir`` - Path to the genome index directory. * - ``--genomeFastaFiles`` - Reference genome FASTA file(s) (for index generation). * - ``--sjdbGTFfile`` - Gene annotation in GTF format (provides known splice junctions). * - ``--readFilesIn`` - Input FASTQ file(s). For paired-end, supply read 1 and read 2 separated by a space. * - ``--readFilesCommand`` - Command to decompress input files (e.g., ``zcat`` for ``.gz``). * - ``--outSAMtype`` - Output format. ``BAM SortedByCoordinate`` produces a sorted BAM directly. * - ``--quantMode`` - ``GeneCounts`` outputs a gene-level count table alongside the BAM. * - ``--outFileNamePrefix`` - Prefix for all output file names. * - ``--runThreadN`` - Number of threads. * - ``--twopassMode Basic`` - Enable STAR's two-pass mode for improved novel splice junction detection. * - ``--sjdbOverhang`` - Read length minus 1 (default 100). Set to match your read length for optimal sensitivity. * - ``--outSAMattributes`` - SAM attributes to include (e.g., ``NH HI AS NM MD``). Expected Output --------------- With the parameters above, STAR produces the following files (all prefixed with ``sample_``): * ``sample_Aligned.sortedByCoord.out.bam`` -- coordinate-sorted BAM file of aligned reads. * ``sample_ReadsPerGene.out.tab`` -- gene-level read counts (when ``--quantMode GeneCounts`` is set). Columns correspond to unstranded, sense-strand, and antisense-strand counts. * ``sample_Log.final.out`` -- alignment summary statistics including total reads, uniquely mapped reads, multi-mapped reads, and splice junction counts. * ``sample_Log.out`` -- detailed run log. * ``sample_SJ.out.tab`` -- splice junctions detected during alignment. Index the BAM file for downstream use: .. code-block:: bash samtools index sample_Aligned.sortedByCoord.out.bam See Also -------- * :doc:`/tools/quality-control/fastqc` -- quality control before alignment * :doc:`/tools/quality-control/multiqc` -- aggregate STAR log files across samples * :doc:`/tools/quantification/index` -- transcript-level and gene-level quantification tools * :doc:`/tools/differential-expression/index` -- differential expression analysis tools * :doc:`/data-formats/fastq` -- reference for the FASTQ file format * :doc:`/data-formats/sam-bam-cram` -- reference for the SAM/BAM/CRAM alignment format * :doc:`/data-formats/gff-gtf` -- reference for the GTF annotation format