HTSeq ===== Overview -------- HTSeq is a Python framework for working with high-throughput sequencing data that includes htseq-count, a widely used tool for counting reads mapped to genomic features. Given a sorted BAM file and a GTF annotation, htseq-count assigns each read (or read pair) to a gene based on its overlap with annotated exons. It provides multiple overlap resolution modes to handle reads that span feature boundaries or overlap multiple genes. HTSeq produces a simple gene-by-count table that serves as direct input for differential expression tools such as DESeq2 and edgeR. Installation ------------ .. code-block:: bash mamba install -c bioconda htseq Basic Usage ----------- Count reads per gene from a coordinate-sorted BAM file. .. code-block:: bash htseq-count -f bam -r pos -s reverse \ -t exon -i gene_id \ sample.sorted.bam genes.gtf > counts.txt For multiple samples, run htseq-count separately on each BAM file and merge the results into a count matrix. Key Parameters -------------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Flag / option - Description * - ``-f`` - Input format: ``bam`` or ``sam``. * - ``-r`` - Sort order of the input file: ``pos`` for coordinate-sorted or ``name`` for name-sorted. * - ``-s`` - Strand-specificity: ``yes`` for forward stranded, ``reverse`` for reverse stranded, or ``no`` for unstranded. * - ``-t`` - Feature type to use from the GTF (default ``exon``). * - ``-i`` - GTF attribute to use as the feature ID (default ``gene_id``). * - ``-m`` / ``--mode`` - Overlap resolution mode: ``union`` (default), ``intersection-strict``, or ``intersection-nonempty``. * - ``--nonunique`` - How to handle reads mapping to multiple features: ``none`` (discard) or ``all`` (count for each feature). * - ``-a`` - Minimum alignment quality threshold (default 10). * - ``--additional-attr`` - Include additional GTF attributes in the output (e.g. gene_name). Expected Output --------------- * Standard output (redirected to ``counts.txt``) -- a two-column tab-delimited file with the gene identifier in the first column and the raw read count in the second column. The last five lines contain special counters: - ``__no_feature`` -- reads not overlapping any feature. - ``__ambiguous`` -- reads overlapping multiple features. - ``__too_low_aQual`` -- reads below the alignment quality threshold. - ``__not_aligned`` -- unmapped reads. - ``__alignment_not_unique`` -- reads with multiple alignments. See Also -------- * :doc:`featurecounts` -- faster multi-threaded alternative for read counting with built-in multi-BAM support * :doc:`salmon` -- alignment-free transcript-level quantification * :doc:`kallisto` -- pseudoalignment-based transcript quantification * :doc:`/tools/differential-expression/deseq2` -- differential expression analysis using htseq-count output * :doc:`/tools/differential-expression/edger` -- alternative differential expression framework