Sambamba

Overview

Sambamba is a high-performance tool for processing BAM and CRAM files, designed as a drop-in complement to SAMtools with substantially better multi-threading support. Its most common uses are duplicate marking, sorting, and filtering. By distributing work across many CPU cores, Sambamba can significantly reduce wall-clock time for routine BAM processing steps, making it particularly attractive for large whole-genome sequencing datasets.

Installation

mamba install -c bioconda sambamba

Basic Usage

Mark PCR / optical duplicates

sambamba markdup -t 8 sample.sorted.bam sample.dedup.bam

Sort a BAM file by coordinate

sambamba sort -t 8 -o sample.sorted.bam sample.bam

Filter reads using an expression

sambamba view -t 8 -f bam -F "mapping_quality >= 30 and not duplicate" \
  sample.bam > filtered.bam

Key Parameters

Flag / option

Description

-t THREADS

Number of threads to use for processing.

-o FILE

Write output to FILE.

-f FORMAT

Output format: bam, sam, or json.

-F FILTER

A filter expression for selecting reads (e.g., "mapping_quality >= 30 and not duplicate").

--tmpdir DIR

Directory for temporary files during sorting (defaults to the current directory).

--overflow-list-size INT

Size of the overflow hash table list used during duplicate marking; tune for very large files.

-l INT

Compression level for BAM output (0–9).

Expected Output

  • sambamba markdup – a BAM file with the duplicate FLAG bit (0x400) set on identified PCR or optical duplicates. Duplicates are retained by default and can be filtered later.

  • sambamba sort – a coordinate-sorted (or name-sorted with -n) BAM file.

  • sambamba view – filtered reads written to a BAM, SAM, or JSON file depending on the -f option.

See Also

  • SAMtools – the reference toolkit for SAM/BAM manipulation

  • Picard – alternative duplicate marking with detailed duplicate metrics output

  • SAM / BAM / CRAM – reference for the SAM/BAM/CRAM file formats