SAM / BAM / CRAM

Overview

SAM (Sequence Alignment/Map), BAM, and CRAM are the standard formats for storing sequencing reads that have been aligned to a reference genome. They are produced by every major aligner (BWA-MEM2, STAR, Minimap2, HISAT2) and consumed by virtually every downstream tool – from variant callers to coverage calculators.

Format

Extension

Description

SAM

.sam

Human-readable plain text. Useful for inspection but too large for storage.

BAM

.bam

Binary, BGZF-compressed SAM. The de facto working format – fast to read and supports indexing.

CRAM

.cram

Reference-based compression of BAM. Achieves 40–60 % smaller files by encoding only the differences from the reference.

BAM and CRAM files require a companion index (.bai / .csi / .crai) for random access by genomic coordinate.

Structure

A SAM file has two sections: a header (lines starting with @) and alignment records (one tab-separated line per read).

@HD VN:1.6  SO:coordinate
@SQ SN:chr1 LN:248956422
@RG ID:sample1 SM:sample1 PL:ILLUMINA
read001  99  chr1  10050  60  151M  =  10250  351  ACGTACGT...  JJJJJJJJ...
read001  147 chr1  10250  60  151M  =  10050  -351 TGCATGCA...  JJJJJJJJ...

Header section

Tag

Purpose

@HD

File-level metadata: format version (VN), sort order (SO).

@SQ

Reference sequence dictionary: name (SN) and length (LN) of each chromosome.

@RG

Read group: sample name (SM), platform (PL), library (LB). Critical for multi-sample calling.

@PG

Program record: which tool produced this file and with what arguments.

Alignment records

Each alignment line has 11 mandatory fields:

Col

Field

Description

1

QNAME

Read name.

2

FLAG

Bitwise flag encoding read properties (see below).

3

RNAME

Reference sequence name (e.g. chr1).

4

POS

1-based leftmost mapping position.

5

MAPQ

Mapping quality (Phred-scaled probability the position is wrong).

6

CIGAR

Alignment description string (see below).

7

RNEXT

Reference for the mate (= if same chromosome).

8

PNEXT

Mate’s mapping position.

9

TLEN

Template/insert length.

10

SEQ

Read sequence.

11

QUAL

Base quality string (same Phred+33 encoding as FASTQ).

FLAG field

The FLAG is a sum of bit values. Common flags:

Bit

Decimal

Meaning

0x1

1

Read is paired

0x2

2

Both reads mapped in a proper pair

0x4

4

Read is unmapped

0x8

8

Mate is unmapped

0x10

16

Read mapped to reverse strand

0x20

32

Mate mapped to reverse strand

0x40

64

First in pair (R1)

0x80

128

Second in pair (R2)

0x100

256

Secondary alignment

0x400

1024

PCR or optical duplicate

0x800

2048

Supplementary alignment

In the example above, 99 = 1 + 2 + 32 + 64 means: paired, proper pair, mate on reverse strand, first in pair. 147 = 1 + 2 + 16 + 128 means: paired, proper pair, this read on reverse strand, second in pair.

Tip

Use the Picard explain flags web tool or samtools flags 99 to decode any FLAG value.

CIGAR strings

The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string describes how the read aligns to the reference:

Op

Name

Meaning

M

Match / mismatch

Aligned (consumed from both read and reference)

I

Insertion

Bases in read not in reference

D

Deletion

Bases in reference not in read

N

Skipped region

Intron in RNA-seq spliced alignment

S

Soft clip

Bases present in read but not aligned

H

Hard clip

Bases removed from read entirely

Examples:

  • 151M – simple end-to-end alignment of 151 bp

  • 50M2I99M – 50 bp match, 2 bp insertion, 99 bp match

  • 75M5000N76M – spliced RNA-seq read spanning a 5 kb intron

  • 5S146M – 5 soft-clipped bases at the start

MAPQ scores

Mapping quality (MAPQ) is a Phred-scaled estimate of the probability that the alignment position is incorrect:

MAPQ

Error rate

Interpretation

0

1 in 1

Multi-mapped; position unreliable

20

1 in 100

Moderately confident

30

1 in 1 000

High confidence

60

1 in 1 000 000

Unique alignment (BWA-MEM2 max)

255

Not available (used by some tools)

Most variant callers require MAPQ >= 20 or higher.

Working With

Converting and sorting

# Convert SAM to sorted BAM
samtools sort -@ 8 -o aligned.sorted.bam aligned.sam

# Index the BAM
samtools index aligned.sorted.bam

Viewing alignments

# View header
samtools view -H aligned.sorted.bam

# View alignments in a specific region
samtools view aligned.sorted.bam chr1:10000-20000

# View only properly paired reads with MAPQ >= 30
samtools view -f 2 -q 30 aligned.sorted.bam

Flagstat and idxstats

# Summary alignment statistics
samtools flagstat aligned.sorted.bam

# Per-chromosome read counts
samtools idxstats aligned.sorted.bam

Marking duplicates

# Mark PCR duplicates with samtools
samtools markdup aligned.sorted.bam dedup.bam

# Or with Picard
picard MarkDuplicates \
  I=aligned.sorted.bam \
  O=dedup.bam \
  M=dup_metrics.txt

Converting BAM to CRAM

# Compress BAM into CRAM (requires reference FASTA)
samtools view -C -T reference.fa -o aligned.cram aligned.sorted.bam
samtools index aligned.cram

Computing depth and coverage

# Per-base depth
samtools depth -a aligned.sorted.bam > depth.txt

# Coverage summary
samtools coverage aligned.sorted.bam

See Also

  • SAMtools – the primary toolkit for SAM/BAM/CRAM manipulation

  • Sambamba – fast BAM processing with multithreading

  • Picard – duplicate marking and BAM validation

  • deepTools – coverage tracks and QC plots from BAM files

  • BWA-MEM2 – short-read aligner producing SAM output

  • FASTQ – the raw read format that is aligned to produce SAM/BAM

  • VCF / BCF – variant calls derived from BAM alignments