FASTQ
Overview
FASTQ is the universal file format for storing raw sequencing reads together with their per-base quality scores. Every Illumina, PacBio, and Oxford Nanopore sequencing run produces FASTQ files (or an intermediate format that is converted to FASTQ). It is the starting point for virtually every NGS analysis pipeline – from quality control and trimming through alignment, assembly, and quantification.
FASTQ files are plain-text and typically compressed with gzip
(*.fastq.gz or *.fq.gz). A single whole-genome sequencing experiment
at 30x coverage produces roughly 60–90 GB of compressed FASTQ data.
Structure
Each read occupies exactly four lines:
@NB501234:45:H3YTNBGX3:1:11101:24563:1038 1:N:0:ACGTACGT
NTGCAAGCAGTTCAGGATCAGTCGAGACTTCAATGTCGATCTACGTAGC
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
- Line 1 – Header (starts with
@) Contains the read identifier. In the Illumina convention the fields are colon-separated:
Field
Meaning
NB501234Instrument name
45Run ID
H3YTNBGX3Flowcell ID
1Lane
11101Tile number
24563X coordinate on tile
1038Y coordinate on tile
After the space,
1:N:0:ACGTACGTencodes: read number (1= R1,2= R2), filter flag (N= not filtered), control bits, and the index/barcode sequence.- Line 2 – Sequence
The nucleotide sequence of the read. Characters are A, C, G, T, and N (undetermined base).
- Line 3 – Separator (
+) A literal
+character. It may optionally repeat the header but this is rarely done in modern files.- Line 4 – Quality string
One ASCII character per base encoding the Phred quality score.
Phred quality encoding
Modern instruments use the Sanger / Phred+33 encoding. The quality score for a base is computed as:
Q = -10 * log10(P_error)
The score is stored as the character with ASCII code Q + 33. Common
values:
ASCII |
Q |
Error rate |
Interpretation |
|---|---|---|---|
|
0 |
1 in 1 |
Worst quality |
|
2 |
1 in 1.6 |
Very poor |
|
10 |
1 in 10 |
Low quality |
|
20 |
1 in 100 |
Acceptable |
|
30 |
1 in 1 000 |
Good |
|
40 |
1 in 10 000 |
Excellent (Illumina max) |
|
41 |
1 in 12 589 |
NovaSeq 6000 max |
Paired-end conventions
Paired-end libraries produce two FASTQ files (or an interleaved file). The standard naming convention is:
sample_R1.fastq.gz # forward reads (read 1)
sample_R2.fastq.gz # reverse reads (read 2)
Reads at the same position in both files form a pair. Tools such as
fastp and BWA-MEM2 accept the two files as positional arguments and
preserve pairing automatically:
fastp \
-i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz
Working With
Counting reads
Because every read occupies exactly four lines, the total number of reads in a FASTQ file is the line count divided by four:
# Count reads in a gzipped FASTQ
echo $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 ))
Viewing the first reads
# Show the first two reads
zcat sample_R1.fastq.gz | head -8
Quality inspection
# Run FastQC on paired-end files
fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o qc_reports/
Quality trimming and adapter removal
# Trim with fastp
fastp \
-i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50
Subsetting reads
# Extract the first 1 million reads with seqtk
seqtk sample -s42 sample_R1.fastq.gz 1000000 \
| gzip > sample_R1.sub.fq.gz
Converting from BAM back to FASTQ
# Extract reads from a BAM file
samtools fastq -1 sample_R1.fq.gz -2 sample_R2.fq.gz \
-0 /dev/null -s /dev/null -n input.bam