FASTQ ===== Overview -------- FASTQ is the universal file format for storing raw sequencing reads together with their per-base quality scores. Every Illumina, PacBio, and Oxford Nanopore sequencing run produces FASTQ files (or an intermediate format that is converted to FASTQ). It is the starting point for virtually every NGS analysis pipeline -- from quality control and trimming through alignment, assembly, and quantification. FASTQ files are plain-text and typically compressed with **gzip** (``*.fastq.gz`` or ``*.fq.gz``). A single whole-genome sequencing experiment at 30x coverage produces roughly 60--90 GB of compressed FASTQ data. Structure --------- Each read occupies exactly **four lines**: .. code-block:: text @NB501234:45:H3YTNBGX3:1:11101:24563:1038 1:N:0:ACGTACGT NTGCAAGCAGTTCAGGATCAGTCGAGACTTCAATGTCGATCTACGTAGC + #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ **Line 1 -- Header (starts with** ``@`` **)** Contains the read identifier. In the Illumina convention the fields are colon-separated: .. list-table:: :header-rows: 1 :widths: 30 70 * - Field - Meaning * - ``NB501234`` - Instrument name * - ``45`` - Run ID * - ``H3YTNBGX3`` - Flowcell ID * - ``1`` - Lane * - ``11101`` - Tile number * - ``24563`` - X coordinate on tile * - ``1038`` - Y coordinate on tile After the space, ``1:N:0:ACGTACGT`` encodes: read number (``1`` = R1, ``2`` = R2), filter flag (``N`` = not filtered), control bits, and the index/barcode sequence. **Line 2 -- Sequence** The nucleotide sequence of the read. Characters are A, C, G, T, and N (undetermined base). **Line 3 -- Separator (** ``+`` **)** A literal ``+`` character. It may optionally repeat the header but this is rarely done in modern files. **Line 4 -- Quality string** One ASCII character per base encoding the Phred quality score. Phred quality encoding ^^^^^^^^^^^^^^^^^^^^^^ Modern instruments use the **Sanger / Phred+33** encoding. The quality score for a base is computed as: .. code-block:: text Q = -10 * log10(P_error) The score is stored as the character with ASCII code ``Q + 33``. Common values: .. list-table:: :header-rows: 1 :widths: 15 15 30 40 * - ASCII - Q - Error rate - Interpretation * - ``!`` - 0 - 1 in 1 - Worst quality * - ``#`` - 2 - 1 in 1.6 - Very poor * - ``+`` - 10 - 1 in 10 - Low quality * - ``5`` - 20 - 1 in 100 - Acceptable * - ``?`` - 30 - 1 in 1 000 - Good * - ``I`` - 40 - 1 in 10 000 - Excellent (Illumina max) * - ``J`` - 41 - 1 in 12 589 - NovaSeq 6000 max Paired-end conventions ^^^^^^^^^^^^^^^^^^^^^^ Paired-end libraries produce **two FASTQ files** (or an interleaved file). The standard naming convention is: .. code-block:: text sample_R1.fastq.gz # forward reads (read 1) sample_R2.fastq.gz # reverse reads (read 2) Reads at the same position in both files form a pair. Tools such as ``fastp`` and ``BWA-MEM2`` accept the two files as positional arguments and preserve pairing automatically: .. code-block:: bash fastp \ -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz Working With ------------ Counting reads ^^^^^^^^^^^^^^ Because every read occupies exactly four lines, the total number of reads in a FASTQ file is the line count divided by four: .. code-block:: bash # Count reads in a gzipped FASTQ echo $(( $(zcat sample_R1.fastq.gz | wc -l) / 4 )) Viewing the first reads ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Show the first two reads zcat sample_R1.fastq.gz | head -8 Quality inspection ^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Run FastQC on paired-end files fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o qc_reports/ Quality trimming and adapter removal ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Trim with fastp fastp \ -i sample_R1.fastq.gz -I sample_R2.fastq.gz \ -o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz \ --detect_adapter_for_pe \ --qualified_quality_phred 20 \ --length_required 50 Subsetting reads ^^^^^^^^^^^^^^^^ .. code-block:: bash # Extract the first 1 million reads with seqtk seqtk sample -s42 sample_R1.fastq.gz 1000000 \ | gzip > sample_R1.sub.fq.gz Converting from BAM back to FASTQ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Extract reads from a BAM file samtools fastq -1 sample_R1.fq.gz -2 sample_R2.fq.gz \ -0 /dev/null -s /dev/null -n input.bam See Also -------- * :doc:`/tools/quality-control/fastqc` -- per-base quality visualisation * :doc:`/tools/quality-control/fastp` -- all-in-one trimming and QC * :doc:`/tools/quality-control/multiqc` -- aggregate QC reports * :doc:`/tools/quality-control/nanoplot` -- quality plots for long-read FASTQ * :doc:`fasta` -- related sequence-only format (no quality scores) * :doc:`sam-bam-cram` -- the alignment format that FASTQ reads are mapped into