Test Datasets

Overview

Before running a full analysis you need small test datasets to validate your installation and learn each tool’s interface. The NCBI Sequence Read Archive (SRA) is the largest public repository of sequencing data. This page shows how to download test FASTQ files using fasterq-dump (from the SRA Toolkit) and the Python helper library pysradb.

Installation

# SRA Toolkit via Conda
mamba install -c bioconda sra-tools

# Python alternative
pip install pysradb

Basic Usage

Download a single accession

fasterq-dump converts SRA archives into FASTQ files locally. Use --split-files so paired-end reads are written to separate R1/R2 files.

# Download using fasterq-dump (recommended)
fasterq-dump --split-files --threads 8 SRR12345678

Batch download

Loop over a list of accessions, compress each FASTQ immediately to save disk space.

# Batch download
for acc in SRR001 SRR002 SRR003; do
  fasterq-dump --split-files --threads 4 $acc
  gzip ${acc}*.fastq
done

pysradb (Python alternative)

pysradb can download all runs belonging to a study (SRP) accession.

# Python alternative
pip install pysradb
pysradb download -y -t 8 --out-dir ./data SRP123456

Nanopore test data

Download a long-read Oxford Nanopore dataset for testing basecalling and long-read alignment tools.

mamba install -c bioconda sra-tools
prefetch SRR28655382
fasterq-dump --split-files SRR28655382
samtools quickcheck SRR28655382.fastq

Key Parameters

fasterq-dump

Flag	Description
`--split-files`	Write paired-end reads into separate `_1.fastq` / `_2.fastq` files.
`--split-3`	Like `--split-files` but also writes unpaired reads to a third file.
`--threads`	Number of threads for decompression (default 6).
`--outdir`	Write output files to this directory.
`--temp`	Temporary directory for intermediate files (set to fast local storage).

pysradb

Flag	Description
`-y`	Skip confirmation prompts.
`-t`	Number of parallel download threads.
`--out-dir`	Output directory for downloaded files.

Expected Output

After downloading a paired-end Illumina run you will see:

SRR12345678_1.fastq   # Forward reads
SRR12345678_2.fastq   # Reverse reads

Verify file integrity with a quick read count:

wc -l SRR12345678_1.fastq | awk '{print $1/4}'

For the Nanopore test data:

SRR28655382.fastq     # Long reads in single-end FASTQ