SRA Toolkit

Overview

The SRA Toolkit is a collection of command-line utilities from NCBI for downloading, converting, and validating data from the Sequence Read Archive (SRA). Its primary tool, fasterq-dump, extracts FASTQ files from SRA accessions with multi-threaded performance, replacing the older fastq-dump. The toolkit also provides prefetch for downloading SRA files in advance, vdb-validate for verifying data integrity, and Aspera-based upload utilities for submitting data to NCBI.

Installation

mamba install -c bioconda sra-tools

After installation, configure the toolkit (sets the cache directory and accepts NCBI terms):

vdb-config --interactive

Basic Usage

Download a single run

# Download using fasterq-dump (recommended)
fasterq-dump --split-files --threads 8 SRR12345678

Batch download multiple runs

for acc in SRR001 SRR002 SRR003; do
  fasterq-dump --split-files --threads 4 $acc
  gzip ${acc}*.fastq
done

Prefetch before extracting (recommended for large files)

prefetch SRR12345678
fasterq-dump --split-files --threads 8 SRR12345678

Submit data via Aspera

ascp -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh \
  -QT -l 1000m -k 1 \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  subasp@upload.ncbi.nlm.nih.gov:uploads/your_folder/

Key Parameters

Flag / option

Description

--split-files

Write paired-end reads to separate files (_1.fastq and _2.fastq).

--split-3

Like --split-files but also writes unpaired reads to a third file.

--threads

Number of threads for fasterq-dump (default: 6).

--outdir

Output directory for extracted FASTQ files.

--temp

Temporary directory for intermediate files (requires substantial disk space).

--progress

Show a progress bar during extraction.

--skip-technical

Skip technical reads (e.g., barcodes) and output only biological reads.

-X (prefetch)

Maximum file size to download in KB (default: 20 GB).

Expected Output

For a paired-end run (e.g., SRR12345678):

  • SRR12345678_1.fastq – Read 1 FASTQ file.

  • SRR12345678_2.fastq – Read 2 FASTQ file.

After compression:

  • SRR12345678_1.fastq.gz / SRR12345678_2.fastq.gz

For single-end runs, a single SRR12345678.fastq file is produced. The prefetch command downloads an .sra file to the local cache (~/ncbi/ by default), which fasterq-dump then converts to FASTQ.

See Also

  • Entrez Direct – search NCBI databases (SRA, GEO) to discover accession numbers before downloading

  • FastQC – quality-check downloaded FASTQ files

  • fastp – trim and filter downloaded reads before alignment