SRA Toolkit =========== Overview -------- The SRA Toolkit is a collection of command-line utilities from NCBI for downloading, converting, and validating data from the Sequence Read Archive (SRA). Its primary tool, ``fasterq-dump``, extracts FASTQ files from SRA accessions with multi-threaded performance, replacing the older ``fastq-dump``. The toolkit also provides ``prefetch`` for downloading SRA files in advance, ``vdb-validate`` for verifying data integrity, and Aspera-based upload utilities for submitting data to NCBI. Installation ------------ .. code-block:: bash mamba install -c bioconda sra-tools After installation, configure the toolkit (sets the cache directory and accepts NCBI terms): .. code-block:: bash vdb-config --interactive Basic Usage ----------- **Download a single run** .. code-block:: bash # Download using fasterq-dump (recommended) fasterq-dump --split-files --threads 8 SRR12345678 **Batch download multiple runs** .. code-block:: bash for acc in SRR001 SRR002 SRR003; do fasterq-dump --split-files --threads 4 $acc gzip ${acc}*.fastq done **Prefetch before extracting (recommended for large files)** .. code-block:: bash prefetch SRR12345678 fasterq-dump --split-files --threads 8 SRR12345678 **Submit data via Aspera** .. code-block:: bash ascp -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh \ -QT -l 1000m -k 1 \ sample_R1.fastq.gz sample_R2.fastq.gz \ subasp@upload.ncbi.nlm.nih.gov:uploads/your_folder/ Key Parameters -------------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Flag / option - Description * - ``--split-files`` - Write paired-end reads to separate files (``_1.fastq`` and ``_2.fastq``). * - ``--split-3`` - Like ``--split-files`` but also writes unpaired reads to a third file. * - ``--threads`` - Number of threads for ``fasterq-dump`` (default: 6). * - ``--outdir`` - Output directory for extracted FASTQ files. * - ``--temp`` - Temporary directory for intermediate files (requires substantial disk space). * - ``--progress`` - Show a progress bar during extraction. * - ``--skip-technical`` - Skip technical reads (e.g., barcodes) and output only biological reads. * - ``-X`` (prefetch) - Maximum file size to download in KB (default: 20 GB). Expected Output --------------- For a paired-end run (e.g., ``SRR12345678``): * ``SRR12345678_1.fastq`` -- Read 1 FASTQ file. * ``SRR12345678_2.fastq`` -- Read 2 FASTQ file. After compression: * ``SRR12345678_1.fastq.gz`` / ``SRR12345678_2.fastq.gz`` For single-end runs, a single ``SRR12345678.fastq`` file is produced. The ``prefetch`` command downloads an ``.sra`` file to the local cache (``~/ncbi/`` by default), which ``fasterq-dump`` then converts to FASTQ. See Also -------- * :doc:`entrez-direct` -- search NCBI databases (SRA, GEO) to discover accession numbers before downloading * :doc:`/tools/quality-control/fastqc` -- quality-check downloaded FASTQ files * :doc:`/tools/quality-control/fastp` -- trim and filter downloaded reads before alignment