Test Datasets ============= Overview -------- Before running a full analysis you need small test datasets to validate your installation and learn each tool's interface. The NCBI Sequence Read Archive (SRA) is the largest public repository of sequencing data. This page shows how to download test FASTQ files using **fasterq-dump** (from the SRA Toolkit) and the Python helper library **pysradb**. Installation ------------ .. code-block:: bash # SRA Toolkit via Conda mamba install -c bioconda sra-tools # Python alternative pip install pysradb Basic Usage ----------- Download a single accession ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``fasterq-dump`` converts SRA archives into FASTQ files locally. Use ``--split-files`` so paired-end reads are written to separate R1/R2 files. .. code-block:: bash # Download using fasterq-dump (recommended) fasterq-dump --split-files --threads 8 SRR12345678 Batch download ^^^^^^^^^^^^^^ Loop over a list of accessions, compress each FASTQ immediately to save disk space. .. code-block:: bash # Batch download for acc in SRR001 SRR002 SRR003; do fasterq-dump --split-files --threads 4 $acc gzip ${acc}*.fastq done pysradb (Python alternative) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``pysradb`` can download all runs belonging to a study (SRP) accession. .. code-block:: bash # Python alternative pip install pysradb pysradb download -y -t 8 --out-dir ./data SRP123456 Nanopore test data ^^^^^^^^^^^^^^^^^^ Download a long-read Oxford Nanopore dataset for testing basecalling and long-read alignment tools. .. code-block:: bash mamba install -c bioconda sra-tools prefetch SRR28655382 fasterq-dump --split-files SRR28655382 samtools quickcheck SRR28655382.fastq Key Parameters -------------- fasterq-dump ^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 30 70 * - Flag - Description * - ``--split-files`` - Write paired-end reads into separate ``_1.fastq`` / ``_2.fastq`` files. * - ``--split-3`` - Like ``--split-files`` but also writes unpaired reads to a third file. * - ``--threads`` - Number of threads for decompression (default 6). * - ``--outdir`` - Write output files to this directory. * - ``--temp`` - Temporary directory for intermediate files (set to fast local storage). pysradb ^^^^^^^ .. list-table:: :header-rows: 1 :widths: 30 70 * - Flag - Description * - ``-y`` - Skip confirmation prompts. * - ``-t`` - Number of parallel download threads. * - ``--out-dir`` - Output directory for downloaded files. Expected Output --------------- After downloading a paired-end Illumina run you will see: .. code-block:: text SRR12345678_1.fastq # Forward reads SRR12345678_2.fastq # Reverse reads Verify file integrity with a quick read count: .. code-block:: bash wc -l SRR12345678_1.fastq | awk '{print $1/4}' For the Nanopore test data: .. code-block:: text SRR28655382.fastq # Long reads in single-end FASTQ See Also -------- * :doc:`installation` -- set up Conda environments before downloading data * :doc:`computing-environment` -- run downloads inside SLURM jobs on HPC