Entrez Direct
=============

Overview
--------

Entrez Direct (EDirect) is a set of NCBI command-line utilities for searching
and retrieving records from all Entrez databases, including SRA, GEO, PubMed,
Gene, and Nucleotide. The core commands -- ``esearch``, ``efetch``,
``elink``, and ``efilter`` -- can be piped together in Unix fashion to build
complex queries without writing API code. EDirect is essential for
programmatic discovery of public sequencing datasets, metadata extraction, and
automated literature searches in bioinformatics workflows.

Installation
------------

.. code-block:: bash

   mamba install -c bioconda entrez-direct

Optionally set an NCBI API key for higher request rates (10 requests/second
instead of 3):

.. code-block:: bash

   export NCBI_API_KEY=your_api_key_here

Basic Usage
-----------

**Search SRA for ATAC-seq experiments**

.. code-block:: bash

   esearch -db sra -query "ATAC-seq[Strategy] AND Homo sapiens[Organism]" | \
     efetch -format runinfo | head -5

**Search GEO for scRNA-seq datasets**

.. code-block:: bash

   esearch -db gds -query "scRNA-seq AND Homo sapiens AND 2024[PDAT]" | \
     efetch -format summary | head -20

**Retrieve FASTA sequences from Nucleotide**

.. code-block:: bash

   esearch -db nucleotide -query "BRCA1 Homo sapiens[Organism] mRNA" | \
     efetch -format fasta | head -20

**Link from a GEO dataset to SRA runs**

.. code-block:: bash

   esearch -db gds -query "GSE123456" | \
     elink -target sra | \
     efetch -format runinfo

Key Parameters
--------------

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Flag / option
     - Description
   * - ``-db`` (esearch)
     - Entrez database to search (``sra``, ``gds``, ``pubmed``, ``nucleotide``,
       ``gene``, etc.).
   * - ``-query`` (esearch)
     - Search query using Entrez query syntax. Supports field tags (e.g.,
       ``[Organism]``, ``[Strategy]``, ``[PDAT]``).
   * - ``-format`` (efetch)
     - Output format (``runinfo``, ``summary``, ``fasta``, ``xml``,
       ``docsum``, ``abstract``, etc.).
   * - ``-target`` (elink)
     - Target database for cross-database linking.
   * - ``-batch``
     - Process records in batch mode for large result sets.
   * - ``-retmax`` (esearch)
     - Maximum number of records to retrieve (default: 20 for efetch).

Expected Output
---------------

Output varies by database and format:

* **SRA runinfo** -- comma-separated table with columns for Run, ReleaseDate,
  LoadDate, spots, bases, avgLength, size_MB, download_path, Experiment,
  LibraryStrategy, LibrarySource, LibraryLayout, Platform, Model, SRAStudy,
  BioProject, BioSample, SampleName, and Organism.
* **GEO summary** -- text summaries of datasets including title, description,
  platform, and sample counts.
* **FASTA** -- nucleotide or protein sequences in standard FASTA format.
* **XML / DocSum** -- structured metadata records suitable for parsing.

All output is written to standard output, making it easy to pipe to downstream
tools or redirect to files.

See Also
--------

* :doc:`sra-toolkit` -- download FASTQ files from SRA using accessions
  discovered with Entrez Direct