Entrez Direct

Overview

Entrez Direct (EDirect) is a set of NCBI command-line utilities for searching and retrieving records from all Entrez databases, including SRA, GEO, PubMed, Gene, and Nucleotide. The core commands – esearch, efetch, elink, and efilter – can be piped together in Unix fashion to build complex queries without writing API code. EDirect is essential for programmatic discovery of public sequencing datasets, metadata extraction, and automated literature searches in bioinformatics workflows.

Installation

mamba install -c bioconda entrez-direct

Optionally set an NCBI API key for higher request rates (10 requests/second instead of 3):

export NCBI_API_KEY=your_api_key_here

Basic Usage

Search SRA for ATAC-seq experiments

esearch -db sra -query "ATAC-seq[Strategy] AND Homo sapiens[Organism]" | \
  efetch -format runinfo | head -5

Search GEO for scRNA-seq datasets

esearch -db gds -query "scRNA-seq AND Homo sapiens AND 2024[PDAT]" | \
  efetch -format summary | head -20

Retrieve FASTA sequences from Nucleotide

esearch -db nucleotide -query "BRCA1 Homo sapiens[Organism] mRNA" | \
  efetch -format fasta | head -20

Link from a GEO dataset to SRA runs

esearch -db gds -query "GSE123456" | \
  elink -target sra | \
  efetch -format runinfo

Key Parameters

Flag / option	Description
`-db` (esearch)	Entrez database to search (`sra`, `gds`, `pubmed`, `nucleotide`, `gene`, etc.).
`-query` (esearch)	Search query using Entrez query syntax. Supports field tags (e.g., `[Organism]`, `[Strategy]`, `[PDAT]`).
`-format` (efetch)	Output format (`runinfo`, `summary`, `fasta`, `xml`, `docsum`, `abstract`, etc.).
`-target` (elink)	Target database for cross-database linking.
`-batch`	Process records in batch mode for large result sets.
`-retmax` (esearch)	Maximum number of records to retrieve (default: 20 for efetch).

Expected Output

Output varies by database and format:

SRA runinfo – comma-separated table with columns for Run, ReleaseDate, LoadDate, spots, bases, avgLength, size_MB, download_path, Experiment, LibraryStrategy, LibrarySource, LibraryLayout, Platform, Model, SRAStudy, BioProject, BioSample, SampleName, and Organism.
GEO summary – text summaries of datasets including title, description, platform, and sample counts.
FASTA – nucleotide or protein sequences in standard FASTA format.
XML / DocSum – structured metadata records suitable for parsing.

All output is written to standard output, making it easy to pipe to downstream tools or redirect to files.