Entrez Direct
Overview
Entrez Direct (EDirect) is a set of NCBI command-line utilities for searching
and retrieving records from all Entrez databases, including SRA, GEO, PubMed,
Gene, and Nucleotide. The core commands – esearch, efetch,
elink, and efilter – can be piped together in Unix fashion to build
complex queries without writing API code. EDirect is essential for
programmatic discovery of public sequencing datasets, metadata extraction, and
automated literature searches in bioinformatics workflows.
Installation
mamba install -c bioconda entrez-direct
Optionally set an NCBI API key for higher request rates (10 requests/second instead of 3):
export NCBI_API_KEY=your_api_key_here
Basic Usage
Search SRA for ATAC-seq experiments
esearch -db sra -query "ATAC-seq[Strategy] AND Homo sapiens[Organism]" | \
efetch -format runinfo | head -5
Search GEO for scRNA-seq datasets
esearch -db gds -query "scRNA-seq AND Homo sapiens AND 2024[PDAT]" | \
efetch -format summary | head -20
Retrieve FASTA sequences from Nucleotide
esearch -db nucleotide -query "BRCA1 Homo sapiens[Organism] mRNA" | \
efetch -format fasta | head -20
Link from a GEO dataset to SRA runs
esearch -db gds -query "GSE123456" | \
elink -target sra | \
efetch -format runinfo
Key Parameters
Flag / option |
Description |
|---|---|
|
Entrez database to search ( |
|
Search query using Entrez query syntax. Supports field tags (e.g.,
|
|
Output format ( |
|
Target database for cross-database linking. |
|
Process records in batch mode for large result sets. |
|
Maximum number of records to retrieve (default: 20 for efetch). |
Expected Output
Output varies by database and format:
SRA runinfo – comma-separated table with columns for Run, ReleaseDate, LoadDate, spots, bases, avgLength, size_MB, download_path, Experiment, LibraryStrategy, LibrarySource, LibraryLayout, Platform, Model, SRAStudy, BioProject, BioSample, SampleName, and Organism.
GEO summary – text summaries of datasets including title, description, platform, and sample counts.
FASTA – nucleotide or protein sequences in standard FASTA format.
XML / DocSum – structured metadata records suitable for parsing.
All output is written to standard output, making it easy to pipe to downstream tools or redirect to files.
See Also
SRA Toolkit – download FASTQ files from SRA using accessions discovered with Entrez Direct