Nanopore Variant Calling

Overview

Oxford Nanopore long reads can be used for germline variant calling when aligned to a reference genome and processed with a deep-learning-based variant caller. This pipeline maps reads with minimap2, calls SNPs and indels with Clair3 (a pileup-and-full-alignment neural-network caller optimized for Nanopore data), and filters variants with bcftools. Clair3 leverages platform-specific models trained on Nanopore error profiles, achieving high accuracy on both SNPs and small indels from long-read data.

Pipeline Steps

minimap2 align – Map Nanopore reads to the reference genome using minimap2 in map-ont mode with MD tags for accurate variant calling. Output is piped through samtools sort to produce a coordinate-sorted BAM.
samtools index – Index the sorted BAM file for random access by downstream tools.
Clair3 – Call germline variants (SNPs and indels) using the Clair3 deep-learning model appropriate for the Nanopore chemistry. The --include_all_ctgs flag processes all contigs in the reference.
bcftools filter – Apply quality and depth filters to remove low- confidence variant calls (QUAL >= 20, DP >= 10).
bcftools stats – Generate summary statistics for the filtered variant set including counts of SNPs, indels, and ts/tv ratio.

Implementation

# Snakefile -- Nanopore Variant Calling Pipeline
configfile: "config.yaml"

SAMPLES   = config["samples"]
REFERENCE = config["reference"]
MODEL     = config["clair3_model"]

rule all:
    input:
        expand("results/variants/{sample}_snps.vcf.gz", sample=SAMPLES),
        expand("results/variants/{sample}_stats.txt", sample=SAMPLES)

rule minimap2_align:
    input:
        reads="data/{sample}.fastq.gz",
        ref=REFERENCE
    output:
        bam="results/aligned/{sample}.sorted.bam",
        bai="results/aligned/{sample}.sorted.bam.bai"
    threads: 8
    shell:
        """
        minimap2 -ax map-ont --MD \
          -R '@RG\\tID:{wildcards.sample}\\tSM:{wildcards.sample}\\tPL:ONT' \
          {input.ref} {input.reads} \
          | samtools sort -@ {threads} -o {output.bam} -
        samtools index {output.bam}
        """

rule clair3:
    input:
        bam="results/aligned/{sample}.sorted.bam",
        bai="results/aligned/{sample}.sorted.bam.bai",
        ref=REFERENCE
    output:
        vcf="results/clair3/{sample}/merge_output.vcf.gz"
    params:
        model=MODEL,
        outdir="results/clair3/{sample}"
    threads: 8
    shell:
        """
        run_clair3.sh \
          --bam_fn={input.bam} \
          --ref_fn={input.ref} \
          --output={params.outdir} \
          --threads={threads} \
          --platform="ont" \
          --model_path={params.model} \
          --sample_name={wildcards.sample} \
          --include_all_ctgs
        """

rule bcftools_filter:
    input:
        "results/clair3/{sample}/merge_output.vcf.gz"
    output:
        filtered="results/variants/{sample}_filtered.vcf.gz",
        snps="results/variants/{sample}_snps.vcf.gz",
        stats="results/variants/{sample}_stats.txt"
    shell:
        """
        bcftools filter -i 'QUAL>=20 && INFO/DP>=10' \
          {input} -Oz -o {output.filtered}
        bcftools view -v snps {output.filtered} -Oz -o {output.snps}
        bcftools stats {output.filtered} > {output.stats}
        """

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.samplesheet  = "samplesheet.csv"
params.reference    = "ref/reference.fna"
params.clair3_model = "\${CONDA_PREFIX}/bin/models/ont"
params.outdir       = "results"

process MINIMAP2_ALIGN {
    tag "${sample_id}"
    publishDir "${params.outdir}/aligned", mode: 'copy'
    cpus 8
    memory '16 GB'
    input:
        tuple val(sample_id), path(reads)
        path(reference)
    output:
        tuple val(sample_id),
              path("${sample_id}.sorted.bam"),
              path("${sample_id}.sorted.bam.bai"), emit: bam
    script:
    """
    minimap2 -ax map-ont --MD \
      -R '@RG\\tID:${sample_id}\\tSM:${sample_id}\\tPL:ONT' \
      ${reference} ${reads} \
      | samtools sort -@ ${task.cpus} -o ${sample_id}.sorted.bam -
    samtools index ${sample_id}.sorted.bam
    """
}

process CLAIR3 {
    tag "${sample_id}"
    publishDir "${params.outdir}/clair3/${sample_id}", mode: 'copy'
    cpus 8
    memory '32 GB'
    input:
        tuple val(sample_id), path(bam), path(bai)
        path(reference)
    output:
        tuple val(sample_id),
              path("merge_output.vcf.gz"), emit: vcf
    script:
    """
    run_clair3.sh \
      --bam_fn=${bam} \
      --ref_fn=${reference} \
      --output=. \
      --threads=${task.cpus} \
      --platform="ont" \
      --model_path=${params.clair3_model} \
      --sample_name=${sample_id} \
      --include_all_ctgs
    """
}

process BCFTOOLS_FILTER {
    tag "${sample_id}"
    publishDir "${params.outdir}/variants", mode: 'copy'
    input:
        tuple val(sample_id), path(vcf)
    output:
        tuple val(sample_id),
              path("${sample_id}_filtered.vcf.gz"), emit: filtered
        tuple val(sample_id),
              path("${sample_id}_snps.vcf.gz"), emit: snps
        path("${sample_id}_stats.txt"), emit: stats
    script:
    """
    bcftools filter -i 'QUAL>=20 && INFO/DP>=10' \
      ${vcf} -Oz -o ${sample_id}_filtered.vcf.gz
    bcftools view -v snps ${sample_id}_filtered.vcf.gz \
      -Oz -o ${sample_id}_snps.vcf.gz
    bcftools stats ${sample_id}_filtered.vcf.gz \
      > ${sample_id}_stats.txt
    """
}

workflow {
    ch_input = Channel.fromPath(params.samplesheet)
        .splitCsv(header: true)
        .map { row -> tuple(row.sample_id, file(row.fastq)) }

    reference = Channel.value(file(params.reference))

    MINIMAP2_ALIGN(ch_input, reference)
    CLAIR3(MINIMAP2_ALIGN.out.bam, reference)
    BCFTOOLS_FILTER(CLAIR3.out.vcf)
}

Expected Output

After a successful run the output directory will contain:

results/aligned/<sample>.sorted.bam – Coordinate-sorted BAM file with read group tags and MD tags for variant calling.
results/clair3/<sample>/merge_output.vcf.gz – Raw Clair3 variant calls including SNPs and indels with quality scores and genotype likelihoods.
results/variants/<sample>_filtered.vcf.gz – Quality-filtered variants (QUAL >= 20, DP >= 10).
results/variants/<sample>_snps.vcf.gz – SNP-only subset of filtered variants.
results/variants/<sample>_stats.txt – Summary statistics including total variants, SNP count, indel count, and transition/transversion ratio.

Nanopore Variant Calling

Overview

Pipeline Steps

Implementation

Expected Output

See Also