Skip to content

Real-Time Nanopore Pipeline

Real-time metagenomic analysis for Oxford Nanopore sequencing data. Processes reads as they stream from MinKNOW, providing live taxonomic classification, gene annotation, and functional profiling.

Quick Start

cd nanopore_live
./install.sh && ./install.sh --check

./run-realtime.sh --input /path/to/nanopore/run --outdir /path/to/output \
    --run_kraken --kraken_db /path/to/krakendb \
    --run_prokka --run_sketch --run_tetra

Pipeline Stages

Step Process Tool Description
1 VALIDATE_FASTQ BBMap Gzip integrity check + read repair
2 QC_BBDUK BBDuk Adapter and quality trimming
3 QC_FASTQ_FILTER fastq_filter (C) Length + quality filtering (streaming)
4 CONVERT_TO_FASTA awk Header cleanup and format conversion
5a KRAKEN2_CLASSIFY Kraken2 Taxonomic classification (batched per sample)
5b PROKKA_ANNOTATE Prokka ORF prediction + functional annotation
5c BAKTA_CDS / BAKTA_FULL Bakta CDS-only or full annotation (alternative to Prokka)
5d HMM_SEARCH HMMER3 Profile HMM search against user-supplied databases
5e SENDSKETCH BBTools Rapid taxonomic sketching via MinHash
5f TETRAMER_FREQ tetramer_freqs (C) Tetranucleotide composition profiles
6 DB_INTEGRATION R + DuckDB Load all results into DuckDB
7 DB_SYNC R + DuckDB Periodic sync during watch mode
8 CLEANUP bash Compress/delete source files after DB import

Parameters

Parameter Default Description
--input (required) Directory containing fastq_pass/
--outdir results Output directory
--run_kraken false Enable Kraken2 classification
--kraken_db (required if kraken) Path to Kraken2 database
--run_prokka false Enable Prokka annotation
--annotator bakta Gene annotator: prokka, bakta, or none
--bakta_db (required if bakta) Path to Bakta database
--bakta_full false Run full Bakta annotation (ncRNA/tRNA/CRISPR)
--hmm_databases (skip) Comma-delimited paths to HMM files
--run_sketch false Enable Sendsketch profiling
--run_tetra false Enable tetranucleotide frequency
--watch false Monitor for new files during live sequencing
--watch_glob */fastq_pass/barcode*/*.fastq.gz Glob pattern for watch mode
--db_sync_minutes 10 DuckDB sync interval in watch mode
--run_db_integration false Load results into DuckDB
--cleanup false Compress/delete files after DB import
--min_readlen 1500 Minimum read length after filtering
--keep_percent 80 Percent of reads to keep (by quality)
--min_file_size 1000000 Minimum FASTQ file size in bytes (1 MB)
--store_dir (none) Persistent cache directory (storeDir)

Outputs

results/
├── FLOWCELL/
│   ├── barcode01/
│   │   ├── fa/           FASTA (quality-filtered sequences)
│   │   ├── fq/           Intermediate FASTQ (BBDuk output)
│   │   ├── kraken/       *.tsv (per-read), *.report (summary)
│   │   ├── prokka/       SAMPLE/PROKKA_* (GFF, FAA, FFN, TSV)
│   │   ├── hmm/          *.DBNAME.tsv, *.DBNAME.tbl
│   │   ├── sketch/       Sendsketch profiles
│   │   ├── tetra/        Tetranucleotide frequencies
│   │   └── log.txt
│   └── ...
├── dana.duckdb           Integrated database
└── pipeline_info/        Nextflow reports (timeline, trace, DAG)

Profiles

Profile Use case
standard Local execution
test Small test files, reduced resources

Resource Requirements

Component CPUs RAM Notes
Kraken2 8 50-100 GB Depends on database size; serialized via maxForks=1
Prokka/Bakta 4 8 GB Per-file annotation
HMMER3 4 4 GB Per-file search
DuckDB integration 2 4 GB R-based import scripts

Watch Mode

Monitor a directory for new FASTQ files during active sequencing:

./run-realtime.sh --input /path/to/runs --outdir /path/to/output \
    --watch --db_sync_minutes 10 \
    --run_kraken --kraken_db /path/to/db \
    --run_prokka --run_db_integration

DB_SYNC runs as a long-lived process that periodically loads new results into DuckDB. R scripts are idempotent and track imports via import_log.

Post-DB Cleanup

The --cleanup flag compresses or deletes source files after DuckDB import:

Directory/Files Action
fa/*.fa Gzip in place (kept as compressed backup)
kraken/, sketch/, tetra/ Delete (data lives in DuckDB)
prokka/*/PROKKA_*.tsv Delete (loaded into DuckDB)
prokka/*/PROKKA_*.gff, .faa, .ffn Gzip in place
hmm/, dana.duckdb, log.txt Kept (not cleaned)

Safe for watch mode -- operates per-file and checks import_log before deleting.

DuckDB Integration

R scripts in bin/ load results into DuckDB:

Script Data
40_kraken_db.r Kraken2 classifications
41_krakenreport_db.r Kraken2 summary reports
42_prokka_db.r Prokka annotations
43_sketch_db.r Sendsketch profiles
44_tetra_db.r Tetranucleotide frequencies
45_stats_db.r Assembly statistics
46_log_db.r Processing logs
47_merge_db.r Merge per-run databases

Input

Oxford Nanopore directory structure with multiplexed barcodes:

input_dir/fastq_pass/
├── barcode01/*.fastq.gz
├── barcode02/*.fastq.gz
└── ...

Design Notes

  • Kraken2 batching. In batch mode, FASTAs are grouped per sample (groupTuple) so the DB loads once per barcode. In watch mode, files stream individually. Uses maxForks = 1 since Kraken2 loads 50-100 GB into RAM.
  • Compiled C tools. fastq_filter (QC) and tetramer_freqs (TNF) are compiled C binaries in bin/, replacing filtlong and a Python script.
  • Watch mode. Uses Channel.watchPath() with a flat glob pattern (Java WatchService limitation -- no recursive **).
  • Conda environments. Four isolated environments avoid dependency conflicts: dana-bbmap, dana-prokka, dana-bakta, dana-tools.