Real-Time Nanopore Pipeline¶
Real-time metagenomic analysis for Oxford Nanopore sequencing data. Processes reads as they stream from MinKNOW, providing live taxonomic classification, gene annotation, and functional profiling.
Quick Start¶
cd nanopore_live
./install.sh && ./install.sh --check
./run-realtime.sh --input /path/to/nanopore/run --outdir /path/to/output \
--run_kraken --kraken_db /path/to/krakendb \
--run_prokka --run_sketch --run_tetra
Pipeline Stages¶
| Step | Process | Tool | Description |
|---|---|---|---|
| 1 | VALIDATE_FASTQ |
BBMap | Gzip integrity check + read repair |
| 2 | QC_BBDUK |
BBDuk | Adapter and quality trimming |
| 3 | QC_FASTQ_FILTER |
fastq_filter (C) | Length + quality filtering (streaming) |
| 4 | CONVERT_TO_FASTA |
awk | Header cleanup and format conversion |
| 5a | KRAKEN2_CLASSIFY |
Kraken2 | Taxonomic classification (batched per sample) |
| 5b | PROKKA_ANNOTATE |
Prokka | ORF prediction + functional annotation |
| 5c | BAKTA_CDS / BAKTA_FULL |
Bakta | CDS-only or full annotation (alternative to Prokka) |
| 5d | HMM_SEARCH |
HMMER3 | Profile HMM search against user-supplied databases |
| 5e | SENDSKETCH |
BBTools | Rapid taxonomic sketching via MinHash |
| 5f | TETRAMER_FREQ |
tetramer_freqs (C) | Tetranucleotide composition profiles |
| 6 | DB_INTEGRATION |
R + DuckDB | Load all results into DuckDB |
| 7 | DB_SYNC |
R + DuckDB | Periodic sync during watch mode |
| 8 | CLEANUP |
bash | Compress/delete source files after DB import |
Parameters¶
| Parameter | Default | Description |
|---|---|---|
--input |
(required) | Directory containing fastq_pass/ |
--outdir |
results |
Output directory |
--run_kraken |
false |
Enable Kraken2 classification |
--kraken_db |
(required if kraken) | Path to Kraken2 database |
--run_prokka |
false |
Enable Prokka annotation |
--annotator |
bakta |
Gene annotator: prokka, bakta, or none |
--bakta_db |
(required if bakta) | Path to Bakta database |
--bakta_full |
false |
Run full Bakta annotation (ncRNA/tRNA/CRISPR) |
--hmm_databases |
(skip) | Comma-delimited paths to HMM files |
--run_sketch |
false |
Enable Sendsketch profiling |
--run_tetra |
false |
Enable tetranucleotide frequency |
--watch |
false |
Monitor for new files during live sequencing |
--watch_glob |
*/fastq_pass/barcode*/*.fastq.gz |
Glob pattern for watch mode |
--db_sync_minutes |
10 |
DuckDB sync interval in watch mode |
--run_db_integration |
false |
Load results into DuckDB |
--cleanup |
false |
Compress/delete files after DB import |
--min_readlen |
1500 |
Minimum read length after filtering |
--keep_percent |
80 |
Percent of reads to keep (by quality) |
--min_file_size |
1000000 |
Minimum FASTQ file size in bytes (1 MB) |
--store_dir |
(none) | Persistent cache directory (storeDir) |
Outputs¶
results/
├── FLOWCELL/
│ ├── barcode01/
│ │ ├── fa/ FASTA (quality-filtered sequences)
│ │ ├── fq/ Intermediate FASTQ (BBDuk output)
│ │ ├── kraken/ *.tsv (per-read), *.report (summary)
│ │ ├── prokka/ SAMPLE/PROKKA_* (GFF, FAA, FFN, TSV)
│ │ ├── hmm/ *.DBNAME.tsv, *.DBNAME.tbl
│ │ ├── sketch/ Sendsketch profiles
│ │ ├── tetra/ Tetranucleotide frequencies
│ │ └── log.txt
│ └── ...
├── dana.duckdb Integrated database
└── pipeline_info/ Nextflow reports (timeline, trace, DAG)
Profiles¶
| Profile | Use case |
|---|---|
standard |
Local execution |
test |
Small test files, reduced resources |
Resource Requirements¶
| Component | CPUs | RAM | Notes |
|---|---|---|---|
| Kraken2 | 8 | 50-100 GB | Depends on database size; serialized via maxForks=1 |
| Prokka/Bakta | 4 | 8 GB | Per-file annotation |
| HMMER3 | 4 | 4 GB | Per-file search |
| DuckDB integration | 2 | 4 GB | R-based import scripts |
Watch Mode¶
Monitor a directory for new FASTQ files during active sequencing:
./run-realtime.sh --input /path/to/runs --outdir /path/to/output \
--watch --db_sync_minutes 10 \
--run_kraken --kraken_db /path/to/db \
--run_prokka --run_db_integration
DB_SYNC runs as a long-lived process that periodically loads new results into DuckDB. R scripts are idempotent and track imports via import_log.
Post-DB Cleanup¶
The --cleanup flag compresses or deletes source files after DuckDB import:
| Directory/Files | Action |
|---|---|
fa/*.fa |
Gzip in place (kept as compressed backup) |
kraken/, sketch/, tetra/ |
Delete (data lives in DuckDB) |
prokka/*/PROKKA_*.tsv |
Delete (loaded into DuckDB) |
prokka/*/PROKKA_*.gff, .faa, .ffn |
Gzip in place |
hmm/, dana.duckdb, log.txt |
Kept (not cleaned) |
Safe for watch mode -- operates per-file and checks import_log before deleting.
DuckDB Integration¶
R scripts in bin/ load results into DuckDB:
| Script | Data |
|---|---|
40_kraken_db.r |
Kraken2 classifications |
41_krakenreport_db.r |
Kraken2 summary reports |
42_prokka_db.r |
Prokka annotations |
43_sketch_db.r |
Sendsketch profiles |
44_tetra_db.r |
Tetranucleotide frequencies |
45_stats_db.r |
Assembly statistics |
46_log_db.r |
Processing logs |
47_merge_db.r |
Merge per-run databases |
Input¶
Oxford Nanopore directory structure with multiplexed barcodes:
Design Notes¶
- Kraken2 batching. In batch mode, FASTAs are grouped per sample (
groupTuple) so the DB loads once per barcode. In watch mode, files stream individually. UsesmaxForks = 1since Kraken2 loads 50-100 GB into RAM. - Compiled C tools.
fastq_filter(QC) andtetramer_freqs(TNF) are compiled C binaries inbin/, replacing filtlong and a Python script. - Watch mode. Uses
Channel.watchPath()with a flat glob pattern (Java WatchService limitation -- no recursive**). - Conda environments. Four isolated environments avoid dependency conflicts: dana-bbmap, dana-prokka, dana-bakta, dana-tools.