Pipeline Stages¶
Stage A: Preprocessing and DADA2¶
| Step | Process | Tool | Description |
|---|---|---|---|
| 1 | DEMULTIPLEX |
Mr_Demuxy | Optional inner-barcode demultiplexing |
| 2 | REMOVE_PRIMERS |
cutadapt | Primer trimming (auto-selects by 16S/18S/ITS prefix) |
| 3 | DADA2_FILTER_TRIM |
papa2 | Per-sample quality filtering (maxEE, truncQ, PhiX) |
| 4 | DADA2_LEARN_ERRORS |
papa2 | Per-plate error model learning (LOESS) |
| 5 | DADA2_DENOISE |
papa2 | Denoising, pair merging, per-plate chimera removal |
| 6 | MERGE_SEQTABS |
papa2 | Merge per-plate tables (long-format) |
| 7 | REMOVE_CHIMERAS |
papa2 | Consensus chimera removal on merged data |
| 8 | FILTER_SEQTAB |
microscape | Length, prevalence, abundance, and depth filtering |
Per-plate processing
Steps 3–5 run independently per plate (samples sharing PCR history). This allows per-plate error learning while scaling to hundreds of plates.
Stage B: Taxonomy, Phylogeny, and Normalization¶
| Step | Process | Tool | Description |
|---|---|---|---|
| 9 | ASSIGN_TAXONOMY |
papa2 | Naive Bayesian classification (parallel per ref DB) |
| 10 | BUILD_PHYLOGENY |
microscape + MAFFT | MSA + NJ tree (optional, --run_phylogeny) |
| 11 | RENORMALIZE |
microscape | Group ASVs by taxonomy, normalize within groups |
Multiple databases
Supply multiple reference databases with --ref_databases and taxonomy
is assigned against each in parallel. Results are merged in the output.
Stage C: Ordination, Networks, and Visualization¶
| Step | Process | Tool | Description |
|---|---|---|---|
| 12 | LOAD_METADATA |
microscape | Sample metadata integration (MIMARKS) |
| 13 | CLUSTER_TSNE |
microscape | Bray-Curtis + t-SNE ordination (samples and ASVs) |
| 14 | NETWORK_ANALYSIS |
microscape | SparCC-style CLR correlation networks |
| 15 | EXPORT_VIZ |
microscape | JSON export for Svelte web viewer |
Data Flow¶
The pipeline uses long-format DataFrames as its canonical representation from the merge step onward:
sample sequence count
plate1_A01 TACGGAGGATGCGA... 1523
plate1_A01 TACGGAGGATCCGA... 847
plate1_A02 TACGGAGGATGCGA... 2041
This avoids materializing a dense matrix (samples × ASVs), which can exceed memory for large datasets (4K+ samples, 100K+ ASVs).
R Alternative¶
All stages support an R alternative via --lang R:
- DADA2 steps use
bioconductor-dada2 - Downstream steps use
microscapeR - R scripts in
bin/*.Rmirror the Python functionality
nextflow run rec3141/microscape-nf \
--input /path/to/reads \
--lang R \
--ref_databases "silva:/db/silva.fasta:Domain,Phylum,Class,Order,Family,Genus" \
-profile conda -resume
Profiles¶
| Profile | Description |
|---|---|
conda |
Auto-create conda environments from envs/*.yml |
docker |
Use Docker container |
singularity |
Use Singularity/Apptainer container |
slurm |
Submit jobs to SLURM cluster |
test |
Reduced resources for CI/testing |