MAG Analysis Pipeline
Technology-agnostic downstream analysis for metagenome-assembled genomes (MAGs). Accepts a pre-computed assembly + depth table from any assembler and runs binning, annotation, taxonomy, metabolic profiling, mobile genetic element detection, eukaryotic analysis, ecosystem services mapping, phylogenetics, and interactive visualization.
Quick Start
cd mag_analysis
./install.sh && ./install.sh --check
# Basic run
./run-mag-analysis.sh \
--assembly /path/to/assembly.fasta \
--depths /path/to/depths.txt \
--outdir /path/to/output \
--annotator bakta --db_dir /path/to/databases
# All modules
./run-mag-analysis.sh \
--assembly /path/to/assembly.fasta \
--depths /path/to/depths.txt \
--bam_dir /path/to/mapping/ \
--outdir /path/to/output \
--all --db_dir /path/to/databases
# Apptainer (HPC)
./run-mag-analysis.sh --apptainer \
--assembly /path/to/assembly.fasta \
--depths /path/to/depths.txt \
--outdir /path/to/output \
--all --db_dir /path/to/databases
Pipeline Stages
Binning
| Step |
Process |
Tool |
Description |
| 1a |
BIN_SEMIBIN2 |
SemiBin2 |
Self-supervised contig binning (needs BAMs) |
| 1b |
BIN_METABAT2 |
MetaBAT2 |
Depth + TNF binning (always runs) |
| 1c |
BIN_MAXBIN2 |
MaxBin2 |
EM-based abundance binning |
| 1d |
BIN_LORBIN |
LorBin |
Long-read-aware binning (needs BAMs) |
| 1e |
BIN_COMEBIN |
COMEBin |
Contrastive learning binning (needs BAMs) |
| 1f |
BIN_VAMB |
VAMB |
Variational autoencoder binning |
| 1g |
BIN_VAMB_TAX |
VAMB |
Taxonomy-guided variational autoencoder binning |
| 2a |
DASTOOL_CONSENSUS |
DAS Tool |
Score-based consensus of multiple binners |
| 2b |
BINETTE_CONSENSUS |
Binette |
CheckM2-guided bin refinement |
| 2c |
MAGSCOT_CONSENSUS |
MAGScoT |
Marker-gene-based consensus |
| 3 |
CHECKM2 |
CheckM2 |
Completeness and contamination assessment |
Annotation
| Step |
Process |
Tool |
Description |
| 4a |
PROKKA_ANNOTATE |
Prokka |
ORF prediction + functional annotation |
| 4b |
BAKTA_BASIC |
Bakta |
CDS annotation (lightweight) |
| 4c |
BAKTA_EXTRA |
Bakta |
Full annotation (ncRNA, tRNA, CRISPR, sORFs) |
Taxonomy
| Step |
Process |
Tool |
Description |
| 5a |
KAIJU_CONTIG_CLASSIFY |
Kaiju |
Protein-level contig classification |
| 5b |
KAIJU_CLASSIFY |
Kaiju |
Protein-level MAG classification |
| 5c |
KRAKEN2_CLASSIFY |
Kraken2 |
k-mer-based classification |
| 5d |
SENDSKETCH_CLASSIFY |
BBTools sendsketch |
MinHash taxonomy (GTDB) |
| 5e |
RNA_CLASSIFY |
barrnap + BLAST |
rRNA extraction + SILVA classification |
| 5f |
GTDBTK_CLASSIFY |
GTDB-Tk |
Genome-based taxonomy (requires ~120 GB RAM) |
| Step |
Process |
Tool |
Description |
| 6a |
KOFAMSCAN |
KofamScan |
KEGG ortholog assignment via HMM profiles |
| 6b |
EMAPPER |
eggNOG-mapper |
Ortholog annotation (COG, GO, KEGG, CAZy) |
| 6c |
DBCAN |
dbCAN |
CAZyme annotation via HMM + diamond + HMMER |
| 6d |
MERGE_ANNOTATIONS |
R |
Merge KofamScan + eggNOG + dbCAN per gene |
| 6e |
MAP_TO_BINS |
R |
Assign annotations to MAGs |
| 6f |
KEGG_MODULES |
R |
KEGG module completeness per MAG |
| 6g |
MINPATH |
MinPath |
Pathway parsimony analysis |
| 6h |
KEGG_DECODER |
KEGG-Decoder |
Metabolic pathway heatmaps |
| 6i |
ANTISMASH |
antiSMASH |
Biosynthetic gene cluster detection |
| 6j |
ECOSSDB_MAP |
R |
Ecosystem services mapping (CICES 5.2) |
| 6k |
ECOSSDB_SCORE |
R |
Ecosystem service scoring per MAG |
| 6l |
ECOSSDB_SDG |
R |
UN Sustainable Development Goal mapping |
| 6m |
ECOSSDB_VIZ |
R |
Ecosystem services visualization |
Mobile Genetic Elements
| Step |
Process |
Tool |
Description |
| 7a |
GENOMAD_CLASSIFY |
geNomad |
Virus/plasmid identification |
| 7b |
CHECKV_QUALITY |
CheckV |
Viral genome quality assessment |
| 7c |
INTEGRONFINDER |
IntegronFinder |
Integron detection |
| 7d |
ISLANDPATH_DIMOB |
IslandPath-DIMOB |
Genomic island prediction |
| 7e |
MACSYFINDER |
MacSyFinder |
Secretion system detection |
| 7f |
DEFENSEFINDER |
DefenseFinder |
Anti-phage defense system detection |
Eukaryotic Analysis
| Step |
Process |
Tool |
Description |
| 8a |
TIARA_CLASSIFY |
Tiara |
Domain-level classification (eukaryote/prokaryote) |
| 8b |
WHOKARYOTE_CLASSIFY |
Whokaryote |
Eukaryote/prokaryote classifier |
| 8c |
METAEUK_PREDICT |
MetaEuk |
Eukaryotic gene prediction |
| 8d |
MARFERRET_CLASSIFY |
MarFERReT |
Marine eukaryote taxonomy |
Other
| Step |
Process |
Tool |
Description |
| 9 |
CALCULATE_GENE_DEPTHS |
samtools + R |
Per-gene coverage from BAMs |
| 10 |
VIZ_PREPROCESS |
R + Svelte |
Interactive dashboard generation |
Parameters
| Parameter |
Default |
Description |
--assembly |
(required) |
Assembly FASTA file |
--depths |
(required) |
Depth matrix (MetaBAT2 format) |
--bam_dir |
(optional) |
Directory with sorted BAMs for BAM-based binners |
Binning
| Parameter |
Default |
Description |
--run_semibin |
false |
Include SemiBin2 (needs --bam_dir) |
--run_maxbin |
false |
Include MaxBin2 |
--run_lorbin |
false |
Include LorBin (needs --bam_dir) |
--run_comebin |
false |
Include COMEBin (needs --bam_dir) |
--run_vamb |
false |
Include VAMB |
--run_vamb_tax |
false |
Include taxonomy-guided VAMB |
--run_binette |
false |
Run Binette consensus (needs --checkm2_db) |
--run_magscot |
false |
Run MAGScoT consensus |
--metabat_min_cls |
50000 |
MetaBAT2 minimum cluster size |
--lorbin_min_length |
80000 |
LorBin minimum contig length |
Annotation
| Parameter |
Default |
Description |
--annotator |
bakta |
Gene annotator: prokka, bakta, or none |
--bakta_db |
(required if bakta) |
Path to Bakta database |
--bakta_extra |
false |
Run full Bakta annotation |
Taxonomy
| Parameter |
Default |
Description |
--run_kaiju |
false |
Run Kaiju classification |
--kaiju_db |
(required if kaiju) |
Kaiju database path |
--run_kraken2 |
false |
Run Kraken2 classification |
--kraken2_db |
(required if kraken2) |
Kraken2 database path |
--run_sendsketch |
false |
Run sendsketch GTDB taxonomy |
--run_rrna |
false |
Run rRNA classification (SILVA) |
--silva_ssu_db |
(required if rrna) |
SILVA SSU database path |
--rrna_min_identity |
0.80 |
Minimum identity for rRNA BLAST |
--run_gtdbtk |
false |
Run GTDB-Tk classification |
--gtdbtk_db |
(required if gtdbtk) |
GTDB-Tk database path |
| Parameter |
Default |
Description |
--run_metabolism |
false |
Enable metabolic profiling |
--kofam_db |
(required if metabolism) |
KofamScan database path |
--eggnog_db |
(required if metabolism) |
eggNOG database path |
--dbcan_db |
(required if metabolism) |
dbCAN database path |
--emapper_batch_size |
50000 |
eggNOG-mapper batch size (proteins) |
--run_antismash |
false |
Run antiSMASH BGC detection |
--antismash_db |
(required if antismash) |
antiSMASH database path |
--run_ecossdb |
true |
Enable ecosystem services mapping |
Mobile Genetic Elements
| Parameter |
Default |
Description |
--run_genomad |
false |
Run geNomad virus/plasmid detection |
--genomad_db |
(required if genomad) |
geNomad database path |
--run_checkv |
false |
Run CheckV quality assessment |
--checkv_db |
(required if checkv) |
CheckV database path |
--run_integronfinder |
false |
Run IntegronFinder |
--run_islandpath |
false |
Run IslandPath-DIMOB |
--run_macsyfinder |
false |
Run MacSyFinder |
--run_defensefinder |
false |
Run DefenseFinder |
Eukaryotic Analysis
| Parameter |
Default |
Description |
--run_eukaryotic |
false |
Enable eukaryotic classification |
--tiara_min_len |
3000 |
Tiara minimum contig length |
--whokaryote_min_len |
5000 |
Whokaryote minimum contig length |
--run_metaeuk |
false |
Run MetaEuk gene prediction |
--metaeuk_db |
(required if metaeuk) |
MetaEuk database path |
--run_marferret |
false |
Run MarFERReT taxonomy |
--marferret_db |
(required if marferret) |
MarFERReT database path |
Convenience
| Parameter |
Default |
Description |
--all |
false |
Enable all analysis modules |
--db_dir |
(none) |
Auto-resolve database paths from standard layout |
--store_dir |
(none) |
Persistent cache directory (storeDir) |
--run_viz |
false |
Build interactive Svelte dashboard |
--viz_port |
5174 |
Dashboard dev server port |
Outputs
results/
├── binning/
│ ├── semibin/ SemiBin2 bins
│ ├── metabat/ MetaBAT2 bins
│ ├── maxbin/ MaxBin2 bins
│ ├── lorbin/ LorBin bins
│ ├── comebin/ COMEBin bins
│ ├── vamb/ VAMB bins
│ ├── dastool/ DAS Tool consensus bins
│ ├── binette/ Binette refined bins
│ ├── magscot/ MAGScoT consensus bins
│ └── checkm2/ Quality assessment (completeness/contamination)
├── annotation/
│ ├── prokka/ or bakta/ Gene annotations (GFF, FAA, FFN)
│ └── bakta_extra/ Full Bakta output (if enabled)
├── taxonomy/
│ ├── kaiju/ Protein-level classification
│ ├── kraken2/ k-mer classification
│ ├── sendsketch/ GTDB MinHash taxonomy
│ ├── rrna/ rRNA SILVA classification
│ └── gtdbtk/ GTDB-Tk genome taxonomy
├── metabolism/
│ ├── kofamscan/ KEGG ortholog assignments
│ ├── emapper/ eggNOG annotations
│ ├── dbcan/ CAZyme annotations
│ ├── merged/ Merged annotation table
│ ├── per_mag/ Per-MAG annotation summaries
│ ├── modules/ KEGG module completeness
│ ├── minpath/ MinPath pathway analysis
│ ├── kegg_decoder/ Pathway heatmaps
│ ├── antismash/ Biosynthetic gene clusters
│ └── ecossdb/ Ecosystem services (CICES + SDG)
├── mge/
│ ├── genomad/ Virus/plasmid predictions
│ ├── checkv/ Viral quality assessment
│ ├── integrons/ Integron predictions
│ ├── islandpath/ Genomic islands
│ ├── macsyfinder/ Secretion systems
│ └── defensefinder/ Defense systems
├── eukaryotic/
│ ├── tiara/ Domain classification
│ ├── whokaryote/ Eukaryote/prokaryote predictions
│ ├── metaeuk/ Eukaryotic gene predictions
│ └── marferret/ Marine eukaryote taxonomy
├── viz/ Interactive Svelte dashboard
└── pipeline_info/ Nextflow reports (timeline, trace, DAG)
Profiles
| Profile |
Use case |
standard |
Local execution (default) |
test |
Small test data, reduced resources (4 CPUs, 8 GB) |
Resource Requirements
| Component |
CPUs |
RAM |
Notes |
| MetaBAT2 / MaxBin2 |
16 |
60 GB |
Default process_high label |
| SemiBin2 |
8 |
16 GB |
GPU optional |
| COMEBin |
8 |
16 GB |
GPU optional |
| Prokka / Bakta |
16 |
60 GB |
Per-bin annotation |
| Bakta (extra) |
16 |
24 GB |
Full annotation mode |
| eggNOG-mapper |
16 |
48 GB |
Batched protein annotation |
| MetaEuk |
16 |
32 GB |
Eukaryotic gene prediction |
| GTDB-Tk |
16 |
120 GB |
Most memory-intensive step |
| Kraken2 |
8 |
24 GB |
Depends on database |
| CheckM2 |
16 |
60 GB |
Quality assessment |
Databases
Download all databases with the interactive menu:
./download-databases.sh --all --dir /path/to/databases
Or download individually:
./download-databases.sh --checkm2 # CheckM2 (~3 GB)
./download-databases.sh --genomad # geNomad (~3 GB)
./download-databases.sh --checkv # CheckV (~2 GB)
./download-databases.sh --kaiju # Kaiju nr_euk (~60 GB)
./download-databases.sh --gtdbtk # GTDB-Tk (~85 GB)
Use --db_dir to auto-resolve paths from a standard layout:
./run-mag-analysis.sh --db_dir /path/to/databases ...
Design Notes
- Seven-binner consensus. SemiBin2, MetaBAT2, MaxBin2, LorBin, COMEBin, VAMB, VAMB-tax feed into DAS Tool + Binette + MAGScoT for consensus refinement.
- Modular activation. Every analysis module is off by default; use
--run_* flags or --all to enable selectively.
- Technology-agnostic. Accepts assemblies from Nanopore, Illumina, HiFi, or any external assembler. Only requires FASTA + MetaBAT2-format depth table.
- Persistent caching. Use
--store_dir to skip completed processes across runs, even after work/ cleanup.
- MIMAG quality standards. CheckM2 classifies MAGs as high (>90% completeness, <5% contamination), medium (>50%, <10%), or low quality.