Illumina Assembly Pipeline¶

Multi-assembler consensus pipeline for Illumina paired-end metagenomic reads. Produces deduplicated assemblies with depth tables and BAMs for downstream MAG analysis.

Quick Start¶

cd illumina_assembly
./install.sh && ./install.sh --check

# Local (conda)
./run-illumina-assembly.sh --input /path/to/reads --outdir /path/to/output

# Co-assembly mode
./run-illumina-assembly.sh --input /path/to/reads --outdir /path/to/output --coassembly

# Apptainer (HPC) -- auto-pulls SIF on first run
./run-illumina-assembly.sh --apptainer --pull --input /path/to/reads --outdir /path/to/output

# SLURM profile
./run-illumina-assembly.sh --input /path/to/reads --outdir /path/to/output \
    -profile slurm --slurm_account def-myaccount \
    --conda_path ~/scratch/miniforge3/bin

Pipeline Stages¶

Step	Process	Tool	Description
1	`CLUMPIFY`	BBTools clumpify	Optical deduplication
2	`FILTER_BY_TILE`	BBTools filterbytile	Remove low-quality tiles
3	`BBDUK_TRIM`	BBDuk	Adapter trimming (k=23, mink=11, minlen=70)
4	`BBDUK_FILTER`	BBDuk	Artifact + PhiX removal (k=31, entropy=0.95)
5	`REMOVE_HUMAN`	BBTools removehuman	Human read removal (optional)
6	`FASTQC`	FastQC	QC report on preprocessed reads (optional)
7	`ERROR_CORRECT_ECCO`	BBMerge ecco	Phase 1: overlap-based error correction
8	`ERROR_CORRECT_ECC`	BBTools clumpify ecc	Phase 2: clump-based error correction (4 passes)
9	`ERROR_CORRECT_TADPOLE`	Tadpole	Phase 3: k-mer-based error correction (k=62)
10	`NORMALIZE_READS`	bbnorm	Coverage normalization (target=100, mindepth=2)
11	`MERGE_READS`	BBMerge	Merge overlapping pairs (k=93)
12	`QUALITY_TRIM`	BBDuk	Quality-trim unmerged reads
13a	`ASSEMBLE_TADPOLE`	Tadpole	Assembly (k=124)
13b	`ASSEMBLE_MEGAHIT`	Megahit	Assembly (k=45-225)
13c	`ASSEMBLE_SPADES`	SPAdes	Assembly (k=25-125)
13d	`ASSEMBLE_METASPADES`	metaSPAdes	Assembly (k=25-77, --meta)
14	`DEDUPE_ASSEMBLIES`	BBTools dedupe	Cascade deduplication (100% -> 99% -> 98%)
15	`MAP_READS_BBMAP`	BBMap	Map QC'd reads to assembly (minid=90)
16	`CALCULATE_DEPTHS`	jgi_summarize_bam_contig_depths	Depth matrix in MetaBAT2 format

Parameters¶

Required¶

Parameter	Default	Description
`--input`	(required)	Directory containing paired-end `_R1_.fastq.gz` files
`--outdir`	`results`	Output directory

Mode¶

Parameter	Default	Description
`--coassembly`	`false`	Co-assemble all samples together

Preprocessing¶

Parameter	Default	Description
`--min_readlen`	`70`	Minimum read length after trimming
`--run_remove_human`	`true`	Remove human reads via removehuman.sh
`--human_ref`	`databases/human_ref`	Path to BBTools human reference index
`--run_fastqc`	`true`	Run FastQC on final preprocessed reads
`--run_normalize`	`true`	Enable bbnorm coverage normalization

Assembly¶

Parameter	Default	Description
`--run_tadpole`	`true`	Run Tadpole assembler
`--run_megahit`	`true`	Run Megahit assembler
`--run_spades`	`true`	Run SPAdes assembler
`--run_metaspades`	`true`	Run metaSPAdes assembler
`--dedupe_identity`	`98`	Final deduplication identity threshold
`--min_contig_len`	`500`	Minimum contig length after deduplication

Resources¶

Parameter	Default	Description
`--assembly_cpus`	`24`	CPUs for assembly processes
`--assembly_memory`	`250 GB`	Memory for assembly processes

SLURM¶

Parameter	Default	Description
`--slurm_account`	`def-rec3141`	SLURM `--account` for job submission
`--conda_path`	(none)	Path to conda/mamba `bin/` for SLURM jobs

Outputs¶

results/
├── preprocess/<sample>/        QC'd reads + FastQC reports
├── error_correct/<sample>/     Three-phase error-corrected reads
├── normalize/<sample>/         Coverage-normalized reads + k-mer histograms
├── merge/<sample>/             Merged + quality-trimmed reads
├── assembly/<sample>/
│   ├── <sample>.dedupe.fasta   Final deduplicated assembly
│   ├── <sample>.tadpole.fasta  Per-assembler outputs
│   ├── <sample>.megahit.fasta
│   ├── <sample>.spades.fasta
│   ├── <sample>.metaspades.fasta
│   └── <sample>.assembly_stats.txt
├── mapping/<sample>/
│   ├── <sample>.sorted.bam     Alignments
│   ├── <sample>.sorted.bam.bai BAM index
│   ├── <sample>.depths.txt     Depth matrix (MetaBAT2 format)
│   └── <sample>.covstats.txt   Per-contig coverage statistics
└── pipeline_info/              Nextflow reports (timeline, trace, DAG)

In co-assembly mode, <sample> is replaced with coassembly for assembly and mapping directories.

Feed these outputs into mag_analysis:

cd ../mag_analysis
./run-mag-analysis.sh \
    --assembly results/assembly/<sample>/<sample>.dedupe.fasta \
    --depths results/mapping/<sample>/<sample>.depths.txt \
    --bam_dir results/mapping/<sample>/ \
    --outdir /path/to/analysis --db_dir /path/to/databases

Profiles¶

Profile	Use case
`standard`	Local execution (default)
`test`	Small test data, reduced resources (4 CPUs, 8 GB)
`slurm`	SLURM cluster execution

Resource Requirements¶

Component	CPUs	RAM	Notes
Preprocessing (BBTools)	8	16 GB	Per-sample; parallelized
Error correction	8	16 GB	Three sequential phases per sample
SPAdes / metaSPAdes	24	250 GB	Most memory-intensive step
Megahit	24	250 GB	Lower peak memory than SPAdes
Tadpole	24	250 GB	Fast, single k-mer assembly
BBMap read mapping	8	16 GB	Per-sample mapping
Human decontamination	8	16 GB	BBTools removehuman.sh

Input¶

--input must point to a directory containing paired-end Illumina reads:

input_dir/
├── sampleA_S1_L001_R1_001.fastq.gz
├── sampleA_S1_L001_R2_001.fastq.gz
├── sampleB_S2_L001_R1_001.fastq.gz
├── sampleB_S2_L001_R2_001.fastq.gz
└── ...

R2 files are matched by replacing _R1_ with _R2_. The sample ID is extracted as the first field before _ in the filename.

Design Notes¶

Multi-assembler approach. Four assemblers capture different k-mer ranges. Cascade deduplication (100% -> 99% -> 98%) merges them into a single non-redundant contig set.
Three-phase error correction. Overlap-based (bbmerge ecco), clump-based (clumpify ecc, 4 passes), and k-mer-based (tadpole ecc k=62) correction each catch different error types.
Human decontamination. Enabled by default via BBTools' removehuman.sh. Disable with --run_remove_human false.
Interleaved read handling. BBTools operates on interleaved FASTQ throughout, simplifying channel plumbing.
Graceful assembler failure. All assemblers use set +e and produce an empty FASTA on error. The deduplication step skips empty files.
SRA-safe. Automatic fallbacks for SRA-stripped read headers that lack optical duplicate coordinates.