Skip to content

Big Data Workflow

:material-download: Download bigdata.py

This workflow is designed for large-scale amplicon studies — hundreds of samples, HiSeq-scale data (100M+ reads), or multi-run projects. It mirrors the DADA2 Big Data tutorial but uses papa2's Python API.

Key differences from the standard tutorial:

  • Samples are processed one at a time in a loop, keeping memory low
  • Error rates are learned from a random subset of bases
  • The workflow is split into independent stages that can be run as separate jobs
  • Multiple sequencing runs can be merged before chimera removal

Strategy

DADA2's sample inference is independent per sample once the error model is learned. This means the per-sample dereplicate → denoise loop can be streamed with only one sample in memory at a time, keeping RAM usage around 4–8 GB even for a full HiSeq lane (~750 samples).

Scaling behaviour:

  • Linear in number of samples
  • Quadratic in per-sample depth (fewer samples at higher depth = longer)
  • Parallelisable across samples (set DADA2_WORKERS or farm out to multiple nodes)

Stage 1: Filter and Trim

Quality-filter and truncate reads with filter_and_trim(). For paired-end data, both reads in a pair must pass for either to be kept.

import os, glob
import papa2

# Point to your directory of demultiplexed FASTQs
fastq_dir = "data/run1"
fwd_files = sorted(glob.glob(os.path.join(fastq_dir, "*_R1_001.fastq.gz")))
rev_files = sorted(glob.glob(os.path.join(fastq_dir, "*_R2_001.fastq.gz")))

sample_names = [
    os.path.basename(f).replace("_R1_001.fastq.gz", "")
    for f in fwd_files
]
print(f"{len(sample_names)} samples found")

# Filter and trim
filt_dir = "data/run1/filtered"
filt_fwd = [os.path.join(filt_dir, os.path.basename(f)) for f in fwd_files]
filt_rev = [os.path.join(filt_dir, os.path.basename(f)) for f in rev_files]

out = papa2.filter_and_trim(
    fwd_files, filt_fwd,
    rev=rev_files, filt_rev=filt_rev,
    trunc_len=(240, 200),
    max_ee=(2, 2),
    trunc_q=2,
    rm_phix=True,
    multithread=True,
    verbose=True,
)
print(out)

Stage 2: Learn Error Rates

Learn the error model from a subset of the data. 100 million bases is usually sufficient — papa2 will draw from the first N files until the target is reached.

errF = papa2.learn_errors(filt_fwd, nbases=1e8, randomize=True, verbose=True)
errR = papa2.learn_errors(filt_rev, nbases=1e8, randomize=True, verbose=True)

Stage 3: Sample Inference (streaming loop)

Process each sample individually — only one dereplicated sample is in memory at a time. Results are collected into a dict for the sequence table.

Single-end

import papa2

dds = {}
for name, fwd in zip(sample_names, fwd_files):
    print(f"Processing: {name}")
    derep = papa2.derep_fastq(fwd)
    dd = papa2.dada(derep, err=errF, verbose=False)
    dds[name] = dd["denoised"]

# Build the sequence table
seqtab = papa2.make_sequence_table(dds)
print(f"Sequence table: {seqtab['table'].shape[0]} samples x {seqtab['table'].shape[1]} ASVs")

Paired-end

import papa2

mergers_dict = {}
for name, fwd, rev in zip(sample_names, fwd_files, rev_files):
    print(f"Processing: {name}")
    drF = papa2.derep_fastq(fwd)
    drR = papa2.derep_fastq(rev)
    ddF = papa2.dada(drF, err=errF, verbose=False)
    ddR = papa2.dada(drR, err=errR, verbose=False)
    merger = papa2.merge_pairs(ddF, drF, ddR, drR, verbose=False)
    mergers_dict[name] = merger

# Build the sequence table from the merger results
seqtab = papa2.make_sequence_table(mergers_dict)
print(f"Sequence table: {seqtab['table'].shape[0]} samples x {seqtab['table'].shape[1]} ASVs")

Saving intermediate results

For very large studies, save the sequence table to disk between stages so each stage can be an independent job:

import numpy as np
np.savez("run1_seqtab.npz", table=seqtab["table"], seqs=seqtab["seqs"])

Stage 4: Merge Runs (if applicable)

If your study spans multiple sequencing runs, learn errors and infer sequences per-run, then merge the sequence tables:

import numpy as np
import papa2

# Load per-run tables
run1 = np.load("run1_seqtab.npz", allow_pickle=True)
run2 = np.load("run2_seqtab.npz", allow_pickle=True)

# Merge — sequences are matched by identity across runs
seqtab = papa2.collapse_no_mismatch(
    # Combine into a single table keyed by sequence
    papa2.make_sequence_table({
        **{f"run1_{i}": dict(zip(run1["seqs"], run1["table"][i]))
           for i in range(run1["table"].shape[0])},
        **{f"run2_{i}": dict(zip(run2["seqs"], run2["table"][i]))
           for i in range(run2["table"].shape[0])},
    })
)

Stage 5: Remove Chimeras

Use method="consensus" for large datasets — it flags chimeras per-sample and removes ASVs that are chimeric in the majority of samples:

seqtab_nochim = papa2.remove_bimera_denovo(
    seqtab,
    method="consensus",
    verbose=True,
)

total_before = seqtab["table"].sum()
total_after = seqtab_nochim["table"].sum()
print(f"Reads retained: {total_after}/{total_before} "
      f"({100 * total_after / total_before:.1f}%)")

Stage 6: Assign Taxonomy

taxa = papa2.assign_taxonomy(
    seqtab_nochim["seqs"],
    "silva_nr99_v138.1_train_set.fa.gz",
    min_boot=50,
    verbose=True,
)

taxa_with_species = papa2.add_species(
    taxa,
    seqtab_nochim["seqs"],
    "silva_species_assignment_v138.1.fa.gz",
)

See the Tutorial for database download links.


Stage 7: Track Reads and Export

# Sankey diagram — tracks reads through every stage including taxonomy
track = papa2.track_reads(
    seqtab=seqtab,
    seqtab_nochim=seqtab_nochim,
    taxa=taxa_with_species,
)
papa2.plot_sankey(track, output="read_tracking.html")

# Export ASVs
papa2.write_fasta(seqtab_nochim["seqs"], "asvs.fasta")

# Save final tables
import numpy as np
np.savez(
    "results.npz",
    seqtab=seqtab_nochim["table"],
    seqs=seqtab_nochim["seqs"],
)

Performance Tips

  • Set DADA2_WORKERS=N to parallelise the dada() calls across CPU cores. Each worker processes one sample, so set this to the number of cores available.
  • Set OMP_NUM_THREADS=1 to avoid contention between Python-level and OpenMP-level parallelism.
  • For cluster/cloud: each run's Stage 2–3 is independent and can be submitted as a separate job. Only Stage 4–6 requires all runs to be complete.
  • Memory scales with the largest single sample, not the total dataset size. Typical usage is 4–8 GB for HiSeq-depth samples.