Tutorial¶
:material-download: Download tutorial.py
This tutorial walks through the standard DADA2 amplicon processing workflow using papa2's Python API. The workflow mirrors the canonical DADA2 R tutorial but uses Python syntax and papa2's interface throughout.
Expected input: paired-end Illumina reads from a 16S (or ITS) amplicon experiment, demultiplexed into per-sample FASTQ files.
0. Setup¶
import os, glob
import numpy as np
import papa2
# Point to your FASTQ directory
fastq_dir = "data/fastq"
fwd_files = sorted(glob.glob(os.path.join(fastq_dir, "*_R1_001.fastq.gz")))
rev_files = sorted(glob.glob(os.path.join(fastq_dir, "*_R2_001.fastq.gz")))
# Derive sample names from file names
sample_names = [
os.path.basename(f).replace("_R1_001.fastq.gz", "")
for f in fwd_files
]
print(sample_names[:5])
1. Inspect Read Quality¶
Before filtering, visualise per-cycle quality to choose appropriate truncation
lengths. plot_quality_profile() produces a heat-map of quality scores across
read positions — the same summary as R's plotQualityProfile.
# Plot quality for the first three samples (forward reads)
papa2.plot_quality_profile(fwd_files[:3])
papa2.plot_quality_profile(rev_files[:3])
Choose trunc_len values where median quality stays above Q30 across all samples.
A typical choice for 2×250 bp V4 16S reads is trunc_len=(240, 200).
2. Filter and Trim¶
filter_and_trim() applies quality truncation, expected-error filtering,
PhiX removal, and length enforcement — then writes the passing reads to new
FASTQ files. Both reads in a pair must pass for either to be kept.
filt_dir = "data/filtered"
filt_fwd = [os.path.join(filt_dir, os.path.basename(f)) for f in fwd_files]
filt_rev = [os.path.join(filt_dir, os.path.basename(f)) for f in rev_files]
out = papa2.filter_and_trim(
fwd_files, filt_fwd,
rev=rev_files, filt_rev=filt_rev,
trunc_len=(240, 200),
max_ee=(2, 2),
trunc_q=2,
min_len=50,
rm_phix=True,
compress=True,
multithread=True,
verbose=True,
)
# out is an (n_files, 2) array: [reads_in, reads_out] per file
print(out)
Tuning filter parameters
- If too few reads pass, increase
max_eeor reducetrunc_len. - If quality drops sharply at read ends, decrease
trunc_len. - Forward and reverse reads can have different truncation lengths.
3. Dereplicate¶
derep_fastq() collapses identical reads into unique sequences and computes
per-position average quality scores. This step is fast because the C backend
is used when libpapa2.so is present.
# Use the filtered files
derepFs = [papa2.derep_fastq(f, verbose=True) for f in filt_fwd]
derepRs = [papa2.derep_fastq(f, verbose=True) for f in filt_rev]
# Each element is a dict:
# seqs list[str] unique sequences, sorted descending by abundance
# abundances np.int32 array of read counts
# quals np.float64 (n_uniques x max_seqlen) average quality
# map np.int32 per-read index into seqs
4. Learn Error Rates¶
DADA2 builds a parametric error model that describes the probability of observing
each base transition (A→C, A→G, …) at every quality score. papa2 implements
the same LOESS regression as R's loessErrfun.
# Use up to 1e8 bases from the first samples (roughly 10 samples for 1e7 bp each)
errF = papa2.learn_errors(fwd_files, nbases=1e8, verbose=True)
errR = papa2.learn_errors(rev_files, nbases=1e8, verbose=True)
# errF / errR are numpy arrays of shape (16, nqual):
# - 16 rows for the 16 base transitions (A2A, A2C, …, T2T)
# - nqual columns, one per quality-score bin
print("Error matrix shape:", errF.shape)
Self-consistent learning
learn_errors() internally runs dada() with self_consist=True, cycling
between denoising and error estimation until convergence — exactly as R's
learnErrors() does.
5. Denoise with DADA¶
dada() applies the core DADA2 algorithm: it partitions the dereplicated reads
into ASVs, evaluating each unique sequence against all more-abundant ones using
the learned error model.
# Denoise forward reads for all samples
dadaFs = papa2.dada(derepFs, err=errF, verbose=False)
# Denoise reverse reads
dadaRs = papa2.dada(derepRs, err=errR, verbose=False)
# Each element of dadaFs / dadaRs is a dict:
# denoised dict {seq: abundance} final ASVs
# cluster_seqs list[str] ASV sequences
# cluster_abunds np.int32 ASV abundances
# map np.int32 unique -> cluster assignment
# trans np.int32 (16 x nqual) observed transition counts
# Report ASV counts per sample
for name, dF in zip(sample_names, dadaFs):
n_asvs = len(dF["denoised"])
print(f" {name}: {n_asvs} ASVs")
Tuning DADA options¶
All algorithmic parameters from R's DADA_OPTS are exposed:
# Tighten the statistical threshold (default is 1e-40)
papa2.set_dada_opt(OMEGA_A=1e-50)
# Or pass options directly to dada()
dadaFs = papa2.dada(derepFs, err=errF, OMEGA_A=1e-50, verbose=False)
See DADA_OPTS for the full list of tunable parameters.
6. Merge Paired-End Reads¶
merge_pairs() aligns the denoised forward and reverse ASVs to construct the
full amplicon sequence. Pairs are accepted only when the overlap satisfies both
min_overlap and max_mismatch.
mergers = []
for name, dF, drF, dR, drR in zip(
sample_names, dadaFs, derepFs, dadaRs, derepRs):
m = papa2.merge_pairs(dF, drF, dR, drR,
min_overlap=12,
max_mismatch=0,
verbose=True)
mergers.append(m)
n_accept = sum(1 for r in m if r["accept"])
n_reads = sum(r["abundance"] for r in m if r["accept"])
print(f" {name}: {n_accept} merged pairs, {n_reads} reads")
Each element of mergers is a list of dicts:
| Key | Type | Description |
|---|---|---|
sequence |
str |
Merged amplicon sequence |
abundance |
int |
Supporting read count |
forward |
int |
Forward cluster index |
reverse |
int |
Reverse cluster index |
nmatch |
int |
Overlap matches |
nmismatch |
int |
Overlap mismatches |
nindel |
int |
Overlap indels |
accept |
bool |
Passed acceptance criteria |
Just concatenate
For amplicons where forward and reverse reads do not overlap (e.g. long V1-V3
16S), use just_concatenate=True. The reads are joined with 10 Ns.
7. Build the Sequence Table¶
make_sequence_table() assembles a (samples × ASVs) abundance matrix from the
per-sample merger lists.
seqtab = papa2.make_sequence_table(mergers)
# seqtab is a dict:
# table np.int32 (n_samples x n_asvs)
# seqs list[str]
print("Sequence table:", seqtab["table"].shape)
# e.g. (20, 3842) for 20 samples and 3842 unique ASVs
# Inspect the distribution of ASV lengths
import collections
lengths = collections.Counter(len(s) for s in seqtab["seqs"])
print(sorted(lengths.items()))
8. Remove Chimeras¶
Chimeric sequences are artefacts formed by the spurious joining of two parental
molecules. remove_bimera_denovo() identifies and removes bimeras — chimeras with
exactly two parents.
seqtab_nochim = papa2.remove_bimera_denovo(
seqtab,
method="consensus", # flag ASVs flagged as chimeric in most samples
min_fold=1.5,
min_abund=2,
verbose=True
)
# Report read retention
total_before = seqtab["table"].sum()
total_after = seqtab_nochim["table"].sum()
print(f"Reads retained: {total_after}/{total_before} "
f"({100*total_after/total_before:.1f}%)")
print(f"ASVs retained: {len(seqtab_nochim['seqs'])}/{len(seqtab['seqs'])}")
Available methods:
| Method | Description |
|---|---|
"consensus" |
Flag per-sample; remove ASVs chimeric in most samples (default) |
"pooled" |
Pool all samples and flag globally |
"per-sample" |
Zero only chimeric cells per sample, then drop empty ASVs |
9. Assign Taxonomy¶
papa2 includes both a k-mer-based Bayesian classifier (assign_taxonomy, matching
R's assignTaxonomy) and an exact-match species assigner (assign_species).
Download a formatted reference database and point papa2 to it.
Download reference databases¶
DADA2-formatted training files are available from the DADA2 reference page. Common choices:
| Database | File | Link |
|---|---|---|
| SILVA v138.1 (genus) | silva_nr99_v138.1_train_set.fa.gz |
Download |
| SILVA v138.1 (species) | silva_species_assignment_v138.1.fa.gz |
Download |
| GTDB r220 (genus) | GTDB_bac120_arc53_ssu_r220_genus.fa.gz |
Download |
| GTDB r220 (species) | GTDB_bac120_arc53_ssu_r220_species.fa.gz |
Download |
# Example: download SILVA v138.1
wget https://zenodo.org/records/4587955/files/silva_nr99_v138.1_train_set.fa.gz
wget https://zenodo.org/records/4587955/files/silva_species_assignment_v138.1.fa.gz
Run the classifier¶
# Naive Bayesian classifier (genus-level)
taxa = papa2.assign_taxonomy(
seqtab_nochim["seqs"],
"silva_nr99_v138.1_train_set.fa.gz",
min_boot=50,
verbose=True,
)
# taxa is a DataFrame: rows = ASV sequences, columns = Kingdom..Species
# Exact-match species assignment using the SILVA species database
taxa_with_species = papa2.add_species(
taxa,
seqtab_nochim["seqs"],
"silva_species_assignment_v138.1.fa.gz",
)
Inspect error rates¶
After learning errors, visualise the fit to check for anomalies:
10. Track Reads Through the Pipeline¶
It is good practice to track how many reads pass each step. papa2 provides
track_reads() and plot_sankey() to summarise the entire pipeline —
from raw input through chimera removal and taxonomy — as an interactive
Sankey diagram:
track = papa2.track_reads(
dereps=derepFs,
dadas=dadaFs,
mergers=mergers,
seqtab=seqtab,
seqtab_nochim=seqtab_nochim,
taxa=taxa_with_species,
)
print(track)
# e.g. {'input': 50000, 'denoised': 42000, 'merged': 40000,
# 'tabled': 40000, 'non-chimeric': 38000,
# 'kingdom': 37800, 'phylum': 37200, 'class': 36500,
# 'order': 35800, 'family': 34900, 'genus': 33500, 'species': 28000}
# Save as interactive HTML (or omit output= to display in Jupyter)
papa2.plot_sankey(track, output="read_tracking.html")
You can also build a per-sample tracking table for more detail:
import pandas as pd
rows = []
for i, name in enumerate(sample_names):
n_input = int(derepFs[i]["abundances"].sum())
n_denoised = int(dadaFs[i]["cluster_abunds"].sum())
n_merged = int(sum(r["abundance"] for r in mergers[i] if r["accept"]))
rows.append({
"sample": name,
"input": n_input,
"denoised": n_denoised,
"merged": n_merged,
})
nochim_table = seqtab_nochim["table"]
for i, row in enumerate(rows):
row["nonchim"] = int(nochim_table[i].sum())
df = pd.DataFrame(rows)
print(df.to_string(index=False))
11. Export Results¶
# Write ASVs to a FASTA file
papa2.write_fasta(seqtab_nochim["seqs"], "asvs.fasta")
# Or with a header prefix
papa2.uniquesto_fasta(seqtab_nochim["seqs"], "asvs_with_headers.fasta",
prefix="ASV")
# The sequence table and taxonomy can be saved with numpy / pandas:
import numpy as np
np.save("seqtab_nochim.npy", seqtab_nochim["table"])
Summary of the Workflow¶
Raw FASTQ files (demultiplexed)
│
▼
filter_and_trim() # Quality filter, truncate, remove PhiX
│
▼
derep_fastq() # Collapse identical reads
│
▼
learn_errors() # Estimate sequencing error rates (LOESS)
│
▼
dada() # Infer true sequences (ASVs)
│
▼
merge_pairs() # Join forward + reverse ASVs
│
▼
make_sequence_table() # Assemble sample × ASV matrix
│
▼
remove_bimera_denovo() # Remove chimeric ASVs
│
▼
assign_taxonomy() # Bayesian taxonomic classification
│
▼
add_species() # Exact-match species assignment