Tutorial¶

:material-download: Download tutorial.py

This tutorial walks through the standard DADA2 amplicon processing workflow using papa2's Python API. The workflow mirrors the canonical DADA2 R tutorial but uses Python syntax and papa2's interface throughout.

Expected input: paired-end Illumina reads from a 16S (or ITS) amplicon experiment, demultiplexed into per-sample FASTQ files.

0. Setup¶

import os, glob
import numpy as np
import papa2

# Point to your FASTQ directory
fastq_dir = "data/fastq"

fwd_files = sorted(glob.glob(os.path.join(fastq_dir, "*_R1_001.fastq.gz")))
rev_files = sorted(glob.glob(os.path.join(fastq_dir, "*_R2_001.fastq.gz")))

# Derive sample names from file names
sample_names = [
    os.path.basename(f).replace("_R1_001.fastq.gz", "")
    for f in fwd_files
]
print(sample_names[:5])

1. Inspect Read Quality¶

Before filtering, visualise per-cycle quality to choose appropriate truncation lengths. plot_quality_profile() produces a heat-map of quality scores across read positions — the same summary as R's plotQualityProfile.

# Plot quality for the first three samples (forward reads)
papa2.plot_quality_profile(fwd_files[:3])
papa2.plot_quality_profile(rev_files[:3])

Choose trunc_len values where median quality stays above Q30 across all samples. A typical choice for 2×250 bp V4 16S reads is trunc_len=(240, 200).

2. Filter and Trim¶

filter_and_trim() applies quality truncation, expected-error filtering, PhiX removal, and length enforcement — then writes the passing reads to new FASTQ files. Both reads in a pair must pass for either to be kept.

filt_dir = "data/filtered"
filt_fwd = [os.path.join(filt_dir, os.path.basename(f)) for f in fwd_files]
filt_rev = [os.path.join(filt_dir, os.path.basename(f)) for f in rev_files]

out = papa2.filter_and_trim(
    fwd_files, filt_fwd,
    rev=rev_files, filt_rev=filt_rev,
    trunc_len=(240, 200),
    max_ee=(2, 2),
    trunc_q=2,
    min_len=50,
    rm_phix=True,
    compress=True,
    multithread=True,
    verbose=True,
)
# out is an (n_files, 2) array: [reads_in, reads_out] per file
print(out)

Tuning filter parameters

If too few reads pass, increase max_ee or reduce trunc_len.
If quality drops sharply at read ends, decrease trunc_len.
Forward and reverse reads can have different truncation lengths.

3. Dereplicate¶

derep_fastq() collapses identical reads into unique sequences and computes per-position average quality scores. This step is fast because the C backend is used when libpapa2.so is present.

# Use the filtered files
derepFs = [papa2.derep_fastq(f, verbose=True) for f in filt_fwd]
derepRs = [papa2.derep_fastq(f, verbose=True) for f in filt_rev]

# Each element is a dict:
#   seqs        list[str]   unique sequences, sorted descending by abundance
#   abundances  np.int32    array of read counts
#   quals       np.float64  (n_uniques x max_seqlen) average quality
#   map         np.int32    per-read index into seqs

4. Learn Error Rates¶

DADA2 builds a parametric error model that describes the probability of observing each base transition (A→C, A→G, …) at every quality score. papa2 implements the same LOESS regression as R's loessErrfun.

# Use up to 1e8 bases from the first samples (roughly 10 samples for 1e7 bp each)
errF = papa2.learn_errors(fwd_files, nbases=1e8, verbose=True)
errR = papa2.learn_errors(rev_files, nbases=1e8, verbose=True)

# errF / errR are numpy arrays of shape (16, nqual):
# - 16 rows for the 16 base transitions (A2A, A2C, …, T2T)
# - nqual columns, one per quality-score bin
print("Error matrix shape:", errF.shape)

Self-consistent learning

learn_errors() internally runs dada() with self_consist=True, cycling between denoising and error estimation until convergence — exactly as R's learnErrors() does.

5. Denoise with DADA¶

dada() applies the core DADA2 algorithm: it partitions the dereplicated reads into ASVs, evaluating each unique sequence against all more-abundant ones using the learned error model.

# Denoise forward reads for all samples
dadaFs = papa2.dada(derepFs, err=errF, verbose=False)

# Denoise reverse reads
dadaRs = papa2.dada(derepRs, err=errR, verbose=False)

# Each element of dadaFs / dadaRs is a dict:
#   denoised       dict {seq: abundance}  final ASVs
#   cluster_seqs   list[str]              ASV sequences
#   cluster_abunds np.int32               ASV abundances
#   map            np.int32               unique -> cluster assignment
#   trans          np.int32 (16 x nqual)  observed transition counts

# Report ASV counts per sample
for name, dF in zip(sample_names, dadaFs):
    n_asvs = len(dF["denoised"])
    print(f"  {name}: {n_asvs} ASVs")

Tuning DADA options¶

All algorithmic parameters from R's DADA_OPTS are exposed:

# Tighten the statistical threshold (default is 1e-40)
papa2.set_dada_opt(OMEGA_A=1e-50)

# Or pass options directly to dada()
dadaFs = papa2.dada(derepFs, err=errF, OMEGA_A=1e-50, verbose=False)

See DADA_OPTS for the full list of tunable parameters.

6. Merge Paired-End Reads¶

merge_pairs() aligns the denoised forward and reverse ASVs to construct the full amplicon sequence. Pairs are accepted only when the overlap satisfies both min_overlap and max_mismatch.

mergers = []
for name, dF, drF, dR, drR in zip(
        sample_names, dadaFs, derepFs, dadaRs, derepRs):
    m = papa2.merge_pairs(dF, drF, dR, drR,
                          min_overlap=12,
                          max_mismatch=0,
                          verbose=True)
    mergers.append(m)
    n_accept = sum(1 for r in m if r["accept"])
    n_reads  = sum(r["abundance"] for r in m if r["accept"])
    print(f"  {name}: {n_accept} merged pairs, {n_reads} reads")

Each element of mergers is a list of dicts:

Key	Type	Description
`sequence`	`str`	Merged amplicon sequence
`abundance`	`int`	Supporting read count
`forward`	`int`	Forward cluster index
`reverse`	`int`	Reverse cluster index
`nmatch`	`int`	Overlap matches
`nmismatch`	`int`	Overlap mismatches
`nindel`	`int`	Overlap indels
`accept`	`bool`	Passed acceptance criteria

Just concatenate

For amplicons where forward and reverse reads do not overlap (e.g. long V1-V3 16S), use just_concatenate=True. The reads are joined with 10 Ns.

7. Build the Sequence Table¶

make_sequence_table() assembles a (samples × ASVs) abundance matrix from the per-sample merger lists.

seqtab = papa2.make_sequence_table(mergers)

# seqtab is a dict:
#   table  np.int32 (n_samples x n_asvs)
#   seqs   list[str]

print("Sequence table:", seqtab["table"].shape)
# e.g. (20, 3842) for 20 samples and 3842 unique ASVs

# Inspect the distribution of ASV lengths
import collections
lengths = collections.Counter(len(s) for s in seqtab["seqs"])
print(sorted(lengths.items()))

8. Remove Chimeras¶

Chimeric sequences are artefacts formed by the spurious joining of two parental molecules. remove_bimera_denovo() identifies and removes bimeras — chimeras with exactly two parents.

seqtab_nochim = papa2.remove_bimera_denovo(
    seqtab,
    method="consensus",   # flag ASVs flagged as chimeric in most samples
    min_fold=1.5,
    min_abund=2,
    verbose=True
)

# Report read retention
total_before = seqtab["table"].sum()
total_after  = seqtab_nochim["table"].sum()
print(f"Reads retained: {total_after}/{total_before} "
      f"({100*total_after/total_before:.1f}%)")
print(f"ASVs retained: {len(seqtab_nochim['seqs'])}/{len(seqtab['seqs'])}")

Available methods:

Method	Description
`"consensus"`	Flag per-sample; remove ASVs chimeric in most samples (default)
`"pooled"`	Pool all samples and flag globally
`"per-sample"`	Zero only chimeric cells per sample, then drop empty ASVs

9. Assign Taxonomy¶

papa2 includes both a k-mer-based Bayesian classifier (assign_taxonomy, matching R's assignTaxonomy) and an exact-match species assigner (assign_species). Download a formatted reference database and point papa2 to it.

Download reference databases¶

DADA2-formatted training files are available from the DADA2 reference page. Common choices:

Database	File	Link
SILVA v138.1 (genus)	`silva_nr99_v138.1_train_set.fa.gz`	Download
SILVA v138.1 (species)	`silva_species_assignment_v138.1.fa.gz`	Download
GTDB r220 (genus)	`GTDB_bac120_arc53_ssu_r220_genus.fa.gz`	Download
GTDB r220 (species)	`GTDB_bac120_arc53_ssu_r220_species.fa.gz`	Download

# Example: download SILVA v138.1
wget https://zenodo.org/records/4587955/files/silva_nr99_v138.1_train_set.fa.gz
wget https://zenodo.org/records/4587955/files/silva_species_assignment_v138.1.fa.gz

Run the classifier¶

# Naive Bayesian classifier (genus-level)
taxa = papa2.assign_taxonomy(
    seqtab_nochim["seqs"],
    "silva_nr99_v138.1_train_set.fa.gz",
    min_boot=50,
    verbose=True,
)
# taxa is a DataFrame: rows = ASV sequences, columns = Kingdom..Species

# Exact-match species assignment using the SILVA species database
taxa_with_species = papa2.add_species(
    taxa,
    seqtab_nochim["seqs"],
    "silva_species_assignment_v138.1.fa.gz",
)

Inspect error rates¶

After learning errors, visualise the fit to check for anomalies:

papa2.plot_errors(errF, output="error_rates_fwd.html")

10. Track Reads Through the Pipeline¶

It is good practice to track how many reads pass each step. papa2 provides track_reads() and plot_sankey() to summarise the entire pipeline — from raw input through chimera removal and taxonomy — as an interactive Sankey diagram:

track = papa2.track_reads(
    dereps=derepFs,
    dadas=dadaFs,
    mergers=mergers,
    seqtab=seqtab,
    seqtab_nochim=seqtab_nochim,
    taxa=taxa_with_species,
)
print(track)
# e.g. {'input': 50000, 'denoised': 42000, 'merged': 40000,
#        'tabled': 40000, 'non-chimeric': 38000,
#        'kingdom': 37800, 'phylum': 37200, 'class': 36500,
#        'order': 35800, 'family': 34900, 'genus': 33500, 'species': 28000}

# Save as interactive HTML (or omit output= to display in Jupyter)
papa2.plot_sankey(track, output="read_tracking.html")

You can also build a per-sample tracking table for more detail:

import pandas as pd

rows = []
for i, name in enumerate(sample_names):
    n_input    = int(derepFs[i]["abundances"].sum())
    n_denoised = int(dadaFs[i]["cluster_abunds"].sum())
    n_merged   = int(sum(r["abundance"] for r in mergers[i] if r["accept"]))
    rows.append({
        "sample":   name,
        "input":    n_input,
        "denoised": n_denoised,
        "merged":   n_merged,
    })

nochim_table = seqtab_nochim["table"]
for i, row in enumerate(rows):
    row["nonchim"] = int(nochim_table[i].sum())

df = pd.DataFrame(rows)
print(df.to_string(index=False))

11. Export Results¶

# Write ASVs to a FASTA file
papa2.write_fasta(seqtab_nochim["seqs"], "asvs.fasta")

# Or with a header prefix
papa2.uniquesto_fasta(seqtab_nochim["seqs"], "asvs_with_headers.fasta",
                      prefix="ASV")

# The sequence table and taxonomy can be saved with numpy / pandas:
import numpy as np
np.save("seqtab_nochim.npy", seqtab_nochim["table"])

Summary of the Workflow¶

Raw FASTQ files (demultiplexed)
        │
        ▼
filter_and_trim()      # Quality filter, truncate, remove PhiX
        │
        ▼
derep_fastq()          # Collapse identical reads
        │
        ▼
learn_errors()         # Estimate sequencing error rates (LOESS)
        │
        ▼
dada()                 # Infer true sequences (ASVs)
        │
        ▼
merge_pairs()          # Join forward + reverse ASVs
        │
        ▼
make_sequence_table()  # Assemble sample × ASV matrix
        │
        ▼
remove_bimera_denovo() # Remove chimeric ASVs
        │
        ▼
assign_taxonomy()      # Bayesian taxonomic classification
        │
        ▼
add_species()          # Exact-match species assignment