Volume 14, Issue 2 p. 529-542
RESEARCH ARTICLE
Open Access

Inferring taxonomic placement from DNA barcoding aiding in discovery of new taxa

Alessandro Zito

Corresponding Author

Alessandro Zito

Department of Statistical Science, Duke University, Durham, North Carolina, USA

Correspondence

Alessandro Zito

Email: [email protected]

Search for more papers by this author
Tommaso Rigon

Tommaso Rigon

Department of Economics, Management and Statistics, University of Milano-Bicocca, Milan, Italy

Search for more papers by this author
David B. Dunson

David B. Dunson

Department of Statistical Science, Duke University, Durham, North Carolina, USA

Search for more papers by this author
First published: 30 November 2022
Citations: 2
Handling Editor: Daniele Silvestro

Abstract

  1. Predicting the taxonomic affiliation of DNA sequences collected from biological samples is a fundamental step in biodiversity assessment. This task is performed by leveraging existing databases containing reference DNA sequences endowed with a taxonomic identification. However, environmental sequences can be from organisms that are either unknown to science or for which there are no reference sequences available. Thus, taxonomic novelty of a sequence needs to be accounted for when doing classification.
  2. We propose Bayesian nonparametric taxonomic classifiers, BayesANT, which use species sampling model priors to allow unobserved taxa to be discovered at each taxonomic rank. Using a simple product multinomial likelihood with conjugate Dirichlet priors at the lowest rank, a highly flexible supervised algorithm is developed to provide a probabilistic prediction of the taxa placement of each sequence at each rank.
  3. As an illustration, we run our algorithm on a carefully annotated library of Finnish arthropods (FinBOL). To assess the ability of BayesANT to recognize novelty and to predict known taxonomic affiliations correctly, we test it on two training-test splitting scenarios, each with a different proportion of taxa unobserved in training. We show how our algorithm attains accurate predictions and reliably quantifies classification uncertainty, especially when many sequences in the test set are affiliated to taxa unknown in training.
  4. By enabling taxonomic predictions for DNA barcodes to identify unseen branches, we believe BayesANT will be of broad utility as a tool for DNA metabarcoding within bioinformatics pipelines.

1 INTRODUCTION

DNA barcoding refers to the practice of identifying the taxonomic affiliation of unknown specimens through short fragments of their DNA molecular sequence called barcoding genes (Hebert et al., 2003). Typically, this assessment is performed by comparing the DNA obtained from the high-throughput sequencing of a bulk sample to libraries of genes whose Linnean taxonomy is well-established. Examples of these collections are numerous, with the Barcode of Life project (BOLD; Sarkar & Trizna, 2011) and GenBank (Benson et al., 2012) being particularly notable cases. For the identification to be reliable, reference DNA sequences should be characterized by limited intra-species and high inter-species gene variation and should be sufficiently simple to align and compare (Hebert et al., 2003). In the animal kingdom and insects especially, these characteristics have been found in a region of approximately 650-base-pairs near the 5th end of the mitochondrial cytochrome c oxidase sub-unit I, or COI, gene (Janzen et al., 2005). This region has become routinely used in animal species identification. In particular, libraries in BOLD are formed by clustering similar COI sequences under a common Barcode Index Number, or BIN, which identifies a given species (Ratnasingham & Hebert, 2013).

The impact of DNA barcoding in biodiversity assessment has been dramatic. It took more than 200 years to describe approximately 1 million species of insects through morphological inspection, whereas nearly 400,000 BINs have been categorized within just 10 years (Wilson et al., 2017). DNA barcoding offers a way to categorize large quantities of specimens collected by modern automatic sampling methods. For example, flying insects are routinely captured with Malaise traps (Malaise, 1937), which collect the sampled insects together in a preservative within a storage cylinder. While this method often causes deterioration of the captured animals, making them morphologically unrecognizable, the biological material can be processed relatively cheaply (Shokralla et al., 2014) through a practice called DNA metabarcoding (Yu et al., 2012), which groups similar sequences detected in the samples into operational taxonomic units, or OTUs. These provide initial hypothesized species labels for the animals in the sample, and assessing their taxonomic placement is the final key stage of a bioinformatics pipeline.

Despite the advantages described above, taxonomic assessment of OTUs presents its own challenges, especially at lower level ranks. While it is relatively easy to accurately place a DNA sequence to a phylum, a class, or an order (Yu et al., 2012), the information obtainable via high-throughput methods is limited by the short length of the sequences extracted. This makes the identification at the family, at the genus, and at the species level subject to higher uncertainty (Pentinsaari et al., 2020). Moreover, DNA metabarcoding may be prone to sequencing and clustering errors. Consequently, it can either split biologic material from the same species into two different clusters or merge different species into a single cluster (Somervuo et al., 2017). Finally, reference sequence libraries can be subject to mislabelling errors (Somervuo et al., 2016) and can be incomplete (Virgilio et al., 2012; Weigand et al., 2019; Wilkinson et al., 2017). This leads to the necessity of developing classification methods that provide a reliable characterization of uncertainty when taxonomically annotating the collected OTUs, accounting for the potential lack of information and therefore barcode novelty in the library. Ultimately, such methodologies allow one to quantify and track the biodiversity of a given sampling region only if the classification probabilities are reliable and OTUs are obtained consistently across time and space.

Much software for taxonomic recognition has been developed, relying on different prediction methods. One approach labels a query DNA with the taxon of the reference sequence having the highest similarity (Huson et al., 2007; Nguyen et al., 2014). This requires applying local or global alignment procedures to the sequences in the library, such as the BLAST—Basic Local Alignment Search Tool—similarity score (Altschul et al., 1990). When alignment is undesirable due to computational costs, fast algorithms that exploit a κ-mer representation of the sequences can be adopted. Widely used examples are the naïve Bayes RDP classifier (Wang et al., 2007) and its non-Bayesian heuristic alternatives (e.g. SINTAX; Edgar, 2013). More recent methods use modern machine learning and deep learning techniques including tree-based classification algorithms (IDTAXA; Murali et al., 2018) and convolutional neural networks (Vu et al., 2020).

While these approaches can provide good classification results when the training data are sufficiently informative of the biodiversity of the environmental sample (Bazinet & Cummings, 2012), they can lead to unreliable matches when the reference sequence set is incomplete, as is often case (Murali et al., 2018; Wilkinson et al., 2017). Thus, algorithms must coherently account for potential taxonomic novelty when doing classification (Somervuo et al., 2017). Specifically, sequences are regarded as “new” if their true taxonomic annotation is unobserved in the training library. This does not necessarily imply that the specimen from which DNA has been sequenced identifies a taxon new to science. Instead, novelty may be driven by a lack of reference sequencing data for known taxa, limited training libraries, low quality and gaps in barcodes, and sequencing errors in queries. All these factors can potentially lead to false positives, labelling a sequence as “new” when it is not, or false negatives, predicting a known taxon when a new one should be identified. This common issue has been addressed in the literature in various ways (Bokulich et al., 2018; Edgar, 2013; Lan et al., 2012; Lanzén et al., 2012). The widely adopted solution is to select a confidence probability cutoff and regard the classification as unreliable if the predicted taxon has a probability below that threshold (Wang et al., 2007). For example, the default RDP classifier does not report the predicted genus of a query if the most likely genus has a prediction probability lower than 0.8. This cutoff depends on the specific algorithm and often requires appropriate tuning (Lan et al., 2012). Moreover, confidence thresholds might be species-dependent due to differences in genetic variability between and within taxa. A second possibility is to explicitly allow the algorithm to signal if the queries are likely from previously unobserved taxa, as is done by PROTAX—PRObabilistic TAXonomic placement (Somervuo et al., 2016). PROTAX classifies DNA sequences by training a multinomial regression model on a sub-sample of the reference library reflecting prior knowledge of the existing taxonomy. The algorithm can lead to over- or under-detection of new taxa at any rank if the training dataset is not representative. With this approach, novel nodes in the taxonomic tree are explicitly treated as separate classes to be modelled, and they are assigned a prediction probability when classifying queries.

In this paper, we follow the latter approach and develop an off-the-shelf Bayesian nonparametric model for DNA barcode data that explicitly accounts for novelty by modelling the potential undetected nodes at every unlabelled taxonomic level. As our application primarily focuses on insects, we name our method BayesANT, short for BAYESiAn Nonparametric Taxonomic classifier. BayesANT is a supervised prediction algorithm that is trained on a set of sequences whose taxonomic affiliation is known and later annotates unlabelled DNA barcoding sequences in a probabilistic manner. In particular, it computes taxon-assignment probabilities at all unlabelled ranks by combining a prior distribution for the taxonomic tree with a kernel-based approach to modelling the distribution of the nucleotide sequences conditioned on their full taxonomic affiliation. Taxon novelty is incorporated through a Pitman–Yor process prior (Pitman & Yor, 1997), which is a species sampling model urn scheme (Blackwell & MacQueen, 1973; Pitman, 1996) that automatically specifies probabilities for the appearance of undiscovered species (Favaro et al., 2009; Lijoi et al., 2007) in a coherent way. For aligned sequences, we use a Dirichlet-multinomial product kernel over nucleotides, while, for unaligned sequences, we use a multinomial kernel over κ-mer counts. The resulting model facilitates fast computation of a probabilistic classifier, which provides careful uncertainty assessments in taxonomic annotations. Unlike the other methods described above, our method avoids using an arbitrary threshold to annotate a sequence as being from a clade unobserved in training. In particular, taxonomic novelty in BayesANT can be aided through the choice of the Pitman–Yor prior hyperparameters, which can be either fixed ex-ante based on prior knowledge or estimated from the data. We test BayesANT on a library of arthropod DNA sequences collected in Finland (Roslin et al., 2022).

2 MATERIALS AND METHODS

BayesANT evaluates the probabilities that a given DNA query sequence belongs to each of the nodes of the observed taxonomy, allowing for unobserved nodes in the taxonomic tree to be discovered. These probabilities are derived via Bayes rule, while taxonomic novelty arises through Pitman–Yor process priors. Let V i = V i , 1 V i , L $$ {\mathbf{V}}_i=\left({V}_{i,1},\dots, {V}_{i,L}\right) $$ be the taxonomic labels of the i th $$ {i}^{\mathrm{th}} $$ sequence in a library of L $$ L$$ ranks, and X i $$ {\mathbf{X}}_i $$ the associated nucleotide sequence from any barcoding gene, such as COI for insects or ITS2 for fungi. We indicate taxonomic library of n $$ n$$ sequences as D n = V i X i i = 1 n $$ {\mathcal{D}}_n={\left({\mathbf{V}}_i,{\mathbf{X}}_i\right)}_{i=1}^n $$ . See Sections 3 and 5 for more details on how the data are structured. The goal of BayesANT is to predict V n + 1 $$ {\mathbf{V}}_{n+1} $$ , the labels for n + 1 th $$ {\left(n+1\right)}^{\mathrm{th}} $$ sequence, treating the DNA X n + 1 $$ {\mathbf{X}}_{n+1} $$ as covariate. We perform this by paralleling the construction behind naïve Bayes classifiers and linear discriminant analysis: the probability that the n + 1 th $$ {\left(n+1\right)}^{\mathrm{th}} $$ query belongs to the taxonomic branch v = v 1 v L $$ \mathbf{v}=\left({v}_1,\dots, {v}_L\right) $$ conditioned on library D n $$ {\mathcal{D}}_n $$ and sequence X n + 1 $$ {\mathbf{X}}_{n+1} $$ is
p V n + 1 = v X n + 1 D n p V n + 1 = v V n × p X n + 1 V n + 1 = v D n , $$ p\left({\mathbf{V}}_{n+1}=\mathbf{v}|{\mathbf{X}}_{n+1},{\mathcal{D}}_n\right)\propto p\left({\mathbf{V}}_{n+1}=\mathbf{v}|{\mathbf{V}}^{(n)}\right)\times p\left({\mathbf{X}}_{n+1}|{\mathbf{V}}_{n+1}=\mathbf{v},{\mathcal{D}}_n\right), $$ (1)
where V n = V i i = 1 n $$ {\mathbf{V}}^{(n)}={\left({\mathbf{V}}_i\right)}_{i=1}^n $$ are the observed taxonomic labels, p V n + 1 = v V n $$ p\left({\mathbf{V}}_{n+1}=\mathbf{v}\mid {\mathbf{V}}^{(n)}\right) $$ is the prior probability of branch v and p X n + 1 V n + 1 = v D n $$ p\left({\mathbf{X}}_{n+1}\mid {\mathbf{V}}_{n+1}=\mathbf{v},{\mathcal{D}}_n\right) $$ is the distribution of the DNA sequence conditioned on v $$ \mathbf{v}$$ being its assigned branch. Refer to the Supporting Information for a step-by-step derivation of Equation (1). In what follows, we carefully specify how each component is determined.

2.1 Preliminaries: The Pitman–Yor process

The Pitman–Yor (Pitman & Yor, 1997) is a sequential process for label assignment whose allocation probabilities depend on a strength parameter α $$ \alpha $$ , on a discount parameter σ $$ \sigma $$ , and on the size of the clusters previously detected. The allocation rule works as follows. Suppose that V 1 , , V n $$ {V}_{1},\dots ,{V}_{n}$$ are the taxon assignments for the DNA sequences in our library of barcodes at a given rank (such as phylum or class). Specifically, these sequences identify K n =   k   $$ {K}_{n}= k $$ distinct taxa, named V 1 * , , V k * $$ {V}_1^{\ast },\dots, {V}_k^{\ast } $$ , with frequencies n 1 , , n k $$ {n}_{1},\dots ,{n}_{k}$$ and j = 1 k n j = n $$ {\sum}_{j=1}^k{n}_j=n $$ . Then, the probability that the n + 1 th $$ {\left(n+1\right)}^{\mathrm{th}} $$ sequence belongs to the j th $$ {j}^{\mathbf{\text{th}}}$$ of the known taxa is
p V n + 1 = V j * V 1 V n = n j σ α + n , $$ p\left({V}_{n+1}={V}_j^{\ast}\mid {V}_1,\dots, {V}_n\right)=\frac{n_j-\sigma }{\alpha +n}, $$ (2)
for j   =   1 , , k $$ j = 1,\dots ,k$$ , while the probability of observing a new taxon is
p V n + 1 = "new" V 1 , , V n = α + σk α + n , (3)
where α > σ $$ \alpha >-\sigma $$ and σ [ 0 , 1 ) $$ \sigma \in [0,1)$$ . Figure 1 sketches the mechanism when n   =   19 $$ n = 19$$ sequences and k   =   4 $$ k = 4$$ different groups are observed. High values of α $$ \alpha $$ or values of σ $$ \sigma $$ close to 1 lead to a high probability of discovering a new taxon. The probability that a sequence is assigned to taxon label V j * $$ {V}_j^{\ast } $$ increases with its abundance n j $$ {n}_{j}$$ . This process allows barcodes to be clustered together a priori through being assigned to the same existing or newly detected taxa. Both parameters can be easily estimated from the data via empirical Bayes if taxonomic frequencies n 1 , , n k $$ {n}_{1},\dots ,{n}_{k}$$ are observed. Refer to the Supporting Information for details, and to Favaro et al. (2009) and De Blasi et al. (2015) for a general overview.
Details are in the caption following the image
Example of a Pitman–Yor process with n = 19, α = 1, σ = 0.25 and Kn = 4. Taxon names are reported on top of the circles, and frequencies of appearance are written on the right to the blue DNA sequences, respectively. Fractions in black denote the taxon probabilities for the orange DNA sequence. For example, the probability of observing the butterfly-shaped taxon V 1 * $$ {V}_1^{\ast } $$ is (n1 − σ)∕(α + n) = (10–0.25)∕(19 + 1) = 39∕80. The probability for the unknown question mark taxon is (α + σk)∕(α + n) = (1 + 4 × 0.25)∕(19 + 1) = 1∕10.

2.2 Notation and taxonomic structure

A taxonomic library can be represented as a tree with branches of length L ≥ 2, where DNA sequences are uniquely associated with one leaf. We denote such a library as D n = V i X i i = 1 n $$ {\mathcal{D}}_n={\left({\mathbf{V}}_i,{\mathbf{X}}_i\right)}_{i=1}^n $$ , where n is the number of sequences, V i = V i , = 1 L indicates the taxonomic labels of the i th $$ {i}^{\mathrm{th}} $$ sequence and X i $$ {\mathbf{X}}_i $$ is a representation of the associated DNA. For example, the library we use in our application is fully annotated up to rank L = 7, where L represents the species level. Figure 2 displays an example of a taxonomic tree where sequences are classified into order, family and genus. Blue circles indicate nodes associated with at least one DNA sequence, while undiscovered branches are coloured in grey. The labels at a given level , namely V1,, …, Vn,, take values in the space V of distinct taxa. Given their discrete nature, multiple Vi, can be associated with the same taxon. These realizations, which we denote as V 1 , * , , V k , * $$ {V}_{1,\mathrm{\ell}}^{\ast },\dots, {V}_{k_{\mathrm{\ell}},\mathrm{\ell}}^{\ast } $$ , are the nodes in our hierarchical taxonomy, with k being their total observed number at level . For example, the 28 sequences in Figure 2 identify two taxa at the order level: one that has a butterfly-type morphological trait, V 1 , 1 * $$ {V}_{1,1}^{\ast } $$ , and one with a bee-type trait, V 2 , 1 * $$ {V}_{2,1}^{\ast } $$ . Thus, k1 = 2. The beetle-shaped insect node instead represents a potential order unobserved in the library.

Details are in the caption following the image
Example of a three-level taxonomic library under our model. On the bottom-left corner of every node, we report the number of DNA sequences linked to it. The total sample size of this example is n = 28. Circles in blue indicate nodes linked to leaves with observed DNA sequences, while grey circles show all the possible missing or undiscovered branches, labelled with a question mark on the top-right corner. Variation in insect colour along each branch and across branches indicate DNA and morphological similarities and differences, respectively.

Due to the tree structure of the taxonomy, each generic node V at level has a unique parent at level  − 1, denoted as pa v $$ \mathrm{pa}\left({v}_{\mathrm{\ell}}\right) $$ . In Figure 2, for instance, pa V 1 , 2 * = V 1 , 1 * $$ \mathrm{pa}\left({V}_{1,2}^{\ast}\right)={V}_{1,1}^{\ast } $$ and pa V 1 , 3 * = pa V 2 , 3 * = V 1 , 2 * $$ \mathrm{pa}\left({V}_{1,3}^{\ast}\right)=\mathrm{pa}\left({V}_{2,3}^{\ast}\right)={V}_{1,2}^{\ast } $$ . For coherence, assume that the tree is rooted, namely pa v 1 = v 0 $$ \mathrm{pa}\left({v}_1\right)={v}_0 $$ for any v 1 V 1 $$ {v}_1\in {\mathcal{V}}_1 $$ . Each node in the tree is linked to multiple taxa at lower ranks. Let ρn(v) be the set of observed nodes v+1 for which pa v + 1 = v $$ \mathrm{pa}\left({v}_{\mathrm{\ell}+1}\right)={v}_{\mathrm{\ell}} $$ when n sequences are observed, Kn(v) = |ρn(v)| be its cardinality and Nn(v) be the number of DNA sequences belonging to v. In Figure 2, ρ n V 1 , 2 * = V 1 , 3 * V 2 , 3 * $$ {\rho}_n\left({V}_{1,2}^{\ast}\right)=\left\{{V}_{1,3}^{\ast },{V}_{2,3}^{\ast}\right\} $$ and K n V 1 , 2 * = 2 $$ {K}_n\left({V}_{1,2}^{\ast}\right)=2 $$ , while ρ n v 0 = V 1 , 1 * V 2 , 1 * $$ {\rho}_n\left({v}_0\right)=\left\{{V}_{1,1}^{\ast },{V}_{2,1}^{\ast}\right\} $$ and Kn(v0) = 2 for the order level. Finally, the size of a node in our representation is determined as a sum of the number of sequences associated with all leaves connected to it. For example, N n V 1 , 2 * = 8 $$ {N}_n\left({V}_{1,2}^{\ast}\right)=8 $$ , and N n V 1 , 1 * = 12 $$ {N}_n\left({V}_{1,1}^{\ast}\right)=12 $$ . The quantities pa $$ \mathrm{pa}\left(\cdot \right) $$ , ρn(·), Kn(·) and Nn(·) are the key ingredients upon which we build our taxonomic prior in Equation (1).

2.3 Taxonomic prior

The first step in our analysis consists of specifying a flexible prior for the frequencies of occurrence of different types of organisms at each taxonomic rank , including organisms of “new” types. In particular, we incorporate the Pitman–Yor process allocation probabilities in Equations (2) and (3) into the tree structure. Let α $$ {\alpha }_{\ell }$$ and σ denote the allocation parameters for level , with α > −σ and σ ∈ [0,1). Write V , n = V i , i = 1 n $$ {\mathbf{V}}_{\cdot, \mathrm{\ell}}^{(n)}={\left({V}_{i,\mathrm{\ell}}\right)}_{i=1}^n $$ as the sequence of taxonomic labels observed at level . Then, the taxon of sequence n + 1 at level , conditioned on it being allocated to node v−1 at level  − 1, has probabilities
p V n + 1 , = V j , * V n + 1 , 1 = v 1 , V , n = N n V j , * σ α + N n v 1 , $$ p\left({V}_{n+1,\mathrm{\ell}}={V}_{j,\mathrm{\ell}}^{\ast}\mid {V}_{n+1,\mathrm{\ell}-1}={v}_{\mathrm{\ell}-1},{\mathbf{V}}_{\cdot, \mathrm{\ell}}^{(n)}\right)=\frac{N_n\left({V}_{j,\mathrm{\ell}}^{\ast}\right)-{\sigma}_{\mathrm{\ell}}}{\alpha_{\mathrm{\ell}}+{N}_n\left({v}_{\mathrm{\ell}-1}\right)}, $$ (4)
if the node V j , * $$ {V}_{j,\mathrm{\ell}}^{\ast } $$ is such that pa V j , * = v 1 $$ \mathrm{pa}\left({V}_{j,\mathrm{\ell}}^{\ast}\right)={v}_{\mathrm{\ell}-1} $$ , and
p V n + 1 , = " new " V n + 1 , 1 = v 1 , V , n = α + σ K n v 1 α + N n v 1 , (5)
if the node is new and originates from v−1. The structure of Equations (4) and (5) is the same as the one in Equations (2) and (3), with the only difference being that nodes at are generated from their parent-specific process. The level-specific parameters α and σ are important in allowing diversity to vary with taxonomic rank. Similarly to the one-level case discussed in Section 2, these parameters will be estimated based on the data. See the Supporting Information.

2.4 DNA sequence likelihood

The second step to build the predictive rule in Equation (1) is to specify a distribution for the DNA sequences. We do this by adopting a kernel-based approach that flexibly accommodates different DNA representations.

As depicted in Figure 2, a query sequence X i $$ {\mathbf{X}}_i $$ is uniquely associated with one leaf of the taxonomic tree. Recalling that v = v 1 v L $$ \mathbf{v}=\left({v}_1,\dots, {v}_L\right) $$ denotes a taxonomic branch whose leaf is v L V L $$ {v}_L\in {\mathcal{V}}_L $$ , we let
X i V i = v θ v L ~ ind K X i θ v L , $$ \left({\mathbf{X}}_i\mid {\mathbf{V}}_i=\mathbf{v},{\boldsymbol{\theta}}_{v_L}\right)\overset{\mathrm{ind}}{\sim}\mathcal{K}\left({\mathbf{X}}_i;{\boldsymbol{\theta}}_{v_L}\right), $$ (6)
for every sequence i = 1, …, n, where K X i θ $$ \mathcal{K}\left({\mathbf{X}}_i;\boldsymbol{\theta} \right) $$ is a kernel depending on parameters θ $$ \boldsymbol{\theta} $$ representing the likelihood of sequence data X i $$ {\mathbf{X}}_i $$ , and θ v L $$ {\boldsymbol{\theta}}_{v_L} $$ is a collection of leaf-specific parameters. Implicitly, we assume that all DNA sequences associated with leaf vL are independent and identically distributed as K θ v L $$ \mathcal{K}\left(\cdot; {\boldsymbol{\theta}}_{v_L}\right) $$ . Table 1 provides three examples of multinomial-type kernels when sequences are aligned and when they are not. Here, alignment implies that all the sequences are pre-processed to have the same length p so that the nucleotides at each position s = 1 , , p $$ s=1,\dots ,p$$ are meaningfully comparable. Then, X i , s $$ {X}_{i,s}$$ is the nucleotide in the s th $$ {s}^{\mathrm{th}}$$ position of the i th $$ {i}^{\mathrm{th}} $$ query sequence, and θ v L , s , g $$ {\theta }_{{v}_{L,s,g}}$$ is the probability that nucleotide g N 1 = A , C , G , T $$ g\in {\mathcal{N}}_1=\left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\} $$ is seen at s $$ s$$ for taxon v L $$ {v}_{L}$$ . Assuming independence across locations s $$ s$$ as a simplifying assumption to improve computational efficiency in constructing a probabilistic classifier, the resulting kernel is a product of multinomials with location-specific parameters.
TABLE 1. Examples of multinomial kernels for the DNA sequences. The column sequences specifies whether the sequences in the library are aligned or not. Column kernel type is the type of kernel chosen to model the DNA. Columns likelihood and prior for θ v L $$ {\boldsymbol{\theta}}_{v_L} $$ are the likelihood ad the prior in each model, with dir indicating the probability density function of the Dirichlet distribution. N κ $$ {\mathcal{N}}_{\kappa } $$ is the set of all κ-mers on which the sequence is decomposed. In the aligned case, this is a set of monomers N 1 = A , C , G , T $$ {\mathcal{N}}_1=\left\{\mathrm{A},\mathrm{C},\mathrm{G},\mathrm{T}\right\} $$ . The quantity 1 X i , s = g $$ \mathbf{1}\left\{{X}_{i,s}=g\right\} $$ is an indicator equal to one if Xi,s = g and zero otherwise
sequences kernel type likelihood prior for θ v L
Not aligned κ-mers g N κ θ v L , g n i , g $$ {\prod}_{g\in {\mathcal{N}}_{\kappa }}{\theta}_{v_L,g}^{n_{i,g}} $$ dir ξ v L $$ \mathrm{dir}\left({\boldsymbol{\xi}}_{v_L}\right) $$
Aligned Product s = 1 p g N 1 θ v L , s , g 1 X i , s = g $$ {\prod}_{s=1}^p\kern0em {\prod}_{g\in {\mathcal{N}}_1}{\theta}_{v_L,s,g}^{\mathbf{1}\left\{{X}_{i,s}=g\right\}} $$ s dir ξ v L , s $$ {\prod}_s\kern0em \mathrm{dir}\left({\boldsymbol{\xi}}_{v_L,s}\right) $$
Aligned κ-Product s = 1 p g N κ θ v L , s , g 1 X i , s = g $$ {\prod}_{s=1}^p\kern0em {\prod}_{g\in {\mathcal{N}}_{\kappa }}{\theta}_{v_L,s,g}^{\mathbf{1}\left\{{X}_{i,s}=g\right\}} $$ s dir ξ v L , s $$ {\prod}_s\kern0em \mathrm{dir}\left({\boldsymbol{\xi}}_{v_L,s}\right) $$

When sequences are not aligned, each has its own length pi. A viable option is to use a κ-mer decomposition. This amounts to counting the number of times all possible 4κ substrings of length κ appear within the sequence. We denote as N κ $$ {\mathcal{N}}_{\kappa } $$ the set of all κ-mers of length κ. For instance, 3-mers live in N 3 = AAA ACG AGT $$ {\mathcal{N}}_3=\left\{\mathrm{AAA},\mathrm{ACG},\mathrm{AGT}\dots \right\} $$ , with a total of 43 = 64 substrings. In Table 1, n i , g = s = 1 t i 1 X i , s = g $$ {n}_{i,g}={\sum}_{s=1}^{t_i}\mathbf{1}\left\{{X}_{i,s}=g\right\} $$ denotes the number of times a κ-mer g N κ $$ g\in {\mathcal{N}}_{\kappa } $$ appears in the i th $$ {i}^{\mathrm{th}} $$ sequence, with ti = pi − κ + 1 being the total number of κ-mers observed when the length is pi. We model these counts as the output of a multinomial distribution, where θvL,g is the probability of κ-mer g at taxon vL. The κ-mer length parameter κ is chosen a priori as a modelling choice and usually requires adequate tuning. Finally, if sequences are aligned, it is also possible to combine the two kernels by considering a κ-mer/location-specific multinomial distribution. For example, choosing a 2-product kernel for a sequence AATGTA means that the realizations of the multinomial are AA in the first location, AT in the second, TG in the third and so on. This approach allows to better capture site dependencies, but bears heavy computational costs for values of κ greater than 2.

The choice of kernel depends on the application and the data. For example, insect DNA sequences can be easily aligned via hidden Markov models (Eddy, 1995), while fungal sequences often come without alignment due to their higher intrinsic variability. Irrespective of the structure of the data, our proposed multinomial kernels have the advantage of simplicity in computation, with the posterior distribution for θ v L $$ {\boldsymbol{\theta}}_{v_L} $$ obtained in analytic form by adopting conjugate Dirichlet priors as in Table 1. Computational efficiency is a critical issue both in training and in classifying very large numbers of sequences, making it intractable to consider elaborate likelihoods derived from realistic generative models of nucleotide sequences.

2.5 Prediction rule

The prior on the tree and the DNA sequence likelihood defined so far allow us to predict the set of labels V n + 1 $$ {\mathbf{V}}_{n+1} $$ for the query sequence X n + 1 $$ {\mathbf{X}}_{n+1} $$ . BayesANT does this in bottom-up and top-down steps. In the bottom-up step, we use Equations (4), (5) and (6) to determine the posterior probability that X n + 1 $$ {\mathbf{X}}_{n+1} $$ belongs to any leaf in the tree. These include both the observed and the new taxa at the lowest level, as illustrated in Figure 2.1 Then, probabilities of higher nodes are computed aggregating upward. In the top-down step, instead, BayesANT predicts a branch by iteratively choosing the child node with the highest probability at each level, starting from the root.

Let π n + 1 v = p V n + 1 = v V n $$ {\pi}_{n+1}\left(\mathbf{v}\right)=p\left({\mathbf{V}}_{n+1}=\mathbf{v}\mid {\mathbf{V}}^{(n)}\right) $$ be the prior probability for branch v after having observed all labels V n $$ {\mathbf{V}}^{(n)} $$ . By the chain rule, this is equal to the product of the prior conditional probabilities in Equations (4) and (5) of all nodes in the branch, which is
π n + 1 v = p V n + 1 , 1 = v 1 V , 1 n = 2 L p V n + 1 , = v V n + 1 , 1 = v 1 , V , n . $$ {\pi}_{n+1}\left(\mathbf{v}\right)=p\left({V}_{n+1,1}={v}_1\mid {\mathbf{V}}_{\cdot, 1}^{(n)}\right)\prod \limits_{\mathrm{\ell}=2}^Lp\left({V}_{n+1,\mathrm{\ell}}={v}_{\mathrm{\ell}}\mid {V}_{n+1,\mathrm{\ell}-1}={v}_{\mathrm{\ell}-1},{\mathbf{V}}_{\cdot, \mathrm{\ell}}^{(n)}\right). $$ (7)
Equation (7) corresponds to the prior taxon probability in Equation (1). Notice that if v = " new " $$ {v}_{\mathrm{\ell}}{=}^{"}{\mathrm{new}}^{"} $$ at some , the conditional probabilities at lower nodes are equal to 1. But then, the probability that V n + 1 $$ {\mathbf{V}}_{n+1} $$ is associated to branch v conditioned on the DNA sequence X n + 1 $$ {\mathbf{X}}_{n+1} $$ and D n $$ {\mathcal{D}}_n $$ , namely Equation (1), is
p n + 1 v = p V n + 1 = v X n + 1 D n π n + 1 v K X n + 1 θ v L p θ v L D n d θ v L . $$ {p}_{n+1}\left(\mathbf{v}\right)=p\left({\mathbf{V}}_{n+1}=\mathbf{v}\mid {\mathbf{X}}_{n+1},{\mathcal{D}}_n\right)\propto {\pi}_{n+1}\left(\mathbf{v}\right)\int \mathcal{K}\left({\mathbf{X}}_{n+1};{\boldsymbol{\theta}}_{v_L}\right)p\left({\boldsymbol{\theta}}_{v_L}\mid {\mathcal{D}}_n\right)\mathrm{d}{\boldsymbol{\theta}}_{v_L}. $$ (8)
The integral in Equation (8) is the posterior predictive distribution of DNA sequence X n + 1 $$ {\mathbf{X}}_{n+1} $$ with respect to the posterior of θ v L $$ {\boldsymbol{\theta}}_{v_L} $$ . When v = " new " $$ {v}_{\mathrm{\ell}}{=}^{"}{\mathrm{new}}^{"} $$ , this posterior is equal to the prior, i.e. p θ v L D n = p θ v L $$ p\left({\boldsymbol{\theta}}_{v_L}\mid {\mathcal{D}}_n\right)=p\left({\boldsymbol{\theta}}_{v_L}\right) $$ , as no sequence for vL is observed. The convenient property of the models in Table 1 is that both the prior and the posterior predictive distribution have simple and easy-to-compute analytic forms. Once Equation (8) has been evaluated for all leaves, the probabilities of higher nodes in the taxonomy can be easily derived via upward aggregation. Then, we predict the taxa by starting from the root of the tree and recursively selecting the child node with the highest probability. Specifically, the predicted sequence of taxa v * = v * = 1 L $$ {\mathbf{v}}^{\ast }={\left({v}_{\mathrm{\ell}}^{\ast}\right)}_{\mathrm{\ell}=1}^L $$ for the DNA sequence at n + 1 satisfies
v * = arg max v ρ n v 1 * v : v L n v p n + 1 v , $$ {v}_{\mathrm{\ell}}^{\ast }=\arg {\max}_{v_{\mathrm{\ell}}\in {\rho}_n\left({v}_{\mathrm{\ell}-1}^{\ast}\right)}\sum \limits_{\mathbf{v}:\kern0.5em {v}_L\in {\mathrm{\mathcal{L}}}_n\left({v}_{\mathrm{\ell}}\right)}{p}_{n+1}\left(\mathbf{v}\right), $$ (9)
where v : v L n v $$ \left\{\mathbf{v}:\kern0.5em {v}_L\in {\mathrm{\mathcal{L}}}_n\left({v}_{\mathrm{\ell}}\right)\right\} $$ is the set of all branches v = v 1 v L $$ \mathbf{v}=\left({v}_1,\dots, {v}_L\right) $$ whose leaves vL are linked to node v in a library of n DNA sequences.

2.6 Hyperparameter tuning

The hyperparameters ξ v L $$ {\boldsymbol{\xi}}_{v_L} $$ of the multinomial kernel play a fundamental role in novel species recognition. As detailed above, when v L = " new " $$ {v}_L{=}^{"}{\mathrm{new}}^{"} $$ , then Equation (8) is a prior predictive probability, since no sequence is observed for vL and thus p θ v L D n = p θ v L $$ p\left({\boldsymbol{\theta}}_{v_L}\mid {\mathcal{D}}_n\right)=p\left({\boldsymbol{\theta}}_{v_L}\right) $$ . In such cases, prior hyperparameters should contain information regarding the taxonomic branch and level where novelty appears. Uniform priors may be unreasonably vague, leading to underestimation of the prior predictive probability of novel taxa relative to the true proportion. Thus, we tune each ξ v L $$ {\boldsymbol{\xi}}_{v_L} $$ as follows. Consider a taxon vL−1 at level L − 1. If vL−1 is not “new”, the hyperparameters ξ v L $$ {\boldsymbol{\xi}}_{v_L} $$ of all the leaves v L n v L 1 $$ {v}_L\in {\mathrm{\mathcal{L}}}_n\left({v}_{L-1}\right) $$ linked to it - including the new one - are all equal, and they are obtained via method of the moments from the DNA sequences X i $$ {\mathbf{X}}_i $$ with Vi,L−1 = vL−1. If instead vL−1 is a “new” node and the last not novel node in its branch is v at level  ≤ L − 1, the method of the moments is applied on the set of sequences X i $$ {\mathbf{X}}_i $$ such that Vi, = v. This ensures borrowing of information between the branches when the novelty appears at higher levels in the taxonomy. Moreover, this approach tailors the prior predictive distribution of a node to the intrinsic location-specific nucleotide variability of the sequences linked to it. Thus, novelty probability is high in a node if the query is coherent with the observed variability at that node but is not sufficiently similar to any of the training sequences linked to the children nodes in terms of the kernel. For mathematical details on the method of the moments applied to the multinomial kernels of Table 1, see the Supporting Information.

2.7 Calibration of prediction probabilities

Misspecification of a Bayesian model, due to inaccuracies in the prior and/or likelihood function, may lead to predictive probabilities that are not sufficiently well calibrated to accurately capture predictive uncertainties (Grünwald & van Ommen, 2017; Miller & Dunson, 2019). Given the complexity of the true data-generating likelihood underlying DNA barcoding data, and the necessity of using a simple likelihood for computational tractability, some degree of misspecification is inevitable. We apply a simple re-calibration approach to adjust the predictive probabilities used in Equation (9) for misspecification.

In particular, we post-process the prediction probabilities in Equation (8) by exponentiating them by a coefficient ρ ∈ (0,1] and later renormalizing. Then, the new probabilities for the n + 1 th $$ {\left(n+1\right)}^{\mathrm{th}} $$ sequence are
p ˜ n + 1 v = p n + 1 v ρ v p n + 1 v ρ , $$ {\tilde{p}}_{n+1}\left(\mathbf{v}\right)=\frac{p_{n+1}{\left(\mathbf{v}\right)}^{\rho }}{\sum_{\mathbf{v}\prime }{p}_{n+1}{\left({\mathbf{v}}^{\prime}\right)}^{\rho }}, $$
and can be used in place of p n + 1 v $$ {p}_{n+1}\left(\mathbf{v}\right) $$ in Equation (9). Such a strategy does not alter the ranking of the original probabilities since the transformation is monotonic. Moreover, if p n + 1 v = 1 $$ {p}_{n+1}\left(\mathbf{v}\right)=1 $$ , then also p ˜ n + 1 v = 1 $$ {\tilde{p}}_{n+1}\left(\mathbf{v}\right)=1 $$ . This implies that we do not substantially alter the prediction whenever the BayesANT is certain about a taxon. Choices for ρ can be adopted via cross-validation on a hold-out subset of the training library following strategies such as the ones described in (Guo et al., 2017). Specifically, prediction probabilities are calibrated if the average probability for the predicted nodes is equal to the classification accuracy (Somervuo et al., 2016). For example, if 90% of the sequences are correctly classified, ideally the average classification probability is approximately 0.9. An average value of 0.5 and 0.99, instead, means that the algorithm is too conservative when right and too confident when wrong, respectively. In the application discussed in this paper, we select ρ = 0.1 and ρ = 0.06 depending on the testing scenario.

3 RESULTS

3.1 The FinBOL library

The Finland Barcode of Life initiative2 (FinBOL) is a DNA barcoding library that contains reference sequences with highly reliable taxonomic annotations for the arthropod species of Finland. The data have been constructed placing substantial effort on barcode quality thanks to the collective effort of about 150 taxonomists. Biologic material was collected from previously identified specimens conserved in museums or private collections, and later processed via PCR sequencing. For a thorough description of how the library was assembled and later tested, refer to Roslin et al. (2022).

The version of the data we consider contains a total of 34,624 DNA sequences annotated across seven taxonomic levels, namely class, order, family, subfamily, tribe, genus and species. Reference annotations are based on the national checklist of Finnish species (FinBIF, 2020) with the inclusion of dummy taxa whenever subfamily and tribe were missing. The library has been globally aligned via Hidden Markov Models using the HMMER software (Eddy, 1995). As a result, each sequence has a length of 658 base pairs, consisting of nucleotides “A”, “C”, “G” and “T” and alignment gaps “-”. Other infrequent special characters are ignored and treated as missing values for simplicity. Taxonomic labels in the data comprise 3 classes: Arachnida, Insecta and Malacostraca, appearing 1842 and 32,781 and 1 times, respectively. The sequences are further divided into 21 orders, 476 families, 896 subfamilies, 1355 tribes, 3855 genera and 10,985 species, 3025 of which have a single reference sequence associated with them.

Figure 3 depicts the pairwise raw DNA similarities, calculated as the fraction of locations with identical nucleotides, between 3000 sequences randomly sampled without replacement from the library. Each row/column represents the DNA similarity between one sequence and all the other sampled ones, with darker tones indicating higher similarities. Sequences are sorted alphabetically first by order and then by family to ensure cluster separation. In particular, boxes in dark blue along the main diagonal highlight the cross similarities within the orders, while boxes in light blue refer to the families. On the left side of the Figure we report the name and the sizes of the 5 most frequent orders, namely Araneae, Diptera, Coleoptera, Hymenotptera and Lepidoptera. In an ideal setting, the within-taxon similarities along the main diagonal should be higher than the cross-taxa ones. However, this is only true for Lepidoptera and for the two largest families—Ichneumonidae and Tenthredinidae—in Hymenoptera. Indeed, Diptera and Coleoptera are virtually indistinguishable, as they show a similar within- and between-order similarity. Moreover, these two taxa show a high cross-similarity with Lepidoptera, as indicated by the off-diagonal orange rectangles. Overall, the average DNA similarity in the library is around 0.81, with a standard deviation of 0.04, indicating that the sequences are highly homogeneous.

Details are in the caption following the image
Pairwise DNA similarities between 3000 randomly sampled sequences from the FinBOL library. The blue and light blue boxes along the main diagonal identify the orders and the families, respectively. Numbers on the left side represent the frequencies for the five largest orders in the data. Darker tones of red indicate higher similarity.

3.1.1 Testing scenarios

We aim to evaluate the performance of the predictive taxa classification probabilities produced by BayesANT. These probabilities should reflect whether the true taxa of a test sequence are observed in training or not. In the first case, the ideal output assigns a high or close-to-one probability to the true branch at every level, and a near-zero one to all the other branches in the tree. In the second case, instead, if the true affiliation of a sequence is observed for levels 1, …, and unobserved for levels  + 1, …, L, we would like BayesANT to output a high probability for the true nodes up to and the highest conditional probability to the “new” clade at level  + 1. To test our algorithm, we train the classifiers on a random subset of 80% of the FinBOL data and predict the taxonomic affiliation for the remaining 20% of the sequences. By construction, this procedure makes some taxa present in the training set only, others in both the training and the test, and some solely in the test set. We refer to this last category as to the “new”, the “novel” or the “unobserved” taxa, treating the three terms as interchangeable synonyms.

We consider two testing scenarios summarized in panels (a) and (b) in Figure 4. In the first, each sequence in the library has equal probability of being allocated to the test set. This makes the taxonomic composition of the training and test set similar. As a result, only a relatively small fraction of the taxa will be unobserved in training, as is evident from both plots at the top of panels (a) and (b). In the second scenario, we create the test set by stratified sampling: for each test observation, we first sample the family, and then draw one sequence within that family. This assigns each family an equal probability of being selected, irrespective of its frequency of appearance in the data. Such a procedure yields a different composition between training and test, resulting in many more test taxa unobserved in training. In total, the number of barcodes whose true branch has at least one node unobserved in training is 884 in scenario 1 and 2672 in scenario 2, while the total number of query test sequences is 6924 in each case. Furthermore, the proportion of test DNA sequences associated with the most frequent orders differs from their training counterpart. For example, 30% of the sequences in the training library in scenario 2 are Lepidoptera and only 2.5% pertain to Hemiptera; in the test set, however, these fractions become 20% and 5%, respectively, with a much larger proportion of unknowns than is scenario 1. See the bottom of Figure 4, panel (b).

Details are in the caption following the image
Panel (a): Taxonomic composition of the training and the test libraries in the two splitting scenarios. Panel (b): Proportion of DNA barcoding sequences pertaining to the larger orders in the data in both scenarios. The fractions highlighted in dark blue refer to the barcodes which truly belong to the mentioned order, but whose true species is unobserved in training. The total number of sequences in each scenario is 27,699 in the training library and 6925 in the test.

3.2 Test results

BayesANT computes the probability of every node in the taxonomic tree, including potential novel ones, for every test DNA sequence. The predicted annotation is the taxonomic branch with the highest probability at every rank. These probabilities express the uncertainty of the classification, and need to be well calibrated to be reliable: for instance, if 90% of the sequences are correctly classified, then the average probability with which they are classified should be around 0.9. Ideally, we would like to limit cases in which the algorithm is too confident when wrong and too conservative when right; see the Materials and Methods section. Moreover, evaluating the performance of BayesANT requires a clear definition of correctness of the classification under novel taxa. Suppose the true annotation of a test sequence shows a taxon that is unobserved in training. In that case, the prediction outcome may be the correct novel taxonomic leaf, or a new taxon but in an incorrect branch, or a taxon observed in training. We consider the classification correct in the first case and wrong otherwise. For example, if the true annotation of the test sequence is.

  • Insecta ‐> Diptera‐> Tephritidae ‐> Trypetinae ‐> Trypetini ‐> Acidia ‐> Acidia cognata

but Acidia is a genus not observed in the training set, then the correct classification up to the species rank is.

  • Insecta ‐> Diptera ‐> Tephritidae ‐> Trypetinae ‐> Trypetini ‐> New Genus in Trypetini ‐> New Species in New Genus in Trypetini

since the novelty produces a new genus and automatically a new species linked to it. As Acidia is not observed, necessarily also the species Acidia cognata is unseen and the classification at the species level is correct only if BayesANT recognizes the novel genus. An outcome such as.

  • Insecta ‐> Diptera ‐> Tephritidae ‐> Trypetinae ‐> Trypetini ‐> Trypeta ‐> New Species in Trypeta

is wrong but recognizes a novel leaf, while

  • Insecta ‐> Diptera ‐> Tephritidae ‐> Trypetinae ‐> Trypetini ‐> Trypeta ‐> Trypeta zoe

is wrong since it predicts an observed species. When computing accuracy, the first example is correct at the genus and species level, while the other two are not. Unlike other approaches (e.g. see Edgar, 2018), this over-penalizes the cases when the algorithm fails at predicting the correct novel clade.

Figure 5 displays the prediction probabilities of BayesANT in both FinBOL scenarios by plotting the relationship between the % cumulative probability and the % cumulative accuracy at the species level. As the library is globally aligned, we adopt a simple product-multinomial kernel in which the probabilities of nucleotides “A”, “C”, “G” and “T” vary by loci and species. We treat the alignment gap “-” as a missing value and ignore the likelihood contribution of the locations where it appears. For an assessment of how these missing values affect the classification, see the Supporting Information. The rank-specific parameters α and σ are estimated from the data and we report their values in Table 2. Operations were performed on an AMD Ryzen 3900-based dedicated server with 128GB of memory on Ubuntu 20.04, R version 4.1.1 linked to Intel MKL 2019.5–075. Training the algorithm on 27,699 sequences took 1.7 minutes in scenario 1 (10,422 species) and 1.4 minutes in scenario 2 (9490 species), while predicting the remaining 6924 test queries took 10.2 minutes on a single thread and 1.4 minutes on 24 separate threads in each scenario. See the Supporting Information for additional details on computational time. In Figure 5, the dashed diagonal indicates a perfectly calibrated output, while trajectories below and above it imply over- and under-confidence, respectively. The dark blue lines show that BayesANT produces well-calibrated predictive probabilities on the test data, with a prediction accuracy equal to 85.2% and 70.6% from the test data and an average prediction probability of 0.82 and 0.70 in Scenarios 1 and 2, respectively. Results in scenarios 1 and 2 below are based on adjusting initial probabilities with a temperature parameter ρ = 0.1 and ρ = 0.06, respectively. Both values were chosen via standard cross-validation methods as follows. For a given training-test split, we first randomly assign 20% of the training sequences to a hold-out validation set. Then, we train the model on the remaining 80% of the training library and evaluate the prediction probabilities for the validation sequences against a set of pre-determined values for ρ. As a final step, we re-train the model on the full training set and predict the test sequences using the value of ρ that yielded the best calibration in the hold-out.

Details are in the caption following the image
Calibration plot for the prediction of BayesANT at the species level under both scenarios. The dashed diagonal line indicates perfect calibration, while the percentages next to the points are the species accuracies in the test sets. Notice that “all data” includes all the 6925 query sequences in the test set, “new” refers to those whose true taxon is not observed in training at some rank (884 in scenario 1, 2642 in scenario 2), while “observed” restricts to the cases where the true taxonomy is fully observed.
TABLE 2. Estimated Pitman-Yor parameters for each level in the taxonomy
Scenario Param. Class Order Family Subfam. Tribe Genus Species
1 α 0.19 1.17 4.16 1.04 1.16 1.86 7.11
σ 0.00 0.01 0.12 0.00 0.00 0.05 0.00
2 α 0.19 0.76 2.66 1.08 1.12 1.85 6.74
σ 0.00 0.03 0.13 0.00 0.00 0.07 0.00

For the novel cases, the number of sequences predicted to belong to a “new” leaf in Scenario 1 is 958, while their true number is 884. Of these 884 queries, 77.9% are correctly recognized as novel, and 31.1% are effectively correct up to the species level included, with average probability equal to 0.44, as depicted by the orange line in the left panel. This implies that, while the exact “new” leaf in the taxonomy is generally challenging to retrieve due to insufficient signal in the dataset, BayesANT recognizes fairly well the potential novelty of the taxon of a sequence. Similar results are obtained in Scenario 2. While accuracy is lower due to a higher number of sequences with unobserved taxa, the predicted novel leaves are 2736 against 2672 truly “new”. Here, 93.8% are recognized novel, and 33.7% are placed in the correct novel clade in the taxonomic tree. Verifying the effective novelty of the predicted “new” branches requires carefulness and further investigation—for example, by morphological assessment and more comprehensive reference barcode sequencing of new samples collected at the same geographic location. For instance, training BayesANT on a library of insects collected in Finland and using it to predict queries collected from South Africa might lead to an overwhelming number of barcodes labelled as “new” simply due to structural differences between the data, even if the latter has a well-established taxonomy.

In considering these taxonomic classification results, it is important to keep in mind the limited information provided by the available nucleotide sequencing in the COI gene. This information can be insufficient to assign certain query sequences to the correct taxon accurately. As we described in Figure 3, for example, orders Coleoptera and Diptera show a high cross-similarity. Indeed, these are orders who appear to be harder to classify: in Scenario 1, 36.2% of the incorrectly classified sequences at the species level are Diptera, followed by Hymenoptera (23.5%), Coleoptera (16.7%) and Lepidoptera (12.2%). This is even more evident in Scenario 2, with 29.9% of wrong prediction for Diptera and 19.6% for Coleoptera. Notoriously, these are the most prone to barcoding mislabelling (Meier et al., 2006). For a full breakdown of the accuracies across orders, refer to the Supporting Information.

To investigate whether this lack of information is a primary cause when BayesANT produces incorrect classifications, we measured the average similarity between the query test sequence and the sequences in the training, which are annotated with the predicted taxa when the classification is wrong. Figure 6 shows the distribution of these average similarities. Indeed, these are generally high, with an average of 0.983 under both Scenarios. This suggests that misclassification tends to be due to insufficient information to distinguish between the true taxon and an incorrect taxon that is extremely close in the COI region to the query sequence, which sometimes even leads to small discrepancies between barcode similarities and true taxonomic affiliation. An example can be seen in Figure 7, which reports the pairwise DNA similarity between the query test sequence FISYO1282-18 and the training barcodes whose species are labelled as Allantus calceatus and Allantus basalis. In FinBOL, the true species of the query sequence is Al. basalis, while BayesANT wrongly suggests that the most likely species is Al. calceatus with a prediction probability equal to 0.942. The resemblance between the orange picture and those referring to FISYO2086-18 and FISYO270-18 should be evident. However, the DNA barcodes suggest the opposite: the average similarity between the query and Al. calceatus is higher than the one with Al. basalis, thus explaining the incorrect prediction. Indeed, such discrepancies have led to the introduction of the Barcode Index Number system (BIN) to cluster similar COI barcodes into OTUs. For example, all the sequences in Figure 7 fall into the same BIN called BOLD:ABZ8200. For an extensive discussion on the topic, refer to Ratnasingham and Hebert (2013); Phillips et al. (2019).

Details are in the caption following the image
Average DNA similarity between the test query sequences and the predicted taxa when BayesANT incorrectly predicts a taxon observed in training.
Details are in the caption following the image
Pairwise DNA similarity between the test sequence FISYO1282-18 and the training barcodes belonging to the species Allantus basalis (19 sequences) and Allantus calceatus (5 sequences) in scenario 1. Each dot represents a pairwise similarity, with stacked dots indicating equality. The pictures in the blue boxes depict the specimen from which the training barcodes associated with the blue points have been sequenced. The orange box is the specimen of the test query, which is annotated as Allantus basalis in FinBOL. The bottom-right corner reports the predicted species probabilities returned by BayesANT. All pictures are publicly available at https://www.boldsystems.org/ and licenced under CC BY-NC 3.0. Licence holder: Marko Mutanen, University of Oulu.

3.3 Benchmarking

As the last step in our analysis, we benchmark the performance of BayesANT on the FinBOL library against several alternatives in terms of accuracy. Table 3 reports the results under both Scenarios. m-1 refers to the single location multinomial kernel we adopted in our analysis above. While this is our method of reference due to its simplicity and flexibility, it treats loci as independent. Dependence can be introduced by adopting a 2-mer location kernel, m − 2, where the support of the multinomial is in AA AC AT TT $$ \left\{\mathrm{AA},\mathrm{AC},\mathrm{AT},\dots, \mathrm{TT}\right\} $$ and 2-mers are overlapping. To assess the advantage of adopting a Pitman–Yor prior over the taxonomic tree, we also compare with an analysis that lets α = σ = 0 at every level . This does not allow new species, and the prior is the proportion with which each taxon appears in the library at every rank. These methods are labelled as m-1, no new and m-2, no new in Table 3. Although sequences are aligned, we also test the performance of BayesANT under the multinomial kernel over the κ-mer decomposition. In particular, k-5 and k-6 report accuracies and average prediction probabilities when fixing κ = 5 and κ = 6, respectively. Finally, we benchmark all these alternatives against the popular RDP classifier (Wang et al., 2007, version 2.13, 2020). While the number of taxonomic classifiers is rather vast, we focus on RDP as it is a longstanding method that, similarly to ours, relies on Naïve Bayes classification strategies and provides a minimum standard for accuracy. In particular, we do not set a confidence cutoff for RDP, but we consider its full classifications up to the species rank. This allows to benchmark its calibration under wrong predictions, which necessarily happen when the sequences are novel.

TABLE 3. Overall predictive performances of DNA barcoding algorithms on the FinBOL library under the two testing scenarios. Values report the % of DNA sequences correctly labelled, while values in parenthesis denote the average prediction probabilities in the whole test set. Underlined values indicate the best performances
Model Scenario 1—pure random split Scenario 2—stratified random split
Class Order Family Subfam. Tribe Genus Species Class Order Family Subfam. Tribe Genus Species
m-1 100.0 99.9 98.6 97.5 96.0 92.1 85.2 99.8 97.0 82.6 80.8 79.7 75.3 70.6
(1) (1) (0.98) (0.96) (0.94) (0.91) (0.82) (0.99) (0.97) (0.87) (0.83) (0.8) (0.77) (0.7)
m-2 100.0 99.9 98.4 97.2 95.8 92.4 85.4 99.8 97.1 82.1 80.3 79.1 75.7 69.8
(1) (1) (0.98) (0.97) (0.95) (0.93) (0.86) (0.98) (0.96) (0.88) (0.84) (0.81) (0.8) (0.74)
m-1, no new 100.0 99.5 98.0 97.2 96.7 94.3 83.3 98.7 91.8 75.8 75.3 74.6 72.1 59.4
(1) (1) (0.99) (0.98) (0.98) (0.98) (0.92) (1) (0.98) (0.91) (0.89) (0.89) (0.88) (0.78)
m-2, no new 100.0 99.5 97.5 96.7 96.2 93.8 83.2 96.8 89.3 73.8 73.3 72.8 70.8 59.1
(1) (1) (0.99) (0.98) (0.98) (0.98) (0.91) (1) (0.98) (0.91) (0.89) (0.89) (0.88) (0.74)
k-5 99.5 96.4 92.8 91.6 91.1 89.4 79.8 96.1 80.6 66.3 65.9 65.7 65.0 57.3
(1) (0.95) (0.92) (0.91) (0.91) (0.9) (0.87) (0.99) (0.8) (0.70) (0.69) (0.68) (0.67) (0.64)
k-6 99.4 94.9 92.0 91.0 90.7 89.6 80.3 95.9 77.2 66.8 66.4 66.2 65.6 57.5
(1) (0.96) (0.94) (0.94) (0.94) (0.93) (0.91) (0.98) (0.80) (0.73) (0.72) (0.71) (0.71) (0.68)
rdp 100.0 99.6 97.9 97.1 96.7 94.2 83.1 99.6 95.1 77.8 76.9 76.1 72.9 58.9
(1) (0.99) (0.97) (0.96) (0.95) (0.94) (0.92) (0.99) (0.92) (0.79) (0.78) (0.77) (0.75) (0.73)

We first notice that no method is uniformly better or worse than the others, except for the κ-mer kernels. This is likely because the library has been reliably aligned. Aside from k-5 and k-6, performances in Scenario 1 are approximately similar both in terms of prediction probabilities and accuracy. Minor differences are found at the species level, where the inclusion of novel taxa leads to higher accuracy for both m-1 and m-2. When new taxa are in the data, all methods other than BayesANT have a lower percentage of correctly identified sequences in both scenarios. Moreover, the algorithms show a similar behaviour in Scenario 2, which features a much higher proportion of unobserved taxa in training, except the species level. Here, BayesANT shows its advantage, as it attains a prediction that is 10% more accurate than the RDP classifier. When we restrict to the species observed in training, however, model m-1 shows an accuracy of 93.1% in Scenario 1 and 93.7% in Scenario 2, while RDP shows 95.3% and 95.9%, respectively. The better performance of RDP over BayesANT under observed species can be explained by the latter having to account for taxonomic novelty as well, which translates into evaluating probabilities for a larger taxonomic tree. As such, BayesANT pays a price in terms of accuracy under observed taxa in favour of a much higher gain overall. Indeed, if we neglect novelty in BayesANT by fixing α and σ to 0, the accuracies of m-1, no new on the observed species are 95.5% and 96.7%. See the Supporting Information for computational times and additional results on prediction accuracies, including a further benchmarking of the algorithms when the size of the training library is progressively lower.

4 DISCUSSION

This article has proposed a new probabilistic taxonomic classifier for DNA barcoding sequences, BayesANT, which has the key property of allowing one to build on an existing taxonomic library probabilistically. This is motivated by the fact that existing arthropod libraries are incomplete, containing reference DNA sequences for a subset of the nodes of the taxonomic tree. The potential reasons for this lack of reference barcodes are: insufficient sequencing, mislabelling and, in some extreme cases, novelty for science itself. For example, it is estimated that approximately 1.5 million, 5.5 million and 7 million species of beetles, insects and terrestrial arthropods, respectively, are either awaiting a proper description, do not have a reference sequence yet or are simply undiscovered (Stork, 2018), with estimates varying every year. BayesANT uses species sampling priors (Pitman, 1996) to allow for the discovery of previously unobserved branches of the tree. As such, it avoids arbitrary thresholds for novelty of other algorithms, thus characterizing uncertainty in all aspects of taxonomic classification. Probabilistic forecasts providing accurate characterizations of predictive uncertainty are said to be well calibrated (Somervuo et al., 2016). BayesANT guarantees well-calibrated predictions through a cross-validation approach.

Our method builds on a popular species sampling prior known as the Pitman-Yor process (Pitman & Yor, 1997). In its standard formulation, this prior does not take into account the taxonomic tree structure and instead treats all species as exchangeable. However, by specifying a Pitman-Yor at each level of the tree, with different parameters for each taxonomic rank, we obtain a highly flexible generative probabilistic process that can predict the probability of a query sequence to belong to different and potentially novel taxa at each level of the tree. By estimating the Pitman-Yor parameters based on the training data, we allow the process to adapt to existing knowledge about the level of diversity at each taxonomic rank.

Since taxonomic classification in ecology studies typically relies on sequencing of a relatively short region of the genome, there is necessarily substantial uncertainty in classification (Pentinsaari et al., 2020). For instance, different species often have indistinguishable nucleotide sequences in the region being sequenced, making it impossible to reliably distinguish sequences from such species relying on DNA metabarcoding alone without supplemental morphological data. This can be seen in Figure 7, which shows an example of both morphological and genetic variability of OTU-based clusters. In this respect, the recent development of Amplicon Sequence Variants (ASVs; Callahan et al., 2017; Bokulich et al., 2018) appears to be a promising direction to resolve these issues. ASV methods avoid the clustering step typical of OTUs, and provide better characterization of the biological variations in a dataset. In turn, this leads to an increased number of unique sequences, which BayesANT can easily handle through adequate tuning of the parameters α and σ in the Pitman–Yor prior. For instance, a large number of singletons sequences at the lowest level would lead to large values for σL and αL, thus favouring the prediction of novel clades. Exploring the performance of our method on ASV datasets is an interesting potential future direction. Another possibility would be to explore ways to incorporate priors derived from phylogenetic analysis into the proposed structure. This could better resolve the ambiguities in the data and add a further clustering layer to the method.

The modelling choices made in building BayesANT reflect a balance between flexibility and pragmatism in developing an efficient off-the-shelf algorithm that can easily handle the classification of a large number of sequences. This is needed in our motivating applications to biodiversity monitoring studies that routinely collect and metabarcode samples from many different sites and multiple time points for each site. In future research, it may be helpful to consider other modelling choices which modify the Pitman-Yor structure and/or choices of kernels considered here. For example, instead of the simple multinomial kernels, it may be useful to explore pairwise similarity and latent variable-based likelihoods, for example, using the projected κ-mer decomposition of a sequence into a lower dimensional feature space. Another alternative is to specify multinomial kernels that better account for nucleotide dependencies along the sequences without excessively burdening time and memory requirements. These include, for example, mixture models as in Dunson and Xing (2009).

Taxonomic novelty due to missing branches in the reference libraries is discussed in the literature (Edgar, 2013; Lan et al., 2012; Somervuo et al., 2017). Interpretation of the detected “new”, however, is fairly delicate and context-dependent, and it requires further analyses on the sequenced DNA, such as the investigation on potential sequencing errors. Moreover, novelty is inherently related to the tree structure of the annotations in the library, which sometimes does not reflect the genetic distances between the barcodes in the nodes at a rank. The within- and the cross-taxa similarities of Diptera, Lepidoptera and Coleoptera depicted in Figure 3 are an example. In BayesANT, these distances are indirectly taken into account by the choice of kernel, which, under sufficient flexibility, can correctly discriminate between taxa. However, the creation of new clades is still biased toward the nodes that show a higher within-genetic variability (e.g. Diptera) than those that are more similar (like Lepidoptera). This is an issue shared by all taxonomic classifiers due to the current taxonomic system, and adjusting for this bias would require additional information e.g. from morphology. One potential solution in BayesANT is to specify node-specific Pitman–Yor prior parameters to counter the low/high generic variability with higher/lower prior probabilities for novel clades.

AUTHOR CONTRIBUTIONS

Alessandro Zito, Tommaso Rigon and David Dunson designed the method and wrote the paper, Alessandro Zito implemented the algorithm on the FinBOL data and developed the software.

ACKNOWLEDGEMENTS

This project has received funding from the European Research Council under the European Union's Horizon 2020 research and innovation programme (grant agreement No 856506). The authors express their gratitude to Otso Ovaskainen, Panu Somervuo, Jesse Harrison, Markus Koskela, Tomas Roslin, Bianca Dumitrascu and Jennifer Kampe for their precious suggestions, Elena Domenichini for graphical advice and Carolyn Quarterman for the support on the writing.

    CONFLICT OF INTEREST

    The authors declare no conflict of interest.

    Endnotes

  1. 1 Under the assumption that a new node at level automatically creates a new node at all levels  + 1, …, L below, the total number of potentially unobserved leaves is equal to the number of nodes up to L − 1 plus 1.
  2. 2 https://en.finbol.org/
  3. PEER REVIEW

    The peer review history for this article is available at https://publons.com/publon/10.1111/2041-210X.14009.

    DATA AVAILABILITY STATEMENT

    BayesANT is available as citable open-source R package at Zito et al. (2022). For a tutorial on how to use the software, refer to https://alessandrozito.github.io/BayesANT/vignette.html. Data and code to replicate the tables and figures are made available in Zito (2022).