Using taxonomic consistency with semi‐automated data pre‐processing for high quality DNA barcodes

In recent years, large‐scale DNA barcoding campaigns have generated an enormous amount of COI barcodes, which are usually stored in NCBI's GenBank and the official Barcode of Life database (BOLD). BOLD data are generally associated with more detailed and better curated meta‐data, because a great proportion is based on expert‐verified and vouchered material, accessible in public collections. In the course of the initiative German Barcode of Life data were generated for the reference library of 2,846 species of Coleoptera from 13,516 individuals. Confronted with the high effort associated with the identification, verification and data validation, a bioinformatic pipeline, “TaxCI” was developed that (1) identifies taxonomic inconsistencies in a given tree topology (optionally including a reference dataset), (2) discriminates between different cases of incongruence in order to identify contamination or misidentified specimens, (3) graphically marks those cases in the tree, which finally can be checked again and, if needed, corrected or removed from the dataset. For this, “TaxCI” may use DNA‐based species delimitations from other approaches (e.g. mPTP) or may perform implemented threshold‐based clustering. The data‐processing pipeline was tested on a newly generated set of barcodes, using the available BOLD records as a reference. A data revision based on the first run of the TaxCI tool resulted in the second TaxCI analysis in a taxonomic match ratio very similar to the one recorded from the reference set (92% vs. 94%). The revised dataset improved by nearly 20% through this procedure compared to the original, uncorrected one. Overall, the new processing pipeline for DNA barcode data allows for the rapid and easy identification of inconsistencies in large datasets, which can be dealt with before submitting them to public data repositories like BOLD or GenBank. Ultimately, this will increase the quality of submitted data and the speed of data submission, while primarily avoiding the deterioration of the accuracy of the data repositories due to ambiguously identified or contaminated specimens.


| INTRODUCTION
DNA barcoding can provide an efficient tool for rapid biodiversity assessments, because it meets needs for rapid and reproducible specimen identification in the era of massive habitat destruction, biodiversity loss, and climate change (Hebert & Gregory, 2005;Valentini, Pompanon, & Taberlet, 2009). Despite some well-known shortcomings (Funk & Omland, 2003;Ballard & Whitlock, 2004;Hebert & Gregory, 2005;Dowton, Meiklejohn, Cameron, & Wallman, 2014;Ross, 2014), barcoding is currently possibly the fastest and most straightforward way to assess the vast diversity of organisms such as invertebrates, which play a crucial role in ecosystem function, but are often poorly known taxonomically (Hebert & Gregory, 2005). Using standardized genetic markers in DNA barcoding allows connecting the identities of different life stages such as eggs, larvae or adults-often a major difficulty in morphology-based taxonomy (e.g. Ahrens, Monaghan, & Vogler, 2007;Etzler, Wanner, Morales-Rodriguez, & Ivie, 2014;Freitag, 2013;García-Robledo, Kuprewicz, Staines, Kress, & Erwin, 2013;Šipek & Ahrens, 2011). Barcoding has been successfully applied to a vast number of taxa in many different geographic regions (www.boldsystems.org). It has become obvious that validated, comprehensive species libraries are the most fundamental basis for optimal barcode-based taxon identification (Kvist, 2013). Since the early days of DNA barcoding, barcodes have also been used for direct estimation of species boundaries (Carstens, Pelletier, Reid, & Satler, 2013;Meier, Shiyang, Vaidya, & Ng, 2006;Pons et al., 2006;Puillandre, Lambert, Brouillet, & Achaz, 2012;Ratnasingham & Hebert, 2013;Templeton, 2001;Zhang, Kapli, Pavlidis, & Stamatakis, 2013). Most of these methods attempt to infer species boundaries from the discontinuum between intraspecific and interspecific sequence variation, either visible as a "barcode gap," or as a shift in branching rates (GMYC) or number of substitutions per branch (PTP). Therefore, barcode reference libraries should cover the genetic variation within species as comprehensive as possible. Bergsten et al. (2012) revealed increasing difficulties in specimen determination using DNA barcodes when scaling up the geographic scope from local or regional to continental focus. They found that in diving beetles (Dytiscidae) a minimum of 70 specimens needs to be analysed per species to sample 95% of its intraspecific variation.
Consequently, building a proper barcode reference library that sufficiently reflects intraspecific variation and is thus able to correctly detect species boundaries is expected to be an elaborate procedure.
Completing existing barcode libraries and uploading larger datasets to the Barcode of Life Database (BOLD) can be challenging, particularly when a vast amount of barcode data already exists in public libraries. It is important to make sure that the newly submitted data match the existing data in order to avoid taxonomic confusion. Among different sources for inconsistencies, most likely are specimen misidentifications, altering nomenclature or misspellings, contaminations (with admixed alien DNA), sample confusion, or low-quality sequences. Especially in very diverse groups, checking big data (several thousand barcodes) requires major efforts.
Recently, the Barcode Index Number (BIN) System, a DNA-based registry for all animal species (Ratnasingham & Hebert, 2013), has been implemented within BOLD which helps to detect taxonomic inconsistencies. However, once the data are imported into BOLD, the effort to revise and correct them online is higher than having done this beforehand and on user's personal system. Depositing only high-quality sequences correctly annotated with correct species names in BOLD would be the "golden standard" and is crucial for keeping the global barcode library functional and reliable (Collins & Cruickshank, 2013).
A variety of tools for quality assessment of specimen's taxonomy already exist. Some approaches do their taxonomy assessment based on threshold clustering (Kõljalg et al., 2005(Kõljalg et al., , 2013Meier et al., 2006;Ratnasingham & Hebert, 2007. Others detect inconsistencies (of node or terminal labels) based on tree topology alone (Kozlov, Zhang, Yilmaz, Glöckner, & Stamatakis, 2016;McDonald et al., 2012). In both latter approaches the species boundaries are not inferred from the sequencing data, but are defined by the user in form of terminal node annotations, and used as a topological constraint. However, generally a consistent and straightforward automated flagging and filtering for problematic or inconsistent cases, which is useful for fast processing large amount of data, is missing yet.
Therefore, and based on the previously successful use of taxonomic consistency in different phylogenetic studies to uncover conflicts between classification and phylogenetic tree topology (Ahrens & Vogler, 2008;Hunt & Vogler, 2008;Quicke et al., 2012), an r-package (TaxCI; R Core Team, 2014) was developed that helps to check data for inconsistencies before submission to BOLD.
Optionally, the data check can be done with reference data from BOLD, in order to increase data quality and to detect problems with before submitting them to public data repositories like BOLD or GenBank.
Ultimately, this will increase the quality of submitted data and the speed of data submission, while primarily avoiding the deterioration of the accuracy of the data repositories due to ambiguously identified or contaminated specimens.

K E Y W O R D S
Coleoptera, data quality, DNA barcoding, German Barcode of Life, Germany, reference library, species identification Funding information Bundesministerium für Bildung und Forschung, Grant/Award Number: GBOL1: BMBF #01LI1101A and GBOL1: BMBF #01LI1501A Handling editor: Douglas Yu the new barcodes in a semi-automatic manner. This analysis pipeline was used to assist the publication of a second (see Hendrich et al., 2015) very large set of barcodes of beetles (Coleoptera) in Germany and central Europe that was created during the "German Barcode of Life (GBOL)" project, comprising nearly 2,900 species and more than 13,000 specimens ( Figure 1).

| Sampling, sequencing and deposition
Sampling organized with the help of the GBOL web portal (https://www.bolgermany.de; Geiger et al., 2016), voucher treatment, and the taxonomy used (see Data S5) were similar to Hendrich et al. (2015). DNA extraction, polymerase chain reaction and sequencing followed standard protocols  and are described together with the newly developed specific primers in detail in Data S5. All data are deposited in BOLD (https://doi.org/10.5883/gcolzfmk) and GenBank (accession numbers KU906118-KU919633) respectively.

| Data analysis
Our newly generated barcode data were counterchecked for taxonomic consistency against a reference dataset. As a reference a first larger dataset of German Coleoptera barcodes (3,514 species, 15,948 individuals) was used (Hendrich et al., 2015). Since the reference contained many short sequences (<500 bp), tree searches were performed using five different datasets (Table 1) with fasttree 2 (Price, Dehal, & Arkin, 2010): (1) newly generated data only, (2) all reference data, (3) reference data including only sequences longer than 500 bp, (4) combined new data and reference data including all sequences, and (5) combined new data and reference data including only sequences longer than 500 bp. Analyses were conducted on each full set of data (sets 1 to 5) and on separate families except for the monotypic Myxophagan family Sphaeriusidae (species poor families were combined with others with regard to their systematic position; Table S2). Tree searches used the GTR+CAT model. Midpoint rooting was applied before further processing using the r-packages phangorn (Schliep, 2011) and ape (Paradis, Claude, & Strimmer, 2004).
In order to pinpoint misidentifications, contaminations, mislabelling, but also potential cases of introgression, incomplete lineage sorting and instances of polyphyly, the phylogenetic trees obtained were used as phylogenetic backbone to investigate the consistency of the barcoded species, i.e. the monophyly of a priori assigned morphospecies. In a second step, a distance-based clustering was used as implemented in the r-package spider (v.1.3.0; Brown et al., 2012) to gain further evidence for conspecificity or even stronger genetic differentiation (i.e. cryptic diversity).
These procedures were implemented in a new r-package "taxCi" which is open-source and freely available at https://github.com/eberlejonas/TaxCI. This newly developed analysis tool had as input (1) a rooted dendrogram (fasttree tree), (2) the underlying DNA sequence alignment (PHYLIP format), and (3) a data sheet linking the dendrogram's tip-labels to the a priori given taxonomic information (i.e. assignment of family, genus and species). Based on the input tree topology and five consecutive evaluation steps, a penalty score for all specimens is returned that marks potentially misidentified or contaminated samples ( Figure 2).
Step 1 The topology of the dendrogram is analysed with reference to monophyly of the a priori identified species. In analogy to the examination of homoplasy in morphological characters, the membership to a certain a priori identified species was coded as trait state "1" instead of trait state "0." The consistency index (Kluge & Farris, 1969) explores the degree of homoplasy of a trait (or here, of an a priori defined species), and is thus a good expression of "how polyphyletic" a species is in a given tree. All individuals of a given species not grouped as monophylum are marked and assigned with a tci penalty score. They also get highlighted during this step (Figures 3 and 4; 1st box), also those residing in a taxonomically and homogeneously composed clade. Members of the respective species within a heterogeneous cluster get an additional mark in the second step (2nd box). An approach similar to this step was used recently to detect non-monophyly in phylogenetic trees F I G U R E 1 Map of Germany with the sample locations for the reference data from Hendrich et al. (2015; red circles) and newly added barcode data (blue triangles) (this study) and to ascertain the incidence of species non-monophyly in COI barcode sequences (Mutanen et al., 2016). Although species monophyly is not a stringent criterion of species validity (Zhang, Zhang, Zhu, & Yang, 2011), it has been widely shown that most species are monophyletic at least for the barcode gene COI (e.g. Funk & Omland, 2003;Mutanen et al., 2016;Ross, 2014).
Step 2 The TaxCI package uses the function localMinima from the r package spider (Brown et al., 2012) and derives a distance threshold (choice of evolutionary model for distance calculation selectable; here used the default option: K80) from the aligned sequence data for clustering. The actual value corresponds to the first dip in the density of pairwise distances and indicates the transition between intra-and inter-specific distances following the general concept of the barcoding gap. The newly derived clusters are then analysed for their a priori defined taxonomic composition, and all individuals within a cluster of mixed species are highlighted (2 nd box). The function also detects species not conspicuous during the first step, e.g. if they form a monophylum in the backbone tree, which is part of a heterogeneous distance-based cluster (Figures 3 and 4). As an alternative to barcode clustering with spider, users can supply their own, externally obtained clusters (see package documentation for details). A function for directly parsing output files of the species delimitation program mPTP (v0.2.1, Kapli et al., 2016) is implemented in TaxCI.
Step 3 All individuals within a heterospecific cluster are tagged ifand only if-there are other individuals of this species in at least one additional cluster (3rd box). All individuals marked during step two get also marked in the 3rd box, if this condition is fulfilled.
Step 4 Calculates the relative abundance of a nominal a priori identified species in a heterogeneous cluster as proportion of its individuals relative to the total number of specimens in this cluster.
If a particular species has a lower relative abundance than any of the other species in the mixed cluster, all its individuals are tagged in the 4th box ( Figure 3). A penalty score of 1-RA (RA-relative abundance of the species) is assigned. This step might be helpful to infer which of the specimens might be contaminants or wrongly identified.
Step 5 All individuals of a species in a homogeneous cluster with members in at least one other homogeneous cluster are marked in the 5th box. This step might indicate wrong identifications but also cryptic diversity.
All results for the specimens that were found to be conspicuous in any of the five steps are written into a text file table along with a penalty score based on the above five steps. For each analysis, we estimated additionally a match ratio (Ahrens et al., 2016): MR = 2 × N match / (N cluster + N morph ), where N match is the number of species with exact matches, that is all specimens of one morphospecies (and only these) belong to one cluster entity, and N cluster and N morph are the number of clusters and morphospecies). The initially loaded dendrogram is plotted as text-searchable PDF file (without length limitations), now enriched with the details of the taxonomic congruence checks (Boxes 1-5). Thus, the modified tree file can easily be inspected visually on T A B L E 1 Number of morphospecies, genetic barcode clusters ("OTUs"), tci positive species, homogeneous and heterogeneous cluster, species in multiple homogeneous cluster and the match ratio between genetic cluster and morphospecies from the initial and second TaxCI run on the reference data (Ref data), the newly generated barcodes (new data), and both combined (Comb) any computer and decoupled from the analysis platform. This is an important aspect for involving external collaborators or multiple project partners during the validation of taxon identifications. The output of the taxCi analysis was used to re-inspect the affected specimens, to correct the species ID data, or to exclude specimens (temporally for further revision of ID, or definitively, in case of contaminations).
Finally, the taxCi analysis was run again with the cleaned dataset to assess the level of improvement in congruence.
F I G U R E 4 Example of tree output from the empirical barcode data from the first pre-processing analysis F I G U R E 3 (a-n) Simulated tree topologies with hypothetical nominal taxa (terminals A-E) used to explore the test cases for the taxCi-based data pre-processing

| RESULTS
A full-length DNA barcode (658 bp) was obtained from 99.4% of the 14,330 specimens from a priori identified species representing 102 of the 103 beetle families known from Germany. The remaining 0.6% of sequences are between 657 and 557 bp. In the initial phylogenetic analysis and subsequent taxCi run all 14,330 specimens were included representing 2,970 a priori defined species of 102 families.
Based on poor taxCi results we removed 120 species in 814 specimens from the BOLD submission pipeline, of which six specimens proved to be contaminations, three specimens had typing errors in their names, while 804 specimens were preliminary excluded to verify once again carefully the identification by a taxonomic specialist.
In 32 cases taxonomic inconsistencies were found due to changes in nomenclature since 2012 (in these cases only the metadata file for taxCi analysis was corrected, but not the terminal name in the tree; see Data S1 for reference of taxon nomenclature in 2012 vs. 2015).
In 67 cases a taxonomic inconsistency was detected in the reference data (i.e. Hendrich et al., 2015). Five hundred and seventy-nine specimens were kept in the new dataset despite problems with taxonomic consistency as data do not allow for rejection of the sample as in the above cases (potential cases of introgression, incomplete lineage sorting, cryptic diversity). These cases require further research, ideally based on an integrative taxonomy toolset, in some cases possibly even resulting in comprehensive taxonomic revisions.
In total, 13,516 specimens in 2,846 species were added to the BOLD library, which now includes >4,000 species of beetles occurring in Germany.
The taxCi run 1 detected, among the newly generated data of 2,978 morphospecies, more than 23% of the species (n = 703) that were affected by taxonomic inconsistencies (Table 1) The number of resulting distance-based clusters is not comparable between the full and single family analyses, as the derived threshold for the different families differs strongly (Tables 1 and S2). Therefore, different total numbers of clusters between both approaches were found (4,621 vs. 4,494 for full and family analysis (2nd run) respectively). Nevertheless, match ratios were improved for nearly all families in the 2nd run as well (Table S2).
According to the BOLD BIN report, our dataset contained 337 globally new BINs (with 716 specimens), while 2,758 BINs were already present in the BOLD system (comprising 12,796 specimens from our new dataset and a total of 53,874 specimens) (see Table S1).
A detailed overview of the resulting BINs, trees and taxonomic inconsistencies is given in Table S1 and Data S1 and S2.

| DISCUSSION
According to the widely accepted generalized lineage concept (GLC; de Queiroz, 1998Queiroz, , 2005Queiroz, , 2007, which is now commonly applied in DNA-based and integrative species delimitation (Camargo, Morando, Avila, & Sites, 2012;Carstens et al., 2013;Edwards & Knowles, 2014;Ence & Carstens, 2011;Jones, 2015;Jones, Aydin, & Oxelman, 2015;Knowles & Carstens, 2007;O'Meara, Ane, Sanderson, & Wainwright, 2006;Yang & Rannala, 2010, monophyly is not an essential prerequisite for a valid species. However, since a monophyletic species is with a certain probability also more likely a valid species (e.g. Kizirian & Donnelly, 2004), this criterion was used to explore our data for inconsistencies. Therefore, the topology of the obtained gene tree was analysed with reference to the degree of monophyly of the a priori identified species (step 1 of our pipeline). Species detected during this step are non-monophyletic. This can be caused by several natural phenomena (e.g. hybridization, incomplete lineage sorting) but can arise-as an artefact-also from human errors within the barcoding pipeline (e.g. contamination or misidentification of a sample). What is not detected in step 1 are (1) contaminated or misidentified singletons, (2) entire monophyla that appear as sister group of another clade within a cluster (Figure 3j, taxon "E"), or (3) specimens that mirror cryptic diversity (Figure 3m).
The use of a simple distance-based clustering in the steps following the monophyly assessment is because this pipeline was designed for extremely large datasets with thousands of specimens.
Methodologically more appropriate methods like statistical parsimony analysis (Templeton, 2001), Poisson tree process modelling (Zhang et al., 2013), and General Mixed Yule Coalescent modelling (Pons et al., 2006) could potentially replace the distance-based clustering as they might be more accurate and also more appropriate since they follow a defined species concept (i.e. phylogenetic species concept).
However, they are computationally much more demanding and thus time-consuming. As our approach is intended for data pre-processing, we opted for speed rather than accuracy, and implemented a distance clustering approach as default option. However, alternative species delimitations obtained from external programs can easily be implemented as they get more scalable like the recently developed mptp (Kapli et al., 2016).
The indication of the presence of additional specimens (steps 2 and 3) and their abundance (step 4) in another cluster may help to decide whether there is a case of contamination and which of the specimens are affected by contamination, mislabelling, or misidentification. This latter step is especially helpful to find misidentified singletons. The final analysis step (step 5) indicates those potential misidentifications that have gone unnoticed through steps 1-4 or that might be used as indicators for cryptic diversity or strong mtDNA differentiation among populations. Thus, the output from step 5 will be especially valuable for future studies to explore additional evidence on these observed patterns of differentiation.
Data quality, i.e. the match ratio of morphospecies vs. barcode clusters, greatly improved with the fast screening and the subsequent revision of the data and IDs from roughly 73% and 74% to 94 and 92% for our new data and all German beetle barcodes respectively. This is a significant improvement considering that the taxonomic consistency of the reference data was also 94%. Using multi-rate poison tree processing as external method with taxCi resulting in a lower match ratio of 88% and 87% accompanied by lower amount of homogeneous clusters, but with more heterogeneous clusters than in our standard approach (Table 1). Mainly well separated singletons were lumped by the mptp clustering method (Data S1 and S5).
The inclusion of the reference dataset helped to find more taxonomic problems, which would remain undetected with the new dataset alone. This is especially true for wrongly identified or contaminated Tools of quality assessment of specimen's taxonomy do already exist in UNITE (Kõljalg et al., 2005(Kõljalg et al., , 2013; https://unite.ut.ee/index.php), in SpeciesIdentifier (Meier et al., 2006), and in the BOLD systems (Ratnasingham & Hebert, 2007;http://www.boldsystems.org).
Taxonomy assessments in these are exclusively based on threshold clusters, while monophyly and taxonomic consistency beyond clusters is not considered in detail. A flagging of problematic specimens is performed only in BOLD systems. Furthermore, other tools with similar tasks so far used for micro-organisms exist (Kozlov et al., 2016;McDonald et al., 2012). The tax2tree approach (McDonald et al., 2012) refers to topological constraints and does not implement the inference of species boundaries (e.g. by threshold clustering) as it was designed to transfer an existing taxonomy to newly generated sequences rather than critically checking species assignments of different sources of data critically against each other. SATIVA (Kozlov et al., 2016) detects inconsistencies (of node or terminal labels) based on the tree topology (i.e. taxon monophyly) considering various classification levels (except species) based on arbitrary threshold clustering, as species level often lacking in metadata annotations of Cyanobacteria. In both approaches the species boundary is taken as topological constraint given of terminals in relation to tree topology, rather than an inference of such a limit from the sequence data, and thus lacking the possibility to infer "cryptic diversity." In all cases an automated straightforward high-throughput filtering for problematic or inconsistent cases, which is useful for large amount of data, is not yet performed.
BIN assignments from the BOLD system, similarly able to help with data quality assessment (in terms of correct species name assignments), generally have a turnaround time of several weeks.
Furthermore, data curation-once uploaded to BOLD-requires additional work from submitters and data curators.
Another advantage of our taxCi analysis pipeline lies in its capacity to produce tree files in a PDF format that is text-searchable and without page limitation, a feature that is lacking or only fragmentarily available in most above mentioned approaches. Even most current stand-alone tree editing programs do not allow this or have strong limitations for tree length. In the form of such portable files, readable on any desktop computer and also by in terms of molecular systematics untrained taxonomists or citizen scientists, they represent the most valuable linking element for using barcodes. This increases the acceptance and audience for these approaches and issues linked with the data (e.g. taxonomical problems, inconsistencies). Finally, taxCi offers a more complex output for different ways of data control (numerical, graphical) which allows an either automatized and computer-based but also manual/visual data inspection.

| CONCLUSIONS
The presented analysis pipeline for a semi-automatic data quality assessment and check, with subsequent revision and re-check, may provide a helpful tool when processing large to very large datasets that can be handled manually only with great difficulty and considerable time investment. While several so far existing tools or web portals offer partly similar functions, our pipeline combines in efficient way the presentation of data (tree output) with useful flagging of problematic samples, high data throughput, and the complex aspects of Molecular Operational Taxonomic Unit (MOTU) capture relying on tree based species delimitation as well as on fast clustering with distance-based algorithms. taxCi will prove to be a useful tool that helps to generate at forehand high quality barcode data that keep away high correction workload from large reference databases, like BOLD systems, and to bring more people towards an integrative taxonomy where barcode data play a crucial role in biodiversity research, but also, where traditional taxonomy is deeply involved in biodiversity exploration.
H.-P. Reike, E. Rößner, and A. Schäfer. Finally, we are thankful to M. Balke, L. Hendrich, and J. Morinière who inspired us for this paper.
Our colleague J. Decher kindly checked and improved the English in this paper. We thank the two anonymous referees for their helpful comments on the manuscript. The 1KITE project (1KITE consortium: www.1kite.org) kindly provided access to mitogenome data for hundreds of samples that greatly helped in designing the new degenerate primers LCO1490-JJ3 and HCO2198-JJ3.