On the universality of target‐enrichment baits for phylogenomic research

Capturing conserved genomic elements to shed light on deep evolutionary history is becoming the new gold standard for phylogenomic research. Ultraconserved elements are shared among distantly related organisms, allowing the capture of unpreceded amounts of genomic data of non‐model taxa. An underappreciated consequence of hybrid enrichment methods is the potential of introducing undetected DNA sequences from organisms outside the lineage of interest, facilitated through the high degree of conservation of the target regions. In this in silico study, we quantify ultraconserved loci using a data set of 400 published genomes. We utilized six newly designed UCE bait sets, tailored to various arthropod groups, and screened for shared conserved elements in all 242 currently published arthropod genomes. Additionally, we included a diverse set of other potential contaminating organisms, such as various species of fungi and bacteria. Our results show that specific UCE bait sets can capture genomic elements from vastly divergent lineages, including human DNA. Nonetheless, our in silico modeling demonstrates that sufficiently strict bioinformatic processing parameters effectively filter out unintentionally targeted DNA from taxa other than the focus group. Lastly, we characterize all the 100 most widely shared UCE loci as highly conserved exonic regions. We give practical recommendations to address contamination in data sets generated through targeted‐enrichment.

of genome-scale data for non-model organisms .
We have only just begun to understand the nature of ultraconserved elements. The majority of the initially discovered UCEs from vertebrates were characterized as transcriptionally inactive elements, with a smaller proportion of conserved exonic regions (Bejerano et al., 2004). Among others, they were linked with diverse regulatory functions in splicing, transcription, enhancement and development (Bejerano et al., 2004;Dickel et al., 2014;Reneker et al., 2012;Woolfe et al., 2004). However, functional aspects of UCEs are almost exclusively studied in vertebrate systems, and their significance in invertebrate genomes is largely unknown.
Nonetheless, the application of UCEs for phylogenetic research in invertebrate systems is highly promising. The advantages of analysing hundreds to thousands of informative loci outcompete traditional Sanger sequencing (Blaimer et al., 2015), and open exciting avenues to overcome the persistent challenge of deep coalescences McCormack, Harvey, et al., 2013;Meiklejohn, Faircloth, Glenn, Kimball, & Braun, 2016). A major advance is the capability to recover informative sequence data from degraded starting material. Recent studies showed successful UCE capture from century old, dried museum specimens (Blaimer, Lloyd, Guillory, & Brady, 2016;McCormack, Tsai, & Faircloth, 2016), and even formalin-fixed material (Ruane & Austin, 2017). This renders the vast historical archives of dried, pinned, or ethanol preserved material in natural history museums available for use in phylogenomic analysis, including long-extinct species and irreplaceable specimens. This is an appealing advantage over the needs of RNAseq, which relies on freshly collected specimens with intact mRNA (Yeates, Meusemann, Trautwein, Wiegmann, & Zwick, 2016).
A central aspect of hybrid capture techniques is the design of nucleotide bait sets that capture the targeted DNA in-solution. The principal UCE bait design workflow is based on genome-genome alignments of two or more representative genomes, and screens for 40-60 bp long identical anchor sequences (Faircloth, Branstetter, White, & Brady, 2015;Faircloth et al., 2012). A "hands-on" description of the workflow was recently published in this journal (Faircloth, 2017). The ability of the baits to hybridize in-solution is tested in silico by aligning them to sequenced genomes. Hereby, a minimum of 80% sequence overlap and identity is used to predict good in-solution capture success (Branstetter, Longino, Ward, & Faircloth, 2017;Faircloth, 2017;Faircloth et al., 2015). UCE baits sets are tailored to certain taxonomic groups, such as Hymenoptera (Branstetter, Longino, et al., 2017;Faircloth et al., 2015), arachnids (Starrett et al., 2016) or amniotes Hosner, Faircloth, Glenn, Braun, & Kimball, 2016;Streicher & Wiens, 2016). However, we do not know to what extant these conserved sequences are shared among lineages outside the taxon of interest. More specifically, we do not know if these tailored UCE baits can capture significant amounts of contaminating sequences from organisms outside the target lineages. In this case study, we quantify these non-target UCE loci in a diverse set of organisms.
We employed genomes of 400 different taxa and screened them for sequences that match UCE baits of 6 newly designed bait sets, each targeting a separate arthropod lineage (Table 1). We conducted phylum-wide searches for these elements in all 242 sequenced, non-duplicate arthropod genomes that are publicly available on NCBI, as well as in 158 potentially contaminating organisms, such as humans, fungi, and bacteria. On the basis of our results, we evaluate the potential of non-target capture of UCEs in-solution. Lastly, we simulate the target capture of contaminating UCEs under varying bioinformatic parameters, and present cutoff values that exclude contamination while retaining a maximized amount of target UCEs.

| MATERIAL S AND ME THODS
We conducted phylum-wide screens for ultraconserved elements of six major arthropod lineages (Table 1) in all 242 arthropod genomes that were publicly available on NCBI at the beginning of March, 2017.
These lineages likely diverged as early as c. 570 million years ago (Misof et al., 2014). To evaluate the potential of including contaminating DNA during laboratory work, we searched for matching sequences in the genomes of 10 domesticated mammal species, such as cat, dog, rat, as well as two copies of the human genome. We further screened the genomes of 72 species of fungi that could be naturally associated with the extraction material, such as a wide range of different yeasts and moulds, as well as genera that have been identified as contaminants in ancient DNA studies (Austin, Ross, Smith, Fortey, & Thomas, 1997;Wang, Yan, & Jin, 1997). Lastly, we added 74 bacterial genomes that could in practice become part of the starting material of the UCE-enrichment process, such as insect gut symbionts, different strains of Wolbachia, as well as a range of human gut microbiota and typical bacterial contaminants in microbiological laboratories. In cases of multiple genomes of single taxa, we retained the genomes with the highest sequencing depth.

Lepidoptera
Faircloth (2017) to match each bait against each of the 400 genomes. We required a minimum sequence identity of 80% across an overlap region of 80% to score a positive hit. To explore the possibility of in silico filtering for contaminating UCEs, we then quantified the number of matches for each genome along a gradient from 65% to 95% sequence identity, while keeping the minimum overlap of 80%. Subsequently, we examined the relationship of genome size and capture success by plotting the amount of in silico matched UCEs as a function of the total sequence length of each genome. Lastly, we repeated the matching simulations with each of the other five bait sets (Table 1).
We tested for significant positive associations between genome size and the number of UCE loci captured using standard linear regression methods implemented in R (R Development Core Team, 2016).
The included genomes as well as their GenBank assembly accession numbers are listed in the appendix (Table S1).

| RE SULTS AND D ISCUSS I ON
Our results show that bait sets tailored to specific groups can effectively capture conserved genomic elements from organisms outside the targeted lineage. All six bait sets target hundreds of loci from all of the included insect orders, and consistently match several hundreds of elements in all other examined arthropod lineages (Figures 1 and 2). For three of the five tested sets, the amount of matching insect UCEs outside the targeted group is strongly predicted by genome size rather than relatedness ( Figure 3). However, this trend is not present for the bait sets designed for Diptera and Lepidoptera. As the bait sets for all target groups were designed in a comparable manner (Faircloth, 2017), these differing trends are unexpected and further data exploration is required to clarify the underlying mechanisms that drive the different patterns.
Our results demonstrate that single bait sets can capture putative orthologous sequences from the entirety of the most diverse extant phylum Arthropoda, encompassing over 1,200,000 described taxa and about 80% of all known species of animals (Zhang, 2011). Even if none of the bait sets were specifically tailored towards a breadth of arthropod lineages, we identified loci that are universally shared throughout the phylum, representing highly promising candidate loci for a universal arthropod bait set.
We characterize the 100 most widely shared UCEs as exclusively conserved exons or partially exonic regions (Table S2). This sheds new light on their evolutionary significance. The initial discoveries of UCEs from vertebrates characterize them as predominantly noncoding sequences (Bejerano et al., 2004;Woolfe et al., 2004). In con-  The three principle scenarios for contamination when using the UCE-targeted-enrichment methods are the following: (1) The same UCE locus is captured from the target taxon and the contaminant.
If the consensus assemblies produce two different contigs for the same locus, the UCE software pipeline PhyLuce (Faircloth, 2016) identifies these loci as duplicates and discards them. This does not lead to the introduction of the contaminating locus into the final matrix, but to the loss of the entire UCE.
(2) The same locus is captured from both taxa, but the consensus assembly produces one chimeric contig. This contaminated UCE will be assigned to the target taxon.
(3) The captured locus stems only from the contaminant and not from the target taxon. PhyLuce will incorrectly assign the contaminating locus to the target.
Chimeric UCEs have been primarily identified as a problem of low coverage sequencing , and can thus be addressed with greater sequencing efforts. In contrast, increased sequencing will not overcome the problems caused by unintended sequence associations. A solution lies in the in silico identification of UCEs that follows the actual wet-laboratory capture. Our simulation study shows that sufficiently strict matching parameters can effectively filter out loci from unintentionally included DNA (Figure 2).
Minimum sequence identities between 82% and 85% similarity to the capture bait ensure the exclusion of contamination, while retaining sufficient capture rates for all included bait sets. As we employed full genomes of the tested contaminating organisms, we prove the exclusion to be efficient even with the highest possible levels of contamination. However, the capture success within lineages can vary significantly, as exemplified by the Hymenoptera bait set. Specifically, the capture success for Hymenoptera decreases much more steeply for the parasitic aculeata ("Parasitica"), the paraphyletic Symphyta and the Vespidae (Figure 4).
Nonetheless, our data underline the need to minimize contamination in the first place. Careful lab practice for handling sensitive DNA samples requires thorough surface sterilization of dissecting tools, mounts, and tissues. This will help to overcome surface contamination of specimens handled by many standard collecting techniques used by entomologists, such as malaise trapping, pan trapping, collecting multiple specimens in a single kill jar, or pinning specimens by hand. Conducting laboratory work under a laminar flow hood will decrease the potential for introducing environmental DNA. In addition, a careful selection of tissue types could significantly lower the potential of contamination. This is particularly relevant for organisms with predatory feeding behaviours and potential contamination on appendages that are used to capture prey (Faircloth, B.C., pers. comm.). Non-destructive DNA extraction protocols (i.e. Gilbert, Moore, Melchior, & Worobey, 2007;Kanda, Pflug, Sproul, Dasenko, & Maddison, 2016) have the potential to contribute DNA from internal insect parasitoids or ectoparasites. % Sequence identity threshold Matching ultraconserved elements the initial UCE capture obsolete. An alternative bioinformatic approach to detect contaminated libraries is provided by the as-sembly_extract_contigs_to_barcodes script, which is integrated in PhyLuce. It allows the extraction of assembled mitochondrial contigs that correspond to the COI barcode region. These sequences can then be used to validate the presence of a single or multiple corresponding species. Given the detection of multiple divergent DNA barcodes, the entire sample can be excluded from further analyses.
Lastly, we argue that the issue of contamination is relevant to both of the currently most widely used target-enrichment strategies; the UCE approach  and Anchored Enrichment (AE; Lemmon et al., 2012). While we demonstrate the potential for contamination in the UCE method, we argue that it is entirely applicable for the AE strategy. The approaches differ in certain aspects, i.e., in the probe design workflow, different tiling strategies and the subsequent bioinformatic processing. However, central to both is the use of 120 bp long nucleotide baits for in-solution hybridizations with highly conserved DNA fragments (e.g. Fragoso-Martínez et al., 2017;Lemmon et al., 2012;Prum et al., 2015 for AE-based studies, andBranstetter, Danforth, et al., 2017;Starrett et al., 2016 for the UCE approach). The AE baits primarily target-coding sequences and the most widely shared UCEs from this study were exclusively exonic or partially exonic.
We therefore argue that both approaches target comparable, conserved exonic DNA.

ACK N OWLED G EM ENTS
We thank Brant C. Faircloth for the constructive feedback on the manuscript. This study was funded by a U.S. National Science Foundation grant to B.N.D., S.G. Brady, J.P. Pitts, and R. Ross (DEB-1555905).

CO M PE TI N G I NTER E S TS
The authors declare no competing interest.

AUTH O R S' CO NTR I B UTI O N S
S.B. designed and conducted the study. S.B. and B.N.D. wrote the manuscript.

DATA ACCE SS I B I LIT Y
The scripts which were used to simulate the matching success of the UCE probe sets are available in a GitHub repository of the first author (https://github.com/MacGyverScripting/matching_uce_simulations) and are archived via Zenodo (https://doi.org/10.5281/zenodo.1172056).