Long‐range PCR allows sequencing of mitochondrial genomes from environmental DNA

As environmental DNA (eDNA) from macro‐organisms is often assumed to be highly degraded, current eDNA assays target small DNA fragments to estimate species richness by metabarcoding. A limitation of this approach is the inherent lack of unique species‐specific single‐nucleotide polymorphisms available for unequivocal species identification. We designed a novel primer pair capable of amplifying whole mitochondrial genomes and evaluated it in silico for a wide range of ray‐finned fishes (Class: Actinopterygii). We tested the primer pair using long‐range PCR and Illumina sequencing in vitro on a mock community of fish species assembled from pooling genomic DNA extracted from tissues. In situ we utilized long‐range PCR and Illumina sequencing to generate fragments between 16 and 17 kb from eDNA extracted from filtered water samples. Water samples were sourced from a mesocosm experiment and from a natural stream. We validated our method in silico for 61 orders of Actinopterygii; we successfully sequenced mitogenomes in vitro from all six species in our mock community. In situ we recovered mitogenomes for all species present in our mesocosms. We additionally recovered mitogenomes from 10 of 12 species caught at the time of water sampling and two species previously only detected from eDNA metabarcoding of short DNA fragments from a natural stream. Successful amplification of large fragments (>16 kb) from eDNA demonstrates that not all eDNA is highly degraded. Sequencing whole mitogenomes from filtered water samples will alleviate many problems associated with identification of species from short‐fragment PCR amplicon‐based methods.


| INTRODUCTION
The use of DNA found in the environment (eDNA) to catalogue biodiversity is gaining momentum (Creer et al., 2016). From surveying the three domains of life in soils (Drummond et al., 2015) to whales in the ocean (Foote et al., 2012), biodiversity information is being produced on unprecedented scales.
Perhaps because the use of eDNA to detect species was partially inspired by the field of ancient DNA (Thomsen & Willerslev, 2015), researchers assumed that eDNA was highly degraded. Because of this assumption, and coupled with current sequence length limitations of next-generation sequencers (e.g. Illumina MiSeq), researchers have focused on producing small fragments (c. 50-400 base pairs in length) using a PCR amplicon approach to characterize macro-organismal species richness Valentini et al., 2016). However, reliance on short-fragment PCR amplicons from eDNA limits the current utility of the method because species-level assignments are often not possible for short reads (Deiner, Fronhofer, Mächler, Walser, & Altermatt, 2016;Port et al., 2016).
While it may be true that some of the eDNA in environmental samples is degraded, evidence that not all eDNA is degraded has emerged. In a recent study based on water from a fish pond, most of the eDNA detected was from particles that ranged in size from 1 to 10 μm, consistent with the presence of intact tissues or cells in aquatic environments (Turner et al., 2014). This result suggests that eDNA for species currently occupying a habitat is not primarily free DNA suspended in solution, but that it could be cellular or membrane bound DNA in a tightly coiled or circular state and comparatively safe from degradative processes (Torti, Lever, & Jørgensen, 2015). We therefore hypothesized that it should be possible to long-range PCR amplify and sequence whole macro-organismal mitochondrial genomes (mitogenomes) from DNA isolated from water samples.
To test this hypothesis, we designed a novel primer set in the 16S region of the mitochondrial genome that is nearly 95% conserved at the sequence level across the class of Actinopterygii (ray-finned fishes) and could be used for long-range PCR amplification of fish mitogenomes from water samples. Long-range PCR is a viable option for the enrichment of whole mitogenomes from environmental samples because in a single amplification it can produce a fragment that encompasses the entire mitogenome (Zhang, Cui, & Wong, 2012). The amplification of the entire mitogenome in a single PCR has inherent time and cost advantages over amplification of multiple fragments, reduces the potential complications of nuclear-encoded mitochondrial pseudogenes (NUMTs) and avoids the rearrangement of targeted priming sites in mitogenomes with altered gene order (Cameron, 2014). However, the success of long-range PCR amplifications depends on the presence of relatively high-quality and highmolecular-weight DNA.
Sequencing whole mitogenomes from eDNA samples could vastly improve species assignment capabilities because full-length barcodes, such as the cytochrome c oxidase I (COI) region for animals (Hebert, Ratnasingham, & de Waard, 2003), could be recovered and used for the identification of species in communities. Other mitochondrial genes typically used in phylogeography, systematics and conservation genetics could also be recovered in their entirety to provide a non-destructive method for sampling whole communities for studies related to community phylogenetics and conservation.
In this study, we validate a method to amplify and sequence entire mitochondrial fish genomes from water samples (Figure 1). In silico we tested newly designed mitochondrial primers and in vitro performed long-range PCR and next-generation sequencing on resulting amplifications using a mock community amassed from tissue extracted DNA.
In situ we validated the method using water samples collected from a mesocosm experiment with an assembled fish community of eight species, and from a natural stream known to have 12 species present at the time of sampling. DNA extractions from the mesocosm and stream samples were the same as those used for two previous eDNA studies of fish communities Olds et al., 2016), allowing us to compare the species richness estimated from a shortfragment PCR amplicon approach to that of using long-range PCR and whole mitogenome sequencing from water samples.

| Primer design and in silico evaluation
A batch download of Actinopterygii 16S sequences from GenBank was aligned using the PartTree algorithm in MAFFT version 7 (Katoh & Standley, 2013). The alignment was viewed in BioEdit (Hall, 1999) and conserved regions identified by eye. These regions were then evaluated in Primer3 (Untergasser et al ., 2012) for putative primer pairs. Primers Actinopterygii16SLRpcr_F (5′-CAGGACATCCTAATGGTGCAG-3′) and Actinopterygii16SLRpcr_R (5′-ATCCAACATCGAGGTCGTAAAC-3′) were designed to be immediately adjacent to one another to PCR amplify nearly the entire mitogenome in a single reaction. The only part of the mitogenome not amplified is the 43-bp area covered by the Actinopterygii16SLRpcr priming region.
To evaluate the potential taxonomic coverage of the primers, they were concatenated into a single sequence in the same reading frame (i.e. the reverse primer was reverse-complemented before it was concatenated) resulting in a 43-bp fragment. The fragment was then aligned with Actinopterygii fish mitogenomes available on MitoFish v3.02 ( Figure 1a) (Iwasaki et al., 2013). One thousand five hundred and twelve fish species in the class Actinopterygii, representing 62 orders and 310 families, were considered (Appendix S1). blastn (blastall 2.2.26) was used to align primer fragments to mitogenomes with output in tabular format, e value = 10 −4 , and without low-complexity filter (−m 8 −F F −e 1e−4) to ensure a single hit for each mitogenome (Altschul, Gish, Miller, Myers, & Lipman, 1990). Mismatches between each primer and the reference genomes from MitoFish are given in Appendix S1.

| In vitro evaluation using mock community
To test the performance of the primers, laboratory methods and bioinformatic pipeline, an in vitro test was performed using a mock community sample at a concentration of 0.6 ng/μl (0.1 ng/μl from each species) of tissue-derived DNA from six Indo-Pacific ma-

| In situ evaluation from water samples
Environmental DNA used for this study was extracted from filtered water samples that were utilized in two previous studies: two sam- Details of the collection, filtration and DNA extraction have been F I G U R E 1 Methods overview for laboratory workflow used to design and test long-range PCR (LR-PCR) for sequencing of whole mitochondrial genomes from environmental DNA. Each box (a-d) represents the steps used in silico, in vitro and in situ to validate the method. High-throughput sequencing on the Illumina MiSeq is abbreviated as HTS previously described in Evans et al. (2016) and Olds et al. (2016). DNA extracts were treated with Zymo OneStep™ PCR Inhibitor Removal (Zymo, Irvine, CA) kits prior to amplification of mitogenomes via longrange PCR. Amplifications were done exactly as described in the in vitro section with the exception that 20 μg bovine serum albumin (VWR, Radnor, PA) was added to the PCR reaction.
Based on Qubit readings for the cleaned PCR-amplified products, 1 ng (in a total volume of 5 μl) for each of the mesocosm samples was prepared for sequencing on the Illumina MiSeq following the manufacturer's suggested protocol as outlined for the Nextera XT DNA Library Preparation Kit. For each of the eight stream samples, libraries were prepared as described for the in vitro test on the mock community.

| Bioinformatic filtering, mapping and de novo assembly
Raw reads were quality filtered by removing Illumina sequencing adaptor, low-quality sequences with average quality less than Q20 in any 10-bp window and short sequences with length less than 50-bp using Trimmomatic v0.32 (Bolger, Lohse, & Usadel, 2014) with "ILLUMINACLIP: MiSeq.adapter.fas:3:30:6:1:true SLIDINGWINDOW:10:20: MINLEN:50." Reads were used for alignment and assembly only when both forward and reverse reads passed quality filtering. After quality filtering, reads were merged to avoiding counting the sequencing depth twice based on the same DNA fragment using USEARCH (Edgar, 2010) v8.1.1861_i86linux32 fastq_mergepairs command with minimum overlap length = 16 by default and maximum difference percentage = 1% (−fastq_maxdiffpct 0.01). If paired end reads could be merged, only the merged reads were used for mapping and de novo assembly analyses. If paired end reads could not be merged, both forward and reverse reads were mapped to mitogenomes and were used in the de novo assembly. Therefore, both merged and unmerged reads were mapped to the 6 mitogenomes represented in the mock community, the 8 mitogenomes from the two mesocosm densities  and the 14 mitogenomes from fish that were previously captured (or known to occur in the watershed) when water samples were taken from Juday Creek (Figure 1d) . BWA v0.7.15-r1140 ) with a maximum difference in the seed (−k) equal to 2 and seed length equal to 32 was used for mapping. The missing probability was set under a 0.02 error rate (−n) to be equal to 0.06 which is consistent with a 97% similarity of OTU clustering used for the taxonomic assignment of amplicons from previous studies Olds et al., 2016). SAI format files from both ends of the reads were combined with "bwa sampe" command with maximum insert size equal to 1,000 bp. Merged reads were aligned as single ended reads with "bwa samse" command. Reads with mapping quality (MAPQ) <20 were removed. Only unique aligned reads (i.e. reads with "XT:A:U" in sam file) were used for reference mapping. Additionally, all reads from Juday Creek were combined before mapping to reference sequences. Samtools v1.2 ) command "stats" was used to calculate number of reads mapped for each reference.
The mapping ratio was calculated as (number of mapped merged reads + number of mapped unmerged reads)/(the total number of merged and unmerged reads). Single nucleotide polymorphisms were determined from mapped reads as described in Data S1. BEDtools v2.25.0 (Quinlan & Hall, 2010) command "genomecov" was used for reference coverage and average sequencing depth calculation.
Reference coverage was visualized using Geneious v.9.1.5 and scaffolds from de novo assembly were mapped to the reference using default values in Geneious v.9.1.5. To check for cross contamination during sample handling, reference mapping was done with all species used in this study for all libraries. Accession numbers for reference sequences are reported in Appendix S1.
For the mesocosm samples (HE3 and HS3, Evans et al., 2016), reads after Trimmomatic filtering were used for de novo assembly with the metagenomics assembler IDBA-UD v1.1.1 (Peng, Leung, Yiu, & Chin, 2012). Command "fq2fa" was used to transfer fastq format to fasta format and remove any read with an unknown base pair "N." Command "idba_ud -pre_correction -min_support 20" was used to assemble mitogenomes. For the mock community and Juday Creek samples, reads after Trimmomatic filtering were normalized based on kmer frequency (Crusoe et al., 2015) because the sequencing depth from these samples was too high. A custom Perl script was made to remove reads with sequencing depth higher than 50× based on 17-mer. Normalized reads and Perl script used for de novo assembly are provided on Dryad (http://dx.doi.org/10.5061/dryad.q5gg0).
Normalized reads were assembled with the same parameters as the mesocosm samples with the metagenomics assembler IDBA-UD v1.1.1. Assembled scaffolds were then mapped to the reference genomes with blast+ (Camacho et al ., 2009) to estimate coverage for each gene based on the references using a 95-97% sequence similarity cutoff.

| Long-range PCR compared with amplicon sequencing approach
Reads produced from independently sequenced gene fragments (16S, 12S and CytB) in previous studies Olds et al., 2016) were mapped to the same reference mitogenomes used in the in situ evaluation with the same parameters as those used for long-range amplified mitogenomes. Only reads that passed quality control were mapped. Mitogenome coverage of the amplicons and sequence depth were evaluated and qualitatively compared to the values achieved from long-range amplified products for the entire de novo assembled gene. Differences in the estimated species richness between the two methods were also documented.

| In silico evaluation of primers
For the in silico test of the Actinopterygii16SLRpcr primer fragment (i.e. encompassing both the forward and reverse primers), only one fish order averaged greater than two mismatches, while the other 61 orders averaged less than or equal to two mismatches (Appendix S2).
The reverse primer binding site was more conserved than the forward primer binding site. Based on an average mismatch criterion of two or less mismatches, the Actinopterygii16SLRpcr primer region has the potential to long-range PCR-amplify entire mitogenomes for all orders except Osteoglossiformes. However, several of the species within many orders had mismatches of two or more base pairs in the first three base pairs of the 3' end that could lead to failed amplification of their DNA (Appendix S2). Primers were not assessed for similarity to other taxonomic groups outside of Actinopterygii.

| In vitro evaluation and de novo assembly of mitogenomes from mock community
For the mock community experiment, 77.6% of reads mapped to one of the six species included in the mixture of DNA (Appendix S3).
Nearly whole mitogenomes were recovered for all species in the mock community using the reference mapping approach (Table 1). SNPs were called only in one species (Data S1).
Using the de novo assembly approach, a single scaffold was produced that covered the full-length of the reference genome for three of the six mock community species. For the other three species, two scaffolds for each species were recovered such that when combined they covered between 97.9% and 100% of their respective reference mitogenomes (Table 1). Additionally, the full length of genes commonly used in fish eDNA metabarcoding were recovered (Appendix S4). Of the reads that mapped to species not included in the mock community, four species that were only in the mesocosms and not Creek showed moderate levels of mapping to their reference (0.04% of reads) even though they were not included in the mock community library (Appendix S5). However, the de novo assembly showed that no scaffolds for species from Juday Creek could be recovered above 1,900 bp in length (Appendix S5).

| In situ evaluation from water samples
For the high-density, even abundance (HE3) and skewed abundance (HS3) mesocosm communities, 65.2% and 75.1% of reads mapped to one of the eight species included in the mesocosm (Appendix S3).
SNPs were called in most species (Data S1).
Using the de novo assembly approach on the mesocosm samples, long scaffolds were assembled for most species and were near complete mitogenomes (e.g. 16,524 bp for Catostomus commersonii in HS3, Table 1). De novo results were more consistent when abundance was even (HE3). Pimephales promelas de novo assembled scaffolds were short in both mesocosm communities and several species had many scaffolds that could not be merged into a single assembly without guidance of a reference sequence (Table 1). Genes commonly used in eDNA metabarcoding studies of fish were de novo assembled for all species even when the whole mitogenome could not be and ranged in coverage from 89% to 100% (Appendix S4).
For Juday Creek, 46.2% of reads could be uniquely mapped to 10 of 12 species confirmed present when water samples were collected and two additional species previously detected only from eDNA ) (Appendix S3). Whole mitogenomes for 10 of the 12 species caught with electrofishing were successfully recovered from mapping reads to their references (94-100%, Table 1, Figure 2). The two species not recovered were Lepomis macrochirus and Salmo trutta.
We additionally recovered mitogenomes from two species previously only detected from eDNA, but are known to occur in the watershed (Cyprinus carpio and Micropterus salmoides) ( Table 1). SNPs were called for all species (Data S1).
The de novo assembly recovered nearly whole mitogenome scaffolds for about half the species (Table 1). The degree of coverage of these scaffolds to their reference sequences ranged from a single scaffold representing a nearly complete mitogenome to 29 smaller scaffolds covering nearly the entire mitogenome (Table 1, Figure 2). Genes commonly used in eDNA metabarcoding studies could be recovered from the de novo assemblies and the scaffolds overlapping these genes covered from 88% to 100% of the full gene length (Appendix S4). Some sequence reads from the Juday Creek sample mapped to the reference genomes of fish that were only used in the mock community and mesocosm experiment and were not present in Juday Creek (Appendix S5). In most cases the sequence depth and degree of coverage of these reads was low (0.6%-6.5%), with higher coverage of references from the mock community species (Appendix S5).
None of these species had de novo assembled scaffolds longer than 2,100 bp in length.
Comparisons between long-range PCR-amplified mitogenomes and the short-fragment PCR amplicons generated from the same DNA extracts revealed some species (S. trutta and L. macrochirus) whole mitogenomes were not detected even though their smaller fragments were (Appendix S3). Additionally, two species detected previously only by eDNA, M. salmoides and C. carpio , were also detected; mapping revealed that their whole mitogenomes could be recovered and that long fragments could be de novo assembled (Table 1).

| DISCUSSION
We demonstrate that it is possible to sequence whole mitogenomes of fish from DNA extracted from water samples by coupling T A B L E 1 Long-range PCR (LRP) amplicon read mapping statistics for species in the mock community, two mesocosm communities and Juday Creek. Reference coverage for each species' mitogenome was evaluated in one of the two ways: each read mapped to a reference or de novo assembled (see methods long-range PCR amplification and shotgun sequencing techniques. These results contradict the common assumption that eDNA from communities of living fishes in water is highly degraded (e.g. Bohmann et al., 2014). Our results show instead that some of the eDNA from macro-organisms currently inhabiting a water body remains intact at least at the mitochondrial genome size (c. 16 kb).
Sequencing whole mitogenomes from water samples is a pioneering way to non-invasively assess communities of living macro-organisms.
Additionally, it may make characterization of macro-organism eDNA relevant in studies of population and conservation genetics, systematics and phylogeography.
This methodological advance alleviates the burden of basing species identifications on short-fragment PCR amplicons. Most current eDNA studies from macro-organisms base their taxonomic assignments on DNA fragments of <150 base pairs (Port et al., 2016;Valentini et al., 2016). The amount of information in such a small fragment will always be limited, and in many cases assignments to the species level are not possible without additional information. With whole mitogenomes, entire barcode regions can be excised from a dataset and used for taxonomic assignment. For example, using the de novo assembly approach we recovered full-length mitochondrial genes (cytochrome oxidase I, cytochrome B, 12S and 16S) for the fish species in Juday Creek (e.g. Semotilus atromaculatus), even when their entire mitochondrial genome could not be assembled. Given the amount of missing diagnostic information in reference databases used for macroorganism taxonomic assignment (Shaw et al., 2016;Trebitz, Hoffman, Grant, Billehus, & Pilgrim, 2015), generating data with this method will allow use of any region of the mitogenome for which data exist to identify environmental sequences. This method will therefore be invaluable unless and until this void is filled with other methods such as genome skimming (Coissac, Hollingsworth, Lavergne, & Taberlet, 2016).
While the results from our in situ tests are promising, a number of questions remain about the particular environmental circumstances in which eDNA is left intact at the mitogenome level. In this study we used water sampled in September from a small third-order tributary to the St. Joseph River in Northwestern Indiana, USA. The mean annual discharge of Juday Creek is 0.44 m 3 /s samples .
The average temperature on the day of sampling was 22.6°C (Shirey, Brueseke, Kenny, & Lamberti, 2016). We did not collect other variables at the time of sampling as it was not our goal to test these attributes here, but we encourage future studies to investigate additional conditions (e.g. temperature, turbidity, pH), density of individuals, flow conditions, sample type (e.g. soil, sediment), etc., that may facilitate or inhibit the amplification of whole mitogenomes.
Consideration of the species composition is also important. We observed that the de novo assembler was not able to assemble some scaffolds into entire mitogenomes (Figure 2). One possible explanation for this observed pattern is that conserved regions with high sequence similarity among species are difficult to accurately assemble from a complex mixture. Therefore, we expect the de novo assembly method, when not guided by a reference mapping approach, will be challenging in fish communities when species pairs are closely related and sequence similarity is high. Additionally, for de novo assemblies we observed that in some cases high sequencing depth inhibited the concatenation of multiple shorter scaffolds (Figure 2). While down sampling did join smaller scaffolds together more frequently, we still observed the pattern that species with the highest read depth tended to have the largest number of unassembled scaffolds (Table 1, Figure 2). Applying long-read sequencing technology such as PacBio SMRT technology to long-range amplified products could potentially solve this problem by avoiding short reads altogether (Schloss, Jenior, Koumpouras, Westcott, & Highlander, 2016).

Species
From the laboratory perspective, there are many handling steps that could be optimized to improve detection of whole mitogenomes.
For example, the filters were extracted with a Chloroform-isoamyl DNA extraction, but the use of buffered phenol (pH 8.0) in addition to chloroform-isoamyl is known to increase the quantity of high molecular weight DNA during DNA extractions (Blin & Stafford, 1976). Therefore, testing of laboratory methods from extraction to library preparation may increase the yield of eDNA suitable for long-range PCR.
We detected a small amount of contamination between samples run on the same MiSeq run (i.e. Juday Creek and the mock community). The low level of contamination between libraries did not result in high coverage of the mitogenomes nor allow de novo assembly of complete mitogenomes in these samples. Additionally, because our study design intentionally used tropical marine fish for the mock community, they could easily be excluded from occurring in Juday Creek, a temperate fresh water habitat. We cannot determine from our study's design at which point the contamination happened because the libraries showing the greatest amount of cross contamination were also processed in the laboratory at the same time (i.e. mock community and Juday Creek samples). The mock community and Juday Creek libraries were also prepared using the TruSeq Nano LT Sample Prep Kit using a single index step. Duel indexing helps to circumvent problems associated with "tag jumping," and we cannot rule out this phenomenon as a problem for our study (Schnell, Bohmann, & Gilbert, 2015 While it is promising that we could identify SNPs and estimate allele frequencies, there remain many questions about the validity of these estimates from eDNA. Specifically, we could not confirm our SNPs from tissues of the actual species and such tests are needed before adoption of this method is warranted. Until now, most studies generate operational taxonomic units (OTUs) and summarize the data for a species at 97%-99% similarity (Hänfling et al., 2016;Olds et al., 2016), and thus have not utilized the intraspecific level variation present in eDNA samples. A recent study by Sigsgaard et al. (2016) demonstrated that haplotype diversity is attainable from water samples at least at the single-species level using primers that targeted a small amplicon (<500 bp) in whale sharks. We expect that future studies applying our method of whole mitochondrial genome sequencing from water samples will yield similar results, but at the community scale for entire mitogenomes.
The continued advancement of single molecule and long-read technologies, such as the Oxford Nanopore MinION (Laszlo et al., 2014), will improve our approach. Here, we used Illumina sequencing which required sheering the mitogenomes, using sonication or a transposase-mediated process used in the Nextera library preparation kit, to fragment them before sequencing and subsequently remapping these reads to a reference sequence or conducting de novo assembly. Coupling long-range PCR amplifications and sequencing without fragmentation would avoid many of the problems associated with short-fragment based assembly or reference mapping. We expect that long-read sequencing technologies, once they are cost effective F I G U R E 2 Linear visualization for long-range PCR-amplified fish mitogenomes sequenced from environmental DNA water sampled in Juday Creek, IN, USA. Species are depicted in alphabetical order from top to bottom excluding the two species (Lepomis macrochirus and Salmo trutta) that did not produce a full mitogenome (Table 1). The numbered orange bar with yellow background is the reference sequence. The blue graphic indicates the number of reads mapped at each site (i.e. sequencing depth in Table 1) and each of the black bars below the reference sequence are independent scaffolds that were de novo assembled and subsequently mapped to the reference. Scaffolds connected with a red line are a single scaffold and the line connecting them indicates a gap in the assembly. Gaps in the reference sequence (indicated by a red line) represent insertions compared to the de novo assembled scaffold and error rates are reduced, will become the method of choice for sequencing long-range PCR products and will allow population genetic analysis of eDNA samples.