Population genomic SNPs from epigenetic RADs: gaining genetic and epigenetic data from a single established next-generation sequencing approach

Epigenetics is increasingly recognised as an important molecular mechanism underlying phenotypic variation. To study DNA methylation in ecological and evolutionary contexts, epiRADseq is a cost-effective next-generation sequencing technique based on reduced representation sequencing of genomic regions surrounding non-/methylated sites. EpiRADseq for genome-wide methylation abundance and ddRADseq for genome-wide SNP genotyping follow very similar library and sequencing protocols, but to date these two types of dataset have been handled separately. Here we test the performance of using epiRADseq data to generate SNPs for population genomic analyses. We tested the robustness of using epiRADseq data for population genomics with two independent datasets: a newly generated single-end dataset for the European whitefish Coregonus lavaretus, and a re-analysis of publicly available, previously published paired-end data on corals. Using standard bioinformatic pipelines with a reference genome and without (i.e. de novo catalogue loci), we compared the number of SNPs retained, population genetic summary statistics, and population genetic structure between data drawn from ddRADseq and epiRADseq library preparations. We find that SNPs drawn from epiRADseq are similar in number to those drawn from ddRADseq, with a 55-83% of SNPs being identified by both methods. Genotyping error rate was <5% in both approaches. For summary statistics such as heterozygosity and nucleotide diversity, there is a strong correlation between methods (Spearman’s rho > 0.88). Furthermore, identical patterns of population genetic structure were recovered using SNPs from epiRADseq and ddRADseq approaches. We show that SNPs obtained from epiRADseq are highly similar to those from ddRADseq and are equivalent for estimating genetic diversity and population structure. This finding is particularly relevant to researchers interested in genetics and epigenetics on the same individuals because using a single epigenomic approach to generate two datasets greatly reduces the time and financial costs compared to using these techniques separately. It also efficiently enables correction of epigenetic estimates with population genetic data. Many studies will benefit from a combinatorial approach with genetic and epigenetic markers and this demonstrates a single, efficient method to do so.

Summary 21 1. Epigenetics is increasingly recognised as an important molecular mechanism underlying 22 phenotypic variation. To study DNA methylation in ecological and evolutionary contexts, epiRADseq 23 is a cost-effective next-generation sequencing technique based on reduced representation sequencing 24 of genomic regions surrounding non-/methylated sites. EpiRADseq for genome-wide methylation 25 abundance and ddRADseq for genome-wide SNP genotyping follow very similar library and 26 sequencing protocols, but to date these two types of dataset have been handled separately. Here we 27 test the performance of using epiRADseq data to generate SNPs for population genomic analyses.
28 29 2. We tested the robustness of using epiRADseq data for population genomics with two independent 30 datasets: a newly generated single-end dataset for the European whitefish Coregonus lavaretus, and a 31 re-analysis of publicly available, previously published paired-end data on corals. Using standard 32 bioinformatic pipelines with a reference genome and without (i.e. de novo catalogue loci), we 33 compared the number of SNPs retained, population genetic summary statistics, and population genetic 34 structure between data drawn from ddRADseq and epiRADseq library preparations. 35 36 3. We find that SNPs drawn from epiRADseq are similar in number to those drawn from ddRADseq, 37 with a 55-83% of SNPs being identified by both methods. Genotyping error rate was <5% in both 38 approaches. For summary statistics such as heterozygosity and nucleotide diversity, there is a strong 39 correlation between methods (Spearman's rho > 0.88). Furthermore, identical patterns of population 40 genetic structure were recovered using SNPs from epiRADseq and ddRADseq approaches. 41 42 4. We show that SNPs obtained from epiRADseq are highly similar to those from ddRADseq and are 43 equivalent for estimating genetic diversity and population structure. This finding is particularly 44 relevant to researchers interested in genetics and epigenetics on the same individuals because using a 45 single epigenomic approach to generate two datasets greatly reduces the time and financial costs 46 compared to using these techniques separately. It also efficiently enables correction of epigenetic 3 Introduction 54 55 The advent of Next Generation Sequencing (NGS) has facilitated a revolution in ecology and 56 evolution by enabling the integration of the two fields to better elucidate molecular patterns and 57 mechanisms (Ekblom & Galindo, 2011). Technologically, an advance of NGS is not just reduced, per 58 base pair costs of sequencing but that genomic techniques can be applied to so-called 'non-model' 59 species or those without reference genomes or other genomic resources (Ekblom & Galindo, 2011).

75
The study of epigenetic processes, which cause change in gene expression without nucleotide 76 mutation of the underlying genome sequence, is providing a new complexity in the genotype -77 phenotype map and in some cases a disconnect of genotype and phenotype (Feil & Fraga, 2012). The 78 best understood epigenetic mechanism is DNA methylation, which involves the addition of a methyl 79 group to cytosine, and in eukaryotes it occurs mainly in CpG dinucleotides ( (Peterson et al. 2012) and involves the digestion of the genome using two restriction enzymes, with 98 one enzyme being methylation-sensitive. Therefore, a methylated locus will not be cut by the 99 methylation-sensitive enzyme, will not be enriched by PCR nor sequenced, and thus no sequencing 100 read is obtained in the data. If a locus is unmethylated, it will be cut in the same way as ddRADseq 101 and therefore enriched by PCR and sequenced. Therefore, the number of overall reads for a locus is 102 proportional to the level of (non-)methylation and differences in the methylation level between groups

209
To assess the sensitivity of SNP calling to missing data for epiRADseq data, we created three

263
The population filtering generated datasets of 1,046 SNPs and 819 SNPs for ddRADseq and 264 epiRADseq respectively (Fig. 1a). The number of SNPs retained in our study is slightly lower to those 265 used by the original study (1,113 SNPs from ddRADseq, also assessed here). By mapping reads to a 266 reference assembly, we could calculate the number of SNPs that overlapped between the two datasets.

267
In total 676 SNPs overlapped, which corresponds to 83% of SNPs in the epiRADseq and 65% of 268 SNPs in the ddRADseq datasets.

12
DAPC analyses of the epiRADseq and ddRADseq datasets recovered the same three clusters 270 as were inferred from the original study from Dimond et al. using ddRADseq (Fig. 2). Our Fst 271 estimates between clusters ranged from 0.24 to 0.26, while the estimates of Dimond et al. were 0.19 to 272 0.21 (Fig 2a,b,c). The proportion of variation explained by the discriminant functions was similar in 273 all three datasets (Fig. 2). When comparing estimates of genetic diversity, we recovered strong 274 Spearman's σ correlation for all three summary statistics between the ddRADseq and the epiRADseq 275 datasets (Fig. 3).  The number of SNPs retained was very similar for those generated with the epiRADseq method and 299 the ddRADseq method and decreased with increasing filtering stringency (Fig. 1); for the epiRADseq 300 generated data we recovered 6971, 6686, and 5546 SNPs in the -r 67, -r 75, and -r 100 datasets 301 respectively, while for the ddRADseq generated data we recovered 7289, 6988, and 5277 SNPs in the 302 three datasets respectively. A total of 4518 SNPs were shared between the two -r 67 datasets, 4294 303 SNPs were shared between the two -r 75 datasets, and 2978 SNPs were shared between the -r 100 304 datasets.

305
The estimates of heterozygosity and nucleotide diversity inferred from ddRADseq and 306 epiRADseq derived SNPs were highly correlated, with Spearman's correlations of 88.5 to 92.8% 307 (Table 2). When looking at the genetic diversity estimates per individual, which would be impacted 308 by allele dropout, we observed no reduction in expected heterozygosity (V = 58, p-value = 0.45) or 309 nucleotide diversity (V = 82, p-value = 0.31) for the epiRADseq data.

310
The results of the population genetic structure analysis with DAPC were consistent across 311 filtering stringencies and datasets (Fig. 4), with the four populations being grouped into two genetic 312 clusters separating on axis 1 (and so displayed on one axis of variation instead of the two shown for 313 the corals). Fst divergence between the two clusters was identical between methods for the -r 67 and -314 r 75 datasets at Fst = 0.23, and it was negligibly higher for the ddRADseq in the -r 100 datasets at 315 0.24 and 0.25 (Fig. 4).

318
Here we used two independent natural animal population datasets to show that epiRADseq data can 319 be used to derive SNPs for population genomic analyses. We compared SNP number, estimates of 320 summary statistics, and inference of population structure between ddRADseq and epiRADseq 321 methods in a newly generated dataset of European whitefish and a previously published dataset on 322 14 corals. Overall, we find strong agreement for all of the above metrics between epiRADseq and 323 ddRADseq protocols, meaning that epiRADseq data give equivalent results to the well-established 324 method of ddRADseq-derived SNPs. The implication is that a single dataset can be used for 325 epigenetic analyses and for inference of population structure. This is not only efficient but also 326 valuable studies on the association between epigenetic and genetic diversity and their impact on 327 phenotype.

328
Here we used previously published data and new data when comparing the epiRADseq and 329 ddRADseq generated SNPs, which allows us to demonstrate the robustness of the molecular methods

353
We explored the effect of different filtering levels on the SNP retention of epiRADseq and 354 ddRADseq derived SNPs from the whitefish data. We did not explore this with the coral data as we 355 were more interested in comparing estimates of population structure between epiRADseq and the 356 previously published estimates derived from ddRADseq. As expected, the -r 67 and -r 75 ddRADseq 357 datasets had more SNPs than the respective epiRADseq datasets, but the epiRADseq -r 100 dataset 358 had more SNPs than the ddRADseq -r 100 dataset. This is probably due to the higher coverage of the

362
We find an agreement between ddRADseq and epiRADseq analyses of population structure 363 in the whitefish data, as both methods recover two clusters in our dataset of four sampled and closely 364 related populations. The -r filtering had some impact on the correlation of the summary statistics 365 between ddRADseq and epiRADseq, with the correlation increasing from as low as 88% up to 92% as 366 the filtering became more stringent. This is expected because of the -r parameter in Stacks, which 367 influences the number of individuals in a population a locus must be present to be retained in the 368 dataset. In the -r 67 and -r 75 datasets, it is not required for the locus to be present in the same set of 369 individuals (i.e. in two-thirds or three-quarters of all individuals in a population, respectively), while 370 in the -r 100 datasets this restriction is complete so all retained SNPs have to be shared across all 371 individuals. We did not explore further filtering in our analyses, but previous work (e.g. Paris,

376
Instead each of these stringencies and datasets would be valid. Overall, these results suggest that 377 allowing some missing data (i.e. -r of 67% or 75%) will not bias genetic analyses conducted with 16 SNPs from epiRADseq data, consistent with what has already been shown previously with ddRADseq 379 (Shafer et al. 2017).

380
We tested whether allele drop out (ADO) due to locus methylation (Schield et al., 2016) had 381 an effect when using epiRADseq derived SNPs for genetic analyses. It has been shown through 382 simulations (Gautier et al. 2013) and observed in empirical studies (Luca et al. 2011) that ADO leads 383 to an underestimation of expected heterozygosity and nucleotide diversity. This could be a concern 384 for epiRADseq derived SNPs because, by design, a methylated locus is not cut with epiRADseq and 385 therefore will be absent from the dataset. However, we found no difference between ddRADseq and 386 epiRADseq genetic diversity estimates per individual, suggesting ADO is not a particular concern in 387 epiRADseq data.  Here, we showed that the recently developed epiRADseq approach for the study of DNA 435 methylation variation can also be used for generating SNPs for population genetic analyses, using 436 both reference-based and de novo approaches. Sequencing only an epiRADseq library halves the cost 437 in time, consumables, and sequencing compared to sequencing ddRADseq for SNPs and epiRADseq 438 for methylation abundance. This combination provides informative biological data for population 439 genomics and differential methylation, which is a topic of growing interest in molecular ecology and 440 evolution for its heritable and non-heritable effects (Hu & Barrett, 2017