Volume 9, Issue 5
APPLICATION
Free Access

genotypeR: An integrated r package for single nucleotide polymorphism genotype marker design and data analysis

Stephen A. Sefick

Corresponding Author

E-mail address: ssefick@auburn.edu

Department of Biological Sciences, Auburn University, Auburn, AL, USA

Correspondence

Laurie S. Stevison

Email: lstevison@auburn.edu

and

Stephen A. Sefick

Email: ssefick@auburn.edu

Search for more papers by this author
Magdalena A. Castronova

Department of Biological Sciences, Auburn University, Auburn, AL, USA

Search for more papers by this author
Laurie S. Stevison

Corresponding Author

E-mail address: lstevison@auburn.edu

Department of Biological Sciences, Auburn University, Auburn, AL, USA

Correspondence

Laurie S. Stevison

Email: lstevison@auburn.edu

and

Stephen A. Sefick

Email: ssefick@auburn.edu

Search for more papers by this author
First published: 04 January 2018

Abstract

  1. Single nucleotide polymorphism (SNP) genotyping is an important tool to understand basic and applied questions, such as genomic structure, recombination, introgression, parentage/pedigrees and the genetic basis of traits, among other things. Each of these applications share a similar workflow: marker design, genotyping and data analysis.
  2. In this manuscript, we present genotypeR, a package that implements a common genotyping workflow with a standardized software interface. The genotypeR package is written in r with integration of a marker design pipeline written in Perl.
  3. genotypeR designs SNP genotyping markers from vcf files produced from variant calling of sequence data. These markers are processed before genotyping to ensure that they can be used in downstream analyses. After marker multiplexing suitability has been conducted on the genotyping platform, genotyping is conducted and raw output from the genotyping assay is processed by genotypeR. The primary post‐genotyping functionality includes commonly used QA/QC procedures, genotype conversions, recombination analysis and data export to a popular program that uses genotyping data (rqtl).
  4. genotypeR provides a unified software environment for analysing SNP genotyping data, and will be useful for researchers investigating various research questions, removing the need for individual researchers to write custom software, and facilitating a common workflow.

1 INTRODUCTION

Single nucleotide polymorphism (SNP) genotyping is an important widely used technology to answer basic and applied questions. As next‐generation sequencing (NGS) technology costs decrease, genome sequence data availability in (non‐)model taxa is increasing. While NGS genotyping approaches (e.g. genotyping by sequencing [GBS]) are increasingly applied, these methodologies are often cost prohibitive for studies genotyping large samples at few markers. Both Golden Gate and Sequenom genotyping platforms offer a cost effective alternative to GBS when addressing problems not requiring many markers (Perkel, 2008). For example ancestry informative markers can be designed from few high coverage NGS samples to screen many individuals assigning ancestry or admixture status (Li, Gown, et al., 2014; Li, Waldbieser, et al., 2014). In addition, genotyping can be used to map phenotypes to chromosomal locations with quantitative trait loci (QTL) analysis (Erickson, Fenster, Stenoien, & Price, 2004), and to study recombination rate variation with linkage analysis (Stevison, Hoehn, & Noor, 2011; Stevison & Noor, 2010). In addition to traditional genetic approaches, genotyping has aided structural variant discovery (Eichler, 2012), and genomic assembly (Hahn, Zhang, & Moyle, 2014; Kawakami et al., 2014). Due to the benefits of traditional genotyping, development of NGS technology has increased its overall utility rather than replacing it.

In addition, SNP genotyping can facilitate non‐model organism conservation when whole genome re‐sequencing is cost prohibitive. Recently, genotyping was used to investigate natural/artificial introgression between Blue Catfish and bass populations (Li, Gown, et al., 2014; Li, Waldbieser, et al., 2014). Similarly, SNP genotyping correctly identified fathers in cooperative breeding bird populations cheaper and more often than microsatellites (Weinman, Solomon, & Rubenstein, 2014). Furthermore, SNP genotyping of aquiculture shrimp populations helped to understand pedigrees and growth performance (Jung et al., 2013; Sellars et al., 2012).

These SNP genotyping applications have similarities in study design and analysis. They require initial genome analysis to discover high‐quality SNPs. Following variant identification, the user designs markers, conducts genotyping and analyses data. Due to these similarities, we present an open‐source package for SNP genotyping workflow written in r (Ihaka & Gentleman, 1996; Team 2016), with Perl integration. genotypeR is designed to facilitate the entire genotyping workflow (i.e. raw VCF to processed genotypes) using a consistent software interface (summarized in Figure 1). We made our workflow compatible with two genotyping platforms, Sequenom (Gabriel, Ziaugra, & Tabbaa, 2009) and Illumina's GoldenGate (Fan et al., 2003; Shen et al., 2005). Finally, we provide export functions into the popular rqtl package for QTL mapping and related software (e.g. xoi) (Broman & Kwak, 2015; Broman, Wu, Sen, & Churchill, 2003).

image
Analysis flow diagram for genotypeR. (a–c) Divides genotypeR into steps that are inside or outside the software. The blue and black are analyses conducted within genotypeR, and grey and orange are steps outside of genotypeR

2 r PACKAGE

2.1 Software design

This package uses S4 object oriented programming making the user experience standardized defining genotyping data structure, classes and methods.

2.2 Marker design (Perl)

Marker design can either run from Perl code available on https://github.com/StevisonLab/genotypeR/tree/master/inst/SequenomMarkers_v2, or in the r/SequenomMarkers_v2 directory upon installation (Figure 1a).

Marker design begins with a processed VCF file containing high‐quality variant and non‐variant sites, produced with GATK Haplotype Caller –allSites function or samtools mpileup (Li & Durbin, 2009; McKenna et al., 2010).

Pipeline input VCF file(s) can either be a multi‐individual population or two files from different groups, facilitating multiple applications. Marker design consists of identifying SNP genotyping markers that are: (1) variable between the two input VCF files or within the population VCF file and (2) have 100 (or 50) bp flanking reference sequence (Sequenom and GoldenGate respectively). This pipeline requires vcftools version ≥0.1.14, Perl 5, and is standard VCF file compatible (Danecek et al., 2011).

Wrapper scripts informing the user of progress are provided for both command line and r uses. The arguments are the full paths to vcf input file(s), genotyping platform and an output directory (r version). If two VCF files are supplied, the unzipped VCF files are compared with vcftools diff‐sites identifying sample differences. The output is used in the Perl program Finding_SNPs_two_sample.pl. If one population VCF is provided, the raw VCF is input into Finding_SNPs_pop_sample.pl. Both “finding_SNPs” programs identify potential SNP genotyping markers satisfying the above criteria and outputs one or two bed files (UCSC, 2017b) based on number of VCF file(s). Next, the “finding_SNPs” output bed file(s) and the original VCF file(s) are used in the corresponding Grandmaster_SNPs_{two_sample||pop_sample}.pl to make an extended bed file including the designed SNP genotyping marker sequence as the 4th column. Finally, the bed file is chromosome/position sorted (unsupported on windows).

For genotypeR integration, we provide SequenomMarkers for running the marker design step within an r session or an executable Rscript. Below is an example of using SequenomMarkers.

image

2.3 Post‐design/pre‐genotyping

Designed markers need to be tested for genotyping platform multiplexing suitability. The output of Sequenom_Markers can be input into Sequenom software, or prepared for submission to Illumina with GoldenGate2iCOM_design. After marker development, the marker design file should be read into r with read_Master_SNPs_Data, filtered to include the genotyping assay marker set, and marker names should be made with make_marker_names. Marker names will be used for genotyping assay names, and ensure downstream functions have marker distances. Following this, genotyping will have to be conducted before the rest of genotypeR can be used (Figure 1b).

image

2.4 Post‐genotyping QA/QC

Post‐genotyping calls need to be exported to a csv file for analysis in genotypeR (Figure 1c). The csv should be read in with: (1) read_in_sequenom_data or (2) read_in_illumina_GoldenGate. These functions can be extended to other genotyping platforms using provided example data. The first pipeline step is initialize_genotypeR_data, which will create an object with class genotypeR that can be used in subsequent analysis. The genotype_table argument input is produced using Ref_Alt_Table with the output of read_Master_SNPs_Data.

In addition to general genotyping compatibility, genotypeR is compatible with multiple types of genetic data: (1) standard genetic crosses where F2 progeny have one of three possible genotypes (coded with zero_one_two_coding), (2) population samples investigating genetic diversity or source ancestry and (3) a backcross (BC) design with two possible progeny genotypes. To initialize the genotypeR data structure for these designs we provide initialize_genotypeR_data. This function has the argument output=“pass_through” for non‐BC (default) or BC genotype calls without error checking. For BC error QA/QC, we provide the output=“warnings” in initialize_genotypeR_data to identify homozygous genotype alleles impossible in BC design. Once manually inspected warnings are rectified, they can be converted to NAs with the argument output=“warnings2NA”. Similarly, Heterogametic_Genotype_Warnings can identify sex chromosome heterozygosity for the heterogametic sex, impossible in BC design.

image

2.5 Analysis ready genotypes

For downstream analysis, the function binary_coding will convert raw genotype homozygotes and heterozygotes to 0 and 1 respectively. To demonstrate binary coded data analysis, we count crossovers with count_COs, which is only implemented for BC. This function summarizes counted crossovers for all individuals in a marker interval two ways: (1) naively with the genotype before missing genotypes being carried forward to the next call, or (2) distributing the CO based on marker distance. An example output of count_COs is provided in Table 1. We also provide subsetChromosome to split the data by chromosome.

Table 1. Sample output from count_COs. interval_start and interval_end are the bounding marker names for each interval
Crossovers interval_start interval_end num_ind Start End percent_CO cMperMb_CO start_Mb end_Mb Kosambi_cM
7.00 chr2_1001713_1001913 chr2_2001341_2001541 95.00 1,001,713 2,001,541 7.37 7.37 1.00 2.00 7.42
2.00 chr2_2001341_2001541 chr2_3000870_3001070 95.00 2,001,341 3,001,070 2.11 2.11 2.00 3.00 2.11
5.00 chr2_3000870_3001070 chr2_4007269_4007469 95.00 3,000,870 4,007,469 5.26 5.23 3.00 4.01 5.28
  • num_ind is the number of individuals used to count COs, Start and end are respective recombination interval location in bp, whereas start/end_Mb are the interval start/end positions in Mb. cMperMb_CO are COs in cM/Mb and Kosambi_cM is the Kosambi distance in cM.

image

In addition to counting COs, data can be exported for QTL mapping. The function convert2qtl_table exports the output of binary_coding to a rqtl read.cross compatible csv (e.g. QTL mapping, coincidence from xoi package, etc.).

image

2.6 Other functions

COs between markers for an individual are counted with CO, and can be used for QA/QC by identifying contamination or used for investigating CO assurance (Stevison, Sefick, Rushton, & Graze, 2017).

image

Internal functions likely never called by the user and not included in the examples are documented here for completeness. The function grep_df_subset outputs a dataframe with specified columns removed, and sort_sequenom sorts dataframe columns based on marker position.

3 CASE STUDY

We developed and validated this package with three datasets: (1) Saccharomyces cerevisiae (Burke, Liti, & Long, 2014), (2) Ictalurus furcatus (Li, Waldbieser, et al., 2014) and (3) Drosophila pseudoobscura (McGaugh & Noor, 2011; Stevison et al., 2017). We used S. cerevisiae for multi‐individual population VCF marker design. We used I. furcatus sequenom output and marker design files to refine the workflow (Figure 1c). Finally, we used two strains of D. pseudoobscura to develop genotypeR, which we discuss in detail below. First, we downloaded raw reads from NCBI short read archive (Kodama, Shumway, & Leinonen, 2011) (Flagstaff14/16: SRR330100/SRR330102) (McGaugh & Noor, 2011). These were aligned to the dp4 reference (UCSC, 2017a) with bwa (Li & Durbin, 2009). Second, alignment files were analysed with GATK best practices pipeline producing a VCF file including reference bases (McKenna et al., 2010) and binary genotypes. Third, we inspected annotation distributions to apply hard filtering (Table S1), and used VCF files to design markers with SequenomMarkers. A very similar workflow was used to generate the multi‐sample yeast population VCF file. For studies without a reference genome, we recommend mapping raw reads to an assembled reference (Figure 1b), and using the resulting VCF file in our workflow. Then, studies with low coverage data, such as GBS (Li, Waldbieser, et al., 2014), RAD‐seq, or sequence capture, could be used to design genotyping markers.

In our case study, the marker design step produced 8,992 possible markers from a 43.32 Mb region. We thinned our markers to a subset of markers c. 1 Mb apart. We then used the Sequenom proprietary software to determine multiplex suitability (Gabriel et al., 2009), and iterated until 35 markers remained. Many markers at similar locations facilitated marker set iteration for multiplexed marker optimization. Finally, genotyping was completed, data were QA/QC, and recombination rate was calculated with genotypeR as outlined above. Datasets 1 & 3 are provided as package example data.

3.1 Marker efficacy

We developed markers to distinguish the two strains in our BC design. To validate markers must be heterozygous in F1 females (89% validated). To investigate the 11%, we examined filtering thresholds associated with GATK base calls. Our analysis failed to identify specific patterns related to filtering. Nonetheless, a stricter approach has resulted in 99% success elsewhere (Weinman et al., 2014). Therefore, we suggest stringent variant filtering prior to marker design.

4 SOFTWARE ORIGINALITY

When we started our research project, we searched for an integrated workflow and could find none. Many references describe custom workflows, but we wanted an integrated platform for developing, processing and analysing SNP genotyping data commonly used by our lab. Thus we developed genotypeR to help other researchers. Similar software includes the xoi package, which provides a function countxo, similar to CO here.

5 CONCLUSION

Our pipeline picks up where standard NGS workflows leave off, starting with the input of the standard VCF file. In addition, we provide import and export functionality to commonly used tools and genotyping platforms. The current stable version is released on CRAN (https://cran.r-project.org/package=genotypeR), has a vignette containing a tutorial, and can be installed with install.packages(“genotypeR”). The development version is located on github (https://github.com/StevisonLab/genotypeR), which contains installation instructions, and can be directly installed with devtools::install_github (Wickham & Chang, 2017). Users can extend genotypeR for particular needs via pull requests on github.

ACKNOWLEDGEMENTS

We thank Matt Galaska, Tonia Schwartz, Rory Telemeco, Nathan Whelan and anonymous reviewers for comments improving earlier drafts. The first author would especially like to thank Joboo for 17 years of wagging tails, love, companionship and encouragement in all life and academic pursuits.

    AUTHORS’ CONTRIBUTIONS

    L.S.S. conceived ideas, designed and funded the experiment leading to software development; L.S.S. and M.A.C. developed Perl software; S.A.S. developed genotypeR, analysed data, and led manuscript writing. All authors contributed to and approve of publication.

    DATA ACCESSIBILITY

    All data and code, referred to in this manuscript, used to develop and test genotypeR are available upon installation from github (https://github.com/StevisonLab/genotypeR) and CRAN (https://cran.r-project.org/package=genotypeR).