metan: An R package for multi-environment trial analysis
Abstract
- Multi-environment trials (MET) are crucial steps in plant breeding programs that aim at increasing crop productivity to ensure global food security. The analysis of MET data requires the combination of several approaches including data manipulation, visualization and modelling. As new methods are proposed, analysing MET data correctly and completely remains a challenge, often intractable with existing tools.
- Here we describe the metan R package, a collection of functions that implement a workflow-based approach to (a) check, manipulate and summarize typical MET data; (b) analyse individual environments using both fixed and mixed-effect models; (c) compute parametric and nonparametric stability statistics; (d) implement biometrical models widely used in MET analysis and (e) plot typical MET data quickly.
- In this paper, we present a summary of the functions implemented in metan and how they integrate into a workflow to explore and analyse MET data. We guide the user along a gentle learning curve and show how adding only a few commands or options at a time, powerful analyses can be implemented.
- metan offers a flexible, intuitive and richly documented working environment with tools that will facilitate the implementation of a complete analysis of MET datasets.
1 INTRODUCTION
In 50 years (1967–2017) the world average of cereal yields has increased by 64%, from 1.68 to 2.76 tons/ha. In the same period, the total production of cereals has raised from 1.305 × 109 to 3.6 × 109 tons, an increase of 175%, while the cultivated area increased by only 7.9% in the same period (FAOSTAT, 2019). These unparallel increases have been possible due to the improved cultivation techniques in combination with superior cultivars. For maize, for example, 50% of the increase in yield was due to breeding (Duvick, 2005). Plant breeding programs have been developing new cultivars for adaptation to new locations, management practices or growing conditions, in a clear and crucial example of exploitation of genotype-versus-environment interaction (GEI).
The breeders' desire to modelling the GEI appropriately has led to the development of the so-called stability analyses, which includes ANOVA-based methods (Annicchiarico, 1992; Shukla, 1972; Wricke, 1965; Yates & Cochran, 1938); regression-based methods (Eberhart & Russell, 1966); nonparametric methods (Fox, Skovmand, Thompson, Braun, & Cormier, 1990; Huehn, 1979; Lin & Binns, 1988; Thennarasu, 1995) and some methods that combines different statistical techniques, such as the additive main effect and multiplicative interaction (AMMI; Gauch, 2013) and genotype plus genotype-versus-environment interaction (GGE; Yan & Kang, 2003). Then, it is no surprise that scientific production related to multi-environment trial analysis has been growing fast in the recent decades. A bibliometric survey in the SCOPUS database revealed that in the last half-century (1969–2019) 6,590 documents were published in 902 sources (journals, books, etc.) by 19,351 authors. In this period, the number of publications has been increased on average by 11.22% per year but were in the last 10 years the largest amount (~64%) of the documents that were published (see Appendix S1, item 1 for more details).
Linear mixed-effect models (LMM) has been more frequently used to analyse MET data. For example, between 2013 and 2015, the larger number of papers proposing methods to deal with GEI were related to the best linear unbiased prediction (BLUP) in LMMs (Eeuwijk, Bustos-Korts, & Malosetti, 2016). Recent advances in this field showed that BLUP is more predictively accurate than AMMI and that the main advantages of these methods can be combined to help researchers to select or recommend stable and high productive genotypes (Olivoto, Lúcio, Silva, Marchioro, et al., 2019). Thus, the rapid spread of these methods to users around the world can be facilitated if these procedures are implemented in specific software.
In most cases, analysing MET data involves manual checking of the data subset(s) to identify possible outliers, using some biometrical model to explore the relationships between traits(or groups of traits), computing a within-environment ANOVA, computing a joint-ANOVA, and, in case of a significant GEI, applying some stability method to explore it. While a spreadsheet program (e.g. Microsoft Excel) may be used to perform a visual check for outliers, an integrated development environment (IDE, e.g. R, SAS or Matlab) is often required to process the complex matrix operations required in some stability methods. IDEs, however, require a certain degree of expertise to use and have steep learning curves, which sometimes prevents that a coding layman implements certain methods. In this sense, R (R Core Team, 2019) packages have been making easier the life of hundreds of thousands of researchers by providing freely collections of functions developed by the community.
Some open-source R software packages that are designed—or are suitable—for analysing MET data are available. The stability package (https://CRAN.R-project.org/package=stability) contains a collection of functions to perform stability analysis. The ammistability package (https://CRAN.R-project.org/package=ammistability) computes multiple AMMI-based stability parameters. The gge (https://CRAN.R-project.org/package=gge) and GGEBiplots (https://CRAN.R-project.org/package=GGEBiplots) packages may be used to produce a GGE biplot. The R packages agricolae (https://CRAN.R-project.org/package=agricolae) and plantbreeding (http://plantbreeding.r-forge.r-project.org/), while not specifically coded for MET analysis provide useful functions for computing parametric and nonparametric stability statistics. Although useful, these packages do not offer options to perform a complete analysis of MET data, i.e. to provide tools for all steps of the analysis (check, manipulation, analysis and visualization of data). For example, GGEBiplots requires as input data a two-way table containing genotype by environment means with genotypes in rows and environments in columns, but doesn't provide any function to create quickly such table from data that often is in a ‘long’ format in R. In addition, several studies often compare different stability methods (e.g. Bornhofen et al., 2017; Freiria et al., 2018; Scapim et al., 2010; Shahbazi, 2019; Teodoro et al., 2019; Woyann et al., 2018). This requires a range of different packages to be used, making the coding tedious and difficult to follow. Thus, it seems to be value the creation of an R package that presents an easy workflow, incorporates the most used stability statistics, recently proposed stability methods (Olivoto, Lúcio, Silva, Marchioro, et al., 2019; Olivoto, Lúcio, Silva, Sari, Lúcio, Silva, Sari, & Diel, 2019), options for cross-validation procedures (Piepho, 1994) and BLUP-based stability statistics (Colombari Filho et al., 2013). These features are frequently used but are not yet implemented in any other R package for MET analysis.
Here, we describe the metan (multi-environment trial analysis) package, an open-source R package designed to provide an efficient and reproducible workflow for the analysis of MET data. Our main aim in this paper was to describe the features of metan and how this collection of functions can be useful for an intuitive and complete analysis of MET data.
2 THE METAN PACKAGE
The conceptual focus of metan is centred on five components (Figure 1): (a) check, manipulate and summarize typical MET data; (b) perform within-environment analysis of variance; (c) compute parametric and nonparametric stability analysis; (d) compute biometrical models widely used in plant MET analysis of plant breeding trials and (e) quickly create typical plots for two-way data considering any combination of qualitative and quantitative factors.
A stable version of metan is available on CRAN (https://CRAN.R-project.org/package=metan) and can be installed directly via the R console using install.packages("metan"). The development version of the package is available on Github (https://github.com/TiagoOlivoto/metan) and can be installed using devtools:
-
# install.packages("devtools") uncomment to run
-
devtools::install_github("TiagoOlivoto/metan")
-
library(metan)
To illustrate the main features of the package, six example datasets (data_alpha, data_g, data_ge, data_ge2, int.effects and meansGxE) are distributed with metan. Comprehensive details and examples of the functionality of metan are available in our online documentation (https://tiagoolivoto.github.io/metan/). Indeed, we strongly encourage readers to refer to the vignettes as the primary source for information on metan's functionality since they are updated with every package release.
The metan package is constructed on an object-oriented approach, which allows for—among other things—the reliable use of S3 generic functions such as plot(), predict() and print(). These functions can be called any time to inspect and visualize a model. All functions in metan have a non-standard evaluation, where the expressions are evaluated in the specified data frame rather than in the current or global environments, thus avoiding ambiguity in input data. This makes it possible to evaluate code in non-standard ways. Basically, we can pass the argument as an expression rather than a value, reducing the amount of typing.
In metan, all functions have as first argument the input data. So, all of them work naturally with the forward-pipe operator %>% (Bache & Wickham, 2014), which makes the typing cleaner and more logical. Most of MET analyse more than one trait in each genotype. Thus, when possible, functions in metan analyse a vector of variables and return the results into a list, saving a lot of time and code when several variables need to be analysed. In metan, if we want to compute the WAAS index (Olivoto, Lúcio, Silva, Marchioro, et al., 2019) for all the numeric traits of a dataset we can combine the functions performs_ammi(), AMMI_indexes() and get_model_data() with %>% to get a two-way table with the statistic for each genotype and traits (see an example in Appendix S1, item 8.5.4). To our current knowledge, no other package designed for MET analysis presents these features.
Sometimes in MET, a certain analysis needs to be run for each level of a factor, e.g. compute a path analysis or check outliers for each environment of the trial. The R base function subset() could be useful, but worryingly tedious if a large number of levels need to be evaluated. Users of metan can use the function group_by(), which takes an existing object (e.g. data.frame, tbl) and converts it into a grouped object. This object can be passed on to several functions with %>%. If a function recognizes such a class of data then it will take care of the details and compute what is required ‘by group’.
2.1 Checking data
It is assumed that MET data have the following structure (columns): ENV, a factor with e levels, e being the number of environments; GEN a factor with g levels, g being the number of genotypes; REP a factor with r levels, r being the number of replicates within each environment; and at least one numeric variable, e.g. grain yield. The expected number of rows in a typical MET data is then e × g × r.
The function inspect() scans all columns of a data frame object for errors that may affect the use of functions in metan and return a warning if (a) the data have less than three columns as factor; (b) the data have less than the expected number of rows based on the levels of factor variables; (c) any variable has missing values; (d) any possible outliers are detected. Running inspect() is an optional and exploratory step that flags potential issues before analysis. Error check results are summarized in the R console as warnings while a plot (Figure 1a) can also be created by using the argument plot = TRUE in the function (see more details in Appendix S1, item 6.1).
Outliers may violate the assumption of identically distributed errors in ANOVA models. Outliers tend to increase the estimate of sample variance, thus lowering the chance of rejecting the null hypothesis. In this regard, we strongly recommend checking for outliers, especially if the function inspect() returned a warning about them. Users of metan can use the function find_outliers() to check for possible outliers in a numeric variable, returning a summary in the console (Appendix S1, item 6.2) and a plot (Figure 1b) if plots = TRUE is used.
Descriptive statistics help researchers to describe and understand the structure of a MET data. The function desc_stat() computes a total of 28 statistics and when combined with group_by() can be used to implement a descriptive analysis for each level of a factor (see more details in Appendix S1, item 6.3.4).
Frequently in MET analysis two-way tables (e.g. genotypes in rows and environments in columns) need to be created to serve as data input in some procedure, for example, in the R package GGEBiplots. The function make_mat() can be used to create such a table. You inform the data frame in the ‘long’ format, the two variables to be mapped to rows and columns and one numeric variable from which the values will fill the table and make_mat() take care of the details. Conversely, make_long() can be used to quickly convert a ‘wide’ table to a ‘long’ data frame (see an example in Appendix S1, item 6.4).
2.2 Analysing individual environments
Individual analysis performed within each environment gives to researchers important information regarding the performance of genotypes in such environments. Provided that a typical MET data is available, the function anova_ind() can be used to compute, for each environment, a fixed-effect ANOVA considering a Randomized Complete Block design or an α-lattice design (Patterson & Williams, 1976). The function returns the significance of factors, coefficient of variation, heritability and accuracy of selection (see a numeric example in Appendix S1, item 7).
The function gamem() is used to specifically analyse genotypes using a mixed-effect model considering both a randomized complete block design or an α-lattice design (Patterson & Williams, 1976). The function get_model_data() can be used to extract the model information such as variance components, genetic parameters and p-values for the Likelihood ratio test for random effects. By using the function plot_blup() with an object of class gamem the plot in Figure 1c is produced.
2.3 Stability analysis
After inspecting data, checking for outliers and possibly analysing individual environments, a visual inspection of the genotype–environment interaction can be made with the function ge_plot(), which will generate the plots in Figure 1m–n. The winner genotype within each environment can be found quickly using ge_winners(). Statistically, GEI can be checked in a joint analysis of variance performed with the function anova_joint() (Appendix S1, item 8). If GEI is significant, then it is reasonable to proceed with some stability analysis to explore such interaction. metan provides a collection of functions to implement widely used methods for stability analysis in the evaluation of multi-environment trials (Table 1).
Function | Method | References |
---|---|---|
Parametric | ||
Annicchiarico() | Genotypic confidence index | Annicchiarico (1992) |
ecovalence() | Wricke's ecovalence | Wricke (1965) |
gai() | Geometric adaptability index | Shahbazi (2019) |
ge_factanal() | Environment stratification | Murakami and Cruz (2004) |
ge_reg() | Joint regression analysis | Eberhart and Russell (1966) |
ge_stats() | Wrapper function | NA |
gge() | GGE biplot method | Yan and Kang (2003) |
mtsi() | Multi-trait stability index | Olivoto, Lúcio, Silva, Marchioro, et al. (2019) |
performs_ammi() | AMMI method | Gauch (2013) |
Resende_indexes() | BLUP-based stability statistics | Colombari Filho et al. (2013) |
Shukla() | Shukla's stability variance | Shukla (1972) |
waas(), waasb() | Weighted average of absolute scores | Olivoto, Lúcio, Silva, Marchioro, et al. (2019) |
wsmp() | Stability and mean performance | Olivoto, Lúcio, Silva, Marchioro, et al. (2019) |
Nonparametric | ||
Fox() | The ‘top third’ method | Fox et al. (1990) |
Huehn() | Huehn's stability statistics | Huehn (1979) |
Superiority() | Lin and Binns' superiority measure | Lin and Binns (1988) |
Thennarasu() | Thennarasu's stability statistics | Thennarasu (1995) |
After fitting a model, users can obtain custom plots to interpret the GEI. By invoking plot() in an object of class performs_ammi residual plots (Figure 1d) can be obtained. In AMMI analysis, biplots (Figure 1f) are produced with the function plot_scores(), provided that an object of class performs_ammi, waas, waas_means or waasb is available in the Global Environment (see Appendix S1, item 8.5.3 for more details). In GGE models, fitted with the function gge(), 10 types of biplots (Yan & Kang, 2003) can be created. Figure 1g shows the biplot type 8, used for ranking genotypes. All plots are produced with package ggplot2 (Wickham, 2016). So, users of metan can count on the high level of personalization provided by ggplot2 to change any non-data elements of your plot.
Users who research the associations between stability indexes (e.g. Bornhofen et al., 2017; Freiria et al., 2018; Shahbazi, 2019; Woyann et al., 2018) often find difficulties in computing the set of statistics and binding them into a ‘ready-to-read’ file. metan provides an efficient solution for doing that. The function ge_stats() is a wrapper function and can be used to compute all the stability methods shown in Table 1 at once. Then, users can use get_model_data() to extract either the statistics or ranks related to each genotype in each index and variable—if multiple variables are used in ge_stat()—, or corr_stab_ind(), to compute a Spearman's rank correlation matrix between the computed stability indexes (see Appendix S1, item 8.10 for more details).
2.4 Biometrical models
Multi-environment trials often generate data on several traits, and these data should be exploited. In breeding trials (as well as in many other areas), indirect selection helps geneticists and breeders to select superior genotypes (Ferrari et al., 2018; Fonseca, Lima, Dardengo, Silva, & Xavier, 2019; Gediya et al., 2019; Lopes Costa, Melo, & Oliveira Mano, 2019; Meira et al., 2017; Olivoto, de Souza, et al., 2017; Olivoto, Nardino, et al., 2017; Santos et al., 2018); thus, any tool that facilitates this work is welcome. metan provides useful functions for implementing biometrical models easily. This includes the functions corr_coef() for computing Pearson product-moment correlation with p-values, lpcor() for computing partial correlation coefficients, covcor_design() for computing phenotypic, genotypic and residual (co)variance/correlation matrices based on designed experiments, can_cor() for computing canonical correlation analysis, clustering() for clustering analysis, path_coeff() for computing path coefficients, corr_ss() for sample size planning, corr_plot() for a mixed (text and plot) visualization of a correlation matrix (Figure 1j), plot.corr_coef() for a correlation heat map (Figure 1k) and corr_ci() for computing nonparametric confidence intervals of Pearson's correlation (Figure 1l).
Since metan was conceived for multi-environment trial analysis, the function group_by() can be used to pass grouped data allowing, for example, that a path analysis or a canonical correlation be computed within each level of a factor, as shown in Santos et al. (2018). For more details, please, refer to Appendix S1, item 9.
2.5 Data visualization
metan provides useful functions for creating quickly typical plots of two-way data, such as those observed in MET data. The function ge_plot() can be used for a visual inspection of the GEI (Figure 1m–n). The function plot_factbars() is used to create bar plots with two factors (Figure 1o). The plot like the shown in Figure 1o has as mandatory arguments only the data, factors 1 and 2 and the response variable. Similarly, line plots with options for fitting different polynomial degrees can be made with the function plot_factlines(). In an experiment with two quantitative factors, the function resp_surf() can be used to fit a response surface model; then a surface plot (Figure 1p) can be created with plot() (see more details in Appendix S1, item 10).
3 CONCLUDING REMARKS AND FUTURE IMPROVEMENTS
The package metan was designed to facilitate the analysis of multi-environment trials, allowing for more effective and less time-consuming handling and processing of MET datasets that have been increasing rapidly in the last years. Users will find in metan a complete framework to implement the most used parametric and nonparametric stability statistics for MET analysis. The package implements stability methods not available in any other R package, including the estimation of BLUP-based stability statistics (Colombari Filho et al., 2013), newer stability methods such as the WAASB, which is the weighted average of absolute scores from the singular value decomposition of GEI effects matrix obtained in a linear mixed model (Olivoto, Lúcio, Silva, Marchioro, et al., 2019), the multi-trait stability index (Olivoto, Lúcio, Silva, Sari, et al., 2019) and the implementation of cross-validation procedures for AMMI and BLUP models (Piepho, 1994). metan can also be useful for to a lot of other researchers since it provides options for implementing worldwide used multivariate statistics, e.g., path analysis, linear, partial and canonical correlations. The estimation of stability indexes for several variables at once and the estimation of biometrical models for each level of a factor makes metan outperform already published R packages for MET analysis. These features will reduce the amount of coding and save the precious time of the researchers when running their analyses. The metan package is (and will always be) extensively documented online, with transparent and fully reproducible examples. metan is currently under active development; so, new functions will be implemented in the near future. Our next efforts will be focused on implementing cross-validation procedures for GGE models, allowing cross-validation to run in parallel, and increasing the number of stability methods available.
ACKNOWLEDGEMENTS
We thank the National Council for Scientific and Technological Development (CNPq) and Coordination for the Improvement of Higher Education Personnel (CAPES) for fellowships and grants to the authors. The authors have no conflict of interest to declare.
AUTHORS' CONTRIBUTIONS
T.O. conceived the ideas and authored the software and manuscript; A.D.L. assisted in the implementation of methods and critically revised the manuscript; both authors gave final approval for publication.
Open Research
DATA AVAILABILITY STATEMENT
Since metan is updated regularly, the source code used in this manuscript has been archived at https://doi.org/10.5281/zenodo.3548917 as metan version 1.1.0 (Olivoto, 2019). Please, note that there is an updated version of metan in CRAN (https://CRAN.R-project.org/package=metan); so, we strongly suggest users to download it. To explore the latest metan's functionalities, we invite you to download the development version from GitHub (https://github.com/TiagoOlivoto/metan). Package vignettes are also open-source, accessible at https://tiagoolivoto.github.io/metan/. Installing and loading metan will automatically load all example data used in this paper.