Volume 8, Issue 2
Research Article
Free Access

Statistical Quantification of Individual Differences (SQuID): an educational and statistical tool for understanding multilevel phenotypic data in linear mixed models

Hassen Allegue

Marine Mammal Research Unit, Department of Zoology, Institute for the Oceans and Fisheries, University of British Columbia, 2202 Main Mall, Vancouver, BC, Canada, V6T1Z4

Search for more papers by this author
Yimen G. Araya‐Ajoy

Department of Biology, Center for Biodiversity Dynamics, Norwegian University of Science and Technology (NTNU), Trondheim, N‐7491 Norway

Research Group Evolutionary Ecology of Variation, Max Planck Institute for Ornithology, Eberhard‐Gwinner‐Straße, 82319 Seewiesen, Germany

Search for more papers by this author
Niels J. Dingemanse

Corresponding Author

E-mail address: n.dingemanse@lmu.de

Research Group Evolutionary Ecology of Variation, Max Planck Institute for Ornithology, Eberhard‐Gwinner‐Straße, 82319 Seewiesen, Germany

Behavioural Ecology, Department of Biology, Ludwig‐Maximilians University of Munich, Grosshadernerstrasse 2, 82152 Planegg‐Martinsried, Germany

Correspondence author. E‐mail: n.dingemanse@lmu.deSearch for more papers by this author
Ned A. Dochtermann

Department of Biological Sciences, North Dakota State University, Fargo, ND, 58108‐6050 USA

Search for more papers by this author
László Z. Garamszegi

Department of Evolutionary Ecology, Estación Biológica de Doñana–CSIC, c/ Americo Vespucio, s/n, 41092 Seville, Spain

Search for more papers by this author
Shinichi Nakagawa

Evolution & Ecology Research Centre, School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, NSW, 2052 Australia

Search for more papers by this author
Denis Réale

Département des Sciences Biologiques, Université du Québec A Montréal, CP 8888, Succursale centre‐ville, Montréal, QC, Canada, H3C3P8

Search for more papers by this author
Holger Schielzeth

Department of Evolutionary Biology, Bielefeld University, Morgenbreede 45, 33615 Bielefeld, Germany

Population Ecology, Institute of Ecology, Friedrich Schiller University, Dornburger Str. 159, 07743, Jena, Germany

Search for more papers by this author
David F. Westneat

Department of Biology, Center for Ecology, Evolution, and Behavior, University of Kentucky, Lexington, KY, 40506‐0225 USA

Search for more papers by this author
First published: 20 September 2016
Citations: 18
All authors contributed equally and are listed in alphabetical order.

Summary

  1. Phenotypic variation exists in and at all levels of biological organization: variation exists among species, among‐individuals within‐populations, and in the case of l within‐populations abile traits, within‐individuals. Mixed‐effects models represent ideal tools to quantify multilevel measurements of traits and are being increasingly used in evolutionary ecology.
  2. Mixed‐effects models are relatively complex, and two main issues may be hampering their proper usage: (i) the relatively few educational resources available to teach new users how to implement and interpret them and (ii) the lack of tools to ensure that the statistical parameters of interest are correctly estimated.
  3. In this paper, we introduce Statistical Quantification of Individual Differences (SQuID), a simulation‐based tool that can be used for research and educational purposes. SQuID creates a virtual world inhabited by subjects whose phenotypes are generated by a user‐defined phenotypic equation, which allows easy translation of biological hypotheses into quantifiable parameters.
  4. Statistical Quantification of Individual Differences currently models normally distributed traits with linear predictors, but SQuID is subject to further development and will adapt to handle more complex scenarios in the future. The current framework is suitable for performing simulation studies, determining optimal sampling designs for user‐specific biological problems and making simulation‐based inferences to aid in the interpretation of empirical studies.
  5. Statistical Quantification of Individual Differences is also a teaching tool for biologists interested in learning, or teaching others, how to implement and interpret linear mixed‐effects models when studying the processes causing phenotypic variation. Interface‐based modules allow users to learn about these issues. As research on effects of sampling designs continues, new issues will be implemented in new modules, including nonlinear and non‐Gaussian data.

Introduction

Variation is the most striking feature of the natural world (Hallgrímsson & Hall 2005). However, we do not always appreciate that phenotypic variation is caused by processes occurring at multiple hierarchical levels (Wilson 1998; Nussey, Wilson & Brommer 2007; Williams 2008; Westneat, Wright & Dingemanse 2015). Phenotypes vary across species, across populations of the same species, across individuals of the same population and across repeated observations of the same individual. One of the most important biological levels to both ecological and evolutionary processes is the individual, with variation among individuals influencing social interactions (Dingemanse & Araya‐Ajoy 2015), population demography (Dochtermann & Gienger 2012), community structure (Bolnick et al. 2011) and evolutionary dynamics (Dall et al. 2012). Moreover, individuals express traits repeatedly and these expressions are known to vary distinctly within and among individuals (e.g. behaviour, life history, morphology, physiology).

By sampling traits repeatedly for a given set of individuals, we can estimate how changeable or stable individuals are. This assessment can be achieved by applying statistical approaches aimed at partitioning phenotypic variation into within‐ and among‐individual variance components (e.g. Nussey, Wilson & Brommer 2007; Nakagawa & Schielzeth 2010; Dingemanse & Dochtermann 2013). With this and other information in hand, we can start examining the genetic and environmental factors contributing to the variation observed at each hierarchical level (Wilson et al. 2010; Schielzeth & Nakagawa 2013). Doing so, however, requires specific sampling schemes and statistical tools to ensure that statistical parameters of interest are estimated accurately, precisely and with sufficient statistical power (Martin et al. 2011; van de Pol 2012; Dingemanse & Dochtermann 2013; Araya‐Ajoy, Mathot & Dingemanse 2015; Johnson et al. 2015; Kain, Bolker & McCoy 2015; Green & MacLeod 2016).

The mixed‐effects modelling framework has become a particularly popular statistical tool to achieve such aims, especially in the field of evolutionary ecology (Kruuk 2004; Bolker et al. 2009; O'Hara 2009; van de Pol & Wright 2009; Nakagawa & Schielzeth 2010, 2013; Wilson et al. 2010; Dingemanse & Dochtermann 2013). This is because mixed‐effects models explicitly model stratification in the data. One of the strengths of mixed‐effects models is that they evaluate the importance of fixed effects while simultaneously estimating the relative magnitudes of random effects. Mixed‐effects models, which include linear and generalized linear mixed models (Bolker et al. 2009), are often under‐utilized. One reason may be that they are not usually covered in introductory or mid‐level statistical courses despite being more challenging to learn and more difficult to use correctly compared to traditional approaches (Bolker et al. 2009; van de Pol & Wright 2009; Nakagawa & Schielzeth 2013). We therefore require tools to educate people how to appropriately perform the analysis of complex data inherent to mixed‐effects modelling.

With the complexity of mixed‐effects models comes the difficulty of assessing whether statistical parameters of interest are correctly estimated. Several recent papers have addressed this problem by using simulations to evaluate power and accuracy of parameter estimation, particularly for models estimating among‐ and within‐individual variance components (e.g. Martin et al. 2011; Garamszegi & Herczeg 2012; van de Pol 2012; Dingemanse & Dochtermann 2013; Araya‐Ajoy, Mathot & Dingemanse 2015; Kain, Bolker & McCoy 2015). Given the many possible biological questions that one may ask (e.g. Are individuals repeatable, do they vary in level of plasticity, are phenotypic traits correlated within or among individuals?), we require a flexible simulation environment that enables performance assessment (e.g. power and other sensitivity analyses) for all statistical parameters, both before and after data have been collected. As far as we are aware, there is no widely accessible simulation programme that targets ecologists and evolutionary biologists, and guides them in the best way to both sample and analyse data (as existing packages, detailed below, enable either one or the other rather than both).

In this paper, we introduce SQuID, Statistical Quantification of Individual Differences: an environment for simulating multilevel data. SQuID is an r (v3.3.0) package (R Core Team 2015) that, in addition to traditional r packages, can also be used through a user interface platform built by the shiny package (Chang et al. 2016). The latest released version of the SQuID package can be installed from CRAN (url: https://cran.r-project.org/package=squid) by running:

  • > install.packages(“squid”)>

The latest development version can be installed from github (url: https://github.com/hallegue/squid) by running:

  • > install.packages(“devtools”)

  • > devtools::install_github(“hallegue/squid”)

The following r code runs the SQuID application (browser‐based interface):

  • > library(squid)

  • > squidApp()

SQuID serves two main purposes. First, it provides an educational tool useful for students, teachers and researchers who want to learn to use mixed‐effects models. Users can learn how the mixed‐effects model framework can be used to understand distinct biological phenomena (e.g. the environmental factors generating variation within and among individuals, the hierarchical structure of phenotypes) by interactively exploring simulated multilevel data generated based on phenotypic equations. Secondly, SQuID offers research opportunities to those who are already familiar with mixed‐effects models, as SQuID enables the generation of data sets that users may use for a range of simulation‐based statistical analyses such as power and sensitivity analyses of highly realistic and complex multilevel data. With these two purposes, SQuID allows both educational and primary research opportunities.

We note that while several other r packages (e.g. clusterPower, manque, nlmeU, odprism, pamm, RLRsim, simr) are available to run simulations and to study specific aspects of mixed‐effects models (Scheipl, Greven & Kuchenhoff 2008; Martin et al. 2011; van de Pol 2012; Galecki & Burzykowski 2013; Reich & Obeng 2013; Wu 2014; Green & MacLeod 2016), most of them deal with issues related to statistical performance (i.e. power and bias). With its specific ability to separate the generation of data in two steps, one generating the world and the second generating the sampled data, SQuID provides more flexibility to manipulate the effects of data sampling on the results of a mixed‐effects model analysis. Furthermore, our platform offers a rigorous framework to evaluate some special aspects of the sampling design. We think that the way data are sampled can have profound effects on the estimation of variance components in a mixed‐effects model and this is why in addition to thinking of SQuID as a simulation tool we also conceived it as an educational resource.

The anatomy of SQuID

The core of SQuID is a phenotype generator. We introduce SQuID using individuals as the focal entities of interest, though other applications are possible. SQuID generates a virtual world inhabited by individuals whose phenotypes are generated by a user‐defined (uni‐ or bivariate) phenotypic equation (Nussey, Wilson & Brommer 2007) that is explicit on a temporal scale. Time is modelled in discrete steps, the number of which the user can define, but is typically very large to mimic continuous time. As a result, the world contains phenotypic values at all the time points and for all individuals. The SQuID framework therefore allows users to define the rules governing the SQuID world, simulate the environment and the phenotype of the individuals inhabiting it and then collect samples from this world (Fig. 1). Users can proceed to analyse the sampled data, make inferences about the data and try to reconstruct the world, while having the possibility to compare their inferences with the known rules that underlie the created world (Fig. 1).

image
The SQuID world. For explanation, see the main text. SQuID, Statistical Quantification of Individual Differences.

The Phenotypic Equation

The phenotypic equation is the backbone for the generation of phenotypes, and it is used to simulate the phenotypes of individuals in each time step. It defines the causal components that govern data generation and translates a biological hypothesis into an estimable mathematical form. The equation is modular with multiple sources of variation that can be simulated (Table 1). Each source of variation can be switched off by setting the relevant parameter to zero. Indeed, most parameters are set to zero by default in order to allow efficient exploration from simple to more complex models. The phenotypic equation is constructed such that all causal components introduce variation in the form of deviations from the mean value while accounting for all other causal components. The basic phenotypic equation permits various univariate analyses of a single trait (Table 1). For example, a relatively simple version of the phenotypic equation is one where the phenotype is continuous, and individuals differ in their average phenotype (random intercepts for individuals) as well as in their linear response to a continuous environmental gradient (random slopes for individuals):
urn:x-wiley:2041210X:media:mee312659:mee312659-math-0001(eqn 1a)
Table 1. Summary of the components that can be extracted from the mixed‐effects model, using a data set generated by SQuID in a situation where the phenotypic value yhij of the trait y is a function of two uncorrelated environmental variables (x1 and x2) and their interaction. The phenotype yhij is expressed for each individual i and at each instance of time h, in each high‐level group j
Phenotypic equation
urn:x-wiley:2041210X:media:mee312659:mee312659-math-0002
Summation of variance componentsaa Covariance parameters exist but do not contribute to total phenotypic variance.
urn:x-wiley:2041210X:media:mee312659:mee312659-math-0003
Component Explanation Variance componentbb Variances as they contribute to the total phenotypic variance. Note that we use Vx and Var(x) as alternative notations for the variances, COVx,y and Cov(x,y) as alternative notations for covariances and E(x) for expectations.
Remarks
Fixed effects
β0 Population mean
β1 Population‐average response to an environmental effect x1 (with variance Var(x1)) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0004 In SQuID Var(x1) = 1
β2 Population‐average response to an environmental effect x2 (with variance Var(x2)) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0005 In SQuID Var(x2) = 1
β12 Population‐average interaction response to two environmental effects (x1, x2) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0006 Since in SQuID Var(x1) = Var(x2) = 1 and x1 and x2 are independent of each other, the expected variance of the product is Var(x1x2) = 1cc We anticipate that the covariance between x1 and x2 can be set by the user while SQuID evolves, which will affect urn:x-wiley:2041210X:media:mee312659:mee312659-math-0020 and E(x1x2).
Random effects
I Individual‐specific deviations (random intercepts) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0007 In the presence of random slope variation, VI expresses the variance at the point where all covariates are zero. Since all covariates are centred to zero in SQuID, this represents the variance at average values of the covariate(s)
S 1 Individual‐specific response to an environmental effect x1 (random slopes) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0008 In SQuID Var(x1) = 1 and E(x1) = 0, which considerably simplifies the equation to urn:x-wiley:2041210X:media:mee312659:mee312659-math-0009 = Var(S1)
S 2 Individual‐specific response to an environmental effect x2 (random slopes) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0010 In SQuID Var(x2) = 1 and E(x2) = 0, which considerably simplifies the equation to urn:x-wiley:2041210X:media:mee312659:mee312659-math-0011 = Var(S2)
S 12 Individual‐specific response interaction to two environmental effects (x1, x2) (random slopes) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0012 In SQuID Var(x1) = Var(x2) = 1 and E(x1) = E(x2) = 0 and independent of x1 and x2, the expected variance of the product Var(x1x2) = 1 and the expected mean of the product is E(x1x2) = 0, which considerably simplifies the equation to urn:x-wiley:2041210X:media:mee312659:mee312659-math-0013 = Var(S12)cc We anticipate that the covariance between x1 and x2 can be set by the user while SQuID evolves, which will affect urn:x-wiley:2041210X:media:mee312659:mee312659-math-0020 and E(x1x2).
I and S1 Covariance between random intercepts and random slopes in response to an environmental effect x1 urn:x-wiley:2041210X:media:mee312659:mee312659-math-0014 In SQuID E(x1) = 0 and hence the covariance does not contribute to total phenotypic variancedd Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
I and S2 Covariance between random intercepts and random slopes in response to an environmental effect x2 urn:x-wiley:2041210X:media:mee312659:mee312659-math-0015 In SQuID E(x2) = 0 and hence the covariance does not contribute to total phenotypic variancedd Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
I and S12 Covariance between random intercepts and individual‐specific response interaction to two environmental effects (x1, x2) (random slopes) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0016 In SQuID E(x1) = E(x2) = 0 and an expected mean of the product of E(x1x2) = 0 and hence thecovariance does not contribute to total phenotypic variancedd Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
S1 and S2 Covariance between random slopes in response to an environmental effect x1 and random slopes in response to an environmental effect x2 urn:x-wiley:2041210X:media:mee312659:mee312659-math-0017 In SQuID E(x1) = E(x2) = 0 and hence the covariance does not contribute to total phenotypic variancedd Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
S1 and S12 Covariance between randomslopes in response to an environmental effect x1 and individual‐specific response interaction to two environmental effects (x1, x2) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0018 In SQuID E(x1) = E(x2) = 0 and an expected mean of the product of E(x1x2) = 0 and hence the covariance does not contribute to total phenotypic variancedd Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
S2 and S12 Covariance between random slopes in response to an environmental effect x2 and individual‐specific response interaction to two environmental effects (x1, x2) urn:x-wiley:2041210X:media:mee312659:mee312659-math-0019 In SQuID E(x1) = E(x2) = 0 and an expected mean of the product of E(x1x2) = 0 and hence the covariance does not contribute to total phenotypic variancedd Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
G Higher‐level grouping variance (clusters, groups, families, etc.) VG = Var(G)
e Residualee We use VR to indicate Var(e) for two reasons. First, e is conventional notation for the deviation of an observation from the values predicted by a statistical model and VR is conventional notation for the residual variance. In the SQuID modules, we also introduce VE, the variance in phenotype due to environment. Ve and VE would mean very different things, so to avoid confusion we adopt VR to indicate residual variance.
VR = Var(e)
y Total phenotypic variance V P
  • a Covariance parameters exist but do not contribute to total phenotypic variance.
  • b Variances as they contribute to the total phenotypic variance. Note that we use Vx and Var(x) as alternative notations for the variances, COVx,y and Cov(x,y) as alternative notations for covariances and E(x) for expectations.
  • c We anticipate that the covariance between x1 and x2 can be set by the user while SQuID evolves, which will affect urn:x-wiley:2041210X:media:mee312659:mee312659-math-0020 and E(x1x2).
  • d Note the distinction between urn:x-wiley:2041210X:media:mee312659:mee312659-math-0021 as a potential contributor to the variance and Cov(I, S1) as a covariance between intercepts and slopes, and that it can be simulated and estimated. Mean centring of the environmental gradients has the advantage that we can interpret the intercept variance as the variance at an average environmental value and the intercept‐slope covariance as the location of the minimum of the between‐individual variance. With arbitrary scaling of the environmental gradients, the interpretation of the intercept variance will change and urn:x-wiley:2041210X:media:mee312659:mee312659-math-0022 will have to appear in the summation of variance components.
  • e We use VR to indicate Var(e) for two reasons. First, e is conventional notation for the deviation of an observation from the values predicted by a statistical model and VR is conventional notation for the residual variance. In the SQuID modules, we also introduce VE, the variance in phenotype due to environment. Ve and VE would mean very different things, so to avoid confusion we adopt VR to indicate residual variance.
that can also be written as:
urn:x-wiley:2041210X:media:mee312659:mee312659-math-0023(eqn 1b)

Here, a single phenotypic value (yhi), by individual i exhibited at instance h, is modelled as a function of an (user‐defined) environmental gradient (x1hi being the measure of that environmental variable x1 at instance h for individual i). Each phenotypic expression (yhi) may be described by five distinct elements: (i) the population‐mean phenotype in the average environment (β0), (ii) the population‐mean slope (β1) to the environmental gradient (x1hi), (iii) the individual's deviation from the population‐mean phenotype (Ii), (iv) the individual's deviation from the population‐mean slope (S1i) in response to environmental gradient x1 (see next section) and (v) the instance's deviation from the individual's expected value due to unaccounted effects on the phenotype (residual; ehi). Individual deviations from the population‐mean value (intercept, Ii) and phenotypic response to the environment (slope, S1i) are determined by a (co)variance matrix, a standard tabulation that holds the among‐individual variance in intercepts and slopes (on the diagonals) and their covariances (on the lower off‐diagonals). Values of these individual deviations (for both, Ii and S1i) are generated from a multivariate normal distribution (MNV) with a zero‐mean and variance/covariance equal to ΩIS [i.e. MNV (0, ΩIS)]. More details on the meaning of MNV(0, ΩIS) can be found in the step‐by‐step full tutorial module available on the SQuID application. Briefly, the covariance matrix holds all the (co)variance components necessary to generate the information associated with relative deviations of individual phenotypes. For instance, in the case of a single trait y with individual differences in intercepts and slopes, each individual deviation Ii (from intercept β0) is generated based on the specified variance VI; similarly, each deviation S1i (from slope β1) is generated based on the specified variance VS and covariance CovI,S. More complexity can be added by modelling a second environmental gradient (x2), a second trait z that is defined by its own phenotypic equation or a higher order random effect (see variable G in Table 1) suitable for investigating genetic variance (if G indicates family groups, e.g. Dingemanse et al. 2012) or among population variance (if G indicates populations, e.g. Westneat et al. 2014) or among taxon variation (if G represents species or above, e.g. Hadfield & Nakagawa 2010; Garamszegi, Marko & Herczeg 2013). Doing so necessitates the specification of trait covariance at each level of replication (detailed further in the different modules of SQuID). The current version of SQuID focuses on linear terms and Gaussian distributions of phenotype and environment, but we plan to develop it further.

The Environment

The SQuID world consists of a (uni‐ or bivariate) environment that is generated for each time step and can exhibit (i) random fluctuations, (ii) temporal autocorrelation, (iii) temporal trends (e.g. phenological trends), (iv) cyclic changes (e.g. seasonal or daily fluctuations) or (v) a combination of these four types of effect, generated for each environmental variable separately (Figs 2a–c; and 3b). Environmental variables are mean and variance standardized to ease interpretation of parameters such as the intercept and variance in slopes. When two environmental variables are fitted, for simplicity, the current version of SQuID assumes that they are uncorrelated, though we appreciate that this assumption might often be invalid in real data. Environmental variables are either shared (‘general environmental effects’; Falconer & Mackay 1996) or non‐shared (‘specific environmental effects’; Falconer & Mackay 1996) across simulated individuals (Fig. 2d–f). During data analysis, environmental variables can be treated as measured or unmeasured, which we refer to as ‘known’ or ‘unknown’ environmental effects, respectively. Known environmental effects represent a situation where scientists are aware of the potential causal effect of an environmental gradient and able to include it in the data analysis; in SQuID, these effects have their own explicitly defined variance component (Table 1). An unknown environmental effect has known effects on the generation of phenotypic values but is then not used in analyses, thereby representing a situation where a researcher has less prior knowledge about a particular system or logistic constraints for measuring all relevant environmental variables. Such effects typically end up in the residual variance, but they could affect some of the estimated components as well. The consequences of unmeasured environmental effects on the estimation of other fixed effects and the estimation of random effects can thus be explored in SQuID.

image
Effect of three different types of environmental covariates (a, b, c) on phenotypic responses of individuals (d, e, f) and their respective sampling values using two sampling designs (g, h, i vs. j, k, l). Environment (a) is generated from a normal distribution urn:x-wiley:2041210X:media:mee312659:mee312659-math-0024 with an autocorrelation of 0·75 between two consecutive values (time steps) and a decrease in the correlation over time determined by the decay function e−α·Δh where α is the natural logarithm of the correlation (here ln(0·75)) and Δh is the difference in time between instances. Environment (b) is generated similarly to environment (a) but changes linearly over time (here, Environmenth = 0 + 0·05 × h). Environment (c) is generated similarly to environment (a) with a cyclic change added following the equation [× sin (b.c)+ v] where |a| is the amplitude (here 0·05), 2π/|b| is the period (here 50), −c/b is the horizontal shift (here 0), and v is the vertical shift (here 0). (d, e, f) Raw individual phenotypic responses to each environment, respectively, a, b and c described by the model urn:x-wiley:2041210X:media:mee312659:mee312659-math-0025 where yhi is the phenotypic response for the trait y at the instance h (100 time steps) of individual i (five individuals), urn:x-wiley:2041210X:media:mee312659:mee312659-math-0026 is the population‐mean value (here 0), Ii∼N(0, VI) is the intrinsic value of individual i for trait y (here VI = 0·95), β is the mean slope of regression of trait y as a function of environment xhi (here 1), and ehi ~N(0,VR) is the residual for trait y at instance h and for individual i (here VR = 0·05; see Table 1). (g, h, i) Sampled individual phenotypic values (five records per individual) from the raw individual phenotypic values where the among‐individual variance in time of sampling (urn:x-wiley:2041210X:media:mee312659:mee312659-math-0027) is 0 (i.e. individuals are independently sampled with a uniform distribution within the simulated time). (j, k, l) are similar to (g, h, i) except that urn:x-wiley:2041210X:media:mee312659:mee312659-math-0028 is 0·8 (i.e. individuals are independently sampled with a uniform distribution within a random period of time corresponding to 20% of the entire simulation time).

The Sampling Design

Within the SQuID world, users define a sampling design that is applied in order to ‘collect data’ (Figs 2g–l and 3d) and make inferences about the hypothetical world (Figs 2d–f and 3c), just as researchers collect samples to understand the real world. Time steps can be seen as continuous compared to the frequency at which the user samples the generated world. The decoupling of the creation of the virtual world and the sampling from this world is one of the core features of the SQuID environment. This decoupling allows the users to work with the data‐generating rules (i.e. the true parameters) that govern the world, the sampled data or all of the created data that could potentially be sampled. The SQuID environment allows for the simultaneous creation of multiple realizations of the world, replicates, from the same parameter setting (Fig. 3a), which facilitates simulation studies (detailed below). In addition, various sampling designs can be applied to the same world or replicate (Fig. 3d), like no (Fig. 2g–i) vs. substantial (Fig. 2j–l) among‐individual variance in timing of sampling, tailored to biological questions and practical constraints. Users can thus save different operational data sets from the same replicate. Note that we reserve the term ‘replicates’ for independent simulations, while the term ‘repeated measures’ is used to refer to assayed expressions of the phenotype by a focal simulated individual within a replicate.

image
General flow chart of how to use SQuID. The figure shows stylized screenshots of the interface to illustrate the different steps that researchers would follow to generate simulated data sets: Defining (a) the simulation design, (b) the temporal patterning in the environment, (c) the phenotypic equation and (d) the sampling scheme, after which (e) the generated data set is downloaded for post‐processing. SQuID, Statistical Quantification of Individual Differences.

A key component of the SQuID simulation environment is that it allows the traits of individuals to be sampled (repeatedly) and provides considerable flexibility as to how this sampling is done. One can, for example, determine how many individuals are sampled, how often individuals are sampled on average, and whether the number of repeated measures taken per individual is identical across individuals or variable by following a Poisson process with a constant expectation. One can also vary the amount of among‐individual variance in the timing of sampling. At one extreme, users can generate scenarios where the repeated samples from the same individual are highly clustered (e.g. Fig. 2j–l), such that some individuals are sampled repeatedly when they are ‘young’ (or early in the season), whereas other individuals are instead sampled repeatedly when they are ‘older’ (or late in the season). At the other extreme, it is possible to generate little or no among‐individual variance in the timing of sampling (e.g. Fig. 2g–i), such that all individuals, and their traits, are sampled on average (or exactly) at the same time. Importantly, full recovery of all variance components will not always be possible since two components might be completely conflated by the sampling regime (e.g. one observation per individual precludes the separation of between‐ and within‐individual variances). After setting the sampling parameters, we have programmed SQuID to provide useful visualizations of the true and sampled phenotypes and environments at each point in time (Fig. 2).

How to cook and eat SQuID: applications

Primary Research

A major advantage of the SQuID environment is that it can be used efficiently and independently to conduct simulation studies on issues of general importance and publish stand‐alone papers without empirical data. A large range of questions may be addressed, including those asking how to trade‐off the number of individuals vs. the number of replicates per individual, and how fitting a model to data that does not correspond to the data‐generating model biases estimates of variance components.

A general workflow to perform simulation studies using SQuID goes as follows. First, the researcher determines the simulation design, which consists of choosing the time frame of the simulation, and the number of replicate worlds (Fig. 3a). Secondly, the researcher determines the population characteristics: number of individuals in each replicate and the number of traits to study (Fig. 3a). Thirdly, the researcher defines how the environment in the SQuID world varies over time (Fig. 3b). Fourthly, the researcher defines the phenotypic equation that will determine the phenotype of each individual at each instance (Fig. 3c). At this step, the user defines the effect of the environment on the phenotype, the amount of among‐individual variation in average phenotype and level of phenotypic plasticity and the correlation between these two reaction norm components. As a final step, the researcher chooses a particular scheme to sample phenotypes from those generated in SQuID (Fig. 3d), ideally simulating potential protocols for use in the real world, and downloads the generated data sets for analysis (Fig. 3e). Note that the SQuID package can also be used without the user interface by running the function squidR(). This function could be easily included in existing r scripts and hence allows more advanced and efficient simulations.

Simulation‐Based Inferences

The SQuID environment also offers alternative means of interpretation for researchers that already implement the mixed‐effects modelling approach into their statistical practices. The traditional approach to inference follows a linear process that starts from the design of sampling schemes and usually includes a single (or a very few) data collection step(s). During data collection, (random) samples are drawn from the base population, and statistical models are subsequently fitted to these data. Parameters of interest are then subject to biological interpretation. However, practical constraints for data collection at different hierarchical levels impose limitations for the performance of the mixed‐effects modelling approach (e.g. Maas & Hox 2005). Furthermore, variance components are hard to estimate precisely and may become biased near their boundaries (e.g. near‐zero variance), in particular if the sample size is small (Gelman & Hill 2007). Knowledge about such biases is essential for the interpretation of estimated parameters but, unfortunately, the sources of such biases are not always obvious in complex models and simulations are essential in order to learn about them.

Simulation‐based procedures can help avoid misinterpretation and can be used either a priori or a posteriori to data collection. In the a priori phase, the researcher can use the SQuID r package to flexibly explore the consequences of alternative sampling scenarios and/or consider different phenotypic equations in a set of simulation studies. A benefit of performing simulations before any data are collected is that simulations are not constrained by practical limitations experienced in either the field or laboratory: it is thus possible to create the best‐case scenario for sampling. Experience obtained during this exploration stage can be incorporated in an actual study design, and the target sample sizes for the empirical part of the study can be appropriately determined. Therefore, the simulation‐conditioned sampling scheme can be used during the collection of real data, which then can be analysed with the pre‐defined mixed model (ideally the same as the one used in the simulation study).

The SQuID environment can also be exploited in the a posteriori phase, wherein an analysis of collected data can feed back into previous simulations. This approach enables the researcher to re‐investigate the behaviour of the model by incorporating additional complexities not previously considered. By performing sets of parametric bootstrapping and sensitivity analyses, the precision and accuracy of the obtained parameter estimates will be determined relative to the distribution of parameter estimates from the model that is fitted to simulated data (and not relative to the error of parameter estimates from the same model). As a consequence, such a simulation‐supported inference can lead to more objective biological conclusions that appropriately take into account the constraints of sampling and consider potentially confounding factors.

Education

The SQuID application, which can be launched by running the function squidApp(), offers educational material for those who are newcomers in the analysis of hierarchically structured data and those who want to teach mixed‐effects model analysis to newcomers. Consequently, the application can be effectively implemented in both self‐training and teaching programmes on mixed‐effects modelling. The tutorials in the application are organized into modules and are loosely ordered with increasing complexity from simple two‐level analysis towards models that rely on additional variance components, different environmental effects, hierarchical structures and interactions (Fig. 4). Going through the modules step‐by‐step permits the passive learning of the fundaments of model building. This interface also allows the user to interactively investigate the consequences of alternative input options for different components of the generated and sampled world, facilitating active learning. The generated data can be immediately visualized on the browser, but can also be externally saved and imported into statistical packages for those who wish to pursue data explorations on their own.

image
Flow chart of modules. Modules boxed in black are part of contemporary SQuID; dashed modules are anticipated to be added as SQuID evolves. SQuID, Statistical Quantification of Individual Differences.

The exploitation of the educational material encourages researchers to appreciate the multilevel structure of phenotypes. By doing so, users will be able to formulate biological hypotheses in the form of phenotypic equations (Westneat, Wright & Dingemanse 2015) and translate these into statistical models to be analysed with actual data. Importantly, the use of the simulation interface can help understand how study design imposes constraints on the potential inferences that can be made about the world. By adopting simulation‐based statistical interpretations, users will become familiar with the concepts of statistical power, accuracy and precision of estimated parameters in mixed‐effects model analysis. We consider the interactive aspect of SQuID as embodying a substantially novel component in comparison with printed education materials, such as textbooks, on similar topics.

The evolution of SQuID

We expect SQuID to evolve. The SQuID environment has been created to allow learning and exploring various interesting aspects of mixed‐effects models and data sampling. Among others, the current version of SQuID has three notable limitations, which will be resolved in the near future. First, trait distributions are limited to be Gaussian. Soon SQuID will allow simulations with non‐Gaussian trait distributions, specifically Poisson (with log and square‐root link functions) and Binomial (with logit and probit links) (Fig. 4). Secondly, environmental variables (i.e. x1 and x2) are assumed to be uncorrelated mainly for convenience and simplicity. However, environmental variables are often, to some degree, correlated. Environmental variances are also constrained to unit variance, which we plan to retain through all modules (because otherwise the slope of the response to the environmental gradient and the variance in the environment conflate to influence the effect of the environment on phenotypes). Thirdly, when we view unsampled individuals as missing data, our basic sampling scheme is ‘missing completely at random’ (MCAR), as labelled in missing data theory (Little & Rubin 2002; Nakagawa & Freckleton 2008). This is because SQuID sampling does not depend on environmental variables (x1 and x2). However, it is entirely possible that real sampling would be affected by (measured) environment variables (known as missing at random, MAR) or unmeasured environmental variables and/or traits of interest themselves (known as missing not at random, MNAR). A personality trait, boldness/shyness, is a good example of MNAR because shy individuals might be less likely to be sampled or more likely to be missing in the data set (Biro & Dingemanse 2009).

Additionally, the functionality of SQuID will further evolve as we extend its reach to other difficult problems, including the generation of scenarios where within‐individual residual variances differ across environments or individuals (Westneat, Schofield & Wright 2013; Cleasby, Nakagawa & Schielzeth 2015; Westneat, Wright & Dingemanse 2015), the inclusion of more complex hierarchical levels modelled by considering correlation matrices (such as relatedness matrices defined by pedigrees; Kruuk 2004; Wilson et al. 2010), the inclusion of effects of phenotypes expressed by other individuals (i.e. social environments such as parental and indirect genetics effects; McAdam, Garant & Wilson 2014), the incorporation of nonlinear responses to environments, the generation of simulations of selective mortality (van de Pol & Verhulst 2006) or consideration of individual variation in the timing of birth and death. Finally, it is possible to extend the current individual‐based focus towards scenarios when other hierarchical levels (such as populations or species) are of interest. In such cases, additional sampling designs might be considered, for example when applied to situations where the number of repeats is determined by a biological predictor (Garamszegi & Møller 2011) or when correlation structures are determined by phylogeny or gene flow (Stone, Nee & Felsenstein 2011).

Acknowledgements

SQuID was conceived at the Symposium ‘Personality: Causes and Consequences of Consistent Behavioural Variation’ funded by the Volkswagen Foundation (2013), and born and potty‐trained during two follow‐up workshop at the Max Planck Institute for Ornithology (Seewiesen) funded by the Volkswagen Foundation (2014) and the International Max Planck Research School for Organismal Biology (2015). Y.G.A.A. and N.J.D were supported by the Max Planck Society, L.Z.G. by the Plan Nacional Program (CGL2015‐70639‐P) and the National Research, Development and Innovation Office of Hungary (K‐115970), S.N. by an Australian Future Fellowship, D.R. by a Discovery Grant of the National Sciences and Engineering Research Council of Canada, H.S. by the German Research Foundation (SCHI 1188/1‐1) and D.F.W. by the National Science Foundation of the U.S.A. Authors gratefully acknowledge feedback on an earlier version of the manuscript from Jarrod Hadfield, Sandra Hamel, Julien Martin and an anonymous reviewer.

    Data accessibility

    This paper does not include any data.

      Number of times cited according to CrossRef: 18

      • Methodische Beispiele aus der aktuellen Forschung, Methoden der Verhaltensbiologie, 10.1007/978-3-662-60415-1, (93-128), (2020).
      • An individual based, multidimensional approach to identify emotional reactivity profiles in inbred mice, Journal of Neuroscience Methods, 10.1016/j.jneumeth.2020.108810, 343, (108810), (2020).
      • Criteria for acceptable studies of animal personality and behavioural syndromes, Ethology, 10.1111/eth.13082, 126, 9, (865-869), (2020).
      • The next merger and acquisition hot spot in the Middle East (2009–2018): An Update, Journal of Corporate Accounting & Finance, 10.1002/jcaf.22442, 31, 3, (20-31), (2020).
      • Robustness of linear mixed‐effects models to violations of distributional assumptions, Methods in Ecology and Evolution, 10.1111/2041-210X.13434, 11, 9, (1141-1152), (2020).
      • Potential sources of bias in the climate sensitivities of fish otolith biochronologies, Canadian Journal of Fisheries and Aquatic Sciences, 10.1139/cjfas-2019-0450, (1-12), (2020).
      • Personality, plasticity and predictability in sticklebacks: bold fish are less plastic and more predictable than shy fish, Animal Behaviour, 10.1016/j.anbehav.2019.06.022, 154, (193-202), (2019).
      • Causes and Consequences of Phenotypic Plasticity in Complex Environments, Trends in Ecology & Evolution, 10.1016/j.tree.2019.02.010, (2019).
      • Measuring Up to Reality: Null Models and Analysis Simulations to Study Parental Coordination Over Provisioning Offspring, Frontiers in Ecology and Evolution, 10.3389/fevo.2019.00142, 7, (2019).
      • Temporal autocorrelation: a neglected factor in the study of behavioral repeatability and plasticity, Behavioral Ecology, 10.1093/beheco/arz180, (2019).
      • Profile repeatability: a new method for evaluating repeatability of individual hormone response profiles, General and Comparative Endocrinology, 10.1016/j.ygcen.2018.09.015, (2018).
      • On the usage of single measurements in behavioural ecology research on individual differences, Animal Behaviour, 10.1016/j.anbehav.2018.09.012, 145, (99-105), (2018).
      • General conclusion to the special issue Moving forward on individual heterogeneity, Oikos, 10.1111/oik.05223, 127, 5, (750-756), (2018).
      • Trade-off between tolerance and resistance to infections: an experimental approach with malaria parasites in a passerine bird, Oecologia, 10.1007/s00442-018-4290-4, (2018).
      • A brief introduction to mixed effects modelling and multi-model inference in ecology, PeerJ, 10.7717/peerj.4794, 6, (e4794), (2018).
      • Studying behavioural variation in salmonids from an ecological perspective: observations questions methodological considerations, Reviews in Fish Biology and Fisheries, 10.1007/s11160-018-9532-3, (2018).
      • Individual versus pseudo‐repeatability in behaviour: Lessons from translocation experiments in a wild insect, Journal of Animal Ecology, 10.1111/1365-2656.12688, 86, 5, (1033-1043), (2017).
      • Of Uberfleas and Krakens: Detecting Trade-offs Using Mixed Models, Integrative and Comparative Biology, 10.1093/icb/icx015, 57, 2, (362-371), (2017).