Volume 10, Issue 12
APPLICATION
Free Access

gllvm: Fast analysis of multivariate abundance data with generalized linear latent variable models in r

Jenni Niku

Corresponding Author

E-mail address: jenni.m.e.niku@jyu.fi

Department of Mathematics and Statistics, University of Jyväskylä, Jyväskylä, Finland

Correspondence

Jenni Niku

Email: jenni.m.e.niku@jyu.fi

Search for more papers by this author
Francis K. C. Hui

Research School of Finance, Actuarial Studies & Statistics, Australian National University, Canberra, Australia

Search for more papers by this author
Sara Taskinen

Department of Mathematics and Statistics, University of Jyväskylä, Jyväskylä, Finland

Search for more papers by this author
David I. Warton

School of Mathematics and Statistics and Evolution & Ecology Research Centre, UNSW Sydney, Canberra, Australia

Search for more papers by this author
First published: 21 September 2019
Citations: 4

Abstract

  1. There has been rapid development in tools for multivariate analysis based on fully specified statistical models or ‘joint models’. One approach attracting a lot of attention is generalized linear latent variable models (GLLVMs). However, software for fitting these models is typically slow and not practical for large datasets.
  2. The r package gllvm offers relatively fast methods to fit GLLVMs via maximum likelihood, along with tools for model checking, visualization and inference.
  3. The main advantage of the package over other implementations is speed, for example, being two orders of magnitude faster, and capable of handling thousands of response variables. These advances come from using variational approximations to simplify the likelihood expression to be maximized, automatic differentiation software for model‐fitting (via the TMB package) and careful choice of initial values for parameters.
  4. Examples are used to illustrate the main features and functionality of the package, such as constrained or unconstrained ordination, including functional traits in ‘fourth corner’ models, and (if the number of environmental coefficients is not large) make inferences about environmental associations.

1 INTRODUCTION

Multivariate abundance data, consisting of observations of multiple interacting species (or other taxonomic group) from a set of samples, are often collected in ecological studies to characterize a community or assemblage of organisms. The term ‘abundance’ is taken here to mean counts, presence–absence records, biomass data or any other measure of the extent to which a species may be present at a site. Common ecological questions that such data are used to answer include whether a set of sites is similar in terms of their species composition (Bjork, Hui, O'Hara, & Montoya, 2018), finding between species interactions and visualization of correlation patterns across species (Royan et al., 2016), hypothesis testing of environmental effects (Lammel et al., 2018) and making predictions for abundances (Buisson, Thuiller, Lek, Lim, & Grenouillet, 2008).

In recent years, there has been a growing movement towards the specification of statistical models for multivariate analysis in ecology (Ovaskainen, Hottola, & Siitonen, 2010; Ovaskainen et al., 2017; Warton et al., 2015). Of particular interest are methods that use random effects to incorporate between species correlation in models predicting species abundance as a function of environmental variables, often termed joint species distribution models (Pollock et al., 2014). One exciting possibility offered by these methods is the potential to tease apart some of the causes of species co‐occurrence – joint response to known environmental gradients versus other sources, for example, biotic interaction.

A key approach for statistical modelling of multivariate abundance data is the generalized linear latent variable model (GLLVM, Skrondal & Rabe‐Hesketh, 2004). A GLLVM extends the basic generalized linear model to multivariate data using a factor analytic approach, that is, incorporating a small number of latent variables for each site accompanied by species specific factor loadings to model correlations between responses. These latent variables have a natural interpretation as ordination axes, but with additional capacity, for example, predicting new values, controlling for known environmental variables, using standard model selection tools to choose number of ordination axes (Hui, Taskinen, Pledger, Foster, & Warton, 2015). One of the main advantages of GLLVMs is that they can handle situations where there are many species, because the number of parameters in the covariance model scales linearly with the number of responses (Warton et al., 2015). This is a key technical challenge – often there are more species being sampled than sites, for example, microbial data often have thousands of taxa (Kumar et al., 2017; Niku, Warton, Hui, & Taskinen, 2017).

Software for fitting GLLVMs in ecology is currently quite slow computationally and not practical for large datasets. In particular, packages in the freely available software r have been developed, for example, the boral (Hui et al., 2016 and HMSC packages (Tikhonov, Opedal, Abrego, Lehikoinen, & Ovaskainen, 2019), but using Bayesian MCMC for estimation, which is relatively slow and not practical for large microbial datasets. More technical advances provide the opportunity to reduce computation times on some problems from hours to minutes or minutes to seconds, using variational (Hui, Warton, Ormerod, Haapaniemi, & Taskinen, 2017) or Laplace (Niku et al., 2017) approximations to likelihoods, especially via automated differentiation software such as Template Model Builder (Kristensen, Nielsen, Berg, Skaug, & Bell, 2016).

This paper presents the r package gllvm (Niku et al., 2017), which has been developed for rapid fitting of GLLVMs to multivariate abundance data. The package offers a framework for model‐based ordination, as well as allowing us to study the effect of environmental covariates or environment–trait interactions on responses simultaneously with the analysis of correlation patterns across species. The package also contains tools for statistical inference, model selection and visualization. While other r packages have similar functionality (Hui, 2016; Tikhonov et al., 2019), the key point of distinction is that gllvm fits models much faster than its immediate competitors (e.g. see Table 3) and is capable of modelling larger datasets. Version 1.1.7 of the gllvm package is currently available on the Comprehensive R Archive Network (CRAN).

2 GENERALIZED LINEAR LATENT VARIABLE MODELS

A multivariate abundance dataset can be defined by a matrix of abundances, with n rows (usually sites) and m columns of responses (usually species). Denote the abundance of the jth species at the ith site as yij. A set of k environmental variables, or experimental treatments, may also be recorded at each site and stored in the vector urn:x-wiley:2041210X:media:mee313303:mee313303-math-0001. A GLLVM regresses the mean abundance urn:x-wiley:2041210X:media:mee313303:mee313303-math-0002 against environmental variables and a vector of urn:x-wiley:2041210X:media:mee313303:mee313303-math-0003 latent variables, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0004:
urn:x-wiley:2041210X:media:mee313303:mee313303-math-0005(1)
where urn:x-wiley:2041210X:media:mee313303:mee313303-math-0006 and urn:x-wiley:2041210X:media:mee313303:mee313303-math-0007 are vectors of species specific coefficients related to the covariates and latent variables, respectively. The latent variables ui can be thought of as unmeasured environmental variables, or as ordination scores, capturing the main axes of covariation of abundance (after controlling for observed predictors xi). We assume that these latent variables are independent across sites and standard normally distributed. The parameters urn:x-wiley:2041210X:media:mee313303:mee313303-math-0008 are species‐specific intercepts, while urn:x-wiley:2041210X:media:mee313303:mee313303-math-0009 are optional site effects which can be chosen as either fixed or random effects (urn:x-wiley:2041210X:media:mee313303:mee313303-math-0010). The row effects urn:x-wiley:2041210X:media:mee313303:mee313303-math-0011 can be included for site total abundance standardization, that is, all other terms in the model can then be subsequently interpreted as modelling relative abundance or compositional effects (Hui et al., 2015). To ensure that the above model is identifiable, for urn:x-wiley:2041210X:media:mee313303:mee313303-math-0012, the upper triangular of the loading matrix urn:x-wiley:2041210X:media:mee313303:mee313303-math-0013 needs to be set to zero and the diagonal elements to be set positive to avoid rotational invariance; see (Hui et al., 2015 and Niku et al., 2017) for further information.

The residual covariance matrix, storing information on species co‐occurrence that is not explained by environmental variables, can be calculated as urn:x-wiley:2041210X:media:mee313303:mee313303-math-0014. This is the correct form of correlation when the responses are Poisson distributed. In the case of negative binomial distribution with dispersion parameters urn:x-wiley:2041210X:media:mee313303:mee313303-math-0015, we adjust the diagonal elements by adding the term urn:x-wiley:2041210X:media:mee313303:mee313303-math-0016, which corresponds to the variance explained by the NB distribution. Analogously, for the binomial probit model, the residual covariance is urn:x-wiley:2041210X:media:mee313303:mee313303-math-0017 (Ovaskainen, Abrego, Halme, & Dunson, 2016).

If q trait covariates urn:x-wiley:2041210X:media:mee313303:mee313303-math-0018 are also recorded, we can use them to help explain interspecific variation in environmental response. This leads to an extension of the so‐called ‘fourth corner model’ (Brown et al., 2014; Jamil & ter Braak, 2013) where multivariate abundance is regressed against a function of traits and environment, and the environment–trait interactions represents the fourth corner association between traits and environment. The associated fourth corner GLLVM then has mean model:
urn:x-wiley:2041210X:media:mee313303:mee313303-math-0019(2)
where urn:x-wiley:2041210X:media:mee313303:mee313303-math-0020 is a vector of main effects for environmental covariates, and urn:x-wiley:2041210X:media:mee313303:mee313303-math-0021 is the fourth corner coefficient. A main effect for traits was not included, because main effects on abundance across species are absorbed by the intercept term urn:x-wiley:2041210X:media:mee313303:mee313303-math-0022. This model assumes that all interspecific variation in response to covariates is mediated by species, which reduces the number of parameters related to covariates from mk in Equation 1 to urn:x-wiley:2041210X:media:mee313303:mee313303-math-0023 in Equation 2.

In both GLLVM formulations mentioned above, a key feature is that the number of parameters characterizing the residual correlation urn:x-wiley:2041210X:media:mee313303:mee313303-math-0024 grows linearly with the number of responses m. This contrasts with the quadratic rate of growth when an unstructured residual covariance matrix was assumed across responses (Pollock et al., 2014). Thus the term urn:x-wiley:2041210X:media:mee313303:mee313303-math-0025 is able to model residual correlation across response variables even when the number of species is relatively large.

3 ESTIMATION

A difficulty fitting the GLLVM is that the ui's are unobserved and we must integrate over their possible values. Specifically, the log‐likelihood function we wish to maximize has the form
urn:x-wiley:2041210X:media:mee313303:mee313303-math-0026(3)
where urn:x-wiley:2041210X:media:mee313303:mee313303-math-0027 includes all model parameters. In this expression, we have assumed that abundances are independent across sites and any correlation across responses are captured by the latent variables ui. Thus conditional on ui, the yij are independent of each other within sites.

In the literature, several solutions have been proposed to the problem of integration (3), most notably adaptive quadrature (Rabe‐Hesketh, Skrondal, & Pickles, 2002), the Monte Carlo applications of the expectation maximization (EM) algorithm (Hui et al., 2015) and Bayesian MCMC (Hui, 2016; Tikhonov et al., 2019). For large datasets and multiple latent variables, these methods are, however, time‐consuming.

The gllvm package overcomes these computational problems using three key innovations:
  • Maximizing the log‐likelihood using (almost completely) closed form approximation. We provide two ways to do this – using Gaussian variational approximations (VA, Hui et al., 2017) for overdispersed counts, binary and ordinal responses, or using Laplace approximations (LA, Niku et al., 2017) for other exponential family distributions when a fully closed form variational approximation cannot be obtained, for example, biomass data can be modelled by the Tweedie distribution.
  • Parameter estimation makes use of automatic differentiation software in C++ to accelerate computation times, via the interface provided by the r package TMB (Kristensen et al., 2016).
  • Careful choice of starting values. In particular, we use a factor analysis on Dunn‐Smyth residuals (Niku et al., 2019b) to obtain starting values close to the anticipated solution, optionally, with jittering to check the sensitivity of the approach.

The end result is a package that provides more stable solutions, and is orders of magnitude faster than current competitors.

4 USING THE R PACKAGE GLLVM

The r package gllvm provides a flexible implementation for fitting GLLVMs to multivariate data. The main function of the gllvm package is gllvm(), which can be used to fit GLLVMs for multivariate data with the most important arguments listed in the following:

Data input can be specified using the ‘wide format’ matrices via y, X and TR arguments, or using the long format via data argument, and formula is used for model specification (which defaults to including linear terms for all variables from X and TR, and all interactions between variables in X and variables in TR). The number of latent variables can be defined using the argument num.lv, with zero latent variables corresponding to a simple multi‐response GLM that does not account for correlation across responses (Wang, Naumann, Wright, & Warton, 2012). The response distribution can be chosen using the argument family, and models can be fitted using either the VA (method = "VA", default) or with the LA (method = "LA") method. The currently available distributions, link functions and methods for different response types are listed in Table 1.

Table 1. Overview of available distributions with the mean, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0028, and mean–variance, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0029, functions, estimation methods and link functions for various response types in gllvm
Response Distribution Method Link Description
Counts Poisson VA/LA Log urn:x-wiley:2041210X:media:mee313303:mee313303-math-0030, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0031
NB VA/LA Log urn:x-wiley:2041210X:media:mee313303:mee313303-math-0032, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0033, where urn:x-wiley:2041210X:media:mee313303:mee313303-math-0034 is a dispersion parameter
ZIP LA log urn:x-wiley:2041210X:media:mee313303:mee313303-math-0035, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0036, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0037
Binary Bernoulli VA/LA probit urn:x-wiley:2041210X:media:mee313303:mee313303-math-0038, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0039
LA logit
Biomass Tweedie LA log urn:x-wiley:2041210X:media:mee313303:mee313303-math-0040, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0041, where urn:x-wiley:2041210X:media:mee313303:mee313303-math-0042 is a power parameter and urn:x-wiley:2041210X:media:mee313303:mee313303-math-0043 is a dispersion parameter
Ordinal Multinomial VA probit Cumulative probit model
Normal Gaussian VA/LA identity urn:x-wiley:2041210X:media:mee313303:mee313303-math-0044, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0045

Other important arguments in the gllvm call are row.eff for defining the type of row effects (none, fixed or random), offset for potential inclusion of offsets, Power for defining the power parameter of the Tweedie distribution (Niku et al., 2017) and starting.val for judicious choice of starting values for the latent variables (Niku et al., 2019b). For an overview of the available functions in gllvm, see Table 2.

Table 2. Overview of functions available in gllvm
Function Description
gllvm() Fits a generalized linear latent variable model
anova.gllvm() Analysis of deviance for ‘gllvm’ objects
coefplot.gllvm() Plots covariate coefficients and confidence intervals
logLik.gllvm() Log‐likelihood of an object of class ‘gllvm
residuals.gllvm() Dunn‐Smyth residuals for ‘gllvm’ model
summary.gllvm() Summarizing ‘gllvm’ model fits
ordiplot.gllvm() Plots latent variables from a ‘gllvm’ model
plot.gllvm() diagnostics for a ‘gllvm’ object
confint.gllvm() Confidence intervals for ‘gllvm’ model parameters
predict.gllvm() Obtains predictions from a ‘gllvm’ model
getResidualCov.gllvm() Calculates residual covariance matrix for a ‘gllvm’ fit
getResidualCor.gllvm() Calculates residual correlations for a ‘gllvm’ fit
getPredictErr.gllvm() Prediction errors for predicted latent variables
simulate.gllvm() Generate new data based on a ‘gllvm’ fit
Below, we demonstrate the main features of the gllvm package by example. In the examples, we consider the antTraits data, which are available in the r package mvabund (Wang et al., 2012) and consist of counts of 41 ant species measured at 30 sites across south‐east Australia, along with records of five environmental variables and five trait variables for each species. The package and the data can be loaded as follows:

5 MODEL‐BASED ORDINATION

GLLVMs can be used as a model‐based approach to unconstrained ordination by including (e.g.) two latent variables in the model but no predictors (Hui et al., 2015; Walker & Jackson, 2011). The corresponding ordination plot then provides a graphical representation of which sites are similar in terms of their species composition. Such a model can be fitted to the antTraits data using the function gllvm() as given below. We will consider two count distributions for the data – the Poisson and negative binomial (NB).

The default printout includes information criteria, which all suggest that the NB distribution is a better choice than the Poisson distribution for modelling the response. Residual plots for diagnosing model fit in Figure 1 can be obtained using the plot() function. Two plots for both models are of Dunn‐Smyth residuals, which are randomized quantile‐based residuals designed for discrete data (Dunn & Smyth, 1996), plotted against linear predictors, and a normal quantile–quantile plot with a simulated point‐wise 95% confidence interval envelope. The residual diagnostics for the Poisson model show some overdispersion in residuals, in particular, a telltale fan shape in the plot of residuals against fitted values. These issues are largely resolved in the NB model. Note that the latent variables in the model provide some capacity to account for overdispersion, so overdispersed counts do not always require us to move beyond the Poisson distribution, although there is clear evidence of such a need in this example.

image
Residual plots for the Poisson GLLVM (top) and the NB‐GLLVM (bottom) applied for model‐based ordination. Specifically, Dunn‐Smyth residuals are plotted against linear predictors (left), while simulated point‐wise 95% confidence interval envelope is added in the normal quantile–quantile plot (right). The fan shape and unusually large residuals for the Poisson GLLVM suggest data are slightly overdispersed compared to the Poisson distribution. The lack of pattern and smaller residuals for the NB‐GLLVM suggests a better model fit to the data
Once an appropriate model has been established for the data, we can construct an ordination as a scatter plot of the predicted latent variables via the ordiplot() function. The species with the largest factor loadings (largest norms, urn:x-wiley:2041210X:media:mee313303:mee313303-math-0046), and hence most strongly associated with ordination scores, can be added using the logical argument biplot, leading to a biplot for finding indicator species corresponding to specific sites. The ind.spp argument defines the number of species to be plotted.

The above command creates the biplot as shown in Figure 2 based on the GLLVM fitted to the antTraits data. We can see one large cluster of sites on the top with many indicator species, and few smaller clusters with only few indicator species, for example, sites 12–15. In Appendix S3, we apply classical algorithm‐based ordination methods to the ant data and compare the results. While the results between GLLVMs and the algorithm‐based methods are quite similar, GLLVMs offer the advantage of standard tools for diagnosing model fit and performing model selection.

image
A biplot with 15 indicator species based on the NB‐GLLVM fitted to the ant data. The numbers correspond to the site indices

6 MODEL WITH ENVIRONMENTAL VARIABLES

Environmental variables can be included in the model, whether to study their effects on assemblages or to study patterns of species co‐occurrence after controlling for environmental variables.

A model with three latent variables was chosen based on the AICc value, and residual analysis indicates that a NB distribution offered the most suitable mean–variance relationship for the responses.

The estimated coefficients for predictors and their confidence intervals can be plotted using the coefplot() function, in order to study the nature of effects of environmental variables on species.

The resulting plot is given in Figure 3. Note that with a log link used, a unit change covariate l equates to a multiplicative change of urn:x-wiley:2041210X:media:mee313303:mee313303-math-0047 in the predicted mean urn:x-wiley:2041210X:media:mee313303:mee313303-math-0048 for species j. Most of the 95% confidence intervals include zero, indicating that the majority of the species does not exhibit evidence of a strong association between environment and species abundance. This may be due to a lack of information in the data, as much as being due to a lack of environmental association after accounting for potential residual species covariation.

image
Plots of the point estimates (ticks) for coefficients of the environmental variables and their 95% confidence intervals (lines) for the NB‐GLLVM, with those coloured in grey (black) denoting intervals (not) containing zero. The x‐axis of the coefficient plot of the third variable is truncated due to very wide confidence interval for one of the coefficients

7 STUDYING CO‐OCCURRENCE PATTERNS

Latent variables induce correlation across response variables, and so provide a means of estimating correlation patterns across species, and the extent to which they can be explained by environmental variables. As explained previously, information on correlation is stored in the factor loadings, and the getResidualCor() function can be used to estimate the correlation matrix of the linear predictor across species. This can be visualized using the corrplot package:

Regions coloured in dark blue on Figure 4 indicate clusters of species that are positively correlated with each other, after controlling for covariation in species explained by the environmental terms in fit_env. There are also two regions coloured in red, indicating negative correlation between pairs of species. The effect of the environmental variables on the between species correlations can be seen by comparing the correlation matrix in Figure 4 to the correlation matrix given by the model without environmental variables, see example in Appendix S1, where the correlation patterns are considerably different from one another. Correlations can also be visualized in a residual biplot (Appendix S1). The traces of residual covariances obtained via the getResidualCov() function can be used to quantify the amount of variation in the data explained by environmental variables (Warton et al., 2015), see Appendix S1.

image
Residual correlation matrix based on latent factor loadings for the NB‐GLLVM with environmental covariates

8 INCORPORATING FUNCTIONAL TRAITS INTO ‘FOURTH CORNER’ MODELS

In the previous section, environmental associations were studied by fitting separate terms for each species, without attempting to explain why different species respond differently to the environment. Adding functional traits to the model offers the potential to explain why species differ in environmental response. The fourth corner model in Equation 2 can be fitted by using the argument TR to include traits, and the argument formula is used to specify the model.

As previously, coefficients can be plotted using the function coefplot(). The environment–trait interaction terms, also known as the fourth corner terms, can also be visualized using the function levelplot() from the package lattice, see Appendix S1 for example code. The resulting plots in Figure 5 indicate that interactions of the trait variable Polymorphism with Bare.ground and Webers.length with Volume.lying.CWD have the strongest effects on ant abundances. Notice that Pilosity and Polymorphism are factors and gllvm() recognizes this.

image
A plot of the estimated coefficients (ticks) and their 95% confidence intervals (lines) for all terms in the fourth corner model (left), and a level plot for the fourth corner interaction terms (right) in the NB‐GLLVM. The colours offer an indication of the signs and magnitudes of the point estimates
By using a maximum likelihood framework, gllvm offers likelihood‐based machinery for model‐based inference. A particular example is likelihood ratio testing via the anova() function when comparing nested models. In Figure 5, for example, all the trait–environment interactions appear to be relatively small and most of the confidence intervals of the coefficients include zero values. But to formally test whether these traits vary environment, in the below code, we fitted a second model without traits and performed a likelihood ratio test. Notice that in order to separate the next model from the one which has species specific coefficients for environmental variables, we include TR matrix to the function call.

Based on the output from applying the anova() function, the p‐value suggests that the simpler model where traits were not included is more appropriate, that is, there is no strong evidence of traits mediating the environmental response of species.

The validity of any model‐based inference procedure relies on the assumptions of its underlying model. Note that the above test is based on fit_4th, a model that made the strong assumption that all interspecific variation in environmental response is captured by the trait in the model. Tests based on such models can have inflated false‐positive rates when this assumption is violated, as can be shown using simulations with missing trait predictors (ter Braak, 2019). We are working on an extension of our model, using a random slope across species, to capture variation in environmental response not captured by the trait model. Tests based on such a model can be expected to have much‐improved robustness to missing predictors in the trait model.

9 SUMMARY

In this paper, we introduced the r package gllvm for the analysis of multivariate abundance data using GLLVMs. The package caters for the types of response variables most commonly seen in ecology, including presence–absence data, overdispersed counts, biomass and ordinal data. The main point of difference between gllvm and other packages for fitting GLLVMs (Hui, 2016; Tikhonov et al., 2019) is that our algorithm is much faster for model‐fitting, and thus capable of handling much larger datasets. Computational efficiency was achieved by avoiding MC approaches to estimation, and instead making use of recent innovations for maximum likelihood estimation as discussed in Estimation. Table 3 illustrates this by comparing the computation time of gllvm to boral with default settings (40,000 total iterations, warm‐up at 10,000, thinning at 30), for the three example models of this paper. Computation times were over 140 times shorter when using gllvm, analysing the data in seconds rather than minutes. Note that this example dataset was relatively small, and differences in computation time become practically meaningful for larger datasets. For example, for the metagenomic dataset of Niku et al. (2017), with 56 rows and 985 responses, gllvm fitted a two latent variable model without predictors in 15 min, while boral (under default settings) took 10 hr, without achieving convergence. Even larger datasets again can be handled by gllvm, for which analysis is otherwise infeasible with currently available packages.

Table 3. Computation times in seconds (on an Intel Core i7‐3770 (3.4 GHz)) to fit the example GLLVM objects of this paper using gllvm and boral (with default settings) using. The gllvm reduces computation times from minutes to seconds for each example
  fit_ord fit_env fit_4th
Gllvm 4.0 10.0 10.3
boral 595.4 1,483.6 1,529.9

A second point of difference between gllvm and competing packages is that it uses a maximum likelihood framework, and thus can employ likelihood‐based tools for inference. Familiar generic r functions like AIC, BIC and anova can be applied to gllvm objects, although as previously we emphasize that anova results will only be reliable when testing hypotheses concerning a relatively small number of parameters. To compare, packages that fit GLLVMs under a Bayesian framework would return full posterior distributions for both parameters and latent variables (Hui, 2016; Tikhonov et al., 2019), while our likelihood‐based framework returns approximate confidence intervals for parameters, assuming estimators are normally distributed. On the other hand, performing Bayesian hypothesis testing presents a bigger challenge compared to using likelihood‐based hypothesis testing as the gllvm package implements.

The GLLVM framework is distinct from methods historically used for ordination in ecology, such as non‐metric multi‐dimensional scaling (nMDS, as in vegan, Oksanen et al., 2018) and duality diagrams (as in ade4, Dray & Dufour, 2007). A key point of distinction is that a GLLVM specifies a statistical model for the data intended to capture key data properties. In particular, multivariate abundance data typically have a strong mean–variance relationship, which if not accounted for, often introduces artefacts into analyses (Warton & Hui, 2017; Warton, Wright, & Wang, 2012). Specifying a statistical model that aims to capture this mean–variance relationship, and using diagnostic tools to check its adequacy (Figure 1), can avoid this issue.

In the future, we plan to broaden the scope of the gllvm package to handle spatial and temporal correlations that often characterize observational multivariate abundance data, by allowing the latent variables to be structured rather than assuming independence across observational units. We will also extend the fourth corner models by including species‐specific random slopes for the predictors, to account for interspecific variation in environmental response that is not explained by traits. The code repository for the package can be found from github, see https://github.com/JenniNiku/gllvm.

ACKNOWLEDGEMENTS

The work of J.N. was supported by the Wihuri Foundation. The work of S.T. was supported by the CRoNoS COST Action IC1408. The work of F.K.C.H. and D.I.W. was supported by Australia Research Council Discovery Project grants (DP180100836 and DP150100823, respectively), F.K.C.H. was also supported by an ANU cross disciplinary grant.

    AUTHORS’ CONTRIBUTIONS

    J.N., F.K.C.H., S.T. and D.I.W. conceived the ideas and designed methodology; J.N. was mainly responsible for implementing the application; All authors contributed to the writing, reviewing and editing of the draft and gave final approval for publication.

    DATA AVAILABILITY STATEMENT

    The ant dataset used in our examples is publicly available from the r package mvabund (Wang et al., 2012) in the Comprehensive R Archive Network: https://cran.r-project.org/web/packages/mvabund/. The microbial data (Kumar et al., 2017) are published in European Nucleotide Archive under the project number PRJEB17695, https://www.ebi.ac.uk/ena/data/view/PRJEB17695. A subset of these data used in Appendix S2, as well as all code used in this paper and supplementary materials is publicly available in the r package gllvm (Niku et al., 2019a) in the CRAN: https://cran.r-project.org/web/packages/gllvm/.

      Number of times cited according to CrossRef: 4

      • Gradients in richness and turnover of a forest passerine's diet prior to breeding: A mixed model approach applied to faecal metabarcoding data, Molecular Ecology, 10.1111/mec.15394, 29, 6, (1199-1213), (2020).
      • First simultaneous assessment of macro- and meiobenthic community response to juvenile shellfish culture in a Mediterranean coastal lagoon (Thau, France), Ecological Indicators, 10.1016/j.ecolind.2020.106462, 115, (106462), (2020).
      • Model-based ordination of pin-point cover data: Effect of management on dry heathland, Ecological Informatics, 10.1016/j.ecoinf.2020.101155, (101155), (2020).
      • Covariate-adjusted species response curves derived from long-term macroinvertebrate monitoring data using classical and contemporary model-based ordination methods, Ecological Informatics, 10.1016/j.ecoinf.2020.101159, (101159), (2020).