gllvm: Fast analysis of multivariate abundance data with generalized linear latent variable models in r
Abstract
- There has been rapid development in tools for multivariate analysis based on fully specified statistical models or ‘joint models’. One approach attracting a lot of attention is generalized linear latent variable models (GLLVMs). However, software for fitting these models is typically slow and not practical for large datasets.
- The r package gllvm offers relatively fast methods to fit GLLVMs via maximum likelihood, along with tools for model checking, visualization and inference.
- The main advantage of the package over other implementations is speed, for example, being two orders of magnitude faster, and capable of handling thousands of response variables. These advances come from using variational approximations to simplify the likelihood expression to be maximized, automatic differentiation software for model‐fitting (via the TMB package) and careful choice of initial values for parameters.
- Examples are used to illustrate the main features and functionality of the package, such as constrained or unconstrained ordination, including functional traits in ‘fourth corner’ models, and (if the number of environmental coefficients is not large) make inferences about environmental associations.
1 INTRODUCTION
Multivariate abundance data, consisting of observations of multiple interacting species (or other taxonomic group) from a set of samples, are often collected in ecological studies to characterize a community or assemblage of organisms. The term ‘abundance’ is taken here to mean counts, presence–absence records, biomass data or any other measure of the extent to which a species may be present at a site. Common ecological questions that such data are used to answer include whether a set of sites is similar in terms of their species composition (Bjork, Hui, O'Hara, & Montoya, 2018), finding between species interactions and visualization of correlation patterns across species (Royan et al., 2016), hypothesis testing of environmental effects (Lammel et al., 2018) and making predictions for abundances (Buisson, Thuiller, Lek, Lim, & Grenouillet, 2008).
In recent years, there has been a growing movement towards the specification of statistical models for multivariate analysis in ecology (Ovaskainen, Hottola, & Siitonen, 2010; Ovaskainen et al., 2017; Warton et al., 2015). Of particular interest are methods that use random effects to incorporate between species correlation in models predicting species abundance as a function of environmental variables, often termed joint species distribution models (Pollock et al., 2014). One exciting possibility offered by these methods is the potential to tease apart some of the causes of species co‐occurrence – joint response to known environmental gradients versus other sources, for example, biotic interaction.
A key approach for statistical modelling of multivariate abundance data is the generalized linear latent variable model (GLLVM, Skrondal & Rabe‐Hesketh, 2004). A GLLVM extends the basic generalized linear model to multivariate data using a factor analytic approach, that is, incorporating a small number of latent variables for each site accompanied by species specific factor loadings to model correlations between responses. These latent variables have a natural interpretation as ordination axes, but with additional capacity, for example, predicting new values, controlling for known environmental variables, using standard model selection tools to choose number of ordination axes (Hui, Taskinen, Pledger, Foster, & Warton, 2015). One of the main advantages of GLLVMs is that they can handle situations where there are many species, because the number of parameters in the covariance model scales linearly with the number of responses (Warton et al., 2015). This is a key technical challenge – often there are more species being sampled than sites, for example, microbial data often have thousands of taxa (Kumar et al., 2017; Niku, Warton, Hui, & Taskinen, 2017).
Software for fitting GLLVMs in ecology is currently quite slow computationally and not practical for large datasets. In particular, packages in the freely available software r have been developed, for example, the boral (Hui et al., 2016 and HMSC packages (Tikhonov, Opedal, Abrego, Lehikoinen, & Ovaskainen, 2019), but using Bayesian MCMC for estimation, which is relatively slow and not practical for large microbial datasets. More technical advances provide the opportunity to reduce computation times on some problems from hours to minutes or minutes to seconds, using variational (Hui, Warton, Ormerod, Haapaniemi, & Taskinen, 2017) or Laplace (Niku et al., 2017) approximations to likelihoods, especially via automated differentiation software such as Template Model Builder (Kristensen, Nielsen, Berg, Skaug, & Bell, 2016).
This paper presents the r package gllvm (Niku et al., 2017), which has been developed for rapid fitting of GLLVMs to multivariate abundance data. The package offers a framework for model‐based ordination, as well as allowing us to study the effect of environmental covariates or environment–trait interactions on responses simultaneously with the analysis of correlation patterns across species. The package also contains tools for statistical inference, model selection and visualization. While other r packages have similar functionality (Hui, 2016; Tikhonov et al., 2019), the key point of distinction is that gllvm fits models much faster than its immediate competitors (e.g. see Table 3) and is capable of modelling larger datasets. Version 1.1.7 of the gllvm package is currently available on the Comprehensive R Archive Network (CRAN).
2 GENERALIZED LINEAR LATENT VARIABLE MODELS
. A GLLVM regresses the mean abundance
against environmental variables and a vector of
latent variables,
:
(1)
and
are vectors of species specific coefficients related to the covariates and latent variables, respectively. The latent variables ui can be thought of as unmeasured environmental variables, or as ordination scores, capturing the main axes of covariation of abundance (after controlling for observed predictors xi). We assume that these latent variables are independent across sites and standard normally distributed. The parameters
are species‐specific intercepts, while
are optional site effects which can be chosen as either fixed or random effects (
). The row effects
can be included for site total abundance standardization, that is, all other terms in the model can then be subsequently interpreted as modelling relative abundance or compositional effects (Hui et al., 2015). To ensure that the above model is identifiable, for
, the upper triangular of the loading matrix
needs to be set to zero and the diagonal elements to be set positive to avoid rotational invariance; see (Hui et al., 2015 and Niku et al., 2017) for further information.
The residual covariance matrix, storing information on species co‐occurrence that is not explained by environmental variables, can be calculated as
. This is the correct form of correlation when the responses are Poisson distributed. In the case of negative binomial distribution with dispersion parameters
, we adjust the diagonal elements by adding the term
, which corresponds to the variance explained by the NB distribution. Analogously, for the binomial probit model, the residual covariance is
(Ovaskainen, Abrego, Halme, & Dunson, 2016).
are also recorded, we can use them to help explain interspecific variation in environmental response. This leads to an extension of the so‐called ‘fourth corner model’ (Brown et al., 2014; Jamil & ter Braak, 2013) where multivariate abundance is regressed against a function of traits and environment, and the environment–trait interactions represents the fourth corner association between traits and environment. The associated fourth corner GLLVM then has mean model:
(2)
is a vector of main effects for environmental covariates, and
is the fourth corner coefficient. A main effect for traits was not included, because main effects on abundance across species are absorbed by the intercept term
. This model assumes that all interspecific variation in response to covariates is mediated by species, which reduces the number of parameters related to covariates from mk in Equation 1 to
in Equation 2.
In both GLLVM formulations mentioned above, a key feature is that the number of parameters characterizing the residual correlation
grows linearly with the number of responses m. This contrasts with the quadratic rate of growth when an unstructured residual covariance matrix was assumed across responses (Pollock et al., 2014). Thus the term
is able to model residual correlation across response variables even when the number of species is relatively large.
3 ESTIMATION
(3)
includes all model parameters. In this expression, we have assumed that abundances are independent across sites and any correlation across responses are captured by the latent variables ui. Thus conditional on ui, the yij are independent of each other within sites.
In the literature, several solutions have been proposed to the problem of integration (3), most notably adaptive quadrature (Rabe‐Hesketh, Skrondal, & Pickles, 2002), the Monte Carlo applications of the expectation maximization (EM) algorithm (Hui et al., 2015) and Bayesian MCMC (Hui, 2016; Tikhonov et al., 2019). For large datasets and multiple latent variables, these methods are, however, time‐consuming.
- Maximizing the log‐likelihood using (almost completely) closed form approximation. We provide two ways to do this – using Gaussian variational approximations (VA, Hui et al., 2017) for overdispersed counts, binary and ordinal responses, or using Laplace approximations (LA, Niku et al., 2017) for other exponential family distributions when a fully closed form variational approximation cannot be obtained, for example, biomass data can be modelled by the Tweedie distribution.
- Parameter estimation makes use of automatic differentiation software in C++ to accelerate computation times, via the interface provided by the r package TMB (Kristensen et al., 2016).
- Careful choice of starting values. In particular, we use a factor analysis on Dunn‐Smyth residuals (Niku et al., 2019b) to obtain starting values close to the anticipated solution, optionally, with jittering to check the sensitivity of the approach.
The end result is a package that provides more stable solutions, and is orders of magnitude faster than current competitors.
4 USING THE R PACKAGE GLLVM
Data input can be specified using the ‘wide format’ matrices via y, X and TR arguments, or using the long format via data argument, and formula is used for model specification (which defaults to including linear terms for all variables from X and TR, and all interactions between variables in X and variables in TR). The number of latent variables can be defined using the argument num.lv, with zero latent variables corresponding to a simple multi‐response GLM that does not account for correlation across responses (Wang, Naumann, Wright, & Warton, 2012). The response distribution can be chosen using the argument family, and models can be fitted using either the VA (method = "VA", default) or with the LA (method = "LA") method. The currently available distributions, link functions and methods for different response types are listed in Table 1.
, and mean–variance,
, functions, estimation methods and link functions for various response types in gllvm| Response | Distribution | Method | Link | Description |
|---|---|---|---|---|
| Counts | Poisson | VA/LA | Log | , ![]() |
| NB | VA/LA | Log | , , where is a dispersion parameter
|
|
| ZIP | LA | log | , , ![]() |
|
| Binary | Bernoulli | VA/LA | probit | , ![]() |
| LA | logit | |||
| Biomass | Tweedie | LA | log | , , where is a power parameter and is a dispersion parameter
|
| Ordinal | Multinomial | VA | probit | Cumulative probit model |
| Normal | Gaussian | VA/LA | identity | , ![]() |
Other important arguments in the gllvm call are row.eff for defining the type of row effects (none, fixed or random), offset for potential inclusion of offsets, Power for defining the power parameter of the Tweedie distribution (Niku et al., 2017) and starting.val for judicious choice of starting values for the latent variables (Niku et al., 2019b). For an overview of the available functions in gllvm, see Table 2.
| Function | Description |
|---|---|
| gllvm() | Fits a generalized linear latent variable model |
| anova.gllvm() | Analysis of deviance for ‘gllvm’ objects |
| coefplot.gllvm() | Plots covariate coefficients and confidence intervals |
| logLik.gllvm() | Log‐likelihood of an object of class ‘gllvm’ |
| residuals.gllvm() | Dunn‐Smyth residuals for ‘gllvm’ model |
| summary.gllvm() | Summarizing ‘gllvm’ model fits |
| ordiplot.gllvm() | Plots latent variables from a ‘gllvm’ model |
| plot.gllvm() | diagnostics for a ‘gllvm’ object |
| confint.gllvm() | Confidence intervals for ‘gllvm’ model parameters |
| predict.gllvm() | Obtains predictions from a ‘gllvm’ model |
| getResidualCov.gllvm() | Calculates residual covariance matrix for a ‘gllvm’ fit |
| getResidualCor.gllvm() | Calculates residual correlations for a ‘gllvm’ fit |
| getPredictErr.gllvm() | Prediction errors for predicted latent variables |
| simulate.gllvm() | Generate new data based on a ‘gllvm’ fit |
5 MODEL‐BASED ORDINATION
The default printout includes information criteria, which all suggest that the NB distribution is a better choice than the Poisson distribution for modelling the response. Residual plots for diagnosing model fit in Figure 1 can be obtained using the plot() function. Two plots for both models are of Dunn‐Smyth residuals, which are randomized quantile‐based residuals designed for discrete data (Dunn & Smyth, 1996), plotted against linear predictors, and a normal quantile–quantile plot with a simulated point‐wise 95% confidence interval envelope. The residual diagnostics for the Poisson model show some overdispersion in residuals, in particular, a telltale fan shape in the plot of residuals against fitted values. These issues are largely resolved in the NB model. Note that the latent variables in the model provide some capacity to account for overdispersion, so overdispersed counts do not always require us to move beyond the Poisson distribution, although there is clear evidence of such a need in this example.

), and hence most strongly associated with ordination scores, can be added using the logical argument biplot, leading to a biplot for finding indicator species corresponding to specific sites. The ind.spp argument defines the number of species to be plotted.
The above command creates the biplot as shown in Figure 2 based on the GLLVM fitted to the antTraits data. We can see one large cluster of sites on the top with many indicator species, and few smaller clusters with only few indicator species, for example, sites 12–15. In Appendix S3, we apply classical algorithm‐based ordination methods to the ant data and compare the results. While the results between GLLVMs and the algorithm‐based methods are quite similar, GLLVMs offer the advantage of standard tools for diagnosing model fit and performing model selection.

6 MODEL WITH ENVIRONMENTAL VARIABLES
A model with three latent variables was chosen based on the AICc value, and residual analysis indicates that a NB distribution offered the most suitable mean–variance relationship for the responses.
The resulting plot is given in Figure 3. Note that with a log link used, a unit change covariate l equates to a multiplicative change of
in the predicted mean
for species j. Most of the 95% confidence intervals include zero, indicating that the majority of the species does not exhibit evidence of a strong association between environment and species abundance. This may be due to a lack of information in the data, as much as being due to a lack of environmental association after accounting for potential residual species covariation.

7 STUDYING CO‐OCCURRENCE PATTERNS
Regions coloured in dark blue on Figure 4 indicate clusters of species that are positively correlated with each other, after controlling for covariation in species explained by the environmental terms in fit_env. There are also two regions coloured in red, indicating negative correlation between pairs of species. The effect of the environmental variables on the between species correlations can be seen by comparing the correlation matrix in Figure 4 to the correlation matrix given by the model without environmental variables, see example in Appendix S1, where the correlation patterns are considerably different from one another. Correlations can also be visualized in a residual biplot (Appendix S1). The traces of residual covariances obtained via the getResidualCov() function can be used to quantify the amount of variation in the data explained by environmental variables (Warton et al., 2015), see Appendix S1.

8 INCORPORATING FUNCTIONAL TRAITS INTO ‘FOURTH CORNER’ MODELS
As previously, coefficients can be plotted using the function coefplot(). The environment–trait interaction terms, also known as the fourth corner terms, can also be visualized using the function levelplot() from the package lattice, see Appendix S1 for example code. The resulting plots in Figure 5 indicate that interactions of the trait variable Polymorphism with Bare.ground and Webers.length with Volume.lying.CWD have the strongest effects on ant abundances. Notice that Pilosity and Polymorphism are factors and gllvm() recognizes this.

Based on the output from applying the anova() function, the p‐value suggests that the simpler model where traits were not included is more appropriate, that is, there is no strong evidence of traits mediating the environmental response of species.
The validity of any model‐based inference procedure relies on the assumptions of its underlying model. Note that the above test is based on fit_4th, a model that made the strong assumption that all interspecific variation in environmental response is captured by the trait in the model. Tests based on such models can have inflated false‐positive rates when this assumption is violated, as can be shown using simulations with missing trait predictors (ter Braak, 2019). We are working on an extension of our model, using a random slope across species, to capture variation in environmental response not captured by the trait model. Tests based on such a model can be expected to have much‐improved robustness to missing predictors in the trait model.
9 SUMMARY
In this paper, we introduced the r package gllvm for the analysis of multivariate abundance data using GLLVMs. The package caters for the types of response variables most commonly seen in ecology, including presence–absence data, overdispersed counts, biomass and ordinal data. The main point of difference between gllvm and other packages for fitting GLLVMs (Hui, 2016; Tikhonov et al., 2019) is that our algorithm is much faster for model‐fitting, and thus capable of handling much larger datasets. Computational efficiency was achieved by avoiding MC approaches to estimation, and instead making use of recent innovations for maximum likelihood estimation as discussed in Estimation. Table 3 illustrates this by comparing the computation time of gllvm to boral with default settings (40,000 total iterations, warm‐up at 10,000, thinning at 30), for the three example models of this paper. Computation times were over 140 times shorter when using gllvm, analysing the data in seconds rather than minutes. Note that this example dataset was relatively small, and differences in computation time become practically meaningful for larger datasets. For example, for the metagenomic dataset of Niku et al. (2017), with 56 rows and 985 responses, gllvm fitted a two latent variable model without predictors in 15 min, while boral (under default settings) took 10 hr, without achieving convergence. Even larger datasets again can be handled by gllvm, for which analysis is otherwise infeasible with currently available packages.
| fit_ord | fit_env | fit_4th | |
|---|---|---|---|
| Gllvm | 4.0 | 10.0 | 10.3 |
| boral | 595.4 | 1,483.6 | 1,529.9 |
A second point of difference between gllvm and competing packages is that it uses a maximum likelihood framework, and thus can employ likelihood‐based tools for inference. Familiar generic r functions like AIC, BIC and anova can be applied to gllvm objects, although as previously we emphasize that anova results will only be reliable when testing hypotheses concerning a relatively small number of parameters. To compare, packages that fit GLLVMs under a Bayesian framework would return full posterior distributions for both parameters and latent variables (Hui, 2016; Tikhonov et al., 2019), while our likelihood‐based framework returns approximate confidence intervals for parameters, assuming estimators are normally distributed. On the other hand, performing Bayesian hypothesis testing presents a bigger challenge compared to using likelihood‐based hypothesis testing as the gllvm package implements.
The GLLVM framework is distinct from methods historically used for ordination in ecology, such as non‐metric multi‐dimensional scaling (nMDS, as in vegan, Oksanen et al., 2018) and duality diagrams (as in ade4, Dray & Dufour, 2007). A key point of distinction is that a GLLVM specifies a statistical model for the data intended to capture key data properties. In particular, multivariate abundance data typically have a strong mean–variance relationship, which if not accounted for, often introduces artefacts into analyses (Warton & Hui, 2017; Warton, Wright, & Wang, 2012). Specifying a statistical model that aims to capture this mean–variance relationship, and using diagnostic tools to check its adequacy (Figure 1), can avoid this issue.
In the future, we plan to broaden the scope of the gllvm package to handle spatial and temporal correlations that often characterize observational multivariate abundance data, by allowing the latent variables to be structured rather than assuming independence across observational units. We will also extend the fourth corner models by including species‐specific random slopes for the predictors, to account for interspecific variation in environmental response that is not explained by traits. The code repository for the package can be found from github, see https://github.com/JenniNiku/gllvm.
ACKNOWLEDGEMENTS
The work of J.N. was supported by the Wihuri Foundation. The work of S.T. was supported by the CRoNoS COST Action IC1408. The work of F.K.C.H. and D.I.W. was supported by Australia Research Council Discovery Project grants (DP180100836 and DP150100823, respectively), F.K.C.H. was also supported by an ANU cross disciplinary grant.
AUTHORS’ CONTRIBUTIONS
J.N., F.K.C.H., S.T. and D.I.W. conceived the ideas and designed methodology; J.N. was mainly responsible for implementing the application; All authors contributed to the writing, reviewing and editing of the draft and gave final approval for publication.
Open Research
DATA AVAILABILITY STATEMENT
The ant dataset used in our examples is publicly available from the r package mvabund (Wang et al., 2012) in the Comprehensive R Archive Network: https://cran.r-project.org/web/packages/mvabund/. The microbial data (Kumar et al., 2017) are published in European Nucleotide Archive under the project number PRJEB17695, https://www.ebi.ac.uk/ena/data/view/PRJEB17695. A subset of these data used in Appendix S2, as well as all code used in this paper and supplementary materials is publicly available in the r package gllvm (Niku et al., 2019a) in the CRAN: https://cran.r-project.org/web/packages/gllvm/.
REFERENCES
Citing Literature
Number of times cited according to CrossRef: 4
- Jack D. Shutt, James A. Nicholls, Urmi H. Trivedi, Malcolm D. Burgess, Graham N. Stone, Jarrod D. Hadfield, Albert B. Phillimore, Gradients in richness and turnover of a forest passerine's diet prior to breeding: A mixed model approach applied to faecal metabarcoding data, Molecular Ecology, 10.1111/mec.15394, 29, 6, (1199-1213), (2020).
- Élise Lacoste, Fehmi Boufahja, Corinne Pelaprat, Patrik Le Gall, Tom Berteaux, Gregory Messiaen, Serge Mortreux, Jocelyne Oheix, Vincent Ouisse, Emmanuelle Roque d'Orbcastel, Nabila Gaertner-Mazouni, Marion Richard, First simultaneous assessment of macro- and meiobenthic community response to juvenile shellfish culture in a Mediterranean coastal lagoon (Thau, France), Ecological Indicators, 10.1016/j.ecolind.2020.106462, 115, (106462), (2020).
- Christian Damgaard, Rikke Reisner Hansen, Francis K.C. Hui, Model-based ordination of pin-point cover data: Effect of management on dry heathland, Ecological Informatics, 10.1016/j.ecoinf.2020.101155, (101155), (2020).
- Warren Paul, Covariate-adjusted species response curves derived from long-term macroinvertebrate monitoring data using classical and contemporary model-based ordination methods, Ecological Informatics, 10.1016/j.ecoinf.2020.101159, (101159), (2020).




, 
,
, where
is a dispersion parameter
,
, 
, 
,
, where
is a power parameter and
is a dispersion parameter
, 









