Model-based ordination for species with unequal niche widths

It is common practice for ecologists to examine species niches in the study of community composition. The response curve of a species in the fundamental niche is usually assumed to be quadratic. The center of a quadratic curve represents a species’ optimal environmental conditions, and the width its ability to tolerate deviations from the optimum. Most multivariate methods assume species respond linearly to the environment of the niche, or with a quadratic curve that is of equal width and height for all species. However, it is widely understood that some species are generalists who tolerate deviations from their optimal environment better than others. Rare species often tolerate a smaller range of environments than more common species, corresponding to a narrow niche. We propose a new method, for ordination and fitting Joint Species Distribution Models, based on Generalized Linear Mixed-Effects Models, which relaxes the assumptions of equal tolerances and equal maxima. By explicitly estimating species optima, tolerances, and maxima, per ecological gradient, we can better predict change in species communities, and understand how species relate to each other.

where g{·} is a known link function (e.g. the log-link when the responses are assumed to be Poisson, negative-92 binomial, or gamma distributed, the probit-link when the responses are assumed to be Bernoulli or ordinal 93 distributed, and the identity-link for responses that are assumed to be Gaussian distributed).

94
To facilitate easier estimation, and for a closer comparison to the linear GLLVM, we formulate the 95 quadratic GLLVM in matrix notation: with a species-specific intercept β 0j that accounts for e.g. mean abundances, and a vector of coefficients per 97 species for the linear term γ j . We can see a third term is added here to the existing structure of the linear 98 GLLVM, which models tolerances and maxima per species and latent variable. Specifically, we introduce a 99 diagonal matrix D j of quadratic coefficients with each diagonal element being the quadratic effect for latent 100 variable q and species j. We require D j to be a positive-definite diagonal matrix, to ensure concave curves Here, ij accounts for any residual information that is not accounted for by fixed-effects in the model, such the elements of the residual covariance matrix are given by: For a a length p vector i , existing JSDM implementations assume i ∼ N (0, Σ), i.e. the residual term follows 136 a multivariate normal distribution. For the linear GLLVM, it is straightforward to show that ij = z i γ j , so 137 it follows that i ∼ N (0, γ j γ j ). In essence, GLLVMs peform a low rank approximation to the covariance  Turning to the quadratic GLLVM, where ij = z i γ j − z i D j z i , the elements of the residual covariance 141 matrix are: for which a proof is given in Appendix S1. This can be rewritten in terms of the species optima u j and 143 tolerances t j : required to switch ordination method as a consequence of gradient length.

168
To determine gradient length from the proposed quadratic GLLVM, we define the ecological gradients  Instead, we propose that species optima and tolerances can be plotted directly, so that species niches  Finally, based on the disscussion in the two subsections above, there are two ways of scaling the ordination 198 diagram: 1) by the residual variance per latent variable, or 2) by the mean or median tolerance. In the first 199 scaling, the diagram is scaled to draw attention to the latent variable that explains most variance in the model.

214
The marginal log-likelihood of a quadratic GLLVM is given by: where f (y ij |z i , Θ) is the distribution of the species responses given the latent variables. As mentioned  In VA, we construct a lower bound to equation (6), by assuming that the posterior distribution of the 221 latent variables can be approximated by a closed form distribution e.g., a multivariate normal distribution. 222 We then minimze the Kullback-Leibler divergence between this approximate closed-form distribution (also for GLLVMs with linear latent variables, the optimal variational distribution is multivariate normal z i ∼

225
In Appendix S1 we provide information on calculating approximate confidence intervals for the parame-227 ters. In Appendix S2 we provide derivations for the log-likelihood of common response types in community 228 ecology, such as count data (Poisson, and negative-binomial with quadratic mean-variance relationship, 229 and both assuming a log-link function), binary data and ordinal data (both with probit-link function), as 230 well as positive continuous data (gamma, with log-link function) and continuous data (Gaussian, with an 231 identity-link function).

232
Simulation study 233 To assess how well the proposed model retrieves the true latent variables z i , optima u j , tolerances t j , and  Second, to explore the sample size required to accurately estimate the species-specific parameters e.g., 245 species optima u j , tolerances t j , and maxima c j , we simulated datasets of n = 20 to 100 sites in increments of and fitting algorithm were used in the best models per distribution).

250
As a true model, we considered a quadratic GLLVM with d = 2 latent variables, which was constructed 251 as follows. First, the species-specific intercepts β 0j were simulated as Uniform(−1, 1), which corresponds 252 to species with low abundance or occurrence. Next, the true coefficients corresponding to the linear terms ordinal distribution we assumed six classes with the true cut-offs being 0, 1, 2, 3, 4, 5, meaning that species were most often absent (category 1), while they were rarely very abundant (category 6). 258 We measured performance of the quadratic GLLVM by the prediction of the latent variables z i and the (1 -3).

285
The MAE per distribution and for the different sized datasets is presented in Figure 1 (see Appendix S5:  lengths of the first two ecological gradients were 5.46 (4.28-6.64, 95% confidence interval), and 3.35 (3.14-  though the covariate effect of the second model is presented in appendix S5, Figure S3.  The similarity in responses of species to unobserved environments can be assed by examining tolerances, 409 by examining an ordination diagrams for overlap in species distributions, or by using the residual correlation 410 matrix. Determining if species exhibit fully quadratic curves in response to ecological gradients, whether tolerances are common for all species per ecological gradient, or if the equal tolerances assumption is suited for a dataset, comes down to a problem of model selection for the quadratic GLLVM. To that end, future 413 research can further investigate approaches such as regularization (e.g., possibly extending the approach of length is a result of a heuristic rescaling of the ecological gradient, here it is calculated from the quadratic 417 coefficients, which are estimated with approximate maximum likelihood.

418
For datasets with 50 species and 50 sites or more, the quadratic GLLVM accurately retrieved ecological 419 gradients and species-specific parameters, though for continuous responses or counts it is possible to accu-420 rately estimate parameters with fewer species or sites. In general, when fitting the quadratic GLLVM to 421 binary or ordinal responses, more information is required than for other data types (similarly as reported in 422 Yee 2004). However, this is conditional on the information content in a dataset, and the number of required 423 sites and species here should only be considered as a rough rule of thumb. 424 We studied the response of species to ecological gradients for hunting spiders in a Dutch dune ecosystem

432
Modelling rare species is often difficult in community ecology as few ordination methods have the ca-433 pability to explicitly do so. The quadratic GLLVM has great potential for community ecology, as it can 434 simultaneously accommodate common (large tolerance and maxima i.e. a wide and high niche) and rare 435 species (small tolerance and maxima i.e. a narrow and low niche) with the quadratic term. The quadratic 436 GLLVM predicts species with unobserved optima, narrow niches, and small maxima will have the fewest 437 observations. Since the quadratic GLLVM includes two species-specific parameters per latent variable, and 438 thus requires more information in the data for accurate estimation of parameters than the linear GLLVM, it 439 potentially requires a large dataset to include sufficient information on rare species. However, the example 440 in this paper using the dataset of counts for hunting spiders (van der Aart & Smeek-Enserink 1974) shows 441 that the quadratic GLLVM can be feasible to fit even to small datasets. An advantage of GLLVMs is their 442 ability to use information from common species to improve estimation of parameters to describe the niches 443 of rare species. However, without penalization or borrowing information for estimation from more abundant species, the parameters for species with few observations are not necessarily expected to be accurate.