Volume 7, Issue 4
Research Article
Free Access

Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models

Otso Ovaskainen

Corresponding Author

Department of Biosciences, Metapopulation Research Centre, University of Helsinki, P.O. Box 65, FI‐00014 Helsinki, Finland

Department of Biology, Centre for Biodiversity Dynamics, Norwegian University of Science and Technology, N‐7491 Trondheim, Norway

Correspondence author. E‐mail: otso.ovaskainen@helsinki.fiSearch for more papers by this author
David B. Roy

Centre for Ecology and Hydrology, Wallingford, Oxfordshire, OX10 8BB UK

Search for more papers by this author
Richard Fox

Butterfly Conservation, East Lulworth, Wareham, Dorset, BH20 5QP UK

Search for more papers by this author
Barbara J. Anderson

Landcare Research, Private Bag 1930, Dunedin, 1954 New Zealand

Search for more papers by this author
First published: 07 November 2015
Citations: 77

Summary

  1. Modern species distribution models account for spatial autocorrelation in order to obtain unbiased statistical inference on the effects of covariates, to improve the model's predictive ability through spatial interpolation and to gain insight in the spatial processes shaping the data. Somewhat analogously, hierarchical approaches to community‐level data have been developed to gain insights into community‐level processes and to improve species‐level inference by borrowing information from other species that are either ecologically or phylogenetically related to the focal species.
  2. We unify spatial and community‐level structures by developing spatially explicit joint species distribution models. The models utilize spatially structured latent factors to model missing covariates as well as species‐to‐species associations in a statistically and computationally effective manner.
  3. We illustrate that the inclusion of the spatial latent factors greatly increases the predictive performance of the modelling approach with a case study of 55 species of butterfly recorded on a 10 km × 10 km grid in Great Britain consisting of 2609 grid cells.

Introduction

Conceptual and theoretical research in community ecology has long emphasized that the dynamics and distributions of species communities are shaped by the interplay between (i) environmental filtering, (ii) species interactions, and (iii) spatial and stochastic processes (Leibold et al. 2004). One reason why metacommunity theories are still poorly linked with data is the lack of statistical frameworks that enable these three factors to be integrated and that would be applicable for data typically available in community ecological studies (Logue et al. 2011). As we briefly review below, the last decade has brought major statistical advances in species distribution modelling that helps to bridge this gap between theory and data: joint species distribution modelling facilitates the assessment of environmental filtering and species interactions, whereas spatially and spatio‐temporally structured species distribution models enable one to incorporate the effects of spatial and stochastic processes. In this study, we bring these developments together by developing a statistical framework for spatially explicit joint species distribution modelling.

In their influential review, Ferrier & Guisan (2006) classified strategies for analysing community‐level species distribution data into the three categories of ‘assemble first, predict later’ (e.g. modelling species richness as the response variable), ‘predict first, assemble later’ (e.g. summing the predictions of single‐species models to predict species richness), and ‘assemble and predict together’. Since their review, a great amount of methodological progress has taken place in the category of ‘assemble and predict together’, that is joint species distribution models that include simultaneously both species‐ and community‐level components. Such models have been shown to have better predictive power than single‐species models, in particular for rare species for which model parameterization may not be feasible without borrowing information from other species (Ovaskainen & Soininen 2011; Bonthoux, Baselga & Balent 2013; Hui et al. 2013).

Joint species distribution models extend single‐species approaches in two principally different ways: by modelling environmental filtering at the community level and by accounting for statistical co‐occurrence among the species. In the context of regression‐based models, one approach for seeking community‐level patterns in environmental filtering is to treat the species‐specific regression coefficients (related to occurrence and/or detectability) as random effects and thus assuming that they follow either univariate or multivariate normal distributions across the species (Dorazio & Royle 2005; Dorazio et al. 2006, 2010; Kery et al. 2009; Russell et al. 2009; Zipkin et al. 2010; Ovaskainen & Soininen 2011; Jackson et al. 2012; Olden et al. 2014). Another approach that similarly allows sharing information among species is the use of mixture models, which use model‐based grouping of species into ‘species archetypes’ (Dunstan, Foster & Darnell 2011; Hui et al. 2013). Not only model construction, but also model selection can be conducted either at the species level or at the community level (Madon, Warton & Araujo 2013).

As reviewed by Kissling et al. (2012) and Wisz et al. (2013), statistical co‐occurrence among species (generated by species interactions or missing covariates) can be incorporated into joint species distribution models in several ways. The most straightforward alternative is to use some species as predictors for others. In communities with a large number of species, this, however, leads to the problem of multiple testing, which can be counteracted by including as predictors only the most abundant species (le Roux et al. 2014), or only those species that are part of the food web of the focal species (Pellissier et al. 2013). Another alternative is the use of multivariate regression models (Ovaskainen, Hottola & Siitonen 2010; Sebastian‐Gonzalez et al. 2010; Clark et al. 2014; Pollock et al. 2014) or neural network models (Harris 2015) in which the response variable is the vector of occurrences (Ovaskainen, Hottola & Siitonen 2010; Sebastian‐Gonzalez et al. 2010; Pollock et al. 2014; Harris 2015) or abundances (Clark et al. 2014) of all species. In this context, neural network models can be used to identify nonlinear relationships between species. With rich enough data, one may attempt to infer more refined aspects of species associations, for example the presence of so‐called competitive intransitivity (Ulrich et al. 2014). But as statistical co‐occurrence patterns can be created either by missing environmental covariates or by biotic interactions (Morales‐Castilla et al. 2015), the results of such multivariate regression models need to be interpreted with caution (Ovaskainen, Hottola & Siitonen 2010; Pollock et al. 2014).

Joint species distribution models can also be effective tools for bringing functional and phylogenetic perspectives to the analysis of species distribution data. Species traits can be used to model the responses of the species to environmental covariates (Pollock, Morris & Vesk 2012; Brown et al. 2014) and to facilitate the estimation of the species‐to‐species correlation matrices by considering them as functions of trait dissimilarity (Dorazio & Connor 2014). Accounting for phylogenetic constrains is necessary for obtaining unbiased inference in analyses that consider each species as a data point, and it can also be helpful for disentangling the effects of environmental filtering from those of biotic interactions (Helmus et al. 2007; Ives & Helmus 2011). Further, bringing the phylogenetic perspective to joint species distribution models shifts the emphasis from measures of community similarity based on species identity to corresponding measures based on phylogenetic similarity (Ives & Helmus 2010).

In parallel to the developments aiming to move from single‐species perspectives to multispecies perspectives, the need for using spatially explicit species distribution models has become increasingly acknowledged in ecological research, both due to interest on spatial processes per se and due to the need to account for non‐independent data points (Dormann et al. 2007). The estimation of spatially structured residuals has been facilitated by computational advances in Bayesian inference, both on Markov chain Monte Carlo (MCMC) sampling methods (Latimer et al. 2009; Chakraborty et al. 2010) and on methods based upon the integrated nested Laplace approximation (Blangiardo et al. 2013). Another increasingly popular approach for bringing spatial structure into species distribution models is the use of spatial eigenvectors derived from the distance matrix among the sampling sites (Borcard & Legendre 2002; Dray, Legendre & Peres‐Neto 2006; Dray et al. 2012).

The aim of this study was to integrate joint species distribution modelling and spatially explicit species distribution modelling. Such developments were pioneered by Latimer et al. (2009), who incorporated for each species a spatially structured residual, and estimated species‐to‐species correlation structure among the spatial effects. As the approach of Latimer et al. (2009) requires the estimation of species‐specific spatial effects, it is not suited for large species communities that are often dominated by rare species. Latimer et al. (2009) parameterized their model for four common species only. Here we overcome this limitation by modelling spatial effects at the community level. To do so, we utilize latent factor models, which have recently emerged in the ecological literature (Walker & Jackson 2011; Hui et al. 2015), and for which computationally efficient sampling algorithms are available (Bhattacharya & Dunson 2011). The use of spatial latent factors was recently introduced in the community context by Thorson et al. (2015). While our work is closely related to that developed independently by Thorson et al. (2015), it has the following differences: (i) our modelling approach is developed in the Bayesian framework, and it thus provides the full posterior distribution of parameter uncertainty, (ii) we combine spatial factors with fixed effects and thus partition variation between measured and unmeasured covariates, (iii) we apply the model to all species that make up the community, instead of restricting the analyses to common species only, and (iv) we demonstrate how the approach can be used to assess the geographic scaling of spatial covariance patterns. We demonstrate the predictive power of our modelling approach with data consisting of the occurrences of 55 species of butterflies sampled in Great Britain during 1995–1999 on 2841 grid cells at the resolution of 10 km × 10 km (Asher et al. 2001).

Joint species distribution modelling with spatially structured latent factors

We model the presence–absences or abundances of a set of species using the statistical framework of spatially explicit joint species distribution models. The main advantage of this modelling framework is the use of spatially structured latent factors, which makes it possible to capture the effects of missing covariates, the effects of biotic interactions or the combination of these two. The computational and statistical efficiency of the approach arises from there generally being far fewer latent factors than there are species. This is because all species are modelled with the help of a shared set of latent factors, each species having its own loading for each latent factor. If the latent factors were known covariates, the loadings would simply correspond to regression coefficients which could be estimated using standard techniques. However, it is often the case that species distributions are partly determined by unknown or unmeasurable covariates, or by biotic interactions. These ‘hidden covariates’ are here accounted for by the latent factors, and as they are not known a priori, they must be estimated. During the model fitting process, also the spatial scale at which each latent factor varies is estimated. For instance, if the latent factor corresponds to a large‐scale macroclimatic gradient, the corresponding spatial scale will be much larger than if the latent factor corresponds to a small‐scale microclimate gradient or small‐scale biotic interactions. Informally, the latent factors, and their spatial scales, are estimated so that they explain as much as possible of the variation in the distributions of all the species simultaneously. Also the number of latent factors is estimated, with the aim of including a sufficient number of latent factors to allow the model to capture as much of the biologically relevant variation as possible, but to avoid overfitting and thus the inclusion of latent factors that model noise rather than signal.

Before turning to the formal description of the model, we illustrate its main idea in Fig. 1. For simplicity, we do not include any measured covariates. Thus, the predictors of the model consist only of two latent factors η1 and η2, shown, respectively, in panels a and b of Fig. 1. The example is constructed to mimic the case of two competing species with overlapping resource use, for example two birds which are both restricted to coniferous forest but that compete for nesting locations within each stand. In such a case, one would expect to see negative co‐occurrence over short spatial scales, but positive co‐occurrence over large spatial scales (Araujo & Rozenfeld 2014). In Fig. 1, the latent factor η1 represents the shared resource, and it varies at the large characteristic spatial scale α1 = 10 spatial units (in Fig. 1, the spatial unit corresponds to the grid cell size). The latent factor η2 represents the influence of competition, and it varies at the smaller spatial scale of α2 = 2 spatial units. Each species j has its own loading λjh for each latent factor h, so that species‐specific occurrence probabilities are modelled as linear combinations of the latent factors. In the example of Fig. 1, the loadings of species 1 are λ11 = 1, λ12 = 1, so that the linear predictor for species 1 (illustrated in panels c and e) is L1 = η1 + η2. The loadings of species 2 are λ21 = 1, λ12 = −2, so that the linear predictor for species 2 (illustrated in panels d and f) is L2 = η1 − 2η2. As both species have a positive loading to the latent factor η1, they show positive co‐occurrence over large spatial scales. But as their loadings have opposite signs to the latent factor η2, their co‐occurrence pattern is negative over short spatial scales (Fig. 1g, h).

image
Illustration of the joint species distribution modelling with spatially structured latent factors. The panels a and b show latent factors η1 and η2 which have exponentially decaying correlation structure at spatial scales α1 = 10 and α2 = 2. The panels c and d show, respectively, the linear predictors for species 1 (L1 = η1 + η2) and species 2 (L2 = η1 − 2η2), which combine the latent factors with the loading matrix Λ = (1 1; 1 −2). The panels e and f show the occurrence patterns of species 1 and 2 and panel g their co‐occurrence pattern, with red and blue denoting occurrences of species 1 and 2, and black denoting the co‐occurrence of both species. Panel g shows the spatial covariance functions (Eq. 2), with red and blue depicting the within species covariances ρ11 (d) and ρ22 (d), and black depicting the between species covariance ρ12 (d).

Let us then turn to a more formal definition of the model. We index by i = 1,…, ny the sampling units and by = 1,…, ns the species. While we exemplified the modelling framework with two species, the model is equally well suitable for communities consisting of a large number of species.

In case of presence–absence data, we model the presence (yij = 1) or absence (yij = 0, including the possibility of non‐detection) of species j on sampling unit i by probit regression, implemented as urn:x-wiley:2041210X:media:mee312502:mee312502-math-0001, where the latent liability zij = Lij + urn:x-wiley:2041210X:media:mee312502:mee312502-math-0002 includes the linear predictor Lij and the residual which models the probit link function and is distributed urn:x-wiley:2041210X:media:mee312502:mee312502-math-0003 ~N(0,1) independently among the species and the sampling units. The linear predictors are further modelled as
urn:x-wiley:2041210X:media:mee312502:mee312502-math-0004(eqn 1)

Here xik is the measured covariate = 1,…, nc for sampling unit i, βkj is the regression coefficient measuring how species j responds to the covariate k, ηih is the (unmeasured) latent factor = 1,…, nf, and the factor loading λjh measures how species j responds to the latent factor h. We included the intercept in the model by setting xi1 = 1.

In the standard non‐spatial latent factor model (Bhattacharya & Dunson 2011), the latent factors are assumed to be normally distributed with zero mean and unit variance, ηih ~N(0,1). To bring spatial structure for the latent factors, we assumed a spatially homogeneous Gaussian process with urn:x-wiley:2041210X:media:mee312502:mee312502-math-0005, where urn:x-wiley:2041210X:media:mee312502:mee312502-math-0006 is the spatial distance between the sampling units i and i′, and urn:x-wiley:2041210X:media:mee312502:mee312502-math-0007 is Kronecker delta with value 1 for h′ and with value 0 for h ≠ h′. The function fh (d) is a spatial covariance function, normalized to fh (0) = 1 so that its unit is correlation and that the marginal distributions of the latent factors have zero mean and unit variance, similar to non‐spatial latent factor models. Here we assume the exponential function fh (d) = exp(−dh), where αh is the spatial scale of the latent factor h. We note that the standard latent factor model with spatially independent factors induces the covariance structure ε~N(0, Ψ) with Ψ = (ΛTΛ) ⊗ urn:x-wiley:2041210X:media:mee312502:mee312502-math-0008 where Λ is a matrix of the factor loadings (Bhattacharya & Dunson 2011). In addition to determining covariance at zero distance (Thorson et al. 2015), the spatial structure of the latent factors induces a spatial covariance urn:x-wiley:2041210X:media:mee312502:mee312502-math-0009 in the latent factors influencing the occurrences of species j and j′, given at distance d by
urn:x-wiley:2041210X:media:mee312502:mee312502-math-0010(eqn 2)

This characterizes species‐to‐species associations not only at the local level (d = 0) but also their spatial decay, similar to Latimer et al. (2009). As illustrated by Fig. 1h, spatial covariance between species can be positive or negative, corresponding to positive or negative co‐occurrence.

To ensure that not more factors are selected than is necessary to explain the data, we follow Bhattacharya & Dunson (2011) by defining a multiplicative gamma process shrinkage prior on the factor loadings. This variant of the Bayesian approach also avoids the need to use a pre‐specified structure for the loading matrix as assumed by Thorson et al. (2015). We extended the Gibbs sampling algorithm of Bhattacharya & Dunson (2011) to the present case by utilizing Bayesian multivariate regression for the fixed effects and by incorporating a discrete grid sampler for the spatial scale parameters αh. The technical details of the sampling algorithm are presented in Supplementary material, as well as a MATLAB code for model parameterization.

A case study with butterfly data for Great Britain

To illustrate the modelling approach, we consider a case study of 55 species of butterflies. We use the 1995–1999 atlas data (Asher et al. 2001) as presence–absence for each of the 10 × 10 km Ordnance Survey grid cells for Great Britain (n = 2,609 cells). Based on previous studies on butterflies in Great Britain (Hodgson et al. 2011; Bennie et al. 2013), we included as measured covariates (i) the number of growing degree days above 5°C and the percentage of the grid cell cover that consists of (ii) broadleaved woodland, (iii) coniferous woodland and (iv) calcareous substrates (Fig. 2; see Supplementary material for details on the data). To make the test case challenging, we randomly selected only 300 cells that were used as training data to parameterize the model (Fig. 2) and thus used the remaining 2309 cells for model validation. To assess the influence of spatially structured latent factors on model performance, we first fitted the model with just the four covariates. We then fitted the model with the four covariates and the spatially structured latent factors.

image
Measured environmental covariates and model‐identified latent factor used to model the butterfly community. The upper panels show the measured covariates 1–4: the number of growing degree days above 5 degrees (a), and the fraction of each grid cell consisting of broadleaved woodland (b), coniferous woodland (c) and calcareous substrates (d). The lower line of panels shows the two most dominant latent factors (i.e. hidden environmental covariates) identified by the model: η1 (e) and η2 (f). The black squares in panel g show the 300 randomly selected 10 km × 10 km grid cells that were used to parameterize the model. The remaining 2309 (shown by grey) were used to test the predictive performance of the model.

While the focus in this study is on the spatial part of the model, we note that the model belongs to the standard framework of generalized linear mixed models, allowing one to incorporate various hierarchical layers and covariance structures. As an example, we model here the responses of the species (βkj) to the measure covariates (xik) as a function of their functional group. To do so, we classified the species as wider countryside species, specialist species and migratory species. Similarly to Brown et al. (2014) and Ovaskainen & Soininen (2011), we model the vector of regression coefficients for species j with the multivariate normal distribution βj ~Ng(j), V), where the expected response μg(j) is assumed to be specific to the functional group (g) to which the species belongs to.

Failing to account for spatial autocorrelation (i.e. assuming independence among the data points) is expected to lead to biased estimates of fixed effects (known covariates) and overestimation of their statistical significance (Legendre et al. 2002). With the butterfly data, the estimates for the fixed effects were more pronounced and had tighter credibility intervals in the model without latent factors than in the model that also includes latent factors. In the model without the latent factors, the 95% credibility interval for the effect of covariates 1, 2, 3 and 4 did not cross zero, respectively, for 50, 42, 11 and 40 species. In contrast, when latent factors were included, the 95% credibility interval for the effect of covariates 1, 2, 3 and 4 did not cross zero for 28, 6, 8 and 10 species (see Supplementary material for the species‐specific results). The likely overestimation of fixed effects is visible both in species‐ and community‐level predictions, which reflect the covariate layers more pronouncedly than the data. For example, areas with calcareous substrate (Fig. 2d) differ in their species richness from the surrounding areas in a more pronounced way in the model prediction (Fig. 3e) than in the data (Fig. 3d). With the inclusion of the latent factors, this mismatch between the data and the model prediction (Fig. 3f) disappears.

image
Visual comparison of model predictions and data. Panel a shows data (black corresponding to presence and grey to absence) for one of the 55 butterfly species (Green hairstreak; Callophrys rubi) and panel d the observed species richness. Panels b and c show the model predictions for the occurrence probability of Green hairstreak based on the model without (b) and with (c) spatial latent factors. Similarly, panels e and f show the model predictions for species richness based on the model without (e) and with (f) latent factors. All predictions are based on fitting the models to data on the 300 training sites shown in Fig. 2g.

The posterior median estimates (95% credibility intervals) for spatial scale parameters of the two most dominant spatial latent factors are α1 = 170 (120 − 260) km and α2 = 170 (120 − 250) km. The first latent factor identifies essentially a north–south gradient, whereas the second one recognizes that the south‐eastern part of Great Britain differs in terms of its butterfly community composition from the rest of the country and especially from the north‐western part (Fig. 2). The model with spatially structured latent factors appears to better predict the data than the model without the latent factors (Figs. 3 and 4). As expected, the predictive power is generally poorest for species with very low or very high prevalence, as the occurrence of these species varies little. The mean R2 value is 30% for the model without latent factors, whereas it is 42% for the model with latent factors. Across the 55 species, the model with latent factors had higher R2 and AUC (Fielding & Bell 1997) values for 54 and 51 species, respectively, when evaluated against the validation data. In addition to improving the average AUC for occurrence of individual species from 0·86 to 0·91, including the latent factors improved the root‐mean‐squared error of species richness, reducing it from 4·9 species to 3·2 species.

image
The predictive performance of the community model. Panel a shows Species‐specific Tjur (2009) R2 values as a function of the species prevalence, and panel b compares predicted species richness to observed species richness. Both panels are based on fitting the models without spatial latent factors (grey dots) and with spatial latent factors (black dots) to data on 300 training sites (Fig. 2g), and comparing model predictions to the data for the 2309 validation sites. Panel c shows the relative proportions of variance attributed to the measured covariates and to the spatial latent factors. The measured covariates 1–4 are ordered from bottom to top, and coloured grey (covariate 1, i.e. growing degree days), and three levels of red (covariates 2–4). The latent factors 1–2 are shown on top of the measured covariates and are coloured as light blue (latent factor 1) or dark blue (latent factor 2).

Taking the average over the species, in the model with latent factors, the proportions of variance (at the level of the linear predictor) attributed to covariates 1–4 were 28%, 3%, 1% and 3%, whereas they were 54% and 11% for the latent factors 1–2 (Fig. 4). Thus, the covariates contributed 35% to the explained variation and the latent factors the remaining 65%, reflecting the increase in the model's predictive power achieved by adding the latent factors. Taking the average over the species, the amount of variance (at the level of the linear predictor) attributed to covariates 1–4 in the model with latent factors was reduced to 71%, 51%, 102% and 55% of the corresponding values in the model without latent factors. Thus, the latent factors absorbed some of the variation attributed to fixed effects in the model without the latent factors.

Discussion

Statistical methods for joint species distribution modelling have become well‐established, but thus far they have lacked a spatially explicit perspective (with the exceptions of Latimer et al. 2009; Thorson et al. 2015). In this study, we have utilized recent progress in latent factor modelling to develop a general statistical framework for spatially explicit joint species distribution modelling. As illustrated by our results, the inclusion of spatial latent factors improves statistical inference of joint species distribution models in three ways. First, failing to account for spatial structure corresponds to the assumption of independent data points, which leads to biased estimates for the effects of measured covariates. The inclusion of spatially structured latent factors is analogous to the inclusion of a spatially structured residual in single‐species models, and thus it corrects the inference on the effects of measured covariates. Secondly, the incorporation of spatial latent factors enables spatial interpolation which can greatly improve the predictive power of the model. This was indeed the case in the butterfly case study, where most of the explained variation was attributed to the spatial latent factors. Thirdly, the spatial latent factors identified by the models can be informative, as they can be interpreted as the covariates that influence the species community but are missing from the model, or as the end result of biotic interactions. For example, while we included the growing degree days above 5 degrees as a covariate, the first and most important latent factor identified by the model corresponded to a north–south gradient. This suggests that the north–south gradient correlates with relevant covariates other than the number of growing degree days or that there is random species turnover along this gradient.

As community‐level data on species occurrence or abundance typically come from a spatial setting, the possibility to account for spatiality in the analysis phase enables many kinds of applications. Thus, we expect the method presented here to be generally useful for data typically collected in community ecological studies: presence–absence or abundance data acquired for a set of sampling sites, some environmental covariates describing the properties of those sites and the spatial coordinates of those sites. With such data, the modelling approach presented here can be used for assessing the geographic scaling of covariance patterns (Eq. 2; illustrated in Fig. 1h), which can provide information on the type of species interactions (Araujo & Rozenfeld 2014). More generally, the generalized linear modelling framework allows one to partition variation in any community metric (e.g. species richness, evenness, or community dissimilarity) to the influences of measured covariates, to the influences of spatially structured latent factors and to unexplained residual variation. We note that while we have utilized here atlas data that form a regular grid, the method applies directly also to any spatially irregular sampling design. Furthermore, by replacing the distance in two‐dimensional space to the one‐dimensional distance over time, the method applies as such also for time‐series data. In this case, the exponential correlation structure assumed here corresponds to the widely applied AR(1) autoregressive model.

The model presented here adds the influence of measured covariates compared to the approach presented by Thorson et al. (2015), but is still ignores many aspects that other approaches developed in community ecology account for. However, as our modelling approach is based on the standard framework of hierarchical generalized linear mixed models, it is of a very general nature and easily extendable to components implemented in previous research to joint species distribution modelling. These include the influence of species traits (Pollock, Morris & Vesk 2012; Brown et al. 2014) and phylogenetic constraints (Helmus et al. 2007; Ives & Helmus 2011), the use of abundance data instead of presence–absence data (Clark et al. 2014) and accounting for detectability (Dorazio et al. 2006).

As illustrated here, spatially explicit community modelling is expected to be useful especially for problems that involve spatial interpolation, for example for predicting species distribution maps from a sparse set of observations. But much interest in community ecology relates also to extrapolation, that is predicting the occurrences of species under environmental conditions not present in the training data, for example after climate change, after habitat loss or in an unexplored region. While it is not expected that spatially explicit community modelling will be able to provide improved mean predictions for extrapolation, it can improve the assessment of uncertainty in such predictions. This is because the modelling framework is able to identify how much of the current species occurrences are influenced by such unknown variation that is structured by space and thus not likely to be just noise. Assuming that the same proportion of the variance will be attributed to unmodelled but spatially structured variation also in the extrapolated situation will enable constructing more realistic confidence intervals than just ignoring such variation.

Acknowledgements

We thank David Dunson, Chaozhi Zheng, Nerea Abrego, William Lee, two anonymous reviewers and members of the UKPopNet (NERC R8‐H12‐012 and English Nature) for insightful discussions and valuable comments. We thank the large number of volunteer recorders contributing GB butterfly occurrence records. These data sets are operated by Butterfly Conservation and the NERC Centre for Ecology & Hydrology and financially supported by a consortium of government agencies. OO was supported by the Academy of Finland (grant no. 250444) and BJA by NERC grant NE/FO18606/1 and a RSNZ Rutherford Discovery Fellowship.

    Data accessibility

    The data used in the case study are provided in the supplementary material. These data were obtained from the following sources:

    • British butterfly distribution data for the 1995‐1999 atlas period gathered by the Butterflies for the New Millennium recording scheme (Asher et al. 2001) are provided in the supplementary material. These data may be used, with appropriate acknowledgement, under a creative commons licence (https://creativecommons.org/licenses/by/4.0/). These data are held by Butterfly Conservation and the Centre for Ecology & Hydrology and are available through http://butterfly-conservation.org/111/butterflies-for-the-new-millennium.html and http://data.nbn.org.uk (contact: Richard Fox, rfox@butterfly-conservation.org).
    • UK climate data are provided in the supplementary material. We used mean annual number of growing degree days above 5 degrees Celsius for 1995‐1999 for Britain at a 10 km Ordnance Survey grid resolution were derived from CRU ts2.1 and CRU 61‐90 climate data sets (Barrow, Hulme & Jiang 1993). This involved the anomalies at 0·5 deg grid resolution being interpolated onto the UK Ordnance Survey 10 km grid and combined with the TIGER climate data (Hill 1995) from mean elevations within grid cells. These data may be used, with appropriate acknowledgement, under a creative commons licence (https://creativecommons.org/licenses/by/4.0/). Original data are data are available from http://www.alarmproject.net/climate/climate/
    • UK Land cover data (LCM2000) are provided in the supplementary material. We used per cent cover broadleaved woodland and per cent coniferous woodland (Fuller et al. 2002). Per cent cover was calculated as a percentage of the land area within UK Ordnance Survey 10 km grid cell. These data are licensed and must be used in accordance with the open government licence (OGL; http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/). Original data are available through http://www.ceh.ac.uk/landcovermap2000.html.
    • UK geology data are provided in the supplementary material. We used the 1 kilometre resolution Soil Parent Material Model detailing 6 basic parent material parameters (derived from the 1:50 000 scale version). We calculated the sum of 1 km squares within each UK Ordnance Survey 10 km grid cell with a calcareous content value of ‘HIGH’. Original data were downloaded on 01 January 2015, licensing restrictions, terms and conditions and original data are available (with appropriate acknowledgement) from http://www.bgs.ac.uk/downloads/start.cfm?id=2899.

        Number of times cited according to CrossRef: 77

        • Joint species distribution modelling with the r‐package Hmsc, Methods in Ecology and Evolution, 10.1111/2041-210X.13345, 11, 3, (442-447), (2020).
        • Marine epibenthic functional diversity on Flemish Cap (north‐west Atlantic)—Identifying trait responses to the environment and mapping ecosystem functions, Diversity and Distributions, 10.1111/ddi.13026, 26, 4, (460-478), (2020).
        • , Joint Species Distribution Modelling, 10.1017/9781108591720, (2020).
        • Neural hierarchical models of ecological populations, Ecology Letters, 10.1111/ele.13462, 23, 4, (734-747), (2020).
        • Agricultural adapters from the vineyard landscape impact native oak woodland birds, Agriculture, Ecosystems & Environment, 10.1016/j.agee.2020.106960, 300, (106960), (2020).
        • Use of openly available occurrence data to generate biodiversity maps within the South African EEZ, African Journal of Marine Science, 10.2989/1814232X.2020.1737573, 42, 1, (109-121), (2020).
        • First‐kind Galerkin boundary element methods for the Hodge‐Laplacian in three dimensions, Mathematical Methods in the Applied Sciences, 10.1002/mma.6203, 43, 8, (4974-4994), (2020).
        • A spatial community regression approach to exploratory analysis of ecological data, Methods in Ecology and Evolution, 10.1111/2041-210X.13371, 11, 5, (608-620), (2020).
        • hyperoverlap: Detecting biological overlap in n‐dimensional space, Methods in Ecology and Evolution, 10.1111/2041-210X.13363, 11, 4, (513-523), (2020).
        • Functional traits that moderate tropical tree recruitment during post‐windstorm secondary succession, Journal of Ecology, 10.1111/1365-2745.13347, 108, 4, (1322-1333), (2020).
        • Unravelling species co‐occurrence in a steppe bird community of Inner Mongolia: Insights for the conservation of the endangered Jankowski’s Bunting, Diversity and Distributions, 10.1111/ddi.13061, 26, 7, (843-852), (2020).
        • Reintroduction modelling: A guide to choosing and combining models for species reintroductions, Journal of Applied Ecology, 10.1111/1365-2664.13629, 57, 7, (1233-1243), (2020).
        • Co-occurrence patterns and the large-scale spatial structure of benthic communities in seagrass meadows and bare sand, BMC Ecology, 10.1186/s12898-020-00308-4, 20, 1, (2020).
        • Improving prediction of rare species’ distribution from community data, Scientific Reports, 10.1038/s41598-020-69157-x, 10, 1, (2020).
        • Refining predictions of metacommunity dynamics by modeling species non‐independence, Ecology, 10.1002/ecy.3067, 101, 8, (2020).
        • Species co‐occurrences in ectoparasite infracommunities: Accounting for confounding factors associated with space, time, and host community composition, Ecological Entomology, 10.1111/een.12900, 45, 5, (1158-1171), (2020).
        • Where and why? Bees, snail shells and climate: Distribution of Rhodanthidium (Hymenoptera: Megachilidae) in the Iberian Peninsula, Entomological Science, 10.1111/ens.12420, 23, 3, (256-270), (2020).
        • Protecting Biodiversity (in All Its Complexity): New Models and Methods, Trends in Ecology & Evolution, 10.1016/j.tree.2020.08.015, (2020).
        • Integrating statistical and mechanistic approaches with biotic and environmental variables improves model predictions of the impact of climate and land-use changes on future mosquito-vector abundance, diversity and distributions in Australia, Parasites & Vectors, 10.1186/s13071-020-04360-3, 13, 1, (2020).
        • Temporal transferability of marine distribution models in a multispecies context, Ecological Indicators, 10.1016/j.ecolind.2020.106649, 117, (106649), (2020).
        • A database and synthesis of euglossine bee assemblages collected at fragrance baits, Apidologie, 10.1007/s13592-020-00739-4, (2020).
        • Using hierarchical joint models to study reproductive interactions in plant communities, Journal of Ecology, 10.1111/1365-2745.13301, 108, 2, (485-495), (2019).
        • Complementary strengths of spatially‐explicit and multi‐species distribution models, Ecography, 10.1111/ecog.04728, 43, 3, (456-466), (2019).
        • The effect of spatial variation for predicting aphid outbreaks, Journal of Applied Entomology, 10.1111/jen.12724, 144, 4, (263-269), (2019).
        • Computationally efficient joint species distribution modeling of big spatial data, Ecology, 10.1002/ecy.2929, 101, 2, (2019).
        • Fragmented tropical forests lose mutualistic plant–animal interactions, Diversity and Distributions, 10.1111/ddi.13010, 26, 2, (154-168), (2019).
        • Threat webs: Reframing the co‐occurrence and interactions of threats to biodiversity, Journal of Applied Ecology, 10.1111/1365-2664.13427, 56, 8, (1992-1997), (2019).
        • The recent past and promising future for data integration methods to estimate species’ distributions, Methods in Ecology and Evolution, 10.1111/2041-210X.13110, 10, 1, (22-37), (2019).
        • How to predict biodiversity in space? An evaluation of modelling approaches in marine ecosystems, Diversity and Distributions, 10.1111/ddi.12970, 25, 11, (1697-1708), (2019).
        • Analyzing community structure subject to incomplete sampling: hierarchical community model vs. canonical ordinations, Ecology, 10.1002/ecy.2759, 100, 8, (2019).
        • From individual to joint species distribution models: A comparison of model complexity and predictive performance, Journal of Biogeography, 10.1111/jbi.13668, 46, 10, (2260-2274), (2019).
        • A pathway for multivariate analysis of ecological communities using copulas, Ecology and Evolution, 10.1002/ece3.4948, 9, 6, (3276-3294), (2019).
        • Spatio-temporal models provide new insights on the biotic and abiotic drivers shaping Pacific Herring (Clupea pallasi) distribution, Progress in Oceanography, 10.1016/j.pocean.2019.102198, (102198), (2019).
        • A comparison of joint species distribution models for presence–absence data, Methods in Ecology and Evolution, 10.1111/2041-210X.13106, 10, 2, (198-211), (2018).
        • Understanding the connections between species distribution models for presence-background data, Theoretical Ecology, 10.1007/s12080-018-0389-9, 12, 1, (73-88), (2018).
        • Spatially Structured Communities, Spatial Ecology and Conservation Modeling, 10.1007/978-3-030-01989-1, (419-474), (2018).
        • Secondary forest regeneration benefits old-growth specialist bats in a fragmented tropical landscape, Scientific Reports, 10.1038/s41598-018-21999-2, 8, 1, (2018).
        • Asymmetric biotic interactions and abiotic niche differences revealed by a dynamic joint species distribution model, Ecology, 10.1002/ecy.2190, 99, 5, (1018-1023), (2018).
        • Biotic interactions in species distribution modelling: 10 questions to guide interpretation and avoid false conclusions, Global Ecology and Biogeography, 10.1111/geb.12759, 27, 9, (1004-1016), (2018).
        • Uncovering the drivers of host‐associated microbiota with joint species distribution modelling, Molecular Ecology, 10.1111/mec.14718, 27, 12, (2714-2724), (2018).
        • Do joint species distribution models reliably detect interspecific interactions from co‐occurrence data in homogenous environments?, Ecography, 10.1111/ecog.03315, 41, 11, (1812-1819), (2018).
        • Comparing the prediction of joint species distribution models with respect to characteristics of sampling data, Ecography, 10.1111/ecog.03571, 41, 11, (1876-1887), (2018).
        • Pathogeography: leveraging the biogeography of human infectious diseases for global health management, Ecography, 10.1111/ecog.03625, 41, 9, (1411-1427), (2018).
        • Niche Estimation Above and Below the Species Level, Trends in Ecology & Evolution, 10.1016/j.tree.2018.10.012, (2018).
        • Using partial aggregation in spatial capture recapture, Methods in Ecology and Evolution, 10.1111/2041-210X.13030, 9, 8, (1896-1907), (2018).
        • Spatio-Temporal Structural Equation Modeling in a Hierarchical Bayesian Framework: What Controls Wet Heathland Vegetation?, Ecosystems, 10.1007/s10021-018-0259-8, (2018).
        • The utility of spatial model-based estimators of unobserved bycatch, ICES Journal of Marine Science, 10.1093/icesjms/fsy153, (2018).
        • Assessing the joint behaviour of species traits as filtered by environment, Methods in Ecology and Evolution, 10.1111/2041-210X.12901, 9, 3, (716-727), (2017).
        • Multiresponse algorithms for community‐level modelling: Review of theory, applications, and comparison to species distribution models, Methods in Ecology and Evolution, 10.1111/2041-210X.12936, 9, 4, (834-848), (2017).
        • Trait‐dependent distributional shifts in fruiting of common British fungi, Ecography, 10.1111/ecog.03233, 41, 1, (51-61), (2017).
        • Models for assessing local‐scale co‐abundance of animal species while accounting for differential detectability and varied responses to the environment, Biotropica, 10.1111/btp.12500, 50, 1, (5-15), (2017).
        • Joint species distribution modelling for spatio‐temporal occurrence and ordinal abundance data, Global Ecology and Biogeography, 10.1111/geb.12666, 27, 1, (142-155), (2017).
        • Multispecies acoustic dead-zone correction and bias ratio estimates between acoustic and bottom-trawl data, ICES Journal of Marine Science, 10.1093/icesjms/fsx115, 75, 1, (361-373), (2017).
        • What are parental condition‐transfer effects and how can they be detected?, Methods in Ecology and Evolution, 10.1111/2041-210X.12848, 9, 3, (450-456), (2017).
        • Improved demethylation in ecological epigenetic experiments: Testing a simple and harmless foliar demethylation application, Methods in Ecology and Evolution, 10.1111/2041-210X.12903, 9, 3, (744-753), (2017).
        • Modelling the area of occupancy of habitat types with remote sensing, Methods in Ecology and Evolution, 10.1111/2041-210X.12925, 9, 3, (580-593), (2017).
        • A call for viewshed ecology: Advancing our understanding of the ecology of information through viewshed analysis, Methods in Ecology and Evolution, 10.1111/2041-210X.12902, 9, 3, (624-633), (2017).
        • LEFT—A web‐based tool for the remote measurement and estimation of ecological value across global landscapes, Methods in Ecology and Evolution, 10.1111/2041-210X.12924, 9, 3, (571-579), (2017).
        • Measuring and predicting the influence of traits on the assembly processes of wood‐inhabiting fungi, Journal of Ecology, 10.1111/1365-2745.12722, 105, 4, (1070-1081), (2017).
        • Using joint species distribution models for evaluating how species‐to‐species associations depend on the environmental context, Methods in Ecology and Evolution, 10.1111/2041-210X.12723, 8, 4, (443-452), (2017).
        • How to make more out of community data? A conceptual framework and its implementation as models and software, Ecology Letters, 10.1111/ele.12757, 20, 5, (561-576), (2017).
        • Ecological grouping of survey sites when sampling artefacts are present, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12211, 66, 5, (1031-1047), (2017).
        • The relative influence of temperature and size‐structure on fish distribution shifts: A case‐study on Walleye pollock in the Bering Sea, Fish and Fisheries, 10.1111/faf.12225, 18, 6, (1073-1084), (2017).
        • Integrating demography, dispersal and interspecific interactions into bird distribution models, Journal of Avian Biology, 10.1111/jav.01225, 48, 12, (1505-1516), (2017).
        • Accounting for spatiotemporal variation and fisher targeting when estimating abundance from multispecies fishery data, Canadian Journal of Fisheries and Aquatic Sciences, 10.1139/cjfas-2015-0598, 74, 11, (1794-1807), (2017).
        • Parameterization of aquatic ecosystem functioning and its natural variation: Hierarchical Bayesian modelling of plankton food web dynamics, Journal of Marine Systems, 10.1016/j.jmarsys.2017.05.004, 174, (40-53), (2017).
        • Tree diversity patterns along the latitudinal gradient in the northwestern Russia, Forest Ecosystems, 10.1186/s40663-017-0114-y, 4, 1, (2017).
        • Design matters: An evaluation of the impact of small man-made forest clearings on tropical bats using a before-after-control-impact design, Forest Ecology and Management, 10.1016/j.foreco.2017.06.053, 401, (8-16), (2017).
        • Colonization potential of an endangered riparian shrub species, Biodiversity and Conservation, 10.1007/s10531-017-1347-3, 26, 9, (2099-2114), (2017).
        • Connecting Earth observation to high-throughput biodiversity data, Nature Ecology & Evolution, 10.1038/s41559-017-0176, 1, 7, (0176), (2017).
        • Models for Jointly Estimating Abundances of Two Unmarked Site-Associated Species Subject to Imperfect Detection, Journal of Agricultural, Biological and Environmental Statistics, 10.1007/s13253-017-0316-3, (2017).
        • Modelling of species distributions, range dynamics and communities under imperfect detection: advances, challenges and opportunities, Ecography, 10.1111/ecog.02445, 40, 2, (281-295), (2016).
        • The effect of nitrogen deposition on the vegetation of wet heathlands, Plant Ecology, 10.1007/s11258-016-0693-7, 218, 4, (373-383), (2016).
        • Extending Joint Models in Community Ecology: A Response to Beissinger et al ., Trends in Ecology & Evolution, 10.1016/j.tree.2016.07.007, 31, 10, (737-738), (2016).
        • Modeling and mapping fish abundance across wadeable streams of Illinois, USA, based on landscape-level environmental variables, Canadian Journal of Fisheries and Aquatic Sciences, 10.1139/cjfas-2015-0343, 73, 7, (1031-1046), (2016).
        • Joint dynamic species distribution models: a tool for community ordination and spatio‐temporal monitoring, Global Ecology and Biogeography, 10.1111/geb.12464, 25, 9, (1144-1158), (2016).
        • Predicting the Composition of Polychaete Assemblages in the Aegean Coast of Turkey, Frontiers in Marine Science, 10.3389/fmars.2016.00154, 3, (2016).