Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models

Modern species distribution models account for spatial autocorrelation in order to obtain unbiased statistical inference on the effects of covariates, to improve the model's predictive ability through spatial interpolation and to gain insight in the spatial processes shaping the data. Somewhat analogously, hierarchical approaches to community‐level data have been developed to gain insights into community‐level processes and to improve species‐level inference by borrowing information from other species that are either ecologically or phylogenetically related to the focal species. We unify spatial and community‐level structures by developing spatially explicit joint species distribution models. The models utilize spatially structured latent factors to model missing covariates as well as species‐to‐species associations in a statistically and computationally effective manner. We illustrate that the inclusion of the spatial latent factors greatly increases the predictive performance of the modelling approach with a case study of 55 species of butterfly recorded on a 10 km × 10 km grid in Great Britain consisting of 2609 grid cells.


Introduction
Conceptual and theoretical research in community ecology has long emphasized that the dynamics and distributions of species communities are shaped by the interplay between (i) environmental filtering, (ii) species interactions, and (iii) spatial and stochastic processes (Leibold et al. 2004). One reason why metacommunity theories are still poorly linked with data is the lack of statistical frameworks that enable these three factors to be integrated and that would be applicable for data typically available in community ecological studies (Logue et al. 2011). As we briefly review below, the last decade has brought major statistical advances in species distribution modelling that helps to bridge this gap between theory and data: joint species distribution modelling facilitates the assessment of environmental filtering and species interactions, whereas spatially and spatiotemporally structured species distribution models enable one to incorporate the effects of spatial and stochastic processes. In this study, we bring these developments together by developing a statistical framework for spatially explicit joint species distribution modelling.
In their influential review, Ferrier & Guisan (2006) classified strategies for analysing community-level species distribution data into the three categories of 'assemble first, predict later' (e.g. modelling species richness as the response variable), 'predict first, assemble later' (e.g. summing the predictions of single-species models to predict species richness), and 'assemble and predict together'. Since their review, a great amount of methodological progress has taken place in the category of 'assemble and predict together', that is joint species distribution models that include simultaneously both species-and community-level components. Such models have been shown to have better predictive power than single-species models, in particular for rare species for which model parameterization may not be feasible without borrowing information from other species (Ovaskainen & Soininen 2011;Bonthoux, Baselga & Balent 2013;Hui et al. 2013).
Joint species distribution models extend single-species approaches in two principally different ways: by modelling environmental filtering at the community level and by accounting for statistical co-occurrence among the species. In the context of regression-based models, one approach for seeking community-level patterns in environmental filtering is to treat the species-specific regression coefficients (related to occurrence and/or detectability) as random effects and thus assuming that they follow either univariate or multivariate normal distributions across the species (Dorazio & Royle 2005;Dorazio et al. 2006Dorazio et al. , 2010Kery et al. 2009;Russell et al. 2009;Zipkin et al. 2010;Ovaskainen & Soininen 2011;Jackson et al. 2012;Olden et al. 2014). Another approach that similarly allows sharing information among species is the use of mixture models, which use model-based grouping of species into 'species archetypes' (Dunstan, Foster & Darnell 2011;Hui et al. 2013). Not only model construction, but also model selection can be conducted either at the species level or at the community level (Madon, Warton & Araujo 2013).
As reviewed by Kissling et al. (2012) and Wisz et al. (2013), statistical co-occurrence among species (generated by species interactions or missing covariates) can be incorporated into joint species distribution models in several ways. The most straightforward alternative is to use some species as predictors for others. In communities with a large number of species, this, however, leads to the problem of multiple testing, which can be counteracted by including as predictors only the most abundant species (le Roux et al. 2014), or only those species that are part of the food web of the focal species . Another alternative is the use of multivariate regression models (Ovaskainen, Hottola & Siitonen 2010;Sebastian-Gonzalez et al. 2010;Clark et al. 2014;Pollock et al. 2014) or neural network models (Harris 2015) in which the response variable is the vector of occurrences (Ovaskainen, Hottola & Siitonen 2010;Sebastian-Gonzalez et al. 2010;Pollock et al. 2014;Harris 2015) or abundances (Clark et al. 2014) of all species. In this context, neural network models can be used to identify nonlinear relationships between species. With rich enough data, one may attempt to infer more refined aspects of species associations, for example the presence of so-called competitive intransitivity (Ulrich et al. 2014). But as statistical co-occurrence patterns can be created either by missing environmental covariates or by biotic interactions (Morales-Castilla et al. 2015), the results of such multivariate regression models need to be interpreted with caution (Ovaskainen, Hottola & Siitonen 2010;Pollock et al. 2014).
Joint species distribution models can also be effective tools for bringing functional and phylogenetic perspectives to the analysis of species distribution data. Species traits can be used to model the responses of the species to environmental covariates (Pollock, Morris & Vesk 2012;Brown et al. 2014) and to facilitate the estimation of the species-to-species correlation matrices by considering them as functions of trait dissimilarity (Dorazio & Connor 2014). Accounting for phylogenetic constrains is necessary for obtaining unbiased inference in analyses that consider each species as a data point, and it can also be helpful for disentangling the effects of environmental filtering from those of biotic interactions (Helmus et al. 2007;Ives & Helmus 2011). Further, bringing the phylogenetic perspective to joint species distribution models shifts the emphasis from measures of community similarity based on species identity to corresponding measures based on phylogenetic similarity (Ives & Helmus 2010).
In parallel to the developments aiming to move from single-species perspectives to multispecies perspectives, the need for using spatially explicit species distribution models has become increasingly acknowledged in ecological research, both due to interest on spatial processes per se and due to the need to account for non-independent data points (Dormann et al. 2007). The estimation of spatially structured residuals has been facilitated by computational advances in Bayesian inference, both on Markov chain Monte Carlo (MCMC) sampling methods (Latimer et al. 2009;Chakraborty et al. 2010) and on methods based upon the integrated nested Laplace approximation (Blangiardo et al. 2013). Another increasingly popular approach for bringing spatial structure into species distribution models is the use of spatial eigenvectors derived from the distance matrix among the sampling sites (Borcard & Legendre 2002;Dray, Legendre & Peres-Neto 2006;Dray et al. 2012).
The aim of this study was to integrate joint species distribution modelling and spatially explicit species distribution modelling. Such developments were pioneered by Latimer et al. (2009), who incorporated for each species a spatially structured residual, and estimated species-to-species correlation structure among the spatial effects. As the approach of Latimer et al. (2009) requires the estimation of species-specific spatial effects, it is not suited for large species communities that are often dominated by rare species. Latimer et al. (2009) parameterized their model for four common species only. Here we overcome this limitation by modelling spatial effects at the community level. To do so, we utilize latent factor models, which have recently emerged in the ecological literature (Walker & Jackson 2011;Hui et al. 2015), and for which computationally efficient sampling algorithms are available (Bhattacharya & Dunson 2011). The use of spatial latent factors was recently introduced in the community context by Thorson et al. (2015). While our work is closely related to that developed independently by Thorson et al. (2015), it has the following differences: (i) our modelling approach is developed in the Bayesian framework, and it thus provides the full posterior distribution of parameter uncertainty, (ii) we combine spatial factors with fixed effects and thus partition variation between measured and unmeasured covariates, (iii) we apply the model to all species that make up the community, instead of restricting the analyses to common species only, and (iv) we demonstrate how the approach can be used to assess the geographic scaling of spatial covariance patterns. We demonstrate the predictive power of our modelling approach with data consisting of the occurrences of 55 species of butterflies sampled in Great Britain during 1995-1999 on 2841 grid cells at the resolution of 10 km 9 10 km (Asher et al. 2001).

J O I N T S P E C I E S D I S T R I B U T I O N M O D E L L I N G W I T H S P A T I A L L Y S T R U C T U R E D L A T E N T F A C T O R S
We model the presence-absences or abundances of a set of species using the statistical framework of spatially explicit joint species distribution models. The main advantage of this modelling framework is the use of spatially structured latent factors, which makes it possible to capture the effects of missing covariates, the effects of biotic interactions or the combination of these two. The computational and statistical efficiency of the approach arises from there generally being far fewer latent factors than there are species. This is because all species are modelled with the help of a shared set of latent factors, each species having its own loading for each latent factor. If the latent factors were known covariates, the loadings would simply correspond to regression coefficients which could be estimated using standard techniques. However, it is often the case that species distributions are partly determined by unknown or unmeasurable covariates, or by biotic interactions. These 'hidden covariates' are here accounted for by the latent factors, and as they are not known a priori, they must be estimated. During the model fitting process, also the spatial scale at which each latent factor varies is estimated. For instance, if the latent factor corresponds to a large-scale macroclimatic gradient, the corresponding spatial scale will be much larger than if the latent factor corresponds to a small-scale microclimate gradient or small-scale biotic interactions. Informally, the latent factors, and their spatial scales, are estimated so that they explain as much as possible of the variation in the distributions of all the species simultaneously. Also the number of latent factors is estimated, with the aim of including a sufficient number of latent factors to allow the model to capture as much of the biologically relevant variation as possible, but to avoid overfitting and thus the inclusion of latent factors that model noise rather than signal.
Before turning to the formal description of the model, we illustrate its main idea in Fig. 1. For simplicity, we do not include any measured covariates. Thus, the predictors of the model consist only of two latent factors g 1 and g 2 , shown, respectively, in panels a and b of Fig. 1. The example is constructed to mimic the case of two competing species with overlapping resource use, for example two birds which are both restricted to coniferous forest but that compete for nesting locations within each stand. In such a case, one would expect to see negative co-occurrence over short spatial scales, but positive co-occurrence over large spatial scales (Araujo & Rozenfeld 2014). In Fig. 1, the latent factor g 1 represents the shared resource, and it varies at the large characteristic spatial scale a 1 = 10 spatial units (in Fig. 1, the spatial unit corresponds to the grid cell size). The latent factor g 2 represents the influence of competition, and it varies at the smaller spatial scale of a 2 = 2 spatial units. Each species j has its own loading k jh for each latent factor h, so that species-specific occurrence probabilities are modelled as linear combinations of the latent factors. In the example of Fig. 1, the loadings of species 1 are k 11 = 1, k 12 = 1, so that the linear predictor for species 1 (illustrated in panels c and e) is L 1 = g 1 + g 2 . The loadings of species 2 are k 21 = 1, k 12 = À2, so that the linear predictor for species 2 (illustrated in panels d and f) is L 2 = g 1 À 2g 2 . As both species have a positive loading to the latent factor g 1 , they show positive co-occurrence over large spatial scales. But as their loadings have opposite signs to the latent factor g 2 , their co-occurrence pattern is negative over short spatial scales (Fig. 1g, h).
Let us then turn to a more formal definition of the model. We index by i = 1,. . ., n y the sampling units and by j = 1,. . ., n s the species. While we exemplified the modelling framework Fig. 1. Illustration of the joint species distribution modelling with spatially structured latent factors. The panels a and b show latent factors g 1 and g 2 which have exponentially decaying correlation structure at spatial scales a 1 = 10 and a 2 = 2. The panels c and d show, respectively, the linear predictors for species 1 (L 1 = g 1 + g 2 ) and species 2 (L 2 = g 1 À 2g 2 ), which combine the latent factors with the loading matrix Λ = (1 1; 1 À2). The panels e and f show the occurrence patterns of species 1 and 2 and panel g their co-occurrence pattern, with red and blue denoting occurrences of species 1 and 2, and black denoting the co-occurrence of both species. Panel g shows the spatial covariance functions (Eq. 2), with red and blue depicting the within species covariances q 11 (d) and q 22 (d), and black depicting the between species covariance q 12 (d).
with two species, the model is equally well suitable for communities consisting of a large number of species.
In case of presence-absence data, we model the presence (y ij = 1) or absence (y ij = 0, including the possibility of nondetection) of species j on sampling unit i by probit regression, implemented as y ij ¼ 1 zij [ 0 , where the latent liability z ij = L ij + ij includes the linear predictor L ij and the residual which models the probit link function and is distributed ij~N (0,1) independently among the species and the sampling units. The linear predictors are further modelled as Here x ik is the measured covariate k = 1,. . ., n c for sampling unit i, b kj is the regression coefficient measuring how species j responds to the covariate k, g ih is the (unmeasured) latent factor h = 1,. . ., n f , and the factor loading k jh measures how species j responds to the latent factor h. We included the intercept in the model by setting x i1 = 1.
In the standard non-spatial latent factor model (Bhattacharya & Dunson 2011), the latent factors are assumed to be normally distributed with zero mean and unit variance, g ih~N (0,1). To bring spatial structure for the latent factors, we assumed a spatially homogeneous Gaussian process with is the spatial distance between the sampling units i and i 0 , and d hh 0 is Kronecker delta with value 1 for h = h 0 and with value 0 for h 6 ¼ h 0 . The function f h (d) is a spatial covariance function, normalized to f h (0) = 1 so that its unit is correlation and that the marginal distributions of the latent factors have zero mean and unit variance, similar to non-spatial latent factor models. Here we assume the exponential function f h (d) = exp(Àd/a h ), where a h is the spatial scale of the latent factor h. We note that the standard latent factor model with spatially independent factors induces the covariance structure e~N(0, Ψ) with Ψ = (Λ T Λ) ⊗ I ny where Λ is a matrix of the factor loadings (Bhattacharya & Dunson 2011). In addition to determining covariance at zero distance (Thorson et al. 2015), the spatial structure of the latent factors induces a spatial covariance q jj 0 ðdÞ in the latent factors influencing the occurrences of species j and j 0 , given at distance d by This characterizes species-to-species associations not only at the local level (d = 0) but also their spatial decay, similar to Latimer et al. (2009). As illustrated by Fig. 1h, spatial covariance between species can be positive or negative, corresponding to positive or negative co-occurrence.
To ensure that not more factors are selected than is necessary to explain the data, we follow Bhattacharya & Dunson (2011) by defining a multiplicative gamma process shrinkage prior on the factor loadings. This variant of the Bayesian approach also avoids the need to use a pre-specified structure for the loading matrix as assumed by Thorson et al. (2015). We extended the Gibbs sampling algorithm of Bhattacharya & Dunson (2011) to the present case by utilizing Bayesian multivariate regression for the fixed effects and by incorporating a discrete grid sampler for the spatial scale parameters a h . The technical details of the sampling algorithm are presented in Supplementary material, as well as a MATLAB code for model parameterization.

A C A S E S T U D Y W I T H B U T T E R F L Y D A T A F O R G R E A T B R I T A I N
To illustrate the modelling approach, we consider a case study of 55 species of butterflies. We use the 1995-1999 atlas data (Asher et al. 2001) as presence-absence for each of the 10 9 10 km Ordnance Survey grid cells for Great Britain (n = 2,609 cells). Based on previous studies on butterflies in Great Britain (Hodgson et al. 2011;Bennie et al. 2013), we included as measured covariates (i) the number of growing degree days above 5°C and the percentage of the grid cell cover that consists of (ii) broadleaved woodland, (iii) coniferous woodland and (iv) calcareous substrates ( Fig. 2; see Supplementary material for details on the data). To make the test case challenging, we randomly selected only 300 cells that were used as training data to parameterize the model (Fig. 2) and thus used the remaining 2309 cells for model validation. To assess the influence of spatially structured latent factors on model performance, we first fitted the model with just the four covariates. We then fitted the model with the four covariates and the spatially structured latent factors.
While the focus in this study is on the spatial part of the model, we note that the model belongs to the standard framework of generalized linear mixed models, allowing one to incorporate various hierarchical layers and covariance structures. As an example, we model here the responses of the species (b kj ) to the measure covariates (x ik ) as a function of their functional group. To do so, we classified the species as wider countryside species, specialist species and migratory species. Similarly to Brown et al. (2014) and Ovaskainen & Soininen (2011), we model the vector of regression coefficients for species j with the multivariate normal distribution b j~N (l g(j) , V), where the expected response l g(j) is assumed to be specific to the functional group (g) to which the species belongs to.
Failing to account for spatial autocorrelation (i.e. assuming independence among the data points) is expected to lead to biased estimates of fixed effects (known covariates) and overestimation of their statistical significance . With the butterfly data, the estimates for the fixed effects were more pronounced and had tighter credibility intervals in the model without latent factors than in the model that also includes latent factors. In the model without the latent factors, the 95% credibility interval for the effect of covariates 1, 2, 3 and 4 did not cross zero, respectively, for 50, 42, 11 and 40 species. In contrast, when latent factors were included, the 95% credibility interval for the effect of covariates 1, 2, 3 and 4 did not cross zero for 28, 6, 8 and 10 species (see Supplementary material for the species-specific results). The likely overestimation of fixed effects is visible both in species-and communitylevel predictions, which reflect the covariate layers more pronouncedly than the data. For example, areas with calcareous substrate (Fig. 2d) differ in their species richness from the surrounding areas in a more pronounced way in the model prediction (Fig. 3e) than in the data (Fig. 3d). With the inclusion of the latent factors, this mismatch between the data and the model prediction (Fig. 3f) disappears.
The posterior median estimates (95% credibility intervals) for spatial scale parameters of the two most dominant spatial latent factors are a 1 = 170 (120 À 260) km and a 2 = 170 (120 À 250) km. The first latent factor identifies essentially a north-south gradient, whereas the second one recognizes that the south-eastern part of Great Britain differs in terms of its butterfly community composition from the rest of the country and especially from the north-western part (Fig. 2). The model with spatially structured latent factors appears to better predict the data than the model without the latent factors (Figs. 3 and  4). As expected, the predictive power is generally poorest for species with very low or very high prevalence, as the occurrence of these species varies little. The mean R 2 value is 30% for the model without latent factors, whereas it is 42% for the model with latent factors. Across the 55 species, the model with latent factors had higher R 2 and AUC (Fielding & Bell 1997) values for 54 and 51 species, respectively, when evaluated against the validation data. In addition to improving the average AUC for occurrence of individual species from 0Á86 to 0Á91, including the latent factors improved the root-mean-squared error of species richness, reducing it from 4Á9 species to 3Á2 species.
Taking the average over the species, in the model with latent factors, the proportions of variance (at the level of the linear predictor) attributed to covariates 1-4 were 28%, 3%, 1% and The lower line of panels shows the two most dominant latent factors (i.e. hidden environmental covariates) identified by the model: g 1 (e) and g 2 (f). The black squares in panel g show the 300 randomly selected 10 km 9 10 km grid cells that were used to parameterize the model. The remaining 2309 (shown by grey) were used to test the predictive performance of the model. 3%, whereas they were 54% and 11% for the latent factors 1-2 (Fig. 4). Thus, the covariates contributed 35% to the explained variation and the latent factors the remaining 65%, reflecting the increase in the model's predictive power achieved by adding the latent factors. Taking the average over the species, the amount of variance (at the level of the linear predictor) attributed to covariates 1-4 in the model with latent factors was reduced to 71%, 51%, 102% and 55% of the corresponding values in the model without latent factors. Thus, the latent factors absorbed some of the variation attributed to fixed effects in the model without the latent factors.

Discussion
Statistical methods for joint species distribution modelling have become well-established, but thus far they have lacked a spatially explicit perspective (with the exceptions of Latimer et al. 2009;Thorson et al. 2015). In this study, we have utilized recent progress in latent factor modelling to develop a general statistical framework for spatially explicit joint species distribution modelling. As illustrated by our results, the inclusion of spatial latent factors improves statistical inference of joint species distribution models in three ways. First, failing to account for spatial structure corresponds to the assumption of independent data points, which leads to biased estimates for the effects of measured covariates. The inclusion of spatially structured latent factors is analogous to the inclusion of a spatially structured residual in single-species models, and thus it corrects the inference on the effects of measured covariates. Secondly, the incorporation of spatial latent factors enables spatial interpolation which can greatly improve the predictive power of the model. This was indeed the case in the butterfly case study, where most of the explained variation was attributed to the spatial latent factors. Thirdly, the spatial latent factors identified by the models can be informative, as they can be interpreted as the covariates that influence the species community but are missing from the model, or as the end result of biotic interactions. For example, while we included the growing degree days above 5 degrees as a covariate, the first and most important latent factor identified by the model corresponded to a north-south gradient. This suggests that the north-south gradient correlates with relevant covariates other than the number of growing degree days or that there is random species turnover along this gradient.
As community-level data on species occurrence or abundance typically come from a spatial setting, the possibility to account for spatiality in the analysis phase enables many kinds of applications. Thus, we expect the method presented here to be generally useful for data typically collected in community ecological studies: presence-absence or abundance data acquired for a set of sampling sites, some environmental covariates describing the properties of those sites and the spatial coordinates of those sites. With such data, the modelling approach presented here can be used for assessing the geographic scaling of covariance patterns (Eq. 2; illustrated in Fig. 1h), which can provide information on the type of species interactions (Araujo & Rozenfeld 2014). More generally, the generalized linear modelling framework allows one to partition variation in any community metric (e.g. species richness, evenness, or community dissimilarity) to the influences of measured covariates, to the influences of spatially structured latent factors and to unexplained residual variation. We note that while we have utilized here atlas data that form a regular grid, the method applies directly also to any spatially irregular sampling design. Furthermore, by replacing the distance in two-dimensional space to the one-dimensional distance over time, the method applies as such also for time-series data. In this case, the exponential correlation structure assumed here corresponds to the widely applied AR(1) autoregressive model. The model presented here adds the influence of measured covariates compared to the approach presented by Thorson et al. (2015), but is still ignores many aspects that other approaches developed in community ecology account for. However, as our modelling approach is based on the standard framework of hierarchical generalized linear mixed models, it is of a very general nature and easily extendable to components implemented in previous research to joint species distribution modelling. These include the influence of species traits (Pollock, Morris & Vesk 2012;Brown et al. 2014) and phylogenetic constraints (Helmus et al. 2007;Ives & Helmus 2011), the use of abundance data instead of presence-absence data (Clark et al. 2014) and accounting for detectability (Dorazio et al. 2006).
As illustrated here, spatially explicit community modelling is expected to be useful especially for problems that involve spatial interpolation, for example for predicting species distribution maps from a sparse set of observations. But much interest in community ecology relates also to extrapolation, that is predicting the occurrences of species under environmental conditions not present in the training data, for example after climate change, after habitat loss or in an unexplored region. While it is not expected that spatially explicit community modelling will be able to provide improved mean predictions for extrapolation, it can improve the assessment of uncertainty in such predictions. This is because the modelling framework is able to identify how much of the current species occurrences are influenced by such unknown variation that is structured by space and thus not likely to be just noise. Assuming that the same proportion of the variance will be attributed to unmodelled but spatially structured variation also in the extrapolated situation will enable constructing more realistic confidence intervals than just ignoring such variation.  (2009) R 2 values as a function of the species prevalence, and panel b compares predicted species richness to observed species richness. Both panels are based on fitting the models without spatial latent factors (grey dots) and with spatial latent factors (black dots) to data on 300 training sites (Fig. 2g), and comparing model predictions to the data for the 2309 validation sites. Panel c shows the relative proportions of variance attributed to the measured covariates and to the spatial latent factors. The measured covariates 1-4 are ordered from bottom to top, and coloured grey (covariate 1, i.e. growing degree days), and three levels of red (covariates 2-4). The latent factors 1-2 are shown on top of the measured covariates and are coloured as light blue (latent factor 1) or dark blue (latent factor 2).

Data accessibility
The data used in the case study are provided in the supplementary material. These data were obtained from the following sources: • British butterfly distribution data for the 1995-1999 atlas period gathered by the Butterflies for the New Millennium recording scheme (Asher et al. 2001) are provided in the supplementary material. These data may be used, with appropriate acknowledgement, under a creative commons licence (https://creativecommons.org/licenses/by/4.0/). These data are held by Butterfly Conservation and the Centre for Ecology & Hydrology and are available through http://butterflyconservation.org/111/butterflies-for-the-new-millennium.html and http://data.nbn.org.uk (contact: Richard Fox, rfox@butterfly-conservation.org).
• UK climate data are provided in the supplementary material. We used mean annual number of growing degree days above 5 degrees Celsius for 1995-1999 for Britain at a 10 km Ordnance Survey grid resolution were derived from CRU ts2.1 and CRU 61-90 climate data sets (Barrow, Hulme & Jiang 1993). This involved the anomalies at 0Á5 deg grid resolution being interpolated onto the UK Ordnance Survey 10 km grid and combined with the TIGER climate data (Hill 1995) from mean elevations within grid cells. These data may be used, with appropriate acknowledgement, under a creative commons licence (https://creativecommons.org/licenses/by/4.0/). Original data are data are available from http:// www.alarmproject.net/climate/climate/ • UK Land cover data (LCM2000) are provided in the supplementary material.
We used per cent cover broadleaved woodland and per cent coniferous woodland (Fuller et al. 2002). Per cent cover was calculated as a percentage of the land area within UK Ordnance Survey 10 km grid cell. These data are licensed and must be used in accordance with the open government licence (OGL; http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/). Original data are available through http://www.ceh.ac.uk/landcovermap2000.html.
• UK geology data are provided in the supplementary material. We used the 1 kilometre resolution Soil Parent Material Model detailing 6 basic parent material parameters (derived from the 1:50 000 scale version). We calculated the sum of 1 km squares within each UK Ordnance Survey 10 km grid cell with a calcareous content value of 'HIGH'. Original data were downloaded on 01 January 2015, licensing restrictions, terms and conditions and original data are available (with appropriate acknowledgement) from http://www.bgs.ac.uk/downloads/ start.cfm?id=2899.