Volume 14, Issue 1 p. 146-161
RESEARCH ARTICLE
Open Access

Improving the predictability and interpretability of co-occurrence modelling through feature-based joint species distribution ensembles

Francisca Powell-Romero

Corresponding Author

Francisca Powell-Romero

School of Veterinary Science, The University of Queensland, Gatton, Qld, Australia

Correspondence

Francisca Powell-Romero

Email: [email protected]; [email protected]

Search for more papers by this author
Nicholas M. Fountain-Jones

Nicholas M. Fountain-Jones

School of Natural Sciences, University of Tasmania, Hobart, TAS, Australia

Search for more papers by this author
Anna Norberg

Anna Norberg

Centre for Biodiversity Dynamics, Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway

Search for more papers by this author
Nicholas J. Clark

Nicholas J. Clark

School of Veterinary Science, The University of Queensland, Gatton, Qld, Australia

Search for more papers by this author
First published: 08 June 2022
Citations: 2
Handling Editor: Laura Graham

Abstract

  1. Species Distribution Models (SDMs) are vital tools for predicting species occurrences and are used in many practical tasks including conservation and biodiversity management. However, the expanding minefield of SDM methodologies makes it difficult to select the most reliable method for large co-occurrence datasets, particularly when time constraints make designing a bespoke model challenging. To facilitate model selection for practical out-of-sample prediction, we consider three major challenges: (a) the difficulty of incorporating multiple functional forms for species associations; (b) the limited knowledge on how characteristics of co-occurrence data impact model performance; and (c) whether individual model predictions could be combined to obtain optimised community predictions without the need for bespoke models.
  2. To address these gaps, we propose an ensemble method that uses descriptive features of binary co-occurrence datasets to predict model weightings for a set of candidate SDMs. We demonstrate how this method may be applied through a simple case study that uses five independent Joint Species Distribution Models (JSDMs) and Stacked Species Distribution Models (SSDMs) to predict out-of-sample observations for a diversity of co-occurrence datasets. Moreover, we introduce a novel SSDM that offers the potential to include multiple functional forms for each species while delivering robust community predictions.
  3. Our case study highlights two major findings. First, the ability for the feature-based ensemble to offer more robust species co-occurrence predictions compared to other candidate SDMs while providing insights into the data features that impact model performance. Second, the competitiveness of the novel SSDM method for forecasting species co-occurrences, even when using a simple univariate generalised linear model (GLM) as the base model prior to stacking.
  4. We conclude that feature-based ensembles can provide ecologists with a useful tool for generating species distribution predictions in a way that is reliable and informative. Moreover, the flexibility of the ensemble and the novel SSDM method both offer exciting prospects for incorporating a diversity of functional forms while prioritising out-of-sample prediction.

1 INTRODUCTION

Ecologists are increasingly faced with the task of fitting hundreds or thousands of species distribution models (SDMs) to large species community datasets for applied purposes such as biodiversity conservation and management (Palacio et al., 2021; Velásquez-Tibatá et al., 2019). Such tasks require methods that offer reliable predictions for unsampled areas under time constraints that inhibit the design of bespoke models, that is, models customised to a particular dataset. A major advance to aid in this task has been the development of multivariate approaches, which more realistically capture the co-occurrence of species and their possible interspecific biotic associations (Araújo & Luoto, 2007; Heikkinen et al., 2007; Leathwick et al., 2006; Ovaskainen et al., 2017). Multivariate models that estimate the occurrences of species jointly, that is, simultaneously for all species in a dataset, are referred to as joint species distribution models (JSDMs). JSDMs include nonparametric methods that learn from patterns in the data and utilise classification algorithms to predict species co-occurrence (Ingram et al., 2020), and parametric methods that model species' responses to environmental variables and account for co-occurrence patterns in the residuals (Norberg et al., 2019; Pollock et al., 2014; Wilkinson et al., 2019) or estimate them as joint responses to latent factors (Hui & Poisot, 2016; Ovaskainen et al., 2017; Warton et al., 2015). Alternatively, parametric and nonparametric methods that predict the occurrence of each species individually and aggregate the outcomes to enable multi-species predictions are described as stacked species distribution models (SSDMs; Algar et al., 2009; Calabrese et al., 2014; Distler et al., 2015; Harris et al., 2018; Zurell et al., 2020). Selecting the most appropriate method for predicting species distributions is no simple task, requiring the user to navigate an expanding field of alternative approaches whose advantages and disadvantages are not immediately clear. We consider three major gaps in the estimation and application of species distribution models for out-of-sample prediction. First, it is challenging to incorporate different functional forms for each species while producing coherent community predictions. Second, there is little consensus on which aspects of observed data impact model performance. Finally, few studies have described how to combine model predictions into ensemble forecasts, a practice that is widely known to reduce prediction bias in other fields.

Although JSDMs can offer reliable predictions of species co-occurrences in some ecological contexts (Franklin, 1998; Norberg et al., 2019; Thuiller et al., 2003), parametric methods often make assumptions about the functional form of species (Vayssières et al., 2000). This can be problematic when estimating the occurrence of multiple species simultaneously, as this is not necessarily a characteristic that is uniform across all species. Nonparametric methods offer an alternative modelling approach by utilising classification algorithms that require no assumptions about the distributions of the data or model residuals, and thus better cater for species with various functional forms (Vayssières et al., 2000). However, a common pitfall of standard SSDMs is overpredicting outcomes by not accounting for shared responses between species (Dubuis et al., 2011; Guisan & Rahbek, 2011; Calabrese et al., 2014; D'Amen, Dubuis, et al., 2015; Zurell et al., 2020). A recent development fills this gap by allowing each binary vector of species occurrences to be modelled independently, using whichever base univariate model is appropriate, after which the predictions are aggregated (‘stacked’). The stacking is done by learning possible nonlinear multivariate associations (Xing et al., 2020). The authors propose that a weak learner, that is, a method that performs better than random, is appropriate for modelling the associations between the fitted values of other vectors in the dataset and the residuals of the focal vector. Adjusted predictions are then stacked to obtain multivariate predictions (Xing et al., 2020).

There are several reasons why the method proposed by Xing et al. could be advantageous for species distribution modelling. First, it allows for different functional models for each species, meaning that users can freely incorporate relevant domain expertise without being restricted by a single set of assumptions. Incorporating different functional forms for each species simultaneously in JSDMs is challenging, yet this could easily be accommodated by the stacking approach by fitting univariate nonlinear models to each species prior to aggregating the outcomes. Second, the stacking algorithm can potentially estimate complex, nonlinear species' associations without the need for large variance covariance matrices or latent factors, both of which typically assume linearity and can be computationally demanding when modelling many species.

Another challenge in model selection is the limited understanding on what aspects of observed data impact model performance. Studies that have compared different types of models to aid in the model selection process have found inconsistent results considering the predictive performance of SDMs, JSDMs and SSDMs (Baselga & Araújo, 2010; D'Amen, Pradervand, & Guisan, 2015; Harris et al., 2018; Leathwick et al., 2006; Maguire et al., 2016; Moisen & Frescino, 2002; Norberg et al., 2019; Zhang et al., 2018). In particular, a comparison between all three model types highlighted the need for researchers to undertake the computationally and time-demanding task of fitting subsets of data to various complimentary models before undertaking analysis (Norberg et al., 2019). For tasks when time and data constraints are not an issue, bespoke modelling is a highly suitable approach. However, the applied ecologist working with continuously updated datasets that increase in size and complexity may require more feasible alternatives. In such cases, modelling large datasets without fitting customised models requires a deeper understanding of how variation in underlying data structures impacts model performance. While this has been done to understand the structural properties of the models themselves (Elith et al., 2006; Norberg et al., 2019; Wisz et al., 2008), few studies have delved into the structure of the observed data. Data structures can be quantified through features that measure species and community-level characteristics. These may include species characteristics such as growth rate, elevational distribution range and maximum elevation (Guisan, Graham, et al., 2007; Guisan, Zimmermann, et al., 2007), or extrinsic parameters, such as location error and sample size (Guisan, Graham, et al., 2007; Guisan, Zimmermann, et al., 2007; Norberg et al., 2019; Wisz et al., 2008), features of time-series data (Montero-Manso et al., 2020), or network-based features (Azhagesan et al., 2018). These metrics can provide deeper insights into correlations between data structure and model performance. For example, a dataset with a more sparsely connected co-occurrence network may be more effectively modelled by univariate than multivariate methods. To our knowledge, extensive exploration on how data structures impact SDM predictive performance has not yet been undertaken.

Despite evidence that combining multiple models can improve predictions in diverse scenarios (Atiya, 2020; Gneiting & Raftery, 2005; Murray, 2018; Wang & Srinivasan, 2017), few ecological studies have applied ensemble methods (Araújo & New, 2007). Ensembles offer great advantages for ecological forecasting, as they allow the properties of multiple models to be combined into a single weighted prediction that often reduces prediction error relative to any of its constituent models (Araújo & New, 2007). Moreover, combining models into an ensemble algorithm often allows for a fast approach to yield optimised predictions (Lemke & Gabrys, 2010). No SDM correctly captures the true data generating process, suggesting it can be useful to hedge bets against model misspecification by combining predictions. This is especially true when using a diverse set of candidate models, as combinations from models with different degrees of flexibility should, on average, outperform predictions from individual models across heterogeneous environments. Determining appropriate model weights for ensemble models can be challenging. However, the calculation of features that describe structural differences among observed data offers a direct way to estimate model weights (Kang et al., 2017). Recently, a promising novel time-series ensemble approach was proposed, which uses a suite of descriptive features for each response variable in a multivariate dataset as predictors when training a machine learning algorithm to predict the relative weights of simple forecast models (Montero-Manso et al., 2020). The method won second place in M4, a highly competitive global forecasting competition (Makridakis et al., 2020) and has been applied to both aid in the selection of individual models and to build weighted ensemble models (Kück et al., 2016; Lemke & Gabrys, 2010; Talagala et al., 2018). However, while this method has been applied for economic forecasting purposes, to our knowledge, a similar approach has not been applied in ecological modelling. We propose that using binary community features to both understand why some models outperform others and to combine model predictions into a weighted ensemble is a useful avenue of research for building better predictions for communities of species.

Our study explores the effects of the underlying data structure of binary datasets, and how this can be used to predict model weights within an ensemble model to optimise predictions of species distributions. We do this by evaluating the predictive performance of five candidate models and use the deviance residuals from each respective model to generate optimised model weights. Using the optimised weights as the response variables, we build an ensemble algorithm that learns from features describing the composition of the species communities to predict ensemble weights for generating out-of-sample forecasts, a novel approach not yet applied in the field of ecology. We suggest that our framework can be useful for applied modellers seeking to predict the distributions of large sets of species for practical tasks where it is not feasible to undergo the lengthy process of fitting bespoke models.

2 MATERIALS AND METHODS

2.1 Data collection and preparation

We used a total of 30 binary presence–absence co-occurrence datasets across pathogen, vegetation and animal communities. Datasets originally containing abundance or count measures for species occurrence were converted to binary data, where any species with a value equal or >1 was considered to be present, that is, assigned a value of 1. Descriptions, number of species and observations, median prevalence and prevalence range are summarised for each dataset in Table 1 (See Supplementary File 1 for more detailed descriptions). To reduce the risk of overfitting, all covariates for each dataset were standardised using principal component analysis (PCA), and the first five principal components (PCs) were selected as predictors, unless fewer PCs were required to explain at least 80% of the variation in the covariate space, as per Norberg et al. (2019), or if fewer PCs were available (See Supplementary File 1 for description on number of PCs included for each dataset and the cumulative variation explained by these PCs). All individual models were fitted using the same PCs as covariates for each species. For each model apart from the GBM stacking models and the MVRF, covariates were included as additive linear effects. The MVRF can learn nonlinear effects, while the GBM models did not use covariates (the fitted values and residuals from the univariate GLM were already conditioned on covariates prior to their inclusion in the GBM models).

TABLE 1. Description of datasets used for analyses. Values reported for number of species (N species), number of observations (N obs.), number of covariates (N covariates) and prevalence (median and range) are the values after data cleaning for analysis. For original values, see Supplementary File 1
No. Dataset Species Test/train N species N obs. N covariates Prevalence (median and range) Reference
1 bird_parasites Malaria parasites in birds Test 4 449 1 0.156 (0.098–0.265) Clark et al. (2016)
2 helminths Soil-Transmitted Helminths in School Children Train 4 8786 19 0.139 (0.021–0.375) Ruberanziza et al. (2019)
3 fennoscandia_birds Birds Train 141 1,800 21 0.122 (0.010–0.944) Norberg et al. (2019)
4 uk_butterflies Butterflies Test 47 1,800 34 0.452 (0.023–0.948) Norberg et al. (2019)
5 victoria_plants Plants Train 162 1,800 19 0.025 (0.005–0.148) Norberg et al. (2019)
6 usa_trees Trees Train 63 1,800 38 0.043 (0.012–0.339) Norberg et al. (2019)
7 norway_vegetation Vegetation Test 242 1,800 6 0.058 (0.007–0.750) Norberg et al. (2019)
8 eelgrass Species found in eelgrass communities Train 32 96 15 0.276 (0.042–0.885) Stark et al. (2020)
9 shrews European Shrews Train 7 2,921 8 0.163 (0.117–0.687) (Neves et al., 2022)
10 mussel_parasites Parasites in mussels Train 13 720 6 0.200 (0.014–0.731) Brian and Aldridge (2021)
11 lion_infections Various infectious pathogens in lions Train 5 105 11 0.533 (0.333–0.562) Fountain-Jones et al. (2019)
12 eucalyptus Eucalyptus Train 20 327 33 0.090 (0.003–0.284) Pollock et al. (2015)
13 grassland_birds Birds Test 30 560 4 0.040 (0.002–0.421) Han et al. (2020)
14 mulu_birds Birds Test 84 166 3 0.136 (0.036–0.500) Burner et al. (2019)
15 usa_birds Birds Train 101 1,284 28 0.031 (0.001–0.450) Steen et al. (2020)
16 swiss_birds Birds Test 56 1,774 53 0.240 (0.029–0.726) Zurell et al. (2020)
17 swiss_forest Trees Train 63 4,816 45 0.055 (0.012–0.792) Zurell et al. (2020)
18 fish_parasites Parasites in fish Train 42 3,966 8 0.028 (0.001–0.364) Bolnick et al. (2020)
19 brazil_fish Fish Train 66 52 12 0.077 (0.019–0.481) Vieira et al. (2020)
20 reptiles Reptiles Train 104 455 11 0.015 (0.002–0.411) Escoriza (2020)
21 canopy_ants Ants Train 99 153 5 0.039 (0.007–0.693) Adams et al. (2017)
22 swissalps_plants Plants Train 175 912 7 0.080 (0.024–0.476) D'Amen et al. (2018)
23 earthworms Earthworms Test 97 1,352 4 0.004 (0.001–0.708) Mathieu and Jonathan Davies (2014)
24 vines Vines Test 42 50 16 0.070 (0.020–0.780) Delgado and Restrepo (2019)
25 buffalo_infections Various infectious pathogens in buffalo Train 6 343 10 0.106 (0.088–0.185) Glidden et al. (2021)
26 andean_birds Birds Test 159 358 2 0.022 (0.003–0.411) Montaño-Centellas (2020)
27 finland_beetles Beetles Train 239 152 16 0.118 (0.026–0.941) Burner et al. (2021)
28 germany_beetles Beetles Train 75 386 11 0.031 (0.013–0.277) Burner et al. (2021)
29 norway_beetles Beetles Test 125 1111 14 0.023 (0.005–0.369) Burner et al. (2021)
30 nz_forest Trees Train 205 964 2 0.004 (0.001–0.500) Popovic et al. (2019)

Training our ensemble required measures of model predictive performance across a large number of datasets with a diversity of binary feature profiles. Training datasets were selected by stratifying the number of species in each community, number of PCs and median prevalence into three groups (low, medium and high values). These values were used to select 10 of the 30 datasets to be withheld for testing. One dataset from each combination of three stratified variables was withheld. In cases where only one combination was present, the dataset was withheld as a testing dataset to enable extrapolation. Datasets retained for training and withheld for testing are described in (Table 1). A total of 20 datasets, containing 1,622 binary vectors (64.67%) were used as training datasets, and 10 datasets, containing 886 binary vectors (35.33%) were withheld as testing datasets for the final ensemble model. The median prevalence for the training and testing data was 5.23% (Q1 = 1.81%; Q3 = 13.34%) and 5.17% (Q1 = 1.61%; Q3 = 16.03%) respectively (See also Supplementary File 2 for a visualisation of feature diversity in training and testing datasets). Although we acknowledge that not every vector is necessarily a different species since some species may be present in multiple datasets, the features will vary at the species level when measured in different communities, and therefore for clarity, binary vectors will be referred to as ‘species’.

2.2 Fitting multivariate models and obtaining predictive performance metrics

We fitted a total of five individual models to the 20 training datasets, to replicate what modellers may be faced with if modelling hundreds of species with limited resources. Three of the models will likely be familiar to quantitative ecologists. They included (a) a generalised linear model (Bernoulli outcomes with a logit link function) to be used as the univariate baseline predictions for comparison (GLM-BASE), which was fitted by applying iteratively reweighted least squares; (b) a Multivariate Random Forest model (MVRF) fitted using the Fast Unified Random Forests for Survival, Regression and Classification function (Ishwaran et al., 2008), using a node size of 8 to define the average number of observations in a terminal node; and (c) a Hierarchical Modelling of Species Communities model (HMSC; Tikhonov et al., 2021), which was fitted using two MCMC chains with a burn-in of 2,000 and 1,000 iterations, and with default priors for all model parameters (see Table 2 and Supplementary File 3 for further descriptions on these methods and r packages used).

TABLE 2. Description of models used to predict species occurrence
Model Abbreviation Type Multi-outcome method Parametric/nonparametric R packages Source
Generalised Linear Model (Baseline) GLM-BASE Univariate Stacked Species Distribution Model Parametric stats R Core Team (2021)
Gradient Boosted Model – Pearson Residuals GBM-PR Univariate Stacked Species Distribution Model Nonparametric gbm Greenwell et al. (2020)
Gradient Boosted Model – Deviance Residuals GBM-DR Univariate Stacked Species Distribution Model Nonparametric gbm Greenwell et al. (2020)
Multivariate Random Forest MVRF Multivariate Joint Species Distribution Model Nonparametric randomForestSRC Ishwaran et al. (2008)
Hierarchical Modelling of Species Communities HMSC Multivariate Joint Species Distribution Model Parametric Hmsc Tikhonov et al. (2021)

To our knowledge, the two remaining models have not been previously used in ecological applications, hence we describe them in more detail here. These models take the original in-sample predictions from a univariate model (in our case, a generalised linear regression model) and learns from the errors in a stacking algorithm to adjust the out-of-sample predictions. In our approach the errors (i.e. residuals from a focal species' GLM) are modelled as a function of the fitted values from other species' univariate GLMs. This allows the model to uncover potentially nonlinear species associations, avoids the need to parameterise a covariance matrix or set of latent factors, and ensures that out-of-sample predictions can be made for the entire community. We included two versions of the stacking model: one that uses the Pearson Residuals (PR) and another that uses Deviance Residuals (DR) as per Xing et al. (2020) from the individual species as the outcome, with the fitted values from the other species included as features in the stacking algorithm. These residuals are defined as:

Pearson Residual:
r ik = Y ik P ̂ ik P ̂ ik ( 1 P ̂ ik ) .
Deviance Residual:
r ik = - 2 log P ̂ ik if Y ik = 1 - - 2 log P ̂ ik if Y ik = 0 ,
where rik denotes the residual for the k-th outcome for the i-th sample in the univariate GLM model, P ̂ ik denotes the predicted probability from the GLM model and Yik denotes the binary outcome. Following the fitting of the GBM stacking model, the adjustment of the GLM original univariate predictions was made following Xing et al. (2020). Both stacking models were fitted using the gbm package in r (Greenwell et al., 2020), and parameters were tuned using 50 trees, a maximum depth for each tree of 2, and the default shrinkage parameter of 0.1. We specifically used a weak learner to reduce overfitting and prioritise species associations that may be important for out-of-sample predictions.
Each of the training datasets were split into training and testing folds, whereby 70% of the data was randomly selected for fitting the models and the remaining 30% was used for evaluating model predictions. This process was repeated three times for each dataset to capture heterogeneity in model performance among testing folds. To measure predictive accuracy of the five individual models, we binarised predicted probabilities of occurrence into presence/absence using a standard threshold of 0.5 for simplicity, since our datasets contained a range of median prevalence values and species, and as models were not optimised to improve predictions for a particular community but rather compare model performance. Using the binarised predictions, we calculated out-of-sample recall (the ratio of correctly predicted species present to all observations where the species is actually present), precision (the ratio of correctly predicted species present to the total species predicted to be present) and the F1 statistic (the weighted average of precision and recall). We used the F1 score instead of accuracy (total number of correctly predicted observations over the total number of observations) or area under the receiver operating characteristic curve (AUROC; uses the area under the ROC curve that plots sensitivity and specificity to quantify the performance of a model) due to the likely unequal distribution of false positives and false negatives resulting from the large proportion of rare species. This metric is calculated as:
F 1 score = 2 Recall × Precision Recall + Precision .

2.3 Ensemble model

Our goal was to find a weighted ensemble of model predictions (on the probability scale) that could minimise an appropriate binary loss function. In practise, for each species in each evaluation set (i.e. containing the with-held 30% of observations), we optimised weights that minimised the mean squared deviance residual. We accounted for class imbalance by weighting residuals for positive and negative observations by their respective frequencies in the test set when calculating the final mean residual. Optimisations of the unknown model weights were performed using the L-BFGS-B algorithm (Byrd et al., 1995) in the R function optim of the stats package (R Core Team, 2021). For all species we used five separate optimisations with different random starting weights to ensure the parameter space was adequately explored. Final model weights for each species were calculated by taking the mean from the three sets used for training.

Our ensemble model was a multivariate random forest that was trained to predict optimal model weights for a set of binary observations based on features that described the structures and community contexts of those observations. We calculated 23 features to describe the characteristics of species individually and within their community, as well as features to describe the overall nature of community structure (Table 3). These features included three measures of prevalence, the numbers of observations and species, network analysis metrics, measures of species ‘uniqueness’, measures describing characteristics of the Markov Random Field (MRF) Networks, and features that describe the predictors and covariates for each of the datasets (See Supplementary File 4 for histograms showing the distribution of features across all, training, and testing datasets). Note that this set of features is not exhaustive, and it would be fruitful and ecologically interesting to consider other features to describe variation among species' observation vectors.

TABLE 3. Description of features used to define community structures for inclusion in ensemble as predictors of model weights. Value range shows the min and max values for each feature across both training and testing datasets
No. Feature Description Level Value range
1 Prevalence Describes how rare or common a species is Species 0.001, 0.948
2 Prevalence Rank Describes how rare or common a species is relative to the other species within a community Species 0.004, 1
3 Prevalence Standard Deviation Describes how much variation in prevalence there is within a community Community 0.026, 0.326
4 Number of observations Describes how many sampling units are present in the dataset Community 50, 8786
5 Number of Species Describes how many species are present within a community Community 4, 242
6 Degree Centrality Describes the number of species with which one species co-occurs Species 0, 1
7 Eigenvector Centrality Describes how influential one species is within the community Species <0.001, 1
8 Betweenness Centrality Describes how influential one species is within a community Species 0, 1.415
9 Modularity (Newman's Q) Describes the structure of the species network in terms of clustering Community −1.459, 0.515
10 Mean Jaccard Distance Describes how unique individual species are relative to others Species 0.659, 1
11 Mean Jaccard Distance Standard Deviation Describes the variation in how unique species in a community are Community 0.004, 0.119
12 Mean Sørensen–Dice Distance Describes how unique individual species are relative to others Species 0.539, 1
13 Mean Sørensen–Dice Distance Standard Deviation Describes the variation in how unique species in a community are Community 0.010, 0.138
14 Mean Sørensen Index Describes the similarity between two samples of binary observations Species 0.355, 0.962
15 Mean Sørensen Index Standard Deviation Describes the variation of the Sørensen Index within the community Community 0.093, 0.345
16 MRF Intercept Describes the probability of occurrence (on the logit scale) when all other species are equal to 0 Community −49.943, 4.066
17 MRF Network Information Describes how connected the MRF graph is overall. This metric is normalised by the number of species in the data Community 0.641, 85.825
18 MRF Network Information Standard Deviation Describes the variation in the MRF Network Information within a community Community 0.134, 2.076
19 MRF Trace Describes the total amount of dispersion of the variables in the MRF network Community −2.734, 4.387
20 Log Determinant Describes the correlations among pairs of variables in the MRF network Community −0.943, 0.177
21 Number of Covariates The number of raw predictors in the dataset used to run the PCA to prepare covariates for analysis Community 1, 53
22 Number of PCs The number of PCs included as covariates in the analysis Community 1, 5
23 Cumulative Variation Explained by PCs The cumulative variation explained by the PCs included in the analysis Community 0.407, 1

2.4 Ensemble model performance

We used the 10 datasets excluded from the model training to test the predictive accuracy of our ensemble model relative to the individual models. We again used a 70–30 split for validation. For the training dataset containing 70% of the data, we fit the candidate models as described above. We then calculated the 23 features to use as new data in the ensemble algorithm (‘ENS’) to predict weights for each species to generate weighted ensemble predictions. We also generated a null ensemble model (‘NULL-ENS’) for comparison that assigned equal weightings for each candidate model. We then calculated performance metrics as above for the five individual models as well as the two ensemble models.

As our case study aimed to describe a proof-of-concept, all models used in our study were fitted using default configurations. However, it is important to note that an ensemble could just as easily be fitted to bespoke models to capture domain knowledge and tune model parameters, which would likely increase prediction performance. All models were implemented in the R environment, version 4.0.2 (R Core Team, 2021).

3 RESULTS

3.1 Variability among individual model performance

Models were compared based on their predictive performance using classification metrics (recall, precision and F1) for a total of 1,622 binary vectors (referred to as ‘species’ here), which we grouped into four prevalence groups for initial exploration: rare, with prevalence <10% (n = 1,110), uncommon (prevalence 10 to 30%; n = 339), common (prevalence 30 to 75%; n = 160) and very common (prevalence >75%; n = 13). For rare species, out-of-sample F1 performance was comparable between the GBM-DR and HMSC methods, which both performed substantially better than the GLM-BASE by 52.34% and 48.11% respectively (Figure 1). Similarly, for uncommon species HMSC (70.50% average net improvement) and GBM-DR (59.88% improvement), along with the GBM-PR (44.54% improvement), performed better than the base, while MVRF performed slightly worse (by 1.77%). The relative performances of HMSC and GBM-DR were highest for uncommon species and decreased as prevalence increased, with GBM-DR performance falling below the GLM-BASE model performance for common species (by 6.25%) and for both GBM-DR and HMSC for very common species (by 84.62% and 100.00% respectively). HMSC and GBM-DR both showed higher recall values compared to the GLM-BASE model across all prevalence categories except for ‘Very Common’, where they both performed significantly worse than the GLM-BASE in terms of recall (both by 100.00%). HMSC and GBM-DR also showed improvements over the GLM-BASE in terms of precision for ‘Rare’ species (by 24.23% and 35.67% respectively). See Supplementary File 5 for all comparisons for precision and recall, as well as values used to calculate percentages of net improvement for F1 by prevalence category.

Details are in the caption following the image
Relative performance of the compared models MVRF, HMSC, GBM-PR and GBM-DR (see Table 2 for details) measured using F1 metric, describing the weighted average of precision and recall, compared to the baseline GLM-BASE model by species prevalence for 1,622 species. Species prevalence is classified into four categories: Prevalence <10% classified as ‘rare’ (1110 species), between 10% and 30% as ‘uncommon’ (339 species), between 30% and 75% as ‘common’ (160 species), and >75% as ‘very common’ (13 species). Performance for HMSC and GBM-DR is highest for ‘rare’ species, highest for HMSC, GBM-DR and GBM-PR for ‘uncommon’ species, similar performance across all models for common species, and inferior performance of the HMSC and GBM-DR relative to GLM-BASE model for very common species.

3.2 Predicted model performance based on data features

Across the datasets used to train the ensemble, the mean weighting as a percentage for each model in the ensemble were: 8.80% for GLM-BASE, 23.70% for GBM-DR, 7.95% for GBM-PR, 70.39% for HMSC and 10.52% for MVRF. Predicted response functions from the ensemble can be used to interrogate how model performance is related to particular features of a community dataset, providing useful insights for improving both domain knowledge and model performance. In our case study, prevalence, eigenvector centrality and degree centrality were the top three most important predictors of variation in performance across all five models, while betweenness centrality was the least informative (Figure 2). Across all metrics, HMSC was consistently attributed the highest weights, however showed greatest variability across prevalence values. For rare species, HMSC was the clearly prioritised method (Figure 3). For common species and, in particular, species with mid-range prevalence values, the differences in weights between HMSC and MVRF were much less pronounced (See Supplementary File 6 for the response functions for the remaining 20 features included in our case study). With the exception of prevalence, which ranks as the most important predictor of model weighting for GLM-BASE. GBM-PR, HMSC and MRF, the most influential features on model weights were co-occurrence network features, with eigenvector centrality surpassing prevalence as the most important predictor for GBM-DR. GBM-DR and HMSC were most influenced by the 23 features overall, with higher relative importance values across multiple features compared to the other models. In particular, the contrast between the two models in terms of feature importance highlights that individual features will influence performance differently for each model (Figure 2).

Details are in the caption following the image
Heatmap showing relative importance of each of the 23 features (see Table 3 for details) by the compared models GLM-BASE, GBM-DR, GBM-PR, MVRF and HMSC (see Table 2 for details) in predicting model weights by the ensemble model. Features are ranked from highest to smallest relative importance across all five models.
Details are in the caption following the image
Model weight response functions for the three features with highest relative importance (prevalence, eigenvector centrality and degree centrality) for the GLM-BASE (pink), GBM-DR (blue), GBM-PR (green), HMSC (orange) and MVRF (yellow). Functions were estimated by holding all other feature predictors at their mean value and predicting from the ensemble random forest. (a) Trend shows that HMSC receives the highest model weighting by the ensemble model. MVRF is attributed the second highest weighting based on prevalence, and peaks at mid-range prevalence values. At a prevalence of 0.5, HMSC and MVRF are assigned mid-range weighting values (b) response function shows that HMSC is attributed highest weightings across all eigenvector centrality values peaking at lower values. GBM-DR is the second highest best performing model. (c) Response function shows that HMSC and GBM-PR are attributed the highest and second highest weightings across all degree centrality values, respectively, with complimentary performance for number of observations.

3.3 Ensemble model performance comparable with best performing models

We tested the predictive performance of our ensemble (ENS) and an equally weighted ensemble (NULL-ENS). Overall, the GBM-DR performed the best based on the F1 statistic, followed by the ENS and HMSC (Figure 4). Out of the 886 species included in the final validation set, the ENS had the greatest net improvement (51.13%), followed by HMSC (48.87%) and GBM-DR (46.50%; Table 4). Of all six models tested, GBM-PR and the ENS provided the most robust predictions by yielding the lowest number of F1 metrics that were worse than GLM-BASE (3.95% and 5.87% respectively), followed by the MVRF (7.79 %) and the GBM-DR (8.92%). The GBM-DR method showed the highest improvement in precision (34.20%), followed by ENS (33.30%) and HMSC (24.60%). In contrast, HMSC showed the highest improvement in recall (68.85%), followed by the ENS (59.26%) and the GBM-DR (48.98%; see Supplementary File 7 for the tabulated results for precision and recall values and boxplots, as well as results for accuracy and deviance residual performance metrics).

Details are in the caption following the image
Relative performance of Null Ensemble (NULL-ENS), MVRF, HMSC, GBM-PR, GBM-DR and the Weighted Ensemble (ENS) relative to the base GLM model (GLM-BASE) as measured by the F1 statistic, describing the weighted average of precision and recall. GBM-DR, HMSC and ENS model perform significantly better than the GLM-BASE.
TABLE 4. Improvement of predictions over the GLM-BASE for each method based on the F1 statistic for 886 species in the test datasets
Method Positive difference (adj. F1 > 0.02) No difference (adj. F1 –0.02 –0.02) Negative difference (adj. F1 < −0.02) Net improvement (positive – negative)
ENS 505 329 52 453
NULL-ENS 52 695 139 −87
GBM-DR 491 316 79 412
GBM-PR 207 644 35 172
MVRF 209 608 69 140
HMSC 540 239 107 433

4 DISCUSSION

Given the overwhelming volume of SDMs available and their high variability in performance for predicting species distributions, selecting an appropriate model for analysis is not a straight-forward task and often requires the lengthy process of fitting several models with complementary performance. This is not always feasible for ecologists seeking to model hundreds or thousands of species under time constraints. We proposed an ensemble approach that could be used to determine a weighted value for the performance of each desired model based on features of the data. While initial training of the proposed ensemble also requires fitting individual models, and as such will be equally as time-consuming, a continuously trained ensemble model could significantly reduce computational times for practitioners. Ultimately, this model could bypass the need for all constituent models to be fitted to new datasets, which may then be used as a tool to select a single model best suited to the dataset. Alternatively, over time this model could also be used to select a subset of models to be fitted as an ensemble and their respective weights, as a platform for providing more robust predictions than individual JSDMs or SSDMs, as demonstrated by our case study.

In practical settings, SDMs for hundreds or thousands of species are widely applied for management and conservation purposes (Palacio et al., 2021; Velásquez-Tibatá et al., 2019). In this case study, we illustrate a basic example of how a feature-based ensemble may be applied to a small subset of SDMs to improve species occurrence predictions. Our findings demonstrated a net improvement over GLM-BASE as measured by the F1 statistic of 51.13% for ENS model, 2.26% higher than the second-best performing model, the HMSC, and 4.63% higher than the GBM-DR net performance (Table 4). These findings support the idea that combining predictions of multiple models within an ensemble algorithm helps to reduce the biases from individual constituent models, offering predictions that are both robust and reliable (Araújo & New, 2007). The competitiveness of the ENS against the other models was also reflected across other performance metrics estimated from the binary predictions (precision and recall) as the second-best performing model, highlighting the ability for the ENS to detect true presence values. Similarly, the competitiveness of the ENS model was also highlighted by the performance metrics estimated from probability predictions (deviance residuals), however, performed relatively poorly in terms of accuracy, suggesting that the ENS may be unable to predict absences as accurately as other models (given that the median prevalence value for the testing datasets is 5.17%; See Supplementary File 7 for tabulated results for the various performance metrics). These findings suggest that consideration of the most appropriate performance metric for the data is important when selecting a model for use.

To enable robust and optimised predictions, our methodological approach utilises simple descriptive features that describe species and their associated communities. As such, these features provide insights into why and when some models outperform others, improving the interpretability of model performance. In particular, our findings highlight the importance of features that relate to the co-occurrence network (Figure 2). This is particularly evident in the response functions for several network metrics, which show the variability in attributed weights as the association between species differs (see Supplementary File 6). For example, it can be seen that for the ‘MRF Network Information’ feature value increases, the attributed weighting to the GBM-DR model increases, while the weighting attributed to the HMSC model decreases within the ensemble (Supplementary Figure 6-17). This suggests that co-occurrence datasets with more or stronger associations between species, that is, the presence of species has a higher influence on the presence or absence of another species, tend to favour the GBM-DR method more, while favouring the HMSC method less. This provides important and useful evidence that multivariate structure in the observed data can be a key indicator of which models are likely to perform best. While previous studies have attempted to interpret why some models outperform others in particular situations, usually by using post-hoc descriptive statistics (e.g. Norberg et al., 2019), our study uniquely quantified these associations through features that describe characteristics of binary co-occurrence data. Thus, the results together with the valuable insights into how models perform relative to features offer promise for the feature-ensemble method's broader applications.

While our example highlights the utility of ensemble modelling without necessarily having to fit a bespoke model, the flexibility of this approach means that users could incorporate more bespoke, knowledge-driven models. Bayesian models with context-specific prior information can readily be included (Clark et al., 2017; Ovaskainen & Soininen, 2011), as well as models that rely solely on expert opinion to estimate species occurrence (Velásquez-Tibatá et al., 2019). Beyond ecology, ensembles that combine a diversity of expert-driven predictions have demonstrated their superiority compared to individual models in many settings, such as forecasting weekly deaths from COVID-19 in the USA (https://viz.covid19forecasthub.org/). Evaluating the performance of the feature-based ensemble method using more specialised individual models offers exciting avenues for future investigations.

Our findings also highlight some of the strengths and limitations of the individual constituent models. Of particular note is the GBM-DR method, whose competitive performance offers some valuable insights into the importance of learning from other species to predict the occurrence of a focal species, adding to the growing body of evidence regarding the importance of accounting for biotic associations in species distribution modelling (Araújo & Luoto, 2007; Heikkinen et al., 2007; Leathwick et al., 2006; Ovaskainen et al., 2017). While our GBM-DR model only used GLMs as the base models for all species and a weak GBM learner as the stacker, in principle, a wide variety of models could be applied to each individual species prior to stacking. The flexibility of the approach means that users can potentially incorporate any model of any form, so long as they can generate fitted values and residuals, and there is opportunity to use other learners to optimise the stacking predictions (Xing et al., 2020).

Another advantage of the SSDM approach is the ability to estimate nonlinear species associations, rather than relying on additive-only associations described by loadings on latent factors, such as the HMSC approach, or estimated from the full covariance matrix (Clark et al., 2018; Ovaskainen et al., 2016), which can be slow and inefficient for large and complex datasets (Norberg et al., 2019; Pichler & Hartig, 2021). Inclusion of covariates within the stacking learner could also be done, which could in-principle capture how species associations change across environmental gradients. This ability to use recent advances from machine learning for the stacking model coincides with the rising need for interpretable machine learning processes to interrogate and understand these models. For example the recently developed Multi-response Interpretable Machine Learning (MrIML) framework offers a flexible approach that compares the performance of multivariate models and delivers interpretable outputs, which could be used to better understand the associations estimated in the stacking model (Fountain-Jones et al., 2021).

Beyond our case study, the feature-based ensemble framework could be manipulated to suit different end user requirements. For example, while we used deviance residuals to obtain the initial model weights to train the ensemble model, different loss functions including Pearson residuals or even classification metrics such as F1 scores could be used instead. Incorporating uncertainty could also be used by optimising on a penalised prediction interval rather than on a point metric such as the deviance residual, although this approach is more challenging when considering methods such as GBM-DR and GBM-PR as there is no convenient way to quantify prediction uncertainty. Alternatively, identifying more precise ways than using posterior means to calculate point predictions from Bayesian posterior distributions (as we did here) could allow for optimisation of the Bayesian methods where models do not allow for quantification of prediction uncertainty. For simplicity in our model, we optimised the binarisation threshold for species predictions to 0.5, but this arbitrary value could also be optimised to improve each model's predictive ability.

5 CONCLUSIONS

Improving the predictability and interpretability of species distribution model for practical applications requires more than comparisons between model performance across ecological contexts: it requires a deeper understanding of how co-occurrence data drives model performance and better ways for accounting for variations in species associations. In our study, we have demonstrated the utility of a flexible feature-based ensemble approach with the capacity to retrieve accurate and robust predictions rapidly over a range of ecological contexts, without necessarily needing to fit highly specialised models. Within our case study used to highlight the potential applications of our ensemble, we have also introduced a new SSDM approach with great potential for future applications in ecological modelling.

AUTHORS' CONTRIBUTIONS

F.P.-R. conducted the data analysis, prepared figures and lead the writing of the manuscript; N.J.C. led the project, conceived the ideas for the project and contributed critically to the syntax and design of the methodology; N.M.F.-J. provided valuable input for the methodological design and provided recommendations for improving results; A.N. provided valuable recommendations for refining models and results based on her expertise. All authors contributed critically to the drafts and gave final approval for publication.

ACKNOWLEDGEMENTS

This project was funded by ARC Discovery Early Career Researcher Award (DE210101439) and by an Australian Research Council Discovery Project Grant (DP190102020). Open access publishing facilitated by The University of Queensland, as part of the Wiley - The University of Queensland agreement via the Council of Australian University Librarians.

    CONFLICT OF INTEREST

    The authors declare that they have no conflict of interest.

    PEER REVIEW

    The peer review history for this article is available at https://publons.com/publon/10.1111/2041-210X.13915.

    DATA AVAILABILITY STATEMENT

    All data were obtained from open-source databases. Original and cleaned versions of the datasets, code and guided workflow for the analysis of this study can be found on the Zenodo Repository https://doi.org/10.5281/zenodo.6565339 (Powell-Romero et al., 2022).