Reliably predicting pollinator abundance: Challenges of calibrating process‐based ecological models

Pollination is a key ecosystem service for global agriculture but evidence of pollinator population declines is growing. Reliable spatial modelling of pollinator abundance is essential if we are to identify areas at risk of pollination service deficit and effectively target resources to support pollinator populations. Many models exist which predict pollinator abundance but few have been calibrated against observational data from multiple habitats to ensure their predictions are accurate. We selected the most advanced process‐based pollinator abundance model available and calibrated it for bumblebees and solitary bees using survey data collected at 239 sites across Great Britain. We compared three versions of the model: one parameterised using estimates based on expert opinion, one where the parameters are calibrated using a purely data‐driven approach and one where we allow the expert opinion estimates to inform the calibration process. All three model versions showed significant agreement with the survey data, demonstrating this model's potential to reliably map pollinator abundance. However, there were significant differences between the nesting/floral attractiveness scores obtained by the two calibration methods and from the original expert opinion scores. Our results highlight a key universal challenge of calibrating spatially explicit, process‐based ecological models. Notably, the desire to reliably represent complex ecological processes in finely mapped landscapes necessarily generates a large number of parameters, which are challenging to calibrate with ecological and geographical data that are often noisy, biased, asynchronous and sometimes inaccurate. Purely data‐driven calibration can therefore result in unrealistic parameter values, despite appearing to improve model‐data agreement over initial expert opinion estimates. We therefore advocate a combined approach where data‐driven calibration and expert opinion are integrated into an iterative Delphi‐like process, which simultaneously combines model calibration and credibility assessment. This may provide the best opportunity to obtain realistic parameter estimates and reliable model predictions for ecological systems with expert knowledge gaps and patchy ecological data.


| INTRODUC TI ON
Pollination is a key ecosystem service underpinning the reproduction of many flowering plants, including many crops. Pollinators enhance production in ∼75% of globally significant crops, adding >$235bn p.a. of productivity and substantially increasing the nutritional security of people all over the world (Breeze et al., 2016;Smith et al., 2015). However, pollinator populations are under increasing pressure from landscape simplification (Kennedy et al., 2013), agrochemical use (Rundlöf et al., 2015;Woodcock et al., 2017) and climate change (Kerr et al., 2015), and there is growing evidence of instability in pollinator-dependent crop yields (Garibaldi et al., 2011;Garratt et al., 2014). Unless addressed, these pressures are expected to cause significant declines in global pollinator diversity in the coming decades (Balfour et al., 2018;Rasmont et al., 2015), threatening global food security.
To date, very few countries have sufficient data to monitor pollinator abundance (O'Connor et al., 2019) or diversity (Carvalheiro et al., 2013;Kerr et al., 2015;Powney et al., 2019) and therefore cannot reliably identify areas suffering declines or at risk of suboptimal pollination services (Garibaldi et al., 2011). Although field monitoring of national scale trends in pollinators and pollination services is both scientifically and economically viable (Breeze et al., 2020;O'Connor et al., 2019), it will take several years to build up such databases. Until then, additional approaches are needed to help target resources to support pollinator populations.
Spatial modelling of pollinator populations can support decision-making and is essential to predict the effects of future land-use change on pollinator populations. The most simplistic spatial models of pollination are purely based on crop forage distance from semi-natural habitat (Priess et al., 2007). Other studies assign habitat quality scores to all habitat types in the landscape (Nogué et al., 2016;Schulp et al., 2014), but this does not capture the fact that pollinators may use different habitats for different resources. The more sophisticated InVEST pollinator model, developed by Lonsdorf et al. (2009), assigns a separate nesting and flowering quality score to each habitat for different taxa, accounting for flight distances. This model and adaptations of it have already been used to infer spatially explicit current (Koh et al., 2016;Zhao et al., 2019) and future trends in pollinators/ pollination (Chaplin-Kramer et al., 2019) and estimate pollinator natural capital .
More recent studies have refined this process-based InVEST model further by assuming that pollinators are optimal foragers (Olsson et al., 2015), accounting for temporal variation in floral resources and using expert-derived floral attractiveness scores (Häussler et al., 2017).
If models are to be capable of predicting the impact of future land-use change on pollinators and reliably informing conservation management, such sophisticated and realistic simulation of pollinator requirements and resource use is essential. The most advanced social bee models currently available -BEEHAVE and Bumble-BEEHAVE (Becher et al., 2014(Becher et al., , 2018 -adopt an agent-based approach, simulating the behaviour of individual bees. However, such agent-based modelling is computationally intensive, such that process-based models remain the most viable option for predicting pollinator visitation across large spatial scales, while still accounting for fine-grained differences in land-use.
Structural realism alone is not sufficient, however, any model used to inform land management and policy must also be validated against observational data to ensure its predictions reflect current observed reality. Several studies have compared predictions from process-based pollinator models with field data (e.g. Groff et al., 2016;Kennedy et al., 2013;Lonsdorf et al., 2009;Nicholson et al., 2019;, but these have primarily focused on predicting pollinator abundance within specific crops. Such models are yet to be validated more widely using pollinator abundance measurements in both crop and non-crop landcovers. Here, we take the most advanced process-based pollinator abundance model available (Poll4pop; Häussler et al., 2017), which simulates both solitary and social bees (the main UK pollinators of crops and wild flowers), and we compare its predictions to abundance data number of parameters, which are challenging to calibrate with ecological and geographical data that are often noisy, biased, asynchronous and sometimes inaccurate. Purely data-driven calibration can therefore result in unrealistic parameter values, despite appearing to improve model-data agreement over initial expert opinion estimates. We therefore advocate a combined approach where data-driven calibration and expert opinion are integrated into an iterative Delphi-like process, which simultaneously combines model calibration and credibility assessment. This may provide the best opportunity to obtain realistic parameter estimates and reliable model predictions for ecological systems with expert knowledge gaps and patchy ecological data.

K E Y W O R D S
calibration, credibility assessment, Delphi panels, ecosystem services, pollinators, processbased models, validation collected at 239 sites across Great Britain, including crop, non-crop and urban sites. Our aim is to identify an optimum set of parameters for the model that produces the best agreement with the observed survey data and enables the model to be used with confidence to predict the consequences of land-use change on UK pollinator populations and pollination service. We first parameterise the model using nesting and floral attractiveness scores derived from expert opinion and assess the level of model-data agreement. We then use the observed abundance data to calibrate these nesting and floral attractiveness scores and improve the model-data agreement, using an Approximate Bayesian Computation-like approach. We test two calibration methods: a free data-driven calibration and an expert-informed prioritised calibration.
We discuss the implications of these three different model parameterisations, the realism of their derived parameter values and their implications for reliably modelling pollination service at large spatial scales.

| Pollinator abundance data
We collated transect data from surveys conducted between 2011 and 2016 at 239 sites across Great Britain ( Figure 1; Table S1), including 84 crop sites, 12 urban sites and 143 non-crop sites (i.e. nature reserves and semi-natural habitat). Number of surveys per site ranged from 1 to 14, with a mean of 4.5 ± 0.1 surveys per site.
For each survey, we sum up the total number of individuals observed within each of four guilds, which we can then compare to the model predictions, controlling for total transect length and survey date. The guilds are ground nesting bumblebees (GNBB), tree nesting bumblebees (TNBB), ground nesting solitary bees (GNSB) and cavity nesting solitary bees (CNSB), with species allocated to guilds following the nesting preferences given in Falk (2015). Where observations were not recorded to species level but instead recorded as 'Bombus unknown' or 'solitary unknown', we divide these unknown individuals between the nesting guilds according to the proportions of known individuals assigned to each guild on that particular survey. In practice, unknown Bombus and unknown solitary bees were predominantly assigned to ground nesting guilds due to observations of ground nesting species significantly outnumbering other guilds.

| Model description
Poll4pop is a process-based model that predicts spatially explicit abundance and flower visitation rates by wild central-place-foraging pollinators in a given landscape. It accounts for population growth over time, allows different dispersal distances for workers and reproductives, includes preferential use of more rewarding floral and nesting resources, and can incorporate fine-scale edge features in the landscape. We summarise the model below. For a detailed description of the model see Häussler et al. (2017).
The model requires a rasterised landcover map detailing the landcover class (e.g. cereal, woodland, etc.) of each pixel, as well as rasters containing the area within each pixel that is covered by specific edge features (e.g. hedgerows, flower margins). Each landclass is scored according to the amount of floral cover it provides during each season (spring, summer and autumn), the attractiveness of those floral resources to each pollinator guild (floral attractiveness) and the attractiveness of the nesting opportunities the landclass provides to each pollinator guild (nesting attractiveness). For each guild, the model then generates a nesting resources map (i.e. nesting attractiveness score for each pixel multiplied by the pixel area and maximum nest density input into the model), plus floral resource maps for each season (i.e. floral attractiveness multiplied by seasonal floral cover score for each pixel).
Nests are then randomly allocated across the landscape, with the number of nests in a pixel drawn randomly from a Poisson distribution around the expected number predicted by the nesting resources map. For each nest, the model uses the foraging distance of the pollinator to calculate the resources gathered by the nest from its surroundings, which in turn determines how many workers (if social) and new reproductive females the nest produces using the input growth parameters for that pollinator. New reproductive females then disperse according to the dispersal kernel of the pollinator. In any given pixel, the number of new reproductive females that survive F I G U R E 1 Locations of survey sites colour-coded by type of survey site (see Table S1 for type definitions) the following year is limited by the expected number of nests in that pixel according to the nesting resources map.
The model outputs visitation rate to each pixel in each season (based on the amount of time pollinators from all nests spend foraging in each pixel). Solitary bees are assumed to be active only during one season, with new bees produced at the end of this season. Social bees (e.g. bumblebees) are assumed to be active in all three seasons, with queens (i.e. reproductive females) foraging during season 1, workers foraging during seasons 2 and 3, and new queens produced at the end of season 3. For each landcover raster, we also generate edgecover rasters for six edge features (ditches, fallow field margins, grassy field margins, flower-rich margins, hedgerows and woodland edges) using information from rural payments agency databases and the CEH Woody Linear Features Database (Scholefield et al., 2016). See Supporting Information for full details of landcover/edgecover raster generation.

| Landcover/edgecover rasters
For each survey site, we generate 10 × 10 km landcover/edgecover rasters with 10 × 10 m pixels centred on the survey site, which are used to obtain model predictions for calibration and validation. To obtain upscaled calibrated model predictions for Great Britain, we also generate 512 35 × 35 km landcover/edgecover rasters with 10 × 10 m pixels, which cover the entire geographical area with a 5 km overlap between rasters (later removed from output rasters to eliminate edge effects).

| Expert opinion data: Floral cover, floral attractiveness and nesting attractiveness
Ten UK pollinator experts were asked to score 35 common European landclasses (Table S2) for abundance and duration of floral resources per season (later multiplied to obtain floral cover). They were also asked to assign floral and nesting attractiveness scores to each landclass for the pollinator guilds they had experience of. Scores were collected on a 6-point scale, along with corresponding 'certainty scores'. We then calculated the mean scores across all experts and their variance, weighted by the experts' certainty scores. See Supporting Information for full details.

| Literature data: Maximum nest density, foraging distances, dispersal distances and growth parameters
We use the maximum nest density, foraging distance, dispersal  Table 1). For simplicity, we assume both bumblebee nesting guilds have the same values for these parameters. Similarly, we assume both solitary bee nesting guilds have the same values for these parameters. This is unlikely to be true. However, the identical maximum nest density assumption is unimportant for our results, since we never compare the relative abundance of guilds and are concerned only with calibrating relative attractiveness of landclasses within guilds. Similarly, we consider the uniform foraging and dispersal distances within bumblebees and solitary bees an appropriate simplification, since foraging and dispersal distances are poorly known and vary between species (even within guilds) and we compare our model predictions to observed guild totals of varying species composition.

| Comparison of model predictions with pollinator abundance data
To obtain a model prediction for a given survey site, we input the site's 10 × 10 km landcover/edgecover rasters and calculate the predicted spring visitation rate per m 2 within the survey area (V 1 ) by summing up the season 1 visitation rate to all pixels inside the survey area and dividing by the total survey area. We compare this to the observed number of bees on each survey (N obs ) by fitting the model: where L is the total transect length (i.e. we implicitly assume bees are detected within some unknown width either side of the transect which is constant across sites), W is the week of the year that the survey was carried out, Y is the year the survey was carried out and we fit to (N obs + 1) to avoid taking the logarithm of zero when no pollinators were recorded. The co-variable W allows us to account for the fact that pollinator population size changes during the survey season, for example, as bumblebee nests produce workers over time and solitary bees' active periods pass. The co-variable Y allows us to account for the fact that pollinator abundance nationally shows between-year variability due to year-to-year variation in weather suitability impacting pollinator growth directly (e.g. through poor weather reducing foraging time) and indirectly (e.g. by reducing floral cover).
Although the survey data represent counts, we fit the linear model assuming a Gaussian error distribution rather than Poisson, because the count data are over-dispersed with variance much larger than the mean. We choose a Gaussian error distribution with logged variables rather than any other method to deal with overdispersion, such as quasi-Poisson distribution, because this approach produces the smallest and most uniform residuals across the data range.
We fit the linear model using r version 3.5.1 (R Core Team, 2018).
A positive value of β that is significantly different from zero indicates significant model-data agreement.

| Sensitivity analysis
We conduct a sensitivity analysis to determine how sensitive the model-data agreement is to changes in the input nesting and floral attractiveness scores. For each guild, we calculate the change (∆) in model-data agreement slope (β; obtained from fitting Equation where β is the model-data agreement slope when all attractiveness parameters are set to their original expert opinion values. We calculate the uncertainty in ∆ by propagating the SEs on the individual slopes ( + , − and α β ), following Hughes and Hase (2010), as:

| Model calibration
We separate the survey sites into 120 calibration sites and 119 validation sites, using stratified random sampling to ensure both subsets contain equal proportions of crop/non-crop sites and zero/non-zero surveys per guild. The validation sites are not used for calibration but reserved for assessing improvement in model-data agreement.
For each guild, we focus on calibrating the nesting and floral attractiveness scores for each landclass, excluding the landclasses buckwheat (which does not occur in any of our survey site rasters) and 'unsuitable' (which is used for water/bare rock etc. and is fixed at zero attractiveness). We keep the floral cover scores fixed at their original expert opinion values, to allow us to decouple the guilds and calibrate each guild separately, and all other parameters remain fixed at their literature values.
We test two different methods of calibration. Method 1 involves searching the parameter space of all eligible parameters simultaneously (free data-driven calibration). Method 2 (expert-informed prioritised calibration) involves first prioritising parameters for calibration according to the results of the sensitivity analysis and searching the parameter space of parameters which the model is most sensitive to first, while less sensitive parameters remain fixed at their original expert opinion values. The parameter space of these less sensitive parameters is only searched once the more sensitive parameters have been calibrated. We define three sensitivity thresholds for Method 2: parameters which produce ∆ ≥ 5% are calibrated first, followed by parameters which produce 5% > ∆ ≥ 0.5%, with the remaining parameters producing ∆ < 0.5% calibrated last.

The calibration process itself follows an Approximate Bayesian
Computation-like approach (Fearnhead & Prangle, 2012) and involves running poll4pop 1,000 times across all of our calibration sites. Each run uses a unique set of attractiveness parameters where each eligible attractiveness parameter is assigned a random value drawn from a uniform distribution between the allowable limits for that parameter, while any ineligible attractiveness parameters remain fixed at their original expert opinion scores. For each run, we fit Equation 1 to assess the model-data agreement between the calibration site survey data and model predictions and select the 100 runs which produce (β, R 2 ) closest to (1, 1). We then calculate the density distributions of the eligible parameters across these 100 best runs. While the density distributions of the eligible parameters across all 1,000 runs are flat, the density distributions corresponding to the 100 best runs should be biased towards parameter values that produce the best fit to the data and show a peak around this value.
If the full width half maximum (FWHM) of the density distribution peak is ≤60% of the parameter's allowable range, then we assume the parameter has been sufficiently constrained and we define the parameter's calibrated score as the score that corresponds to the density distribution peak. The FWHM limit of ≤60% was set after careful consideration of the density distribution widths the calibration process and data quality are capable of producing.
Typically only a few (∼1-5) parameters will be constrained from analysing a single batch of 1,000 runs, due to the large number of parameters being varied simultaneously broadening the density distributions of individual parameters. After a single batch of 1,000 runs, any successfully calibrated parameters are set to their calibrated values and the process is repeated until all parameters have been calibrated or the remaining parameters to do not yield FWHM ≤ 60% due to their minimal leverage on the model-data agreement.

| Predicting visitation rates across Great Britain
We generate two model predictions per guild for spring visitation rate across Great Britain: one prediction using the calibrated attractiveness scores obtained using Method 1 (V cal ) and one using the original expert opinion attractiveness scores (V exp ). We compare the ratio of these two predictions by calculating V cal /V exp to identify regions of the country where these two predictions differ.
We quantify the uncertainty on V cal by running 100 simulations, where the score of each attractiveness parameter is randomly selected from the density distribution of that calibrated parameter.
The uncertainty on the model prediction in each pixel is then represented by the SD of these 100 simulations.
We quantify the uncertainty on V exp by running 100 simulations, where the score of each attractiveness parameter is randomly selected from a beta distribution (B(a, b)) with mean (µ = a/(a + b)) and variance (σ 2 = µ(1 − µ)/(a + b + 1)) equal to the mean and variance of the expert opinion score for that parameter. Since B(a, b) is only defined on the interval (0, 1), we rescale the floral attractiveness parameter means (originally scored from 0-20) and variances onto the interval (0, 1), draw randomly from the appropriate beta distribution and multiply the randomly selected scores by 20 to return them to the appropriate scale.
The uncertainty on the ratio V cal /V exp is taken as the SD of the ratios calculated from dividing one V cal simulation by the corresponding V exp simulation. We assess the significance of the ratio V cal /V exp in each pixel by calculating the number of SDs the ratio is away from a ratio of 1:1 in that pixel, that is, by expressing SDs are considered to show a significant difference between the calibrated and expert opinion model predictions.

| Initial model-data comparison
All four guilds show significant model-data agreement (i.e. statistically significant β > 0) between the initial model predictions for each survey site, calculated using the expert opinion attractiveness scores, and the observed survey data (

| Sensitivity analysis
For each guild, only a small number of parameters produce a significant change in model-data agreement slope when adjusted ( Figure 2). For GNBB, only the nesting attractiveness of unimproved permanent grassland and cereal and the floral attractiveness of coniferous woodland, unimproved permanent grassland, cereal and oilseed produce a percentage change in slope with uncertainty bounds that do not overlap zero. This is due to the large scatter in the model-data agreement producing ∼10% uncertainty on individual model-data slopes and the small geographic area covered by many landclasses (e.g. vegetables).

Cavity nesting solitary bees
Percentage change in slope for ±50% change in input score

| Improvement in model-data agreement
All four guilds show β closer to 1 and improved R 2 values after calibration (except Method 2 for TNBB), with R 2 values for all guilds now ranging from 0.342 to 0.486 (Table 2). Calibration Method 1 generally produces a slightly higher R 2 than Method 2, but there is typically no significant difference between the model-data agreement slopes obtained by the two methods, with TNBB the only guild for which the SEs on the two slopes do not overlap.
The results in Table 2 represent the model-data fit agreement using all survey sites. Figure 3 (and the corresponding Figures S1, S2 and S3) shows the improvement in β and R 2 as successive batches of parameters are calibrated for the calibration and validation sites separately. The validation sites, which were not used to calibrate the model, generally show a similar improvement in model-data agreement to the calibration sites, with the exception of TNBB R 2 using Method 1 ( Figure S1). In this case, the calibration subset began with a lower R 2 than the validation subset and selecting for (β, R 2 ) close to (1, 1) in the calibration subset produces a slight reduction in R 2 for the validation subset.
Across all four guilds, the biggest improvements in model-data agreement occur at the beginning of both calibration processes, when the most influential parameters are calibrated (despite these not being forcibly prioritised by Method 1). GNBB show no further significant improvement in model-data agreement slope via Method 1 after the first six batches of parameters have been calibrated (18 out of 66 parameters). The optimal β value is typically achieved faster via Method 2 than Method 1. Method 2's prioritising of slope-influencing parameters means that improvements in R 2 often take longer to achieve than improvements in β, whereas β and R 2 improve at roughly the same rate using Method 1 (Figure 3).
For TNBB, this prioritisation of slope-influencing parameters actually results in an overall reduction in R 2 by the end of the Method 2 calibration process ( Figure S1). This may be due to TNBB being most restricted by the Method 2 calibration process, with just 11 parameters falling below the first sensitivity threshold for calibration, compared to 23, 15 and 19 parameters for GNBB, GNSB and CNSB guilds, respectively.

| Calibration rates
For GNBB and TNBB (Figure 4, left panel) the cumulative number of parameters fixed per patch is initially higher using Method 2 but drops below Method 1 for later batches, such that both calibration methods require the same total number of batches. However, the Method 2 calibration rate remains higher than Method 1 while the calibration process is still producing significant improvements in model-data agreement (cf. Figure 3; Figure S1), only dropping below Method 1 after the overall change in model-data agreement becomes negligible. This suggests adopting Method 2 may be advantageous for these guilds, although TNBB may benefit from lower ∆ thresholds to avoid the over-prioritisation of slope improvement at the expense of improvements in R 2 .
For GNSB, the cumulative number of parameters fixed by Method 2 is always comparable to or less than Method 1, such that adopting Method 2 offers no advantage for this guild (circles, Figure 4). In contrast, for CNSB, the calibration rate by Method 2 is always substantially higher than using Method 1 (diamonds, Figure 4). Method 2 fixes a lower total number of parameters for CNSB than Method 1 (51 vs. 58 respectively). However, Figure S3 shows that significant improvements in model-data agreement ceased around batch 16 for both methods for this guild, at which point Method 2 had calibrated a greater number of parameters than Method 1.

Tree nesting bumblebees
Nesting attractiveness score

Tree nesting bumblebees
Floral attractiveness score bumblebee guilds show lower visitation rates in lowland arable areas of eastern England, due to the predominance of (low calibrated floral score) cereals in these areas (Figure 7; Figure S4) Figures S4, S5 and S6 show corresponding maps for the other pollinator guilds calibrated scores for maize and cereals; Figure 6) and lower visitation rates in upland areas of permanent grassland (Figures S5 and S6). show the most extreme differences between the two models, producing in some upland wetland areas a factor of 10 7 difference in predicted visitation rate, due to the calibrated model obtaining non-zero nesting scores for many landclasses for which the experts assigned zero attractiveness ( Figure 5). Once the uncertainties on the attractiveness scores are taken into account, there are generally no significant differences between the two model predictions for solitary bees, with the largest discrepancies (between 1 and 2 SD away from 1:1 ratio) occurring in suburban areas and at the interface between suburban areas and woodlands (bottom right panels; Figures S5 and S6). In contrast, both bumblebee guilds show significant differences between the two model predictions in arable areas (>3 SD away from 1:1 ratio; Figure 7; Figure S4). See the Supporting Information for a full discussion of the national-level model predictions for each guild.

| D ISCUSS I ON
We have compared the most advanced spatially explicit processbased pollinator abundance model currently available to bee abundance data collected at 239 sites across Great Britain. Our initial model version, parameterised using expert opinion nesting and floral attractiveness scores, showed significant (but nonlinear) model-data agreement for all four guilds. We then tested two different methods to calibrate the nesting and floral attractiveness scores for each guild and improve the model-data agreement -1. a free purely datadriven calibration and 2. an expert-informed prioritised calibration.
Method 2 calibrated parameters at a faster rate (initially) for three of the four guilds, but not for GNSB.
Although our calibrated models both showed improvements in model-data agreement, there were significant differences between the calibrated attractiveness scores obtained by the two methods, reflecting the fact that, in complex interacting process-based models, the order in which parameters are calibrated matters. Another factor may be 'over adjustment' of parameters by the prioritised calibration method to compensate for the fact that the process could not always simultaneously adjust other parameters which, if allowed to vary, might have enabled a better combined fit to the data. It may also simply be a consequence of our low model-data sensitivity to small area landclasses (Yapo et al., 1996).
Both calibration processes selected attractiveness scores that improved the fit to our observed abundance data. However, closer examination of the calibrated scores reveals instances where the calibration process identified ecologically unrealistic values, for example, high floral attractiveness scores for solitary bees for cereals, which are wind pollinated and do not provide significant nectar resources (except potentially in organic systems with higher incrop wild floral cover; Holzschuh et al., 2007). There are many reasons why our calibration processes might find erroneous/unrealistic attractiveness values: • Detectability biases in survey data. Erroneous calibrated scores may arise if species detectability varies systematically with landcover, for example, through reduced sight lines/fewer individuals in flight. Our guild-level approach may further exacerbate this if guild species composition systematically alters across landcover such that more readily detectable species are replaced with less detectable species in some habitats, so causing an apparent reduction in measured guild abundance unrelated to actual guild abundance. Solitary bees are typically under-recorded on transects due to their smaller size (O'Connor et al., 2019), and their short flight periods also reduces data availability for these guilds.
• Use of survey totals. We necessarily compared mean model visitation rate within the survey area with summed abundance along all surveyed transects. This may produce erroneous calibrated scores in heterogeneous survey areas containing multiple landclasses.
• Timing of crop surveys. All crop surveys were conducted during the (relatively short) peak flowering period of the crop when temporarily high foraging abundances occur within the crop relative to the wider countryside. However, the model predicts total visitation rate per season. Two adjacent parcels containing equally attractive resources for short durations will receive the same predicted seasonal visitation rate, that is, half the bees forage in each. If, in reality, these two parcels flower sequentially within one season, such that all the bees forage in one and then in the other, the model cannot capture this unless we subdivide the season and increase the temporal resolution at which we run the model. This temporal limitation may be driving the unusually high calibrated nesting/floral scores obtained for some crops.
• Geographical distribution of survey sites. Biases in geographical coverage can produce spurious calibration results if these correlate with systematic changes in landcover or data collection conditions. Despite wide geographical coverage, there were more lowland intensive arable sites than upland sites and often better survey conditions at lowland sites. Wide variation in survey weather condition recording/lack of recording for some sites prevented us controlling for this.
• Geographical differences in total abundance unrelated to floral/nesting attractiveness. Climatic gradients and current range limitations (relevant to TNBB Bombus hypnorum) can also cause variation in pollinator population size and produce spurious calibration results where these gradients correlate with systematic geographical changes in landcover.
• Limitations of mapping data: misclassifications. Inaccuracies in mapping data can lead to spurious calibration results if mapping data indicate a landclass is present when in reality it is not.
• Limitations of mapping data: omissions. Lack of fine-scale feature mapping (IPBES, 2016) prevents many important pollinator habitats from being included in our input model landscapes. We also only mapped obviously pollinator-relevant agri-environment features and used a simplistic approach of placing boundary features around the entire perimeter of the containing land parcel, due to lack of information on feature placement.
• Limitations of mapping data: No accounting for within-habitat heterogeneity. Large-scale systematic differences in habitat quality between regions (e.g. due to management) could influence the calibrated attractiveness scores, while small-scale within-habitat heterogeneity will influence measured abundances in the field but will not be present in the mapping data, which is predominantly derived from the 25 × 25 m resolution LCM2015 dataset.
• Dataset asynchrony and dynamic landscapes. Crop rotation means that our study landscapes are likely to contain roughly the right proportions of crops but not necessarily in exactly the right places due to asynchrony of our mapping and survey data. Although we forced the surveyed crop fields to contain the correct crop, erroneous attractiveness scores may be obtained for crops that are adjacent in our mapped landscapes but were not adjacent in reality at the time of the survey. Lack of crop rotation information also means we cannot account for the legacy of past flowering crop distributions on current year pollinator population size/distribution.
• Non-stationary populations and flight seasons. We compared the observed data to the predicted spring visitation rates using a survey date co-variable. For bumblebees, this reflects the fact that numbers increase as spring-foraging queens produce summerforaging workers. The model only permits solitary bees to fly in one season (with no allowance for primitive eusocial or multivoltine behaviour) and so in order to compare spring visitation rates, we simulated only spring-flying solitaries. By comparing spring solitary bee and bumblebee numbers to survey data collected throughout spring-summer with a date co-variable, we are assuming that spring and summer abundance obey the same correlation over time in different landscapes, which is unlikely to be true if some landscapes contain a high proportion of landclasses with very temporally restricted floral cover scores (Persson & Smith, 2013). An improvement would therefore be to explicitly model both spring-and summer-flying solitary bees and to match surveys to the appropriate seasonal visitation rate. However, this adds an extra layer of complexity to an already complex process and can produce very different results depending on where the (arbitrary and latitude-dependent) cut-off between spring and summer is placed.
• Choice of parameters to calibrate. We did not calibrate the floral cover scores, leaving these fixed at their expert opinion estimates to enable decoupling of the guilds. However, experts can struggle to accurately assess floral cover  and quantitative sampling (e.g. Baude et al., 2016;Hicks et al., 2016) can provide more accurate estimates. Under/over-estimated floral cover scores could cause higher/lower floral attractiveness scores respectively.
• Parameter degeneracy. Crop sites consisted of observational data collected within a single landclass, however, without multiple simultaneous observations in adjacent landclasses with different nesting/floral properties, the calibration process will struggle to disentangle which (i.e. nesting/floral/both) attractiveness scores for the landclasses should be altered to match the data.
Measurements in multiple nearby landclasses are needed to capture the movement of bees from good nesting to good foraging areas and so break this degeneracy. This is another certain cause of unrealistic calibrated scores for agricultural landclasses in particular.
• Structural limitations of model. The model does not account for density-dependent competition for floral resources, land-use factors such as pesticide risk, or flexibility in foraging range. Changes in guild species composition with habitat may cause a change in the typical foraging range for that guild, which may impact calibrated attractiveness scores.
The fact that our expert-informed prioritised calibration process also produced some unrealistic scores raises the question of whether this expert-informed calibration was informed enough.
Perhaps we should have gone further and restricted the parameter space searched, for example, by using the expert opinion scores as Bayesian priors (e.g. Choy et al., 2009). However, the suburban floral attractiveness scores highlight why we might be cautious about taking such a strongly expert-influenced a priori approach; this would potentially have prevented the calibration from even exploring the preferred range identified by both tested calibration methods for all guilds.
There are good reasons why expert opinion scores may be inaccurate or not yield the most appropriate values within our modelling scenario: • Expert elicitation method. Experts scored landclasses independently. A more sophisticated elicitation method, such as the Delphi process (O'Hagan, 2019), may have provided more reliable final scores with lower variance, by allowing the experts to collectively review all opinions and iteratively refine and discuss their scores. In addition, we calculated the certainty-weighted mean score and variance across all experts and used these to parameterise a beta distribution uncertainty profile for each score. Other studies (e.g. Koh et al., 2016) assign beta uncertainty distributions to the individual expert scores and average these, which may yield a broader mean uncertainty distribution and a slightly different weighted mean.
• Semantic uncertainties. Each landclass that occurred in the mapping data had to be matched to one of the 35 expert opinion landclasses, generating semantic uncertainties. For example, experts scored 'garden' attractiveness and this was applied to all 'suburbs' in LCM2015, where gardens are diluted by less attractive landclasses such as buildings and roads. This could explain the calibrated/expert discrepancy for suburbs. Semantic uncertainties also arise where different experts assign different scores to the same landclass due to different interpretations of a landcover term, for example, based on field experience in different geographical regions.
• Knowledge gaps. There was a trend for the calibrated nesting scores to be higher than the experts predicted. It is difficult to find nests in the field and therefore plausible that experts in general may be less reliable at assessing nesting quality.
Clearly, we can improve on expert opinion estimates by including data-driven calibration, which relates observational data more directly to the modelling environment (e.g. Groff et al., 2016). However, ecological survey data cannot be treated as 'true' due to its own inherent observational biases. Ideally, expert opinion data would be entirely supplanted with field data on nesting and floral attractiveness, but require specialised efforts to obtain, are hard to determine (e.g. Baey et al., 2017;Bahlai & Landis, 2016;Osborne et al., 2008) and can vary strongly between species even within guilds (Falk, 2015). The collection of large-scale systematic pollinator monitoring data (as proposed by Carvell et al., 2016) could help our data-driven calibration to derive more realistic, consistent estimates with lower temporal/regional biases, but no such data are currently available for the United Kingdom.
This means some expert moderation is essential to identify unrealistic parameter values, which may reflect the limitations and biases of our current datasets and the insensitivities of our model more than the preferences of the species we are modelling.
Our results show that a totally expert opinion parameterisation and a purely data-driven calibration both have limitations in their ability to yield accurate parameter estimates. Our maps of pollinator visitation illustrate how differences in the parameter values obtained by these two approaches can produce enormous differences in model outputs (e.g. factor of 10 7 increase in TNBB visitation in some locations) when used to predict abundance on a landscape scale. This emphasises the need to reconcile these two approaches and obtain the most reliable/realistic estimate for each parameter and, crucially, the approach which yields the most reliable estimate may be different for each parameter. The fact that our expertinformed prioritised calibration also yielded unrealistic parameter values suggests that a more integrated, iterative approach may be better.
We propose a solution is to integrate data-driven calibration results within a Delphi-like process, so adding a data-driven 'expert' to the human members of the panel. Expert opinion is not imposed a priori, but an initial independent data-driven calibration is conducted for comparison with expert opinion. Each calibrated parameter can then be discussed, examining reasons why the data-driven calibration may be preferred over the expert estimate and vice versa. Unrealistic parameter values can be identified, appropriate limits (priors) set if over adjustment is suspected and the calibration process repeated. Model predictions generated using the final hybrid parameter values can then be compared to the original survey data, to ensure significant agreement is still maintained.

| CON CLUS IONS
Reliably modelling pollinator abundance is essential if we are to identify areas of pollination service deficit and effectively target resources to support pollinator populations. The central-place foraging behaviour of many pollinators favours a process-based model in order to accurately reflect how the distribution of nesting/floral resources affects landscape-level pollinator abundance. We selected the most advanced process-based pollinator abundance model available and calibrated it against observational data collected across Great Britain, to assess its suitability for generating spatially explicit estimates of national pollinator abundance.
In its initial expert-parameterised version, the model showed significant agreement with the survey data, which further improved with calibration for three out of four modelled pollinator guilds. This demonstrates the model's potential to reliably map pollination service/natural capital, identify target areas for interventions and form the basis of novel tools to inform land-use decision-making. Our aim was to identify the parameter set that produced the best fit to the survey data and could be used with confidence to predict the consequences of land-use change on UK pollinator populations. Although the calibrated parameterisations satisfy the former, their inclusion of unrealistic parameter values means they fail at the latter; adopting the calibrated parameters for the sake of a small increase in R 2 would (far more seriously) cause the model to predict that increasing cereal cover is beneficial for many pollinators, which is generally not the case. This demonstrates that our concept of model accuracy must include both accurate prediction within the calibration/validation environment and ecological realism of underlying parameters (given our wider knowledge base) to enable meaningful model application outside it.
Our work highlights the universal challenges faced when calibrating any spatially explicit, process-based ecological model. The desire to realistically represent complex ecological processes in finely mapped landscapes necessarily generates models with large numbers of parameters. Computational limitations and model insensitivities may preclude calibration of all parameters making some use of expert estimates a necessity. This, combined with survey and geographical data biases, may lead purely data-driven calibration to easily identify spurious parameter values. We suggest that treating expert elicitation and data-driven calibration as complementary parts of one single iterative process, which integrates model calibration and credibility assessment, may provide the best opportunity to obtain realistic parameter estimates for process-based models, in ecological systems with expert knowledge gaps and patchy/biased ecological data.