Volume 92, Issue 12 p. 2248-2262
RESEARCH METHODS GUIDE
Open Access

Integrated community models: A framework combining multispecies data sources to estimate the status, trends and dynamics of biodiversity

Elise F. Zipkin

Corresponding Author

Elise F. Zipkin

Department of Integrative Biology; Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, Michigan, USA

Correspondence

Elise F. Zipkin

Email: [email protected]

Search for more papers by this author
Jeffrey W. Doser

Jeffrey W. Doser

Department of Integrative Biology; Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, Michigan, USA

Search for more papers by this author
Courtney L. Davis

Courtney L. Davis

Department of Integrative Biology; Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, Michigan, USA

Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA

Search for more papers by this author
Wendy Leuenberger

Wendy Leuenberger

Department of Integrative Biology; Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, Michigan, USA

Search for more papers by this author
Samuel Ayebare

Samuel Ayebare

Department of Integrative Biology; Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, Michigan, USA

Search for more papers by this author
Kayla L. Davis

Kayla L. Davis

Department of Integrative Biology; Ecology, Evolutionary Biology, and Behavior Program, Michigan State University, East Lansing, Michigan, USA

Search for more papers by this author
First published: 25 October 2023
Citations: 8
Handling Editor: Thierry Boulinier

Abstract

  1. Data deficiencies among rare or cryptic species preclude assessment of community-level processes using many existing approaches, limiting our understanding of the trends and stressors for large numbers of species. Yet evaluating the dynamics of whole communities, not just common or charismatic species, is critical to understanding and the responses of biodiversity to ongoing environmental pressures.
  2. A recent surge in both public science and government-funded data collection efforts has led to a wealth of biodiversity data. However, these data collection programmes use a wide range of sampling protocols (from unstructured, opportunistic observations of wildlife to well-structured, design-based programmes) and record information at a variety of spatiotemporal scales. As a result, available biodiversity data vary substantially in quantity and information content, which must be carefully reconciled for meaningful ecological analysis.
  3. Hierarchical modelling, including single-species integrated models and hierarchical community models, has improved our ability to assess and predict biodiversity trends and processes. Here, we highlight the emerging ‘integrated community modelling’ framework that combines both data integration and community modelling to improve inferences on species- and community-level dynamics.
  4. We illustrate the framework with a series of worked examples. Our three case studies demonstrate how integrated community models can be used to extend the geographic scope when evaluating species distributions and community-level richness patterns; discern population and community trends over time; and estimate demographic rates and population growth for communities of sympatric species. We implemented these worked examples using multiple software methods through the R platform via packages with formula-based interfaces and through development of custom code in JAGS, NIMBLE and Stan.
  5. Integrated community models provide an exciting approach to model biological and observational processes for multiple species using multiple data types and sources simultaneously, thus accounting for uncertainty and sampling error within a unified framework. By leveraging the combined benefits of both data integration and community modelling, integrated community models can produce valuable information about both common and rare species as well as community-level dynamics, allowing for holistic evaluation of the effects of global change on biodiversity.

1 INTRODUCTION

The consequences of global change on animal communities remain unclear for all but the most abundant taxa because of data limitations (Breiner et al., 2015; Kindsvater et al., 2018). As biodiversity continues to decline, it is critical to assess the status and dynamics of whole communities, and not just those common or charismatic species with large amounts of data. While many sampling schemes target multiple species simultaneously (e.g. breeding birds via point counts, small mammals via trapping), traditional approaches to evaluate community-level processes require a large number of observations (Hamilton et al., 2015; Sor et al., 2017), precluding assessments of rare or data-deficient species. One in six species has been classified as data deficient by the International Union for the Conservation of Nature (Bland et al., 2017), leading to gaps in our understanding of the trends and stressors for a large proportion of species, often within taxonomically related communities. Furthermore, data sets collected by individual researchers are generally restricted in spatiotemporal scope, resulting in narrow and inadequate assessments of the biotic and abiotic factors influencing species' trends when the effects of environmental factors vary across space and/or time (Rollinson et al., 2021).

Fortunately, there has been a surge of data collection programmes in recent decades, including hundreds of public (citizen) science (e.g. eBird, iNaturalist) and government-funded (e.g. National Ecological Observation Network [NEON]) programmes that provide a wealth of fine- to broad-scale data on multiple species simultaneously (Barnett et al., 2019; Thornhill et al., 2016). Incorporating these distinct data sources into rigorous analyses is an ongoing challenge because of variations in data type, quality, sampling protocols and geographic coverage (Chandler et al., 2017; Moussy et al., 2021). Yet, understanding and evaluating both species- and community-level processes, including responses to environmental change, is critical to maintaining biodiversity, and hence, a key objective of the ecological research community (Johnson et al., 2017).

Hierarchical modelling has significantly advanced the use of such ecological data because it allows for the separation of biological and observation processes (Kéry & Royle, 2016) and can thus account for variations in the types and information content of specific data sources. Two recent advancements in hierarchical modelling are critical to the development of rapid, comprehensive assessments of biodiversity: (1) single-species integrated models and (2) hierarchical community models.

Single-species integrated models combine multiple data sets into a single model on a target species, generally by defining a joint likelihood among available data sources (Miller et al., 2019; Schaub & Kéry, 2021). A key advantage of integrated modelling (sometimes referred to as data fusion) is the ability to merge multiple data types, regardless of collection method and spatiotemporal scope (Zipkin et al., 2021). Single-species integrated models improve inferences through increased precision of parameter estimates (Mosnier et al., 2015), the estimation of parameters for which no explicit data are available (Oppel et al., 2014), and by accounting for uncertainties and correlations among data sets (Lee et al., 2015). Single-species integrated modelling has primarily taken two forms: integrated population models and integrated distribution models (Zipkin et al., 2019). Integrated population models allow for a mechanistic understanding of species-level processes primarily by combining demographic data with time-series population- or site-level count data (Brown & Collopy, 2013; Saunders et al., 2018). Integrated distribution models synthesize presence-only, detection–nondetection and/or count data to estimate species distribution patterns and the effects of covariates on occurrence or abundance (Fletcher Jr. et al., 2019; Grattarola et al., 2023).

Hierarchical community models link together occurrence or abundance parameters of individual species (estimated using a single, multispecies detection–nondetection or count data set), through community-level distributions (Devarajan et al., 2020; Dorazio et al., 2006; Sollmann et al., 2016). This is done either by treating species-level parameters (i.e. intercepts or covariate effects) as random variables arising from normal distributions, characterized by a community-level mean and variance across species (e.g. multispecies occurrence models; Zipkin et al., 2010; Guillera-Arroita, 2017), or by using multivariate distributions to explicitly model species associations (e.g. joint species distribution models; Ovaskainen et al., 2017; Warton et al., 2015). Hierarchical community models improve biological inferences by accounting for both species-level effects and the aggregated effects of covariates on a community as a whole, leading to increased precision in species parameter estimates, even for those species that were observed infrequently (Zipkin et al., 2009), and the ability to estimate biodiversity metrics such as richness and composition (Guillera-Arroita et al., 2019).

Despite recent advances in single-species integrated models and hierarchical community models, there has been comparatively less research focused on modelling multiple species using more than one data source. Several applications of multispecies integrated population models in which count data are combined with demographic data have examined interactions between species, including competition (Perón & Koons, 2012), synchrony in survival and reproduction rates (Lahoz-Monfort et al., 2017) and predator–prey dynamics (Barraquand & Gimenez, 2019; Clark, 2021; Paquet & Barraquand, 2022; Quéroué et al., 2021). A key feature of these models is that species' demographic rates are assumed to be influenced by the dynamics of one or two other species such that the models account for both abiotic effects (via covariates) and biotic interactions among species. There are also examples of integrated community occupancy models that combine multiple detection–nondetection data sets to estimate species co-occurrence patterns using two or more data sets (Doser, Leuenberger, et al., 2022; Lauret et al., 2023). Clark et al. (2017) developed a generalized joint attribute model to estimate the distribution and abundance of multiple species using combinations of detection–nondetection and various types of count data. In such models, species abundances are assumed to come from a community level, multivariate normal distribution, thereby accounting for associations among related species within the predefined community. Recent efforts integrating multispecies data sets only scratch the surface of what can be achieved by uniting available data sources on multiple species and whole communities (Rapacciuolo & Blois, 2019), but highlight the growing relevance of, and need for, a generalizable modelling framework.

In this paper, we highlight the class of models broadly categorized as integrated community models. Although the term integrated community modelling has been used with several meanings, here we define it as a framework that links multiple data sources on multiple sympatric species through a hierarchical community component. We outline the general approach of the method, demonstrate three specific applications with different biological and observation model structures using empirical data and simulations and offer a discussion about various factors to consider before using integrated community models. We conclude by providing suggestions for future developments.

2 EXPLANATION OF THE METHOD

We consider integrated community models broadly as the combination of single-species integrated models and hierarchical community models (Figure 1). Each of the different data sources (e.g. detection–nondetection, count, demographic) can be used to inform various components of the underlying biological process model through hierarchical, observation models linked together with a joint likelihood or through simpler approaches that account for variation across data sources through covariates and/or random effects (Wikle & Berliner, 2007; Zipkin et al., 2021). Covariates can be included using link functions to incorporate relevant biotic and abiotic factors that influence species dynamics and detection across space and time. The biological process models for species (Figure 1, yellow shading) can range from simple (e.g. estimates of species occurrence; Miller et al., 2019) to complex (e.g. estimates of species survival, reproduction and abundance; Schaub & Abadi, 2011), and depend on both the life history and ecology of the taxonomic group and the quantity and type of available data.

Details are in the caption following the image
Integrated community models enable the estimation of the status, trends and/or population dynamics of multiple species simultaneously by combining disparate data sources within a unified analysis framework. The schematic diagram outlines the hierarchical approach of the integrated community modelling process: the underlying species-level biological process (yellow; e.g. abundance, z i , j , t , of species i at survey location site j in year t) is modelled using relevant environmental covariates and/or demographic rates (not shown). Each data source ( y 1 , y 2 , y 3 , yD ) provides a piece of information on the biological process and is connected to the biological parameters via an observation process model (grey). Species-specific parameters are linked through community-level distributions to share information across species (blue).

Parameters in the biological process model are linked to the survey data through observation process models (Figure 1, grey shading). As with the biological process, the observation models can range from simple (e.g. fixed and/or random effects on individual surveys or data sources) to increasingly mechanistic (e.g. modelling detection probabilities based on covariates and recorded auxiliary sampling information), primarily depending on the type and quantity of available data (Dorazio, 2014; Kéry & Royle, 2016). Commonly collected wildlife data types span a spectrum of collection effort and information content including (from low to high): presence-only data (observations of a species in a given location at a given time), detection–nondetection data (records that also contain locations that were surveyed but the species was not observed), count data (total number of individuals observed at a location within a specific time frame) and demographic data (e.g. capture–recapture, productivity data; to inform survival, reproduction, immigration etc.). Typically, the higher the information content, the more difficult it is to collect the data and likely that the geographic coverage is limited. In addition to survey type, we can classify presence-only, detection–nondetection and count data as structured, semi-structured or unstructured. Structured data are collected systematically with a design protocol (e.g. NEON), while unstructured data are collected opportunistically and without a defined purpose (e.g. iNaturalist). Semi-structured data are typically collected in volunteer-based monitoring programmes (e.g. eBird) that have some degree of structure (e.g. collection of auxiliary sampling information, checklists) that helps mitigate observational and sampling biases present in unstructured data, but lack the rigorous sampling design of structured data (Altwegg & Nichols, 2019; Kelling et al., 2019).

The various available data sources provide unique or complementary information (in terms of spatiotemporal location, life stage and/or demography) on the biological processes of interest (Saunders, Farr, et al., 2019). For example, available data sources may consist of multiple different data types (e.g. one count and one demographic data set) or may all be the same type (e.g. two detection–nondetection data sets) but collected under distinct protocols with different types of observation and/or sampling biases. Thus, the value of combining available data sources can either be inference on mechanistic biological processes (as is done in integrated population models; Schaub & Kéry, 2021) or expansion of the spatiotemporal scope of inference (as is done in integrated distribution models; Isaac et al., 2020). When modelled together, the information from each data source can be used to jointly estimate biological parameters, improving accuracy on inferences (Pacifici et al., 2017), and in many cases, allowing estimation of a greater number of parameters than is possible through independent analyses of the various data sets (Plard et al., 2019; Schaub et al., 2007).

The species-level biological and observation process models are joined by allowing parameters to come from common, community-level distributions (Figure 1, blue shading), which can be univariate (e.g. Farr et al., 2019) or multivariate (e.g. Thorson et al., 2016), depending on whether estimation of covariances among species is desirable. This facilitates information sharing among species in the community, allowing parameter estimation for all species and not just those with large sample sizes (Zipkin et al., 2009). Additionally, this approach produces estimates of community-level parameters (mean across all species and variance among species), which can be used to summarize community responses to relevant covariates and environmental stressors (Threlfall et al., 2017).

In theory, integrated community models could be analysed with either a frequentist or Bayesian approach. In most cases, Bayesian analysis will be more practical to simultaneously estimate the biological and observation parameters at species and community levels. Bayesian inference allows for maximum flexibility in terms of hierarchical model structure and convergence (Fordyce et al., 2011) and enables straightforward calculation of derived biodiversity metrics (e.g. site-level richness, evenness, composition and turnover; Dorazio, 2016; Gelman & Hill, 2007). In our case studies, parameters are all estimated using Bayesian analysis with Beyer several different approaches in the R software platform (R Core Team, 2022), including built-in packages and custom code that implements Markov chain Monte Carlo (MCMC) algorithms via JAGS (Plummer, 2003), NIMBLE (de Valpine et al., 2017) and Stan (Stan Development Team, 2022).

3 WORKED EXAMPLES

Here, we show three case studies to provide practical examples of the integrated community modelling framework that (1) extend the geographic scope of inference to evaluate species distributions and effects of habitat for a specialized bird community, (2) discern the population trends of open-habitat-associated butterfly species over a recent decade and (3) estimate demographic rates and population growth for communities of sympatric species. Our worked examples demonstrate approaches to combine structured and semi-structured data including detection–nondetection, site- and population-level count and demographic data types. We show how the integrated community modelling framework is capable of producing inferences for both individual species (rare and common) and community-level metrics of interest. We present a general overview and the basic methods of each case study in the main text with details including mathematical equations and the specifics of implementation described in associated supplemental materials. The complete code for the worked examples is available at https://zipkinlab.github.io/#icm2023Z and is also archived at Zipkin et al. (2023).

3.1 Case study 1: Spatial distributions of forest birds across the Northeastern United States

3.1.1 Background and motivation

Communities of species with specialized habitat requirements, such as grassland or interior forest obligate birds, are particularly vulnerable to global change because of limited availability of habitat. Furthermore, species in specialized communities tend to be rare, resulting in few observations within large-scale monitoring programmes and low precision of species distribution estimates and effects of environmental drivers (Lomba et al., 2010). In this case study, we use an integrated community occupancy model (Doser, Leuenberger, et al., 2022) to estimate the spatial variation in species richness of a community of 27 interior forest obligate bird species, most of which are rarely detected, across 11 states in the Northeastern United States. The integrated community occupancy model combines multiple replicated and/or nonreplicated detection–nondetection data sources in a community occupancy modelling framework to provide inferences on species-specific and community-level occurrence patterns across space (and time). By sharing information across species and data sources, this approach can yield increased precision and accuracy of species-specific occurrence patterns compared to single data source models and single species models (Doser, Leuenberger, et al., 2022). Additionally, the community component enables estimation of aggregated biodiversity metrics, such as site-level species richness, that allows for local-scale assessment of available habitat use by the community.

3.1.2 Data

We integrated detection–nondetection data on 27 interior forest obligate bird species from 356 routes in the North American Breeding Bird Survey (BBS; Pardieck et al., 2020) and 10,383 eBird checklists (Sullivan et al., 2009) across 11 states in the Northeastern United States, all collected during the breeding season of 2017 (Supplemental Information S1). Many species were detected infrequently, with 52% and 96% of species detected at less than 30% of the spatial locations in the BBS and eBird data sets respectively. For the BBS data, which is a structured road-side survey, observers recorded all birds seen within a 0.4-km radius at 50 locations (i.e. stops) along each ~39.2 km roadside survey (i.e. route). We assigned each route to a 5 × 5 km grid cell based on its midpoint location and aggregated the 50 stops per route into five spatial replicates of 10 stops each (i.e. we assume each route was sampled five times, but the specific sampling locations occur at different spots along the route). We summarized the detection (1) or nondetection (0) of each species at each of the five spatial replicates. For the eBird data, which is an opportunistic, semi-structured data source, we assigned each complete checklist during a 3-week period in early June (to match the time frame of the BBS data) to a specific grid cell, and used multiple checklists within each cell as repeat surveys for the occupancy modelling framework. To mitigate preferential sampling biases, we used multiple spatiotemporal filtering criteria following standard recommendations for using eBird data in occupancy models (Johnston et al., 2021; Strimas-Mackey et al., 2020).

3.1.3 Modelling approach

We modelled the data using an integrated community occupancy model that consisted of individual observation models (i.e. likelihoods) for the BBS and eBird data sets, which shared a species-specific biological process model. The biological process model described how species-specific occurrence varies across space as a function of elevation, forest cover and five bioclimatic variables, with individual species-specific effects treated as random variables arising from common community-level normal distributions (Dorazio & Royle, 2005). Both observation models explicitly accounted for imperfect detection by allowing detection probability to vary across species, space and replicate surveys. For the BBS observation component, we modelled detection probability as a function of day of the year. For the eBird observation component, we modelled eBird detection probability as a function of day of the year, time of day of collection, time spent, distance travelled and the number of observers. Similar to the biological process model, we assumed that each of the species-specific parameters in the observation models arise from community-level distributions. We estimated species richness of the community in the 5 × 5 km grid cells as a derived quantity across the Northeastern United States. We fit the model in a Bayesian framework using the spOccupancy R package (Doser, Finley, et al., 2022). We also fit the model with custom code by calling NIMBLE through R to provide potential users with additional implementation resources (code provided but results not shown). See Supplemental Information S1 for full model details.

3.1.4 Results

Species richness of the interior forest bird community varied substantially across the northeast, with high richness across the Appalachian and Adirondack mountains and low richness in urban areas (Figure 2a). Spatial variation in richness and species-specific occurrence probabilities were largely driven by forest cover, with all species in the community showing a positive relationship to the amount of local-level forest cover (Figure 2b). The integrated community occupancy model shares information across individual species and data sources, which allowed us to estimate reasonably precise occurrence probabilities and covariate effects throughout a broad spatial extent for common as well as rare species (e.g. Figure 2c,d). Given that the species most vulnerable to global change are often rarely observed in large-scale monitoring programmes, integrated community modelling frameworks that can leverage comparable data types (e.g. structured or semi-structured occurrence or count data) from multiple monitoring programmes and multiple species can provide a richer understanding of vulnerable communities and the specific species within them.

Details are in the caption following the image
Estimates from an integrated community occupancy model for a community of 27 interior forest obligate bird species across the Northeastern United States using a 5 × 5 km grid. Panel (a) shows estimated mean species richness across the region while panel (b) shows the estimated mean (dark line), 50% credible interval (box) and 95% credible interval (whiskers) for the effect of forest cover on the overall community (COMM) and individual species (see Table S1.1 for species codes). Panel (c) shows estimated mean occurrence probabilities for Black-throated Green Warblers (Setophaga virens; BTNW), a common species of least concern, while panel (d) shows mean occurrence probabilities for Cerulean Warbler (Setophaga cerulean; CERW), a rare species that is classified as near threatened.

3.2 Case study 2: Temporal trends of butterflies in the Midwestern United States

3.2.1 Background and motivation

Insect communities face numerous threats, with a variety of anthropogenic stressors contributing to population and diversity declines (Forister et al., 2021; Hallmann et al., 2017; Wepprich et al., 2019). Rigorously quantifying insect population trends is notoriously difficult due to their complex life histories (e.g. seasonal variation in activity; Saunders, Ries, et al., 2019), rarity and elusiveness of species, biases in long-term data sets (e.g. natural history collections; Davis et al., 2023; Ries et al., 2019) and unbalanced sampling across space and time in volunteer monitoring programmes (Dennis et al., 2013). Integrating multiple data sources in a community modelling framework can mitigate many of these data complexities by increasing the amount of data available, sharing information across species and accommodating sampling biases within individual data sources. In this case study, we use an integrated community model to quantify relative abundance trends in 10 open-habitat-associated butterfly species in the Midwestern United States over a recent decade. Our model explicitly accounts for variation at multiple spatial (i.e. site, county) and observational levels (i.e. survey) while simultaneously accounting for variation in expected counts between different data sources due to differences in survey protocols and observer skills (Zylstra et al., 2021).

3.2.2 Data

We integrated count data from five volunteer-based monitoring programmes to assess early-summer butterfly trends (June through July) from 2008 to 2017 across six Midwestern US states (Iowa, Wisconsin, Illinois, Indiana, Michigan, Ohio). We focused our analysis on 10 species that are year-round residents, active during summer, inhabit open areas, multivoltine, relatively easy to detect and adequately sampled by all monitoring programmes (Supplemental Information S2). Four data sets come from statewide, structured butterfly surveys: (1) Illinois Butterfly Monitoring Network (266 spatial locations); (2) Iowa Butterfly Survey Network (61 spatial locations); (3) Michigan Butterfly Network (133 spatial locations); and (4) Ohio Lepidopterists (118 spatial locations). In each programme, data are collected following a Pollard transect protocol in which trained volunteers walk ~1 km transects weekly or biweekly throughout the summer and record every butterfly detected (Pollard, 1977), although survey protocols differ slightly among states. While each of these four data sets has substantial, repeated temporal sampling across a given summer, they are restricted in spatial extent to only a single state. Our fifth data set comprises semi-structured count data from the North American Butterfly Association (NABA), in which volunteer observers extensively survey a 25-km diameter circle once a year, recording all butterflies observed, by species (https://www.naba.org). We used data from 85 sites that fell within our study region and survey time period. Alone, the NABA data may not provide reliable estimates of species temporal trends since only one count is performed at a given site in each year, which often does not adequately represent temporal variation in butterfly counts due to high variation in seasonal activity periods of species (Dennis et al., 2013; Zylstra et al., 2021). However, NABA data have a much larger spatial extent (i.e. North America) compared to the statewide butterfly surveys, and are thus able to provide critical information across the full study region.

3.2.3 Modelling approach

Our integrated community model is based on a negative binomial hierarchical model, adapted from an analysis on monarch butterflies (Zylstra et al., 2021). We modelled the mean expected count for each butterfly species in a given week during a given year as a function of multiple covariates and random effects. To account for variation in survey effort across datasets, we included a fixed categorical variable of survey protocol (five levels) and a (log) linear effect of survey effort (i.e. the total number of surveys performed at a given spatial location in a given week). We further included a linear and quadratic effect of week to account for species-specific seasonal variation in activity, a linear effect of year to estimate any temporal trends in butterfly abundance, and random effects of county, site and year to account for additional variation. We allowed the species-specific effects of week to vary by year to account for differences in butterfly phenology across the decade. We treated each of the species-specific parameters (intercept and covariate effects) as random variables that come from common, parameter-specific, community-level normal distributions. We estimated a derived annual relative abundance index for each species as the expected number of individuals counted in a single survey (i.e. using the species-specific intercept, linear trend of year and year random effect) to create an average across sites and weeks. We fit the model in a Bayesian framework using the spAbundance R package (Doser, 2023). We also fit the model with custom code by calling Stan through R (code provided but results not shown). See Supplemental Information S2 for full model details.

3.2.4 Results

The integrated community model revealed varying support for linear trends in relative abundance across butterfly species. The 95% credible intervals of all trend estimates overlapped zero, indicating uncertainty in population changes over this short time period (Figure 3), which is not entirely unexpected for butterfly species as insects tend to have high variations in abundance from year to year (Didham et al., 2020; Wagner et al., 2021). Seven of the 10 species had negative average trend estimates, with the most support for declines in Eastern Tiger Swallowtail (Papilio glaucus; 0.87), Cabbage White (Pieris rapae; probability negative trend = 0.83), Peck's Skipper (Polites peckius; 0.82) and Spring/summer Azure (Celastrina ladon; 0.80). There was mild support (69% probability) for a declining trend across the community (log-scale mean = −0.04) although with high uncertainty (SD = 0.08). In our case study, the integrated community modelling framework allowed us to take advantage of the within-season repeated sampling of the statewide structured monitoring data sets while simultaneously leveraging the large spatial extent of the NABA data to generate trend estimates of multiple butterfly species across the Midwestern United States. As global climate and land-use change continue to pose threats to animal communities, modelling frameworks that leverage all available data sources on multiple species can provide critical insights on which species are most vulnerable and which may be less susceptible.

Details are in the caption following the image
Relative abundance trends of 10 butterfly species and the community (COMM) in the Midwestern United States from 2008 to 2017 using an integrated community model. The probability that the linear trend estimate is less than zero is also included in each panel (i.e. that abundance is decreasing through time). Points (posterior medians with 95% credible interval lines) show the derived annual relative abundance index for each species. Black trend lines represent the posterior median trend estimate with the shaded area denoting the 95% credible interval. Species that indicate a possible decline over the 10-year study period are shaded in yellow (i.e. probability of a negative trend >0.6), while species that are likely to have increased (i.e. probability of negative trend <0.4) are shaded in blue. All others are shaded in grey.

3.3 Case study 3: Estimating species- and community-level demographic rates and population growth

3.3.1 Background and motivation

Comprehensive monitoring and evaluation of biodiversity requires data not only on species distribution and abundance patterns but also on their demographic rates including survival and reproduction (Beyer & Manica, 2020). However, collecting multiple data types is resource-intensive for individual species, and orders of magnitudes harder for entire ecological communities. Thus, analytical frameworks that can combine very different types of data, that derive from a variety of sources and protocols, are especially valuable for biodiversity assessments. In this case study, we develop a multispecies integrated population model that extends integrated population models (Besbeas et al., 2002; Schaub & Kéry, 2021) to a community level within a single analytical framework. Previous work has leveraged the single-species integrated population model to estimate interspecific interactions within a multispecies context (e.g. Quéroué et al., 2021), but only for a small number (i.e. two or three) of species. The approach we outline here is capable of making inferences on tens of species simultaneously by treating species-specific demographic rates as random effects from shared, community-level distributions (Iknayan et al., 2014). By combining multiple data types (e.g. population-level counts, productivity, capture–recapture data) on multiple sympatric species, this integrated community modelling approach has the potential to improve parameter identifiability for species and/or demographic rates for which only limited data are available (Zipkin & Saunders, 2018). We demonstrate the potential utility of the multispecies integrated population modelling framework using a data simulation approach.

3.3.2 Data and approach

To construct the multispecies integrated population model, we first defined a biological process model that incorporates both demographic rates and population sizes for each of 10 hypothetical species that we assume are all part of the same community. We linked species' population sizes with their demographic rates using a female-based, age-structured matrix model (Caswell, 2000) with two age classes (juvenile and adult) and a pre-breeding census for each species (Figure 4a). We assume that species-specific fecundity, juvenile survival and adult survival are each derived from parameter-specific, normal distributions with a community-level mean and variance (Dorazio & Royle, 2005). We used the biological process model to simulate 100 independent, annual population-level counts (i.e. census), productivity (i.e. number of juveniles produced per adult) and capture–mark–recapture data sets for the hypothetical community of 10 species over a 10-year time period. While we simulated all three types of data (census, productivity and capture–recapture data) for all 10 species, species differed in the relative amount of data depending on their population size. For example, rare and declining species naturally had less capture–recapture and productivity data because there were fewer individuals available for sampling. We further assumed that all data sources were collected via design-based, structured sampling protocols, such that the data sources are representative of each species' population but may also contain sampling error. We then estimated parameters from a joint likelihood of the three independent data sets to make inferences on the species- and community-level demographic rates as well as derived parameters, including annual population sizes and growth rates (Kéry & Royle, 2016; Schaub & Kéry, 2021). We evaluated the performance of the model by calculating the relative bias posterior mean true value true value of estimated parameters at both the species and community levels. We fit our model by developing custom code in R and JAGS with the jagsUI package (Kellner, 2021). See Supplemental Information S3 for full model details.

Details are in the caption following the image
Multispecies integrated population models (MIPM) enable unbiased inference on species-specific and community-level demographic parameters as well as mean population growth rates over time, as shown by our simple simulation study. Panel (a) describes the female-based age-structured model with two age classes (juvenile and adult) and a prebreeding census that was used to link population sizes with demographic rates with i = 10 species in our hypothetical community. Species-specific demographic rates of fecundity f i , juvenile survival ϕ i , 1 and adult survival ϕ i , 2 were derived from shared distributions with a community-level mean and variance (not shown). Panel (b) shows the mean relative bias in estimated demographic parameters (dark line) with the 50% (box) and 95% (whiskers) credible intervals at both the community (blue) and species (white) levels compared to a simulated truth. Population growth rates were also estimated from the MIPM with high accuracy and precision, as shown in panel (c), where black dots show the true mean population growth rates for each species (S1–S10) and dark lines show the mean estimated values with 50% (box) and 95% credible interval (whiskers). Community-level (COMM) growth rates were derived as the geometric mean across species. Estimated population growth rates were less precise for declining species with low survival, as shown in panel (d) for two species (S4 and S7) with the same expected fecundity.

3.3.3 Results

The multispecies integrated population model was able to recover the true biological parameters at both the species and community levels with little bias, and with especially high accuracy and precision of community-level demographic rates (Figure 4b). Mean population growth rates across the 10 years were also highly accurate for all species, as well as the community (Figure 4c). However, estimates of population growth rates tended to be less precise for declining species as compared to those species whose growth rates were positive (Figure 4c,d), likely because declining species tend to have smaller population sizes and thus less data available for analysis. In conducting this simple simulation, we demonstrated that combining multispecies demographic and population-level count data provides a viable solution for quantifying species-specific and community-level dynamics and growth rates, which can ultimately aid in biodiversity monitoring and assessments from local to regional scales. Applications of this approach may be particularly beneficial in cases where data are limited for some species within a community (but plentiful for others), or when one or more data types are not collected for all species in every year (e.g. multispecies amphibian survey data, waterfowl banding data, mist net capture data of birds). As these types of models are fairly new, additional work is necessary to understand the full inferential benefits—and potential biases—under scenarios that vary parameter values (e.g. low vs. high detection probability and/or sampling error) and include an exploration as to how the amounts of various data types influence model estimates (e.g. when all three types of data are not available for all species in all years).

4 THINGS TO CONSIDER BEFORE USING THIS METHOD

Potential users should consider several points before developing an integrated community model for their system. Initial steps in determining the utility and value of combining multispecies data sources should focus on the specific information that could be gained from an integrated community model and the quantity and types of available data. Data integration approaches are becoming increasingly popular (Zipkin et al., 2021), and for good reason, as they have immense potential to expand inferential and predictive capabilities from available, yet imperfect, data (Zylstra & Zipkin, 2021). However, data integration is not without its challenges and limitations. In some cases, combining multispecies data sources—rather than estimating parameters separately for individual species or data sets—may be orders of magnitude more complicated with high computational burdens (Gotway & Young, 2002; Pacifici et al., 2019), or may not substantially help to answer the particular question of interest. Although integrating different data types generally helps with parameter identifiability and precision (Doser et al., 2021; Farr et al., 2021), data integration alone cannot correct for biases in unstructured data or problems with collinearity among environmental variables, especially when the drivers of such biases are unknown or cannot be incorporated within the observation model (Simmonds et al., 2020). Simulations and model assessments can help establish the inferential value of integrated community models as compared to simpler alternatives for specific study systems. Furthermore, the quantity and types of data available have important implications for the potential structure and complexity of the biological process model. Although the incorporation of mechanistic processes within models is a clear goal within ecological research, if data are unavailable to estimate detailed demographic parameters, no amount of integration will be able to rectify the situation (Plard et al., 2021; Riecke et al., 2019).

Several papers have focused on the individual challenges of both single-species data integration (Isaac et al., 2020; Miller et al., 2019; Zipkin et al., 2021) and hierarchical community modelling (Guillera-Arroita, 2017; Iknayan et al., 2014). Successful data integration models require consideration of the spatial extent and scale of the various data sources to resolve mismatches in the collection grain (e.g. through change of support; Farr et al., 2021; Pacifici et al., 2019; Zipkin et al., 2017), spatial biases in unstructured and semi-structured data due to preferential sampling (e.g. by including site selection within the model [Conn et al., 2017] or spatially correlated random effects [Hefley et al., 2017]), and issues related to unbalanced quantities of various data sources (e.g. through subsampling or downweighing high volume, low information-content data [Johnston et al., 2018] or modelling biases within the likelihood [Fer et al., 2018; Tang et al., 2021]). Similarly, the value of inferences within community models will depend on how the community is defined and the proportion of species in the community that are rare or undersampled. Statistically speaking, including all species that were observed during a multispecies data collection event within a community model is legitimate, as the result will simply draw species-level parameters towards community averages (Dorazio & Royle, 2005). In practice, however, researchers will want to have an ecological justification for including individual species within a community modelling analysis. Additionally, when a high proportion of species are rare or infrequently observed, it is difficult to achieve convergence in community models and parameter estimates are likely to be exceedingly imprecise for many species (Zipkin et al., 2020), potentially rendering such analyses less informative.

Within the context of integrated community models, the challenges of both data integration and community modelling are likely to be present and may be exacerbated. For example, problems associated with unbalanced data in integrated community models may be acute when the spatial extent of the study is broad (Zipkin et al., 2021) or the goal is to understand mechanistic processes (Campbell et al., 2018). This is especially true if data quantities are highly uneven among species or many species are underrepresented or absent within a specific data source, either due to sampling constraints or cryptic behaviours that vary among species within the community. Combining replicated and non-replicated data sets typically requires more complex ways of accounting for errors in detection (Doser, Leuenberger, et al., 2022). In many hierarchical community models, which focus solely on estimating occurrence or abundance rates, defining the community is done rather loosely and may simply be ‘all species’ (all observed species, or both observed and unobserved species if using data augmentation; Royle et al., 2007). In the context of an integrated community model, it may be more important to have a clear definition of, and strong biological justification for, the target community because of the additional parameters that are estimated (e.g. demographic rates). Otherwise, inferences on community parameters—as well as for rare species—may not be biologically meaningful.

Finally, we expect there to be computational issues for many types of integrated community models, as this is a concern for both single-species integrated models and hierarchical community models. Many hierarchical models are analysed using Bayesian methods, such as MCMC approaches that can take an exceedingly long time to run and may be difficult to troubleshoot. Typically, such models are fit using common Bayesian software packages such as JAGS, NIMBLE and Stan, which provide flexibility for defining specialized models, but require a substantial amount of programming knowledge and custom coding. Thus, potential users should consider if it is worth developing a complicated, but more adaptable, model or if qualitatively similar inferences can be achieved using built-in software (e.g. R packages). Fortunately, there has been substantial development on creating user-friendly and computationally efficient software for both single-species integrated models and community models. Simple forms of integrated community models, like the hierarchical negative binomial model used in the butterfly case study, can be fit using user-friendly R packages that can accommodate a variety of random effect structures (e.g. brms [Bürkner, 2017], spAbundance [Doser, 2023]), while the integrated community occupancy model used in the bird case study can be fit with the spOccupancy R package (Doser, Finley, et al., 2022). Continued development of specialized and computationally efficient software is an important avenue for future research that can more readily facilitate the implementation of integrated community models. However, for now, many complex biological and observational model structures are likely to require custom coding using Bayesian programming languages.

5 CONCLUSIONS

Integrated community models are an exciting framework to assess biodiversity dynamics using unified approaches that can combine data from multiple different sources that vary in information content. Such methods can leverage public science, government-funded and traditional, scientist-collected data sources to expand the spatial and temporal scope of inference for the many data-deficient species on which little is known. Integrated community models take advantage of the collective benefits of data integration and community modelling frameworks, while accounting for both biological processes and observation errors. Thus, these models can improve the accuracy and precision on both species-level parameters and community-level metrics, enhancing understanding about the variations in species' response to local environmental factors and global climate and land-use change. Future research in integrated community modelling could focus on approaches to assess model fit and compare among competing model structures, incorporating wider varieties of structured and unstructured data types including opportunistic presence-only data and expanding inferences to communities with greater numbers of species. The development of integrated models also has important implications for the design of new data collection activities. Forthcoming sampling and monitoring programmes should consider what data sets are already available on particular species and communities in order to obtain maximally beneficial data (Chandler et al., 2017; Moussy et al., 2021). For example, targeting areas with limited data or collecting data on unknown demographic rates could be particularly advantageous within an integrated community modelling framework. While integrated community models are still at an early stage of development, given the exponentially growing use of both single-species integrated models and hierarchical community models over the last decade, we expect there to be much activity in this area over the next decade. Such approaches are leading to increasingly efficient and useful assessments of biodiversity, which is critically important as the world's environment is changing rapidly.

AUTHOR CONTRIBUTIONS

Elise F. Zipkin conceived of the study, lead the team and wrote the initial draft of the paper with input from Jeffrey W. Doser and Courtney L. Davis. Jeffrey W. Doser, Wendy Leuenberger and Courtney L. Davis led the development and analysis of case studies one, two and three, respectively, with input from Samuel Ayebare, Kayla L. Davis and Elise F. Zipkin. All authors reviewed the manuscript and provided edits during revisions.

ACKNOWLEDGEMENTS

We thank Leslie Ries and the North American Butterfly Association (NABA) for providing access to the data used in case study two. We thank the editors and reviewers for many helpful comments on earlier drafts, especially Rob Salguero-Gómez for extensive feedback at various stages of the writing process. This study was funded by the United States National Science Foundation (NSF) with award DEB-1954406.

    CONFLICT OF INTEREST STATEMENT

    The authors have no conflict of interest.

    ETHICS STATEMENT

    Not applicable.

    DATA AVAILABILITY STATEMENT

    The code and data used in the integrated community modelling case studies can be found on the Zipkin Lab Code Archive (https://zipkinlab.github.io/) and are also permanently archived on Zenodo at https://doi.org/10.5281/zenodo.8361425 (Zipkin et al., 2023).

      Journal list menu