Species data for understanding biodiversity dynamics: The what, where and when of species occurrence data collection

1. The availability and quantity of observational species occurrence records have greatly increased due to technological advancements and the rise of online portals, such as the Global Biodiversity Information Facility (GBIF), coalescing occurrence records from multiple datasets. It is well-established that such records are biased in time, space and taxonomy, but whether these datasets differ in relation to origin have not been assessed. If biases are specific to different types of datasets, and the relative contributionfrom these datasets have changed overtime, these shiftingbiases will have implications for interpretations of results and, consequentially, for management and conservation measures. 2. We examined observational GBIF records from Norway to test potential differences in taxonomic, time and land-cover biases between 10 different datasets, with a focus on red-listed and non-native species. 3. The datasets differ in their taxonomic coverage, with datasets dominated by citizen scientistrecordersfocusinggreatlyonbirds.Thenumberofrecordshasincreasedover time; in particular, citizen science datasets between species occurrence records with different origins for science-policy impact and management.

between species occurrence records with different origins for science-policy impact and management.

K E Y W O R D S
alien species, citizen science, Global Biodiversity Information Facility, land cover, museum collections, sampling bias, threatened species INTRODUCTION The amount and availability of data on species occurrences have increased tremendously in recent years (Gaiji et al., 2013), as have their use in applied conservation and biodiversity management (Powney & Isaac, 2015). Registering species occurrences have become far easier than in the early days of biogeographical surveys due to technological advancements and can be done with the help of volunteer amateurs ('citizen scientists') (Boakes et al., 2016). Online portals, for example the Global Biodiversity Information Facility (GBIF) (GBIF.org, 2019a), have further increased the public availability and interest (Amano et al., 2016). These portals gather data from various sources, ranging from digitized natural history collections to observations made by citizen scientists. Thus, these records are a mixture of data on preserved specimens and observational records from both structured surveys and opportunistic sightings (Speed et al., 2018). Volunteers participating in citizen science programmes (or autonomously reporting species occurrences) likely have different motivations for reporting than do institutional recorders registering species according to a specified aim, covering both intrinsic and extrinsic factors. For participants in citizen science programmes, the most important motivational factors have been reported as a personal connection, interest and concern for nature, a wish to contribute to science and (biodiversity and nature) conservation and the value/usefulness of their contributions (Ganzevoort, van den Born, Halffman, & Turnhout, 2017;Larson et al., 2020;Tiago, Gouveia, Capinha, Santos-Reis, & Pereira, 2017).
These mixed datasets thus suffer from various biases and errors due to their diverse origins and underlying motivations (Newbold, 2010). A frequently recognized bias for occurrence records is the 'roadside' bias; observations are reported more frequently short distances from roads and paths, due to easier accessibility (Kadmon, Farber, & Danin, 2004;Tye, McCleery, Fletcher, Greene, & Butryn, 2017). The term can be expanded to include areas near densely populated areas (Luck, 2007;Robinson, Ruiz-Gutierrez, & Fink, 2018). Concern has been raised repeatedly over this bias, especially if sampled areas cover significantly different environmental conditions than do un-sampled areas (Bystriakova, Peregrym, Erkens, Bezsmertna, & Schneider, 2012;Phillips et al., 2009;Speed et al., 2018). This potentially leads to faulty conclusions regarding biodiversity patterns (Kramer-Schadt et al., 2013).
More importantly, if such potential biases are not similar among data providers (e.g. datasets mainly consisting of purely opportunistic citizen science records vs. datasets from structured, targeted institutional surveys), conclusions can differ depending on the proportional contri-bution from the different datatypes (Tye et al., 2017). Even more so, if this relative contribution from various types of datasets has changed over time.
In terms of biodiversity management, attention is frequently focused on specific taxonomic groups or on species of conservation concern (e.g. red-listed and alien species). However, different data providers might prioritize differently regarding taxonomic groups and species' management status (red-listed vs. alien). Citizen scientists can be biased towards charismatic, easily recognizable taxa (Amano et al., 2016) and have a greater incentive to report red-listed and rare species (Tulloch, Mustin, Possingham, Szabo, & Wilson, 2013). Speed et al. (2018) showed that observational plant records and preserved specimens have different biases regarding taxonomic coverage, time and space and hypothesized that these differences can be translated somewhat to whether the data originate from structured surveys or opportunistic records, thus illustrating some of the potential issues with these mixed datasets. Note, however, the distinction between observation-and specimen records is not equivalent to the distinction between citizen science-and institutional records; vegetation plot data will be registered as observations, and some specimens in natural history collections are supplied by citizens (Miller-Rushing, Primack, & Bonney, 2012;NTNU University Museum, 2018). Geldmann et al. (2016) showed that spatial bias in citizen science records depended on the sampling scheme, distance to roads and the human population density.
Understanding spatio-temporal dynamics of biodiversity is paramount to achieve sustainable management of biodiversity issues, for example red-listed and alien species; for example there is a general lack of understanding on how land use, a main but complex driver, affects biodiversity change, as detailed data on species occurrences associated with different land-use types often are limited.
Fine-grain data on species distributions and associations from local to global spatial scales, and over long time periods are required -a task virtually impossible to achieve through targeted surveys alone Dickinson, Zuckerberg, & Bonter, 2010;Theobald et al., 2015). Opportunistic citizen science records are frequently used as a data source, for species distribution modelling (SDM) (Beck, Böller, Erhardt, & Schwanghart, 2014;Jetz, McPherson, & Guralnick, 2012), which can be used in decision-making for managing red-listed and alien species (Thuiller et al., 2005;Guisan et al., 2013;Syfert et al., 2014). As these models are sensitive to bias in the data (Yañez-Arenas, Guevara, Martínez-Meyer, Mandujano, & Lobo, 2014), methods to account for varying forms of bias in SDM's are still being explored (e.g. Kramer-Schadt et al., 2013;Dorazio, 2014;Robinson et al., 2018). A general caveat of using GBIF records in SDM is that only few of datasets report species absences, thus requiring the use of presence-only modelling.
If the inherent biases differ markedly between datasets collected through institutional surveys, as citizen science, or as a mixture of the two, and the proportional contribution from these groups has changed over time, this raises the additional issue of how to deal with shifting biases, rather than simply static ones.
To our knowledge, limited attention has been given to whether taxonomic, temporal and geographical sampling biases are similar for datasets with varying origins (i.e. predominantly from citizen science programmes or institutionally organized surveys), and whether these different datasets complement or amplify each other's biases. The same holds for records of conservation concern within these datasets (but see Beck et al., 2014 (2) whether certain datasets (with specified origins and characteristics) are representative of all collected data, and if not: (3) how to ensure complementarity between datasets to obtain maximum coverage.
In this study, we aim to test the 10 datasets with the most records within the study region from GBIF, detailing their differences and biases in taxonomy, time and land-cover associations and relating these to the various origins and characteristics of the datasets. The datasets range from 'pure' opportunistic citizen science records to observations from structured, targeted surveys by scientific institutions. To relate the results to biodiversity management, focus will be put on red-listed and alien species.
We hypothesize the following:

H1:
The distribution of records between the three main kingdoms (H1a) and alien-versus red-listed species (H1b) differ between the datasets; also within the datasets not explicitly focusing on a particular taxonomic group.
H2: There has been an increase in the number of records over time, primarily reflecting an increase in the activity of citizen scientists.

H3:
The different datasets will be unevenly distributed among different land-cover types, with areas heavily influenced by humans (e.g. urban areas and agricultural land; areas classified as 'developed area' and 'cultivated' in Table S.1 in the Supporting Information ( Figure 1, Figure S.1)) sampled more than would be expected by random chance; this oversampling is expected to be greater for datasets primarily consisting of citizen science records than for more targeted datasets.

Land-cover and species occurrence records
The study was limited to Norway ( Figure 1). This is a well-surveyed region regarding species occurrence records in GBIF (Chandler et al., 2017), covering great variation in land cover, climate, human population density and with detailed land-cover data available (Statistics Norway, 2020).
Land cover was based on the Norwegian AR50 maps from NIBIO and Hoem, 2020b). eBird: citizen science records of birds, Levatich & Padilla, 2019)). Five datasets originated from museums and/or univer-  (Henriksen & Hilmo, 2015). In total, ≈4500 species are currently red-listed; of these are ≈2550 animals (mainly invertebrates),  (Sandvik, Gederaas, & Hilmo, 2017;NBIC, 2018b). In total, ≈3000 species are listed as alien to mainland Norway, ≈1500 of these have a risk assessment. Of these, ≈390 are animals, ≈990 are plants and ≈100 are fungi. As per the guidelines published by the NBIC (Sandvik et al., 2017), we here use the term 'alien species' rather than the frequently used 'invasive species' . 'Alien' refers to '(. . . ) a species introduced outside its natural past or present distribution' (IUCN, 2020). The term 'invasive' suggests invasion potential and negative ecological effects, which is not necessarily the case for all alien species. To avoid subjective decisions as to which alien species to classify as 'invasive' , all species classified as 'alien' on the Alien Species List (Gederaas et al., 2012) were included, and the term 'alien' was used rather than 'invasive' .
Species names of the GBIF records were matched with the Norwegian Red List, and the Norwegian Alien Species List, using syn-onyms from the GBIF backbone taxonomy, using the package rgbif (Chamberlain & Boettiger, 2017). Species within the Red List categories 'regionally extinct' , 'critically endangered' , 'endangered' , 'vulnerable' , 'near threatened' and 'data deficient' are classified as 'red-listed' .
As the majority of 'data-deficient' species are potentially threatened (Bland, Collen, Orme, & Bielby, 2015), and old records are included in the analyses, inclusion of the remaining Red List categories is warranted. Species alien to Svalbard, but native to mainland Norway were not listed as alien, neither were alien species which have not yet established, but are evaluated to have the potential to do so within 50 years; NBIC, 2018).
Maps and occurrence records were transformed to the geodetic coordinate reference system WGS84/UTM zone 32 (epsg: 32632).

Statistical analyses
Taxonomic differences within and between datasets were examined using Χ 2 -tests (base package: 'stats'), testing the null hypothesis of equal distribution of the kingdoms between and within the datasets.
Likewise, the distribution of red-listed and alien species between the datasets was tested with a Χ 2 -test.
To test for temporal patterns in the data, a Mann-Kendall test for a monotonic trend was applied (package: 'trend'; Pohlert, 2020 Pebesma & Bivand, 2005). The null hypotheses was that the species occurrence records are randomly distributed across Norway, and the number of records is a function of the area of each land-cover type. To evaluate the differences in biodiversity patterns obtained using occurrence records from the different datasets, or all in combination, individual-based species accumulation curves were made for each dataset × conservation status group, and the asymptotic species richness calculated (package: 'iNEXT'; Hsieh et al., 2020).
All data preparation and analyses were performed in R, version 3.6.1 (R Core Team, 2020). Maps were made in ArcMap version 10.6 (ESRI, 2018).

Taxonomic differences
The number of records from each dataset differed (Χ 2 = 26,019,773, df = 9, p-value < 0.001) with the vast majority of the records belonging to the NBIC CS dataset, followed by the UiO Plant Notes (see Table 1 for description of dataset names). The kingdoms were not equally distributed between and within the datasets (Χ 2 = 3,813,957, df = 18, p-value < 0.001). Obviously, the datasets with a specified taxonomic scope were dominated by records belonging to the particular king-

Temporal differences
The

Geographic differences
The simulated numbers of records within the groups (conservation status × dataset) were predicted by the area of the specified land cover type (Table 2, Figure 4).
Each land-cover type was relatively over-or under-sampled for different datasets (the observed number of records fell outside of the 0.95 confidence interval of models based on the simulated data), except for snow/ice, which was under-sampled by all datasets. The results are summarized in Table 3, and the full table can be seen in the Supporting Information S.6.
Models and results regarding datasets (regardless of conservation status) can be seen in the Supporting Information (Supporting Information S.5).
Comparing the absolute residuals between predicted and observed number of records within each land-cover type, the largest numerical discrepancies were seen for open firm ground, developed areas and cultivated land (Figure 5(a)). However, comparing the relative residuals (disregarding un-mapped areas and snow/ice), only alien records associated with open firm ground showed a consistent pattern between datasets (under-sampling; Figure 5(b)).

Asymptotic species richness
The asymptotic species richness differed for most of the datasets (Sup-

Differences in taxonomic groups and conservation status between datasets
The taxonomic bias within and between the datasets differ markedly, both in the sense that several of the datasets are concerned with a single taxonomic group, and in that the multi-taxa datasets are skewed towards a single group. The datasets originating from museums all focus on plants (except for UiO Lichen ; lichens are here classified as fungi). These patterns are reflected when comparing the multi-taxa datasets: the two datasets from the NBIC are both dominated by ani-mal records, whereas the BioFokus and Jordal are both dominated by plants. Interestingly, only two out of the 10 datasets can be regarded as citizen science, but yet they make up the bulk of the records. The dominance of birds within these datasets reflect the long-term popularity of ornithology (Devictor, Whittaker, & Beltrame, 2010), the incentive for people to report on charismatic, recognizable species, and that many citizen science programmes have focused on birds .
However, if the datasets dominated by citizen science records are not considered, the avian dominance is much less pronounced. This echoes the taxonomic differences observed by Troudet et al. (2017) andSpeed et al. (2018). Theobald et al. (2015) found the taxonomic bias in citizen science and institutional datasets to be consistent; however, they did see an overweight of respectively birds and plants in the two groups.
This underlines the careful considerations which much be taken eventually when using citizen science in multi-taxa analyses -nevertheless, within popular taxa, citizen science records can be a useful supplement

F I G U R E 3 (a) Number of GBIF records across years in total. (b)
Density plots of the number of records, divided by datasets. Note that the y-axis in (b) indicate proportion rather than absolute number. Acronyms refers to the datasets described in Table 1 to institutional observations, as this allows for otherwise impossible sample sizes Powney & Isaac, 2015).
Citizen science data on popular taxa have proven useful for discovering population trends, conservation and management (e.g. for birds: Lehikoinen et al., 2019;and examples in Sullivan et al., 2009).
The datasets with more alien-than red-listed records were all datasets focused on vascular plants; for all other datasets, more redlisted-than alien records were registered. This illustrates that most species on the Alien Species List are plants (NBIC, 2018

Geographical biases
The most anthropogenic land-cover types have higher numbers of records than what would be expected for most, but not all groups. concern, thus warranting attention from different recorders (Pärtel, Bruun, & Sammul, 2005).
The picture is highly nuanced regarding the different forest categories. The cases of oversampling may reflect that sampling tends to be done where high species richness is expected a priori (Boakes et al., 2016), the high amount of woodland in Norway (>35%), and the high species richness of forests (≈60% of Norwegian species are associated with woodlands). The highest number and concentration of red-listed species is found in coniferous woodlands and broad-leaved deciduous woodland, respectively (Gjerde, Brandrud, Ohlson, & Ødegaard, 2010), TA B L E 2 Model output. Simulated occurrence data randomly distributed across the AR50 map; conservation status and dataset name assigned in the same proportions as for the GBIF data (100 repetitions). Generalized linear models (Poisson error distribution, 'identity'-link function) of the simulated data were fitted, predicting number of records falling within each land cover by the area of the respective land-cover type. P-values below 0.05 are highlighted in bold text. Acronyms refers to the datasets described in which is somewhat seen in the positive residuals of red-listed records from most datasets. Some of the datasets hold fewer red-listed records than expected for coniferous-(KMN, eBird, Jordal NBIC CS (red-listed), and UiO Plant Obs ) and deciduous (eBird, NBIC CS (red-listed), NTNU (red-listed), UiO Lichen , and UiO Plant Obs ) forests. This discrepancy presumably stems from the taxonomical difference between the datasets: red-listed woodland species in Norway are mainly fungi, insects and lichens (Gjerde et al., 2010;Henriksen and Hilmo, 2015), and the num- and 'snow/ice' , both of which have fewer records than predicted by the null models. Consequently, parts of the differences between observations and predictions can be attributed to the null models not taking intrinsic differences in species richness and abundances into account.
Nevertheless, as we were not modelling species richness, but number of records (a proxy of sampling effort), the main signals are mainly attributable to differences in sampling effort.

Dataset complementarity
The general quality of the data found in open databases, such as GBIF, is a point worth general discussion. Various opinions on the matter exist (Gaiji et al., 2013;Newbold, 2010;Powney & Isaac, 2015). The

TA B L E 3
Over-versus under-sampled land-cover types for each dataset. A summary of which land-cover types has either more or fewer observed records than expected by the Generalized Linear Models summarized in

Red-listed Red-listed Red-listed Red-listed Red-listed Red-listed Red-listed Red-listed Red-listed Red-listed
Alien Alien Alien Alien Alien Alien Alien -Alien Alien ). Colours indicate conservation status, shapes indicate dataset. The land-cover types are ordered increasingly with respect to area on single datasets further than the extents of the individual datasets, geographically or taxonomically.

Integrating multiple datasets for understanding and managing biodiversity
Data availability thus remains the main challenge for understanding biodiversity patterns, and ultimately for how we manage biodiversity (Magurran, Dornelas, Moyes, & Henderson, 2019). This study has examined how different datasets, with different origins and characteristics, can complement each other in filling data availability gaps, specifically the gaps for three kingdoms (animals, plants and fungi), red-listed and alien species and their distributions across land covers and time.
Despite the emerging paradigm of data reuse and sharing among scientists, lack of data publishing is still an issue; only 10% of biocollections is estimated to be digitally available, including data used prior to recent changes in data publishing policies provided by funding agencies and journals (Ball-Damerow et al., 2019). Traditionally, most collected data have been stored locally, and data not directly used in publications have remained unused and potentially forgotten with time (Osawa, 2019). This also leaves the worst case scenario that not all parts of datasets are published. Likewise, standardization of biodiversity data among data providers is important to ensure interoperability (Poisot, Bruneau, Gonzalez, Gravel, & Peres-Neto, 2019). An attempt at this is to use the Darwin Core Archive format adopted by GBIF (Osawa, 2019;Wieczorek et al., 2012). Despite these efforts, substantial quantities of primary biodiversity data (and metadata) remain undiscovered (Chavan & Penev, 2011). This leaves a gap in the foundation of biodiversity research. In the light of the results presented here, if the lack of data sharing is uneven among datasets with different origins, the gap is even more severe.
Open source, compiled biodiversity data have potential to be used for biodiversity modelling, if spatially biased sampling effort can be corrected for (Higa et al., 2015). Unfortunately, a recent review found that only 69% of the examined papers addressed some aspect of data quality (Ball-Damerow et al., 2019). Our results caution that careful considerations of the data used in such studies are needed; as the contribution from different datasets have changed over time, so has the geographical bias. Therefore, accounting for bias should be a dynamic process, dependent on timespan of the included data and the data contributors. If observational datasets of mixed origins are used indiscriminately, the reported spatio-temporal patterns could merely reflect spatio-temporal shifts in bias. Future surveys and citizen science programmes should aim to include otherwise neglected taxonomic groups, especially in under-sampled land-cover types, such as remote mountainous areas. In particular, non-avian animals are underrepresented compared to their actual abundance, and open firm ground and mires should be investigated more closely. Citizen science programmes focusing on non-avian taxa should be designed, learning from the success of previous programmes for birds (Sullivan et al., 2009), butterflies (Butterfly Conservation, 2020) and bumblebees (Bumblebee Conservation Trust, 2019) and use their established frameworks. Both citizen scientists and institutional recorders should be encouraged to record observations in secluded areas and to include observations of 'less prestigious' species.
The quality of data from, respectively, institutional recorders and citizen scientists will vary immensely depending on methods and organism group. Whereas trained professionals likely exhibit greater skills regarding some of the more challenging groups, this is not necessarily the case for all taxa. If quality can be ensured, citizen scientists can provide otherwise impossible amounts of data to facilitate sciencepolicy impact of the sustainable biodiversity management. This study has shown the different biases from different datasets and illustrates some of the challenges with accounting for all of them in a single study.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHORS' CONTRIBUTIONS
TKP, GA, JDMS and VG conceived the idea and designed the methodology; TKP retrieved and analysed the data; TKP wrote the first draft of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.
Land-cover data are available through Kartkatalogen (Geonorge, 2019) and was downloaded on 23 November 2019.
All R code written to perform the data download and analyses can be viewed and downloaded in a public repository: https://doi.org/10. 5281/zenodo.4455460 (Petersen, 2021).