Volume 10, Issue 5 p. 744-751
APPLICATION
Open Access

CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases

Alexander Zizka

Corresponding Author

Alexander Zizka

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

German Center for Integrative Biodiversity Research (iDiv), Leipzig, Germany

Correspondence

Alexander Zizka

Email: [email protected]

Search for more papers by this author
Daniele Silvestro

Daniele Silvestro

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Department of Computational Biology, University of Lausanne, Lausanne, Switzerland

Search for more papers by this author
Tobias Andermann

Tobias Andermann

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Search for more papers by this author
Josué Azevedo

Josué Azevedo

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Search for more papers by this author
Camila Duarte Ritter

Camila Duarte Ritter

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Department of Eukaryotic Microbiology, University of Duisburg-Essen, Essen, Germany

Search for more papers by this author
Daniel Edler

Daniel Edler

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Integrated Science Lab, Department of Physics, Umeå University, Umeå, Sweden

Search for more papers by this author
Harith Farooq

Harith Farooq

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Departamento de Biologia & CESAM, Universidade de Aveiro, Aveiro, Umeå, Portugal

Faculty of Natural Sciences at Lúrio University, Universidade de Aveiro, Pemba, Mozambique

Search for more papers by this author
Andrei Herdean

Andrei Herdean

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Search for more papers by this author
María Ariza

María Ariza

Natural History Museum, University of Oslo, Oslo, Norway

Search for more papers by this author
Ruud Scharn

Ruud Scharn

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Department of Earth Sciences, University of Gothenburg, Göteborg, Sweden

Search for more papers by this author
Sten Svantesson

Sten Svantesson

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Search for more papers by this author
Niklas Wengström

Niklas Wengström

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Search for more papers by this author
Vera Zizka

Vera Zizka

Faculty of Biology, University Duisburg-Essen, Essen, Germany

Search for more papers by this author
Alexandre Antonelli

Alexandre Antonelli

Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden

Gothenburg Global Biodiversity Centre, Göteborg, Sweden

Gothenburg Botanical Garden, Göteborg, Sweden

Search for more papers by this author
First published: 20 January 2019
Citations: 462

rOpenSci Resources

The software package [CoordinateCleaner], developed as part of this research effort, was extensively reviewed and approved by the rOpenSci project (https://ropensci.org). A full record of the review is available at: [https://github.com/ropensci/CoordinateCleaner]

Funding information

A.A. and A.Z. are supported by the European Research Council under the European Union's Seventh Framework Programme (FP/2007-2013, ERC Grant Agreement n. 331024 to A.A.). DS received funding from the Swedish Research Council (2015-04748). A.A. is further supported by the Swedish Research Council, the Swedish Foundation for Strategic Research, a Wallenberg Academy Fellowship, the Faculty of Sciences at the University of Gothenburg, and the David Rockefeller Center for Latin American Studies at Harvard University. C.D.R. is financed by CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil: 249064/2013-8).

Abstract

  1. Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect geo-referencing or dating, can diminish their usefulness. Manual cleaning is time-consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records.
  2. Here, we present CoordinateCleaner, an r-package to scan datasets of species occurrence records for geo-referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo-referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio-temporal tests for fossils.
  3. We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing datasets (18.5%) might be biased by rasterized coordinates. In PBDB, 1205 records (6.3%) are potentially problematic.
  4. All cleaning functions and the biodiversity institution database are open-source and available within the CoordinateCleaner r-package.

1 INTRODUCTION

The digitalization of biological and palaeontological collections from museums and herbaria is rapidly increasing the public availability of species’ geographical distribution records. To date, more than 1 billion geo-referenced occurrence records are freely available from online databases, such as the Global Biodiversity Information Facility (GBIF, www.gbif.org), BirdLife International (www.birdlife.org) or other taxonomically, temporally, or spatially more focused databases (e.g. http://www.paleobiodb.org, http://bien.nceas.ucsb.edu/bien). Together, these resources have become widely used in ecological, biogeographical and palaeontological research and have greatly facilitated our understanding of biodiversity patterns and processes (e.g. Díaz et al., 2016; Zanne et al., 2014).

Most biodiversity databases are composed of, or provide access to, a variety of sources. Hence, they integrate data of varying quality, often compiled and curated at different times and places. Unfortunately, the available meta-data, for example on the nature of the records (museum specimen, survey, citizen science observation), the collection method (GPS record, grid cell from an atlas project) and collection-time, varies and often meta-data are missing. As a consequence, data quality in online databases is a major concern, and has limited their utility and reliability for research and conservation (Anderson et al., 2016; Chapman, 2005; Gratton et al., 2017; Yesson et al., 2007).

In the case of species occurrence records for extant taxa, problems with the geographical location constitute a major concern. In particular, erroneous or overly imprecise geographical coordinates can bias biodiversity patterns at multiple spatial scales (Maldonado et al., 2015). Common problems include (a) occurrence records assigned to country or province centroids due to automated geo-referencing from vague locality description, (b) records with switched latitude and longitude, (c) zero coordinates due to data entry errors, (d) records from zoos, botanical gardens or museums, (e) records based on rasterized collections and (f) records that have been subject to strong decimal rounding (Table 1, Gueta & Carmel, 2016; Maldonado et al., 2015; Robertson, Visser, and Hui, 2016; Yesson et al., 2007). Records affected by these issues can cause severe bias depending on the research question and the geographical scale of analyses (Graham et al., 2008; Gueta & Carmel, 2016; Johnson & Gillingham, 2008).

Table 1. Geographical and temporal tests implemented in the CoordinateCleaner package
Test function Level Flags Main error source GBIF (%) PBDB (%)
cc_cap REC Radius around country capitals Imprecise geo-referencing based on vague locality description 1.1
cc_cen REC Radius around country and province centroids Imprecise geo-referencing based on vague locality description 1.8 1
cc_coun REC Records outside indicated country borders Various, e.g. swapped latitude and longitude
cc_dupl REC Records from one species with identical coordinates Various, e.g. duplicates from various institutions, records from genetic sequencing data
cc_equ REC Records with identical lon/lat Data entry errors 1.6 1
cc_gbif REC Radius around the GBIF headquarters in Copenhagen Data entry errors, erroneous geo-referencing 0 0
cc_inst REC Radius around biodiversity institutions Cultivated/captured individuals, data entry errors 0.8 0
cc_iucn REC Records outside external range polygon Naturalized individuals, data entry errors
cc_outl REC Geographically isolated records of a species Various, e.g. swapped latitude and longitude
cc_sea REC Records located within oceans Various, e.g. swapped latitude and longitude 0.1
cc_urb REC Records from within urban areas Cultivated individuals, old records
cc_val REC Records outside lat/lon coordinate system Data entry errors, e.g. wrong decimal delimiter 0 0
cc_zero REC Plain zeros in the coordinates and a radius around (0/0) Data entry errors, failed geo-referencing 1.6 0.01
cd_ddmm DS Over proportional drop of records at 0.6 Erroneous conversion from dd.mm to dd.dd 4.1% datasets
cd_round DS Decimal periodicity or over proportional number of zero decimals Rasterized or rounded data 18.5% datasets
cf_age FOS/REC Temporal outliers in fossil age or collection year Various
cf_equal FOS General time validity Data entry errors 0
cf_range FOS Overly imprecise age ranges Lack of data 3.3
cf_outl FOS Outliers in space-time Data entry error 2.1
  • REC, record-level; DS, dataset-level; FOS, fossil-level; dd.mm, degree minute annotation; dd.dd, decimal degree annotation; GBIF, Global Biodiversity Information Facility; PBDB, Paleobiology Database.

In addition to spatial issues, the temporal information (i.e. the year of collection) associated with occurrence records can be erroneous. In the case of fossil occurrences, temporal information includes the age of the specimen typically defined by the stratigraphic range of the sampling locality. Although sampling biases (and their temporal and spatial heterogeneity) are arguably the most severe issue in the analysis of the fossil record (Foote, 2000; Xing et al., 2016), overly imprecise or erroneous fossil ages, data entry errors or taxonomic uncertainties can negatively affect the reliability of the analysis (Varela, Lobo, & Hortal, 2011). While large-scale analyses of the fossil record appear resilient to error in the data (Adrain & Westrop, 2000; Sepkoski, 1993), the inclusion of erroneous data is likely to generate non-negligible biases at smaller temporal and taxonomic scales.

Manual cleaning is possible, but time-consuming and limited to the taxonomic and geographical expertise of individual researchers. It is thus generally not feasible for datasets that comprise thousands or millions of occurrence records. Furthermore, manual cleaning often based on poorly documented and thus irreproducible ad hoc decisions can add subjectivity and, in the worst case, bias. These issues call for standardized data validation and cleaning tools for large-scale biodiversity data (Gueta & Carmel, 2016).

2 DESCRIPTION

Here, we present CoordinateCleaner, a new software package for standardized, reproducible and fast identification of potential geographical and temporal errors in databases of recent and fossil species occurrences. CoordinateCleaner is implemented in R (R Core Team, 2018) based on standard tools for data handling and spatial statistics (Allaire et al., 2018; Arel-Bundock, 2018; Becker, Wilks, Brownrigg, Minka, Deckmyn, 2017; Bivand & Lewin-Koh, 2017; Bivand & Rundel, 2018; Chamberlain, 2017; Hester, 2017; Hijmans, 2017a,b; Pebesma & Bivand, 2005; Varela, Gonzalez Hernandez, & Fabris Sgarbi, 2016; Wickham, 2011, 2016; Wickham, Danenberg, & Eugster, 2017; Wickham & Hesselberth, 2018; Wickham, Hester, & Chang, 2018; Xie, 2018). See the online documentation available at https://ropensci.github.io/CoordinateCleaner for an in-depth description of methods and simulations. The main features of the package are listed below.

2.1 Automatic tests for suspicious geographical coordinates or temporal information

CoordinateCleaner compares the coordinates of occurrence records to reference databases of country and province centroids, country capitals, urban areas, known natural ranges and tests for plain zeros, equal longitude/latitude, coordinates at sea, country borders and outliers in collection year. The reference databases are compiled from several sources (Central Intelligence Agency, 2014; South, 2017, and www.naturalearthdata.com/). All functions available in CoordinateCleaner are summarized in Table 1 and each of them can be customized with flexible parameters and individual reference databases.

2.2 A global database of biodiversity institutions

A common problem are occurrence records matching the location of biodiversity institutions, such as zoological and botanical gardens, museums, herbaria or universities. These can have various origins: records from living individuals in captivity or horticulture, individuals that have escaped horticulture near the institution, or specimens without collection coordinates that have been erroneously geo-referenced to their physical location (e.g. a museum). To address these problems we compiled a global reference database of 9,691 biodiversity institutions from multiple sources (Botanic Gardens Conservation International, 2017; GeoNames, 2017; Global Biodiveristy Information Facility, 2017; Index Herbariorum, 2017; The Global Registry of Biodiversity Repositories, 2017; Wikipedia, 2017) and geo-referenced them using the ggmap and opencage R-packages (Kahle & Wickham, 2013; Salmon, 2017). Where automatic geo-referencing failed (c. 50% of the entries), we geo-referenced manually using Google Earth Pro (Google Inc, 2017) or information from the institutions web-pages, if available. We acknowledge that this database might not be complete, and have set up a website at http://biodiversity-institutions.surge.sh/ where scientists can explore the database and submit additions or corrections. See https://ropensci.github.io/CoordinateCleaner/articles/Background_the_institutions_database.html for a detailed description of the database.

2.3 Algorithms to identify conversion errors and rasterized data

Two types of potential bias are unidentifiable on record-level if the relevant meta-data are missing: (A) coordinate conversion errors based on the misinterpretation of the degree sign (°) as decimal delimiter and (B) occurrence records derived from rasterized collection designs or subjected to strong decimal rounding (e.g. presence/absence in 100 × 100 km grid cells). This may be particularly problematic for studies with small geographical scale, which need high precision, and if the erroneous records have been combined with precise GPS-records into datasets of mixed precision. CoordinateCleaner implements two novel algorithms to identify these problems on a dataset-level (a dataset in this context can either be all available records or subsets thereof, for instance from different contributing institutions). The tests assume that datasets with a sufficient number of biased records show a characteristic periodicity in the statistical distribution of their coordinates or coordinate decimals.

To detect coordinate conversion bias (A), we use a binomial test together with the expectation of a random distribution of the coordinate decimals in the dataset (implemented in the cd_ddmm function). If we consider a dataset of coordinates spanning several degrees of latitude and longitude, we can expect the distribution of decimals to be roughly uniform in range [0, 1). In the case of a conversion error, the coordinate decimal cannot be above 0.59 (because one degree only has 59 min). Thus, conversion errors tend to inflate the frequency of coordinates with decimals <0.6. We use two tests to identify this bias. First, we use the fraction of coordinate decimals below 0.6 to fit a binomial distribution with parameter = 0.592 (which assumes uniformly distributed decimals). This yields estimates of (a) a p-value accepting or rejecting the hypothesis of a uniform distribution and (b) the parameter urn:x-wiley:2041210X:media:mee313152:mee313152-math-0001, which best explains the empirical distribution of decimals below and above 0.6. The first test is therefore given by the p-value that can be used to reject the hypothesis of a uniform distribution when smaller than a given threshold. The second test is based on the relative difference (urn:x-wiley:2041210X:media:mee313152:mee313152-math-0002) between the estimated frequency of decimals below 0.6 (urn:x-wiley:2041210X:media:mee313152:mee313152-math-0003) and the expected one (q). Thus any > 0 indicates a higher-than-expected frequency of decimals smaller than 0.6. We flag a dataset as biased, if the p-value is smaller than a user-defined threshold (by default set to 0.025) and r is larger than a user-defined threshold (by default set to 1).

To detect rasterized sampling bias (B), we test for the regular pattern in the sample coordinates caused by a rasterized sampling (or strong decimal rounding). This test involves three steps, which are implemented in a single function (cd_round). First, the algorithm amplifies the pattern by binning the coordinates and then calculates the autocorrelation among the number of records per bin as the covariance of two consecutive sliding windows. This step generates a vector x of autocorrelation values.

Second, we identify outliers of high autocorrelation within x, which we interpret as points of high sampling frequency, that is the nodes of the sampling raster. Using a second sliding-window x of size 10, where xk = {xk, xk+1, …, xk+9}, we flag a point xk+i as highly autocorrelated when
urn:x-wiley:2041210X:media:mee313152:mee313152-math-0004

where Q75 is the 75% quantile of xk, urn:x-wiley:2041210X:media:mee313152:mee313152-math-0005 is its interquartile range, and T is a user-set multiplier defining the test sensitivity. Third, we compute the distance (in degrees) between all flagged outliers and identify D as the most common distance. A dataset is then flagged as potentially biased if D is within a user-defined range (by default between 0.1 and 2 degrees) and the number of outliers spaced by a distance D exceeds a user-defined value (by default set to 3).

We optimized all default settings based on simulations to obtain high sensitivity for datasets of variable size and geographical scale. The cd_ddmm and cd_round functions succeeded to identify bias A) and bias B) in simulated datasets with more than 100 records and more than 50 individual sampling locations data respectively (https://ropensci.github.io/CoordinateCleaner/articles/Background_dataset_level_cleaning.html). Both functions include optional visual diagnostic output to evaluate the results for flagged datasets, which we recommend to guide a final decision, especially for dataset with few records, or geographically restricted extent.

2.4 Spatio-temporal tests for fossil data

Problems with inaccurate or overly imprecise temporal information are exacerbated in fossils. In particular, insufficient data, taxonomic misidentification, homonyms (names with same spelling but referring to different taxa) and data entry errors can cause very imprecise or wrong ages. CoordinateCleaner includes functions to identify fossils with (a) an unexpectedly large age range (r = amax − amin), (b) an unexpected age, and (c) an unexpected location in space-time in a given dataset. To identify (a) and (b) we use an interquartile-based outlier test implemented in the cf_range function, so that a fossil i in a dataset is flagged if
urn:x-wiley:2041210X:media:mee313152:mee313152-math-0006

where Q75(r) is the 75 quartile age range (a) or age (b) across all records in the set, urn:x-wiley:2041210X:media:mee313152:mee313152-math-0007 is the interquartile range of r and M is a user-defined sensitivity threshold (by default set to 5).

To identify C) we test for outliers in a linear combination of range standardized geographical and temporal distances, based on a random sampling between minimum and maximum ages implemented in the cf_outl function. We calculate for each fossil i the mean scaled temporal and spatial distances to all other records in the set, ti and si respectively. To compare temporal and spatial distances, which are otherwise expressed in different units (Myr and km), we rescale the temporal distances to the range of spatial distances. We use the sum of mean scaled distances (ti + si) to identify temporal and spatial outliers, based on interquantile ranges as above:
urn:x-wiley:2041210X:media:mee313152:mee313152-math-0008

where and Q is a user-set sensitivity threshold (five by default). The test is replicated n times, where each replicate uses a randomly sampled age within the age range of i. Records are flagged if they have been identified as outlier in a fraction of k replicates, where n and k user-defined parameters (by default set to 5 and 0.5 respectively). The cf_range and cf_outl function can identify outliers across entire datasets or on a per-taxon base.

3 Running CoordinateCleaner

CoordinateCleaner includes three wrapper functions: clean_coordinates, clean_dataset and clean_fossils which combine a set of tests suitable for the respective data. clean_coordinates is the main function and creates an object of the S3-class ‘spatialvalid’, which has a summary and plotting method. Flagged occurrence records can easily be identified, checked or removed before further analyses. We provide two tutorials demonstrating how to use CoordinateCleaner on recent and fossil datasets and multiple short examples on the package at https://ropensci.github.io/CoordinateCleaner/. A reproducible minimal example is:image

Alternatively, eah cleaning function can be called individually, for instance in pipelines based on the magrittr pipe (%>%).

4 EMPIRICAL EXAMPLE

We demonstrate CoordinateCleaner on occurrence records for flowering plants available from GBIF (c. 91 million geo-referenced records; Global Biodiversity Information Facility, 2017, accessed 02 Feb 2017) and the Palaeobiology Database (PBDB, c. 19,000 records; PBDB, 2018 accessed 26 Jan 2018). We chose GBIF and PBDB as examples because they are large and widely used providers of biodiversity data. We stress that both platforms put substantial efforts in identifying problematic records and acquiring meta-data to increase data quality, and that we consider their data as having generally high quality and improving. We ran the clean_coordinates, clean_fossils and clean_dataset wrapper functions with all tests recommended in our tutorials, except those that are dependent on downstream analyses (Table 1). We used a custom gazetteer with a 1-degree buffer for cc_sea, to avoid flagging records close to the coastline (available in the package with data(‘buffland’)). For computational efficiency, we divided the GBIF data into subsets of 200K records.

clean_coordinates flagged more than 3,340,000 GBIF records (3.6%), the majority due to coordinates matching country centroids, zero coordinates and equal latitude and longitude (Table 1). Figure 1a shows the number of occurrence records flagged per 100 × 100 km grid-cell, globally. Concerning the fossil data from PBDB, clean_fossils flagged 1,205 records (6.3%), mostly due to large uncertainty in dating and unexpected old age or distant location. These flags might include records where a precise dating was not possible, records with low taxonomic resolution, homonyms or problems during data entry. Figure 1b shows the number of fossil records flagged per 100 × 100 km grid-cell, globally.

Details are in the caption following the image
The number of species occurrence records flagged by CoordinateCleaner in empirical datasets, per 100 × 100 km grid cell. Warmer colours indicate more flagged records. (a) Flowering plants from the Global Biodiversity Information Facility (c. 91M; Global Biodiversity Information Facility, 2017) (b) Angiosperm fossils from PBDB (c. 19,000; PBDB, 2018). Note the logarithmic scale

On the dataset-level, we retrieved 2,494 individual datasets of flowering plants from GBIF, mostly representing data from different publishers (e.g. collections of specific museums). These datasets varied considerably in the number of records (from 1 record to 16 million) and geographical extent (<1 degree to global). We limited the tests to 641 datasets with at least 50 individual sampling locations to test for bias in decimal conversion (function cd_ddmm, Table 1) and 966 datasets with more than 100 occurrence records for the rasterization bias (function cd_round, Table 1). clean_dataset flagged 26 (4.1%) datasets as biased towards decimals below 0.6 (potentially related to ddmm to dd.dd conversion) and 179 datasets (18.5%) with a signature of decimal periodicity (potential rounding or rasterization). The high percentage of datasets with biased decimals was surprising and these might include datasets with clustered sampling. Since the value of such data for biological research is strongly dependent on follow up analyses we recommend to use a case-by-case judgement based on the desired precision, diagnostic plots and meta-data for a final decision on the flagged datasets. In general, not all flagged records and datasets are necessarily erroneous: our tests only indicate deviations from common and explicit assumptions. Flagged data may require further validation by researchers or exclusion from subsequent analyses.

5 COMPARISON TO OTHER SOFTWARE

To our knowledge, few other tools exist for standardized data cleaning, namely the scrubr (Chamberlain, 2016) and biogeo (Robertson et al., 2016) r packages. Additionally, the modestR package (García-Roselló et al., 2013) implements a graphical user interface and includes cleaning of GBIF data based on habitat suitability. Some of the basic functions performed by CoordinateCleaner overlap with these packages, however, CoordinateCleaner provides a substantially more comprehensive set of options, including novel tests and data (see https://ropensci.github.io/CoordinateCleaner/articles/Background_comparison_other_software for a function-by-function comparison of CoordinateCleaner, scrubr and biogeo).

Primarily, CoordinateCleaner adds the following novelties as compared to available packages: (a) A unique set of tests for problematic geographical coordinates, tailored to common but often overlooked problems in biological databases and not restricted to specific organisms, (b) A global, geo-referenced database of biodiversity institutions, to identify records from cultivation, zoos, museums, etc., (c) Novel algorithms to identify problems not identifiable on record-level, for example errors from the conversion of the coordinate annotation or low coordinate precision due to rasterized data collection, (d) Tests tailored to fossils, accounting for problems in dating and (e) Applicability to large datasets. These features in combination with their user-friendly implementation and extensive documentation and tutorials, will render CoordinateCleaner a useful tool for research in biogeography, palaeontology, ecology and conservation.

In general, no hard rule exists to judge data quality for biogeographical analyses – what is ‘good data’ depends largely on downstream analyses. For instance, continent-level precision might suffice for ancestral range estimation in some global studies, whereas species distribution models based on environmental data can require a 1-km precision. The objective of CoordinateCleaner is to automate the identification of problematic records as far as possible for all scales, with default values tailored to large datasets with millions of records and thousands of species. Nevertheless, some researcher judgement will always be necessary to choose suitable tests, specify appropriate thresholds, and avoid adding bias by cleaning. In the worst case, automatic cleaning could bias downstream analyses by information loss caused by overly strict filtering, exacerbating sampling bias by false outlier removal, and over-confidence in the cleaned data. In most cases, however, CoordinateCleaner speeds up the identification of problematic records and common problems in a datasets for further verification. In some cases, disregarding flagged records might be warranted, but we recommend to carefully judge, and verify flagged records when possible, especially for the outlier and dataset-level tests. We provide an extensive documentation to guide cleaning and output interpretation (https://ropensci.github.io/CoordinateCleaner).

ACKNOWLEDGEMENTS

We thank all GBIF and PDBD administrators and contributors for their excellent work. We thank Sara Varela, Carsten Meyer and an anonymous reviewer for helpful comments on an earlier version of the manuscript, and rOpenSci, Maëlle Salmon, Irene Steves and Francisco Rodriguez-Sanchez for helpful comments on the R-code, as well as Juan D Carrillo for valuable feedback on the tutorial for cleaning fossil records.

    AUTHORS’ CONTRIBUTIONS

    A.Z. developed the tools and designed this study. D.S. and A.Z. designed and implemented the dataset-level cleaning algorithms. D.E. developed the website for contributing to the biodiversity institutions database. A.Z., T.A., J.A., C.D.R., H.F., A.H., M.A., R.S., S.t.S., N.W. and V.Z. contributed data to the biodiversity institutions database. A.Z. wrote the manuscript, with contributions from A.A., D.S., T.A., J.A., D.E., H.F. and V.Z. All authors read and approved the final version of the manuscript.

    DATA ACCESSIBILITY

    The code of CoordinateCleaner is open source and has been reviewed by rOpenSci. The package is available as R-package from the CRAN repository (stable, https://cran.rstudio.com/web/packages/CoordinateCleaner/index.html) and GitHub (developmental, https://github.com/ropensci/CoordinateCleaner). The biodiversity institutions database is part of the package under a CC-BY license. Cleaning pipelines for occurrence records from GBIF and fossils from PBDB are available from https://ropensci.github.io/CoordinateCleaner, (https://doi.org/10.5281/zenodo.2539408) and from CRAN as part of the package.