Volume 9, Issue 2
APPLICATION
Free Access

The zoon r package for reproducible and shareable species distribution modelling

Nick Golding

Corresponding Author

E-mail address: nick.golding.research@gmail.com

School of BioSciences, University of Melbourne, Parkville, Vic., Australia

Correspondence

Nick Golding

Email: nick.golding.research@gmail.com

Search for more papers by this author
Tom A. August

NERC Centre for Ecology & Hydrology, Crowmarsh Gifford, Wallingford, UK

Search for more papers by this author
Tim C. D. Lucas

Oxford Big Data Institute, Li Ka Shing Centre for Health Information and Discovery Nuffield Department of Medicine, University of Oxford, Oxford, UK

Search for more papers by this author
David J. Gavaghan

Department of Computer Science, University of Oxford, Oxford, UK

Search for more papers by this author
E. Emiel van Loon

Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, The Netherlands

Search for more papers by this author
Greg McInerny

Centre for Interdisciplinary Methodologies, Social Sciences, University of Warwick, Coventry, UK

Search for more papers by this author
First published: 30 July 2017
Citations: 18

Abstract

  1. The rapid growth of species distribution modelling (SDM) as an ecological discipline has resulted in a large and diverse set of methods and software for constructing and evaluating SDMs. The disjointed nature of the current SDM research environment hinders evaluation of new methods, synthesis of current knowledge and the dissemination of new methods to SDM users.
  2. The zoon r package aims to overcome these problems by providing a modular framework for constructing reproducible SDM workflows. zoon modules are interoperable snippets of r code, each carrying a SDM method that zoon combines into a single analysis object.
  3. Rather than defining these modules, zoon draws modules from an open, version‐controlled online repository. zoon makes it easy for SDM researchers to contribute modules to this repository, enabling others to rapidly deploy new methods in their own workflows or to compare alternative methods.
  4. Each workflow object created by zoon is a rerunnable record of the data, code and results of an entire SDM analysis. This can then be easily shared, scrutinised, reproduced and extended by the whole SDM research community.
  5. We explain how zoon works and demonstrate how it can be used to construct a completely reproducible SDM analyses, create and share a new module, and perform a methodological comparison study.

1 INTRODUCTION

Species distribution modelling (SDM) has grown rapidly over the last 20 years. It is now one of “the most widely‐reviewed topics in the ecological literature” (Araújo & Peterson, 2012) and the growth of this literature is still accelerating (Barbosa & Schneck, 2015). The SDM software market is similarly large and diverse (Ahmed et al., 2015). While most SDM users rely either on the MaxEnt standalone application (Phillips, Anderson, & Schapire, 2006) or the r programming language (R Core Team, 2016) as the first choice software for their analyses (Ahmed et al., 2015), a variety of SDM‐specific r packages are available, each implementing different approaches to fitting SDMs (Hijmans, Phillips, Leathwick, & Elith, 2016; Naimi & Araújo, 2016; Thuiller, Lafourcade, Engler, & Araújo, 2009) as well as a range of alternative standalone applications.

As a result, the SDM community can become siloed, with groups of researchers collaborating primarily with others who use the same software, even within specific r packages (Ahmed et al., 2015), presenting a barrier to disseminating new SDM findings and methods. This is compounded since almost all available SDM software is focussed on analytical tasks such as constructing models, rather than enabling scientists to produce analyses in a format that others can reproduce and modify. The inability of SDM researchers to reproduce, scrutinise and build on others’ research prevents rigorous peer review, synthesis of research findings across studies and reduces capacity for the science to be a self‐correcting process (Society, 2012).

For example, a fundamental dispute over the ability of SDMs to detect environmental associations (Araújo, Thuiller, & Yoccoz, 2009; Beale, Lennon, & Gimona, 2008, 2009) was left unresolved as only the original publication shared their analytical code (Beale et al., 2008). Similarly, in the past decade, the SDM community has failed to carry out any large scale comparisons of SDM methods, still relying on the work of Elith et al. (2006) to select models and software (Joppa et al., 2014), despite the development of many new methods as well as improved evaluation procedures since the publication of that study (Roberts et al., 2017). The community's inability to repeat this comprehensive analysis is due to both lack of access to comparable data and the difficulty in learning and applying the many different pieces of software for model fitting.

Given the current software and patterns of collaboration, the SDM community is neither likely to repeat or modify analyses nor likely to produce the large model comparisons that would answer even basic questions about how best to do SDM. It may also have contributed to widespread misunderstandings about how to apply and interpret different modelling approaches (Yackulic et al., 2013). In order to overcome these problems, the data and code underpinning SDM research need to be made more accessible, reproducible and modifiable by the whole research community. This can be achieved if technologies enable and encourage sharing of research as fully reproducible objects (Peng, 2011), in ways that suit the diversity of users involved in SDM (Ahmed et al., 2015).

The zoon r package has been developed specifically to improve reproducibility and comparability of SDMs in r by allowing users to encode entire SDM analyses as repeatable and extensible workflows consisting of independently executable, community‐contributed modules. The module‐workflow structure enables scientists to more easily create and share components of their analysis, and then, access, modify, reuse and combine the components of others (see below and Figure 1). While zoon's modular nature is similar to other sdm r packages such as BioMod2 (Thuiller et al., 2009) and sdm (Naimi & Araújo, 2016), zoon pulls each module from an open repository that any SDM user can contribute to and makes it easy for non‐developers to contribute modules. zoon's focus on reproducible and modifiable workflows is inspired by repositories such as the Cardiac Web Lab (Cooper, Scharm, & Mirams, 2016) and workflow systems such as Taverna (Wolstencroft et al., 2013) and the biovel system (De Giovanni et al., 2016), but embeds an SDM‐specific workflow system within r, making it much more accessible to the SDM community.

image
The modular Species distribution modelling (SDM) structure encoded by a zoon workflow. (a) Description of the five module types. (b) Flow diagram illustrating how objects are passed between different module types: ‘data frame’—an dataframe of occurrence records; ‘raster’—a RasterStack object of the covariates; ‘model’—a ZoonModel object, generating standardised predictions from a given model. (c) The flow diagram implied by chaining two ‘process’ modules. (d) The flow diagram implied by listing three ‘model’ modules. Full details of module inputs and outputs, and the effects of listing and chaining each module type are given in the zoon vignette ‘Building a module’

This paper introduces version 0.6 of the zoon r package. We describe the modular structure of zoon workflows and how they can be constructed, shared, reproduced and modified. We then illustrate how these concepts enable better SDM research by reproducing a published SDM analysis, converting a recently proposed method into a zoon module and performing a reproducible methodological comparison.

2 BUILDING A WORKFLOW

zoon encodes SDM analyses as a simple workflow of five key steps: obtaining occurrence data; obtaining covariate data; applying some processes to these data; fitting one or more models and generating any outputs (Figure 1a). Each of these steps is encoded as one or more software modules (snippets of r code) to complete one of these tasks. zoon modules can use methods from any r package, providing a common interface between the many SDM packages already available. Users combine these modules via a call to the workflow function, which executes each module in turn, before returning a zoonWorkflow object—a shareable, extensible and fully reproducible documentation of the SDM analysis. Figure 1b illustrates how data and outputs are passed between modules of the five different types in a zoon workflow.

The following code uses workflow to run simple a presence‐background SDM for a mosquito species in the UK, fitting a MaxEnt model with default settings and 500 randomly placed background points.

3 CHAINS, LISTS AND REPLICATES

Many SDM analyses apply more than one method in one of the five SDM steps, for example to combine multiple covariates and processing steps, to compare various models on the same dataset or to run the same procedure for several species. zoon enables these more complex workflows by enabling users to pass multiple modules of each type via the list, Chain and Replicate functions.

3.1 Chain

Chain runs multiple modules of the same type sequentially, as illustrated in Figure 1c. For example, in a presence‐only analysis a chain of two process modules could be used to generate background data and then to standardise the covariate rasters using the presence/background dataset. Chains can also be applied to occurrence or covariate modules, to combine multiple datasets (e.g. occurrences from different databases) into one. Chaining output modules simply runs each module separately, allowing the user to create multiple maps and summary figures, calculate performance metrics and create other model outputs in one workflow. Model modules are the only module type that may not be chained since their outputs and inputs are always different.

3.2 list

list splits a workflow, with each path using a different one of the listed modules, as illustrated in Figure 1d. For example, listing multiple model modules would take the same occurrence and covariate data, and fit three separate models and apply the same output modules to each model separately. Lists can also be used to run the same SDM procedure for multiple species or for multiple sets of covariates, or to compare different process modules. Listing output modules causes each module to be applied separately.

3.3 Replicate

Some steps in SDM analyses are stochastic and need to be run multiple times; Replicate enables this by generating a list with one module repeated a given number of times. Replicate could be used to run the same workflow for hundreds of simulated occurrence or covariate datasets, to generate multiple bootstraps for modelling or to fit models with stochastic elements.

The following workflow uses a chain to standardise the covariate rasters as well as generating background records, and uses a list to fit three different models: MaxEnt, Boosted Regression Trees (also known as Generalized Boosted Regression Models [GBM]) and RandomForest. The results of the three models are then returned in three Shiny apps (created by the Appify module) allowing the user to interactively explore the occurrence and covariate data, model summaries and prediction maps.

4 CROSS‐VALIDATION AND EXTERNAL VALIDATION

workflow handles cross‐validation and external validation internally, storing this information in the dataframe of occurrence records. Occurrence or process modules can be used to assign records to one or more cross‐validation folds (positive integers indicating different hold‐out groups) or to an external validation dataset (indicated by a zero). At the model stage, workflow fits a separate model for each cross‐validation fold then makes and stores predictions for the hold‐out datapoints, as well as fitting a model using the whole dataset. In all model fitting, records flagged for external validation are omitted. These cross‐validation predictions can then be subsequently analysed by output modules to estimate out‐of sample predictive performance. Predictions can be also be made from the full model to an external validation dataset and evaluated.

5 SHARING WORKFLOWS

The zoonWorkflow object returned by workflow contains all of the data, code and results used in each stage of the analysis. The object is therefore a self‐contained representation of an entire analysis which can be saved as a binary RData object, archived, shared with colleagues or uploaded to the web. Anyone else may then load the object into their r session, load the zoon r package and investigate the analysis.

Workflow objects could, therefore, be provided as supplementary information to journal articles or hosted online to meet journal requirements for public archiving of data and software. zoon provides the ZoonFigshare function to facilitate sharing a completed workflow object from within r, via the free web platform figshare. ZoonFigshare takes a workflow object and some minimal metadata and uploads the workflow as an RData object, along with a metadata text file, to the user's figshare profile for others to download, inspect and modify.

6 EXPLORING WORKFLOWS

zoonWorkflow objects are simply r lists containing the outputs from each module in the workflow, as well as the modules used, their versions and arguments and information about the R session in which the workflow was run.

The overall structure of a zoonWorkflow object can be inspected and visualised using the provided print and plot methods. The outputs and intermediate steps of the analysis can be extracted directly from the zoonWorkflow object or using the utility functions Occurrence, Covariate, Process,Model and Output to extract the outputs of each analytical stage. These functions return either the single r object outputted by each module or list of objects when a list of modules was used in the workflow.

7 REPRODUCING AND EXTENDING WORKFLOWS

Given a zoonWorkflow object, the analysis can be repeated in its entirety using the zoon function RerunWorkflow. This function can therefore be used to rapidly update an analysis whenever the underlying occurrence or covariate data (from an online repository for example) are updated. When used in this way, the function is equivalent to copying the author's original source code. RerunWorkflow can also run from a specific stage of the workflow, for example using the previously downloaded data stored in the object but rerunning an output module to recreate a plot.

Workflows can similarly be modified and rerun by replacing or adding modules in one or more of the five analytical steps using the function ChangeWorkflow. ChangeWorkflow only reruns the changed modules and those downstream; the stored outputs of earlier modules are reused rather than rerunning them.

This functionality therefore enables researchers to explore and alter existing workflows, such as from a published analysis, without having to rerun computationally expensive models or redownload datasets that may have changed in the interim. See Example 2 for a demonstration of this functionality.

8 EXPLORING AVAILABLE MODULES

zoon modules are not distributed with the zoon r package, but are instead downloaded on‐the‐fly from the online module repository. New modules uploaded to the repository therefore become instantly available to zoon users, without having to update the r package. zoon can query the currently available modules of each type via the function GetModuleList.

All zoon modules on the repository are accompanied by documentation and metadata, similar to R's helpfiles. The documentation for any module on the repository can be accessed from r using the function ModuleHelp. zoon also provides a function ZoonCitation to provide citation information, similar to the citation function provided with r for citing r packages.

The only pre‐requisite for uploading a module to the online repository is that it passes a series of automated unit and integration tests to ensure interoperability with other modules. Similar to the CRAN archive of r packages, zoon leaves the SDM community to determine which methods and software should be used, rather than acting as a gatekeeper of SDM methods.

A web service for interactive exploration of available modules and their documentation is currently being developed. This platform may include systems by which users can rate different modules, making it easier for the community to promote the best modules, flag methodological issues and collaborate on module development.

9 EXAMPLE APPLICATIONS

Next, we demonstrate how zoon can be used in practice for running SDMs, creating and sharing new modules, and performing a methods comparison. The workflow objects created by these analyses can be accessed at https://doi.org/10.6084/m9.figshare.4597792.v1. We encourage readers to download, interrogate and alter these workflows for themselves.

9.1 Example 1: Modelling the potential distribution of nine‐banded armadillo

Feng and Papeş (2015) constructed a MaxEnt species distribution model for nine‐banded armadillo using a presence‐only data on the species’ current distribution and the bioclim (Hijmans, Cameron, Parra, Jones, & Jarvis, 2005) set of environmental correlates. This model was then used to predict areas in the Americas which may be suitable for the species to become established.

Such a model can be quickly and easily reconstructed as a zoon workflow using modules available in the zoon module repository. Feng and Papeş (2015) used a combination of occurrence data from GBIF and additional occurrence data manually collected from the published literature. Unfortunately, the latter data have not been made publically available, so here we use only data from GBIF. If the additional data had been made available, it would be straightforward to incorporate them, for example using the LocalOccurrenceData module.

This workflow plots a static map of the predicted distribution. This map is shown in Figure 2a and corresponds to figure 3 in Feng and Papeş (2015). The resulting workflow contains all the code required to rerun the workflow, the input data and the results of executing each module. The object FengPapes could therefore be saved as a binary file and shared as a reproducible representation of this research.

image
Outputs of the workflow objects ‘FengPapes’ and ‘FengPapesUpdate’. (a) Map of the MaxEnt predicted distribution, with a 5% omission rate threshold, by the ‘PrintMap’ module in the workflow ‘FengPapes’ which encodes the core of a published analysis. (b) A response curve produced by the ‘ResponseCurve’ module for the first covariate, bio1 in the workflow ‘FengPapesUpdate’, which modifies the original analysis workflow. (c) A screenshot of the interactive map produced by the ‘InteractiveMap’ modules in the workflow ‘FengPapesUpdate’, displaying raw occurrence data and predicted distribution over a global map, allowing users to interactively explore their results. White areas are masked due to being in the MESS mask. Any Species distribution modelling (SDM) analysis distributed as a zoon workflow can be easily be explored and scrutinised by modifying its output modules using the function ‘ChangeWorkflow’

Next, we update the workflow to produce an interactive map enabling anyone to inspect the data and predictions on a zoomable map, and to inspect the response curves of the fitted model. These outputs are shown in Figure 2, panels b and c. Various other output modules could be used here, e.g. to validate models, project distributions to new regions or save prediction maps in a variety of formats

9.2 Example 2: Building a spatial thinning module

Aiello‐Lammens, Boria, Radosavljevic, Vilela, and Anderson (2015) proposed an approach for dealing with spatial sampling bias in a presence‐only data by ‘thinning’ the presence data and provide an R package spThin to implement their procedure (Aiello‐Lammens, Boria, Radosavljevic, Vilela, & Anderson, 2014). We can incorporate this approach in a workflow by defining a simple process module that adapts the zoon data into the spThin format, uses this package to apply the algorithm and converts data back into zoon's expected format again:

To convert this code into a zoon module, we need to write it to a standalone file (named spThin.R) with the metadata required to build the module documentation. The zoon function BuildModule helps with this step and can also run checks to make sure we got everything right:

This module can now be shared so that others can use it in their zoon workflows. Modules can be uploaded to the zoon modules repository via the online submission system at zoonproject.org. The zoon package vignette Building modules provides full details and examples of how to build modules of each type.

9.3 Example 3: Unpacking MaxEnt

The popular MaxEnt SDM model has recently been shown to be equivalent to a Poisson point process model (Renner & Warton, 2013) and to be closely approximated by a logistic regression model with weights applied to background datapoints (Fithian & Hastie, 2013; Renner et al., 2015; Warton & Shepherd, 2010). Given this close correspondence between MaxEnt and logistic regression, Renner and Warton (2013) and others have suggested that MaxEnt's superior predictive performance (Elith et al., 2006) is most likely due to the array of features (candidate covariate transformations) it constructs and its use of regularisation to prevent overfitting.

zoon enables us to investigate this hypothesis by easily comparing MaxEnt models fitted with and without regularisation, and feature construction. We fit MaxEnt with both the widely used java implementation (omitting threshold features) as well as the downweighted logistic regression approximation provided in the maxnet r package (Phillips, 2016). We also compare these models with a standard logistic regression model, which does not apply the downweighting step. The following workflow fits these models to a previously published presence‐background data on the Carolina wren in the USA (Royle, Chandler, Yackulic, & Nichols, 2012), maps their predictions (Figure 3) and evaluates the predictive performance of each model against the full presence–absence dataset by AUC. Note that setting the regularisation constant in the maxnet package to zero caused numerical errors, so we use a small value instead.

image
Prediction of the distribution of the Carolina wren from six different models, produced by the workflow ‘MaxEntComparison’. (a) MaxEnt without threshold features. (b) MaxNet with default settings. (c) MaxNet without regularisation. (d) MaxNet with regularisation but only linear features. (e) MaxNet without regularisation and only linear features. (f) Logistic regression. MaxNet with full features but no regularisation (c) gave the most local complexity, indicative of overfitting to the data. The models with only linear features (d–f) had a broader distribution, indicative of underfitting. Differences between MaxNet with linear features and no regularisation (e) and logistic regression (f) are due to the downweighting applied to background data in the former

The AUC statistics calculated against the presence–absence data were: MaxEnt 0.9669; MaxNet 0.9673; MaxNet with no regularisation 0.9586; MaxNet with only linear features 0.944; MaxNet with only linear features and no regularisation 0.944; Logistic regression 0.9449. As expected, MaxEnt and MaxNet (the logistic regression approximation) generated similar predictions and had similar predictive performance. Likewise, once the regularisation and features of MaxNet were switched off, both the predictions and performance were very similar to logistic regression.

Readers interested in a more comprehensive comparison, or in exploring the response curves, can easily reproduce and modify the analysis either by editing and rerunning the above code block or by downloading the workflow object and using the ChangeWorkflow function.

10 FUTURE DEVELOPMENTS

Reproducibility and open science are gaining significant attention (Borregaard & Hart, 2016), and journals, funders and governments are increasingly making them a requirement (European Commission, 2016; McNutt, 2016; Obama, 2013). Researchers must therefore find ways of at least disseminating code and data, if not making them usable. Despite numerous platforms for archiving and sharing code and data, users still face a challenge locating and using resources across different platforms. Increasing the usability of code and data is therefore the real challenge of open and reproducible science.

The size and fractured nature of the SDM research community therefore warrants a discipline‐specific open research platform such as zoon to resolve these problems. By providing a common interface for SDM and the capacity to create and share methodological experiments, zoon offers a new opportunity for the SDM community to develop community modelling benchmarks and resolve scientific and methodological questions in ways that the current culture of publishing cannot achieve.

zoon also has the potential to improve methodological standards in SDM, a research area in which there is widespread misunderstanding and misuse of techniques (Yackulic et al., 2013). SDM methodologists could encode model validation procedures and best practice guidance as zoon output modules, enabling end users to rapidly ensure their modelling procedures are fit for purpose (Guillera‐Arroita et al., 2015).

zoon was both conceived and developed based on input from users. Continuing and expanding user engagement will be vital to form a critical mass of ecologists creating and using zoon modules in their research. To assist users, a number of tutorials on using zoon and developing modules are provided on the zoonproject GitHub repository and as vignettes distributed with zoon. We intend to supplement these with a gallery of research examples and a forum where the community can discuss and evaluate best SDM practice in a reproducible way.

ACKNOWLEDGEMENTS

The 2020 Science programme is funded through the EPSRC Cross‐Disciplinary Interface Programme (grant number EP/I017909/1).

    AUTHORS’ CONTRIBUTIONS

    All authors contributed to the design of the zoon r package. T.A.A., T.C.D.L. and N.G. implemented and programmed the r package with input from the other authors. N.G. wrote the first draft of the manuscript and all authors contributed to subsequent revisions and gave final approval for publication.

    DATA ACCESSIBILITY

    Nine‐banded armadillo data were downloaded from GBIF, Carolina Wren data from Royle et al. (2012) were accessed via the maxlike r package. Both datasets can be accessed using the code presented here. Version 0.6 of the zoon r package is archived at https://doi.org/10.5281/zenodo.240926. The code required to reproduce this manuscript is archived at https://doi.org/10.5281/zenodo.834870.

      Number of times cited according to CrossRef: 18

      • ENMTML: An R package for a straightforward construction of complex ecological niche models, Environmental Modelling & Software, 10.1016/j.envsoft.2019.104615, (104615), (2020).
      • A standard protocol for reporting species distribution models, Ecography, 10.1111/ecog.04960, 43, 9, (1261-1277), (2020).
      • Protecting Biodiversity (in All Its Complexity): New Models and Methods, Trends in Ecology & Evolution, 10.1016/j.tree.2020.08.015, (2020).
      • A global map of terrestrial habitat types, Scientific Data, 10.1038/s41597-020-00599-8, 7, 1, (2020).
      • A translucent box: interpretable machine learning in ecology, Ecological Monographs, 10.1002/ecm.1422, 0, 0, (2020).
      • A comparison of macroecological and stacked species distribution models to predict future global terrestrial vertebrate richness, Journal of Biogeography, 10.1111/jbi.13696, 47, 1, (114-129), (2019).
      • Species Distribution Modeling in Latin America: A 25-Year Retrospective Review, Tropical Conservation Science, 10.1177/1940082919854058, 12, (194008291985405), (2019).
      • The MIAmaxent R package: Variable transformation and model selection for species distribution models, Ecology and Evolution, 10.1002/ece3.5654, 9, 21, (12051-12068), (2019).
      • A review of evidence about use and performance of species distribution modelling ensembles like BIOMOD, Diversity and Distributions, 10.1111/ddi.12892, 25, 5, (839-852), (2019).
      • Mapping species distributions in 2 weeks using citizen science, Insect Conservation and Diversity, 10.1111/icad.12345, 12, 5, (382-388), (2019).
      • Niche divergence and limits to expansion in the high polyploid Dianthus broteri complex, New Phytologist, 10.1111/nph.15663, 222, 2, (1076-1087), (2019).
      • Status and Challenges of Reproducibility in Computational Systems and Synthetic Biology, Reference Module in Biomedical Sciences, 10.1016/B978-0-12-801238-3.11525-9, (2019).
      • Mapping access to domestic water supplies from incomplete data in developing countries: An illustrative assessment for Kenya, PLOS ONE, 10.1371/journal.pone.0216923, 14, 5, (e0216923), (2019).
      • A checklist for maximizing reproducibility of ecological niche models, Nature Ecology & Evolution, 10.1038/s41559-019-0972-5, (2019).
      • Consensus and conflict among ecological forecasts of Zika virus outbreaks in the United States, Scientific Reports, 10.1038/s41598-018-22989-0, 8, 1, (2018).
      • Wild boar in focus: Review of existing models on spatial distribution and density of wild boar and proposal for next steps, EFSA Supporting Publications, 10.2903/sp.efsa.2018.EN-1490, 15, 10, (2018).
      • malariaAtlas: an R interface to global malariometric data hosted by the Malaria Atlas Project, Malaria Journal, 10.1186/s12936-018-2500-5, 17, 1, (2018).
      • SDMtune: An R package to tune and evaluate species distribution models, Ecology and Evolution, 10.1002/ece3.6786, 0, 0, (undefined).