FossilSim: An r package for simulating fossil occurrence data under mechanistic models of preservation and recovery

Key features of the fossil record that present challenges for integrating palaeontological and phylogenetic datasets include (i) non‐uniform fossil recovery, (ii) stratigraphic age uncertainty and (iii) inconsistencies in the definition of species origination and taxonomy. We present an r package FossilSim that can be used to simulate and visualise fossil data for phylogenetic analysis under a range of flexible models. The package includes interval‐, environment‐ and lineage‐dependent models of fossil recovery that can be combined with models of stratigraphic age uncertainty and species evolution. The package input and output can be used in combination with the wide range of existing phylogenetic and palaeontological r packages. We also provide functions for converting between FossilSim and paleotree objects. Simulated datasets provide enormous potential to assess the performance of phylogenetic methods and to explore the impact of using fossil occurrence databases on parameter estimation in macroevolution.


| INTRODUC TI ON
The fossil record plays a fundamental role in understanding many aspects of evolution. Fossil occurrences have long provided the basis for inferring macroevolutionary trends, diversification rates and a timescale for events in Earth's history. It is also increasingly used to reconstruct and timescale phylogenetic relationships, motivating the development of phylogenetic models that can incorporate information from the fossil record (Heath, Huelsenbeck, & Stadler, 2014;Mitchell, Etienne, & Rabosky, 2018;Stadler, Gavryushkina, Warnock, Drummond, & Heath, 2018). However, available models necessarily make simplifying assumptions about processes leading to the generation of palaeontological data. Incorporating more complex models of fossil recovery and species evolution will be essential in developing these methods further.
A key feature of the fossil record is the incomplete and non-uniform nature of preservation and sampling, which varies over time, across space and among taxa. This aspect of palaeontological data must be considered in the interpretation of events in deep time (Holland, & Patzkowsky, 2015) but has rarely been considered in constructing models in phylogenetics. The interpretation of species and speciation in the fossil record is also not straightforward and can impact macroevolutionary parameter estimates (Ezard, Pearson, Aze, & Purvis, 2012). Similarly, the role of different speciation processes in phylogenetics and their combined interaction with fossil recovery processes remains relatively unexplored (Bapst, 2013).
Improving models in macroevolution and phylogenetics requires assessing the performance of different methods under a wide range of speciation, preservation and sampling scenarios. This assessment is only possible in datasets where the true underlying parameters are known, thus simulations play a critical role in the exploration and refinement of inference tools. Simulations are valuable for validating software implementations, assessing model adequacy and establishing the limitations of existing methods. Indeed, simulation studies have already played an important role in demonstrating the impact of palaeontological data on parameter inference in both phylogenetics (e.g. Heath et al., 2014;Warnock, Yang, & Donoghue, 2017) and palaeobiology (e.g. Soul, & Friedman, 2017;Smiley, 2018).
Here, we present an r package Fossilsim that can be used to simulate fossil and taxonomy data under a wide range of mechanistic models in a format that can be integrated easily with existing tools.
The package also provides visualisation functions that allow the user to summarise information about tree topology, taxonomy and stratigraphic data in a single plot.

| Models of speciation and taxonomy
We use the term 'taxonomy' to describe the interspecific relationships among simulated fossil and extant samples obtained using any combination of the speciation modes outlined in the following text and Figure 1. This is intended to emulate the process by which a taxonomist may divide lineages into units of morphotaxa, each one representing intervals during which characters of the greatest diagnostic value appear invariant.
Species taxonomy can be generated using Fossilsim for any fully resolved timescaled phylogenetic tree. Although the number of co-existing lineages is informative about the total number of species at a given moment in time, a phylogenetic tree object does not typically contain information about how individual species relate to the underlying branching process. Fossilsim can be used to model the speciation process along a tree and incorporates three possible modes of speciation shown in Figure 1: budding, bifurcation and anagenesis. A budding or asymmetric speciation event gives rise to one new species and does not result in the extinction of the ancestor.
A bifurcation or symmetric speciation event gives rise to two new species and results in the extinction of the ancestor. At each branching event in a phylogenetic tree, bifurcation speciation occurs with probability β. If β = 0 all speciation occurs via budding and if β = 1 all speciation occurs via bifurcation. Anagenetic speciation occurs along each branch in a phylogenetic tree with rate a .
Cryptic speciation can also be modelled: in this case, ancestor and descendant species cannot be distinguished and Fossilsim will record both the true and apparent species identity. At each speciation event, cryptic speciation occurs with probability κ. This mixed model of speciation (or taxonomy) is described in Bapst (2012) and Stadler et al. (2018).
We note that this model of speciation and the Fossilsim package in general treats taxonomic and species units as equivalent, and treats speciation and extinction as point processes, reflecting the assumptions made by models available for the analysis of fossil data.
Simulation and inference tools incorporating protracted speciation are not yet widely available but users should always consider the implications of the assumptions being made about the diversification process.

| Constant and time-dependent fossil recovery
The simplest model is a Poisson process, where each species has a constant fossil recovery rate ψ, which corresponds to the exponential waiting times between fossil sampling events (Stadler, 2010).
Under the time-dependent model, fossil recovery varies in a piecewise manner across discrete geological intervals. These intervals are specified by the user and can be of even or uneven lengths. During a given interval i, each species has a constant rate of fossil recovery i (Gavryushkina, Welch, Stadler, & Drummond, 2014). The number of fossils sampled during each interval will be drawn from a Poisson distribution with rate i . Alternatively, per-interval fossil recovery can be described using probabilities: during a given interval i, each species has a probability P collection i of being sampled. When using probabilities, at most one fossil per species will be sampled during each interval.

| Environment-dependent fossil recovery
The environmental processes that control the spatial and temporal distribution of a living species also drive patterns of fossil recovery. Holland (1995) described a model of fossil recovery based on the abundance of species relative to an environmental or ecological gradient (i.e. a species response curve). The probability of sampling a fossil of species i, P collection i , is described using a symmetric unimodal species response curve with the following Gaussian distribution: where e is the current environment (i.e. the position along the gradient) of species i, PE is the species' preferred environment, ET is its environmental tolerance and PA its peak abundance. PE, ET and PA give respectively the mean, the standard deviation and the amplitude of the distribution. Changes in environment (e) can be modelled using any user defined function. The model can be applied to any environmental or depositional context, in which a measure of environmental gradient (1) P collection i (e) = PAe −(e−PE) 2 ∕2ET 2 , F I G U R E 1 Three different modes of speciation that can be simulated using Fossilsim. The shaded area represents the ancestral species, while the white area shows the descendant species. Budding and bifurcation may also be referred to as asymmetric and symmetric speciation

(a) Budding (b) Bifurcation (c) Anagenesis
can be provided (Patzkowsky, & Holland, 2012) but has typically been applied in a marine context. In this context, environmental changes are often linked to relative water depth.

| Lineage-and trait-dependent fossil recovery
where the standard deviation = t i and t i is the duration of the species (Thorne, & Kishino, 2002;Heath et al., 2014). Under the independent trait values model, values are drawn from a user-specified distribution and assigned independently to each species (Drummond, Ho, Phillips, & Rambaut, 2006). In addition, we incorporate the parameter P change , which is the probability that there will be a change from the ancestral trait value at each speciation event.
Lineage-dependent models of fossil recovery can be coupled with both the time-homogenous Poisson process and depth-dependent models of fossil recovery. Lineage-specific trait values can be assigned to the ψ, PE, ET and PA parameters.

| Simulation
Fossilsim stores the simulated taxonomy and fossils as custom objects, taking advantage of the class system in the R programming language. These objects can also be created from a dataframe containing the required information, which allows the user to easily import pre-existing datasets.
In Figure 2, we distinguish between five important ages that may be used in reference to a given fossil species (Holland, & Patzkowsky, 2002). The speciation (or origination) and extinction times reflect the endpoints of the true species duration, and fossil occurrences are sampled within this interval. The oldest and youngest fossil occurrences reflect the endpoints of the sampled duration (i.e. the first and last appearance times). Speciation and extinction times can be calculated from the taxonomy object using the functions species.start and species.end. Fossil occurrence times as stored as part of the fossils object, as hmax and hmin, which represent the oldest and youngest age of the sampling horizon, respectively. The oldest and youngest occurrence ages can be calculated from the information provided by the fossils object.

| Taxonomy objects
The taxonomy object is a dataframe associated with a tree, which records the positions of species on the topology as well as information about each species. It contains one row per unique combination of edge and species, that is, the same species can appear multiple times in the case of budding speciation and the same edge can appear multiple times in the case of anagenetic speciation. Each row has the following information: • corresponding species (sp) and edge (edge) • parent (parent) of the species • mode of speciation that generated the species (mode): anagenetic (a), budding or asymmetric (b), bifurcating or symmetric (s) • start (start) and end (end) time of the species along the edge • optional: whether the species is cryptic (cryptic) and if true which species it is identified as (cryptic.id) F I G U R E 2 Fossilsim plot illustrating the relationship between the five ages associated with fossil species described in (Holland, & Patzkowsky, 2002): speciation, extinction, fossil occurrences ages, oldest occurrence (first appearance) and youngest occurrence (last appearance). A simulated tree is shown with each branch representing a separate species; the start and end points of each branch represent the speciation and extinction times, respectively. Each fossil occurrence is represented by a diamond. In (a) all simulated fossil occurrences are shown. In (b) only the oldest and youngest occurrences are shown (i.e. the stratigraphic ranges). Blue bars represent the interval between the oldest and youngest occurrence. If a species is sampled once (i.e. the species is a singleton) only a single occurrence is shown. If the age of fossil occurrences is not known precisely, each occurrence will also be associated with a minimum and maximum age

| Fossil objects
The fossils object is a dataframe that records a series of fossil occurrences and contains the following information for each fossil: • position of the occurrence on the tree topology (edge) • species to which the sample belongs (sp) • youngest (hmin) and oldest (hmax) age associated with the occurrence Fossil simulation functions will by default record the true simulated age of the fossils, in which case hmin and hmax are equal to this true age. To better represent the dating uncertainty associated with the fossil recovery process, fossils can also be simulated using user-defined stratigraphic intervals, in which case only the minimum and maximum age associated with each occurrence will be recorded.

| Simulation functions
Fossilsim can be used to simulate data for any user-specified fully resolved and dated phylogeny (simulated or empirical). Taxonomy objects are simulated from a tree topology in ape (Paradis, Claude, & Strimmer, 2004) format under the mixed model of speciation described above. Fossils objects are simulated from a tree topology and/or a taxonomy object under the models of fossil recovery described above. The package also contains several functions that can simulate trees, taxonomy and fossils in an integrated way.

| Interaction with r packages & other software
paleotree (Bapst, 2012) is a package for simulating fossil data and timescaling phylogenetic trees. paleotree simulates jointly the tree, taxonomy and fossil record, while these are separate functions in our package. This modularity allows the user to easily replace one component of the simulation with their own model. By allowing the user to specify the starting tree, Fossilsim creates the opportunity to explore expected distributions of fossil occurrences under different taxonomic and sampling scenarios for empirical phylogenies. In addition, we provide a range of alternative models for simulating fossil recovery. On the other hand, paleotree offers simulation models that are not currently supported by Fossilsim and can filter the simulation results under multiple conditions. These conditions can only be achieved with Fossilsim by using rejection sampling, although the starting tree can be conditioned on age and/or number of taxa when produced using treesim. Thus, we intend our simulation and plotting functions to be complementary to those provided by paleotree and provide functions for converting between Fossilsim and paleotree objects. paleotree was also used to validate Fossilsim functions (see Supplementary Material).
BEAST2 (Bouckaert, et al., 2014) is a popular Bayesian inference software for phylogenetic analyses. Its add-on sampled ancestors package (Gavryushkina, et al., 2014) is designed to handle datasets containing fossil samples, and in particular datasets where fossils are present along the lineages of the tree and not simply treated as tips. To maintain a strictly binary structure for the tree, these sampled ancestors are stored as tips at the end of zero-length branches.
Fossilsim implements this format in the SAtree class and provides functions to convert simulated trees and fossils into this format. This allows datasets simulated by Fossilsim to be used for validation or exploration of the sampled ancestors package. Trees formatted as SAtree objects are also suitable for use in the statistical computing language revBayes (Höhna, et al., 2016).

| Data visualisation
These functions require either a tree object in ape format or a taxonomy object. If the tree is provided without taxonomy, all speciation events will be assumed to be symmetric bifurcations. When plotting trees in SAtree format, all speciation events are assumed to be asymmetric.

| Examples
In values is used to simulate fossil recovery rates under the independent trait values model. In this example, the probability that trait values change at each speciation event is 0.5 and new values are drawn from an exponential distribution with mean = 3. If a tree object is passed to the simulation function, rather than a taxonomy object, the functions assume that all speciation occurs via bifurcation. Figure 3d shows a non-bifurcating representation for asymmetric speciation events. This representation is not currently compatible with mixed speciation and can only be used for fully budding trees. It was produced by the following commands:

| CON CLUS IONS
A large number of r packages have been developed for simulating and analysing phylogenetic data (a comprehensive list is compiled by the CRAN R project https://cran.rproject.org/web/views/ Phylogenetics.html). Many available packages use the ape phylo class. Since Fossilsim also uses this framework, the output can easily be combined with data simulated using a wide range of phylogenetic packages, including molecular sequences (e.g. phylosim, Sipos et al., 2011) or morphological character data (e.g. geiger, Pennell, et al., 2014). A much smaller number of packages exist for simulating palaeontological data (e.g. paleotree, Bapst, 2012). By providing a package that combines flexible, mechanistic models of fossil recovery that can output data in a range of formats, we expand the scope for exploring the performance of methods in macroevolution.
F I G U R E 3 Example output produced by Fossilsim. (a) Taxonomy plot produced by Fossilsim, showing each species as a colour. The symbol at the start of each species range marks the speciation mode (budding, bifurcation or anagenesis) which gave rise to that species. Note that in case of budding speciation, the range of a given species can cover several edges.

DATA ACCE SS I B I LIT Y
The