How should we estimate diversity in the fossil record? Testing richness estimators using sampling‐standardised discovery curves

To infer genuine patterns of biodiversity change in the fossil record, we must be able to accurately estimate relative differences in numbers of taxa (richness) despite considerable variation in sampling between time intervals. Popular subsampling (=interpolation) methods aim to standardise diversity samples by rarefying the data to equal sample size or equal sample completeness (=coverage). Standardising by sample size is misleading because it compresses richness ratios, thereby flattening diversity curves. However, standardising by coverage reconstructs relative richness ratios with high accuracy. Asymptotic richness extrapolators are widely used in ecology, but rarely applied to fossil data. However, a recently developed parametric extrapolation method, TRiPS (True Richness estimation using Poisson Sampling), specifically aims to estimate the true richness of fossil assemblages. Here, we examine the suitability of a range of richness estimators (both interpolators and extrapolators) for fossil datasets, using simulations and a novel method for comparing the performance of richness estimators with empirical data. We constructed sampling‐standardised discovery curves (SSDCs) for two datasets, each spanning 150 years of palaeontological research: Mesozoic dinosaurs at global scale, and Mesozoic–early Cenozoic tetrapods from North America. These approaches reveal how each richness estimator responds to both simulated best‐case and empirical real‐world accumulation of fossil occurrences. We find that extrapolators can only truly standardise diversity data once sampling is sufficient for richness estimates to have asymptoted. Below this point, directly comparing extrapolated estimates derived from samples of different sizes may not accurately reconstruct relative richness ratios. When abundance distributions are not perfectly flat and sampling is moderate to good, but not perfect, TRiPS does not extrapolate, because it overestimates binomial sampling probabilities. Coverage‐based interpolators, by contrast, generally yield more stable subsampled diversity estimates, even in the face of dramatic increases in face‐value counts of species richness. Richness estimators that standardise by coverage are among the best currently available methods for reconstructing deep‐time biodiversity patterns. However, we recommend the use of sampling‐standardised discovery curves to understand how biased reporting of fossil occurrences may affect sampling‐standardised diversity estimates.


| INTRODUC TI ON
Early studies of taxonomic richness through deep time (e.g. Benton, 1985;Sepkoski, Bambach, Raup, & Valentine, 1981;Valentine, 1969) interpreted the fossil record literally using face-value (=raw or observed) counts of taxa. However, because fossil record sampling varies considerably among clades, geological time-intervals and geographic regions, direct comparisons of face-value richness can be misleading (e.g. Alroy et al., 2001Alroy et al., , 2010aAlroy et al., , 2010bPeters, 2005;Raup, 1972;Smith & McGowan, 2011). To infer genuine patterns of deep-time biodiversity, we need methods that successfully standardise samples of unequal sizes and permit direct comparisons of richness among assemblages.
An early approach was to standardise samples by size, drawing down samples to equal numbers of specimens, individuals or localities (classical rarefaction [CR]; Sanders, 1968). However, item-quota standardisation methods such as CR under-sample more diverse assemblages, compressing relative richness ratios and artificially flattening diversity curves (Alroy, 2010b,c;Chao & Jost, 2012). The solution to this problem is to standardise samples to equal levels of completeness, or "coverage" of the species' underlying frequency distribution (Alroy, 2010a;Chao & Jost, 2012;Jost, 2010). This approach is known among palaeobiologists as shareholder quorum subsampling (SQS), and among ecologists as coverage-based rarefaction (CBR). It reconstructs richness ratios with high accuracy, provided that the shape of the abundance distribution does not vary substantially between assemblages (Alroy, 2010a(Alroy, -2010cChao & Jost, 2012).
Here, we describe a new approach for examining the real-world performance of richness estimators when confronted with new data.
We evaluate the ability of both interpolators and extrapolators to successfully standardise diversity data and accurately reconstruct relative magnitudes of richness between assemblages. We construct sampling-standardised discovery curves (SSDCs; also known as species-accumulation or collector curves) spanning 150 years of palaeontological exploration for (1) Mesozoic-early Cenozoic terrestrial tetrapods, and (2) Mesozoic dinosaurs. This novel historical dimension reveals how the potentially biased accumulation of new fossil data affects richness estimates generated by different methods. We interpret empirical patterns in light of results from simulated datasets, in which richness and evenness are precisely and independently varied and sampling is unbiased. Although we focus on fossil datasets, many of our conclusions are equally applicable to modern-day ecological studies.

| Interpolators
We evaluated two interpolation methods, CR and SQS. CR is flawed because it artificially compresses richness ratios (Alroy, 2010b(Alroy, , 2010cChao & Jost, 2012), and we implemented it in our simulations for illustrative reasons only (using the r package iNEXT; Hsieh, Ma, & Chao, 2016). The alternative approach of standardising data to equal coverage (="quorum" level) was proposed and implemented algorithmically by Alroy (2009Alroy ( , 2010aAlroy ( , 2010bAlroy ( , 2010c under the name SQS, and independently described by Jost (2010). Chao and Jost (2012) described the analytical solutions and expanded the approach to permit extrapolation, calling it coverage-based rarefaction (CBR). The names SQS and CBR refer to the same broad approach of standardising diversity samples by coverage, and do not uniquely map onto any method of implementation (algorithmic or analytical) or piece of software (e.g. J. alroy's sQs Perl and r scripts, and iNEXT [Hsieh et al., 2016]; see Appendix S1 for detailed discussion of SQS).
Coverage is an objective measure of diversity-sample completeness that can be efficiently estimated from the frequencies of rare species in a sufficiently large sample (Esty, 1986; see Appendix S4).
The simplest and most commonly used coverage estimator is Good's u (Good, 1953). Good's u ranges between 0 and 1, and is equal to one minus the number of singletons (species only observed once) divided by the total number of sampling units (specimens, individuals or occurrences). Sampling is poor when there are many singletons, and good when there are few. Coverage indicates what percentage of individuals in the original population belong to species included in the sample.
Conversely, the complement of coverage, the 'coverage deficit', indicates the fraction of individuals in the source population belonging to unsampled species. The coverage deficit at any particular level of sample completeness is proportional to the slope at that point on a among the best currently available methods for reconstructing deep-time biodiversity patterns. However, we recommend the use of sampling-standardised discovery curves to understand how biased reporting of fossil occurrences may affect sampling-standardised diversity estimates.

K E Y W O R D S
Dinosauria, diversity, extrapolation, fossil record, interpolation, sample coverage, shareholder quorum subsampling, species accumulation curve rarefaction curve; this also corresponds to the probability that a new species will be observed by adding one more individual to the sample (Chao & Jost, 2012). For example, if coverage is estimated to be 0.9, the species in the sample account for 90% of the individuals in the focal assemblage, and there is a 10% chance that a new species will be discovered if the sample size is increased by one (Chao & Jost, 2012).
Shareholder quorum subsampling has been implemented using two subsampling algorithms (Alroy, 2009(Alroy, , 2010a(Alroy, , 2010b(Alroy, , 2010c(Alroy, , 2014 and one set of analytical equations (Chao & Jost, 2012; see Appendix S1). Unlike the original approximate algorithm, the exact algorithm (Alroy, 2014) used here consistently imposes the target quorum. During each subsampling trial, all occurrences are drawn sequentially and randomly, continually tracking the value of Good's u. As occurrences are drawn, u may either rise or fall. Each time u crosses the target quorum, richness is recorded, and the median of these values from all subsampling trials represents the overall estimate. The exact algorithm produces results that are identical to the analytical equations of Chao and Jost (2012, implemented in iNEXT; see Figure S1). Importantly, the exact algorithm lets us implement additional protocols to address biases affecting fossil occurrence datasets (Alroy, 2009(Alroy, , 2010a(Alroy, , 2010b(Alroy, , 2010c(Alroy, , 2014; see Appendix S1).
For the simulations, we used a combination of the analytical equations and the exact algorithm, newly implemented here in the r language. SQS richness estimates for fossil datasets were calculated, using iNEXT and the exact algorithm implemented in Ja's Perl script version 4.3. The latter allows us to apply the three-collections-perreference protocol necessary to account for the reference effect (see Appendix S1).

| Extrapolators
We evaluated three extrapolators: TRiPS (Starrfelt & Liow, 2016a), Chao1 (Chao, 1984) and λ 5 ("lambda-5"; Alroy, 2017). TRiPS aims to estimate true richness by modelling per-interval sampling rates for extinct lineages as a homogeneous Poisson process. Maximum likelihood is used to infer a single sampling rate for all taxa present in each interval, using observed taxon occurrence frequencies and interval durations. Sampling rates are then used to estimate a single per-lineage binomial probability per interval: the odds that a species would be sampled given that it was extant during that interval. The richness estimated by TRiPS is that which maximises the binomial likelihood given that binomial sampling probability and the observed number of species. We implemented TRiPS using the r scripts provided by Liow (2016a, 2016b).
Chao1 is an asymptotic richness extrapolator widely used in ecology (Colwell & Coddington, 1994;Gotelli & Chao, 2013) that uses information about rare species (singletons and doubletons) to estimate a lower bound for true species richness. Chao1 assumes that singletons, doubletons and undetected species have equal underlying frequencies, and that the sample size is large enough that the mean abundances of singletons and undetected species are similar (Chao & Chiu, 2016). Estimates should therefore be more downward-biased as communities become more uneven and sample sizes are smaller.
To address this, Alroy (2017) proposed λ 5 , which reformulates Chao1 in terms of Poisson sampling, and incorporates information about the number of observed species, singletons and total individuals sampled. The λ 5 equation is solved via a simple hill-climbing algorithm. Although λ 5 also assumes a flat abundance distribution, it incorporates information about sample size to reduce the downward bias when abundance distributions are uneven.

| Simulation experiments
We performed three simulation experiments to show how richness estimators perform under ideal conditions, when true richness and evenness are known and sampling is unbiased. The first two experiments (SE1 and SE2) precisely and independently varied sampling effort, true richness and evenness, while the third (SE3) varied sampling effort systematically, but varied true richness and evenness stochastically.
For SE1 and SE2, we simulated communities with all combinations of four values of true richness (50, 100, 200 and 400 species) and four levels of underlying evenness (one perfectly even/flat, and three lognormally distributed, with standard deviations [SDs] of 1, 1.5 and 2). The simulations in SE1 are directly analogous to samplingstandardised discovery curves derived from empirical datasets. For each simulated community, samples were drawn progressively at sample sizes ranging from 1 to 10,000 individuals. At each sample size, we recorded face-value species counts and richness estimates from SQS, CR, TRiPS, Chao1 and λ 5 . This procedure produces a single simulated discovery curve (face-value species counts) and set of SSDCs (extrapolated richness estimates). The procedure was repeated 1,000 times and the curves averaged to produce rarefaction curves for each richness estimator. For face-value species counts, this procedure yields an item-quota or size-based rarefaction curve (i.e. a CR curve). However, performing this procedure for other richness estimators yields sampling-standardised rarefaction curves (SSRCs), and allows point estimates using size-rarefied extrapolated richness estimates (e.g. size-rarefied Chao1, TRiPS or λ 5 richness).
Such curves reveal how each richness estimator is expected to respond to the progressive accumulation of data when sampling is entirely random and unbiased.
The simulations in SE2 also generated rarefaction curves for the simulated communities. However, these simulations are not analogous to empirical SSDCs, because the data were rarefied by coverage, not sample size. SE2 therefore demonstrates how each richness estimator is expected to respond to progressive increases in sample coverage (from 0.1 to 0.9999). Although the analytical equations of Chao and Jost (2012) permit Chao1 estimates at specific levels of coverage, we opted to use the exact algorithm because this allows us to standardise any estimator to equal coverage. We achieved this by modifying the code for the exact algorithm in r to calculate not only simple counts of species whenever the target quorum was crossed or reached, but also estimates from Chao1 and λ 5 , yielding coveragerarefied extrapolated richness estimates. The asymptotic richness estimates from these extrapolators are derived from repeated subsamples of the data at specific levels of coverage.
We did not rarefy TRiPS to equal coverage because implementing this method within the exact algorithm was too computationally intensive. TRiPS runs approximately three orders of magnitude slower than other extrapolators, and the exact algorithm used to standardise richness estimators to equal coverage is also computationally demanding. However, results from size-based rarefaction of TRiPS richness estimates in SE1 demonstrate that coverage-based rarefaction of TRiPS would not be beneficial (see Section 3).
SE3 tested the sensitivity of richness estimators to stochastic variation in richness and evenness. We generated samplingstandardised richness estimates for many simulated communities in which true richness was sampled from a lognormal distribution (SD = 1 and M = 5 on a log scale), and the SD of the underlying lognormal abundance distribution was randomly varied from 1 to 2 on a log scale. These were standardised at both a range of sample sizes and coverages.
We also used the simulation framework from SE1 to show expected counts of singletons, doubletons and multitons (species that have been sampled at least twice) with increasing sampling effort.
Comparing curves of singletons, doubletons and multitons from empirical fossil datasets to expected patterns under entirely unbiased sampling can shed light on the nature of reporting biases. For example, novelty biases (see Section 4) are expected to inflate the frequencies of singletons relative to multitons, and might therefore distort curves of counts as a function of sampling effort.

| Empirical sampling-standardised discovery curves
Full details of the fossil occurrence data are provided in Appendix S2. We downloaded Mesozoic-early Eocene occurrence data for Tetrapodomorpha, and Mesozoic occurrence data for Dinosauromorpha, from the Paleobiology Database (PaleoDB).
Marine tetrapods and flying taxa were excluded, and the datasets were cleaned (see Appendix S2).
Our analyses focus on two partitions of these data. The first comprises North American data because this continent has the best sampled fossil record for much of this interval. The second partition comprises all global occurrences of Mesozoic dinosaurs. These data are exceptionally complete and well-vetted in the PaleoDB, and there has been intense interest in reconstructing dinosaur diversity patterns (e.g. Barrett, McGowan, & Page, 2009;Starrfelt & Liow, 2016a;Tennant, Chiarenza, & Baron, 2018;, including in the initial publication of TRiPS. Global diversity curves suffer from profound issues with highlyvariable sampling universes (Appendix S2), and here we only analyse dinosaur data at global level to enable direct comparison with Starrfelt and Liow (2016a).
We reconstructed SSDCs for each geological stage-level time interval by subsetting our data to create 150 historical datasets, representing yearly timeslices through the history of palaeontological discovery, from 1866 to 2016. Each PaleoDB occurrence is associated with a published reference that corresponds to either the original description or the latest accepted taxonomic revision. This information was used to limit each historical timeslice to only those occurrences published prior to or during the year in question. Historical snapshots of the fossil record may include taxonomic opinions and identifications that were later rendered obsolete. This provides a more accurate picture of the history of palaeontological discovery, and is akin to using historical literature compilations (e.g. Alroy, 2000;Sepkoski, 1993;Tennant et al., 2018). We did not construct SSDCs using CR because our simulations provide further evidence that the method produces misleading results (see below).
Sampling-standardised discovery curves in which collection effort is quantified by time in years may be misleading if discovery rates are strongly heterogeneous. We therefore focus on SSDCs in which effort is quantified by the chronological addition of occurrences. Together with coverage estimates, these provide a much clearer view of sampling effort through collector-time.
To examine biases in the real-world accumulation of species in the fossil record, we compared empirical SSDCs to null distributions where the order in which occurrences are discovered is repeatedly randomised (these are equivalent to sampling-standardised rarefaction curves). This is achieved by generating many replicate datasets in which publication years for occurrences are randomly assigned a year and SSDCs are calculated. These null distributions shed light on the performance of sampling-standardisation methods for constructing SSDCs in the absence of systematic collection and reporting biases (Alroy, 2010a(Alroy, , 2010b(Alroy, , 2010c, including the expansion of the sampling universe (e.g. when the empirical SSDC falls well above or below the range observed in the null). We calculated palaeogeographic spread (the spatial distribution of fossil localities within a time interval) in order to quantify the expansion of the geographic sampling universe through collector-time (Appendix S2). All analyses were conducted in r (version 3.2.2; R Development Core Team, 2015), unless otherwise stated. All analysis code and data have been archived on Zenodo (https://doi.org/10.5281/zenodo.1167536; Close et al., 2018).

| Simulation experiments
Rarefying face-value species counts by sample size, as in SE1, pro- As a result, SSRCs for SQS and CR are simply flat lines extending out from the point at which the sample size or coverage is sufficient to obtain an estimate ( Figure S3).
However, although it is relatively insensitive to raw sample size, CR artificially compresses richness ratios by progressively underestimating relative richnesses of more diverse communities (Alroy, 2010b(Alroy, , 2010cChao & Jost, 2012). This results in a nonlinear relationship between true and estimated richness (especially when evenness is high; Figure 3). In contrast, standardising by coverage yields perfectly accurate relative richnesses provided that the shape of the abundance distribution does not vary among communities F I G U R E 1 Size-based rarefaction curves for face-value counts (=CR), Chao1 and λ 5 , analysing communities from Simulation Experiment 1. Columns represent true evenness values. Until they asymptote, extrapolated richness estimates are strongly sample-size dependent. Extrapolators converge on true richness rapidly when communities are perfectly even, but take progressively longer as evenness decreases. When communities are not perfectly even, TRiPS ceases to extrapolate above a certain sample size. λ 5 performs better than other methods tested when communities are uneven Flat Lognormal (SD = 1) Lognormal (SD = 1.5) Lognormal (SD = 2) True richness = 400 10 100 1,000 10,000 10 100 1,000 10,000 10 100 1,000 10,000 10 100 1,000 10,000 F I G U R E 2 Coverage-based rarefaction curves using both face-value counts (=SQS) and extrapolated richness from Chao1 and λ 5 , analysing communities from Simulation Experiment 1. TRiPS is not included due to computational issues, but Figure 1 demonstrates that this method performs poorly when abundance distributions are uneven and sampling is moderate to good The sampling level required for extrapolators to asymptote becomes greater as true richness increases or evenness decreases.
When communities are perfectly even-i.e. when the species abundance distribution is perfectly flat-sampling-standardised rarefaction curves for Chao1, TRiPS and λ 5 stabilise at very small sample sizes ( Figure S3). For a perfectly even community with 400 species, these extrapolators asymptote on true richness after sample sizes reach 100 individuals, or when sample coverage is <0.5 ( Figures S3 and 2).
By contrast, face-value counts of species only asymptote on true richness after at least 1,000 individuals have been sampled, the point at F I G U R E 3 Relationship between true and estimated richness for estimators standardised by sample size (face-value counts [=CR], TRiPS, Chao1 and λ 5 ), analysing communities from Simulation Experiment 3. Standardising to equal sample size causes estimators to scale nonlinearly with true richness, particularly when sampling is limited. Standardising Chao1 and λ 5 to equal sample size yields a tighter relationship, but the nonlinear pattern remains which coverage is total. Confidence intervals are large at small sample sizes, with an upper bound that peaks sharply (greatly exceeding true richness) before shrinking, then disappearing as the coverage deficit diminishes to zero ( Figure S4). Chao1 yields the most conservative estimate, but converges on true richness slightly later than TRiPS and λ 5 .
As evenness decreases, extrapolators require progressively more data in order to converge on true richness ( Figures S3-2). For communities with lognormal frequency distributions, λ 5 converges on true richness earlier than Chao1 (SD = 1-1.5), but initially overshoots true richness when evenness is very low (SD = 2). As evenness F I G U R E 4 Relationship between true and estimated richness when standardising face-value counts (=SQS), Chao1 or λ 5 estimates to equal coverage, analysing communities from Simulation Experiment 3. Coverage-standardised estimators scale linearly with true richness. Variation in evenness (SD of underlying lognormal distribution) causes a looser relationship at lower levels of coverage, but the effect diminishes as coverage increases. Standardising extrapolators (especially λ 5 ) to equal coverage yields a visibly tighter relationship decreases, upper confidence interval bounds for Chao1 and λ 5 (but not TRiPS) usually approach or encompass true richness ( Figure S4).
TRiPS, however, ceases to extrapolate (simply returning facevalue counts of species) when the underlying frequency distribution is not perfectly flat and sample sizes are moderate to large (Figures 1   and S3). Once TRiPS ceases to extrapolate, confidence intervals shrink to negligible sizes ( Figure S4). This even occurs when the underlying lognormal frequency distribution is comparatively even (SD = 1; Figure S3). When evenness is very low (SD = 2) and sample sizes are large, TRiPS ceases to yield richness estimates altogether.
If sampling is limited relative to true richness, all richness estimators tested here are biased by low evenness. Regardless of whether samples are rarefied by size (CR) or coverage (SQS), richness estimates drop as evenness diminishes (compare Figures 5 and S5).
When sampling is comparatively limited, substantial among-sample differences in evenness can severely confound estimates of relative richness: even at a quorum of 0.9, the coverage-standardised estimate for a community with 400 species and low evenness (SD = 2) is substantially less that for a perfectly even community with 200 species ( Figure 5). As communities diverge in evenness, progressively greater coverage is required in order to accurately infer relative richness for communities as a whole, since it becomes ever harder to detect the rarest species. When sampling is very poor, standardising by coverage produces richness estimates that are slightly more biased by differences in evenness than standardising by sample size.
However, when sampling is very good, the situation is reversed, and CR becomes more sensitive to evenness than SQS ( Figure S6; see Discussion). The influence of evenness diminishes as sample size or coverage level increases ( Figures 5, S5 and S7). Crucially, however, only standardising by coverage yields a linear relationship between true and estimated richness (Figures 3 and 4).
Downward biases to richness estimates caused by low evenness are substantially reduced by using coverage-based rarefaction of extrapolated richness estimates, rather than coverage-based rarefaction of simple face-value counts of species (=SQS/CBR).
Coverage-rarefied λ 5 richness estimates are the least affected by evenness (compare richness estimates at a coverage of 0.99 for facevalue counts, Chao1 and λ 5 in Figure 5; coverage-rarefied λ 5 is nearly unaffected by differences in evenness).
We were not able to coverage-rarefy TRiPS richness estimates Empirical patterns can be compared against these to assess the strength of the reporting biases (see below).

| Empirical sampling-standardised discovery curves
Empirical SSDCs using extrapolators show little sign of asymptoting Simulations show that SSDCs using subsampling methods will follow a perfectly flat trajectory if (1) sampling is random and unbiased, and (2) the size of the underlying sampling universe is static ( Figure S3). However, idiosyncratic sampling of the fossil record may cause subsampled diversity estimates to fluctuate. Firstly, SQS richness may decline as new occurrences are added. This is most evident for global dinosaurs during the Maastrichtian (Figures 6c, 7b, Null distributions reveal the range of patterns SSDCs would take if all currently-known occurrences had been discovered in random order, and thus shed light on systematic reporting biases, such as a preference for reporting novel taxa, or systematic expansion of the sampling universe through collector-time (see Section 4). When the sampling universe is expanded late in collection history, the empirical SQS SSDC lies below the range of randomised collection histories (e.g. North American tetrapods during the Danian and Ypresian above quorum 0.1; Figure S12b,c).
Progressively better sampling of the same universe causes the empirical curve to lie within the range of the null (e.g. for global dinosaurs during the Tithonian; Figure S13). Maastrichtian tetrapods exhibit both patterns depending on the quorum level ( Figure  S12c): a late, steep rise is evident at a quorum of 0.6, causing the empirical curve to fall below the null. However, the empirical curves overlap with the null for quorums of 0.1-0.5. This may be because later collecting efforts do little to alter the species sampled at low quorum levels (the most common taxa). Diversity curves constructed using all the estimators we test are shown in Figures S14 and S15.

| D ISCUSS I ON
Both our simulations (Figures 1, S3 and 2) and SSDCs for fossil datasets ( Figures S9 and 6) demonstrate that although interpolators consistently standardise diversity samples of differing sizes, extrapolators should no more be expected to yield fair results from such samples than direct comparisons of face-value counts of species (unless coverage for all assemblages is sufficient for extrapolators to have reached an asymptote). Extrapolators yield a minimum bound for true richness (Chao, 1984). However, true richness may substantially exceed this when sample sizes are insufficient (Chao, Colwell, Lin, & Gotelli, 2009). The sample-size dependency of extrapolators is well-known (Chao et al., 2009;Colwell & Coddington, 1994) but perhaps not widely appreciated.
By their very nature, richness estimates obtained from SQS and CR are relatively sample-size invariant, and our empirical SSDC results broadly reflect this. However, extrapolated SSDCs for empirical fossil data show few signs of asymptoting (Figures 6 and S9). This suggests that sampling in the tetrapod fossil record is generally not yet good enough to use extrapolators unless they are applied within a rarefyand-extrapolate protocol of the kind used in our simulation experiments. Our simulations show that rarefying Chao1 or λ 5 estimates to equal coverage produces the best results (particularly λ 5 ; Figures 5 and S5). In particular, rarefying extrapolators to equal coverage is the best way to remove confounding effects of among-sample variation in evenness, a problem that affects all richness estimators when sampling is comparatively limited (see below). A similar approach is advocated by Colwell et al. (2012) and Chao and Jost (2012). iNEXT (Hsieh et al., 2016) implements this for Chao1, but analytical solutions for adjusting λ 5 to particular levels of coverage do not yet exist.

| TRiPS
Our results suggest that, as additional data is accumulated, TRiPS eventually stops extrapolating when evenness is less than perfect because it fits a single sampling rate for all species in each interval (="uniform" sampling rates in the parlance of Wagner & Marcot, Attempting to fit a single sampling rate and probability to all taxa in each sampling unit is unlikely to work on real-world data, because empirical species-abundance distributions tend to be heavily rightskewed on an arithmetic scale (Preston, 1962a(Preston, , 1962b. Modern species-abundance distributions are best described by doublegeometric distributions, but the lognormal is a reasonable alternative (Alroy, 2015). Although time-averaging could potentially alter these patterns, log-normal distributions of per-collection sampling rates among taxa have been shown to fit empirical fossil occurrence data much better than uniform rates (Wagner & Marcot, 2013). Even if the per-individual chance of preservation were identical for all species, ubiquitously right-skewed abundance distributions cause sampling rates and probabilities to be overestimated for rare taxa.

| Reporting biases and sampling-universe variability
Sampling-standardised discovery curves are valuable because the fossil record is not sampled in an unbiased and random manner, and because the nature of sampling may change through collector time.
SQS SSDCs are considerably more stable than those for extrapolators. However, in some instances SQS richness may rise or fall as data accrues. We attribute such fluctuations to two drivers: (1) nonrandom reporting of fossil discoveries, and (2) sporadic expansion of the sampling universe through collector-time.
A key assumption of any richness estimator is that sampling is unbiased. However, palaeontological research probably exhibits a 'novelty bias'-a tendency to prioritise publication of new taxa over new occurrences of named taxa (Alroy, 2010c;Tennant et al., 2018).
At least in the early phases of discovery, this bias results in inflated counts of singletons, which bias estimates of sample coverage downwards, and estimated richness upwards. When novel taxa become scarce, efforts may shift towards reporting additional occurrences of named taxa. This phenomenon may explain the decline in SQS richness of Maastrichtian tetrapods over the latter half of the twentieth century (see Figure S9a). The non-random nature of palaeontological reporting practices is underscored by the trajectories of singleton and multiton taxa through collector-time ( Figure S9v-x). When sampling is entirely random (Figure 8), the ratio of singleton to multiton taxa is expected to decline more or less monotonically, but seems to be invariant for Maastrichtian tetrapods: multitons are underreported relative to singletons. This may explain why Good's u often appears to asymptote well below 1 (Figure 6s-u).
SQS SSDCs may also fluctuate due to non-random expansion of the sampling universe (e.g. increases in sampled geographic area, palaeolatitudes or palaeoenvironments). Studies of regional-level diversity patterns (i.e. continental-scale or gamma diversity) implicitly assume that fossil discoveries are a representative, random sample of that geographic region. However, fossil discoveries within continental regions have highly non-random spatial distributions, providing only a partial window into the intended geographic sampling universe. Furthermore, the realised sampling universe tends to expand as new fossiliferous regions are discovered. Even the best richness estimators cannot correct for variability in the size of the underlying taxon pool. It is, therefore, important that the realised sampling universes within focal assemblages are comparable.
SSDCs provide valuable context for gauging the maturity of sampling in focal assemblages. At the regional level, very few intervals of the dinosaur record have emerged from an early phase of discovery that tends to be characterised by volatile SQS SSDCs (Figure 7b-c and f-g). The Maastrichtian, Kimmeridgian and Tithonian of North America are likely exceptions ( Figure 6). However, even the apparent stability of SQS richness in these intervals could change if productive new fossiliferous regions are discovered. This emphasises the need to recognise potential disconnects between the extent of the intended and realised sampling universes and tailor comparisons of diversity accordingly (Close, Benson, Upchurch, & Butler, 2017).

| Among-sample variation in evenness
SQS has recently been criticised for tracking evenness (Hannisdal, Haaga, Reitan, Diego, & Liow, 2017). In fact, among-sample variation in evenness will confound any richness estimator that implicitly or explicitly utilises information about relative frequencies of taxa (see also Kosnik & Wagner, 2006). This is simply because it becomes much harder to sample all of the species in a community when evenness is very low. We consider that any additional sensitivity SQS may have to differences in evenness at low coverage is a worthwhile tradeoff (Figures 3 and 4). The initial description of SQS (Alroy, 2010a) acknowledged the potential for among-sample variation in evenness to confound SQS; indeed, the central assumption of SQS is that substitutions of taxa occur randomly with respect to their relative frequencies. In other words, SQS is only guaranteed to estimate richness ratios with perfect accuracy when evenness (or the shape of the species abundance distribution more generally) does not vary systematically between communities.
In fact, depending on the level of sampling, standardising to equal coverage is either more or less sensitive to evenness compared to methods that standardise to sample size (e.g. CR). When coverage is poor to moderate, richness estimates standardised to equal coverage are marginally more sensitive to evenness than those standardised to sample size. This is because SQS establishes how many species will be found, on average, by repeatedly sampling a fixed proportion of individuals in the community. We naturally expect to sample fewer species in a given fraction of the community if evenness is very low, and more species if evenness is very high. This is why, when sampling is comparatively limited, sample coverage (both true and estimated) actually increases as abundance distributions become more uneven ( Figure S16; note that this shows the coverage deficit in order to allow log-transformation of the y-axis). When evenness is very low, coverage at smaller sample sizes is relatively high (and the coverage deficit is therefore low): although very rare species are unlikely to be sampled even once, common species are easy to find, and they collectively account for a large fraction of individuals in the population. Conversely, when evenness is very high, coverage drops, because limited samples likely contain many singletons. However, because coverage increases and asymptotes more rapidly in very even communities than very uneven ones, this relationship reverses when sampling is very good. Eventually, coverage for a given sample size will be higher when communities are more even, and lower if they are less even ( Figures S16 and S6). Thus, as coverage increases, problems arising from among-sample differences in evenness diminish and eventually disappear. The implication of this changing relationship between coverage, evenness and sample sizes is that SQS is more sensitive than CR to differences in evenness at low quorum levels, because it undersamples (relative to total species richness) when evenness is low. Conversely, SQS is less sensitive than CR to differences in evenness at very high levels of coverage, because SQS then samples harder when evenness is low.
From a theoretical perspective, total species richness and the shape of the abundance distribution are distinct properties.
However, practicalities of sampling mean that it may be difficult to disentangle these two properties. As Chao and Jost (2012) observe, variation in the shape of the abundance distribution is the reason why size-based (CR) rarefaction curves for different assemblages can cross (signifying points where the rank-order richness of communities switches). Coverage-based rarefaction curves (plots of richness as a function of coverage; e.g. Figure 2) cross the same number of times as size-based rarefaction curves, but less data is required to detect where this occurs. The only way to correctly resolve true differences in ranked richness is by attaining sufficient coverage in each assemblage to have observed all the crossing points-but in reality, we can never know if we have surpassed this point (Chao & Jost, 2012). This is the main reason for using the highest quorum level possible, and for treating estimates from low quorum levels with scepticism. However, SQS does tell you how many species will be found, on average, in a random sample of a fixed percentage of individuals drawn from the population, information that is biologically meaningful.
Another reason for preferring higher quorum levels-even if the shape of the abundance distribution does not vary between communities-is that it is difficult to accurately estimate low levels of coverage from limited samples. All else being equal, SQS requires much less data than CR to accurately reconstruct richness ratios and, in theory, richness ratios can be accurately reconstructed from very small sample sizes provided that abundance distributions do not differ (Chao & Jost, 2012, Table 2). In practice, however, sample coverage must be estimated from the data. Our simulations demonstrate that coverage can be very accurately estimated when sample sizes are moderately large; precision increases with sample size (Figures S17 and S18). However, both accuracy and precision depend on true richness and evenness: coverage is more difficult to estimate from small samples when true richness and evenness are low, and easier to estimate when richness and evenness are high.

| CON CLUS IONS
Simulations and empirical sampling-standardised discovery curves (SSDCs) for fossil datasets show that standardising diversity data to equal coverage ensures fair comparisons of richness when sampling is limited. When sampling is unbiased and the shape of the abundance distribution does not vary among communities, SQS yields perfectly accurate relative richness ratios, and standardised estimates scale linearly with true richness. Empirical SSDCs using SQS are more stable than those using extrapolators. Richness estimators that standardise by coverage are among the best currently available methods for reconstructing deep-time biodiversity patterns.
Extrapolated richness estimates obtained from samples of unequal sizes may be almost as misleading as direct comparisons of unstandardised richness. Unless sampling is sufficiently complete for the estimator to have asymptoted, extrapolated estimates may strongly depend on sample size, yielding inaccurate relative richness ratios among assemblages. This is especially crucial for fossil occurrence data, because sample completeness varies substantially among time intervals and geographic regions. The sampling level required for extrapolators to asymptote increases with true richness and decreases with evenness. Of the extrapolators we tested, the λ 5 method performs best when evenness is low.
When abundance distributions are less than perfectly even and sampling is moderate to good but not complete, TRiPS stops extrapolating and instead returns face-value counts of taxa. This is because TRiPS fits a single sampling rate for all species in each interval, which causes the method to overestimate binomial sampling probabilities.
Most assemblages of interest to palaeobiologists or ecologists are unlikely to have flat abundance distributions, and indeed SSDCs using TRiPS often closely track unstandardised discovery curves.
All richness estimators are biased by differences in evenness when sampling is comparatively limited. Richness estimates become downwardly biased as evenness diminishes, since it becomes ever harder to detect the rarest species. When overall sampling is very poor, standardising by coverage produces richness estimates that are slightly more biased by differences in evenness than standardising by sample size. However, when sampling is very good, the situation is reversed.
Rarefying extrapolated richness estimators to equal sample coverage (i.e. using a coverage-based rarefaction algorithm to standardise extrapolated, rather than face-value counts of species) gives us the best of both worlds: it makes our samples effectively a little bigger, and therefore diminishes the impact of evenness while retaining the desirable properties of SQS (e.g. a linear relationship between true and estimated richness). Coverage-based rarefaction of extrapolators removes any potential sample-size dependency, and effectively extends the maximum coverage obtainable from limited diversity samples.
Our empirical SSDCs reveal biases in the accumulation of palaeobiological knowledge that may confound even the best richness estimators. We recommend constructing SSDCs for fossil datasets in order to shed light on these sources of bias, and to provide important historical context for understanding the reliability of present-day sampling-standardised richness estimates.