Volume 8, Issue 11
Research Article
Open Access

A general sampling formula for community structure data

Bart Haegeman

Corresponding Author

E-mail address: bart.haegeman@sete.cnrs.fr

Centre for Biodiversity Theory and Modelling, Theoretical and Experimental Ecology Station, CNRS and Paul Sabatier University, 2 route du CNRS, 09200 Moulis, France

Correspondence author. E‐mail: bart.haegeman@sete.cnrs.frSearch for more papers by this author
Rampal S. Etienne

Groningen Institute for Evolutionary Life Sciences, University of Groningen, Box 11103, 9700 CC Groningen, The Netherlands

Search for more papers by this author
First published: 09 May 2017
Citations: 8

Summary

  1. The development of neutral community theory has shown that the assumption of species neutrality, although implausible on the level of individual species, can lead to reasonable predictions on the community level. While Hubbell's neutral model and several of its variants have been analysed in quite some detail, the comparison of theoretical predictions with empirical abundance data is often hindered by technical problems. Only for a few models the exact solution of the stationary abundance distribution is known and sufficiently simple to be applied to data. For other models, approximate solutions have been proposed, but their accuracy is questionable.
  2. Here, we argue that many of these technical problems can be overcome by replacing the assumption of constant community size (the zero‐sum constraint) by the assumption of independent species abundances.
  3. We present a general sampling formula for community abundance data under this assumption. We show that for the few models for which an exact solution with zero‐sum constraint is known, our independent species approach leads to very similar parameter estimates as the zero‐sum models, for six frequently studied tropical forest community samples.
  4. We show that our general sampling formula can be easily confronted to a much wider range of datasets (very large datasets, relative abundance data, presence‐absence data, and sets of multiple samples) for a large class of models, including non‐neutral ones. We provide an R package, called SADISA (Species Abundance Distributions under the Independent Species Assumption), to facilitate the use of the sampling formula.

Introduction

Species abundance distributions (SADs) have long intrigued ecologists (Fisher, Corbet & Williams 1943; Preston 1948; MacArthur 1957). The motivation is, besides the relative ease of collecting this type of data, that they may contain information on how species assemble in ecological communities, and on differences in species' properties. Indeed, intuitively a high abundance seems a sign of strong adaptation to the habitat where the species resides, indicating competitive dominance. However, such a high abundance perhaps just arises by chance. In the search for explanatory mechanisms, a plethora of models have been proposed to describe the SADs (McGill et al. 2007).

The last decade has seen a revived interest in the SAD because it is one of the key predictions of the neutral theory of biodiversity (Hubbell 2001; Rosindell, Hubbell & Etienne 2011), a theory that assumes that all individuals are functionally equivalent, regardless of the species it belongs to. This model attributes the differences in abundance not to differences in adaptation, but to inherent demographic stochasticity, i.e. a large abundance need not be a sign of strong adaptation, but is just due to demographic fortune. Comparing the neutral model predictions to those of more traditional niche‐based models on abundance data has led to mixed results (Purves & Pacala 2005; Du, Zhou & Etienne 2011; Haegeman & Etienne 2011). This has invigorated the criticism that SADs do not contain sufficient information to infer the underlying process. However, stronger inferences might be possible when increasing the size of the community samples (Al Hammal et al. 2015). Moreover, in combination with other community patterns such as species‐area curves, SADs may be informative (May, Huth & Wiegand 2015). Hence, it remains a useful exercise to fit reasonable models to species abundance data.

The central ingredient of fitting community models to data are sampling formulas. These formulas are used to evaluate the likelihood of data for a set of model parameters, find the optimal parameters using maximum likelihood and compare the fit quality of competing models, e.g. using Akaike information criterion. For Hubbell's neutral model, an exact sampling formula was derived by Etienne (2005). This formula gives the likelihood of observing S species abundances n1,n2,…,nS in a sample of size J individuals according to a neutral model of a local community connected by immigration (described by the dispersal probability m, or equivalently by the dispersal number I) to a metacommunity governed by point‐mutation speciation (described by parameter θ, called the biodiversity number). However, this sampling formula is computationally demanding for samples of large size.

Nevertheless, the formula paved the way for a more general sampling theory (Etienne & Alonso 2005; Green & Plotkin 2007) in which the sampling formula was presented as a compound distribution of local, dispersal‐limited sampling, and a metacommunity abundance distribution. It has been extended to multiple samples connected to the same metacommuity (Munoz et al. 2007; Etienne 2007, 2009), random‐fission speciation (Haegeman & Etienne 2010; Etienne & Haegeman 2011) and multiple guilds (Janzen, Haegeman & Etienne 24; see also Walker 2007). In all cases, the sampling formula was cumbersome to derive and demanding to compute and the total sample size allowing numerical computation was limited. Harris et al. (2017) circumvented the latter problem, but their approach is based on Bayesian computation rather than on a simple likelihood formula.

Here we present a new framework within which sampling formulas can be relatively easily derived and computed, not only for the models for which a zero‐sum sampling formula is already available, but also for a wealth of other models. The crucial step is that we abandon the assumption of zero‐sum dynamics, i.e. constant community size, and embrace the independent species assumption, i.e. we assume that species fluctuate independently of one another. It has been shown before that the zero‐sum and independent species variants of neutral community models are intimately linked (Etienne, Alonso & McKane 2007a; Haegeman & Etienne 2008). In particular, the two model variants yield identical predictions for the local community model with fixed species pool and for the metacommunity model with point‐mutation speciation. For Hubbell's neutral model, in which the local community model is coupled to the metacommunity model, the equivalence breaks down (Haegeman & Etienne 2011), but we show that there is still an excellent agreement, especially for highly diverse systems. We exploit this correspondence to derive sampling formulas that are easy to evaluate, even for very large sample size.

Independent‐species approaches have been repeatedly applied to analyse the predictions of neutral community models. Alonso & McKane (2004) and Volkov et al. (2003, 2005, 2007) used this assumption to construct approximate solutions of the point‐mutation speciation model. Haegeman & Etienne (2010) and Etienne & Haegeman (2011) used it as a starting point to get to a zero‐sum sampling formula for random‐fission speciation. Chisholm & Pacala (2010) and Haegeman & Etienne (2011) used it as a basis for a niche model. However, none of these studies have constructed a general framework to fit community models to abundance data, as we present here.

We start by providing an intuitive idea of the independent species approach and of its computational advantages over the standard zero‐sum approach. Then, we present the general sampling formulas under the independent species assumption. We apply these formulas to the few models for which the zero‐sum approach has been developed, and show that the independent species approach leads to very similar parameter estimates. Next, we present several model fitting problems which cannot be dealt with in the zero‐sum framework, but for which the independent‐species framework can be used. In particular, we consider community models with protracted speciation, species‐level density dependence, and species‐specific dispersal rates, and datasets of very large size, relative abundance data, presence‐absence data and sets of multiple samples. In each of these cases the independent species framework leads to a straightforward fitting procedure, illustrating its simplicity and versatility. We provide an R package called SADISA (Species Abundance Distributions under the Independent Species Assumption) to evaluate the new sampling formulas.

From the zero‐sum to the independent species assumption

The large majority of neutral community models is based on the zero‐sum assumption. This assumption states that the number of individuals in the community is constant over time, implying that species abundance fluctuations are correlated: a decrease in one species has to be instantaneously compensated by an increase in another species. Here we explore the consequences of replacing the zero‐sum by the independent species assumption, stating that species abundances fluctuate independently.

We illustrate the two assumptions using a simple community model. We consider a pool of species, whose relative abundances are assumed to be known and invariant over time (note that this assumption is limited to this example model; in the rest of the paper the species pool is governed by the probability distribution dictated by the metacommunity model). The dynamics of the local community coupled to this species pool consist of two processes: local mortality and immigration from the species pool (that is, we discard local reproduction; in the framework of Hubbell's model, this corresponds to setting m = 1 or I→∞; again, this assumption is limited to this example model). This holds for both the zero‐sum and the independent species model variant of the model. The difference between the model variants resides in the way death and immigration events alternate. In the zero‐sum version, each death event is immediately followed by an immigration event. As a result, the sum of all species abundance changes is zero (hence the term ‘zero sum’) and local community size remains constant over time. In the independent species version, each event, whether it is a death or an immigration, is uncoupled from other events. Hence, it is possible that several immigrations occur without any death in between them, or vice versa, so that the local community size would increase or decrease. In stationary state, however, the number of immigrations and deaths occurring over a longer period of time balance each other, so that the community size fluctuates around an average value. Moreover, because these stationary fluctuations are induced by independent events, the variability of community size is typically small. This strongly suggests that the predictions of the independent species model are often close to those of the zero‐sum model. This is indeed what we find, as shown below.

In this paper we exploit the near equivalence of the two assumptions to simplify the evaluation of their model predictions. Here we provide a first intuition of how this simplification works, while we refer to the next section for more details. We consider the case in which the species pool abundances are not known (if they are known, the evaluation of the zero‐sum and independent species predictions are both straightforward). In this case, a community model at the regional scale (i.e. a metacommunity model) predicts the distribution of species pool abundances. We obtain the predictions for the local community abundances by averaging the local community composition for a given species pool over the distribution of species pool abundances. Under the zero‐sum assumption, the species pool abundances are linked, and the computation of the average requires the evaluation of an S‐dimensional integral, with S the number of species in the species pool. This is usually an extremely difficult numerical problem. In contrast, under the independent species assumption, species independence allows us to consider the S species one by one. As a result, the local community predictions decompose into S single‐species averages, each of which requires the evaluation of a one‐dimensional integral. This is an easy task, because the numerical integration of one‐dimensional functions is not costly, even if there are many of them. Hence, by replacing the zero‐sum by the independent species assumption, the evaluation of the model predictions simplifies drastically.

General sampling formula under the independent‐species assumption

As for the zero‐sum case, sampling formulas are the central ingredient of the inference procedure in the independent species case. These formulas give the probability of observing a specific set of abundance data under a community model for a specific set of parameters. Here we show that under the independent species assumption general sampling formulas can be derived, in contrast to the zero‐sum assumption. Concrete examples for which independent species but not zero‐sum formulas can be calculated are presented afterwards.

Single‐sample sampling formula

We first analyse the case in which a single sample taken from the community is available. We assume that the abundances of the species observed in the sample are quantified (in contrast to, e.g. presence‐absence data). We represent the data as species abundance frequencies sk, i.e. the number of species that are observed k times in the sample. For example, if there are nine observed species in the sample with abundances (species are ordered from most to least abundant),

Species # 1 2 3 4 5 6 7 8 9
Abundance in sample 11 5 5 4 2 1 1 1 1

then the corresponding abundance frequencies are s11 = 1, s5 = 2, s4 = 1, s2 = 1, s1 = 4, and all other urn:x-wiley:2041210X:media:mee312807:mee312807-math-0001.

Many independent species models have abundance frequencies that are approximately Poisson distributed. In Appendix S1, Supporting Information, we show that if the number of species in the metacommunity is Poisson distributed, the Poisson distribution is exact. Moreover, we argue that even if this condition is not met, the Poisson approximation is often very accurate. In those cases, which include all the independent species models considered in this paper, the independent species sampling formula is, either exactly or to a very good approximation, a product of Poisson samples,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0002(eqn 1)
where urn:x-wiley:2041210X:media:mee312807:mee312807-math-0003 stands for the data, i.e. the observed abundance frequencies. The numbers λk denote the predicted abundance frequencies, given by,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0004(eqn 2)
The term urn:x-wiley:2041210X:media:mee312807:mee312807-math-0005 in the integrand of eqn 2 stands for the probability that a species with relative abundance x in the metacommunity is observed k times in the sample taken from the local community. For example, for neutral dispersal‐limited sampling, it is given by a negative binomial distribution,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0006(eqn 3)
with I the dispersal number and q a parameter that can be interpreted as sampling effort (see Appendix S2). The term ρ(x) in the integrand of eqn 2 denotes the metacommunity abundance density, that is, ρ(x)dx gives the number of species with relative abundance in the interval [x,x + dx] in the metacommunity. For example, for a neutral model with point‐mutation speciation, we have
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0007(eqn 4)
where θ is the metacommunity diversity (see Appendix S3). Note the similarity in model structure between local community and metacommunity: while the sum urn:x-wiley:2041210X:media:mee312807:mee312807-math-0008 equals the expected number of species with abundance k between k1 and urn:x-wiley:2041210X:media:mee312807:mee312807-math-0009 in the local community, the integral urn:x-wiley:2041210X:media:mee312807:mee312807-math-0010 equals the expected number of species with abundance x between urn:x-wiley:2041210X:media:mee312807:mee312807-math-0011 and urn:x-wiley:2041210X:media:mee312807:mee312807-math-0012 in the metacommunity. Also, the interpretation of variable x as relative abundance requires some care (see Appendix S3). The sum of x over all metacommunity species is equal to one only on average, although its fluctuations are often limited. Alternatively, variable x can be interpreted as an immigration propensity (see Appendix S3).
The evaluation of sampling formula 1 boils down to the computation of several integrals 2. It suffices to compute integrals λk for abundances k that are observed in the sample, i.e. for which urn:x-wiley:2041210X:media:mee312807:mee312807-math-0013. This can be seen by rewriting eqn 1 as
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0014(eqn 5)
with Λ the expected number of observed species,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0015(eqn 6)
where urn:x-wiley:2041210X:media:mee312807:mee312807-math-0016 is the probability that a species with relative abundance x in the metacommunity is present in the data, urn:x-wiley:2041210X:media:mee312807:mee312807-math-0017.

By substituting eqns 3 and 4 into eqns 1 and 2, we obtain a concrete sampling formula with model parameters θ, I and q. This formula can be directly used for likelihood maximization, and connects model predictions and empirical data. Regarding its application, the independent species sampling formula is very similar to the zero‐sum sampling formula.

In comparison with the zero‐sum case, the independent species sampling formula depends on an additional parameter, the sampling effort q. It is a number between 0 and 1; the larger this number, the larger the expected sample size (see Appendix S2). It can be estimated from the data, as the other model parameters. Alternatively, it can be determined a priori, based on the sample size J. The latter approach leads to a close correspondence with the zero‐sum estimation procedure, in which the sample size J is also set beforehand. The parameter q can be tuned such that the expected sample size in the independent species approach matches the real sample size, which is also the fixed sample size used in the zero‐sum approach. By applying this tuning, we obtain parameter estimates with the independent species approach that are almost identical to those obtained with the zero‐sum approach, as we will show in the next section.

For the case of dispersal‐limited sampling, given by eqn 3, the same sampling formula applies for the entire local community or for a sample taken from the local community. This is due to a property called sampling invariance (see Appendix S2). It suffices to set the parameter q in accordance with the size of the dataset, whether it is an exhaustive census or a non‐exhaustive sample. In particular, the sampling formula does not depend on the size of the local community from which the sample was taken. However, sampling invariance, and the associated flexibility in dealing with either census or sample data, does not hold generally, as we will illustrate in the next section.

Multiple‐samples sampling formula

We now extend the sampling formula to L local communities connected to a single metacommunity. There is no direct migration between local communities; they are interdependent due to the immigration from the common metacommunity. We assume that we have a sample with abundance data taken from each of the local communities. As for the single‐sample case, we express the data in terms of abundance frequencies. In particular, for each of the species observed in at least one of the L samples, we introduce the abundance vector urn:x-wiley:2041210X:media:mee312807:mee312807-math-0018 containing its abundance in each sample. Abundance frequency urn:x-wiley:2041210X:media:mee312807:mee312807-math-0019 is equal to the number of species with abundance vector urn:x-wiley:2041210X:media:mee312807:mee312807-math-0020. For example, consider L = 2 local communities and suppose there are 8 observed species in total. If their abundances are given by,

Species # 1 2 3 4 5 6 7 8
Abundance in sample of 1st community 7 4 2 2 1 1 0 0
Abundance in sample of 2nd community 9 3 1 1 1 0 1 1

then the corresponding abundance frequencies are s(7,9) = 1, s(4,3) = 1, s(2,1) = 2, s(1,1) = 1, s(1,0) = 1, s(0,1) = 2, and all other urn:x-wiley:2041210X:media:mee312807:mee312807-math-0021.

For independent species models the abundance frequencies are Poisson distributed, approximately if not exactly (see Appendix S1). The independent species sampling formula is
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0022(eqn 7)
where urn:x-wiley:2041210X:media:mee312807:mee312807-math-0023 is given by
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0024(eqn 8)
and Λ is given by
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0025(eqn 9)
In these eqns urn:x-wiley:2041210X:media:mee312807:mee312807-math-0026 is the probability of observing a species with relative abundance x in the metacommunity urn:x-wiley:2041210X:media:mee312807:mee312807-math-0027 times in the sample taken from local community ℓ, and urn:x-wiley:2041210X:media:mee312807:mee312807-math-0028 is the probability of observing a species with relative abundance x in the metacommunity in at least one of the samples, i.e. urn:x-wiley:2041210X:media:mee312807:mee312807-math-0029. For example, under neutral dispersal‐limited sampling with dispersal number I and sampling effort q in the local community ℓ, we have
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0030(eqn 10)
Combining this expression with a choice for the metacommunity abundance density ρ(x), we obtain a complete multiple‐samples sampling formula.

Multiple‐guilds sampling formula

Another extension of the sampling formula consists in allowing for guild structure within the community (or communities) under study. We denote the number of guild by G, and we assume that they do not interact at the metacommunity level. The local community is composed of species that immigrated from the guild metacommunities, and the sample data is taken from the local community, possibly containing species of different guilds. We specify the data using abundance frequencies urn:x-wiley:2041210X:media:mee312807:mee312807-math-0031, which are the number of species with abundance k in guild g. For example, if there are G = 2 guilds with species abundances,

then urn:x-wiley:2041210X:media:mee312807:mee312807-math-0034, urn:x-wiley:2041210X:media:mee312807:mee312807-math-0035, urn:x-wiley:2041210X:media:mee312807:mee312807-math-0036, urn:x-wiley:2041210X:media:mee312807:mee312807-math-0037, urn:x-wiley:2041210X:media:mee312807:mee312807-math-0038, and all other urn:x-wiley:2041210X:media:mee312807:mee312807-math-0039.

The independent species sampling formula is, either exactly or approximately (see Appendix S1),
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0040(eqn 11)
where urn:x-wiley:2041210X:media:mee312807:mee312807-math-0041 and Λ(g) are given by eqns 2 and 6. Local sampling probabilities urn:x-wiley:2041210X:media:mee312807:mee312807-math-0042 and metacommunity abundance densities ρ(g)(x) can be guild‐dependent. Despite this complexity, sampling formula 11 expresses independence between species belonging to the same and to different guilds.

Comparison to models with zero‐sum sampling formula

We compare the parameter estimates and likelihoods obtained with the independent species approach and the zero‐sum approach, in those cases where a zero‐sum sampling formula is available and computable.

Single samples

The most studied neutral community model, also known as Hubbell's model, combines point‐mutation speciation and dispersal‐limited sampling (Hubbell 2001). To evaluate the zero‐sum sampling formula, we follow the approach of Etienne (2005). This involves an arbitrary‐precision computation with Stirling numbers, using the computer algebra system PARI/GP. The evaluation of the independent species sampling formula, given by eqns 1‐4-1‐4, requires the computation of several one‐dimensional integrals. Because the integrands are often sharply peaked, we use a dedicated numerical integration algorithm, which is included in the R package SADISA.

We apply both sampling formulas to six datasets of tropical tree communities (Volkov et al. 2005; Etienne & Haegeman 2011). The parameter estimates obtained with the zero‐sum and the independent species approach are very similar (Table 1, rows ZSC and ISA). Importantly, the likelihood values should not be compared, because they are not likelihoods for exactly the same data. The zero‐sum approach assumes that the total number of individuals is given by the observed value, while the independent species approach treats this as additional data the probability of which is incorporated in the total likelihood. This explains why the zero‐sum likelihood is systematically higher than the independent species likelihood (the log‐likelihood is less negative, see Table 1). However, after conditioning the independent species likelihood on sample size (see Appendix S4), the zero‐sum and independent species likelihood values almost coincide (Table 1, rows ZSC and ISAC). Note that the parameter estimates are even closer than in the case without conditioning (except for the Sinharaja dataset).

Table 1. Fits for neutral model with point‐mutation speciation and dispersal‐limited sampling. We analysed six datasets of tropical tree communities (Volkov et al. 2005; Etienne et al. 2007b; Etienne & Haegeman 2011), and we computed the maximum‐likelihood fits for three model variants. The first variant, ZSC, imposes the zero‐sum constraint, so that community size is invariant over time (results taken from Etienne et al. 2007b). The second variant, ISA, assumes independence between species. The third variant, ISAC, is also based on species independence, but the abundance distribution is conditioned on sample size. Note that likelihoods of model variants ZSC and ISAC are comparable (but the likelihood of ISA is not comparable with those of ZSC and ISAC)
Dataset Model θ I m LL
BCI ZSC 47·67 2211 0·0934 −308·73
ISA 47·94 2175 0·0920 −317·70
ISAC 47·67 2213 0·0935 −308·73
Korup ZSC 52·73 29 700 0·5470 −317·04
ISA 52·88 29 290 0·5436 −326·09
ISAC 52·73 29 700 0·5471 −317·04
Pasoh ZSC 190·9 2708 0·0926 −359·38
ISA 191·4 2689 0·0919 −367·90
ISAC 190·9 2712 0·0927 −359·38
Sinharaja ZSC 436·8 32·38 0·0019 −252·93
ISA 439·8 32·45 0·0019 −262·00
ISAC 461·5 31·96 0·0019 −253·05
Yasuni ZSC 204·2 13 170 0·4288 −297·15
ISA 204·4 13 110 0·4277 −305·20
ISAC 204·2 13 180 0·4289 −297·15
Lambir ZSC 285·6 4296 0·1146 −386·38
ISA 286·0 4280 0·1143 −394·93
ISAC 285·5 4299 0·1147 −386·39

The likelihood landscapes for the zero‐sum and the independent species approach are almost identical (Fig. 1). The ridge of high likelihood, present in both cases, is related to a well‐known problem of Hubbell's neutral model, namely, the difficulty of distinguishing abundance distributions resulting from high regional diversity and low dispersal from those resulting from low regional diversity and high dispersal (Etienne et al. 2006). Clearly, the independent species approach has the same problem. Note that the colour code in the two panels is not exactly the same; the colour codes for the log‐likelihood function differ by an additive constant. However, this constant difference has no effect on the maximum‐likelihood estimates. Figure 2 shows that also the fitted SADs are almost identical. Hence, at least for the community model and the datasets considered here, the zero‐sum approach and the independent species approach give practically equivalent results.

image
Likelihood landscape for zero‐sum and independent species approach. We consider the point‐mutation speciation model with dispersal‐limited sampling. We computed the zero‐sum and independent‐species likelihood as a function of metacommunity diversity θ (x‐axis) and dispersal number I (y‐axis) for the BCI dataset. Warmer colours correspond to higher likelihood values. The white ×‐mark indicates the maximum‐likelihood parameters. The two likelihood functions are almost identical, up to a constant factor (the colour code is relative to the maximum log‐likelihood value; for example, dark blue corresponds to log‐likelihood values at least 40 units below the maximum).
image
Species abundance distributions for neutral model with point‐mutation speciation and dispersal‐limited sampling. For the six tropical forest plots (data represented by grey bars) we plot the fitted distributions with the zero‐sum approach (thick green line) and the independent species approach (thin red line). The two fitted distributions are almost identical.
For two other speciation models, the zero‐sum sampling formula for a single sample and single guild has been derived, assuming neutral dispersal‐limited sampling. For random‐fission speciation, the metacommunity abundance density ρ(x) is given by (see Appendix S3; compare with eqn 4),
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0043(eqn 12)
Like θ for point mutation, the parameter ϕ characterizes the metacommunity diversity (in particular, it gives the expected number of species in the metacommunity). Also a model with per‐species speciation has a zero‐sum sampling formula (Etienne et al. 2007b). In the independent species setting, the metacommunity abundance density ρ(x) is given by
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0044(eqn 13)

Parameter θ is related to the per‐individual speciation rate, while parameter α measures the importance of per‐species speciation (with 0 ≤ α < 1). The metacommunity diversity increases both with increasing θ and increasing α. Note that we recover the point‐mutation model for α = 0 and the random‐fission model for α = −1 (formally, because α = −1 is outside the range 0 ≤ α < 1 of values allowed by the per‐species speciation model). While we do not have a direct independent species derivation of eqn 13, we show in Appendix S5 that this equation is the independent species equivalent of the zero‐sum solution.

Similarly to the case of point mutation, we find that the zero‐sum and independent species estimates are very close, both for the random‐fission speciation model (Table 2) and for the per‐species speciation model (Table 3). The absolute log‐likelihood values should not be compared (because they are not likelihoods for exactly the same data, see above), but the log‐likelihood values relative to the point‐mutation values are comparable. The log‐likelihood differences ΔLL are very similar in all cases, showing that the zero‐sum approach and the independent species approach lead to the same inferences.

Table 2. Fits for neutral model with random‐fission speciation and dispersal‐limited sampling. Same datasets as in Table 1. We consider two model variants: variant ZSC imposes the zero‐sum constraint (results taken from Etienne & Haegeman 2011); variant ISA assumes independence between species. ZSC and ISA likelihoods are not comparable. In column ΔLL we compare the maximum log‐likelihoods of the random‐fission model with those of the point‐mutation model, for the ZSC and the ISA variant
Dataset Model ϕ I m LL ΔLL
BCI ZSC 595·1 61·61 0·0029 −311·92 −3·20
ISA 595·2 61·81 0·0029 −321·11 −3·41
Korup ZSC 49·52 0·0020 −318·67 −1·63
ISA 49·61 0·0020 −327·75 −1·66
Pasoh ZSC 1528 263·4 0·0098 −363·75 −4·37
ISA 1527 264·0 0·0098 −372·49 −4·58
Sinharaja ZSC 927·6 32·42 0·0019 −252·88 +0·05
ISA 950·1 32·35 0·0019 −261·97 +0·03
Yasuni ZSC 10 980 197·0 0·0111 −306·75 −9·60
ISA 11 130 196·9 0·0111 −314·88 −9·68
Lambir ZSC 2500 372·5 0·0111 −402·32 −15·94
ISA 2500 372·9 0·0111 −411·08 −16·15
Table 3. Fits for per‐species speciation model, or equivalently, metacommunity model with density dependence. Same datasets as in Table 1. Model variants are combinations of nDL, no dispersal limitation; DL, dispersal limitation; ZSC, zero‐sum constraint; ISA, species independence approach. Results for model (nDL, ZSC) are taken from Etienne et al. (2007b), but results for model (DL, ZSC) have not been reported before. The maximum likelihood of the per‐species speciation model is always larger than the corresponding point‐mutation likelihood (column ΔLL), because point‐mutation speciation is a special case of per‐species speciation (case α = 0)
Dataset Model urn:x-wiley:2041210X:media:mee312807:mee312807-math-0045 urn:x-wiley:2041210X:media:mee312807:mee312807-math-0046 I m LL ΔLL
BCI nDL ZSC 34·97 0 1 −318·85 0
nDL ISA 35·06 0 1 −327·97 0
DL ZSC 38·32 0·1203 1049 0·0466 −308·19 0·54
DL ISA 37·33 0·1354 960·2 0·0428 −317·01 0·69
Korup nDL ZSC 44·54 0·0289 1 −318·31 0·36
nDL ISA 44·19 0·0303 1 −327·35 0·40
DL ZSC 13·87 0·4326 1046 0·0408 −306·82 10·22
DL ISA 12·99 0·4420 996·8 0·0390 −315·38 10·71
Pasoh nDL ZSC 126·4 0 1 −392·51 0
nDL ISA 126·7 0 1 −401·20 0
DL ZSC 184·2 0·0361 2192 0·0763 −359·31 0·07
DL ISA 183·0 0·0447 2081 0·0727 −367·80 0·11
Sinharaja nDL ZSC 25·63 0 1 −253·78 0
nDL ISA 25·73 0 1 −262·82 0
DL ZSC 12·72 0·5123 145·3 0·0085 −252·13 1·19
DL ISA 11·77 0·5270 138·8 0·0081 −260·59 1·42
Yasuni nDL ZSC 178·3 0 1 −307·58 0
nDL ISA 178·6 0 1 −315·68 0
DL ZSC 61·86 0·5272 1117 0·0598 −278·88 18·27
DL ISA 60·39 0·5324 1098 0·0589 −286·54 18·66
Lambir nDL ZSC 195·0 0 1 −437·89 0
nDL ISA 195·3 0 1 −446·57 0
DL ZSC 245·5 0·1161 2546 0·0713 −385·20 1·18
DL ISA 244·3 0·1202 2503 0·0702 −393·65 1·28

The independent species sampling formula 1 is only approximately valid for these two speciation models (see Appendix S1). Nevertheless, the agreement with the zero‐sum results is as strong as for the case of point‐mutation speciation, for which the independent species sampling formula 1 is exact. This indicates, in addition to the general argument of Appendix S1, that the Poisson approximation is very accurate.

The data provides stronger support for point‐mutation speciation than for random‐fission speciation, as reported by Etienne & Haegeman (2011). The data does not contain signs of per‐species speciation in the case without dispersal limitation, in agreement with Etienne et al. (2007b). However, in the case with dispersal limitation, which has not been studied previously, there is strong evidence of per‐species speciation in the Korup and Yasuni datasets. Hence, the selection between speciation models depends on whether or not dispersal limitation is taken into account. While this is an intriguing result, an analysis of its precise meaning is beyond the scope of this paper.

Multiple samples

The zero‐sum analog of the multiple‐samples sampling formula 7 has only been explored for the point‐mutation speciation process and neutral dispersal‐limited sampling (Etienne 2007; Connolly, Hughes & Bellwood 6). Here we apply the independent species sampling formula 7 on the same datasets. We follow the approach of Etienne (2007) and reduce the number of parameters to estimate by assuming that I = I for all ℓ. Moreover, we eliminate the sampling efforts q by setting the expected sample size equal to the observed sample size for each local community ℓ. As a result, the likelihood has to be maximized over two parameters only (θ and I).

We find very good agreement between the estimates obtained with the zero‐sum constraint and those obtained with the independent species assumption (Table 4). The likelihood values are different, but as explained before, they should not be compared. Indeed, the zero‐sum approach imposes a constraint on the allowed datasets that is not present in the independent species approach.

Table 4. Fits for multiple samples. From the abundance data of three Panamian forest plots, we constructed eleven datasets, each consisting of three samples (one full dataset, and ten reduced datasets; see Etienne (2007) for details). We computed the maximum‐likelihood fits for two model variants. The first variant, ZSC, imposes the zero‐sum constraint (results taken from Etienne 2007). The second variant, ISA, assumes independence between species. Likelihoods of the two model variants are not comparable
Dataset Model θ I LL
Full dataset ZSC 259·3 44·24 −1091·80
ISA 259·4 44·46 −1116·12
Subsample 1 ZSC 270·5 39·18 −679·87
ISA 270·8 39·41 −702·08
Subsample 2 ZSC 273·9 39·21 −668·84
ISA 274·2 39·44 −690·96
Subsample 3 ZSC 280·0 41·18 −673·74
ISA 280·2 41·41 −695·75
Subsample 4 ZSC 282·2 42·63 −680·40
ISA 282·4 42·87 −702·35
Subsample 5 ZSC 290·8 41·71 −679·28
ISA 291·1 41·94 −701·23
Subsample 6 ZSC 297·3 39·13 −654·40
ISA 297·6 39·35 −676·45
Subsample 7 ZSC 298·6 37·27 −652·12
ISA 299·0 37·48 −674·39
Subsample 8 ZSC 296·5 36·32 −640·46
ISA 296·8 36·53 −662·70
Subsample 9 ZSC 300·4 37·65 −647·22
ISA 300·7 37·87 −669·34
Subsample 10 ZSC 271·5 40·47 −688·08
ISA 271·7 40·70 −710·15

Multiple guilds

Recently, we derived the zero‐sum sampling formula for a single sample of two dispersal guilds with a metacommunity governed by point‐mutation speciation (Janzen, Haegeman & Etienne 24). As we were interested in detecting guild differences in dispersal rate, we assumed that the two guilds have the same distribution of relative abundances in the metacommunity, but no species in common. Here we apply the multiple‐guilds sampling formula 11 of the independent species approach to the dataset studied by Janzen, Haegeman & Etienne (24).

Importantly, the assumption that the guild metacommunities do not differ can be implemented in different ways. The zero‐sum approach of Janzen, Haegeman & Etienne (24) assumed that the two guilds have the same speciation rates, and hence, the same metacommunity diversity θ (denoted by ‘sS’, which stands for same speciation rate). However, this assumption does not eliminate differences in guild metacommunity sizes. One can therefore impose additionally that guild metacommunity sizes are the same (denoted by ‘sM’, which stands for same metacommunity size). It turns out that this additional assumption has a strong effect on the parameter estimates [Table 5; compare rows (sM, ZSC) and (sS, ZSC)], regardless of whether guilds have the same or different dispersal rates: the likelihood is consistently higher for the second implementation (same speciation rate and same guild metacommunity size) than for the first implementation (same speciation rate, but guild metacommunity size can vary).

Table 5. Fits for multiple guilds. Guild 1: species with biotic dispersal; guild 2: species with abiotic dispersal; see Janzen, Haegeman & Etienne (24) for details. For six censuses of the BCI plot we computed the maximum‐likelihood fits for several model variants: sM, guild metacommunities have same size; sS, guilds have same speciation rate; dD, guilds have different dispersal rate; sD, guilds have same dispersal rate; ZSC, zero‐sum constraint; ISA, species independence approach. Results for model (sS, ZSC) are taken from Janzen, Haegeman & Etienne (24), but results for model (sM, ZSC) have not been reported before
Dataset Model θ I 1 I 2 LL
BCI (1982) sM dD ZSC 80·50 2433 13·56 −365·92
sM dD ISA 80·85 2399 13·90 −382·59
sM sD ZSC 41·22 79 520 79 520 −410·32
sM sD ISA 41·49 71 420 71 420 −426·80
sS dD ZSC 503·0 49·91 7·871 −368·06
sS sD ZSC 67·29 520·7 520·7 −399·18
BCI (1985) sM dD ZSC 79·43 2743 12·75 −365·39
sM dD ISA 79·77 2704 13·08 −382·07
sM sD ZSC 20·31 20·31 −411·55
sM sD ISA 20·41 20·41 −428·05
sS dD ZSC 561·0 47·76 7·338 −367·52
sS sD ZSC 65·57 573·4 573·4 −400·82
BCI (1990) sM dD ZSC 78·62 2078 12·52 −361·33
sM dD ISA 78·92 2059 12·86 −378·08
sM sD ZSC 42·19 8137 8137 −407·51
sM sD ISA 42·53 7803 7803 −424·00
sS dD ZSC 107·0 53·68 7·546 −365·42
sS sD ZSC 62·13 583·7 583·7 −393·86
BCI (1995) sM dD ZSC 77·93 2078 12·05 −371·03
sM dD ISA 78·24 2057 12·37 −387·83
sM sD ZSC 41·31 9329 9329 −417·96
sM sD ISA 41·65 8859 8859 −434·49
sS dD ZSC 106·5 53·32 7·277 −374·98
sS sD ZSC 62·00 554·1 554·1 −404·08
BCI (2000) sM dD ZSC 77·77 2060 12·53 −361·10
sM dD ISA 78·08 2040 12·86 −377·85
sM sD ZSC 42·08 7148 7148 −405·99
sM sD ISA 42·41 6897 6897 −422·49
sS dD ZSC 105·8 54·41 7·594 −364·99
sS sD ZSC 61·12 595·6 595·6 −392·54
BCI (2005) sM dD ZSC 76·09 2589 13·01 −359·54
sM dD ISA 76·39 2558 13·37 −376·26
sM sD ZSC 40·50 21 040 21 040 −401·50
sM sD ISA 40·79 19 980 19 980 −417·99
sS dD ZSC 471·3 48·09 7·665 −361·97
sS sD ZSC 60·41 669·9 669·9 −390·99

This distinction is crucial for the comparison of the zero‐sum and independent species estimates. The independent species model underlying sampling formula 11 corresponds to the second implementation, i.e. the identity of guild speciation rates implies the identity of guild metacommunity sizes. Indeed, the independent species estimates are very similar to the zero‐sum estimates obtained with the second implementation [Table 5; compare rows (sM, ZSC) and (sM, ISA)]. This agreement holds both when assuming that guilds have the same or different dispersal rates. Note that there is no independent species model that corresponds to the first implementation, where guild metacommunity sizes can vary.

Extensions to models without zero‐sum sampling formula

We study several problems of fitting community models to abundance data for which the zero‐sum approach does not lead to a workable solution. We show that by adapting the independent species approach each of these problems can be solved without major obstacles.

Different urn:x-wiley:2041210X:media:mee312807:mee312807-math-0047: local community models

Until now we have assumed that the sampling probability is given by neutral dispersal‐limited sampling 3. The independent species framework allows us to analyse other local community models. As an illustration, we consider a model with density dependence, which constitutes a departure from neutrality (see Allouche & Kadmon 2009; Jabot & Chave 2011 for other extensions of the neutral model with density dependence).

Many forms of density dependence can be incorporated in the independent species framework. We assume that the per capita birth rate is proportional to urn:x-wiley:2041210X:media:mee312807:mee312807-math-0048 and that the per capita death rate is constant. This leads to positive density dependence for 0 < α < 1 and negative density dependence for α < 0. In Appendix S6 we show that the sampling probability urn:x-wiley:2041210X:media:mee312807:mee312807-math-0049 then becomes,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0050(eqn 14)

This expression replaces eqn 3 in sampling formula 1. Note that the sampling formula with density dependence lacks sampling invariance, that is, eqn 14 changes when considering a sample taken from the local community rather than the entire local community. This implies that, when applied to sample abundance data, the sampling formula depends on local community size, introducing an additional parameter to estimate. When fitting the model to the tropical forest plots, we find some evidence of negative density dependence in the local community (Table S1).

Different ρ(x): metacommunity models

The metacommunity abundance density ρ(x) depends on the metacommunity dynamics. Particular interest has been given to how new species arise. Rosindell et al. (2010) proposed the protracted speciation model to account for the fact that speciation takes time. In Appendix S3 we show that the corresponding metacommunity abundance density ρ(x) is given by
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0051(eqn 15)

Parameter θ is related to the speciation‐initiation rate, while parameter ϕ is inversely proportional to speciation time. Interestingly, in the limit ϕ→∞ we recover 4 for point‐mutation speciation, and in the limit θ→∞ we recover 12 for random‐fission speciation. Hence, the protracted‐speciation model interpolates between the two speciation models. Fitting the model to the six tropical forest plots shows that protractedness cannot be detected in the SADs (Table S2). Rosindell et al. (2010) reached the same conclusion using the approximate fitting procedure of Alonso & McKane (2004). Note that this procedure can be reinterpreted in the independent species framework (see Discussion).

As another example, we consider a metacommunity model with density dependence. Density dependence at large scales can effectively emerge from local interactions (Steele & Forrester 2005). We take the same form of density dependence as in the local community example: the per capita birth rate is proportional to urn:x-wiley:2041210X:media:mee312807:mee312807-math-0052 and the per capita death rate is constant. The corresponding abundance density ρ(x) is given by (see Appendix S5),
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0053(eqn 16)
which, interestingly, is the same expression as 13 for per‐species speciation. However, where in the case of per‐species speciation only positive values of α were meaningful (in particular, 0 ≤ α < 1), the density‐dependence interpretation of eqn 16 also allows negative values of α (in case of negative density dependence). The model fits for the tropical forest data have positive values of α (Table 3, rows DL). Hence, the interpretation is not univocal: it can indicate either per‐species speciation or positive density dependence.

Species‐dependent parameters

The previous models are based on the assumption of species equivalence. While species differences are difficult to deal with in the zero‐sum framework (Zhou & Zhang 2008), they can be easily incorporated with the independent species approach. Indeed, because the likelihood is equal to the product of species‐level likelihoods, it suffices to introduce species‐dependent parameters in each of the factors of this product. However, this leads to likelihood functions of a large number of parameters (proportional to the number of species), which cannot be inferred from the data. To reduce the number of parameters, we consider an alternative model in which parameters differ between species, but species‐specific parameters are drawn from a distribution that is the same for all species. Likelihood maximization can then be used to infer information about this distribution.

As an example, we suppose that dispersal number I differs between species and that the species‐specific dispersal numbers Ii are drawn from distribution σ(I). In Appendix S7 we show that the independent species sampling formula 1 still holds, with λk given by (instead of eqn 2),
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0054(eqn 17)
and Λ given by (instead of eqn 6),
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0055
In a concrete application, one could parameterize the distribution σ(I) by its variance, and infer this parameter from the data. If the likelihood for non‐zero variance is higher than the likelihood for zero variance, there might be evidence that the dispersal number I differs between species. The strength of the evidence can be quantified, using likelihood‐ratio tests. Note that this procedure informs us only on the existence of species differences in dispersal rate, but not on the dispersal rate of specific species.

A similar approach could be applied to other model parameters. For example, in the multiple‐sample case, one could assume that dispersal number I differs between samples. To limit the number of parameters, i.e. to avoid the introduction of a parameter for each patch, one could assume that the sample‐specific dispersal numbers I are drawn from a common distribution σ(I). The corresponding sampling formula can then be constructed along the lines explained above. However, because different species are affected by the same choice of dispersal number I, the likelihood has no longer the product structure of independent species, so that the sampling formula is more complicated to evaluate.

Large datasets

Even if the zero‐sum sampling formula is available, its evaluation becomes often cumbersome for large datasets. We have argued above that the independent species sampling formula is easier to evaluate. To further support this statement, we consider Hubbell's neutral model (point‐mutation speciation and dispersal‐limited sampling). For a fixed set of parameter values (metacommunity diversity θ = 50 and dispersal number I = 1000), we generate sample data for sample sizes ranging from J = 103 to J = 106. This can be easily done within the independent species framework, because the abundance frequencies are independent Poisson random variables, see eqn 1. For each of the generated samples, we fit the model parameters, using maximum likelihood, once with the zero‐sum sampling formula and once with the independent species sampling formula. We then compare the time it takes to complete the maximization. Note that one maximization typically requires a few hundreds of sampling formula evaluations.

The comparison results are shown in Fig. 3. The scaling of computation time with sample size differs between the two approaches: the independent species computation time scales as urn:x-wiley:2041210X:media:mee312807:mee312807-math-0056, and the zero‐sum computation time scales as J2. The independent species approach is faster for sample size J > 104. For example, for J = 105, the independent species computation takes about a minute, while the zero‐sum computation takes about half an hour (on a standard laptop computer; see Fig. 3 for specifications). For still larger sample size, J > 2 × 105, our implementation of the zero‐sum computation does not complete, due to memory problems that occurred during the computation of large Stirling numbers (on which the zero‐sum sampling formula is based; see Etienne 2005). In contrast, the independent species computation time remains below a few minutes for sample size J up to 106.

image
Computational complexity of zero‐sum and independent species likelihood maximization. We generated samples of different size for the neutral community model with point‐mutation speciation (θ = 50) and dispersal limitation (I = 1000), and estimated the model parameters, using the zero‐sum (red dots) and independent species (green dots) sampling formula. Computation time scales consistently with sample size J: proportional to J2 for the zero‐sum approach (red line) and proportional to urn:x-wiley:2041210X:media:mee312807:mee312807-math-0057 for the independent species approach (green line). We did not succeed in evaluating the zero‐sum likelihood for sample size J > 2 × 105 due to memory problems (vertical red line). Computations were performed on a laptop computer with Intel Core i5 microprocessor (two cores, 2·80 GHz clock speed and 6 MB on‐board memory) and 3·8 GB main memory.

As an illustration, we fit Hubbell's model to an extended dataset of the BCI tropical forest plot, which includes all trees with dbh (diameter at breast height) above 1 cm (rather than trees with dbh above 10 cm). Due to the large sample size (J ≈ 2·3 × 105), we were not able to evaluate the zero‐sum likelihood on our computer. Likelihood maximization using the independent species approach did not pose any problem (see Table S3).

Relative abundance data

Another limitation of the zero‐sum sampling formula is that it can only be applied to absolute species abundances. However, abundance data are often available as relative abundances (e.g. vegetation cover, biomass, fingerprint data). The independent species approach can be easily extended to that type of data. with sampling formula,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0058(eqn 18)
with pi the observed relative abundance and Λ the expected number of observed species,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0059
The integrand in eqn 19 contains two sampling probabilities. The first one is the probability density urn:x-wiley:2041210X:media:mee312807:mee312807-math-0060 for local relative abundance p given metacommunity relative abundance x. For the case of neutral dispersal‐limited sampling, it is the continuous version of the negative binomial distribution 3, which is the gamma distribution,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0061(eqn 19)
The second one is the probability urn:x-wiley:2041210X:media:mee312807:mee312807-math-0062 to observe in the sample a species with local relative abundance p. For example, one could take urn:x-wiley:2041210X:media:mee312807:mee312807-math-0063, so that species with relative abundance under the threshold relative abundance 1/ξ are typically not detected, and species with relative abundances above it have a substantial chance of being detected. Note that sampling formula 18 can be generalized to multiple samples,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0064(eqn 20)
with urn:x-wiley:2041210X:media:mee312807:mee312807-math-0065. The index i runs over all species that are observed at least in one sample. The index ℓ runs over the local communities from which a sample is taken; the first product inside the integrand corresponds to samples in which species i is observed, while the second product corresponds to samples in which species i is unobserved.

Presence‐absence data

We can apply our approach also to datasets where only species occurrences were scored in multiple sites, i.e. presence‐absence data. We consider L samples. We introduce the presence‐absence vector urn:x-wiley:2041210X:media:mee312807:mee312807-math-0066 of a species, i.e. urn:x-wiley:2041210X:media:mee312807:mee312807-math-0067 with o = 1 if the species is present in sample ℓ and o = 0 if not. We denote the corresponding abundance frequencies by urn:x-wiley:2041210X:media:mee312807:mee312807-math-0068. Then, the independent species sampling formula is,
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0069(eqn 21)
with
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0070(eqn 22)
and urn:x-wiley:2041210X:media:mee312807:mee312807-math-0071 the probability that a species with metacommunity abundance x is present in sample ℓ. For neutral dispersal‐limited sampling (with dispersal number I and sampling effort q), we have (see eqn 10),
urn:x-wiley:2041210X:media:mee312807:mee312807-math-0072

Discussion

We have provided a framework to compute, under the independent species assumption, a sampling formula for all mainland‐island(s) models for which we can specify the metacommunity abundance density ρ(x) and the local sampling probability urn:x-wiley:2041210X:media:mee312807:mee312807-math-0073. The computational complexity of the sampling formula reduces to the evaluation of one‐dimensional integrals of the form urn:x-wiley:2041210X:media:mee312807:mee312807-math-0074. Because the integrands are often sharply peaked, the numerical evaluation of these integrals can be challenging. We include a dedicated integration algorithm in the R package SADISA (which stands for Species Abundance Distributions under the Independent Species Assumption). Currently, the package implements the sampling formulas only for the analyses presented in the paper. However, it is relatively straightforward to use the methods implemented in the package for other community models.

The independent species framework allows us to fit a broad set of neutral community models. This set is much broader than the models with zero‐sum sampling formulas, for which our approach is often (much) more efficient. The framework can be applied to larger datasets (higher abundances, more species, more samples) and to relative abundance and presence‐absence data. The only requirement is the specification of the metacommunity abundance density ρ(x) – which depends on the speciation process – and the local sampling probability urn:x-wiley:2041210X:media:mee312807:mee312807-math-0075 – which depends on the local demographic dynamics. Even in cases where the independent species sampling formulas are approximate, such as the random‐fission and the per‐species speciation models, the parameter estimates are almost indistinguishable from the zero‐sum results. The approach is not restricted to neutral scenarios, as illustrated by our examples of density dependence and species‐dependent parameters. Independent‐species models can be easily simulated, because the abundance frequencies are independent Poisson random variables (see Appendix S1). Simulated datasets are useful to explore model predictions, but also to evaluate the accuracy of parameter estimates and the reliability of model inference (see below).

We have shown that the sampling formulas under the independent species assumption yield parameter estimates that are very similar to those obtained under the zero‐sum constraint. This need not always be the case. The condition for this similarity is that the community size distribution is sharply peaked. This happens for the local community when the dispersal number I is large (e.g. I > 10; see Appendix S2), and in the metacommunity (under point mutation) when the diversity parameter θ is large (e.g. θ > 10; see Appendix S3). Sampling formulas are typically applied to highly diverse systems, because only those systems are considered to contain sufficient information (i.e. enough ‘replicates’) to reliably estimate the parameters. Hence, we expect that the zero‐sum and independent species fits will often agree. Even if the fits do not agree, this discrepancy should not be seen as a failure of the independent‐species approach. Independent‐species models are not only approximations of zero‐sum models; they are fully consistent mathematical models in their own right. However, in such (rare) cases of discrepancy, the ecological meaning should be critically evaluated.

Our work sheds new light on previous attempts to link abundance data with community models. Alonso & McKane (2004) proposed a somewhat ad hoc approach to fit community models to abundance data. Within the independent‐species framework, it corresponds to applying an additional conditioning on the observed number of species. As our approach does not have this conditioning, it does not discard the information contained in the observed number of species, and is thus more powerful. Volkov et al. (2003) combined the independent species metacommunity abundance density under point mutation with the zero‐sum version of local dispersal‐limited sampling. This mixed approach can be used to compute the expected abundance distribution, but is less helpful to derive the full sampling formula. We have shown how a consistent application of the independent species approach readily provides both the abundance distribution and the sampling formula. Green & Plotkin (2007) proposed abundance distributions which have the same structure as the ones we obtained from solving the independent species community models (compare their eqn 1 with our eqn 2). Our results can be interpreted as a more mechanistic underpinning of their distributions. Moreover, our framework indicates how to incorporate their abundance distributions into sampling formulas, which can then be used for parameter estimation and model selection.

The theory we have developed results in a long list of sampling formulas (see Appendix S8). The question arises how to choose among them in practice. The general structure of the sampling formula is dictated by the nature of the data: is the data expressed in absolute abundances, relative abundances, or as presence‐absence data; is there a single or are there multiple samples? The biological question determines the different processes to include in the community models, which in turn determine the functions appearing in the sampling formula: the abundance density ρ(x) at the regional scale, and the sampling probability urn:x-wiley:2041210X:media:mee312807:mee312807-math-0076 at the local scale. We have presented a derivation for several of these functions, which can serve as a template for other community models. Once the functions ρ(x) and urn:x-wiley:2041210X:media:mee312807:mee312807-math-0077 have been specified, we can apply the independent species formalism to evaluate the sampling formula and to determine the maximum‐likehood parameters. The R package SADISA includes a step‐by‐step demonstration for single‐sample and multiple‐samples examples.

Reliable inference of community processes from abundance data is well‐known to be very challenging. While the independent species approach drastically simplifies the evaluation of the likelihood function, it evidently does not resolve fundamental issues of fitting community models to abundance data. For example, in Hubbell's neutral model, very large samples are required to distinguish between cases with high regional diversity and low dispersal and cases with low regional diversity and high dispersal (see the ridge of high likelihood in Fig. 1). Community structure is the result of the interplay between several processes, both at local and regional scales, which are often difficult to tell apart using abundance data alone (McGill et al. 2007; Al Hammal et al. 2015). These issues are as problematic for the independent species approach as for the zero‐sum approach.

Therefore, the independent species sampling formulas must not be applied blindly, but should be combined with techniques to evaluate the reliability of the maximum‐likelihood estimates. When applying the sampling formulas in practice, it is important to assess the estimation bias of the model parameters. A common approach consists in simulating many times the community model with the estimated parameter values, and determining the maximum‐likelihood parameters for each of the simulated datasets, which are then compared to the simulation values. The zero‐sum and independent species model variants present the same parameter estimation biases. However, the evaluation of these biases is more efficient for independent species models, because they are particularly easy to simulate. Simulated datasets are also used to test whether the fitted model can satisfactorily reproduce the empirical data (Etienne 2007; Jabot & Chave 2011).

The flexibility of the independent species assumption allows us to construct new hypothesis tests on a wide range of community processes. However, the reliability of such tests should be carefully assessed. For example, we repeatedly used the tropical forest data to illustrate our sampling formulas. Each of these sampling formulas deals with one or two community processes (including dispersal limitation, different speciation mechanisms, and density dependence), and we determined for each process separately whether it is supported by the data (using Akaike information criterion). A more satisfying approach would combine these processes in a single, nested model, and test whether particular instances of this general model provide fits of similar quality. However, this approach would most probably lead to overparametrization problems, which can be detected by appropriate model selection techniques (Burnham & Anderson 2003; note that these techniques are often simulation‐based). Clearly, the technical possibility to evaluate the likelihood function does not at all guarantee the reliability of the inference results.

Species abundance distributions are known to contain limited information about the processes that structured the community (McGill et al. 2007). More powerful inferences might be possible based on abundance data coming from multiple sites, which can be handled with the approach presented in this paper. A similar approach can be instrumental to integrate also other types of data, such as species‐area relationships (O'Dwyer & Green 2010), time‐series data (Kalyuzhny, Kadmon & Shnerb 2015) and phylogenetic information (Manceau, Lambert & Morlon 2015). Combining different patterns will yield stronger tests of the adequacy of a model to fit the data. To tackle this, the independent species approach seems a promising tool.

Authors' contributions

B.H. and R.S.E. conceived the study, developed the theory, analysed the examples, programmed the R package, and wrote the paper.

Acknowledgements

We thank four anonymous reviewers and S. Dray for insightful comments, and the Center for Tropical Forest Science for data collection. Financial support was provided by the French National Research Agency (ANR) through the TULIP Laboratory of Excellence (to B.H., grant number ANR‐10‐LABX‐41), by the Netherlands Organization for Scientific Research (NWO) (VICI grant number 865.13.003) through VIDI and VICI grants to R.S.E., and by the bilateral French‐Dutch Van Gogh programme (to B.H. and R.S.E.).

    Data accessibility

    All datasets and models analysed in this paper are available in the R package SADISA, which can be downloaded at https://CRAN.R-project.org/package=SADISA.

      Number of times cited according to CrossRef: 8

      • Community structure of vascular epiphytes: a neutral perspective, Oikos, 10.1111/oik.06537, 129, 6, (853-867), (2020).
      • Integrating multiple sources of ecological data to unveil macroscale species abundance, Nature Communications, 10.1038/s41467-020-15407-5, 11, 1, (2020).
      • How accurate are estimates of flower visitation rates by pollinators? Lessons from a spatially explicit agent-based model, Ecological Informatics, 10.1016/j.ecoinf.2020.101077, 57, (101077), (2020).
      • High-throughput sequencing for community analysis: the promise of DNA barcoding to uncover diversity, relatedness, abundances and interactions in spider communities, Development Genes and Evolution, 10.1007/s00427-020-00652-x, (2020).
      • Niche‐neutral theoretic approach to mechanisms underlying the biodiversity and biogeography of human microbiomes, Evolutionary Applications, 10.1111/eva.13116, 0, 0, (2020).
      • Fungal infection alters the selection, dispersal and drift processes structuring the amphibian skin microbiome, Ecology Letters, 10.1111/ele.13414, 23, 1, (88-98), (2019).
      • Probability distributions of extinction times, species richness, and immigration and extinction rates in neutral ecological models, Journal of Theoretical Biology, 10.1016/j.jtbi.2019.110051, (110051), (2019).
      • On the proportional abundance of species: Integrating population genetics and community ecology, Scientific Reports, 10.1038/s41598-017-17070-1, 7, 1, (2017).