Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions
Correspondence site: http://www.respond2articles.com/MEE/
Summary
1. Understanding the factors affecting species occurrence is a pre-eminent focus of applied ecological research. However, direct information about species occurrence is lacking for many species. Instead, researchers sometimes have to rely on so-called presence-only data (i.e. when no direct information about absences is available), which often results from opportunistic, unstructured sampling. maxent is a widely used software program designed to model and map species distribution using presence-only data.
2. We provide a critical review of maxent as applied to species distribution modelling and discuss how it can lead to inferential errors. A chief concern is that maxent produces a number of poorly defined indices that are not directly related to the actual parameter of interest – the probability of occurrence (ψ). This focus on an index was motivated by the belief that it is not possible to estimate ψ from presence-only data; however, we demonstrate that ψ is identifiable using conventional likelihood methods under the assumptions of random sampling and constant probability of species detection.
3. The model is implemented in a convenient r package which we use to apply the model to simulated data and data from the North American Breeding Bird Survey. We demonstrate that maxent produces extreme under-predictions when compared to estimates produced by logistic regression which uses the full (presence/absence) data set. We note that maxent predictions are extremely sensitive to specification of the background prevalence, which is not objectively estimated using the maxent method.
4. As with maxent, formal model-based inference requires a random sample of presence locations. Many presence-only data sets, such as those based on museum records and herbarium collections, may not satisfy this assumption. However, when sampling is random, we believe that inference should be based on formal methods that facilitate inference about interpretable ecological quantities instead of vaguely defined indices.
Introduction
Species distribution is naturally characterized by the probability of occurrence of a species, say ψ(x) = Pr(y(x) = 1) where y(x) is the true occurrence state of a species at some location (pixel) x (Kéry 2011). Inference about ψ(x) can be achieved directly from presence–absence data using logistic regression and related models (MacKenzie et al. 2002). However, ecologists are not always fortunate enough to have presence–absence data, and many data sets exist which only contain locations of species presence – so-called presence-only data.
maxent (e.g. Phillips et al. 2006) is a popular software package for producing ‘species distribution’ maps from presence-only data. Interestingly, maxent does not produce estimates of occurrence probability but, instead, produces estimates of an ill-defined ‘suitability index’ (Elith et al. 2011). Because maxent does not correspond to an explicit model of species occurrence, it is not suitable for making explicit predictions of an actual state variable or testing hypotheses about factors that influence occurrence probability. Support for producing indices of species distribution from presence-only data, as opposed to estimates of occurrence probability, has been justified in the literature based on the incorrect assertion that occurrence probability ψ (sometimes referred to as ‘prevalence’ or occupancy) cannot be estimated from presence-only data.
The principle aim of our paper is to show that occurrence probability can be estimated from presence-only data. We consider a formal model-based approach to analysis of presence-only data. We emphasize the critical assumption required for statistical inference about species occurrence probability from presence-only data, which is random sampling of space as a basis for accumulating presence-only observations. In addition, the estimator we devise here is most relevant only when species detection probability is constant. We conclude that, under these assumptions, inference about occurrence probability can be achieved directly from presence-only data using conventional likelihood methods (e.g. Lancaster and Imbens 1996). We suspect that this is surprising to many users of maxent and related species distribution modelling tools in the light of repeated statements to the contrary in the literature (e.g. Phillips and Dudik 2008; Elith et al. 2011; Kéry 2011), asserting that probability of occurrence is not identifiable. For example, Elith et al. (2010) state that
Formally, we say that prevalence is not identifiable from presence-only data (Ward et al. 2009). This means that it cannot be exactly determined, regardless of the sample size; this is a fundamental limitation of presence-only data.
In fact, Ward et al. (2009) do not make such a definitive claim. Their precise claim is
[...occurrence probability...] is identifiable only if we make unrealistic assumptions about the structure of [...the relationship between occurrence probability and covariates....] such as in logistic regression....
In that context, it seems that subsequent references to Ward et al. (2009) misconstrue their result. In our view, logistic regression (or other binary regression models) is hardly unrealistic. Indeed, such models are the most common approach to modelling binary variables in ecology (and probably all of statistics), especially in the context of modelling species occurrence (MacKenzie et al. 2002; Tyre et al. 2003; Kéry et al. 2010). Even more generally, the logistic function is the canonical link of the binomial GLM (McCullagh and Nelder 1989, p. 38) and, as such, it is customarily adopted and widely used, and even books have been written about it (Hosmer and Lemeshow 2000).
We demonstrate the application of the formal model-based framework for estimating occurrence probability from presence-only data using a data set derived from the North American Breeding Bird Survey, and we provide an r package for producing estimates of species distribution model parameters from presence-only data.
Before proceeding, we note that the statistical principle of maximum entropy (Jaynes 1957, 1963; Jaynes and Bretthorst 2003) is widely applied to problems in statistics and other disciplines, and our development here is not critical of these ideas. Rather, we are critical of the routine application of the software package maxent as applied to species distribution modelling. We specifically object to the pervasive views in the maxent user community that one should avoid characterizing species distribution by occurrence probability, that occurrence probability is not identifiable and that one should instead obtain indices of species occurrence probability by using maxent.
Genesis of presence-only data
The original motivation for the development of maxent was to estimate and model the distribution of a species using presence-only data (Dudik et al. 2004; Phillips et al. 2004). Species distribution is naturally characterized by occurrence probability, which provides a quantitative description of the probability of the focal species occurring at a location and a mechanism for generating explicit predictions of occurrence and testing hypotheses related to factors that influence occurrence. maxent attempts to approximate the probability of occurrence by using a logistic transformation of its suitability index (Phillips & Dudik 2008). Before explaining the details of this indirect method, we first consider a model to describe the genesis of presence-only data, and the common approach of estimating the probability of occurrence using standard sampling methods.
Occupancy or Presence/Absence Sampling Experiment




Presence-Only Sampling Experiment
We adopt the view here that presence-only data, that is, a sample of locations for which y = 1, arise by discarding the y = 0 observations from a data set that arose by random sampling as described earlier. That is, we sample pixels randomly and obtain x1,…,xN and record y(x1),…,y(xN). Then, we consider only those sites x1,...xn for which y(x) = 1. The corresponding subset of locations constitutes our data set, which we will label here x1,...,xn. We use ‘n’ here instead of ‘N’ as above and recognize that the presence-only x’s are a reordered version of a subset of the initial sample.
Likelihood analysis
The basic characteristic of presence-only data is that the variable y is no longer random in our sample, that is, because y = 1 with probability 1 for all observations. Instead, x is the random quantity, and the set of n locations x1,…,xn are the data upon which inference is based. Importantly, the specific values of x that appear in the sample represent a biased selection from all possible values , favouring those for which y = 1. To clarify the nature of the induced bias in our sample of x, we invoke Bayes rule. In the remainder, we use π() to represent the probability distributions of x, and ψ() to represent probability distributions of y.

This might appear to be an awkward invocation of Bayes rule because we often do not think of spatial location as the outcome of a stochastic process in most contexts. It is somewhat more natural in the context of environmental covariates (Lancaster and Imbens 1996; Lele and Keim 2006), but they are equivalent formulations. We proceed with this development in terms of x here because this is pervasive in the maxent literature.



Equation 2 makes it clear that the x’s for which y = 1 are not a representative sample of all x’s. Intuitively, the presence-only sample (i.e. x’s for which y = 1) will favour pixels for which ψ(y = 1|x) is large relative to ψ(y = 1).
In this expression of Bayes rule, the variable x is ‘pixel identity’ for which π(x) is constant. It is not an indicator of whether pixel x appears in the sample. In the latter case, Pr(y = 1|x) would be the probability of occupancy conditional on pixel x being in the sample which has no clearly useful meaning. That said, random sampling is important for the invocation of Bayes rule – by imagining that the presence-only sample arises by first random sampling pixels for presence–absence and then discarding the y = 0 pixels (See Appendix). Alternatively, it can be justified by sampling randomly the sample frame consisting of all y = 1 observations. Under random sampling, either with or without replacement, the probability that a sample unit appears in the sample is constant and thus has no effect on Eqn 2.
The Likelihood
We note that this expression of Bayes rule appears in a large number of species distribution modelling papers that involve the development or application of maxent. However, these papers never provide further analysis of the result, instead making the (incorrect) claim that direct analysis of π(y = 1|x) is intractable because the background prevalence (ψ(y = 1) here) is not identifiable. In fact, this is incorrect, as has been noted in other contexts that produce similarly biased data (e.g. case–control studies, Lancaster and Imbens 1996, and ‘resource selection probability functions’, RSPF; Manly et al. 2002; Lele and Keim 2006). To clarify this, we use the previous application of Bayes rule to describe the likelihood for the presence-only data.



As noted earlier, the denominator is the marginal probability of occurrence over the landscape, and it is computed by summing over all elements of where
is the state space of x, that is, the landscape as defined by the analyst. Clearly, this marginal probability could be estimated by evaluating ψ(y = 1|x) at a random sample of points x independent of y (Lele and Keim 2006). Sometimes, a sample of
chosen independent of y is referred to as the ‘background’ in species distribution modelling or, in the context of case–control models, ‘contaminated controls’ (Lancaster and Imbens 1996).
Imperfect Detection of Species
In practice, we expect bias in observing species presence, such that the probability of detecting a species given that it is present should be less than 1. A standard model of this phenomenon (MacKenzie et al. 2002; Tyre et al. 2003) is constructed as follows: Let yobs be the observed species presence and then define the probability of detection as Pr(yobs = 1|y = 1 , x) = p. If p is constant spatially, the marginal probability of the contaminated observations yobs is pψ(y|x), and we see that the constant p cancels from the Eqn 4 and inferences about ψ are unaffected.
Geographic vs. Environmental Space

By the law of total probability, the marginal probability ψ(y = 1|z(x)) can be computed directly or estimated if a random sample of z(x) independent of y is available (Lele and Keim 2006).

Likelihood Analysis in R

z<- rnorm(10000,0,1) # simulate a covariate
lpsi<- -1 -1*z # define the linear predictor
# occurrence probability
psi<-exp(lpsi)/(1+exp(lpsi))
# generate presence-absence data
y<-rbinom(10000,1,psi)
# keep the presence-only data
data<- sample(z[y==1],2000)
# define the neg log-likelihood
lik<-function(parm){
beta0<-parm [1]
beta1<-parm [2]
gridpsi<-
exp(beta0+beta1*z)/(1+exp(beta0+beta1*z))
datapsi<-
exp(beta0 + beta1*data)/(1+exp(beta0+beta1*data))
-1*sum(log(datapsi/(sum(gridpsi))))
}
# minimize it
out<-nlm(lik,c(0,0),hessian=TRUE)
# produce the estimates
out$estimate
We conducted 5000 simulations under the model described above and found that the MLEs were unbiased (Fig. 1). Furthermore, the log-likelihood has a distinct mode (Fig. 2), indicating that occurrence probability, ψ(y = 1|z), is identifiable in the situations we examined, under random sampling. We do note, however, that there is a prominent ridge in the likelihood, highlighting the low information content inherent in presence-only data. We developed an r package ‘maxlike’, which implements the likelihood analysis in some generality.

Distributions of the maximum likelihood estimates obtained by fitting our model to 5000 simulated data sets. β0 and β1 are the intercept and slope parameters of the linear model of occurrence probability (ψ(y = 1|z)). The data-generating values are indicated by vertical lines. Kernel density estimators were used to represent the distributions.

The log-likelihood surface of the maxlike model for a data set simulated using logit(ψ(y = 1|z)) = −1−1*z. The ‘X’ indicates the maximum. Parameters of the model are identifiable, but there exists a prominent ridge in the likelihood.
MAXENT analysis



We therefore are led to ask: In what sense is maxent‘estimating π’? The only clear interpretation of π(x|y = 1) ≡q(x) is that maxent is estimating a specific version of π(x|y = 1) in which Pr(y = 1|z(x)) = ψ(y = 1|z(x)) = exp(βz(x)) (i.e. occurrence probability is modelled as an exponential function) and, furthermore, a penalized form of that specific π(x|y = 1). Clear advantages to either of these two methodological choices (ψ(y = 1|z(x)) = exp(βz(x)) and the penalty) have not been established. In particular, modelling probabilities by a simple exponential function does not appear to be customary, or even very natural, as it does not have bounded support on [0,1] as ψ(y = 1|z(x)) must.
Identifiability of β0 or ‘Species Prevalence’
There is a widespread and incorrect belief (see Phillips and Dudik quote above) that species prevalence (Elith et al. 2011 also use the term ‘proportion of occupied sites’) cannot be determined from presence-only data (Phillips and Dudik 2008; Elith et al. 2011; Kéry 2011), and this is widely used as justification for producing vaguely defined ‘suitability indices’. While this is repeatedly asserted, there is never any specific discussion or argument as to why this is the case. In fact, it is in direct contradiction to existing literature (Lancaster and Imbens 1996; Lele and Keim 2006).
As we demonstrated, lack of identifiability of occurrence probability is not a general feature of presence-only data. We can clearly estimate the intercept term in Eqn 5 by maximum likelihood, if a suitable parametric form of ψ(y = 1|z) is assumed (Lancaster and Imbens 1996) and a continuous covariate is present (Lele and Keim 2006). Lacking a particular parametric form, one must know ψ(y = 1) (Lancaster and Imbens 1996) and, if covariates are only nominal categorical, then only relative probabilities of occurrence are achievable (Lele and Keim 2006). Conversely, it is clear that, no matter the composition of covariates, with the choice ψ(yi = 1|z) = exp(βzi), the intercept term is not identifiable, as the intercept cancels from Eqn 6 (Lele and Keim 2006). Thus, the inability to estimate occurrence probability is a feature of the specific model used by maxent and not a feature of presence-only data. As such, we do not see an advantage to using the exponential form for ψ(y = 1|z) over the more conventional logistic occurrence probability model.
Logistic output from maxent


‘Regularization’ in maxent
In maxent, the specific objective function maximized is not the likelihood given in Eqn 4 earlier. Rather, it is the exponential function along with a penalty term that has the effect of penalizing the maximum entropy distribution for large covariate effects. This penalization is termed ‘regularization’ in the maxent terminology. The effect of the penalty is clear – it shrinks the regression coefficients to 0 – and it is a standard concept in smoothing methods and other contexts (Green and Silverman 1994; Tibshirani 1996). General motivation for the need of a penalty term in the context of species distribution modelling is not clear, and we have not seen specific justification given in the literature other than the suggestions that it may prevent over-fitting or save the user time by avoiding the need to formally compare competing hypotheses (Phillips et al. 2004). As an alternative to regularization, we recommend that exercising restraint in the creation of covariate data sets and maintaining a focus on developing a priori models can also prevent over-fitting.
The practical problem in using the penalized objective function is that it will generally lead to biased estimators of the important β parameters, and we believe that this should be considered and understood by users of maxent prior to analysis. In our view, this penalty is not necessary in developing occupancy models from presence-only data. Moreover, the way in which it is handled by maxent seems ad hoc– which is to say, the smoothing parameters are fixed a priori based on heuristics.
There could be at least two situations in which using a penalized objective function is a sensible thing to do. One is when no obvious model set can be developed a priori and the number of potential models is extremely large. In that case, we might wish to fit some omnibus complex model for the sake of hypothesis generation. A second possibility for using a penalty to the objective function is in the presence of sparse data, or small sample size relative to the number of predictors. In that case, some model parameters will be weakly identified, and the penalty term essentially keeps the parameter in a reasonable region of the parameter space and probably alleviates numerical errors and other pathologies that you might expect in such cases.
MAXENT Scalings of q(x)

We see that LN itself is a redundant parameter because exp(LN+β*z(x)) = exp(LN)*exp(β*z(x)) and LN is included in both the denominator and numerator. Furthermore, we can absorb DN directly into β, yielding the reparameterization β* = β/DN. As such, the scaling of q does not appear be a meaningful methodological element of the maxent problem.
Comparison between maxent and maxlike
Simulation Study


Comparison between maxlike and maxent estimates to true values of ψ(y = 1|z(x)). The grey lines represent the relationship between the estimate and the true value for each of 100 simulated data sets. The maxent index is not proportional to the probability of occurrence.
North American Breeding Bird Survey
We applied the maximum likelihood estimator of occurrence probability to data from the North American Breeding Bird Survey (BBS) and compared the resulting estimates to predictions based on logistic regression and also to maxent’s ‘logistic output’. We fitted models to data on the Carolina wren (Thryothorus ludovicianus) using four land cover variables (per cent cover of mixed forest, deciduous forest, coniferous forest and grasslands) and latitude and longitude. We considered quadratic effects for each covariate. Fitting this model in maxent required that we modified the default settings, so that so-called hinge, threshold and product features were disabled. We restricted our analysis to the 2222 BBS routes surveyed in the United States during 2006, the year when the land cover variables were measured. Each BBS route is an approximately 40-km-long stretch of road consisting of 50 ‘stops’– points at which observers record counts of all bird species seen or heard during a 3-min period.
Traditional analyses of BBS data treat either the stop or the route as the sample unit; however, maxent requires data formatted as rasters (spatially referenced grids) and treats the pixel as the sample unit. Thus, we imposed a 25 km2 grid over the study area, and for pixels with >1 stop, we classified the pixel as being occupied (yi = 1) if ≥1 detection was made at any of the stops in the pixel or unoccupied (yi = 0) otherwise. Note that only logistic regression made use of the yi = 0 data. Covariate values for each pixel in the United States were used as the ‘background’.
Of the three methods, the logistic regression model makes the most use of the data and thus is expected to outperform the presence-only models. More generally, presence–absence data should always be preferred to presence-only data because observed zeros are informative about the species’ range. Predicted probability maps from logistic regression have a clear interpretation as the probability that a pixel would yield an observation of the species in question. For these reasons, we considered the results of the logistic regression models as the standard against which we compared results of the other two estimators.
Maps of the Carolina wren distribution created using each of the 3 estimation methods are shown in Fig. 4. Salient points of the analysis are that maxent’s logistic output is not as similar to the logistic regression predictions as those obtained by the maximum likelihood estimator and are generally inconsistent with the observed data in that sense that the resulting ‘index’ of species range is less defined and more geographically diffuse. From the maps, we see that maxent’s ‘logistic output’ greatly underestimates the probability of occurrence throughout the core of the species’ range and overestimates occurrence probability in regions where the species was never detected. The reason for this bias is the same as the bias in our simulation study –maxent uses a default intercept value that implies baseline prevalence of 0·50. Clearly, the maxent predictions will depend on this value which is not estimated from the data using the maxent procedure, and this subjectivity prohibits a clear interpretation of the index. Thus, while one might obtain more consistency using a different value of this parameter, there is no objective basis for setting this value.

Maps of the Carolina wren distribution generated using the three estimators applied to Breeding Bird Survey data.
Another important limitation of maxent for modelling these data relates to the difficulty of specifying the desired model of interest. For example, one cannot test for a specific interaction because the software requires that either all or none of the possible interactions are tested. Similarly, one cannot evaluate the possibility of a specific quadratic effect.
Discussion
Inference about occupancy from presence-only data has proved to be elusive. Rather than developing methods for direct inference about species occurrence, ecologists have settled for the production of ill-defined ‘suitability indices’ such as those issued by maxent. However, under random sampling, formal statistical inference about the probability of species occurrence can be achieved from presence-only data using conventional likelihood methods (Lancaster and Imbens 1996). We imagine that inference based on this likelihood should be accessible to practitioners familiar with ordinary statistical concepts.
Our simulation study using a standard logistic regression type of model indicates that occupancy probability is identifiable from presence-only data, consistent with what has been shown in related classes of models (Lancaster and Imbens 1996; Lele and Keim 2006). Some might argue that parametric assumptions are overly restrictive and, as a result, it is better to estimate something vaguely defined, which only might be proportional to occurrence probability. However, the most sensible and natural interpretation of the model underlying maxent is that it also assumes a parametric relationship between ψ(y = 1|z) and covariates, one that is exponential. This is widely justified based solely on the incorrect assertion that the marginal probability of occurrence is not identifiable, and not based on any specific benefit of the exponential function. The lack of identifiability problem is specific to the parametric model that is implemented in maxent, and not a general feature of presence-only data. In our view, it does not make sense to forgo estimation of what is an eminently sensible quantity in the absence of any concrete technical or conceptual argument.
The ability to estimate occurrence probability parameters from presence-only data naturally requires larger sample sizes than from presence–absence data. Our analysis of data from the North American BBS provides sufficient data for this purpose, but such data may be unavailable in many studies. This was noted by Ward et al. (2009), who qualified their statement about identifiability by noting that
Even when [...occurrence probability...] is identifiable, the estimate is highly variable.
While clearly the precision of estimates of model parameters in any specific application is a matter of sample size and complexity of the model, this does caution that sufficiently precise estimates might not be produced for all applications. We do not view this as a serious deterrent to seeking out estimates for parameters of models that are ecologically sensible independent of whether or not data are available to achieve a certain level of precision. Whether an estimator is ‘highly variable’ in a situation is not relevant if the estimand is the object of inference, and there are not competing (presumably more precise) estimators available for achieving that objective.
We emphasize that random sampling is critical because under this assumption, the marginal probability that a presence-only unit is included in the sample is constant, and thus, the sample inclusion probability does not affect π(x|y = 1) constructed by invocation of Bayes rule. The issue of imperfect detection does not affect the development of the estimator, but it does affect interpretation of results. If detection probability, p, is constant spatially, then parameter estimators under the random sampling presence-only model are unaffected. Despite this, imperfect detection probability poses a number of complications, because it stands to reason that detection probability should be influenced by a whole host of things including nuisance effects such as ‘effort’– which might include things such as human population or road density, such that what we obtain in many samples is a random sampling of presence-only sites from among those that are easy to access or sample. Also, detection probability might be related to ecological processes including population density of the species being studied, such that detection is more likely to occur at high-density sites and vice versa (Royle and Nichols 2003). In such cases, we expect to obtain a sample of presence-only sites that is biased toward high-density sites. Despite the importance of imperfect detection, it is possible to accommodate this in formal models for inference about species occurrence probability. So-called ‘occupancy models’ (MacKenzie et al. 2002; Tyre et al. 2003) generalize the logistic regression model to allow for imperfect detection, that is, false negatives, and some work has also been done to accommodate false positives within the occupancy modelling framework (Royle and Link 2006; Miller et al. 2011). In our view, application of any statistical procedure to presence-only data should acknowledge the core assumptions, seriously consider consequences of their violation on inferences and discuss them in the context of the specific study.
maxent is a popular software package for producing species distribution models, which is not based on a formal model for species occurrence. As such, the focus of maxent, as it is applied in practice, is not on formal inference about mechanisms responsible for observed species distribution pattern. We believe that it is not widely appreciated that direct inference about occurrence probability can be achieved using standard likelihood methods. We believe that the likelihood approach advanced in this paper offers a better framework for species distribution modelling because it allows users to estimate the fundamental parameter governing species distributions, the probability of occurrence. In this paper, we have also tried to provide context to certain technical facets of maxent that have important consequences. Principal among those are the implicit assumption equating the conditional probability Pr(x|y = 1) to the specific exponential function referred to as the ‘maximum entropy distribution’ and the implication that occupancy probability is modelled by an exponential model, thereby neglecting estimation of the intercept parameter and forsaking the ability to estimate occurrence probability. Second, we believe that many maxent users are unaware of the relevance of the penalty term that appears fundamental to maxent. To date, there has not been a formal justification of the need, importance or consequences of the penalty in the context of species distribution modelling, and there has been no mention of the possible bias introduced by this procedure. In our view, poorly motivated and justified technical elements of maxent distract from understanding the central inference problem of species distribution modelling.
Acknowledgements
We thank Peter Blank for providing habitat and weather data sets used in the BBS analysis, and members of PWRC's WMS Research Group.