Comparative interpretation of count, presence–absence and point methods for species distribution models
Correspondence site: http://www.respond2articles.com/MEE/
Summary
1. The need to understand the processes shaping population distributions has resulted in a vast increase in the diversity of spatial wildlife data, leading to the development of many novel analytical techniques that are fit-for-purpose. One may aggregate location data into spatial units (e.g. grid cells) and model the resulting counts or presence–absences as a function of environmental covariates. Alternatively, the point data may be modelled directly, by combining the individual observations with a set of random or regular points reflecting habitat availability, a method known as a use-availability design (or, alternatively a presence – pseudo-absence or case–control design).
2. Although these spatial point, count and presence–absence methods are widely used, the ecological literature is not explicit about their connections and how their parameter estimates and predictions should be interpreted. The objective of this study is to recapitulate some recent statistical results and illustrate that under certain assumptions, each method can be motivated by the same underlying spatial inhomogeneous Poisson point process (IPP) model in which the intensity function is modelled as a log-linear function of covariates.
3. The Poisson likelihood used for count data is a discrete approximation of the IPP likelihood. Similarly, the presence–absence design will approximate the IPP likelihood, but only when spatial units (i.e. pixels) are extremely small (Electric Journal of Statistics, 2010, 4, 1151–1201). For larger pixel sizes, presence–absence designs do not differentiate between one or multiple observations within each pixel, hence leading to information loss.
4. Logistic regression is often used to estimate the parameters of the IPP model using point data. Although the response variable is defined as 0 for the availability points, these zeros do not serve as true absences as is often assumed; rather, their role is to approximate the integral of the denominator in the IPP likelihood (The Annals of Applied Statistics, 2010, 4, 1383–1402). Because of this common misconception, the estimated exponential function of the linear predictor (i.e. the resource selection function) is often assumed to be proportional to occupancy. Like IPP and count models, this function is proportional to the expected density of observations.
5. Understanding these (dis-)similarities between different species distribution modelling techniques should improve biological interpretation of spatial models and therefore advance ecological and methodological cross-fertilization.
Introduction
Ecological conservation and management require an understanding of the distribution of populations and the covariates that shape them. These information needs have fuelled the collection of a vast amount of data on species distributions. Such data come in many forms (e.g. presence–absence, presence-only, use-availability and count data) and consequently have led to the development of many novel analytical techniques (Buckland & Elston 1993; Guisan, Edwards, & Hastie 2002; Hirzel & Guisan 2002; Drake, Randin, & Guisan 2006; Elith et al. 2006; Leathwick et al. 2006; Pearce & Boyce 2006). Unfortunately, it is not always clear how the findings of these techniques should be interpreted biologically.
Invariably, all spatial observations on individual plants and animals can be treated as unique points. In addition to the spatial position and time of observation, each point is characterized by individual-specific features (e.g. age, sex, travel direction and behaviour), local environmental conditions (e.g. canopy closure, proximity to water, etc.) or characteristics influencing the observation process (e.g. interference from weather conditions or visibility to the observer as a function of distance). These points arise from complex ecological processes that can be summarized by a temporally and spatially heterogeneous intensity function (Warton & Shepherd 2010).
In an attempt to reconstruct this intensity function, many researchers discretize space, recording either the presence/absence of a species in each unit or the number of occurrences per unit area and effort. Such data can then be used to quantify empirically how the distribution of a species depends on environmental variables (Buckland & Elston 1993; Mackenzie & Royle 2005), e.g. by using generalized linear models (GLM –McCullagh & Nelder 1989) or generalized additive models (GAM –Wood 2006). Count methods, in particular, are well-established in ecology, primarily because of their potential for estimating and predicting population abundance (Guisan & Zimmermann 2000; Austin 2002; Buckland et al. 2004; Guisan & Thuiller 2005).
For some methods of data collection, such as the remote tracking of individual animals or presence-only records of plants, the data are, by default, point observations in space and time. Such data have fuelled the development of methods that use individual point observations directly (Boyce & McDonald 1999), such as the use-availability, presence–pseudo-absence or case–control design (Pearce & Boyce 2006). These methods quantify a species’ preference for environmental variables by comparing observed locations with a random selection of points in space (and time) reflecting habitat availability. Because such models link species distribution data directly with environmental variables, they are also used to make spatial predictions for other points in space and time for which the necessary environmental data are available. However, it is currently unclear how the parameter estimates and predictions of point methods relate to those based on count or presence–absence data. As a consequence, these approaches have been developed independently, limiting methodological and ecological cross-fertilization.
Our main objective in this review article is to clarify the similarities between these seemingly disparate methods for studying species distribution and habitat selection. We first describe the conceptual transition from traditional habitat selection analyses (in discrete environmental space) to species distribution modelling (in geographical space). Specifically, researchers studying habitat use traditionally evaluated the importance of environmental variables by comparing the proportion of observations falling into different habitat types, to the availability (measured in units of area) of these habitat types on the landscape (Johnson 1980). To allow for a sufficient number of observations in each habitat, environmental variables were generally discretized into coarse bins. Over the past few decades, the development and popularization of geographic information systems and remote sensing techniques have led to an enormous increase in the number and spatial resolution of environmental covariates. As a result, most researchers now model the distribution of species in geographical space as a function of a complex suite of spatial predictors. Together, traditional habitat selection analysis and more recent species distribution modelling have considered a variety of response data. Here, we compare the likelihood functions most often used with these different response data. In particular, we focus on point and count data. Owing to work performed by others (Diggle 1990; Cressie 1993; Diggle & Rowlingson 1994; Baddeley & Turner 2000; Lele & Keim 2006; Lele 2009; Baddeley et al. 2010; Diggle, Kaimi, & Abellana 2010a; Warton & Shepherd 2010), it is possible to show that the Poisson likelihood function used for count data, and likelihood functions used for point methods [e.g. Weighted Distribution Theory (WDT) and point Logistic regression] can be motivated by the same underlying inhomogeneous Poisson point process (IPP) model. Using simulations, we assess whether these approaches indeed give similar parameter estimates and standard errors. Drawing connections between these approaches should help to clarify the interpretation of the estimated regression parameters (e.g. see Keating & Cherry 2004) and ensure that biological conclusions regarding habitat preference and spatial predictions are insensitive to the method used.
Materials and methods
Habitat use and preference in discrete environmental space
Species or individual occurrences in geographical space are largely driven by preference (or avoidance) for underlying environmental conditions. Therefore, habitat studies mostly examine the choices individuals make in environmental, rather than geographical space (Fig. 1 and see Buckland & Elston 1993; Boyce & McDonald 1999). Here, environmental space has k dimensions, one for each environmental variable affecting the distribution of a species. In this space, an arbitrarily small convex hull, centred on a combination of environmental conditions X_{i} = (x_{1}, …, x_{k}), is defined as a habitat or environmental unit.
Habitat use is not only influenced by an organism’s preference for various environmental conditions, but also by the relative abundance and distribution of these habitats (i.e. their ‘availability’). When organisms show no preference and move randomly, resulting in a homogeneous distribution of usage in geographical space, habitat use is proportional to habitat availability. Therefore, deviations from proportionality indicate the existence of preference (or avoidance). Consequently, many analyses define the preference w(X) as the ratio of habitat use over availability (Manly et al. 2002).
Use and preference in continuous environmental space
The denominator K is a normalizing constant ensuring that f^{u}(X) integrates to 1 over all X (Lele & Keim 2006). Typically, f^{a}(X_{i}) is determined by the distribution of environmental conditions within a predefined study area. In the simplest case, all points in space (within the study area) are assumed to be equally accessible, but this can easily be extended to situations where accessibility is defined more flexibly (Matthiopoulos 2003b; Johnson et al. 2008).
Any non-negative function of X can be used to model preference, but the exponential function is used most frequently: . This formulation of preference is known as a resource selection function (RSF) (Boyce & McDonald 1999).
Use and preference in continuous and discrete geographical space
Definition of the response based on the realized distribution of observations
Equations 1 and 2 describe the link between the organisms’ preference for environmental conditions and their distribution in space. The ultimate objective is to reconstruct the surface of usage h^{u}(S_{j}) (Fig. 1c) from spatial data and to quantify how it arises from the organism’s environmental preference w (Fig. 1a). Several different response variables have been used to estimate h^{u} and w (see also Table 1):
Point of view | Method | Discretization | Response data | Likelihood function |
---|---|---|---|---|
Environmental space | Discrete habitat selection analysis | In environmental space, resulting in discrete habitat types | The ratio of the amount of use (e.g. time spent) over availability (e.g. area) | Most analyses employ parametric or nonparametric hypothesis testing (not addressed here, but see e.g. Johnson 1980) |
Use-availability design based on point data | None | Two separate samples of points (used and available). When using logistic regression, the response variable is equal to 1 for used points and 0 for the availability points | Conditional or unconditional inhomogeneous Poisson process (CIPP and UIPP, respectively), Weighted Distribution theory* or Partial likelihood (of which logistic regression is a special case) | |
Geographical space | Spatial point process | None | Idem | Idem |
Count method | In geographical space, resulting in spatial units, e.g. grid cells | Number of species observations (i.e. counts) per unit of area | Poisson generalized linear model | |
Presence–absence or occupancy modelling | Idem | Presence or absence of a species | Logistic regression or generalized linear model with complementary log-log link |
- *With an exponential model, this approach is equivalent to the conditional inhomogeneous Poisson process likelihood.
- 1
In discrete environmental space, preference is directly estimated as the ratio of habitat usage (number of observations) over habitat availability (units of area) (Johnson 1980; Manly et al. 2002).
- 2
In continuous environmental space, observations are realizations from f^{u}(X), contrasted to the environmental conditions at a random sample of points in space, representing availability, f^{a}(X). Most frequently, the combined data are analysed using logistic regression, with Y = 1 for observed locations and 0 for random points, and the exponential of the linear predictor (Xβ) is assumed to represent a RSF; (Boyce & McDonald 1999), a function which is said to be proportional to the probability of use. Lele & Keim (2006) and Lele (2009) derived the general form of the likelihood function for use-availability designs (derived from eqn 1), which allows for other (i.e. non-exponential) functional forms for w(X).
- 3
In continuous geographical space (analogously to environmental space), the data consist of species observations that are realizations from a probability density function h^{u}(S_{j}), and a sample of points in space can again be used to approximate the denominator in eqn 2 (Diggle 1990; Diggle & Rowlingson 1994; Diggle, Kaimi, & Abellana 2010a).
- 4
In discrete geographical space,
- a.
Counts: the response variable is defined as the number of observations in each spatial unit (e.g. a grid cell) and such counts can be modelled using a Poisson GLM (Buckland & Elston 1993; Guisan, Edwards, & Hastie 2002). The expected count in grid cell j is modelled as , where the covariates X are typically grid cell averages or values measured at the cell’s centroid.
- b.
Presence–absence: Here, similar to count data, the spatial domain is divided into spatial units and for each unit, the response variable is defined as 1 if at least one observation is present in that cell, and 0 otherwise. Such occupancy data (MacKenzie et al. 2005) are often modelled as a Bernoulli random variable. Most often, a logit link is used to model the presence probability as a function of predictors. Alternatively, a complementary log-log link provides a more natural parameterization for continuous point process models observed in discrete space (Prentice & Gloeckler 1978) and can facilitate comparisons and predictions across grids that differ in their grid cell size (Baddeley et al. 2010).
- a.
All of these response variables attempt to link the distribution of an organism to the environmental conditions at which it is observed. In some cases, the resulting model is consequently used to make spatial predictions. However, it is not evident what the preference function, w(X) in eqn 1, represents biologically and whether the different specifications of the response variable lead to similar parameter estimates. This issue can be resolved by comparing the different likelihood functions employed to estimate the parameters in the different models.
Likelihood functions
We now consider the case of continuous environmental space (point 2 in Definition of the response based on the realized distribution of observations above). The M observations with environmental conditions are random realizations from the process f^{u}(X), which can be written as a function of preference w and habitat availability f^{a}(X) (eqn 1), leading to the WDT log-likelihood of Lele & Keim (2006). The WDT likelihood is equivalent to that of a spatially continuous conditional IPP in which the total number of points is considered fixed, with w playing the role of (Cressie 1993, page 651, eq. 8·5·3).
We will refer to the full likelihood (eqn 4) as the unconditional IPP (UIPP) and eqn 5 as the conditional IPP approach (CIPP).
Although the UIPP (eqn 4) and the CIPP (eqn 5) likelihoods look similar, they differ in two aspects. First, because the CIPP treats the total number of observations as fixed, the intercept is not identifiable, i.e. β_{0} will drop out of eqn 1 as it appears in both the numerator and the denominator. Second, in contrast to the UIPP (eqn 4), the right part of the CIPP likelihood (eqn 5) uses the log of the integral. The main question is, therefore, does maximization of the UIPP likelihood or the CIPP-likelihood lead to the similar slope parameters β ? In Appendix A, we show this is indeed the case, i.e. . Therefore, we can conclude that the UIPP likelihood (and its discrete approximation; the Poisson GLM) and the CIPP likelihood will result in similar estimates of preference, w(X). However, in contrast to the CIPP likelihood, the UIPP likelihood allows the estimation of the intercept and hence the absolute density of observations in geographical space.
Finally, we consider presence–absence data in discrete geographical space. To quantify the occupancy probability as a function of environmental covariates, such presence–absence data are most often modelled using the logistic regression log-likelihood function. This likelihood function will only approximate the IPP likelihood when spatial pixels are extremely small (Baddeley et al. 2010). Under a coarse discretization, presence–absence data do not differentiate between one or several observations being present in each cell, leading to some loss of information compared to count and point data. How much information is lost depends on the resolution of the spatial (and temporal) discretization and the organism’s prevalence. In some cases, using presence–absence data may be unavoidable. In this case, Baddeley et al. (2010) provide empirical and theoretical evidence that a complementary log-log link, with the logarithm of pixel area included as an offset, provides the best approximation of the parameters of the (log-linear) IPP model – i.e. the aforementioned approach is preferable to using logistic regression, which assumes a logit link.
Numerical integration of the IPP likelihood functions
The integral of the denominator of the IPP likelihood functions (eqns 4 and 5) is intractable, but can be approximated by evaluating w(X) at a set of availability or control observations. Perhaps, the simplest approach is to use a Monte Carlo approximation to the integral, given by , where (j = 1,…,B) are the environmental conditions at a (large) random sample of points from geographical space (Lele & Keim 2006). Alternatively, one can use quadrature methods to perform numerical integration (Baddeley & Turner 2000). A variety of methods can be used to choose availability points and their weights. A simple approach is to divide space into a regular grid, place availability points at the centre of each pixel and form quadrature weights for each point (used and available) as α_{i} = a_{i}/n_{i}, where a_{i} is the area of the ith cell and n_{i} is the total number of points that fall in the ith cell. Finally, a weighted log-linear Poisson model can be fit to the data, with the value of the response variable equal to 0 for availability points and 1/α_{i} for the used points. The quadrature weights can be easily specified as ‘prior weights’ in most GLM software packages. The quadrature method can be more efficient, because the used points also contribute to estimation of the IPP integral (Baddeley & Turner 2000).
The major advantage of the IPP approaches and the aforementioned numerical integration techniques is that each used point can be linked to the exact underlying environmental conditions or characteristics associated with each observation. Furthermore, in Poisson GLMs, the choice of the spatial scale of the grouping is arbitrary, while for the IPP the number of availability points can simply be increased until an acceptable approximation is achieved (Warton & Shepherd 2010).
Estimating parameters using logistic regression fitted to point data
Most often, if preference (or the resource selection function) is modelled as an exponential function of the linear predictor, logistic regression is used. Here, Y = 1 for the species locations and Y = 0 for the availability points (Boyce & McDonald 1999). The binomial likelihood can be seen as a special case of the partial likelihood (Gilbert, Lele, & Vardi 1999; Lele 2009), and if the number of availability points increases towards infinity, point logistic regression estimators for model parameters will converge to values obtained by maximizing eqn 4 (Diggle & Rowlingson 1994; Lele 2009; Warton & Shepherd 2010). In contrast to the CIPP approach, logistic regression will also estimate an intercept, which is defined as , where α is the fraction of observations where Y = 1. This may have consequences for the standard error estimates, which will be examined by means of simulation, below.
Data simulation
Several studies have confirmed that logistic regression fitted to point data leads to asymptotically unbiased parameter estimates (Lele & Keim 2006; Lele 2009; Warton & Shepherd 2010), but estimates of standard errors, which are crucial for drawing the correct biological conclusions, are reported as inappropriate (Lele & Keim 2006). So far, this aspect has received little attention. By simulating the distribution of organisms arising from a known intensity surface, we try to understand the behaviour of the parameter estimates and standard errors resulting from the different likelihood functions.
Parameter estimates will vary across different realizations of the species’ distribution, but we also postulated that large errors introduced by the approximation of the integral in the denominator of the likelihood function could degrade the performance of parameter and standard error estimators. To test this hypothesis, we compared the IPP and logistic regression estimators fitted to point data using either n = 100 or 10 000 availability points. Although n = 100 will give a poor approximation to the integral in the likelihood function, which can be detrimental to the parameter and standard error estimators, it may be unavoidable. For example, collecting data on the environmental conditions underlying the availability points can be expensive and logistically challenging to obtain. In this case, placing the availability points efficiently and choosing the best numerical integration technique is essential. Although Monte Carlo simulation is used most often, other techniques, such as the numerical quadrature, may be more efficient because the used points also contribute to estimation of the IPP integral (Baddeley & Turner 2000; Warton & Shepherd 2010). Following sampling design theory (Gruijter et al. 2006), we may also expect improvement in the approximation of the integral by placing points regularly, instead of randomly, in space. To assess the performance of the different estimators, numerical integration techniques and the effect of the placement of the availability points, we compared (i) maximum likelihood estimates of the UIPP model obtained using Monte Carlo integration with either randomly placed or regularly placed availability points; (ii) maximum likelihood estimates of the UIPP model obtained using numerical quadrature [implemented using the dirichlet tessellation function (‘dirichlet.weights’, package ‘spatstat’) and the ‘glm’ function in program R], (iii) logistic regression with randomly placed or regularly placed availability points, and (iv) Poisson GLM fitted to count data. Detailed simulations based on discrete presence–absence data can be found in Baddeley et al. (2010).
Without loss of generality, we assumed that the simulated organisms respond to just one, spatially autocorrelated environmental variable on a fine grid (100 × 100 cells) (Fig. 2a). The preference for each cell was specified as . As each cell is of equal size and the arena is assumed to be equally accessible, the expected usage S_{j} of the jth cell (in geographical space) will be proportional to preference (eqn 1 and Fig. 2b). Therefore, in the simulation each individual randomly selected a cell with a probability proportional to that cell’s expected usage. This was repeated for M = 3000 individuals (Fig. 2c).
For the IPP and logistic regression approaches, we used the environmental conditions underlying the species observations and a set of points reflecting habitat availability. The availability points were either placed in the centre of each grid cell (regular design) or randomly in space (random design) (Fig. 2e). When fitting a Poisson GLM, we calculated the counts of observations in each grid cell (Fig. 2d). The quadrature weights were based on a similar grid. Although the expected usage in the simulation was known exactly, the observed distribution of organisms represents a single realization of this stochastic process. Therefore, each realization will result in slightly different parameter estimates (and standard errors). To capture this variability, we repeated the creation of the environmental covariate and species observations 500 times. Each simulation resulted in an estimate for the coefficient of the preference function, . We compared the mean of these estimates to the true β and also calculated the Monte Carlo standard deviation (see Table 2): , where Q = 500 is the total number of simulations. This statistic should be close to the mean standard error estimated by the model if the standard error estimator is unbiased.
Method | Generation of availability points | No. availability points* = 100 | No. availability points = 10 000 | ||||
---|---|---|---|---|---|---|---|
† | |||||||
UIPP – MC integration | Random | 0·515 | 0·021 | 0·138 | 0·500 | 0·021 | 0·025 |
Regular | 0·507 | 0·021 | 0·068 | 0·501 | 0·021 | 0·020 | |
UIPP – Quadrature | Regular | 0·460 | 0·021 | 0·046 | 0·499 | 0·021 | 0·020 |
Logistic regression | Random | 0·500 | 0·114 | 0·114 | 0·500 | 0·024 | 0·025 |
Regular | 0·503 | 0·114 | 0·052 | 0·501 | 0·024 | 0·021 | |
Poisson GLM | – | 0·237 | 0·025 | 0·278 | 0·501 | 0·021 | 0·020 |
- *For the Poisson GLM, it represents the number of grid cells (i.e. 100 or 10 000).
- †Based on 500 simulation runs, the table shows the mean slope of the preference function, mean estimated standard errors ( ) based on the repeated (i.e. 500 times) estimation of the parameters. In all cases, the true value of β was set to 0·5. is the Monte Carlo (MC) standard deviation, where is the estimated parameter obtained from fitting models to repeated (Q = 500 times) realizations of the distribution of animals (Fig. 2c): The type of model leading to the lowest average is best at estimating the parameters. Furthermore, if the parameter standard errors ( ) are estimated unbiasedly, they should be approximately equal to
Simulation results
Mean parameter estimates were close to 0·5 (the true β) in all cases, except for the Poisson GLM when fit to a coarse grid (Table 2). The Poisson GLM in this case performed poorly because the scale at which the environmental covariate was distributed was much smaller than the resolution of the grid cells. Both IPP and logistic regression estimates based on point data were less variable (i.e. had smaller values of ) when using a set of regular (vs. random) points, and these differences were more pronounced at the smaller sample size (100). Lastly, as expected, precision improved for all estimators when a larger sample of availability points was used, but if this cannot be achieved in the study, the estimators suggest point logistic regression and numerical quadrature are most efficient.
In contrast to the parameter estimates, standard errors were frequently biased. In particular, the UIPP model tended to overestimate precision, particularly when the number of availability points was small (see Table 1 which shows that standard errors for the UIPP approaches with 100 availability points did not differ from those of the 10 000-point simulation). Standard errors for the UIPP model were only estimated correctly when based on many, regularly spaced availability points (Table 2). Conversely, logistic regression underestimated precision when used in conjunction with a set of regular availability points, irrespective of their number.
Discussion
The ecological literature contains a diversity of novel statistical methods, each trying to improve our understanding of the complex link between a species and its environment (Buckland & Elston 1993; Hirzel & Guisan 2002; Fauchald & Tveraa 2003; Lehmann, Overton, & Leathwick 2003; Matthiopoulos 2003a; Drake, Randin, & Guisan 2006; Elith et al. 2006; Leathwick et al. 2006; Phillips, Anderson, & Schapire 2006; Guisan et al. 2007). Despite their differences, all of these methods require at least one response variable that relates to the distribution of the study species to estimate the model parameters. So far, this link between the processes responsible for the observed distribution of individual organisms (i.e. the spatial point process) and the response variable (and analysis method) has received little attention. Here, we argued that many popular analysis methods can be motivated by the same underlying exponential IPP model, and thus that the IPP model provides a useful unifying framework for modelling species distribution and habitat preference data. More specifically, this study illustrates, both analytically and through simulation, that treating species observations as points under a use-availability design results in similar estimates of habitat preference, , to models fitted to count data in discrete space. This implies that under both survey designs, the preference function, (also known as the resource selection function or RSF), can be interpreted as providing information on the relative density of observations in space.
Parameter and standard error estimators
Although the interpretation of the regression parameters is the same whether data are analysed using logistic regression fitted to point data, WDT, IPP or Poisson GLMs, the likelihood function and numerical integration techniques used to estimate the parameters may lead to slightly different parameter estimates and standard errors as seen in our simulation study. With the exception of the Poisson GLM model fit to a coarse grid, estimators appeared to be unbiased, and their precision increased with the regularity and density of availability points. On the other hand, the simulation study suggests that the UIPP approach estimates standard errors as if there is no uncertainty resulting from the quality of the integral approximation. So, for example, standard errors will be more seriously underestimated when the analysis employs a small number of availability points. In contrast, logistic regression tended to overestimate the standard errors when availability points were placed regularly in space. This may be a result of the fact that in logistic regression, the estimated intercept is normalized by . Consequently, some portion of the estimation uncertainty caused by a poor approximation of this integral may be absorbed by the intercept. Thus, there may be some advantages in using logistic regression on point data if it is difficult to obtain a large sample of availability points. Whenever possible, however, it is preferable to increase the number of availability observations to improve the approximation of the likelihood function (see Warton & Shepherd 2010). However, more research is certainly needed to determine whether these simulation results apply more generally.
The role of the availability or ‘pseudo-absence’ points
So what are the implications of the inherent similarity between the count methods and the spatial point process approaches for ecology? The literature on usage and habitat preference appears to be divided: many studies (Buckland & Elston 1993; Guisan & Zimmermann 2000; Guisan, Edwards & Hastie 2002) model the spatial distribution of species and use environmental variables as covariates to allow for predictions of densities in areas or times not sampled. In contrast, habitat preference or resource selection studies primarily focus on understanding why species select certain habitats. Although resource selection functions are used to make spatial predictions, the interpretation of these spatial maps is not always clear (Keating & Cherry 2004). This is especially true when logistic models are used in conjunction with a use-availability sampling design. Some studies have interpreted the exponential function of the linear predictor obtained from logistic regression fitted to point data as being proportional to occupancy (i.e. probability that a site is used at least once) (Boyce & McDonald 1999; Manly et al. 2002; Keating & Cherry 2004). This interpretation seems incorrect. Spatial predictions of preference (the exponential of the linear predictor) are instead proportional to the density of observations.
The most likely reason for this common misinterpretation is that logistic regression treats the control points reflecting habitat availability as zeros. Consequently, this may suggest that the zeros represent true absences. This assumption is incorrect; the ‘zeros’ reflect the availability of environmental conditions in the study area and treating the availability points as zeros is just a numerical trick to estimate the parameters of the exponential preference function. Thinking in terms of a spatial point process clarifies the role of the availability points. Also, more recent methods have incorporated this principle. For example, Phillips, Anderson, & Schapire (2006: 238) argue that one of the main benefits of using a maximum entropy model (MAXENT) (over logistic regression and Poisson GLMs) is that it provides a clearer interpretation of the data-generating mechanism because areas without species records do not have to be treated as absences. With a clearer statement of the underlying data-generating model, we argue this ‘advantage’ is no longer relevant (it is also interesting to note that fitting a maximum entropy model is equivalent to maximizing the CIPP likelihood; see eqn 2 in Phillips, Anderson & Schapire (2006)). The random points or pseudo-absences need not be assumed to be true absences when analysing data using logistic regression or Poisson GLMs; rather, these points are implicitly used to estimate integrals in the likelihood for the spatial point process model. A clear understanding of the role of availability points is also necessary to extend models to more complex situations, e.g. multilevel models that account for animal-to-animal variability in habitat preference studies (Fieberg et al. 2010).
Use-availability vs. presence–absence design
Although the use-availability and presence–absence modelling approaches appear similar (because in both cases the response consists of zeros and ones), they are fundamentally different. The use-availability design treats each individual observation as one data point (Boyce & McDonald 1999). In contrast, the presence–absence design (also known as occupancy modelling) discretizes space and defines for each spatial unit whether an organism is present (MacKenzie et al. 2005). Consequently, it does not differentiate between one or several animals being present. Particularly when organisms congregate in large numbers resulting in a high abundance in some spatial units, occupancy may be a poor descriptor of local density. Indeed, several empirical studies have demonstrated a lack of correlation between occupancy and local abundance (Pearce & Ferrier 2001; Nielsen et al. 2005; Jimenez-Valverde et al. 2009). Therefore, imposing a presence–absence design will lead to loss of information (except when each spatial unit only contains one observation). In the extreme case where each spatial unit is used at least once, we have completely lost the ability to investigate the influence of environmental conditions on the species distribution. Lastly, when a logit link function is used to model the relationship between spatial covariates and presence probability, imposing different pixel sizes leads to different parameters estimates and estimated probabilities that cannot be reconciled. Consequently, meaningful predictions for regions discretized at different spatial resolutions cannot be made, although a complementary log-log link may allow such comparisons (Baddeley et al. 2010).
Using point methods or discrete counts
Because models fitted to count or point data can lead to identical results, researchers may define the response variable as either a count per unit area (e.g. grid cells for wildlife telemetry data or segments for line-transect data) or implement the use-availability design. Unfortunately, the spatial scale (i.e. the level of aggregation) and the zonation chosen may have a substantial influence on the results. This problem is closely related to the change-of-support or the modifiable areal unit problem, which is extensively discussed in agriculture, geography, sociology, statistics and ecology (Jelinski & Wu 1996; Dungan et al. 2002; Gotway & Young 2002; Svancara et al. 2002). Although some statistical solutions exist (e.g. see Jelinski & Wu 1996; Dungan et al. 2002; Gotway & Young 2002; Svancara et al. 2002), the most appropriate method ultimately depends on the spatial scale at which the species selects certain environmental conditions.
Point methods may prove useful even when environmental data are only available on a relatively coarse grid, because they allow one to consider individual-level covariates (e.g. sex or age of the study individuals) as well as characteristics of the observation process that may be pertinent to the study (e.g. the distances of animals relative to the observer, sighting conditions or weather, and serial correlation in animal movement data). As such, point methods provide a means to explore individual behaviours, such as the fine-scale spatial dependence between predator and prey (e.g. harbour porpoise – fish schools and wolf kill sites: Embling et al. 2005; Webb, Hebblewhite & Merrill 2008) and address questions regarding how preference for environmental conditions may depend on the characteristics of individuals under study (Aarts et al. 2008). By contrast, count methods require summation of individuals within a spatial (or temporal unit) and consequently the information associated with unique observations is lost.
Extending the Poisson point process models
Typical biological applications consider a multitude of (potentially correlated) variables (Cramer 1985), constraints in movement or dispersal (Matthiopoulos 2003b), unbalanced sampling effort across individuals (Aarts et al. 2008; Hebblewhite & Merrill 2008) or space (Diggle, Menezes & Su 2010b), inhomogeneous detection probabilities (Buckland et al. 2004; Frair et al. 2004; Royle, Nichols & Kery 2005), nonlinear responses of individuals to the environment and perhaps most importantly, the data on species distributions are generally spatially or temporally autocorrelated (Dormann 2007; Dormann et al. 2007; Johnson et al. 2008; Fieberg et al. 2010). Under an infinite discretization, we expect no difference in the effect of these confounding factors, because point and count methods approximate the same Poisson point process model. In contrast, under a coarse discretization these approaches may differ. For example, a coarse discretization may alleviate small-scale spatial autocorrelation in count methods. This may be an advantage if one wishes to avoid the effect of correlation on the parameter estimates and their standard errors, but if one is also interested in capturing the actual mechanism causing the clustering of organisms (Fieberg et al. 2010), implementation of point methods may be more fruitful. The exponential IPP model does not address this latter form of clustering, but more complex point process models could be fit to address these issues (Baddeley & Turner 2000; Johnson et al. 2008; Diggle, Kaimi & Abellana 2010a).
A plea for unification of habitat selection and species distribution methodology
Methods for estimating the spatial distribution of a species and its dependence on environmental covariates have improved greatly in the last decade. This is particularly true for studies using detailed movement data from individual animals carrying telemetry (e.g. GPS) devices. Methods have been developed that allow one to account for heterogeneous detection probabilities (Frair et al. 2004), unequal accessibility of habitats (Matthiopoulos 2003b; Aarts et al. 2008), by incorporating movement models directly into the estimation of the environmental preference (Johnson et al. 2008), and investigate how preference itself may change as a function of habitat availability (Mysterud & Ims 1998; Matthiopoulos et al. 2011). This latter aspect, known as a functional response in habitat use, is particularly important if we wish to generalize the conclusions on preference and make spatial predictions for other regions. Furthermore, looking at how preference changes with changes in absolute availability is essential if we are to understand which habitats are crucial to the existence of a species (Mysterud & Ims 1998; Matthiopoulos et al. 2011). Interestingly, many of the issues mentioned previously also apply to studies modelling count data as a function of environmental variables, but are rarely addressed by these models. Therefore, appreciation of the similarities between the count and point-level methods may greatly improve cross-fertilization. Furthermore, if these models are to be useful for wildlife management or conservation, it is essential to understand the exact meaning of their results.
Acknowledgements
We thank J.J. Poos, S. Brasseur and three anonymous reviewers and the associate editor for their valuable feedback. We thank P. Diggle and D. Warton for some useful suggestions.