Design of occupancy studies with imperfect detection
Correspondence site: http://www.respond2articles.com/MEE/
Summary
1. Occupancy is an important concept in ecology. To obtain an unbiased estimator of occupancy it is necessary to address the issue of imperfect detection, which requires conducting replicate surveys at the sites being sampled. As the allocation of total effort can be done in different ways, occupancy studies should be designed carefully to ensure an efficient use of available resources.
2. In this paper we address the design of single-season single-species occupancy studies with a focus on: (1) issues relating to small sample sizes and (2) the potential relevance of including the precision of the detectability estimator as a criterion for design. We explore analytically the model with constant probabilities and examine how bias and precision are affected by the numbers of sites and replicates used.
3. We show how, for small sample sizes, the estimator properties depart from those predicted by large sample approximations, emphasize the need to use simulations when designing for small sample sizes and provide a new software tool that can assist in this process.
4. We offer advice on the amount of replication needed when the probability of detection is a quantity of interest and show that, in this case, it is more efficient to reduce the number of sites and increase the amount of replication per site compared with situations where only occupancy is of concern.
5. Synthesis and applications. It is essential to have clearly stated objectives before starting a study and to design the sampling accordingly. As the allocation of effort into replication and sites can be done in different ways, occupancy studies should be designed carefully to ensure an efficient use of available resources. To avoid waste, it is crucial to anticipate the quality of the estimates that can be expected from a particular study design. The discussion and guidance provided here is of special interest for those designing occupancy studies with small sample sizes, something not uncommon in the context of ecology and conservation.
Introduction
Occupancy, defined as the proportion of sites occupied by a species, is a state variable commonly used in ecology for the modelling of habitat relationships, metapopulation studies and wildlife monitoring programmes. When species detection is imperfect, occupied sites may be classified as unoccupied based on survey data. If not accounted for, these false absences lead to underestimates of occupancy. The issue of imperfect detection in the context of occupancy studies has received much attention in recent years. MacKenzie et al. (2002) presented a modelling approach for addressing the simultaneous estimation of occupancy and detectability which has since been developed in a number of ways including extensions to cover multiple seasons (MacKenzie et al. 2003), multiple species (MacKenzie, Bailey, & Nichols 2004) and heterogeneity in detection probability (Royle 2006). To account for imperfect detection when modelling occupancy, replicate surveys have to be carried out at sampled sites. Replication is commonly achieved by conducting repeated surveys at different points in time or by surveying different sectors of each sampled site. Other methods include independent surveys carried out by different observers within a single visit or the simultaneous use of independent detection methods. The need for replication creates a trade-off between the number of sites to survey and the number of replicate surveys to carry out per site.
Several papers have addressed the issue of study design in the context of occupancy modelling. MacKenzie et al. (2002), Tyre et al. (2003) and Field, Tyre, & Possingham (2005a) provided some guidance on the number of replicate surveys needed based on simulations. MacKenzie & Royle (2005) presented the first detailed investigation on this subject, giving advice on general issues and providing specific recommendations for the most efficient allocation of survey effort under three sampling schemes and different cost function scenarios. They based their guidance on analytic results obtained by considering the large sample properties of the maximum-likelihood estimator for occupancy probability under a model with constant probabilities of occupancy and detectability. Bailey et al. (2007) later described a software tool developed for exploring design trade-offs for different occupancy models, either using analytic approximations or simulations. They presented an example and noted that the use of simulations is important when working with small sample sizes.
Small sample sizes are not uncommon in ecological studies. In particular they are frequently encountered in surveys linked to conservation projects, as these often have limited resources and tend to focus on rare species. Pilot studies, by their nature, also tend to deal with relatively small amounts of data. Under these circumstances the large sample approximations may be poor. In our experience, the effects of working with small sample sizes are not always addressed in practice and the use of simulations as a tool for assisting study design appears not to be widespread.
While for many studies the primary object of inference is the probability of occupancy, with the probability of detection being regarded merely as a nuisance parameter, there are circumstances when the latter is a quantity of interest in its own right. For instance, this is the case when the estimates obtained from a (pilot) study are to be used as input for the design of subsequent monitoring protocols (e.g. Field et al. 2005b; Pellet 2008) or when there is interest in evaluating the performance of detection methods (e.g. Mortelliti & Boitani 2008). Detectability may also be of interest when it reflects some important characteristic of the ecological system. For example, it could be associated with reproduction (Best & Petersen 1982). Detectability estimates provide information on the number of times that a site needs to be visited before stating with a given degree of certainty whether the species of interest is present or absent at that particular location. This information can be especially relevant in the context of environmental impact assessments. Under these scenarios there is a benefit in obtaining a precise estimate of detection probability.
In this paper we address the design of single-season single-species occupancy studies with a focus on: (1) issues relating to small sample sizes and (2) the potential relevance of including the precision of the detectability estimator as a criterion for design. We investigate analytically the quality of the maximum-likelihood estimators for the occupancy model with constant probabilities of occupancy and detection. We also show how bias and precision are affected by the number of sites and replicates employed and illustrate how the predictions made by large sample theory diverge from the actual distribution of the estimator when sample sizes are small. We discuss how studies are designed using recommendations based on asymptotic approximations and provide guidance to assist survey design when detection probability is a parameter of interest. Finally, we describe the design procedure with an emphasis on the need to use simulations as a tool for sampling design when the sample size is small and provide a numerical example to illustrate the steps. In this context we present a new software application (Single-season Occupancy study Design Assistant, soda) that can assist in the process by automating the search for a suitable design.
Modelling occupancy under imperfect detection: estimator properties
The detailed formulation of occupancy models with imperfect detection is well covered in the literature (e.g. MacKenzie et al. 2006); so, here we limit the description to key aspects relevant to our analysis. Let ψ be the probability of occupancy, p the probability of detection, S the number of sites to be surveyed and K the number of replicate surveys per sampling site. We assume that both occupancy and detection probabilities are constant in time and space. Although in practice this simplification may not always be reasonable, it is necessary in order to provide general study design guidelines. We use the maximum-likelihood approach for model fitting as proposed by MacKenzie et al. (2002) and assume a standard survey design with K surveys carried out in all S sampling sites.








Eqn 3 indicates that the occupancy estimate hits the boundary when the proportion of sites where the species was not detected (left term) is smaller than the proportion of zeros in the history raised to the power of K (right term). This suggests that boundary estimates may be an issue when working with small sample sizes and low probabilities, especially when the amount of replication is small.
A graphical representation of all the MLEs obtainable for a given design illustrates the issues resulting from small sample sizes and the effect that increasing the number of sites or replicates has on the quality of the estimates (Fig. 1). Given a finite number of sites (S) and replicates (K) there is a finite number of histories that can be theoretically observed (i.e. 2SK possible combinations of zeros and ones). Under the model with constant probabilities of occupancy and detectability all those histories that share the same SD and d produce the same estimates of occupancy and detection (eqn 2). This results in (S + 1)[1 + S(K − 1)/2] possible estimate points in the parameter space (dots in the figure).When sample sizes are very small, there are only a few distinct detection histories that can be observed and, correspondingly, few possible parameter estimate values (Fig. 1a). The parameter space is sparsely covered by the MLEs, which means that the estimator is not precise, an effect more pronounced as probabilities of occupancy and detection get smaller. In fact there are no solutions covering the area corresponding to the lowest probabilities, which causes the estimator to be substantially biased in this region. As more samples are added to the study, the MLE solutions cover more of the probability space. Additional replication results in a better coverage of the area corresponding to low probabilities of detection (Fig. 1b), while an increase in the number of sampling sites achieves a more even coverage in the area corresponding to high probabilities of detection (Fig. 1c). When the amount of replication is large, the MLEs coincide with the naïve estimates in most cases as p* is close to unity except for very low values of p.

Maximum-likelihood estimates for all possible detection histories that can be observed under a design with (a) S = 10, K = 3, (b) S = 10, K = 9, (c) S = 30, K = 3, (d) S = 30, K = 9, (e) S = 100, K = 3, (f) S = 100, K = 9. No assumptions are made here about true values of the parameters. Each dot represents a pair of estimates (,
) which corresponds to the solution for all histories summarized by the sufficient statistics (SD, d). There are (S + 1)[1 + S(K − 1)/2] different possible (SD, d) combinations. Dotted lines connect estimates for histories that share the same SD, from 1 (bottom line) to S (top line). Moving along the lines from right to left, each dot corresponds to histories with a decreasing value of d, from a maximum of KSD to a minimum of SD. At the right-most side of the graph estimates correspond to the naïve estimates and ‘bend’ upwards as detectability (p) gets smaller. For clarity (e) and (f) have been plotted without lines and using smaller markers.







Design of occupancy studies
Large sample approximations and simulations are tools that can assist in the design of occupancy studies. Here, we comment on these two approaches and provide an overall picture of the design process with an emphasis on small sample sizes. Note that, to design a study, we need to assume values for the parameters to be estimated.
Optimal design based on asymptotic approximations
The asymptotic variance approximations can be of use when designing occupancy studies as they allow us to explore analytically how estimator precision changes for different design parameters. MacKenzie & Royle (2005) derive study design recommendations based on the asymptotic approximation of the variance of the occupancy estimator (Table 1a). Recommendations can also be produced incorporating the variance of as part of the design criterion, which is useful when detectability is itself a parameter of interest. There are different criteria that can be used for optimal design. For a discussion on the merits of the different methods, see Atkinson & Donev (1992, p. 106). One common approach is to minimize the trace of the variance–covariance matrix, that is, the sum of the variances of the parameters. This is called A-optimality and it gives equal weight to the two variances rather than minimizing the variance of each of the parameters separately (i.e. the variance of
and
in our case). Alternatively D-optimality minimizes the determinant of the variance–covariance matrix. For large samples, the maximum-likelihood estimators
and
are approximately normally distributed and D-optimal design minimizes the area of elliptical confidence region based on this distribution. Here, we derive the optimal number of replicate surveys to be carried out at each sampling site using the A-optimality (Table 1b) and D-optimality (Table 1c) criteria. The optimal number of replicates increases driven by the variance of
, with larger changes observed for low probabilities of occupancy and low probabilities of detection respectively. As happens when considering the variance of the occupancy estimator only, the optimal number of replicate surveys in these two cases is determined by the parameter values (ψ and p) irrespective of the total effort assigned to the survey (TS). Note that the optimal number of replicates is the same regardless of whether the study is designed to minimize survey effort or estimator variance (measured through any of the three criteria above).



ψ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0·1 | 0·2 | 0·3 | 0·4 | 0·5 | 0·6 | 0·7 | 0·8 | 0·9 | ||
(a) | ||||||||||
p | 0·1 | 14 | 15 | 16 | 17 | 18 | 20 | 23 | 26 | 34 |
0·2 | 7 | 7 | 8 | 8 | 9 | 10 | 11 | 13 | 16 | |
0·3 | 5 | 5 | 5 | 5 | 6 | 6 | 7 | 8 | 10 | |
0·4 | 3 | 4 | 4 | 4 | 4 | 5 | 5 | 6 | 7 | |
0·5 | 3 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 5 | |
0·6 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 4 | |
0·7 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | |
0·8 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
0·9 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
(b) | ||||||||||
p | 0·1 | 19 | 16 | 17 | 17 | 19 | 20 | 23 | 27 | 34 |
0·2 | 13 | 10 | 9 | 9 | 9 | 10 | 11 | 13 | 16 | |
0·3 | 10 | 7 | 7 | 6 | 6 | 7 | 7 | 8 | 10 | |
0·4 | 8 | 6 | 5 | 5 | 5 | 5 | 5 | 6 | 7 | |
0·5 | 7 | 5 | 4 | 4 | 4 | 4 | 4 | 5 | 6 | |
0·6 | 6 | 4 | 4 | 3 | 3 | 3 | 3 | 4 | 4 | |
0·7 | 5 | 4 | 3 | 3 | 3 | 3 | 3 | 3 | 4 | |
0·8 | 4 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 3 | |
0·9 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
(c) | ||||||||||
p | 0·1 | 19 | 19 | 20 | 21 | 23 | 24 | 27 | 30 | 36 |
0·2 | 9 | 10 | 10 | 11 | 11 | 12 | 13 | 14 | 17 | |
0·3 | 6 | 6 | 7 | 7 | 7 | 8 | 8 | 9 | 11 | |
0·4 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | 7 | 8 | |
0·5 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 5 | 6 | |
0·6 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 5 | |
0·7 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 4 | |
0·8 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | |
0·9 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
Design based on a simulation study
Likelihood theory tells us that asymptotic approximations are good when the sample size is large enough; however, it does not tell us how large it needs to be. In Fig. 2 we illustrate how the properties of the MLEs under the constant occupancy model depart from the asymptotic approximation for a combination of design parameter values that is realistic within the context of ecological studies (168 units of total effort). The difference between the approximated and actual estimator distributions is larger for low probabilities of occupancy and detection. Designing an occupancy study based on asymptotic properties of the estimators is therefore not appropriate if the intended sample size is small, especially when dealing with rare and elusive species. Under these circumstances, the actual quality of the estimators may be very different from that predicted by the asymptotic variance expressions and the design identified as optimal using large sample approximations may not be the best available, as illustrated in the example section. In these cases the most appropriate method for designing a study relies on the use of simulations.

Actual (top row) and asymptotic (bottom row) distribution of the MLEs for different underlying probabilities of occupancy and detectability (marked with a triangle) under an optimal design with 168 units of total effort: (a) 12 sites and 14 replicates (S = 12 and K = 14); (b,c) 56 sites and 3 replicates (S = 56, K = 3). Plots show the part of the distribution that contains 0·999 probability. For small probabilities of occupancy and detection the estimators have strong bias, with many of the detection histories resulting in boundary estimates (dots at the top left of the plot). As probabilities increase, the true distribution of the MLEs becomes closer to the bivariate normal distribution predicted by the asymptotic approximation.
Sampling design procedure for occupancy surveys: the big picture
The design of an occupancy survey (Fig. 3) should start with a clear statement of the project requirements in terms of the quality of the estimators (e.g. maximum allowed variance) and total survey effort available. With this in mind the design can be made to either (A) maximize the quality of the estimators or (B) minimize the effort employed. We also need to assume initial values for the parameters to be estimated. These can be based on the results of a pilot study, on studies carried out for the same or similar species in comparable circumstances or on expert opinion. The first issue to address is whether the sample size can be considered large enough to base the choice of design parameters on asymptotic approximations. If the total effort available is large and the probabilities of occupancy and detectability are expected to be relatively high, the design can safely be based on these approximations. Nevertheless, we recommend verifying that the approximations are valid before proceeding to collect data. This involves running a simulation with the chosen design parameters (K and S) and given parameter assumptions (ψ and p). If the sample size is not large enough for the asymptotic approximation to be good, the design needs to be based on a simulation study, in which the quality of estimators is evaluated for different combinations of design parameters. There is software which allows simulating the model with a given set of K, S, ψ and p to evaluate estimator bias and variance (genpres, Bailey et al. 2007). Program soda offers the possibility of running an automated search for a suitable design which explores different combinations of K and S given the assumptions and requirements specified by the user. The tool allows the user to select whether priority is given to maximizing estimator quality or minimizing total effort, and allows detectability to be incorporated as part of the design criteria. Program soda can be freely downloaded at http://www.kent.ac.uk/ims/personal/msr/soda.html. An R function for evaluating the performance of a given design is available at the same site.

Occupancy survey design procedure. Shaded boxes represent decision stages. Survey design has to start with clear targets for total effort and estimator quality (e.g. measured as the MSE of or including
with the A- or D-optimality criteria). Priority can be given to maximize estimator quality (A) or minimize total effort (B). Although not included here for simplicity, there are other issues such as cost of surveys and logistical constraints that may need to be incorporated in the design process.
Once a candidate design is identified, either through asymptotic approximations or simulations, we need to verify whether it fulfils the requirements of the project. If it does, the study can proceed to data collection. Otherwise, if no suitable design was found, the objectives and constraints of the project need to be reconsidered: can more resources be allocated to this study? Could less precise estimates still be informative for the purpose of the study? If the answer to these questions is negative the study should not continue as it would be a waste of resources that could be used elsewhere (Legg & Nagy 2006). If the project objectives or constraints are redefined, a new design should be sought given the new requirements.
Example: designing an occupancy study when sample size is small



K | ||||||
---|---|---|---|---|---|---|
4 | 5 | 6 | 7 | 8 | 9 | |
TS ≈ 250 | ||||||
S | 62 | 50 | 42 | 36 | 31 | 28 |
aRMSE ![]() |
6·9/9·6 | 6·8/8·6 | 6·9/7·9 | 7·1/7·5 | 7·5/7·3 | 7·8/7·1 |
RMSE ![]() |
12·6/10·1 | 10·6/9·3 | 9·6/8·7 | 9·3/8·4 | 9·6/8·2 | 9·6/8·0 |
RMSE*![]() |
9·3/9·7 | 8·2/9·0 | 7·7/8·4 | 7·5/8·1 | 7·7/8·0 | 7·9/7·7 |
Boundary estimates | 1·1% | 0·7% | 0·5% | 0·5% | 0·5% | 0·5% |
TS ≈ 300 | ||||||
S | 75 | 60 | 50 | 43 | 37 | 33 |
aRMSE ![]() |
6·3/8·7 | 6·2/7·9 | 6·3/7·3 | 6·6/6·9 | 6·9/6·7 | 7·2/6·5 |
RMSE ![]() |
10·1/9·2 | 8·4/8·4 | 7·8/7·8 | 7·5/7·5 | 7·9/7·4 | 8·1/7·2 |
RMSE*![]() |
8·2/8·9 | 7·2/8·2 | 6·9/7·7 | 6·7/7·4 | 7·0/7·3 | 7·2/7·1 |
Boundary estimates | 0·5% | 0·3% | 0·2% | 0·2% | 0·2% | 0·3% |
TS ≈ 350 | ||||||
S | 87 | 70 | 58 | 50 | 43 | 39 |
aRMSE ![]() |
5·8/8·1 | 5·7/7·3 | 5·9/6·8 | 6·1/6·4 | 6·4/6·2 | 6·6/6·0 |
RMSE ![]() |
8·3/8·5 | 7·0/7·6 | 6·6/7·2 | 6·7/6·9 | 6·9/6·7 | 7·1/6·6 |
RMSE*![]() |
7·4/8·4 | 6·5/7·6 | 6·3/7·2 | 6·3/6·9 | 6·5/6·6 | 6·6/6·5 |
Boundary estimates | 0·2% | 0·1% | 0·1% | 0·1% | 0·1% | 0·1% |
A-optimality criterion (×103) | 14·1 | 10·7 | 9·5 | 9·2 | 9·3 | 9·3 |
D-optimality criterion (×10−5) | 3·28 | 2·21 | 1·95 | 1·96 | 2·04 | 2·05 |
-
Three levels of total effort (TS = 250, 300 and 350) and six levels of replication (K = 4–9) were considered. Asymptotic root mean-squared error (aRMSE) was obtained analytically. Actual root mean-squared error (RMSE) was estimated via simulation with 50 000 iterations. The frequency of boundary estimates (
) and the actual root mean-squared error after removing these (RMSE*) are also shown for reference. For TS = 350, the sum of the mean-squared errors (A-optimality criterion) and the determinant of the MSE matrix (D-optimality criterion) are also shown.

Discussion
When faced with the task of planning a study it is essential to address explicitly three basic questions: (1) why is the study needed, (2) what is a suitable state variable and (3) how to do the sampling? (Yoccoz, Nichols, & Boulinier 2001). Here, we have concentrated on aspects related to the ‘how’ question in the context of occupancy studies, in particular on issues derived from the trade-off resulting from the allocation of survey effort between number of sites and number of replicates. However, we emphasize the need to first deal properly with the ‘why’ and ‘what’ questions, as well as to consider other elements related to the ‘how’ such as the selection of sites, the timing of surveys (MacKenzie & Royle 2005) or decisions on the type of replication to be used.
Addressing the ‘why’ question requires a clear statement of the objectives of the study from which design requirements can be derived, including the maximum survey effort available and the level of precision needed for results to be meaningful (Field et al. 2007). Defining this is not just a statistical decision and should incorporate considerations of the species biology and the system in general. For instance, management decisions should explicitly evaluate the costs associated with false positives and false negatives when detecting trends, costs that are not necessarily equal (Field et al. 2005a). Although studies often focus on the estimate of occupancy, here we argue that there are situations when the probability of detection is also of interest. In these cases it is natural for the precision of p to be included as part of the design criterion. We show that, under these scenarios, the best design will tend to require more replication than in cases where only the precision of the occupancy estimator is considered, especially when working with rare species.
Ecological studies often involve small sample sizes. This is particularly true for studies related to conservation. Here, we show that the asymptotic approximations to the distributions of the maximum-likelihood estimators are unreliable for sample sizes that, although small, are realistic in the context of ecology. Estimators are biased and less precise than indicated by these large sample approximations. This is especially relevant when working with rare and elusive species as then the probabilities of occupancy and detection are low. We highlight the importance of taking these issues into consideration when designing occupancy studies and argue that simulations should be used in the design process. It is essential to determine the actual properties of the estimators under a chosen design, to make sure that they fulfil the design targets before spending, and maybe wasting, time and effort in the field. With a clear description of the overall design procedure, supported by a numerical example and a new software application, we aim at promoting the good practice of addressing small sample considerations when designing occupancy studies. However, it is important to note that this guidance does not replace the careful evaluation of each project’s characteristics. Apart from the requirements addressed here, there may be other issues that need to be incorporated in the design process such as decisions on the minimum number of sites that the program aims to survey, the cost of each survey or other logistical considerations. The large sample recommendations discussed are based on the model with constant probabilities. We do not give specific recommendations for studies involving covariates (e.g. occupancy in two habitats) but the same general approach is applicable and the use of simulations remains the best tool to guide study design. Here, we have concentrated on maximum-likelihood inference. An alternative Bayesian approach avoids asymptotic assumptions; however, it is still necessary to select an optimal design and prior sensitivity needs to be considered.
Designing a study requires initial values of the parameters to be estimated. It is important to realize that the actual performance of the chosen design depends on the correctness of these initial values. Given that these parameters are the object under study, there may be considerable uncertainty about their true values. Before deciding on a final design, we recommend exploring the sensitivity of the design to a change in these initial values. Bayesian experimental design (Chaloner & Verdinelli 1995) provides a systematic framework to account for prior knowledge on the parameters in the design process. Sequential methods divide studies into stages, with later stages designed using the results of earlier ones to update the initial estimates (Abdelbasit & Plackett 1983). The potential of these techniques in the context of occupancy study design is the subject of future work.
Acknowledgements
This research has been supported by an EPSRC/NCSE grant. The authors thank Darryl MacKenzie and one anonymous reviewer for valuable comments that improved the quality of this manuscript.