Measuring β‐diversity with species abundance data

Summary In 2003, 24 presence–absence β‐diversity metrics were reviewed and a number of trade‐offs and redundancies identified. We present a parallel investigation into the performance of abundance‐based metrics of β‐diversity. β‐diversity is a multi‐faceted concept, central to spatial ecology. There are multiple metrics available to quantify it: the choice of metric is an important decision. We test 16 conceptual properties and two sampling properties of a β‐diversity metric: metrics should be 1) independent of α‐diversity and 2) cumulative along a gradient of species turnover. Similarity should be 3) probabilistic when assemblages are independently and identically distributed. Metrics should have 4) a minimum of zero and increase monotonically with the degree of 5) species turnover, 6) decoupling of species ranks and 7) evenness differences. However, complete species turnover should always generate greater values of β than extreme 8) rank shifts or 9) evenness differences. Metrics should 10) have a fixed upper limit, 11) symmetry (βA,B = βB,A), 12) double‐zero asymmetry for double absences and double presences and 13) not decrease in a series of nested assemblages. Additionally, metrics should be independent of 14) species replication 15) the units of abundance and 16) differences in total abundance between sampling units. When samples are used to infer β‐diversity, metrics should be 1) independent of sample sizes and 2) independent of unequal sample sizes. We test 29 metrics for these properties and five ‘personality’ properties. Thirteen metrics were outperformed or equalled across all conceptual and sampling properties. Differences in sensitivity to species’ abundance lead to a performance trade‐off between sample size bias and the ability to detect turnover among rare species. In general, abundance‐based metrics are substantially less biased in the face of undersampling, although the presence–absence metric, βsim, performed well overall. Only βBaselga R turn, βBaselga B‐C turn and βsim measured purely species turnover and were independent of nestedness. Among the other metrics, sensitivity to nestedness varied >4‐fold. Our results indicate large amounts of redundancy among existing β‐diversity metrics, whilst the estimation of unseen shared and unshared species is lacking and should be addressed in the design of new abundance‐based metrics.


Introduction
Metrics of b-diversity are widely used in ecological studies, but there is uncertainty about the degree of redundancy among the metrics available and the facets of b-diversity being measured. Whittaker (1960Whittaker ( , 1972 broadly defined b-diversity as the spatial variation (turnover) in species composition and abundance between sampling units, whilst a-diversity is the local diversity within a single sampling unit and c-diversity measures larger-scale diversity.
The number of studies investigating b-diversity has increased considerably in recent years (Koleff, Gaston & Lennon 2003a;Anderson et al. 2011). b-diversity has been linked to the shape of the species-area curve (Harte et al. 1999), variance in species occupancy (McGlinn & Hurlbert 2012) and species' spatial aggregation (Morlon et al. 2008). The distance-decay relationship (the increase in b-diversity with geographical distance) is a critical component of three of the six unified theories of biodiversity reviewed by McGill (2010). Measures of b-diversity in relation to environmental and spatial gradients have been used to unpick community assembly (Chase 2003) and drivers of global scale biodiversity patterns (Qian & Ricklefs 2007). Empirical measures of b-diversity can be used to delineate biotic regions (Holt et al. 2013) and to inform the optimal configuration of reserves (Wiersma & Urban 2005). b-diversity has been used to evaluate the landscape-scale implications of farm management (Gabriel et al. 2006) and to assess the effects of environmental change on biotic homogenization (Baiser et al. 2012). Because c-diversity is entirely determined by the aand b-components of diversity, empirical estimates of b-diversity link biodiversity at local and regional scales (Smith 2010). Turnover in abundance also has important implications for ecosystem functioning and monitoring responses to disturbance (Balata, Piazzi & Benedetti-Cecchi 2007).
A key distinction is between b-diversity metrics that use presence-absence data and metrics that use species abundances (Anderson et al. 2011). Abundance data are clearly more information-rich than presence-absence data, and this can change how we interpret spatial variation in assemblage structure (Cassey et al. 2008). For presenceabsence metrics, the only visible differences between sites are in species identities. Abundance-based measures detect more nuanced variation: we may observe all the same species at two sites, but those species may have different abundance ranks (the commonest species here may be rare there and vice versa). Even when the ranks are the same, evenness of abundances can vary (the common species can be more or less dominant). Consequently, we distinguish sensitivity to (i) species turnover, (ii) species richness differences (iii) rank abundance shifts and (iv) evenness differences as distinct components of b-diversity. Abundance-based indices may also be expected to be more robust to incomplete sampling (Beck, Holloway & Schwanghart 2013): stochastic differences in rare species are an artefact of undersampling, but abundance-based metrics are less influenced by turnover of rare species than their presence-absence counterparts. Whilst abundance information makes our inferences about b-diversity more powerful, it also introduces a source of subjectivity: we need to decide how to weight turnover in common and rare species. Koleff, Gaston & Lennon (2003a) compared the performance of 24 presence-absence metrics of b-diversity and identified a number of trade-offs and redundancies among the presence-absence metrics available. Overall, they recommended b sim (Lennon et al. 2001) as the best-performing index. We are lacking an equivalent investigation into the performance and 'personality' of the many abundance-based metrics available.
We test 16 conceptual properties that are important for an abundance-based b-diversity metric, whatever the application. Where applicable, we note the relationship between these properties and those previously described in the literature.

desirable properties
We make a distinction between conceptual and statistical properties. Conceptual properties (C1-C16) are intrinsic to the design of the metric (e.g. the use of abundance information and whether the metric has a fixed upper limit). Sampling properties (S1-S2) explore responses to undersampling: true differences between assemblages are confounded by imperfect detection, especially of rare species. We consider both conceptual and sampling properties as desirable when choosing a metric.
Independence of a-diversity (C1) b-diversity should be independent of a-diversity within assemblage pairs, so that the aand bcomponents of diversity can be partitioned (Jost 2007;Chase et al. 2011) and b-diversity can be meaningfully compared between regions differing in a-diversity. If aand b-diversities are independent, then pairs of assemblages with the same proportion of species turnover should have the same value of b-diversity, regardless of whether a-diversity within those assemblages is high or low. Legendre & De C aceres (2013: property 10) test this property algebraically for 16 dissimilarity metrics. In P1, we consider an alternative where assemblage pairs have unequal species richness.
b is cumulative along a gradient of species turnover (C2) When assemblages are positioned along an environmental gradient, species turnover will be directional. Koleff, Gaston & Lennon (2003a) call this property additivity. Species are gradually replaced as conditions change, so turnover between neighbouring pairs of assemblages is lower than between pairs that are farther apart. When samples A, B and C are positioned in sequence along such a gradient, summed b-diversity between consecutive pairs of samples (b A,B + b B,C ) should equal the total b-diversity between the end points of the gradient (b A,C ). Metrics with disproportionate sensitivity to small amounts of turnover will lead to overestimates of cumulative b.
Similarity is probabilistic when assemblages are independently and identically distributed (C3) When assemblages are independently drawn from within a larger, well-mixed metacommunity, then similarity (i.e. 1-b for metrics with an upper limit of 1) among multiple pairs of assemblages should be probabilistic. The expected similarity of assemblages A and C (1-b A,C ) is given by the product of similarities between A and B, and B and C, (1-b A,B )*(1-b B,C ). Metrics that lack an upper limit cannot be converted to their similarity complement and so cannot be probabilistic.

Minimum of zero (C4)
Legendre & De C aceres (2013: property 1) state that when comparing an assemblage to itself, b should always be zero, and when comparing two different assemblages, b should be equal to or greater than zero.

Fixed upper limit (C5)
Legendre & De C aceres (2013: property 9) note that bounded metrics are easier to compare than unbounded ones. For example, the maximum value of b Euclidean and b Manhattan depends on the combined abundances of an assemblage pair, making it difficult to interpret the values of b when assemblage pairs have different numbers of individuals.
Monotonic increase with species turnover (C6) b should be a strictly increasing monotonic function of the proportion of species in the first assemblage that are replaced by new species in the second assemblage; otherwise, it is not reflecting species turnover. A pair of assemblages in which 20% of assemblage A species are replaced by new species in assemblage B should have lower b-diversity than an assemblage pair with 40% turnover. The property is closely related to the property described by Jost, Chao & Chazdon (2011) as monotonicity.
Monotonic increase with the decoupling of species ranks (C7) An abundance-based b-diversity metric should be sensitive to the degree to which species ranks are decoupled between assemblage pairs (reflecting differences in the dominant and rare species). Therefore, b-diversity should decrease monotonically with increased correlation between species ranks.

Monotonic increase with differences in evenness (C8)
Even if two sites have the same species, with the same rank order of abundances, they may still differ in evenness: the commonest species may dominate more in some sites than others. A good abundance-based b-diversity metric should increase monotonically as differences in evenness between sites grow larger. Properties C7 and C8 are two aspects of a property described as monotonicity to changes in abundance by Legendre & De C aceres (2013: property 3).
b is lower for complete decoupling of species ranks than for complete species turnover (C9) Consider a pair of assemblages in which all species are unshared and a second pair of assemblages in which all species are shared, but the rank abundances are reversed, such that the dominant species in assemblage A becomes the rarest in assemblage B and vice versa. The first pair of assemblages must be considered more different than the second pair.
b is lower for evenness differences than for complete species turnover (C10) As an alternative scenario for abundance differences, consider a pair of assemblages in which all species are shared: in the first assemblage, the abundances are perfectly even and in the second assemblage, all species are singletons except the dominant species (e.g. extreme unevenness). Compare this to an assemblage pair where all species are shared. As above, the loss or gain of a species should always be deemed a more extreme difference than a shift in its abundance. Sites with no species in common should have the largest values of b (Legendre & De C aceres 2013: property 5). Properties C9 and C10 describe two alternative scenarios in which this property should hold.

Symmetry (C11)
Legendre & De C aceres (2013: property 2) and Koleff, Gaston & Lennon (2003a) note that the order in which two assemblages, A and B, are considered should not change the value of b for that pair (e.g. b A, B = b B, A ).

Double-zero asymmetry (C12)
Legendre & De C aceres (2013: property 4) argue that the absence of a species from both assemblages does not indicate resemblance between the two assemblages in the way that shared presences do: double absences contain no information about the distance in ecological niche space. Consequently, the addition of zero abundances to both assemblages should not change the value of b, whilst the addition of shared presences should lower the value of b.
b does not decrease in a series of nested assemblages (C13) Metrics vary in how they respond to nestedness. However, b should never decrease when species richness differences increase, as the addition of unique species should not increase similarity (Legendre & De C aceres 2013: property 6).

Independence of species replication (C14)
When all species in both the assemblages being compared are duplicated, the value of b should remain constant. This becomes important when identical subsets of an assemblage are pooled (Jost, Chao & Chazdon 2011; Legendre & De C aceres 2013: property 7).

Independence of units of abundance (C15)
When comparing b among regions differing in productivity or the units used to measure abundance, metrics that are sensitive to the total abundance in an assemblage pair will be inappropriate. Legendre & De C aceres (2013: property 8) call this property invariance to measurement units.

Independent of differences in abundance (C16)
This property was described as invariance to the total abundance in each assemblage by Legendre & De C aceres (2013: property 11) and density invariance by Jost, Chao & Chazdon (2011). It is designed to identify metrics that are mathematically dependent on differences in abundance between sampling units. C15 and C16 differ from undersampling in that there is no stochasticity.

Unbiased by undersampling (S1)
In all previous simulations, we have assumed our simulated assemblages represent the 'true' composition. However, b-diversity is usually estimated from samples, which generates differences in richness and abundances as a sampling artefact (Chao et al. 2005(Chao et al. , 2006. A good bdiversity metric should remain constant as the sample size decreases.

Unbiased by unequal sampling effort (S2)
Differences in sample size can also inflate b-diversity due to imperfect detection of rare species. A good b-diversity metric should remain constant with increasing difference in sample sizes.

personality properties
In addition to the desirable properties identified above, b-diversity metrics may differ in other respects that are worthy of note. We term this the 'personality' of the metrics and their importance will depend on the ecological question concerned.

Sensitivity to nestedness (P1)
For presence-absence metrics, Koleff, Gaston & Lennon (2003a) distinguish 'narrow-sense' metrics, which measure purely species turnover, from 'broad-sense' metrics, which measure both species turnover and differences in species richness. We may want a b-diversity metric to reflect differences in richness, as these will mean that one site will have species that are absent in another. On the other hand, we may want the value of b to measure purely species turnover, especially if we are comparing b-diversity between regions with different species richness. This differs from the test in C1 (independence of differences in a-diversity): in C1, each pair of assemblages we compare has an equal number of species. Here, species richness differs between the two assemblages we compare.
Relative sensitivity to nestedness and turnover components of b (P2) We test two metrics (b Bray-Curtis and b Ruzicka ) that can be additively partitioned into independent nestedness and turnover components (Baselga 2013;Podani, Ricotta & Schmera 2013;Legendre 2014). For metrics that cannot be deconstructed, it is useful to compare the value of b for complete turnover to that for extreme nestedness to estimate the relative sensitivity to these components.

Relative weighting of species turnover and abundance differences (P3 and P4)
We have identified two ways in which species abundances can vary between assemblages: decoupling of species ranks and differences in evenness. The relative weighting of these components and species turnover is a useful property to quantify. The ideal weighting is somewhat subjective (provided that b-diversity is less for extreme differences in abundance than for turnover of a species, see C9 and C10, above).

Relative sensitivity to turnover of rare versus common species (P5)
There is scope for variation in how common versus rare species contribute to b. One reason for investigating this is the occupancy-abundance relationship (ONR). Positive ONRs are nearly ubiquitous (Brown 1984) and reflect that rare species are generally more range restricted and so more likely to be turned over than are locally abundant (and more widespread) species.
Here, we manipulate the composition and structure of hypothetical assemblages and apply 29 b-diversity metrics to the resulting assemblage pairs. Each metric is evaluated against 18 desirable properties (C1-C15 and S1-S2) to generate a score card, which we use to identify the bestperforming abundance-based b-diversity metrics. We then explore how personality properties may affect the choice of metric for different ecological applications.

b-diversity metrics
In total, we evaluated 24 abundance-based metrics and five presence-absence metrics (Appendix S1, Supplementary Information). All metrics are expressed so that higher values of b indicate more differentiation (1-b for similarity metrics). For comparability, metrics were rescaled relative to the maximum value obtained in each set of simulations, before calculating scores.

hypothetical species assemblages
Abundance differences in our hypothetical assemblages were modelled using the log series distribution (Fisher, Corbet & Williams 1943) using the function fisher.ecosystem in R package 'untb' (Hankin 2007). Our conclusions would be qualitatively identical using other commonly used models of the species abundance distribution (McGill 2010). A hypothetical species assemblage with 100 species and 10 000 individuals was used as the starting assemblage for all simulations.

evaluation of properties
For b-diversity metrics that have been previously implemented in R, the functions vegdist and d and adipart in R package 'vegan' v.2.0-5 (Oksanen et al. 2013) were used to calculate b-diversity. Formulae for the remaining metrics can be found in Appendix S1. Each of our properties was assessed by exploring how measured b-diversity covaried with a test-specific parameter, describing some aspect of assemblage structure. We manipulated the starting assemblage according to the specific rules for each test. Each simulation described below was run 10 000 times at each unique combination of the test-specific parameter and proportion species turnover, t = 0, 0Á2, 0Á4, À0Á6, 0Á8 and 1Á0, to obtain median b for that combination. All simulations were carried out in R v.3.0.3 (R Core Team 2014). Formulae for evaluating b-diversity metrics for each of the properties can be found in Appendix S2.

Independence of a-diversity (C1)
Fisher's a of assemblages was manipulated using the function fisher.ecosystem in R package 'untb' (Hankin 2007). The expected number of individuals was fixed at N = 10 000, whilst manipulating the number of expected species, S, to generate a series of assemblages with S = 300, 250, 200, 150, 100, 80, 60, 40, 20 and 10. Fisher's a was estimated for each assemblage. For each a-diversity:turnover combination, we calculated error as the difference between the median b-diversity at each level of a and the median b-diversity when a was highest (S = 300): dependence on a-diversity was measured as the root-mean-squared error (RMSE).
b is cumulative along a gradient of species turnover (C2) In each simulation, three assemblages, A, B and C, were generated according to the following rules: a proportion of species, t, in assemblage A were randomly selected to be turned over in assemblage B (t = 0, 0Á1, 0Á2, 0Á3, 0Á4 and 0Á5). Of the species in assemblage B, the same proportion was turned over in assemblage C, with the condition that species shared between assemblages A and B were g times more likely to be turned over in assemblage C than species unique to assemblage B, where g is a test-specific parameter which we manipulate to simulate different strengths of directional species turnover (g = 1, 5, 10, 50, 100, 500 or 1000). At each turnover:gradient combination, we calculated error as the difference between observed b-diversity for assemblages A and C (b A,C ) and the value predicted if the metric was cumulative (b A,B + b B,C ): departure from cumulative b was evaluated as the RMSE.
Similarity is probabilistic when assemblages are distributed independently and identically in space (C3) In each simulation, three assemblages, A, B and C, were generated according to the following rules: a proportion, p (p = 0-1 in increments of 0Á2), of the species in assemblage 1 were randomly selected to be conserved in assemblage 2. This process was repeated with the species in assemblages 1 and 2 (with the same value of p) to obtain the third assemblage. Species lost from assemblage A can reappear in assemblage C, as we would expect in independent samples drawn from a well-mixed species pool, but entirely novel species can also appear in assemblage C. In each simulation, we calculated error as the difference between observed similarity for assemblages A and C (1Àb A,C ) and the similarity predicted if the metric is probabilistic (1Àb A,B )(1Àb B,C ): departure from probabilistic similarity was evaluated as the RMSE.

Minimum of zero (C4)
The starting assemblage was manipulated to generate assemblage pairs with increasing differences in species turnover, t; decoupling of species ranks, r; and evenness differences, DE. Methods for these simulations can be found in C7 and C8. Two behaviours were tested: (i) b is zero for identical assemblages and (ii) b is greater than or equal to zero when assemblages are different, either because of species turnover, decoupling of species ranks or evenness differences. The metric was scored as TRUE if both qualities were met.

Fixed upper bound (C5)
This property was evaluated as TRUE/FALSE by applying equation 8 and then equation 3 in Legendre & De C aceres (2013: property 9) to calculate the upper limit of a metric, using a pair of assemblages with no shared species.

Monotonic increase with species turnover (C6)
A series of assemblages with increasing species turnover was generated by randomly selecting a proportion of species (t = 0-1 in increments of 0Á2) in the starting assemblage and assigning them a new identity in the new assemblage. Metrics were scored as TRUE if each consecutive increase in species turnover generated an increase in median b.

Monotonic increase with decoupling of species ranks (C7)
A series of assemblages with increased decoupling of species ranks was generated by determining species ranks in the new assemblage partially by the ranks in the starting assemblage and partially at random (r = +1Á0 (a perfect positive correlation between ranks) to À1Á0 (a perfect negative correlation) in increments of 0Á1). Metrics were scored as TRUE if each incremental decrease in r, generated an increase in median b at a given level of species turnover.

Monotonic increase with differences in evenness (C8)
In the starting assemblage for this test all except the dominant species have just one individual (extreme unevenness). A series of assemblages with increasing evenness differences was generated by redistributing individuals from the dominant species among the other 99 species: the probability of being allocated to each species was determined by raising the abundances in a Fisher's log series distributed assemblage to a power, b = 0Á2, 0Á4, 0Á6, 0Á8, 1Á0, 1Á2, 1Á4, 1Á6, 1Á8 2Á0, 4Á0, 6Á0 and 8Á0. These values were chosen to generate assemblages with both more and less evenness relative to a Fisher's log series distribution. Metrics were scored as TRUE if each incremental increase in DE led to an increase in median b.
b under extreme decoupling of species ranks < b when species turnover is complete; b under extreme evenness differences < b when species turnover is complete (C9 and C10) The turnover of a species should be weighted greater than a change in abundance. Metrics were scored as TRUE for these two properties if median b is lower for extreme decoupling of species ranks (r = À1) and extreme evenness differences (DE = 0Á97) than for complete species turnover (t = 1). The relative weighting of abundance differences and species turnover also has a personality component (see P3 and P4).

Symmetry (C11)
Symmetry was tested by reversing the order in which assemblages A and B were given to a metric. This was tested for assemblage pairs with multiple levels of species turnover, t; decoupling of species ranks, r; and evenness differences, DE. A metric was scored as TRUE if b A,B = b B,A in all simulations.

Double-zero asymmetry (C12)
We generated a series of eleven assemblage pairs, the first with no double zeros and then consecutively adding up to 10 double zeros to the assemblage pair. This was repeated, but adding double presences of equal abundance. Abundances in each simulation were chosen at random from within the starting assemblage. Two behaviours were tested: (i) b does not change with the addition of double zeros and (ii) b decreases with the addition of double presences. Metrics were scored as TRUE if both conditions were met.

b does not decrease in a series of nested assemblages (C13)
A series of nested assemblages was generated by randomly selecting a number of species to be lost from the starting assemblage (S = 0-90 in increments of 10). Metrics were scored as TRUE if each incremental increase in species loss led to an increase in median b.

Independence of species replication (C14)
A series of 10 assemblage pairs with all species replicated x times at six levels of species turnover, t, was used to simulate the effect of pooling identical subsets of unshared species. At each combination of x (in 1-10) and t, error was calculated as the difference between median b in one identical subset and when x identical subsets were pooled. Metrics were scored as the RMSE.

Independent of the units of abundance (C15)
Following the method in Legendre & De C aceres (2013), we test this property by generating a series of assemblage pairs in which the abundances in both assemblages are multiplied by a constant factor (cc = 1-10). Error was calculated as the difference between median b in the starting assemblage pair (cc = 1) and between median b at each combination of cc and species turnover, t. Metrics were scored as the RMSE.

Independence of differences in abundance (C16)
We test this property by generating a series of assemblage pairs in which the abundances in one assemblage are multiplied by a constant factor, (c = 1-10). At each c:turnover combination, error was calculated as the difference between median b at each value of c and median b in the starting assemblage pair, (c = 1). Metrics were scored as the RMSE.
The following two properties test the behaviour of metrics when samples are used to infer b-diversity.

Independence of sample size (S1)
For a series assemblage pairs with different levels of turnover, t, both assemblages were randomly sampled, without replacement, to generate a series of assemblage pairs with equal sample sizes of N = 10 000 (fully censused), 9000, 8000, 7000, 6000, 5000, 4000, 3000, 2000, 1000, 500, 200, 100, 50, 20 and 10. For each sample size:turnover combination, error was calculated as the difference between median b-diversity at sample size N and median b-diversity in a fully censused assemblage: dependence on sample size was measured as the RMSE.

Independence of unequal sample sizes (S2)
For a series of assemblage pairs with different levels of turnover, t, one assemblage in each pair was randomly sampled, without replacement, whilst the other was fully sampled to generate sample size differences of DN = 0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 9500, 9800, 9900, 9950, 9980 and 9990. As above, for each DN:turnover combination, we calculated error as the difference between the median b-diversity at sample size difference DN and median b-diversity when both assemblages were fully censused (DN = 0): dependence on unequal sample size was measured as the RMSE.

Sensitivity to nestedness (P1)
To generate ten assemblages with differences in species richness, DS, we randomly selected S species (see C13) to be lost from the starting assemblage. For each species loss:turnover combination, we calculated error as the difference between the median b-diversity for S and median b-diversity when species richness was equal (S = 0): sensitivity to nestedness was measured as the RMSE.

Relative sensitivity to nestedness and turnover (P2)
This property was measured as the ratio of b under extreme nestedness but no turnover (DS = 90, t = 0) and the value for complete species turnover but no species loss (t = 1, DS = 0).

Relative sensitivity to abundance differences and species turnover (P3 and P4)
We calculated b under extreme decoupling of species ranks (r = À1), and extreme differences in evenness (DE = 0Á97), using simulated assemblages from C7 and C8. These values were expressed as a proportion of the value of median b under complete species turnover, t = 1.
Relative sensitivity to turnover in rare versus common species (P5) We turned over a single species in the starting assemblage, from the dominant (1450 individuals) to the rarest species (1 individual) and recorded the value of b for each. Relative sensitivity to rare and common species was evaluated as the ratio between b when the rarest species was turned over to b when the dominant species was turned over.
In order to investigate redundancy and complementarity among the 29 metrics, a principal component analysis was performed using all quantitatively measured properties, using the function prcomp in R version 3.0.3 (R Development Core Team, 2014). We also investigate which of the metrics are Pareto-dominated, that is, those metrics that are outperformed or equalled across all desirable properties.

Results
We have scored the performance of 29 metrics for 16 conceptual and two sampling properties (Table 1). In addition, a further five personality tests have enabled us to identify more subjective variation in metrics' behaviour ( Table 2). The results of all simulations are presented in Appendix S3.

conceptual and sampling properties
All 29 metrics satisfied properties C4, C6 (minimum of zero and positiveness, monotonic increase with species turnover: Fig. S4) and C11 (symmetry). We use the remaining properties to discriminate between the performances of metrics. Thirteen metrics were Pareto-dominated (Tables 1, S5). We focus on the metrics that performed best against the conceptual and sampling properties and consider their contrasting strengths and weaknesses.
Nine metrics passed all qualitatively scored tests (b Morisita, bHorn , b Morisita-Horn , b Jost Simpson b Renkonen , b Kulczynski , b Bray-Curtis , b Canberra and b Ru zi cka, : C5-C13, Table 1). The presence-absence metrics b sim , b Classic Jaccard and b Classic Sørensen failed only C7 and C8 (monotonic increase with decoupling of species ranks and evenness differences: Figs S5 and S6), as such measures, by definition, are insensitive to differences in abundance. All abundance-based metrics became less sensitive to abundance differences as the species turnover between assemblages became more extreme (Figs S5 and S6). Across all quantitative tests, b Morisita obtained the best mean score. The presence-absence metric, b sim, performed best or joint best for six of the eight quantitative concep-tual and sampling properties, with the exception of C2 (b is cumulative) and S1 (independence of sample size). b Morisita was the most robust metric to undersampling, performing best when both assemblages were undersampled (S1) and second best under unequal sample sizes (S2). b sim was best for S2, but performed poorly for S1 (Figs S2 and S12: Table 1). b Canberra scored equally highly with b sim, b Classic Sørensen and b Classic Jaccard for C1 (independence of a-diversity: Fig. S1), C14 (independence of species replication: Fig. S7) and C15 (independence of measurement units: Fig. S8), but performed poorly on C2 (b is cumulative: Fig. S2), C3 (similarity is probabilistic: Fig. S3) and C16 (independence of differences in abundance: Fig. S9) and for both sampling properties (S1 and S2). b Binomial was joint best for C15 (independence of measurement units: Fig. S8), but performed poorly for all other quantitative properties. b Horn and b Renkonen performed relatively well across all quantitative properties, but were never best for any property.
In sampling simulations S1 and S2 (Table 1; Figs S12 and S13), most presence-absence metrics were positively biased by undersampling, with the exception of b Chao Sørensen and b Chao Jaccard which have a correction for undersampling.

personality properties
With the exception of b sim and the partitioned turnover components of b Bray-Curtis and b Ru zi cka , all metrics were at least somewhat sensitive to nestedness (P1), although there were fourfold differences in the degree of sensitivity to species richness differences (P2, Table 2).
The relative weighting of abundance differences and turnover varied substantially among abundance-based metrics (Table 2). With the exception of b Gower b alt. Gower , b Av. Euclidean , b Lande Simpson and b Euclidean , metrics were more sensitive to species turnover than differences in abundance (P3: decoupling of species ranks, P4: differences in evenness, Figs S5 and S6).
The relative sensitivity to turnover in rare versus common species (P5) varied substantially among metrics from equal weighting of rare and common species (all presence-absence metrics) to metrics that had a negligible response to turnover in rare species (b Morisita : Fig. S11).
A principal component analysis revealed substantial redundancy among the 29 metrics investigated (Fig. 1).

Discussion
Our results identify a number of trade-offs in performance, consider redundancy and complementarity among existing metrics and suggest areas to be addressed in the design of new metrics.
In choosing a metric, we suggest that our desirable properties will provide a useful primary filter in choosing a metric. We focus on the best-performing metrics in Table 1, but other metrics may still be useful if the Table 1. Scorecard for 29 b-diversity metrics against the 16 conceptual and two sampling properties described in the text. Metrics are ordered by number of TRUES and, when equal, by the mean of quantitative scores. Note this weights qualitative properties greater than quantitative properties, such that metrics with one or two fails drop down the scorecard. Metrics have an ideal score of TRUE (T) for qualitative properties and 0 for quantitative properties. C4, C6 and C11 were TRUE for all metrics and scores are not shown Metric relative weighting of the desirable properties is changed, or if personality properties or additional properties, untested here, become important. Our personality properties highlight two additional sources of variation which may further filter the appropriate metrics for some applica-tions: (i) sensitivity to rare species and (ii) sensitivity to nestedness. Our results indicate the first of these is tradedoff with performance for sampling properties (Fig. S16).
The most extreme example of this trade-off is b Morisita , which is the most independent of sample size (Fig. S10),  1. Biplot of the first two principal components axes of the scores of 29 bdiversity metrics based on quantitative scores for properties C1-C2, C14-C16, S1-S2 and P1-P5. Four partitioned turnover components are also shown, using the partitioning methods proposed by Baselga (2013) and Podani, Ricotta & Schmera (2013). Together, PC1 and PC2 explain 52% of variation in scores.
at the expense of being almost completely insensitive to turnover in rare species (Fig. S13). b-diversity metrics fall along a continuum in terms of sensitivity to rare species. b Classic Sørensen is conceptually linked to species richness metrics of a-diversity such that rare and dominant species are weighted equally. b Horn relates to Shannon entropy: species are weighted by their relative abundance. b Morisita is linked to the Gini-Simpson index of a-diversity (Jost 2007): rare species contribute little to the final value of these metrics. Consequently, b Morisita performs well, even with the very partial samples that ecologists usually work with, because the missing rare species in small samples have a negligible effect on the value of b. This may be important: the emphasis b Morisita places on common species is suitable when shifts in dominance are of interest (e.g. when linking diversity to ecosystem function), but will be less appropriate when patterns of turnover in rare species of particular interest (e.g. complementarity of reserve networks: Wiersma & Urban (2005)). Unfortunately, those metrics that are sensitive to turnover in rare species are, consequently, less robust in the face of undersampling.
In general, our results suggest that when insensitivity to sample size (S1 and S2), sensitivity to turnover of individuals (C7 and C8) and/or cumulative b (C2) are priorities, b Morisita should be favoured. When turnover in rare species is important and undersampling is not severe, the presence-absence metric, b sim , is favoured due to superior performance in terms of independence of a-diversity (C1), probabilistic similarity (C3), independence of species replication (C14), measurement units (C15) and differences in abundance (C16). However, b Morisita is almost completely independent of sample size (S1), whilst b sim , b Classic Sørensen and b Classic Jaccard are eleventh, twelfth and eighteenth. This is consistent with predictions that presence-absence metrics are more sensitive to sample sizes.
An example of where our results have implications for existing studies of b-diversity is in the spatial scaling of b-diversity. Studies using presence-absence metrics have shown that b-diversity decreases with the spatial grain of samples (McGlinn & Hurlbert 2012;Barton et al. 2013). One reason for this is statistical: the probability of a rare species being turned over increases at finer grains (Keil et al. 2012) both because rare species are range-restricted and because fine-grain samples have (almost by definition) much smaller sample sizes of individuals than do coarsegrain samples. By contrast, common species are usually more widespread than rare species and much less likely to be turned over at fine grains. The trade-off we have noted between robustness to undersampling and sensitivity to rare species thus becomes relevant here: those metrics which weight rare species turnover highly (including all presence/absence measures) will likely find b shifting with scale. It follows that abundance-based metrics, particularly those disproportionately influenced by dominant species, will likely be less scale dependent than presenceabsence metrics (Fig. S15).
A second consequence of this trade-off is that metrics that are insensitive to turnover in rare species will also return very low values of beta under a positive occupancy-abundance relationship (Fig. S14), a pattern that is near ubiquitous. Specialist applications focussing on rare species may need to use metrics that are less robust to undersampling but, consequently, will require larger sample sizes to observe the rarer species: no abundance-based metric is able to account for unseen shared species (i.e. abundance-based equivalents of b Chao Sørensen and b Chao Jaccard ).
Another potential filter of metrics is sensitivity to nestedness (P1). There are circumstances when the partitioning of the nestedness and turnover components will be a priority when choosing a metric. First, metrics measuring purely species turnover address methodological issues associated with species richness gradients (e.g. latitudinal gradients: Koleff, Lennon & Gaston 2003b). Moreover, patterns of nestedness and turnover are likely to emerge as a result of different processes: distinguishing these patterns may contribute to a more mechanistic understanding of spatial patterns in b-diversity (e.g. Baselga 2010). Our simulations include two abundancebased metrics, b Bray-Curtis and b Ru zi cka, that can be additively partitioned into independent nestedness and turnover components. We find the partitioning method described by Baselga (2013) generates turnover components that are independent of nestedness, whilst the method proposed by Podani, Ricotta & Schmera (2013) does not.
A principal component analysis indicated a large amount of redundancy among metrics. Yet our results highlight one property which is lacking among existing abundance-based b-diversity metrics. Three pieces of information are absent in samples of species assemblages; (i) how many species are missing in the sample, but present at the site, (ii) their abundances and (iii) whether they are shared or unshared between undersampled assemblage pairs. Abundance-based b-diversity metrics that estimate this information and adjust the value of b accordingly are one avenue for improving performance when there is undersampling. Recent developments in biodiversity sampling theory (Green & Plotkin 2007;Morlon et al. 2008;McGill 2011) and hierarchical Bayesian techniques that model the observation process (K ery & Royle 2008) provide a useful starting point for developing such metrics.
The issues we have raised highlight that b-diversity is a multi-faceted concept. Any study measuring b-diversity should be explicit about its goals (which properties should be emphasized) and assumptions (e.g. about sampling) when filtering the available metrics.