Volume 6, Issue 7
Application
Free Access

fuzzySim: applying fuzzy logic to binary similarity indices in ecology

A. Márcia Barbosa

Corresponding Author

Centro de Investigação em Biodiversidade e Recursos Genéticos (CIBIO), InBIO Research Network in Biodiversity and Evolutionary Biology, University of Évora, 7004‐516 Évora, Portugal

Correspondence author. E‐mail: barbosa@uevora.ptSearch for more papers by this author
First published: 14 March 2015
Citations: 41

Summary

  1. Binary similarity indices are widely used in ecology, for example for detecting associations between species occurrence patterns, comparing regional and temporal species assemblages, and assessing beta diversity patterns, including spatial and temporal species loss and turnover. Such indices have widespread applications in biogeography, global change biology and biodiversity conservation.
  2. Similarity indices are commonly calculated upon binary presence/absence (or sometimes modelled suitable/unsuitable) data, which are generally incomplete and more categorical than their underlying natural patterns. Probable false absences are disregarded, amplifying the effects of data deficiencies and the scale dependence of the results.
  3. Fuzzy occurrence data, with a degree of uncertainty attributed to localities where presence or absence cannot be safely assigned, could better reflect species distributions, compensating for incomplete knowledge and methodological errors. Similarity indices would therefore also benefit from accommodating such fuzzy data directly.
  4. This study proposes fuzzy versions of the binary similarity indices most commonly used in ecology, so that they can be directly applied to continuous (fuzzy) rather than binary occurrence values, thus producing more realistic similarity assessments. Fuzzy occurrence can be obtained with several methods, some of which are also provided. The procedure is robust to data source disparities, gaps or other errors in species occurrence records, even for restricted species for which slight inaccuracies can affect substantial parts of their range.
  5. The method is implemented in a free and open‐source software package, fuzzySim, which is available for the r statistical software and under implementation for the QGIS geographic information system. It is provided with sample data and an illustrated tutorial suitable for non‐experienced users.

Introduction

Spatial associations between species distributions provide deep insights into the processes that drive biodiversity patterns. Areas with similar species compositions (biogeographic or biotic regions) (Márquez, Real & Vargas 2001; Holt et al. 2013; Olivero, Márquez & Real 2013) or species with similar occurrence patterns (chorotypes) (Baroni Urbani, Ruffo & Vigna Taglianti 1978; Márquez et al. 1997; Real, Olivero & Vargas 2008; Olivero, Real & Márquez 2011) serve, for example, as natural units to maximize efficiency in biodiversity management and conservation planning. Beta diversity patterns provide a link between local biodiversity and the broader regional species pool, and are at the core of community ecology (Anderson et al. 2011). Like chorotypes and biotic regions, they can be used in tests of hypotheses about the processes driving species distribution and diversity (Baselga & Orme 2012) and can reveal effects of large‐scale historic processes on current biodiversity (Baselga, Gómez‐Rodríguez & Lobo 2012).

Identifying chorotypes, biotic regions and beta diversity patterns requires comparing species occurrence patterns via similarity indices. These indices are typically based on the numbers of shared localities among species or the numbers of shared species among localities (Olivero, Real & Márquez 2011; Baselga & Orme 2012; Olivero, Márquez & Real 2013). Such measures are not otherwise spatially explicit, that is, they do not take into account the proximity (whether spatial or environmental) between species occurrence sites. Consequently, the distributions of species recorded at adjacent (even interspersed) and/or highly similar sites are considered as different as those of species occurring at opposite ends of the geographical or environmental space (Barbosa et al. 2012) (Fig. 1). This amplifies the effects of false absences and small spatial errors in the georeferencing of species occurrences, and the scale dependence of distributional relationships, precluding the identification of species associations except at broad spatial scales. These problems can particularly affect species with restricted distributions, as erroneous or missing distribution records can involve significant parts of their range (Barbosa et al. 2012).

image
Left: distributions of three vole (Microtus) species in western Europe according to atlas (Mitchell‐Jones et al. 1999) and range map data (IUCN 2010). Right: pairwise similarities among these distributions using binary (based on presence/absence) and fuzzy (based on distance interpolation of presences) versions of the Jaccard and Baroni similarity indices. Note that, with atlas data, binary similarity is zero in all cases, while fuzzy similarity detects that M. guentheri and M. thomasi are more similar to each other (even without any overlapping presences) than to M. cabrerae.

Species distributions are inherently complex, and survey methods are always fallible. Species occurrence data therefore have intrinsic, unavoidable errors (Rocchini et al. 2011). For example, false absences often arise from insufficient sampling, while range map filling introduces false presences (Barbosa et al. 2012). It could therefore be more appropriate to treat distribution data as fuzzy, with localities classified with a degree of uncertainty on whether a species occurs or not, rather than putative presences and absences (Rocchini 2010; Rocchini et al. 2011; Duff, Bell & York 2014).

Fuzzy logic has recently been applied to smooth the limits between chorotypes (Olivero, Real & Márquez 2011) and biotic regions (Olivero, Márquez & Real 2013), but these are still primarily defined based on categorical presence/absence data on categorical sampling units. Thus, species recorded at adjacent but not strictly coincident localities (e.g. distribution atlas data for Microtus thomasi and M. guentheri, Fig. 1) are considered to be as dissimilar from each other as from a species occurring thousands of kilometres away (e.g. M. cabrerae, Fig. 1). However, localities adjacent to (or scattered among) recorded presences, particularly if they are within similar environments and not separated by physical barriers, are often false absences resulting from insufficient sampling or other methodological constraints (e.g., Rocchini et al. 2011), so they should be attributed a degree of uncertainty about species presence. Another approach has been to compare species distributions or assemblages based on modelled habitat suitability values (Sillero et al. 2009; Albouy et al. 2012). However, comparisons are then still based on binary similarity indices after a binarization of suitability, thus discarding relevant quantitative information and introducing abrupt arbitrary thresholds.

Similarity indices that account for fuzziness of location, such as the fuzzy numerical comparison (Visser & de Nijs 2006) or the improved fuzzy kappa (Hagen‐Zanker 2009), go beyond site‐by‐site comparison by giving partial credit to neighbouring sites. They thus introduce tolerance for small spatial differences in species occurrences and have been proposed for improving future analyses (Barbosa & Real 2012; Barbosa et al. 2012). However, these indices are computationally intensive and require the use of complex spatial data structures such as raster maps, rather than simple occurrence data tables. Raster maps also imply that the spatial units under analysis are square pixels, precluding the use of other, sometimes more appropriate divisions such as equal‐area units, provinces or watersheds (Carmona et al. 1999; Márquez, Real & Vargas 2001). Furthermore, considerable development would be necessary before these indices could be routinely used for such analyses, namely in optimizing their computation for multiple species and in determining their levels of significance (Barbosa et al. 2012). Likewise, indices of niche overlap (Warren, Glor & Turelli 2008; Broennimann et al. 2012) might be applicable to comparing spatial occurrence patterns, but their use is not established in species distributional and regional clustering, and their significance thresholds entail computer‐intensive randomized simulations.

This study proposes an alternative, simple protocol for analysing fuzzy similarity between species occurrence patterns and between regional or temporal species compositions, based on fuzzy versions of the binary similarity indices commonly used in ecology, whose significance thresholds and other properties are therefore already known. The fuzzy indices are directly applicable to fuzzy (continuous) occurrence values, which can be obtained, for example, with distribution modelling, kernel smoothing or spatial interpolation of occurrence localities. To be used within a fuzzy logic framework, fuzzy occurrence must be bounded between 0 and 1 and represent the degree of membership of each locality to the species occurrence area, or of each species to a regional species pool (Zadeh 1965; Olivero, Real & Márquez 2011; Olivero, Márquez & Real 2013).

Calculating fuzzy similarity

Jaccard's (1901) index is one of the most widely used similarity indices in ecology, for example for detecting species distributional associations (Real & Vargas 1996; Sillero et al. 2009; Barbosa et al. 2012), comparing regional species compositions (Chao et al. 2004; Anderson, Bolton & Stegenga 2009; Engen, Grøtan & Saether 2011), calculating beta diversity patterns (Anderson et al. 2011) or assessing temporal species turnover in climate change research (Albouy et al. 2012). Sørensen's (1948) index is widely used for assessing compositional similarity between sites or communities and beta diversity patterns in space or time (Anderson et al. 2011; Baselga, Gómez‐Rodríguez & Lobo 2012; Baselga & Orme 2012; Dobrovolski et al. 2012). Simpson's (1960) similarity is the proportion, in the smaller of two samples, of taxa/regions common to both, thereby minimizing the effect of sample size discrepancies. Baroni‐Urbani & Buser's (1976) index (hereafter Baroni for short) is broadly used in species distributional clustering and biotic regionalization (Márquez et al. 1997; Real, Olivero & Vargas 2008; Olivero, Real & Márquez 2011; Moya, Saucède & Manjón‐Cabeza 2012; Olivero, Márquez & Real 2013). It accounts for both shared presences and shared absences, but gives greater weight to presences.

All these similarity indices vary between zero (no distributional or compositional overlap) and one (identical distributions or compositions). Tables of significant values per sample size are available (e.g. Baroni‐Urbani & Buser 1976; Real 1999), so these indices can be used for identifying significant associations (Olivero, Real & Márquez 2011; Olivero, Márquez & Real 2013). All these indices are calculated on two or more of the following terms: A and B (the numbers of species/localities in each sample), C (the number of shared species/localities) and D (the number of species/localities missing from both samples). All these terms have direct correspondence with Boolean logic expressions and can thus be translated into their fuzzy logical equivalents to assess fuzzy binary similarity (Table 1).

Table 1. Correspondence between the terms in the formulas of binary similarity indices for a given pair of species (sp1 and sp2) and their equivalent expressions in classical and fuzzy set theory (Zadeh 1965)
Term Boolean logic Classical sets Fuzzy sets
A sp1 sp1 sum(sp1)
B sp2 sp2 sum(sp2)
C sp1 AND sp2 sp1 ∩ sp2 sum(minimum(sp1, sp2))
D NOT sp1 AND NOT sp2 complement(sp1 U sp2) sum(1 – maximum (sp1, sp2))

Availability and functionality

This methodology is implemented in a free and open‐source software package, fuzzySim, which works under the r programming environment (R Core Team 2014). The package, including some sample data (Fontaneto et al. 2012), is available on the public platform R‐Forge (http://fuzzysim.r-forge.r-project.org), together with a reference manual and a step‐by‐step tutorial on its installation and usage. Most functionalities of fuzzySim (Fig. 2) are also being implemented as a graphical user interface extension for QGIS (QGIS Development Team 2014), which is also free and open‐source.

image
Basic workflow of the fuzzySim package. Text inside arrows represents function names. Text in grey represents operations to do outside fuzzySim.

The fuzzySim package allows a variety of methods for converting (multiple) species presence/absence data into continuous, fuzzy surfaces, including inverse distance to presence raised to any power (function distPres), trend surface analysis of any given degree (function multTSA) and generalized linear models based on presence–absence (function multGLM). The former two methods can be useful for purposes other than comparing species distributions and assemblages, for example for defining putative geographical ranges (Takahashi et al. 2014) or for delimiting the geographical background in species distribution modelling (Acevedo et al. 2012). Besides several methods for selecting predictor variables (including information criteria and false discovery rate), multGLM includes an option to convert probability to prevalence‐independent favourability values, which have proven appropriate for use within a fuzzy logic framework (Real, Barbosa & Vargas 2006; Acevedo & Real 2012). The package also allows using other continuous distribution data that users can obtain elsewhere, as long as they are bounded between 0 and 1, directly comparable among species and interpretable as fuzzy membership values (Zadeh 1965; Real, Barbosa & Vargas 2006; Barbosa & Real 2012).

Species occurrence data, whether binary or fuzzy, can be transposed (so that regions go in columns and species in rows) for comparing regional species compositions (Olivero, Márquez & Real 2013). Pairwise similarity matrices between either species distributions or regions' compositions can be calculated with function simMat based on fuzzy versions of the Jaccard, Sørensen, Simpson and Baroni indices. These metrics were chosen for their simplicity, long‐time widespread use and known significance thresholds for meaningful clustering, as well as their comparability to traditional indices. Nonetheless, additional similarity indices can be implemented as necessary, such as fuzzy versions of other measures of agreement between binary variables (e.g. simple matching coefficient, Cohen's kappa, true skill statistic) and of binary correlation indices (Phi, Matthews, Yule), as well as standard correlation coefficients (Pearson, Spearman, Kendall) and measures of niche overlap (Warren, Glor & Turelli 2008; Broennimann et al. 2012).

The fuzzy similarity matrices produced with fuzzySim can then be plotted, compared, classified, clustered and converted into dendrograms depicting the fuzzy relationships between species distributions or between regional species compositions. r code for all these operations is provided in the tutorials available from the package homepage (http://fuzzysim.r-forge.r-project.org). The fuzzy similarity matrices can also be entered in the RMACOQUI package (Olivero, Real & Márquez 2011) for a systematic analysis of chorotypes or biotic regions. The fuzzy versions of binary similarity indices can also be integrated within other software packages that currently compute these indices, such as vegan (Oksanen et al. 2013) or betapart (Baselga & Orme 2012).

Example analyses

To allow direct comparison with traditional methods, I used previously analysed data (Barbosa et al. 2012) on the occurrence of 156 terrestrial mammal species in Western Europe, based on both a distribution atlas (Mitchell‐Jones et al. 1999) and a set of range maps (IUCN 2010), under a 50 × 50 km UTM grid containing 2118 cells (Sastre, Roca & Lobo 2009). These data were previously used for comparing distributional relationships inferred from distribution atlas versus range maps, to assess the effects of data type on the results of such analyses. Although there was a good general agreement between the relationships obtained from the two data sources, there were some visible differences, most notably for small‐range species (Barbosa et al. 2012).

The procedure is here illustrated with two fuzzy versions of species occurrence data: inverse squared distance interpolation of presences (Shepard 1968; Takahashi et al. 2014) and environmental favourability for presence (Real, Barbosa & Vargas 2006) based on information criterion selection of WorldClim bioclimatic variables (Hijmans et al. 2005). Note that these are simple examples; fuzzy distribution data should be obtained based on knowledge on what best reflects the distributions of the target species. Matrices of fuzzy similarity between species distributions were then calculated with the fuzzy similarity indices. For comparison, binary vs. fuzzy similarity was plotted for two similarity indices, one that considers only shared presences (Jaccard) and another that considers also shared absences (Baroni).

As expected, fuzzy similarity was generally higher than binary similarity (Fig. 3), as similarity among species distributions is more than the strict coincidence of their recorded localities. Also, fuzzy similarity matrices obtained from atlas and from range map data were more similar (Mantel rank correlation tests, ρ = 0·98 for Jaccard and 0·97 for Baroni) than with binary similarity (Barbosa et al. 2012), indicating that fuzzy similarity minimizes the effects of data type on the results of chorological analyses. More notably, the fuzzy approach solved the problem posed by small‐range species, for which slight differences between data sets could mean their coincidence or not in substantial parts of their distribution areas. Chorotypes, biogeographic regions and beta diversity patterns defined with fuzzy similarity indices are thus more likely to be robust to disparities, errors or gaps in species occurrence data, even for narrowly distributed species, which usually face higher conservation concern and where slight inaccuracies can affect substantial parts of their recorded range.

image
Pairwise distributional similarities among 156 species in the European mammal atlas (Mitchell‐Jones et al. 1999) when using binary vs. fuzzy versions of the Jaccard and Baroni similarity indices. Fuzzy distribution data were obtained with inverse squared distance interpolation of presences (top) and with environmental favourability based on informative bioclimatic variables (bottom).

Acknowledgements

This work was supported by Fundação para a Ciência e a Tecnologia through ‘FCT Investigator’ contract IF/00266/2013 and exploratory project CP1168/CT0001. The IUCN and Societas Europaea Mammalogica (particularly A.J. Mitchell‐Jones) kindly provided the distribution data for analysis. Raimundo Real, Jesús Olivero and Ana Luz Márquez provided valuable early discussions on chorotypes and fuzzy logic. Alba Estrada, Raquel Garcia, Rich Grenyer and Duccio Rocchini provided valuable feedback.

    Data accessibility

    Mammal atlas maps are available from the European Mammal Society (http://www.european-mammals.org). Range maps are available at the IUCN website (http://www.iucnredlist.org). Data used in the fuzzySim tutorial are included in the supplementary material of Fontaneto et al. (2012) and in the r package (http://fuzzysim.r-forge.r-project.org).

        Number of times cited according to CrossRef: 41

        • eDNA metabarcoding survey reveals fine‐scale coral reef community variation across a remote, tropical island ecosystem, Molecular Ecology, 10.1111/mec.15382, 29, 6, (1069-1086), (2020).
        • Investigating spatial non-stationary environmental effects on the distribution of giant pandas in the Qinling Mountains, China, Global Ecology and Conservation, 10.1016/j.gecco.2019.e00894, 21, (e00894), (2020).
        • Unravelling the drivers of maned wolf activity along an elevational gradient in the Atlantic Forest, south-eastern Brazil, Mammalian Biology, 10.1007/s42991-020-00017-x, 100, 2, (187-201), (2020).
        • Orthotopic Bone Regeneration within 3D Printed Bioceramic Scaffolds with Region‐Dependent Porosity Gradients in an Equine Model, Advanced Healthcare Materials, 10.1002/adhm.201901807, 9, 10, (2020).
        • Spatial heterogeneity in population change of the globally threatened European turtle dove in Spain: The role of environmental favourability and land use, Diversity and Distributions, 10.1111/ddi.13067, 26, 7, (818-831), (2020).
        • Ecophysics reload—exploring applications of theoretical physics in macroecology, Ecological Modelling, 10.1016/j.ecolmodel.2020.109032, 424, (109032), (2020).
        • Spatially constrained attenuation compensation in the mixed domain, Geophysical Prospecting, 10.1111/1365-2478.12959, 68, 6, (1819-1833), (2020).
        • Description of Latica, a new monotypic spider genus from Uruguay and Argentina (Araneae, Herpyllinae, Gnaphosidae): an integrative approach, Zoologischer Anzeiger, 10.1016/j.jcz.2020.07.006, (2020).
        • An integrative analysis of threats affecting protected areas in a biodiversity stronghold in Southeast Mexico, Global Ecology and Conservation, 10.1016/j.gecco.2020.e01297, (e01297), (2020).
        • Species Distribution Models and Niche Partitioning among Unisexual Darevskia dahli and Its Parental Disexual (D. portschinskii, D. mixta) Rock Lizards in the Caucasus, Mathematics, 10.3390/math8081329, 8, 8, (1329), (2020).
        • Greater topoclimatic control of above‐ versus below‐ground communities, Global Change Biology, 10.1111/gcb.15330, 0, 0, (2020).
        • Similitud de la ficoflora marina en zonas del Atlántico Occidental Tropical y Subtropical, Caldasia, 10.15446/caldasia.v42n1.73372, 42, 1, (2020).
        • Ensemble modeling of the potential distribution of the whale shark in the Atlantic Ocean, Ecology and Evolution, 10.1002/ece3.5884, 10, 1, (175-184), (2019).
        • Incorporating intraspecific variation into species distribution models improves distribution predictions, but cannot predict species traits for a wide‐spread plant species, Ecography, 10.1111/ecog.04630, 43, 1, (60-74), (2019).
        • Complementing the Pleistocene biogeography of European amphibians: Testimony from a southern Atlantic species, Journal of Biogeography, 10.1111/jbi.13515, 46, 3, (568-583), (2019).
        • Prioritizing road defragmentation using graph-based tools, Landscape and Urban Planning, 10.1016/j.landurbplan.2019.103653, 192, (103653), (2019).
        • Applying fuzzy logic to assess the biogeographical risk of dengue in South America, Parasites & Vectors, 10.1186/s13071-019-3691-5, 12, 1, (2019).
        • An application of fuzzy logic to build ecological sympatry networks, Ecological Informatics, 10.1016/j.ecoinf.2019.100978, (100978), (2019).
        • Case Studies, A New Bio-inspired Optimization Algorithm Based on the Self-defense Mechanism of Plants in Nature, 10.1007/978-3-030-05551-6_6, (23-52), (2019).
        • Categorical laterality indices in fMRI: a parallel with classic similarity indices, Brain Structure and Function, 10.1007/s00429-019-01833-9, (2019).
        • Integrative approach untangles the misconceptions about the range and identity of two stingless bees from the Brazilian semiarid region, Journal of Apicultural Research, 10.1080/00218839.2019.1673594, (1-7), (2019).
        • The spread of the red-billed leiothrix (Leiothrix lutea) in Europe: The conquest by an overlooked invader?, Biological Invasions, 10.1007/s10530-019-02123-5, (2019).
        • Delimiting floristic biogeographic districts in the Cerrado and assessing their conservation status, Biodiversity and Conservation, 10.1007/s10531-019-01819-3, (2019).
        • Identification of potential source and sink areas for butterflies on the Iberian Peninsula, Insect Conservation and Diversity, 10.1111/icad.12297, 11, 5, (479-492), (2018).
        • Fuzzy ecospace modelling, Methods in Ecology and Evolution, 10.1111/2041-210X.13010, 9, 6, (1442-1452), (2018).
        • GIS-Based Data Synthesis and Visualization, Ecological Informatics, 10.1007/978-3-319-59928-1, (273-286), (2018).
        • Understanding factors affecting the distribution of the maned wolf (Chrysocyon brachyurus) in South America: Spatial dynamics and environmental drivers, Mammalian Biology, 10.1016/j.mambio.2018.04.006, 92, (54-61), (2018).
        • On dangerous ground: the evolution of body armour in cordyline lizards, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2018.0513, 285, 1880, (20180513), (2018).
        • Assessment of the National Park network of mainland Spain by the Insecurity Index of vertebrate species, PLOS ONE, 10.1371/journal.pone.0197496, 13, 5, (e0197496), (2018).
        • Changes in potential mammal diversity in national parks and their implications for conservation, Current Zoology, 10.1093/cz/zoy001, (2018).
        • ENVIREM: an expanded set of bioclimatic and topographic variables increases flexibility and improves performance of ecological niche modeling, Ecography, 10.1111/ecog.02880, 41, 2, (291-307), (2017).
        • Integrative inference of population history in the Ibero-Maghrebian endemic Pleurodeles waltl (Salamandridae), Molecular Phylogenetics and Evolution, 10.1016/j.ympev.2017.04.022, 112, (122-137), (2017).
        • Macroecological conclusions based on IUCN expert maps: A call for caution, Global Ecology and Biogeography, 10.1111/geb.12601, 26, 8, (930-941), (2017).
        • Evolvability meets biogeography: evolutionary potential decreases at high and low environmental favourability, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2017.0516, 284, 1856, (20170516), (2017).
        • A New Meta-Heuristics of Optimization with Dynamic Adaptation of Parameters Using Type-2 Fuzzy Logic for Trajectory Control of a Mobile Robot, Algorithms, 10.3390/a10030085, 10, 3, (85), (2017).
        • Statistical analysis of co-occurrence patterns in microbial presence-absence datasets, PLOS ONE, 10.1371/journal.pone.0187132, 12, 11, (e0187132), (2017).
        • Favourable areas for co‐occurrence of parapatric species: niche conservatism and niche divergence in Iberian tree frogs and midwife toads, Journal of Biogeography, 10.1111/jbi.12850, 44, 1, (88-98), (2016).
        • Present and past climatic effects on the current distribution and genetic diversity of the Iberian spadefoot toad (Pelobates cultripes): an integrative approach, Journal of Biogeography, 10.1111/jbi.12791, 44, 2, (245-258), (2016).
        • Niche overlap of mountain hare subspecies and the vulnerability of their ranges to invasion by the European hare; the (bad) luck of the Irish, Biological Invasions, 10.1007/s10530-016-1330-z, 19, 2, (655-674), (2016).
        • Species Distributions, Quantum Theory, and the Enhancement of Biodiversity Measures, Systematic Biology, 10.1093/sysbio/syw072, (syw072), (2016).
        • Anti‐aging activity of the Ink4/Arf locus, Aging Cell, 10.1111/j.1474-9726.2009.00458.x, 8, 2, (152-161), (2009).