Volume 7, Issue 12
Research Article
Free Access

Mapping Averaged Pairwise Information (MAPI): a new exploratory tool to uncover spatial structure

Sylvain Piry

Corresponding Author

E-mail address: piry@supagro.inra.fr

UMR CBGP, INRA, 34988, Montferrier sur Lez, France

Correspondence author. E‐mail: piry@supagro.inra.frSearch for more papers by this author
Marie‐Pierre Chapuis

UMR CBGP, CIRAD, 34988 Montferrier sur Lez, France

Search for more papers by this author
Bertrand Gauffre

UMR 7372, Centre d'Etudes Biologiques de Chizé, CNRS – Université de La Rochelle, 79360 Villiers‐en‐Bois, France

USC1339, Centre d'Etudes Biologiques de Chizé, INRA, 79360 Villiers‐en‐Bois, France

Search for more papers by this author
Julien Papaïx

Biostatistique et Processus Spatiaux, INRA, 84914 Avignon, France

Search for more papers by this author
Astrid Cruaud

UMR CBGP, INRA, 34988, Montferrier sur Lez, France

Search for more papers by this author
Karine Berthier

Pathologie Végétale, INRA, 84140 Montfavet, France

Search for more papers by this author
First published: 18 July 2016
Citations: 13

Summary

  1. Visualisation of spatial networks based on pairwise metrics such as (dis)similarity coefficients provides direct information on spatial organisation of biological systems. However, for large networks, graphical representations are often unreadable as nodes (samples), and edges (links between samples) strongly overlap. We present a new method, MAPI, allowing translation from spatial networks to variation surfaces.
  2. MAPI relies on (i) a spatial network in which samples are linked by ellipses and (ii) a grid of hexagonal cells encompassing the study area. Pairwise metric values are attributed to ellipses and averaged within the cells they intersect. The resulting surface of variation can be displayed as a colour map in Geographical Information System (GIS), along with other relevant layers, such as land cover. The method also allows the identification of significant discontinuities in grid cell values through a nonparametric randomisation procedure.
  3. The interest of MAPI is here demonstrated in the field of spatial and landscape genetics. Using simulated test data sets, as well as observed data from three biological models, we show that MAPI is (i) relatively insensitive to confounding effects resulting from isolation by distance (i.e. over‐structuring), (ii) efficient in detecting barriers when they are not too permeable to gene flow and, (iii) useful to explore relationships between spatial genetic patterns and landscape features.
  4. MAPI is freely provided as a PostgreSQL/PostGIS data base extension allowing easy interaction with GIS or the r software and other programming languages. Although developed for spatial and landscape genetics, the method can also be useful to visualise spatial organisation from other kinds of data from which pairwise metrics can be computed.

Introduction

Computation of metrics to estimate similarity, difference or flow between samples is a common approach to study a wide range of social and biological systems (Borgatti et al. 2009; Miele, Picard & Dray 2014). Computing pairwise metrics allows the construction of networks with nodes (samples) being connected by edges that are associated with the value of the metric of interest. When geographic coordinates of samples are available, such networks are inherently informative about the spatial organisation of the system under study (Barthélemy 2011). This organisation can be visualised by superimposing the network on a geographic map, with the thickness or colour of the edges displayed according to the pairwise values. However, such graphical representations are difficult to interpret for large networks as the numerous nodes and edges strongly overlap and obscure both the network and the underlying geographic space (Hennemann 2013). To better visualise spatial information, variation surfaces can be produced using interpolation procedures (e.g. kernel density estimation, kriging). To do so, pairwise metric values have first to be attributed to unique geographical points (i.e. transformed into punctual values) as, for example, to the middle of the edges connecting the samples in the network (e.g. Miller 2005).

In this paper, we present a novel method, MAPI, to produce variation surfaces from pairwise metrics computed between georeferenced samples. In essence, MAPI is a nonparametric smoothing procedure applied to pairwise values rather than punctual values. To smooth out pairwise values, the straight lines connecting the samples in the network are replaced by ellipsoidal polygons for which the foci are located on the two samples being connected. Then, similarly to kernel density estimators for punctual values, ellipses constitute a geometric shape used to average information between overlapping connections. This smoothing procedure produces a two‐dimensional geographical layer that can be easily visualised and customised in a Geographical Information System (GIS). Although this method may be of interest in other research fields, we will hereafter focus on spatial and landscape genetics, where pairwise metrics are widely used but flexible visualisation tools are still lacking for these measures (see Miller 2005; Vandergast et al. 2010; Etherington 2011; Petkova, Novembre & Stephens 2016).

Although new methods, such as spatially explicit clustering, are increasingly used in spatial genetics to describe genetic variation and identify barriers to gene flow (Guillot et al. 2009), pairwise genetic measures, such as distance, remain appealing as: (i) they can be computed between individuals or populations, (ii) they facilitate handling massive data sets, which are increasing with the advance in high‐throughput sequencing (Duforet‐Frebourg & Blum 2014) and (iii) their regression against geographic distances may be informative with respect to isolation by distance (IBD), that is the increase of genetic differentiation with spatial separation due to restricted dispersal (Wright 1943). The original IBD model has been further extended in landscape genetics to assess the impact of landscape heterogeneity on spatial genetic structure and gene flow. In this field, replacing straight line distances between samples by ecological distances has given rise to concepts of isolation by barrier, isolation by resistance and isolation by environment (Balkenhol et al. 2015). Nowadays, various approaches using least cost paths, circuit theory or population graphs have been developed to assess the impact of landscape features on genetic structure and gene flow (e.g. Coulon et al. 2004; Dyer & Nason 2004; Cushman et al. 2006; McRae 2006; McRae & Beier 2007; Garroway et al. 2008; Dyer 2015; Petkova, Novembre & Stephens 2016).

Landscape genetics approaches often require a priori knowledge on species–landscape interactions to target relevant environmental variables, set cost/resistance values and account for demographic effects such as population size (e.g. Broquet et al. 2006; Weckworth et al. 2013). These methods can be difficult to apply when little is known about the species under study. In such a situation, we still rely on exploratory analyses to identify environmental variables of interest and draw up hypotheses that can be further tested using appropriate sampling schemes (Kelling et al. 2009; Richardson et al. 2016). Although the visualisation of pairwise genetic measures may be an obvious first step in such an approach (Etherington 2011), there are still very few tools allowing visualisation of such measures along with environmental layers without, first, attributing features such as cost or resistance values to habitat types. Among those tools, Allele In Space (Miller 2005) and the GIS toolbox of Vandergast et al. (2010) offer the rare possibility to produce variation surfaces of genetic distances that can be mapped over landscape layers (Wood et al. 2013; Adams & Burg 2015). Both tools rely on an inverse‐distance‐weighted interpolation of pairwise measures that are first spatially attributed to the middle of the segments linking the samples in the network. Confounding effects due to IBD can be somewhat limited by (i) using residuals from the regression of genetic distances against geographic distances (Miller 2005) or (ii) limiting the network to the nearest neighbours for each sample (Vandergast et al. 2010).

Disentangling the relative contributions of IBD and environmental features in shaping spatial genetic patterns remains a central issue in landscape genetics (Bradburd, Ralph & Coop 2013; Wang & Bradburd 2014). This question is often addressed using Mantel and partial Mantel tests between matrices of population‐ or individual‐based genetic distances, Euclidean distances and ecological distances (e.g. Cushman et al. 2006; Hagerty et al. 2011). There is an ongoing debate on the statistical performance of the Mantel test in spatial and landscape genetics (see Raufaste & Rousset 2001; Cushman & Landguth 2010; Legendre & Fortin 2010; Cushman et al. 2013; Guillot & Rousset 2013) and, recently, alternative methods using multivariate or geostatistical techniques have been proposed to disentangle the effects of geographic distance and environmental heterogeneity on spatial genetic structure (e.g. Bradburd, Ralph & Coop 2013; Duforet‐Frebourg & Blum 2014; Galpern et al. 2014; Botta et al. 2015). Moreover, results from lot of methods do not include spatially explicit variation surfaces that can be mapped over landscape layers to facilitate their interpretation.

In this general context, MAPI provides an approach to (i) visualise pairwise genetic measures without confounding effects resulting from IBD, (ii) test for spatial genetic discontinuity through a nonparametric randomisation procedure and (iii) explore relationships between observed genetic patterns and environmental heterogeneity, notably by using MAPI results for further statistical analyses. In this work, we assessed the efficiency of MAPI to accurately detect spatial discontinuity in pairwise genetic metrics by applying permutation tests on controlled simulated data sets of panmictic populations and populations under IBD, including scenarios of separation by a linear barrier to gene flow. We also appraised the potential of MAPI in a landscape genetics framework by analysing genotypes simulated under landscape constraints with various spatial configurations of favourable and unfavourable habitats. We used a Bayesian conditional autoregressive model to analyse the relationships between MAPI results and landscape variables. To illustrate MAPI, we also re‐analysed published data from three biological models: (i) microsatellite genotypes from a rodent population under IBD, (ii) DNA sequences from a plant virus exhibiting spatial genetic discontinuities and (iii) microsatellites and landscape data from both a forest specialist and a generalist species of ground beetles. Finally, we used controlled and observed data sets to explore MAPI sensitivity to parameter setting and sampling scheme.

Materials and methods

MAPI Methodology

MAPI is implemented as an open‐source SQL extension within open‐source data base PostgreSQL 9.x and Postgis 2.x (1996–2013, The PostgreSQL Global Development Group: http://www.postgresql.org/http://postgis.net/). Source code, user manual and training data sets can be downloaded from: https://www1.montpellier.inra.fr/CBGP/software/MAPI/. Importation of data files and running of MAPI commands can be easily done from the software r (R core Team 2015). We hereafter detail the steps of MAPI, with Figs 1 and 2 illustrating the general framework and Table 1 providing hints for setting parameter values.

image
Schematic view of MAPI components.
image
Steps in the test for spatial structure. Computation of mw values for the observed (a1, a2) and np permuted data sets (b1, b2); ranking of the observed mw value against the cumulative null distribution for each cell (c1, example for one cell represented by a red hexagon on a2, b2 outputs); lower‐tail (LT) and upper‐tail (UT) P‐values of the cells (c2, significant LT and UT P‐values after FDR corrections are delineated by thin and large black contours, respectively); visualisation of the final output with raw mw values and significant areas (d). NB: for the sake of visibility, only a subset of ellipses is represented on outputs a1 and b1.
Table 1. Setting of MAPI parameter values
Step Parameter Definition Hints Default Simulated test data
Spatial framework Landscape framework
Grid of cells β Defines the spatial resolution of the grid of cells 0·50 and 0·25 for regular and irregular samplings, respectively NA • Set to 0·5 (855 cells) and 0·25 (~843 cells) for large regular and small irregular samplings, respectively • not used Importation of a grid with the resolution of the landscape raster pixelsaa When a grid is imported, the cells can be of any shape (e.g. hexagons, squares, etc.).
Network of ellipses Eccentricity Ellipse eccentricity

0·975 is a good starting point

Run several analyses from 0·800 (inflated) to 0·999 (narrow) to assess robustness

0·975 • set to 0·975 • set to 0·975
error_circle_radius Error circle on sample locations Higher values for larger uncertainty on sample positions 10 map units

No error

• set to 0·01

Error

• set to 0·50

min_distance

max_distance

(optional)

Minimum and maximum distances between samples Start without filtering. Filter on minimum distance to avoid local effects (see section 3·1 in the Appendix) NA • not used min_distance set to 1
Test for spatial structure n_permutations Number of permutations (np) ≥1000 1000 • set to 1000 • not used
my_alpha Significance level (α) α = 0·05 0·05 • set to 0·05 • not used
  • a When a grid is imported, the cells can be of any shape (e.g. hexagons, squares, etc.).

Data input

The method requires sample geographical coordinates and pairwise metric values computed between samples (e.g. genetic distance).

Grid of cells

A grid of hexagonal cells is superimposed on the area defined by the convex hull of the sampling points. Based on the Nyquist Frequency concept, optimised cell size should be at most half the average distance between closest samples. The half‐width (hw) of the cells is computed as: urn:x-wiley:2041210X:media:mee312616:mee312616-math-0001 with N the number of sampling points, A the surface's area defined by their convex hull and β a parameter depending on their spatial dispersion: 0·5 for regular sampling and 0·25 for irregular sampling (see Hengl 2006). The grid is automatically generated by informing the β parameter. The final number of cells (nc) in the grid is then computed as: urn:x-wiley:2041210X:media:mee312616:mee312616-math-0002 Alternatively, a user‐defined grid of cells can be imported in PostgreSQL (see user manual for details on how to build or import a grid).

Network of ellipses

Network edges are formed by ellipsoidal polygons (hereafter referred to as ellipses) for which foci are the geographical locations of the two samples being connected. The shape of these polygons can be adjusted by two parameters: (i) the eccentricity of the ellipses which controls the smoothing intensity and must be >0 (infinite circle) and smaller than 1 (straight line) and (ii) the radius of the error circle that controls for uncertainty on sample coordinates (error_circle_radius). In addition, two optional parameters (min_distance and max_distance) limit the analysis to a given range of between‐sample distances. Effects of these four parameters on the shape of the network are detailed in section 1 of the Appendix S1, Supporting Information .

Surface

The ellipses receive the value of the metric computed between the samples they connect and the cells of the grid receive the weighted arithmetic mean (mw) of the ne ellipses intercepting their geographical extent, computed as: urn:x-wiley:2041210X:media:mee312616:mee312616-math-0003 , with vi and ai the metric value and area of the ellipse i, respectively, and sw the sum‐of‐weights of the ellipses defined as: urn:x-wiley:2041210X:media:mee312616:mee312616-math-0004. This weighting procedure limits long‐distance effects as long and inflated ellipses participate far less than short ellipses to the computation of mw. Cells not intersected by ellipse have no mw value and are not included in the final result.

Test of significance for spatial structure

We implemented a nonparametric randomisation procedure, as schematised in Fig. 2, to test whether the pairwise metric values associated with the ellipses are independent of the sample locations (i.e. under the null hypothesis that pairwise metric values, and then mw cell values, are randomly distributed in space). Sample locations from the observed data set (Fig. 2a1) are permuted np times (Fig. 2b1) (parameter n_permutations). At each permutation, new mw values are computed and stored to build a cumulative null distribution for each of the nc cells of the grid (Fig. 2b2). Each mw cell value from the observed data set is then ranked against its null distribution (Fig. 2c1). For each cell, the proportion of permuted values that are smaller than the observed value provides a lower‐tailed test (LT) P‐value. An upper‐tailed test (UT) P‐value is also computed for each cell as: 1 − (LT P‐value). As the probability to find significant cells only by chance (i.e. type I error) increases with the number of tests performed (i.e. nc tests), the false discovery rate (FDR) procedure proposed by Benjamini & Yekutieli (2001) is applied to account for multiple testing under positive dependency conditions (i.e. spatial autocorrelation between cells) (Fig. 2c2). The significance level at which FDR is controlled can be set by users through the parameter my_alpha. For example, when my_alpha is set to 0·05, this means that 5% of the cells detected as significant can be false positives. Finally, for each test, significant cells that are spatially connected are aggregated together. When using metrics estimating differences, such as distances, significant highest mw values localise areas of higher dissimilarity than expected by chance (hereafter referred to as discontinuous areas) while significant lowest mw values localise areas of higher similarity than expected by chance (hereafter referred to as continuous areas).

Data output

Cell‐specific information, such as their geometry and associated mw and P‐values, is stored as a spatial PostgreSQL table which can be exported as a shapefile or text file and imported into GIS software or the software r using the rgdal package. These data can be used for further statistical analyses of relationships between genetic measures and landscape variables.

Visualisation

On the final graphical output (Fig. 2d), the observed cell values (mw) are displayed using a colour scale. Cells with significant extreme values after FDR correction are visualised as black contours mapped as an additional layer. When significant cells are spatially connected, only the perimeter of the polygon they form is represented. Vector or raster maps, satellite imagery or teledetection products can be inserted below the MAPI layers for landscape interpretation.

Simulated Data in Spatial Genetics

Using backward‐in‐time (genealogy‐based) simulation algorithms, we simulated data sets of 10 microsatellite genotypes for 200 diploid individuals distributed on the nodes of a 20 × 10 lattice (i.e. one individual per node). Three demographic models with different sets of parameter values were analysed (25 simulated data sets each): (i) one IBD population without barriers to gene flow (spatial test data sets 1; IBD); (ii) two panmictic populations separated by a barrier (spatial test data sets 2; Barrier); and (iii) two IBD populations separated by a barrier (spatial test data sets 3; Barrier and IBD) (see Fig. 3 to visualise the simulated lattice). A null model of one panmictic population without a barrier to gene flow was also considered (spatial test data sets 0; Panmixia). The permeability of the barrier is controlled by setting the Nem parameter, that is the product of the effective population size and migration rate per generation. Details on simulation algorithms, parameter settings and basic measures of genetic variation can be found in sections 2·1 and 2·3 of the Appendix S1. We investigated MAPI sensitivity to sampling effort by (i) randomly resampling 75 individuals out of the 200 simulated for each of the test data sets, which provided a test for how MAPI deals with both random irregular sampling and small sample size, and (ii) simulating additional test data sets with 100 loci for a representative subset of spatial data sets under both large regular and small irregular samplings (see Table 2 to identify test data sets).

image
Subset of MAPI graphical outputs from data sets simulated under panmixia (a), IBD (e, i), barrier to gene flow between panmictic (b–d) and IBD (f–h, j–l) populations. Significant continuous and discontinuous areas are denoted by thin and large black contours, respectively. The white line localise the simulated barrier. NB: colour scale varies.
Table 2. Performance of MAPI under different spatial models. Parameters for each spatial model are indicated, including the product of the effective population size and migration rate per generation (Nem); the shape parameter of the geometric distribution (g); the number of simulated populations (Ksim); and the number of loci (Nloc). Performance is measured as the average percentage, over 25 replicates, of the barrier covered by significant discontinuous areas (Coverage), and the average percentage, over 25 replicates, of significant discontinuous areas having no contact with the barrier (Unexpected). Large sampling refers to the sampling of 200 genotypes distributed regularly on the lattice and small sampling to the sampling of 75 genotypes drawn randomly from the lattice. The data sets with 100 loci are indicated in italics
Spatial model N e m g K sim N loc Coverage (%) Unexpected (%)
Large sampling Small sampling Large sampling Small sampling
Panmixia (test data set 0) NA NA 1 10 NA NA 0·10 0·05
IBD (test data sets 1) NA 0·250 1 10 NA NA 0·07 0·53
NA 0·250 1 100 NA NA 0·00 2·75
NA 0·500 1 10 NA NA 0·06 0·23
NA 0·500 1 100 NA NA 0·04 0·70
NA 0·675 1 10 NA NA 0·06 0·06
NA 0·750 1 10 NA NA 0·06 0·26
Barrier (test data sets 2) 0·1 NA 2 10 99 84 0·03 0·21
1 NA 2 10 65 31 0·08 0·35
2·5 NA 2 10 7 2 0·15 0·27
2·5 NA 2 100 98 85 0·04 0·06
10 NA 2 10 0 0 0·12 0·11
10 NA 2 100 30 13 0·09 0·28
Barrier and IBD (test data sets 3) 0·1 0·250 2 10 100 90 0·04 0·29
1 0·250 2 10 100 81 0·07 0·24
2·5 0·250 2 10 90 66 0·04 0·61
10 0·250 2 10 41 21 0·08 0·49
0·1 0·500 2 10 100 92 0·02 0·21
1 0·500 2 10 96 79 0·01 0·36
2·5 0·500 2 10 92 62 0·07 0·21
10 0·500 2 10 44 26 0·10 0·21
10 0·500 2 100 99 82 0·05 0·31
0·1 0·675 2 10 100 94 0·02 0·23
1 0·675 2 10 96 74 0·02 0·35
2·5 0·675 2 10 85 53 0·08 0·21
2·5 0·675 2 100 100 96 0·03 0·36
10 0·675 2 10 23 7 0·13 0·21
0·1 0·750 2 10 99 94 0·03 0·15
1 0·750 2 10 93 65 0·05 0·27
2·5 0·750 2 10 79 47 0·07 0·24
10 0·750 2 10 15 6 0·02 0·12

Analyses of simulated test data sets with MAPI were performed using a grid of hexagonal cells defined by setting the parameter β to 0·5 and 0·25 for regular and irregular samplings, respectively (Table 1). The network was built using the genetic distance ar (Rousset 2000), an error radius for sample location of 0·01 and ellipses with an eccentricity value of 0·975. Cells with extreme high mw values were detected using 1000 permutations of the sample locations and a significance level of 0·05 for FDR control. As the exact location of the barrier was known, we computed the proportion of the barrier covered by cells with significantly higher values (hereafter referred to as Coverage). We also computed the proportion of the study area covered by cells with significantly higher values having no contact with the simulated barrier, which may result from stochasticity, edge effects or IBD (hereafter referred to as Unexpected). We also investigated the sensitivity of MAPI to the eccentricity value (0·900, 0·975 and 0·999) and number of cells (setting the parameter β to 0·15, 0·25, 0·5 and 0·75). To this aim, we used a subset of data sets simulated under both large regular sampling and small irregular sampling (see section 2·1 in the Appendix S1 to identify test data sets).

Simulated Data in Landscape Genetics

Using a forward‐in‐time (individual‐based) simulation algorithm, we simulated data sets of 10 microsatellite genotypes for diploid individuals distributed in a landscape raster of about 50 × 50 cells. Three landscape models, which consider two habitats with contrasted carrying capacity (20 and 2 for the favourable and unfavourable habitat, respectively), were analysed (20 simulated data sets each): (i) a spatial transition from the favourable to unfavourable habitat with a high level of interpenetration (landscape test data sets 1; Gradient); (ii) fragmentation of the favourable habitat in small areas isolated by a prevalent unfavourable habitat (landscape test data sets 2; Fragmentation); and (iii) a random distribution of the favourable and unfavourable habitats over the study area (landscape test data sets 3; Random) (see Fig. 4 to visualise the simulated landscapes). For the latter, the detection of the habitat effect is unexpected since the spatial scale of autocorrelation in habitat is very small. Details on simulation algorithms, parameter settings and basic measures of genetic variation can be found in sections 2·2 and 2·3 of the Appendix S1. For each of the simulated data sets, we also investigated MAPI sensitivity to sampling effort by sampling 200 and 500 individuals under three sampling schemes: (i) random sampling, in which individuals were sampled anywhere in the landscape regardless to the habitat; (ii) balanced sampling, in which an equal number of individuals was sampled from each habitat; and (iii) gridded sampling, in which the same number of individuals was randomly sampled from each square of a 3 × 3 grid encompassing the landscape raster.

image
Subset of MAPI graphical outputs for the gradient, fragmented and random landscape configurations and three sampling schemes (random: 500 genotypes sampled anywhere, balanced: 250 genotyped sampled in each habitat and gridded: 56 genotypes sampled in each cell of a 3 × 3 lattice (not shown) covering the study area). NB: colour scale varies and white lines represent habitat configuration.

Analyses of simulated data sets with MAPI were performed using grids constituted of squared cells that matched the landscape raster pixels (side length = 1), the genetic distance ar (Rousset 2000), an error radius for sample location of 0·5 to consider that individuals can be anywhere within a cell, ellipses with an eccentricity of 0·975 and a minimal distance between samples of 0·1 to exclude intracell connections as individuals simulated within a same cell have the same geographical coordinates. For each simulation and sampling scheme, we extracted from each cell both, the mw value and habitat type, which was expressed as a factor according to favourability (hab): 1 for the favourable habitat and 2 for the unfavourable habitat. We fitted a regression model as follows: urn:x-wiley:2041210X:media:mee312616:mee312616-math-0005 where urn:x-wiley:2041210X:media:mee312616:mee312616-math-0006 is an intercept depending on the habitat type in cell i. To account for spatial correlations, it was assumed that the error term εi included two components: a structured component (i.e. spatially correlated) and an unstructured component (i.e. independently distributed). The unstructured heterogeneity term was assumed to be centred and normally distributed while the structured heterogeneity term was assumed to have a conditional intrinsic Gaussian autoregressive (CAR) distribution (Besag, York & Mollié 1991) with first‐ and second‐order neighbours as neighbouring structure. The model was fitted in a Bayesian framework with an INLA approach (Blangiardo & Cameletti 2015) using the R‐INLA package (Martins et al. 2013) of the r software (R Core Team 2015). The fit of the model to the data was assessed using posterior predictive checking (Gelman et al. 2004), and we systematically checked that residuals were not overly structured in space. Bayesian inference resulted in posterior densities for the parameters urn:x-wiley:2041210X:media:mee312616:mee312616-math-0007 (favourable habitat) and urn:x-wiley:2041210X:media:mee312616:mee312616-math-0008 (unfavourable habitat). As we used a genetic distance, we expect a high posterior probability for the intercept of the favourable habitat (urn:x-wiley:2041210X:media:mee312616:mee312616-math-0009) to be less than the intercept of the unfavourable habitat (urn:x-wiley:2041210X:media:mee312616:mee312616-math-0010). The significance was assessed with a threshold of 0·05.

Biological Data Sets

To illustrate MAPI behaviour in the presence of IBD, we analysed microsatellite genotypes from a rodent population that was previously characterised as a single genetic unit, only structured by IBD (Gauffre et al. 2008), using the clustering method GENELAND (Guillot, Mortier & Estoup 2005). MAPI efficiency in detecting genetic discontinuities is further illustrated using DNA sequences from different strains of plant virus. These data were previously analysed (Desbiez et al. 2009; Joannon et al. 2010) using phylogenetic analyses, the clustering method samova (Dupanloup, Schneider & Excoffier 2002) and the maximum difference Monmonier's algorithm (Monmonier 1973). Finally, MAPI was applied to genetic distances computed from microsatellite genotypes of two species of forest ground beetles with contrasted level of habitat specialisation (i.e. specialist and generalist). Previous work using Mantel tests between genetic, geographic and landscape distances showed that open field areas were a stronger barrier to gene flow for the forest specialist (Brouat et al. 2003). See section 3 in the Appendix S1 for further details on the data sets and related works.

Results

All simulation set‐ups and MAPI graphical outputs are available online at: https://www1.montpellier.inra.fr/CBGP/software/MAPI/.

Sensitivity to Isolation by Distance

Simulation test data sets

MAPI did not detect unexpected areas of genetic discontinuity under strict IBD, regardless of its strength (see test data sets 1 in Table 2; Fig. 3e,i). When models combine IBD and barrier effects, MAPI still did not detect unexpected areas of genetic discontinuity that did not overlap with the simulated barrier (see test data sets 3 in Table 2; Fig. 3f–h, j–l). Under large regular sampling, the average percentage over replicates of unexpected discontinuous areas was <0·15% whatever the simulation setting considered (see ‘Unexpected’ in Table 2), with a maximal percentage across the 625 simulated data sets of 1·3%. Under small irregular sampling, the percentage of unexpected discontinuous areas was still minute, with an average lower than 0·53% whatever the simulation setting (Table 2) and a maximum across the 625 simulated data sets of 5·8%. Thus, gaps between samples did not drastically increase the proportion of unexpected significant discontinuous areas, even under strong IBD (g = 0·25, slope = 0·055). The single exception was the combination of strong IBD and genotype data sets of 100 loci, for which the small and irregular sampling led to a more significant (but still relatively low) increase in the percentage of unexpected discontinuous areas, with an average over replicates of 2·75% and a maximum of 10·8% (Table 2). However, the spatial distribution of the unexpected significant cells hardly suggested the presence of a barrier to gene flow (see online illustrations).

Biological data set

When applied on rodent microsatellite data, MAPI did not find any spurious significant area of genetic discontinuity that could be interpreted as a barrier to gene flow despite IBD (slope = 0·005, P‐value = 0·005 – see section 3·1 in the Appendix S1).

Detection of a Barrier to Gene Flow

Simulation test data sets

We found that MAPI was efficient in detecting strong to moderate barriers to gene flow (Nem ≤ 1; FST ≥ 0·1) with a barrier coverage ≥75% (see test data sets 2 in Table 2; Fig. 3c,d). When permeability increased (Nem > 1; FST < 0·1), the method lost its accuracy to detect the barrier (the proportion of undetected barriers became large – see test data sets 2 in Table 2; Fig. 3b). However, our simulations showed that IBD increased the power of the method to detect the linear barrier, especially for high levels of gene flow. For example, for Nem = 2·5, even weak IBD (g = 0·750; slope = 0·005) increased the barrier coverage from 7% to 79%. As a result, MAPI performed well in identifying barriers to gene flow given that Nem < 10 (see test data sets 3 in Table 2; Fig. 3f–h and j–l). When samplings were small and irregular (N = 75), the performance of MAPI to detect a weak barrier to gene flow (Nem > 1) decreased by 30–70% (Table 2), though in these situations there was still often a graphical signal for a barrier (see Fig. S3 in the Appendix S1). As expected, increasing genotyping effort improved further the detection of barriers for high levels of gene flow (Table 2). For example, for Nem = 2·5, increasing the number of loci from 10 to 100 increased the barrier coverage from 7% to 98% in the absence of IBD, and for Nem = 10, 30% of the barrier was still recovered.

Biological data set

In line with previous results published on the plant virus sequence data set, MAPI identified a major area of genetic discontinuity bisecting the study area from north to south (see section 3·2 in the Appendix S1).

Sensitivity to parameter settings

The eccentricity and number of cells have little effect on the detection of unexpected significant areas (i.e. Unexpected in Fig. 5). The worst situation (i.e. 3%) occurred when using very narrow ellipses (e = 0·999; see Fig. S1 of the Appendix S1) combined with a very high number of cells (i.e. β = 0·15). The combination of narrow ellipses (e = 0·999), which result in a lower number of ellipses intercepting each cell with a very low number of cells (i.e. β = 0·75), decreased MAPI efficiency to detect a barrier to gene flow (i.e. Coverage in Fig. 5; Fig. S3 in the Appendix S1). An eccentricity value of 0·975 associated with a β value of 0·5 for regular sampling or 0·25 for irregular sampling ensures high barrier coverage, low rate of false positives and reasonable computational time. Lower values of eccentricity (0·8–0·95; inflated ellipses) provide same quality results as a value of 0·975 but at the cost of expended computational time and stronger smoothing effect (see broader aggregates of significant cells for a value of 0·9 in Fig. S3 in the Appendix S1). The effect of the eccentricity parameter is further illustrated on virus data (Fig. S8 in the Appendix S1). In all cases, the major area of genetic discontinuity was still uncovered by MAPI.

image
Interactions between eccentricity and β parameter values on MAPI performance. For the different values of β, the y‐axis shows, when applicable, the percentage of the simulated barrier covered by significant discontinuous areas (i.e. Coverage, left panel) or the percentage of significant discontinuous areas having no contact with the barrier (Unexpected, right panel). Eccentricity is on the x‐axis. We considered large regular sampling schemes (200 samples, top row) and small irregular sampling schemes (75 samples, bottom row). We used a subset of ten data sets simulated under spatial genetics models (see the section 2·1 in the Appendix S1 to identify test data sets).

Assessment of Landscape Effects

Simulation test data sets

As expected, when the two habitats were distributed randomly, there was no obvious correspondence between the spatial variation in mw and landscape cell values (Fig. 4; see also Fig. S4 in the Appendix S1 to visualise examples of MAPI graphical outputs). This was confirmed by the results of the Bayesian conditional autoregressive model which showed that most of the posterior probabilities for a habitat effect were not significant (Table 3 and Fig. S5 in the Appendix S1). A difference between habitats was, however, still detected from a few simulated test data sets, especially when using random sampling. For the gradient and fragmented landscapes, MAPI graphical outputs showed a relative convergence between variation in mw cell values and spatial habitat configuration (Fig. 4 and Fig. S4 in the Appendix S1). Accordingly, the intercept was significantly less for the favourable habitat (urn:x-wiley:2041210X:media:mee312616:mee312616-math-0011) than for the unfavourable habitat (urn:x-wiley:2041210X:media:mee312616:mee312616-math-0012) for all data sets regardless of the sampling scheme and size for the fragmented landscape and for most of the data sets when large (500 genotypes) and random or gridded samplings were used for the gradient landscape (Table 3). When the sampling size was smaller (200 genotypes), the proportion of significant posterior probabilities for urn:x-wiley:2041210X:media:mee312616:mee312616-math-0013 decreased of 25% and 50% for the random and gridded samplings, respectively. Interestingly, under the gradient configuration, the balanced sampling led to very poor results and even, in a few cases, to false positives (i.e. urn:x-wiley:2041210X:media:mee312616:mee312616-math-0014) (Table 3 and Fig. S5 in the Appendix S1).

Table 3. Performance of MAPI under different landscape models. For each simulated data set and sampling scheme, we extracted cell‐specific mw values and habitat type (1 or 2) to fit the following regression model: urn:x-wiley:2041210X:media:mee312616:mee312616-math-0015, where urn:x-wiley:2041210X:media:mee312616:mee312616-math-0016 is an intercept depending on the habitat type in cell i and error term εi includes a structured and an unstructured component. Here are presented the posterior probability that urn:x-wiley:2041210X:media:mee312616:mee312616-math-0017 (P1 < α2)) averaged over 20 replicates (urn:x-wiley:2041210X:media:mee312616:mee312616-math-0018) as well as the proportion of P1 < α2) ≥ 0·95 (P95%) and ≤ 0·05 (P5%). Note that for all simulated test data sets, the regression model was well adjusted with a mean predictive P‐value very close or equal to 0·5 and residuals not overly structured in space (see landscape genetics illustrations on the website)
Landscape model Sampling scheme urn:x-wiley:2041210X:media:mee312616:mee312616-math-0019 P 95% P 5%
Large sampling Small sampling Large sampling Small sampling Large sampling Small sampling
Gradient (test data sets 1) Random 0·945 0·883 80 55 0 0
Balanced 0·614 0·709 10 15 5 5
Gridded 0·972 0·862 90 40 0 0
Fragmentation (test data sets 2) Random 0·999 0·999 100 100 0 0
Balanced 0·999 0·998 100 100 0 0
Gridded 0·999 0·999 100 100 0 0
Random (test data sets 3) Random 0·762 0·741 40 25 0 0
Balanced 0·655 0·546 15 15 0 5
Gridded 0·780 0·639 10 15 0 0

Biological data set

On ground beetle microsatellite data set, MAPI analyses supported previously published conclusions by identifying a significant area of genetic discontinuity corresponding to a large open field for the forest specialist only (Fig. 6). When using mw cell values in the Bayesian conditional autoregressive model with the proportion of trees as an explanatory variable, we found a significant negative relationship only for the forest specialist (posterior probability = 1). This means that on average, between‐individual genetic distances were lower within highly forested areas. No significant pattern was found for the generalist (see section 3·3 in the Appendix S1).

image
MAPI graphical output for the forest specialist species of ground beetle. Cell‐specific mw values appear as a colour scale. The hatched area corresponds to the location of the significant area of genetic discontinuity. Variation in tree coverage follows a greyscale from a minimum of 11% to a maximum of 75%. Sampling locations are illustrated by white circles proportional to sampling sizes.

Discussion

In this work, we presented a new method to translate networks of pairwise relationships into variation surfaces. MAPI is essentially a smoothing procedure using the overlap between ellipses as a way to share information between spatial connections. The surface produced is a grid of cells that provide information on the average intensity of the pairwise relationships crossing at the cell locations. The variation in the cell values over the surface allows to localise areas where the majority of the crossing connections correspond to very low or very high pairwise values. The significance of these areas can be assessed using a nonparametric randomisation procedure.

MAPI can be applied to genetic data to detect areas of high genetic continuity and discontinuity. When using neutral markers (e.g. microsatellites) and genetic differentiation measures, continuous and discontinuous areas reflect areas where gene flow is the highest and the lowest, respectively. Variation in gene flow intensity can result from different processes such as spatial heterogeneity in population density or migration success (Richardson et al. 2016). Here, using controlled simulations, we determined that MAPI performed well to detect genetic discontinuities resulting from a physical barrier as long as it is not too permeable to gene flow. When gene flow was high, MAPI's performance was substantially improved with sampling effort in the number of loci and individuals. Under such spatial genetics models, the sensitivity analysis to eccentricity setting showed that a value of 0·975 ensured high barrier coverage, low rate of unexpected significant cells, high spatial resolution and reasonable computational time. This result might, however, differ in more complex scenarios than a linear barrier to gene flow (e.g. highly fragmented landscape).

A central feature of MAPI is its relative insensitivity to IBD, which is a critical issue for spatial and landscape genetics analyses (Guillot et al. 2009; Thomassen et al. 2010; Bradburd, Ralph & Coop 2013). In an ideal situation, with perfect regular sampling and no edge effects, all cells should theoretically display the same mw value under strict IBD. When analysing highly irregular samplings, the method detected a few unexpected significant areas of genetic discontinuity as the cells located within the spatial gaps were only informed from long‐distance connections. These highly discontinuous areas were, however, small and generally located on the border of the study area. As for many spatial exploratory analyses, regular individual‐based samplings are more likely to provide reliable and interpretable results (Oyler‐McCance, Bradley & Landguth 2013; Balkenhol et al. 2015).

In landscape genetics, assessing the effects of environmental features on spatial patterns of genetic variation often requires going beyond the detection of barriers to gene flow. In this context, our controlled simulations and ground beetle data sets (Brouat et al. 2003) showed that overlaying MAPI graphical outputs on landscape layers may provide information on which environmental variables are potential candidates to explain observed genetic patterns (see Fig. 4 and Fig. S9 in the Appendix S1). These candidate variables can be further explored by using mw cell values in post‐MAPI analyses. To illustrate this possibility, we used a regression model accounting for spatial autocorrelation in mw cell values and fitted in a Bayesian framework. From our simulated data sets, the regression model successfully retrieved the expected relationships (i.e. smaller mw values within the favourable habitat) for the fragmented and gradient landscapes but not when the two habitats were randomly distributed. These results can be explained by the difference in the scale at which habitats are spatially autocorrelated in the three simulated landscapes. As long‐distance effects are limited in the computation of mw by using an inverse‐area‐weighting procedure, the cells mainly reflect the average intensity of the shorter connections crossing at one location. Consequently, significant relationships between mw cell values and landscape variables are expected to be detectable when the landscape is spatially structured in such a way that a cell is mostly influenced by short‐distance connections occurring between samples located within a relatively homogeneous landscape aggregate. Contrarily, when the spatial scale of autocorrelation in landscape features is smaller than the resolution of the sampling, a cell reflects between‐ as much as within‐habitat connections and habitat effects are likely to become undetectable (e.g. random landscape). In the data sets simulated under a fragmented landscape model, the sampling size and strategy did not affect the detection of the habitat effect. Under the gradient configuration, the habitat effect was well detected especially when a large number (i.e. 500) of genotypes were sampled following a random, or even better, a gridded sampling strategy. Contrarily, the balanced sampling strategy was inefficient to detect an effect. This result can be due to the multiscale patterns of autocorrelation of the habitats that have different consequences. First, as the two habitats strongly interpenetrate, the central part of the study area looks like the random landscape model (i.e. small habitat patches). Within this zone, the balanced sampling produces numerous between‐habitats connections at short distance that blur the relationship between the mw cell values and habitat type. Secondly, over the whole study area, the scale at which the habitats are autocorrelated is quite large and, subsequently, autocorrelation in mw cell values can be expected to be large as well. In such a situation, increasing the neighbouring structure from two to five cells in the conditional autoregressive model significantly improves the detection of the habitat effect from the balanced sampling (i.e. mean posterior probability = 0·909 and 0·927; percentage of significant probability = 60 and 75 for the data sets with 200 and 500 genotypes, respectively).

Analysis of MAPI cell values conjointly with landscape variables is a data‐driven exploratory approach that can help identify candidate landscape features. Adequate sampling schemes and analyses aiming to test for the effects of those candidate variables should be conducted at an appropriate spatial scale before drawing conclusions (Richardson et al. 2016).

Acknowledgements

The authors thank T. Jombart, G. Guillot, N. Verzelen, R Leblois and A. Estoup for helpful comments on previous versions of the manuscript as well as the associate editor and four referees for their very constructive remarks. The authors also thank C. Desbiez and C. Brouat for providing the virus and forest ground beetle data, respectively. K.B. acknowledges the INRA Metaprogram ‘Sustainable Management of Crop Health’ for funding the COPACABANA project. This work is dedicated to late Serge Meusnier.

    Data accessibility

    Rodent microsatellite data: DRYAD entry http://dx.doi.org/10.5061/dryad.jf7sn (Gauffre et al. 2014)

    Plant virus sequence data: GenBank accession numbers EU660581EU660590, HM044202HM044215.

    Ground beetle microsatellite data: Figshare entry https://figshare.com/articles/GroundBeetles_zip/3482804 and French National Institute for Agricultural Research, https://www1.montpellier.inra.fr/CBGP/software/MAPI/ (file:MAPI_examples.zip).

    Simulated test data sets and r scripts: French National Institute for Agricultural Research, https://www1.montpellier.inra.fr/CBGP/software/MAPI/ (file: MAPI_examples.zip).

      Number of times cited according to CrossRef: 13

      • The Andaman day gecko paradox: an ancient endemic without pronounced phylogeographic structure, Scientific Reports, 10.1038/s41598-020-68402-7, 10, 1, (2020).
      • ResDisMapper: An package for fine‐scale mapping of resistance to dispersal, Molecular Ecology Resources, 10.1111/1755-0998.13127, 20, 3, (819-831), (2020).
      • Distribution and evolution of the major viruses infecting cucurbitaceous and solanaceous crops in the French Mediterranean area, Virus Research, 10.1016/j.virusres.2020.198042, (198042), (2020).
      • Combining genetic and demographic monitoring better informs conservation of an endangered urban snake, PLOS ONE, 10.1371/journal.pone.0231744, 15, 5, (e0231744), (2020).
      • Integrating population genetics to define conservation units from the core to the edge of Rhinolophus ferrumequinum western range, Ecology and Evolution, 10.1002/ece3.5714, 9, 21, (12272-12290), (2019).
      • Landscape genetic analyses of Cervus elaphus and Sus scrofa: comparative study and analytical developments, Heredity, 10.1038/s41437-019-0183-5, (2019).
      • Evaluating methods to visualize patterns of genetic differentiation on a landscape, Molecular Ecology Resources, 10.1111/1755-0998.12747, 18, 3, (448-460), (2018).
      • Demographic and genetic approaches to study dispersal in wild animal populations: A methodological review, Molecular Ecology, 10.1111/mec.14848, 27, 20, (3976-4010), (2018).
      • Pathogens in space: Advancing understanding of pathogen dynamics and disease ecology through landscape genetics, Evolutionary Applications, 10.1111/eva.12678, 11, 10, (1763-1778), (2018).
      • Plasmodium falciparum genetic variation of var2csa in the Democratic Republic of the Congo, Malaria Journal, 10.1186/s12936-018-2193-9, 17, 1, (2018).
      • Fine-scale interactions between habitat quality and genetic variation suggest an impact of grazing on the critically endangered Crau Plain grasshopper (Pamphagidae: Prionotropis rhodanica), Journal of Orthoptera Research, 10.3897/jor.27.15036, 27, 1, (61-73), (2018).
      • Exploiting Genetic Information to Trace Plant Virus Dispersal in Landscapes, Annual Review of Phytopathology, 10.1146/annurev-phyto-080516-035616, 55, 1, (139-160), (2017).
      • Simple Rules for an Efficient Use of Geographic Information Systems in Molecular Ecology, Frontiers in Ecology and Evolution, 10.3389/fevo.2017.00033, 5, (2017).