Guidelines for the prediction of species interactions through binary classification
Abstract
- The prediction of species interactions is gaining momentum as a way to circumvent limitations in data volume. Yet, ecological networks are challenging to predict because they are typically small and sparse. Dealing with extreme class imbalance is a challenge for most binary classifiers, and there are currently no guidelines as to how predictive models can be trained for this specific problem.
- Using simple mathematical arguments and numerical experiments in which a variety of classifiers (for supervised learning) are trained on simulated networks, we develop a series of guidelines related to the choice of measures to use for model selection, and the ways to assemble the training dataset.
- Neither classifier accuracy nor the area under the receiver operating characteristic curve (ROC-AUC) are informative measures for the performance of interaction prediction. The area under the precision-recall curve (PR-AUC) is a fairer assessment of performance. In some cases, even standard measures can lead to selecting a more biased classifier because the effect of connectance is strong. The amount of correction to apply to the training dataset depends on network connectance, on the measure to be optimized, and only weakly on the classifier.
- These results reveal that training machines to predict networks is a challenging task, and that in virtually all cases, the composition of the training set needs to be fine-tuned before performing the actual training. We discuss these consequences in the context of the low volume of data.
Species interactions, forming ecological networks, are a backbone for key ecological and evolutionary processes; yet, enumerating all of the interactions between species is a daunting task, as it scales with , that is, the squared species richness (Martinez, 1992). Recent contributions to the field of ecological network prediction (Becker et al., 2022; Pichler et al., 2020; Strydom et al., 2021) highlight that although interactions can be predicted by adding ecologically relevant information (in the form of e.g. traits), we do not have robust guidelines as to how the predictive ability of models recommending species interactions should be evaluated, nor about how these models should be trained. Here, by relying on simple derivations and a series of simulations, we formulate a number of such guidelines, specifically for the case of binary classifiers derived from thresholded values. Specifically, we conduct an investigation of the models in terms of their skill (ability to make the right prediction), bias (trends towards systematically over-predicting one class) and class imbalance (the relative number of cases representing interactions), and show how these effects interact. We conclude on the fact that models with the best interaction-scale predictive score do not necessarily result in the most accurate representation of the true network.
The prediction of ecological interactions shares conceptual and methodological issues with two fields in biology: species distribution modelling (SDMs) and genomics. SDMs suffers from issues affecting interactions prediction, namely low prevalence (due to sparsity of observations/interactions) and data aggregation (due to bias in sampling some locations/species). An important challenge lies in the fact that the best measure to quantify the performance of a model is not necessarily a point of consensus (these methods, their interpretation and the way they are measured are covered in depth in the next section). In previous work, Allouche et al. (2006) suggested that Cohen's agreement score ( thereafter) was a better test of model performance than the true skill statistic (TSS; which we refer to as Youden's informedness thereafter); these conclusions were later criticized by Somodi et al. (2017), who emphasized that informedness is affected both by prevalence and bias. Although this work offers recommendations about the comparison of models, it does not establishes baselines or good practices for training on imbalanced ecological data, or ways to remedy the imbalance. Steen et al. (2021) show that, when applying spatial thinning (artificially re-balancing observation data in space to avoid artefacts due to auto-correlation), the best approach to train ML-based SDMs varies according to the balancing of the dataset and the evaluation measures used; there is no single ‘recipe’ that is guaranteed to give the best model. By contrast to networks, SDMs have the advantage of being able to both thin datasets to remove some of the sampling bias (e.g. Inman et al., 2021) but also to create pseudo-absences to inflate the number of supposed negatives in the dataset (e.g. Iturbide et al., 2015). These powerful ways to remove data bias often have no analogue in networks, removing one potential tool from our methodological toolkit, and making the task of network prediction through classification potentially more demanding, and more prone to underlying data biases.
An immense body of research on machine learning application to life sciences is focused on genomics (which has very specific challenges: see a recent discussion by Whalen et al., 2021); this sub-field has generated recommendations that do not necessarily match the current best-practices for SDMs, and therefore hint at the importance of domain-specific guidelines. Chicco and Jurman (2020) suggest using Matthews correlation coefficient (MCC) over , as a protection against over-inflation of predicted results; Delgado and Tibau (2019) advocate against the use of Cohen's , again in favour of MCC, as the relative nature of means that a worse classifier can be picked over a better one; similarly, Boughorbel et al. (2017) recommend MCC over other measures of performance for imbalanced data, as it has more desirable statistical properties. More recently, Chicco et al. (2021) temper the apparent supremacy of the MCC, by suggesting that it should be replaced by Youden's informedness (also known as , bookmaker's accuracy and the true-skill statistic) when the imbalance in the dataset may not be representative of the actual imbalance. In a way, the measures themselves need not be a strong focus for network prediction, as they are routinely used in other field; the discipline-specific question we seek to address is: ‘which metric should be employed when predicting networks, and how to optimize it?’
Species interaction networks are often undersampled (Jordano, 2016a, 2016b), and this undersampling is structured taxonomically (Beauchesne et al., 2016), structurally (de Aguiar et al., 2019) and spatially (Poisot, Bergeron, et al., 2021; Wood et al., 2015). As a consequence, networks suffer from data deficiencies both within and between datasets. This implies that the comparison of classifiers across space, when undersampling varies locally (see e.g. McLeod et al., 2021), is non-trivial. Furthermore, the baseline value of classifiers performance measures under various conditions of skill, bias and prevalence has to be identified to allow researchers to evaluate whether their interaction prediction model is indeed learning. Taken together, these considerations highlight three specific issues for ecological networks. First, what values of performance measures are indicative of a classifier with no skill? This is particularly important as it can evaluate whether low prevalence can lull us into a false sense of predictive accuracy. Second, independently of the question of model evaluation, is low prevalence an issue for training or testing, and can we remedy it? Finally, because the low amount of data on interaction makes a lot of imbalance correction methods (see e.g. Branco et al., 2015) hard to apply, which measures of model performance can be optimized by sacrificing least amount of positive interaction data?
A preliminary question is to examine the baseline performance of these measures, that is, the values they would take on hypothetical networks based on a classifier that has no-skill. It may sound counter-intuitive to care so deeply about how good a classifier with no-skill is, as by definition, is has no skill. The necessity of this exercise has its roots in the paradox of accuracy: when the desired class (‘two species interact’) is rare, a model that gets less ecologically performant by only predicting the opposite class (‘these two species do not interact’) sees its accuracy increase; because most of the guesses have ‘these two species do not interact’ as a correct answer, a model that never predicts interactions would be right an overwhelming majority of the time; it would also be utterly useless. Herein lies the core challenge of predicting species interactions: the extreme imbalance between classes makes the training of predictive models difficult, and their validation even more so as we do not reliably know which negatives are true. The connectance (the proportion of realized interactions, usually the number of interactions divided by the number of species pairs) of empirical networks is usually well under 20%, with larger networks having a lower connectance (MacDonald et al., 2020), and therefore, being increasingly difficult to predict.
1 A PRIMER ON BINARY CLASSIFIER EVALUATION
A lot of binary classifiers are built by using a regressor (whose task is to guess the value of the interaction and can therefore return a value considered to be a pseudo-probability); in this case, the optimal value below which predictions are assumed to be negative (i.e. the interaction does not exist) can be determined by picking a threshold maximizing some value on the ROC or the PR curve. The area under these curves (ROC-AUC and PR-AUC henceforth) give ideas on the overall goodness of the classifier, and the ideal threshold is the point on these curves that minimizes the trade-off represented in these curves. Saito and Rehmsmeier (2015) established that the ROC-AUC is biased towards over-estimating performance for imbalanced data; on the contrary, the PR-AUC is able to identify classifiers that are less able to detect positive interactions correctly, with the additional advantage of having a baseline value equal to prevalence. Therefore, it is important to assess whether these two measures return different results when applied to ecological network prediction. The ROC curve is defined by the false positive rate on the axis and the true positive rate on the axis, and the PR curve is defined by the true positive rate on the axis and the positive predictive value on the axis.
There is an immense diversity of measures to evaluate the performance of classification tasks (Ferri et al., 2009). Here, we will focus on five of them with high relevance for imbalanced learning (He & Ma, 2013). The choice of metrics with relevance to class-imbalanced problems is fundamental because, as Japkowicz (2013) unambiguously concluded, ‘relatively robust procedures used for unskewed data can break down miserably when the data is skewed’. Following Japkowicz (2013), we focus on two ranking metrics (the areas under the receiver operating characteristic and precision recall curves) and three threshold metrics (, informedness and MCC: we will briefly discuss but show early on that it has undesirable properties).
2 BASELINE VALUES FOR THE THRESHOLD METRICS
In this section, we will assume a network with connectance equal to a scalar , that is, having interactions (where is the species richness) and non-interactions. Therefore, the vector describing the true state of the network (assumed to be an unweighted, directed network) is a column vector (we can safely drop the terms, as we will work on the confusion matrix, which ends up expressing relative values). We will apply skill and bias to this matrix and measure how a selection of performance metrics respond to changes in these values, in order to assess their suitability for model evaluation.
2.1 Confusion matrix with skill and bias
2.2 What are the baseline values of performance measures?
In this section, we will change the values of , and and report how the main measures discussed in the introduction (MCC, , and informedness) respond. Before we do so, it is important to explain why we will not focus on accuracy too much. Accuracy is the number of correct predictions () divided by the sum of the confusion matrix. For a no-skill, no-bias classifier, accuracy is equal to ; for , this is ; and for , this is equal to . In other words, the values of accuracy are high enough to be uninformative (for small, ). More concerning is the fact that introducing bias changes the response of accuracy in unexpected ways. Assuming a no-skill classifier, the numerator of accuracy becomes , which increases when is low, which specifically means that at equal skill, a classifier that underpredicts interactions will have higher accuracy than an unbiased classifier (because the value of accuracy is dominated by the size of tn, which will increase). These issues are absent from balanced accuracy but should nevertheless lead us to not report accuracy as the primary measure of network prediction success; moving forward, we will focus on other measures.
In order to examine how MCC, , and informedness change with respect to the imbalance, skill and bias, we performed a grid exploration of the values of and linearly from to ; means that is essentially 0, and means it is essentially 1—this choice was motivated by the fact that most responses are nonlinear with regards to bias and skill. The values or were taken linearly in , which is within the range of connectance for species interaction networks. Note that at this point, there is no network model to speak of; the confusion matrix we discuss can be obtained for any classification task. Based on the previous discussion, the desirable properties for a measure of classifier success should be: an increase with classifier skill, especially at low bias; a hump-shaped response to bias, especially at high skill and ideally centred around ; and an increase with prevalence up until equiprevalence is reached.
In Figure 1, we show that none of the four measures satisfy all the considerations at once: increases with skill and increases monotonously with bias; this is because does not account for true negatives, and the increase in positive detection masks the over-prediction of interactions. Informedness varies with skill, reaching 0 for a no-skill classifier, but is entirely unsensitive to bias. Both MCC and have the same behaviour, whereby they increase with skill. peaks at increasing values of bias for increasing skill; that is, it is likely to lead to the selection of a classifier that over-predicts interactions. By contract, MCC peaks at the same value, regardless of skill, but this value is not : unless at very high classifier skill, MCC risks leading to a model that over-predicts interactions. In Figure 2, we show that all measures except give a value of 0 for a no-skill classifier and are forced towards their correct maximal value when skill changes (i.e. a more connected networks will have higher values for a skilled classifier, and lower values for a classifier making mostly mistakes).


These two analyses point to the following recommendations: MCC is indeed more appropriate than , as although sensitive to bias, it is sensitive in a consistent way. Informedness is appropriate at discriminating between different skills but confounded by bias. As both of these measures bring valuable information on the model behaviour, we will retain them for future analyses. is increasing with bias and should not be prioritized to evaluate the performance of the model. The discussion of sensitivity to bias should come with a domain-specific caveat: although it is likely that interactions documented in ecological networks are correct, a lot of non-interactions are simply unobserved; as predictive models are used for data-inflation (i.e. the prediction of new interactions), it is not necessarily a bad thing in practice to select models that predict more interactions than the original dataset, because the original dataset misses some interactions. Furthermore, the weight of positive interactions could be adjusted if some information about the extent of undersampling exists (e.g. Branco et al., 2015). In a recent large-scale imputation of interactions in the mammal-virus networks, Poisot, Ouellet, et al. (2021), for example, estimated that 93% of interactions are yet to be documented.
3 NUMERICAL EXPERIMENTS ON TRAINING STRATEGY
The dataset used for numerical experiments is composed of a grid of 35 values of connectance (from 0.011 to 0.5) and 35 values of (from 0.02 to 0.98); for each pair of values, 500 networks are generated and predicted. For each network, we train four machines: a trait-based k-NN (e.g. Desjardins-Proulx et al., 2017), a regression tree, a regression random forest and a boosted regression tree; the later three methods are turned into classifiers using thresholding, which oftentimes provides better results than classification when faced with class imbalance (Hong et al., 2016). Following results from Pichler et al. (2020), linear models have not been considered (in any way, the relationship in the simulated networks is nonlinear). The point of these numerical experiments is not to recommend the best model (this is likely problem-specific), but to highlight a series of recommendations that would work for supervised learning tasks. All models were taken from the MLJ.jl package (Blaom et al., 2020; Blaom & Vollmer, 2020) in Julia 1.7 (Bezanson et al., 2017). All machines use the default parameterization; this is an obvious deviation from best practices, as the hyperparameters of any machine require training before its application on a real dataset. As we use 612,500 such datasets, this would require over 2 million unique instances of tweaking the hyperparameters, which is prohibitive from a computing time point of view. An important thing to keep in mind is that the problem we simulate has been designed to be simple to solve: we expect all machines with sensible default parameters to fare well—the results presented in the later sections show that this assumption is warranted, and we further checked that the models do not overfit by ensuring that there is never more than 5% of difference between the accuracy on the training and testing sets. All machines return a quantitative prediction, usually (but not necessarily) in , which is proportional (but not necessarily linearly) to the probability of an interaction between and . The ROC-AUC and PR-AUC (and therefore the thresholds) can be measured by integrating over the domain of the values return by each machine, but in order to make the average-based ensemble model more meaningful, all predictions are expressed in .
In order to pick the best confusion matrix for a given trained machine, we performed a thresholding approach using 500 steps on predictions from the testing set and picking the threshold that maximized Youden's informedness. During the thresholding step, we measured the area under the receiver operating characteristic (ROC-AUC) and precision-recall (PR-AUC) curves, as measures of overall performance over the range of returned values. We report the ROC-AUC and PR-AUC, as well as a suite of other measures as introduced in the next section, for the best threshold. The ensemble model was generated by summing the predictions of all component models on the testing set (ranged in ) and then put through the same thresholding process. The complete code to run the simulations is available at https://doi.org/10.17605/OSF.IO/JKEWD.
After the simulations were completed, we removed all runs (i.e. triples of model, , and ) for which at least one of the following conditions was met: the accuracy was 0, the true positive or true negative rates were 0, the connectance was larger than 0.25. This removes both the obviously failed model runs, and the networks that are more densely connected compared to the connectance of empirical food webs (and are therefore less difficult to predict, being less imbalanced; preliminary analyses of data with a connectance larger than 0.3 revealed that all machines reached consistently high performance).
3.1 Effect of training set balance on performance
In Figure 3, we present the response of two thresholding measures (PR-AUC and ROC-AUC) and two ranking measures (Informedness and MCC) to a grid of 35 values of training set balance, and 35 values of connectance, for the four component models as well as the ensemble. ROC-AUC is always high, and does not vary with training set balance. On the other hand, PR-AUC shows very strong responses, increasing with training set balance. It is notable here that two classifiers that seemed to be performing well (Decision Tree and Random Forest) based on their MCC are not able to reach a high PR-AUC even at higher connectances. All models reached a higher performance on more connected networks, and using more balanced training sets. In all cases, informedness was extremely high, which is an expected consequence of the fact that this is the value we optimized to determine the cutoff. MCC increased with training set balance, although this increase became less steep with increasing connectance. Three of the models (kNN, decision tree, and random forest) only increased their PR-AUC sharply when the training set was heavily imbalanced towards more interactions. Interestingly, the ensemble almost always outclassed its component models. For larger connectances (less difficult networks to predict, as they are more balanced), MCC and informedness stared decreasing when the training set bias got too close to one, suggesting that a training set balance of 0.5 may often be appropriate if these measures are the one to optimize.

Based on the results presented in Figure 3, it seems that informedness and ROC-AUC are not necessarily able to discriminate between good and bad classifiers (although this result may be an artefact for informedness, as it has been optimized when thresholding). On the other hand, MCC and PR-AUC show a strong response to training set balance, and may therefore, be more useful at model comparison.
3.2 Required amount of positives to get the best performance
The previous results revealed that the measure of classification performance responds both to the bias in the training set and to the connectance of the network; from a practical point of view, assembling a training set requires one to withhold positive information, which in ecological networks are very scarce (and typically more valuable than negatives, on which there is a doubt). For this reason, across all values of connectance, we measured the training set balance that maximized a series of performance measures. When this value is high, the training set needs to skew more positive in order to get a performant model; when this value is about 0.5, the training set needs to be artificially balanced to optimize the model performance. These results are presented in Figure 4.

The more ‘optimistic’ measures (ROC-AUC and informedness) required a biasing of the dataset from about 0.4 to 0.75 to be maximized, with the amount of bias required decreasing only slightly with the connectance of the original network. MCC and PR-AUC required values of training set balance from 0.75 to almost 1 to be optimized, which is in line with the results of the previous section, i.e. they are more stringent tests of model performance. These results suggest that learning from a dataset with very low connectance can be a different task than for more connected networks: it becomes increasingly important to capture the mechanisms that make an interaction exist, and therefore having a slightly more biased training dataset might be beneficial. As connectance increases, the need for biased training sets is less prominent, as learning the rules for which interactions do not exist starts gaining importance.
When trained at their optimal training set balance, connectance still had a significant impact on the performance of some machines (Figure 5). Notably, decision tree, and k-NN, as well as random forest to a lower extent, had low values of PR-AUC. In all cases, the boosted regression tree was reaching very good predictions (especially for connectances larger than 0.1), and the ensemble was almost always scoring perfectly. This suggests that all the models are biased in different ways, and that the averaging in the ensemble is able to correct these biases. We do not expect this last result to have any generality, and provide a discussion of a recent example in which the ensemble was performing worse than its components models.

4 DO BETTER CLASSIFICATION ACCURACY RESULT IN MORE REALISTIC NETWORKS?
In this last section, we generate a network using the same model as before, with species, a connectance of (), and a training set balance of , as Figure 4 suggests, this is the optimal training set balance for this range of connectance. The prediction made on the complete dataset is presented in Figure 6.

The trained models were then thresholded (again by optimizing informedness), and their predictions transformed back into networks for analysis; specifically, we measured the connectance, nestedness (; Bastolla et al., 2009), modularity (; Barber, 2007), asymmetry (; Delmas et al., 2018), and Jaccard network dissimilarity (Canard et al., 2014). This process was repeated 250 times, and the results are presented in Table 1. The k-NN model is an interesting instance here: it produces the network that looks the most like the original dataset, despite having the lowest PR-AUC, suggesting it hits high recall at the cost of low precision. The ensemble was able to reach a very high PR-AUC (and a very high ROC-AUC), which translated into more accurate reconstructions of the structure of the network (with the exception of modulairty, which is underestimated by ). This result bears elaborating. Measures of model performance capture how much of the interactions and non-interactions are correctly identified. As long as these predictions are not perfect, some interactions will be predicted at the ‘wrong’ position in the network; these measures cannot describe the structural effect of these mistakes. On the other hand, measures of network structure can have the same value with interactions that fall at drastically different positions; this is in part because a lot of these measures covary with connectance, and in part because as long as these values are not 0 or their respective maximum, there is a large number of network configurations that can have the same value. That ROC-AUC is consistently larger than PR-AUC may be a case of this measure masking models that are not, individually, strong predictors (Jeni et al., 2013). In this specific example, the combination of individually ‘adequate’ models resulted in an extremely strong ensemble, suggesting that the correct prediction of interactions (as measured by MCC, Inf., ROC-AUC and PR-AUC) and network properties is indeed a feasible task under appropriately hyper-parameterized models.
Model | MCC | Inf. | ROC-AUC | PR-AUC | Conn. | Jaccard | |||
---|---|---|---|---|---|---|---|---|---|
Decision tree | 0.59 | 0.94 | 0.97 | 0.04 | 0.17 | 0.64 | 0.37 | 0.42 | 0.1 |
BRT | 0.46 | 0.91 | 0.97 | 0.36 | 0.2 | 0.78 | 0.29 | 0.41 | 0.19 |
Random Forest | 0.72 | 0.98 | 0.99 | 0.1 | 0.16 | 0.61 | 0.38 | 0.42 | 0.06 |
k-NN | 0.71 | 0.98 | 0.99 | 0.02 | 0.16 | 0.61 | 0.39 | 0.42 | 0.06 |
Ensemble | 0.74 | 0.98 | 1.0 | 0.79 | 0.16 | 0.61 | 0.38 | 0.42 | 0.06 |
Data | 0.16 | 0.56 | 0.41 | 0.42 | 0.0 |
5 GUIDELINES FOR THE ASSESSMENT OF NETWORK PREDICTIVE MODELS
We establish that due to the low prevalence of interactions, even poor classifiers applied to food web data will reach a high accuracy; this is because the measure is dominated by the accidentally correct predictions of negatives. On simulated confusion matrices with ranges of imbalance that are credible for ecological networks, MCC had the most desirable behaviour, and informedness is a linear measure of classifier skill. By performing simulations with four models and an ensemble, we show that informedness and ROC-AUC are consistently high on network data, whereas MCC and PR-AUC are more accurate measures of the effective performance of the classifier. Finally, by measuring the structure of predicted networks, we highlight an interesting paradox: the models with the best performance measures are not necessarily the models with the closest reconstructed network structure. We discuss these results in the context of establishing guidelines for the prediction of ecological interactions.
It is noteworthy that the ensemble model was systematically better than the component models. We do not expect that ensembles will always be better than single models. Networks with different structures than the one we simulated here may respond in different ways, especially if the rules are fuzzier than the simple rule we used here. In a recent multi-model comparison involving supervised and unsupervised learning, Becker et al. (2022) found that the ensemble was not the best model, and was specifically underperforming compared to models using biological traits. This may be because the dataset of Becker et al. (2022) was known to be undersampled, and so the network alone contained less information than the combination of the network and species traits. There is no general conclusion to draw from either these results or ours, besides reinforcing the need to be pragmatic about which models should be included in the ensemble, and whether to use an ensemble at all. In a sense, the surprising performance of the ensemble model should form the basis of the first broad recommendation: optimal training set balance and its interaction with connectance and the specific binary classifier used is, in a sense, a hyperparameter that should be assessed following the approach outlined in this manuscript. The distribution of results in Figure 4 and Figure 5 show that there are variations around the trend, and multiple models should probably be trained on their ‘optimal’ training/testing set, as opposed to the same ones.
The results presented here highlight an interesting paradox: although the k-NN model was ultimately able to get a correct estimate of network structure (see Table 1; Figure 6), it ultimately remains a poor classifier, as evidenced by its low PR-AUC. This suggests that the goal of predicting interactions and predicting networks may not always be solvable in the same way—of course a perfect classifier of interactions would make a perfect network prediction; indeed, the best scoring predictor of interactions (the ensemble model) had the best prediction of network structure. The tasks of predicting networks structure and of predicting interactions within networks are essentially two different ones. For some applications (e.g. comparison of network structure across gradients), one may care more about a robust estimate of the structure, at the cost at putting some interactions at the wrong place. For other applications (e.g. identifying pairs of interacting species), one may conversely care more about getting as many pairs right, even though the mistakes accumulate in the form of a slightly worse estimate of network structure. How these two approaches can be reconciled is something to evaluate on a case-by-case basis, especially since there is no guarantee that an ensemble model will always be the most precise one. Despite this apparent tension at the heart of the predictive exercise, we can use the results presented here to suggest a number of guidelines.
First, because we have more trust in reported interactions than in reported absences of interactions (which are overwhelmingly pseudo-absences), we can draw on previous literature to recommend informedness as a measure to decide on a threshold for binary classification (Chicco et al., 2021); this being said, because informedness is insensitive to bias (although it is a linear measure of skill), the overall model performance is better evaluated through the use of MCC (Figures 4 and 5). Because is monotonously sensitive to classifier bias (Figure 1) and network connectance (Figure 2), MCC should be prefered as a measure of model evaluation and comparison. When dealing with multiple models, we therefore suggest to find the optimal threshold using informedness, and to pick the best model using MCC (assuming one does not want to use an ensemble model).
Second, accuracy alone should not be the main measure of model performance, but rather an expectation of how well the model should behave given the class balance in the set on which predictions are made; this is because, as derived earlier, the expected accuracy for a no-skill no-bias classifier is (where is the class balance), which will most often be large. This pitfall is notably illustrated in a recent food-web model (Caron et al., 2022) wherein the authors, using a training set of with only 100 positive interactions (representing 0.1% of the total interactions), reached a good accuracy. Reporting a good accuracy is not informative, especially when accuracy is not (i) compared to the baseline expected value under the given class balance, and (ii) interpreted in the context of a measure that is not sensitive to the chance prediction of many negatives (like MCC).
Third, because the PR-AUC responds more to network connectance (Figure 5) and training set imbalance (Figure 4) than ROC-AUC, it should be used as a measure of model performance over the ROC-AUC. This is not to say that ROC-AUC should be discarded (in fact, a low ROC-AUC is undoubtedly a sign of an issue with the model), but that its interpretation should be guided by the PR-AUC value. Specifically, a high ROC-AUC is not informative, as it can be associated to a low PR-AUC (see e.g. Random Forest in Table 1) This again echoes recommendations from other fields (Jeni et al., 2013; Saito & Rehmsmeier, 2015). We therefore expect to see high ROC-AUC values, and then to pick the model that maximizes the PR-AUC value. Taken together with the previous two guidelines, we strongly encourage to (i) ensure that accuracy and ROC-AUC are high (in the case of accuracy, higher than expected under no-skill no-bias situation), and (ii) to discuss the performance of the model in terms of the most discriminant measures, i.e. PR-AUC and MCC.
Finally, network connectance (i.e. the empirical class imbalance) should inform the composition of the training and testing set, because it is an ecologically relevant value. In the approach outlined here, we treat the class imbalance of the training set as an hyper-parameter, but test the model on a set that has the same class imbalance as the actual dataset. This is an important distinction, as it ensure that the prediction environment matches the testing environment (as we cannot manipulate the connectance of the empirical dataset on which the predictions will be made), and so the values measured on the testing set (or validation set if the data volume allows one to exists) can be directly compared to the values for the actual prediction. A striking result from Figure 4 is that Informedness was almost always maximal at 50/50 balance (regardless of connectance), whereas MCC required more positives to be maximized when connectance increases, matching the idea that it is a more stringent measure of performance. This has an important consequence in ecological networks, for which the pool of positive cases (interactions) to draw from is typically small: the most parsimonious measure (i.e. the one requiring to discard the least amount of interactions to train the model) will give the best validation potential, and in this light is very likely informedness (maximizing informedness is, in fact, the generally accepted default for imbalanced classification regardless of the problem domain; Schisterman et al., 2005). This last result further strengthens the assumption that the amount of bias is an hyper-parameter that must be fine-tuned, as using the wrong bias can lead to models with lower performance; for this reason, it makes sense to not train all models on the same training/testing set, but rather to optimize the set composition for each of them.
One key element for real-life data that can make the prediction exercise more tractable is that some interactions can safely be assumed to be impossible; indeed, a lot of networks can be reasonably well described using a stochastic block model (e.g. Xie et al., 2017). In ecological networks, this can be due to spatial constraints (Valdovinos, 2019), or to the long-standing knowledge that some links are ‘forbidden’ due to traits (Olesen et al., 2011) or abundances (Canard et al., 2014). The matching rules (Olito & Fox, 2015; Strona & Veech, 2017) can be incorporated in the model either by adding compatibility traits, or by only training the model on pairs of species that are not likely to be forbidden links. Knowledge of true negative interactions could be propagated in training/testing sets that have true negatives, and in this situation, it may be possible to use the more usual 70/30 split for training/testing folds as the need to protect against potential unbalance is lowered. Besides forbidden links, a real-life case that may arise is multi-interaction or multi-layer networks (Pilosof et al., 2017). These can be studied using the same general approach outlined here, either by assuming that pairs of species can interact in more than one way (wherein one would train a model for each type of interaction, based on the relevant predictors), or by assuming that pairs of species can only have one type of interaction (wherein this becomes a multi-label classification problem).
AUTHOR CONTRIBUTIONS
Timothée Poisot designed the study, performed the analyses, and wrote the manuscript.
ACKNOWLEDGEMENTS
The author acknowledges that this study was conducted on land within the traditional unceded territory of the Saint Lawrence Iroquoian, Anishinabewaki, Mohawk, Huron-Wendat and Omà miwininiwak nations. The author thanks Colin J. Carlson, Michael D. Catchen, Giulio Valentino Dalla Riva and Tanya Strydom for inputs on earlier versions of this manuscript. This research was enabled in part by support provided by Calcul Québec (www.calculquebec.ca) through the Narval general purpose cluster. T.P. is supported by the Fondation Courtois, a NSERC Discovery Grant and Discovery Acceleration Supplement, by funding to the Viral Emergence Research Initiative (VERENA) consortium including NSF BII 2021909 and NSF BII 2213854, and by a grant from the Institut de Valorisation des Données (IVADO).
CONFLICT OF INTEREST STATEMENT
The author declares no conflict of interest.
Open Research
PEER REVIEW
The peer review history for this article is available at https://www.webofscience.com/api/gateway/wos/peer-review/10.1111/2041-210X.14071.
DATA AVAILABILITY STATEMENT
The manuscript does not use empirical datasets; the code is accessible through the doi: https://doi.org/10.17605/OSF.IO/JKEWD.