Exploiting the full potential of Bayesian networks in predictive ecology
Abstract
- Although ecological models used to make predictions from underlying covariates have a record of success, they also suffer from limitations. They are typically unable to make predictions when the value of one or more covariates is missing during the testing. Missing values can be estimated but methods are often unreliable and can result in poor accuracy. Similarly, missing values during the training can hinder parameter estimation of many ecological models. Bayesian networks can handle these and other limiting issues, such as having highly correlated covariates. However, they are rarely used to their full potential.
- Indeed, Bayesian networks are commonly used to evaluate the knowledge of experts by constructing the network manually and often (incorrectly) interpreting the resulting network causally. We provide an approach to learn a Bayesian network fully from observed data, without relying on experts and show how to appropriately interpret the resulting network, both to identify how the variables (covariates and target) are interrelated and to answer probabilistic queries.
- We apply this method to the case study of a mountain pine beetle infestation and find that the trained Bayesian network has a predictive accuracy of 0.88 AUC. We classify the covariates as primary and secondary in terms of contributing to the prediction and show that the predictive accuracy does not deteriorate when the secondary covariates are missing and degrades to only 0.76 when one of the primary covariates is missing.
- As a complement to the previous work on constructing Bayesian networks by hand, we show that if instead, both the structure and parameters are learned only from data, we can achieve more accurate predictions as well as generate new insights about the underlying processes.
1 INTRODUCTION
Predictions are essential in aquatic and terrestrial ecology, whether the focus lies in changes in ecosystem composition, structure and richness to preserve the biodiversity and ecosystem function, or in the spatial distribution of individuals and species to inform conservation and invasive species policies. The field of predictive ecology focuses on how to make such predictions, particularly in the context of climate change, and has grown exponentially since the 1990s, given the quality and quantity of available ecological data (Mouquet et al., 2015; Purves et al., 2013). Simple and advanced statistical and machine-learning approaches have been used to this end, and some have reported great success. Commonly applied models include mechanistic equations, individual-based models, GLMs (Aukema et al., 2008; Preisler et al., 2012), generalized additive models, MaxEnt (Merow et al., 2013), decision trees, support vector machines and artificial neural networks (Marmion et al., 2009; Youssef et al., 2016).
These standard models, however, lack some practical features, which questions their use as predictors. They are unable to make predictions when the value of a covariate is missing, a typical issue because some covariates are expensive or logistically impossible to collect. To impute, the missing values can be unreliable as modelling assumptions are needed so as to ‘guess’ them. The assumptions may even conflict with those posed by the original model using the imputed values. Another approach is to produce a model that does not involve any covariate that is ever missing. This can be problematic as well, because (a) those covariates are not fixed in the area of interest: the value of a covariate may be missing at location A, but present at location B, and the opposite may hold for another covariate; and (b) even if a covariate is only measured in the laboratory and never on the field, incorporating it in the model can still reveal its effect on the response variable. Most models also cannot reveal the co-effect of more than one covariate on the response variable, and some do not allow for statistical inference. Moreover, those that are used for statistical inference cannot handle correlated covariates.
Bayesian networks (BNs) can deal with these issues. They are directed acyclic graphs, whose nodes are the response variable and covariates, and the links between the nodes show how these nodes are related to each other. Both links from covariate to response and from covariate to covariate are allowed in the network. BNs are graphical, and hence often simpler to understand than complex systems of equations (e.g. Bode et al., 2017; Eklöf et al., 2013; Rish et al., 2009; Troyanskaya et al., 2003), deepening our understanding of natural phenomena as well as allowing for accurate predictions. However, there are two main issues with how BNs are typically applied in practice: (a) they are rarely used to their full potential and (b) they are misinterpreted as causal networks. The common practice of applying BNs is to manually construct the structure (network), based on the knowledge of experts, then either set the parameters manually or learn them from data, and finally, read the links as causal relationships in the resulting BN. Although useful in assessing the qualitative descriptions of an ecological process, this approach relies heavily on our prior understanding of the process, and hence, is only as good as our understanding. If, instead, both the structure and parameters of the BN are learned only from the data, there will be room for more accurate predictions as well as new insights about the process. Moreover, BNs are not causal networks, but essentially a set of conditional (in)dependencies that factorize the joint probability distribution of all of the variables. Causal deductions, hence, may not be made, although some hypotheses may be tested.
We complement previous studies on BNs that used the knowledge of experts (Chen & Pollino, 2012; Marcot et al., 2006) by focusing on learning the structure, and proper model interpretation in the form of conditional probabilistic inferences rather than causal deductions. The goal of this paper was (a) to discuss the advantages of different ecological modelling approaches, and highlight what BNs can offer in this context; (b) to provide a systematic approach for training a BN completely from data, without incorporating the prior knowledge of experts, and then evaluating and interpreting the resulting BN and (c) to apply this method to the case study of a mountain pine beetle (MPB) outbreak.
2 ADVANTAGES OF BAYESIAN NETWORKS
In what follows, we first briefly introduce BNs and then compare them with other modelling approaches in predictive ecology (Table 1). Here, we focus on the ‘typical’ situation with each model; for example, the prediction accuracy of a properly trained BN being typically high does not imply that it is always higher or even as high as other highly accurate models.
Model | Model characteristic | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Capable of generative learning | Handles missing values at training | Handles missing values at testing | Level of non-linearity that can be handled | Allows for statistical inference and model selection | Tolerates correlated variables in statistical inference | Provides insights on variables co-effects | Ease of incorporation of prior knowledge | Does not need prior knowledge | Predictive accuracy | Marginalizable | |
Mechanistic equation | • | ● | ● | ● | • | ||||||
Individual-based model | ● | ● | ● | ||||||||
Generalized linear model | ● | • | • | • | |||||||
Generalized additive model | • | ● | • | • | |||||||
MaxEnt | • | • | ● | ||||||||
Decision tree | ● | • | ● | ● | |||||||
Support vector machine (linear) | ● | • | • | ||||||||
Neural network | ● | ● | ● | ||||||||
Bayesian network | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● |
Note
- (empty) typically low; • typically medium; ● typically high.
2.1 Introduction to Bayesian networks
2.2 Generative versus discriminative learning
Consider the response variable Y and set of covariates (features) that are used to estimate Y. One may pursue either of the two learning tasks with respect to these variables: generative, that is to learn the joint probability distribution , or discriminative, that is to learn the conditional probability . On the one hand, the joint probability distribution represents the probability of any given assignment to all of the variables and in the data, or loosely speaking, how all the variables are related to each other. On the other hand, the conditional probability represents the probability of happening given , or in other words, in which cases does happen. So discriminative learning focuses only on the probability of the response variable whereas generative learning also reveals the probability of the covariates. For example, on the one hand, an ecologist may be interested in two species' co-occurrence, which is a generative question given by the distribution , where and are the density of the species. On the other hand, the same ecologist may be interested in whether the density of species (as a response variable) can be estimated using that of species (as the covariate), which is a discriminative question, given by .
Note that knowing the ‘true’ joint distribution allows knowing the conditional distribution . However, because small errors in estimating , which typically happen in practice, might lead to large errors in the associated values of (Ng & Jordan, 2002), each learning task deserves its own treatment. Although potentially capable of modelling the joint probability distribution, mechanistic models are not commonly used for this purpose as it would require a great deal of prior knowledge of the process. Roughly speaking, none of the models in Table 1, except for BNs, are effective at generative learning.
2.3 Missing data
Datasets often have many instances (observations) where the value of one or more of the covariates and/or response variables are missing. Missing values can occur both at the time of training and testing of a model.
Should the training dataset contain missing values, most traditional statistical methods such as regressions would use casewise deletion, that is, to remove the entire instance (observation) from the dataset if the value of one or more variable is missing (Harrell, 2015). Casewise deletions can lead to bias in the estimated parameters if the degree to which the variable's value is likely to be missing is correlated with the actual range of values, for example, when a temperature sensor fails to record values below −10°C. Casewise deletions also result in losing the information provided by the remaining variables in the instance with missing values. Therefore, imputation is often used to estimate the missing values, which can be as simple as using the variable's mean or the variable's value from a similar instance, or can be more complex, such as using the chained equation method (Harrell, 2015). However, in essence, imputation is presuming a model for the variables with missing values, which may conflict the actual model that is going to be trained on the imputed dataset, resulting in a poor predictor. As with BNs, methods such as expectation maximization (EM) and structural EM can be used to learn the parameters and structure, without imputation or casewise deletion (Koller & Friedman, 2009).
Should the testing dataset contain missing values, almost all models fail to make predictions as each covariate has to take some value, that is, they cannot be left with ‘NA’s (not available). Imputation comes with the above-mentioned shortcomings. Another alternative is to use expert knowledge to obtain probable limits for the covariates with missing values, and run the model on those limits to get a probable range for the prediction. For example, in climate change models, the exact concentration of the pathway of a covariate such as greenhouse gas emission that will be followed in the future is unknown. Therefore, models use a series of scenarios ranging from best to worst case scenario in order to predict changes in CO_{2} emissions and temperatures (Pachauri et al., 2015). There is, however, no need of these rough approximations when applying BNs. By marginalizing over the unobserved covariates, BNs can predict the target variable based on any observed subset of the covariates.
2.4 Non-linearity of the relationship between the covariates and response variable
In many real-world situations, the response variable may be related to the covariates in a highly nonlinear manner. Simple models such as linear regressions, however, assume a linear relationship. To capture some levels of non-linearity, GLMs extend the regressions by applying functions such as and to the covariates. Other extensions, such as generalized additive models, fit a smooth curve to the data for each covariate, thereby allowing complex nonlinear relationships (Guisan et al., 2002). Another extension is the machine-learning method MaxEnt (Phillips et al., 2006) that is able to link highly nonlinear response curves and estimate the probability distribution of the response variable using maximum entropy. Likewise, support vector machines classify the covariate space using hyper-planes, and hence, are linear, yet can allow for some non-linearity by first transforming the space using nonlinear kernels (Scholkopf & Smola, 2001). Process-based models can also build in highly complex nonlinear relationships. In all of these cases, the relationships between the response variable and covariates must be entirely described, based on a priori model, a constraint that is relaxed in some other machine-learning models. For example, classification trees can represent any function over the set of discrete covariates, but does not need to be defined beforehand. Note, this may require a very deep classification tree. Moreover, the fact that a classification tree can represent a complex function does not mean it can be learned effectively. Likewise, BNs are flexible in dealing with nonlinear relationships. Over a set of discrete variables, BNs can represent an arbitrary joint probability distribution , which can represent any arbitrary conditional distribution .
2.5 Hypothesis testing, statistical inference and model selection
The objective of hypothesis testing is to make inference through deduction. It consists of devising one or more working hypotheses and challenging them with data for corroboration (Hilborn & Mangel, 1997; Stephens et al., 2005). The hypothesis to test is translated into a mathematical equation and is verified using methods such as least squares and maximum likelihood. So to test a hypothesis, one needs (a) a mathematical equation representing a biological hypothesis and (b) a test statistic with a distribution that can be determined, representing the model accuracy when confronted to data. The complexity of machine-learning models usually prevents us from obtaining a simple equation representing the hypothesis, but this is not the case for BNs.
2.6 Prior knowledge of the processes
Unlike mechanistic models that typically need a comprehensive knowledge of the involved processes to make accurate predictions, phenomenological methods such as traditional statistics and especially machine-learning have more leeway. One does not need to have any knowledge about the ecological process to train and test a support vector machine, or neural network, for example. Although one may argue that the functions used in a neural network or the number of nodes and layers are parameters to be determined beforehand, yet these too can be selected automatically based on the training data or general rules of thumb. The level of autonomous learning is even higher with BNs. The whole structure and parameters of a discrete BN can be completely learned from data (McCann et al., 2006). The same goes for decision trees.
Although they can be trained autonomously, BNs allow experts to incorporate their knowledge into the network by forcing or preventing links between the nodes and additionally adding latent variables that are unobservable and often abstract variables, such as habitat quality. Indeed, the spectrum of autonomous learning for BNs ranges from neither to both structure and parameters learned based on experts' knowledge.
2.7 Correlations
Often two or more of the covariates in a process are highly correlated. This hinders statistical inference as the effects of the correlated covariates on the response variable are difficult to separate (Dormann et al., 2013; Stewart, 1987). This would happen if we were building a model, say a logistic regression, with two covariates that are both relevant to the response variable, and also are highly correlated with each other. Thus, typically one of the variables is eliminated beforehand, either randomly, based on ecological relevance, measurement feasibility and proximity to the mechanisms (Dormann et al., 2013; Harrell, 2015), or by using some autonomous technique such as minimum-redundancy maximum-relevance (Peng et al., 2005). However, this prevents understanding the impact of both of the correlated covariates together on the response variable. Process-based models do not suffer from correlation (except for parameter estimability), yet they require the mechanisms to be a priori known (Dormann et al., 2013). Nevertheless, a BN whose structure is learned from data does not require any prior knowledge, and reveals the differences of the correlated covariates in terms of their probabilistic dependence to other covariates as well as the response variable.
2.8 Predictive accuracy
Despite the complexity of ecological systems (Anand et al., 2010; Levin, 1992), some machine-learning models are reported to make accurate predictions. In contrast, process-based and traditional statistical models are rarely able to reach the same level of accuracy (e.g. Elith et al., 2006). Particularly, process-based models are known for their inability to make good predictions, although this has been challenged by, for example, Håkanson (2004), who presented an accurate mechanistic model for aquatic systems. Within machine-learning models, neural networks are acknowledged for accurate performance in highly complex tasks such as image recognition (Egmont-Petersen et al., 2002). However, this does not mean that neural networks necessarily outperform simpler models in practice. Firstly, finding the optimal number of layers and nodes is not always practical due to limited computational resources. Secondly, proper estimation of the many parameters of a neural network often requires massive data. Hence, while asymptotically effective, neural networks may not be as successful as simple models when the available data are insufficient. Finally, if the system in question is actually simple, then neural networks, especially deep ones, can easily overfit the training data. Simpler models may be a better choice also in this case.
2.9 Marginalizability
The notion of marginalizability addresses the possibility of separately studying how a particular covariate or subset of the covariates informs us about the response variable. We call a model marginalizable if it allows us to compute the probability of the response variable Y given any subset of the covariates ; that is, . Most predictive models allow us to compute , that is, the likelihood of the response given all of the covariates. However, only those that perform a generative task, that is, learning , allow us to marginalize the likelihood over the variables , to obtain the likelihood conditioned on only those variables that we are interested in: . Therefore, only BNs and those mechanistic models developed to formulate the joint probability are marginalizable.
3 LEARNING BAYESIAN NETWORKS FROM DATA
We explain, step by step, how to learn and then use a BN to make predictions and acquire biological insights. Most steps are general enough to be applied by any statistical/machine-learning method in the context of model selection or prediction making.
3.1 Setup
Ecological processes are typically modelled by a response variable Y and a set of covariates . If the process is spatial and temporal, then each instance (observation) of the process has a unique pair of identities: (a) the time t of the instance, the unit of which indicates the frequency of the observations, for example, a year, month or day, and (b) a general index g, roughly to distinguish the instances location wise. For example, if the process of interest is Cyanobacteria bloom in lakes, then g indicates the label of the lakes. If the interest is in the spread of an infestation over a given area, then we may divide the area into r × r squares for say r = 1 km, and label them by g = 1, 2…. We may exclude time when modelling a stationary quantity, for example, the joint distribution of several species in a specific area. Similarly, we may exclude the index g, if all instances are taken from the same location, for example, from the same lake. Also, note that time and especially the index g are not necessarily two covariates of the process. Indeed, time must be excluded from the set of covariates if the goal is to obtain a model that can be applied to times different from those in the available data, for example, to predict the future (see Supporting Information). Similarly, the index g may be excluded; however, one must acknowledge the possible performance loss when applying the model to areas far-away from the training area, with dramatically different geographic features.
For illustration purposes, in what follows, we consider a spatial and temporal process. For each index g and time t, let denote the set of covariates and denote the response variable (Table 2). Although the response variable can be continuous or integer, in order to use acknowledged performance measures such as AUC (Section 3.5), we restrict it to be binary. For example, given an index g and time t, the response variable may represent the presence, , or absence, , of infestation or a species of interest. The covariates can be correlated with each other and may include variables that are not known a priori to contribute to the response variable. Our goal is to estimate (learn) the joint probability distribution using available data.
Notation | Variable |
---|---|
t | Time |
g | General index |
Y_{g,t} | Response variable |
Set of covariates | |
Probability function |
3.2 Step 1: Data discretization
The random variables in a BN can be either continuous or categorical. However, if they are continuous, we must predetermine their distributional forms, for example, a Gaussian distribution. To avoid making such assumptions, we use discrete BNs where every variable is categorical. We discretize all continuous variables by considering various number of intervals or discretization levels (say ) and using data to determine which number leads to a higher performance score. If a continuous variable's range does not have evident thresholds in terms of the biological context, we use Hartemink's information-preserving algorithm (Hartemink, 2001) to quantify the values in a way that maximizes the mutual information shared by the variables (Cover & Thomas, 2012).
3.3 Step 2: Partitioning the dataset into train and test
The typical machine-learning approach to learn, then evaluate a model, is to randomly partition the dataset in two subsets, train and test, where the greater portion (train) is used to estimate the model, and the remaining portion (test) to evaluate the trained model. However, evaluation concerns are raised if the instances of the original data are randomly partitioned into train and test. Indeed, using this method, the train and the test datasets are extremely similar (see Supporting Information). For each instance of the test dataset, it is highly likely to have a matching instance in the train dataset due to correlations in time and space. The purpose of a test dataset is to simulate how the model performs when applied in practice to a new dataset. If the goal is to make predictions in the future, say next month, we set the train dataset to be the data from the final observations (instances) and let the train dataset be the remaining instances. Namely, we make the train and test datasets time-wise disjoint.
3.4 Step 3: Learning
3.4.1 Step 3.1: Learning the BN structure
For each of the k-level quantified training datasets, we find the structure that results in the lowest BIC or the lowest AIC. Although this can be done by performing an exhaustive search on all possible BN structures—that is, directed acyclic graphs, with the response variable and covariates as the node-set—we instead use efficient algorithms, for example, (Silander & Myllymaki, 2012), which is implemented in the r package bnstruct (Franzin et al., 2017). Both BIC and AIC criteria penalize having more parameters, which reduces the chance of overfitting to the training dataset. The choice of BIC or AIC depends on the main goal of the study, the model complexity and the number of instances relative to the number of parameters (Aho et al., 2014).
Note this approach is computationally infeasible if there are too many variables, for example, more than 25, or too many discretization levels. Then one may, instead, either use a fixed (a priori known) BN structure, for example, naive Bayes, or learn a ‘close-to-optimal’ (a priori unknown) BN on the training dataset using acknowledged searching algorithms (Table 3, Figure 1). We learn the structure of the a priori unknown networks by the bnlearn package in r (Scutari, 2009). The input to each algorithm is the variables and the corresponding training dataset, and the output is a BN structure whose nodes are the variables. In case the learned structure contains undirected links, we randomly assign directions as long as directed cycles and v-structures do not appear. This is because BNs must not contain cycles by definition, and the introduction of v-structures can change the performance of the resulting BN (Koller & Friedman, 2009). So for each discretization level , we obtain a BN structure according to one of the algorithms or fixed structures in Table 3.
Abbreviated name | Based on the algorithm/structure | Type of the algorithm | Description |
---|---|---|---|
GS | Grow shrink | Constraint based | Uses conditional independence tests on the training dataset to detect the Markov blankets of the variables (Margaritis & Thrun, 1999) |
IAMB | Incremental association Markov blanket | Constraint based | Detects Markov blankets with an attempt to avoid false positives, that is, fault infestation predictions (Tsamardinos et al., 2003) |
IIAMB | Interleaved incremental association Markov blanket | Constraint based | A variant of IAMB to maintain the size of the Markov blanket as small as possible (Tsamardinos et al., 2003) |
HC | Hill climbing | Local search | Starts from a random directed graph and adds or removes an edge only if it results in a higher score (BIC in our case) on the train dataset (Margaritis, 2003) |
CL | Chow-Liu | Global search | Finds the undirected spanning tree of the variables to minimize the Kullback–Leibler distance from the actual distribution (Chow & Liu, 1968; Figure 1) |
NB | Naive Bayes | — | The most basic yet often successful BN formed by the response variable (T_{g,t} in our case), linking to all of the covariates (Koller & Friedman, 2009; Figure 1) |
TAN | Tree-augmented naive Bayes | — | A NB network with a spanning tree among the covariates that can be learned from the train dataset (Friedman et al., 1997; Figure 1) |
3.4.2 Step 3.2: Learning the BN parameters
After finding the highest-scoring BN structure for each of the k-level quantified training datasets, we learn the associated CPD parameters on the same training dataset and denote the resulting BN by . We use the Bayesian parameter estimation approach (Koller & Friedman, 2009), implemented in bnlearn. To this end, for each quantization level k, we obtain a BN that best fits the training data in terms of BIC, AIC or other constraints listed in Table 3.
3.5 Step 4: Evaluation
How to choose among the different s from the previous step? Namely, what number of discretization levels results in ‘the best’ BN? We cannot compare them directly using a performance measure that involves the likelihood of the data, for example, log-likelihood, AIC and BIC, because the s do not use the same data but different discretized versions of it.
However, all BNs use the same number of discretization levels for the response variable. So we can compare them based on how well they predict the response variable on the test dataset. Each network allows us to compute , that is, the chances of the observed response variable given the covariates, for every instance in the dataset. Correspondingly, we compare the area under receiver operating characteristic curve (AUROC or simply AUC; Bradley, 1997; Metz, 1978) score of the BNs on the test dataset (see Supporting Information). The choice of AUC is to make our results comparable with the huge body of literature using this performance score as the final performance of a classifier. For each discretization level k, we calculate the AUC score of and pick the highest-scoring one as our final BN. If there is a tie between the top BNs, we break it by looking at the area under precision-recall curve (AUPR; Raghavan et al., 1989; Saito & Rehmsmeier, 2015) scores; that is, among the top BNs with a deficit of at most, say 0.01, from the top AUC, we pick the one with the highest AUPR. The AUPR score better handles unbalanced data by looking at precision rather than the false positive rate (Davis & Goadrich, 2006; Saito & Rehmsmeier, 2015).
Given the temporal nature of our task, we evaluate the final model on a single test dataset, as explained in Section 3.3. If instead, one divides the original dataset into several yearly separated folds and uses cross-validation to obtain the AUC and AUPR values for each fold, then one could also provide confidence intervals for the reported AUC and AUPR values.
3.6 Step 5: Interpretation
3.7 Step 6 (optional): Sensitivity analysis
We examine the prediction accuracy (AUC and AUPR) of the best model when a primary covariate becomes unobservable. This roughly shows the contribution of each covariate to the prediction, although it is, indeed, the co-effect of all the covariates that leads to accurate predictions.
3.8 Step 7 (optional): Comparison with simple Bayesian networks
To further assess the prediction performance of the final BN, we may compare its AUC (or AUPR) with that of simple BNs consisting of a single or two covariates linked to the response variable. These BNs might be considered as the ‘null model’.
Recall that our final BN is designed to perform a generative task (that is to reveal the relationships between the variables), not a discriminative task (that is to predict the response variable). However, if the BN performs well on the first, it is likely to also do well on the second. Yet, the opposite does not hold (Ng & Jordan, 2002). So even if any of these simple BNs predicts the response variable better than our final BN, it does not question the capability of our BN in explaining the probabilistic relationships between the variables. The same may hold in the previous optional step: the AUC score of the BN may increase after removing some of the covariates. This can also be explained by the fact that our final BN is the best fit to the data under the performance score that we used, which is BIC (or AIC) not AUC.
Nevertheless, in such cases, we may train a BN with a different set of covariates for prediction purposes. For example, we may find that subset of the covariates that results in a BN scoring the highest AUC on the training dataset.
4 THE MOUNTAIN PINE BEETLE CASE STUDY
We illustrate the learning and interpretation of BNs via the data on the MPB infestation in the Cypress Hills park—an interprovincial park located in Alberta and Saskatchewan (Figure S2). Endemic-level populations of MPB have existed in Cypress Hills since the 1980s. However, a MPB outbreak started in 2006 and propagated in the park, where it continues until now.
4.1 Biology and management
Mountain pine beetle presents two main population phases: an endemic phase with small population size where beetles attack weak and stressed pines with the help of other bark beetles, and an epidemic phase where the number of individuals is large enough to overcome the defences of large and healthy pines (Safranyik & Carroll, 2006). In summer, beetles will emerge from a tree, mate and attack new pines to lay eggs in galleries under the bark. New MPB infestations are reported to frequently appear in south- and west-facing slopes (Safranyik, 2004). During the tree growing season, water stress negatively impacts the pine's ability to build its defence against bark beetles (Lusebrink et al., 2016; Safranyik, 1978). Indeed, pines use water to make a toxic resin that is exuded during a beetle attack to prevent beetles from attracting conspecifics and inhibit the formation of galleries and oviposition (Erbilgin et al., 2017; Raffa & Berryman, 1983). MPB emergence and flights are reduced with high temperatures during the dispersal season (Safranyik & Carroll, 2006). MPB can disperse at short distances within a stand or, more rarely, fly above the canopy to use the wind to travel long distances of the order of tens to thousands of kilometres (Robertson et al., 2007; Safranyik & Carroll, 2006). Once the eggs are laid, the adults die. Over the fall, winter and spring, eggs become larvae then pupae before finishing their transition to adult and emerging in the summer. Individuals need a minimum of 833 degree days to complete their transition to adult (Safranyik et al., 1975, 2010).
The Forest Service Branch of the Saskatchewan Ministry of Environment follows a strict direct control approach. At the start of every fall, the park is surveyed aerially to collect geo-referenced data on red-top trees—that is, trees that are dead or dying from a MPB infestation at the previous year. Then, on the ground, managers survey 50-m radius circular plots around each red-top tree to find recently infested trees during the summer. The newly found infestations are later controlled in late fall/winter using a fell and burn method.
Our goal is to provide a set of covariates that potentially impact the MBP infestation in Cypress Hills area, understand how they are related to each other and to the infestation, and find which covariates are sufficient for an accurate prediction. We are also interested to test some of the claims in the literature, for example, lower humidity increases the chances of infestation (Lusebrink et al., 2016), and to find what values of the highly correlated covariates degree days and maximum temperature, that are typically not included together in a model, makes infestation most likely. These objectives are well suited to BNs.
4.2 Methods
We divide the studying area into 100 m × 100 m squares and label them by g = 1, 2, …. We choose 1 year as our time unite and define the response variable as the presence or absence of infestation in pixel g at the fall of year t. We use the covariates listed in Table 4 and quantify them into levels. Our data include the values of and (Figure 2) over the years and for 18317 different pixels g in Cypress Hills, resulting in a total of instances (see Supporting Information for an instance of the data).
Name | Symbol | Description | Unit |
---|---|---|---|
Aspect | A_{g} | Compass direction that the slope at pixel g faces | ° |
Distance to infested border | B_{g} | Distance of the centre of pixel g to the border of the whole area of interest that was initially infested (Figure S2) | km |
Degree days | Sum of daily temperatures above 5.5°C from fall of year t − 1 to summer of year t | Celsius degree-day | |
Maximum temperature | Highest maximum daily temperature in July and August of year t | °C | |
Wind speed | Average daily wind speed in July and August of year t | km/hour | |
Relative humidity | Average daily relative humidity in spring of year t | % | |
Cold tolerance | An index in [0, 1] representing the ability of the larvae to survive the cold season of year t − 1, as defined in (Régnière & Bentz, 2007) | ||
Pine cover | Pine density in summer of year t | % | |
Managed last year infestation | Defined to be 1 if pixel g includes at least one tree that was infested and managed (controlled) at year t − 1, and 0 otherwise (Figure 2) | — | |
Missed last year infestation | Defined to be 1 if pixel g includes at least one tree that was infested and missed (not controlled) at year t − 1, and 0 otherwise | — | |
MPB's ability to disperse at short distances within a stand, defined as | |||
Missed neighbours' last year infestation | — | ||
where are those pixels that are essentially at a distance of i × 100 m from g (Figure S3) | |||
Managed neighbours' last year infestation | Defined similar to , with the difference that is replaced by | — |
We compare the AUC and AUPR scores of our final model with those of what we call the one-memory infestation (OI) Bayesian network, consisting of and , being linked to the target , considered as the null model (Figure 1).
4.3 Resulting Bayesian network
We find the BN with the best BIC score on the train dataset with six discrete levels, that is, (Figure 3), as our ‘best model’ to explain the MPB infestation, with AUC = 0.88 and AUPR = 0.28. The OI model scores 0.75 for AUC and 0.19 for AUPR—both lower than our selected model. According to the structure of , the infestation in location g at year t is directly connected to , , , , and . These together with , form the Markov blanket of the infestation node, and hence, are the primary covariates and sufficient for estimating infestation with AUC score. Other covariates are all indirectly linked to infestation and are secondary covariates. Given , one can obtain conditional independencies of the covariates to infestation using d-separations and plot the CPDs (see Supporting Information).
4.4 Sensitivity to missing covariates
The prediction accuracy of does not deteriorate when the values of any of the secondary covariates are missing. Upon missing values for the primaries, the model can still accurately predict infestation as it can use some of the secondary covariates (Table 5).
Missing covariate | AUC | AUPR |
---|---|---|
Nothing missing | 0.882 | 0.277 |
Maximum temperature | 0.889 | 0.350 |
Cold tolerance | 0.881 | 0.290 |
Distance to infested border | 0.890 | 0.309 |
Missed neighbours' past infestation | 0.760 | 0.220 |
Managed neighbours' Past infestation | 0.879 | 0.284 |
Missed last year infestation | 0.811 | 0.103 |
Managed last year infestation | 0.869 | 0.206 |
Last year infestation (both missed and managed) | 0.784 | 0.068 |
4.5 Discussion
The final model we have chosen to explain the MPB infestation in the Cypress Hills area is the BN with six discretization levels, scoring AUC on the test dataset. For a managed MPB outbreak in the Cypress Hills area, the model postulates the following covariates as primary (and hence sufficient for an 0.88 AUC prediction) at each location, at each time: (1, 2) presence of infestation in last year, both managed and missed, (3, 4) neighbours' degree of infestation in last year, both managed and missed, (5) distance to the border where the infestation was initiated, (6) maximum temperature in July and August of that year and (7) cold tolerance in the cold season of that year; n.b., the remaining covariates are secondary and are used to predict infestation if one or more of the primary covariates are missing.
Given this BN, we can provide a wide range of ceteris paribus claims revealing the co-effects of the covariates on the presence of infestation (see Supporting Information). For example, if we know maximum daily temperature is high (above 31.2°C), the interval of relative humidity that results in the highest infestation risk sharply changes from medium to low. This is in line with the claim in (Lusebrink et al., 2016; Safranyik, 1978) that lower humidity increases the infestation probability. However, for maximum daily temperatures lower than 31.2°C, the infestation likelihood is high for both low and high relative humidity. This inconsistency can be solved by looking at maximum temperature and relative humidity together. We find that humid areas require low maximum daily temperature, while dry areas require high maximum daily temperature for a considerable risk of infestation (above 20%).
As another example, a MPB needs 833 degree days to complete its transition to adults and the minimum number of degree days in the data is 1,054 (Safranyik et al., 1975, 2010). Therefore, degree day never prevents infestation in our data and just reflects the negative impact of high summer temperatures. This, however, does not mean that degree day is useless in our model. First of all, as mentioned earlier, in the absence of some of the primary covariates, the model effectively estimates infestation via the information on degree day and other present covariates. Secondly, although highly correlated, degree day and maximum temperature are different, and the model reveals their coexistence effect on the infestation: for low (resp. high) degree days, infestation becomes more likely as maximum temperature increases (resp. decreases; see Supporting Information).
We emphasize that one may not make causal conclusions based on the structure of the model. Clearly, the edge from infestation to managed-last-year-infestation does not imply that this year's infestation has caused last year's (managed) infestation. It only means that the two are probabilistically dependent. The same holds for all other links, such as the one from maximum temperature to infestation: although temperature may be ‘causing’ infestation, one may not conclude so just based on the BN. One may refer to the literature on causality and the corresponding tests in order to verify the causality of a link in a BN (Pearl, 2009; Pearl & Mackenzie, 2018). Moreover, the absence of an edge between, for example, degree day and infestation does not necessarily mean that the two are independent. They may be dependent but become conditionally independent if some other covariates are known here.
In summary, the learned BN contributes to the prediction and understanding of MPB infestations by (1) accurately predicting MPB infestations, (2) identifying the primary set of covariates that are sufficient for making these predictions, (3) making acceptable predictions when data on some of the primary covariates are unavailable, (4) revealing the previously unknown co-effects of the covariates on infestation likelihood, (5) identifying the most informative covariate(s) to infestation likelihood and (6) proposing a BN structure that can serve as the basis for future causality tests between the variables. Points 1, 2, 3 and 5 are particularly useful to forest managers to plan ahead of time and know what data to collect. See SI for a more elaborate discussion on the MPB case study.
Nevertheless, as with almost all other machine-learning models, BNs are generally constructed under the stationary assumption, implying fixed structure and parameters over time. This may result in poor performance when the model is used to make predictions at a time different from those in the training dataset, provided that the ‘true ecological process’ is non-stationary. For example, a BN trained on data collected during the beginning of an outbreak may not accurately predict the declining phases of the outbreak. Similar concerns are raised when using the learned BN in environmental situations, where the ranges of the covariates are very different from those in the training dataset. We refer the reader to (Robinson et al., 2010; Zhou et al., 2008; Zhu & Wang, 2015) for relaxing the stationary assumption.
5 DISCUSSION
Although traditional models used to make ecological predictions from underlying covariates have a record of success, they also suffer from limitations. They cannot make predictions when one or more covariates are missing; unless the missing values are imputed using other methods which can be unreliable and result in low prediction accuracy. They also do not allow for statistical inference when some of the covariates are highly correlated. BNs can handle these issues. Specifically, they provide a primary and secondary ordering of the covariates, where primary covariates are essential to predicting the target variable and secondary covariates, while not always essential, can be helpful in making predictions when the values of some covariates are missing.
However, BNs are not used to their full potential in the literature as their structure is typically constructed based on the knowledge of experts. Moreover, the obtained BN is often read causally, a questionable practice as BNs are different from causal networks.
We have complemented previous work by providing a systematic approach to obtain a BN fully from data. We have demonstrated the approach via a MPB case study, where no knowledge of experts was involved in finding either the structure or CPDs. The resulting BN predicts infestations fairly accurately, even in the absence of any of the selected covariates that are involved in the model.
Researchers have utilized BNs to visualize their understanding of the causal relationships between the variables involved in ecological processes (Amstrup et al., 2008; Aps et al., 2009; Borsuk et al., 2004; Johnson et al., 2010; Newton, 2010; Pollino, Woodberry, et al., 2007). The resulting networks have been often used as predictors and sometimes reported to be fairly successful on a test dataset. This is an acceptable approach to assess the a priori knowledge of the experts or when there are no data available to learn the BN structure. However, by means of the results for our MPB case study, we challenge claims that put forward this approach as ‘the (only) right one’ for constructing a BN. Examples include synthesizing existing knowledge into the model is necessary and structural learning is only for modelling poorly understood systems or those difficult to characterize (Chen & Pollino, 2012), modellers must demonstrate causal relations (McCann et al., 2006), models based on theories about causal relations are generally better (Uusitalo, 2007) and network structure is a matter of judgement and should reflect expert knowledge and stakeholder needs (Gutierrez et al., 2011). Some researchers have looked into fixed (naive Bayes) and partially learnable (tree-augmented naive Bayes) structures (Aguilera et al., 2010), yet this is different from learning fully based on data.
In general, for modelling the joint probability distribution of the variables involved in an ecological process, that is, a generative task, BNs seem to be the first and often best candidate, especially if the governing dynamics are yet unknown to be mechanistically modelled. However, if the sole purpose is to predict the response variable, that is, a discriminative task, other models may show a higher prediction accuracy, although unlike BNs, they typically cannot deal with missing values in the covariates. We are currently exploring ways to use BNs as well as other models, to predict infestation many years in the future (Ramazi et al., accepted).
ACKNOWLEDGEMENTS
We thank Rory L. McIntosh for providing the mountain pine beetle data needed for the application section. We thank Greiner and Lewis Research Groups for helpful feedback on ideas related to this research. The research was partly funded by Alberta Environment & Parks (AEP). This research was also supported by a grant to M.A.L. from the Natural Science and Engineering Research Council of Canada (grant no. NET GP 434810-12) to the TRIA Network, with contributions from Alberta Agriculture and Forestry, Foothills Research Institute, Manitoba Conservation and Water Stewardship, Natural Resources Canada-Canadian Forest Service, Northwest Territories Environment and Natural Resources, Ontario Ministry of Natural Resources and Forestry, Saskatchewan Ministry of Environment, West Fraser and Weyerhaeuser. M.A.L. is also grateful for the support through the NSERC Discovery and the Canada Research Chair Programs. R.G. is grateful for funding from NSERC Discovery and Alberta Machine Intelligence Institute.
AUTHORS' CONTRIBUTIONS
All the authors conceived the ideas, interpreted the results and drafted the manuscript. P.R. developed the methods and undertook the analysis. All the authors gave final approval for publication.
Open Research
DATA AVAILABILITY STATEMENT
The dataset analysed in the current study is described in Kunegel-Lion et al. (2020a) and available from Dryad Digital Repository (https://doi.org/10.5061/dryad.70rxwdbt9 (Kunegel-Lion et al., 2020b).