Volume 12, Issue 1 p. 135-149
RESEARCH ARTICLE
Open Access

Exploiting the full potential of Bayesian networks in predictive ecology

Pouria Ramazi

Corresponding Author

Pouria Ramazi

Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada

Department of Computing Science, University of Alberta, Edmonton, AB, Canada

Correspondence

Pouria Ramazi

Email: [email protected]

Search for more papers by this author
Mélodie Kunegel-Lion

Mélodie Kunegel-Lion

Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada

Search for more papers by this author
Russell Greiner

Russell Greiner

Department of Computing Science, University of Alberta, Edmonton, AB, Canada

Search for more papers by this author
Mark A. Lewis

Mark A. Lewis

Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada

Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada

Search for more papers by this author
First published: 12 October 2020
Citations: 11

Abstract

  1. Although ecological models used to make predictions from underlying covariates have a record of success, they also suffer from limitations. They are typically unable to make predictions when the value of one or more covariates is missing during the testing. Missing values can be estimated but methods are often unreliable and can result in poor accuracy. Similarly, missing values during the training can hinder parameter estimation of many ecological models. Bayesian networks can handle these and other limiting issues, such as having highly correlated covariates. However, they are rarely used to their full potential.
  2. Indeed, Bayesian networks are commonly used to evaluate the knowledge of experts by constructing the network manually and often (incorrectly) interpreting the resulting network causally. We provide an approach to learn a Bayesian network fully from observed data, without relying on experts and show how to appropriately interpret the resulting network, both to identify how the variables (covariates and target) are interrelated and to answer probabilistic queries.
  3. We apply this method to the case study of a mountain pine beetle infestation and find that the trained Bayesian network has a predictive accuracy of 0.88 AUC. We classify the covariates as primary and secondary in terms of contributing to the prediction and show that the predictive accuracy does not deteriorate when the secondary covariates are missing and degrades to only 0.76 when one of the primary covariates is missing.
  4. As a complement to the previous work on constructing Bayesian networks by hand, we show that if instead, both the structure and parameters are learned only from data, we can achieve more accurate predictions as well as generate new insights about the underlying processes.

1 INTRODUCTION

Predictions are essential in aquatic and terrestrial ecology, whether the focus lies in changes in ecosystem composition, structure and richness to preserve the biodiversity and ecosystem function, or in the spatial distribution of individuals and species to inform conservation and invasive species policies. The field of predictive ecology focuses on how to make such predictions, particularly in the context of climate change, and has grown exponentially since the 1990s, given the quality and quantity of available ecological data (Mouquet et al., 2015; Purves et al., 2013). Simple and advanced statistical and machine-learning approaches have been used to this end, and some have reported great success. Commonly applied models include mechanistic equations, individual-based models, GLMs (Aukema et al., 2008; Preisler et al., 2012), generalized additive models, MaxEnt (Merow et al., 2013), decision trees, support vector machines and artificial neural networks (Marmion et al., 2009; Youssef et al., 2016).

These standard models, however, lack some practical features, which questions their use as predictors. They are unable to make predictions when the value of a covariate is missing, a typical issue because some covariates are expensive or logistically impossible to collect. To impute, the missing values can be unreliable as modelling assumptions are needed so as to ‘guess’ them. The assumptions may even conflict with those posed by the original model using the imputed values. Another approach is to produce a model that does not involve any covariate that is ever missing. This can be problematic as well, because (a) those covariates are not fixed in the area of interest: the value of a covariate may be missing at location A, but present at location B, and the opposite may hold for another covariate; and (b) even if a covariate is only measured in the laboratory and never on the field, incorporating it in the model can still reveal its effect on the response variable. Most models also cannot reveal the co-effect of more than one covariate on the response variable, and some do not allow for statistical inference. Moreover, those that are used for statistical inference cannot handle correlated covariates.

Bayesian networks (BNs) can deal with these issues. They are directed acyclic graphs, whose nodes are the response variable and covariates, and the links between the nodes show how these nodes are related to each other. Both links from covariate to response and from covariate to covariate are allowed in the network. BNs are graphical, and hence often simpler to understand than complex systems of equations (e.g. Bode et al., 2017; Eklöf et al., 2013; Rish et al., 2009; Troyanskaya et al., 2003), deepening our understanding of natural phenomena as well as allowing for accurate predictions. However, there are two main issues with how BNs are typically applied in practice: (a) they are rarely used to their full potential and (b) they are misinterpreted as causal networks. The common practice of applying BNs is to manually construct the structure (network), based on the knowledge of experts, then either set the parameters manually or learn them from data, and finally, read the links as causal relationships in the resulting BN. Although useful in assessing the qualitative descriptions of an ecological process, this approach relies heavily on our prior understanding of the process, and hence, is only as good as our understanding. If, instead, both the structure and parameters of the BN are learned only from the data, there will be room for more accurate predictions as well as new insights about the process. Moreover, BNs are not causal networks, but essentially a set of conditional (in)dependencies that factorize the joint probability distribution of all of the variables. Causal deductions, hence, may not be made, although some hypotheses may be tested.

We complement previous studies on BNs that used the knowledge of experts (Chen & Pollino, 2012; Marcot et al., 2006) by focusing on learning the structure, and proper model interpretation in the form of conditional probabilistic inferences rather than causal deductions. The goal of this paper was (a) to discuss the advantages of different ecological modelling approaches, and highlight what BNs can offer in this context; (b) to provide a systematic approach for training a BN completely from data, without incorporating the prior knowledge of experts, and then evaluating and interpreting the resulting BN and (c) to apply this method to the case study of a mountain pine beetle (MPB) outbreak.

2 ADVANTAGES OF BAYESIAN NETWORKS

In what follows, we first briefly introduce BNs and then compare them with other modelling approaches in predictive ecology (Table 1). Here, we focus on the ‘typical’ situation with each model; for example, the prediction accuracy of a properly trained BN being typically high does not imply that it is always higher or even as high as other highly accurate models.

TABLE 1. Comparison of models in predictive ecology. See Sections 2.2–2.9, 2.2–2.9 for explanations of the model characteristics
Model Model characteristic
Capable of generative learning Handles missing values at training Handles missing values at testing Level of non-linearity that can be handled Allows for statistical inference and model selection Tolerates correlated variables in statistical inference Provides insights on variables co-effects Ease of incorporation of prior knowledge Does not need prior knowledge Predictive accuracy Marginalizable
Mechanistic equation
Individual-based model
Generalized linear model
Generalized additive model
MaxEnt
Decision tree
Support vector machine (linear)
Neural network
Bayesian network

Note

  • (empty) typically low; • typically medium; ● typically high.

2.1 Introduction to Bayesian networks

Given a set of n random variables urn:x-wiley:2041210X:media:mee313509:mee313509-math-0001 (consisting of the response variable and n – 1 covariates), a BN factorizes the joint probability urn:x-wiley:2041210X:media:mee313509:mee313509-math-0002 according to a specified directed acyclic graph whose nodes are the variables urn:x-wiley:2041210X:media:mee313509:mee313509-math-0003, following the equation
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0004(1)
where urn:x-wiley:2041210X:media:mee313509:mee313509-math-0005 denotes the parents of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0006 in the graph, that is, those nodes that have an outgoing edge that leads to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0007 (Figure S1). The individual factors urn:x-wiley:2041210X:media:mee313509:mee313509-math-0008 are known as conditional probability distributions (CPDs; Koller & Friedman, 2009). A BN encodes the claim that given the Markov blanketurn:x-wiley:2041210X:media:mee313509:mee313509-math-0009 of a node urn:x-wiley:2041210X:media:mee313509:mee313509-math-0010—which is the set of its parents, children and the other parents of its children—the node becomes independent from the remaining of the nodes, written urn:x-wiley:2041210X:media:mee313509:mee313509-math-0011. This provides the essentials for understanding how the variables relate to each other. We, therefore, refer to the nodes in the Markov blanket of the target node Yi as primary covariates and to others as secondary. The estimation of the target node based on the values of the primary covariates does not change if the values of the secondary covariates are additionally known. The conditional independencies also reduce the number of parameters needed to represent the joint distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0012. It is possible to learn from data, both the graph and the CPDs, known as the structure and parameters of the BN (Section 3).
The factorization in Equation (1) is sufficient to define BNs and draws a clear line between BNs and causal networks. To explain, assume that we are modelling the co-occurrence of two competitive species, with densities denoted by urn:x-wiley:2041210X:media:mee313509:mee313509-math-0013 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0014, each corresponding to a node in a BN. We could link these two distributions using either a directed edge from urn:x-wiley:2041210X:media:mee313509:mee313509-math-0015 to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0016, decomposing the joint density distribution of the two species as urn:x-wiley:2041210X:media:mee313509:mee313509-math-0017, or a directed edge from urn:x-wiley:2041210X:media:mee313509:mee313509-math-0018 to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0019, resulting in urn:x-wiley:2041210X:media:mee313509:mee313509-math-0020. The first relies on the distribution of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0021 and the conditional distribution of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0022 given urn:x-wiley:2041210X:media:mee313509:mee313509-math-0023, and the reverse holds for the second. Both of these models can be used to make acceptable predictions if one can estimate parameters urn:x-wiley:2041210X:media:mee313509:mee313509-math-0024 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0025 effectively. However, none of the models are causal: neither of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0026 or urn:x-wiley:2041210X:media:mee313509:mee313509-math-0027 is causing the other. The edge simply means probabilistic dependence and dictates the factorization of the joint distribution. Now, assume that the species distributions are each partly ‘caused’ by a third variable vegetation, denoted by V. Should we construct a BN based on this ‘causal understanding’, we would add the node V and link it to both urn:x-wiley:2041210X:media:mee313509:mee313509-math-0028 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0029 without connecting the two. This results in the joint probability distribution of the two species and vegetation.
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0030
However, this is not the only way to model the joint distribution. Depending on the training data, one may obtain a more accurate model in terms of data fitting by also linking urn:x-wiley:2041210X:media:mee313509:mee313509-math-0031 to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0032 (or vice versa), resulting in
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0033
This might be because vegetation is not the only cause of the two, and another factor, say temperature, also plays a role, which is not included in our variable list but is highly correlated with urn:x-wiley:2041210X:media:mee313509:mee313509-math-0034, and hence, provides a better estimation of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0035 by linking the two. One may yet use a different model, where urn:x-wiley:2041210X:media:mee313509:mee313509-math-0036 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0037 are not linked to each other but both linked to V, resulting in
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0038
This is particularly useful if we know the distributions of the species densities, that is, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0039 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0040, but not that of vegetation urn:x-wiley:2041210X:media:mee313509:mee313509-math-0041, and we know how vegetation can be estimated based on the distribution of the two species, that is, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0042. None of the links in this model are causal.

2.2 Generative versus discriminative learning

Consider the response variable Y and set of covariates (features) urn:x-wiley:2041210X:media:mee313509:mee313509-math-0043 that are used to estimate Y. One may pursue either of the two learning tasks with respect to these variables: generative, that is to learn the joint probability distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0044, or discriminative, that is to learn the conditional probability urn:x-wiley:2041210X:media:mee313509:mee313509-math-0045. On the one hand, the joint probability distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0046 represents the probability of any given assignment to all of the variables urn:x-wiley:2041210X:media:mee313509:mee313509-math-0047 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0048 in the data, or loosely speaking, how all the variables are related to each other. On the other hand, the conditional probability urn:x-wiley:2041210X:media:mee313509:mee313509-math-0049 represents the probability of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0050 happening given urn:x-wiley:2041210X:media:mee313509:mee313509-math-0051, or in other words, in which cases does urn:x-wiley:2041210X:media:mee313509:mee313509-math-0052 happen. So discriminative learning focuses only on the probability of the response variable whereas generative learning also reveals the probability of the covariates. For example, on the one hand, an ecologist may be interested in two species' co-occurrence, which is a generative question given by the distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0053, where urn:x-wiley:2041210X:media:mee313509:mee313509-math-0054 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0055 are the density of the species. On the other hand, the same ecologist may be interested in whether the density of species urn:x-wiley:2041210X:media:mee313509:mee313509-math-0056 (as a response variable) can be estimated using that of species urn:x-wiley:2041210X:media:mee313509:mee313509-math-0057 (as the covariate), which is a discriminative question, given by urn:x-wiley:2041210X:media:mee313509:mee313509-math-0058.

Note that knowing the ‘true’ joint distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0059 allows knowing the conditional distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0060. However, because small errors in estimating urn:x-wiley:2041210X:media:mee313509:mee313509-math-0061, which typically happen in practice, might lead to large errors in the associated values of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0062 (Ng & Jordan, 2002), each learning task deserves its own treatment. Although potentially capable of modelling the joint probability distribution, mechanistic models are not commonly used for this purpose as it would require a great deal of prior knowledge of the process. Roughly speaking, none of the models in Table 1, except for BNs, are effective at generative learning.

2.3 Missing data

Datasets often have many instances (observations) where the value of one or more of the covariates and/or response variables are missing. Missing values can occur both at the time of training and testing of a model.

Should the training dataset contain missing values, most traditional statistical methods such as regressions would use casewise deletion, that is, to remove the entire instance (observation) from the dataset if the value of one or more variable is missing (Harrell, 2015). Casewise deletions can lead to bias in the estimated parameters if the degree to which the variable's value is likely to be missing is correlated with the actual range of values, for example, when a temperature sensor fails to record values below −10°C. Casewise deletions also result in losing the information provided by the remaining variables in the instance with missing values. Therefore, imputation is often used to estimate the missing values, which can be as simple as using the variable's mean or the variable's value from a similar instance, or can be more complex, such as using the chained equation method (Harrell, 2015). However, in essence, imputation is presuming a model for the variables with missing values, which may conflict the actual model that is going to be trained on the imputed dataset, resulting in a poor predictor. As with BNs, methods such as expectation maximization (EM) and structural EM can be used to learn the parameters and structure, without imputation or casewise deletion (Koller & Friedman, 2009).

Should the testing dataset contain missing values, almost all models fail to make predictions as each covariate has to take some value, that is, they cannot be left with ‘NA’s (not available). Imputation comes with the above-mentioned shortcomings. Another alternative is to use expert knowledge to obtain probable limits for the covariates with missing values, and run the model on those limits to get a probable range for the prediction. For example, in climate change models, the exact concentration of the pathway of a covariate such as greenhouse gas emission that will be followed in the future is unknown. Therefore, models use a series of scenarios ranging from best to worst case scenario in order to predict changes in CO2 emissions and temperatures (Pachauri et al., 2015). There is, however, no need of these rough approximations when applying BNs. By marginalizing over the unobserved covariates, BNs can predict the target variable based on any observed subset of the covariates.

2.4 Non-linearity of the relationship between the covariates and response variable

In many real-world situations, the response variable may be related to the covariates in a highly nonlinear manner. Simple models such as linear regressions, however, assume a linear relationship. To capture some levels of non-linearity, GLMs extend the regressions by applying functions such as urn:x-wiley:2041210X:media:mee313509:mee313509-math-0063 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0064 to the covariates. Other extensions, such as generalized additive models, fit a smooth curve to the data for each covariate, thereby allowing complex nonlinear relationships (Guisan et al., 2002). Another extension is the machine-learning method MaxEnt (Phillips et al., 2006) that is able to link highly nonlinear response curves and estimate the probability distribution of the response variable using maximum entropy. Likewise, support vector machines classify the covariate space using hyper-planes, and hence, are linear, yet can allow for some non-linearity by first transforming the space using nonlinear kernels (Scholkopf & Smola, 2001). Process-based models can also build in highly complex nonlinear relationships. In all of these cases, the relationships between the response variable and covariates must be entirely described, based on a priori model, a constraint that is relaxed in some other machine-learning models. For example, classification trees can represent any function over the set of discrete covariates, but does not need to be defined beforehand. Note, this may require a very deep classification tree. Moreover, the fact that a classification tree can represent a complex function does not mean it can be learned effectively. Likewise, BNs are flexible in dealing with nonlinear relationships. Over a set of discrete variables, BNs can represent an arbitrary joint probability distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0065, which can represent any arbitrary conditional distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0066.

2.5 Hypothesis testing, statistical inference and model selection

The objective of hypothesis testing is to make inference through deduction. It consists of devising one or more working hypotheses and challenging them with data for corroboration (Hilborn & Mangel, 1997; Stephens et al., 2005). The hypothesis to test is translated into a mathematical equation and is verified using methods such as least squares and maximum likelihood. So to test a hypothesis, one needs (a) a mathematical equation representing a biological hypothesis and (b) a test statistic with a distribution that can be determined, representing the model accuracy when confronted to data. The complexity of machine-learning models usually prevents us from obtaining a simple equation representing the hypothesis, but this is not the case for BNs.

For example, consider a process with the response variable Y and covariates urn:x-wiley:2041210X:media:mee313509:mee313509-math-0067 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0068. One may hypothesize that the response variable Y depends on both urn:x-wiley:2041210X:media:mee313509:mee313509-math-0069 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0070 but becomes independent of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0071, given urn:x-wiley:2041210X:media:mee313509:mee313509-math-0072. Namely, the response variable depends directly only on covariate urn:x-wiley:2041210X:media:mee313509:mee313509-math-0073, and that urn:x-wiley:2041210X:media:mee313509:mee313509-math-0074 itself depends only on urn:x-wiley:2041210X:media:mee313509:mee313509-math-0075. This can be modelled by a BN with three nodes for the variables and two links: one from urn:x-wiley:2041210X:media:mee313509:mee313509-math-0076 to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0077 and another from urn:x-wiley:2041210X:media:mee313509:mee313509-math-0078 to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0079. The BN assigns the following likelihood to each observation of the above process:
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0080(2)
The null hypothesis in this case is that there is no dependence among the variables: they are mutually independent. This results in a BN without any links between the nodes, yielding the following likelihood:
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0081(3)
Given an observation, each of the probabilistic terms on the right-hand side of the above equations is simply a parameter provided that the BNs are discrete. Hence, the likelihood of a specified dataset for each of the BNs will be a polynomial in the parameters, the maximum of which is straightforward to derive. This allows for classical hypothesis testing, for example, by employing the likelihood ratio test, to reject the null hypothesis. Alternatively, among all BNs with the nodes Y, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0082 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0083, one may find ‘the best’ using multiple working hypotheses, based on the Akaike information criterion (AIC; Akaike, 1974) or Bayesian information criterion (BIC; Schwarz, 1978). Therefore, with BNs, we are able to make inferences and obtain insights on the ecologically relevant covariates (e.g. Cooper & Herskovits, 1992; Milns et al., 2010; Pollino, White, et al., 2007).

2.6 Prior knowledge of the processes

Unlike mechanistic models that typically need a comprehensive knowledge of the involved processes to make accurate predictions, phenomenological methods such as traditional statistics and especially machine-learning have more leeway. One does not need to have any knowledge about the ecological process to train and test a support vector machine, or neural network, for example. Although one may argue that the functions used in a neural network or the number of nodes and layers are parameters to be determined beforehand, yet these too can be selected automatically based on the training data or general rules of thumb. The level of autonomous learning is even higher with BNs. The whole structure and parameters of a discrete BN can be completely learned from data (McCann et al., 2006). The same goes for decision trees.

Although they can be trained autonomously, BNs allow experts to incorporate their knowledge into the network by forcing or preventing links between the nodes and additionally adding latent variables that are unobservable and often abstract variables, such as habitat quality. Indeed, the spectrum of autonomous learning for BNs ranges from neither to both structure and parameters learned based on experts' knowledge.

2.7 Correlations

Often two or more of the covariates in a process are highly correlated. This hinders statistical inference as the effects of the correlated covariates on the response variable are difficult to separate (Dormann et al., 2013; Stewart, 1987). This would happen if we were building a model, say a logistic regression, with two covariates that are both relevant to the response variable, and also are highly correlated with each other. Thus, typically one of the variables is eliminated beforehand, either randomly, based on ecological relevance, measurement feasibility and proximity to the mechanisms (Dormann et al., 2013; Harrell, 2015), or by using some autonomous technique such as minimum-redundancy maximum-relevance (Peng et al., 2005). However, this prevents understanding the impact of both of the correlated covariates together on the response variable. Process-based models do not suffer from correlation (except for parameter estimability), yet they require the mechanisms to be a priori known (Dormann et al., 2013). Nevertheless, a BN whose structure is learned from data does not require any prior knowledge, and reveals the differences of the correlated covariates in terms of their probabilistic dependence to other covariates as well as the response variable.

2.8 Predictive accuracy

Despite the complexity of ecological systems (Anand et al., 2010; Levin, 1992), some machine-learning models are reported to make accurate predictions. In contrast, process-based and traditional statistical models are rarely able to reach the same level of accuracy (e.g. Elith et al., 2006). Particularly, process-based models are known for their inability to make good predictions, although this has been challenged by, for example, Håkanson (2004), who presented an accurate mechanistic model for aquatic systems. Within machine-learning models, neural networks are acknowledged for accurate performance in highly complex tasks such as image recognition (Egmont-Petersen et al., 2002). However, this does not mean that neural networks necessarily outperform simpler models in practice. Firstly, finding the optimal number of layers and nodes is not always practical due to limited computational resources. Secondly, proper estimation of the many parameters of a neural network often requires massive data. Hence, while asymptotically effective, neural networks may not be as successful as simple models when the available data are insufficient. Finally, if the system in question is actually simple, then neural networks, especially deep ones, can easily overfit the training data. Simpler models may be a better choice also in this case.

2.9 Marginalizability

The notion of marginalizability addresses the possibility of separately studying how a particular covariate or subset of the covariates informs us about the response variable. We call a model marginalizable if it allows us to compute the probability of the response variable Y given any subset urn:x-wiley:2041210X:media:mee313509:mee313509-math-0084 of the covariates urn:x-wiley:2041210X:media:mee313509:mee313509-math-0085; that is, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0086. Most predictive models allow us to compute urn:x-wiley:2041210X:media:mee313509:mee313509-math-0087, that is, the likelihood of the response given all of the covariates. However, only those that perform a generative task, that is, learning urn:x-wiley:2041210X:media:mee313509:mee313509-math-0088, allow us to marginalize the likelihood over the variables urn:x-wiley:2041210X:media:mee313509:mee313509-math-0089, to obtain the likelihood conditioned on only those variables that we are interested in: urn:x-wiley:2041210X:media:mee313509:mee313509-math-0090. Therefore, only BNs and those mechanistic models developed to formulate the joint probability urn:x-wiley:2041210X:media:mee313509:mee313509-math-0091 are marginalizable.

3 LEARNING BAYESIAN NETWORKS FROM DATA

We explain, step by step, how to learn and then use a BN to make predictions and acquire biological insights. Most steps are general enough to be applied by any statistical/machine-learning method in the context of model selection or prediction making.

3.1 Setup

Ecological processes are typically modelled by a response variable Y and a set of covariates urn:x-wiley:2041210X:media:mee313509:mee313509-math-0092. If the process is spatial and temporal, then each instance (observation) of the process has a unique pair of identities: (a) the time t of the instance, the unit of which indicates the frequency of the observations, for example, a year, month or day, and (b) a general index g, roughly to distinguish the instances location wise. For example, if the process of interest is Cyanobacteria bloom in lakes, then g indicates the label of the lakes. If the interest is in the spread of an infestation over a given area, then we may divide the area into r × r squares for say r = 1 km, and label them by g = 1, 2…. We may exclude time when modelling a stationary quantity, for example, the joint distribution of several species in a specific area. Similarly, we may exclude the index g, if all instances are taken from the same location, for example, from the same lake. Also, note that time and especially the index g are not necessarily two covariates of the process. Indeed, time must be excluded from the set of covariates if the goal is to obtain a model that can be applied to times different from those in the available data, for example, to predict the future (see Supporting Information). Similarly, the index g may be excluded; however, one must acknowledge the possible performance loss when applying the model to areas far-away from the training area, with dramatically different geographic features.

For illustration purposes, in what follows, we consider a spatial and temporal process. For each index g and time t, let urn:x-wiley:2041210X:media:mee313509:mee313509-math-0093 denote the set of covariates and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0094 denote the response variable (Table 2). Although the response variable can be continuous or integer, in order to use acknowledged performance measures such as AUC (Section 3.5), we restrict it to be binary. For example, given an index g and time t, the response variable urn:x-wiley:2041210X:media:mee313509:mee313509-math-0095 may represent the presence, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0096, or absence, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0097, of infestation or a species of interest. The covariates can be correlated with each other and may include variables that are not known a priori to contribute to the response variable. Our goal is to estimate (learn) the joint probability distribution urn:x-wiley:2041210X:media:mee313509:mee313509-math-0098 using available data.

TABLE 2. Variable notation
Notation Variable
t Time
g General index
Yg,t Response variable
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0099 Set of covariates
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0100 Probability function

3.2 Step 1: Data discretization

The random variables in a BN can be either continuous or categorical. However, if they are continuous, we must predetermine their distributional forms, for example, a Gaussian distribution. To avoid making such assumptions, we use discrete BNs where every variable is categorical. We discretize all continuous variables by considering various number of intervals or discretization levels (say urn:x-wiley:2041210X:media:mee313509:mee313509-math-0101) and using data to determine which number leads to a higher performance score. If a continuous variable's range does not have evident thresholds in terms of the biological context, we use Hartemink's information-preserving algorithm (Hartemink, 2001) to quantify the values in a way that maximizes the mutual information shared by the variables (Cover & Thomas, 2012).

3.3 Step 2: Partitioning the dataset into train and test

The typical machine-learning approach to learn, then evaluate a model, is to randomly partition the dataset in two subsets, train and test, where the greater portion (train) is used to estimate the model, and the remaining portion (test) to evaluate the trained model. However, evaluation concerns are raised if the instances of the original data are randomly partitioned into train and test. Indeed, using this method, the train and the test datasets are extremely similar (see Supporting Information). For each instance of the test dataset, it is highly likely to have a matching instance in the train dataset due to correlations in time and space. The purpose of a test dataset is to simulate how the model performs when applied in practice to a new dataset. If the goal is to make predictions in the future, say next month, we set the train dataset to be the data from the final observations (instances) and let the train dataset be the remaining instances. Namely, we make the train and test datasets time-wise disjoint.

3.4 Step 3: Learning

3.4.1 Step 3.1: Learning the BN structure

For each of the k-level quantified training datasets, we find the structure that results in the lowest BIC or the lowest AIC. Although this can be done by performing an exhaustive search on all possible BN structures—that is, directed acyclic graphs, with the response variable and covariates as the node-set—we instead use efficient algorithms, for example, (Silander & Myllymaki, 2012), which is implemented in the r package bnstruct (Franzin et al., 2017). Both BIC and AIC criteria penalize having more parameters, which reduces the chance of overfitting to the training dataset. The choice of BIC or AIC depends on the main goal of the study, the model complexity and the number of instances relative to the number of parameters (Aho et al., 2014).

Note this approach is computationally infeasible if there are too many variables, for example, more than 25, or too many discretization levels. Then one may, instead, either use a fixed (a priori known) BN structure, for example, naive Bayes, or learn a ‘close-to-optimal’ (a priori unknown) BN on the training dataset using acknowledged searching algorithms (Table 3, Figure 1). We learn the structure of the a priori unknown networks by the bnlearn package in r (Scutari, 2009). The input to each algorithm is the variables and the corresponding training dataset, and the output is a BN structure whose nodes are the variables. In case the learned structure contains undirected links, we randomly assign directions as long as directed cycles and v-structures do not appear. This is because BNs must not contain cycles by definition, and the introduction of v-structures can change the performance of the resulting BN (Koller & Friedman, 2009). So for each discretization level urn:x-wiley:2041210X:media:mee313509:mee313509-math-0102, we obtain a BN structure according to one of the algorithms or fixed structures in Table 3.

TABLE 3. Bayesian networks to compare with the learned one
Abbreviated name Based on the algorithm/structure Type of the algorithm Description
GS Grow shrink Constraint based Uses conditional independence tests on the training dataset to detect the Markov blankets of the variables (Margaritis & Thrun, 1999)
IAMB Incremental association Markov blanket Constraint based Detects Markov blankets with an attempt to avoid false positives, that is, fault infestation predictions (Tsamardinos et al., 2003)
IIAMB Interleaved incremental association Markov blanket Constraint based A variant of IAMB to maintain the size of the Markov blanket as small as possible (Tsamardinos et al., 2003)
HC Hill climbing Local search Starts from a random directed graph and adds or removes an edge only if it results in a higher score (BIC in our case) on the train dataset (Margaritis, 2003)
CL Chow-Liu Global search Finds the undirected spanning tree of the variables to minimize the Kullback–Leibler distance from the actual distribution (Chow & Liu, 1968; Figure 1)
NB Naive Bayes The most basic yet often successful BN formed by the response variable (Tg,t in our case), linking to all of the covariates (Koller & Friedman, 2009; Figure 1)
TAN Tree-augmented naive Bayes A NB network with a spanning tree among the covariates that can be learned from the train dataset (Friedman et al., 1997; Figure 1)
Details are in the caption following the image
Structure of different Bayesian networks CL: Chow-Liu, NB: naive Bayes, TAN: tree-augmented naive Bayes and OI: one-memory infestation (Section 4). Grey and white circles represent the target urn:x-wiley:2041210X:media:mee313509:mee313509-math-0103 and its covariates. In the OI case, the covariates are urn:x-wiley:2041210X:media:mee313509:mee313509-math-0104 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0105

3.4.2 Step 3.2: Learning the BN parameters

After finding the highest-scoring BN structure for each of the k-level quantified training datasets, we learn the associated CPD parameters on the same training dataset and denote the resulting BN by urn:x-wiley:2041210X:media:mee313509:mee313509-math-0106. We use the Bayesian parameter estimation approach (Koller & Friedman, 2009), implemented in bnlearn. To this end, for each quantization level k, we obtain a BN urn:x-wiley:2041210X:media:mee313509:mee313509-math-0107 that best fits the training data in terms of BIC, AIC or other constraints listed in Table 3.

3.5 Step 4: Evaluation

How to choose among the different urn:x-wiley:2041210X:media:mee313509:mee313509-math-0108s from the previous step? Namely, what number of discretization levels results in ‘the best’ BN? We cannot compare them directly using a performance measure that involves the likelihood of the data, for example, log-likelihood, AIC and BIC, because the urn:x-wiley:2041210X:media:mee313509:mee313509-math-0109s do not use the same data but different discretized versions of it.

However, all BNs use the same number of discretization levels for the response variable. So we can compare them based on how well they predict the response variable on the test dataset. Each network allows us to compute urn:x-wiley:2041210X:media:mee313509:mee313509-math-0110, that is, the chances of the observed response variable given the covariates, for every instance in the dataset. Correspondingly, we compare the area under receiver operating characteristic curve (AUROC or simply AUC; Bradley, 1997; Metz, 1978) score of the BNs on the test dataset (see Supporting Information). The choice of AUC is to make our results comparable with the huge body of literature using this performance score as the final performance of a classifier. For each discretization level k, we calculate the AUC score of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0111 and pick the highest-scoring one as our final BN. If there is a tie between the top BNs, we break it by looking at the area under precision-recall curve (AUPR; Raghavan et al., 1989; Saito & Rehmsmeier, 2015) scores; that is, among the top BNs with a deficit of at most, say 0.01, from the top AUC, we pick the one with the highest AUPR. The AUPR score better handles unbalanced data by looking at precision rather than the false positive rate (Davis & Goadrich, 2006; Saito & Rehmsmeier, 2015).

Given the temporal nature of our task, we evaluate the final model on a single test dataset, as explained in Section 3.3. If instead, one divides the original dataset into several yearly separated folds and uses cross-validation to obtain the AUC and AUPR values for each fold, then one could also provide confidence intervals for the reported AUC and AUPR values.

3.6 Step 5: Interpretation

Given an index g and year t, the final BN obtained from the above steps determines the joint probability distribution of the response variable urn:x-wiley:2041210X:media:mee313509:mee313509-math-0112 and covariates urn:x-wiley:2041210X:media:mee313509:mee313509-math-0113. Perhaps the most important implication of the obtained BN is the primary urn:x-wiley:2041210X:media:mee313509:mee313509-math-0114 and secondary urn:x-wiley:2041210X:media:mee313509:mee313509-math-0115 division of the covariates urn:x-wiley:2041210X:media:mee313509:mee313509-math-0116 with respect to the response variable. Namely, if we just know the primary covariates, there is no need to know the secondary covariates, that is,
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0117(4)
Moreover, other conditional independencies between the covariates themselves can be identified based on the d-separations of the BN (Koller & Friedman, 2009).
Also, based on the CPDs, we can investigate knowing which covariates increases the probability of the response variable most. For example, consider the covariate temperature urn:x-wiley:2041210X:media:mee313509:mee313509-math-0118, discretized into the two ranges [20°C, 30°C) and [30°C, 40°C). We can see how the response variable urn:x-wiley:2041210X:media:mee313509:mee313509-math-0119 depends on this covariate by sweeping through these quantified levels, for example,
urn:x-wiley:2041210X:media:mee313509:mee313509-math-0120(5)
Hence, the response variable being equal to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0121 is most likely when temperature is in the range urn:x-wiley:2041210X:media:mee313509:mee313509-math-0122. Note that this is only if other covariates are unknown. Now, comparing this with the similar probability conditioned on a different covariate clarifies which is more informative to the response variable.

3.7 Step 6 (optional): Sensitivity analysis

We examine the prediction accuracy (AUC and AUPR) of the best model when a primary covariate becomes unobservable. This roughly shows the contribution of each covariate to the prediction, although it is, indeed, the co-effect of all the covariates that leads to accurate predictions.

3.8 Step 7 (optional): Comparison with simple Bayesian networks

To further assess the prediction performance of the final BN, we may compare its AUC (or AUPR) with that of simple BNs consisting of a single or two covariates linked to the response variable. These BNs might be considered as the ‘null model’.

Recall that our final BN is designed to perform a generative task (that is to reveal the relationships between the variables), not a discriminative task (that is to predict the response variable). However, if the BN performs well on the first, it is likely to also do well on the second. Yet, the opposite does not hold (Ng & Jordan, 2002). So even if any of these simple BNs predicts the response variable better than our final BN, it does not question the capability of our BN in explaining the probabilistic relationships between the variables. The same may hold in the previous optional step: the AUC score of the BN may increase after removing some of the covariates. This can also be explained by the fact that our final BN is the best fit to the data under the performance score that we used, which is BIC (or AIC) not AUC.

Nevertheless, in such cases, we may train a BN with a different set of covariates for prediction purposes. For example, we may find that subset of the covariates that results in a BN scoring the highest AUC on the training dataset.

4 THE MOUNTAIN PINE BEETLE CASE STUDY

We illustrate the learning and interpretation of BNs via the data on the MPB infestation in the Cypress Hills park—an interprovincial park located in Alberta and Saskatchewan (Figure S2). Endemic-level populations of MPB have existed in Cypress Hills since the 1980s. However, a MPB outbreak started in 2006 and propagated in the park, where it continues until now.

4.1 Biology and management

Mountain pine beetle presents two main population phases: an endemic phase with small population size where beetles attack weak and stressed pines with the help of other bark beetles, and an epidemic phase where the number of individuals is large enough to overcome the defences of large and healthy pines (Safranyik & Carroll, 2006). In summer, beetles will emerge from a tree, mate and attack new pines to lay eggs in galleries under the bark. New MPB infestations are reported to frequently appear in south- and west-facing slopes (Safranyik, 2004). During the tree growing season, water stress negatively impacts the pine's ability to build its defence against bark beetles (Lusebrink et al., 2016; Safranyik, 1978). Indeed, pines use water to make a toxic resin that is exuded during a beetle attack to prevent beetles from attracting conspecifics and inhibit the formation of galleries and oviposition (Erbilgin et al., 2017; Raffa & Berryman, 1983). MPB emergence and flights are reduced with high temperatures during the dispersal season (Safranyik & Carroll, 2006). MPB can disperse at short distances within a stand or, more rarely, fly above the canopy to use the wind to travel long distances of the order of tens to thousands of kilometres (Robertson et al., 2007; Safranyik & Carroll, 2006). Once the eggs are laid, the adults die. Over the fall, winter and spring, eggs become larvae then pupae before finishing their transition to adult and emerging in the summer. Individuals need a minimum of 833 degree days to complete their transition to adult (Safranyik et al., 1975, 2010).

The Forest Service Branch of the Saskatchewan Ministry of Environment follows a strict direct control approach. At the start of every fall, the park is surveyed aerially to collect geo-referenced data on red-top trees—that is, trees that are dead or dying from a MPB infestation at the previous year. Then, on the ground, managers survey 50-m radius circular plots around each red-top tree to find recently infested trees during the summer. The newly found infestations are later controlled in late fall/winter using a fell and burn method.

Our goal is to provide a set of covariates that potentially impact the MBP infestation in Cypress Hills area, understand how they are related to each other and to the infestation, and find which covariates are sufficient for an accurate prediction. We are also interested to test some of the claims in the literature, for example, lower humidity increases the chances of infestation (Lusebrink et al., 2016), and to find what values of the highly correlated covariates degree days and maximum temperature, that are typically not included together in a model, makes infestation most likely. These objectives are well suited to BNs.

4.2 Methods

We divide the studying area into 100 m × 100 m squares and label them by g = 1, 2, …. We choose 1 year as our time unite and define the response variable urn:x-wiley:2041210X:media:mee313509:mee313509-math-0123 as the presence or absence of infestation in pixel g at the fall of year t. We use the covariates listed in Table 4 and quantify them into urn:x-wiley:2041210X:media:mee313509:mee313509-math-0124 levels. Our data include the values of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0125 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0126 (Figure 2) over the years urn:x-wiley:2041210X:media:mee313509:mee313509-math-0127 and for 18317 different pixels g in Cypress Hills, resulting in a total of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0128 instances (see Supporting Information for an instance of the data).

TABLE 4. Description of the covariates urn:x-wiley:2041210X:media:mee313509:mee313509-math-0129
Name Symbol Description Unit
Aspect Ag Compass direction that the slope at pixel g faces °
Distance to infested border Bg Distance of the centre of pixel g to the border of the whole area of interest that was initially infested (Figure S2) km
Degree days urn:x-wiley:2041210X:media:mee313509:mee313509-math-0130 Sum of daily temperatures above 5.5°C from fall of year t − 1 to summer of year t Celsius degree-day
Maximum temperature urn:x-wiley:2041210X:media:mee313509:mee313509-math-0131 Highest maximum daily temperature in July and August of year t °C
Wind speed urn:x-wiley:2041210X:media:mee313509:mee313509-math-0132 Average daily wind speed in July and August of year t km/hour
Relative humidity urn:x-wiley:2041210X:media:mee313509:mee313509-math-0133 Average daily relative humidity in spring of year t %
Cold tolerance urn:x-wiley:2041210X:media:mee313509:mee313509-math-0134 An index in [0, 1] representing the ability of the larvae to survive the cold season of year t − 1, as defined in (Régnière & Bentz, 2007)
Pine cover urn:x-wiley:2041210X:media:mee313509:mee313509-math-0135 Pine density in summer of year t %
Managed last year infestation urn:x-wiley:2041210X:media:mee313509:mee313509-math-0136 Defined to be 1 if pixel g includes at least one tree that was infested and managed (controlled) at year t − 1, and 0 otherwise (Figure 2)
Missed last year infestation urn:x-wiley:2041210X:media:mee313509:mee313509-math-0137 Defined to be 1 if pixel g includes at least one tree that was infested and missed (not controlled) at year t − 1, and 0 otherwise
MPB's ability to disperse at short distances within a stand, defined as
Missed neighbours' last year infestation urn:x-wiley:2041210X:media:mee313509:mee313509-math-0138 urn:x-wiley:2041210X:media:mee313509:mee313509-math-0139
where urn:x-wiley:2041210X:media:mee313509:mee313509-math-0140 are those pixels that are essentially at a distance of i × 100 m from g (Figure S3)
Managed neighbours' last year infestation urn:x-wiley:2041210X:media:mee313509:mee313509-math-0141 Defined similar to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0142, with the difference that urn:x-wiley:2041210X:media:mee313509:mee313509-math-0143 is replaced by urn:x-wiley:2041210X:media:mee313509:mee313509-math-0144
Details are in the caption following the image
Infestation status. Grey and white are used to indicate the presence and absence of infestation in a pixel. (a) None of the trees in pixel urn:x-wiley:2041210X:media:mee313509:mee313509-math-0145 were infested at year urn:x-wiley:2041210X:media:mee313509:mee313509-math-0146 (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0147); however, at least one tree was infested at year urn:x-wiley:2041210X:media:mee313509:mee313509-math-0148 (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0149). (b) All infested trees in pixel 1 that were infested at year urn:x-wiley:2041210X:media:mee313509:mee313509-math-0150 were managed at the same year (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0151, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0152), there were no infested trees in pixel 2 at year urn:x-wiley:2041210X:media:mee313509:mee313509-math-0153 (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0154, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0155), all infested trees in pixel 4 that were infested at year urn:x-wiley:2041210X:media:mee313509:mee313509-math-0156 were missed at the same year, and hence, turned red in the following year (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0157, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0158), some infested trees were missed and some were managed in pixel 5 (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0159, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0160). Missed and managed neighbours' last year infestation for pixel urn:x-wiley:2041210X:media:mee313509:mee313509-math-0161 at year urn:x-wiley:2041210X:media:mee313509:mee313509-math-0162 are, thus, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0163, presuming that urn:x-wiley:2041210X:media:mee313509:mee313509-math-0164

We compare the AUC and AUPR scores of our final model with those of what we call the one-memory infestation (OI) Bayesian network, consisting of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0165 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0166, being linked to the target urn:x-wiley:2041210X:media:mee313509:mee313509-math-0167, considered as the null model (Figure 1).

4.3 Resulting Bayesian network

We find the BN with the best BIC score on the train dataset with six discrete levels, that is, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0168 (Figure 3), as our ‘best model’ to explain the MPB infestation, with AUC = 0.88 and AUPR = 0.28. The OI model scores 0.75 for AUC and 0.19 for AUPR—both lower than our selected model. According to the structure of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0169, the infestation urn:x-wiley:2041210X:media:mee313509:mee313509-math-0170 in location g at year t is directly connected to urn:x-wiley:2041210X:media:mee313509:mee313509-math-0171, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0172, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0173, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0174, urn:x-wiley:2041210X:media:mee313509:mee313509-math-0175 and urn:x-wiley:2041210X:media:mee313509:mee313509-math-0176. These together with urn:x-wiley:2041210X:media:mee313509:mee313509-math-0177, form the Markov blanket of the infestation node, and hence, are the primary covariates and sufficient for estimating infestation with urn:x-wiley:2041210X:media:mee313509:mee313509-math-0178 AUC score. Other covariates are all indirectly linked to infestation and are secondary covariates. Given urn:x-wiley:2041210X:media:mee313509:mee313509-math-0179, one can obtain conditional independencies of the covariates to infestation using d-separations and plot the CPDs (see Supporting Information).

Details are in the caption following the image
The structure of ‘the Best’ Bayesian network (urn:x-wiley:2041210X:media:mee313509:mee313509-math-0180). We choose this structure as the one to explain and predict MPB infestation. The response variable and its Markov blanket are in red and Cyan

4.4 Sensitivity to missing covariates

The prediction accuracy of urn:x-wiley:2041210X:media:mee313509:mee313509-math-0181 does not deteriorate when the values of any of the secondary covariates are missing. Upon missing values for the primaries, the model can still accurately predict infestation as it can use some of the secondary covariates (Table 5).

TABLE 5. AUC and AUPR scores of ‘the best’ BN urn:x-wiley:2041210X:media:mee313509:mee313509-math-0182, when one of the primary covariates is missing
Missing covariate AUC AUPR
Nothing missing 0.882 0.277
Maximum temperature 0.889 0.350
Cold tolerance 0.881 0.290
Distance to infested border 0.890 0.309
Missed neighbours' past infestation 0.760 0.220
Managed neighbours' Past infestation 0.879 0.284
Missed last year infestation 0.811 0.103
Managed last year infestation 0.869 0.206
Last year infestation (both missed and managed) 0.784 0.068

4.5 Discussion

The final model we have chosen to explain the MPB infestation in the Cypress Hills area is the BN urn:x-wiley:2041210X:media:mee313509:mee313509-math-0183 with six discretization levels, scoring urn:x-wiley:2041210X:media:mee313509:mee313509-math-0184 AUC on the test dataset. For a managed MPB outbreak in the Cypress Hills area, the model postulates the following covariates as primary (and hence sufficient for an 0.88 AUC prediction) at each location, at each time: (1, 2) presence of infestation in last year, both managed and missed, (3, 4) neighbours' degree of infestation in last year, both managed and missed, (5) distance to the border where the infestation was initiated, (6) maximum temperature in July and August of that year and (7) cold tolerance in the cold season of that year; n.b., the remaining covariates are secondary and are used to predict infestation if one or more of the primary covariates are missing.

Given this BN, we can provide a wide range of ceteris paribus claims revealing the co-effects of the covariates on the presence of infestation (see Supporting Information). For example, if we know maximum daily temperature is high (above 31.2°C), the interval of relative humidity that results in the highest infestation risk sharply changes from medium to low. This is in line with the claim in (Lusebrink et al., 2016; Safranyik, 1978) that lower humidity increases the infestation probability. However, for maximum daily temperatures lower than 31.2°C, the infestation likelihood is high for both low and high relative humidity. This inconsistency can be solved by looking at maximum temperature and relative humidity together. We find that humid areas require low maximum daily temperature, while dry areas require high maximum daily temperature for a considerable risk of infestation (above 20%).

As another example, a MPB needs 833 degree days to complete its transition to adults and the minimum number of degree days in the data is 1,054 (Safranyik et al., 1975, 2010). Therefore, degree day never prevents infestation in our data and just reflects the negative impact of high summer temperatures. This, however, does not mean that degree day is useless in our model. First of all, as mentioned earlier, in the absence of some of the primary covariates, the model effectively estimates infestation via the information on degree day and other present covariates. Secondly, although highly correlated, degree day and maximum temperature are different, and the model reveals their coexistence effect on the infestation: for low (resp. high) degree days, infestation becomes more likely as maximum temperature increases (resp. decreases; see Supporting Information).

We emphasize that one may not make causal conclusions based on the structure of the model. Clearly, the edge from infestation to managed-last-year-infestation does not imply that this year's infestation has caused last year's (managed) infestation. It only means that the two are probabilistically dependent. The same holds for all other links, such as the one from maximum temperature to infestation: although temperature may be ‘causing’ infestation, one may not conclude so just based on the BN. One may refer to the literature on causality and the corresponding tests in order to verify the causality of a link in a BN (Pearl, 2009; Pearl & Mackenzie, 2018). Moreover, the absence of an edge between, for example, degree day and infestation does not necessarily mean that the two are independent. They may be dependent but become conditionally independent if some other covariates are known here.

In summary, the learned BN contributes to the prediction and understanding of MPB infestations by (1) accurately predicting MPB infestations, (2) identifying the primary set of covariates that are sufficient for making these predictions, (3) making acceptable predictions when data on some of the primary covariates are unavailable, (4) revealing the previously unknown co-effects of the covariates on infestation likelihood, (5) identifying the most informative covariate(s) to infestation likelihood and (6) proposing a BN structure that can serve as the basis for future causality tests between the variables. Points 1, 2, 3 and 5 are particularly useful to forest managers to plan ahead of time and know what data to collect. See SI for a more elaborate discussion on the MPB case study.

Nevertheless, as with almost all other machine-learning models, BNs are generally constructed under the stationary assumption, implying fixed structure and parameters over time. This may result in poor performance when the model is used to make predictions at a time different from those in the training dataset, provided that the ‘true ecological process’ is non-stationary. For example, a BN trained on data collected during the beginning of an outbreak may not accurately predict the declining phases of the outbreak. Similar concerns are raised when using the learned BN in environmental situations, where the ranges of the covariates are very different from those in the training dataset. We refer the reader to (Robinson et al., 2010; Zhou et al., 2008; Zhu & Wang, 2015) for relaxing the stationary assumption.

5 DISCUSSION

Although traditional models used to make ecological predictions from underlying covariates have a record of success, they also suffer from limitations. They cannot make predictions when one or more covariates are missing; unless the missing values are imputed using other methods which can be unreliable and result in low prediction accuracy. They also do not allow for statistical inference when some of the covariates are highly correlated. BNs can handle these issues. Specifically, they provide a primary and secondary ordering of the covariates, where primary covariates are essential to predicting the target variable and secondary covariates, while not always essential, can be helpful in making predictions when the values of some covariates are missing.

However, BNs are not used to their full potential in the literature as their structure is typically constructed based on the knowledge of experts. Moreover, the obtained BN is often read causally, a questionable practice as BNs are different from causal networks.

We have complemented previous work by providing a systematic approach to obtain a BN fully from data. We have demonstrated the approach via a MPB case study, where no knowledge of experts was involved in finding either the structure or CPDs. The resulting BN predicts infestations fairly accurately, even in the absence of any of the selected covariates that are involved in the model.

Researchers have utilized BNs to visualize their understanding of the causal relationships between the variables involved in ecological processes (Amstrup et al., 2008; Aps et al., 2009; Borsuk et al., 2004; Johnson et al., 2010; Newton, 2010; Pollino, Woodberry, et al., 2007). The resulting networks have been often used as predictors and sometimes reported to be fairly successful on a test dataset. This is an acceptable approach to assess the a priori knowledge of the experts or when there are no data available to learn the BN structure. However, by means of the results for our MPB case study, we challenge claims that put forward this approach as ‘the (only) right one’ for constructing a BN. Examples include synthesizing existing knowledge into the model is necessary and structural learning is only for modelling poorly understood systems or those difficult to characterize (Chen & Pollino, 2012), modellers must demonstrate causal relations (McCann et al., 2006), models based on theories about causal relations are generally better (Uusitalo, 2007) and network structure is a matter of judgement and should reflect expert knowledge and stakeholder needs (Gutierrez et al., 2011). Some researchers have looked into fixed (naive Bayes) and partially learnable (tree-augmented naive Bayes) structures (Aguilera et al., 2010), yet this is different from learning fully based on data.

In general, for modelling the joint probability distribution of the variables involved in an ecological process, that is, a generative task, BNs seem to be the first and often best candidate, especially if the governing dynamics are yet unknown to be mechanistically modelled. However, if the sole purpose is to predict the response variable, that is, a discriminative task, other models may show a higher prediction accuracy, although unlike BNs, they typically cannot deal with missing values in the covariates. We are currently exploring ways to use BNs as well as other models, to predict infestation many years in the future (Ramazi et al., accepted).

ACKNOWLEDGEMENTS

We thank Rory L. McIntosh for providing the mountain pine beetle data needed for the application section. We thank Greiner and Lewis Research Groups for helpful feedback on ideas related to this research. The research was partly funded by Alberta Environment & Parks (AEP). This research was also supported by a grant to M.A.L. from the Natural Science and Engineering Research Council of Canada (grant no. NET GP 434810-12) to the TRIA Network, with contributions from Alberta Agriculture and Forestry, Foothills Research Institute, Manitoba Conservation and Water Stewardship, Natural Resources Canada-Canadian Forest Service, Northwest Territories Environment and Natural Resources, Ontario Ministry of Natural Resources and Forestry, Saskatchewan Ministry of Environment, West Fraser and Weyerhaeuser. M.A.L. is also grateful for the support through the NSERC Discovery and the Canada Research Chair Programs. R.G. is grateful for funding from NSERC Discovery and Alberta Machine Intelligence Institute.

    AUTHORS' CONTRIBUTIONS

    All the authors conceived the ideas, interpreted the results and drafted the manuscript. P.R. developed the methods and undertook the analysis. All the authors gave final approval for publication.

    DATA AVAILABILITY STATEMENT

    The dataset analysed in the current study is described in Kunegel-Lion et al. (2020a) and available from Dryad Digital Repository (https://doi.org/10.5061/dryad.70rxwdbt9 (Kunegel-Lion et al., 2020b).