Intrinsic inference difficulties for trait evolution with Ornstein‐Uhlenbeck models
Summary
- For the study of macroevolution, phenotypic data are analysed across species on a dated phylogeny using phylogenetic comparative methods. In this context, the Ornstein‐Uhlenbeck (OU) process is now being used extensively to model selectively driven trait evolution, whereby a trait is attracted to a selection optimum μ.
- We report here theoretical properties of the maximum‐likelihood (ML) estimators for these parameters, including their non‐uniqueness and inaccuracy, and show that theoretical expectations indeed apply to real trees. We provide necessary conditions for ML estimators to be well defined and practical implications for model parametrization.
- We then show how these limitations carry over to difficulties in detecting shifts in selection regimes along a phylogeny. When the phylogenetic placement of these shifts is unknown, we identify a ‘large p ‐ small n’ problem where traditional model selection criteria fail and favour overly complex scenarios. Instead, we propose a modified criterion that is better adapted to change‐point models.
- The challenges we identify here are inherent to trait evolution models on phylogenetic trees when observations are limited to present‐day taxa, and require the addition of fossil taxa to be alleviated. We conclude with recommendations for empiricists.
Introduction

The OU process is ideal to model changes in selection regimes, such as changes in the selection optimum μ between different parts of the tree (as pioneered by Butler & King (2004), see also Scales, King & Butler (2009) for instance), or changes in α or in the variance rate σ2 (Beaulieu et al. 2012). Changes in the selection optimum μ have been modelled elegantly by Hansen, Pienaar & Orzack (2008) and Bartoszek et al. (2012), under the assumption that μ is driven by a continuous predictor whose evolution is modelled with a BM or OU process.
- Consistent estimators: parameter estimates should converge to the true parameter values as more and more data are collected. For this, it is necessary (but not sufficient) that parameters are identifiable.
- Increasing power: tests of particular hypotheses should reach any desired power provided that enough data are collected.
- Accurate model selection using likelihood ratio tests, AIC or BIC.
In this paper, we first review cases for which these properties were proved to break down on phylogenies. We then describe limitations with OU models for trait evolution regarding all three properties: lack of identifiability and consistency, lack of power and inaccurate model selection. Our simulations show that the power to detect phylogenetic changes in the selection optimum depends little on the tree size. Moreover, when the phylogenetic position of the changes is unknown, AIC and BIC fail to select the correct model in favour of overly complex models with many changes. To remedy this problem, we introduce a phylogenetic adaptation of the modified BIC suggested by Zhang & Siegmund (2007). Other recommendations are provided to help empiricists handle these various limitations.
Review of known statistical limitations
Limitations in inferring ancestral states have been reported very early in empirical studies because large standard errors were observed (e.g. Schluter et al. 1997; Garland & Ives 2000). Confidence intervals for ancestral states can be so wide as to span the range of observed present‐day values, so much so that ‘[...] it might be best to accept our limitations and not even try to estimate ancestral states from comparative data’ (Martins 2000). These limitations, due to phylogenetic dependence among taxa, have recently been explained theoretically. Ané (2008) showed that under BM and some general conditions, the ancestral state is not estimated consistently as the tree grows indefinitely. More specifically, the variance of the maximum‐likelihood estimator (MLE) of the ancestral state cannot be lower than σ2t/k where k is the number of daughters of the root node and t is the length of the shortest branch stemming from the root. This result proves that the ancestral state reconstruction accuracy is very limited, unless evolution proceeded slowly or the ancestral node of interest consists of a large polytomy, or unless fossil taxa can be included in the analysis. The accuracy loss due to phylogenetic dependence can be substantial. Using the phylogeny from Bininda‐Emonds et al. (2007) with 4507 species for instance (Fig. 4, left) and under BM evolution, the reconstruction at the base of the mammal tree has the same accuracy as that obtained from only 5 independent taxa, if such existed, from a star tree of the same height. With marsupials excluded (4249 species), trait reconstruction for the non‐marsupial mammal ancestor has the same accuracy as would be obtained from about 19 independent species. Interestingly, this number of equivalent independent observations was shown to be lower than the adjusted degree of freedom defined by Paradis & Claude (2002) as the tree length to tree height ratio, when the tree is ultrametric.
Under the OU model, Ho & Ané (2013) proved a similar issue with the ancestral state and the selection optimum μ. If the tree is ultrametric and the height of the tree is bounded, μ cannot be estimated consistently as the tree grows indefinitely, regardless of the estimation method. They also proved that the presence of fossil taxa is crucial to increase precision in μ. Slater & Harmon (2013) point to many other benefits of integrating data from both fossils and extant taxa.
While contemporary species bear limited information on ancestral states and optimal values, rates of evolution and correlation among traits are typically much easier to estimate (also shown experimentally by Oakley & Cunningham 2000). Under BM evolution, independent contrasts (Felsenstein 1985) have been used to detect significant correlations in many studies, too numerous to count. Many studies also successfully detected shifts in the rate of trait evolution σ2 (e.g.O'Meara et al. 2006; Davis et al. 2007; Eastman et al. 2011; Venditti, Meade & Pagel 2011). Indeed, the precision in estimated rates and regression coefficients is known to increase with the square root of the number of taxa, unlike ancestral state estimates (Ané 2008).
Inference on the level of phylogenetic correlation from contemporary taxa is also challenging. One common way to measure phylogenetic signal is through a λ parameter (Freckleton, Harvey & Pagel 2002), where λ = 0 corresponds to no phylogenetic correlation and λ = 1 corresponds to the BM model. Using simulations, Boettiger, Coop & Ralph (2012) showed clearly that λ estimates can be very uncertain and typically have a lot less precision than estimates of the variance rate σ2.
Another common way to measure phylogenetic signal is through the OU model (e.g. Lavin et al. 2008), with α = ∞ corresponding to no phylogenetic correlation and α = 0 to the BM model. Here again, Ho & Ané (2013) proved that, in some situations, α is not consistently estimable (while the variance rate σ2 might be), and Boettiger, Coop & Ralph (2012) demonstrated with simulations that the power to detect α>0 can be disappointingly low even on very large trees.
Traditional methods for model selection are also challenged by phylogenetic dependence. The standard BIC was proved to be inappropriate under BM evolution (Ané 2008). Its tendency to select overly simple models was explained by its penalty increasing with the number of present‐day species instead of a number of equivalent independent observations. Both AIC and BIC were shown in simulations to result in a strong bias for overparametrized models when applied to OU models with multiple selection regimes (Boettiger, Coop & Ralph 2012), casting doubts on the appropriateness of these criteria for phylogenetic comparative methods. Given these limitations when selection parameters are constant along the whole tree, precision and model selection issues are expected to be even more pronounced under complex OU models with multiple selection regimes in the tree (Butler & King 2004; Hansen, Pienaar & Orzack 2008; Bartoszek et al. 2012; Beaulieu et al. 2012).
(Un)identifiability of selection regimes
(eqn 1)
and
(obtained with the R function phylolm, Ho & Ané 2014). However, every point (y0,μ) lying on the line
maximized the likelihood (Fig. 1). This line formed a ridge of the likelihood surface. Hence, any good optimization procedure should report some convergence issue, and any one reported MLE value on this line is obviously not necessarily a good estimate for the true value of (y0,μ).

is where the likelihood achieves its maximum.
Identifiability of selection optima
More generally, non‐identifiability occurs whenever X is not of full rank, in which case the likelihood surface has a hyperplane ridge and the MLE is undefined. The unidentifiability of the ancestral state and selection optima can occur in models where μ takes different values (μ1,μ2,…,μm) along different parts of an ultrametric tree, even when the location in the phylogeny of the different regimes is known. It is the case when the part of the tree under the influence of μℓ forms a connected subtree for every ℓ (Fig. 2 left, proof in Appendix A1).

Note that connected subtrees do not need to form clades. The condition of connected subtrees is equivalent to there being a minimal number m−1 of changes of selection regimes along the tree. This is the case, for instance, if there is only one shift in the selection optimum, dividing the tree into 2 connected subtrees, each with its own selection regime. In this common situation, the ancestral state and the two selection optima are not separately identifiable.
On the contrary, OU models with multiple regimes are identifiable when selection regimes are not perfectly correlated with the tree, with at least one regime covering two disconnected subtrees (Fig. 2 right). However, we need to emphasize that identifiability only guarantees the uniqueness of the ML estimator. It does not guarantee its performance. In fact, the next section discusses situations when the MLE has poor precision even on large trees.
If we relax the assumption of known location (s) in the phylogeny for the putative shifts in selection regimes, then some of these location parameters also become unidentifiable, like the precise timing of a shift along a branch or the number of shifts along a single branch (see Appendix A1). Also, alternative scenarios with separate shifts along adjacent edges might not be distinguishable. In particular, 2 shifts along 2 sister edges have the same signature of expected values at extant taxa as one shift on either of the 2 sister edges and the other shift earlier along the parent edge. This is kept in mind for model selection in section.
Diagnostics and reparametrization
A lack of identifiability can be diagnosed by a lack of convergence during ML optimization, as algorithms may search the ridge of the likelihood surface forever. This might explain some convergence failures and the lack of reasonable estimates for some μℓ's reported by Butler & King (2004) or Beaulieu et al. (2012). Luckily, in studies using several selective regimes, hypotheses typically have at least one regime forming a disconnected subtree – see for instance the study of fibre‐type composition of iliofibularis muscle in lizards (Fig. 1 in Scales, King & Butler 2009). These models are thus identifiable with unique ML parameter estimates. Another source of non‐identifiability is when α = 0, because μ has no influence on the trait when selection is not acting. This can cause an almost flat ridge in the likelihood surface for very low estimates of α (Butler & King 2004). Note also that a ridge in the likelihood plagues any Bayesian method, with the prior distribution having complete influence over which values get supported along the ridge.
To fix this lack of identifiability, we can reparametrize the model to use a new design matrix of full rank. For example, with a single regime, we can re‐express the model by using (β0,α,σ2) instead of (y0,μ,α,σ2) where β0 = y0e−αT + μ(1−e−αT). With this new parametrization, Y = β01 + e, but now β0 is identifiable with a unique MLE.
To illustrate, we reanalysed flower diameter of 25 Euphorbiaceae species under an OU model with ancestral optimum size μ0 and a shift to a larger optimum flower size μ1 at the base of the 3‐species Rafflesiaceae clade (see Fig. 1 in Davis et al. 2007). As noted before, y0, μ0 and μ1 are not identifiable separately. We reparametrize the model using (β0,β1,α,σ2) where β0 = y0e−αT + μ0(1−e−αT) is the expected flower size in non‐Rafflesiaceae and β1 = (1−e−αt)(μ1−μ0) is the difference in expected flower diameter between Rafflesiaceae and non‐Rafflesiaceae. Here T is the age of the tree and t the age of the shift, which we do not need to estimate with our reparametrization. The model can be written Y = β01 + β1X1 + e where X1 is a column of zeros and ones, with ones corresponding to Rafflesiaceae. We rescaled branch lengths in the phylogeny to a tree height of 1 and used phylolm (Ho & Ané 2014) on the log‐transformed diameters (as in Davis et al. 2007). We obtained
(that is, exp (0·88) = 2·41 mm),
(that is, an exp (4·4) = 81·2‐fold size increase in extant taxa),
and
, which indicates weak phylogenetic correlation (
of the tree height).
Another approach is to assume that the ancestral state y0 is random according to some prior distribution. If the OU process is homogeneous with a single selection optimum, the stationary distribution of that process is a natural choice for the prior distribution of y0: Gaussian with mean μ and variance σ2/(2α). However, the stationary distribution has no clear definition when there are several regimes. To use this approach for the flower diameter data above, we assumed that y0 followed a normal distribution with mean μ0 and variance σ2/(2α). By doing this, we drop y0 from the set of parameters, but keep the selection optima μ0, μ1. We also need to keep the shift age t in the model, which we assumed to be 0·605, at the base of Rafflesiaceae. We obtained the same
,
as before,
and
(that is, a 196‐fold increase in optimal flower size).
Another key ingredient to restore identifiability is to add fossil data, whenever possible. Fossil taxa can provide very influential information on both y0 and μ (or μℓ's) because they are at a shorter distance T from the root and under a lesser influence of μ than contemporary tips.
(Non)‐Microergodicity of selection parameters
Unlike in traditional regression models with independent residuals, identifiability does not guarantee that the MLE converges to the true parameter when the sample size increases indefinitely. A requirement stronger than identifiability is needed instead, historically called microergodicity (Stein 1999). To understand this concept, we first consider two simple examples.
Example 1: Assume that
are independent binary observations with unknown mean p = P{Yi = 1}, and P{Yi = 0} = 1−p. In this case, the sample mean
is a ‘good’ estimator of p, in the sense that it converges to p as the sample size n increases, from the law of large numbers. Here we keep gaining more precision by collecting more data.
Example 2: Consider repeated binary observations
with mean p = P{Yi = 1}. Assume that Y2 is independent of Y1, but that the remaining observations simply repeat Y1 and Y2 over and over: Y2k−1 = Y1 and Y2k = Y2 for k≥1. Here, observing the entire sequence does not provide any extra information than observing Y1 and Y2 only, due to the extreme correlation between sampling units. Collecting more data does not increase information, and obviously, there does not exist any estimator f(Y1,…,Yn) that would converge to the true value of p.
To conceptualize the difference between these two examples and to understand the amount of information carried by comparative data, we use the concepts of orthogonal distributions and of microergodicity.
Definition: Two distributions P1 and P2 are orthogonal if there exists an event A such that P1(A) = 1 and P2(A) = 0.
Intuitively, this means that we can use data and tell with certainty whether these data came from P1 or P2. If the event A is observed, then there is certainty that the model associated with P2 is wrong. But if we observe that A does not occur, then the model associated with P1 is certainly wrong. With example 1, we might want to compare the models where p = p1 = 0·5 vs. p = p2 = 0·8, and call Pi the distribution of the entire sequence
when p = pi (i = 1 or 2). Consider the event
converges to 0·5}. By the law of large numbers, P1(A) = 1 and P2(A) = 0, so P1 and P2 are orthogonal. This reflects the fact that we can identify whether p = 0·5 vs. p = 0·8 with certainty from the entire sequence
in example 1.
Definition: Let (Yn)n≥1 be an infinite sequence of observations, and let Pθ be the joint distribution of (Yn)n≥1 under some model and parameters collected in a vector θ. A function f(θ) of θ is said to be microergodic if for every
such that
, then
and
are orthogonal.
We think of (Yn)n≥1 as trait values obtained from infinitely many species as a best case scenario, as if we were able to increase taxon sampling indefinitely. With the definition above, a function of parameters f(θ) is microergodic if the full data set (Yn)n≥1 contains enough information to tell the value of f(θ) with certainty. Unless θ is microergodic, there is no good estimator for it (Zhang 2004), even from an infinite sample. Therefore, microergodicity is necessary for a precise estimation of model parameters. In example 1, the mean p is microergodic, as the argument above used to distinguish p = 0·5 from 0·8 can be repeated to distinguish any two values of p. In example 2, p is not microergodic. The dependence between observations maintains uncertainty about the value of p, even from infinitely many Yn observations.
Recently, Ho & Ané (2013) investigated microergodicity for the one‐regime OU model on an ultrametric tree and with a random ancestral state at the root (to avoid the unidentifiability issue mentioned earlier). While this simple model is unlikely to be adequate for most real traits, it is still considered when comparing hypotheses and more complex models are expected to require even more data. Ho & Ané (2013) proved that if the height of the tree is bounded, as when sampling from a group of interest, then the selection optimum μ is not microergodic. Therefore, information from contemporary taxa is not sufficient to estimate μ exactly, even if infinitely many taxa were observable. On the other hand, σ2 was shown to be microergodic in the case when the tree has many ‘young’ internal nodes. Unfortunately, an additional assumption to ensure enough variation in internal node ages was required for α and the stationary variance γ = σ2/(2α) to be microergodic. The microergodicity of α is necessary for a good estimator
to exist, but is it sufficient? This question remains open. In the particular case of a symmetric tree, we can prove that the restricted MLE of α converges to the true α with more data, if the microergodicity condition is met (using tools in Ho & Ané 2013, appendix B).
To illustrate these theoretical results, we simulated data on several very large phylogenies from across the tree of life: a 9993‐species phylogeny of birds (Jetz et al. 2012), a 4507‐species mammal tree (Bininda‐Emonds et al. 2007), a 839‐taxon tree on Fabales (Simon et al. 2009) and a 140‐species phylogeny of ants (Moreau et al. 2006). We also used a 400‐language phylogeny (Gray, Drummond & Greenhill 2009). For simplicity, we present here the results on the largest tree only (9993 taxa). The results on the other phylogenies are similar (see Figs A1–A4).
Twenty sequences of nested phylogenies from 50 to 9993 taxa were created by randomly selecting subsets of taxa from 20 bootstrap trees from Jetz et al. (2012), conditional on the root being the only common ancestor of the selected taxa to guarantee that all trees have the same ancestral species and same height. Trees were rescaled by a common factor to have height 1. Data were simulated from a one‐regime OU model with μ = 0, γ = 1, α = 0·1, 1 or 10 and σ2 = 2αγ. This model is very close to a Brownian motion when α = 0·1 (t1/2 = 6·9 much larger than the tree height T = 1) and very close to phylogenetic independence when α = 10 (t1/2 = 0·069 very small). Figure 3 shows the MLEs
,
,
and
, which were obtained using phylolm (Ho & Ané 2014). When phylogenetic correlation is strong to moderate (α = 0·1 or 1), the accuracy of
does not improve with more taxa, illustrating the non‐microergodicity of μ. Under strong phylogenetic correlation (α = 0·1),
and
are strongly biased with few taxa, with α being overestimated almost all the time with less than 100 taxa. Their bias improves with more taxa but their precision does not: there is as much uncertainty about α and γ from 9993 taxa as from 50. This illustrates that α and γ are not microergodic. On the other hand, the variance rate σ2 is estimated precisely, regardless of the phylogenetic correlation. Under weak phylogenetic correlation (α = 10), all parameters are estimated as usual: with little or no bias and with increasing precision from more taxa. These simulations illustrate the main difficulty of detecting selection. When it is moderate, phylogenetic correlation between taxa greatly reduces the actual amount of information on selection parameters, especially on the target of selection. With these limitations in mind for a single selection regime, we now turn to models with multiple regimes.

Power to detect shifts in the selection optimum
Detecting changes in selection regime is important to study the drivers of selection pressure. For example, Brawand et al. (2011) detected evolutionary shifts in gene expression in testes for a large number of mammalian genes, by comparing a single‐regime OU model to a two‐regime OU model in which the optimal gene expression level μ undergoes a shift on a specific lineage (see also Rohlfs, Harrigan & Nielsen 2014). In this section, we use simulations to study the power to detect such shifts. We first consider the case when the location in the phylogeny of the shift is known, such as when it is hypothesized to be driven by an environmental change or change in another trait that can be mapped onto the phylogeny (e.g. Butler & King 2004). Next we consider the case when the number and phylogenetic position of the shift(s) need to be estimated.
Known location: We simulated data from an OU model with two selection regimes on the 4507‐species mammal tree from Bininda‐Emonds et al. (2007). The shift in selection optimum was placed at the base of Euarchontoglires (Fig. 4, left), so that each regime applied to about half the taxa. Twenty sequences of nested trees from 10 to 4507 taxa were created by randomly selecting subsets of taxa, under the constraint that the number of sampled Euarchontoglires was half the total number of sampled taxa. Along each tree, 100 data sets were simulated according to the OU model with ancestral state and ancestral optimum y0 = μ0 = 0, γ = 1, α = 1, and a Euarchontoglires optimum shift to μ1 = 1·079,2·157 or 4·314. This model corresponds to expected values in extant taxa of β0 = 0 for non‐Euarchontoglires and of β1 = 0·5, 1 or 2 for Euarchontoglires. The parametrization with β values was used for inference, to avoid the non‐identifiability issue described earlier. Testing the shift in selection regime is then equivalent to testing whether β1 = 0 or not. Figure 5 shows that the estimation error in β1 decreased from 10 to 50 taxa, but did not change much from 50 to 4507 taxa. This phenomenon reflects the non‐microergodicity of both selection optima, and suggests that the power to detect the shift (β1 ≠ 0) does not increase significantly as the tree grows. The same simulation was carried out on the 839‐taxon tree on Fabales (Simon et al. 2009) with similar results (see Fig. S5).


, from data simulated on the 4507‐species mammal tree in Fig. 4(left).
From each simulated data set, we rejected the null hypothesis of no shift if
where the standard error of
for each tree size was estimated as the standard deviation of
over the 2000 simulations. Figure 6 shows that the power depends mostly on the effect size and very little on the tree size: when the signal is weak (effect size
), the shift is detected only 39·8% of the time even from 4507 taxa. On the other hand, the shift is detected 99% of the time from only 50 taxa when the signal is strong (effect size
). In the example above on flower diameter evolution from Davis et al. (2007), the shift at the base of Rafflesiaceae was statistically significant (SE(
, P‐value 10−12). This is likely not because of a large sample size: there were 22 and 3 species in each regime. Instead, the shift was detectable because of a very large effect size;
was estimated to be 8·92.

To further illustrate that the power to detect a selection optimum shift depends little on tree size, we used the approach introduced by Boettiger, Coop & Ralph (2012) to detect the shift. This method uses a bootstrap‐based likelihood ratio test and simultaneously estimates the power of the test. We simulated data from the OU model on the 839‐taxon tree on Fabales (Simon et al. 2009; Fig. 4 right) with one shift in μ at the base of a clade containing roughly half the total number of taxa. Twenty sequences of nested trees from 10 to 839 taxa were created by sampling taxa randomly, but requiring an equal number of taxa in each regime when n < 839. Along each tree, one data set D was simulated under the OU model as before with γ = 1, α = 1, y0 = μ0 = 0 and a shift varying from μ1 = 0·873, 1·747 to 3·494, corresponding to β0 = 0 and β1 = 0·5, 1 or 2. Each data set D was analysed as follows (Boettiger, Coop & Ralph 2012). First, model parameters were estimated under a single regime (model 1). Bootstrap data sets
were then simulated independently with these estimated parameters under model 1 (Fig. 6). For each
, the log likelihood ratio δ = 2(log L2−log L1) was computed to compare model 1 to the two‐regime OU model (model 2) with the shift placed at its correct location. Secondly, D was used again to estimate parameters under model 2 and parametric bootstrap replicates were simulated under the two‐regime model to obtain a sample of δ values under that model. The area of the overlap between the two distributions of δ was then used to measure the power to detect the shift: the smaller this overlap, the greater the power (see Fig. 7 insets for examples of this overlap). As Fig. 7 shows, the power to detect the shift increases very little (i.e. the overlap decreases very little) with the number of taxa. Instead, the overlap and hence the power are most influenced by the true magnitude of the shift.


where tb is the shift's age, due to phylogenetic inertia. We seek to identify on which branch there is a shift, that is, for which branch, βb≠0. To do so, we consider here a traditional stepwise selection method, similar to that used by Ingram & Mahler (2013) in SURFACE. The simplest one‐regime model is first evaluated. At each step, a list of candidate models is made by modifying the current model: either adding a shift on a branch, moving an existing shift to a neighbour branch or dropping a shift. The procedure stops if the current model is better than every candidate model based on a criterion like AIC (Akaike 1974) or BIC (Schwarz 1978). Otherwise, the best model in the list is selected and used as the new current model for the next step. For identifiability purposes again, we consider a maximum of one shift per branch and our procedure disregards models with ‘extinct’ regimes, that is regimes applying to internal branches only. Models with shifts located on sister branches are considered, with the caveat that their exact location among the 2 sister edges and their parent edge is not identifiable. An R implementation of this stepwise selection method is available from the authors.
We simulated 10 data sets along a 140‐taxon tree under the OU model with 3 shifts, located on branches labelled
,
and
(Fig. 8, left). We used a moderate α = 1 after rescaling the tree to a height of 1, γ = 1, and simulated shifts in selection optima leading to
,
and
. The first shift is expected to be easy to detect, but the other two are not, because their magnitude is less and because they are located on sister edges. As discussed before, due to a lack of identifiability, these two shifts are expected to be detected on either
,
and/or on their parent edge. The stepwise model selection procedure was used with the constraint that the number of shifts in any proposed model may not exceed a given number. Figure 8 shows one simulation, for which AIC selected m = 10 shifts when allowed no more than 10. The first 3 corresponded to the true shifts (with some location error). With one exception, all other estimated shifts corresponded to a single taxon or a pair of sister taxa, misinterpreting one or two extreme value(s) for a shift in selection regime. Figure 9 (top) shows a steady decrease in AIC and BIC values with m, leading to the estimation of the maximum allowed of 10 shifts in all cases, and a strong overestimation of regime shifts. With the allowed maximum increased to 40, AIC selected 40 shifts for all 10 data sets. BIC also selected 40 shifts for 9 data sets and 28 shifts for the remaining data set.

,
and
. AIC selected 10 shifts when allowed no more than 10, along edges marked in bold. Edge numbers indicate the order in which estimated shifts were added by AIC. Middle: simulated normalized trait data. Right: model selected by SURFACE when allowed no more than 10 shifts. Shifts were detected on the same 10 branches. Identical colours are used for regimes inferred to share the same optimum.

We conjecture that the failure of AIC and BIC may be caused by two phenomena related to the very large number of candidate models on large trees. First, there are more models to choose from than available data points, a problem coined ‘large p, small n’ in the statistics literature. Indeed, with n taxa on a rooted tree, there are 2n−2 models with one shift, one for each of the 2n−2 branches. So with a single shift and a single extra parameter, the number of potential models is already larger than the number of data points, providing ample opportunity for overfitting. The second issue with heterogeneous tree models is combinatorial: the number of models explodes with the number of shifts. On the 140‐taxon tree used above, there are 2n−2 = 278 models with a single shift, but 38 503 models with 2 shifts (from choosing 2 of the 278 edges), and 6·4×1017 models with 10 shifts. This explosion gives an advantage to 10‐shift models over 2‐shift models, and to 2‐shift models over 1‐shift models when using AIC or BIC, which are not meant to protect against multiple testing.

was detected first. The second detected shift was either on the parent branch of
and
(6 times) or on
(4 times), and overfitting at or near external edges was avoided. We repeated the modified BIC stepwise selection on 100 new simulated data sets. The average number of detected shifts was 2·55 (standard deviation 0·99), demonstrating that overfitting is not an issue for the modified BIC here. Further improvements would likely be achieved by adjusting for phylogenetic correlation in the BIC penalty, where n is replaced by an effective sample size (see Ané 2008; for the BM model). The modified BIC could also be used in a variety of problems, such as to detect changes in rate or in selection strength. However, more extensive simulations are needed to quantify the performance of modified BIC penalties for these problems.
Fully Bayesian methods are not immune to this large p‐small n issue if the prior distribution is not carefully thought to counteract the combinatorial explosion of models. Venditti, Meade & Pagel (2011) used a Bayesian reversible‐jump framework to detect changes in the evolutionary rate of body size. They detected evidence of shifts in about one‐third of all branches (1494 branches) in their 3185 mammal phylogeny. Their prior distribution for the number of changes was not discussed, so it is possible that the very large estimated number of shifts was driven by the combinatorial explosion of models and equal weights to all models a priori.
Eastman et al. (2011) proposed a truncated Poisson prior for the number of shifts, which places a high prior weight on models with few shifts and a much lower weight a priori on each individual model with many shifts. Rabosky (2014) used a similar approach to infer differential rates of species diversification or shifts in trait evolutionary rate, using a compound Poisson prior process for the location and number of shifts. Again, this choice controls the combinatorial explosion of models by giving a Poisson prior probability to the entire set of models with a given number of shifts. Simulations showed that overfitting was not an issue. After this work was completed, Uyeda & Harmon (2014) proposed a Bayesian method for OU models with shifts in the selection optimum, implemented in the R package bayou. Their prior on the number of shifts is a conditional Poisson distribution ranging from 0 to half the number of tips. Their simulations showed that the estimated number of shifts is sensitive to this prior.
The modified BIC used here is one attempt to solve the large p‐small n problem. Many methods have been developed to solve this problem for independent data, such as LASSO (Tibshirani 1996; Zhao & Yu 2006) or least angle regression (Efron et al. 2004). As pointed out for the modified BIC, more work should be done to adjust these methods for phylogenetic correlation and adapt them to change‐point detection problems on evolutionary trees.
Conclusion and Recommendations
We discussed here difficulties for studying trait evolution for OU models from present‐day species: unidentifiability of ancestral states, non‐microergodicity of parameters and limited power to detect changes in selection regimes. These issues are intrinsic to the nature of comparative data with observations at the tips of the tree only. Our study does not mean that OU models are limited and should be abandoned. If trait evolution truly followed an OU process, then analyses using an OU model are certainly most appropriate. But investigators should be aware from the onset that increasing data collection from more present‐day taxa may not provide the power needed to discriminate between hypotheses. These difficulties are not tied to maximum likelihood: non‐microergodic parameters are bound to be estimated with imprecision, regardless of the estimation method. Finally, we illustrated the breakdown of traditional model selection criteria (AIC and BIC) to detect changes in the trait evolution process and we identified a ‘large p‐small n’ issue: the number of candidate models is larger than the number of taxa and explodes with the number of shifts. We introduced a modified criterion borrowed from change‐point models for time series, which does not overestimate the number of shifts like AIC and BIC do. While fully Bayesian methods have been proposed for related heterogeneous trait evolution models on trees, more work is needed to discover appropriate model selection tools in a frequentist framework.
Recommendations for empiricists
When using an OU model to analyse comparative data, the chosen model should first be checked for identifiability. Unidentifiability would plague both maximum‐likelihood and Bayesian approaches. We proposed a reparametrization where all parameters are identifiable. A second recommended step is to compare the half‐life estimated from the data with the total tree height. If the half‐life is found to be very high, this is indicative of strong phylogenetic signal and the researcher should be aware that α is likely to be overestimated. In this situation, also, there is a lack of power to detect shifts of small magnitude in the selection optimum μ. The researcher should be aware that the absence of evidence for a shift might be from a lack of power, even on a huge tree, rather than from a true absence of shift. A recommended action here is to use the Monte Carlo approach introduced by Boettiger, Coop & Ralph (2012), using the tree at hand and estimated parameters, to determine whether a lack of power or a truly small (or absent) shift is responsible for the negative test result. Alternatively, a recommended action is to seek to add fossil data. Fossils can restore the identifiability of parameters and greatly increase the precision of estimated ancestral states, shifts in ancestral states and μ values (see Ho & Ané 2013). The benefit of combining fossils with living taxa has been recognized empirically and is receiving increased recognition (Slater & Harmon 2013), with interest to test more and more complex evolutionary processes (Slater 2013). Finally, we strongly discourage the use of AIC to search for shifts in μ along the phylogeny, when the locations of these shifts are not known in advance. Instead, we recommend using the modified BIC or a fully Bayesian framework (Uyeda, Eastman & Harmon 2014) where the prior is carefully chosen. Running the Bayesian estimation procedure with no data is a good way to check the mean number of shifts a priori. Doing so is important to understand the influence of the prior on the inferred number of shifts from the data.
Acknowledgements
We thank Emmanuel Paradis, Natalie Cooper and an anonymous reviewer for their insightful comments and suggestions. This work was supported in part by the National Science Foundation (DMS 1106483).
Data accessibility
The R code for the stepwise selection method using the modified BIC, and the data on flower diameter in Euphorbiaceae are available at https://github.com/lamho86/.
References
Citing Literature
Number of times cited according to CrossRef: 64
- Jesualdo A. Fuentes‐G., Paul David Polly, Emília P. Martins, A Bayesian extension of phylogenetic generalized least squares: Incorporating uncertainty in the comparative study of trait relationships and evolutionary rates, Evolution, 10.1111/evo.13899, 74, 2, (311-325), (2020).
- João Filipe Riva Tonini, Diogo B. Provete, Natan M. Maciel, Alessandro Ribeiro Morais, Sandra Goutte, Luís Felipe Toledo, Robert Alexander Pyron, Allometric escape from acoustic constraints is rare for frog calls, Ecology and Evolution, 10.1002/ece3.6155, 10, 8, (3686-3695), (2020).
- Matthew A. Kolmann, Michael D. Burns, Justin Y. K. Ng, Nathan R. Lovejoy, Devin D. Bloom, Habitat transitions alter the adaptive landscape and shape phenotypic evolution in needlefishes (Belonidae), Ecology and Evolution, 10.1002/ece3.6172, 10, 8, (3769-3783), (2020).
- Yi‐Gang Song, Yann Fragnière, Hong‐Hu Meng, Ying Li, Sébastien Bétrisey, Adriana Corrales, Steven Manchester, Min Deng, Anna K. Jasińska, Hoàng Văn Sâm, Gregor Kozlowski, Global biogeographic synthesis and priority conservation regions of the relict tree family Juglandaceae, Journal of Biogeography, 10.1111/jbi.13766, 47, 3, (643-657), (2020).
- Hillary Koch, Michael DeGiorgio, Maximum Likelihood Estimation of Species Trees from Gene Trees in the Presence of Ancestral Population Structure, Genome Biology and Evolution, 10.1093/gbe/evaa022, 12, 2, (3977-3995), (2020).
- Daniel T. Ksepka, Amy M. Balanoff, N. Adam Smith, Gabriel S. Bever, Bhart-Anjan S. Bhullar, Estelle Bourdon, Edward L. Braun, J. Gordon Burleigh, Julia A. Clarke, Matthew W. Colbert, Jeremy R. Corfield, Federico J. Degrange, Vanesa L. De Pietri, Catherine M. Early, Daniel J. Field, Paul M. Gignac, Maria Eugenia Leone Gold, Rebecca T. Kimball, Soichiro Kawabe, Louis Lefebvre, Jesús Marugán-Lobón, Carrie S. Mongle, Ashley Morhardt, Mark A. Norell, Ryan C. Ridgely, Ryan S. Rothman, R. Paul Scofield, Claudia P. Tambussi, Christopher R. Torres, Marcel van Tuinen, Stig A. Walsh, Akinobu Watanabe, Lawrence M. Witmer, Alexandra K. Wright, Lindsay E. Zanno, Erich D. Jarvis, Jeroen B. Smaers, Tempo and Pattern of Avian Brain Size Evolution, Current Biology, 10.1016/j.cub.2020.03.060, (2020).
- Miriam L. Zelditch, Jingchun Li, Donald L. Swiderski, Stasis of functionally versatile specialists, Evolution, 10.1111/evo.13956, 74, 7, (1356-1377), (2020).
- Michael D. Burns, Devin D. Bloom, Migratory lineages rapidly evolve larger body sizes than non-migratory relatives in ray-finned fishes, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2019.2615, 287, 1918, (20192615), (2020).
- S. T. Friedman, S. A. Price, K. A. Corn, O. Larouche, C. M. Martinez, P. C. Wainwright, Body shape diversification along the benthic–pelagic axis in marine fishes, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2020.1053, 287, 1931, (20201053), (2020).
- Pedro L. Godoy, Crocodylomorph cranial shape evolution and its relationship with body size and ecology, Journal of Evolutionary Biology, 10.1111/jeb.13540, 33, 1, (4-21), (2019).
- Peter M. Kappeler, Charles L. Nunn, Alexander Q. Vining, Steven M. Goodman, Evolutionary dynamics of sexual size dimorphism in non-volant mammals following their independent colonization of Madagascar, Scientific Reports, 10.1038/s41598-018-36246-x, 9, 1, (2019).
- Ana Catalán, Adriana D. Briscoe, Sebastian Höhna, Drift and Directional Selection Are the Evolutionary Forces Driving Gene Expression Divergence in Eye and Brain Tissue of Heliconius Butterflies , Genetics, 10.1534/genetics.119.302493, 213, 2, (581-594), (2019).
- Venelin Mitov, Krzysztof Bartoszek, Georgios Asimomitis, Tanja Stadler, Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts, Theoretical Population Biology, 10.1016/j.tpb.2019.11.005, (2019).
- David J. Button, Lindsay E. Zanno, Repeated Evolution of Divergent Modes of Herbivory in Non-avian Dinosaurs, Current Biology, 10.1016/j.cub.2019.10.050, (2019).
- Jeroen B. Smaers, Carrie S. Mongle, Kamran Safi, Dina K.N. Dechmann, Allometry, evolution and development of neocortex size in mammals, , 10.1016/bs.pbr.2019.05.002, (2019).
- Joanna Baker, Chris Venditti, Rapid Change in Mammalian Eye Shape Is Explained by Activity Pattern, Current Biology, 10.1016/j.cub.2019.02.017, (2019).
- Joshua W. Lambert, Martin Reichard, Daniel Pincheira-Donoso, Live fast, diversify non-adaptively: evolutionary diversification of exceptionally short-lived annual killifishes, BMC Evolutionary Biology, 10.1186/s12862-019-1344-0, 19, 1, (2019).
- Michael D. Burns, Brian L. Sidlauskas, Ancient and contingent body shape diversification in a hyperdiverse continental fish radiation, Evolution, 10.1111/evo.13658, 73, 3, (569-587), (2019).
- Pedro L. Godoy, Roger B. J. Benson, Mario Bronzati, Richard J. Butler, The multi-peak adaptive landscape of crocodylomorph body size evolution, BMC Evolutionary Biology, 10.1186/s12862-019-1466-4, 19, 1, (2019).
- Gabriel S. Yapuncich, Henry J. Feng, Rachel H. Dunn, Erik R. Seiffert, Doug M. Boyer, Vertical support use and primate origins, Scientific Reports, 10.1038/s41598-019-48651-x, 9, 1, (2019).
- Thomas Cody Prang, The African ape-like foot of Ardipithecus ramidus and its implications for the origin of bipedalism, eLife, 10.7554/eLife.44433, 8, (2019).
- Simone P. Blomberg, Suren Rathnayake, Cheyenne Moreau, Beyond Brownian motion and the Ornstein-Uhlenbeck process: Stochastic diffusion models for the evolution of quantitative characters, The American Naturalist, 10.1086/706339, (2019).
- Lam Si Tung Ho, Vu Dinh, Frederick A. Matsen, Marc A. Suchard, On the convergence of the maximum likelihood estimator for the transition rate under a 2-state symmetric model, Journal of Mathematical Biology, 10.1007/s00285-019-01453-1, (2019).
- Luiz H Varzinczak, Mauricio O Moura, Fernando C Passos, Shifts to multiple optima underlie climatic niche evolution in New World phyllostomid bats, Biological Journal of the Linnean Society, 10.1093/biolinnean/blz123, (2019).
- Daniel S. Moen, What determines the distinct morphology of species with a particular ecology? The roles of many-to-one mapping and trade-offs in the evolution of frog ecomorphology and performance, The American Naturalist, 10.1086/704736, (2019).
- Venelin Mitov, Krzysztof Bartoszek, Tanja Stadler, Automatic generation of evolutionary hypotheses using mixed Gaussian phylogenetic models, Proceedings of the National Academy of Sciences, 10.1073/pnas.1813823116, (201813823), (2019).
- Rafael S. Marcondes, Realistic scenarios of missing taxa in phylogenetic comparative methods and their effects on model selection and parameter estimation, PeerJ, 10.7717/peerj.7917, 7, (e7917), (2019).
- Paul Bastide, Claudia Solís-Lemus, Ricardo Kriebel, K William Sparks, Cécile Ané, Phylogenetic Comparative Methods on Phylogenetic Networks with Reticulations, Systematic Biology, 10.1093/sysbio/syy033, 67, 5, (800-820), (2018).
- Devin D Bloom, Michael D Burns, Tiffany A Schriever, Evolution of body size and trophic position in migratory fishes: a phylogenetic comparative analysis of Clupeiformes (anchovies, herring, shad and allies), Biological Journal of the Linnean Society, 10.1093/biolinnean/bly106, 125, 2, (302-314), (2018).
- Zhiheng Li, Julia A. Clarke, Chad M. Eliason, Thomas A. Stidham, Tao Deng, Zhonghe Zhou, Vocal specialization through tracheal elongation in an extinct Miocene pheasant from China, Scientific Reports, 10.1038/s41598-018-26178-x, 8, 1, (2018).
- Kathleen L Foster, Theodore Garland, Lars Schmitz, Timothy E Higham, Skink ecomorphology: forelimb and hind limb lengths, but not static stability, correlate with habitat use and demonstrate multiple solutions, Biological Journal of the Linnean Society, 10.1093/biolinnean/bly146, (2018).
- F Sara Ceccarelli, Nicolás Mongiardino Koch, Eduardo M Soto, Mariana L Barone, Miquel A Arnedo, Martín J Ramírez, The Grass was Greener: Repeated Evolution of Specialized Morphologies and Habitat Shifts in Ghost Spiders Following Grassland Expansion in South America, Systematic Biology, 10.1093/sysbio/syy028, (2018).
- Matthew C. Hutchinson, Marília P. Gaiarsa, Daniel B. Stouffer, Contemporary Ecological Interactions Improve Models of Past Trait Evolution, Systematic Biology, 10.1093/sysbio/syy012, (2018).
- Jeroen B Smaers, Alan H Turner, Aida Gómez-Robles, Chet C Sherwood, A cerebellar substrate for cognition evolved multiple times independently in mammals, eLife, 10.7554/eLife.35696, 7, (2018).
- Paul Bastide, Cécile Ané, Stéphane Robin, Mahendra Mariadassou, Inference of Adaptive Shifts for Multivariate Correlated Traits, Systematic Biology, 10.1093/sysbio/syy005, (2018).
- Chris J. Law, Graham J. Slater, Rita S. Mehta, Lineage Diversity and Size Disparity in Musteloidea: Testing Patterns of Adaptive Radiation Using Molecular and Fossil-Based Methods, Systematic Biology, 10.1093/sysbio/syx047, 67, 1, (127-144), (2017).
- Florian C Boucher, Vincent Démery, Elena Conti, Luke J Harmon, Josef Uyeda, A General Model for Estimating Macroevolutionary Landscapes, Systematic Biology, 10.1093/sysbio/syx075, 67, 2, (304-319), (2017).
- Claudia Voelckel, Nicole Gruenheit, Peter Lockhart, Evolutionary Transcriptomics and Proteomics: Insight into Plant Adaptation, Trends in Plant Science, 10.1016/j.tplants.2017.03.001, 22, 6, (462-471), (2017).
- Mark Grabowski, William L. Jungers, Evidence of a chimpanzee-sized ancestor of humans but a gibbon-sized ancestor of apes, Nature Communications, 10.1038/s41467-017-00997-4, 8, 1, (2017).
- Randi H. Griffin, Gabriel S. Yapuncich, A critical comment on the ‘multiple variance Brownian motion’ model of Smaers et al. (2016), Biological Journal of the Linnean Society, 10.1093/biolinnean/blw030, 121, 1, (223-228), (2017).
- J. B. Smaers, C. S. Mongle, On the accuracy and theoretical underpinnings of the multiple variance Brownian motion approach for estimating variable rates and inferring ancestral states, Biological Journal of the Linnean Society, 10.1093/biolinnean/blx003, 121, 1, (229-238), (2017).
- Jeroen B. Smaers, Aida Gómez-Robles, Ashley N. Parks, Chet C. Sherwood, Exceptional Evolutionary Expansion of Prefrontal Cortex in Great Apes and Humans, Current Biology, 10.1016/j.cub.2017.01.020, 27, 5, (714-720), (2017).
- D. Luke Mahler, Marjorie G. Weber, Catherine E. Wagner, Travis Ingram, Pattern and Process in the Comparative Study of Convergent Evolution, The American Naturalist, 10.1086/692648, 190, S1, (S13-S28), (2017).
- H. Peter Linder, Yanis Bouchenak-Khelladi, Adaptive radiations should not be simplified: The case of the danthonioid grasses, Molecular Phylogenetics and Evolution, 10.1016/j.ympev.2017.10.003, 117, (179-190), (2017).
- David Jablonski, Approaches to Macroevolution: 2. Sorting of Variation, Some Overarching Issues, and General Conclusions, Evolutionary Biology, 10.1007/s11692-017-9434-7, 44, 4, (451-475), (2017).
- Marcio R. Pie, Leonardo L. F. Campos, Andreas L. S. Meyer, Andressa Duran, The evolution of climatic niches in squamate reptiles, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2017.0268, 284, 1858, (20170268), (2017).
- Marina P. Arbetman, Gabriela Gleiser, Carolina L. Morales, Paul Williams, Marcelo A. Aizen, Global decline of bumblebees is phylogenetically structured and inversely related to species range size and pathogen incidence, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2017.0204, 284, 1859, (20170204), (2017).
- Aaron M. Davis, Ricardo Betancur-R, Widespread ecomorphological convergence in multiple fish families spanning the marine–freshwater interface, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2017.0565, 284, 1854, (20170565), (2017).
- Laura Rodrigues Vieira de Alencar, Marcio Martins, Gustavo Burin, Tiago Bosisio Quental, Arboreality constrains morphological evolution but not species diversification in vipers, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2017.1775, 284, 1869, (20171775), (2017).
- Ricardo Kriebel, Mohammad Khabbazian, Kenneth J. Sytsma, A continuous morphological approach to study the evolution of pollen in a phylogenetic context: An example with the order Myrtales, PLOS ONE, 10.1371/journal.pone.0187228, 12, 12, (e0187228), (2017).
- Nicole E. Rafferty, Paul D. Nabity, A global test for phylogenetic signal in shifts in flowering time under climate change, Journal of Ecology, 10.1111/1365-2745.12701, 105, 3, (627-633), (2016).
- Cécile Ané, Lam Si Tung Ho, Sebastien Roch, Phase transition on the convergence rate of parameter estimation under an Ornstein–Uhlenbeck diffusion on a tree, Journal of Mathematical Biology, 10.1007/s00285-016-1029-x, 74, 1-2, (355-385), (2016).
- Alexander Q. Vining, Charles L. Nunn, Evolutionary change in physiological phenotypes along the human lineage, Evolution, Medicine, and Public Health, 10.1093/emph/eow026, 2016, 1, (312-324), (2016).
- Mohammad Khabbazian, Ricardo Kriebel, Karl Rohe, Cécile Ané, Fast and accurate detection of evolutionary shifts in Ornstein–Uhlenbeck models, Methods in Ecology and Evolution, 10.1111/2041-210X.12534, 7, 7, (811-824), (2016).
- Natalie Cooper, Gavin H. Thomas, Richard G. FitzJohn, Shedding light on the ‘dark side’ of phylogenetic comparative methods, Methods in Ecology and Evolution, 10.1111/2041-210X.12533, 7, 6, (693-699), (2016).
- Krzysztof Bartoszek, Serik Sagitov, Phylogenetic confidence intervals for the optimal trait value, Journal of Applied Probability, 10.1239/jap/1450802756, 52, 4, (1115-1132), (2016).
- Krzysztof Bartoszek, Serik Sagitov, Phylogenetic confidence intervals for the optimal trait value, Journal of Applied Probability, 10.1017/S0021900200113117, 52, 04, (1115-1132), (2016).
- Antigoni Kaliontzopoulou, Dean C. Adams, Phylogenies, the Comparative Method, and the Conflation of Tempo and Mode, Systematic Biology, 10.1093/sysbio/syv079, 65, 1, (1-15), (2015).
- Adam C. Algar, D. Luke Mahler, Robert Ricklefs, Area, climate heterogeneity, and the response of climate niches to ecological opportunity in island radiations of nolis lizards, Global Ecology and Biogeography, 10.1111/geb.12327, 25, 7, (781-791), (2015).
- R. Alexander Pyron, Post-molecular systematics and the future of phylogenetics, Trends in Ecology & Evolution, 10.1016/j.tree.2015.04.016, 30, 7, (384-389), (2015).
- Krzysztof Bartoszek, Serik Sagitov, A consistent estimator of the evolutionary rate, Journal of Theoretical Biology, 10.1016/j.jtbi.2015.01.019, 371, (69-78), (2015).
- Marina Naval-Sánchez, Delphine Potier, Gert Hulselmans, Valerie Christiaens, Stein Aerts, Identification of Lineage-Specific Cis -Regulatory Modules Associated with Variation in Transcription Factor Binding and Chromatin Activity Using Ornstein–Uhlenbeck Models , Molecular Biology and Evolution, 10.1093/molbev/msv107, 32, 9, (2441-2455), (2015).
- Clayton E. Cressler, Marguerite A. Butler, Aaron A. King, Detecting Adaptive Evolution in Phylogenetic Comparative Analysis Using the Ornstein–Uhlenbeck Model, Systematic Biology, 10.1093/sysbio/syv043, 64, 6, (953-968), (2015).
- Randi H. Griffin, Gabriel S. Yapuncich, The Independent Evolution Method Is Not a Viable Phylogenetic Comparative Method, PLOS ONE, 10.1371/journal.pone.0144147, 10, 12, (e0144147), (2015).




