Parallel likelihood calculation for phylogenetic comparative models: The SPLITT C++ library

Phylogenetic comparative models (PCMs) have been used to study macroevolutionary patterns, to characterize adaptive phenotypic landscapes, to quantify rates of evolution, to measure trait heritability, and to test various evolutionary hypotheses. A major obstacle to applying these models has been the complexity of evaluating their likelihood function. Recent works have shown that for many PCMs, the likelihood can be obtained in time proportional to the size of the tree based on post‐order tree traversal, also known as pruning. Despite this progress, inferring complex multi‐trait PCMs on large trees remains a time‐intensive task. Here, we study parallelizing the pruning algorithm as a generic technique for speeding‐up PCM‐inference. We implement several parallel traversal algorithms in the form of a generic C++ library for Serial and Parallel LIneage Traversal of Trees (SPLITT). Based on SPLITT, we provide examples of parallel likelihood evaluation for several popular PCMs, ranging from a single‐trait Brownian motion model to complex multi‐trait Ornstein‐Uhlenbeck and mixed Gaussian phylogenetic models. Using the phylogenetic Ornstein–Uhlenbeck mixed model (POUMM) as a showcase, we run benchmarks on up to 24 CPU cores, reporting up to an order of magnitude parallel speed‐up for the likelihood calculation on simulated balanced and unbalanced trees of up to 100,000 tips with up to 16 traits. Noticing that the parallel speed‐up depends on multiple factors, the SPLITT library is capable to automatically select the fastest traversal strategy for a given hardware, tree‐topology, and data. Combining SPLITT likelihood calculation with adaptive Metropolis sampling on real data, we show that the time for Bayesian POUMM inference on a tree of 10,000 tips can be reduced from several days to less than an hour. We conclude that parallel pruning effectively accelerates the likelihood calculation and, thus, the statistical inference of Gaussian phylogenetic models. For time‐intensive Bayesian inferences, we recommend combining this technique with adaptive Metropolis sampling. Beyond Gaussian models, the parallel tree traversal can be applied to numerous other models, including discrete trait and birth–death population dynamics models. Currently, SPLITT supports multi‐core shared memory architectures, but can be extended to distributed memory architectures as well as graphical processing units.


| INTRODUC TI ON
Phylogenetic comparative models (PCMs) have been used for studying the evolution of various biological species, ranging from microorganisms to animals and plants. Ultimately, these statistical models aim to understand the intricate connections between the macroevolutionary patterns observable in phenotype data from phylogenetically linked species and the fundamental mechanisms of evolution operating on the microevolutionary timescale, such as natural selection and random genetic drift (Felsenstein, 1985;Hansen & Martins, 1996;Harmon, 2018;Lande, 1976;Losos, 2011;Uyeda, Zenil-Ferguson, & Pennell, 2018). This quest has led to the recent development of complex multi-trait multi-regime models of evolution (Bastide, An′e, Robin, & Mariadassou, 2018;Clavel, Escarguel, & Merceron, 2015;Manceau, Lambert, & Morlon, 2016;. The inherent complexity of these models is posing new challenges in terms of parameter inference and model selection.
In their effort to speed-up PCM inference, recent works have shown that, for a broad family of PCMs, the likelihood of an observed phylogenetic tree and data conditioned on the model parameters can be computed in time proportional to the size of the tree (FitzJohn, 2012;Goolsby, Bruggeman, & An′e, 2016;Ho & Ané, 2014;Mitov, Bartoszek, Asimomitis, & Stadler, 2018). This family includes Gaussian models like Brownian motion (BM) and Ornstein-Uhlenbeck (OU) phylogenetic models as well as some non-Gaussian models like phylogenetic logistic regression (Ho & Ané, 2014;Ives & Garland, 2010;Paradis & Claude, 2002). All of these likelihood calculation techniques rely on post-order tree traversal also known as pruning (Felsenstein, 1973(Felsenstein, , 1981(Felsenstein, , 1983. For moderate numbers of traits, combining pruning algorithms for likelihood calculation with gradient-based optimization (Boyd & Vandenberghe, 2004) enables maximum likelihood model inference within seconds on contemporary computers, even for phylogenies of many thousands of tips (Ho & Ané, 2014). Despite its simple interpretation and several useful statistical properties, the maximum likelihood estimator has often been criticised for being a point estimator, uninformative about the likelihood surface, often prone to be a local optimum, and failing to quantify the uncertainty of a priori assumed models for comparative data (Bishop, 2007;. Uyeda . In contrast with ML inference, though, Bayesian inference methods require many orders of magnitude more likelihood evaluations. This presents a bottleneck in Bayesian analysis, in particular, for complex models of many unknown parameters or when faced with large phylogenies of many thousands of tips, such as transmission trees from large-scale epidemiological studies, for example, Alizon et al. (2010); Shirreff et al. (2013); Hodcroft et al. (2014); Bertels et al. (2017); . While big data should provide the needed statistical power to fit a complex model, the time needed to perform a full scale Bayesian fit often limits the choice to a faster but less informative ML-inference, or a Bayesian inference of a simplified model.
Speeding-up Bayesian inference is an active topic in applied statistics with recent advances that can be classified in several groups.
One group of methods are adaptive variants of the random walk Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) that aim to decrease the number of MCMC iterations by performing "on-the-fly" changes of the jump distribution, based on what has been "learned" about the parameter space from past iterations (Haario, Saksman, & Tamminen, 2001;Vihola, 2012). A major advantage of these methods is that they are generic with respect to the models and can be implemented as general purpose Metropolis samplers (e.g. adaptMCMC; Scheidegger, 2017). A second group are "pre-fetching" methods which modify the Metropolis-Hastings algorithm so that it speculatively executes sequences of individual likelihood calls in parallel, "hoping" that these sequences tend to match the actual accepted states of the MCMC (Angelino, Kohler, Waterland, Seltzer, & Adams, 2014;Brockwell, 2006). Another possibility to use multiple processor power, which could potentially be combined with the above methods, is to delegate the parallelization problem to a low level linear algebra library, for example, OpenBLAS (Wang, Zhang, Zhang, & Yi, 2013).
This article contributes to a separate body of work, namely, the ensemble of model-specific approaches that parallelize the likelihood calculation by using specific features of the likelihood function. These include factorizations of the likelihood into a product of components associated with conditionally independent subsets of the model parameters (Goudie, Turner, De Angelis, & Thomas, 2017;Whiley & Wilson, 2004) or the observed variables (Ayres et al., 2012). Often, this factorization relies on strong model assumptions, such as a hierarchical structure of the model parameters or independence of the observed variables. A common approach used in software packages like BEAST (Bouckaert et al., 2014;Drummond, Suchard, Xie, & Rambaut, 2012) is to combine the factorization with caching and reusing of some of the previously calculated likelihood components in consecutive MCMC iterations, as long as these are not affected by the proposed jump in parameter space.

K E Y W O R D S
continuous time Markov process, continuous trait, discrete character, pre-order traversal For a phylogenetic comparative model, though, the likelihood cannot (in general) be factorized across parameter groups, trait independence is acceptable only as a null hypothesis and, with a moderate number of traits and pruning likelihood calculation, parallelizing algebraic operations (on low-dimensional vectors and matrices) is inefficient. Hence, we explore the parallelization of the likelihood calculation at the level of traversing the phylogenetic tree, that is, the pruning itself. Parallel tree traversal has been studied in computer science, mostly for the purposes of parallel tree contraction (Reif, 1989), automated task scheduling (Qamnieh, 2015) and for phylogenetic inference from multiple sequence alignment data (Ayres & Cummings, 2017;Ayres et al., 2012). Capitalizing on the same ideas, we developed SPLITT: a shared-memory C++ library for Serial and Parallel LIneage Traversal of Trees. While we focus on Gaussian phylogenetic models as the main application of the library, we designed the SPLITT programming interface to be generic with respect to the node-traversal operations, hoping that the library could potentially find use in different models, including birth-death population models and discrete trait models. We tested SPLITT on large trees (up to N = 100,000) and on different topologies, including balanced and highly unbalanced trees. These tests proved a nice property of the parallel pruning algorithm, namely the fact that its parallel efficiency increases with the tree size as well as the complexity of the nodetraversal operations. Thus, for large trees and complex models, the parallel speed-up is limited either by the number of available processors or by another limited resource such as the memory bandwidth.
Finally, we showcase that our parallel pruning algorithm coupled with adaptive Metropolis samplers dramatically reduces the time for Bayesian analysis of trees with thousands of tips.

| MATERIAL S AND ME THODS
In this section, we introduce SPLITT and show through an example how it is used to parallelize a pruning algorithm over a given phylogenetic tree and data. Further technical details and examples are provided in Sections 2 and 3 of the Supporting Information and the SPLITT online documentation (https://venelin.github.io/SPLITT/index.html).

| A general framework for parallel tree traversal
SPLITT implements a general framework for specifying the type of trait data, the model parameters and the "node-traversal" operations, which are executed in a pre-order or a post-order traversal of the tree (Section 1, Supporting Information). The node-traversal operations represent user-defined rules specifying how a set of variables associated with each node, called a "node-state," is initialized and updated in the computer memory, based on the input tree and data, the model parameters, and the node-states of the previously visited nodes. At the end of the traversal, the final node-state values are accessible for calculating a quantity of interest, such as the likelihood of model, given the tree and the data.
The node-states can be calculated in parallel for any group of siblings or more remote cousins on the tree. Formally, SPLITT makes the following key assumption: Assumption 1 Calculating the state of a node j can be done independently from the calculation of the state of any other node k, provided that neither j is an ancestor of k, nor k is an ancestor of j.
To maximize the potential for parallel execution, the life cycle of a node during traversal is divided into three operations (Figure 1c 3. PruneNode (post-order traversal only): "communicates" the state of a node to its parent node. SPLITT ensures that this operation is synchronized between siblings, that is, daughters of the same parent node. Hence, this operation is convenient for accumulation (e.g. summation) of state-variables of the daughters into the state of their parent ( Figure 1c). This operation is not defined for the root of the tree.
The parallel speed-up can depend on multiple factors, including the balancedness of the tree, and the computing and memory complexity of the traversal operations, which can be different between nodes in the tree. Noticing that there is no one-size-fit-all parallel traversal strategy that guarantees fastest execution, previous works have studied queuebased and range-based parallelization strategies (Ayres & Cummings, 2017;Qamnieh, 2015;Reif, 1989). SPLITT implements such algorithms called "orders" (Section 1.1, Supporting Information). As a default setting, SPLITT implements a mode "auto," in which it compares the execution time of different parallel orders during the first several calls on a given tree and data, choosing the fastest one for all subsequent calls.

| A showcase: the phylogenetic (Ornstein-Uhlenbeck) mixed model
To illustrate the use and to test the SPLITT library, we developed two variants of the so called phylogenetic mixed model (PMM)-the original PMM assuming a Brownian motion process (Housworth, Martins, & Lynch, 2004;Lynch, 1991), and its recent extension to an Ornstein-Uhlenbeck process, the POUMM, which F I G U R E 1 Parallel pruning for calculating the log-likelihood of the phylogenetic mixed model we and other authors have used previously to analyse the evolution of set-point viral load in HIV patients  and references therein).

| Generalization to multi-trait Ornstein-Uhlenbeck and mixed Gaussian phylogenetic models
The quadratic polynomial representation of the log-likelihood function (Equation 1) can be generalized to a broader family of models.
In Section 2.2, Supporting Information, we show how the coefficients a M , b M and c M can be calculated for the single-trait Ornstein-Uhlenbeck mixed model. In a separate work , we extend this integration technique to evaluate the likelihood of multitrait Ornstein-Uhlenbeck and mixed Gaussian phylogenetic models, that is, models in which different types of models are assigned to different lineages of the tree. These models have been implemented in several r-packages summarized in the following sub-section.

| Software
We provide SPLITT as a C++ library licensed under version 3.0 of the GNU Lesser General Public License (LGPL v3.0) and available at https://github.com/venelin/SPLITT.git. In its current implementation, the library uses the C++11 language standard, the standard template library and the OpenMP standard for parallel processing.
Section 3, Supporting Information provides details on the model inference procedure implemented within the package and reports a test of technical correctness.
The generalization to a multi-trait mixed Gaussian phylogenetic model (MGPM) has been implemented in the r-package PCMBase

| Technical correctness
To test the technical correctness of the SPLITT library and the higher level POUMM, PCMBase and PCMBaseCpp packages, we used the method of posterior quantiles (Cook, Gelman, & Rubin, 2006). For the single-trait POUMM implementation (POUMM r-package), we report the technical correctness test in Section 3.2, Supporting Information. For the multi-trait implementation (PCMBase and PCMBaseCpp r-packages), the technical validation is reported in .

| Simulations
We evaluated the performance of the SPLITT library using the singletrait and multi-trait phylogenetic Ornstein-Uhlenbeck mixed model (POUMM) as a showcase. The single-trait POUMM was implemented in the r-package POUMM (Section 3, Supporting Information), based on the quadratic polynomial representation of the log-likelihood (Section 2.2, Supporting Information). The multi-trait POUMM version was implemented in the r-package, PCMBaseCpp, using a multi-trait generalization of the quadratic polynomial representation described in . The POUMM is a suitable model for a comparative benchmark, because a number of r-packages provide similar OU-based phylogenetic models, using C++ for the likelihood implementation. These include, among others, geiger (Pennell et al., 2014) and diversitree (FitzJohn, 2012) for the single-trait case and Rphylopars (Goolsby et al., 2016) for the multi-trait case.
This resulted in a total of 16 topologies (trees for N = 1, 000 shown on Figure 2). For each topology, random branch lengths were assigned overwriting the default branch lengths of 1 assigned by rtreeshape(). Since the OU-implementations in the current diversitree and Rphylopars versions do not support nonultrametric trees, each tree was ultrametrized (adjusting branch lengths so that all tips have the same root-tip distance). For each tree, we generated random trait-values by simulating the POUMM model using random parameters.

| Other pruning algorithm examples
In Sections 2.1 and 2.2.2, Supporting Information, we describe another pruning algorithm for calculating the POUMM log-likelihood, which is based on the generalized 3-point structure algorithm (Ho & Ané, 2014). In Section 2.3, Supporting Information, we give an example of a pruning algorithm for calculating the likelihood of a discrete (binary) trait observed at the tips of a phylogenetic tree.

| Time for preprocessing the tree
Each of the tested packages implements a preprocessing step initializing cached data-structures that are re-used during likelihood calculation. In the case of SPLITT, this is the constructor-function of the internal Tree structure; in the case of diversitree, this is the function make.ou; in the case of geiger, this is the internal function bm.lik. We note that the time for creating the cache structure is not important in scenarios of fitting Gaussian phylogenetic models to a fixed tree and data (created once, at the beginning of the inference process). However, these times become important in the case when the tree topology is inferred together with the model parameters from trait and sequence alignment data.
F I G U R E 2 Test tree topologies for N = 1,000. For visualization purpose, all branch lengths have been set to 1, whereas the random branch lengths were used in the benchmarks. Note that the tree for p = 0.5 is nearly but not perfectly balanced due to the random nature of the tree generation process, as well as N not being an exact degree of 2 p = 0.5 (balanced) p = 0.1 p = 0.01 p = 0.01/N (ladder) We measured the preprocessing time on the 16 trees (Table 1).
The times scaled linearly with the size of the tree for the packages using the SPLITT library (POUMM and PCMBaseCpp) and for diversitree. For these packages the time was not affected by the unbalancedness of the tree. For geiger, we observed longer times, both for bigger N as well as for more unbalanced trees. For N = 100,000 and p = 0.01/N, both, diversitree and geiger failed with a stack-overflow error. The relatively short times for the SPLITT-based POUMM and PCMBaseCpp packages indicate that SPLITT could potentially be used for phylogenetic inference.

| Time for POUMM likelihood calculation
To measure the likelihood calculation time, we ran performance benchmarks on a personal computer (PC) running OS X on an Intel(R) We distinguish the different implementations according to the following criteria: • Number of traits: we distinguish between single-trait implementations, that is, geiger, diversitree and POUMM, and multitrait implementations, that is, Rphylopars and PCMBaseCpp.
For the multi-trait implementations, we measured the time for 1, 4, 8 and 16 traits.
• Mode: denotes whether the implementation is single threaded using one physical core of the CPU-serial, or multi-threaded, running as many threads as there are physical CPU cores-parallel; • Order: denotes the order in which the prune-able nodes are processed. We tested three possible orders: postorder-the nodes are processed sequentially; queue-based-the nodes are processed in parallel as they enter the queue (Algorithm S1, Section 2, Supporting Information), synchronized thread access to the queue; range-based-the nodes in each pruning generation are processed in order of their allocation in memory, no need for a synchronized access to a queue (Algorithm S2, Section 2, Supporting Information).
• Implementation: the r-package and the back-end used (R or C++).
The likelihood calculation time was measured using the Rfunction "sys.time" calling the specific likelihood implementation on a fixed set of parameters n = 100 times, then, dividing the cumulative time by n. To avoid influence from other processes running on the same PC, the benchmark was run after a restart of the operating system (OS). The resulting times for the single-trait implementations running on the PC are shown on Figure 3.
On small trees of 100 tips, the fastest single-trait POUMM implementations were the serial C++ implementations from the packages POUMM and diversitree (about 0.03 ms); the range-based parallel implementation was nearly as fast on balanced trees (p = 0.5) but was progressively slower on unbalanced trees. The geiger implementation was nearly an order of magnitude slower (0.2 ms). The POUMM queuebased parallel implementation was nearly 100 times slower (nearly 2 ms), presumably due to the excessive synchronization overhead.
The serial R implementation from the diversitree package was the slowest (above 2 ms), which was expected, since the R interpreter is notorious for its slow speed compared to compiled languages like C++.

| Parallel speedup
The parallel speed-ups for the Euler cluster benchmark for single- For single-trait implementations, the parallel speed-up is negligible for trees of <1,000 tips and for highly unbalanced trees ( Figure 5).
The parallel speed-up becomes noticeable for large balanced trees, peaking at 10× for a balanced tree of 100,000 tips, running on 20 CPU cores ( Figure 5). The above behaviour is explained by the fact that the InitNode and VisitNode operations in the single-trait case are very fast relative to the thread-management operations. Also noteworthy is the fact that even on balanced trees above 100,000 tips, the parallel efficiency, that is,the ratio of the parallel speed-up and the number of parallel cores, drops below 50% when running on more than 20 CPU cores. This suggests a possible competition between the CPU cores for a limited resource such as the processor cache or the memory bandwidth.
For the multi-trait implementations, the InitNode and VisitNode operations are computationally more intensive. This is why we observe F I G U R E 3 Likelihood calculation times for single-trait R and C++ implementations of the POUMM model on a PC (processor Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz with four physical cores). Both, the x-axis denoting the number of tips in the tree and the y-axis denoting the calculation time in milliseconds are on a log-10 scale. Panels from left to right correspond to different tree topologies with leftmost panel corresponding to a balanced tree and right-most panel corresponding to a ladder tree, see also  substantial parallel speed-up on the smallest as well as the most unbalanced trees ( Figure 6). However, for all multi-trait cases, we observe a decline in parallel speed-up with more than 12 CPU cores ( Figure 6).
The most reasonable explanation for this is competition between the CPU cores for a limited hardware resource.

| Combined parallel likelihood calculation with adaptive Metropolis sampling
In Bayesian MCMC inference, the parallel likelihood calculation can be combined with an adaptive MCMC sampler. The POUMM F I G U R E 5 Parallel speed-up for the single-trait POUMM implementation on the Euler cluster (package POUMM). The grey and red lines denote the expected speed-up at 100% and 50% parallel efficiency, respectively. Horizontally, the panels correspond to the different tree topologies, see also  where . This showed faster MCMC convergence ( Figure S10, Section 5, Supporting Information) and overall <30 min for the MCMC run (Section 5, Supporting Information).
F I G U R E 6 Parallel speed-up for the multi-trait (k = 16 traits) POUMM implementation (package PCMBaseCpp) on the Euler cluster. The grey and red lines denote the expected speed-up at 100% and 50% parallel efficiency, respectively. Horizontally, the panels correspond to the different tree topologies, see also  Mode serial parallel queue parallel range

| D ISCUSS I ON
The examples in this article focused on Gaussian models of continuous trait evolution (see Section 2 and Section 2, Supporting Information).
Yet, SPLITT can in principle be used for any algorithm that runs a preorder or post-order tree traversal. For example, another family of models where SPLITT could be used are models of structured populations.
When calculating the likelihood for a phylogenetic tree under a structured birth-death model, the calculations proceed in a pruning fashion (Kühnert, Stadler, Vaughan, & Drummond, 2016) and may be improved with respect to speed using our approach. However, the structured coalescent likelihood for a tree is a function of all co-existing lineages even for approximate methods (Müller, Rasmussen, & Stadler, 2017), and thus a pruning formulation is not available.  Ronquist and Huelsenbeck (2003), use the library BEAGLE which distributes the computation for the independent sites of the sequence alignment among multiple CPU or GPU cores (Ayres et al., 2012, but see also Ayres & Cummings, 2017)). SPLITT operates on a different level, namely, it parallelizes the computation for independent lineages in the tree. Both approaches are interesting because they fit well to different sizes of the input data-while BEAGLE achieves significant parallel speed-ups in long alignments comprising many thousands nucleotide or codon columns (Ayres et al., 2012, SPLITT is better suited to shorter alignments of potentially many thousands of species.
Based on the performance benchmarks, we conclude that with the current implementation of SPLITT, running on the above-

| Outlook
The past decade has seen a rapid advance in the production of multi-core processors. At the same time, it appears that the maximum clock frequency of a single processing unit is approaching the maximum achievable for semi-conductor-based architectures. In parallel with this development on the hardware side, the volume of sequence data and the size of phylogenetic trees is growing exponentially. For instance, in <5 years the size of phylogenetic trees used for calculating the heritability of HIV virulence has increased from a few hundreds to several thousand patients (Alizon et al., 2010;Hodcroft et al., 2014). This motivates the development of novel parallel algorithms capitalizing on the multi-core technology.
The parallel tree traversal library, SPLITT, enables parallel computation for a vast set of phylogenetic models, facing the challenges of increasing model complexity and volumes of data in phylogenetic analysis.

ACK N OWLED G EM ENTS
We thank Dr. Krzysztof Bartoszek for valuable insights on the Ornstein-Uhlenbeck process.   . The PMM likelihood calculation example illustrated on Figure 1 has been implemented in the form of an