Lpnet: Reconstructing phylogenetic networks from distances using integer linear programming
Abstract
- Neighbor-net is a widely used network reconstructing method that approximates pairwise distances between taxa by a circular phylogenetic network.
- We present Lpnet, a variant of Neighbor-net. We first apply standard methods to construct a binary phylogenetic tree and then use integer linear programming to compute an optimal circular ordering that agrees with all tree splits.
- This approach achieves an improved approximation of the input distance for the clear majority of experiments that we have run for simulated and real data. We release an implementation in R that can handle up to 94 taxa and usually needs about 1 min on a standard computer for 80 taxa. For larger taxa sets, we include a top-down heuristic which also tends to perform better than Neighbor-net.
- Our Lpnet provides an alternative to Neighbor-net and performs better in most cases. We anticipate Lpent will be useful to generate phylogenetic hypotheses.
1 INTRODUCTION
Neighbor-net (Bryant & Moulton, 2004) is a widely used distance-based method that approximates the input distance by (the distance induced by) a weighted circular split system on the same taxa set, which is then visualized as a splits graph. It has been used to analyse numerous real datasets, and it is a crucial step of other phylogenetic tools, for example, to construct coalescent-based phylogenetic networks (Allman et al., 2019), to detect and filter out homoplastic sites of a sequence alignment (Dress et al., 2008) and to align multiple sequences without guide tree (Kruspe & Stadler, 2007). Neighbor-net is the network analogue of the neighbor-joining (NJ) method (Saitou & Nei, 1987). First, a circular ordering of the taxa is computed heuristically, and then, a non-negative least squares (NNLS) procedure is applied to find optimal weights for all splits that agree with that circular ordering. The heuristic consists of joining two clusters, using the same criterion as neighbor-joining, and then choosing one taxon from each cluster such that those two taxa will be adjacent in the final circular ordering. The first part of this agglomeration defines a phylogenetic tree on all taxa (Levy & Pachter, 2011) and the second part chooses a circular ordering that agrees with all splits of that tree. It follows from Semple and Steel (2004) that there are ${2}^{n-3}$ circular orderings with that property for $n$ taxa.
Here, we present Lpnet, a variant of Neighbor-net, that does not apply the second heuristic step of the agglomeration. Instead, we only construct a phylogenetic tree heuristically. Then, we use integer linear programming to find an optimal circular ordering, and finally, we use the same NNLS procedure as Neighbor-net to assign split weights. We have run experiments to compare Lpnet and Neighbor-net, using random distances, simulated datasets and a real dataset, and we observe that Lpnet tends to approximate the input distance clearly better than Neighbor-net. Since the integer linear programming problem in Lpnet uses a quadratic number of variables and a cubic number of constraints, Lpnet is computationally more demanding than Neighbor-net. Our implementation in R is feasible for up to 94 taxa and usually needs about 1 min on a standard computer for 80 taxa. For larger taxa sets, we provide a top-down heuristic to compute a circular ordering from the tree. The method is available at https://github.com/yukimayuli-gmz/lpnet.
2 MATERIALS AND METHODS
2.1 Compatible splits and circular split systems
A split $S$ divides a set $X$ of taxa to two non-empty parts $A$ and $B$ and is denoted $S=A\mid B$. For a phylogenetic tree where the leaf nodes are labelled by the set $X$ of taxa, every branch of the tree represents a split of $X$. Sets of splits that can be obtained from a single tree in this way are called compatible (Semple & Steel, 2003).
A circular ordering of a set $X=\left\{{x}_{1},\dots ,{x}_{n}\right\}$ of $n\ge 3$ taxa can be obtained by labelling all vertices of an $n$-gon by the taxa (see Figure 1). A split $A\mid B$ agrees with a circular ordering, if both $A$ and $B$ label consecutive paths on the circle. A circular ordering is defined by the permutation of $X$ that we get by starting at an arbitrary taxon and then following all vertices of the cycle in a clock-like or anti-clock-like fashion. Therefore, there are $2n$ different permutations associated with the same circular ordering.
A circular split system of $X$ is a set of splits of $X$ such that all splits agree with a single circular ordering. It has long been known that circular split systems have up to $\left(\begin{array}{c}n\\ 2\end{array}\right)$ splits (Bandelt & Dress, 1992). Furthermore, compatible split systems are circular; thus, an unrooted phylogenetic tree can be considered a circular split system. When non-negative weights are assigned to the splits, this weighted circular split system can be visualized by a planar splits graph, a network where the taxa are embedded as vertices, and every split $A\mid B$ of weight $w$ corresponds to a set of parallel edges which all have length $w$. Furthermore, removing all those edges decomposes the network into two connected components containing $A$ and $B$ respectively (Dress & Huson, 2004). Split graphs are commonly referred to as implicit phylogenetic networks, and SplitsTree4 (Huson & Bryant, 2006) can be used to draw such a network from an input weighted circular split system (Huson & Bryant, 2006).
2.2 Quartets and weights
A quartet $\mathit{uv}\mid \mathit{xy}$ consists of two pairs $\left\{u,v\right\}$ and $\left\{x,y\right\}$ for four different taxa $u,v,x,y$. A weight of $\mathit{uv}\mid \mathit{xy}$ can be defined for every quartet, based on the pairwise distances between taxa in $\left\{u,v,x,y\right\}$ and quantifies the support for separating $u$ and $v$ from $x$ and $y$. For a distance $d$, and four taxa $u,v,x,y$, we define the weight of the quartet $\mathit{uv}\mid \mathit{xy}$ by $w\left(\mathit{uv}|\mathit{xy}\right)=d\left(u,x\right)+d\left(u,y\right)+d\left(v,x\right)+d\left(v,y\right)-2d\left(u,v\right)-2d\left(x,y\right)$. The support for a tree (without edge lengths) or a circular ordering can be quantified by summing up the weights of all supported quartets, and this number is maximized by a correct tree or ordering, if the input distance is induced by a tree or a circular split system respectively.
2.3 Distance reduction and the order dependence of Neighbor-net
The first step of neighbor-joining and all other agglomerative algorithms discussed here is to identify two taxa $u,v$, such that ${\sum}_{x,y\ne u,v}w\left(\mathit{uv}|\mathit{xy}\right)$ is maximized. Note that the selection criterion of neighbor-joining and its variants can be formulated in terms of these quartet weights (Mihaescu et al., 2009). Then, the cluster $\left\{u,v\right\}$ is considered a single taxon, where the distances between this new taxon and another taxon $x$ are based on the distances between $u$, $v$, and $x$. This process is reiterated until there are only three taxa left, and the two steps are called the selection and the reduction step. It was pointed out by Bryant (2005) that, in order to reconstruct trees correctly from their induced metric, the selection step is unique, while the reduction step can give different weights to the two joined clusters. Neighbor-joining always gives the same weight to both clusters, while UNJ (Gascuel, 1997b) uses weights that are proportional to the cluster size, and BioNJ (Gascuel, 1997a) tries to minimize the variance of the reduced distances.
Neighbor-net does not reduce the distance for clusters of size two. Instead, it reduces three taxa to two, whenever a cluster of size two is merged with another cluster. If both clusters have size two, this reduction step has to be performed twice. This distinguishes the taxon that is not included in the first reduction from the other three. For default parameters, the taxa that are first reduced each receive 2/9 and the remaining taxon 1/3 of the total weight of the new cluster. Since the choice which three taxa are reduced first is not determined by the input distances, the output of Neighbor-net can sometimes depend on the input order of the taxa, even if no ties occur. It therefore seems reasonable to give equal weights to all four clusters. We have implemented this variant of the Neighbor-net tree construction in Lpnet and refer to it as symmetric NNet tree. In addition, we use NNet tree to mimic the original weighting by randomly assigning 1/3 to one of the two candidate taxa that might receive that weight from the Neighbor-net algorithm.
2.4 The Lpnet algorithm
- Construct a phylogenetic tree. We use several methods, including NJ.
- Use Integer Linear Programming to find a circular ordering which is consistent with the tree to maximize the sum of all quartet weights contained in the circular ordering.
- Estimate split weights from the circular ordering by using non-negative least squares such that the least squares fit (LSFit) is maximized.
Neighbor-net combines the agglomeration of bottom-up clustering with a second selection step that essentially defines an ordering of the taxa in the newly added cluster. Every cluster can be considered a path with its taxa as vertices, and the path will be an interval in the final circular ordering. Crucially, the decision about the ordering of the new cluster is made based on the information available at the time of merging.
For Lpnet, we choose to delay the ordering of the clusters until the whole tree is known. We do so by solving an integer linear programming problem that finds a circular ordering of all taxa that maximizes the sum of the weights of its supported quartets among all orderings that agree with all splits of the tree.
2.4.1 Constructing a tree
Neighbor-net constructs a circular ordering agglomeratively. The process can be described as adding edges to a graph whose vertices are the taxa such that every component is a path (Grünewald et al., 2007).
As indicated in the previous subsection, the second selection step has some influence on the tree construction. It was already suggested by Levy and Pachter (2011) to change the distance reduction of Neighbor-net such that all splits of the neighbor-joining tree are always supported by the output ordering. The main idea of Lpnet is to skip the second selection altogether and choose a circular ordering after the tree construction. In order to observe how much of the performance difference between Lpnet and Neighbor-net is caused by this new strategy, we include the NNet tree which stays as close as possible to Neighbor-net. Noting the undesired order dependence of Neighbor-net, we also include the symmetric NNet tree.
The other tree reconstruction methods implemented by Lpnet are neighbor-joining and its variants, UNJ and BioNJ. It is also possible to input a user-defined tree, and Lpnet will compute a circular split system where all splits of the input tree agree with the circular ordering.
2.4.2 Using linear programming to maximize quartet weights
When a phylogenetic tree is drawn in the plane, this embedding defines a circular ordering of the taxa which can be observed by traversing the tree such that every edge is visited once in each direction (see Figure 2). Given a binary unrooted tree $T$ with $n$ taxa and pairwise distances, we want to find a circular ordering that agrees with all splits of $T$, such that the sum of the weights of all supported quartets is maximized. Given an initial ordering that agrees with $T$, we can obtain another such ordering by choosing a non-trivial split $A\mid B$ and reversing the order of $A$. Note that reversing the order of $B$ yields the same circular ordering as reversing $A$, and reversing both yields the initial circular ordering. This process can be interpreted as flipping the edge that separates $A$ from $B$, and it follows from Semple and Steel (2004) that all allowed circular orderings can be obtained by a sequence of edge flips. Moreover, the final circular ordering does not depend on the order in which the edges are flipped, flipping an edge twice yields the same circular ordering as never flipping it, and flipping two different sets of edges always results in different circular orderings.
These conditions are necessary, because every edge of the smallest subtree of $T$ containing $i,j,k$ is contained in exactly two of the three paths between two of those vertices. This means that the sum ${X}_{\mathit{ij}}+{X}_{\mathit{ik}}+{X}_{\mathit{jk}}$ is even, so either all three variables are zero or there are two ones and one zero. To see that the conditions are sufficient, we note that the (0,1)-assignment to all those variables ${X}_{\mathit{ij}}$ where $\mathit{ij}$ is an edge of $T$ already defines a circular ordering. Now there is a single extension of this assignment to all variables such that all conditions hold: Let $i$ and $k$ be two interior vertices of $T$ such that ${X}_{\mathit{ik}}$ is unknown while ${X}_{\mathit{ij}}$ and ${X}_{\mathit{jk}}$ have already been assigned for a vertex $j$. As before, ${X}_{\mathit{ij}}+{X}_{\mathit{ik}}+{X}_{\mathit{jk}}$ must be even, so there is only one allowed assignment for ${X}_{\mathit{ik}}$. We give an example of a $5$-taxa tree with all four possible circular orderings in Figure 3, and we list the values of the variables ${X}_{i,j}$.
In summary, we compute an optimal circular ordering by solving a binary linear programming problem with $\left(\begin{array}{c}n-2\\ 2\end{array}\right)$ variables and $4\left(\begin{array}{c}n-2\\ 3\end{array}\right)$ constraints and we can globally maximize this objective function using binary linear programming.
2.4.3 Maximizing quartet weights heuristically
Even if an exact solution to maximize the quartet weights is not feasible, it has some advantages to first complete the tree construction and then compute a circular ordering. We include a top-down heuristic which iteratively fixes the variables ${X}_{i,j}$ for edges $\mathit{ij}$. In contrast to Neighbor-net, it uses all sets of four taxa. It also can choose between several candidate variables to be fixed first, and it revises its decisions in a post-processing step.
The agglomeration process of NJ and its variants ends when there are only three clusters left, and for each of those clusters, a rooted binary tree has been constructed. This corresponds to a rooted tree where the root has outdegree 3 which we will denote $T$. We use that tree and an initial circular ordering to decide in a top-down fashion which of the edges should be flipped. We assume that there is a subtree $U$ containing the root such that for all edges of $U$, a decision has been made. Initially, the subtree contains only the root and no edges. For every interior edge of $T$ that has exactly one vertex $u$ in $U$ and the other vertex $v$ outside $U$, we compute the average weight of all quartets that are displayed if and only if the edge $\mathit{uv}$ is flipped minus the average weight of the quartets that are displayed if and only if $\mathit{uv}$ is not flipped. A $4$-set $\left\{{x}_{1},{x}_{2},{x}_{3},{x}_{4}\right\}$ of taxa contributes to that score, if and only if, for the smallest subtree of $T$ connecting ${x}_{1},{x}_{2},{x}_{3},{x}_{4}$, one of the two vertices of degree $3$ is $v$ and the other one is in $U$. For the edge $\mathit{uv}$ that maximizes the absolute value of this difference, we decide to flip the edge, if the difference is positive, or to not flip it. Then, we add the vertex $v$ and the edge $\mathit{uv}$ to $U$ and go to the next iteration. The process stops when $U$ contains all interior edges of $T$. After this top-down procedure, we do some post-processing by reversing the decision whenever doing so increases the sum of the quartet weights. In order to guarantee polynomial running time, we stop this process, either when we find no edge that improves the score or when every edge has been checked $n/2$ times where $n$ is the number of taxa.
2.4.4 Estimate split weights by non-negative least squares
2.5 Consistency of Lpnet
Consistency is an important feature of phylogenetic reconstruction methods. It means that a method does not make mistakes for perfect input. Specifically, a method that reconstructs a circular split system from distances is consistent, if it returns, the correct weighted split system whenever the input distance is induced by a weighted circular split system. Neighbor-net was shown to be consistent by Bryant et al. (2007). A more general proof was given by Levy and Pachter (2011), where all tree construction methods used by Lpnet are included in their definition of neighbor-joining which allows a wide class of weighting schemes for the reduction step. Their result implies that all splits of the tree constructed by Lpnet agree with some circular ordering that agrees with all splits of the underlying split system. It is easy to see that such an ordering will also maximize the sum of all supported quartets. Finally, NNLS will be able to match the input distance exactly; thus, Lpnet is consistent for all used tree reconstruction methods.
2.6 Implementation
We provide an R implementation of Lpnet that outputs a nexus file with a weighted circular split system. We recommend SplitsTree4 (Huson & Bryant, 2006) to draw a corresponding phylogenetic network. The user can choose one of the five tree reconstruction methods listed above or input a tree. For the linear programming problem, two solvers are supported: The R version Rglpk of the GNU Linear Programming Kit (http://www.gnu.org/software/glpk) is free and open source, while Gurobi (http://www.gurobi.com) is one of the most powerful commercial mathematical optimization solvers. A free licence of Gurobi is available for academic institutions.
In practice, the integer linear programming is the computationally most demanding part of Lpnet. The constraint matrix has $O\left({n}^{5}\right)$ entries, and the required memory grows equally fast. There is a hard limit of 94 taxa, because the matrix is stored as a vector, and R only allows vectors of length at most ${2}^{31}-1$.
We list the size of the constraint matrix and the average CPU time for running our Lpnet function in R using Gurobi with binary linear programming for different numbers of taxa in Table 1. Rglpk is much slower, and the problem was often not feasible for more than 50 taxa on our machine. For 40 taxa, we observed an average CPU time of 13.76 s.
Number of taxa | User time | Size of constraint matrix |
---|---|---|
20 | 0.0803 s | 3.81 MB |
40 | 1.4167 s | 180.985 MB |
60 | 11.7401 s | 1.52 GB |
80 | 59.8303 s | 6.809 GB |
Integer linear programming is NP-hard (Karp, 1972), and solvers will often solve the relaxed problem where the variables are allowed to be non-integer first. For the problem solved by Lpnet, this solution happens to be an integer solution most of the time. If there are non-integer entries, the running time will increase. For example, we observed this case for less than 1% of random distances for 80 taxa and Gurobi, and then, the average CPU time was 307.27 s. In an attempt to estimate the worst case, we constructed an example where the optimal solution of the relaxed problem contains no integer at all. For 80 taxa, Gurobi needed almost 14 h for the solution. The distance matrix is available as no_integer.nex in the examples folder of our R package at https://github.com/yukimayuli-gmz/lpnet.
In summary, our Lpnet implementation can be used for up to 94 taxa, and with the Gurobi solver, a solution will usually take at most a few minutes. The Rglpk solver works well for smaller instances but will struggle for more than 50 taxa.
We report running times for our heuristic in Table 2. Using a computer with more memory (128 GB), an example with 500 taxa took 28 h.
Number of taxa | User time |
---|---|
50 | 2.0631 s |
100 | 42.5499 s |
200 | 1176.228 s |
280 | 5605.2504 s |
All other CPU times reported in this article were obtained running Lpnet on a laptop with Windows 10 operating system, Intel Core i7-9750H 2.60 GHz CPU with six cores and 16 GB of RAM.
3 RESULTS
In this section, we mainly compare the performance of Neighbor-net and Lpnet for different input distances. In order to evaluate different networks from the same input, we use the LSFit, which is also available in SplitsTree4 (Huson & Bryant, 2006). Before the split weights are computed, Neighbor-net and Lpnet both try to maximize the sum of the weights of the quartets that agree with the chosen circular ordering. Therefore, we use this sum to evaluate the algorithms at that state and refer to it as sum of quartets. Within Lpnet, the result may depend on the choice of the tree reconstruction method. We use Neighbor-joining and its common variants BioNJ and UNJ, as well as two methods that try to mimic the internal tree building of Neighbor-net that we call NNet tree and symmetric NNet tree.
We run examples on four different kinds of input distances: First, we present a simple artificial example with only seven taxa and two reticulations that shows how the early ordering of clusters by Neighbor-net can cause problems. The second example uses random distances between taxa and represents the general approximation problem of an input metric by a circular split system, without any phylogenetic signal. The third example uses simulated sequences from random trees. The last example is a published dataset of Viburnum plants which contains a hypothesized hybrid and has been suggested as a benchmark for phylogenetic network methods.
3.1 An artificial example
We start with an artificial example that demonstrates the disadvantage of ordering clusters locally. As visualized in Figure 4, we assume that seven taxa mainly evolved under a clock-like tree, with the exception of two gene transfer events ${I}_{1}$ and ${I}_{2}$. We assume that 10% (for ${I}_{1}$) and 20% (for ${I}_{2}$), respectively, of the genome of the reticulation vertices are independently replaced along the reticulation arrows. This means that the genome consists of four parts representing sequences affected by one or both or none of the reticulations. Each part follows its own tree, and the observed distance is a convex combination of those four tree distances where the coefficients are the fractions of the genome that follow the trees. This distance corresponds to 10 non-trivial splits which do not fit on any circular ordering and seven trivial splits. The non-trivial splits with the weights and the contributing trees are listed in Table 3, and the distance matrix representing the network is given in Table 4.
Original tree (72%) | ${\mathit{TI}}_{1}$ (8%) | ${\mathit{TI}}_{2}$ (18%) | ${\mathit{TI}}_{12}$ (2%) | Tree mixture NNet Lpnet | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Split | Weight | Split | Weight | Split | Weight | Split | Weight | Split | Weight | Weight | Weight |
$12$ | 4 | $12$ | 4 | $12\mid 34567$ | 3.6 | 3.6 | 3.6 | ||||
$34$ | 4 | $12,567\mid 34$ | 2.88 | 2.65 | 2.8375 | ||||||
$567$ | 2 | $567$ | 2 | $1234\mid 567$ | 1.6 | 1.6 | 1.6 | ||||
$23$ | 1 | $14,567\mid 23$ | 0.08 | 6.215 e-16 | |||||||
$234$ | 4 | $1567\mid 234$ | 0.32 | 0.35 | 0.35 | ||||||
$35$ | 2 | $35$ | 1 | $12,467\mid 35$ | 0.38 | 0.375 | |||||
$345$ | 4 | $1267\mid 345$ | 0.72 | 0.9 | 0.7125 | ||||||
$67$ | 2 | $67$ | 2 | $12,345\mid 67$ | 0.4 | 0.4 | 0.4 | ||||
$167$ | 4 | $167\mid 2345$ | 0.08 | 0.1 | 0.1 | ||||||
$235$ | 1 | $1467\mid 235$ | 0.02 | ||||||||
$12,347\mid 56$ | 0 | 3.73 e-16 |
- Note: The contributing trees and their percentage of the genome are listed in the header. Every row corresponds to a non-trivial split. For every contributing tree, we represent its splits by the smaller split halves and list the split weights. The correct weight for every non-trivial split throughout the genome is given in the Tree mixture weight column.
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
2 | 6.8 | |||||
3 | 14 | 13 | ||||
4 | 14 | 13.2 | 6 | |||
5 | 15.6 | 15.4 | 13.2 | 14 | ||
6 | 16 | 16 | 16 | 16 | 14.4 | |
7 | 16 | 16 | 16 | 16 | 14.4 | 14 |
When Neighbor-net is applied to this distance, it correctly identifies the clusters $\left\{1,2\right\}$ and $\left\{3,4\right\}$, first and then decides to join those clusters. The second selection criterion of Neighbor-net chooses to make two and three neighbours in the circular ordering, because it relies on quartets that have at most one taxon from $\left\{\mathrm{5,6,7}\right\}$. This decision based on local information makes it impossible to later include the cluster $\left\{3,5\right\}$, which corresponds to a much stronger split than $\left\{2,3\right\}$. Lpnet first correctly finds the whole main tree and then chooses a circular ordering that allows all true splits with a weight higher than 0.1.
As can be seen from Table 3, the weights of the correct splits in Lpnet tend to be clearly closer to the true weight than Neighbor-net, and even 23|14,567, the only true split that is allowed by the Neighbor-net ordering and not by the Lpnet one, finally gets a negligible weight in the Neighbor-net. Both methods achieve a very high LSFit (Neighbor-net 99.99445 and Lpnet 99.99965), but the gap to 100 is still more than 15 times greater for Neighbor-net than for Lpnet.
3.2 Random distances
We use random numbers between 0 and 1 from the uniform distribution as pairwise distances, and then add a large enough constant to all distances to guarantee the triangle inequality.
We generate 10,000 random distance matrices with 30 taxa. Then, we compare Lpnet using different methods to construct phylogenetic trees, with Neighbor-net. We use SplitsTree4 (Huson & Bryant, 2006) with default setting to get the Neighbor-net circular ordering. Then, we compare the sum of quartets and the LSFit value for Lpnet and Neighbor-net (Table 5). We observe that for all five tree building methods, for the sum of quartets and LSFit, the Lpnet algorithm clearly tends to get better scores than Neighbor-net. While Lpnet achieves a higher LSFit for roughly 80% of the input metrics, this fraction is more than 98% for the sum of quartets. Comparing the Lpnets using different tree building methods, we find that we often get the same circular ordering. Nevertheless, Table 6 shows that UNJ performs significantly better than the other methods. Our heuristic, based on the UNJ tree, is a clear improvement compared to Neighbor-net, but it produces worse results than Integer Linear Programming.
Same circular ordering as Neighbor-net | LSFit | Sum of quartets | |||
---|---|---|---|---|---|
Lpnet > NNet | Lpnet < NNet | Lpnet > NNet | Lpnet < NNet | ||
Neighbor-joining | 0 | 7782 | 2218 | 9828 | 172 |
Symmetric NNet tree | 0 | 8259 | 1741 | 9974 | 26 |
NNet tree | 0 | 8272 | 1728 | 9967 | 33 |
UNJ | 0 | 8251 | 1749 | 9977 | 23 |
BioNJ | 0 | 7962 | 2038 | 9903 | 197 |
Heuristic method | 0 | 7200 | 2800 | 9872 | 128 |
Same circular ordering as UNJ | LSFit | Sum of quartets | |||
---|---|---|---|---|---|
$>$ UNJ | $<$ UNJ | $>$ UNJ | $<$ UNJ | ||
Neighbor-joining | 1107 | 3989 | 4904 | 3370 | 5523 |
Symmetric NNet tree | 5427 | 2273 | 2300 | 2038 | 2535 |
NNet tree | 3909 | 2981 | 3110 | 2556 | 3535 |
BioNJ | 2138 | 3759 | 4103 | 3621 | 4241 |
Heuristic method | 2807 | 1951 | 5242 | 0 | 7193 |
3.3 Simulated sequences
We randomly generate a tree for $30$ taxa by using the function ‘sim.taxa’ from the r package TreeSimGM (Hagen & Stadler, 2018). We let the parameter ‘waiting time until speciation’ for ‘sim.taxa’ be exponentially distributed with rate parameter $\lambda =1.2$, and then normalize such that the longest pairwise distance is one. Then, we use the software Dawg (Cartwright, 2005) to simulate DNA sequences of length 10,000 bp from the random tree under the Jukes–Cantor model. Finally, we use SplitsTree4 (Huson & Bryant, 2006) to compute Jukes–Cantor distances with default settings. We repeat this process 10,000 times and compare Lpnet and Neighbor-net in the same way as for the random metrics. Table 7 shows the result of comparing the LSFit value and the sum of all quartets between Lpnet and Neighbor-net. We see that for all five tree construction methods, the advantage of Lpnet compared to Neighbor-net increases. The sum of quartets is now always higher and the LSFit better for almost 95% of the datasets when we use Lpnet. The input datasets for this experiment can be interpreted as a tree metric plus some random noise, and the results show that in this situation, the strategy of Lpnet to complete the tree reconstruction before embedding the tree pays off. The various tree building methods yield the same circular ordering more often than for random metrics, but again UNJ achieves significantly better scores than the other variants of NJ (see Table 8).
Same circular ordering as Neighbor-net | LSFit | Sum of quartets | |||
---|---|---|---|---|---|
Lpnet > NNet | Lpnet < NNet | Lpnet > NNet | Lpnet < NNet | ||
Neighbor-joining | 0 | 9406 | 594 | 9997 | 3 |
Symmetric NNet tree | 0 | 9412 | 588 | 9998 | 2 |
NNet tree | 0 | 9407 | 593 | 9997 | 3 |
UNJ | 0 | 9409 | 591 | 9997 | 3 |
BioNJ | 0 | 9399 | 601 | 9997 | 3 |
Heuristic method | 0 | 8608 | 1392 | 9728 | 272 |
Same circular ordering as UNJ | LSFit | Sum of quartets | |||
---|---|---|---|---|---|
$>$ UNJ | $<$ UNJ | $>$ UNJ | $<$ UNJ | ||
Neighbor-joining | 9276 | 327 | 384 | 266 | 445 |
Symmetric NNet tree | 9469 | 258 | 272 | 212 | 317 |
NNet tree | 9466 | 256 | 276 | 210 | 322 |
BioNJ | 9336 | 307 | 356 | 241 | 422 |
Heuristic method | 3280 | 1807 | 4897 | 0 | 6704 |
3.4 Analysis of a published dataset
As an example of an analysis of a real dataset, we choose a study of the genus Viburnum of flowering plants (Donoghue et al., 2004). The raw data are chloroplast trnK intron and nuclear ribosomal ITS DNA sequences from 43 species of Viburnum and 2 species of Sambucus. We use the uncorrected P distance from the combined sequence alignment to compute the Neighbor-net (Figure 5) and the Lpnet (Figure 6). Following the previous results, we chose UNJ as the tree reconstruction method for Lpnet.
This dataset has been proposed as an example for testing phylogenetic networks method by an influential but now inactive blog (phylonetworks.blogspot.com/p/datasets.html). We focus on the position of Viburnum prunifolium (V. prunifolium) which was already hypothesized to be a hybrid between Viburnum lentago (V. lentago) and Viburnum rufidulum (V. rufidulum) in 1956 (Brumbaugh & Guard, 1956). From the differences between the trees obtained by analysing the two loci separately, Donoghue et al. (2004) conclude that their dataset supports that hypothesis. In Figure 5 (NNet) and Figure 6 (Lpnet), we highlight V. prunifolium in red, and V. lentago and V. rufidulum in blue. We observe that in the Neighbor-net, V. prunifolium is not placed between V. lentago and V. rufidulum, and there is no split separating V. prunifolium and V. lentago from all other taxa and no split separating V. prunifolium and V. rufidulum from all other taxa. In the Lpnet, V. prunifolium is between V. lentago and V. rufidulum, and the 2-splits grouping together only V. prunifolium and V. lentago (with medium weight) and grouping together V. prunifolium and V. rufidulum (with low weight) are both present.
From Donoghue's study (Donoghue et al., 2004), we observe that there is a cluster containing only V. prunifolium and V. lentago in the phylogenetic tree for the trnK alignment and an unresolved cluster containing V. prunifolium, V. rufidulum and V. elatum for the ITS alignment. The latter split is strong and can be observed in both networks, while the former split has medium weight in the Lpnet and conflicts with the circular ordering of the Neighbor-net. This indicates that the distance matrix is indeed better represented by the Lpnet, which is also confirmed by the better LSFit value (Lpnet: 99.90413, Nnet: 99.884).
4 DISCUSSION
Eighteen years after the release of Neighbor-net, our new variant Lpnet provides an alternative that approximates the input distance better for the clear majority of the datasets we have tried. The main disadvantage of Lpnet is that it is slower and needs more memory, but most datasets that have been analysed by Neighbor-net have less than 80 taxa and can therefore be handled by Lpnet as well.
The main application of split graphs and in particular Neighbor-nets is to find the main signals in an early stage of a data analysis (Huson & Bryant, 2006). In practice, a Neighbor-net often contains a few strong splits and many tiny ones which are usually interpreted as irrelevant noise. We expect that the clear signals will often be detected by both methods, while differences between the minor splits will cause a slightly higher LSFit value for Lpnet. In such cases, it does not matter much which method is used.
However, our realistic artificial example demonstrates that there can be significant differences. If reticulations are anticipated in a dataset, then the goal has to be to reconstruct an explicit phylogenetic network as shown in Figure 4. While it is generally hard to guess that network from a splits graph, this task would be easier for the Lpnet than for the Neighbor-net. We therefore anticipate that Lpnet will be useful for interpreting real datasets in the future.
We provide five different algorithms to construct phylogenetic trees. All of them turn out to yield the best LSFit occasionally, and we have no strong preference. In the average, UNJ performed best for our datasets, and it seems most consistent with the general approach taken by Lpnet that treats all pairs of taxa and all quartets equal. Therefore, we select UNJ as the default method, but we recommend to try other methods as well. Other tree building methods like minimum evolution or even not distance-based methods like maximum likelihood might be worth trying, and any binary tree can be input to Lpnet. It is generally interesting to see whether all splits of a reasonable tree will have positive weights in the Lpnet, and to compare the weights of the strongest other splits with the weights of the conflicting tree splits. However, users should be aware that we only have a consistency proof for the neighbor-joining variants that we provide.
Even though we replaced a heuristic part of Neighbor-net by an exact algorithm, Lpnet is still a heuristic method. It relies on a heuristic tree construction, and it optimizes the sum of quartet weights, while the final score function for a weighted split system is the LSfit value. The weights of the supported quartets indicate but do not guarantee that the distance can be approximated well by the allowed splits, and the discrepancy causes almost all cases where the LSFit of Lpnet is worse than Neighbor-net. It would be desirable to have a method that directly optimizes the least squares fit, but this would not allow any agglomerative construction and we are not aware of any such algorithm other than trying every possible ordering.
QNet (Grünewald et al., 2007) is the quartet analogue of Neighbor-net. It uses quartet weights directly obtained from the raw data instead of distances to reconstruct a weighted circular split system. The strategy of Lpnet to first construct a tree and then use linear programming to get a circular ordering can also be applied to modify QNet, and the approximation is expected to improve.
AUTHOR CONTRIBUTIONS
Mengzhen Guo and Stefan Grünewald conceived the ideas and designed methodology; Mengzhen Guo wrote the R package; Stefan Grünewald led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.
CONFLICT OF INTEREST STATEMENT
The authors declare no competing interests.
Open Research
The peer review history for this article is available at https://www.webofscience.com/api/gateway/wos/peer-review/10.1111/2041-210X.14086.
DATA AVAILABILITY STATEMENT
The tree files, sequence alignments and distance matrices for the simulated datasets, as well as the distance matrices for the random distances are available at https://github.com/yukimayuli-gmz/data (archived with Zenodo at https://doi.org/10.5281/zenodo.7657019, Guo, 2023b), while the real dataset (DonoghueAll.nex) can be found in the examples folder of our R package (https://github.com/yukimayuli-gmz/lpnet, archived with Zenodo at https://doi.org/10.5281/zenodo.7657039, Guo, 2023a).