Volume 8, Issue 1 p. 28-36
Application
Free Access

ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data

Guangchuang Yu

Guangchuang Yu

State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong SAR, China

Search for more papers by this author
David K. Smith

David K. Smith

State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong SAR, China

Search for more papers by this author
Huachen Zhu

Huachen Zhu

State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong SAR, China

Search for more papers by this author
Yi Guan

Yi Guan

State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong SAR, China

Search for more papers by this author
Tommy Tsan-Yuk Lam

Corresponding Author

Tommy Tsan-Yuk Lam

State Key Laboratory of Emerging Infectious Diseases and Centre of Influenza Research, School of Public Health, The University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong SAR, China

Correspondence author. E-mail: [email protected]Search for more papers by this author
First published: 16 August 2016
Citations: 1,433

Summary

  1. We present an r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees.
  2. ggtree can read more tree file formats than other softwares, including newick, nexus, NHX, phylip and jplace formats, and support visualization of phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree objects defined in other r packages. It can also extract the tree/branch/node-specific and other data from the analysis outputs of beast, epa, hyphy, paml, phylodog, pplacer, r8s, raxml and revbayes software, and allows using these data to annotate the tree.
  3. The package allows colouring and annotation of a tree by numerical/categorical node attributes, manipulating a tree by rotating, collapsing and zooming out clades, highlighting user selected clades or operational taxonomic units and exploration of a large tree by zooming into a selected portion.
  4. A two-dimensional tree can be drawn by scaling the tree width based on an attribute of the nodes. A tree can be annotated with an associated numerical matrix (as a heat map), multiple sequence alignment, subplots or silhouette images.
  5. The package ggtree is released under the artistic-2.0 license. The source code and documents are freely available through bioconductor (http://www.bioconductor.org/packages/ggtree).

Introduction

Phylogenetic trees are commonly used to present the evolutionary relationships of species. There are many software packages and Web tools that are designed for displaying phylogenetic trees, such as treeview (Page 1996), figtree (Rambaut 2014), treedyn (Chevenet et al. 2006), itol (Letunic & Bork 2011), evolview (Zhang et al. 2012) and dendroscope (Huson & Scornavacca 2012). Only a subset, such as figtree, treedyn and itol, allows users to annotate trees with colouring branches, highlighted clades and tree features. However, their pre-defined annotating functions are usually limited to some specific evolutionary data and not readily programmable within the same program platform. As phylogenetic trees become more widely used in multidisciplinary studies, there is an increasing need to incorporate various types of their covariates and other associated data from different sources into the trees for visualizations and further analyses. Users then require programmable software to allow high levels of customization and data integration over the trees, in addition to standalone applications that focus on specific analyses and data types.

To fill this gap, we developed ggtree, a package for the r programming language (R Core Team, 2015) released under the bioconductor project (Gentleman et al. 2004). ggtree is built with the merits of ggplot2 (Wickham 2009) that was based on the grammar of graphics (Wilkinson 2005). Unlike most of the other phylogenetic software and r packages that only read tree files in newick and/or nexus formats, ggtree supports more formats including NHX (New Hampshire eXtended format), jplace and phylip. It also allows the evolutionary data to be parsed from the non-standard formatted output data of different software (Table 1) into annotations on a tree. This enables diverse types of annotations to be combined, visualized and further processed on the same tree topology, where new patterns or correlations of evolutionary processes could be more easily identified.

Table 1. Computer programs and r packages for molecular evolutionary analyses in which their specific data outputs can be directly parsed by ggtree
Programs Data that can be parsed
ape (r package) Bootstrap values
beast Any information (e.g. substitution rates, node ages, geographic states) stored in the node attributes in the nexus-formatted tree file
paml-baseml Ancestral sequences (from rst file)
paml-codeml

Ancestral sequences (from rst file)

dN, dS and ω estimates (from mlc file)

hyphy Ancestral sequences (from the nexus-formatted tree file)
phangorn (r package) Ancestral sequences
raxml Branch support values
r8s Tree with branch in unit of time, rate and absolute substitution
pplacer Taxon placement information from jplace-formatted tree file
epa Taxon placement information from jplace-formatted tree file
phylodog Any information from the NHX-formatted tree file
revbayes Any information from the NHX-formatted tree file

The r language is increasingly being used in phylogenetics. However, a comprehensive package, designed for viewing and annotating phylogenetic trees, particularly with data integration, is not yet available. Most of the r packages in phylogenetics focus on specific statistical analyses rather than viewing and annotating the trees with more generalized tree-associated data. Some packages, including ape (Paradis, Claude & Strimmer 2004) and phytools (Revell 2012), which are capable of displaying and annotating trees, are developed using the base graphics system of r. outbreaktools (Jombart et al. 2014) and phyloseq (McMurdie & Holmes 2013) extended ggplot2 to draw phylogenetic trees. The ggplot2 system of graphics allows rapid customization and exploration of design solutions. However, these packages were designed for epidemiology and microbiome data, respectively, and did not aim to provide a general solution for tree visualization and annotation (Appendix S1, Supporting Information). The ggtree package inherits versatile properties of ggplot2 and thus allows constructing complex tree views by freely combining multiple layers of annotations from different sources of tree-associated data.

Description

The ggtree package

The ggtree package is designed for annotating phylogenetic trees with their associated data of different types and from various sources. These data could come from users or analysis programs and might include evolutionary rates, ancestral sequences, etc., that are associated with the taxa from real samples, or with the internal nodes representing hypothetic ancestor strain/species, or with the tree branches indicating evolutionary time courses. For instance, the data could be the geographic positions of the sampled avian influenza viruses (informed by the survey locations) and the ancestral nodes (by phylogeographic inference) in the viral gene tree (Lam et al. 2012).

ggtree supports the graphical language of ggplot2, with which high level of customization can be intuitive and flexible. However, ggplot2 itself does not provide low-level geometric objects or other support for tree-like structures. Even though outbreaktools and phyloseq are developed based on ggplot2, the most valuable part of ggplot2 syntax – adding layers of annotations – is not supported in these packages. For example, if we have plotted a tree without taxa labels, outbreaktools and phyloseq provide no easy way for general r users, who have little knowledge about the infrastructures of these packages, to add a layer of taxa labels.

ggtree extends ggplot2 to support tree objects and implements a geometric layer, geom_tree, to support visualizing tree structure. In ggtree, viewing a phylogenetic tree is relatively easy, via the command ‘ggplot(tree_object) + geom_tree() + theme_tree()’ or ‘ggtree(tree_object)’ for short. Layers of annotations can be added one by one via the ‘+’ operator. To facilitate tree visualization, ggtree provides several geometric layers, including geom_treescale for adding legend of tree scale (genetic distance, divergence time, etc.), geom_range for displaying uncertainty of branch lengths (confidence interval or range, etc.), geom_tiplab for adding taxa label, geom_tippoint and geom_nodepoint for adding symbols of tips and internal nodes, geom_hilight for highlighting a clade with rectangle and geom_cladelabel for annotating a selected clade with a bar and text label (Table 2).

Table 2. Major ggtree functions
Function Description
as.binary Convert a multifurcating tree to a binary tree by resolving the polytomy with zero branch lengths.
MRCA Find most recent common ancestor of two or more tips
read.paml_rst Parse an ‘rst’ file from paml, which is then stored in a paml_rst object; outputs from baseml and codeml are supported
read.baseml Parse the output of baseml, which is then stored in a baseml object
read.codeml_mlc Parse a ‘mlc’ file from codeml which is then stored in a codeml_mlc object
read.codeml Parse the output from codeml, which is then stored in a codeml object
read.hyphy Parse the output from hyphy, which is then stored in a hyphy object
read.beast Parse the output from beast, which is then stored in a beast object
read.raxml Parse the output from raxml, which is then stored in a raxml object
read.r8s Parse the output from r8s, which is then stored in an r8s object
read.jplace Parse a jplace file into a jplace object. Outputs from epa, pplacer and ggtree are supported
read.nhx Parse a NHX file into nhx object. Outputs from phylodog and revbayes are supported
read.phylip Parse PHYLIP tree file.
apeBoot Integrate phylo object with bootstrap values from ape::boot.phylo and stored in apeBootstrap object
phyPML Parse output from phangorn::optim.pml, stored inferred ancestral sequences and stored the result in phangorn object
get.fields List the annotation attributes stored in a tree object
get.treetext Extract the newick tree string from tree objects
get.tree Extract the phylo object (tree representation) from a tree object
get.tipseqs Extract tip sequences from baseml, codeml or hyphy objects
get.subs Extract nucleotide or amino acid substitutions along the tree from a baseml, codeml or hyphy object
get.placement Extract placement information from a jplace object parsed from the output of epa or pplacer
get.phylopic Download a silhouette image from the PhyloPic data base
plot Plot methods for quickly viewing the annotation data of all types of tree objects defined in ggtree
ggtree Construct a tree view from a tree object. Supported layouts are rectangular, slanted, circular, fan, unrooted and two-dimensional tree
geom_tree Layer to support drawing a tree view with ggplot2
geom_cladelabel Layer to annotate clade with bar and text label
geom_range Layer to annotate uncertainty of branch lengths
geom_hilight Layer to highlight selected clade with rectangle
geom_tiplab Layer to add labels to tree tips
geom_tippoint Layer to add symbols to tree tips
geom_nodepoint Layer to add symbols to internal nodes
geom_rootpoint Layer to add symbols to root node
geom_treescale Layer to add tree scale (e.g. substitution rate)
geom_text2 Modified version of geom_text with subset supported
geom_point2 Modified version of geom_point with subset supported
geom_segment2 Modified version of geom_segment with subset supported
theme_tree Blank theme
theme_tree2 Blank theme with evolutionary distance as the x-axis
theme_transparent Background transparent theme
theme_inset Blank theme with background transparent
scale_color Define colours based on the numerical values (scale) of attributes associated with a tree. These can then be used in colouring a tree or annotation data
collapse Collapse a selected clade
expand Expand a collapsed clade
scaleClade Zoom in or zoom out a selected clade
flip Exchange positions of two clades that share a same parent node
rotate Rotate a selected clade
groupOTU Group selected OTUs by tracing back to their most recent common ancestor
groupClade Group a selected clade or list of clades
gzoom Zoom a selected portion of a very large tree
viewClade Visualize a clade of a tree
gheatmap Visualize a tree with an associated matrix displayed next to the tree as a heatmap
subview Embed subplot
inset Annotate nodes with subplots
nodebar Create a list of bar charts for nodes
nodepie Create a list of pie charts for nodes
phylopic Annotate a tree with a silhouette image downloaded from the PhyloPic data base
mask Mask all genetic substitutions on the tree branches, except for those specified.
msaplot Visualize a tree with a multiple sequence alignment displayed next to the tree
open_tree Convert circular layout tree to fan layout
rescale_tree Rescale branch length
rotate_tree Rotate tree by specific angle
%<% Update a tree view with another tree object
%<+% Append user-specific annotation data to an existing tree view. These data can be used for annotating the tree
write.jplace Output a jplace file of a tree with user-specified data. It can be used to store a tree with user's own annotation data. The output will be able to be parsed by read.jplace and is fully supported in ggtree

File formats and S4 classes

In ggtree, the S4 class defines a compound tree-based object that contains the tree and other information associated to the tree, branches or nodes. ggtree can read a number of tree file formats, including newick and nexus (via ape), NHX, jplace (Matsen et al. 2012) and phylip, into a S4 tree object. Non-standard analysis output files from various evolutionary biology software packages including beast (Bouckaert et al. 2014), epa (Berger, Krompass & Stamatakis 2011), hyphy (Pond, Frost & Muse 2005), paml (Yang 2007), phylodog (Bastien et al. 2013), pplacer (Matsen, Kodner & Armbrust 2010), raxml (Stamatakis 2014), revbayes (Sebastian et al. 2014) and r8s (Sanderson 2003) (Table 1) can also be parsed into S4 objects using functions read.beast, read.codeml_mlc, read.codeml, read.hyphy, read.jplace, read.nhx, read.paml_rst, read.phylip, read.raxml and read.r8s (Fig. 1, Table 2). After parsing, some node/branch-specific attribute data (e.g. evolutionary rates, ancestral/taxon sequences) are extracted from the files and stored in the S4 tree object. An overview of S4 classes and corresponding parser functions is illustrated in Fig. 1.

Details are in the caption following the image
Overview of S4 classes and their parser functions defined in ggtree. Phylogenetic tree, branch/node-associated data and other information contained in the standard tree files or output files from supported analysis programs (Table 1) can be imported into a S4 tree object by parser functions. The middle and bottom compartments of the class diagrams (boxes) show the attributes to be added to the tree object after parsing and other helper methods, respectively. Two S4 tree objects can be merged into one using merge_tree function. ggtree also supports visualization of phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree objects defined in other r packages, via ggtree().

Furthermore, ggtree provides a function, merge_tree, to combine two trees together with their node/branch-specific attribute data. Essentially, as a result, one such attribute (e.g. evolutionary rate) can be mapped to another attribute (e.g. dN/dS) of the same branch/node for comparison and further computations (Fig. 2). ggtree can also directly visualize and annotate phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree-related objects that are defined in other r packages. The tree object in ggtree can also be converted, via get.tree(), to phylo or multiphylo objects that are widely used in other r packages. In addition, ggtree provides fortify method to convert the tree object to a tidy data frame, which is familiar to r users and easy to manipulate. Therefore, ggtree represents an infrastructure that enables phylogeny/taxon-related data inferred from different external computer programs or r packages, to be unified and analysed in r.

Details are in the caption following the image
After merging the beast and codeml outputs, the branch-specific estimates (substitution rate, dN/dS, dN and dS) from the two analysis programs are compared on the same branch basis. The associations of dN/dS, dN and dS vs. rate are visualized in hexbin scatter plots.

Example 1: parsing tree and analysis output files

To illustrate the utilities of ggtree, we used a previously published data set: 76 H3 hemagglutinin gene sequences of a lineage containing swine and human influenza A viruses (Liang et al. 2014). The data set was re-analysed by beast for timescale estimation and codeml for synonymous and non-synonymous substitutions estimation. In this example, we first parsed the outputs from beast using read.beast and from codeml using read.codeml into two tree objects. Then, the two objects containing two sets of branch/node-specific data were merged via the merge_tree function.

  • library(ggtree)

  • beast_file<-system.file(“examples/

  • MCC_FluA_H3.tree”,package=“ggtree”)

  • rst_file<-system.file(“examples/rst”,

  • package=“ggtree”)

  • mlc_file<-system.file(“examples/mlc”,

  • package=“ggtree”)

  • beast_tree<-read.beast(beast_file)

  • codeml_tree<-read.codeml(rst_file, mlc_file)

  • merged_tree<-merge_tree(beast_tree,codeml_tree)get.fields(merged_tree)

  • ##[1]“height” “height_0.95_HPD” “height_median”

  • ##[4]“height_range” “length” “length_0.95_HPD”

  • ##[7]“length_median” “length_range” “posterior”

  • ##[10]“rate” “rate_0.95_HPD” “rate_median”

  • ##[13]“rate_range” “t” “N”

  • ##[16]“S” “dN_vs_dS” “dN”

  • ##[19]“dS”“N_x_dN” “S_x_dS”

  • ##[22]“marginal_subs” “joint_subs”

  • “marginal_AA_subs”

  • ##[25]“joint_AA_subs”

After merging the beast_tree and codeml_tree objects, all branch/node-specific data inferred by beast and codeml are available in the merged_tree object, in the components [1-13] and [14-25] of the vector above. We further converted the tree object to data frame, df, and visualized hexbin scatter plot of dN/dS, dN and dS inferred by codeml vs. rate inferred by beast on the same branches.

  • df<-fortify(merged_tree)

  • df<-df[,c(“dN_vs_dS”,“dN”,“dS”,“rate”)]

  • df<-na.omit(df)

  • df<-df[df$dN_vs_dS>=0&df$dN_vs_dS<=1.5,]%>%

  • tidyr::gather(type, value, dN_vs_dS:dS)

  • df$type[df$type==“dN_vs_dS”]<-“dN/dS”

  • levels(df$type)<-c(“dN/dS”,“dN”,“dS”)

  • ggplot(df,aes(rate, value))+ geom_hex()+

  • facet_wrap(~ type, scale=“free_y”)

The output is illustrated in Fig. 2. We can then test the association of these branch/node-specific data using Pearson correlation, which in this case showed that dN and dS are significantly associated with rate but not dN/dS.

Example 2: phylogenetic tree visualization and annotation

The following example turns the merged_tree into a graphic object with tree branches coloured by branch-specific substitution rates (rate) as shown in Fig. 3a.

Details are in the caption following the image
Phylogenetic tree of H3 influenza viruses. The tree with branches scaled in time (years from the root) and coloured by substitution rates (a). The tree was rescaled using dN as branch lengths and coloured by dN values (b). The tree branches were rescaled in time (Gregorian calendar) and were assigned to different groups based on host species of the taxa, by which the branches were annotated in different line types, colours and symbols (c).

  • p<-ggtree(merged_tree,aes(color=rate))+

  • theme_tree2()+

  • scale_color_continuous(high=‘#D55E00’,

  • low=‘#0072B2’)+ geom_tiplab(size=2)

Other branch/node-specific data stored in the tree object (Fig. 5) can be displayed as an additional graphic layer of annotation on top of a tree. Complex presentations of trees are made possible by adding multiple layers of annotations. Phylogenetic tree can be rescaled using any numerical variable associated with branches. For instance, branch-specific estimates of dN, dS and ω from codeml analysis, can be used as lengths and colours of the branches in the tree (Fig. 3b). Tree nodes can be given different symbols based on the categorical values associated (Fig. 3c). ggtree can display a tree in different layouts, including rectangular, slanted, circular and fan layouts for phylogram and cladogram, rooted/unrooted, timescaled and two-dimensional phylogenies.

Compared to other phylogenetic tree visualizing packages, ggtree excels at visual exploration of tree structure and related data. For example, complex tree view with several annotation layers can be transferred to a new tree object without step-by-step re-creation. We have created an operator, %<%, to update a tree view with a new tree object. The following example rescales the branch lengths of the tree (merged_tree) with the branch-specific dN values and updates the graphic object (p) with this new tree via %<%. The branch colours of this updated tree view were re-mapped from ‘rate’ to ‘dN’.

  • p%<%rescale_tree(merged_tree,‘dN’)+ aes(color=dN)

The groupClade function assigns the branches and nodes under different clades into different groups. Similarly, groupOTU function assigns branches and nodes to different groups based on user-specified groups of operational taxonomic units (OTUs) that are not necessarily within a clade, but can be monophyletic (clade), polyphyletic or paraphyletic. A phylogenetic tree can be annotated by mapping different line type, size, colour or shape to the branches or nodes that have been assigned to different groups. In the following example (Fig. 3c), we assigned branches and nodes to different groups based on the host species of the taxa via groupOTU(). According to the groupings, branches were then given different colours and line types, and the taxa were given symbols with different colours and shapes. We also applied the timescale, in Gregorian calendar, to the branch lengths by setting the most recent sampling date (mrsd).

  • tip<-get.tree(merged_tree)$tip.label

  • merged_tree<-groupOTU(merged_tree,tip[grep

  • (“Swine”,tip)],“host”)

  • ggtree(merged_tree,aes(color=host,

  • inetype=host),mrsd=“2013-01-01”)+

  • geom_tippoint(aes(shape=host))+theme_tree2()

To facilitate viewing and manipulating a phylogenetic tree, ggtree provides a number of helper functions. For example, the gzoom or viewClade functions allow the user to zoom into a selected portion or display a selected clade respectively. Other common tree manipulations could be achieved by collapse, expand, rotate, flip, etc., functions. A list of the major ggtree functions is given in Table 2, and their detailed explanations and examples are provided in the online vignette.

Example 3: two-dimensional trees

The y-axis or width of a conventionally laid out tree (i.e. with tree branches spanning horizontally along the x-axis, as shown in Fig. 3) often provides only regular spatial separation to the tree branches, without quantitative biological meanings. ggtree can draw ‘two-dimensional’ trees by rescaling the y-axis/tree width to a node-specific numerical attribute that might be a measure of certain biological characteristics of the taxa and hypothetical ancestors in the tree. In this example, we used the previous timescaled tree object and aimed to scale its y-axis/tree width based on the predicted N-linked glycosylation sites (NLG) for each of the taxon and ancestral sequences. The NLG sites were predicted using the netnglyc 1.0 Server (Blom et al. 2004) and were read into r and stored in NAG variable (Appendix S2). To scale the y-axis, the parameter yscale in the ggtree() function is set to a numerical or categorical variable. If yscale is a categorical variable, users should specify how the categories are to be mapped to numerical values via the yscale_mapping variable as demonstrated in this case. The resultant two-dimensional tree is shown in Fig. 4.

Details are in the caption following the image
Two-dimensional tree with the trunk and other branches highlighted in red (for swine) and blue (for human). The x-axis is scaled to the branch length (in units of year) of the timescaled tree. The y-axis is scaled to the node attribute variable, in this case the number of predicted N-linked glycosylation sites (NLG) on the hemagglutinin protein. Coloured circles indicate the different types of tree nodes. Note that nodes assigned the same x (temporal) and y (NLG) coordinates are superimposed in this representation and appear as one node, which is shaded based on the colours of all the nodes at that point.

  • ggtree(merged_tree,aes(color=host),mrsd=“2013-

  • 01-01”,yscale=“label”,yscale_mapping=NAG)

Example 4: more complex tree annotations

In this example, we demonstrate more complex tree annotation with additional texts and shapes (Fig. 5). We first visualized the tree in timescale and branches coloured by dN/dS. The tree was annotated with clade probabilities and the amino acid substitutions. The substitutions were determined via parent–child sequence comparison from the taxon sequences and ancestral sequences that can be estimated by any of hyphy, baseml or codeml.

Details are in the caption following the image
Timescaled phylogenetic tree annotated with a matrix of values associated with each taxon, in this case the genotypes of H3 influenza viruses. The x-axis is the timescale (in units of year) inferred by beast. The tree branches are coloured by their dN/dS values (as in the left scale at the top), and the internal node labels show the posterior clade probabilities. Tip labels (taxon names) and circles are coloured by species (human in blue and swine in red). The genotype, which is shown as an array of coloured boxes on the right, is composed of the lineages (either HuH3N2, Pdm/09 or TRIG, coloured as in the right legend at the top) of the eight genome segments of the virus. Any missing segment sequences are shown as empty boxes. Smaller rectangular and fan layout of the unlabelled tree are shown in the upper and lower insets on the left. Complete codes are available in the Appendix S2.

While ggtree supports tree annotation using data from a list of software (Table 1), it also easily accepts user-defined data. In ggtree, the operator %<+% has been defined to allow user-defined annotation data (host.df in this example) to attach to a tree graphic object. In this example, we attached the host species information to the tree view and coloured the circle symbols and labels of taxa based on this information.

Users may have a matrix of data (from experiments or data analysis) about the taxa in the phylogenetic tree. In ggtree, this data matrix can be displayed as a heat map aligning with the corresponding taxa at the right side of the tree by gheatmap function. Here, we annotated the tree with a heat map of the genotypes for each taxon (Fig. 5). In the genotype matrix, the colour of each of the eight boxes indicates the lineage of each gene segment of the viruses that was classified according to Lam et al. (2011) and Liang et al. (2014).

ggtree provides subview function to add a subplot in a new layer of main plot. In this example, the tree with the associated matrix was condensed into rectangular and fan shapes and was plotted as subplot inside Fig. 5.

  • ##Below is codeexcerpt,see Appendix S2 fordetails

  • ##visualize a tree with branches in timescale and

  • ##coloured by dN/dS.

  • cols<-scale_color(merged_tree,“dN_vs_dS”,

  • low=“#0072B2”,high=“#D55E00”,

  • interval=seq(0,1.5,length.out=100))

  • p<-ggtree(merged_tree,size=.8,mrsd=“2013-01-

  • 01”,ndigits= 2,color=cols)

  • ##add annotation of amino acid substitution

  • ## inferred by joint probabilities

  • p<- p+geom_text(aes(x=branch,

  • label=joint_AA_subs),vjust=-.03,size=1.8)

  • ##use%<+%operatortoattachthehostinformationto

  • ##thetreeview

  • p<-p%<+%host.df

  • ##after the attachment via %<+% operator,

  • ##we can use host information to colour circles and

  • ##labels of tips.

  • p <-p + geom_tippoint(aes(color=host),size=2)+

  • geom_tiplab(aes(color=host), align=TRUE, size=3,

  • linesize=.3)

  • ##visualize genotype heatmap

  • gheatmap(p, genotype, width=.4,offset=7,

  • colnames=F)%>%scale_x_ggtree

In addition to heat map display of taxa-associated matrix data, the underlying multiple sequence alignment of the taxa could be displayed with the tree using the msaplot function. Furthermore, trees can also be annotated with subplots of different types of graphs (e.g. bar, pie, box plot) using inset function or with silhouette images taken from the PhyloPic data base (http://phylopic.org) with phylopic function.

Conclusions

The ggtree package features (i) high interoperability, as ggtree can import evolutionary data from different tree file formats and analysis programs as well as other associated data from experiments, so that various sources and types of data can be displayed on a tree for comparison and further analyses; (ii) complex phylogenetic presentations, such as two-dimensional tree and graph/image-associated trees; (iii) highly flexible graphic system, as ggtree extends ggplot2 and allows creating separate geometric layers that can be freely combined, removed and rearranged to supports diverse but convenient ways of tree manipulation and visualization. ggtree also supports visualization of tree objects defined by other r packages so that ggtree can be easily integrated into their analysis/packages. For example, phyloseq tree object and microbiome data can be visualized using ggtree (Appendix S1). With the help of ggtree, users can easily create large phylogenetic trees with complex annotations by integrating various associated data including temporal, spatial and genotypic information, such as those trees created in Liang et al. (2014) and Lam et al. (2015).

Acknowledgements

We gratefully thank the Editor and three anonymous reviewers for their useful suggestions and comments that have significantly improved this manuscript. This research was supported by Seed Funding Programme for Basic Research, HKU (201411159214), Theme-based Research Scheme (T11-705/14-N) and Area of Excellence Scheme grant (AoE/M-12/06) from University Grants Committee of the HKSAR. This research is conducted in part using the research computing facilities (HPC2015) and advisory services offered by Information Technology Services, HKU. The authors declared no conflict of interest to the publication of this work.

    Author contributions

    G.Y. and T.T.-Y.L. conceived and developed the methods and r package; G.Y., D.K.S. and T.T.-Y.L. wrote the manuscript; G.Y., D.K.S., H.Z., Y.G. and T.T.-Y.L. contributed to the final version of the manuscript.

      Data accessibility

      Example data deposited in the Dryad repository: http://datadryad.org/resource/doi:10.5061/dryad.v15v0 (Yu et al. 2016).