Volume 10, Issue 7
APPLICATION
Free Access

An r package and online resource for macroevolutionary studies using the ray‐finned fish tree of life

Jonathan Chang

Corresponding Author

E-mail address: jonathan.chang@monash.edu

School of Biological Sciences, Monash University, Clayton, VIC, Australia

Correspondence

Jonathan Chang

Email: jonathan.chang@monash.edu

Search for more papers by this author
Daniel L. Rabosky

Museum of Zoology, Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI

Search for more papers by this author
Stephen A. Smith

Museum of Zoology, Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI

Search for more papers by this author
Michael E. Alfaro

Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA

Search for more papers by this author
First published: 28 March 2019
Citations: 8

Abstract

  1. Comprehensive, time‐scaled phylogenies provide a critical resource for many questions in ecology, evolution and biodiversity. Methodological advances have increased the breadth of taxonomic coverage in phylogenetic data; however, accessing and reusing these data remain challenging.
  2. We introduce the Fish Tree of Life website and associated r package fishtree to provide convenient access to sequences, phylogenies, fossil calibrations and diversification rate estimates for the most diverse group of vertebrate organisms, the ray‐finned fishes. The Fish Tree of Life website presents subsets and visual summaries of phylogenetic and comparative data, and is complemented by the r package, which provides flexible programmatic access to the same underlying data source for advanced users wishing to extend or reanalyse the data.
  3. We demonstrate functionality with an overview of the website, and show three examples of advanced usage through the r package. First, we test for the presence of long branch attraction artefacts across the fish tree of life. The second example examines the effects of habitat on diversification rate in the pufferfishes. The final example demonstrates how a community phylogenetic analysis could be conducted with the package.
  4. This resource makes a large comparative vertebrate dataset easily accessible via the website, while the r package enables the rapid reuse and reproducibility of research results via its ability to easily integrate with other r packages and software for molecular biology and comparative methods.

1 INTRODUCTION

Phylogenies are fundamental to comparative evolutionary biology, and their use extends to community ecology, conservation biology, ecophysiology, developmental biology and translational medical research. New phylogenetic information can illuminate open questions in biology, but this work is clouded by the difficulty in inferring phylogenies, especially for non‐specialist researchers (Pearse & Purvis, 2013). To avoid these pitfalls, reusing existing phylogenies can make phylogenetic knowledge accessible without requiring researchers to collaborate with phylogenetic experts or learn these methods themselves (Arnold, Matthews, & Nunn, 2010; Magee, May, & Moore, 2014; Webb, Ackerly, & Kembel, 2008; Webb & Donoghue, 2005). However, surveys of the biological literature estimate that 60%–95% of previously published phylogenetic datasets are no longer accessible (Drew et al., 2013; Magee et al., 2014; McTavish, Drew, Redelings, & Cranston, 2017; Stoltzfus et al., 2012), highlighting the challenge of persistently sharing data and creating a major barrier to new comparative analyses.

One alternative solution is a “tree of life” approach that centralizes research effort across large groups to create a curated and validated phylogenetic dataset, as opposed to smaller family‐ or genus‐level analyses (Beaulieu & O'Meara, 2018; McTavish et al., 2017). These broad phylogenies, in diverse groups such as mammals, birds, squamate reptiles, fishes and angiosperms (Bininda‐Emonds et al., 2007; Jetz, Thomas, Joy, Hartmann, & Mooers, 2012; Pyron, Burbrink, & Wiens, 2013; Rabosky et al., 2018; Zanne et al., 2014), represent the best target for phylogenetic re‐use, as extensive sampling across these broad organismal groups is likely to cover the particular set of species that would interest a taxon‐focused researcher.

Here, we present a new community resource and accompanying r package, the Fish Tree of Life, focusing on the ray‐finned fishes, the most species‐rich group of vertebrates with over 33,000 species. We describe this resource, which is based on a recent complete phylogeny (Rabosky et al., 2018), and provide three motivating examples, showing how this large empirical dataset could be used to investigate the common problem of long branch attraction, study a specific taxon in a phylogenetic comparative analysis and analyse a dataset using methods from phylogenetic community ecology. This work joins other resources such as birdtree.org (Jetz et al., 2012), the Open Tree of Life (Hinchliff et al., 2015), and Phylotastic (Nguyen et al., 2018). We expand on these previous offerings by also providing pre‐computed taxonomic subsets with character matrices, phylogenies, fossil calibrations and diversification rate information in a website and r package.

2 FUNCTIONALITY

2.1 Website: fishtreeoflife.org

Our website aims to permit easy access the curated dataset introduced in Rabosky et al. (2018), including the multiple sequence alignment, the phylogram from RAxML (Stamatakis, 2014), the time‐calibrated phylogeny from treePL (Smith & O'Meara, 2012) and the fossil calibrations used for divergence time estimation. We also generated pages and downloads for each rank above family in the Phylogenetic Fish Classification (Rabosky et al., 2018). Each page lists all species in that taxon, as well as taxonomy and subsets of the sequence alignments, phylogenies and fossil calibrations. Separate pages and downloads permit more focused work; for example, in conjunction with new genetic data, a researcher could use profile alignment in MAFFT (Katoh & Standley, 2013) to incorporate their new data into our existing sequence alignment. This saves time compared to a de novo analysis, as the rigorous validation and curation process in Rabosky et al. (2018) should reduce the amount of erroneous or misidentified sequences in combined datasets (Bridge, Roberts, Spooner, & Panchal, 2003).

We have also included a fossil section to our Fish Tree of Life website (Figure 1). This lists all 139 fossils used in our analysis, as well as the phylogenetic placement of those fossils on the phylogeny. Each page includes the taxon it calibrates (e.g. crown Acanthuridae), as well as the minimum age, authorities for taxonomic placement and age and fossil locality. We also show the upper bound of the 95% confidence interval for the estimated age of the Hedman fossil outgroup process (Hedman, 2010), and list the fossil outgroup sequence used to calculate those bounds. Our approach explicitly integrates fossil knowledge in a phylogenetic context suitable for divergence time estimation, while some other resources, such as TimeTree (Hedges, Dudley, & Kumar, 2006) or DateLife (Nguyen et al., 2018), either do not permit reuse or lack detailed fossil taxonomy and locality data. Our compilation could provide an established starting point for analyses that for example, vary fossil calibrations to estimate their downstream effects on diversification rate inference.

image
(a) An example of the fossil calibration page, which includes the exact locality and authorities of the fossil, as well as the outgroup sequence used to determine the 95% upper bound on maximum ages. (b) The same data represented as Javascript Object Notation, a machine‐readable data format

2.2 r package: fishtree

As the website is intended for browsing, more complex analyses should be conducted in a reproducible programming environment. We, therefore, wrote the r package fishtree, which facilitates access to data from the fishtreeoflife.org website. Researchers can load the alignments, phylogenies and diversification rate metrics directly into native r objects, using the fishtree_alignment, fishtree_phylogeny and fishtree_tip_rates functions, respectively, and can subset data by taxonomic rank, for example, by family (Labridae) or order (Labriformes). Phylogenies are classed as type phylo from ape (Paradis & Schliep, 2018) to work seamlessly in conjunction with other commonly used r packages for phylogenetics and comparative analysis. We summarize the major fishtree functions in Table 1.

Table 1. An overview of the four major functions in the r package fishtree. For all functions that take a named taxonomic rank, any rank higher than family is accepted, including higher taxa, for example, Ostariophysi or Ovalentaria
Function Data retrieved
fishtree_alignment Aligned sequences for a taxonomic rank or list of species, optionally splitting by gene partition
fishtree_taxonomy Information for a taxonomic rank, including a list of species and average diversification rates
fishtree_phylogeny Phylogeny for a taxonomic rank or list of species. Permits downloads of paraphyletic taxa, either by dropping species that break monophyly, or by including all species descending from the most recent common ancestor of all species sampled in the taxon.
fishtree_tip_rates Tip‐specific diversification rates for a taxonomic rank or list of species, computed via BAMM (Rabosky, 2014) or DR statistic (Jetz et al., 2012)

3 EXAMPLE APPLICATIONS

Here, we demonstrate three example studies that could be conducted with the fishtree r package. The first example shows how researchers could investigate a common problem in phylogenetic inference, long branch attraction. The second example shows how comparative biologists interested in a specific group (pufferfishes) could test a hypothesis related to trait‐dependent diversification. We also provide a final example as a vignette in the supplement that shows a phylogenetic community ecology analysis using the r package picante (Kembel et al., 2010). The latter two examples are available in the Supporting Information and as vignettes in the r package.

3.1 Example: testing long branch attraction across the fish tree of life

We demonstrate how a researcher might investigate the problem of long branch attraction (LBA). This occurs when two long branches are incorrectly grouped together as sisters (Bergsten, 2005), and is generally recognized as a problem when saturation, heterotachy or across‐lineage rate variation is rampant in a sequence alignment (Philippe, Zhou, Brinkmann, Rodrigue, & Delsuc, 2005).

Here, we reanalyse the phylogeny by family to determine what portions might have been affected by LBA. If LBA artefacts are present, we predict that the reanalysed topologies would be more balanced (less pectinate) than the original, globally analysed phylogeny. If saturation is causing LBA, we expect that the transition rates would also be faster in the reanalysed phylogenies. The faster transition rates may cause unrelated taxa to be recovered as sister lineages, as fast molecular evolution can lead to shared mutations that are identical by state, not by descent.

We downloaded the alignment for each family with fishtree_alignment, and excluded families where three or fewer species had data using fishtree_taxonomy. We re‐estimated the topology using RAxML v8.2.11 (Stamatakis, 2014) under a partitioned GTR+GAMMA model (Yang, 1996). We refer to these as the “reanalyzed” trees. We also download the phylogeny for each family pruned from the entire phylogeny with fishtree_phylogeny; we refer to these as the “pruned” trees.

For each of the reanalysed and pruned topologies, we inferred the rates of molecular evolution using the ‐f e option in RAxML. We additionally conducted an approximately unbiased (AU) test of topologies (Shimodaira & Hasegawa, 1999), using the ‐f G option in RAxML to score per‐site likelihoods in CONSEL (Shimodaira & Hasegawa, 2001). We reanalysed n = 268 family level phylogenies, having on average 43.16 species and the largest family (Cyprinidae) having 1,369 species. After correcting for multiple comparisons (Benjamini & Hochberg, 1995), we significantly rejected (pAU < 0.05) the pruned topology in 8 of 268 families with the AU test (Figure 2).

image
The approximately unbiased (AU) test for tree topologies significantly rejected 8 of 268 reanalysed phylogenies in favour of the original topology, coloured in dark red. (a) Alignment incompleteness, species richness, and their interaction significantly predicted the p‐value of the AU test. (b) Skeletal family‐level phylogeny of the ray‐finned fish tree of life; bar lengths are the negative log of the AU test p‐value. This figure with tips labelled by family is provided as Figure S1

We also computed the normalized Robinson‐Foulds (RF) distance (Robinson & Foulds, 1981) and Yule‐normalized Colless tree balance metric (Blum, François, & Janson, 2006; Colless, 1982) using apTreeshape and phangorn (Bortolussi, Durand, Blum, & François, 2006; Schliep, 2011). We fit two regression models: a full model that included alignment incompleteness, the log species richness in the family and the difference in the Colless metric and the RF distance between the pruned and reanalysed phylogenies, with all interaction terms; and a reduced model that only included the alignment incompleteness, log species richness and interaction term; both models used the AU test p‐value as the response term. A likelihood ratio test supported the less complex model, with all predictors significant at p < 0.001.

Consistent with our prediction, we find that the reanalysed phylogenies tended to be more balanced (less pectinate) than the pruned topology, measured by the Colless metric (31 of 50, 62%). Relative to the pruned topologies, the reanalysed topologies generally had faster transition parameters and substantially different base composition frequencies (Table 2), suggesting that LBA contributed to the more balanced topologies recovered in the reanalysed phylogenies. Based on the significant predictors in the likelihood ratio test, we speculate that the larger dataset more robustly parameterizes the substitution model and leads to fewer LBA artefacts in the pruned trees.

Table 2. Transition rates A ↔ C, A ↔ G, A ↔ T, C ↔ G, C ↔ T tend to be faster in reanalysed phylogenies, base frequency parameters πA, πC, πG, πT have a substantially different distribution, and the α parameter of the gamma model of rate heterogeneity (Yang, 1996) suggests much less among‐site rate heterogeneity in reanalysed phylogenies. Transition rate parameters were computed relative to the G ↔ T transition rate
Parameter Proportion of 268 reanalysed trees where this parameter was smaller
A ↔ C 0.03
A ↔ G 0.24
A ↔ T 0.05
C ↔ G 0.57
C ↔ T 0.01
πA 0.91
πC 0.99
πG 0.01
πT 0.85
α 0.97

4 CONCLUSION

We have presented a comprehensive resource that makes a massive comparative dataset of vertebrates available for evolutionary biologists and ecologists. Our resource has numerous facilities to permit researchers to easily use subsets of an otherwise impractically large dataset. We believe that making this dataset available in both web and r package formats will unlock a massive dataset for scientific reuse and synergize well with r Notebooks and other reproducible research tools such as Docker (Boettiger, 2017), while simultaneously lowering the barrier for starting a comparative analysis for researchers of all ability levels.

To demonstrate this, we have shown three example use‐cases of our resource, one examining a broad question in molecular evolution, another testing a comparative phylogenetic hypothesis, and the last using a community phylogenetics analysis. In the first example, we made an extremely time‐consuming task much easier, as we were able to rapidly import and subset the relevant data into r and focus our efforts on connecting the output from different software and analysing the results. In the second example, we were able to rapidly test a hypothesis in a comparative context, since fishtree was designed to work well within the r phylogenetics ecosystem and all analyses could be conducted without many data cleaning tasks. In the last example, we showed how fishtree could also be used in a community phylogenetics analysis by testing whether reef fish communities in several ocean basis are phylogenetically clustered or overdispersed.

As concerns around data curation and cleaning in large data aggregations become increasingly visible in biological research (Franz & Sterner, 2018), the ease of use of the tooling around this large, well‐curated dataset provides a framework for how concerns around data quality might be assuaged. Further development of the website and r package will focus on adding more pre‐computed analyses and figures, which will provide more starting points for researchers hoping to extend and reuse these resources. Finally, our website and r package can be easily updated as new phylogenetic knowledge becomes available. As the entire process has been standardized and automated inside of a Docker container, any newer ray‐finned fish phylogeny can be added as a data file to extend the available data.

ACKNOWLEDGEMENTS

We thank Matt McGee, Daniele Silvestro and two anonymous reviewers for their insightful comments on this manuscript, and Peter Cowman and Tom Near for testing early versions of the website and r package. This work was supported by an Encyclopedia of Life Rubenstein Fellowship (EOL‐33066‐13) and an NSF Doctoral Dissertation Improvement Grant (DEB‐1601830) to JC. Travel support to disseminate this research was provided to JC by UCLA and the Society of Systematic Biologists. This research used computational and storage services associated with the Hoffman2 Shared Cluster provided by UCLA Institute for Digital Research and Education's Research Technology Group, as well as computational resources provided by Advanced Research Computing at the University of Michigan, Ann Arbor.

    AUTHORS' CONTRIBUTIONS

    J.C. drafted the manuscript, developed the methods, wrote the software and website. S.A.S., D.L.R., and M.E.A. assisted with analyses and website design. All authors planned the work and contributed to the final manuscript.

    DATA ACCESSIBILITY

    Our website can be accessed at https://fishtreeoflife.org. The r package is available on GitHub, https://github.com/jonchang/fishtree as well as CRAN (https://CRAN.R-project.org/package=fishtree). Source code and data for the example demonstrations are available from the Dryad Digital Repository (Chang, Rabosky, Smith & Alfaro, 2019, https://doi.org/10.5061/dryad.6vg974n) and in the Supporting Information.

      Number of times cited according to CrossRef: 8

      • Thermal bottlenecks in the life cycle define climate vulnerability of fish, Science, 10.1126/science.aaz3658, 369, 6499, (65-70), (2020).
      • Can ancestry and morphology be used as surrogates for species niche relationships?, Ecology and Evolution, 10.1002/ece3.6390, 10, 13, (6562-6578), (2020).
      • Speciation rate and the diversity of fishes in freshwaters and the oceans, Journal of Biogeography, 10.1111/jbi.13839, 47, 6, (1207-1217), (2020).
      • Accelerated evolution at chaperone promoters among Antarctic notothenioid fishes, BMC Evolutionary Biology, 10.1186/s12862-019-1524-y, 19, 1, (2019).
      • A phylogenomic framework for pelagiarian fishes (Acanthomorpha: Percomorpha) highlights mosaic radiation in the open ocean, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2019.1502, 286, 1910, (20191502), (2019).
      • Estimating Diversification Rates on Incompletely Sampled Phylogenies: Theoretical Concerns and Practical Solutions, Systematic Biology, 10.1093/sysbio/syz081, (2019).
      • Improved estimation of macroevolutionary rates from fossil data using a Bayesian framework, Paleobiology, 10.1017/pab.2019.23, (1-25), (2019).
      • Global analysis of fish growth rates shows weaker responses to temperature than metabolic predictions, Global Ecology and Biogeography, 10.1111/geb.13189, 0, 0, (undefined).