RNeXML: a package for reading and writing richly annotated phylogenetic, character, and trait data in R

NeXML is a powerful and extensible exchange standard recently proposed to better meet the expanding needs for phylogenetic data and metadata sharing. Here we present the RNeXML package, which provides users of the R programming language with easy-to-use tools for reading and writing NeXML documents, including rich metadata, in a way that interfaces seamlessly with the extensive library of phylogenetic tools already available in the R ecosystem.


Introduction
Users of the popular statistical and mathematical computing platform R (R Core Team 2014) enjoy a wealth of readily installable comparative phylogenetic methods and tools (O'Meara 2014). Exploiting the opportunities arising from this wealth for complex and integrative comparative research questions relies on the ability to reuse and integrate previously generated or published data and metadata. The expanding data exchange needs of the evolutionary research community are rapidly outpacing the capabilities of most current and widely used data exchange standards , which were all developed a decade or more ago. This has resulted in a radiation of different data representations and exchange standard "flavors" that are no longer interoperable at the very time when the growth of available data and methods has made that interoperability most valuable. In response to the unmet needs for standardized data exchange in phylogenetics, a modern XML-based exchange standard, called NeXML, has recently been developed . NeXML comprehensively supports current data exchange needs, is predictably machine-readable, and is forward compatible.
The exchange problem for phylogenetic data is particularly acute in light of the challenges in finding and sharing phylogenetic data without the otherwise common loss of most data and metadata semantics Drew et al. 2013;Cranston et al. 2014). For example, the still popular NEXUS file format (Maddison et al. 1997) cannot consistently represent horizontal gene transfer or ambiguity in reading a character (such as a DNA sequence base pair). This and other limitations have led to modifications of NEXUS in different ways for different needs, with the unfortunate result that NEXUS files generated by one program can be incompatible with another . Without a formal grammar, software based on NEXUS files may also make inconsistent assumptions about tokens, quoting, or element lengths. Vos et al. (2012) estimates that as many as 15% of the NEXUS files in the CIPRES portal contain unrecoverable but hard to diagnose errors.
A detailed account of how the NeXML standard addresses these and other relevant challenges can be found in Vos et al. (2012). In brief, NeXML was designed with the following important properties. First, NeXML is defined by a precise grammar that can be programmatically validated; i.e., it can be verified whether a file precisely follows this grammar, and therefore whether it can be read (parsed) without errors by software that uses the NeXML grammar (e.g. RNeXML) is predictable. Second, NeXML is extensible: a user can define representations of new, previously unanticipated information (as we will illustrate) without violating its defining grammar. Third and most importantly, NeXML is rich in computable semantics: it is designed for expressing metadata such that machines can understand their meaning and make inferences from it. For example, OTUs in a tree or character matrix for frog species can be linked to concepts in a formally defined hierarchy of taxonomic concepts such as the Vertebrate Taxonomy Ontology (Midford et al. 2013), which enables a machine to infer that a query for amphibia is to include the frog data in what is returned. (For a more broader discussion of the value of such capabilities for evolutionary and biodiversity science we refer the reader to Parr et al. (2011).) To make the capabilities of NeXML available to R users in an easy-to-use form, and to lower the hurdles to adoption of the standard, we present RNeXML, an R package that aims to provide easy programmatic access to reading and writing NeXML documents, tailored for the kinds of use-cases that will be common for users and developers of the wealth of evolutionary analysis methods within the R ecosystem.

The RNeXML package
The RNeXML package is written entirely in R and available under a Creative Commons Zero public domain waiver. The current development version can be found on Github at https://github.com/ropensci/RNeXML, and the stable version can be installed from the CRAN repository. RNeXML is part of the rOpenSci project. Users of RNeXML are encouraged to submit bug reports or feature requests in the issues log on Github, or the phylogenetics R users group list at r-sig-phylo@r-project.org for help. Vignettes with more detailed examples of specific features of RNeXML are distributed with the R package and serve as a supplement to this manuscript. Each of the vignettes can be found at http://ropensci.github.io/RNeXML/.

Representation of NeXML documents in R
Conceptually, a NeXML document has the following components: (1) phylogeny topology and branch length data, (2) character or trait data in matrix form, (3) operational taxonomic units (OTUs), and (4) metadata. To represent the contents of a NeXML document (currently in memory), RNeXML defines the nexml object type. This type therefore holds phylogenetic trees as well as character or trait matrices, and all metadata, which is similar to the phylogenetic data object types defined in the phylobase package (NESCENT R Hackathon Team 2014), but contrasts with the more widely used ones defined in the ape package (Paradis et al. 2004), which represents trees alone.
When reading and writing NeXML documents, RNeXML aims to map their components to and from, respectively, their most widely used representations in R. As a result, the types of objects accepted or returned by the package's methods are the phylo and multiPhylo objects from the ape package (Paradis et al. 2004) for phylogenies, and R's native data.frame list structure for data matrices.

Reading phylogenies and character data
The method nexml_read() reads NeXML files, either from a local file, or from a remote location via its URL, and returns an object of type nexml: nex <-nexml_read("components/trees.xml") The method get_trees_list() can be used to extract the phylogenies as an ape::multiPhylo object, which can be treated as a list of ape::phylo objects: The get_trees_list() method is designed for use in scripts, providing a consistent and predictable return type regardless of the number of phylogenies a NeXML document contains. For greater convenience in interactive use, the method get_trees() returns the R object most intuitive given the arrangement of phylogeny data in the source NeXML document. For example, the method returns an ape::phylo object if the NeXML document contains a single phylogeny, an ape::multiPhylo object if it contains multiple phylogenies arranged in a single trees block, and a list of ape::multiPhylo objects if it contains multiple trees blocks (the capability for which NeXML inherits from NEXUS).
If the location parameter with which the nexml_read() method is invoked is recognized as a URL, the method will automatically download the document to the local working directory and read it from there. This gives convenient and rapid access to phylogenetic data published in NeXML format on the web, such as the content of the phylogenetic data repository TreeBASE (Piel et al. 2002(Piel et al. , 2009. For example, the following plots a tree in TreeBASE (using ape's plot function): tb_nex <-nexml_read( "https://raw.github.com/TreeBASE/supertreebase/master/data/treebase/S100.xml") tb_phy <-get_trees_list(tb_nex)

plot(tb_phy[[1]])
The method get_characters() obtains character data matrices from a nexml object, and returns them as a standard data.frame R object with columns as characters and rows as taxa: nex <-nexml_read("components/comp_analysis.xml") get_characters ( A NeXML data matrix can be of molecular (for molecular sequence alignments), discrete (for most morphological character data), or continuous type (for many trait data). To enable strict validation of data types NeXML allows, and if their data types differ requires multiple data matrices to be separated into different "blocks". Since the data.frame data structure in R has no such constraints, the get_characters() method combines such blocks as separate columns into a single data.frame object, provided they correspond to the same taxa. Otherwise, a list of data.frames is returned, with list elements corresponding to characters blocks. Similar to the methods for obtaining trees, there is also a method get_characters_list(), which always returns a list of data.frames, one for each character block.

Writing phylogenies and character data
The method nexml_write() generates a NeXML file from its input parameters. In its simplest invocation, the method writes a tree to a file: data(bird.orders) nexml_write(bird.orders, file = "birds.xml") The first argument to nexml_write() is either an object of type nexml, or any object that can be coerced to it, such as in the above example an ape::phylo phylogeny. Alternatively, passing a multiPhylo object would write a list of phylogenies to the file.
In addition to trees, the nexml_write() method also allows to specify character data as another parameter. The following example uses data from the comparative phylogenetics R package geiger (Pennell et al. 2014).
library("geiger") data(geospiza) nexml_write(trees = geospiza$phy, characters = geospiza$dat, file="geospiza.xml") Note that the NeXML format is well-suited for incomplete data: for instance, here it does not assume the character matrix has data for every tip in the tree.

Validating NeXML
File validation is a central feature of the NeXML format which ensures that any properly implemented NeXML parser will always be able to read the NeXML file. The function takes the path to any NeXML file and returns TRUE to indicate a valid file, or FALSE otherwise, along with a display of any error messages generated by the validator. nexml_validate("geospiza.xml") [1] TRUE The nexml_validate() function performs this validation using the online NeXML validator (when a network connection is available), which performs additional checks not expressed in the NeXML schema itself . If a network connection is not available, the function falls back on the schema validation method from the XML package (Lang 2013).

Creating and populating nexml objects
Instead of packaging the various components for a NeXML file at the time of writing the file, RNeXML also allows users to create and iteratively populate in-memory nexml objects. The methods to do this are add_characters(), add_trees(), and add_meta(), for adding characters, trees, and metadata, respectively. Each of these functions will automatically create a new nexml object if not supplied with an existing one as the last (optional) argument.
For example, here we use add_trees() to first create a nexml object with the phylogeny data, and then add the character data to it: nexObj <-add_trees(geospiza$phy) nexObj <-add_characters(geospiza$dat, nexObj) The data with which a nexml object is populated need not share the same OTUs. RNeXML automatically adds new, separate OTU blocks into the NeXML file for each data matrix and tree that uses a different set of OTUs.
Other than storage size, there is no limit to the number of phylogenies and character matrices that can be included in a single NeXML document. This allows, for example, to capture samples from a posterior probability distribution of inferred or simulated phylogenies and character states in a single NeXML file.

Data documentation and annotation with built-in metadata
NeXML allows attaching ("annotating") metadata to any data element, and even to metadata themselves. Whether at the level of the document as a whole or an individual data matrix or phylogeny, metadata can provide bibliographic and provenance information, for example about the study as part of which the phylogeny was generated or applied, which data matrix and which methods were used to generate it. Metadata can also be attached to very specific elements of the data, such as specific traits, individual OTUs, nodes, or even edges of the phylogeny.
As described in Vos et al. (2012), to encode metadata annotations NeXML uses the "Resource Description Framework in Annotations" (RDFa) (Prud'hommeaux 2014). This standard provides for a strict machine-readable format yet enables future backwards compatibility with compliant NeXML parsers (and thus RNeXML), because the capacity of a tool to parse annotations is not predicated on understanding the meaning of annotations it has not seen before.
To lower the barriers to sharing well-documented phylogenetic data, RNeXML aims to make recording useful and machine-readable metadata easier at several levels.
First, when writing a NeXML file the package adds certain basic metadata automatically if they are absent, using default values consistent with recommended best practices (Cranston et al. 2014). Currently, this includes naming the software generating the NeXML, a time-stamp of when a tree was produced, and an open data license. These are merely default arguments to add_basic_meta() and can be configured.
Second, RNeXML provides a simple method, called add_basic_metadata(), to set metadata attributes commonly recommended for inclusion with data to be publicly archived or shared (Cranston et al. 2014). The currently accepted parameters include title, description, creator, pubdate, rights, publisher, and citation. Behind the scenes the method automatically anchors these attributes in common vocabularies (such as Dublin Core).
Third, RNeXML integrates with the R package taxize (Chamberlain & Szöcs 2013) to mitigate one of the most common obstacles to reuse of phylogenetic data, namely the misspellings and inconsistent taxonomic naming with which OTU labels are often fraught. The taxize_nexml() method in RNeXML uses taxize to match OTU labels against the NCBI database, and, where a unique match is found, it annotates the respective OTU with the matching NCBI identifier.

Data annotation with custom metadata
The RNeXML interface described above for built-in metadata allows users to create precise and semantically rich annotations without confronting any of the complexity of namespaces and ontologies. Nevertheless, advanced users may desire the explicit control over these semantic tools that takes full advantage of the flexibility and extensibility of the NeXML specification (Parr et al. 2011;Vos et al. 2012). In this section we detail how to accomplish these more complex uses in RNeXML.
Using a vocabulary or ontology terms rather than simple text strings to describe data is crucial for allowing machines to not only parse but also interpret and potentially reason over their semantics. To achieve this benefit for custom metadata extensions, the user necessarily needs to handle certain technical details from which the RNeXML interface shields her otherwise, in particular the globally unique identifiers (normally HTTP URIs) of metadata terms and vocabularies. To be consistent with XML terminology, RNeXML calls vocabulary URIs namespaces, and their abbreviations prefixes. For example, the namespace for the Dublin Core Metadata Terms vocabulary is "http://purl.org/dc/elements/1.1/". Using its common abbreviation "dc", a metadata property "dc:title" expands to the identifier "http://purl.org/dc/elements/1.1/title". This URI resolves to a human and machine-readable (depending on access) definition of precisely what the term title in Dublin Core means. In contrast, just using the text string "title" could also mean the title of a person, a legal title, the verb title, etc. URI identifiers of metadata vocabularies and terms are not mandated to resolve, but if machines are to derive the maximum benefit from them, they should resolve to a definition of their semantics in RDF.
RNeXML includes methods to obtain and manipulate metadata properties, values, identifiers, and namespaces. The get_namespaces() method accepts a nexml object and returns a named list of namespace prefixes and their corresponding identifiers known to the object: birds <-nexml_read("birds.xml") prefixes <-get_namespaces(birds) prefixes["dc"] dc "http://purl.org/dc/elements/1.1/" The get_metadata() method returns, as a named list, the metadata annotations for a given nexml object at a given level, with the whole NeXML document being the default level ("all" extracts all metadata objects): meta <-get_metadata(birds) otu_meta <-get_metadata(birds, level="otu") The returned list does not include the data elements to which the metadata are attached. Therefore, a different approach, documented in the metadata vignette, is recommended for accessing the metadata attached to data elements.
The meta() method creates a new metadata object from a property name and content (value). For example, the following creates a modification date metadata object, using a property in the PRISM vocabulary: modified <meta(property = "prism:modificationDate", content = "2013-10-04") Metadata annotations in NeXML can be nested within another annotation, which the meta() method accommodates by accepting a parameter children, with the list of nested metadata objects (which can themselves be nested) as value.
The add_meta() function adds metadata objects as annotations to a nexml object at a specified level, with the default level being the NeXML document as a whole:

birds <-add_meta(modified, birds)
If the prefix used by the metadata property is not among the built-in ones (which can be obtained using get_namespaces()), it has to be provided along with its URI as the namespaces parameter. For example, the following uses the "Simple Knowledge Organization System" (SKOS) vocabulary to add a note to the trees in the nexml object: history <meta(property = "skos:historyNote", content = "Mapped from the bird.orders data in the ape package using RNeXML") birds <-add_meta(history, birds, level = "trees", namespaces = c(skos = "http://www.w3.org/2004/02/skos/core#")) Alternatively, additional namespaces can also be added in batch using the add_namespaces() method.
By virtue of subsetting the S4 nexml object, RNeXML also offers fine control of where a meta element is added, for which the package vignette on S4 subsetting of nexml contains examples.
Because NeXML expresses all metadata using the RDF standard, and stores them compliant with RDFa, they can be extracted as an RDF graph, queried, analyzed, and mashed up with other RDF data, local or on the web, using a wealth of off-the-shelf tools for working with RDF (see Prud'hommeaux (2014) or Hartig (2012)). Examples for these possibilities are included in the RNeXML SPARQL vignette (a recursive acronym for SPARQL Protocol and RDF Query Language, see http://www.w3.org/TR/rdf-sparql-query/), and the package also comes with a demonstration that can be run from R using the following command: demo("sparql", "RNeXML")).

Using metadata to extend the NeXML standard
NeXML was designed to prevent the need for future non-interoperable "flavors" of the standard in response to new research directions. Its solution to this inevitable problem is a highly flexible metadata system without sacrificing strict validation of syntax and structure.
Here we illustrate how RNeXML's interface to NeXML's metadata system can be used to record and share a type of phylogenetic data not taken into account when NeXML was designed, in this case stochastic character maps (Huelsenbeck et al. 2003). Such data assign certain parts (corresponding to time) of each branch in a time-calibrated phylogeny to a particular "state" (typically of a morphological characteristic). The current de-facto format for sharing stochastic character maps, created by simmap (Bollback 2006), a widely used tool for creating such maps, is a non-interoperable modification of the standard Newick tree format. This means that computer programs designed to read Newick or NEXUS formats may fail when trying to read in a phylogeny that includes simmap annotations.
In contrast, by allowing new data types to be added as -sometimes complex -metadata annotations NeXML can accommodate data extensions without compromise to its grammar and thus syntax In NeXML. To illustrate how RNeXML facilitates extending the NeXML standard in this way, we have implemented two functions in the package, nexml_to_simmap and simmap_to_nexml. These functions show how simmap data can be represented as meta annotations on the branch length elements of a NeXML tree, and provide routines to convert between this NeXML representation and the extended ape::phylo representation of a simmap tree in R that was introduced by Revell (2012). We encourage readers interested in this capability to consult the example code in simmap_to_nexml to see how this is implemented.
Extensions to NeXML must also be defined in the file's namespace in order to valid. This provides a way to ensure that a URI providing documentation of the extension is always included. Our examples here use the prefix, simmap, to group the newly introduced metadata properties in a vocabulary, for which the add_namespace() method can be used to give a URI as an identifier: nex <-add_namespaces(c(simmap = "https://github.com/ropensci/RNeXML/tree/master/inst/simmap.md")) Here the URI does not resolve to a fully machine-readable definition of the terms and their semantics, but it can nonetheless be used to provide at least a human-readable informal definition of the terms.

Publishing NeXML files from R
Data archiving is increasingly required by scientific journals, including in evolutionary biology, ecology, and biodiversity (e.g. Rausher et al. (2010)). The effort involved with preparing and submitting properly annotated data to archives remains a notable barrier to the broad adoption of data archiving and sharing as a normal part of the scholarly publication workflow (Tenopir et al. 2011;Stodden 2014).
In particular, the majority of phylogenetic trees published in the scholarly record are inaccessible or lost to the research community (Drew et al. 2013).
One of RNeXML's aims is to promote the archival of well-documented phylogenetic data in scientific data repositories, in the form of NeXML files. To this end, the method nexml_publish() provides an API directly from within R that allows data archival to become a step programmed into data management scripts. Initially, the method supports the data repository Figshare (http://figshare.com): doi <-nexml_publish(birds, repository="figshare") This method reserves a permanent identifier (DOI) on the figshare repository that can later be made public through the figshare web interface. This also acts as a secure backup of the data to a repository and a way to share with collaborators prior to public release.

Conclusions and future directions
RNeXML allows R's ecosystem to read and write data in the NeXML format through an interface that is no more involved than reading or writing data from other phylogenetic data formats. It also carries immediate benefits for its users compared to other formats. For example, comparative analysis R packages and users frequently add their own metadata annotations to the phylogenies they work with, such as annotations of species, stochastic character maps, trait values, model estimates and parameter values. RNeXML affords R the capability to harness machine-readable semantics and an extensible metadata schema to capture, preserve, and share these and other kinds of information, all through an API instead of having to understand in detail the schema underlying the NeXML standard. To assist users in meeting the rising bar for best practices in data sharing in phylogenetic research (Cranston et al. 2014), RNeXML captures metadata information from the R environment to the extent possible, and applies reasonable defaults.
The goals for continued development of RNeXML revolve primarily around better interoperability with other existing phylogenetic data representations in R, such as those found in the phylobase package (NESCENT R Hackathon Team 2014); and better integration of the rich metadata semantics found in ontologies defined in the Web Ontology Language (OWL), including programmatic access to machine reasoning with such metadata.