Classifying ecosystems with metaproperties from terrestrial laser scanner data

Abstract In this study, we introduce metaproperty analysis of terrestrial laser scanner (TLS) data, and demonstrate its application through several ecological classification problems. Metaproperty analysis considers pulse level and spatial metrics derived from the hundreds of thousands to millions of lidar pulses present in a single scan from a typical contemporary instrument. In such large aggregations, properties of the populations of lidar data reflect attributes of the underlying ecological conditions of the ecosystems. In this study, we provide the Metaproperty Classification Model to employ TLS metaproperty analysis for classification problems in ecology. We applied this to a proof‐of‐concept study, which classified 88 scans from rooms and forests with 100% accuracy, to serve as a template. We then applied the Metaproperty Classification Model in earnest, to separate scans from temperate and tropical forests with 97.09% accuracy (N = 224), and to classify scans from inland and coastal tropical rainforests with 84.07% accuracy (N = 270). The results demonstrate the potential for metaproperty analysis to identify subtle and important ecosystem conditions, including diseases and anthropogenic disturbances. Metaproperty analysis serves as an augmentation to contemporary object reconstruction applications of TLS in ecology, and can characterize regional heterogeneity.

For the study of ecosystems, individual metaproperties may describe and characterize particular attributes of an ecosystem. For example, distance of returns in a forest scan could describe the spatial distribution of vegetation. By extension, groups of metaproperties describe multiple attributes of an ecosystem, and together act as a fingerprint for a scan's location. Therefore, metaproperties can classify the type of ecosystem in which a scan was taken. Furthermore, comparing metaproperties between similar ecosystems can classify F I G U R E 1 Photographs and compact biomass lidar point clouds of a Room (University of Massachusetts Boston), temperate forest (Harvard Forest) and tropical rainforest (La Selva, Costa Rica) F I G U R E 2 Diagram of metaproperties (descriptions in Table 4) featuring CBL2 TLS ecological conditions, while metaproperties of scans within a single ecosystem can characterize spatial gradients and distinct areas.
In this study, we separate metaproperties into two types, pulse metaproperties and spatial metaproperties ( Figure 2). Pulse metaproperties are population statistics of attributes of the pulses in a scan. Examples of pulse metaproperties could include the mean distance of returns in a scan or the ratio of first to second returns in pulses in a scan. Spatial metaproperties, on the other hand, are geometric attributes of the empty space between and around the objects detected by a TLS scan. This space is treated as a hypothetical object, whose geometric attributes, such as volume or crosssectional area, can be derived for use as spatial metaproperties.
The concept of spatial metaproperties has precedent in the field of mathematical morphology. Mathematical morphology concerns the properties of objects whose shape is the empty space or medium between objects in a 3-D space, encountered from a point in that space (Serra, 1982).
Metaproperty analysis augments contemporary TLS object reconstruction methods for studying ecosystems. Object reconstruction uses lidar data from one or more scans to construct representations of objects, such as trees, whose spatial properties, such as volume, are then measured and treated as proxies for the true objects' ecological properties, such as biomass Kaasalainen et al., 2014;Krooks et al., 2014;Raumonen et al., 2015;Romanczyk et al., 2013;Wu, Cawse-Nicholson, & van Aardt, 2013). In this way, object reconstruction techniques refine a subset of the TLS data in one or more scans to model particular attributes of ecosystem structure.
Metaproperty analysis, on the other hand, utilizes almost all of the information in each scan, to provide a holistic assessment of ecosystem structure and reflective properties.
In this paper we seek to provide several proofs-of-concept for the use of metaproperty analysis for ecosystem classification. We also provide a template, the Metaproperty Classification Model (MCM), for applying the methods to future studies. We evaluate the potential of metaproperty analysis for classifying ecosystems through three, increasingly subtle, binary classification problems (Figure 1). Each of these three analyses uses a group of metaproperties to predict the type of ecosystem in which TLS scans were performed. We begin by demonstrating the steps and principles of metaproperties analysis by performing the intuitively simple task of separating scans taken in rooms from those taken in forests. We then proceed to the distinguishing of tropical forests from temperate forests. Finally, we attempt to distinguish between coastal and inland tropical rainforest areas within Costa Rica.

| Metaproperties used in this study
The metaproperties applied to the classification problems in this study are defined in Table 4 and depicted in Figure 2. There were two aims for the selection of the metaproperties for this study. The first aim was that the group of metaproperties could reasonably be expected to have explanatory power for the classification problems. We addressed this aim through the investigation of several independent preliminary datasets, not included in this study. This process particularly helped suggest which descriptive statistics might be appropriate as pulse metaproperties.
The second aim was that the group of metaproperties demonstrates the diversity of metrics that can be used with metaproperty analysis, so this study can act as a pathfinder for future studies. The spatial metaproperties, which are geometric attributes of the space, primarily fulfill this aim. The selected spatial metaproperties vary in their complexity, from the simple No returns:pulses, which is computed similarly to a traditional lidar estimation of gap fraction (Strahler et al., 2008) (Table 4); to the more abstract optical plane area (OPA), which is the area of the polygon created by joining the two-dimensional (X and Y) Cartesian co-ordinates of all first returns from pulses emitted at the optical plane of the TLS instrument ( Figure 2). In other words, the OPA is a derivation of the area of the empty space at the optical plane.

| Metaproperty classification model
We fully describe the MCM to provide a complete workflow for others wishing to apply metaproperty analysis to their own TLS lidar data and evaluate the results. The MCM is currently comprised of ten stages, with additional discretionary steps to adapt to specific data scenarios. These stages of the MCM are summarized in Figure 3 and detailed below. Area of the polygon defined by joining the two-dimensional (X and Y) Cartesian co-ordinates of first returns from pulses emitted at the optical plane.

Rugosity
Ratio of the area of Delaunay triangulated surfaces (Lee & Schachter, 1980) fitted to the X, Y and Z co-ordinates of returns, to the area of the polygon defined by the two-dimensional (X and Y) Cartesian co-ordinates of returns.

| Step 1: Graphically assess explanatory variables
A scatter plot is created for each metaproperty, with the values of the metaproperty as the X axis and the classifications as 1 and 0 on the Y axis (for example, 1 = room, 0 = forest). This is overlaid with the logit transform of the classification, which is a continuous function of probabilities from 0 to 1, across the range of metaproperty values. For an example of a plot with an overlain logit transform of a variable, see Figure 4.

| Discretionary step: Transformation
Suitable transformation (typically, natural logarithmic) can be applied if it improves linearity of the logit transformed variable.

| Step 3: Power analysis for testing sample size
A power analysis is performed to determine the required size of the testing dataset for 95% confidence and a 5% margin of error. This is of the form: where Z is the critical value for the confidence level (1.96 for 95% confidence), r is the proportion of one classification set, N is the total number of scans, and E is the margin of error (0.05).

| Step 4: Separate training and testing data
Individuals for the testing set should be randomly selected without replacement from the classification groups of scans, proportional to the groups' representation in the total population of scans.

| Discretionary step: Assess residuals for outliers
Individuals with high standardized residuals (>3) can be reported as outliers. If a Cook's distance test reveals that these same individuals  (1). Such variables cannot be used in a binary logistic regression without employing a penalized likelihood method such as Firth's logistic regression had a disproportionate influence (Cook's distance >1.5) on the regression, then consider removing these cases and repeating the binomial logistic regression with the remainder of the same training set.

| Discretionary step: Reduce model
If

| Step 6: Predict testing set
Use the β coefficients of the model with the metaproperties of the testing set to predict probabilities and assign predicted classifications for the testing set scans.

| Step 7: Accuracy assessment
Assess the overall accuracy of the model for the training and testing sets. Calculate additional accuracy metrics which are informative to the specific classification problem. For example, true positive rate (sensitivity), true negative rate (specificity), false positive rate (recall) and false negative rate.

| Step 8: Chi-squared statistic
Calculate the chi-squared statistic for the results, and report the statistic along with degrees of freedom. Provided the chi-squared statistic does not suggest a difference in the observed and expected groups (evaluated at an alpha value of 0.05), then this supports the performance of the model.

| Step 9: Compare to accuracy by chance
Calculate the accuracy by chance as the sum of the squared proportions of the number of individuals in each category to the total number of individuals in the population. This takes the form as follows: where N is the total number of scans, and n i is the number of scans in group i.

| Step 10: Receiver operating characteristic (ROC)
Perform a separate ROC for the training and testing data and plot the curves together. Examine the plots, and report any localized changes in the rates of true and false positives. Report the area under curve (AUC) with 95% confidence intervals. High AUC (approaching 1) suggest the model has strong discriminatory power.

| Discretionary step: Precision and recall
If the classification groups were represented particularly unevenly, which may be reflected in anomalies in ROC, then consider performing precision and recall analysis. Visualize the precision and recall curve and report any localized changes in precision and recall rates.
Report the AUC with 95% confidence intervals. Low AUC (approaching 0) suggests the model has strong discriminatory power, given the underlying distribution of individuals between the groups.

| Adaptations for rooms vs. forests
At step 2 of the MCM, we observed complete separation of Rooms and Forests with both the 1st:2nd Returns and no returns: pulses metaproperties. This prompted our addition of the recommendation that Firth's logistic regression be used in such cases in the future. We proceeded to form the binary logistic regression for this proof-ofconcept classification problem without using the 1st:2nd Returns and no returns:pulses metaproperties as explanatory variables. Also, the regression for Rooms vs. Forests was trained and evaluated on the complete population of scans (no separation of training and testing sets).

| Overall
The MCM formed models with greater than 80% accuracy for testing set classification prediction for all of the classification problems.
The performance of the classification models declined slightly as the subtlety of the classification problems increased: Rooms vs.

| Rooms vs. Forests
The full model with four metaproperties as explanatory variables was utilized, having converged successfully (Wald statistic <0.01, 81 df, Table 5). There was strong evidence of a relationship between the response variable and the explanatory variables mean distance (Wald <0.01), mean intensity (Wald: 0.012), and Rugosity (Wald <0.01).
There was moderate evidence of a relationship between the response variables and OPA (Wald: 0.036). There were five outliers (standardized residuals 3.2-4.1), but the distribution of standardized residuals did not suggest that these were exceptional. Cook's distance test revealed two influential individuals (Cook's distance: 4.9, 1.9), but neither were among the outliers. The model discriminated rooms from forests with 100% accuracy.

| Tropical vs. Temperate Forests
The full model was utilized, having converged successfully (Wald statistic <0.01, 398 df,  (Table 7), and the AUC of the ROC (Figure 6)

| Inland vs. Coastal Rainforests
The full model for the binary logistic regression was utilized, having converged successfully (Wald <0.01, 929 df, explanatory power. Furthermore, each metaproperty had explanatory power in at least one of the classification problems (Table 10). Rooms demonstrated a tendency towards higher mean distance and lower OPA (Figure 9) than Forests. Even though these had very small β coefficients, suggesting a low magnitude of effect, the existence of the relationships was strongly supported by the Wald statistics (Table 5). The lower mean distance in Forests could be explained by the abundance of near-field vegetation and tree trunks, with the increased OPA resulting from the long sight-lines between these objects ( Figure 9). The many long sight-lines of Forests could have been expected to increase mean distance, but sight-lines that exceeded the maximum range of the CBL, or corresponded to gaps in the canopy, did not contribute to mean distance, since the metaproperty only considers pulses with first returns. Rugosity was higher (β: 0.117) in Rooms than Forests. This counterintuitive result is discussed below, but Rugosity still provided extremely strong discriminatory power (Wald <0.001), and thus was still valuable for the task of classifying Rooms from Forests.
Classifying Tropical Forests from Temperate Forests is also an easy task, with many potential diagnostic properties, including such basic information as the latitude of a scan's location. Given that metaproperties draw distinctions between ecosystems based on the integration of a large amount of spatial and reflective information captured by lidar data, the 97% testing set accuracy is encouraging. However, the rare misclassified cases (16/224, greatly increase the area of the optical plane ( Figure 10).
There was also reasonable evidence for discrimination based on no returns:pulses (Wald: 0.035), which indicated a strongly negative rela-  Inland Rainforest were similar in form, but with far smaller overall extent in the Inland group ( Figure 12). The Inland Rainforest may have more vertical variation due to both the prevalent subcanopy, and the relative abundance of epiphytic plants on the trunks of trees (Merwin, Rentmeester, & Nadkarni, 2003). This vertical variation may account for its increased Rugosity (β: −0.0097), since height variation in lidar returns increases the 3D area.

| Selection of metaproperties
There are many possible metaproperties that could be extracted from TLS scans, and these could be used in many different combinations.
Selecting an appropriate metaproperty, or set of metaproperties, is therefore a challenging process. Sometimes a particular metaproperty will be hypothesized a priori to explain a particular, measurable ecosystem condition. For example, one might hypothesize that the number of laser pulses with multiple returns might be large in conifer forests, given their fine needles. In such cases, the single metaproperty can simply be extracted and the relationship to the ecosystem condition can be tested via traditional inferential statistical techniques such as linear regression.
However, when metaproperties are being used for more exploratory studies where no particular relationships are hypothesized a priori, as in this paper, a suite of metaproperties is desirable. The set of metaproperties to be included in an exploratory study should ideally be determined in an independent, but ecologically similar, preliminary dataset. Examining multiple potential combinations in the main dataset to select metaproperties is to be avoided, as this sort of "data snooping" drastically decreases confidence in any relationships that are eventually observed.
In the absence of a preliminary dataset, we can still guide the a priori selection of metaproperties with several general considerations.
Firstly, a group of metaproperties should include pulse metaproperties that utilize as much of the information captured in the lidar pulses of the relevant TLS instrument as possible. TLS instruments other than the CBL may capture more returns per pulse or full waveform data , or return intensity at multiple wavelengths (Douglas et al., 2012;Gaulton, Danson, Pearson, Lewis, & Disney, 2010;Howe et al., 2015), resulting in many potential pulse metaproperties. In general, pulse metaproperties will take the form of descriptive statistics of the entire population of pulses, such as the mean, minimum, maximum, standard deviation, range, or ratio. In this study, the particular statistics used as pulse metaproperties were partly chosen to provide some resilience against "ecosystem scaling." Ecosystem scaling occurs when objects are different in physical size, but not general morphology, such as dwarf vs. tall forests.
Secondly, spatial metaproperties should be independent of each other, since explanatory variables in a binomial logistic regression are assumed to be independent. Independence, in this case, means that the spatial metaproperties should not consider the same geometric properties or regions of the empty space. Additionally, spatial metaproperties should also not be substantially dependent on pulse metaproperties, such that they obviously co-vary.

| CONCLUSIONS
In this study, we introduced metaproperties as metrics that aggregate the spatial and reflective information in lidar data. We established metaproperty analysis as a way to effectively utilize the increasing variety and quality of information from contemporary TLS instruments to classify ecosystems. Through a series of ecosystem classification problems, we demonstrated how metaproperty analysis can find individual, powerful indicators for ecosystem type, as well as weighing more subtle evidence from multiple metaproperties. The MCM provided a complete workflow for ecosystem investigation, including considerations of statistical power, optimization of models, presence and influence of outliers, and appropriate metrics to assess model accuracy. The discretionary steps of the MCM are adaptable to a range of data scenarios.
Since metaproperty analysis can simultaneously consider many attributes of an ecosystem, it can uncover single diagnostic properties or form predictive models based on multiple metaproperties for F I G U R E 1 2 OPA polygons for Inland vs. Costal rainforests, plotted to scale. Coastal Rainforests tended towards higher OPA ecosystem types and conditions. This could improve characterization of conditions which are currently challenging for ecological assessment, such as disease, storm damage, and anthropogenic disturbances.
Metaproperty analysis may be particularly useful for studying diseases and infestations in their early stages. While these conditions eventually result in easily identifiable changes in ecosystems, effective management relies on their classification in early stages, when the changes are more subtle. Metaproperty analysis could help establish patterns of spatial heterogeneity within ecosystem, which could guide appropriate stratified sampling for validation of airborne and satellite observations.
Metaproperty analysis methods can also be applied to historical lidar data, providing a baseline for observing ecosystem change.
The emerging class of TLS instruments that are optimized for rapid scanning and portability, such as the Compact Biomass Lidar (CBL), synergize well with metaproperty analysis. Favourable deployment logistics enable the capture of many TLS scans across large areas of ecosystems (Paynter et al., 2016). The resulting increase in sample size compared with previous instruments improves the inferential power of metaproperty analyses. A large number of scans can also provide subsets of data for preliminary analyses, yielding refined groups of metaproperties or candidates for diagnostic metaproperties for ecosystem conditions. Consideration of preliminary studies could be added to the MCM as a discretionary step. However, targeting reduced groups of metaproperties also warrants caution, as overfitting analyses to a current set of observations may exclude metaproperties with explanatory power for future observations and conditions.
Metaproperty analysis also reduces lidar data to a lightweight format, which improves the accessibility of the techniques, and thus encourages large-scale and collaborative ecosystem studies. To encourage collaboration, and maximize use of historical data, we must facilitate the combination of datasets from different TLS scanners. Adapting metaproperty analysis for use with airborne lidar data could also be extremely useful to achieve ecosystem assessment over larger spatial extents. The independence of metaproperties, and the independence of overlapping scans, also remains important topics for further investigation. However, metaproperty analysis techniques have the potential to be a pathfinder for transitioning TLS sampling from the plot scale to the landscape scale.

AUTHORS' CONTRIBUTION
I.P., D.G., E.S., F.P. and Z.L. collected TLS data. I.P. and D.G. conducted data analysis. I.P., D.G., E.S. and C.S. designed the study and produced the manuscript. Z.L. and A.S. offered additional refinements to the design of the study and manuscript. P.B. provided refinements to the study presentation and manuscript during the resubmission process.