Calibration chain transformation improves the comparability of organic hydrogen and oxygen stable isotope data

Stable hydrogen and oxygen isotopic compositions (δ2H and δ18O, respectively) of animal tissues have been used to infer geographical origin or mobility based on the premise that the isotopic composition of tissue is systematically related to that of local water sources. Isotopic data for known‐origin samples are required to quantify these tissue–environment relationships. Although many of such data have been published and could be reused by researchers, differences in the standards used for calibration and analytical procedures for different datasets limit the comparability of these data. We develop an algorithm that uses results from comparative analysis of secondary standards to transform data among reference scales and estimate the uncertainty inherent in these transformations. We apply the algorithm to a compilation of known‐origin keratin data published over the past ~20 years. We show that transformation improves the comparability of data from different laboratories, and that the transformed data suggest ecophysiologically meaningful differences in keratin–water relationships among different animal groups and taxa. The compiled data and algorithms are freely available in the ASSIGNR r‐package to support geographical provenance research, and more generally offer a methodology overcoming several challenges in geochemical data integration and reuse.


| 733
Methods in Ecology and Evoluঞon MAGOZZI et Al. to that of local water sources (i.e. precipitation) through relatively predictable relationships (e.g. Chamberlain et al., 1996;Ehleringer et al., 2008;Hobson & Wassenaar, 1997). The isotopic composition of precipitation varies predictably across space and time (Bowen & Revenaugh, 2003;Bowen, Wassenaar, et al., 2005;Craig, 1961), and thus tissue isotope ratios derived ultimately from precipitationdriven local food webs can be compared to environmental water isoscapes (predictive models of spatio-temporal isotope patterns) to infer tissue origin (Ma et al., 2020;Vander Zanden et al., 2014;Wunder, 2010). Keratinous tissues, such as feather, hair or nail, are metabolically inert once formed and so preserve an isotopic composition characteristic of the location of tissue growth (Hobson, 1999;Macko et al., 1999;West et al., 2004), and are the focus here and in many published studies.
Because tissue-environment relationships vary among taxa and regions (Magozzi et al., 2019 and references therein), samples of known origin are needed to quantify such relationships (e.g. Chamberlain et al., 1996;Ehleringer et al., 2008;Hobson et al., 2012;Hobson & Wassenaar, 1997). The collection and analysis of known-origin samples, however, is resource-intensive and in some cases prohibitive, reducing the efficiency and applicability of the approach. Although many known-origin datasets have already been published, and in theory could be reused to quantify tissue-environment relationships in new studies, different sample preparation, analytical and calibration practices used at different laboratories (or even at a single laboratory over time) have generated data that are not directly comparable (e.g. Bowen, Chesson, et al., 2005;Meier-Augenstein et al., 2013;Soto et al., 2017;Wassenaar & Hobson, 2003). As a result, responsible users of published data have thus far focused on measurements made in a single laboratory or obtained with identical protocols.
Different sample treatment and analysis methods are one source of inconsistency in published data. Exchange of H atoms from protein carboxyl and hydroxyl groups with ambient atmospheric water vapour molecules affects measured δ 2 H values and must be corrected for (Bowen, Chesson, et al., 2005;Chesson et al., 2009;Schimmelmann, 1991;Wassenaar & Hobson, 2000, most commonly through comparative analysis against keratin standards for which non-exchangeable δ 2 H values have been established (Kelly et al., 2009;Sauer et al., 2009;Wassenaar & Hobson, 2003). Water is also tightly adsorbed by keratin and adheres to sample capsules, contributing to the measured δ 2 H and δ 18 O values (Wortmann et al., 2001) unless samples are thoroughly dried prior to combustion (Bowen, Chesson, et al., 2005;Coplen & Qi, 2012;Soto et al., 2017).
Analytical methods have themselves evolved since the advent of online thermal conversion/elemental analysis isotope ratio mass spectrometry (TC/EA-IRMS) for keratin H and O isotope analysis ~20 years ago. The use of different pyrolysis reactor fillings (e.g. chromium vs. glassy carbon; Gehre et al., 2015;Nair et al., 2015) and chromatographic conditions (Hunsinger et al., 2013; has been shown to affect measured δ 2 H and δ 18 O values, respectively. Fortunately, isotope ratio analysis is performed as a comparative analysis, wherein the sample values are calibrated to the accepted values for co-analysed standards. Under ideal circumstances conforming to the principle of identical treatment (PIT; Werner & Brand, 2001), which requires the preparation and analysis of matrixmatched (chemically and physically equivalent) standards alongside the unknown samples, analytical biases should affect the samples and standards similarly and have little effect on sample values reported relative to the standard values (i.e. on a 'reference scale' defined by the standards used and their assigned values). However, ideal circumstances are not possible in many situations. Even where samples and standards are compositionally similar (i.e. keratin), differences in preparation (i.e. grinding/powdering) and biochemistry (i.e. amino acid profile) can affect H exchange and may impart bias to otherwise PIT-compliant comparisons (Alibardi, 2017;Bowen, Chesson, et al., 2005;Robbins, 2012). Additionally, the internationally accepted reference scale (here the Vienna Standard Mean Ocean Water, 'VSMOW', scale) for δ 2 H and δ 18 O data is defined by primary standards that are water, meaning that calibrating organic secondary standards to the VSMOW scale using PIT procedures is impossible.
As methodologies have advanced, a set of 'optimal' procedures minimizing bias in these non-PIT comparisons has been developed, but in the interim many different secondary standard calibrations have been produced, leading to published known-origin data that are laboratory-specific and not robustly traceable to the VSMOW scale.
Here, we leverage cross-calibration studies, in which one set of secondary standards is analysed alongside and calibrated to a reference scale defined by a second set, to develop a method that transforms data between reference scales. Transformation from the original scale to a target scale proceeds along a chain of linked calibrations, ideally with each link consisting of a PIT-based cross-calibration or a non-PIT calibration using optimal methods Soto et al., 2017; see also Coplen & Qi, 2016). The algorithm propagates uncertainty, and permits comparison of data reported on more than two dozen reference scales. We apply and test the method using a compilation of data for >4,000 keratin samples, showing that the method reduces, but does not always eliminate, discrepancies among data from different laboratories. The data and transformation algorithm are available in the ASSIGNR r-package (Ma et al., 2020) to support their open reuse.

| Secondary standard calibration history
We compile information on widely used keratin secondary stand- Each reference scale is defined by two secondary standards and their assigned values (Dunn & Carter, 2018). In a typical laboratory application, a linear model relating instrument-reported values for the secondary standards to their assigned values is applied to the measured values for unknowns to calibrate them to the reference scale.
Assigned values for secondary standards are the product of calibration (through co-analysis) relative to a reference scale defined by a different set of primary or secondary standards. These calibrations have used a wide range of methodologies, and for most keratin standards multiple calibrations have been generated using different methods. In general, these can be classified as calibrations based on (1) PIT methods, with full reporting of methods and uncertainty, (2) measurement against non-matrix-matched standards using optimal methods to minimize matrix effects or (3) measurements that are neither PIT-based nor optimal. For category 2, we consider optimal methods to include thorough sample drying (e.g. using a sealed, evacuated and heated carousel such as the Uni-Prep TM device; Wassenaar et al., 2015;Soto et al., 2017), correction for exchangeable H via equilibration with multiple waters and use of a Cr-filled (vs. glassy carbon) pyrolysis reactor to avoid bias caused by HCN-producing reactions (Soto et al., 2017; see also Coplen & Qi, 2016 Wassenaar & Hobson, 2003). In 2011, the Environment Canada laboratory introduced the KHS and CBS (kudu horn and caribou hoof) standards to replace CHS, CFS and BWB, and analysed these with online combustion continuous flow isotope ratio mass spectrometry techniques; between 2011 and 2015, nine calibrations for these secondary standards were reported (EC_H_1-9; Soto et al., 2017;Wassenaar et al., 2015;L.I. Wassenaar, pers. comm.).
The USGS Denver laboratory produced two hair standards des-  Table S3). Each of these is either non-PIT or lacking uncertainty, however, and therefore considered a floating calibration.
In 2011-2012, the USGS Reston isotope laboratory introduced the USGS42 and USGS43 (Tibetan and Indian human hair) standards.
Seven calibrations exist, including a series of calibrations to primary standards using non-optimal methods (US_H_1-4 and 6; Coplen & Qi, 2012, 2016Wassenaar et al., 2015;summarized by Soto et al., 2017) and one using optimal methods (US_H_7; Soto et al., 2017). The certified values for these materials comprise the US_H_6 scale (https://isoto pes.usgs.gov/lab/refer encem ateri als/USGS42.pdf; https://isoto pes.usgs.gov/lab/refer encem ateri als/ USGS43.pdf; Coplen & Qi, 2016), and, although this calibration used non-optimal drying methods, the values obtained are indistinguishable from US_H_7 (Soto et al., 2017) and thus adopted here as the authoritative calibration to VSMOW. USGS42 and USGS43 were also calibrated to the UT_H_2 reference scale (US_H_5; OldEC.1_H_1 BWB −108 OldEC.1_H_1 Wassenaar and Hobson (2003) Equilibration with waters at high temperature; Znreduction and dual inlet-IRMS method described by Wassenaar and Hobson (2000); calibration to water standards CHS −187 OldEC.2_H_1 BWB −108 OldEC.2_H_1 Wassenaar and Hobson (2003) Equilibration with waters at high temperature; Znreduction and dual inlet-IRMS method described by Wassenaar and Hobson (2000); calibration to water standards Soto et al. (2017) Equilibration with waters at room temperature for 6 days, dried using different methods, assumed ε = 0‰; TCEA-IRMS with glassy C-filled reactor; calibration to water standards  Brand et al., 2009;Schimmelmann, 2002), and a consensus value has been adopted based on an inter-laboratory average (IAEA_O_2; Brand et al., 2009Brand et al., , 2014. Although we include the benzoic acids here, we note that these are not matrix-matched standards for keratins and are not calibrated to VSMOW following the criteria that we have accepted as optimal. Thus, we discourage their use in keratin analysis and suggest caution in the interpretation of transformations involving these standards. Several calibrations for the Environment Canada standard materials have been reported based on non-PIT, non-optimal methods (OldEC.3_O_1 and EC_O_9 and 10). These were also calibrated to VSMOW using optimal methods  The AND and CAL-SAL standards were calibrated to waters using non-optimal (CAN_O_6) and optimal (CAN_O_7) analytical methods . They were also PIT calibrated to the US_O_1 reference scale (CAN_O_5; Coplen & Qi, 2012).

| Secondary standard database
We compiled summary data and methodological information for each secondary standard calibration (Tables 1 and 2), a description of all unpublished calibration data (Table S1; Table 2), and, when available, the raw unpublished calibration data themselves (Tables S3 and S4).
A summary of the calibrations was added as a new object stds within the ASSIGNR r-package (Ma et al., 2020). This list object contains two data frames (hstds and ostds) that record assigned second-

| Known-origin database
We updated ASSIGNR's database of known-origin tissue samples by adding additional keratin data and information on sample preparation and analysis. The new knownOrig database consists of three objects.
knownOrig$sources includes attribution and methodological information, where available, for the compiled datasets (see Table S5). Documentation includes (a) sample type (e.g. feather); OldEC.1_H_1); (j) whether the standards used for calibration were powdered (Y/N); (k) whether procedures that limit the contribution of adsorbed water to the analyses, either through dedicated sample preparation devices (Soto et al., 2017;Wassenaar et al., 2015) or careful drying and rapid handing of dried samples (e.g. Bowen, Chesson, et al., 2005), were used (Y/N); (l) the analysis method  These functions allow transformations between most reference scales compiled here. We used calibration chains ( Figures S3 and S4) to transform all compiled datasets to the VSMOW reference scale for analysis.

| Tissue-water relationships
We compared within-species site-average keratin values for knownorigin samples with local precipitation amount-weighted annual average δ 2 H and δ 18 O values extracted from precipitation isoscapes (http://www.water isoto pes.org; Bowen & Revenaugh, 2003;Bowen, Wassenaar, et al., 2005). These values offer a standardized, firstorder estimate of spatial variation in local environmental water isotope ratios that can be used to characterize tissue-water relationships across different groups of taxonomically and/or ecologically related species. We derived such relationships using both original and recalibrated keratin data.

| Validation
We validated the quality of the calibration chain transformations using several datasets in which data for the same or related samples were originally reported on different reference scales. These include modern human hair δ 2 H data reported on the OldUT_H_1 (Ehleringer et al., 2008;Thompson et al., 2010) and CAN_H_1 scales ( was that values for the scaup samples would be more similar after transformation to a common scale, and that the tissue-water relationship for other ecologically related samples would be more uniform following transformation. For two sample groups (modern humans and ground-foraging non-passerine birds), enough data were available to allow statistical testing. Levene's test was used to assess whether residual variance from tissue-water regressions was reduced using different regressions for data originally calibrated to different reference scales; the test was repeated for pre-and post-transformation data.

| Hydrogen
The compiled known-origin dataset includes 935 human hair and 3,075 bird feather samples analysed for δ 2 H values. Hair data were originally reported on the OldUT_H_1 or CAN_H_1 reference scales.
The majority of the feather data were referenced to the OldEC.1_H_1 scale, although five other scales were represented (Figure 1). None of the data were reported on scales that were directly traceable to the VSMOW reference scale based on the criteria used here (e.g. EC_H_9 and US_H_6), so all values were shifted during transformation. Most sample values are higher after transformation, but the magnitude of reflects a combination of the originally reported analytical uncertainty, contraction (or expansion) of the δ 2 H scale during transformation and uncertainty added within the calibration chain. Reported uncertainties varied widely among datasets, and uncertainties converge somewhat following transformation: δ 2 H scale compression drives a reduction in estimated uncertainty for many samples with high reported uncertainty, and the addition of uncertainty from transformation dominates for those with low reported uncertainty (Figure 1b).

| Oxygen
The compilation contains 358 human hair and 337 bird feather sam-

| Validation
Calibration chain transformation leverages cross-calibration of keratin standards to develop scale transformations that are intended to improve data comparability. Examples in which known or presumed relationships exist between data originally calibrated to different scales allow us to test for improved comparability of transformed data and evaluate that improvement relative to other sources of variability among datasets.
We used the method to transform two sets of modern human hair δ 2 H data originally calibrated to the OldUT_H_1 (Ehleringer et al., 2008;Thompson et al., 2010) and CAN_H_1 (Bataille et al., 2020;C.P. Bataille, pers. comm.) reference scales. Although we expect some regional variation in hair isotope ratios due to dietary differences (Bowen et al., 2009), hair δ 2 H values are known to correlate strongly with local environmental water values (Ehleringer et al., 2008). Before recalibration, values in the CAN_H_1 dataset, consisting of samples from Canadian residents, were ~8‰ higher, on average, than those calibrated to OldUT_H_1, which included samples from the United States and east Asia (Figure 1a). This pattern is opposite to that expected based on water δ 2 H values for these regions (Bowen, Wassenaar, et al., 2005), and regression relationships between the two groups of data and local precipitation δ 2 H values were statistically distinct (Levene's test p value << 0.05; Figure 3a,b).
After transformation to the VSMOW reference scale, values in the Canadian dataset are lower than those for USA/Asia, as expected ( Figure 1a), and the hair-water relationships are no longer distinct (Levene's test p value = 0.96; Figure 3a,b). This strongly suggests that the originally calibrated data were not comparable, and that the calibration chain transformation eliminates (or greatly reduces) the disparity between these datasets.
We also evaluated transformation effects on multi-species data for two broadly defined ecological guilds of birds. The vast majority of passerine δ 2 H values in the database were originally calibrated to the OldEC.1_H_1 and EC_H_5 reference scales. However, the compilation contains a dataset for a population of spotted towhees (Pipilo maculatus) from a single site in Utah (Magozzi et al., 2020) that was calibrated to UT_H_2. Both pre-and post-transformation, the towhee site-average value clusters well with other passerine data from environments with similar water δ 2 H values, but the posttransformation value falls closer to the mean tissue-water relationship for the composite dataset (Figure 3c,d). The same is true for pre-and post-transformation δ 18 O data for passerine feathers originally calibrated to IAEA_O_1, UT_O_1 and EC_O_10 scales (Figure 4c,d). These are relatively weak tests in that we lack a firm basis for predicting expected differences between species, but the results are consistent with the idea that calibration chain transformation increases comparability among datasets.
Values of δ 2 H for ground-foraging non-passerine birds originally calibrated to DEN_H_1 (Wunder et al., 2005;M.B. Wunder, pers. comm.) and OldEC.1_H_1 (Hobson et al., 2004) defined two discrete clusters when plotted against local water values (Figure 3e). Following transformation, the offset between these groups is reduced but not eliminated (Levene's test p value pre-transformation = 5e −4 ; posttransformation = 4e −3 ; Figure 3f). In this case, the residual offset may represent real, ecologically driven differences among taxa, rather than an analytical artefact. The DEN_H_1-calibrated data represent a single species (Charadrius montanus) that occupies dry grasslands with sparse vegetation cover, in which evaporative isotope effects might lead to higher food web δ 2 H values (e.g. Magozzi et al., 2019) than for the other species represented in the database.
The database includes only one case in which the same samples were calibrated to two different scales. Lesser scaup feathers analysed and calibrated to OldEC.1_H_1 (Hobson et al., 2009) were subsequently reanalysed at Purdue University relative to OldUT_H_1 (G.J. Bowen, pers. comm.); the data show a small but consistent (mean = 6.7‰) offset. In this case, the transformation results in almost no relative shift in values (Figure 1a). The BWB secondary standard was analysed alongside the feathers at Purdue, and both the originally calibrated and transformed data show good agreement with its value on the OldEC.1_H_1 scale. This suggests that the original reference scales are themselves closely comparable, and that the small offset in the scaup sample data might result from other methodological effects.
One possibility may be inaccurate correction for H exchange due to differences in the physical condition (powdered vs. cut) of the samples and standards (Coplen & Qi, 2012) or difference in their amino acid composition. Thus, these data highlight the potential importance of minor deviations from PIT, which remain common in analytical work with complex organic materials, as an unresolved source of uncertainty in data compilations. Standardization of protocols across laboratories will be needed to reduce or eliminate this uncertainty.

| Tissue-water relationships
Differences in tissue-water isotope relationships may reflect environmentally or biologically controlled isotope effects during F I G U R E 3 Comparison of original (left panels) and VSMOW-recalibrated (right panels) site-average keratin and local precipitation δ 2 H values for taxonomically and/or ecologically related animals: modern humans (USA, Canada and Asia; a and b), passerines (c and d), ground-foraging non-passerine birds (e and f), waterbirds (g and h) and raptors (i and j). Local precipitation δ 2 H values are extracted from the precipitation amount-weighted annual average δ 2 H isoscape on http://www.water isoto pes.org (Bowen & Revenaugh, 2003;Bowen, Wassenaar, et al., 2005). Colours represent different reference scales used in calibration of the original data, symbols represent different datasets are similar, and slopes for all bird groups are substantially higher than that for humans ( Figure 3). This is consistent with the idea that widespread consumption of non-local food resources dampens the isotopic variability that would otherwise be expected in human hair due to geography (Bowen et al., 2009;Ehleringer et al., 2008). Slope comparisons for δ 18 O values are complicated by the smaller sample sizes and weaker tissue-water correlations for avian groups, but at minimum suggest that human-avian differences are less apparent than for δ 2 H values ( Figure 4). The smaller contribution of foodderived O (relative to H) to keratin (Ehleringer et al., 2008) is consistent with this result.
The data also show differences in keratin isotope ratios between different groups across a range of water isotopic compositions   (Kohn, 1996;Magozzi et al., 2019).

| CON CLUS IONS
Although keratin δ 2 H and δ 18 O data from known-origin biological samples are important in movement ecology research, they are difficult to compare among studies and reuse due to heterogeneity in analytical methods. The calibration chain method introduced here attempts to resolve one major source of heterogeneity by reducing or eliminating differences related to the use of different F I G U R E 4 Comparison of original (left panels) and VSMOW-recalibrated (right panels) site-average keratin and local precipitation δ 18 O values for taxonomically and/or ecologically related animals: modern humans (USA; a and b), passerines (c and d) and ground-foraging non-passerine birds (e and f). Symbology and precipitation estimation as in Figure 3 (a) (b) (c) (d) (e) (f) secondary standards in data calibration. Comparisons of pre-and post-transformation data show that the method improves comparability, and suggests systematic isotopic differences between groups of organisms that likely reflect differences in isotope routing and fractionation within food webs. The method is implemented in the ASSIGNR r-package, and provides a basis for improved reuse of the database of >4,000 samples included therein and for transformation of user-generated data. The approach developed here could be extended to other matrices and isotope systems in future work.
We emphasize, however, that post-hoc correction such as that introduced here is not ideal, and encourage continued efforts within the research community to increase the availability of suitable standards and adoption of optimal analytical methods.
Standard development has thus far been conducted largely by selforganized groups, leading to the diversity of heterogeneity of materials evidenced here. In addition, other methodological factors continue to contribute uncertainty to comparisons among studies.
Some have been more thoroughly discussed in other papers (e.g. Soto et al., 2017), and are documented in the ASSIGNR database but not accounted for in calibration chain transformation. To leverage the method developed here and continue to support improvements in the standardization and comparability of organic H and O isotope data, we suggest the following three priorities for the community: 1. Organize and support coordinated efforts to develop, characterize (including regular round-robin comparisons among laboratories) and distribute large amounts of standard materials for keratin and other commonly studied biological materials (e.g. chitin).
2. Conduct analyses in compliance with PIT principles to the maximum extent possible, and, where identical treatment is not possible, adopt technologies and methodologies that are demonstrated to eliminate analytical effects known to impart bias in non-PIT analyses (e.g. Hunsinger & Stern, 2012;Wassenaar et al., 2015).
3. Ensure that all new data reports include essential quality control and methodological information, such as measured weight % H and O, the identity and accepted values of the standards used, and details of sample preparation, handling and drying, needed to assess and conduct post-hoc re-evaluation of the results.

ACK N OWLED G EM ENTS
We profoundly thank C Stricker, LI Wassenaar and DX Soto for providing mean δ 2 H values and associated standard errors for DEN_H_2 and EC_H_7 and 8 calibrations. Their hard work on producing data for these secondary standards over time has provided the foundation for this valuable study comparing calibrations and recalibrating available sample data to comparable reference scales. We also thank LI Wassenar and an anonymous reviewer for providing constructive feedback on our manuscript.

AUTH O R S ' CO NTR I B UTI O N S
S.M. and G.J.B. conceived the ideas and designed the methodology; S.M. compiled and the data; S.M. and G.J.B. led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.

DATA AVA I L A B I L I T Y S TAT E M E N T
The ASSIGNR package version 2.0.0 is available on the CRAN repository: https://cran.r-proje ct.org/web/packa ges/assig nR/index.