Count transformation models

The effect of explanatory environmental variables on a species' distribution is often assessed using a count regression model. Poisson generalized linear models or negative binomial models are common, but the traditional approach of modelling the mean after log or square root transformation remains popular and in some cases is even advocated. We propose a novel framework of linear models for count data. Similar to the traditional approach, the new models apply a transformation to count responses; however, this transformation is estimated from the data and not defined a priori. In contrast to simple least‐squares fitting and in line with Poisson or negative binomial models, the exact discrete likelihood is optimized for parameter estimation and inference. Simple interpretation of effects in the linear predictors is possible. Count transformation models provide a new approach to regressing count data in a distribution‐free yet fully parametric fashion, obviating the need to a priori commit to a specific parametric family of distributions or to a specific transformation. The models are a generalization of discrete Weibull models for counts and are thus able to handle over‐ and underdispersion. We demonstrate empirically that the models are more flexible than Poisson or negative binomial models but still maintain interpretability of multiplicative effects. A re‐analysis of deer–vehicle collisions and the results of artificial simulation experiments provide evidence of the practical applicability of the model framework. In ecology studies, uncertainties regarding whether and how to transform count data can be resolved in the framework of count transformation models, which were designed to simultaneously estimate an appropriate transformation and the linear effects of environmental variables by maximizing the exact count log‐likelihood. The application of data‐driven transformations allows over‐ and underdispersion to be addressed in a model‐based approach. Models in this class can be compared to Poisson or negative binomial models using the in‐ or out‐of‐sample log‐likelihood. Extensions to nonlinear additive or interaction effects, correlated observations, hurdle‐type models and other more complex situations are possible. A free software implementation is available in the cotram add‐on package to the R system for statistical computing.


| INTRODUC TI ON
Information represented by counts is ubiquitous in ecology. Perhaps the most obvious instance of ecological count data is animal abundances, which are determined either directly, for example by birdwatchers, or indirectly, by the counting of surrogates, for example the number of deer-vehicle collisions as a proxy for roe deer abundance. This information is later converted into models of animal densities or species distributions using statistical models for count data.
Distributions of count data are, of course, discrete and right-skewed, such that tailored statistical models are required for data analysis.
Here, we focus on models explaining the impact of explanatory environmental variables x on the distribution of a count response Y ∈ {0, 1, 2, …}. In the commonly used Poisson generalized linear model Y|x ∼ Po( exp ( + x ⊤ )) with log-link, intercept α and linear predictor x ⊤ , both the mean (Y|x) and the variance (Y|x) of the count response are given by exp ( + x ⊤ ). Overdispersion, that is, the situation (Y|x) < (Y|x), is allowed in the more complex negative binomial model Y|x ∼ NB( exp ( + x ⊤ ), ) with mean (Y|x) = exp ( + x ⊤ ) and potentially larger variance (Y|x) = (Y|x) + (Y|x) 2 ∕ . For independent observations, the model parameters are obtained by maximizing the discrete log-likelihood function, in which an observation (y, x) contributes the log-density log (ℙ(Y = y|x)) of either the Poisson or the negative binomial distribution.
Before the emergence of these models tailored to the analysis of count data (generalized linear models were introduced by Nelder & Wedderburn, 1972), researchers were restricted to analysing transformations of Y by normal linear regression models. Prominent textbooks at the time (Snedecor & Cochran, 1967;Sokal & Rohlf, 1967) suggested log transformations log(y + 1) or square root transformations √ y + 0.5 of observed counts y. The application of leastsquares estimators to the log-transformed counts then leads to the mean ( log (y + 1)|x) = + x ⊤ . Implicitly, it is assumed that the variance after transformation ( log (y + 1)|x) = 2 is constant and that errors are normally distributed. Although it is clear that the normal assumption log (Y + 1)| x ∼ N( + x ⊤ , 2 ) is incorrect (the count data are still discrete after transformation) and, consequently, that the wrong likelihood is maximized by applying least-squares to log(y + 1) for parameter estimation and inference, this approach is still broadly used both in practice and in theory (e.g. De Felipe, Sáez-Gómez, & Camacho, 2019; Dean, Voss, & Draguljić, 2017;Gotelli & Ellison, 2013;Ives, 2015;Mooney et al., 2016). Moreover, other deficits of this approach have been discussed in numerous papers (e.g. O'Hara & Kotze, 2010;St-Pierre, Shikon, & Schneider, 2018;Warton, 2018;Warton, Lyons, Stoklosa, & Ives, 2016). As a compromise between the two extremes of using rather strict count distribution models (such as the Poisson or negative binomial) and the analysis of transformed counts by normal linear regression models, we suggest a novel class of transformation models for count data that combine the strengths of both approaches. Briefly stated, in the newly proposed method appropriate transformations of counts Y are estimated simultaneously with regression coefficients β from the data by maximizing the correct discrete form of the likelihood in models that ensure the interpretability of a linear predictor x ⊤ . We describe the theoretical foundations of these novel count regression models in Section 2. Practical aspects of the methodology are demonstrated in Section 3 in a re-analysis of roe deer activity patterns based on deer-vehicle collision data, followed by an artificial simulation experiment contrasting the performance of Poisson, negative binomial and count transformation models under certain conditions.

| MATERIAL S AND ME THODS
The core idea of our count transformation model for describing the impact of explanatory environmental variables x on counts Y ∈ {0, 1, 2, …} is the simultaneous estimation of a fully parameterized smooth transformation α(Y) of the discrete response and the regression coefficients in a linear predictor x ⊤ . The aim of the approach is to model the discrete conditional distribution function We develop the novel model starting with a generalized linear model (GLM) for a binary event (Y ≤ k) defined by some cut-off point k. Assuming a Bernoulli distribution (Y ≤ k)|x ∼ B(1, (x)) with success parameter π(x), a binary GLM with link function g is given as The intercept α defines the probability of a 'success' (Y ≤ k) for a baseline configuration x ⊤ = 0 and, in a logistic regression model with g = logit, the regression coefficients β have an interpretation as odds ratios exp ( ). and thus the number of intercept thresholds α k may become quite large. The main aspect of our count transformation models is a smooth and parsimonious parameterization of the intercept thresholds. To conditional distribution function, conditional quantile function, count regression, deer-vehicle collisions, overdispersion, underdispersion simplify the notation, we note that the mean has an interpretation as a distribution function giving the probability of observing a count y smaller than or equal to k given the configuration x. Furthermore, each link function g = F −1 corresponds to the quantile function of a specific continuous distribution function F (g = logit and F = g −1 = expit for logistic regression, g = Φ −1 for probit regression, etc.). Last, using a negative sign for the linear predictor x ⊤ ensures that large values of x ⊤ correspond to large means (Y|x), however, in a nonlinear way. For arbitrary cut-offs y, we introduce the count transformation model as a model for the conditional distribution function F Y| X= x (y|x) of a count response Y given explanatory variables x, as The intercept thresholds are now specified by a smooth continuous and monotonically increasing transformation function : ℝ + → ℝ applied to the greatest integer ⌊y⌋ less than or equal to the cut-off point y. Hothorn, Möst, and Bühlmann (2018) suggested the parameterization of α in terms of basis functions a : ℝ → ℝ P and the corresponding parameters ϑ as There is no deeper meaning behind the parameters ϑ and many basis functions a are possible; we comment on a specific choice later on. The only modification required for count data is to consider this transformation function as a step function with jumps at integers 0, 1, 2, … only. This is achieved in model (1) by the floor function ⌊y⌋.
The very same approach was suggested by Padellini and Rue (2019) to model quantile functions F −1 Y| X= x of count data instead of the distribution functions we consider here. Figure 1 shows a distribution function F Y (y) = F ( (⌊y⌋)) and the corresponding transformation function α, both as discrete step functions (flooring the argument first) and continuously (without doing so). The two versions are identical for integer-valued arguments. Thus, the transformation function α, and consequently the transformation model (1), are parameterized continuously but evaluated and interpreted discretely.
A computationally attractive, low-dimensional representation of a smooth function in terms of a few basis functions a and corresponding parameters ϑ is therefore the core ingredient of our novel model framework. In addition to the baseline transformation and distribution functions (i.e. for a configuration with x ⊤ = 0 in model (1)), the conditional transformation and distribution function for some configuration x ⊤ = 3 is also depicted. The impact of x ⊤ = 3 on the transformation function is given by a vertical shift but is nonlinear on the scale of the distribution function.
On a more technical level, the basis a is specified in terms of a Bs,P−1 , with P-dimensional basis functions of a Bernstein polynomial (see Farouki, 2012, for an extensive overview) of order P − 1.
Specifically, the basis a(y) can be chosen as: a Bs,P−1 (y) or a Bs,P−1 (y + 1), or as a Bernstein polynomial on the log-scale: a Bs,P−1 (log(y)) or a Bs,P−1 (log(y + 1)). The choice of a(y) = a Bs,P−1 (log(y + 1)) is particularly well suited for modelling relatively small counts. We chose Bernstein polynomials purely for computational convenience, as it is easy to differentiate and integrate (y) = a(y) ⊤ with respect to y. Furthermore, for P = 1, the defined basis is equivalent to a linear function of either y, log(y) or log(y + 1) and monotonicity of the transformation function α can be obtained under the constraint 1 ≤ 2 ≤ ⋯ ≤ P of the parameters = ( 1 , … , P ) ⊤ ∈ ℝ P (Hothorn et al., 2018).  Table 1, with F Y (y) = F( (⌊y⌋)) denoting the distribution of the baseline configuration x ⊤ = 0. Note that, with a sufficiently flexible parameterization of the transformation function (y) = a(y) ⊤ , every distribution can be written in this way such that the model is distribution-free (Hothorn et al., 2018). Bernstein polynomials of sufficiently large order P − 1 can approximate any function on intervals and this theoretical property is practically achieved for relatively small orders P -1 (see figure 5 in Hothorn, 2020a, for a head-to-head comparison with a nonparametric method).
(1) Count Continuous The parameters β describe a deviation from the baseline distribution F Y in terms of the linear predictor x ⊤ . For a probit link, the linear predictor is the conditional mean of the transformed counts α(Y). This interpretation, except for the fact that the intercept is now understood as being part of the transformation function α, is the same as in the traditional approach of first transforming the counts and only then estimating the mean using least-squares. However, the transformation α is not heuristically chosen or defined a priori but estimated from data through parameters ϑ, as explained below. For a logit link, exp ( − x ⊤ ) is the odds ratio comparing the conditional with the baseline cumulative hazard function log(1 − F Y ). The loglog link leads to the reverse time hazard ratio with multiplicative changes in log(F Y ). All models in Table 1 are parameterized to relate positive values of x ⊤ to larger means independent of the specified link g = F −1 .
In Section 3.1 of our empirical evaluation we consider a linear count transformation model for discrete hazards by specifying the cloglog link. The discrete Cox count transformation model with P Bernstein basis functions a Bs,P−1 relates positive linear predictors to smaller hazards and thus larger means. The discrete hazard is the probability that y counts will be observed given that at least y counts were already observed. The model is equivalent to and thus the hazard ratio exp ( − x ⊤ ) gives the multiplicative change in discrete hazards.
The Cox proportional hazards model with a simplified transformation function α(y) = ϑ 1 + ϑ 2 log(y + 1) specifies a discrete form of a Weibull model (introduced by Nakagawa & Osaki, 1975) that Peluso, Vinciotti, and Yu (2019) recently discussed as an extension to other count regression models and that serves as a more flexible approach for both over-and underdispersed data. The discrete Weibull model is a special form of our Cox count transformation model (2), as the former features a linear basis function a with P = 2 parameters defined by a Bernstein polynomial of order one. Thus, model (2) can be understood as a generalization moving away from the low-parametric discrete Weibull distribution while maintaining both the interpretability of the effects as log-hazard ratios and the ability to handle over-and underdispersion.
Simultaneous likelihood-based inference for ϑ and β for fully parameterized transformation models was developed by Hothorn et al. (2018); here we refer only to the most important aspects. The exact log-likelihood of the model for independent observations (y i , The corresponding log-likelihood is then maximized simultaneously with respect to both ϑ and β under suitable constraints: Score functions and Hessians are available from Hothorn et al. (2018) and standard inference procedures relying on the asymptotic normal distribution of ϑ and β can be applied. The parametric bootstrap is an alternative, for example when uncertainties in the predicted conditional distribution function F Y|X=x shall be assessed (see section 6 of Hothorn, 2020a). The likelihood highlights an important connection to a recently proposed approach to multivariate models (Clark, Nemergut, Seyednasrollah, Turner, & Zhang, 2017), where the main challenge is to make multiple response variables measured at different scales comparable. Latent continuous variables are used to model discrete responses by means of appropriate censoring. For the univariate case, considered here, our likelihood is equivalent to censoring a latent continuous variable at integers 0, 1, 2, …. Different choices of the link function g define the latent variable's distribution, for example, for a probit model with g = Φ −1 a latent normal distribution is assumed.

| RE SULTS
In our empirical evaluation of the proposed count transformation models, we demonstrate practical aspects of the model framework in Section 3.1, by re-analysing data on deer-vehicle collisions, and examine their properties in the context of conventional count regression models, assuming either a conditional Poisson or a negative binomial distribution. In Section 3.2, we use simulated count data to evaluate the robustness of count transformation models under model misspecification. (2) TA B L E 1 Transformation model. Interpretation of linear predictors x ⊤ under different link functions g = F −1

| Analysis of deer-vehicle collision data
In the following, we re-analyse a time series of 341,655 deer-vehicle collisions (DVCs) involving roe deer Capreolus capreolus that were documented over a period of 10 years in Bavaria, Germany. The data were originally analysed by Hothorn, Müller, Held, Möst, and Mysterud (2015) with the aim of describing temporal patterns in roe deer activ- For each model we computed the estimated multiplicative seasonal changes in risk depending on the time of day relative to baseline on January 1, including 95% simultaneous confidence bands.
We interpreted 'risk' as a multiplicative change to baseline with respect to either the conditional mean ('expectation ratio'; Poisson and negative binomial models) or the conditional discrete hazard function ('hazard ratio') for the Cox count transformation model (2).
The results in Figure 2 show a rather strong agreement between the three models with respect to the estimated risk (expectation ratio or hazard ratio). However, the uncertainty, assessed by the 95% confidence bands, was smaller in the Poisson model. The negative binomial and the Cox count transformation model (2)  log-likelihood to -58,234.72, which was closer to but did not match the out-of-sample log-likelihood of model (2). Practically, the count transformation model performed as good as the negative binomial model; however, the necessity to choose a specific parametric distribution was present in the latter model only, owing to the distribution-free nature of the former.
We further compared the three different models in terms of their conditional distribution functions for four selected time intervals of the year 2009. The discrete conditional distribution functions of the models, evaluated for all integers between 0 and 38, are given in Figure 3. The conditional medians obtained from all three models are rather close, but the variability assessed by the Poisson model is much smaller than that associated with the negative binomial and count transformation models, thus indicating overdispersion.
In addition to confidence statements about regression coefficients β or functions thereof as in Figure 2, it is also possible to obtain confidence bands for transformation functions α or for the conditional distribution function itself. Figure 4 presents the estimated conditional distribution for a specific half-hour interval, along with a confidence band calculated from the joint asymptotic normal distribution of the estimated ϑ parameters.

F I G U R E 3
2009−04−21 04:30:00 − 05:00:00 (UTC+1) F I G U R E 5 Artificial count-datagenerating processes (DGPs). Visualization of the conditional distribution functions of counts given a single explanatory variable x. Samples drawn from these distributions were employed to assess the performance of the count regression models in Figure 6

F I G U R E 6
Artificial count-data-generating processes (DGPs). The performance of the count regression models (Poisson, negative binomial and count transformation models outlined in Table 1)

| Artificial count-data-generating processes
We investigated the performance of the different regression models in a simulation experiment based on count data from various underlying data-generating processes (DGPs). Count responses Y were generated conditionally on a numeric explanatory environmental variable x ∈ [0, 1] following a Poisson or negative binomial distribution or one of the discrete distributions underlying the four count transformation models corresponding to the four link functions from Table 1. For the Poisson model, the mean and variance were assumed to be (Y|x) = (Y|x) = exp (1.2 + 0.8x). The negative binomial data were chosen to be moderately overdispersed, with (Y|x) = exp (1.2 + 0.8x) and (Y|x) = (Y|x) + (Y|x) 2 ∕3. The four data-generating processes arising from the count transformation models were specified by the different link functions in Table 1, a Bernstein polynomial a Bs,6 (log(y + 1)) and a regression coefficient β 1 = 0.8. The conditional distribution functions defined by each of the six data-generating processes are visualized in Figure 5 and reveal different functional forms.
We repeated the simulation experiment for each data-generating process 100 times, with learning and validation sample sizes of 250 and 750 observations, respectively. The centred out-of-sample log-likelihoods, contrasting the model fit, were computed by the differences between the out-of-sample log-likelihoods of the models and the out-of-sample log-likelihoods of the true data-generating processes.
The results as given in Figure 6 follow a clear pattern. When misspecified, the model fit of the Poisson model is inferior to that of all other models. As expected, the negative binomial model fits both the data arising from the Poisson distribution (limiting case of the negative binomial distribution with ν → ∞) and the moderately overdispersed data well. However, it lacks robustness for more complex data-generating processes, such as the underlying mechanisms specified by a count transformation model. The fit of the count transformation models is satisfactory across all DGPs, albeit with some differences within the model framework.

| Computational details
All computations were performed using R version 3.6.2 (R Core Team, 2019). A reference implementation of transformation models is available in the mlt R add-on package (Hothorn, 2020a(Hothorn, , 2020b. A simple user interface to linear count transformation models is available in the cotram R add-on package (Siegfried and Hothorn, 2019).
The package includes an introductory vignette and reproducibility material for the empirical results presented in Section 3.
The following example demonstrates the functionality of the cotram package in terms of a Cox count transformation model (2)  The data are shown in Figure 7 overlayed with the smoothed version of the estimated conditional distribution functions for varying values of coverstorey.

ACK N OWLED G EM ENTS
The authors would like to thank Wendy Ran for improving the language. Constructive feedback by the handling editor, Prof. O'Hara, and two anonymous referees is highly appreciated.

AUTH O R S ' CO NTR I B UTI O N S
S.S. implemented the cotram package for count transformation models on top of the tram package for general linear transformation models written by T.H.; Empirical results were obtained using R code by S.S. and the cotram package vignette was also written by S.S. Both authors designed the simulation experiments, drafted and revised the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The code necessary to reproduce the empirical results presented in Section 3 is available from within the cotram R add-on package http:// CRAN.R-project.org/package=cotram (Siegfried & Hothorn, 2019).
The deer-vehicle collisions data re-analysed in Section 3.1 were re-