The GENMOD Procedure

Tweedie Distribution for Generalized Linear Models

The Tweedie (1984) distribution has nonnegative support and can have a discrete mass at zero, making it useful to model responses that are a mixture of zeros and positive values. The Tweedie distribution belongs to the exponential family, so it conveniently fits in the generalized linear models framework. According to such parameterization, the mean and variance for the Tweedie random variable are normal upper E left-parenthesis upper Y right-parenthesis equals mu and normal upper V normal a normal r left-parenthesis upper Y right-parenthesis equals phi mu Superscript p, respectively, where phi is the dispersion parameter and p is an extra parameter that controls the variance of the distribution.

The Tweedie family of distributions includes several important distributions for generalized linear models. When p equals 0, the Tweedie distribution degenerates to the normal distribution; when p equals 1, it becomes a Poisson distribution; when p equals 2, it becomes a gamma distribution; when p equals 3, it is an inverse Gaussian distribution.

Except for these special cases, the probability density function for the Tweedie distribution does not have a closed form and can at best be expressed in terms of series. Numerical approximations are needed to evaluate the density function. Dunn and Smyth (2005) propose using a finite series and provide a formula to determine its lower and upper indices in order to achieve a desired accuracy. Alternatively, you can apply the Fourier transformation on the characteristic function (Dunn and Smyth 2008). These approximations tend to be expensive when a high level of accuracy is demanded or the data volume becomes large. PROC GENMOD uses the series method unless it becomes complicated to do so. In this case, the method that is based on the Fourier transformation is used. The accuracy of approximation is controlled by the EPSILON= option, whose default value is 10 Superscript negative 5.

The Tweedie distribution is not defined when p is between 0 and 1. In practice, the most interesting range is from 1 to 2 in which the Tweedie distribution gradually loses its mass at 0 as it shifts from a Poisson distribution to a gamma distribution. In this case, the Tweedie random variable Y can be generated from a compound Poisson distribution (Smyth 1996) as

StartLayout 1st Row 1st Column upper Y 2nd Column equals 3rd Column normal upper Sigma Subscript i equals 1 Superscript upper T Baseline upper X Subscript i 2nd Row 1st Column upper T 2nd Column tilde 3rd Column normal upper P normal o normal i normal s normal s normal o normal n left-parenthesis lamda right-parenthesis 3rd Row 1st Column upper X Subscript i 2nd Column tilde 3rd Column normal g normal a normal m normal m normal a left-parenthesis alpha comma gamma right-parenthesis EndLayout

where upper Y equals 0 if upper T equals 0, T and upper X Subscript i are statistically independent, and normal g normal a normal m normal m normal a left-parenthesis alpha comma gamma right-parenthesis denotes a gamma random variable that has mean alpha gamma and variance alpha gamma squared. These parameters are determined by the Tweedie parameters as follows:

StartLayout 1st Row 1st Column lamda 2nd Column equals 3rd Column StartFraction mu Superscript 2 minus p Baseline Over phi left-parenthesis 2 minus p right-parenthesis EndFraction 2nd Row 1st Column alpha 2nd Column equals 3rd Column StartFraction 2 minus p Over p minus 1 EndFraction 3rd Row 1st Column gamma 2nd Column equals 3rd Column phi left-parenthesis p minus 1 right-parenthesis mu Superscript p minus 1 EndLayout

Inversely, given the Tweedie distributional parameters, the parameters of the compound Poisson distribution are determined as follows:

StartLayout 1st Row 1st Column mu 2nd Column equals 3rd Column lamda alpha gamma 2nd Row 1st Column p 2nd Column equals 3rd Column StartFraction alpha plus 2 Over alpha plus 1 EndFraction 3rd Row 1st Column phi 2nd Column equals 3rd Column StartFraction lamda Superscript 1 minus p Baseline left-parenthesis alpha gamma right-parenthesis Superscript 2 minus p Baseline Over 2 minus p EndFraction EndLayout

In terms of generalized linear models parameterizations, the canonical parameter theta for the Tweedie density can be expressed as

StartLayout 1st Row 1st Column theta 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column StartFraction mu Superscript 1 minus p Baseline Over 1 minus p EndFraction 2nd Column p not-equals 1 2nd Row 1st Column log mu 2nd Column p equals 1 EndLayout EndLayout

and the function b left-parenthesis theta right-parenthesis is

StartLayout 1st Row 1st Column b left-parenthesis theta right-parenthesis 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column StartFraction mu Superscript 2 minus p Baseline Over 2 minus p EndFraction 2nd Column p not-equals 2 2nd Row 1st Column log mu 2nd Column p equals 2 EndLayout EndLayout

Because of the intractability of differentiating the gradient functions with respect to the variance parameters, PROC GENMOD uses a quasi-Newton approach to maximize the likelihood function, where the Hessian matrix is approximated by taking finite differences of the gradient functions. Convergence is determined by a union of two criteria: the relative gradient convergence criterion is set to 10 Superscript negative 9, and the relative function convergence criterion is set to 2 times 10 Superscript negative 9. Convergence is declared when at least one of the criteria is attained during the quasi-Newton iteration.

Before PROC GENMOD maximizes the approximate likelihood, it first maximizes the following extended log quasi-likelihood which is constructed according to the definition of McCullagh and Nelder (1989, Chapter 9) as

upper Q Subscript p Baseline left-parenthesis bold y comma bold-italic mu comma phi comma p right-parenthesis equals sigma-summation Underscript i Endscripts q left-parenthesis y Subscript i Baseline comma mu Subscript i Baseline comma phi comma p right-parenthesis

where the contribution from an observation is

q left-parenthesis y Subscript i Baseline comma mu Subscript i Baseline comma phi comma p right-parenthesis equals minus 0.5 log left-parenthesis 2 pi phi y Subscript i Superscript p Baseline slash w Subscript i Baseline right-parenthesis minus w Subscript i Baseline left-parenthesis StartFraction y Subscript i Superscript 2 minus p Baseline minus left-parenthesis 2 minus p right-parenthesis y Subscript i Baseline mu Subscript i Superscript 1 minus p Baseline plus left-parenthesis 1 minus p right-parenthesis mu Subscript i Superscript 2 minus p Baseline Over left-parenthesis 1 minus p right-parenthesis left-parenthesis 2 minus p right-parenthesis EndFraction right-parenthesis slash phi

and w Subscript i is the weight for the observation from the WEIGHT statement.

The range of parameter p for the quasi-likelihood is from 1 to 2. For a specified P= value outside this range, PROC GENMOD skips optimization of the quasi-likelihood. To maintain numerical stability, PROC GENMOD imposes a lower bound of 1.1 and a upper bound of 1.99 for computation with the quasi-likelihood. The full-likelihood solution imposes the same lower bound but no upper bound. The estimates that are obtained from optimizing the quasi-likelihood are usually near the full-likelihood solution so that fewer iterations are needed for maximizing the more expensive full likelihood.

Last updated: December 09, 2022