The GENMOD Procedure

Generalized Estimating Equations

Let y Subscript i j, j equals 1 comma ellipsis comma n Subscript i Baseline, i equals 1 comma ellipsis comma upper K, represent the jth measurement on the ith subject. There are n Subscript i measurements on subject i and sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline n Subscript i total measurements where f Subscript i is the cluster frequency that you specify in the FREQUENCY statement. If you do not specify a FREQUENCY statement, f Subscript i Baseline equals 1 for all observations. The frequencies must be the same for all observations within each cluster.

Correlated data are modeled using the same link function and linear predictor setup (systematic component) as the independence case. The random component is described by the same variance functions as in the independence case, but the covariance structure of the correlated measurements must also be modeled. In the rest of this section, v left-parenthesis mu right-parenthesis is the variance function of the specified distribution. For the case of the negative binomial, the variance function is fixed at v left-parenthesis mu right-parenthesis equals mu plus k mu squared, where k is either the maximum likelihood estimate of the negative binomial dispersion parameter or the value specified in the NOSCALE and SCALE= options in the MODEL statement.

Let the vector of measurements on the ith subject be bold upper Y Subscript i Baseline equals left-bracket y Subscript i Baseline 1 Baseline comma ellipsis comma y Subscript i n Sub Subscript i Subscript Baseline right-bracket prime with corresponding vector of means bold-italic mu Subscript i Baseline equals left-bracket mu Subscript i Baseline 1 Baseline comma ellipsis comma mu Subscript i n Sub Subscript i Subscript Baseline right-bracket prime, and let bold upper V Subscript i be the covariance matrix of bold upper Y Subscript i. Let the vector of independent, or explanatory, variables for the jth measurement on the ith subject be

bold x Subscript i j Baseline equals left-bracket x Subscript i j Baseline 1 Baseline comma ellipsis comma x Subscript i j p Baseline right-bracket prime

The generalized estimating equation of Liang and Zeger (1986) for estimating the p times 1 vector of regression parameters bold-italic beta is an extension of the independence estimating equation to correlated data and is given by

bold upper S left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline bold upper D prime Subscript i Baseline bold upper V Subscript i Superscript negative 1 Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis right-parenthesis equals bold 0

where

bold upper D Subscript i Baseline equals StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction

Because

g left-parenthesis mu Subscript i j Baseline right-parenthesis equals bold x prime Subscript i j Baseline bold-italic beta

where g is the link function, the p times n Subscript i matrix of partial derivatives of the mean with respect to the regression parameters for the ith subject is given by

bold upper D prime Subscript i Baseline equals StartFraction partial-differential bold-italic mu prime Subscript i Over partial-differential bold-italic beta EndFraction equals Start 3 By 3 Matrix 1st Row 1st Column StartFraction x Subscript i Baseline 11 Baseline Over g prime left-parenthesis mu Subscript i Baseline 1 Baseline right-parenthesis EndFraction 2nd Column ellipsis 3rd Column StartFraction x Subscript i n Sub Subscript i Subscript Baseline 1 Baseline Over g prime left-parenthesis mu Subscript i n Sub Subscript i Subscript Baseline right-parenthesis EndFraction 2nd Row 1st Column vertical-ellipsis 2nd Column Blank 3rd Column vertical-ellipsis 3rd Row 1st Column StartFraction x Subscript i Baseline 1 p Baseline Over g prime left-parenthesis mu Subscript i Baseline 1 Baseline right-parenthesis EndFraction 2nd Column ellipsis 3rd Column StartFraction x Subscript i n Sub Subscript i Subscript p Baseline Over g prime left-parenthesis mu Subscript i n Sub Subscript i Subscript Baseline right-parenthesis EndFraction EndMatrix

Working Correlation Matrix

Let bold upper R Subscript i Baseline left-parenthesis bold-italic alpha right-parenthesis be an n Subscript i Baseline times n Subscript i "working" correlation matrix that is fully specified by the vector of parameters bold-italic alpha. The covariance matrix of bold upper Y Subscript i is modeled as

bold upper V Subscript i Baseline equals phi bold upper A Subscript i Superscript one-half Baseline bold upper W Subscript i Superscript negative one-half Baseline bold upper R left-parenthesis bold-italic alpha right-parenthesis bold upper W Subscript i Superscript negative one-half Baseline bold upper A Subscript i Superscript one-half

where bold upper A Subscript i is an n Subscript i Baseline times n Subscript i diagonal matrix with v left-parenthesis mu Subscript i j Baseline right-parenthesis as the jth diagonal element and bold upper W Subscript i is an n Subscript i Baseline times n Subscript i diagonal matrix with w Subscript i j as the jth diagonal, where w Subscript i j is a weight specified with the WEIGHT statement. If there is no WEIGHT statement, w Subscript i j Baseline equals 1 for all i and j. If bold upper R Subscript i Baseline left-parenthesis bold-italic alpha right-parenthesis is the true correlation matrix of bold upper Y Subscript i, then bold upper V Subscript i is the true covariance matrix of bold upper Y Subscript i.

The working correlation matrix is usually unknown and must be estimated. It is estimated in the iterative fitting process by using the current value of the parameter vector bold-italic beta to compute appropriate functions of the Pearson residual

e Subscript i j Baseline equals StartFraction y Subscript i j Baseline minus mu Subscript i j Baseline Over StartRoot v left-parenthesis mu Subscript i j Baseline right-parenthesis slash w Subscript i j Baseline EndRoot EndFraction

If you specify the working correlation as bold upper R 0 equals bold upper I, which is the identity matrix, the GEE reduces to the independence estimating equation.

Following are the structures of the working correlation supported by the GENMOD procedure and the estimators used to estimate the working correlations.

Table 11: continued

Working Correlation Structure Estimator
Fixed
normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis equals r Subscript j k where r Subscript j k is the jkth element of a constant, user-specified correlation matrix bold upper R 0. The working correlation is not estimated in this case.
Independent
normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column j equals k 2nd Row 1st Column 0 2nd Column j not-equals k EndLayout The working correlation is not estimated in this case.
m-dependent
normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i comma j plus t Baseline right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column t equals 0 2nd Row 1st Column alpha Subscript t Baseline 2nd Column t equals 1 comma 2 comma ellipsis comma m 3rd Row 1st Column 0 2nd Column t greater-than m EndLayout ModifyingAbove alpha With caret Subscript t Baseline equals StartFraction 1 Over left-parenthesis upper K Subscript t Baseline minus p right-parenthesis phi EndFraction sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline sigma-summation Underscript j less-than-or-equal-to n Subscript i Baseline minus t Endscripts e Subscript i j Baseline e Subscript i comma j plus t
upper K Subscript t Baseline equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline left-parenthesis n Subscript i Baseline minus t right-parenthesis
Exchangeable
normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column j equals k 2nd Row 1st Column alpha 2nd Column j not-equals k EndLayout ModifyingAbove alpha With caret equals StartFraction 1 Over left-parenthesis upper N Superscript asterisk Baseline minus p right-parenthesis phi EndFraction sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline sigma-summation Underscript j less-than k Endscripts e Subscript i j Baseline e Subscript i k
upper N Superscript asterisk Baseline equals 0.5 sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline n Subscript i Baseline left-parenthesis n Subscript i Baseline minus 1 right-parenthesis
Unstructured
normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column j equals k 2nd Row 1st Column alpha Subscript j k Baseline 2nd Column j not-equals k EndLayout ModifyingAbove alpha With caret Subscript j k Baseline equals StartFraction 1 Over left-parenthesis upper K minus p right-parenthesis phi EndFraction sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline e Subscript i j Baseline e Subscript i k
Autoregressive AR(1)
normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i comma j plus t Baseline right-parenthesis equals alpha Superscript t for t equals 0 comma 1 comma 2 comma ellipsis comma n Subscript i Baseline minus j ModifyingAbove alpha With caret equals StartFraction 1 Over left-parenthesis upper K 1 minus p right-parenthesis phi EndFraction sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline sigma-summation Underscript j less-than-or-equal-to n Subscript i Baseline minus 1 Endscripts e Subscript i j Baseline e Subscript i comma j plus 1
upper K 1 equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline left-parenthesis n Subscript i Baseline minus 1 right-parenthesis


Dispersion Parameter

The dispersion parameter phi is estimated by

ModifyingAbove phi With caret equals StartFraction 1 Over upper N minus p EndFraction sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline sigma-summation Underscript j equals 1 Overscript n Subscript i Baseline Endscripts e Subscript i j Superscript 2

where upper N equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline n Subscript i is the total number of measurements and p is the number of regression parameters.

The square root of ModifyingAbove phi With caret is reported by PROC GENMOD as the scale parameter in the "Analysis of GEE Parameter Estimates Model-Based Standard Error Estimates" output table. If a fixed scale parameter is specified with the NOSCALE option in the MODEL statement, then the fixed value is used in estimating the model-based covariance matrix and standard errors.

Fitting Algorithm

The following is an algorithm for fitting the specified model by using GEEs. Note that this is not in general a likelihood-based method of estimation, so that inferences based on likelihoods are not possible for GEE methods.

  1. Compute an initial estimate of bold-italic beta with an ordinary generalized linear model assuming independence.

  2. Compute the working correlations bold upper R based on the standardized residuals, the current bold-italic beta, and the assumed structure of bold upper R.

  3. Compute an estimate of the covariance:

    bold upper V Subscript i Baseline equals phi bold upper A Subscript i Superscript one-half Baseline bold upper W Subscript i Superscript negative one-half Baseline ModifyingAbove bold upper R With caret left-parenthesis bold-italic alpha right-parenthesis bold upper W Subscript i Superscript negative one-half Baseline bold upper A Subscript i Superscript one-half
  4. Update bold-italic beta:

    bold-italic beta Subscript r plus 1 Baseline equals bold-italic beta Subscript r Baseline plus left-bracket sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction prime bold upper V Subscript i Superscript negative 1 Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction right-bracket Superscript negative 1 Baseline left-bracket sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction prime bold upper V Subscript i Superscript negative 1 Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline right-parenthesis right-bracket
  5. Repeat steps 2-4 until convergence.

Missing Data

See Diggle, Liang, and Zeger (1994, Chapter 11) for a discussion of missing values in longitudinal data. Suppose that you intend to take measurements upper Y Subscript i Baseline 1 Baseline comma ellipsis comma upper Y Subscript i n Baseline for the ith unit. Missing values for which upper Y Subscript i j are missing whenever upper Y Subscript i k is missing for all j greater-than-or-equal-to k are called dropouts. Otherwise, missing values that occur intermixed with nonmissing values are intermittent missing values. The GENMOD procedure can estimate the working correlation from data containing both types of missing values by using the all available pairs method, in which all nonmissing pairs of data are used in the moment estimators of the working correlation parameters defined previously. The resulting covariances and standard errors are valid under the missing completely at random (MCAR) assumption.

For example, for the unstructured working correlation model,

ModifyingAbove alpha With caret Subscript j k Baseline equals StartFraction 1 Over left-parenthesis upper K prime minus p right-parenthesis phi EndFraction sigma-summation f Subscript i Baseline e Subscript i j Baseline e Subscript i k

where the sum is over the units that have nonmissing measurements at times j and k, and upper K prime is the number of units with nonmissing measurements at j and k. Estimates of the parameters for other working correlation types are computed in a similar manner, using available nonmissing pairs in the appropriate moment estimators.

The contribution of the ith unit to the parameter update equation is computed by omitting the elements of left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript bold i Baseline right-parenthesis, the columns of bold upper D prime Subscript i Baseline equals StartFraction partial-differential bold-italic mu Over partial-differential bold-italic beta EndFraction prime, and the rows and columns of bold upper V Subscript i corresponding to missing measurements.

Parameter Estimate Covariances

The model-based estimator of Cov left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis is given by

bold upper Sigma Subscript m Baseline left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis equals bold upper I 0 Superscript negative 1

where

bold upper I 0 equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction prime bold upper V Subscript i Superscript negative 1 Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction

This is the GEE equivalent of the inverse of the Fisher information matrix that is often used in generalized linear models as an estimator of the covariance estimate of the maximum likelihood estimator of bold-italic beta. It is a consistent estimator of the covariance matrix of ModifyingAbove bold-italic beta With caret if the mean model and the working correlation matrix are correctly specified.

The estimator

bold upper Sigma Subscript e Baseline equals bold upper I 0 Superscript negative 1 Baseline bold upper I 1 bold upper I 0 Superscript negative 1

is called the empirical, or robust, estimator of the covariance matrix of ModifyingAbove bold-italic beta With caret, where

bold upper I 1 equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction prime bold upper V Subscript i Superscript negative 1 Baseline normal upper C normal o normal v left-parenthesis bold upper Y Subscript i Baseline right-parenthesis bold upper V Subscript i Superscript negative 1 Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction

It has the property of being a consistent estimator of the covariance matrix of ModifyingAbove bold-italic beta With caret, even if the working correlation matrix is misspecified—that is, if normal upper C normal o normal v left-parenthesis bold upper Y Subscript i Baseline right-parenthesis not-equals bold upper V Subscript i. For further information about the robust variance estimate, see Zeger, Liang, and Albert (1988); Royall (1986); White (1982). In computing bold upper Sigma Subscript e, bold-italic beta and phi are replaced by estimates, and normal upper C normal o normal v left-parenthesis bold upper Y Subscript i Baseline right-parenthesis is replaced by the estimate

left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis right-parenthesis left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis right-parenthesis prime

Multinomial GEEs

Lipsitz, Kim, and Zhao (1994) and Miller, Davis, and Landis (1993) describe how to extend GEEs to multinomial data. Currently, only the independent working correlation is available for multinomial models in PROC GENMOD.

Alternating Logistic Regressions

If the responses are binary (that is, they take only two values), then there is an alternative method to account for the association among the measurements. The alternating logistic regressions (ALR) algorithm of Carey, Zeger, and Diggle (1993) models the association between pairs of responses with log odds ratios, instead of with correlations, as ordinary GEEs do.

For binary data, the correlation between the jth and kth response is, by definition,

normal upper C normal o normal r normal r left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis equals StartFraction probability left-parenthesis upper Y Subscript i j Baseline equals 1 comma upper Y Subscript i k Baseline equals 1 right-parenthesis minus mu Subscript i j Baseline mu Subscript i k Baseline Over StartRoot mu Subscript i j Baseline left-parenthesis 1 minus mu Subscript i j Baseline right-parenthesis mu Subscript i k Baseline left-parenthesis 1 minus mu Subscript i k Baseline right-parenthesis EndRoot EndFraction

The joint probability in the numerator satisfies the following bounds, by elementary properties of probability, since mu Subscript i j Baseline equals probability left-parenthesis upper Y Subscript i j Baseline equals 1 right-parenthesis:

max left-parenthesis 0 comma mu Subscript i j Baseline plus mu Subscript i k Baseline minus 1 right-parenthesis less-than-or-equal-to probability left-parenthesis upper Y Subscript i j Baseline equals 1 comma upper Y Subscript i k Baseline equals 1 right-parenthesis less-than-or-equal-to min left-parenthesis mu Subscript i j Baseline comma mu Subscript i k Baseline right-parenthesis

The correlation, therefore, is constrained to be within limits that depend in a complicated way on the means of the data.

The odds ratio, defined as

normal upper O normal upper R left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis equals StartFraction probability left-parenthesis upper Y Subscript i j Baseline equals 1 comma upper Y Subscript i k Baseline equals 1 right-parenthesis probability left-parenthesis upper Y Subscript i j Baseline equals 0 comma upper Y Subscript i k Baseline equals 0 right-parenthesis Over probability left-parenthesis upper Y Subscript i j Baseline equals 1 comma upper Y Subscript i k Baseline equals 0 right-parenthesis probability left-parenthesis upper Y Subscript i j Baseline equals 0 comma upper Y Subscript i k Baseline equals 1 right-parenthesis EndFraction

is not constrained by the means and is preferred, in some cases, to correlations for binary data.

The ALR algorithm seeks to model the logarithm of the odds ratio, gamma Subscript i j k Baseline equals log left-parenthesis normal upper O normal upper R left-parenthesis upper Y Subscript i j Baseline comma upper Y Subscript i k Baseline right-parenthesis right-parenthesis, as

gamma Subscript i j k Baseline equals bold z prime Subscript i j k Baseline bold-italic alpha

where bold-italic alpha is a q times 1 vector of regression parameters and bold z Subscript i j k is a fixed, specified vector of coefficients.

The parameter gamma Subscript i j k can take any value in left-parenthesis negative normal infinity comma normal infinity right-parenthesis with gamma Subscript i j k Baseline equals 0 corresponding to no association.

The log odds ratio, when modeled in this way with a regression model, can take different values in subgroups defined by bold z Subscript i j k. For example, bold z Subscript i j k can define subgroups within clusters, or it can define "block effects" between clusters.

You specify a GEE model for binary data that uses log odds ratios by specifying a model for the mean, as in ordinary GEEs, and a model for the log odds ratios. You can use any of the link functions appropriate for binary data in the model for the mean, such as logistic, probit, or complementary log-log. The ALR algorithm alternates between a GEE step to update the model for the mean and a logistic regression step to update the log odds ratio model. Upon convergence, the ALR algorithm provides estimates of the regression parameters for the mean, bold-italic beta, the regression parameters for the log odds ratios, bold-italic alpha, their standard errors, and their covariances.

Specifying Log Odds Ratio Models

Specifying a regression model for the log odds ratio requires you to specify rows of the bold z matrix bold z Subscript i j k for each cluster i and each unique within-cluster pair left-parenthesis j comma k right-parenthesis. The GENMOD procedure provides several methods of specifying bold z Subscript i j k. These are controlled by the LOGOR=keyword and associated options in the REPEATED statement. The supported keywords and the resulting log odds ratio models are described as follows.

EXCH

specifies exchangeable log odds ratios. In this model, the log odds ratio is a constant for all clusters i and pairs left-parenthesis j comma k right-parenthesis. The parameter alpha is the common log odds ratio.

bold z Subscript i j k Baseline equals 1 for all i comma j comma k
FULLCLUST

specifies fully parameterized clusters. Each cluster is parameterized in the same way, and there is a parameter for each unique pair within clusters. If a complete cluster is of size n, then there are StartFraction n left-parenthesis n minus 1 right-parenthesis Over 2 EndFraction parameters in the vector bold-italic alpha. For example, if a full cluster is of size 4, then there are StartFraction 4 times 3 Over 2 EndFraction equals 6 parameters, and the bold z matrix is of the form

bold upper Z equals Start 6 By 6 Matrix 1st Row 1st Column 1 2nd Column 0 3rd Column 0 4th Column 0 5th Column 0 6th Column 0 2nd Row 1st Column 0 2nd Column 1 3rd Column 0 4th Column 0 5th Column 0 6th Column 0 3rd Row 1st Column 0 2nd Column 0 3rd Column 1 4th Column 0 5th Column 0 6th Column 0 4th Row 1st Column 0 2nd Column 0 3rd Column 0 4th Column 1 5th Column 0 6th Column 0 5th Row 1st Column 0 2nd Column 0 3rd Column 0 4th Column 0 5th Column 1 6th Column 0 6th Row 1st Column 0 2nd Column 0 3rd Column 0 4th Column 0 5th Column 0 6th Column 1 EndMatrix

The elements of bold-italic alpha correspond to log odds ratios for cluster pairs in the following order:

Pair Parameter
(1,2) Alpha1
(1,3) Alpha2
(1,4) Alpha3
(2.3) Alpha4
(2,4) Alpha5
(3,4) Alpha6

LOGORVAR(variable)

specifies log odds ratios by cluster. The argument variable is a variable name that defines the "block effects" between clusters. The log odds ratios are constant within clusters, but they take a different value for each different value of the variable. For example, if Center is a variable in the input data set taking a different value for k treatment centers, then specifying LOGOR=LOGORVAR(Center) requests a model with different log odds ratios for each of the k centers, constant within center.

NESTK

specifies k-nested log odds ratios. You must also specify the SUBCLUST=variable option to define subclusters within clusters. Within each cluster, PROC GENMOD computes a log odds ratio parameter for pairs having the same value of variable for both members of the pair and one log odds ratio parameter for each unique combination of different values of variable.

NEST1

specifies 1-nested log odds ratios. You must also specify the SUBCLUST=variable option to define subclusters within clusters. There are two log odds ratio parameters for this model. Pairs having the same value of variable correspond to one parameter; pairs having different values of variable correspond to the other parameter. For example, if clusters are hospitals and subclusters are wards within hospitals, then patients within the same ward have one log odds ratio parameter, and patients from different wards have the other parameter.

ZFULL

specifies the full bold z matrix. You must also specify a SAS data set containing the bold z matrix with the ZDATA=data-set-name option. Each observation in the data set corresponds to one row of the bold z matrix. You must specify the ZDATA data set as if all clusters are complete—that is, as if all clusters are the same size and there are no missing observations. The ZDATA data set has upper K left-bracket n Subscript m a x Baseline left-parenthesis n Subscript m a x Baseline minus 1 right-parenthesis slash 2 right-bracket observations, where K is the number of clusters and n Subscript m a x is the maximum cluster size. If the members of cluster i are ordered as 1 comma 2 comma ellipsis comma n, then the rows of the bold z matrix must be specified for pairs in the order left-parenthesis 1 comma 2 right-parenthesis comma left-parenthesis 1 comma 3 right-parenthesis comma ellipsis comma left-parenthesis 1 comma n right-parenthesis comma left-parenthesis 2 comma 3 right-parenthesis comma ellipsis comma left-parenthesis 2 comma n right-parenthesis comma ellipsis comma left-parenthesis n minus 1 comma n right-parenthesis. The variables specified in the REPEATED statement for the SUBJECT effect must also be present in the ZDATA= data set to identify clusters. You must specify variables in the data set that define the columns of the bold z matrix by the ZROW=variable-list option. If there are q columns (q variables in variable-list), then there are q log odds ratio parameters. You can optionally specify variables indicating the cluster pairs corresponding to each row of the bold z matrix with the YPAIR=(variable1, variable2) option. If you specify this option, the data from the ZDATA data set are sorted within each cluster by variable1 and variable2. See Example 51.6 for an example of specifying a full bold z matrix.

ZREP

specifies a replicated bold z matrix. You specify bold z matrix data exactly as you do for the ZFULL case, except that you specify only one complete cluster. The bold z matrix for the one cluster is replicated for each cluster. The number of observations in the ZDATA data set is StartFraction n Subscript m a x Baseline left-parenthesis n Subscript m a x Baseline minus 1 right-parenthesis Over 2 EndFraction, where n Subscript m a x is the size of a complete cluster (a cluster with no missing observations).

ZREP(matrix)

specifies direct input of the replicated bold z matrix. You specify the bold z matrix for one cluster with the syntax LOGOR=ZREP ( left-parenthesis y 1 y 2 right-parenthesis z 1 z 2 midline-horizontal-ellipsis z Subscript q Baseline comma midline-horizontal-ellipsis ), where y 1 and y 2 are numbers representing a pair of observations and the values z 1 comma z 2 comma ellipsis comma z Subscript q Baseline make up the corresponding row of the bold z matrix. The number of rows specified is StartFraction n Subscript m a x Baseline left-parenthesis n Subscript m a x Baseline minus 1 right-parenthesis Over 2 EndFraction, where n Subscript m a x is the size of a complete cluster (a cluster with no missing observations). For example,

logor =  zrep((1 2) 1 0,
              (1 3) 1 0,
              (1 4) 1 0,
              (2 3) 1 1,
              (2 4) 1 1,
              (3 4) 1 1)

specifies the StartFraction 4 times 3 Over 2 EndFraction equals 6 rows of the bold z matrix for a cluster of size 4 with q = 2 log odds ratio parameters. The log odds ratio for the pairs (1 2), (1 3), (1 4) is alpha 1, and the log odds ratio for the pairs (2 3), (2 4), (3 4) is alpha 1 plus alpha 2.

Quasi-likelihood Information Criterion

The quasi-likelihood information criterion (QIC) was developed by Pan (2001) as a modification of the Akaike information criterion (AIC) to apply to models fit by GEEs.

Define the quasi-likelihood under the independence working correlation assumption, evaluated with the parameter estimates under the working correlation of interest as

upper Q left-parenthesis ModifyingAbove bold-italic beta With caret left-parenthesis upper R right-parenthesis comma phi right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts f Subscript i Baseline sigma-summation Underscript j equals 1 Overscript n Subscript i Baseline Endscripts upper Q left-parenthesis ModifyingAbove bold-italic beta With caret left-parenthesis upper R right-parenthesis comma phi semicolon left-parenthesis upper Y Subscript i j Baseline comma bold upper X Subscript i j Baseline right-parenthesis right-parenthesis

where the quasi-likelihood contribution of the jth observation in the ith cluster is defined in the section Quasi-likelihood Functions and ModifyingAbove bold-italic beta With caret left-parenthesis upper R right-parenthesis are the parameter estimates obtained from GEEs with the working correlation of interest R.

QIC is defined as

normal upper Q normal upper I normal upper C left-parenthesis upper R right-parenthesis equals minus 2 upper Q left-parenthesis ModifyingAbove bold-italic beta With caret left-parenthesis upper R right-parenthesis comma phi right-parenthesis plus 2 normal t normal r normal a normal c normal e left-parenthesis ModifyingAbove normal upper Omega With caret Subscript upper I Baseline ModifyingAbove upper V With caret Subscript upper R Baseline right-parenthesis

where ModifyingAbove upper V With caret Subscript upper R is the robust covariance estimate and ModifyingAbove normal upper Omega With caret Subscript upper I is the inverse of the model-based covariance estimate under the independent working correlation assumption, evaluated at ModifyingAbove bold-italic beta With caret left-parenthesis upper R right-parenthesis, the parameter estimates obtained from GEEs with the working correlation of interest R.

PROC GENMOD also computes an approximation to normal upper Q normal upper I normal upper C left-parenthesis upper R right-parenthesis defined by Pan (2001) as

normal upper Q normal upper I normal upper C Subscript u Baseline left-parenthesis upper R right-parenthesis equals minus 2 upper Q left-parenthesis ModifyingAbove bold-italic beta With caret left-parenthesis upper R right-parenthesis comma phi right-parenthesis plus 2 p

where p is the number of regression parameters.

Pan (2001) notes that QIC is appropriate for selecting regression models and working correlations, whereas normal upper Q normal upper I normal upper C Subscript u is appropriate only for selecting regression models.

Quasi-likelihood Functions

See McCullagh and Nelder (1989) and Hardin and Hilbe (2003) for discussions of quasi-likelihood functions. The contribution of observation j in cluster i to the quasi-likelihood function evaluated at the regression parameters bold-italic beta is given by upper Q left-parenthesis bold-italic beta comma phi semicolon left-parenthesis upper Y Subscript i j Baseline comma bold upper X Subscript i j Baseline right-parenthesis right-parenthesis equals StartFraction upper Q Subscript i j Baseline Over phi EndFraction, where upper Q Subscript i j is defined in the following list. These are used in the computation of the quasi-likelihood information criteria (QIC) for goodness of fit of models fit with GEEs. The w Subscript i j are prior weights, if any, specified with the WEIGHT or FREQ statements. Note that the definition of the quasi-likelihood for the negative binomial differs from that given in McCullagh and Nelder (1989). The definition used here allows the negative binomial quasi-likelihood to approach the Poisson as k right-arrow 0.

  • Normal:

    upper Q Subscript i j Baseline equals minus one-half w Subscript i j Baseline left-parenthesis y Subscript i j Baseline minus mu Subscript i j Baseline right-parenthesis squared
  • Inverse Gaussian:

    upper Q Subscript i j Baseline equals StartFraction w Subscript i j Baseline left-parenthesis mu Subscript i j Baseline minus .5 y Subscript i j Baseline right-parenthesis Over mu Subscript i j Superscript 2 Baseline EndFraction
  • Gamma:

    upper Q Subscript i j Baseline equals minus w Subscript i j Baseline left-bracket StartFraction y Subscript i j Baseline Over mu Subscript i j Baseline EndFraction plus log left-parenthesis mu Subscript i j Baseline right-parenthesis right-bracket
  • Negative binomial:

    upper Q Subscript i j Baseline equals w Subscript i j Baseline left-bracket log normal upper Gamma left-parenthesis y Subscript i j Baseline plus StartFraction 1 Over k EndFraction right-parenthesis minus log normal upper Gamma left-parenthesis StartFraction 1 Over k EndFraction right-parenthesis plus y Subscript i j Baseline log left-parenthesis StartFraction k mu Subscript i j Baseline Over 1 plus k mu Subscript i j Baseline EndFraction right-parenthesis plus StartFraction 1 Over k EndFraction log left-parenthesis StartFraction 1 Over 1 plus k mu Subscript i j Baseline EndFraction right-parenthesis right-bracket
  • Poisson:

    upper Q Subscript i j Baseline equals w Subscript i j Baseline left-parenthesis y Subscript i j Baseline log left-parenthesis mu Subscript i j Baseline right-parenthesis minus mu Subscript i j Baseline right-parenthesis
  • Binomial:

    upper Q Subscript i j Baseline equals w Subscript i j Baseline left-bracket r Subscript i j Baseline log left-parenthesis p Subscript i j Baseline right-parenthesis plus left-parenthesis n Subscript i j Baseline minus r Subscript i j Baseline right-parenthesis log left-parenthesis 1 minus p Subscript i j Baseline right-parenthesis right-bracket
  • Multinomial (s categories):

    upper Q Subscript i j Baseline equals w Subscript i j Baseline sigma-summation Underscript k equals 1 Overscript s Endscripts y Subscript i j k Baseline log left-parenthesis mu Subscript i j k Baseline right-parenthesis

Generalized Score Statistics

Boos (1992) and Rotnitzky and Jewell (1990) describe score tests applicable to testing bold upper L prime bold-italic beta equals bold 0 in GEEs, where bold upper L prime is a user-specified r times p contrast matrix or a contrast for a Type 3 test of hypothesis.

Let bold-italic beta overTilde be the regression parameters resulting from solving the GEE under the restricted model bold upper L prime bold-italic beta equals bold 0, and let bold upper S left-parenthesis bold-italic beta overTilde right-parenthesis be the generalized estimating equation values at bold-italic beta overTilde.

The generalized score statistic is

upper T equals bold upper S left-parenthesis bold-italic beta overTilde right-parenthesis prime bold upper Sigma Subscript m Baseline bold upper L left-parenthesis bold upper L prime bold upper Sigma Subscript e Baseline bold upper L right-parenthesis Superscript negative 1 Baseline bold upper L prime bold upper Sigma Subscript m Baseline bold upper S left-parenthesis bold-italic beta overTilde right-parenthesis

where bold upper Sigma Subscript m is the model-based covariance estimate and bold upper Sigma Subscript e is the empirical covariance estimate. The p-values for T are computed based on the chi-square distribution with r degrees of freedom.

The preceding development for score tests assumes that the rank of the empirical covariance matrix bold upper Sigma Subscript e is not less than the row rank of the contrast matrix bold upper L. When the rank of bold upper Sigma Subscript e is less than the row rank of bold upper L, estimability of the function is not sufficient to ensure that the chi-square test statistic has a unique value no matter what kind of generalized inverse is used to compute left-parenthesis bold upper L prime bold upper Sigma Subscript e Baseline bold upper L right-parenthesis Superscript minus.

Although it is extremely rare, it is possible in practice that the uniqueness condition is not satisfied. For example, if the number of clusters is less than the number of nonsingular parameters in the model, then the matrix of coefficients for testing the overall null does not satisfy the uniqueness condition. If this condition is not satisfied, then the chi-square statistic for testing upper H colon bold upper L bold-italic beta equals bold 0 is not invariant to the choice of the g 2-inverse of bold upper L bold upper Sigma Subscript e Baseline bold upper L prime. This chi-square test is not recommended when the uniqueness condition is not satisfied. An alternative approach would be to increase the number of clusters or to find a parsimonious model so that the number of parameters is less than the number of clusters. When the rank of bold upper Sigma Subscript e is less than the row rank of bold upper L for a test, the procedure prints a note to the SAS log.

Last updated: December 09, 2022