The GEE Procedure

Weighted Generalized Estimating Equations under the MAR Assumption

In longitudinal studies, response measurements are often missing because of skipped visits or dropouts. Suppose is the indicator that the response is observed, where if is observed and 0 otherwise. Missing data patterns can be classified into two types: dropout and intermittent. A dropout occurs if an individual skips a particular visit and then never comes back for subsequent visits. That is, if , then for all . Otherwise, the missing data pattern is intermittent. Intermittent patterns can be quite complicated; only dropout patterns are considered here.

The mechanism for missingness can be described by a statistical model for the probability of observing a missing value, and making the right assumption about the mechanism is crucial to methods that handle missing data. Missingness mechanisms are classified into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin 1976).

Assumptions about longitudinal data that include missing responses caused by dropouts are classified as follows:

The data are said to be MCAR if the probability of a missing response is independent of its past, current, and future responses conditional on the covariates. That is, .
The data are said to be MAR if the probability of a missing response is independent of its current and future responses conditional on the observed past responses and the covariates. That is, . MAR is a weaker assumption than MCAR.
The data are said to be MNAR if the probability of a missing response depends on the unobserved responses. MNAR is the most general and the most problematic missing-data scenario.

The GEE procedure implements two different weighted methods (observation-specific and subject-specific) of estimating the regression parameter when dropouts occur. Both methods provide consistent estimates if the data are MAR. The weighted GEE methods are not supported for the multinomial distribution for polytomous responses.

Observation-Specific Weighted GEE Method

Suppose is the weight for , which is defined as the inverse probability of observing . In other words, . Suppose is a diagonal matrix whose jth diagonal is . The responses for the ith subject are . Consider the following weighted generalized estimating equations (Robins and Rotnitzky 1995; Preisser, Lohman, and Rathouz 2002):

bold upper S Subscript o w Baseline left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts StartFraction partial-differential bold-italic mu prime Subscript i Over partial-differential bold-italic beta EndFraction bold upper V Subscript i Superscript negative 1 Baseline upper W Subscript i Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis right-parenthesis equals bold 0

Unlike the standard generalized estimating equations, the weighted generalized estimating equations are unbiased when the observations are appropriately weighted and lead to consistent estimates of .

The weights are often unknown in practice and are estimated by a logistic regression model under the MAR assumption. Specifically, suppose that denotes the probability of observing the response given its observed previous responses.

Under the MAR assumption,

lamda Subscript i j Baseline equals upper P left-parenthesis r Subscript i j Baseline equals 1 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis equals upper P left-parenthesis r Subscript i j Baseline equals 1 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma upper Y 1 comma ellipsis comma upper Y Subscript j minus 1 Baseline right-parenthesis

Using the observed data, can be predicted from a logistic regression model,

normal l normal o normal g normal i normal t StartSet lamda Subscript i j Baseline EndSet equals z Subscript i j Baseline bold-italic alpha

where the are predictors that usually include the covariates , the past responses, and the indicators for visit times. The dropout process implies that the estimated probability of observing can be expressed as a cumulative product of conditional probabilities:

ModifyingAbove upper P With caret left-parenthesis r Subscript i j Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis equals lamda Subscript i Baseline 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times lamda Subscript i Baseline 2 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times midline-horizontal-ellipsis times lamda Subscript i j Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis

With the estimated weights , the regression parameter is estimated by solving the equation for .

The regression parameter can be estimated by solving for after plugging in the estimated weights. The fitting algorithm is described in the section Fitting Algorithm for Weighted GEE.

Subject-Specific Weighted GEE Method

Unlike the observation-specific weighted method, which assigns an observation-specific weight to each observation, the subject-specific weighted method assigns a single weight to each subject. In other words, all the observations from a subject receive the same weight. Specifically, the subject-specific weighted method obtains the regression parameter estimates by solving the equations

bold upper S Subscript s w Baseline left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts bold upper D prime Subscript i Baseline bold upper V Subscript i Superscript negative 1 Baseline w Subscript i Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis right-parenthesis equals bold 0

where the responses for the ith subject are and the weight for subject i is the inverse probability of a subject i dropping out at the observed time (Fitzmaurice, Molenberghs, and Lipsitz 1995; Preisser, Lohman, and Rathouz 2002). Note that the weight is a scalar, in contrast to the weight matrix that the observation-specific weighted GEE method uses.

The subject-specific weighted estimating equations are also unbiased when the subjects are appropriately weighted and lead to consistent estimates of the regression parameters .

The weight is usually unknown in practice and needs to be estimated. Suppose subject i drops out at time . Assume that the first visit is always observed with . Thus, the dropout times range from 2 to T+1. Note that a dropout time of T+1 indicates that subject i completes all the T visits and dropout does not occur.

The weight is defined as follows: if subject i drops out before completing the last visit (that is, ), then ; otherwise, the subject completes all the T visits (that is, ), and .

Similar to the process for the observation-specific weighted method, the dropout process for the subject-specific weighted method implies that subject-specific weights can be estimated as a cumulative product of conditional probabilities:

StartLayout 1st Row 1st Column ModifyingAbove w With caret Subscript i 2nd Column equals upper P left-parenthesis r Subscript i m Sub Subscript i Subscript Baseline equals 0 comma r Subscript i m Sub Subscript i Subscript minus 1 Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1 Baseline equals left-bracket lamda Subscript i Baseline 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times midline-horizontal-ellipsis times lamda Subscript i m Sub Subscript i Subscript minus 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times left-parenthesis 1 minus lamda Subscript i m Sub Subscript i Subscript Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis right-parenthesis right-bracket Superscript negative 1 Baseline comma normal i normal f m Subscript i Baseline less-than-or-equal-to upper T 2nd Row 1st Column ModifyingAbove w With caret Subscript i 2nd Column equals upper P left-parenthesis r Subscript i m Sub Subscript i Subscript minus 1 Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1 Baseline equals left-bracket lamda Subscript i Baseline 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times lamda Subscript i Baseline 2 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times midline-horizontal-ellipsis times lamda Subscript i m Sub Subscript i Subscript minus 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis right-bracket Superscript negative 1 Baseline comma normal i normal f m Subscript i Baseline equals upper T plus 1 EndLayout

Thus, the subject-specific weights can be obtained after is estimated by fitting a logistic regression to the data .

The regression parameter from the subject-specific weighted GEE method can be estimated by solving for after plugging in the estimated weights. The fitting algorithm is described in the section Fitting Algorithm for Weighted GEE. The subject-specific weighting scheme was originally developed for computational convenience. Preisser, Lohman, and Rathouz (2002) showed that the observation-level weighted GEE method produces more efficient estimates than the cluster-level weighted GEE method for incomplete longitudinal binary data.

Fitting Algorithm for Weighted GEE

The following fitting algorithm fits marginal models by using the observation-specific or the subject-specific weighted GEE method when the dropout process is missing at random:

Fit a logistic regression to the data to obtain an estimate of and estimate the weights.
Compute an initial estimate of by using an ordinary generalized linear model, assuming independence of the responses.
Compute the working correlation matrix based on the standardized residuals, the current estimate of , and the specified structure of .
Compute the estimated covariance matrix:
Update :

where , and are as follows:
- For the observation-specific weighted method, ; and are its corresponding mean vector and working covariance matrix, respectively; and is a diagonal matrix whose jth diagonal is .
- For the subject-specific weighted method, ; and are its corresponding mean vector and working covariance matrix, respectively; and is a diagonal matrix whose jth diagonal is .
Repeat steps 3–5 until convergence.

Note that you can use the WEIGHT statement in the GENMOD procedure to perform a two-stage strategy that is often used in practice to obtain the weighted GEE estimates. You fit a logistic regression to the data to obtain the weights as described in the preceding steps. Then you estimate by specifying the estimated weights in the WEIGHT statement in PROC GENMOD for the GEE analysis. For the subject-specific weighted GEE method, this approach is appropriate for any working correlation structure. However, for the observation-specific weighted method, this approach is appropriate only for the independent working correlation structure.

The two-stage approach results in standard errors that are larger than those that are produced by using the MISSMODEL statement in the GEE procedure (because PROC GENMOD treats the weights as fixed and known). Thus, the two-stage approach that uses PROC GENMOD results in conservative inference (Fitzmaurice, Laird, and Ware 2011). The GEE procedure computes the parameter estimate covariances as described in (Fitzmaurice, Laird, and Ware 2011) and Preisser, Lohman, and Rathouz (2002).

Missing Data

Suppose that each subject in a longitudinal study is measured at T times. In other words, for the ith subject you measure T responses and T corresponding covariates .

By default, the GEE procedure handles missing data in the same manner as the standard GEE method in the GENMOD procedure. The working correlation matrix is estimated from data that contain both intermittent and dropout types of missing values by using the all-available-pairs method, in which all nonmissing pairs of data are used in the moment estimators. The resulting covariances and standard errors are valid under the missing completely at random (MCAR) assumption. For more information, see the section Missing Data in Chapter 51, The GENMOD Procedure.

When you specify the MISSMODEL statement in the GEE procedure to use the weighted GEE method to analyze the data, the procedure uses observations that have missing values in the response, provided that the missing values for all subjects are caused by dropouts. If the missing values are intermittent for any of the subjects, then the weighted GEE method does not apply and the procedure terminates.

For the observation-specific weighted GEE method, the covariates for all the observations for a subject must be observed, regardless of whether the response is missing. For each subject, the input data set must provide T observations.

For the subject-specific weighted GEE method, the covariates for a subject who drops out at time k must be observed for the observations up to and including time k. The input data set must provide at least k observations for this subject. The covariates must be observed for all observations on a subject who completes the study, and the input data set must provide T observations for this subject.

For more information about how weighted GEE methods handle missing values, see Fitzmaurice, Laird, and Ware (2011) and Preisser, Lohman, and Rathouz (2002).

Last updated: December 09, 2022