The GEE Procedure

Weighted Generalized Estimating Equations under the MAR Assumption

In longitudinal studies, response measurements are often missing because of skipped visits or dropouts. Suppose r Subscript i j is the indicator that the response y Subscript i j is observed, where r Subscript i j Baseline equals 1 if y Subscript i j is observed and 0 otherwise. Missing data patterns can be classified into two types: dropout and intermittent. A dropout occurs if an individual skips a particular visit and then never comes back for subsequent visits. That is, if r Subscript i j Baseline equals 0, then r Subscript i k Baseline equals 0 for all k greater-than j. Otherwise, the missing data pattern is intermittent. Intermittent patterns can be quite complicated; only dropout patterns are considered here.

The mechanism for missingness can be described by a statistical model for the probability of observing a missing value, and making the right assumption about the mechanism is crucial to methods that handle missing data. Missingness mechanisms are classified into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin 1976).

Assumptions about longitudinal data that include missing responses caused by dropouts are classified as follows:

  • The data are said to be MCAR if the probability of a missing response is independent of its past, current, and future responses conditional on the covariates. That is, upper P left-parenthesis r Subscript i j Baseline equals 0 vertical-bar bold upper Y Subscript i Baseline comma bold upper X Subscript i Baseline right-parenthesis equals upper P left-parenthesis r Subscript i j Baseline equals 0 vertical-bar bold upper X Subscript i Baseline right-parenthesis.

  • The data are said to be MAR if the probability of a missing response is independent of its current and future responses conditional on the observed past responses and the covariates. That is, upper P left-parenthesis r Subscript i j Baseline equals 0 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis equals upper P left-parenthesis r Subscript i j Baseline equals 0 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma y Subscript i Baseline 1 Baseline comma ellipsis comma y Subscript i j minus 1 Baseline right-parenthesis. MAR is a weaker assumption than MCAR.

  • The data are said to be MNAR if the probability of a missing response depends on the unobserved responses. MNAR is the most general and the most problematic missing-data scenario.

The GEE procedure implements two different weighted methods (observation-specific and subject-specific) of estimating the regression parameter bold-italic beta when dropouts occur. Both methods provide consistent estimates if the data are MAR. The weighted GEE methods are not supported for the multinomial distribution for polytomous responses.

Observation-Specific Weighted GEE Method

Suppose w Subscript i j is the weight for y Subscript i j, which is defined as the inverse probability of observing y Subscript i j. In other words, w Subscript i j Baseline equals upper P left-parenthesis r Subscript i j Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1. Suppose upper W Subscript i is a upper T times upper T diagonal matrix whose jth diagonal is r Subscript i j Baseline w Subscript i j. The responses for the ith subject are bold upper Y Subscript i Baseline equals left-parenthesis y Subscript i Baseline 1 Baseline comma y Subscript i Baseline 2 Baseline comma ellipsis comma y Subscript i upper T Baseline right-parenthesis prime. Consider the following weighted generalized estimating equations (Robins and Rotnitzky 1995; Preisser, Lohman, and Rathouz 2002):

bold upper S Subscript o w Baseline left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts StartFraction partial-differential bold-italic mu prime Subscript i Over partial-differential bold-italic beta EndFraction bold upper V Subscript i Superscript negative 1 Baseline upper W Subscript i Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis right-parenthesis equals bold 0

Unlike the standard generalized estimating equations, the weighted generalized estimating equations are unbiased when the observations are appropriately weighted and lead to consistent estimates of bold-italic beta.

The weights w Subscript i j are often unknown in practice and are estimated by a logistic regression model under the MAR assumption. Specifically, suppose that lamda Subscript i j Baseline equals upper P left-parenthesis r Subscript i j Baseline equals 1 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis denotes the probability of observing the response y Subscript i j given its observed previous responses.

Under the MAR assumption,

lamda Subscript i j Baseline equals upper P left-parenthesis r Subscript i j Baseline equals 1 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis equals upper P left-parenthesis r Subscript i j Baseline equals 1 vertical-bar r Subscript i j minus 1 Baseline equals 1 comma upper X Subscript i Baseline comma upper Y 1 comma ellipsis comma upper Y Subscript j minus 1 Baseline right-parenthesis

Using the observed data, lamda Subscript i j can be predicted from a logistic regression model,

normal l normal o normal g normal i normal t StartSet lamda Subscript i j Baseline EndSet equals z Subscript i j Baseline bold-italic alpha

where the z Subscript i j are predictors that usually include the covariates x Subscript i j, the past responses, and the indicators for visit times. The dropout process implies that the estimated probability of observing y Subscript i j can be expressed as a cumulative product of conditional probabilities:

ModifyingAbove upper P With caret left-parenthesis r Subscript i j Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis equals lamda Subscript i Baseline 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times lamda Subscript i Baseline 2 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times midline-horizontal-ellipsis times lamda Subscript i j Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis

With the estimated weights ModifyingAbove w With caret Subscript i j Baseline equals ModifyingAbove upper P With caret left-parenthesis r Subscript i j Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1, the regression parameter bold-italic beta is estimated by solving the equation for bold upper S Subscript o w Baseline left-parenthesis bold-italic beta right-parenthesis.

The regression parameter bold-italic beta can be estimated by solving for bold upper S Subscript o w Baseline left-parenthesis bold-italic beta right-parenthesis after plugging in the estimated weights. The fitting algorithm is described in the section Fitting Algorithm for Weighted GEE.

Subject-Specific Weighted GEE Method

Unlike the observation-specific weighted method, which assigns an observation-specific weight to each observation, the subject-specific weighted method assigns a single weight to each subject. In other words, all the observations from a subject receive the same weight. Specifically, the subject-specific weighted method obtains the regression parameter estimates by solving the equations

bold upper S Subscript s w Baseline left-parenthesis bold-italic beta right-parenthesis equals sigma-summation Underscript i equals 1 Overscript upper K Endscripts bold upper D prime Subscript i Baseline bold upper V Subscript i Superscript negative 1 Baseline w Subscript i Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis right-parenthesis equals bold 0

where the responses for the ith subject are bold upper Y Subscript i Baseline equals left-parenthesis y Subscript i Baseline 1 Baseline comma y Subscript i Baseline 2 Baseline comma ellipsis comma y Subscript i n Sub Subscript i Subscript Baseline right-parenthesis prime and the weight w Subscript i for subject i is the inverse probability of a subject i dropping out at the observed time (Fitzmaurice, Molenberghs, and Lipsitz 1995; Preisser, Lohman, and Rathouz 2002). Note that the weight w Subscript i is a scalar, in contrast to the weight matrix bold upper W Subscript i that the observation-specific weighted GEE method uses.

The subject-specific weighted estimating equations are also unbiased when the subjects are appropriately weighted and lead to consistent estimates of the regression parameters bold-italic beta.

The weight w Subscript i is usually unknown in practice and needs to be estimated. Suppose subject i drops out at time m Subscript i Baseline equals sigma-summation Underscript j equals 1 Overscript upper T Endscripts r Subscript i j Baseline plus 1. Assume that the first visit y Subscript i Baseline 1 is always observed with r Subscript i Baseline 1 Baseline equals 1. Thus, the dropout times m Subscript i range from 2 to T+1. Note that a dropout time of T+1 indicates that subject i completes all the T visits and dropout does not occur.

The weight w Subscript i is defined as follows: if subject i drops out before completing the last visit (that is, m Subscript i Baseline less-than-or-equal-to upper T), then w Subscript i Baseline equals upper P left-parenthesis r Subscript i m Sub Subscript i Subscript Baseline equals 0 comma r Subscript i m Sub Subscript i Subscript minus 1 Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1; otherwise, the subject completes all the T visits (that is, m Subscript i Baseline equals upper T plus 1), and w Subscript i Baseline equals upper P left-parenthesis r Subscript i upper T Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1.

Similar to the process for the observation-specific weighted method, the dropout process for the subject-specific weighted method implies that subject-specific weights can be estimated as a cumulative product of conditional probabilities:

StartLayout 1st Row 1st Column ModifyingAbove w With caret Subscript i 2nd Column equals upper P left-parenthesis r Subscript i m Sub Subscript i Subscript Baseline equals 0 comma r Subscript i m Sub Subscript i Subscript minus 1 Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1 Baseline equals left-bracket lamda Subscript i Baseline 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times midline-horizontal-ellipsis times lamda Subscript i m Sub Subscript i Subscript minus 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times left-parenthesis 1 minus lamda Subscript i m Sub Subscript i Subscript Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis right-parenthesis right-bracket Superscript negative 1 Baseline comma normal i normal f m Subscript i Baseline less-than-or-equal-to upper T 2nd Row 1st Column ModifyingAbove w With caret Subscript i 2nd Column equals upper P left-parenthesis r Subscript i m Sub Subscript i Subscript minus 1 Baseline equals 1 vertical-bar upper X Subscript i Baseline comma upper Y Subscript i Baseline right-parenthesis Superscript negative 1 Baseline equals left-bracket lamda Subscript i Baseline 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times lamda Subscript i Baseline 2 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis times midline-horizontal-ellipsis times lamda Subscript i m Sub Subscript i Subscript minus 1 Baseline left-parenthesis ModifyingAbove bold-italic alpha With caret right-parenthesis right-bracket Superscript negative 1 Baseline comma normal i normal f m Subscript i Baseline equals upper T plus 1 EndLayout

Thus, the subject-specific weights ModifyingAbove w With caret Subscript i can be obtained after lamda Subscript i j is estimated by fitting a logistic regression to the data left-parenthesis r Subscript i j Baseline comma z Subscript i j Baseline right-parenthesis.

The regression parameter bold-italic beta from the subject-specific weighted GEE method can be estimated by solving for bold upper S Subscript s w Baseline left-parenthesis bold-italic beta right-parenthesis after plugging in the estimated weights. The fitting algorithm is described in the section Fitting Algorithm for Weighted GEE. The subject-specific weighting scheme was originally developed for computational convenience. Preisser, Lohman, and Rathouz (2002) showed that the observation-level weighted GEE method produces more efficient estimates than the cluster-level weighted GEE method for incomplete longitudinal binary data.

Fitting Algorithm for Weighted GEE

The following fitting algorithm fits marginal models by using the observation-specific or the subject-specific weighted GEE method when the dropout process is missing at random:

  1. Fit a logistic regression to the data left-parenthesis r Subscript i j Baseline comma z Subscript i j Baseline right-parenthesis to obtain an estimate of bold-italic alpha and estimate the weights.

  2. Compute an initial estimate of bold-italic beta by using an ordinary generalized linear model, assuming independence of the responses.

  3. Compute the working correlation matrix bold upper R based on the standardized residuals, the current estimate of bold-italic beta, and the specified structure of bold upper R.

  4. Compute the estimated covariance matrix:

    bold upper V Subscript i Baseline equals phi bold upper A Subscript i Superscript one-half Baseline ModifyingAbove bold upper R With caret left-parenthesis bold-italic alpha right-parenthesis bold upper A Subscript i Superscript one-half
  5. Update ModifyingAbove bold-italic beta With caret:

    ModifyingAbove bold-italic beta With caret Subscript r plus 1 Baseline equals ModifyingAbove bold-italic beta With caret Subscript r Baseline plus left-bracket sigma-summation Underscript i equals 1 Overscript upper K Endscripts StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction prime bold upper V Subscript i Superscript negative 1 Baseline StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction right-bracket Superscript negative 1 Baseline left-bracket sigma-summation Underscript i equals 1 Overscript upper K Endscripts StartFraction partial-differential bold-italic mu Subscript i Baseline Over partial-differential bold-italic beta EndFraction prime bold upper V Subscript i Superscript negative 1 Baseline bold upper W Subscript i Baseline left-parenthesis bold upper Y Subscript i Baseline minus bold-italic mu Subscript i Baseline right-parenthesis right-bracket

    where bold upper Y Subscript i Baseline comma bold-italic mu Subscript i Baseline comma bold upper V Subscript i Baseline, and bold upper W Subscript i are as follows:

    • For the observation-specific weighted method, bold upper Y Subscript i Baseline equals left-parenthesis y Subscript i Baseline 1 Baseline comma y Subscript i Baseline 2 Baseline comma ellipsis comma y Subscript i upper T Baseline right-parenthesis prime; bold-italic mu Subscript i and bold upper V Subscript i are its corresponding mean vector and working covariance matrix, respectively; and bold upper W Subscript i is a upper T times upper T diagonal matrix whose jth diagonal is r Subscript i j Baseline ModifyingAbove w With caret Subscript i j.

    • For the subject-specific weighted method, bold upper Y Subscript i Baseline equals left-parenthesis y Subscript i Baseline 1 Baseline comma y Subscript i Baseline 2 Baseline comma ellipsis comma y Subscript i n Sub Subscript i Subscript Baseline right-parenthesis prime; bold-italic mu Subscript i and bold upper V Subscript i are its corresponding mean vector and working covariance matrix, respectively; and bold upper W Subscript i is a n Subscript i Baseline times n Subscript i diagonal matrix whose jth diagonal is ModifyingAbove w With caret Subscript i.

  6. Repeat steps 3–5 until convergence.

Note that you can use the WEIGHT statement in the GENMOD procedure to perform a two-stage strategy that is often used in practice to obtain the weighted GEE estimates. You fit a logistic regression to the data left-parenthesis r Subscript i j Baseline comma z Subscript i j Baseline right-parenthesis to obtain the weights as described in the preceding steps. Then you estimate bold-italic beta by specifying the estimated weights in the WEIGHT statement in PROC GENMOD for the GEE analysis. For the subject-specific weighted GEE method, this approach is appropriate for any working correlation structure. However, for the observation-specific weighted method, this approach is appropriate only for the independent working correlation structure.

The two-stage approach results in standard errors that are larger than those that are produced by using the MISSMODEL statement in the GEE procedure (because PROC GENMOD treats the weights as fixed and known). Thus, the two-stage approach that uses PROC GENMOD results in conservative inference (Fitzmaurice, Laird, and Ware 2011). The GEE procedure computes the parameter estimate covariances as described in (Fitzmaurice, Laird, and Ware 2011) and Preisser, Lohman, and Rathouz (2002).

Missing Data

Suppose that each subject in a longitudinal study is measured at T times. In other words, for the ith subject you measure T responses left-parenthesis y Subscript i Baseline 1 Baseline comma y Subscript i Baseline 2 Baseline comma ellipsis comma y Subscript i upper T Baseline right-parenthesis and T corresponding covariates left-parenthesis bold x Subscript i Baseline 1 Baseline comma bold x Subscript i Baseline 2 Baseline comma comma ellipsis comma x Subscript i upper T Baseline right-parenthesis.

By default, the GEE procedure handles missing data in the same manner as the standard GEE method in the GENMOD procedure. The working correlation matrix is estimated from data that contain both intermittent and dropout types of missing values by using the all-available-pairs method, in which all nonmissing pairs of data are used in the moment estimators. The resulting covariances and standard errors are valid under the missing completely at random (MCAR) assumption. For more information, see the section Missing Data in Chapter 51, The GENMOD Procedure.

When you specify the MISSMODEL statement in the GEE procedure to use the weighted GEE method to analyze the data, the procedure uses observations that have missing values in the response, provided that the missing values for all subjects are caused by dropouts. If the missing values are intermittent for any of the subjects, then the weighted GEE method does not apply and the procedure terminates.

For the observation-specific weighted GEE method, the covariates for all the observations for a subject must be observed, regardless of whether the response is missing. For each subject, the input data set must provide T observations.

For the subject-specific weighted GEE method, the covariates for a subject who drops out at time k must be observed for the observations up to and including time k. The input data set must provide at least k observations for this subject. The covariates must be observed for all observations on a subject who completes the study, and the input data set must provide T observations for this subject.

For more information about how weighted GEE methods handle missing values, see Fitzmaurice, Laird, and Ware (2011) and Preisser, Lohman, and Rathouz (2002).

Last updated: December 09, 2022