The MCMC Procedure

Handling of Missing Data

PROC MCMC automatically augments missing values[40] via the use of the MODEL statement. PROC MCMC treats missing values as unknown parameters, assigns distributions to the variables, and incorporates the sampling of the missing data as part of the Markov chain.

You can use the MISSING= option in the PROC MCMC statement to specify how you want PROC MCMC to handle the missing values.

  • If you specify MISSING=CC (CC stands for complete cases), PROC MCMC discards all observations that have missing or partial missing values before carrying out the simulation.

  • If you specify MISSING=AC (AC stands for all cases), PROC MCMC neither discards any missing values nor augments them.

  • If you specify MISSING=CCMODELY, PROC MCMC treats missing response values as parameters and incorporates them as part of the simulation. But the procedure discards all observations that have missing covariates. A covariate is a data set variable that appears in the program but does not appear to the left of the tilde in the MODEL statement.

  • If you specify MISSING=ACMODELY, PROC MCMC samples missing responses without discarding observations that contain missing values in covariates.

Generally speaking, there are three types of missing data models, as discussed by Rubin (1976). Also see Little and Rubin (2002) for a comprehensive treatment of missing data analysis. The rest of this section provides an overview of these three types of missing data models and explains how to use PROC MCMC to fit them.

Missing Completely at Random (MCAR)

Data are said to be MCAR if the probability of a missing value (or the failure of observing a value) does not depend on any other observations in the data set, regardless of whether they are observed or missing. That is, the observed and unobserved values are independent of each other: if y Subscript i is missing, it is MCAR if the probability of observing y Subscript i is independent of other y Subscript j (and other covariates x Subscript i) in the data set. Under this assumption, both the observed and unobserved data are random samples of all the data; hence, fitting a model based only on the observed data does not introduce any biases. This type of analysis is called a complete-case analysis. To carry out a complete-case analysis, you must specify MISSING=CC in the PROC MCMC statement.

Missing at Random (MAR)

Data are said to be MAR if the probability of a missing value can depend on some observed quantities but does not depend on any unobserved data. For example, suppose that x Subscript i are completely observed for all observations and some y Subscript i are missing. MAR states that the probability of observing y Subscript i is independent of other missing y Subscript i (values that could have been observed) and that it depends only on x Subscript i (and, potentially, observed y Subscript i).

The MAR assumption states that the missing y Subscript i are no longer random samples and that they need to be modeled (via the likelihood specification of the missing values). At the same time, the independence assumption of the missing values on the unobserved quantities states that the missing mechanism (usually an binary indicator variable such that r Subscript i Baseline equals 1 if y Subscript i is missing and r Subscript i Baseline equals 0 otherwise) can be ignored and does not need to be taken into account. Hence, MAR is sometimes referred to as ignorably missing. It is not the missing values that can be ignored, it is the missing mechanism that can be ignored.

By default, PROC MCMC treats the missing data as MAR (this assumes that you do not input a binary indicator variable r Subscript i and model it specifically): each missing value becomes an extra parameter and PROC MCMC updates it in every iteration. PROC MCMC assumes that both the missing values and observed values arise from the same distribution (which is specified in the MODEL statement),

bold y equals StartSet bold y Subscript bold o bold b bold s Baseline comma bold y Subscript bold m bold i bold s Baseline EndSet tilde f left-parenthesis bold y vertical-bar theta right-parenthesis

where bold y consists of observed (bold y Subscript bold o bold b bold s) and missing (bold y Subscript bold m bold i bold s) values, and f left-parenthesis bold y vertical-bar theta right-parenthesis is the likelihood function with parameters theta.

You can use the MODEL statement to model missing covariates. Using multiple MODEL statements enables you to specify, for example, a marginal distribution for missing values in covariate x and a conditional distribution for the response variable y given x as follows:

model x ~ normal(alpha, var=s2_x);
model y ~ normal(beta * x, var=s2_y);

In each iteration, PROC MCMC draws samples for every missing value in variable x, then every missing value in variable y, conditional on the drawn values of the x variable.

Missing Not at Random (MNAR)

Data are said to be MNAR if the probability of a missing value depends on unobserved data (or data that could have been observed): the probability that y Subscript i is missing depends on the missing values of other y Subscript i. This is a very general scenario that assumes that the missing mechanism is no longer ignorable (it is sometimes referred to as nonignorably missing) and that a model for the missing mechanism is required in order to make correct inferences about the model parameters.

Let bold upper R equals left-parenthesis r 1 comma ellipsis comma r Subscript n Baseline right-parenthesis be the missing value indicator for bold upper Y equals left-parenthesis y 1 comma ellipsis comma y Subscript n Baseline right-parenthesis, where r Subscript i Baseline equals 1 if y Subscript i is missing and r Subscript i Baseline equals 0 otherwise. This bold upper R is usually part of an input data set where you preprocess the response variable and create this missing value indicator variable. Modeling MNAR data implies that you must specify a joint likelihood function over bold upper R and bold upper Y colon f left-parenthesis bold upper R comma bold upper Y vertical-bar bold upper X comma bold-italic theta right-parenthesis, where bold upper X represents the covariates and bold-italic theta represents the model parameters. This joint distribution can be factored in two ways: a pattern-mixture model and a selection model.

The selection model factors the joint distribution bold upper R and bold upper Y into a marginal distribution for bold upper Y and a conditional distribution for bold upper R,

f left-parenthesis bold upper R comma bold upper Y vertical-bar bold upper X comma bold-italic theta right-parenthesis proportional-to f left-parenthesis bold upper Y vertical-bar bold upper X comma bold-italic alpha right-parenthesis dot f left-parenthesis bold upper R vertical-bar bold upper Y comma bold upper X comma bold-italic beta right-parenthesis

where bold-italic theta equals left-parenthesis bold-italic alpha comma bold-italic beta right-parenthesis, f left-parenthesis bold upper R vertical-bar bold upper Y comma bold upper X comma bold-italic alpha right-parenthesis is usually a binary model with a logit or probit link that involves regression parameters bold-italic alpha, and f left-parenthesis bold upper Y vertical-bar bold upper X comma bold-italic beta right-parenthesis is the sampling distribution that generates y Subscript i with model parameters bold-italic beta.

The pattern-mixture model factors the opposite way, a marginal distribution for bold upper R and a conditional distribution for bold upper Y,

f left-parenthesis bold upper R comma bold upper Y vertical-bar bold upper X comma bold-italic theta right-parenthesis proportional-to f left-parenthesis bold upper R vertical-bar bold upper X comma bold-italic gamma right-parenthesis dot f left-parenthesis bold upper Y vertical-bar bold upper R comma bold upper X comma bold-italic delta right-parenthesis

where bold-italic theta equals left-parenthesis bold-italic gamma comma bold-italic delta right-parenthesis.

You can use PROC MCMC to fit either model by specifying multiple MODEL statements: one for the marginal distribution and one for the conditional distribution. Suppose that the variable r is the missing data indicator, which is modeled using a logit model, and that the response variable y is a Poisson regression that includes the missing variable indicator as one of its covariates. The following statements are a PROC MCMC program that fits a pattern-mixture model:

pi = logistic(alpha * x1);
model r ~ binary(pi);
mu = beta0 + beta1 * x2 + beta3 * r;
model y ~ poisson(exp(mu));

The first MODEL statement uses a binary model with logit link to model the missing mechanism, and the second MODEL statement models the response variable with a Poisson regression that includes the missing value indicator as one of its covariates. Each of the two sets of regression has its covariates and regression coefficients. If this hypothetical data set contained missing values in covariates x1 and x2, you could add two more MODEL statements to handle each variable as follows:

model x1 ~ normal(mu1, var=s2_x1);
pi = logistic(alpha * x1);
model r ~ binary(pi);
model x2 ~ normal(mu2, var=s2_x2);
mu = beta0 + beta1 * x2 + beta3 * r;
model y ~ poisson(exp(mu));


[40] A missing value is usually, although not necessarily, represented by a single period (.) in the input data set.

Last updated: March 08, 2022