The SURVEYIMPUTE Procedure

Two-Stage Fully Efficient Fractional Imputation

The two-stage fully efficient fractional imputation method uses multiple donor units for a recipient unit. Each donor donates a fraction of the original weight of the recipient unit such that the sum of the fractional weights from all the donors is equal to the original weight of the recipient. The fraction of the recipient weight that a donor unit contributes to the recipient unit is known as the fractional weight. The method is called fully efficient because it does not introduce additional variability that is caused by the selection of donor units (Kim and Fuller 2004). Two-stage FEFI has two hierarchical imputation stages. One disadvantage of the two-stage FEFI method is that it can greatly increase the size of the imputed data set.

Two-stage FEFI is useful when you want to impute variables that have many unique observed values. FEFI creates many imputed rows if the variable that you are imputing has many unique observed values. Two-stage FEFI imputes these variables conditional on the imputed levels from the first-stage FEFI. Thus, two-stage FEFI often uses fewer imputed rows compared to a similar first-stage FEFI.

Variables that have many observed levels are grouped into imputation bins. The first-stage imputation is performed for all categorical variables by using the FEFI method. The categorical variables include the character variables, the CLASS variables that you also specify in the VAR statement, and the variables that contain the imputation bins of the continuous variables.

The second-stage imputation is performed for the continuous variables within each first-stage donor cell. Observations that contain missing values in any of the continuous items are the recipients, and observations that contain observed values for these missing items are the donors. The second-stage donor cells are defined by the unique combinations of the observed values for the continuous variables within the first-stage donor cells.

Imputation-adjusted replicate weights are computed by repeating both the first-stage and second-stage imputation in every replicate sample independently.

The method is similar to Im, Kim, and Fuller (2015).

Two-Stage FEFI Algorithm

Suppose you want to impute P items jointly. Let bold upper X Subscript i Baseline equals left-parenthesis upper X Subscript i Baseline 1 Baseline comma ellipsis comma upper X Subscript i upper P 1 Baseline right-parenthesis be the response for upper P 1 items in unit i, and let bold upper Y Subscript i Baseline equals left-parenthesis upper Y Subscript i Baseline 1 Baseline comma ellipsis comma upper Y Subscript i upper P 2 Baseline right-parenthesis be the response for upper P 2 items in unit i, where upper P equals upper P 1 plus upper P 2. Let upper X Subscript i j be categorical with J levels for item j, and let upper Y Subscript i j be continuous. Further assume that bold upper Y overTilde equals left-parenthesis upper Y overTilde Subscript i Baseline 1 Baseline comma ellipsis comma upper Y overTilde Subscript i upper P 2 Baseline right-parenthesis contains the discretized levels (imputation bins) for Y, where upper Y overTilde Subscript i j has J levels. Define upper Z Subscript i j Baseline equals left-parenthesis upper X Subscript i j Baseline comma upper Y overTilde Subscript i j Baseline right-parenthesis. Then upper Z Subscript i j is categorical and has J levels for item j. Denote bold upper Z Subscript i comma normal o normal b normal s as the observed part and bold upper Z Subscript i comma normal m normal i normal s normal s as the missing part of bold upper Z Subscript i.

Let pi left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis be the population proportion that falls in category kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P. Assume that it is possible to estimate the population categories from the observed sample. For example, the conditional probability, upper P left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 vertical-bar upper Z Subscript i Baseline 3 Baseline equals kappa 3 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript p Baseline right-parenthesis, is the same for the observed data as it is for the data where left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 right-parenthesis are missing. The initial conditional probabilities are estimated by

ModifyingAbove upper P With caret left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 vertical-bar upper Z Subscript i Baseline 3 Baseline equals kappa 3 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript p Baseline right-parenthesis equals StartSet sigma-summation Underscript kappa 1 kappa 2 Endscripts ModifyingAbove pi With caret left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With caret left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis

where

ModifyingAbove pi With caret left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i Endscripts w Subscript i Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i Endscripts w Subscript i Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

is the estimated joint probability, upper I left-parenthesis period right-parenthesis is an indicator function, and w Subscript i is the observation weight for unit i.

Let l equals 1 comma 2 comma ellipsis comma upper M Subscript i l 1 Baseline be all the observed combinations of bold upper Z Subscript k colon k not-an-element-of i comma normal m normal i normal s normal s in the sample. Let bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket be the lth realization of bold upper Z Subscript i comma normal m normal i normal s normal s in the sample. You must assume that at least one realization is available; otherwise, the observation is not imputed.

The two-stage FEFI method first computes the fully efficient fractional weights by using an EM-by-weighting algorithm like that of Kim and Fuller (2013) to impute the missing values in bold upper Z Subscript i. The missing values in bold upper Y Subscript i are imputed in the second-stage imputation. Two-stage FEFI weights for imputing bold upper Y Subscript i are computed independently in every imputed level of bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket, where l equals 1 comma ellipsis comma upper M Subscript l 1 Baseline is the number of first-stage donor cells.

The following steps describe the two-stage FEFI technique. If you do not use the CELL statement to specify imputation cells, PROC SURVEYIMPUTE uses the entire data set as one imputation cell. If you specify imputation cells, then all the probabilities are computed by using observations from the same imputation cell as the recipient unit. To simplify notation, subscripts are not used for imputation cells in the following description. Imputation cells are defined for the first-stage imputation. Steps 1 to 4 describe the first-stage FEFI for the categorical variables, which also include the imputation bins for the continuous variables. Step 5 describes the second-stage FEFI.

  1. Initialization: For each observation that has missing items, determine the number of first-stage donor cells. The first-stage donor cells are determined by using the number of unique combinations of observed levels in bold upper Z Subscript i for imputing the missing items in bold upper Z Subscript i. Only the responding units in the imputation cell are used to determine the number of first-stage donor cells. Compute the initial fractional weight from donor cell l to unit i, w Subscript i l left-parenthesis 0 right-parenthesis, by

    w Subscript i l left-parenthesis 0 right-parenthesis Baseline equals StartSet sigma-summation Underscript k equals 1 Overscript upper M Subscript i l 1 Baseline Endscripts ModifyingAbove pi With tilde Subscript left-parenthesis 0 right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket k right-bracket Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With tilde Subscript left-parenthesis 0 right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket Baseline right-parenthesis

    where l equals 1 comma 2 comma ellipsis comma upper M Subscript l 1 Baseline is the number of first-stage donor cells and

    ModifyingAbove pi With tilde Subscript left-parenthesis 0 right-parenthesis Baseline left-parenthesis kappa 1 comma ellipsis comma kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i element-of script upper A Subscript c Baseline Endscripts w Subscript i Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i element-of script upper A Subscript c Baseline Endscripts w Subscript i Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

    The sum of the fractional weights over all the donor cells is 1 for every observation unit; that is, sigma-summation Underscript l Endscripts w Subscript i l left-parenthesis 0 right-parenthesis Baseline equals 1 for all i. The lth imputed row for unit i is created by keeping the observed items unchanged, replacing the missing items with the observed levels from the lth donor cell, and computing the fractional weight by w Subscript i Baseline w Subscript i l left-parenthesis 0 right-parenthesis. Only the complete observations (observations that have no missing items) are used to compute the fractional weights in this step. If unit i has no missing items, then w Subscript i Baseline 1 left-parenthesis 0 right-parenthesis Baseline equals 1. The initial FEFI data set contains all the observed units, the imputed rows for observations that have missing items, and the corresponding fractional weights.

  2. M-step: The tth maximization step (M-step) computes the joint probabilities by using the fractional weights from the (t–1)th expectation-step,

    ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis kappa 1 comma ellipsis comma kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i Endscripts sigma-summation Underscript l Endscripts w Subscript i Baseline w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i Endscripts sigma-summation Underscript l Endscripts w Subscript i Baseline w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

    for all i, all l, and t greater-than 0. Note that for t greater-than 0, pi overTilde Subscript left-parenthesis t right-parenthesis uses all observation units, including observations that have missing items that are imputed in the initialization step.

  3. E-step: The tth expectation (E-step) computes the fractional weights by using the joint probabilities ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis kappa 1 comma ellipsis comma kappa Subscript upper P Baseline right-parenthesis from the tth M-step. The tth fractional weight for unit i and donor cell l is given by

    w Subscript i l left-parenthesis t right-parenthesis Baseline equals StartSet sigma-summation Underscript k equals 1 Overscript upper M Subscript i l 1 Baseline Endscripts ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket k right-bracket Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket Baseline right-parenthesis
  4. Repetition: The expectation maximization steps (EM-steps, steps 2 and 3) are repeated for t equals 1 comma 2 comma ellipsis comma until the changes in fractional weights over all observation units between two successive EM-steps are negligible or the maximum number of EM repetitions is reached.

    The maximum absolute difference convergence criterion, epsilon Subscript normal upper A normal upper D, at step t is defined as

    max Underscript i comma l Endscripts StartAbsoluteValue w Subscript i l left-parenthesis t right-parenthesis Baseline minus w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline EndAbsoluteValue less-than-or-equal-to epsilon Subscript normal upper A normal upper D

    The maximum absolute relative difference convergence criterion, epsilon Subscript normal upper R normal upper D, at step t is defined as

    max Underscript i comma l Endscripts StartAbsoluteValue w Subscript i l left-parenthesis t right-parenthesis Baseline minus w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline EndAbsoluteValue slash w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline less-than-or-equal-to epsilon Subscript normal upper R normal upper D

    where w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline greater-than 0.

  5. Second-stage imputation: The second-stage imputation replaces the missing values in the continuous variables by using the observed values within each selected first-stage donor cell. This step is similar to step 1 but is applied to impute the continuous variables.

    For a particular observation unit i, let bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline equals left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s left-bracket l 1 right-bracket Baseline right-parenthesis be the l 1th donor cell from the first-stage imputation, where l 1 ranges from 1 to upper M Subscript l 1. For each observation unit, i, the possible number of second-stage donor cells is equal to the number of unique combinations of the observed levels for the missing items in bold upper Y Subscript i from the responding units in the first-stage donor cell l 1.

    Let pi Subscript 2 vertical-bar l 1 Baseline left-parenthesis kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline midline-horizontal-ellipsis kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis be the population proportion that falls in category kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline midline-horizontal-ellipsis kappa Subscript upper P 2 vertical-bar l 1. Assume that it is possible to estimate the population categories from the observed sample. For example, the conditional probability, upper P left-parenthesis upper Y Subscript i Baseline 1 vertical-bar l 1 Baseline equals kappa Subscript 1 vertical-bar l 1 Baseline comma upper Y Subscript i Baseline 2 vertical-bar l 1 Baseline equals kappa Subscript 2 vertical-bar l 1 Baseline vertical-bar upper Y Subscript i Baseline 3 vertical-bar l 1 Baseline equals kappa Subscript 3 vertical-bar l 1 Baseline comma ellipsis comma upper Y Subscript i upper P 2 vertical-bar l 1 Baseline equals kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis, is the same for the observed data as it is for the data in which left-parenthesis upper Y Subscript i Baseline 1 vertical-bar l 1 Baseline equals kappa Subscript 1 vertical-bar l 1 Baseline comma upper Y Subscript i Baseline 2 vertical-bar l 1 Baseline equals kappa Subscript 2 vertical-bar l 1 Baseline right-parenthesis are missing. The conditional probabilities are estimated by

    StartLayout 1st Row  ModifyingAbove upper P With caret left-parenthesis upper Y Subscript i Baseline 1 vertical-bar l 1 Baseline equals kappa Subscript 1 vertical-bar l 1 Baseline comma upper Y Subscript i Baseline 2 vertical-bar l 1 Baseline equals kappa Subscript 2 vertical-bar l 1 Baseline vertical-bar upper Y Subscript i Baseline 3 vertical-bar l 1 Baseline equals kappa Subscript 3 vertical-bar l 1 Baseline comma ellipsis comma upper Y Subscript i upper P 2 vertical-bar l 1 Baseline equals kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis equals 2nd Row  StartSet sigma-summation Underscript kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline Endscripts ModifyingAbove pi With caret Subscript 2 vertical-bar l 1 Baseline left-parenthesis kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline midline-horizontal-ellipsis kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With caret Subscript 2 vertical-bar l 1 Baseline left-parenthesis kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline midline-horizontal-ellipsis kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis EndLayout

    where

    ModifyingAbove pi With caret Subscript 2 vertical-bar l 1 Baseline left-parenthesis kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline midline-horizontal-ellipsis kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis equals StartSet sigma-summation Underscript i element-of bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline Endscripts w Subscript i Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i element-of bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline Endscripts w Subscript i Baseline upper I left-parenthesis upper Y Subscript i Baseline 1 vertical-bar l 1 Baseline equals kappa Subscript 1 vertical-bar l 1 Baseline comma ellipsis comma upper Y Subscript i upper P 2 vertical-bar l 1 Baseline equals kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis

    is the estimated joint probability, upper I left-parenthesis period right-parenthesis is an indicator function, and w Subscript i is the observation weight for unit i.

    Let l equals 1 comma 2 comma ellipsis comma upper M Subscript i l 2 vertical-bar l 1 Baseline be all the observed combinations of bold upper Y Subscript k colon k element-of bold upper Z Sub Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Subscript comma k not-equals i comma normal m normal i normal s normal s in the sample. Let bold upper Y Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket be the lth realization of bold upper Y Subscript i comma normal m normal i normal s normal s in the sample. You must assume that at least one realization is available; otherwise, missing values in the continuous items for the observation are not imputed.

    Compute the second-stage fractional weight from the second-stage donor cell l 2 conditional on the first-stage donor cell l 1 for unit i, w Subscript i l 2 vertical-bar l 1:

    w Subscript i l 2 vertical-bar l 1 Baseline equals StartSet sigma-summation Underscript k equals 1 Overscript upper M Subscript i l 2 vertical-bar l 1 Baseline Endscripts ModifyingAbove pi With tilde Subscript 2 vertical-bar l 1 Baseline left-parenthesis bold upper Y Subscript i comma normal o normal b normal s Baseline comma bold upper Y Subscript i comma normal m normal i normal s normal s left-bracket k right-bracket Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With tilde Subscript 2 vertical-bar l 1 Baseline left-parenthesis bold upper Y Subscript i comma normal o normal b normal s Baseline comma bold upper Y Subscript i comma normal m normal i normal s normal s left-bracket l 2 right-bracket Baseline right-parenthesis

    where l equals 1 comma 2 comma ellipsis comma upper M Subscript i l 2 vertical-bar l 1 Baseline is the number of second-stage donor cells and

    ModifyingAbove pi With tilde Subscript 2 vertical-bar l 1 Baseline left-parenthesis kappa 1 comma ellipsis comma kappa Subscript upper P 2 Baseline right-parenthesis equals StartSet sigma-summation Underscript i element-of bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline Endscripts w Subscript i Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i element-of bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline Endscripts w Subscript i Baseline upper I left-parenthesis upper Y Subscript i Baseline 1 vertical-bar l 1 Baseline equals kappa 1 comma ellipsis comma upper Y Subscript i upper P 2 vertical-bar l 1 Baseline equals kappa Subscript upper P 2 Baseline right-parenthesis

    The sum of the second-stage fractional weights over all second-stage donor cells is 1 for every observation unit; that is, sigma-summation Underscript l 2 Endscripts w Subscript i l 2 vertical-bar l 1 Baseline equals 1 for all l 1 and i. The l 2th second-stage imputed row in the l 1th first-stage imputed row for unit i is created by keeping the observed items unchanged, replacing the missing items in bold upper Y Subscript i with the observed values from the l 2th second-stage donor cell, and computing the two-stage fractional weight by w Subscript i l 1 l 2 Baseline equals w Subscript i l 1 Baseline w Subscript i l 2 vertical-bar l 1, where w Subscript i l 1 is the first-stage fractional weight for the first-stage donor cell l 1. The maximum number of donor cells for unit i is upper M Subscript i l 1 Baseline upper M Subscript i l 2 vertical-bar l 1. Only the complete observations are used to compute the second-stage fractional weights.

The imputation-adjusted replicate weights are created by using the following:

  1. The first-stage FEFI is repeated by using replicate weights in every replicate sample.

  2. The second-stage replicate fractional weights for the kth replicate are computed by using the estimated joint probabilities from the kth replicate sample:

    ModifyingAbove pi With caret Subscript 2 vertical-bar l 1 Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis kappa Subscript 1 vertical-bar l 1 Baseline kappa Subscript 2 vertical-bar l 1 Baseline midline-horizontal-ellipsis kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis equals StartSet sigma-summation Underscript i element-of bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline Endscripts w Subscript i Superscript left-parenthesis k right-parenthesis Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i element-of bold upper Z Subscript i comma normal d normal c normal e normal l normal l left-bracket l 1 right-bracket Baseline Endscripts w Subscript i Superscript left-parenthesis k right-parenthesis Baseline upper I left-parenthesis upper Y Subscript i Baseline 1 vertical-bar l 1 Baseline equals kappa Subscript 1 vertical-bar l 1 Baseline comma ellipsis comma upper Y Subscript i upper P 2 vertical-bar l 1 Baseline equals kappa Subscript upper P 2 vertical-bar l 1 Baseline right-parenthesis

    where w Subscript i Superscript left-parenthesis k right-parenthesis are the unadjusted replicate weights.

Example of Two-Stage FEFI

The small data set shown in Figure 11 is used to illustrate the imputation technique. The data set contains 18 observation units, and each unit has four items (X, CX, Y, and CY). The variable Unit contains the observation identification. Variables CX and CY contains the imputation bins for variables X and Y, respectively. In this example, X and CX are missing for units 14 and 18, and Y and CY are missing for units 5 and 18.

Figure 11: Sample Data with Missing Items

Unit X CX Y CY
1 0.3 0 -0.54 0
2 0.2 0 -0.77 0
3 1.7 0 -0.59 0
4 1.7 0 -0.59 0
5 1.0 0 . .
6 1.8 0 -0.03 1
7 2.0 0 0.95 1
8 1.9 0 0.78 1
9 6.7 1 -0.15 0
10 6.0 1 -1.01 0
11 3.3 1 -1.86 0
12 7.3 1 -0.21 0
13 6.7 1 0.80 1
14 . . 1.23 1
15 2.9 1 0.65 1
16 9.6 1 0.95 1
17 10.0 1 0.13 1
18 . . . .


The following statements request joint imputation of X and Y by using the two-stage FEFI method. Two CLEVVAR= options specify variables CX and CY, which contain the imputation bins for variables X and Y, respectively. The following statements also request imputation-adjusted replicate weights for the jackknife replication method. The OUTPUT statement stores the imputed values in the Imputed data set and stores the jackknife coefficients in the OJKC data set. The FRACTIONALWEIGHTS= option in the OUTPUT statement saves the fractional weights in the Imputed data set.

proc surveyimpute data=Example method=fefi;
   var X (clevvar=CX) Y (clevvar=CY);
   output out=Imputed fractionalweights=FracWt outjkcoefs=OJKC;
run;

The first-stage FEFI imputes the imputation bin variables CX and CY by using the FEFI method. The imputed data set after the first-stage imputation is displayed in Figure 12. Variables X and Y are not imputed in the first-stage imputation.

Figure 12: First-Stage Fractional Imputation

Unit ImpIndex ImpWt FracWt X CX Y CY
1 0 1.0000 1.0000 0.3 0 -0.54 0
2 0 1.0000 1.0000 0.2 0 -0.77 0
3 0 1.0000 1.0000 1.7 0 -0.59 0
4 0 1.0000 1.0000 1.7 0 -0.59 0
5 1 0.5360 0.5360 1.0 0 . 0
5 2 0.4640 0.4640 1.0 0 . 1
6 0 1.0000 1.0000 1.8 0 -0.03 1
7 0 1.0000 1.0000 2.0 0 0.95 1
8 0 1.0000 1.0000 1.9 0 0.78 1
9 0 1.0000 1.0000 6.7 1 -0.15 0
10 0 1.0000 1.0000 6.0 1 -1.01 0
11 0 1.0000 1.0000 3.3 1 -1.86 0
12 0 1.0000 1.0000 7.3 1 -0.21 0
13 0 1.0000 1.0000 6.7 1 0.80 1
14 1 0.4640 0.4640 . 0 1.23 1
14 2 0.5360 0.5360 . 1 1.23 1
15 0 1.0000 1.0000 2.9 1 0.65 1
16 0 1.0000 1.0000 9.6 1 0.95 1
17 0 1.0000 1.0000 10.0 1 0.13 1
18 1 0.2668 0.2668 . 0 . 0
18 2 0.2310 0.2310 . 0 . 1
18 3 0.2353 0.2353 . 1 . 0
18 4 0.2668 0.2668 . 1 . 1


The first-stage FEFI is described as follows:

  • Observation unit 1 has no missing value. Therefore, the ImpIndex value is 0; the FracWt value is 1; and the values of X, CX, Y, and CY are the same as the observed values for observation unit 1 in Figure 12. Because all observation units have a weight of 1, the fractional weights (FracWt) and the imputation-adjusted weights (ImpWt) are the same for all rows.

  • Observation unit 5 has missing values in Y and CY. In the first-stage, only CY is imputed conditional on the observed level of CX. The observed level for CX for observation unit 5 is 0. For CX=0, two levels for CY are observed: CY=0, and CY=1. Therefore, observation unit 5 receives two donor cells (ImpIndex=1 and ImpIndex=2). The fractional weights for these two donor cells are computed by applying FEFI on variables CX and CY. For more information about FEFI, see the section Example of FEFI. The fractional weights after the first-stage imputation are 0.536 and 0.464. Because CX is observed, CX values in both rows for observation unit 5 are the same as the observed value. However, the first recipient row for observation unit 5 has an imputed CY value of 0, the second recipient row for observation unit 5 has an imputed CY value of 1, and each of these rows has a corresponding fractional weight. Because no imputation is performed for Y in the first-stage, both rows for observation unit 5 contain missing values for Y.

  • Observation unit 14 has missing values in X and CX. In the first-stage, only CX is imputed conditional on the observed level of CY. The observed level of CY for unit 14 is 1. For CY=1, two levels for CX are observed: CX=0, and CX=1. Therefore, observation unit 14 receives two donor cells (ImpIndex=1 and ImpIndex=2). The fractional weights for these two donor cells are computed by applying FEFI on variables CX and CY. For more information about FEFI, see the section Example of FEFI. The fractional weights after the first-stage imputation are 0.464 and 0.536. Because CY is observed, CY values in both rows for unit 14 are the same as the observed value. However, the first recipient row for unit 14 has an imputed CX value of 0, the second recipient row for unit 14 has an imputed CX value of 1, and each of these rows has a corresponding fractional weight. Because no imputation is performed for X in the first-stage, both rows for unit 14 contain missing values for X.

  • Observation unit 18 has missing values in all variables X, CX, Y, and CY. Only variables CX and CY are imputed in the first-stage. From the observed data, CX and CY can take the following values (CX=0, CY=0), (CX=0, CY=1), (CX=1, CY=0), and (CX=1, CY=1). The four imputed rows (ImpIndex 1, ImpIndex 2, ImpIndex 3, and ImpIndex 4) for observation unit 18 represent the four observed combinations for CX and CY along with their fractional weights. The fractional weights for these four donor cells are computed by applying FEFI on variables CX and CY. For more information about FEFI, see the section Example of FEFI.

The second-stage FEFI imputes the missing values in X and Y conditional on the imputed levels for imputation bin variables CX and CY from the first-stage imputation. The imputed data set after the second-stage imputation is displayed in Figure 13. Variables X and Y are imputed in the second stage.

Figure 13: Two-Stage Fractional Imputation

Unit ImpIndex ImpWt FracWt X CX Y CY
1 0 1.0000 1.0000 0.3 0 -0.54 0
2 0 1.0000 1.0000 0.2 0 -0.77 0
3 0 1.0000 1.0000 1.7 0 -0.59 0
4 0 1.0000 1.0000 1.7 0 -0.59 0
5 1 0.1340 0.1340 1.0 0 -0.77 0
5 2 0.1340 0.1340 1.0 0 -0.54 0
5 3 0.2680 0.2680 1.0 0 -0.59 0
5 4 0.1547 0.1547 1.0 0 -0.03 1
5 5 0.1547 0.1547 1.0 0 0.78 1
5 6 0.1547 0.1547 1.0 0 0.95 1
6 0 1.0000 1.0000 1.8 0 -0.03 1
7 0 1.0000 1.0000 2.0 0 0.95 1
8 0 1.0000 1.0000 1.9 0 0.78 1
9 0 1.0000 1.0000 6.7 1 -0.15 0
10 0 1.0000 1.0000 6.0 1 -1.01 0
11 0 1.0000 1.0000 3.3 1 -1.86 0
12 0 1.0000 1.0000 7.3 1 -0.21 0
13 0 1.0000 1.0000 6.7 1 0.80 1
14 1 0.1547 0.1547 1.8 0 1.23 1
14 2 0.1547 0.1547 1.9 0 1.23 1
14 3 0.1547 0.1547 2.0 0 1.23 1
14 4 0.1340 0.1340 2.9 1 1.23 1
14 5 0.1340 0.1340 6.7 1 1.23 1
14 6 0.1340 0.1340 9.6 1 1.23 1
14 7 0.1340 0.1340 10.0 1 1.23 1
15 0 1.0000 1.0000 2.9 1 0.65 1
16 0 1.0000 1.0000 9.6 1 0.95 1
17 0 1.0000 1.0000 10.0 1 0.13 1
18 1 0.0667 0.0667 0.2 0 -0.77 0
18 2 0.0667 0.0667 0.3 0 -0.54 0
18 3 0.1334 0.1334 1.7 0 -0.59 0
18 4 0.0770 0.0770 1.8 0 -0.03 1
18 5 0.0770 0.0770 1.9 0 0.78 1
18 6 0.0770 0.0770 2.0 0 0.95 1
18 7 0.0588 0.0588 3.3 1 -1.86 0
18 8 0.0588 0.0588 6.0 1 -1.01 0
18 9 0.0588 0.0588 6.7 1 -0.15 0
18 10 0.0588 0.0588 7.3 1 -0.21 0
18 11 0.0667 0.0667 2.9 1 0.65 1
18 12 0.0667 0.0667 6.7 1 0.80 1
18 13 0.0667 0.0667 9.6 1 0.95 1
18 14 0.0667 0.0667 10.0 1 0.13 1


The second-stage FEFI is described as follows:

  • Observation unit 1 has no missing value. Therefore, the ImpIndex value is 0; the FracWt value is 1; and the values of X, CX, Y, and CY are the same as the observed values for observation unit 1 in Figure 13. Because all observation units have a weight of 1, the fractional weights (FracWt) and the imputation-adjusted weights (ImpWt) are the same for all rows.

  • Observation unit 5 has missing values in Y and CY. The variable CY has two imputed levels (0 and 1) from the first-stage imputation. The observed level of CX for observation unit 5 is 0.

    The row that contains Unit 5 and ImpIndex=1 in Figure 12 has CX=0 and CY=0. Units 1, 2, 3, and 4 in the complete data have CX=0 and CY=0. These units are possible donors for a missing Y when CX=0 and CY=0. The four donor units have three unique values for Y: –0.54, –0.59, and –0.77. These three unique values define three donor cells to impute Y when CX=0 and CY=0. The missing value in Y for CX=0 and CY=0 is replaced by all three possible observed values. Because the weight for the donor cell that is defined by Y=–0.59 is double the weight for the other two donor cells, the imputed row that contains Y = –0.59 is assigned a second-stage fractional weight of 1/2 and the other two rows are each assigned a second-stage fractional weight of 1/4. The second-stage fractional weight is then multiplied by the first-stage FEFI weight (0.54) for CX=0, CY=0, and ImpIndex=1 to obtain the two-stage FEFI weight.

    The row that contains Unit 5 and ImpIndex=2 in Figure 12 has CX=0 and CY=1. Units 6, 7, and 8 in the complete data have CX=0 and CY=1. These units are donors for missing Y when CX=0 and CY=1. The three donor units have three unique values for Y: –0.03, 0.95, and 0.78. These three unique values define three donor cells for imputing Y when CX=0 and CY=1. The missing value in Y for CX=0 and CY=1 is replaced by all three possible observed values. Because all three donor cells have equal weights, all three imputed rows are assigned a second-stage fractional weight of 1/3. The second-stage fractional weight is then multiplied by the first-stage FEFI weight (0.46) for CX=0, CY=1, and ImpIndex=2 to obtain the two-stage FEFI weight.

    Therefore, after two-stage FEFI, missing values in Y are replaced by six imputed values that have fractional weights proportional to the observed weighted frequencies of the second-stage donor cells conditional on the first-stage FEFI.

  • Observation unit 14 has missing values in X and CX. The variable CX has two imputed levels (0 and 1) from the first-stage imputation. The observed level for CY for unit 14 is 1.

    The row that contains Unit 14 and ImpIndex=1 in Figure 12 has CX=0 and CY=1. Units 6, 7, and 8 in the complete data have CX=0 and CY=1. These units are possible donors for missing X when CX=0 and CY=1. The three donor units have three unique values for X: 1.8, 1.9, and 2.0. These three unique values define three donor cells for imputing X when CX=0 and CY=1. The missing value in X for CX=0 and CY=1 is replaced by all three possible observed values. Because all three donor cells have equal weights, all three imputed rows are assigned a second-stage fractional weight of 1/3. The second-stage fractional weight is then multiplied by the first-stage FEFI weight (0.46) for CX=0, CY=1, and ImpIndex=1 to obtain the two-stage FEFI weight.

    The row that contains Unit 14 and ImpIndex=2 in Figure 12 has CX=1 and CY=1. Units 13, 15, 16 and 17 in the complete data have CX=1 and CY=1. These units are donors for missing X when CX=1 and CY=1. The four donor units have four unique values for X: 2.9, 6.7, 9.6, and 10.0. These four unique values define four donor cells for imputing X when CX=1 and CY=1. The missing value in X for CX=1 and CY=1 is replaced by all four possible observed values. Because all four donor cells have equal weights, all four imputed rows are assigned a second-stage fractional weight of 1/4. The second-stage fractional weight is then multiplied by the first-stage FEFI weight (0.54) for CX=1 CX=1 to obtain the two-stage FEFI weight.

    Therefore, after two-stage FEFI, missing values in X are replaced by seven imputed values that have fractional weights proportional to the observed weighted frequencies of the second-stage donor cells conditional on the first-stage FEFI.

  • Observation unit 18 has missing values in all four variables X, CX, Y, and CY. The variable CX has two imputed levels (0 and 1), and the variable CY has two imputed levels (0 and 1) from the first-stage imputation.

    The row that contains Unit 18 and ImpIndex=1 in Figure 12 has CX=0 and CY=0. Units 1, 2, 3, and 4 in the complete data have CX=0 and CY=0. These units are possible donors for missing X and Y when CX=0 and CY=0. The four donor units have three unique values for (X, Y): (0.3, –0.54), (0.2, –0.77), and (1.7, –0.59). These three unique values define three donor cells for imputing (X, Y) when CX=0 and CY=0. The missing value in (X, Y) for CX=0 and CY=0 is replaced by all three values. Because the weight for the donor cell that is defined by (X, Y) = (1.7, –0.59) is double the weight for the other two donor cells, the imputed row that contains X = 1.7 and Y = –0.59 is assigned a second-stage fractional weight of 1/2 and the other two rows each are assigned a second-stage fractional weight of 1/4. The second-stage fractional weight is then multiplied by the first-stage FEFI weight (0.27) for CX=0, CY=0, and ImpIndex=1 to obtain the two-stage FEFI weight.

    The row that contains Unit 18 and ImpIndex=2 in Figure 12 has CX=0 and CY=1. Units 6, 7, and 8 in the complete data have CX=0 and CY=1. These units are donors for the missing (X, Y) when CX=0 and CY=1. The three donor units have three unique values for (X, Y): (1.8, -0.3), (1.9, 0.78), and (2.0, 0.95). These three unique values define three donor cells for imputing (X, Y) when CX=0 and CY=1. The missing value in (X, Y) for CX=0 and CY=1 is replaced by all three possible observed values. Because all three donor cells have equal weights, all three imputed rows are assigned a second-stage fractional weight of 1/3. The second-stage fractional weight is then multiplied by the first-stage FEFI weight (0.23) for CX=0 and CX=1 to obtain the two-stage FEFI weight.

    Missing values in (X, Y) in rows that contain Unit 18, ImpIndex=3, and ImpIndex=4 in Figure 12 are imputed similarly.

    Thus, after two-stage FEFI, missing values in (X, Y) are replaced by 14 imputed values that have fractional weights proportional to the observed weighted frequencies of the second-stage donor cells conditional on the first-stage FEFI.

The resulting data set has 42 rows. Fifteen rows for fully observed units (ImpIndex = 0), six rows for unit 5, seven rows for unit 14, and fourteen rows for unit 18. The sum of the fractional weights is 1 for all units. The imputation-adjusted replicate weights are computed by applying the first-stage and the second-stage imputation independently in each replicate sample as discussed in the previous list. The imputed data set along with first four imputation-adjusted replicate weights is displayed in Figure 14.

Figure 14: Two-Stage Fractional Imputation with the Imputation-Adjusted Replicate Weights

Unit ImpIndex ImpWt FracWt X CX Y CY ImpRepWt_1 ImpRepWt_2 ImpRepWt_3 ImpRepWt_4
1 0 1.0000 1.0000 0.3 0 -0.54 0 0 1.0588 1.0588 1.0588
2 0 1.0000 1.0000 0.2 0 -0.77 0 1.0588 0 1.0588 1.0588
3 0 1.0000 1.0000 1.7 0 -0.59 0 1.0588 1.0588 0 1.0588
4 0 1.0000 1.0000 1.7 0 -0.59 0 1.0588 1.0588 1.0588 0
5 1 0.1340 0.1340 1.0 0 -0.77 0 0.1637 0 0.1637 0.1637
5 2 0.1340 0.1340 1.0 0 -0.54 0 0 0.1637 0.1637 0.1637
5 3 0.2680 0.2680 1.0 0 -0.59 0 0.3274 0.3274 0.1637 0.1637
5 4 0.1547 0.1547 1.0 0 -0.03 1 0.1893 0.1893 0.1893 0.1893
5 5 0.1547 0.1547 1.0 0 0.78 1 0.1893 0.1893 0.1893 0.1893
5 6 0.1547 0.1547 1.0 0 0.95 1 0.1893 0.1893 0.1893 0.1893
6 0 1.0000 1.0000 1.8 0 -0.03 1 1.0588 1.0588 1.0588 1.0588
7 0 1.0000 1.0000 2.0 0 0.95 1 1.0588 1.0588 1.0588 1.0588
8 0 1.0000 1.0000 1.9 0 0.78 1 1.0588 1.0588 1.0588 1.0588
9 0 1.0000 1.0000 6.7 1 -0.15 0 1.0588 1.0588 1.0588 1.0588
10 0 1.0000 1.0000 6.0 1 -1.01 0 1.0588 1.0588 1.0588 1.0588
11 0 1.0000 1.0000 3.3 1 -1.86 0 1.0588 1.0588 1.0588 1.0588
12 0 1.0000 1.0000 7.3 1 -0.21 0 1.0588 1.0588 1.0588 1.0588
13 0 1.0000 1.0000 6.7 1 0.80 1 1.0588 1.0588 1.0588 1.0588
14 1 0.1547 0.1547 1.8 0 1.23 1 0.1656 0.1656 0.1656 0.1656
14 2 0.1547 0.1547 1.9 0 1.23 1 0.1656 0.1656 0.1656 0.1656
14 3 0.1547 0.1547 2.0 0 1.23 1 0.1656 0.1656 0.1656 0.1656
14 4 0.1340 0.1340 2.9 1 1.23 1 0.1405 0.1405 0.1405 0.1405
14 5 0.1340 0.1340 6.7 1 1.23 1 0.1405 0.1405 0.1405 0.1405
14 6 0.1340 0.1340 9.6 1 1.23 1 0.1405 0.1405 0.1405 0.1405
14 7 0.1340 0.1340 10.0 1 1.23 1 0.1405 0.1405 0.1405 0.1405
15 0 1.0000 1.0000 2.9 1 0.65 1 1.0588 1.0588 1.0588 1.0588
16 0 1.0000 1.0000 9.6 1 0.95 1 1.0588 1.0588 1.0588 1.0588
17 0 1.0000 1.0000 10.0 1 0.13 1 1.0588 1.0588 1.0588 1.0588
18 1 0.0667 0.0667 0.2 0 -0.77 0 0.0764 0 0.0764 0.0764
18 2 0.0667 0.0667 0.3 0 -0.54 0 0 0.0764 0.0764 0.0764
18 3 0.1334 0.1334 1.7 0 -0.59 0 0.1528 0.1528 0.0764 0.0764
18 4 0.0770 0.0770 1.8 0 -0.03 1 0.0884 0.0884 0.0884 0.0884
18 5 0.0770 0.0770 1.9 0 0.78 1 0.0884 0.0884 0.0884 0.0884
18 6 0.0770 0.0770 2.0 0 0.95 1 0.0884 0.0884 0.0884 0.0884
18 7 0.0588 0.0588 3.3 1 -1.86 0 0.0662 0.0662 0.0662 0.0662
18 8 0.0588 0.0588 6.0 1 -1.01 0 0.0662 0.0662 0.0662 0.0662
18 9 0.0588 0.0588 6.7 1 -0.15 0 0.0662 0.0662 0.0662 0.0662
18 10 0.0588 0.0588 7.3 1 -0.21 0 0.0662 0.0662 0.0662 0.0662
18 11 0.0667 0.0667 2.9 1 0.65 1 0.0750 0.0750 0.0750 0.0750
18 12 0.0667 0.0667 6.7 1 0.80 1 0.0750 0.0750 0.0750 0.0750
18 13 0.0667 0.0667 9.6 1 0.95 1 0.0750 0.0750 0.0750 0.0750
18 14 0.0667 0.0667 10.0 1 0.13 1 0.0750 0.0750 0.0750 0.0750


Last updated: December 09, 2022