The SURVEYIMPUTE Procedure

Fully Efficient Fractional Imputation

The fully efficient fractional imputation (FEFI) method uses multiple donor units for a recipient unit. The observation unit that contains the missing values is known as the recipient unit, and the observation unit that provides the value for imputation is known as the donor unit. The number of donor units for a recipient unit is equal to the number of observed levels for the missing items, given the observed levels for the nonmissing items of the recipient unit. Each donor donates a fraction of the original weight of the recipient unit such that the sum of the fractional weights from all the donors is equal to the original weight of the recipient. The fraction of the recipient weight that a donor unit contributes to the recipient unit is known as the fractional weight. The method is called fully efficient because it does not introduce additional variability that is caused by the selection of donor units (Kim and Fuller 2004). One disadvantage of the FEFI method is that it can greatly increase the size of the imputed data set. For more information, see Kalton and Kish (1984), Fuller (2009, Section 5.2.2), and Kim and Shao (2014, Section 4.6).

FEFI Algorithm

Suppose you want to impute P items jointly (by using the IMPJOINT statement in PROC SURVEYIMPUTE). Let bold upper Z Subscript i Baseline equals left-parenthesis upper Z Subscript i Baseline 1 Baseline comma ellipsis comma upper Z Subscript i upper P Baseline right-parenthesis be the true response for the P items for unit i. bold upper Z Subscript i is completely known if all P items are observed for unit i. However, the true response might not be known for some units because of item nonresponse. Let upper Z Subscript i j be categorical and have J levels for item j. Denote bold upper Z Subscript i comma normal o normal b normal s as the observed part and bold upper Z Subscript i comma normal m normal i normal s normal s as the missing part of bold upper Z Subscript i. Let pi left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis be the population proportion that falls in category left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 comma midline-horizontal-ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis. Assume that it is possible to estimate the population proportion from the observed sample. That is, for example, the conditional probability, upper P left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 vertical-bar upper Z Subscript i Baseline 3 Baseline equals kappa 3 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript p Baseline right-parenthesis, in the observed data is the same as the conditional probability in the data where left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 right-parenthesis are missing. The conditional probabilities are estimated by

ModifyingAbove upper P With caret left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma upper Z Subscript i Baseline 2 Baseline equals kappa 2 vertical-bar upper Z Subscript i Baseline 3 Baseline equals kappa 3 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript p Baseline right-parenthesis equals StartSet sigma-summation Underscript kappa 1 kappa 2 Endscripts ModifyingAbove pi With caret left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With caret left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis

where

ModifyingAbove pi With caret left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i Endscripts w Subscript i Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i Endscripts w Subscript i Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

is the estimated joint probability, upper I left-parenthesis period right-parenthesis is an indicator function, and w Subscript i is the observation weight for unit i.

The FEFI method uses an EM-by-weighting algorithm similar to that of Ibrahim (1990). The detailed algorithm is described in Kim and Fuller (2013). The following steps describe the imputation technique. If you do not specify imputation cells by using the CELLS statement, PROC SURVEYIMPUTE uses the entire data set as one imputation cell. If you specify imputation cells, then all the probabilities in these steps are computed by using observations from the same imputation cell as the recipient unit. To simplify notation, subscripts are not used for imputation cells in the following description.

For given i, let bold upper Z Subscript i comma normal o normal b normal s and bold upper Z Subscript i comma normal m normal i normal s normal s be the observed part and the missing part, respectively, of unit i. Let script upper A Subscript c be the index set for the complete respondents. Suppose you want to impute the missing part of bold upper Z Subscript i, bold upper Z Subscript i comma normal m normal i normal s normal s. The index set d Subscript i Baseline equals StartSet k colon k element-of script upper A Subscript c Baseline logical-and bold upper Z Subscript k comma normal o normal b normal s Baseline equals bold upper Z Subscript i comma normal o normal b normal s Baseline EndSet contains the indexes for the all possible donor units for bold upper Z Subscript i. Let l equals 1 comma 2 comma ellipsis comma upper M Subscript l Baseline be all the observed combinations of StartSet bold upper Z Subscript k comma normal m normal i normal s normal s Baseline colon k element-of d Subscript i Baseline EndSet. The set of all observed combinations for unit i defines the donor cells (all possible realizations) for unit i. Let bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket be the lth imputed value of bold upper Z Subscript i comma normal m normal i normal s normal s. You must assume that at least one imputed value is available; otherwise the observation is not imputed.

  1. Initialization: For each observation that has missing items, determine the number of donor cells by using the number of unique combinations of observed levels for the missing items for the responding units in the imputation cell. Compute the initial fractional weight from donor cell l to unit i, w Subscript i l left-parenthesis 0 right-parenthesis, by

    w Subscript i l left-parenthesis 0 right-parenthesis Baseline equals StartSet sigma-summation Underscript k equals 1 Overscript upper M Subscript l Baseline Endscripts ModifyingAbove pi With tilde Subscript left-parenthesis 0 right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket k right-bracket Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With tilde Subscript left-parenthesis 0 right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket Baseline right-parenthesis

    where l equals 1 comma 2 comma ellipsis comma upper M Subscript l Baseline is the number of donor cells and

    ModifyingAbove pi With tilde Subscript left-parenthesis 0 right-parenthesis Baseline left-parenthesis kappa 1 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i element-of script upper A Subscript c Baseline Endscripts w Subscript i Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i element-of script upper A Subscript c Baseline Endscripts w Subscript i Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

    The sum of the fractional weights over all the donor cells is 1 for every observation unit; that is, sigma-summation Underscript l Endscripts w Subscript i l left-parenthesis 0 right-parenthesis Baseline equals 1, for all i. The lth imputed row for unit i is created by keeping the observed items unchanged, replacing the missing items with the observed levels from the lth donor cell, and computing the fractional weight by w Subscript i Baseline w Subscript i l left-parenthesis 0 right-parenthesis. Only the complete respondents are used to compute the fractional weights in this step. If unit i has no missing items, then w Subscript i Baseline 1 left-parenthesis 0 right-parenthesis Baseline equals 1. The initial FEFI data set contains all the observed units, the imputed rows for observation that had missing items, and the corresponding fractional weights.

  2. M-step: The tth M-step computes the joint probabilities by using the fractional weights from the (t–1)th E-step,

    ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis kappa 1 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i Endscripts sigma-summation Underscript l Endscripts w Subscript i Baseline w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i Endscripts sigma-summation Underscript l Endscripts w Subscript i Baseline w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

    for all i, all l, and t greater-than 0. Note that for t greater-than 0, pi overTilde Subscript left-parenthesis t right-parenthesis uses all observation units, including observations where missing items are imputed in the initialization step.

  3. E-step: The tth E-step computes the fractional weights by using the joint probabilities ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis kappa 1 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis from the tth M-step. The tth fractional weight for unit i and donor cell l is given by

    w Subscript i l left-parenthesis t right-parenthesis Baseline equals StartSet sigma-summation Underscript k equals 1 Overscript upper M Subscript l Baseline Endscripts ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket k right-bracket Baseline right-parenthesis EndSet Superscript negative 1 Baseline ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis bold upper Z Subscript i comma normal o normal b normal s Baseline comma bold upper Z Subscript i comma normal m normal i normal s normal s left-bracket l right-bracket Baseline right-parenthesis
  4. Repetition: The EM-steps are repeated for t equals 1 comma 2 comma ellipsis comma until the changes in fractional weights over all observation units between two successive EM-steps are negligible or the maximum number of EM repetitions is reached.

    The maximum absolute difference convergence criterion, epsilon Subscript normal upper A normal upper D, at step t is defined as

    max Underscript i comma l Endscripts StartAbsoluteValue w Subscript i l left-parenthesis t right-parenthesis Baseline minus w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline EndAbsoluteValue less-than-or-equal-to epsilon Subscript normal upper A normal upper D

    The maximum absolute relative difference convergence criterion, epsilon Subscript normal upper R normal upper D, at step t is defined as

    max Underscript i comma l Endscripts StartAbsoluteValue w Subscript i l left-parenthesis t right-parenthesis Baseline minus w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline EndAbsoluteValue slash w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline less-than-or-equal-to epsilon Subscript normal upper R normal upper D

    where w Subscript i l left-parenthesis t minus 1 right-parenthesis Baseline greater-than 0.

The replicate weights are created by computing a replicated version of ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Baseline left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript p Baseline right-parenthesis, ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript p Baseline right-parenthesis, and by repeating the EM-by-weighting algorithm as described earlier. For the kth replicate sample, ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis kappa 1 kappa 2 midline-horizontal-ellipsis kappa Subscript p Baseline right-parenthesis is computed by

ModifyingAbove pi With tilde Subscript left-parenthesis t right-parenthesis Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis kappa 1 midline-horizontal-ellipsis kappa Subscript upper P Baseline right-parenthesis equals StartSet sigma-summation Underscript i Endscripts sigma-summation Underscript l Endscripts w Subscript i Superscript left-parenthesis k right-parenthesis Baseline w Subscript i l left-parenthesis t minus 1 right-parenthesis Superscript left-parenthesis k right-parenthesis Baseline EndSet Superscript negative 1 Baseline sigma-summation Underscript i Endscripts sigma-summation Underscript l Endscripts w Subscript i Superscript left-parenthesis k right-parenthesis Baseline w Subscript i l left-parenthesis t minus 1 right-parenthesis Superscript left-parenthesis k right-parenthesis Baseline upper I left-parenthesis upper Z Subscript i Baseline 1 Baseline equals kappa 1 comma ellipsis comma upper Z Subscript i upper P Baseline equals kappa Subscript upper P Baseline right-parenthesis

Example of FEFI

The small data set shown in Figure 8 is used to illustrate the imputation technique. The data set contains nine observation units, and each unit has two items (X and Y). The variable Unit contains the observation identification. In this example, X is missing for units 5 and 9, and Y is missing for units 2 and 9.

Figure 8: Sample Data with Missing Items

Unit X Y
1 0 0
2 0 .
3 0 1
4 0 0
5 . 1
6 1 0
7 1 1
8 1 1
9 . .


The following SAS statements request joint imputation of X and Y by using the FEFI method. These statements also request imputation-adjusted replicate weights for the jackknife replication method. The CLASS statement specifies that both X and Y are CLASS variables. The OUTPUT statement stores the imputed values in the data set Imputed and stores the jackknife coefficients in the data set Ojkc. The FRACTIONALWEIGHTS= option in the OUTPUT statement saves the fractional weights in the Imputed data set.

proc surveyimpute data=test varmethod=jackknife;
   class x y;
   var x y;
   id Unit;
   output out=Imputed fractionalweights=FracWgt outjkcoefs=Ojkc;
run;

The initial fractional weights, FracWgt, after the initialization step are displayed in Figure 9.

  • Observation unit 1 has no missing value. Therefore, the ImpIndex value is 0, the FracWgt value is 1, and the values of X and Y are the same as the observed values for observation unit 1 in Figure 9. Because all observation units have a weight of 1, the fractional weights, FracWgt, and the imputation-adjusted weights, ImpWt, are the same for all rows.

  • Observation unit 2 has a missing Y. The observed level for X for unit 2 is 0. For X = 0, two levels for Y are observed: Y = 0, which has a proportion (FracWgt) of 0.67, and Y = 1, which has a proportion of 0.33. Therefore, observation unit 2 receives two donor cells (ImpIndex = 1 and ImpIndex = 2), whose initial fractional weights are 0.67 and 0.33, respectively. Because X is observed, the X values in both rows for unit 2 are the same as the observed value. However, the first recipient row for unit 2 has an imputed Y value of 0, the second recipient row for unit 2 has an imputed Y value of 1, and each has a corresponding initial fractional weight.

  • Observation unit 5 has a missing X. The observed level for Y for unit 5 is 1. To impute X, note that two levels of X are observed when Y = 1: X = 0 with a proportion of 0.33 and X = 1 with a proportion of 0.67. The two recipient rows for observation unit 5 contain the initial fractional weights in the FracWgt column and the imputed X values.

  • Observation unit 9 has missing values for both X and Y. From the observed data, X and Y can take the following values: (X = 0, Y = 0) with probability 0.33, (X = 0, Y = 1) with probability 0.17, (X = 1, Y = 0) with probability 0.17, and (X = 1, Y = 1) with probability 0.33. The four imputed rows (ImpIndex 1, ImpIndex 2, ImpIndex 3, and ImpIndex 4) for observation unit 9 represent the four observed combinations for X and Y along with their initial fractional weights.

The resulting data set contains 14 rows. There are six rows for fully observed units (ImpIndex = 0), two rows for unit 2, two rows for unit 5, and four rows for unit 9. The sum of initial fractional weights is 1 for all units.

Figure 9: Fractional Imputation after Initialization

Unit ImpIndex ImpWt FracWgt X Y
1 0 1.00000 1.00000 0 0
2 1 0.66667 0.66667 0 0
2 2 0.33333 0.33333 0 1
3 0 1.00000 1.00000 0 1
4 0 1.00000 1.00000 0 0
5 1 0.33333 0.33333 0 1
5 2 0.66667 0.66667 1 1
6 0 1.00000 1.00000 1 0
7 0 1.00000 1.00000 1 1
8 0 1.00000 1.00000 1 1
9 1 0.33333 0.33333 0 0
9 2 0.16667 0.16667 0 1
9 3 0.16667 0.16667 1 0
9 4 0.33333 0.33333 1 1


The EM algorithm repeats the computation of the joint probabilities and the fractional weights until convergence. The fractional weights, FracWgt, after the EM step and the imputation-adjusted replicate weights (ImpRepWt_1, …, ImpRepWt_9) are displayed in Figure 10.

Figure 10: Fractional Imputation after the EM

Unit ImpIndex ImpWt FracWgt X Y ImpRepWt_1 ImpRepWt_2 ImpRepWt_3 ImpRepWt_4 ImpRepWt_5 ImpRepWt_6 ImpRepWt_7 ImpRepWt_8 ImpRepWt_9
1 0 1.00000 1.00000 0 0 0.00000 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500
2 1 0.58601 0.58601 0 0 0.46072 0.00000 1.12498 0.46072 0.75009 0.65906 0.62682 0.62682 0.65877
2 2 0.41399 0.41399 0 1 0.66428 0.00000 0.00002 0.66428 0.37491 0.46594 0.49818 0.49818 0.46623
3 0 1.00000 1.00000 0 1 1.12500 1.12500 0.00000 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500
4 0 1.00000 1.00000 0 0 1.12500 1.12500 1.12500 0.00000 1.12500 1.12500 1.12500 1.12500 1.12500
5 1 0.41399 0.41399 0 1 0.49821 0.37510 0.00002 0.49821 0.00000 0.46601 0.66443 0.66443 0.46623
5 2 0.58601 0.58601 1 1 0.62679 0.74990 1.12498 0.62679 0.00000 0.65899 0.46057 0.46057 0.65877
6 0 1.00000 1.00000 1 0 1.12500 1.12500 1.12500 1.12500 1.12500 0.00000 1.12500 1.12500 1.12500
7 0 1.00000 1.00000 1 1 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500 0.00000 1.12500 1.12500
8 0 1.00000 1.00000 1 1 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500 1.12500 0.00000 1.12500
9 1 0.32330 0.32330 0 0 0.22659 0.32143 0.48214 0.22659 0.42862 0.41563 0.41109 0.41109 0.00000
9 2 0.22840 0.22840 0 1 0.32669 0.21434 0.00001 0.32669 0.21424 0.29384 0.32672 0.32672 0.00000
9 3 0.12500 0.12500 1 0 0.16071 0.16071 0.16071 0.16071 0.16071 0.00000 0.16071 0.16071 0.00000
9 4 0.32330 0.32330 1 1 0.41101 0.42851 0.48214 0.41101 0.32143 0.41553 0.22648 0.22648 0.00000


Last updated: December 09, 2022