The SURVEYIMPUTE Procedure

PROC SURVEYIMPUTE Statement

  • PROC SURVEYIMPUTE <options>;

The PROC SURVEYIMPUTE statement invokes the SURVEYIMPUTE procedure. The DATA= option identifies the data set to be analyzed. Table 1 summarizes the options available in the PROC SURVEYIMPUTE statement.

Table 1: Options Available in the PROC SURVEYIMPUTE Statement

Option Description
DATA= Names the input data set
METHOD= Specifies the imputation method
NDONORS= Specifies the number of donors for a recipient
NOPRINT Suppresses all displayed output
ORDER= Specifies the sort order of CLASS variables
RATE= Specifies the sampling rate for the primary sampling units
SEED= Specifies the random number seed
TOTAL= Specifies the total number of primary sampling units
VARMETHOD= Specifies the variance estimation method


You can specify the following options.

DATA=SAS-data-set

names the SAS data set that contains the data to be analyzed. If you omit the DATA= option, PROC SURVEYIMPUTE uses the most recently created SAS data set.

METHOD=FEFI | FHDI | HOTDECK | HD <(method-option)>

specifies the imputation method to impute missing values for all variables in the VAR statement.

Table 2 summarizes the available method-options.

Table 2: Imputation Methods

METHOD= Imputation Method Method-Options
FEFI Fully efficient fractional imputation method ABSEMWTCONV=
MAXDONORCELLS=
MAXEMITER=
RELEMWTCONV=
FHDI Fractional hot-deck imputation method ABSEMWTCONV=
DISP=MEAN
DISP=SSCP
MAXDONORCELLS=
MAXEMITER=
RELEMWTCONV=
REPWTADJ=RATIO
REPWTADJ=NEIGHBOR
REPWTADJ=NONE
SELECTION=PPSPERPATN
SELECTION=PPSPEROBS
HOTDECK | HD Approximate Bayesian bootstrap SELECTION=ABB
Simple random sampling without replacement SELECTION=SRSWOR
Simple random sampling with replacement SELECTION=SRSWR
Weighted selection SELECTION=WEIGHTED


By default, if all variables that you specify in the VAR statement are also specified in the CLASS statement, then METHOD=FEFI. Otherwise, the default imputation method is METHOD=HOTDECK. You can specify the following values:

FEFI <(method-options)>

requests the fully efficient fractional imputation (FEFI) method. For more information, see the section Fully Efficient Fractional Imputation.

You can specify the following method-options:

ABSEMWTCONV=r

specifies the absolute weighted convergence criterion. The expectation maximization (EM) algorithm stops when the maximum absolute difference between the fractional weights from the previous iteration and the fractional weights from the current iteration is less than r. The default value of r is 0.00001. For more information, see the section FEFI Algorithm.

MAXDONORCELLS=i

specifies the maximum number (i) of donor cells allowed for a recipient unit. If the maximum number of donor cells exceeds MAXDONORCELLS=, then no imputation is performed. By default, MAXDONORCELLS=5000.

MAXEMITER=i

specifies the maximum number (i) of iterations for the expectation maximization (EM) algorithm. By default, MAXEMITER=100.

RELEMWTCONV=r

specifies a relative weighted convergence criterion. The expectation maximization (EM) algorithm stops when the maximum absolute relative difference between the weights from the previous iteration and the weights from the current iteration is less than r. For more information, see the section FEFI Algorithm. The default value of r is 0.001.

FHDI <(method-options)>

requests the fractional hot-deck imputation (FHDI) method. For more information, see the section Fractional Hot-Deck Imputation.

You can specify the following method-options in parentheses:

ABSEMWTCONV=r

specifies the absolute weighted convergence criterion. The expectation maximization (EM) algorithm stops when the maximum absolute difference between the first-stage fractional weights from the previous iteration and the first-stage fractional weights from the current iteration is less than r. For more information, see the section FEFI Algorithm. The default value of r is 0.00001.

DISP=stat

displays the weighted mean or the crossproduct of the weighted mean sum of squares for variables that are specified in the VAR statement but not in the CLASS statement.

You can specify one of the following stats:

MEAN

displays the weighted mean.

SSCP

displays both the weighted mean and the crossproduct of the weighted mean sum of squares.

Displayed statistics are from both two-stage FEFI and FHDI because PROC SURVEYIMPUTE must perform two-stage FEFI in order to determine the donor sets for FHDI. The closer the means and crossproducts for FHDI are to the ones for two-stage FEFI, the more confidence you can have that FHDI is as efficient as FEFI.

MAXDONORCELLS=i

specifies the maximum number (i) of second-stage donor cells allowed for a recipient unit. If the maximum number of second-stage donor cells exceeds i, then no imputation is performed. By default, MAXDONORCELLS=5000.

MAXEMITER=i

specifies the maximum number (i) of iterations for the expectation maximization (EM) algorithm for the first-stage FEFI. By default, MAXEMITER=100.

RELEMWTCONV=r

specifies the relative weighted convergence criterion. The expectation maximization (EM) algorithm for the first-stage FEFI stops when the maximum absolute relative difference between the weights from the previous iteration and the weights from the current iteration is less than r. For more information, see the section FEFI Algorithm. The default value of r is 0.001.

REPWTADJ=replicate-adjustment-option

adjusts the replicate weights for FHDI. For more information, see the section Replicate Weight Adjustments for FHDI. You can specify one of the following values for replicate-adjustment-option:

NEIGHBOR

adjusts replicate weights by using the sum of replicate fractional weights in neighborhoods that are defined by the full sample fractional weights from two-stage FEFI.

NONE

does not adjust the replicate weights for the selection of donors after two-stage FEFI.

RATIO

adjusts replicate weights by using the ratio of replicate fractional weights and the full sample fractional weights from two-stage FEFI.

By default, REPWTADJ=NEIGHBOR.

SELECTION=selection-option

specifies how to perform the probability proportional to size (PPS) with replacement selection. For more information, see "Second-Stage Selection" in section Fractional Hot-Deck Imputation Algorithm. You can specify one of the following selection-options:

PPSPEROBS

performs independent selection of second-stage donor cells for every observation unit that is a recipient for the second-stage imputation.

PPSPERPATN

performs one selection of second-stage donor cells for all observation units that are recipients for the second-stage imputation and have the same first-stage FEFI levels.

By default, SELECTION=PPSPERPATN for METHOD=FHDI.

HOTDECK < (SELECTION=selection-option) >
HD < (SELECTION=selection-option) >

requests the hot-deck imputation method. For more information, see the section Hot-Deck Imputation.

By default, SELECTION=SRSWR for METHOD=HOTDECK if you do not use the WEIGHT statement, and SELECTION=WEIGHTED for METHOD=HOTDECK if you use the WEIGHT statement. You can specify one of the following donor selection selection-options:

ABB

requests donor selection by using the approximate Bayesian bootstrap method. For more information, see the section Approximate Bayesian Bootstrap.

SRSWOR

requests donor selection by using simple random samples without replacement. For more information, see the section Simple Random Samples without Replacement.

SRSWR

requests donor selection by using simple random samples with replacement. For more information, see the section Simple Random Samples with Replacement.

WEIGHTED

requests donor selection by using probability proportional to respondent weights with replacement. For more information, see the section Weighted Selection.

NDONORS=r

specifies the number of donor units r, where r is either of the following:

  • the number of donor units to be used to impute every recipient unit when METHOD=HOTDECK

  • the maximum number of second-stage donor cells to be used to impute second-stage missing items conditional on the first-stage FEFI levels when METHOD=FHDI

If you specify NDONORS=0 for METHOD=HOTDECK, then no imputation is performed.

When METHOD=FEFI, the SURVEYIMPUTE procedure performs fully efficient fractional imputation, for which the NDONORS= option does not apply.

By default, NDONORS=1 for METHOD=HOTDECK and NDONORS=10 for METHOD=FHDI.

NOPRINT

suppresses all displayed output. This option temporarily disables the Output Delivery System (ODS); for more information about ODS, see Chapter 23, Using the Output Delivery System.

ORDER=DATA | FORMATTED | FREQ | INTERNAL

specifies the sort order for the levels of the classification variables (which are specified in the CLASS statement).

This option applies to the levels for all classification variables, except when you use the (default) ORDER=FORMATTED option with numeric classification variables that have no explicit format. In that case, the levels of such variables are ordered by their internal value.

The ORDER= option can take the following values:

Value of ORDER= Levels Sorted By
DATA Order of appearance in the input data set
FORMATTED External formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ Descending frequency count; levels with the most observations come first in the order
INTERNAL Unformatted value

By default, ORDER=FORMATTED. For ORDER=FORMATTED and ORDER=INTERNAL, the sort order is machine-dependent.

For more information about sort order, see the chapter on the SORT procedure in the Base SAS Procedures Guide and the discussion of BY-group processing in the "Grouping Data" section of SAS Programmers Guide: Essentials.

RATE=value | SAS-data-set
R=value | SAS-data-set

specifies the sampling rate, which PROC SURVEYIMPUTE uses to compute a finite population correction for bootstrap replicate weights. This option is ignored for the BRR and jackknife replication methods.

If your sample design has multiple stages, you should specify the first-stage sampling rate, which is the ratio of the number of primary sampling units (PSUs) that are selected to the total number of PSUs in the population.

You can specify the sampling rate in either of the following ways:

value

specifies a nonnegative number to use for a nonstratified design or for a stratified design that has the same sampling rate in each stratum.

SAS-data-set

specifies a SAS-data-set that contains the stratification variables and the sampling rates for a stratified design that has different sampling rates in the strata. You must provide the sampling rates in the data set variable named _RATE_.

The sampling rates must be nonnegative numbers. You can specify value as a number between 0 and 1. Or you can specify value in percentage form as a number between 1 and 100, and PROC SURVEYIMPUTE converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

For more information, see the section Population Totals and Sampling Rates.

If you do not specify the RATE= or TOTAL= option, then the bootstrap replicate weights do not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option in the same PROC SURVEYIMPUTE statement.

SEED=number

specifies the initial random number generation seed for selecting donor units for METHOD=HOTDECK or for selecting second-stage donor cells for METHOD=FHDI. The number should be a positive integer. If you do not specify this option or if number is 0, PROC SURVEYIMPUTE uses the time of day from the computer’s clock to obtain the initial seed. For more information, see the section Random Number Generation.

TOTAL=value | SAS-data-set

specifies the total number of primary sampling units (PSUs) in the population. PROC SURVEYIMPUTE uses the value to compute a finite population correction for bootstrap replicate weights. This option is ignored for the BRR and jackknife replication methods.

You can specify the total number of PSUs in either of the following ways:

value

specifies a positive number to use for a nonstratified design or for a stratified design that has the same population total in each stratum.

SAS-data-set

specifies a SAS-data-set that contains the stratification variables and the population totals for a stratified design that has different population totals in the strata. You must provide the stratum totals in the data set variable named _TOTAL_.

The stratum totals must be positive numbers.

For more information, see the section Population Totals and Sampling Rates.

If you do not specify the TOTAL= or RATE= option, then the bootstrap replicate weights do not include a finite population correction. You cannot specify both the TOTAL= option and the RATE= option in the same PROC SURVEYIMPUTE statement.

VARMETHOD=method <(method-options)>
REPWEIGHTSTYPE=method <(method-options)>

computes imputation-adjusted replicate weights.

Table 3 summarizes the available methods and method-options.

Table 3: Replicate Weights Options

method Replicate Weights Method method-options
BOOTSTRAP Bootstrap MH=number
REPS=number
BRR Balanced repeated replication FAY <=value>
HADAMARD=SAS-data-set
PRINTH
REPS=number
JACKKNIFE Jackknife None
NONE No replicate weight computation None


By default, VARMETHOD=JACKKNIFE when METHOD=FEFI or METHOD=FHDI, and VARMETHOD=NONE when METHOD=HOTDECK.

You can specify the following methods:

BOOTSTRAP < (method-options) >

computes imputation-adjusted bootstrap replicate weights. The bootstrap method requires at least two primary sampling units (PSUs) in each stratum for stratified designs unless you provide replicate weights by using a REPWEIGHTS statement. For more information, see the section Bootstrap Method.

You can specify the following method-options:

MH=value | (values)

specifies the number of PSUs to select for the bootstrap replicate samples. You can provide bootstrap stratum sample sizes m Subscript h by specifying a list of values. Alternatively, you can provide a single bootstrap sample size value to use for all strata or for a nonstratified design. For more information, see the section Bootstrap Method.

Each bootstrap sample size, m Subscript h, must be a positive integer and must be less than n Subscript h, which is the total number of PSUs in stratum h. By default, m Subscript h = n Subscript h Baseline minus 1 for a stratified design. For a nonstratified design, the bootstrap sample size value must be less than n (the total number of PSUs in the sample). By default, m = n – 1 for a nonstratified design. You can provide the bootstrap sample size by specifying one of the following forms:

MH=value

specifies a single bootstrap sample size value to use for all strata or for a nonstratified design.

MH=(values)

specifies a list of stratum bootstrap sample size values. You can separate the values by using blanks or commas, and you must enclose the list of values in parentheses. The number of values must not be less than the number of strata in the DATA= input data set.

The order of the stratum sample size values must match the order of the stratum levels in the DATA= input data set. Each stratum sample size value must be a positive integer and must be less than the total number of PSUs in the corresponding stratum.

REPS=number

specifies the number of replicates for bootstrap variance estimation, where number must be an integer greater than 1. Increasing the number of replicates improves the estimation precision but also increases the computation time. By default, REPS=250.

BRR < (method-options) >

computes the imputation-adjusted balanced repeated replication (BRR) weights. The BRR method requires a stratified sample design with two primary sampling units (PSUs) in each stratum. If you specify the VARMETHOD=BRR option, you must also use a STRATA statement unless you provide replicate weights by using a REPWEIGHTS statement. For more information, see the section Balanced Repeated Replication (BRR) Method.

You can specify the following method-options in parentheses after the VARMETHOD=BRR option:

FAY <=value>

requests Fay’s method, which is a modification of the BRR method. For more information, see the section Unadjusted Fay’s BRR Replicate Weights.

You can specify the value of the Fay coefficient, which is used in converting the original sampling weights to replicate weights. The Fay coefficient must be a nonnegative number less than 1. By default, the value of the Fay coefficient is 0.5.

HADAMARD=SAS-data-set
H=SAS-data-set

names a SAS data set that contains the Hadamard matrix for BRR replicate construction. If you do not provide a Hadamard matrix by using this method-option, PROC SURVEYIMPUTE generates an appropriate Hadamard matrix for replicate construction. For more information, see the sections Balanced Repeated Replication (BRR) Method and Hadamard Matrix.

If a Hadamard matrix of a particular dimension exists, it is not necessarily unique. Therefore, if you want to use a specific Hadamard matrix, you must provide the matrix as a SAS data set in the HADAMARD= method-option.

In the HADAMARD= input data set, each variable corresponds to a column of the Hadamard matrix, and each observation corresponds to a row of the matrix. You can use any variable names in the HADAMARD= data set. All values in the data set must equal either 1 or –1. You must ensure that the matrix you provide is indeed a Hadamard matrix—that is, bold upper A prime bold upper A equals upper R bold upper I, where bold upper A is the Hadamard matrix of dimension R and bold upper I is an identity matrix. PROC SURVEYIMPUTE does not check the validity of the Hadamard matrix that you provide.

The HADAMARD= input data set must contain at least H variables, where H denotes the number of first-stage strata in your design. If the data set contains more than H variables, PROC SURVEYIMPUTE uses only the first H variables. Similarly, the HADAMARD= input data set must contain at least H observations.

If you do not specify the REPS= method-option, then the number of replicates is equal to the number of observations in the HADAMARD= input data set. If you specify the number of replicates—for example, REPS=nreps—then the procedure uses the first nreps observations in the HADAMARD= data set to construct the replicates.

You can specify the PRINTH method-option to display the Hadamard matrix that the procedure uses to construct replicates for BRR.

PRINTH

displays the Hadamard matrix that is used to construct replicates for BRR. When you provide the Hadamard matrix in the HADAMARD= method-option, PROC SURVEYIMPUTE displays only the rows and columns that are actually used to construct replicates. For more information, see the sections Balanced Repeated Replication (BRR) Method and Hadamard Matrix.

The PRINTH method-option is not available when you use a REPWEIGHTS statement to provide replicate weights, because the procedure does not use a Hadamard matrix in this case.

REPS=number

specifies the number of replicates for BRR variance estimation. The value of number must be an integer greater than 1.

If you do not provide a Hadamard matrix by using the HADAMARD= method-option, the number of replicates should be greater than the number of strata and should be a multiple of 4. For more information, see the section Balanced Repeated Replication (BRR) Method. If a Hadamard matrix cannot be constructed for the REPS= value that you specify, the value is increased until a Hadamard matrix of that dimension can be constructed. Therefore, it is possible for the actual number of replicates used to be larger than the REPS= value that you specify.

If you provide a Hadamard matrix by using the HADAMARD= method-option, the value of REPS= must not be greater than the number of rows in the Hadamard matrix. If you provide a Hadamard matrix and do not specify the REPS= method-option, the number of replicates equals the number of rows in the Hadamard matrix.

If you do not specify the REPS= or HADAMARD= method-option and do not include a REPWEIGHTS statement, the number of replicates equals the smallest multiple of 4 that is greater than the number of strata.

If you provide replicate weights by using a REPWEIGHTS statement, PROC SURVEYIMPUTE does not use the REPS= method-option. When you use a REPWEIGHTS statement, the number of replicates equals the number of REPWEIGHTS variables.

JACKKNIFE
JK

computes the imputation-adjusted jackknife replicate weights. For more information, see the section Jackknife Method.

NONE

does not compute replicate weights.

Last updated: December 09, 2022