The SURVEYSELECT Procedure

SAMPLINGUNIT | CLUSTER Statement

  • SAMPLINGUNIT | CLUSTER variables </ options> ;

The SAMPLINGUNIT statement names one or more variables that identify the sampling units as groups of observations (clusters). The combinations of categories of SAMPLINGUNIT variables define the sampling units. If there is a STRATA statement, sampling units are nested within strata.

When you use a SAMPLINGUNIT statement to define units (clusters), PROC SURVEYSELECT selects a sample of these units by using the selection method and design parameters that you specify in the PROC SURVEYSELECT statement. If you do not use a SAMPLINGUNIT statement, then PROC SURVEYSELECT uses the input data set observations as sampling units by default.

The SAMPLINGUNIT variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the SAMPLINGUNIT variables determine the SAMPLINGUNIT variable levels. Thus, you can use formats to group values into levels. For more information, see the FORMAT procedure in the Base SAS Procedures Guide and the FORMAT statement and SAS formats in SAS Formats and Informats: Reference.

The SAMPLINGUNIT statement is available with all selection methods except Poisson sampling (METHOD=POISSON) and sequential Poisson sampling (METHOD=SEQ_POISSON).

If you specify the PPS option in the SAMPLINGUNIT statement and do not specify a SIZE statement, the procedure computes sampling unit size as the number of observations in the sampling unit. If you specify a SIZE statement and a SAMPLINGUNIT statement, the procedure computes sampling unit size by summing the size measures of all observations in the sampling unit.

By default, PROC SURVEYSELECT sorts the input data set by the SAMPLINGUNIT variables within strata before sample selection. This groups the observations into sampling units and orders the sampling units by the SAMPLINGUNIT variables. If you do not want the procedure to sort the input data set by the SAMPLINGUNIT variables, then specify the PRESORTED option in the SAMPLINGUNIT statement. By using the PRESORTED option, you can provide the order of the sampling units for systematic and sequential selection methods. The CONTROL statement is not available with the SAMPLINGUNIT statement.

The SAMPLINGUNIT statement defines groups of observations (clusters) to use as sampling units, and PROC SURVEYSELECT selects a sample of these units. When you specify a SAMPLINGUNIT statement, PROC SURVEYSELECT does not select samples of observations from within the sampling units (clusters). To select independent samples within groups, you can use the STRATA statement.

You can specify the following options:

ID

includes identification numbers for the selected sampling units (clusters) in the OUT= output data set variable named ClusterID. The ID numbers are consecutive integers that begin with the number 1. For more information, see the section Sample Output Data Set.

When the selection method is with-replacement or with-minimum-replacement and a cluster is selected more than once, PROC SURVEYSELECT assigns a separate ID number to each selection. (With-replacement and with-minimum-replacement selection methods include METHOD=URS, METHOD=BALBOOTSTRAP, METHOD=PPS_WR, METHOD=PPS_SYS, and METHOD=PPS_SEQ.)

When you specify a STRATA statement, PROC SURVEYSELECT assigns cluster ID numbers within each stratum. When you specify replicated sampling (by using the REPS= option), PROC SURVEYSELECT assigns cluster ID numbers within each replicate.

For with-replacement and with-minimum-replacement selection methods, the ID option invokes the OUTHITS option, which includes a distinct copy of each selected unit in the output data set. When you specify the ID option together with the OUTALL option in the PROC SURVEYSELECT statement, the value of ClusterID is 0 for sampling units that are not selected.

PPS

computes a sampling unit’s size measure as the number of observations in the sampling unit. The procedure then uses these size measures to select a sample according to the PPS selection method that you specify in the METHOD= option in the PROC SURVEYSELECT statement.

This option has no effect when you specify a SIZE statement. When you specify a SIZE statement, the procedure computes sampling unit size by summing the size measures of all observations that belong to the sampling unit.

PRESORTED

requests that PROC SURVEYSELECT not sort the input data set by the SAMPLINGUNIT variables within strata. By default, the procedure sorts the input data set by the SAMPLINGUNIT variables, which groups the observations into sampling units and orders the units by the SAMPLINGUNIT variables.

The PRESORTED option enables you to provide the order of the sampling units. For systematic and sequential selection methods, ordering provides additional control over the distribution of the sample and gives some benefits of proportionate stratification. Systematic and sequential methods include METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ. For more information, see the descriptions of these methods in the section Sample Selection Methods.

When you specify the PRESORTED option, the procedure treats the sampling unit groups as NOTSORTED. Like the BY statement option NOTSORTED, this does not mean that the data are unsorted by the SAMPLINGUNIT variables, but rather that the data are arranged in groups (according to values of the SAMPLINGUNIT variables) and that these groups are not necessarily in alphabetical or increasing numeric order. For more information about the BY statement NOTSORTED option, see SAS Programmers Guide: Essentials.

Last updated: December 09, 2022