The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement

  • PROC SURVEYSELECT options;

The PROC SURVEYSELECT statement invokes the SURVEYSELECT procedure. Optionally, it identifies input and output data sets. If you do not name a DATA= input data set, the procedure selects the sample from the most recently created SAS data set. If you do not name an OUT= output data set to contain the sample of selected units, the procedure still creates an output data set and names it according to the DATAn convention.

The PROC SURVEYSELECT statement also specifies the sample selection method, the sample size, and other sample design parameters.

If you do not specify a selection method, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) by default unless you specify a SIZE statement or the PPS option in the SAMPLINGUNIT statement. If you specify a SIZE statement (or the PPS option), PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS) by default. For more information, see the description of the METHOD= option.

You can use the SAMPSIZE=n option to specify the sample size, or you can use the SAMPSIZE=SAS-data-set option to name a secondary input data set that contains stratum sample sizes. You must specify a sample size or sampling rate except when you request one of the following: random assignment (GROUPS=); balanced bootstrap selection (METHOD=BALBOOTSTRAP); Poisson sampling (METHOD=POISSON); Brewer’s method or Murthy’s method, either of which selects two units from each stratum (METHOD=PPS_BREWER or METHOD=PPS_MURTHY); or sample allocation for a specified margin (MARGIN=).

You can provide stratum sample sizes, sampling rates, initial seeds, minimum size measures, maximum size measures, and certainty size measures in a secondary input data set. For more information, see the descriptions of the SAMPSIZE=, SAMPRATE=, SEED=, MINSIZE=, MAXSIZE=, CERTSIZE=, and CERTSIZE=P= options. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT. For more information, see the section Secondary Input Data Set.

Table 1 summarizes the options available in the PROC SURVEYSELECT statement. Descriptions of the options follow in alphabetical order.

Table 1: PROC SURVEYSELECT Statement Options

Option Description
Input and Output Data Sets
DATA= Names the input SAS data set
OUT= Names the output SAS data set that contains the sample
OUTORDER=RANDOM Randomly orders the observations in the output data set
OUTSORT= Names an output SAS data set that stores the sorted input data set
Selection Method
METHOD= Specifies the sample selection method
Sample Size
SAMPSIZE= Specifies the sample size
SELECTALL Selects all stratum units when the sample size exceeds the total
Sampling Rate
NMAX= Specifies the maximum stratum sample size
NMIN= Specifies the minimum stratum sample size
ROUND= Specifies the rounding type
SAMPRATE= Specifies the sampling rate
Replicated Sampling
REPS= Specifies the number of replicate samples
Size Measures
CERTSIZE= Specifies the certainty size measure
CERTSIZE=P= Specifies the certainty proportion
MAXSIZE= Specifies the maximum size measure
MINSIZE= Specifies the minimum size measure
Control Sorting
SORT= Specifies the type of sorting
Random Number Generation
RANUNI Requests the RANUNI random number generator
SEED= Specifies the initial seed
STRATUMSEED= Controls the stratum initial seeds
Random Assignment
GROUPS= Requests random assignment
Displayed Output
NOPRINT Suppresses the display of all output
OUT= Data Set Contents
CERTUNITS= Includes number of certainty units
JTPROBS Includes joint probabilities of selection
OUTALL Includes all observations from the DATA= input data set
OUTHITS Includes a distinct copy of each selected unit
OUTSEED Includes the initial seed for each stratum
OUTSIZE Includes additional design and sampling frame information
STATS Includes selection probabilities and sampling weights


You can specify the following options:

CERTSIZE <=value | SAS-data-set>

specifies the certainty size value. When you specify this option, PROC SURVEYSELECT selects with certainty all sampling units whose size measures are greater than or equal to the certainty size value. After removing these certainty units, the procedure selects the remainder of the sample by using the method that you specify in the METHOD= option. You provide the size measures by using the SIZE statement.

You can provide a single certainty value for the entire sample selection, or you can provide stratum-level certainty values by specifying a SAS-data-set. The certainty size values must be positive numbers.

When you specify this option, the OUT= output data set contains a variable named Certain that identifies units that are selected with certainty. The selection probability of each certainty unit is one.

The CERTSIZE= option is available for the following PPS selection methods: METHOD=PPS, METHOD=PPS_SAMPFORD, METHOD=PPS_SYS, METHOD=PPS_WR, and METHOD=SEQ_POISSON. The CERTSIZE= option is not available when you specify a SAMPLINGUNIT statement.

You can provide certainty size values by specifying one of the following forms:

CERTSIZE

indicates that certainty size values are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). This data set should include a variable named _CERTSIZE_ that contains the certainty values. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTSIZE=value

specifies a single certainty size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the certainty value to determine certainty selections for all strata.

CERTSIZE=SAS-data-set

names a SAS-data-set that contains stratum-level certainty size values. You should provide the certainty values in the data set variable named _CERTSIZE_. Each observation in this data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= data set. The order of the stratum groups in the CERTSIZE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTSIZE=P <=p | SAS-data-set>

specifies the certainty proportion to use for iterative certainty selection. When you specify this option, PROC SURVEYSELECT selects with certainty the sampling units whose size measure proportions (of the total size) are greater than or equal to the certainty proportion p. After removing the selected certainty units, the procedure repeats certainty selection by using the remaining sampling units, the total size of the remaining units, and the certainty proportion p. This process continues until no more certainty units are selected. PROC SURVEYSELECT then selects the remainder of the sample by using the method that you specify in the METHOD= option.

You can provide a single certainty proportion p for the entire sample, or you can provide stratum-level certainty proportions by specifying a SAS-data-set.

The certainty proportions must be positive numbers. You can specify a certainty proportion as a number between 0 and 1. Or you can specify a proportion in percentage form as a number between 1 and 100, which PROC SURVEYSELECT converts to a proportion. The procedure treats the value 1 as 100% instead of 1%.

When you specify this option, the OUT= output data set contains a variable named Certain that identifies units that are selected with certainty. The selection probability of each certainty unit is one.

You use the SIZE statement to provide size measures for the sampling units. The CERTSIZE=P= option is available for METHOD=PPS, METHOD=PPS_SAMPFORD, and METHOD=SEQ_POISSON. This option is not available when you specify a SAMPLINGUNIT statement.

You can provide certainty size proportions by specifying one of the following forms:

CERTSIZE=P

indicates that certainty size proportions are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the certainty proportions in the data set variable named _CERTP_. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTSIZE=P=p

specifies a single certainty size proportion p, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the certainty proportion p to determine certainty selections for all strata.

CERTSIZE=P=SAS-data-set

names a SAS-data-set that contains stratum-level certainty size proportions. You should provide the certainty proportions in the data set variable named _CERTP_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the CERTSIZE=P= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTUNITS=certain-option | (certain-options)

controls the display and output of the number of certainty units. This option is available when you specify the CERTSIZE= or CERTSIZE=P= option.

You can specify one or both of the following certain-options. If you specify both certain-options, enclose the values in parentheses after CERTUNITS=.

NOPRINT

suppresses display of the number of certainty units in the "Sample Selection Summary" table. For more information, see the section Displayed Output.

OUTPUT

includes the number of certainty units in the output data set. For more information, see the section Sample Output Data Set.

DATA=SAS-data-set

names the SAS-data-set from which PROC SURVEYSELECT selects the sample. If you omit the DATA= option, the procedure uses the most recently created SAS data set. In sampling terminology, the input data set is the sampling frame (the list of units from which the sample is selected).

By default, the procedure uses input data set observations as sampling units and selects a sample of these units. Alternatively, you can use the SAMPLINGUNIT statement to define sampling units as groups of observations (clusters).

GROUPS=n | (values)| P=(values)<(ROUND=RANDOM)>

requests random assignment (partitioning) of the input data set observations into groups. You can specify the number of equal-size groups, the number of observations in each group, or the proportion of observations in each group.

When you specify a STRATA statement, PROC SURVEYSELECT performs the random assignment independently for each stratum. Otherwise, PROC SURVEYSELECT performs the random assignment for the entire input data set.

You can specify one of the following forms:

GROUPS=n <(ROUND=RANDOM)>

requests random assignment of the observations in the input data set (or stratum) to n groups. The value of n must be a positive integer.

When you specify this form, PROC SURVEYSELECT assigns the same, or nearly the same, number of observations to each group. For example, if the data set contains 100 observations and you specify GROUPS=3, PROC SURVEYSELECT randomly assigns observations to three groups, and the group sizes are 33, 33, and 34 observations.

By default, when the group sizes are not equal, sequential group identification numbers (in the output data set variable GroupID) are designated in order of group size. In the example where GROUPS=3, GroupID=1 and GroupID=2 contain 33 observations, and GroupID=3 contains 34 observations by default. If you specify ROUND=RANDOM, the group identification numbers are designated randomly.

GROUPS=(values)

specifies the number of observations to be assigned to each group. The group size values must be positive integers. The sum of the group size values must equal the total number of observations in the input data set (or in the stratum, if you specify a STRATA statement).

The number of groups is the number of group size values that you list. You can separate the values by using blanks or commas, and you must enclose the list of values in parentheses.

GROUPS=P=(values)<(ROUND=RANDOM)>

specifies the proportion of observations to be assigned to each group. The group proportion values must be positive numbers between 0 and 1. The sum of the group proportion values must equal 1.

The number of groups is the number of proportion values that you list. You can separate the values by using blanks or commas, and you must enclose the list of values in parentheses.

PROC SURVEYSELECT converts the group proportions to positive, integer-valued group sizes. The procedure multiplies the total number of observations (in the input data set or in the stratum) by the group proportions to obtain the target group sizes. When a target group size is less than 1, it is always rounded up to 1. By default, PROC SURVEYSELECT computes the integer group sizes by rounding up the target group sizes in order of the fractional parts until the group size total equals the total number of observations. If you specify ROUND=RANDOM, PROC SURVEYSELECT rounds up the target group sizes in random order until the total is met.

The following options in the PROC SURVEYSELECT statement are available when you specify the GROUPS= option: the SEED=, RANUNI, and OUTSEED options, which pertain to random number generation; the REPS= option, which provides independent replicates of the random assignment; the NOPRINT option, which suppresses display of the "Random Assignment" table; and the OUTSIZE option, which adds information to the output data set.

The GROUPS= option does not select a sample, and therefore sample selection options (for example, METHOD= or SAMPSIZE=) are not available when you specify the GROUPS= option. The SAMPLINGUNIT statement is also not available when you specify the GROUPS= option.

When you specify the GROUPS= option, the OUT= output data set includes a variable named GroupID that identifies the group assignment of each observation. When you specify the OUTSIZE option, the output data set also includes the following variables: GroupSize, which is the number of observations in the group; Total, which is the total number of observations (in the data set or in the stratum); and NGroups, which is the number of groups (in the data set or in the stratum). For more information, see the section Random Assignment Output Data Set.

JTPROBS

includes joint probabilities of selection in the OUT= output data set. This option is available for the following probability proportional to size selection methods: METHOD=PPS, METHOD=PPS_SAMPFORD, and METHOD=PPS_WR. By default, PROC SURVEYSELECT outputs joint selection probabilities for METHOD=PPS_BREWER and METHOD=PPS_MURTHY, which select two units per stratum.

For information about joint selection probabilities for a particular sampling method, see the method description in the section Sample Selection Methods. For more information about the contents of the output data set, see the section Sample Output Data Set.

MAXSIZE <=value | SAS-data-set>

specifies the maximum size value, which PROC SURVEYSELECT uses to adjust size measures before sample selection. When a size measure exceeds the maximum size value, PROC SURVEYSELECT replaces that size measure with the maximum size value.

You can provide a single maximum size value for the entire sample selection, or you can provide stratum-level maximum size values by specifying a SAS-data-set. The maximum size values must be positive numbers.

You provide size measures by specifying the SIZE statement or by specifying the PPS option in the SAMPLINGUNIT statement.

Unless you specify a SAMPLINGUNIT statement, the OUT= data set includes a variable named AdjustedSize that contains the adjusted size measures.

If you use a SAMPLINGUNIT statement to define sampling units (clusters), PROC SURVEYSELECT adjusts the sampling unit sizes (instead of the observation sizes). If you specify a SIZE statement, the size of a sampling unit is the sum of the size measures of the observations in the unit. If you specify the SAMPLINGUNIT PPS option, the size of a sampling unit is the number of observations in the unit.

When you use a SAMPLINGUNIT statement, the OUT= data set includes a variable named UnitSize that contains the adjusted sampling unit size measures.

You can provide maximum size values by specifying one of the following forms:

MAXSIZE

indicates that maximum size values are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the maximum size values in the data set variable named _MAXSIZE_. For more information, see the section Secondary Input Data Set. You can specify only one secondary input data set in each invocation of PROC SURVEYSELECT.

MAXSIZE=value

specifies a single maximum size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the value to adjust size measures in all strata.

MAXSIZE=SAS-data-set

names a SAS-data-set that contains stratum-level maximum size values. You should provide the maximum size values in the data set variable named _MAXSIZE_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= data set. The order of the stratum groups in the MAXSIZE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

METHOD=method
M=method

specifies the sample selection method.

By default, METHOD=PPS if you specify a SIZE statement or the PPS option in the SAMPLINGUNIT statement; otherwise, METHOD=SRS by default.

You can specify one of the following methods:

BALBOOTSTRAP
BALBOOT

requests balanced bootstrap selection. This method selects r bootstrap samples of N units with equal probability and with replacement, where N is the total number of sampling units (in the stratum or in the data set) and r is the number of samples (replicates) that you specify in the REPS= option. The bootstrap selection is balanced so that the overall total number of selections is r for each sampling unit. For more information, see the section Balanced Bootstrap Sampling.

When you request this method, you must specify the number of bootstrap samples r in the REPS= option. The sample size for each bootstrap replicate is fixed at N units; therefore, you cannot specify the SAMPSIZE=, SAMPRATE=, or ALLOC= option together with METHOD=BALBOOTSTRAP.

For balanced bootstrap sampling, the output data set contains an observation for each unit that is selected (in each replicate sample). If you specify the OUTHITS option, the output data set also includes the variable NumberHits, which provides the number of selections in the replicate for each selected unit.

BERNOULLI

requests Bernoulli sampling, which consists of N independent selection trials, each with constant inclusion probability pi, where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. For more information, see the section Bernoulli Sampling.

When you specify this method, you must provide the sampling rate (inclusion probability pi) in the SAMPRATE= option. For stratified sampling (which you request with the STRATA statement), you can specify the same sampling rate for each stratum in the SAMPRATE=value option. Or you can specify different sampling rates for different strata in the SAMPRATE=(values) or SAMPRATE=SAS-data-set option.

Because Bernoulli sampling is based on a specified inclusion probability instead of a fixed sample size, METHOD=BERNOULLI does not use the SAMPSIZE= option. Also, the ALLOC= option in the STRATA statement (which allocates the total sample size among strata) is not available with METHOD=BERNOULLI.

POISSON

requests Poisson sampling. A generalization of Bernoulli sampling, Poisson sampling consists of N independent selection trials with a separate inclusion probability specified for each unit, where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. For more information, see the section Poisson Sampling. For a fixed-sample-size modification of Poisson sampling, see METHOD=SEQ_POISSON.

You must provide inclusion probabilities for Poisson sampling in the SIZE variable. The probability values should be between 0 and 1. If a value of the SIZE variable is missing, nonpositive, or greater than 1, PROC SURVEYSELECT omits the observation from sample selection.

Because Poisson sampling is based on specified inclusion probabilities instead of on a fixed sample size, you cannot specify the SAMPSIZE= option together with METHOD=POISSON. Similarly, you cannot specify the ALLOC= option (which allocates the total sample size among strata) together with METHOD=POISSON.

The SAMPLINGUNIT statement is not available when you specify METHOD=POISSON.

When you specify the SAMPRATE= option for METHOD=POISSON but do not specify a SIZE statement, PROC SURVEYSELECT uses METHOD=BERNOULLI.

PPS <(method-option)>

requests selection with probability proportional to size and without replacement. PROC SURVEYSELECT performs PPS selection by using the Hanurav-Vijayan algorithm. For more information, see the section PPS Sampling without Replacement. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

You can specify the following method-option:

OUTSORT=SAS-data-set

names an output data set to store the input data that are sorted by size measures within strata. By default, PROC SURVEYSELECT does not save this data set. (Sorting by size measures is part of the Hanurav-Vijayan algorithm for PPS selection. For more information, see the section PPS Sampling without Replacement.) This option is not available when you specify the SAMPLINGUNIT statement.

PPS_BREWER
BREWER

requests selection according to Brewer’s method. Brewer’s method selects two units from each stratum with probability proportional to size and without replacement. For more information, see the section Brewer’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement. You do not need to specify the sample size in the SAMPSIZE= option because Brewer’s method selects two units from each stratum.

PPS_MURTHY
MURTHY

requests selection according to Murthy’s method. Murthy’s method selects two units from each stratum with probability proportional to size and without replacement. For more information, see the section Murthy’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement. You do not need to specify the sample size in the SAMPSIZE= option because Murthy’s method selects two units from each stratum.

PPS_SAMPFORD
SAMPFORD

requests selection according to Sampford’s method. Sampford’s method selects units with probability proportional to size and without replacement. For more information, see the section Sampford’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_SEQ
CHROMY

requests sequential selection with probability proportional to size and with minimum replacement. This method is also known as Chromy’s method. For more information, see the section PPS Sequential Sampling. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_SYS <(method-options)>

requests systematic selection with probability proportional to size. For more information, see the section PPS Systematic Sampling. When you specify this method, you must provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

You can specify the following method-options:

DETAILS

displays the random start and the systematic interval in the "Sample Selection Summary" table when the design does not include strata or replicates. For more information, see the section Displayed Output.

INTERVAL=value

specifies the interval value for PPS systematic selection. The interval value must be a positive number. It must not exceed the total of the size measures in the data set (or in each stratum if you specify a STRATA statement). By default, the systematic interval is the ratio of the size measure total to the sample size (which you provide in the SAMPSIZE= option). For more information, see the section PPS Systematic Sampling.

You cannot use the INTERVAL= method-option when you specify a sample size in the SAMPSIZE= option or when you specify the ALLOC= option, which allocates the total sample size among strata.

START=value

specifies the starting value for PPS systematic selection. The starting value must be a positive number that is less than the systematic interval. By default, PROC SURVEYSELECT randomly determines a starting point in the systematic interval. For more information, see the section PPS Systematic Sampling.

When you use this option to specify a systematic starting point (instead of allowing the procedure to randomly determine the starting point), the following options for random number generation have no effect: SEED=, RANUNI, and OUTSEED. You cannot use the REPS= option when you specify the START= method-option.

When the starting value that you provide is not randomly determined, the resulting selection is not a probability-based sample.

PPS_WR

requests selection with probability proportional to size and with replacement. For more information, see the section PPS Sampling with Replacement. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

SEQ
CHROMY

requests sequential selection according to Chromy’s method. If you specify this method and do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses sequential zoned selection with equal probability and without replacement. For more information, see the section Sequential Random Sampling.

If you specify METHOD=SEQ and also specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses METHOD=PPS_SEQ, which is sequential selection with probability proportional to size and with minimum replacement. For more information, see the section PPS Sequential Sampling.

SEQ_POISSON

requests sequential Poisson sampling, which is a fixed-sample-size modification of Poisson sampling (METHOD=POISSON). For more information, see the section Sequential Poisson Sampling.

When you request this method, you must provide size measures in the SIZE variable and you must specify the sample size in the SAMPSIZE= option.

If you request this method, you cannot specify a SAMPLINGUNIT statement.

SRS

requests simple random sampling, which is selection with equal probability and without replacement. For more information, see the section Simple Random Sampling. METHOD=SRS is the default selection method if you do not specify the METHOD= option and also do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement).

SYS <(method-options)>

requests systematic random sampling. If you specify this method and do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses systematic random sampling with equal probability. For more information, see the section Systematic Random Sampling.

If you specify this method and also specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses systematic random sampling with probability proportional to size (METHOD=PPS_SYS). For more information, see the section PPS Systematic Sampling.

You can specify the following method-options:

DETAILS

displays the random start and the systematic interval in the "Sample Selection Summary" table when the design does not include strata or replicates. For more information, see the section Displayed Output.

INTERVAL=value

specifies the interval for systematic random sampling. The interval value must be a positive number and must not exceed the number of sampling units in the data set (or the number of units in each stratum, if you specify a STRATA statement).

By default, PROC SURVEYSELECT determines the systematic interval from the sampling rate or sample size that you provide in the SAMPRATE= or SAMPSIZE= option, respectively. When you specify the sampling rate, PROC SURVEYSELECT computes the systematic interval as the inverse of the sampling rate. When you specify the sample size, the procedure computes the interval as the ratio of the number of sampling units to the sample size. For more information, see the section Systematic Random Sampling.

You cannot use the INTERVAL= method-option when you specify the SAMPSIZE= option, the SAMPRATE= option, or the ALLOC= option (which allocates the total sample size among strata).

START=value

specifies the starting value for systematic selection. The starting value must be a positive number that is less than the systematic interval. By default, PROC SURVEYSELECT randomly determines a starting point in the systematic interval. For more information, see the section Systematic Random Sampling.

When you use this option to specify a systematic starting point (instead of allowing the procedure to randomly determine the starting point), the following options for random number generation have no effect: SEED=, RANUNI, and OUTSEED. You cannot use the REPS= option when you specify the START= method-option.

When the starting value that you provide is not randomly determined, the resulting selection is not a probability-based sample.

URS

requests unrestricted random sampling, which is selection with equal probability and with replacement. For more information, see the section Unrestricted Random Sampling.

MINSIZE <=value | SAS-data-set>

specifies the minimum size measure, which PROC SURVEYSELECT uses to adjust size measures before sample selection. When a size measure is less than the minimum size value, PROC SURVEYSELECT replaces that size measure with the minimum size value.

You can provide a single minimum size value for the entire sample selection, or you can provide stratum-level minimum size values by specifying a SAS-data-set. The minimum size values must be positive numbers.

You provide size measures by specifying the SIZE statement or by specifying the PPS option in the SAMPLINGUNIT statement.

Unless you specify a SAMPLINGUNIT statement, the OUT= data set includes a variable named AdjustedSize that contains the adjusted size measures.

If you use a SAMPLINGUNIT statement to define sampling units (clusters), PROC SURVEYSELECT adjusts the sampling unit sizes (not the observation sizes). If you specify a SIZE statement, the size of a sampling unit is the sum of the size measures of the observations in the unit. If you specify the SAMPLINGUNIT PPS option, the size of a sampling unit is the number of observations in the unit.

When you use a SAMPLINGUNIT statement, the OUT= data set includes a variable named UnitSize that contains the adjusted sampling unit size measures.

You can provide minimum size values by specifying one of the following forms:

MINSIZE

indicates that minimum size values are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the minimum size values in the data set variable named _MINSIZE_. For more information, see the section Secondary Input Data Set. You can specify only one secondary input data set in each invocation of PROC SURVEYSELECT.

MINSIZE=value

specifies a single minimum size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the minimum value to adjust size measures in all strata.

MINSIZE=SAS-data-set

names a SAS-data-set that contains stratum-level minimum size values. You should provide the minimum size values in the data set variable named _MINSIZE_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the MINSIZE= data set must match the order of the groups in the DATA= input data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

NMAX=n

specifies the maximum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total number of units in the stratum by the specified sampling rate. If this sample size is greater than the value NMAX=n, PROC SURVEYSELECT selects only n units.

The maximum sample size n must be a positive integer. The NMAX= option is available only with the SAMPRATE= option, which you can specify for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). The NMAX= option is not available with METHOD=BERNOULLI, where the SAMPRATE= option specifies the constant inclusion probability.

NMIN=n

specifies the minimum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total number of units in the stratum by the specified sampling rate. If this sample size is less than the value NMIN=n, PROC SURVEYSELECT selects n units.

The minimum sample size n must be a positive integer. The NMIN= option is available only with the SAMPRATE= option, which you can specify for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). The NMIN= option is not available with METHOD=BERNOULLI, where the SAMPRATE= option specifies the constant inclusion probability.

NOPRINT

suppresses the display of all output. You can use the NOPRINT option when you want only to create an output data set. This option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 23, Using the Output Delivery System.

OUT=SAS-data-set

names the output data set. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique. If you request sample selection by specifying the METHOD= option, the output data set contains the observations that are selected for the sample. If you request sample allocation without sample selection by specifying the ALLOC= and NOSAMPLE options in the STRATA statement, the output data set contains the allocated sample sizes. If you request random assignment by specifying the GROUPS= option, the output data set contains all observations in the input data set together with their assigned group identification.

When PROC SURVEYSELECT selects a sample, the output data set contains the units that are selected, sample design information, and selection statistics. You can specify options that control the information to include in the output data set. For more information, see the descriptions of the following options: JTPROBS, OUTALL, OUTHITS, OUTSEED, OUTSIZE, and STATS. For more information, see the section Sample Output Data Set.

By default, the sample output data set contains only those units that are selected for the sample. To include all observations from the input data set in the output data set, use the OUTALL option.

By default, the sample output data set includes one copy of each selected unit, even when a unit is selected more than once, which can occur when you use with-replacement or with-minimum-replacement selection methods. For with-replacement or with-minimum-replacement selection methods, the output data set includes a variable NumberHits that records the number of hits (selections) for each unit. To include a distinct copy of each selection in the output data set when the same unit is selected more than once, use the OUTHITS option.

When you specify the ALLOC= and NOSAMPLE options in the STRATA statement, PROC SURVEYSELECT allocates the total sample size among the strata but does not select a sample. In this case, the OUT= data set contains the allocated sample sizes. For more information, see the section Allocation Output Data Set.

When you specify the GROUPS= option, PROC SURVEYSELECT randomly assigns observations to groups; it does not select a sample. In this case, the OUT= data set contains all observations from the input data set and includes a variable named GroupID that identifies group assignments. For more information, see the section Random Assignment Output Data Set.

OUTALL <(ZEROSTRATA )>

includes all observations from the sampling frame in the OUT= output data set. By default, the output data set includes only those units selected for the sample. When you specify the OUTALL option, the output data set includes all observations in the sampling frame along with a variable (Selected) that indicates each observation’s selection status. The value of Selected is 1 for an observation that is selected or 0 for an observation that is not selected. For information about the contents of the output data set, see the section Sample Output Data Set.

The OUTALL option is available for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, and METHOD=BERNOULLI). and for METHOD=POISSON.

If you specify a sample size of 0 for a stratum, PROC SURVEYSELECT omits this stratum from the sampling frame. By default, PROC SURVEYSELECT also omits this stratum from the output data set when you specify the OUTALL option. You can specify the OUTALL(ZEROSTRATA) option to include strata that have sample sizes of 0 in the output data set. For more information, see the description of the SAMPSIZE= option.

OUTHITS

includes a distinct copy of each selected unit in the OUT= output data set when the same sampling unit is selected more than once. By default, the output data set contains a single copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits is m; for example, the output data set contains three copies of a unit that is selected three times (NumberHits is 3).

A sampling unit can be selected more than once by with-replacement and with-minimum-replacement selection methods, which include METHOD=URS, METHOD=BALBOOTSTRAP, METHOD=PPS_WR, METHOD=PPS_SYS, and METHOD=PPS_SEQ. The OUTHITS option is available for these selection methods.

For information about the contents of the output data set, see the section Sample Output Data Set.

OUTORDER=RANDOM <(option)>
OUTRANDOM <(option)>

randomly orders the sample observations in the OUT= output data set. If you specify a STRATA statement, the sample observations are randomly ordered within stratum groups. If you specify the REPS= option, the sample observations are randomly ordered within sample replicates (nested within stratum groups). This option does not affect the selection of the sample; it randomly orders the selected observations after sample selection is complete.

By default for most sample selection methods, the order of the sample observations in the OUT= data set corresponds to the order in the DATA= input data set. If you specify a CONTROL statement to sort the input data set for systematic selection methods, the order in the output data set is the same as the sorted order of the input data set. If you specify a sequential selection method, the order in the output data set is the selection order (which begins at a randomly chosen sampling unit). By default for METHOD=PPS, the order of the sample observations corresponds to the ascending order of the size measures.

The OUTORDER=RANDOM option is not available when you specify the GROUPS= option, a SAMPLINGUNIT statement, or the NOSAMPLE option in the STRATA statement.

You can specify the following option:

SEED=value

specifies the initial random number seed for randomly ordering the sample observations. To initialize random number generation, value must be a positive integer. If you do not specify this option or if you specify a value that is negative or 0, PROC SURVEYSELECT uses the existing random number stream that is used for sample selection. For more information, see the SEED= option and the section Random Number Generation.

OUTSEED

includes the initial seed for each stratum in the OUT= output data set. The variable InitialSeed contains the stratum initial seeds. For information about the contents of the output data set, see the section Sample Output Data Set. The OUTSEED option is not available when you specify the STRATUMSEED=NONE option for a stratified sample.

To reproduce the same sample for any stratum in a subsequent execution of PROC SURVEYSELECT, you can specify the same stratum initial seed in the SEED=SAS-data-set option together with the same sample selection parameters. For more information, see the section Random Number Generation.

The "Sample Selection Summary" table displays the initial random number seed for the entire sample selection, which is the initial seed for the first stratum when the design is stratified. To reproduce the entire sample, you can specify this same seed value in the SEED= option, along with the same sample selection parameters.

OUTSIZE

includes additional design and sampling frame information in the OUT= output data set.

If you use a STRATA statement, the OUTSIZE option provides stratum-level values in the output data set. Otherwise, the OUTSIZE option provides overall values.

The OUTSIZE option includes the sample size or sampling rate in the output data set, depending on whether you specify the SAMPSIZE= option or the SAMPRATE= option, respectively. For PPS selection methods, the OUTSIZE option includes the total size measure in the output data set. If you do not provide size measures, or if you specify a SAMPLINGUNIT statement, the OUTSIZE option includes the total number of sampling units in the output data set.

If you request size measure adjustment or certainty selection, the OUTSIZE option includes the following information in the output data set: the minimum size measure if you specify the MINSIZE= option, the maximum size measure if you specify the MAXSIZE= option, the certainty size measure if you specify the CERTSIZE= option, and the certainty proportion if you specify the CERTSIZE=P= option.

For METHOD=BERNOULLI, the OUTSIZE option includes the following information in the output data set: total number of sampling units, selection probability (sampling rate), expected sample size, and actual sample size. See the section Bernoulli Sampling for descriptions of these statistics.

For more information about the contents of the output data set, see the section Sample Output Data Set.

If you specify the GROUPS= option for random assignment, the OUTSIZE option adds the following information to the output data set: total number of units, number of groups, and number of units in the group. For more information, see the section Random Assignment Output Data Set.

OUTSORT=SAS-data-set

names an output data set to store the sorted input data set. This option is available when you specify a CONTROL statement to sort the DATA= input data set for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ).

If you specify CONTROL variables but do not name an output data set in the OUTSORT= option, the sorted data set replaces the input data set.

RANUNI

requests uniform random number generation by the method of Fishman and Moore (1982), which is the random number generator that the RANUNI function provides. For more information, see the section Random Number Generation. For information about the RANUNI function, see SAS Functions and CALL Routines: Reference.

By default, PROC SURVEYSELECT uses the Mersenne twister random number generator (Matsumoto and Nishimura 1998). The Mersenne twister random number generator has a very long period and good statistical properties. This is the random number generator that the RAND function provides for the uniform distribution. For information about the RAND function, see SAS Functions and CALL Routines: Reference.

If you use the RANUNI option to select a sample, you can reproduce this same sample by specifying the same SEED= value together with the RANUNI option (for the same input data set and sample selection parameters).

In releases before SAS/STAT 12.1, PROC SURVEYSELECT uses the RANUNI random number generator by default. To reproduce samples from releases before SAS/STAT 12.1, you can specify the same SEED= value together with the RANUNI option (for the same input data set and sample selection parameters).

REPS=nreps <(option)>

requests replicated sampling and specifies the number of replicate samples. The value of nreps must be a positive integer.

When you specify this option, PROC SURVEYSELECT selects nreps independent replicate samples. Each replicate sample is selected by using the same sample size (or sampling rate) and design parameters that you specify.

By default, the variable named Replicate in the OUT= data set contains replicate identification numbers for the sample observations. You can specify a different name for the replicate identification variable in the REPNAME= suboption.

You can use replicated sampling to provide a simple method of variance estimation for any form of statistic and to evaluate variable nonsampling errors such as interviewer differences. For more information, see Lohr (2010), Wolter (2007), Kish (1965), Kish (1987), and Kalton (1983). You can also use the REPS= option to perform a variety of other resampling and simulation tasks. For more information, see Cassell (2007).

You can specify the following option:

REPNAME=name

specifies the name of the replicate identification variable in the OUT= data set. By default, the variable name is Replicate. For more information, see the section Sample Output Data Set.

ROUND=type

specifies the type of rounding to use when the sampling rate is converted to a positive, integer-valued sample size. This option is available when you specify the SAMPRATE= option for one of the following equal probability selection methods: METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ.

For these selection methods, PROC SURVEYSELECT converts the sampling rate that you specify in the SAMPRATE= option to an integer-valued sample size before selecting the sample. PROC SURVEYSELECT multiplies the total number of units (in the stratum or in the data set) by the sampling rate to obtain the target sample size. If the target sample size is not an integer, the procedure rounds this number to a positive integer value. By default, PROC SURVEYSELECT always rounds the target sample size up to the nearest integer (ROUND=UP).

To provide positive selection probabilities for all sampling units, all stratum sample sizes must be greater than 0. Therefore, when a target sample size is less than 1, it is always rounded up to 1 (instead of down to 0), regardless of the ROUND=type.

You can specify one of the following types:

ALTERNATE

alternates the rounding direction (up or down) by strata. This option rounds up for the first stratum, down for the second stratum, up for the third stratum, and so on, where the strata are processed in the order in which they appear in the input data set. The alternating sequence skips strata for which the target sample size is an integer (and therefore does not require rounding). This option has no effect unless you request stratified sampling by specifying a STRATA statement.

DOWN
FLOOR

rounds down to the largest integer that does not exceed the target sample size.

NEAREST <(HALF=DOWN)>

rounds the target sample size to the nearest integer. This option rounds up when the fractional part of the target sample size is greater than 0.5 and rounds down when the fractional part is less than 0.5. By default, the sample size is rounded up when the fractional part is exactly 0.5. When you specify ROUND=NEAREST(HALF=DOWN), the sample size is rounded down when the fractional part is 0.5.

RANDOM

determines the rounding direction (up or down) randomly. Each rounding direction is equally likely (with probability of 0.5).

UP
CEILING

rounds up to the smallest integer that is not less than the target sample size. This option is the default.

SAMPRATE=value | (values)| SAS-data-set
RATE=value | (values)| SAS-data-set

specifies the sampling rate, which is the proportion of units to select for the sample. You can provide a single sampling rate value for the entire sample selection, or you can provide stratum sampling rates by specifying values or a SAS-data-set.

The sampling rate value must be a positive number. The stratum sampling rate values and the stratum sampling rates that you provide in the SAS-data-set must be nonnegative numbers. You can specify a sampling rate as a number between 0 and 1. Or you can specify a rate in percentage form as a number between 1 and 100, which PROC SURVEYSELECT converts to a proportion. The procedure treats the value 1 as 100% instead of 1%.

This option is available for the equal probability selection methods, as follows:

  • For METHOD=SYS (systematic random sampling), PROC SURVEYSELECT computes the selection interval as the inverse of the sampling rate. For more information, see the section Systematic Random Sampling.

  • For METHOD=BERNOULLI (Bernoulli sampling), the procedure uses the sampling rate as the inclusion probability. For more information, see the section Bernoulli Sampling.

  • For the other equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ), the procedure converts the sampling rate to an integer-valued sample size before selecting the sample.

To convert the sampling rate to an integer-valued sample size, PROC SURVEYSELECT multiplies the total number of units (in the stratum or data set) by the sampling rate that you specify to obtain the target sample size. If this number is not an integer, the target sample size is rounded to a positive integer. By default, PROC SURVEYSELECT always rounds the target sample size up to the nearest integer. You can specify other types of rounding by using the ROUND= option.

You cannot specify the SAMPRATE= option together with the SAMPSIZE= option.

You can provide sampling rates by specifying one of the following forms:

SAMPRATE=value
RATE=value

specifies a single sampling rate value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the rate value for all strata.

SAMPRATE=(values)
RATE=(values)

specifies a list of stratum sampling rate values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The number of stratum sampling rate values should equal the number of strata in the input data set.

The order of the stratum sampling rate values must match the order of the stratum groups in the DATA= input data set. When you specify a list of values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

The stratum sampling rate values must be nonnegative numbers. If you specify a stratum sampling rate of 0, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPRATE=SAS-data-set
RATE=SAS-data-set

names a SAS-data-set that contains stratum sampling rates. You should provide the sampling rates in the data set variable named _RATE_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the SAMPRATE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

The stratum sampling rates must be nonnegative numbers. If you specify a stratum sampling rate of 0, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPSIZE=n |(values)| SAS-data-set
N=n | (values)| SAS-data-set

specifies the sample size, which is the number of units to select for the sample. You can provide a single sample size n for the entire sample selection, or you can provide stratum sample sizes by specifying values or a SAS-data-set.

The value of n must be a positive integer. The stratum sample size values and the stratum sample sizes that you provide in the SAS-data-set must be nonnegative numbers. For selection methods that select without replacement, the sample size must not exceed the total number of units in the data set (or the number of units in the stratum, if you specify a STRATA statement).

This option specifies the number of sampling units to select. If you do not specify a SAMPLINGUNIT statement, PROC SURVEYSELECT defines sampling units as observations and selects the number of observations that you specify. If you specify a SAMPLINGUNIT statement, PROC SURVEYSELECT defines sampling units as groups of observations (clusters) and selects the number of clusters that you specify.

If you specify SAMPSIZE=n and the ALLOC= option in the STRATA statement, PROC SURVEYSELECT allocates the sample size n among the strata according to the allocation method that you request. For more information, see the section Sample Size Allocation. You cannot specify SAMPSIZE=values or SAMPSIZE=SAS-data-set when you use the ALLOC= option. You cannot specify SAMPSIZE= with the MARGIN= option, which determines stratum sample sizes that provide the specified margin of error. For more information, see the section Specifying the Margin of Error.

You cannot specify both the SAMPSIZE= option and the SAMPRATE= option.

You can provide sample size values by specifying one of the following forms:

SAMPSIZE=n
N=n

specifies a single sample size value n, which must be a positive integer. If you request a stratified sample design, PROC SURVEYSELECT selects n units from each stratum (unless you also specify the ALLOC= option in the STRATA statement, which allocates the total sample size among the strata).

For methods that select without replacement, the sample size n must not exceed the number of units in the stratum unless you also specify the SELECTALL option. If you specify the SELECTALL option, PROC SURVEYSELECT selects all stratum units when the stratum sample size exceeds the total number of units in the stratum.

SAMPSIZE=(values)
N=(values)

specifies a list of stratum sample size values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The number of sample size values must equal the number of strata in the input data set.

The order of the stratum sample size values must match the order of the stratum groups in the DATA= input data set. When you specify a list of values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

The values of the stratum sample sizes must be nonnegative numbers. If you specify a stratum sample size of 0, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPSIZE=SAS-data-set
N=SAS-data-set

names a SAS-data-set that contains stratum sample sizes. You should provide the sample sizes in the data set variable named _NSIZE_ or SampleSize. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the SAMPSIZE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

The stratum sample sizes must be nonnegative numbers. If you specify a stratum sample size of 0, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SEED <=value | SAS-data-set>

specifies the initial seed for random number generation. You can provide a single seed value for the entire sample selection, or you can provide stratum initial seeds by specifying a SAS-data-set. To initialize random number generation, a seed must be a positive integer. If you do not specify this option, or if you specify an initial seed that is negative or 0, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain an initial seed. For more information, see the section Random Number Generation.

PROC SURVEYSELECT displays the value of the initial seed in the "Sample Selection Summary" table. To reproduce the same sample in a subsequent execution of PROC SURVEYSELECT, you can specify the same initial seed in the SEED= option (for the same input data set and sample selection parameters).

If you specify a STRATA statement, you can provide stratum initial seeds in a SAS-data-set. If you do not provide stratum initial seeds and if you use the RANUNI random number generator, PROC SURVEYSELECT generates random numbers continuously across strata (from a single pseudorandom number stream). You can specify the OUTSEED option to include the stratum initial seeds in the output data set.

If you do not provide stratum initial seeds and if you use the (default) Mersenne twister random number generator, PROC SURVEYSELECT generates separate pseudorandom number streams for the strata by default. You can specify the STRATUMSEED=NONE option to use a single Mersenne twister stream across all strata. For more information, see the STRATUMSEED= option.

You can provide initial seeds by specifying one of the following forms:

SEED

indicates that stratum initial seeds are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the initial seeds in the data set variable named _SEED_ or InitialSeed. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

SEED=value

specifies a single initial seed value for random number generation. To initialize random number generation, the value must be a positive integer.

SEED=SAS-data-set

names a SAS-data-set that contains stratum initial seeds. You should provide the stratum initial seeds in the data set variable named _SEED_ or InitialSeed. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the SEED= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

The OUTSEED option includes the stratum initial seeds in the OUT= output data set. You can reproduce the same sample in a subsequent execution of PROC SURVEYSELECT by specifying the same stratum initial seeds (for the same input data set and sample selection parameters). If you need to reproduce the same sample for only a subset of the strata, you can use the same initial seeds for the strata in the subset.

SELECTALL

requests that PROC SURVEYSELECT select all stratum units when the stratum sample size exceeds the total number of units in the stratum. By default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum unless you are using a with-replacement selection method.

The SELECTALL option is available for the following without-replacement selection methods: METHOD=SRS, METHOD=SYS, METHOD=SEQ, METHOD=PPS, and METHOD=PPS_SAMPFORD.

The SELECTALL option is not available for with-replacement selection methods, with-minimum-replacement methods, or those PPS methods that select two units per stratum.

SORT=NEST | SERP

specifies the type of sorting to perform when you specify a CONTROL statement for systematic or sequential sample selection.

SORT=NEST requests nested sorting, and SORT=SERP requests hierarchic serpentine sorting. For more information, see the section Sorting by CONTROL Variables.

By default, SORT=SERP. Where there is only one CONTROL variable, the two types of sorting are equivalent.

The SORT= option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ). When you specify a CONTROL statement, PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample.

When you specify a CONTROL statement, you can use the OUTSORT= option to name an output data set that contains the sorted input data set. Otherwise, if you do not specify the OUTSORT= option, the sorted data set replaces the input data set.

The SORT= option and the CONTROL statement are not available when you specify a SAMPLINGUNIT statement.

STATS

includes the selection probability and sampling weight in the OUT= output data set for equal probability selection methods when you do not specify a STRATA statement. By default, the output data set does not include these values for equal probability selection methods unless you specify a STRATA statement. The STATS option applies to the following selection methods: METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, METHOD=BALBOOTSTRAP, and METHOD=BERNOULLI.

In addition to the selection probability and sampling weight, the STATS option includes the following statistics in the output data set for METHOD=BERNOULLI: total number of sampling units, expected sample size, actual sample size, and adjusted sampling weight. For more information, see the section Bernoulli Sampling.

For PPS selection methods, the output data set contains selection probabilities and sampling weights by default. The STATS option has no effect for PPS methods.

For more information about the contents of the output data set, see the section Sample Output Data Set.

STRATUMSEED=NONE | RESTORE

controls stratum initial seeds for the Mersenne twister random number generator, which PROC SURVEYSELECT uses by default unless you specify the RANUNI option.

By default, PROC SURVEYSELECT uses separate, independent pseudorandom number streams for the strata. You can store the stratum initial seeds in the output data set by specifying the OUTSEED option. You can reproduce the entire sample (over all strata) by specifying the same initial seed in the SEED= option. You can also reproduce individual stratum samples by specifying the corresponding stratum initial seeds in the SEED= option.

The STRATUMSEED= option applies only to stratified selection, which you request by specifying a STRATA statement. The STRATUMSEED= option applies only to the Mersenne twister random number generator; you cannot specify this option together with the RANUNI option (which requests the RANUNI random number generator).

You cannot specify the STRATUMSEED= option when you provide stratum initial seeds by using the SEED=SAS-data-set option.

For more information, see the section Random Number Generation.

You can specify one of the following keywords:

NONE

uses a single pseudorandom number stream across all strata instead of separate pseudorandom number streams for the strata.

When you specify this option, stratum-level initial seeds are not available; you can reproduce the entire sample (over all strata), but you cannot reproduce individual stratum samples separately (apart from the overall sample). To reproduce an entire sample that the procedure selects when you specify the STRATUMSEED=NONE option, you can specify the same SEED= value, input data set, and selection parameters (along with the STRATUMSEED=NONE option). The OUTSEED option, which stores stratum initial seeds in the output data set, is not available when you specify STRATUMSEED=NONE.

RESTORE

reproduces the stratum initial seeds that PROC SURVEYSELECT uses in releases before SAS/STAT 14.1 for the Mersenne twister random number generator. To reproduce a stratified sample that PROC SURVEYSELECT selects in releases before SAS/STAT 14.1, you can specify STRATUMSEED=RESTORE along with the same SEED= value, input data set, and selection parameters.

Last updated: December 09, 2022