The SURVEYSELECT Procedure

PPS Sampling without Replacement

When you specify the METHOD=PPS option, PROC SURVEYSELECT selects units with probability proportional to size and without replacement. The selection probability for unit i in stratum h is n Subscript h Baseline upper Z Subscript h i, where n Subscript h is the sample size for stratum h and upper Z Subscript h i is the relative size of unit i in stratum h. The relative size upper Z Subscript h i is computed as upper M Subscript h i Baseline slash upper M Subscript h dot, which is the ratio of the size measure of unit i in stratum h to the total of all size measures in stratum h.

Because selection probabilities cannot exceed 1, the relative size for each unit must not exceed 1 slash n Subscript h for METHOD=PPS. This requirement can be expressed as upper Z Subscript h i Baseline less-than-or-equal-to 1 slash n Subscript h, or equivalently as upper M Subscript h i Baseline less-than-or-equal-to upper M Subscript h dot Baseline slash n Subscript h. If your size measures do not meet this requirement, you can adjust the size measures by using the MAXSIZE= or MINSIZE= option. Or you can request certainty selection for the larger units by using the CERTSIZE= or CERTSIZE=P= option. Alternatively, you can use a selection method that does not have this relative size restriction, such as PPS with minimum replacement (METHOD=PPS_SEQ).

PROC SURVEYSELECT performs PPS selection by using the Hanurav-Vijayan algorithm. Hanurav (1967) introduced this algorithm for the selection of two units per stratum, and Vijayan (1968) generalized it for the selection of more than two units. This algorithm enables computation of joint selection probabilities and provides joint selection probability values that usually ensure nonnegativity and stability of the Sen-Yates-Grundy variance estimator. For more information, see Fox (1989), Golmant (1990), and Watts (1991).

The notation in the remainder of this section drops the stratum subscript h for simplicity. If you specify a stratified design, n now denotes the sample size for the current stratum, N denotes the stratum population size, upper M Subscript i denotes the size measure for unit i in the stratum, and M denotes the total of size measures in the stratum. For a stratified design, PROC SURVEYSELECT selects samples independently within strata by using the same selection method in each stratum.

PROC SURVEYSELECT performs the Hanurav-Vijayan selection algorithm as described by Fox (1989, p. 169). For the definition of upper P Subscript k Superscript left-parenthesis i right-parenthesis, see Golmant (1990). The sampling units are first sorted in ascending order by size measure so that upper M 1 less-than-or-equal-to upper M 2 less-than-or-equal-to midline-horizontal-ellipsis less-than-or-equal-to upper M Subscript upper N. The procedure then selects a PPS sample of n units as follows:

  1. The procedure randomly chooses one of the integers 1 comma 2 comma ellipsis comma n with probability theta 1 comma theta 2 comma ellipsis comma theta Subscript n Baseline, where

    theta Subscript i Baseline equals n left-parenthesis upper Z Subscript upper N minus n plus i plus 1 Baseline minus upper Z Subscript upper N minus n plus i Baseline right-parenthesis left-parenthesis upper T plus i upper Z Subscript upper N minus n plus 1 Baseline right-parenthesis slash upper T

    where upper Z Subscript j Baseline equals upper M Subscript j Baseline slash upper M and

    upper T equals sigma-summation Underscript j equals 1 Overscript upper N minus n Endscripts upper Z Subscript j

    By definition, upper Z Subscript upper N plus 1 Baseline equals 1 slash n to ensure that sigma-summation Underscript i equals 1 Overscript n Endscripts theta Subscript i Baseline equals 1.

  2. If the integer i is selected in step 1, the procedure includes the last left-parenthesis n minus i right-parenthesis units in the sample (where the units are ordered by their size measures). The procedure then selects the remaining i units by following steps 3 through 6.

  3. The procedure defines new normed size measures for the remaining left-parenthesis upper N minus n plus i right-parenthesis units that were not selected in steps 1 and 2:

    upper Z Subscript j Superscript asterisk Baseline left-parenthesis i right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column upper Z Subscript j Baseline slash left-parenthesis upper T plus i upper Z Subscript upper N minus n plus 1 Baseline right-parenthesis 2nd Column normal f normal o normal r j equals 1 comma ellipsis comma upper N minus n plus 1 2nd Row 1st Column upper Z Subscript upper N minus n plus 1 Baseline slash left-parenthesis upper T plus i upper Z Subscript upper N minus n plus 1 Baseline right-parenthesis 2nd Column normal f normal o normal r j equals upper N minus n plus 2 comma ellipsis comma upper N minus n plus i EndLayout
  4. The procedure selects the next unit from the first left-parenthesis upper N minus n plus 1 right-parenthesis units with probability proportional to a Subscript j Baseline left-parenthesis 1 right-parenthesis, where

    StartLayout 1st Row 1st Column a 1 left-parenthesis 1 right-parenthesis 2nd Column equals 3rd Column i upper Z 1 Superscript asterisk Baseline left-parenthesis i right-parenthesis 2nd Row 1st Column a Subscript j Baseline left-parenthesis 1 right-parenthesis 2nd Column equals 3rd Column i upper Z Subscript j Superscript asterisk Baseline left-parenthesis i right-parenthesis product Underscript k equals 1 Overscript j minus 1 Endscripts left-parenthesis 1 minus left-parenthesis i minus 1 right-parenthesis upper P Subscript k Superscript left-parenthesis i right-parenthesis Baseline right-parenthesis normal f normal o normal r j equals 2 comma ellipsis comma upper N minus n plus 1 EndLayout

    and

    upper P Subscript k Superscript left-parenthesis i right-parenthesis Baseline equals upper M Subscript k Baseline slash left-parenthesis upper M Subscript k plus 1 Baseline plus upper M Subscript k plus 2 Baseline plus midline-horizontal-ellipsis plus upper M Subscript upper N minus n plus i Baseline right-parenthesis
  5. Where j 1 denotes the unit that is selected in step 4, the procedure selects the next unit from units left-parenthesis j 1 plus 1 right-parenthesis through left-parenthesis upper N minus n plus 2 right-parenthesis with probability proportional to a Subscript j Baseline left-parenthesis 2 comma j 1 right-parenthesis, where

    a Subscript j 1 plus 1 Baseline left-parenthesis 2 comma j 1 right-parenthesis equals left-parenthesis i minus 1 right-parenthesis upper Z Subscript j 1 plus 1 Superscript asterisk Baseline left-parenthesis i right-parenthesis
    a Subscript j Baseline left-parenthesis 2 comma j 1 right-parenthesis equals left-parenthesis i minus 1 right-parenthesis upper Z Subscript j Superscript asterisk Baseline left-parenthesis i right-parenthesis product Underscript k equals j 1 plus 1 Overscript j minus 1 Endscripts left-parenthesis 1 minus left-parenthesis i minus 2 right-parenthesis upper P Subscript k Superscript left-parenthesis i right-parenthesis Baseline right-parenthesis normal f normal o normal r j equals j 1 plus 2 comma ellipsis comma upper N minus n plus 2
  6. The procedure repeats step 5 until all n sample units are selected.

If you specify the JTPROBS option, PROC SURVEYSELECT computes the joint selection probabilities for all pairs of selected units in each stratum. The joint selection probability for units i and j is

upper P Subscript left-parenthesis i j right-parenthesis Baseline equals sigma-summation Underscript r equals 1 Overscript n Endscripts theta Subscript r Baseline upper K Subscript i j Superscript left-parenthesis r right-parenthesis

where

upper K Subscript i j Superscript left-parenthesis r right-parenthesis Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 1 2nd Column upper N minus n plus r less-than i less-than-or-equal-to upper N minus 1 2nd Row 1st Column r upper Z Subscript upper N minus n plus 1 Baseline slash left-parenthesis upper T plus r upper Z Subscript upper N minus n plus 1 Baseline right-parenthesis 2nd Column upper N minus n less-than i less-than-or-equal-to upper N minus n plus r comma j greater-than upper N minus n plus r 3rd Row 1st Column r upper Z Subscript i Baseline slash left-parenthesis upper T plus r upper Z Subscript upper N minus n plus 1 Baseline right-parenthesis 2nd Column 1 less-than-or-equal-to i less-than-or-equal-to upper N minus n comma j greater-than upper N minus n plus r 4th Row 1st Column pi Subscript i j Superscript left-parenthesis r right-parenthesis Baseline 2nd Column j less-than-or-equal-to upper N minus n plus r EndLayout
pi Subscript i j Superscript left-parenthesis r right-parenthesis Baseline equals r left-parenthesis r minus 1 right-parenthesis upper P Subscript i Superscript left-parenthesis r right-parenthesis Baseline upper Z Subscript j Superscript asterisk Baseline left-parenthesis r right-parenthesis product Underscript k equals 1 Overscript i minus 1 Endscripts left-parenthesis 1 minus upper P Subscript k Superscript left-parenthesis r right-parenthesis Baseline right-parenthesis
upper P Subscript k Superscript left-parenthesis r right-parenthesis Baseline equals upper M Subscript k Baseline slash left-parenthesis upper M Subscript k plus 1 Baseline plus upper M Subscript k plus 2 Baseline plus midline-horizontal-ellipsis plus upper M Subscript upper N minus n plus r Baseline right-parenthesis
Last updated: December 09, 2022