The SURVEYREG Procedure

Computational Details

Notation

For a stratified clustered sample design, observations are represented by an n times left-parenthesis p plus 2 right-parenthesis matrix

left-parenthesis bold w bold comma bold y bold comma bold upper X right-parenthesis equals left-parenthesis w Subscript h i j Baseline comma y Subscript h i j Baseline comma bold x Subscript h i j Baseline right-parenthesis

where

  • bold w denotes the sampling weight vector

  • bold y denotes the dependent variable

  • bold upper X denotes the n times p design matrix. (When an effect contains only classification variables, the columns of bold upper X that correspond this effect contain only 0s and 1s; no reparameterization is made.)

  • h equals 1 comma 2 comma ellipsis comma upper H is the stratum index

  • i equals 1 comma 2 comma ellipsis comma n Subscript h Baseline is the cluster index within stratum h

  • j equals 1 comma 2 comma ellipsis comma m Subscript h i Baseline is the unit index within cluster i of stratum h

  • p is the total number of parameters (including an intercept if the INTERCEPT effect is included in the MODEL statement)

  • n equals sigma-summation Underscript h equals 1 Overscript upper H Endscripts sigma-summation Underscript i equals 1 Overscript n Subscript h Baseline Endscripts m Subscript h i   is the total number of observations in the sample

Also, f Subscript h denotes the sampling rate for stratum h. You can use the TOTAL= or RATE= option to input population totals or sampling rates. See the section Specification of Population Totals and Sampling Rates for details. If you input stratum totals, PROC SURVEYREG computes f Subscript h as the ratio of the stratum sample size to the stratum total. If you input stratum sampling rates, PROC SURVEYREG uses these values directly for f Subscript h. If you do not specify the TOTAL= or RATE= option, then the procedure assumes that the stratum sampling rates f Subscript h are negligible, and a finite population correction is not used when computing variances.

Regression Coefficients

PROC SURVEYREG solves the normal equations bold upper X prime bold upper W bold upper X bold-italic beta equals bold upper X prime bold upper W bold y by using a modified sweep routine that produces a generalized (g2) inverse left-parenthesis bold upper X prime bold upper W bold upper X right-parenthesis Superscript minus and a solution (Pringle and Rayner 1971)

ModifyingAbove bold-italic beta With caret equals bold left-parenthesis bold upper X prime bold upper W bold upper X bold right-parenthesis Superscript bold minus Baseline bold upper X prime bold upper W bold y

where bold upper W is the diagonal matrix constructed from WEIGHT variable values.

For models with CLASS variables, there are more design matrix columns than there are degrees of freedom (df) for the effect. Thus, there are linear dependencies among the columns. In this case, the parameters are not estimable; there is an infinite number of least squares solutions. PROC SURVEYREG uses a generalized (g2) inverse to obtain values for the estimates. The solution values are not displayed unless you specify the SOLUTION option in the MODEL statement. The solution has the characteristic that estimates are zero whenever the design column for that parameter is a linear combination of previous columns. (In strict terms, the solution values should not be called estimates.) With this full parameterization, hypothesis tests are constructed to test linear functions of the parameters that are estimable.

Design Effect

If you specify the DEFF option in the MODEL statement, PROC SURVEYREG calculates the design effects for the regression coefficients. The design effect of an estimate is the ratio of the actual variance to the variance computed under the assumption of simple random sampling:

DEFF equals StartFraction variance under the sample design Over variance under simple random sampling EndFraction

For more information, see Kish (1965, p. 258).

PROC SURVEYREG computes the numerator as described in the section Variance Estimation. And the denominator is computed under the assumption that the sample design is simple random sampling, with no stratification and no clustering.

For Taylor series or bootstrap variance estimation, PROC SURVEYREG computes the overall sampling fraction f Subscript normal upper S normal upper R normal upper S in the simple random sampling variance by using the value of the RATE= or TOTAL= option.

If you do not specify either of these options, PROC SURVEYREG assumes that the value of f Subscript normal upper S normal upper R normal upper S is negligible and does not use a finite population correction in the analysis, as described in the section Specification of Population Totals and Sampling Rates.

If you specify RATE=value, PROC SURVEYREG uses this value as the overall sampling fraction f Subscript normal upper S normal upper R normal upper S. If you specify TOTAL=value, PROC SURVEYREG computes f Subscript normal upper S normal upper R normal upper S as the ratio of the number of PSUs in the sample to the specified total.

If you specify stratum sampling rates by using the RATE=SAS-data-set option, then PROC SURVEYREG computes stratum totals based on these stratum sampling rates and the number of sample PSUs in each stratum. The procedure sums the stratum totals to form the overall total, and it computes f Subscript normal upper S normal upper R normal upper S as the ratio of the number of sample PSUs to the overall total. Alternatively, if you specify stratum totals by using the TOTAL=SAS-data-set option, then PROC SURVEYREG sums these totals to compute the overall total. The overall sampling fraction f Subscript normal upper S normal upper R normal upper S is then computed as the ratio of the number of sample PSUs to the overall total.

Stratum Collapse

If there is only one sampling unit in a stratum, then PROC SURVEYREG cannot estimate the variance for this stratum for the Taylor series method. To estimate stratum variances, by default the procedure collapses, or combines, those strata that contain only one sampling unit. If you specify the NOCOLLAPSE option in the STRATA statement, PROC SURVEYREG does not collapse strata and uses a variance estimate of zero for any stratum that contains only one sampling unit.

Note that stratum collapse only applies to Taylor series variance estimation (the default method, also specified by VARMETHOD=TAYLOR). The procedure does not collapse strata for replication methods.

If you do not specify the NOCOLLAPSE option for the Taylor series method, PROC SURVEYREG collapses strata according to the following rules. If there are multiple strata that contain only one sampling unit each, then the procedure collapses, or combines, all these strata into a new pooled stratum. If there is only one stratum with a single sampling unit, then PROC SURVEYREG collapses that stratum with the preceding stratum, where strata are ordered by the STRATA variable values. If the stratum with one sampling unit is the first stratum, then the procedure combines it with the following stratum.

If you specify stratum sampling rates by using the RATE=SAS-data-set option, PROC SURVEYREG computes the sampling rate for the new pooled stratum as the weighted average of the sampling rates for the collapsed strata. See the section Computational Details for details. If the specified sampling rate equals 0 for any of the collapsed strata, then the pooled stratum is assigned a sampling rate of 0. If you specify stratum totals by using the TOTAL=SAS-data-set option, PROC SURVEYREG combines the totals for the collapsed strata to compute the sampling rate for the new pooled stratum.

Sampling Rate of the Pooled Stratum from Collapse

Assuming that PROC SURVEYREG collapses single-unit strata h 1 comma h 2 comma ellipsis comma h Subscript c Baseline into the pooled stratum, the procedure calculates the sampling rate for the pooled stratum as

f Subscript Pooled Stratum Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 0 2nd Column if any of f Subscript h Sub Subscript l Subscript Baseline equals 0 where l equals 1 comma 2 comma ellipsis comma c 2nd Row 1st Column left-parenthesis sigma-summation Underscript l equals 1 Overscript c Endscripts n Subscript h Sub Subscript l Subscript Baseline f Subscript h Sub Subscript l Subscript Superscript negative 1 Baseline right-parenthesis Superscript negative 1 Baseline sigma-summation Underscript l equals 1 Overscript c Endscripts n Subscript h Sub Subscript l Baseline 2nd Column otherwise EndLayout
Last updated: December 09, 2022