Introduction to Statistical Modeling with SAS/STAT Software

Inference Principles for Survey Data

Design-based and model-assisted statistical inference for survey data requires that the randomness due to the selection mechanism be taken into account. This can require special estimation principles and techniques.

The SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures support design-based and/or model-assisted inference for sample surveys. Suppose pi Subscript i is the selection probability for unit i in sample S. The inverse of the inclusion probability is known as sampling weight and is denoted by w Subscript i. Briefly, the idea is to apply a relationship that exists in the population to the sample and to take into account the sampling weights. For example, to estimate the finite population total upper T Subscript upper N Baseline equals sigma-summation Underscript i element-of upper U Subscript upper N Baseline y Subscript i Endscripts based on the sample S, you can accumulate the sampled values while properly weighting: ModifyingAbove upper T With caret Subscript pi Baseline equals sigma-summation Underscript i element-of upper S Endscripts w Subscript i Baseline y Subscript i. It is easy to verify that ModifyingAbove upper T With caret Subscript pi is design-unbiased in the sense that normal upper E left-bracket ModifyingAbove upper T With caret Subscript pi Baseline vertical-bar script upper F Subscript upper N Baseline right-bracket equals upper T Subscript upper N (see Cochran 1977).

When a statistical model is present, similar ideas apply. For example, if beta Subscript upper N Baseline 0 and beta Subscript upper N Baseline 1 are finite population quantities for a simple linear regression working model that minimize the sum of squares

sigma-summation Underscript i element-of upper U Subscript upper N Endscripts left-parenthesis y Subscript i Baseline minus beta Subscript 0 upper N Baseline minus beta Subscript 1 upper N Baseline x Subscript i Baseline right-parenthesis squared

in the population, then the sample-based estimators ModifyingAbove beta With caret Subscript 0 upper S and ModifyingAbove beta With caret Subscript 1 upper S are obtained by minimizing the weighted sum of squares

sigma-summation Underscript i element-of upper S Endscripts w Subscript i Baseline left-parenthesis y Subscript i Baseline minus ModifyingAbove beta With caret Subscript 0 upper S Baseline minus ModifyingAbove beta With caret Subscript 1 upper S Baseline x Subscript i Baseline right-parenthesis squared

in the sample, taking into account the inclusion probabilities.

In model-assisted inference, weighted least squares or pseudo-maximum likelihood estimators are commonly used to solve such estimation problems. Maximum pseudo-likelihood or weighted maximum likelihood estimators for survey data maximize a sample-based estimator of the population likelihood. Assume a working model with uncorrelated responses such that the finite population log-likelihood is

sigma-summation Underscript i element-of upper U Subscript upper N Baseline Endscripts l left-parenthesis theta Subscript 1 upper N Baseline comma ellipsis comma theta Subscript p upper N Baseline semicolon y Subscript i Baseline right-parenthesis comma

where theta Subscript 1 upper N Baseline comma ellipsis comma theta Subscript p upper N Baseline are finite population quantities. For independent sampling, one possible sample-based estimator of the population log likelihood is

sigma-summation Underscript i element-of upper S Endscripts w Subscript i Baseline l left-parenthesis theta Subscript 1 upper N Baseline comma ellipsis comma theta Subscript p upper N Baseline semicolon y Subscript i Baseline right-parenthesis

Sample-based estimators ModifyingAbove theta With caret Subscript 1 upper S Baseline comma ellipsis comma ModifyingAbove theta With caret Subscript p upper S Baseline are obtained by maximizing this expression.

Design-based and model-based statistical analysis might employ the same statistical model (for example, a linear regression) and the same estimation principle (for example, weighted least squares), and arrive at the same estimates. The design-based estimation of the precision of the estimators differs from the model-based estimation, however. For complex surveys, design-based variance estimates are in general different from their model-based counterpart. The SAS/STAT procedures for survey data (SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures) compute design-based variance estimates for complex survey data. See the section Variance Estimation, in ChapterĀ 15, Introduction to Survey Sampling and Analysis Procedures, for details about design-based variance estimation.

Last updated: December 09, 2022