This chapter introduces the SAS/STAT procedures for survey sampling and describes how you can use these procedures to analyze survey data.
Researchers often use sample survey methodology to obtain information about a large population by selecting and measuring a sample from that population. Because of variability among items, researchers apply probability-based scientific designs to select the sample. This reduces the risk of a distorted view of the population and enables statistically valid inferences to be made from the sample. For more information about statistical sampling and analysis of complex survey data, see Lohr (2010); Kalton (1983); Cochran (1977); Kish (1965). To select probability-based random samples from a study population, you can use the SURVEYSELECT procedure, which provides a variety of probability sampling methods. To perform imputation of missing values in survey data, you can use the SURVEYIMPUTE procedure, which provides donor-based imputation methods. To analyze sample survey data, you can use the SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures, which incorporate the sample design into the analyses.
Many SAS/STAT procedures, such as the MEANS, FREQ, GLM, LOGISTIC, and PHREG procedures, can compute sample means, produce crosstabulation tables, and estimate regression relationships. However, in most of these procedures, statistical inference is based on the assumption that the sample is drawn from an infinite population by simple random sampling. If the sample is in fact selected from a finite population by using a complex survey design, these procedures generally do not calculate the estimates and their variances according to the design actually used. Using analyses that are not appropriate for your sample design can lead to incorrect statistical inferences.
The SURVEYMEANS, SURVEYFREQ, SURVEYREG, SURVEYLOGISTIC, and SURVEYPHREG procedures properly analyze complex survey data by taking into account the sample design. You can use these procedures for multistage or single-stage designs, with or without stratification, and with or without unequal weighting. The survey analysis procedures provide a choice of variance estimation methods, which include Taylor series linearization, balanced repeated replication (BRR), bootstrap, and jackknife.
Table 1 briefly describes the SAS/STAT sampling and analysis procedures.
Table 1: Survey Sampling and Analysis Procedures
| PROC SURVEYSELECT | |
|---|---|
| Selection methods | Simple random sampling (without replacement) |
| Unrestricted random sampling (with replacement) | |
| Balanced bootstrap | |
| Systematic | |
| Sequential | |
| Bernoulli | |
| Poisson | |
| Probability proportional to size (PPS) sampling, | |
| with and without replacement | |
| PPS systematic | |
| PPS for two units per stratum | |
| PPS sequential with minimum replacement | |
| Allocation methods | Proportional |
| Optimal | |
| Neyman | |
| Sampling tools | Stratified sampling |
| Cluster sampling | |
| Replicated sampling | |
| Serpentine sorting | |
| Random assignment | |
| PROC SURVEYIMPUTE | |
| Imputation methods | Single and multiple hot-deck |
| Approximate Bayesian bootstrap | |
| Fully efficient fractional | |
| Two-stage fully efficient fractional | |
| Fractional hot-deck | |
| PROC SURVEYMEANS | |
| Statistics | Means and totals |
| Proportions | |
| Quantiles | |
| Geometric means | |
| Ratios | |
| Standard errors | |
| Confidence limits | |
| Analyses | Hypothesis tests |
| Domain analysis | |
| Comparison of domain means | |
| Poststratification | |
| Graphics | Histograms |
| Box plots | |
| Summary panel plots | |
| Domain box plots | |
| PROC SURVEYFREQ | |
| Tables | One-way frequency tables |
| Two-way and multiway crosstabulation tables | |
| Estimates of totals and proportions | |
| Standard errors | |
| Confidence limits | |
| Analyses | Tests of goodness of fit |
| Tests of independence | |
| Risks and risk differences | |
| Odds ratios and relative risks | |
| Kappa coefficients | |
| Graphics | Weighted frequency and percent plots |
| Mosaic plots | |
| Odds ratio, relative risk, and risk difference plots | |
| Kappa plots | |
| PROC SURVEYREG | |
| Analyses | Linear regression model fitting |
| Regression coefficients | |
| Covariance matrices | |
| Confidence limits | |
| Hypothesis tests | |
| Estimable functions | |
| Contrasts | |
| Least squares means (LS-means) of effects | |
| Custom hypothesis tests among LS-means | |
| Regression with constructed effects | |
| Predicted values and residuals | |
| Domain analysis | |
| Graphics | Fit plots |
| PROC SURVEYLOGISTIC | |
| Analyses | Cumulative logit regression model fitting |
| Logit, probit, and complementary log-log link functions | |
| Generalized logit regression model fitting | |
| Regression coefficients | |
| Covariance matrices | |
| Confidence limits | |
| Hypothesis tests | |
| Odds ratios | |
| Estimable functions | |
| Contrasts | |
| Least squares means (LS-means) of effects | |
| Custom hypothesis tests among LS-means | |
| Regression with constructed effects | |
| Model diagnostics | |
| Domain analysis | |
| PROC SURVEYPHREG | |
| Analyses | Proportional hazards regression model fitting |
| Breslow and Efron likelihoods | |
| Regression coefficients | |
| Covariance matrices | |
| Confidence limits | |
| Hypothesis tests | |
| Hazard ratios | |
| Contrasts | |
| Predicted values and standard errors | |
| Martingale, Schoenfeld, score, and deviance residuals | |
| Domain analysis |