The SURVEYLOGISTIC Procedure

Variance Estimation

Due to the variability of characteristics among items in the population, researchers apply scientific sample designs in the sample selection process to reduce the risk of a distorted view of the population, and they make inferences about the population based on the information from the sample survey data. In order to make statistically valid inferences for the population, they must incorporate the sample design in the data analysis.

The SURVEYLOGISTIC procedure fits linear logistic regression models for discrete response survey data by using the maximum likelihood method. In the variance estimation, the procedure uses the Taylor series (linearization) method or replication (resampling) methods to estimate sampling errors of estimators based on complex sample designs, including designs with stratification, clustering, and unequal weighting (Binder 1981, 1983; Roberts, Rao, and Kumar 1987; Skinner, Holt, and Smith 1989; Binder and Roberts 2003; Morel 1989; Lehtonen and Pahkinen 1995; Woodruff 1971; Fuller 1975; Särndal, Swensson, and Wretman 1992; Fuller 2009; Wolter 2007; Rust 1985; Dippo, Fay, and Morganstein 1984; Rao and Shao 1999; Rao, Wu, and Yue 1992; Rao and Shao 1996).

You can use the VARMETHOD= option to specify a variance estimation method to use. By default, the Taylor series method is used. However, replication methods have recently gained popularity for estimating variances in complex survey data analysis. One reason for this popularity is the relative simplicity of replication-based estimates, especially for nonlinear estimators; another is that modern computational capacity has made replication methods feasible for practical survey analysis.

Replication methods draw multiple replicates (also called subsamples) from a full sample according to a specific resampling scheme. The most commonly used resampling schemes are the balanced repeated replication (BRR) method, the jackknife method, and the bootstrap method. For each replicate, the original weights are modified for the PSUs in the replicates in order to create replicate weights. The parameters of interest are estimated by using the replicate weights for each replicate. Then the variances of parameters of interest are estimated by the variability among the estimates derived from these replicates. You can use the REPWEIGHTS statement to provide your own replicate weights for variance estimation.

The following sections provide details about how the variance-covariance matrix of the estimated regression coefficients is estimated for each variance estimation method.

Taylor Series (Linearization)

The Taylor series (linearization) method is the most commonly used method to estimate the covariance matrix of the regression coefficients for complex survey data. It is the default variance estimation method used by PROC SURVEYLOGISTIC.

Using the notation described in the section Notation, the estimated covariance matrix of model parameters by the Taylor series method is

ModifyingAbove upper V With caret left-parenthesis ModifyingAbove bold-italic theta With caret right-parenthesis equals ModifyingAbove bold upper Q With caret Superscript negative 1 Baseline ModifyingAbove bold upper G With caret ModifyingAbove bold upper Q With caret Superscript negative 1

where

StartLayout 1st Row 1st Column ModifyingAbove bold upper Q With caret 2nd Column equals 3rd Column sigma-summation Underscript h equals 1 Overscript upper H Endscripts sigma-summation Underscript i equals 1 Overscript n Subscript h Endscripts sigma-summation Underscript j equals 1 Overscript m Subscript h i Endscripts w Subscript h i j Baseline ModifyingAbove bold upper D With caret Subscript h i j Baseline left-parenthesis diag left-parenthesis ModifyingAbove bold-italic pi With caret Subscript h i j Baseline right-parenthesis minus ModifyingAbove bold-italic pi With caret Subscript h i j Baseline ModifyingAbove bold-italic pi With caret prime Subscript h i j right-parenthesis Superscript negative 1 Baseline ModifyingAbove bold upper D With caret prime Subscript h i j 2nd Row 1st Column ModifyingAbove bold upper G With caret 2nd Column equals 3rd Column StartFraction n minus 1 Over n minus p EndFraction sigma-summation Underscript h equals 1 Overscript upper H Endscripts StartFraction n Subscript h Baseline left-parenthesis 1 minus f Subscript h Baseline right-parenthesis Over n Subscript h Baseline minus 1 EndFraction sigma-summation Underscript i equals 1 Overscript n Subscript h Endscripts left-parenthesis bold e Subscript h i dot Baseline minus bold e overbar Subscript h dot dot Baseline right-parenthesis left-parenthesis bold e Subscript h i dot Baseline minus bold e overbar Subscript h dot dot Baseline right-parenthesis prime 3rd Row 1st Column bold e Subscript h i dot 2nd Column equals 3rd Column sigma-summation Underscript j equals 1 Overscript m Subscript h i Endscripts w Subscript h i j Baseline ModifyingAbove bold upper D With caret Subscript h i j Baseline left-parenthesis diag left-parenthesis ModifyingAbove bold-italic pi With caret Subscript h i j Baseline right-parenthesis minus ModifyingAbove bold-italic pi With caret Subscript h i j Baseline ModifyingAbove bold-italic pi With caret prime Subscript h i j right-parenthesis Superscript negative 1 Baseline left-parenthesis bold y Subscript h i j Baseline minus ModifyingAbove bold-italic pi With caret Subscript h i j Baseline right-parenthesis 4th Row 1st Column bold e overbar Subscript h dot dot 2nd Column equals 3rd Column StartFraction 1 Over n Subscript h Baseline EndFraction sigma-summation Underscript i equals 1 Overscript n Subscript h Endscripts bold e Subscript h i dot EndLayout

and is the matrix of partial derivatives of the inverse link function with respect to and and the response probabilities are evaluated at .

If you specify the TECHNIQUE=NEWTON option in the MODEL statement to request the Newton-Raphson algorithm, the matrix is replaced by the negative (expected) Hessian matrix when the estimated covariance matrix is computed.

Adjustments to the Variance Estimation

The factor in the computation of the matrix reduces the small sample bias associated with using the estimated function to calculate deviations (Morel 1989; Hidiroglou, Fuller, and Hickman 1980). For simple random sampling, this factor contributes to the degrees-of-freedom correction applied to the residual mean square for ordinary least squares in which p parameters are estimated. By default, the procedure uses this adjustment in Taylor series variance estimation. It is equivalent to specifying the VADJUST=DF option in the MODEL statement. If you do not want to use this multiplier in the variance estimation, you can specify the VADJUST=NONE option in the MODEL statement to suppress this factor.

In addition, you can specify the VADJUST=MOREL option to request an adjustment to the variance estimator for the model parameters , introduced by Morel (1989):

where for given nonnegative constants and ,

StartLayout 1st Row 1st Column kappa 2nd Column equals 3rd Column max left-parenthesis delta comma p Superscript negative 1 Baseline trace left-parenthesis ModifyingAbove bold upper Q With caret Superscript negative 1 Baseline ModifyingAbove bold upper G With caret right-parenthesis right-parenthesis 2nd Row 1st Column lamda 2nd Column equals 3rd Column min left-parenthesis phi comma StartFraction p Over n overTilde minus p EndFraction right-parenthesis EndLayout

The adjustment does the following:

reduces the small sample bias reflected in inflated Type I error rates
guarantees a positive-definite estimated covariance matrix provided that exists
is close to zero when the sample size becomes large

In this adjustment, is an estimate of the design effect, which has been bounded below by the positive constant . You can use DEFFBOUND= in the VADJUST=MOREL option in the MODEL statement to specify this lower bound; by default, the procedure uses . The factor converges to zero when the sample size becomes large, and has an upper bound . You can use ADJBOUND= in the VADJUST=MOREL option in the MODEL statement to specify this upper bound; by default, the procedure uses .

Bootstrap Method

The VARMETHOD=BOOTSTRAP option in the PROC SURVEYLOGISTIC statement requests the bootstrap method for variance estimation. This method can be used for stratified sample designs and for designs that have no stratification. If your design is stratified, the bootstrap method requires at least two PSUs in each stratum. You can provide your own bootstrap replicate weights for the analysis by using a REPWEIGHTS statement, or the procedure can construct bootstrap replicate weights for the analysis.

PROC SURVEYLOGISTIC estimates the parameter of interest (or requested statistics) from each replicate, and then uses the variability among replicate estimates to estimate the overall variance of these statistics.

This bootstrap method for complex survey data is similar to the method of Rao, Wu, and Yue (1992) and is also known as the bootstrap weights method (Mashreghi, Haziza, and Léger 2016). For more information, see Lohr (2010, Section 9.3.3), Wolter (2007, Chapter 5), Beaumont and Patak (2012), Fuller (2009, Section 4.5), and Shao and Tu (1995, Section 6.2.4). McCarthy and Snowden (1985), Rao and Wu (1988), Sitter (1992b), and Sitter (1992a) provide several adjusted bootstrap variance estimators that are consistent for complex survey data. The naive bootstrap variance estimator that is suitable for infinite populations is not consistent for complex survey data.

Replicate Weight Construction

If you do not provide replicate weights by using a REPWEIGHTS statement, PROC SURVEYLOGISTIC constructs bootstrap replicate weights for the analysis. The procedure selects replicate bootstrap samples by with-replacement random sampling of PSUs within strata. You can specify the number of bootstrap replicates in the REPS= method-option; by default, the number of replicates is 250. (Increasing the number of replicates can improve the estimation precision but also increases the computation time.) You can specify the bootstrap sample sizes in the MH= method-option; by default, , where is the number of PSUs in stratum h.

In each replicate sample, the original sampling weights of the selected units are adjusted to reflect the full sample. These adjusted weights are the bootstrap replicate weights. In replicate r, the bootstrap replicate weight for observation j in PSU i in stratum h is computed as

w overTilde Subscript h i j Superscript left-parenthesis r right-parenthesis Baseline equals StartSet 1 plus StartRoot StartFraction 1 minus f Subscript h Baseline Over m Subscript h Baseline left-parenthesis n Subscript h Baseline minus 1 right-parenthesis EndFraction EndRoot n Subscript h Baseline k Subscript h i Superscript left-parenthesis r right-parenthesis Baseline minus StartRoot StartFraction m Subscript h Baseline left-parenthesis 1 minus f Subscript h Baseline right-parenthesis Over n Subscript h Baseline minus 1 EndFraction EndRoot EndSet w Subscript h i j

where is the number of times PSU i is selected for replicate r, and is the sampling fraction in stratum h that you can input by using either the RATE= or TOTAL= option in the PROC SURVEYLOGISTIC statement.

You can use the OUTWEIGHTS=SAS-data-set method-option to store the bootstrap replicate weights in a SAS data set. For information about the contents of the OUTWEIGHTS= data set, see the section Replicate Weights Output Data Set. You can provide these replicate weights to the procedure for subsequent analyses by using a REPWEIGHTS statement.

Variance Estimation

Let R be the total number of bootstrap replicates weights. Denote as the replicate coefficient for the rth replicate ().

When the procedure generates the bootstrap replicate weights, .

If you provide your own bootstrap replicate weights by including a REPWEIGHTS statement, you can specify in the REPCOEFS= option. By default, .

Let be the estimated regression coefficients from the full sample for , and let be the estimated regression coefficient when the rth set of bootstrap replicate weights are used. PROC SURVEYLOGISTIC estimates the covariance matrix of by the following equation:

ModifyingAbove bold upper V With caret left-parenthesis ModifyingAbove bold-italic theta With caret right-parenthesis equals sigma-summation Underscript r equals 1 Overscript upper R Endscripts alpha Subscript r Baseline left-parenthesis ModifyingAbove bold-italic theta With caret Subscript r Baseline minus ModifyingAbove bold-italic theta With caret right-parenthesis left-parenthesis ModifyingAbove bold-italic theta With caret Subscript r Baseline minus ModifyingAbove bold-italic theta With caret right-parenthesis prime

Here, the degrees of freedom is the number of clusters minus the number of strata. If there are no clusters, then the degrees of freedom equals the number of observations minus the number of strata. If the design is not stratified, then the degrees of freedom equals the number of PSUs minus one.

If you provide your own replicate weights without specifying the DF= option, the degrees of freedom is set to be the number of replicates R.

If you specify the CENTER=REPLICATES method-option, then PROC SURVEYLOGISTIC computes the covariance matrix of by

where is the average of the replicate estimates and is calculated as follows:

ModifyingAbove bold-italic theta With caret overbar equals StartFraction 1 Over upper R EndFraction sigma-summation Underscript r equals 1 Overscript upper R Endscripts ModifyingAbove bold-italic theta With caret Subscript r

If a parameter cannot be computed from one or more replicates, then the procedure computes the variance estimate by using those replicates from which the parameter can be estimated, and the number of those replicates, , replaces the original number of replicates, R, in the variance estimation.

If you do not provide your own value for the degrees of freedom and if is less than the number of PSUs minus the number of strata, then the degrees of freedom is set to .

Balanced Repeated Replication (BRR) Method

The balanced repeated replication (BRR) method requires that the full sample be drawn by using a stratified sample design with two primary sampling units (PSUs) per stratum. Let H be the total number of strata. The total number of replicates R is the smallest multiple of 4 that is greater than H. However, if you prefer a larger number of replicates, you can specify the REPS=number option. If a number number Hadamard matrix cannot be constructed, the number of replicates is increased until a Hadamard matrix becomes available.

Each replicate is obtained by deleting one PSU per stratum according to the corresponding Hadamard matrix and adjusting the original weights for the remaining PSUs. The new weights are called replicate weights.

Replicates are constructed by using the first H columns of the Hadamard matrix. The rth () replicate is drawn from the full sample according to the rth row of the Hadamard matrix as follows:

If the element of the Hadamard matrix is 1, then the first PSU of stratum h is included in the rth replicate and the second PSU of stratum h is excluded.
If the element of the Hadamard matrix is –1, then the second PSU of stratum h is included in the rth replicate and the first PSU of stratum h is excluded.

Note that the "first" and "second" PSUs are determined by data order in the input data set. Thus, if you reorder the data set and perform the same analysis by using BRR method, you might get slightly different results, because the contents in each replicate sample might change.

The replicate weights of the remaining PSUs in each half-sample are then doubled to their original weights. For more information about the BRR method, see Wolter (2007) and Lohr (2010).

By default, an appropriate Hadamard matrix is generated automatically to create the replicates. You can request that the Hadamard matrix be displayed by specifying the VARMETHOD=BRR(PRINTH) method-option. If you provide a Hadamard matrix by specifying the VARMETHOD=BRR(HADAMARD=) method-option, then the replicates are generated according to the provided Hadamard matrix.

You can use the VARMETHOD=BRR(OUTWEIGHTS=) method-option to save the replicate weights into a SAS data set.

Let be the estimated regression coefficients from the full sample for , and let be the estimated regression coefficient from the rth replicate by using replicate weights. PROC SURVEYLOGISTIC estimates the covariance matrix of by

ModifyingAbove bold upper V With caret left-parenthesis ModifyingAbove bold-italic theta With caret right-parenthesis equals StartFraction 1 Over upper R EndFraction sigma-summation Underscript r equals 1 Overscript upper R Endscripts left-parenthesis ModifyingAbove bold-italic theta With caret Subscript r Baseline minus ModifyingAbove bold-italic theta With caret right-parenthesis left-parenthesis ModifyingAbove bold-italic theta With caret Subscript r Baseline minus ModifyingAbove bold-italic theta With caret right-parenthesis prime

with H degrees of freedom, where H is the number of strata. If you provide your own replicate weights without specifying the DF= option, the degrees of freedom is set to be the number of replicates R.

If you specify the CENTER=REPLICATES method-option, then PROC SURVEYLOGISTIC computes the covariance matrix of by

where is the average of the replicate estimates and is calculated as follows:

If you do not provide your own value for the degrees of freedom, then the degrees of freedom equals the minimum between and the number of strata, H.

Fay’s BRR Method

Fay’s method is a modification of the BRR method, and it requires a stratified sample design with two primary sampling units (PSUs) per stratum. The total number of replicates R is the smallest multiple of 4 that is greater than the total number of strata H. However, if you prefer a larger number of replicates, you can specify the REPS= method-option.

For each replicate, Fay’s method uses a Fay coefficient to impose a perturbation of the original weights in the full sample that is gentler than using only half-samples, as in the traditional BRR method. The Fay coefficient can be set by specifying the FAY = method-option. By default, if the FAY method-option is specified without providing a value for (Judkins 1990; Rao and Shao 1999). When , Fay’s method becomes the traditional BRR method. For more information, see Dippo, Fay, and Morganstein (1984); Fay (1984, 1989); Judkins (1990).

Let H be the number of strata. Replicates are constructed by using the first H columns of the Hadamard matrix, where R is the number of replicates, . The rth () replicate is created from the full sample according to the rth row of the Hadamard matrix as follows:

If the element of the Hadamard matrix is 1, then the full sample weight of the first PSU in stratum h is multiplied by and the full sample weight of the second PSU is multiplied by to obtain the rth replicate weights.
If the element of the Hadamard matrix is –1, then the full sample weight of the first PSU in stratum h is multiplied by and the full sample weight of the second PSU is multiplied by to obtain the rth replicate weights.

You can use the VARMETHOD=BRR(OUTWEIGHTS=) method-option to save the replicate weights into a SAS data set.

Let be the estimated regression coefficients from the full sample for . Let be the estimated regression coefficient obtained from the rth replicate by using replicate weights. PROC SURVEYLOGISTIC estimates the covariance matrix of by

If you specify the CENTER=REPLICATES method-option, then PROC SURVEYLOGISTIC computes the covariance matrix of by

where is the average of the replicate estimates and is calculated as follows:

If you do not provide your own value for the degrees of freedom, then the degrees of freedom equals the minimum between and the number of strata, H.

Jackknife Method

The jackknife method of variance estimation deletes one PSU at a time from the full sample to create replicates. The total number of replicates R is the same as the total number of PSUs. In each replicate, the sample weights of the remaining PSUs are modified by the jackknife coefficient . The modified weights are called replicate weights.

The jackknife coefficient and replicate weights are described as follows.

Without Stratification

If there is no stratification in the sample design (no STRATA statement), the jackknife coefficients are the same for all replicates:

alpha Subscript r Baseline equals StartFraction upper R minus 1 Over upper R EndFraction where r equals 1 comma 2 comma ellipsis comma upper R

Denote the original weight in the full sample for the jth member of the ith PSU as . If the ith PSU is included in the rth replicate (), then the corresponding replicate weight for the jth member of the ith PSU is defined as

w Subscript i j Superscript left-parenthesis r right-parenthesis Baseline equals w Subscript i j Baseline slash alpha Subscript r

With Stratification

If the sample design involves stratification, each stratum must have at least two PSUs to use the jackknife method.

Let stratum be the stratum from which a PSU is deleted for the rth replicate. Stratum is called the donor stratum. Let be the total number of PSUs in the donor stratum . The jackknife coefficients are defined as

alpha Subscript r Baseline equals StartFraction n Subscript h overTilde Sub Subscript r Subscript Baseline minus 1 Over n Subscript h overTilde Sub Subscript r Subscript Baseline EndFraction where r equals 1 comma 2 comma ellipsis comma upper R

w Subscript i j Superscript left-parenthesis r right-parenthesis Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column w Subscript i j Baseline 2nd Column if i th PSU is not in the donor stratum h overTilde Subscript r Baseline 2nd Row 1st Column w Subscript i j Baseline slash alpha Subscript r Baseline 2nd Column if i th PSU is in the donor stratum h overTilde Subscript r Baseline EndLayout

You can use the VARMETHOD=JACKKNIFE(OUTJKCOEFS=) method-option to save the jackknife coefficients into a SAS data set and use the VARMETHOD=JACKKNIFE(OUTWEIGHTS=) method-option to save the replicate weights into a SAS data set.

If you provide your own replicate weights in a REPWEIGHTS statement, then you can also provide corresponding jackknife coefficients in the JKCOEFS= or REPCOEFS= option. If you provide replicate weights but do not provide jackknife coefficients, PROC SURVEYLOGISTIC uses as the jackknife coefficient for all replicates by default.

with degrees of freedom, where R is the number of replicates and H is the number of strata, or when there is no stratification. If you provide your own replicate weights without specifying the DF= option, the degrees of freedom is set to be the number of replicates R.

If you specify the CENTER=REPLICATES method-option, then PROC SURVEYLOGISTIC computes the covariance matrix of by

where is the average of the replicate estimates and is calculated as follows:

If you do not provide your own value for the degrees of freedom, then the degrees of freedom equals .

Hadamard Matrix

A Hadamard matrix is a square matrix whose elements are either 1 or –1 such that

bold upper H bold upper H Superscript prime Baseline equals k bold upper I

where k is the dimension of and is the identity matrix of order k. The order k is necessarily 1, 2, or a positive integer that is a multiple of 4.

For example, the following matrix is a Hadamard matrix of dimension k = 8:

StartLayout 1st Row 1st Column 1 2nd Column 1 3rd Column 1 4th Column 1 5th Column 1 6th Column 1 7th Column 1 8th Column 1 2nd Row 1st Column 1 2nd Column negative 1 3rd Column 1 4th Column negative 1 5th Column 1 6th Column negative 1 7th Column 1 8th Column negative 1 3rd Row 1st Column 1 2nd Column 1 3rd Column negative 1 4th Column negative 1 5th Column 1 6th Column 1 7th Column negative 1 8th Column negative 1 4th Row 1st Column 1 2nd Column negative 1 3rd Column negative 1 4th Column 1 5th Column 1 6th Column negative 1 7th Column negative 1 8th Column 1 5th Row 1st Column 1 2nd Column 1 3rd Column 1 4th Column 1 5th Column negative 1 6th Column negative 1 7th Column negative 1 8th Column negative 1 6th Row 1st Column 1 2nd Column negative 1 3rd Column 1 4th Column negative 1 5th Column negative 1 6th Column 1 7th Column negative 1 8th Column 1 7th Row 1st Column 1 2nd Column 1 3rd Column negative 1 4th Column negative 1 5th Column negative 1 6th Column negative 1 7th Column 1 8th Column 1 8th Row 1st Column 1 2nd Column negative 1 3rd Column negative 1 4th Column 1 5th Column negative 1 6th Column 1 7th Column 1 8th Column negative 1 EndLayout

Last updated: December 09, 2022