The SURVEYLOGISTIC Procedure

Logistic Regression Models and Parameters

The SURVEYLOGISTIC procedure fits a logistic regression model and estimates the corresponding regression parameters. Each model uses the link function you specified in the LINK= option in the MODEL statement. There are four types of model you can use with the procedure: cumulative logit model, complementary log-log model, probit model, and generalized logit model.

Notation

Let Y be the response variable with categories 1 comma 2 comma ellipsis comma upper D comma upper D plus 1. The p covariates are denoted by a p-dimension row vector bold x.

For a stratified clustered sample design, each observation is represented by a row vector, left-parenthesis w Subscript h i j Baseline comma bold y prime Subscript h i j Baseline comma y Subscript h i j left-parenthesis upper D plus 1 right-parenthesis Baseline comma bold x Subscript h i j Baseline right-parenthesis, where

  • h equals 1 comma 2 comma ellipsis comma upper H is the stratum index

  • i equals 1 comma 2 comma ellipsis comma n Subscript h Baseline is the cluster index within stratum h

  • j equals 1 comma 2 comma ellipsis comma m Subscript h i Baseline is the unit index within cluster i of stratum h

  • w Subscript h i j denotes the sampling weight

  • bold y Subscript h i j is a D-dimensional column vector whose elements are indicator variables for the first D categories for variable Y. If the response of the jth unit of the ith cluster in stratum h falls in category d, the dth element of the vector is one, and the remaining elements of the vector are zero, where d equals 1 comma 2 comma ellipsis comma upper D.

  • y Subscript h i j left-parenthesis upper D plus 1 right-parenthesis is the indicator variable for the left-parenthesis upper D plus 1 right-parenthesis category of variable Y

  • bold x Subscript h i j denotes the k-dimensional row vector of explanatory variables for the jth unit of the ith cluster in stratum h. If there is an intercept, then x Subscript h i j Baseline 1 Baseline identical-to 1.

  • n overTilde equals sigma-summation Underscript h equals 1 Overscript upper H Endscripts n Subscript h is the total number of clusters in the sample

  • n equals sigma-summation Underscript h equals 1 Overscript upper H Endscripts sigma-summation Underscript i equals 1 Overscript n Subscript h Baseline Endscripts m Subscript h i is the total sample size

The following notations are also used:

  • f Subscript h denotes the sampling rate for stratum h

  • bold-italic pi Subscript h i j is the expected vector of the response variable:

    StartLayout 1st Row 1st Column bold-italic pi Subscript h i j 2nd Column equals 3rd Column upper E left-parenthesis bold y Subscript h i j Baseline vertical-bar bold x Subscript h i j Baseline right-parenthesis 2nd Row 1st Column Blank 2nd Column equals 3rd Column left-parenthesis pi Subscript h i j Baseline 1 Baseline comma pi Subscript h i j Baseline 2 Baseline comma ellipsis comma pi Subscript h i j upper D Baseline right-parenthesis prime 3rd Row 1st Column pi Subscript h i j left-parenthesis upper D plus 1 right-parenthesis 2nd Column equals 3rd Column upper E left-parenthesis y Subscript h i j left-parenthesis upper D plus 1 right-parenthesis Baseline vertical-bar bold x Subscript h i j Baseline right-parenthesis EndLayout

Note that pi Subscript h i j left-parenthesis upper D plus 1 right-parenthesis Baseline equals 1 minus bold 1 prime bold-italic pi Subscript h i j, where 1 is a D-dimensional column vector whose elements are 1.

Logistic Regression Models

If the response categories of the response variable Y can be restricted to a number of ordinal values, you can fit cumulative probabilities of the response categories with a cumulative logit model, a complementary log-log model, or a probit model. Details of cumulative logit models (or proportional odds models) can be found in McCullagh and Nelder (1989). If the response categories of Y are nominal responses without natural ordering, you can fit the response probabilities with a generalized logit model. Formulation of the generalized logit models for nominal response variables can be found in Agresti (2002). For each model, the procedure estimates the model parameter bold-italic theta by using a pseudo-log-likelihood function. The procedure obtains the pseudo-maximum likelihood estimator ModifyingAbove bold-italic theta With caret by using iterations described in the section Iterative Algorithms for Model Fitting and estimates its variance described in the section Variance Estimation.

Cumulative Logit Model

A cumulative logit model uses the logit function

g left-parenthesis t right-parenthesis equals log left-parenthesis StartFraction t Over 1 minus t EndFraction right-parenthesis

as the link function.

Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

upper F Subscript h i j d Baseline equals sigma-summation Underscript r equals 1 Overscript d Endscripts pi Subscript h i j r

for d equals 1 comma 2 comma ellipsis comma upper D period Then the cumulative logit model can be written as

log left-parenthesis StartFraction upper F Subscript h i j d Baseline Over 1 minus upper F Subscript h i j d Baseline EndFraction right-parenthesis equals alpha Subscript d Baseline plus bold x Subscript h i j Baseline bold-italic beta

with the model parameters

StartLayout 1st Row 1st Column bold-italic beta 2nd Column equals 3rd Column left-parenthesis beta 1 comma beta 2 comma ellipsis comma beta Subscript k Baseline right-parenthesis prime 2nd Row 1st Column bold-italic alpha 2nd Column equals 3rd Column left-parenthesis alpha 1 comma alpha 2 comma ellipsis comma alpha Subscript upper D Baseline right-parenthesis prime comma alpha 1 less-than alpha 2 less-than midline-horizontal-ellipsis less-than alpha Subscript upper D Baseline 3rd Row 1st Column bold-italic theta 2nd Column equals 3rd Column left-parenthesis bold-italic alpha prime comma bold-italic beta Superscript prime Baseline right-parenthesis prime EndLayout
Complementary Log-Log Model

A complementary log-log model uses the complementary log-log function

g left-parenthesis t right-parenthesis equals log left-parenthesis minus log left-parenthesis 1 minus t right-parenthesis right-parenthesis

as the link function. Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

upper F Subscript h i j d Baseline equals sigma-summation Underscript r equals 1 Overscript d Endscripts pi Subscript h i j r

for d equals 1 comma 2 comma ellipsis comma upper D period Then the complementary log-log model can be written as

log left-parenthesis minus log left-parenthesis 1 minus upper F Subscript h i j d Baseline right-parenthesis right-parenthesis equals alpha Subscript d Baseline plus bold x Subscript h i j Baseline bold-italic beta

with the model parameters

StartLayout 1st Row 1st Column bold-italic beta 2nd Column equals 3rd Column left-parenthesis beta 1 comma beta 2 comma ellipsis comma beta Subscript k Baseline right-parenthesis prime 2nd Row 1st Column bold-italic alpha 2nd Column equals 3rd Column left-parenthesis alpha 1 comma alpha 2 comma ellipsis comma alpha Subscript upper D Baseline right-parenthesis prime comma alpha 1 less-than alpha 2 less-than midline-horizontal-ellipsis less-than alpha Subscript upper D Baseline 3rd Row 1st Column bold-italic theta 2nd Column equals 3rd Column left-parenthesis bold-italic alpha prime comma bold-italic beta Superscript prime Baseline right-parenthesis prime EndLayout
Probit Model

A probit model uses the probit (or normit) function, which is the inverse of the cumulative standard normal distribution function,

g left-parenthesis t right-parenthesis equals normal upper Phi Superscript negative 1 Baseline left-parenthesis t right-parenthesis

as the link function, where

normal upper Phi left-parenthesis t right-parenthesis equals StartFraction 1 Over StartRoot 2 pi EndRoot EndFraction integral Subscript negative normal infinity Superscript t Baseline e Superscript minus one-half z squared Baseline d z

Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

upper F Subscript h i j d Baseline equals sigma-summation Underscript r equals 1 Overscript d Endscripts pi Subscript h i j r

for d equals 1 comma 2 comma ellipsis comma upper D period Then the probit model can be written as

upper F Subscript h i j d Baseline equals normal upper Phi left-parenthesis alpha Subscript d Baseline plus bold x Subscript h i j Baseline bold-italic beta right-parenthesis

with the model parameters

StartLayout 1st Row 1st Column bold-italic beta 2nd Column equals 3rd Column left-parenthesis beta 1 comma beta 2 comma ellipsis comma beta Subscript k Baseline right-parenthesis prime 2nd Row 1st Column bold-italic alpha 2nd Column equals 3rd Column left-parenthesis alpha 1 comma alpha 2 comma ellipsis comma alpha Subscript upper D Baseline right-parenthesis prime comma alpha 1 less-than alpha 2 less-than midline-horizontal-ellipsis less-than alpha Subscript upper D Baseline 3rd Row 1st Column bold-italic theta 2nd Column equals 3rd Column left-parenthesis bold-italic alpha prime comma bold-italic beta Superscript prime Baseline right-parenthesis prime EndLayout
Generalized Logit Model

For nominal response, a generalized logit model is to fit the ratio of the expected proportion for each response category over the expected proportion of a reference category with a logit link function.

Without loss of generality, let category upper D plus 1 be the reference category for the response variable Y. Denote the expected proportion for the dth category by pi Subscript h i j d as in the section Notation. Then the generalized logit model can be written as

log left-parenthesis StartFraction pi Subscript h i j d Baseline Over pi Subscript h i j left-parenthesis upper D plus 1 right-parenthesis Baseline EndFraction right-parenthesis equals bold x Subscript h i j Baseline bold-italic beta Subscript d

for d equals 1 comma 2 comma ellipsis comma upper D comma with the model parameters

StartLayout 1st Row 1st Column bold-italic beta Subscript d 2nd Column equals 3rd Column left-parenthesis beta Subscript d Baseline 1 Baseline comma beta Subscript d Baseline 2 Baseline comma ellipsis comma beta Subscript d k Baseline right-parenthesis prime 2nd Row 1st Column bold-italic theta 2nd Column equals 3rd Column left-parenthesis bold-italic beta prime 1 comma bold-italic beta prime 2 comma ellipsis comma bold-italic beta prime Subscript upper D right-parenthesis prime EndLayout

Likelihood Function

Let bold g left-parenthesis dot right-parenthesis be a link function such that

bold-italic pi equals bold g left-parenthesis bold x comma bold-italic theta right-parenthesis

where bold-italic theta is a column vector for regression coefficients. The pseudo-log likelihood is

l left-parenthesis bold-italic theta right-parenthesis equals sigma-summation Underscript h equals 1 Overscript upper H Endscripts sigma-summation Underscript i equals 1 Overscript n Subscript h Baseline Endscripts sigma-summation Underscript j equals 1 Overscript m Subscript h i Baseline Endscripts w Subscript h i j Baseline left-parenthesis left-parenthesis log left-parenthesis bold-italic pi Subscript h i j Baseline right-parenthesis right-parenthesis prime bold y Subscript h i j Baseline plus log left-parenthesis pi Subscript h i j left-parenthesis upper D plus 1 right-parenthesis Baseline right-parenthesis y Subscript h i j left-parenthesis upper D plus 1 right-parenthesis Baseline right-parenthesis

Denote the pseudo-estimator as ModifyingAbove bold-italic theta With caret, which is a solution to the estimating equations:

sigma-summation Underscript h equals 1 Overscript upper H Endscripts sigma-summation Underscript i equals 1 Overscript n Subscript h Baseline Endscripts sigma-summation Underscript j equals 1 Overscript m Subscript h i Baseline Endscripts w Subscript h i j Baseline bold upper D Subscript h i j Baseline left-parenthesis normal d normal i normal a normal g left-parenthesis bold-italic pi Subscript h i j Baseline right-parenthesis minus bold-italic pi Subscript h i j Baseline bold-italic pi prime Subscript h i j right-parenthesis Superscript negative 1 Baseline left-parenthesis bold y Subscript h i j Baseline minus bold-italic pi Subscript h i j Baseline right-parenthesis equals bold 0

where bold upper D Subscript h i j is the matrix of partial derivatives of the link function bold g with respect to bold-italic theta.

To obtain the pseudo-estimator ModifyingAbove bold-italic theta With caret, the procedure uses iterations with a starting value bold-italic theta Superscript left-parenthesis 0 right-parenthesis for bold-italic theta. See the section Iterative Algorithms for Model Fitting for more details.

Last updated: December 09, 2022