The LOGISTIC Procedure

Scoring Data Sets

Scoring a data set, which is especially important for predictive modeling, means applying a previously fitted model to a new data set in order to compute the conditional, or posterior, probabilities of each response category given the values of the explanatory variables in each observation.

The SCORE statement enables you to score new data sets and output the scored values and, optionally, the corresponding confidence limits into a SAS data set. If the response variable is included in the new data set, then you can request fit statistics for the data, which is especially useful for test or validation data. If the response is binary, you can also create a SAS data set containing the receiver operating characteristic (ROC) curve. You can specify multiple SCORE statements in the same invocation of PROC LOGISTIC.

By default, the posterior probabilities are based on implicit prior probabilities that are proportional to the frequencies of the response categories in the training data (the data used to fit the model). Explicit prior probabilities should be specified with the PRIOR= or PRIOREVENT= option when the sample proportions of the response categories in the training data differ substantially from the operational data to be scored. For example, to detect a rare category, it is common practice to use a training set in which the rare categories are overrepresented; without prior probabilities that reflect the true incidence rate, the predicted posterior probabilities for the rare category will be too high. By specifying the correct priors, the posterior probabilities are adjusted appropriately.

The model fit to the DATA= data set in the PROC LOGISTIC statement is the default model used for the scoring. Alternatively, you can save a model fit in one run of PROC LOGISTIC and use it to score new data in a subsequent run. The OUTMODEL= option in the PROC LOGISTIC statement saves the model information in a SAS data set. Specifying this data set in the INMODEL= option of a new PROC LOGISTIC run will score the DATA= data set in the SCORE statement without refitting the model.

The STORE statement can also be used to save your model. The PLM procedure can use this model to score new data sets; see Chapter 94, The PLM Procedure, for more information. You cannot specify priors in PROC PLM.

Fit Statistics for Scored Data Sets

Specifying the FITSTAT option displays the following fit statistics when the data set being scored includes the response variable:

Statistic Description
Total frequency upper F equals sigma-summation Underscript i Endscripts f Subscript i Baseline n Subscript i
Total weight upper W equals sigma-summation Underscript i Endscripts f Subscript i Baseline w Subscript i Baseline n Subscript i
Log likelihood log upper L equals sigma-summation Underscript i Endscripts f Subscript i Baseline w Subscript i Baseline log left-parenthesis ModifyingAbove pi With caret Subscript i Baseline right-parenthesis
Full log likelihood log upper L Subscript f Baseline equals constant plus log upper L
Misclassification (error) rate StartFraction sigma-summation Underscript i Endscripts 1 StartSet upper F normal bar upper Y Subscript i Baseline not-equals upper I normal bar upper Y Subscript i Baseline EndSet f Subscript i Baseline n Subscript i Baseline Over upper F EndFraction
AIC minus 2 log upper L Subscript f plus 2 p
AICC minus 2 log upper L Subscript f plus StartFraction 2 p n Over n minus p minus 1 EndFraction
BIC minus 2 log upper L Subscript f plus p log left-parenthesis n right-parenthesis
SC minus 2 log upper L Subscript f plus p log left-parenthesis upper F right-parenthesis
R-square upper R squared equals 1 minus left-parenthesis StartFraction upper L 0 Over upper L EndFraction right-parenthesis Superscript 2 slash upper F
Maximum-rescaled R-square StartFraction upper R squared Over 1 minus upper L 0 Superscript 2 slash upper F Baseline EndFraction
AUC Area under the ROC curve
Brier score (polytomous response) StartFraction 1 Over upper W EndFraction sigma-summation Underscript i Endscripts f Subscript i Baseline w Subscript i Baseline sigma-summation Underscript j Endscripts left-parenthesis y Subscript i j Baseline minus ModifyingAbove pi With caret Subscript i j Baseline right-parenthesis squared
Brier score (binary response) StartFraction 1 Over upper W EndFraction sigma-summation Underscript i Endscripts f Subscript i Baseline w Subscript i Baseline left-parenthesis r Subscript i Baseline left-parenthesis 1 minus ModifyingAbove pi With caret Subscript i Baseline right-parenthesis squared plus left-parenthesis n Subscript i Baseline minus r Subscript i Baseline right-parenthesis ModifyingAbove pi With caret Subscript i Superscript 2 Baseline right-parenthesis
Brier reliability (events/trials syntax) StartFraction 1 Over upper W EndFraction sigma-summation Underscript i Endscripts f Subscript i Baseline w Subscript i Baseline left-parenthesis r Subscript i Baseline slash n Subscript i Baseline minus ModifyingAbove pi With caret Subscript i Baseline right-parenthesis squared

In the preceding table, f Subscript i is the frequency of the ith observation in the data set being scored, w Subscript i is the weight of the observation, and n equals sigma-summation Underscript i Endscripts f Subscript i. The number of trials when events/trials syntax is specified is n Subscript i, and with single-trial syntax n Subscript i Baseline equals 1. The values upper F normal bar upper Y Subscript i Baseline and upper I normal bar upper Y Subscript i Baseline are described in the section OUT= Output Data Set in a SCORE Statement. The indicator function 1 StartSet upper A EndSet is 1 if A is true and 0 otherwise. The likelihood of the model is L, and upper L 0 denotes the likelihood of the intercept-only model. For polytomous response models, y Subscript i is the observed polytomous response level, ModifyingAbove pi With caret Subscript i j is the predicted probability of the jth response level for observation i, and y Subscript i j Baseline equals 1 StartSet y Subscript i Baseline equals j EndSet. For binary response models, ModifyingAbove pi With caret Subscript i is the predicted probability of the observation, r Subscript i is the number of events when you specify events/trials syntax, and r Subscript i Baseline equals y Subscript i when you specify single-trial syntax.

The log likelihood, Akaike’s information criterion (AIC), and Schwarz criterion (SC) are described in the section Model Fitting Information. The full log likelihood is displayed for models specified with events/trials syntax, and the constant term is described in the section Model Fitting Information. The AICC is a small-sample bias-corrected version of the AIC (Hurvich and Tsai 1993; Burnham and Anderson 1998). The Bayesian information criterion (BIC) is the same as the SC except when events/trials syntax is specified. The area under the ROC curve for binary response models is defined in the section ROC Computations. The R-square and maximum-rescaled R-square statistics, defined in Generalized Coefficient of Determination, are not computed when you specify both an OFFSET= variable and the INMODEL= data set. The Brier score (Brier 1950) is the weighted squared difference between the predicted probabilities and their observed response levels. For events/trials syntax, the Brier reliability is the weighted squared difference between the predicted probabilities and the observed proportions (Murphy 1973).

Posterior Probabilities and Confidence Limits

Let F be the inverse link function. That is,

StartLayout 1st Row 1st Column upper F left-parenthesis t right-parenthesis equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column StartFraction 1 Over 1 plus exp left-parenthesis negative t right-parenthesis EndFraction 2nd Column logistic 2nd Row 1st Column normal upper Phi left-parenthesis t right-parenthesis 2nd Column normal 3rd Row 1st Column 1 minus exp left-parenthesis minus exp left-parenthesis t right-parenthesis right-parenthesis 2nd Column complementary log hyphen log EndLayout EndLayout

The first derivative of F is given by

StartLayout 1st Row 1st Column upper F prime left-parenthesis t right-parenthesis equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column StartFraction exp left-parenthesis negative t right-parenthesis Over left-parenthesis 1 plus exp left-parenthesis negative t right-parenthesis right-parenthesis squared EndFraction 2nd Column logistic 2nd Row 1st Column phi left-parenthesis t right-parenthesis 2nd Column normal 3rd Row 1st Column exp left-parenthesis t right-parenthesis exp left-parenthesis minus exp left-parenthesis t right-parenthesis right-parenthesis 2nd Column complementary log hyphen log EndLayout EndLayout

Suppose there are k plus 1 response categories. Let Y be the response variable with levels 1 comma ellipsis comma k plus 1. Let bold x equals left-parenthesis x 0 comma x 1 comma ellipsis comma x Subscript s Baseline right-parenthesis prime be a left-parenthesis s plus 1 right-parenthesis-vector of covariates, with x 0 identical-to 1. Let bold-italic beta be the vector of intercept and slope regression parameters.

Posterior probabilities are given by

p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis equals StartStartFraction p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis StartFraction p Subscript n Baseline left-parenthesis upper Y equals i right-parenthesis Over p Subscript o Baseline left-parenthesis upper Y equals i right-parenthesis EndFraction OverOver sigma-summation Underscript j Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis StartFraction p Subscript n Baseline left-parenthesis upper Y equals j right-parenthesis Over p Subscript o Baseline left-parenthesis upper Y equals j right-parenthesis EndFraction EndEndFraction i equals 1 comma ellipsis comma k plus 1

where the old posterior probabilities (p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis comma i equals 1 comma ellipsis comma k plus 1) are the conditional probabilities of the response categories given bold x, the old priors (p Subscript o Baseline left-parenthesis upper Y equals i right-parenthesis comma i equals 1 comma ellipsis comma k plus 1) are the sample proportions of response categories of the training data, and the new priors (p Subscript n Baseline left-parenthesis upper Y equals i right-parenthesis comma i equals 1 comma ellipsis comma k plus 1) are specified in the PRIOR= or PRIOREVENT= option. To simplify notation, absorb the old priors into the new priors; that is

p left-parenthesis upper Y equals i right-parenthesis equals StartFraction p Subscript n Baseline left-parenthesis upper Y equals i right-parenthesis Over p Subscript o Baseline left-parenthesis upper Y equals i right-parenthesis EndFraction i equals 1 comma ellipsis comma k plus 1

Note if the PRIOR= and PRIOREVENT= options are not specified, then p left-parenthesis upper Y equals i right-parenthesis equals 1.

The posterior probabilities are functions of bold-italic beta and their estimates are obtained by substituting bold-italic beta by its MLE ModifyingAbove bold-italic beta With caret. The variances of the estimated posterior probabilities are given by the delta method as follows:

Var left-parenthesis ModifyingAbove p With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis right-parenthesis equals left-bracket StartFraction partial-differential p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction right-bracket Superscript prime Baseline Var left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis left-bracket StartFraction partial-differential p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction right-bracket

where

StartFraction partial-differential p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction equals StartStartFraction StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction p left-parenthesis upper Y equals i right-parenthesis OverOver sigma-summation Underscript j Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis EndEndFraction minus StartStartFraction p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals i right-parenthesis sigma-summation Underscript j Endscripts StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction p left-parenthesis upper Y equals j right-parenthesis OverOver left-bracket sigma-summation Underscript j Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis right-bracket squared EndEndFraction

and the old posterior probabilities p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis are described in the following sections.

A 100(1 minus alpha)% confidence interval for p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis is

ModifyingAbove p With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis plus-or-minus z Subscript 1 minus alpha slash 2 Baseline StartRoot ModifyingAbove Var With caret left-parenthesis ModifyingAbove p With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis right-parenthesis EndRoot

where z Subscript tau is the upper 100tauth percentile of the standard normal distribution.

Binary and Cumulative Response Models

Let alpha 1 comma ellipsis comma alpha Subscript k Baseline be the intercept parameters and let bold-italic beta Subscript s be the vector of slope parameters. Denote bold-italic beta equals left-parenthesis alpha 1 comma ellipsis comma alpha Subscript k Baseline comma bold-italic beta prime Subscript s right-parenthesis prime. Let

eta Subscript i Baseline equals eta Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis equals alpha Subscript i Baseline plus bold x prime bold-italic beta Subscript s Baseline comma i equals 1 comma ellipsis comma k

Estimates of eta 1 comma ellipsis comma eta Subscript k Baseline are obtained by substituting the maximum likelihood estimate ModifyingAbove bold-italic beta With caret for bold-italic beta.

The predicted probabilities of the responses are

StartLayout 1st Row  ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis equals ModifyingAbove probability With caret left-parenthesis upper Y equals i right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column upper F left-parenthesis ModifyingAbove eta With caret Subscript 1 Baseline right-parenthesis 2nd Column i equals 1 2nd Row 1st Column upper F left-parenthesis ModifyingAbove eta With caret Subscript i Baseline right-parenthesis minus upper F left-parenthesis ModifyingAbove eta With caret Subscript i minus 1 Baseline right-parenthesis 2nd Column i equals 2 comma ellipsis comma k 3rd Row 1st Column 1 minus upper F left-parenthesis ModifyingAbove eta With caret Subscript k Baseline right-parenthesis 2nd Column i equals k plus 1 EndLayout EndLayout

For i equals 1 comma ellipsis comma k, let bold-italic delta Subscript i Baseline left-parenthesis bold x right-parenthesis be a (k + 1) column vector with ith entry equal to 1, k + 1 entry equal to bold x, and all other entries 0. The derivative of p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis with respect to bold-italic beta are

StartLayout 1st Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column upper F prime left-parenthesis alpha 1 plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta 1 left-parenthesis bold x right-parenthesis 2nd Column i equals 1 2nd Row 1st Column upper F prime left-parenthesis alpha Subscript i Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta Subscript i Baseline left-parenthesis bold x right-parenthesis minus upper F prime left-parenthesis alpha Subscript i minus 1 Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta Subscript i minus 1 Baseline left-parenthesis bold x right-parenthesis 2nd Column i equals 2 comma ellipsis comma k 3rd Row 1st Column minus upper F prime left-parenthesis alpha Subscript k Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta Subscript k Baseline left-parenthesis bold x right-parenthesis 2nd Column i equals k plus 1 EndLayout EndLayout

The cumulative posterior probabilities are

p left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis equals StartFraction sigma-summation Underscript j equals 1 Overscript i Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis Over sigma-summation Underscript j equals 1 Overscript k plus 1 Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis EndFraction equals sigma-summation Underscript j equals 1 Overscript i Endscripts p left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis i equals 1 comma ellipsis comma k plus 1

Their derivatives are

StartFraction partial-differential p left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction equals sigma-summation Underscript j equals 1 Overscript i Endscripts StartFraction partial-differential p left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction i equals 1 comma ellipsis comma k plus 1

In the delta-method equation for the variance, replace p left-parenthesis upper Y equals dot vertical-bar bold x right-parenthesis with p left-parenthesis upper Y less-than-or-equal-to dot vertical-bar bold x right-parenthesis.

Finally, for the cumulative response model, use

StartLayout 1st Row 1st Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis 2nd Column equals 3rd Column upper F left-parenthesis ModifyingAbove eta With caret Subscript i Baseline right-parenthesis i equals 1 comma ellipsis comma k 2nd Row 1st Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y less-than-or-equal-to k plus 1 vertical-bar bold x right-parenthesis 2nd Column equals 3rd Column 1 3rd Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction 2nd Column equals 3rd Column upper F prime left-parenthesis alpha Subscript i Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis delta Subscript i Baseline left-parenthesis bold x right-parenthesis i equals 1 comma ellipsis comma k 4th Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y less-than-or-equal-to k plus 1 vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction 2nd Column equals 3rd Column 0 EndLayout
Generalized Logit Model

Consider the last response level (Y=k+1) as the reference. Let bold-italic beta 1 comma ellipsis comma bold-italic beta Subscript k Baseline be the (intercept and slope) parameter vectors for the first k logits, respectively. Denote bold-italic beta equals left-parenthesis bold-italic beta prime 1 comma ellipsis comma bold-italic beta prime Subscript k right-parenthesis prime. Let eta equals left-parenthesis eta 1 comma ellipsis comma eta Subscript k Baseline right-parenthesis prime with

eta Subscript i Baseline equals eta Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis equals bold x prime bold-italic beta Subscript i Baseline i equals 1 comma ellipsis comma k

Estimates of eta 1 comma ellipsis comma eta Subscript k Baseline are obtained by substituting the maximum likelihood estimate ModifyingAbove bold-italic beta With caret for bold-italic beta.

The predicted probabilities are

StartLayout 1st Row 1st Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y equals k plus 1 vertical-bar bold x right-parenthesis identical-to probability left-parenthesis upper Y equals k plus 1 vertical-bar bold x right-parenthesis 2nd Column equals 3rd Column StartFraction 1 Over 1 plus sigma-summation Underscript l equals 1 Overscript k Endscripts exp left-parenthesis ModifyingAbove eta With caret Subscript l Baseline right-parenthesis EndFraction 2nd Row 1st Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis identical-to probability left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis 2nd Column equals 3rd Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y equals k plus 1 vertical-bar bold x right-parenthesis exp left-parenthesis eta Subscript i Baseline right-parenthesis comma i equals 1 comma ellipsis comma k EndLayout

The derivative of p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis with respect to bold-italic beta are

StartLayout 1st Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction 2nd Column equals 3rd Column StartFraction partial-differential eta Over partial-differential bold-italic beta EndFraction StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential eta EndFraction 2nd Row 1st Column Blank 2nd Column equals 3rd Column left-parenthesis upper I Subscript k Baseline circled-times bold x right-parenthesis left-parenthesis StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential eta 1 EndFraction comma midline-horizontal-ellipsis comma StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential eta Subscript k Baseline EndFraction right-parenthesis prime EndLayout

where

StartLayout 1st Row  StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential eta Subscript j Baseline EndFraction equals StartLayout Enlarged left-brace 1st Row 1st Column p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis left-parenthesis 1 minus p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis right-parenthesis 2nd Column j equals i 2nd Row 1st Column minus p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis 2nd Column otherwise EndLayout EndLayout
Special Case of Binary Response Model with No Priors

Let bold-italic beta be the vector of regression parameters. Let

eta equals eta left-parenthesis bold-italic beta right-parenthesis equals bold x prime bold-italic beta

The variance of ModifyingAbove eta With caret is given by

Var left-parenthesis ModifyingAbove eta With caret right-parenthesis equals bold x Superscript prime Baseline Var left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis bold x

A 100(1 minus alpha) percent confidence interval for eta is

ModifyingAbove eta With caret plus-or-minus z Subscript 1 minus alpha slash 2 Baseline StartRoot ModifyingAbove Var With caret left-parenthesis ModifyingAbove eta With caret right-parenthesis EndRoot

Estimates of p Subscript o Baseline left-parenthesis upper Y equals 1 vertical-bar bold x right-parenthesis and confidence intervals for the p Subscript o Baseline left-parenthesis upper Y equals 1 vertical-bar bold x right-parenthesis are obtained by back-transforming ModifyingAbove eta With caret and the confidence intervals for eta, respectively. That is,

ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y equals 1 vertical-bar bold x right-parenthesis equals upper F left-parenthesis ModifyingAbove eta With caret right-parenthesis

and the confidence intervals are

upper F left-parenthesis ModifyingAbove eta With caret plus-or-minus z Subscript 1 minus alpha slash 2 Baseline StartRoot ModifyingAbove Var With caret left-parenthesis ModifyingAbove eta With caret right-parenthesis EndRoot right-parenthesis
Last updated: December 09, 2022