The LOGISTIC Procedure

Scoring Data Sets

Scoring a data set, which is especially important for predictive modeling, means applying a previously fitted model to a new data set in order to compute the conditional, or posterior, probabilities of each response category given the values of the explanatory variables in each observation.

The SCORE statement enables you to score new data sets and output the scored values and, optionally, the corresponding confidence limits into a SAS data set. If the response variable is included in the new data set, then you can request fit statistics for the data, which is especially useful for test or validation data. If the response is binary, you can also create a SAS data set containing the receiver operating characteristic (ROC) curve. You can specify multiple SCORE statements in the same invocation of PROC LOGISTIC.

By default, the posterior probabilities are based on implicit prior probabilities that are proportional to the frequencies of the response categories in the training data (the data used to fit the model). Explicit prior probabilities should be specified with the PRIOR= or PRIOREVENT= option when the sample proportions of the response categories in the training data differ substantially from the operational data to be scored. For example, to detect a rare category, it is common practice to use a training set in which the rare categories are overrepresented; without prior probabilities that reflect the true incidence rate, the predicted posterior probabilities for the rare category will be too high. By specifying the correct priors, the posterior probabilities are adjusted appropriately.

The model fit to the DATA= data set in the PROC LOGISTIC statement is the default model used for the scoring. Alternatively, you can save a model fit in one run of PROC LOGISTIC and use it to score new data in a subsequent run. The OUTMODEL= option in the PROC LOGISTIC statement saves the model information in a SAS data set. Specifying this data set in the INMODEL= option of a new PROC LOGISTIC run will score the DATA= data set in the SCORE statement without refitting the model.

The STORE statement can also be used to save your model. The PLM procedure can use this model to score new data sets; see Chapter 94, The PLM Procedure, for more information. You cannot specify priors in PROC PLM.

Fit Statistics for Scored Data Sets

Specifying the FITSTAT option displays the following fit statistics when the data set being scored includes the response variable:

Statistic	Description
Total frequency
Total weight
Log likelihood
Full log likelihood
Misclassification (error) rate
AIC
AICC
BIC
SC
R-square
Maximum-rescaled R-square
AUC	Area under the ROC curve
Brier score (polytomous response)
Brier score (binary response)
Brier reliability (events/trials syntax)

In the preceding table, is the frequency of the ith observation in the data set being scored, is the weight of the observation, and . The number of trials when events/trials syntax is specified is , and with single-trial syntax . The values and are described in the section OUT= Output Data Set in a SCORE Statement. The indicator function is 1 if A is true and 0 otherwise. The likelihood of the model is L, and denotes the likelihood of the intercept-only model. For polytomous response models, is the observed polytomous response level, is the predicted probability of the jth response level for observation i, and . For binary response models, is the predicted probability of the observation, is the number of events when you specify events/trials syntax, and when you specify single-trial syntax.

The log likelihood, Akaike’s information criterion (AIC), and Schwarz criterion (SC) are described in the section Model Fitting Information. The full log likelihood is displayed for models specified with events/trials syntax, and the constant term is described in the section Model Fitting Information. The AICC is a small-sample bias-corrected version of the AIC (Hurvich and Tsai 1993; Burnham and Anderson 1998). The Bayesian information criterion (BIC) is the same as the SC except when events/trials syntax is specified. The area under the ROC curve for binary response models is defined in the section ROC Computations. The R-square and maximum-rescaled R-square statistics, defined in Generalized Coefficient of Determination, are not computed when you specify both an OFFSET= variable and the INMODEL= data set. The Brier score (Brier 1950) is the weighted squared difference between the predicted probabilities and their observed response levels. For events/trials syntax, the Brier reliability is the weighted squared difference between the predicted probabilities and the observed proportions (Murphy 1973).

Posterior Probabilities and Confidence Limits

Let F be the inverse link function. That is,

StartLayout 1st Row 1st Column upper F left-parenthesis t right-parenthesis equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column StartFraction 1 Over 1 plus exp left-parenthesis negative t right-parenthesis EndFraction 2nd Column logistic 2nd Row 1st Column normal upper Phi left-parenthesis t right-parenthesis 2nd Column normal 3rd Row 1st Column 1 minus exp left-parenthesis minus exp left-parenthesis t right-parenthesis right-parenthesis 2nd Column complementary log hyphen log EndLayout EndLayout

The first derivative of F is given by

StartLayout 1st Row 1st Column upper F prime left-parenthesis t right-parenthesis equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column StartFraction exp left-parenthesis negative t right-parenthesis Over left-parenthesis 1 plus exp left-parenthesis negative t right-parenthesis right-parenthesis squared EndFraction 2nd Column logistic 2nd Row 1st Column phi left-parenthesis t right-parenthesis 2nd Column normal 3rd Row 1st Column exp left-parenthesis t right-parenthesis exp left-parenthesis minus exp left-parenthesis t right-parenthesis right-parenthesis 2nd Column complementary log hyphen log EndLayout EndLayout

Suppose there are response categories. Let Y be the response variable with levels . Let be a -vector of covariates, with . Let be the vector of intercept and slope regression parameters.

Posterior probabilities are given by

p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis equals StartStartFraction p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis StartFraction p Subscript n Baseline left-parenthesis upper Y equals i right-parenthesis Over p Subscript o Baseline left-parenthesis upper Y equals i right-parenthesis EndFraction OverOver sigma-summation Underscript j Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis StartFraction p Subscript n Baseline left-parenthesis upper Y equals j right-parenthesis Over p Subscript o Baseline left-parenthesis upper Y equals j right-parenthesis EndFraction EndEndFraction i equals 1 comma ellipsis comma k plus 1

where the old posterior probabilities () are the conditional probabilities of the response categories given , the old priors () are the sample proportions of response categories of the training data, and the new priors () are specified in the PRIOR= or PRIOREVENT= option. To simplify notation, absorb the old priors into the new priors; that is

p left-parenthesis upper Y equals i right-parenthesis equals StartFraction p Subscript n Baseline left-parenthesis upper Y equals i right-parenthesis Over p Subscript o Baseline left-parenthesis upper Y equals i right-parenthesis EndFraction i equals 1 comma ellipsis comma k plus 1

Note if the PRIOR= and PRIOREVENT= options are not specified, then .

The posterior probabilities are functions of and their estimates are obtained by substituting by its MLE . The variances of the estimated posterior probabilities are given by the delta method as follows:

Var left-parenthesis ModifyingAbove p With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis right-parenthesis equals left-bracket StartFraction partial-differential p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction right-bracket Superscript prime Baseline Var left-parenthesis ModifyingAbove bold-italic beta With caret right-parenthesis left-bracket StartFraction partial-differential p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction right-bracket

where

StartFraction partial-differential p left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction equals StartStartFraction StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction p left-parenthesis upper Y equals i right-parenthesis OverOver sigma-summation Underscript j Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis EndEndFraction minus StartStartFraction p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals i right-parenthesis sigma-summation Underscript j Endscripts StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction p left-parenthesis upper Y equals j right-parenthesis OverOver left-bracket sigma-summation Underscript j Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis right-bracket squared EndEndFraction

and the old posterior probabilities are described in the following sections.

A 100()% confidence interval for is

ModifyingAbove p With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis plus-or-minus z Subscript 1 minus alpha slash 2 Baseline StartRoot ModifyingAbove Var With caret left-parenthesis ModifyingAbove p With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis right-parenthesis EndRoot

where is the upper 100th percentile of the standard normal distribution.

Binary and Cumulative Response Models

Let be the intercept parameters and let be the vector of slope parameters. Denote . Let

eta Subscript i Baseline equals eta Subscript i Baseline left-parenthesis bold-italic beta right-parenthesis equals alpha Subscript i Baseline plus bold x prime bold-italic beta Subscript s Baseline comma i equals 1 comma ellipsis comma k

Estimates of are obtained by substituting the maximum likelihood estimate for .

The predicted probabilities of the responses are

StartLayout 1st Row ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis equals ModifyingAbove probability With caret left-parenthesis upper Y equals i right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column upper F left-parenthesis ModifyingAbove eta With caret Subscript 1 Baseline right-parenthesis 2nd Column i equals 1 2nd Row 1st Column upper F left-parenthesis ModifyingAbove eta With caret Subscript i Baseline right-parenthesis minus upper F left-parenthesis ModifyingAbove eta With caret Subscript i minus 1 Baseline right-parenthesis 2nd Column i equals 2 comma ellipsis comma k 3rd Row 1st Column 1 minus upper F left-parenthesis ModifyingAbove eta With caret Subscript k Baseline right-parenthesis 2nd Column i equals k plus 1 EndLayout EndLayout

For , let be a (k + 1) column vector with ith entry equal to 1, k + 1 entry equal to , and all other entries 0. The derivative of with respect to are

StartLayout 1st Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y equals i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction equals 2nd Column StartLayout Enlarged left-brace 1st Row 1st Column upper F prime left-parenthesis alpha 1 plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta 1 left-parenthesis bold x right-parenthesis 2nd Column i equals 1 2nd Row 1st Column upper F prime left-parenthesis alpha Subscript i Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta Subscript i Baseline left-parenthesis bold x right-parenthesis minus upper F prime left-parenthesis alpha Subscript i minus 1 Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta Subscript i minus 1 Baseline left-parenthesis bold x right-parenthesis 2nd Column i equals 2 comma ellipsis comma k 3rd Row 1st Column minus upper F prime left-parenthesis alpha Subscript k Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis bold-italic delta Subscript k Baseline left-parenthesis bold x right-parenthesis 2nd Column i equals k plus 1 EndLayout EndLayout

The cumulative posterior probabilities are

p left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis equals StartFraction sigma-summation Underscript j equals 1 Overscript i Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis Over sigma-summation Underscript j equals 1 Overscript k plus 1 Endscripts p Subscript o Baseline left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis p left-parenthesis upper Y equals j right-parenthesis EndFraction equals sigma-summation Underscript j equals 1 Overscript i Endscripts p left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis i equals 1 comma ellipsis comma k plus 1

Their derivatives are

StartFraction partial-differential p left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction equals sigma-summation Underscript j equals 1 Overscript i Endscripts StartFraction partial-differential p left-parenthesis upper Y equals j vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction i equals 1 comma ellipsis comma k plus 1

In the delta-method equation for the variance, replace with .

Finally, for the cumulative response model, use

StartLayout 1st Row 1st Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis 2nd Column equals 3rd Column upper F left-parenthesis ModifyingAbove eta With caret Subscript i Baseline right-parenthesis i equals 1 comma ellipsis comma k 2nd Row 1st Column ModifyingAbove p Subscript o Baseline With caret left-parenthesis upper Y less-than-or-equal-to k plus 1 vertical-bar bold x right-parenthesis 2nd Column equals 3rd Column 1 3rd Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y less-than-or-equal-to i vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction 2nd Column equals 3rd Column upper F prime left-parenthesis alpha Subscript i Baseline plus bold x prime bold-italic beta Subscript s Baseline right-parenthesis delta Subscript i Baseline left-parenthesis bold x right-parenthesis i equals 1 comma ellipsis comma k 4th Row 1st Column StartFraction partial-differential p Subscript o Baseline left-parenthesis upper Y less-than-or-equal-to k plus 1 vertical-bar bold x right-parenthesis Over partial-differential bold-italic beta EndFraction 2nd Column equals 3rd Column 0 EndLayout

Generalized Logit Model

Consider the last response level (Y=k+1) as the reference. Let be the (intercept and slope) parameter vectors for the first k logits, respectively. Denote . Let with