The following subsection describes the standard Pearson and deviance goodness-of-fit tests. To be valid, these two tests require sufficient replication within subpopulations. When there are continuous predictors in the model, the data are often too sparse to use these statistics. The remaining subsections describe tests that are all designed to be valid in such situations.
Let N be the number of observations in your data, and let m denote the number of subpopulation profiles. You can use the AGGREGATE (or AGGREGATE=) option in the MODEL statement to define the subpopulation profiles. If you omit this option, each observation is regarded as coming from a separate subpopulation, and m=N.
Let be the number of response levels and p be the number of parameters that are estimated. For the jth profile (or observation) and the ith response level, let
be the total weight (sum of the product of the frequencies and the weights of the observations within that profile), let
, and let
be the fitted probability. Let
be the number of responses with level i, and let
be the number of trials. Let
denote the predicted probability of response level i,
.
The Pearson chi-square statistic
and the deviance
are given by
Each of these chi-square statistics has degrees of freedom
A large difference between the Pearson statistic and the deviance provides some evidence that the data are too sparse to use either statistic.
Without the AGGREGATE (or AGGREGATE=) option, the Pearson chi-square statistic and the deviance are calculated only for events/trials syntax.
Sufficient replication within subpopulations is required to make the Pearson and deviance goodness-of-fit tests valid. When there are one or more continuous predictors in the model, the data are often too sparse to use these statistics. Hosmer and Lemeshow (2000) proposed a statistic that they show, through simulation, is distributed as chi-square when there is no replication in any of the subpopulations. Fagerland, Hosmer, and Bofin (2008) and Fagerland and Hosmer (2013, 2016) extend this test to polytomous response models.
The observations are sorted in increasing order of a scored value. For binary response variables, the scored value of an observation is its estimated event probability. The event is the response level specified in the response variable option EVENT=, the response level that is not specified in the REF= option, or, if neither of these options was specified, the response level identified in the "Response Profiles" table as "Ordered Value 1." For nominal response variables (LINK=GLOGIT), the scored value of an observation is 1 minus the estimated probability of the reference level (specified using the REF= option). For ordinal response variables, the scored value of an observation is , where K is the number of response levels and
is the predicted probability of the ith ordered response.
The observations (and frequencies) are then combined into G groups. By default G = 10, but you can specify with the NGROUPS= suboption of the LACKFIT option in the MODEL statement. For single-trial syntax, observations with identical scored values are combined and are placed in the same group. Let F be the total frequency. The target frequency for each group is
, which is the integer part of
. Load the first group (
) with the observation that has the smallest scored value and with frequency
, and let the next-smallest observation have a frequency of f. PROC LOGISTIC performs the following steps for each observation to create the groups:
If the final group has frequency
, then add these observations to the preceding group. The total number of groups actually created, g, can be less than G. There must be at least three groups in order for the Hosmer-Lemeshow statistic to be computed.
For binary response variables, the Hosmer-Lemeshow goodness-of-fit statistic is obtained by calculating the Pearson chi-square statistic from the table of observed and expected frequencies, where g is the number of groups. The statistic is written
where is the total frequency of subjects in the jth group,
is the total frequency of event outcomes in the jth group, and
is the average estimated predicted probability of an event outcome for the jth group. (Note that the predicted probabilities are computed as shown in the section Linear Predictor, Predicted Probability, and Confidence Limits and are not the cross validated estimates discussed in the section Classification Table.) The Hosmer-Lemeshow statistic is then compared to a chi-square distribution with
degrees of freedom, where the value of r can be specified in the DFREDUCE= suboption of the LACKFIT option in the MODEL statement. The default is r = 2.
For polytomous response variables, the Pearson chi-square statistic is computed from a table of observed and expected frequencies,
where is the sum of the observed frequencies and
is the sum of the model predicted probabilities of the observations in group j with response k. The Hosmer-Lemeshow statistic is then compared to a chi-square distribution. The number of degrees of freedom for this test of cumulative and adjacent-category logit models with the equal-slopes assumption is given by Fagerland and Hosmer (2013) and Fagerland and Hosmer (2016) as (g–r)(K–1)+(K–2); PROC LOGISTIC uses this number for all models that make the equal-slopes assumption. The number of degrees of freedom for this test of the generalized logit model is given by Fagerland, Hosmer, and Bofin (2008) as (g–r)(K–1), where K is the number of response levels; PROC LOGISTIC uses this number for all models that do not make the equal-slopes assumption. The degrees of freedom can also be specified using the DF= suboption of the LACKFIT option in the MODEL statement.
Large values of (and small p-values) indicate a lack of fit of the model.
The tests in this section are valid even when the data are sparse and there is very little or no replication in the data. These tests are currently available only for binary logistic regression models, and they are reported in the "Goodness-of-Fit Tests" table when you specify the GOF option in the MODEL statement. Let denote the predicted event probability, and let
be the covariance matrix for the fitted model.
The general misspecification test of White (1982) is applied by Orme (1988) to binary response data. The design vector for observation j is expanded by the upper triangular matrix of
; that is, by
. The
are scaled by
, the
values are scaled by
, and new response values are created by using the binary form of the residual
:
The model sum of squares from a linear regression of against the expanded set of covariates has a chi-square distribution with degrees of freedom equal to the number of scaled
covariates that are nondegenerate. This test is labeled "Information Matrix" in the "Goodness-of-Fit Tests" table.
Kuss (2002) modifies this test by expanding the design matrix using only the diagonal values of . This test is labeled "Information Matrix Diagonal" in the "Goodness-of-Fit Tests" table.
Osius and Rojek (1992) use fixed-cells asymptotics to derive the mean and variance of the Pearson chi-square statistic. The mean is m, the number of subpopulation profiles, and the variance is as follows:
Standardize the Pearson statistic, , then square it to obtain a
test.
Copas (1989) bases a test on the asymptotic normal distribution of the numerator of the Pearson chi-square statistic
which has mean . Hosmer et al. (1997) simplify the distribution of
for binary response models, so that its variance,
, is the residual sum of squares of a regression of a vector with entries
on
with weights
. Standardize the statistic,
, then square it to obtain a
test.
Spiegelhalter (1986) derives a test based on the Brier score, written in binary form as
where . B is asymptotically normal with
and
Standardize the Brier score, , then square it to obtain a
test.
Stukel (1988) adds two covariates to the model and tests that they are insignificant. For a binary response, where and
is the indicator function, add the following covariates:
Then use a score test (see the section Score Statistics and Tests) to test whether this larger model is significantly different from the fitted model.