The DISCRIM Procedure

Posterior Probability Error-Rate Estimates

The posterior probability error-rate estimates (Fukunaga and Kessel 1973; Glick 1978; Hora and Wilcox 1982) for each group are based on the posterior probabilities of the observations classified into that same group.

A sample of observations with classification results can be used to estimate the posterior error rates. The following notation is used to describe the sample:

script upper S

the set of observations in the (training) sample

n

the number of observations in script upper S

n Subscript t

the number of observations in script upper S in group t

script upper R Subscript t

the set of observations such that the posterior probability belonging to group t is the largest

script upper R Subscript u t

the set of observations from group u such that the posterior probability belonging to group t is the largest

The classification error rate for group t is defined as

e Subscript t Baseline equals 1 minus integral Underscript script upper R Subscript t Baseline Endscripts f Subscript t Baseline left-parenthesis bold x right-parenthesis d bold x

The posterior probability of bold x for group t can be written as

p left-parenthesis t vertical-bar bold x right-parenthesis equals StartFraction q Subscript t Baseline f Subscript t Baseline left-parenthesis bold x right-parenthesis Over f left-parenthesis bold x right-parenthesis EndFraction

where f left-parenthesis bold x right-parenthesis equals sigma-summation Underscript u Endscripts q Subscript u Baseline f Subscript u Baseline left-parenthesis bold x right-parenthesis is the unconditional density of bold x.

Thus, if you replace f Subscript t Baseline left-parenthesis bold x right-parenthesis with p left-parenthesis t vertical-bar bold x right-parenthesis f left-parenthesis bold x right-parenthesis slash q Subscript t, the error rate is

e Subscript t Baseline equals 1 minus StartFraction 1 Over q Subscript t Baseline EndFraction integral Underscript script upper R Subscript t Baseline Endscripts p left-parenthesis t vertical-bar bold x right-parenthesis f left-parenthesis bold x right-parenthesis d bold x

An estimator of e Subscript t, unstratified over the groups from which the observations come, is then given by

ModifyingAbove e With caret Subscript t Baseline left-parenthesis unstratified right-parenthesis equals 1 minus StartFraction 1 Over n q Subscript t Baseline EndFraction sigma-summation Underscript script upper R Subscript t Baseline Endscripts p left-parenthesis t vertical-bar bold x right-parenthesis

where p left-parenthesis t vertical-bar bold x right-parenthesis is estimated from the classification criterion, and the summation is over all sample observations of script upper S classified into group t. The true group membership of each observation is not required in the estimation. The term n q Subscript t is the number of observations that are expected to be classified into group t, given the priors. If more observations than expected are classified into group t, then ModifyingAbove e With caret Subscript t can be negative.

Further, if you replace f left-parenthesis bold x right-parenthesis with sigma-summation Underscript u Endscripts q Subscript u Baseline f Subscript u Baseline left-parenthesis bold x right-parenthesis, the error rate can be written as

e Subscript t Baseline equals 1 minus StartFraction 1 Over q Subscript t Baseline EndFraction sigma-summation Underscript u Endscripts q Subscript u Baseline integral Underscript script upper R Subscript u t Baseline Endscripts p left-parenthesis t vertical-bar bold x right-parenthesis f Subscript u Baseline left-parenthesis bold x right-parenthesis d bold x

and an estimator stratified over the group from which the observations come is given by

ModifyingAbove e With caret Subscript t Baseline left-parenthesis stratified right-parenthesis equals 1 minus StartFraction 1 Over q Subscript t Baseline EndFraction sigma-summation Underscript u Endscripts q Subscript u Baseline StartFraction 1 Over n Subscript u Baseline EndFraction left-parenthesis sigma-summation Underscript script upper R Subscript u t Baseline Endscripts p left-parenthesis t vertical-bar bold x right-parenthesis right-parenthesis

The inner summation is over all sample observations of script upper S coming from group u and classified into group t, and n Subscript u is the number of observations originally from group u. The stratified estimate uses only the observations with known group membership. When the prior probabilities of the group membership are proportional to the group sizes, the stratified estimate is the same as the unstratified estimator.

The estimated group-specific error rates can be less than zero, usually due to a large discrepancy between prior probabilities of group membership and group sizes. To have a reliable estimate for group-specific error rate estimates, you should use group sizes that are at least approximately proportional to the prior probabilities of group membership.

A total error rate is defined as a weighted average of the individual group error rates

e equals sigma-summation Underscript t Endscripts q Subscript t Baseline e Subscript t

and can be estimated from

ModifyingAbove e With caret left-parenthesis unstratified right-parenthesis equals sigma-summation Underscript t Endscripts q Subscript t Baseline ModifyingAbove e With caret Subscript t Baseline left-parenthesis unstratified right-parenthesis

or

ModifyingAbove e With caret left-parenthesis stratified right-parenthesis equals sigma-summation Underscript t Endscripts q Subscript t Baseline ModifyingAbove e With caret Subscript t Baseline left-parenthesis stratified right-parenthesis

The total unstratified error-rate estimate can also be written as

ModifyingAbove e With caret left-parenthesis unstratified right-parenthesis equals 1 minus StartFraction 1 Over n EndFraction sigma-summation Underscript t Endscripts sigma-summation Underscript script upper R Subscript t Baseline Endscripts p left-parenthesis t vertical-bar bold x right-parenthesis

which is one minus the average value of the maximum posterior probabilities for each observation in the sample. The prior probabilities of group membership do not appear explicitly in this overall estimate.

Last updated: December 09, 2022