The DISCRIM Procedure

Background

The following notation is used to describe the classification methods:

bold x

a p-dimensional vector containing the quantitative variables of an observation

bold upper S Subscript p

the pooled covariance matrix

t

a subscript to distinguish the groups

n Subscript t

the number of training set observations in group t

bold m Subscript t

the p-dimensional vector containing variable means in group t

bold upper S Subscript t

the covariance matrix within group t

StartAbsoluteValue bold upper S Subscript t Baseline EndAbsoluteValue

the determinant of bold upper S Subscript t

q Subscript t

the prior probability of membership in group t

p left-parenthesis t vertical-bar bold x right-parenthesis

the posterior probability of an observation bold x belonging to group t

f Subscript t

the probability density function for group t

f Subscript t Baseline left-parenthesis bold x right-parenthesis

the group-specific density estimate at bold x from group t

f left-parenthesis bold x right-parenthesis

sigma-summation Underscript t Endscripts q Subscript t Baseline f Subscript t Baseline left-parenthesis bold x right-parenthesis, the estimated unconditional density at bold x

e Subscript t

the classification error rate for group t

Bayes’ Theorem

Assuming that the prior probabilities of group membership are known and that the group-specific densities at bold x can be estimated, PROC DISCRIM computes p left-parenthesis t vertical-bar bold x right-parenthesis, the probability of bold x belonging to group t, by applying Bayes’ theorem:

p left-parenthesis t vertical-bar bold x right-parenthesis equals StartFraction q Subscript t Baseline f Subscript t Baseline left-parenthesis bold x right-parenthesis Over f left-parenthesis bold x right-parenthesis EndFraction

PROC DISCRIM partitions a p-dimensional vector space into regions upper R Subscript t, where the region upper R Subscript t is the subspace containing all p-dimensional vectors bold y such that p left-parenthesis t vertical-bar bold y right-parenthesis is the largest among all groups. An observation is classified as coming from group t if it lies in region upper R Subscript t.

Parametric Methods

Assuming that each group has a multivariate normal distribution, PROC DISCRIM develops a discriminant function or classification criterion by using a measure of generalized squared distance. The classification criterion is based on either the individual within-group covariance matrices or the pooled covariance matrix; it also takes into account the prior probabilities of the classes. Each observation is placed in the class from which it has the smallest generalized squared distance. PROC DISCRIM also computes the posterior probability of an observation belonging to each class.

The squared Mahalanobis distance from bold x to group t is

d Subscript t Superscript 2 Baseline left-parenthesis bold x right-parenthesis equals left-parenthesis bold x minus bold m Subscript t Baseline right-parenthesis prime bold upper V Subscript t Superscript negative 1 Baseline left-parenthesis bold x minus bold m Subscript t Baseline right-parenthesis

where bold upper V Subscript t Baseline equals bold upper S Subscript t if the within-group covariance matrices are used, or bold upper V Subscript t Baseline equals bold upper S Subscript p if the pooled covariance matrix is used.

The group-specific density estimate at bold x from group t is then given by

f Subscript t Baseline left-parenthesis bold x right-parenthesis equals left-parenthesis 2 pi right-parenthesis Superscript minus StartFraction p Over 2 EndFraction Baseline StartAbsoluteValue bold upper V Subscript t Baseline EndAbsoluteValue Superscript negative one-half Baseline exp left-parenthesis minus 0.5 d Subscript t Superscript 2 Baseline left-parenthesis bold x right-parenthesis right-parenthesis

Using Bayes’ theorem, the posterior probability of bold x belonging to group t is

p left-parenthesis t vertical-bar bold x right-parenthesis equals StartFraction q Subscript t Baseline f Subscript t Baseline left-parenthesis bold x right-parenthesis Over sigma-summation Underscript u Endscripts q Subscript u Baseline f Subscript u Baseline left-parenthesis bold x right-parenthesis EndFraction

where the summation is over all groups.

The generalized squared distance from bold x to group t is defined as

upper D Subscript t Superscript 2 Baseline left-parenthesis bold x right-parenthesis equals d Subscript t Superscript 2 Baseline left-parenthesis bold x right-parenthesis plus g 1 left-parenthesis t right-parenthesis plus g 2 left-parenthesis t right-parenthesis

where

StartLayout 1st Row 1st Column g 1 left-parenthesis t right-parenthesis 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column ln StartAbsoluteValue bold upper S Subscript t Baseline EndAbsoluteValue 2nd Column Blank 3rd Column if the within hyphen group covariance matrices are used 2nd Row 1st Column 0 2nd Column Blank 3rd Column if the pooled covariance matrix is used EndLayout EndLayout

and

StartLayout 1st Row 1st Column g 2 left-parenthesis t right-parenthesis 2nd Column equals 3rd Column StartLayout Enlarged left-brace 1st Row 1st Column minus 2 ln left-parenthesis q Subscript t Baseline right-parenthesis 2nd Column Blank 3rd Column if the prior probabilities are not all equal 2nd Row 1st Column 0 2nd Column Blank 3rd Column if the prior probabilities are all equal EndLayout EndLayout

The posterior probability of bold x belonging to group t is then equal to

p left-parenthesis t vertical-bar bold x right-parenthesis equals StartFraction exp left-parenthesis minus 0.5 upper D Subscript t Superscript 2 Baseline left-parenthesis bold x right-parenthesis right-parenthesis Over sigma-summation Underscript u Endscripts exp left-parenthesis minus 0.5 upper D Subscript u Superscript 2 Baseline left-parenthesis bold x right-parenthesis right-parenthesis EndFraction

The discriminant scores are minus 0.5 upper D Subscript u Superscript 2 Baseline left-parenthesis bold x right-parenthesis. An observation is classified into group u if setting t = u produces the largest value of p left-parenthesis t vertical-bar bold x right-parenthesis or the smallest value of upper D Subscript t Superscript 2 Baseline left-parenthesis bold x right-parenthesis. If this largest posterior probability is less than the threshold specified, bold x is labeled as Other.

Nonparametric Methods

Nonparametric discriminant methods are based on nonparametric estimates of group-specific probability densities. Either a kernel method or the k-nearest-neighbor method can be used to generate a nonparametric density estimate in each group and to produce a classification criterion. The kernel method uses uniform, normal, Epanechnikov, biweight, or triweight kernels in the density estimation.

Either Mahalanobis distance or Euclidean distance can be used to determine proximity. When the k-nearest-neighbor method is used, the Mahalanobis distances are based on the pooled covariance matrix. When a kernel method is used, the Mahalanobis distances are based on either the individual within-group covariance matrices or the pooled covariance matrix. Either the full covariance matrix or the diagonal matrix of variances can be used to calculate the Mahalanobis distances.

The squared distance between two observation vectors, bold x and bold y, in group t is given by

d Subscript t Superscript 2 Baseline left-parenthesis bold x comma bold y right-parenthesis equals left-parenthesis bold x minus bold y right-parenthesis prime bold upper V Subscript t Superscript negative 1 Baseline left-parenthesis bold x minus bold y right-parenthesis

where bold upper V Subscript t has one of the following forms:

bold upper V Subscript t Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column bold upper S Subscript p Baseline 2nd Column Blank 3rd Column the pooled covariance matrix 2nd Row 1st Column diag left-parenthesis bold upper S Subscript p Baseline right-parenthesis 2nd Column Blank 3rd Column the diagonal matrix of the pooled covariance matrix 3rd Row 1st Column bold upper S Subscript t Baseline 2nd Column Blank 3rd Column the covariance matrix within group t 4th Row 1st Column diag left-parenthesis bold upper S Subscript t Baseline right-parenthesis 2nd Column Blank 3rd Column the diagonal matrix of the covariance matrix within group t 5th Row 1st Column bold upper I 2nd Column Blank 3rd Column the identity matrix EndLayout

The classification of an observation vector bold x is based on the estimated group-specific densities from the training set. From these estimated densities, the posterior probabilities of group membership at bold x are evaluated. An observation bold x is classified into group u if setting t equals u produces the largest value of p left-parenthesis t vertical-bar bold x right-parenthesis. If there is a tie for the largest probability or if this largest probability is less than the threshold specified, bold x is labeled as Other.

The kernel method uses a fixed radius, r, and a specified kernel, upper K Subscript t, to estimate the group t density at each observation vector bold x. Let bold z be a p-dimensional vector. Then the volume of a p-dimensional unit sphere bounded by bold z prime bold z equals 1 is

v 0 equals StartStartFraction pi Superscript StartFraction p Over 2 EndFraction Baseline OverOver normal upper Gamma left-parenthesis StartFraction p Over 2 EndFraction plus 1 right-parenthesis EndEndFraction

where normal upper Gamma represents the gamma function (see SAS Functions and CALL Routines: Reference).

Thus, in group t, the volume of a p-dimensional ellipsoid bounded by StartSet bold z vertical-bar bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z equals r squared EndSet is

v Subscript r Baseline left-parenthesis t right-parenthesis equals r Superscript p Baseline StartAbsoluteValue bold upper V Subscript t Baseline EndAbsoluteValue Superscript one-half Baseline v 0

The kernel method uses one of the following densities as the kernel density in group t:

Uniform Kernel

upper K Subscript t Baseline left-parenthesis bold z right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column StartFraction 1 Over v Subscript r Baseline left-parenthesis t right-parenthesis EndFraction 2nd Column Blank 3rd Column if bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z less-than-or-equal-to r squared 2nd Row 1st Column 0 2nd Column Blank 3rd Column elsewhere EndLayout

Normal Kernel (with mean zero, variance r squared bold upper V Subscript t)

upper K Subscript t Baseline left-parenthesis bold z right-parenthesis equals StartFraction 1 Over c 0 left-parenthesis t right-parenthesis EndFraction exp left-parenthesis minus StartFraction 1 Over 2 r squared EndFraction bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z right-parenthesis

where c 0 left-parenthesis t right-parenthesis equals left-parenthesis 2 pi right-parenthesis Superscript StartFraction p Over 2 EndFraction Baseline r Superscript p Baseline StartAbsoluteValue bold upper V Subscript t Baseline EndAbsoluteValue Superscript one-half.

Epanechnikov Kernel

upper K Subscript t Baseline left-parenthesis bold z right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column c 1 left-parenthesis t right-parenthesis left-parenthesis 1 minus StartFraction 1 Over r squared EndFraction bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z right-parenthesis 2nd Column Blank 3rd Column if bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z less-than-or-equal-to r squared 2nd Row 1st Column 0 2nd Column Blank 3rd Column elsewhere EndLayout

where c 1 left-parenthesis t right-parenthesis equals StartFraction 1 Over v Subscript r Baseline left-parenthesis t right-parenthesis EndFraction left-parenthesis 1 plus StartFraction p Over 2 EndFraction right-parenthesis.

Biweight Kernel

upper K Subscript t Baseline left-parenthesis bold z right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column c 2 left-parenthesis t right-parenthesis left-parenthesis 1 minus StartFraction 1 Over r squared EndFraction bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z right-parenthesis squared 2nd Column Blank 3rd Column if bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z less-than-or-equal-to r squared 2nd Row 1st Column 0 2nd Column Blank 3rd Column elsewhere EndLayout

where c 2 left-parenthesis t right-parenthesis equals left-parenthesis 1 plus StartFraction p Over 4 EndFraction right-parenthesis c 1 left-parenthesis t right-parenthesis.

Triweight Kernel

upper K Subscript t Baseline left-parenthesis bold z right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column c 3 left-parenthesis t right-parenthesis left-parenthesis 1 minus StartFraction 1 Over r squared EndFraction bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z right-parenthesis cubed 2nd Column Blank 3rd Column if bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z less-than-or-equal-to r squared 2nd Row 1st Column 0 2nd Column Blank 3rd Column elsewhere EndLayout

where c 3 left-parenthesis t right-parenthesis equals left-parenthesis 1 plus StartFraction p Over 6 EndFraction right-parenthesis c 2 left-parenthesis t right-parenthesis.

The group t density at bold x is estimated by

f Subscript t Baseline left-parenthesis bold x right-parenthesis equals StartFraction 1 Over n Subscript t Baseline EndFraction sigma-summation Underscript bold y Endscripts upper K Subscript t Baseline left-parenthesis bold x minus bold y right-parenthesis

where the summation is over all observations bold y in group t, and upper K Subscript t is the specified kernel function. The posterior probability of membership in group t is then given by

p left-parenthesis t vertical-bar bold x right-parenthesis equals StartFraction q Subscript t Baseline f Subscript t Baseline left-parenthesis bold x right-parenthesis Over f left-parenthesis bold x right-parenthesis EndFraction

where f left-parenthesis bold x right-parenthesis equals sigma-summation Underscript u Endscripts q Subscript u Baseline f Subscript u Baseline left-parenthesis bold x right-parenthesis is the estimated unconditional density. If f left-parenthesis bold x right-parenthesis is zero, the observation bold x is labeled as Other.

The uniform-kernel method treats upper K Subscript t Baseline left-parenthesis bold z right-parenthesis as a multivariate uniform function with density uniformly distributed over bold z prime bold upper V Subscript t Superscript negative 1 Baseline bold z less-than-or-equal-to r squared. Let k Subscript t be the number of training set observations bold y from group t within the closed ellipsoid centered at bold x specified by d Subscript t Superscript 2 Baseline left-parenthesis bold x comma bold y right-parenthesis less-than-or-equal-to r squared. Then the group t density at bold x is estimated by

f Subscript t Baseline left-parenthesis bold x right-parenthesis equals StartFraction k Subscript t Baseline Over n Subscript t Baseline v Subscript r Baseline left-parenthesis t right-parenthesis EndFraction

When the identity matrix or the pooled within-group covariance matrix is used in calculating the squared distance, v Subscript r Baseline left-parenthesis t right-parenthesis is a constant, independent of group membership. The posterior probability of bold x belonging to group t is then given by

p left-parenthesis t vertical-bar bold x right-parenthesis equals StartStartFraction StartFraction q Subscript t Baseline k Subscript t Baseline Over n Subscript t Baseline EndFraction OverOver sigma-summation Underscript u Endscripts StartFraction q Subscript u Baseline k Subscript u Baseline Over n Subscript u Baseline EndFraction EndEndFraction

If the closed ellipsoid centered at bold x does not include any training set observations, f left-parenthesis bold x right-parenthesis is zero and bold x is labeled as Other. When the prior probabilities are equal, p left-parenthesis t vertical-bar bold x right-parenthesis is proportional to k Subscript t Baseline slash n Subscript t and bold x is classified into the group that has the highest proportion of observations in the closed ellipsoid. When the prior probabilities are proportional to the group sizes, p left-parenthesis t vertical-bar bold x right-parenthesis equals k Subscript t Baseline slash sigma-summation Underscript u Endscripts k Subscript u, bold x is classified into the group that has the largest number of observations in the closed ellipsoid.

The nearest-neighbor method fixes the number, k, of training set points for each observation bold x. The method finds the radius r Subscript k Baseline left-parenthesis bold x right-parenthesis that is the distance from bold x to the kth-nearest training set point in the metric bold upper V Subscript t Superscript negative 1. Consider a closed ellipsoid centered at bold x bounded by StartSet bold z vertical-bar left-parenthesis bold z minus bold x right-parenthesis prime bold upper V Subscript t Superscript negative 1 Baseline left-parenthesis bold z minus bold x right-parenthesis equals r Subscript k Superscript 2 Baseline left-parenthesis bold x right-parenthesis EndSet; the nearest-neighbor method is equivalent to the uniform-kernel method with a location-dependent radius r Subscript k Baseline left-parenthesis bold x right-parenthesis. Note that, with ties, more than k training set points might be in the ellipsoid.

Using the k-nearest-neighbor rule, the k Subscript n (or more with ties) smallest distances are saved. Of these k distances, let k Subscript t represent the number of distances that are associated with group t. Then, as in the uniform-kernel method, the estimated group t density at bold x is

f Subscript t Baseline left-parenthesis bold x right-parenthesis equals StartFraction k Subscript t Baseline Over n Subscript t Baseline v Subscript k Baseline left-parenthesis bold x right-parenthesis EndFraction

where v Subscript k Baseline left-parenthesis bold x right-parenthesis is the volume of the ellipsoid bounded by StartSet bold z vertical-bar left-parenthesis bold z minus bold x right-parenthesis prime bold upper V Subscript t Superscript negative 1 Baseline left-parenthesis bold z minus bold x right-parenthesis equals r Subscript k Superscript 2 Baseline left-parenthesis bold x right-parenthesis EndSet. Since the pooled within-group covariance matrix is used to calculate the distances used in the nearest-neighbor method, the volume v Subscript k Baseline left-parenthesis bold x right-parenthesis is a constant independent of group membership. When k = 1 is used in the nearest-neighbor rule, bold x is classified into the group associated with the bold y point that yields the smallest squared distance d Subscript t Superscript 2 Baseline left-parenthesis bold x comma bold y right-parenthesis. Prior probabilities affect nearest-neighbor results in the same way that they affect uniform-kernel results.

With a specified squared distance formula (METRIC=, POOL=), the values of r and k determine the degree of irregularity in the estimate of the density function, and they are called smoothing parameters. Small values of r or k produce jagged density estimates, and large values of r or k produce smoother density estimates. Various methods for choosing the smoothing parameters have been suggested, and there is as yet no simple solution to this problem.

For a fixed kernel shape, one way to choose the smoothing parameter r is to plot estimated densities with different values of r and to choose the estimate that is most in accordance with the prior information about the density. For many applications, this approach is satisfactory.

Another way of selecting the smoothing parameter r is to choose a value that optimizes a given criterion. Different groups might have different sets of optimal values. Assume that the unknown density has bounded and continuous second derivatives and that the kernel is a symmetric probability density function. One criterion is to minimize an approximate mean integrated square error of the estimated density (Rosenblatt 1956). The resulting optimal value of r depends on the density function and the kernel. A reasonable choice for the smoothing parameter r is to optimize the criterion with the assumption that group t has a normal distribution with covariance matrix bold upper V Subscript t. Then, in group t, the resulting optimal value for r is given by

left-parenthesis StartFraction upper A left-parenthesis upper K Subscript t Baseline right-parenthesis Over n Subscript t Baseline EndFraction right-parenthesis Superscript 1 slash left-parenthesis p plus 4 right-parenthesis

where the optimal constant upper A left-parenthesis upper K Subscript t Baseline right-parenthesis depends on the kernel upper K Subscript t (Epanechnikov 1969). For some useful kernels, the constants upper A left-parenthesis upper K Subscript t Baseline right-parenthesis are given by the following:

StartLayout 1st Row 1st Column upper A left-parenthesis upper K Subscript t Baseline right-parenthesis 2nd Column equals StartFraction 1 Over p EndFraction 2 Superscript p plus 1 Baseline left-parenthesis p plus 2 right-parenthesis normal upper Gamma left-parenthesis StartFraction p Over 2 EndFraction right-parenthesis 3rd Column Blank 4th Column with a uniform kernel 2nd Row 1st Column upper A left-parenthesis upper K Subscript t Baseline right-parenthesis 2nd Column equals StartFraction 4 Over 2 p plus 1 EndFraction 3rd Column Blank 4th Column with a normal kernel 3rd Row 1st Column upper A left-parenthesis upper K Subscript t Baseline right-parenthesis 2nd Column equals StartFraction 2 Superscript p plus 2 Baseline p squared left-parenthesis p plus 2 right-parenthesis left-parenthesis p plus 4 right-parenthesis Over 2 p plus 1 EndFraction normal upper Gamma left-parenthesis StartFraction p Over 2 EndFraction right-parenthesis 3rd Column Blank 4th Column with an Epanechnikov kernel EndLayout

These selections of upper A left-parenthesis upper K Subscript t Baseline right-parenthesis are derived under the assumption that the data in each group are from a multivariate normal distribution with covariance matrix bold upper V Subscript t. However, when the Euclidean distances are used in calculating the squared distance left-parenthesis bold upper V Subscript t Baseline equals upper I right-parenthesis, the smoothing constant should be multiplied by s, where s is an estimate of standard deviations for all variables. A reasonable choice for s is

s equals left-parenthesis StartFraction 1 Over p EndFraction sigma-summation s Subscript j j Baseline right-parenthesis Superscript one-half

where s Subscript j j are group t marginal variances.

The DISCRIM procedure uses only a single smoothing parameter for all groups. However, the selection of the matrix in the distance formula (from the METRIC= or POOL= option), enables individual groups and variables to have different scalings. When bold upper V Subscript t, the matrix used in calculating the squared distances, is an identity matrix, the kernel estimate at each data point is scaled equally for all variables in all groups. When bold upper V Subscript t is the diagonal matrix of a covariance matrix, each variable in group t is scaled separately by its variance in the kernel estimation, where the variance can be the pooled variance left-parenthesis bold upper V Subscript t Baseline equals bold upper S Subscript p Baseline right-parenthesis or an individual within-group variance left-parenthesis bold upper V Subscript t Baseline equals bold upper S Subscript t Baseline right-parenthesis. When bold upper V Subscript t is a full covariance matrix, the variables in group t are scaled simultaneously by bold upper V Subscript t in the kernel estimation.

In nearest-neighbor methods, the choice of k is usually relatively uncritical (Hand 1982). A practical approach is to try several different values of the smoothing parameters within the context of the particular application and to choose the one that gives the best cross validated estimate of the error rate.

Classification Error-Rate Estimates

A classification criterion can be evaluated by its performance in the classification of future observations. PROC DISCRIM uses two types of error-rate estimates to evaluate the derived classification criterion based on parameters estimated by the training sample:

  • error-count estimates

  • posterior probability error-rate estimates

The error-count estimate is calculated by applying the classification criterion derived from the training sample to a test set and then counting the number of misclassified observations. The group-specific error-count estimate is the proportion of misclassified observations in the group. When the test set is independent of the training sample, the estimate is unbiased. However, the estimate can have a large variance, especially if the test set is small.

When the input data set is an ordinary SAS data set and no independent test sets are available, the same data set can be used both to define and to evaluate the classification criterion. The resulting error-count estimate has an optimistic bias and is called an apparent error rate. To reduce the bias, you can split the data into two sets—one set for deriving the discriminant function and the other set for estimating the error rate. Such a split-sample method has the unfortunate effect of reducing the effective sample size.

Another way to reduce bias is cross validation (Lachenbruch and Mickey 1968). Cross validation treats n – 1 out of n training observations as a training set. It determines the discriminant functions based on these n – 1 observations and then applies them to classify the one observation left out. This is done for each of the n training observations. The misclassification rate for each group is the proportion of sample observations in that group that are misclassified. This method achieves a nearly unbiased estimate but with a relatively large variance.

To reduce the variance in an error-count estimate, smoothed error-rate estimates are suggested (Glick 1978). Instead of summing terms that are either zero or one as in the error-count estimator, the smoothed estimator uses a continuum of values between zero and one in the terms that are summed. The resulting estimator has a smaller variance than the error-count estimate. The posterior probability error-rate estimates provided by the POSTERR option in the PROC DISCRIM statement (see the section Posterior Probability Error-Rate Estimates) are smoothed error-rate estimates. The posterior probability estimates for each group are based on the posterior probabilities of the observations classified into that same group. The posterior probability estimates provide good estimates of the error rate when the posterior probabilities are accurate. When a parametric classification criterion (linear or quadratic discriminant function) is derived from a nonnormal population, the resulting posterior probability error-rate estimators might not be appropriate.

The overall error rate is estimated through a weighted average of the individual group-specific error-rate estimates, where the prior probabilities are used as the weights.

To reduce both the bias and the variance of the estimator, Hora and Wilcox (1982) compute the posterior probability estimates based on cross validation. The resulting estimates are intended to have both low variance from using the posterior probability estimate and low bias from cross validation. They use Monte Carlo studies on two-group multivariate normal distributions to compare the cross validation posterior probability estimates with three other estimators: the apparent error rate, cross validation estimator, and posterior probability estimator. They conclude that the cross validation posterior probability estimator has a lower mean squared error in their simulations.

Quasi-inverse

Consider the plot shown in Figure 6 with two variables, X1 and X2, and two classes, A and B. The within-class covariance matrix is diagonal, with a positive value for X1 but zero for X2. Using a Moore-Penrose pseudo-inverse would effectively ignore X2 in doing the classification, and the two classes would have a zero generalized distance and could not be discriminated at all. The quasi inverse used by PROC DISCRIM replaces the zero variance for X2 with a small positive number to remove the singularity. This permits X2 to be used in the discrimination and results correctly in a large generalized distance between the two classes and a zero error rate. It also permits new observations, such as the one indicated by N, to be classified in a reasonable way. PROC CANDISC also uses a quasi inverse when the total-sample covariance matrix is considered to be singular and Mahalanobis distances are requested. This problem with singular within-class covariance matrices is discussed in Ripley (1996, p. 38). The use of the quasi inverse is an innovation introduced by SAS.

Figure 6: Plot of Data with Singular Within-Class Covariance Matrix

Plot of Data with Singular Within-Class Covariance Matrix


Let bold upper S be a singular covariance matrix. The matrix bold upper S can be either a within-group covariance matrix, a pooled covariance matrix, or a total-sample covariance matrix. Let v be the number of variables in the VAR statement, and let the nullity n be the number of variables among them with (partial) R square exceeding 1 – p. If the determinant of bold upper S (Testing of Homogeneity of Within Covariance Matrices) or the inverse of bold upper S (Squared Distances and Generalized Squared Distances) is required, a quasi determinant or quasi inverse is used instead. With raw data input, PROC DISCRIM scales each variable to unit total-sample variance before calculating this quasi inverse. The calculation is based on the spectral decomposition bold upper S equals bold upper Gamma bold upper Lamda bold upper Gamma prime, where bold upper Lamda is a diagonal matrix of eigenvalues lamda Subscript j, j equals 1 comma ellipsis comma v, where lamda Subscript i Baseline greater-than-or-equal-to lamda Subscript j when i less-than j, and bold upper Gamma is a matrix with the corresponding orthonormal eigenvectors of bold upper S as columns. When the nullity n is less than v, set lamda Subscript j Superscript 0 Baseline equals lamda Subscript j for j equals 1 comma ellipsis comma v minus n, and lamda Subscript j Superscript 0 Baseline equals p lamda overbar for j equals v minus n plus 1 comma ellipsis comma v, where

lamda overbar equals StartFraction 1 Over v minus n EndFraction sigma-summation Underscript k equals 1 Overscript v minus n Endscripts lamda Subscript k

When the nullity n is equal to v, set lamda Subscript j Superscript 0 Baseline equals p, for j equals 1 comma ellipsis comma v. A quasi determinant is then defined as the product of lamda Subscript j Superscript 0, j equals 1 comma ellipsis comma v. Similarly, a quasi inverse is then defined as bold upper S Superscript asterisk Baseline equals bold upper Gamma bold upper Lamda Superscript asterisk Baseline bold upper Gamma prime, where bold upper Lamda Superscript asterisk is a diagonal matrix of values 1 slash lamda Subscript j Superscript 0 Baseline comma j equals 1 comma ellipsis comma v.

Last updated: December 09, 2022