The CLUSTER Procedure

Miscellaneous Formulas

The root mean squared standard deviation of a cluster upper C Subscript upper K is

RMSSTD equals StartRoot StartFraction upper W Subscript upper K Baseline Over v left-parenthesis upper N Subscript upper K Baseline minus 1 right-parenthesis EndFraction EndRoot

The R-square statistic for a given level of the hierarchy is

upper R squared equals 1 minus StartFraction upper P Subscript upper G Baseline Over upper T EndFraction

The squared semipartial correlation for joining clusters upper C Subscript upper K and upper C Subscript upper L is

semipartial upper R squared equals StartFraction upper B Subscript upper K upper L Baseline Over upper T EndFraction

The bimodality coefficient is

b equals StartStartFraction m 3 squared plus 1 OverOver m 4 plus StartFraction 3 left-parenthesis n minus 1 right-parenthesis squared Over left-parenthesis n minus 2 right-parenthesis left-parenthesis n minus 3 right-parenthesis EndFraction EndEndFraction

where m 3 is skewness and m 4 is kurtosis. Values of b greater than 0.555 (the value for a uniform population) can indicate bimodal or multimodal marginal distributions. The maximum of 1.0 (obtained for the Bernoulli distribution) is obtained for a population with only two distinct values. Very heavy-tailed distributions have small values of b regardless of the number of modes.

Formulas for the cubic-clustering criterion and approximate expected R square are given in Sarle (1983).

The pseudo F statistic for a given level is

pseudo upper F equals StartStartFraction StartFraction upper T minus upper P Subscript upper G Baseline Over upper G minus 1 EndFraction OverOver StartFraction upper P Subscript upper G Baseline Over n minus upper G EndFraction EndEndFraction

The pseudo t squared statistic for joining upper C Subscript upper K and upper C Subscript upper L is

pseudo t squared equals StartStartFraction upper B Subscript upper K upper L Baseline OverOver StartFraction upper W Subscript upper K Baseline plus upper W Subscript upper L Baseline Over upper N Subscript upper K Baseline plus upper N Subscript upper L Baseline minus 2 EndFraction EndEndFraction

The pseudo F and t squared statistics can be useful indicators of the number of clusters, but they are not distributed as F and t squared random variables. If the data are independently sampled from a multivariate normal distribution with a scalar covariance matrix and if the clustering method allocates observations to clusters randomly (which no clustering method actually does), then the pseudo F statistic is distributed as an F random variable with v left-parenthesis upper G minus 1 right-parenthesis and v left-parenthesis n minus upper G right-parenthesis degrees of freedom. Under the same assumptions, the pseudo t squared statistic is distributed as an F random variable with v and v left-parenthesis upper N Subscript upper K Baseline plus upper N Subscript upper L Baseline minus 2 right-parenthesis degrees of freedom. The pseudo t squared statistic differs computationally from Hotelling’s upper T squared in that the latter uses a general symmetric covariance matrix instead of a scalar covariance matrix. The pseudo F statistic was suggested by Caliński and Harabasz (1974). The pseudo t squared statistic is related to the upper J Subscript e Baseline left-parenthesis 2 right-parenthesis slash upper J Subscript e Baseline left-parenthesis 1 right-parenthesis statistic of Duda and Hart (1973) by

StartFraction upper J Subscript e Baseline left-parenthesis 2 right-parenthesis Over upper J Subscript e Baseline left-parenthesis 1 right-parenthesis EndFraction equals StartFraction upper W Subscript upper K Baseline plus upper W Subscript upper L Baseline Over upper W Subscript upper M Baseline EndFraction equals StartStartFraction 1 OverOver 1 plus StartFraction t squared Over upper N Subscript upper K Baseline plus upper N Subscript upper L Baseline minus 2 EndFraction EndEndFraction

See Milligan and Cooper (1985) and Cooper and Milligan (1988) regarding the performance of these statistics in estimating the number of population clusters. Conservative tests for the number of clusters using the pseudo F and t squared statistics can be obtained by the Bonferroni approach (Hawkins, Muller, and ten Krooden; 1982, pp. 337–340).

Last updated: December 09, 2022