The following notation is used in this section:
the number of variables or the dimensionality
weight for the jth variable from the WEIGHTS= option in the VAR statement. The weight when either
or
is missing.
the sum of total weights. Regardless of whether the observation is missing, its weight is added to this metric.
mean for observation x
mean for observation y
the distance or dissimilarity between observations x and y
the similarity between observations x and y
The factor is used to adjust some of the proximity measures for missing values.
Similar methods are presented together in the following list.
Note: Squared shape distance plus squared size distance equals squared Euclidean distance.
correlation transformed to Euclidean distance as sqrt(1–CORR)
squared correlation transformed to squared Euclidean distance as (1–SQCORR)
generalized Euclidean distance, where p is a nonnegative numeric value and r is a positive numeric value. The distance between two observations is the rth root of sum of the absolute differences to the pth power between the values for the observations:
Similar methods are presented together in the following list.
Canberra metric distance. See Sneath and Sokal (1973, pp. 125–126)
Note: When both and
are zeros, the corresponding component fraction
is defined to be zero. Two variants of this distance measure, CANSCALED and CANADKINS, follow.
Canberra metric distance, scaled
Note: This measure is a scaled version of so that the resulting range falls between 0 and 1.
Canberra metric distance, Adkins form (Lance and Williams 1967)
where
is an indicator variable for zero values in both
and
.
Note: This measure is similar to the scaled CANBERRA distance, , but without considering the zero-valued (
,
) pairs in the scaling.
chi-square If the data represent the frequency counts, chi-square dissimilarity between two sets of frequencies can be computed. A 2-by-v contingency table is illustrated to explain how the chi-square dissimilarity is computed as follows:
where
The chi-square measure is computed as follows:
where for j= 1, 2, …, v
phi-square
This is the CHISQ dissimilarity normalized by the sum of weights
The following notation is used for computing to
. Notice that only the nonmissing pairs are discussed here; all the pairs with at least one missing value are excluded from any of the computations in the following section because
if either
or
is missing.
The following list presents distance methods that accept symmetric nominal variables. Similar methods are grouped together.
simple matching coefficient transformed to Euclidean distance
simple matching coefficient transformed to squared Euclidean distance
Sokal and Sneath 3. The coefficient between an observation and itself is always indeterminate (missing) because there is no mismatch.
The following notation is used for computing to
. Notice that only the nonmissing pairs are discussed in this section; all the pairs that have at least one missing value are excluded from any of the computations here because
if either
or
is missing.
Also, the observed nonmissing data of an asymmetric binary variable can have only two possible outcomes: presence or absence. Therefore, the notation, PX (present mismatches), always has a value of zero for an asymmetric binary variable.
The following list presents distance methods that accept asymmetric nominal and ratio variables:
Jaccard similarity coefficient
The JACCARD method is equivalent to the SIMRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as the sum of the coefficient from the ratio variables (SIMRATIO) and the coefficient from the asymmetric nominal variables.
Jaccard dissimilarity coefficient
The DJACCARD method is equivalent to the DISRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as the sum of the coefficient from the ratio variables (DISRATIO) and the coefficient from the asymmetric nominal variables.
With reference to the notation defined in the previous section, the following list presents distance methods that accept asymmetric nominal variables:
Dice coefficient or Czekanowski/Sorensen similarity coefficient
Russell and Rao. This is the binary equivalent of the dot product coefficient.
Binary Lance and Williams, also known as Bray and Curtis coefficient
Kulcynski 1. The coefficient between an observation and itself is always indeterminate (missing) because there is no mismatch.