The DISTANCE Procedure

Proximity Measures

The following notation is used in this section:

v

the number of variables or the dimensionality

data for observation x and the jth variable, where

data for observation y and the jth variable, where

weight for the jth variable from the WEIGHTS= option in the VAR statement. The weight when either or is missing.

W

the sum of total weights. Regardless of whether the observation is missing, its weight is added to this metric.

mean for observation x

mean for observation y

the distance or dissimilarity between observations x and y

the similarity between observations x and y

The factor is used to adjust some of the proximity measures for missing values.

Methods That Accept All Measurement Levels

GOWER

Gower’s similarity

You compute as follows:

For nominal, ordinal, interval, or ratio variables,

StartLayout 1st Row 1st Column delta Subscript x comma y Superscript j 2nd Column equals 3rd Column 1 EndLayout

For asymmetric nominal variables,

StartLayout 1st Row 1st Column delta Subscript x comma y Superscript j 2nd Column equals 3rd Column 1 comma normal i normal f normal e normal i normal t normal h normal e normal r x Subscript j Baseline normal o normal r y Subscript j Baseline normal i normal s normal p normal r normal e normal s normal e normal n normal t 2nd Row 1st Column delta Subscript x comma y Superscript j 2nd Column equals 3rd Column 0 comma normal i normal f normal b normal o normal t normal h x Subscript j Baseline normal a normal n normal d y Subscript j Baseline normal a normal r normal e normal a normal b normal s normal e normal n normal t EndLayout

For nominal or asymmetric nominal variables,

StartLayout 1st Row 1st Column d Subscript x comma y Superscript j 2nd Column equals 3rd Column 1 comma normal i normal f x Subscript j Baseline equals y Subscript j Baseline 2nd Row 1st Column d Subscript x comma y Superscript j 2nd Column equals 3rd Column 0 comma normal i normal f x Subscript j Baseline not-equals y Subscript j EndLayout

For ordinal, interval, or ratio variables,

StartLayout 1st Row 1st Column d Subscript x comma y Superscript j 2nd Column equals 3rd Column 1 minus StartAbsoluteValue x Subscript j Baseline minus y Subscript j Baseline EndAbsoluteValue EndLayout

DGOWER

1 minus Gower

Methods That Accept Ratio, Interval, and Ordinal Variables

Similar methods are presented together in the following list.

EUCLID

Euclidean distance

SQEUCLID

Squared Euclidean distance

SIZE

size distance

SHAPE

shape distance

Note: Squared shape distance plus squared size distance equals squared Euclidean distance.

COV

covariance similarity coefficient , where

CORR

correlation similarity coefficient

DCORR

correlation transformed to Euclidean distance as sqrt(1–CORR)

SQCORR

squared correlation

DSQCORR

squared correlation transformed to squared Euclidean distance as (1–SQCORR)

L(p)

Minkowski () distance, where p is a positive numeric value

CITYBLOCK

CHEBYCHEV

POWER(p,r)

generalized Euclidean distance, where p is a nonnegative numeric value and r is a positive numeric value. The distance between two observations is the rth root of sum of the absolute differences to the pth power between the values for the observations:

Methods That Accept Ratio Variables

Similar methods are presented together in the following list.

SIMRATIO

similarity ratio

DISRATIO

one minus similarity ratio

NONMETRIC

Lance-Williams nonmetric coefficient

CANBERRA

Canberra metric distance. See Sneath and Sokal (1973, pp. 125–126)

Note: When both and are zeros, the corresponding component fraction is defined to be zero. Two variants of this distance measure, CANSCALED and CANADKINS, follow.

CANSCALED

Canberra metric distance, scaled

Note: This measure is a scaled version of so that the resulting range falls between 0 and 1.

CANADKINS

Canberra metric distance, Adkins form (Lance and Williams 1967) where is an indicator variable for zero values in both and .

Note: This measure is similar to the scaled CANBERRA distance, , but without considering the zero-valued (,) pairs in the scaling.

COSINE

cosine coefficient

DOT

dot (inner) product coefficient

OVERLAP

sum of the minimum values

DOVERLAP

maximum of the sum of the x and the sum of y minus overlap

CHISQ

chi-square If the data represent the frequency counts, chi-square dissimilarity between two sets of frequencies can be computed. A 2-by-v contingency table is illustrated to explain how the chi-square dissimilarity is computed as follows:

	Variable				Row
Observation	Var 1	Var 2	…	Var v	Sum
X			…
Y			…
Column Sum			…		T

where

StartLayout 1st Row 1st Column r Subscript x 2nd Column equals 3rd Column sigma-summation Underscript j equals 1 Overscript v Endscripts w Subscript j Baseline x Subscript j 2nd Row 1st Column r Subscript y 2nd Column equals 3rd Column sigma-summation Underscript j equals 1 Overscript v Endscripts w Subscript j Baseline y Subscript j 3rd Row 1st Column c Subscript j 2nd Column equals 3rd Column w Subscript j Baseline left-parenthesis x Subscript j Baseline plus y Subscript j Baseline right-parenthesis 4th Row 1st Column upper T 2nd Column equals 3rd Column r Subscript x Baseline plus r Subscript y Baseline equals sigma-summation Underscript j equals 1 Overscript v Endscripts c Subscript j EndLayout

The chi-square measure is computed as follows:

where for j= 1, 2, …, v

StartLayout 1st Row 1st Column upper E left-parenthesis x Subscript j Baseline right-parenthesis 2nd Column equals 3rd Column r Subscript x Baseline c Subscript j slash upper T 2nd Row 1st Column upper E left-parenthesis y Subscript j Baseline right-parenthesis 2nd Column equals 3rd Column r Subscript y Baseline c Subscript j slash upper T EndLayout

CHI

square root of chi-square

PHISQ

phi-square This is the CHISQ dissimilarity normalized by the sum of weights

PHI

square root of phi-square

Methods That Accept Symmetric Nominal Variables

The following notation is used for computing to . Notice that only the nonmissing pairs are discussed here; all the pairs with at least one missing value are excluded from any of the computations in the following section because if either or is missing.

M

nonmissing matches

, where

StartLayout 1st Row 1st Column delta Subscript x comma y Superscript j 2nd Column equals 3rd Column 1 comma normal i normal f x Subscript j Baseline equals y Subscript j Baseline 2nd Row 1st Column delta Subscript x comma y Superscript j 2nd Column equals 3rd Column 0 comma normal o normal t normal h normal e normal r normal w normal i normal s normal e EndLayout

X

nonmissing mismatches

, where

N

total nonmissing pairs

The following list presents distance methods that accept symmetric nominal variables. Similar methods are grouped together.

HAMMING: Hamming distance
MATCH: simple matching coefficient
DMATCH: simple matching coefficient transformed to Euclidean distance
DSQMATCH: simple matching coefficient transformed to squared Euclidean distance
HAMANN: Hamann coefficient
RT: Roger and Tanimoto
SS1: Sokal and Sneath 1
SS3: Sokal and Sneath 3. The coefficient between an observation and itself is always indeterminate (missing) because there is no mismatch.

Methods That Accept Asymmetric Nominal and Ratio Variables

The following notation is used for computing to . Notice that only the nonmissing pairs are discussed in this section; all the pairs that have at least one missing value are excluded from any of the computations here because if either or is missing.

Also, the observed nonmissing data of an asymmetric binary variable can have only two possible outcomes: presence or absence. Therefore, the notation, PX (present mismatches), always has a value of zero for an asymmetric binary variable.

X

mismatches with at least one present

, where

PM

present matches

, where

PX

present mismatches

, where

PP

both present = PM + PX

P

at least one present = PM + X

PAX

present-absent mismatches

, where

N

total nonmissing pairs

The following list presents distance methods that accept asymmetric nominal and ratio variables:

JACCARD

Jaccard similarity coefficient

The JACCARD method is equivalent to the SIMRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as the sum of the coefficient from the ratio variables (SIMRATIO) and the coefficient from the asymmetric nominal variables.

DJACCARD

Jaccard dissimilarity coefficient

The DJACCARD method is equivalent to the DISRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as the sum of the coefficient from the ratio variables (DISRATIO) and the coefficient from the asymmetric nominal variables.

Methods That Accept Asymmetric Nominal Variables

With reference to the notation defined in the previous section, the following list presents distance methods that accept asymmetric nominal variables:

DICE

Dice coefficient or Czekanowski/Sorensen similarity coefficient

RR

Russell and Rao. This is the binary equivalent of the dot product coefficient.

BLWNM | BRAYCURTIS

Binary Lance and Williams, also known as Bray and Curtis coefficient

K1

Kulcynski 1. The coefficient between an observation and itself is always indeterminate (missing) because there is no mismatch.

Last updated: December 09, 2022