The HPSPLIT Procedure

Measures of Model Fit

Various measures of model fit have been proposed in the data mining literature. The HPSPLIT procedure measures model fit based on a number of metrics for classification trees and regression trees.

If you specify a variable in the WEIGHT statement, then the weight of an observation is the value of the weight variable for that observation. If no WEIGHT statement is specified, then the weight of each observation is equal to one. In this case, the sum of weights of observations is equal to the number of observations.

Measures of Model Fit for Classification Trees

The HPSPLIT procedure measures model fit based on the following metrics for classification tree: entropy, Gini index, misclassification rate (Misc), residual sum of squares (RSS), average square error (ASE, also known as the Brier score), sensitivity, specificity, area under the curve (AUC), and confusion matrix.

Entropy for Classification Trees

Entropy for classifications tree is defined as

normal upper E normal n normal t normal r normal o normal p normal y equals minus sigma-summation Underscript lamda Endscripts StartFraction upper N Subscript w lamda Baseline Over upper N Subscript w Baseline 0 Baseline EndFraction sigma-summation Underscript tau Endscripts StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction log Subscript 2 Baseline left-parenthesis StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction right-parenthesis

where

is a leaf
is the sum of weights of observations on leaf
is the total sum of weights of observations in the entire data set
is a level of the response variable
is the sum of weights of observations on leaf that have response level

Gini Index for Classification Trees

The Gini index for classification trees is defined as

normal upper G normal i normal n normal i equals sigma-summation Underscript lamda Endscripts StartFraction upper N Subscript w lamda Baseline Over upper N Subscript w Baseline 0 Baseline EndFraction sigma-summation Underscript tau Endscripts StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction left-parenthesis 1 minus StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction right-parenthesis

Misclassification Rate for Classification Trees

Misclassification (Misc) comes from the number of incorrectly predicted observations. It is defined as

normal upper M normal i normal s normal c equals StartFraction 1 Over upper N Subscript w Baseline 0 Baseline EndFraction sigma-summation StartLayout Enlarged left-brace 1st Row 1st Column 0 2nd Column if prediction is correct 2nd Row 1st Column w Subscript i Baseline 2nd Column otherwise EndLayout

Residual Sum of Squares for Classification Trees

The residual sum of squares (RSS) for classification trees is defined as

normal upper R normal upper S normal upper S equals sigma-summation Underscript lamda Endscripts sigma-summation Underscript normal upper Phi Endscripts upper N Subscript w normal upper Phi Superscript lamda Baseline left-bracket sigma-summation Underscript tau not-equals normal upper Phi Endscripts left-parenthesis upper P Subscript w tau Superscript lamda Baseline right-parenthesis squared plus left-parenthesis 1 minus upper P Subscript w normal upper Phi Superscript lamda Baseline right-parenthesis squared right-bracket

where

is the actual response level
is the number of observations on leaf that have response level
is the weighted posterior probability for response level on leaf ,
is the weighted posterior probability for the actual response level on leaf ,

Average Square Error for Classification Trees

The average square error (ASE) is also known as the Brier score for classification trees. It is defined as

normal upper A normal upper S normal upper E equals StartFraction normal upper R normal upper S normal upper S Over upper N Subscript w Baseline 0 Baseline upper N Subscript upper T Baseline EndFraction

where is the number of levels for the response variable.

Sensitivity for Binary Classification Trees

Sensitivity is the probability of predicting an event for the response variable when the actual state is an event. For example, if the event is "an individual is sick," then sensitivity is the probability of predicting that an individual is sick given that the individual is actually sick. For binary classification trees, it is defined as

normal upper S normal e normal n normal s normal i normal t normal i normal v normal i normal t normal y equals StartFraction normal upper T normal upper P Subscript normal w Baseline Over upper P Subscript w Baseline EndFraction

where

is the sum of weights of true positives (predicting that an individual is sick)
P is the sum of weights of positive observations (sick individuals)

Specificity for Binary Classification Trees

Specificity is the probability of predicting a nonevent for the response variable when the actual state is a nonevent. For example, if the event is "an individual is sick," then specificity is the probability of predicting that an individual is not sick given the fact that the individual is actually not sick. For a binary classification tree, specificity is defined as

normal upper S normal p normal e normal c normal i normal f normal i normal c normal i normal t normal y equals StartFraction normal upper T normal upper N Subscript normal w Baseline Over upper N Subscript w Baseline EndFraction

where

is the sum of weights of true negatives (predicting that an individual is not sick)
N is the sum of weights of negative observations (healthy individuals)

Area under the Curve for Binary Classification Trees

Area under the curve (AUC) is defined as the area under the receiver operating characteristic (ROC) curve. PROC HPSPLIT uses sensitivity as the Y axis and 1 – specificity as the X axis to draw the ROC curve. AUC is calculated by trapezoidal rule integration,

normal upper A normal upper U normal upper C equals one-half sigma-summation Underscript lamda Endscripts left-parenthesis left-parenthesis x Subscript lamda Baseline minus x Subscript lamda minus 1 Baseline right-parenthesis left-parenthesis y Subscript lamda Baseline plus y Subscript lamda minus 1 Baseline right-parenthesis right-parenthesis

where

is the sensitivity value at leaf
is the 1 – specificity value at leaf

Note: For a binary response, the event level that is used for calculating sensitivity, specificity, and AUC is specified in the EVENT= option in the MODEL statement.

Confusion Matrix for Classification Trees

A confusion matrix is also known as a contingency table. It contains information about actual values and predicted values from a classification tree. A confusion matrix has rows and columns, where each row corresponds to the actual response level and each column corresponds to the predicted response level. The values in the matrix represent the number of observations that have the actual response represented in the row and the predicted response represented in the column. The error rate per actual response level is also reported,