The HPSPLIT Procedure

Measures of Model Fit

Various measures of model fit have been proposed in the data mining literature. The HPSPLIT procedure measures model fit based on a number of metrics for classification trees and regression trees.

If you specify a variable in the WEIGHT statement, then the weight of an observation is the value of the weight variable for that observation. If no WEIGHT statement is specified, then the weight of each observation is equal to one. In this case, the sum of weights of observations is equal to the number of observations.

Measures of Model Fit for Classification Trees

The HPSPLIT procedure measures model fit based on the following metrics for classification tree: entropy, Gini index, misclassification rate (Misc), residual sum of squares (RSS), average square error (ASE, also known as the Brier score), sensitivity, specificity, area under the curve (AUC), and confusion matrix.

Entropy for Classification Trees

Entropy for classifications tree is defined as

normal upper E normal n normal t normal r normal o normal p normal y equals minus sigma-summation Underscript lamda Endscripts StartFraction upper N Subscript w lamda Baseline Over upper N Subscript w Baseline 0 Baseline EndFraction sigma-summation Underscript tau Endscripts StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction log Subscript 2 Baseline left-parenthesis StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction right-parenthesis

where

  • lamda is a leaf

  • upper N Subscript w lamda is the sum of weights of observations on leaf lamda

  • upper N Subscript w Baseline 0 is the total sum of weights of observations in the entire data set

  • tau is a level of the response variable

  • upper N Subscript w tau Superscript lamda is the sum of weights of observations on leaf lamda that have response level tau

Gini Index for Classification Trees

The Gini index for classification trees is defined as

normal upper G normal i normal n normal i equals sigma-summation Underscript lamda Endscripts StartFraction upper N Subscript w lamda Baseline Over upper N Subscript w Baseline 0 Baseline EndFraction sigma-summation Underscript tau Endscripts StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction left-parenthesis 1 minus StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction right-parenthesis
Misclassification Rate for Classification Trees

Misclassification (Misc) comes from the number of incorrectly predicted observations. It is defined as

normal upper M normal i normal s normal c equals StartFraction 1 Over upper N Subscript w Baseline 0 Baseline EndFraction sigma-summation StartLayout Enlarged left-brace 1st Row 1st Column 0 2nd Column if prediction is correct 2nd Row 1st Column w Subscript i Baseline 2nd Column otherwise EndLayout
Residual Sum of Squares for Classification Trees

The residual sum of squares (RSS) for classification trees is defined as

normal upper R normal upper S normal upper S equals sigma-summation Underscript lamda Endscripts sigma-summation Underscript normal upper Phi Endscripts upper N Subscript w normal upper Phi Superscript lamda Baseline left-bracket sigma-summation Underscript tau not-equals normal upper Phi Endscripts left-parenthesis upper P Subscript w tau Superscript lamda Baseline right-parenthesis squared plus left-parenthesis 1 minus upper P Subscript w normal upper Phi Superscript lamda Baseline right-parenthesis squared right-bracket

where

  • normal upper Phi is the actual response level

  • upper N Subscript normal upper Phi Superscript lamda is the number of observations on leaf lamda that have response level normal upper Phi

  • upper P Subscript w tau Superscript lamda is the weighted posterior probability for response level tau on leaf lamda,

    upper P Subscript w lamda Baseline equals StartFraction upper N Subscript w tau Superscript lamda Baseline Over upper N Subscript w lamda Baseline EndFraction
  • upper P Subscript w normal upper Phi Superscript lamda is the weighted posterior probability for the actual response level normal upper Phi on leaf lamda,

    upper P Subscript w normal upper Phi Baseline equals StartFraction upper N Subscript w tau Superscript normal upper Phi Baseline Over upper N Subscript w lamda Baseline EndFraction
Average Square Error for Classification Trees

The average square error (ASE) is also known as the Brier score for classification trees. It is defined as

normal upper A normal upper S normal upper E equals StartFraction normal upper R normal upper S normal upper S Over upper N Subscript w Baseline 0 Baseline upper N Subscript upper T Baseline EndFraction

where upper N Subscript upper T is the number of levels for the response variable.

Sensitivity for Binary Classification Trees

Sensitivity is the probability of predicting an event for the response variable when the actual state is an event. For example, if the event is "an individual is sick," then sensitivity is the probability of predicting that an individual is sick given that the individual is actually sick. For binary classification trees, it is defined as

normal upper S normal e normal n normal s normal i normal t normal i normal v normal i normal t normal y equals StartFraction normal upper T normal upper P Subscript normal w Baseline Over upper P Subscript w Baseline EndFraction

where

  • normal upper T normal upper P is the sum of weights of true positives (predicting that an individual is sick)

  • P is the sum of weights of positive observations (sick individuals)

Specificity for Binary Classification Trees

Specificity is the probability of predicting a nonevent for the response variable when the actual state is a nonevent. For example, if the event is "an individual is sick," then specificity is the probability of predicting that an individual is not sick given the fact that the individual is actually not sick. For a binary classification tree, specificity is defined as

normal upper S normal p normal e normal c normal i normal f normal i normal c normal i normal t normal y equals StartFraction normal upper T normal upper N Subscript normal w Baseline Over upper N Subscript w Baseline EndFraction

where

  • normal upper T normal upper N is the sum of weights of true negatives (predicting that an individual is not sick)

  • N is the sum of weights of negative observations (healthy individuals)

Area under the Curve for Binary Classification Trees

Area under the curve (AUC) is defined as the area under the receiver operating characteristic (ROC) curve. PROC HPSPLIT uses sensitivity as the Y axis and 1 – specificity as the X axis to draw the ROC curve. AUC is calculated by trapezoidal rule integration,

normal upper A normal upper U normal upper C equals one-half sigma-summation Underscript lamda Endscripts left-parenthesis left-parenthesis x Subscript lamda Baseline minus x Subscript lamda minus 1 Baseline right-parenthesis left-parenthesis y Subscript lamda Baseline plus y Subscript lamda minus 1 Baseline right-parenthesis right-parenthesis

where

  • y Subscript lamda is the sensitivity value at leaf lamda

  • x Subscript lamda is the 1 – specificity value at leaf lamda

Note: For a binary response, the event level that is used for calculating sensitivity, specificity, and AUC is specified in the EVENT= option in the MODEL statement.

Confusion Matrix for Classification Trees

A confusion matrix is also known as a contingency table. It contains information about actual values and predicted values from a classification tree. A confusion matrix has upper N Subscript upper T rows and upper N Subscript upper T columns, where each row corresponds to the actual response level and each column corresponds to the predicted response level. The values in the matrix represent the number of observations that have the actual response represented in the row and the predicted response represented in the column. The error rate per actual response level is also reported,

normal upper E normal r normal r normal o normal r normal upper R normal a normal t normal e equals StartFraction upper N Subscript w w Baseline Over upper N Subscript w normal upper Phi Baseline EndFraction

where

  • upper N Subscript w w is the sum of weights of wrong predictions

  • upper N Subscript w normal upper Phi is the sum of weights of observations that have response level normal upper Phi

Measures of Model Fit for Regression Trees

The HPSPLIT procedure measures model fit for regression trees based on RSS and ASE.

Residual Sum of Squares for Regression Trees

The residual sum of squares (RSS) for regression trees is defined as

normal upper R normal upper S normal upper S equals sigma-summation Underscript lamda Endscripts sigma-summation Underscript i element-of lamda Endscripts w Subscript i Baseline left-parenthesis y Subscript i Baseline minus ModifyingAbove y With caret Subscript lamda Superscript upper T Baseline right-parenthesis squared

where

  • i is an observation on leaf lamda

  • y Subscript i is the predicted value of the response variable of observation i

  • ModifyingAbove y With caret Subscript lamda Superscript upper T is the actual value of the response variable on leaf lamda

Average Square Error for Regression Trees

The average square error (ASE) for regression trees is defined as

normal upper A normal upper S normal upper E equals StartFraction normal upper R normal upper S normal upper S Over upper N Subscript w Baseline 0 Baseline EndFraction
Last updated: December 09, 2022