The HPSPLIT Procedure

Variable Importance

A training data set can contain a large number of predictors. Some predictors are useful for predicting the response variable, and others are not. You can use the HPSPLIT procedure to select the most useful predictors based on variable importance. (See Example 68.5: Assessing Variable Importance.) Variable importance is an indication of which predictors are most useful for predicting the response variable. Various measures of variable importance have been proposed in the data mining literature.

The most important variables might not be the ones near the top of the tree. PROC HPSPLIT measures variable importance based on the following metrics: count, surrogate count, RSS, and relative importance. The count-based variable importance simply counts the number of times in the tree that a particular variable is used in a split. Similarly, the surrogate count tallies the number of times that a variable is used in a surrogate splitting rule.

The RSS-based metric measures variable importance based on the change of RSS when a split is found at a node. The change is

normal upper Delta Subscript d Baseline equals normal upper R normal upper S normal upper S Subscript d Baseline minus sigma-summation Underscript i Endscripts normal upper R normal upper S normal upper S Subscript i Superscript d

where

  • d denotes the node

  • i denotes the index of a child that this node has

  • normal upper R normal upper S normal upper S Subscript d is the RSS if the node is treated as a leaf

  • normal upper R normal upper S normal upper S Subscript i Superscript d is the RSS of the node after it has been split

If the change in RSS is negative (which is possible when you use the validation set), then the change is set to 0.

If surrogate rules are in effect, they are also credited with a portion of the change in RSS. The credit is proportional to the agreement between the primary and surrogate splitting rules at the node. The agreement at node d, kappa Subscript d, is defined as

kappa Subscript d Baseline equals sigma-summation Underscript i Endscripts StartFraction upper N Subscript i Baseline Over upper N Subscript d Baseline EndFraction

where

  • upper N Subscript d is the number of nonmissing observations

  • upper N Subscript i is the number of observations that were assigned to i by both the primary and surrogate rules

The change in RSS from the surrogate rules is defined as

normal upper Delta Subscript d Baseline equals kappa Subscript d Baseline left-parenthesis normal upper R normal upper S normal upper S Subscript d Baseline minus sigma-summation Underscript i Endscripts normal upper R normal upper S normal upper S Subscript i Superscript d Baseline right-parenthesis

The RSS-based importance is then defined as

StartRoot sigma-summation Underscript d equals 1 Overscript upper D Endscripts normal upper Delta Subscript d Baseline EndRoot

where D is the total number of nodes.

The relative importance metric is a number between 0 and 1. It is calculated in two steps. First, PROC HPSPLIT finds the maximum RSS-based variable importance. Then, for each variable, it calculates the relative variable importance as the RSS-based importance of this variable divided by the maximum RSS-based importance among all the variables.

The RSS and relative importance are calculated from the training set. They are calculated again from the validation set if one exists.

Last updated: December 09, 2022