The HPSPLIT Procedure

Splitting Criteria

The goal of recursive partitioning, as described in the section Building a Decision Tree, is to subdivide the predictor space in such a way that the response values for the observations in the terminal nodes are as similar as possible. The HPSPLIT procedure provides two types of criteria for splitting a parent node : criteria that maximize a decrease in node impurity, as defined by an impurity function, and criteria that are defined by a statistical test. You select the criterion by specifying an option in the GROW statement.

Criteria Based on Impurity

The entropy, Gini index, and RSS criteria decrease impurity. The impurity of a parent node is defined as , a nonnegative number that is equal to zero for a pure node—in other words, a node for which all the observations have the same value of the response variable. Parent nodes for which the observations have very different values of the response variable have a large impurity.

The HPSPLIT procedure selects the best splitting variable and the best cutoff value to produce the highest reduction in impurity,

normal upper Delta i left-parenthesis s comma tau right-parenthesis equals i left-parenthesis tau right-parenthesis minus sigma-summation Underscript b equals 1 Overscript upper B Endscripts p left-parenthesis tau Subscript b Baseline vertical-bar tau right-parenthesis i left-parenthesis tau Subscript b Baseline right-parenthesis

where denotes the bth child node, p(|) is the sum of the weights of observations in that are assigned to divided by the sum of the weights of observations in , and B is the number of branches after splitting .

If you specify a variable in the WEIGHT statement, then the weight of an observation is the value of the weight variable for that observation. If you omit the WEIGHT statement, then the weight of each observation is equal to 1. In this case, the sum of weights of observations is equal to the number of observations.

The impurity reduction criteria available for classification trees are based on different impurity functions i() as follows:

Entropy criterion (default)

The entropy impurity of node is defined as

where is the sum of the weights of observations that have the jth response value divided by the sum of weights of all observations.
Gini index criterion

The Gini index criterion defines i() as the Gini index that corresponds to the ASE of a class response and is given by

For more information, see Hastie, Tibshirani, and Friedman (2009).

The impurity reduction criterion available for regression trees is as follows:

RSS criterion (default)

The RSS criterion, also referred to as the ANOVA criterion, defines i() as the residual sum of squares,

where is the number of observations in , is the sum of weights of all observations in , is the response value of observation i, is the weighted average response of the observations in , and is the weight of observation i.

The weighted average response is defined as

Criteria Based on Statistical Test

The chi-square, F-test, CHAID, and FastCHAID criteria are defined by statistical tests. These criteria calculate the worth of a split by testing for a significant difference in the response variable across the branches defined by a split. The worth is defined as , where p is the p-value of the test. You can adjust the p-values for these criteria by specifying the BONFERRONI option in the GROW statement.

The criteria based on statistical tests compute the worth of a split as follows:

Chi-square criterion

For categorical response variables, the worth is based on the p-value for the Pearson chi-square test that compares the frequencies of the levels of the response across the child nodes.
F-test criterion

For continuous response variables, the worth is based on the F test for the null hypothesis that the means of the response values are identical across the child nodes. The test statistic is

where

If you specify the UNWEIGHTDF option in the WEIGHT statement, then the degrees of freedom for the F test are not weighted. The and formulas are the same when the degrees of freedom are either weighted or unweighted. The F statistic with unweighted degrees of freedom is

Available for both categorical and continuous response variables:

CHAID criterion

For categorical and continuous response variables, CHAID is an approach first described by Kass (1980) that regards every possible split as representing a test. CHAID tests the hypothesis of no association between the values of the response (target) and the branches of a node. The Bonferroni adjusted probability is defined as m, where is the significance level of a test and m is the number of independent tests.

For categorical response variables, the HPSPLIT procedure also provides the FastCHAID criterion, which is a special case of CHAID. FastCHAID is faster to compute because it sorts the possible splits according to the response variable.

Last updated: December 09, 2022