The HPSPLIT Procedure

Splitting Criteria

The goal of recursive partitioning, as described in the section Building a Decision Tree, is to subdivide the predictor space in such a way that the response values for the observations in the terminal nodes are as similar as possible. The HPSPLIT procedure provides two types of criteria for splitting a parent node tau: criteria that maximize a decrease in node impurity, as defined by an impurity function, and criteria that are defined by a statistical test. You select the criterion by specifying an option in the GROW statement.

Criteria Based on Impurity

The entropy, Gini index, and RSS criteria decrease impurity. The impurity of a parent node tau is defined as i left-parenthesis tau right-parenthesis, a nonnegative number that is equal to zero for a pure node—in other words, a node for which all the observations have the same value of the response variable. Parent nodes for which the observations have very different values of the response variable have a large impurity.

The HPSPLIT procedure selects the best splitting variable and the best cutoff value to produce the highest reduction in impurity,

normal upper Delta i left-parenthesis s comma tau right-parenthesis equals i left-parenthesis tau right-parenthesis minus sigma-summation Underscript b equals 1 Overscript upper B Endscripts p left-parenthesis tau Subscript b Baseline vertical-bar tau right-parenthesis i left-parenthesis tau Subscript b Baseline right-parenthesis

where tau Subscript b denotes the bth child node, p(tau Subscript b|tau) is the sum of the weights of observations in tau that are assigned to tau Subscript b divided by the sum of the weights of observations in tau, and B is the number of branches after splitting tau.

If you specify a variable in the WEIGHT statement, then the weight of an observation is the value of the weight variable for that observation. If you omit the WEIGHT statement, then the weight of each observation is equal to 1. In this case, the sum of weights of observations is equal to the number of observations.

The impurity reduction criteria available for classification trees are based on different impurity functions i(tau) as follows:

  • Entropy criterion (default)

    The entropy impurity of node tau is defined as

    i left-parenthesis tau right-parenthesis equals minus sigma-summation Underscript j equals 1 Overscript upper J Endscripts p Subscript j Baseline log Subscript 2 Baseline p Subscript j

    where p Subscript j is the sum of the weights of observations that have the jth response value divided by the sum of weights of all observations.

  • Gini index criterion

    The Gini index criterion defines i(tau) as the Gini index that corresponds to the ASE of a class response and is given by

    i left-parenthesis tau right-parenthesis equals 1 minus sigma-summation Underscript j equals 1 Overscript upper J Endscripts p Subscript j Superscript 2

    For more information, see Hastie, Tibshirani, and Friedman (2009).

The impurity reduction criterion available for regression trees is as follows:

  • RSS criterion (default)

    The RSS criterion, also referred to as the ANOVA criterion, defines i(tau) as the residual sum of squares,

    i left-parenthesis tau right-parenthesis equals StartFraction 1 Over upper N Subscript w Baseline left-parenthesis tau right-parenthesis EndFraction sigma-summation Underscript i equals 1 Overscript upper N left-parenthesis tau right-parenthesis Endscripts left-parenthesis upper Y Subscript i Baseline minus upper Y overbar Subscript w Baseline right-parenthesis squared

    where upper N left-parenthesis tau right-parenthesis is the number of observations in tau, upper N Subscript w Baseline left-parenthesis tau right-parenthesis is the sum of weights of all observations in tau, upper Y Subscript i is the response value of observation i, upper Y overbar Subscript w is the weighted average response of the observations in tau, and w Subscript i is the weight of observation i.

    The weighted average response is defined as

    upper Y overbar Subscript w Baseline equals StartFraction sigma-summation Underscript i equals 1 Overscript upper N left-parenthesis tau right-parenthesis Endscripts w Subscript i Baseline upper Y Subscript i Baseline Over sigma-summation Underscript i equals 1 Overscript upper N left-parenthesis tau right-parenthesis Endscripts w Subscript i Baseline EndFraction

Criteria Based on Statistical Test

The chi-square, F-test, CHAID, and FastCHAID criteria are defined by statistical tests. These criteria calculate the worth of a split by testing for a significant difference in the response variable across the branches defined by a split. The worth is defined as minus log left-parenthesis p right-parenthesis, where p is the p-value of the test. You can adjust the p-values for these criteria by specifying the BONFERRONI option in the GROW statement.

The criteria based on statistical tests compute the worth of a split as follows:

  • Chi-square criterion

    For categorical response variables, the worth is based on the p-value for the Pearson chi-square test that compares the frequencies of the levels of the response across the child nodes.

  • F-test criterion

    For continuous response variables, the worth is based on the F test for the null hypothesis that the means of the response values are identical across the child nodes. The test statistic is

    upper F equals StartFraction SS Subscript between Baseline slash left-parenthesis upper B minus 1 right-parenthesis Over SS Subscript within Baseline slash left-parenthesis upper N Subscript w Baseline left-parenthesis tau right-parenthesis minus upper B right-parenthesis EndFraction

    where

    SS Subscript between Baseline equals sigma-summation Underscript b equals 1 Overscript upper B Endscripts upper N Subscript w Baseline left-parenthesis tau Subscript b Baseline right-parenthesis left-parenthesis ModifyingAbove upper Y With bar Subscript w Baseline left-parenthesis tau Subscript b Baseline right-parenthesis minus ModifyingAbove upper Y With bar Subscript w Baseline left-parenthesis tau right-parenthesis right-parenthesis squared
    SS Subscript within Baseline equals sigma-summation Underscript b equals 1 Overscript upper B Endscripts sigma-summation Underscript i equals 1 Overscript upper N left-parenthesis tau Subscript b Baseline right-parenthesis Endscripts w Subscript i Baseline left-parenthesis upper Y Subscript b i Baseline minus ModifyingAbove upper Y With bar Subscript w Baseline left-parenthesis tau Subscript b Baseline right-parenthesis right-parenthesis squared

    If you specify the UNWEIGHTDF option in the WEIGHT statement, then the degrees of freedom for the F test are not weighted. The SS Subscript between and SS Subscript within formulas are the same when the degrees of freedom are either weighted or unweighted. The F statistic with unweighted degrees of freedom is

    upper F equals StartFraction SS Subscript between Baseline slash left-parenthesis upper B minus 1 right-parenthesis Over SS Subscript within Baseline slash left-parenthesis upper N left-parenthesis tau right-parenthesis minus upper B right-parenthesis EndFraction

Available for both categorical and continuous response variables:

  • CHAID criterion

    For categorical and continuous response variables, CHAID is an approach first described by Kass (1980) that regards every possible split as representing a test. CHAID tests the hypothesis of no association between the values of the response (target) and the branches of a node. The Bonferroni adjusted probability is defined as malpha, where alpha is the significance level of a test and m is the number of independent tests.

    For categorical response variables, the HPSPLIT procedure also provides the FastCHAID criterion, which is a special case of CHAID. FastCHAID is faster to compute because it sorts the possible splits according to the response variable.

Last updated: December 09, 2022