The ADAPTIVEREG Procedure

Fitting Algorithms

The multivariate adaptive regression splines algorithm (Friedman 1991b) is a predictive modeling algorithm that combines nonparametric variable transformations with a recursive partitioning scheme.

The algorithm originates with Smith (1982), who proposes a nonparametric method that applies the model selection method (stepwise regression) to a large number of truncated power spline functions, which are evaluated at different knot values. This method constructs spline functions and selects relevant knot values automatically with the model selection method. However, the method is applicable only to problems in low dimensions. For multiple variables, the number of tensor products between spline basis functions is too large to fit even a single model. The multivariate adaptive regression splines algorithm avoids this situation by using forward selection to build the model gradually instead of using the full set of tensor products of spline basis functions.

Like the recursive partitioning algorithm, which has "growing" and "pruning" steps, the multivariate adaptive regression splines algorithm contains two stages: forward selection and backward selection. During the forward selection process, bases are created from interactions between existing parent bases and nonparametric transformations of continuous or classification variables as candidate effects. After the model grows to a certain size, the backward selection process begins by deleting selected bases. The deletion continues until the null model is reached, and then an overall best model is chosen based on some goodness-of-fit criterion. The next three subsections give details about the selection process and methods of nonparametric transformation of variables. The fourth subsection describes how the multivariate adaptive regression splines algorithm is applied to fit generalized linear models. The fifth subsection describes the fast algorithm (Friedman 1993) for speeding up the fitting process.

Forward Selection

The forward selection process in the multivariate adaptive regression splines algorithm is as follows:

  1. Initialize by setting bold upper B 0 equals bold 1 and upper M equals 1.

  2. Repeat the following steps until the maximum number of bases upper M Subscript max has been reached or the model cannot be improved by any combination of bold upper B Subscript m, bold v, and t.

    1. Set the lack-of-fit criterion normal upper L normal upper O normal upper F Superscript asterisk Baseline equals normal infinity.

    2. For each selected basis: bold upper B Subscript m Baseline comma m element-of StartSet 0 comma ellipsis comma upper M minus 1 EndSet do the following for each variable bold v that bold upper B Subscript m does not consist of bold v not-an-element-of StartSet bold v left-parenthesis k comma m right-parenthesis vertical-bar 1 less-than-or-equal-to k less-than-or-equal-to upper K Subscript m Baseline EndSet

      1. For each knot value (or a subset of categories) t of bold v colon t element-of StartSet bold v EndSet, form a model with all currently selected bases sigma-summation Underscript i equals 0 Overscript upper M minus 1 Endscripts bold upper B Subscript i and two new bases: bold upper B Subscript m Baseline bold upper T 1 left-parenthesis bold v comma t right-parenthesis and bold upper B Subscript m Baseline bold upper T 2 left-parenthesis bold v comma t right-parenthesis.

      2. Compute the lack-of-fit criterion for the new model LOF.

      3. If normal upper L normal upper O normal upper F less-than normal upper L normal upper O normal upper F Superscript asterisk, then update normal upper L normal upper O normal upper F Superscript asterisk Baseline equals normal upper L normal upper O normal upper F, m Superscript asterisk Baseline equals m, bold v Superscript asterisk Baseline equals bold v, and t Superscript asterisk Baseline equals t.

    3. Update the model by adding two bases that improve the most bold upper B Subscript m Sub Superscript asterisk Baseline bold upper T 1 left-parenthesis bold v Superscript asterisk Baseline comma t Superscript asterisk Baseline right-parenthesis and bold upper B Subscript m Sub Superscript asterisk Baseline bold upper T 2 left-parenthesis bold v Superscript asterisk Baseline comma t Superscript asterisk Baseline right-parenthesis.

    4. Set upper M equals upper M plus 2.

The essential part of each iteration is to search a combination of bold upper B Subscript m, bold v, and t such that adding two corresponding bases most improve the model. The objective of the forward selection step is to build a model that overfits the data. The lack-of-fit criterion for linear models is usually the residual sum of squares (RSS).

Backward Selection

The backward selection process in the multivariate adaptive regression splines algorithm is as follows:

  1. Initialize by setting the overall lack-of-fit criterion: normal upper L normal upper O normal upper F Superscript asterisk Baseline equals normal infinity.

  2. Repeat the following steps until the null model is reached. The final model is the best one that is found during the backward deletion process.

    1. For a selected basis bold upper B Subscript m Baseline comma m element-of StartSet 1 comma ellipsis comma upper M EndSet:

      1. Compute the lack-of-fit criterion, LOF, for a model that excludes bold upper B Subscript m.

      2. If normal upper L normal upper O normal upper F less-than normal upper L normal upper O normal upper F Superscript asterisk, save the model as the best one. Let m Superscript asterisk Baseline equals m.

      3. Delete bold upper B Subscript m Sub Superscript asterisk from the current model.

    2. Set upper M equals upper M minus 1.

The objective of the backward selection is to "prune" back the overfitted model to find the best model that has good predictive performance. So the lack-of-fit criteria that characterize model loyalty to original data are not appropriate. Instead, the multivariate adaptive regression splines algorithm uses a quantity similar to the generalized cross validation criterion. See the section Goodness-of-Fit Criteria for more information.

Variable Transformations

The type of transformation depends on the variable type:

  • For a continuous variable, the transformation is a linear truncated power spline,

    bold upper T 1 left-parenthesis bold v comma t right-parenthesis equals left-parenthesis v minus t right-parenthesis Subscript plus Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column v minus t comma 2nd Column if v greater-than t 2nd Row 1st Column 0 comma 2nd Column otherwise EndLayout
    bold upper T 2 left-parenthesis bold v comma t right-parenthesis equals left-bracket minus left-parenthesis v minus t right-parenthesis right-bracket Subscript plus Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 0 comma 2nd Column if v greater-than t 2nd Row 1st Column t minus v comma 2nd Column otherwise EndLayout

    where t is a knot value for variable bold v and v is an observed value for bold v. Instead of examining every unique value of bold v, a series of knot values with a minimum span are used by assuming the smoothness of the underlying function. Friedman (1991b) uses the following formula to determine a reasonable number of counts between knots (span size). For interior knots, the span size is determined by

    minus two-fifths log Subscript 2 Baseline left-bracket minus StartFraction log left-parenthesis 1 minus alpha right-parenthesis Over p n Subscript m Baseline EndFraction right-bracket

    For boundary knots, the span size is determined by

    3 minus log Subscript 2 Baseline StartFraction alpha Over p EndFraction

    where alpha is the parameter that controls the knot density, p is the number of variables, and n Subscript m is the number of observations that a parent basis bold upper B Subscript m Baseline greater-than 0.

  • For a classification variable, the transformation is an indicator function,

    bold upper T 1 left-parenthesis bold v comma t right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 1 comma 2nd Column if v element-of StartSet c 1 comma ellipsis comma c Subscript t Baseline EndSet 2nd Row 1st Column 0 comma 2nd Column otherwise EndLayout
    bold upper T 2 left-parenthesis bold v comma t right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 0 comma 2nd Column if v element-of StartSet c 1 comma ellipsis comma c Subscript t Baseline EndSet 2nd Row 1st Column 1 comma 2nd Column otherwise EndLayout

    where StartSet c 1 comma ellipsis comma c Subscript t Baseline EndSet is a subset of all categories of variable bold v. The smoothing is applied to categorical variables by assuming that subsets of categories tend to have similar properties, analogous to the assumption that a local neighborhood has close predictions for continuous variables.

    If a categorical variable has k distinct categories, then there are a total of 2 Superscript k minus 1 Baseline minus 1 possible subsets to consider. The computation cost is equal to all-subsets selection in regression, which is expensive for large k values. The multivariate adaptive regression splines algorithm use the stepwise selection method to select categories to form the subset StartSet c 1 comma ellipsis comma c Subscript t Baseline EndSet. The method is still greedy, but it reduces computation and still yields reasonable final models.

Goodness-of-Fit Criteria

Like other nonparametric regression procedures, the multivariate adaptive regression splines algorithm can yield complicated models that involve high-order interactions in which many knot values or subsets are considered. Besides the basis functions, both the forward selection and backward selection processes are also highly nonlinear. Because of the trade-off between bias and variance, the complicated models that contain many parameters tend to have low bias but high variance. To select models that achieve good prediction performance, Craven and Wahba (1979) propose the widely used generalized cross validation criterion (GCV),

normal upper G normal upper C normal upper V equals StartFraction 1 Over n EndFraction sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis StartFraction y Subscript i Baseline minus ModifyingAbove f With caret Subscript i Baseline Over 1 minus normal t normal r normal a normal c normal e left-parenthesis bold upper S right-parenthesis slash n EndFraction right-parenthesis squared equals StartFraction normal upper R normal upper S normal upper S Over n left-parenthesis 1 minus normal t normal r normal a normal c normal e left-parenthesis bold upper S right-parenthesis slash n right-parenthesis squared EndFraction

where y is the response, ModifyingAbove f With caret is an estimate of the underlying smooth function, and bold upper S is the smoothing matrix such that ModifyingAbove bold y With caret equals bold upper S bold y. The effective degrees of freedom for the smoothing spline can be defined as normal t normal r normal a normal c normal e left-parenthesis bold upper S right-parenthesis. In the multivariate adaptive regression splines algorithm, Friedman (1991b) uses a similar quantity as the lack-of-fit criterion,

normal upper L normal upper O normal upper F equals StartFraction normal upper R normal upper S normal upper S Over n left-parenthesis 1 minus left-parenthesis upper M plus d left-parenthesis upper M minus 1 right-parenthesis slash 2 right-parenthesis slash n right-parenthesis squared EndFraction

where d is the degrees-of-freedom cost for each nonlinear basis function and M is total number of linearly independent bases in the model. Because any candidate model that is evaluated at each step of the multivariate adaptive regression splines algorithm is a linear model, M is actually the trace of the hat matrix. The only difference between the GCV criterion and the LOF criterion is the extra term d left-parenthesis upper M minus 1 right-parenthesis. The corresponding effective degrees of freedom is defined as upper M plus d left-parenthesis upper M minus 1 right-parenthesis slash 2. The quantity d takes into account the extra nonlinearity in forming new bases, and it operates as a smoothing parameter. Larger values of d tend to result in smoother function estimates. Based on many practical experiments and some theoretic work (Owen 1991), Friedman suggests that the value of d is typically in the range of left-bracket 2 comma 4 right-bracket. For data that have complicated structures, the value of d could be much larger.

Alternatively, you can use the cross validation as the goodness-of-fit criterion or use a separate validation data set to select models and a separate testing data set to evaluate selected models.

Generalized Linear Models

Friedman (1991b) applies the multivariate adaptive regression splines algorithm to a logistic model by using the squared error loss between the response and inversely linked values in the goodness-of-fit criterion:

sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis y Subscript i Baseline minus StartFraction 1 Over 1 plus exp left-parenthesis bold x prime bold-italic beta right-parenthesis EndFraction right-parenthesis squared

When a final model is obtained, the ordinary logistic model is fitted on selected bases. Some realizations of the multivariate adaptive regression splines algorithm ignore the distributional properties and derive model bases that are based on the least squares criterion. The reason to ignore the distributional properties or use least squares approximations is that examining the lack-of-fit criterion for each combination of bold upper B Subscript m Baseline comma bold v, and t is computationally formidable, because one generalized linear model fit involves multiple steps of weighted least squares. The ADAPTIVEREG procedure extends the multivariate adaptive regression splines algorithm to generalized linear models as suggested by Buja et al. (1991).

In the forward selection process, the ADAPTIVEREG procedure extends the algorithm in the following way. Suppose there are left-parenthesis 2 k plus 1 right-parenthesis bases after iteration k Then a generalized linear model is fitted against the data by using the selected bases. Then the weighted least squares method uses the working weights and working response in the last step of the iterative reweighted least squares algorithm as the weight and response for selecting new bases in iteration left-parenthesis k plus 1 right-parenthesis. Then the residual chi-square statistic is used to select two new bases. This is similar to the forward selection scheme that the LOGISTIC procedure uses. For more information about the score chi-square statistic, see the section Testing Individual Effects Not in the Model in Chapter 79, The LOGISTIC Procedure.

In the backward selection process, the ADAPTIVEREG procedure extends the algorithm in the following way. Suppose there are M bases in the selected model. The Wald chi-square statistic is used to determine which basis to delete. After one basis is selected for deletion, a generalized linear model is refitted with the remaining bases. This is similar to the backward deletion scheme that the LOGISTIC procedure uses. For more information about the Wald chi-square statistic, see the section Testing Linear Hypotheses about the Regression Coefficients in Chapter 79, The LOGISTIC Procedure.

Accordingly, the lack-of-fit criterion in the forward selection for generalized linear models is the score chi-square statistic. For the lack-of-fit criterion in the backward selection process for generalized linear models, the residual sum of squares term is replaced by the model deviance.

Fast Algorithm

The original multivariate adaptive regression splines algorithm is computationally expensive. To improve the computation speed, Friedman (1993) proposes the fast algorithm. The essential idea of the fast algorithm is to reduce the number of combinations of bold upper B comma bold v, and t that are examined at each step of forward selection.

Suppose there are left-parenthesis 2 k plus 1 right-parenthesis bases that are formed after iteration k, where a parent basis bold upper B Subscript m is selected to construct two new bases. Consider a queue with bases as its elements. At the top of the queue are the selected parent bold upper B Subscript m and two newly constructed bases, bold upper B Subscript 2 k and bold upper B Subscript 2 k plus 1. The rest of the queue is sorted based on the minimum lack-of-fit criterion for each basis:

upper J left-parenthesis bold upper B Subscript i Baseline right-parenthesis equals min Underscript StartLayout 1st Row  normal f normal o normal r normal a normal l normal l normal e normal l normal i normal g normal i normal b normal l normal e bold v 2nd Row  normal f normal o normal r normal a normal l normal l normal k normal n normal o normal t t EndLayout Endscripts normal upper L normal upper O normal upper F left-parenthesis bold v comma t vertical-bar bold upper B Subscript i Baseline right-parenthesis comma i equals 1 comma ellipsis comma 2 k minus 1

When k is not small, there are a relatively large number of bases in the model, and adding more bases is unlikely to dramatically improve the fit. Thus the ranking of the bases in the priority queue is not likely to change much during adjacent iterations. So the candidate parent bases can be restricted to the top K ones in the queue for iteration left-parenthesis k plus 1 right-parenthesis After iteration k, the top bases have new upper J left-parenthesis bold upper B Subscript i Baseline right-parenthesis values, whereas the values of the bottom bases are unchanged. The queue is reordered based on upper J left-parenthesis bold upper B Subscript i Baseline right-parenthesis values. This corresponds to the K= option value for the FAST option in the MODEL statement.

To avoid losing the candidate bases that are ranked at the bottom of the queue and to allow them to rise back to the top, a natural "aging" factor is introduced into each basis. This is accomplished by defining the priority for each basis function to be

upper P left-parenthesis bold upper B Subscript i Baseline right-parenthesis equals upper R left-parenthesis bold upper B Subscript i Baseline right-parenthesis plus beta left-parenthesis k Subscript c Baseline minus k Subscript r Baseline right-parenthesis

where upper R left-parenthesis bold upper B Subscript i Baseline right-parenthesis is the rank of ith basis in the queue, k Subscript c is the current iteration number, and k Subscript r is the number of the iteration where the upper J left-parenthesis bold upper B Subscript i Baseline right-parenthesis value was last computed. The top K candidate bases are then sorted again based on this priority. Large beta values cause bases that have low improvement during previous iterations to rise faster to the top of the list. This corresponds to the BETA= value for the FAST option in the MODEL statement.

For a candidate basis in the top of the priority queue, the minimum lack-of-fit criterion upper J left-parenthesis bold upper B Subscript i Baseline right-parenthesis is recomputed for all eligible variables bold v for iteration left-parenthesis k plus 1 right-parenthesis. An optimal variable is likely to be the same as the one that was found during the previous iteration. So the fast multivariate adaptive regression splines algorithm introduces another factor H to save the computation cost. The factor specifies how often upper J left-parenthesis bold upper B Subscript i Baseline right-parenthesis should be recomputed for all eligible variables. If H = 1, then optimization over all variables is done at each iteration when a parent basis is considered. If H = 5, the complete optimization is done after five iterations. For an iteration count less than the specified H, the optimization is done only for the optimal variable found in the last complete optimization. The only exceptions are the top three candidates, bold upper B Subscript 2 k minus 1 (which is the parent basis bold upper B Subscript m used to construct two new bases) and two new ones, bold upper B Subscript 2 k and bold upper B Subscript 2 k plus 1. The complete optimization for them is performed at each iteration. This corresponds to the H= option value for the FAST option in the MODEL statement.

Last updated: December 09, 2022