The ADAPTIVEREG Procedure

MODEL Statement

  • MODEL dependent <(options)>=<effects> </ options>;

  • MODEL events/trials = <effects> </ options>;

The MODEL statement names the response variable and the explanatory effects, including covariates, main effects, interactions, and nested effects; see the section Specification of Effects in Chapter 53, The GLM Procedure, for more information. If you omit the explanatory effects, the procedure fits an intercept-only model. You must specify exactly one MODEL statement.

You can specify two forms of the MODEL statement. The first form, referred to as single-trial syntax, is applicable to binary, ordinal, and nominal response data. The second form, referred to as events/trials syntax, is restricted to binary response data. You use the single-trial syntax when each observation in the DATA= data set contains information about only a single trial, such as a single subject in an experiment. When each observation contains information about multiple binary response trials, such as the counts of the number of observed subjects and the number of subjects who respond, then you can use the events/trials syntax.

In the events/trials syntax, you specify two variables that contain count data for a binomial experiment. These two variables are separated by a slash. The value of the first variable, events, is the number of positive responses (or events). The value of the second variable, trials, is the number of trials. The values of both events and (trialsevents) must be nonnegative and the value of trials must be positive for the response to be valid.

In the single-trial syntax, you specify one variable (on the left side of the equal sign) as the response variable. This variable can be character or numeric. You can specify variable options specific to the response variable immediately after the response variable with parentheses around them.

For both forms of the MODEL statement, explanatory effects follow the equal sign. Variables can be either continuous or classification variables. Classification variables can be character or numeric, and they must be declared in the CLASS statement. When an effect is a classification variable, the procedure inserts a set of coded columns into the design matrix instead of directly entering a single column that contains the values of the variable.

Table 3 summarizes the options available in the MODEL statement.

Table 3: MODEL Statement Options

Option Description
Response Variable Options
DESCENDING Reverses the order of the response categories
EVENT= Specifies the event category for the binary response
ORDER= Specifies the sort order for the binary response
REFERENCE= Specifies the reference category for the binary response
Statistical Modeling Options
ADDITIVE Requests an additive model
ALPHA Controls the knot selection
CVMETHOD= Specifies how subsets for cross validation are formed
DFPERBASIS Specifies degrees of freedom per basis function
DIST= Specifies the distribution family
FAST Controls the fast-forward selection algorithm
FORWARDONLY Requests that the backward selection process be skipped
KEEP= Specifies effects to be included in the final model
LINEAR= Specifies linear effects to be examined in model selection
LINK= Specifies the link function
MAXBASIS= Specifies the maximum number of basis functions allowed
MAXORDER= Specifies the maximum order of interactions allowed
NOMISS Requests removal of missing values from modeling
OFFSET= Specifies an offset for the linear predictor
VARPENALTY= Specifies the penalty for variable reentry


You can specify the following options in the MODEL statement.

Response Variable Options

Response variable options determine how the ADAPTIVEREG procedure models probabilities for binary data. You can specify the following response variable options by enclosing them in parentheses after the response variable.

DESCENDING
DESC

reverses the order of the response categories. If both the DESCENDING and ORDER= options are specified, PROC ADAPTIVEREG orders the response categories according to the ORDER= option and then reverses that order.

EVENT='category' |FIRST |LAST

specifies the event category for the binary response model. PROC ADAPTIVEREG models the probability of the event category. You can specify one of the following values for this option:

'category'

specifies the formatted value of the reference category.

FIRST

designates the first ordered category as the event.

LAST

designates the last ordered category as the event.

The default is EVENT=FIRST.

One of the most common sets of response levels is StartSet 0 comma 1 EndSet, with 1 representing the event for which the probability is to be modeled. Consider the example where Y takes the value 1 for event and 0 for nonevent, and X is the explanatory variable. To specify the value 1 as the event category, use the following MODEL statement:

model Y (event='1') = X;
ORDER=order-type

specifies the sort order for the categories of categorical variables. This ordering determines which parameters in the model correspond to each level in the data. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. Table 4 shows how PROC ADAPTIVEREG interprets values of the ORDER= option.

Table 4: Sort Order for Categorical Variables

order-type Levels Sorted By
DATA Order of appearance in the input data set
FORMATTED External formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ Descending frequency count; levels with the most observations come first in the order
FREQDATA Order of descending frequency count, and within counts by order of appearance in the input data set when counts are tied
FREQFORMATTED Order of descending frequency count, and within counts by formatted value (as above) when counts are tied
FREQINTERNAL Order of descending frequency count, and within counts by unformatted value when counts are tied
INTERNAL Unformatted value


For the FORMATTED and INTERNAL values, the sort order is machine-dependent. If you specify the ORDER= option in the MODEL statement and the ORDER= option in the CLASS statement, the former takes precedence.

For more information about sort order, see the chapter on the SORT procedure in the Base SAS Procedures Guide and the discussion of BY-group processing in SAS Programmers Guide: Essentials.

REFERENCE='category' |FIRST |LAST
REF='category' |FIRST |LAST

specifies the reference category for the binary or multinomial response model. For the binary response model, specifying one response category as the reference is the same as specifying the other response category as the event category. You can specify one of the following values for this option:

'category'

specifies the formatted value of the reference category.

FIRST

designates the first ordered category as the reference.

LAST

designates the last ordered category as the reference.

The default is REFERENCE=LAST.

Model Options

You can specify the following model options.

ADDITIVE

requests an additive model for which only main effects are included in the fitted model. If you do not specify the ADDITIVE option, PROC ADAPTIVEREG fits a model that has both main effects and two-way interaction terms.

ALPHA=number

specifies the parameter that controls the number of knots considered for each variable. Friedman (1991b) uses the following as the number of observations between interior knots:

minus two-fifths log Subscript 2 Baseline left-bracket minus StartFraction log left-parenthesis 1 minus alpha right-parenthesis Over p n Subscript m Baseline EndFraction right-bracket

Friedman also uses the following as the number of observations between extreme knots and the corresponding variable boundary values,

3 minus log Subscript 2 Baseline StartFraction alpha Over p EndFraction

where p is the number of variables and n Subscript m is the number of observations for which a parent basis bold upper B Subscript m Baseline greater-than 0. The value of alpha should be greater than 0 and less than 1. The default is ALPHA=0.05.

CVMETHOD=RANDOM <(n)>
CVMETHOD=INDEX (variable)

specifies the method for subdividing the training data into n parts when you request n-fold cross validation when you do backward selection. CVMETHOD=RANDOM assigns each training observation randomly to one of the n parts. CVMETHOD=INDEX(variable) assigns observations to parts based on the formatted value of the named variable. This input data set variable is treated as a classification variable, and the number of parts n is the number of distinct levels of this variable. By optionally naming this variable in a CLASS statement, you can use the ORDER= option in the CLASS statement to control how this variable is levelized.

The value of n defaults to 5 with CVMETHOD=RANDOM.

DFPERBASIS=d
DF=d

specifies the degrees of freedom (d) that are "charged" for each basis function that is used in the lack-of-fit function for backward selection. Larger values of d lead to fewer spline knots and thus smoother function estimates. The default is DFPERBASIS=2.

DIST=distribution-id

specifies the distribution family used in the model.

If you do not specify a distribution-id, the ADAPTIVEREG procedure defaults to the normal distribution for continuous response variables and to the binary distribution for classification or character variables, unless the events/trial syntax is used in the MODEL statement. If you choose the events/trial syntax, the ADAPTIVEREG procedure defaults to the binomial distribution.

Table 5 lists the values of the DIST= option and the corresponding default link functions. For generalized linear models with these distributions, you can find expressions for the log-likelihood functions in the section Log-Likelihood Functions in Chapter 51, The GENMOD Procedure.

Table 5: Values of the DIST= Option

distribution-id Aliases Distribution Default Link Function
BINOMIAL Binomial Logit
GAMMA GAM, G Gamma Reciprocal
GAUSSIAN NORMAL, N, NOR Normal Identity
IGAUSSIAN IG Inverse Gaussian Inverse squared
(power(–2))
NEGBIN NB Negative binomial Log
POISSON POI Poisson Log


FAST<(fast-options)>

improves the speed of the modeling. Because of the computation complexity in the original multivariate adaptive regression splines algorithm, Friedman (1993) proposes modifications to improve the speed by tuning several parameters. See the section Fast Algorithm for more information about the improvement of the multivariate adaptive regression splines algorithm. You can specify the following fast-options:

BETA=beta

specifies the "aging" factor in the priority queue of candidate parent bases. Larger values of beta result in low-improvement parents rising fast into top list of candidates. The default value is BETA=1.

H=h

specifies the parameter that controls how often the improvement is recomputed for a parent basis bold upper B Subscript m over all candidate variables. Larger values of h cause fewer computations of improvement. The default value is H=1.

K=k

specifies the number of top candidates in the priority queue of parent bases for selecting new bases. Larger values of k cause more parent bases to be considered. The default is to use half of eligible parent bases at every iteration.

FORWARDONLY

skips the backward selection step after forward selection is finished.

KEEP=effects

specifies a list of variables to be included in the final model.

LINEAR=effects

specifies a list of variables to be considered without nonparametric transformation. They should appear in the linear form if they are selected.

LINK=keyword

specifies the link function in the model. Not all link functions are available for all distribution families. The keywords and expressions for the associated link functions are shown in Table 6.

Table 6: Link Functions in MODEL Statement of the ADAPTIVEREG Procedure

keyword Alias Link Function g left-parenthesis mu right-parenthesis equals eta equals
IDENTITY ID Identity mu
LOG Log log left-parenthesis mu right-parenthesis
LOGIT Logit log left-parenthesis mu slash left-parenthesis 1 minus mu right-parenthesis right-parenthesis
POWERMINUS2 Power with exponent –2 1 slash mu squared
PROBIT NORMIT Probit normal upper Phi Superscript negative 1 Baseline left-parenthesis mu right-parenthesis
RECIPROCAL INVERSE Reciprocal 1 slash mu


MAXBASIS=number

specifies the maximum number of basis functions left-parenthesis upper M Subscript max Baseline right-parenthesis that can be used in the final model. The default value is the larger value between 21 and one plus two times the number of nonintercept effects specified in the MODEL statement.

MAXORDER=number

specifies the maximum interaction levels for effects that could potentially enter the model. The default value is MAXORDER=2.

NOMISS

excludes all observations with missing values from the model fitting. By default, the ADAPTIVEREG procedure takes the missingness into account when an explanatory variable has missing values. For more information about how PROC ADAPTIVEREG handles missing values, see the section Missing Values.

OFFSET=variable

specifies an offset for the linear predictor. An offset plays the role of a predictor whose coefficient is known to be 1. For example, you can use an offset in a Poisson model when counts have been obtained in time intervals of different lengths. With a log link function, you can model the counts as Poisson variables with the logarithm of the time interval as the offset variable. The offset variable cannot appear in the CLASS statement or elsewhere in the MODEL statement.

VARPENALTY= gamma

specifies the incremental penalty gamma for increasing the number of variables in the adaptive regression model. To discourage a model with too many variables, at each iteration of the forward selection the model improvement is reduced by a factor of left-parenthesis 1 minus gamma right-parenthesis for any new variable that is introduced.

For highly collinear designs, the VARPENALTY= option helps PROC ADAPTIVEREG produce models that are nearly equivalent in terms of residual sum of squares but have fewer independent variables. Friedman (1991b) suggests the following values for gamma:

0.0

no penalty (default value)

0.05

moderate penalty

0.1

heavy penalty

The best value depends on the specific situation. Some experimenting with different values is usually required. You should use this option with care.

Last updated: December 09, 2022