The TPSPLINE Procedure

MODEL Statement

  • MODEL dependent-variables = <regression-variables> (smoothing-variables)</ options>;

The MODEL statement specifies the dependent variables, the independent regression variables, which are listed with no parentheses, and the independent smoothing variables, which are listed inside parentheses.

The regression variables are optional. At least one smoothing variable is required, and it must be listed after the regression variables. No variables can be listed in both the regression variable list and the smoothing variable list.

If you specify more than one dependent variable, PROC TPSPLINE calculates a thin-plate smoothing spline estimate for each dependent variable by using the regression variables and smoothing variables specified on the right side.

If you specify regression variables, PROC TPSPLINE fits a semiparametric model by using the regression variables as the linear part of the model.

Table 3 summarizes the options available in the MODEL statement.

Table 3: MODEL Statement Options

Option Description
ALPHA= Specifies the significance level
DF= Specifies the degrees of freedom
DISTANCE= Defines a range in which points are treated as replicates
LAMBDA0= Specifies the smoothing parameter
LAMBDA= Specifies a set of values for the lamda parameter
LOGNLAMBDA0= Specifies the smoothing parameter on the log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis scale
LOGNLAMBDA= Specifies a set of values for the lamda parameter on the log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis scale
M= Specifies the order of the derivative
RANGE= Specifies the range for smoothing values to be evaluated


You can specify the following options in the MODEL statement:

ALPHA=number

specifies the significance level alpha of the confidence limits on the final thin-plate smoothing spline estimate when you request confidence limits to be included in the output data set. Specify number as a value between 0 and 1. The default value is 0.05. See the section OUTPUT Statement for more information about the OUTPUT statement.

DF=df

specifies the degrees of freedom of the thin-plate smoothing spline estimate, defined as

sans-serif-italic df equals normal t normal r left-parenthesis bold upper A left-parenthesis lamda right-parenthesis right-parenthesis

where bold upper A left-parenthesis lamda right-parenthesis is the hat matrix. Specify df as a value between zero and the number of unique design points n Subscript q. Smaller sans-serif-italic df values cause more penalty on the roughness and thus smoother fits.

DISTANCE=number
D=number

defines a range such that if the upper L Subscript normal infinity distance between two data points left-parenthesis bold x Subscript i Baseline comma bold z Subscript i Baseline right-parenthesis and left-parenthesis bold x Subscript j Baseline comma bold z Subscript j Baseline right-parenthesis satisfies

double-vertical-bar bold x Subscript i Baseline minus bold x Subscript j Baseline double-vertical-bar Subscript normal infinity Baseline less-than-or-equal-to upper D slash 2

then these data points are treated as replicates, where bold x Subscript i are the smoothing variables and bold z Subscript i are the regression variables.

You can use the DISTANCE=  option to reduce the number of unique design points by treating nearby data as replicates. This can be useful when you have a large data set. Larger DISTANCE=  option values cause fewer n Subscript q points. The default value is 0.

PROC TPSPLINE uses the DISTANCE=  value to group points as follows: The data are first sorted by the smoothing variables in the order in which they appear in the MODEL statement. The first point in the sorted data becomes the first unique point. Subsequent points have their values set equal to that point until the first point where the maximum distance in one dimension is larger than upper D slash 2. This point becomes the next unique point, and so on. Because of this sequential processing, the set of unique points differs depending on the order of the smoothing variables in the MODEL statement.

For example, with a model that has two smoothing variables (x1, x2), the data are first sorted by x1 and x2 (in that order), and then uniqueness is assessed sequentially. The first point in the sorted data bold x 1 equals left-parenthesis sans-serif x Baseline sans-serif 1 Subscript 1 Baseline comma sans-serif x Baseline sans-serif 2 Subscript 1 Baseline right-parenthesis becomes the first unique point, bold u 1 equals left-parenthesis sans-serif u Baseline sans-serif 1 Subscript 1 Baseline comma sans-serif u Baseline sans-serif 2 Subscript 1 Baseline right-parenthesis. Subsequent points bold x Subscript i Baseline equals left-parenthesis sans-serif x Baseline sans-serif 1 Subscript i Baseline comma sans-serif x Baseline sans-serif 2 Subscript i Baseline right-parenthesis are set equal to bold u 1 until the algorithm comes to a point with max left-parenthesis StartAbsoluteValue sans-serif x Baseline sans-serif 1 Subscript i Baseline minus sans-serif u Baseline sans-serif 1 Subscript 1 Baseline EndAbsoluteValue comma StartAbsoluteValue sans-serif x Baseline sans-serif 2 Subscript i Baseline minus sans-serif u Baseline sans-serif 2 Subscript 1 Baseline EndAbsoluteValue right-parenthesis greater-than upper D slash 2. This point becomes the second unique point bold u 2, and data sorting proceeds from there.

LAMBDA0=number

specifies the smoothing parameter, lamda 0, to be used in the thin-plate smoothing spline estimate. By default, PROC TPSPLINE uses the lamda parameter that minimizes the GCV function for the final fit. The LAMBDA0=  value must be positive. Larger lamda 0 values cause smoother fits.

LAMBDA=list-of-values

specifies a set of values for the lamda parameter. PROC TPSPLINE returns a GCV value for each lamda point that you specify. You can use the LAMBDA= option to study the GCV function curve for a set of values for lamda. All values listed in the LAMBDA=  option must be positive.

LOGNLAMBDA0=number
LOGNL0=number

specifies the smoothing parameter lamda 0 on the log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis scale. If you specify both the LOGNL0=  and LAMBDA0=  options, only the value provided by the LOGNL0=  option is used. Larger log Subscript 10 Baseline left-parenthesis n lamda 0 right-parenthesis values cause smoother fits. By default, PROC TPSPLINE uses the lamda parameter that minimizes the GCV function for the estimate.

LOGNLAMBDA=list-of-values
LOGNL=list-of-values

specifies a set of values for the lamda parameter on the log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis scale. PROC TPSPLINE returns a GCV value for each lamda point that you specify. You can use the LOGNLAMBDA=  option to study the GCV function curve for a set of lamda values. If you specify both the LOGNL=  and LAMBDA=  options, only the list of values provided by the LOGNL=  option is used.

In some cases, the LOGNL=  option might be preferred over the LAMBDA=  option. Because the LAMBDA=  value must be positive, a small change in that value can result in a major change in the GCV value. If you instead specify lamda on the log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis scale, the allowable range is enlarged to include negative values. Thus, the GCV function is less sensitive to changes in LOGNLAMBDA.

The DF=  option, LAMBDA0=  option, and LOGNLAMBDA0=  option all specify exact smoothness of a nonparametric fit. If you want to fit a model with specified smoothness, the DF=  option is preferable to the other two options because left-parenthesis 0 comma n Subscript q Baseline right-parenthesis, the range of normal d normal f, is much smaller in length than left-parenthesis 0 comma normal infinity right-parenthesis of lamda and left-parenthesis negative normal infinity comma normal infinity right-parenthesis of log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis.

M=number

specifies the order of the derivative in the penalty term. The number must be a positive integer. The default value is max left-parenthesis 2 comma normal i normal n normal t left-parenthesis d slash 2 right-parenthesis plus 1 right-parenthesis, where d is the number of smoothing variables.

RANGE=(lower, upper)

specifies that on the log Subscript 10 Baseline left-parenthesis n lamda right-parenthesis scale only smoothing values greater than or equal to lower and less than or equal to upper be evaluated to minimize the GCV function.

Last updated: December 09, 2022