The ROBUSTREG Procedure

High Breakdown Value Estimation

The breakdown value of an estimator is the smallest contamination fraction of the data that can cause the estimates on the entire data to be arbitrarily far from the estimates on only the uncontaminated data. The breakdown value of an estimator can be used to measure the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced the following high breakdown value estimators for linear regression.

LTS Estimate

The least trimmed squares (LTS) estimate that was proposed by Rousseeuw (1984) is defined as the p-vector

ModifyingAbove bold-italic theta With caret Subscript upper L upper T upper S Baseline equals arg min Underscript bold-italic theta Endscripts upper Q Subscript upper L upper T upper S Baseline left-parenthesis bold-italic theta right-parenthesis with upper Q Subscript upper L upper T upper S Baseline left-parenthesis bold-italic theta right-parenthesis equals sigma-summation Underscript i equals 1 Overscript h Endscripts r Subscript left-parenthesis i right-parenthesis Superscript 2

where r Subscript left-parenthesis 1 right-parenthesis Superscript 2 Baseline less-than-or-equal-to r Subscript left-parenthesis 2 right-parenthesis Superscript 2 Baseline less-than-or-equal-to ellipsis less-than-or-equal-to r Subscript left-parenthesis n right-parenthesis Superscript 2 are the ordered squared residuals r Subscript i Superscript 2 Baseline equals left-parenthesis y Subscript i Baseline minus bold x prime Subscript i Baseline bold-italic theta right-parenthesis squared, i equals 1 comma ellipsis comma n, and h is defined in the range StartFraction n Over 2 EndFraction plus 1 less-than-or-equal-to h less-than-or-equal-to StartFraction 3 n plus p plus 1 Over 4 EndFraction.

You can specify the parameter h by using the H= option in the PROC ROBUSTREG statement. By default, h equals left-bracket StartFraction 3 n plus p plus 1 Over 4 EndFraction right-bracket. The breakdown value is StartFraction n minus h Over n EndFraction for the LTS estimate.

The ROBUSTREG procedure computes LTS estimates by using the FAST-LTS algorithm of Rousseeuw and Van Driessen (2000). The estimates are often used to detect outliers in the data, which are then downweighted in the resulting weighted LS regression.

Algorithm

Least trimmed squares (LTS) regression is based on the subset of h observations (out of a total of n observations) whose least squares fit possesses the smallest sum of squared residuals. The coverage h can be set between StartFraction n Over 2 EndFraction and n. The LTS method was proposed by Rousseeuw (1984, p. 876) as a highly robust regression estimator with breakdown value StartFraction n minus h Over n EndFraction. The ROBUSTREG procedure uses the FAST-LTS algorithm that was proposed by Rousseeuw and Van Driessen (2000). The intercept adjustment technique is also used in this implementation. However, because this adjustment is expensive to compute, it is optional. You can use the IADJUST= option in the PROC ROBUSTREG statement to request or suppress the intercept adjustment. By default, PROC ROBUSTREG does intercept adjustment for data sets that contain fewer than 10,000 observations. The steps of the algorithm are described briefly as follows. For more information, see Rousseeuw and Van Driessen (2000).

  1. The default h is left-bracket StartFraction 3 n plus p plus 1 Over 4 EndFraction right-bracket, where p is the number of independent variables. You can specify any integer h with left-bracket StartFraction n Over 2 EndFraction right-bracket plus 1 less-than-or-equal-to h less-than-or-equal-to left-bracket StartFraction 3 n plus p plus 1 Over 4 EndFraction right-bracket by using the H= option in the MODEL statement. The breakdown value for LTS, StartFraction n minus h Over n EndFraction, is reported. The default h is a good compromise between breakdown value and statistical efficiency.

  2. If p = 1 (single regressor), the procedure uses the exact algorithm of Rousseeuw and Leroy (1987, p. 172).

  3. If p greater-than-or-equal-to 2, PROC ROBUSTREG uses the following algorithm. If n < 2 ssubs, where ssubs is the size of the subgroups (you can specify ssubs by using the SUBGROUPSIZE= option in the PROC ROBUSTREG statement; by default, ssubs = 300), PROC ROBUSTREG draws a random p-subset and computes the regression coefficients by using these p points (if the regression is degenerate, another p-subset is drawn). The absolute residuals for all observations in the data set are computed, and the first h points that have the smallest absolute residuals are selected. From this selected h-subset, PROC ROBUSTREG carries out nsteps C-steps (concentration steps; for more information, see Rousseeuw and Van Driessen 2000). You can specify nsteps by using the CSTEP= option in the PROC ROBUSTREG statement; by default, nsteps = 2. PROC ROBUSTREG redraws p-subsets and repeats the preceding computation nrep times, and then finds the nbsol (at most) solutions that have the lowest sums of h squared residuals. You can specify nrep by using the NREP= option in the PROC ROBUSTREG statement; by default, NREP=min left-brace 500 comma StartBinomialOrMatrix n Choose p EndBinomialOrMatrix right-brace. For small n and p, all StartBinomialOrMatrix n Choose p EndBinomialOrMatrix subsets are used and the NREP= option is ignored (Rousseeuw and Hubert 1996). You can specify nbsol by using the NBEST= option in the PROC ROBUSTREG statement; by default, NBEST=10. For each of these nbsol best solutions, C-steps are taken until convergence and the best final solution is found.

  4. If n greater-than-or-equal-to 5 s s u b s, construct five disjoint random subgroups with size ssubs. If 2 s s u b s less-than n less-than 5 s s u b s, the data are split into at most four subgroups with ssubs or more observations in each subgroup, so that each observation belongs to a subgroup and the subgroups have roughly the same size. Let nsubs denote the number of subgroups. Inside each subgroup, PROC ROBUSTREG repeats the step 3 algorithm nrep / nsubs times, keeps the nbsol best solutions, and pools the subgroups, yielding the merged set of size n Subscript m e r g e d. In the merged set, for each of the n s u b s times n b s o l best solutions, nsteps C-steps are carried out by using n Subscript m e r g e d and h Subscript m e r g e d Baseline equals left-bracket n Subscript m e r g e d Baseline StartFraction h Over n EndFraction right-bracket and the nbsol best solutions are kept. In the full data set, for each of these nbsol best solutions, C-steps are taken by using n and h until convergence and the best final solution is found.

Note: At step 3 in the algorithm, a randomly selected p-subset might be degenerate (that is, its design matrix might be singular). If the total number of p-subsets from any subgroup is greater than 4,000 and the ratio of degenerate p-subsets is higher than the threshold that is specified in the FAILRATIO= option, the algorithm terminates with an error message.

R Square

For models with the intercept term, the robust version of R square for the LTS estimate is defined as

upper R Subscript upper L upper T upper S Superscript 2 Baseline equals 1 minus StartFraction s Subscript upper L upper T upper S Superscript 2 Baseline left-parenthesis bold upper X comma bold y right-parenthesis Over s Subscript upper L upper T upper S Superscript 2 Baseline left-parenthesis bold 1 comma bold y right-parenthesis EndFraction

For models without the intercept term, it is defined as

upper R Subscript upper L upper T upper S Superscript 2 Baseline equals 1 minus StartFraction s Subscript upper L upper T upper S Superscript 2 Baseline left-parenthesis bold upper X comma bold y right-parenthesis Over s Subscript upper L upper T upper S Superscript 2 Baseline left-parenthesis bold 0 comma bold y right-parenthesis EndFraction

For both models,

s Subscript upper L upper T upper S Baseline left-parenthesis bold upper X comma bold y right-parenthesis equals d Subscript h comma n Baseline StartRoot StartFraction 1 Over h EndFraction sigma-summation Underscript i equals 1 Overscript h Endscripts r Subscript left-parenthesis i right-parenthesis Superscript 2 Baseline EndRoot

Note that s Subscript upper L upper T upper S is a preliminary estimate of the parameter sigma in the distribution function upper L left-parenthesis dot slash sigma right-parenthesis.

Here d Subscript h comma n is chosen to make s Subscript upper L upper T upper S consistent, assuming a Gaussian model. Specifically,

StartLayout 1st Row 1st Column d Subscript h comma n 2nd Column equals 3rd Column 1 slash StartRoot 1 minus StartFraction 2 n Over h c Subscript h comma n Baseline EndFraction phi left-parenthesis 1 slash c Subscript h comma n Baseline right-parenthesis EndRoot 2nd Row 1st Column c Subscript h comma n 2nd Column equals 3rd Column 1 slash normal upper Phi Superscript negative 1 Baseline left-parenthesis StartFraction h plus n Over 2 n EndFraction right-parenthesis EndLayout

where normal upper Phi and phi are the distribution function and the density function of the standard normal distribution, respectively.

Final Weighted Scale Estimator

The ROBUSTREG procedure displays two scale estimators, s Subscript upper L upper T upper S and Wscale. The estimator Wscale is a more efficient scale estimator based on the preliminary estimate s Subscript upper L upper T upper S; it is defined as

Wscale equals StartRoot StartFraction sigma-summation Underscript i Endscripts w Subscript i Baseline r Subscript i Baseline Superscript 2 Baseline Over sigma-summation Underscript i Endscripts w Subscript i Baseline minus p EndFraction EndRoot

where

w Subscript i Baseline equals StartLayout Enlarged left-brace 1st Row 1st Column 0 2nd Column if StartAbsoluteValue r Subscript i Baseline EndAbsoluteValue slash s Subscript upper L upper T upper S Baseline greater-than k 2nd Row 1st Column 1 2nd Column otherwise EndLayout

You can specify k by using the CUTOFF= option in the MODEL statement. By default, k = 3.

S Estimate

The S estimate that was proposed by Rousseeuw and Yohai (1984) is defined as the p-vector

ModifyingAbove bold-italic theta With caret Subscript upper S Baseline equals arg min Underscript theta Endscripts upper S left-parenthesis bold-italic theta right-parenthesis

where the dispersion upper S left-parenthesis bold-italic theta right-parenthesis is the solution of

StartFraction 1 Over n minus p EndFraction sigma-summation Underscript i equals 1 Overscript n Endscripts chi left-parenthesis StartFraction y Subscript i Baseline minus bold x prime Subscript i Baseline bold-italic theta Over upper S EndFraction right-parenthesis equals beta

Here beta is set to integral chi left-parenthesis s right-parenthesis d normal upper Phi left-parenthesis s right-parenthesis such that ModifyingAbove bold-italic theta With caret Subscript upper S and upper S left-parenthesis ModifyingAbove bold-italic theta With caret Subscript upper S Baseline right-parenthesis are asymptotically consistent estimates of bold-italic theta and sigma for the Gaussian regression model. The breakdown value of the S estimate is

StartFraction beta Over max Underscript s Endscripts chi left-parenthesis s right-parenthesis EndFraction

The ROBUSTREG procedure provides two choices for chi: Tukey’s bisquare function and Yohai’s optimal function.

Tukey’s bisquare function, which you can specify by using the option CHIF=TUKEY, is

chi Subscript k 0 Baseline left-parenthesis s right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column 3 left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis squared minus 3 left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis Superscript 4 Baseline plus left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis Superscript 6 Baseline comma 2nd Column if StartAbsoluteValue s EndAbsoluteValue less-than-or-equal-to k 0 2nd Row 1st Column 1 2nd Column otherwise EndLayout

The constant k 0 controls the breakdown value and efficiency of the S estimate. If you use the EFF= option to specify the efficiency, you can determine the corresponding k 0. The default k 0 is 2.9366, such that the breakdown value of the S estimate is 0.25, with a corresponding asymptotic efficiency for the Gaussian model of 75.9%.

The Yohai function, which you can specify by using the option CHIF=YOHAI, is

chi Subscript k 0 Baseline left-parenthesis s right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column StartFraction s squared Over 2 EndFraction 2nd Column if StartAbsoluteValue s EndAbsoluteValue less-than-or-equal-to 2 k 0 2nd Row 1st Column k 0 squared left-bracket b 0 plus b 1 left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis squared plus b 2 left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis Superscript 4 Baseline 2nd Column Blank 3rd Row 1st Column plus b 3 left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis Superscript 6 Baseline plus b 4 left-parenthesis StartFraction s Over k 0 EndFraction right-parenthesis Superscript 8 Baseline right-bracket 2nd Column if 2 k 0 less-than StartAbsoluteValue s EndAbsoluteValue less-than-or-equal-to 3 k 0 4th Row 1st Column 3.25 k 0 squared 2nd Column if StartAbsoluteValue s EndAbsoluteValue greater-than 3 k 0 EndLayout

where b 0 equals 1.792, b 1 equals negative 0.972, b 2 equals 0.432, b 3 equals negative 0.052, and b 4 equals 0.002. If you use the EFF= option to specify the efficiency, you can determine the corresponding k 0. By default, k 0 is set to 0.7405, such that the breakdown value of the S estimate is 0.25, with a corresponding asymptotic efficiency for the Gaussian model of 72.7%.

Algorithm

The ROBUSTREG procedure implements the algorithm that was proposed by Marazzi (1993) for the S estimate, which is a refined version of the algorithm that was proposed by Ruppert (1992). The refined algorithm is briefly described as follows.

Initialize iter = 1.

  1. Draw a random q-subset of the total n observations, and compute the regression coefficients by using these q observations (if the regression is degenerate, draw another q-subset), where q greater-than-or-equal-to p can be specified by using the SUBSIZE= option. By default, q equals p.

  2. Compute the residuals: r Subscript i Baseline equals y Subscript i Baseline minus sigma-summation Underscript j equals 1 Overscript p Endscripts x Subscript i j Baseline theta Subscript j for i equals 1 comma ellipsis comma n. For the first iteration, where iter = 1, take the following substeps:

    1. If all StartAbsoluteValue r Subscript i Baseline EndAbsoluteValue equals 0 for i equals 1 comma ellipsis comma n, which means y Subscript i exactly equals sigma-summation Underscript j equals 1 Overscript p Endscripts x Subscript i j Baseline theta Subscript j for all i equals 1 comma ellipsis comma n, this algorithm terminates with a message for exact fit.

    2. Otherwise, set s Superscript asterisk Baseline equals 2 median StartSet StartAbsoluteValue r Subscript i Baseline EndAbsoluteValue comma i equals 1 comma ellipsis comma n EndSet.

    3. If s Superscript asterisk Baseline equals 0, update s Superscript asterisk Baseline equals min left-brace StartAbsoluteValue r Subscript i Baseline EndAbsoluteValue greater-than 0 comma i equals 1 comma ellipsis comma n right-brace.

    4. If sigma-summation Underscript i equals 1 Overscript n Endscripts chi left-parenthesis r Subscript i Baseline slash s Superscript asterisk Baseline right-parenthesis greater-than left-parenthesis n minus p right-parenthesis beta, set s Superscript asterisk Baseline equals 1.5 s Superscript asterisk; go to step 3.

    If iter > 1 and sigma-summation Underscript i equals 1 Overscript n Endscripts chi left-parenthesis r Subscript i Baseline slash s Superscript asterisk Baseline right-parenthesis less-than equals left-parenthesis n minus p right-parenthesis beta, go to step 3; otherwise, go to step 5.

  3. Solve the following equation for s by using an iterative algorithm:

    StartFraction 1 Over n minus p EndFraction sigma-summation Underscript i equals 1 Overscript n Endscripts chi left-parenthesis r Subscript i Baseline slash s right-parenthesis equals beta

  4. If iter > 1 and s greater-than s Superscript asterisk, go to step 5. Otherwise, set s Superscript asterisk Baseline equals s and bold-italic theta Superscript asterisk Baseline equals bold-italic theta. If s Superscript asterisk Baseline less-than TOLS, return s Superscript asterisk and bold-italic theta Superscript asterisk; otherwise, go to step 5.

  5. If iter < NREP, set i t e r equals i t e r plus 1 and return to step 1; otherwise, return s Superscript asterisk and bold-italic theta Superscript asterisk.

The ROBUSTREG procedure performs the following refinement step by default. You can request that this refinement not be performed by specifying the NOREFINE option in the PROC ROBUSTREG statement.

  1. Let psi equals chi prime. Using the values s Superscript asterisk and bold-italic theta Superscript asterisk from the previous steps, compute the M estimates bold-italic theta Subscript upper M and sigma Subscript upper M of bold-italic theta and sigma with the setup for M estimation that is described in the section M Estimation. If sigma Subscript upper M Baseline greater-than s Superscript asterisk, give a warning and return s Superscript asterisk and bold-italic theta Superscript asterisk; otherwise, return sigma Subscript upper M and bold-italic theta Subscript upper M.

You can specify TOLS by using the TOLERANCE= option; by default, TOLERANCE=0.001. Alternately, you can specify NREP by using the NREP= option. You can also use the option NREP=NREP0 or NREP=NREP1 to determine NREP according to Table 9. NREP=NREP0 is set as the default.

Table 9: Default NREP

P NREP0 NREP1
1 150 500
2 300 1000
3 400 1500
4 500 2000
5 600 2500
6 700 3000
7 850 3000
8 1250 3000
9 1500 3000
>9 1500 3000


Note: At step 1 in the algorithm, a randomly selected q-subset might be degenerate. If the total number of q-subsets from any subgroup is greater than 4,000 and the ratio of degenerate q-subsets is higher than the threshold specified in the FAILRATIO= option, the algorithm terminates with an error message.

R Square and Deviance

For the model with the intercept term, the robust version of R square for the S estimate is defined as

upper R Subscript upper S Superscript 2 Baseline equals 1 minus StartFraction left-parenthesis n minus p right-parenthesis upper S Subscript p Superscript 2 Baseline Over left-parenthesis n minus 1 right-parenthesis upper S Subscript mu Superscript 2 Baseline EndFraction

For the model without the intercept term, it is defined as

upper R Subscript upper S Superscript 2 Baseline equals 1 minus StartFraction left-parenthesis n minus p right-parenthesis upper S Subscript p Superscript 2 Baseline Over n upper S 0 squared EndFraction

In both cases, upper S Subscript p is the S estimate of the scale in the full model, upper S Subscript mu is the S estimate of the scale in the regression model with only the intercept term, and upper S 0 is the S estimate of the scale without any regressor. The deviance D is defined as the optimal value of the objective function on the sigma squared scale:

upper D equals upper S Subscript p Superscript 2
Asymptotic Covariance and Confidence Intervals

Because the S estimate satisfies the first-order necessary conditions as the M estimate, it has the same asymptotic covariance as the M estimate. All three estimators of the asymptotic covariance for the M estimate in the section Asymptotic Covariance and Confidence Intervals can be used for the S estimate. Besides, the weighted covariance estimator H4 that is described in the section Asymptotic Covariance and Confidence Intervals is also available and is set as the default. Confidence intervals for estimated parameters are computed from the diagonal elements of the estimated asymptotic covariance matrix.

Last updated: December 09, 2022