Introduction to Regression Procedures

Predicted and Residual Values

After the model has been fit, predicted and residual values are usually calculated, graphed, and output. The predicted values are calculated from the estimated regression equation; the raw residuals are calculated as the observed value minus the predicted value. Often other forms of residuals, such as studentized or cumulative residuals, are used for model diagnostics. Some procedures can calculate standard errors of residuals, predicted mean values, and individual predicted values.

Consider the ith observation, where is the row of regressors, is the vector of parameter estimates, and is the estimate of the residual variance (the mean squared error). The leverage value of the ith observation is defined as

h Subscript i Baseline equals w Subscript i Baseline bold x prime Subscript i Baseline left-parenthesis bold upper X prime bold upper W bold upper X right-parenthesis Superscript negative 1 Baseline bold x Subscript i

where is the design matrix for the observed data, is an arbitrary regressor vector (possibly but not necessarily a row of ), is a diagonal matrix of observed weights, and is the weight corresponding to .

Then the predicted mean and the standard error of the predicted mean are

StartLayout 1st Row 1st Column ModifyingAbove y With caret Subscript i Baseline equals 2nd Column bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret 2nd Row 1st Column STDERR left-parenthesis ModifyingAbove y With caret Subscript i Baseline right-parenthesis equals 2nd Column StartRoot s squared h Subscript i Baseline slash w Subscript i Baseline EndRoot EndLayout

The standard error of the individual (future) predicted value is

normal upper S normal upper T normal upper D normal upper E normal upper R normal upper R left-parenthesis y Subscript i Baseline right-parenthesis equals StartRoot s squared left-parenthesis 1 plus h Subscript i Baseline right-parenthesis slash w Subscript i Baseline EndRoot

If the predictor vector corresponds to an observation in the analysis data, then the raw residual for that observation and the standard error of the raw residual are defined as

StartLayout 1st Row 1st Column normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline equals 2nd Column y Subscript i Baseline minus bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret 2nd Row 1st Column normal upper S normal upper T normal upper D normal upper E normal upper R normal upper R left-parenthesis normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline right-parenthesis equals 2nd Column StartRoot s squared left-parenthesis 1 minus h Subscript i Baseline right-parenthesis slash w Subscript i Baseline EndRoot EndLayout

The studentized residual is the ratio of the raw residual and its estimated standard error. Symbolically,

normal upper S normal upper T normal upper U normal upper D normal upper E normal upper N normal upper T Subscript i Baseline equals StartFraction normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline Over normal upper S normal upper T normal upper D normal upper E normal upper R normal upper R left-parenthesis normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline right-parenthesis EndFraction

Two types of intervals provide a measure of confidence for prediction: the confidence interval for the mean value of the response, and the prediction (or forecasting) interval for an individual observation. As discussed in the section Mean Squared Error in Chapter 3, Introduction to Statistical Modeling with SAS/STAT Software, both intervals are based on the mean squared error of predicting a target based on the result of the model fit. The difference in the expressions for the confidence interval and the prediction interval occurs because the target of estimation is a constant in the case of the confidence interval (the mean of an observation) and the target is a random variable in the case of the prediction interval (a new observation).

For example, you can construct a confidence interval for the ith observation that contains the true mean value of the response with probability . The upper and lower limits of the confidence interval for the mean value are

StartLayout 1st Row 1st Column normal upper L normal o normal w normal e normal r normal upper M equals 2nd Column bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret minus t Subscript alpha slash 2 comma nu Baseline StartRoot s squared h Subscript i Baseline slash w Subscript i Baseline EndRoot 2nd Row 1st Column normal upper U normal p normal p normal e normal r normal upper M equals 2nd Column bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret plus t Subscript alpha slash 2 comma nu Baseline StartRoot s squared h Subscript i Baseline slash w Subscript i Baseline EndRoot EndLayout

where is the tabulated t quantile with degrees of freedom equal to the degrees of freedom for the mean squared error, .

The limits for the prediction interval for an individual response are

StartLayout 1st Row 1st Column normal upper L normal o normal w normal e normal r normal upper I equals 2nd Column bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret minus t Subscript alpha slash 2 comma nu Baseline StartRoot s squared left-parenthesis 1 plus h Subscript i Baseline right-parenthesis slash w Subscript i Baseline EndRoot 2nd Row 1st Column normal upper U normal p normal p normal e normal r normal upper I equals 2nd Column bold x prime Subscript i Baseline ModifyingAbove bold-italic beta With caret plus t Subscript alpha slash 2 comma nu Baseline StartRoot s squared left-parenthesis 1 plus h Subscript i Baseline right-parenthesis slash w Subscript i Baseline EndRoot EndLayout

Influential observations are those that, according to various criteria, appear to have a large influence on the analysis. One measure of influence, Cook’s D, measures the change to the estimates that results from deleting an observation,

normal upper C normal upper O normal upper O normal upper K normal upper D Subscript i Baseline equals StartFraction 1 Over k EndFraction normal upper S normal upper T normal upper U normal upper D normal upper E normal upper N normal upper T Subscript i Superscript 2 Baseline left-parenthesis StartFraction normal upper S normal upper T normal upper D normal upper E normal upper R normal upper R left-parenthesis ModifyingAbove y With caret Subscript i Baseline right-parenthesis Over normal upper S normal upper T normal upper D normal upper E normal upper R normal upper R left-parenthesis normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline right-parenthesis EndFraction right-parenthesis squared

where k is the number of parameters in the model (including the intercept). For more information, see Cook (1977, 1979).

The predicted residual for observation i is defined as the residual for the ith observation that results from dropping the ith observation from the parameter estimates. The sum of squares of predicted residual errors is called the PRESS statistic:

StartLayout 1st Row 1st Column normal upper P normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline equals 2nd Column StartFraction normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Baseline Over 1 minus h Subscript i Baseline EndFraction 2nd Row 1st Column normal upper P normal upper R normal upper E normal upper S normal upper S equals 2nd Column sigma-summation Underscript i equals 1 Overscript n Endscripts w Subscript i Baseline normal upper P normal upper R normal upper E normal upper S normal upper I normal upper D Subscript i Superscript 2 EndLayout

Last updated: December 09, 2022