Introduction to Statistical Modeling with SAS/STAT Software

Analysis of Variance

The identity

bold upper Y equals bold upper X bold-italic beta overTilde plus left-parenthesis bold upper Y minus bold upper X bold-italic beta overTilde right-parenthesis

holds for all vectors bold-italic beta overTilde, but only for the least squares solution is the residual left-parenthesis bold upper Y minus bold upper X ModifyingAbove bold-italic beta With caret right-parenthesis orthogonal to the predicted value bold upper X ModifyingAbove bold-italic beta With caret. Because of this orthogonality, the additive identity holds not only for the vectors themselves, but also for their lengths (Pythagorean theorem):

StartAbsoluteValue EndAbsoluteValue bold upper Y StartAbsoluteValue EndAbsoluteValue squared equals StartAbsoluteValue EndAbsoluteValue bold upper X ModifyingAbove bold-italic beta With caret StartAbsoluteValue EndAbsoluteValue squared plus StartAbsoluteValue EndAbsoluteValue left-parenthesis bold upper Y minus bold upper X ModifyingAbove bold-italic beta With caret right-parenthesis StartAbsoluteValue EndAbsoluteValue squared

Note that bold upper X ModifyingAbove bold-italic beta With caret equals bold upper X left-parenthesis bold upper X prime bold upper X right-parenthesis Superscript negative 1 Baseline bold upper X prime bold upper Y = bold upper H bold upper Y and note that bold upper Y minus bold upper X ModifyingAbove bold-italic beta With caret equals left-parenthesis bold upper I minus bold upper H right-parenthesis bold upper Y equals bold upper M bold upper Y. The matrices bold upper H and bold upper M equals bold upper I minus bold upper H play an important role in the theory of linear models and in statistical computations. Both are projection matrices—that is, they are symmetric and idempotent. (An idempotent matrix bold upper A is a square matrix that satisfies bold upper A bold upper A equals bold upper A. The eigenvalues of an idempotent matrix take on the values 1 and 0 only.) The matrix bold upper H projects onto the subspace of upper R Superscript n that is spanned by the columns of bold upper X. The matrix bold upper M projects onto the orthogonal complement of that space. Because of these properties you have bold upper H prime equals bold upper H, bold upper H bold upper H equals bold upper H, bold upper M prime equals bold upper M, bold upper M bold upper M equals bold upper M, bold upper H bold upper M equals bold 0.

The Pythagorean relationship now can be written in terms of bold upper H and bold upper M as follows:

StartAbsoluteValue EndAbsoluteValue bold upper Y StartAbsoluteValue EndAbsoluteValue squared equals bold upper Y prime bold upper Y equals StartAbsoluteValue EndAbsoluteValue bold upper H bold upper Y StartAbsoluteValue EndAbsoluteValue squared plus StartAbsoluteValue EndAbsoluteValue bold upper M bold upper Y StartAbsoluteValue EndAbsoluteValue squared equals bold upper Y prime bold upper H prime bold upper H bold upper Y plus bold upper Y prime bold upper M prime bold upper M bold upper Y equals bold upper Y prime bold upper H bold upper Y plus bold upper Y prime bold upper M bold upper Y

If bold upper X prime bold upper X is deficient in rank and a generalized inverse is used to solve the normal equations, then you work instead with the projection matrices bold upper H equals bold upper X left-parenthesis bold upper X prime bold upper X right-parenthesis Superscript minus Baseline bold upper X prime. Note that if bold upper G is a generalized inverse of bold upper X prime bold upper X, then bold upper X bold upper G bold upper X prime, and hence also bold upper H and bold upper M, are invariant to the choice of bold upper G.

The matrix bold upper H is sometimes referred to as the "hat" matrix because when you premultiply the vector of observations with bold upper H, you produce the fitted values, which are commonly denoted by placing a "hat" over the bold upper Y vector, ModifyingAbove bold upper Y With caret equals bold upper H bold upper Y.

The term bold upper Y prime bold upper Y is the uncorrected total sum of squares (normal upper S normal upper S normal upper T) of the linear model, bold upper Y prime bold upper M bold upper Y is the error (residual) sum of squares (normal upper S normal upper S normal upper R), and bold upper Y prime bold upper H bold upper Y is the uncorrected model sum of squares. This leads to the analysis of variance table shown in Table 2.

Table 2: Analysis of Variance with Uncorrected Sums of Squares

Source df Sum of Squares
Model normal r normal a normal n normal k left-parenthesis bold upper X right-parenthesis normal upper S normal upper S normal upper M equals bold upper Y prime bold upper H bold upper Y equals ModifyingAbove bold-italic beta With caret prime bold upper X prime bold upper Y
Residual n minus normal r normal a normal n normal k left-parenthesis bold upper X right-parenthesis normal upper S normal upper S normal upper R equals bold upper Y prime bold upper M bold upper Y equals bold upper Y prime bold upper Y minus ModifyingAbove bold-italic beta With caret bold upper X prime bold upper Y equals
sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis upper Y Subscript i Baseline minus ModifyingAbove upper Y With caret Subscript i Baseline right-parenthesis squared
Uncorr. Total n normal upper S normal upper S normal upper T equals bold upper Y prime bold upper Y equals sigma-summation Underscript i equals 1 Overscript n Endscripts upper Y Subscript i Superscript 2


When the model contains an intercept term, then the analysis of variance is usually corrected for the mean, as shown in Table 3.

Table 3: Analysis of Variance with Corrected Sums of Squares

Source df Sum of Squares
Model normal r normal a normal n normal k left-parenthesis bold upper X right-parenthesis minus 1 normal upper S normal upper S normal upper M Subscript c Baseline equals ModifyingAbove bold-italic beta With caret prime bold upper X prime bold upper Y minus n upper Y overbar squared equals sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis ModifyingAbove upper Y With caret Subscript i Baseline minus upper Y overbar Subscript i Baseline right-parenthesis squared
Residual n minus normal r normal a normal n normal k left-parenthesis bold upper X right-parenthesis normal upper S normal upper S normal upper R equals bold upper Y prime bold upper M bold upper Y equals bold upper Y prime bold upper Y minus ModifyingAbove bold-italic beta With caret bold upper X prime bold upper Y equals
sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis upper Y Subscript i Baseline minus ModifyingAbove upper Y With caret Subscript i Baseline right-parenthesis squared
Corrected Total n minus 1 normal upper S normal upper S normal upper T Subscript c Baseline equals bold upper Y prime bold upper Y minus n upper Y overbar squared equals sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis upper Y Subscript i Baseline minus upper Y overbar right-parenthesis squared


The coefficient of determination, also called the R-square statistic, measures the proportion of the total variation explained by the linear model. In models with intercept, it is defined as the ratio

upper R squared equals 1 minus StartFraction normal upper S normal upper S normal upper R Over normal upper S normal upper S normal upper T Subscript c Baseline EndFraction equals 1 minus StartFraction sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis upper Y Subscript i Baseline minus ModifyingAbove upper Y With caret Subscript i Baseline right-parenthesis squared Over sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis upper Y Subscript i Baseline minus upper Y overbar right-parenthesis squared EndFraction

In models without intercept, the R-square statistic is a ratio of the uncorrected sums of squares

upper R squared equals 1 minus StartFraction normal upper S normal upper S normal upper R Over normal upper S normal upper S normal upper T EndFraction equals 1 minus StartFraction sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis upper Y Subscript i Baseline minus ModifyingAbove upper Y With caret Subscript i Baseline right-parenthesis squared Over sigma-summation Underscript i equals 1 Overscript n Endscripts upper Y Subscript i Superscript 2 Baseline EndFraction
Last updated: December 09, 2022