Introduction to Analysis of Variance Procedures

From Sums of Squares to Linear Hypotheses

Analysis of variance (ANOVA) is a technique for analyzing data in which one or more response (or dependent or simply Y) variables are measured under various conditions identified by one or more classification variables. The combinations of levels for the classification variables form the cells of the design for the data. This design can be the result of a controlled experiment or the result of an observational study in which you observe factors and factor level combinations in an uncontrolled environment. For example, an experiment might measure weight change (the dependent variable) for men and women who participated in three different weight-loss programs. The six cells of the design are formed by the six combinations of gender (men, women) and program (A, B, C).

In an analysis of variance, the variation in the response is separated into variation attributable to differences between the classification variables and variation attributable to random error. An analysis of variance constructs tests to determine the significance of the classification effects. A typical goal in such an analysis is to compare means of the response variable for various combinations of the classification variables.

The least squares principle is central to computing sums of squares in analysis of variance models. Suppose that you are fitting the linear model bold upper Y equals bold upper X bold-italic beta plus bold-italic epsilon and that the error terms satisfy the usual assumptions (uncorrelated, zero mean, homogeneous variance). Further, suppose that bold upper X is partitioned according to several model effects, bold upper X equals left-bracket bold upper X 1 bold upper X 2 midline-horizontal-ellipsis bold upper X Subscript k Baseline right-bracket. If ModifyingAbove bold-italic beta With caret denotes the ordinary least squares solution for this model, then the sum of squares attributable to the overall model can be written as

normal upper S normal upper S normal upper M equals ModifyingAbove bold-italic beta With caret prime bold upper X prime bold upper Y equals bold upper Y prime bold upper H bold upper Y

where bold upper H is the "hat" matrix bold upper H equals bold upper X left-parenthesis bold upper X prime bold upper X right-parenthesis Superscript minus Baseline bold upper X prime. (This model sum of squares is not yet corrected for the presence of an explicit or implied intercept. This adjustment would consist of subtracting n upper Y overbar squared from SSM.) Because of the properties of the hat matrix bold upper H, you can write bold upper X prime equals bold upper X prime bold upper H and bold upper H bold upper X equals bold upper X. The (uncorrected) model sum of squares thus can also be written as

normal upper S normal upper S normal upper M equals ModifyingAbove bold-italic beta With caret prime left-parenthesis bold upper X prime bold upper X right-parenthesis ModifyingAbove bold-italic beta With caret

This step is significant, because it demonstrates that sums of squares can be identified with quadratic functions in the least squares coefficients. The generalization of this idea is to do the following:

  • consider hypotheses of interest in an analysis of variance model

  • express the hypotheses in terms of linear estimable functions of the parameters

  • compute the sums of squares associated with the estimable function

  • construct statistical tests based on the sums of squares

Decomposing a model sum of squares into sequential, additive components, testing the significance of experimental factors, comparing factor levels, and performing other statistical inferences fall within this generalization. Suppose that bold upper L bold-italic beta is an estimable function (see the section Estimable Functions in ChapterĀ 3, Introduction to Statistical Modeling with SAS/STAT Software, and ChapterĀ 16, The Four Types of Estimable Functions, for details). The sum of squares associated with the hypothesis upper H colon bold upper L bold-italic beta equals bold 0 is

normal upper S normal upper S left-parenthesis upper H right-parenthesis equals normal upper S normal upper S left-parenthesis bold upper L bold-italic beta equals bold 0 right-parenthesis equals ModifyingAbove bold-italic beta With caret prime bold upper L prime left-parenthesis bold upper L left-parenthesis bold upper X prime bold upper X right-parenthesis Superscript minus Baseline bold upper L prime right-parenthesis Superscript negative 1 Baseline bold upper L ModifyingAbove bold-italic beta With caret

One application would be to form sums of squares associated with the different components of bold upper X. For example, you can form a matrix bold upper L 2 matrix such that bold upper L 2 bold-italic beta equals bold 0 tests the effect of adding the columns for bold upper X 2 to an empty model or to test the effect of adding bold upper X 2 to a model that already contains bold upper X 1.

These sums of squares can also be expressed as the difference between two residual sums of squares, since bold upper L bold-italic beta equals bold 0 can be thought of as a (linear) restriction on the parameter estimates in the model:

normal upper S normal upper S left-parenthesis upper H right-parenthesis equals normal upper S normal upper S normal upper R left-parenthesis constrained model right-parenthesis minus normal upper S normal upper S normal upper R left-parenthesis full model right-parenthesis

If, in addition to the usual assumptions mentioned previously, the model errors are assumed to be normally distributed, then normal upper S normal upper S left-parenthesis upper H right-parenthesis follows a distribution that is proportional to a chi-square distribution. This fact, and the independence of normal upper S normal upper S left-parenthesis upper H right-parenthesis from the residual sum of squares, enables you to construct F tests based on sums of squares in least squares models.

The extension of sum of squares analysis of variance to general analysis of variance for classification effects depends on the fact that the distributional properties of quadratic forms in normal random variables are well understood. It is not necessary to first formulate a sum of squares to arrive at an exact or even approximate F test. The generalization of the expression for normal upper S normal upper S left-parenthesis upper H right-parenthesis is to form test statistics based on quadratic forms

ModifyingAbove bold-italic beta With caret prime bold upper L prime normal upper V normal a normal r left-bracket bold upper L ModifyingAbove bold-italic beta With caret right-bracket Superscript negative 1 Baseline bold upper L ModifyingAbove bold-italic beta With caret

that follow a chi-square distribution if ModifyingAbove bold-italic beta With caret is normally distributed.

Last updated: December 09, 2022