Introduction to Statistical Modeling with SAS/STAT Software

Univariate and Multivariate Models

A multivariate statistical model is a model in which multiple response variables are modeled jointly. Suppose, for example, that your data consist of heights left-parenthesis h Subscript i Baseline right-parenthesis and weights left-parenthesis w Subscript i Baseline right-parenthesis of children, collected over several years left-parenthesis t Subscript i Baseline right-parenthesis. The following separate regressions represent two univariate models:

StartLayout 1st Row 1st Column w Subscript i 2nd Column equals beta Subscript w Baseline 0 Baseline plus beta Subscript w Baseline 1 Baseline t Subscript i Baseline plus epsilon Subscript w i Baseline 2nd Row 1st Column h Subscript i 2nd Column equals beta Subscript h Baseline 0 Baseline plus beta Subscript h Baseline 1 Baseline t Subscript i Baseline plus epsilon Subscript h i EndLayout

In the univariate setting, no information about the children’s heights "flows" to the model about their weights and vice versa. In a multivariate setting, the heights and weights would be modeled jointly. For example:

StartLayout 1st Row 1st Column bold upper Y Subscript i Baseline equals StartBinomialOrMatrix w Subscript i Baseline Choose h Subscript i Baseline EndBinomialOrMatrix 2nd Column equals bold upper X bold-italic beta plus StartBinomialOrMatrix epsilon Subscript w i Baseline Choose epsilon Subscript h i Baseline EndBinomialOrMatrix 2nd Row 1st Column Blank 2nd Column equals bold upper X bold-italic beta plus bold-italic epsilon Subscript i Baseline 3rd Row 1st Column bold-italic epsilon Subscript i 2nd Column tilde left-parenthesis bold 0 comma Start 2 By 2 Matrix 1st Row 1st Column sigma 1 squared 2nd Column sigma 12 2nd Row 1st Column sigma 12 2nd Column sigma 2 squared EndMatrix right-parenthesis EndLayout

The vectors bold upper Y Subscript i and bold-italic epsilon Subscript i collect the responses and errors for the two observation that belong to the same subject. The errors from the same child now have the correlation

normal upper C normal o normal r normal r left-bracket epsilon Subscript w i Baseline comma epsilon Subscript h i Baseline right-bracket equals StartFraction sigma 12 Over StartRoot sigma 1 squared sigma 2 squared EndRoot EndFraction

and it is through this correlation that information about heights "flows" to the weights and vice versa. This simple example shows only one approach to modeling multivariate data, through the use of covariance structures. Other techniques involve seemingly unrelated regressions, systems of linear equations, and so on.

Multivariate data can be coarsely classified into three types. The response vectors of homogeneous multivariate data consist of observations of the same attribute. Such data are common in repeated measures experiments and longitudinal studies, where the same attribute is measured repeatedly over time. Homogeneous multivariate data also arise in spatial statistics where a set of geostatistical data is the incomplete observation of a single realization of a random experiment that generates a two-dimensional surface. One hundred measurements of soil electrical conductivity collected in a forest stand compose a single observation of a 100-dimensional homogeneous multivariate vector. Heterogeneous multivariate observations arise when the responses that are modeled jointly refer to different attributes, such as in the previous example of children’s weights and heights. There are two important subtypes of heterogeneous multivariate data. In homocatanomic multivariate data the observations come from the same distributional family. For example, the weights and heights might both be assumed to be normally distributed. With heterocatanomic multivariate data the observations can come from different distributional families. The following are examples of heterocatanomic multivariate data:

  • For each patient you observe blood pressure (a continuous outcome), the number of prior episodes of an illness (a count variable), and whether the patient has a history of diabetes in the family (a binary outcome). A multivariate model that models the three attributes jointly might assume a lognormal distribution for the blood pressure measurements, a Poisson distribution for the count variable and a Bernoulli distribution for the family history.

  • In a study of HIV/AIDS survival, you model jointly a patient’s CD4 cell count over time—itself a homogeneous multivariate outcome—and the survival of the patient (event-time data).

Last updated: December 09, 2022