Multivariate Analysis Of Variance

Data science nuggets

Introduction

  • my previous blog talked about comparing two means using T.TEST and as well as comparing more than 2 means with ANOVA

RECAP ANOVA

  • for this one we said consider yourself having having one dependent variable and one or more independent variables which are factors
day 5 day 20 day 35
10 15 35
15 30 40
25 20 50
15 15 43
20 23 45
18 20 40
103 123 253
  • the above dataset as it stands might look a bit weird but once we change it to long format we get
library(tidyverse)
df<-tribble(~DAY_5 ,~DAY_20,~DAY_35,
    10, 15, 35,
    15, 30, 40,
    25, 20, 50,
    15, 15, 43,
    20, 23, 45,
    18, 20, 40)
df |> 
  gather("day","time") |> 
  knitr::kable()
day time
DAY_5 10
DAY_5 15
DAY_5 25
DAY_5 15
DAY_5 20
DAY_5 18
DAY_20 15
DAY_20 30
DAY_20 20
DAY_20 15
DAY_20 23
DAY_20 20
DAY_35 35
DAY_35 40
DAY_35 50
DAY_35 43
DAY_35 45
DAY_35 40
  • now we have time as the dependent variable and day as the independent variable
  • so we generally assume the following assumptions

assumptions

  1. Univariate Normal distribution
    The residuals of ANOVA model can once again be visualised in the normal QQplot. If the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted.
  2. Homoscedasticity
    To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots.

Univariate Normal

Random variable \(X\) is distributed \(X \sim N(\mu, \sigma^2)\) if

\[f(X)=\frac{{1}}{{\sigma \sqrt{{2\pi}}}}e^{-.5(\frac{{x-\mu}}{{\sigma}})^2}\].

MANOVA

Multivariate Normal Distribution

Multivariate Normal Distribution

Let \(\mathbf{y}\) be a multivariate normal (MVN) random variable with mean \(\mu\) and variance \(\mathbf{\Sigma}\). Then the density of \(\mathbf{y}\) is

\[ f(\mathbf{y}) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp(-\frac{1}{2} \mathbf{(y-\mu)'\Sigma^{-1}(y-\mu)} ) \]

\(\mathbf{y} \sim N_p(\mu, \mathbf{\Sigma})\)

Properties of MVN

  • Let \(\mathbf{A}_{r \times p}\) be a fixed matrix. Then \(\mathbf{Ay} \sim N_r (\mathbf{A \mu, A \Sigma A'})\) . \(r \le p\) and all rows of \(\mathbf{A}\) must be linearly independent to guarantee that \(\mathbf{A \Sigma A}'\) is non-singular.

  • Let \(\mathbf{G}\) be a matrix such that \(\mathbf{\Sigma}^{-1} = \mathbf{GG}'\). Then \(\mathbf{G'y} \sim N_p(\mathbf{G' \mu, I})\) and \(\mathbf{G'(y-\mu)} \sim N_p (0,\mathbf{I})\)

  • Any fixed linear combination of \(y_1,...,y_p\) (say \(\mathbf{c'y}\)) follows \(\mathbf{c'y} \sim N_1 (\mathbf{c' \mu, c' \Sigma c})\)

Multivariate Analysis of Variance(MANOVA)

One-way MANOVA

  • Compare treatment means for h different populations
  • now we have more than one dependent variables

Population 1: \(\mathbf{y}_{11}, \mathbf{y}_{12}, \dots, \mathbf{y}_{1n_1} \sim idd N_p (\mathbf{\mu}_1, \mathbf{\Sigma})\)

\(\vdots\)

Population h: \(\mathbf{y}_{h1}, \mathbf{y}_{h2}, \dots, \mathbf{y}_{hn_h} \sim idd N_p (\mathbf{\mu}_h, \mathbf{\Sigma})\)


  • Using the ANOVA example , lets say we now have other than time we could also have the weight and the height of the the lab rats at those days.

Assumptions

  1. Independent random samples from \(h\) different populations
  2. Common covariance matrices
  3. Each population is multivariate normal

Calculate the summary statistics \(\mathbf{\bar{y}}_i, \mathbf{S}\) and the pooled estimate of the covariance matrix \(\mathbf{S}\)

Similar to the univariate one-way ANVOA, we can use the effects model formulation \(\mathbf{\mu}_i = \mathbf{\mu} + \mathbf{\tau}_i\), where

  • \(\mathbf{\mu}_i\) is the population mean for population i

  • \(\mathbf{\mu}\) is the overall mean effect

  • \(\mathbf{\tau}_i\) is the treatment effect of the i-th treatment.

For the one-way model: \(\mathbf{y}_{ij} = \mu + \tau_i + \epsilon_{ij}\) for \(i = 1,..,h; j = 1,..., n_i\) and \(\epsilon_{ij} \sim N_p(\mathbf{0, \Sigma})\)

However, the above model is over-parameterized (i.e., infinite number of ways to define \(\mathbf{\mu}\) and the \(\mathbf{\tau}_i\)’s such that they add up to \(\mu_i\). Thus we can constrain by having

\[ \sum_{i=1}^h n_i \tau_i = 0 \]

or

\[ \mathbf{\tau}_h = 0 \]

The observational equivalent of the effects model is

\[ \begin{aligned} \mathbf{y}_{ij} &= \mathbf{\bar{y}} + (\mathbf{\bar{y}}_i - \mathbf{\bar{y}}) + (\mathbf{y}_{ij} - \mathbf{\bar{y}}_i) \\ &= \text{overall sample mean} + \text{treatement effect} + \text{residual} \text{ (under univariate ANOVA)} \end{aligned} \]

After manipulation

\[ \sum_{i = 1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})' = \sum_{i = 1}^h n_i (\mathbf{\bar{y}}_i - \mathbf{\bar{y}})(\mathbf{\bar{y}}_i - \mathbf{\bar{y}})' + \\ \sum_{i=1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}}_i)' \]

LHS = Total corrected sums of squares and cross products (SSCP) matrix

RHS =

  • 1st term = Treatment (or between subjects) sum of squares and cross product matrix (denoted H;B)

  • 2nd term = residual (or within subject) SSCP matrix denoted (E;W)

Note:

\[ \mathbf{E} = (n_1 - 1)\mathbf{S}_1 + ... + (n_h -1) \mathbf{S}_h = (n-h) \mathbf{S} \]

MANOVA table

MONOVA table
Source SSCP df
Treatment \(\mathbf{H}\) \(h -1\)
Residual (error) \(\mathbf{E}\) \(\sum_{i= 1}^h n_i - h\)
Total Corrected \(\mathbf{H + E}\) \(\sum_{i=1}^h n_i -1\)

\[ H_0: \tau_1 = \tau_2 = \dots = \tau_h = \mathbf{0} \]

We consider the relative “sizes” of \(\mathbf{E}\) and \(\mathbf{H+E}\)

Wilk’s Lambda

Define Wilk’s Lambda

\[ \Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H+E}|} \]

Properties:

  1. Wilk’s Lambda is equivalent to the F-statistic in the univariate case

  2. The exact distribution of \(\Lambda^*\) can be determined for especial cases.

  3. For large sample sizes, reject \(H_0\) if

\[ -(\sum_{i=1}^h n_i - 1 - \frac{p+h}{2}) \log(\Lambda^*) > \chi^2_{(1-\alpha, p(h-1))} \]

MANOVA IN R

suppose for the labrats , we now give the rats different drugs now and test them on conditions y1,y2,y3 and y4

library(tidyverse)
## Read in Data
lab <- read.table("images/labrats.dat")
names(lab) <- c("drug", "y1", "y2", "y3", "y4")
## Create a subject ID nested within drug

lab <- lab %>%
    group_by(drug) %>%
    mutate(subject = row_number()) %>%
    ungroup()
str(lab)
#> tibble [24 x 6] (S3: tbl_df/tbl/data.frame)
#>  $ drug   : chr [1:24] "ax23" "ax23" "ax23" "ax23" ...
#>  $ y1     : int [1:24] 72 78 71 72 66 74 62 69 85 82 ...
#>  $ y2     : int [1:24] 86 83 82 83 79 83 73 75 86 86 ...
#>  $ y3     : int [1:24] 81 88 81 83 77 84 78 76 83 80 ...
#>  $ y4     : int [1:24] 77 82 75 69 66 77 70 70 80 84 ...
#>  $ subject: int [1:24] 1 2 3 4 5 6 7 8 1 2 ...

## Create means summary for profile plot, pivot longer for plotting with ggplot
lab_means <- lab %>%
    group_by(drug) %>%
    summarize_at(vars(starts_with("y")), mean) %>%
    ungroup() %>%
    pivot_longer(-drug, names_to = "subject", values_to = "mean") %>%
    mutate(subject = as.numeric(as.factor(subject)))
gg_profile <- ggplot(lab_means, aes(x = subject, y = mean)) +
    geom_line(aes(col = drug)) +
    geom_point(aes(col = drug)) +
    ggtitle("Profile Plot") +
    scale_y_continuous(name = "Response") +
    scale_x_discrete(name = "subject")
gg_profile

visualise boxplots

lab |> 
  select(-subject) |> 
  gather("response","value",-drug) |> 
  ggplot(aes(x=drug,y=value,fill=drug))+
  geom_boxplot()+
  facet_wrap(~response,scales="free")+
  tvthemes::scale_fill_avatar()

Fitting A manova in R

Using manova()

lm_mod <- manova(cbind(y1, y2, y3, y4) ~ drug, data = lab)
summary(lm_mod)
#>           Df Pillai approx F num Df den Df    Pr(>F)    
#> drug       2 1.2835   8.5081      8     38 1.501e-06 ***
#> Residuals 21                                            
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

look to see which ones differ

summary.aov(lm_mod)
#>  Response y1 :
#>             Df Sum Sq Mean Sq F value   Pr(>F)   
#> drug         2    567 283.500  9.2878 0.001289 **
#> Residuals   21    641  30.524                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response y2 :
#>             Df Sum Sq Mean Sq F value   Pr(>F)   
#> drug         2 569.08 284.542  7.2528 0.004029 **
#> Residuals   21 823.88  39.232                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response y3 :
#>             Df Sum Sq Mean Sq F value   Pr(>F)   
#> drug         2 391.08 195.542  6.2609 0.007368 **
#> Residuals   21 655.88  31.232                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response y4 :
#>             Df Sum Sq Mean Sq F value  Pr(>F)  
#> drug         2  316.0 158.000  4.9192 0.01769 *
#> Residuals   21  674.5  32.119                  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments

from the output above ,we can see that the responses are highly significantly different among drugs since all p-values are less than 5%

using lm

## Fit model
lab_mod <- lm(cbind(y1, y2, y3, y4) ~ drug, data = lab)
man_fit <- car::Anova(lab_mod)
summary(man_fit)
#> 
#> Type II MANOVA Tests:
#> 
#> Sum of squares and products for error:
#>        y1      y2      y3     y4
#> y1 641.00 601.750 535.250 426.00
#> y2 601.75 823.875 615.500 534.25
#> y3 535.25 615.500 655.875 555.25
#> y4 426.00 534.250 555.250 674.50
#> 
#> ------------------------------------------
#>  
#> Term: drug 
#> 
#> Sum of squares and products for the hypothesis:
#>        y1       y2       y3    y4
#> y1 567.00 335.2500  42.7500 387.0
#> y2 335.25 569.0833 404.5417 367.5
#> y3  42.75 404.5417 391.0833 171.0
#> y4 387.00 367.5000 171.0000 316.0
#> 
#> Multivariate Tests: drug
#>                  Df test stat  approx F num Df den Df     Pr(>F)    
#> Pillai            2  1.283456  8.508082      8     38 1.5010e-06 ***
#> Wilks             2  0.079007 11.509581      8     36 6.3081e-08 ***
#> Hotelling-Lawley  2  7.069384 15.022441      8     34 3.9048e-09 ***
#> Roy               2  6.346509 30.145916      4     19 5.4493e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the output contains more information regarding the methods used