Multivariate Analysis Of Variance

Data science nuggets

Introduction

my previous blog talked about comparing two means using T.TEST and as well as comparing more than 2 means with ANOVA

RECAP ANOVA

for this one we said consider yourself having having one dependent variable and one or more independent variables which are factors

day 5	day 20	day 35
10	15	35
15	30	40
25	20	50
15	15	43
20	23	45
18	20	40
103	123	253

the above dataset as it stands might look a bit weird but once we change it to long format we get

library(tidyverse)
df<-tribble(~DAY_5 ,~DAY_20,~DAY_35,
    10, 15, 35,
    15, 30, 40,
    25, 20, 50,
    15, 15, 43,
    20, 23, 45,
    18, 20, 40)
df |> 
  gather("day","time") |> 
  knitr::kable()

day	time
DAY_5	10
DAY_5	15
DAY_5	25
DAY_5	15
DAY_5	20
DAY_5	18
DAY_20	15
DAY_20	30
DAY_20	20
DAY_20	15
DAY_20	23
DAY_20	20
DAY_35	35
DAY_35	40
DAY_35	50
DAY_35	43
DAY_35	45
DAY_35	40

now we have time as the dependent variable and day as the independent variable
so we generally assume the following assumptions

assumptions

Univariate Normal distribution
The residuals of ANOVA model can once again be visualised in the normal QQplot. If the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted.
Homoscedasticity
To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots.

Univariate Normal

Random variable \(X\) is distributed \(X \sim N(\mu, \sigma^2)\) if

\[f(X)=\frac{{1}}{{\sigma \sqrt{{2\pi}}}}e^{-.5(\frac{{x-\mu}}{{\sigma}})^2}\].

MANOVA

Multivariate Normal Distribution

Multivariate Normal Distribution

Let \(\mathbf{y}\) be a multivariate normal (MVN) random variable with mean \(\mu\) and variance \(\mathbf{\Sigma}\). Then the density of \(\mathbf{y}\) is

\[ f(\mathbf{y}) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp(-\frac{1}{2} \mathbf{(y-\mu)'\Sigma^{-1}(y-\mu)} ) \]

\(\mathbf{y} \sim N_p(\mu, \mathbf{\Sigma})\)

Properties of MVN

Let \(\mathbf{A}_{r \times p}\) be a fixed matrix. Then \(\mathbf{Ay} \sim N_r (\mathbf{A \mu, A \Sigma A'})\) . \(r \le p\) and all rows of \(\mathbf{A}\) must be linearly independent to guarantee that \(\mathbf{A \Sigma A}'\) is non-singular.
Let \(\mathbf{G}\) be a matrix such that \(\mathbf{\Sigma}^{-1} = \mathbf{GG}'\). Then \(\mathbf{G'y} \sim N_p(\mathbf{G' \mu, I})\) and \(\mathbf{G'(y-\mu)} \sim N_p (0,\mathbf{I})\)
Any fixed linear combination of \(y_1,...,y_p\) (say \(\mathbf{c'y}\)) follows \(\mathbf{c'y} \sim N_1 (\mathbf{c' \mu, c' \Sigma c})\)

Multivariate Analysis of Variance(MANOVA)

One-way MANOVA

Compare treatment means for h different populations
now we have more than one dependent variables

Population 1: \(\mathbf{y}_{11}, \mathbf{y}_{12}, \dots, \mathbf{y}_{1n_1} \sim idd N_p (\mathbf{\mu}_1, \mathbf{\Sigma})\)

\(\vdots\)

Population h: \(\mathbf{y}_{h1}, \mathbf{y}_{h2}, \dots, \mathbf{y}_{hn_h} \sim idd N_p (\mathbf{\mu}_h, \mathbf{\Sigma})\)

Using the ANOVA example , lets say we now have other than time we could also have the weight and the height of the the lab rats at those days.

Assumptions

Independent random samples from \(h\) different populations
Common covariance matrices
Each population is multivariate normal

Calculate the summary statistics \(\mathbf{\bar{y}}_i, \mathbf{S}\) and the pooled estimate of the covariance matrix \(\mathbf{S}\)

Similar to the univariate one-way ANVOA, we can use the effects model formulation \(\mathbf{\mu}_i = \mathbf{\mu} + \mathbf{\tau}_i\), where

\(\mathbf{\mu}_i\) is the population mean for population i
\(\mathbf{\mu}\) is the overall mean effect
\(\mathbf{\tau}_i\) is the treatment effect of the i-th treatment.

For the one-way model: \(\mathbf{y}_{ij} = \mu + \tau_i + \epsilon_{ij}\) for \(i = 1,..,h; j = 1,..., n_i\) and \(\epsilon_{ij} \sim N_p(\mathbf{0, \Sigma})\)

However, the above model is over-parameterized (i.e., infinite number of ways to define \(\mathbf{\mu}\) and the \(\mathbf{\tau}_i\)’s such that they add up to \(\mu_i\). Thus we can constrain by having

\[ \sum_{i=1}^h n_i \tau_i = 0 \]

\[ \mathbf{\tau}_h = 0 \]

The observational equivalent of the effects model is

\[ \begin{aligned} \mathbf{y}_{ij} &= \mathbf{\bar{y}} + (\mathbf{\bar{y}}_i - \mathbf{\bar{y}}) + (\mathbf{y}_{ij} - \mathbf{\bar{y}}_i) \\ &= \text{overall sample mean} + \text{treatement effect} + \text{residual} \text{ (under univariate ANOVA)} \end{aligned} \]

After manipulation

\[ \sum_{i = 1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})' = \sum_{i = 1}^h n_i (\mathbf{\bar{y}}_i - \mathbf{\bar{y}})(\mathbf{\bar{y}}_i - \mathbf{\bar{y}})' + \\ \sum_{i=1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}}_i)' \]

LHS = Total corrected sums of squares and cross products (SSCP) matrix

RHS =

1st term = Treatment (or between subjects) sum of squares and cross product matrix (denoted H;B)
2nd term = residual (or within subject) SSCP matrix denoted (E;W)

Note:

\[ \mathbf{E} = (n_1 - 1)\mathbf{S}_1 + ... + (n_h -1) \mathbf{S}_h = (n-h) \mathbf{S} \]

MANOVA table

MONOVA table
Source	SSCP	df
Treatment	\(\mathbf{H}\)	\(h -1\)
Residual (error)	\(\mathbf{E}\)	\(\sum_{i= 1}^h n_i - h\)
Total Corrected	\(\mathbf{H + E}\)	\(\sum_{i=1}^h n_i -1\)

\[ H_0: \tau_1 = \tau_2 = \dots = \tau_h = \mathbf{0} \]

We consider the relative “sizes” of \(\mathbf{E}\) and \(\mathbf{H+E}\)

Wilk’s Lambda

Define Wilk’s Lambda

\[ \Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H+E}|} \]

Properties:

Wilk’s Lambda is equivalent to the F-statistic in the univariate case
The exact distribution of \(\Lambda^*\) can be determined for especial cases.
For large sample sizes, reject \(H_0\) if

\[ -(\sum_{i=1}^h n_i - 1 - \frac{p+h}{2}) \log(\Lambda^*) > \chi^2_{(1-\alpha, p(h-1))} \]

MANOVA IN R

suppose for the labrats , we now give the rats different drugs now and test them on conditions y1,y2,y3 and y4

library(tidyverse)
## Read in Data
lab <- read.table("images/labrats.dat")
names(lab) <- c("drug", "y1", "y2", "y3", "y4")
## Create a subject ID nested within drug

lab <- lab %>%
    group_by(drug) %>%
    mutate(subject = row_number()) %>%
    ungroup()
str(lab)
#> tibble [24 x 6] (S3: tbl_df/tbl/data.frame)
#>  $ drug   : chr [1:24] "ax23" "ax23" "ax23" "ax23" ...
#>  $ y1     : int [1:24] 72 78 71 72 66 74 62 69 85 82 ...
#>  $ y2     : int [1:24] 86 83 82 83 79 83 73 75 86 86 ...
#>  $ y3     : int [1:24] 81 88 81 83 77 84 78 76 83 80 ...
#>  $ y4     : int [1:24] 77 82 75 69 66 77 70 70 80 84 ...
#>  $ subject: int [1:24] 1 2 3 4 5 6 7 8 1 2 ...

## Create means summary for profile plot, pivot longer for plotting with ggplot
lab_means <- lab %>%
    group_by(drug) %>%
    summarize_at(vars(starts_with("y")), mean) %>%
    ungroup() %>%
    pivot_longer(-drug, names_to = "subject", values_to = "mean") %>%
    mutate(subject = as.numeric(as.factor(subject)))
gg_profile <- ggplot(lab_means, aes(x = subject, y = mean)) +
    geom_line(aes(col = drug)) +
    geom_point(aes(col = drug)) +
    ggtitle("Profile Plot") +
    scale_y_continuous(name = "Response") +
    scale_x_discrete(name = "subject")
gg_profile

visualise boxplots

lab |> 
  select(-subject) |> 
  gather("response","value",-drug) |> 
  ggplot(aes(x=drug,y=value,fill=drug))+
  geom_boxplot()+
  facet_wrap(~response,scales="free")+
  tvthemes::scale_fill_avatar()

Fitting A manova in R

Using `manova()`

lm_mod <- manova(cbind(y1, y2, y3, y4) ~ drug, data = lab)
summary(lm_mod)
#>           Df Pillai approx F num Df den Df    Pr(>F)    
#> drug       2 1.2835   8.5081      8     38 1.501e-06 ***
#> Residuals 21                                            
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

look to see which ones differ

summary.aov(lm_mod)
#>  Response y1 :
#>             Df Sum Sq Mean Sq F value   Pr(>F)   
#> drug         2    567 283.500  9.2878 0.001289 **
#> Residuals   21    641  30.524                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response y2 :
#>             Df Sum Sq Mean Sq F value   Pr(>F)   
#> drug         2 569.08 284.542  7.2528 0.004029 **
#> Residuals   21 823.88  39.232                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response y3 :
#>             Df Sum Sq Mean Sq F value   Pr(>F)   
#> drug         2 391.08 195.542  6.2609 0.007368 **
#> Residuals   21 655.88  31.232                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response y4 :
#>             Df Sum Sq Mean Sq F value  Pr(>F)  
#> drug         2  316.0 158.000  4.9192 0.01769 *
#> Residuals   21  674.5  32.119                  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments

from the output above ,we can see that the responses are highly significantly different among drugs since all p-values are less than 5%

using `lm`

## Fit model
lab_mod <- lm(cbind(y1, y2, y3, y4) ~ drug, data = lab)
man_fit <- car::Anova(lab_mod)
summary(man_fit)
#> 
#> Type II MANOVA Tests:
#> 
#> Sum of squares and products for error:
#>        y1      y2      y3     y4
#> y1 641.00 601.750 535.250 426.00
#> y2 601.75 823.875 615.500 534.25
#> y3 535.25 615.500 655.875 555.25
#> y4 426.00 534.250 555.250 674.50
#> 
#> ------------------------------------------
#>  
#> Term: drug 
#> 
#> Sum of squares and products for the hypothesis:
#>        y1       y2       y3    y4
#> y1 567.00 335.2500  42.7500 387.0
#> y2 335.25 569.0833 404.5417 367.5
#> y3  42.75 404.5417 391.0833 171.0
#> y4 387.00 367.5000 171.0000 316.0
#> 
#> Multivariate Tests: drug
#>                  Df test stat  approx F num Df den Df     Pr(>F)    
#> Pillai            2  1.283456  8.508082      8     38 1.5010e-06 ***
#> Wilks             2  0.079007 11.509581      8     36 6.3081e-08 ***
#> Hotelling-Lawley  2  7.069384 15.022441      8     34 3.9048e-09 ***
#> Roy               2  6.346509 30.145916      4     19 5.4493e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the output contains more information regarding the methods used

Multivariate Analysis Of Variance

Data science nuggets

Multivariate Analysis Of Variance

Introduction

RECAP ANOVA

assumptions

Univariate Normal

MANOVA

Multivariate Normal Distribution

Properties of MVN

Multivariate Analysis of Variance(MANOVA)

MANOVA IN R

visualise boxplots

Fitting A manova in R

Using manova()

look to see which ones differ

using lm

Using `manova()`

using `lm`