Multivariate Analysis Of Variance
Data science nuggets
Introduction
- my previous blog talked about comparing two means using
T.TEST
and as well as comparing more than 2 means withANOVA
RECAP ANOVA
- for this one we said consider yourself having having
one dependent
variable andone or more independent
variables which are factors
day 5 | day 20 | day 35 |
---|---|---|
10 | 15 | 35 |
15 | 30 | 40 |
25 | 20 | 50 |
15 | 15 | 43 |
20 | 23 | 45 |
18 | 20 | 40 |
103 | 123 | 253 |
- the above dataset as it stands might look a bit weird but once we change it to long format we get
library(tidyverse)
<-tribble(~DAY_5 ,~DAY_20,~DAY_35,
df10, 15, 35,
15, 30, 40,
25, 20, 50,
15, 15, 43,
20, 23, 45,
18, 20, 40)
|>
df gather("day","time") |>
::kable() knitr
day | time |
---|---|
DAY_5 | 10 |
DAY_5 | 15 |
DAY_5 | 25 |
DAY_5 | 15 |
DAY_5 | 20 |
DAY_5 | 18 |
DAY_20 | 15 |
DAY_20 | 30 |
DAY_20 | 20 |
DAY_20 | 15 |
DAY_20 | 23 |
DAY_20 | 20 |
DAY_35 | 35 |
DAY_35 | 40 |
DAY_35 | 50 |
DAY_35 | 43 |
DAY_35 | 45 |
DAY_35 | 40 |
- now we have
time
as thedependent
variable andday
as theindependent
variable - so we generally assume the following assumptions
assumptions
- Univariate Normal distribution
The residuals of ANOVA model can once again be visualised in the normal QQplot. If the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted. - Homoscedasticity
To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots.
Univariate Normal
Random variable \(X\) is distributed \(X \sim N(\mu, \sigma^2)\) if
\[f(X)=\frac{{1}}{{\sigma \sqrt{{2\pi}}}}e^{-.5(\frac{{x-\mu}}{{\sigma}})^2}\].
MANOVA
Multivariate Normal Distribution
Multivariate Normal Distribution
Let \(\mathbf{y}\) be a multivariate normal (MVN) random variable with mean \(\mu\) and variance \(\mathbf{\Sigma}\). Then the density of \(\mathbf{y}\) is
\[ f(\mathbf{y}) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp(-\frac{1}{2} \mathbf{(y-\mu)'\Sigma^{-1}(y-\mu)} ) \]
\(\mathbf{y} \sim N_p(\mu, \mathbf{\Sigma})\)
Properties of MVN
Let \(\mathbf{A}_{r \times p}\) be a fixed matrix. Then \(\mathbf{Ay} \sim N_r (\mathbf{A \mu, A \Sigma A'})\) . \(r \le p\) and all rows of \(\mathbf{A}\) must be linearly independent to guarantee that \(\mathbf{A \Sigma A}'\) is non-singular.
Let \(\mathbf{G}\) be a matrix such that \(\mathbf{\Sigma}^{-1} = \mathbf{GG}'\). Then \(\mathbf{G'y} \sim N_p(\mathbf{G' \mu, I})\) and \(\mathbf{G'(y-\mu)} \sim N_p (0,\mathbf{I})\)
Any fixed linear combination of \(y_1,...,y_p\) (say \(\mathbf{c'y}\)) follows \(\mathbf{c'y} \sim N_1 (\mathbf{c' \mu, c' \Sigma c})\)
Multivariate Analysis of Variance(MANOVA)
One-way MANOVA
- Compare treatment means for h different populations
- now we have more than one
dependent
variables
Population 1: \(\mathbf{y}_{11}, \mathbf{y}_{12}, \dots, \mathbf{y}_{1n_1} \sim idd N_p (\mathbf{\mu}_1, \mathbf{\Sigma})\)
\(\vdots\)
Population h: \(\mathbf{y}_{h1}, \mathbf{y}_{h2}, \dots, \mathbf{y}_{hn_h} \sim idd N_p (\mathbf{\mu}_h, \mathbf{\Sigma})\)
- Using the
ANOVA
example , lets say we now have other thantime
we could also have theweight
and theheight
of the thelab rats
at those days.
Assumptions
- Independent random samples from \(h\) different populations
- Common covariance matrices
- Each population is multivariate normal
Calculate the summary statistics \(\mathbf{\bar{y}}_i, \mathbf{S}\) and the pooled estimate of the covariance matrix \(\mathbf{S}\)
Similar to the univariate one-way ANVOA, we can use the effects model formulation \(\mathbf{\mu}_i = \mathbf{\mu} + \mathbf{\tau}_i\), where
\(\mathbf{\mu}_i\) is the population mean for population i
\(\mathbf{\mu}\) is the overall mean effect
\(\mathbf{\tau}_i\) is the treatment effect of the i-th treatment.
For the one-way model: \(\mathbf{y}_{ij} = \mu + \tau_i + \epsilon_{ij}\) for \(i = 1,..,h; j = 1,..., n_i\) and \(\epsilon_{ij} \sim N_p(\mathbf{0, \Sigma})\)
However, the above model is over-parameterized (i.e., infinite number of ways to define \(\mathbf{\mu}\) and the \(\mathbf{\tau}_i\)’s such that they add up to \(\mu_i\). Thus we can constrain by having
\[ \sum_{i=1}^h n_i \tau_i = 0 \]
or
\[ \mathbf{\tau}_h = 0 \]
The observational equivalent of the effects model is
\[ \begin{aligned} \mathbf{y}_{ij} &= \mathbf{\bar{y}} + (\mathbf{\bar{y}}_i - \mathbf{\bar{y}}) + (\mathbf{y}_{ij} - \mathbf{\bar{y}}_i) \\ &= \text{overall sample mean} + \text{treatement effect} + \text{residual} \text{ (under univariate ANOVA)} \end{aligned} \]
After manipulation
\[ \sum_{i = 1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})' = \sum_{i = 1}^h n_i (\mathbf{\bar{y}}_i - \mathbf{\bar{y}})(\mathbf{\bar{y}}_i - \mathbf{\bar{y}})' + \\ \sum_{i=1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}}_i)' \]
LHS = Total corrected sums of squares and cross products (SSCP) matrix
RHS =
1st term = Treatment (or between subjects) sum of squares and cross product matrix (denoted H;B)
2nd term = residual (or within subject) SSCP matrix denoted (E;W)
Note:
\[ \mathbf{E} = (n_1 - 1)\mathbf{S}_1 + ... + (n_h -1) \mathbf{S}_h = (n-h) \mathbf{S} \]
MANOVA table
Source | SSCP | df |
---|---|---|
Treatment | \(\mathbf{H}\) | \(h -1\) |
Residual (error) | \(\mathbf{E}\) | \(\sum_{i= 1}^h n_i - h\) |
Total Corrected | \(\mathbf{H + E}\) | \(\sum_{i=1}^h n_i -1\) |
\[ H_0: \tau_1 = \tau_2 = \dots = \tau_h = \mathbf{0} \]
We consider the relative “sizes” of \(\mathbf{E}\) and \(\mathbf{H+E}\)
Wilk’s Lambda
Define Wilk’s Lambda
\[ \Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H+E}|} \]
Properties:
Wilk’s Lambda is equivalent to the F-statistic in the univariate case
The exact distribution of \(\Lambda^*\) can be determined for especial cases.
For large sample sizes, reject \(H_0\) if
\[ -(\sum_{i=1}^h n_i - 1 - \frac{p+h}{2}) \log(\Lambda^*) > \chi^2_{(1-\alpha, p(h-1))} \]
MANOVA IN R
suppose for the labrats , we now give the rats different drugs now and test them on conditions
y1,y2,y3 and y4
library(tidyverse)
## Read in Data
<- read.table("images/labrats.dat")
lab names(lab) <- c("drug", "y1", "y2", "y3", "y4")
## Create a subject ID nested within drug
<- lab %>%
lab group_by(drug) %>%
mutate(subject = row_number()) %>%
ungroup()
str(lab)
#> tibble [24 x 6] (S3: tbl_df/tbl/data.frame)
#> $ drug : chr [1:24] "ax23" "ax23" "ax23" "ax23" ...
#> $ y1 : int [1:24] 72 78 71 72 66 74 62 69 85 82 ...
#> $ y2 : int [1:24] 86 83 82 83 79 83 73 75 86 86 ...
#> $ y3 : int [1:24] 81 88 81 83 77 84 78 76 83 80 ...
#> $ y4 : int [1:24] 77 82 75 69 66 77 70 70 80 84 ...
#> $ subject: int [1:24] 1 2 3 4 5 6 7 8 1 2 ...
## Create means summary for profile plot, pivot longer for plotting with ggplot
<- lab %>%
lab_means group_by(drug) %>%
summarize_at(vars(starts_with("y")), mean) %>%
ungroup() %>%
pivot_longer(-drug, names_to = "subject", values_to = "mean") %>%
mutate(subject = as.numeric(as.factor(subject)))
<- ggplot(lab_means, aes(x = subject, y = mean)) +
gg_profile geom_line(aes(col = drug)) +
geom_point(aes(col = drug)) +
ggtitle("Profile Plot") +
scale_y_continuous(name = "Response") +
scale_x_discrete(name = "subject")
gg_profile
visualise boxplots
|>
lab select(-subject) |>
gather("response","value",-drug) |>
ggplot(aes(x=drug,y=value,fill=drug))+
geom_boxplot()+
facet_wrap(~response,scales="free")+
::scale_fill_avatar() tvthemes
Fitting A manova in R
Using manova()
<- manova(cbind(y1, y2, y3, y4) ~ drug, data = lab)
lm_mod summary(lm_mod)
#> Df Pillai approx F num Df den Df Pr(>F)
#> drug 2 1.2835 8.5081 8 38 1.501e-06 ***
#> Residuals 21
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
look to see which ones differ
summary.aov(lm_mod)
#> Response y1 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 567 283.500 9.2878 0.001289 **
#> Residuals 21 641 30.524
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Response y2 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 569.08 284.542 7.2528 0.004029 **
#> Residuals 21 823.88 39.232
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Response y3 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 391.08 195.542 6.2609 0.007368 **
#> Residuals 21 655.88 31.232
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Response y4 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 316.0 158.000 4.9192 0.01769 *
#> Residuals 21 674.5 32.119
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comments
from the output above ,we can see that the responses are highly significantly different among drugs since all p-values are less than 5%
using lm
## Fit model
<- lm(cbind(y1, y2, y3, y4) ~ drug, data = lab)
lab_mod <- car::Anova(lab_mod)
man_fit summary(man_fit)
#>
#> Type II MANOVA Tests:
#>
#> Sum of squares and products for error:
#> y1 y2 y3 y4
#> y1 641.00 601.750 535.250 426.00
#> y2 601.75 823.875 615.500 534.25
#> y3 535.25 615.500 655.875 555.25
#> y4 426.00 534.250 555.250 674.50
#>
#> ------------------------------------------
#>
#> Term: drug
#>
#> Sum of squares and products for the hypothesis:
#> y1 y2 y3 y4
#> y1 567.00 335.2500 42.7500 387.0
#> y2 335.25 569.0833 404.5417 367.5
#> y3 42.75 404.5417 391.0833 171.0
#> y4 387.00 367.5000 171.0000 316.0
#>
#> Multivariate Tests: drug
#> Df test stat approx F num Df den Df Pr(>F)
#> Pillai 2 1.283456 8.508082 8 38 1.5010e-06 ***
#> Wilks 2 0.079007 11.509581 8 36 6.3081e-08 ***
#> Hotelling-Lawley 2 7.069384 15.022441 8 34 3.9048e-09 ***
#> Roy 2 6.346509 30.145916 4 19 5.4493e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the output contains more information regarding the methods used