Multivariate Analysis Of Variance
Data science nuggets
Introduction
- my previous blog talked about comparing two means using
T.TESTand as well as comparing more than 2 means withANOVA
RECAP ANOVA
- for this one we said consider yourself having having
one dependentvariable andone or more independentvariables which are factors
| day 5 | day 20 | day 35 |
|---|---|---|
| 10 | 15 | 35 |
| 15 | 30 | 40 |
| 25 | 20 | 50 |
| 15 | 15 | 43 |
| 20 | 23 | 45 |
| 18 | 20 | 40 |
| 103 | 123 | 253 |
- the above dataset as it stands might look a bit weird but once we change it to long format we get
library(tidyverse)
df<-tribble(~DAY_5 ,~DAY_20,~DAY_35,
10, 15, 35,
15, 30, 40,
25, 20, 50,
15, 15, 43,
20, 23, 45,
18, 20, 40)
df |>
gather("day","time") |>
knitr::kable()| day | time |
|---|---|
| DAY_5 | 10 |
| DAY_5 | 15 |
| DAY_5 | 25 |
| DAY_5 | 15 |
| DAY_5 | 20 |
| DAY_5 | 18 |
| DAY_20 | 15 |
| DAY_20 | 30 |
| DAY_20 | 20 |
| DAY_20 | 15 |
| DAY_20 | 23 |
| DAY_20 | 20 |
| DAY_35 | 35 |
| DAY_35 | 40 |
| DAY_35 | 50 |
| DAY_35 | 43 |
| DAY_35 | 45 |
| DAY_35 | 40 |
- now we have
timeas thedependentvariable anddayas theindependentvariable - so we generally assume the following assumptions
assumptions
- Univariate Normal distribution
The residuals of ANOVA model can once again be visualised in the normal QQplot. If the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted. - Homoscedasticity
To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots.
Univariate Normal
Random variable \(X\) is distributed \(X \sim N(\mu, \sigma^2)\) if
\[f(X)=\frac{{1}}{{\sigma \sqrt{{2\pi}}}}e^{-.5(\frac{{x-\mu}}{{\sigma}})^2}\].
MANOVA
Multivariate Normal Distribution
Multivariate Normal Distribution
Let \(\mathbf{y}\) be a multivariate normal (MVN) random variable with mean \(\mu\) and variance \(\mathbf{\Sigma}\). Then the density of \(\mathbf{y}\) is
\[ f(\mathbf{y}) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp(-\frac{1}{2} \mathbf{(y-\mu)'\Sigma^{-1}(y-\mu)} ) \]
\(\mathbf{y} \sim N_p(\mu, \mathbf{\Sigma})\)
Properties of MVN
Let \(\mathbf{A}_{r \times p}\) be a fixed matrix. Then \(\mathbf{Ay} \sim N_r (\mathbf{A \mu, A \Sigma A'})\) . \(r \le p\) and all rows of \(\mathbf{A}\) must be linearly independent to guarantee that \(\mathbf{A \Sigma A}'\) is non-singular.
Let \(\mathbf{G}\) be a matrix such that \(\mathbf{\Sigma}^{-1} = \mathbf{GG}'\). Then \(\mathbf{G'y} \sim N_p(\mathbf{G' \mu, I})\) and \(\mathbf{G'(y-\mu)} \sim N_p (0,\mathbf{I})\)
Any fixed linear combination of \(y_1,...,y_p\) (say \(\mathbf{c'y}\)) follows \(\mathbf{c'y} \sim N_1 (\mathbf{c' \mu, c' \Sigma c})\)
Multivariate Analysis of Variance(MANOVA)
One-way MANOVA
- Compare treatment means for h different populations
- now we have more than one
dependentvariables
Population 1: \(\mathbf{y}_{11}, \mathbf{y}_{12}, \dots, \mathbf{y}_{1n_1} \sim idd N_p (\mathbf{\mu}_1, \mathbf{\Sigma})\)
\(\vdots\)
Population h: \(\mathbf{y}_{h1}, \mathbf{y}_{h2}, \dots, \mathbf{y}_{hn_h} \sim idd N_p (\mathbf{\mu}_h, \mathbf{\Sigma})\)
- Using the
ANOVAexample , lets say we now have other thantimewe could also have theweightand theheightof the thelab ratsat those days.
Assumptions
- Independent random samples from \(h\) different populations
- Common covariance matrices
- Each population is multivariate normal
Calculate the summary statistics \(\mathbf{\bar{y}}_i, \mathbf{S}\) and the pooled estimate of the covariance matrix \(\mathbf{S}\)
Similar to the univariate one-way ANVOA, we can use the effects model formulation \(\mathbf{\mu}_i = \mathbf{\mu} + \mathbf{\tau}_i\), where
\(\mathbf{\mu}_i\) is the population mean for population i
\(\mathbf{\mu}\) is the overall mean effect
\(\mathbf{\tau}_i\) is the treatment effect of the i-th treatment.
For the one-way model: \(\mathbf{y}_{ij} = \mu + \tau_i + \epsilon_{ij}\) for \(i = 1,..,h; j = 1,..., n_i\) and \(\epsilon_{ij} \sim N_p(\mathbf{0, \Sigma})\)
However, the above model is over-parameterized (i.e., infinite number of ways to define \(\mathbf{\mu}\) and the \(\mathbf{\tau}_i\)’s such that they add up to \(\mu_i\). Thus we can constrain by having
\[ \sum_{i=1}^h n_i \tau_i = 0 \]
or
\[ \mathbf{\tau}_h = 0 \]
The observational equivalent of the effects model is
\[ \begin{aligned} \mathbf{y}_{ij} &= \mathbf{\bar{y}} + (\mathbf{\bar{y}}_i - \mathbf{\bar{y}}) + (\mathbf{y}_{ij} - \mathbf{\bar{y}}_i) \\ &= \text{overall sample mean} + \text{treatement effect} + \text{residual} \text{ (under univariate ANOVA)} \end{aligned} \]
After manipulation
\[ \sum_{i = 1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})' = \sum_{i = 1}^h n_i (\mathbf{\bar{y}}_i - \mathbf{\bar{y}})(\mathbf{\bar{y}}_i - \mathbf{\bar{y}})' + \\ \sum_{i=1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}}_i)' \]
LHS = Total corrected sums of squares and cross products (SSCP) matrix
RHS =
1st term = Treatment (or between subjects) sum of squares and cross product matrix (denoted H;B)
2nd term = residual (or within subject) SSCP matrix denoted (E;W)
Note:
\[ \mathbf{E} = (n_1 - 1)\mathbf{S}_1 + ... + (n_h -1) \mathbf{S}_h = (n-h) \mathbf{S} \]
MANOVA table
| Source | SSCP | df |
|---|---|---|
| Treatment | \(\mathbf{H}\) | \(h -1\) |
| Residual (error) | \(\mathbf{E}\) | \(\sum_{i= 1}^h n_i - h\) |
| Total Corrected | \(\mathbf{H + E}\) | \(\sum_{i=1}^h n_i -1\) |
\[ H_0: \tau_1 = \tau_2 = \dots = \tau_h = \mathbf{0} \]
We consider the relative “sizes” of \(\mathbf{E}\) and \(\mathbf{H+E}\)
Wilk’s Lambda
Define Wilk’s Lambda
\[ \Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H+E}|} \]
Properties:
Wilk’s Lambda is equivalent to the F-statistic in the univariate case
The exact distribution of \(\Lambda^*\) can be determined for especial cases.
For large sample sizes, reject \(H_0\) if
\[ -(\sum_{i=1}^h n_i - 1 - \frac{p+h}{2}) \log(\Lambda^*) > \chi^2_{(1-\alpha, p(h-1))} \]
MANOVA IN R
suppose for the labrats , we now give the rats different drugs now and test them on conditions
y1,y2,y3 and y4
library(tidyverse)
## Read in Data
lab <- read.table("images/labrats.dat")
names(lab) <- c("drug", "y1", "y2", "y3", "y4")
## Create a subject ID nested within drug
lab <- lab %>%
group_by(drug) %>%
mutate(subject = row_number()) %>%
ungroup()
str(lab)
#> tibble [24 x 6] (S3: tbl_df/tbl/data.frame)
#> $ drug : chr [1:24] "ax23" "ax23" "ax23" "ax23" ...
#> $ y1 : int [1:24] 72 78 71 72 66 74 62 69 85 82 ...
#> $ y2 : int [1:24] 86 83 82 83 79 83 73 75 86 86 ...
#> $ y3 : int [1:24] 81 88 81 83 77 84 78 76 83 80 ...
#> $ y4 : int [1:24] 77 82 75 69 66 77 70 70 80 84 ...
#> $ subject: int [1:24] 1 2 3 4 5 6 7 8 1 2 ...
## Create means summary for profile plot, pivot longer for plotting with ggplot
lab_means <- lab %>%
group_by(drug) %>%
summarize_at(vars(starts_with("y")), mean) %>%
ungroup() %>%
pivot_longer(-drug, names_to = "subject", values_to = "mean") %>%
mutate(subject = as.numeric(as.factor(subject)))
gg_profile <- ggplot(lab_means, aes(x = subject, y = mean)) +
geom_line(aes(col = drug)) +
geom_point(aes(col = drug)) +
ggtitle("Profile Plot") +
scale_y_continuous(name = "Response") +
scale_x_discrete(name = "subject")
gg_profilevisualise boxplots
lab |>
select(-subject) |>
gather("response","value",-drug) |>
ggplot(aes(x=drug,y=value,fill=drug))+
geom_boxplot()+
facet_wrap(~response,scales="free")+
tvthemes::scale_fill_avatar()Fitting A manova in R
Using manova()
lm_mod <- manova(cbind(y1, y2, y3, y4) ~ drug, data = lab)
summary(lm_mod)
#> Df Pillai approx F num Df den Df Pr(>F)
#> drug 2 1.2835 8.5081 8 38 1.501e-06 ***
#> Residuals 21
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1look to see which ones differ
summary.aov(lm_mod)
#> Response y1 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 567 283.500 9.2878 0.001289 **
#> Residuals 21 641 30.524
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Response y2 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 569.08 284.542 7.2528 0.004029 **
#> Residuals 21 823.88 39.232
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Response y3 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 391.08 195.542 6.2609 0.007368 **
#> Residuals 21 655.88 31.232
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Response y4 :
#> Df Sum Sq Mean Sq F value Pr(>F)
#> drug 2 316.0 158.000 4.9192 0.01769 *
#> Residuals 21 674.5 32.119
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Comments
from the output above ,we can see that the responses are highly significantly different among drugs since all p-values are less than 5%
using lm
## Fit model
lab_mod <- lm(cbind(y1, y2, y3, y4) ~ drug, data = lab)
man_fit <- car::Anova(lab_mod)
summary(man_fit)
#>
#> Type II MANOVA Tests:
#>
#> Sum of squares and products for error:
#> y1 y2 y3 y4
#> y1 641.00 601.750 535.250 426.00
#> y2 601.75 823.875 615.500 534.25
#> y3 535.25 615.500 655.875 555.25
#> y4 426.00 534.250 555.250 674.50
#>
#> ------------------------------------------
#>
#> Term: drug
#>
#> Sum of squares and products for the hypothesis:
#> y1 y2 y3 y4
#> y1 567.00 335.2500 42.7500 387.0
#> y2 335.25 569.0833 404.5417 367.5
#> y3 42.75 404.5417 391.0833 171.0
#> y4 387.00 367.5000 171.0000 316.0
#>
#> Multivariate Tests: drug
#> Df test stat approx F num Df den Df Pr(>F)
#> Pillai 2 1.283456 8.508082 8 38 1.5010e-06 ***
#> Wilks 2 0.079007 11.509581 8 36 6.3081e-08 ***
#> Hotelling-Lawley 2 7.069384 15.022441 8 34 3.9048e-09 ***
#> Roy 2 6.346509 30.145916 4 19 5.4493e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1the output contains more information regarding the methods used