Joel Correa da Rosa
May 10th, 2017
ANOVA (Analysis of Variance ) is a generalization of the t-test for comparing means of 3 independent populations or more.
\( H_0:\mu_1=\mu_2=\mu_3 \)
\( H_1: \text{at least one difference} \)
A nutritionist is interested in comparing the effects of two alternative interventions: exercise and diet in reducing the levels of cholesterol. The magnitude of the effect is measured as the difference from a control group (non-intervention).
# Group 1
control<-c(310,320,370,240)
# Group 2
diet<-c(280,250,270,280)
# Group 3
exercise<-c(240,230,180,270)
# Building R data frame structure
chol.data<-cbind.data.frame(Cholesterol = c(control,diet,exercise),
Group =c(rep('1',length(control)),
rep('2',length(diet)),
rep('3',length(exercise))
)
)
summary(chol.data)
Cholesterol Group
Min. :180.0 1:4
1st Qu.:240.0 2:4
Median :270.0 3:4
Mean :270.0
3rd Qu.:287.5
Max. :370.0
Cholesterol | Group |
---|---|
310 | 1 |
320 | 1 |
370 | 1 |
240 | 1 |
280 | 2 |
250 | 2 |
270 | 2 |
280 | 2 |
240 | 3 |
230 | 3 |
180 | 3 |
270 | 3 |
boxplot(chol.data$Cholesterol~chol.data$Group)
The categorized box-plot is an important tool for visualizing differences
The ANOVA decomposes the total variance : \( \sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1} \) into two components :
Differences of means (between group)
Random variation (within group)
SS: Sum of Squares
\( SS_{Total} = SS_{Treatment} + SS_{Residuals} \)
\( SS_{Total}=\sum_{j=1}^{n_1}(y_{1j}-\bar{y})^2 +\sum_{j=1}^{n_2}(y_{2j}-\bar{y})^2 +...+\sum_{j=1}^{n_K}(y_{Kj}-\bar{y})^2 \)
\( SS_{Treatment}=n_1(\bar{y}_1-\bar{y})^2+n_2(\bar{y}_2-\bar{y})^2+...+n_K(\bar{y}_K-\bar{y})^2 \)
\( SS_{Residuals}=\sum_{j=1}^{n_1}(y_{1j}-\bar{y}_1)^2 +\sum_{j=1}^{n_2}(y_{2j}-\bar{y}_2)^2 +...+\sum_{j=1}^{n_K}(y_{Kj}-\bar{y}_K)^2 \)
Important : In this framework, the mean is the central tendency measure that represents the group. In other words, it is assumed to be a good summary of the population.
\( F = \frac{MS_{Treatment}}{MS_{Error}} \)
\( MS_{Treatment}=\frac{SS_{Treatment}}{df_1} \) \( MS_{Residuals}=\frac{SS_{Residual}}{df_2} \)
df: degrees of freedom.
The df (degrees of freedom) element is defined by the difference between the number of squared elements in the sum and the number of linear restrictions.
\( df = \# \text{squared elements} - \# \text{linear restrictions} \)
For the one-way ANOVA, \( df_1=K-1 \) and \( df_2=N-K \).
F follows the F-Snedecor distribution with two parameters: \( df_1 \): numerator's degrees of freedom and \( df_2 \): denominator's degrees of freedom.
The \( F \) test statistic is a ratio between variances. The rationale behind it is that the larger the contribution of \( SS_{Treatment} \) to the total variation \( SS_{Total} \), more evidence about the difference among populations.
The significance test for this statistic is different from the Student's t regarding the direction of the alternative hypothesis. Critical region is associated to high values of \( F \).
Consider the nutritionist example : \( K=3 \) groups and \( N=12 \) subjects. Under the null hypothesis, the test statistic behaves like:
# Number of Groups
K = 3
# Total Sample Siz
N = 12
# degrees of Freedom in the numerator
df1 = K-1
# degrees of freedom in the denominator
df2 = N-K
# sequence from 0 to 100
seq1<-seq(0,8,0.001)
# plot F distribution
plot(seq1,df(seq1,df1,df2))
set.seed(123)
summary(rf(1000,df1,df2))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000322 0.273007 0.692401 1.216963 1.485266 13.554842
The output of the operation below defines the lower bound for the critical region for F-statistic for the data in Example 01.
qf(0.95,df1,df2)
[1] 4.256495
SSTotal<-(310-270)^2+(320-270)^2+(370-270)^2+(240-270)^2 +
(280-270)^2+(250-270)^2+(270-270)^2+(280-270)^2 +
(240-270)^2+(230-270)^2+(180-270)^2+(270-270)^2
SSTotal
[1] 26200
Sum of Squares of Residuals partitioned by group
SSRes.control<-(310-310)^2+(320-310)^2+(370-310)^2+(240-310)^2
SSRes.diet<-(280-270)^2+(250-270)^2+(270-270)^2+(280-270)^2
SSRes.exercise<-(240-230)^2+(230-230)^2+(180-230)^2+(270-230)^2
SSRes.control
[1] 8600
SSRes.diet
[1] 600
SSRes.exercise
[1] 4200
Sum of Squares of Residuals
SSResiduals<-SSRes.control+SSRes.diet+SSRes.exercise
SSResiduals
[1] 13400
SSTreatment<-SSTotal-SSResiduals
SSTreatment
[1] 12800
Alternatively
SSTreatment<-4*(310-270)^2+4*(270-270)^2+4*(230-270)^2
SSTreatment
[1] 12800
MSTreatment<-SSTreatment/(3-1)
MSTreatment
[1] 6400
MSResiduals<-SSResiduals/(12-3)
MSResiduals
[1] 1488.889
Fstatistic<-MSTreatment/MSResiduals
Fstatistic
[1] 4.298507
Comparing observed F statistic with F distribution
# plot F distribution
plot(seq1,df(seq1,df1,df2))
abline(v=Fstatistic,col='red')
P-value associated with the observed F statistic
1-pf(Fstatistic,df1,df2)
[1] 0.04893457
av<-aov(chol.data$Cholesterol~chol.data$Group)
av
Call:
aov(formula = chol.data$Cholesterol ~ chol.data$Group)
Terms:
chol.data$Group Residuals
Sum of Squares 12800 13400
Deg. of Freedom 2 9
Residual standard error: 38.58612
Estimated effects may be unbalanced
summ.av<-summary(av)
summ.av
Df Sum Sq Mean Sq F value Pr(>F)
chol.data$Group 2 12800 6400 4.299 0.0489 *
Residuals 9 13400 1489
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
chol.data$Group | 2 | 12800 | 6400.000 | 4.298508 | 0.0489346 |
Residuals | 9 | 13400 | 1488.889 | NA | NA |
ANOVA can be formulated as a linear regression model
Model: “completely randomized model”
\( y_{ij}=\mu_i +\epsilon_{ij}=\mu+\alpha_i+\epsilon_{ij} \)
\( i=1,2,..,K \)
\( j=1,2,...,n_K \)
\( \epsilon_{ij} \sim N(0,\sigma^2) \)
This model is well defined by the restriction: \( \sum_{i=1}^n \alpha_i =0 \)
mod1<-lm(chol.data$Cholesterol~-1+chol.data$Group)
summary(mod1)
Call:
lm(formula = chol.data$Cholesterol ~ -1 + chol.data$Group)
Residuals:
Min 1Q Median 3Q Max
-70 -5 5 10 60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
chol.data$Group1 310.00 19.29 16.07 6.20e-08 ***
chol.data$Group2 270.00 19.29 13.99 2.06e-07 ***
chol.data$Group3 230.00 19.29 11.92 8.14e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 38.59 on 9 degrees of freedom
Multiple R-squared: 0.9851, Adjusted R-squared: 0.9802
F-statistic: 198.7 on 3 and 9 DF, p-value: 1.534e-08
anova(mod1)
Analysis of Variance Table
Response: chol.data$Cholesterol
Df Sum Sq Mean Sq F value Pr(>F)
chol.data$Group 3 887600 295867 198.72 1.534e-08 ***
Residuals 9 13400 1489
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod2<-lm(chol.data$Cholesterol~chol.data$Group)
summary(mod2)
Call:
lm(formula = chol.data$Cholesterol ~ chol.data$Group)
Residuals:
Min 1Q Median 3Q Max
-70 -5 5 10 60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 310.00 19.29 16.068 6.2e-08 ***
chol.data$Group2 -40.00 27.28 -1.466 0.1767
chol.data$Group3 -80.00 27.28 -2.932 0.0167 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 38.59 on 9 degrees of freedom
Multiple R-squared: 0.4885, Adjusted R-squared: 0.3749
F-statistic: 4.299 on 2 and 9 DF, p-value: 0.04893
anova(mod2)
Analysis of Variance Table
Response: chol.data$Cholesterol
Df Sum Sq Mean Sq F value Pr(>F)
chol.data$Group 2 12800 6400.0 4.2985 0.04893 *
Residuals 9 13400 1488.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA F-statistic tests for at least one difference between groups. Post-hoc tests are necessary to identify which groups are different.
A sequence of pairwise comparisons with a t-test will inflate the type-I error, in other words it will increase the Family-Wise Error Rate.
The FWER is the probability of coming to at least one wrong conclusion in a series of hypothesis tests.
\( FWER = 1 - (1-\alpha)^C \)
\( \alpha \) : significance level for an individual test
\( C \): number of comparisons
Example : \( \alpha \)=0.05 and C=5, results in \( FWER=1-(1-0.05)^5=0.2262 \)
When performing 5 hypothesis tests with fixed significance level as 5%, there is 22.62\% probability of at least one wrong conclusion.
There are two strategies for post hoc comparisons:
1) single step
2) step-down
Bonferroni correction consists of keeping the FWER by correcting the test-wise significant level. This can be done by simply dividing \( \alpha \) by the number of tests to be performed.
Example : \( \alpha \)=0.05 and C=5, results in \( FWER=1-(1-0.05)^5=0.2262 \)
To keep \( FWER \) at 5% significance level it would suffice to perform each test at \( \alpha^{*}=\alpha/5=0.01 \)
\( FWER=1-(1-0.01)^5\approx 0.05 \)
This test establishes a minimum difference to be considered significant at 5% level.
\( LSD = t\sqrt{2MSResiduals/n} \)
\( t \) is the critical value of the Student's t distribution with df associated with MSResiduals (also called MSE), the denominator of the F-statistic.
\( n \) is the number of observations per group
Cholesterol | Group |
---|---|
310 | control |
320 | control |
370 | control |
240 | control |
280 | diet |
250 | diet |
270 | diet |
280 | diet |
240 | exercise |
230 | exercise |
180 | exercise |
270 | exercise |
ddply(chol.data,.(Group),summarise,mean = mean(Cholesterol), sd = sd(Cholesterol), n= length(Cholesterol))
Group mean sd n
1 control 310 53.54126 4
2 diet 270 14.14214 4
3 exercise 230 37.41657 4
The first step for finding the LSD is to obtain the Critical value for t statistic with N-K degrees of freedom.
tc = qt(0.95,9)
tc
[1] 1.833113
MSE is the denominator of the F-statistic, that was called MSResiduals before.
LSD is obtained by
MSE<-MSResiduals
LSD = tc*sqrt(2*MSE/4)
LSD
[1] 50.01559
The differences observed were
Control_vs_Diet = 310-270
Control_vs_Exercise = 310-230
Diet_vs_Exercise = 270-230
Control_vs_Diet
[1] 40
Control_vs_Exercise
[1] 80
Diet_vs_Exercise
[1] 40
There is only significance when comparing Control versus Exercise
The test statistic for Tukey's HSD is:
\( q_r = \frac{\bar{x}_l-\bar{x}_s}{\sqrt{\frac{MSE}{n}}} \)
The distribution of \( q \) is called the studentized range distribution and depends on the the number of means and the degrees of freedom for SSResiduals.
\( \bar{x}_l \) and \( \bar{x}_s \) are the largest and smallest treatment means, respectively.
\( r \), the degrees of freedom of the distribution is given bythe number of means
\( n \) is the number of observations per group
qR<-(310-230)/sqrt(MSResiduals/4)
qR
[1] 4.146568
1-ptukey(qR,3,12,nranges=1)
[1] 0.03130034
The ANOVA tests for difference but does not locate where they are. Post-hoc tests are designed for that. The most used one is the Tukey HSD.
posthoc <- TukeyHSD(av,conf.level=0.95)
posthoc
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = chol.data$Cholesterol ~ chol.data$Group)
$`chol.data$Group`
diff lwr upr p adj
2-1 -40 -116.1785 36.178503 0.3506917
3-1 -80 -156.1785 -3.821497 0.0401865
3-2 -40 -116.1785 36.178503 0.3506917