Lecture 11: ANOVA

Joel Correa da Rosa
May 10th, 2017

Hypothesis testing: Comparing 3 or more means

ANOVA (Analysis of Variance ) is a generalization of the t-test for comparing means of 3 independent populations or more.

$ H_0:\mu_1=\mu_2=\mu_3 $

$ H_1: \text{at least one difference} $

Example 01

A nutritionist is interested in comparing the effects of two alternative interventions: exercise and diet in reducing the levels of cholesterol. The magnitude of the effect is measured as the difference from a control group (non-intervention).

Data - Example 01

# Group 1
control<-c(310,320,370,240)

# Group 2
diet<-c(280,250,270,280)

# Group 3
exercise<-c(240,230,180,270)

Data - Example 01

# Building R data frame structure

chol.data<-cbind.data.frame(Cholesterol = c(control,diet,exercise),
                            Group =c(rep('1',length(control)),
                                     rep('2',length(diet)),
                                     rep('3',length(exercise))
                                    )
                            )

summary(chol.data)

  Cholesterol    Group
 Min.   :180.0   1:4  
 1st Qu.:240.0   2:4  
 Median :270.0   3:4  
 Mean   :270.0        
 3rd Qu.:287.5        
 Max.   :370.0

Data Table - Example 01

Cholesterol	Group
310	1
320	1
370	1
240	1
280	2
250	2
270	2
280	2
240	3
230	3
180	3
270	3

Categorized Box Plot

boxplot(chol.data$Cholesterol~chol.data$Group)

plot of chunk unnamed-chunk-4

The categorized box-plot is an important tool for visualizing differences

Test Statistic for one-way ANOVA

The ANOVA decomposes the total variance : $ \sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1} $ into two components :

Differences of means (between group)
Random variation (within group)

Components of Variation

SS: Sum of Squares

$ SS_{Total} = SS_{Treatment} + SS_{Residuals} $

$ SS_{Total}=\sum_{j=1}^{n_1}(y_{1j}-\bar{y})^2 +\sum_{j=1}^{n_2}(y_{2j}-\bar{y})^2 +...+\sum_{j=1}^{n_K}(y_{Kj}-\bar{y})^2 $

$ SS_{Treatment}=n_1(\bar{y}_1-\bar{y})^2+n_2(\bar{y}_2-\bar{y})^2+...+n_K(\bar{y}_K-\bar{y})^2 $

$ SS_{Residuals}=\sum_{j=1}^{n_1}(y_{1j}-\bar{y}_1)^2 +\sum_{j=1}^{n_2}(y_{2j}-\bar{y}_2)^2 +...+\sum_{j=1}^{n_K}(y_{Kj}-\bar{y}_K)^2 $

Important : In this framework, the mean is the central tendency measure that represents the group. In other words, it is assumed to be a good summary of the population.

Test Statistic

$ F = \frac{MS_{Treatment}}{MS_{Error}} $

$ MS_{Treatment}=\frac{SS_{Treatment}}{df_1} $ $ MS_{Residuals}=\frac{SS_{Residual}}{df_2} $

df: degrees of freedom.

The df (degrees of freedom) element is defined by the difference between the number of squared elements in the sum and the number of linear restrictions.

$ df = \# \text{squared elements} - \# \text{linear restrictions} $

For the one-way ANOVA, $ df_1=K-1 $ and $ df_2=N-K $.

Test Statistic

F follows the F-Snedecor distribution with two parameters: $ df_1 $: numerator's degrees of freedom and $ df_2 $: denominator's degrees of freedom.

The $ F $ test statistic is a ratio between variances. The rationale behind it is that the larger the contribution of $ SS_{Treatment} $ to the total variation $ SS_{Total} $, more evidence about the difference among populations.

The significance test for this statistic is different from the Student's t regarding the direction of the alternative hypothesis. Critical region is associated to high values of $ F $.

Example F Test Statistic

Consider the nutritionist example : $ K=3 $ groups and $ N=12 $ subjects. Under the null hypothesis, the test statistic behaves like:

# Number of Groups
K = 3

# Total Sample Siz
N = 12

# degrees of Freedom in the numerator
df1 = K-1

# degrees of freedom in the denominator
df2 = N-K

# sequence from 0 to 100
seq1<-seq(0,8,0.001)

Example F Test Statistic

# plot F distribution
plot(seq1,df(seq1,df1,df2))

plot of chunk unnamed-chunk-6

What is expected under the H0 ?

set.seed(123)
summary(rf(1000,df1,df2))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000322  0.273007  0.692401  1.216963  1.485266 13.554842

Critical Values for F-statistic

The output of the operation below defines the lower bound for the critical region for F-statistic for the data in Example 01.

qf(0.95,df1,df2)

[1] 4.256495

Using R to obtain ANOVA Statistics

SSTotal<-(310-270)^2+(320-270)^2+(370-270)^2+(240-270)^2 +
  (280-270)^2+(250-270)^2+(270-270)^2+(280-270)^2 +
  (240-270)^2+(230-270)^2+(180-270)^2+(270-270)^2

SSTotal

[1] 26200

Using R to obtain ANOVA Statistics

Sum of Squares of Residuals partitioned by group

SSRes.control<-(310-310)^2+(320-310)^2+(370-310)^2+(240-310)^2
SSRes.diet<-(280-270)^2+(250-270)^2+(270-270)^2+(280-270)^2
SSRes.exercise<-(240-230)^2+(230-230)^2+(180-230)^2+(270-230)^2

SSRes.control

[1] 8600

SSRes.diet

[1] 600

SSRes.exercise

[1] 4200

Using R to obtain ANOVA Statistics

Sum of Squares of Residuals

SSResiduals<-SSRes.control+SSRes.diet+SSRes.exercise
SSResiduals

[1] 13400

Using R to obtain ANOVA Statistics

SSTreatment<-SSTotal-SSResiduals
SSTreatment

[1] 12800

Alternatively

SSTreatment<-4*(310-270)^2+4*(270-270)^2+4*(230-270)^2
SSTreatment

[1] 12800

Using R to obtain ANOVA Statistics

MSTreatment<-SSTreatment/(3-1)
MSTreatment

[1] 6400

MSResiduals<-SSResiduals/(12-3)
MSResiduals

[1] 1488.889

Fstatistic<-MSTreatment/MSResiduals
Fstatistic

[1] 4.298507

Using R to obtain ANOVA Statistics

Comparing observed F statistic with F distribution

# plot F distribution
plot(seq1,df(seq1,df1,df2))
abline(v=Fstatistic,col='red')

plot of chunk unnamed-chunk-15

Using R to obtain ANOVA Statistics

P-value associated with the observed F statistic

1-pf(Fstatistic,df1,df2)

[1] 0.04893457

Running ANOVA with functions implemented in R

av<-aov(chol.data$Cholesterol~chol.data$Group)
av

Call:
   aov(formula = chol.data$Cholesterol ~ chol.data$Group)

Terms:
                chol.data$Group Residuals
Sum of Squares            12800     13400
Deg. of Freedom               2         9

Residual standard error: 38.58612
Estimated effects may be unbalanced

Running ANOVA with functions implemented in R

summ.av<-summary(av)
summ.av

                Df Sum Sq Mean Sq F value Pr(>F)  
chol.data$Group  2  12800    6400   4.299 0.0489 *
Residuals        9  13400    1489                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Table

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
chol.data$Group	2	12800	6400.000	4.298508	0.0489346
Residuals	9	13400	1488.889	NA	NA

Regression Approach for ANOVA

ANOVA can be formulated as a linear regression model

Model: “completely randomized model”

$ y_{ij}=\mu_i +\epsilon_{ij}=\mu+\alpha_i+\epsilon_{ij} $

$ i=1,2,..,K $

$ j=1,2,...,n_K $

$ \epsilon_{ij} \sim N(0,\sigma^2) $

This model is well defined by the restriction: $ \sum_{i=1}^n \alpha_i =0 $

Regression Approach in R without Intercept

mod1<-lm(chol.data$Cholesterol~-1+chol.data$Group)
summary(mod1)


Call:
lm(formula = chol.data$Cholesterol ~ -1 + chol.data$Group)

Residuals:
   Min     1Q Median     3Q    Max 
   -70     -5      5     10     60 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
chol.data$Group1   310.00      19.29   16.07 6.20e-08 ***
chol.data$Group2   270.00      19.29   13.99 2.06e-07 ***
chol.data$Group3   230.00      19.29   11.92 8.14e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.59 on 9 degrees of freedom
Multiple R-squared:  0.9851,    Adjusted R-squared:  0.9802 
F-statistic: 198.7 on 3 and 9 DF,  p-value: 1.534e-08

Regression Approach in R without Intercept

anova(mod1)

Analysis of Variance Table

Response: chol.data$Cholesterol
                Df Sum Sq Mean Sq F value    Pr(>F)    
chol.data$Group  3 887600  295867  198.72 1.534e-08 ***
Residuals        9  13400    1489                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression Approach in R with Intercept

mod2<-lm(chol.data$Cholesterol~chol.data$Group)
summary(mod2)


Call:
lm(formula = chol.data$Cholesterol ~ chol.data$Group)

Residuals:
   Min     1Q Median     3Q    Max 
   -70     -5      5     10     60 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        310.00      19.29  16.068  6.2e-08 ***
chol.data$Group2   -40.00      27.28  -1.466   0.1767    
chol.data$Group3   -80.00      27.28  -2.932   0.0167 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.59 on 9 degrees of freedom
Multiple R-squared:  0.4885,    Adjusted R-squared:  0.3749 
F-statistic: 4.299 on 2 and 9 DF,  p-value: 0.04893

Regression Approach in R with Intercept

anova(mod2)

Analysis of Variance Table

Response: chol.data$Cholesterol
                Df Sum Sq Mean Sq F value  Pr(>F)  
chol.data$Group  2  12800  6400.0  4.2985 0.04893 *
Residuals        9  13400  1488.9                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post Hoc Comparisons for One-Way ANOVA

Rationale

The ANOVA F-statistic tests for at least one difference between groups. Post-hoc tests are necessary to identify which groups are different.
A sequence of pairwise comparisons with a t-test will inflate the type-I error, in other words it will increase the Family-Wise Error Rate.

FamilyWise Error Rate (FWER)

The FWER is the probability of coming to at least one wrong conclusion in a series of hypothesis tests.

$ FWER = 1 - (1-\alpha)^C $

$ \alpha $ : significance level for an individual test

$ C $: number of comparisons

Example : $ \alpha $=0.05 and C=5, results in $ FWER=1-(1-0.05)^5=0.2262 $

When performing 5 hypothesis tests with fixed significance level as 5%, there is 22.62\% probability of at least one wrong conclusion.

Altenatives for Post-hoc comparisons

LSD
Tukey HSD
Bonferroni
Sidak
Scheffé
Dunnet

There are two strategies for post hoc comparisons:

1) single step

2) step-down

Bonferroni-Correction

Bonferroni correction consists of keeping the FWER by correcting the test-wise significant level. This can be done by simply dividing $ \alpha $ by the number of tests to be performed.

Example : $ \alpha $=0.05 and C=5, results in $ FWER=1-(1-0.05)^5=0.2262 $

To keep $ FWER $ at 5% significance level it would suffice to perform each test at $ \alpha^{*}=\alpha/5=0.01 $

$ FWER=1-(1-0.01)^5\approx 0.05 $

Least Significant Difference (LSD)

This test establishes a minimum difference to be considered significant at 5% level.

$ LSD = t\sqrt{2MSResiduals/n} $

$ t $ is the critical value of the Student's t distribution with df associated with MSResiduals (also called MSE), the denominator of the F-statistic.

$ n $ is the number of observations per group

Example LSD statistic

Cholesterol	Group
310	control
320	control
370	control
240	control
280	diet
250	diet
270	diet
280	diet
240	exercise
230	exercise
180	exercise
270	exercise

Example LSD statistic

ddply(chol.data,.(Group),summarise,mean = mean(Cholesterol), sd = sd(Cholesterol), n= length(Cholesterol))

     Group mean       sd n
1  control  310 53.54126 4
2     diet  270 14.14214 4
3 exercise  230 37.41657 4

Example LSD statistic

The first step for finding the LSD is to obtain the Critical value for t statistic with N-K degrees of freedom.

tc = qt(0.95,9)
tc

[1] 1.833113

Example LSD statistic

MSE is the denominator of the F-statistic, that was called MSResiduals before.

LSD is obtained by

MSE<-MSResiduals
LSD = tc*sqrt(2*MSE/4)
LSD

[1] 50.01559

Example LSD statistic

The differences observed were

Control_vs_Diet = 310-270
Control_vs_Exercise = 310-230
Diet_vs_Exercise = 270-230

Control_vs_Diet

[1] 40

Control_vs_Exercise

[1] 80

Diet_vs_Exercise

[1] 40

There is only significance when comparing Control versus Exercise

Tukey Honest Significant Difference(HSD)

The test statistic for Tukey's HSD is:

$ q_r = \frac{\bar{x}_l-\bar{x}_s}{\sqrt{\frac{MSE}{n}}} $

The distribution of $ q $ is called the studentized range distribution and depends on the the number of means and the degrees of freedom for SSResiduals.

$ \bar{x}_l $ and $ \bar{x}_s $ are the largest and smallest treatment means, respectively.

$ r $, the degrees of freedom of the distribution is given bythe number of means

$ n $ is the number of observations per group

The studentized range distribution in R

qR<-(310-230)/sqrt(MSResiduals/4)
qR

[1] 4.146568

1-ptukey(qR,3,12,nranges=1)

[1] 0.03130034

Post-Hoc Tukey HSD

The ANOVA tests for difference but does not locate where they are. Post-hoc tests are designed for that. The most used one is the Tukey HSD.

posthoc <- TukeyHSD(av,conf.level=0.95)
posthoc

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = chol.data$Cholesterol ~ chol.data$Group)

$`chol.data$Group`
    diff       lwr       upr     p adj
2-1  -40 -116.1785 36.178503 0.3506917
3-1  -80 -156.1785 -3.821497 0.0401865
3-2  -40 -116.1785 36.178503 0.3506917