Using R for ANOVA

R software has different resources for ANOVA and post-hoc tests.

one-way ANOVA

A nutritionist is interested in comparing the effects of two alternative interventions: exercise and diet in reducing the levels of cholesterol. The magnitude of the effect is measured as the difference from a control group (non-intervention).

# Group 1
control<-c(310,320,370,240)

# Group 2
diet<-c(280,250,270,280)

# Group 3
exercise<-c(240,230,180,270)

Formatting the data for the one-way ANOVA

# concatenate data about cholesterol
cholesterol<-c(control,diet,exercise)

# concatenate data about treatment
treatment<-c( rep('control',4) , rep('diet',4) , rep('exercise',4))

# create a data frame
d<-cbind.data.frame(cholesterol = cholesterol , treatment = treatment)

# showing  the data
d
##    cholesterol treatment
## 1          310   control
## 2          320   control
## 3          370   control
## 4          240   control
## 5          280      diet
## 6          250      diet
## 7          270      diet
## 8          280      diet
## 9          240  exercise
## 10         230  exercise
## 11         180  exercise
## 12         270  exercise
# summarizing the data
summary(d)
##   cholesterol       treatment
##  Min.   :180.0   control :4  
##  1st Qu.:240.0   diet    :4  
##  Median :270.0   exercise:4  
##  Mean   :270.0               
##  3rd Qu.:287.5               
##  Max.   :370.0

The boxplot is a graphical representation that indicates how different the groups are:

boxplot(d$cholesterol ~ d$treatment)

ggplot

R software has recently incoporated a series of resources in the library ggplot2.

library(ggplot2)

gx<-ggplot(d,aes(x=treatment,y=cholesterol,colour=treatment))+
  geom_boxplot()
gx

The advantage of ggplot is that we can incorporate layers of informations. As an example, we will add dots for individual observations to the boxplot.

gx<-gx+geom_point()
gx

Building the ANOVA table

# anova decomposition of total variation
av<-aov(d$cholesterol ~ d$treatment)
av
## Call:
##    aov(formula = d$cholesterol ~ d$treatment)
## 
## Terms:
##                 d$treatment Residuals
## Sum of Squares        12800     13400
## Deg. of Freedom           2         9
## 
## Residual standard error: 38.58612
## Estimated effects may be unbalanced
# the classical anova table
summary(av)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## d$treatment  2  12800    6400   4.299 0.0489 *
## Residuals    9  13400    1489                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Snedecor F distribution

The F-statistic for the ANOVA test is calculated as a ratio of two variances (Between Treatment / Within Treatment). To understand what is expected for this statistic when \(H_0\), the null hypothesis, is true, we will simulate 1000 values from the F distribution with 2 and 9 degrees of freedom.

# Generate data from the f distribution
fvalues<-rf(n=1000,df1=2,df2=9)

# plot histogram of fvalues
hist(fvalues)

# plot a line density
plot(density(fvalues))

Post-hoc test

The default implementation for Post-hoc tests in R comes from the Tukey Honest Significant Difference

posthoc<-TukeyHSD(av,'d$treatment')
posthoc
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = d$cholesterol ~ d$treatment)
## 
## $`d$treatment`
##                  diff       lwr       upr     p adj
## diet-control      -40 -116.1785 36.178503 0.3506917
## exercise-control  -80 -156.1785 -3.821497 0.0401865
## exercise-diet     -40 -116.1785 36.178503 0.3506917

Sample Size calculation for one-way ANOVA

The naive way of calculating the sample size for ANOVA would have been the evaluation of the sample size for all pairwise comparisons between indepedent populations and choosing the largest sample size.

Instead of this procedure, we can use the function power.anova.test that is present in R base package.

We wil evaluate what will be the sample size to detect the effect size in the same magnitude as the one observed in the cholesterol dataset.

power.anova.test(groups = 3, between.var = 6400 , within.var = 1489, sig.level=0.05 , power=0.8)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 2.439193
##     between.var = 6400
##      within.var = 1489
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

To capture variation in this magnitude, the sample \(n=3\) is sufficient.

two-way ANOVA

Consider that Gender is also included as a factor to be controled in the study design. We will artificially incorporate this factor in our dataset about cholesterol with a simple R command.

# creating a variable for gender
d$gender<-rep(c('M','M','F','F'),3)

# the new dataset
d
##    cholesterol treatment gender
## 1          310   control      M
## 2          320   control      M
## 3          370   control      F
## 4          240   control      F
## 5          280      diet      M
## 6          250      diet      M
## 7          270      diet      F
## 8          280      diet      F
## 9          240  exercise      M
## 10         230  exercise      M
## 11         180  exercise      F
## 12         270  exercise      F
library(ggplot2)

gx1<-ggplot(d,aes(x=treatment,y=cholesterol,colour=treatment))+
  geom_boxplot()+
  geom_point(aes(shape=gender))
gx1

Performing the two-way ANOVA

# ANOVA with two factors
av2<-aov(d$cholesterol~d$treatment+d$gender)
av2
## Call:
##    aov(formula = d$cholesterol ~ d$treatment + d$gender)
## 
## Terms:
##                 d$treatment  d$gender Residuals
## Sum of Squares    12800.000    33.333 13366.667
## Deg. of Freedom           2         1         8
## 
## Residual standard error: 40.87583
## Estimated effects may be unbalanced
# ANOVA table with two factors
summary(av2)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## d$treatment  2  12800    6400    3.83 0.0681 .
## d$gender     1     33      33    0.02 0.8912  
## Residuals    8  13367    1671                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The two-ANOVA analysis did not include interactions, only main effects. To see whether interactions play an important role in the ANOVA, someone can use the function interaction.plot.

# interaction plot to the artificial cholesterol dataset
interaction.plot(d$gender,d$treatment,d$cholesterol)

In the cholesterol example we do not have evidence of interaction according to the interaction plot tool. We will explore another example

Example: Blood Pressure

Consider an example where the blood pressure will be associated to the use of a drug that interacts with obesity. The data is reported below.

bloodpressure <- c(158,163,173,178,168,188,183,198,178,193,186,191,196,181,176,185,190,195,200,180)
obesity <- factor(c(rep("present",10),rep("absent",10)))
drug <- factor(rep(c(rep("present",5),rep("absent",5)),2))
bpdata <- data.frame(bloodpressure, obesity, drug)

The data frame with the data is shown below.

bpdata
##    bloodpressure obesity    drug
## 1            158 present present
## 2            163 present present
## 3            173 present present
## 4            178 present present
## 5            168 present present
## 6            188 present  absent
## 7            183 present  absent
## 8            198 present  absent
## 9            178 present  absent
## 10           193 present  absent
## 11           186  absent present
## 12           191  absent present
## 13           196  absent present
## 14           181  absent present
## 15           176  absent present
## 16           185  absent  absent
## 17           190  absent  absent
## 18           195  absent  absent
## 19           200  absent  absent
## 20           180  absent  absent

The summary of the data.

summary(bpdata)
##  bloodpressure      obesity        drug   
##  Min.   :158.0   absent :10   absent :10  
##  1st Qu.:177.5   present:10   present:10  
##  Median :184.0                            
##  Mean   :183.0                            
##  3rd Qu.:191.5                            
##  Max.   :200.0

The ggplot is an important instrument to visualize data from multifactorial experiments

ggplot(bpdata,aes(x=drug,y=bloodpressure,fill=obesity))+
  geom_boxplot()+
  geom_point(position = position_jitter())

To infer about the interaction between drug and obesity, we will carry out the interaction plot.

interaction.plot(bpdata$drug,bpdata$obesity,bpdata$bloodpressure)

The interaction plot indicates that the drug effect in the obesity group is stronger than in the “control” group. This finding in the graphic is an evidence of interaction between drug and obesity conditions. To test for main effects and the effect of interaction we will run the two-way ANOVA with interaction following the commands below.

# Running ANOVA with interaction term
av3<-aov(bloodpressure~drug*obesity,data=bpdata)
av3
## Call:
##    aov(formula = bloodpressure ~ drug * obesity, data = bpdata)
## 
## Terms:
##                 drug obesity drug:obesity Residuals
## Sum of Squares   720     500          320      1000
## Deg. of Freedom    1       1            1        16
## 
## Residual standard error: 7.905694
## Estimated effects may be unbalanced
# ANOVA table
summary(av3)
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## drug          1    720   720.0   11.52 0.00371 **
## obesity       1    500   500.0    8.00 0.01211 * 
## drug:obesity  1    320   320.0    5.12 0.03792 * 
## Residuals    16   1000    62.5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(av3)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = bloodpressure ~ drug * obesity, data = bpdata)
## 
## $drug
##                diff     lwr       upr     p adj
## present-absent  -12 -19.495 -4.505003 0.0037059
## 
## $obesity
##                diff     lwr       upr     p adj
## present-absent  -10 -17.495 -2.505003 0.0121093
## 
## $`drug:obesity`
##                                diff      lwr       upr     p adj
## present:absent-absent:absent     -4 -18.3051 10.305099 0.8534038
## absent:present-absent:absent     -2 -16.3051 12.305099 0.9775889
## present:present-absent:absent   -22 -36.3051 -7.694901 0.0022719
## absent:present-present:absent     2 -12.3051 16.305099 0.9775889
## present:present-present:absent  -18 -32.3051 -3.694901 0.0115535
## present:present-absent:present  -20 -34.3051 -5.694901 0.0051230