R software has different resources for ANOVA and post-hoc tests.
A nutritionist is interested in comparing the effects of two alternative interventions: exercise and diet in reducing the levels of cholesterol. The magnitude of the effect is measured as the difference from a control group (non-intervention).
# Group 1
control<-c(310,320,370,240)
# Group 2
diet<-c(280,250,270,280)
# Group 3
exercise<-c(240,230,180,270)
Formatting the data for the one-way ANOVA
# concatenate data about cholesterol
cholesterol<-c(control,diet,exercise)
# concatenate data about treatment
treatment<-c( rep('control',4) , rep('diet',4) , rep('exercise',4))
# create a data frame
d<-cbind.data.frame(cholesterol = cholesterol , treatment = treatment)
# showing the data
d
## cholesterol treatment
## 1 310 control
## 2 320 control
## 3 370 control
## 4 240 control
## 5 280 diet
## 6 250 diet
## 7 270 diet
## 8 280 diet
## 9 240 exercise
## 10 230 exercise
## 11 180 exercise
## 12 270 exercise
# summarizing the data
summary(d)
## cholesterol treatment
## Min. :180.0 control :4
## 1st Qu.:240.0 diet :4
## Median :270.0 exercise:4
## Mean :270.0
## 3rd Qu.:287.5
## Max. :370.0
The boxplot is a graphical representation that indicates how different the groups are:
boxplot(d$cholesterol ~ d$treatment)
R software has recently incoporated a series of resources in the library ggplot2.
library(ggplot2)
gx<-ggplot(d,aes(x=treatment,y=cholesterol,colour=treatment))+
geom_boxplot()
gx
The advantage of ggplot is that we can incorporate layers of informations. As an example, we will add dots for individual observations to the boxplot.
gx<-gx+geom_point()
gx
# anova decomposition of total variation
av<-aov(d$cholesterol ~ d$treatment)
av
## Call:
## aov(formula = d$cholesterol ~ d$treatment)
##
## Terms:
## d$treatment Residuals
## Sum of Squares 12800 13400
## Deg. of Freedom 2 9
##
## Residual standard error: 38.58612
## Estimated effects may be unbalanced
# the classical anova table
summary(av)
## Df Sum Sq Mean Sq F value Pr(>F)
## d$treatment 2 12800 6400 4.299 0.0489 *
## Residuals 9 13400 1489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic for the ANOVA test is calculated as a ratio of two variances (Between Treatment / Within Treatment). To understand what is expected for this statistic when \(H_0\), the null hypothesis, is true, we will simulate 1000 values from the F distribution with 2 and 9 degrees of freedom.
# Generate data from the f distribution
fvalues<-rf(n=1000,df1=2,df2=9)
# plot histogram of fvalues
hist(fvalues)
# plot a line density
plot(density(fvalues))
The default implementation for Post-hoc tests in R comes from the Tukey Honest Significant Difference
posthoc<-TukeyHSD(av,'d$treatment')
posthoc
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = d$cholesterol ~ d$treatment)
##
## $`d$treatment`
## diff lwr upr p adj
## diet-control -40 -116.1785 36.178503 0.3506917
## exercise-control -80 -156.1785 -3.821497 0.0401865
## exercise-diet -40 -116.1785 36.178503 0.3506917
The naive way of calculating the sample size for ANOVA would have been the evaluation of the sample size for all pairwise comparisons between indepedent populations and choosing the largest sample size.
Instead of this procedure, we can use the function power.anova.test that is present in R base package.
We wil evaluate what will be the sample size to detect the effect size in the same magnitude as the one observed in the cholesterol dataset.
power.anova.test(groups = 3, between.var = 6400 , within.var = 1489, sig.level=0.05 , power=0.8)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 2.439193
## between.var = 6400
## within.var = 1489
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
To capture variation in this magnitude, the sample \(n=3\) is sufficient.
Consider that Gender is also included as a factor to be controled in the study design. We will artificially incorporate this factor in our dataset about cholesterol with a simple R command.
# creating a variable for gender
d$gender<-rep(c('M','M','F','F'),3)
# the new dataset
d
## cholesterol treatment gender
## 1 310 control M
## 2 320 control M
## 3 370 control F
## 4 240 control F
## 5 280 diet M
## 6 250 diet M
## 7 270 diet F
## 8 280 diet F
## 9 240 exercise M
## 10 230 exercise M
## 11 180 exercise F
## 12 270 exercise F
library(ggplot2)
gx1<-ggplot(d,aes(x=treatment,y=cholesterol,colour=treatment))+
geom_boxplot()+
geom_point(aes(shape=gender))
gx1
Performing the two-way ANOVA
# ANOVA with two factors
av2<-aov(d$cholesterol~d$treatment+d$gender)
av2
## Call:
## aov(formula = d$cholesterol ~ d$treatment + d$gender)
##
## Terms:
## d$treatment d$gender Residuals
## Sum of Squares 12800.000 33.333 13366.667
## Deg. of Freedom 2 1 8
##
## Residual standard error: 40.87583
## Estimated effects may be unbalanced
# ANOVA table with two factors
summary(av2)
## Df Sum Sq Mean Sq F value Pr(>F)
## d$treatment 2 12800 6400 3.83 0.0681 .
## d$gender 1 33 33 0.02 0.8912
## Residuals 8 13367 1671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The two-ANOVA analysis did not include interactions, only main effects. To see whether interactions play an important role in the ANOVA, someone can use the function interaction.plot.
# interaction plot to the artificial cholesterol dataset
interaction.plot(d$gender,d$treatment,d$cholesterol)
In the cholesterol example we do not have evidence of interaction according to the interaction plot tool. We will explore another example
Consider an example where the blood pressure will be associated to the use of a drug that interacts with obesity. The data is reported below.
bloodpressure <- c(158,163,173,178,168,188,183,198,178,193,186,191,196,181,176,185,190,195,200,180)
obesity <- factor(c(rep("present",10),rep("absent",10)))
drug <- factor(rep(c(rep("present",5),rep("absent",5)),2))
bpdata <- data.frame(bloodpressure, obesity, drug)
The data frame with the data is shown below.
bpdata
## bloodpressure obesity drug
## 1 158 present present
## 2 163 present present
## 3 173 present present
## 4 178 present present
## 5 168 present present
## 6 188 present absent
## 7 183 present absent
## 8 198 present absent
## 9 178 present absent
## 10 193 present absent
## 11 186 absent present
## 12 191 absent present
## 13 196 absent present
## 14 181 absent present
## 15 176 absent present
## 16 185 absent absent
## 17 190 absent absent
## 18 195 absent absent
## 19 200 absent absent
## 20 180 absent absent
The summary of the data.
summary(bpdata)
## bloodpressure obesity drug
## Min. :158.0 absent :10 absent :10
## 1st Qu.:177.5 present:10 present:10
## Median :184.0
## Mean :183.0
## 3rd Qu.:191.5
## Max. :200.0
The ggplot is an important instrument to visualize data from multifactorial experiments
ggplot(bpdata,aes(x=drug,y=bloodpressure,fill=obesity))+
geom_boxplot()+
geom_point(position = position_jitter())
To infer about the interaction between drug and obesity, we will carry out the interaction plot.
interaction.plot(bpdata$drug,bpdata$obesity,bpdata$bloodpressure)
The interaction plot indicates that the drug effect in the obesity group is stronger than in the “control” group. This finding in the graphic is an evidence of interaction between drug and obesity conditions. To test for main effects and the effect of interaction we will run the two-way ANOVA with interaction following the commands below.
# Running ANOVA with interaction term
av3<-aov(bloodpressure~drug*obesity,data=bpdata)
av3
## Call:
## aov(formula = bloodpressure ~ drug * obesity, data = bpdata)
##
## Terms:
## drug obesity drug:obesity Residuals
## Sum of Squares 720 500 320 1000
## Deg. of Freedom 1 1 1 16
##
## Residual standard error: 7.905694
## Estimated effects may be unbalanced
# ANOVA table
summary(av3)
## Df Sum Sq Mean Sq F value Pr(>F)
## drug 1 720 720.0 11.52 0.00371 **
## obesity 1 500 500.0 8.00 0.01211 *
## drug:obesity 1 320 320.0 5.12 0.03792 *
## Residuals 16 1000 62.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(av3)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = bloodpressure ~ drug * obesity, data = bpdata)
##
## $drug
## diff lwr upr p adj
## present-absent -12 -19.495 -4.505003 0.0037059
##
## $obesity
## diff lwr upr p adj
## present-absent -10 -17.495 -2.505003 0.0121093
##
## $`drug:obesity`
## diff lwr upr p adj
## present:absent-absent:absent -4 -18.3051 10.305099 0.8534038
## absent:present-absent:absent -2 -16.3051 12.305099 0.9775889
## present:present-absent:absent -22 -36.3051 -7.694901 0.0022719
## absent:present-present:absent 2 -12.3051 16.305099 0.9775889
## present:present-present:absent -18 -32.3051 -3.694901 0.0115535
## present:present-absent:present -20 -34.3051 -5.694901 0.0051230