As we discussed previously, ANOVA is a test to see if there is statistically significant variance across groups in a data set. This can be done visually:
ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()
Or mathematically:
mpg_cyl_anova<-aov(mpg~cyl,data=mtcars)
summary(mpg_cyl_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.7 817.7 79.56 6.11e-10 ***
## Residuals 30 308.3 10.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The incredibly small p-value tells us that there REALLY IS some variance across these groups.
This is helpful when exploring your data – it gives you an idea if something is happening to a particular variable when analyzed against another variable.
Here is price vs. cut in diamonds:
ggplot(diamonds,aes(x=cut,y=price,group=cut)) + geom_boxplot()
It’s difficult to just look at this and decide that there is a statistically significant difference between groups. The boxes all occupy similar spots on the graph.
diamonds_anova<-aov(price~cut,data=diamonds)
summary(diamonds_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 ***
## Residuals 53935 8.474e+11 1.571e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The very small p-value suggests that there REALLY IS some difference in means between groups. But where? It’s still hard to tell.
The good news, is that we are at least on the right track.
If we want to dig further, it’s better to expand ANOVA into the linear model. As it happens, ANOVA is basically just a special case of “lm()”. We can use “lm()” instead of “aov()” to see a detailed explaination of where the “difference” between groups is actually happening, with the help of “TukeyHSD”:
diamonds_lm_anova<-lm(price~cut,data=diamonds)
TukeyHSD(aov(diamonds_lm_anova))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = diamonds_lm_anova)
##
## $cut
## diff lwr upr p adj
## Good-Fair -429.89331 -740.44880 -119.3378 0.0014980
## Very Good-Fair -376.99787 -663.86215 -90.1336 0.0031094
## Premium-Fair 225.49994 -59.26664 510.2665 0.1950425
## Ideal-Fair -901.21579 -1180.57139 -621.8602 0.0000000
## Very Good-Good 52.89544 -130.15186 235.9427 0.9341158
## Premium-Good 655.39325 475.65120 835.1353 0.0000000
## Ideal-Good -471.32248 -642.36268 -300.2823 0.0000000
## Premium-Very Good 602.49781 467.76249 737.2331 0.0000000
## Ideal-Very Good -524.21792 -647.10467 -401.3312 0.0000000
## Ideal-Premium -1126.71573 -1244.62267 -1008.8088 0.0000000
This creates a list of all the combinations of groups and if their means are different or not (less than 0.05, you can assume they’re different).
Which groups have means that ARENT statistically different?
Be careful: this can create a huge list if you have many groups (indeed, it will create groups*2 elements)
KEEP IN MIND: The Tukey HSD only works on categorical variables.
We have touched on this lightly before, but sometimes when you are researching the “effect of x and z upon y”, you notice that “x and z together” has a greater effect than just “x and z by itself”. This is called an “interaction”.
Remember tooth growth? We saw that orange juice is better for growth, but it also depends on the dosage.
Can we model orange juice + dosage? Interestingly, the correct equation is actually orange juice * dosage:
two_way_anova <- aov(len ~ supp * dose, data = ToothGrowth)
summary(two_way_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205.4 205.4 12.317 0.000894 ***
## dose 1 2224.3 2224.3 133.415 < 2e-16 ***
## supp:dose 1 88.9 88.9 5.333 0.024631 *
## Residuals 56 933.6 16.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do we see here? Supplement and Dose are both highly significant, which we expected, but “supp:dose” is also significant and positive. This means that ANOVA is detecting a significant effect of supplement and dose TOGETHER, even in addition to the regular effects that they have on Tooth Growth by themselves!
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp, group = supp)) +
geom_line() +
geom_point() +
labs(title = "Interaction Effect of Supplement and Dose on Tooth Growth",
x = "Dose Level",
y = "Tooth Length",
color = "Supplement Type")
Let’s “smooth” this out even more:
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Tooth Length vs. Vitamin C Dose", x = "Dose (mg/day)", y = "Tooth Length")
## `geom_smooth()` using formula = 'y ~ x'
We can visualize here that OJ is broadly more effective for tooth growth – especially at lower doses. However, at the highest dose (2mg), the effect “narrows” and the distinctions seem to vanish. You will notice, however, that the relationship between the dose*interaction is not quite linear. It sort of goes up and then levels off. In a case like this, a linear model is actually not giving us the whole picture:
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Tooth Length vs. Vitamin C Dose", x = "Dose (mg/day)", y = "Tooth Length")
## `geom_smooth()` using formula = 'y ~ x'
We see that the “loess” model is describing the flattening of effects between suppliment types better than the traditional “linear” model. More on this later in the course.
tooth_data<-ToothGrowth
tooth_data$supp<-as.factor(tooth_data$supp)
tooth_data$dose<-as.factor(tooth_data$dose)
tooth_model_lm<-lm(len~supp*dose,data=tooth_data)
summary(tooth_model_lm)
##
## Call:
## lm(formula = len ~ supp * dose, data = tooth_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.20 -2.72 -0.27 2.65 8.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.230 1.148 11.521 3.60e-16 ***
## suppVC -5.250 1.624 -3.233 0.00209 **
## dose1 9.470 1.624 5.831 3.18e-07 ***
## dose2 12.830 1.624 7.900 1.43e-10 ***
## suppVC:dose1 -0.680 2.297 -0.296 0.76831
## suppVC:dose2 5.330 2.297 2.321 0.02411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.7746
## F-statistic: 41.56 on 5 and 54 DF, p-value: < 2.2e-16
So what’s the big deal? The linear model allows us to break down the ANOVA into parts. We see that “suppVC” (the categorical variable “VC” in “supp”) has a significant estimate of -5.25 compared to “suppOJ” (not seen). This means that, all things being equal, “suppOJ” is better for tooth growth than “suppVC” (which is what we suspected).
Likewise, we see that dose1 and dose2 are both significant and positive – meaning that tooth growth increases as dosage increases – all things being equal, including substance type (VC vs. OJ).
The real magic happens when we look at suppVC:dose1 and suppVC:dose2. These two coefficients are telling us if the INTERACTION between supplement and dose are having an effect, ABOVE AND BEYOND the effects of just suppliment and dose by themselves. We see that only suppVC:dose2 is significant. It’s also positive. This means that the combination of VC and a very high dose has an effect ABOVE AND BEYOND the effect of supplementation and dosage by itself. Interestingly, this suggests that very high doses of vitamin C is somewhat more effective than just regular dosages of orange juice by itself.
Let’s run ANOVA using the linear model on the plant growth dataset.
For this week’s homework:
Select your favorite dependent variable from your dataset, and ensure that it meets the criteria for parametric tests (normal distribution, continuous, etc.). If it does not, transform it using the log or sqrt transformation. Demonstrate that you have done this, if needed. Run a histogram to reveal the distribution of your dependent variable.
run an anova “aov()” this variable, including some kind of grouping variable (dependent~group,data=yourdata)
save the resulting aov as an object in the homework
run “summary()” on the saved object. What do you see? Explain the results to me.
Run the same aov, but this time as a linear model “lm()”. Save the linear model, and run summary() on it, just like you did with the aov(). What do you see? Explain the results to me. Are they similar? Different? Does the linear model explain the data more than the aov()?
Knit and submit via CANVAS