Research paper– very explicitly connect lit review to your topic “This article is the reason why I am including this variable in my model”– can include at end lit reviews are justifying why you did why you did drop in zoom session the week after Thanksgiving for anyone needing help with finishing their papers even if you are happy with your first draft grade, please do revise the paper based on his suggestions he gives in his grading comments. You don’t need to rewrite the whole paper– just address his comments!
He’s offering 5 extra credit points on the final if you bring your paper to the tutoring center and send him your receipt
“even when controlled for x and y we saw this effect”– this is what an interpretation of a linear model
Lit review– why you are researching what you are researching, why you are using a certain method, why you are using different variables
As we discussed previously, ANOVA is a test to see if there is statistically significant variance across groups in a data set. This can be done visually:
ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()
Or mathematically:
mpg_cyl_anova<-aov(mpg~cyl,data=mtcars)
summary(mpg_cyl_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.7 817.7 79.56 6.11e-10 ***
## Residuals 30 308.3 10.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The incredibly small p-value tells us that there REALLY IS some variance across these groups.
This is helpful when exploring your data – it gives you an idea if something is happening to a particular variable when analyzed against another variable. It is a useful starting point!
Here is price vs. cut in diamonds:
ggplot(diamonds,aes(x=cut,y=price,group=cut)) + geom_boxplot()
It’s difficult to just look at this and decide that there is a statistically significant difference between groups. The boxes all occupy similar spots on the graph.
diamonds_anova<-aov(price~cut,data=diamonds)
summary(diamonds_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 ***
## Residuals 53935 8.474e+11 1.571e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The very small p-value suggests that there REALLY IS some difference in means between groups. But where? It’s still hard to tell. This doesn’t tell us where the statistically significant differences are, just that it exists.
The good news, is that we are at least on the right track.
If we want to dig further, it’s better to expand ANOVA into the linear model. As it happens, ANOVA is basically just a special case of “lm()”. We can use “lm()” instead of “aov()” to see a detailed explaination of where the “difference” between groups is actually happening, with the help of “TukeyHSD”:
diamonds_lm_anova<-lm(price~cut,data=diamonds)
TukeyHSD(aov(diamonds_lm_anova))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = diamonds_lm_anova)
##
## $cut
## diff lwr upr p adj
## Good-Fair -429.89331 -740.44880 -119.3378 0.0014980
## Very Good-Fair -376.99787 -663.86215 -90.1336 0.0031094
## Premium-Fair 225.49994 -59.26664 510.2665 0.1950425
## Ideal-Fair -901.21579 -1180.57139 -621.8602 0.0000000
## Very Good-Good 52.89544 -130.15186 235.9427 0.9341158
## Premium-Good 655.39325 475.65120 835.1353 0.0000000
## Ideal-Good -471.32248 -642.36268 -300.2823 0.0000000
## Premium-Very Good 602.49781 467.76249 737.2331 0.0000000
## Ideal-Very Good -524.21792 -647.10467 -401.3312 0.0000000
## Ideal-Premium -1126.71573 -1244.62267 -1008.8088 0.0000000
This creates a list of all the combinations of groups and if their means are different or not (the p adj column) less than 0.05, you can assume they’re different).
Which groups have means that ARENT statistically different? In this case, Premium-Fair and Very Good-Good are the only two with a large p-value, where there is not a difference.
Be careful: this can create a huge list if you have many groups (indeed, it will create groups*2 elements)
KEEP IN MIND: The Tukey HSD only works on categorical variables. It will not work on continuous data.
We have touched on this lightly before, but sometimes when you are researching the “effect of x and z upon y”, you notice that “x and z together” has a greater effect than just “x and z by itself”. This is called an “interaction” or interaction term.
Remember tooth growth? We saw that orange juice is better for growth, but it also depends on the dosage.
Can we model orange juice + dosage? Interestingly, the correct equation is actually orange juice * dosage:
two_way_anova <- aov(len ~ supp * dose, data = ToothGrowth)
summary(two_way_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205.4 205.4 12.317 0.000894 ***
## dose 1 2224.3 2224.3 133.415 < 2e-16 ***
## supp:dose 1 88.9 88.9 5.333 0.024631 *
## Residuals 56 933.6 16.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do we see here? Supplement and Dose are both highly significant, which we expected, but “supp:dose” is also significant and positive. This means that ANOVA is detecting a significant effect of supplement and dose TOGETHER, even in addition to the regular effects that they have on Tooth Growth by themselves!
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp, group = supp)) +
geom_line() +
geom_point() +
labs(title = "Interaction Effect of Supplement and Dose on Tooth Growth",
x = "Dose Level",
y = "Tooth Length",
color = "Supplement Type")
At the lowest dose, OJ is WAY up on tooth length. On the 1.0 dose, there
is smaller spread between OJ and VC. At 2.0, OJ and VC are mixed
together. At lower doses, OJ looks more effective. At higher doses, the
type of supplement matters less. It just matters that they are getting
some kind of supplement!
Let’s “smooth” this out even more:
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Tooth Length vs. Vitamin C Dose", x = "Dose (mg/day)", y = "Tooth Length")
## `geom_smooth()` using formula = 'y ~ x'
We can visualize here that OJ is broadly more effective for tooth growth – especially at lower doses. However, at the highest dose (2mg), the effect “narrows” and the distinctions seem to vanish. You will notice, however, that the relationship between the dose*interaction is not quite linear. It sort of goes up and then levels off. In a case like this, a linear model is actually not giving us the whole picture:
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Tooth Length vs. Vitamin C Dose", x = "Dose (mg/day)", y = "Tooth Length")
## `geom_smooth()` using formula = 'y ~ x'
We see that the “loess” model is describing the flattening of effects between supplement types better than the traditional “linear” model. More on this later in the course.
tooth_data<-ToothGrowth
tooth_data$supp<-as.factor(tooth_data$supp)
tooth_data$dose<-as.factor(tooth_data$dose)
tooth_model_lm<-lm(len~supp*dose,data=tooth_data)
summary(tooth_model_lm)
##
## Call:
## lm(formula = len ~ supp * dose, data = tooth_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.20 -2.72 -0.27 2.65 8.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.230 1.148 11.521 3.60e-16 ***
## suppVC -5.250 1.624 -3.233 0.00209 **
## dose1 9.470 1.624 5.831 3.18e-07 ***
## dose2 12.830 1.624 7.900 1.43e-10 ***
## suppVC:dose1 -0.680 2.297 -0.296 0.76831
## suppVC:dose2 5.330 2.297 2.321 0.02411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.7746
## F-statistic: 41.56 on 5 and 54 DF, p-value: < 2.2e-16
Vitamin C by itself has a negative effect on tooth length, but vitamin c at dose 2 has a positive effect. [OJ is in the intercept]
So what’s the big deal? The linear model allows us to break down the ANOVA into parts. We see that “suppVC” (the categorical variable “VC” in “supp”) has a significant estimate of -5.25 compared to “suppOJ” (not seen). This means that, all things being equal, “suppOJ” is better for tooth growth than “suppVC” (which is what we suspected).
Likewise, we see that dose1 and dose2 are both significant and positive – meaning that tooth growth increases as dosage increases – all things being equal, including substance type (VC vs. OJ).
The real magic happens when we look at suppVC:dose1 and suppVC:dose2. These two coefficients are telling us if the INTERACTION between supplement and dose are having an effect, ABOVE AND BEYOND the effects of just suppliment and dose by themselves. We see that only suppVC:dose2 is significant. It’s also positive. This means that the combination of VC and a very high dose has an effect ABOVE AND BEYOND the effect of supplementation and dosage by itself. Interestingly, this suggests that very high doses of vitamin C is somewhat more effective than just regular dosages of orange juice by itself.
#A GLM will be a better fit because it is not linear data. The dependent variable is continuous, independent variables are continuous or categorical.
tooth.glm <- glm(len~supp*dose, data=tooth_data, family = "gaussian")
summary(tooth.glm)
##
## Call:
## glm(formula = len ~ supp * dose, family = "gaussian", data = tooth_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.230 1.148 11.521 3.60e-16 ***
## suppVC -5.250 1.624 -3.233 0.00209 **
## dose1 9.470 1.624 5.831 3.18e-07 ***
## dose2 12.830 1.624 7.900 1.43e-10 ***
## suppVC:dose1 -0.680 2.297 -0.296 0.76831
## suppVC:dose2 5.330 2.297 2.321 0.02411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 13.18715)
##
## Null deviance: 3452.21 on 59 degrees of freedom
## Residual deviance: 712.11 on 54 degrees of freedom
## AIC: 332.71
##
## Number of Fisher Scoring iterations: 2
The model itself is significant (bootleg p-value based on difference between the deviance and degrees of freedom)
The findings are similar to what we found with the linear model. Dose 2 has a large, positive, statistically significant effect on tooth length. Vitamin C at Dose 2 has a positive and statistically significant effect.
AIC(tooth_model_lm, tooth.glm)
## df AIC
## tooth_model_lm 7 332.7056
## tooth.glm 7 332.7056
We usually use the lowest AIC– it is the better fit. In this case, it’s the same AIC, but Karlerik would recommend using the GLM because we know the underlying data is not linear.
Let’s run ANOVA using the linear model on the plant growth dataset.
For this week’s homework:
Select your favorite dependent variable from your dataset, and ensure that it meets the criteria for parametric tests (normal distribution, continuous, etc.). If it does not, transform it using the log or sqrt transformation. Demonstrate that you have done this, if needed. Run a histogram to reveal the distribution of your dependent variable.
run an anova “aov()” this variable, including some kind of grouping variable (dependent~group,data=yourdata)
save the resulting aov as an object in the homework
run “summary()” on the saved object. What do you see? Explain the results to me.
Run the same aov, but this time as a linear model “lm()”. Save the linear model, and run summary() on it, just like you did with the aov(). What do you see? Explain the results to me. Are they similar? Different? Does the linear model explain the data more than the aov()?
Knit and submit via CANVAS