Markdown Author: Jessie Bell, 2023
Libraries Used: ggplot
Answers: sea green
Preamble: ANOVA stands for Analysis of Variance. It tests for differences in means among 2 or more groups, and with two-way ANOVA, you can test two factors. Your response variable should technically be continuous. In ANOVA, each group of samples is assumed to have been drawn from normally distributed populations that have the same variance (\(\sigma^2\)).
Before running an ANOVA we check that our data do not violate ANY of the following assumptions:
1. NORMALITY: the data have normal histograms and QQplots
2. HOMEOSCEDASTICITY: the data have approximately equal variance (\(s^2\)). Note that variance is standard deviation squared (see chapter 3 of your text for a refresher on summary stats). An interesting fact about ANOVA is that the mean square error (MSE) output in the ANOVA table estimates \(\sigma^2\).
If either assumption is violated, some additional thought is required to determine whether the violation can safely be ignored, if data transformation is warranted, or if a different approach to analysis is required. This is discussed in Ch. 13 of your textbook.
Data types should be
Response Variable: 1 continuous
Predictor Variable: 1 categorical (3+ levels)
#try this yourself, iris data is built into R so you should be able to use it without setting a working directory.
irisData <- iris
head(irisData)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
#here the only factors (aka: CATEGORICAL DATA) in our dataset IRIS are SPECIES. This means, these data have 1 factor (there is 1 categorical column in the data)
count(irisData$Species)
## x freq
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
#There are 3 different species in the data (aka: 3 levels).
#lets start using ggplot
#type the following code without teh hashtag
#install.packages("ggplot")
#use quotes around the name in the install.packages function
#next type library(ggplot2)
#now the below codes should work to check for equal variance and normality
#THE NORMALITY ASSUMPTION
ggplot(irisData, aes(Sepal.Length, fill=Species))+
geom_histogram(bins = 30)+
labs(title="Are the data normal?")
#LOOKING PRETTY FRIGGIN NORMAL TO ME. If you are still skeptical, try the shapiro-wilk test.
#a p-value > 0.05 means the data are normal when you run this test.
#CHECK FOR SIMILAR VARIANCE:
ggplot(irisData, aes(x=Species, y=Sepal.Length))+
geom_boxplot()
#you can color code your boxes by species like this:
ggplot(irisData, aes(x=Species, y=Sepal.Length, color=Species))+
geom_boxplot()+
labs(title="How's the variance?")
#variance seems pretty equal, with an outlier at the bottom of virginica. I am not too worried about this outlier, but if you are feel free to run Levene's Test in the car package.
#If the p-value > 0.05 you can assume your data have equal variance.
Note: statistical hypotheses are NOT the same as scientific hypotheses.
ANOVA hypotheses look like this
H0: \(\mu_1\) = \(\mu_2\) = \(\mu_3\)
HA: at least one of the means is statistically different from the others
anova2<-aov(Response variable~Explanatory variable, data=DATA)
anova1 <- aov(Sepal.Length ~ Species, data=irisData)
summary(anova1) #be sure to takethe summary of the aov() function
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
**Crazy! Our p-value is tiny (0.0000000000000002) and we can reject our null hypothesis in favor of our alternative. The mean sepal lengths seem to be different between at least 1 species of iris. We can now figure out which one exactly by doing the post-hoc Tukey Test
#first make a table of the means
tapply(iris$Sepal.Length, iris$Species, mean)
## setosa versicolor virginica
## 5.006 5.936 6.588
#it looks like it will be setosa vs. virginica in this. Lets try Tukey now
TukeyHSD(anova1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = irisData)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
the dataset collected by Darnel land Munguia (2011) provided evidence that mean sepal length (cm) differed among all three species of iris sp. (F2,147 = 119, p<0.001 (never say 0 for p-value)) with I. setosa being the shortest and I. virginica being the longest (Tukey HSD, p <0.001).
Data types should be
Response Variable: 1 continuous
Predictor Variable: 2 categorical (i.e. flower color and flower species)
I don’t have time to make an example for you BUT! Here is a good website explaining the steps:
code to run the 2-way anova to test model assumptions:
aov(formula = Sepal.Length ~ Species * Sepal.Color, data = irisData)
note: sepal color is not in the iris dataset, but this is to show you an example of how to set it up.
The F-Distribution with different degrees of freedom (df)
#And the F curve, with 2 and 27 degrees of freedom, looks like this:
x <- seq(0,15, by = 0.001)
curve(df(x, df1 = 2, df2 = 27), xlim = c(0,10), ylab = "Density")
#you are asked to make the f-dist with 4 and 12 degrees of freedom
x <- seq(0,15, by = 0.001)
curve(df(x, df1 = 4, df2 = 8), xlim = c(0,10), ylab = "Density", col="green")
curve(df(x, df1 = 2, df2 = 48), xlim = c(0,10), ylab = "Density", col="blue")
curve(df(x, df1 = 2, df2 = 24), xlim = c(0,10), ylab = "Density", col="cyan")
ANOVA
#setup the table
f <- c(rep("a", times=6), rep("b", times=6))
g1 <- c(6.9, 7.3, 11.2, 13.6, 8.0, 3.3, 8.3, 6.3, 8.1, 5.7, 13.8, 12.8) #I added the numbers like this so that all the a's wil still line up and all the b's will still line up
#combine the data
dan <- data.frame(f, g1)
#check your dataframe. You should have 6 as and 6 bs and they should correspond to the table in the pdf handout
head(dan)
## f g1
## 1 a 6.9
## 2 a 7.3
## 3 a 11.2
## 4 a 13.6
## 5 a 8.0
## 6 a 3.3
Data types
Response Variable (y): g1, continuous numeric
Predictor Variable (x): f, 1 categorical (2 levels)
#THE NORMALITY ASSUMPTION
ggplot(dan, aes(g1, fill=f))+
geom_histogram(bins=5)
#Hard to say with a graph. Lets run qqplot to double check normality
qqnorm(dan$g1, main = "Are the data normal?")
qqline(dan$g1)
#I am def. not convinced that that follows a straight diagonal line. It is hard to tell because the data are so small though. Let's assume they follow the assumption of normality for now.
#let's triple check iwth Shapiro
shapiro.test(dan$g1)
##
## Shapiro-Wilk normality test
##
## data: dan$g1
## W = 0.926, p-value = 0.3396
#a p-value > 0.05 for Shapiro Wilk means the data are normal when you run this test. LOOKS PRETTTTY GOOD!
#CHECK FOR SIMILAR VARIANCE ASSUMPTION:
ggplot(dan, aes(f, g1, color=f))+
geom_jitter() #you can use a jitter plot or a bocplot here. I was always taught to run a jitter for the variance, since this assumption does not care abotu teh mean.
#variance seems pretty equal, our dataset is like so smoll tho. Because our dataset is so small, we can assume that the more we sample, the easier to answer this question
#If you are not satisfied with the variance of the data, run Levene's test from the car package
#If the p-value of Levene's test is > 0.05 you can assume your data have equal variance.
H0: \(\mu_1\) = \(\mu_2\)
HA: at least one of the means is statistically different from the others
PLEASE NOTE: I AM REALLY NOT SURE WHY WE ARENT RUNNIN A t-test ON THESE DATA. BESIDES THAT CONFUSION, I WILL CONTINUE THE STEPS YOU ARE ASKED TO COMPLETE.
anova2 <- aov(g1 ~ f, data=dan)
summary(anova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## f 1 1.84 1.841 0.152 0.705
## Residuals 10 121.30 12.130
#p>0.05 and we fail to reject H0
Run this to determine which two means are different if you have 3 levels! AND IF YOUR P-VALUE ABOVE IS LESS THAN 0. Otherwise, the means are not different and there is no reason to run Tukey-HSD.
This is not a good example. Here you only have 2 LEVELS (a, b). I think this is actually a t-test. If it were an anova problem it would have one more level (i.e. a, b, and c – not just a, b)
**The Island Dataset should work well with an anova though, it has the Face category with 4 levels (“north”, “south”, “east”, “west”).
the dataset we made up above and labelled “dan”, provided evidence that mean g1 did NOT differ between a and b (F1,10 = 0.152, p > 0.05) and we fail to reject H0.
islandData <- read.csv("island.csv")
head(islandData)
## Island Face AlgalDensity Nutrients
## 1 SanJuan North 47.59 11.296981
## 2 SanJuan North 39.56 8.796913
## 3 SanJuan North 47.62 10.399961
## 4 SanJuan North 53.82 9.903147
## 5 SanJuan North 50.64 6.947449
## 6 SanJuan South 46.43 14.702849
Note: statistical hypotheses are NOT the same as scientific hypotheses.
ANOVA hypotheses look like this H0: \(\mu_1\) \(=\) \(\mu_2\) = \(\mu_3\) HA: at least one of the means is statistically different from the others
anova2<-aov(Response variable~Explanatory variable, data=DATA)
#example from iris dataset (see question 1, 1-factor anova)
anova1 <- aov(Sepal.Length ~ Species, data=irisData)
#change the data to match your island data.
summary(anova1) #be sure to takethe summary of the aov() function
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If your p is low, the null has got to go! If p-value is less that 0.05 you will need to run a TUKEY-HSD to determine which means are different
#example of how to run tukey for iris dataset
#first make a table of the means for funsies
tapply(iris$Sepal.Length, iris$Species, mean)
## setosa versicolor virginica
## 5.006 5.936 6.588
#it looks like it will be setosa vs. virginica in this. Lets try Tukey now
TukeyHSD(anova1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = irisData)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
the island dataset provided evidence that mean sepal length (cm) differed among all three species of iris sp. (F2,147 = 119, p<0.001 (never say 0 for p-value)) with I. setosa being the shortest and I. virginica being the longest (Tukey HSD, p <0.001).
Obviously this needs to change to match the island dataset!