Markdown Author: Jessie Bell, 2023

Libraries Used: ggplot

Answers: sea green

ANOVA

Preamble: ANOVA stands for Analysis of Variance. It tests for differences in means among 2 or more groups, and with two-way ANOVA, you can test two factors. Your response variable should technically be continuous. In ANOVA, each group of samples is assumed to have been drawn from normally distributed populations that have the same variance (\(\sigma^2\)).

Before running an ANOVA we check that our data do not violate ANY of the following assumptions:

  • 1. NORMALITY: the data have normal histograms and QQplots

  • 2. HOMEOSCEDASTICITY: the data have approximately equal variance (\(s^2\)). Note that variance is standard deviation squared (see chapter 3 of your text for a refresher on summary stats). An interesting fact about ANOVA is that the mean square error (MSE) output in the ANOVA table estimates \(\sigma^2\).

If either assumption is violated, some additional thought is required to determine whether the violation can safely be ignored, if data transformation is warranted, or if a different approach to analysis is required. This is discussed in Ch. 13 of your textbook.

1

1-FACTOR vs. 2-FACTOR

1 Factor ANOVA

Answer: A 1-factor ANOVA compares the means across more than two groups.

Data types should be

  • Response Variable: 1 continuous

  • Predictor Variable: 1 categorical (3+ levels)

EXAMPLE
We have 3 species of iris flowers we collected and we want to know if there are differences in some of their characteristics. 🌻
Scientific Question: Is the mean sepal length different depending on the species of iris?
#try this yourself, iris data is built into R so you should be able to use it without setting a working directory. 

irisData <- iris

head(irisData)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#here the only factors (aka: CATEGORICAL DATA) in our dataset IRIS are SPECIES. This means, these data have 1 factor (there is 1 categorical column in the data)

count(irisData$Species) 
##            x freq
## 1     setosa   50
## 2 versicolor   50
## 3  virginica   50
#There are 3 different species in the data (aka: 3 levels). 



Step 1: Does the Iris data violate the 2 assumptions of ANOVA? Plot the data!
#lets start using ggplot

#type the following code without teh hashtag

#install.packages("ggplot")

#use quotes around the name in the install.packages function

#next type library(ggplot2)

#now the below codes should work to check for equal variance and normality

#THE NORMALITY ASSUMPTION
ggplot(irisData, aes(Sepal.Length, fill=Species))+
  geom_histogram(bins = 30)+
  labs(title="Are the data normal?")

#LOOKING PRETTY FRIGGIN NORMAL TO ME. If you are still skeptical, try the shapiro-wilk test. 

#a p-value > 0.05 means the data are normal when you run this test.

#CHECK FOR SIMILAR VARIANCE:

ggplot(irisData, aes(x=Species, y=Sepal.Length))+
  geom_boxplot()

#you can color code your boxes by species like this: 

ggplot(irisData, aes(x=Species, y=Sepal.Length, color=Species))+
  geom_boxplot()+
  labs(title="How's the variance?")

#variance seems pretty equal, with an outlier at the bottom of virginica. I am not too worried about this outlier, but if you are feel free to run Levene's Test in the car package.

#If the p-value > 0.05 you can assume your data have equal variance. 
The data seem to have equal variance, and are normally distributed. Let’s move onto the next step.



Step 2: State your statistical hypotheses

Note: statistical hypotheses are NOT the same as scientific hypotheses.

ANOVA hypotheses look like this

H0: \(\mu_1\) = \(\mu_2\) = \(\mu_3\)

HA: at least one of the means is statistically different from the others

Step 3: Run the ANOVA and test your hypotheses

anova2<-aov(Response variable~Explanatory variable, data=DATA)

anova1 <- aov(Sepal.Length ~ Species, data=irisData)

summary(anova1) #be sure to takethe summary of the aov() function 
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

**Crazy! Our p-value is tiny (0.0000000000000002) and we can reject our null hypothesis in favor of our alternative. The mean sepal lengths seem to be different between at least 1 species of iris. We can now figure out which one exactly by doing the post-hoc Tukey Test

Step 4: Decide which means are different
#first make a table of the means
tapply(iris$Sepal.Length, iris$Species, mean)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588
#it looks like it will be setosa vs. virginica in this. Lets try Tukey now

TukeyHSD(anova1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = irisData)
## 
## $Species
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0
Step 5: Create a summary statement

the dataset collected by Darnel land Munguia (2011) provided evidence that mean sepal length (cm) differed among all three species of iris sp. (F2,147 = 119, p<0.001 (never say 0 for p-value)) with I. setosa being the shortest and I. virginica being the longest (Tukey HSD, p <0.001).

2 Factor ANOVA

Answer: A 2-factor ANOVA compares the means across two categorical variables.

Data types should be

  • Response Variable: 1 continuous

  • Predictor Variable: 2 categorical (i.e. flower color and flower species)

EXAMPLE

I don’t have time to make an example for you BUT! Here is a good website explaining the steps:

A Pirates Guide, Ch. 14

code to run the 2-way anova to test model assumptions:

  aov(formula = Sepal.Length ~ Species * Sepal.Color, data = irisData)
  

note: sepal color is not in the iris dataset, but this is to show you an example of how to set it up.

2

The F-Distribution with different degrees of freedom (df)

#And the F curve, with 2 and 27 degrees of freedom, looks like this:
x <- seq(0,15, by = 0.001)
curve(df(x, df1 = 2, df2 = 27), xlim = c(0,10), ylab = "Density")

#you are asked to make the f-dist with 4 and 12 degrees of freedom
x <- seq(0,15, by = 0.001)
curve(df(x, df1 = 4, df2 = 8), xlim = c(0,10), ylab = "Density", col="green")

curve(df(x, df1 = 2, df2 = 48), xlim = c(0,10), ylab = "Density", col="blue")

curve(df(x, df1 = 2, df2 = 24), xlim = c(0,10), ylab = "Density", col="cyan")

3

ANOVA

#setup the table
f <- c(rep("a", times=6), rep("b", times=6))

g1 <- c(6.9, 7.3, 11.2, 13.6, 8.0, 3.3, 8.3, 6.3, 8.1, 5.7, 13.8, 12.8) #I added the numbers like this so that all the a's wil still line up and all the b's will still line up



#combine the data
dan <- data.frame(f, g1)

#check your dataframe. You should have 6 as and 6 bs and they should correspond to the table in the pdf handout
head(dan)
##   f   g1
## 1 a  6.9
## 2 a  7.3
## 3 a 11.2
## 4 a 13.6
## 5 a  8.0
## 6 a  3.3
Scientific Question: Are the column f means of a and b statistically different from one another?

Data types

  • Response Variable (y): g1, continuous numeric

  • Predictor Variable (x): f, 1 categorical (2 levels)

Step 1: Check the 2 Assumptions
#THE NORMALITY ASSUMPTION

ggplot(dan, aes(g1, fill=f))+
  geom_histogram(bins=5)

#Hard to say with a graph. Lets run qqplot to double check normality
qqnorm(dan$g1, main = "Are the data normal?")
qqline(dan$g1)

#I am def. not convinced that that follows a straight diagonal line. It is hard to tell because the data are so small though. Let's assume they follow the assumption of normality for now.

#let's triple check iwth Shapiro
shapiro.test(dan$g1)
## 
##  Shapiro-Wilk normality test
## 
## data:  dan$g1
## W = 0.926, p-value = 0.3396
#a p-value > 0.05 for Shapiro Wilk means the data are normal when you run this test. LOOKS PRETTTTY GOOD!

#CHECK FOR SIMILAR VARIANCE ASSUMPTION:

ggplot(dan, aes(f, g1, color=f))+
  geom_jitter() #you can use a jitter plot or a bocplot here. I was always taught to run a jitter for the variance, since this assumption does not care abotu teh mean. 

#variance seems pretty equal, our dataset is like so smoll tho. Because our dataset is so small, we can assume that the more we sample, the easier to answer this question

#If you are not satisfied with the variance of the data, run Levene's test from the car package

#If the p-value of Levene's test is > 0.05 you can assume your data have equal variance. 
Step 2: State your Statistical Hypotheses

H0: \(\mu_1\) = \(\mu_2\)

HA: at least one of the means is statistically different from the others

PLEASE NOTE: I AM REALLY NOT SURE WHY WE ARENT RUNNIN A t-test ON THESE DATA. BESIDES THAT CONFUSION, I WILL CONTINUE THE STEPS YOU ARE ASKED TO COMPLETE.

4

Step 3: Run ANOVA
anova2 <- aov(g1 ~ f, data=dan)
summary(anova2)
##             Df Sum Sq Mean Sq F value Pr(>F)
## f            1   1.84   1.841   0.152  0.705
## Residuals   10 121.30  12.130
#p>0.05 and we fail to reject H0

5

Step 4: Tukey-HSD

Run this to determine which two means are different if you have 3 levels! AND IF YOUR P-VALUE ABOVE IS LESS THAN 0. Otherwise, the means are not different and there is no reason to run Tukey-HSD.

This is not a good example. Here you only have 2 LEVELS (a, b). I think this is actually a t-test. If it were an anova problem it would have one more level (i.e. a, b, and c – not just a, b)

**The Island Dataset should work well with an anova though, it has the Face category with 4 levels (“north”, “south”, “east”, “west”).

Step 5: Create a summary statement

the dataset we made up above and labelled “dan”, provided evidence that mean g1 did NOT differ between a and b (F1,10 = 0.152, p > 0.05) and we fail to reject H0.

6 - 8

Scientific Question: Are nutrients different between beaches?
islandData <- read.csv("island.csv")

head(islandData)
##    Island  Face AlgalDensity Nutrients
## 1 SanJuan North        47.59 11.296981
## 2 SanJuan North        39.56  8.796913
## 3 SanJuan North        47.62 10.399961
## 4 SanJuan North        53.82  9.903147
## 5 SanJuan North        50.64  6.947449
## 6 SanJuan South        46.43 14.702849

Follow the steps listed on 1-factor ANOVA.

Step 1: Does the data violate the 2 assumptions of ANOVA? Plot the data!
Step 2: State your statistical hypotheses

Note: statistical hypotheses are NOT the same as scientific hypotheses.

ANOVA hypotheses look like this H0: \(\mu_1\) \(=\) \(\mu_2\) = \(\mu_3\) HA: at least one of the means is statistically different from the others

Step 3: Run the ANOVA and test your hypotheses

anova2<-aov(Response variable~Explanatory variable, data=DATA)

#example from iris dataset (see question 1, 1-factor anova)
anova1 <- aov(Sepal.Length ~ Species, data=irisData)

#change the data to match your island data. 

summary(anova1) #be sure to takethe summary of the aov() function 
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If your p is low, the null has got to go! If p-value is less that 0.05 you will need to run a TUKEY-HSD to determine which means are different

Step 4: Decide which means are different
#example of how to run tukey for iris dataset 

#first make a table of the means for funsies
tapply(iris$Sepal.Length, iris$Species, mean)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588
#it looks like it will be setosa vs. virginica in this. Lets try Tukey now

TukeyHSD(anova1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = irisData)
## 
## $Species
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0
Step 5: Create a summary statement

the island dataset provided evidence that mean sepal length (cm) differed among all three species of iris sp. (F2,147 = 119, p<0.001 (never say 0 for p-value)) with I. setosa being the shortest and I. virginica being the longest (Tukey HSD, p <0.001).

Obviously this needs to change to match the island dataset!