ANOVA

ANOVA can be broken into two ideas. One-way and two-way.

One-way ANOVA will examine the means of a quantitative variable by examining the variance among the groups. We should be comparing more than two groups (for two groups do the hypothesis test of module 0). We could do this by hand but nobody does so I am not going to bother either. The null hypothesis for this will always be that all means are equal while the alternative is that there are at least two that are not equal. Be careful with this as some treatments may still have equal means but not all are equal. Assumptions for the test include normality, sample independence, and the tricky one equal variances between different groups. We have seen normality (qq-plots) The equality of variances will be checked by using the residual vs plotted graph and again asking do we fall on the line.

Two-way ANOVA is going to look at two categorical variables and ask if there is an effect with these categories on the dependent variable. You will have to look not only at the contribution but the interaction between those categorical variables. Conditions in two-way are similar to the one-way except that we also ask that the number in each category is equal.

One Way

  1. An experiment is conducted to determine whether there is a difference among the mean increases in growth produced by five strains (A, B, C, D and E) of growth hormones for plants. The experimental material consists of 20 cuttings of a shrub (all of equal weight), with four cuttings randomly assigned to each of the five different strains. The increases in weight for each cutting are given in the table below.
A B C D E
Plant 1 19 28 25 13 6
Plant 2 8 27 22 12 7
Plant 3 14 20 24 13 15
Plant 4 11 17 22 16 8

Let’s add this into R.

data = data.frame("A" = c(19,8,14,11), "B" = c(28,27,20,17), "C" = c(25,22,24,22), "D" = c(13,12,13,16),"E" = c(6,7,15,8),"Plant"=1:4)
data
##    A  B  C  D  E Plant
## 1 19 28 25 13  6     1
## 2  8 27 22 12  7     2
## 3 14 20 24 13 15     3
## 4 11 17 22 16  8     4

I need to make this into tidy data. It took me a while to find the correct code but I have done it here. By the way I don’t think I needed to include the plant number but I have included it anyway.

data <-
  data %>% 
  pivot_longer(c('A','B','C','D','E'), names_to = "Strain", values_to = "Weight")

data
## # A tibble: 20 x 3
##    Plant Strain Weight
##    <int> <chr>   <dbl>
##  1     1 A          19
##  2     1 B          28
##  3     1 C          25
##  4     1 D          13
##  5     1 E           6
##  6     2 A           8
##  7     2 B          27
##  8     2 C          22
##  9     2 D          12
## 10     2 E           7
## 11     3 A          14
## 12     3 B          20
## 13     3 C          24
## 14     3 D          13
## 15     3 E          15
## 16     4 A          11
## 17     4 B          17
## 18     4 C          22
## 19     4 D          16
## 20     4 E           8

To apply the ANOVA test in R you actually must do two calls. The first call aov asks you to create a function. Essentially you are asking is there a difference in the weight because of the strain you picked. This was why I had to get the data tidy! Once you run the fit, you run a test on that fit with anova of the fit.

fm = aov(Weight ~ Strain, data)
summary(fm)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## Strain       4  660.8  165.20   11.38 0.00019 ***
## Residuals   15  217.7   14.52                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Strain here could also be called the Treatment and the Residuals could be called the Error.

Let’s include a plot to see what is going on here.

ggplot(data , aes(x = Strain, y = Weight)) +
  geom_boxplot()

We can see it is very clear that these will have a different means. B and C are clearly have heavier averages while A, D, and E are lighter.

Let’s check the assumptions of ANOVA are fulfilled.

plot(fm, 1:2)

In the first graph you are looking to see if there is any relationship. If there is you will not meet the criteria of homogeneity of variance. Here we do not observe any relationship.

In the second graph you in fact want a relationship! You would like for all the points to fall on that line. Here they do not clearly fall on the line but it is close enough to continue with our results. We will discuss ways to check for this in the next module!

If we needed to find a critical F value, we would need to know two degrees of freedom and the p value we are using. Looking at the 95% critical value, we use the degrees of freedom from the numerator (or Treatment) and the degrees of freedom from the denominator (or Error). You stick a q in front of f to make it give you critical values just like with t or z.

qf(.95,4,15)
## [1] 3.055568

I want to recreate one more visualization to really express the differences in these means.

y1 <- mean(data$Weight)
ggplot(data = data, aes(x = Strain, y = Weight)) +
  geom_point() + 
  stat_summary(fun.data = 'mean_se',color = "red") +
  geom_hline(yintercept = y1, color ="blue", linetype = "dashed") 

I had seen this graph some where and decided to recreate it here. Totally unnecessary but creates another visualization of really what the ANOVA test is looking at.

Two Way

The homework only has the output from a two way ANOVA only after it has been run through software. So I don’t have an exact homework question to cover but I’ll still preform one on a built in dataset. Everybody loves diamonds right?

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

We want to ask is the mean price different for diamonds of different cut and color? .

table(diamonds$cut,diamonds$color)
##            
##                D    E    F    G    H    I    J
##   Fair       163  224  312  314  303  175  119
##   Good       662  933  909  871  702  522  307
##   Very Good 1513 2400 2164 2299 1824 1204  678
##   Premium   1603 2337 2331 2924 2360 1428  808
##   Ideal     2834 3903 3826 4884 3115 2093  896

This also violates one of the requirements of the test. We are supposed to have the same number of entries in each category and I do not. I continue anyway. Let’s visualize!

ggplot(diamonds, aes(x = cut, y =price, color = color))+
  geom_boxplot()

Time to run the ANOVA

twoWayAnova <- aov(price ~ cut*color, data = diamonds)
summary(twoWayAnova)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## cut             4 1.104e+10 2.760e+09 181.405 <2e-16 ***
## color           6 2.551e+10 4.251e+09 279.371 <2e-16 ***
## cut:color      24 1.653e+09 6.889e+07   4.527  1e-12 ***
## Residuals   53905 8.203e+11 1.522e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So here we examined if cut, color and the interaction between the two will have an effect on the price. Since all are significant they do.

Let’s check the requirements are met

plot(twoWayAnova, 1:2)

Both of these graphs are actually problematic. The QQ-Plot is showing that the data is clearly not normal. What is normal about diamonds anyway? The first plot is clearly showing a trend as well. We expect the red line to stay near the dotted line and that is not the case. These give us a hint that I should not have used this test! We’ll explore what to do in this case during the 5th and final module.