Chapter 5

Summary

This week, I worked through Chapter 5: One-way ANOVA and Randomized Experiments of the Stat2 book, which is the opening chapter to Unit B: Analysis of Variance. This chapter covered topics about one-way ANOVA models, such as how to analyze them and ways to compare them.

This week, my R skills were again put to the test. One-way ANOVA is a way to test if there are statistically significant differences between the means groups in three or more, as normal ANOVA is used to compare means. As I worked along with the practice examples in the book, I had to spend a lot of time on Google attempting to figure out how to group variables on the x-axis within groups such as “Beef, Cereal, or Pork”. Within this specific example, the purpose of grouping the dotplots into these groups was to compare the amount of weight gained by rats on three different diets based of beef, cereal, or pork. Because of the grouping variables, we are able to more easily compare means of the different control groups, therefore more easily see the effects of the different diets on the rats.

An important part of one-way ANOVA is randomization, as it has three purposes, to justify using a probability model, protect against bias, and to support inference about cause and effect. These are similar to conditions that need to be checked when performing an ANOVA test, as confounding variables can prevent the possibility of generalizing a singular model to other populations, and without justification of a probability model, our distribution may not be normal due to a non-random sample size, and it is important to check for justification for a cause and effect relationship, as without this justification, further generalizations can be made. I feel that understanding these “conditions” is really important as to not abuse the one-way ANOVA test.

The chapter suggested us to think of ANOVA as a CAT scan that sees below the surface and reveals more numerical factors that may have not been seen in a simple regression model. These “extra layers” come from the ability of ANOVA to see the main, overall average, as well as how far each group average departs from the grand average, and how far each response is from the group average. ANOVA is able to measure group to group differences within a model that allows us to question if group differences we see are statistically significant, or just extra noise.

As a regression model has a p-value and R^2 value to measure variation, ANOVA uses the F-ratio to evaluate if variation in group-to-group and unit-to-unit values are similar. In other words, the F-ratio is the ratio of the variance between groups to the main group variance. Looking at a table of results from performing a one-way ANOVA, the results are grouped into two columns labeled Parameter and Statistic, the difference between the two being that Parameters are numbers that summarize the entire population and model, while statistics are factors that are numerical estimates of the data, not the population.

While doing the homework, a lot of Googling was required in order to correctly group my scatterplots, such as in problem 5.27. This problem allowed me to actually perform my first one-way ANOVA, but with this, I had to learn how to interpret the summary table of values, as it is not the exact same as a regression output. From the chapter, I had seen how the values were actually computed, but had to spend extra time on the ANOVA output table in R to ensure my understanding of what each output value meant.

In other problems such as 5.52, I had to simply check the conditions for ANOVA and then perform an ANOVA test. This was interesting, as I used the same function I had used in previous chapters to test conditions for regression models, which was plotting the residuals of the model. When I first did this, it arised out of muscle memory from graphing plots of residuals from regression models, and after hitting enter, I expected an error message saying that they types of functions were not compatible. Surprisingly, using the same function was exactly what I needed to do, and I was able to plot normal QQ plots and Residuals vs Fitted plots to check conditions in R.

I feel that this week was a stepping stone towards furthering my understanding of R as well as ANOVA. In previous weeks, I had not had to work as hard at interpreting the data in R, but this week, that understanding was essential to compute any test statistics or models. As I move along in this project, I can feel myself becoming more confident in my R abilities as well as my statistical abilities.

# load libraries
library(onewaytests)
library(onewaytests)
library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

5.16

a.

The explanatory variable is the type of dog food, and the response variable is how much the dog sleeps, energy of the dog

b.

This is an observational study because the dogs are being observed

c.

There are different groups of dog food, which means we are comparing the mean final score of the energy level, meaning we will use ANOVA.

5.24

a.

The null hypothesis is that the means of the size of the hatchling bottle turtles do not differ by geographical region

b.

We should know if the amount of turtles differs by state, which can affect the p-value

c.

To ensure the conditions fro ANOVA are satisfied, we need to assess if randomization was used to select the three states chosen. We also have to check for bias, because if there was no randomization, then there is potential bias. We also need information about the justification of the researcher’s goal of looking into the hatching rates of these turtles and if they differ between regions.

5.27

mouse = read.csv("~/downloads/MouseBrain.csv")
plot(factor(mouse$Genotype), mouse$Contacts)

# a
aggregate(mouse$Contacts, list(mouse$Genotype), FUN = mean)

##   Group.1        x
## 1   Minus 12.70000
## 2   Mixed 19.26316
## 3    Plus 17.52632

mouseaov = aov(Contacts ~ Genotype, data = mouse)
summary(mouseaov)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Genotype     2  285.4  142.70    5.66 0.00642 **
## Residuals   45 1134.5   25.21                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# b - residuals
# qqplot
qqnorm(mouse$Contacts, pch = 16)
qqline(mouse$Contacts)

# Bartlett's Homogeneity Test in R
homog.test(Contacts ~ Genotype, data = mouse)

## 
##   Levene's Homogeneity Test (alpha = 0.05) 
## ----------------------------------------------- 
##   data : Contacts and Genotype 
## 
##   statistic  : 0.430331 
##   num df     : 2 
##   denum df   : 45 
##   p.value    : 0.6529416 
## 
##   Result     : Variances are homogeneous. 
## -----------------------------------------------

a.

The boxplots and statistics show that, on average, the mixed genotype has the greatest amount of social contacts, with the minus genotype having the least on average. The data also show that the mixed genotype has the greatest range of social contact, and that there are two outliers total within the minus and plus genotypes respectively.

b.

Conditions

Constant and additive effects - Because scientists are running this experiment in a lab setting, we can assume that the effects are constant.

The normal qqplot shows that the residuals follow a normal distribution

Using Bartlett’s Homogeniety Test in R, I found that the variances are homogenous

5.32

hawk = read.csv("~/downloads/Hawks.csv")
# a
plot(factor(hawk$Species), hawk$Weight)

# b
aggregate(hawk$Weight, list(hawk$Species), FUN = sd, na.rm = TRUE)

##   Group.1         x
## 1      CH 162.03164
## 2      RT 189.21025
## 3      SS  80.65267

a.

The dotplot shows a large amount of outliers, which is something to be concerned about. This can affect the F-statistic, because the sample mean is not resistant to outliers, which can affect the hypothesis.

b.

When comparing the standard deviations of the weights of the three different species, the ratio between the largest and smallest standard deviation is 2.025, which is close to the threshold in the rule of thumb, which is 2. This means that the data may violate the equal variance condition.

CH = 162.03

RT = 189.21

SS = 80.65

5.38

menin = read.csv("~/downloads/Meniscus.csv")
meninone = aov(Stiffness ~ factor(Method), data = menin)
## qqplot and residuals
plot(meninone)

# boxplot
ggplot(menin) + geom_boxplot(aes(x = factor(Method), y = Stiffness))

# c
summary(meninone)

##                Df Sum Sq Mean Sq F value Pr(>F)  
## factor(Method)  2  10.57   5.285   4.981 0.0219 *
## Residuals      15  15.91   1.061                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a.

The null hypothesis would be that the three different treatments would result in no change, or equal means for stiffness on each knee. The alternative hypothesis is that at least one of the treatments would result in a different result.

b.

The QQ plot shows that the data follows the line well, showing that the condition for normality is met. The residuals vs. fitted plot shows the red line is very close to the dotted line, meaning that the data is banded and the conditions for equality of variance are met. The methods are all independent, meeting the condition of independence

c.

The p-value is 0.02193, which is smaller than the significance level of 0.05, giving us evidence to reject the null hypothesis, meaning that we accept the alternative hypothesis that stiffness of the meniscus differs based on the type of repair. The F-value is 4.981, which indicates that the average variation between the groups of different treatments is about five times larger than the average variation within groups.

5.52

words = read.csv("~/downloads/WordsWithFriends.csv")
# a
wordsone = aov(WinMargin ~ factor(BlanksNumber), data = words)
plot(wordsone)

ggplot(words) + geom_boxplot(aes(x = factor(BlanksNumber), y = WinMargin))

summary(wordsone)

##                       Df Sum Sq Mean Sq F value  Pr(>F)   
## factor(BlanksNumber)   2   9514    4757   6.988 0.00103 **
## Residuals            441 300202     681                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a.

The normal QQ plot shows that the data follows the line well, indicating that the condition of normality is met. The residuals vs. fitted plot has a flat red line on the dotted line, showing the condition for equality of variance is met. The boxplot shows the condition for independence is met.

b.

The ANOVA summary gives a p-value lower than the significance level, which gives us evidence to reject the null hypothesis, meaning that there is evidence that there is a relationship between the winning margin, or how much a player wins, and the number of blank tiles. The F-value of 6.98 shows that the average variation between groups is about 7 times greater than the average variation within groups.