One Way Anova

Say that you have 4 groups (data below).

x1 <- c(12.636195, 11.302671,  9.155329, 10.281724)
x2 <- c(6.911613, 10.769846, 10.475964, 12.119434)
x3 <- c(13.35268, 14.35637, 15.05340, 12.47963)
x4 <- c(15.88884, 18.94905, 16.78968, 17.76045)

To analyze the data with one way anova and the aov function, you have to pour all the data into one vector:

(y <- c(x1, x2, x3, x4))
##  [1] 12.636195 11.302671  9.155329 10.281724  6.911613 10.769846 10.475964
##  [8] 12.119434 13.352680 14.356370 15.053400 12.479630 15.888840 18.949050
## [15] 16.789680 17.760450

But notice now that the y variable doesn’t contain information about the original groups. Here’s one way to create the categorical variable.

groups <- c(rep(1,4), rep(2,4), rep(3,4), rep(4,4))
length(groups)
## [1] 16

However, notice now that “groups” is seen as numeric.

summary(groups)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.75    2.50    2.50    3.25    4.00

The following line of code sets “groups” as categorical (a “factor”). Checking out the summary, you can see that R now views “groups” as a factor.

groups <- as.factor(groups) 
summary(groups) 
## 1 2 3 4 
## 4 4 4 4

Now we can analyze the data.

\(H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4\).

And the alternative hypothesis is that at least two of the means diffs.

Check the boxplots:

boxplot(y ~ groups)

Create the model and run the ANOVA.

mod <- aov(y ~ groups)
anova(mod)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## groups     3 131.160  43.720  17.248 0.0001195 ***
## Residuals 12  30.418   2.535                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our p-value is much smaller than \(\alpha = .05\), we reject the null in favor of the alternative. That is, at least some of the means are significantly different. We should follow up (later) to see which differ.

Practicing Paired Data

Does the shady part of the grapefruit have more solids than the exposed half of the grapefruit?

Let \(\mu\) be the mean difference in the percentage of solids (shaded minus sunny). The null hypothesis is that the mean is zero and the alternative hypothesis is that the mean is greater than zero. Our test statistic is

\[\dfrac{\bar{x} - 0}{s/\sqrt{n}}\]

and it follows a T distribution with \(n-1 = 24\) degrees of freedom.

library(PairedData)
## Loading required package: MASS
## Loading required package: gld
## Loading required package: mvtnorm
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'PairedData'
## The following object is masked from 'package:base':
## 
##     summary
data("GrapeFruit")
diffs <- GrapeFruit$Shaded - GrapeFruit$Exposed
(n <- length(diffs))
## [1] 25
(xbar <- mean(diffs))
## [1] 0.1936
(mysd <- sd(diffs))
## [1] 0.3066469
(teststat <- xbar/(mysd/sqrt(n)))
## [1] 3.156725
(pval <- pt(teststat, df = n-1, lower.tail = FALSE))
## [1] 0.002132029

Because our p-value is quite small( 0.002132), we reject Ho in favor of Ha. That is, the shaded part of the grapefruit has significantly more solids than the exposed part, on average.