Topic 7: One-way ANOVA

These are the solutions for Computer Lab 8.

1 One-way ANOVA

Example R code for this question is included below, where appropriate.

1.1

install.packages("gapminder") # Install package
library(gapminder) # Load package
data(gapminder) # Load gapminder data

1.2

boxplot(gapminder$lifeExp ~ gapminder$continent, 
main = "Life Expectancy Box Plots for each Continent", 
ylab = "Life Expectancy (years)", xlab = "Continent", col = c(2:6))

1.3

tapply(gapminder$lifeExp, gapminder$continent, mean)

##   Africa Americas     Asia   Europe  Oceania 
## 48.86533 64.65874 60.06490 71.90369 74.32621

tapply(gapminder$lifeExp, gapminder$continent, sd)

##    Africa  Americas      Asia    Europe   Oceania 
##  9.150210  9.345088 11.864532  5.433178  3.795611

# Compute sample size for continents observations
table(gapminder$continent)

## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

1.4

We observe that the average life expectancy appears to be different across the continents. The data appear to be similarly spread out for the continents Africa, Americas and Asia, whereas the data have a narrower spread for Europe and Oceania. This is supported by an assessment of the standard deviation values.

The mean life expectancy for people in Africa is much lower than for people in other continents.

The sample size for each continent is different, being 624, 300, 396, 360 and 24 for Africa, the Americas, Asia, Europe and Oceania respectively.

1.5

Appropriate notation is as follows:

Let $\mu_1$ denote the true average life expectancy at birth of people born in Africa.
Let $\mu_2$ denote the true average life expectancy at birth of people born in the Americas.
Let $\mu_3$ denote the true average life expectancy at birth of people born in Asia.
Let $\mu_4$ denote the true average life expectancy at birth of people born in Europe.
Let $\mu_5$ denote the true average life expectancy at birth of people born in Oceania.

Using this notation, we can define: \[H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\] versus \[H_1: \text{ not all $\mu_i$'s are equal, for $i = 1, \cdots, 5$ } \]

1.6

Example R code for fitting a one-way ANOVA using the specified data is shown below:

anova.life <- aov(lifeExp ~ continent, data = gapminder)

1.7

The summary function can be used as shown below:

summary(anova.life)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## continent      4 139343   34836   408.7 <2e-16 ***
## Residuals   1699 144805      85                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We note here that $d1 = 4$ (the number of continents i.e. groups minus 1) and $d2 = 1699$ (the number of observations minus the number of continents i.e. groups).
The p-value is almost 0, which is much less than $0.05$.
The test statistic is $F=408.7$.

1.8

To summarise, we can write:

There was a significant difference in the average life expectancy from birth (in years) $\left[F(4, 1699) = 408.7, p < 0.001\right]$ for people living on different continents.

1.9

library(car)

leveneTest(gapminder$lifeExp ~ gapminder$continent)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    4  51.568 < 2.2e-16 ***
##       1699                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is close to 0, and much smaller than $0.05$, we cannot assume equal variances. This is not a surprising result, given our box plot observations earlier.

1.10

Remember for this question that we must be assessing the residuals from the one-way ANOVA.

par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)

residuals <- anova.life$residuals # obtain residuals

hist(residuals, ylim = c(0, 0.06), freq = FALSE,
     main = "Histogram of residuals \n (with normal density curve overlaid)", 
     xlab = "Residuals", col = "skyblue")
curve(dnorm(x, mean = mean(residuals), sd(residuals)), add = TRUE, lwd = 2)

qqnorm(residuals, main = "Normal Q-Q plot", pch = 19); qqline(residuals)

The histogram of residuals shows that the residuals appear to be normally distributed. The Normal Q-Q plot shows some deviation from the qqline for low theoretical quantile values, but this is not extreme enough to cause major concern.

1.11

shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.9954, p-value = 4.4e-05

Since the p-value $=4.4 \times 10^{-5} < 0.001$, we conclude that the residuals are in fact not normally distributed, despite our visual inspection appearing to suggest otherwise.

Let’s consider why we have obtained this unexpected result.

Due to the high sample size, the Shapiro test will be very high-powered, which means we are much more likely to get a small p-value even though the data may still be approximately normal. Given the symmetry of the residuals, as well as the high sample size (thanks to the Central Limit Theorem), we do not have any concerns about the normality assumption here despite the result of the Shapiro test.

1.12

TukeyHSD(anova.life)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lifeExp ~ continent, data = gapminder)
## 
## $continent
##                       diff       lwr       upr     p adj
## Americas-Africa  15.793407 14.022263 17.564550 0.0000000
## Asia-Africa      11.199573  9.579887 12.819259 0.0000000
## Europe-Africa    23.038356 21.369862 24.706850 0.0000000
## Oceania-Africa   25.460878 20.216908 30.704848 0.0000000
## Asia-Americas    -4.593833 -6.523432 -2.664235 0.0000000
## Europe-Americas   7.244949  5.274203  9.215696 0.0000000
## Oceania-Americas  9.667472  4.319650 15.015293 0.0000086
## Europe-Asia      11.838783 10.002952 13.674614 0.0000000
## Oceania-Asia     14.261305  8.961718 19.560892 0.0000000
## Oceania-Europe    2.422522 -2.892185  7.737230 0.7250559

1.12.1

We note that the Oceania-Europe comparison is not statistically significant, since we have $p = 0.725 > 0.05$, and the confidence interval does include 0. This means that the difference in the average life expectancy of people born within these two continents is not statistically significant.

All other comparisons have a statistically significant difference in average life expectancy between the two continents under consideration. For example, we note that:

People born in the Americas have an average life expectancy which is 15.793 years greater than the average life expectancy of people born in Africa. We have $p <0.05$, i.e. $p < 0.001$, and the $95\%$ confidence interval for this difference is roughly $(14.02, 17.56)$. This confidence interval does not contain 0.
People born in Oceania have an average life expectancy which is a whopping 25.46 years greater than the average life expectancy of people born in Africa. We have $p <0.05$, i.e. $p < 0.001$, a the $95\%$ confidence interval for this difference is roughly $(20.22, 30.70)$. This confidence interval does not contain 0.

N.B. This data is outdated now, so the differences may no longer be as extreme as suggested here.

1.12.2

install.packages("lsr")

library(lsr) 
etaSquared(anova.life)

##              eta.sq eta.sq.part
## continent 0.4903887   0.4903887

We obtain an $\eta^2$ value of roughly $0.49$, which is considered large. This makes sense, based on our previous results - a large proportion of the variation in the life expectancy values of people can be attributed to the continent in which they live, i.e. the continent in which a person is born has a large impact on their life expectancy.

2 Practice

Follow the steps outlined above, with gapminder$lifeExp replaced by gapminder$gdpPercap in the code.

That’s all the questions done! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 7 material.

These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 8 Solutions