Topic 7: One-way ANOVA

These are the solutions for Computer Lab 8.

1 One-way ANOVA

Example R code for this question is included below, where appropriate.

1.1

No answer required.

1.2

gapminder_2002 <- read.csv("gapminder_2002.csv", header = T)

1.3 Initial Exploratory Analysis

boxplot(gapminder_2002$lifeExp ~ gapminder_2002$continent, 
main = "Life Expectancy Box Plots for each Continent", 
ylab = "Life Expectancy (years)", xlab = "Continent", col = c(2:6))

1.3.1

tapply(gapminder_2002$lifeExp, gapminder_2002$continent, mean)

##       Africa     Americas Asia_Oceania       Europe 
##     53.32523     72.42204     69.83423     76.70060

tapply(gapminder_2002$lifeExp, gapminder_2002$continent, sd)

##       Africa     Americas Asia_Oceania       Europe 
##     9.586496     4.799705     8.494322     2.922180

# Compute sample size for continents observations
table(gapminder_2002$continent)

## 
##       Africa     Americas Asia_Oceania       Europe 
##           52           25           35           30

1.3.2

We observe that the average life expectancy appears to be different across the continents. The data appear to be similarly spread out for the Continent categories Africa and Asia_Oceania, whereas the data have a narrower spread for Europe and Americas. This is supported by an assessment of the standard deviation values.

The mean life expectancy for people in Africa is much lower than for people in other continents.

The sample size for each Continent is different, being 52, 25, 35 and 30 for Africa, the Americas, Asia_Oceania, and Europe respectively.

1.4 Defining Hypotheses

Appropriate notation is as follows:

Let $\mu_1$ denote the population average life expectancy at birth of people born in population
Let $\mu_2$ denote the true average life expectancy at birth of people born in the Americas.
Let $\mu_3$ denote the population average life expectancy at birth of people born in Asia_Oceania.
Let $\mu_4$ denote the population average life expectancy at birth of people born in Europe.

Using this notation, we can define: \[H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4\] versus \[H_1: \text{ not all $\mu_i$'s are equal, for $i = 1, \cdots, 4$ } \]

1.5 Conducting a one-way ANOVA in R

Example R code for fitting a one-way ANOVA using the specified data is shown below:

anova.life <- aov(lifeExp ~ continent, data = gapminder_2002)

1.5.1

The summary function can be used as shown below:

summary(anova.life)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## continent     3  13321    4440   77.17 <2e-16 ***
## Residuals   138   7941      58                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We note here that $d1 = 3$ (the number of continents i.e. groups minus 1) and $d2 = 138$ (the number of observations minus the number of continents i.e. groups).
The p-value is almost 0, which is much less than $0.05$.
The test statistic is $F=77.17$.

1.5.2

To summarise, we can write:

There was a significant difference in the average life expectancy from birth (in years) [$F(3, 138) = 77.17, p < 0.001$] for people living on different continents.

1.6 Test Assumption Checks

No answer required.

1.6.1 Equal Variances Assumption Check

library(car)

leveneTest(gapminder_2002$lifeExp ~ gapminder_2002$continent)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   3  7.7959 7.598e-05 ***
##       138                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the $p$-value is close to 0, and much smaller than $0.05$, we cannot assume equal variances. This is not a surprising result, given our box plot observations earlier.

1.6.2 Normality Assumption Check - Visual

Remember for this question that we must be assessing the residuals from the one-way ANOVA.

par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)

residuals <- anova.life$residuals # obtain residuals

hist(residuals, ylim = c(0, 0.08), freq = FALSE,
     main = "Histogram of residuals \n (with normal density curve overlaid)", 
     xlab = "Residuals", col = "skyblue")
curve(dnorm(x, mean = mean(residuals), sd(residuals)), add = TRUE, lwd = 2)

qqnorm(residuals, main = "Normal Q-Q plot", pch = 19); qqline(residuals)

The histogram of residuals shows that the residuals appear to be at least approximately normally distributed. The Normal Q-Q plot shows some deviation from the qqline for low and high theoretical quantile values, but this is not extreme enough to cause major concern.

1.6.3 Normality Assumption Check - Formal

shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.97155, p-value = 0.004661

Since the $p$-value $\approx 0.005$, the test indicates that the residuals are in fact not normally distributed, despite our visual inspection appearing to suggest otherwise.

Let’s consider why we have obtained this unexpected result.

Due to the high sample size, the Shapiro test will be very high-powered, which means we are much more likely to get a small $p$-value even though the data may still be approximately normal. Given the symmetry of the residuals, as well as the high sample size (thanks to the Central Limit Theorem), we do not have any concerns about the normality assumption here despite the result of the Shapiro test.

1.7 Post-hoc Testing

TukeyHSD(anova.life)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lifeExp ~ continent, data = gapminder_2002)
## 
## $continent
##                            diff       lwr       upr     p adj
## Americas-Africa       19.096809 14.295731 23.897888 0.0000000
## Asia_Oceania-Africa   16.508998 12.195902 20.822093 0.0000000
## Europe-Africa         23.375369 18.852544 27.898194 0.0000000
## Asia_Oceania-Americas -2.587811 -7.753601  2.577978 0.5627338
## Europe-Americas        4.278560 -1.063587  9.620707 0.1638481
## Europe-Asia_Oceania    6.866371  1.958116 11.774627 0.0021688

1.7.1

We note that the Asia_Oceania - Americas comparison is not statistically significant, since we have $p \approx 0.563 > 0.05$, and the confidence interval includes 0. This means that the difference in the average life expectancy of people born within these two continents is not statistically significant.

Similarly, we note that the Europe - Americas comparison is not statistically significant, with $p \approx 0.164 > 0.05$.

All other comparisons have a statistically significant difference in average life expectancy between the two continents under consideration. For example, we note that:

People born in the Americas have an average life expectancy which is 19.1 years greater than the average life expectancy of people born in Africa. We have a very small $p$-value with $p < 0.001$, and the $95\%$ confidence interval for this difference is roughly $(14.3, 23.9)$. This confidence interval does not contain 0.
People born in Asia_Oceania have an average life expectancy which is 16.5 years greater than the average life expectancy of people born in Africa. We have a very small $p$-value with $p < 0.001$, and the $95\%$ confidence interval for this difference is roughly $(12.2, 20.8)$. This confidence interval does not contain 0.

N.B. This data is outdated now, so the differences may no longer be as extreme as suggested here.

1.8 Effect Size

install.packages("lsr")

library(lsr) 
etaSquared(anova.life)

##              eta.sq eta.sq.part
## continent 0.6265306   0.6265306

We obtain an $\eta^2$ value of roughly $0.63$, which is considered large. This makes sense, based on our previous results - a large proportion of the variation in the life expectancy values of people can be attributed to the continent in which they live, i.e. the continent in which a person is born has a large impact on their life expectancy.

2 Extension: One-way ANOVA Practice

Follow the steps outlined above, with gapminder_2002$lifeExp replaced by gapminder_2002$gdpPercap in the code.

That’s all the questions done! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 7 material.

These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 8 Solutions