Example R code for this question is included below, where appropriate.
install.packages("gapminder") # Install package
library(gapminder) # Load package
data(gapminder) # Load gapminder data
boxplot(gapminder$lifeExp ~ gapminder$continent,
main = "Life Expectancy Box Plots for each Continent",
ylab = "Life Expectancy (years)", xlab = "Continent", col = c(2:6))
tapply(gapminder$lifeExp, gapminder$continent, mean)
## Africa Americas Asia Europe Oceania
## 48.86533 64.65874 60.06490 71.90369 74.32621
tapply(gapminder$lifeExp, gapminder$continent, sd)
## Africa Americas Asia Europe Oceania
## 9.150210 9.345088 11.864532 5.433178 3.795611
# Compute sample size for continents observations
table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
We observe that the average life expectancy appears to be different across the continents. The data appear to be similarly spread out for the continents Africa, Americas and Asia, whereas the data have a narrower spread for Europe and Oceania. This is supported by an assessment of the standard deviation values.
The mean life expectancy for people in Africa is much lower than for people in other continents.
The sample size for each continent is different, being 624, 300, 396, 360 and 24 for Africa, the Americas, Asia, Europe and Oceania respectively.
Appropriate notation is as follows:
Using this notation, we can define: \[H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\] versus \[H_1: \text{ not all $\mu_i$'s are equal, for $i = 1, \cdots, 5$ } \]
Example R code for fitting a one-way ANOVA using the specified data is shown below:
anova.life <- aov(lifeExp ~ continent, data = gapminder)
The summary
function can be used as shown below:
summary(anova.life)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 139343 34836 408.7 <2e-16 ***
## Residuals 1699 144805 85
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We note here that \(d1 = 4\) (the number of continents i.e. groups minus 1) and \(d2 = 1699\) (the number of observations minus the number of continents i.e. groups).
The p-value is almost 0, which is much less than \(0.05\).
The test statistic is \(F=408.7\).
To summarise, we can write:
There was a significant difference in the average life expectancy from birth (in years) \(\left[F(4, 1699) = 408.7, p < 0.001\right]\) for people living on different continents.
library(car)
leveneTest(gapminder$lifeExp ~ gapminder$continent)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 51.568 < 2.2e-16 ***
## 1699
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is close to 0, and much smaller than \(0.05\), we cannot assume equal variances. This is not a surprising result, given our box plot observations earlier.
Remember for this question that we must be assessing the residuals from the one-way ANOVA.
par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)
residuals <- anova.life$residuals # obtain residuals
hist(residuals, ylim = c(0, 0.06), freq = FALSE,
main = "Histogram of residuals \n (with normal density curve overlaid)",
xlab = "Residuals", col = "skyblue")
curve(dnorm(x, mean = mean(residuals), sd(residuals)), add = TRUE, lwd = 2)
qqnorm(residuals, main = "Normal Q-Q plot", pch = 19); qqline(residuals)
The histogram of residuals shows that the residuals appear to be normally distributed. The Normal Q-Q plot shows some deviation from the qqline for low theoretical quantile values, but this is not extreme enough to cause major concern.
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.9954, p-value = 4.4e-05
Since the p-value \(=4.4 \times 10^{-5} < 0.001\), we conclude that the residuals are in fact not normally distributed, despite our visual inspection appearing to suggest otherwise.
Let’s consider why we have obtained this unexpected result.
Due to the high sample size, the Shapiro test will be very high-powered, which means we are much more likely to get a small p-value even though the data may still be approximately normal. Given the symmetry of the residuals, as well as the high sample size (thanks to the Central Limit Theorem), we do not have any concerns about the normality assumption here despite the result of the Shapiro test.
TukeyHSD(anova.life)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = lifeExp ~ continent, data = gapminder)
##
## $continent
## diff lwr upr p adj
## Americas-Africa 15.793407 14.022263 17.564550 0.0000000
## Asia-Africa 11.199573 9.579887 12.819259 0.0000000
## Europe-Africa 23.038356 21.369862 24.706850 0.0000000
## Oceania-Africa 25.460878 20.216908 30.704848 0.0000000
## Asia-Americas -4.593833 -6.523432 -2.664235 0.0000000
## Europe-Americas 7.244949 5.274203 9.215696 0.0000000
## Oceania-Americas 9.667472 4.319650 15.015293 0.0000086
## Europe-Asia 11.838783 10.002952 13.674614 0.0000000
## Oceania-Asia 14.261305 8.961718 19.560892 0.0000000
## Oceania-Europe 2.422522 -2.892185 7.737230 0.7250559
We note that the Oceania-Europe comparison is not statistically significant, since we have \(p = 0.725 > 0.05\), and the confidence interval does include 0. This means that the difference in the average life expectancy of people born within these two continents is not statistically significant.
All other comparisons have a statistically significant difference in average life expectancy between the two continents under consideration. For example, we note that:
People born in the Americas have an average life expectancy which is 15.793 years greater than the average life expectancy of people born in Africa. We have \(p <0.05\), i.e. \(p < 0.001\), and the \(95\%\) confidence interval for this difference is roughly \((14.02, 17.56)\). This confidence interval does not contain 0.
People born in Oceania have an average life expectancy which is a whopping 25.46 years greater than the average life expectancy of people born in Africa. We have \(p <0.05\), i.e. \(p < 0.001\), a the \(95\%\) confidence interval for this difference is roughly \((20.22, 30.70)\). This confidence interval does not contain 0.
N.B. This data is outdated now, so the differences may no longer be as extreme as suggested here.
install.packages("lsr")
library(lsr)
etaSquared(anova.life)
## eta.sq eta.sq.part
## continent 0.4903887 0.4903887
We obtain an \(\eta^2\) value of roughly \(0.49\), which is considered large. This makes sense, based on our previous results - a large proportion of the variation in the life expectancy values of people can be attributed to the continent in which they live, i.e. the continent in which a person is born has a large impact on their life expectancy.
Follow the steps outlined above, with gapminder$lifeExp
replaced by gapminder$gdpPercap
in the code.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.