Example R code for this question is included below, where appropriate.
No answer required.
gapminder_2002 <- read.csv("gapminder_2002.csv", header = T)
boxplot(gapminder_2002$lifeExp ~ gapminder_2002$continent,
main = "Life Expectancy Box Plots for each Continent",
ylab = "Life Expectancy (years)", xlab = "Continent", col = c(2:6))
tapply(gapminder_2002$lifeExp, gapminder_2002$continent, mean)
## Africa Americas Asia_Oceania Europe
## 53.32523 72.42204 69.83423 76.70060
tapply(gapminder_2002$lifeExp, gapminder_2002$continent, sd)
## Africa Americas Asia_Oceania Europe
## 9.586496 4.799705 8.494322 2.922180
# Compute sample size for continents observations
table(gapminder_2002$continent)
##
## Africa Americas Asia_Oceania Europe
## 52 25 35 30
We observe that the average life expectancy appears to be different across the continents. The data appear to be similarly spread out for the Continent
categories Africa
and Asia_Oceania
, whereas the data have a narrower spread for Europe
and Americas
. This is supported by an assessment of the standard deviation values.
The mean life expectancy for people in Africa is much lower than for people in other continents.
The sample size for each Continent
is different, being 52, 25, 35 and 30 for Africa, the Americas, Asia_Oceania, and Europe respectively.
Appropriate notation is as follows:
Using this notation, we can define: \[H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4\] versus \[H_1: \text{ not all $\mu_i$'s are equal, for $i = 1, \cdots, 4$ } \]
Example R code for fitting a one-way ANOVA using the specified data is shown below:
anova.life <- aov(lifeExp ~ continent, data = gapminder_2002)
The summary
function can be used as shown below:
summary(anova.life)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 3 13321 4440 77.17 <2e-16 ***
## Residuals 138 7941 58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We note here that \(d1 = 3\) (the number of continents i.e. groups minus 1) and \(d2 = 138\) (the number of observations minus the number of continents i.e. groups).
The p-value is almost 0, which is much less than \(0.05\).
The test statistic is \(F=77.17\).
To summarise, we can write:
There was a significant difference in the average life expectancy from birth (in years) [\(F(3, 138) = 77.17, p < 0.001\)] for people living on different continents.
No answer required.
library(car)
leveneTest(gapminder_2002$lifeExp ~ gapminder_2002$continent)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 7.7959 7.598e-05 ***
## 138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the \(p\)-value is close to 0, and much smaller than \(0.05\), we cannot assume equal variances. This is not a surprising result, given our box plot observations earlier.
Remember for this question that we must be assessing the residuals from the one-way ANOVA.
par(mfrow = c(1, 2), cex = 0.8, mex = 0.8)
residuals <- anova.life$residuals # obtain residuals
hist(residuals, ylim = c(0, 0.08), freq = FALSE,
main = "Histogram of residuals \n (with normal density curve overlaid)",
xlab = "Residuals", col = "skyblue")
curve(dnorm(x, mean = mean(residuals), sd(residuals)), add = TRUE, lwd = 2)
qqnorm(residuals, main = "Normal Q-Q plot", pch = 19); qqline(residuals)
The histogram of residuals shows that the residuals appear to be at least approximately normally distributed. The Normal Q-Q plot shows some deviation from the qqline for low and high theoretical quantile values, but this is not extreme enough to cause major concern.
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.97155, p-value = 0.004661
Since the \(p\)-value \(\approx 0.005\), the test indicates that the residuals are in fact not normally distributed, despite our visual inspection appearing to suggest otherwise.
Let’s consider why we have obtained this unexpected result.
Due to the high sample size, the Shapiro test will be very high-powered, which means we are much more likely to get a small \(p\)-value even though the data may still be approximately normal. Given the symmetry of the residuals, as well as the high sample size (thanks to the Central Limit Theorem), we do not have any concerns about the normality assumption here despite the result of the Shapiro test.
TukeyHSD(anova.life)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = lifeExp ~ continent, data = gapminder_2002)
##
## $continent
## diff lwr upr p adj
## Americas-Africa 19.096809 14.295731 23.897888 0.0000000
## Asia_Oceania-Africa 16.508998 12.195902 20.822093 0.0000000
## Europe-Africa 23.375369 18.852544 27.898194 0.0000000
## Asia_Oceania-Americas -2.587811 -7.753601 2.577978 0.5627338
## Europe-Americas 4.278560 -1.063587 9.620707 0.1638481
## Europe-Asia_Oceania 6.866371 1.958116 11.774627 0.0021688
We note that the Asia_Oceania - Americas
comparison is not statistically significant, since we have \(p \approx 0.563 > 0.05\), and the confidence interval includes 0. This means that the difference in the average life expectancy of people born within these two continents is not statistically significant.
Similarly, we note that the Europe - Americas
comparison is not statistically significant, with \(p \approx 0.164 > 0.05\).
All other comparisons have a statistically significant difference in average life expectancy between the two continents under consideration. For example, we note that:
People born in the Americas have an average life expectancy which is 19.1 years greater than the average life expectancy of people born in Africa. We have a very small \(p\)-value with \(p < 0.001\), and the \(95\%\) confidence interval for this difference is roughly \((14.3, 23.9)\). This confidence interval does not contain 0.
People born in Asia_Oceania have an average life expectancy which is 16.5 years greater than the average life expectancy of people born in Africa. We have a very small \(p\)-value with \(p < 0.001\), and the \(95\%\) confidence interval for this difference is roughly \((12.2, 20.8)\). This confidence interval does not contain 0.
N.B. This data is outdated now, so the differences may no longer be as extreme as suggested here.
install.packages("lsr")
library(lsr)
etaSquared(anova.life)
## eta.sq eta.sq.part
## continent 0.6265306 0.6265306
We obtain an \(\eta^2\) value of roughly \(0.63\), which is considered large. This makes sense, based on our previous results - a large proportion of the variation in the life expectancy values of people can be attributed to the continent in which they live, i.e. the continent in which a person is born has a large impact on their life expectancy.
Follow the steps outlined above, with gapminder_2002$lifeExp
replaced by gapminder_2002$gdpPercap
in the code.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.