knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
This study is a Problem Set of the “Data Analysis with R” course, part of Udacity’s Data Analyst Nanodegree curriculum. In this document, we select three datasets from Gapminder and perform exploratory analyses using different types of visualisations. We try to identify associations between different variables, more particularly Life Expectancy at Birth and DTP3 Immunization of 1-year-old infants.
For educational purposes, we make the code visible. The code is developped using R and the following libraries: tidyr, dplyr, ggplot2, gridExtra, GGally and RColorBrewer.
The questions we would like to answer are the following:
We select the Life Expectancy At Birth dataset. We download the file, save it as a .tsv (“tabulation-separated values”) and load it into R:
require(tidyr)
require(dplyr)
require(ggplot2)
require(gridExtra)
require(GGally)
require(RColorBrewer)
# Load the data file about Life Expectancy:
le_full <- read.csv('./indicator_life_expectancy_at_birth.tsv', sep = '\t',
stringsAsFactors = FALSE)
colnames(le_full)[1] <- "country"
le_full[1:5, 1:5]
## country X1800 X1801 X1802 X1803
## 1 Abkhazia NA NA NA NA
## 2 Afghanistan 28.21 28.20 28.19 28.18
## 3 Akrotiri and Dhekelia NA NA NA NA
## 4 Albania 35.40 35.40 35.40 35.40
## 5 Algeria 28.82 28.82 28.82 28.82
The data documentation specifies that data prior to 1950 is estimated from models using other data, and might thus be unreliable. We will therefore only use post-1950 data:
le <- select(le_full, country, num_range("X", 1950:2016))
dim(le)
## [1] 260 68
le[1:5, 1:5]
## country X1950 X1951 X1952 X1953
## 1 Abkhazia NA NA NA NA
## 2 Afghanistan 26.85 27.13 27.67 28.19
## 3 Akrotiri and Dhekelia NA NA NA NA
## 4 Albania 54.48 54.72 55.23 55.85
## 5 Algeria 42.77 43.03 43.50 43.96
The data contains 260 countries and covers 67 years (+ the “country” column). We note that there are NAs in the dataset.
For an overview idea of the data, we compute some summary statistics and plot the data distribution for 1950 and 2016 (first and latest years in the sample):
# Some summary statistics for 1950 and 2016:
print("The World in 1950:")
## [1] "The World in 1950:"
summary(le$X1950)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 23.51 39.83 49.06 49.26 58.50 71.64 58
print(paste("Standard deviation:", sd(le$X1950, na.rm = TRUE)))
## [1] "Standard deviation: 11.7946752231682"
print("The World in 2016:")
## [1] "The World in 2016:"
summary(le$X2016)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 48.86 67.18 74.50 72.56 78.65 84.80 52
print(paste("Standard deviation:", sd(le$X2016, na.rm = TRUE)))
## [1] "Standard deviation: 7.73853493362854"
# Histograms for both years:
p1 <- ggplot(aes(x = X1950), data = le) +
theme_dark() +
geom_histogram(binwidth = 1, color = 'black', fill = 'steelblue') +
ggtitle("Life Expectancy Distribution in 1950") +
scale_x_continuous(limits = c(20, 90))
p2 <- ggplot(aes(x = X2016), data = le) +
theme_dark() +
geom_histogram(binwidth = 1, color = 'black', fill = 'steelblue') +
ggtitle("Life Expectancy Distribution in 2016") +
scale_x_continuous(limits = c(20, 90))
grid.arrange(p1, p2, ncol = 1)
These numbers and plots show that the modern world is a radically different place from 1950. The mean LBE moved from 49.3 to 72.6 years while the minimum LBE more than doubled from 23.5 to 48.9. At the same time, the standard deviation \(\sigma\) decreased from 11.8 to 7.7, meaning that not only people live longer on average, but there is also less disparity between countries.
These trends are obvious from the histograms. The whole plot shifted to the right between 1950 and 2016. The data from 1950 is more spread out and is somewhat bi-modal, hinting at well-separated groups of countries. It exhibits almost no skewness.
In contrast, the 2016 data is closer to unimodal, less spread out and negatively skewed (the left tail is longer than the right tail). In other words, the majority of countries are above the average – an observation confirmed by the fact that median > mean.
Moreover, every single country has improved their LEB:
le_decrease <- filter(le, X2016 < X1950)
nrow(le_decrease)
## [1] 0
To get a better sense for the rate of improvement in LEB, we want to compare LEB distributions accross countries at a few regularly spaced dates. We will focus on years 1956 to 2016 by 15-year increments.
We plot the distribution for each of the selected years using frequency polygons, which allow us to represent several data series on the same plot.
# Convert data to long format:
le_long <- gather(le, key = year, value = life_exp, -country, convert = TRUE)
le_long$year <- as.factor(sub('X', '', le_long$year))
# Subset 15-year periods:
fifteen_y <- as.character(seq(1956, 2016, 15))
le_fifteen <- filter(le_long, year %in% fifteen_y)
ggplot(aes(x = life_exp), data = le_fifteen) +
theme_dark() +
geom_freqpoly(aes(color = year), binwidth = 5, size = 1) +
scale_color_brewer(type = 'seq', palette = 'Blues')
It would appear that the biggest changes happened in the earlier years of the dataset (up to 1986), with the distribution’s shape shifting from bimodal to unimodal and negative skeweness. The rate of change seems comparatively slower in the years after that.
Boxplots will allow a more quantitative view:
five_y <- as.character(seq(1951, 2016, 5))
le_five <- filter(le_long, year %in% five_y)
ggplot(aes(x = year, y = life_exp), data = le_five) +
theme_dark() +
geom_boxplot(fill = "lightsteelblue") +
ggtitle("Life Expectancy Worldwide, 1950-2016")
The shift from a bimodal distribution to unimodal with negative skewness might be hint to the fact that there were well-defined groups of countries (in terms of life expectancy) that progressively merged while improving. Typically, poor countries will tend to have lower life expectancy, so we are interested in comparing the rate of improvement for different groups of countries defined by their GDP per capita. The suspicion is that poor countries improved their LEB faster than rich ones, and that this catching up explains the previous observations.
For a full analysis of potential correlations, we would need to join the LEB data with GDP data – which we will do in part 3. For now, we just want to get a feel for the correctness of our assumption. We will therefore select only a few countries in each group.
Note: The countries are classified according to their GDP per capita as of 2015.
With the defined groups, we can then make boxplots for each, similar to the work done above.
# Define country groups:
poor <- c("Somalia", "Congo, Dem. Rep.", "Mozambique", "Ethiopia")
mid_lower <- c("Bangladesh", "Kenya", "Cameroon", "India")
mid_upper <- c("Indonesia", "China", "Brazil", "Russia")
rich <- c("United States", "Japan", "Switzerland", "United Kingdom")
le_5_select <- subset(le_five,
country %in% c(poor, mid_lower, mid_upper, rich))
le_5_select$country_group <- NA
le_5_select[le_5_select$country %in% poor,
"country_group"] <- "poor"
le_5_select[le_5_select$country %in% mid_lower,
"country_group"] <- "mid_lower"
le_5_select[le_5_select$country %in% mid_upper,
"country_group"] <- "mid_upper"
le_5_select[le_5_select$country %in% rich,
"country_group"] <- "rich"
# Build boxplots:
p1 <- ggplot(aes(x = as.factor(year), y = life_exp),
data = subset(le_5_select, country_group == "poor")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(30, 85)) +
ggtitle("Low GDP per Capita") +
xlab("Year") + ylab("Life Expectancy at Birth") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p2 <- ggplot(aes(x = as.factor(year), y = life_exp),
data = subset(le_5_select, country_group == "mid_lower")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(30, 85)) +
ggtitle("Lower-Mid GDP per Capita") +
xlab("Year") + ylab("Life Expectancy at Birth") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p3 <- ggplot(aes(x = as.factor(year), y = life_exp),
data = subset(le_5_select, country_group == "mid_upper")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(30, 85)) +
ggtitle("Upper-Mid GDP per Capita") +
xlab("Year") + ylab("Life Expectancy at Birth") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p4 <- ggplot(aes(x = as.factor(year), y = life_exp),
data = subset(le_5_select, country_group == "rich")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(30, 85)) +
ggtitle("High GDP per Capita") +
xlab("Year") + ylab("Life Expectancy at Birth") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p1, p2, p3, p4, ncol = 2)
These plots tend to confirm our assumption. IQRs are not very relevant with such small samples, but we can clearly see that medians improved significantly more for countries in the “Low GDP”, “Lower-Mid GDP” and “Upper-Mid” groups (by about 25 to 27 years for each of them), than the “High GDP” group (about 14 years). Moreover, it seems that the richer the country, the slower the rate of change between 2010 and 2016.
These trends make sense, as the law of diminishing returns would apply to life expectancy – it takes a lot more effort to get from 60 to 80 than to get from 40 to 60. However, they also indicate that health in developping countries has improved dramatically in the second half of the XX\(^{th}\) century and continues to do so.
This exploratory analysis has given us some key insights into the data. We observed that the overall LEB has improved dramatically since 1950 for every single country in the world, albeit at a slower rate since the 1990s. Moreover, the improvement benefited poorer countries the most, which led to a much more compact distribution of LEB accross the world.
In this second part of the analysis, we introduce another variable, the proportion of one-year-olds immunized with 3 doses of Diphteria-Tetanus-Pertussis (DTP3). This dataset contains observations since 1980 and is also provided by Gapminder.
We would like to answer the following questions:
dtp <- read.csv('./one_year_old_dtp3_immunized.tsv', sep = '\t',
stringsAsFactors = FALSE)
colnames(dtp)[1] <- 'country'
dim(dtp)
## [1] 270 33
The data contains 270 countries over 32 years (from 1980 to 2011). Note that the LEB dataset previously explored had only 260 countries but spanned a much longer period (with many missing values). To compare the two, we will need to abandon some of the data.
We first need to convert the dataset into a long format:
dtp_long <- gather(dtp, key = year, value = pct_immunized,
-country, convert = TRUE)
dtp_long$year <- as.factor(sub('X', '', dtp_long$year))
We then build a frequency polygon plot, selecting a few years:
ggplot(aes(x = pct_immunized), data = subset(dtp_long, year %in% c('1981', '1991', '2001', '2011'))) +
theme_dark() +
geom_freqpoly(aes(color = year), binwidth = 5, size = 1) +
scale_color_brewer(type = 'seq', palette = 'Blues') +
scale_x_continuous(limits = c(0, 100))
While in 1981, the distribution of countries was very uniform, by 1991 it was already starting to display a significant peak between 80 and 100%. This trend continued in the next 20 years until the distribution looked heavily negatively skewed, with a left-tail comprising of a only a few countries. This is similar to what we witnessed with LEB, but here the phenomenon seems even more accute and much faster.
two_y <- as.character(seq(1981, 2011, 2))
ggplot(aes(x = year, y = pct_immunized),
data = subset(dtp_long, year %in% two_y)) +
theme_dark() +
geom_boxplot(fill = "lightsteelblue") +
ggtitle("Percentage of 1-year-old immunized with DTP3")
Indeed, these boxplots confirm a critical increase in immunization and the equally considerable reduction of the distribution’s dispersion.
Finally, we re-use the country groups created in Part 1 to compare how DTP3 Immunization progressed between countries of different wealth.
dtp_select <- subset(dtp_long,
country %in% c(poor, mid_lower, mid_upper, rich))
dtp_select$country_group <- NA
dtp_select[dtp_select$country %in% poor,
"country_group"] <- "poor"
dtp_select[dtp_select$country %in% mid_lower,
"country_group"] <- "mid_lower"
dtp_select[dtp_select$country %in% mid_upper,
"country_group"] <- "mid_upper"
dtp_select[dtp_select$country %in% rich,
"country_group"] <- "rich"
# Build boxplots:
p1 <- ggplot(aes(x = year, y = pct_immunized),
data = subset(dtp_select,
year %in% two_y & country_group == "poor")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(0, 100)) +
ggtitle("Low GDP per Capita") +
xlab("Year") + ylab("% immunized with DTP3")
p2 <- ggplot(aes(x = year, y = pct_immunized),
data = subset(dtp_select,
year %in% two_y & country_group == "mid_lower")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(0, 100)) +
ggtitle("Lower-Mid GDP per Capita") +
xlab("Year") + ylab("% immunized with DTP3")
p3 <- ggplot(aes(x = year, y = pct_immunized),
data = subset(dtp_select,
year %in% two_y & country_group == "mid_upper")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(0, 100)) +
ggtitle("Upper-Mid GDP per Capita") +
xlab("Year") + ylab("% immunized with DTP3")
p4 <- ggplot(aes(x = year, y = pct_immunized),
data = subset(dtp_select,
year %in% two_y & country_group == "rich")) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(0, 100)) +
ggtitle("High GDP per Capita") +
xlab("Year") + ylab("% immunized with DTP3")
grid.arrange(p1, p2, p3, p4, ncol = 2)
As with Life Expectancy at Birth, the biggest gains happened in the poorer countries. The median values are more erratic than with life expectancy. A larger sample would probably fix that. The lowest group however is still far from the levels reached by the other countries.
In this section, we will explore possible correlations between Life Expectancy at Birth and DTP3 immunization among 1-year-old infants.
We first have to join the two datasets. As we already noticed that the LEB dataset had less countries but more years that the DTP3 dataset, we will need to subset it for the same time span and then perform a left join.
period <- as.character(seq(1980, 2011))
le_dtp <- le_long %>%
filter(year %in% period) %>%
left_join(dtp_long, by = c("country", "year"))
le_dtp$year <- as.numeric(le_dtp$year)
We can now build a scatterplot including all country x year
combinations. We use horizontal jittering to reduce overplotting because percentages on the x-axis are expressed as integers, which tends to create “stripes”:
ggplot(data = le_dtp, aes(x = pct_immunized, y = life_exp)) +
geom_jitter(colour = 'lightsteelblue', alpha = '0.16', height = 0, width = .4) +
theme_dark() +
coord_cartesian(ylim = c(30, 90))
Here, each observation or dot on the plot is a combination of a country and a year. There seems to be a relatively linear correlation between LEB and DTP3 Immunization.
cor.test(le_dtp$pct_immunized, le_dtp$life_exp)
##
## Pearson's product-moment correlation
##
## data: le_dtp$pct_immunized and le_dtp$life_exp
## t = 61.348, df = 5368, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6259937 0.6574495
## sample estimates:
## cor
## 0.6419917
This confirms that there is a moderate to strong positive correlation.
To improve the plot and give it a sense of progression over time, we can group observations by 5-year periods. We can also add the linear regression line (brown), the mean (blue), the median (white, solid) and the .1 and .9 quantiles (white, dashed) for LEB as a function of Immunization:
le_dtp$period <- cut(le_dtp$year,
breaks = c(seq(1980, 2010, 5), 2011),
labels = c("1980-1984", "1985-1989",
"1990-1994", "1995-1999",
"2000-2004", "2005-2009",
"2010-2011"),
include.lowest = TRUE, right = TRUE)
ggplot(data = le_dtp, aes(x = pct_immunized, y = life_exp)) +
scale_colour_brewer(type = 'div', palette = 'Spectral') +
geom_jitter(aes(colour = period), alpha = '0.15', height = 0, width = .4) +
geom_smooth(formula = y ~ x, method = 'lm', colour = 'brown') +
geom_line(stat = 'summary', fun.y = "mean",
fun.args = list(na.rm = TRUE),
color = 'blue') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .1),
linetype = 2, color = 'white') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .5),
color = 'white') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .9),
linetype = 2, color = 'white') +
theme_dark() +
coord_cartesian(ylim = c(30, 90))
We notice that as observed in previous plots, observations corresponding to more recent periods are bunched up near the top-right corner of the plot (higher immunization and higher life expectancy) whereas older periods are further to the left. Moreover, the 0.1-0.9 quantile range tends to narrow down as we progress towards the higher values of immunization, although this could be an effect of the higher observation count in that region of the plot. We can also remark that the mean and median lines remain intertwined for the most part, except over 80% immunization where the LEB median becomes consistently higher than the mean by a few years.
From all the observations above, we can conclude that a high immunization ratio is associated with:
In this part of the analysis, we made the following observations:
In this section, we are going to cross-analyse the previous two variables against GDP per capita much more accurately.
The questions we would like to answer are:
We again pull our GDP dataset from Gapminder – this time the data is GDP/capita (US$, inflation-adjusted).
gdp <- read.csv('./GDPpercapitaconstant2000US.tsv', sep = '\t')
colnames(gdp)[1] <- "country"
colnames(gdp)[2 : 53] <- sub("X", "", colnames(gdp)[2 : 53])
gdp[1:5, 1:5]
## country 1960 1961 1962 1963
## 1 Abkhazia NA NA NA NA
## 2 Afghanistan NA NA NA NA
## 3 Akrotiri and Dhekelia NA NA NA NA
## 4 Albania NA NA NA NA
## 5 Algeria 1280.385 1085.415 855.948 1128.416
The dataset contains inflation-adjusted GDP per capita values for 275 countries since 1960. However there are many missing values, particularly before 1980. As this is also the starting point for the combined LEB/DTP3 dataset, we drop any prior columns, convert it to a long format then left-join this dataset with the dataset:
gdp_long <- gather(gdp, key = year, value = gdp_cap, -country)
gdp_long$year <- as.numeric(gdp_long$year)
le_dtp_gdp <- gdp_long %>%
filter(year %in% period) %>% # Susbset for the period 1980-2011
left_join(le_dtp, by = c("country", "year"))
We can now plot these variables against each other:
ggpairs(le_dtp_gdp[, 3:5])
To improve the plot we add the variable gdp_cap_log
which is the base 10-logarithm of gdp_cap
and plot again with refined aesthetics and linear smoothers:
le_dtp_gdp$gdp_cap_log <- log10(le_dtp_gdp$gdp_cap)
custom_points <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_point(..., alpha = .2, color = 'lightsteelblue') +
geom_smooth(method = 'lm')
}
ggpairs(select(le_dtp_gdp, gdp_cap_log, life_exp, pct_immunized),
lower = list(continuous = custom_points))
Using log values for GDP per capita improves the linear correlation significantly – we now have a strong correlation (.80) with life expectancy and a medium one with immunization (.51).
We also see that when using log values, the GDP/capita distribution accross country/year pairs is much more balanced and exhibits almost no skewness, whereas it was heavily skewed (positively) using raw values: Only a few rich country/year pairs are contained in the long right tail. In the context of building a predictive model, a log transformation on GDP/capita would be strongly recommended.
With the GDP/capita data, we can now have a much more rigourous approach to the analyses by country group that we ran in the previous sections.
We first plot life expectancy against immunization, colouring by GDP/capita:
ggplot(data = le_dtp_gdp, aes(x = pct_immunized, y = life_exp)) +
scale_color_continuous(low = "#132B43", high = "#56B1F7") +
geom_jitter(aes(colour = gdp_cap_log),
alpha = '0.15', height = 0, width = .4) +
geom_smooth(formula = y ~ x, method = 'lm', colour = 'brown') +
geom_line(stat = 'summary', fun.y = "mean",
fun.args = list(na.rm = TRUE),
color = 'blue') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .1),
linetype = 2, color = 'white') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .5),
color = 'white') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .9),
linetype = 2, color = 'white') +
theme_dark() +
coord_cartesian(ylim = c(30, 90))
To improve the plot’s readability, we can create groups by GDP/capita. Let’s use quartiles:
le_dtp_gdp$gdp_quartile <- with(le_dtp_gdp,
cut(gdp_cap,
breaks = quantile(gdp_cap,
probs =
seq(
0., 1., by = 0.25),
na.rm = TRUE),
labels = c(
"Low", "Lower-Mid", "Higher-Mid", "High"),
include.lowest=TRUE))
ggplot(data = subset(le_dtp_gdp, !is.na(gdp_quartile)),
aes(x = pct_immunized, y = life_exp)) +
scale_color_brewer(type = 'qual', palette = 2) +
geom_jitter(aes(colour = gdp_quartile),
alpha = '0.5', height = 0, width = .4) +
geom_smooth(formula = y ~ x, method = 'lm', colour = 'brown') +
geom_line(stat = 'summary', fun.y = "mean",
fun.args = list(na.rm = TRUE),
color = 'blue') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .1),
linetype = 2, color = 'white') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .5),
color = 'white') +
geom_line(stat = 'summary', fun.y = "quantile",
fun.args = list(na.rm = TRUE, prob = .9),
linetype = 2, color = 'white') +
theme_dark() +
coord_cartesian(ylim = c(30, 90))
On the plot, we can clearly see “layers” by colour: at the bottom are the Low GDP countries and at the top the richest countries.
But maybe the correlation between Life Expectancy and Immmunization only exists through GRP/capita? In other words, richer countries tend to have higher life expectancy, and they also happen to have higher immunization ratios.
To try and reduce the influence of GDP/capita on the other two variables, we can plot them by wealth group:
ggplot(data = subset(le_dtp_gdp, !is.na(gdp_quartile)),
aes(x = pct_immunized, y = life_exp)) +
geom_jitter(colour = 'lightsteelblue',
alpha = '0.3', height = 0, width = .4) +
theme_dark() +
coord_cartesian(ylim = c(40, 90)) +
facet_wrap( ~ gdp_quartile, ncol = 2)
The correlations are still there, but in the two middle categories we can see a large number of outliers that have high immunization and low life expectancy. Let’s look at them in more detail:
high_immu_low_le <- subset(le_dtp_gdp,
gdp_quartile %in% c("Lower-Mid", "Higher-Mid") &
pct_immunized > 80 & life_exp < 57)
group_by(high_immu_low_le, country, period) %>%
summarise(mean_le = mean(life_exp), mean_immu = mean(pct_immunized))
## Source: local data frame [13 x 4]
## Groups: country [?]
##
## country period mean_le mean_immu
## (chr) (fctr) (dbl) (dbl)
## 1 Angola 2005-2009 56.700 81.0
## 2 Botswana 1990-1994 56.800 95.0
## 3 Botswana 1995-1999 51.080 96.8
## 4 Botswana 2000-2004 46.800 96.4
## 5 Botswana 2005-2009 53.340 96.0
## 6 Botswana 2010-2011 56.500 96.0
## 7 Equatorial Guinea 1995-1999 50.750 81.0
## 8 Namibia 2000-2004 52.800 83.5
## 9 Namibia 2005-2009 54.900 86.0
## 10 Swaziland 1995-1999 52.375 84.5
## 11 Swaziland 2000-2004 44.240 91.8
## 12 Swaziland 2005-2009 44.960 93.4
## 13 Swaziland 2010-2011 48.000 91.0
These outliers are African countries, some of which are recovering from a troubled history (Angola, Equatorial Guinea).
There are also outliers in the High GDP group, whose average life expectancy is fairly high but immunization is low.
rich_low_immu <- subset(le_dtp_gdp,
gdp_quartile == "High" & pct_immunized < 50)
group_by(rich_low_immu, country, period) %>%
summarise(mean_le = mean(life_exp), mean_immu = mean(pct_immunized))
## Source: local data frame [8 x 4]
## Groups: country [?]
##
## country period mean_le mean_immu
## (chr) (fctr) (dbl) (dbl)
## 1 Australia 1980-1984 74.72667 40.33333
## 2 Bahamas 1980-1984 66.75000 36.00000
## 3 Ireland 1980-1984 74.35833 40.83333
## 4 Ireland 1985-1989 75.50333 43.66667
## 5 Italy 1980-1984 75.56500 11.00000
## 6 Saudi Arabia 1980-1984 66.99000 41.00000
## 7 United Arab Emirates 1980-1984 67.35333 33.00000
## 8 United Kingdom 1980-1984 73.73500 43.50000
These outliers correspond to the early years of the dataset. Some are countries that increased their wealth rapidly over the last 30 years (thus had more of a “developping country” profile during the 1980s). Public immunization policies are also likely to have played a role in explaning the low immunization rate in the 1980s. Note that in the case of Italy, the extremely low values for Italy are all the more surprising that in 1986, the immunization rate suddenly jumped to 98% according to the data. Despite some Internet research, we could not find an evidence of a major shift in policy at that time, so we cannot exclude that there are errors in the data.
Now that we have rigorously defined categories for countries, we can re-run the plots that we made in parts 1 and 2 and see how life expectancy and immunization evolved for each quartile. We are, however, limited to post-1980 data because of the GDP/capita dataset.
Life Expectancy at Birth:
years_select <- seq(1981, 2011, 2)
ggplot(data = subset(le_dtp_gdp,
year %in% years_select &
!is.na(life_exp) &
!is.na(gdp_quartile)),
aes(x = as.factor(year), y = life_exp)) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(45, 85)) +
xlab('Year') + ylab('Life Expectancy at Birth') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(~ gdp_quartile, ncol = 2)
In this shorter period of time, the differences in the rate of improvement of life expectancy are less obvious than on the 60-year plot of Part 1. However it is quite interesting to observe that each of the first three groups’ median life expectancy in 2011 is very close to the next group’s median life expectancy in 1981, as if there was a 30-year gap in development between each quartile (as far as life expectancy is concerned).
DTP3 Immunization of 1-year-olds:
ggplot(data = subset(le_dtp_gdp,
year %in% years_select &
!is.na(pct_immunized) &
!is.na(gdp_quartile)),
aes(x = as.factor(year), y = pct_immunized)) +
theme_dark() +
geom_boxplot(fill = 'lightsteelblue') +
coord_cartesian(ylim = c(0, 100)) +
xlab('Year') + ylab('DTP3 Immunization of 1-yo infants') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_wrap(~ gdp_quartile, ncol = 2)
This confirms the spectacular catch-up performed by countries at the lower-end of the wealth spectrum. Naturally, rich countries are already at almost 100% immunization, therefore they are no longer growing significantly.
To further dissociate life expectancy and DTP3 immunization from their association with GDP per capita, we can create a new variable called le_by_log_gdp
that is calculated as follow:
\(le\_by\_log\_gdp = \frac{life\_exp}{log_{10}(gdp\_cap)}\)
le_dtp_gdp$le_by_log_gdp <- with(le_dtp_gdp,
life_exp / gdp_cap_log)
ggplot(data = subset(le_dtp_gdp, year %in% c(1981, 1991, 2001, 2011)),
aes(x = gdp_cap_log, y = le_by_log_gdp)) +
geom_point(aes(colour = as.factor(year))) +
scale_colour_discrete() +
theme_dark() +
geom_smooth(method = 'lm')
There we discover an unexpected relationship: There seems to be a negative linear correlation between le_by_log_gdp and gdp_cap_log. In other words, the higher the wealth of a country, the smaller the gain in life expectancy for every further increase in wealth. This is again the law of diminishing returns at play. We can however notice that these further increments are themselves increasing in size over time: As technology and healthcare improve, each increment in wealth generates a larger increment in life expectancy. We can see this because the earlier years tend to be below the more recent ones on the plot.
What are the parameters of this correlation?
with(le_dtp_gdp, cor.test(gdp_cap_log, le_by_log_gdp))
##
## Pearson's product-moment correlation
##
## data: gdp_cap_log and le_by_log_gdp
## t = -83.184, df = 5498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7579616 -0.7345490
## sample estimates:
## cor
## -0.7464863
lm(data = le_dtp_gdp, le_by_log_gdp ~ gdp_cap_log)
##
## Call:
## lm(formula = le_by_log_gdp ~ gdp_cap_log, data = le_dtp_gdp)
##
## Coefficients:
## (Intercept) gdp_cap_log
## 30.406 -2.919
We can use these results to build a new expression for life expectancy as a function of GDP per capita:
$$ \[\begin{aligned} \frac{life\_exp}{log_{10}(gdp\_cap)} &\approx -2.919 \times log_{10}(gdp\_cap) + 30.406 \\ Define: \\ le\_gdp\_const &= \frac{life\_exp}{log_{10}(gdp\_cap)} + 2.919 \times log_{10}(gdp\_cap) \approx 30.406 \\ \end{aligned}\]$$
This new variable should be almost constant around 30.406. We add it to our dataset and look at its distribution:
le_dtp_gdp$le_gdp_const <- with(le_dtp_gdp,
le_by_log_gdp + 2.919 * gdp_cap_log)
qplot(data = le_dtp_gdp, le_gdp_const, fill = I('lightsteelblue'),
colour = I('steelblue'), binwidth = .2) +
theme_dark()
Indeed, the new variable has a narrow normal distribution with a maximum between 30 and 31. Plotting this variable against pct_immunized
should allow us to almost completely remove the association to GDP/capita:
ggplot(data = subset(le_dtp_gdp, !is.na(gdp_quartile)),
aes(x = pct_immunized, y = le_gdp_const)) +
geom_jitter(colour = 'lightsteelblue', width = .2, height = 0.,
alpha = '0.3') +
theme_dark() +
coord_cartesian(ylim = c(25, 35))
with(le_dtp_gdp, cor.test(pct_immunized, le_gdp_const))
##
## Pearson's product-moment correlation
##
## data: pct_immunized and le_gdp_const
## t = 25.564, df = 4995, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3153921 0.3644356
## sample estimates:
## cor
## 0.3401451
The correlation between LEB and immunization is now weaker but still highly significant ($ p < 2 10^{-16}$). It is about half of the correlation we found when using the raw life expectancy values.
In this section, we achieved the following results:
Overall this analysis shows just how much public health has improved since the middle of the XX\(^{th}\) century, especially in poorer countries. Life Expectancy at Birth and DTP3 Immunization Ratio among 1-year-olds have both improved dramatically in every country, regardless of their wealth. They remain clearly associated with the GDP per capita, but the poorer countries have been catching up relentlessly.
In terms of public health policy, we also showed that DTP Imunization of 1-year-olds and Life Expectancy at Birth are directly correlated, and not just through economic growth. Although this does not prove a cause-consequence relationship, this observation advocates for comprehensive immunization campaigns of young infants.