knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Introduction

This study is a Problem Set of the “Data Analysis with R” course, part of Udacity’s Data Analyst Nanodegree curriculum. In this document, we select three datasets from Gapminder and perform exploratory analyses using different types of visualisations. We try to identify associations between different variables, more particularly Life Expectancy at Birth and DTP3 Immunization of 1-year-old infants.

For educational purposes, we make the code visible. The code is developped using R and the following libraries: tidyr, dplyr, ggplot2, gridExtra, GGally and RColorBrewer.

1. Life Expectancy at Birth over Time

Questions

The questions we would like to answer are the following:

How did Life Expectancy at Birth (LEB) and its distribution among countries evolve overall since the middle of the 20th century?
What are the periods where it has changed the most?
How did the evolution of LEB compare between countries of different wealth?

Data Load

We select the Life Expectancy At Birth dataset. We download the file, save it as a .tsv (“tabulation-separated values”) and load it into R:

require(tidyr)
require(dplyr)
require(ggplot2)
require(gridExtra)
require(GGally)
require(RColorBrewer)

# Load the data file about Life Expectancy:
le_full <- read.csv('./indicator_life_expectancy_at_birth.tsv', sep = '\t', 
                stringsAsFactors = FALSE)

colnames(le_full)[1] <- "country"
le_full[1:5, 1:5]

##                 country X1800 X1801 X1802 X1803
## 1              Abkhazia    NA    NA    NA    NA
## 2           Afghanistan 28.21 28.20 28.19 28.18
## 3 Akrotiri and Dhekelia    NA    NA    NA    NA
## 4               Albania 35.40 35.40 35.40 35.40
## 5               Algeria 28.82 28.82 28.82 28.82

The data documentation specifies that data prior to 1950 is estimated from models using other data, and might thus be unreliable. We will therefore only use post-1950 data:

le <- select(le_full, country, num_range("X", 1950:2016))
dim(le)

## [1] 260  68

le[1:5, 1:5]

##                 country X1950 X1951 X1952 X1953
## 1              Abkhazia    NA    NA    NA    NA
## 2           Afghanistan 26.85 27.13 27.67 28.19
## 3 Akrotiri and Dhekelia    NA    NA    NA    NA
## 4               Albania 54.48 54.72 55.23 55.85
## 5               Algeria 42.77 43.03 43.50 43.96

The data contains 260 countries and covers 67 years (+ the “country” column). We note that there are NAs in the dataset.

LEB distribution in 1950 vs. 2016

For an overview idea of the data, we compute some summary statistics and plot the data distribution for 1950 and 2016 (first and latest years in the sample):

# Some summary statistics for 1950 and 2016:
print("The World in 1950:")

## [1] "The World in 1950:"

summary(le$X1950)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   23.51   39.83   49.06   49.26   58.50   71.64      58

print(paste("Standard deviation:", sd(le$X1950, na.rm = TRUE)))

## [1] "Standard deviation: 11.7946752231682"

print("The World in 2016:")

## [1] "The World in 2016:"

summary(le$X2016)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   48.86   67.18   74.50   72.56   78.65   84.80      52

print(paste("Standard deviation:", sd(le$X2016, na.rm = TRUE)))

## [1] "Standard deviation: 7.73853493362854"

# Histograms for both years:
p1 <- ggplot(aes(x = X1950), data = le) +
    theme_dark() +
    geom_histogram(binwidth = 1, color = 'black', fill = 'steelblue') +
    ggtitle("Life Expectancy Distribution in 1950") + 
    scale_x_continuous(limits = c(20, 90))
p2 <- ggplot(aes(x = X2016), data = le) +
    theme_dark() +
    geom_histogram(binwidth = 1, color = 'black', fill = 'steelblue') +
    ggtitle("Life Expectancy Distribution in 2016") +
    scale_x_continuous(limits = c(20, 90))
grid.arrange(p1, p2, ncol = 1)

These numbers and plots show that the modern world is a radically different place from 1950. The mean LBE moved from 49.3 to 72.6 years while the minimum LBE more than doubled from 23.5 to 48.9. At the same time, the standard deviation $\sigma$ decreased from 11.8 to 7.7, meaning that not only people live longer on average, but there is also less disparity between countries.

These trends are obvious from the histograms. The whole plot shifted to the right between 1950 and 2016. The data from 1950 is more spread out and is somewhat bi-modal, hinting at well-separated groups of countries. It exhibits almost no skewness.
In contrast, the 2016 data is closer to unimodal, less spread out and negatively skewed (the left tail is longer than the right tail). In other words, the majority of countries are above the average – an observation confirmed by the fact that median > mean.

Moreover, every single country has improved their LEB:

le_decrease <- filter(le, X2016 < X1950)
nrow(le_decrease)

## [1] 0

Evolution of LEB over time

To get a better sense for the rate of improvement in LEB, we want to compare LEB distributions accross countries at a few regularly spaced dates. We will focus on years 1956 to 2016 by 15-year increments.

We plot the distribution for each of the selected years using frequency polygons, which allow us to represent several data series on the same plot.

# Convert data to long format:
le_long <- gather(le, key = year, value = life_exp, -country, convert = TRUE)
le_long$year <- as.factor(sub('X', '', le_long$year))

# Subset 15-year periods:
fifteen_y <- as.character(seq(1956, 2016, 15))
le_fifteen <- filter(le_long, year %in% fifteen_y)

ggplot(aes(x = life_exp), data = le_fifteen) +
    theme_dark() +
    geom_freqpoly(aes(color = year), binwidth = 5, size = 1) +
    scale_color_brewer(type = 'seq', palette = 'Blues')

It would appear that the biggest changes happened in the earlier years of the dataset (up to 1986), with the distribution’s shape shifting from bimodal to unimodal and negative skeweness. The rate of change seems comparatively slower in the years after that.

Boxplots will allow a more quantitative view:

five_y <- as.character(seq(1951, 2016, 5))
le_five <- filter(le_long, year %in% five_y)
ggplot(aes(x = year, y = life_exp), data = le_five) +
    theme_dark() +
    geom_boxplot(fill = "lightsteelblue") + 
    ggtitle("Life Expectancy Worldwide, 1950-2016")

As noted before, the largest improvements in median values happened between 1950 and 1980 approximately,
The growth seemed to slow down the the late 80’s / early 90’s, then bounced back especially in the early 2000’s,
The interquartile range (IQR – spread of the middle 50% of the countries) decreased most significantly from 1971 to 1986,
The median progressively shifted towards the upper bound of the IQR, confirming a shift towards negative skewness
A few outliers appeared aound 2006 at the lower end of the distribution.

Evolution by Country Wealth

The shift from a bimodal distribution to unimodal with negative skewness might be hint to the fact that there were well-defined groups of countries (in terms of life expectancy) that progressively merged while improving. Typically, poor countries will tend to have lower life expectancy, so we are interested in comparing the rate of improvement for different groups of countries defined by their GDP per capita. The suspicion is that poor countries improved their LEB faster than rich ones, and that this catching up explains the previous observations.

For a full analysis of potential correlations, we would need to join the LEB data with GDP data – which we will do in part 3. For now, we just want to get a feel for the correctness of our assumption. We will therefore select only a few countries in each group.

Note: The countries are classified according to their GDP per capita as of 2015.

With the defined groups, we can then make boxplots for each, similar to the work done above.

# Define country groups:
poor <- c("Somalia", "Congo, Dem. Rep.", "Mozambique", "Ethiopia")
mid_lower <- c("Bangladesh", "Kenya", "Cameroon", "India")
mid_upper <- c("Indonesia", "China", "Brazil", "Russia")
rich <- c("United States", "Japan", "Switzerland", "United Kingdom")

le_5_select <- subset(le_five, 
                        country %in% c(poor, mid_lower, mid_upper, rich))
le_5_select$country_group <- NA
le_5_select[le_5_select$country %in% poor, 
              "country_group"] <- "poor"
le_5_select[le_5_select$country %in% mid_lower, 
              "country_group"] <- "mid_lower"
le_5_select[le_5_select$country %in% mid_upper, 
              "country_group"] <- "mid_upper"
le_5_select[le_5_select$country %in% rich, 
              "country_group"] <- "rich"

# Build boxplots:
p1 <- ggplot(aes(x = as.factor(year), y = life_exp), 
             data = subset(le_5_select, country_group == "poor")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(30, 85)) +
      ggtitle("Low GDP per Capita") +
      xlab("Year") + ylab("Life Expectancy at Birth") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))

p2 <- ggplot(aes(x = as.factor(year), y = life_exp), 
      data = subset(le_5_select, country_group == "mid_lower")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(30, 85)) +
      ggtitle("Lower-Mid GDP per Capita") +
      xlab("Year") + ylab("Life Expectancy at Birth") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))
      
p3 <- ggplot(aes(x = as.factor(year), y = life_exp), 
      data = subset(le_5_select, country_group == "mid_upper")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(30, 85)) +
      ggtitle("Upper-Mid GDP per Capita") +
      xlab("Year") + ylab("Life Expectancy at Birth") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))

p4 <- ggplot(aes(x = as.factor(year), y = life_exp), 
      data = subset(le_5_select, country_group == "rich")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(30, 85)) +
      ggtitle("High GDP per Capita") +
      xlab("Year") + ylab("Life Expectancy at Birth") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1))

grid.arrange(p1, p2, p3, p4, ncol = 2)

These plots tend to confirm our assumption. IQRs are not very relevant with such small samples, but we can clearly see that medians improved significantly more for countries in the “Low GDP”, “Lower-Mid GDP” and “Upper-Mid” groups (by about 25 to 27 years for each of them), than the “High GDP” group (about 14 years). Moreover, it seems that the richer the country, the slower the rate of change between 2010 and 2016.

These trends make sense, as the law of diminishing returns would apply to life expectancy – it takes a lot more effort to get from 60 to 80 than to get from 40 to 60. However, they also indicate that health in developping countries has improved dramatically in the second half of the XX$^{th}$ century and continues to do so.

Recap: Observations on Life Expectancy at Birth

This exploratory analysis has given us some key insights into the data. We observed that the overall LEB has improved dramatically since 1950 for every single country in the world, albeit at a slower rate since the 1990s. Moreover, the improvement benefited poorer countries the most, which led to a much more compact distribution of LEB accross the world.

2. Life Expectancy vs DTP3 Immunization

In this second part of the analysis, we introduce another variable, the proportion of one-year-olds immunized with 3 doses of Diphteria-Tetanus-Pertussis (DTP3). This dataset contains observations since 1980 and is also provided by Gapminder.

Questions:

We would like to answer the following questions:

How did the immunization rate evolve overall and in specific countries in the last 36 years?
Is there a correlation between the immunization rate and life expectancy at birth? What are the key relationships between these two variables?

Data Load

dtp <- read.csv('./one_year_old_dtp3_immunized.tsv', sep = '\t',
                stringsAsFactors = FALSE)
colnames(dtp)[1] <- 'country'
dim(dtp)

## [1] 270  33

The data contains 270 countries over 32 years (from 1980 to 2011). Note that the LEB dataset previously explored had only 260 countries but spanned a much longer period (with many missing values). To compare the two, we will need to abandon some of the data.

Evolution of DTP3 Immunization over Time

We first need to convert the dataset into a long format:

dtp_long <- gather(dtp, key = year, value = pct_immunized, 
                   -country, convert = TRUE)
dtp_long$year <- as.factor(sub('X', '', dtp_long$year))

We then build a frequency polygon plot, selecting a few years:

ggplot(aes(x = pct_immunized), data = subset(dtp_long, year %in% c('1981', '1991', '2001', '2011'))) +
    theme_dark() +
    geom_freqpoly(aes(color = year), binwidth = 5, size = 1) +
    scale_color_brewer(type = 'seq', palette = 'Blues') +
    scale_x_continuous(limits = c(0, 100))

While in 1981, the distribution of countries was very uniform, by 1991 it was already starting to display a significant peak between 80 and 100%. This trend continued in the next 20 years until the distribution looked heavily negatively skewed, with a left-tail comprising of a only a few countries. This is similar to what we witnessed with LEB, but here the phenomenon seems even more accute and much faster.

two_y <- as.character(seq(1981, 2011, 2))
ggplot(aes(x = year, y = pct_immunized), 
       data = subset(dtp_long, year %in% two_y)) +
    theme_dark() +
    geom_boxplot(fill = "lightsteelblue") + 
    ggtitle("Percentage of 1-year-old immunized with DTP3")

Indeed, these boxplots confirm a critical increase in immunization and the equally considerable reduction of the distribution’s dispersion.

Finally, we re-use the country groups created in Part 1 to compare how DTP3 Immunization progressed between countries of different wealth.

dtp_select <- subset(dtp_long, 
                        country %in% c(poor, mid_lower, mid_upper, rich))
dtp_select$country_group <- NA
dtp_select[dtp_select$country %in% poor, 
              "country_group"] <- "poor"
dtp_select[dtp_select$country %in% mid_lower, 
              "country_group"] <- "mid_lower"
dtp_select[dtp_select$country %in% mid_upper, 
              "country_group"] <- "mid_upper"
dtp_select[dtp_select$country %in% rich, 
              "country_group"] <- "rich"

# Build boxplots:
p1 <- ggplot(aes(x = year, y = pct_immunized), 
             data = subset(dtp_select, 
                           year %in% two_y & country_group == "poor")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(0, 100)) +
      ggtitle("Low GDP per Capita") +
      xlab("Year") + ylab("% immunized with DTP3")

p2 <- ggplot(aes(x = year, y = pct_immunized), 
      data = subset(dtp_select, 
                    year %in% two_y & country_group == "mid_lower")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(0, 100)) +
      ggtitle("Lower-Mid GDP per Capita") +
      xlab("Year") + ylab("% immunized with DTP3")
      
p3 <- ggplot(aes(x = year, y = pct_immunized), 
      data = subset(dtp_select, 
                    year %in% two_y & country_group == "mid_upper")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(0, 100)) +
      ggtitle("Upper-Mid GDP per Capita") +
      xlab("Year") + ylab("% immunized with DTP3")

p4 <- ggplot(aes(x = year, y = pct_immunized), 
      data = subset(dtp_select, 
                    year %in% two_y & country_group == "rich")) +
      theme_dark() +
      geom_boxplot(fill = 'lightsteelblue') +
      coord_cartesian(ylim = c(0, 100)) +
      ggtitle("High GDP per Capita") +
      xlab("Year") + ylab("% immunized with DTP3")

grid.arrange(p1, p2, p3, p4, ncol = 2)

As with Life Expectancy at Birth, the biggest gains happened in the poorer countries. The median values are more erratic than with life expectancy. A larger sample would probably fix that. The lowest group however is still far from the levels reached by the other countries.

Life Expectancy at Birth vs. DTP3 Immunization of 1-year-olds

In this section, we will explore possible correlations between Life Expectancy at Birth and DTP3 immunization among 1-year-old infants.

We first have to join the two datasets. As we already noticed that the LEB dataset had less countries but more years that the DTP3 dataset, we will need to subset it for the same time span and then perform a left join.

period <- as.character(seq(1980, 2011))
le_dtp <- le_long %>% 
    filter(year %in% period) %>%
    left_join(dtp_long, by = c("country", "year"))
le_dtp$year <- as.numeric(le_dtp$year)

We can now build a scatterplot including all country x year combinations. We use horizontal jittering to reduce overplotting because percentages on the x-axis are expressed as integers, which tends to create “stripes”:

ggplot(data = le_dtp, aes(x = pct_immunized, y = life_exp)) +
    geom_jitter(colour = 'lightsteelblue', alpha = '0.16', height = 0, width = .4) +
    theme_dark() +
    coord_cartesian(ylim = c(30, 90))

Here, each observation or dot on the plot is a combination of a country and a year. There seems to be a relatively linear correlation between LEB and DTP3 Immunization.

cor.test(le_dtp$pct_immunized, le_dtp$life_exp)

## 
##  Pearson's product-moment correlation
## 
## data:  le_dtp$pct_immunized and le_dtp$life_exp
## t = 61.348, df = 5368, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6259937 0.6574495
## sample estimates:
##       cor 
## 0.6419917

This confirms that there is a moderate to strong positive correlation.

To improve the plot and give it a sense of progression over time, we can group observations by 5-year periods. We can also add the linear regression line (brown), the mean (blue), the median (white, solid) and the .1 and .9 quantiles (white, dashed) for LEB as a function of Immunization:

le_dtp$period <- cut(le_dtp$year, 
                     breaks = c(seq(1980, 2010, 5), 2011),
                     labels = c("1980-1984", "1985-1989",
                                "1990-1994", "1995-1999",
                                "2000-2004", "2005-2009",
                                "2010-2011"),
                     include.lowest = TRUE, right = TRUE)
ggplot(data = le_dtp, aes(x = pct_immunized, y = life_exp)) +
    scale_colour_brewer(type = 'div', palette = 'Spectral') +
    geom_jitter(aes(colour = period), alpha = '0.15', height = 0, width = .4) +
    geom_smooth(formula = y ~ x, method = 'lm', colour = 'brown') +
    geom_line(stat = 'summary', fun.y = "mean", 
              fun.args = list(na.rm = TRUE),
              color = 'blue') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .1),
              linetype = 2, color = 'white') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .5),
              color = 'white') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .9),
              linetype = 2, color = 'white') +
    theme_dark() +
    coord_cartesian(ylim = c(30, 90))

We notice that as observed in previous plots, observations corresponding to more recent periods are bunched up near the top-right corner of the plot (higher immunization and higher life expectancy) whereas older periods are further to the left. Moreover, the 0.1-0.9 quantile range tends to narrow down as we progress towards the higher values of immunization, although this could be an effect of the higher observation count in that region of the plot. We can also remark that the mean and median lines remain intertwined for the most part, except over 80% immunization where the LEB median becomes consistently higher than the mean by a few years.

From all the observations above, we can conclude that a high immunization ratio is associated with:

Higher life expectancy,
Smaller variance in life expectancy between observations,
A slightly negative skewness that indicates that although a minority of countries/year pairs is lowering the mean by poor LEB values, the majority is actually above the mean. We can clearly see a number of outliers in the lower-right part of the plot (between 80% and 100% immunization and 40 to 55 years old, approximately).

Recap: Observations on Immunization vs. Life Expectancy

In this part of the analysis, we made the following observations:

The DTP3 Immunization percentage among 1-year-old followed a similar pattern to Life Expectancy at Birth (but in half as many years): A rapid progress, a reduction in the variance accross countries, poorer countries catching up with rich ones;
There is a robust positive correlation (over .6) between Immunization and Life Expectancy;
Where and when immunization is high, we tend to have the majority of observations above the LEB average. Although a few outliers tend to pull the mean down, this is a more desirable situation than the opposite. These outliers could be due to one-off events (such a natural disasters, wars, epidemics) which, although tragic, are unlikely to occur on a regular basis.

3. Life Expectancy at Birth and DTP3 Immunization vs. GDP per capita

In this section, we are going to cross-analyse the previous two variables against GDP per capita much more accurately.

Questions:

The questions we would like to answer are:

What associations exist between our three variables?
How do LEB and DTP3 Immunization evolve over time for the different wealth quartiles? (A more robust version of the analyses performed in Parts 1 and 2)
Is the relationship between LEB and DTP3 Immunization only due to economic growth, acting as a hidden variable, or is there a direct association between the two?

We again pull our GDP dataset from Gapminder – this time the data is GDP/capita (US$, inflation-adjusted).

gdp <- read.csv('./GDPpercapitaconstant2000US.tsv', sep = '\t')
colnames(gdp)[1] <- "country"
colnames(gdp)[2 : 53] <- sub("X", "", colnames(gdp)[2 : 53])
gdp[1:5, 1:5]

##                 country     1960     1961    1962     1963
## 1              Abkhazia       NA       NA      NA       NA
## 2           Afghanistan       NA       NA      NA       NA
## 3 Akrotiri and Dhekelia       NA       NA      NA       NA
## 4               Albania       NA       NA      NA       NA
## 5               Algeria 1280.385 1085.415 855.948 1128.416

The dataset contains inflation-adjusted GDP per capita values for 275 countries since 1960. However there are many missing values, particularly before 1980. As this is also the starting point for the combined LEB/DTP3 dataset, we drop any prior columns, convert it to a long format then left-join this dataset with the dataset:

gdp_long <- gather(gdp, key = year, value = gdp_cap, -country)
gdp_long$year <- as.numeric(gdp_long$year)
le_dtp_gdp <- gdp_long %>% 
    filter(year %in% period) %>%  # Susbset for the period 1980-2011
    left_join(le_dtp, by = c("country", "year"))

Correlations

We can now plot these variables against each other:

ggpairs(le_dtp_gdp[, 3:5])

To improve the plot we add the variable gdp_cap_log which is the base 10-logarithm of gdp_cap and plot again with refined aesthetics and linear smoothers:

le_dtp_gdp$gdp_cap_log <- log10(le_dtp_gdp$gdp_cap)

custom_points <- function(data, mapping, ...) {
    ggplot(data = data, mapping = mapping) +
        geom_point(..., alpha = .2, color = 'lightsteelblue') +
        geom_smooth(method = 'lm')
}
    
ggpairs(select(le_dtp_gdp, gdp_cap_log, life_exp, pct_immunized),
        lower = list(continuous = custom_points))

Using log values for GDP per capita improves the linear correlation significantly – we now have a strong correlation (.80) with life expectancy and a medium one with immunization (.51).

We also see that when using log values, the GDP/capita distribution accross country/year pairs is much more balanced and exhibits almost no skewness, whereas it was heavily skewed (positively) using raw values: Only a few rich country/year pairs are contained in the long right tail. In the context of building a predictive model, a log transformation on GDP/capita would be strongly recommended.

Analysis by GDP/capita quantile

With the GDP/capita data, we can now have a much more rigourous approach to the analyses by country group that we ran in the previous sections.

We first plot life expectancy against immunization, colouring by GDP/capita:

ggplot(data = le_dtp_gdp, aes(x = pct_immunized, y = life_exp)) +
    scale_color_continuous(low = "#132B43", high = "#56B1F7") +
    geom_jitter(aes(colour = gdp_cap_log), 
                    alpha = '0.15', height = 0, width = .4) +
    geom_smooth(formula = y ~ x, method = 'lm', colour = 'brown') +
    geom_line(stat = 'summary', fun.y = "mean", 
              fun.args = list(na.rm = TRUE),
              color = 'blue') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .1),
              linetype = 2, color = 'white') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .5),
              color = 'white') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .9),
              linetype = 2, color = 'white') +
    theme_dark() +
    coord_cartesian(ylim = c(30, 90))

To improve the plot’s readability, we can create groups by GDP/capita. Let’s use quartiles:

le_dtp_gdp$gdp_quartile <- with(le_dtp_gdp, 
                                cut(gdp_cap, 
                                breaks = quantile(gdp_cap, 
                                                  probs = 
                                                      seq(
                                                      0., 1., by = 0.25),
                                                  na.rm = TRUE), 
                                labels = c(
                                    "Low", "Lower-Mid", "Higher-Mid", "High"),
                                include.lowest=TRUE))

ggplot(data = subset(le_dtp_gdp, !is.na(gdp_quartile)), 
       aes(x = pct_immunized, y = life_exp)) +
    scale_color_brewer(type = 'qual', palette = 2) +
    geom_jitter(aes(colour = gdp_quartile), 
                    alpha = '0.5', height = 0, width = .4) +
    geom_smooth(formula = y ~ x, method = 'lm', colour = 'brown') +
    geom_line(stat = 'summary', fun.y = "mean", 
              fun.args = list(na.rm = TRUE),
              color = 'blue') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .1),
              linetype = 2, color = 'white') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .5),
              color = 'white') +
    geom_line(stat = 'summary', fun.y = "quantile", 
              fun.args = list(na.rm = TRUE, prob = .9),
              linetype = 2, color = 'white') +
    theme_dark() +
    coord_cartesian(ylim = c(30, 90))

On the plot, we can clearly see “layers” by colour: at the bottom are the Low GDP countries and at the top the richest countries.

But maybe the correlation between Life Expectancy and Immmunization only exists through GRP/capita? In other words, richer countries tend to have higher life expectancy, and they also happen to have higher immunization ratios.

To try and reduce the influence of GDP/capita on the other two variables, we can plot them by wealth group:

ggplot(data = subset(le_dtp_gdp, !is.na(gdp_quartile)), 
       aes(x = pct_immunized, y = life_exp)) +
    geom_jitter(colour = 'lightsteelblue',  
                    alpha = '0.3', height = 0, width = .4) +
    theme_dark() +
    coord_cartesian(ylim = c(40, 90)) +
    facet_wrap( ~ gdp_quartile, ncol = 2)

The correlations are still there, but in the two middle categories we can see a large number of outliers that have high immunization and low life expectancy. Let’s look at them in more detail:

high_immu_low_le <- subset(le_dtp_gdp, 
       gdp_quartile %in% c("Lower-Mid", "Higher-Mid") & 
           pct_immunized > 80 & life_exp < 57)

group_by(high_immu_low_le, country, period) %>%
    summarise(mean_le = mean(life_exp), mean_immu = mean(pct_immunized))

## Source: local data frame [13 x 4]
## Groups: country [?]
## 
##              country    period mean_le mean_immu
##                (chr)    (fctr)   (dbl)     (dbl)
## 1             Angola 2005-2009  56.700      81.0
## 2           Botswana 1990-1994  56.800      95.0
## 3           Botswana 1995-1999  51.080      96.8
## 4           Botswana 2000-2004  46.800      96.4
## 5           Botswana 2005-2009  53.340      96.0
## 6           Botswana 2010-2011  56.500      96.0
## 7  Equatorial Guinea 1995-1999  50.750      81.0
## 8            Namibia 2000-2004  52.800      83.5
## 9            Namibia 2005-2009  54.900      86.0
## 10         Swaziland 1995-1999  52.375      84.5
## 11         Swaziland 2000-2004  44.240      91.8
## 12         Swaziland 2005-2009  44.960      93.4
## 13         Swaziland 2010-2011  48.000      91.0

These outliers are African countries, some of which are recovering from a troubled history (Angola, Equatorial Guinea).

There are also outliers in the High GDP group, whose average life expectancy is fairly high but immunization is low.

rich_low_immu <- subset(le_dtp_gdp, 
       gdp_quartile == "High" & pct_immunized < 50)

group_by(rich_low_immu, country, period) %>%
    summarise(mean_le = mean(life_exp), mean_immu = mean(pct_immunized))

## Source: local data frame [8 x 4]
## Groups: country [?]
## 
##                country    period  mean_le mean_immu
##                  (chr)    (fctr)    (dbl)     (dbl)
## 1            Australia 1980-1984 74.72667  40.33333
## 2              Bahamas 1980-1984 66.75000  36.00000
## 3              Ireland 1980-1984 74.35833  40.83333
## 4              Ireland 1985-1989 75.50333  43.66667
## 5                Italy 1980-1984 75.56500  11.00000
## 6         Saudi Arabia 1980-1984 66.99000  41.00000
## 7 United Arab Emirates 1980-1984 67.35333  33.00000
## 8       United Kingdom 1980-1984 73.73500  43.50000

These outliers correspond to the early years of the dataset. Some are countries that increased their wealth rapidly over the last 30 years (thus had more of a “developping country” profile during the 1980s). Public immunization policies are also likely to have played a role in explaning the low immunization rate in the 1980s. Note that in the case of Italy, the extremely low values for Italy are all the more surprising that in 1986, the immunization rate suddenly jumped to 98% according to the data. Despite some Internet research, we could not find an evidence of a major shift in policy at that time, so we cannot exclude that there are errors in the data.

Now that we have rigorously defined categories for countries, we can re-run the plots that we made in parts 1 and 2 and see how life expectancy and immunization evolved for each quartile. We are, however, limited to post-1980 data because of the GDP/capita dataset.

Life Expectancy at Birth:

years_select <- seq(1981, 2011, 2)
ggplot(data = subset(le_dtp_gdp, 
                     year %in% years_select & 
                     !is.na(life_exp) & 
                     !is.na(gdp_quartile)), 
       aes(x = as.factor(year), y = life_exp)) +
    theme_dark() +
    geom_boxplot(fill = 'lightsteelblue') +
    coord_cartesian(ylim = c(45, 85)) +
    xlab('Year') + ylab('Life Expectancy at Birth') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    facet_wrap(~ gdp_quartile, ncol = 2)

In this shorter period of time, the differences in the rate of improvement of life expectancy are less obvious than on the 60-year plot of Part 1. However it is quite interesting to observe that each of the first three groups’ median life expectancy in 2011 is very close to the next group’s median life expectancy in 1981, as if there was a 30-year gap in development between each quartile (as far as life expectancy is concerned).

DTP3 Immunization of 1-year-olds:

ggplot(data = subset(le_dtp_gdp, 
                     year %in% years_select & 
                     !is.na(pct_immunized) & 
                     !is.na(gdp_quartile)), 
       aes(x = as.factor(year), y = pct_immunized)) +
    theme_dark() +
    geom_boxplot(fill = 'lightsteelblue') +
    coord_cartesian(ylim = c(0, 100)) +
    xlab('Year') + ylab('DTP3 Immunization of 1-yo infants') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    facet_wrap(~ gdp_quartile, ncol = 2)

This confirms the spectacular catch-up performed by countries at the lower-end of the wealth spectrum. Naturally, rich countries are already at almost 100% immunization, therefore they are no longer growing significantly.

Removing the association with GDP/capita

To further dissociate life expectancy and DTP3 immunization from their association with GDP per capita, we can create a new variable called le_by_log_gdp that is calculated as follow:

$le\_by\_log\_gdp = \frac{life\_exp}{log_{10}(gdp\_cap)}$

le_dtp_gdp$le_by_log_gdp <- with(le_dtp_gdp,
                                 life_exp / gdp_cap_log)
ggplot(data = subset(le_dtp_gdp, year %in% c(1981, 1991, 2001, 2011)),
      aes(x = gdp_cap_log, y = le_by_log_gdp)) +
    geom_point(aes(colour = as.factor(year))) +
    scale_colour_discrete() +
    theme_dark() +
    geom_smooth(method = 'lm')

There we discover an unexpected relationship: There seems to be a negative linear correlation between le_by_log_gdp and gdp_cap_log. In other words, the higher the wealth of a country, the smaller the gain in life expectancy for every further increase in wealth. This is again the law of diminishing returns at play. We can however notice that these further increments are themselves increasing in size over time: As technology and healthcare improve, each increment in wealth generates a larger increment in life expectancy. We can see this because the earlier years tend to be below the more recent ones on the plot.

What are the parameters of this correlation?

with(le_dtp_gdp, cor.test(gdp_cap_log, le_by_log_gdp))

## 
##  Pearson's product-moment correlation
## 
## data:  gdp_cap_log and le_by_log_gdp
## t = -83.184, df = 5498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7579616 -0.7345490
## sample estimates:
##        cor 
## -0.7464863

lm(data = le_dtp_gdp, le_by_log_gdp ~ gdp_cap_log)

## 
## Call:
## lm(formula = le_by_log_gdp ~ gdp_cap_log, data = le_dtp_gdp)
## 
## Coefficients:
## (Intercept)  gdp_cap_log  
##      30.406       -2.919

We can use these results to build a new expression for life expectancy as a function of GDP per capita:

$$ \[\begin{aligned} \frac{life\_exp}{log_{10}(gdp\_cap)} &\approx -2.919 \times log_{10}(gdp\_cap) + 30.406 \\ Define: \\ le\_gdp\_const &= \frac{life\_exp}{log_{10}(gdp\_cap)} + 2.919 \times log_{10}(gdp\_cap) \approx 30.406 \\ \end{aligned}\]

This new variable should be almost constant around 30.406. We add it to our dataset and look at its distribution:

le_dtp_gdp$le_gdp_const <- with(le_dtp_gdp, 
                                le_by_log_gdp + 2.919 * gdp_cap_log)
qplot(data = le_dtp_gdp, le_gdp_const, fill = I('lightsteelblue'),
      colour = I('steelblue'), binwidth = .2) +
    theme_dark()

Indeed, the new variable has a narrow normal distribution with a maximum between 30 and 31. Plotting this variable against pct_immunized should allow us to almost completely remove the association to GDP/capita:

ggplot(data = subset(le_dtp_gdp, !is.na(gdp_quartile)), 
       aes(x = pct_immunized, y = le_gdp_const)) +
    geom_jitter(colour = 'lightsteelblue', width = .2, height = 0.,
                    alpha = '0.3') +
    theme_dark() +
    coord_cartesian(ylim = c(25, 35))

with(le_dtp_gdp, cor.test(pct_immunized, le_gdp_const))

## 
##  Pearson's product-moment correlation
## 
## data:  pct_immunized and le_gdp_const
## t = 25.564, df = 4995, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3153921 0.3644356
## sample estimates:
##       cor 
## 0.3401451

The correlation between LEB and immunization is now weaker but still highly significant ($ p < 2 10^{-16}$). It is about half of the correlation we found when using the raw life expectancy values.

Recap: Observations on life expectancy and immunization vs GDP/capita

In this section, we achieved the following results:

We demonstrated a strong correlation between GDP/capita and LEB and a moderate correlation between GDP/capita and DTP3 Immunization at 1-year-old,
We found that the correlation between LEB and DTP3 Immmunization exists within every quartile of GDP/capita and identified some of the outliers,
We confirmed the results of parts 1 and 2 related to the rate of improvement of both LEB and DTP3 Immunization for each of the GDP quartiles,
Discovered a formula to approximate the relationship between LEB and GDP/capita,
Used this formula to remove most of the association of life expectancy to GDP/capita and thus demonstrate a weak to moderate, but highly significant, correlation between life expectancy and immunization.

Conclusion

Overall this analysis shows just how much public health has improved since the middle of the XX$^{th}$ century, especially in poorer countries. Life Expectancy at Birth and DTP3 Immunization Ratio among 1-year-olds have both improved dramatically in every country, regardless of their wealth. They remain clearly associated with the GDP per capita, but the poorer countries have been catching up relentlessly.

In terms of public health policy, we also showed that DTP Imunization of 1-year-olds and Life Expectancy at Birth are directly correlated, and not just through economic growth. Although this does not prove a cause-consequence relationship, this observation advocates for comprehensive immunization campaigns of young infants.

Gapminder Data - Exploratory Analysis

Luc Frachon

12 janvier 2017

Introduction

1. Life Expectancy at Birth over Time

Questions

Data Load

LEB distribution in 1950 vs. 2016

Evolution of LEB over time

Evolution by Country Wealth

Recap: Observations on Life Expectancy at Birth

2. Life Expectancy vs DTP3 Immunization

Questions:

Data Load

Evolution of DTP3 Immunization over Time

Life Expectancy at Birth vs. DTP3 Immunization of 1-year-olds

Recap: Observations on Immunization vs. Life Expectancy

3. Life Expectancy at Birth and DTP3 Immunization vs. GDP per capita

Questions:

Correlations

Analysis by GDP/capita quantile

Removing the association with GDP/capita

Recap: Observations on life expectancy and immunization vs GDP/capita

Conclusion