A reworking of a data investigation produced by Alfred Essa/@malpaso The Rwandan Tragedy: Data Analysis with 7 Lines of Simple Python Code
The first step is to get some life expectancy data from the World Bank - it seems like there's a World Bank Data service API wrapper for that:
# install.packages('WDI')
library(WDI)
## Loading required package: RJSONIO
## Warning: package 'RJSONIO' was built under R version 2.15.3
Let's see if we can find a code for life expectancy?
WDIsearch(string = "life.*expectancy", field = "name", cache = NULL)
## indicator
## [1,] "SP.DYN.LE00.FE.IN"
## [2,] "SP.DYN.LE00.IN"
## [3,] "SP.DYN.LE00.MA.IN"
## [4,] "UIS.SLE.0"
## [5,] "UIS.SLE.0.F"
## [6,] "UIS.SLE.0.M"
## [7,] "UIS.SLE.123"
## [8,] "UIS.SLE.123.F"
## [9,] "UIS.SLE.123.GPI"
## [10,] "UIS.SLE.123.M"
## [11,] "UIS.SLE.1t6.GPI"
## [12,] "UIS.SLE.56"
## [13,] "UIS.SLE.56.F"
## [14,] "UIS.SLE.56.GPI"
## [15,] "UIS.SLE.56.M"
## name
## [1,] "Life expectancy at birth, female (years)"
## [2,] "Life expectancy at birth, total (years)"
## [3,] "Life expectancy at birth, male (years)"
## [4,] "School life expectancy (years). Pre-primary. Total"
## [5,] "School life expectancy (years). Pre-primary. Female"
## [6,] "School life expectancy (years). Pre-primary. Male"
## [7,] "School life expectancy (years). Primary to secondary. Total"
## [8,] "School life expectancy (years). Primary to secondary. Female"
## [9,] "Gender parity index for school life expectancy. Primary to secondary."
## [10,] "School life expectancy (years). Primary to secondary. Male"
## [11,] "Gender parity index for school life expectancy. Primary to tertiary."
## [12,] "School life expectancy (years). Tertiary. Total"
## [13,] "School life expectancy (years). Tertiary. Female"
## [14,] "Gender parity index for school life expectancy. Tertiary."
## [15,] "School life expectancy (years). Tertiary. Male"
Ah ha, seems like “SP.DYN.LE00.IN” (Life expectancy at birth, total (years)) will do it…
(There are also codes for life expectancy for males and females separately)
df.le = WDI(country = "all", indicator = c("SP.DYN.LE00.IN"), start = 1900,
end = 2012)
We'll be doing some charting, so let's use ggplot… Load the required library…
require(ggplot2)
Alfred used a boxplot to provide an overview of the range of life expectanices across countries over a period of year. The outliers during the 1990s really jumped out:
g = ggplot() + geom_boxplot(data = df.le, aes(x = year, y = SP.DYN.LE00.IN,
group = year))
g = g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g
Let's filter the data to tunnel down and look to see which country or countries the outliers correspond to:
subset(df.le, year > 1988 & SP.DYN.LE00.IN < 40)
## iso2c country SP.DYN.LE00.IN year
## 10086 RW Rwanda 37.76 1997
## 10087 RW Rwanda 33.98 1996
## 10088 RW Rwanda 30.47 1995
## 10089 RW Rwanda 27.94 1994
## 10090 RW Rwanda 26.82 1993
## 10091 RW Rwanda 27.33 1992
## 10092 RW Rwanda 29.44 1991
## 10093 RW Rwanda 32.83 1990
## 10094 RW Rwanda 36.97 1989
## 10507 SL Sierra Leone 39.73 2000
## 10508 SL Sierra Leone 38.97 1999
## 10509 SL Sierra Leone 38.33 1998
## 10510 SL Sierra Leone 37.80 1997
## 10511 SL Sierra Leone 37.42 1996
## 10512 SL Sierra Leone 37.21 1995
## 10513 SL Sierra Leone 37.19 1994
## 10514 SL Sierra Leone 37.35 1993
## 10515 SL Sierra Leone 37.66 1992
## 10516 SL Sierra Leone 38.12 1991
## 10517 SL Sierra Leone 38.72 1990
## 10518 SL Sierra Leone 39.46 1989
Rwanda is notable, so let's overlay the numbers for life expectancy in Rwanda on the chart:
g = g + geom_line(data = subset(df.le, country == "Rwanda"), aes(x = year, y = SP.DYN.LE00.IN),
col = "red")
g
So what's causing the drop life expectancy? One way of exploring this problem is to look at the life expectancy figures for other countries with known problems over a particular period to see if their life expectancy figures have a similar signature over that particular period.
So for example, let's bring in in data for Kenyan life expectancy - does the Aids epidemic that hit that country have a similar signarture effect?
g = g + geom_line(data = subset(df.le, country == "Kenya"), aes(x = year, y = SP.DYN.LE00.IN),
col = "green")
g
How about Uganda, which suffered similarly?
g = g + geom_line(data = subset(df.le, country == "Uganda"), aes(x = year, y = SP.DYN.LE00.IN),
col = "blue")
g
Neither of those traces appear to have the same signature as the Rwandan curve. So might there be another cause? How about civil war? For example, Bangladesh suffered a civil war in the early 1970s - what was the effect on life expectancy over that period?
g = g + geom_line(data = subset(df.le, country == "Bangladesh"), aes(x = year,
y = SP.DYN.LE00.IN), col = "purple")
g
Ah ha - that has a marked similarity, to the eye at least…
Search for mortaility indicators
WDIsearch(string = "mortality", field = "name", cache = NULL)
## indicator
## [1,] "SH.DYN.CHLD.FE"
## [2,] "SH.DYN.CHLD.MA"
## [3,] "SH.DYN.MORT"
## [4,] "SH.DYN.MORT.FE"
## [5,] "SH.DYN.MORT.MA"
## [6,] "SH.DYN.NMRT"
## [7,] "SH.STA.MMRT"
## [8,] "SH.STA.MMRT.NE"
## [9,] "SP.DYN.AMRT.FE"
## [10,] "SP.DYN.AMRT.MA"
## [11,] "SP.DYN.IMRT.FE.IN"
## [12,] "SP.DYN.IMRT.IN"
## [13,] "SP.DYN.IMRT.MA.IN"
## name
## [1,] "Mortality rate, female child (per 1,000 female children age one)"
## [2,] "Mortality rate, male child (per 1,000 male children age one)"
## [3,] "Mortality rate, under-5 (per 1,000 live births)"
## [4,] "Mortality rate, under-5, female (per 1,000)"
## [5,] "Mortality rate, under-5, male (per 1,000)"
## [6,] "Mortality rate, neonatal (per 1,000 live births)"
## [7,] "Maternal mortality ratio (modeled estimate, per 100,000 live births)"
## [8,] "Maternal mortality ratio (national estimate, per 100,000 live births)"
## [9,] "Mortality rate, adult, female (per 1,000 female adults)"
## [10,] "Mortality rate, adult, male (per 1,000 male adults)"
## [11,] "Mortality rate, infant, female (per 1,000 live births)"
## [12,] "Mortality rate, infant (per 1,000 live births)"
## [13,] "Mortality rate, infant, male (per 1,000 live births)"
Grab “Mortality rate, under-5 (per 1,000 live births)” overall, and broken down for male and female
df.cm = WDI(country = "all", indicator = c("SH.DYN.MORT", "SH.DYN.MORT.FE",
"SH.DYN.MORT.MA"), start = 1900, end = 2009)
Overall boxplot
gx = ggplot(df.cm) + geom_boxplot(aes(x = year, y = SH.DYN.MORT, group = year))
gx
Does Rwanda account for any of the outliers? Let's overlay the Rwanda stats
gx = gx + geom_line(data = subset(df.cm, country == "Rwanda"), aes(x = year,
y = SH.DYN.MORT), col = "red")
gx
Hmmm.. some but not all… So what are the outliers in this case? Do you think you could find out?