The Rwandan Tragedy

A reworking of a data investigation produced by Alfred Essa/@malpaso The Rwandan Tragedy: Data Analysis with 7 Lines of Simple Python Code

The first step is to get some life expectancy data from the World Bank - it seems like there's a World Bank Data service API wrapper for that:

# install.packages('WDI')
library(WDI)
## Loading required package: RJSONIO
## Warning: package 'RJSONIO' was built under R version 2.15.3

Let's see if we can find a code for life expectancy?

WDIsearch(string = "life.*expectancy", field = "name", cache = NULL)
##       indicator          
##  [1,] "SP.DYN.LE00.FE.IN"
##  [2,] "SP.DYN.LE00.IN"   
##  [3,] "SP.DYN.LE00.MA.IN"
##  [4,] "UIS.SLE.0"        
##  [5,] "UIS.SLE.0.F"      
##  [6,] "UIS.SLE.0.M"      
##  [7,] "UIS.SLE.123"      
##  [8,] "UIS.SLE.123.F"    
##  [9,] "UIS.SLE.123.GPI"  
## [10,] "UIS.SLE.123.M"    
## [11,] "UIS.SLE.1t6.GPI"  
## [12,] "UIS.SLE.56"       
## [13,] "UIS.SLE.56.F"     
## [14,] "UIS.SLE.56.GPI"   
## [15,] "UIS.SLE.56.M"     
##       name                                                                   
##  [1,] "Life expectancy at birth, female (years)"                             
##  [2,] "Life expectancy at birth, total (years)"                              
##  [3,] "Life expectancy at birth, male (years)"                               
##  [4,] "School life expectancy (years).  Pre-primary.  Total"                 
##  [5,] "School life expectancy (years).  Pre-primary.  Female"                
##  [6,] "School life expectancy (years).  Pre-primary.  Male"                  
##  [7,] "School life expectancy (years).  Primary to secondary.  Total"        
##  [8,] "School life expectancy (years).  Primary to secondary.  Female"       
##  [9,] "Gender parity index for school life expectancy. Primary to secondary."
## [10,] "School life expectancy (years).  Primary to secondary.  Male"         
## [11,] "Gender parity index for school life expectancy.  Primary to tertiary."
## [12,] "School life expectancy (years).  Tertiary.  Total"                    
## [13,] "School life expectancy (years).  Tertiary.  Female"                   
## [14,] "Gender parity index for school life expectancy.  Tertiary."           
## [15,] "School life expectancy (years).  Tertiary.  Male"

Ah ha, seems like “SP.DYN.LE00.IN” (Life expectancy at birth, total (years)) will do it…

(There are also codes for life expectancy for males and females separately)

df.le = WDI(country = "all", indicator = c("SP.DYN.LE00.IN"), start = 1900, 
    end = 2012)

We'll be doing some charting, so let's use ggplot… Load the required library…

require(ggplot2)

Alfred used a boxplot to provide an overview of the range of life expectanices across countries over a period of year. The outliers during the 1990s really jumped out:

g = ggplot() + geom_boxplot(data = df.le, aes(x = year, y = SP.DYN.LE00.IN, 
    group = year))
g = g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

plot of chunk unnamed-chunk-5

Let's filter the data to tunnel down and look to see which country or countries the outliers correspond to:

subset(df.le, year > 1988 & SP.DYN.LE00.IN < 40)
##       iso2c      country SP.DYN.LE00.IN year
## 10086    RW       Rwanda          37.76 1997
## 10087    RW       Rwanda          33.98 1996
## 10088    RW       Rwanda          30.47 1995
## 10089    RW       Rwanda          27.94 1994
## 10090    RW       Rwanda          26.82 1993
## 10091    RW       Rwanda          27.33 1992
## 10092    RW       Rwanda          29.44 1991
## 10093    RW       Rwanda          32.83 1990
## 10094    RW       Rwanda          36.97 1989
## 10507    SL Sierra Leone          39.73 2000
## 10508    SL Sierra Leone          38.97 1999
## 10509    SL Sierra Leone          38.33 1998
## 10510    SL Sierra Leone          37.80 1997
## 10511    SL Sierra Leone          37.42 1996
## 10512    SL Sierra Leone          37.21 1995
## 10513    SL Sierra Leone          37.19 1994
## 10514    SL Sierra Leone          37.35 1993
## 10515    SL Sierra Leone          37.66 1992
## 10516    SL Sierra Leone          38.12 1991
## 10517    SL Sierra Leone          38.72 1990
## 10518    SL Sierra Leone          39.46 1989

Rwanda is notable, so let's overlay the numbers for life expectancy in Rwanda on the chart:

g = g + geom_line(data = subset(df.le, country == "Rwanda"), aes(x = year, y = SP.DYN.LE00.IN), 
    col = "red")
g

plot of chunk unnamed-chunk-7

So what's causing the drop life expectancy? One way of exploring this problem is to look at the life expectancy figures for other countries with known problems over a particular period to see if their life expectancy figures have a similar signature over that particular period.

So for example, let's bring in in data for Kenyan life expectancy - does the Aids epidemic that hit that country have a similar signarture effect?

g = g + geom_line(data = subset(df.le, country == "Kenya"), aes(x = year, y = SP.DYN.LE00.IN), 
    col = "green")
g

plot of chunk unnamed-chunk-8

How about Uganda, which suffered similarly?

g = g + geom_line(data = subset(df.le, country == "Uganda"), aes(x = year, y = SP.DYN.LE00.IN), 
    col = "blue")
g

plot of chunk unnamed-chunk-9

Neither of those traces appear to have the same signature as the Rwandan curve. So might there be another cause? How about civil war? For example, Bangladesh suffered a civil war in the early 1970s - what was the effect on life expectancy over that period?

g = g + geom_line(data = subset(df.le, country == "Bangladesh"), aes(x = year, 
    y = SP.DYN.LE00.IN), col = "purple")
g

plot of chunk unnamed-chunk-10

Ah ha - that has a marked similarity, to the eye at least…

Search for mortaility indicators

WDIsearch(string = "mortality", field = "name", cache = NULL)
##       indicator          
##  [1,] "SH.DYN.CHLD.FE"   
##  [2,] "SH.DYN.CHLD.MA"   
##  [3,] "SH.DYN.MORT"      
##  [4,] "SH.DYN.MORT.FE"   
##  [5,] "SH.DYN.MORT.MA"   
##  [6,] "SH.DYN.NMRT"      
##  [7,] "SH.STA.MMRT"      
##  [8,] "SH.STA.MMRT.NE"   
##  [9,] "SP.DYN.AMRT.FE"   
## [10,] "SP.DYN.AMRT.MA"   
## [11,] "SP.DYN.IMRT.FE.IN"
## [12,] "SP.DYN.IMRT.IN"   
## [13,] "SP.DYN.IMRT.MA.IN"
##       name                                                                   
##  [1,] "Mortality rate, female child (per 1,000 female children age one)"     
##  [2,] "Mortality rate, male child (per 1,000 male children age one)"         
##  [3,] "Mortality rate, under-5 (per 1,000 live births)"                      
##  [4,] "Mortality rate, under-5, female (per 1,000)"                          
##  [5,] "Mortality rate, under-5, male (per 1,000)"                            
##  [6,] "Mortality rate, neonatal (per 1,000 live births)"                     
##  [7,] "Maternal mortality ratio (modeled estimate, per 100,000 live births)" 
##  [8,] "Maternal mortality ratio (national estimate, per 100,000 live births)"
##  [9,] "Mortality rate, adult, female (per 1,000 female adults)"              
## [10,] "Mortality rate, adult, male (per 1,000 male adults)"                  
## [11,] "Mortality rate, infant, female (per 1,000 live births)"               
## [12,] "Mortality rate, infant (per 1,000 live births)"                       
## [13,] "Mortality rate, infant, male (per 1,000 live births)"

Grab “Mortality rate, under-5 (per 1,000 live births)” overall, and broken down for male and female

df.cm = WDI(country = "all", indicator = c("SH.DYN.MORT", "SH.DYN.MORT.FE", 
    "SH.DYN.MORT.MA"), start = 1900, end = 2009)

Overall boxplot

gx = ggplot(df.cm) + geom_boxplot(aes(x = year, y = SH.DYN.MORT, group = year))
gx

plot of chunk unnamed-chunk-13

Does Rwanda account for any of the outliers? Let's overlay the Rwanda stats

gx = gx + geom_line(data = subset(df.cm, country == "Rwanda"), aes(x = year, 
    y = SH.DYN.MORT), col = "red")
gx

plot of chunk unnamed-chunk-14

Hmmm.. some but not all… So what are the outliers in this case? Do you think you could find out?