Project 2: Data Preparation & Transformation Pt. III

Data Set III - Child Mortality
Load Libraries
Load Data
Preview Data Structure
Analysis
Conclusion

Data Set III - Child Mortality

This data set was posted by Raj Kumar. Childmortality.org is an organization that publishes Child Mortality estimates for all the countries around the world. They provide all available data and the latest child mortality estimates for each country based on the research of the UN Inter-agency Group for Child Mortality Estimation.

The data is in a very wide format and contains six variables with values of interest. Each of these variables are concatinated to each year from 1950 to 2016, resulting in 405 columns. These key variables are

Under-5 (0-4 years) mortality Infant (0-1 years) mortality Neonatal (0-1 month) mortality Number of under-5 deaths Number of infant deaths Number of neonatal deaths

As suggested by Raj, We can read this data into a data frame and subset the data set to the median estimate for each country. We need to also handle null values of the data. Then we can convert this data into long format with 4 variables country, year, category and their respective value. This will make it easier to analyze the data."

My goals with this data set are as follows:

(-) Load, Tidy, and transform the data.
(-) Map Infant Mortality Rate across the globe.
(-) Map and compare the change in Infant Mortality Rates
(-) Create time series plots of Infant Mortality for the extremes.

Load Libraries

library("tidyverse")

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts --------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("stringr")
library("DT")
library("rworldmap")

## Loading required package: sp

## ### Welcome to rworldmap ###

## For a short introduction type :   vignette('rworldmap')

Load Data

data <- read.csv("data/ChildMortality.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)

Preview Data Structure

str(data)

## 'data.frame':    585 obs. of  405 variables:
##  $ ISO.Code              : chr  "AFG" "AFG" "AFG" "AGO" ...
##  $ CountryName           : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Angola" ...
##  $ Uncertainty.bounds.   : chr  "Lower" "Median" "Upper" "Lower" ...
##  $ U5MR.1950             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1951             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1952             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1953             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1954             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1955             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1956             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1957             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1958             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1959             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ U5MR.1960             : num  308 364 427 NA NA ...
##  $ U5MR.1961             : num  307 358 414 NA NA ...
##  $ U5MR.1962             : num  306 352 404 NA NA ...
##  $ U5MR.1963             : num  303 346 394 NA NA ...
##  $ U5MR.1964             : num  299 340 386 NA NA ...
##  $ U5MR.1965             : num  295 334 378 NA NA ...
##  $ U5MR.1966             : num  290 328 372 NA NA ...
##  $ U5MR.1967             : num  286 323 365 NA NA ...
##  $ U5MR.1968             : num  281 317 359 NA NA ...
##  $ U5MR.1969             : num  276 312 353 NA NA ...
##  $ U5MR.1970             : num  270 306 348 NA NA ...
##  $ U5MR.1971             : num  265 300 342 NA NA ...
##  $ U5MR.1972             : num  261 295 336 NA NA ...
##  $ U5MR.1973             : num  255 289 330 NA NA ...
##  $ U5MR.1974             : num  250 283 323 NA NA ...
##  $ U5MR.1975             : num  245 277 316 NA NA ...
##  $ U5MR.1976             : num  240 271 308 NA NA ...
##  $ U5MR.1977             : num  235 264 301 NA NA ...
##  $ U5MR.1978             : num  230 258 293 NA NA ...
##  $ U5MR.1979             : num  224 252 285 NA NA ...
##  $ U5MR.1980             : num  218 245 277 181 236 ...
##  $ U5MR.1981             : num  213 238 269 185 234 ...
##  $ U5MR.1982             : num  207 232 262 188 230 ...
##  $ U5MR.1983             : num  201 225 253 190 228 ...
##  $ U5MR.1984             : num  196 218 245 192 225 ...
##  $ U5MR.1985             : num  190 211 237 193 224 ...
##  $ U5MR.1986             : num  184 204 228 194 223 ...
##  $ U5MR.1987             : num  179 197 219 195 222 ...
##  $ U5MR.1988             : num  173 190 210 196 221 ...
##  $ U5MR.1989             : num  168 184 202 197 221 ...
##  $ U5MR.1990             : num  162 177 194 198 221 ...
##  $ U5MR.1991             : num  157 171 186 198 222 ...
##  $ U5MR.1992             : num  152 165 180 199 223 ...
##  $ U5MR.1993             : num  147 160 173 200 224 ...
##  $ U5MR.1994             : num  142 154 168 201 224 ...
##  $ U5MR.1995             : num  138 150 162 200 223 ...
##  $ U5MR.1996             : num  134 145 157 199 222 ...
##  $ U5MR.1997             : num  130 141 152 196 220 ...
##  $ U5MR.1998             : num  127 137 148 192 216 ...
##  $ U5MR.1999             : num  124 133 144 188 212 ...
##  $ U5MR.2000             : num  120 130 140 183 207 ...
##  $ U5MR.2001             : num  117 126 136 176 201 ...
##  $ U5MR.2002             : num  113 122 132 170 194 ...
##  $ U5MR.2003             : num  110 118 128 161 186 ...
##  $ U5MR.2004             : num  106 114 124 152 177 ...
##  $ U5MR.2005             : num  102 110 120 141 167 ...
##  $ U5MR.2006             : num  98.4 106.3 115 129.2 157.5 ...
##  $ U5MR.2007             : num  94.4 102.2 110.8 117.2 147.6 ...
##  $ U5MR.2008             : num  90.3 98.2 106.6 105.6 137.9 ...
##  $ U5MR.2009             : num  86.1 94.1 102.9 94.3 128.3 ...
##  $ U5MR.2010             : num  81.7 90.2 99.4 83.1 119.4 ...
##  $ U5MR.2011             : num  77.3 86.4 96.3 73 111 ...
##  $ U5MR.2012             : num  72.7 82.8 93.4 63.8 103.5 ...
##  $ U5MR.2013             : num  68.2 79.3 90.7 56.1 96.8 ...
##  $ U5MR.2014             : num  64 76.1 88.1 50 91.2 ...
##  $ U5MR.2015             : num  60.2 73.2 86.1 45 86.5 ...
##  $ U5MR.2016             : num  56.6 70.4 84.7 41.2 82.5 147 7.3 13.5 24.8 1.6 ...
##  $ IMR.1950              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1951              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1952              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1953              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1954              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1955              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1956              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1957              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1958              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1959              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ IMR.1960              : num  206 246 292 NA NA ...
##  $ IMR.1961              : num  206 241 283 NA NA ...
##  $ IMR.1962              : num  205 237 275 NA NA ...
##  $ IMR.1963              : num  203 233 268 NA NA ...
##  $ IMR.1964              : num  200 228 262 NA NA ...
##  $ IMR.1965              : num  197 224 256 NA NA ...
##  $ IMR.1966              : num  194 220 252 NA NA ...
##  $ IMR.1967              : num  191 216 247 NA NA ...
##  $ IMR.1968              : num  187 213 242 NA NA ...
##  $ IMR.1969              : num  184 209 238 NA NA ...
##  $ IMR.1970              : num  180 205 234 NA NA ...
##  $ IMR.1971              : num  177 201 230 NA NA ...
##  $ IMR.1972              : num  174 197 226 NA NA ...
##  $ IMR.1973              : num  170 193 221 NA NA ...
##  $ IMR.1974              : num  167 189 217 NA NA ...
##  $ IMR.1975              : num  164 185 212 NA NA ...
##  $ IMR.1976              : num  160 181 207 NA NA ...
##  $ IMR.1977              : num  157 176 201 NA NA ...
##  $ IMR.1978              : num  153 172 196 NA NA ...
##   [list output truncated]

The structure reveals a data set with 405 variables. Variables in the form Under.five.Deaths.1955 will pose a problem for analysis because it represents two variables, category and year.

Use the gather function to convert variables of the form variable.name.year to rows. This should result in a variable called YearCat (preserving the current format), and variable called Rate.

data <- gather(data, "YearCat", "Rate", 4:405)

Split the YearCatvariable into Category and Year.

data$Category <- str_sub(data$YearCat,1,-6)
data$Year <- strtoi(str_sub(data$YearCat, -4))
datatable( tail(data), options = list(filter = FALSE),filter="none" )

Analysis

Map the Infant Mortality Rate for the year 1980 and, 2016, using the median.

# filter the data to match criteria
mdata <- filter(data, `Category` == "IMR", `Uncertainty.bounds.` == "Median", Year == 1980)

# match country polygon to country code in data set
sPDF <- joinCountryData2Map( mdata,joinCode = "ISO3", nameJoinColumn = "ISO.Code")

## 195 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 48 codes from the map weren't represented in your data

mapParams <- mapCountryData(sPDF,nameColumnToPlot='Rate',
                            missingCountryCol = NA, 
                            addLegend ='FALSE',
                            mapTitle = "Infant Mortality Rate Per 1000 \n Year: 1980")

do.call( addMapLegend, c( mapParams, legendLabels = "all", legendWidth = 1.5 ) )

# filter the data to match criteria
mdata <- filter(data, `Category` == "IMR", `Uncertainty.bounds.` == "Median", Year == 2016)

# match country polygon to country code in data set
sPDF <- joinCountryData2Map( mdata,joinCode = "ISO3", nameJoinColumn = "ISO.Code")

## 195 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 48 codes from the map weren't represented in your data

mapParams <- mapCountryData(sPDF,nameColumnToPlot='Rate',
                            missingCountryCol = NA, 
                            addLegend ='FALSE',
                            mapTitle = "Infant Mortality Rate Per 1000 \n Year: 2016")

do.call( addMapLegend, c( mapParams, legendLabels = "all", legendWidth = 1.5 ) )

The maps show a stunning decrease in Infant Mortality Rates from 1980 to 2016. This is evident by looking at the scales, where the both minimum and maximum values decreased from 7.1 to 1.6 and 177 to 88.5 respectively. Africa is still plagued by relatively high infant mortaility rates after all those years.

Europe, North America (excluding Mexico), Russia, and Australia have lowest rates. China is the only “world power” with questionable Infant Mortality Rates.

What about the change in Infant Mortality Rates accross the same time period?

# calculate the difference in IMR between 1980 and 2016
mData <- select(data, -YearCat) %>%
          filter( `Category` == "IMR", `Uncertainty.bounds.` == "Median", Year == 1980 | Year == 2016  ) %>%
          spread( Year, Rate )  %>%
          mutate( ISO.Code = ISO.Code, Rate =  `1980` - `2016` )
          
# match country polygon to country code in data set
sPDF <- joinCountryData2Map( data.frame(mData),joinCode = "ISO3", nameJoinColumn = "ISO.Code" )

## 195 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 48 codes from the map weren't represented in your data

mapParams <- mapCountryData(sPDF,nameColumnToPlot='Rate',
                            numCats = 50,
                            missingCountryCol = NA, 
                            colourPalette = 'diverging',
                            addLegend ='FALSE',
                            mapTitle = "Change in Infant Mortality Rate Per 1000 \n Year: 1980 - 2016")

do.call( addMapLegend, c( mapParams, legendLabels = "all", legendWidth = 1.5 ) )

The map shows that the highest decreases in Infant Mortality Rates were in countries with previously very high rates. These are concentrated in Africa and the Middle-East. However, the map does a poor a job of displaying countries with increases in Infant Mortality Rates. The map legend suggests that at least one country had an increase in Infant Mortality of 16.3.

Which countries had an increase in Infant Mortaility Rate from 1980 to 2016?

datatable(filter(mData, Rate < 0 ), options = list(filter = FALSE),filter="none")

Dominica is the only country that had a higher Infant Mortality Rate in 2016 when compared to 1980!

Plot a times series chart for Infant Mortality Rate for the country with the highest and lowest increase over the period.

pData <- filter(data, 
                `ISO.Code` == arrange( mData, `Rate` )$`ISO.Code`[1] | `ISO.Code` == arrange( mData, desc( `Rate` ) )$`ISO.Code`[1], 
                `Category` == "IMR", 
                `Uncertainty.bounds.` == "Median",
                !is.na(`Rate`) ) 
                ggplot( pData, aes(x = Year, y = Rate) ) +
                labs( title = "Infant Mortality Rate" ) +
                geom_line( aes(color = `ISO.Code`), size = 1 )

The plots are almost the inverse of each other. Dominica’s rate stablilized in the 1980’s but began increasing at an alarming rate around 2005. Mozambique’s rate trended downwards throughout. Mozambique probabilty had the largest deacrease due to its very high initial rate

Conclusion

Much more can be done with this dataset. It would be interesting to have some relating fincancial/economic and management data to guage the effectiveness of methodoligies used to combat the issue of Infant Mortality.

Tools in the Tidyverse were particularly useful when this data set ballooned from 585 to over 200,000 observations.