Assignment Summary

For Project 2, choose any three of the “wide” datasets identified in the Week 6 Discussion items. This document contains the third of the three files selected for this assignment.

File selected: Unicef data on Under 5 Mortality Rates submitted by Samuel Bellows. link to Unicef Under 5 Mortality Rates

Analysis to perform: This UNICEF dataset gives the under 5 mortality for many countries across the years 1950-2015. The problem is that the year variable is spread out into 65 different columns, 1 for each year, that need to be gathered into 1 column. Produce a 3 column dataset of country, year, and mortality. Provide narrative descriptions of the data cleanup work and analysis completed, along with conclusions about the data.

Setup and load data

library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
#Load the data
data <-  read.csv('unicef-u5mr.csv')
head(data,4)
##   CountryName U5MR.1950 U5MR.1951 U5MR.1952 U5MR.1953 U5MR.1954 U5MR.1955
## 1 Afghanistan        NA        NA        NA        NA        NA        NA
## 2     Albania        NA        NA        NA        NA        NA        NA
## 3     Algeria        NA        NA        NA        NA       251     249.9
## 4     Andorra        NA        NA        NA        NA        NA        NA
##   U5MR.1956 U5MR.1957 U5MR.1958 U5MR.1959 U5MR.1960 U5MR.1961 U5MR.1962
## 1        NA        NA        NA        NA        NA     356.5     350.6
## 2        NA        NA        NA        NA        NA        NA        NA
## 3       249       248     247.5     246.7     246.3     246.1     246.2
## 4        NA        NA        NA        NA        NA        NA        NA
##   U5MR.1963 U5MR.1964 U5MR.1965 U5MR.1966 U5MR.1967 U5MR.1968 U5MR.1969
## 1     345.0     339.7     334.1     328.7     323.3     318.1     313.0
## 2        NA        NA        NA        NA        NA        NA        NA
## 3     246.8     247.4     248.2     248.7     248.4     247.4     245.3
## 4        NA        NA        NA        NA        NA        NA        NA
##   U5MR.1970 U5MR.1971 U5MR.1972 U5MR.1973 U5MR.1974 U5MR.1975 U5MR.1976
## 1     307.8     302.1     296.4     290.8     284.9     279.4     273.6
## 2        NA        NA        NA        NA        NA        NA        NA
## 3     241.7     236.5     230.0     222.5     214.2     205.0     195.2
## 4        NA        NA        NA        NA        NA        NA        NA
##   U5MR.1977 U5MR.1978 U5MR.1979 U5MR.1980 U5MR.1981 U5MR.1982 U5MR.1983
## 1     267.8     261.6     255.5     249.1     242.7     236.2     229.7
## 2        NA      91.1      84.7      78.6      73.0      67.8      62.8
## 3     184.9     173.8     161.8     148.1     132.5     115.8      99.2
## 4        NA        NA        NA        NA        NA        NA        NA
##   U5MR.1984 U5MR.1985 U5MR.1986 U5MR.1987 U5MR.1988 U5MR.1989 U5MR.1990
## 1     222.9     216.0     209.2     202.1     195.0     187.8     181.0
## 2      58.3      54.3      50.7      47.6      44.9      42.5      40.6
## 3      83.8      71.2      61.9      55.4      51.2      48.5      46.8
## 4        NA        NA        NA        NA        NA        NA       8.5
##   U5MR.1991 U5MR.1992 U5MR.1993 U5MR.1994 U5MR.1995 U5MR.1996 U5MR.1997
## 1     174.2     167.8     162.0     156.8     152.3     148.6     145.5
## 2      38.8      37.3      36.0      34.6      33.2      31.8      30.3
## 3      45.7      44.9      44.1      43.3      42.5      41.8      41.1
## 4       7.9       7.4       6.9       6.4       6.0       5.7       5.3
##   U5MR.1998 U5MR.1999 U5MR.2000 U5MR.2001 U5MR.2002 U5MR.2003 U5MR.2004
## 1     142.6     139.9     137.0     133.8     130.3     126.8     123.2
## 2      28.9      27.5      26.2      24.9      23.6      22.5      21.5
## 3      40.6      40.2      39.7      38.9      37.8      36.5      35.1
## 4       5.0       4.8       4.6       4.4       4.2       4.1       4.0
##   U5MR.2005 U5MR.2006 U5MR.2007 U5MR.2008 U5MR.2009 U5MR.2010 U5MR.2011
## 1     119.6     116.3     113.2     110.4     107.6     105.0     102.3
## 2      20.5      19.5      18.7      17.9      17.3      16.6      16.0
## 3      33.6      32.1      30.7      29.4      28.3      27.3      26.6
## 4       3.9       3.7       3.6       3.5       3.4       3.3       3.2
##   U5MR.2012 U5MR.2013 U5MR.2014 U5MR.2015
## 1      99.5      96.7      93.9      91.1
## 2      15.5      14.9      14.4      14.0
## 3      26.1      25.8      25.6      25.5
## 4       3.1       3.0       2.9       2.8

Tidy up the data

#Convert the years data from a wide to long format
yearlydata <- pivot_longer(data,c(2:67), 'year')
yearlydata <- yearlydata %>% `colnames<-`(c('country', 'year', 'mortality'))
yearlydata_sorted <- yearlydata[order(yearlydata$mortality),]

Analysis

#Generate some summary information about the dataset
summary(yearlydata)
##               country          year             mortality     
##  Afghanistan      :   66   Length:12936       Min.   :  1.90  
##  Albania          :   66   Class :character   1st Qu.: 19.90  
##  Algeria          :   66   Mode  :character   Median : 53.55  
##  Andorra          :   66                      Mean   : 85.01  
##  Angola           :   66                      3rd Qu.:128.30  
##  Antigua & Barbuda:   66                      Max.   :443.50  
##  (Other)          :12540                      NA's   :2692
missing_data <- sum(is.na.data.frame(yearlydata))
percent_missing_data <- round(missing_data/length((yearlydata$year)) * 100)

cat('\n')
cat('Number of country-years missing data:', missing_data,'\n')
## Number of country-years missing data: 2692
cat('Percent of missing data:', percent_missing_data, '%\n')
## Percent of missing data: 21 %
mortality_avg <- data %>%
             transmute(CountryName,
             Mean = rowMeans(select(., c(2:67)),na.rm = TRUE))

mortality_avg <- mortality_avg[order(mortality_avg$Mean, decreasing = FALSE),]
head(mortality_avg)
##        CountryName     Mean
## 4          Andorra 4.676923
## 113         Monaco 5.214815
## 151     San Marino 6.563333
## 161       Slovenia 7.717143
## 46          Cyprus 9.002703
## 47  Czech Republic 9.585714
#Boxplot the data
ggplot(mortality_avg, aes(x = "", y = Mean)) + 
  geom_boxplot()

mortality_avg_top20 <- tail(mortality_avg,20)
mortality_avg_top20
##                  CountryName     Mean
## 85             Cote d Ivoire 183.3508
## 175                     Togo 183.5864
## 33  Central African Republic 190.1143
## 166              South Sudan 199.6162
## 154                  Senegal 201.2924
## 35                      Chad 204.1864
## 1                Afghanistan 207.2182
## 128                  Nigeria 208.7269
## 118               Mozambique 208.9745
## 64                Gambia The 209.9067
## 5                     Angola 210.8361
## 48                     Benin 215.7379
## 105                   Malawi 222.5250
## 99                   Liberia 223.9966
## 72                    Guinea 235.4727
## 190             Burkina Faso 247.1606
## 127                    Niger 262.1510
## 157             Sierra Leone 267.9500
## 108                     Mali 273.7038
## 101            Liechtenstein      NaN
#Bargraph the top 20 countries with the highest average mortality
ggplot(mortality_avg_top20, aes(x=reorder(CountryName, Mean), y=Mean)) + 
  geom_bar(stat="identity", fill="lightgreen", color="grey50") +
  ggtitle("Countries with Highest Mortality Rates") +
  xlab("country") + ylab("mortality rate") + 
  coord_flip()

mortality_avg_bottom20 <- head(mortality_avg,20)
mortality_avg_bottom20
##              CountryName      Mean
## 4                Andorra  4.676923
## 113               Monaco  5.214815
## 151           San Marino  6.563333
## 161             Slovenia  7.717143
## 46                Cyprus  9.002703
## 47        Czech Republic  9.585714
## 170               Sweden 10.393939
## 44               Croatia 10.397143
## 103           Luxembourg 10.861538
## 66               Germany 10.866667
## 24                Brunei 11.002941
## 83                Israel 11.042857
## 77               Iceland 11.153846
## 130               Norway 12.460606
## 123          Netherlands 12.565152
## 19  Bosnia & Herzegovina 12.940625
## 59               Finland 12.949231
## 116           Montenegro 13.256250
## 49               Denmark 13.348485
## 57               Estonia 13.861111
#Bargraph the countries with the lowest mortality rates
ggplot(mortality_avg_bottom20, aes(x=reorder(CountryName, -Mean), y=Mean)) + 
  geom_bar(stat="identity", fill="lightgreen", color="grey50") +
  ggtitle("Countries with Lowest Mortality Rates") +
  xlab("country") + ylab("mortality rate") + 
  coord_flip()

#What is the US average mortality rate
subset(mortality_avg, CountryName == 'United States of America')
##                  CountryName     Mean
## 189 United States of America 17.46667

Conclusions and Recommendations

There are 196 countries in the dataset and the average mortality rate for under_5_years_of_age across all countries is 85 with a median of 54 for the years 1950 to 2015. 50% of the countries lie in a range of 20 to 130.

Mali has the highest mortality rate of 273 and the rate drops to 183 for the 20th highest country. Moving to lowest mortality rates, Andorra has a rate of 5 and the 20th best country is Estonia at 14. Interestingly, the United States ranks 33rd with a mortality rate of 18.

Recommendations for further study of this data are to… 1. align each country with its continent and evaluate countries by continent, similar to prior datasets in this assignment when states were analyzed by region.

  1. identify additional demographics about each of these countries and look for the top features that could help to predict a countries mortality rate for children under age 5 (high correlations would be the desired find).