For Project 2, choose any three of the “wide” datasets identified in the Week 6 Discussion items. This document contains the third of the three files selected for this assignment.
File selected: Unicef data on Under 5 Mortality Rates submitted by Samuel Bellows. link to Unicef Under 5 Mortality Rates
Analysis to perform: This UNICEF dataset gives the under 5 mortality for many countries across the years 1950-2015. The problem is that the year variable is spread out into 65 different columns, 1 for each year, that need to be gathered into 1 column. Produce a 3 column dataset of country, year, and mortality. Provide narrative descriptions of the data cleanup work and analysis completed, along with conclusions about the data.
library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
#Load the data
data <- read.csv('unicef-u5mr.csv')
head(data,4)
## CountryName U5MR.1950 U5MR.1951 U5MR.1952 U5MR.1953 U5MR.1954 U5MR.1955
## 1 Afghanistan NA NA NA NA NA NA
## 2 Albania NA NA NA NA NA NA
## 3 Algeria NA NA NA NA 251 249.9
## 4 Andorra NA NA NA NA NA NA
## U5MR.1956 U5MR.1957 U5MR.1958 U5MR.1959 U5MR.1960 U5MR.1961 U5MR.1962
## 1 NA NA NA NA NA 356.5 350.6
## 2 NA NA NA NA NA NA NA
## 3 249 248 247.5 246.7 246.3 246.1 246.2
## 4 NA NA NA NA NA NA NA
## U5MR.1963 U5MR.1964 U5MR.1965 U5MR.1966 U5MR.1967 U5MR.1968 U5MR.1969
## 1 345.0 339.7 334.1 328.7 323.3 318.1 313.0
## 2 NA NA NA NA NA NA NA
## 3 246.8 247.4 248.2 248.7 248.4 247.4 245.3
## 4 NA NA NA NA NA NA NA
## U5MR.1970 U5MR.1971 U5MR.1972 U5MR.1973 U5MR.1974 U5MR.1975 U5MR.1976
## 1 307.8 302.1 296.4 290.8 284.9 279.4 273.6
## 2 NA NA NA NA NA NA NA
## 3 241.7 236.5 230.0 222.5 214.2 205.0 195.2
## 4 NA NA NA NA NA NA NA
## U5MR.1977 U5MR.1978 U5MR.1979 U5MR.1980 U5MR.1981 U5MR.1982 U5MR.1983
## 1 267.8 261.6 255.5 249.1 242.7 236.2 229.7
## 2 NA 91.1 84.7 78.6 73.0 67.8 62.8
## 3 184.9 173.8 161.8 148.1 132.5 115.8 99.2
## 4 NA NA NA NA NA NA NA
## U5MR.1984 U5MR.1985 U5MR.1986 U5MR.1987 U5MR.1988 U5MR.1989 U5MR.1990
## 1 222.9 216.0 209.2 202.1 195.0 187.8 181.0
## 2 58.3 54.3 50.7 47.6 44.9 42.5 40.6
## 3 83.8 71.2 61.9 55.4 51.2 48.5 46.8
## 4 NA NA NA NA NA NA 8.5
## U5MR.1991 U5MR.1992 U5MR.1993 U5MR.1994 U5MR.1995 U5MR.1996 U5MR.1997
## 1 174.2 167.8 162.0 156.8 152.3 148.6 145.5
## 2 38.8 37.3 36.0 34.6 33.2 31.8 30.3
## 3 45.7 44.9 44.1 43.3 42.5 41.8 41.1
## 4 7.9 7.4 6.9 6.4 6.0 5.7 5.3
## U5MR.1998 U5MR.1999 U5MR.2000 U5MR.2001 U5MR.2002 U5MR.2003 U5MR.2004
## 1 142.6 139.9 137.0 133.8 130.3 126.8 123.2
## 2 28.9 27.5 26.2 24.9 23.6 22.5 21.5
## 3 40.6 40.2 39.7 38.9 37.8 36.5 35.1
## 4 5.0 4.8 4.6 4.4 4.2 4.1 4.0
## U5MR.2005 U5MR.2006 U5MR.2007 U5MR.2008 U5MR.2009 U5MR.2010 U5MR.2011
## 1 119.6 116.3 113.2 110.4 107.6 105.0 102.3
## 2 20.5 19.5 18.7 17.9 17.3 16.6 16.0
## 3 33.6 32.1 30.7 29.4 28.3 27.3 26.6
## 4 3.9 3.7 3.6 3.5 3.4 3.3 3.2
## U5MR.2012 U5MR.2013 U5MR.2014 U5MR.2015
## 1 99.5 96.7 93.9 91.1
## 2 15.5 14.9 14.4 14.0
## 3 26.1 25.8 25.6 25.5
## 4 3.1 3.0 2.9 2.8
#Convert the years data from a wide to long format
yearlydata <- pivot_longer(data,c(2:67), 'year')
yearlydata <- yearlydata %>% `colnames<-`(c('country', 'year', 'mortality'))
yearlydata_sorted <- yearlydata[order(yearlydata$mortality),]
#Generate some summary information about the dataset
summary(yearlydata)
## country year mortality
## Afghanistan : 66 Length:12936 Min. : 1.90
## Albania : 66 Class :character 1st Qu.: 19.90
## Algeria : 66 Mode :character Median : 53.55
## Andorra : 66 Mean : 85.01
## Angola : 66 3rd Qu.:128.30
## Antigua & Barbuda: 66 Max. :443.50
## (Other) :12540 NA's :2692
missing_data <- sum(is.na.data.frame(yearlydata))
percent_missing_data <- round(missing_data/length((yearlydata$year)) * 100)
cat('\n')
cat('Number of country-years missing data:', missing_data,'\n')
## Number of country-years missing data: 2692
cat('Percent of missing data:', percent_missing_data, '%\n')
## Percent of missing data: 21 %
mortality_avg <- data %>%
transmute(CountryName,
Mean = rowMeans(select(., c(2:67)),na.rm = TRUE))
mortality_avg <- mortality_avg[order(mortality_avg$Mean, decreasing = FALSE),]
head(mortality_avg)
## CountryName Mean
## 4 Andorra 4.676923
## 113 Monaco 5.214815
## 151 San Marino 6.563333
## 161 Slovenia 7.717143
## 46 Cyprus 9.002703
## 47 Czech Republic 9.585714
#Boxplot the data
ggplot(mortality_avg, aes(x = "", y = Mean)) +
geom_boxplot()
mortality_avg_top20 <- tail(mortality_avg,20)
mortality_avg_top20
## CountryName Mean
## 85 Cote d Ivoire 183.3508
## 175 Togo 183.5864
## 33 Central African Republic 190.1143
## 166 South Sudan 199.6162
## 154 Senegal 201.2924
## 35 Chad 204.1864
## 1 Afghanistan 207.2182
## 128 Nigeria 208.7269
## 118 Mozambique 208.9745
## 64 Gambia The 209.9067
## 5 Angola 210.8361
## 48 Benin 215.7379
## 105 Malawi 222.5250
## 99 Liberia 223.9966
## 72 Guinea 235.4727
## 190 Burkina Faso 247.1606
## 127 Niger 262.1510
## 157 Sierra Leone 267.9500
## 108 Mali 273.7038
## 101 Liechtenstein NaN
#Bargraph the top 20 countries with the highest average mortality
ggplot(mortality_avg_top20, aes(x=reorder(CountryName, Mean), y=Mean)) +
geom_bar(stat="identity", fill="lightgreen", color="grey50") +
ggtitle("Countries with Highest Mortality Rates") +
xlab("country") + ylab("mortality rate") +
coord_flip()
mortality_avg_bottom20 <- head(mortality_avg,20)
mortality_avg_bottom20
## CountryName Mean
## 4 Andorra 4.676923
## 113 Monaco 5.214815
## 151 San Marino 6.563333
## 161 Slovenia 7.717143
## 46 Cyprus 9.002703
## 47 Czech Republic 9.585714
## 170 Sweden 10.393939
## 44 Croatia 10.397143
## 103 Luxembourg 10.861538
## 66 Germany 10.866667
## 24 Brunei 11.002941
## 83 Israel 11.042857
## 77 Iceland 11.153846
## 130 Norway 12.460606
## 123 Netherlands 12.565152
## 19 Bosnia & Herzegovina 12.940625
## 59 Finland 12.949231
## 116 Montenegro 13.256250
## 49 Denmark 13.348485
## 57 Estonia 13.861111
#Bargraph the countries with the lowest mortality rates
ggplot(mortality_avg_bottom20, aes(x=reorder(CountryName, -Mean), y=Mean)) +
geom_bar(stat="identity", fill="lightgreen", color="grey50") +
ggtitle("Countries with Lowest Mortality Rates") +
xlab("country") + ylab("mortality rate") +
coord_flip()
#What is the US average mortality rate
subset(mortality_avg, CountryName == 'United States of America')
## CountryName Mean
## 189 United States of America 17.46667
There are 196 countries in the dataset and the average mortality rate for under_5_years_of_age across all countries is 85 with a median of 54 for the years 1950 to 2015. 50% of the countries lie in a range of 20 to 130.
Mali has the highest mortality rate of 273 and the rate drops to 183 for the 20th highest country. Moving to lowest mortality rates, Andorra has a rate of 5 and the 20th best country is Estonia at 14. Interestingly, the United States ranks 33rd with a mortality rate of 18.
Recommendations for further study of this data are to… 1. align each country with its continent and evaluate countries by continent, similar to prior datasets in this assignment when states were analyzed by region.