The United Nations Development Programme publishes an annual report on human development across the globe. An interactive tool to visualize some of the data is available at http://hdr.undp.org/en/data
Download the data for 2018 from the left side tab stating Download all 2018 Human Development data. This will lead you to an Excel document. Open it and comment on the structure. Focus on table 2: what problems do you see with the file format? List three problems and explain.
The Table 2 contains Countries and their HDI overal rank as well as their HDI over time and changes in their classification. One main problem is that the data is messy and the countries with the same overal Classification are in the same category, for example Very High HDI , High HDI, etc. ;However, their classification over time has not been considered. For instance, Ireland belongs to High HDI in 1990 and to Very High HDI in 2017, but it is in Very HDI category.
There is an R package available online (on CRAN) that allows to read Excel files. Find the package, (install it if necessary), load it and try to read the second table of the file “2018_statistical_annex_all.xlsx” into R (make use of the available parameters). Compare the result in R with what is visible in Excel. Keep track of and comment on all steps in the process. For the write-up assume, that the Excel file is in the same folder as the homework Rmd file. Also assume, that both the TA and I have an updated version of the package installed, i.e. only load the package but do not install it in the code.
library(readxl)
d<- read_excel(path = "2018_statistical_annex_all.xlsx", sheet = "Table 2",
range = cell_limits(c(4, "A"), c(205,"Q")))%>%
select("Country", "Value", "X__2", "X__4", "X__6", "X__8", "X__10", "X__12",
"X__14" )
## Warning in cell_limits(c(4, "A"), c(205, "Q")): NAs introduced by coercion
## Warning in cell_limits(c(4, "A"), c(205, "Q")): NAs introduced by coercion
d%>%head()
## # A tibble: 6 x 9
## Country Value X__2 X__4 X__6 X__8 X__10 X__12 X__14
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> 1990 2000 2010 2012 2014 2015 2016 2017
## 2 VERY HIGH~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Norway 0.8500~ 0.9166~ 0.9415~ 0.9421~ 0.9456~ 0.9480~ 0.951~ 0.952~
## 4 Switzerla~ 0.8319~ 0.8890~ 0.9319~ 0.9351~ 0.9393~ 0.9418~ 0.943~ 0.943~
## 5 Australia 0.8662~ 0.8982~ 0.9230~ 0.9287~ 0.9332~ 0.9356~ 0.937~ 0.938~
## 6 Ireland 0.7634~ 0.8566~ 0.9094~ 0.9017~ 0.9211~ 0.9292~ 0.934~ 0.938~
In reading the excel file into R, the difference is that any empty column in the excel sheet is reported as NA in R output. HDI are considred characters in this data set. Column names are different from the excel file and it is difficult to read them in R.
Introduce names to the data extracted for Table 2.
names(d)<-c("Country", 1990, 2000, 2010, 2012, 2014, 2015,2016,
"2017")
d%>% head()
## # A tibble: 6 x 9
## Country `1990` `2000` `2010` `2012` `2014` `2015` `2016` `2017`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> 1990 2000 2010 2012 2014 2015 2016 2017
## 2 VERY HIG~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Norway 0.8500~ 0.9166~ 0.9415~ 0.9421~ 0.9456~ 0.9480~ 0.9512~ 0.952~
## 4 Switzerl~ 0.8319~ 0.8890~ 0.9319~ 0.9351~ 0.9393~ 0.9418~ 0.9431~ 0.943~
## 5 Australia 0.8662~ 0.8982~ 0.9230~ 0.9287~ 0.9332~ 0.9356~ 0.9378~ 0.938~
## 6 Ireland 0.7634~ 0.8566~ 0.9094~ 0.9017~ 0.9211~ 0.9292~ 0.9344~ 0.938~
Use tidyr and dplyr functionality to get the data set into the following format (Classification of HDI is very high for [0.8,1], high for [0.7, .8), medium for [.55, .7) and low below .55):
d<-d[-c(1,2, 62,116,156,195),]
d<-d%>%
gather(key=Year, value = HDI, 2:9 )
#make classification
Classification<-d$HDI
Classification[(Classification<1) & (Classification >=0.8)]<-"very high"
Classification[(Classification<.8) & (Classification >=0.7)]<-"high"
Classification[(Classification<.7) & (Classification >=0.55)]<-"medium"
Classification[Classification<0.55]<-"low"
d<-d%>%
mutate(Classification= as.factor(Classification),
HDI = as.numeric(HDI))
## Warning in evalq(as.numeric(HDI), <environment>): NAs introduced by
## coercion
d%>% head()
## # A tibble: 6 x 4
## Country Year HDI Classification
## <chr> <chr> <dbl> <fct>
## 1 Norway 1990 0.850 very high
## 2 Switzerland 1990 0.832 very high
## 3 Australia 1990 0.866 very high
## 4 Ireland 1990 0.763 high
## 5 Germany 1990 0.801 very high
## 6 Iceland 1990 0.802 very high
Use ggplot to draw a plot of HDI over time for the US, Norway, Germany, and Canada.
d%>%
filter(Country %in% c( "United States", "Norway", "Germany", "Canada"))%>%
ggplot(aes(x = `Year`, y = `HDI`, color=Country, group= Country))+
geom_point()+
geom_line()
How many countries switched classifications between 2000 and 2017? How many countries went up, how many countries went down?
d1<-d%>%
filter(Year %in% c(2000, 2017))%>%
select(-HDI)%>%
spread(key = Year, value = Classification)%>%
filter(!`2000`== `2017`)
d1%>%head()
## # A tibble: 6 x 3
## Country `2000` `2017`
## <chr> <fct> <fct>
## 1 Albania medium high
## 2 Algeria medium high
## 3 Andorra high very high
## 4 Angola low medium
## 5 Antigua and Barbuda low high
## 6 Argentina high very high
d1%>%tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 95
Ther are 95 countries that switched classification between the years 2000 and 2017
d%>%
filter(Year %in% c(2000, 2017))%>%
select(-HDI)%>%
spread(key = Year, value = Classification)%>%
filter(!`2000`== `2017`)%>%
filter(`2000` %in% c("very high", "high", "medium") &
!`2017` %in% c("very high", "high", "medium"))
## # A tibble: 1 x 3
## Country `2000` `2017`
## <chr> <fct> <fct>
## 1 Syrian Arab Republic medium low
There is only one country which went down from the Medium classification for HDI to Low classification.
d2<-d%>%
filter(Year %in% c(2000, 2017))%>%
select(-HDI)%>%
spread(key = Year, value = Classification)%>%
filter(!`2000`== `2017`)%>%
filter(`2000` %in% c("low", "medium", "high") &
`2017` %in% c( "medium", "high", "very high"))
d2%>%head()
## # A tibble: 6 x 3
## Country `2000` `2017`
## <chr> <fct> <fct>
## 1 Albania medium high
## 2 Algeria medium high
## 3 Andorra high very high
## 4 Angola low medium
## 5 Antigua and Barbuda low high
## 6 Argentina high very high
d2%>%tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 94
94 countries went up in HDI classification between 2000 and 2017.