Required packages
library(readxl)
library(tidyr)
library(dplyr)
library(outliers)
library(infotheo)
Executive Summary
The objective of this report was to make a tidy dataset of countries with two indicators:
- Employment rates by education levels: the percentage of employed 25-64 year-olds by all 25-64 year-olds with the same education level
- Educational attainment: the highest level of education achieved as a percentage to the total population of 25-64 year-olds
Meanwhile, since income and educational attainment may be correlated, I added gross national income (GNI) and classified countries into different income levels. To my own interest, the working hour was also added to complement this classification. I adopted a combination of income level and the length of working hours to group countries, such as “High income & Short hours” and “Low income & Long hours”. To accomplish this, I obtained multiple datasets with the variables needed and joined them into a tidy one. Necessary scanning steps were taken to deal with missing values and to identify the outliers. For the classification, I created new variables of each country’s average GNI and working hours over three years and used the binning method to group the countries.
Data
Member countries of OECD (The Organisation for Economic Co-operation and Development) were investigated. I looked at three education levels for both employment rate and educational attainment: below upper secondary, upper secondary, and tertiary. The measures cover the latest three years available: 2014 to 2016. Five datasets were generated for further process. I introduced the main variables in each dataset below:
- GNI: Location(country code), Time(year), Value(gross national income: US dollars per capita); it is a CSV file downloaded from: data.oecd.org/natincome/gross-national-income.htm
- hours: Location(country code), Time(year), Value(average annual working hours per worker); it is a CSV file downloaded from: data.oecd.org/emp/hours-worked.htm
- employment: Location(country code), Subject(education level), Time(year),Value(employment rate); it is a CSV file downloaded from: data.oecd.org/emp/employment-by-education-level.htm
- education: Location(country code), Subject(education level), Time(year),Value(educational attainment); it is a CSV file downloaded from: data.oecd.org/eduatt/adult-education-level.htm
- code: Code value(country code), Definition(country’s full name); it is an EXCEL file downloaded from: www.dnb.com/content/dam/english/dnb-solutions/sales-and-marketing/iso_3digit_alpha_country_codes.xls
setwd("~/Desktop/DP Assignment 3")
GNI<-read.csv("GNI.csv",stringsAsFactors = FALSE)
GNI[1:3, ]
hours<-read.csv("hours worked.csv",stringsAsFactors = FALSE)
hours[1:3, ]
employment<-read.csv("employment.csv",stringsAsFactors = FALSE)
employment[1:3, ]
education<-read.csv("EducationLevel.csv",stringsAsFactors = FALSE)
education[1:3, ]
code<-read_excel("code.xls",skip=1)
code[1:3, ]
- I opened a new folder called “DP Assignment 3” on Desktop and saved the five datasets into it. This folder was set as the working directory using “setwd()”.
- CSV files were imported using Base R functions. I set “stringsAsFactors = FALSE” to read the variables in their original forms.
- The EXCEL file was imported using “read_excel()”.
- In order to save space for my report, I displayed only the first three rows of each dataset.
Understand
str(GNI)
'data.frame': 137 obs. of 8 variables:
$ LOCATION : chr "AUS" "AUS" "AUS" "AUT" ...
$ INDICATOR : chr "GNI" "GNI" "GNI" "GNI" ...
$ SUBJECT : chr "TOT" "TOT" "TOT" "TOT" ...
$ MEASURE : chr "USD_CAP" "USD_CAP" "USD_CAP" "USD_CAP" ...
$ FREQUENCY : chr "A" "A" "A" "A" ...
$ TIME : int 2014 2015 2016 2014 2015 2016 2014 2015 2016 2014 ...
$ Value : num 45953 45710 46885 48866 49594 ...
$ Flag.Codes: chr "" "" "" "" ...
str(hours)
'data.frame': 112 obs. of 8 variables:
$ LOCATION : chr "AUS" "AUS" "AUS" "AUT" ...
$ INDICATOR : chr "HRWKD" "HRWKD" "HRWKD" "HRWKD" ...
$ SUBJECT : chr "TOT" "TOT" "TOT" "TOT" ...
$ MEASURE : chr "HR_WKD" "HR_WKD" "HR_WKD" "HR_WKD" ...
$ FREQUENCY : chr "A" "A" "A" "A" ...
$ TIME : int 2014 2015 2016 2014 2015 2016 2014 2015 2014 2015 ...
$ Value : int 1681 1682 1669 1628 1608 1601 1556 1551 1704 1707 ...
$ Flag.Codes: logi NA NA NA NA NA NA ...
str(employment)
'data.frame': 363 obs. of 8 variables:
$ LOCATION : chr "NZL" "NZL" "NZL" "GBR" ...
$ INDICATOR : chr "EMPEDU" "EMPEDU" "EMPEDU" "EMPEDU" ...
$ SUBJECT : Factor w/ 3 levels "Below_upper_secondary",..: NA NA NA NA NA NA NA NA NA NA ...
$ MEASURE : chr "PC_25_64" "PC_25_64" "PC_25_64" "PC_25_64" ...
$ FREQUENCY : chr "A" "A" "A" "A" ...
$ TIME : int 2014 2015 2016 2014 2015 2016 2014 2015 2016 2014 ...
$ Value : num 80.3 81.3 82.1 84.6 85.4 ...
$ Flag.Codes: logi NA NA NA NA NA NA ...
str(education)
'data.frame': 363 obs. of 8 variables:
$ LOCATION : chr "AUS" "AUS" "AUS" "AUS" ...
$ INDICATOR : chr "EDUADULT" "EDUADULT" "EDUADULT" "EDUADULT" ...
$ SUBJECT : Factor w/ 3 levels "Below_upper_secondary",..: NA NA NA NA NA NA NA NA NA NA ...
$ MEASURE : chr "PC_25_64" "PC_25_64" "PC_25_64" "PC_25_64" ...
$ FREQUENCY : chr "A" "A" "A" "A" ...
$ TIME : int 2014 2015 2016 2014 2015 2016 2014 2015 2016 2014 ...
$ Value : num 22.9 21 20.1 41.9 42.9 ...
$ Flag.Codes: logi NA NA NA NA NA NA ...
str(code)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 239 obs. of 2 variables:
$ Code Value: chr "AFG" "ALA" "ALB" "DZA" ...
$ Definition: chr "Afghanistan" "Aland Islands" "Albania" "Algeria" ...
employment<-employment %>% mutate(SUBJECT=factor(SUBJECT,levels=c("BUPPSRY","UPPSRY_NTRY","TRY"),labels =c("Below_upper_secondary","Upper_secondary","Tertiary")))
education<-education %>% mutate(SUBJECT=factor(SUBJECT,levels=c("BUPPSRY","UPPSRY","TRY"),labels=c("Below_upper_secondary","Upper_secondary","Tertiary")))
I inspected the structure of the data frame and the type of variables using “str()”. There were character, numeric and logical values. Since there was no factors in my datasets, I converted the character variables “Subject” to factors and labelled them using “mutate()” and “factor()”. “Subject” includes three education levels that are suitable for a convertion to factors.
Tidy & Manipulate Data I
GNInew<-GNI %>% select(-INDICATOR,-SUBJECT,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year,LOCATION,TIME)
colnames(GNInew)[2]<-"GNI"
hoursnew<-hours%>%select(-INDICATOR,-SUBJECT,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year,LOCATION,TIME)
colnames(hoursnew)[2]<-"HRWKD"
employmentnew<-employment %>% select(-INDICATOR,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year_Subject, LOCATION, TIME,SUBJECT)
colnames(employmentnew)[2]<-"EMP"
educationnew<-education %>% select(-INDICATOR,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year_Subject,LOCATION,TIME,SUBJECT)
colnames(educationnew)[2]<-"EDU"
df1<-full_join(educationnew,employmentnew,by="Location_Year_Subject")
df1<-df1 %>% separate(1,into=c("Location","Year","Education_level"),extra = "merge",fill="left") %>% unite(Location_Year, Location, Year)
df1[1:3, ]#an example of df1
df2<-right_join(GNInew,hoursnew,by="Location_Year")
df2[1:3, ]#an example of df2
df<-left_join(df2,df1,by="Location_Year")
df<-separate(df,1,into=c("Country_code","Year"))
colnames(code)[1]<-"Country_code"
df<-right_join(code,df,by="Country_code")
df
I joined five datasets into one tidy dataset named “df”. To accomplish this, I took the below steps:
I deleted redundant variables to keep only the variables I need using “select()” for each dataset.
For GNI and hours, I combined two variables “Location” and “Time” using “unite()” and set it as a unique variable to match for joining purpose. I renamed “Value” to the corresponding indicators “GNI” and “HRWKD”.
For employment and education, I combined “Location”, “Time” and “Subject” using “unite()” and set it as a unique variable to match for joining purpose. I renamed “Value” to the corresponding indicators “EMP” and “EDU”.
I joined employmentnew and educationnew by “Location_Year_Subject” into a new dataset df1 using “full_join()” to keep all the values from both datasets. Then I separated the variable “Location_Year_Subject” into two variables “Location_Year” and “Education_level” for further joining purpose.
I joined matching rows from GNInew to hoursnew by “Location_Year” into a new dataset df2 using “right_join()” so that the variale “HRWKD” was placed to the right and all values from hoursnew were kept because hoursnew only contained countries that are OECD member countries.
I joined matching rows from df1 to df2 by “Location_Year” into the final combined dataset df using “left_join()” to keep all values of df2. Then I separated the variable “Location_Year” into “Country_code” and “Year”. At last, I joined matching rows from code to df by “Country_code” using “right_join()” so that I could have a new column “Definition” to check the full name of each country code.
Tidy & Manipulate Data II
freq<-data.frame(table(df$Country_code,df$Year))
freq<-spread(freq,key = Var2, value = Freq)
freq[1:3, ]#an example of freq
which(freq$`2014`==0)
integer(0)
which(freq$`2015`==0)
integer(0)
which(freq$`2016`==0)
[1] 3 5
freq$Var1[3:5]
[1] BEL CAN CHE
38 Levels: AUS AUT BEL CAN CHE CHL CRI CZE DEU DNK ESP EST FIN FRA GBR GRC HUN IRL ISL ISR ITA JPN KOR LTU LUX ... USA
dffreq1<-df[!(df$`Country_code`=="BEL"), ]
dfnew<-dffreq1[!(dffreq1$`Country_code`=="CHE"), ]
dfnew<-dfnew %>% group_by(`Country_code`) %>% mutate(Mean_GNI=mean(unique(GNI)),Mean_hours=mean(unique(HRWKD)))
dfnew
After joining the datasets, I assumed all countries have both GNI and working hours values available for three years, 2014 to 2016. I checked my assumption using “table()” to see the frequency of a country under each year. I converted the table to a dataframe for easy checking. Normally, each country should appear three times (for three education levels in a year) or at least once under each year. I found that two countries “BEL” and “CHE” do not match with this criteria. They have no rows for year 2016. It was because that these two countries have no “HRKWD” values in year 2016 or they have no values for both “GNI” and “HRKWD” in year 2016. If they have “HRKWD” values in year 2016, their 2016 rows should apprear because in the above joining process I kept all rows of the working hours dataset.
To classify countries based on a combination of their income level and working hours, I need to calculate both the average GNI and working hours over three years 2014-2016 for each country. Therefore I excluded these two countries in my report. Then I created two new columns of mean values using “group_by()” and “mutate()”. I added “unique()” in this function to ensure the value of each year only appears once in the calculation to get the correct mean value.
Scan I
colSums(is.na(dfnew))
Country_code Definition Year GNI HRWKD Education_level EDU
0 3 0 0 0 7 7
EMP Mean_GNI Mean_hours
7 0 0
which(is.na(dfnew$Definition))
[1] 275 276 277
head(dfnew[275:277, ])
dfnew1<-dfnew[!(dfnew$`Country_code`=="OECD"), ]
colSums((is.na(dfnew1)))
Country_code Definition Year GNI HRWKD Education_level EDU
0 0 0 0 0 4 4
EMP Mean_GNI Mean_hours
4 0 0
which(is.na(dfnew1$`Education_level`))==which(is.na(dfnew1$EDU))
[1] TRUE TRUE TRUE TRUE
which(is.na(dfnew1$`Education_level`))==which(is.na(dfnew1$EMP))
[1] TRUE TRUE TRUE TRUE
which(is.na(dfnew1$`Education_level`))
[1] 106 236 240 265
dfnew1[c(106,236,240,265), ]
#CHL
r1<-dfnew1[237:239, ]
dfnew2 <- rbind(dfnew1[1:236,],r1,r1,dfnew1[-(1:236),])
c1<-c("2014",21903.17218,1990)
dfnew2[237:239,3:5]<-rbind(c1,c1,c1)
c2<-c("2016",22355.9843,1974)
dfnew2[243:245,3:5]<-rbind(c2,c2,c2)
dfnew3<-dfnew2[-c(246,236), ]
dfnew3 %>% filter(Country_code=="CHL")
#IRL and RUS
r3<-dfnew3[106, ]
dfnew4<-rbind(dfnew3[1:106, ],r3,r3,dfnew3[-(1:106), ])
r4<-dfnew4[271, ]
dfnew5<-rbind(dfnew4[1:271, ],r4,r4,dfnew4[-(1:271), ])
c3<-c("Below_upper_secondary","Tertiary","Upper_secondary")
dfnew5[106:108,6]<-c3
dfnew5[271:273,6]<-c3
dfnew6<-dfnew5%>%group_by(`Country_code`,`Education_level`)%>% mutate(EDU=ifelse(is.na(EDU),mean(EDU,na.rm=TRUE),EDU)) %>% mutate(EMP=ifelse(is.na(EMP),mean(EMP,na.rm=TRUE),EMP))
dfnew6 %>% filter(Country_code=="IRL")
dfnew6 %>% filter(Country_code=="RUS")
Missing Value: I checked the number of missing values in each column. The three missing values in variable “Definition” are “OECD”, which represents an average value for all OECD countries. I deleted these three rows because I want to study each country separately. For all the other missing values in “Education_level”,“EMP” and “EDU”, I checked that they were in the same rows. Three countries included missing values: Ireland, Chile and Russia. Chile only had employment rate and educational attainment available in year 2015. I copied its “EMP” and “EDU” values for each education level of year 2015 to year 2014 and 2016. For Ireland and Russia, they had employment rate and educational attainment availble in year 2014 and 2015. I imputed the average values of employment rate and educational attainment for each education level in year 2014 and year 2015 to year 2016.
dfnew6$GNI<-as.numeric(dfnew6$GNI)
dfnew6$HRWKD<-as.integer(dfnew6$HRWKD)
dfnew6$Education_level<-as.factor(dfnew6$Education_level)
m<-cbind(dfnew6$GNI,dfnew6$HRWKD,dfnew6$EDU,dfnew6$EMP,dfnew6$Mean_GNI,dfnew6$Mean_hours) #with a reason
sum(is.nan(m))
[1] 0
sum(is.infinite(m))
[1] 0
Special Value: I checked the type of variables again and found that “GNI”,“HRWKD” and “Education_level” have become characters. It was probably due to the above “unite()”, “separate()” and “rbind()” process that forced all variables into one type. “Education_level” was converted to factors. “GNI” and “HRWKD” were converted to numeric values for further checking on special values. Functions “is.nan()” and “is.infinite()” only work for vectorial input. Therefore I combined all numeric values into one matrix. It turned out that there were no special values.
sum(dfnew6$GNI<0)
[1] 0
sum(dfnew6$HRWKD<0)
[1] 0
sum(dfnew6$EDU<=0 |dfnew6$EDU>=100)
[1] 0
sum(dfnew6$EMP<=0 |dfnew6$EMP>=100)
[1] 0
sum(dfnew6$Mean_GNI<0)
[1] 0
sum(dfnew6$Mean_hours<0)
[1] 0
checkEDU100<-dfnew6 %>% group_by(`Country_code`,Year) %>% mutate(sumEDU=sum(EDU))
which(round(checkEDU100$sumEDU,digits = 0)!=100)
[1] 118 119 120
dfnew6[118:120,]
Inconsistency:
- All values of “GNI”,“HRWKD”,“Mean_GNI” and “Mean_hours” should be greater than zero. After checking, the dataset obeys this rule.
- All values of “EDU” and “EMP” are percentages, therefore they should be within 0-100. After checking, the dataset obeys this rule.
- For each year in every country, the sum of “EDU” of three education levels should be 100. The educational attainment (“EDU”) represents the percentage of the 25-64 year-olds who achieved Tertiary/Upper secondary/Below upper secondary as the highest education level. These three education levels cover all the situtations. After checking, Japan does not fulfil the requirments because it only has the values of “EDU” for Tertiary level. I did not apply the same method to check “EMP” because the employment rate represents the percentage of employed 25-64 year-olds by all 25-64 year-olds with the same education level. They cannot be added up based on different population.
Scan II
z.scores<-dfnew6$GNI %>%scores(type="z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.8250 -0.7893 -0.1215 0.0000 0.7900 2.4540
which(abs(z.scores)>3)
integer(0)
z.scores<-dfnew6$HRWKD %>%scores(type="z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.89600 -0.57890 -0.03397 0.00000 0.57730 2.33100
which(abs(z.scores)>3)
integer(0)
z.scores<-dfnew6$Mean_GNI %>%scores(type="z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.7920 -0.7724 -0.1372 0.0000 0.8405 2.3340
which(abs(z.scores)>3)
integer(0)
z.scores<-dfnew6$Mean_hours %>%scores(type="z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.88700 -0.56860 -0.04827 0.00000 0.53370 2.30200
which(abs(z.scores)>3)
integer(0)
length(boxplot(dfnew6$EDU ~ dfnew6$Education_level, main="Educational attainment by Education level", ylab = "Education level", xlab = "Educational attainment")$out)
[1] 9

length(boxplot(dfnew6$EMP ~ dfnew6$Education_level, main="Employment rate by Education level", ylab = "Education level", xlab = "Employment rate")$out)
[1] 7

#identify the 9 outliers in educational attainment above the upper fence for "Below_upper_secondary" level
outlierEDU<-dfnew6 %>% filter(Education_level=="Below_upper_secondary") %>% select(-EMP)
outlierEDU<-outlierEDU[order(outlierEDU$EDU,decreasing=TRUE), ]
outlierEDU[1:9, ]
#identify the 1 outliers in employment rate above the upper fence for "Tertiary" level
outlierEMP1<-dfnew6 %>% filter(Education_level=="Tertiary") %>% select(-EDU)
outlierEMP1up<-outlierEMP1[order(outlierEMP1$EMP,decreasing=TRUE), ]
outlierEMP1up[1, ]
#identify the 3 outliers in employment rate below the lower fence for "Tertiary" level
outlierEMP1low<-outlierEMP1[order(outlierEMP1$EMP), ]
outlierEMP1low[1:3, ]
#identify the 3 outlier in employment rate below the lower fence for "Upper_secondary" level
outlierEMP2<-dfnew6 %>% filter(Education_level=="Upper_secondary") %>% select(-EDU)
outlierEMP2<-outlierEMP2[order(outlierEMP2$EMP), ]
outlierEMP2[1:3, ]
- For “GNI”,“HRWKD”,“Mean_GNI” and “Mean_hours”, I applied z-score method to identify outliers. After checking, there were no suggested outliers for them. It may imply that the values of gross national income and working hours for these countries remain in a reasonable range. No country stands out regarding gross national income or working hours.
- For the employment rate (“EMP”) and educational attainment (“EDU”), I applied bivariate box plots so that I can detect outliers by education levels. 9 outliers were identified in the educational attainment for “Below_upper_secondary” level. 4 outliers were identified in the employment rate for “Tertiary” level, with 1 above the upper fence and 3 below the lower fence. 3 outliers were identified in the employment rate for “Upper_secondary” level. I decided not to deal with these outliers. I compared the data on a global scale. It is reasonable for different countries to have different conditions on their education and employment.
---
title: "MATH2349 Semester 1, 2018"
author: "Yifan FENG s3704476"
subtitle: Assignment 3
output:
  html_notebook: default
  html_document:
    df_print: paged
---

## Required packages 

```{r}
library(readxl)
library(tidyr)
library(dplyr)
library(outliers)
library(infotheo)
```

## Executive Summary 

The objective of this report was to make a tidy dataset of countries with two indicators: 

* Employment rates by education levels: the percentage of employed 25-64 year-olds by all 25-64 year-olds with the same education level
* Educational attainment: the highest level of education achieved as a percentage to the total population of 25-64 year-olds

Meanwhile, since income and educational attainment may be correlated, I added gross national income (GNI) and classified countries into different income levels. To my own interest, the working hour was also added to complement this classification. I adopted a combination of income level and the length of working hours to group countries, such as "High income & Short hours" and "Low income & Long hours". To accomplish this, I obtained multiple datasets with the variables needed and joined them into a tidy one. Necessary scanning steps were taken to deal with missing values and to identify the outliers. For the classification, I created new variables of each country's average GNI and working hours over three years and used the binning method to group the countries. 

## Data

Member countries of OECD (The Organisation for Economic Co-operation and Development) were investigated. I looked at three education levels for both employment rate and educational attainment: below upper secondary, upper secondary, and tertiary. The measures cover the latest three years available: 2014 to 2016. Five datasets were generated for further process. I introduced the main variables in each dataset below:

* GNI: Location(country code), Time(year), Value(gross national income: US dollars per capita); it is a CSV file downloaded from: <span style="color:blue">*data.oecd.org/natincome/gross-national-income.htm*</span>
* hours: Location(country code), Time(year), Value(average annual working hours per worker); it is a CSV file downloaded from: <span style="color:blue">*data.oecd.org/emp/hours-worked.htm*</span>
* employment: Location(country code), Subject(education level), Time(year),Value(employment rate); it is a CSV file downloaded from: <span style="color:blue">*data.oecd.org/emp/employment-by-education-level.htm*</span>
* education: Location(country code), Subject(education level), Time(year),Value(educational attainment); it is a CSV file downloaded from: <span style="color:blue">*data.oecd.org/eduatt/adult-education-level.htm*</span>
* code: Code value(country code), Definition(country's full name); it is an EXCEL file downloaded from: <span style="color:blue">*www.dnb.com/content/dam/english/dnb-solutions/sales-and-marketing/iso_3digit_alpha_country_codes.xls*</span>

```{r}
setwd("~/Desktop/DP Assignment 3")
GNI<-read.csv("GNI.csv",stringsAsFactors = FALSE)
GNI[1:3, ]
hours<-read.csv("hours worked.csv",stringsAsFactors = FALSE)
hours[1:3, ]
employment<-read.csv("employment.csv",stringsAsFactors = FALSE)
employment[1:3, ]
education<-read.csv("EducationLevel.csv",stringsAsFactors = FALSE)
education[1:3, ]
code<-read_excel("code.xls",skip=1)
code[1:3, ]
```
1. I opened a new folder called "DP Assignment 3" on Desktop and saved the five datasets into it. This folder was set as the working directory using "setwd()".
2. CSV files were imported using Base R functions. I set "stringsAsFactors = FALSE" to read the variables in their original forms.
3. The EXCEL file was imported using "read_excel()".
4. In order to save space for my report, I displayed only the first three rows of each dataset.

## Understand 

```{r}
str(GNI)
str(hours)
str(employment)
str(education)
str(code)
employment<-employment %>% mutate(SUBJECT=factor(SUBJECT,levels=c("BUPPSRY","UPPSRY_NTRY","TRY"),labels =c("Below_upper_secondary","Upper_secondary","Tertiary")))
education<-education %>% mutate(SUBJECT=factor(SUBJECT,levels=c("BUPPSRY","UPPSRY","TRY"),labels=c("Below_upper_secondary","Upper_secondary","Tertiary")))
```
I inspected the structure of the data frame and the type of variables using "str()". There were character, numeric and logical values. Since there was no factors in my datasets, I converted the character variables "Subject" to factors and labelled them using "mutate()" and "factor()". "Subject" includes three education levels that are suitable for a convertion to factors.

##	Tidy & Manipulate Data I 

```{r}
GNInew<-GNI %>% select(-INDICATOR,-SUBJECT,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year,LOCATION,TIME)
colnames(GNInew)[2]<-"GNI"

hoursnew<-hours%>%select(-INDICATOR,-SUBJECT,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year,LOCATION,TIME)
colnames(hoursnew)[2]<-"HRWKD"

employmentnew<-employment %>% select(-INDICATOR,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year_Subject, LOCATION, TIME,SUBJECT)
colnames(employmentnew)[2]<-"EMP"

educationnew<-education %>% select(-INDICATOR,-MEASURE,-FREQUENCY,-Flag.Codes) %>% unite(Location_Year_Subject,LOCATION,TIME,SUBJECT)
colnames(educationnew)[2]<-"EDU"

df1<-full_join(educationnew,employmentnew,by="Location_Year_Subject")
df1<-df1 %>% separate(1,into=c("Location","Year","Education_level"),extra = "merge",fill="left") %>% unite(Location_Year, Location, Year)
df1[1:3, ]#an example of df1

df2<-right_join(GNInew,hoursnew,by="Location_Year")
df2[1:3, ]#an example of df2

df<-left_join(df2,df1,by="Location_Year")
df<-separate(df,1,into=c("Country_code","Year"))
colnames(code)[1]<-"Country_code"
df<-right_join(code,df,by="Country_code")
df
```
I joined five datasets into one tidy dataset named "df". To accomplish this, I took the below steps:

1. I deleted redundant variables to keep only the variables I need using "select()" for each dataset. 

2. For GNI and hours, I combined two variables "Location" and "Time" using "unite()" and set it as a unique variable to match for joining purpose. I renamed "Value" to the corresponding indicators "GNI" and "HRWKD".

3. For employment and education, I combined "Location", "Time" and "Subject" using "unite()" and set it as a unique variable to match for joining purpose. I renamed "Value" to the corresponding indicators "EMP" and "EDU".

3. I joined employmentnew and educationnew by "Location_Year_Subject" into a new dataset df1 using "full_join()" to keep all the values from both datasets. Then I separated the variable "Location_Year_Subject" into two variables "Location_Year" and "Education_level" for further joining purpose. 

4. I joined matching rows from GNInew to hoursnew by "Location_Year" into a new dataset df2 using "right_join()" so that the variale "HRWKD" was placed to the right and all values from hoursnew were kept because hoursnew only contained countries that are OECD member countries.

5. I joined matching rows from df1 to df2 by "Location_Year" into the final combined dataset df using "left_join()" to keep all values of df2. Then I separated the variable "Location_Year" into "Country_code" and "Year". At last, I joined matching rows from code to df by "Country_code" using "right_join()" so that I could have a new column "Definition" to check the full name of each country code.

##	Tidy & Manipulate Data II 

```{r}
freq<-data.frame(table(df$Country_code,df$Year))
freq<-spread(freq,key = Var2, value = Freq)
freq[1:3, ]#an example of freq
which(freq$`2014`==0)
which(freq$`2015`==0)
which(freq$`2016`==0)
freq$Var1[3:5]
dffreq1<-df[!(df$`Country_code`=="BEL"), ]
dfnew<-dffreq1[!(dffreq1$`Country_code`=="CHE"), ]
dfnew<-dfnew %>% group_by(`Country_code`) %>% mutate(Mean_GNI=mean(unique(GNI)),Mean_hours=mean(unique(HRWKD)))
dfnew
```
After joining the datasets, I assumed all countries have both GNI and working hours values available for three years, 2014 to 2016. I checked my assumption using "table()" to see the frequency of a country under each year. I converted the table to a dataframe for easy checking. Normally, each country should appear three times (for three education levels in a year) or at least once under each year. I found that two countries "BEL" and "CHE" do not match with this criteria. They have no rows for year 2016. It was because that these two countries have no "HRKWD" values in year 2016 or they have no values for both "GNI" and "HRKWD" in year 2016. If they have "HRKWD" values in year 2016, their 2016 rows should apprear because in the above joining process I kept all rows of the working hours dataset. 

To classify countries based on a combination of their income level and working hours, I need to calculate both the average GNI and working hours over three years 2014-2016 for each country. Therefore I excluded these two countries in my report. Then I created two new columns of mean values using "group_by()" and "mutate()". I added "unique()" in this function to ensure the value of each year only appears once in the calculation to get the correct mean value. 

##	Scan I 
```{r}
colSums(is.na(dfnew))
which(is.na(dfnew$Definition))
head(dfnew[275:277, ])
dfnew1<-dfnew[!(dfnew$`Country_code`=="OECD"), ]

colSums((is.na(dfnew1)))
which(is.na(dfnew1$`Education_level`))==which(is.na(dfnew1$EDU))
which(is.na(dfnew1$`Education_level`))==which(is.na(dfnew1$EMP))
which(is.na(dfnew1$`Education_level`))
dfnew1[c(106,236,240,265), ]

#CHL
r1<-dfnew1[237:239, ]
dfnew2 <- rbind(dfnew1[1:236,],r1,r1,dfnew1[-(1:236),])
c1<-c("2014",21903.17218,1990)
dfnew2[237:239,3:5]<-rbind(c1,c1,c1)
c2<-c("2016",22355.9843,1974)
dfnew2[243:245,3:5]<-rbind(c2,c2,c2)
dfnew3<-dfnew2[-c(246,236), ]
dfnew3 %>% filter(Country_code=="CHL")


#IRL and RUS
r3<-dfnew3[106, ]
dfnew4<-rbind(dfnew3[1:106, ],r3,r3,dfnew3[-(1:106), ])
r4<-dfnew4[271, ]
dfnew5<-rbind(dfnew4[1:271, ],r4,r4,dfnew4[-(1:271), ])
c3<-c("Below_upper_secondary","Tertiary","Upper_secondary")
dfnew5[106:108,6]<-c3
dfnew5[271:273,6]<-c3
dfnew6<-dfnew5%>%group_by(`Country_code`,`Education_level`)%>% mutate(EDU=ifelse(is.na(EDU),mean(EDU,na.rm=TRUE),EDU)) %>% mutate(EMP=ifelse(is.na(EMP),mean(EMP,na.rm=TRUE),EMP))
dfnew6 %>% filter(Country_code=="IRL")
dfnew6 %>% filter(Country_code=="RUS")
```
**Missing Value:**
I checked the number of missing values in each column. The three missing values in variable "Definition" are "OECD", which represents an average value for all OECD countries. I deleted these three rows because I want to study each country separately. For all the other missing values in "Education_level","EMP" and "EDU", I checked that they were in the same rows. Three countries included missing values: Ireland, Chile and Russia. Chile only had employment rate and educational attainment available in year 2015. I copied its "EMP" and "EDU" values for each education level of year 2015 to year 2014 and 2016. For Ireland and Russia, they had employment rate and educational attainment availble in year 2014 and 2015. I imputed the average values of employment rate and educational attainment for each education level in year 2014 and year 2015 to year 2016.

```{r}
dfnew6$GNI<-as.numeric(dfnew6$GNI)
dfnew6$HRWKD<-as.integer(dfnew6$HRWKD)
dfnew6$Education_level<-as.factor(dfnew6$Education_level)
m<-cbind(dfnew6$GNI,dfnew6$HRWKD,dfnew6$EDU,dfnew6$EMP,dfnew6$Mean_GNI,dfnew6$Mean_hours) #with a reason
sum(is.nan(m))
sum(is.infinite(m)) 
```
**Special Value:**
I checked the type of variables again and found that "GNI","HRWKD" and "Education_level" have become characters. It was probably due to the above "unite()", "separate()" and "rbind()" process that forced all variables into one type. "Education_level" was converted to factors. "GNI" and "HRWKD" were converted to numeric values for further checking on special values. Functions "is.nan()" and "is.infinite()" only work for vectorial input. Therefore I combined all numeric values into one matrix. It turned out that there were no special values.

```{r}
sum(dfnew6$GNI<0)
sum(dfnew6$HRWKD<0)
sum(dfnew6$EDU<=0 |dfnew6$EDU>=100)
sum(dfnew6$EMP<=0 |dfnew6$EMP>=100)
sum(dfnew6$Mean_GNI<0)
sum(dfnew6$Mean_hours<0)
checkEDU100<-dfnew6 %>% group_by(`Country_code`,Year) %>% mutate(sumEDU=sum(EDU))
which(round(checkEDU100$sumEDU,digits = 0)!=100)
dfnew6[118:120,]
```
**Inconsistency:**

* All values of "GNI","HRWKD","Mean_GNI" and "Mean_hours" should be greater than zero. After checking, the dataset obeys this rule.
* All values of "EDU" and "EMP" are percentages, therefore they should be within 0-100. After checking, the dataset obeys this rule.
* For each year in every country, the sum of "EDU" of three education levels should be 100. The educational attainment ("EDU") represents the percentage of the 25-64 year-olds who achieved Tertiary/Upper secondary/Below upper secondary as the highest education level. These three education levels cover all the situtations. After checking, Japan does not fulfil the requirments because it only has the values of "EDU" for Tertiary level. I did not apply the same method to check "EMP" because the employment rate represents the percentage of employed 25-64 year-olds by all 25-64 year-olds with the same education level. They cannot be added up based on different population.

##	Scan II

```{r}
z.scores<-dfnew6$GNI %>%scores(type="z")
z.scores %>% summary()
which(abs(z.scores)>3)
z.scores<-dfnew6$HRWKD %>%scores(type="z")
z.scores %>% summary()
which(abs(z.scores)>3)
z.scores<-dfnew6$Mean_GNI %>%scores(type="z")
z.scores %>% summary()
which(abs(z.scores)>3)
z.scores<-dfnew6$Mean_hours %>%scores(type="z")
z.scores %>% summary()
which(abs(z.scores)>3)
length(boxplot(dfnew6$EDU ~ dfnew6$Education_level, main="Educational attainment by Education level", ylab = "Education level", xlab = "Educational attainment")$out)
length(boxplot(dfnew6$EMP ~ dfnew6$Education_level, main="Employment rate by Education level", ylab = "Education level", xlab = "Employment rate")$out)
#identify the 9 outliers in educational attainment above the upper fence for "Below_upper_secondary" level
outlierEDU<-dfnew6 %>% filter(Education_level=="Below_upper_secondary") %>% select(-EMP)
outlierEDU<-outlierEDU[order(outlierEDU$EDU,decreasing=TRUE), ]
outlierEDU[1:9, ]
#identify the 1 outliers in employment rate above the upper fence for "Tertiary" level
outlierEMP1<-dfnew6 %>% filter(Education_level=="Tertiary") %>% select(-EDU)
outlierEMP1up<-outlierEMP1[order(outlierEMP1$EMP,decreasing=TRUE), ]
outlierEMP1up[1, ]
#identify the 3 outliers in employment rate below the lower fence for "Tertiary" level
outlierEMP1low<-outlierEMP1[order(outlierEMP1$EMP), ]
outlierEMP1low[1:3, ]
#identify the 3 outlier in employment rate below the lower fence for "Upper_secondary" level
outlierEMP2<-dfnew6 %>% filter(Education_level=="Upper_secondary") %>% select(-EDU)
outlierEMP2<-outlierEMP2[order(outlierEMP2$EMP), ]
outlierEMP2[1:3, ]
```
* For "GNI","HRWKD","Mean_GNI" and "Mean_hours", I applied z-score method to identify outliers. After checking, there were no suggested outliers for them. It may imply that the values of gross national income and working hours for these countries remain in a reasonable range. No country stands out regarding gross national income or working hours.
* For the employment rate ("EMP") and educational attainment ("EDU"), I applied bivariate box plots so that I can detect outliers by education levels. 9 outliers were identified in the educational attainment for "Below_upper_secondary" level. 4 outliers were identified in the employment rate for "Tertiary" level, with 1 above the upper fence and 3 below the lower fence. 3 outliers were identified in the employment rate for "Upper_secondary" level. I decided not to deal with these outliers. I compared the data on a global scale. It is reasonable for different countries to have different conditions on their education and employment. 

##	Transform 
```{r}
bin_GNI<-discretize(dfnew6$Mean_GNI, disc = "equalwidth",nbins=3)
dfnew7<-bind_cols(dfnew6,bin_GNI)
bin_hours<-discretize(dfnew7$Mean_hours,disc="equalwidth", nbins=3)
dfnew8<-bind_cols(dfnew7,bin_hours)

dfnew8<- dfnew8 %>% mutate(X=factor(X,levels=c(1,2,3),labels =c("Low_income","Middle_income","High_income"))) %>% mutate(X1=factor(X1,levels=c(1,2,3),labels=c("Short_hours","Medium_hours","Long_hours"))) %>% unite(Class,X,X1,sep = "&")
dfnew8
Country_class<-dfnew8 %>% ungroup() %>% distinct(Country_code,Class)
Country_class[order(Country_class$Class), ]
```
* I applied binning method to divide "Mean_GNI" and "Mean_hours" for setting up the scale of the classification. The variables were divided into 3 intervals of equal width using "discretize()". Then I added the bining columns to my dataset using "bind_cols()". I converted the binining columns to factors and labelled the values using "mutate()" and "factor()". Then I combined the two bining columns into one variable "Class". 
* I created a new dataset Country_class for a easy reference on the classification. I used "ungroup()" to drop each country's grouping with the three education levels. Then I used "distinct()" to remove duplicated rows based on "Country_code".

<br>
<br>
