library("readr")
library("dplyr")
library("tidyr")
library("deductive")
library("deducorrect")
library("editrules")
library("validate")
library("Hmisc")
library("forecast")
library("stringr")
library("lubridate")
library("car")
library("outliers")
library("mvoutlier")
library("MVN")
library("infotheo")
library("MASS")
library("mlr")
library("mlr3")
library("ggplot2")
library("knitr")
library ("rmarkdown")
The aim of this assignment is to demonstrate the concepts and techniques in Data Preprocessing or also known as Data Wrangling. Data Preprocessing in data mining can be summarised as a data mining technique that involves transforming raw (and complex) data into an understable format ready for statistical analysis.
Major Tasks in Data Preprocessing are as follow:
Three datasets are used in demonstrating the major tasks in data preprocessing in this assignment; i.e. the Human Freedom Index 2018 and World Happiness Report 2016 and 2017.
This exercise is aimed to prepare a dataset that will be used for analysing whether higher degree of freedom means higher level of happiness.
Human Freedom Index
The Human Freedom Index (‘HFI’) present the degree of human freedom in the world based on 79 indicators of personal and economic freedom grouped into 12 categories as follow (pf = personal freedom, ef = economic freedom):
The HFI 2018 report is co-published by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom, which can be found at this link: https://www.cato.org/human-freedom-index-new
The HFI 2018 dataset covers the HFI from year 2008 up to 2016 and 162 countries. It can also be retireved from Kaggle: https://www.kaggle.com/chchloe29/gdp-per-capita
The World Happinness Report
The World Happiness Report is a landmark survey of the state of global happiness that ranks 155-157 countries by how happy their citizens perceive themselves to be. The report is produced by the United Nations Sustainable Development Solutions Network in partnership with the Ernesto Illy Foundation. The original World Happinness Report 2018 can be found at this link: https://worldhappiness.report/ed/2018/
However, for the purpose of the data prepocessing exercise, the dataset has been sourced from Kaggle:
+ https://www.kaggle.com/unsdsn/world-happiness#2016.csv
+ https://www.kaggle.com/unsdsn/world-happiness#2017.csv
The report ranks 156 countries by their happiness level from year 2026 and 2017. The happiness score are based on answers to the main life evaluation question asked in the Gallup World Poll (Gallup is US data analytic firm). In the survey, people were asked to rate their happiness on a scare from 0 to 10 with 0 being the worst possible life and 10 being the best possible life. The report then determined six key factors that contribute to happiness score:
+ economic production (*GDP per capita*)
+ social support (*Family*)
+ life expectancy (*Health*)
+ freedom to make life choices (*Freedom*)*
+ absence of corruption (*Trust - Goverment Corruption*), and
+ generosity (*Generosity*).
Note: the Freedom variable in the World Happiness Report is different measure to Freedom in Human Freedom Index. The Freedom level is determined based on the survey result of the Gallup World Poll where applicants were asked this question: “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
In order to make the happiness score for each country more meaningful, a hypothetical country where that has the lowest happiness level is established; it is called ‘Dystopia’ which is the opposite of Utopia. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive width.
The residuals, or unexplained components, differ for each country, reflecting the extent to which the six variables either over- or under-explain average 2014-2016 life evaluations. These residuals have an average value of approximately zero over the whole set of countries.
The whiskers or upper lower levels of confidence indicate 95% confidence intervals for the estimated means.
setwd("~/Desktop/RMIT/MATH2349 – Data Preprocessing/Assignments/Assignment3/Data")
The working directory was changed to /Users/Meilda/Desktop/RMIT/MATH2349 – Data Preprocessing/Assignments/Assignment3/Data inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
library(readr)
hfi_raw <- read_csv("hfi_cc_2018.csv")
Parsed with column specification:
cols(
.default = col_double(),
ISO_code = [31mcol_character()[39m,
countries = [31mcol_character()[39m,
region = [31mcol_character()[39m
)
See spec(...) for full column specifications.
head(hfi_raw)
happy16 <- read_csv("2016.csv")
Parsed with column specification:
cols(
Country = [31mcol_character()[39m,
Region = [31mcol_character()[39m,
`Happiness Rank` = [32mcol_double()[39m,
`Happiness Score` = [32mcol_double()[39m,
`Lower Confidence Interval` = [32mcol_double()[39m,
`Upper Confidence Interval` = [32mcol_double()[39m,
`Economy (GDP per Capita)` = [32mcol_double()[39m,
Family = [32mcol_double()[39m,
`Health (Life Expectancy)` = [32mcol_double()[39m,
Freedom = [32mcol_double()[39m,
`Trust (Government Corruption)` = [32mcol_double()[39m,
Generosity = [32mcol_double()[39m,
`Dystopia Residual` = [32mcol_double()[39m
)
head(happy16)
happy17 <- read_csv("2017.csv")
Parsed with column specification:
cols(
Country = [31mcol_character()[39m,
Happiness.Rank = [32mcol_double()[39m,
Happiness.Score = [32mcol_double()[39m,
Whisker.high = [32mcol_double()[39m,
Whisker.low = [32mcol_double()[39m,
Economy..GDP.per.Capita. = [32mcol_double()[39m,
Family = [32mcol_double()[39m,
Health..Life.Expectancy. = [32mcol_double()[39m,
Freedom = [32mcol_double()[39m,
Generosity = [32mcol_double()[39m,
Trust..Government.Corruption. = [32mcol_double()[39m,
Dystopia.Residual = [32mcol_double()[39m
)
head(happy17)
As both happy16 & happy17 shares almost identical variables, to observations from each year, an additional column ‘year’ is added into each dataframe.
The purpose of taking data preparation in this exercise is to produce an efficient and understable dataset for analysing whether freedom means happiness. Hence, some variables in happy16 and happy17 that do not have great impact to the end result of the exercise are removed, e.g.: the levels of confidence and whiskers varriables. Column ‘Region’ is also dropped to avoid creating missing values for dataframe happy17 that doesn’t have column ‘Region’. Dropping variables are done by making a subset out of the dataset using a vector of column names except the names of the unwanted columns.
Check each dataframes’ structure using str() function before merging the two dataframes in order to see if there is any difference in regards to the data dimension and variables using anti_join() function. The result showing that they are joined by 4 variables only, i.e. by “Country”, “Family”, “Freedom”, “Generosity”and “year”.
Differences between happy16 & happy17 dataframes are the results of inconsistency in naming variables and difference in the number of countries being surveyed. In order to avoid duplicates of the common columns in the two dataframes being joined, consistency in naming variables. Suitably, columns in happy16 are renamed in accordance to their corresponding variables in happy17 by using names() function.
Merging instead of joining is used because both data shares same variables. Merging is done by using bind_rows(x,y).
For a purpose of demonstrating tasks in data prepocessing, not all variables in human freedom index are used. A simplified dataframe is created by subsetting variables from hfi_raw dataframe where only variables that represent the twelve factors that determine the human freedom index previously described in the Data section are chosen along with ‘year’, ‘region’, ISO_code, ‘country’, total freedom scores & ranks. The value of the twelve determinants = mean of components within each determinants).
Column ‘countries’ in the new dataframe hfi is renamed to be ‘Country’ to match corresponding column in joined_happy using function names()
Join the human freedom index dataframe and the world happiness index dataframe join_happy using a using all variables that appear in both tables to join them (natural join). Joining the two table based on their common variables can be done by left_join() operation.
Lastly, transform categorical variables in the combined table into factor. In statistical modelling, storing categorical data as factor ensures that the modelling functions will treat the data correctly. Transformation is done by using lapply() and as. functions.
#To keep data each year:
happy16$year <- 2016
happy17$year <- 2017
#Remove redundant variables in 'happy16' and 'happy17'
happy16a <- happy16[-c(2,5:6)]
head(happy16a)
happy17a <- happy17[-c(4:5)]
head(happy17a)
#Check each dataframe structure before joining happy16 & happy17 dataframes to see if there is any difference in regards to the data dimension and variables.
str(happy16a)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 157 obs. of 11 variables:
$ Country : chr "Denmark" "Switzerland" "Iceland" "Norway" ...
$ Happiness Rank : num 1 2 3 4 5 6 7 8 9 10 ...
$ Happiness Score : num 7.53 7.51 7.5 7.5 7.41 ...
$ Economy (GDP per Capita) : num 1.44 1.53 1.43 1.58 1.41 ...
$ Family : num 1.16 1.15 1.18 1.13 1.13 ...
$ Health (Life Expectancy) : num 0.795 0.863 0.867 0.796 0.811 ...
$ Freedom : num 0.579 0.586 0.566 0.596 0.571 ...
$ Trust (Government Corruption): num 0.445 0.412 0.15 0.358 0.41 ...
$ Generosity : num 0.362 0.281 0.477 0.379 0.255 ...
$ Dystopia Residual : num 2.74 2.69 2.83 2.66 2.83 ...
$ year : num 2016 2016 2016 2016 2016 ...
str(happy17a)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 155 obs. of 11 variables:
$ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ...
$ Happiness.Rank : num 1 2 3 4 5 6 7 8 9 10 ...
$ Happiness.Score : num 7.54 7.52 7.5 7.49 7.47 ...
$ Economy..GDP.per.Capita. : num 1.62 1.48 1.48 1.56 1.44 ...
$ Family : num 1.53 1.55 1.61 1.52 1.54 ...
$ Health..Life.Expectancy. : num 0.797 0.793 0.834 0.858 0.809 ...
$ Freedom : num 0.635 0.626 0.627 0.62 0.618 ...
$ Generosity : num 0.362 0.355 0.476 0.291 0.245 ...
$ Trust..Government.Corruption.: num 0.316 0.401 0.154 0.367 0.383 ...
$ Dystopia.Residual : num 2.28 2.31 2.32 2.28 2.43 ...
$ year : num 2017 2017 2017 2017 2017 ...
#Check mismatches between happy16a and happy17a
anti_join(happy16a,happy17a)
Joining, by = c("Country", "Family", "Freedom", "Generosity", "year")
#Rename columns in happy16a dataframe to be consistent with happy17a columns:
names(happy16a)[2] <- "Happiness.Rank"
names(happy16a)[3] <- "Happiness.Score"
names(happy16a)[4] <- "Economy..GDP.per.Capita."
names(happy16a)[6] <- "Health..Life.Expectancy."
names(happy16a)[8] <- "Trust..Government.Corruption."
names(happy16a)[10] <- "Dystopia.Residual"
head(happy16a)
#Merge 'happy16a' with 'happy17a'
joined_happy <- bind_rows(happy16a,happy17a)
names(joined_happy) <- c("Country","Happiness_Rank",
"Happiness_Score","GDP_per_Capita",
"Family","Health","Freedom","Trust_Govt_Corruption",
"Generosity","Dystopia_Residual","year")
head(joined_happy)
str(joined_happy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 312 obs. of 11 variables:
$ Country : chr "Denmark" "Switzerland" "Iceland" "Norway" ...
$ Happiness_Rank : num 1 2 3 4 5 6 7 8 9 10 ...
$ Happiness_Score : num 7.53 7.51 7.5 7.5 7.41 ...
$ GDP_per_Capita : num 1.44 1.53 1.43 1.58 1.41 ...
$ Family : num 1.16 1.15 1.18 1.13 1.13 ...
$ Health : num 0.795 0.863 0.867 0.796 0.811 ...
$ Freedom : num 0.579 0.586 0.566 0.596 0.571 ...
$ Trust_Govt_Corruption: num 0.445 0.412 0.15 0.358 0.41 ...
$ Generosity : num 0.362 0.281 0.477 0.379 0.255 ...
$ Dystopia_Residual : num 2.74 2.69 2.83 2.66 2.83 ...
$ year : num 2016 2016 2016 2016 2016 ...
#Check structure of human freedom index 'hfi_raw
str(hfi_raw)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1458 obs. of 123 variables:
$ year : num 2016 2016 2016 2016 2016 ...
$ ISO_code : chr "ALB" "DZA" "AGO" "ARG" ...
$ countries : chr "Albania" "Algeria" "Angola" "Argentina" ...
$ region : chr "Eastern Europe" "Middle East & North Africa" "Sub-Saharan Africa" "Latin America & the Caribbean" ...
$ pf_rol_procedural : num 6.66 NA NA 7.1 NA ...
$ pf_rol_civil : num 4.55 NA NA 5.79 NA ...
$ pf_rol_criminal : num 4.67 NA NA 4.34 NA ...
$ pf_rol : num 5.29 3.82 3.45 5.74 5 ...
$ pf_ss_homicide : num 8.92 9.46 8.06 7.62 8.81 ...
$ pf_ss_disappearances_disap : num 10 10 5 10 10 10 10 10 10 10 ...
$ pf_ss_disappearances_violent : num 10 9.29 10 10 10 ...
$ pf_ss_disappearances_organized : num 10 5 7.5 7.5 7.5 10 10 7.5 NA 2.5 ...
$ pf_ss_disappearances_fatalities : num 10 9.93 10 10 9.32 ...
$ pf_ss_disappearances_injuries : num 10 9.99 10 9.99 9.93 ...
$ pf_ss_disappearances : num 10 8.84 8.5 9.5 9.35 ...
$ pf_ss_women_fgm : num 10 10 10 10 10 10 10 10 NA 10 ...
$ pf_ss_women_missing : num 7.5 7.5 10 10 5 10 10 7.5 NA 7.5 ...
$ pf_ss_women_inheritance_widows : num 5 0 5 10 10 10 10 5 NA 0 ...
$ pf_ss_women_inheritance_daughters : num 5 0 5 10 10 10 10 10 NA 0 ...
$ pf_ss_women_inheritance : num 5 0 5 10 10 10 10 7.5 NA 0 ...
$ pf_ss_women : num 7.5 5.83 8.33 10 8.33 ...
$ pf_ss : num 8.81 8.04 8.3 9.04 8.83 ...
$ pf_movement_domestic : num 5 5 0 10 5 10 10 5 10 10 ...
$ pf_movement_foreign : num 10 5 5 10 5 10 10 5 10 5 ...
$ pf_movement_women : num 5 5 10 10 10 10 10 5 NA 5 ...
$ pf_movement : num 6.67 5 5 10 6.67 ...
$ pf_religion_estop_establish : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_religion_estop_operate : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_religion_estop : num 10 5 10 7.5 5 10 10 2.5 NA 10 ...
$ pf_religion_harassment : num 9.57 6.87 8.9 9.04 8.58 ...
$ pf_religion_restrictions : num 8.01 2.96 7.46 6.85 5.09 ...
$ pf_religion : num 9.19 4.94 8.79 7.8 6.22 ...
$ pf_association_association : num 10 5 2.5 7.5 7.5 10 10 2.5 NA 5 ...
$ pf_association_assembly : num 10 5 2.5 10 7.5 10 10 5 NA 0 ...
$ pf_association_political_establish: num NA NA NA NA NA NA NA NA NA NA ...
$ pf_association_political_operate : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_association_political : num 10 5 2.5 5 5 10 10 2.5 NA 0 ...
$ pf_association_prof_establish : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_association_prof_operate : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_association_prof : num 10 5 5 7.5 5 10 10 2.5 NA 10 ...
$ pf_association_sport_establish : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_association_sport_operate : num NA NA NA NA NA NA NA NA NA NA ...
$ pf_association_sport : num 10 5 7.5 7.5 7.5 10 10 2.5 NA 10 ...
$ pf_association : num 10 5 4 7.5 6.5 10 10 3 NA 5 ...
$ pf_expression_killed : num 10 10 10 10 10 10 10 10 10 10 ...
$ pf_expression_jailed : num 10 10 10 10 10 ...
$ pf_expression_influence : num 5 2.67 2.67 5.67 3.33 ...
$ pf_expression_control : num 5.25 4 2.5 5.5 4.25 7.75 8 0.25 7.25 0.75 ...
$ pf_expression_cable : num 10 10 7.5 10 7.5 10 10 10 NA 7.5 ...
$ pf_expression_newspapers : num 10 7.5 5 10 7.5 10 10 0 NA 7.5 ...
$ pf_expression_internet : num 10 7.5 7.5 10 7.5 10 10 7.5 NA 2.5 ...
$ pf_expression : num 8.61 7.38 6.45 8.74 7.15 ...
$ pf_identity_legal : num 0 NA 10 10 7 7 10 0 NA NA ...
$ pf_identity_parental_marriage : num 10 0 10 10 10 10 10 10 10 0 ...
$ pf_identity_parental_divorce : num 10 5 10 10 10 10 10 10 10 0 ...
$ pf_identity_parental : num 10 2.5 10 10 10 10 10 10 10 0 ...
$ pf_identity_sex_male : num 10 0 0 10 10 10 10 10 10 10 ...
$ pf_identity_sex_female : num 10 0 0 10 10 10 10 10 10 10 ...
$ pf_identity_sex : num 10 0 0 10 10 10 10 10 10 10 ...
$ pf_identity_divorce : num 5 0 10 10 5 10 10 5 NA 0 ...
$ pf_identity : num 6.25 0.833 7.5 10 8 ...
$ pf_score : num 7.6 5.28 6.11 8.1 6.91 ...
$ pf_rank : num 57 147 117 42 84 11 8 131 64 114 ...
$ ef_government_consumption : num 8.23 2.15 7.6 5.34 7.26 ...
$ ef_government_transfers : num 7.51 7.82 8.89 6.05 7.75 ...
$ ef_government_enterprises : num 8 0 0 6 8 10 10 0 7 10 ...
$ ef_government_tax_income : num 9 7 10 7 5 5 4 9 10 10 ...
$ ef_government_tax_payroll : num 7 2 9 1 5 5 3 4 10 10 ...
$ ef_government_tax : num 8 4.5 9.5 4 5 5 3.5 6.5 10 10 ...
$ ef_government : num 7.94 3.62 6.5 5.35 7 ...
$ ef_legal_judicial : num 2.67 4.19 1.84 3.69 3.87 ...
$ ef_legal_courts : num 3.15 4.33 1.97 2.93 4.2 ...
$ ef_legal_protection : num 4.51 4.69 2.51 4.26 5.66 ...
$ ef_legal_military : num 8.33 4.17 3.33 7.5 5.83 ...
$ ef_legal_integrity : num 4.17 5 4.17 3.33 5 ...
$ ef_legal_enforcement : num 4.39 4.51 2.3 3.63 5.2 ...
$ ef_legal_restrictions : num 6.49 6.63 5.46 6.86 9.8 ...
$ ef_legal_police : num 6.93 6.14 3.02 3.39 5.71 ...
$ ef_legal_crime : num 6.22 6.74 4.29 4.13 7.01 ...
$ ef_legal_gender : num 0.949 0.821 0.846 0.769 1 ...
$ ef_legal : num 5.07 4.69 2.96 3.9 5.81 ...
$ ef_money_growth : num 8.99 6.96 9.39 5.23 9.08 ...
$ ef_money_sd : num 9.48 8.34 4.99 5.22 9.26 ...
$ ef_money_inflation : num 9.74 8.72 3.05 2 9.75 ...
$ ef_money_currency : num 10 5 5 10 10 10 10 5 0 10 ...
$ ef_money : num 9.55 7.25 5.61 5.61 9.52 ...
$ ef_trade_tariffs_revenue : num 9.63 8.48 8.99 6.06 8.87 ...
$ ef_trade_tariffs_mean : num 9.24 6.22 7.72 7.26 8.76 9.5 8.96 8.2 3.36 9.06 ...
$ ef_trade_tariffs_sd : num 8.02 5.92 4.25 5.94 8.02 ...
$ ef_trade_tariffs : num 8.96 6.87 6.99 6.42 8.55 ...
$ ef_trade_regulatory_nontariff : num 5.57 4.96 3.13 4.47 5.92 ...
$ ef_trade_regulatory_compliance : num 9.405 0 0.917 5.156 8.466 ...
$ ef_trade_regulatory : num 7.49 2.48 2.02 4.81 7.19 ...
$ ef_trade_black : num 10 5.56 10 0 10 ...
$ ef_trade_movement_foreign : num 6.31 3.66 2.95 5.36 5.11 ...
$ ef_trade_movement_capital : num 4.615 0 3.077 0.769 5.385 ...
$ ef_trade_movement_visit : num 8.297 1.106 0.111 7.965 10 ...
$ ef_trade_movement : num 6.41 1.59 2.04 4.7 6.83 ...
$ ef_trade : num 8.21 4.13 5.26 3.98 8.14 ...
[list output truncated]
- attr(*, "spec")=
.. cols(
.. year = [32mcol_double()[39m,
.. ISO_code = [31mcol_character()[39m,
.. countries = [31mcol_character()[39m,
.. region = [31mcol_character()[39m,
.. pf_rol_procedural = [32mcol_double()[39m,
.. pf_rol_civil = [32mcol_double()[39m,
.. pf_rol_criminal = [32mcol_double()[39m,
.. pf_rol = [32mcol_double()[39m,
.. pf_ss_homicide = [32mcol_double()[39m,
.. pf_ss_disappearances_disap = [32mcol_double()[39m,
.. pf_ss_disappearances_violent = [32mcol_double()[39m,
.. pf_ss_disappearances_organized = [32mcol_double()[39m,
.. pf_ss_disappearances_fatalities = [32mcol_double()[39m,
.. pf_ss_disappearances_injuries = [32mcol_double()[39m,
.. pf_ss_disappearances = [32mcol_double()[39m,
.. pf_ss_women_fgm = [32mcol_double()[39m,
.. pf_ss_women_missing = [32mcol_double()[39m,
.. pf_ss_women_inheritance_widows = [32mcol_double()[39m,
.. pf_ss_women_inheritance_daughters = [32mcol_double()[39m,
.. pf_ss_women_inheritance = [32mcol_double()[39m,
.. pf_ss_women = [32mcol_double()[39m,
.. pf_ss = [32mcol_double()[39m,
.. pf_movement_domestic = [32mcol_double()[39m,
.. pf_movement_foreign = [32mcol_double()[39m,
.. pf_movement_women = [32mcol_double()[39m,
.. pf_movement = [32mcol_double()[39m,
.. pf_religion_estop_establish = [32mcol_double()[39m,
.. pf_religion_estop_operate = [32mcol_double()[39m,
.. pf_religion_estop = [32mcol_double()[39m,
.. pf_religion_harassment = [32mcol_double()[39m,
.. pf_religion_restrictions = [32mcol_double()[39m,
.. pf_religion = [32mcol_double()[39m,
.. pf_association_association = [32mcol_double()[39m,
.. pf_association_assembly = [32mcol_double()[39m,
.. pf_association_political_establish = [32mcol_double()[39m,
.. pf_association_political_operate = [32mcol_double()[39m,
.. pf_association_political = [32mcol_double()[39m,
.. pf_association_prof_establish = [32mcol_double()[39m,
.. pf_association_prof_operate = [32mcol_double()[39m,
.. pf_association_prof = [32mcol_double()[39m,
.. pf_association_sport_establish = [32mcol_double()[39m,
.. pf_association_sport_operate = [32mcol_double()[39m,
.. pf_association_sport = [32mcol_double()[39m,
.. pf_association = [32mcol_double()[39m,
.. pf_expression_killed = [32mcol_double()[39m,
.. pf_expression_jailed = [32mcol_double()[39m,
.. pf_expression_influence = [32mcol_double()[39m,
.. pf_expression_control = [32mcol_double()[39m,
.. pf_expression_cable = [32mcol_double()[39m,
.. pf_expression_newspapers = [32mcol_double()[39m,
.. pf_expression_internet = [32mcol_double()[39m,
.. pf_expression = [32mcol_double()[39m,
.. pf_identity_legal = [32mcol_double()[39m,
.. pf_identity_parental_marriage = [32mcol_double()[39m,
.. pf_identity_parental_divorce = [32mcol_double()[39m,
.. pf_identity_parental = [32mcol_double()[39m,
.. pf_identity_sex_male = [32mcol_double()[39m,
.. pf_identity_sex_female = [32mcol_double()[39m,
.. pf_identity_sex = [32mcol_double()[39m,
.. pf_identity_divorce = [32mcol_double()[39m,
.. pf_identity = [32mcol_double()[39m,
.. pf_score = [32mcol_double()[39m,
.. pf_rank = [32mcol_double()[39m,
.. ef_government_consumption = [32mcol_double()[39m,
.. ef_government_transfers = [32mcol_double()[39m,
.. ef_government_enterprises = [32mcol_double()[39m,
.. ef_government_tax_income = [32mcol_double()[39m,
.. ef_government_tax_payroll = [32mcol_double()[39m,
.. ef_government_tax = [32mcol_double()[39m,
.. ef_government = [32mcol_double()[39m,
.. ef_legal_judicial = [32mcol_double()[39m,
.. ef_legal_courts = [32mcol_double()[39m,
.. ef_legal_protection = [32mcol_double()[39m,
.. ef_legal_military = [32mcol_double()[39m,
.. ef_legal_integrity = [32mcol_double()[39m,
.. ef_legal_enforcement = [32mcol_double()[39m,
.. ef_legal_restrictions = [32mcol_double()[39m,
.. ef_legal_police = [32mcol_double()[39m,
.. ef_legal_crime = [32mcol_double()[39m,
.. ef_legal_gender = [32mcol_double()[39m,
.. ef_legal = [32mcol_double()[39m,
.. ef_money_growth = [32mcol_double()[39m,
.. ef_money_sd = [32mcol_double()[39m,
.. ef_money_inflation = [32mcol_double()[39m,
.. ef_money_currency = [32mcol_double()[39m,
.. ef_money = [32mcol_double()[39m,
.. ef_trade_tariffs_revenue = [32mcol_double()[39m,
.. ef_trade_tariffs_mean = [32mcol_double()[39m,
.. ef_trade_tariffs_sd = [32mcol_double()[39m,
.. ef_trade_tariffs = [32mcol_double()[39m,
.. ef_trade_regulatory_nontariff = [32mcol_double()[39m,
.. ef_trade_regulatory_compliance = [32mcol_double()[39m,
.. ef_trade_regulatory = [32mcol_double()[39m,
.. ef_trade_black = [32mcol_double()[39m,
.. ef_trade_movement_foreign = [32mcol_double()[39m,
.. ef_trade_movement_capital = [32mcol_double()[39m,
.. ef_trade_movement_visit = [32mcol_double()[39m,
.. ef_trade_movement = [32mcol_double()[39m,
.. ef_trade = [32mcol_double()[39m,
.. ef_regulation_credit_ownership = [32mcol_double()[39m,
.. ef_regulation_credit_private = [32mcol_double()[39m,
.. ef_regulation_credit_interest = [32mcol_double()[39m,
.. ef_regulation_credit = [32mcol_double()[39m,
.. ef_regulation_labor_minwage = [32mcol_double()[39m,
.. ef_regulation_labor_firing = [32mcol_double()[39m,
.. ef_regulation_labor_bargain = [32mcol_double()[39m,
.. ef_regulation_labor_hours = [32mcol_double()[39m,
.. ef_regulation_labor_dismissal = [32mcol_double()[39m,
.. ef_regulation_labor_conscription = [32mcol_double()[39m,
.. ef_regulation_labor = [32mcol_double()[39m,
.. ef_regulation_business_adm = [32mcol_double()[39m,
.. ef_regulation_business_bureaucracy = [32mcol_double()[39m,
.. ef_regulation_business_start = [32mcol_double()[39m,
.. ef_regulation_business_bribes = [32mcol_double()[39m,
.. ef_regulation_business_licensing = [32mcol_double()[39m,
.. ef_regulation_business_compliance = [32mcol_double()[39m,
.. ef_regulation_business = [32mcol_double()[39m,
.. ef_regulation = [32mcol_double()[39m,
.. ef_score = [32mcol_double()[39m,
.. ef_rank = [32mcol_double()[39m,
.. hf_score = [32mcol_double()[39m,
.. hf_rank = [32mcol_double()[39m,
.. hf_quartile = [32mcol_double()[39m
.. )
#Simplify hfi_raw by subsetting
hfi <- subset(hfi_raw, select = c((1:4), hf_score, hf_rank, pf_rol, pf_ss, pf_movement, pf_religion, pf_association, pf_expression, pf_identity, pf_score, pf_rank, ef_government, ef_legal, ef_money, ef_trade, ef_regulation, ef_score, ef_rank))
#Rename variable 'countries' to match 'Country' in joined_happy
names(hfi)[3] <- "Country"
head(hfi)
dim(hfi)
[1] 1458 22
#Join human freedom index 'hfi' & world happiness index 'joined_happy'
hfi_happy <- hfi %>% left_join(joined_happy)
Joining, by = c("year", "Country")
#Transform column 2-4 into factor
hfi_happy[,2:4] <- lapply(hfi_happy[,2:4], as.factor)
str(hfi_happy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1458 obs. of 31 variables:
$ year : num 2016 2016 2016 2016 2016 ...
$ ISO_code : Factor w/ 162 levels "AGO","ALB","ARE",..: 2 43 1 4 5 6 7 8 16 15 ...
$ Country : Factor w/ 162 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
$ region : Factor w/ 10 levels "Caucasus & Central Asia",..: 3 5 9 4 1 7 10 1 4 5 ...
$ hf_score : num 7.57 5.14 5.64 6.47 7.24 ...
$ hf_rank : num 48 155 142 107 57 4 16 130 50 75 ...
$ pf_rol : num 5.29 3.82 3.45 5.74 5 ...
$ pf_ss : num 8.81 8.04 8.3 9.04 8.83 ...
$ pf_movement : num 6.67 5 5 10 6.67 ...
$ pf_religion : num 9.19 4.94 8.79 7.8 6.22 ...
$ pf_association : num 10 5 4 7.5 6.5 10 10 3 NA 5 ...
$ pf_expression : num 8.61 7.38 6.45 8.74 7.15 ...
$ pf_identity : num 6.25 0.833 7.5 10 8 ...
$ pf_score : num 7.6 5.28 6.11 8.1 6.91 ...
$ pf_rank : num 57 147 117 42 84 11 8 131 64 114 ...
$ ef_government : num 7.94 3.62 6.5 5.35 7 ...
$ ef_legal : num 5.07 4.69 2.96 3.9 5.81 ...
$ ef_money : num 9.55 7.25 5.61 5.61 9.52 ...
$ ef_trade : num 8.21 4.13 5.26 3.98 8.14 ...
$ ef_regulation : num 6.91 5.27 5.52 5.37 7.38 ...
$ ef_score : num 7.54 4.99 5.17 4.84 7.57 7.98 7.58 6.49 7.34 7.56 ...
$ ef_rank : num 34 159 155 160 29 10 27 106 49 30 ...
$ Happiness_Rank : num 109 38 141 26 121 9 12 81 NA 42 ...
$ Happiness_Score : num 4.66 6.36 3.87 6.65 4.36 ...
$ GDP_per_Capita : num 0.955 1.053 0.847 1.151 0.861 ...
$ Family : num 0.502 0.833 0.664 1.066 0.625 ...
$ Health : num 0.7301 0.618 0.0499 0.6971 0.6408 ...
$ Freedom : num 0.31866 0.21006 0.00589 0.42284 0.14037 ...
$ Trust_Govt_Corruption: num 0.053 0.1616 0.0843 0.073 0.0362 ...
$ Generosity : num 0.1684 0.0704 0.1207 0.1099 0.0779 ...
$ Dystopia_Residual : num 1.93 3.41 2.09 3.13 1.98 ...
The combined dataset is tidy in a way that it meets the three requirements which make a dataset tidy: + Each variable must have its own column. + Each observation must have its own row. + Each value must have its own cell.
head(hfi_happy)
df<-merge(happy16a[,c(1,3)],
happy17a[,c(1,3)],
by.x = "Country",
by.y = "Country")
colnames(df)<-c("Country","Happiness_Rank16","Happiness_Rank17")
df1<-df%>%
mutate(HappinessRank_Change = Happiness_Rank17 - Happiness_Rank16)
df1
Most of the missing values are resulted from world-happiness-index variables that start from year 2016 whilst the human-freedom-index started from 2008. As it is impossible to replace those missing values with meaningful values, observations from year 2008 - 2015 are deleted by subsetting the year 2016 & 2017 data.
Further investigation found that there are 157 countries covered in the World Happiness Index whilst countries covered in the Human Freedom Index are 162 in total. This mismatch in number of country results in missing values (‘NA’). The rows with missing values in the happiness variables are also deleted by using filter() function.
After deleting the missing values, there are still total missing values of more than 5% of the total observations (11 missing values out of 137 observations), we cannot further deleting the missing values. Instead, another approach to missing values is applied, i.e. the imputation method where the missing values are replaced with the mean, median or mode of the variabe with missing values. From the data summary, it is found that all the missing values are within variable pf_association which will be replaced with the mean value of pf_association.
#scan for missing values
sum(is.na(hfi_happy))
[1] 13591
#Select data from year 2016 & 2017
hfi_happy_nomiss <- hfi_happy %>% filter(year==2016 | year == 2017)
sum(is.na(hfi_happy_nomiss))
[1] 252
#Delete countries that are not covered in World Happiness Index
hfi_happy_nomiss1 <- hfi_happy_nomiss %>% filter(!is.na(Happiness_Score),
!is.na(Happiness_Score),
!is.na(GDP_per_Capita),
!is.na(Family),
!is.na(Health),
!is.na(Freedom),
!is.na(Trust_Govt_Corruption),
!is.na(Generosity),
!is.na(Dystopia_Residual)
)
sum(is.na(hfi_happy_nomiss1))
[1] 11
summary(hfi_happy_nomiss1)
year ISO_code Country region hf_score
Min. :2016 AGO : 1 Albania : 1 Sub-Saharan Africa :31 Min. :3.766
1st Qu.:2016 ALB : 1 Algeria : 1 Latin America & the Caribbean:22 1st Qu.:6.275
Median :2016 ARE : 1 Angola : 1 Eastern Europe :20 Median :6.848
Mean :2016 ARG : 1 Argentina: 1 Western Europe :18 Mean :6.945
3rd Qu.:2016 ARM : 1 Armenia : 1 Middle East & North Africa :17 3rd Qu.:7.858
Max. :2016 AUS : 1 Australia: 1 South Asia :15 Max. :8.887
(Other):131 (Other) :131 (Other) :14
hf_rank pf_rol pf_ss pf_movement pf_religion
Min. : 1.00 Min. :1.980 Min. :3.964 Min. : 0.000 Min. :3.215
1st Qu.: 37.00 1st Qu.:3.988 1st Qu.:7.383 1st Qu.: 6.667 1st Qu.:6.346
Median : 75.00 Median :4.861 Median :8.413 Median : 8.333 Median :7.893
Mean : 78.51 Mean :5.183 Mean :8.220 Mean : 7.701 Mean :7.389
3rd Qu.:119.00 3rd Qu.:6.451 3rd Qu.:9.370 3rd Qu.:10.000 3rd Qu.:8.540
Max. :162.00 Max. :8.687 Max. :9.960 Max. :10.000 Max. :9.731
pf_association pf_expression pf_identity pf_score pf_rank
Min. : 0.500 Min. :1.759 Min. : 0.000 Min. :2.512 Min. : 1.0
1st Qu.: 5.000 1st Qu.:6.964 1st Qu.: 4.375 1st Qu.:6.041 1st Qu.: 38.0
Median : 7.500 Median :8.018 Median : 8.000 Median :6.975 Median : 80.0
Mean : 7.127 Mean :7.745 Mean : 6.969 Mean :7.044 Mean : 79.9
3rd Qu.: 9.500 3rd Qu.:9.000 3rd Qu.: 9.250 3rd Qu.:8.259 3rd Qu.:121.0
Max. :10.000 Max. :9.798 Max. :10.000 Max. :9.399 Max. :161.0
NA's :11
ef_government ef_legal ef_money ef_trade ef_regulation ef_score
Min. :3.617 Min. :2.003 Min. :1.942 Min. :2.877 Min. :2.484 Min. :2.880
1st Qu.:5.664 1st Qu.:4.127 1st Qu.:7.217 1st Qu.:6.356 1st Qu.:6.475 1st Qu.:6.290
Median :6.576 Median :5.125 Median :8.680 Median :7.217 Median :7.098 Median :6.950
Mean :6.534 Mean :5.300 Mean :8.304 Mean :7.056 Mean :7.038 Mean :6.846
3rd Qu.:7.462 3rd Qu.:6.218 3rd Qu.:9.433 3rd Qu.:8.050 3rd Qu.:7.733 3rd Qu.:7.500
Max. :9.528 Max. :8.798 Max. :9.922 Max. :9.324 Max. :9.440 Max. :8.970
ef_rank Happiness_Rank Happiness_Score GDP_per_Capita Family
Min. : 1.00 Min. : 1.00 Min. :2.905 Min. :0.06831 Min. :0.0000
1st Qu.: 37.00 1st Qu.: 37.00 1st Qu.:4.459 1st Qu.:0.69429 1st Qu.:0.6450
Median : 76.00 Median : 78.00 Median :5.389 Median :1.05266 Median :0.8597
Mean : 78.23 Mean : 77.19 Mean :5.429 Mean :0.98463 Mean :0.8031
3rd Qu.:121.00 3rd Qu.:116.00 3rd Qu.:6.361 3rd Qu.:1.27964 3rd Qu.:1.0217
Max. :162.00 Max. :157.00 Max. :7.526 Max. :1.82427 Max. :1.1833
Health Freedom Trust_Govt_Corruption Generosity Dystopia_Residual
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.8179
1st Qu.:0.4249 1st Qu.:0.2767 1st Qu.:0.06126 1st Qu.:0.1546 1st Qu.:1.9982
Median :0.6201 Median :0.4027 Median :0.10398 Median :0.2225 Median :2.2809
Mean :0.5729 Mean :0.3764 Mean :0.13739 Mean :0.2438 Mean :2.3105
3rd Qu.:0.7301 3rd Qu.:0.4877 3rd Qu.:0.17554 3rd Qu.:0.3147 3rd Qu.:2.6646
Max. :0.9528 Max. :0.5961 Max. :0.50521 Max. :0.8197 Max. :3.5591
hfi_happy_imp <- hfi_happy_nomiss1 %>% mutate_at(vars(pf_association),~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))
sum(is.na(hfi_happy_imp))
[1] 0
The presence of the two outliers can dramatically affected mean and variance estimates of our data, thus can lead to inaccurate result or prediction. The outliers may occur as a result of recording error, limitations on measuring techniques, skewed population, samples are taken from not entirely the same population, among others. However, we can not be sure what caused the outliers in our dataset as we extracted the data from a publicly available data source.
We use the Mahalanobis distance computing in the MVN package to detect any outliers in our data. The Mahalanobis distance detects an outlier by measuring how far each observation is to the center of joint chi-square distribution.
The result of the multivariate normality (‘mvn’) test are as follow:
the result of Mardia’s multivariate normality (‘MVN’) test, which measures the skewness and kurtosis coefficients as well as their corresponding statistical significance, shows that the skewness test p-value result is less than 0.05 whilst the kurtosis test result is above 0.05. Mardia MVN test requires the p-value results of both tests must be at least 0.05 for the data follows a multivariate normality distribution at the 0.05 significance level. Based on the result of the Mardia test, our data does not have a multivariate normal distribution;
the Shapiro-Wilk test, which measure normality on each variable, shows that only two variables in our data that have normal distribution, i.e. ef_government (the size o f government) & Dystopia_Residual (unexplained components in level of happiness of each country).
the Mahalanobis distance test shows that the data has two outliers which lie in the first and second observations of the dataset.
the MVN test result also points out that “the covariance matrix of the data is singular”, which may be caused by linear interdependances among the variables of the data and two or more variables sum up to a constant such as Happiness Score & Dystopia Residual. The singularity issue is outside of our project’s scope and thus, will be left untreated in this project.
#Choose only numerical variables (excl. 'year')
happy_nums <- dplyr::select_if((hfi_happy_imp[ , -1]), is.numeric)
dim(happy_nums)
[1] 137 27
head(happy_nums)
#check for outliers and multivariate normality
result <- mvn(happy_nums, multivariateOutlierMethod = "quan", showOutliers = TRUE)
The covariance matrix of the data is singular.
There are 137 observations (in the entire dataset of 137 obs.) lying on the hyperplane
with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p) = 0 with (m_1, ..., m_p) the
mean of these observations and coefficients a_i from the vector a <- c(0.8164966, 0,
0, 0, 0, 0, 0, 0, 0, -0.4082483, 0, 0, 0, 0, 0, 0, -0.4082483, 0, 0, -4e-07, 4e-07,
4e-07, 4e-07, 4e-07, 4e-07, 4e-07, 4e-07)
result
$multivariateNormality
$univariateNormality
$Descriptives
$multivariateOutliers
# remove outliers from dataset.
result_clean <- mvn(happy_nums, multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)
The covariance matrix of the data is singular.
There are 137 observations (in the entire dataset of 137 obs.) lying on the hyperplane
with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p) = 0 with (m_1, ..., m_p) the
mean of these observations and coefficients a_i from the vector a <- c(0.8164966, 0,
0, 0, 0, 0, 0, 0, 0, -0.4082483, 0, 0, 0, 0, 0, 0, -0.4082483, 0, 0, -4e-07, 4e-07,
4e-07, 4e-07, 4e-07, 4e-07, 4e-07, 4e-07)
dim(result_clean$newData)
[1] 135 27
Basically, there are two methods used in dealing with outliers, i.e. removing them from data set and data transformation.
As the outliers in our data count only 1% of the sample, it can be dealt by simply removing them from our dataset. The result (result_clean) shows new data with a dimension of 135 observation with 27 variables.
We also tried data tranformation method that is immune to the presence of outliers to see how . We chose one variable for this - the Human Freedom Index - and observed that it has a left-skewed distribution. We applied a square transformation on the variable and found that after the transformation its distribution becomes more symmetrical or normally distributed as shown in the histograms below.
#Data transformation using square transformation:
freedomsquare <- (hfi_happy_imp$hf_score)^2
#Before transformation:
#histogram
qplot(hfi_happy_imp$hf_score,
geom = "histogram",
bindwith = 1,
main = "Human Freedom Index Histogram",
xlab = "Human Freedom Score",
fill = I("light blue"),
col = I("turquoise"),
alpha=I(1))
Ignoring unknown parameters: bindwith
#QQ plot
qqnorm(hfi_happy_imp$hf_score, main = "Human Freedom Index Score before data transformation")
qqline(hfi_happy_imp$hf_score, col = "deeppink")
#after data transformation
#histogram
qplot (freedomsquare, geom = "histogram",
bindwith = 1,
main = "Normalised Human Freedom Index Histogram",
xlab = "Human Freedom Score",
fill = I("palegreen"),
col = I("deepskyblue"),
alpha=I(1))
Ignoring unknown parameters: bindwith
#QQ plot
qqnorm(freedomsquare, main = "Normalised Human Freedom Index Score")
qqline(freedomsquare, col = "lime green")