Required packages
# This is the R chunk for the required packages
library(readr)
library(tidyr)
library(dplyr)
library(magrittr)
library(outliers)
library(forecast)
Executive Summary
Data preprocessing was performed on a dataset containing information on the happiness score and unemployment rate in different countries in 2016, in an attempt to investigate their correlations with a range of factors including income groups, regions and country’s GDP per Capita.
The data wraggling process started was commenced by importing the dataset of world happiness report in 2016, as well as the world bank unemployment rate (1960-2019) dataset. Then these datasets were prefiltered by selecting the relevant variables, mering and forming of a new, tidy dataframe for further manipulation.
Then the new dataframe structure was checked for a better understanding of the variables, with the proper data type conversions performed. Further scanning of the numeric variables (unemployment rate, happy score, GDP per Capita) in the dataset was performed to check missing values, special values and outliers. Finally, data transformation was perfromed to transform non-symetrically distributed varaiables to normally distributed ones.
Data
The frist data is the World Happiness Report in 2016 downloaded from https://www.kaggle.com/unsdsn/world-happiness?select=2016.csv, with a license of CC0: Public Domain. The report ranks 155 countries by their happiness levels based on a range of factors including their economy (GDP per Capita), freedom, health etc. The data is a tidy data set since each variable has its own column, each observation has its own row and each value has its own cell.
The second data and the third datasets were World Bank Unemployment Data (1960-2019) downloaded from https://data.worldbank.org/indicator/SL.UEM.TOTL.NE.ZS with a World Bank Dataset license. The second dataset was the unemployment rate% reported in countries over years. The third dataset was information on the countries included in dataset2, such as the country’s code, name, region and income groups.The data set was untidy because a number of varaiables from “1960” to “2019” representing the year should be values of a new variable “year”, rather than being column names themselved. Data manipulation was not performed here since only the result in 2016 was the interest of this invesitgation. However, if this dataset was to be used for other purposed, this dataset needed to be reshaped using the dyplr gather() function using the follwoing command: gather (dataset, key = “year”, value = “unemployment rate”, ‘1960’:‘2019’)
# This is the R chunk for the Data Section
happy_2016<-read_csv("Happiness_world.csv")
Parsed with column specification:
cols(
Country = [31mcol_character()[39m,
Region = [31mcol_character()[39m,
`Happiness Rank` = [32mcol_double()[39m,
`Happiness Score` = [32mcol_double()[39m,
`Lower Confidence Interval` = [32mcol_double()[39m,
`Upper Confidence Interval` = [32mcol_double()[39m,
`Economy (GDP per Capita)` = [32mcol_double()[39m,
Family = [32mcol_double()[39m,
`Health (Life Expectancy)` = [32mcol_double()[39m,
Freedom = [32mcol_double()[39m,
`Trust (Government Corruption)` = [32mcol_double()[39m,
Generosity = [32mcol_double()[39m,
`Dystopia Residual` = [32mcol_double()[39m
)
dfunemploy<-read_csv("Unemployment_world.csv")
Parsed with column specification:
cols(
.default = col_logical(),
`Country Name` = [31mcol_character()[39m,
`Country Code` = [31mcol_character()[39m,
`Indicator Name` = [31mcol_character()[39m,
`Indicator Code` = [31mcol_character()[39m,
`1991` = [32mcol_double()[39m,
`1992` = [32mcol_double()[39m,
`1993` = [32mcol_double()[39m,
`1994` = [32mcol_double()[39m,
`1995` = [32mcol_double()[39m,
`1996` = [32mcol_double()[39m,
`1997` = [32mcol_double()[39m,
`1998` = [32mcol_double()[39m,
`1999` = [32mcol_double()[39m,
`2000` = [32mcol_double()[39m,
`2001` = [32mcol_double()[39m,
`2002` = [32mcol_double()[39m,
`2003` = [32mcol_double()[39m,
`2004` = [32mcol_double()[39m,
`2005` = [32mcol_double()[39m,
`2006` = [32mcol_double()[39m
# ... with 13 more columns
)
See spec(...) for full column specifications.
dfcountry <- read_csv("Country_world.csv")
Parsed with column specification:
cols(
`Country Code` = [31mcol_character()[39m,
Region = [31mcol_character()[39m,
IncomeGroup = [31mcol_character()[39m,
SpecialNotes = [31mcol_character()[39m,
TableName = [31mcol_character()[39m
)
head(happy_2016)
head(dfunemploy)
head(dfcountry)
Tidy & Manipulate Data I
Select relevant columns from each dataset. Only the variables required for further analysis will be selected to form a new dataset.
Mutate join is used to merge the datasets to form a new one.
# This is the R chunk for the Tidy & Manipulate Data I
#select data for 2016 from the datasets for further analysis
happy_2016 <- happy_2016 %>% select(c("Country", "Happiness Score", "Happiness Rank", "Economy (GDP per Capita)"))
unemploy_2016 <- dfunemploy %>% select(c("Country Name", "Country Code", "2016"))
dfcountry <- dfcountry %>% select(c("Country Code", "Region", "IncomeGroup"))
#perform inner join on these three datasets by country name;
df <- unemploy_2016 %>% left_join(dfcountry, by = "Country Code") %>% inner_join(happy_2016, by = c("Country Name" = "Country"))
head(df)
Tidy & Manipulate Data II
Assign new column names to the formed dataset; Arrange the unemployment rate of all countries and find their ranking - create a new column called unemployment_rank; This will be used for further analysis along with the happiness score ranking of countries
colnames(df) <- c("Country", "Country_code", "Unemployment_rate", "Region","Income_group", "Happy_rank", "Happy_score", "GDPperCapita")
df <- df %>% arrange(Unemployment_rate) %>% mutate(Unemployment_rank =row_number ())
df <- df %>% select(c("Country", "Region","Income_group","GDPperCapita", "Happy_score", "Happy_rank","Unemployment_rate", "Unemployment_rank"))
head(df)
NA
Understand
To understand the formed dataset, the dataframe structure has been examined. Several categorical values need to be converted to factor. Also, the ranking numbers need to be converted to ordered factor.
After type conversion, the strucutre of the dataframe was checked again, ensuring that the desired data types were used.
# This is the R chunk for the Understand Section
#structure before type conversion
glimpse(df)
Observations: 138
Variables: 8
$ Country [3m[38;5;246m<chr>[39m[23m "Qatar", "Niger", "Thailand", ...
$ Region [3m[38;5;246m<chr>[39m[23m "Middle East & North Africa", ...
$ Income_group [3m[38;5;246m<chr>[39m[23m "High income", "Low income", "...
$ GDPperCapita [3m[38;5;246m<dbl>[39m[23m 1.82427, 0.13270, 1.08930, 0.5...
$ Happy_score [3m[38;5;246m<dbl>[39m[23m 36, 142, 33, 140, 42, 152, 119...
$ Happy_rank [3m[38;5;246m<dbl>[39m[23m 6.375, 3.856, 6.474, 3.907, 6....
$ Unemployment_rate [3m[38;5;246m<dbl>[39m[23m 0.150, 0.505, 0.688, 0.716, 0....
$ Unemployment_rank [3m[38;5;246m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
# Type conversion
df$Region <- as.factor(df$Region)
levels(df$Income_group)
NULL
df$Happy_rank <- factor(df$Happy_rank, ordered = TRUE)
df$Unemployment_rank <- factor (df$Unemployment_rank, ordered = TRUE)
df$Income_group <- factor(df$Income_group, levels = c("Low income", "Lower middle income", "Upper middle income", "High income"), ordered = TRUE)
#structure after type conversion
glimpse(df)
Observations: 138
Variables: 8
$ Country [3m[38;5;246m<chr>[39m[23m "Qatar", "Niger", "Thailand", ...
$ Region [3m[38;5;246m<fct>[39m[23m Middle East & North Africa, Su...
$ Income_group [3m[38;5;246m<ord>[39m[23m High income, Low income, Upper...
$ GDPperCapita [3m[38;5;246m<dbl>[39m[23m 1.82427, 0.13270, 1.08930, 0.5...
$ Happy_score [3m[38;5;246m<dbl>[39m[23m 36, 142, 33, 140, 42, 152, 119...
$ Happy_rank [3m[38;5;246m<ord>[39m[23m 6.375, 3.856, 6.474, 3.907, 6....
$ Unemployment_rate [3m[38;5;246m<dbl>[39m[23m 0.150, 0.505, 0.688, 0.716, 0....
$ Unemployment_rank [3m[38;5;246m<ord>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
Scan I
The dataframe was then scanned for missing values.
# Scan for missing values:
colSums(is.na(df))
Country Region Income_group
0 0 0
GDPperCapita Happy_score Happy_rank
0 0 0
Unemployment_rate Unemployment_rank
1 0
Results indicated that there is a missing value in unemployment rate; By examining, this missing value could be recoded by using the mean value of unemployment rate
# Handle missing values
df$Unemployment_rate[is.na(df$Unemployment_rate)] <- mean(df$Unemployment_rate, na.rm = TRUE)
Furthermore, the dataset was checked for special values since calculations invovling special values can result in special values which need to be scanned and handled.
A function has been created to scan several special values; sapply() will be used to apply this function on the dataframe.
# Scan for special values:
check_special <- function(x){if (is.numeric(x)) (is.infinite(x) | is.nan(x))}
df %>% select('GDPperCapita','Happy_score', 'Unemployment_rate') %>% sapply(function(x) sum(check_special(x)))
GDPperCapita Happy_score Unemployment_rate
0 0 0
Resutls showed that there were no special values in the dataframe.
Scan II
The next step was to scan the numerical variables (unemployment rate, happy score and GDP) of the dataframe for outliers. Two approaches were used.
- The first approach was to check the descriptive statistics (summary) of each numeric variable, screening for abnormal values.
# Scan for outliers using descriptive statistics:
df %>% select("Happy_score", "Unemployment_rate", "GDPperCapita") %>% summary()
Happy_score Unemployment_rate GDPperCapita
Min. : 1.00 Min. : 0.150 Min. :0.0000
1st Qu.: 36.25 1st Qu.: 3.905 1st Qu.:0.6891
Median : 76.50 Median : 5.954 Median :1.0291
Mean : 76.96 Mean : 7.372 Mean :0.9606
3rd Qu.:116.75 3rd Qu.: 9.570 3rd Qu.:1.2787
Max. :157.00 Max. :26.551 Max. :1.8243
The results showed that the maximum value of the unemploymen rate was about 3 times of its 3rd quartile value, and is much higher than 1.5 * IQR, which seems an outlier. Also, the GDP data has a minimum value of 0, which could be an outlier worth further exploration.
- To further support these hypotheses, a second approach, Tukey’s method - boxplot was used to scan for outliersin happy score and unemployment rate.
# Scan for outliers using boxplot:
unemploy_out <- boxplot(df$Unemployment_rate,main ="Unemployment Rate Box Plot", ylab="Unemployment Rate", col = "grey")

happy_out <- boxplot(df$Happy_score,main ="Happy Score Box Plot", ylab="Unemployment Rate", col = "pink")

GDP_out <- boxplot(df$GDPperCapita,main ="GDP Box Plot", ylab="Unemployment Rate", col = "light blue")

NA
NA
The boxplots indicate that there are no outliers in the happiness score data and GDP data, but there are some outliers in the unemployment rate data, which need to be handled using Turkey’s method: outlier fences: Q1 - 1.5 IQR ~ Q3 + 1.5 IQR Only the higher fence was calculated based on the boxplot and outlier were removed using subset function.
#Handle the outliers:
HighFence <- quantile(df$Unemployment_rate, .75) + 1.5 * IQR(df$Unemployment_rate)
df <- subset(df, df$Unemployment_rate < HighFence)
Transform
Data transformation is often used to process data for further statistical analysis. It can be used to enhance the understanding of the data by rescaling/standardising the values. It can also be used to reduce the skewness of data to achieve symmetric distribution which is preferred in many statistical analysis.
- histogram is used to display the distribution of the unemployment_rate data:
# Histogram of unemployment rate and happy score
hist(df$Unemployment_rate, main="Unemployment Rate Distribution", xlab = "Unemployment Rate %",breaks = 10)

Results showed that both the unemployment rate and happy score are right skewed.
To reduce right skewness, taking logarithms was attempted and histogram was checked:
# Logarithm:
log_unemploy <- log10 (df$Unemployment_rate)
# Histogram
hist(log_unemploy, main="Unemployment Rate Distribution - log transformed", xlab = "Unemployment Rate %",breaks = 10)

Results showed that the data was left_skewed. Which means taking logarithm was too powerful and both data were over-transformed.
Another approach, the square root transformation method was used:
# Taking square root:
sqrt_unemploy <- sqrt(df$Unemployment_rate)
# Histogram
hist(sqrt_unemploy, main="Unemployment Rate Distribution - sqrt transformed", xlab = "Unemployment Rate %",breaks = 10)

Results showed that the unemployment rate distribution was approximately symmetric, which means the data transformation was successful. The dataframe was updated with the transformed data:
# Update the employment rate variable
df$Unemployment_rate <- sqrt(df$Unemployment_rate)
- The histogram of happy_score data is shown below:
# Histogram
hist(df$Happy_score , main="Happy Score Distribution", xlab = "Happy Score",breaks = 10)

The data did not show clear distribution. Different transformation method were used:
@1). Logarithm:
# logarithm transformation:
log_happy <- log10(df$Happy_score)
# Histogram
hist(log_happy , main="Happy Score Distribution - log transformed", xlab = "Happy Score",breaks = 10)

@2). Square:
# logarithm transformation:
square_happy <- df$Happy_score^2
# Histogram
hist(square_happy , main="Happy Score Distribution - square transformed", xlab = "Happy Score",breaks = 10)

@3). Square root:
# square root transformation:
sqrt_happy <- sqrt(df$Happy_score)
# Histogram
hist(sqrt_happy , main="Happy Score Distribution - sqrt transformed", xlab = "Happy Score",breaks = 10)

@4). box-Cox:
# Box-Cox transformation
bc_happy <- BoxCox(df$Happy_score, lambda = "auto")
# Histogram
hist(bc_happy , main="Happy Score Distribution - BoxCox Transformed", xlab = "Happy Score",breaks = 10)

The happy score data was showed clearer distribution with data transformation applied.
However, each transformation led to either right skewed or left skewed data. The square root transformed data led to moderately left skewed data, so it was choosen as the transformation method.
# Update the happy score variable
df$Happy_score <- sqrt(df$Happy_score)
- The histogram of GDP data is shown below:
# Histogram
hist(df$GDPperCapita , main="GDP per Capita Distribution", xlab = "GDP per Capita",breaks = 10)

It was shown that the GDP per Captita data was appriximately normally distributed. So data transformation was not performed.
Summary
The information on unemployment rate and happiness report in 2016 among countries was gathered. Data cleasing and transformation were performed. The dataset was ready for further explorary analysis.
