In the past few years, Covid-19 has led to a dramatic loss of human life worldwide. Meanwhile, countries also suffer on the impact of public health, food systems and employment. In other words, the pandemic has cost millions of people life, and million of people also suffering job losses and the side effect of the post Covid-19. Hence, we could say the recovery from the impact of Covid-19 is a long journey.
So, we come to a question about the performance of world happiness rate which is life ladder that published by Sustainable Development Solutions Network. This indicator also recognizes by worldwide to justify the happiness of the country. Although we are suffering the post effect of Covid-19 and the countries are working hard in economy recovery so that people can back to their normal life. Thus, we would like to do the happiness prediction so that we are able to justify the duration for a country to do economy recovery. Namely, we will have the picture that a country will bounced back to the normal life.
In our assignment, we will be using the model that provide in the world happiness report to predict the country happiness. The model is build by using the indicators, levels of GDP, life expectancy, generosity, social support, freedom, and corruption. Due to the time constraints, we are not able to cover up other indicators such as income level, unemployment rate, poverty rate and other possible factors could impact by the Covid-19.
There are 3 objectives we would like to achieve through this project. The objectives as follows:
The data source for our project is coming from World Happiness Report, an annual report published by the Sustainable Development Solutions Network of the United Nations. This report rank the country by the happiness ladder based on survey data via Gallup World Poll. The data set we used ranging from year 2005 to year 2020. It consists of 166 countries with 11 variables. The variables as follows:
library(dplyr)
library(naniar)
library(ggplot2)
library(visdat)
library(reshape2)
library(ggpubr)
library(CatEncoders)
library(smotefamily)
library(caret)
library(MLmetrics)
library(adabag)
library(gbm)
library(cvms)
library(caret)
library(randomForest)
library(rpart)
library(e1071)
library(rpart.plot)
library(lares)
library(rsq)
library(cowplot)
Load the data using read.csv and check the dataframe structure. Noticed that most of the data type are numeric and only two variables are character and integer.
myurl <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTNujIfWRdX0iF0aN25fB-2alBJc7mL40DDuNqL_92Ch--CnnbOFTHgShiboxRT90cgdyYYIhk4jQDs/pub?output=csv"
happiness_raw <- read.csv(url(myurl))
## Display the structure of data frame
str(happiness_raw)
## 'data.frame': 1949 obs. of 11 variables:
## $ Country.name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
## $ Life.Ladder : num 3.72 4.4 4.76 3.83 3.78 ...
## $ Log.GDP.per.capita : num 7.37 7.54 7.65 7.62 7.71 ...
## $ Social.support : num 0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
## $ Healthy.life.expectancy.at.birth: num 50.8 51.2 51.6 51.9 52.2 ...
## $ Freedom.to.make.life.choices : num 0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
## $ Generosity : num 0.168 0.19 0.121 0.162 0.236 0.061 0.104 0.08 0.042 -0.121 ...
## $ Perceptions.of.corruption : num 0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
## $ Positive.affect : num 0.518 0.584 0.618 0.611 0.71 0.621 0.532 0.554 0.565 0.496 ...
## $ Negative.affect : num 0.258 0.237 0.275 0.267 0.268 0.273 0.375 0.339 0.348 0.371 ...
## Display the summary of the dataset
summary(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita
## Length:1949 Min. :2005 Min. :2.375 Min. : 6.635
## Class :character 1st Qu.:2010 1st Qu.:4.640 1st Qu.: 8.464
## Mode :character Median :2013 Median :5.386 Median : 9.460
## Mean :2013 Mean :5.467 Mean : 9.368
## 3rd Qu.:2017 3rd Qu.:6.283 3rd Qu.:10.353
## Max. :2020 Max. :8.019 Max. :11.648
## NA's :36
## Social.support Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
## Min. :0.2900 Min. :32.30 Min. :0.2580
## 1st Qu.:0.7498 1st Qu.:58.69 1st Qu.:0.6470
## Median :0.8355 Median :65.20 Median :0.7630
## Mean :0.8126 Mean :63.36 Mean :0.7426
## 3rd Qu.:0.9050 3rd Qu.:68.59 3rd Qu.:0.8560
## Max. :0.9870 Max. :77.10 Max. :0.9850
## NA's :13 NA's :55 NA's :32
## Generosity Perceptions.of.corruption Positive.affect Negative.affect
## Min. :-0.3350 Min. :0.0350 Min. :0.3220 Min. :0.0830
## 1st Qu.:-0.1130 1st Qu.:0.6900 1st Qu.:0.6255 1st Qu.:0.2060
## Median :-0.0255 Median :0.8020 Median :0.7220 Median :0.2580
## Mean : 0.0001 Mean :0.7471 Mean :0.7100 Mean :0.2685
## 3rd Qu.: 0.0910 3rd Qu.:0.8720 3rd Qu.:0.7990 3rd Qu.:0.3200
## Max. : 0.6980 Max. :0.9830 Max. :0.9440 Max. :0.7050
## NA's :89 NA's :110 NA's :22 NA's :16
## Check the first 6 rows of the dataset
head(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1 Afghanistan 2008 3.724 7.370 0.451
## 2 Afghanistan 2009 4.402 7.540 0.552
## 3 Afghanistan 2010 4.758 7.647 0.539
## 4 Afghanistan 2011 3.832 7.620 0.521
## 5 Afghanistan 2012 3.783 7.705 0.521
## 6 Afghanistan 2013 3.572 7.725 0.484
## Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1 50.80 0.718 0.168
## 2 51.20 0.679 0.190
## 3 51.60 0.600 0.121
## 4 51.92 0.496 0.162
## 5 52.24 0.531 0.236
## 6 52.56 0.578 0.061
## Perceptions.of.corruption Positive.affect Negative.affect
## 1 0.882 0.518 0.258
## 2 0.850 0.584 0.237
## 3 0.707 0.618 0.275
## 4 0.731 0.611 0.267
## 5 0.776 0.710 0.268
## 6 0.823 0.621 0.273
## Display the columns of the dataset
colnames(happiness_raw)
## [1] "Country.name" "year"
## [3] "Life.Ladder" "Log.GDP.per.capita"
## [5] "Social.support" "Healthy.life.expectancy.at.birth"
## [7] "Freedom.to.make.life.choices" "Generosity"
## [9] "Perceptions.of.corruption" "Positive.affect"
## [11] "Negative.affect"
First, we will drop off the last 2 columns, Postive Affect and Negative Affect, which are not needed in our modelling. The dataset consists of 1949 rows with missing values. Next, we removed total of 237 rows or 12.2% that consists of missing values.
## Remove the columns that not relevant
happiness_raw <- subset(happiness_raw, select = -c(Positive.affect, Negative.affect))
## Check the total number of rows of dataset
nrow(happiness_raw)
## [1] 1949
## To check the dataset has any missing value
any(is.na(happiness_raw))
## [1] TRUE
## Visualize the missing data
data1 <- c(sum(is.na(happiness_raw)),sum(!is.na(happiness_raw)))
mylabel <- c("Missing","Present")
pct <- prop.table(data1)*100
mylabel2 <- paste(round(pct, digits = 1),"%")
mylabel3 <- paste(mylabel, mylabel2)
data2 <- setNames(data1,mylabel3)
pie(data2, main = "Figure 1 - Missingness of World Happiness Report")
## To check the dataset has duplicated value
sum(duplicated(happiness_raw))
## [1] 0
## Remove the missing values in the dataset
happiness_raw <-na.omit(happiness_raw)
## Recheck the total number of rows of dataset
nrow(happiness_raw)
## [1] 1712
## Re-verify that no missing values
any(is.na(happiness_raw))
## [1] FALSE
##total omitted rows is 237. It occupied the total data is 12.2%
We added on one more column call “Rank” to rank the happiness level using Life Ladder in the dataset. If the Life Ladder value more than 7 considered as High, less than 7 will treat as Low, others range will categorize as Medium
## Categorize the happiness level by adding new column rating
happiness_raw <-mutate(happiness_raw,Rank = if_else(Life.Ladder >7,"High",if_else(Life.Ladder <3.5,"Low","Medium")))
summary(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita
## Length:1712 Min. :2005 Min. :2.375 Min. : 6.635
## Class :character 1st Qu.:2010 1st Qu.:4.595 1st Qu.: 8.394
## Mode :character Median :2013 Median :5.359 Median : 9.456
## Mean :2013 Mean :5.445 Mean : 9.320
## 3rd Qu.:2017 3rd Qu.:6.252 3rd Qu.:10.268
## Max. :2020 Max. :7.971 Max. :11.648
## Social.support Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
## Min. :0.2900 Min. :32.30 Min. :0.2580
## 1st Qu.:0.7410 1st Qu.:58.17 1st Qu.:0.6440
## Median :0.8345 Median :65.10 Median :0.7575
## Mean :0.8102 Mean :63.22 Mean :0.7395
## 3rd Qu.:0.9080 3rd Qu.:68.68 3rd Qu.:0.8520
## Max. :0.9870 Max. :77.10 Max. :0.9850
## Generosity Perceptions.of.corruption Rank
## Min. :-0.3350000 Min. :0.0350 Length:1712
## 1st Qu.:-0.1112500 1st Qu.:0.6970 Class :character
## Median :-0.0255000 Median :0.8060 Mode :character
## Mean :-0.0007173 Mean :0.7511
## 3rd Qu.: 0.0890000 3rd Qu.:0.8750
## Max. : 0.6890000 Max. :0.9830
head(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1 Afghanistan 2008 3.724 7.370 0.451
## 2 Afghanistan 2009 4.402 7.540 0.552
## 3 Afghanistan 2010 4.758 7.647 0.539
## 4 Afghanistan 2011 3.832 7.620 0.521
## 5 Afghanistan 2012 3.783 7.705 0.521
## 6 Afghanistan 2013 3.572 7.725 0.484
## Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1 50.80 0.718 0.168
## 2 51.20 0.679 0.190
## 3 51.60 0.600 0.121
## 4 51.92 0.496 0.162
## 5 52.24 0.531 0.236
## 6 52.56 0.578 0.061
## Perceptions.of.corruption Rank
## 1 0.882 Medium
## 2 0.850 Medium
## 3 0.707 Medium
## 4 0.731 Medium
## 5 0.776 Medium
## 6 0.823 Medium
boxplot(happiness_raw$Life.Ladder, main = "Life Ladder")
boxplot(happiness_raw$Log.GDP.per.capita, main = "GDP Per Capital")
## We may see some outliers at below columns. We could not treat them as outliers as the the values are measured based on each countries.
## The value will differs between the countries are well developed, developing and under developed
## Hence, we will see some of the value will be out of the box plot
boxplot(happiness_raw$Social.support, main = "Social Support")
boxplot(happiness_raw$Healthy.life.expectancy.at.birth, main = "Healthy life expectancy at birth")
boxplot(happiness_raw$Freedom.to.make.life.choices, main = "Freedom to make life choices")
boxplot(happiness_raw$Generosity, main = "Generosity")
boxplot(happiness_raw$ Perceptions.of.corruption, main = "Perceptions of corruption")
##total omitted rows is 237. It occupied the total data is 12.2%
# Explore the structure of the dataframe
str(happiness_raw)
## 'data.frame': 1712 obs. of 10 variables:
## $ Country.name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
## $ Life.Ladder : num 3.72 4.4 4.76 3.83 3.78 ...
## $ Log.GDP.per.capita : num 7.37 7.54 7.65 7.62 7.71 ...
## $ Social.support : num 0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
## $ Healthy.life.expectancy.at.birth: num 50.8 51.2 51.6 51.9 52.2 ...
## $ Freedom.to.make.life.choices : num 0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
## $ Generosity : num 0.168 0.19 0.121 0.162 0.236 0.061 0.104 0.08 0.042 -0.121 ...
## $ Perceptions.of.corruption : num 0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
## $ Rank : chr "Medium" "Medium" "Medium" "Medium" ...
## - attr(*, "na.action")= 'omit' Named int [1:237] 26 29 30 67 111 112 113 114 115 116 ...
## ..- attr(*, "names")= chr [1:237] "26" "29" "30" "67" ...
# Explore the max and min happiness in the dataset
max_happiness_summary <- happiness_raw %>% group_by(Country.name) %>% summarise (max_happiness=max(Life.Ladder)) %>% arrange(desc(max_happiness))
min_happiness_summary <- happiness_raw %>% group_by(Country.name) %>% summarise (min_happiness=min(Life.Ladder)) %>% arrange(min_happiness)
Top_6_highest <- head(max_happiness_summary)
Top_6_lowest <- head(min_happiness_summary)
ggplot(Top_6_highest, aes(x=reorder(Country.name, max_happiness), y=max_happiness, fill=Country.name)) +
geom_bar(stat="identity", width=0.5) +
coord_cartesian(ylim=c(7,8)) +
labs(y= "Happiness", x = "Country Name") +
ggtitle("Graph of Happiness vs Country Name (Highest 6)")
ggplot(Top_6_lowest, aes(x=reorder(Country.name, -min_happiness), y=min_happiness, fill=Country.name)) +
geom_bar(stat="identity", width=0.5) +
coord_cartesian(ylim=c(2,3)) +
labs(y= "Happiness", x = "Country Name") +
ggtitle("Graph of Happiness vs Country Name (Lowest 6)")
—————————–Q1 REGRESSION PROBLEM—————————–
# Drop Country.name, year and rank from dataframe
happiness_regression <- happiness_raw %>% select(-Country.name,-year,-Rank)
# Create Correlation Matrix
corr_mat <- round(cor(happiness_regression, method = "pearson"),3)
corr_mat
## Life.Ladder Log.GDP.per.capita Social.support
## Life.Ladder 1.000 0.793 0.713
## Log.GDP.per.capita 0.793 1.000 0.706
## Social.support 0.713 0.706 1.000
## Healthy.life.expectancy.at.birth 0.755 0.860 0.617
## Freedom.to.make.life.choices 0.524 0.352 0.410
## Generosity 0.182 -0.025 0.056
## Perceptions.of.corruption -0.448 -0.344 -0.227
## Healthy.life.expectancy.at.birth
## Life.Ladder 0.755
## Log.GDP.per.capita 0.860
## Social.support 0.617
## Healthy.life.expectancy.at.birth 1.000
## Freedom.to.make.life.choices 0.384
## Generosity 0.018
## Perceptions.of.corruption -0.335
## Freedom.to.make.life.choices Generosity
## Life.Ladder 0.524 0.182
## Log.GDP.per.capita 0.352 -0.025
## Social.support 0.410 0.056
## Healthy.life.expectancy.at.birth 0.384 0.018
## Freedom.to.make.life.choices 1.000 0.327
## Generosity 0.327 1.000
## Perceptions.of.corruption -0.487 -0.288
## Perceptions.of.corruption
## Life.Ladder -0.448
## Log.GDP.per.capita -0.344
## Social.support -0.227
## Healthy.life.expectancy.at.birth -0.335
## Freedom.to.make.life.choices -0.487
## Generosity -0.288
## Perceptions.of.corruption 1.000
# Melt Correlation Matrix
melted_corr_mat <- melt(corr_mat)
head(melted_corr_mat)
## Var1 Var2 value
## 1 Life.Ladder Life.Ladder 1.000
## 2 Log.GDP.per.capita Life.Ladder 0.793
## 3 Social.support Life.Ladder 0.713
## 4 Healthy.life.expectancy.at.birth Life.Ladder 0.755
## 5 Freedom.to.make.life.choices Life.Ladder 0.524
## 6 Generosity Life.Ladder 0.182
#Plot Correlation Heat Map
ggplot(melted_corr_mat, aes(Var1, Var2, fill=value)) + geom_tile() +
ggtitle("Correlation Matrix") + xlab("Features") + ylab("Features") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_text(aes(label = value)) +
scale_fill_gradient(low = "azure1", high = "cornflowerblue")
Based on Correlation matrix, Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices have strong correlation with Life.Ladder
# Combine and plot the scatter plot graph to study the trend of Life.Ladder against the aforementioned features
sp1 <- ggplot(happiness_regression, aes(Log.GDP.per.capita, Life.Ladder)) +
geom_point(alpha=0.1) + geom_smooth(method="lm", colour = "blue") + ggtitle("Scatter Plot against Log.GDP.per.capita")
sp2 <- ggplot(happiness_regression, aes(Social.support, Life.Ladder)) +
geom_point(alpha = 0.1) + geom_smooth(method="lm", colour = "red") + ggtitle("Scatter Plot against Social.support")
sp3 <- ggplot(happiness_regression, aes(Healthy.life.expectancy.at.birth, Life.Ladder)) +
geom_point(alpha=0.1) + geom_smooth(method="lm", colour = "green") + ggtitle("Scatter Plot against Healthy.life.expectancy.at.birth")
sp4 <- ggplot(happiness_regression, aes(Freedom.to.make.life.choices, Life.Ladder)) +
geom_point(alpha = 0.1) + geom_smooth(method="lm", colour = "orange") + ggtitle("Scatter Plot against Freedom.to.make.life.choices")
# Scatter Plot with best fit of LM against 4 features with strong correlations (>0.5)
sp_comb <- ggarrange(sp1, sp2, sp3, sp4,ncol = 2, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
print(sp_comb)
As observed from the combined scatter plot, the higher the value of features, the higher the Life.Ladder
—————————–Q2 CLASSIFICATION PROBLEM—————————–
# Count the number of rows for each Rank class
nrow(filter(happiness_raw, Rank == "Low"))
## [1] 54
nrow(filter(happiness_raw, Rank == "Medium"))
## [1] 1449
nrow(filter(happiness_raw, Rank == "High"))
## [1] 209
# The number of rows for Low, Medium and High are 54, 1449 & 209 respectively.
# Factor the Rank of Original Data for Plotting
happiness_class <- happiness_raw
happiness_class$Rank <- factor(happiness_class$Rank,
levels = c("Low", "Medium", "High"))
# Plot Bar Chart to visualize the number of class
ggplot(happiness_class, aes(Rank, fill=Rank)) + geom_bar(colour='black')
From the bar chart, it was noted that the data is strongly
imbalanced.
Both “Low” and “High” ranks are the minority class in this
dataset.
SMOTE will be done in later step.
# Plot Density Plot to see the distribution of a numeric variables
# Plot Box Plot to see the distributions of statistical data
# Plot against Log.GDP.per.capita
dp1 <- ggplot(happiness_class, aes(x = Log.GDP.per.capita, fill=Rank)) +
geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Log.GDP.per.capita")
bp1 <- ggplot(happiness_class, aes(x = Rank, y= Log.GDP.per.capita , fill=Rank)) +
geom_boxplot() + ggtitle("Box Plot against Log.GDP.per.capita")
dpbp1 <- ggarrange(dp1, bp1, ncol = 1, nrow = 2)
print(dpbp1)
# Plot against Social.support
dp2 <- ggplot(happiness_class, aes(x = Social.support, fill=Rank)) +
geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Social.support")
bp2 <- ggplot(happiness_class, aes(x = Rank, y= Social.support , fill=Rank)) +
geom_boxplot() + ggtitle("Box Plot against Social.support")
dpbp2 <- ggarrange(dp2, bp2, ncol = 1, nrow = 2)
print(dpbp2)
# Plot against Healthy.life.expectancy.at.birth
dp3 <- ggplot(happiness_class, aes(x = Healthy.life.expectancy.at.birth, fill=Rank)) +
geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Healthy.life.expectancy.at.birth")
bp3 <- ggplot(happiness_class, aes(x = Rank, y= Healthy.life.expectancy.at.birth , fill=Rank)) +
geom_boxplot() + ggtitle("Box Plot against Healthy.life.expectancy.at.birth")
dpbp3 <- ggarrange(dp3, bp3, ncol = 1, nrow = 2)
print(dpbp3)
# Plot against Freedom.to.make.life.choices
dp4 <- ggplot(happiness_class, aes(x = Freedom.to.make.life.choices, fill=Rank)) +
geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Freedom.to.make.life.choices")
bp4 <- ggplot(happiness_class, aes(x = Rank, y= Freedom.to.make.life.choices , fill=Rank)) +
geom_boxplot() + ggtitle("Box Plot against Freedom.to.make.life.choices")
dpbp4 <- ggarrange(dp4, bp4, ncol = 1, nrow = 2)
print(dpbp4)
# Plot against Generosity
dp5 <- ggplot(happiness_class, aes(x = Generosity, fill=Rank)) +
geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Generosity")
bp5 <- ggplot(happiness_class, aes(x = Rank, y= Generosity , fill=Rank)) +
geom_boxplot() + ggtitle("Box Plot against Generosity")
dpbp5 <- ggarrange(dp5, bp5, ncol = 1, nrow = 2)
print(dpbp5)
# Plot against Perceptions.of.corruption
dp6 <- ggplot(happiness_class, aes(x = Perceptions.of.corruption, fill=Rank)) +
geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Perceptions.of.corruption")
bp6 <- ggplot(happiness_class, aes(x = Rank, y= Perceptions.of.corruption , fill=Rank)) +
geom_boxplot() + ggtitle("Box Plot against Perceptions.of.corruption")
dpbp6 <- ggarrange(dp6, bp6, ncol = 1, nrow = 2)
print(dpbp6)
# CATEGORIES ENCODER
# Factor the Rank of Original Data
happiness_Factored <- happiness_raw
happiness_Factored$Rank <- factor(happiness_Factored$Rank,
levels = c("Low", "Medium", "High"))
table(happiness_Factored$Rank)
##
## Low Medium High
## 54 1449 209
prop.table(table(happiness_Factored$Rank))
##
## Low Medium High
## 0.03154206 0.84637850 0.12207944
#define original categorical labels
labs = LabelEncoder.fit(happiness_Factored$Rank)
#convert labels to numeric values
happiness_Factored$Rank = transform(labs, factor(happiness_Factored$Rank, levels = c("Low", "Medium", "High")))
“Low” has become 1, “Medium” has become 2, “High” has become 3.
MULTIVARIATE ANOVA TEST
The null hypothesis (H0) of the ANOVA is no difference in means.
The alternative hypothesis (Ha) is that the means are different from one
another.
Confidence Interval = 95% (Default)
The p value of the variable is low (p < 0.025), then the variables
are statistically significant, and vice cersa
reference: https://www.scribbr.com/statistics/anova-in-r/#:~:text=Revised%20on%20November%2017%2C%202022,level%20of%20the%20independent%20variable.
aov1 = aov(data= happiness_Factored, Rank ~ Log.GDP.per.capita + Social.support + Healthy.life.expectancy.at.birth +
Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption)
summary(aov1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Log.GDP.per.capita 1 63.76 63.76 768.481 <2e-16 ***
## Social.support 1 5.73 5.73 69.091 <2e-16 ***
## Healthy.life.expectancy.at.birth 1 0.26 0.26 3.088 0.0791 .
## Freedom.to.make.life.choices 1 9.92 9.92 119.519 <2e-16 ***
## Generosity 1 12.05 12.05 145.262 <2e-16 ***
## Perceptions.of.corruption 1 15.80 15.80 190.402 <2e-16 ***
## Residuals 1705 141.46 0.08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result of ANOVA test shows that all variables are statistically significant except Healthy.life.expectancy.at.birth
SMOTE for IMBALANCED DATA
Reference https://rdrr.io/cran/performanceEstimation/man/smote.html
Low and High are the minority class in this dataset
#SMOTE for "Low" rank
low_smoted <- SMOTE(happiness_Factored[,-1], happiness_Factored$Rank, K = 5, dup_size = 25)
low_smoted.data <- low_smoted$data
table(low_smoted.data$Rank)
##
## 1 2 3
## 1404 1449 209
prop.table(table(low_smoted.data$Rank))
##
## 1 2 3
## 0.45852384 0.47322012 0.06825604
#SMOTE for "High" rank
high_smoted <- SMOTE(low_smoted.data[,1:9], low_smoted.data$Rank, K = 5, dup_size = 6)
high_smoted.data <- high_smoted$data
table(high_smoted.data$Rank)
##
## 1 2 3
## 1404 1449 1463
prop.table(table(high_smoted.data$Rank))
##
## 1 2 3
## 0.3253012 0.3357275 0.3389713
# The data has been balanced with SMOTE method.
#Dataframe with Balanced Data for Machine Learning Modelling
happiness_Balanced <- high_smoted.data
happiness_Balanced_L <- select(happiness_Balanced, -class, -Rank)
# Set the seed for reproducibility
set.seed(42)
# Test Design
# Split the dataset into 80% training set and 20% test set
split_train <- 0.80
dpartl <- createDataPartition(happiness_Balanced_L$Life.Ladder, p=split_train, list=FALSE)
dtrainl <- happiness_Balanced_L[ dpartl,]
dtestl <- happiness_Balanced_L[-dpartl,]
regressor_rf = randomForest(x = dtrainl,
y = dtrainl$Life.Ladder,
ntree = 500)
# Predict Life Ladder score with Random Forest Regression model
rf_pred = predict(regressor_rf, newdata = dtestl)
rf_actual_pred <- as.data.frame(cbind(Prediction = rf_pred, Actual = dtestl$Life.Ladder))
# Model Performance Metrics
rf_perf_metrics <- data.frame(
R2 = R2(rf_pred, dtestl$Life.Ladder),
RMSE = RMSE(rf_pred, dtestl$Life.Ladder),
MAE = MAE(rf_pred, dtestl$Life.Ladder)
)
print(rf_perf_metrics)
## R2 RMSE MAE
## 1 0.9980992 0.07905845 0.04471033
# Regression Results Plot
print(lares::mplot_lineal(tag = dtestl$Life.Ladder,
score = rf_pred,
subtitle = "Random Forest Regression Model",
model_name = "regressor_rf"))
## Warning in .font_global(font, quiet = FALSE): Font 'Arial Narrow' is not
## installed, has other name, or can't be found
The chart plots Real value vs Predicted value.
As observed, the R-squared is 0.9981, which shows a very high goodness
of fit.
The RMSE is 0.079, while the MAE is 0.045. The predicted value is
consistently having low error, for any range of Real value.
# Distribution Plot
print(lares::mplot_density(tag = dtestl$Life.Ladder,
score = rf_pred,
subtitle = "Random Forest Regression Model",
model_name = "regressor_rf"))
## Warning in .font_global(font, quiet = FALSE): Font 'Arial Narrow' is not
## installed, has other name, or can't be found
The distribution plot shows that the Model and the Real value are consistent in terms of density, for any continuous value from the dataset.
# Splits by Quantiles Plot
print(lares::mplot_splits(tag = dtestl$Life.Ladder,
score = rf_pred,
split = 8))
The Split Groups splits the dataset into 8 tags, with each having the
same samples.
It shows that Tag 1,2,3,4,5,6 and 8 achieve a consistency of more than
90%.
Meanwhile, tag 7 is at around 89%.
# Model Development
# Train Support Vector Regression model with training set
regressor_svr = svm(formula = Life.Ladder ~ .,
data = dtrainl,
type = 'eps-regression',
kernel = 'radial')
# Predict Life Ladder score with Support Vector Regression model
svr_pred = predict(regressor_svr, newdata = dtestl)
svr_actual_pred <- as.data.frame(cbind(Prediction = svr_pred, Actual = dtestl$Life.Ladder))
# Model Performance Metrics
svr_perf_metrics <- data.frame(
R2 = R2(svr_pred, dtestl$Life.Ladder),
RMSE = RMSE(svr_pred, dtestl$Life.Ladder),
MAE = MAE(svr_pred, dtestl$Life.Ladder)
)
print(svr_perf_metrics)
## R2 RMSE MAE
## 1 0.9511555 0.3944255 0.2454234
# Regression Results Plot
print(lares::mplot_lineal(tag = dtestl$Life.Ladder,
score = svr_pred,
subtitle = "Support Vector Regression Model",
model_name = "regressor_svr"))
The R-squared is 0.9511, which shows a high goodness of fit. The RMSE
is 0.394, while the MAE is 0.245.
The predicted value is showing low error when the real value is at the
low end or high end.
However, the error is relatively larger when the real value is within
3.5 and 7.
# Distribution Plot
print(lares::mplot_density(tag = dtestl$Life.Ladder,
score = svr_pred,
subtitle = "Support Vector Regression Model",
model_name = "regressor_svr"))
The distribution plot shows that the model and the real value overlaps with more than 70% of the region.
# Splits by Quantiles Plot
print(lares::mplot_splits(tag = dtestl$Life.Ladder,
score = svr_pred,
split = 8))
The Split Groups splits the dataset into 8 tags, with each having the
same samples.
The percentage of the tags range from 44% to 73%. The tag 3, 4, 5, 6 and
7 have records from 4 or more tags.
It suggests that the model performs relative poor when the real value is
within 3.5 and 7.
# Model Development
# Train Decision Tree Regression model with training set
regressor_dt = rpart(formula = Life.Ladder ~ .,
data = dtrainl,
control = rpart.control(minsplit = 10))
# Predict Life Ladder score with Decision Tree Regression model
dt_pred = predict(regressor_dt, newdata = dtestl)
dt_actual_pred <- as.data.frame(cbind(Prediction = dt_pred, Actual = dtestl$Life.Ladder))
# Plot Decision Tree
prp(regressor_dt)
# Model Performance Metrics
dt_perf_metrics <- data.frame(
R2 = R2(dt_pred, dtestl$Life.Ladder),
RMSE = RMSE(dt_pred, dtestl$Life.Ladder),
MAE = MAE(dt_pred, dtestl$Life.Ladder)
)
print(dt_perf_metrics)
## R2 RMSE MAE
## 1 0.8861933 0.6011727 0.4230862
# Regression Results Plot
print(lares::mplot_lineal(tag = dtestl$Life.Ladder,
score = dt_pred,
subtitle = "Decision Tree Regression Model",
model_name = "regressor_svr"))
The R-squared is 0.8861. The goodness of fit is the lowest compared
to the other 2.
The RMSE is 0.6012, MAE is 0.4231.
# Distribution Plot
print(lares::mplot_density(tag = dtestl$Life.Ladder,
score = dt_pred,
subtitle = "Decision Tree Regression Model",
model_name = "regressor_svr"))
The distribution plot is considered inconsistent as the model and the real doesn’t overlap.
# Splits by Quantiles Plot
print(lares::mplot_splits(tag = dtestl$Life.Ladder,
score = dt_pred,
split = 8))
The percentage of the tags range from 22% to 50%. It suggests that the model performs very poor in terms of prediction.
# Model Assessment
regression_evaluation <- data.frame(
rf_r2 = R2(rf_pred, dtestl$Life.Ladder),
svr_r2 = R2(svr_pred, dtestl$Life.Ladder),
dt_r2 = R2(dt_pred, dtestl$Life.Ladder)
)
print(regression_evaluation)
## rf_r2 svr_r2 dt_r2
## 1 0.9980992 0.9511555 0.8861933
Life Ladder Prediction (Regression Model)
Random Forest Regression model achieves the highest R2 results which is
0.9981 (99.81%) compared to the other two models.
happiness_Balanced$Rank <- as.factor(happiness_Balanced$Rank)
happiness_Balanced_C <- select(happiness_Balanced, -class, -Life.Ladder)
# Set the seed for reproducibility
set.seed(42)
# Test Design
# Split the dataset into 80% training set and 20% test set
split_train <- 0.80
dpartc <- createDataPartition(happiness_Balanced_C$Rank, p=split_train, list=FALSE)
dtrainc <- happiness_Balanced_C[ dpartc,]
dtestc <- happiness_Balanced_C[-dpartc,]
# Cross Validation
# Set up the trainControl object
tc <- trainControl(method = "repeatedcv",
number=10,# 10-fold cross validation
classProbs = FALSE,
savePredictions = TRUE,
repeats = 3,
## Estimate class probabilities
summaryFunction = multiClassSummary,)
# Model Development
# Train K-Nearest Neighbors model with dtrainc
knn_model <- train(
Rank~.,
data=dtrainc,
trControl=tc,
preProcess = c("center","scale"),
method="knn",
metric='Accuracy',
tuneLength=20
)
# Print the model results
print(knn_model)
## k-Nearest Neighbors
##
## 3455 samples
## 7 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3110, 3109, 3110, 3109, 3109, 3109, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa Mean_F1 Mean_Sensitivity Mean_Specificity
## 5 0.9298562 0.8948191 0.9278536 0.9303716 0.9650330
## 7 0.9190520 0.8786294 0.9162738 0.9196545 0.9596538
## 9 0.9113337 0.8670704 0.9079585 0.9119949 0.9558302
## 11 0.9068948 0.8604233 0.9031392 0.9075913 0.9536281
## 13 0.9046743 0.8571024 0.9007150 0.9053941 0.9525403
## 15 0.9015872 0.8524830 0.8973293 0.9023240 0.9510203
## 17 0.8979213 0.8469935 0.8933859 0.8986957 0.9491942
## 19 0.8948323 0.8423657 0.8900388 0.8956388 0.9476492
## 21 0.8932906 0.8400554 0.8884157 0.8941060 0.9468796
## 23 0.8916528 0.8376035 0.8865938 0.8924777 0.9460670
## 25 0.8901089 0.8352926 0.8848471 0.8909451 0.9453019
## 27 0.8881791 0.8324033 0.8828839 0.8890400 0.9443405
## 29 0.8872154 0.8309639 0.8819247 0.8880887 0.9438701
## 31 0.8852861 0.8280797 0.8799478 0.8861862 0.9429208
## 33 0.8846098 0.8270692 0.8792623 0.8855181 0.9425884
## 35 0.8835497 0.8254841 0.8782759 0.8844752 0.9420639
## 37 0.8824903 0.8238964 0.8772082 0.8834213 0.9415333
## 39 0.8809464 0.8215873 0.8755289 0.8818971 0.9407686
## 41 0.8799818 0.8201460 0.8744997 0.8809425 0.9402949
## 43 0.8789244 0.8185631 0.8734278 0.8798970 0.9397695
## Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall
## 0.9354297 0.9678093 0.9354297 0.9303716
## 0.9263285 0.9633190 0.9263285 0.9196545
## 0.9199837 0.9601459 0.9199837 0.9119949
## 0.9161926 0.9583297 0.9161926 0.9075913
## 0.9147568 0.9574967 0.9147568 0.9053941
## 0.9124988 0.9563054 0.9124988 0.9023240
## 0.9092073 0.9547475 0.9092073 0.8986957
## 0.9062283 0.9534102 0.9062283 0.8956388
## 0.9046807 0.9527099 0.9046807 0.8941060
## 0.9034791 0.9520893 0.9034791 0.8924777
## 0.9023136 0.9515191 0.9023136 0.8909451
## 0.9000344 0.9505375 0.9000344 0.8890400
## 0.8989864 0.9500469 0.8989864 0.8880887
## 0.8970737 0.9491448 0.8970737 0.8861862
## 0.8963671 0.9488183 0.8963671 0.8855181
## 0.8951236 0.9482329 0.8951236 0.8844752
## 0.8937958 0.9476660 0.8937958 0.8834213
## 0.8924930 0.9470442 0.8924930 0.8818971
## 0.8917495 0.9466510 0.8917495 0.8809425
## 0.8905642 0.9461259 0.8905642 0.8798970
## Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.3099521 0.9477023
## 0.3063507 0.9396542
## 0.3037779 0.9339126
## 0.3022983 0.9306097
## 0.3015581 0.9289672
## 0.3005291 0.9266721
## 0.2993071 0.9239450
## 0.2982774 0.9216440
## 0.2977635 0.9204928
## 0.2972176 0.9192723
## 0.2967030 0.9181235
## 0.2960597 0.9166903
## 0.2957385 0.9159794
## 0.2950954 0.9145535
## 0.2948699 0.9140533
## 0.2945166 0.9132695
## 0.2941634 0.9124773
## 0.2936488 0.9113329
## 0.2933273 0.9106187
## 0.2929748 0.9098333
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
# Predict Life Ladder Rank by K-Nearest Neighbours model
knn_pred <- predict(knn_model, dtestc)
# Confusion Matrix
knn_confusionmatrix<-confusionMatrix(knn_pred, dtestc$Rank)
print(knn_confusionmatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 280 34 0
## 2 0 231 3
## 3 0 24 289
##
## Overall Statistics
##
## Accuracy : 0.9292
## 95% CI : (0.9099, 0.9454)
## No Information Rate : 0.3391
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8938
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 0.7993 0.9897
## Specificity 0.9415 0.9948 0.9578
## Pos Pred Value 0.8917 0.9872 0.9233
## Neg Pred Value 1.0000 0.9075 0.9945
## Prevalence 0.3252 0.3357 0.3391
## Detection Rate 0.3252 0.2683 0.3357
## Detection Prevalence 0.3647 0.2718 0.3635
## Balanced Accuracy 0.9707 0.8970 0.9738
# Feature Importance of K-Nearest Neighbors Model
# Create object of importance of our variables
knn_importance <- varImp(knn_model)
# Create box plot of importance of variables
gg.knn_feat_importance<-ggplot(data = knn_importance, mapping = aes(x = knn_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: K-Nearest Neighbors ") + # Title
theme_minimal() # Theme
plot(gg.knn_feat_importance)
# Plot confusion matrix
# Install package "rsvg", "ggimage"
knn_cfm <- as_tibble(knn_confusionmatrix$table)
print(plot_confusion_matrix(knn_cfm,
target_col = "Prediction",
prediction_col = "Reference",
counts_col = "n"))
# Model Development
# Create the Model without Cross Validation with the help of Boosting Function
ada_model = boosting(Rank~., data=dtrainc, boos=TRUE, mfinal=50)
# Predict Life Ladder Rank by AdaBoost model
ada_pred = predict(ada_model , newdata = dtestc)
# Confusion Matrix
ada_confusionmatrix<-confusionMatrix(
as.factor(ada_pred$class),
dtestc$Rank
)
print(ada_confusionmatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 280 10 0
## 2 0 267 5
## 3 0 12 287
##
## Overall Statistics
##
## Accuracy : 0.9686
## 95% CI : (0.9547, 0.9792)
## No Information Rate : 0.3391
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.953
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 0.9239 0.9829
## Specificity 0.9828 0.9913 0.9789
## Pos Pred Value 0.9655 0.9816 0.9599
## Neg Pred Value 1.0000 0.9626 0.9911
## Prevalence 0.3252 0.3357 0.3391
## Detection Rate 0.3252 0.3101 0.3333
## Detection Prevalence 0.3368 0.3159 0.3473
## Balanced Accuracy 0.9914 0.9576 0.9809
# Feature Importance of AdaBoost Model
# Plot Feature Importance
print(importanceplot(ada_model, horiz=TRUE))
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
# Plot confusion matrix
# Install package "rsvg", "ggimage"
ada_cfm <- as_tibble(ada_confusionmatrix$table)
print(plot_confusion_matrix(ada_cfm,
target_col = "Prediction",
prediction_col = "Reference",
counts_col = "n"))
# Cross Validation
# Set up the trainControl object
tc <- trainControl(method = "repeatedcv",
number=10,#10-fold cross validation
classProbs = FALSE,
savePredictions = TRUE,
repeats = 3,
## Estimate class probabilities
summaryFunction = multiClassSummary,)
# Model Development
# Train NaiveBayes model with training set
nb_model <- train(Rank~.,
dtrainc,
method="naive_bayes",
preProcess = c("center","scale"),
metric='Accuracy',
trControl=tc)
# Print the model results
print(nb_model)
## Naive Bayes
##
## 3455 samples
## 7 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3109, 3110, 3110, 3110, 3109, 3109, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa Mean_F1 Mean_Sensitivity
## FALSE 0.8401357 0.7603845 0.8368341 0.8409479
## TRUE 0.8563458 0.7846966 0.8537941 0.8571567
## Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision
## 0.9203027 0.8411017 0.9232458 0.8411017
## 0.9284432 0.8587607 0.9310392 0.8587607
## Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.8409479 0.2800452 0.8806253
## 0.8571567 0.2854486 0.8927999
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
# Predict Life Ladder Rank by Naive Bayes model
nb_pred <- predict(nb_model, dtestc)
# Confusion Matrix
nb_confusionmatrix<-confusionMatrix(nb_pred, dtestc$Rank)
print(nb_confusionmatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 266 66 0
## 2 14 208 29
## 3 0 15 263
##
## Overall Statistics
##
## Accuracy : 0.856
## 95% CI : (0.8307, 0.8788)
## No Information Rate : 0.3391
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7842
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.9500 0.7197 0.9007
## Specificity 0.8864 0.9248 0.9736
## Pos Pred Value 0.8012 0.8287 0.9460
## Neg Pred Value 0.9735 0.8672 0.9503
## Prevalence 0.3252 0.3357 0.3391
## Detection Rate 0.3089 0.2416 0.3055
## Detection Prevalence 0.3856 0.2915 0.3229
## Balanced Accuracy 0.9182 0.8223 0.9372
# Feature Importance of Naive Bayes Model
# Create object of importance of our variables
nb_importance <- varImp(nb_model)
# Create box plot of importance of variables
gg.nb_feat_importance<-ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: Naive Bayes") + # Title
theme_minimal() # Theme
plot(gg.nb_feat_importance)
# Plot confusion matrix
# Install package "rsvg", "ggimage"
nb_cfm <- as_tibble(nb_confusionmatrix$table)
print(plot_confusion_matrix(nb_cfm,
target_col = "Prediction",
prediction_col = "Reference",
counts_col = "n"))
classification_evaluation<-data.frame(
K_Nearest_Neighbours= knn_confusionmatrix$overall[1],
AdaBoost = ada_confusionmatrix$overall[1],
Naive_Bayes= nb_confusionmatrix$overall[1]
)
print(classification_evaluation)
## K_Nearest_Neighbours AdaBoost Naive_Bayes
## Accuracy 0.9291521 0.9686411 0.8559814
Life Ladder (Rank) Prediction (Classification
Model)
AdaBoost classification model is the best model among the 3 models as it
achieves the highest model accuracy of 0.969 (96.9%).
The happiest country’s ranking position does not change from year to year, the countries did not change from 2015 to 2020. They are still on the top and bottom happiest country list. The countries for top 6 happiness are Denmark, Finland, Switzerland, Norway, Canada and Netherlands. Meanwhile, the countries for bottom 6 happiness are Afghanistan, Syria, Central African Republic, Zimbabwe, Liberia and Togo.
The factors that most affects happiness is Economic GDP per capita, social support, healthy life expectancy at birth, freedom to make life choices which have a strong correlation with life ladder. This is because income can let people meet their basic needs, so it is quite important and will affect people’s happiness. Another main factor is healthy life expectancy at birth can be attributed to rising living standards, improved lifestyle, and better education which along to quality health services. Lastly, freedom to make life choices is also one of the key contributing factors. This is because of the social dimensions of happiness within working and living environments, which affect the quality-of-life people enjoy.
Three regression models are used to predict Happiness Score which are Random Forest, Support vector and Decision Tree. Random Forest is the best regression model compared to Support Vector and Decision Tree classification model in our project result. This is because the prediction accuracy of Random Forest Regression model is highest which achieved 99.81%.
Three classification models are used to predict Happiness Rank which are K-Nearest Neighbor, AdaBoost and Naïve Bayes. Ada Boost is the best classification model compared to K-Nearest Neighbor and Naive Bayes Classification model in our project result. This is because the prediction accuracy of AdaBoost Classification model is highest which achieved 96.9%.
In a nutshell, this project has shown that happiness depends on a huge range of influences. Hence, a large scale of regular collection of happiness data is conducted to help in policymaking process and help us to foster well-being. In other words, moving to a higher rank of happiness country could possibly make people happier. By the same token, moving to a less happiness country could reduce the level of happiness. By fact, it means that emotions are contagious, even at a national level.