Group 1: Prediction of Country Happiness

Group Member

Liew Li Ling 22050655
Teo Kwee Kim S2181984
Koh Yi Sheng S2188155
Ong Horng Neng S2191604
Lee Li En S2129847

1.0 Introduction

In the past few years, Covid-19 has led to a dramatic loss of human life worldwide. Meanwhile, countries also suffer on the impact of public health, food systems and employment. In other words, the pandemic has cost millions of people life, and million of people also suffering job losses and the side effect of the post Covid-19. Hence, we could say the recovery from the impact of Covid-19 is a long journey.

So, we come to a question about the performance of world happiness rate which is life ladder that published by Sustainable Development Solutions Network. This indicator also recognizes by worldwide to justify the happiness of the country. Although we are suffering the post effect of Covid-19 and the countries are working hard in economy recovery so that people can back to their normal life. Thus, we would like to do the happiness prediction so that we are able to justify the duration for a country to do economy recovery. Namely, we will have the picture that a country will bounced back to the normal life.

In our assignment, we will be using the model that provide in the world happiness report to predict the country happiness. The model is build by using the indicators, levels of GDP, life expectancy, generosity, social support, freedom, and corruption. Due to the time constraints, we are not able to cover up other indicators such as income level, unemployment rate, poverty rate and other possible factors could impact by the Covid-19.

1.1 Problem

What is the factors affecting the happiness of the country?
What is the effectiveness of the World Happiness Indicator?
How to predict the happiness of a country’s citizen?

1.2 Objective

There are 3 objectives we would like to achieve through this project. The objectives as follows:

To able forecast the happiness of their people
To determine the effectiveness of the indicator for the country
To determine the root cause that causes the unhappiness of the country.

2.0 Data Preprocessing

2.1 Data Obtain

The data source for our project is coming from World Happiness Report, an annual report published by the Sustainable Development Solutions Network of the United Nations. This report rank the country by the happiness ladder based on survey data via Gallup World Poll. The data set we used ranging from year 2005 to year 2020. It consists of 166 countries with 11 variables. The variables as follows:

Country name = Name of country that is being analyzed
year = the year the report was conducted
Life Ladder = A score received for a country’s happiness
Log GDP per capita = the logged version of GDP per capita
Social support = The ranking for a country’s social support
Healthy life expectancy at birth = Expected lifespan
Freedom to make life choices = Freedom index
Generosity = A metric on how generous the country is
Perceptions of corruption = How badly the citizens of said country think about corruption in the country.
Positive affection = The average of individual yes or no answers for three questions about emotions experienced or not on the previous day: laughter, enjoyment, and learning or doing something interesting
Negative affection = The average of individual yes or no answers about three emotions experienced on the previous day: worry, sadness, and anger

2.2 Data Scrub

2.2.1 Import Library

library(dplyr)
library(naniar)
library(ggplot2)
library(visdat)
library(reshape2)
library(ggpubr)
library(CatEncoders)
library(smotefamily)
library(caret)
library(MLmetrics)
library(adabag)
library(gbm)
library(cvms)
library(caret)
library(randomForest)
library(rpart)
library(e1071)
library(rpart.plot)
library(lares)
library(rsq)
library(cowplot)

2.2.2 Load Data

Load the data using read.csv and check the dataframe structure. Noticed that most of the data type are numeric and only two variables are character and integer.

myurl <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTNujIfWRdX0iF0aN25fB-2alBJc7mL40DDuNqL_92Ch--CnnbOFTHgShiboxRT90cgdyYYIhk4jQDs/pub?output=csv"
happiness_raw <- read.csv(url(myurl))
## Display the structure of data frame
str(happiness_raw)

## 'data.frame':    1949 obs. of  11 variables:
##  $ Country.name                    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year                            : int  2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
##  $ Life.Ladder                     : num  3.72 4.4 4.76 3.83 3.78 ...
##  $ Log.GDP.per.capita              : num  7.37 7.54 7.65 7.62 7.71 ...
##  $ Social.support                  : num  0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
##  $ Healthy.life.expectancy.at.birth: num  50.8 51.2 51.6 51.9 52.2 ...
##  $ Freedom.to.make.life.choices    : num  0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
##  $ Generosity                      : num  0.168 0.19 0.121 0.162 0.236 0.061 0.104 0.08 0.042 -0.121 ...
##  $ Perceptions.of.corruption       : num  0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
##  $ Positive.affect                 : num  0.518 0.584 0.618 0.611 0.71 0.621 0.532 0.554 0.565 0.496 ...
##  $ Negative.affect                 : num  0.258 0.237 0.275 0.267 0.268 0.273 0.375 0.339 0.348 0.371 ...

## Display the summary of the dataset
summary(happiness_raw)

##  Country.name            year       Life.Ladder    Log.GDP.per.capita
##  Length:1949        Min.   :2005   Min.   :2.375   Min.   : 6.635    
##  Class :character   1st Qu.:2010   1st Qu.:4.640   1st Qu.: 8.464    
##  Mode  :character   Median :2013   Median :5.386   Median : 9.460    
##                     Mean   :2013   Mean   :5.467   Mean   : 9.368    
##                     3rd Qu.:2017   3rd Qu.:6.283   3rd Qu.:10.353    
##                     Max.   :2020   Max.   :8.019   Max.   :11.648    
##                                                    NA's   :36        
##  Social.support   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
##  Min.   :0.2900   Min.   :32.30                    Min.   :0.2580              
##  1st Qu.:0.7498   1st Qu.:58.69                    1st Qu.:0.6470              
##  Median :0.8355   Median :65.20                    Median :0.7630              
##  Mean   :0.8126   Mean   :63.36                    Mean   :0.7426              
##  3rd Qu.:0.9050   3rd Qu.:68.59                    3rd Qu.:0.8560              
##  Max.   :0.9870   Max.   :77.10                    Max.   :0.9850              
##  NA's   :13       NA's   :55                       NA's   :32                  
##    Generosity      Perceptions.of.corruption Positive.affect  Negative.affect 
##  Min.   :-0.3350   Min.   :0.0350            Min.   :0.3220   Min.   :0.0830  
##  1st Qu.:-0.1130   1st Qu.:0.6900            1st Qu.:0.6255   1st Qu.:0.2060  
##  Median :-0.0255   Median :0.8020            Median :0.7220   Median :0.2580  
##  Mean   : 0.0001   Mean   :0.7471            Mean   :0.7100   Mean   :0.2685  
##  3rd Qu.: 0.0910   3rd Qu.:0.8720            3rd Qu.:0.7990   3rd Qu.:0.3200  
##  Max.   : 0.6980   Max.   :0.9830            Max.   :0.9440   Max.   :0.7050  
##  NA's   :89        NA's   :110               NA's   :22       NA's   :16

## Check the first 6 rows of the dataset
head(happiness_raw)

##   Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1  Afghanistan 2008       3.724              7.370          0.451
## 2  Afghanistan 2009       4.402              7.540          0.552
## 3  Afghanistan 2010       4.758              7.647          0.539
## 4  Afghanistan 2011       3.832              7.620          0.521
## 5  Afghanistan 2012       3.783              7.705          0.521
## 6  Afghanistan 2013       3.572              7.725          0.484
##   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1                            50.80                        0.718      0.168
## 2                            51.20                        0.679      0.190
## 3                            51.60                        0.600      0.121
## 4                            51.92                        0.496      0.162
## 5                            52.24                        0.531      0.236
## 6                            52.56                        0.578      0.061
##   Perceptions.of.corruption Positive.affect Negative.affect
## 1                     0.882           0.518           0.258
## 2                     0.850           0.584           0.237
## 3                     0.707           0.618           0.275
## 4                     0.731           0.611           0.267
## 5                     0.776           0.710           0.268
## 6                     0.823           0.621           0.273

## Display the columns of the dataset
colnames(happiness_raw)

##  [1] "Country.name"                     "year"                            
##  [3] "Life.Ladder"                      "Log.GDP.per.capita"              
##  [5] "Social.support"                   "Healthy.life.expectancy.at.birth"
##  [7] "Freedom.to.make.life.choices"     "Generosity"                      
##  [9] "Perceptions.of.corruption"        "Positive.affect"                 
## [11] "Negative.affect"

2.2.3 Data Cleaning

First, we will drop off the last 2 columns, Postive Affect and Negative Affect, which are not needed in our modelling. The dataset consists of 1949 rows with missing values. Next, we removed total of 237 rows or 12.2% that consists of missing values.

## Remove the columns that not relevant
happiness_raw <- subset(happiness_raw, select = -c(Positive.affect, Negative.affect))
## Check the total number of rows of dataset
nrow(happiness_raw)

## [1] 1949

## To check the dataset has any missing value
any(is.na(happiness_raw))

## [1] TRUE

## Visualize the missing data
data1 <- c(sum(is.na(happiness_raw)),sum(!is.na(happiness_raw)))
mylabel <- c("Missing","Present")
pct <- prop.table(data1)*100
mylabel2 <- paste(round(pct, digits = 1),"%")
mylabel3 <- paste(mylabel, mylabel2)
data2 <- setNames(data1,mylabel3)
pie(data2, main = "Figure 1 - Missingness of World Happiness Report")

## To check the dataset has duplicated value
sum(duplicated(happiness_raw))

## [1] 0

## Remove the missing values in the dataset
happiness_raw <-na.omit(happiness_raw)
## Recheck the total number of rows of dataset
nrow(happiness_raw)

## [1] 1712

## Re-verify that no missing values
any(is.na(happiness_raw))

## [1] FALSE

##total omitted rows is 237. It occupied the total data is 12.2%

2.2.4 Data Categorization

We added on one more column call “Rank” to rank the happiness level using Life Ladder in the dataset. If the Life Ladder value more than 7 considered as High, less than 7 will treat as Low, others range will categorize as Medium

## Categorize the happiness level by adding new column rating
happiness_raw <-mutate(happiness_raw,Rank = if_else(Life.Ladder >7,"High",if_else(Life.Ladder <3.5,"Low","Medium")))

summary(happiness_raw)

##  Country.name            year       Life.Ladder    Log.GDP.per.capita
##  Length:1712        Min.   :2005   Min.   :2.375   Min.   : 6.635    
##  Class :character   1st Qu.:2010   1st Qu.:4.595   1st Qu.: 8.394    
##  Mode  :character   Median :2013   Median :5.359   Median : 9.456    
##                     Mean   :2013   Mean   :5.445   Mean   : 9.320    
##                     3rd Qu.:2017   3rd Qu.:6.252   3rd Qu.:10.268    
##                     Max.   :2020   Max.   :7.971   Max.   :11.648    
##  Social.support   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
##  Min.   :0.2900   Min.   :32.30                    Min.   :0.2580              
##  1st Qu.:0.7410   1st Qu.:58.17                    1st Qu.:0.6440              
##  Median :0.8345   Median :65.10                    Median :0.7575              
##  Mean   :0.8102   Mean   :63.22                    Mean   :0.7395              
##  3rd Qu.:0.9080   3rd Qu.:68.68                    3rd Qu.:0.8520              
##  Max.   :0.9870   Max.   :77.10                    Max.   :0.9850              
##    Generosity         Perceptions.of.corruption     Rank          
##  Min.   :-0.3350000   Min.   :0.0350            Length:1712       
##  1st Qu.:-0.1112500   1st Qu.:0.6970            Class :character  
##  Median :-0.0255000   Median :0.8060            Mode  :character  
##  Mean   :-0.0007173   Mean   :0.7511                              
##  3rd Qu.: 0.0890000   3rd Qu.:0.8750                              
##  Max.   : 0.6890000   Max.   :0.9830

head(happiness_raw)

##   Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1  Afghanistan 2008       3.724              7.370          0.451
## 2  Afghanistan 2009       4.402              7.540          0.552
## 3  Afghanistan 2010       4.758              7.647          0.539
## 4  Afghanistan 2011       3.832              7.620          0.521
## 5  Afghanistan 2012       3.783              7.705          0.521
## 6  Afghanistan 2013       3.572              7.725          0.484
##   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1                            50.80                        0.718      0.168
## 2                            51.20                        0.679      0.190
## 3                            51.60                        0.600      0.121
## 4                            51.92                        0.496      0.162
## 5                            52.24                        0.531      0.236
## 6                            52.56                        0.578      0.061
##   Perceptions.of.corruption   Rank
## 1                     0.882 Medium
## 2                     0.850 Medium
## 3                     0.707 Medium
## 4                     0.731 Medium
## 5                     0.776 Medium
## 6                     0.823 Medium

boxplot(happiness_raw$Life.Ladder, main = "Life Ladder")

boxplot(happiness_raw$Log.GDP.per.capita, main = "GDP Per Capital")

## We may see some outliers at below columns. We could not treat them as outliers as the the values are measured based on each countries.
## The value will differs between the countries are well developed, developing and under developed 
## Hence, we will see some of the value will be out of the box plot

boxplot(happiness_raw$Social.support, main = "Social Support")

boxplot(happiness_raw$Healthy.life.expectancy.at.birth, main = "Healthy life expectancy at birth")

boxplot(happiness_raw$Freedom.to.make.life.choices, main = "Freedom to make life choices")

boxplot(happiness_raw$Generosity, main = "Generosity")

boxplot(happiness_raw$ Perceptions.of.corruption, main = "Perceptions of corruption")

##total omitted rows is 237. It occupied the total data is 12.2%

3.0 Data Exploration

# Explore the structure of the dataframe
str(happiness_raw)

## 'data.frame':    1712 obs. of  10 variables:
##  $ Country.name                    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year                            : int  2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
##  $ Life.Ladder                     : num  3.72 4.4 4.76 3.83 3.78 ...
##  $ Log.GDP.per.capita              : num  7.37 7.54 7.65 7.62 7.71 ...
##  $ Social.support                  : num  0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
##  $ Healthy.life.expectancy.at.birth: num  50.8 51.2 51.6 51.9 52.2 ...
##  $ Freedom.to.make.life.choices    : num  0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
##  $ Generosity                      : num  0.168 0.19 0.121 0.162 0.236 0.061 0.104 0.08 0.042 -0.121 ...
##  $ Perceptions.of.corruption       : num  0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
##  $ Rank                            : chr  "Medium" "Medium" "Medium" "Medium" ...
##  - attr(*, "na.action")= 'omit' Named int [1:237] 26 29 30 67 111 112 113 114 115 116 ...
##   ..- attr(*, "names")= chr [1:237] "26" "29" "30" "67" ...

# Explore the max and min happiness in the dataset
max_happiness_summary <- happiness_raw %>% group_by(Country.name) %>% summarise (max_happiness=max(Life.Ladder)) %>% arrange(desc(max_happiness))
min_happiness_summary <- happiness_raw %>% group_by(Country.name) %>% summarise (min_happiness=min(Life.Ladder)) %>% arrange(min_happiness)
Top_6_highest <- head(max_happiness_summary)
Top_6_lowest <- head(min_happiness_summary)

ggplot(Top_6_highest, aes(x=reorder(Country.name, max_happiness), y=max_happiness, fill=Country.name)) +
  geom_bar(stat="identity", width=0.5) +
  coord_cartesian(ylim=c(7,8)) +
  labs(y= "Happiness", x = "Country Name") +
  ggtitle("Graph of Happiness vs Country Name (Highest 6)")

ggplot(Top_6_lowest, aes(x=reorder(Country.name, -min_happiness), y=min_happiness, fill=Country.name)) +
  geom_bar(stat="identity", width=0.5) +
  coord_cartesian(ylim=c(2,3)) +
  labs(y= "Happiness", x = "Country Name") +
  ggtitle("Graph of Happiness vs Country Name (Lowest 6)")

—————————–Q1 REGRESSION PROBLEM—————————–

# Drop Country.name, year and rank from dataframe
happiness_regression <- happiness_raw %>% select(-Country.name,-year,-Rank)

# Create Correlation Matrix
corr_mat <- round(cor(happiness_regression, method = "pearson"),3)
corr_mat

##                                  Life.Ladder Log.GDP.per.capita Social.support
## Life.Ladder                            1.000              0.793          0.713
## Log.GDP.per.capita                     0.793              1.000          0.706
## Social.support                         0.713              0.706          1.000
## Healthy.life.expectancy.at.birth       0.755              0.860          0.617
## Freedom.to.make.life.choices           0.524              0.352          0.410
## Generosity                             0.182             -0.025          0.056
## Perceptions.of.corruption             -0.448             -0.344         -0.227
##                                  Healthy.life.expectancy.at.birth
## Life.Ladder                                                 0.755
## Log.GDP.per.capita                                          0.860
## Social.support                                              0.617
## Healthy.life.expectancy.at.birth                            1.000
## Freedom.to.make.life.choices                                0.384
## Generosity                                                  0.018
## Perceptions.of.corruption                                  -0.335
##                                  Freedom.to.make.life.choices Generosity
## Life.Ladder                                             0.524      0.182
## Log.GDP.per.capita                                      0.352     -0.025
## Social.support                                          0.410      0.056
## Healthy.life.expectancy.at.birth                        0.384      0.018
## Freedom.to.make.life.choices                            1.000      0.327
## Generosity                                              0.327      1.000
## Perceptions.of.corruption                              -0.487     -0.288
##                                  Perceptions.of.corruption
## Life.Ladder                                         -0.448
## Log.GDP.per.capita                                  -0.344
## Social.support                                      -0.227
## Healthy.life.expectancy.at.birth                    -0.335
## Freedom.to.make.life.choices                        -0.487
## Generosity                                          -0.288
## Perceptions.of.corruption                            1.000

# Melt Correlation Matrix
melted_corr_mat <- melt(corr_mat)
head(melted_corr_mat)

##                               Var1        Var2 value
## 1                      Life.Ladder Life.Ladder 1.000
## 2               Log.GDP.per.capita Life.Ladder 0.793
## 3                   Social.support Life.Ladder 0.713
## 4 Healthy.life.expectancy.at.birth Life.Ladder 0.755
## 5     Freedom.to.make.life.choices Life.Ladder 0.524
## 6                       Generosity Life.Ladder 0.182

#Plot Correlation Heat Map
ggplot(melted_corr_mat, aes(Var1, Var2, fill=value)) + geom_tile() + 
  ggtitle("Correlation Matrix") + xlab("Features") + ylab("Features") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  geom_text(aes(label = value)) +
  scale_fill_gradient(low = "azure1", high = "cornflowerblue")

Based on Correlation matrix, Log.GDP.per.capita, Social.support, Healthy.life.expectancy.at.birth, Freedom.to.make.life.choices have strong correlation with Life.Ladder

# Combine and plot the scatter plot graph to study the trend of Life.Ladder against the aforementioned features
sp1 <- ggplot(happiness_regression, aes(Log.GDP.per.capita, Life.Ladder)) + 
  geom_point(alpha=0.1) + geom_smooth(method="lm", colour = "blue") + ggtitle("Scatter Plot against Log.GDP.per.capita")

sp2 <- ggplot(happiness_regression, aes(Social.support, Life.Ladder)) + 
  geom_point(alpha = 0.1) + geom_smooth(method="lm", colour = "red") + ggtitle("Scatter Plot against Social.support")

sp3 <- ggplot(happiness_regression, aes(Healthy.life.expectancy.at.birth, Life.Ladder)) + 
  geom_point(alpha=0.1) + geom_smooth(method="lm", colour = "green") + ggtitle("Scatter Plot against Healthy.life.expectancy.at.birth")

sp4 <- ggplot(happiness_regression, aes(Freedom.to.make.life.choices, Life.Ladder)) + 
  geom_point(alpha = 0.1) + geom_smooth(method="lm", colour = "orange") + ggtitle("Scatter Plot against Freedom.to.make.life.choices")

# Scatter Plot with best fit of LM against 4 features with strong correlations (>0.5)
sp_comb <- ggarrange(sp1, sp2, sp3, sp4,ncol = 2, nrow = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

print(sp_comb)

As observed from the combined scatter plot, the higher the value of features, the higher the Life.Ladder

—————————–Q2 CLASSIFICATION PROBLEM—————————–

# Count the number of rows for each Rank class
nrow(filter(happiness_raw, Rank == "Low"))

## [1] 54

nrow(filter(happiness_raw, Rank == "Medium"))

## [1] 1449

nrow(filter(happiness_raw, Rank == "High"))

## [1] 209

# The number of rows for Low, Medium and High are 54, 1449 & 209 respectively.

# Factor the Rank of Original Data for Plotting
happiness_class <- happiness_raw 
happiness_class$Rank <- factor(happiness_class$Rank,
                         levels = c("Low", "Medium", "High"))

# Plot Bar Chart to visualize the number of class
ggplot(happiness_class, aes(Rank, fill=Rank)) + geom_bar(colour='black')

From the bar chart, it was noted that the data is strongly imbalanced.
Both “Low” and “High” ranks are the minority class in this dataset.
SMOTE will be done in later step.

# Plot Density Plot to see the distribution of a numeric variables
# Plot Box Plot to see the distributions of statistical data 

# Plot against Log.GDP.per.capita
dp1 <- ggplot(happiness_class, aes(x = Log.GDP.per.capita, fill=Rank)) +
  geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Log.GDP.per.capita")
bp1 <- ggplot(happiness_class, aes(x = Rank, y= Log.GDP.per.capita , fill=Rank)) +
  geom_boxplot() + ggtitle("Box Plot against Log.GDP.per.capita")
dpbp1 <- ggarrange(dp1, bp1, ncol = 1, nrow = 2)
print(dpbp1)

# Plot against Social.support
dp2 <- ggplot(happiness_class, aes(x = Social.support, fill=Rank)) +
  geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Social.support")
bp2 <- ggplot(happiness_class, aes(x = Rank, y= Social.support , fill=Rank)) +
  geom_boxplot() + ggtitle("Box Plot against Social.support")
dpbp2 <- ggarrange(dp2, bp2, ncol = 1, nrow = 2)
print(dpbp2)

# Plot against Healthy.life.expectancy.at.birth
dp3 <- ggplot(happiness_class, aes(x = Healthy.life.expectancy.at.birth, fill=Rank)) +
  geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Healthy.life.expectancy.at.birth")
bp3 <- ggplot(happiness_class, aes(x = Rank, y= Healthy.life.expectancy.at.birth , fill=Rank)) +
  geom_boxplot() + ggtitle("Box Plot against Healthy.life.expectancy.at.birth")
dpbp3 <- ggarrange(dp3, bp3, ncol = 1, nrow = 2)
print(dpbp3)

# Plot against Freedom.to.make.life.choices
dp4 <- ggplot(happiness_class, aes(x = Freedom.to.make.life.choices, fill=Rank)) +
  geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Freedom.to.make.life.choices")
bp4 <- ggplot(happiness_class, aes(x = Rank, y= Freedom.to.make.life.choices , fill=Rank)) +
  geom_boxplot() + ggtitle("Box Plot against Freedom.to.make.life.choices")
dpbp4 <- ggarrange(dp4, bp4, ncol = 1, nrow = 2)
print(dpbp4)

# Plot against Generosity
dp5 <- ggplot(happiness_class, aes(x = Generosity, fill=Rank)) +
  geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Generosity")
bp5 <- ggplot(happiness_class, aes(x = Rank, y= Generosity , fill=Rank)) +
  geom_boxplot() + ggtitle("Box Plot against Generosity")
dpbp5 <- ggarrange(dp5, bp5, ncol = 1, nrow = 2)
print(dpbp5)

# Plot against Perceptions.of.corruption
dp6 <- ggplot(happiness_class, aes(x = Perceptions.of.corruption, fill=Rank)) +
  geom_density(alpha=0.25) + facet_wrap(~Rank) + ggtitle("Density Plot against Perceptions.of.corruption")
bp6 <- ggplot(happiness_class, aes(x = Rank, y= Perceptions.of.corruption , fill=Rank)) +
  geom_boxplot() + ggtitle("Box Plot against Perceptions.of.corruption")
dpbp6 <- ggarrange(dp6, bp6, ncol = 1, nrow = 2)
print(dpbp6)

# CATEGORIES ENCODER
# Factor the Rank of Original Data
happiness_Factored <- happiness_raw
happiness_Factored$Rank <- factor(happiness_Factored$Rank,
                               levels = c("Low", "Medium", "High"))
table(happiness_Factored$Rank)

## 
##    Low Medium   High 
##     54   1449    209

prop.table(table(happiness_Factored$Rank))

## 
##        Low     Medium       High 
## 0.03154206 0.84637850 0.12207944

#define original categorical labels
labs = LabelEncoder.fit(happiness_Factored$Rank)

#convert labels to numeric values
happiness_Factored$Rank = transform(labs, factor(happiness_Factored$Rank, levels = c("Low", "Medium", "High")))

“Low” has become 1, “Medium” has become 2, “High” has become 3.

MULTIVARIATE ANOVA TEST
The null hypothesis (H0) of the ANOVA is no difference in means.
The alternative hypothesis (Ha) is that the means are different from one another.
Confidence Interval = 95% (Default)
The p value of the variable is low (p < 0.025), then the variables are statistically significant, and vice cersa
reference: https://www.scribbr.com/statistics/anova-in-r/#:~:text=Revised%20on%20November%2017%2C%202022,level%20of%20the%20independent%20variable.

aov1 = aov(data= happiness_Factored, Rank  ~ Log.GDP.per.capita + Social.support + Healthy.life.expectancy.at.birth + 
             Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption)
summary(aov1)

##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## Log.GDP.per.capita                  1  63.76   63.76 768.481 <2e-16 ***
## Social.support                      1   5.73    5.73  69.091 <2e-16 ***
## Healthy.life.expectancy.at.birth    1   0.26    0.26   3.088 0.0791 .  
## Freedom.to.make.life.choices        1   9.92    9.92 119.519 <2e-16 ***
## Generosity                          1  12.05   12.05 145.262 <2e-16 ***
## Perceptions.of.corruption           1  15.80   15.80 190.402 <2e-16 ***
## Residuals                        1705 141.46    0.08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result of ANOVA test shows that all variables are statistically significant except Healthy.life.expectancy.at.birth

SMOTE for IMBALANCED DATA
Reference https://rdrr.io/cran/performanceEstimation/man/smote.html
Low and High are the minority class in this dataset

#SMOTE for "Low" rank
low_smoted <- SMOTE(happiness_Factored[,-1], happiness_Factored$Rank, K = 5, dup_size = 25)
low_smoted.data <- low_smoted$data
table(low_smoted.data$Rank)

## 
##    1    2    3 
## 1404 1449  209

prop.table(table(low_smoted.data$Rank))

## 
##          1          2          3 
## 0.45852384 0.47322012 0.06825604

#SMOTE for "High" rank
high_smoted <- SMOTE(low_smoted.data[,1:9], low_smoted.data$Rank, K = 5, dup_size = 6)
high_smoted.data <- high_smoted$data
table(high_smoted.data$Rank)

## 
##    1    2    3 
## 1404 1449 1463

prop.table(table(high_smoted.data$Rank))

## 
##         1         2         3 
## 0.3253012 0.3357275 0.3389713

# The data has been balanced with SMOTE method.

#Dataframe with Balanced Data for Machine Learning Modelling
happiness_Balanced <- high_smoted.data

4.0 Data Modelling

4.1 Regression Models

happiness_Balanced_L <- select(happiness_Balanced, -class, -Rank)
# Set the seed for reproducibility
set.seed(42)

# Test Design
# Split the dataset into 80% training set and 20% test set
split_train <- 0.80
dpartl <- createDataPartition(happiness_Balanced_L$Life.Ladder, p=split_train, list=FALSE) 
dtrainl <- happiness_Balanced_L[ dpartl,] 
dtestl <- happiness_Balanced_L[-dpartl,]

4.1.1 Random Forest Regression model for predicting Life Ladder

regressor_rf  = randomForest(x = dtrainl,
                             y = dtrainl$Life.Ladder,
                             ntree = 500)

# Predict Life Ladder score with Random Forest Regression model
rf_pred = predict(regressor_rf, newdata = dtestl)

rf_actual_pred <- as.data.frame(cbind(Prediction = rf_pred, Actual = dtestl$Life.Ladder))


# Model Performance Metrics
rf_perf_metrics <- data.frame(
  R2 = R2(rf_pred, dtestl$Life.Ladder),
  RMSE = RMSE(rf_pred, dtestl$Life.Ladder),
  MAE = MAE(rf_pred, dtestl$Life.Ladder)
)

print(rf_perf_metrics)

##          R2       RMSE        MAE
## 1 0.9980992 0.07905845 0.04471033

# Regression Results Plot
print(lares::mplot_lineal(tag = dtestl$Life.Ladder, 
                          score = rf_pred,
                          subtitle = "Random Forest Regression Model",
                          model_name = "regressor_rf"))

## Warning in .font_global(font, quiet = FALSE): Font 'Arial Narrow' is not
## installed, has other name, or can't be found

The chart plots Real value vs Predicted value.
As observed, the R-squared is 0.9981, which shows a very high goodness of fit.
The RMSE is 0.079, while the MAE is 0.045. The predicted value is consistently having low error, for any range of Real value.

# Distribution Plot
print(lares::mplot_density(tag = dtestl$Life.Ladder, 
                           score = rf_pred,
                           subtitle = "Random Forest Regression Model",
                           model_name = "regressor_rf"))

## Warning in .font_global(font, quiet = FALSE): Font 'Arial Narrow' is not
## installed, has other name, or can't be found

The distribution plot shows that the Model and the Real value are consistent in terms of density, for any continuous value from the dataset.

# Splits by Quantiles Plot
print(lares::mplot_splits(tag = dtestl$Life.Ladder, 
                          score = rf_pred,
                          split = 8))

The Split Groups splits the dataset into 8 tags, with each having the same samples.
It shows that Tag 1,2,3,4,5,6 and 8 achieve a consistency of more than 90%.
Meanwhile, tag 7 is at around 89%.

4.1.2 Support Vector Regression model for predicting Life Ladder

# Model Development
# Train Support Vector Regression model with training set
regressor_svr = svm(formula = Life.Ladder ~ .,
                    data = dtrainl,
                    type = 'eps-regression',
                    kernel = 'radial')

# Predict Life Ladder score with Support Vector Regression model
svr_pred = predict(regressor_svr,  newdata = dtestl)

svr_actual_pred <- as.data.frame(cbind(Prediction = svr_pred, Actual = dtestl$Life.Ladder))


# Model Performance Metrics
svr_perf_metrics <- data.frame(
  R2 = R2(svr_pred, dtestl$Life.Ladder),
  RMSE = RMSE(svr_pred, dtestl$Life.Ladder),
  MAE = MAE(svr_pred, dtestl$Life.Ladder)
)

print(svr_perf_metrics)

##          R2      RMSE       MAE
## 1 0.9511555 0.3944255 0.2454234

# Regression Results Plot
print(lares::mplot_lineal(tag = dtestl$Life.Ladder, 
                          score = svr_pred,
                          subtitle = "Support Vector Regression Model",
                          model_name = "regressor_svr"))

The R-squared is 0.9511, which shows a high goodness of fit. The RMSE is 0.394, while the MAE is 0.245.
The predicted value is showing low error when the real value is at the low end or high end.
However, the error is relatively larger when the real value is within 3.5 and 7.

# Distribution Plot
print(lares::mplot_density(tag = dtestl$Life.Ladder, 
                           score = svr_pred,
                           subtitle = "Support Vector Regression Model",
                           model_name = "regressor_svr"))

The distribution plot shows that the model and the real value overlaps with more than 70% of the region.

# Splits by Quantiles Plot
print(lares::mplot_splits(tag = dtestl$Life.Ladder, 
                          score = svr_pred,
                          split = 8))

The Split Groups splits the dataset into 8 tags, with each having the same samples.
The percentage of the tags range from 44% to 73%. The tag 3, 4, 5, 6 and 7 have records from 4 or more tags.
It suggests that the model performs relative poor when the real value is within 3.5 and 7.

4.1.3 Decision Tree Regression model for predicting Life Ladder

# Model Development
# Train Decision Tree Regression model with training set
regressor_dt = rpart(formula = Life.Ladder ~ .,
                     data = dtrainl,
                     control = rpart.control(minsplit = 10))

# Predict Life Ladder score with Decision Tree Regression model
dt_pred = predict(regressor_dt, newdata = dtestl)

dt_actual_pred <- as.data.frame(cbind(Prediction = dt_pred, Actual = dtestl$Life.Ladder))

# Plot Decision Tree
prp(regressor_dt)

# Model Performance Metrics
dt_perf_metrics <- data.frame(
  R2 = R2(dt_pred, dtestl$Life.Ladder),
  RMSE = RMSE(dt_pred, dtestl$Life.Ladder),
  MAE = MAE(dt_pred, dtestl$Life.Ladder)
)

print(dt_perf_metrics)

##          R2      RMSE       MAE
## 1 0.8861933 0.6011727 0.4230862

# Regression Results Plot
print(lares::mplot_lineal(tag = dtestl$Life.Ladder, 
                          score = dt_pred,
                          subtitle = "Decision Tree Regression Model",
                          model_name = "regressor_svr"))

The R-squared is 0.8861. The goodness of fit is the lowest compared to the other 2.
The RMSE is 0.6012, MAE is 0.4231.

# Distribution Plot
print(lares::mplot_density(tag = dtestl$Life.Ladder, 
                           score = dt_pred,
                           subtitle = "Decision Tree Regression Model",
                           model_name = "regressor_svr"))

The distribution plot is considered inconsistent as the model and the real doesn’t overlap.

# Splits by Quantiles Plot
print(lares::mplot_splits(tag = dtestl$Life.Ladder, 
                          score = dt_pred,
                          split = 8))

The percentage of the tags range from 22% to 50%. It suggests that the model performs very poor in terms of prediction.

4.1.4 Model Assessment

# Model Assessment
regression_evaluation <- data.frame(
  rf_r2 = R2(rf_pred, dtestl$Life.Ladder),
  svr_r2 = R2(svr_pred, dtestl$Life.Ladder),
  dt_r2 = R2(dt_pred, dtestl$Life.Ladder)
)

print(regression_evaluation)

##       rf_r2    svr_r2     dt_r2
## 1 0.9980992 0.9511555 0.8861933

Life Ladder Prediction (Regression Model)
Random Forest Regression model achieves the highest R2 results which is 0.9981 (99.81%) compared to the other two models.

4.2 Classification Models

4.2.1 K-Nearest Neighbors Classification model for predicting Rank

happiness_Balanced$Rank <- as.factor(happiness_Balanced$Rank)

happiness_Balanced_C <- select(happiness_Balanced, -class, -Life.Ladder)

# Set the seed for reproducibility
set.seed(42)

# Test Design
# Split the dataset into 80% training set and 20% test set
split_train <- 0.80
dpartc <- createDataPartition(happiness_Balanced_C$Rank, p=split_train, list=FALSE) 
dtrainc <- happiness_Balanced_C[ dpartc,] 
dtestc <- happiness_Balanced_C[-dpartc,]

# Cross Validation
# Set up the trainControl object
tc <- trainControl(method = "repeatedcv", 
                   number=10,# 10-fold cross validation 
                   classProbs = FALSE,
                   savePredictions = TRUE, 
                   repeats = 3,
                   ## Estimate class probabilities
                   summaryFunction = multiClassSummary,)

# Model Development
# Train K-Nearest Neighbors model with dtrainc
knn_model <- train(
  Rank~.,
  data=dtrainc,
  trControl=tc,
  preProcess = c("center","scale"),
  method="knn",
  metric='Accuracy',
  tuneLength=20
)


# Print the model results
print(knn_model)

## k-Nearest Neighbors 
## 
## 3455 samples
##    7 predictor
##    3 classes: '1', '2', '3' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3110, 3109, 3110, 3109, 3109, 3109, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa      Mean_F1    Mean_Sensitivity  Mean_Specificity
##    5  0.9298562  0.8948191  0.9278536  0.9303716         0.9650330       
##    7  0.9190520  0.8786294  0.9162738  0.9196545         0.9596538       
##    9  0.9113337  0.8670704  0.9079585  0.9119949         0.9558302       
##   11  0.9068948  0.8604233  0.9031392  0.9075913         0.9536281       
##   13  0.9046743  0.8571024  0.9007150  0.9053941         0.9525403       
##   15  0.9015872  0.8524830  0.8973293  0.9023240         0.9510203       
##   17  0.8979213  0.8469935  0.8933859  0.8986957         0.9491942       
##   19  0.8948323  0.8423657  0.8900388  0.8956388         0.9476492       
##   21  0.8932906  0.8400554  0.8884157  0.8941060         0.9468796       
##   23  0.8916528  0.8376035  0.8865938  0.8924777         0.9460670       
##   25  0.8901089  0.8352926  0.8848471  0.8909451         0.9453019       
##   27  0.8881791  0.8324033  0.8828839  0.8890400         0.9443405       
##   29  0.8872154  0.8309639  0.8819247  0.8880887         0.9438701       
##   31  0.8852861  0.8280797  0.8799478  0.8861862         0.9429208       
##   33  0.8846098  0.8270692  0.8792623  0.8855181         0.9425884       
##   35  0.8835497  0.8254841  0.8782759  0.8844752         0.9420639       
##   37  0.8824903  0.8238964  0.8772082  0.8834213         0.9415333       
##   39  0.8809464  0.8215873  0.8755289  0.8818971         0.9407686       
##   41  0.8799818  0.8201460  0.8744997  0.8809425         0.9402949       
##   43  0.8789244  0.8185631  0.8734278  0.8798970         0.9397695       
##   Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall
##   0.9354297            0.9678093            0.9354297       0.9303716  
##   0.9263285            0.9633190            0.9263285       0.9196545  
##   0.9199837            0.9601459            0.9199837       0.9119949  
##   0.9161926            0.9583297            0.9161926       0.9075913  
##   0.9147568            0.9574967            0.9147568       0.9053941  
##   0.9124988            0.9563054            0.9124988       0.9023240  
##   0.9092073            0.9547475            0.9092073       0.8986957  
##   0.9062283            0.9534102            0.9062283       0.8956388  
##   0.9046807            0.9527099            0.9046807       0.8941060  
##   0.9034791            0.9520893            0.9034791       0.8924777  
##   0.9023136            0.9515191            0.9023136       0.8909451  
##   0.9000344            0.9505375            0.9000344       0.8890400  
##   0.8989864            0.9500469            0.8989864       0.8880887  
##   0.8970737            0.9491448            0.8970737       0.8861862  
##   0.8963671            0.9488183            0.8963671       0.8855181  
##   0.8951236            0.9482329            0.8951236       0.8844752  
##   0.8937958            0.9476660            0.8937958       0.8834213  
##   0.8924930            0.9470442            0.8924930       0.8818971  
##   0.8917495            0.9466510            0.8917495       0.8809425  
##   0.8905642            0.9461259            0.8905642       0.8798970  
##   Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.3099521            0.9477023             
##   0.3063507            0.9396542             
##   0.3037779            0.9339126             
##   0.3022983            0.9306097             
##   0.3015581            0.9289672             
##   0.3005291            0.9266721             
##   0.2993071            0.9239450             
##   0.2982774            0.9216440             
##   0.2977635            0.9204928             
##   0.2972176            0.9192723             
##   0.2967030            0.9181235             
##   0.2960597            0.9166903             
##   0.2957385            0.9159794             
##   0.2950954            0.9145535             
##   0.2948699            0.9140533             
##   0.2945166            0.9132695             
##   0.2941634            0.9124773             
##   0.2936488            0.9113329             
##   0.2933273            0.9106187             
##   0.2929748            0.9098333             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

# Predict Life Ladder Rank by K-Nearest Neighbours model 
knn_pred <- predict(knn_model, dtestc)

# Confusion Matrix 
knn_confusionmatrix<-confusionMatrix(knn_pred, dtestc$Rank)
 
print(knn_confusionmatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 280  34   0
##          2   0 231   3
##          3   0  24 289
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9292          
##                  95% CI : (0.9099, 0.9454)
##     No Information Rate : 0.3391          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8938          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   0.7993   0.9897
## Specificity            0.9415   0.9948   0.9578
## Pos Pred Value         0.8917   0.9872   0.9233
## Neg Pred Value         1.0000   0.9075   0.9945
## Prevalence             0.3252   0.3357   0.3391
## Detection Rate         0.3252   0.2683   0.3357
## Detection Prevalence   0.3647   0.2718   0.3635
## Balanced Accuracy      0.9707   0.8970   0.9738

# Feature Importance of K-Nearest Neighbors Model
# Create object of importance of our variables
knn_importance <- varImp(knn_model)

# Create box plot of importance of variables
gg.knn_feat_importance<-ggplot(data = knn_importance, mapping = aes(x = knn_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: K-Nearest Neighbors ") + # Title
  theme_minimal() # Theme

plot(gg.knn_feat_importance)

# Plot confusion matrix
# Install package "rsvg", "ggimage"
knn_cfm <- as_tibble(knn_confusionmatrix$table)
print(plot_confusion_matrix(knn_cfm, 
                            target_col = "Prediction", 
                            prediction_col = "Reference",
                            counts_col = "n"))

4.2.2 AdaBoost Classification model for predicting Rank

# Model Development
# Create the Model without Cross Validation with the help of Boosting Function
ada_model = boosting(Rank~., data=dtrainc, boos=TRUE, mfinal=50)


# Predict Life Ladder Rank by AdaBoost model 
ada_pred = predict(ada_model , newdata = dtestc)

# Confusion Matrix
ada_confusionmatrix<-confusionMatrix(
  as.factor(ada_pred$class),
  dtestc$Rank
)

print(ada_confusionmatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 280  10   0
##          2   0 267   5
##          3   0  12 287
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9686          
##                  95% CI : (0.9547, 0.9792)
##     No Information Rate : 0.3391          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.953           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   0.9239   0.9829
## Specificity            0.9828   0.9913   0.9789
## Pos Pred Value         0.9655   0.9816   0.9599
## Neg Pred Value         1.0000   0.9626   0.9911
## Prevalence             0.3252   0.3357   0.3391
## Detection Rate         0.3252   0.3101   0.3333
## Detection Prevalence   0.3368   0.3159   0.3473
## Balanced Accuracy      0.9914   0.9576   0.9809

# Feature Importance of AdaBoost Model
# Plot Feature Importance
print(importanceplot(ada_model, horiz=TRUE))

##      [,1]
## [1,]  0.7
## [2,]  1.9
## [3,]  3.1
## [4,]  4.3
## [5,]  5.5
## [6,]  6.7
## [7,]  7.9

# Plot confusion matrix
# Install package "rsvg", "ggimage"
ada_cfm <- as_tibble(ada_confusionmatrix$table)
print(plot_confusion_matrix(ada_cfm, 
                            target_col = "Prediction", 
                            prediction_col = "Reference",
                            counts_col = "n"))

4.2.3 Naive Bayes Classification model for predicting Rank

# Cross Validation
# Set up the trainControl object
tc <- trainControl(method = "repeatedcv", 
                   number=10,#10-fold cross validation 
                   classProbs = FALSE,
                   savePredictions = TRUE, 
                   repeats = 3,
                   ## Estimate class probabilities
                   summaryFunction = multiClassSummary,)

# Model Development
# Train NaiveBayes model with training set
nb_model <- train(Rank~.,
                  dtrainc,
                  method="naive_bayes",
                  preProcess = c("center","scale"),
                  metric='Accuracy',
                  trControl=tc)

# Print the model results
print(nb_model)

## Naive Bayes 
## 
## 3455 samples
##    7 predictor
##    3 classes: '1', '2', '3' 
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3109, 3110, 3110, 3110, 3109, 3109, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa      Mean_F1    Mean_Sensitivity
##   FALSE      0.8401357  0.7603845  0.8368341  0.8409479       
##    TRUE      0.8563458  0.7846966  0.8537941  0.8571567       
##   Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision
##   0.9203027         0.8411017            0.9232458            0.8411017     
##   0.9284432         0.8587607            0.9310392            0.8587607     
##   Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.8409479    0.2800452            0.8806253             
##   0.8571567    0.2854486            0.8927999             
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

# Predict Life Ladder Rank by Naive Bayes model 
nb_pred <- predict(nb_model, dtestc)

# Confusion Matrix
nb_confusionmatrix<-confusionMatrix(nb_pred, dtestc$Rank)

print(nb_confusionmatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 266  66   0
##          2  14 208  29
##          3   0  15 263
## 
## Overall Statistics
##                                           
##                Accuracy : 0.856           
##                  95% CI : (0.8307, 0.8788)
##     No Information Rate : 0.3391          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7842          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9500   0.7197   0.9007
## Specificity            0.8864   0.9248   0.9736
## Pos Pred Value         0.8012   0.8287   0.9460
## Neg Pred Value         0.9735   0.8672   0.9503
## Prevalence             0.3252   0.3357   0.3391
## Detection Rate         0.3089   0.2416   0.3055
## Detection Prevalence   0.3856   0.2915   0.3229
## Balanced Accuracy      0.9182   0.8223   0.9372

# Feature Importance of Naive Bayes Model
# Create object of importance of our variables
nb_importance <- varImp(nb_model)

# Create box plot of importance of variables
gg.nb_feat_importance<-ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: Naive Bayes") + # Title
  theme_minimal() # Theme

plot(gg.nb_feat_importance)

# Plot confusion matrix
# Install package "rsvg", "ggimage"
nb_cfm <- as_tibble(nb_confusionmatrix$table)
print(plot_confusion_matrix(nb_cfm, 
                            target_col = "Prediction", 
                            prediction_col = "Reference",
                            counts_col = "n"))

4.2.4 Model Assessment

classification_evaluation<-data.frame(
  K_Nearest_Neighbours= knn_confusionmatrix$overall[1],
  AdaBoost = ada_confusionmatrix$overall[1],
  Naive_Bayes=  nb_confusionmatrix$overall[1]

)

print(classification_evaluation)

##          K_Nearest_Neighbours  AdaBoost Naive_Bayes
## Accuracy            0.9291521 0.9686411   0.8559814

Life Ladder (Rank) Prediction (Classification Model)
AdaBoost classification model is the best model among the 3 models as it achieves the highest model accuracy of 0.969 (96.9%).

5.0 Discussion

5.1 Happiness Countries from 2015 to 2020

The happiest country’s ranking position does not change from year to year, the countries did not change from 2015 to 2020. They are still on the top and bottom happiest country list. The countries for top 6 happiness are Denmark, Finland, Switzerland, Norway, Canada and Netherlands. Meanwhile, the countries for bottom 6 happiness are Afghanistan, Syria, Central African Republic, Zimbabwe, Liberia and Togo.

5.2 The Main Factor Affects Happiness

The factors that most affects happiness is Economic GDP per capita, social support, healthy life expectancy at birth, freedom to make life choices which have a strong correlation with life ladder. This is because income can let people meet their basic needs, so it is quite important and will affect people’s happiness. Another main factor is healthy life expectancy at birth can be attributed to rising living standards, improved lifestyle, and better education which along to quality health services. Lastly, freedom to make life choices is also one of the key contributing factors. This is because of the social dimensions of happiness within working and living environments, which affect the quality-of-life people enjoy.

5.3 Happiness Score Prediction (Regression Model)

Three regression models are used to predict Happiness Score which are Random Forest, Support vector and Decision Tree. Random Forest is the best regression model compared to Support Vector and Decision Tree classification model in our project result. This is because the prediction accuracy of Random Forest Regression model is highest which achieved 99.81%.

5.4 Happiness Level Prediction (Classification Model)

Three classification models are used to predict Happiness Rank which are K-Nearest Neighbor, AdaBoost and Naïve Bayes. Ada Boost is the best classification model compared to K-Nearest Neighbor and Naive Bayes Classification model in our project result. This is because the prediction accuracy of AdaBoost Classification model is highest which achieved 96.9%.

6.0 Conclusion

In a nutshell, this project has shown that happiness depends on a huge range of inﬂuences. Hence, a large scale of regular collection of happiness data is conducted to help in policymaking process and help us to foster well-being. In other words, moving to a higher rank of happiness country could possibly make people happier. By the same token, moving to a less happiness country could reduce the level of happiness. By fact, it means that emotions are contagious, even at a national level.