Happiness

GROUP MEMBERS (GROUP 13)

CHEN BAOGANG 17186722

LIM SHI JUN 17113677

RONJON AHMED S2150527

ZHU MINGLI S2147909

1 Introduction

Happiness is an emotional state characterized by feelings of joy, fulfillment, satisfaction and contentment. It has many distinct definitions and is frequently associated with positive emotions and life satisfaction. When most people talk about happiness, they may be referring to how they feel in the present moment or to a broader sense of how they feel about life in general.How happy are people today? Were people happier in the past? How satisfied are people in different societies with their lives? And how do our living conditions affect all of this? These are difficult questions to answer, but they are surely important to each of us individually. Indeed, today, life satisfaction and happiness are becoming important study topics in the social sciences, including in ‘mainstream’ economics. In this study, we will explore the data and empirical evidence that may provide answers to these questions. Our focus here will be on survey-based measures of World Happiness by country.

1.1 Problems

Which countries are on the Top 10 Happiness Countries list?
Which countries are on the Top 10 Progressive Countries list?
Which factor affects people’s happiness the most?
Which regression model is the best to predict the happiness scores?
Which classification model is the best to predict happiness levels?

1.2 Objective

This study aims to both quantify and analyze well-being around the world. Our main goal is to do an exploratory analysis of the factors that make people happy.
To predict happiness score through different regression models
To predict happiness level through two classification models

2 Data Preprocessing

2.1 Dataset

World Happiness Report

Source: https://www.kaggle.com/datasets/unsdsn/world-happiness

The title of the datasets that we get is “World Happiness Report”. The datasets are from 2015 to 2019 and they are located separately in different CSV files. It has 8 to 12 variables in each CSV, with almost 155 different countries, which is the dependent variable in this study. The independent variables are the factors that affect people’s happiness such as family, freedom, life expectancy, GDP per capita, generosity, and trust in government corruption.

The content of these happiness scores and rankings use data come from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.

The datasets need to be cleaning and tidy up by renaming the variables so that they are all titled the same in five CSV files, combining them, and ensuring that there are no uncommon data or missing values in this dataset.

2.2 Data Column understanding

Country
Happiness rank
Happiness score

This is obtained from a sample of population. The survey-taker asked the respondent to rate their happiness from 1 to 10.

Economic (GDP per cap)

Extend of GDP that contributes to the happiness score

Family

To what extend does family contribute to the happiness score

health (life expectancy)

Extend of health (life expectancy) contribute to the happiness score

Freedom

Extend of freedom that contribute to happiness. The freedom here represents the freedom of speech, freedom to pursue what we want, etc

Trust (Government corruption)

Extend of trust with regards to government corruption that contribute to happiness score

Generosity

Extend of generosity that contribute to happiness score

2.3 Load libraries

library(Metrics)
library(caret)
library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(skimr)
library(tidyr)
library(reshape2)
library(ggpubr)
library(stringr)
library(e1071)
library(pROC)

2.4 Load the data

happy15_df = read.csv ("data/2015.csv")
happy16_df = read.csv ("data/2016.csv")
happy17_df = read.csv ("data/2017.csv")
happy18_df = read.csv ("data/2018.csv")
happy19_df = read.csv ("data/2019.csv")
head(happy19_df)

##   Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1            1           Finland 7.769          1.340          1.587
## 2            2           Denmark 7.600          1.383          1.573
## 3            3            Norway 7.554          1.488          1.582
## 4            4           Iceland 7.494          1.380          1.624
## 5            5       Netherlands 7.488          1.396          1.522
## 6            6       Switzerland 7.480          1.452          1.526
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                   0.986                        0.596      0.153
## 2                   0.996                        0.592      0.252
## 3                   1.028                        0.603      0.271
## 4                   1.026                        0.591      0.354
## 5                   0.999                        0.557      0.322
## 6                   1.052                        0.572      0.263
##   Perceptions.of.corruption
## 1                     0.393
## 2                     0.410
## 3                     0.341
## 4                     0.118
## 5                     0.298
## 6                     0.343

We have a look at the last one (head is by default with the first 5 rows)

2.5 Rename dataset column’ names

2.5.1 Change year 2018 datasets column names base year 2017 datasets

happy18_df=plyr::rename(happy18_df, replace = c( "Country.or.region"="Country", 
                                  "Overall.rank"="Happiness.Rank" ,
                                  "GDP.per.capita"="Economy..GDP.per.Capita.",
                                  "Healthy.life.expectancy"="Health..Life.Expectancy.",
                                  "Freedom.to.make.life.choices"="Freedom",
                                  "Perceptions.of.corruption"="Trust..Government.Corruption.",
                                  "Social.support"="Family",
                                  "Score"="Happiness.Score"))
colnames(happy18_df)

## [1] "Happiness.Rank"                "Country"                      
## [3] "Happiness.Score"               "Economy..GDP.per.Capita."     
## [5] "Family"                        "Health..Life.Expectancy."     
## [7] "Freedom"                       "Generosity"                   
## [9] "Trust..Government.Corruption."

2.5.2 Change year 2019 datasets column names base year 2017 datasets

happy19_df=plyr::rename(happy19_df, replace = c( "Country.or.region"="Country", 
                                  "Overall.rank"="Happiness.Rank" ,
                                  "GDP.per.capita"="Economy..GDP.per.Capita.",
                                  "Healthy.life.expectancy"="Health..Life.Expectancy.",
                                  "Freedom.to.make.life.choices"="Freedom",
                                  "Perceptions.of.corruption"="Trust..Government.Corruption.",
                                  "Social.support"="Family",
                                  "Score"="Happiness.Score"))
colnames(happy19_df)

## [1] "Happiness.Rank"                "Country"                      
## [3] "Happiness.Score"               "Economy..GDP.per.Capita."     
## [5] "Family"                        "Health..Life.Expectancy."     
## [7] "Freedom"                       "Generosity"                   
## [9] "Trust..Government.Corruption."

2.5.3 Change year 2015 datasets column names base year 2017 datasets

happy15_df=plyr::rename(happy15_df, replace = c( "Happiness Rank" = "Happiness.Rank", 
                                  "Happiness Score" = "Happiness.Score",
                                  "Economy (GDP per Capita)" = "Economy..GDP.per.Capita.",
                                  "Health (Life Expectancy)" = "Health..Life.Expectancy.",
                                  "Trust (Government Corruption)" = "Trust..Government.Corruption.",
                                  "Dystopia Residual"="Dystopia.Residual"
                                  ))
colnames(happy15_df)

##  [1] "Country"                       "Region"                       
##  [3] "Happiness.Rank"                "Happiness.Score"              
##  [5] "Standard.Error"                "Economy..GDP.per.Capita."     
##  [7] "Family"                        "Health..Life.Expectancy."     
##  [9] "Freedom"                       "Trust..Government.Corruption."
## [11] "Generosity"                    "Dystopia.Residual"

2.5.4 Change year 2016 datasets column names base year 2017 datasets

happy16_df=plyr::rename(happy16_df, replace = c( "Happiness Rank" = "Happiness.Rank", 
                                  "Happiness Score" = "Happiness.Score",
                                  "Economy (GDP per Capita)" = "Economy..GDP.per.Capita.",
                                  "Health (Life Expectancy)" = "Health..Life.Expectancy.",
                                  "Trust (Government Corruption)"  = "Trust..Government.Corruption.",
                                  "Dystopia Residual"="Dystopia.Residual"
                                  ))
colnames(happy16_df)

##  [1] "Country"                       "Region"                       
##  [3] "Happiness.Rank"                "Happiness.Score"              
##  [5] "Lower.Confidence.Interval"     "Upper.Confidence.Interval"    
##  [7] "Economy..GDP.per.Capita."      "Family"                       
##  [9] "Health..Life.Expectancy."      "Freedom"                      
## [11] "Trust..Government.Corruption." "Generosity"                   
## [13] "Dystopia.Residual"

2.5.5 Insert year column at first position (index 0)

happy15_df<-cbind(Year=2015,happy15_df)

happy16_df<-cbind(Year=2016,happy16_df)

happy17_df<-cbind(Year=2017,happy17_df)

happy18_df<-cbind(Year=2018,happy18_df)

happy19_df<-cbind(Year=2019,happy19_df)

2.5.6 Change column type for emerging dataset

happy18_df$Trust..Government.Corruption. = as.numeric(happy18_df$Trust..Government.Corruption.)

str(happy18_df)

## 'data.frame':    156 obs. of  10 variables:
##  $ Year                         : num  2018 2018 2018 2018 2018 ...
##  $ Happiness.Rank               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country                      : chr  "Finland" "Norway" "Denmark" "Iceland" ...
##  $ Happiness.Score              : num  7.63 7.59 7.55 7.5 7.49 ...
##  $ Economy..GDP.per.Capita.     : num  1.3 1.46 1.35 1.34 1.42 ...
##  $ Family                       : num  1.59 1.58 1.59 1.64 1.55 ...
##  $ Health..Life.Expectancy.     : num  0.874 0.861 0.868 0.914 0.927 0.878 0.896 0.876 0.913 0.91 ...
##  $ Freedom                      : num  0.681 0.686 0.683 0.677 0.66 0.638 0.653 0.669 0.659 0.647 ...
##  $ Generosity                   : num  0.202 0.286 0.284 0.353 0.256 0.333 0.321 0.365 0.285 0.361 ...
##  $ Trust..Government.Corruption.: num  0.393 0.34 0.408 0.138 0.357 0.295 0.291 0.389 0.383 0.302 ...

2.6 Merge data from 2015-2019

happy15_16<-dplyr::bind_rows(happy15_df,happy16_df)

happy15_16_17<-dplyr::bind_rows(happy15_16,happy17_df)

happy18_19<-dplyr::bind_rows(happy18_df,happy19_df)

df<-dplyr::bind_rows(happy18_19,happy15_16_17)

head(df)

##   Year Happiness.Rank     Country Happiness.Score Economy..GDP.per.Capita.
## 1 2018              1     Finland           7.632                    1.305
## 2 2018              2      Norway           7.594                    1.456
## 3 2018              3     Denmark           7.555                    1.351
## 4 2018              4     Iceland           7.495                    1.343
## 5 2018              5 Switzerland           7.487                    1.420
## 6 2018              6 Netherlands           7.441                    1.361
##   Family Health..Life.Expectancy. Freedom Generosity
## 1  1.592                    0.874   0.681      0.202
## 2  1.582                    0.861   0.686      0.286
## 3  1.590                    0.868   0.683      0.284
## 4  1.644                    0.914   0.677      0.353
## 5  1.549                    0.927   0.660      0.256
## 6  1.488                    0.878   0.638      0.333
##   Trust..Government.Corruption. Region Standard.Error Dystopia.Residual
## 1                         0.393   <NA>             NA                NA
## 2                         0.340   <NA>             NA                NA
## 3                         0.408   <NA>             NA                NA
## 4                         0.138   <NA>             NA                NA
## 5                         0.357   <NA>             NA                NA
## 6                         0.295   <NA>             NA                NA
##   Lower.Confidence.Interval Upper.Confidence.Interval Whisker.high Whisker.low
## 1                        NA                        NA           NA          NA
## 2                        NA                        NA           NA          NA
## 3                        NA                        NA           NA          NA
## 4                        NA                        NA           NA          NA
## 5                        NA                        NA           NA          NA
## 6                        NA                        NA           NA          NA

2.7 Change Happiness.Rank data type

df$Happiness.Rank  = as.numeric(df$Happiness.Rank )

str(df)

## 'data.frame':    782 obs. of  17 variables:
##  $ Year                         : num  2018 2018 2018 2018 2018 ...
##  $ Happiness.Rank               : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country                      : chr  "Finland" "Norway" "Denmark" "Iceland" ...
##  $ Happiness.Score              : num  7.63 7.59 7.55 7.5 7.49 ...
##  $ Economy..GDP.per.Capita.     : num  1.3 1.46 1.35 1.34 1.42 ...
##  $ Family                       : num  1.59 1.58 1.59 1.64 1.55 ...
##  $ Health..Life.Expectancy.     : num  0.874 0.861 0.868 0.914 0.927 0.878 0.896 0.876 0.913 0.91 ...
##  $ Freedom                      : num  0.681 0.686 0.683 0.677 0.66 0.638 0.653 0.669 0.659 0.647 ...
##  $ Generosity                   : num  0.202 0.286 0.284 0.353 0.256 0.333 0.321 0.365 0.285 0.361 ...
##  $ Trust..Government.Corruption.: num  0.393 0.34 0.408 0.138 0.357 0.295 0.291 0.389 0.383 0.302 ...
##  $ Region                       : chr  NA NA NA NA ...
##  $ Standard.Error               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dystopia.Residual            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Lower.Confidence.Interval    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Upper.Confidence.Interval    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Whisker.high                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Whisker.low                  : num  NA NA NA NA NA NA NA NA NA NA ...

2.8 Remove unnessesary columns

count NA value in every column

colSums(is.na(df))

##                          Year                Happiness.Rank 
##                             0                             0 
##                       Country               Happiness.Score 
##                             0                             0 
##      Economy..GDP.per.Capita.                        Family 
##                             0                             0 
##      Health..Life.Expectancy.                       Freedom 
##                             0                             0 
##                    Generosity Trust..Government.Corruption. 
##                             0                             1 
##                        Region                Standard.Error 
##                           467                           624 
##             Dystopia.Residual     Lower.Confidence.Interval 
##                           312                           625 
##     Upper.Confidence.Interval                  Whisker.high 
##                           625                           627 
##                   Whisker.low 
##                           627

Remove unnessesary columns

df = subset(df, select = -c(Lower.Confidence.Interval,Upper.Confidence.Interval,Dystopia.Residual,Standard.Error,Whisker.high,Whisker.low))

colSums(is.na(df))

##                          Year                Happiness.Rank 
##                             0                             0 
##                       Country               Happiness.Score 
##                             0                             0 
##      Economy..GDP.per.Capita.                        Family 
##                             0                             0 
##      Health..Life.Expectancy.                       Freedom 
##                             0                             0 
##                    Generosity Trust..Government.Corruption. 
##                             0                             1 
##                        Region 
##                           467

2.9 Impute with mean or median values for numerical columns

df$Trust..Government.Corruption.[is.na(df$Trust..Government.Corruption.)] <- median(df$Trust..Government.Corruption., na.rm = T)

colSums(is.na(df))

##                          Year                Happiness.Rank 
##                             0                             0 
##                       Country               Happiness.Score 
##                             0                             0 
##      Economy..GDP.per.Capita.                        Family 
##                             0                             0 
##      Health..Life.Expectancy.                       Freedom 
##                             0                             0 
##                    Generosity Trust..Government.Corruption. 
##                             0                             0 
##                        Region 
##                           467

2.10 Filter uncommon data in Country Column

Due to the data is describing the happiness score and relative factors for countries across different years. So, it is important to view the uniformity of the data in Year column of the data.

Country and Region counts group by Year

aggregate(df$Country, by=list(df$Year), FUN=length)

##   Group.1   x
## 1    2015 158
## 2    2016 157
## 3    2017 155
## 4    2018 156
## 5    2019 156

From the table shown as above, the number of countries involved in this dataset for different year is different. Therefore, it is necessary to make an intersection of them to get the most common country list.

Country_2015 = subset(df, Year == 2015)$Country
Country_2016 = subset(df, Year == 2016)$Country
Country_2017 = subset(df, Year == 2017)$Country
Country_2018 = subset(df, Year == 2018)$Country
Country_2019 = subset(df, Year == 2019)$Country

common_country =intersect(intersect(intersect(intersect(Country_2015,
Country_2016),Country_2017),Country_2018),Country_2019)
length(common_country)

## [1] 141

Therefore, there are 141 countries’ data existing across from 2015-2019 in this dataset.Then we need to filter the original dataset by this common_country list.

df1 = subset(df,Country %in% common_country)
print(paste("The amount of rows in the dataset is: ",dim(df1)[1]))
print(paste("The amount of columns in the dataset is: ",dim(df1)[2]))

## [1] "The amount of rows in the dataset is:  705"
## [1] "The amount of columns in the dataset is:  11"

2.11 Fill value for categorical columns

Create a new dataset for storing common region and country

common_region <- unique(subset(df1, Region!="NA", c(Country, Region)))

head(common_country)

## [1] "Switzerland" "Iceland"     "Denmark"     "Norway"      "Canada"     
## [6] "Finland"

Fill relate region to missing value of region column

assign_region <- function(x){
  Region <- common_region$Region[common_region$Country == x]
}

for(country in common_country)
      df1$Region[df1$Country == country] <- assign_region(country)

2.12 Save cleaned dataset

write_csv(df1, path = "World Happiness Data (2015-2019)_cleaned.csv")

2.12.1 Briefly statistic view the data

skimr::skim_without_charts(df1)

Data summary
Name	df1
Number of rows	705
Number of columns	11
_______________________
Column type frequency:
character	2
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Country	0	1	4	23	0	141	0
Region	0	1	12	31	0	10	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
Year	1	2017.00	1.42	2015.00	2016.00	2017.00	2018.00	2019.00
Happiness.Rank	1	76.85	45.28	1.00	37.00	77.00	116.00	158.00
Happiness.Score	1	5.43	1.13	2.84	4.52	5.39	6.29	7.77
Economy..GDP.per.Capita.	1	0.93	0.40	0.00	0.64	1.00	1.24	2.10
Family	1	1.09	0.32	0.00	0.88	1.14	1.35	1.64
Health..Life.Expectancy.	1	0.63	0.23	0.00	0.49	0.66	0.81	1.14
Freedom	1	0.41	0.15	0.00	0.31	0.43	0.53	0.72
Generosity	1	0.22	0.13	0.00	0.13	0.20	0.28	0.84
Trust..Government.Corruption.	1	0.12	0.11	0.00	0.05	0.09	0.15	0.55

print(paste("The amount of rows in the dataset is: ",dim(df)[1]))
print(paste("The amount of columns in the dataset is: ",dim(df)[2]))
print(paste("the column names in this dataset are:", paste(shQuote(colnames(df)), collapse=", ")))

## [1] "The amount of rows in the dataset is:  782"
## [1] "The amount of columns in the dataset is:  11"
## [1] "the column names in this dataset are: \"Year\", \"Happiness.Rank\", \"Country\", \"Happiness.Score\", \"Economy..GDP.per.Capita.\", \"Family\", \"Health..Life.Expectancy.\", \"Freedom\", \"Generosity\", \"Trust..Government.Corruption.\", \"Region\""

3 Exploratary Data Analysis

3.1 Explore data by country, region and year

3.1.1 Top 10 happiest countries

3.1.1.1 Top 10 happiest countries in 2015

df1 %>%
  filter(Year == 2015) %>%
  arrange(-Happiness.Score) %>%
  slice_head(n=10) %>%
  ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
  geom_point(colour = "red", size = 3) +
  theme(text=element_text(size=10)) + 
  coord_flip() +
  labs(title = "The 10 happiest countries in 2015", x = "")

3.1.1.2 Top 10 happiest countries in 2016

df1 %>%
  filter(Year == 2016) %>%
  arrange(-Happiness.Score) %>%
  slice_head(n=10) %>%
  ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
  geom_point(colour = "red", size = 3) +
  theme(text=element_text(size=10)) + 
  coord_flip() +
  labs(title = "The 10 happiest countries in 2016", x = "")

3.1.1.3 Top 10 happiest countries in 2017

df1 %>%
  filter(Year == 2017) %>%
  arrange(-Happiness.Score) %>%
  slice_head(n=10) %>%
  ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
  geom_point(colour = "red", size = 3) +
  theme(text=element_text(size=10)) + 
  coord_flip() +
  labs(title = "The 10 happiest countries in 2017", x = "")

3.1.1.4 Top 10 happiest countries in 2018

df1 %>%
  filter(Year == 2018) %>%
  arrange(-Happiness.Score) %>%
  slice_head(n=10) %>%
  ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
  geom_point(colour = "red", size = 3) +
  theme(text=element_text(size=10)) + 
  coord_flip() +
  labs(title = "The 10 happiest countries in 2018", x = "")

3.1.1.5 Top 10 happiest countries in 2019

df1 %>%
  filter(Year == 2019) %>%
  arrange(-Happiness.Score) %>%
  slice_head(n=10) %>%
  ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
  geom_point(colour = "red", size = 3) +
  theme(text=element_text(size=10)) + 
  coord_flip() +
  labs(title = "The 10 happiest countries in 2019", x = "")

In 2015, Switzerland was the top happiest country. But it dropped to number two in 2016. Same as Denmark, which was the happiest country in 2016, but fell to number two in 2017. Norway was the happiest country in 2017. While Finland was the happiest country in 2018 and 2019.

3.1.2 From 2015 to 2019, Mean Happiness score by regions:

gg2 <- ggplot(df1 , aes(x = Region, y = Happiness.Score)) +
  geom_boxplot(aes(fill=Region)) + theme_bw() +
  theme(axis.text.x = element_text (angle = 90))

gg2

The top 3 happiness region are: Australia and New Zealand, North America and Western Europe.

3.1.3 From 2015 to 2019, Mean Happiness score by countries:

df1 %>%
  group_by(Country) %>%
  summarise(mscore = mean(Happiness.Score)) %>%
  arrange(-mscore) %>%
  slice_head(n=10) %>%
  
  ggplot(aes(reorder(Country, mscore), mscore)) +
  geom_point() +
  theme_bw() +
  
  coord_flip() +
  labs(title = "Happiness Score by Country",
       x = "", y = "Average happiness score")

The top 3 happiness countries are: Denmark, Norway and Finland.

3.1.4 Top 10 Mean Happiness score by countries trends by years

Top10_happy_country_DF = df1 %>%
  group_by(Country) %>%
  summarise(mscore = mean(Happiness.Score)) %>%
  arrange(-mscore) %>%
  slice_head(n=10)

Top10_happy_country_DF_list = c(Top10_happy_country_DF$Country)

df1_Top10_happy_country = subset(df1,Country %in% Top10_happy_country_DF_list)

ggplot(df1_Top10_happy_country,  aes(x = Year,y = Happiness.Score,color = Country))+  geom_line()

Only the happiness score of Finland is increasing dramatically from 2015-2019.

3.1.5 Top 10 most progressive countries from 2015 - 2019:

df1 %>%
  mutate(y = as.character(Year)) %>%
  select(y, Country, Region, Happiness.Score) %>%
  pivot_wider(names_from = y, values_from = Happiness.Score,
              names_prefix = "y_") %>%
  mutate(p = (y_2019 - y_2015)/y_2015 * 100) %>%
  arrange(-p) %>%
  slice_head(n = 10) %>%
  ggplot(aes(reorder(Country, p), p)) +
  geom_point() +
  theme_bw() +
  coord_flip() +
  labs(title = "The 10 most progressive countries from 2015 - 2019",
       y = "Percentage Increase of Happiness Score", x = "")

Top10_Progress_country_df = df1 %>%
  mutate(y = as.character(Year)) %>%
  select(y, Country, Region, Happiness.Score) %>%
  pivot_wider(names_from = y, values_from = Happiness.Score,
              names_prefix = "y_") %>%
  mutate(p = (y_2019 - y_2015)/y_2015 * 100) %>%
  arrange(-p) %>%
  slice_head(n = 10)

Top10_Progress_country_df_list = c(Top10_Progress_country_df$Country)

df1_Top10_Progress_country = subset(df1,Country %in% Top10_Progress_country_df_list)

ggplot(df1_Top10_Progress_country,  aes(x = Year,y = Happiness.Score,color = Country))+  geom_line()

3.2 Explore data by factors

colnames(df1)

##  [1] "Year"                          "Happiness.Rank"               
##  [3] "Country"                       "Happiness.Score"              
##  [5] "Economy..GDP.per.Capita."      "Family"                       
##  [7] "Health..Life.Expectancy."      "Freedom"                      
##  [9] "Generosity"                    "Trust..Government.Corruption."
## [11] "Region"

head(df1)

##   Year Happiness.Rank     Country Happiness.Score Economy..GDP.per.Capita.
## 1 2018              1     Finland           7.632                    1.305
## 2 2018              2      Norway           7.594                    1.456
## 3 2018              3     Denmark           7.555                    1.351
## 4 2018              4     Iceland           7.495                    1.343
## 5 2018              5 Switzerland           7.487                    1.420
## 6 2018              6 Netherlands           7.441                    1.361
##   Family Health..Life.Expectancy. Freedom Generosity
## 1  1.592                    0.874   0.681      0.202
## 2  1.582                    0.861   0.686      0.286
## 3  1.590                    0.868   0.683      0.284
## 4  1.644                    0.914   0.677      0.353
## 5  1.549                    0.927   0.660      0.256
## 6  1.488                    0.878   0.638      0.333
##   Trust..Government.Corruption.         Region
## 1                         0.393 Western Europe
## 2                         0.340 Western Europe
## 3                         0.408 Western Europe
## 4                         0.138 Western Europe
## 5                         0.357 Western Europe
## 6                         0.295 Western Europe

3.2.1 The mean value of the factors

df1 %>%
  summarise(gdp = mean(Economy..GDP.per.Capita.),
            family = mean(Family),
            life.expectancy = mean(Health..Life.Expectancy.),
            freedom = mean(Freedom),
            generosity = mean(Generosity),
            corruption = mean(Trust..Government.Corruption.)) %>%
  pivot_longer(c(gdp, family, life.expectancy,freedom,generosity, corruption),
               names_to = "f", values_to = "value") %>%
  ggplot(aes(reorder(f, value), value)) +
  geom_bar(stat = "identity", fill = "darkgreen", width = 0.55, alpha = 0.7) +
  geom_text(aes(label = paste0(round(value, 2)), vjust = -0.5)) +
  theme_bw() +
  labs(title = "The mean value of the factors" , y = "", x = "")

The family factor has the highest mean value, which is 1.09.

3.2.2 Average value of happiness variables for different regions

Happiness.Continent <- df1 %>%
                          select(-c(Year,Happiness.Rank))%>%
                          group_by(Region) %>%
                          summarise_at(vars(-Country), funs(mean(., na.rm=TRUE)))


Happiness.Continent.melt <- melt(Happiness.Continent)


# Faceting
ggplot(Happiness.Continent.melt, aes(y=value, x=Region, color=Region, fill=Region)) + 
  geom_bar( stat="identity") +    
  facet_wrap(~variable) + theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Average value of happiness variables for different regions", 
       y = "Average value")

3.3 Find Relationship using Scatter Plot of Happiness Score with each variable (include regression line)

3.3.1 Scatter plot of Happiness Score with Economy_GDP per Capita (overall and by region)

ggline1 = ggplot(df1, aes(x = Economy..GDP.per.Capita., y = Happiness.Score)) + 
  geom_point(size = .5, alpha = 0.8) +  
  geom_smooth(method = "lm", fullrange = TRUE) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline1a = ggplot(df1, aes(x = Economy..GDP.per.Capita., y = Happiness.Score)) + 
  geom_point(aes(color=Region), size = .5, alpha = 0.8) +  
  geom_smooth(aes(color = Region, fill = Region), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Region) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline1
ggline1a

3.3.2 Scatter plot of Happiness Score with Family (overall and by region)

ggline2 = ggplot(df1, aes(x = Family, y = Happiness.Score)) + 
  geom_point(size = .5, alpha = 0.8) +  
  geom_smooth(method = "lm", fullrange = TRUE) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline2a = ggplot(df1, aes(x = Family, y = Happiness.Score)) + 
  geom_point(aes(color=Region), size = .5, alpha = 0.8) +  
  geom_smooth(aes(color = Region, fill = Region), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Region) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline2
ggline2a

3.3.3 Scatter plot of Happiness Score with Health_Life Expentancy (overall and by region)

ggline3 = ggplot(df1, aes(x = Health..Life.Expectancy., y = Happiness.Score)) + 
  geom_point(size = .5, alpha = 0.8) +  
  geom_smooth(method = "lm", fullrange = TRUE) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline3a = ggplot(df1, aes(x = Health..Life.Expectancy., y = Happiness.Score)) + 
  geom_point(aes(color=Region), size = .5, alpha = 0.8) +  
  geom_smooth(aes(color = Region, fill = Region), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Region) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline3
ggline3a

3.3.4 Scatter plot of Happiness Score with Freedom (overall and by region)

ggline4 = ggplot(df1, aes(x =Freedom, y = Happiness.Score)) + 
  geom_point(size = .5, alpha = 0.8) +  
  geom_smooth(method = "lm", fullrange = TRUE) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline4a = ggplot(df1, aes(x =Freedom, y = Happiness.Score)) + 
  geom_point(aes(color=Region), size = .5, alpha = 0.8) +  
  geom_smooth(aes(color = Region, fill = Region), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Region) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline4
ggline4a

3.3.5 Scatter plot of Happiness Score with Trust_Government Corruption (overall and by region)

ggline5 = ggplot(df1, aes(x = Trust..Government.Corruption., y = Happiness.Score)) + 
  geom_point(size = .5, alpha = 0.8) +  
  geom_smooth(method = "lm", fullrange = TRUE) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline5a = ggplot(df1, aes(x = Trust..Government.Corruption., y = Happiness.Score)) + 
  geom_point(aes(color=Region), size = .5, alpha = 0.8) +  
  geom_smooth(aes(color = Region, fill = Region), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Region) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline5
ggline5a

3.3.6 Scatter plot of Happiness Score with Generosity (overall and by region)

ggline6 = ggplot(df1, aes(x = Generosity, y = Happiness.Score)) + 
  geom_point(size = .5, alpha = 0.8) +  
  geom_smooth(method = "lm", fullrange = TRUE) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline6a = ggplot(df1, aes(x = Generosity, y = Happiness.Score)) + 
  geom_point(aes(color=Region), size = .5, alpha = 0.8) +  
  geom_smooth(aes(color = Region, fill = Region), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Region) +
  theme_bw() + labs(title = "Scatter plot with regression line")

ggline6
ggline6a

3.4 Find Correlation using Correlation Matrix Heatmap

3.4.1 Drop columns based on Heatmap Correlation

We should drop Year,Country,Happiness.Rank,Region column before compute the heatmap.

dataset = select(df1,-c("Year","Country","Happiness.Rank","Region"))
head(dataset)

##   Happiness.Score Economy..GDP.per.Capita. Family Health..Life.Expectancy.
## 1           7.632                    1.305  1.592                    0.874
## 2           7.594                    1.456  1.582                    0.861
## 3           7.555                    1.351  1.590                    0.868
## 4           7.495                    1.343  1.644                    0.914
## 5           7.487                    1.420  1.549                    0.927
## 6           7.441                    1.361  1.488                    0.878
##   Freedom Generosity Trust..Government.Corruption.
## 1   0.681      0.202                         0.393
## 2   0.686      0.286                         0.340
## 3   0.683      0.284                         0.408
## 4   0.677      0.353                         0.138
## 5   0.660      0.256                         0.357
## 6   0.638      0.333                         0.295

3.4.2 Compute Heatmap Correlation

library(corrplot)
Num.cols <- sapply(dataset, is.numeric)
Cor.data <- cor(dataset[, Num.cols])

corrplot(Cor.data, method = 'color')

library(GGally)

ggcorr(dataset, label = TRUE, label_round = 2, label_size = 3.5, size = 2, hjust = .85) +
  ggtitle("Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))

3.5 Categorize Happiness score into 3 level for classfication Algorithm:

High, Mid, low
Add new column Happy.Level into dataset

rge_dif=round((max(dataset$Happiness.Score)-min(dataset$Happiness.Score))/3,3)

low=min(dataset$Happiness.Score)+rge_dif
mid=low+rge_dif

print(paste("range difference in happiness score: ",rge_dif))
print(paste('upper bound of Low grp',low))
print(paste('upper bound of Mid grp',mid))
print(paste('upper bound of High grp','max:',max(dataset$Happiness.Score)))

## [1] "range difference in happiness score:  1.643"
## [1] "upper bound of Low grp 4.482"
## [1] "upper bound of Mid grp 6.125"
## [1] "upper bound of High grp max: 7.769"

Transform “hapiness.Score” column into “Happy.Level” column

dataset_level <- dataset %>%
  mutate(Happy.Level=case_when(
    Happiness.Score <=low  ~ "Low",
    Happiness.Score>low & Happiness.Score <=mid ~ "Mid",
    Happiness.Score >mid ~ "High"
  ))  %>%
  mutate(Happy.Level=factor(Happy.Level, levels=c("High", "Mid", "Low"))) %>%
  select(-Happiness.Score)

4 Regression Models

4.1 Split into train set (80%) and test set (20%)

# Splitting the dataset into the Training set and Test set
set.seed(123) 
split=0.80
trainIndex <- createDataPartition(dataset$Happiness.Score, p=split, list=FALSE) 
data_train <- dataset[ trainIndex,] 
data_test <- dataset[-trainIndex,]

4.2 Multiple Linear Regression for Happiness Score Prediction

4.2.1 Train Multiple Linear Regression model with data_train

# Fitting Multiple Linear Regression to the Training set
lm_model = lm(formula = Happiness.Score ~ .,
               data = data_train)

summary(lm_model)

## 
## Call:
## lm(formula = Happiness.Score ~ ., data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87619 -0.33193  0.00798  0.34507  1.43298 
## 
## Coefficients:
##                               Estimate Std. Error t value             Pr(>|t|)
## (Intercept)                    2.09629    0.09560  21.927 < 0.0000000000000002
## Economy..GDP.per.Capita.       1.14157    0.09825  11.619 < 0.0000000000000002
## Family                         0.64275    0.09459   6.795   0.0000000000277987
## Health..Life.Expectancy.       1.26063    0.16461   7.658   0.0000000000000837
## Freedom                        1.21029    0.20527   5.896   0.0000000064505937
## Generosity                     0.71311    0.19481   3.661             0.000276
## Trust..Government.Corruption.  0.96843    0.26401   3.668             0.000268
##                                  
## (Intercept)                   ***
## Economy..GDP.per.Capita.      ***
## Family                        ***
## Health..Life.Expectancy.      ***
## Freedom                       ***
## Generosity                    ***
## Trust..Government.Corruption. ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5422 on 558 degrees of freedom
## Multiple R-squared:  0.7722, Adjusted R-squared:  0.7697 
## F-statistic: 315.2 on 6 and 558 DF,  p-value: < 0.00000000000000022

An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.

A number near 0 indicates that the regression model did not explain much of the variability in the outcome.

Our adjusted R2 is 0.7697, which is good.

4.2.2 Predict happiness score with data_test

y_pred_lm = predict(lm_model, newdata = data_test)
Actual_lm = data_test$Happiness.Score

Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = Actual_lm))


gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Multiple Linear Regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm

data.frame(
  R2 = R2(y_pred_lm, data_test$Happiness.Score),
  RMSE = RMSE(y_pred_lm, data_test$Happiness.Score),
  MAE = MAE(y_pred_lm, data_test$Happiness.Score)
)

##          R2      RMSE       MAE
## 1 0.7643535 0.5478055 0.4256454

4.3 Support Vector Regression for Happyniess Score Prediction

4.3.1 Train SVR model with data_train

library(e1071)

regressor_svr = svm(formula = Happiness.Score ~ .,
                data = data_train,
                type = 'eps-regression',
                kernel = 'radial')

4.3.2 Predict happiness score with data_test

# Predicting happiness score with SVR model
y_pred_svr = predict(regressor_svr,  newdata = data_test)

Pred_Actual_svr <- as.data.frame(cbind(Prediction = y_pred_svr, Actual = data_test$Happiness.Score))


Pred_Actual_lm.versus.svr <- cbind(Prediction.lm = y_pred_lm, Prediction.svr = y_pred_svr, Actual = data_test$Happiness.Score)


gg.svr <- ggplot(Pred_Actual_svr, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "SVR", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.svr

data.frame(
  R2 = R2(y_pred_svr, data_test$Happiness.Score),
  RMSE = RMSE(y_pred_svr, data_test$Happiness.Score),
  MAE = MAE(y_pred_svr, data_test$Happiness.Score)
)

##          R2      RMSE       MAE
## 1 0.8246303 0.4740831 0.3504708

4.4 Decision Tree Regression for Happyniess Score Prediction

4.4.1 Train Decision Tree Regressio model with data_train

# install.packages("rpart")
library(rpart)
regressor_dt = rpart(formula = Happiness.Score ~ .,
                  data = data_train,
                  control = rpart.control(minsplit = 10))

4.4.2 Predict happiness score with data_test

# Predicting happiness score with Decision Tree Regression
y_pred_dt = predict(regressor_dt, newdata = data_test)

Pred_Actual_dt <- as.data.frame(cbind(Prediction = y_pred_dt, Actual = data_test$Happiness.Score))


gg.dt <- ggplot(Pred_Actual_dt, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Decision Tree Regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.dt

# install.packages("rpart.plot")
library(rpart.plot)
prp(regressor_dt)

data.frame(
  R2 = R2(y_pred_dt, data_test$Happiness.Score),
  RMSE = RMSE(y_pred_dt, data_test$Happiness.Score),
  MAE = MAE(y_pred_dt, data_test$Happiness.Score)
)

##         R2      RMSE       MAE
## 1 0.682486 0.6362329 0.5223723

4.5 Random Forest Regression for Happyniess Score Prediction

4.5.1 Train Random Forest Regression model with data_train

library(randomForest)

x_train_rf<-select(dataset,-c("Happiness.Score"))

          
set.seed(1234)
regressor_rf = randomForest(x = x_train_rf,
                         y = dataset$Happiness.Score,
                         ntree = 500)

4.5.2 Predict happiness score with data_test

# Predicting happiness score with Random Forest Regression
y_pred_rf = predict(regressor_rf, newdata = data_test)

Pred_Actual_rf <- as.data.frame(cbind(Prediction = y_pred_rf, Actual = data_test$Happiness.Score))


gg.rf <- ggplot(Pred_Actual_rf, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Random Forest Regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.rf

data.frame(
  R2 = R2(y_pred_rf, data_test$Happiness.Score),
  RMSE = RMSE(y_pred_rf, data_test$Happiness.Score),
  MAE = MAE(y_pred_rf, data_test$Happiness.Score)
)

##          R2      RMSE       MAE
## 1 0.9692887 0.2104387 0.1561681

4.6 Model Evaluation

ggarrange(gg.lm, gg.svr, gg.dt, gg.rf, ncol = 2, nrow = 3)

5 Classification Models

Dependent variable is happiness level in dataset_level,

5.1 Split into train set (80%) and test set (20%)

# Splitting the dataset into the Training set and Test set
set.seed(123) 

split=0.80
trainIndex <- createDataPartition(dataset_level$Happy.Level, p=split, list=FALSE) 
data_train <- dataset_level[ trainIndex,] 
data_test <- dataset_level[-trainIndex,]

5.2 Cross validation

tc <- trainControl(method = "repeatedcv", 
                   number=10,#10-fold cross validation 
                   classProbs = TRUE,
                   savePredictions = TRUE, 
                   repeats = 3,
                   ## Estimate class probabilities
                   summaryFunction = multiClassSummary,)

5.3 K-Nearest Neighbor Classifier for predicting Happy Level

5.3.1 Train K-Nearest Neighbours model with data_train

set.seed(123)
model_knn <- train(
  Happy.Level~., 
  data=data_train, 
  trControl=tc,
  preProcess = c("center","scale"),
  method="knn",
  metric='Accuracy',
  tuneLength=20
  ) 

model_knn

## k-Nearest Neighbors 
## 
## 566 samples
##   6 predictor
##   3 classes: 'High', 'Mid', 'Low' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 509, 509, 509, 510, 510, 509, ... 
## Resampling results across tuning parameters:
## 
##   k   logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
##    5  1.6389388  0.8860447  0.5147550  0.7485366  0.5928032  0.7428531
##    7  1.1644979  0.8942592  0.5588632  0.7544371  0.5991474  0.7464060
##    9  0.7990958  0.9034868  0.5949783  0.7662165  0.6184617  0.7591744
##   11  0.7426081  0.9052587  0.6174876  0.7609736  0.6084687  0.7524037
##   13  0.7355638  0.9022291  0.6227409  0.7521386  0.5937375  0.7425297
##   15  0.7014662  0.9026759  0.6249766  0.7526820  0.5945413  0.7427888
##   17  0.6292874  0.9044261  0.6334476  0.7486198  0.5872415  0.7381265
##   19  0.5776395  0.9036594  0.6436939  0.7509690  0.5899505  0.7386437
##   21  0.5267587  0.9035198  0.6418789  0.7474192  0.5848969  0.7360335
##   23  0.5340227  0.9010732  0.6418692  0.7539567  0.5954132  0.7430073
##   25  0.5378848  0.9002502  0.6439309  0.7504782  0.5897219  0.7400664
##   27  0.5409197  0.8990913  0.6533776  0.7480767  0.5845696  0.7365311
##   29  0.5414947  0.8999231  0.6643858  0.7445471  0.5777809  0.7325024
##   31  0.5444279  0.8991016  0.6705870  0.7446191  0.5773339  0.7318675
##   33  0.5473342  0.8983055  0.6675026  0.7386973  0.5672005  0.7258717
##   35  0.5499086  0.8981910  0.6679192  0.7368810  0.5651835  0.7248550
##   37  0.5521854  0.8978885  0.6792291  0.7351572  0.5617189  0.7225790
##   39  0.5537471  0.8982587  0.6810644  0.7345728  0.5604520  0.7216963
##   41  0.5555175  0.8975937  0.6831163  0.7328184  0.5566654  0.7191187
##   43  0.5573566  0.8970735  0.6788005  0.7375493  0.5656219  0.7249270
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.7376124         0.8575736         0.7629220            0.8631876          
##   0.7355902         0.8582846         0.7746184            0.8672235          
##   0.7481970         0.8648439         0.7858549            0.8736509          
##   0.7393457         0.8609607         0.7827231            0.8713618          
##   0.7294857         0.8559641         0.7717696            0.8661652          
##   0.7301567         0.8563026         0.7731878            0.8671111          
##   0.7240807         0.8534211         0.7710150            0.8651708          
##   0.7248006         0.8544001         0.7735897            0.8666976          
##   0.7224491         0.8527844         0.7686853            0.8642012          
##   0.7293335         0.8564037         0.7750833            0.8676991          
##   0.7260277         0.8544421         0.7740571            0.8658964          
##   0.7216382         0.8523689         0.7734210            0.8650008          
##   0.7166382         0.8498173         0.7734903            0.8632140          
##   0.7154329         0.8495512         0.7728127            0.8631285          
##   0.7090822         0.8460099         0.7661875            0.8589382          
##   0.7092720         0.8455246         0.7628424            0.8576898          
##   0.7061274         0.8441816         0.7621036            0.8567201          
##   0.7049964         0.8436465         0.7620600            0.8567828          
##   0.7016916         0.8422836         0.7617451            0.8555782          
##   0.7075941         0.8453322         0.7659850            0.8585823          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.7629220       0.7376124    0.2495122            0.7975930             
##   0.7746184       0.7355902    0.2514790            0.7969374             
##   0.7858549       0.7481970    0.2554055            0.8065204             
##   0.7827231       0.7393457    0.2536579            0.8001532             
##   0.7717696       0.7294857    0.2507129            0.7927249             
##   0.7731878       0.7301567    0.2508940            0.7932296             
##   0.7710150       0.7240807    0.2495399            0.7887509             
##   0.7735897       0.7248006    0.2503230            0.7896004             
##   0.7686853       0.7224491    0.2491397            0.7876168             
##   0.7750833       0.7293335    0.2513189            0.7928686             
##   0.7740571       0.7260277    0.2501594            0.7902349             
##   0.7734210       0.7216382    0.2493589            0.7870035             
##   0.7734903       0.7166382    0.2481824            0.7832277             
##   0.7728127       0.7154329    0.2482064            0.7824921             
##   0.7661875       0.7090822    0.2462324            0.7775460             
##   0.7628424       0.7092720    0.2456270            0.7773983             
##   0.7621036       0.7061274    0.2450524            0.7751545             
##   0.7620600       0.7049964    0.2448576            0.7743215             
##   0.7617451       0.7016916    0.2442728            0.7719876             
##   0.7659850       0.7075941    0.2458498            0.7764632             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

plot(model_knn)

5.3.2 Predict happiness level by K-Nearest Neighbours model

pred_knn <- predict(model_knn, data_test)

cm_knn<-confusionMatrix(pred_knn, data_test$Happy.Level)

cm_knn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Mid Low
##       High   33   7   0
##       Mid     5  56   8
##       Low     0   6  24
## 
## Overall Statistics
##                                             
##                Accuracy : 0.8129            
##                  95% CI : (0.7381, 0.874)   
##     No Information Rate : 0.4964            
##     P-Value [Acc > NIR] : 0.0000000000000105
##                                             
##                   Kappa : 0.7008            
##                                             
##  Mcnemar's Test P-Value : NA                
## 
## Statistics by Class:
## 
##                      Class: High Class: Mid Class: Low
## Sensitivity               0.8684     0.8116     0.7500
## Specificity               0.9307     0.8143     0.9439
## Pos Pred Value            0.8250     0.8116     0.8000
## Neg Pred Value            0.9495     0.8143     0.9266
## Prevalence                0.2734     0.4964     0.2302
## Detection Rate            0.2374     0.4029     0.1727
## Detection Prevalence      0.2878     0.4964     0.2158
## Balanced Accuracy         0.8996     0.8129     0.8470

5.3.3 Feature Importance

# Create object of importance of our variables 
knn_importance <- varImp(model_knn) 

# Create box plot of importance of variables
ggplot(data = knn_importance, mapping = aes(x = knn_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: K-Nearest Neighbours ") + # Title
  theme_light() # Theme

5.4 Naive Bayes Classification model for predicting Happy Level

5.4.1 Train Naive Bayes Classification model model with data_train

model_nb <- train(Happy.Level~.,
                  data_train,
                  method="naive_bayes",
                  preProcess = c("center","scale"),
                  metric='Accuracy',
                  trControl=tc)

model_nb

## Naive Bayes 
## 
## 566 samples
##   6 predictor
##   3 classes: 'High', 'Mid', 'Low' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 510, 510, 508, 509, 510, 509, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
##   FALSE      0.7237029  0.8965975  0.7632952  0.7597568  0.6156457  0.7583292
##    TRUE      0.7014796  0.8940879  0.7646527  0.7662536  0.6232512  0.7631400
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.7572299         0.8664677         0.7686608            0.8689490          
##   0.7596220         0.8684121         0.7801272            0.8726667          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.7686608       0.7572299    0.2532523            0.8118488             
##   0.7801272       0.7596220    0.2554179            0.8140171             
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

plot(model_nb)

5.4.2 Predict happiness Level by Naive Bayes

pred_nb <- predict(model_nb, data_test)

cm_nb<-confusionMatrix(pred_nb, data_test$Happy.Level)

cm_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Mid Low
##       High   31   4   0
##       Mid     7  54   8
##       Low     0  11  24
## 
## Overall Statistics
##                                            
##                Accuracy : 0.7842           
##                  95% CI : (0.7065, 0.8494) 
##     No Information Rate : 0.4964           
##     P-Value [Acc > NIR] : 0.000000000002766
##                                            
##                   Kappa : 0.6557           
##                                            
##  Mcnemar's Test P-Value : NA               
## 
## Statistics by Class:
## 
##                      Class: High Class: Mid Class: Low
## Sensitivity               0.8158     0.7826     0.7500
## Specificity               0.9604     0.7857     0.8972
## Pos Pred Value            0.8857     0.7826     0.6857
## Neg Pred Value            0.9327     0.7857     0.9231
## Prevalence                0.2734     0.4964     0.2302
## Detection Rate            0.2230     0.3885     0.1727
## Detection Prevalence      0.2518     0.4964     0.2518
## Balanced Accuracy         0.8881     0.7842     0.8236

5.4.3 Feature Importance

# Create object of importance of our variables 
nb_importance <- varImp(model_nb) 

# Create box plot of importance of variables
ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) + # Data & mapping
  geom_boxplot() + # Create box plot
  labs(title = "Variable importance: Naive Bayes model") + # Title
  theme_light() # Theme

5.5 Model Evaluation by AUC

model_list <- list(KNN = model_knn, NB=model_nb)
resamples <- resamples(model_list)

bwplot(resamples, metric="AUC")

5.6 Model Evaluation by Prediction Accurancy

data.frame(
  K_Nearest_Neighbours= cm_knn$overall[1],
  Naive_Bayes=  cm_nb$overall[1]
)

##          K_Nearest_Neighbours Naive_Bayes
## Accuracy            0.8129496   0.7841727

6 Discussion

6.1 Happiness Countries from 2015 to 2019

Although the top 10 happiest country’s ranking position changes from year to year, the countries did not change from 2015 to 2019. They are still on the top 10 happiest country list. The countries are Finland, Denmark, Norway, Iceland, Netherlands, Switzerland, Sweden, New Zealand, Canada and Australia.

6.2 Progressive Countries from 2015 to 2019

The top 10 progressive countries are Benin, Togo, Ivory Coast, Burundi, Burkina Faso, Guinea, Gabon, Cambodia, Honduras and Congo (Brazzaville). The happiness value of people in these countries did not declined, and the people are getting happier every year.

6.3 The Main Factor Affects Happiness

The factor that most affect happiness is Economic GDP per capita, It is probably because income can let people meet their basic needs, so it is quite important and will affect people’s happiness, and another main factor is health.

6.4 Happiness Score Prediction (Regression Model)

Random Forest Regression comes out with the best result compared to others, Support Vector Regression model and Multiple Linear Regression are good in prediction. And finally, Decision Tree was the worst algorithm to predict happiness scores.

6.5 Happiness Level Prediction (Classification Model)

K-Nearest Neighbor Classification model is better than Naive Bayes Classification model in our project result since the prediction accuracy of K-Nearest Neighbor Classification model is higher than the other one.

7 Conclusion

In conclusion, this study has shown that happiness depends on a huge range of inﬂuences. Thus, regular collection of happiness data on a large scale can inform policy-making and help us identify what “deliverables” should be created to foster well-being. In other words, moving to a happier country could plausibly make you happier. By the same token, moving to a less happy country could reduce your level of happiness. Emotions are contagious, even at a national level.

8 Reference

Decision Tree for Regression in R Programming - GeeksforGeeks. (2020, July 26). GeeksforGeeks. https://www.geeksforgeeks.org/decision-tree-for-regression-in-r-programming/

https://www.facebook.com/verywell. (2020). How Do Psychologists Define Happiness? Verywell Mind. https://www.verywellmind.com/what-is-happiness-4869755

K-NN Classifier in R Programming - GeeksforGeeks. (2020, June 18). GeeksforGeeks. https://www.geeksforgeeks.org/k-nn-classifier-in-r-programming/#:~:text=K%2DNearest%20Neighbor%20or%20K,underlying%20data%20or%20its%20distribution.

Nikola O. (2021, December 29). Random Forest Regression in R: Code and Interpretation. Hackernoon.com. https://hackernoon.com/random-forest-regression-in-r-code-and-interpretation

Ortiz-Ospina, E., & Roser, M. (2013, May 14). Happiness and Life Satisfaction. Our World in Data. https://ourworldindata.org/happiness-and-life-satisfaction

Random Forest Approach in R Programming - GeeksforGeeks. (2020, May 31). GeeksforGeeks. https://www.geeksforgeeks.org/random-forest-approach-in-r-programming/#:~:text=Random%20Forest%20in%20R%20Programming,when%20employed%20on%20its%20own.

R - Multiple Regression. (2022). Tutorialspoint.com. https://www.tutorialspoint.com/r/r_multiple_regression.htm

Scatter Plots - R Base Graphs - Easy Guides - Wiki - STHDA. (2020). Sthda.com. http://www.sthda.com/english/wiki/scatter-plots-r-base-graphs#:~:text=A%20scatter%20plot%20can%20be,using%20the%20function%20loess().

Support Vector Regression Example with SVM in R. (2019, September 5). Datatechnotes.com. https://www.datatechnotes.com/2019/09/support-vector-regression-example-with.html#:~:text=Support%20Vector%20Machine%20is%20a,for%20regression%20problem%20in%20R.

Sustainable Development Solutions Network. (2012). World Happiness Report. Kaggle.com. https://www.kaggle.com/datasets/unsdsn/world-happiness

Analysis of World Happiness

2022-06-18