Happiness
GROUP MEMBERS (GROUP 13)
CHEN BAOGANG 17186722
LIM SHI JUN 17113677
RONJON AHMED S2150527
ZHU MINGLI S2147909
Happiness is an emotional state characterized by feelings of joy, fulfillment, satisfaction and contentment. It has many distinct definitions and is frequently associated with positive emotions and life satisfaction. When most people talk about happiness, they may be referring to how they feel in the present moment or to a broader sense of how they feel about life in general.How happy are people today? Were people happier in the past? How satisfied are people in different societies with their lives? And how do our living conditions affect all of this? These are difficult questions to answer, but they are surely important to each of us individually. Indeed, today, life satisfaction and happiness are becoming important study topics in the social sciences, including in ‘mainstream’ economics. In this study, we will explore the data and empirical evidence that may provide answers to these questions. Our focus here will be on survey-based measures of World Happiness by country.
Which countries are on the Top 10 Happiness Countries list?
Which countries are on the Top 10 Progressive Countries list?
Which factor affects people’s happiness the most?
Which regression model is the best to predict the happiness scores?
Which classification model is the best to predict happiness levels?
This study aims to both quantify and analyze well-being around the world. Our main goal is to do an exploratory analysis of the factors that make people happy.
To predict happiness score through different regression models
To predict happiness level through two classification models
World Happiness Report
Source: https://www.kaggle.com/datasets/unsdsn/world-happiness
The title of the datasets that we get is “World Happiness Report”. The datasets are from 2015 to 2019 and they are located separately in different CSV files. It has 8 to 12 variables in each CSV, with almost 155 different countries, which is the dependent variable in this study. The independent variables are the factors that affect people’s happiness such as family, freedom, life expectancy, GDP per capita, generosity, and trust in government corruption.
The content of these happiness scores and rankings use data come from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.
The datasets need to be cleaning and tidy up by renaming the variables so that they are all titled the same in five CSV files, combining them, and ensuring that there are no uncommon data or missing values in this dataset.
Country
Happiness rank
Happiness score
This is obtained from a sample of population. The survey-taker asked the respondent to rate their happiness from 1 to 10.
Extend of GDP that contributes to the happiness score
To what extend does family contribute to the happiness score
Extend of health (life expectancy) contribute to the happiness score
Extend of freedom that contribute to happiness. The freedom here represents the freedom of speech, freedom to pursue what we want, etc
Extend of trust with regards to government corruption that contribute to happiness score
Extend of generosity that contribute to happiness score
library(Metrics)
library(caret)
library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(skimr)
library(tidyr)
library(reshape2)
library(ggpubr)
library(stringr)
library(e1071)
library(pROC)
happy15_df = read.csv ("data/2015.csv")
happy16_df = read.csv ("data/2016.csv")
happy17_df = read.csv ("data/2017.csv")
happy18_df = read.csv ("data/2018.csv")
happy19_df = read.csv ("data/2019.csv")
head(happy19_df)
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1 1 Finland 7.769 1.340 1.587
## 2 2 Denmark 7.600 1.383 1.573
## 3 3 Norway 7.554 1.488 1.582
## 4 4 Iceland 7.494 1.380 1.624
## 5 5 Netherlands 7.488 1.396 1.522
## 6 6 Switzerland 7.480 1.452 1.526
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 0.986 0.596 0.153
## 2 0.996 0.592 0.252
## 3 1.028 0.603 0.271
## 4 1.026 0.591 0.354
## 5 0.999 0.557 0.322
## 6 1.052 0.572 0.263
## Perceptions.of.corruption
## 1 0.393
## 2 0.410
## 3 0.341
## 4 0.118
## 5 0.298
## 6 0.343
We have a look at the last one (head is by default with the first 5 rows)
happy18_df=plyr::rename(happy18_df, replace = c( "Country.or.region"="Country",
"Overall.rank"="Happiness.Rank" ,
"GDP.per.capita"="Economy..GDP.per.Capita.",
"Healthy.life.expectancy"="Health..Life.Expectancy.",
"Freedom.to.make.life.choices"="Freedom",
"Perceptions.of.corruption"="Trust..Government.Corruption.",
"Social.support"="Family",
"Score"="Happiness.Score"))
colnames(happy18_df)
## [1] "Happiness.Rank" "Country"
## [3] "Happiness.Score" "Economy..GDP.per.Capita."
## [5] "Family" "Health..Life.Expectancy."
## [7] "Freedom" "Generosity"
## [9] "Trust..Government.Corruption."
happy19_df=plyr::rename(happy19_df, replace = c( "Country.or.region"="Country",
"Overall.rank"="Happiness.Rank" ,
"GDP.per.capita"="Economy..GDP.per.Capita.",
"Healthy.life.expectancy"="Health..Life.Expectancy.",
"Freedom.to.make.life.choices"="Freedom",
"Perceptions.of.corruption"="Trust..Government.Corruption.",
"Social.support"="Family",
"Score"="Happiness.Score"))
colnames(happy19_df)
## [1] "Happiness.Rank" "Country"
## [3] "Happiness.Score" "Economy..GDP.per.Capita."
## [5] "Family" "Health..Life.Expectancy."
## [7] "Freedom" "Generosity"
## [9] "Trust..Government.Corruption."
happy15_df=plyr::rename(happy15_df, replace = c( "Happiness Rank" = "Happiness.Rank",
"Happiness Score" = "Happiness.Score",
"Economy (GDP per Capita)" = "Economy..GDP.per.Capita.",
"Health (Life Expectancy)" = "Health..Life.Expectancy.",
"Trust (Government Corruption)" = "Trust..Government.Corruption.",
"Dystopia Residual"="Dystopia.Residual"
))
colnames(happy15_df)
## [1] "Country" "Region"
## [3] "Happiness.Rank" "Happiness.Score"
## [5] "Standard.Error" "Economy..GDP.per.Capita."
## [7] "Family" "Health..Life.Expectancy."
## [9] "Freedom" "Trust..Government.Corruption."
## [11] "Generosity" "Dystopia.Residual"
happy16_df=plyr::rename(happy16_df, replace = c( "Happiness Rank" = "Happiness.Rank",
"Happiness Score" = "Happiness.Score",
"Economy (GDP per Capita)" = "Economy..GDP.per.Capita.",
"Health (Life Expectancy)" = "Health..Life.Expectancy.",
"Trust (Government Corruption)" = "Trust..Government.Corruption.",
"Dystopia Residual"="Dystopia.Residual"
))
colnames(happy16_df)
## [1] "Country" "Region"
## [3] "Happiness.Rank" "Happiness.Score"
## [5] "Lower.Confidence.Interval" "Upper.Confidence.Interval"
## [7] "Economy..GDP.per.Capita." "Family"
## [9] "Health..Life.Expectancy." "Freedom"
## [11] "Trust..Government.Corruption." "Generosity"
## [13] "Dystopia.Residual"
happy15_df<-cbind(Year=2015,happy15_df)
happy16_df<-cbind(Year=2016,happy16_df)
happy17_df<-cbind(Year=2017,happy17_df)
happy18_df<-cbind(Year=2018,happy18_df)
happy19_df<-cbind(Year=2019,happy19_df)
happy18_df$Trust..Government.Corruption. = as.numeric(happy18_df$Trust..Government.Corruption.)
str(happy18_df)
## 'data.frame': 156 obs. of 10 variables:
## $ Year : num 2018 2018 2018 2018 2018 ...
## $ Happiness.Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : chr "Finland" "Norway" "Denmark" "Iceland" ...
## $ Happiness.Score : num 7.63 7.59 7.55 7.5 7.49 ...
## $ Economy..GDP.per.Capita. : num 1.3 1.46 1.35 1.34 1.42 ...
## $ Family : num 1.59 1.58 1.59 1.64 1.55 ...
## $ Health..Life.Expectancy. : num 0.874 0.861 0.868 0.914 0.927 0.878 0.896 0.876 0.913 0.91 ...
## $ Freedom : num 0.681 0.686 0.683 0.677 0.66 0.638 0.653 0.669 0.659 0.647 ...
## $ Generosity : num 0.202 0.286 0.284 0.353 0.256 0.333 0.321 0.365 0.285 0.361 ...
## $ Trust..Government.Corruption.: num 0.393 0.34 0.408 0.138 0.357 0.295 0.291 0.389 0.383 0.302 ...
happy15_16<-dplyr::bind_rows(happy15_df,happy16_df)
happy15_16_17<-dplyr::bind_rows(happy15_16,happy17_df)
happy18_19<-dplyr::bind_rows(happy18_df,happy19_df)
df<-dplyr::bind_rows(happy18_19,happy15_16_17)
head(df)
## Year Happiness.Rank Country Happiness.Score Economy..GDP.per.Capita.
## 1 2018 1 Finland 7.632 1.305
## 2 2018 2 Norway 7.594 1.456
## 3 2018 3 Denmark 7.555 1.351
## 4 2018 4 Iceland 7.495 1.343
## 5 2018 5 Switzerland 7.487 1.420
## 6 2018 6 Netherlands 7.441 1.361
## Family Health..Life.Expectancy. Freedom Generosity
## 1 1.592 0.874 0.681 0.202
## 2 1.582 0.861 0.686 0.286
## 3 1.590 0.868 0.683 0.284
## 4 1.644 0.914 0.677 0.353
## 5 1.549 0.927 0.660 0.256
## 6 1.488 0.878 0.638 0.333
## Trust..Government.Corruption. Region Standard.Error Dystopia.Residual
## 1 0.393 <NA> NA NA
## 2 0.340 <NA> NA NA
## 3 0.408 <NA> NA NA
## 4 0.138 <NA> NA NA
## 5 0.357 <NA> NA NA
## 6 0.295 <NA> NA NA
## Lower.Confidence.Interval Upper.Confidence.Interval Whisker.high Whisker.low
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
df$Happiness.Rank = as.numeric(df$Happiness.Rank )
str(df)
## 'data.frame': 782 obs. of 17 variables:
## $ Year : num 2018 2018 2018 2018 2018 ...
## $ Happiness.Rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : chr "Finland" "Norway" "Denmark" "Iceland" ...
## $ Happiness.Score : num 7.63 7.59 7.55 7.5 7.49 ...
## $ Economy..GDP.per.Capita. : num 1.3 1.46 1.35 1.34 1.42 ...
## $ Family : num 1.59 1.58 1.59 1.64 1.55 ...
## $ Health..Life.Expectancy. : num 0.874 0.861 0.868 0.914 0.927 0.878 0.896 0.876 0.913 0.91 ...
## $ Freedom : num 0.681 0.686 0.683 0.677 0.66 0.638 0.653 0.669 0.659 0.647 ...
## $ Generosity : num 0.202 0.286 0.284 0.353 0.256 0.333 0.321 0.365 0.285 0.361 ...
## $ Trust..Government.Corruption.: num 0.393 0.34 0.408 0.138 0.357 0.295 0.291 0.389 0.383 0.302 ...
## $ Region : chr NA NA NA NA ...
## $ Standard.Error : num NA NA NA NA NA NA NA NA NA NA ...
## $ Dystopia.Residual : num NA NA NA NA NA NA NA NA NA NA ...
## $ Lower.Confidence.Interval : num NA NA NA NA NA NA NA NA NA NA ...
## $ Upper.Confidence.Interval : num NA NA NA NA NA NA NA NA NA NA ...
## $ Whisker.high : num NA NA NA NA NA NA NA NA NA NA ...
## $ Whisker.low : num NA NA NA NA NA NA NA NA NA NA ...
count NA value in every column
colSums(is.na(df))
## Year Happiness.Rank
## 0 0
## Country Happiness.Score
## 0 0
## Economy..GDP.per.Capita. Family
## 0 0
## Health..Life.Expectancy. Freedom
## 0 0
## Generosity Trust..Government.Corruption.
## 0 1
## Region Standard.Error
## 467 624
## Dystopia.Residual Lower.Confidence.Interval
## 312 625
## Upper.Confidence.Interval Whisker.high
## 625 627
## Whisker.low
## 627
Remove unnessesary columns
df = subset(df, select = -c(Lower.Confidence.Interval,Upper.Confidence.Interval,Dystopia.Residual,Standard.Error,Whisker.high,Whisker.low))
colSums(is.na(df))
## Year Happiness.Rank
## 0 0
## Country Happiness.Score
## 0 0
## Economy..GDP.per.Capita. Family
## 0 0
## Health..Life.Expectancy. Freedom
## 0 0
## Generosity Trust..Government.Corruption.
## 0 1
## Region
## 467
df$Trust..Government.Corruption.[is.na(df$Trust..Government.Corruption.)] <- median(df$Trust..Government.Corruption., na.rm = T)
colSums(is.na(df))
## Year Happiness.Rank
## 0 0
## Country Happiness.Score
## 0 0
## Economy..GDP.per.Capita. Family
## 0 0
## Health..Life.Expectancy. Freedom
## 0 0
## Generosity Trust..Government.Corruption.
## 0 0
## Region
## 467
Due to the data is describing the happiness score and relative factors for countries across different years. So, it is important to view the uniformity of the data in Year column of the data.
Country and Region counts group by Year
aggregate(df$Country, by=list(df$Year), FUN=length)
## Group.1 x
## 1 2015 158
## 2 2016 157
## 3 2017 155
## 4 2018 156
## 5 2019 156
From the table shown as above, the number of countries involved in this dataset for different year is different. Therefore, it is necessary to make an intersection of them to get the most common country list.
Country_2015 = subset(df, Year == 2015)$Country
Country_2016 = subset(df, Year == 2016)$Country
Country_2017 = subset(df, Year == 2017)$Country
Country_2018 = subset(df, Year == 2018)$Country
Country_2019 = subset(df, Year == 2019)$Country
common_country =intersect(intersect(intersect(intersect(Country_2015,
Country_2016),Country_2017),Country_2018),Country_2019)
length(common_country)
## [1] 141
Therefore, there are 141 countries’ data existing across from 2015-2019 in this dataset.Then we need to filter the original dataset by this common_country list.
df1 = subset(df,Country %in% common_country)
print(paste("The amount of rows in the dataset is: ",dim(df1)[1]))
print(paste("The amount of columns in the dataset is: ",dim(df1)[2]))
## [1] "The amount of rows in the dataset is: 705"
## [1] "The amount of columns in the dataset is: 11"
Create a new dataset for storing common region and country
common_region <- unique(subset(df1, Region!="NA", c(Country, Region)))
head(common_country)
## [1] "Switzerland" "Iceland" "Denmark" "Norway" "Canada"
## [6] "Finland"
Fill relate region to missing value of region column
assign_region <- function(x){
Region <- common_region$Region[common_region$Country == x]
}
for(country in common_country)
df1$Region[df1$Country == country] <- assign_region(country)
write_csv(df1, path = "World Happiness Data (2015-2019)_cleaned.csv")
skimr::skim_without_charts(df1)
| Name | df1 |
| Number of rows | 705 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Country | 0 | 1 | 4 | 23 | 0 | 141 | 0 |
| Region | 0 | 1 | 12 | 31 | 0 | 10 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 2017.00 | 1.42 | 2015.00 | 2016.00 | 2017.00 | 2018.00 | 2019.00 |
| Happiness.Rank | 0 | 1 | 76.85 | 45.28 | 1.00 | 37.00 | 77.00 | 116.00 | 158.00 |
| Happiness.Score | 0 | 1 | 5.43 | 1.13 | 2.84 | 4.52 | 5.39 | 6.29 | 7.77 |
| Economy..GDP.per.Capita. | 0 | 1 | 0.93 | 0.40 | 0.00 | 0.64 | 1.00 | 1.24 | 2.10 |
| Family | 0 | 1 | 1.09 | 0.32 | 0.00 | 0.88 | 1.14 | 1.35 | 1.64 |
| Health..Life.Expectancy. | 0 | 1 | 0.63 | 0.23 | 0.00 | 0.49 | 0.66 | 0.81 | 1.14 |
| Freedom | 0 | 1 | 0.41 | 0.15 | 0.00 | 0.31 | 0.43 | 0.53 | 0.72 |
| Generosity | 0 | 1 | 0.22 | 0.13 | 0.00 | 0.13 | 0.20 | 0.28 | 0.84 |
| Trust..Government.Corruption. | 0 | 1 | 0.12 | 0.11 | 0.00 | 0.05 | 0.09 | 0.15 | 0.55 |
print(paste("The amount of rows in the dataset is: ",dim(df)[1]))
print(paste("The amount of columns in the dataset is: ",dim(df)[2]))
print(paste("the column names in this dataset are:", paste(shQuote(colnames(df)), collapse=", ")))
## [1] "The amount of rows in the dataset is: 782"
## [1] "The amount of columns in the dataset is: 11"
## [1] "the column names in this dataset are: \"Year\", \"Happiness.Rank\", \"Country\", \"Happiness.Score\", \"Economy..GDP.per.Capita.\", \"Family\", \"Health..Life.Expectancy.\", \"Freedom\", \"Generosity\", \"Trust..Government.Corruption.\", \"Region\""
df1 %>%
filter(Year == 2015) %>%
arrange(-Happiness.Score) %>%
slice_head(n=10) %>%
ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
geom_point(colour = "red", size = 3) +
theme(text=element_text(size=10)) +
coord_flip() +
labs(title = "The 10 happiest countries in 2015", x = "")
df1 %>%
filter(Year == 2016) %>%
arrange(-Happiness.Score) %>%
slice_head(n=10) %>%
ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
geom_point(colour = "red", size = 3) +
theme(text=element_text(size=10)) +
coord_flip() +
labs(title = "The 10 happiest countries in 2016", x = "")
df1 %>%
filter(Year == 2017) %>%
arrange(-Happiness.Score) %>%
slice_head(n=10) %>%
ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
geom_point(colour = "red", size = 3) +
theme(text=element_text(size=10)) +
coord_flip() +
labs(title = "The 10 happiest countries in 2017", x = "")
df1 %>%
filter(Year == 2018) %>%
arrange(-Happiness.Score) %>%
slice_head(n=10) %>%
ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
geom_point(colour = "red", size = 3) +
theme(text=element_text(size=10)) +
coord_flip() +
labs(title = "The 10 happiest countries in 2018", x = "")
df1 %>%
filter(Year == 2019) %>%
arrange(-Happiness.Score) %>%
slice_head(n=10) %>%
ggplot(aes(reorder(Country, Happiness.Score), Happiness.Score)) +
geom_point(colour = "red", size = 3) +
theme(text=element_text(size=10)) +
coord_flip() +
labs(title = "The 10 happiest countries in 2019", x = "")
In 2015, Switzerland was the top happiest country. But it dropped to number two in 2016. Same as Denmark, which was the happiest country in 2016, but fell to number two in 2017. Norway was the happiest country in 2017. While Finland was the happiest country in 2018 and 2019.
gg2 <- ggplot(df1 , aes(x = Region, y = Happiness.Score)) +
geom_boxplot(aes(fill=Region)) + theme_bw() +
theme(axis.text.x = element_text (angle = 90))
gg2
The top 3 happiness region are: Australia and New Zealand, North America and Western Europe.
df1 %>%
group_by(Country) %>%
summarise(mscore = mean(Happiness.Score)) %>%
arrange(-mscore) %>%
slice_head(n=10) %>%
ggplot(aes(reorder(Country, mscore), mscore)) +
geom_point() +
theme_bw() +
coord_flip() +
labs(title = "Happiness Score by Country",
x = "", y = "Average happiness score")
The top 3 happiness countries are: Denmark, Norway and Finland.
Top10_happy_country_DF = df1 %>%
group_by(Country) %>%
summarise(mscore = mean(Happiness.Score)) %>%
arrange(-mscore) %>%
slice_head(n=10)
Top10_happy_country_DF_list = c(Top10_happy_country_DF$Country)
df1_Top10_happy_country = subset(df1,Country %in% Top10_happy_country_DF_list)
ggplot(df1_Top10_happy_country, aes(x = Year,y = Happiness.Score,color = Country))+ geom_line()
Only the happiness score of Finland is increasing dramatically from 2015-2019.
df1 %>%
mutate(y = as.character(Year)) %>%
select(y, Country, Region, Happiness.Score) %>%
pivot_wider(names_from = y, values_from = Happiness.Score,
names_prefix = "y_") %>%
mutate(p = (y_2019 - y_2015)/y_2015 * 100) %>%
arrange(-p) %>%
slice_head(n = 10) %>%
ggplot(aes(reorder(Country, p), p)) +
geom_point() +
theme_bw() +
coord_flip() +
labs(title = "The 10 most progressive countries from 2015 - 2019",
y = "Percentage Increase of Happiness Score", x = "")
Top10_Progress_country_df = df1 %>%
mutate(y = as.character(Year)) %>%
select(y, Country, Region, Happiness.Score) %>%
pivot_wider(names_from = y, values_from = Happiness.Score,
names_prefix = "y_") %>%
mutate(p = (y_2019 - y_2015)/y_2015 * 100) %>%
arrange(-p) %>%
slice_head(n = 10)
Top10_Progress_country_df_list = c(Top10_Progress_country_df$Country)
df1_Top10_Progress_country = subset(df1,Country %in% Top10_Progress_country_df_list)
ggplot(df1_Top10_Progress_country, aes(x = Year,y = Happiness.Score,color = Country))+ geom_line()
colnames(df1)
## [1] "Year" "Happiness.Rank"
## [3] "Country" "Happiness.Score"
## [5] "Economy..GDP.per.Capita." "Family"
## [7] "Health..Life.Expectancy." "Freedom"
## [9] "Generosity" "Trust..Government.Corruption."
## [11] "Region"
head(df1)
## Year Happiness.Rank Country Happiness.Score Economy..GDP.per.Capita.
## 1 2018 1 Finland 7.632 1.305
## 2 2018 2 Norway 7.594 1.456
## 3 2018 3 Denmark 7.555 1.351
## 4 2018 4 Iceland 7.495 1.343
## 5 2018 5 Switzerland 7.487 1.420
## 6 2018 6 Netherlands 7.441 1.361
## Family Health..Life.Expectancy. Freedom Generosity
## 1 1.592 0.874 0.681 0.202
## 2 1.582 0.861 0.686 0.286
## 3 1.590 0.868 0.683 0.284
## 4 1.644 0.914 0.677 0.353
## 5 1.549 0.927 0.660 0.256
## 6 1.488 0.878 0.638 0.333
## Trust..Government.Corruption. Region
## 1 0.393 Western Europe
## 2 0.340 Western Europe
## 3 0.408 Western Europe
## 4 0.138 Western Europe
## 5 0.357 Western Europe
## 6 0.295 Western Europe
df1 %>%
summarise(gdp = mean(Economy..GDP.per.Capita.),
family = mean(Family),
life.expectancy = mean(Health..Life.Expectancy.),
freedom = mean(Freedom),
generosity = mean(Generosity),
corruption = mean(Trust..Government.Corruption.)) %>%
pivot_longer(c(gdp, family, life.expectancy,freedom,generosity, corruption),
names_to = "f", values_to = "value") %>%
ggplot(aes(reorder(f, value), value)) +
geom_bar(stat = "identity", fill = "darkgreen", width = 0.55, alpha = 0.7) +
geom_text(aes(label = paste0(round(value, 2)), vjust = -0.5)) +
theme_bw() +
labs(title = "The mean value of the factors" , y = "", x = "")
The family factor has the highest mean value, which is 1.09.
Happiness.Continent <- df1 %>%
select(-c(Year,Happiness.Rank))%>%
group_by(Region) %>%
summarise_at(vars(-Country), funs(mean(., na.rm=TRUE)))
Happiness.Continent.melt <- melt(Happiness.Continent)
# Faceting
ggplot(Happiness.Continent.melt, aes(y=value, x=Region, color=Region, fill=Region)) +
geom_bar( stat="identity") +
facet_wrap(~variable) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Average value of happiness variables for different regions",
y = "Average value")
ggline1 = ggplot(df1, aes(x = Economy..GDP.per.Capita., y = Happiness.Score)) +
geom_point(size = .5, alpha = 0.8) +
geom_smooth(method = "lm", fullrange = TRUE) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline1a = ggplot(df1, aes(x = Economy..GDP.per.Capita., y = Happiness.Score)) +
geom_point(aes(color=Region), size = .5, alpha = 0.8) +
geom_smooth(aes(color = Region, fill = Region),
method = "lm", fullrange = TRUE) +
facet_wrap(~Region) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline1
ggline1a
ggline2 = ggplot(df1, aes(x = Family, y = Happiness.Score)) +
geom_point(size = .5, alpha = 0.8) +
geom_smooth(method = "lm", fullrange = TRUE) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline2a = ggplot(df1, aes(x = Family, y = Happiness.Score)) +
geom_point(aes(color=Region), size = .5, alpha = 0.8) +
geom_smooth(aes(color = Region, fill = Region),
method = "lm", fullrange = TRUE) +
facet_wrap(~Region) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline2
ggline2a
ggline3 = ggplot(df1, aes(x = Health..Life.Expectancy., y = Happiness.Score)) +
geom_point(size = .5, alpha = 0.8) +
geom_smooth(method = "lm", fullrange = TRUE) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline3a = ggplot(df1, aes(x = Health..Life.Expectancy., y = Happiness.Score)) +
geom_point(aes(color=Region), size = .5, alpha = 0.8) +
geom_smooth(aes(color = Region, fill = Region),
method = "lm", fullrange = TRUE) +
facet_wrap(~Region) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline3
ggline3a
ggline4 = ggplot(df1, aes(x =Freedom, y = Happiness.Score)) +
geom_point(size = .5, alpha = 0.8) +
geom_smooth(method = "lm", fullrange = TRUE) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline4a = ggplot(df1, aes(x =Freedom, y = Happiness.Score)) +
geom_point(aes(color=Region), size = .5, alpha = 0.8) +
geom_smooth(aes(color = Region, fill = Region),
method = "lm", fullrange = TRUE) +
facet_wrap(~Region) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline4
ggline4a
ggline5 = ggplot(df1, aes(x = Trust..Government.Corruption., y = Happiness.Score)) +
geom_point(size = .5, alpha = 0.8) +
geom_smooth(method = "lm", fullrange = TRUE) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline5a = ggplot(df1, aes(x = Trust..Government.Corruption., y = Happiness.Score)) +
geom_point(aes(color=Region), size = .5, alpha = 0.8) +
geom_smooth(aes(color = Region, fill = Region),
method = "lm", fullrange = TRUE) +
facet_wrap(~Region) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline5
ggline5a
ggline6 = ggplot(df1, aes(x = Generosity, y = Happiness.Score)) +
geom_point(size = .5, alpha = 0.8) +
geom_smooth(method = "lm", fullrange = TRUE) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline6a = ggplot(df1, aes(x = Generosity, y = Happiness.Score)) +
geom_point(aes(color=Region), size = .5, alpha = 0.8) +
geom_smooth(aes(color = Region, fill = Region),
method = "lm", fullrange = TRUE) +
facet_wrap(~Region) +
theme_bw() + labs(title = "Scatter plot with regression line")
ggline6
ggline6a
We should drop Year,Country,Happiness.Rank,Region column before compute the heatmap.
dataset = select(df1,-c("Year","Country","Happiness.Rank","Region"))
head(dataset)
## Happiness.Score Economy..GDP.per.Capita. Family Health..Life.Expectancy.
## 1 7.632 1.305 1.592 0.874
## 2 7.594 1.456 1.582 0.861
## 3 7.555 1.351 1.590 0.868
## 4 7.495 1.343 1.644 0.914
## 5 7.487 1.420 1.549 0.927
## 6 7.441 1.361 1.488 0.878
## Freedom Generosity Trust..Government.Corruption.
## 1 0.681 0.202 0.393
## 2 0.686 0.286 0.340
## 3 0.683 0.284 0.408
## 4 0.677 0.353 0.138
## 5 0.660 0.256 0.357
## 6 0.638 0.333 0.295
library(corrplot)
Num.cols <- sapply(dataset, is.numeric)
Cor.data <- cor(dataset[, Num.cols])
corrplot(Cor.data, method = 'color')
library(GGally)
ggcorr(dataset, label = TRUE, label_round = 2, label_size = 3.5, size = 2, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))
rge_dif=round((max(dataset$Happiness.Score)-min(dataset$Happiness.Score))/3,3)
low=min(dataset$Happiness.Score)+rge_dif
mid=low+rge_dif
print(paste("range difference in happiness score: ",rge_dif))
print(paste('upper bound of Low grp',low))
print(paste('upper bound of Mid grp',mid))
print(paste('upper bound of High grp','max:',max(dataset$Happiness.Score)))
## [1] "range difference in happiness score: 1.643"
## [1] "upper bound of Low grp 4.482"
## [1] "upper bound of Mid grp 6.125"
## [1] "upper bound of High grp max: 7.769"
Transform “hapiness.Score” column into “Happy.Level” column
dataset_level <- dataset %>%
mutate(Happy.Level=case_when(
Happiness.Score <=low ~ "Low",
Happiness.Score>low & Happiness.Score <=mid ~ "Mid",
Happiness.Score >mid ~ "High"
)) %>%
mutate(Happy.Level=factor(Happy.Level, levels=c("High", "Mid", "Low"))) %>%
select(-Happiness.Score)
# Splitting the dataset into the Training set and Test set
set.seed(123)
split=0.80
trainIndex <- createDataPartition(dataset$Happiness.Score, p=split, list=FALSE)
data_train <- dataset[ trainIndex,]
data_test <- dataset[-trainIndex,]
# Fitting Multiple Linear Regression to the Training set
lm_model = lm(formula = Happiness.Score ~ .,
data = data_train)
summary(lm_model)
##
## Call:
## lm(formula = Happiness.Score ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87619 -0.33193 0.00798 0.34507 1.43298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.09629 0.09560 21.927 < 0.0000000000000002
## Economy..GDP.per.Capita. 1.14157 0.09825 11.619 < 0.0000000000000002
## Family 0.64275 0.09459 6.795 0.0000000000277987
## Health..Life.Expectancy. 1.26063 0.16461 7.658 0.0000000000000837
## Freedom 1.21029 0.20527 5.896 0.0000000064505937
## Generosity 0.71311 0.19481 3.661 0.000276
## Trust..Government.Corruption. 0.96843 0.26401 3.668 0.000268
##
## (Intercept) ***
## Economy..GDP.per.Capita. ***
## Family ***
## Health..Life.Expectancy. ***
## Freedom ***
## Generosity ***
## Trust..Government.Corruption. ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5422 on 558 degrees of freedom
## Multiple R-squared: 0.7722, Adjusted R-squared: 0.7697
## F-statistic: 315.2 on 6 and 558 DF, p-value: < 0.00000000000000022
An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model.
A number near 0 indicates that the regression model did not explain much of the variability in the outcome.
Our adjusted R2 is 0.7697, which is good.
y_pred_lm = predict(lm_model, newdata = data_test)
Actual_lm = data_test$Happiness.Score
Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = Actual_lm))
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm
data.frame(
R2 = R2(y_pred_lm, data_test$Happiness.Score),
RMSE = RMSE(y_pred_lm, data_test$Happiness.Score),
MAE = MAE(y_pred_lm, data_test$Happiness.Score)
)
## R2 RMSE MAE
## 1 0.7643535 0.5478055 0.4256454
library(e1071)
regressor_svr = svm(formula = Happiness.Score ~ .,
data = data_train,
type = 'eps-regression',
kernel = 'radial')
# Predicting happiness score with SVR model
y_pred_svr = predict(regressor_svr, newdata = data_test)
Pred_Actual_svr <- as.data.frame(cbind(Prediction = y_pred_svr, Actual = data_test$Happiness.Score))
Pred_Actual_lm.versus.svr <- cbind(Prediction.lm = y_pred_lm, Prediction.svr = y_pred_svr, Actual = data_test$Happiness.Score)
gg.svr <- ggplot(Pred_Actual_svr, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "SVR", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.svr
data.frame(
R2 = R2(y_pred_svr, data_test$Happiness.Score),
RMSE = RMSE(y_pred_svr, data_test$Happiness.Score),
MAE = MAE(y_pred_svr, data_test$Happiness.Score)
)
## R2 RMSE MAE
## 1 0.8246303 0.4740831 0.3504708
# install.packages("rpart")
library(rpart)
regressor_dt = rpart(formula = Happiness.Score ~ .,
data = data_train,
control = rpart.control(minsplit = 10))
# Predicting happiness score with Decision Tree Regression
y_pred_dt = predict(regressor_dt, newdata = data_test)
Pred_Actual_dt <- as.data.frame(cbind(Prediction = y_pred_dt, Actual = data_test$Happiness.Score))
gg.dt <- ggplot(Pred_Actual_dt, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Decision Tree Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.dt
# install.packages("rpart.plot")
library(rpart.plot)
prp(regressor_dt)
data.frame(
R2 = R2(y_pred_dt, data_test$Happiness.Score),
RMSE = RMSE(y_pred_dt, data_test$Happiness.Score),
MAE = MAE(y_pred_dt, data_test$Happiness.Score)
)
## R2 RMSE MAE
## 1 0.682486 0.6362329 0.5223723
library(randomForest)
x_train_rf<-select(dataset,-c("Happiness.Score"))
set.seed(1234)
regressor_rf = randomForest(x = x_train_rf,
y = dataset$Happiness.Score,
ntree = 500)
# Predicting happiness score with Random Forest Regression
y_pred_rf = predict(regressor_rf, newdata = data_test)
Pred_Actual_rf <- as.data.frame(cbind(Prediction = y_pred_rf, Actual = data_test$Happiness.Score))
gg.rf <- ggplot(Pred_Actual_rf, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Random Forest Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.rf
data.frame(
R2 = R2(y_pred_rf, data_test$Happiness.Score),
RMSE = RMSE(y_pred_rf, data_test$Happiness.Score),
MAE = MAE(y_pred_rf, data_test$Happiness.Score)
)
## R2 RMSE MAE
## 1 0.9692887 0.2104387 0.1561681
ggarrange(gg.lm, gg.svr, gg.dt, gg.rf, ncol = 2, nrow = 3)
Dependent variable is happiness level in dataset_level,
# Splitting the dataset into the Training set and Test set
set.seed(123)
split=0.80
trainIndex <- createDataPartition(dataset_level$Happy.Level, p=split, list=FALSE)
data_train <- dataset_level[ trainIndex,]
data_test <- dataset_level[-trainIndex,]
tc <- trainControl(method = "repeatedcv",
number=10,#10-fold cross validation
classProbs = TRUE,
savePredictions = TRUE,
repeats = 3,
## Estimate class probabilities
summaryFunction = multiClassSummary,)
set.seed(123)
model_knn <- train(
Happy.Level~.,
data=data_train,
trControl=tc,
preProcess = c("center","scale"),
method="knn",
metric='Accuracy',
tuneLength=20
)
model_knn
## k-Nearest Neighbors
##
## 566 samples
## 6 predictor
## 3 classes: 'High', 'Mid', 'Low'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 509, 509, 509, 510, 510, 509, ...
## Resampling results across tuning parameters:
##
## k logLoss AUC prAUC Accuracy Kappa Mean_F1
## 5 1.6389388 0.8860447 0.5147550 0.7485366 0.5928032 0.7428531
## 7 1.1644979 0.8942592 0.5588632 0.7544371 0.5991474 0.7464060
## 9 0.7990958 0.9034868 0.5949783 0.7662165 0.6184617 0.7591744
## 11 0.7426081 0.9052587 0.6174876 0.7609736 0.6084687 0.7524037
## 13 0.7355638 0.9022291 0.6227409 0.7521386 0.5937375 0.7425297
## 15 0.7014662 0.9026759 0.6249766 0.7526820 0.5945413 0.7427888
## 17 0.6292874 0.9044261 0.6334476 0.7486198 0.5872415 0.7381265
## 19 0.5776395 0.9036594 0.6436939 0.7509690 0.5899505 0.7386437
## 21 0.5267587 0.9035198 0.6418789 0.7474192 0.5848969 0.7360335
## 23 0.5340227 0.9010732 0.6418692 0.7539567 0.5954132 0.7430073
## 25 0.5378848 0.9002502 0.6439309 0.7504782 0.5897219 0.7400664
## 27 0.5409197 0.8990913 0.6533776 0.7480767 0.5845696 0.7365311
## 29 0.5414947 0.8999231 0.6643858 0.7445471 0.5777809 0.7325024
## 31 0.5444279 0.8991016 0.6705870 0.7446191 0.5773339 0.7318675
## 33 0.5473342 0.8983055 0.6675026 0.7386973 0.5672005 0.7258717
## 35 0.5499086 0.8981910 0.6679192 0.7368810 0.5651835 0.7248550
## 37 0.5521854 0.8978885 0.6792291 0.7351572 0.5617189 0.7225790
## 39 0.5537471 0.8982587 0.6810644 0.7345728 0.5604520 0.7216963
## 41 0.5555175 0.8975937 0.6831163 0.7328184 0.5566654 0.7191187
## 43 0.5573566 0.8970735 0.6788005 0.7375493 0.5656219 0.7249270
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 0.7376124 0.8575736 0.7629220 0.8631876
## 0.7355902 0.8582846 0.7746184 0.8672235
## 0.7481970 0.8648439 0.7858549 0.8736509
## 0.7393457 0.8609607 0.7827231 0.8713618
## 0.7294857 0.8559641 0.7717696 0.8661652
## 0.7301567 0.8563026 0.7731878 0.8671111
## 0.7240807 0.8534211 0.7710150 0.8651708
## 0.7248006 0.8544001 0.7735897 0.8666976
## 0.7224491 0.8527844 0.7686853 0.8642012
## 0.7293335 0.8564037 0.7750833 0.8676991
## 0.7260277 0.8544421 0.7740571 0.8658964
## 0.7216382 0.8523689 0.7734210 0.8650008
## 0.7166382 0.8498173 0.7734903 0.8632140
## 0.7154329 0.8495512 0.7728127 0.8631285
## 0.7090822 0.8460099 0.7661875 0.8589382
## 0.7092720 0.8455246 0.7628424 0.8576898
## 0.7061274 0.8441816 0.7621036 0.8567201
## 0.7049964 0.8436465 0.7620600 0.8567828
## 0.7016916 0.8422836 0.7617451 0.8555782
## 0.7075941 0.8453322 0.7659850 0.8585823
## Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.7629220 0.7376124 0.2495122 0.7975930
## 0.7746184 0.7355902 0.2514790 0.7969374
## 0.7858549 0.7481970 0.2554055 0.8065204
## 0.7827231 0.7393457 0.2536579 0.8001532
## 0.7717696 0.7294857 0.2507129 0.7927249
## 0.7731878 0.7301567 0.2508940 0.7932296
## 0.7710150 0.7240807 0.2495399 0.7887509
## 0.7735897 0.7248006 0.2503230 0.7896004
## 0.7686853 0.7224491 0.2491397 0.7876168
## 0.7750833 0.7293335 0.2513189 0.7928686
## 0.7740571 0.7260277 0.2501594 0.7902349
## 0.7734210 0.7216382 0.2493589 0.7870035
## 0.7734903 0.7166382 0.2481824 0.7832277
## 0.7728127 0.7154329 0.2482064 0.7824921
## 0.7661875 0.7090822 0.2462324 0.7775460
## 0.7628424 0.7092720 0.2456270 0.7773983
## 0.7621036 0.7061274 0.2450524 0.7751545
## 0.7620600 0.7049964 0.2448576 0.7743215
## 0.7617451 0.7016916 0.2442728 0.7719876
## 0.7659850 0.7075941 0.2458498 0.7764632
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(model_knn)
pred_knn <- predict(model_knn, data_test)
cm_knn<-confusionMatrix(pred_knn, data_test$Happy.Level)
cm_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Mid Low
## High 33 7 0
## Mid 5 56 8
## Low 0 6 24
##
## Overall Statistics
##
## Accuracy : 0.8129
## 95% CI : (0.7381, 0.874)
## No Information Rate : 0.4964
## P-Value [Acc > NIR] : 0.0000000000000105
##
## Kappa : 0.7008
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Mid Class: Low
## Sensitivity 0.8684 0.8116 0.7500
## Specificity 0.9307 0.8143 0.9439
## Pos Pred Value 0.8250 0.8116 0.8000
## Neg Pred Value 0.9495 0.8143 0.9266
## Prevalence 0.2734 0.4964 0.2302
## Detection Rate 0.2374 0.4029 0.1727
## Detection Prevalence 0.2878 0.4964 0.2158
## Balanced Accuracy 0.8996 0.8129 0.8470
# Create object of importance of our variables
knn_importance <- varImp(model_knn)
# Create box plot of importance of variables
ggplot(data = knn_importance, mapping = aes(x = knn_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: K-Nearest Neighbours ") + # Title
theme_light() # Theme
model_nb <- train(Happy.Level~.,
data_train,
method="naive_bayes",
preProcess = c("center","scale"),
metric='Accuracy',
trControl=tc)
model_nb
## Naive Bayes
##
## 566 samples
## 6 predictor
## 3 classes: 'High', 'Mid', 'Low'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 510, 510, 508, 509, 510, 509, ...
## Resampling results across tuning parameters:
##
## usekernel logLoss AUC prAUC Accuracy Kappa Mean_F1
## FALSE 0.7237029 0.8965975 0.7632952 0.7597568 0.6156457 0.7583292
## TRUE 0.7014796 0.8940879 0.7646527 0.7662536 0.6232512 0.7631400
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 0.7572299 0.8664677 0.7686608 0.8689490
## 0.7596220 0.8684121 0.7801272 0.8726667
## Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.7686608 0.7572299 0.2532523 0.8118488
## 0.7801272 0.7596220 0.2554179 0.8140171
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
plot(model_nb)
pred_nb <- predict(model_nb, data_test)
cm_nb<-confusionMatrix(pred_nb, data_test$Happy.Level)
cm_nb
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Mid Low
## High 31 4 0
## Mid 7 54 8
## Low 0 11 24
##
## Overall Statistics
##
## Accuracy : 0.7842
## 95% CI : (0.7065, 0.8494)
## No Information Rate : 0.4964
## P-Value [Acc > NIR] : 0.000000000002766
##
## Kappa : 0.6557
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Mid Class: Low
## Sensitivity 0.8158 0.7826 0.7500
## Specificity 0.9604 0.7857 0.8972
## Pos Pred Value 0.8857 0.7826 0.6857
## Neg Pred Value 0.9327 0.7857 0.9231
## Prevalence 0.2734 0.4964 0.2302
## Detection Rate 0.2230 0.3885 0.1727
## Detection Prevalence 0.2518 0.4964 0.2518
## Balanced Accuracy 0.8881 0.7842 0.8236
# Create object of importance of our variables
nb_importance <- varImp(model_nb)
# Create box plot of importance of variables
ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) + # Data & mapping
geom_boxplot() + # Create box plot
labs(title = "Variable importance: Naive Bayes model") + # Title
theme_light() # Theme
model_list <- list(KNN = model_knn, NB=model_nb)
resamples <- resamples(model_list)
bwplot(resamples, metric="AUC")
data.frame(
K_Nearest_Neighbours= cm_knn$overall[1],
Naive_Bayes= cm_nb$overall[1]
)
## K_Nearest_Neighbours Naive_Bayes
## Accuracy 0.8129496 0.7841727
Although the top 10 happiest country’s ranking position changes from year to year, the countries did not change from 2015 to 2019. They are still on the top 10 happiest country list. The countries are Finland, Denmark, Norway, Iceland, Netherlands, Switzerland, Sweden, New Zealand, Canada and Australia.
The top 10 progressive countries are Benin, Togo, Ivory Coast, Burundi, Burkina Faso, Guinea, Gabon, Cambodia, Honduras and Congo (Brazzaville). The happiness value of people in these countries did not declined, and the people are getting happier every year.
The factor that most affect happiness is Economic GDP per capita, It is probably because income can let people meet their basic needs, so it is quite important and will affect people’s happiness, and another main factor is health.
Random Forest Regression comes out with the best result compared to others, Support Vector Regression model and Multiple Linear Regression are good in prediction. And finally, Decision Tree was the worst algorithm to predict happiness scores.
K-Nearest Neighbor Classification model is better than Naive Bayes Classification model in our project result since the prediction accuracy of K-Nearest Neighbor Classification model is higher than the other one.
In conclusion, this study has shown that happiness depends on a huge range of influences. Thus, regular collection of happiness data on a large scale can inform policy-making and help us identify what “deliverables” should be created to foster well-being. In other words, moving to a happier country could plausibly make you happier. By the same token, moving to a less happy country could reduce your level of happiness. Emotions are contagious, even at a national level.
Decision Tree for Regression in R Programming - GeeksforGeeks. (2020, July 26). GeeksforGeeks. https://www.geeksforgeeks.org/decision-tree-for-regression-in-r-programming/
https://www.facebook.com/verywell. (2020). How Do Psychologists Define Happiness? Verywell Mind. https://www.verywellmind.com/what-is-happiness-4869755
K-NN Classifier in R Programming - GeeksforGeeks. (2020, June 18). GeeksforGeeks. https://www.geeksforgeeks.org/k-nn-classifier-in-r-programming/#:~:text=K%2DNearest%20Neighbor%20or%20K,underlying%20data%20or%20its%20distribution.
Nikola O. (2021, December 29). Random Forest Regression in R: Code and Interpretation. Hackernoon.com. https://hackernoon.com/random-forest-regression-in-r-code-and-interpretation
Ortiz-Ospina, E., & Roser, M. (2013, May 14). Happiness and Life Satisfaction. Our World in Data. https://ourworldindata.org/happiness-and-life-satisfaction
Random Forest Approach in R Programming - GeeksforGeeks. (2020, May 31). GeeksforGeeks. https://www.geeksforgeeks.org/random-forest-approach-in-r-programming/#:~:text=Random%20Forest%20in%20R%20Programming,when%20employed%20on%20its%20own.
R - Multiple Regression. (2022). Tutorialspoint.com. https://www.tutorialspoint.com/r/r_multiple_regression.htm
Scatter Plots - R Base Graphs - Easy Guides - Wiki - STHDA. (2020). Sthda.com. http://www.sthda.com/english/wiki/scatter-plots-r-base-graphs#:~:text=A%20scatter%20plot%20can%20be,using%20the%20function%20loess().
Support Vector Regression Example with SVM in R. (2019, September 5). Datatechnotes.com. https://www.datatechnotes.com/2019/09/support-vector-regression-example-with.html#:~:text=Support%20Vector%20Machine%20is%20a,for%20regression%20problem%20in%20R.
Sustainable Development Solutions Network. (2012). World Happiness Report. Kaggle.com. https://www.kaggle.com/datasets/unsdsn/world-happiness