What is happiness? Happiness is an emotional state characterized by feelings of joy, satisfaction, contentment, and fulfillment. Perceptions of happiness may be different from one person to the next. In last decades, the interest in country happiness analysis has increased. This is mainly due to happiness can serve as an important and useful index to guild a country policy and measure its effectiveness. In this project, we will focus on predicting countries happiness level via machine learning algorithm.
There are 3 objectives we would like to achieve through this project. The objectives as follows:
The data source for our project is coming from World Happiness Report, an annual report published by the Sustainable Development Solutions Network of the United Nations. This report rank the country by the happiness ladder based on survey data via Gallup World Poll. The data set we used ranging from year 2005 to year 2020. It consists of 166 countries with 11 variables. The variables as follows:
library(dplyr)
library(naniar)
library(ggplot2)
library(visdat)
Load the data using read.csv and check the dataframe structure. Noticed that most of the data type are numeric and only two variables are character and integer.
myurl <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTNujIfWRdX0iF0aN25fB-2alBJc7mL40DDuNqL_92Ch--CnnbOFTHgShiboxRT90cgdyYYIhk4jQDs/pub?output=csv"
happiness_raw <- read.csv(url(myurl))
## Display the structure of data frame
str(happiness_raw)
## 'data.frame': 1949 obs. of 11 variables:
## $ Country.name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
## $ Life.Ladder : num 3.72 4.4 4.76 3.83 3.78 ...
## $ Log.GDP.per.capita : num 7.37 7.54 7.65 7.62 7.71 ...
## $ Social.support : num 0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
## $ Healthy.life.expectancy.at.birth: num 50.8 51.2 51.6 51.9 52.2 ...
## $ Freedom.to.make.life.choices : num 0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
## $ Generosity : num 0.168 0.19 0.121 0.162 0.236 0.061 0.104 0.08 0.042 -0.121 ...
## $ Perceptions.of.corruption : num 0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
## $ Positive.affect : num 0.518 0.584 0.618 0.611 0.71 0.621 0.532 0.554 0.565 0.496 ...
## $ Negative.affect : num 0.258 0.237 0.275 0.267 0.268 0.273 0.375 0.339 0.348 0.371 ...
## Display the summary of the dataset
summary(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita
## Length:1949 Min. :2005 Min. :2.375 Min. : 6.635
## Class :character 1st Qu.:2010 1st Qu.:4.640 1st Qu.: 8.464
## Mode :character Median :2013 Median :5.386 Median : 9.460
## Mean :2013 Mean :5.467 Mean : 9.368
## 3rd Qu.:2017 3rd Qu.:6.283 3rd Qu.:10.353
## Max. :2020 Max. :8.019 Max. :11.648
## NA's :36
## Social.support Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
## Min. :0.2900 Min. :32.30 Min. :0.2580
## 1st Qu.:0.7498 1st Qu.:58.69 1st Qu.:0.6470
## Median :0.8355 Median :65.20 Median :0.7630
## Mean :0.8126 Mean :63.36 Mean :0.7426
## 3rd Qu.:0.9050 3rd Qu.:68.59 3rd Qu.:0.8560
## Max. :0.9870 Max. :77.10 Max. :0.9850
## NA's :13 NA's :55 NA's :32
## Generosity Perceptions.of.corruption Positive.affect Negative.affect
## Min. :-0.3350 Min. :0.0350 Min. :0.3220 Min. :0.0830
## 1st Qu.:-0.1130 1st Qu.:0.6900 1st Qu.:0.6255 1st Qu.:0.2060
## Median :-0.0255 Median :0.8020 Median :0.7220 Median :0.2580
## Mean : 0.0001 Mean :0.7471 Mean :0.7100 Mean :0.2685
## 3rd Qu.: 0.0910 3rd Qu.:0.8720 3rd Qu.:0.7990 3rd Qu.:0.3200
## Max. : 0.6980 Max. :0.9830 Max. :0.9440 Max. :0.7050
## NA's :89 NA's :110 NA's :22 NA's :16
## Check the first 6 rows of the dataset
head(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1 Afghanistan 2008 3.724 7.370 0.451
## 2 Afghanistan 2009 4.402 7.540 0.552
## 3 Afghanistan 2010 4.758 7.647 0.539
## 4 Afghanistan 2011 3.832 7.620 0.521
## 5 Afghanistan 2012 3.783 7.705 0.521
## 6 Afghanistan 2013 3.572 7.725 0.484
## Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1 50.80 0.718 0.168
## 2 51.20 0.679 0.190
## 3 51.60 0.600 0.121
## 4 51.92 0.496 0.162
## 5 52.24 0.531 0.236
## 6 52.56 0.578 0.061
## Perceptions.of.corruption Positive.affect Negative.affect
## 1 0.882 0.518 0.258
## 2 0.850 0.584 0.237
## 3 0.707 0.618 0.275
## 4 0.731 0.611 0.267
## 5 0.776 0.710 0.268
## 6 0.823 0.621 0.273
## Display the columns of the dataset
colnames(happiness_raw)
## [1] "Country.name" "year"
## [3] "Life.Ladder" "Log.GDP.per.capita"
## [5] "Social.support" "Healthy.life.expectancy.at.birth"
## [7] "Freedom.to.make.life.choices" "Generosity"
## [9] "Perceptions.of.corruption" "Positive.affect"
## [11] "Negative.affect"
First, we will drop off the last 2 columns, Postive Affect and Negative Affect, which are not needed in our modelling. The dataset consists of 1949 rows with missing values. Next, we removed total of 237 rows or 12.2% that consists of missing values.
## Remove the columns that not relevant
happiness_raw <- subset(happiness_raw, select = -c(Positive.affect, Negative.affect))
## Check the total number of rows of dataset
nrow(happiness_raw)
## [1] 1949
## To check the dataset has any missing value
any(is.na(happiness_raw))
## [1] TRUE
## Visualize the missing data
data1 <- c(sum(is.na(happiness_raw)),sum(!is.na(happiness_raw)))
mylabel <- c("Missing","Present")
pct <- prop.table(data1)*100
mylabel2 <- paste(round(pct, digits = 1),"%")
mylabel3 <- paste(mylabel, mylabel2)
data2 <- setNames(data1,mylabel3)
pie(data2, main = "Figure 1 - Missingness of World Happiness Report")
## To check the dataset has duplicated value
sum(duplicated(happiness_raw))
## [1] 0
## Remove the missing values in the dataset
happiness_raw <-na.omit(happiness_raw)
## Recheck the total number of rows of dataset
nrow(happiness_raw)
## [1] 1712
## Re-verify that no missing values
any(is.na(happiness_raw))
## [1] FALSE
##total omitted rows is 237. It occupied the total data is 12.2%
We added on one more column call “Rank” to rank the happiness level using Life Ladder in the dataset. If the Life Ladder value more than 7 considered as High, less than 7 will treat as Low, others range will categorize as Medium
## Categorize the happiness level by adding new column rating
happiness_raw <-mutate(happiness_raw,Rank = if_else(Life.Ladder >7,"High",if_else(Life.Ladder <3.5,"Low","Medium")))
summary(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita
## Length:1712 Min. :2005 Min. :2.375 Min. : 6.635
## Class :character 1st Qu.:2010 1st Qu.:4.595 1st Qu.: 8.394
## Mode :character Median :2013 Median :5.359 Median : 9.456
## Mean :2013 Mean :5.445 Mean : 9.320
## 3rd Qu.:2017 3rd Qu.:6.252 3rd Qu.:10.268
## Max. :2020 Max. :7.971 Max. :11.648
## Social.support Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
## Min. :0.2900 Min. :32.30 Min. :0.2580
## 1st Qu.:0.7410 1st Qu.:58.17 1st Qu.:0.6440
## Median :0.8345 Median :65.10 Median :0.7575
## Mean :0.8102 Mean :63.22 Mean :0.7395
## 3rd Qu.:0.9080 3rd Qu.:68.68 3rd Qu.:0.8520
## Max. :0.9870 Max. :77.10 Max. :0.9850
## Generosity Perceptions.of.corruption Rank
## Min. :-0.3350000 Min. :0.0350 Length:1712
## 1st Qu.:-0.1112500 1st Qu.:0.6970 Class :character
## Median :-0.0255000 Median :0.8060 Mode :character
## Mean :-0.0007173 Mean :0.7511
## 3rd Qu.: 0.0890000 3rd Qu.:0.8750
## Max. : 0.6890000 Max. :0.9830
head(happiness_raw)
## Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1 Afghanistan 2008 3.724 7.370 0.451
## 2 Afghanistan 2009 4.402 7.540 0.552
## 3 Afghanistan 2010 4.758 7.647 0.539
## 4 Afghanistan 2011 3.832 7.620 0.521
## 5 Afghanistan 2012 3.783 7.705 0.521
## 6 Afghanistan 2013 3.572 7.725 0.484
## Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1 50.80 0.718 0.168
## 2 51.20 0.679 0.190
## 3 51.60 0.600 0.121
## 4 51.92 0.496 0.162
## 5 52.24 0.531 0.236
## 6 52.56 0.578 0.061
## Perceptions.of.corruption Rank
## 1 0.882 Medium
## 2 0.850 Medium
## 3 0.707 Medium
## 4 0.731 Medium
## 5 0.776 Medium
## 6 0.823 Medium