Group Member

1. Introduction

What is happiness? Happiness is an emotional state characterized by feelings of joy, satisfaction, contentment, and fulfillment. Perceptions of happiness may be different from one person to the next. In last decades, the interest in country happiness analysis has increased. This is mainly due to happiness can serve as an important and useful index to guild a country policy and measure its effectiveness. In this project, we will focus on predicting countries happiness level via machine learning algorithm.

1.1 Problem

  1. What is the factors affecting the happiness of the country?
  2. What is the effectiveness of the World Happiness Indicator?
  3. How to predict the happiness of a country’s citizen?

1.2 Objective

There are 3 objectives we would like to achieve through this project. The objectives as follows:

2. Data Preprocessing

2.1 Data Obtain

The data source for our project is coming from World Happiness Report, an annual report published by the Sustainable Development Solutions Network of the United Nations. This report rank the country by the happiness ladder based on survey data via Gallup World Poll. The data set we used ranging from year 2005 to year 2020. It consists of 166 countries with 11 variables. The variables as follows:

  1. Country name = Name of country that is being analyzed
  2. year = the year the report was conducted
  3. Life Ladder = A score received for a country’s happiness
  4. Log GDP per capita = the logged version of GDP per capita
  5. Social support = The ranking for a country’s social support
  6. Healthy life expectancy at birth = Expected lifespan
  7. Freedom to make life choices = Freedom index
  8. Generosity = A metric on how generous the country is
  9. Perceptions of corruption = How badly the citizens of said country think about corruption in the country.
  10. Positive affection = The average of individual yes or no answers for three questions about emotions experienced or not on the previous day: laughter, enjoyment, and learning or doing something interesting
  11. Negative affection = The average of individual yes or no answers about three emotions experienced on the previous day: worry, sadness, and anger

2.2 Data Scrub

2.2.1 Import Library

library(dplyr)
library(naniar)
library(ggplot2)
library(visdat)

2.2.2 Load Data

Load the data using read.csv and check the dataframe structure. Noticed that most of the data type are numeric and only two variables are character and integer.

myurl <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTNujIfWRdX0iF0aN25fB-2alBJc7mL40DDuNqL_92Ch--CnnbOFTHgShiboxRT90cgdyYYIhk4jQDs/pub?output=csv"
happiness_raw <- read.csv(url(myurl))
## Display the structure of data frame
str(happiness_raw)
## 'data.frame':    1949 obs. of  11 variables:
##  $ Country.name                    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year                            : int  2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
##  $ Life.Ladder                     : num  3.72 4.4 4.76 3.83 3.78 ...
##  $ Log.GDP.per.capita              : num  7.37 7.54 7.65 7.62 7.71 ...
##  $ Social.support                  : num  0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
##  $ Healthy.life.expectancy.at.birth: num  50.8 51.2 51.6 51.9 52.2 ...
##  $ Freedom.to.make.life.choices    : num  0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
##  $ Generosity                      : num  0.168 0.19 0.121 0.162 0.236 0.061 0.104 0.08 0.042 -0.121 ...
##  $ Perceptions.of.corruption       : num  0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
##  $ Positive.affect                 : num  0.518 0.584 0.618 0.611 0.71 0.621 0.532 0.554 0.565 0.496 ...
##  $ Negative.affect                 : num  0.258 0.237 0.275 0.267 0.268 0.273 0.375 0.339 0.348 0.371 ...
## Display the summary of the dataset
summary(happiness_raw)
##  Country.name            year       Life.Ladder    Log.GDP.per.capita
##  Length:1949        Min.   :2005   Min.   :2.375   Min.   : 6.635    
##  Class :character   1st Qu.:2010   1st Qu.:4.640   1st Qu.: 8.464    
##  Mode  :character   Median :2013   Median :5.386   Median : 9.460    
##                     Mean   :2013   Mean   :5.467   Mean   : 9.368    
##                     3rd Qu.:2017   3rd Qu.:6.283   3rd Qu.:10.353    
##                     Max.   :2020   Max.   :8.019   Max.   :11.648    
##                                                    NA's   :36        
##  Social.support   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
##  Min.   :0.2900   Min.   :32.30                    Min.   :0.2580              
##  1st Qu.:0.7498   1st Qu.:58.69                    1st Qu.:0.6470              
##  Median :0.8355   Median :65.20                    Median :0.7630              
##  Mean   :0.8126   Mean   :63.36                    Mean   :0.7426              
##  3rd Qu.:0.9050   3rd Qu.:68.59                    3rd Qu.:0.8560              
##  Max.   :0.9870   Max.   :77.10                    Max.   :0.9850              
##  NA's   :13       NA's   :55                       NA's   :32                  
##    Generosity      Perceptions.of.corruption Positive.affect  Negative.affect 
##  Min.   :-0.3350   Min.   :0.0350            Min.   :0.3220   Min.   :0.0830  
##  1st Qu.:-0.1130   1st Qu.:0.6900            1st Qu.:0.6255   1st Qu.:0.2060  
##  Median :-0.0255   Median :0.8020            Median :0.7220   Median :0.2580  
##  Mean   : 0.0001   Mean   :0.7471            Mean   :0.7100   Mean   :0.2685  
##  3rd Qu.: 0.0910   3rd Qu.:0.8720            3rd Qu.:0.7990   3rd Qu.:0.3200  
##  Max.   : 0.6980   Max.   :0.9830            Max.   :0.9440   Max.   :0.7050  
##  NA's   :89        NA's   :110               NA's   :22       NA's   :16
## Check the first 6 rows of the dataset
head(happiness_raw)
##   Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1  Afghanistan 2008       3.724              7.370          0.451
## 2  Afghanistan 2009       4.402              7.540          0.552
## 3  Afghanistan 2010       4.758              7.647          0.539
## 4  Afghanistan 2011       3.832              7.620          0.521
## 5  Afghanistan 2012       3.783              7.705          0.521
## 6  Afghanistan 2013       3.572              7.725          0.484
##   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1                            50.80                        0.718      0.168
## 2                            51.20                        0.679      0.190
## 3                            51.60                        0.600      0.121
## 4                            51.92                        0.496      0.162
## 5                            52.24                        0.531      0.236
## 6                            52.56                        0.578      0.061
##   Perceptions.of.corruption Positive.affect Negative.affect
## 1                     0.882           0.518           0.258
## 2                     0.850           0.584           0.237
## 3                     0.707           0.618           0.275
## 4                     0.731           0.611           0.267
## 5                     0.776           0.710           0.268
## 6                     0.823           0.621           0.273
## Display the columns of the dataset
colnames(happiness_raw)
##  [1] "Country.name"                     "year"                            
##  [3] "Life.Ladder"                      "Log.GDP.per.capita"              
##  [5] "Social.support"                   "Healthy.life.expectancy.at.birth"
##  [7] "Freedom.to.make.life.choices"     "Generosity"                      
##  [9] "Perceptions.of.corruption"        "Positive.affect"                 
## [11] "Negative.affect"

2.2.3 Data Cleaning

First, we will drop off the last 2 columns, Postive Affect and Negative Affect, which are not needed in our modelling. The dataset consists of 1949 rows with missing values. Next, we removed total of 237 rows or 12.2% that consists of missing values.

## Remove the columns that not relevant
happiness_raw <- subset(happiness_raw, select = -c(Positive.affect, Negative.affect))
## Check the total number of rows of dataset
nrow(happiness_raw)
## [1] 1949
## To check the dataset has any missing value
any(is.na(happiness_raw))
## [1] TRUE
## Visualize the missing data
data1 <- c(sum(is.na(happiness_raw)),sum(!is.na(happiness_raw)))
mylabel <- c("Missing","Present")
pct <- prop.table(data1)*100
mylabel2 <- paste(round(pct, digits = 1),"%")
mylabel3 <- paste(mylabel, mylabel2)
data2 <- setNames(data1,mylabel3)
pie(data2, main = "Figure 1 - Missingness of World Happiness Report")

## To check the dataset has duplicated value
sum(duplicated(happiness_raw))
## [1] 0
## Remove the missing values in the dataset
happiness_raw <-na.omit(happiness_raw)
## Recheck the total number of rows of dataset
nrow(happiness_raw)
## [1] 1712
## Re-verify that no missing values
any(is.na(happiness_raw)) 
## [1] FALSE
##total omitted rows is 237. It occupied the total data is 12.2%

2.2.4 Data Categorization

We added on one more column call “Rank” to rank the happiness level using Life Ladder in the dataset. If the Life Ladder value more than 7 considered as High, less than 7 will treat as Low, others range will categorize as Medium

## Categorize the happiness level by adding new column rating
happiness_raw <-mutate(happiness_raw,Rank = if_else(Life.Ladder >7,"High",if_else(Life.Ladder <3.5,"Low","Medium")))

summary(happiness_raw)
##  Country.name            year       Life.Ladder    Log.GDP.per.capita
##  Length:1712        Min.   :2005   Min.   :2.375   Min.   : 6.635    
##  Class :character   1st Qu.:2010   1st Qu.:4.595   1st Qu.: 8.394    
##  Mode  :character   Median :2013   Median :5.359   Median : 9.456    
##                     Mean   :2013   Mean   :5.445   Mean   : 9.320    
##                     3rd Qu.:2017   3rd Qu.:6.252   3rd Qu.:10.268    
##                     Max.   :2020   Max.   :7.971   Max.   :11.648    
##  Social.support   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
##  Min.   :0.2900   Min.   :32.30                    Min.   :0.2580              
##  1st Qu.:0.7410   1st Qu.:58.17                    1st Qu.:0.6440              
##  Median :0.8345   Median :65.10                    Median :0.7575              
##  Mean   :0.8102   Mean   :63.22                    Mean   :0.7395              
##  3rd Qu.:0.9080   3rd Qu.:68.68                    3rd Qu.:0.8520              
##  Max.   :0.9870   Max.   :77.10                    Max.   :0.9850              
##    Generosity         Perceptions.of.corruption     Rank          
##  Min.   :-0.3350000   Min.   :0.0350            Length:1712       
##  1st Qu.:-0.1112500   1st Qu.:0.6970            Class :character  
##  Median :-0.0255000   Median :0.8060            Mode  :character  
##  Mean   :-0.0007173   Mean   :0.7511                              
##  3rd Qu.: 0.0890000   3rd Qu.:0.8750                              
##  Max.   : 0.6890000   Max.   :0.9830
head(happiness_raw)
##   Country.name year Life.Ladder Log.GDP.per.capita Social.support
## 1  Afghanistan 2008       3.724              7.370          0.451
## 2  Afghanistan 2009       4.402              7.540          0.552
## 3  Afghanistan 2010       4.758              7.647          0.539
## 4  Afghanistan 2011       3.832              7.620          0.521
## 5  Afghanistan 2012       3.783              7.705          0.521
## 6  Afghanistan 2013       3.572              7.725          0.484
##   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices Generosity
## 1                            50.80                        0.718      0.168
## 2                            51.20                        0.679      0.190
## 3                            51.60                        0.600      0.121
## 4                            51.92                        0.496      0.162
## 5                            52.24                        0.531      0.236
## 6                            52.56                        0.578      0.061
##   Perceptions.of.corruption   Rank
## 1                     0.882 Medium
## 2                     0.850 Medium
## 3                     0.707 Medium
## 4                     0.731 Medium
## 5                     0.776 Medium
## 6                     0.823 Medium