1 Introduction

Happiness level in each country is determined by several factors. In this analysis I analyzed the determinant of happiness using dataset from a kaggle.com. In which, the dataset consist of several macroeconomics variables such as GDP, healthy life expectancy, happiness score, freedom of choice and generosity. The happiness score will be analyzed using linear regression model.

2 Load Dataset

happy <- read.csv("2019.csv")

3 Exploratory Data Analysis

str(happy)

## 'data.frame':    156 obs. of  9 variables:
##  $ Overall.rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country.or.region           : Factor w/ 156 levels "Afghanistan",..: 44 37 106 58 99 134 133 100 24 7 ...
##  $ Score                       : num  7.77 7.6 7.55 7.49 7.49 ...
##  $ GDP.per.capita              : num  1.34 1.38 1.49 1.38 1.4 ...
##  $ Social.support              : num  1.59 1.57 1.58 1.62 1.52 ...
##  $ Healthy.life.expectancy     : num  0.986 0.996 1.028 1.026 0.999 ...
##  $ Freedom.to.make.life.choices: num  0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ Generosity                  : num  0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ Perceptions.of.corruption   : num  0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...

3.1 Import Library

library(dplyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

3.2 Rename variable

happy <- happy %>% 
  mutate(rank = Overall.rank,
         country = Country.or.region,
         happiness_score = Score,
         log_GDP = GDP.per.capita,
         sos_support = Social.support,
         h_life_exp = Healthy.life.expectancy,
         freedom = Freedom.to.make.life.choices,
         generosity = Generosity,
         corruption = Perceptions.of.corruption) %>% 
  select(-c(Overall.rank, Country.or.region, Score, GDP.per.capita, Social.support, Healthy.life.expectancy, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption))

str(happy)

## 'data.frame':    156 obs. of  9 variables:
##  $ rank           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country        : Factor w/ 156 levels "Afghanistan",..: 44 37 106 58 99 134 133 100 24 7 ...
##  $ happiness_score: num  7.77 7.6 7.55 7.49 7.49 ...
##  $ log_GDP        : num  1.34 1.38 1.49 1.38 1.4 ...
##  $ sos_support    : num  1.59 1.57 1.58 1.62 1.52 ...
##  $ h_life_exp     : num  0.986 0.996 1.028 1.026 0.999 ...
##  $ freedom        : num  0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
##  $ generosity     : num  0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
##  $ corruption     : num  0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...

Here is the description of each variable :
- rank –> Rank country in Happiness
- country –> name of the country/region
- happiness_score –> Happiness score
- log_GDP –> log of GDP per capita
- sos_support –> social support
- h_life_exp –> healthy life expectancy
- freedom –> freedom to make life choice
- generosity –> generosity rate
- corruption –> perceptions of corruption

# check if there is na
colSums(is.na(happy))

##            rank         country happiness_score         log_GDP     sos_support 
##               0               0               0               0               0 
##      h_life_exp         freedom      generosity      corruption 
##               0               0               0               0

table(is.na(happy))

## 
## FALSE 
##  1404

there is no NA in the dataset

#we do not use var 'rank' and 'country' therefore we need to omit first 
happy1 <- happy %>% 
  select(-c(rank, country))

#To see correlation among variable
ggcorr(happy1, label = T, label_size = 2.9, hjust = 1, layout.exp = 2)

Based on the graphic correlation above we know that every variable has positive correlation to happiness score

#check distribution of each variable
boxplot(happy1)

Based on boxplot above, we can see that there is no outlier on happiness_score variable. Therefore we can continue to the next step.

4 Modelling

4.1 Selecting Model

# model_all consist of all predictors variable to predict '`happiness_score`
model_all <- lm(happiness_score~.,happy1)

# model_forward using step wise
model_forward <- step(object = model_all, direction = "forward", trace =F)

# model_backward using step wise
model_backward <- step(object = model_all, direction = "backward", trace = F)

4.2 Comparing performance of model_all / model_backward / model_forward

4.2.1 Model_all

summary(model_all)

## 
## Call:
## lm(formula = happiness_score ~ ., data = happy1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75304 -0.35306  0.05703  0.36695  1.19059 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)   1.7952     0.2111   8.505 0.0000000000000177 ***
## log_GDP       0.7754     0.2182   3.553           0.000510 ***
## sos_support   1.1242     0.2369   4.745 0.0000048338066976 ***
## h_life_exp    1.0781     0.3345   3.223           0.001560 ** 
## freedom       1.4548     0.3753   3.876           0.000159 ***
## generosity    0.4898     0.4977   0.984           0.326709    
## corruption    0.9723     0.5424   1.793           0.075053 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5335 on 149 degrees of freedom
## Multiple R-squared:  0.7792, Adjusted R-squared:  0.7703 
## F-statistic: 87.62 on 6 and 149 DF,  p-value: < 0.00000000000000022

# predicting error model_all
MAE(
  y_pred = model_all$fitted.values,
  y_true = happy1$happiness_score
)

## [1] 0.4139013

Based on model_all variable generosity and corruption are not significant. The model performance based on R-squared and error are 77.03% and 0.4139 respectively. It shows that model_all can explain our predicted variable as 77.03% with error in predictted value around -+0.41.

4.2.2 Model_forward

summary(model_forward)

## 
## Call:
## lm(formula = happiness_score ~ log_GDP + sos_support + h_life_exp + 
##     freedom + generosity + corruption, data = happy1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75304 -0.35306  0.05703  0.36695  1.19059 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)   1.7952     0.2111   8.505 0.0000000000000177 ***
## log_GDP       0.7754     0.2182   3.553           0.000510 ***
## sos_support   1.1242     0.2369   4.745 0.0000048338066976 ***
## h_life_exp    1.0781     0.3345   3.223           0.001560 ** 
## freedom       1.4548     0.3753   3.876           0.000159 ***
## generosity    0.4898     0.4977   0.984           0.326709    
## corruption    0.9723     0.5424   1.793           0.075053 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5335 on 149 degrees of freedom
## Multiple R-squared:  0.7792, Adjusted R-squared:  0.7703 
## F-statistic: 87.62 on 6 and 149 DF,  p-value: < 0.00000000000000022

# predicting error model_forward
MAE(
  y_pred = model_forward$fitted.values,
  y_true = happy1$happiness_score
)

## [1] 0.4139013

Based on model_forward variable generosity and corruption are not significant. The model performance based on R-squared and error are 77.03% and 0.4139 respectively. It shows that model_forward can explain our predicted variable as 77.03% with error in predictted value around -+0.4139.

4.2.3 Model_backward

summary(model_backward)

## 
## Call:
## lm(formula = happiness_score ~ log_GDP + sos_support + h_life_exp + 
##     freedom + corruption, data = happy1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82997 -0.35344  0.05803  0.35977  1.17522 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   1.8689     0.1973   9.471 < 0.0000000000000002 ***
## log_GDP       0.7455     0.2161   3.450             0.000728 ***
## sos_support   1.1180     0.2368   4.722           0.00000533 ***
## h_life_exp    1.0840     0.3344   3.241             0.001467 ** 
## freedom       1.5340     0.3666   4.185           0.00004844 ***
## corruption    1.1176     0.5218   2.142             0.033839 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5335 on 150 degrees of freedom
## Multiple R-squared:  0.7777, Adjusted R-squared:  0.7703 
## F-statistic:   105 on 5 and 150 DF,  p-value: < 0.00000000000000022

# predicting error model_forward
MAE(
  y_pred = model_backward$fitted.values,
  y_true = happy1$happiness_score
)

## [1] 0.4132855

The model performance based on R-squared and error are 77.03% and 0.4132 respectively. It shows that model_backward can explain our predicted variable as 77.03% with error in predictted value around -+0.4132.

4.2.4 Conclusion of Model Selecting

Compared to model_all and model_forward, model_backward is slightly better since its error is slightly smaller than error of model_all and model_forward. Therefore we use model_backward for the next steps.

5 Analysis

Based on regression analysis we found that all variables have significant impact to happiness score, in which in the detail as followed :
- GDP –> increasing of 1 unit of log GDP increase 0.74 of happiness score. The higher GDP, the highe happiness rate of that country.
- social support –> increasing of 1 unit of social support increase 1.1 of happiness score. The more social support given by people the more happiness they will get.
- healthy life expectancy –> increasing of 1 unit of healthy life expectancy increase 1.08 of happiness score. When healthy life expectancy get higher the happier people become.
- freedom –> increasing of 1 unit of freedom increase 1.5 of happiness score. Obviously, when people free in making choice they will get happier.
- corruption –> increasing of 1 unit of corruption increase 1.1 of happiness score. Perception of corruption has postive impact to the happiness. However compared to other variables (based on p-value score), this has smallest impact to the happiness.

6 Checking Assumption

6.1 Normality Residual

hist(model_backward$residuals, breaks = 20)

using Shapiro test : - Hypothesis:
- H0: Residuals are distributed normally
- H1: Residuals are not distribute normally

#normality test using shapiro test
shapiro.test(model_backward$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.98147, p-value = 0.03426

The test result shows that the residuals of model are not distributed normally (p-value < 0.05). As the consequence the model can lead to bias therefore this model is needed to be improved by normalizing the error distribution by using scale or normalization.

6.2 Heteroscedasticity Test

The model should have homoscedasticity in the varians of residual.

plot(happy1$happiness_score, model_backward$residuals)+abline(h=0, col= "red")

## integer(0)

#heteroscedasticity test using BP test
bptest(model_backward)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 16.115, df = 5, p-value = 0.006523

Hypothesis:
- H0: Data residual Homogen - H1: Data residual Heteros

based on the result test, the residual model is heterscedasticity (p-value < 0.05).

6.3 Multicolinearity Test

We want that our model does not have multicolinearity, we will use vif test to check that. The vif value has to be smaller than 10 to pass the multicolinearity test.

vif(model_backward)

##     log_GDP sos_support  h_life_exp     freedom  corruption 
##    4.035938    2.733741    3.571591    1.502702    1.325518

Multicolinearity test shows that there is no multicolinearity in the model_backward.

7 Conclusion

Our model has a good performance in predicting happiness_score based on R-square and error value. However in assumption tests, our model_backward only passed multicolinearity test. The error of model are not distributed normally and there is heteroscedasticity in the model. As a consequence the model could lead into bias when interpretting the model. Therefore the model is needed to be improved by normalizing the error distribution by using scaling or normalization.

8 Suggestion

Based on regression above, it is important for the government to maintan the economic condition (GDP), give society freedom in making choice, maintain that they have healthy life and provide them with good public health service as well support them with social support and increase their perception about corruption level of the country. In which those variables can make people happier.

Determinant of Happiness

Meinari

3/25/2020