1. The Data

1.1 Context

The following are the World Happiness Report 2020 using the data downloaded from Kaggle.com. The data shows various aspect of life that affect Citizens’ Happiness level in different countries.

1.2 Inspiration & Aim

Using linear regression analysis we are going to make a model to predict which aspect of life affect people’s happiness the most.

2. Data Preparation

2.1 Library Preparation

First, let’s import all of the libraries that we are going to use in this report.

library(dplyr)
library(lmtest)
library(car)
library(MLmetrics)
library(GGally)

2.2 Read Data

happy <- read.csv("world-happiness-report-2021.csv")
glimpse(happy)
## Rows: 149
## Columns: 20
## $ ï..Country.name                            <chr> "Finland", "Denmark", "Swit~
## $ Regional.indicator                         <chr> "Western Europe", "Western ~
## $ Ladder.score                               <dbl> 7.842, 7.620, 7.571, 7.554,~
## $ Standard.error.of.ladder.score             <dbl> 0.032, 0.035, 0.036, 0.059,~
## $ upperwhisker                               <dbl> 7.904, 7.687, 7.643, 7.670,~
## $ lowerwhisker                               <dbl> 7.780, 7.552, 7.500, 7.438,~
## $ Logged.GDP.per.capita                      <dbl> 10.775, 10.933, 11.117, 10.~
## $ Social.support                             <dbl> 0.954, 0.954, 0.942, 0.983,~
## $ Healthy.life.expectancy                    <dbl> 72.000, 72.700, 74.400, 73.~
## $ Freedom.to.make.life.choices               <dbl> 0.949, 0.946, 0.919, 0.955,~
## $ Generosity                                 <dbl> -0.098, 0.030, 0.025, 0.160~
## $ Perceptions.of.corruption                  <dbl> 0.186, 0.179, 0.292, 0.673,~
## $ Ladder.score.in.Dystopia                   <dbl> 2.43, 2.43, 2.43, 2.43, 2.4~
## $ Explained.by..Log.GDP.per.capita           <dbl> 1.446, 1.502, 1.566, 1.482,~
## $ Explained.by..Social.support               <dbl> 1.106, 1.108, 1.079, 1.172,~
## $ Explained.by..Healthy.life.expectancy      <dbl> 0.741, 0.763, 0.816, 0.772,~
## $ Explained.by..Freedom.to.make.life.choices <dbl> 0.691, 0.686, 0.653, 0.698,~
## $ Explained.by..Generosity                   <dbl> 0.124, 0.208, 0.204, 0.293,~
## $ Explained.by..Perceptions.of.corruption    <dbl> 0.481, 0.485, 0.413, 0.170,~
## $ Dystopia...residual                        <dbl> 3.253, 2.868, 2.839, 2.967,~

The data consists of 149 rows which represents 149 countries and 20 columns. There are several columns that we are not going to use, so we will drop them.

#Subsetting
happy2 <- happy[,-c(1:2, 4:6, 14:20)]

2.3 Missing Values

Next let’s check for missing values

colSums(is.na(happy2))
##                 Ladder.score        Logged.GDP.per.capita 
##                            0                            0 
##               Social.support      Healthy.life.expectancy 
##                            0                            0 
## Freedom.to.make.life.choices                   Generosity 
##                            0                            0 
##    Perceptions.of.corruption     Ladder.score.in.Dystopia 
##                            0                            0

Since there are no missing value in this data, we can carry on!

3. Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

3.1 Correlation Chart

First of all, let’s find correlation between variables.

ggcorr(happy2, 
       label = TRUE, 
       label_size = 2.9, 
       hjust = 1, 
       layout.exp = 2)
## Warning in cor(data, use = method[1], method = method[2]): the standard
## deviation is zero

Looking at the plot above we can see that Logged.GDP.per.capita, Social.support & Healthy.life.expectancy has the most positive correlation with Ladder.score the most. Ladder.score is the score each country got for happiness, which will be our main focus in this data.

3.2 Linear Regression Model

Based on the Correlation chart above we are going to make a model using Logged.GDP.per.capita, Social.support & Healthy.life.expectancy as the predictor.

model1 <- lm(Ladder.score ~ 
               Logged.GDP.per.capita +
               Social.support +
               Healthy.life.expectancy,
             data = happy2)
summary(model1)
## 
## Call:
## lm(formula = Ladder.score ~ Logged.GDP.per.capita + Social.support + 
##     Healthy.life.expectancy, data = happy2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81242 -0.40736  0.01624  0.47390  1.31845 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -2.40583    0.47808  -5.032 1.41e-06 ***
## Logged.GDP.per.capita    0.27335    0.09419   2.902  0.00429 ** 
## Social.support           3.00051    0.70315   4.267 3.56e-05 ***
## Healthy.life.expectancy  0.04486    0.01447   3.101  0.00232 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6013 on 145 degrees of freedom
## Multiple R-squared:  0.6928, Adjusted R-squared:  0.6865 
## F-statistic:   109 on 3 and 145 DF,  p-value: < 2.2e-16

Insight: This model has Multiple R-Squared of 0.6928.

3.3 Step-Wise Regression

Next we will automatically define the predictors using Step-Wise Regression.

step_lm <- lm(Ladder.score ~ ., 
              data = happy2)
model_step <- step(step_lm, 
                   direction = "backward")
## Start:  AIC=-175.83
## Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption + 
##     Ladder.score.in.Dystopia
## 
## 
## Step:  AIC=-175.83
## Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
## 
##                                Df Sum of Sq    RSS     AIC
## - Generosity                    1    0.3777 42.052 -176.49
## <none>                                      41.674 -175.84
## - Perceptions.of.corruption     1    1.2732 42.948 -173.35
## - Healthy.life.expectancy       1    1.5170 43.191 -172.51
## - Logged.GDP.per.capita         1    3.0409 44.715 -167.34
## - Social.support                1    4.0301 45.705 -164.08
## - Freedom.to.make.life.choices  1    4.8452 46.520 -161.45
## 
## Step:  AIC=-176.49
## Ladder.score ~ Logged.GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Perceptions.of.corruption
## 
##                                Df Sum of Sq    RSS     AIC
## <none>                                      42.052 -176.49
## - Healthy.life.expectancy       1    1.4290 43.481 -173.51
## - Perceptions.of.corruption     1    1.6089 43.661 -172.90
## - Logged.GDP.per.capita         1    2.7815 44.834 -168.95
## - Social.support                1    4.1366 46.189 -164.51
## - Freedom.to.make.life.choices  1    5.7233 47.775 -159.48
#Testing model from Step-Wise Regression
summary(lm(Ladder.score ~ 
             Logged.GDP.per.capita +
             Social.support +
             Healthy.life.expectancy +
             Freedom.to.make.life.choices +
             Perceptions.of.corruption, 
           data = happy2))
## 
## Call:
## lm(formula = Ladder.score ~ Logged.GDP.per.capita + Social.support + 
##     Healthy.life.expectancy + Freedom.to.make.life.choices + 
##     Perceptions.of.corruption, data = happy2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93303 -0.29768  0.06863  0.33924  1.02304 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.11039    0.62112  -3.398 0.000880 ***
## Logged.GDP.per.capita         0.26400    0.08584   3.075 0.002518 ** 
## Social.support                2.50670    0.66835   3.751 0.000256 ***
## Healthy.life.expectancy       0.02936    0.01332   2.204 0.029095 *  
## Freedom.to.make.life.choices  2.13266    0.48342   4.412 2.01e-05 ***
## Perceptions.of.corruption    -0.66778    0.28549  -2.339 0.020718 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5423 on 143 degrees of freedom
## Multiple R-squared:  0.7536, Adjusted R-squared:  0.745 
## F-statistic: 87.49 on 5 and 143 DF,  p-value: < 2.2e-16
#Creating new model based on Step-Wise Regression

model2 <- lm(Ladder.score ~ 
               Logged.GDP.per.capita +
               Social.support +
               Healthy.life.expectancy +
               Freedom.to.make.life.choices +
               Perceptions.of.corruption, 
             data = happy2)

Step-Wise Regression will create the optimum formula using the AIC number, the lower the better. From SWR process above we can see that SWR gave us a new model using Logged.GDP.per.capita, Social.support, Healthy.life.expectancy, Freedom.to.make.life.choices & Perceptions.of.corruption as the predictor.

Insight: This model gave us a Multiple R-Squared of 0.7536, which is higher than that of Model 1. Therefore we can use Model 2 as the primary model.

Current Models:

Model 1: Ladder.score = -2.40583 + Logged.GDP.per.capita(0.27335) + Social.support(3.00051) + Healthy.life.expectancy(0.04486)

Model 2: Ladder.score = -2.11039 + Logged.GDP.per.capita(0.26400) + Social.support(2.50670) + Healthy.life.expectancy(0.02936) + Freedom.to.make.life.choices(2.13266) + Perceptions.of.corruption(2.13266)

4. Model Prediction & Error

4.1 Prediction & Error using Model 1

predict(model1, 
        data.frame(Logged.GDP.per.capita = 10,
                   Social.support = 0.95,
                   Healthy.life.expectancy = 72),
        interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 6.408241 6.215029 6.601452
sqrt((10-6.408241)^2)
## [1] 3.591759

4.2 Prediction & Error of Model 2

predict(model2, 
        data.frame(Logged.GDP.per.capita = 10,
                   Social.support = 0.95,
                   Healthy.life.expectancy = 72,
                   Freedom.to.make.life.choices = 0.95, 
                   Perceptions.of.corruption = 1),
        interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 6.383348 6.107969 6.658728
sqrt((10-6.383348)^2)
## [1] 3.616652

Through RSE calculation above we can see that both model has a similar Error margin 3.59 on model 1 and 3.62 on model 2. Sincec model 2 has a higher Multiple R-Squared value, we are going to continue using model 2.

5. Model Evaluation

5.1 Shapiro Test (Normality)

hist(model1$residuals, breaks =20)

hist(model2$residuals, breaks =20)

shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.9924, p-value = 0.6143
shapiro.test(model2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.97079, p-value = 0.002893

Looking at the results above, we can see that model 1 has a P-Value of > 0.05, while model 2 has a P-Value of < 0.05. For model 1 it means that the data is normal.

Theoretically, model 1 has an advantage since in Shapiro Test P-Value of > 0.05 is considered normal, and equal to or below 0.05 is not.

5.2 Heteroscedasticity

#Model 1
plot(happy2$Ladder.score, 
     model1$residuals)
abline(h = 0, col ="red")

#Model 2
plot(happy2$Ladder.score, 
     model2$residuals)
abline(h = 0, col ="red")

bptest(model1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 6.0644, df = 3, p-value = 0.1085
bptest(model2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2
## BP = 17.147, df = 5, p-value = 0.004229

Again, looking at the result above we can see that Model 1 passed since it has a P-Value of > 0.05. Which means that Heteroscedasticity is not present, while it present on Model 2. In this test we are looking if the model has a pattern, Heteroscedasticity means that there is a pattern which we don’t want to have in a prediction model.

5.3 Variance Inflation Factor (multicollinearity)

vif(model1)
##   Logged.GDP.per.capita          Social.support Healthy.life.expectancy 
##                4.874626                2.671151                3.917951
vif(model2)
##        Logged.GDP.per.capita               Social.support 
##                     4.977918                     2.967390 
##      Healthy.life.expectancy Freedom.to.make.life.choices 
##                     4.083158                     1.510650 
##    Perceptions.of.corruption 
##                     1.317658

There are no variables that has more than 10 value in both models, so Multicollinearity is not present. According to the test above both model has a good criteria for a Linear Regression Model.

6. Conclusion

Although Model 2 has better R-Squared Value (0.75), it failed to pass both Shapiro Test and Breusch-Pagan Test.

Model 1 however passed with flying colors. Considering that Model 1 has a R-Squared value of 0.69 which is close enough with Model 2. Therefore we are going to carry on using Model 1.