1 Introduction

Measuring an individual happiness level is a tough task because emotions and feelings fluctuate very often. You could feel like you are on the top of the world today just to feel like you are trapped at the deepest part of trench the very next day. If it is hard enough to measure an individual happiness level, it must be even harder to measure a country happiness index, or is it?

Image via [Pixabay](https://pixabay.com/photos/people-women-talking-laugh-happy-2567915/)

Image via Pixabay

In 2020, The World Happiness Report (WHR) measures the happiness level of more than 150 countries. To calculate happiness, participants were asked to mentally construct a ladder with labels from 10 to 0, where 0 is the best possible life and 0 is the worst possible life for them. Then, participants were asked to place themselves in the ladder based on their current perceived happiness level. The participants various life factors were also calculated during the survey to construct correlation of happiness to various life factors.

2 Data Preparation

The 2020 World Happiness Report result was obtained from Makeover Monday from data world. You can find the tabular data here.

Loading required packages

Data Input

Country name The name of country (char)
Ladder score Cantril ladder score based on current perceived happines from the scale of 1-10 (numeric)
Regional Indicator Region location i.e., Western Europe, North America, Southeast Asia, etc. (factor)
Logged GDP per capita GDP per capita in log scale (numeric)
Social Support Amount of social support given by the government from the scale of 0-1 (numeric)
Healthy life expectancy Average life expectancy of a country (numeric)
Freedom to make life choices Average individual ability to freely determine life choices from the scale of 0-1 (numeric)
Generosity Perceived level of generosity (numeric)
Perceptions of corruption Perceived level of government corruption (numeric) from the scale of 0-1 (numeric)

3 Exploratory Data Analysis (EDA)

There are 3 steps that I implement for this EDA process.

  • Check data type and determine target
  • Remove irrelevant columns
  • Check missing values

3.1 Check data type and determine target

We can use str to determine the data types of every column

## tibble[,20] [153 x 20] (S3: tbl_df/tbl/data.frame)
##  $ Country name                              : chr [1:153] "Finland" "Denmark" "Switzerland" "Iceland" ...
##  $ Regional indicator                        : chr [1:153] "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ Ladder score                              : num [1:153] 7.81 7.65 7.56 7.5 7.49 ...
##  $ Standard error of ladder score            : num [1:153] 0.0312 0.0335 0.035 0.0596 0.0348 ...
##  $ upperwhisker                              : num [1:153] 7.87 7.71 7.63 7.62 7.56 ...
##  $ lowerwhisker                              : num [1:153] 7.75 7.58 7.49 7.39 7.42 ...
##  $ Logged GDP per capita                     : num [1:153] 10.6 10.8 11 10.8 11.1 ...
##  $ Social support                            : num [1:153] 0.954 0.956 0.943 0.975 0.952 ...
##  $ Healthy life expectancy                   : num [1:153] 71.9 72.4 74.1 73 73.2 ...
##  $ Freedom to make life choices              : num [1:153] 0.949 0.951 0.921 0.949 0.956 ...
##  $ Generosity                                : num [1:153] -0.0595 0.0662 0.1059 0.2469 0.1345 ...
##  $ Perceptions of corruption                 : num [1:153] 0.195 0.168 0.304 0.712 0.263 ...
##  $ Ladder score in Dystopia                  : num [1:153] 1.97 1.97 1.97 1.97 1.97 ...
##  $ Explained by: Log GDP per capita          : num [1:153] 1.29 1.33 1.39 1.33 1.42 ...
##  $ Explained by: Social support              : num [1:153] 1.5 1.5 1.47 1.55 1.5 ...
##  $ Explained by: Healthy life expectancy     : num [1:153] 0.961 0.979 1.041 1.001 1.008 ...
##  $ Explained by: Freedom to make life choices: num [1:153] 0.662 0.665 0.629 0.662 0.67 ...
##  $ Explained by: Generosity                  : num [1:153] 0.16 0.243 0.269 0.362 0.288 ...
##  $ Explained by: Perceptions of corruption   : num [1:153] 0.478 0.495 0.408 0.145 0.434 ...
##  $ Dystopia + residual                       : num [1:153] 2.76 2.43 2.35 2.46 2.17 ...

Let’s check the first 5 observations to make sure our data type is appropriate

Country name Regional indicator Ladder score Standard error of ladder score upperwhisker lowerwhisker Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Ladder score in Dystopia Explained by: Log GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption Dystopia + residual
Finland Western Europe 7.8087 0.0311563 7.869766 7.747634 10.63927 0.9543297 71.90083 0.9491722 -0.0594820 0.1954446 1.972317 1.285189 1.499526 0.9612714 0.6623167 0.1596704 0.4778573 2.762835
Denmark Western Europe 7.6456 0.0334923 7.711245 7.579955 10.77400 0.9559908 72.40250 0.9514443 0.0662018 0.1684895 1.972317 1.326949 1.503449 0.9793326 0.6650399 0.2427934 0.4952603 2.432741
Switzerland Western Europe 7.5599 0.0350142 7.628528 7.491272 10.97993 0.9428466 74.10245 0.9213367 0.1059110 0.3037284 1.972317 1.390774 1.472403 1.0405332 0.6289545 0.2690558 0.4079459 2.350267
Iceland Western Europe 7.5045 0.0596159 7.621347 7.387653 10.77256 0.9746696 73.00000 0.9488919 0.2469442 0.7117097 1.972317 1.326502 1.547567 1.0008434 0.6619807 0.3623302 0.1445408 2.460688
Norway Western Europe 7.4880 0.0348374 7.556281 7.419719 11.08780 0.9524866 73.20078 0.9557503 0.1345326 0.2632182 1.972317 1.424207 1.495173 1.0080719 0.6702009 0.2879851 0.4341006 2.168266

The data type of this dataset is already appropriate. As such, there is no need for data type conversion.

3.2 Remove Irrelevant Columns

Let’s check the distribution of numeric columns with the help of inspectdf package.

## Warning: package 'inspectdf' was built under R version 4.0.5

Some of the later columns, especially those that start with Explained by are derivative. As such, we are going to remove them. Irrelevant columns such as upper and lower whisker are also going to be removed.

3.3 Check and remove missing values

There are many ways to check and remove missing values. I utilized vis_miss from the library naniar to visualize missing values.

Good, we have no missing data. We can start to building our machine learning model right away.

4 Model Creation

In linear regression, we are trying to predict numeric values by taking advantage of the linear relations between predictor(s) and target. Linear regression is one of the simplest machine learning model with the following assumptions.

  • LINEARITY: X and Y has linear relationships (can be shown through correlations values)
  • NORMALITY: X and Y have normal distributions
  • HOMOSCEDASTICITY: Residual variance are the same throughout all X values. This shows that X values are not affected by random error
  • INDEPENDENCE: Observations are independent to each other meaning that there is no repeated/duplicated observations.

We can test for linearity using ggcorr function from the ggally package.

We see that generosity does not contribute to the ladder score. As such, we are going to remove this variable prior linear model construction. We are also going to remove character columns (country and region name) prior to model construction.

Before we modeling, lets do the simplest form of cross validation, test-train data splitting. Here, we split the data into train (80%) and test (20%) set.

Now that we have our trainining and testing set, we can finally build our model from the training set. We are going to build a linear regression model using Ladder Score as target and the rest in happy_clean2 as the predictors.

## 
## Call:
## lm(formula = `Ladder score` ~ ., data = data_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.170652 -0.076816 -0.000427  0.059615  0.313882 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -3.931169   0.121805  -32.27   <2e-16 ***
## `Logged GDP per capita`         0.280086   0.014496   19.32   <2e-16 ***
## `Social support`                2.430695   0.111828   21.74   <2e-16 ***
## `Healthy life expectancy`       0.036000   0.002359   15.26   <2e-16 ***
## `Freedom to make life choices`  1.464421   0.092209   15.88   <2e-16 ***
## `Perceptions of corruption`    -0.812368   0.058798  -13.82   <2e-16 ***
## `Dystopia + residual`           0.994968   0.015005   66.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08856 on 100 degrees of freedom
## Multiple R-squared:  0.9942, Adjusted R-squared:  0.9939 
## F-statistic:  2873 on 6 and 100 DF,  p-value: < 2.2e-16

We see that our model has good Multiple R-Squared and Residual standard error values. From the R-squared value, we know that our model could explain 99% of the of variance in the target variable (ladder score). Our standard error is also astonishingly low at 0.08. We will see if this is overfitting once we apply this model on the test set.

5 Model Evaluation

5.1 Performance

There are several performance metrics that we can use to determine the goodness of model once applied to test set. Here we are going to use RMSE, MAE, and MAPE for model evaluation.

## [1] 0.1013872
## [1] 0.07373572
## [1] 0.01394702

Our RMSE and MAE score is very close to the model residual error. The MAPE score is also very low at 1% error. This results indicates that our model performs well on the unseen data.

5.2 Assumptions

We have mentioned that in order for us to do linear regression, certain assumptions must be met. Linear regressions models that does not satisfy all the assumptions are highly susceptible to bias and it the results could even be misleading.

1. Linearity

Linear model assumes that there is a linear relationship between target and predictors. We can plot fitted values (predicted values) againts residuals (error). What we want to see is linear distribution of fitted values regardless of residuals.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can see that there are some patterns in our data albeit not very significant. Also, keep in mind that we have very little error (residuals) shown by residuals value that are very small (< 0.5).

2. Normality

Our second assumption is that our residuals are normally distributed. We can do Sahphiro normality test or plot the residuals distribution. We will do both here.

## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.97167, p-value = 0.02177

From the normality test, we found out that our p-value is significant. This indicates that our residuals are not normally distributed.

3. Independence

There is no test for this. Understanding of our data is what needed for this assumption satisfaction.

4. Heteroscedasticity

We are going to reuse our residuals vs fitted plot for visual presentation of heteroscedasticity or homoscedasticity.

This plot looks good because distribution of error does not seem to be affected by predicted values. We can confirm this by doing Breusch-Pagan test.

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 8.8686, df = 6, p-value = 0.1811

Confirmed by Breusch-Pagan test result, our model random variables are homoscedastic because the BP test result is not significant.

5. Multicollinearity

When multicollinearity exist, there exist a correlation between indpendent predictors. We can use vif to check for this.

##        `Logged GDP per capita`               `Social support` 
##                       4.242679                       2.540532 
##      `Healthy life expectancy` `Freedom to make life choices` 
##                       3.914377                       1.507692 
##    `Perceptions of corruption`          `Dystopia + residual` 
##                       1.412067                       1.031217

As a rule of thumb, a value greater than 5 would suggest that Multicollinearity exists. We do not have those in our predictors, as such our model satisfy multicollinearity assumption.

6 Conclusion

Variables that are useful to describe the variances in country happiness ladder score are GDP/capite, social support, healthy life expectancy, life choices freedom, corruptions perceptions, and dystopia residuals. Our final model has satisfied the classical assumptions aside from normality.

The R-squared of the model is very high, with 99% of the variables can explain the variances in ladder score. The accuracy of the model in predicting ladder score is measured with RMSE, with training data has RMSE of 0.08856 and testing data has RMSE of 0.1013872, suggesting that our model perform well on the unseen data.