Measuring an individual happiness level is a tough task because emotions and feelings fluctuate very often. You could feel like you are on the top of the world today just to feel like you are trapped at the deepest part of trench the very next day. If it is hard enough to measure an individual happiness level, it must be even harder to measure a country happiness index, or is it?
Image via Pixabay
In 2020, The World Happiness Report (WHR) measures the happiness level of more than 150 countries. To calculate happiness, participants were asked to mentally construct a ladder with labels from 10 to 0, where 0 is the best possible life and 0 is the worst possible life for them. Then, participants were asked to place themselves in the ladder based on their current perceived happiness level. The participants various life factors were also calculated during the survey to construct correlation of happiness to various life factors.
The 2020 World Happiness Report result was obtained from Makeover Monday from data world. You can find the tabular data here.
Loading required packages
library(tidyverse)
library(kableExtra)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(naniar)Data Input
Country name The name of country (char)
Ladder score Cantril ladder score based on current perceived happines from the scale of 1-10 (numeric)
Regional Indicator Region location i.e., Western Europe, North America, Southeast Asia, etc. (factor)
Logged GDP per capita GDP per capita in log scale (numeric)
Social Support Amount of social support given by the government from the scale of 0-1 (numeric)
Healthy life expectancy Average life expectancy of a country (numeric)
Freedom to make life choices Average individual ability to freely determine life choices from the scale of 0-1 (numeric)
Generosity Perceived level of generosity (numeric)
Perceptions of corruption Perceived level of government corruption (numeric) from the scale of 0-1 (numeric)
There are 3 steps that I implement for this EDA process.
We can use str to determine the data types of every column
## tibble[,20] [153 x 20] (S3: tbl_df/tbl/data.frame)
## $ Country name : chr [1:153] "Finland" "Denmark" "Switzerland" "Iceland" ...
## $ Regional indicator : chr [1:153] "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
## $ Ladder score : num [1:153] 7.81 7.65 7.56 7.5 7.49 ...
## $ Standard error of ladder score : num [1:153] 0.0312 0.0335 0.035 0.0596 0.0348 ...
## $ upperwhisker : num [1:153] 7.87 7.71 7.63 7.62 7.56 ...
## $ lowerwhisker : num [1:153] 7.75 7.58 7.49 7.39 7.42 ...
## $ Logged GDP per capita : num [1:153] 10.6 10.8 11 10.8 11.1 ...
## $ Social support : num [1:153] 0.954 0.956 0.943 0.975 0.952 ...
## $ Healthy life expectancy : num [1:153] 71.9 72.4 74.1 73 73.2 ...
## $ Freedom to make life choices : num [1:153] 0.949 0.951 0.921 0.949 0.956 ...
## $ Generosity : num [1:153] -0.0595 0.0662 0.1059 0.2469 0.1345 ...
## $ Perceptions of corruption : num [1:153] 0.195 0.168 0.304 0.712 0.263 ...
## $ Ladder score in Dystopia : num [1:153] 1.97 1.97 1.97 1.97 1.97 ...
## $ Explained by: Log GDP per capita : num [1:153] 1.29 1.33 1.39 1.33 1.42 ...
## $ Explained by: Social support : num [1:153] 1.5 1.5 1.47 1.55 1.5 ...
## $ Explained by: Healthy life expectancy : num [1:153] 0.961 0.979 1.041 1.001 1.008 ...
## $ Explained by: Freedom to make life choices: num [1:153] 0.662 0.665 0.629 0.662 0.67 ...
## $ Explained by: Generosity : num [1:153] 0.16 0.243 0.269 0.362 0.288 ...
## $ Explained by: Perceptions of corruption : num [1:153] 0.478 0.495 0.408 0.145 0.434 ...
## $ Dystopia + residual : num [1:153] 2.76 2.43 2.35 2.46 2.17 ...
Let’s check the first 5 observations to make sure our data type is appropriate
| Country name | Regional indicator | Ladder score | Standard error of ladder score | upperwhisker | lowerwhisker | Logged GDP per capita | Social support | Healthy life expectancy | Freedom to make life choices | Generosity | Perceptions of corruption | Ladder score in Dystopia | Explained by: Log GDP per capita | Explained by: Social support | Explained by: Healthy life expectancy | Explained by: Freedom to make life choices | Explained by: Generosity | Explained by: Perceptions of corruption | Dystopia + residual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Finland | Western Europe | 7.8087 | 0.0311563 | 7.869766 | 7.747634 | 10.63927 | 0.9543297 | 71.90083 | 0.9491722 | -0.0594820 | 0.1954446 | 1.972317 | 1.285189 | 1.499526 | 0.9612714 | 0.6623167 | 0.1596704 | 0.4778573 | 2.762835 |
| Denmark | Western Europe | 7.6456 | 0.0334923 | 7.711245 | 7.579955 | 10.77400 | 0.9559908 | 72.40250 | 0.9514443 | 0.0662018 | 0.1684895 | 1.972317 | 1.326949 | 1.503449 | 0.9793326 | 0.6650399 | 0.2427934 | 0.4952603 | 2.432741 |
| Switzerland | Western Europe | 7.5599 | 0.0350142 | 7.628528 | 7.491272 | 10.97993 | 0.9428466 | 74.10245 | 0.9213367 | 0.1059110 | 0.3037284 | 1.972317 | 1.390774 | 1.472403 | 1.0405332 | 0.6289545 | 0.2690558 | 0.4079459 | 2.350267 |
| Iceland | Western Europe | 7.5045 | 0.0596159 | 7.621347 | 7.387653 | 10.77256 | 0.9746696 | 73.00000 | 0.9488919 | 0.2469442 | 0.7117097 | 1.972317 | 1.326502 | 1.547567 | 1.0008434 | 0.6619807 | 0.3623302 | 0.1445408 | 2.460688 |
| Norway | Western Europe | 7.4880 | 0.0348374 | 7.556281 | 7.419719 | 11.08780 | 0.9524866 | 73.20078 | 0.9557503 | 0.1345326 | 0.2632182 | 1.972317 | 1.424207 | 1.495173 | 1.0080719 | 0.6702009 | 0.2879851 | 0.4341006 | 2.168266 |
The data type of this dataset is already appropriate. As such, there is no need for data type conversion.
Let’s check the distribution of numeric columns with the help of inspectdf package.
## Warning: package 'inspectdf' was built under R version 4.0.5
Some of the later columns, especially those that start with Explained by are derivative. As such, we are going to remove them. Irrelevant columns such as upper and lower whisker are also going to be removed.
happy_clean <- happy %>% #select relevant columns
select(`Country name`, `Ladder score`, `Regional indicator`, `Logged GDP per capita`, `Social support`,
`Healthy life expectancy`, `Freedom to make life choices`, Generosity,
`Perceptions of corruption`, `Dystopia + residual`) There are many ways to check and remove missing values. I utilized vis_miss from the library naniar to visualize missing values.
Good, we have no missing data. We can start to building our machine learning model right away.
In linear regression, we are trying to predict numeric values by taking advantage of the linear relations between predictor(s) and target. Linear regression is one of the simplest machine learning model with the following assumptions.
We can test for linearity using ggcorr function from the ggally package.
We see that generosity does not contribute to the ladder score. As such, we are going to remove this variable prior linear model construction. We are also going to remove character columns (country and region name) prior to model construction.
Before we modeling, lets do the simplest form of cross validation, test-train data splitting. Here, we split the data into train (80%) and test (20%) set.
set.seed(123)
samplesize <- round(0.7 * nrow(happy_clean2), 0)
index <- sample(seq_len(nrow(happy_clean2)), size = samplesize)
data_train <- happy_clean2[index, ]
data_test <- happy_clean2[-index, ]Now that we have our trainining and testing set, we can finally build our model from the training set. We are going to build a linear regression model using Ladder Score as target and the rest in happy_clean2 as the predictors.
##
## Call:
## lm(formula = `Ladder score` ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.170652 -0.076816 -0.000427 0.059615 0.313882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.931169 0.121805 -32.27 <2e-16 ***
## `Logged GDP per capita` 0.280086 0.014496 19.32 <2e-16 ***
## `Social support` 2.430695 0.111828 21.74 <2e-16 ***
## `Healthy life expectancy` 0.036000 0.002359 15.26 <2e-16 ***
## `Freedom to make life choices` 1.464421 0.092209 15.88 <2e-16 ***
## `Perceptions of corruption` -0.812368 0.058798 -13.82 <2e-16 ***
## `Dystopia + residual` 0.994968 0.015005 66.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08856 on 100 degrees of freedom
## Multiple R-squared: 0.9942, Adjusted R-squared: 0.9939
## F-statistic: 2873 on 6 and 100 DF, p-value: < 2.2e-16
We see that our model has good Multiple R-Squared and Residual standard error values. From the R-squared value, we know that our model could explain 99% of the of variance in the target variable (ladder score). Our standard error is also astonishingly low at 0.08. We will see if this is overfitting once we apply this model on the test set.
There are several performance metrics that we can use to determine the goodness of model once applied to test set. Here we are going to use RMSE, MAE, and MAPE for model evaluation.
model_predict <- predict(model, newdata = data_test)
MLmetrics::RMSE(y_pred = model_predict, y_true = data_test$`Ladder score`)## [1] 0.1013872
## [1] 0.07373572
## [1] 0.01394702
Our RMSE and MAE score is very close to the model residual error. The MAPE score is also very low at 1% error. This results indicates that our model performs well on the unseen data.
We have mentioned that in order for us to do linear regression, certain assumptions must be met. Linear regressions models that does not satisfy all the assumptions are highly susceptible to bias and it the results could even be misleading.
1. Linearity
Linear model assumes that there is a linear relationship between target and predictors. We can plot fitted values (predicted values) againts residuals (error). What we want to see is linear distribution of fitted values regardless of residuals.
resact <- data.frame(residual = model$residuals, fitted = model$fitted.values)
resact %>%
ggplot(aes(fitted, residual)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can see that there are some patterns in our data albeit not very significant. Also, keep in mind that we have very little error (residuals) shown by residuals value that are very small (< 0.5).
2. Normality
Our second assumption is that our residuals are normally distributed. We can do Sahphiro normality test or plot the residuals distribution. We will do both here.
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.97167, p-value = 0.02177
residuals <- as_tibble(model$residuals)
ggplot(residuals, mapping = aes(x = value)) +
geom_density()From the normality test, we found out that our p-value is significant. This indicates that our residuals are not normally distributed.
3. Independence
There is no test for this. Understanding of our data is what needed for this assumption satisfaction.
4. Heteroscedasticity
We are going to reuse our residuals vs fitted plot for visual presentation of heteroscedasticity or homoscedasticity.
resact %>%
ggplot(aes(fitted, residual)) +
geom_point() +
geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())This plot looks good because distribution of error does not seem to be affected by predicted values. We can confirm this by doing Breusch-Pagan test.
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 8.8686, df = 6, p-value = 0.1811
Confirmed by Breusch-Pagan test result, our model random variables are homoscedastic because the BP test result is not significant.
5. Multicollinearity
When multicollinearity exist, there exist a correlation between indpendent predictors. We can use vif to check for this.
## `Logged GDP per capita` `Social support`
## 4.242679 2.540532
## `Healthy life expectancy` `Freedom to make life choices`
## 3.914377 1.507692
## `Perceptions of corruption` `Dystopia + residual`
## 1.412067 1.031217
As a rule of thumb, a value greater than 5 would suggest that Multicollinearity exists. We do not have those in our predictors, as such our model satisfy multicollinearity assumption.
Variables that are useful to describe the variances in country happiness ladder score are GDP/capite, social support, healthy life expectancy, life choices freedom, corruptions perceptions, and dystopia residuals. Our final model has satisfied the classical assumptions aside from normality.
The R-squared of the model is very high, with 99% of the variables can explain the variances in ladder score. The accuracy of the model in predicting ladder score is measured with RMSE, with training data has RMSE of 0.08856 and testing data has RMSE of 0.1013872, suggesting that our model perform well on the unseen data.