Happiness level in each country is determined by several factors. In this analysis I analyzed the determinant of happiness using dataset from a kaggle.com. In which, the dataset consist of several macroeconomics variables such as GDP, healthy life expectancy, happiness score, freedom of choice and generosity. The happiness score will be analyzed using linear regression model.
## 'data.frame': 156 obs. of 9 variables:
## $ Overall.rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country.or.region : Factor w/ 156 levels "Afghanistan",..: 44 37 106 58 99 134 133 100 24 7 ...
## $ Score : num 7.77 7.6 7.55 7.49 7.49 ...
## $ GDP.per.capita : num 1.34 1.38 1.49 1.38 1.4 ...
## $ Social.support : num 1.59 1.57 1.58 1.62 1.52 ...
## $ Healthy.life.expectancy : num 0.986 0.996 1.028 1.026 0.999 ...
## $ Freedom.to.make.life.choices: num 0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
## $ Generosity : num 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
## $ Perceptions.of.corruption : num 0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
happy <- happy %>%
mutate(rank = Overall.rank,
country = Country.or.region,
happiness_score = Score,
log_GDP = GDP.per.capita,
sos_support = Social.support,
h_life_exp = Healthy.life.expectancy,
freedom = Freedom.to.make.life.choices,
generosity = Generosity,
corruption = Perceptions.of.corruption) %>%
select(-c(Overall.rank, Country.or.region, Score, GDP.per.capita, Social.support, Healthy.life.expectancy, Freedom.to.make.life.choices, Generosity, Perceptions.of.corruption))
str(happy)## 'data.frame': 156 obs. of 9 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : Factor w/ 156 levels "Afghanistan",..: 44 37 106 58 99 134 133 100 24 7 ...
## $ happiness_score: num 7.77 7.6 7.55 7.49 7.49 ...
## $ log_GDP : num 1.34 1.38 1.49 1.38 1.4 ...
## $ sos_support : num 1.59 1.57 1.58 1.62 1.52 ...
## $ h_life_exp : num 0.986 0.996 1.028 1.026 0.999 ...
## $ freedom : num 0.596 0.592 0.603 0.591 0.557 0.572 0.574 0.585 0.584 0.532 ...
## $ generosity : num 0.153 0.252 0.271 0.354 0.322 0.263 0.267 0.33 0.285 0.244 ...
## $ corruption : num 0.393 0.41 0.341 0.118 0.298 0.343 0.373 0.38 0.308 0.226 ...
Here is the description of each variable :
- rank –> Rank country in Happiness
- country –> name of the country/region
- happiness_score –> Happiness score
- log_GDP –> log of GDP per capita
- sos_support –> social support
- h_life_exp –> healthy life expectancy
- freedom –> freedom to make life choice
- generosity –> generosity rate
- corruption –> perceptions of corruption
## rank country happiness_score log_GDP sos_support
## 0 0 0 0 0
## h_life_exp freedom generosity corruption
## 0 0 0 0
##
## FALSE
## 1404
there is no NA in the dataset
#we do not use var 'rank' and 'country' therefore we need to omit first
happy1 <- happy %>%
select(-c(rank, country))
#To see correlation among variable
ggcorr(happy1, label = T, label_size = 2.9, hjust = 1, layout.exp = 2)Based on the graphic correlation above we know that every variable has positive correlation to happiness score
Based on boxplot above, we can see that there is no outlier on
happiness_score variable. Therefore we can continue to the next step.
# model_all consist of all predictors variable to predict '`happiness_score`
model_all <- lm(happiness_score~.,happy1)##
## Call:
## lm(formula = happiness_score ~ ., data = happy1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75304 -0.35306 0.05703 0.36695 1.19059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7952 0.2111 8.505 0.0000000000000177 ***
## log_GDP 0.7754 0.2182 3.553 0.000510 ***
## sos_support 1.1242 0.2369 4.745 0.0000048338066976 ***
## h_life_exp 1.0781 0.3345 3.223 0.001560 **
## freedom 1.4548 0.3753 3.876 0.000159 ***
## generosity 0.4898 0.4977 0.984 0.326709
## corruption 0.9723 0.5424 1.793 0.075053 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5335 on 149 degrees of freedom
## Multiple R-squared: 0.7792, Adjusted R-squared: 0.7703
## F-statistic: 87.62 on 6 and 149 DF, p-value: < 0.00000000000000022
# predicting error model_all
MAE(
y_pred = model_all$fitted.values,
y_true = happy1$happiness_score
)## [1] 0.4139013
Based on model_all variable generosity and corruption are not significant. The model performance based on R-squared and error are 77.03% and 0.4139 respectively. It shows that model_all can explain our predicted variable as 77.03% with error in predictted value around -+0.41.
##
## Call:
## lm(formula = happiness_score ~ log_GDP + sos_support + h_life_exp +
## freedom + generosity + corruption, data = happy1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75304 -0.35306 0.05703 0.36695 1.19059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7952 0.2111 8.505 0.0000000000000177 ***
## log_GDP 0.7754 0.2182 3.553 0.000510 ***
## sos_support 1.1242 0.2369 4.745 0.0000048338066976 ***
## h_life_exp 1.0781 0.3345 3.223 0.001560 **
## freedom 1.4548 0.3753 3.876 0.000159 ***
## generosity 0.4898 0.4977 0.984 0.326709
## corruption 0.9723 0.5424 1.793 0.075053 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5335 on 149 degrees of freedom
## Multiple R-squared: 0.7792, Adjusted R-squared: 0.7703
## F-statistic: 87.62 on 6 and 149 DF, p-value: < 0.00000000000000022
# predicting error model_forward
MAE(
y_pred = model_forward$fitted.values,
y_true = happy1$happiness_score
)## [1] 0.4139013
Based on model_forward variable generosity and corruption are not significant. The model performance based on R-squared and error are 77.03% and 0.4139 respectively. It shows that model_forward can explain our predicted variable as 77.03% with error in predictted value around -+0.4139.
##
## Call:
## lm(formula = happiness_score ~ log_GDP + sos_support + h_life_exp +
## freedom + corruption, data = happy1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82997 -0.35344 0.05803 0.35977 1.17522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8689 0.1973 9.471 < 0.0000000000000002 ***
## log_GDP 0.7455 0.2161 3.450 0.000728 ***
## sos_support 1.1180 0.2368 4.722 0.00000533 ***
## h_life_exp 1.0840 0.3344 3.241 0.001467 **
## freedom 1.5340 0.3666 4.185 0.00004844 ***
## corruption 1.1176 0.5218 2.142 0.033839 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5335 on 150 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7703
## F-statistic: 105 on 5 and 150 DF, p-value: < 0.00000000000000022
# predicting error model_forward
MAE(
y_pred = model_backward$fitted.values,
y_true = happy1$happiness_score
)## [1] 0.4132855
The model performance based on R-squared and error are 77.03% and 0.4132 respectively. It shows that model_backward can explain our predicted variable as 77.03% with error in predictted value around -+0.4132.
Compared to model_all and model_forward, model_backward is slightly better since its error is slightly smaller than error of model_all and model_forward. Therefore we use model_backward for the next steps.
Based on regression analysis we found that all variables have significant impact to happiness score, in which in the detail as followed :
- GDP –> increasing of 1 unit of log GDP increase 0.74 of happiness score. The higher GDP, the highe happiness rate of that country.
- social support –> increasing of 1 unit of social support increase 1.1 of happiness score. The more social support given by people the more happiness they will get.
- healthy life expectancy –> increasing of 1 unit of healthy life expectancy increase 1.08 of happiness score. When healthy life expectancy get higher the happier people become.
- freedom –> increasing of 1 unit of freedom increase 1.5 of happiness score. Obviously, when people free in making choice they will get happier.
- corruption –> increasing of 1 unit of corruption increase 1.1 of happiness score. Perception of corruption has postive impact to the happiness. However compared to other variables (based on p-value score), this has smallest impact to the happiness.
using Shapiro test : - Hypothesis:
- H0: Residuals are distributed normally
- H1: Residuals are not distribute normally
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.98147, p-value = 0.03426
The test result shows that the residuals of model are not distributed normally (p-value < 0.05). As the consequence the model can lead to bias therefore this model is needed to be improved by normalizing the error distribution by using scale or normalization.
The model should have homoscedasticity in the varians of residual.
## integer(0)
##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 16.115, df = 5, p-value = 0.006523
Hypothesis:
- H0: Data residual Homogen - H1: Data residual Heteros
based on the result test, the residual model is heterscedasticity (p-value < 0.05).
We want that our model does not have multicolinearity, we will use vif test to check that. The vif value has to be smaller than 10 to pass the multicolinearity test.
## log_GDP sos_support h_life_exp freedom corruption
## 4.035938 2.733741 3.571591 1.502702 1.325518
Multicolinearity test shows that there is no multicolinearity in the model_backward.
Our model has a good performance in predicting happiness_score based on R-square and error value. However in assumption tests, our model_backward only passed multicolinearity test. The error of model are not distributed normally and there is heteroscedasticity in the model. As a consequence the model could lead into bias when interpretting the model. Therefore the model is needed to be improved by normalizing the error distribution by using scaling or normalization.
Based on regression above, it is important for the government to maintan the economic condition (GDP), give society freedom in making choice, maintain that they have healthy life and provide them with good public health service as well support them with social support and increase their perception about corruption level of the country. In which those variables can make people happier.