Background

The Data is a survey about happiness which contains aspects of happiness

My Purpose using this data is to predict the important factors that influence of happiness

Data Description:

Score: Score of happiness
GDP.per.capita: Effect Gross Domestic Product per capita
Social.support: Effect Social Support
Healthy.life.expectancy: Effect Health Life Expectancy
Freedom.to.make.life.choices: Effect Freedom Life Choices
Generosity: Effect to help each other
Perceptions.of.corruption: Effect incident corruption

Set Up

Activated library

library(GGally) #check correlation
library(dplyr) #piping daya
library(rsample) #sampling data
library(tidyverse) #wrangling data
library(lmtest) #check assumption
library(car) #check vif
library(MLmetrics) #calculate error

Import Data

Import data csv

rawdata <- read.csv("2019.csv", sep=",")
rawdata

Subsetting Data

Select colomn what I need

happy <- rawdata[,c("Score","GDP.per.capita","Social.support","Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption"
)]
happy

Data Inspection

Check Data Type

glimpse(happy)
## Rows: 156
## Columns: 7
## $ Score                        <dbl> 7.769, 7.600, 7.554, 7.494, 7.488, 7.4...
## $ GDP.per.capita               <dbl> 1.340, 1.383, 1.488, 1.380, 1.396, 1.4...
## $ Social.support               <dbl> 1.587, 1.573, 1.582, 1.624, 1.522, 1.5...
## $ Healthy.life.expectancy      <dbl> 0.986, 0.996, 1.028, 1.026, 0.999, 1.0...
## $ Freedom.to.make.life.choices <dbl> 0.596, 0.592, 0.603, 0.591, 0.557, 0.5...
## $ Generosity                   <dbl> 0.153, 0.252, 0.271, 0.354, 0.322, 0.2...
## $ Perceptions.of.corruption    <dbl> 0.393, 0.410, 0.341, 0.118, 0.298, 0.3...

All data type already appropriate

Check missing value

colSums(is.na(happy))
##                        Score               GDP.per.capita 
##                            0                            0 
##               Social.support      Healthy.life.expectancy 
##                            0                            0 
## Freedom.to.make.life.choices                   Generosity 
##                            0                            0 
##    Perceptions.of.corruption 
##                            0

All column no have missing value

Create Model

Cross Validation

Dataset happy divided into 2 for traning and testing:

1) data_train: 80% from dataset, its function for traning model

2) data_test: 20% from dataset, its function for testing model

RNGkind(sample.kind = "Rounding")
set.seed(1616)
init <- initial_split(happy,
                      prop = 0.8,
                      strata = Score) 
happy_train <- training(init) 
happy_test <- testing(init) 

Create Regression Model

Create regresion linear model and filtering predictor variabel with stepwise method

rawmodelhappy <- lm(Score~.,happy_train)
model_rawmodelhappy <- step(rawmodelhappy, direction = "backward")
## Start:  AIC=-151.4
## Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
## 
##                                Df Sum of Sq    RSS     AIC
## - Perceptions.of.corruption     1    0.0592 35.218 -153.18
## <none>                                      35.158 -151.40
## - Generosity                    1    0.7186 35.877 -150.81
## - Healthy.life.expectancy       1    2.8395 37.998 -143.46
## - GDP.per.capita                1    3.3265 38.485 -141.83
## - Freedom.to.make.life.choices  1    3.5849 38.743 -140.97
## - Social.support                1    5.3416 40.500 -135.29
## 
## Step:  AIC=-153.18
## Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity
## 
##                                Df Sum of Sq    RSS     AIC
## <none>                                      35.218 -153.18
## - Generosity                    1    0.9017 36.119 -151.95
## - Healthy.life.expectancy       1    2.8751 38.093 -145.14
## - GDP.per.capita                1    3.6811 38.899 -142.46
## - Freedom.to.make.life.choices  1    4.2415 39.459 -140.63
## - Social.support                1    5.2980 40.516 -137.24

Variable “Perceptions.of.corruption” is delete because it does not really affect target variable “Score”

summary(model_rawmodelhappy)
## 
## Call:
## lm(formula = Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity, data = happy_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6969 -0.3842  0.0169  0.3877  1.2157 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.7053     0.2364   7.215 5.00e-11 ***
## GDP.per.capita                 0.8582     0.2403   3.571 0.000510 ***
## Social.support                 1.0953     0.2557   4.284 3.68e-05 ***
## Healthy.life.expectancy        1.1574     0.3667   3.156 0.002016 ** 
## Freedom.to.make.life.choices   1.5091     0.3937   3.833 0.000201 ***
## Generosity                     0.9138     0.5170   1.767 0.079663 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5373 on 122 degrees of freedom
## Multiple R-squared:  0.7668, Adjusted R-squared:  0.7572 
## F-statistic: 80.22 on 5 and 122 DF,  p-value: < 2.2e-16

Variable “Generosity” has a p-value > 0.05, its means that variable is not significant, so it can be delete

ReCreate Regresion Model

model_linear_happy <- lm(Score~ . -Perceptions.of.corruption -Generosity,happy_train)
model_happy <- step(model_linear_happy, direction = "backward")
## Start:  AIC=-151.95
## Score ~ (GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption) - 
##     Perceptions.of.corruption - Generosity
## 
##                                Df Sum of Sq    RSS     AIC
## <none>                                      36.119 -151.95
## - Healthy.life.expectancy       1    3.1068 39.226 -143.38
## - GDP.per.capita                1    3.3173 39.437 -142.70
## - Social.support                1    4.8778 40.997 -137.73
## - Freedom.to.make.life.choices  1    6.1861 42.305 -133.71
summary(model_happy)
## 
## Call:
## lm(formula = Score ~ (GDP.per.capita + Social.support + Healthy.life.expectancy + 
##     Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption) - 
##     Perceptions.of.corruption - Generosity, data = happy_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84158 -0.35063  0.01071  0.40918  1.16965 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.8590     0.2217   8.385 9.87e-14 ***
## GDP.per.capita                 0.8092     0.2408   3.361  0.00103 ** 
## Social.support                 1.0442     0.2562   4.076 8.17e-05 ***
## Healthy.life.expectancy        1.2005     0.3691   3.253  0.00148 ** 
## Freedom.to.make.life.choices   1.7290     0.3767   4.590 1.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5419 on 123 degrees of freedom
## Multiple R-squared:  0.7608, Adjusted R-squared:  0.753 
## F-statistic: 97.81 on 4 and 123 DF,  p-value: < 2.2e-16

Based on Summary model_happy, we get information:

1. Best (lowest) AIC score is -151.95

2. Adjusted R-Squared value is 0.753 or 75.3%, its mean model can explain the variation data from target variable (Score)

3. The p-value of each predictor variable is less than 0.05 (p-value <0.05), its mean each predictor variable is significant or affected to target variable (Score)

4. Each variable has an added value for the target variable, for more details, see the following formula:

\[ Score = 1.8590 + (0.8092\times GDP.per.capita) + (1.0442\times Social.support) + \] \[ (1.2005\times Healthy.life.expectancy) + (1.7290\times Freedom.to.make.life.choices) \]

Evaluation

Prediction

After creating a formula for target variable, the next step is evaluate it

pred_test <- predict(model_happy, newdata = happy_test)
head(pred_test)
##        3        6       11       18       22       24 
## 6.991747 6.879350 6.792441 6.528286 6.672624 6.475829

Error Check

RMSE(pred_test, happy_test$Score)
## [1] 0.5324381

By checking the error using RMSE (Root Mean Squared Error), get result model will deviate from the actual data as much as 0.5324381

Assumption

Multicoloniarity

vif(model_happy)
##               GDP.per.capita               Social.support 
##                     3.695632                     2.408059 
##      Healthy.life.expectancy Freedom.to.make.life.choices 
##                     3.371526                     1.226366

VIF value is lower than 10, it means that our variables from our tunned datasets are all independent

Normality

qqPlot(model_happy$residuals)

## 148 153 
## 122 126
plot(density(model_happy$residuals))

We can see that the plot above as the normality of the residual looks good .

Heterodasticity

plot(model_happy$fitted.values, #prediksi
     model_happy$residuals) #eror

If we check the plot above, we can see there is a presence of a shape. It means that heteroscedacity is present.

Linearity

data.frame(prediction=model_happy$fitted.values,
     error=model_happy$residuals) %>% 
  ggplot(aes(prediction,error)) +
  geom_hline(yintercept=0) +
  geom_point() +
  geom_smooth() +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From plot above, There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

Conclusion

Based on make model for prediction (model_happy) it can be concluded:

1. R-Squared model_happy has a value of 0.753, so it means based on the R-Squared value for model_happy it can explain variation data in target variable “Score” 0.753 (75.3%)

2. Error From results of the error test using RMSE (Root Mean Squared Error) model_happy has a value of 0.5324381, so it can be said that the possibility that the model will deviate from the actual data is only 0.5324381

3. Assumption

From the results of the assumptions for the model based on the graphs and tests, the model already appropiate criteria from actual data

So can be concluded that in determining happiness, an important variable can be seen from model_happy. The important variables are GDP.per.capita, Social.support, Healthy.life.expectancy, Freedom.to.make.life.choices.