Activated library
library(GGally) #check correlation
library(dplyr) #piping daya
library(rsample) #sampling data
library(tidyverse) #wrangling data
library(lmtest) #check assumption
library(car) #check vif
library(MLmetrics) #calculate error
Import Data
Import data csv
rawdata <- read.csv("2019.csv", sep=",")
rawdata
Subsetting Data
Select colomn what I need
happy <- rawdata[,c("Score","GDP.per.capita","Social.support","Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption"
)]
happy
Check Data Type
glimpse(happy)
## Rows: 156
## Columns: 7
## $ Score <dbl> 7.769, 7.600, 7.554, 7.494, 7.488, 7.4...
## $ GDP.per.capita <dbl> 1.340, 1.383, 1.488, 1.380, 1.396, 1.4...
## $ Social.support <dbl> 1.587, 1.573, 1.582, 1.624, 1.522, 1.5...
## $ Healthy.life.expectancy <dbl> 0.986, 0.996, 1.028, 1.026, 0.999, 1.0...
## $ Freedom.to.make.life.choices <dbl> 0.596, 0.592, 0.603, 0.591, 0.557, 0.5...
## $ Generosity <dbl> 0.153, 0.252, 0.271, 0.354, 0.322, 0.2...
## $ Perceptions.of.corruption <dbl> 0.393, 0.410, 0.341, 0.118, 0.298, 0.3...
All data type already appropriate
Check missing value
colSums(is.na(happy))
## Score GDP.per.capita
## 0 0
## Social.support Healthy.life.expectancy
## 0 0
## Freedom.to.make.life.choices Generosity
## 0 0
## Perceptions.of.corruption
## 0
All column no have missing value
Cross Validation
Dataset happy divided into 2 for traning and testing:
1) data_train: 80% from dataset, its function for traning model
2) data_test: 20% from dataset, its function for testing model
RNGkind(sample.kind = "Rounding")
set.seed(1616)
init <- initial_split(happy,
prop = 0.8,
strata = Score)
happy_train <- training(init)
happy_test <- testing(init)
Create Regression Model
Create regresion linear model and filtering predictor variabel with stepwise method
rawmodelhappy <- lm(Score~.,happy_train)
model_rawmodelhappy <- step(rawmodelhappy, direction = "backward")
## Start: AIC=-151.4
## Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy +
## Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption
##
## Df Sum of Sq RSS AIC
## - Perceptions.of.corruption 1 0.0592 35.218 -153.18
## <none> 35.158 -151.40
## - Generosity 1 0.7186 35.877 -150.81
## - Healthy.life.expectancy 1 2.8395 37.998 -143.46
## - GDP.per.capita 1 3.3265 38.485 -141.83
## - Freedom.to.make.life.choices 1 3.5849 38.743 -140.97
## - Social.support 1 5.3416 40.500 -135.29
##
## Step: AIC=-153.18
## Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy +
## Freedom.to.make.life.choices + Generosity
##
## Df Sum of Sq RSS AIC
## <none> 35.218 -153.18
## - Generosity 1 0.9017 36.119 -151.95
## - Healthy.life.expectancy 1 2.8751 38.093 -145.14
## - GDP.per.capita 1 3.6811 38.899 -142.46
## - Freedom.to.make.life.choices 1 4.2415 39.459 -140.63
## - Social.support 1 5.2980 40.516 -137.24
Variable “Perceptions.of.corruption” is delete because it does not really affect target variable “Score”
summary(model_rawmodelhappy)
##
## Call:
## lm(formula = Score ~ GDP.per.capita + Social.support + Healthy.life.expectancy +
## Freedom.to.make.life.choices + Generosity, data = happy_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6969 -0.3842 0.0169 0.3877 1.2157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7053 0.2364 7.215 5.00e-11 ***
## GDP.per.capita 0.8582 0.2403 3.571 0.000510 ***
## Social.support 1.0953 0.2557 4.284 3.68e-05 ***
## Healthy.life.expectancy 1.1574 0.3667 3.156 0.002016 **
## Freedom.to.make.life.choices 1.5091 0.3937 3.833 0.000201 ***
## Generosity 0.9138 0.5170 1.767 0.079663 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5373 on 122 degrees of freedom
## Multiple R-squared: 0.7668, Adjusted R-squared: 0.7572
## F-statistic: 80.22 on 5 and 122 DF, p-value: < 2.2e-16
Variable “Generosity” has a p-value > 0.05, its means that variable is not significant, so it can be delete
ReCreate Regresion Model
model_linear_happy <- lm(Score~ . -Perceptions.of.corruption -Generosity,happy_train)
model_happy <- step(model_linear_happy, direction = "backward")
## Start: AIC=-151.95
## Score ~ (GDP.per.capita + Social.support + Healthy.life.expectancy +
## Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption) -
## Perceptions.of.corruption - Generosity
##
## Df Sum of Sq RSS AIC
## <none> 36.119 -151.95
## - Healthy.life.expectancy 1 3.1068 39.226 -143.38
## - GDP.per.capita 1 3.3173 39.437 -142.70
## - Social.support 1 4.8778 40.997 -137.73
## - Freedom.to.make.life.choices 1 6.1861 42.305 -133.71
summary(model_happy)
##
## Call:
## lm(formula = Score ~ (GDP.per.capita + Social.support + Healthy.life.expectancy +
## Freedom.to.make.life.choices + Generosity + Perceptions.of.corruption) -
## Perceptions.of.corruption - Generosity, data = happy_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.84158 -0.35063 0.01071 0.40918 1.16965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8590 0.2217 8.385 9.87e-14 ***
## GDP.per.capita 0.8092 0.2408 3.361 0.00103 **
## Social.support 1.0442 0.2562 4.076 8.17e-05 ***
## Healthy.life.expectancy 1.2005 0.3691 3.253 0.00148 **
## Freedom.to.make.life.choices 1.7290 0.3767 4.590 1.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5419 on 123 degrees of freedom
## Multiple R-squared: 0.7608, Adjusted R-squared: 0.753
## F-statistic: 97.81 on 4 and 123 DF, p-value: < 2.2e-16
Based on Summary model_happy, we get information:
1. Best (lowest) AIC score is -151.95
2. Adjusted R-Squared value is 0.753 or 75.3%, its mean model can explain the variation data from target variable (Score)
3. The p-value of each predictor variable is less than 0.05 (p-value <0.05), its mean each predictor variable is significant or affected to target variable (Score)
4. Each variable has an added value for the target variable, for more details, see the following formula:
\[ Score = 1.8590 + (0.8092\times GDP.per.capita) + (1.0442\times Social.support) + \] \[ (1.2005\times Healthy.life.expectancy) + (1.7290\times Freedom.to.make.life.choices) \]
Prediction
After creating a formula for target variable, the next step is evaluate it
pred_test <- predict(model_happy, newdata = happy_test)
head(pred_test)
## 3 6 11 18 22 24
## 6.991747 6.879350 6.792441 6.528286 6.672624 6.475829
Error Check
RMSE(pred_test, happy_test$Score)
## [1] 0.5324381
By checking the error using RMSE (Root Mean Squared Error), get result model will deviate from the actual data as much as 0.5324381
Multicoloniarity
vif(model_happy)
## GDP.per.capita Social.support
## 3.695632 2.408059
## Healthy.life.expectancy Freedom.to.make.life.choices
## 3.371526 1.226366
VIF value is lower than 10, it means that our variables from our tunned datasets are all independent
Normality
qqPlot(model_happy$residuals)
## 148 153
## 122 126
plot(density(model_happy$residuals))
We can see that the plot above as the normality of the residual looks good .
Heterodasticity
plot(model_happy$fitted.values, #prediksi
model_happy$residuals) #eror
If we check the plot above, we can see there is a presence of a shape. It means that heteroscedacity is present.
Linearity
data.frame(prediction=model_happy$fitted.values,
error=model_happy$residuals) %>%
ggplot(aes(prediction,error)) +
geom_hline(yintercept=0) +
geom_point() +
geom_smooth() +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From plot above, There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
Based on make model for prediction (model_happy) it can be concluded:
1. R-Squared model_happy has a value of 0.753, so it means based on the R-Squared value for model_happy it can explain variation data in target variable “Score” 0.753 (75.3%)
2. Error From results of the error test using RMSE (Root Mean Squared Error) model_happy has a value of 0.5324381, so it can be said that the possibility that the model will deviate from the actual data is only 0.5324381
3. Assumption
From the results of the assumptions for the model based on the graphs and tests, the model already appropiate criteria from actual data
So can be concluded that in determining happiness, an important variable can be seen from model_happy. The important variables are GDP.per.capita, Social.support, Healthy.life.expectancy, Freedom.to.make.life.choices.
Social.support: Effect Social Support