\[{\LARGE \text{INTRODUCTION}}\] Dataset yang kami peroleh untuk dilakukan uji data adalah dataset Happiness. Data tersebut terdiri dari beberapa variabel yang berhubungan dengan tingkat Happiness (kebehagiaan) dari beberapa negara. Perhitungan rangking kebahagiaan dari data tersebut berdasarkan tujuh faktor yaitu data keluarga, harapan hidup, ekonomi, kemurahan hati, kepercayaan pada pemerintah, dan kebebasan. Ranking yang diperoleh dari suatu negara berbanding lurus dengan jumlah dari tujuh faktor yang telah disebutkan. Artinya, semakin tinggi skor kebahagiaan, semakin rendah peringkat kebahagiaan. Jadi, dapat disimpulkan bahwa nilai yang lebih tinggi dari masing-masing tujuh faktor ini berarti tingkat kebahagiaan lebih tinggi. Faktor-faktor ini dapat didefinisikan sebagai sejauh mana faktor tersebut meempengaruhi pada tingkat kebahagiaan suatu negara.
\[{\LARGE \text{Data Preparation}}\]
library(ggplot2) #Digunakan untuk visualisasi data
library(dplyr) #Digunakan untuk memanipulasi data
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(car) #Digunakan untuk uji asumsi
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(lmtest) #Digunakan untuk menguji heteroskedastisitas
## Warning: package 'lmtest' was built under R version 4.1.2
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(GGally) #Digunakan untuk plot scatter matrix
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MLmetrics) #Digunakan untuk mengevaluasi matriks
## Warning: package 'MLmetrics' was built under R version 4.1.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lime) #Digunakan untuk mmembuat fungsi penjelas pda data training
## Warning: package 'lime' was built under R version 4.1.2
##
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
##
## explain
library(rsample) #Digunakan untuk infrastruktur resempling umum
## Warning: package 'rsample' was built under R version 4.1.2
library(caret) #Digunakan untuk menghitung kinerja algorima klasikfikasi
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
##
## MAE, RMSE
library(randomForest) #Digunakan untuk meningkatkan performa kinerja pohon keputusan
## Warning: package 'randomForest' was built under R version 4.1.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caTools) #Digunakan untuk membagi data
## Warning: package 'caTools' was built under R version 4.1.2
data <- read.csv("Happiness.csv") #Digunakan untuk membaca data
head(data) #Digunakan untuk menampilkan 6 data teratas
## Country Region Happiness.Rank Happiness.Score Standard.Error
## 1 Switzerland Western Europe 1 7.587 0.03411
## 2 Iceland Western Europe 2 7.561 0.04884
## 3 Denmark Western Europe 3 7.527 0.03328
## 4 Norway Western Europe 4 7.522 0.03880
## 5 Canada North America 5 7.427 0.03553
## 6 Finland Western Europe 6 7.406 0.03140
## Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom
## 1 1.39651 1.34951 0.94143 0.66557
## 2 1.30232 1.40223 0.94784 0.62877
## 3 1.32548 1.36058 0.87464 0.64938
## 4 1.45900 1.33095 0.88521 0.66973
## 5 1.32629 1.32261 0.90563 0.63297
## 6 1.29025 1.31826 0.88911 0.64169
## Trust..Government.Corruption. Generosity Dystopia.Residual
## 1 0.41978 0.29678 2.51738
## 2 0.14145 0.43630 2.70201
## 3 0.48357 0.34139 2.49204
## 4 0.36503 0.34699 2.46531
## 5 0.32957 0.45811 2.45176
## 6 0.41372 0.23351 2.61955
str(data) #Menampilkan detail setiap atribut data yang mencakup tipe data dan isi data
## 'data.frame': 158 obs. of 12 variables:
## $ Country : chr "Switzerland" "Iceland" "Denmark" "Norway" ...
## $ Region : chr "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
## $ Happiness.Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Happiness.Score : num 7.59 7.56 7.53 7.52 7.43 ...
## $ Standard.Error : num 0.0341 0.0488 0.0333 0.0388 0.0355 ...
## $ Economy..GDP.per.Capita. : num 1.4 1.3 1.33 1.46 1.33 ...
## $ Family : num 1.35 1.4 1.36 1.33 1.32 ...
## $ Health..Life.Expectancy. : num 0.941 0.948 0.875 0.885 0.906 ...
## $ Freedom : num 0.666 0.629 0.649 0.67 0.633 ...
## $ Trust..Government.Corruption.: num 0.42 0.141 0.484 0.365 0.33 ...
## $ Generosity : num 0.297 0.436 0.341 0.347 0.458 ...
## $ Dystopia.Residual : num 2.52 2.7 2.49 2.47 2.45 ...
colSums(is.na(data)) #Digunakan untuk mengecek missing value pada setiap kolom data
## Country Region
## 0 0
## Happiness.Rank Happiness.Score
## 0 0
## Standard.Error Economy..GDP.per.Capita.
## 0 0
## Family Health..Life.Expectancy.
## 0 0
## Freedom Trust..Government.Corruption.
## 0 0
## Generosity Dystopia.Residual
## 0 0
anyNA(data) #Digunakan untuk mengecek missing value semua isi data ( FALSE : tidak ada missing value, sehingga data dapat diproses)
## [1] FALSE
boxplot(data %>% select(-Country,-Region)) #Digunakan untuk memvisualisasikan distribusi suatu data. Dari data happines, divisualisasi kecuali data Country dan data Region
data = data[, -c(5)] #digunakan untuk clening data. menghilangkan standart error pada data Happiness
ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2) #Digunakan untuk mengecek korelasi data
## Warning in ggcorr(data, label = TRUE, label_size = 2.9, hjust = 1, layout.exp =
## 2): data in column(s) 'Country', 'Region' are not numeric and were ignored
\[ {\LARGE \text{Model 1. Regresi Linier Sederhana}}\]
library(MASS) #Paket utama dari Venables dan Ripley's MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
model_score<-lm(Happiness.Rank~Happiness.Score, data)#Digunakan untuk pembentukan model regresi linier
model_score #Digunakan untuk memanggil syntax model_score
##
## Call:
## lm(formula = Happiness.Rank ~ Happiness.Score, data = data)
##
## Coefficients:
## (Intercept) Happiness.Score
## 292.61 -39.64
summary(model_score) #Digunakan untuk menampilkan ringkasan data dari model_score
##
## Call:
## lm(formula = Happiness.Rank ~ Happiness.Score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0608 -3.9687 0.0754 4.5849 9.5936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 292.6110 2.2049 132.71 <2e-16 ***
## Happiness.Score -39.6443 0.4012 -98.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.756 on 156 degrees of freedom
## Multiple R-squared: 0.9843, Adjusted R-squared: 0.9842
## F-statistic: 9763 on 1 and 156 DF, p-value: < 2.2e-16
#Pada model regresi linier sederhana diperoleh Adjust R-Squared=0.9842 (98%)
summary(model_score)$r.squared #Digunakan untuk menampilkan ringkasan data(r.squered) dari model_score
## [1] 0.984273
plot(Happiness.Rank~Happiness.Score, data=data, pch=1, xlab = "Happiness.Score", ylab = "Happiness.Rank") #Digunakan untuk memvisualisasikan data dengan x = Happiness.Score dan y = Happiness.Rank
sresid <- studres(model_score)
shapiro.test(sresid) #digunakan untuk melihat apakah data tersebut berdistribusi normal atau tidak.Jika p-value nya >0.05, maka data tersebut berdistribusi normal. Brarti dapat disimpulkan bahwa model_score tersebut berdistribusi normal
##
## Shapiro-Wilk normality test
##
## data: sresid
## W = 0.9481, p-value = 1.4e-05
\[{\LARGE \text{Perbandingan Metode Backward dan Forward pada Regresi Linier}}\]
model.backward<-step(object = model_score,direction = "backward", trace=T,data=data)#Semua variabel X diregresikan dengan Variabel Y dengan langkah mundur
## Start: AIC=555.08
## Happiness.Rank ~ Happiness.Score
##
## Df Sum of Sq RSS AIC
## <none> 5169 555.08
## - Happiness.Score 1 323504 328673 1209.16
Model.score.none<-lm(Happiness.Rank~1, data = data)
model.forward<-step(object = Model.score.none,direction = "forward",data = data,scope=list(lower=Model.score.none,upper=model_score)) #Semua variabel X diregresikan dengan Variabel Y dengan langkah maju
## Start: AIC=1209.16
## Happiness.Rank ~ 1
##
## Df Sum of Sq RSS AIC
## + Happiness.Score 1 323504 5169 555.08
## <none> 328673 1209.16
##
## Step: AIC=555.08
## Happiness.Rank ~ Happiness.Score
model.stepwise<-step(object = Model.score.none,data=data, direction = "both",scope = list(lower=Model.score.none, upper=model_score)) #metode untuk mendapatkan model terbaik dari sebuah analisis regresi
## Start: AIC=1209.16
## Happiness.Rank ~ 1
##
## Df Sum of Sq RSS AIC
## + Happiness.Score 1 323504 5169 555.08
## <none> 328673 1209.16
##
## Step: AIC=555.08
## Happiness.Rank ~ Happiness.Score
##
## Df Sum of Sq RSS AIC
## <none> 5169 555.08
## - Happiness.Score 1 323504 328673 1209.16
\[ Membandingkan hasil ringkasan motode backward, forward, dan stepwise dengan adj.r.squared \]
summary(model.backward)$adj.r.squared
## [1] 0.9841721
summary(model.forward)$adj.r.squared
## [1] 0.9841721
summary(model.stepwise)$adj.r.squared
## [1] 0.9841721
\[ {\LARGE \text{Model 2. Regresi Linier Berganda}}\]
final.model<-lm(Happiness.Rank ~ Happiness.Score+Health..Life.Expectancy.+Economy..GDP.per.Capita.+Family+Freedom+Trust..Government.Corruption.+Generosity, data= data)
summary(final.model)
##
## Call:
## lm(formula = Happiness.Rank ~ Happiness.Score + Health..Life.Expectancy. +
## Economy..GDP.per.Capita. + Family + Freedom + Trust..Government.Corruption. +
## Generosity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2965 -3.7153 0.2947 4.0706 11.3964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.0512 2.4846 117.142 <2e-16 ***
## Happiness.Score -39.1473 0.8310 -47.108 <2e-16 ***
## Health..Life.Expectancy. -6.1603 3.3300 -1.850 0.0663 .
## Economy..GDP.per.Capita. -0.2254 2.3607 -0.095 0.9241
## Family 1.4337 2.5576 0.561 0.5759
## Freedom -2.4135 4.0848 -0.591 0.5555
## Trust..Government.Corruption. 8.1398 4.5052 1.807 0.0728 .
## Generosity 5.9313 4.0059 1.481 0.1408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.627 on 150 degrees of freedom
## Multiple R-squared: 0.9855, Adjusted R-squared: 0.9849
## F-statistic: 1461 on 7 and 150 DF, p-value: < 2.2e-16
#Pada model regresi linier berganda diperoleh Adjust R-Squared=0.9849 (98%)
prediction <- predict(object = final.model, newdata = data )
head(prediction) #untuk memprediksi faktor pada final model
## 1 2 3 4 5 6
## -6.5680488 -6.8420777 -2.9529766 -3.8757139 0.1946160 0.4520993
test.linearirty<- data.frame(residual = final.model$residuals, fitted = final.model$fitted.values)
test.linearirty %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#pada model tersebut terdapat sedikit atau tidak ada pola yang terlihat dalam plot residual maka, dapat disimpulakn bahwa model tersebut linier.
hist(final.model$residuals)#Visualisasi final model pada variabel residual
shapiro.test(final.model$residuals)#digunakan untuk melihat apakah data tersebut berdistribusi normal atau tidak.Jika p-value nya >0.05, maka data tersebut berdistribusi normal. Brarti dapat disimpulkan bahwa final.model$residual tersebut tidak berdistribusi normal
##
## Shapiro-Wilk normality test
##
## data: final.model$residuals
## W = 0.95468, p-value = 5.208e-05
vif(final.model) #pada final.model tersebut multikolinearitas
## Happiness.Score Health..Life.Expectancy.
## 4.489106 3.356490
## Economy..GDP.per.Capita. Family
## 4.490306 2.406064
## Freedom Trust..Government.Corruption.
## 1.878694 1.449949
## Generosity
## 1.276923
#membagi data pada data Happiness
set.seed(123)
dataset <- data[4:11]
split = sample.split(data$Happiness.Score, SplitRatio = 0.75)
data.training = subset(dataset, split == TRUE)
data.test = subset(dataset, split == FALSE)
\[{\LARGE \text{Model 3. Handle Outlier with Scaling}}\]
#Model regresi handle outlier with scaling dengan menggunakan variabel Happiness.Score pada data trainning
model.linear <- lm(formula = Happiness.Score ~., data.training)
model.linear.step<- step(object = model.linear, direction = "both")
## Start: AIC=-1931.9
## Happiness.Score ~ Economy..GDP.per.Capita. + Family + Health..Life.Expectancy. +
## Freedom + Trust..Government.Corruption. + Generosity + Dystopia.Residual
##
## Df Sum of Sq RSS AIC
## <none> 0.000 -1931.90
## - Trust..Government.Corruption. 1 1.119 1.119 -535.70
## - Generosity 1 1.444 1.444 -505.62
## - Freedom 1 1.480 1.480 -502.71
## - Health..Life.Expectancy. 1 2.429 2.429 -444.23
## - Family 1 4.831 4.831 -363.08
## - Economy..GDP.per.Capita. 1 5.376 5.376 -350.48
## - Dystopia.Residual 1 34.328 34.328 -131.70
summary(model.linear.step) #ringkasan data dari model.linier.step dan didapatkan Adjust R-Squared=1 (100%)
##
## Call:
## lm(formula = Happiness.Score ~ Economy..GDP.per.Capita. + Family +
## Health..Life.Expectancy. + Freedom + Trust..Government.Corruption. +
## Generosity + Dystopia.Residual, data = data.training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.451e-04 -2.128e-04 -1.252e-05 2.226e-04 5.033e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.860e-05 1.391e-04 0.565 0.573
## Economy..GDP.per.Capita. 1.000e+00 1.163e-04 8601.286 <2e-16 ***
## Family 9.999e-01 1.226e-04 8153.926 <2e-16 ***
## Health..Life.Expectancy. 9.999e-01 1.730e-04 5781.443 <2e-16 ***
## Freedom 9.999e-01 2.216e-04 4512.472 <2e-16 ***
## Trust..Government.Corruption. 9.999e-01 2.548e-04 3923.865 <2e-16 ***
## Generosity 1.000e+00 2.244e-04 4457.285 <2e-16 ***
## Dystopia.Residual 1.000e+00 4.601e-05 21735.552 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002696 on 110 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.999e+08 on 7 and 110 DF, p-value: < 2.2e-16
#Predikisi faktor-faktor pada model linier step dengan data_test
prediction_test <- predict(object = model.linear.step,newdata = data.test, interval = "confidence", level = 0.95)
#menghitung rata-rata data_test
MAE<-mean(abs(prediction_test - data.test$Happiness.Score))
MAE
## [1] 0.0002922013
#Predikisi faktor-faktor pada model linier step dengan data_train
prediction_train <- predict(object = model.linear.step,newdata = data.training, interval = "confidence", level = 0.95)
#menghitung rata-rata data_train
MAE<-mean(abs(prediction_train - data.training$Happiness.Score))
MAE
## [1] 0.0002373886
#Normality Test
shapiro.test(model.linear$residuals)
##
## Shapiro-Wilk normality test
##
## data: model.linear$residuals
## W = 0.97434, p-value = 0.02321
#Heteroskedasticity assumption
bptest(formula = model.linear)
##
## studentized Breusch-Pagan test
##
## data: model.linear
## BP = 13.344, df = 7, p-value = 0.06417
#Variance Inflation Factor assumption
vif(model.linear)
## Economy..GDP.per.Capita. Family
## 3.871583 1.963873
## Health..Life.Expectancy. Freedom
## 2.929787 1.774576
## Trust..Government.Corruption. Generosity
## 1.471724 1.257062
## Dystopia.Residual
## 1.070446
\[{\LARGE \text{Model 4. Min-Max Normalization}}\]
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)
}
index <- initial_split(dataset, prop = 0.85,strata = "Happiness.Score")
mm_train <- training(index)
mm_test <- testing(index)
mm_train <- normalize(mm_train)
mm_test <- normalize(mm_test)
model.linear2<-lm(formula = Happiness.Score ~ ., data = mm_train)
model.linear2<- step(object = model.linear2, direction = "both")
## Start: AIC=-2729.39
## Happiness.Score ~ Economy..GDP.per.Capita. + Family + Health..Life.Expectancy. +
## Freedom + Trust..Government.Corruption. + Generosity + Dystopia.Residual
##
## Df Sum of Sq RSS AIC
## <none> 0.00000 -2729.39
## - Trust..Government.Corruption. 1 0.02548 0.02548 -1134.05
## - Generosity 1 0.02938 0.02938 -1115.01
## - Freedom 1 0.02945 0.02945 -1114.67
## - Health..Life.Expectancy. 1 0.03889 0.03890 -1077.39
## - Economy..GDP.per.Capita. 1 0.08349 0.08349 -975.04
## - Family 1 0.08824 0.08824 -967.62
## - Dystopia.Residual 1 0.66217 0.66217 -697.55
summary(model.linear2)
##
## Call:
## lm(formula = Happiness.Score ~ Economy..GDP.per.Capita. + Family +
## Health..Life.Expectancy. + Freedom + Trust..Government.Corruption. +
## Generosity + Dystopia.Residual, data = mm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.411e-05 -2.854e-05 -1.095e-06 3.101e-05 6.600e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.864e-06 1.831e-05 0.539 0.591
## Economy..GDP.per.Capita. 1.000e+00 1.270e-04 7876.886 <2e-16 ***
## Family 9.999e-01 1.235e-04 8098.075 <2e-16 ***
## Health..Life.Expectancy. 9.999e-01 1.860e-04 5376.320 <2e-16 ***
## Freedom 9.997e-01 2.137e-04 4678.255 <2e-16 ***
## Trust..Government.Corruption. 9.999e-01 2.298e-04 4351.792 <2e-16 ***
## Generosity 1.000e+00 2.141e-04 4672.300 <2e-16 ***
## Dystopia.Residual 1.000e+00 4.508e-05 22183.210 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.668e-05 on 126 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.19e+08 on 7 and 126 DF, p-value: < 2.2e-16
pred_test<-predict(object = model.linear2,newdata = mm_test, interval = "confidence", level = 0.95)
MAE<-mean(abs(pred_test - mm_test$Happiness.Score))
MAE
## [1] 0.001865324
pred_train<-predict(object = model.linear2,newdata = mm_train, interval = "confidence", level = 0.95)
MAE<-mean(abs(pred_train- mm_train$Happiness.Score))
MAE
## [1] 3.216925e-05
\[{\LARGE \text{KESIMPULAN}}\]
Hasil pemodelan dari data Happines adalah sebagai beikut:
Model 1. Regresi Linier Sederhana
-Adjusted R Suare : 0.9842 (98%)
Model 2. Regresi Linier Berganda
-Adjusted R Suare : 0.9849 (98%)
Model 3. Handle Outlier with Scaling
-Adjusted R Suare : 100%
-MAE Train : 0.0002373886
-MAE Test : 0.0002922013
Model 4. Min-Max Normalization
-Adjusted R Suare : 100%
-MAE Train : 0.006935522
-MAE Test : 3.225876e - 05
Berdasarkan percobaan dari beberapa model yang telah dilakukan diatas dapat disimpulkan bahwa model terbaik adalah model Handle Outlier with Scaling dan Min-Max Normalization dengan nilai Adjusted R-Squared=1 (100%) dan MAE<4.