Penelitian ini menggunakan data sekunder yang berasal dari gabungan dua situs, yaitu www.cato.org dan www.kaggle.com. Data terdiri dari 163 amatan berupa negara-negara yang ada di dunia. Peubah yang digunakan pada penelitian ini berjumlah 12 yang terdiri dari satu peubah respon dan sebelas peubah penjelas yang disajikan pada tabel berikut.
df <- data.frame(
"Peubah" = c("Y", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11"),
"Keterangan" = c("Human Development Index", "Security and Safety", "Movement", "Religion", "Association, Assembly, & Civil Society", "Expression & Information", "Identity & Relationships", "Size of Government", "Legal System & Property Rights", "Sound Money", "Freedom to Trade Internationally", "Regulation"),
"Satuan" = rep("Skor", 12)
)
print(df)
## Peubah Keterangan Satuan
## 1 Y Human Development Index Skor
## 2 X1 Security and Safety Skor
## 3 X2 Movement Skor
## 4 X3 Religion Skor
## 5 X4 Association, Assembly, & Civil Society Skor
## 6 X5 Expression & Information Skor
## 7 X6 Identity & Relationships Skor
## 8 X7 Size of Government Skor
## 9 X8 Legal System & Property Rights Skor
## 10 X9 Sound Money Skor
## 11 X10 Freedom to Trade Internationally Skor
## 12 X11 Regulation Skor
setwd("C:/Users/FARHAN ABDILLAH/Documents/1 KULIAH/SEMESTER 4/Analisis Regresi/PROJEK ANREG")
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.3
df <- read_excel("data_anreg.xlsx")
df
## # A tibble: 163 × 13
## Country X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 9.25 6.55 9.76 8.66 6.11 8.75 7.82 5.26 9.79 8.22 7.11
## 2 Algeria 8.84 5.92 5.42 3.87 4.40 2.5 4.41 4.13 7.63 3.64 5.78
## 3 Angola 8.47 5.97 6.75 5.55 5.60 6.25 8.13 3.71 6.09 5.37 6.23
## 4 Argentina 8.50 7.86 9.87 9.19 8.39 10 6.48 4.80 4.52 3.09 5.49
## 5 Armenia 9.40 8.18 8.39 8.92 7.58 8.75 7.98 6.24 9.55 7.69 7.76
## 6 Australia 9.76 6.41 9.86 9.40 9.09 10 6.09 8.34 9.56 7.92 8.27
## 7 Austria 9.84 6.17 8.85 9.27 9.01 10 4.93 8.39 9.18 8.09 7.20
## 8 Azerbaijan 8.03 5.45 4.13 3.38 2.56 9.38 4.55 5.24 7.27 7.01 6.96
## 9 Bahamas 7.30 10 10 10 9.17 10 8.65 6.23 6.80 5.22 8.23
## 10 Bahrain 8.70 4.77 2.60 1.72 1.64 5 7.08 4.92 9.43 8.26 7.68
## # … with 153 more rows, and 1 more variable: Y <dbl>
library(corrplot)
## corrplot 0.92 loaded
df_numeric <- df[sapply(df, is.numeric)]
cor_matrix <- cor(df_numeric, use = "complete.obs")
corrplot(cor_matrix, method = "number")
library(ggplot2)
ggplot(df, aes(x = Y)) +
geom_histogram(binwidth = 0.05, color = "black", fill = "#69b3a2", alpha = 0.8,
size = 0.1, position = "identity", show.legend = FALSE) +
geom_vline(aes(xintercept = mean(Y)), color = "red", linetype = "dashed", size = 1) +
geom_density(alpha = 0.2, fill = "#FF9999") +
ggtitle("Sebaran Nilai Human Development Index") +
xlab("Human Development") + ylab("Frekuensi") +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
model <- lm(Y ~ X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11 ,data= df)
model
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 +
## X10 + X11, data = df)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## 0.353068 0.011585 -0.001163 -0.017904 0.001122 0.005530
## X6 X7 X8 X9 X10 X11
## 0.012629 -0.009577 0.055342 0.008086 0.007273 -0.008220
summary(model)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 +
## X10 + X11, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.269193 -0.061053 0.006342 0.054670 0.210460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.353068 0.073168 4.825 3.39e-06 ***
## X1 0.011585 0.005474 2.116 0.035970 *
## X2 -0.001163 0.007516 -0.155 0.877254
## X3 -0.017904 0.007158 -2.501 0.013444 *
## X4 0.001122 0.007382 0.152 0.879374
## X5 0.005530 0.008611 0.642 0.521740
## X6 0.012629 0.003295 3.833 0.000186 ***
## X7 -0.009577 0.006826 -1.403 0.162663
## X8 0.055342 0.009118 6.069 9.92e-09 ***
## X9 0.008086 0.006145 1.316 0.190225
## X10 0.007273 0.008382 0.868 0.386882
## X11 -0.008220 0.010215 -0.805 0.422282
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08793 on 151 degrees of freedom
## Multiple R-squared: 0.6947, Adjusted R-squared: 0.6724
## F-statistic: 31.23 on 11 and 151 DF, p-value: < 2.2e-16
library(car)
## Loading required package: carData
vif(model)
## X1 X2 X3 X4 X5 X6 X7 X8
## 1.797904 2.035125 4.650266 6.968862 8.506792 1.686062 1.281922 4.227177
## X9 X10 X11
## 1.906263 2.534636 1.997328
Multikolinieritas terjadi saat VIF > 10. Dalam kasus
ini, tidak ada multikolinieritas antarvariabel. Oleh karena itu, dapat
dilanjutkan ke tahap analisis berikutnya.
Menggunakan Best Subset Regression
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
bs.model <- ols_step_best_subset(model)
bs.model <- as.data.frame(bs.model)
bs.model
## mindex n predictors rsquare adjr predrsq
## 8 1 1 X8 0.6176617 0.6152870 0.6096146
## 53 2 2 X6 X8 0.6367169 0.6321758 0.6227058
## 162 3 3 X3 X6 X8 0.6680605 0.6617974 0.6508296
## 282 4 4 X1 X3 X6 X8 0.6813553 0.6732884 0.6596178
## 682 5 5 X1 X3 X6 X7 X8 0.6848359 0.6747989 0.6591477
## 1205 6 6 X1 X3 X6 X7 X8 X9 0.6896019 0.6776635 0.6555505
## 1647 7 7 X1 X3 X5 X6 X7 X8 X9 0.6918271 0.6779096 0.6528203
## 1921 8 8 X1 X3 X5 X6 X7 X8 X9 X10 0.6931997 0.6772620 0.6436073
## 2024 9 9 X1 X3 X5 X6 X7 X8 X9 X10 X11 0.6945860 0.6766204 0.6395040
## 2043 10 10 X1 X2 X3 X5 X6 X7 X8 X9 X10 X11 0.6946265 0.6745362 0.6347041
## 2047 11 11 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 0.6946733 0.6724309 0.6289367
## cp aic sbic sbc msep fpe apc
## 8 30.086213 -299.7913 -762.9552 -290.5101 1.480233 0.009192601 0.3918374
## 53 22.662458 -306.1244 -769.2720 -293.7494 1.415306 0.008842314 0.3769063
## 162 9.161418 -318.8318 -781.4625 -303.3631 1.301380 0.008179202 0.3486409
## 282 4.586404 -323.4947 -785.7275 -304.9322 1.257214 0.007948614 0.3388120
## 682 4.865072 -323.2849 -785.3138 -301.6287 1.251453 0.007958960 0.3392530
## 1205 4.508034 -323.7687 -785.4846 -299.0187 1.240480 0.007935529 0.3382543
## 1647 5.407572 -322.9414 -784.4130 -295.0977 1.239584 0.007976115 0.3399843
## 1921 6.728764 -321.6690 -782.9169 -290.7315 1.242129 0.008038891 0.3426601
## 2024 8.043169 -320.4072 -781.4105 -286.3760 1.244651 0.008101701 0.3453374
## 2043 10.023110 -318.4289 -779.2707 -281.3039 1.252728 0.008201052 0.3495723
## 2047 12.000000 -316.4538 -777.1331 -276.2351 1.260886 0.008301537 0.3538555
## hsp
## 8 5.675735e-05
## 53 5.460700e-05
## 162 5.052717e-05
## 282 4.912133e-05
## 682 4.920765e-05
## 1205 4.908885e-05
## 1647 4.936990e-05
## 1921 4.979249e-05
## 2024 5.021970e-05
## 2043 5.087811e-05
## 2047 5.154859e-05
min_aic_index <- which.min(bs.model$aic)
ggplot(bs.model, aes(x = n, y = aic)) +
geom_line() +
geom_point() +
geom_point(data = bs.model[min_aic_index, ], color = "red", size = 3) +
labs(x = "Jumlah Variabel Prediktor", y = "AIC") +
ggtitle("Pergerakan Nilai AIC") +
theme_minimal()
max_rsquare_index <- which.max(bs.model$rsquare)
ggplot(bs.model, aes(x = n, y = rsquare)) +
geom_line() +
geom_point() +
geom_point(data = bs.model[max_rsquare_index, ], color = "red", size = 3) +
labs(x = "Jumlah Variabel Prediktor", y = "Rsquare") +
ggtitle("Pergerakan Nilai Rsquare") +
theme_minimal()
best_model <- lm(Y~X1 + X3 + X6 + X7 + X8 + X9 , data = df)
best_model
##
## Call:
## lm(formula = Y ~ X1 + X3 + X6 + X7 + X8 + X9, data = df)
##
## Coefficients:
## (Intercept) X1 X3 X6 X7 X8
## 0.318646 0.012358 -0.013156 0.013666 -0.010593 0.057662
## X9
## 0.008534
summary(best_model)
##
## Call:
## lm(formula = Y ~ X1 + X3 + X6 + X7 + X8 + X9, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.283392 -0.053519 0.003781 0.055185 0.213356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.318646 0.064584 4.934 2.05e-06 ***
## X1 0.012358 0.005003 2.470 0.01459 *
## X3 -0.013156 0.003929 -3.348 0.00102 **
## X6 0.013666 0.003109 4.395 2.04e-05 ***
## X7 -0.010593 0.006572 -1.612 0.10902
## X8 0.057662 0.007234 7.971 3.12e-13 ***
## X9 0.008534 0.005514 1.548 0.12373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08723 on 156 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6777
## F-statistic: 57.76 on 6 and 156 DF, p-value: < 2.2e-16
# ===== Eksplorasi asumsi =====
plot(best_model,1) # plot sisaan vs yduga
plot(best_model,2) # qq-plot
plot(x = 1:dim(df)[1],
y = best_model$residuals,
type = 'b',
ylab = "Residuals",
xlab = "Observation") # plot sisaan vs urutan
library(randtests)
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
# Melakukan uji asumsi
t_test <- t.test(best_model$residuals, mu = 0, conf.level = 0.95)
runs_test <- runs.test(best_model$residuals)
bp_test <- ncvTest(best_model)
shapiro_test <- shapiro.test(best_model$residuals)
# Membuat dataframe hasil
asumsi <- data.frame(
"Asumsi" = c("Nilai harapan sisaan sama dengan nol", "Sisaan saling bebas", "Ragam Sisaan Homogen", "Normalitas Sisaan"),
"Jenis Uji" = c("t-test", "Runs Test", "Non Constant Variance Test", "Shapiro-Wilk Test"),
"p-value" = c(t_test$p.value, runs_test$p.value, bp_test$p, shapiro_test$p.value),
"Kesimpulan" = c(ifelse(t_test$p.value < 0.05, "Tidak Terpenuhi", "Terpenuhi"),
ifelse(runs_test$p.value < 0.05, "Tidak Terpenuhi", "Terpenuhi"),
ifelse(bp_test$p < 0.05, "Tidak Terpenuhi", "Terpenuhi"),
ifelse(shapiro_test$p.value < 0.05, "Tidak Terpenuhi", "Terpenuhi"))
)
asumsi
## Asumsi Jenis.Uji p.value
## 1 Nilai harapan sisaan sama dengan nol t-test 1.000000e+00
## 2 Sisaan saling bebas Runs Test 3.442757e-01
## 3 Ragam Sisaan Homogen Non Constant Variance Test 1.573317e-05
## 4 Normalitas Sisaan Shapiro-Wilk Test 1.847863e-01
## Kesimpulan
## 1 Terpenuhi
## 2 Terpenuhi
## 3 Tidak Terpenuhi
## 4 Terpenuhi
weights <- 1 / lm(abs(best_model$residuals) ~ best_model$fitted.values)$fitted.values^2
best_model_wls <- lm(Y~X1 + X3 + X6 + X7 + X8 + X9, data = df, weights = weights)
summary(best_model_wls)
##
## Call:
## lm(formula = Y ~ X1 + X3 + X6 + X7 + X8 + X9, data = df, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.1496 -0.9153 0.1486 0.8292 2.7217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.194809 0.061656 3.160 0.001898 **
## X1 0.018441 0.005151 3.580 0.000458 ***
## X3 -0.007857 0.003267 -2.405 0.017349 *
## X6 0.011262 0.002983 3.776 0.000227 ***
## X7 -0.006983 0.005056 -1.381 0.169186
## X8 0.048120 0.005642 8.529 1.21e-14 ***
## X9 0.018179 0.006048 3.006 0.003087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.239 on 156 degrees of freedom
## Multiple R-squared: 0.7869, Adjusted R-squared: 0.7788
## F-statistic: 96.04 on 6 and 156 DF, p-value: < 2.2e-16
# Melakukan uji asumsi
t_test <- t.test(best_model_wls$residuals, mu = 0, conf.level = 0.95)
runs_test <- runs.test(best_model_wls$residuals)
bp_test <- ncvTest(best_model_wls)
shapiro_test <- shapiro.test(best_model_wls$residuals)
# Membuat dataframe hasil
asumsi <- data.frame(
"Asumsi" = c("Nilai harapan sisaan sama dengan nol", "Sisaan saling bebas", "Ragam Sisaan Homogen", "Normalitas Sisaan"),
"Jenis Uji" = c("t-test", "Runs Test", "Non Constant Variance Test", "Shapiro-Wilk Test"),
"p-value" = c(t_test$p.value, runs_test$p.value, bp_test$p, shapiro_test$p.value),
"Kesimpulan" = c(ifelse(t_test$p.value < 0.05, "Tidak Terpenuhi", "Terpenuhi"),
ifelse(runs_test$p.value < 0.05, "Tidak Terpenuhi", "Terpenuhi"),
ifelse(bp_test$p < 0.05, "Tidak Terpenuhi", "Terpenuhi"),
ifelse(shapiro_test$p.value < 0.05, "Tidak Terpenuhi", "Terpenuhi"))
)
asumsi
## Asumsi Jenis.Uji p.value
## 1 Nilai harapan sisaan sama dengan nol t-test 0.91619779
## 2 Sisaan saling bebas Runs Test 0.43062324
## 3 Ragam Sisaan Homogen Non Constant Variance Test 0.07265583
## 4 Normalitas Sisaan Shapiro-Wilk Test 0.23252504
## Kesimpulan
## 1 Terpenuhi
## 2 Terpenuhi
## 3 Terpenuhi
## 4 Terpenuhi
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.2.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Mendefinisikan peubah
x <- model.matrix(Y~X1 + X3 + X6 + X7 + X8 + X9, data = df)[,-1]
y <- df$Y
# Melakukan cross validation
cv.r <- cv.glmnet(x, y, alpha = 0)
plot(cv.r)
best_lambda <- cv.r$lambda.min
best_ridge <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(best_ridge)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.32143475
## X1 0.01386950
## X3 -0.01019535
## X6 0.01300473
## X7 -0.01264700
## X8 0.04988640
## X9 0.01121121
# Mendefinisikan fungsi R-Square
rsq <- function(model, lambda, x, y){
# Prediksi y
y_pred <- predict(model, s = lambda, newx = x)
# Menghitung JKG dan JKT
jkt <- sum((y - mean(y))^2)
jkg <- sum((y_pred - y)^2)
# Menghitung R-Squared
rsq <- 1 - jkg/jkt
return(rsq)
}
r2_ridge <- rsq(best_ridge, best_lambda, x, y)
r2_ridge
## [1] 0.6863754
cv.l<-cv.glmnet(x,y,alpha=1);plot(cv.l)
best.ll<-cv.l$lambda.min
bestlasso<-glmnet(x,y,alpha=1,lambda=best.ll);coef(bestlasso)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.318905098
## X1 0.012047397
## X3 -0.012341975
## X6 0.013192995
## X7 -0.010199560
## X8 0.057550390
## X9 0.008216546
r2_lasso <- rsq(bestlasso, best.ll, x, y)
r2_lasso
## [1] 0.6894319
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
r2_ols <- summary(best_model)$r.squared
r2_wls <- summary(best_model_wls)$r.squared
r2_ridge <- r2_ridge
r2_lasso <- r2_lasso
comparison_table <- data.frame(
Method = c("OLS", "WLS", "Ridge", "Lasso"),
R2 = c(r2_ols, r2_wls, r2_ridge, r2_lasso)
) %>%
arrange(desc(R2))
print(comparison_table)
## Method R2
## 1 WLS 0.7869475
## 2 OLS 0.6896019
## 3 Lasso 0.6894319
## 4 Ridge 0.6863754
Performa Model WLS lebih baik diikuti dengan Model OLS, Lasso, dan Ridge. Hal ini cukup masuk akal karena Regresi Ridge dan Lasso bertujuan untuk mengatasi masalah multikolinieritas. Sedangkan, pada analisis sebelumnya, asumsi tidak ada multikolinieritas terpenuhi.
summary(best_model_wls)
##
## Call:
## lm(formula = Y ~ X1 + X3 + X6 + X7 + X8 + X9, data = df, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.1496 -0.9153 0.1486 0.8292 2.7217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.194809 0.061656 3.160 0.001898 **
## X1 0.018441 0.005151 3.580 0.000458 ***
## X3 -0.007857 0.003267 -2.405 0.017349 *
## X6 0.011262 0.002983 3.776 0.000227 ***
## X7 -0.006983 0.005056 -1.381 0.169186
## X8 0.048120 0.005642 8.529 1.21e-14 ***
## X9 0.018179 0.006048 3.006 0.003087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.239 on 156 degrees of freedom
## Multiple R-squared: 0.7869, Adjusted R-squared: 0.7788
## F-statistic: 96.04 on 6 and 156 DF, p-value: < 2.2e-16
Peubah yang berpengaruh signifikan terhadap model adalah X1, X3, X6, X8, dan X9. Model ini memiliki R2 sebesar 0.7869. Artinya, model ini sudah menjelaskan 78.69% keberagaman dari data. Berikut adalah model yang dibentuk
\[ Y = 0.194809 + 0.018441X_1 - 0.007857X_3 + 0.011262X_6 - 0.006983X_7 + 0.048120X_8 + 0.018179X_9 \] Koefisien X1 (Security and Safety) adalah 0.018441, yang berarti bahwa ketika variabel X1 (Security and Safety) naik satu satuan, maka perkiraan nilai variabel dependen Y (Human Development Index) akan meningkat sebesar 0.018441, dengan asumsi variabel lainnya tetap konstan.
Koefisien X3 (Religion) adalah -0.007857, yang berarti bahwa ketika variabel X3 naik satu satuan, maka perkiraan nilai variabel dependen Y (Human Development Index) akan menurun sebesar 0.007857, dengan asumsi variabel lainnya tetap konstan.
Koefisien X6 (Identity & Relationships) adalah 0.011262, yang berarti bahwa ketika variabel X6 (Identity & Relationships) naik satu satuan, maka perkiraan nilai variabel dependen Y (Human Development Index) akan meningkat sebesar 0.011262, dengan asumsi variabel lainnya tetap konstan.
Koefisien X7 (Size of Government) adalah -0.006983, yang berarti bahwa ketika variabel X7 (Size of Government) naik satu satuan, maka perkiraan nilai variabel dependen Y (Human Development Index) akan menurun sebesar -0.006983, dengan asumsi variabel lainnya tetap konstan.
Koefisien X8 (Legal System & Property Rights) adalah 0.048120, yang berarti bahwa ketika variabel X8 (Legal System & Property Rights) naik satu satuan, maka perkiraan nilai variabel dependen Y (Human Development Index) akan meningkat sebesar 0.048120, dengan asumsi variabel lainnya tetap konstan.
Koefisien X9 (Sound Money) adalah 0.018179, yang berarti bahwa ketika variabel X9 (Sound Money) naik satu satuan, maka perkiraan nilai variabel dependen Y (Human Development Index) akan meningkat sebesar 0.018179, dengan asumsi variabel lainnya tetap konstan.