Projek penugasan ini menggunakan data sekunder yang berasal dari \(kaggle.com\). Dataset ini bertujuan untuk melihat peubah yang mempengaruhi student performance yang terdiri dari 4292 baris.
library(rio)
data <- import("https://raw.githubusercontent.com/AngelMartha/Pengantar-Sains-Data/main/Student_Performance.csv")
head(data)
## Hours Studied Previous Scores Extracurricular Activities Sleep Hours
## 1 7 99 Yes 9
## 2 4 82 No 4
## 3 8 51 Yes 7
## 4 5 52 Yes 5
## 5 7 75 No 8
## 6 3 78 No 9
## Sample Question Papers Practiced Performance Index
## 1 1 91
## 2 2 65
## 3 2 45
## 4 2 36
## 5 5 66
## 6 6 61
str(data)
## 'data.frame': 4292 obs. of 6 variables:
## $ Hours Studied : int 7 4 8 5 7 3 7 8 5 4 ...
## $ Previous Scores : int 99 82 51 52 75 78 73 45 77 89 ...
## $ Extracurricular Activities : chr "Yes" "No" "Yes" "Yes" ...
## $ Sleep Hours : int 9 4 7 5 8 9 5 4 8 4 ...
## $ Sample Question Papers Practiced: int 1 2 2 2 5 6 6 6 2 0 ...
## $ Performance Index : num 91 65 45 36 66 61 63 42 61 69 ...
x1 <- data$`Hours Studied`
x2 <- data$`Previous Scores`
x3 <- data$`Extracurricular Activities`
x4 <- data$`Sleep Hours`
x5 <- data$`Sample Question Papers Practiced`
y <- data$`Performance Index`
# Menghapus baris dengan nilai NA
sum(is.na(data))
## [1] 0
data <- na.omit(data)
# Mengubah tipe data peubah Extracurricular Activites menjadi factor
data$`Extracurricular Activities` = as.factor(data$`Extracurricular Activities`)
# Sebaran peubah Y (Indeks Alibaca)
hist(data$`Performance Index`, col = "palegreen2")
model <- lm(y ~ x1+x2+x3+x4+x5, data=data)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5991 -1.3767 -0.0606 1.3556 7.5570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.054413 0.194540 -175.05 <2e-16 ***
## x1 2.852940 0.012016 237.43 <2e-16 ***
## x2 1.017127 0.001797 566.04 <2e-16 ***
## x3Yes 0.638206 0.062226 10.26 <2e-16 ***
## x4 0.490291 0.018279 26.82 <2e-16 ***
## x5 0.190916 0.010882 17.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9886
## F-statistic: 7.472e+04 on 5 and 4286 DF, p-value: < 2.2e-16
car::vif(model)
## x1 x2 x3 x4 x5
## 1.001441 1.000649 1.000690 1.000717 1.001927
##Kolmogorov-Smirnov Test
ks.test(model$residuals, "pnorm", mean=mean(model$residuals), sd=sd(model$residuals))
## Warning in ks.test.default(model$residuals, "pnorm", mean =
## mean(model$residuals), : ties should not be present for the Kolmogorov-Smirnov
## test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: model$residuals
## D = 0.014729, p-value = 0.3095
## alternative hypothesis: two-sided
##Shapiro-Wilk Test
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.99945, p-value = 0.2436
Berdasarkan Kolmogorov-Smirnov test dan Shapiro-Wilk, residual data menyebar normal dengan p-value > 5%.
lmtest::bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 4.7328, df = 5, p-value = 0.4494
Karena p-value > 0.05 maka ragam sisaan homogen atau tidak terdapat masalah heteroskedastisitas
library(randtests)
runs.test(model$residuals)
##
## Runs Test
##
## data: model$residuals
## statistic = -1.8014, runs = 2088, n1 = 2146, n2 = 2146, n = 4292,
## p-value = 0.07164
## alternative hypothesis: nonrandomness
Karena p-value > 0.05 maka sisaan saling bebas.
t.test(model$residuals,
mu = 0,
conf.level = 0.95)
##
## One Sample t-test
##
## data: model$residuals
## t = 1.0315e-15, df = 4291, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.06093703 0.06093703
## sample estimates:
## mean of x
## 3.206208e-17
Karena p-value > 0.05 maka nilai harapan sisaan sama dengan nol
bmodelselect <- step(model, direction="backward", scope=formula(lm(y ~ x1+x2+x3+x4+x5, data)), trace=1)
## Start: AIC=6115.35
## y ~ x1 + x2 + x3 + x4 + x5
##
## Df Sum of Sq RSS AIC
## <none> 17793 6115.4
## - x3 1 437 18229 6217.4
## - x5 1 1278 19070 6411.0
## - x4 1 2987 20779 6779.3
## - x1 1 234018 251811 17486.7
## - x2 1 1330087 1347880 24687.0
summary(bmodelselect)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5991 -1.3767 -0.0606 1.3556 7.5570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.054413 0.194540 -175.05 <2e-16 ***
## x1 2.852940 0.012016 237.43 <2e-16 ***
## x2 1.017127 0.001797 566.04 <2e-16 ***
## x3Yes 0.638206 0.062226 10.26 <2e-16 ***
## x4 0.490291 0.018279 26.82 <2e-16 ***
## x5 0.190916 0.010882 17.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9886
## F-statistic: 7.472e+04 on 5 and 4286 DF, p-value: < 2.2e-16
Berdasarkan metode backward, model terbaik adalah model dengan peubah X1, X2, X3(Yes), X4,dan X5 dengan \(R−squared:0.9887\)
fmodelselect <- step(lm(y ~ 1, data), direction="forward", scope=formula(model), trace=1)
## Start: AIC=25330.29
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x2 1 1310870 257877 17583
## + x1 1 213944 1354803 24703
## + x5 1 4253 1564494 25321
## + x4 1 3341 1565406 25323
## + x3 1 1034 1567713 25330
## <none> 1568747 25330
##
## Step: AIC=17582.87
## y ~ x2
##
## Df Sum of Sq RSS AIC
## + x1 1 235473 22404 7098.5
## + x4 1 2862 255015 17537.0
## + x5 1 2554 255323 17542.1
## + x3 1 537 257340 17575.9
## <none> 257877 17582.9
##
## Step: AIC=7098.52
## y ~ x2 + x1
##
## Df Sum of Sq RSS AIC
## + x4 1 2870.33 19534 6512.1
## + x5 1 1225.63 21179 6859.1
## + x3 1 424.84 21979 7018.4
## <none> 22404 7098.5
##
## Step: AIC=6512.09
## y ~ x2 + x1 + x4
##
## Df Sum of Sq RSS AIC
## + x5 1 1304.6 18229 6217.4
## + x3 1 463.5 19070 6411.0
## <none> 19534 6512.1
##
## Step: AIC=6217.42
## y ~ x2 + x1 + x4 + x5
##
## Df Sum of Sq RSS AIC
## + x3 1 436.68 17793 6115.4
## <none> 18229 6217.4
##
## Step: AIC=6115.35
## y ~ x2 + x1 + x4 + x5 + x3
summary(fmodelselect)
##
## Call:
## lm(formula = y ~ x2 + x1 + x4 + x5 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5991 -1.3767 -0.0606 1.3556 7.5570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.054413 0.194540 -175.05 <2e-16 ***
## x2 1.017127 0.001797 566.04 <2e-16 ***
## x1 2.852940 0.012016 237.43 <2e-16 ***
## x4 0.490291 0.018279 26.82 <2e-16 ***
## x5 0.190916 0.010882 17.54 <2e-16 ***
## x3Yes 0.638206 0.062226 10.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9886
## F-statistic: 7.472e+04 on 5 and 4286 DF, p-value: < 2.2e-16
Berdasarkan metode forward, model terbaik adalah model dengan peubah X1, X2, X3(Yes), X4,dan X5 dengan \(R−squared:0.9887\). Hal ini sama dengan metode backward.
smodelselect <- step(lm(y ~ 1, data), direction="both", scope=formula(model), trace=1)
## Start: AIC=25330.29
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x2 1 1310870 257877 17583
## + x1 1 213944 1354803 24703
## + x5 1 4253 1564494 25321
## + x4 1 3341 1565406 25323
## + x3 1 1034 1567713 25330
## <none> 1568747 25330
##
## Step: AIC=17582.87
## y ~ x2
##
## Df Sum of Sq RSS AIC
## + x1 1 235473 22404 7098.5
## + x4 1 2862 255015 17537.0
## + x5 1 2554 255323 17542.1
## + x3 1 537 257340 17575.9
## <none> 257877 17582.9
## - x2 1 1310870 1568747 25330.3
##
## Step: AIC=7098.52
## y ~ x2 + x1
##
## Df Sum of Sq RSS AIC
## + x4 1 2870 19534 6512.1
## + x5 1 1226 21179 6859.1
## + x3 1 425 21979 7018.4
## <none> 22404 7098.5
## - x1 1 235473 257877 17582.9
## - x2 1 1332399 1354803 24703.0
##
## Step: AIC=6512.09
## y ~ x2 + x1 + x4
##
## Df Sum of Sq RSS AIC
## + x5 1 1305 18229 6217.4
## + x3 1 464 19070 6411.0
## <none> 19534 6512.1
## - x4 1 2870 22404 7098.5
## - x1 1 235481 255015 17537.0
## - x2 1 1331916 1351450 24694.4
##
## Step: AIC=6217.42
## y ~ x2 + x1 + x4 + x5
##
## Df Sum of Sq RSS AIC
## + x3 1 437 17793 6115.4
## <none> 18229 6217.4
## - x5 1 1305 19534 6512.1
## - x4 1 2949 21179 6859.1
## - x1 1 234118 252348 17493.8
## - x2 1 1330543 1348772 24687.8
##
## Step: AIC=6115.35
## y ~ x2 + x1 + x4 + x5 + x3
##
## Df Sum of Sq RSS AIC
## <none> 17793 6115.4
## - x3 1 437 18229 6217.4
## - x5 1 1278 19070 6411.0
## - x4 1 2987 20779 6779.3
## - x1 1 234018 251811 17486.7
## - x2 1 1330087 1347880 24687.0
summary(smodelselect)
##
## Call:
## lm(formula = y ~ x2 + x1 + x4 + x5 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5991 -1.3767 -0.0606 1.3556 7.5570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.054413 0.194540 -175.05 <2e-16 ***
## x2 1.017127 0.001797 566.04 <2e-16 ***
## x1 2.852940 0.012016 237.43 <2e-16 ***
## x4 0.490291 0.018279 26.82 <2e-16 ***
## x5 0.190916 0.010882 17.54 <2e-16 ***
## x3Yes 0.638206 0.062226 10.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared: 0.9887, Adjusted R-squared: 0.9886
## F-statistic: 7.472e+04 on 5 and 4286 DF, p-value: < 2.2e-16
Hasil stepwise menunjukkan hasil yang sama dengan metode backward dan forward selection, yaitu model terbaik dengan peubah X1, X2, X3(Yes), X4,dan X5 dengan \(R−squared:0.9887\).
matrix_X <- data.matrix(data[, -6])
matrix_Y <- matrix(data$`Performance Index`) # Assuming 'y' is a single-column vector
alpha_ridge = 0
model_Ridge <- glmnet::cv.glmnet(matrix_X,matrix_Y,alpha=alpha_ridge)
summary(model_Ridge)
## Length Class Mode
## lambda 100 -none- numeric
## cvm 100 -none- numeric
## cvsd 100 -none- numeric
## cvup 100 -none- numeric
## cvlo 100 -none- numeric
## nzero 100 -none- numeric
## call 4 -none- call
## name 1 -none- character
## glmnet.fit 12 elnet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
## index 2 -none- numeric
# Hasil Regresi Ridge
print(model_Ridge)
##
## Call: glmnet::cv.glmnet(x = matrix_X, y = matrix_Y, alpha = alpha_ridge)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 1.748 100 6.763 0.2206 5
## 1se 1.748 100 6.763 0.2206 5
# Memilih nilai lambda terbaik
best_lambda <- model_Ridge$lambda.min
cat("Lambda terbaik:", best_lambda, "\n")
## Lambda terbaik: 1.747632
# Melakukan prediksi dengan model Ridge terbaik
predictions <- predict(model_Ridge, s = best_lambda, newx = matrix_X)
# koefisien Ridge
coef(model_Ridge, s = best_lambda)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -27.1794272
## Hours Studied 2.6037407
## Previous Scores 0.9312880
## Extracurricular Activities 0.6104807
## Sleep Hours 0.4519084
## Sample Question Papers Practiced 0.1872488
# R-squared untuk model Ridge
r_squared_ridge <- 1 - sum((matrix_Y - predictions)^2) / sum((matrix_Y - mean(matrix_Y))^2)
r_squared_ridge
## [1] 0.9815616
Pada regresi Ridge, tidak terdapat peubah yang dihilangkan atau semua peubah dimasukkan dalam model. R-Square yang diperoleh dengan regresi ridge adalah \(R−squared:0.9815616\)
alpha_Lasso = 1
model_Lasso <- glmnet::cv.glmnet(matrix_X,matrix_Y,alpha=alpha_Lasso)
model_Lasso
##
## Call: glmnet::cv.glmnet(x = matrix_X, y = matrix_Y, alpha = alpha_Lasso)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.05463 63 4.173 0.07047 5
## 1se 0.12619 54 4.237 0.06430 5
print(model_Lasso)
##
## Call: glmnet::cv.glmnet(x = matrix_X, y = matrix_Y, alpha = alpha_Lasso)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.05463 63 4.173 0.07047 5
## 1se 0.12619 54 4.237 0.06430 5
# Memilih nilai lambda terbaik
best_lambdal <- model_Lasso$lambda.min
cat("Lambda terbaik:", best_lambdal, "\n")
## Lambda terbaik: 0.05462602
# Melakukan prediksi dengan model Ridge terbaik
predictions_lasso <- predict(model_Lasso, s = best_lambdal, newx = matrix_X)
# Koefisien Ridge
coef(model_Lasso, s = best_lambdal)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -33.9096822
## Hours Studied 2.8321776
## Previous Scores 1.0139862
## Extracurricular Activities 0.5303235
## Sleep Hours 0.4571373
## Sample Question Papers Practiced 0.1725999
# Menghitung R-squared untuk model Lasso
r_squared_lasso <- 1 - sum((matrix_Y - predictions_lasso)^2) / sum((matrix_Y - mean(matrix_Y))^2)
r_squared_lasso
## [1] 0.9886176
Melalui regresi Lasso terlihat bahwa tidak terdapat peubah yang dihilangkan atau semua peubah dimasukkan dalam model. R-Square yang diperoleh dengan regresi ridge adalah \(R−squared:0.9886176\)
Jika dibandingkan dari ketiga model, berdasarkan nilai R-Square, model terbaik adalah menggunakan regresi Lasso dengan \(R−squared:0.9886176\)
Model regresi Lasso yang diperoleh adalah \(Y=-33.9096822+ 2.8321776X_1+1.0139862X_2+0.5303235X_3+0.4571373X_4+0.1725999X_5\)
Model menjelaskan untuk setiap penambahan satu satuan X1, X2, X3, X4, dan X5 dengan peubah lain dianggap tetap akan meningkatkan peubah respon (Performance Index) sebanyak koefisien dari masing-masing peubah.
Sumber data : https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/data