Pengantar Sains Data

Latar Belakang

Projek penugasan ini menggunakan data sekunder yang berasal dari \(kaggle.com\). Dataset ini bertujuan untuk melihat peubah yang mempengaruhi student performance yang terdiri dari 4292 baris.

Persiapan Data

library(rio)
data <- import("https://raw.githubusercontent.com/AngelMartha/Pengantar-Sains-Data/main/Student_Performance.csv")
head(data)

##   Hours Studied Previous Scores Extracurricular Activities Sleep Hours
## 1             7              99                        Yes           9
## 2             4              82                         No           4
## 3             8              51                        Yes           7
## 4             5              52                        Yes           5
## 5             7              75                         No           8
## 6             3              78                         No           9
##   Sample Question Papers Practiced Performance Index
## 1                                1                91
## 2                                2                65
## 3                                2                45
## 4                                2                36
## 5                                5                66
## 6                                6                61

str(data)

## 'data.frame':    4292 obs. of  6 variables:
##  $ Hours Studied                   : int  7 4 8 5 7 3 7 8 5 4 ...
##  $ Previous Scores                 : int  99 82 51 52 75 78 73 45 77 89 ...
##  $ Extracurricular Activities      : chr  "Yes" "No" "Yes" "Yes" ...
##  $ Sleep Hours                     : int  9 4 7 5 8 9 5 4 8 4 ...
##  $ Sample Question Papers Practiced: int  1 2 2 2 5 6 6 6 2 0 ...
##  $ Performance Index               : num  91 65 45 36 66 61 63 42 61 69 ...

Peubah

x1 <- data$`Hours Studied`
x2 <- data$`Previous Scores`
x3 <- data$`Extracurricular Activities`
x4 <- data$`Sleep Hours`
x5 <- data$`Sample Question Papers Practiced`
y <- data$`Performance Index`

Data cleaning

# Menghapus baris dengan nilai NA
sum(is.na(data))

## [1] 0

data <- na.omit(data)

# Mengubah tipe data peubah Extracurricular Activites menjadi factor
data$`Extracurricular Activities` = as.factor(data$`Extracurricular Activities`)

Eksplorasi Data

# Sebaran peubah Y (Indeks Alibaca)
hist(data$`Performance Index`, col = "palegreen2")

Model Awal

model <- lm(y ~ x1+x2+x3+x4+x5, data=data)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5991 -1.3767 -0.0606  1.3556  7.5570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.054413   0.194540 -175.05   <2e-16 ***
## x1            2.852940   0.012016  237.43   <2e-16 ***
## x2            1.017127   0.001797  566.04   <2e-16 ***
## x3Yes         0.638206   0.062226   10.26   <2e-16 ***
## x4            0.490291   0.018279   26.82   <2e-16 ***
## x5            0.190916   0.010882   17.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9886 
## F-statistic: 7.472e+04 on 5 and 4286 DF,  p-value: < 2.2e-16

Multikolinearitas

car::vif(model)

##       x1       x2       x3       x4       x5 
## 1.001441 1.000649 1.000690 1.000717 1.001927

Pengujian Asumsi

Uji Asumsi Normalitas

##Kolmogorov-Smirnov Test
ks.test(model$residuals, "pnorm", mean=mean(model$residuals), sd=sd(model$residuals))

## Warning in ks.test.default(model$residuals, "pnorm", mean =
## mean(model$residuals), : ties should not be present for the Kolmogorov-Smirnov
## test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model$residuals
## D = 0.014729, p-value = 0.3095
## alternative hypothesis: two-sided

##Shapiro-Wilk Test
shapiro.test(model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.99945, p-value = 0.2436

Berdasarkan Kolmogorov-Smirnov test dan Shapiro-Wilk, residual data menyebar normal dengan p-value > 5%.

Uji Asumsi Homoskedastisitas (Gauss Markov)

lmtest::bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 4.7328, df = 5, p-value = 0.4494

Karena p-value > 0.05 maka ragam sisaan homogen atau tidak terdapat masalah heteroskedastisitas

Uji Kebebasan Sisaan (Gauss Markov)

library(randtests)

runs.test(model$residuals)

## 
##  Runs Test
## 
## data:  model$residuals
## statistic = -1.8014, runs = 2088, n1 = 2146, n2 = 2146, n = 4292,
## p-value = 0.07164
## alternative hypothesis: nonrandomness

Karena p-value > 0.05 maka sisaan saling bebas.

Nilai harapan sisaan sama dengan nol (Gauss Markov)

t.test(model$residuals,
       mu = 0,
       conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  model$residuals
## t = 1.0315e-15, df = 4291, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.06093703  0.06093703
## sample estimates:
##    mean of x 
## 3.206208e-17

Karena p-value > 0.05 maka nilai harapan sisaan sama dengan nol

Pemilihan Peubah Penjelas/ Variable Selection

Metode Backward

bmodelselect <- step(model, direction="backward", scope=formula(lm(y ~ x1+x2+x3+x4+x5, data)), trace=1)

## Start:  AIC=6115.35
## y ~ x1 + x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## <none>                17793  6115.4
## - x3    1       437   18229  6217.4
## - x5    1      1278   19070  6411.0
## - x4    1      2987   20779  6779.3
## - x1    1    234018  251811 17486.7
## - x2    1   1330087 1347880 24687.0

summary(bmodelselect)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5991 -1.3767 -0.0606  1.3556  7.5570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.054413   0.194540 -175.05   <2e-16 ***
## x1            2.852940   0.012016  237.43   <2e-16 ***
## x2            1.017127   0.001797  566.04   <2e-16 ***
## x3Yes         0.638206   0.062226   10.26   <2e-16 ***
## x4            0.490291   0.018279   26.82   <2e-16 ***
## x5            0.190916   0.010882   17.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9886 
## F-statistic: 7.472e+04 on 5 and 4286 DF,  p-value: < 2.2e-16

Berdasarkan metode backward, model terbaik adalah model dengan peubah X1, X2, X3(Yes), X4,dan X5 dengan \(R−squared:0.9887\)

Metode Forward

fmodelselect <- step(lm(y ~ 1, data), direction="forward", scope=formula(model), trace=1)

## Start:  AIC=25330.29
## y ~ 1
## 
##        Df Sum of Sq     RSS   AIC
## + x2    1   1310870  257877 17583
## + x1    1    213944 1354803 24703
## + x5    1      4253 1564494 25321
## + x4    1      3341 1565406 25323
## + x3    1      1034 1567713 25330
## <none>              1568747 25330
## 
## Step:  AIC=17582.87
## y ~ x2
## 
##        Df Sum of Sq    RSS     AIC
## + x1    1    235473  22404  7098.5
## + x4    1      2862 255015 17537.0
## + x5    1      2554 255323 17542.1
## + x3    1       537 257340 17575.9
## <none>              257877 17582.9
## 
## Step:  AIC=7098.52
## y ~ x2 + x1
## 
##        Df Sum of Sq   RSS    AIC
## + x4    1   2870.33 19534 6512.1
## + x5    1   1225.63 21179 6859.1
## + x3    1    424.84 21979 7018.4
## <none>              22404 7098.5
## 
## Step:  AIC=6512.09
## y ~ x2 + x1 + x4
## 
##        Df Sum of Sq   RSS    AIC
## + x5    1    1304.6 18229 6217.4
## + x3    1     463.5 19070 6411.0
## <none>              19534 6512.1
## 
## Step:  AIC=6217.42
## y ~ x2 + x1 + x4 + x5
## 
##        Df Sum of Sq   RSS    AIC
## + x3    1    436.68 17793 6115.4
## <none>              18229 6217.4
## 
## Step:  AIC=6115.35
## y ~ x2 + x1 + x4 + x5 + x3

summary(fmodelselect)

## 
## Call:
## lm(formula = y ~ x2 + x1 + x4 + x5 + x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5991 -1.3767 -0.0606  1.3556  7.5570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.054413   0.194540 -175.05   <2e-16 ***
## x2            1.017127   0.001797  566.04   <2e-16 ***
## x1            2.852940   0.012016  237.43   <2e-16 ***
## x4            0.490291   0.018279   26.82   <2e-16 ***
## x5            0.190916   0.010882   17.54   <2e-16 ***
## x3Yes         0.638206   0.062226   10.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9886 
## F-statistic: 7.472e+04 on 5 and 4286 DF,  p-value: < 2.2e-16

Berdasarkan metode forward, model terbaik adalah model dengan peubah X1, X2, X3(Yes), X4,dan X5 dengan \(R−squared:0.9887\). Hal ini sama dengan metode backward.

Metode Stepwise

smodelselect <- step(lm(y ~ 1, data), direction="both", scope=formula(model), trace=1)

## Start:  AIC=25330.29
## y ~ 1
## 
##        Df Sum of Sq     RSS   AIC
## + x2    1   1310870  257877 17583
## + x1    1    213944 1354803 24703
## + x5    1      4253 1564494 25321
## + x4    1      3341 1565406 25323
## + x3    1      1034 1567713 25330
## <none>              1568747 25330
## 
## Step:  AIC=17582.87
## y ~ x2
## 
##        Df Sum of Sq     RSS     AIC
## + x1    1    235473   22404  7098.5
## + x4    1      2862  255015 17537.0
## + x5    1      2554  255323 17542.1
## + x3    1       537  257340 17575.9
## <none>               257877 17582.9
## - x2    1   1310870 1568747 25330.3
## 
## Step:  AIC=7098.52
## y ~ x2 + x1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1      2870   19534  6512.1
## + x5    1      1226   21179  6859.1
## + x3    1       425   21979  7018.4
## <none>                22404  7098.5
## - x1    1    235473  257877 17582.9
## - x2    1   1332399 1354803 24703.0
## 
## Step:  AIC=6512.09
## y ~ x2 + x1 + x4
## 
##        Df Sum of Sq     RSS     AIC
## + x5    1      1305   18229  6217.4
## + x3    1       464   19070  6411.0
## <none>                19534  6512.1
## - x4    1      2870   22404  7098.5
## - x1    1    235481  255015 17537.0
## - x2    1   1331916 1351450 24694.4
## 
## Step:  AIC=6217.42
## y ~ x2 + x1 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## + x3    1       437   17793  6115.4
## <none>                18229  6217.4
## - x5    1      1305   19534  6512.1
## - x4    1      2949   21179  6859.1
## - x1    1    234118  252348 17493.8
## - x2    1   1330543 1348772 24687.8
## 
## Step:  AIC=6115.35
## y ~ x2 + x1 + x4 + x5 + x3
## 
##        Df Sum of Sq     RSS     AIC
## <none>                17793  6115.4
## - x3    1       437   18229  6217.4
## - x5    1      1278   19070  6411.0
## - x4    1      2987   20779  6779.3
## - x1    1    234018  251811 17486.7
## - x2    1   1330087 1347880 24687.0

summary(smodelselect)

## 
## Call:
## lm(formula = y ~ x2 + x1 + x4 + x5 + x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5991 -1.3767 -0.0606  1.3556  7.5570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.054413   0.194540 -175.05   <2e-16 ***
## x2            1.017127   0.001797  566.04   <2e-16 ***
## x1            2.852940   0.012016  237.43   <2e-16 ***
## x4            0.490291   0.018279   26.82   <2e-16 ***
## x5            0.190916   0.010882   17.54   <2e-16 ***
## x3Yes         0.638206   0.062226   10.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.037 on 4286 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9886 
## F-statistic: 7.472e+04 on 5 and 4286 DF,  p-value: < 2.2e-16

Hasil stepwise menunjukkan hasil yang sama dengan metode backward dan forward selection, yaitu model terbaik dengan peubah X1, X2, X3(Yes), X4,dan X5 dengan \(R−squared:0.9887\).

Ridge Regression

Matriks untuk regresi ridge

matrix_X <- data.matrix(data[, -6])
matrix_Y <- matrix(data$`Performance Index`)  # Assuming 'y' is a single-column vector

alpha_ridge = 0
model_Ridge <- glmnet::cv.glmnet(matrix_X,matrix_Y,alpha=alpha_ridge)
summary(model_Ridge)

##            Length Class  Mode     
## lambda     100    -none- numeric  
## cvm        100    -none- numeric  
## cvsd       100    -none- numeric  
## cvup       100    -none- numeric  
## cvlo       100    -none- numeric  
## nzero      100    -none- numeric  
## call         4    -none- call     
## name         1    -none- character
## glmnet.fit  12    elnet  list     
## lambda.min   1    -none- numeric  
## lambda.1se   1    -none- numeric  
## index        2    -none- numeric

# Hasil Regresi Ridge
print(model_Ridge)

## 
## Call:  glmnet::cv.glmnet(x = matrix_X, y = matrix_Y, alpha = alpha_ridge) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure     SE Nonzero
## min  1.748   100   6.763 0.2206       5
## 1se  1.748   100   6.763 0.2206       5

# Memilih nilai lambda terbaik
best_lambda <- model_Ridge$lambda.min
cat("Lambda terbaik:", best_lambda, "\n")

## Lambda terbaik: 1.747632

# Melakukan prediksi dengan model Ridge terbaik
predictions <- predict(model_Ridge, s = best_lambda, newx = matrix_X)

# koefisien Ridge
coef(model_Ridge, s = best_lambda)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                      -27.1794272
## Hours Studied                      2.6037407
## Previous Scores                    0.9312880
## Extracurricular Activities         0.6104807
## Sleep Hours                        0.4519084
## Sample Question Papers Practiced   0.1872488

# R-squared untuk model Ridge
r_squared_ridge <- 1 - sum((matrix_Y - predictions)^2) / sum((matrix_Y - mean(matrix_Y))^2)
r_squared_ridge

## [1] 0.9815616

Pada regresi Ridge, tidak terdapat peubah yang dihilangkan atau semua peubah dimasukkan dalam model. R-Square yang diperoleh dengan regresi ridge adalah \(R−squared:0.9815616\)

Regresi Lasso

alpha_Lasso = 1
model_Lasso <- glmnet::cv.glmnet(matrix_X,matrix_Y,alpha=alpha_Lasso)
model_Lasso

## 
## Call:  glmnet::cv.glmnet(x = matrix_X, y = matrix_Y, alpha = alpha_Lasso) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.05463    63   4.173 0.07047       5
## 1se 0.12619    54   4.237 0.06430       5

Hasil regresi Lasso

print(model_Lasso)

## 
## Call:  glmnet::cv.glmnet(x = matrix_X, y = matrix_Y, alpha = alpha_Lasso) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.05463    63   4.173 0.07047       5
## 1se 0.12619    54   4.237 0.06430       5

# Memilih nilai lambda terbaik
best_lambdal <- model_Lasso$lambda.min
cat("Lambda terbaik:", best_lambdal, "\n")

## Lambda terbaik: 0.05462602

# Melakukan prediksi dengan model Ridge terbaik
predictions_lasso <- predict(model_Lasso, s = best_lambdal, newx = matrix_X)

# Koefisien Ridge
coef(model_Lasso, s = best_lambdal)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                      -33.9096822
## Hours Studied                      2.8321776
## Previous Scores                    1.0139862
## Extracurricular Activities         0.5303235
## Sleep Hours                        0.4571373
## Sample Question Papers Practiced   0.1725999

# Menghitung R-squared untuk model Lasso
r_squared_lasso <- 1 - sum((matrix_Y - predictions_lasso)^2) / sum((matrix_Y - mean(matrix_Y))^2)
r_squared_lasso

## [1] 0.9886176

Melalui regresi Lasso terlihat bahwa tidak terdapat peubah yang dihilangkan atau semua peubah dimasukkan dalam model. R-Square yang diperoleh dengan regresi ridge adalah \(R−squared:0.9886176\)

Interpretasi

Jika dibandingkan dari ketiga model, berdasarkan nilai R-Square, model terbaik adalah menggunakan regresi Lasso dengan \(R−squared:0.9886176\)

Model regresi Lasso yang diperoleh adalah \(Y=-33.9096822+ 2.8321776X_1+1.0139862X_2+0.5303235X_3+0.4571373X_4+0.1725999X_5\)

Model menjelaskan untuk setiap penambahan satu satuan X1, X2, X3, X4, dan X5 dengan peubah lain dianggap tetap akan meningkatkan peubah respon (Performance Index) sebanyak koefisien dari masing-masing peubah.

Sumber data : https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/data