Supervised Learning (Regresi): OLS, Seleksi Variabel, dan Regularisasi
Pendahuluan
Praktikum ini menunjukkan alur end-to-end supervised learning (regresi) menggunakan data Hitters dari buku An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani — JWHT/ISLR).
Cakupan:
- Train–Test Split (base R)
- Model Linear OLS (baseline)
- Seleksi Variabel: Forward, Backward, Stepwise
(step())
- Regularisasi: Ridge, LASSO, Elastic Net (paket
glmnet)
- Evaluasi di Test Set: RMSE, MAE, R²
Paket & Data
# Paket minimal
# ISLR: menyediakan dataset Hitters
# glmnet: untuk ridge, lasso, elastic net
if (!requireNamespace("ISLR", quietly = TRUE)) install.packages("ISLR", repos = "https://cloud.r-project.org")
if (!requireNamespace("glmnet", quietly = TRUE)) install.packages("glmnet", repos = "https://cloud.r-project.org")
library(ISLR)
library(glmnet)Dataset Hitters berisi data statistik pemain bisbol
Major League Baseball (MLB) musim 1986–1987
Berikut penjelasan setiap variabelnya:
AtBat: jumlah pukulan (at-bats) yang dilakukan pemain selama musim tersebutHits: jumlah pukulan berhasil (hits) yang dilakukan pemain selama musim tersebutHmRun: jumlah home runs yang dicetak pemain pada musim tersebutRuns: jumlah total runs yang dicetak oleh pemain pada musim tersebutRBI: Runs Batted In — jumlah runs yang dihasilkan oleh pukulan pemainWalks: jumlah kali pemain berhasil mencapai base melalui walks (tanpa memukul bola)Years: jumlah tahun pemain telah bermain di Major League BaseballCAtBat: jumlah kumulatif at-bats (pukulan) selama karier pemainCHits: jumlah kumulatif hits selama karier pemainCHmRun: jumlah kumulatif home runs selama karier pemainCRuns: jumlah kumulatif runs selama karier pemainCRBI: jumlah kumulatif runs batted in selama karier pemainCWalks: jumlah kumulatif walks selama karier pemainLeague: liga tempat pemain bermain pada musim sebelumnya (A= American League,N= National League).Division: divisi tempat pemain bermain pada musim tersebut (E= East,W= West)PutOuts: jumlah put outs — aksi pertahanan di mana pemain mengeluarkan lawan dari permainanAssists: jumlah assists — kontribusi pemain dalam membantu terjadinya put outErrors: jumlah kesalahan (errors) yang dilakukan pemain dalam bertahanSalary: gaji pemain pada tahun 1987 (dalam ribuan dolar AS)NewLeague: liga tempat pemain bermain pada musim berikutnya (AatauN)
# Muat data & bersihkan NA pada target
data("Hitters", package = "ISLR")
hitters <- na.omit(Hitters)
# Tampilkan ringkas data
str(hitters)## 'data.frame': 263 obs. of 20 variables:
## $ AtBat : int 315 479 496 321 594 185 298 323 401 574 ...
## $ Hits : int 81 130 141 87 169 37 73 81 92 159 ...
## $ HmRun : int 7 18 20 10 4 1 0 6 17 21 ...
## $ Runs : int 24 66 65 39 74 23 24 26 49 107 ...
## $ RBI : int 38 72 78 42 51 8 24 32 66 75 ...
## $ Walks : int 39 76 37 30 35 21 7 8 65 59 ...
## $ Years : int 14 3 11 2 11 2 3 2 13 10 ...
## $ CAtBat : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
## $ CHits : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
## $ CHmRun : int 69 63 225 12 19 1 0 6 253 90 ...
## $ CRuns : int 321 224 828 48 501 30 41 32 784 702 ...
## $ CRBI : int 414 266 838 46 336 9 37 34 890 504 ...
## $ CWalks : int 375 263 354 33 194 24 12 8 866 488 ...
## $ League : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
## $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
## $ PutOuts : int 632 880 200 805 282 76 121 143 0 238 ...
## $ Assists : int 43 82 11 40 421 127 283 290 0 445 ...
## $ Errors : int 10 14 3 4 25 7 9 19 0 22 ...
## $ Salary : num 475 480 500 91.5 750 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
## ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.5 190.0 425.0 535.9 750.0 2460.0
Train–Test Split (Base R)
set.seed(123)
n <- nrow(hitters)
idx <- sample.int(n, size = floor(0.8 * n)) # 80% training
train <- hitters[idx, ]
test <- hitters[-idx, ]
dim(train); dim(test)## [1] 210 20
## [1] 53 20
Baseline: Model Linear OLS
##
## Call:
## lm(formula = Salary ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -753.89 -173.98 -11.87 159.32 1815.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 207.47345 101.18477 2.050 0.04169 *
## AtBat -1.68499 0.75317 -2.237 0.02644 *
## Hits 5.06495 2.89376 1.750 0.08168 .
## HmRun -2.68840 6.86520 -0.392 0.69579
## Runs -0.95353 3.41442 -0.279 0.78034
## RBI 1.41046 3.00406 0.470 0.63924
## Walks 6.35620 2.11308 3.008 0.00299 **
## Years -13.14742 14.12616 -0.931 0.35318
## CAtBat -0.25953 0.15700 -1.653 0.09996 .
## CHits 0.65158 0.80755 0.807 0.42076
## CHmRun 1.29483 1.86077 0.696 0.48737
## CRuns 1.25425 0.89170 1.407 0.16119
## CRBI 0.46318 0.84092 0.551 0.58242
## CWalks -0.68669 0.39171 -1.753 0.08121 .
## LeagueN 6.62438 91.91248 0.072 0.94262
## DivisionW -120.30537 45.33765 -2.654 0.00864 **
## PutOuts 0.19444 0.08678 2.240 0.02622 *
## Assists 0.51924 0.25194 2.061 0.04067 *
## Errors -7.33195 4.88072 -1.502 0.13470
## NewLeagueN 61.30716 91.82558 0.668 0.50517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 311.7 on 190 degrees of freedom
## Multiple R-squared: 0.6014, Adjusted R-squared: 0.5616
## F-statistic: 15.09 on 19 and 190 DF, p-value: < 2.2e-16
# Fungsi metrik evaluasi berbasis base R
rmse <- function(y, yhat) mean((y - yhat)^2)^0.5
mae <- function(y, yhat) mean(abs(y - yhat))
r2 <- function(y, yhat) cor(y, yhat)^2pred_ols <- predict(ols, newdata = test)
res_ols <- c(
RMSE = rmse(test$Salary, pred_ols),
MAE = mae(test$Salary, pred_ols),
R2 = r2(test$Salary, pred_ols)
)
res_ols## RMSE MAE R2
## 371.6850949 260.4381260 0.3028332
Seleksi Variabel
# Model kosong & penuh
m_null <- lm(Salary ~ 1, data = train)
m_full <- lm(Salary ~ ., data = train)
# Forward
step_forward <- step(m_null, scope = list(lower = m_null, upper = m_full),
direction = "forward", trace = 0)
# Backward
step_backward <- step(m_full, direction = "backward", trace = 0)
# Stepwise (dua arah)
step_both <- step(m_null, scope = list(lower = m_null, upper = m_full),
direction = "both", trace = 0)##
## Call:
## lm(formula = Salary ~ CRuns + Walks + CWalks + Division + CRBI +
## CAtBat + CHits + PutOuts + NewLeague, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -649.69 -182.02 -5.03 148.21 1959.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.08546 65.80910 1.247 0.213736
## CRuns 1.33824 0.54416 2.459 0.014770 *
## Walks 5.66119 1.32439 4.275 2.96e-05 ***
## CWalks -0.56861 0.29433 -1.932 0.054787 .
## DivisionW -124.70888 44.58535 -2.797 0.005660 **
## CRBI 0.79409 0.26432 3.004 0.003003 **
## CAtBat -0.41104 0.11971 -3.434 0.000723 ***
## CHits 0.99804 0.47977 2.080 0.038777 *
## PutOuts 0.14026 0.08387 1.672 0.096020 .
## NewLeagueN 71.73976 44.70648 1.605 0.110141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 312.3 on 200 degrees of freedom
## Multiple R-squared: 0.5789, Adjusted R-squared: 0.5599
## F-statistic: 30.55 on 9 and 200 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CHits +
## CHmRun + CRuns + CWalks + Division + PutOuts + Assists +
## Errors + NewLeague, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -770.98 -178.95 -3.66 138.62 1830.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 151.37091 79.91465 1.894 0.05968 .
## AtBat -1.53935 0.71874 -2.142 0.03345 *
## Hits 4.69395 2.26305 2.074 0.03937 *
## Walks 6.13058 1.84524 3.322 0.00106 **
## CAtBat -0.35200 0.13646 -2.580 0.01062 *
## CHits 1.14248 0.52879 2.161 0.03194 *
## CHmRun 2.30822 0.61691 3.742 0.00024 ***
## CRuns 0.92695 0.57732 1.606 0.10997
## CWalks -0.59690 0.33913 -1.760 0.07996 .
## DivisionW -120.56432 44.13464 -2.732 0.00688 **
## PutOuts 0.19750 0.08569 2.305 0.02222 *
## Assists 0.56187 0.23486 2.392 0.01768 *
## Errors -7.19408 4.73640 -1.519 0.13040
## NewLeagueN 71.80602 44.61091 1.610 0.10909
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 308.6 on 196 degrees of freedom
## Multiple R-squared: 0.5971, Adjusted R-squared: 0.5704
## F-statistic: 22.34 on 13 and 196 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Salary ~ CRuns + Walks + CWalks + Division + CRBI +
## CAtBat + CHits + PutOuts + NewLeague, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -649.69 -182.02 -5.03 148.21 1959.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.08546 65.80910 1.247 0.213736
## CRuns 1.33824 0.54416 2.459 0.014770 *
## Walks 5.66119 1.32439 4.275 2.96e-05 ***
## CWalks -0.56861 0.29433 -1.932 0.054787 .
## DivisionW -124.70888 44.58535 -2.797 0.005660 **
## CRBI 0.79409 0.26432 3.004 0.003003 **
## CAtBat -0.41104 0.11971 -3.434 0.000723 ***
## CHits 0.99804 0.47977 2.080 0.038777 *
## PutOuts 0.14026 0.08387 1.672 0.096020 .
## NewLeagueN 71.73976 44.70648 1.605 0.110141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 312.3 on 200 degrees of freedom
## Multiple R-squared: 0.5789, Adjusted R-squared: 0.5599
## F-statistic: 30.55 on 9 and 200 DF, p-value: < 2.2e-16
# Evaluasi masing-masing model di test set
eval_model <- function(model, newdata) {
p <- predict(model, newdata = newdata)
c(RMSE = rmse(newdata$Salary, p),
MAE = mae(newdata$Salary, p),
R2 = r2(newdata$Salary, p))
}
res_models <- rbind(
OLS = res_ols,
Forward = eval_model(step_forward, test),
Backward = eval_model(step_backward, test),
Stepwise = eval_model(step_both, test)
)
round(res_models, 3)## RMSE MAE R2
## OLS 371.685 260.438 0.303
## Forward 379.426 268.963 0.271
## Backward 367.065 259.246 0.319
## Stepwise 379.426 268.963 0.271
Regularisasi
Untuk
glmnet, kita perlu memisahkan X (matriks prediktor) dan y (vektor target), serta membuat dummy untuk variabel kategorikal menggunakanmodel.matrix().
# Matriks model tanpa intercept (kolom pertama adalah intercept, jadi kita drop)
x_train <- model.matrix(Salary ~ ., data = train)[, -1]
y_train <- train$Salary
x_test <- model.matrix(Salary ~ ., data = test)[, -1]
y_test <- test$Salary
dim(x_train); dim(x_test)## [1] 210 19
## [1] 53 19
Ridge Regression (alpha = 0)
set.seed(1234)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
lambda_ridge <- cv_ridge$lambda.min
ridge_fit <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_ridge)
pred_ridge <- as.numeric(predict(ridge_fit, s = lambda_ridge, newx = x_test))
res_ridge <- c(RMSE = rmse(y_test, pred_ridge),
MAE = mae(y_test, pred_ridge),
R2 = r2(y_test, pred_ridge))
res_ridge## RMSE MAE R2
## 363.9871284 262.9056758 0.2580805
LASSO (alpha = 1)
set.seed(1234)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)
lambda_lasso <- cv_lasso$lambda.min
lasso_fit <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_lasso)
pred_lasso <- as.numeric(predict(lasso_fit, s = lambda_lasso, newx = x_test))
res_lasso <- c(RMSE = rmse(y_test, pred_lasso),
MAE = mae(y_test, pred_lasso),
R2 = r2(y_test, pred_lasso))
res_lasso## RMSE MAE R2
## 362.1504895 258.3503619 0.2906963
Variabel Terpilih LASSO (koefisien ≠ 0)
coef_lasso <- coef(lasso_fit, s = lambda_lasso)
selected <- rownames(coef_lasso)[as.vector(coef_lasso != 0)]
selected <- setdiff(selected, "(Intercept)")
selected## [1] "AtBat" "Hits" "HmRun" "RBI" "Walks"
## [6] "Years" "CHmRun" "CRuns" "CRBI" "CWalks"
## [11] "DivisionW" "PutOuts" "Assists" "Errors" "NewLeagueN"
Elastic Net (0 < alpha < 1)
set.seed(1234)
alphas <- seq(0.1, 0.9, by = 0.2)
cv_list <- vector("list", length(alphas))
names(cv_list) <- paste0("alpha_", alphas)
best <- list(alpha = NA, lambda = NA, mse = Inf)
for (i in seq_along(alphas)) {
a <- alphas[i]
cv_fit <- cv.glmnet(x_train, y_train, alpha = a, nfolds = 10)
cv_list[[i]] <- cv_fit
if (cv_fit$cvm[cv_fit$lambda == cv_fit$lambda.min] < best$mse) {
best$alpha = a
best$lambda = cv_fit$lambda.min
best$mse = cv_fit$cvm[cv_fit$lambda == cv_fit$lambda.min]
}
}
best## $alpha
## [1] 0.3
##
## $lambda
## [1] 1.212266
##
## $mse
## [1] 107546
enet_fit <- glmnet(x_train, y_train, alpha = best$alpha, lambda = best$lambda)
pred_enet <- as.numeric(predict(enet_fit, s = best$lambda, newx = x_test))
res_enet <- c(RMSE = rmse(y_test, pred_enet),
MAE = mae(y_test, pred_enet),
R2 = r2(y_test, pred_enet))
res_enet## RMSE MAE R2
## 368.1461131 259.9775333 0.2967051
Perbandingan Akhir (Test Set)
final_res <- rbind(
OLS = res_ols,
Forward = eval_model(step_forward, test),
Backward= eval_model(step_backward, test),
Stepwise= eval_model(step_both, test),
Ridge = res_ridge,
LASSO = res_lasso,
ElasticNet = res_enet
)
round(final_res[order(final_res[, "RMSE"]), ], 3)## RMSE MAE R2
## LASSO 362.150 258.350 0.291
## Ridge 363.987 262.906 0.258
## Backward 367.065 259.246 0.319
## ElasticNet 368.146 259.978 0.297
## OLS 371.685 260.438 0.303
## Forward 379.426 268.963 0.271
## Stepwise 379.426 268.963 0.271
Catatan Penutup
- OLS sebagai baseline interpretatif
- Seleksi variabel dengan
step()(AIC) membantu model parsimonious, namun sensitif terhadap sampel
- Regularisasi (glmnet) menstabilkan model dan dapat
melakukan seleksi otomatis (LASSO) atau
kombinasi shrinkage+seleksi (Elastic Net)
- Evaluasi akhir menggunakan test
set, bukan training
- Untuk data deret waktu, gunakan pembagian berdasarkan waktu, bukan acak