Supervised Learning (Regresi): OLS, Seleksi Variabel, dan Regularisasi

Pendahuluan

Praktikum ini menunjukkan alur end-to-end supervised learning (regresi) menggunakan data Hitters dari buku An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani — JWHT/ISLR).

Cakupan:

  1. Train–Test Split (base R)
  2. Model Linear OLS (baseline)
  3. Seleksi Variabel: Forward, Backward, Stepwise (step())
  4. Regularisasi: Ridge, LASSO, Elastic Net (paket glmnet)
  5. Evaluasi di Test Set: RMSE, MAE, R²

Paket & Data

# Paket minimal
# ISLR: menyediakan dataset Hitters
# glmnet: untuk ridge, lasso, elastic net
if (!requireNamespace("ISLR", quietly = TRUE)) install.packages("ISLR", repos = "https://cloud.r-project.org")
if (!requireNamespace("glmnet", quietly = TRUE)) install.packages("glmnet", repos = "https://cloud.r-project.org")

library(ISLR)
library(glmnet)

Dataset Hitters berisi data statistik pemain bisbol Major League Baseball (MLB) musim 1986–1987
Berikut penjelasan setiap variabelnya:

  • AtBat: jumlah pukulan (at-bats) yang dilakukan pemain selama musim tersebut

  • Hits: jumlah pukulan berhasil (hits) yang dilakukan pemain selama musim tersebut

  • HmRun: jumlah home runs yang dicetak pemain pada musim tersebut

  • Runs: jumlah total runs yang dicetak oleh pemain pada musim tersebut

  • RBI: Runs Batted In — jumlah runs yang dihasilkan oleh pukulan pemain

  • Walks: jumlah kali pemain berhasil mencapai base melalui walks (tanpa memukul bola)

  • Years: jumlah tahun pemain telah bermain di Major League Baseball

  • CAtBat: jumlah kumulatif at-bats (pukulan) selama karier pemain

  • CHits: jumlah kumulatif hits selama karier pemain

  • CHmRun: jumlah kumulatif home runs selama karier pemain

  • CRuns: jumlah kumulatif runs selama karier pemain

  • CRBI: jumlah kumulatif runs batted in selama karier pemain

  • CWalks: jumlah kumulatif walks selama karier pemain

  • League: liga tempat pemain bermain pada musim sebelumnya (A = American League, N = National League).

  • Division: divisi tempat pemain bermain pada musim tersebut (E = East, W = West)

  • PutOuts: jumlah put outs — aksi pertahanan di mana pemain mengeluarkan lawan dari permainan

  • Assists: jumlah assists — kontribusi pemain dalam membantu terjadinya put out

  • Errors: jumlah kesalahan (errors) yang dilakukan pemain dalam bertahan

  • Salary: gaji pemain pada tahun 1987 (dalam ribuan dolar AS)

  • NewLeague: liga tempat pemain bermain pada musim berikutnya (A atau N)

# Muat data & bersihkan NA pada target
data("Hitters", package = "ISLR")
hitters <- na.omit(Hitters)

# Tampilkan ringkas data
str(hitters)
## 'data.frame':    263 obs. of  20 variables:
##  $ AtBat    : int  315 479 496 321 594 185 298 323 401 574 ...
##  $ Hits     : int  81 130 141 87 169 37 73 81 92 159 ...
##  $ HmRun    : int  7 18 20 10 4 1 0 6 17 21 ...
##  $ Runs     : int  24 66 65 39 74 23 24 26 49 107 ...
##  $ RBI      : int  38 72 78 42 51 8 24 32 66 75 ...
##  $ Walks    : int  39 76 37 30 35 21 7 8 65 59 ...
##  $ Years    : int  14 3 11 2 11 2 3 2 13 10 ...
##  $ CAtBat   : int  3449 1624 5628 396 4408 214 509 341 5206 4631 ...
##  $ CHits    : int  835 457 1575 101 1133 42 108 86 1332 1300 ...
##  $ CHmRun   : int  69 63 225 12 19 1 0 6 253 90 ...
##  $ CRuns    : int  321 224 828 48 501 30 41 32 784 702 ...
##  $ CRBI     : int  414 266 838 46 336 9 37 34 890 504 ...
##  $ CWalks   : int  375 263 354 33 194 24 12 8 866 488 ...
##  $ League   : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
##  $ PutOuts  : int  632 880 200 805 282 76 121 143 0 238 ...
##  $ Assists  : int  43 82 11 40 421 127 283 290 0 445 ...
##  $ Errors   : int  10 14 3 4 25 7 9 19 0 22 ...
##  $ Salary   : num  475 480 500 91.5 750 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
##   ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...
summary(hitters$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    67.5   190.0   425.0   535.9   750.0  2460.0

Train–Test Split (Base R)

set.seed(123)
n <- nrow(hitters)
idx <- sample.int(n, size = floor(0.8 * n))  # 80% training
train <- hitters[idx, ]
test  <- hitters[-idx, ]

dim(train); dim(test)
## [1] 210  20
## [1] 53 20

Baseline: Model Linear OLS

ols <- lm(Salary ~ ., data = train)
summary(ols)
## 
## Call:
## lm(formula = Salary ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -753.89 -173.98  -11.87  159.32 1815.62 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  207.47345  101.18477   2.050  0.04169 * 
## AtBat         -1.68499    0.75317  -2.237  0.02644 * 
## Hits           5.06495    2.89376   1.750  0.08168 . 
## HmRun         -2.68840    6.86520  -0.392  0.69579   
## Runs          -0.95353    3.41442  -0.279  0.78034   
## RBI            1.41046    3.00406   0.470  0.63924   
## Walks          6.35620    2.11308   3.008  0.00299 **
## Years        -13.14742   14.12616  -0.931  0.35318   
## CAtBat        -0.25953    0.15700  -1.653  0.09996 . 
## CHits          0.65158    0.80755   0.807  0.42076   
## CHmRun         1.29483    1.86077   0.696  0.48737   
## CRuns          1.25425    0.89170   1.407  0.16119   
## CRBI           0.46318    0.84092   0.551  0.58242   
## CWalks        -0.68669    0.39171  -1.753  0.08121 . 
## LeagueN        6.62438   91.91248   0.072  0.94262   
## DivisionW   -120.30537   45.33765  -2.654  0.00864 **
## PutOuts        0.19444    0.08678   2.240  0.02622 * 
## Assists        0.51924    0.25194   2.061  0.04067 * 
## Errors        -7.33195    4.88072  -1.502  0.13470   
## NewLeagueN    61.30716   91.82558   0.668  0.50517   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 311.7 on 190 degrees of freedom
## Multiple R-squared:  0.6014, Adjusted R-squared:  0.5616 
## F-statistic: 15.09 on 19 and 190 DF,  p-value: < 2.2e-16
# Fungsi metrik evaluasi berbasis base R
rmse <- function(y, yhat) mean((y - yhat)^2)^0.5
mae <- function(y, yhat) mean(abs(y - yhat))
r2  <- function(y, yhat) cor(y, yhat)^2
pred_ols <- predict(ols, newdata = test)
res_ols <- c(
  RMSE = rmse(test$Salary, pred_ols),
  MAE = mae(test$Salary, pred_ols),
  R2  = r2(test$Salary, pred_ols)
)
res_ols
##        RMSE         MAE          R2 
## 371.6850949 260.4381260   0.3028332

Seleksi Variabel

# Model kosong & penuh
m_null <- lm(Salary ~ 1, data = train)
m_full <- lm(Salary ~ ., data = train)

# Forward
step_forward  <- step(m_null, scope = list(lower = m_null, upper = m_full),
                      direction = "forward", trace = 0)

# Backward
step_backward <- step(m_full, direction = "backward", trace = 0)

# Stepwise (dua arah)
step_both     <- step(m_null, scope = list(lower = m_null, upper = m_full),
                      direction = "both", trace = 0)
summary(step_forward)
## 
## Call:
## lm(formula = Salary ~ CRuns + Walks + CWalks + Division + CRBI + 
##     CAtBat + CHits + PutOuts + NewLeague, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -649.69 -182.02   -5.03  148.21 1959.12 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   82.08546   65.80910   1.247 0.213736    
## CRuns          1.33824    0.54416   2.459 0.014770 *  
## Walks          5.66119    1.32439   4.275 2.96e-05 ***
## CWalks        -0.56861    0.29433  -1.932 0.054787 .  
## DivisionW   -124.70888   44.58535  -2.797 0.005660 ** 
## CRBI           0.79409    0.26432   3.004 0.003003 ** 
## CAtBat        -0.41104    0.11971  -3.434 0.000723 ***
## CHits          0.99804    0.47977   2.080 0.038777 *  
## PutOuts        0.14026    0.08387   1.672 0.096020 .  
## NewLeagueN    71.73976   44.70648   1.605 0.110141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 312.3 on 200 degrees of freedom
## Multiple R-squared:  0.5789, Adjusted R-squared:  0.5599 
## F-statistic: 30.55 on 9 and 200 DF,  p-value: < 2.2e-16
summary(step_backward)
## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CHits + 
##     CHmRun + CRuns + CWalks + Division + PutOuts + Assists + 
##     Errors + NewLeague, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -770.98 -178.95   -3.66  138.62 1830.76 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  151.37091   79.91465   1.894  0.05968 .  
## AtBat         -1.53935    0.71874  -2.142  0.03345 *  
## Hits           4.69395    2.26305   2.074  0.03937 *  
## Walks          6.13058    1.84524   3.322  0.00106 ** 
## CAtBat        -0.35200    0.13646  -2.580  0.01062 *  
## CHits          1.14248    0.52879   2.161  0.03194 *  
## CHmRun         2.30822    0.61691   3.742  0.00024 ***
## CRuns          0.92695    0.57732   1.606  0.10997    
## CWalks        -0.59690    0.33913  -1.760  0.07996 .  
## DivisionW   -120.56432   44.13464  -2.732  0.00688 ** 
## PutOuts        0.19750    0.08569   2.305  0.02222 *  
## Assists        0.56187    0.23486   2.392  0.01768 *  
## Errors        -7.19408    4.73640  -1.519  0.13040    
## NewLeagueN    71.80602   44.61091   1.610  0.10909    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 308.6 on 196 degrees of freedom
## Multiple R-squared:  0.5971, Adjusted R-squared:  0.5704 
## F-statistic: 22.34 on 13 and 196 DF,  p-value: < 2.2e-16
summary(step_both)
## 
## Call:
## lm(formula = Salary ~ CRuns + Walks + CWalks + Division + CRBI + 
##     CAtBat + CHits + PutOuts + NewLeague, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -649.69 -182.02   -5.03  148.21 1959.12 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   82.08546   65.80910   1.247 0.213736    
## CRuns          1.33824    0.54416   2.459 0.014770 *  
## Walks          5.66119    1.32439   4.275 2.96e-05 ***
## CWalks        -0.56861    0.29433  -1.932 0.054787 .  
## DivisionW   -124.70888   44.58535  -2.797 0.005660 ** 
## CRBI           0.79409    0.26432   3.004 0.003003 ** 
## CAtBat        -0.41104    0.11971  -3.434 0.000723 ***
## CHits          0.99804    0.47977   2.080 0.038777 *  
## PutOuts        0.14026    0.08387   1.672 0.096020 .  
## NewLeagueN    71.73976   44.70648   1.605 0.110141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 312.3 on 200 degrees of freedom
## Multiple R-squared:  0.5789, Adjusted R-squared:  0.5599 
## F-statistic: 30.55 on 9 and 200 DF,  p-value: < 2.2e-16
# Evaluasi masing-masing model di test set
eval_model <- function(model, newdata) {
  p <- predict(model, newdata = newdata)
  c(RMSE = rmse(newdata$Salary, p),
    MAE = mae(newdata$Salary, p),
    R2  = r2(newdata$Salary, p))
}

res_models <- rbind(
  OLS       = res_ols,
  Forward   = eval_model(step_forward,  test),
  Backward  = eval_model(step_backward, test),
  Stepwise  = eval_model(step_both,     test)
)
round(res_models, 3)
##             RMSE     MAE    R2
## OLS      371.685 260.438 0.303
## Forward  379.426 268.963 0.271
## Backward 367.065 259.246 0.319
## Stepwise 379.426 268.963 0.271

Regularisasi

Untuk glmnet, kita perlu memisahkan X (matriks prediktor) dan y (vektor target), serta membuat dummy untuk variabel kategorikal menggunakan model.matrix().

# Matriks model tanpa intercept (kolom pertama adalah intercept, jadi kita drop)
x_train <- model.matrix(Salary ~ ., data = train)[, -1]
y_train <- train$Salary

x_test  <- model.matrix(Salary ~ ., data = test)[, -1]
y_test  <- test$Salary

dim(x_train); dim(x_test)
## [1] 210  19
## [1] 53 19

Ridge Regression (alpha = 0)

set.seed(1234)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
lambda_ridge <- cv_ridge$lambda.min

ridge_fit <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_ridge)
pred_ridge <- as.numeric(predict(ridge_fit, s = lambda_ridge, newx = x_test))

res_ridge <- c(RMSE = rmse(y_test, pred_ridge),
               MAE = mae(y_test, pred_ridge),
               R2  = r2(y_test, pred_ridge))
res_ridge
##        RMSE         MAE          R2 
## 363.9871284 262.9056758   0.2580805

LASSO (alpha = 1)

set.seed(1234)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)
lambda_lasso <- cv_lasso$lambda.min

lasso_fit <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_lasso)
pred_lasso <- as.numeric(predict(lasso_fit, s = lambda_lasso, newx = x_test))

res_lasso <- c(RMSE = rmse(y_test, pred_lasso),
               MAE = mae(y_test, pred_lasso),
               R2  = r2(y_test, pred_lasso))
res_lasso
##        RMSE         MAE          R2 
## 362.1504895 258.3503619   0.2906963

Variabel Terpilih LASSO (koefisien ≠ 0)

coef_lasso <- coef(lasso_fit, s = lambda_lasso)
selected <- rownames(coef_lasso)[as.vector(coef_lasso != 0)]
selected <- setdiff(selected, "(Intercept)")
selected
##  [1] "AtBat"      "Hits"       "HmRun"      "RBI"        "Walks"     
##  [6] "Years"      "CHmRun"     "CRuns"      "CRBI"       "CWalks"    
## [11] "DivisionW"  "PutOuts"    "Assists"    "Errors"     "NewLeagueN"

Elastic Net (0 < alpha < 1)

set.seed(1234)
alphas <- seq(0.1, 0.9, by = 0.2)
cv_list <- vector("list", length(alphas))
names(cv_list) <- paste0("alpha_", alphas)

best <- list(alpha = NA, lambda = NA, mse = Inf)

for (i in seq_along(alphas)) {
  a <- alphas[i]
  cv_fit <- cv.glmnet(x_train, y_train, alpha = a, nfolds = 10)
  cv_list[[i]] <- cv_fit
  if (cv_fit$cvm[cv_fit$lambda == cv_fit$lambda.min] < best$mse) {
    best$alpha  = a
    best$lambda = cv_fit$lambda.min
    best$mse    = cv_fit$cvm[cv_fit$lambda == cv_fit$lambda.min]
  }
}
best
## $alpha
## [1] 0.3
## 
## $lambda
## [1] 1.212266
## 
## $mse
## [1] 107546
enet_fit <- glmnet(x_train, y_train, alpha = best$alpha, lambda = best$lambda)
pred_enet <- as.numeric(predict(enet_fit, s = best$lambda, newx = x_test))

res_enet <- c(RMSE = rmse(y_test, pred_enet),
              MAE = mae(y_test, pred_enet),
              R2  = r2(y_test, pred_enet))
res_enet
##        RMSE         MAE          R2 
## 368.1461131 259.9775333   0.2967051

Perbandingan Akhir (Test Set)

final_res <- rbind(
  OLS     = res_ols,
  Forward = eval_model(step_forward,  test),
  Backward= eval_model(step_backward, test),
  Stepwise= eval_model(step_both,     test),
  Ridge   = res_ridge,
  LASSO   = res_lasso,
  ElasticNet = res_enet
)

round(final_res[order(final_res[, "RMSE"]), ], 3)
##               RMSE     MAE    R2
## LASSO      362.150 258.350 0.291
## Ridge      363.987 262.906 0.258
## Backward   367.065 259.246 0.319
## ElasticNet 368.146 259.978 0.297
## OLS        371.685 260.438 0.303
## Forward    379.426 268.963 0.271
## Stepwise   379.426 268.963 0.271

Catatan Penutup

  • OLS sebagai baseline interpretatif
  • Seleksi variabel dengan step() (AIC) membantu model parsimonious, namun sensitif terhadap sampel
  • Regularisasi (glmnet) menstabilkan model dan dapat melakukan seleksi otomatis (LASSO) atau kombinasi shrinkage+seleksi (Elastic Net)
  • Evaluasi akhir menggunakan test set, bukan training
  • Untuk data deret waktu, gunakan pembagian berdasarkan waktu, bukan acak