Latihan dan Tugas
2025-09-04
Chapter 1 Konsep dan Teori Analisis Regresi Tingkat Lanjut
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, dpi = 150,
fig.align = "center", fig.width = 7, fig.height = 4.5)
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
##
## Attaching package: 'modelr'
##
## The following object is masked from 'package:broom':
##
## bootstrap
1.1 Konsep Analisis Regresi dan Korelasi
Analisis regresi adalah metode statistik yang digunakan untuk mempelajari hubungan satu variabel dependen (respon) dengan satu atau lebih variabel independen (prediktor). Tujuan utamanya adalah membangun model matematis yang bisa digunakan untuk:
- Menjelaskan hubungan antara variabel.
- Memprediksi nilai variabel dependen berdasarkan variabel independen.
- Mengukur pengaruh masing-masing prediktor terhadap respon.
Model regresi linear sederhana dituliskan sebagai: \[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, \ldots, n \] dengan:
\(Y_i\) : variabel dependen (respon)
\(X_i\) : variabel independen (prediktor)
\(\beta_0, \beta_1\) : parameter regresi
\(\varepsilon_i\) : error (residu)
1.1.0.1 Konsep Korelasi
Korelasi mengukur derajat keeratan hubungan linier antara dua variabel, tanpa membedakan mana yang dependen dan independen.
Koefisien korelasi Pearson:
Koefisien korelasi Pearson dirumuskan sebagai:
\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} \]
- Nilai \(r \in [-1, 1]\).
- \(r > 0\): hubungan positif, \(r < 0\): hubungan negatif.
- \(|r|\) makin mendekati 1 artinya hubungan makin kuat.
Perbedaan Utama
\[ \begin{array}{|l|l|l|} \hline \textbf{Aspek} & \textbf{Korelasi} & \textbf{Regresi} \\ \hline \text{Tujuan} & \text{Mengukur keeratan hubungan} & \text{Membangun model prediksi/penjelasan} \\ \hline \text{Arah hubungan} & \text{Simetris (X dan Y setara)} & \text{Asimetris (Y sebagai respon, X sebagai prediktor)} \\ \hline \text{Output utama} & \text{Koefisien korelasi } r & \text{Persamaan regresi } \hat{Y} = \beta_0 + \beta_1 X \\ \hline \end{array} \]
1.1.0.2 Contoh Aplikasi di R
Data
Gunakan dataset mtcars untuk melihat hubungan mpg (miles per gallon) dengan wt (berat mobil).
## mpg wt
## Mazda RX4 21.0 2.620
## Mazda RX4 Wag 21.0 2.875
## Datsun 710 22.8 2.320
## Hornet 4 Drive 21.4 3.215
## Hornet Sportabout 18.7 3.440
## Valiant 18.1 3.460
1.1.0.2.1 Korelasi
##
## Pearson's product-moment correlation
##
## data: mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
Regresi
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Visualisasi
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Hubungan antara Berat Mobil (wt) dan Efisiensi Bahan Bakar (mpg)",
x = "Berat Mobil (1000 lbs)",
y = "Miles per Gallon (mpg)")1.2 Model Regresi Linear
Kita memodelkan hubungan antara respons \(y \in \mathbb{R}^n\) dan kovariat \(X \in \mathbb{R}^{n \times p}\): \[ y = X\beta + \varepsilon, \qquad \mathbb{E}(\varepsilon) = 0, \qquad \text{Var}(\varepsilon) = \sigma^2 I_n. \]
Dengan \(X = [1, x_1, \dots, x_{p-1}]\) (termasuk intersep), \(\beta\) adalah vektor koefisien, dan \(\varepsilon\) adalah galat.
1.2.1 Estimator OLS & Geometri Proyeksi
Estimator meminimalkan jumlah kuadrat residu: \[ \hat{\beta}_{\text{OLS}} = \arg \min_{\beta} \| y - X\beta \|^2 = (X^\top X)^{-1} X^\top y. \]
Prediksi \(\hat{y} = X\hat{\beta}\) adalah proyeksi ortogonal dari \(y\) ke ruang kolom \(X\). Matriks dan residu: \[ H = X(X^\top X)^{-1} X^\top, \qquad \hat{y} = Hy, \qquad e = (I - H)y. \]
Leverage pengamatan ke-\(i\) adalah \(h_{ii}\) (elemen diagonal \(H\)).
1.2.2 Sifat Asimtotik & Varian Koefisien
Di bawah asumsi Gauss–Markov (linearitas, eksogenitas, homoskedastisitas, dan tidak ada multikolinearitas sempurna), \(\hat{\beta}_{\text{OLS}}\) adalah . Varians koefisien dan estimator ragam:
\[ \text{Var}(\hat{\beta}_{\text{OLS}}) = \sigma^2 (X^\top X)^{-1}, \qquad \hat{\sigma}^2 = \frac{\|e\|^2}{n - p}. \]
Asimtotik: \(\sqrt{n} (\hat{\beta} - \beta) \overset{d}{\longrightarrow} \mathcal{N}(0, V)\), sehingga inferensi berbasis normal/\(t\) berlaku untuk \(n\) besar.
1.3 Generalisasi: WLS & GLS
Jika galat tidak homoskedastik/berkorelasi, maka \(\text{Var}(\varepsilon) = \sigma^2 \Omega \neq \sigma^2 I\).
untuk varians berbeda-beda (\(\Omega\) diagonal): \[ \hat{\beta}_{\text{WLS}} = (X^\top W X)^{-1} X^\top W y, \qquad W = \Omega^{-1}. \]
(\(\Omega\) umum): \[ \hat{\beta}_{\text{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y. \]
1.4 Robust Standard Errors (White Sandwich)
Tanpa menspesifikasi bentuk \(\Omega\), kovarians (HC) ala White adalah: \[ \widehat{\text{Var}}(\hat{\beta}) = (X^\top X)^{-1} \left( \sum_{i=1}^n x_i x_i^\top e_i^2 \right) (X^\top X)^{-1}. \]
Ini memberi yang valid terhadap heteroskedastisitas (HC0–HC3).
1.5 Multikolinearitas & Diagnostik
untuk koefisien ke-\(j\): \[ \text{VIF}_j = \frac{1}{1 - R_j^2}. \]
dengan \(R_j^2\) adalah koefisien determinasi dari regresi \(x_j\) pada kovariat lain. Nilai besar (\(\gtrsim 10\)) mengindikasikan multikolinearitas kuat.
1.6 Diagnostik Residu, Leverage, dan Pengaruh
Ukuran pengaruh : \[ D_i = \frac{e_i^2}{p \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}. \]
Gunakan plot residu, plot leverage vs residu, QQ-plot, serta inspeksi \(D_i\) untuk memeriksa asumsi dan outlier.
1.7 Bias–Variance Trade-off & Regularisasi (Ikhtisar)
Model makin kompleks (banyak parameter/basis) cenderung namun . dan menambahkan penalti:
\[ \text{Ridge: } \hat{\beta} = \arg \min_{\beta} \| y - X\beta \|^2 + \lambda \|\beta\|_2^2, \qquad \text{Lasso: } \hat{\beta} = \arg \min_{\beta} \| y - X\beta \|^2 + \lambda \|\beta\|_1. \]
1.8 Nonlinearitas: Ekspansi Basis & Spline
Kita dapat menambah term polinomial atau sehingga tetap linear dalam parameter: \[ y = \beta_0 + \sum_k \beta_k b_k(x) + \varepsilon. \]
1.9 Studi Kasus 1: Regresi pada Data Nyata (mtcars)
1.9.1 Persiapan Data dan Eksplorasi
dat1 <- mtcars %>%
as_tibble(rownames = "car") %>%
mutate(across(c(mpg, disp, hp, wt, qsec), as.double))
# Plot hubungan awal
GG1 <- ggplot(dat1, aes(wt, mpg)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon (mpg)",
title = "Hubungan mpg vs wt pada mtcars")
GG11.9.2 Estimasi OLS & Ringkasan
##
## Call:
## lm(formula = mpg ~ wt + hp + disp, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## wt -3.800891 1.066191 -3.565 0.00133 **
## hp -0.031157 0.011436 -2.724 0.01097 *
## disp -0.000937 0.010350 -0.091 0.92851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.1 2.11 17.6 1.16e-16 32.8 41.4
## 2 wt -3.80 1.07 -3.56 1.33e- 3 -5.98 -1.62
## 3 hp -0.0312 0.0114 -2.72 1.10e- 2 -0.0546 -0.00773
## 4 disp -0.000937 0.0103 -0.0905 9.29e- 1 -0.0221 0.0203
1.9.3 Diagnostik Dasar (Residu, QQ-plot, Cook)
diag_df <- tibble(
fitted = fitted(fit1), resid = resid(fit1),
stdres = rstandard(fit1), hat = hatvalues(fit1), cook = cooks.distance(fit1)
)
p1 <- ggplot(diag_df, aes(fitted, resid)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = 0, linetype = 2) +
labs(x = "Fitted", y = "Residuals")
p2 <- ggplot(diag_df, aes(sample = stdres)) +
stat_qq() + stat_qq_line() + labs(x = "Theoretical", y = "Standardized Residuals")
p1; p2
Figure 1.2: Diagnostik OLS untuk mtcars: residu vs fitted (kiri) dan QQ-plot (kanan).
qplot(seq_along(diag_df$cook), diag_df$cook, geom = c("line","point"),
xlab = "Index", ylab = "Cook's D")1.9.5 Robust SE (White HC3) & Interpretasi
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.10550527 2.11081525 17.5788 < 2.2e-16 ***
## wt -3.80089058 1.06619064 -3.5649 0.001331 **
## hp -0.03115655 0.01143579 -2.7245 0.010971 *
## disp -0.00093701 0.01034974 -0.0905 0.928507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.10550527 2.71026981 13.6907 6.257e-14 ***
## wt -3.80089058 1.08086207 -3.5165 0.00151 **
## hp -0.03115655 0.01437457 -2.1675 0.03886 *
## disp -0.00093701 0.01012138 -0.0926 0.92690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1.9.6 Nonlinearitas: Polinomial vs Spline pada wt
fit_poly <- lm(mpg~ poly(wt, 2, raw = TRUE) + hp + disp, data = dat1)
fit_spl <- lm(mpg~ bs(wt, df = 5) + hp + disp, data = dat1)
AIC(fit1, fit_poly, fit_spl) %>% arrange(AIC)## df AIC
## fit_poly 6 150.7305
## fit_spl 9 154.3323
## fit1 5 158.6430
## df BIC
## fit_poly 6 159.5250
## fit1 5 165.9717
## fit_spl 9 167.5239
# Kurva prediksi
grid_wt <- tibble(wt = seq(min(dat1$wt), max(dat1$wt), length.out = 200),
hp = mean(dat1$hp), disp = mean(dat1$disp))
pred <- grid_wt %>%
mutate(OLS = predict(fit1, newdata = grid_wt),
Poly = predict(fit_poly, newdata = grid_wt),
Spline = predict(fit_spl, newdata = grid_wt)) %>%
pivot_longer(OLS:Spline, names_to = "model", values_to = "yhat")
GG2 <- ggplot() +
geom_point(data = dat1, aes(wt, mpg), alpha = 0.3) +
geom_line(data = pred, aes(wt, yhat, linetype = model), linewidth = 0.9) +
labs(x = "wt", y = "mpg", title = "Perbandingan OLS vs Polinomial vs Spline")
GG21.9.7 Validasi Silang (K-fold) untuk Perbandingan Model
k <- 5
cv_res <- crossv_kfold(dat1, k = k)
rmse_fun <- function(train, test, formula) {
m <- lm(formula, data = as.data.frame(train))
sqrt(mean((as.data.frame(test)$mpg- predict(m, newdata = as.data.frame(test)))^2))
}
forms <- list(
OLS = mpg~ wt + hp + disp,
Poly = mpg~ poly(wt, 2, raw = TRUE) + hp + disp,
Spline = mpg~ bs(wt, df = 5) + hp + disp
)
cv_tbl <- map_dfr(names(forms), function(nm){
fml <- forms[[nm]]
tibble(model = nm, fold = seq_len(k),
rmse = map2_dbl(cv_res$train, cv_res$test,~ rmse_fun(.x, .y, fml)))
})
cv_tbl %>% group_by(model) %>% summarise(mean_rmse = mean(rmse), sd_rmse = sd(rmse)) %>% arrange(mean_rmse)## # A tibble: 3 × 3
## model mean_rmse sd_rmse
## <chr> <dbl> <dbl>
## 1 Poly 2.35 0.980
## 2 Spline 2.74 0.807
## 3 OLS 2.92 1.01
1.10 Studi Kasus 2: Simulasi Heteroskedastisitas (WLS vs OLS
vs Robust)
n <- 400
x1 <- runif(n, 0, 10)
x2 <- rnorm(n, 5, 2)
sigma_i <- 0.6 + 0.2 * x1 # var naik dengan x1
err <- rnorm(n, 0, sigma_i)
y <- 2 + 1.5*x1- 1.2*x2 + err
sim <- tibble(y, x1, x2, w = 1/sigma_i^2)
fit_ols <- lm(y~ x1 + x2, data = sim)
fit_wls <- lm(y~ x1 + x2, data = sim, weights = w)
# Tabel perbandingan SE
comp_se <- tibble(
term = names(coef(fit_ols)),
OLS_SE = coef(summary(fit_ols))[, "Std. Error"],
Robust_SE = sqrt(diag(vcovHC(fit_ols, type = "HC3"))),
WLS_SE = coef(summary(fit_wls))[, "Std. Error"]
)
comp_se## # A tibble: 3 × 4
## term OLS_SE Robust_SE WLS_SE
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.250 0.212 0.175
## 2 x1 0.0293 0.0282 0.0261
## 3 x2 0.0415 0.0399 0.0313
##
## studentized Breusch-Pagan test
##
## data: fit_ols
## BP = 29.415, df = 2, p-value = 4.098e-07
ggplot(data.frame(fitted=fitted(fit_ols), resid=resid(fit_ols)), aes(fitted, resid)) +
geom_point(alpha=.6) + geom_smooth(se=FALSE, method="loess") +
geom_hline(yintercept = 0, linetype=2) + labs(x="Fitted", y="Residuals")
Figure 1.4: Heteroskedastisitas: Residual vs Fitted (OLS).
1.11 Catatan
- Mulai dari OLS + diagnostik; bila heteroskedastik, pilih robust SE atau WLS/GLS sesuai struktur ragam.
- Periksa multikolinearitas (VIF), leverage, Cook’s D, dan normalitas residu.
- Gunakan ekspansi basis (polinomial/spline) saat ada indikasi nonlinearitas.
- Validasi model (CV) dan bandingkan AIC/BIC untuk kompleksitas vs akurasi.
- Untuk dimensi tinggi, gunakan regularisasi (Ridge/Lasso) dan pipeline (standarisasi, CV).
1.12 GLS dengan Korelasi AR(1): Contoh dan Simulasi
Model galat mengikuti proses AR(1):
\[
\varepsilon_t = \rho \varepsilon_{t-1} + u_t, \quad \text{dengan } u_t \sim \mathcal{N}(0, \sigma_u^2).
\]
Kovarians residual memiliki bentuk Toeplitz:
\[
\Omega_{ts} = \sigma_u^2 \rho^{|t-s|}/(1-\rho^2).
\]
GLS menimbang data dengan \(\Omega^{-1}\) (atau transformasi ekuivalen) agar estimasi efisien.
1.12.1 Simulasi data dengan galat AR(1)
set.seed(42)
T <- 400
x <- runif(T,-2, 2)
rho <- 0.6
u <- rnorm(T, 0, 1)
err <- numeric(T)
for (t in 2:T) err[t] <- rho*err[t-1] + u[t]
y <- 1 + 2*x + err
dar1 <- tibble::tibble(t = 1:T, x, y)
head(dar1)## # A tibble: 6 × 3
## t x y
## <int> <dbl> <dbl>
## 1 1 1.66 4.32
## 2 2 1.75 4.83
## 3 3 -0.855 0.661
## 4 4 1.32 6.53
## 5 5 0.567 2.49
## 6 6 0.0764 0.213
1.12.2 1.12.2 OLS vs GLS (nlme::gls dengan corAR1)
library(nlme)
# OLS biasa
fit_ols_ar1 <- lm(y~ x, data = dar1)
# GLS dengan korelasi AR(1) di residual
fit_gls_ar1 <- gls(y~ x, data = dar1,
correlation = corAR1(form =~ t))
summary(fit_ols_ar1)##
## Call:
## lm(formula = y ~ x, data = dar1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3360 -0.8231 0.0026 0.8211 3.9013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95441 0.06043 15.79 <2e-16 ***
## x 2.06045 0.05162 39.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.209 on 398 degrees of freedom
## Multiple R-squared: 0.8002, Adjusted R-squared: 0.7997
## F-statistic: 1594 on 1 and 398 DF, p-value: < 2.2e-16
## Generalized least squares fit by REML
## Model: y ~ x
## Data: dar1
## AIC BIC logLik
## 1139.541 1155.487 -565.7706
##
## Correlation Structure: AR(1)
## Formula: ~t
## Parameter estimate(s):
## Phi
## 0.5772332
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.9552097 0.11651869 8.19791 0
## x 2.0449090 0.03743358 54.62766 0
##
## Correlation:
## (Intr)
## x 0.004
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.752477543 -0.674750151 -0.007333444 0.679546648 3.225627947
##
## Residual standard error: 1.210605
## Degrees of freedom: 400 total; 398 residual
1.12.3 Diagnostik serial correlation
par(mfrow = c(1,2))
acf(resid(fit_ols_ar1), main = "ACF Residuals (OLS)")
acf(resid(fit_gls_ar1, type = "normalized"), main = "ACF Residuals (GLS)")
Figure 1.5: ACF residu dari OLS vs GLS
##
## Durbin-Watson test
##
## data: fit_ols_ar1
## DW = 0.84993, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
1.12.4 Alternatif: Quasi-differencing (Cochrane–Orcutt) manual
# Estimasi rho dari regresi resid_t ~ resid_{t-1}
res <- resid(fit_ols_ar1)
rho_hat <- coef(lm(res[-1]~ 0 + res[-length(res)]))[1]
# Quasi-differencing (transformasi data)
y_star <- dar1$y[-1]- rho_hat*dar1$y[-T]
x_star <- dar1$x[-1]- rho_hat*dar1$x[-T]
fit_qd <- lm(y_star~ x_star)
summary(fit_qd)##
## Call:
## lm(formula = y_star ~ x_star)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6691 -0.6723 -0.0232 0.6970 3.2363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40603 0.04955 8.194 3.52e-15 ***
## x_star 2.04499 0.03757 54.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9898 on 397 degrees of freedom
## Multiple R-squared: 0.8819, Adjusted R-squared: 0.8816
## F-statistic: 2964 on 1 and 397 DF, p-value: < 2.2e-16
Catatan: gls() memperkirakan 𝜌 dan ragam secara simultan dalam kerangka likelihood sehingga sering lebih stabil dibanding pendekatan dua tahap.
1.13 Latihan
set.seed(123)
# Paket yang diperlukan
pkgs <- c("dplyr","ggplot2","car","lmtest","sandwich","MASS","boot","broom")
new <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(new)) install.packages(new, dependencies = TRUE)
lapply(pkgs, library, character.only = TRUE)## [[1]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[2]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[3]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[4]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[5]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[6]]
## [1] "MASS" "nlme" "splines" "modelr" "car" "carData"
## [7] "lmtest" "zoo" "sandwich" "broom" "lubridate" "forcats"
## [13] "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble"
## [19] "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils"
## [25] "datasets" "methods" "base"
##
## [[7]]
## [1] "boot" "MASS" "nlme" "splines" "modelr" "car"
## [7] "carData" "lmtest" "zoo" "sandwich" "broom" "lubridate"
## [13] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [19] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [25] "utils" "datasets" "methods" "base"
##
## [[8]]
## [1] "boot" "MASS" "nlme" "splines" "modelr" "car"
## [7] "carData" "lmtest" "zoo" "sandwich" "broom" "lubridate"
## [13] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [19] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [25] "utils" "datasets" "methods" "base"
- Simulasi Data Tujuan: membuat data dengan 3 prediktor (X1, X2, X3), disuntik multikolinearitas ringan (X1 & X2 berkorelasi) dan heteroskedastisitas (ragam error meningkat saat |X1| besar), serta sedikit outlier.
n <- 200
X1 <- rnorm(n, 0, 1)
# X2 berkorelasi ~0.7 dengan X1
rho <- 0.7
X2 <- rho*X1 + sqrt(1-rho^2)*rnorm(n)
X3 <- rnorm(n)
# Heteroskedastisitas: sd error tergantung |X1|
eps <- rnorm(n, sd = 0.5 + 0.5*abs(X1))
# Model benar (ground truth): y = 2 + 1.5*X1 - 0.8*X2 + 0.5*X3 + eps
y <- 2 + 1.5*X1- 0.8*X2 + 0.5*X3 + eps
# Tambah beberapa outlier pada respon
idx_out <- sample(1:n, 3)
y[idx_out] <- y[idx_out] + rnorm(3, mean = 6, sd = 1)
dat <- data.frame(y, X1, X2, X3)
summary(dat)## y X1 X2 X3
## Min. :-3.194 Min. :-2.30917 Min. :-2.36565 Min. :-2.80977
## 1st Qu.: 1.157 1st Qu.:-0.62576 1st Qu.:-0.68926 1st Qu.:-0.55753
## Median : 2.040 Median :-0.05874 Median : 0.03268 Median : 0.07583
## Mean : 2.066 Mean :-0.00857 Mean : 0.02408 Mean : 0.03178
## 3rd Qu.: 2.895 3rd Qu.: 0.56840 3rd Qu.: 0.69036 3rd Qu.: 0.68098
## Max. : 9.427 Max. : 3.24104 Max. : 3.00049 Max. : 2.43023
- Estimasi Parameter (OLS)
##
## Call:
## lm(formula = y ~ X1 + X2 + X3, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3070 -0.6446 -0.0414 0.5284 7.1821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.07742 0.08726 23.806 < 2e-16 ***
## X1 1.31536 0.12466 10.551 < 2e-16 ***
## X2 -0.75833 0.12303 -6.164 3.97e-09 ***
## X3 0.57737 0.09070 6.366 1.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.232 on 196 degrees of freedom
## Multiple R-squared: 0.4404, Adjusted R-squared: 0.4318
## F-statistic: 51.41 on 3 and 196 DF, p-value: < 2.2e-16
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.440 0.432 1.23 51.4 1.47e-24 3 -324. 657. 674.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Catatan interpretasi cepat: Koefisien menunjukkan perubahan rata-rata y untuk kenaikan 1 unit pada prediktor (dengan prediktor lain konstan). Lihat Pr(>|t|) untuk signifikansi. 3. Uji Hipotesis 3.1 Uji serentak (overall F-test) Sudah tersedia pada output summary(mod0) (Signifikansi model secara keseluruhan). 3.2 Uji parsial (t-test per koefisien) Juga ada di summary(mod0). 3.3 Uji hipotesis gabungan (contoh:
##
## Linear hypothesis test:
## X2 = 0
## X3 = 0
##
## Model 1: restricted model
## Model 2: y ~ X1 + X2 + X3
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 423.38
## 2 196 297.59 2 125.79 41.423 9.897e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.4 Uji dengan SE robust (menghadapi heteroskedastisitas)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.077419 0.087811 23.6578 < 2.2e-16 ***
## X1 1.315357 0.163042 8.0676 7.027e-14 ***
## X2 -0.758334 0.123626 -6.1341 4.644e-09 ***
## X3 0.577370 0.107728 5.3595 2.330e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Uji Asumsi 4.1 Multikolinearitas
## X1 X2 X3
## 1.811923 1.816606 1.003660
Rule of thumb: VIF > 10 (atau > 5) → indikasi kolinearitas meningkat.
4.2 Heteroskedastisitas
##
## studentized Breusch-Pagan test
##
## data: mod0
## BP = 1.0283, df = 3, p-value = 0.7944
# White test (via bptest dengan kuadrat fitted)
lmtest::bptest(mod0,~ fitted(mod0) + I(fitted(mod0)^2))##
## studentized Breusch-Pagan test
##
## data: mod0
## BP = 5.3586, df = 2, p-value = 0.06861
Jika signifikan, gunakan SE robust (lihat 3.4) untuk inferensi yang lebih andal.
4.3 Normalitas residual
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.87436, p-value = 7.729e-12
4.4 Autokorelasi residual (opsional) (Biasanya untuk data runtun waktu)
##
## Durbin-Watson test
##
## data: mod0
## DW = 2.0342, p-value = 0.5998
## alternative hypothesis: true autocorrelation is greater than 0
4.5 Uji spesifikasi (linearitas) – Ramsey RESET
- Seleksi Model Strategi: mulai dari model kandidat (termasuk transformasi sederhana) lalu gunakan AIC (stepwise), bandingkan dengan model awal, dan validasi via CV.
dat2 <- dat %>%
mutate(X1_sq = X1^2, X2_sq = X2^2, X3_sq = X3^2,
X1X2 = X1*X2, X1X3 = X1*X3, X2X3 = X2*X3)
mod_full <- lm(y~ X1 + X2 + X3 + X1_sq + X2_sq + X3_sq + X1X2 + X1X3 + X2X3, data = dat2)
AIC(mod0); AIC(mod_full)## [1] 657.0574
## [1] 660.7653
# Stepwise berbasis AIC (dua arah)
mod_step <- MASS::stepAIC(mod_full, direction = "both", trace = FALSE)
summary(mod_step)##
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X2_sq + X1X2, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2286 -0.6853 -0.0298 0.4825 7.0242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.19439 0.10487 20.925 < 2e-16 ***
## X1 1.31032 0.12513 10.472 < 2e-16 ***
## X2 -0.72799 0.12278 -5.929 1.37e-08 ***
## X3 0.57331 0.08996 6.373 1.32e-09 ***
## X2_sq -0.28501 0.11895 -2.396 0.0175 *
## X1X2 0.23671 0.13546 1.748 0.0821 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.22 on 194 degrees of freedom
## Multiple R-squared: 0.4571, Adjusted R-squared: 0.4431
## F-statistic: 32.67 on 5 and 194 DF, p-value: < 2.2e-16
## # A tibble: 1 × 4
## r.squared adj.r.squared AIC BIC
## <dbl> <dbl> <dbl> <dbl>
## 1 0.457 0.443 655. 678.
Validasi silang (10-fold CV) untuk membandingkan RMSE:
# Fungsi CV RMSE menggunakan boot::cv.glm
cv_rmse <- function(model, data, K=10){
# cv.glm mengembalikan delta: [1] raw CV, [2] adjusted CV
set.seed(123)
res <- boot::cv.glm(data = data, glmfit = model, K = K)
sqrt(res$delta[1])
}
rmse_mod0 <- cv_rmse(mod0, dat)
rmse_modstep<- cv_rmse(mod_step, dat2)
c(RMSE_mod0 = rmse_mod0, RMSE_modStep = rmse_modstep)## RMSE_mod0 RMSE_modStep
## NaN NaN
Pilih model dengan trade-off terbaik antara kesederhanaan dan kinerja (RMSE, AIC/BIC, Adj, R- squared.
- Evaluasi Outlier & Titik Berpengaruh Gunakan studentized residuals, leverage (hat), dan Cook’s distance. Tambahkan Bonferroni outlier test.
## Potentially influential observations of
## lm(formula = y ~ X1 + X2 + X3, data = dat) :
##
## dfb.1_ dfb.X1 dfb.X2 dfb.X3 dffit cov.r cook.d hat
## 16 -0.21 -0.06 -0.32 0.54 -0.78_* 0.94 0.15 0.07_*
## 18 0.30 -0.30 -0.24 0.09 0.74_* 0.75_* 0.13 0.03
## 27 0.00 0.00 0.00 0.01 0.01 1.07_* 0.00 0.04
## 56 -0.05 -0.05 0.01 0.11 -0.14 1.07_* 0.00 0.06
## 65 -0.14 0.36 -0.34 0.17 -0.45_* 0.97 0.05 0.04
## 90 0.31 0.25 0.04 -0.12 0.50_* 0.72_* 0.06 0.01
## 97 0.00 0.00 -0.01 0.00 -0.01 1.08_* 0.00 0.05
## 135 -0.03 0.07 -0.01 -0.08 -0.12 1.07_* 0.00 0.05
## 147 0.19 -0.21 -0.01 -0.25 0.42 0.92_* 0.04 0.03
## 149 0.23 0.54 -0.20 0.60 0.86_* 0.89_* 0.18 0.07_*
## 160 -0.02 0.04 -0.04 0.01 -0.05 1.06_* 0.00 0.04
## 162 0.43 -0.56 0.28 0.26 0.78_* 0.48_* 0.13 0.01
## 164 -0.20 -0.38 -0.25 -0.35 -0.87_* 0.95 0.18 0.08_*
## 191 0.03 0.02 -0.02 -0.08 0.08 1.07_* 0.00 0.05
# Statistik diagnostik
studres <- rstudent(mod0)
lev <- hatvalues(mod0)
cookd <- cooks.distance(mod0)
# Ambang sederhana
thr_res <- 3 # |studentized residual| > 3
thr_lev <- 2*length(coef(mod0))/nrow(dat) # leverage tinggi
thr_cook <- 4/nrow(dat) # Cook's distance besar
flag <- data.frame(
id = 1:nrow(dat),
studres = studres,
leverage = lev,
cookd = cookd
) %>%
mutate(
flag_res = abs(studres) > thr_res,
flag_lev = leverage > thr_lev,
flag_cook = cookd > thr_cook,
any_flag = flag_res | flag_lev | flag_cook
) %>%
arrange(desc(any_flag), desc(abs(studres)))
head(flag, 10)## id studres leverage cookd flag_res flag_lev flag_cook any_flag
## 162 162 6.450812 0.01441863 0.12607179 TRUE FALSE TRUE TRUE
## 90 90 4.302679 0.01343365 0.05785187 TRUE FALSE TRUE TRUE
## 18 18 4.188058 0.03056365 0.12748735 TRUE FALSE TRUE TRUE
## 149 149 3.243249 0.06634585 0.17821039 TRUE TRUE TRUE TRUE
## 164 164 -2.855691 0.08454000 0.18164105 FALSE TRUE TRUE TRUE
## 16 16 -2.803489 0.07241620 0.14821120 FALSE TRUE TRUE TRUE
## 147 147 2.584024 0.02573283 0.04284913 FALSE FALSE TRUE TRUE
## 143 143 2.200166 0.03420638 0.04203837 FALSE FALSE TRUE TRUE
## 30 30 2.172800 0.03120105 0.03730329 FALSE FALSE TRUE TRUE
## 65 65 -2.140175 0.04228760 0.04965406 FALSE TRUE TRUE TRUE
Plot cepat untuk Cook’s distance:
plot(cookd, type="h", main="Cook's Distance", ylab="D", xlab="Observasi")
abline(h = thr_cook, lty=2)
Tindak lanjut (opsional):
Coba refit tanpa observasi berpengaruh ekstrem dan bandingkan koefisien/fit. Pertimbangkan robust regression (M-estimator) jika banyak outlier: MASS::rlm.
# Refit tanpa titik yang sangat berpengaruh (contoh)
to_drop <- which(cookd > thr_cook | abs(studres) > thr_res)
mod_refit <- lm(y~ X1 + X2 + X3, data = dat[-to_drop, ])
broom::tidy(mod0)## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.08 0.0873 23.8 9.73e-60
## 2 X1 1.32 0.125 10.6 6.73e-21
## 3 X2 -0.758 0.123 -6.16 3.97e- 9
## 4 X3 0.577 0.0907 6.37 1.35e- 9
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.99 0.0566 35.1 2.69e-83
## 2 X1 1.44 0.0861 16.7 1.11e-38
## 3 X2 -0.676 0.0819 -8.26 2.87e-14
## 4 X3 0.501 0.0606 8.27 2.75e-14
- Ringkasan & Rekomendasi
• Cek signifikansi koefisien (t-test) dan kecocokan model (F-test, R²).
• Perhatikan multikolinearitas (VIF); bila tinggi, pertimbangkan transformasi, pengurangan variabel, atau ridge/lasso.
• Jika heteroskedastisitas terdeteksi, gunakan SE robust untuk inferensi atau model varian (mis. WLS).
• Lakukan seleksi model (AIC/BIC, CV) untuk keseimbangan bias–varian.
• Evaluasi outlier/pengaruh; pertimbangkan robust regression bila perlu.
• Validasi hasil secara substantif (masuk akal secara domain) dan replikasi (seed, pipeline).
1.14 Tugas/Praktikum dan Bank Soal
1.14.1 Praktikum (hands-on)
- Diagnostik lengkap OLS pada data
mtcars: heteroskedastisitas (Breusch–Pagan), multikolinearitas (VIF), pengaruh (Cook’s D), normalitas (QQ-plot). Laporkan temuan & rekomendasi. - Robust SE (HC3) vs WLS pada data simulasi heteroskedastik; bandingkan standar error dan p-value.
- Spline vs polinomial: modelkan
mpg ~ wtdan bandingkan AIC/BIC + 5-fold CV. - AR(1) GLS: gunakan kode GLS di atas, ganti
xdengan sinyal musiman (mis.sin(2*pi*t/12)), interpretasikan \(\hat{\rho}\); bandingkan SE OLS vs GLS.
1.14.2 Soal Latihan
- Esai: Mengapa OLS tetap unbiased di bawah heteroskedastisitas, namun tidak efisien? Bagaimana robust SE mengatasi masalah inferensi?
- Esai: Turunkan bentuk estimator WLS saat varian residual diketahui proporsional terhadap \(|x|\).
- Esai: Bandingkan Ridge dan Lasso dari sudut pandang bias–variance dan seleksi variabel.
- Hitungan: Diberikan matriks \(X\) dan vektor \(y\), hitung \(\hat{\beta}\) OLS serta \(H\); identifikasi leverage maksimum.
- Pilihan Ganda:
- VIF tinggi menunjukkan?
- Cook’s D digunakan untuk?
- Dalam AR(1), kovarians antara \(\varepsilon_t\) dan \(\varepsilon_{t-k}\) adalah?
- VIF tinggi menunjukkan?
1.15 Jawaban
1.15.1 Praktikum
install.packages(setdiff(c(
"tidyverse","broom","modelr","car","lmtest","sandwich","splines","nlme"
), rownames(installed.packages())))
library(tidyverse); library(broom); library(modelr); library(car)
library(lmtest); library(sandwich); library(splines); library(nlme)dat1 <- mtcars |>
tibble::as_tibble(rownames = "car") |>
mutate(across(c(mpg, disp, hp, wt, qsec), as.double))- Diagnostik lengkap OLS (mtcars)
##
## Call:
## lm(formula = mpg ~ wt + hp + disp, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## wt -3.800891 1.066191 -3.565 0.00133 **
## hp -0.031157 0.011436 -2.724 0.01097 *
## disp -0.000937 0.010350 -0.091 0.92851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
Breusch–Pagan (heteroskedastisitas)
##
## studentized Breusch-Pagan test
##
## data: mod_ols
## BP = 0.9459, df = 3, p-value = 0.8143
vif_tbl <- tibble(term = names(car::vif(mod_ols)),
VIF = as.numeric(car::vif(mod_ols))) |>
arrange(desc(VIF))
vif_tbl## # A tibble: 3 × 2
## term VIF
## <chr> <dbl>
## 1 disp 7.32
## 2 wt 4.84
## 3 hp 2.74
Pengaruh (Cook’s D) & leverage
cd <- cooks.distance(mod_ols)
plot(cd, ylab="Cook's D", main="Cook's Distance"); abline(h=4/length(cd), lty=2)top_infl <- tibble(car = dat1$car, cooks_d = as.numeric(cd)) |>
arrange(desc(cooks_d)) |> slice(1:5)
top_infl## # A tibble: 5 × 2
## car cooks_d
## <chr> <dbl>
## 1 Maserati Bora 0.340
## 2 Chrysler Imperial 0.320
## 3 Toyota Corolla 0.153
## 4 Fiat 128 0.120
## 5 Pontiac Firebird 0.0698
Ringkasan otomatis + rekomendasi Multikolinearitas (VIF)
## BP p-value: 0.814
## VIF tertinggi: disp (7.32)
## Top Cook's D: Maserati Bora, Chrysler Imperial, Toyota Corolla, Fiat 128, Pontiac Firebird
if (bp$p.value < 0.05) {
cat("Rekomendasi: gunakan robust SE (HC3) atau WLS/GLS; evaluasi multikolinearitas (VIF tinggi).\n")
} else cat("Rekomendasi: OLS lolos uji heteroskedastisitas; lanjut validasi & diagnostik lain.\n")## Rekomendasi: OLS lolos uji heteroskedastisitas; lanjut validasi & diagnostik lain.
- Robust SE (HC3) vs WLS pada data simulasi heteroskedastik
library(dplyr)
library(broom)
library(lmtest)
library(sandwich)
set.seed(1)
n <- 200
x <- runif(n, 0, 10)
sigma <- 0.5 + 0.4*x # varians ↑ dengan x
y <- 2 + 1.5*x + rnorm(n, 0, sigma)
simdat <- tibble(y, x)
m_ols_sim <- lm(y ~ x, data = simdat)
# Robust SE (HC3)
Vrob <- sandwich::vcovHC(m_ols_sim, type = "HC3")
tab_hc3 <- lmtest::coeftest(m_ols_sim, vcov = Vrob)
# WLS, bobot ≈ 1/var
w <- 1/sigma^2
m_wls <- lm(y ~ x, data = simdat, weights = w)
result <- bind_rows(
tidy(m_ols_sim) |> mutate(model="OLS"),
tidy(m_ols_sim) |>
mutate(std.error = sqrt(diag(Vrob)),
statistic = estimate/std.error,
p.value = 2*pnorm(abs(statistic), lower.tail = FALSE),
model="OLS+HC3"),
tidy(m_wls) |> mutate(model="WLS")
) |>
dplyr::select(c(model, term, estimate, std.error, p.value))
print(result)## # A tibble: 6 × 5
## model term estimate std.error p.value
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 OLS (Intercept) 1.95 0.406 3.15e- 6
## 2 OLS x 1.51 0.0696 4.00e-54
## 3 OLS+HC3 (Intercept) 1.95 0.296 4.48e-11
## 4 OLS+HC3 x 1.51 0.0717 4.52e-98
## 5 WLS (Intercept) 2.05 0.172 7.33e-25
## 6 WLS x 1.49 0.0533 1.75e-70
Interpretasi: Bandingkan kolom std.error & p.value. Pada data heteroskedastik, HC3 memberi SE konsisten; WLS biasanya lebih efisien bila bobot mendekati 1/Var(u).
- Spline vs polinomial: mpg ~ wt (AIC/BIC + 5-fold CV)
forms <- list(
OLS = mpg ~ wt + hp + disp,
Poly = mpg ~ poly(wt, 2, raw = TRUE) + hp + disp,
Spline = mpg ~ bs(wt, df = 5) + hp + disp
)
# AIC/BIC
aicbic <- map_dfr(names(forms), \(nm){
m <- lm(forms[[nm]], data = dat1)
tibble(model = nm, AIC = AIC(m), BIC = BIC(m))
}) |> arrange(AIC)
aicbic## # A tibble: 3 × 3
## model AIC BIC
## <chr> <dbl> <dbl>
## 1 Poly 151. 160.
## 2 Spline 154. 168.
## 3 OLS 159. 166.
# 5-fold CV (RMSE)
set.seed(123)
k <- 5
cv_res <- modelr::crossv_kfold(dat1, k = k)
rmse_fun <- function(train, test, fml) {
m <- lm(fml, data = as.data.frame(train))
sqrt(mean((as.data.frame(test)$mpg - predict(m, newdata = as.data.frame(test)))^2))
}
cv_tbl <- map_dfr(names(forms), \(nm){
tibble(model = nm, fold = 1:k,
rmse = map2_dbl(cv_res$train, cv_res$test, ~ rmse_fun(.x, .y, forms[[nm]])))
})
cv_summary <- cv_tbl |> group_by(model) |>
summarise(mean_rmse = mean(rmse), sd_rmse = sd(rmse), .groups="drop") |>
arrange(mean_rmse)
cv_summary## # A tibble: 3 × 3
## model mean_rmse sd_rmse
## <chr> <dbl> <dbl>
## 1 Poly 2.35 0.980
## 2 Spline 2.74 0.807
## 3 OLS 2.92 1.01
Pilih model dengan AIC/BIC atau mean RMSE paling kecil (kedua metrik biasanya selaras; jika tidak, utamakan CV untuk akurasi prediksi).
- GLS AR(1) dengan sinyal musiman; bandingkan SE
library(dplyr)
library(broom)
library(nlme)
# Simulasi data
set.seed(123)
n <- 120
t <- 1:n
x <- sin(2*pi*t/12) # sinyal musiman
eps <- as.numeric(arima.sim(list(ar = 0.6), n = n)) # AR(1) rho≈0.6
y <- 5 + 2*x + eps
glsdat <- tibble(y, x, t)
# OLS
m_ols_ar <- lm(y ~ x, data = glsdat)
# GLS AR(1)
m_gls <- gls(y ~ x, data = glsdat,
correlation = corAR1(form = ~ t))
# Ekstrak hasil GLS secara manual
gls_summary <- summary(m_gls)$tTable
gls_df <- as.data.frame(gls_summary)
gls_df <- tibble::rownames_to_column(gls_df, "term") %>%
rename(
estimate = Value,
std.error = Std.Error,
statistic = `t-value`,
p.value = `p-value`
) %>%
mutate(model = "GLS AR(1)")
# Ekstrak hasil OLS dengan broom
ols_df <- tidy(m_ols_ar) %>%
mutate(model = "OLS")
# Gabungkan hasil
result <- bind_rows(ols_df, gls_df) %>%
dplyr::select(model, term, estimate, std.error, statistic, p.value)
print(result)## # A tibble: 4 × 6
## model term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 OLS (Intercept) 5.00 0.101 49.5 8.14e-81
## 2 OLS x 1.87 0.143 13.1 8.27e-25
## 3 GLS AR(1) (Intercept) 5.02 0.205 24.5 4.54e-48
## 4 GLS AR(1) x 1.88 0.205 9.17 1.86e-15
# Estimasi rho
rho_hat <- coef(m_gls$modelStruct$corStruct, unconstrained = FALSE)
cat("Estimated rho =", rho_hat, "\n")## Estimated rho = 0.6086503
Estimasi ρ (AR(1)) & interpretasi
Jika SE(β_x) GLS < OLS, memodelkan AR(1) meningkatkan efisiensi (uji lebih tajam).
1.15.2 Soal Latihan
Esai 1: OLS Unbiased tapi Tidak Efisien di bawah Heteroskedastisitas OLS tetap unbiased jika \(\mathbb{E}(\varepsilon|X)=0\) dan \(\text{Cov}(X,\varepsilon)=0\). Namun variansnya tidak minimum karena pelanggaran homoskedastisitas. Robust SE (HC) memperbaiki estimasi kovarians sehingga uji-t dan uji-F valid meski varians galat tidak konstan.
Esai 2: Estimator WLS saat \(\operatorname{Var}(u_i) \propto 1/|x_i|\) Dengan bobot \(w_i \propto |x_i|\) dan \(W=\operatorname{diag}(w_i)\), maka \[ \hat{\beta}_{\text{WLS}} = \left( X^{\top} W X \right)^{-1} X^{\top} W y. \] Observasi dengan varians lebih besar (bobot kecil) berpengaruh lebih sedikit.
Esai 3: Ridge vs Lasso Ridge menambahkan penalti \(\lambda\lVert\beta\rVert_2^2\) (shrinkage, tanpa seleksi variabel), Lasso menambahkan \(\lambda\lVert\beta\rVert_1\) (shrinkage dengan seleksi—banyak koefisien bisa nol). Ridge cocok untuk multikolinearitas; Lasso untuk pemilihan fitur.
Hitungan: \(\hat{\beta}\), Matriks Hat \(H\), dan Leverage \(\hat{\beta}=(X^\top X)^{-1}X^\top y\), ; \(H = X(X^\top X)^{-1}X^\top\). Leverage \(h_{ii}\) adalah elemen diagonal \(H\); nilai terbesar menandai pengamatan paling berpengaruh.
Pilihan Ganda (A) VIF tinggi ada multikolinearitas antar prediktor. (B) Cook’s D digunakan untuk mendeteksi observasi berpengaruh. (C) Dalam AR(1): \(\operatorname{Cov}(\varepsilon_t,\varepsilon_{t-k})=\sigma_u^2,\rho^{k}/(1-\rho^2)\).