Analisis regresi adalah metode statistik yang digunakan untuk mempelajari hubungan satu variabel dependen (respon) dengan satu atau lebih variabel independen (prediktor). Tujuan utamanya adalah membangun model matematis yang bisa digunakan untuk:
Model regresi linear sederhana dituliskan sebagai:
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, \ldots, n \] dengan
• 𝑌𝑖 : variabel dependen (respon)
• 𝑋𝑖 : variabel independen (prediktor)
• 𝛽0, 𝛽1 : parameter regresi
• 𝜀𝑖 : error (residu)
Korelasi mengukur derajat keeratan hubungan linier antara dua variabel, tanpa membedakan mana yang dependen dan independen.
Koefisien korelasi Pearson:
\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \; \sum_{i=1}^n (y_i - \bar{y})^2}} \]
• Nilai 𝑟 ∈ [−1, 1].
• 𝑟 > 0: hubungan positif, 𝑟 < 0: hubungan negatif.
• |𝑟| makin mendekati 1 artinya hubungan makin kuat
Berikut tabel perbedaan utama antara korelasi dan regresi:
Tabel Perbandingan
| Aspek | Korelasi | Regresi |
|---|---|---|
| Tujuan | Mengukur keeratan hubungan | Membangun model prediksi/penjelasan |
| Arah hubungan | Simetris (X dan Y setara) | Asimetris (Y sebagai respon, X sebagai prediktor) |
| Output utama | Koefisien korelasi \(r\) | Persamaan regresi \(\hat{Y} = \beta_0 + \beta_1 X\) |
Contoh Aplkasi di R
Data
Gunakan dataset mtcars untuk melihat hubungan mpg (miles per gallon) dengan wt (berat mobil).
data(mtcars)
head(mtcars[, c("mpg", "wt")])
## mpg wt
## Mazda RX4 21.0 2.620
## Mazda RX4 Wag 21.0 2.875
## Datsun 710 22.8 2.320
## Hornet 4 Drive 21.4 3.215
## Hornet Sportabout 18.7 3.440
## Valiant 18.1 3.460
Korelasi
cor.test(mtcars$wt, mtcars$mpg)
##
## Pearson's product-moment correlation
##
## data: mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
Regresi
model <- lm(mpg~ wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Visualisasi
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
labs(title = "Hubungan antara Berat Mobil (wt) dan Efisiensi Bahan Bakar (mpg)",
x = "Berat Mobil (1000 lbs)",
y = "Miles per Gallon (mpg)")
Kita memodelkan hubungan antara respons \(y \in \mathbb{R}^n\) dan kovariat \(X \in \mathbb{R}^{n \times p}\):
\[ y = X\beta + \varepsilon,\ \mathbb{E}(\varepsilon)=0,\ \operatorname{Var}(\varepsilon)=\sigma^2 I_n. \]
Dengan \(X = [\mathbf{1},\, x_1,\, \ldots,\, x_{p-1}]\) (termasuk intersep), \(\beta\) adalah vektor koefisien, dan \(\varepsilon\) adalah galat.
Estimator ordinary least squares (OLS) meminimalkan jumlah kuadrat residu:
\[ \widehat{\beta}_{\text{OLS}} = \operatorname*{arg\,min}_{\beta}\,\lVert y - X\beta\rVert^{2} = (X^{\top}X)^{-1} X^{\top} y . \]
Prediksi \(\hat y = X\widehat{\beta}\) adalah proyeksi ortogonal dari \(y\) ke ruang kolom \(X\). Matriks hat dan residu: \[ H = X (X^{\top} X)^{-1} X^{\top}, \qquad \hat y = H y, \qquad e = (I - H) y . \]
Leverage pengamatan ke-\(i\) adalah \(h_{ii}\) (elemen diagonal \(H\)).
Di bawah asumsi Gauss–Markov (linearitas, eksogenitas, homoskedastisitas, dan tidak ada multikolinearitas sempurna), \(\widehat{\beta}_{\text{OLS}}\) adalah BLUE.
Varians koefisien dan estimator ragam:
\[ \operatorname{Var}(\widehat{\beta}_{\text{OLS}})=\sigma^2 (X^{\top}X)^{-1}, \qquad \widehat{\sigma}^2=\frac{\lVert e\rVert^2}{n-p}. \]
Asimtotik: \[ \sqrt{n}\,\big(\widehat{\beta}-\beta\big)\xrightarrow{d}\mathcal{N}(0,\,V), \] sehingga inferensi berbasis normal/\(t\) berlaku untuk \(n\) besar.
Jika galat tidak homoskedastik/berkorelasi, maka \(\operatorname{Var}(\varepsilon) = \sigma^2 \Omega \neq \sigma^2 I_n.\)
Weighted least squares (WLS) untuk varians berbeda-beda (\(\Omega\) diagonal): \[ \widehat{\beta}_{\mathrm{WLS}} = (X^{\top} W X)^{-1}\, X^{\top} W y, \qquad W = \Omega^{-1}. \]
Generalized least squares (GLS) untuk \(\Omega\) umum: \[ \widehat{\beta}_{\mathrm{GLS}} = (X^{\top} \Omega^{-1} X)^{-1}\, X^{\top} \Omega^{-1} y. \]
Tanpa menspesifikasi bentuk \(\Omega\), kovarians heteroskedasticity–consistent (HC) ala White (HC0) adalah \[ \widehat{\operatorname{Var}}(\widehat{\beta}) = (X^{\top}X)^{-1} \Bigg(\sum_{i=1}^n x_i x_i^{\top}\,\hat e_i^2\Bigg) (X^{\top}X)^{-1}. \]
Ini memberi robust SE yang valid terhadap heteroskedastisitas (HC0–HC3).
Variance Inflation Factor (VIF) untuk koefisien ke-\(j\): \[ \mathrm{VIF}_j \;=\; \frac{1}{1 - R_j^{2}}, \] dengan \(R_j^{2}\) adalah koefisien determinasi dari regresi \(x_j\) pada semua kovariat lain. Nilai besar (\(\gtrsim 10\)) mengindikasikan multikolinearitas kuat
Ukuran pengaruh Cook’s distance: \[ D_i \;=\; \frac{e_i^2}{p\,\widehat{\sigma}^2}\;\frac{h_{ii}}{(1-h_{ii})^2}, \]
Gunakan plot residu, plot leverage vs residu, QQ-plot, serta inspeksi \(𝐷𝑖\) untuk memeriksa asumsi dan outlier.
Model makin kompleks (banyak parameter/basis) cenderung bias rendah namun varians tinggi. Ridge dan Lasso menambahkan penalti:
\[ \text{Ridge:}\quad \widehat{\beta} = \operatorname*{arg\,min}_{\beta}\;\Big\{\;\|y - X\beta\|_2^2 + \lambda\,\|\beta\|_2^2\;\Big\}. \]
\[ \text{Lasso:}\quad \widehat{\beta} = \operatorname*{arg\,min}_{\beta}\;\Big\{\;\|y - X\beta\|_2^2 + \lambda\,\|\beta\|_1\;\Big\}. \]
Kita dapat menambah term polinomial atau basis spline sehingga tetap linear dalam parameter:
\[ y = \beta_0 + \sum_{k=1}^{K} \beta_k\, b_k(x) + \varepsilon. \]
dat1 <- mtcars %>%
as_tibble(rownames = "car") %>%
mutate(across(c(mpg, disp, hp, wt, qsec), as.double))
# Plot hubungan awal
GG1 <- ggplot(dat1, aes(wt, mpg)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Weight (1000 lbs)", y = "Miles per Gallon (mpg)",
title = "Hubungan mpg vs wt pada mtcars")
GG1
fit1 <- lm(mpg~ wt + hp + disp, data = dat1)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ wt + hp + disp, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## wt -3.800891 1.066191 -3.565 0.00133 **
## hp -0.031157 0.011436 -2.724 0.01097 *
## disp -0.000937 0.010350 -0.091 0.92851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
# Koefisien dengan CI 95%
tidy(fit1, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.1 2.11 17.6 1.16e-16 32.8 41.4
## 2 wt -3.80 1.07 -3.56 1.33e- 3 -5.98 -1.62
## 3 hp -0.0312 0.0114 -2.72 1.10e- 2 -0.0546 -0.00773
## 4 disp -0.000937 0.0103 -0.0905 9.29e- 1 -0.0221 0.0203
diag_df <- tibble(
fitted = fitted(fit1), resid = resid(fit1),
stdres = rstandard(fit1), hat = hatvalues(fit1), cook = cooks.distance(fit1)
)
p1 <- ggplot(diag_df, aes(fitted, resid)) +
geom_point(alpha = 0.7) +
geom_hline(yintercept = 0, linetype = 2) +
labs(x = "Fitted", y = "Residuals")
p2 <- ggplot(diag_df, aes(sample = stdres)) +
stat_qq() + stat_qq_line() + labs(x = "Theoretical", y = "Standardized Residuals")
p1; p2
Figure 1.2: Diagnostik OLS untuk mtcars: residu vs fitted (kiri) dan QQ-plot (kanan).
car::vif(fit1)
## wt hp disp
## 4.844618 2.736633 7.324517
coeftest(fit1) # SE OLS
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.10550527 2.11081525 17.5788 < 2.2e-16 ***
## wt -3.80089058 1.06619064 -3.5649 0.001331 **
## hp -0.03115655 0.01143579 -2.7245 0.010971 *
## disp -0.00093701 0.01034974 -0.0905 0.928507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qplot(seq_along(diag_df$cook), diag_df$cook, geom = c("line","point"),
xlab = "Index", ylab = "Cook's D")
Figure 1.3: Cook’s distance per observasi.
coeftest(fit1, vcov = vcovHC(fit1, type = "HC3")) # SE robust (HC3)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.10550527 2.71026981 13.6907 6.257e-14 ***
## wt -3.80089058 1.08086207 -3.5165 0.00151 **
## hp -0.03115655 0.01437457 -2.1675 0.03886 *
## disp -0.00093701 0.01012138 -0.0926 0.92690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_poly <- lm(mpg~ poly(wt, 2, raw = TRUE) + hp + disp, data = dat1)
fit_spl <- lm(mpg~ bs(wt, df = 5) + hp + disp, data = dat1)
AIC(fit1, fit_poly, fit_spl) %>% arrange(AIC)
## df AIC
## fit_poly 6 150.7305
## fit_spl 9 154.3323
## fit1 5 158.6430
BIC(fit1, fit_poly, fit_spl) %>% arrange(BIC)
## df BIC
## fit_poly 6 159.5250
## fit1 5 165.9717
## fit_spl 9 167.5239
# Kurva prediksi
grid_wt <- tibble(wt = seq(min(dat1$wt), max(dat1$wt), length.out = 200),
hp = mean(dat1$hp), disp = mean(dat1$disp))
pred <- grid_wt %>%
mutate(OLS = predict(fit1, newdata = grid_wt),
Poly = predict(fit_poly, newdata = grid_wt),
Spline = predict(fit_spl, newdata = grid_wt)) %>%
pivot_longer(OLS:Spline, names_to = "model", values_to = "yhat")
GG2 <- ggplot() +
geom_point(data = dat1, aes(wt, mpg), alpha = 0.3) +
geom_line(data = pred, aes(wt, yhat, linetype = model), linewidth = 0.9) +
labs(x = "wt", y = "mpg", title = "Perbandingan OLS vs Polinomial vs Spline")
GG2
Tanyakan bagian ini belum lengkap
n <- 400
x1 <- runif(n, 0, 10)
x2 <- rnorm(n, 5, 2)
sigma_i <- 0.6 + 0.2 * x1 # var naik dengan x1
err <- rnorm(n, 0, sigma_i)
y <- 2 + 1.5*x1- 1.2*x2 + err
sim <- tibble(y, x1, x2, w = 1/sigma_i^2)
fit_ols <- lm(y~ x1 + x2, data = sim)
fit_wls <- lm(y~ x1 + x2, data = sim, weights = w)
# Tabel perbandingan SE
comp_se <- tibble(
term = names(coef(fit_ols)),
OLS_SE = coef(summary(fit_ols))[, "Std. Error"],
Robust_SE = sqrt(diag(vcovHC(fit_ols, type = "HC3"))),
WLS_SE = coef(summary(fit_wls))[, "Std. Error"]
)
comp_se
## # A tibble: 3 × 4
## term OLS_SE Robust_SE WLS_SE
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.281 0.267 0.191
## 2 x1 0.0300 0.0313 0.0267
## 3 x2 0.0435 0.0466 0.0333
# Uji Breusch-Pagan untuk heteroskedastisitas
lmtest::bptest(fit_ols)
##
## studentized Breusch-Pagan test
##
## data: fit_ols
## BP = 48.294, df = 2, p-value = 3.259e-11
ggplot(data.frame(fitted=fitted(fit_ols), resid=resid(fit_ols)), aes(fitted, resid)) +
geom_point(alpha=.6) + geom_smooth(se=FALSE, method="loess") +
geom_hline(yintercept = 0, linetype=2) + labs(x="Fitted", y="Residuals")
Model galat mengikuti proses AR(1):\(\varepsilon_t = \rho\,\varepsilon_{t-1} + u_t\) dengan \(u_t \sim \mathcal{N}(0,\ \sigma_u^2)\). Kovarians residual memiliki bentuk Toeplitz:\(\Omega_{ts} = \frac{\sigma_u^2}{1-\rho^2}\,\rho^{|t-s|}\) GLS menimbang data dengan \(\Omega^{-1}\) (atau transformasi ekuivalen) agar estimasi efisien.
set.seed(42)
T <- 400
x <- runif(T,-2, 2)
rho <- 0.6
u <- rnorm(T, 0, 1)
err <- numeric(T)
for (t in 2:T) err[t] <- rho*err[t-1] + u[t]
y <- 1 + 2*x + err
dar1 <- tibble::tibble(t = 1:T, x, y)
head(dar1)
## # A tibble: 6 × 3
## t x y
## <int> <dbl> <dbl>
## 1 1 1.66 4.32
## 2 2 1.75 4.83
## 3 3 -0.855 0.661
## 4 4 1.32 6.53
## 5 5 0.567 2.49
## 6 6 0.0764 0.213
library(nlme)
# OLS biasa
fit_ols_ar1 <- lm(y~ x, data = dar1)
# GLS dengan korelasi AR(1) di residual
fit_gls_ar1 <- gls(y~ x, data = dar1,
correlation = corAR1(form =~ t))
summary(fit_ols_ar1)
##
## Call:
## lm(formula = y ~ x, data = dar1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3360 -0.8231 0.0026 0.8211 3.9013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95441 0.06043 15.79 <2e-16 ***
## x 2.06045 0.05162 39.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.209 on 398 degrees of freedom
## Multiple R-squared: 0.8002, Adjusted R-squared: 0.7997
## F-statistic: 1594 on 1 and 398 DF, p-value: < 2.2e-16
summary(fit_gls_ar1)
## Generalized least squares fit by REML
## Model: y ~ x
## Data: dar1
## AIC BIC logLik
## 1139.541 1155.487 -565.7706
##
## Correlation Structure: AR(1)
## Formula: ~t
## Parameter estimate(s):
## Phi
## 0.5772332
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.9552097 0.11651869 8.19791 0
## x 2.0449090 0.03743358 54.62766 0
##
## Correlation:
## (Intr)
## x 0.004
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.752477543 -0.674750151 -0.007333444 0.679546648 3.225627947
##
## Residual standard error: 1.210605
## Degrees of freedom: 400 total; 398 residual
par(mfrow = c(1,1))
lmtest::dwtest(fit_ols_ar1) # indikasi korelasi serial pada OLS
##
## Durbin-Watson test
##
## data: fit_ols_ar1
## DW = 0.84993, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
par(mfrow = c(1,2))
acf(resid(fit_ols_ar1), main = "ACF Residuals (OLS)")
acf(resid(fit_gls_ar1, type = "normalized"), main = "ACF Residuals (GLS)")
Figure 1.5: ACF residu dari OLS vs GLS
# Estimasi rho dari regresi resid_t ~ resid_{t-1}
res <- resid(fit_ols_ar1)
rho_hat <- coef(lm(res[-1]~ 0 + res[-length(res)]))[1]
# Quasi-differencing (transformasi data)
y_star <- dar1$y[-1]- rho_hat*dar1$y[-T]
x_star <- dar1$x[-1]- rho_hat*dar1$x[-T]
fit_qd <- lm(y_star~ x_star)
summary(fit_qd)
##
## Call:
## lm(formula = y_star ~ x_star)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6691 -0.6723 -0.0232 0.6970 3.2363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40603 0.04955 8.194 3.52e-15 ***
## x_star 2.04499 0.03757 54.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9898 on 397 degrees of freedom
## Multiple R-squared: 0.8819, Adjusted R-squared: 0.8816
## F-statistic: 2964 on 1 and 397 DF, p-value: < 2.2e-16
Catatan: gls() memperkirakan 𝜌 dan ragam secara simultan dalam kerangka likelihood sehingga sering lebih stabil dibanding pendekatan dua tahap.
set.seed(123)
# Paket yang diperlukan
pkgs <- c("dplyr","ggplot2","car","lmtest","sandwich","MASS","boot","broom")
new <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(new)) install.packages(new, dependencies = TRUE)
lapply(pkgs, library, character.only = TRUE)
## [[1]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[2]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[3]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[4]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[5]]
## [1] "nlme" "splines" "modelr" "car" "carData" "lmtest"
## [7] "zoo" "sandwich" "broom" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [19] "tidyverse" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[6]]
## [1] "MASS" "nlme" "splines" "modelr" "car" "carData"
## [7] "lmtest" "zoo" "sandwich" "broom" "lubridate" "forcats"
## [13] "stringr" "dplyr" "purrr" "readr" "tidyr" "tibble"
## [19] "ggplot2" "tidyverse" "stats" "graphics" "grDevices" "utils"
## [25] "datasets" "methods" "base"
##
## [[7]]
## [1] "boot" "MASS" "nlme" "splines" "modelr" "car"
## [7] "carData" "lmtest" "zoo" "sandwich" "broom" "lubridate"
## [13] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [19] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [25] "utils" "datasets" "methods" "base"
##
## [[8]]
## [1] "boot" "MASS" "nlme" "splines" "modelr" "car"
## [7] "carData" "lmtest" "zoo" "sandwich" "broom" "lubridate"
## [13] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [19] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [25] "utils" "datasets" "methods" "base"
1. Simulasi Data
Tujuan: membuat data dengan 3 prediktor (X1, X2, X3), disuntik multikolinearitas ringan (X1 & X2 berkorelasi) dan heteroskedastisitas (ragam error meningkat saat |X1| besar), serta sedikit outlier.
n <- 200
X1 <- rnorm(n, 0, 1)
# X2 berkorelasi ~0.7 dengan X1
rho <- 0.7
X2 <- rho*X1 + sqrt(1-rho^2)*rnorm(n)
X3 <- rnorm(n)
# Heteroskedastisitas: sd error tergantung |X1|
eps <- rnorm(n, sd = 0.5 + 0.5*abs(X1))
# Model benar (ground truth): y = 2 + 1.5*X1 - 0.8*X2 + 0.5*X3 + eps
y <- 2 + 1.5*X1- 0.8*X2 + 0.5*X3 + eps
# Tambah beberapa outlier pada respon
idx_out <- sample(1:n, 3)
y[idx_out] <- y[idx_out] + rnorm(3, mean = 6, sd = 1)
dat <- data.frame(y, X1, X2, X3)
summary(dat)
## y X1 X2 X3
## Min. :-3.194 Min. :-2.30917 Min. :-2.36565 Min. :-2.80977
## 1st Qu.: 1.157 1st Qu.:-0.62576 1st Qu.:-0.68926 1st Qu.:-0.55753
## Median : 2.040 Median :-0.05874 Median : 0.03268 Median : 0.07583
## Mean : 2.066 Mean :-0.00857 Mean : 0.02408 Mean : 0.03178
## 3rd Qu.: 2.895 3rd Qu.: 0.56840 3rd Qu.: 0.69036 3rd Qu.: 0.68098
## Max. : 9.427 Max. : 3.24104 Max. : 3.00049 Max. : 2.43023
2. Estimasi Parameter (OLS)
mod0 <- lm(y~ X1 + X2 + X3, data = dat)
summary(mod0)
##
## Call:
## lm(formula = y ~ X1 + X2 + X3, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3070 -0.6446 -0.0414 0.5284 7.1821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.07742 0.08726 23.806 < 2e-16 ***
## X1 1.31536 0.12466 10.551 < 2e-16 ***
## X2 -0.75833 0.12303 -6.164 3.97e-09 ***
## X3 0.57737 0.09070 6.366 1.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.232 on 196 degrees of freedom
## Multiple R-squared: 0.4404, Adjusted R-squared: 0.4318
## F-statistic: 51.41 on 3 and 196 DF, p-value: < 2.2e-16
broom::glance(mod0) # ringkas: R^2, Adj R^2, AIC, dll.
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.440 0.432 1.23 51.4 1.47e-24 3 -324. 657. 674.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Catatan interpretasi cepat: Koefisien menunjukkan perubahan rata-rata y untuk kenaikan 1 unit pada prediktor (dengan prediktor lain konstan). Lihat Pr(>|t|) untuk signifikansi.
3. Uji Hipotesis
3.1 Uji serentak (overall F-test) Sudah tersedia pada output summary(mod0) (Signifikansi model secara keseluruhan).
3.2 Uji parsial (t-test per koefisien) Juga ada di summary(mod0).
3.3 Uji hipotesis gabungan (contoh:
car::linearHypothesis(mod0, c("X2 = 0", "X3 = 0"))
##
## Linear hypothesis test:
## X2 = 0
## X3 = 0
##
## Model 1: restricted model
## Model 2: y ~ X1 + X2 + X3
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 423.38
## 2 196 297.59 2 125.79 41.423 9.897e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.4 Uji dengan SE robust (menghadapi heteroskedastisitas)
# Koefisien + SE robust (HC3)
coeftest(mod0, vcov. = sandwich::vcovHC(mod0, type = "HC3"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.077419 0.087811 23.6578 < 2.2e-16 ***
## X1 1.315357 0.163042 8.0676 7.027e-14 ***
## X2 -0.758334 0.123626 -6.1341 4.644e-09 ***
## X3 0.577370 0.107728 5.3595 2.330e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4. Uji Asumsi
4.1 Multikolinearitas
# Variance Inflation Factor (VIF)
car::vif(mod0)
## X1 X2 X3
## 1.811923 1.816606 1.003660
# Kondisi numerik (opsional): kappa > 30 indikasi masalah
kappa(model.matrix(mod0))
## [1] 2.263679
Rule of thumb: VIF > 10 (atau > 5) → indikasi kolinearitas meningkat.
4.2 Heteroskedastisitas
# Breusch-Pagan
lmtest::bptest(mod0)
##
## studentized Breusch-Pagan test
##
## data: mod0
## BP = 1.0283, df = 3, p-value = 0.7944
# White test (via bptest dengan kuadrat fitted)
lmtest::bptest(mod0,~ fitted(mod0) + I(fitted(mod0)^2))
##
## studentized Breusch-Pagan test
##
## data: mod0
## BP = 5.3586, df = 2, p-value = 0.06861
Jika signifikan, gunakan SE robust (lihat 3.4) untuk inferensi yang lebih andal.
4.3 Normalitas residual
res <- rstandard(mod0)
shapiro.test(res) # sensitif untuk n besar; gunakan juga QQ-plot
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.87436, p-value = 7.729e-12
qqnorm(res); qqline(res)
4.4 Autokorelasi residual (opsional)
(Biasanya untuk data runtun waktu)
lmtest::dwtest(mod0)
##
## Durbin-Watson test
##
## data: mod0
## DW = 2.0342, p-value = 0.5998
## alternative hypothesis: true autocorrelation is greater than 0
4.5 Uji spesifikasi (linearitas) – Ramsey RESET
lmtest::resettest(mod0)
##
## RESET test
##
## data: mod0
## RESET = 4.2735, df1 = 2, df2 = 194, p-value = 0.01527
4.6 Plot Diagnostik Standar
par(mfrow=c(2,2))
plot(mod0)
par(mfrow=c(1,1))
5. Seleksi Model
Strategi: mulai dari model kandidat (termasuk transformasi sederhana) lalu gunakan AIC (stepwise), bandingkan dengan model awal, dan validasi via CV.
dat2 <- dat %>%
mutate(X1_sq = X1^2, X2_sq = X2^2, X3_sq = X3^2,
X1X2 = X1*X2, X1X3 = X1*X3, X2X3 = X2*X3)
mod_full <- lm(y~ X1 + X2 + X3 + X1_sq + X2_sq + X3_sq + X1X2 + X1X3 + X2X3, data = dat2)
AIC(mod0); AIC(mod_full)
## [1] 657.0574
## [1] 660.7653
# Stepwise berbasis AIC (dua arah)
mod_step <- MASS::stepAIC(mod_full, direction = "both", trace = FALSE)
summary(mod_step)
##
## Call:
## lm(formula = y ~ X1 + X2 + X3 + X2_sq + X1X2, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2286 -0.6853 -0.0298 0.4825 7.0242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.19439 0.10487 20.925 < 2e-16 ***
## X1 1.31032 0.12513 10.472 < 2e-16 ***
## X2 -0.72799 0.12278 -5.929 1.37e-08 ***
## X3 0.57331 0.08996 6.373 1.32e-09 ***
## X2_sq -0.28501 0.11895 -2.396 0.0175 *
## X1X2 0.23671 0.13546 1.748 0.0821 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.22 on 194 degrees of freedom
## Multiple R-squared: 0.4571, Adjusted R-squared: 0.4431
## F-statistic: 32.67 on 5 and 194 DF, p-value: < 2.2e-16
broom::glance(mod_step)[,c("r.squared","adj.r.squared","AIC","BIC")]
## # A tibble: 1 × 4
## r.squared adj.r.squared AIC BIC
## <dbl> <dbl> <dbl> <dbl>
## 1 0.457 0.443 655. 678.
Validasi silang (10-fold CV) untuk membandingkan RMSE:
# Fungsi CV RMSE menggunakan boot::cv.glm
cv_rmse <- function(model, data, K=10){
# cv.glm mengembalikan delta: [1] raw CV, [2] adjusted CV
set.seed(123)
res <- boot::cv.glm(data = data, glmfit = model, K = K)
sqrt(res$delta[1])
}
rmse_mod0 <- cv_rmse(mod0, dat)
rmse_modstep<- cv_rmse(mod_step, dat2)
c(RMSE_mod0 = rmse_mod0, RMSE_modStep = rmse_modstep)
## RMSE_mod0 RMSE_modStep
## NaN NaN
Pilih model dengan trade-off terbaik antara kesederhanaan dan kinerja (RMSE, AIC/BIC, Adj, R- squared.
6. Evaluasi Outlier & Titik Berpengaruh
Gunakan studentized residuals, leverage (hat), dan Cook’s distance. Tambahkan Bonferroni outlier test.
infl <- influence.measures(mod0)
# Ringkasan pengaruh
summary(infl)
## Potentially influential observations of
## lm(formula = y ~ X1 + X2 + X3, data = dat) :
##
## dfb.1_ dfb.X1 dfb.X2 dfb.X3 dffit cov.r cook.d hat
## 16 -0.21 -0.06 -0.32 0.54 -0.78_* 0.94 0.15 0.07_*
## 18 0.30 -0.30 -0.24 0.09 0.74_* 0.75_* 0.13 0.03
## 27 0.00 0.00 0.00 0.01 0.01 1.07_* 0.00 0.04
## 56 -0.05 -0.05 0.01 0.11 -0.14 1.07_* 0.00 0.06
## 65 -0.14 0.36 -0.34 0.17 -0.45_* 0.97 0.05 0.04
## 90 0.31 0.25 0.04 -0.12 0.50_* 0.72_* 0.06 0.01
## 97 0.00 0.00 -0.01 0.00 -0.01 1.08_* 0.00 0.05
## 135 -0.03 0.07 -0.01 -0.08 -0.12 1.07_* 0.00 0.05
## 147 0.19 -0.21 -0.01 -0.25 0.42 0.92_* 0.04 0.03
## 149 0.23 0.54 -0.20 0.60 0.86_* 0.89_* 0.18 0.07_*
## 160 -0.02 0.04 -0.04 0.01 -0.05 1.06_* 0.00 0.04
## 162 0.43 -0.56 0.28 0.26 0.78_* 0.48_* 0.13 0.01
## 164 -0.20 -0.38 -0.25 -0.35 -0.87_* 0.95 0.18 0.08_*
## 191 0.03 0.02 -0.02 -0.08 0.08 1.07_* 0.00 0.05
# Statistik diagnostik
studres <- rstudent(mod0)
lev <- hatvalues(mod0)
cookd <- cooks.distance(mod0)
# Ambang sederhana
thr_res <- 3 # |studentized residual| > 3
thr_lev <- 2*length(coef(mod0))/nrow(dat) # leverage tinggi
thr_cook <- 4/nrow(dat) # Cook's distance besar
flag <- data.frame(
id = 1:nrow(dat),
studres = studres,
leverage = lev,
cookd = cookd
) %>%
mutate(
flag_res = abs(studres) > thr_res,
flag_lev = leverage > thr_lev,
flag_cook = cookd > thr_cook,
any_flag = flag_res | flag_lev | flag_cook
) %>%
arrange(desc(any_flag), desc(abs(studres)))
head(flag, 10)
## id studres leverage cookd flag_res flag_lev flag_cook any_flag
## 162 162 6.450812 0.01441863 0.12607179 TRUE FALSE TRUE TRUE
## 90 90 4.302679 0.01343365 0.05785187 TRUE FALSE TRUE TRUE
## 18 18 4.188058 0.03056365 0.12748735 TRUE FALSE TRUE TRUE
## 149 149 3.243249 0.06634585 0.17821039 TRUE TRUE TRUE TRUE
## 164 164 -2.855691 0.08454000 0.18164105 FALSE TRUE TRUE TRUE
## 16 16 -2.803489 0.07241620 0.14821120 FALSE TRUE TRUE TRUE
## 147 147 2.584024 0.02573283 0.04284913 FALSE FALSE TRUE TRUE
## 143 143 2.200166 0.03420638 0.04203837 FALSE FALSE TRUE TRUE
## 30 30 2.172800 0.03120105 0.03730329 FALSE FALSE TRUE TRUE
## 65 65 -2.140175 0.04228760 0.04965406 FALSE TRUE TRUE TRUE
car::outlierTest(mod0) # menguji outlier pada residual studentized dengan koreksi Bonferroni
## rstudent unadjusted p-value Bonferroni p
## 162 6.450812 8.5823e-10 1.7165e-07
## 90 4.302679 2.6674e-05 5.3348e-03
## 18 4.188058 4.2583e-05 8.5167e-03
Plot cepat untuk Cook’s distance:
plot(cookd, type="h", main="Cook's Distance", ylab="D", xlab="Observasi")
abline(h = thr_cook, lty=2)
Tindak lanjut (opsional):
Coba refit tanpa observasi berpengaruh ekstrem dan bandingkan koefisien/fit.
Pertimbangkan robust regression (M-estimator) jika banyak outlier: MASS::rlm.
# Refit tanpa titik yang sangat berpengaruh (contoh)
to_drop <- which(cookd > thr_cook | abs(studres) > thr_res)
mod_refit <- lm(y~ X1 + X2 + X3, data = dat[-to_drop, ])
broom::tidy(mod0)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.08 0.0873 23.8 9.73e-60
## 2 X1 1.32 0.125 10.6 6.73e-21
## 3 X2 -0.758 0.123 -6.16 3.97e- 9
## 4 X3 0.577 0.0907 6.37 1.35e- 9
broom::tidy(mod_refit)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.99 0.0566 35.1 2.69e-83
## 2 X1 1.44 0.0861 16.7 1.11e-38
## 3 X2 -0.676 0.0819 -8.26 2.87e-14
## 4 X3 0.501 0.0606 8.27 2.75e-14
7. Ringkasan & Rekomendasi
• Cek signifikansi koefisien (t-test) dan kecocokan model (F-test, R²).
• Perhatikan multikolinearitas (VIF); bila tinggi, pertimbangkan transformasi, pengurangan variabel, atau ridge/lasso. • Jika heteroskedastisitas terdeteksi, gunakan SE robust untuk inferensi atau model varian (mis.WLS).
• Lakukan seleksi model (AIC/BIC, CV) untuk keseimbangan bias–varian.
• Evaluasi outlier/pengaruh; pertimbangkan robust regression bila perlu.
• Validasi hasil secara substantif (masuk akal secara domain) dan replikasi (seed, pipeline).
1. Diagnostik lengkap OLS pada data mtcars: heteroskedastisitas (Breusch–Pagan), multikolin- earitas (VIF), pengaruh (Cook’s D), normalitas (QQ-plot). Laporkan temuan & rekomendasi.
jawaban :
1. Heteroskedastisitas — Uji Breusch–Pagan
Inti Konsep
Ide Singkat Cara Kerja
# Model regresi
data(mtcars)
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
# Uji Breusch–Pagan
library(lmtest)
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 0.9459, df = 3, p-value = 0.8143
2. Multikolinearitas — VIF (Variance Inflation Factor)
Inti Konsep
Yang diuji: apakah prediktor saling berkorelasi kuat sehingga membuat estimasi koefisien tidak stabil.
Rumus VIF untuk prediktor \(X_j\):
\[ _j = \]
dengan \(R_j^2\) adalah koefisien determinasi saat \(X_j\) diregresikan pada prediktor lain.
Patokan
Implementasi di R
# Model regresi
data(mtcars)
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
# Hitung VIF
library(car)
vif(model)
## wt hp disp
## 4.844618 2.736633 7.324517
3. Observasi Berpengaruh — Cook’s Distance
Inti Konsep
Patokan Umum
Implementasi di R
data(mtcars)
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
# Hitung Cook's Distance
cooksD <- cooks.distance(model)
# Plot
plot(cooksD, type = "h", main="Cook's Distance", ylab="Cook's D")
abline(h = 4/length(cooksD), col="red", lty=2)
# Tampilkan observasi dengan Cook's D terbesar
sort(cooksD, decreasing = TRUE)[1:10]
## Maserati Bora Chrysler Imperial Toyota Corolla Fiat 128
## 0.34029114 0.31997067 0.15297714 0.11960192
## Pontiac Firebird Lotus Europa AMC Javelin Dodge Challenger
## 0.06980693 0.05959750 0.04909944 0.04218196
## Merc 280C Toyota Corona
## 0.02606104 0.02215865
4. Normalitas Residual — QQ-plot
Inti Konsep
Tujuan: mengecek apakah residual mendekati
distribusi normal.
Hal ini penting untuk inferensi (uji t, uji F, confidence interval).
QQ-plot:
Implementasi di R
data(mtcars)
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
# QQ-plot residual
qqnorm(rstudent(model), main="QQ-Plot Residuals")
qqline(rstudent(model), col="red")
Ringkasan Temuan
Model yang dianalisis:
\[ mpg wt + hp + disp \]
1.Heteroskedastisitas
2.Multikolinearitas
disp (VIF ≈ 7.3).wt juga relatif tinggi (≈ 4.8), namun masih di bawah
ambang serius.3.Observasi Berpengaruh
4.Normalitas Residual
5. Rekomendasi Praktis
disp atau
wt (pilih variabel yang paling informatif).Catatan Akhir
Secara keseluruhan, model masih layak
digunakan.
Namun, kolinearitas moderat dan beberapa observasi berpengaruh
menunjukkan perlunya eksplorasi model alternatif agar hasil analisis
lebih stabil dan dapat dipercaya.
2. Robust SE (HC3) vs WLS pada data simulasi heteroskedastik; bandingkan standar error dan p-value.
Jawaban :
1) Setting Masalah (Ringkas)
Kita ingin mengestimasi model linier sederhana:
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
dengan asumsi:
Heteroskedastisitas
\[ \mathrm{Var}(\varepsilon_i \mid x_i) \propto x_i^2 \]
Implikasi
2) OLS (Estimasi) dan Standard Error Klasik
Estimator OLS
Estimator OLS untuk model linier:
\[ \hat{\beta} = (X'X)^{-1}X'y \]
di mana: - \(X\) = matriks desain (n × p), - \(y\) = vektor respon, - \(\hat{\beta}\) = vektor koefisien hasil estimasi.
Varians & Standard Error Klasik
Dengan asumsi homoskedastisitas:
\[ \widehat{\mathrm{Var}}(\hat{\beta}) = \hat{\sigma}^2 (X'X)^{-1} \]
dengan:
\[ \hat{\sigma}^2 = \frac{\sum \hat{e}_i^2}{n - p} \]
dan \(\hat{e}_i = y_i - \hat{y}_i\) adalah residual.
Catatan Penting
3) Robust SE (HC3)
Ide Utama
Rumus HC3
\[ \widehat{\mathrm{Var}}_{\text{HC3}}(\hat{\beta}) = (X'X)^{-1} \Big( X' \, \mathrm{diag}\!\left(\frac{\hat{e}_i^2}{(1-h_{ii})^2}\right) X \Big) (X'X)^{-1} \]
dengan: - \(\hat{e}_i\) = residual OLS, - \(h_{ii}\) = leverage (elemen diagonal matriks hat \(H = X(X'X)^{-1}X'\)).
Intuisi
Implikasi Praktis
4) Weighted Least Squares (WLS)
Ide Utama
Rumus WLS
\[ \hat{\beta}_{\text{WLS}} = (X'WX)^{-1} X'Wy \]
dengan: - \(W = \mathrm{diag}(w_i)\) matriks bobot, - biasanya \(w_i = \tfrac{1}{\hat{\sigma}_i^2}\), yaitu kebalikan dari varians error.
Contoh Kasus
Implikasi Praktis
5) Perbandingan pada Data Simulasi (Heteroskedastik)
Kita simulasikan data dengan varians error meningkat terhadap \(x\) sehingga terjadi heteroskedastisitas. Model kebenaran: \(y = \beta_0 + \beta_1 x + \varepsilon\) dengan \(\beta_0 = 2\), \(\beta_1 = 1.5\). Kita set \(\mathrm{sd}(\varepsilon_i) \propto x_i\) agar varians error membesar saat \(x\) besar.
set.seed(123)
n <- 40
x <- seq(0.5, 4, length.out = n)
sd <- 0.6 * x # sd error proporsional dengan x
eps <- rnorm(n, mean = 0, sd = sd)
y <- 2 + 1.5 * x + eps
dat <- data.frame(x, y, sd)
m_ols <- lm(y ~ x, data = dat)
m_wls <- lm(y ~ x, data = dat, weights = 1 / (x^2))
# SE klasik (default) untuk OLS
se_ols <- sqrt(diag(vcov(m_ols)))
# SE robust HC3
library(sandwich)
library(lmtest)
vc_hc3 <- vcovHC(m_ols, type = "HC3")
se_hc3 <- sqrt(diag(vc_hc3))
# SE WLS (default vcov dari lm dengan weights)
se_wls <- sqrt(diag(vcov(m_wls)))
# Ringkas koefisien, SE, p-value
tab <- data.frame(
Metode = c("OLS (klasik)", "OLS + HC3", "WLS"),
beta0 = c(coef(m_ols)[1], coef(m_ols)[1], coef(m_wls)[1]),
se_b0 = c(se_ols[1], se_hc3[1], se_wls[1]),
p_b0 = c(coeftest(m_ols)[1,4], coeftest(m_ols, vcov.=vc_hc3)[1,4], coeftest(m_wls)[1,4]),
beta1 = c(coef(m_ols)[2], coef(m_ols)[2], coef(m_wls)[2]),
se_b1 = c(se_ols[2], se_hc3[2], se_wls[2]),
p_b1 = c(coeftest(m_ols)[2,4], coeftest(m_ols, vcov.=vc_hc3)[2,4], coeftest(m_wls)[2,4])
)
knitr::kable(tab, digits = 5, caption = "Perbandingan koefisien, standar error, dan p-value")
| Metode | beta0 | se_b0 | p_b0 | beta1 | se_b1 | p_b1 |
|---|---|---|---|---|---|---|
| OLS (klasik) | 1.82910 | 0.46313 | 0.00033 | 1.59916 | 0.18697 | 0 |
| OLS + HC3 | 1.82910 | 0.30483 | 0.00000 | 1.59916 | 0.15889 | 0 |
| WLS | 2.07486 | 0.20805 | 0.00000 | 1.48159 | 0.15306 | 0 |
plot(fitted(m_ols), resid(m_ols),
xlab = "Fitted (OLS)", ylab = "Residual (OLS)",
main = "Residual vs Fitted (OLS)")
abline(h = 0, lty = 2)
uji Breusch–Pagan untuk menegaskan heteroskedastisitas.
library(lmtest)
bptest(m_ols)
##
## studentized Breusch-Pagan test
##
## data: m_ols
## BP = 4.3684, df = 1, p-value = 0.03661
3. Spline vs polinomial: modelkan mpg ~ wt dan bandingkan AIC/BIC + 5-fold CV.
jawaban :
mtcars (n = 32).mpg, prediktor =
wt.wt dan
mpg.Implementasi r
data(mtcars)
mtcars_small <- mtcars[, c("mpg","wt")]
str(mtcars_small)
## 'data.frame': 32 obs. of 2 variables:
## $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
summary(mtcars_small)
## mpg wt
## Min. :10.40 Min. :1.513
## 1st Qu.:15.43 1st Qu.:2.581
## Median :19.20 Median :3.325
## Mean :20.09 Mean :3.217
## 3rd Qu.:22.80 3rd Qu.:3.610
## Max. :33.90 Max. :5.424
Linear
\[
mpg = \beta_0 + \beta_1\, wt + \varepsilon
\]
Polinomial
- Derajat 2: \[
mpg = \beta_0 + \beta_1\,wt + \beta_2\,wt^2 + \varepsilon
\] - Derajat 3: \[
mpg = \beta_0 + \beta_1\,wt + \beta_2\,wt^2 + \beta_3\,wt^3 +
\varepsilon
\]
Natural spline
- df = 3
- df = 4
Catatan: spline memberi fleksibilitas lokal; polinomial memberi fleksibilitas global yang bisa berosilasi di tepi rentang data.
Implementasi r
library(splines)
# Pakai GLM gaussian agar mudah dipakai cv.glm
m_lin <- glm(mpg ~ wt, data = mtcars, family = gaussian())
m_p2 <- glm(mpg ~ poly(wt, 2, raw = TRUE), data = mtcars, family = gaussian())
m_p3 <- glm(mpg ~ poly(wt, 3, raw = TRUE), data = mtcars, family = gaussian())
m_ns3 <- glm(mpg ~ ns(wt, df = 3), data = mtcars, family = gaussian())
m_ns4 <- glm(mpg ~ ns(wt, df = 4), data = mtcars, family = gaussian())
mods <- list(
Linear = m_lin,
Poly2 = m_p2,
Poly3 = m_p3,
NS3 = m_ns3,
NS4 = m_ns4
)
lapply(mods, summary)
## $Linear
##
## Call:
## glm(formula = mpg ~ wt, family = gaussian(), data = mtcars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 9.277398)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 278.32 on 30 degrees of freedom
## AIC: 166.03
##
## Number of Fisher Scoring iterations: 2
##
##
## $Poly2
##
## Call:
## glm(formula = mpg ~ poly(wt, 2, raw = TRUE), family = gaussian(),
## data = mtcars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.9308 4.2113 11.856 1.21e-12 ***
## poly(wt, 2, raw = TRUE)1 -13.3803 2.5140 -5.322 1.04e-05 ***
## poly(wt, 2, raw = TRUE)2 1.1711 0.3594 3.258 0.00286 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.025705)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 203.75 on 29 degrees of freedom
## AIC: 158.05
##
## Number of Fisher Scoring iterations: 2
##
##
## $Poly3
##
## Call:
## glm(formula = mpg ~ poly(wt, 3, raw = TRUE), family = gaussian(),
## data = mtcars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.40370 15.58379 3.106 0.00431 **
## poly(wt, 3, raw = TRUE)1 -11.82598 15.46346 -0.765 0.45081
## poly(wt, 3, raw = TRUE)2 0.68938 4.74034 0.145 0.88541
## poly(wt, 3, raw = TRUE)3 0.04594 0.45070 0.102 0.91954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.273924)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 203.67 on 28 degrees of freedom
## AIC: 160.04
##
## Number of Fisher Scoring iterations: 2
##
##
## $NS3
##
## Call:
## glm(formula = mpg ~ ns(wt, df = 3), family = gaussian(), data = mtcars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.286 1.664 19.406 < 2e-16 ***
## ns(wt, df = 3)1 -13.903 1.938 -7.173 8.32e-08 ***
## ns(wt, df = 3)2 -27.887 3.846 -7.251 6.80e-08 ***
## ns(wt, df = 3)3 -15.627 1.815 -8.612 2.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.231457)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 202.48 on 28 degrees of freedom
## AIC: 159.85
##
## Number of Fisher Scoring iterations: 2
##
##
## $NS4
##
## Call:
## glm(formula = mpg ~ ns(wt, df = 4), family = gaussian(), data = mtcars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.543 1.834 17.745 < 2e-16 ***
## ns(wt, df = 4)1 -12.942 2.172 -5.960 2.35e-06 ***
## ns(wt, df = 4)2 -15.236 2.565 -5.941 2.47e-06 ***
## ns(wt, df = 4)3 -28.229 5.081 -5.556 6.87e-06 ***
## ns(wt, df = 4)4 -16.272 1.999 -8.139 9.65e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.457091)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 201.34 on 27 degrees of freedom
## AIC: 161.67
##
## Number of Fisher Scoring iterations: 2
dengan: - \(\ell\) = log-likelihood model, - \(k\) = jumlah parameter efektif, - \(n\) = jumlah observasi.
Semakin kecil AIC/BIC → semakin baik trade-off
fit vs kompleksitas.
Model lebih kompleks biasanya menurunkan \(-2\ell\) (fit membaik) tetapi meningkatkan
penalti (2k atau \(k\log n\)).
Implementasi r
library(dplyr)
metrics_ic <- tibble::tibble(
Model = names(mods),
AIC = sapply(mods, AIC) |> as.numeric(),
BIC = sapply(mods, BIC) |> as.numeric()
) %>% arrange(AIC)
knitr::kable(metrics_ic, digits = 3, caption = "Perbandingan AIC/BIC")
| Model | AIC | BIC |
|---|---|---|
| Poly2 | 158.048 | 163.911 |
| NS3 | 159.849 | 167.178 |
| Poly3 | 160.037 | 167.365 |
| NS4 | 161.669 | 170.463 |
| Linear | 166.029 | 170.427 |
cv_mse <- function(data, fit, K = 5) {
# cv.glm: delta[1] = CV estimate (MSE untuk gaussian)
boot::cv.glm(data = data, glmfit = fit, K = K)$delta[1]
}
metrics_cv <- tibble::tibble(
Model = names(mods),
CV_MSE = sapply(mods, function(fit) cv_mse(mtcars, fit, K = 5))
) %>% arrange(CV_MSE)
knitr::kable(metrics_cv, digits = 4, caption = "5-fold CV (MSE)")
| Model | CV_MSE |
|---|---|
| Poly3 | 7.8588 |
| Poly2 | 7.9531 |
| NS4 | 8.9777 |
| Linear | 9.7610 |
| NS3 | 9.9831 |
Linear: kemungkinan underfit;
hubungan mpg–wt biasanya tidak murni linear
(penurunan mpg makin tajam pada mobil lebih
berat).
Polinomial derajat 2: cukup fleksibel untuk lengkungan halus → AIC biasanya turun nyata dibanding linear.
Polinomial derajat 3: lebih fleksibel, tapi berisiko overfit; BIC sering lebih ketat dan bisa tidak mendukung kenaikan derajat.
Natural spline df=3/4: fleksibel dan stabil di tepi rentang; sering kompetitif atau lebih baik dari polinomial orde tinggi.
Prediksi manual urutan performa:
AIC: Linear > Poly-2 ≈ NS(df=3) < Poly-3 ≈ NS(df=4) \ (lebih kecil lebih baik; model lebih fleksibel cenderung menang, tetapi tergantung penalti)
BIC: cenderung memilih model lebih sederhana → Poly-2 atau NS(df=3) sering unggul.
5-fold CV: Poly-2 dan NS(df=3) biasanya memberi CV-MSE paling kecil; Linear underfit, sementara Poly-3/NS(df=4) tidak selalu memberi perbaikan stabil pada n kecil (32).
library(ggplot2)
wt_grid <- data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 200))
pred_df <- dplyr::bind_rows(
data.frame(wt = wt_grid$wt,
mpg_hat = predict(m_lin, newdata = wt_grid),
Model = "Linear"),
data.frame(wt = wt_grid$wt,
mpg_hat = predict(m_p2, newdata = wt_grid),
Model = "Poly-2"),
data.frame(wt = wt_grid$wt,
mpg_hat = predict(m_p3, newdata = wt_grid),
Model = "Poly-3"),
data.frame(wt = wt_grid$wt,
mpg_hat = predict(m_ns3, newdata = wt_grid),
Model = "NS df=3"),
data.frame(wt = wt_grid$wt,
mpg_hat = predict(m_ns4, newdata = wt_grid),
Model = "NS df=4")
)
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_line(data = pred_df, aes(wt, mpg_hat, color = Model), linewidth = 1) +
labs(title = "mpg ~ wt: Spline vs Polinomial",
x = "wt", y = "mpg")
6. Rekomendasi
Pilihan praktis:
4. AR(1) GLS: gunakan kode GLS di atas, ganti x dengan sinyal musiman (mis. sin(2xpixt/12)), interpretasikan ̂𝜌; bandingkan SE OLS vs GLS.
Jawaban :
1) Pemodelan & Tujuan
Model mean (data bulanan):
\[ y_t = \beta_0 + \beta_1\,x_t + \varepsilon_t,\quad x_t=\sin\!\Big(\tfrac{2\pi t}{12}\Big),\quad t=1,\dots,n. \]
Struktur error AR(1):
\[ \varepsilon_t = \rho\,\varepsilon_{t-1} + u_t,\quad u_t \sim \text{i.i.d. }(0,\sigma_u^2),\ |\rho|<1. \]
Tujuan
1) Estimasi efek musiman \(\beta_1\)
dengan GLS yang memperhitungkan autokorelasi.
2) Menafsirkan \(\hat\rho\) sebagai
ukuran persistensi gangguan antarbulan.
3) Membandingkan standar error OLS vs GLS untuk \(\beta_1\).
2) Baseline OLS (mengabaikan AR)
Estimator OLS:
\[ \hat\beta^{\text{OLS}} = (X'X)^{-1}X'y,\quad X=[\mathbf{1},\,x]. \]
Jika \(x_t\) eksogen, \(\hat\beta^{\text{OLS}}\) tak bias.
Masalahnya, autokorelasi AR(1) membuat kovarian OLS klasik tidak valid
untuk uji t/F.
Kovarian error “benar” di bawah AR(1):
\[ \Omega=\frac{\sigma_u^2}{1-\rho^2}\big[\rho^{|t-s|}\big]_{t,s=1}^n. \]
3) GLS untuk AR(1): Ide & Transformasi
Secara teoretis:
\[ \hat\beta^{\text{GLS}}=(X'\Omega^{-1}X)^{-1}X'\Omega^{-1}y. \]
Praktik umum menggunakan FGLS via transformasi Prais–Winsten:
Untuk \(t\ge 2\): \[ y_t^*=y_t-\rho\,y_{t-1},\qquad x_t^*=x_t-\rho\,x_{t-1}. \]
Untuk \(t=1\): \[ y_1^*=\sqrt{1-\rho^2}\,y_1,\qquad x_1^*=\sqrt{1-\rho^2}\,x_1. \]
Kemudian OLS pada data bertanda bintang:
\[ \hat\beta^{\text{GLS}}=\big((X^*)'X^*\big)^{-1}(X^*)'y^*,\qquad \widehat{\mathrm{Var}}(\hat\beta^{\text{GLS}})=\hat\sigma_*^2\big((X^*)'X^*\big)^{-1}. \]
4) Estimasi \(\rho\) dan Interpretasi \(\hat\rho\)
Langkah ringkas:
Makna \(\hat\rho\):
Catatan musiman sinus:
Untuk \(x_t=\sin(2\pi t/12)\), korelasi
lag-1 dari sinyalnya \(\cos(2\pi/12)=\cos(\pi/6)\approx 0.866\).
Jika error juga persisten (\(\hat\rho>0\)), koreksi AR(1) penting
agar inferensi tidak terlalu optimistis.
5) SE OLS vs SE GLS: Apa yang Diharapkan
6) Apa yang Perlu Dilaporkan
7) Checklist Langkah