Pada analisis ini kita akan membandingkan dua model regresi linear
sederhana yang menghubungkan konsentrasi Ozone
(variabel
dependen) dengan dua variabel prediktor terpisah: Temp
(suhu) dan Wind
(kecepatan angin). Tujuan: menilai model
mana yang lebih baik menurut metrik MSE, AIC, dan Adjusted R-squared,
serta memberikan insight.
Kita pakai dataset airquality
(dataset built-in R).
Dataset ini berisi pengukuran kualitas udara di New York (bulan
Mei–September 1973) termasuk Ozone
, Temp
,
Wind
, Solar.R
, dll.
# tampilkan beberapa baris awal
data("airquality")
aq <- as_tibble(airquality)
# info singkat
glimpse(aq)
## Rows: 153
## Columns: 6
## $ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
## $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
summary(aq)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
# Kita akan gunakan baris dengan Ozone tidak NA
aq_clean <- aq %>% select(Ozone, Temp, Wind) %>% drop_na()
glimpse(aq_clean)
## Rows: 116
## Columns: 3
## $ Ozone <int> 41, 36, 12, 18, 28, 23, 19, 8, 7, 16, 11, 14, 18, 14, 34, 6, 30,…
## $ Temp <int> 67, 72, 74, 62, 66, 65, 59, 61, 74, 69, 66, 68, 58, 64, 66, 57, …
## $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.9, 8.6, 13.8, 20.1, 6.9, 9.7, 9.2, 10.9…
kable(head(aq_clean, 10))
Ozone | Temp | Wind |
---|---|---|
41 | 67 | 7.4 |
36 | 72 | 8.0 |
12 | 74 | 12.6 |
18 | 62 | 11.5 |
28 | 66 | 14.9 |
23 | 65 | 8.6 |
19 | 59 | 13.8 |
8 | 61 | 20.1 |
7 | 74 | 6.9 |
16 | 69 | 9.7 |
# Fit model linear Ozone ~ Temp
m_temp <- lm(Ozone ~ Temp, data = aq_clean)
# Ringkasan model
tidy(m_temp) %>% kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -146.9955 | 18.2872 | -8.0382 | 0 |
Temp | 2.4287 | 0.2331 | 10.4177 | 0 |
glance(m_temp) %>% kable(digits = 4)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.4877 | 0.4832 | 23.7143 | 108.529 | 0 | 1 | -530.8532 | 1067.706 | 1075.967 | 64109.89 | 114 | 116 |
# Prediksi dan MSE
aq_clean <- aq_clean %>% mutate(pred_temp = predict(m_temp, aq_clean),
resid_temp = Ozone - pred_temp)
mse_temp <- mean( (aq_clean$Ozone - aq_clean$pred_temp)^2 )
aic_temp <- AIC(m_temp)
adjr2_temp <- summary(m_temp)$adj.r.squared
# Print metrik
tibble(
model = "Ozone ~ Temp",
MSE = mse_temp,
AIC = aic_temp,
Adj_R2 = adjr2_temp
) %>% kable(digits = 4)
model | MSE | AIC | Adj_R2 |
---|---|---|---|
Ozone ~ Temp | 552.6715 | 1067.706 | 0.4832 |
ggplot(aq_clean, aes(x = Temp, y = Ozone)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Ozone vs Temp", subtitle = "Model: Ozone ~ Temp",
x = "Temperature (F)", y = "Ozone (ppb)") +
theme_minimal()
# Fit model linear Ozone ~ Wind
m_wind <- lm(Ozone ~ Wind, data = aq_clean)
# Ringkasan model
tidy(m_wind) %>% kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 96.8729 | 7.2387 | 13.3827 | 0 |
Wind | -5.5509 | 0.6904 | -8.0401 | 0 |
glance(m_wind) %>% kable(digits = 4)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.3619 | 0.3563 | 26.4673 | 64.6437 | 0 | 1 | -543.5937 | 1093.187 | 1101.448 | 79859.01 | 114 | 116 |
# Prediksi dan MSE
aq_clean <- aq_clean %>% mutate(pred_wind = predict(m_wind, aq_clean),
resid_wind = Ozone - pred_wind)
mse_wind <- mean( (aq_clean$Ozone - aq_clean$pred_wind)^2 )
aic_wind <- AIC(m_wind)
adjr2_wind <- summary(m_wind)$adj.r.squared
# Print metrik
tibble(
model = "Ozone ~ Wind",
MSE = mse_wind,
AIC = aic_wind,
Adj_R2 = adjr2_wind
) %>% kable(digits = 4)
model | MSE | AIC | Adj_R2 |
---|---|---|---|
Ozone ~ Wind | 688.4398 | 1093.187 | 0.3563 |
ggplot(aq_clean, aes(x = Wind, y = Ozone)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Ozone vs Wind", subtitle = "Model: Ozone ~ Wind",
x = "Wind (mph)", y = "Ozone (ppb)") +
theme_minimal()
comparison <- tibble(
model = c("Ozone ~ Temp", "Ozone ~ Wind"),
MSE = c(mse_temp, mse_wind),
AIC = c(aic_temp, aic_wind),
Adj_R2 = c(adjr2_temp, adjr2_wind)
)
# tambahkan peringkat/ranking jika mau
comparison <- comparison %>%
mutate(rank_MSE = rank(MSE, ties.method = "min"),
rank_AIC = rank(AIC, ties.method = "min"),
rank_AdjR2 = rank(-Adj_R2, ties.method = "min")) # higher adjR2 => better
kable(comparison, digits = 4)
model | MSE | AIC | Adj_R2 | rank_MSE | rank_AIC | rank_AdjR2 |
---|---|---|---|---|---|---|
Ozone ~ Temp | 552.6715 | 1067.706 | 0.4832 | 1 | 1 | 1 |
Ozone ~ Wind | 688.4398 | 1093.187 | 0.3563 | 2 | 2 | 2 |
MSE (Mean Squared Error) Model dengan nilai MSE lebih kecil dianggap memberikan prediksi yang lebih akurat. Bandingkan nilai MSE dari kedua model, dan pilih yang paling kecil.
AIC (Akaike Information Criterion) Semakin kecil nilai AIC, semakin baik model dalam hal keseimbangan antara goodness-of-fit dan kompleksitas model. Bandingkan nilai AIC keduanya.
Adjusted R-squared Nilai Adjusted R-squared menunjukkan proporsi variasi Ozone yang dapat dijelaskan oleh variabel prediktor. Nilai yang lebih tinggi menandakan model lebih baik menjelaskan data.
Interpretasi Koefisien
Temp
adalah prediktor yang
lebih baik dibanding Wind
.Wind
lebih unggul pada metrik
tersebut, maka angin lebih berpengaruh.