Pendahuluan

Pada analisis ini kita akan membandingkan dua model regresi linear sederhana yang menghubungkan konsentrasi Ozone (variabel dependen) dengan dua variabel prediktor terpisah: Temp (suhu) dan Wind (kecepatan angin). Tujuan: menilai model mana yang lebih baik menurut metrik MSE, AIC, dan Adjusted R-squared, serta memberikan insight.

Data

Kita pakai dataset airquality (dataset built-in R). Dataset ini berisi pengukuran kualitas udara di New York (bulan Mei–September 1973) termasuk Ozone, Temp, Wind, Solar.R, dll.

# tampilkan beberapa baris awal
data("airquality")
aq <- as_tibble(airquality)

# info singkat
glimpse(aq)
## Rows: 153
## Columns: 6
## $ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, …
## $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27…
## $ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9…
## $ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64…
## $ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
summary(aq)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
# Kita akan gunakan baris dengan Ozone tidak NA
aq_clean <- aq %>% select(Ozone, Temp, Wind) %>% drop_na()
glimpse(aq_clean)
## Rows: 116
## Columns: 3
## $ Ozone <int> 41, 36, 12, 18, 28, 23, 19, 8, 7, 16, 11, 14, 18, 14, 34, 6, 30,…
## $ Temp  <int> 67, 72, 74, 62, 66, 65, 59, 61, 74, 69, 66, 68, 58, 64, 66, 57, …
## $ Wind  <dbl> 7.4, 8.0, 12.6, 11.5, 14.9, 8.6, 13.8, 20.1, 6.9, 9.7, 9.2, 10.9…
kable(head(aq_clean, 10))
Ozone Temp Wind
41 67 7.4
36 72 8.0
12 74 12.6
18 62 11.5
28 66 14.9
23 65 8.6
19 59 13.8
8 61 20.1
7 74 6.9
16 69 9.7

Model 1 — Ozone ~ Temp

# Fit model linear Ozone ~ Temp
m_temp <- lm(Ozone ~ Temp, data = aq_clean)

# Ringkasan model
tidy(m_temp) %>% kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) -146.9955 18.2872 -8.0382 0
Temp 2.4287 0.2331 10.4177 0
glance(m_temp) %>% kable(digits = 4)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.4877 0.4832 23.7143 108.529 0 1 -530.8532 1067.706 1075.967 64109.89 114 116
# Prediksi dan MSE
aq_clean <- aq_clean %>% mutate(pred_temp = predict(m_temp, aq_clean),
                                resid_temp = Ozone - pred_temp)
mse_temp <- mean( (aq_clean$Ozone - aq_clean$pred_temp)^2 )
aic_temp <- AIC(m_temp)
adjr2_temp <- summary(m_temp)$adj.r.squared

# Print metrik
tibble(
  model = "Ozone ~ Temp",
  MSE = mse_temp,
  AIC = aic_temp,
  Adj_R2 = adjr2_temp
) %>% kable(digits = 4)
model MSE AIC Adj_R2
Ozone ~ Temp 552.6715 1067.706 0.4832

Plot: Ozone vs Temp + garis regresi

ggplot(aq_clean, aes(x = Temp, y = Ozone)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Ozone vs Temp", subtitle = "Model: Ozone ~ Temp",
       x = "Temperature (F)", y = "Ozone (ppb)") +
  theme_minimal()

Model 2 — Ozone ~ Wind

# Fit model linear Ozone ~ Wind
m_wind <- lm(Ozone ~ Wind, data = aq_clean)

# Ringkasan model
tidy(m_wind) %>% kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 96.8729 7.2387 13.3827 0
Wind -5.5509 0.6904 -8.0401 0
glance(m_wind) %>% kable(digits = 4)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.3619 0.3563 26.4673 64.6437 0 1 -543.5937 1093.187 1101.448 79859.01 114 116
# Prediksi dan MSE
aq_clean <- aq_clean %>% mutate(pred_wind = predict(m_wind, aq_clean),
                                resid_wind = Ozone - pred_wind)
mse_wind <- mean( (aq_clean$Ozone - aq_clean$pred_wind)^2 )
aic_wind <- AIC(m_wind)
adjr2_wind <- summary(m_wind)$adj.r.squared

# Print metrik
tibble(
  model = "Ozone ~ Wind",
  MSE = mse_wind,
  AIC = aic_wind,
  Adj_R2 = adjr2_wind
) %>% kable(digits = 4)
model MSE AIC Adj_R2
Ozone ~ Wind 688.4398 1093.187 0.3563

Plot: Ozone vs Wind + garis regresi

ggplot(aq_clean, aes(x = Wind, y = Ozone)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Ozone vs Wind", subtitle = "Model: Ozone ~ Wind",
       x = "Wind (mph)", y = "Ozone (ppb)") +
  theme_minimal()

Perbandingan Model

comparison <- tibble(
  model = c("Ozone ~ Temp", "Ozone ~ Wind"),
  MSE = c(mse_temp, mse_wind),
  AIC = c(aic_temp, aic_wind),
  Adj_R2 = c(adjr2_temp, adjr2_wind)
)

# tambahkan peringkat/ranking jika mau
comparison <- comparison %>%
  mutate(rank_MSE = rank(MSE, ties.method = "min"),
         rank_AIC = rank(AIC, ties.method = "min"),
         rank_AdjR2 = rank(-Adj_R2, ties.method = "min")) # higher adjR2 => better

kable(comparison, digits = 4)
model MSE AIC Adj_R2 rank_MSE rank_AIC rank_AdjR2
Ozone ~ Temp 552.6715 1067.706 0.4832 1 1 1
Ozone ~ Wind 688.4398 1093.187 0.3563 2 2 2

Insight dan Pendapat saya

  1. MSE (Mean Squared Error) Model dengan nilai MSE lebih kecil dianggap memberikan prediksi yang lebih akurat. Bandingkan nilai MSE dari kedua model, dan pilih yang paling kecil.

  2. AIC (Akaike Information Criterion) Semakin kecil nilai AIC, semakin baik model dalam hal keseimbangan antara goodness-of-fit dan kompleksitas model. Bandingkan nilai AIC keduanya.

  3. Adjusted R-squared Nilai Adjusted R-squared menunjukkan proporsi variasi Ozone yang dapat dijelaskan oleh variabel prediktor. Nilai yang lebih tinggi menandakan model lebih baik menjelaskan data.

  4. Interpretasi Koefisien

  1. Kesimpulan Praktis