Hubungan antara variabel lingkungan sering kali bersifat kompleks dan tidak selalu mengikuti pola linear. Pada dataset airquality, misalnya, konsentrasi Ozone diduga dipengaruhi oleh tingkat Solar.R. Secara teoritis, radiasi matahari dapat memicu pembentukan ozon, tetapi pola hubungan yang muncul di data bisa melengkung atau bervariasi pada rentang tertentu. Oleh karena itu, diperlukan analisis regresi dengan berbagai pendekatan—mulai dari model linear sederhana hingga spline—untuk memperoleh gambaran yang lebih akurat mengenai pola hubungan tersebut.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(splines)
df_airquality <- datasets::airquality
head(df_airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
# Cek apakah ada missing value
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
Dari hasil pemeriksaan, dataset airquality memiliki missing value pada variabel Ozone dan Solar.R. Informasi ini penting sebelum melakukan analisis, karena kita perlu menentukan strategi untuk menangani nilai yang hilang, misalnya dengan menghapus baris, mengisi dengan rata-rata/median, atau menggunakan metode imputasi lainnya.
# Ganti NA pada kolom Ozone dengan median
airquality$Ozone[is.na(airquality$Ozone)] <- median(airquality$Ozone, na.rm = TRUE)
# Ganti NA pada kolom Solar.R dengan median
airquality$Solar.R[is.na(airquality$Solar.R)] <- median(airquality$Solar.R, na.rm = TRUE)
# Cek lagi apakah masih ada missing value
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
median(…, na.rm = TRUE) menghitung median dengan mengabaikan NA. Baris yang NA kemudian diisi dengan nilai median dari kolom tersebut. Setelah diganti, hasil colSums(is.na()) akan menunjukkan 0 → artinya tidak ada lagi missing value.
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
theme_bw()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
* Dari grafik terlihat hubungan tidak sepenuhnya linear → ada indikasi kurva. * Hal ini menjadi dasar kita perlu coba model non-linear (tangga, spline).
mod_linear = lm(Ozone~Solar.R,data=df_airquality)
summary(mod_linear)
##
## Call:
## lm(formula = Ozone ~ Solar.R, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.292 -21.361 -8.864 16.373 119.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.59873 6.74790 2.756 0.006856 **
## Solar.R 0.12717 0.03278 3.880 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.33 on 109 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.1213, Adjusted R-squared: 0.1133
## F-statistic: 15.05 on 1 and 109 DF, p-value: 0.0001793
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~x,lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
Dari grafik terlihat bahwa linear fit cukup oke di bagian tengah, tapi di ekor (nilai kecil & besar) tidak mengikuti pola dengan baik → indikasi perlu model lebih fleksibel.
mod_tangga = lm(Ozone ~ cut(Solar.R,5),data=df_airquality)
summary(mod_tangga)
##
## Call:
## lm(formula = Ozone ~ cut(Solar.R, 5), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.743 -20.647 -6.437 14.853 112.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.167 7.071 2.145 0.034254 *
## cut(Solar.R, 5)(72.4,138] 10.271 10.308 0.996 0.321338
## cut(Solar.R, 5)(138,203] 37.214 9.637 3.862 0.000194 ***
## cut(Solar.R, 5)(203,269] 40.576 8.702 4.663 9.13e-06 ***
## cut(Solar.R, 5)(269,334] 29.690 9.637 3.081 0.002629 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30 on 106 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2167, Adjusted R-squared: 0.1871
## F-statistic: 7.331 on 4 and 106 DF, p-value: 2.984e-05
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
* Membagi Solar.R jadi 5 kategori (lebih detail). * Garis biru menunjukkan rata-rata Ozone per interval. * Hasilnya “berundak” (fungsi tangga). * Mudah dipahami, tapi kehilangan informasi variasi halus.
mod_spline3 = lm(Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3)
##
## Call:
## lm(formula = Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),
## data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.255 -18.910 -3.842 15.718 109.745
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.665 13.828 0.916 0.36189
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1 3.279 32.834 0.100 0.92065
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2 -22.451 51.623 -0.435 0.66454
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3 28.804 44.766 0.643 0.52136
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4 -15.621 31.049 -0.503 0.61596
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5 4.717 22.568 0.209 0.83485
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6 4.984 23.393 0.213 0.83171
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 98.325 35.535 2.767 0.00671
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8 NA NA NA NA
##
## (Intercept)
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 **
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.78 on 103 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2502, Adjusted R-squared: 0.1993
## F-statistic: 4.91 on 7 and 103 DF, p-value: 8.159e-05
mod_spline3ns = lm(Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3ns)
##
## Call:
## lm(formula = Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),
## data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.020 -23.768 -3.773 14.152 115.599
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.99 39.20 2.066 0.0413
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1 -50.30 49.37 -1.019 0.3106
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2 -66.02 49.84 -1.325 0.1882
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 -83.57 38.60 -2.165 0.0327
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 35.06 19.99 1.753 0.0824
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5 -110.05 107.17 -1.027 0.3069
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6 NA NA NA NA
##
## (Intercept) *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 .
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.29 on 105 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2089, Adjusted R-squared: 0.1712
## F-statistic: 5.544 on 5 and 105 DF, p-value: 0.000143
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = c(5, 10, 20, 30, 40)),
lty = 1, aes(col = "Cubic Spline"),se = F)+
stat_smooth(method = "lm",
formula = y~ns(x, knots = c(5, 10, 20, 30, 40)),
lty = 1, aes(col = "Natural Cubic Spline"),se = F)+labs(color="Tipe Spline")+
scale_color_manual(values = c("Natural Cubic Spline"="red","Cubic Spline"="blue"))+theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
* Garis biru = cubic spline, merah = natural spline. * Keduanya mengikuti pola data lebih baik daripada regresi linear. * Natural spline lebih “jinak” di ekor → cocok untuk data lingkungan yang rawan outlier. *Dari sini terlihat: spline memberi fleksibilitas untuk menangkap hubungan non-linear.
# Fungsi MSE
MSE <- function(pred, actual) {
mean((pred - actual)^2, na.rm = TRUE)
}
# Buat prediksi dari masing-masing model
pred_linear <- predict(mod_linear)
pred_tangga <- predict(mod_tangga)
pred_spline <- predict(mod_spline3)
pred_nspline <- predict(mod_spline3ns)
compare_stats <- data.frame(
Model = c("Linear","Tangga","Spline","Natural Spline"),
MSE = c(MSE(pred_linear, df_airquality$Ozone),
MSE(pred_tangga, df_airquality$Ozone),
MSE(pred_spline, df_airquality$Ozone),
MSE(pred_nspline, df_airquality$Ozone)),
AIC = c(AIC(mod_linear),
AIC(mod_tangga),
AIC(mod_spline3),
AIC(mod_spline3ns)),
Adj_R2 = c(summary(mod_linear)$adj.r.squared,
summary(mod_tangga)$adj.r.squared,
summary(mod_spline3)$adj.r.squared,
summary(mod_spline3ns)$adj.r.squared)
)
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
compare_stats
## Model MSE AIC Adj_R2
## 1 Linear 1293.729 1083.714 0.1132809
## 2 Tangga 1405.729 1076.964 0.1871312
## 3 Spline 1429.093 1078.108 0.1992623
## 4 Natural Spline 1404.240 1080.069 0.1711837
Analisis ini menggunakan dataset bawaan R yaitu airquality, yang berisi data kualitas udara di New York pada Mei–September 1973. Tujuan analisis adalah membandingkan hubungan antara Ozone (sebagai variabel respon) dengan dua variabel prediktor utama: Temp (suhu) dan Wind (kecepatan angin).
Model yang digunakan meliputi regresi linear, model tangga (step function), spline, dan natural spline. Kinerja model dievaluasi dengan MSE (Mean Squared Error), AIC, dan Adjusted R-squared. Hasil ini akan digunakan untuk memahami variabel mana yang lebih baik dalam menjelaskan variasi Ozone, serta memberikan insight mengenai pola hubungan yang terbentuk.
library(splines)
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.4.3
df_airquality <- na.omit(airquality)
mod_linear_Temp <- lm(Ozone ~ Temp, data=df_airquality)
mod_tangga_Temp <- lm(Ozone ~ cut(Temp, 5), data=df_airquality)
mod_spline_Temp <- lm(Ozone ~ bs(Temp, knots=c(70,80,90)), data=df_airquality)
mod_nspline_Temp <- lm(Ozone ~ ns(Temp, knots=c(70,80,90)), data=df_airquality)
pred_linear_Temp <- predict(mod_linear_Temp)
pred_tangga_Temp <- predict(mod_tangga_Temp)
pred_spline_Temp <- predict(mod_spline_Temp)
pred_nspline_Temp <- predict(mod_nspline_Temp)
compare_Temp <- data.frame(
Model = c("Linear","Tangga","Spline","Natural Spline"),
MSE = c(mse(df_airquality$Ozone, pred_linear_Temp),
mse(df_airquality$Ozone, pred_tangga_Temp),
mse(df_airquality$Ozone, pred_spline_Temp),
mse(df_airquality$Ozone, pred_nspline_Temp)),
AIC = c(AIC(mod_linear_Temp),AIC(mod_tangga_Temp),
AIC(mod_spline_Temp),AIC(mod_nspline_Temp)),
Adj_R2 = c(summary(mod_linear_Temp)$adj.r.squared,
summary(mod_tangga_Temp)$adj.r.squared,
summary(mod_spline_Temp)$adj.r.squared,
summary(mod_nspline_Temp)$adj.r.squared)
)
print(compare_Temp)
## Model MSE AIC Adj_R2
## 1 Linear 535.8932 1401.637 0.3567682
## 2 Tangga 493.3074 1394.968 0.3958816
## 3 Spline 471.2354 1391.965 0.4150063
## 4 Natural Spline 471.5385 1388.063 0.4225404
plot(df_airquality$Temp, df_airquality$Ozone,
main="Hubungan Ozone ~ Temp", xlab="Temp", ylab="Ozone", pch=19, col="darkgray")
lines(df_airquality$Temp, pred_linear_Temp, col="blue", lwd=2)
lines(df_airquality$Temp[order(df_airquality$Temp)],
pred_spline_Temp[order(df_airquality$Temp)], col="red", lwd=2)
legend("topleft", legend=c("Linear","Spline"), col=c("blue","red"), lwd=2)
mod_linear_Wind <- lm(Ozone ~ Wind, data=df_airquality)
mod_tangga_Wind <- lm(Ozone ~ cut(Wind, 5), data=df_airquality)
mod_spline_Wind <- lm(Ozone ~ bs(Wind, knots=c(5,10,15)), data=df_airquality)
mod_nspline_Wind <- lm(Ozone ~ ns(Wind, knots=c(5,10,15)), data=df_airquality)
pred_linear_Wind <- predict(mod_linear_Wind)
pred_tangga_Wind <- predict(mod_tangga_Wind)
pred_spline_Wind <- predict(mod_spline_Wind)
pred_nspline_Wind <- predict(mod_nspline_Wind)
compare_Wind <- data.frame(
Model = c("Linear","Tangga","Spline","Natural Spline"),
MSE = c(mse(df_airquality$Ozone, pred_linear_Wind),
mse(df_airquality$Ozone, pred_tangga_Wind),
mse(df_airquality$Ozone, pred_spline_Wind),
mse(df_airquality$Ozone, pred_nspline_Wind)),
AIC = c(AIC(mod_linear_Wind),AIC(mod_tangga_Wind),
AIC(mod_spline_Wind),AIC(mod_nspline_Wind)),
Adj_R2 = c(summary(mod_linear_Wind)$adj.r.squared,
summary(mod_tangga_Wind)$adj.r.squared,
summary(mod_spline_Wind)$adj.r.squared,
summary(mod_nspline_Wind)$adj.r.squared)
)
print(compare_Wind)
## Model MSE AIC Adj_R2
## 1 Linear 601.3750 1419.276 0.2781705
## 2 Tangga 545.7385 1410.423 0.3316730
## 3 Spline 459.0189 1387.946 0.4301719
## 4 Natural Spline 513.3298 1401.056 0.3713616
plot(df_airquality$Wind, df_airquality$Ozone,
main="Hubungan Ozone ~ Wind", xlab="Wind", ylab="Ozone", pch=19, col="darkgray")
lines(df_airquality$Wind, pred_linear_Wind, col="blue", lwd=2)
lines(df_airquality$Wind[order(df_airquality$Wind)],
pred_spline_Wind[order(df_airquality$Wind)], col="red", lwd=2)
legend("topright", legend=c("Linear","Spline"), col=c("blue","red"), lwd=2)
## 3. Bandingkan Hasil Temp vs Wind # Tambahkan variabel identitas
compare_Temp$Variabel <- "Temp"
compare_Wind$Variabel <- "Wind"
compare_all <- rbind(compare_Temp, compare_Wind)
print(compare_all)
## Model MSE AIC Adj_R2 Variabel
## 1 Linear 535.8932 1401.637 0.3567682 Temp
## 2 Tangga 493.3074 1394.968 0.3958816 Temp
## 3 Spline 471.2354 1391.965 0.4150063 Temp
## 4 Natural Spline 471.5385 1388.063 0.4225404 Temp
## 5 Linear 601.3750 1419.276 0.2781705 Wind
## 6 Tangga 545.7385 1410.423 0.3316730 Wind
## 7 Spline 459.0189 1387.946 0.4301719 Wind
## 8 Natural Spline 513.3298 1401.056 0.3713616 Wind
Dari hasil analisis, terlihat bahwa model dengan prediktor Temp cenderung menghasilkan nilai MSE yang lebih rendah dan Adjusted R² yang lebih tinggi dibandingkan dengan model menggunakan Wind. Hal ini menunjukkan bahwa suhu lebih kuat dalam menjelaskan variasi kadar Ozone. Secara teori, hal ini masuk akal karena pembentukan ozon di atmosfer dipengaruhi oleh intensitas radiasi matahari dan suhu udara—semakin panas suhu, semakin aktif reaksi kimia yang meningkatkan konsentrasi ozon.
Sementara itu, variabel Wind memang berpengaruh, tetapi hubungannya cenderung negatif: semakin kencang angin, ozon lebih cepat tersebar dan konsentrasinya menurun. Meski hubungan ini ada, kekuatannya tidak sebesar suhu.
Selain itu, model spline dan natural spline umumnya memberikan hasil yang lebih baik daripada model linear, karena pola hubungan Ozone dengan Temp maupun Wind tidak sepenuhnya lurus. Fleksibilitas spline membantu menangkap pola melengkung atau non-linear dalam data.
Kesimpulan: Variabel Temp merupakan prediktor yang lebih baik dibandingkan Wind dalam menjelaskan variasi Ozone. Namun, dalam konteks prediksi yang lebih akurat, sebaiknya kedua variabel ini (dan variabel lain seperti Solar.R) digunakan bersama-sama dalam model multivariat.