Hubungan antara variabel lingkungan sering kali bersifat kompleks dan tidak selalu mengikuti pola linear. Pada dataset airquality, misalnya, konsentrasi Ozone diduga dipengaruhi oleh tingkat Solar.R. Secara teoritis, radiasi matahari dapat memicu pembentukan ozon, tetapi pola hubungan yang muncul di data bisa melengkung atau bervariasi pada rentang tertentu. Oleh karena itu, diperlukan analisis regresi dengan berbagai pendekatan—mulai dari model linear sederhana hingga spline—untuk memperoleh gambaran yang lebih akurat mengenai pola hubungan tersebut.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(splines)
df_airquality <- datasets::airquality
head(df_airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
# Cek apakah ada missing value
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
Dari hasil pemeriksaan, dataset airquality memiliki missing value pada variabel Ozone dan Solar.R. Informasi ini penting sebelum melakukan analisis, karena kita perlu menentukan strategi untuk menangani nilai yang hilang, misalnya dengan menghapus baris, mengisi dengan rata-rata/median, atau menggunakan metode imputasi lainnya.
# Ganti NA pada kolom Ozone dengan median
airquality$Ozone[is.na(airquality$Ozone)] <- median(airquality$Ozone, na.rm = TRUE)
# Ganti NA pada kolom Solar.R dengan median
airquality$Solar.R[is.na(airquality$Solar.R)] <- median(airquality$Solar.R, na.rm = TRUE)
# Cek lagi apakah masih ada missing value
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
median(…, na.rm = TRUE) menghitung median dengan mengabaikan NA. Baris yang NA kemudian diisi dengan nilai median dari kolom tersebut. Setelah diganti, hasil colSums(is.na()) akan menunjukkan 0 → artinya tidak ada lagi missing value.
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
theme_bw()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
Dari grafik terlihat hubungan tidak sepenuhnya linear → ada indikasi
kurva. Hal ini menjadi dasar kita perlu coba model non-linear (tangga,
spline).
mod_linear = lm(Ozone~Solar.R,data=df_airquality)
summary(mod_linear)
##
## Call:
## lm(formula = Ozone ~ Solar.R, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.292 -21.361 -8.864 16.373 119.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.59873 6.74790 2.756 0.006856 **
## Solar.R 0.12717 0.03278 3.880 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.33 on 109 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.1213, Adjusted R-squared: 0.1133
## F-statistic: 15.05 on 1 and 109 DF, p-value: 0.0001793
-Model regresi linear sederhana: Ozone = β0 + β1*Solar.R. -summary() menampilkan koefisien, nilai p, R², dll. -Interpretasi: jika β1 signifikan, maka ada hubungan linear antara Solar.R dan Ozone. -Kelemahan: hubungan bisa saja non-linear → model ini bisa terlalu kaku.
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~x,lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
Dari grafik terlihat bahwa linear fit cukup oke di bagian tengah, tapi
di ekor (nilai kecil & besar) tidak mengikuti pola dengan baik →
indikasi perlu model lebih fleksibel.
mod_tangga = lm(Ozone ~ cut(Solar.R,5),data=df_airquality)
summary(mod_tangga)
##
## Call:
## lm(formula = Ozone ~ cut(Solar.R, 5), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.743 -20.647 -6.437 14.853 112.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.167 7.071 2.145 0.034254 *
## cut(Solar.R, 5)(72.4,138] 10.271 10.308 0.996 0.321338
## cut(Solar.R, 5)(138,203] 37.214 9.637 3.862 0.000194 ***
## cut(Solar.R, 5)(203,269] 40.576 8.702 4.663 9.13e-06 ***
## cut(Solar.R, 5)(269,334] 29.690 9.637 3.081 0.002629 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30 on 106 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2167, Adjusted R-squared: 0.1871
## F-statistic: 7.331 on 4 and 106 DF, p-value: 2.984e-05
-cut(Solar.R, 3) membagi Solar.R ke dalam 3 interval sama lebar. -Artinya model mengestimasi rata-rata Ozone berbeda-beda untuk tiap interval Solar.R. -Cocok kalau kita hanya ingin melihat perbedaan rata-rata antar kelompok. -Kelemahan: tidak halus (stepwise), prediksi bisa “loncat-loncat”.
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
-Membagi Solar.R jadi 5 kategori (lebih detail). -Garis biru menunjukkan rata-rata Ozone per interval. -Hasilnya “berundak” (fungsi tangga). -Mudah dipahami, tapi kehilangan informasi variasi halus.
mod_spline3 = lm(Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3)
##
## Call:
## lm(formula = Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),
## data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.255 -18.910 -3.842 15.718 109.745
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.665 13.828 0.916 0.36189
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1 3.279 32.834 0.100 0.92065
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2 -22.451 51.623 -0.435 0.66454
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3 28.804 44.766 0.643 0.52136
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4 -15.621 31.049 -0.503 0.61596
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5 4.717 22.568 0.209 0.83485
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6 4.984 23.393 0.213 0.83171
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 98.325 35.535 2.767 0.00671
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8 NA NA NA NA
##
## (Intercept)
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 **
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.78 on 103 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2502, Adjusted R-squared: 0.1993
## F-statistic: 4.91 on 7 and 103 DF, p-value: 8.159e-05
mod_spline3ns = lm(Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3ns)
##
## Call:
## lm(formula = Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),
## data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.020 -23.768 -3.773 14.152 115.599
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.99 39.20 2.066 0.0413
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1 -50.30 49.37 -1.019 0.3106
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2 -66.02 49.84 -1.325 0.1882
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 -83.57 38.60 -2.165 0.0327
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 35.06 19.99 1.753 0.0824
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5 -110.05 107.17 -1.027 0.3069
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6 NA NA NA NA
##
## (Intercept) *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 .
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.29 on 105 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2089, Adjusted R-squared: 0.1712
## F-statistic: 5.544 on 5 and 105 DF, p-value: 0.000143
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = c(5, 10, 20, 30, 40)),
lty = 1, aes(col = "Cubic Spline"),se = F)+
stat_smooth(method = "lm",
formula = y~ns(x, knots = c(5, 10, 20, 30, 40)),
lty = 1, aes(col = "Natural Cubic Spline"),se = F)+labs(color="Tipe Spline")+
scale_color_manual(values = c("Natural Cubic Spline"="red","Cubic Spline"="blue"))+theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Fungsi MSE
MSE <- function(pred, actual) {
mean((pred - actual)^2, na.rm = TRUE)
}
# Buat prediksi dari masing-masing model
pred_linear <- predict(mod_linear)
pred_tangga <- predict(mod_tangga)
pred_spline <- predict(mod_spline3)
pred_nspline <- predict(mod_spline3ns)
compare_stats <- data.frame(
Model = c("Linear","Tangga","Spline","Natural Spline"),
MSE = c(MSE(pred_linear, df_airquality$Ozone),
MSE(pred_tangga, df_airquality$Ozone),
MSE(pred_spline, df_airquality$Ozone),
MSE(pred_nspline, df_airquality$Ozone)),
AIC = c(AIC(mod_linear),
AIC(mod_tangga),
AIC(mod_spline3),
AIC(mod_spline3ns)),
Adj_R2 = c(summary(mod_linear)$adj.r.squared,
summary(mod_tangga)$adj.r.squared,
summary(mod_spline3)$adj.r.squared,
summary(mod_spline3ns)$adj.r.squared)
)
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
compare_stats
## Model MSE AIC Adj_R2
## 1 Linear 1293.729 1083.714 0.1132809
## 2 Tangga 1405.729 1076.964 0.1871312
## 3 Spline 1429.093 1078.108 0.1992623
## 4 Natural Spline 1404.240 1080.069 0.1711837
Dalam analisis ini, kita akan memperluas analisis yang telah dilakukan sebelumnya dengan mengeksplorasi hubungan antara konsentrasi Ozone dengan dua variabel lingkungan lainnya: Temperature (Temp) dan Wind Speed (Wind). Seperti pada analisis sebelumnya dengan Solar.R, kita akan menggunakan berbagai pendekatan regresi untuk memahami pola hubungan yang mungkin tidak linear.
library(tidyverse)
library(splines)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Load dan persiapkan data
df_airquality <- datasets::airquality
# Cek missing values
colSums(is.na(df_airquality))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
# Ganti NA dengan median
df_airquality$Ozone[is.na(df_airquality$Ozone)] <- median(df_airquality$Ozone, na.rm = TRUE)
df_airquality$Solar.R[is.na(df_airquality$Solar.R)] <- median(df_airquality$Solar.R, na.rm = TRUE)
# Cek lagi apakah masih ada missing value
colSums(is.na(df_airquality))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
# Visualisasi scatter plot
p1 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
geom_point(alpha=0.6, color="pink") +
labs(title = "Hubungan Ozone vs Temperature",
x = "Temperature (°F)",
y = "Ozone (ppb)") +
theme_bw()
print(p1)
- Scatter plot menunjukkan hubungan positif antara Temperature (°F)
dengan Ozone (ppb).
Artinya, ketika suhu meningkat, konsentrasi ozon cenderung meningkat juga.
Titik-titik agak menyebar, tapi pola naik cukup terlihat → ada indikasi hubungan linier positif.
mod_linear_temp <- lm(Ozone ~ Temp, data = df_airquality)
summary(mod_linear_temp)
##
## Call:
## lm(formula = Ozone ~ Temp, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.464 -16.399 -3.866 10.314 122.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -104.0802 15.6655 -6.644 5.21e-10 ***
## Temp 1.8443 0.1997 9.236 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.3 on 151 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3568
## F-statistic: 85.31 on 1 and 151 DF, p-value: < 2.2e-16
p2 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
geom_point(alpha=0.6, color="pink") +
stat_smooth(method = "lm", formula = y~x, lty = 1, col = "black", se = F) +
labs(title = "Regresi Linear: Ozone vs Temperature",
x = "Temperature (°F)",
y = "Ozone (ppb)") +
theme_bw()
print(p2)
mod_tangga_temp <- lm(Ozone ~ cut(Temp, 5), data = df_airquality)
summary(mod_tangga_temp)
##
## Call:
## lm(formula = Ozone ~ cut(Temp, 5), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.342 -14.000 -1.011 8.500 118.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.750 5.646 3.144 0.00201 **
## cut(Temp, 5)(64.2,72.4] 5.250 7.352 0.714 0.47627
## cut(Temp, 5)(72.4,80.6] 11.261 6.554 1.718 0.08787 .
## cut(Temp, 5)(80.6,88.8] 31.372 6.502 4.825 3.45e-06 ***
## cut(Temp, 5)(88.8,97] 61.092 7.663 7.973 3.86e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.58 on 148 degrees of freedom
## Multiple R-squared: 0.4118, Adjusted R-squared: 0.3959
## F-statistic: 25.9 on 4 and 148 DF, p-value: 2.779e-16
p3 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
geom_point(alpha=0.6, color="pink") +
stat_smooth(method = "lm", formula = y~cut(x,5), lty = 1, col = "skyblue", se = F) +
labs(title = "Regresi Fungsi Tangga: Ozone vs Temperature",
x = "Temperature (°F)",
y = "Ozone (ppb)") +
theme_bw()
print(p3)
# Tentukan knots berdasarkan kuantil
temp_knots <- quantile(df_airquality$Temp, c(0.2, 0.4, 0.6, 0.8), na.rm = TRUE)
print(paste("Temperature knots:", paste(temp_knots, collapse = ", ")))
## [1] "Temperature knots: 69, 76.8, 81, 86"
mod_spline_temp <- lm(Ozone ~ bs(Temp, knots = temp_knots), data = df_airquality)
summary(mod_spline_temp)
##
## Call:
## lm(formula = Ozone ~ bs(Temp, knots = temp_knots), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.518 -12.001 -2.314 8.704 126.644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.8677 13.3473 2.013 0.04597 *
## bs(Temp, knots = temp_knots)1 -18.4375 28.9008 -0.638 0.52451
## bs(Temp, knots = temp_knots)2 -0.7273 16.4801 -0.044 0.96486
## bs(Temp, knots = temp_knots)3 -6.2867 17.1459 -0.367 0.71441
## bs(Temp, knots = temp_knots)4 16.0112 14.5352 1.102 0.27249
## bs(Temp, knots = temp_knots)5 37.5144 19.5661 1.917 0.05716 .
## bs(Temp, knots = temp_knots)6 59.2630 21.5667 2.748 0.00676 **
## bs(Temp, knots = temp_knots)7 51.4744 22.4573 2.292 0.02334 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.27 on 145 degrees of freedom
## Multiple R-squared: 0.4394, Adjusted R-squared: 0.4123
## F-statistic: 16.23 on 7 and 145 DF, p-value: 1.151e-15
mod_nspline_temp <- lm(Ozone ~ ns(Temp, knots = temp_knots), data = df_airquality)
summary(mod_nspline_temp)
##
## Call:
## lm(formula = Ozone ~ ns(Temp, knots = temp_knots), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.962 -11.735 -1.614 10.270 127.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.977 9.021 2.325 0.0214 *
## ns(Temp, knots = temp_knots)1 2.739 10.053 0.272 0.7857
## ns(Temp, knots = temp_knots)2 19.562 11.629 1.682 0.0947 .
## ns(Temp, knots = temp_knots)3 49.792 10.672 4.666 6.86e-06 ***
## ns(Temp, knots = temp_knots)4 54.625 22.918 2.383 0.0184 *
## ns(Temp, knots = temp_knots)5 64.937 12.423 5.227 5.80e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.16 on 147 degrees of freedom
## Multiple R-squared: 0.4372, Adjusted R-squared: 0.418
## F-statistic: 22.84 on 5 and 147 DF, p-value: < 2.2e-16
p4 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
geom_point(alpha=0.6, color="pink") +
stat_smooth(method = "lm", formula = y~bs(x, knots = temp_knots),
lty = 1, aes(col = "Cubic Spline"), se = F) +
stat_smooth(method = "lm", formula = y~ns(x, knots = temp_knots),
lty = 1, aes(col = "Natural Cubic Spline"), se = F) +
labs(title = "Regresi Spline: Ozone vs Temperature",
x = "Temperature (°F)",
y = "Ozone (ppb)",
color = "Tipe Spline") +
scale_color_manual(values = c("Natural Cubic Spline"="brown", "Cubic Spline"="skyblue")) +
theme_bw()
print(p4)
Analisis Hubungan Ozone vs Wind Speed
# Visualisasi scatter plot
p5 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
geom_point(alpha=0.6, color="darkred") +
labs(title = "Hubungan Ozone vs Wind Speed",
x = "Wind Speed (mph)",
y = "Ozone (ppb)") +
theme_bw()
print(p5)
Terdapat hubungan negatif antara kecepatan angin dan konsentrasi ozon.
Semakin cepat angin bertiup, semakin rendah konsentrasi ozon di
udara.
mod_linear_wind <- lm(Ozone ~ Wind, data = df_airquality)
summary(mod_linear_wind)
##
## Call:
## lm(formula = Ozone ~ Wind, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.971 -15.967 -3.924 12.933 99.676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.2388 6.0007 13.872 < 2e-16 ***
## Wind -4.3866 0.5683 -7.719 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.68 on 151 degrees of freedom
## Multiple R-squared: 0.2829, Adjusted R-squared: 0.2782
## F-statistic: 59.58 on 1 and 151 DF, p-value: 1.494e-12
p6 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
geom_point(alpha=0.6, color="pink") +
stat_smooth(method = "lm", formula = y~x, lty = 1, col = "darkred", se = F) +
labs(title = "Regresi Linear: Ozone vs Wind Speed",
x = "Wind Speed (mph)",
y = "Ozone (ppb)") +
theme_bw()
print(p6)
# Model Regresi Fungsi Tangga
mod_tangga_wind <- lm(Ozone ~ cut(Wind, 5), data = df_airquality)
summary(mod_tangga_wind)
##
## Call:
## lm(formula = Ozone ~ cut(Wind, 5), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.731 -14.807 -4.952 5.696 80.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.231 6.588 13.241 < 2e-16 ***
## cut(Wind, 5)(5.5,9.3] -40.924 7.300 -5.606 9.86e-08 ***
## cut(Wind, 5)(9.3,13.1] -58.279 7.365 -7.913 5.43e-13 ***
## cut(Wind, 5)(13.1,16.9] -61.427 7.972 -7.706 1.74e-12 ***
## cut(Wind, 5)(16.9,20.7] -70.231 15.214 -4.616 8.41e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.75 on 148 degrees of freedom
## Multiple R-squared: 0.3493, Adjusted R-squared: 0.3317
## F-statistic: 19.86 on 4 and 148 DF, p-value: 4.178e-13
p7 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
geom_point(alpha=0.6, color="purple") +
stat_smooth(method = "lm", formula = y~cut(x,5), lty = 1, col = "brown", se = F) +
labs(title = "Regresi Fungsi Tangga: Ozone vs Wind Speed",
x = "Wind Speed (mph)",
y = "Ozone (ppb)") +
theme_bw()
print(p7)
Model Regresi Spline
# Tentukan knots berdasarkan kuantil
wind_knots <- quantile(df_airquality$Wind, c(0.2, 0.4, 0.6, 0.8), na.rm = TRUE)
print(paste("Wind knots:", paste(wind_knots, collapse = ", ")))
## [1] "Wind knots: 6.9, 8.6, 10.42, 12.96"
mod_spline_wind <- lm(Ozone ~ bs(Wind, knots = wind_knots), data = df_airquality)
summary(mod_spline_wind)
##
## Call:
## lm(formula = Ozone ~ bs(Wind, knots = wind_knots), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.437 -13.683 -4.602 6.600 68.706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.28 18.42 3.055 0.00268 **
## bs(Wind, knots = wind_knots)1 107.29 34.96 3.069 0.00256 **
## bs(Wind, knots = wind_knots)2 -18.89 20.29 -0.931 0.35343
## bs(Wind, knots = wind_knots)3 -13.13 20.99 -0.626 0.53254
## bs(Wind, knots = wind_knots)4 -34.75 19.38 -1.794 0.07497 .
## bs(Wind, knots = wind_knots)5 -14.69 26.35 -0.558 0.57795
## bs(Wind, knots = wind_knots)6 -57.22 33.49 -1.709 0.08962 .
## bs(Wind, knots = wind_knots)7 -30.79 25.89 -1.189 0.23621
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.26 on 145 degrees of freedom
## Multiple R-squared: 0.4398, Adjusted R-squared: 0.4128
## F-statistic: 16.26 on 7 and 145 DF, p-value: 1.089e-15
mod_nspline_wind <- lm(Ozone ~ ns(Wind, knots = wind_knots), data = df_airquality)
summary(mod_nspline_wind)
##
## Call:
## lm(formula = Ozone ~ ns(Wind, knots = wind_knots), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.130 -14.039 -3.006 6.463 78.996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.63 13.94 7.147 3.84e-11 ***
## ns(Wind, knots = wind_knots)1 -70.15 13.72 -5.114 9.69e-07 ***
## ns(Wind, knots = wind_knots)2 -69.71 16.00 -4.357 2.46e-05 ***
## ns(Wind, knots = wind_knots)3 -65.42 12.05 -5.428 2.30e-07 ***
## ns(Wind, knots = wind_knots)4 -99.28 32.44 -3.060 0.00263 **
## ns(Wind, knots = wind_knots)5 -68.17 16.39 -4.159 5.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.03 on 147 degrees of freedom
## Multiple R-squared: 0.3922, Adjusted R-squared: 0.3715
## F-statistic: 18.97 on 5 and 147 DF, p-value: 1.58e-14
p8 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
geom_point(alpha=0.6, color="pink") +
stat_smooth(method = "lm", formula = y~bs(x, knots = wind_knots),
lty = 1, aes(col = "Cubic Spline"), se = F) +
stat_smooth(method = "lm", formula = y~ns(x, knots = wind_knots),
lty = 1, aes(col = "Natural Cubic Spline"), se = F) +
labs(title = "Regresi Spline: Ozone vs Wind Speed",
x = "Wind Speed (mph)",
y = "Ozone (ppb)",
color = "Tipe Spline") +
scale_color_manual(values = c("Natural Cubic Spline"="darkred", "Cubic Spline"="blue")) +
theme_bw()
print(p8)
# 3. Perbandingan Model # Fungsi untuk Menghitung Metrik
# Fungsi MSE
MSE <- function(pred, actual) {
mean((pred - actual)^2, na.rm = TRUE)
}
# Fungsi untuk membuat tabel perbandingan
create_comparison_table <- function(linear_mod, tangga_mod, spline_mod, nspline_mod, actual_values, variable_name) {
pred_linear <- predict(linear_mod)
pred_tangga <- predict(tangga_mod)
pred_spline <- predict(spline_mod)
pred_nspline <- predict(nspline_mod)
comparison <- data.frame(
Variabel = rep(variable_name, 4),
Model = c("Linear", "Tangga", "Spline", "Natural Spline"),
MSE = c(MSE(pred_linear, actual_values),
MSE(pred_tangga, actual_values),
MSE(pred_spline, actual_values),
MSE(pred_nspline, actual_values)),
AIC = c(AIC(linear_mod),
AIC(tangga_mod),
AIC(spline_mod),
AIC(nspline_mod)),
Adj_R2 = c(summary(linear_mod)$adj.r.squared,
summary(tangga_mod)$adj.r.squared,
summary(spline_mod)$adj.r.squared,
summary(nspline_mod)$adj.r.squared)
)
return(comparison)
}
comparison_temp <- create_comparison_table(
mod_linear_temp, mod_tangga_temp, mod_spline_temp, mod_nspline_temp,
df_airquality$Ozone, "Temperature"
)
print("=== Perbandingan Model: Ozone vs Temperature ===")
## [1] "=== Perbandingan Model: Ozone vs Temperature ==="
print(comparison_temp)
## Variabel Model MSE AIC Adj_R2
## 1 Temperature Linear 535.8932 1401.637 0.3567682
## 2 Temperature Tangga 493.3074 1394.968 0.3958816
## 3 Temperature Spline 470.1846 1393.623 0.4122853
## 4 Temperature Natural Spline 472.0010 1390.213 0.4180419
comparison_wind <- create_comparison_table(
mod_linear_wind, mod_tangga_wind, mod_spline_wind, mod_nspline_wind,
df_airquality$Ozone, "Wind Speed"
)
print("=== Perbandingan Model: Ozone vs Wind Speed ===")
## [1] "=== Perbandingan Model: Ozone vs Wind Speed ==="
print(comparison_wind)
## Variabel Model MSE AIC Adj_R2
## 1 Wind Speed Linear 601.3750 1419.276 0.2781705
## 2 Wind Speed Tangga 545.7385 1410.423 0.3316730
## 3 Wind Speed Spline 469.8080 1393.501 0.4127561
## 4 Wind Speed Natural Spline 509.7543 1401.986 0.3714935
# Gabungkan hasil perbandingan
comparison_all <- rbind(comparison_temp, comparison_wind)
# Format untuk presentasi yang lebih baik
comparison_all$MSE <- round(comparison_all$MSE, 2)
comparison_all$AIC <- round(comparison_all$AIC, 2)
comparison_all$Adj_R2 <- round(comparison_all$Adj_R2, 3)
print("=== Perbandingan Lengkap Semua Model ===")
## [1] "=== Perbandingan Lengkap Semua Model ==="
print(comparison_all)
## Variabel Model MSE AIC Adj_R2
## 1 Temperature Linear 535.89 1401.64 0.357
## 2 Temperature Tangga 493.31 1394.97 0.396
## 3 Temperature Spline 470.18 1393.62 0.412
## 4 Temperature Natural Spline 472.00 1390.21 0.418
## 5 Wind Speed Linear 601.38 1419.28 0.278
## 6 Wind Speed Tangga 545.74 1410.42 0.332
## 7 Wind Speed Spline 469.81 1393.50 0.413
## 8 Wind Speed Natural Spline 509.75 1401.99 0.371
# Visualisasi MSE
p_mse <- ggplot(comparison_all, aes(x = Model, y = MSE, fill = Variabel)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Perbandingan MSE Antar Model",
x = "Model", y = "MSE") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Visualisasi AIC
p_aic <- ggplot(comparison_all, aes(x = Model, y = AIC, fill = Variabel)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Perbandingan AIC Antar Model",
x = "Model", y = "AIC") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Visualisasi Adjusted R-squared
p_r2 <- ggplot(comparison_all, aes(x = Model, y = Adj_R2, fill = Variabel)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
labs(title = "Perbandingan Adjusted R² Antar Model",
x = "Model", y = "Adjusted R²") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Tampilkan semua grafik
grid.arrange(p_mse, p_aic, p_r2, ncol = 1)
Temuan Utama: ## Hubungan Ozone vs Temperature: Baik, aku buatkan versi lain dari insight dan pendapat dengan bahasa berbeda agar tidak sama persis dengan yang di gambar kamu 👇
Hubungan Ozone vs Temperature
Hubungan Ozone vs Wind Speed
# Hitung korelasi
cor_temp <- cor(df_airquality$Ozone, df_airquality$Temp)
cor_wind <- cor(df_airquality$Ozone, df_airquality$Wind)
print(paste("Korelasi Ozone vs Temperature:", round(cor_temp, 3)))
## [1] "Korelasi Ozone vs Temperature: 0.601"
print(paste("Korelasi Ozone vs Wind Speed:", round(cor_wind, 3)))
## [1] "Korelasi Ozone vs Wind Speed: -0.532"
Untuk Temperature: Disarankan menggunakan pendekatan spline karena lebih fleksibel dalam menangkap pola hubungan yang tidak sepenuhnya linier.
Untuk Wind Speed: Regresi linier sederhana sudah memadai untuk menggambarkan keterkaitan yang relatif jelas antara kedua variabel.
Penerapan Praktis: Model ini bisa digunakan sebagai dasar dalam memprediksi kualitas udara dengan mempertimbangkan faktor meteorologi.
Arah Penelitian Selanjutnya: Perlu dipertimbangkan analisis multivariat yang menggabungkan lebih banyak faktor lingkungan agar diperoleh model prediksi yang lebih menyeluruh dan akurat.
Kesimpulan:
Variabel temperature menunjukkan pengaruh yang lebih kuat dan lebih kompleks terhadap konsentrasi ozon dibandingkan dengan wind speed. Hal ini menegaskan pentingnya pendekatan non-linear dalam pemodelan hubungan lingkungan, terutama karena pembentukan ozon melibatkan mekanisme fotokimia yang kompleks. Sementara itu, wind speed berperan lebih sederhana sebagai faktor yang membantu mengurangi akumulasi ozon melalui proses dispersi polutan.