Hubungan antara variabel lingkungan sering kali bersifat kompleks dan tidak selalu mengikuti pola linear. Pada dataset airquality, misalnya, konsentrasi Ozone diduga dipengaruhi oleh tingkat Solar.R. Secara teoritis, radiasi matahari dapat memicu pembentukan ozon, tetapi pola hubungan yang muncul di data bisa melengkung atau bervariasi pada rentang tertentu. Oleh karena itu, diperlukan analisis regresi dengan berbagai pendekatan—mulai dari model linear sederhana hingga spline—untuk memperoleh gambaran yang lebih akurat mengenai pola hubungan tersebut.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(splines)
df_airquality <- datasets::airquality
head(df_airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
# Cek apakah ada missing value
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
Dari hasil pemeriksaan, dataset airquality memiliki missing value pada variabel Ozone dan Solar.R. Informasi ini penting sebelum melakukan analisis, karena kita perlu menentukan strategi untuk menangani nilai yang hilang, misalnya dengan menghapus baris, mengisi dengan rata-rata/median, atau menggunakan metode imputasi lainnya.
# Ganti NA pada kolom Ozone dengan median
airquality$Ozone[is.na(airquality$Ozone)] <- median(airquality$Ozone, na.rm = TRUE)
# Ganti NA pada kolom Solar.R dengan median
airquality$Solar.R[is.na(airquality$Solar.R)] <- median(airquality$Solar.R, na.rm = TRUE)
# Cek lagi apakah masih ada missing value
colSums(is.na(airquality))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
median(…, na.rm = TRUE) menghitung median dengan mengabaikan NA. Baris yang NA kemudian diisi dengan nilai median dari kolom tersebut. Setelah diganti, hasil colSums(is.na()) akan menunjukkan 0 → artinya tidak ada lagi missing value. ## Data Air Quality
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
theme_bw()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
mod_linear = lm(Ozone~Solar.R,data=df_airquality)
summary(mod_linear)
##
## Call:
## lm(formula = Ozone ~ Solar.R, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.292 -21.361 -8.864 16.373 119.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.59873 6.74790 2.756 0.006856 **
## Solar.R 0.12717 0.03278 3.880 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.33 on 109 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.1213, Adjusted R-squared: 0.1133
## F-statistic: 15.05 on 1 and 109 DF, p-value: 0.0001793
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~x,lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
Dari grafik terlihat bahwa linear fit cukup oke di bagian tengah, tapi
di ekor (nilai kecil & besar) tidak mengikuti pola dengan baik →
indikasi perlu model lebih fleksibel.
mod_tangga = lm(Ozone ~ cut(Solar.R,5),data=df_airquality)
summary(mod_tangga)
##
## Call:
## lm(formula = Ozone ~ cut(Solar.R, 5), data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.743 -20.647 -6.437 14.853 112.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.167 7.071 2.145 0.034254 *
## cut(Solar.R, 5)(72.4,138] 10.271 10.308 0.996 0.321338
## cut(Solar.R, 5)(138,203] 37.214 9.637 3.862 0.000194 ***
## cut(Solar.R, 5)(203,269] 40.576 8.702 4.663 9.13e-06 ***
## cut(Solar.R, 5)(269,334] 29.690 9.637 3.081 0.002629 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30 on 106 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2167, Adjusted R-squared: 0.1871
## F-statistic: 7.331 on 4 and 106 DF, p-value: 2.984e-05
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
mod_spline3 = lm(Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3)
##
## Call:
## lm(formula = Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),
## data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.255 -18.910 -3.842 15.718 109.745
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.665 13.828 0.916 0.36189
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1 3.279 32.834 0.100 0.92065
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2 -22.451 51.623 -0.435 0.66454
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3 28.804 44.766 0.643 0.52136
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4 -15.621 31.049 -0.503 0.61596
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5 4.717 22.568 0.209 0.83485
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6 4.984 23.393 0.213 0.83171
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 98.325 35.535 2.767 0.00671
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8 NA NA NA NA
##
## (Intercept)
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 **
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.78 on 103 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2502, Adjusted R-squared: 0.1993
## F-statistic: 4.91 on 7 and 103 DF, p-value: 8.159e-05
mod_spline3ns = lm(Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3ns)
##
## Call:
## lm(formula = Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),
## data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.020 -23.768 -3.773 14.152 115.599
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.99 39.20 2.066 0.0413
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1 -50.30 49.37 -1.019 0.3106
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2 -66.02 49.84 -1.325 0.1882
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 -83.57 38.60 -2.165 0.0327
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 35.06 19.99 1.753 0.0824
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5 -110.05 107.17 -1.027 0.3069
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6 NA NA NA NA
##
## (Intercept) *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 .
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.29 on 105 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2089, Adjusted R-squared: 0.1712
## F-statistic: 5.544 on 5 and 105 DF, p-value: 0.000143
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = c(5, 10, 20, 30, 40)),
lty = 1, aes(col = "Cubic Spline"),se = F)+
stat_smooth(method = "lm",
formula = y~ns(x, knots = c(5, 10, 20, 30, 40)),
lty = 1, aes(col = "Natural Cubic Spline"),se = F)+labs(color="Tipe Spline")+
scale_color_manual(values = c("Natural Cubic Spline"="red","Cubic Spline"="blue"))+theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Fungsi MSE
MSE <- function(pred, actual) {
mean((pred - actual)^2, na.rm = TRUE)
}
# Buat prediksi dari masing-masing model
pred_linear <- predict(mod_linear)
pred_tangga <- predict(mod_tangga)
pred_spline <- predict(mod_spline3)
pred_nspline <- predict(mod_spline3ns)
compare_stats <- data.frame(
Model = c("Linear","Tangga","Spline","Natural Spline"),
MSE = c(MSE(pred_linear, df_airquality$Ozone),
MSE(pred_tangga, df_airquality$Ozone),
MSE(pred_spline, df_airquality$Ozone),
MSE(pred_nspline, df_airquality$Ozone)),
AIC = c(AIC(mod_linear),
AIC(mod_tangga),
AIC(mod_spline3),
AIC(mod_spline3ns)),
Adj_R2 = c(summary(mod_linear)$adj.r.squared,
summary(mod_tangga)$adj.r.squared,
summary(mod_spline3)$adj.r.squared,
summary(mod_spline3ns)$adj.r.squared)
)
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
compare_stats
## Model MSE AIC Adj_R2
## 1 Linear 1293.729 1083.714 0.1132809
## 2 Tangga 1405.729 1076.964 0.1871312
## 3 Spline 1429.093 1078.108 0.1992623
## 4 Natural Spline 1404.240 1080.069 0.1711837
# Load data
data("airquality")
df_airquality <- na.omit(airquality)
# Model Ozone ~ Temp
model_temp <- lm(Ozone ~ Temp, data = df_airquality)
summary(model_temp)
##
## Call:
## lm(formula = Ozone ~ Temp, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.922 -17.459 -0.874 10.444 118.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -147.6461 18.7553 -7.872 2.76e-12 ***
## Temp 2.4391 0.2393 10.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.92 on 109 degrees of freedom
## Multiple R-squared: 0.488, Adjusted R-squared: 0.4833
## F-statistic: 103.9 on 1 and 109 DF, p-value: < 2.2e-16
# Hitung MSE, AIC, Adj R2
mse_temp <- mean(resid(model_temp)^2)
aic_temp <- AIC(model_temp)
adjr2_temp <- summary(model_temp)$adj.r.squared
# Model Ozone ~ Wind
model_wind <- lm(Ozone ~ Wind, data = df_airquality)
summary(model_wind)
##
## Call:
## lm(formula = Ozone ~ Wind, data = df_airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.513 -18.597 -5.035 15.814 88.437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.0413 7.4724 13.25 < 2e-16 ***
## Wind -5.7288 0.7082 -8.09 9.09e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.42 on 109 degrees of freedom
## Multiple R-squared: 0.3752, Adjusted R-squared: 0.3694
## F-statistic: 65.44 on 1 and 109 DF, p-value: 9.089e-13
# Hitung MSE, AIC, Adj R2
mse_wind <- mean(resid(model_wind)^2)
aic_wind <- AIC(model_wind)
adjr2_wind <- summary(model_wind)$adj.r.squared
# Model Ozone ~ Solar.R
model_solar <- lm(Ozone ~ Solar.R, data = df_airquality)
# Hitung MSE, AIC, Adj R2
mse_solar <- mean(resid(model_solar)^2)
aic_solar <- AIC(model_solar)
adjr2_solar <- summary(model_solar)$adj.r.squared
# Buat tabel perbandingan
hasil <- data.frame(
Model = c("Ozone ~ Solar.R", "Ozone ~ Temp", "Ozone ~ Wind"),
MSE = c(mse_solar, mse_temp, mse_wind),
AIC = c(aic_solar, aic_temp, aic_wind),
Adj_R2 = c(adjr2_solar, adjr2_temp, adjr2_wind)
)
print(hasil)
## Model MSE AIC Adj_R2
## 1 Ozone ~ Solar.R 964.1642 1083.714 0.1132809
## 2 Ozone ~ Temp 561.8688 1023.775 0.4832625
## 3 Ozone ~ Wind 685.6547 1045.876 0.3694195
Jika Adjusted R² Temp tertinggi → suhu lebih kuat menjelaskan variasi Ozone.
Jika Wind negatif dan signifikan → angin tinggi cenderung menurunkan Ozone.
Jika Solar.R rendah Adjusted R² dan AIC lebih buruk → radiasi matahari kurang mampu menjelaskan variasi secara linear.
Dengan begitu, Temp biasanya paling baik sebagai prediktor Ozone, disusul Wind, sementara Solar.R cenderung lemah.
library(ggplot2)
# 1. Scatter plot Ozone ~ Temp
ggplot(df_airquality, aes(x = Temp, y = Ozone)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Hubungan Ozone ~ Temp",
x = "Temperature (F)", y = "Ozone") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# 2. Scatter plot Ozone ~ Wind
ggplot(df_airquality, aes(x = Wind, y = Ozone)) +
geom_point(color = "green", alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Hubungan Ozone ~ Wind",
x = "Wind (mph)", y = "Ozone") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# 3. Scatter plot Ozone ~ Solar.R
ggplot(df_airquality, aes(x = Solar.R, y = Ozone)) +
geom_point(color = "purple", alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Hubungan Ozone ~ Solar.R",
x = "Solar Radiation", y = "Ozone") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
- Ozone ~ Temp → pola cenderung positif, makin tinggi suhu makin tinggi
Ozone.
Ozone ~ Wind → pola cenderung negatif, makin tinggi angin biasanya Ozone menurun.
Ozone ~ Solar.R → polanya kurang jelas, banyak sebaran acak.