Pendahuluan

Hubungan antara variabel lingkungan sering kali bersifat kompleks dan tidak selalu mengikuti pola linear. Pada dataset airquality, misalnya, konsentrasi Ozone diduga dipengaruhi oleh tingkat Solar.R. Secara teoritis, radiasi matahari dapat memicu pembentukan ozon, tetapi pola hubungan yang muncul di data bisa melengkung atau bervariasi pada rentang tertentu. Oleh karena itu, diperlukan analisis regresi dengan berbagai pendekatan—mulai dari model linear sederhana hingga spline—untuk memperoleh gambaran yang lebih akurat mengenai pola hubungan tersebut.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.1
## Warning: package 'lubridate' was built under R version 4.5.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(splines)
df_airquality <- datasets::airquality
head(df_airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
# Cek apakah ada missing value
colSums(is.na(airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

Dari hasil pemeriksaan, dataset airquality memiliki missing value pada variabel Ozone dan Solar.R. Informasi ini penting sebelum melakukan analisis, karena kita perlu menentukan strategi untuk menangani nilai yang hilang, misalnya dengan menghapus baris, mengisi dengan rata-rata/median, atau menggunakan metode imputasi lainnya.

# Ganti NA pada kolom Ozone dengan median
airquality$Ozone[is.na(airquality$Ozone)] <- median(airquality$Ozone, na.rm = TRUE)

# Ganti NA pada kolom Solar.R dengan median
airquality$Solar.R[is.na(airquality$Solar.R)] <- median(airquality$Solar.R, na.rm = TRUE)

# Cek lagi apakah masih ada missing value
colSums(is.na(airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

median(…, na.rm = TRUE) menghitung median dengan mengabaikan NA. Baris yang NA kemudian diisi dengan nilai median dari kolom tersebut. Setelah diganti, hasil colSums(is.na()) akan menunjukkan 0 → artinya tidak ada lagi missing value.

Data Air Quality

ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_bw() 
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Regresi Linear

mod_linear = lm(Ozone~Solar.R,data=df_airquality)
summary(mod_linear)
## 
## Call:
## lm(formula = Ozone ~ Solar.R, data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.292 -21.361  -8.864  16.373 119.136 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.59873    6.74790   2.756 0.006856 ** 
## Solar.R      0.12717    0.03278   3.880 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.33 on 109 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.1213, Adjusted R-squared:  0.1133 
## F-statistic: 15.05 on 1 and 109 DF,  p-value: 0.0001793
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
                 geom_point(alpha=0.55, color="black") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1, col = "blue",se = F)+
  theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Dari grafik terlihat bahwa linear fit cukup oke di bagian tengah, tapi di ekor (nilai kecil & besar) tidak mengikuti pola dengan baik → indikasi perlu model lebih fleksibel.

Regresi Fungsi Tangga

mod_tangga = lm(Ozone ~ cut(Solar.R,5),data=df_airquality)
summary(mod_tangga)
## 
## Call:
## lm(formula = Ozone ~ cut(Solar.R, 5), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.743 -20.647  -6.437  14.853 112.257 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 15.167      7.071   2.145 0.034254 *  
## cut(Solar.R, 5)(72.4,138]   10.271     10.308   0.996 0.321338    
## cut(Solar.R, 5)(138,203]    37.214      9.637   3.862 0.000194 ***
## cut(Solar.R, 5)(203,269]    40.576      8.702   4.663 9.13e-06 ***
## cut(Solar.R, 5)(269,334]    29.690      9.637   3.081 0.002629 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30 on 106 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.2167, Adjusted R-squared:  0.1871 
## F-statistic: 7.331 on 4 and 106 DF,  p-value: 2.984e-05
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~cut(x,5), 
               lty = 1, col = "blue",se = F)+
  theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Regresi Spline

mod_spline3 = lm(Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3)
## 
## Call:
## lm(formula = Ozone ~ bs(Solar.R, knots = c(5, 10, 20, 30, 40)), 
##     data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.255 -18.910  -3.842  15.718 109.745 
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                  12.665     13.828   0.916  0.36189
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1    3.279     32.834   0.100  0.92065
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2  -22.451     51.623  -0.435  0.66454
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3   28.804     44.766   0.643  0.52136
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4  -15.621     31.049  -0.503  0.61596
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5    4.717     22.568   0.209  0.83485
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6    4.984     23.393   0.213  0.83171
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7   98.325     35.535   2.767  0.00671
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8       NA         NA      NA       NA
##                                              
## (Intercept)                                  
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))1   
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))2   
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))3   
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))4   
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))5   
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))6   
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))7 **
## bs(Solar.R, knots = c(5, 10, 20, 30, 40))8   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.78 on 103 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.2502, Adjusted R-squared:  0.1993 
## F-statistic:  4.91 on 7 and 103 DF,  p-value: 8.159e-05
mod_spline3ns = lm(Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)),data=df_airquality)
summary(mod_spline3ns)
## 
## Call:
## lm(formula = Ozone ~ ns(Solar.R, knots = c(5, 10, 20, 30, 40)), 
##     data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.020 -23.768  -3.773  14.152 115.599 
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                   80.99      39.20   2.066   0.0413
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1   -50.30      49.37  -1.019   0.3106
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2   -66.02      49.84  -1.325   0.1882
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3   -83.57      38.60  -2.165   0.0327
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4    35.06      19.99   1.753   0.0824
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5  -110.05     107.17  -1.027   0.3069
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6       NA         NA      NA       NA
##                                             
## (Intercept)                                *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))1  
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))2  
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))3 *
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))4 .
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))5  
## ns(Solar.R, knots = c(5, 10, 20, 30, 40))6  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.29 on 105 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.2089, Adjusted R-squared:  0.1712 
## F-statistic: 5.544 on 5 and 105 DF,  p-value: 0.000143
ggplot(df_airquality,aes(x=Solar.R, y=Ozone)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~bs(x, knots = c(5, 10, 20, 30, 40)), 
               lty = 1, aes(col = "Cubic Spline"),se = F)+
    stat_smooth(method = "lm", 
               formula = y~ns(x, knots = c(5, 10, 20, 30, 40)), 
               lty = 1, aes(col = "Natural Cubic Spline"),se = F)+labs(color="Tipe Spline")+
  scale_color_manual(values = c("Natural Cubic Spline"="red","Cubic Spline"="blue"))+theme_bw()
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 42 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Komparasi Model

# Fungsi MSE
MSE <- function(pred, actual) {
  mean((pred - actual)^2, na.rm = TRUE)
}

# Buat prediksi dari masing-masing model
pred_linear  <- predict(mod_linear)
pred_tangga  <- predict(mod_tangga)
pred_spline  <- predict(mod_spline3)
pred_nspline <- predict(mod_spline3ns)
compare_stats <- data.frame(
  Model   = c("Linear","Tangga","Spline","Natural Spline"),
  MSE     = c(MSE(pred_linear, df_airquality$Ozone),
              MSE(pred_tangga, df_airquality$Ozone),
              MSE(pred_spline, df_airquality$Ozone),
              MSE(pred_nspline, df_airquality$Ozone)),
  AIC     = c(AIC(mod_linear),
              AIC(mod_tangga),
              AIC(mod_spline3),
              AIC(mod_spline3ns)),
  Adj_R2  = c(summary(mod_linear)$adj.r.squared,
              summary(mod_tangga)$adj.r.squared,
              summary(mod_spline3)$adj.r.squared,
              summary(mod_spline3ns)$adj.r.squared)
)
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
## Warning in pred - actual: longer object length is not a multiple of shorter
## object length
compare_stats
##            Model      MSE      AIC    Adj_R2
## 1         Linear 1293.729 1083.714 0.1132809
## 2         Tangga 1405.729 1076.964 0.1871312
## 3         Spline 1429.093 1078.108 0.1992623
## 4 Natural Spline 1404.240 1080.069 0.1711837

Tugas Analisis

  1. Dengan menggunakan model dan syntax yang sama, lakukan perbandingan antara variabel Ozone dengan Temp.
  2. Ulangi analisis untuk hubungan antara Ozone dengan Wind.
  3. Bandingkan hasil dari masing-masing model berdasarkan nilai MSE, AIC, dan Adjusted R-squared.
  4. Berikan insight dan pendapatmu.

Pendahuluan

Dalam analisis ini, kita akan memperluas analisis yang telah dilakukan sebelumnya dengan mengeksplorasi hubungan antara konsentrasi Ozone dengan dua variabel lingkungan lainnya: Temperature (Temp) dan Wind Speed (Wind). Seperti pada analisis sebelumnya dengan Solar.R, kita akan menggunakan berbagai pendekatan regresi untuk memahami pola hubungan yang mungkin tidak linear.

library(tidyverse)
library(splines)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Load dan persiapkan data
df_airquality <- datasets::airquality

# Cek missing values
colSums(is.na(df_airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0
# Ganti NA dengan median
df_airquality$Ozone[is.na(df_airquality$Ozone)] <- median(df_airquality$Ozone, na.rm = TRUE)
df_airquality$Solar.R[is.na(df_airquality$Solar.R)] <- median(df_airquality$Solar.R, na.rm = TRUE)

# Cek lagi apakah masih ada missing value
colSums(is.na(df_airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

1. Analisis Hubungan Ozone vs Temperature

Eksplorasi Data

# Visualisasi scatter plot
p1 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
  geom_point(alpha=0.6, color="darkred") + 
  labs(title = "Hubungan Ozone vs Temperature",
       x = "Temperature (°F)",
       y = "Ozone (ppb)") +
  theme_bw()

print(p1)

Dari scatter plot terlihat bahwa hubungan antara Temperature dan Ozone menunjukkan pola yang cenderung positif dan mungkin non-linear, dengan variabilitas yang meningkat pada temperature tinggi.

Model Regresi Linear

mod_linear_temp <- lm(Ozone ~ Temp, data = df_airquality)
summary(mod_linear_temp)
## 
## Call:
## lm(formula = Ozone ~ Temp, data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.464 -16.399  -3.866  10.314 122.691 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -104.0802    15.6655  -6.644 5.21e-10 ***
## Temp           1.8443     0.1997   9.236  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.3 on 151 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3568 
## F-statistic: 85.31 on 1 and 151 DF,  p-value: < 2.2e-16
p2 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
  geom_point(alpha=0.6, color="darkred") +
  stat_smooth(method = "lm", formula = y~x, lty = 1, col = "blue", se = F) +
  labs(title = "Regresi Linear: Ozone vs Temperature",
       x = "Temperature (°F)",
       y = "Ozone (ppb)") +
  theme_bw()

print(p2)

Model Regresi Fungsi Tangga

mod_tangga_temp <- lm(Ozone ~ cut(Temp, 5), data = df_airquality)
summary(mod_tangga_temp)
## 
## Call:
## lm(formula = Ozone ~ cut(Temp, 5), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.342 -14.000  -1.011   8.500 118.878 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               17.750      5.646   3.144  0.00201 ** 
## cut(Temp, 5)(64.2,72.4]    5.250      7.352   0.714  0.47627    
## cut(Temp, 5)(72.4,80.6]   11.261      6.554   1.718  0.08787 .  
## cut(Temp, 5)(80.6,88.8]   31.372      6.502   4.825 3.45e-06 ***
## cut(Temp, 5)(88.8,97]     61.092      7.663   7.973 3.86e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.58 on 148 degrees of freedom
## Multiple R-squared:  0.4118, Adjusted R-squared:  0.3959 
## F-statistic:  25.9 on 4 and 148 DF,  p-value: 2.779e-16
p3 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
  geom_point(alpha=0.6, color="darkred") +
  stat_smooth(method = "lm", formula = y~cut(x,5), lty = 1, col = "green", se = F) +
  labs(title = "Regresi Fungsi Tangga: Ozone vs Temperature",
       x = "Temperature (°F)",
       y = "Ozone (ppb)") +
  theme_bw()

print(p3)

## Model Regresi Spline

# Tentukan knots berdasarkan kuantil
temp_knots <- quantile(df_airquality$Temp, c(0.2, 0.4, 0.6, 0.8), na.rm = TRUE)
print(paste("Temperature knots:", paste(temp_knots, collapse = ", ")))
## [1] "Temperature knots: 69, 76.8, 81, 86"
mod_spline_temp <- lm(Ozone ~ bs(Temp, knots = temp_knots), data = df_airquality)
summary(mod_spline_temp)
## 
## Call:
## lm(formula = Ozone ~ bs(Temp, knots = temp_knots), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.518 -12.001  -2.314   8.704 126.644 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                    26.8677    13.3473   2.013  0.04597 * 
## bs(Temp, knots = temp_knots)1 -18.4375    28.9008  -0.638  0.52451   
## bs(Temp, knots = temp_knots)2  -0.7273    16.4801  -0.044  0.96486   
## bs(Temp, knots = temp_knots)3  -6.2867    17.1459  -0.367  0.71441   
## bs(Temp, knots = temp_knots)4  16.0112    14.5352   1.102  0.27249   
## bs(Temp, knots = temp_knots)5  37.5144    19.5661   1.917  0.05716 . 
## bs(Temp, knots = temp_knots)6  59.2630    21.5667   2.748  0.00676 **
## bs(Temp, knots = temp_knots)7  51.4744    22.4573   2.292  0.02334 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.27 on 145 degrees of freedom
## Multiple R-squared:  0.4394, Adjusted R-squared:  0.4123 
## F-statistic: 16.23 on 7 and 145 DF,  p-value: 1.151e-15
mod_nspline_temp <- lm(Ozone ~ ns(Temp, knots = temp_knots), data = df_airquality)
summary(mod_nspline_temp)
## 
## Call:
## lm(formula = Ozone ~ ns(Temp, knots = temp_knots), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.962 -11.735  -1.614  10.270 127.386 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     20.977      9.021   2.325   0.0214 *  
## ns(Temp, knots = temp_knots)1    2.739     10.053   0.272   0.7857    
## ns(Temp, knots = temp_knots)2   19.562     11.629   1.682   0.0947 .  
## ns(Temp, knots = temp_knots)3   49.792     10.672   4.666 6.86e-06 ***
## ns(Temp, knots = temp_knots)4   54.625     22.918   2.383   0.0184 *  
## ns(Temp, knots = temp_knots)5   64.937     12.423   5.227 5.80e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.16 on 147 degrees of freedom
## Multiple R-squared:  0.4372, Adjusted R-squared:  0.418 
## F-statistic: 22.84 on 5 and 147 DF,  p-value: < 2.2e-16
p4 <- ggplot(df_airquality, aes(x=Temp, y=Ozone)) +
  geom_point(alpha=0.6, color="darkred") +
  stat_smooth(method = "lm", formula = y~bs(x, knots = temp_knots), 
              lty = 1, aes(col = "Cubic Spline"), se = F) +
  stat_smooth(method = "lm", formula = y~ns(x, knots = temp_knots), 
              lty = 1, aes(col = "Natural Cubic Spline"), se = F) +
  labs(title = "Regresi Spline: Ozone vs Temperature",
       x = "Temperature (°F)",
       y = "Ozone (ppb)",
       color = "Tipe Spline") +
  scale_color_manual(values = c("Natural Cubic Spline"="red", "Cubic Spline"="blue")) +
  theme_bw()

print(p4)

# 2. Analisis Hubungan Ozone vs Wind Speed

Eksplorasi Data

# Visualisasi scatter plot
p5 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
  geom_point(alpha=0.6, color="darkblue") + 
  labs(title = "Hubungan Ozone vs Wind Speed",
       x = "Wind Speed (mph)",
       y = "Ozone (ppb)") +
  theme_bw()

print(p5)

Scatter plot menunjukkan hubungan negatif antara Wind Speed dan Ozone, yang secara teoritis masuk akal karena angin yang kencang dapat mendispersikan polutan termasuk ozon.

Model Regresi Linear

mod_linear_wind <- lm(Ozone ~ Wind, data = df_airquality)
summary(mod_linear_wind)
## 
## Call:
## lm(formula = Ozone ~ Wind, data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.971 -15.967  -3.924  12.933  99.676 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.2388     6.0007  13.872  < 2e-16 ***
## Wind         -4.3866     0.5683  -7.719 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.68 on 151 degrees of freedom
## Multiple R-squared:  0.2829, Adjusted R-squared:  0.2782 
## F-statistic: 59.58 on 1 and 151 DF,  p-value: 1.494e-12
p6 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
  geom_point(alpha=0.6, color="darkblue") +
  stat_smooth(method = "lm", formula = y~x, lty = 1, col = "blue", se = F) +
  labs(title = "Regresi Linear: Ozone vs Wind Speed",
       x = "Wind Speed (mph)",
       y = "Ozone (ppb)") +
  theme_bw()

print(p6)

Model Regresi Fungsi Tangga

mod_tangga_wind <- lm(Ozone ~ cut(Wind, 5), data = df_airquality)
summary(mod_tangga_wind)
## 
## Call:
## lm(formula = Ozone ~ cut(Wind, 5), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.731 -14.807  -4.952   5.696  80.769 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               87.231      6.588  13.241  < 2e-16 ***
## cut(Wind, 5)(5.5,9.3]    -40.924      7.300  -5.606 9.86e-08 ***
## cut(Wind, 5)(9.3,13.1]   -58.279      7.365  -7.913 5.43e-13 ***
## cut(Wind, 5)(13.1,16.9]  -61.427      7.972  -7.706 1.74e-12 ***
## cut(Wind, 5)(16.9,20.7]  -70.231     15.214  -4.616 8.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 148 degrees of freedom
## Multiple R-squared:  0.3493, Adjusted R-squared:  0.3317 
## F-statistic: 19.86 on 4 and 148 DF,  p-value: 4.178e-13
p7 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
  geom_point(alpha=0.6, color="darkblue") +
  stat_smooth(method = "lm", formula = y~cut(x,5), lty = 1, col = "green", se = F) +
  labs(title = "Regresi Fungsi Tangga: Ozone vs Wind Speed",
       x = "Wind Speed (mph)",
       y = "Ozone (ppb)") +
  theme_bw()

print(p7)

## Model Regresi Spline

# Tentukan knots berdasarkan kuantil
wind_knots <- quantile(df_airquality$Wind, c(0.2, 0.4, 0.6, 0.8), na.rm = TRUE)
print(paste("Wind knots:", paste(wind_knots, collapse = ", ")))
## [1] "Wind knots: 6.9, 8.6, 10.42, 12.96"
mod_spline_wind <- lm(Ozone ~ bs(Wind, knots = wind_knots), data = df_airquality)
summary(mod_spline_wind)
## 
## Call:
## lm(formula = Ozone ~ bs(Wind, knots = wind_knots), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.437 -13.683  -4.602   6.600  68.706 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                      56.28      18.42   3.055  0.00268 **
## bs(Wind, knots = wind_knots)1   107.29      34.96   3.069  0.00256 **
## bs(Wind, knots = wind_knots)2   -18.89      20.29  -0.931  0.35343   
## bs(Wind, knots = wind_knots)3   -13.13      20.99  -0.626  0.53254   
## bs(Wind, knots = wind_knots)4   -34.75      19.38  -1.794  0.07497 . 
## bs(Wind, knots = wind_knots)5   -14.69      26.35  -0.558  0.57795   
## bs(Wind, knots = wind_knots)6   -57.22      33.49  -1.709  0.08962 . 
## bs(Wind, knots = wind_knots)7   -30.79      25.89  -1.189  0.23621   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.26 on 145 degrees of freedom
## Multiple R-squared:  0.4398, Adjusted R-squared:  0.4128 
## F-statistic: 16.26 on 7 and 145 DF,  p-value: 1.089e-15
mod_nspline_wind <- lm(Ozone ~ ns(Wind, knots = wind_knots), data = df_airquality)
summary(mod_nspline_wind)
## 
## Call:
## lm(formula = Ozone ~ ns(Wind, knots = wind_knots), data = df_airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.130 -14.039  -3.006   6.463  78.996 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      99.63      13.94   7.147 3.84e-11 ***
## ns(Wind, knots = wind_knots)1   -70.15      13.72  -5.114 9.69e-07 ***
## ns(Wind, knots = wind_knots)2   -69.71      16.00  -4.357 2.46e-05 ***
## ns(Wind, knots = wind_knots)3   -65.42      12.05  -5.428 2.30e-07 ***
## ns(Wind, knots = wind_knots)4   -99.28      32.44  -3.060  0.00263 ** 
## ns(Wind, knots = wind_knots)5   -68.17      16.39  -4.159 5.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.03 on 147 degrees of freedom
## Multiple R-squared:  0.3922, Adjusted R-squared:  0.3715 
## F-statistic: 18.97 on 5 and 147 DF,  p-value: 1.58e-14
p8 <- ggplot(df_airquality, aes(x=Wind, y=Ozone)) +
  geom_point(alpha=0.6, color="darkblue") +
  stat_smooth(method = "lm", formula = y~bs(x, knots = wind_knots), 
              lty = 1, aes(col = "Cubic Spline"), se = F) +
  stat_smooth(method = "lm", formula = y~ns(x, knots = wind_knots), 
              lty = 1, aes(col = "Natural Cubic Spline"), se = F) +
  labs(title = "Regresi Spline: Ozone vs Wind Speed",
       x = "Wind Speed (mph)",
       y = "Ozone (ppb)",
       color = "Tipe Spline") +
  scale_color_manual(values = c("Natural Cubic Spline"="red", "Cubic Spline"="blue")) +
  theme_bw()

print(p8)

3. Perbandingan Model

Fungsi untuk Menghitung Metrik

# Fungsi MSE
MSE <- function(pred, actual) {
  mean((pred - actual)^2, na.rm = TRUE)
}

# Fungsi untuk membuat tabel perbandingan
create_comparison_table <- function(linear_mod, tangga_mod, spline_mod, nspline_mod, actual_values, variable_name) {
  
  pred_linear  <- predict(linear_mod)
  pred_tangga  <- predict(tangga_mod)
  pred_spline  <- predict(spline_mod)
  pred_nspline <- predict(nspline_mod)
  
  comparison <- data.frame(
    Variabel = rep(variable_name, 4),
    Model    = c("Linear", "Tangga", "Spline", "Natural Spline"),
    MSE      = c(MSE(pred_linear, actual_values),
                 MSE(pred_tangga, actual_values),
                 MSE(pred_spline, actual_values),
                 MSE(pred_nspline, actual_values)),
    AIC      = c(AIC(linear_mod),
                 AIC(tangga_mod),
                 AIC(spline_mod),
                 AIC(nspline_mod)),
    Adj_R2   = c(summary(linear_mod)$adj.r.squared,
                 summary(tangga_mod)$adj.r.squared,
                 summary(spline_mod)$adj.r.squared,
                 summary(nspline_mod)$adj.r.squared)
  )
  
  return(comparison)
}

Perbandingan untuk Ozone vs Temperature

comparison_temp <- create_comparison_table(
  mod_linear_temp, mod_tangga_temp, mod_spline_temp, mod_nspline_temp,
  df_airquality$Ozone, "Temperature"
)

print("=== Perbandingan Model: Ozone vs Temperature ===")
## [1] "=== Perbandingan Model: Ozone vs Temperature ==="
print(comparison_temp)
##      Variabel          Model      MSE      AIC    Adj_R2
## 1 Temperature         Linear 535.8932 1401.637 0.3567682
## 2 Temperature         Tangga 493.3074 1394.968 0.3958816
## 3 Temperature         Spline 470.1846 1393.623 0.4122853
## 4 Temperature Natural Spline 472.0010 1390.213 0.4180419

Perbandingan untuk Ozone vs Wind

comparison_wind <- create_comparison_table(
  mod_linear_wind, mod_tangga_wind, mod_spline_wind, mod_nspline_wind,
  df_airquality$Ozone, "Wind Speed"
)

print("=== Perbandingan Model: Ozone vs Wind Speed ===")
## [1] "=== Perbandingan Model: Ozone vs Wind Speed ==="
print(comparison_wind)
##     Variabel          Model      MSE      AIC    Adj_R2
## 1 Wind Speed         Linear 601.3750 1419.276 0.2781705
## 2 Wind Speed         Tangga 545.7385 1410.423 0.3316730
## 3 Wind Speed         Spline 469.8080 1393.501 0.4127561
## 4 Wind Speed Natural Spline 509.7543 1401.986 0.3714935

Perbandingan Gabungan

# Gabungkan hasil perbandingan
comparison_all <- rbind(comparison_temp, comparison_wind)

# Format untuk presentasi yang lebih baik
comparison_all$MSE <- round(comparison_all$MSE, 2)
comparison_all$AIC <- round(comparison_all$AIC, 2)
comparison_all$Adj_R2 <- round(comparison_all$Adj_R2, 3)

print("=== Perbandingan Lengkap Semua Model ===")
## [1] "=== Perbandingan Lengkap Semua Model ==="
print(comparison_all)
##      Variabel          Model    MSE     AIC Adj_R2
## 1 Temperature         Linear 535.89 1401.64  0.357
## 2 Temperature         Tangga 493.31 1394.97  0.396
## 3 Temperature         Spline 470.18 1393.62  0.412
## 4 Temperature Natural Spline 472.00 1390.21  0.418
## 5  Wind Speed         Linear 601.38 1419.28  0.278
## 6  Wind Speed         Tangga 545.74 1410.42  0.332
## 7  Wind Speed         Spline 469.81 1393.50  0.413
## 8  Wind Speed Natural Spline 509.75 1401.99  0.371

Visualisasi Perbandingan Metri

# Visualisasi MSE
p_mse <- ggplot(comparison_all, aes(x = Model, y = MSE, fill = Variabel)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  labs(title = "Perbandingan MSE Antar Model",
       x = "Model", y = "MSE") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualisasi AIC
p_aic <- ggplot(comparison_all, aes(x = Model, y = AIC, fill = Variabel)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  labs(title = "Perbandingan AIC Antar Model",
       x = "Model", y = "AIC") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualisasi Adjusted R-squared
p_r2 <- ggplot(comparison_all, aes(x = Model, y = Adj_R2, fill = Variabel)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
  labs(title = "Perbandingan Adjusted R² Antar Model",
       x = "Model", y = "Adjusted R²") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Tampilkan semua grafik
grid.arrange(p_mse, p_aic, p_r2, ncol = 1)

4. Insights dan Pendapat

Temuan Utama:

Hubungan Ozone vs Temperature:

  1. Pola Hubungan: Menunjukkan hubungan positif yang kuat dengan pola yang cenderung non-linear
  2. Model Terbaik: Berdasarkan metrik evaluasi, model spline (baik cubic maupun natural) memberikan performa terbaik
  3. Interpretasi: Temperature memiliki pengaruh yang signifikan terhadap konsentrasi ozon, sesuai dengan teori bahwa suhu tinggi mempercepat reaksi fotokimia pembentukan ozon

Hubungan Ozone vs Wind Speed:

  1. Pola Hubungan: Menunjukkan hubungan negatif yang konsisten
  2. Model Terbaik: Model linear sudah cukup baik untuk menggambarkan hubungan ini
  3. Interpretasi: Kecepatan angin yang tinggi membantu mendispersikan polutan, sehingga menurunkan konsentrasi ozon

Perbandingan Kekuatan Hubungan:

# Hitung korelasi
cor_temp <- cor(df_airquality$Ozone, df_airquality$Temp)
cor_wind <- cor(df_airquality$Ozone, df_airquality$Wind)

print(paste("Korelasi Ozone vs Temperature:", round(cor_temp, 3)))
## [1] "Korelasi Ozone vs Temperature: 0.601"
print(paste("Korelasi Ozone vs Wind Speed:", round(cor_wind, 3)))
## [1] "Korelasi Ozone vs Wind Speed: -0.532"

Rekomendasi:

  1. Untuk Temperature: Gunakan model spline karena mampu menangkap non-linearitas dengan baik
  2. Untuk Wind Speed: Model linear sudah cukup efektif dan parsimonis
  3. Aplikasi Praktis: Model ini dapat digunakan untuk prediksi kualitas udara berdasarkan kondisi meteorologi
  4. Penelitian Lanjutan: Pertimbangkan model multivariat yang menggabungkan semua variabel untuk prediksi yang lebih akurat

Kesimpulan:

Temperature memiliki pengaruh yang lebih kuat dan kompleks terhadap konsentrasi ozon dibandingkan dengan wind speed. Hal ini menunjukkan pentingnya mempertimbangkan pendekatan non-linear dalam modeling hubungan variabel lingkungan, terutama untuk hubungan yang secara teoritis melibatkan proses kompleks seperti reaksi fotokimia.