Tugas Pertemuan 8 Praktikum Sains Data

Soal

Data untuk tugas dapat diperoleh di package ISLR.Silahkan Install dulu packagenya.

A. Define Data

Gunakan model-model non-linier yang sudah dipelajari pada data Auto dengan peubah respon mpg dan pilih tiga kolom untuk dijadikan peubah prediktor.

library(ISLR)
attach(Auto)
#Memanggil data yang akan dismulasikan
View(Auto)

Akan dipilih variabel prediktor displacement, horsepower, dan weight.

B. Visualize Data

Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#Membuat plot HP vs MpG
#Cobakan dengan model linier
mod_linear = lm(mpg~horsepower,data=Auto)
summary(mod_linear)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
#Plot model linier
ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "blue",se = T)+
  theme_bw()

#Membuat plot HP vs displacement
#Cobakan dengan model linier
mod_linear = lm(mpg~Auto$weight,data=Auto)
summary(mod_linear)
## 
## Call:
## lm(formula = mpg ~ Auto$weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9736  -2.7556  -0.3358   2.1379  16.5194 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.216524   0.798673   57.87   <2e-16 ***
## Auto$weight -0.007647   0.000258  -29.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6918 
## F-statistic: 878.8 on 1 and 390 DF,  p-value: < 2.2e-16
#Plot model linier
ggplot(Auto,aes(x=Auto$weight, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "blue",se = T)+ xlab("Weight") +
  theme_bw()
## Warning: Use of `Auto$weight` is discouraged. Use `weight` instead.

## Warning: Use of `Auto$weight` is discouraged. Use `weight` instead.

#Membuat plot HP vs Acceleration
#Cobakan dengan model linier
mod_linear = lm(mpg~Auto$acceleration,data=Auto)
summary(mod_linear)
## 
## Call:
## lm(formula = mpg ~ Auto$acceleration, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.989  -5.616  -1.199   4.801  23.239 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8332     2.0485   2.359   0.0188 *  
## Auto$acceleration   1.1976     0.1298   9.228   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.08 on 390 degrees of freedom
## Multiple R-squared:  0.1792, Adjusted R-squared:  0.1771 
## F-statistic: 85.15 on 1 and 390 DF,  p-value: < 2.2e-16
#Plot model linier
ggplot(Auto,aes(x=Auto$acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "blue",se = T)+xlab("Acceleration") +
  theme_bw()
## Warning: Use of `Auto$acceleration` is discouraged. Use `acceleration` instead.
## Warning: Use of `Auto$acceleration` is discouraged. Use `acceleration` instead.

Berdasarkan tiga output plot di atas, terlihat bahwa hubungan antara variabel mpg dan acceleration linier, begitu juga hubungan antara variabel mpg dengan weight. Hal yang berbeda ditunjukkan oleh pola hubungan variabel mpg dan horsepower dimana pola hubunganya cenderung tidak linier terlihat pada nilai X yang berada di ujung atas dan bawah.

C. Menentukan Model

Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris)

1. Regresi Polinomial

Pada pendekatan polinomial ini akan diuji dengan pendekatan polinomial berderajat 2,3, dan 4 untuk dilihat secara visual dan juga secara empiris dengan nilai mrse dan mae nya.

library(tidyverse)
library(splines)
library(rsample)


#Regresi polinomial derajat 2
mod_polinomial2 = lm(mpg ~ poly(horsepower,2,raw = T),
                     data=Auto)
summary(mod_polinomial2)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2, raw = T), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   56.9000997  1.8004268   31.60   <2e-16 ***
## poly(horsepower, 2, raw = T)1 -0.4661896  0.0311246  -14.98   <2e-16 ***
## poly(horsepower, 2, raw = T)2  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
#Plot model polinom derajat 2
  ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,2,raw=T), 
               lty = 1, col = "blue",se = T)+
  labs(title= "Polinom d=2")+
  theme_bw()

#Regresi polinomial derajat 3
  
mod_polinomial3 = lm(mpg ~ poly(horsepower,3,raw = T),
                     data=Auto)
summary(mod_polinomial3)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 3, raw = T), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7039  -2.4491  -0.1519   2.2035  15.8159 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.068e+01  4.563e+00  13.298  < 2e-16 ***
## poly(horsepower, 3, raw = T)1 -5.689e-01  1.179e-01  -4.824 2.03e-06 ***
## poly(horsepower, 3, raw = T)2  2.079e-03  9.479e-04   2.193   0.0289 *  
## poly(horsepower, 3, raw = T)3 -2.147e-06  2.378e-06  -0.903   0.3673    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.375 on 388 degrees of freedom
## Multiple R-squared:  0.6882, Adjusted R-squared:  0.6858 
## F-statistic: 285.5 on 3 and 388 DF,  p-value: < 2.2e-16
#Plot model polinom derajat 3
   ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,3,raw=T), 
               lty = 1, col = "blue",se = T)+labs(title= "Polinom d=3")+
  theme_bw()

#Regresi polinomial derajat 4
  
mod_polinomial4 = lm(mpg ~ poly(horsepower,3,raw = T),
                     data=Auto)
summary(mod_polinomial4)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 3, raw = T), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7039  -2.4491  -0.1519   2.2035  15.8159 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.068e+01  4.563e+00  13.298  < 2e-16 ***
## poly(horsepower, 3, raw = T)1 -5.689e-01  1.179e-01  -4.824 2.03e-06 ***
## poly(horsepower, 3, raw = T)2  2.079e-03  9.479e-04   2.193   0.0289 *  
## poly(horsepower, 3, raw = T)3 -2.147e-06  2.378e-06  -0.903   0.3673    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.375 on 388 degrees of freedom
## Multiple R-squared:  0.6882, Adjusted R-squared:  0.6858 
## F-statistic: 285.5 on 3 and 388 DF,  p-value: < 2.2e-16
#Plot model polinom derajat 4
   ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,4,raw=T), 
               lty = 1, col = "blue",se = T)+labs(title= "Polinom d=4")+ 
  theme_bw()

#Uji Empiris Regresi Linier
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear1 <- map_dfr(cross_val$splits,
    function(x){
mod <- lm(mpg ~ horsepower,
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

metric_linear1
# menghitung rata-rata untuk 10 folds
mean_metric_linear1 <- colMeans(metric_linear1)

mean_metric_linear1      
##     rmse      mae 
## 4.889086 3.832328
#Perbandingan Model dengan RMSE dan MAE polinom derajat 2

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_poly_d2 <- map_dfr(cross_val$splits,
    function(x){
mod <- lm(mpg ~ poly(horsepower,2,raw = T),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

metric_poly_d2
# menghitung rata-rata untuk 10 folds
mean_metric_poly_d2 <- colMeans(metric_poly_d2)

mean_metric_poly_d2
##     rmse      mae 
## 4.351129 3.266460
#Perbandingan Model dengan RMSE dan MAE polinom derajat 3
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_poly_d3 <- map_dfr(cross_val$splits,
    function(x){
mod <- lm(mpg ~ poly(horsepower,3,raw = T),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

metric_poly_d3
# menghitung rata-rata untuk 10 folds
mean_metric_poly_d3 <- colMeans(metric_poly_d3)

mean_metric_poly_d3
##     rmse      mae 
## 4.355430 3.266945
#Perbandingan Model dengan RMSE dan MAE polinom derajat 4
set.seed(333)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_poly_d4 <- map_dfr(cross_val$splits,
    function(x){
mod <- lm(mpg ~ poly(horsepower,4,raw = T),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg


rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

metric_poly_d4
# menghitung rata-rata untuk 10 folds
mean_metric_poly_d4 <- colMeans(metric_poly_d4)

mean_metric_poly_d4
##     rmse      mae 
## 4.368638 3.290133

2. Regresi Tangga

Pada pendekatan fungsi tangga ini akan dicari pada knots berapa sehingga memiliki nilai mrse dan mae yang paling kecil dan juga akan ditunjukkan visualisasi datanya.

library(tidyverse)
library(splines)
library(rsample)

#Uji Dengan regresi fungsi tangga

mod_tangga1 = lm(mpg ~ cut(horsepower,5),data=Auto)
summary(mod_tangga1)
## 
## Call:
## lm(formula = mpg ~ cut(horsepower, 5), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3033  -3.0220  -0.5413   2.4074  16.6394 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   31.3033     0.4272   73.27   <2e-16 ***
## cut(horsepower, 5)(82.8,120]  -8.2813     0.5642  -14.68   <2e-16 ***
## cut(horsepower, 5)(120,156]  -15.2427     0.7210  -21.14   <2e-16 ***
## cut(horsepower, 5)(156,193]  -17.5922     1.0036  -17.53   <2e-16 ***
## cut(horsepower, 5)(193,230]  -18.5340     1.3767  -13.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.719 on 387 degrees of freedom
## Multiple R-squared:  0.6382, Adjusted R-squared:  0.6345 
## F-statistic: 170.7 on 4 and 387 DF,  p-value: < 2.2e-16
#Sebaran plotnya
ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~cut(x,5), 
               lty = 1, col = "blue",se = F)+
  theme_bw()

mod1 <- lm(mpg ~ horsepower,
         data=Auto)




#Jika diuji secara empirin dengan RMSE dan MAE
set.seed(223)
cross_val1 <- vfold_cv(Auto,v=10,strata = "mpg")

az<-cross_val1$splits[[1]]

#Misal ingin dicoba-coba pada interval berapa model terbaiknya, misal dicoba intervalnya 3 -10
breaks <- 3:10

best_tangga1 <- map_dfr(breaks, function(i){

metric_tangga1 <- map_dfr(cross_val1$splits,
    function(x){
      
training1 <- Auto[x$in_id,]
training1$horsepower <- cut(training1$horsepower,i)

mod1 <- lm(mpg ~ horsepower,
         data=training1)

labs_x <- levels(mod1$model[,2])

labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                  upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))


testing1 <- Auto[-x$in_id,]

horsepower_new <- cut(testing1$horsepower,c(labs_x_breaks[1,1],labs_x_breaks[,2]))

pred1 <- predict(mod1,
               newdata=list(horsepower=horsepower_new))
truth <- testing1$mpg


data_eval1 <- na.omit(data.frame(truth,pred1))

rmse <- mlr3measures::rmse(truth = data_eval1$truth,
                           response = data_eval1$pred1
                           )
mae <- mlr3measures::mae(truth = data_eval1$truth,
                           response = data_eval1$pred1
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

metric_tangga1



# menghitung rata-rata untuk 10 folds
mean_metric_tangga1 <- colMeans(metric_tangga1)

mean_metric_tangga1
})
best_tangga1 <- cbind(breaks=breaks,best_tangga1)
# menampilkan hasil all breaks
best_tangga1
#berdasarkan rmse
best_tangga1 %>% slice_min(rmse)
#berdasarkan mae
best_tangga1 %>% slice_min(mae)

3. Regresi Spline

Pada pendekatan regresi spline ini akan dicoba dengan menggunakan base spline dan juga naturan spline untuk kemudian dibandingkan mana yang memiliki nilai rmsep dan mae yang paling kecil.

library(tidyverse)
library(splines)
library(rsample)

#b-spline
dim(bs(Auto$horsepower, knots = c(80, 120,180)))
## [1] 392   6
mod_spline3 = lm(mpg ~ bs(horsepower, knots = c(80, 120,180)),data=Auto)

summary(mod_spline3)
## 
## Call:
## lm(formula = mpg ~ bs(horsepower, knots = c(80, 120, 180)), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6595  -2.6233  -0.2615   2.2284  15.2843 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                32.838      1.757  18.691  < 2e-16
## bs(horsepower, knots = c(80, 120, 180))1    5.778      2.548   2.268   0.0239
## bs(horsepower, knots = c(80, 120, 180))2   -9.224      1.840  -5.013 8.19e-07
## bs(horsepower, knots = c(80, 120, 180))3  -14.738      2.510  -5.871 9.37e-09
## bs(horsepower, knots = c(80, 120, 180))4  -21.074      2.620  -8.043 1.09e-14
## bs(horsepower, knots = c(80, 120, 180))5  -20.772      3.509  -5.919 7.15e-09
## bs(horsepower, knots = c(80, 120, 180))6  -18.484      3.219  -5.742 1.91e-08
##                                             
## (Intercept)                              ***
## bs(horsepower, knots = c(80, 120, 180))1 *  
## bs(horsepower, knots = c(80, 120, 180))2 ***
## bs(horsepower, knots = c(80, 120, 180))3 ***
## bs(horsepower, knots = c(80, 120, 180))4 ***
## bs(horsepower, knots = c(80, 120, 180))5 ***
## bs(horsepower, knots = c(80, 120, 180))6 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.294 on 385 degrees of freedom
## Multiple R-squared:  0.7019, Adjusted R-squared:  0.6973 
## F-statistic: 151.1 on 6 and 385 DF,  p-value: < 2.2e-16
#natural spline
mod_spline3ns = lm(mpg ~ ns(horsepower, knots = c(80, 120,180)),data=Auto)
summary(mod_spline3ns)
## 
## Call:
## lm(formula = mpg ~ ns(horsepower, knots = c(80, 120, 180)), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8425  -2.4877  -0.1741   2.2245  15.9857 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                38.260      1.089   35.12   <2e-16
## ns(horsepower, knots = c(80, 120, 180))1  -22.367      1.262  -17.72   <2e-16
## ns(horsepower, knots = c(80, 120, 180))2  -21.902      1.758  -12.46   <2e-16
## ns(horsepower, knots = c(80, 120, 180))3  -35.347      2.706  -13.06   <2e-16
## ns(horsepower, knots = c(80, 120, 180))4  -18.569      1.852  -10.02   <2e-16
##                                             
## (Intercept)                              ***
## ns(horsepower, knots = c(80, 120, 180))1 ***
## ns(horsepower, knots = c(80, 120, 180))2 ***
## ns(horsepower, knots = c(80, 120, 180))3 ***
## ns(horsepower, knots = c(80, 120, 180))4 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.369 on 387 degrees of freedom
## Multiple R-squared:  0.6899, Adjusted R-squared:  0.6867 
## F-statistic: 215.2 on 4 and 387 DF,  p-value: < 2.2e-16
#Buat plot bs
ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~bs(x, knots = c(80, 120,180)), 
               lty = 1,se = F)

#Buat plot ns
ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="black")+
    stat_smooth(method = "lm", 
               formula = y~ns(x, knots = c(80, 120,180)), 
               lty = 1,se=F)

#Secara empiris
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 6:12

best_spline3 <- map_dfr(df, function(i){
metric_spline3 <- map_dfr(cross_val$splits,
    function(x){
mod <- lm(mpg ~ bs(horsepower,df=i),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg


rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

metric_spline3

# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)

mean_metric_spline3
}
)
## Warning in bs(horsepower, degree = 3L, knots = c(`25%` = 75, `50%` = 92.5, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(horsepower, degree = 3L, knots = c(`20%` = 72, `40%` = 88, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(horsepower, degree = 3L, knots = c(`16.66667%` = 70, `33.33333%` =
## 84, : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(horsepower, degree = 3L, knots = c(`14.28571%` = 68, `28.57143%`
## = 78.8571428571428, : some 'x' values beyond boundary knots may cause ill-
## conditioned bases
## Warning in bs(horsepower, degree = 3L, knots = c(`12.5%` = 67, `25%` = 75, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(horsepower, degree = 3L, knots = c(`11.11111%` = 67, `22.22222%` =
## 75, : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(horsepower, degree = 3L, knots = c(`10%` = 66.3, `20%` = 72, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
best_spline3 <- cbind(df=df,best_spline3)
# menampilkan hasil all breaks
best_spline3
#berdasarkan rmse
best_spline3 %>% slice_min(rmse)
#berdasarkan mae
best_spline3 %>% slice_min(mae)

4. Komparasi Model

Setelah didapatkan hasil pendekatan dengan berbagai macam pendekatan, berikut ini adalah hasil komparasi terhadap nilai rmse dan mae nya untuk kemudian ditentukan mana pendekatan yang paling baik untuk mengetahui pola hubungan antara variabel mpg dan horsepower.

library(tidyverse)
library(splines)
library(rsample)

#Perbandingan semua model

nilai_metric <- rbind(mean_metric_linear1,
                      mean_metric_poly_d2,
                      mean_metric_poly_d3,
                      mean_metric_poly_d4,
                      best_tangga1 %>% select(-1) %>% slice_min(mae),
                      best_spline3 %>% select(-1)%>% slice_min(mae)
                     
                      )

nama_model <- c("Linear",
                "Poly2",
"Poly3","Poly4","Tangga(breaks=8)",
                "spline(df=12)"
)

data.frame(nama_model,nilai_metric)

Dari output tersebut dapat disimpulkan bahwa regresi fungsi natural cubic splines adalah pendekatan yang paling baik dalam menjelaskan hubungan antara variabel mpg dan horsepower, hal itu terlihat dari nilai mae nya paling kecil.