P8 STA581 Moving Beyond Linearity

SOAL

Data bisa diperoleh untuk tugas dapat diperoleh di package ISLR. Silahkan Install dulu packagenya.

Gunakan model-model non-linier yang sudah dipelajari pada data Auto dengan peubah respon mpg dan pilih tiga kolom untuk dijadikan peubah prediktor.

  • Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.
  • Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris.

Packages

library(tidyverse)
library(splines)
library(rsample) #bootstrapping
library(ISLR)
library(cowplot) #Streamlined Plot Theme and Plot Annotations for 'ggplot2'
library(caret) # cross-validation
library(gridExtra) # combining graphs
library(ggplot2)
library(dplyr)
library(purrr)
library(ggpubr)
library(knitr)
library(DT)
library(kableExtra)
library(gam) # generalized additive models
library(mlr3measures)

Eksplorasi Data

Data yang digunakan adalah Auto dari package ISLR yang terdiri dari 392 observasi dan 9 variabel. Dataset ini diambil dari perpustakaan StatLib yang dikelola di Carnegie Mellon University. Dataset ini digunakan dalam American Statistical Association Exposition 1983.Variabel yang akan digunakan pada kasus ini adalah :

  • mpg : miles per gallon
  • horsepower: Engine horsepower
  • weight : Vehicle weight (lbs.)
  • acceleration : Time to accelerate from 0 to 60 mph (sec.)
Auto = Auto %>%  select(mpg, horsepower, acceleration, weight)
datatable(Auto)

Visualisasi Data

hp <- ggplot(Auto,aes(x=horsepower, y=mpg), add=TRUE) + 
  geom_point(alpha=0.55, color="#3399ff") +

  theme_bw()
  
we <- ggplot(Auto,aes(x=weight, y=mpg), add=TRUE) + 
  geom_point(alpha=0.55, color="#6600cc") +

  theme_bw()
  
ac <- ggplot(Auto,aes(x=acceleration, y=mpg), add=TRUE) + 
  geom_point(alpha=0.55, color="#ff3333") + 

  theme_bw()

plot_grid(hp, we, ac, labels = c("mpg vs horsepower","mpg vs Weight", "mpg vs Acceleration"), label_size = 6)

Jika kita amati lebih detail dari scatter plot diatas, pola data ketiga scatterplot cenderung tidak linier, maka dari itu ada beberapa pendekatan yang bisa digunakan.

Dalam kasus kali ini akan digunakan pendekatan Regresi Polinomial, Piecewise constant, dan Natural cubic splines.

Regresi Polinomial

Data mpg & horsepower

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

ordo <- 2:10

best_poly <- map_dfr(ordo, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ poly(horsepower,ordo=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                              }
                            )
  metric_poly

# menghitung rata-rata untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly
  }
)

(polyh <- cbind(ordo=ordo,best_poly))
##   ordo     rmse      mae
## 1    2 4.351129 3.266460
## 2    3 4.355430 3.266945
## 3    4 4.367009 3.280452
## 4    5 4.318767 3.249003
## 5    6 4.317431 3.257272
## 6    7 4.293242 3.233376
## 7    8 4.309511 3.251758
## 8    9 4.323321 3.244919
## 9   10 4.371903 3.274887

Plot CV

polyh$ordo <- as.factor(polyh$ordo)
ggplot(data=best_poly, aes(x=ordo, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Root Mean Square Error")

Plot

A <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="6666cc") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Ordo 2") +
  xlab("Horse Power") + 
  ylab("mile per gallon")  

B <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="6666cc") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,7,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Ordo 7") +
  xlab("Horse Power") + 
  ylab("mile per gallon")

ggarrange(A, B, ncol = 2, nrow = 1)

Jika dilihat berdasarkan plot, meskipun nilai RMSE Regresi Polinomial ordo 7 lebih kecil daripada ordo 2, model terbaik yang dipilih adalah Regresi Polinomial ordo 2. Karena untuk ordo 7 pada bagian ujung ujung plot cenderung tidak stabil.

Data mpg & weight

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

ordo <- 2:10

best_poly <- map_dfr(ordo, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ poly(weight,ordo=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                              }
                            )
  metric_poly

# menghitung rata-rata untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly
  }
)

(polyw <- cbind(ordo=ordo,best_poly))
##   ordo     rmse      mae
## 1    2 4.149977 3.060704
## 2    3 4.158482 3.068020
## 3    4 4.165585 3.067210
## 4    5 4.169760 3.079497
## 5    6 4.178027 3.098482
## 6    7 4.170375 3.100742
## 7    8 4.181430 3.108979
## 8    9 4.254235 3.164729
## 9   10 4.299992 3.167556

Plot CV

polyw$ordo <- as.factor(polyw$ordo)
ggplot(data=best_poly, aes(x=ordo, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Root Mean Square Error")

Plot

poly2<-ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="#9900FF") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Ordo 2") +
  xlab("weight") + 
  ylab("mile per gallon")  

Secara empiris dapat dilihat bahwa nilai rmse dan mae pada ordo 2 memiliki nilai paling kecil dibanding ordo lain dan dapat dibutkian pada plot diatas.

Data mpg & acceleration

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

ordo <- 2:10

best_poly <- map_dfr(ordo, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ poly(acceleration,ordo=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                              }
                            )
  metric_poly

# menghitung rata-rata untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly
  }
)

(polya <- cbind(ordo=ordo,best_poly))
##   ordo     rmse      mae
## 1    2 7.037689 5.755305
## 2    3 7.048668 5.792450
## 3    4 7.008647 5.700795
## 4    5 7.042178 5.729396
## 5    6 7.046243 5.704385
## 6    7 7.097228 5.711475
## 7    8 7.167747 5.767367
## 8    9 7.150072 5.744992
## 9   10 7.164808 5.754932

Plot CV

polya$ordo <- as.factor(polya$ordo)
ggplot(data=best_poly, aes(x=ordo, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Root Mean Square Error")

Plot

poly31 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="#FF0033") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Ordo 2") +
  xlab("acceleration") + 
  ylab("mile per gallon")

poly32 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="#FF9966") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,4,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Ordo 4") +
  xlab("acceleration") + 
  ylab("mile per gallon")

poly33 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="#FF00CC") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,6,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Ordo 6") +
  xlab("acceleration") + 
  ylab("mile per gallon")

ggarrange(poly31, poly32, poly33, ncol = 2, nrow = 2)

Berdasarkan nilai rmse dan mae ordo 4 merupakan model terbaik untuk regresi polinomial namun jika dilihat secara visualisasi dapat dilihat bahwa kurva untuk ordo 4 tidak stabil sehingga dibandingkan dengan kedua ordo lainnya yang memiliki nilai rmse dan mae terkecil yaitu ordo 2 dan 6.

Dari ketiga ordo diatas dapat dilihat bahwa ordo 2 menunjukkan kurva yang cenderung stabil dibanding kedua ordo lainnya.

Piecewise Constant

Data mpg & horsepower

###Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

breaks <- 2:20

best_tangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$horsepower <- cut(training$horsepower,i)
                             mod <- lm(mpg ~ horsepower, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                             testing <- Auto[-x$in_id,]
                             horsepower_new <- cut(testing$horsepower,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,   newdata=list(horsepower=horsepower_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )

metric_tangga

# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)

mean_metric_tangga
})
(tanggah <- cbind(breaks=breaks,best_tangga))
##    breaks     rmse      mae
## 1       2 6.147626 4.790234
## 2       3 5.775792 4.521829
## 3       4 4.985270 3.774549
## 4       5 4.712623 3.571614
## 5       6 4.644383 3.548375
## 6       7 4.554116 3.379980
## 7       8 4.437597 3.405433
## 8       9 4.549668 3.471999
## 9      10 4.589172 3.455531
## 10     11 4.531039 3.380931
## 11     12 4.442697 3.321151
## 12     13 4.527126 3.375170
## 13     14 4.428824 3.270424
## 14     15 4.313162 3.286289
## 15     16 4.336661 3.302031
## 16     17 4.414343 3.347532
## 17     18 4.340715 3.221099
## 18     19 4.331798 3.224404
## 19     20 4.480710 3.320605

Plot CV rmse

tanggah$breaks <- as.factor(tanggah$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Root Mean Square Error")

Plot CV mae

tanggah$breaks <- as.factor(tanggah$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=mae, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("MAE")

Plot

pc8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="#FF9966") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("K=8") +
  xlab("Horse Power") + ylab("mile per gallon")

pc15<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="#FF00CC") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,15), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("K=15") +
  xlab("Horse Power") + ylab("mile per gallon")

ggarrange(pc8, pc15, ncol = 2, nrow = 1)

Berdasarkan nilai rmse dan mae interval ordo 8 dan ordo 15 merupakan model terbaik, namun jika dibandingkan secara visual didapatkan bahwa interval sebesar 8 lebih baik dikarenakan kurva cenderung lebih mulus.

Data mpg & weight

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

breaks <- 2:10

best_tangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$weight <- cut(training$weight,i)
                             mod <- lm(mpg ~ weight, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                             testing <- Auto[-x$in_id,]
                             weight_new <- cut(testing$weight,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,   newdata=list(weight=weight_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )

metric_tangga

# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)

mean_metric_tangga
})
tanggaw <- cbind(breaks=breaks,best_tangga)
tanggaw
##   breaks     rmse      mae
## 1      2 5.570088 4.345368
## 2      3 4.868847 3.657505
## 3      4 4.706593 3.633584
## 4      5 4.389515 3.221556
## 5      6 4.201264 3.112221
## 6      7 4.279395 3.191726
## 7      8 4.370964 3.249765
## 8      9 4.341743 3.182396
## 9     10 4.329114 3.176390

Plot CV rmse

tanggaw$breaks <- as.factor(tanggaw$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Root Mean Square Error")

Plot CV mae

tanggaw$breaks <- as.factor(tanggaw$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=mae, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("MAE")

Plot

pcw6<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="#FF9966") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("K=6") +
  xlab("weight") + ylab("mile per gallon")

Secara empiris dapat dilihat bahwa nilai rmse dan mae pada interval 6 memiliki nilai paling kecil dibanding interval lain dan dapat dibuktikan pada plot diatas.

Data mpg & acceleration

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

breaks <- 2:12

best_tangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$acceleration <- cut(training$acceleration,i)
                             mod <- lm(mpg ~ acceleration, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                             testing <- Auto[-x$in_id,]
                             acceleration_new <- cut(testing$acceleration,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,   newdata=list(acceleration=acceleration_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )

metric_tangga

# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)

mean_metric_tangga
})
(tanggaa <- cbind(breaks=breaks,best_tangga))
##    breaks     rmse      mae
## 1       2 7.502748 6.315199
## 2       3 7.003365 5.637141
## 3       4 7.243047 5.879089
## 4       5 7.240624 5.948255
## 5       6 6.987921 5.640505
## 6       7 7.202002 5.846671
## 7       8 7.018944 5.692063
## 8       9 7.012388 5.626494
## 9      10 7.145911 5.803971
## 10     11 7.037956 5.692398
## 11     12 7.048222 5.630799

Plot CV rmse

tanggaa$breaks <- as.factor(tanggaa$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Root Mean Square Error")

Plot CV mae

tanggaa$breaks <- as.factor(tanggaa$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=mae, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("MAE")

Plot

pca6<- ggplot(Auto,aes(x=acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="#FF9966") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("K=6") +
  xlab("acceleration") + ylab("mile per gallon")

pca9<- ggplot(Auto,aes(x=acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="#FF00CC") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,9), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("K=9") +
  xlab("acceleration") + ylab("mile per gallon")

ggarrange(pca6, pca9, ncol = 2, nrow = 1)

Model terbaik menurut nilai rmse yaitu interval 6 sedangkan menurut nilai mae yaitu interval 9, namun jika dibandingkan secara visual didapatkan bahwa interval sebesar 6 lebih baik dikarenakan kurva cenderung lebih mulus.

Natural Cubic Splines

Data mpg & horsepower

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_spline3 <- map_dfr(df, function(i) {
  metric_spline3 <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ ns(horsepower,df=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                              }
                            )
  metric_spline3

# menghitung rata-rata untuk 10 folds
  mean_metric_spline3 <- colMeans(metric_spline3)
  mean_metric_spline3
  }
)

(best_spline31 <- cbind(df=df,best_spline3))
##    df     rmse      mae
## 1   2 4.330108 3.252908
## 2   3 4.348006 3.261538
## 3   4 4.329461 3.257926
## 4   5 4.313312 3.246412
## 5   6 4.310011 3.237990
## 6   7 4.294671 3.229174
## 7   8 4.293369 3.232221
## 8   9 4.281877 3.211860
## 9  10 4.260316 3.192010
## 10 11 4.288667 3.220958
## 11 12 4.285063 3.192363
## 12 13 4.302071 3.220998
## 13 14 4.322821 3.244848
## 14 15 4.305928 3.250204

Plot CV

best_spline31$df <- as.factor(best_spline31$df)
ggplot(data=best_spline3, aes(x=df, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

Berdasarkan nilai rmse diatas didapatkan derajat 10 memiliki nilai terendah dibanding derajat lain.

Knots

attr(ns(Auto$horsepower, df=10),"knots")
##   10%   20%   30%   40%   50%   60%   70%   80%   90% 
##  67.0  72.0  80.0  88.0  93.5 100.0 110.0 140.0 157.7

Plot

ncs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="purple")+
    stat_smooth(method = "lm", 
               formula = y~ns(x, df=10), 
               lty = 1,col = "black", se=F)+
    theme_bw()+
    ggtitle("DF =10") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(67, 72, 80, 88, 93.5, 100, 110, 140, 157.7), col="grey50", lty=2)
ncs

Data mpg & weight

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_spline3 <- map_dfr(df, function(i) {
  metric_spline3 <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ ns(weight,df=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                              }
                            )
  metric_spline3

# menghitung rata-rata untuk 10 folds
  mean_metric_spline3 <- colMeans(metric_spline3)
  mean_metric_spline3
  }
)

(best_spline32 <- cbind(df=df,best_spline3))
##    df     rmse      mae
## 1   2 4.150320 3.062108
## 2   3 4.162964 3.064666
## 3   4 4.171112 3.073367
## 4   5 4.180162 3.092261
## 5   6 4.171277 3.102353
## 6   7 4.141103 3.061136
## 7   8 4.170552 3.098226
## 8   9 4.175108 3.091378
## 9  10 4.168457 3.089614
## 10 11 4.162104 3.087247
## 11 12 4.157470 3.086991
## 12 13 4.178029 3.102996
## 13 14 4.187941 3.109626
## 14 15 4.184376 3.108734

Plot CV

best_spline32$df <- as.factor(best_spline32$df)
ggplot(data=best_spline3, aes(x=df, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

Berdasarkan nilai rmse diatas didapatkan derajat 7 memiliki nilai terendah dibanding derajat lain.

Knots

attr(ns(Auto$weight, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  2074.857  2285.429  2635.000  2986.571  3446.143  4096.286

Plot

ncs <- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="purple")+
    stat_smooth(method = "lm", 
               formula = y~ns(x, df=7), 
               lty = 1,col = "black", se=F)+
    theme_bw()+
    ggtitle("DF =7") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2074.857, 2285.429, 2635.000, 2986.571, 3446.143, 4096.286), col="grey50", lty=2)
ncs

Data mpg & acceleration

Cross Validation

set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_spline3 <- map_dfr(df, function(i) {
  metric_spline3 <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ ns(acceleration,df=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                              }
                            )
  metric_spline3

# menghitung rata-rata untuk 10 folds
  mean_metric_spline3 <- colMeans(metric_spline3)
  mean_metric_spline3
  }
)

(best_spline33 <- cbind(df=df,best_spline3))
##    df     rmse      mae
## 1   2 7.020337 5.743563
## 2   3 7.033832 5.767504
## 3   4 6.968525 5.649933
## 4   5 6.965883 5.610253
## 5   6 6.977508 5.638072
## 6   7 6.959692 5.633256
## 7   8 6.936841 5.598141
## 8   9 6.997009 5.649171
## 9  10 6.977197 5.646953
## 10 11 7.013287 5.661907
## 11 12 7.027583 5.680325
## 12 13 7.046496 5.701470
## 13 14 7.079820 5.714688
## 14 15 7.109796 5.737668

Plot CV

best_spline33$df <- as.factor(best_spline33$df)
ggplot(data=best_spline3, aes(x=df, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

Berdasarkan nilai rmse diatas didapatkan derajat 8 memiliki nilai terendah dibanding derajat lain.

Knots

attr(ns(Auto$acceleration, df=8),"knots")
##   12.5%     25%   37.5%     50%   62.5%     75%   87.5% 
## 12.5000 13.7750 14.5000 15.5000 16.2000 17.0250 18.7125

Plot

ncs <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="purple")+
    stat_smooth(method = "lm", 
               formula = y~ns(x, df=8), 
               lty = 1,col = "black", se=F)+
    theme_bw()+
    ggtitle("DF =8") +
  xlab("acceleration") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(12.5000, 13.7750, 14.5000, 15.5000, 16.2000, 17.0250, 18.7125), col="grey50", lty=2)
ncs

Perbandingan Model

Data mpg & horsepower

Secara Empiris

##              nama_model     rmse      mae
## 1            Polinomial 4.293242 3.233376
## 2    Piecewise Constant 4.313162 3.286289
## 3 Natural Cubic Splines 4.260316 3.192010

Secara Visual

Data mpg & weight

Secara Empiris

##              nama_model     rmse      mae
## 1            Polinomial 4.149977 3.060704
## 2    Piecewise Constant 4.201264 3.112221
## 3 Natural Cubic Splines 4.141103 3.061136

Secara Visual

Data mpg & acceleration

Secara Empiris

##              nama_model     rmse      mae
## 1            Polinomial 7.008647 5.700795
## 2    Piecewise Constant 6.987921 5.640505
## 3 Natural Cubic Splines 6.936841 5.598141

Secara Visual

Referensi

Sartono, B.. (November 1, 2020). Moving Beyond Linearity. Retrieved from https://rpubs.com/bagusco/taklinear

Dito, G.A. (April 14, 2021). Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R. Retrieved from https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/

Amalia, Reni. (April 23, 2021). Moving Beyond Linearity. Retrived from https://www.rpubs.com/reniamelia/MovingBeyondLinearity

Rahmi, Anisa. (April, 22, 2021). Beyond Linearity. Retrived from https://www.rpubs.com/annisarahmi/beyondlienarity