Non-Linier Regression


Data

Data yang digunakan adalah data Auto yang terdapat dalam package ISLR dengan peubah respon mpg dan variabel prediktor Engine horsepower, Time to Accelerate from 0 to 60 mph dan Vehicle weight

Untuk menjawab soal 2, Prosedur dalam pemilihan model terbaik dilakukan dengan cara melihat masing-masing pola hubungan data secara visual dan empiris dengan membandingkan nilai RMSE dan MAE dari ketiga model berikut:
1. Regresi Linier
2. Regresi Polinomial
3. Fungsi Tangga
4. Regresi Splines

Soal 1 : Melihat Pola Hubungan Data

g1 <- ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
g2 <- ggplot(Auto, aes(x = acceleration, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
g3 <- ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
grid.arrange(g1, g2, g3, ncol = 1, nrow=3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Berdasarkan visualisasi diatas dapat diambil kesimpulan sementara bahwa:
* Terdapat hubungan non-linier antara peubah respon mpg dengan variabel prediktor engine horsepower
* Terdapat hubungan non-linier antara peubah respon mpg dengan variabel prediktor time to accelerate from 0 to 60 mph
* Terdapat hubungan non-linier antara peubah respon mpg dengan variabel prediktor vehicle weight

Soal 2 : Menentukan Model Non-Linier Terbaik

MPG VS Engine Displacement

Regresi Linier

Secara Visual

ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "red",se = F)+
  theme_bw()

Secara visual, model regresi linier kurang baik dalam merepresentasikan atau menjelaskan pola hubungan MPG dengan Engine horsepower karena pola garis lurus yang terbentuk tidak seutuhnya mengikuti pola sebaran data. Selanjutnya, perlu dilakukan pemilihan model regresi non-linier terbaik

Secara Empiris

#Derajat 1
set.seed(321)

cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear <- map_dfr(cross_val$splits, function(x){
  mod <- lm(mpg ~ horsepower, data=Auto[x$in_id,])
  pred <- predict(mod,newdata=Auto[-x$in_id,])
  truth <- Auto[-x$in_id,]$mpg
  rmse <- mlr3measures::rmse(truth = truth, response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)

# Rata-rata RMSE & MAE untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear
##     rmse      mae 
## 4.872408 3.843361

Regresi Polinomial

Secara Visual

#plot degree 2-7
d1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 2")+
  theme_classic()

d2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,3,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 3")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 4")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,5,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 5")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,6,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 6")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,7,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 7")+
  theme_classic()
d7 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,8,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 8")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,9,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 9")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,10,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 10")+
  theme_classic()

grid.arrange(d1,d2,d3,d4,d5,d6,d7,d8,d9, nrow=3,ncol=3,
             top = textGrob("Plot Regresi Polinomial",gp=gpar(fontsize=15,font=1)))

Secara visual, model regresi polinomial dengan derajat 7 sampai derajat 10 cenderung lebih mengikuti pola sebaran data dibandingkan model regresi polinomial dengan derajat 2 sampai derajat 6. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi polinomial tersebut.

Secara Empiris

# Regresi polinomial derajat 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:10

best_poly <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(mpg ~ poly(horsepower,derajat=i), data=Auto[x$in_id,])
                           pred <- predict(mod,newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                           }
                         )
  metric_poly
  
# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly 
  }
)
best_poly <- cbind(derajat=derajat,best_poly) 
best_poly
##   derajat     rmse      mae
## 1       2 4.334171 3.264925
## 2       3 4.345833 3.270972
## 3       4 4.351042 3.286527
## 4       5 4.289402 3.242422
## 5       6 4.273654 3.241089
## 6       7 4.257811 3.225156
## 7       8 4.265711 3.237336
## 8       9 4.271502 3.232318
## 9      10 4.303605 3.254084
best_poly$derajat <- as.factor(best_poly$derajat) 
prmse<-ggplot(data=best_poly, aes(x=derajat, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Root Mean Square Error")

best_poly$derajat <- as.factor(best_poly$derajat) 
pmae<-ggplot(data=best_poly, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada model regresi polinomial dengan derajat 7.

#Prediksi
prediksi.poly7 = predict( glm(mpg ~ poly(horsepower,7)) , data.frame(horsepower), interval="predict" )
#Plot
plot(horsepower, mpg, pch=19, col="blue",
     xlab="Engine Horsepower", ylab="Mile per Gallon",
     main="Regresi Polinomial Derajat 7")
ix <- sort(horsepower,index.return=T)$ix
lines(horsepower[ix], prediksi.poly7[ix], 
      main = "Polynomial Regression Derajat 7", col="red",
      type="l", lwd=2)

Fungsi Tangga

Secara Visual

#plot interval 2-10
d2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,2,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 2")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,3,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 3")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,4,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 4")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,5,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 5")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,6,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 6")+
  theme_classic()

d7 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,7,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 7")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,8,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 8")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,9,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 9")+
  theme_classic()

d10 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,10,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 10")+
  theme_classic()

grid.arrange(d2,d3,d4,d5,d6,d7,d8,d9,d10,nrow=3,ncol=3,
             top = textGrob("Plot Regresi Fungsi Tangga",gp=gpar(fontsize=15,font=1)))

Secara visual, fungsi tangga dengan interval 7 sampai derajat 10 cenderung lebih mengikuti pola sebaran data dibandingkan fungsi tangga dengan interval 2 sampai derajat 6. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model fungsi tangga.

Secara Empiris

# Fungsi tangga interval 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

interval <- 2:10

best_tangga <- map_dfr(interval, function(i){ 
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$horsepower <- cut(training$horsepower,i)
                             mod <- lm(mpg ~ horsepower, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_interval <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",labs_x))) 
                             testing <- Auto[-x$in_id,]
                             horsepower_new <- cut(testing$horsepower,c(labs_x_interval[1,1],labs_x_interval[,2])) 
                             pred <-predict(mod, newdata=list(horsepower=horsepower_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred) 
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred) 
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )
  metric_tangga

# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  mean_metric_tangga 
  }
)
best_tangga <- cbind(interval=interval,best_tangga)
best_tangga
##   interval     rmse      mae
## 1        2 6.142228 4.798645
## 2        3 5.748515 4.519001
## 3        4 4.951940 3.788360
## 4        5 4.667772 3.582610
## 5        6 4.638192 3.549758
## 6        7 4.433535 3.355719
## 7        8 4.393438 3.404443
## 8        9 4.512399 3.443004
## 9       10 4.531132 3.423550
best_tangga$interval <- as.factor(best_tangga$interval) 
prmse<-ggplot(data=best_tangga, aes(x=interval, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Root Mean Square Error")

best_tangga$interval <- as.factor(best_tangga$interval) 
pmae<-ggplot(data=best_tangga, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE minimum terdapat pada fungsi tangga dengan interval 8 sedangkan MAE minimum terdapat pada fungsi tangga dengan interval 7

#Prediksi
prediksi.cut7 = predict( glm(mpg ~ cut(horsepower,7)) , data.frame(horsepower), interval="predict" )
prediksi.cut8 = predict( glm(mpg ~ cut(horsepower,8)) , data.frame(horsepower), interval="predict")
#Plot
plot(horsepower, mpg, pch=19, col="blue",
     xlab="Engine Horsepower", ylab="Mile per Gallon",
     main="Fungsi Tangga Interval 7 dan 8")
ix <- sort(horsepower,index.return=T)$ix
lines(horsepower[ix], prediksi.cut7[ix], 
      main = "Fungsi Tangga Interval 7", col="red",
      type="l", lwd=2)
lines(horsepower[ix], prediksi.cut8[ix], 
      main = "Fungsi Tangga Interval 8", col="red",
      type="l", lwd=2)

Regresi Spline

Secara Visual

bs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("b-Splines") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~bs(x, knots = c(90,120,150,180)), 
              lty = 1,se = F)

ns <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("Natural Splines") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~ns(x, knots = c(90,120,150,180)), 
              lty = 1,se = F)

plot_grid(bs, ns, label_size = 6)

Secara visual, model regresi natural spline cenderung lebih mengikuti pola sebaran data dibandingkan model regresi b-splie. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi spline tersebut.

Secara Empiris (B-Spline)

# B-Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:10

best_bspline <- map_dfr(df, function(i){
  bspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ bs(horsepower,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  bspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_bspline <- colMeans(bspline)
  mean_bspline 
  }
)
best_bspline <- cbind(df=df,best_bspline)
best_bspline
##   df     rmse      mae
## 1  2 4.345833 3.270972
## 2  3 4.345833 3.270972
## 3  4 4.300196 3.249076
## 4  5 4.252369 3.218942
## 5  6 4.267055 3.231918
## 6  7 4.280071 3.239350
## 7  8 4.288528 3.240305
## 8  9 4.277210 3.230383
## 9 10 4.294385 3.243836
best_bspline %>% slice_min(rmse)
##   df     rmse      mae
## 1  5 4.252369 3.218942
best_bspline %>% slice_min(mae)
##   df     rmse      mae
## 1  5 4.252369 3.218942
#menentukan knot berdasarkan fungsi basis
df_5<-attr(bs(Auto$horsepower, df=5),"knots") 
df_5
## 33.33333% 66.66667% 
##        84       110
bs <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=5),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 5") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(90,120,150,180), col="grey50", lty=2)
bs

Secara Empiris (Natural Spline)

# Natural Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_nspline <- map_dfr(df, function(i){
  nspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ ns(horsepower,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  nspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_nspline <- colMeans(nspline)
  mean_nspline 
  }
)
best_nspline <- cbind(df=df,best_nspline)
best_nspline
##    df     rmse      mae
## 1   2 4.316420 3.252256
## 2   3 4.339172 3.270060
## 3   4 4.298632 3.248019
## 4   5 4.275662 3.231437
## 5   6 4.277158 3.232881
## 6   7 4.249823 3.213396
## 7   8 4.271403 3.230483
## 8   9 4.255107 3.213574
## 9  10 4.235094 3.183301
## 10 11 4.257970 3.208380
## 11 12 4.244159 3.182543
## 12 13 4.245239 3.193597
## 13 14 4.255621 3.214075
## 14 15 4.250686 3.225731
best_nspline %>% slice_min(rmse)
##   df     rmse      mae
## 1 10 4.235094 3.183301
best_nspline %>% slice_min(mae)
##   df     rmse      mae
## 1 12 4.244159 3.182543
#menentukan knot berdasarkan fungsi basis
df_10<-attr(ns(Auto$horsepower, df=10),"knots") 
df_10
##   10%   20%   30%   40%   50%   60%   70%   80%   90% 
##  67.0  72.0  80.0  88.0  93.5 100.0 110.0 140.0 157.7
df_12<-attr(ns(Auto$horsepower, df=12),"knots") 
df_12
## 8.333333% 16.66667%       25% 33.33333% 41.66667%       50% 58.33333% 66.66667% 
##   65.0000   70.0000   75.0000   84.0000   88.0000   93.5000  100.0000  110.0000 
##       75% 83.33333% 91.66667% 
##  126.0000  150.0000  165.8333
ns10 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=10),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 10") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(90,120,150,180), col="grey50", lty=2)

ns12 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=12),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 12") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(90,120,150,180), col="grey50", lty=2)

plot_grid(ns10, ns12, label_size = 6)

Secara empiris, model regresi natural spline dengan df=12 lebih baik dalam merepresentasikan atau menjelaskan pola hubungan antara MPG dengan Engine Horse Power dibandingkan model regresi natural spline dengan df=10 maupun b-spline dengan df=5 karena nilai RMSE dan MAE natural spline dengan df=12 lebih kecil dibandingkan b-spline.

Perbandingan Semua Model

nilai_rmse <- rbind (4.872408,4.257811,4.433535, 4.393438, 4.252369, 4.235094, 4.244159) 
nilai_mae <- rbind (3.843361,3.225156,3.355719,3.404443, 3.218942,3.183301, 3.182543)
nama_model <- c("Linier", "Polinomial (derajat=7)", "Fungsi Tangga (interval=8)","Fungsi Tangga (interval=8)", "B-spline (df=5)", "Natural spline  (df=10)", "Natural spline  (df=12)")
model <- data.frame(nama_model,nilai_rmse,nilai_mae)
model
##                   nama_model nilai_rmse nilai_mae
## 1                     Linier   4.872408  3.843361
## 2     Polinomial (derajat=7)   4.257811  3.225156
## 3 Fungsi Tangga (interval=8)   4.433535  3.355719
## 4 Fungsi Tangga (interval=8)   4.393438  3.404443
## 5            B-spline (df=5)   4.252369  3.218942
## 6    Natural spline  (df=10)   4.235094  3.183301
## 7    Natural spline  (df=12)   4.244159  3.182543

Secara empiris, model regresi natural spline dengan df=12 merupakan model terbaik dalam merepresentasikan data, karena dibandingkan dengan model lain, natural spline dengan df=12 memiliki RMSE dan MAE paling minimum. Sehingga untuk melihat pola hubungan antara MPG dengan Engine horsepower lebih baik menggunakan model regresi natural spline dengan df 12 dengan visualisasi sebagai berikut:

MPG VS Time to Accelerate from 0 to 60 mph

Regresi Linier

Secara Visual

ggplot(Auto,aes(x=acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "red",se = F)+
  theme_bw()

Secara visual, model regresi linier kurang baik dalam merepresentasikan atau menjelaskan pola hubungan MPG dengan Time to Accelerate from 0 to 60 mph karena pola garis lurus yang terbentuk tidak seutuhnya mengikuti pola sebaran data. Selanjutnya, perlu dilakukan pemilihan model regresi non-linier terbaik

Secara Empiris

#Derajat 1
set.seed(321)

cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear <- map_dfr(cross_val$splits, function(x){
  mod <- lm(mpg ~ acceleration, data=Auto[x$in_id,])
  pred <- predict(mod,newdata=Auto[-x$in_id,])
  truth <- Auto[-x$in_id,]$mpg
  rmse <- mlr3measures::rmse(truth = truth, response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)

# Rata-rata RMSE & MAE untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear
##     rmse      mae 
## 7.073565 5.823831

Regresi Polinomial

Secara Visual

#plot degree 2-7
d1 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 2")+
  theme_classic()

d2 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,3,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 3")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 4")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,5,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 5")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,6,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 6")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,7,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 7")+
  theme_classic()
d7 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,8,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 8")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,9,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 9")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,10,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 10")+
  theme_classic()

grid.arrange(d1,d2,d3,d4,d5,d6,d7,d8,d9, nrow=3,ncol=3,
             top = textGrob("Plot Regresi Polinomial",gp=gpar(fontsize=15,font=1)))

Secara visual, model regresi polinomial dengan derajat 4 sampai derajat 8 cenderung lebih mengikuti pola sebaran data dibandingkan model regresi polinomial dengan derajat 2,3, 8 dan 9. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi polinomial tersebut.

Secara Empiris

# Regresi polinomial derajat 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:10

best_poly <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(mpg ~ poly(acceleration,derajat=i), data=Auto[x$in_id,])
                           pred <- predict(mod,newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                           }
                         )
  metric_poly
  
# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly 
  }
)
best_poly <- cbind(derajat=derajat,best_poly) 
best_poly
##   derajat     rmse      mae
## 1       2 7.026192 5.760824
## 2       3 7.047178 5.803701
## 3       4 6.986808 5.683944
## 4       5 7.011349 5.719472
## 5       6 7.005414 5.695174
## 6       7 7.080182 5.736693
## 7       8 7.161613 5.790400
## 8       9 7.129490 5.740258
## 9      10 7.124493 5.729222
best_poly$derajat <- as.factor(best_poly$derajat) 
prmse<-ggplot(data=best_poly, aes(x=derajat, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Root Mean Square Error")

best_poly$derajat <- as.factor(best_poly$derajat) 
pmae<-ggplot(data=best_poly, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada model regresi polinomial dengan derajat 4.

#Prediksi
prediksi.poly4 = predict( glm(mpg ~ poly(acceleration,4)) , data.frame(acceleration), interval="predict" )
#Plot
plot(acceleration, mpg, pch=19, col="blue",
     xlab="Time to Accerate", ylab="Mile per Gallon",
     main="Regresi Polinomial Derajat 4")
ix <- sort(acceleration,index.return=T)$ix
lines(acceleration[ix], prediksi.poly4[ix], 
      main = "Polynomial Regression Derajat 4", col="red",
      type="l", lwd=2)

Fungsi Tangga

Secara Visual

#plot interval 2-10
d2 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,2,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 2")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,3,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 3")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,4,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 4")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,5,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 5")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,6,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 6")+
  theme_classic()

d7 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,7,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 7")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,8,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 8")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,9,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 9")+
  theme_classic()

d10 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,10,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 10")+
  theme_classic()

grid.arrange(d2,d3,d4,d5,d6,d7,d8,d9,d10,nrow=3,ncol=3,
             top = textGrob("Plot Regresi Fungsi Tangga",gp=gpar(fontsize=15,font=1)))

Secara visual, fungsi tangga dengan interval 3 dan 4 cenderung lebih mengikuti pola sebaran data dibandingkan fungsi tangga dengan interval lainnya. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model fungsi tangga.

Secara Empiris

# Fungsi tangga interval 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

interval <- 2:10

best_tangga <- map_dfr(interval, function(i){ 
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$acceleration <- cut(training$acceleration,i)
                             mod <- lm(mpg ~ acceleration, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_interval <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",labs_x))) 
                             testing <- Auto[-x$in_id,]
                             acceleration_new <- cut(testing$acceleration,c(labs_x_interval[1,1],labs_x_interval[,2])) 
                             pred <-predict(mod, newdata=list(acceleration=acceleration_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred) 
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred) 
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )
  metric_tangga

# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  mean_metric_tangga 
  }
)
best_tangga <- cbind(interval=interval,best_tangga)
best_tangga
##   interval     rmse      mae
## 1        2 7.538018 6.334114
## 2        3 6.951984 5.625622
## 3        4 7.229533 5.886147
## 4        5 7.203617 5.977914
## 5        6 6.974349 5.656407
## 6        7 7.192183 5.865866
## 7        8 7.038860 5.697161
## 8        9 7.024980 5.685807
## 9       10 7.095391 5.811114
best_tangga$interval <- as.factor(best_tangga$interval) 
prmse<-ggplot(data=best_tangga, aes(x=interval, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Root Mean Square Error")

best_tangga$interval <- as.factor(best_tangga$interval) 
pmae<-ggplot(data=best_tangga, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada fungsi tangga dengan interval 3

#Prediksi
prediksi.cut3 = predict( glm(mpg ~ cut(acceleration,3)) , data.frame(horsepower), interval="predict" )
#Plot
plot(acceleration, mpg, pch=19, col="blue",
     xlab="Time to Accelerate", ylab="Mile per Gallon",
     main="Fungsi Tangga Interval 7 dan 8")
ix <- sort(acceleration,index.return=T)$ix
lines(acceleration[ix], prediksi.cut3[ix], 
      main = "Fungsi Tangga Interval 7", col="red",
      type="l", lwd=2)

Regresi Spline

Secara Visual

bs <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("b-Splines") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~bs(x, knots = c(11,15,20,25)), 
              lty = 1,se = F)

ns <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("Natural Splines") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~ns(x, knots = c(11,15,20,25)), 
              lty = 1,se = F)

plot_grid(bs, ns, label_size = 6)
## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 1)

Secara Empiris (B-Spline)

# B-Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:10

best_bspline <- map_dfr(df, function(i){
  bspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ bs(acceleration,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  bspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_bspline <- colMeans(bspline)
  mean_bspline 
  }
)
best_bspline <- cbind(df=df,best_bspline)
best_bspline
##   df     rmse      mae
## 1  2 7.047178 5.803701
## 2  3 7.047178 5.803701
## 3  4 6.968859 5.674078
## 4  5 6.990827 5.694131
## 5  6 7.001749 5.692412
## 6  7 6.990255 5.635431
## 7  8 7.006893 5.661165
## 8  9 7.017205 5.699507
## 9 10 6.994071 5.660316
best_bspline %>% slice_min(rmse)
##   df     rmse      mae
## 1  4 6.968859 5.674078
best_bspline %>% slice_min(mae)
##   df     rmse      mae
## 1  7 6.990255 5.635431
#menentukan knot berdasarkan fungsi basis
df_4<-attr(bs(Auto$acceleration, df=4),"knots") 
df_4
##  50% 
## 15.5
#menentukan knot berdasarkan fungsi basis
df_7<-attr(bs(Auto$acceleration, df=7),"knots") 
df_7
##   20%   40%   60%   80% 
## 13.42 14.80 16.00 17.70
bs4 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=4),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 4") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(11,15,20,25), col="grey50", lty=2)

bs7 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=7),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 4") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(11,15,20,25), col="grey50", lty=2)

plot_grid(bs4, bs7, label_size = 6)

Secara Empiris (Natural Spline)

# Natural Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_nspline <- map_dfr(df, function(i){
  nspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ ns(acceleration,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  nspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_nspline <- colMeans(nspline)
  mean_nspline 
  }
)
best_nspline <- cbind(df=df,best_nspline)
best_nspline
##    df     rmse      mae
## 1   2 7.008099 5.745162
## 2   3 7.022494 5.773451
## 3   4 6.955818 5.671915
## 4   5 6.954930 5.627050
## 5   6 6.963101 5.644851
## 6   7 6.958731 5.651751
## 7   8 6.940616 5.619744
## 8   9 6.967658 5.642248
## 9  10 6.971402 5.654179
## 10 11 7.021230 5.687510
## 11 12 7.026188 5.701527
## 12 13 7.042842 5.713865
## 13 14 7.058619 5.717293
## 14 15 7.082357 5.740639
best_nspline %>% slice_min(rmse)
##   df     rmse      mae
## 1  8 6.940616 5.619744
best_nspline %>% slice_min(mae)
##   df     rmse      mae
## 1  8 6.940616 5.619744
#menentukan knot berdasarkan fungsi basis
df_8<-attr(ns(Auto$acceleration, df=8),"knots") 
df_8
##   12.5%     25%   37.5%     50%   62.5%     75%   87.5% 
## 12.5000 13.7750 14.5000 15.5000 16.2000 17.0250 18.7125
ns8 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=8),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 8") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(11,15,20,25), col="grey50", lty=2)
ns8

Secara empiris, model regresi natural spline dengan df=8 lebih baik dalam merepresentasikan atau menjelaskan pola hubungan antara MPG dengan Time to Accelerate from 0 to 60 mph dibandingkan model regresi b-spline dengan df=4 maupun b-spline dengan df=7 karena nilai RMSE dan MAE natural spline dengan df=8 lebih kecil dibandingkan b-spline.

Perbandingan Semua Model

nilai_rmse <- rbind (7.073565,6.986808,6.951984,6.968859,6.990255,6.940616 ) 
nilai_mae <- rbind (5.823831,5.683944,5.625622,5.674078,5.635431,5.619744)
nama_model <- c("Linier", "Polinomial (derajat=4)", "Fungsi Tangga (interval=3)","B-spline (df4)","B-spline (df7)","Natural spline  (df=8)")
model <- data.frame(nama_model,nilai_rmse,nilai_mae)
model
##                   nama_model nilai_rmse nilai_mae
## 1                     Linier   7.073565  5.823831
## 2     Polinomial (derajat=4)   6.986808  5.683944
## 3 Fungsi Tangga (interval=3)   6.951984  5.625622
## 4             B-spline (df4)   6.968859  5.674078
## 5             B-spline (df7)   6.990255  5.635431
## 6     Natural spline  (df=8)   6.940616  5.619744

Secara empiris, model regresi natural spline dengan df=8 merupakan model terbaik dalam merepresentasikan data, karena dibandingkan dengan model lain, natural spline dengan df=12 memiliki RMSE dan MAE paling minimum. Sehingga untuk melihat pola hubungan antara MPG dengan Time to Accelerate from 0 to 60 mph lebih baik menggunakan model regresi natural spline dengan df 8 dengan visualisasi sebagai berikut:

MPG VS Vehicle weight

Regresi Linier

Secara Visual

ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "red",se = F)+
  theme_bw()

Secara visual, model regresi linier kurang baik dalam merepresentasikan atau menjelaskan pola hubungan MPG dengan vehicle weight karena pola garis lurus yang terbentuk tidak seutuhnya mengikuti pola sebaran data. Selanjutnya, perlu dilakukan pemilihan model regresi non-linier terbaik

Secara Empiris

#Derajat 1
set.seed(321)

cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear <- map_dfr(cross_val$splits, function(x){
  mod <- lm(mpg ~ weight, data=Auto[x$in_id,])
  pred <- predict(mod,newdata=Auto[-x$in_id,])
  truth <- Auto[-x$in_id,]$mpg
  rmse <- mlr3measures::rmse(truth = truth, response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)

# Rata-rata RMSE & MAE untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear
##     rmse      mae 
## 4.299574 3.282486

Regresi Polinomial

Secara Visual

#plot degree 2-7
d1 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 2")+
  theme_classic()

d2 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,3,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 3")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 4")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,5,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 5")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,6,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 6")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,7,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 7")+
  theme_classic()
d7 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,8,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 8")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,9,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 9")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,10,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 10")+
  theme_classic()

grid.arrange(d1,d2,d3,d4,d5,d6,d7,d8,d9, nrow=3,ncol=3,
             top = textGrob("Plot Regresi Polinomial",gp=gpar(fontsize=15,font=1)))

Secara visual, model regresi polinomial dengan derajat 2 sampai derajat 4 cenderung lebih mengikuti pola sebaran data dibandingkan model regresi polinomial dengan derajat 5 sampai dengan 10. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi polinomial tersebut.

Secara Empiris

# Regresi polinomial derajat 1-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 1:10

best_poly <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(mpg ~ poly(weight,derajat=i), data=Auto[x$in_id,])
                           pred <- predict(mod,newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                           }
                         )
  metric_poly
  
# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly 
  }
)
best_poly <- cbind(derajat=derajat,best_poly) 
best_poly
##    derajat     rmse      mae
## 1        1 4.299574 3.282486
## 2        2 4.140466 3.060826
## 3        3 4.145446 3.065768
## 4        4 4.151575 3.066187
## 5        5 4.150254 3.067449
## 6        6 4.159668 3.086962
## 7        7 4.154581 3.090445
## 8        8 4.190959 3.117152
## 9        9 4.299573 3.178586
## 10      10 4.178087 3.117666
best_poly$derajat <- as.factor(best_poly$derajat) 
prmse<-ggplot(data=best_poly, aes(x=derajat, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Root Mean Square Error")

best_poly$derajat <- as.factor(best_poly$derajat) 
pmae<-ggplot(data=best_poly, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada model regresi polinomial dengan derajat 2.

#Prediksi
prediksi.poly2 = predict( glm(mpg ~ poly(weight,2)) , data.frame(weight), interval="predict" )
#Plot
plot(weight, mpg, pch=19, col="blue",
     xlab="Vehicle Weight", ylab="Mile per Gallon",
     main="Regresi Polinomial Derajat 2")
ix <- sort(weight,index.return=T)$ix
lines(weight[ix], prediksi.poly2[ix], 
      main = "Polynomial Regression Derajat 2", col="red",
      type="l", lwd=2)

Untuk melihat pola hubungan data antara MPG dengan vehicle weight lebih cenderung mengikuti pola polynomial atau spline, selanjutnya untuk memilih model terbaik untuk merepresentasikan pola hubungan MPG dengan vehicle weight dilakukan dengan membandingkan terhadap regresi spline.

Regresi Spline

Secara Visual

bs <- ggplot(Auto,aes(x=weight, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("b-Splines") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~bs(x, knots = c(2000,3000,4000,5000)), 
              lty = 1,se = F)

ns <- ggplot(Auto,aes(x=weight, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("Natural Splines") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~ns(x, knots = c(2000,3000,4000,5000)), 
              lty = 1,se = F)

plot_grid(bs, ns, label_size = 6)

Secara visual, model regresi natural spline cenderung lebih mengikuti pola sebaran data dibandingkan model regresi b-spline. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi spline tersebut.

Secara Empiris (B-Spline)

# B-Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:10

best_bspline <- map_dfr(df, function(i){
  bspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ bs(weight,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  bspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_bspline <- colMeans(bspline)
  mean_bspline 
  }
)
best_bspline <- cbind(df=df,best_bspline)
best_bspline
##   df     rmse      mae
## 1  2 4.145446 3.065768
## 2  3 4.145446 3.065768
## 3  4 4.152022 3.065852
## 4  5 4.148745 3.077351
## 5  6 4.150613 3.084954
## 6  7 4.190345 3.108026
## 7  8 4.178610 3.109603
## 8  9 4.169315 3.077254
## 9 10 4.450239 3.195706
best_bspline %>% slice_min(rmse)
##   df     rmse      mae
## 1  2 4.145446 3.065768
## 2  3 4.145446 3.065768
best_bspline %>% slice_min(mae)
##   df     rmse      mae
## 1  2 4.145446 3.065768
## 2  3 4.145446 3.065768
#menentukan knot berdasarkan fungsi basis
df_3<-attr(bs(Auto$weight, df=3),"knots") 
df_3
## numeric(0)
bs3 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=3),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 3") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2000,3000,4000,5000), col="grey50", lty=2)
bs3

Secara Empiris (Natural Spline)

# Natural Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_nspline <- map_dfr(df, function(i){
  nspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ ns(weight,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  nspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_nspline <- colMeans(nspline)
  mean_nspline 
  }
)
best_nspline <- cbind(df=df,best_nspline)
best_nspline
##    df     rmse      mae
## 1   2 4.138968 3.059081
## 2   3 4.146196 3.061158
## 3   4 4.155606 3.066985
## 4   5 4.170902 3.091928
## 5   6 4.159089 3.093683
## 6   7 4.118805 3.041155
## 7   8 4.153896 3.083712
## 8   9 4.174299 3.088892
## 9  10 4.142906 3.064658
## 10 11 4.152285 3.069388
## 11 12 4.159216 3.077669
## 12 13 4.177854 3.091879
## 13 14 4.180087 3.099473
## 14 15 4.187073 3.109754
best_nspline %>% slice_min(rmse)
##   df     rmse      mae
## 1  7 4.118805 3.041155
best_nspline %>% slice_min(mae)
##   df     rmse      mae
## 1  7 4.118805 3.041155
#menentukan knot berdasarkan fungsi basis
df_7<-attr(ns(Auto$weight, df=7),"knots") 
df_7
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  2074.857  2285.429  2635.000  2986.571  3446.143  4096.286
ns7 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=7),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 7") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2000,3000,4000,5000), col="grey50", lty=2)
ns7

Secara empiris, model regresi natural spline dengan df=7 lebih baik dalam merepresentasikan atau menjelaskan pola hubungan antara MPG dengan vehicle weight dibandingkan model regresi b-spline dengan df=3 karena nilai RMSE dan MAE natural spline dengan df=7 lebih kecil dibandingkan b-spline.

Perbandingan Semua Model

nilai_rmse <- rbind (4.299574,4.140466,4.145446,4.118805) 
nilai_mae <- rbind (3.282486,3.060826,3.065768,3.041155)
nama_model <- c("Linier", "Polinomial (derajat=2)","B-spline (df3)","Natural spline  (df=7)")
model <- data.frame(nama_model,nilai_rmse,nilai_mae)
model
##               nama_model nilai_rmse nilai_mae
## 1                 Linier   4.299574  3.282486
## 2 Polinomial (derajat=2)   4.140466  3.060826
## 3         B-spline (df3)   4.145446  3.065768
## 4 Natural spline  (df=7)   4.118805  3.041155

Secara empiris, model regresi natural spline dengan df=7 merupakan model terbaik dalam merepresentasikan data, karena dibandingkan dengan model lain, natural spline dengan df=7 memiliki RMSE dan MAE paling minimum. Sehingga untuk melihat pola hubungan antara MPG dengan vehicle weight lebih baik menggunakan model regresi natural spline dengan df 7 dengan visualisasi sebagai berikut:

Kesimpulan

  1. Model yang paling tepat menjelaskan dan merepresentasikan pola hubungan mpg dan variabel prediktor Engine horsepower adalah model regresi natural spline dengan df 12
  2. Model yang paling tepat menjelaskan dan merepresentasikan pola hubungan mpg dan variabel prediktor Time to Accelerate from 0 to 60 mph adalah model regresi natural spline dengan df 8
  3. Model yang paling tepat menjelaskan dan merepresentasikan pola hubungan mpg dan variabel prediktor Vehicle weight adalah model regresi natural spline dengan df 7