Non-Linier Regression

Data

Data yang digunakan adalah data Auto yang terdapat dalam package ISLR dengan peubah respon mpg dan variabel prediktor Engine horsepower, Time to Accelerate from 0 to 60 mph dan Vehicle weight

Untuk menjawab soal 2, Prosedur dalam pemilihan model terbaik dilakukan dengan cara melihat masing-masing pola hubungan data secara visual dan empiris dengan membandingkan nilai RMSE dan MAE dari ketiga model berikut:
1. Regresi Linier
2. Regresi Polinomial
3. Fungsi Tangga
4. Regresi Splines

Soal 1 : Melihat Pola Hubungan Data

g1 <- ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
g2 <- ggplot(Auto, aes(x = acceleration, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
g3 <- ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
grid.arrange(g1, g2, g3, ncol = 1, nrow=3)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Berdasarkan visualisasi diatas dapat diambil kesimpulan sementara bahwa:
* Terdapat hubungan non-linier antara peubah respon mpg dengan variabel prediktor engine horsepower
* Terdapat hubungan non-linier antara peubah respon mpg dengan variabel prediktor time to accelerate from 0 to 60 mph
* Terdapat hubungan non-linier antara peubah respon mpg dengan variabel prediktor vehicle weight

Soal 2 : Menentukan Model Non-Linier Terbaik

MPG VS Engine Displacement

Regresi Linier

Secara Visual

ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "red",se = F)+
  theme_bw()

Secara visual, model regresi linier kurang baik dalam merepresentasikan atau menjelaskan pola hubungan MPG dengan Engine horsepower karena pola garis lurus yang terbentuk tidak seutuhnya mengikuti pola sebaran data. Selanjutnya, perlu dilakukan pemilihan model regresi non-linier terbaik

Secara Empiris

#Derajat 1
set.seed(321)

cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear <- map_dfr(cross_val$splits, function(x){
  mod <- lm(mpg ~ horsepower, data=Auto[x$in_id,])
  pred <- predict(mod,newdata=Auto[-x$in_id,])
  truth <- Auto[-x$in_id,]$mpg
  rmse <- mlr3measures::rmse(truth = truth, response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)

# Rata-rata RMSE & MAE untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear

##     rmse      mae 
## 4.872408 3.843361

Regresi Polinomial

Secara Visual

#plot degree 2-7
d1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 2")+
  theme_classic()

d2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,3,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 3")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 4")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,5,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 5")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,6,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 6")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,7,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 7")+
  theme_classic()
d7 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,8,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 8")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,9,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 9")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,10,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 10")+
  theme_classic()

grid.arrange(d1,d2,d3,d4,d5,d6,d7,d8,d9, nrow=3,ncol=3,
             top = textGrob("Plot Regresi Polinomial",gp=gpar(fontsize=15,font=1)))

Secara visual, model regresi polinomial dengan derajat 7 sampai derajat 10 cenderung lebih mengikuti pola sebaran data dibandingkan model regresi polinomial dengan derajat 2 sampai derajat 6. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi polinomial tersebut.

Secara Empiris

# Regresi polinomial derajat 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:10

best_poly <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(mpg ~ poly(horsepower,derajat=i), data=Auto[x$in_id,])
                           pred <- predict(mod,newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                           }
                         )
  metric_poly
  
# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly 
  }
)
best_poly <- cbind(derajat=derajat,best_poly) 
best_poly

##   derajat     rmse      mae
## 1       2 4.334171 3.264925
## 2       3 4.345833 3.270972
## 3       4 4.351042 3.286527
## 4       5 4.289402 3.242422
## 5       6 4.273654 3.241089
## 6       7 4.257811 3.225156
## 7       8 4.265711 3.237336
## 8       9 4.271502 3.232318
## 9      10 4.303605 3.254084

best_poly$derajat <- as.factor(best_poly$derajat) 
prmse<-ggplot(data=best_poly, aes(x=derajat, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Root Mean Square Error")

best_poly$derajat <- as.factor(best_poly$derajat) 
pmae<-ggplot(data=best_poly, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada model regresi polinomial dengan derajat 7.

#Prediksi
prediksi.poly7 = predict( glm(mpg ~ poly(horsepower,7)) , data.frame(horsepower), interval="predict" )

#Plot
plot(horsepower, mpg, pch=19, col="blue",
     xlab="Engine Horsepower", ylab="Mile per Gallon",
     main="Regresi Polinomial Derajat 7")
ix <- sort(horsepower,index.return=T)$ix
lines(horsepower[ix], prediksi.poly7[ix], 
      main = "Polynomial Regression Derajat 7", col="red",
      type="l", lwd=2)

Fungsi Tangga

Secara Visual

#plot interval 2-10
d2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,2,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 2")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,3,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 3")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,4,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 4")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,5,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 5")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,6,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 6")+
  theme_classic()

d7 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,7,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 7")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,8,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 8")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,9,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 9")+
  theme_classic()

d10 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,10,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 10")+
  theme_classic()

grid.arrange(d2,d3,d4,d5,d6,d7,d8,d9,d10,nrow=3,ncol=3,
             top = textGrob("Plot Regresi Fungsi Tangga",gp=gpar(fontsize=15,font=1)))

Secara visual, fungsi tangga dengan interval 7 sampai derajat 10 cenderung lebih mengikuti pola sebaran data dibandingkan fungsi tangga dengan interval 2 sampai derajat 6. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model fungsi tangga.

Secara Empiris

# Fungsi tangga interval 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

interval <- 2:10

best_tangga <- map_dfr(interval, function(i){ 
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$horsepower <- cut(training$horsepower,i)
                             mod <- lm(mpg ~ horsepower, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_interval <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",labs_x))) 
                             testing <- Auto[-x$in_id,]
                             horsepower_new <- cut(testing$horsepower,c(labs_x_interval[1,1],labs_x_interval[,2])) 
                             pred <-predict(mod, newdata=list(horsepower=horsepower_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred) 
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred) 
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )
  metric_tangga

# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  mean_metric_tangga 
  }
)
best_tangga <- cbind(interval=interval,best_tangga)
best_tangga

##   interval     rmse      mae
## 1        2 6.142228 4.798645
## 2        3 5.748515 4.519001
## 3        4 4.951940 3.788360
## 4        5 4.667772 3.582610
## 5        6 4.638192 3.549758
## 6        7 4.433535 3.355719
## 7        8 4.393438 3.404443
## 8        9 4.512399 3.443004
## 9       10 4.531132 3.423550

best_tangga$interval <- as.factor(best_tangga$interval) 
prmse<-ggplot(data=best_tangga, aes(x=interval, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Root Mean Square Error")

best_tangga$interval <- as.factor(best_tangga$interval) 
pmae<-ggplot(data=best_tangga, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE minimum terdapat pada fungsi tangga dengan interval 8 sedangkan MAE minimum terdapat pada fungsi tangga dengan interval 7

#Prediksi
prediksi.cut7 = predict( glm(mpg ~ cut(horsepower,7)) , data.frame(horsepower), interval="predict" )
prediksi.cut8 = predict( glm(mpg ~ cut(horsepower,8)) , data.frame(horsepower), interval="predict")

#Plot
plot(horsepower, mpg, pch=19, col="blue",
     xlab="Engine Horsepower", ylab="Mile per Gallon",
     main="Fungsi Tangga Interval 7 dan 8")
ix <- sort(horsepower,index.return=T)$ix
lines(horsepower[ix], prediksi.cut7[ix], 
      main = "Fungsi Tangga Interval 7", col="red",
      type="l", lwd=2)
lines(horsepower[ix], prediksi.cut8[ix], 
      main = "Fungsi Tangga Interval 8", col="red",
      type="l", lwd=2)

Regresi Spline

Secara Visual

bs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("b-Splines") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~bs(x, knots = c(90,120,150,180)), 
              lty = 1,se = F)

ns <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("Natural Splines") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~ns(x, knots = c(90,120,150,180)), 
              lty = 1,se = F)

plot_grid(bs, ns, label_size = 6)

Secara visual, model regresi natural spline cenderung lebih mengikuti pola sebaran data dibandingkan model regresi b-splie. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi spline tersebut.

Secara Empiris (B-Spline)

# B-Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:10

best_bspline <- map_dfr(df, function(i){
  bspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ bs(horsepower,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  bspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_bspline <- colMeans(bspline)
  mean_bspline 
  }
)
best_bspline <- cbind(df=df,best_bspline)
best_bspline

##   df     rmse      mae
## 1  2 4.345833 3.270972
## 2  3 4.345833 3.270972
## 3  4 4.300196 3.249076
## 4  5 4.252369 3.218942
## 5  6 4.267055 3.231918
## 6  7 4.280071 3.239350
## 7  8 4.288528 3.240305
## 8  9 4.277210 3.230383
## 9 10 4.294385 3.243836

best_bspline %>% slice_min(rmse)

##   df     rmse      mae
## 1  5 4.252369 3.218942

best_bspline %>% slice_min(mae)

##   df     rmse      mae
## 1  5 4.252369 3.218942

#menentukan knot berdasarkan fungsi basis
df_5<-attr(bs(Auto$horsepower, df=5),"knots") 
df_5

## 33.33333% 66.66667% 
##        84       110

bs <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=5),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 5") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(90,120,150,180), col="grey50", lty=2)
bs

Secara Empiris (Natural Spline)

# Natural Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_nspline <- map_dfr(df, function(i){
  nspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ ns(horsepower,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  nspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_nspline <- colMeans(nspline)
  mean_nspline 
  }
)
best_nspline <- cbind(df=df,best_nspline)
best_nspline

##    df     rmse      mae
## 1   2 4.316420 3.252256
## 2   3 4.339172 3.270060
## 3   4 4.298632 3.248019
## 4   5 4.275662 3.231437
## 5   6 4.277158 3.232881
## 6   7 4.249823 3.213396
## 7   8 4.271403 3.230483
## 8   9 4.255107 3.213574
## 9  10 4.235094 3.183301
## 10 11 4.257970 3.208380
## 11 12 4.244159 3.182543
## 12 13 4.245239 3.193597
## 13 14 4.255621 3.214075
## 14 15 4.250686 3.225731

best_nspline %>% slice_min(rmse)

##   df     rmse      mae
## 1 10 4.235094 3.183301

best_nspline %>% slice_min(mae)

##   df     rmse      mae
## 1 12 4.244159 3.182543

#menentukan knot berdasarkan fungsi basis
df_10<-attr(ns(Auto$horsepower, df=10),"knots") 
df_10

##   10%   20%   30%   40%   50%   60%   70%   80%   90% 
##  67.0  72.0  80.0  88.0  93.5 100.0 110.0 140.0 157.7

df_12<-attr(ns(Auto$horsepower, df=12),"knots") 
df_12

## 8.333333% 16.66667%       25% 33.33333% 41.66667%       50% 58.33333% 66.66667% 
##   65.0000   70.0000   75.0000   84.0000   88.0000   93.5000  100.0000  110.0000 
##       75% 83.33333% 91.66667% 
##  126.0000  150.0000  165.8333

ns10 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=10),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 10") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(90,120,150,180), col="grey50", lty=2)

ns12 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=12),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 12") +
  xlab("Horse Power") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(90,120,150,180), col="grey50", lty=2)

plot_grid(ns10, ns12, label_size = 6)

Secara empiris, model regresi natural spline dengan df=12 lebih baik dalam merepresentasikan atau menjelaskan pola hubungan antara MPG dengan Engine Horse Power dibandingkan model regresi natural spline dengan df=10 maupun b-spline dengan df=5 karena nilai RMSE dan MAE natural spline dengan df=12 lebih kecil dibandingkan b-spline.

Perbandingan Semua Model

nilai_rmse <- rbind (4.872408,4.257811,4.433535, 4.393438, 4.252369, 4.235094, 4.244159) 
nilai_mae <- rbind (3.843361,3.225156,3.355719,3.404443, 3.218942,3.183301, 3.182543)
nama_model <- c("Linier", "Polinomial (derajat=7)", "Fungsi Tangga (interval=8)","Fungsi Tangga (interval=8)", "B-spline (df=5)", "Natural spline  (df=10)", "Natural spline  (df=12)")
model <- data.frame(nama_model,nilai_rmse,nilai_mae)
model

##                   nama_model nilai_rmse nilai_mae
## 1                     Linier   4.872408  3.843361
## 2     Polinomial (derajat=7)   4.257811  3.225156
## 3 Fungsi Tangga (interval=8)   4.433535  3.355719
## 4 Fungsi Tangga (interval=8)   4.393438  3.404443
## 5            B-spline (df=5)   4.252369  3.218942
## 6    Natural spline  (df=10)   4.235094  3.183301
## 7    Natural spline  (df=12)   4.244159  3.182543

Secara empiris, model regresi natural spline dengan df=12 merupakan model terbaik dalam merepresentasikan data, karena dibandingkan dengan model lain, natural spline dengan df=12 memiliki RMSE dan MAE paling minimum. Sehingga untuk melihat pola hubungan antara MPG dengan Engine horsepower lebih baik menggunakan model regresi natural spline dengan df 12 dengan visualisasi sebagai berikut:

MPG VS Time to Accelerate from 0 to 60 mph

Regresi Linier

Secara Visual

ggplot(Auto,aes(x=acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "red",se = F)+
  theme_bw()

Secara visual, model regresi linier kurang baik dalam merepresentasikan atau menjelaskan pola hubungan MPG dengan Time to Accelerate from 0 to 60 mph karena pola garis lurus yang terbentuk tidak seutuhnya mengikuti pola sebaran data. Selanjutnya, perlu dilakukan pemilihan model regresi non-linier terbaik

Secara Empiris

#Derajat 1
set.seed(321)

cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear <- map_dfr(cross_val$splits, function(x){
  mod <- lm(mpg ~ acceleration, data=Auto[x$in_id,])
  pred <- predict(mod,newdata=Auto[-x$in_id,])
  truth <- Auto[-x$in_id,]$mpg
  rmse <- mlr3measures::rmse(truth = truth, response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)

# Rata-rata RMSE & MAE untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear

##     rmse      mae 
## 7.073565 5.823831

Regresi Polinomial

Secara Visual

#plot degree 2-7
d1 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 2")+
  theme_classic()

d2 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,3,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 3")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 4")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,5,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 5")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,6,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 6")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,7,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 7")+
  theme_classic()
d7 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,8,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 8")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,9,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 9")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,10,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 10")+
  theme_classic()

grid.arrange(d1,d2,d3,d4,d5,d6,d7,d8,d9, nrow=3,ncol=3,
             top = textGrob("Plot Regresi Polinomial",gp=gpar(fontsize=15,font=1)))

Secara visual, model regresi polinomial dengan derajat 4 sampai derajat 8 cenderung lebih mengikuti pola sebaran data dibandingkan model regresi polinomial dengan derajat 2,3, 8 dan 9. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi polinomial tersebut.

Secara Empiris

# Regresi polinomial derajat 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:10

best_poly <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(mpg ~ poly(acceleration,derajat=i), data=Auto[x$in_id,])
                           pred <- predict(mod,newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                           }
                         )
  metric_poly
  
# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly 
  }
)
best_poly <- cbind(derajat=derajat,best_poly) 
best_poly

##   derajat     rmse      mae
## 1       2 7.026192 5.760824
## 2       3 7.047178 5.803701
## 3       4 6.986808 5.683944
## 4       5 7.011349 5.719472
## 5       6 7.005414 5.695174
## 6       7 7.080182 5.736693
## 7       8 7.161613 5.790400
## 8       9 7.129490 5.740258
## 9      10 7.124493 5.729222

best_poly$derajat <- as.factor(best_poly$derajat) 
prmse<-ggplot(data=best_poly, aes(x=derajat, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Root Mean Square Error")

best_poly$derajat <- as.factor(best_poly$derajat) 
pmae<-ggplot(data=best_poly, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada model regresi polinomial dengan derajat 4.

#Prediksi
prediksi.poly4 = predict( glm(mpg ~ poly(acceleration,4)) , data.frame(acceleration), interval="predict" )

#Plot
plot(acceleration, mpg, pch=19, col="blue",
     xlab="Time to Accerate", ylab="Mile per Gallon",
     main="Regresi Polinomial Derajat 4")
ix <- sort(acceleration,index.return=T)$ix
lines(acceleration[ix], prediksi.poly4[ix], 
      main = "Polynomial Regression Derajat 4", col="red",
      type="l", lwd=2)

Fungsi Tangga

Secara Visual

#plot interval 2-10
d2 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,2,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 2")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,3,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 3")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,4,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 4")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,5,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 5")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,6,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 6")+
  theme_classic()

d7 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,7,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 7")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,8,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 8")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,9,raw=T), 
              lty = 1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 9")+
  theme_classic()

d10 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~cut(x,10,raw=T), 
              lty=1,lwd=2, col = "red",se = F)+
  ggtitle("Interval 10")+
  theme_classic()

grid.arrange(d2,d3,d4,d5,d6,d7,d8,d9,d10,nrow=3,ncol=3,
             top = textGrob("Plot Regresi Fungsi Tangga",gp=gpar(fontsize=15,font=1)))

Secara visual, fungsi tangga dengan interval 3 dan 4 cenderung lebih mengikuti pola sebaran data dibandingkan fungsi tangga dengan interval lainnya. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model fungsi tangga.

Secara Empiris

# Fungsi tangga interval 2-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

interval <- 2:10

best_tangga <- map_dfr(interval, function(i){ 
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$acceleration <- cut(training$acceleration,i)
                             mod <- lm(mpg ~ acceleration, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_interval <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",labs_x))) 
                             testing <- Auto[-x$in_id,]
                             acceleration_new <- cut(testing$acceleration,c(labs_x_interval[1,1],labs_x_interval[,2])) 
                             pred <-predict(mod, newdata=list(acceleration=acceleration_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred) 
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred) 
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                             }
                           )
  metric_tangga

# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  mean_metric_tangga 
  }
)
best_tangga <- cbind(interval=interval,best_tangga)
best_tangga

##   interval     rmse      mae
## 1        2 7.538018 6.334114
## 2        3 6.951984 5.625622
## 3        4 7.229533 5.886147
## 4        5 7.203617 5.977914
## 5        6 6.974349 5.656407
## 6        7 7.192183 5.865866
## 7        8 7.038860 5.697161
## 8        9 7.024980 5.685807
## 9       10 7.095391 5.811114

best_tangga$interval <- as.factor(best_tangga$interval) 
prmse<-ggplot(data=best_tangga, aes(x=interval, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Root Mean Square Error")

best_tangga$interval <- as.factor(best_tangga$interval) 
pmae<-ggplot(data=best_tangga, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Interval") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada fungsi tangga dengan interval 3

#Prediksi
prediksi.cut3 = predict( glm(mpg ~ cut(acceleration,3)) , data.frame(horsepower), interval="predict" )

#Plot
plot(acceleration, mpg, pch=19, col="blue",
     xlab="Time to Accelerate", ylab="Mile per Gallon",
     main="Fungsi Tangga Interval 7 dan 8")
ix <- sort(acceleration,index.return=T)$ix
lines(acceleration[ix], prediksi.cut3[ix], 
      main = "Fungsi Tangga Interval 7", col="red",
      type="l", lwd=2)

Regresi Spline

Secara Visual

bs <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("b-Splines") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~bs(x, knots = c(11,15,20,25)), 
              lty = 1,se = F)

ns <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("Natural Splines") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~ns(x, knots = c(11,15,20,25)), 
              lty = 1,se = F)

plot_grid(bs, ns, label_size = 6)

## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading

## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 1)

Secara Empiris (B-Spline)

# B-Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:10

best_bspline <- map_dfr(df, function(i){
  bspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ bs(acceleration,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  bspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_bspline <- colMeans(bspline)
  mean_bspline 
  }
)
best_bspline <- cbind(df=df,best_bspline)
best_bspline

##   df     rmse      mae
## 1  2 7.047178 5.803701
## 2  3 7.047178 5.803701
## 3  4 6.968859 5.674078
## 4  5 6.990827 5.694131
## 5  6 7.001749 5.692412
## 6  7 6.990255 5.635431
## 7  8 7.006893 5.661165
## 8  9 7.017205 5.699507
## 9 10 6.994071 5.660316

best_bspline %>% slice_min(rmse)

##   df     rmse      mae
## 1  4 6.968859 5.674078

best_bspline %>% slice_min(mae)

##   df     rmse      mae
## 1  7 6.990255 5.635431

#menentukan knot berdasarkan fungsi basis
df_4<-attr(bs(Auto$acceleration, df=4),"knots") 
df_4

##  50% 
## 15.5

#menentukan knot berdasarkan fungsi basis
df_7<-attr(bs(Auto$acceleration, df=7),"knots") 
df_7

##   20%   40%   60%   80% 
## 13.42 14.80 16.00 17.70

bs4 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=4),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 4") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(11,15,20,25), col="grey50", lty=2)

bs7 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=7),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 4") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(11,15,20,25), col="grey50", lty=2)

plot_grid(bs4, bs7, label_size = 6)

Secara Empiris (Natural Spline)

# Natural Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_nspline <- map_dfr(df, function(i){
  nspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ ns(acceleration,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  nspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_nspline <- colMeans(nspline)
  mean_nspline 
  }
)
best_nspline <- cbind(df=df,best_nspline)
best_nspline

##    df     rmse      mae
## 1   2 7.008099 5.745162
## 2   3 7.022494 5.773451
## 3   4 6.955818 5.671915
## 4   5 6.954930 5.627050
## 5   6 6.963101 5.644851
## 6   7 6.958731 5.651751
## 7   8 6.940616 5.619744
## 8   9 6.967658 5.642248
## 9  10 6.971402 5.654179
## 10 11 7.021230 5.687510
## 11 12 7.026188 5.701527
## 12 13 7.042842 5.713865
## 13 14 7.058619 5.717293
## 14 15 7.082357 5.740639

best_nspline %>% slice_min(rmse)

##   df     rmse      mae
## 1  8 6.940616 5.619744

best_nspline %>% slice_min(mae)

##   df     rmse      mae
## 1  8 6.940616 5.619744

#menentukan knot berdasarkan fungsi basis
df_8<-attr(ns(Auto$acceleration, df=8),"knots") 
df_8

##   12.5%     25%   37.5%     50%   62.5%     75%   87.5% 
## 12.5000 13.7750 14.5000 15.5000 16.2000 17.0250 18.7125

ns8 <- ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=8),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 8") +
  xlab("Time to Accelerate") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(11,15,20,25), col="grey50", lty=2)
ns8

Secara empiris, model regresi natural spline dengan df=8 lebih baik dalam merepresentasikan atau menjelaskan pola hubungan antara MPG dengan Time to Accelerate from 0 to 60 mph dibandingkan model regresi b-spline dengan df=4 maupun b-spline dengan df=7 karena nilai RMSE dan MAE natural spline dengan df=8 lebih kecil dibandingkan b-spline.

Perbandingan Semua Model

nilai_rmse <- rbind (7.073565,6.986808,6.951984,6.968859,6.990255,6.940616 ) 
nilai_mae <- rbind (5.823831,5.683944,5.625622,5.674078,5.635431,5.619744)
nama_model <- c("Linier", "Polinomial (derajat=4)", "Fungsi Tangga (interval=3)","B-spline (df4)","B-spline (df7)","Natural spline  (df=8)")
model <- data.frame(nama_model,nilai_rmse,nilai_mae)
model

##                   nama_model nilai_rmse nilai_mae
## 1                     Linier   7.073565  5.823831
## 2     Polinomial (derajat=4)   6.986808  5.683944
## 3 Fungsi Tangga (interval=3)   6.951984  5.625622
## 4             B-spline (df4)   6.968859  5.674078
## 5             B-spline (df7)   6.990255  5.635431
## 6     Natural spline  (df=8)   6.940616  5.619744

Secara empiris, model regresi natural spline dengan df=8 merupakan model terbaik dalam merepresentasikan data, karena dibandingkan dengan model lain, natural spline dengan df=12 memiliki RMSE dan MAE paling minimum. Sehingga untuk melihat pola hubungan antara MPG dengan Time to Accelerate from 0 to 60 mph lebih baik menggunakan model regresi natural spline dengan df 8 dengan visualisasi sebagai berikut:

MPG VS Vehicle weight

Regresi Linier

Secara Visual

ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "red",se = F)+
  theme_bw()

Secara visual, model regresi linier kurang baik dalam merepresentasikan atau menjelaskan pola hubungan MPG dengan vehicle weight karena pola garis lurus yang terbentuk tidak seutuhnya mengikuti pola sebaran data. Selanjutnya, perlu dilakukan pemilihan model regresi non-linier terbaik

Secara Empiris

#Derajat 1
set.seed(321)

cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

metric_linear <- map_dfr(cross_val$splits, function(x){
  mod <- lm(mpg ~ weight, data=Auto[x$in_id,])
  pred <- predict(mod,newdata=Auto[-x$in_id,])
  truth <- Auto[-x$in_id,]$mpg
  rmse <- mlr3measures::rmse(truth = truth, response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)

# Rata-rata RMSE & MAE untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear

##     rmse      mae 
## 4.299574 3.282486

Regresi Polinomial

Secara Visual

#plot degree 2-7
d1 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 2")+
  theme_classic()

d2 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,3,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 3")+
  theme_classic()

d3 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 4")+
  theme_classic()

d4 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,5,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 5")+
  theme_classic()

d5 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,6,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 6")+
  theme_classic()

d6 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,7,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 7")+
  theme_classic()
d7 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,8,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 8")+
  theme_classic()

d8 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,9,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 9")+
  theme_classic()

d9 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm",formula = y~poly(x,10,raw=T), 
              lty = 1, col = "red",se = F)+
  ggtitle("Derajat 10")+
  theme_classic()

grid.arrange(d1,d2,d3,d4,d5,d6,d7,d8,d9, nrow=3,ncol=3,
             top = textGrob("Plot Regresi Polinomial",gp=gpar(fontsize=15,font=1)))

Secara visual, model regresi polinomial dengan derajat 2 sampai derajat 4 cenderung lebih mengikuti pola sebaran data dibandingkan model regresi polinomial dengan derajat 5 sampai dengan 10. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi polinomial tersebut.

Secara Empiris

# Regresi polinomial derajat 1-10
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 1:10

best_poly <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(mpg ~ poly(weight,derajat=i), data=Auto[x$in_id,])
                           pred <- predict(mod,newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                           }
                         )
  metric_poly
  
# Rata-rata RMSE & MAE untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly 
  }
)
best_poly <- cbind(derajat=derajat,best_poly) 
best_poly

##    derajat     rmse      mae
## 1        1 4.299574 3.282486
## 2        2 4.140466 3.060826
## 3        3 4.145446 3.065768
## 4        4 4.151575 3.066187
## 5        5 4.150254 3.067449
## 6        6 4.159668 3.086962
## 7        7 4.154581 3.090445
## 8        8 4.190959 3.117152
## 9        9 4.299573 3.178586
## 10      10 4.178087 3.117666

best_poly$derajat <- as.factor(best_poly$derajat) 
prmse<-ggplot(data=best_poly, aes(x=derajat, y=rmse, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Root Mean Square Error")

best_poly$derajat <- as.factor(best_poly$derajat) 
pmae<-ggplot(data=best_poly, aes(x=derajat, y=mae, group=1)) +
  geom_line(linetype = "dashed")+ 
  geom_point()+
  ggtitle("K-Fold Cross Validation") + 
  xlab("Derajat") + 
  ylab("Mean Absolute Error")

plot_grid(prmse, pmae, labels=c("RMSE","MAE"), label_size = 6)

Secara empiris, nilai RMAE dan MAE minimum terdapat pada model regresi polinomial dengan derajat 2.

#Prediksi
prediksi.poly2 = predict( glm(mpg ~ poly(weight,2)) , data.frame(weight), interval="predict" )

#Plot
plot(weight, mpg, pch=19, col="blue",
     xlab="Vehicle Weight", ylab="Mile per Gallon",
     main="Regresi Polinomial Derajat 2")
ix <- sort(weight,index.return=T)$ix
lines(weight[ix], prediksi.poly2[ix], 
      main = "Polynomial Regression Derajat 2", col="red",
      type="l", lwd=2)

Untuk melihat pola hubungan data antara MPG dengan vehicle weight lebih cenderung mengikuti pola polynomial atau spline, selanjutnya untuk memilih model terbaik untuk merepresentasikan pola hubungan MPG dengan vehicle weight dilakukan dengan membandingkan terhadap regresi spline.

Regresi Spline

Secara Visual

bs <- ggplot(Auto,aes(x=weight, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("b-Splines") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~bs(x, knots = c(2000,3000,4000,5000)), 
              lty = 1,se = F)

ns <- ggplot(Auto,aes(x=weight, y=mpg)) +
  geom_point(alpha=0.55, color="black") +
  theme_bw()+
  ggtitle("Natural Splines") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  stat_smooth(method = "lm", 
              formula = y~ns(x, knots = c(2000,3000,4000,5000)), 
              lty = 1,se = F)

plot_grid(bs, ns, label_size = 6)

Secara visual, model regresi natural spline cenderung lebih mengikuti pola sebaran data dibandingkan model regresi b-spline. Namun perlu dilakukan pengecekan secaran empiris melalui nilai RMSE dan MAE dari setiap model regresi spline tersebut.

Secara Empiris (B-Spline)

# B-Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:10

best_bspline <- map_dfr(df, function(i){
  bspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ bs(weight,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  bspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_bspline <- colMeans(bspline)
  mean_bspline 
  }
)
best_bspline <- cbind(df=df,best_bspline)
best_bspline

##   df     rmse      mae
## 1  2 4.145446 3.065768
## 2  3 4.145446 3.065768
## 3  4 4.152022 3.065852
## 4  5 4.148745 3.077351
## 5  6 4.150613 3.084954
## 6  7 4.190345 3.108026
## 7  8 4.178610 3.109603
## 8  9 4.169315 3.077254
## 9 10 4.450239 3.195706

best_bspline %>% slice_min(rmse)

##   df     rmse      mae
## 1  2 4.145446 3.065768
## 2  3 4.145446 3.065768

best_bspline %>% slice_min(mae)

##   df     rmse      mae
## 1  2 4.145446 3.065768
## 2  3 4.145446 3.065768

#menentukan knot berdasarkan fungsi basis
df_3<-attr(bs(Auto$weight, df=3),"knots") 
df_3

## numeric(0)

bs3 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~bs(x, df=3),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("B-Spline With DF = 3") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2000,3000,4000,5000), col="grey50", lty=2)
bs3

Secara Empiris (Natural Spline)

# Natural Spline
set.seed(321)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:15

best_nspline <- map_dfr(df, function(i){
  nspline <- map_dfr(cross_val$splits,
                     function(x){
                       mod <- lm(mpg ~ ns(weight,df=i),data=Auto[x$in_id,]) 
                       pred <-predict(mod,newdata=Auto[-x$in_id,])
                       truth <- Auto[-x$in_id,]$mpg
                       rmse <- mlr3measures::rmse(truth = truth,response = pred) 
                       mae <- mlr3measures::mae(truth = truth,response = pred ) 
                       metric <- c(rmse,mae)
                       names(metric) <- c("rmse","mae")
                       return(metric) 
                       }
                     )
  nspline

# Rata-rata RMSE & MAE untuk 10 folds
  mean_nspline <- colMeans(nspline)
  mean_nspline 
  }
)
best_nspline <- cbind(df=df,best_nspline)
best_nspline

##    df     rmse      mae
## 1   2 4.138968 3.059081
## 2   3 4.146196 3.061158
## 3   4 4.155606 3.066985
## 4   5 4.170902 3.091928
## 5   6 4.159089 3.093683
## 6   7 4.118805 3.041155
## 7   8 4.153896 3.083712
## 8   9 4.174299 3.088892
## 9  10 4.142906 3.064658
## 10 11 4.152285 3.069388
## 11 12 4.159216 3.077669
## 12 13 4.177854 3.091879
## 13 14 4.180087 3.099473
## 14 15 4.187073 3.109754

best_nspline %>% slice_min(rmse)

##   df     rmse      mae
## 1  7 4.118805 3.041155

best_nspline %>% slice_min(mae)

##   df     rmse      mae
## 1  7 4.118805 3.041155

#menentukan knot berdasarkan fungsi basis
df_7<-attr(ns(Auto$weight, df=7),"knots") 
df_7

## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  2074.857  2285.429  2635.000  2986.571  3446.143  4096.286

ns7 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm",
              formula = y~ns(x, df=7),
              lty = 1,col = "black", se=F)+ 
  theme_bw()+
  ggtitle("Natural Spline With DF = 7") +
  xlab("Vehicle Weight") +
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2000,3000,4000,5000), col="grey50", lty=2)
ns7

Secara empiris, model regresi natural spline dengan df=7 lebih baik dalam merepresentasikan atau menjelaskan pola hubungan antara MPG dengan vehicle weight dibandingkan model regresi b-spline dengan df=3 karena nilai RMSE dan MAE natural spline dengan df=7 lebih kecil dibandingkan b-spline.

Perbandingan Semua Model

nilai_rmse <- rbind (4.299574,4.140466,4.145446,4.118805) 
nilai_mae <- rbind (3.282486,3.060826,3.065768,3.041155)
nama_model <- c("Linier", "Polinomial (derajat=2)","B-spline (df3)","Natural spline  (df=7)")
model <- data.frame(nama_model,nilai_rmse,nilai_mae)
model

##               nama_model nilai_rmse nilai_mae
## 1                 Linier   4.299574  3.282486
## 2 Polinomial (derajat=2)   4.140466  3.060826
## 3         B-spline (df3)   4.145446  3.065768
## 4 Natural spline  (df=7)   4.118805  3.041155

Secara empiris, model regresi natural spline dengan df=7 merupakan model terbaik dalam merepresentasikan data, karena dibandingkan dengan model lain, natural spline dengan df=7 memiliki RMSE dan MAE paling minimum. Sehingga untuk melihat pola hubungan antara MPG dengan vehicle weight lebih baik menggunakan model regresi natural spline dengan df 7 dengan visualisasi sebagai berikut:

Kesimpulan

Model yang paling tepat menjelaskan dan merepresentasikan pola hubungan mpg dan variabel prediktor Engine horsepower adalah model regresi natural spline dengan df 12
Model yang paling tepat menjelaskan dan merepresentasikan pola hubungan mpg dan variabel prediktor Time to Accelerate from 0 to 60 mph adalah model regresi natural spline dengan df 8
Model yang paling tepat menjelaskan dan merepresentasikan pola hubungan mpg dan variabel prediktor Vehicle weight adalah model regresi natural spline dengan df 7