Tugas Praktikum 8 Sains Data

Andika Putri Ratnasari (G1501211018)

2021-11-01

Data

Data yang digunakan yaitu data Auto. Data ini berisi tentang informasi mengenai 392 jenis kendaraan. Berikut adalah penjelasan untuk masing-masing variabel dalam data:

  • mpg: miles per gallon
  • cylinders: Number of cylinders between 4 and 8
  • displacement: Engine displacement (cu. inches)
  • horsepower: Engine horsepower
  • weight: Vehicle weight (lbs.)
  • acceleration: Time to accelerate from 0 to 60 mph (sec.)
  • year: Model year (modulo 100)
  • origin: Origin of car (1. American, 2. European, 3. Japanese)
  • name: Vehicle name

Peubah respon yang digunakan yaitu mpg (miles per gallon) sedangkan peubah prediktor yang dipilih yaitu horsepower, displacement , dan weight.

Plot masing-masing peubah prediktor terhadap respon mpg:

plot_hp<-ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color=I("black")) + 
                 ggtitle("Plot mpg vs horsepower")+
                 theme_bw()

plot_dis<-ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color=I("blue")) + 
                 ggtitle("Plot mpg vs displacement")+
                 theme_bw()

plot_wei<-ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color=I("red")) +
                 ggtitle("Plot mpg vs weight")+
                 theme_bw()

plot_grid(plot_hp, plot_dis, plot_wei)

Berdasarkan plot di atas terlihat bahwa tidak terdapat pola linier antara masing-masing peubah prediktor (horsepower, displacement, dan weight) dan peubah respon (mpg). Oleh karena itu, dilakukan pemodelan hubungan tak linier dengan menggunakan metode:

  1. Regresi Polinomial
  2. Regresi Fungsi Tangga
  3. Cubic Splines
  4. Natural Cubic Splines

Pemodelan mpg vs horsepower

Regresi Polinomial

A. Regresi Polinomial (Secara Visual)

1. Regresi Polinomial Derajat 2

rp2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Horse Power") + 
  ylab("mile per gallon") 

2. Regresi Polinomial Derajat 3

rp3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,3,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 3") +
  xlab("Horse Power") + 
  ylab("mile per gallon") 

3. Regresi Polinomial Derajat 4

rp4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="navy") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,4,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 4") +
  xlab("Horse Power") + 
  ylab("mile per gallon") 

4. Regresi Polinomial Derajat 5

rp5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,5,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 5") +
  xlab("Horse Power") + 
  ylab("mile per gallon") 
plot_grid(rp2, rp3, rp4, rp5)

Berdasarkan keempat plot di atas dapat dilihat bahwa regresi polinomial derajat 2 dan 3 memiliki pola yang teratur dibandingkan dengan regresi polinomial derajat 4 dan 5. Sehingga, pada metode regresi polinomial dipilih derajat 2, karena modelnya lebih sederhana.

B. Regresi Polinomial (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 1:5

polinomial1 <- map_dfr(derajat, function(i) {
  metric_poli <- map_dfr(val_silang$splits,
                        function(x) {
                        mod <- lm(mpg ~ poly(horsepower,derajat=i), data=Auto[x$in_id,])
                        predik <- predict(mod, newdata=Auto[-x$in_id,])
                        truth <- Auto[-x$in_id,]$mpg
                        rmse <- mlr3measures::rmse(truth = truth, response = predik)
                        mae <- mlr3measures::mae(truth = truth, response = predik)
                        metric <- c(rmse,mae)
                        names(metric) <- c("RMSE","MAE")
                        return(metric)
                              }
                            )
  metric_poli
  mean_metric_poli<- colMeans(metric_poli) # menghitung rata-rata untuk 10 folds
  mean_metric_poli
  }
)

polinomial1<- cbind(derajat=derajat,polinomial1)

Perbandingan RMSE dan MAE

datatable(polinomial1)

Plot RMSE dan MAE

polinomial1$derajat <- as.factor(polinomial1$derajat)
prmse<-ggplot(data=polinomial1, aes(x=derajat, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Root Mean Square Error")

polinomial1$derajat <- as.factor(polinomial1$derajat)
pmae<-ggplot(data=polinomial1, aes(x=derajat, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Mean Absolute Error")

plot_grid(prmse,pmae, labels=c("RMSE","MAE"), label_size = 6)

Plot Regresi Polinomial d=2 dan d=5

regpol2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Horse Power") + 
  ylab("mile per gallon")  

regpol5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,5,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 5") +
  xlab("Horse Power") + 
  ylab("mile per gallon")

plot_grid(regpol2,regpol5)

Nilai RMSE dan MAE terendah terdapat pada regresi polinomial derajat 5, namun apabila dilihat dari plot regresi pada bagian ujung-ujung plot cenderung tidak teratur untuk regresi polinomial derajat 5. Oleh karena itu, regresi polinomial terbaik yaitu regresi polinomial derajat 2. Berikut merupakan summary dari model regresi polinomial derajat 2:

poli2 = lm(mpg ~ poly(horsepower,2), data = Auto)
summary(poli2)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Regresi Fungsi Tangga

A. Regresi Fungsi Tangga (Secara Visual)

Regresi Fungsi Tangga dengan Knots 1-10

ft1<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,2), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=1") +
  xlab("Horse Power") + ylab("mile per gallon")
ft2<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,3), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=2") +
  xlab("Horse Power") + ylab("mile per gallon")
ft3<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,4), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=3") +
  xlab("Horse Power") + ylab("mile per gallon")

ft4<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,5), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=4") +
  xlab("Horse Power") + ylab("mile per gallon")

ft5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=5") +
  xlab("Horse Power") + ylab("mile per gallon")

ft6<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,7), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=6") +
  xlab("Horse Power") + ylab("mile per gallon")

ft7<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=7") +
  xlab("Horse Power") + ylab("mile per gallon")

ft8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,9), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=8") +
  xlab("Horse Power") + ylab("mile per gallon")

ft9<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,10), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=9") +
  xlab("Horse Power") + ylab("mile per gallon")

ft10<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,11), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=10") +
  xlab("Horse Power") + ylab("mile per gallon")

Berdasarkan hasil plot di atas dapat dilihat bahwa plot dari regresi fungsi tangga yang dapat memberikan pola paling mulus yaitu knots 7, untuk knots 8 hingga 10 terdapat pola yang tidak teratur dalam plot. Sehingga secara visual, dipilih regresi fungsi tangga dengan knots 7 untuk memodelkan data mpg vs horsepower.

B. Regresi Fungsi Tangga (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

breakk <- 2:11 #knots 1-10

pemilihanf.tangga <- map_dfr(breakk, function(i){
  metric_ftangga <- map_dfr(val_silang$splits,
                          function(x){
                          datatrain <- Auto[x$in_id,]
                          datatrain$horsepower <- cut(datatrain$horsepower,i)
                          modelt <- lm(mpg ~ horsepower, data=datatrain)
                          labs_x <- levels(modelt$model[,2])
                          labs_x_b <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                            upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                          datatesting <- Auto[-x$in_id,]
                          horsepower_new <- cut(datatesting$horsepower,c(labs_x_b[1,1],labs_x_b[,2]))
                          predik <- predict(modelt,   newdata=list(horsepower=horsepower_new))
                          truth <- datatesting$mpg
                          evaluasi <- na.omit(data.frame(truth,predik))
                          rmse <- mlr3measures::rmse(truth = evaluasi$truth,response = evaluasi$predik)
                          mae <- mlr3measures::mae(truth = evaluasi$truth,response = evaluasi$predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )

metric_ftangga
mean_metricftangga <- colMeans(metric_ftangga)
mean_metricftangga
})
pemilihanf.tangga <- cbind(breakk=breakk,pemilihanf.tangga)

Perbandingan RMSE dan MAE

datatable(pemilihanf.tangga)

Plot RMSE dan MAE

pemilihanf.tangga$breakk <- as.factor(pemilihanf.tangga$breakk)
trmse<-ggplot(data=pemilihanf.tangga, aes(x=breakk, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Root Mean Square Error")

pemilihanf.tangga$breakk <- as.factor(pemilihanf.tangga$breakk)
tmae<-ggplot(data=pemilihanf.tangga, aes(x=breakk, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Mean Absolute Error")

plot_grid(trmse,tmae)

Berdasarkan nilai RMSE model regresi fungsi tangga terbaik adalah regresi fungsi tangga dengan knots=7, sedangkan berdasarkan nilai MAE model terbaik adalah regresi fungsi tangga dengan knots=6.

Plot Regresi Rungsi Tangga

ft7<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,7), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=6") +
  xlab("Horse Power") + ylab("mile per gallon")

ft8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=7") +
  xlab("Horse Power") + ylab("mile per gallon")

plot_grid(ft7,ft8)

Apabila dilihat secara visual, plot dari regresi fungsi tangga dengan knots 7 memiliki kurva yang lebih mulus dibandingkan dengan knots 6, selain itu juga tidak terdapat pola yang tidak teratur pada plot dengan knots=7. Jadi, pada metode regresi fungsi tangga ini dipilih knots=7.

Berikut merupakan summary model regresi fungsi tangga dengan knots=7:

fungsitangga7 = lm(Auto$mpg ~ cut(Auto$horsepower,8)) #7 knots
summary(fungsitangga7)
## 
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$horsepower, 8))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9471  -2.6757  -0.1533   2.4015  14.5529 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       33.9085     0.5771   58.76   <2e-16 ***
## cut(Auto$horsepower, 8)(69,92]    -6.9614     0.6910  -10.07   <2e-16 ***
## cut(Auto$horsepower, 8)(92,115]  -12.7551     0.7425  -17.18   <2e-16 ***
## cut(Auto$horsepower, 8)(115,138] -15.6656     1.1264  -13.91   <2e-16 ***
## cut(Auto$horsepower, 8)(138,161] -18.7799     0.8568  -21.92   <2e-16 ***
## cut(Auto$horsepower, 8)(161,184] -19.9735     1.1470  -17.41   <2e-16 ***
## cut(Auto$horsepower, 8)(184,207] -21.1228     1.7720  -11.92   <2e-16 ***
## cut(Auto$horsepower, 8)(207,230] -21.0085     1.5159  -13.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.433 on 384 degrees of freedom
## Multiple R-squared:  0.6832, Adjusted R-squared:  0.6774 
## F-statistic: 118.3 on 7 and 384 DF,  p-value: < 2.2e-16

Cubic Splines

A. Cubic Splines (Secara Visual)

1. Knots 1-6

attr(bs(Auto$horsepower, df=4),"knots")
##  50% 
## 93.5
attr(bs(Auto$horsepower, df=5),"knots")
## 33.33333% 66.66667% 
##        84       110
attr(bs(Auto$horsepower, df=6),"knots")
##   25%   50%   75% 
##  75.0  93.5 126.0
attr(bs(Auto$horsepower, df=7),"knots")
## 20% 40% 60% 80% 
##  72  88 100 140
attr(bs(Auto$horsepower, df=8),"knots")
## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##      70.0      84.0      93.5     110.0     150.0
attr(bs(Auto$horsepower, df=9),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  68.85714  79.71429  89.57143  98.00000 113.57143 150.00000

Plot Cubic Splines dengan Knots 1-6

cs1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(93.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=1") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(93.5), col="grey50", lty=2)

cs2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="orange")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(84,110)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=2") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(84,110), col="grey50", lty=2)

cs3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="orange")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(75.0, 93.5,126.0)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=3") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(75.0, 93.5,126.0), col="grey50", lty=2)

cs4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="green")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(72, 88, 100, 140)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=4") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(72, 88, 100, 140), col="grey50", lty=2)

cs5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(70.0,84.0,93.5,110.0,  150.0)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=5") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)

cs6<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=6") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150), col="grey50", lty=2)

Berdasarkan hasil plot di atas dapat disimpulkan bahwa model cubic splines terbaik adalah cubic splines dengan knots 1 karena mulai knots 2 hingga 6 terdapat pola yang tidak teratur pada ujung plot.

B. Cubic Splines (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 4:9 #knots 1-6

pemilihan.cs <- map_dfr(df, function(i) {
  metric_cs<- map_dfr(val_silang$splits,
                          function(x) {
                          modelcs <- lm(mpg ~ bs(horsepower,df=i), data=Auto[x$in_id,])
                          predik <- predict(modelcs, newdata=Auto[-x$in_id,])
                          truth <- Auto[-x$in_id,]$mpg
                          rmse <- mlr3measures::rmse(truth = truth, response = predik)
                          mae <- mlr3measures::mae(truth = truth, response = predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )
  metric_cs

# menghitung rata-rata untuk 10 folds
  mean_metric_cs <- colMeans(metric_cs)
  mean_metric_cs
  }
)

pemilihan.cs <- cbind(df=df,pemilihan.cs)

Perbandingan RMSE dan MAE

datatable(pemilihan.cs)

Plot RMSE dan MAE

pemilihan.cs$df <- as.factor(pemilihan.cs$df)
crmse<-ggplot(data=pemilihan.cs, aes(x=df, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

cmae<-ggplot(data=pemilihan.cs, aes(x=df, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Mean Absolute Error")

plot_grid(crmse,cmae)

Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model cubic splines dengan knots 2, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model cubic splines dengan knots 2 memiliki ketidakteraturan pola pada ujung-ujung plot. Sehingga model cubic splines terbaik adalah dengan knots 1. Berikut merupakan plot dan summary model:

Plot Cubic Splines

plotcs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(93.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines with Knots=1") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(93.5), col="grey50", lty=2)
plotcs

cspline = lm(Auto$mpg ~ bs(Auto$horsepower, knots = c(93.5)))
summary(cspline)
## 
## Call:
## lm(formula = Auto$mpg ~ bs(Auto$horsepower, knots = c(93.5)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3859  -2.4974  -0.2154   2.0668  16.2357 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             35.366      1.532  23.082  < 2e-16 ***
## bs(Auto$horsepower, knots = c(93.5))1   -0.483      2.152  -0.224    0.823    
## bs(Auto$horsepower, knots = c(93.5))2  -26.299      1.962 -13.406  < 2e-16 ***
## bs(Auto$horsepower, knots = c(93.5))3  -19.576      2.952  -6.631 1.12e-10 ***
## bs(Auto$horsepower, knots = c(93.5))4  -23.402      2.294 -10.201  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.336 on 387 degrees of freedom
## Multiple R-squared:  0.6945, Adjusted R-squared:  0.6914 
## F-statistic:   220 on 4 and 387 DF,  p-value: < 2.2e-16

Natural Cubic Splines

A. Natural Cubic Splines (Secara Visual)

1. Knots 1-6

attr(ns(Auto$horsepower, df=2),"knots")
##  50% 
## 93.5
attr(ns(Auto$horsepower, df=3),"knots")
## 33.33333% 66.66667% 
##        84       110
attr(ns(Auto$horsepower, df=4),"knots")
##   25%   50%   75% 
##  75.0  93.5 126.0
attr(ns(Auto$horsepower, df=5),"knots")
## 20% 40% 60% 80% 
##  72  88 100 140
attr(ns(Auto$horsepower, df=6),"knots")
## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##      70.0      84.0      93.5     110.0     150.0
attr(ns(Auto$horsepower, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  68.85714  79.71429  89.57143  98.00000 113.57143 150.00000

Plot Natural Cubic Splines dengan Knots 1-6

ns1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(93.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(93.5), col="grey50", lty=2)

ns2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="orange")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(84,110)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=2") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(84,110), col="grey50", lty=2)

ns3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="coral")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(75.0, 93.5,126.0)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=3") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(75.0, 93.5,126.0), col="grey50", lty=2)

ns4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="green")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(72, 88, 100, 140)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=4") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(72, 88, 100, 140), col="grey50", lty=2)

ns5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(70.0,84.0,93.5,110.0,  150.0)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=5") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)

ns6<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=6") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150), col="grey50", lty=2)

Berdasarkan hasil plot di atas dapat dilihat bahwa plot natural cubic splines dengan knots 1 hingga 5 tidak memiliki pola yang tidak teratur. Sedangkan untuk knots 6 terdapat pola yang tidak teratur pada ujung plot. Sehingga untuk lebih meyakinkan dalam memilih model terbaik, dilakukan pemilihan model secara empiris.

B. Natural Cubic Splines (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:7 #knots 1-6

pemilihan.ncs <- map_dfr(df, function(i) {
  metric_ncs<- map_dfr(val_silang$splits,
                          function(x) {
                          modelncs <- lm(mpg ~ ns(horsepower,df=i), data=Auto[x$in_id,])
                          predik <- predict(modelncs, newdata=Auto[-x$in_id,])
                          truth <- Auto[-x$in_id,]$mpg
                          rmse <- mlr3measures::rmse(truth = truth, response = predik)
                          mae <- mlr3measures::mae(truth = truth, response = predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )
  metric_ncs

# menghitung rata-rata untuk 10 folds
  mean_metric_ncs <- colMeans(metric_ncs)
  mean_metric_ncs
  }
)

pemilihan.ncs <- cbind(df=df,pemilihan.ncs)

Perbandingan RMSE dan MAE

datatable(pemilihan.ncs)

Plot RMSE dan MAE

pemilihan.ncs$df <- as.factor(pemilihan.ncs$df)
nrmse<-ggplot(data=pemilihan.ncs, aes(x=df, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

nmae<-ggplot(data=pemilihan.ncs, aes(x=df, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Mean Absolute Error")

plot_grid(nrmse,nmae)

Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model natural cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model natural cubic splines dengan knots 6 memiliki pola yang tidak teratur pada ujung-ujung plot. Sehingga model natural cubic splines terbaik adalah dengan knots 5. Berikut merupakan plot dan summary model:

Plot Natural Cubic Splines

plotncs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(70.0,84.0,93.5,110.0,150.0)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=5") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)
plotncs

ncspline = lm(Auto$mpg ~ ns(Auto$horsepower, knots = c(70.0,84.0,93.5,110.0,150.0)))
summary(ncspline)
## 
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$horsepower, knots = c(70, 84, 
##     93.5, 110, 150)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9491  -2.6183  -0.1595   2.3508  15.1349 
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                               34.738      1.509
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1   -8.210      1.594
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2  -13.046      1.835
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3  -14.577      1.886
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4  -22.802      1.624
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5  -22.758      3.512
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6  -20.849      1.742
##                                                         t value Pr(>|t|)    
## (Intercept)                                              23.021  < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1  -5.149 4.18e-07 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2  -7.108 5.76e-12 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3  -7.730 9.50e-14 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4 -14.039  < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5  -6.480 2.81e-10 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6 -11.967  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.302 on 385 degrees of freedom
## Multiple R-squared:  0.7009, Adjusted R-squared:  0.6962 
## F-statistic: 150.4 on 6 and 385 DF,  p-value: < 2.2e-16

Perbandingan Model

nilai_rmse <- rbind(4.32756231703042,
                    4.41347139979508,
                    4.32073392297636,
                    4.2917765537155)

nilai_mae<-rbind(3.2639553356593,
                 3.41081122407494,
                 3.26311155879859,
                 3.24670654206857)

Metode <- c("Regresi Polinomial (d=2)",
            "Fungsi Tangga (knots=7)",
            "Cubic Splines (knots=1)",
            "Natural Cubic Splines (knots=5)")

perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)

Plot dari Masing-masing Model

#Regresi Polinomial
rp2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Horse Power") + 
  ylab("mile per gallon") 


#Regresi Fungsi Tangga
ft7<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=7") +
  xlab("Horse Power") + ylab("mile per gallon")

#Cubic Splines
cs1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(93.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=1") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(93.5), col="grey50", lty=2)

#Natural Cubic Splines
ns5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(70.0,84.0,93.5,110.0,  150.0)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=5") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)

plot_grid(rp2, ft7, cs1, ns5)

Berdasarkan nila RMSE dan MAE dapat disimpulkan bahwa model terbaik untuk memodelkanmpg vs horsepower adalah model natural cubic splines dengan knots 6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat horsepower (kekuatan mesin) dari kendaraan maka miles per gallon (jarak yang dapat ditempuh per galon bahan bakar) akan semakin kecil.

Pemodelan mpg vs displacement

Regresi Polinomial

A. Regresi Polinomial (Secara Visual)

1. Regresi Polinomial Derajat 2

rp22 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Displacement") + 
  ylab("mile per gallon") 

2. Regresi Polinomial Derajat 3

rp33 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,3,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 3") +
  xlab("Displacement") + 
  ylab("mile per gallon") 

3. Regresi Polinomial Derajat 4

rp44 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="navy") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,4,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 4") +
  xlab("Displacement") + 
  ylab("mile per gallon") 

4. Regresi Polinomial Derajat 5

rp55 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,5,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 5") +
  xlab("Displacement") + 
  ylab("mile per gallon") 
plot_grid(rp22, rp33, rp44, rp55)

Berdasarkan keempat plot di atas dapat dilihat bahwa plot regresi polinomial derajat 2 hingga 5 memiliki pola yang hampir sama, tidak terdapat pola yang tidak teratur pada ujung plot. Sehingga, pada metode Regresi Polinomial dipilih derajat 2 karena model paling sederhana.

B. Regresi Polinomial (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 1:5

polinomial2 <- map_dfr(derajat, function(i) {
  metric_poli <- map_dfr(val_silang$splits,
                        function(x) {
                        mod <- lm(mpg ~ poly(displacement,derajat=i), data=Auto[x$in_id,])
                        predik <- predict(mod, newdata=Auto[-x$in_id,])
                        truth <- Auto[-x$in_id,]$mpg
                        rmse <- mlr3measures::rmse(truth = truth, response = predik)
                        mae <- mlr3measures::mae(truth = truth, response = predik)
                        metric <- c(rmse,mae)
                        names(metric) <- c("RMSE","MAE")
                        return(metric)
                              }
                            )
  metric_poli
  mean_metric_poli<- colMeans(metric_poli) # menghitung rata-rata untuk 10 folds
  mean_metric_poli
  }
)

polinomial2<- cbind(derajat=derajat,polinomial2)

Perbandingan RMSE dan MAE

datatable(polinomial2)

Plot RMSE dan MAE

polinomial2$derajat <- as.factor(polinomial2$derajat)
prmse2<-ggplot(data=polinomial2, aes(x=derajat, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Root Mean Square Error")

polinomial2$derajat <- as.factor(polinomial2$derajat)
pmae2<-ggplot(data=polinomial2, aes(x=derajat, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Mean Absolute Error")

plot_grid(prmse2,pmae2, labels=c("RMSE","MAE"), label_size = 6)

Berdasarkan nilai RMSE dan MAE terendah, dapat disimpulkan bahwa model regresi polinomial terbaik adalah regresi polinomial derajat 2. Berikut merupakan plot dan summary model:

Plot Regresi Polinomial d=2

regpol22 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Displacement") + 
  ylab("mile per gallon")  

regpol22

poli22= lm(mpg ~ poly(displacement,2), data = Auto)
summary(poli22)
## 
## Call:
## lm(formula = mpg ~ poly(displacement, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2165  -2.2404  -0.2508   2.1094  20.5158 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              23.4459     0.2205 106.343  < 2e-16 ***
## poly(displacement, 2)1 -124.2585     4.3652 -28.466  < 2e-16 ***
## poly(displacement, 2)2   31.0895     4.3652   7.122 5.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6872 
## F-statistic: 430.5 on 2 and 389 DF,  p-value: < 2.2e-16

Regresi Fungsi Tangga

A. Regresi Fungsi Tangga (Secara Visual)

Regresi Fungsi Tangga dengan Knots 1-10

ft11<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,2), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=1") +
  xlab("displacement") + ylab("mile per gallon")
ft22<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,3), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=2") +
  xlab("displacement") + ylab("mile per gallon")
ft33<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,4), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=3") +
  xlab("displacement") + ylab("mile per gallon")

ft44<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,5), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=4") +
  xlab("displacement") + ylab("mile per gallon")

ft55<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=5") +
  xlab("displacement") + ylab("mile per gallon")

ft66<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,7), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=6") +
  xlab("displacement") + ylab("mile per gallon")

ft77<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=7") +
  xlab("displacement") + ylab("mile per gallon")

ft88<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,9), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=8") +
  xlab("displacement") + ylab("mile per gallon")

ft99<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,10), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=9") +
  xlab("displacement") + ylab("mile per gallon")

ft110<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,11), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=10") +
  xlab("displacement") + ylab("mile per gallon")

Berdasarkan hasil plot di atas dapat dilihat bahwa plot dari regresi fungsi tangga yang dapat memberikan pola paling mulus yaitu knots 8, untuk knots 9 dan 10 terdapat pola yang tidak teratur dalam plot. Sehingga secara visual, dipilih regresi fungsi tangga dengan knots 8 untuk memodelkan data mpg vs displacement.

B. Regresi Fungsi Tangga (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

breakk <- 2:11 #knots 1-10

pemilihanf.tangga1 <- map_dfr(breakk, function(i){
  metric_ftangga <- map_dfr(val_silang$splits,
                          function(x){
                          datatrain <- Auto[x$in_id,]
                          datatrain$displacement <- cut(datatrain$displacement,i)
                          modelt <- lm(mpg ~ displacement, data=datatrain)
                          labs_x <- levels(modelt$model[,2])
                          labs_x_b <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                            upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                          datatesting <- Auto[-x$in_id,]
                          displacement_new <- cut(datatesting$displacement,c(labs_x_b[1,1],labs_x_b[,2]))
                          predik <- predict(modelt,   newdata=list(displacement=displacement_new))
                          truth <- datatesting$mpg
                          evaluasi <- na.omit(data.frame(truth,predik))
                          rmse <- mlr3measures::rmse(truth = evaluasi$truth,response = evaluasi$predik)
                          mae <- mlr3measures::mae(truth = evaluasi$truth,response = evaluasi$predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )

metric_ftangga
mean_metricftangga <- colMeans(metric_ftangga)
mean_metricftangga
})
pemilihanf.tangga1 <- cbind(breakk=breakk,pemilihanf.tangga1)

Perbandingan RMSE dan MAE

datatable(pemilihanf.tangga1)

Plot RMSE dan MAE

pemilihanf.tangga1$breakk <- as.factor(pemilihanf.tangga1$breakk)
trmse1<-ggplot(data=pemilihanf.tangga1, aes(x=breakk, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Root Mean Square Error")

pemilihanf.tangga1$breakk <- as.factor(pemilihanf.tangga1$breakk)
tmae1<-ggplot(data=pemilihanf.tangga1, aes(x=breakk, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Mean Absolute Error")

plot_grid(trmse1,tmae1)

Berdasarkan nilai RMSE dan MAE terkecil, model regresi fungsi tangga terbaik adalah regresi fungsi tangga dengan knots 8.

Berikut merupakan plot dan summary model regresi fungsi tangga dengan knots 8:

Plot Regresi Rungsi Tangga

ft88<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,9), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=8") +
  xlab("displacement") + ylab("mile per gallon")

ft88

fungsitangga88 = lm(Auto$mpg ~ cut(Auto$displacement,9)) #8 knots
summary(fungsitangga88)
## 
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$displacement, 9))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5181  -2.2250  -0.2905   2.0339  19.4607 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         31.5181     0.3947  79.850  < 2e-16 ***
## cut(Auto$displacement, 9)(111,154]  -5.5601     0.6010  -9.252  < 2e-16 ***
## cut(Auto$displacement, 9)(154,197]  -8.2848     1.0770  -7.693 1.23e-13 ***
## cut(Auto$displacement, 9)(197,240] -11.7022     0.7527 -15.547  < 2e-16 ***
## cut(Auto$displacement, 9)(240,283] -12.9788     0.8951 -14.499  < 2e-16 ***
## cut(Auto$displacement, 9)(283,326] -16.2276     0.7656 -21.197  < 2e-16 ***
## cut(Auto$displacement, 9)(326,369] -16.7923     0.8595 -19.537  < 2e-16 ***
## cut(Auto$displacement, 9)(369,412] -17.5494     1.1337 -15.479  < 2e-16 ***
## cut(Auto$displacement, 9)(412,455] -18.2959     1.4710 -12.438  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.251 on 383 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7033 
## F-statistic: 116.9 on 8 and 383 DF,  p-value: < 2.2e-16

Cubic Splines

A. Cubic Splines (Secara Visual)

1. Knots 1-6

attr(bs(Auto$displacement, df=4),"knots")
## 50% 
## 151
attr(bs(Auto$displacement, df=5),"knots")
## 33.33333% 66.66667% 
##       119       232
attr(bs(Auto$displacement, df=6),"knots")
##    25%    50%    75% 
## 105.00 151.00 275.75
attr(bs(Auto$displacement, df=7),"knots")
## 20% 40% 60% 80% 
##  98 122 225 305
attr(bs(Auto$displacement, df=8),"knots")
## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##        97       119       151       232       318
attr(bs(Auto$displacement, df=9),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  96.85714 108.00000 134.57143 198.00000 250.00000 321.14286

Plot Cubic Splines dengan Knots 1-6

cs11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(151)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=1") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(151), col="grey50", lty=2)

cs22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="orange")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(119,232)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=2") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(119,232), col="grey50", lty=2)

cs33 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="orange")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(105, 151, 275.75)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=3") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(105, 151,275.75), col="grey50", lty=2)

cs44 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="green")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(98, 122, 225, 305)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=4") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(98, 122, 225, 305), col="grey50", lty=2)

cs55<- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(97,119,151,232,318)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=5") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(97,119,151,232,318), col="grey50", lty=2)

cs66<- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(96.85714,108, 134.57143, 198, 250, 321.14286)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=6") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(96.85714,108, 134.57143, 198, 250, 321.14286), col="grey50", lty=2)

Berdasarkan hasil plot di atas dapat disimpulkan bahwa model cubic splines terbaik adalah cubic splines dengan knots 1 karena mulai knots 2 hingga 6 terdapat pola yang tidak teratur pada ujung plot.

B. Cubic Splines (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 4:9 #knots 1-6

pemilihan.cs1 <- map_dfr(df, function(i) {
  metric_cs<- map_dfr(val_silang$splits,
                          function(x) {
                          modelcs <- lm(mpg ~ bs(displacement,df=i), data=Auto[x$in_id,])
                          predik <- predict(modelcs, newdata=Auto[-x$in_id,])
                          truth <- Auto[-x$in_id,]$mpg
                          rmse <- mlr3measures::rmse(truth = truth, response = predik)
                          mae <- mlr3measures::mae(truth = truth, response = predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )
  metric_cs

# menghitung rata-rata untuk 10 folds
  mean_metric_cs <- colMeans(metric_cs)
  mean_metric_cs
  }
)

pemilihan.cs1 <- cbind(df=df,pemilihan.cs1)

Perbandingan RMSE dan MAE

datatable(pemilihan.cs1)

Plot RMSE dan MAE

pemilihan.cs1$df <- as.factor(pemilihan.cs1$df)
crmse1<-ggplot(data=pemilihan.cs1, aes(x=df, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

cmae1<-ggplot(data=pemilihan.cs, aes(x=df, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Mean Absolute Error")

plot_grid(crmse1,cmae1)

Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model cubic splines dengan knots 2 dan 3, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model cubic splines dengan knots 2 dan 3 memiliki ketidakteraturan pola pada ujung-ujung plot. Sehingga model cubic splines terbaik adalah dengan knots 1. Berikut merupakan plot dan summary model:

Plot Cubic Splines

plotcs1 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(151)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines with Knots=1") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(151), col="grey50", lty=2)
plotcs1

cspline1 = lm(Auto$mpg ~ bs(Auto$displacement, knots = c(151)))
summary(cspline1)
## 
## Call:
## lm(formula = Auto$mpg ~ bs(Auto$displacement, knots = c(151)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1199  -2.3744  -0.2858   2.1009  20.5408 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              33.308      1.218  27.351  < 2e-16 ***
## bs(Auto$displacement, knots = c(151))1   -2.579      1.831  -1.409     0.16    
## bs(Auto$displacement, knots = c(151))2  -18.060      1.864  -9.691  < 2e-16 ***
## bs(Auto$displacement, knots = c(151))3  -18.381      2.592  -7.093  6.3e-12 ***
## bs(Auto$displacement, knots = c(151))4  -20.465      1.857 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.368 on 387 degrees of freedom
## Multiple R-squared:   0.69,  Adjusted R-squared:  0.6868 
## F-statistic: 215.3 on 4 and 387 DF,  p-value: < 2.2e-16

Natural Cubic Splines

A. Natural Cubic Splines (Secara Visual)

1. Knots 1-6

attr(ns(Auto$displacement, df=2),"knots")
## 50% 
## 151
attr(ns(Auto$displacement, df=3),"knots")
## 33.33333% 66.66667% 
##       119       232
attr(ns(Auto$displacement, df=4),"knots")
##    25%    50%    75% 
## 105.00 151.00 275.75
attr(ns(Auto$displacement, df=5),"knots")
## 20% 40% 60% 80% 
##  98 122 225 305
attr(ns(Auto$displacement, df=6),"knots")
## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##        97       119       151       232       318
attr(ns(Auto$displacement, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  96.85714 108.00000 134.57143 198.00000 250.00000 321.14286

Plot Natural Cubic Splines dengan Knots 1-6

ns11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(151)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(151), col="grey50", lty=2)

ns22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="coral")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(119,232)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=2") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(119,232), col="grey50", lty=2)

ns33 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="yellow")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(105, 151, 275.75)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=3") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(105, 151, 275.75), col="grey50", lty=2)

ns44 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="green")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(98, 122, 225, 305)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=4") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(98, 122, 225, 305), col="grey50", lty=2)

ns55<- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(97,119,151,232,318)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=5") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(97,119,151,232,318), col="grey50", lty=2)

ns66<- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(96.85714,108, 134.57143, 198, 250, 321.14286)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=6") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(96.85714,108, 134.57143, 198, 250, 321.14286), col="grey50", lty=2)

Berdasarkan hasil plot di atas dapat dilihat bahwa plot natural cubic splines dengan knots 1 hingga 3 tidak memiliki pola yang tidak teratur. Sedangkan untuk knots 4 hingga 6 terdapat pola yang tidak teratur pada ujung plot. Sehingga untuk lebih meyakinkan dalam memilih model terbaik, dilakukan pemilihan model secara empiris.

B. Natural Cubic Splines (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:7 #knots 1-6

pemilihan.ncs1 <- map_dfr(df, function(i) {
  metric_ncs<- map_dfr(val_silang$splits,
                          function(x) {
                          modelncs <- lm(mpg ~ ns(displacement,df=i), data=Auto[x$in_id,])
                          predik <- predict(modelncs, newdata=Auto[-x$in_id,])
                          truth <- Auto[-x$in_id,]$mpg
                          rmse <- mlr3measures::rmse(truth = truth, response = predik)
                          mae <- mlr3measures::mae(truth = truth, response = predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )
  metric_ncs

# menghitung rata-rata untuk 10 folds
  mean_metric_ncs <- colMeans(metric_ncs)
  mean_metric_ncs
  }
)

pemilihan.ncs1 <- cbind(df=df,pemilihan.ncs1)

Perbandingan RMSE dan MAE

datatable(pemilihan.ncs1)

Plot RMSE dan MAE

pemilihan.ncs1$df <- as.factor(pemilihan.ncs1$df)
nrmse1<-ggplot(data=pemilihan.ncs1, aes(x=df, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

nmae1<-ggplot(data=pemilihan.ncs1, aes(x=df, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Mean Absolute Error")

plot_grid(nrmse1,nmae1)

Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model natural cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model natural cubic splines dengan knots 4 hingga 6 memiliki pola yang tidak teratur pada ujung-ujung plot. Sehingga model natural cubic splines terbaik adalah dengan knots 1, karena memiliki nilai RMSE dan MAE terkecil di antara knots 1 hingga 3. Berikut merupakan plot dan summary model:

Plot Natural Cubic Splines

plotncs1 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(151)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(151), col="grey50", lty=2)
plotncs1

ncspline1 = lm(Auto$mpg ~ ns(Auto$displacement, knots = c(151)))
summary(ncspline1)
## 
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$displacement, knots = c(151)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3850  -2.3565  -0.3177   2.1249  20.5067 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             33.6114     0.4962   67.74   <2e-16 ***
## ns(Auto$displacement, knots = c(151))1 -30.1316     1.2818  -23.51   <2e-16 ***
## ns(Auto$displacement, knots = c(151))2 -14.3845     1.0485  -13.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.358 on 389 degrees of freedom
## Multiple R-squared:  0.6899, Adjusted R-squared:  0.6883 
## F-statistic: 432.6 on 2 and 389 DF,  p-value: < 2.2e-16

Perbandingan Model

nilai_rmse1 <- rbind(4.32105681825998,
                     4.23676691518348,
                     4.34109320633999,
                     4.31394442085514)

nilai_mae1<-rbind(3.16031517994995,
                  3.1001249378559,
                  3.186742052199,
                  3.16206372510196)

Metode <- c("Regresi Polinomial (d=2)",
            "Fungsi Tangga (knots=8)",
            "Cubic Splines (knots=1)",
            "Natural Cubic Splines (knots=1)")

perbandingan1 <- data.frame(Metode,nilai_rmse1, nilai_mae1)
datatable(perbandingan1)

Plot dari Masing-masing Model

#Regresi Polinomial
rp22 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Displacement") + 
  ylab("mile per gallon")  


#Regresi Fungsi Tangga
ft88<- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,9), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=8") +
  xlab("displacement") + ylab("mile per gallon")

#Cubic Splines
cs11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(151)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=1") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(151), col="grey50", lty=2)

#Natural Cubic Splines
ns11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
           geom_point(alpha=0.55, color="yellow")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(151)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("displacement") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(151), col="grey50", lty=2)

plot_grid(rp22, ft88, cs11, ns11)

Berdasarkan nila RMSE dan MAE dapat disimpulkan bahwa model terbaik untuk memodelkanmpg vs displacement adalah model regresi fungsi tangga dengan knots 8. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat displacement (perpindahan mesin atau total volume bahan bakar yang dapat ditarik mesin selama satu siklus lengkap) dari kendaraan maka miles per gallon (jarak yang dapat ditempuh per galon bahan bakar) akan semakin kecil.

Pemodelan mpg vs weight

Regresi Polinomial

A. Regresi Polinomial (Secara Visual)

1. Regresi Polinomial Derajat 2

rp222 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("weight") + 
  ylab("mile per gallon") 

2. Regresi Polinomial Derajat 3

rp333 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,3,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 3") +
  xlab("weight") + 
  ylab("mile per gallon") 

3. Regresi Polinomial Derajat 4

rp444 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="navy") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,4,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 4") +
  xlab("weight") + 
  ylab("mile per gallon") 

4. Regresi Polinomial Derajat 5

rp555 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,5,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 5") +
  xlab("weight") + 
  ylab("mile per gallon") 
plot_grid(rp222, rp333, rp444, rp555)

Berdasarkan keempat plot di atas dapat dilihat bahwa plot regresi polinomial derajat 5 memiliki sedikit ketidakteraturan pola pada ujungnya, sedangkan untuk derajat 2 hingga 4 membentuk pola yang hampir sama dan tidak terdapat pola yang tidak teratur pada ujung plot. Sehingga, pada metode Regresi Polinomial dipilih derajat 2 karena model paling sederhana.

B. Regresi Polinomial (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 1:5

polinomial3 <- map_dfr(derajat, function(i) {
  metric_poli <- map_dfr(val_silang$splits,
                        function(x) {
                        mod <- lm(mpg ~ poly(weight,derajat=i), data=Auto[x$in_id,])
                        predik <- predict(mod, newdata=Auto[-x$in_id,])
                        truth <- Auto[-x$in_id,]$mpg
                        rmse <- mlr3measures::rmse(truth = truth, response = predik)
                        mae <- mlr3measures::mae(truth = truth, response = predik)
                        metric <- c(rmse,mae)
                        names(metric) <- c("RMSE","MAE")
                        return(metric)
                              }
                            )
  metric_poli
  mean_metric_poli<- colMeans(metric_poli) # menghitung rata-rata untuk 10 folds
  mean_metric_poli
  }
)

polinomial3<- cbind(derajat=derajat,polinomial3)

Perbandingan RMSE dan MAE

datatable(polinomial3)

Plot RMSE dan MAE

polinomial3$derajat <- as.factor(polinomial3$derajat)
prmse3<-ggplot(data=polinomial3, aes(x=derajat, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Root Mean Square Error")

polinomial3$derajat <- as.factor(polinomial3$derajat)
pmae3<-ggplot(data=polinomial3, aes(x=derajat, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Degree of Polynomial") + 
  ylab("Mean Absolute Error")

plot_grid(prmse3,pmae3, labels=c("RMSE","MAE"), label_size = 6)

Berdasarkan nilai RMSE dan MAE terendah, dapat disimpulkan bahwa model regresi polinomial terbaik adalah regresi polinomial derajat 2. Berikut merupakan plot dan summary model:

Plot Regresi Polinomial d=2

regpol222 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("Weight") + 
  ylab("mile per gallon")  

regpol222

poli222= lm(mpg ~ poly(weight,2), data = Auto)
summary(poli222)
## 
## Call:
## lm(formula = mpg ~ poly(weight, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6246  -2.7134  -0.3485   1.8267  16.0866 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        23.4459     0.2109 111.151  < 2e-16 ***
## poly(weight, 2)1 -128.4436     4.1763 -30.755  < 2e-16 ***
## poly(weight, 2)2   23.1589     4.1763   5.545 5.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16

Regresi Fungsi Tangga

A. Regresi Fungsi Tangga (Secara Visual)

Regresi Fungsi Tangga dengan Knots 1-10

ft111<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,2), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=1") +
  xlab("weight") + ylab("mile per gallon")
ft222<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,3), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=2") +
  xlab("weight") + ylab("mile per gallon")
ft333<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,4), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=3") +
  xlab("weight") + ylab("mile per gallon")

ft444<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,5), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=4") +
  xlab("weight") + ylab("mile per gallon")

ft555<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=5") +
  xlab("weight") + ylab("mile per gallon")

ft666<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,7), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=6") +
  xlab("weight") + ylab("mile per gallon")

ft777<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="yellow") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=7") +
  xlab("weight") + ylab("mile per gallon")

ft888<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="green") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,9), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=8") +
  xlab("weight") + ylab("mile per gallon")

ft999<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,10), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=9") +
  xlab("weight") + ylab("mile per gallon")

ft120<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,11), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=10") +
  xlab("weight") + ylab("mile per gallon")

Berdasarkan hasil plot di atas dapat dilihat bahwa plot dari regresi fungsi tangga yang dapat memberikan pola yang mulus mulai dari knots 5, namun pada knots 9 dan 10 terdapat ketidakteraturan pola dalam plot. Sehingga untuk lebih meyakinkan dalam memilih model, dilakukan pengecekan nilai RMSE dan MAE melalui pemilihan secara empiris.

B. Regresi Fungsi Tangga (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

breakk <- 2:11 #knots 1-10

pemilihanf.tangga12 <- map_dfr(breakk, function(i){
  metric_ftangga <- map_dfr(val_silang$splits,
                          function(x){
                          datatrain <- Auto[x$in_id,]
                          datatrain$weight <- cut(datatrain$weight,i)
                          modelt <- lm(mpg ~ weight, data=datatrain)
                          labs_x <- levels(modelt$model[,2])
                          labs_x_b <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                            upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                          datatesting <- Auto[-x$in_id,]
                          weight_new <- cut(datatesting$weight,c(labs_x_b[1,1],labs_x_b[,2]))
                          predik <- predict(modelt,   newdata=list(weight=weight_new))
                          truth <- datatesting$mpg
                          evaluasi <- na.omit(data.frame(truth,predik))
                          rmse <- mlr3measures::rmse(truth = evaluasi$truth,response = evaluasi$predik)
                          mae <- mlr3measures::mae(truth = evaluasi$truth,response = evaluasi$predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )

metric_ftangga
mean_metricftangga <- colMeans(metric_ftangga)
mean_metricftangga
})
pemilihanf.tangga12 <- cbind(breakk=breakk,pemilihanf.tangga12)

Perbandingan RMSE dan MAE

datatable(pemilihanf.tangga12)

Plot RMSE dan MAE

pemilihanf.tangga12$breakk <- as.factor(pemilihanf.tangga12$breakk)
trmse12<-ggplot(data=pemilihanf.tangga12, aes(x=breakk, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Root Mean Square Error")

pemilihanf.tangga12$breakk <- as.factor(pemilihanf.tangga12$breakk)
tmae12<-ggplot(data=pemilihanf.tangga12, aes(x=breakk, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("Breaks") + 
  ylab("Mean Absolute Error")

plot_grid(trmse12,tmae12)

Berdasarkan nilai RMSE dan MAE terkecil, model regresi fungsi tangga terbaik adalah regresi fungsi tangga dengan knots 5.

Berikut merupakan plot dan summary model regresi fungsi tangga dengan knots 5:

Plot Regresi Rungsi Tangga

ft555<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=5") +
  xlab("weight") + ylab("mile per gallon")

ft555

fungsitangga555 = lm(Auto$mpg ~ cut(Auto$weight,6)) #5 knots
summary(fungsitangga555)
## 
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$weight, 6))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.603  -2.486  -0.423   1.635  16.901 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             32.6033     0.4433   73.55   <2e-16 ***
## cut(Auto$weight, 6)(2.2e+03,2.79e+03]   -6.1043     0.6082  -10.04   <2e-16 ***
## cut(Auto$weight, 6)(2.79e+03,3.38e+03] -11.1170     0.6624  -16.78   <2e-16 ***
## cut(Auto$weight, 6)(3.38e+03,3.96e+03] -15.1324     0.6940  -21.80   <2e-16 ***
## cut(Auto$weight, 6)(3.96e+03,4.55e+03] -18.2380     0.7466  -24.43   <2e-16 ***
## cut(Auto$weight, 6)(4.55e+03,5.14e+03] -20.2283     1.1409  -17.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.205 on 386 degrees of freedom
## Multiple R-squared:  0.7134, Adjusted R-squared:  0.7097 
## F-statistic: 192.2 on 5 and 386 DF,  p-value: < 2.2e-16

Cubic Splines

A. Cubic Splines (Secara Visual)

1. Knots 1-6

attr(bs(Auto$weight, df=4),"knots")
##    50% 
## 2803.5
attr(bs(Auto$weight, df=5),"knots")
## 33.33333% 66.66667% 
##  2397.000  3333.667
attr(bs(Auto$weight, df=6),"knots")
##     25%     50%     75% 
## 2225.25 2803.50 3614.75
attr(bs(Auto$weight, df=7),"knots")
##    20%    40%    60%    80% 
## 2155.0 2583.2 3113.4 3820.8
attr(bs(Auto$weight, df=8),"knots")
## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##  2125.000  2397.000  2803.500  3333.667  3960.833
attr(bs(Auto$weight, df=9),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  2074.857  2285.429  2635.000  2986.571  3446.143  4096.286

Plot Cubic Splines dengan Knots 1-6

cs111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2803.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=1") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2803.5), col="grey50", lty=2)

cs222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="pink")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2397,3333.667)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=2") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2397,3333.667), col="grey50", lty=2)

cs333 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="orange")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2225.25, 2803.50, 3614.75)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=3") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2225.25, 2803.50, 3614.75), col="grey50", lty=2)

cs444 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="green")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2155.0, 2583.2, 3113.4, 3820.8)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=4") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2155.0, 2583.2, 3113.4, 3820.8), col="grey50", lty=2)

cs555<- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2125.000,2397.000, 2803.500,3333.667,3960.833)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=5") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2125.000,2397.000, 2803.500,3333.667,3960.833), col="grey50", lty=2)

cs666<- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2074.857,2285.429,  2635.000,2986.571,3446.143,4096.286)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=6") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2074.857,2285.429,  2635.000,2986.571,3446.143,4096.286), col="grey50", lty=2)

Berdasarkan hasil plot di atas dapat disimpulkan bahwa model cubic splines terbaik adalah cubic splines dengan knots 1 karena mulai knots 2 hingga 6 terdapat pola yang tidak teratur pada ujung plot.

B. Cubic Splines (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 4:9 #knots 1-6

pemilihan.cs12 <- map_dfr(df, function(i) {
  metric_cs<- map_dfr(val_silang$splits,
                          function(x) {
                          modelcs <- lm(mpg ~ bs(weight,df=i), data=Auto[x$in_id,])
                          predik <- predict(modelcs, newdata=Auto[-x$in_id,])
                          truth <- Auto[-x$in_id,]$mpg
                          rmse <- mlr3measures::rmse(truth = truth, response = predik)
                          mae <- mlr3measures::mae(truth = truth, response = predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )
  metric_cs

# menghitung rata-rata untuk 10 folds
  mean_metric_cs <- colMeans(metric_cs)
  mean_metric_cs
  }
)

pemilihan.cs12 <- cbind(df=df,pemilihan.cs12)

Perbandingan RMSE dan MAE

datatable(pemilihan.cs12)

Plot RMSE dan MAE

pemilihan.cs12$df <- as.factor(pemilihan.cs12$df)
crmse12<-ggplot(data=pemilihan.cs12, aes(x=df, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

cmae12<-ggplot(data=pemilihan.cs12, aes(x=df, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Mean Absolute Error")

plot_grid(crmse12,cmae12)

Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model cubic splines dengan knots 6 memiliki ketidakteraturan pola pada ujung-ujung plot. Sehingga model cubic splines terbaik adalah dengan knots 1. Berikut merupakan plot dan summary model:

Plot Cubic Splines

plotcs12 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2803.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines with Knots=1") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
plotcs12

cspline12 = lm(Auto$mpg ~ bs(Auto$weight, knots = c(2803.5)))
summary(cspline12)
## 
## Call:
## lm(formula = Auto$mpg ~ bs(Auto$weight, knots = c(2803.5)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7865  -2.7357  -0.4536   1.8590  16.2655 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           35.737      1.675  21.330  < 2e-16 ***
## bs(Auto$weight, knots = c(2803.5))1   -3.245      2.553  -1.271    0.204    
## bs(Auto$weight, knots = c(2803.5))2  -18.339      1.792 -10.231  < 2e-16 ***
## bs(Auto$weight, knots = c(2803.5))3  -21.830      2.897  -7.535 3.48e-13 ***
## bs(Auto$weight, knots = c(2803.5))4  -24.419      2.338 -10.443  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.184 on 387 degrees of freedom
## Multiple R-squared:  0.7156, Adjusted R-squared:  0.7127 
## F-statistic: 243.5 on 4 and 387 DF,  p-value: < 2.2e-16

Natural Cubic Splines

A. Natural Cubic Splines (Secara Visual)

1. Knots 1-6

attr(ns(Auto$weight, df=2),"knots")
##    50% 
## 2803.5
attr(ns(Auto$weight, df=3),"knots")
## 33.33333% 66.66667% 
##  2397.000  3333.667
attr(ns(Auto$weight, df=4),"knots")
##     25%     50%     75% 
## 2225.25 2803.50 3614.75
attr(ns(Auto$weight, df=5),"knots")
##    20%    40%    60%    80% 
## 2155.0 2583.2 3113.4 3820.8
attr(ns(Auto$weight, df=6),"knots")
## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##  2125.000  2397.000  2803.500  3333.667  3960.833
attr(ns(Auto$weight, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  2074.857  2285.429  2635.000  2986.571  3446.143  4096.286

Plot Natural Cubic Splines dengan Knots 1-6

ns111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="red")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2803.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2803.5), col="grey50", lty=2)

ns222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="yellow")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2397,3333.667)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=2") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2397,3333.667), col="grey50", lty=2)

ns333 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="coral")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2225.25, 2803.50, 3614.75)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=3") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2225.25, 2803.50, 3614.75), col="grey50", lty=2)

ns444 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="green")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2155.0, 2583.2, 3113.4, 3820.8)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=4") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2155.0, 2583.2, 3113.4, 3820.8), col="grey50", lty=2)

ns555<- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2125.000,2397.000, 2803.500,3333.667,3960.833)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=5") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2125.000,2397.000, 2803.500,3333.667,3960.833), col="grey50", lty=2)

ns666<- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2074.857,2285.429,  2635.000,2986.571,3446.143,4096.286)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=6") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2074.857,2285.429,  2635.000,2986.571,3446.143,4096.286), col="grey50", lty=2)

Berdasarkan hasil plot di atas dapat dilihat bahwa plot natural cubic splines dengan knots 1 hingga 5 tidak memiliki pola yang tidak teratur. Sedangkan untuk knots 6 terdapat pola yang tidak teratur pada ujung plot. Sehingga untuk lebih meyakinkan dalam memilih model terbaik, dilakukan pemilihan model secara empiris.

B. Natural Cubic Splines (Secara Empiris)

set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:7 #knots 1-6

pemilihan.ncs12 <- map_dfr(df, function(i) {
  metric_ncs<- map_dfr(val_silang$splits,
                          function(x) {
                          modelncs <- lm(mpg ~ ns(weight,df=i), data=Auto[x$in_id,])
                          predik <- predict(modelncs, newdata=Auto[-x$in_id,])
                          truth <- Auto[-x$in_id,]$mpg
                          rmse <- mlr3measures::rmse(truth = truth, response = predik)
                          mae <- mlr3measures::mae(truth = truth, response = predik)
                          metric <- c(rmse,mae)
                          names(metric) <- c("RMSE","MAE")
                          return(metric)
                          }
                          )
  metric_ncs

# menghitung rata-rata untuk 10 folds
  mean_metric_ncs <- colMeans(metric_ncs)
  mean_metric_ncs
  }
)

pemilihan.ncs12 <- cbind(df=df,pemilihan.ncs12)

Perbandingan RMSE dan MAE

datatable(pemilihan.ncs12)

Plot RMSE dan MAE

pemilihan.ncs12$df <- as.factor(pemilihan.ncs12$df)
nrmse12<-ggplot(data=pemilihan.ncs12, aes(x=df, y=RMSE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Root Mean Square Error")

nmae12<-ggplot(data=pemilihan.ncs12, aes(x=df, y=MAE, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()+
  ggtitle("K-Fold Cross Validation") +
  xlab("DF") + ylab("Mean Absolute Error")

plot_grid(nrmse12,nmae12)

Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model natural cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model natural cubic splines dengan knots 6 memiliki pola yang tidak teratur pada ujung-ujung plot. Sehingga model natural cubic splines terbaik adalah dengan knots 1, karena memiliki nilai RMSE dan MAE terkecil di antara knots 1 hingga 5. Berikut merupakan plot dan summary model:

Plot Natural Cubic Splines

plotncs12<- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="purple")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2803.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
plotncs12

ncspline12 = lm(Auto$mpg ~ ns(Auto$weight, knots = c(2803.5)))
summary(ncspline12)
## 
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$weight, knots = c(2803.5)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6845  -2.7067  -0.3965   1.8145  16.2062 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          36.6987     0.6448   56.92   <2e-16 ***
## ns(Auto$weight, knots = c(2803.5))1 -34.8674     1.4493  -24.06   <2e-16 ***
## ns(Auto$weight, knots = c(2803.5))2 -18.3763     1.0192  -18.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.175 on 389 degrees of freedom
## Multiple R-squared:  0.7153, Adjusted R-squared:  0.7138 
## F-statistic: 488.7 on 2 and 389 DF,  p-value: < 2.2e-16

Perbandingan Model

nilai_rmse12 <- rbind(4.14814621667258,
                      4.21960381450313, 
                      4.16083965104084,
                      4.14665995693387)

nilai_mae12<-rbind(3.06116897397545,
                   3.12547480366824,
                   3.06302776816373,
                   3.0590341190436)

Metode <- c("Regresi Polinomial (d=2)",
            "Fungsi Tangga (knots=5)",
            "Cubic Splines (knots=1)",
            "Natural Cubic Splines (knots=1)")

perbandingan12 <- data.frame(Metode,nilai_rmse12, nilai_mae12)
datatable(perbandingan12)

Plot dari Masing-masing Model

#Regresi Polinomial
rp222 <- ggplot(Auto,aes(x=weight, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Polinomial Derajat 2") +
  xlab("weight") + 
  ylab("mile per gallon") 

#Regresi Fungsi Tangga
ft555<- ggplot(Auto,aes(x=weight, y=mpg)) +
                 geom_point(alpha=0.55, color="red") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "black",se = F)+
  theme_bw()+
  ggtitle("Regresi Fungsi Tangga Knots=5") +
  xlab("weight") + ylab("mile per gallon")

#Cubic Splines
cs111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="yellow")+
           stat_smooth(method = "lm", 
           formula = y~bs(x, knots=c(2803.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Cubic Splines Knots=1") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2803.5), col="grey50", lty=2)

#Natural Cubic Splines
ns111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
           geom_point(alpha=0.55, color="blue")+
           stat_smooth(method = "lm", 
           formula = y~ns(x, knots=c(2803.5)), 
           lty = 1,col = "black", se=F)+
           theme_bw()+
    ggtitle("Natural Cubic Splines Knots=1") +
  xlab("weight") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(2803.5), col="grey50", lty=2)

plot_grid(rp222, ft555, cs111, ns111)

Berdasarkan nila RMSE dan MAE dapat disimpulkan bahwa model terbaik untuk memodelkanmpg vs weight adalah model natural cubic splines dengan knots 1. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat weight (berat/bobot) dari kendaraan maka miles per gallon (jarak yang dapat ditempuh per galon bahan bakar) akan semakin kecil.

Referensi

  1. Dito, G.A. (April 14, 2021). Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R. Retrieved from https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/

  2. James, Gareth. et.al. (2013). An Introduction to Statistical Learning Aplication in R. New York: Springer.

  3. Sartono, B.. (November 1, 2020). Moving Beyond Linearity. Retrieved from https://rpubs.com/bagusco/taklinear