Tugas Praktikum 8 Sains Data

Soal

Gunakan model-model non-linier yang sudah dipelajari pada data Auto dengan peubah respon mpg dan pilih tiga kolom untuk dijadikan peubah prediktor.

a.) Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.

b.) Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris)

Library

library(ISLR)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(splines)
library(rsample)
library(DT)
library(ggpubr)

Dataset

Data yang digunakan adalah Auto dari package ISLR yang terdiri dari 392 observasi dan 9 variabel. Berikut keterangan 9 Variabel yang ada.

mpg: miles per gallon
cylinders: Number of cylinders between 4 and 8
displacement: Engine displacement (cu. inches)
horsepower: Engine horsepower
weight: Vehicle weight (lbs.)
acceleration: Time to accelerate from 0 to 60 mph (sec.)
year: Model year (modulo 100)
origin: Origin of car (1. American, 2. European, 3. Japanese)
name: Vehicle name

data("Auto",package="ISLR")
datatable(Auto)

plot_dis <-ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="pink") + 
  ggtitle("Scatterplot mpg vs displacement") +
  xlab("displacement")+
  ylab("mpg")+
  theme_bw() 
plot_accel<-ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") + 
  ggtitle("Scatterplot mpg vs acceleration") +
  xlab("acceleration")+
  ylab("mpg")+
  theme_bw() 

plot_year<-ggplot(Auto,aes(x=year, y=mpg)) + 
  geom_point(alpha=0.55, color="purple") + 
  ggtitle("Scatterplot mpg vs year") +
  xlab("year")+
  ylab("mpg")+
  theme_bw() 

ggarrange(plot_dis,plot_accel,plot_year, nrow = 1)

Dari visualisasi tersebut hubungan antara data peubah respons dan data peubah prediktor non-linier sehingga dilakukan pemodelan hubungan non-linier dengan menggunakan: 1) Polynomial Regression 2) Piecewise Constant 3) Natural Cubic Splines

Model MPG vs Displacement

Regresi Polinomial

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:4

polinomial <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x) {
                           mod <- lm(mpg ~ poly(displacement,derajat=i),data=Auto[x$in_id,])
                           pred <- predict(mod, newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth, response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                         }
  )
  metric_poly
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly
}
)

polinomial <- cbind(derajat=derajat,polinomial)
datatable(polinomial)

polinomial %>% slice_min(mae)

##   derajat     rmse      mae
## 1       2 4.361273 3.172365

polinomial %>% slice_min(rmse)

##   derajat     rmse      mae
## 1       2 4.361273 3.172365

Secara empiris jika dilihat dari rmse dan mae terlihat bahwa polinomial pangkat 2 memberikan nilai rmse dan mae yang terkecil sehingga dipilih polinomial pangkat 2.

Plot Terbaik Polinomial Regresi db=2

Poly2<-ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "magenta",se = F)+
  theme_bw()
  Poly2

Secara visual dan empiris plot untuk regresi polinomial terbaik adalah yang derajat 2

poly2 = lm(mpg ~ poly(displacement,2), data = Auto)
summary(poly2)

## 
## Call:
## lm(formula = mpg ~ poly(displacement, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2165  -2.2404  -0.2508   2.1094  20.5158 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              23.4459     0.2205 106.343  < 2e-16 ***
## poly(displacement, 2)1 -124.2585     4.3652 -28.466  < 2e-16 ***
## poly(displacement, 2)2   31.0895     4.3652   7.122 5.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6872 
## F-statistic: 430.5 on 2 and 389 DF,  p-value: < 2.2e-16

Piecewise Constant

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

breaks <- 2:11

fungsitangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$displacement <- cut(training$displacement,i)
                             mod <- lm(mpg ~ displacement, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*","\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                             testing <- Auto[-x$in_id,]
                             displacement_new <- cut(testing$displacement,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,   newdata=list(displacement=displacement_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                           }
  )
  
  metric_tangga
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  
  mean_metric_tangga
})
fungsitangga <- cbind(breaks=breaks,fungsitangga)
datatable(fungsitangga)

fungsitangga %>% slice_min(mae)

##   breaks     rmse      mae
## 1      9 4.265649 3.130846

fungsitangga %>% slice_min(rmse)

##   breaks     rmse      mae
## 1      9 4.265649 3.130846

Berdasarkan nilai RMSE & MAE model piecewise constant terbaik adalah piecewise constant dengan knots=8

pc8<- ggplot(Auto,aes(x=displacement, y=mpg)) +
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,8), 
              lty = 1, col = "magenta",se = F)+
  theme_bw()+
  ggtitle("Piecewise Constant with Knots=8") +
  xlab("displacement") + ylab("mile per gallon")
pc8

Berikut merupakan summary model piecewise constant dengan knots=7:

ftangga9 = lm(Auto$mpg ~ cut(Auto$displacement,9)) #8 knots
summary(ftangga9)

## 
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$displacement, 9))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5181  -2.2250  -0.2905   2.0339  19.4607 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         31.5181     0.3947  79.850  < 2e-16 ***
## cut(Auto$displacement, 9)(111,154]  -5.5601     0.6010  -9.252  < 2e-16 ***
## cut(Auto$displacement, 9)(154,197]  -8.2848     1.0770  -7.693 1.23e-13 ***
## cut(Auto$displacement, 9)(197,240] -11.7022     0.7527 -15.547  < 2e-16 ***
## cut(Auto$displacement, 9)(240,283] -12.9788     0.8951 -14.499  < 2e-16 ***
## cut(Auto$displacement, 9)(283,326] -16.2276     0.7656 -21.197  < 2e-16 ***
## cut(Auto$displacement, 9)(326,369] -16.7923     0.8595 -19.537  < 2e-16 ***
## cut(Auto$displacement, 9)(369,412] -17.5494     1.1337 -15.479  < 2e-16 ***
## cut(Auto$displacement, 9)(412,455] -18.2959     1.4710 -12.438  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.251 on 383 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7033 
## F-statistic: 116.9 on 8 and 383 DF,  p-value: < 2.2e-16

Natural Cubic Splines

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:6

naturalspline3 <- map_dfr(df, function(i) {
  metric_spline3 <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ ns(displacement,df=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                            }
  )
  metric_spline3
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_spline3 <- colMeans(metric_spline3)
  mean_metric_spline3
}
)

naturalspline3 <- cbind(df=df,naturalspline3)
datatable(naturalspline3)

naturalspline3 %>% slice_min(mae)

##   df     rmse      mae
## 1  6 4.260237 3.165058

naturalspline3 %>% slice_min(rmse)

##   df     rmse      mae
## 1  6 4.260237 3.165058

Bersadarkan nilai rmse terkecil, didapat model natural cubic spline terbaik dengan df=6.

attr(ns(Auto$displacement, df=6),"knots")

## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##        97       119       151       232       318

ncs6 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm", 
              formula = y~ns(x, df=6), 
              lty = 1,col = "magenta", se=F)+
  theme_bw()+
  ggtitle("Natural Cubic Splines with df=6") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(97, 119, 151, 232, 318), col="grey50", lty=2)
ncs6

ncspline = lm(Auto$mpg ~ ns(Auto$displacement, knots =
c(97, 119, 151, 232, 318)))
summary(ncspline)

## 
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$displacement, knots = c(97, 119, 
##     151, 232, 318)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9381  -2.3829  -0.3102   2.1258  20.6988 
## 
## Coefficients:
##                                                           Estimate Std. Error
## (Intercept)                                                 29.254      1.409
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))1   -5.603      1.473
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))2   -2.879      2.154
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))3  -11.521      1.735
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))4  -16.379      1.372
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))5  -10.755      3.232
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))6  -19.107      1.483
##                                                           t value Pr(>|t|)    
## (Intercept)                                                20.760  < 2e-16 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))1  -3.803 0.000166 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))2  -1.336 0.182237    
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))3  -6.639 1.08e-10 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))4 -11.935  < 2e-16 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))5  -3.327 0.000961 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))6 -12.885  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 385 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7049 
## F-statistic: 156.6 on 6 and 385 DF,  p-value: < 2.2e-16

Perbandingan Model

nilai_rmse <- rbind(4.361273,
                    4.265649,
                    4.260237)

nilai_mae<-rbind(3.172365,
                 3.130846,
                 3.165058)

Metode <- c("Regresi Polinomial (d=2)",
            "Piecewise Constant (knots=8)",
            "Natural Cubic Splines (df=6)")

perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)

ggplot(Auto,aes(x=displacement, y=mpg)) +
  geom_point(alpha=0.55, color="turquoise") + 
  ggtitle("Scatterplot Displacement VS mile per gallon") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "yellow",se = F)+
  stat_smooth(method = "lm",formula = y~cut(x,8), 
              lty = 1, col = "purple",se = F)+
  stat_smooth(method = "lm",formula = y~ns(x,df=6), 
              lty = 1, col = "red",se = F)+
  theme_bw()

Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs displacement adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat displacement dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.

Model MPG vs acceleration

Regresi Polinomial

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:4

polinomial <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x) {
                           mod <- lm(mpg ~ poly(acceleration,derajat=i),data=Auto[x$in_id,])
                           pred <- predict(mod, newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth, response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                         }
  )
  metric_poly
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly
}
)

polinomial <- cbind(derajat=derajat,polinomial)
datatable(polinomial)

polinomial %>% slice_min(mae)

##   derajat     rmse      mae
## 1       4 6.998135 5.690208

polinomial %>% slice_min(rmse)

##   derajat     rmse      mae
## 1       4 6.998135 5.690208

Secara empiris jika dilihat dari rmse dan mae terlihat bahwa polinomial pangkat 4 memberikan nilai rmse dan mae yang terkecil sehingga dipilih polinomial pangkat 4.

Plot Terbaik Polinomial Regresi db=4

Poly4<-ggplot(Auto,aes(x=acceleration, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,4,raw=T), 
              lty = 1, col = "magenta",se = F)+
  theme_bw()
Poly4

Secara visual dan empiris plot untuk regresi polinomial terbaik adalah yang derajat 4

poly4 = lm(mpg ~ poly(acceleration,4), data = Auto)
summary(poly4)

## 
## Call:
## lm(formula = mpg ~ poly(acceleration, 4), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4665  -5.3322  -0.8491   4.6067  22.9362 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             23.4459     0.3514  66.724  < 2e-16 ***
## poly(acceleration, 4)1  65.3340     6.9571   9.391  < 2e-16 ***
## poly(acceleration, 4)2 -18.7482     6.9571  -2.695  0.00735 ** 
## poly(acceleration, 4)3   6.0643     6.9571   0.872  0.38393    
## poly(acceleration, 4)4  20.7577     6.9571   2.984  0.00303 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.957 on 387 degrees of freedom
## Multiple R-squared:  0.2136, Adjusted R-squared:  0.2055 
## F-statistic: 26.28 on 4 and 387 DF,  p-value: < 2.2e-16

Piecewise Constant

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

breaks <- 2:11

fungsitangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$acceleration <- cut(training$acceleration,i)
                             mod <- lm(mpg ~ acceleration, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*","\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                             testing <- Auto[-x$in_id,]
                             acceleration_new <- cut(testing$acceleration,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,   newdata=list(acceleration=acceleration_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                           }
  )
  
  metric_tangga
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  
  mean_metric_tangga
})
fungsitangga <- cbind(breaks=breaks,fungsitangga)
datatable(fungsitangga)

fungsitangga %>% slice_min(mae)

##   breaks     rmse      mae
## 1      3 6.984927 5.632108

fungsitangga %>% slice_min(rmse)

##   breaks     rmse      mae
## 1      3 6.984927 5.632108

Berdasarkan nilai RMSE & MAE model piecewise constant terbaik adalah piecewise constant dengan knots=2 namun berdasarkan visual piecewise constant yang terbaik dengan knots=5

pc5<- ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,6), 
              lty = 1, col = "magenta",se = F)+
  theme_bw()+
  ggtitle("Piecewise Constant with Knots=5") +
  xlab("acceleration") + ylab("mile per gallon")
pc5

Berikut merupakan summary model piecewise constant dengan knots=5:

ftangga6 = lm(Auto$mpg ~ cut(Auto$acceleration,6)) #5 knots
summary(ftangga6)

## 
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$acceleration, 6))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6725  -5.0975  -0.6837   4.3191  20.9275 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            14.818      2.100   7.055 8.02e-12 ***
## cut(Auto$acceleration, 6)(10.8,13.6]    2.932      2.232   1.314     0.19    
## cut(Auto$acceleration, 6)(13.6,16.4]    9.866      2.171   4.544 7.40e-06 ***
## cut(Auto$acceleration, 6)(16.4,19.2]   10.854      2.211   4.910 1.35e-06 ***
## cut(Auto$acceleration, 6)(19.2,22]     12.441      2.492   4.993 9.01e-07 ***
## cut(Auto$acceleration, 6)(22,24.8]     15.896      3.368   4.720 3.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.966 on 386 degrees of freedom
## Multiple R-squared:  0.2137, Adjusted R-squared:  0.2035 
## F-statistic: 20.98 on 5 and 386 DF,  p-value: < 2.2e-16

Natural Cubic Splines

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:6

naturalspline2 <- map_dfr(df, function(i) {
  metric_spline3 <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ ns(acceleration,df=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                            }
  )
  metric_spline3
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_spline3 <- colMeans(metric_spline3)
  mean_metric_spline3
}
)

naturalspline2 <- cbind(df=df,naturalspline2)
datatable(naturalspline2)

naturalspline2 %>% slice_min(mae)

##   df     rmse      mae
## 1  5 6.984689 5.625165

naturalspline2 %>% slice_min(rmse)

##   df     rmse      mae
## 1  6 6.961777 5.627009

Berdasarkan nilai rmse dan mae yang terkecil, model terbaik adalah natural cubic spline dengan df=6.

attr(ns(Auto$acceleration, df=6),"knots")

## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##  13.00000  14.50000  15.50000  16.50000  18.18333

ncs6 <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm", 
              formula = y~ns(x, df=6), 
              lty = 1,col = "magenta", se=F)+
  theme_bw()+
  ggtitle("Natural Cubic Splines with df=6") +
  xlab("Acceleration") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(13, 14.5, 15.5, 16.5, 18.183), col="grey50", lty=2)
ncs6

ncspline = lm(Auto$mpg ~ ns(Auto$acceleration, knots =
                              c(13, 14.5, 15.5, 16.5, 18.183)))
summary(ncspline)

## 
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$acceleration, knots = c(13, 14.5, 
##     15.5, 16.5, 18.183)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7756  -5.3712  -0.8021   4.5241  22.2375 
## 
## Coefficients:
##                                                                 Estimate
## (Intercept)                                                       16.234
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1    9.117
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2    7.469
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3   10.660
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4    9.162
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5   10.313
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6   20.427
##                                                                 Std. Error
## (Intercept)                                                          3.383
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1      3.248
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2      3.835
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3      3.578
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4      2.840
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5      7.846
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6      3.827
##                                                                 t value
## (Intercept)                                                       4.799
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1   2.807
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2   1.947
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3   2.980
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4   3.226
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5   1.314
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6   5.337
##                                                                 Pr(>|t|)    
## (Intercept)                                                     2.29e-06 ***
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1  0.00526 ** 
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2  0.05223 .  
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3  0.00307 ** 
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4  0.00136 ** 
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5  0.18948    
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6 1.61e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.932 on 385 degrees of freedom
## Multiple R-squared:  0.2234, Adjusted R-squared:  0.2113 
## F-statistic: 18.45 on 6 and 385 DF,  p-value: < 2.2e-16

Perbandingan Model

nilai_rmse <- rbind(6.998135,
                    7.004580,
                    4.305876)

nilai_mae<-rbind(5.690208,
                 5.6388286,
                 3.246447)

Metode <- c("Regresi Polinomial (d=4)",
            "Piecewise Constant (knots=5)",
            "Natural Cubic Splines (df=6)")

perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)

ggplot(Auto,aes(x=acceleration, y=mpg)) +
  geom_point(alpha=0.55, color="turquoise") + 
  ggtitle("Scatterplot acceleration VS mile per gallon") +
  stat_smooth(method = "lm",formula = y~poly(x,4,raw=T), 
              lty = 1, col = "yellow",se = F)+
  stat_smooth(method = "lm",formula = y~cut(x,6), 
              lty = 1, col = "purple",se = F)+
  stat_smooth(method = "lm",formula = y~ns(x,df=6), 
              lty = 1, col = "red",se = F)+
  theme_bw()

Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs acceleration adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat acceleration dari kendaraan maka miles per gallon akan semakin besar atau kendaraan semakin hemat.

Model MPG vs horsepower

Regresi Polinomial

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

derajat <- 2:4

polinomial <- map_dfr(derajat, function(i) {
  metric_poly <- map_dfr(cross_val$splits,
                         function(x) {
                           mod <- lm(mpg ~ poly(horsepower,derajat=i),data=Auto[x$in_id,])
                           pred <- predict(mod, newdata=Auto[-x$in_id,])
                           truth <- Auto[-x$in_id,]$mpg
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth, response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                         }
  )
  metric_poly
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_poly <- colMeans(metric_poly)
  mean_metric_poly
}
)

polinomial <- cbind(derajat=derajat,polinomial)
datatable(polinomial)

polinomial %>% slice_min(mae)

##   derajat     rmse      mae
## 1       2 4.339298 3.268563

polinomial %>% slice_min(rmse)

##   derajat     rmse      mae
## 1       2 4.339298 3.268563

Secara empiris jika dilihat dari rmse dan mae terlihat bahwa polinomial pangkat 2 memberikan nilai rmse dan mae yang terkecil sehingga dipilih polinomial pangkat 2.

Plot Terbaik Polinomial Regresi db=2

Poly2<-ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,2,raw=T), 
              lty = 1, col = "magenta",se = F)+
  theme_bw()
Poly2

Secara visual dan empiris plot untuk regresi polinomial terbaik adalah yang derajat 2

poly2 = lm(mpg ~ poly(horsepower,2), data = Auto)
summary(poly2)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Piecewise Constant

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

breaks <- 2:11

fungsitangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- Auto[x$in_id,]
                             training$horsepower <- cut(training$horsepower,i)
                             mod <- lm(mpg ~ horsepower, data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*","\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
                             testing <- Auto[-x$in_id,]
                             horsepower_new <- cut(testing$horsepower,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,   newdata=list(horsepower=horsepower_new))
                             truth <- testing$mpg
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                           }
  )
  
  metric_tangga
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  
  mean_metric_tangga
})
fungsitangga <- cbind(breaks=breaks,fungsitangga)
datatable(fungsitangga)

fungsitangga %>% slice_min(mae)

##   breaks     rmse      mae
## 1     11 4.478578 3.349871

fungsitangga %>% slice_min(rmse)

##   breaks     rmse      mae
## 1      8 4.454052 3.395497

Berdasarkan nilai RMSE & MAE model piecewise constant terbaik adalah piecewise constant dengan knots=7

pc8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="blue") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,7), 
              lty = 1, col = "magenta",se = F)+
  theme_bw()+
  ggtitle("Piecewise Constant with Knots=7") +
  xlab("horsepower") + ylab("mile per gallon")
pc8

Berikut merupakan summary model piecewise constant dengan knots=7:

ftangga8 = lm(Auto$mpg ~ cut(Auto$horsepower,8)) #7 knots
summary(ftangga8)

## 
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$horsepower, 8))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9471  -2.6757  -0.1533   2.4015  14.5529 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       33.9085     0.5771   58.76   <2e-16 ***
## cut(Auto$horsepower, 8)(69,92]    -6.9614     0.6910  -10.07   <2e-16 ***
## cut(Auto$horsepower, 8)(92,115]  -12.7551     0.7425  -17.18   <2e-16 ***
## cut(Auto$horsepower, 8)(115,138] -15.6656     1.1264  -13.91   <2e-16 ***
## cut(Auto$horsepower, 8)(138,161] -18.7799     0.8568  -21.92   <2e-16 ***
## cut(Auto$horsepower, 8)(161,184] -19.9735     1.1470  -17.41   <2e-16 ***
## cut(Auto$horsepower, 8)(184,207] -21.1228     1.7720  -11.92   <2e-16 ***
## cut(Auto$horsepower, 8)(207,230] -21.0085     1.5159  -13.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.433 on 384 degrees of freedom
## Multiple R-squared:  0.6832,   Adjusted R-squared:  0.6774 
## F-statistic: 118.3 on 7 and 384 DF,  p-value: < 2.2e-16

Natural Cubic Splines

set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

df <- 2:6

naturalspline3 <- map_dfr(df, function(i) {
  metric_spline3 <- map_dfr(cross_val$splits,
                            function(x) {
                              mod <- lm(mpg ~ ns(horsepower,df=i), data=Auto[x$in_id,])
                              pred <- predict(mod, newdata=Auto[-x$in_id,])
                              truth <- Auto[-x$in_id,]$mpg
                              rmse <- mlr3measures::rmse(truth = truth, response = pred)
                              mae <- mlr3measures::mae(truth = truth, response = pred)
                              metric <- c(rmse,mae)
                              names(metric) <- c("rmse","mae")
                              return(metric)
                            }
  )
  metric_spline3
  
  # menghitung rata-rata untuk 10 folds
  mean_metric_spline3 <- colMeans(metric_spline3)
  mean_metric_spline3
}
)

naturalspline3 <- cbind(df=df,naturalspline3)
datatable(naturalspline3)

naturalspline3 %>% slice_min(mae)

##   df     rmse      mae
## 1  6 4.305876 3.246447

naturalspline3 %>% slice_min(rmse)

##   df     rmse      mae
## 1  6 4.305876 3.246447

attr(ns(Auto$horsepower, df=6),"knots")

## 16.66667% 33.33333%       50% 66.66667% 83.33333% 
##      70.0      84.0      93.5     110.0     150.0

ncs6 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="blue")+
  stat_smooth(method = "lm", 
              formula = y~ns(x, df=6), 
              lty = 1,col = "magenta", se=F)+
  theme_bw()+
  ggtitle("Natural Cubic Splines with df=6") +
  xlab("Horse Power") + 
  ylab("mile per gallon") +
  geom_vline(xintercept = c(70, 84, 93.5, 110, 150), col="grey50", lty=2)
ncs6

ncspline = lm(Auto$mpg ~ ns(Auto$horsepower, knots =
c(70, 84, 93.5, 110, 150)))
summary(ncspline)

## 
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$horsepower, knots = c(70, 84, 
##     93.5, 110, 150)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9491  -2.6183  -0.1595   2.3508  15.1349 
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                               34.738      1.509
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1   -8.210      1.594
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2  -13.046      1.835
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3  -14.577      1.886
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4  -22.802      1.624
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5  -22.758      3.512
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6  -20.849      1.742
##                                                         t value Pr(>|t|)    
## (Intercept)                                              23.021  < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1  -5.149 4.18e-07 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2  -7.108 5.76e-12 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3  -7.730 9.50e-14 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4 -14.039  < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5  -6.480 2.81e-10 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6 -11.967  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.302 on 385 degrees of freedom
## Multiple R-squared:  0.7009, Adjusted R-squared:  0.6962 
## F-statistic: 150.4 on 6 and 385 DF,  p-value: < 2.2e-16

Perbandingan Model

nilai_rmse <- rbind(4.339298,
                    4.454052,
                    4.305876)

nilai_mae<-rbind(3.268563,
                 3.395497,
                 3.246447)

Metode <- c("Regresi Polinomial (d=2)",
            "Piecewise Constant (knots=7)",
            "Natural Cubic Splines (df=6)")

perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)

ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="turquoise") + 
  ggtitle("Scatterplot horsepower VS mile per gallon") +
  stat_smooth(method = "lm",formula = y~poly(x,2,raw=T), 
              lty = 1, col = "yellow",se = F)+
  stat_smooth(method = "lm",formula = y~cut(x,7), 
              lty = 1, col = "purple",se = F)+
  stat_smooth(method = "lm",formula = y~ns(x,df=6), 
              lty = 1, col = "red",se = F)+
  theme_bw()

Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs horsepower adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat horsepower dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.

Kesimpulan

Berdasarkan perbandingan dari ketiga model regresi non-linier tersebut didapat masing-masing model terbaik untuk setiap pasangan peubah respons vs peubah prediktor.

mpg vs displacement : Regresi piecewise constant (knots=8) Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs displacement adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat displacement (Engine displacement) dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.

mpg vs horsepower : Regresi natural cubic splines (df=6) Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs horsepower adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat Engine horsepower dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.

mpg vs acceleration : Regresi natural cubic splines (df=6) Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs acceleration adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat acceleration atau waktu yang dibutuhkan untuk akselerasi dari 0 hingga 60 mph (miles per hour) dari kendaraan maka miles per gallon akan semakin besar atau kendaraan semakin hemat.