TUGAS PRAKTIKUM SAINS DATA

NAMA : SRI FITRIYANTI

NIM : G1501211025

TUGAS

Gunakan model-model non-linier yang sudah dipelajari pada data Auto dengan peubah respon mpg dan pilih tiga kolom untuk dijadikan peubah prediktor.

Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk mendukung jawaban Anda.
Tentukan model non-linear terbaik untuk masing-masing pasangan peubah (Secara visual dan/atau secara empiris)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(splines)
library(ISLR)
library(boot)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ggplot2)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ggthemes)
library(broom)
library(knitr)
library(rsample)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggthemes':
## 
##     theme_map

library(dplyr)
library(purrr)
library(DT)
library(mlr3measures)

## In order to avoid name clashes, do not attach 'mlr3measures'. Instead, only load the namespace with `requireNamespace("mlrmeasures")` and access the measures directly via `::`, e.g. `mlr3measures::auc()`.

## 
## Attaching package: 'mlr3measures'

## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity

Dataset ini adalah data yang berisi tentang Miles Per Gallon (MPG) atau bahan bakar yang dihabiskan oleh setiap mobil yang dipengaruhi beberapa faktor yaitu : - Acceleration - Displacement - Horse Power - Weight - Cylinders - Model Year - Origin

Dimana pada dataset Auto terdiri atas 392 observasi. Akan dipilih peubah respon yaitu mpg dan peubah prediktor yaitu horsepower, acceleration dan displacement.

1. HUBUNGAN ANTARA RESPON MPG DAN PREDITOR (HORSEPOWER,ACCELERATION, DISPLACEMENT)

Bagian ini akan menunjukkan dari 3 variabel prekditor yang dipilih yaitu horsepower, displacement dan acceleration akan diperiksa hubungan non-linier dengan mpg, akan ditunjukkan secara empiris dan visual dengan mengunakan 3 metode yaitu polinomial, fungsi tangga dan natural spline.

a. Hubungan NON-LINIER MPG dan HORSEPOWER

#import data
data("Auto",package="ISLR") 
attach(Auto)

## The following object is masked from package:ggplot2:
## 
##     mpg

head(cbind(mpg, horsepower))

##      mpg horsepower
## [1,]  18        130
## [2,]  15        165
## [3,]  18        150
## [4,]  16        150
## [5,]  17        140
## [6,]  15        198

plot(horsepower, mpg, pch=19, col="darkgrey",
     xlab="Horse Power", ylab="mile per gallon")

#regresi linier
mod_linear = lm(mpg~horsepower,data=Auto)
summary(mod_linear)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

ggplot(Auto,aes(x=horsepower, y=mpg)) +
  geom_point(alpha=0.55, color="darkgrey") +
  stat_smooth(method = "lm", 
              formula = y~x,lty = 1,
              col = "red",se = T)+
  theme_bw()

pada plot diatas dapat dilihat sebaran data antara mpg vs horsepower sudah terihat tidak linier, sehingga jika dipaksakan menggunakan model linier seperti pada gambar sangat tidak baik. Berikut akan di cek hubungan non-linier antara mpg vs horsepower dengan Regresi polinomial, kemudia dapat dilakukan pengecekan hubungan dengan metode fungsi tangga dan natural spline.

- Regresi Polinomial

folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()
set.seed(888)
for (k in 1:10) {
  for (i in 1:10) {
    model <- lm(mpg ~ poly(horsepower,i), data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
  }
}
errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%
  mutate(Polynomial = row_number()) %>%
  ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
  geom_col() +
  scale_x_continuous(breaks = 1:10) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

poly_model <- lm(mpg ~ poly(horsepower, which.min(errors)), data = Auto)
tidy(poly_model) %>%
  mutate(term = ifelse(row_number() == 1, 'Intercept',
                       paste0('$Horsepower^', row_number()-1,'$'))) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.221	106.161	0.000
\(Horsepower^1\)	-120.138	4.373	-27.475	0.000
\(Horsepower^2\)	44.090	4.373	10.083	0.000
\(Horsepower^3\)	-3.949	4.373	-0.903	0.367
\(Horsepower^4\)	-5.188	4.373	-1.186	0.236

Model terbaik polinomial yang terpilih adalah derajat 6 karena memiliki nilai RMSE terkecil daripada yang lainnya. Namun dilihat dari tabel secara statistik yang signifikan hanyalah koefisien derajat 1,2 dan 5 saja. Oleh karena itu menunjukkan sangat kuat bahwa terdapat hubungan non-linier antara Horse power dan mpg.

Kemudian akan di tunjukkan dalam plot bagaimana regresi polynomial dengan nilai RMSE terkecil diatas yaitu polinomial derajat 3 untuk mpg vs horsepower. Dapat dilihat pada ujung-ujung plot walaupun memiliki nilai RMSE terkecil polinomial derajat 6 masih terlihat tidak stabil atau ‘liar’.

poly6 <- Auto %>%
  mutate(Predictions = predict(poly_model, Auto)) %>%
  ggplot() + xlab('Horsepower') + ylab('Miles Per Gallon') +
  geom_point(aes(horsepower, mpg, col = 'blue')) +
  geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
poly6

Kemudian memoodelkan data mpg vs horsepower dengan fungsi tangga dan natural spline, yang juga menggunakan Cross Validation 10 fold.

- Fungsi Tangga

folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()
for (k in 1:10) {
  for (i in 1:10) {
    Auto$cut <- cut(Auto$horsepower, i+1)
    model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
    Auto$cut <- NULL
  }
}
errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%
  mutate(Steps = row_number() +1) %>%
  ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) + 
  geom_col() +
  scale_x_continuous(breaks = 2:(length(errors)+1)) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#dari gambar dapat dilihat bahwa untuk horsepower diperlukan 7 potongan 

min_error <- which.min(errors) + 1
step_model <- lm(mpg ~ horsepower *cut(horsepower, min_error), data = Auto)
tidy(step_model) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	46.124	4.373	10.547	0.000
horsepower	-0.212	0.068	-3.128	0.002
cut(horsepower, min_error)(72.3,98.6]	7.795	6.257	1.246	0.214
cut(horsepower, min_error)(98.6,125]	-34.673	10.334	-3.355	0.001
cut(horsepower, min_error)(125,151]	-8.606	10.917	-0.788	0.431
cut(horsepower, min_error)(151,177]	-28.510	19.688	-1.448	0.148
cut(horsepower, min_error)(177,204]	-17.650	31.074	-0.568	0.570
cut(horsepower, min_error)(204,230]	-73.279	43.413	-1.688	0.092
horsepower:cut(horsepower, min_error)(72.3,98.6]	-0.115	0.085	-1.347	0.179
horsepower:cut(horsepower, min_error)(98.6,125]	0.295	0.110	2.675	0.008
horsepower:cut(horsepower, min_error)(125,151]	0.060	0.097	0.616	0.538
horsepower:cut(horsepower, min_error)(151,177]	0.190	0.134	1.413	0.159
horsepower:cut(horsepower, min_error)(177,204]	0.130	0.177	0.735	0.463
horsepower:cut(horsepower, min_error)(204,230]	0.395	0.209	1.892	0.059

pc7 <- Auto %>%
  mutate(Predictions = predict(step_model, Auto)) %>%
  ggplot() + xlab('Horse Power') + ylab('Miled Per Gallon') + 
  geom_point(aes(horsepower, mpg, col = 'blue')) +
  geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))

pc7

Pada fungsi tangga dengan CV 10 fold diperoleh bahwa steps atau breaks yang memiliki nilai RMSE terkecil yaitu pada breaks ke 7. Sehingga seperti terlihat diatas plot fungsi tangga untuk mpg vs horsepower bentuk tangga yang tidak terlalu jelas setiap tangga atau potongannya tetapi plot fungsi tangga 7 sudah menunjukkan dengan cukup baik . Untuk menlihat hubungan non-linier antara hubungan pada fungsi tangga diperoleh bahwa dengan 7 potongan/breaks untuk variabel horsepower. Dan dapat dilihat pada tabel statistik diatas bahwa dengan potongan sebanyak 7 ternyata masih ada beberapa potongan yang tidak signifikan karena nilai p-value yang besar dari 0,05, sehingga dapat dikatakan bahwa antara mpg dan weight memiliki hubungan non-linier.

- Natural Spline

set.seed(888)
spline_model <- train(mpg ~ horsepower, data
                      = Auto,
                      method = 'gamSpline',
                      trControl = trainControl(method = 'cv', number = 10),
                      tuneGrid = expand.grid(df = seq(1, 12, 1)))

## Loading required package: gam

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded gam 1.20

disp_lm <- train(mpg ~ horsepower, data = Auto, method = 'lm',
                 trControl = trainControl(method = 'cv', number = 10))
plot(spline_model)

anova(disp_lm$finalModel, 
      spline_model$finalModel) %>%
  tidy %>%
  mutate(model = c('Linear', 'Spline'),
         percent = sumsq/lag(rss) * 100) %>% 
  select(model, rss, percent, sumsq, p.value) %>%
  kable(digits = 3)

model	rss	percent	sumsq	p.value
Linear	9385.916	NA	NA	NA
Spline	6876.918	26.732	2508.998	0

Pada tabel Anova diatas dapat dilihat jika dibandingkan dengan model linier makan model spline dengan df=12 meningkat 25.986 % lebih baik dari model linier sehingga dapat dikatakan dengan pasti bahwa antara mpg dan weight memiliki hubungan non-linier.

splineK12 <- Auto %>%
  mutate(pred = predict(spline_model, Auto)) %>%
  ggplot() + 
  geom_point(aes(horsepower, mpg, col = 'blue')) +
  geom_line(aes(horsepower, pred, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00')) +
  labs(x = 'Horse Power', y = 'Miles Per Gallon', title = 'Spline Relationship') +
  theme(legend.position = 'top')

splineK12

Kemudian untuk model natural spline dengan menggunakan package caret dilakukan cross validation 10 fold, diperoleh df=12 memiliki nilai RMSE terkecil dilihat pada grafik diatas, kemudian dibentuk visual dari model dengan df=12 pada gambar diatas.Pada plot diatas menunjukkan plot yang cukup mewakili data dari horsepower secara stabil, walau pada ujung masih sedikit ‘aneh’ atau ‘liar’.

b. Hubungan NON-LINIER MPG dan ACCELERATION

Dilakukan pengecekan hubungan antara mpg vs acceleration dengan cross validation 10 fold juga dengan menggunakan 3 metode seperti sebelumnya.

- Regresi Polinomial

folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()
set.seed (888)
for (k in 1:10) {
  for (i in 1:10) {
    model <- lm(mpg ~ poly(acceleration,i), data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
  }
}
errors <- apply(errors, 2, mean)
errors

##  [1] 5.790129 8.153305 6.257939 6.871151 6.313805 6.930685 6.976644 8.627511
##  [9] 6.930772 8.822965

data_frame(RMSE = errors) %>%
  mutate(Polynomial = row_number()) %>%
  ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
  geom_col() +
  scale_x_continuous(breaks = 1:10) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

poly_model <- lm(mpg ~ poly(acceleration, which.min(errors)), data = Auto)
tidy(poly_model) %>%
  mutate(term = ifelse(row_number() == 1, 'Intercept',
                       paste0('$acceleration^', row_number()-1,'$'))) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.358	65.564	0
\(acceleration^1\)	65.334	7.080	9.228	0

Model terbaik polinomial adalah derajat 1 dari grafik nilai RMSE terkecil. Namun ketika melihat nilai p.value nya yang tidak signifikan berarti menunjukkan bahwa antara mpg dan memiliki hubungan non-linier. Berikut adalah visualisasi plot polinomial derajat 1 :

polys1 <- Auto %>%
  mutate(Predictions = predict(poly_model, Auto)) %>%
  ggplot() + xlab('acceleration') + ylab('Miles Per Gallon') +
  geom_point(aes(acceleration, mpg, col = 'blue')) +
  geom_line(aes(acceleration, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
polys1

dari plot model polinomial derajat 1 diatas sudah cukup baik mewakili data walau masih terlihat seperti model linier biasa.

- Fungsi Tangga

folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()
set.seed (888)
for (k in 1:10) {
  for (i in 1:10) {
    Auto$cut <- cut(Auto$acceleration, i+1)
    model <- lm(mpg ~ acceleration*cut, data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
    Auto$cut <- NULL
  }
}
errors<- apply(errors, 2, mean)
errors

##  [1] 5.696756 7.231980 6.223437 6.413796 7.154842 7.770116 8.818060 6.520029
##  [9] 7.223349 7.749723

data_frame(RMSE = errors) %>%
  mutate(Steps = row_number() +1) %>%
  ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) + 
  geom_col() +
  scale_x_continuous(breaks = 2:(length(errors)+1)) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#dari gambar dapat dilihat bahwa untuk acceleration diperlukan 10 potongan 

min_error <- which.min(errors) + 1
step_model <- lm(mpg ~ acceleration *cut(acceleration, min_error), data = Auto)
tidy(step_model) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-6.121	3.551	-1.724	0.086
acceleration	2.008	0.252	7.967	0.000
cut(acceleration, min_error)(16.4,24.8]	18.501	7.207	2.567	0.011
acceleration:cut(acceleration, min_error)(16.4,24.8]	-1.257	0.422	-2.981	0.003

pc1 <- Auto %>%
  mutate(Predictions = predict(step_model, Auto)) %>%
  ggplot() + xlab('acceleration') + ylab('Miled Per Gallon') + 
  geom_point(aes(acceleration, mpg, col = 'blue')) +
  geom_line(aes(acceleration, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
pc1

untuk menlihat hubungan non-linier antara hubungan pada fungsi tangga diperoleh bahwa dengan 1 potongan/breaks untuk variabel acceleration. dan dapat dilihat pada tabel statistik diatas bahwa dengan potongan sebanyak 1 ternyata m signifikan karena nilai p-value yang kecil dari 0,05, sehingga dapat dikatakan bahwa antara mpg dan acceleration memiliki hubungan non-linier.

- Natural Spline

spline_model <- train(mpg ~ acceleration, data
                      = Auto,
                      method = 'gamSpline',
                      trControl = trainControl(method = 'cv', number = 10),
                      tuneGrid = expand.grid(df = seq(1, 12, 1)))
disp_lm <- train(mpg ~ acceleration, data = Auto, method = 'lm',
                 trControl = trainControl(method = 'cv', number = 10))
disp_lm

## Linear Regression 
## 
## 392 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 353, 353, 352, 352, 353, 353, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   7.047074  0.187741  5.798512
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

plot(spline_model)

anova(disp_lm$finalModel, 
      spline_model$finalModel) %>%
  tidy %>%
  mutate(model = c('Linear', 'Spline'),
         percent = sumsq/lag(rss) * 100) %>% 
  select(model, rss, percent, sumsq, p.value) %>%
  kable(digits = 3)

model	rss	percent	sumsq	p.value
Linear	19550.46	NA	NA	NA
Spline	18701.24	4.344	849.217	0.001

Pada tabel Anova diatas dapat dilihat jika dibandingkan dengan model linier makan model spline dengan df=5 meningkat 4.344% lebih baik dari model linier sehingga dapat dikatakan dengan pasti bahwa antara mpg dan acceleration memiliki hubungan non-linier.

splineK4 <- Auto %>%
  mutate(pred = predict(spline_model, Auto)) %>%
  ggplot() + 
  geom_point(aes(acceleration, mpg, col = 'blue')) +
  geom_line(aes(acceleration, pred, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00')) +
  labs(x = 'acceleration', y = 'Miles Per Gallon', title = 'Spline Relationship') +
  theme(legend.position = 'top')

splineK4

Dengan menggunakan metode natural spline akan di tunjukkan hubungan non-linier antara mpg dan acceleration. Dapat dilihat hasil cross validasi 10 fold pada grafik nilai RMSE bahwa plot menunjukkan untuk memilih basis spline dengan df=4 yang paling kecil nilai RMSE. Dapat dilihat plot yang ditunjukkan pada gambar diatas bahwa spline model sangat baik tidak ada bentuk aneh dari plot. Kemudian dapat dilihat pada tabel perbandingan dengan model linier biasa model spline pada nilai RSS meningkat sebesar 7,4% dan nilai p-value sebesar 0 sehingga dengan tepat dapat dikatakan bahwa antara mpg dan weight memiliki hubungan non-linier.

c. Hubungan Non-linier antara MPG VS DISPLACEMENT

- Regresi Polinomial

#regresi polinomial dengan CV-10
folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()

for (k in 1:10) {
  for (i in 1:10) {
    model <- lm(mpg ~ poly(displacement,i), data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
  }
}
errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%
  mutate(Polynomial = row_number()) %>%
  ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
  geom_col() +
  scale_x_continuous(breaks = 1:10) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#model terbaik polinomial adalah derajat 5 dari grafik tersebut

Hubungan antara mpg vs displacement pada grafik RMSE yang diperoleh menunjukkan nilai RMSE terkecil yaitu polinomial derajat 5.

poly_model <- lm(mpg ~ poly(displacement, which.min(errors)), data = Auto)
tidy(poly_model) %>%
  mutate(term = ifelse(row_number() == 1, 'Intercept',
                       paste0('$Horsepower^', row_number()-1,'$'))) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.221	106.158	0.000
\(Horsepower^1\)	-124.258	4.373	-28.416	0.000
\(Horsepower^2\)	31.090	4.373	7.110	0.000
\(Horsepower^3\)	-4.466	4.373	-1.021	0.308
\(Horsepower^4\)	0.775	4.373	0.177	0.859
\(Horsepower^5\)	3.299	4.373	0.754	0.451

#namun ketika melihat nilai p.value nya yang signifikan hanya polinomial derajat 1,2,
#berarti menunjukkan bahwa antara mpg dan weight memiliki hubungan non-linier.

Kemudian dilihat pada tabel berikutnya bahwa pada polinomial derajat 5 yang signifikan hanya koefisien derajat 1 dan 2, sehingga dapat disimpulkan bahwa terdapat hubungan non-linier antara mpg dan displacement tersebut.

poly5 <- Auto %>%
  mutate(Predictions = predict(poly_model, Auto)) %>%
  ggplot() + xlab('displacement') + ylab('Miles Per Gallon') +
  geom_point(aes(displacement, mpg, col = 'blue')) +
  geom_line(aes(displacement, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
poly5

pada visualisasi polinomisl derajat 5 diatas dapat di lihat walaupun memiliki nilai RMSE terkecil, tetapi plot belum ‘stabil’ pada ujungnya, sehingga masih kurang baik mewakili data ini, akan dilakukan pemilihan model terbaik nantinya.

- Fungsi Tangga

set.seed(888)
folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()
for (k in 1:10) {
  for (i in 1:10) {
    Auto$cut <- cut(Auto$displacement, i+1)
    model <- lm(mpg ~ displacement*cut, data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
    Auto$cut <- NULL
  }
}
errors <- apply(errors, 2, mean)

data_frame(RMSE = errors) %>%
  mutate(Steps = row_number() +1) %>%
  ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) + 
  geom_col() +
  scale_x_continuous(breaks = 2:(length(errors)+1)) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

#dari gambar dapat dilihat bahwa untuk horsepower diperlukan 7 potongan 

min_error <- which.min(errors) + 1
step_model <- lm(mpg ~ displacement *cut(displacement, min_error), data = Auto)
tidy(step_model) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	45.544	2.405	18.935	0.000
displacement	-0.156	0.024	-6.574	0.000
cut(displacement, min_error)(123,179]	-9.311	7.640	-1.219	0.224
cut(displacement, min_error)(179,234]	-10.577	9.486	-1.115	0.266
cut(displacement, min_error)(234,289]	-110.659	40.271	-2.748	0.006
cut(displacement, min_error)(289,344]	-23.492	24.501	-0.959	0.338
cut(displacement, min_error)(344,400]	-19.866	26.346	-0.754	0.451
cut(displacement, min_error)(400,455]	-26.756	17.570	-1.523	0.129
displacement:cut(displacement, min_error)(123,179]	0.082	0.054	1.501	0.134
displacement:cut(displacement, min_error)(179,234]	0.088	0.048	1.833	0.068
displacement:cut(displacement, min_error)(234,289]	0.486	0.160	3.034	0.003
displacement:cut(displacement, min_error)(289,344]	0.134	0.082	1.635	0.103
displacement:cut(displacement, min_error)(344,400]	0.125	0.078	1.611	0.108
displacement:cut(displacement, min_error)(400,455]	0.144	0.048	3.002	0.003

pada hasil grafik RMSE diatas diperoleh yang memiliki nilai terkecil yaitu breaks ke 5, lalu pada tabel berikutnya dapat dilihat bahwa yang koefisien yang signifikan hanya beberapa saja, sehingga dapat dikatakan bahwa mpg dan displacement memiliki hubungan non-linier.

pc5 <- Auto %>%
  mutate(Predictions = predict(step_model, Auto)) %>%
  ggplot() + xlab('displacement') + ylab('Miled Per Gallon') + 
  geom_point(aes(displacement, mpg, col = 'blue')) +
  geom_line(aes(displacement, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
pc5

Pada plot diatas dilihat bahwa untuk break 5 diatas plot menunjukkan pola yang cukup berbeda dengan pola fungsi tangga yang biasa di lihat, namun plot diatas cukup mewakili data dari displacement.

- Natural Spline

spline_model <- train(mpg ~ displacement, data
                      = Auto,
                      method = 'gamSpline',
                      trControl = trainControl(method = 'cv', number = 10),
                      tuneGrid = expand.grid(df = seq(1, 12, 1)))
disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',
                 trControl = trainControl(method = 'cv', number = 10))
plot(spline_model)

anova(disp_lm$finalModel, 
      spline_model$finalModel) %>%
  tidy %>%
  mutate(model = c('Linear', 'Spline'),
         percent = sumsq/lag(rss) * 100) %>% 
  select(model, rss, percent, sumsq, p.value) %>%
  kable(digits = 3)

model	rss	percent	sumsq	p.value
Linear	8378.822	NA	NA	NA
Spline	6629.277	20.881	1749.545	0

Pada tabel Anova diatas dapat dilihat jika dibandingkan dengan model linier makan model spline dengan df=12 meningkat 20.881 % lebih baik dari model linier sehingga dapat dikatakan dengan pasti bahwa antara mpg dan displacement memiliki hubungan non-linier. Pada grafik RMSE diatas ditunjukkan bahwa kita nilai RMSE terus turun hingga pada df=12 yang memiliki nilai paling kecil. Sehingga dilakukan visualisasi model natural spline dengan df-12 sebagai berikut :

splineK12 <- Auto %>%
  mutate(pred = predict(spline_model, Auto)) %>%
  ggplot() + 
  geom_point(aes(displacement, mpg, col = 'blue')) +
  geom_line(aes(displacement, pred, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00')) +
  labs(x = 'displacement', y = 'Miles Per Gallon', title = 'Spline Relationship') +
  theme(legend.position = 'top')
splineK12

Pada plot diatas model natural spline dengan df=12 dilihat bahwa plot sudah cukup baik namun pada bagian ujung tetap memiliki bentuk aneh, sehingga akan dilakukan pemilihan model terbaik selanjutnya.

2. PEMILIHAN MODEL TERBAIK

a. MODEL TERBAIK untuk mpg vs Horsepower

Untuk memilih model terbaik menggunakan cross validation 10 fold, akan dipilih model terbaik dengan 3 metode yaitu polinomial , fungsi tangga dan natural spline. Pemilihan model terbaik dilihat nilai RMSE dan atau MAE terkecil kemudian akan dilihat secara visualisasi disesuaikan dengan nilai RMSE dan MAE terkecil tersebut.

- Metode Rergresi Polinomial

set.seed(888)
folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()

for (k in 1:10) {
  for (i in 1:10) {
    model <- lm(mpg ~ poly(horsepower,i), data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
  }
}
errors <- apply(errors, 2, mean)
errors

##  [1] 4.214925 4.255124 3.778729 4.103760 5.118257 3.717611 4.924988 3.862065
##  [9] 4.513991 5.025943

data_frame(RMSE = errors) %>%
  mutate(Polynomial = row_number()) %>%
  ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
  geom_col() +
  scale_x_continuous(breaks = 1:10) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

poly_model <- lm(mpg ~ poly(horsepower, which.min(errors)), data = Auto)
tidy(poly_model) %>%
  mutate(term = ifelse(row_number() == 1, 'Intercept',
                       paste0('$Horsepower^', row_number()-1,'$'))) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.218	107.715	0.000
\(Horsepower^1\)	-120.138	4.310	-27.877	0.000
\(Horsepower^2\)	44.090	4.310	10.231	0.000
\(Horsepower^3\)	-3.949	4.310	-0.916	0.360
\(Horsepower^4\)	-5.188	4.310	-1.204	0.229
\(Horsepower^5\)	13.272	4.310	3.080	0.002
\(Horsepower^6\)	-8.546	4.310	-1.983	0.048

poly6 <- Auto %>%
  mutate(Predictions = predict(poly_model, Auto)) %>%
  ggplot() + xlab('Horsepower') + ylab('Miles Per Gallon') +
  geom_point(aes(horsepower, mpg, col = 'blue')) +
  geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
poly6

Dilihat dari plot diatas polinomial derajat 6 adalah polinomial dengan nilai RMSE terkecil namun jika dilihat plot nya masih terlihat di ujung-ujung plot tidak stabil atau ‘liar’. Sehingga sesuai disarankan di dalam buku ISLR untuk memilih derajat hingga 4. Untuk itu dipilih nilai RMSE terkecil kembali dari derajat 1-4, maka dipilih model polinomial derajat 3 dan dilakukan ulang cross validation 10 fold, akan dilihat bagaimana bentuk plot dari polinomial derajat 3 sebagai berikut:

cv.poly3 <- function(dataset){
  set.seed(888)
  cross_val <- vfold_cv(dataset,v=10,strata = "mpg")
  metric_poly3 <- map_dfr(cross_val$splits,
    function(x){  
      mod <- lm(mpg ~ poly(horsepower,3,raw = T),
         data=dataset[x$in_id,]) 
      pred <- predict(mod,
               newdata=dataset[-x$in_id,]) 
      truth <- dataset[-x$in_id,]$mpg
      rmse <- mlr3measures::rmse(truth = truth,response = pred)
      mae <- mlr3measures::mae(truth = truth,response = pred)
      metric <- c(rmse,mae) 
      names(metric) <- c("rmse","mae")
      return(metric)
      }
  )
  poly3<- colMeans(metric_poly3) #menghitung rmse dan mae rata-rata untuk 10 folds
  return(poly3)
}

cv.poly3(Auto)

##     rmse      mae 
## 4.351002 3.284153

poly3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,3,raw=T), 
               lty = 1, col = "blue",se = F)+
  theme_bw()
poly3

#perbandingan antara polinomial derajat 3 dan 6
ggplot(Auto,aes(x=horsepower,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="lm",
              formula=y~poly(x,6,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="blue",se=F)+
    stat_smooth(method="lm",
              formula=y~poly(x,3,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="red",se=F)+
  scale_color_manual(values = c("Regresi Polinomial"="blue","Regresi Polinomial"="red"))

Dapat dilihat pada plot perbaningan diatas plot biru adalah polinomial derajat 6 dan plot merah polinomial derajat 3, dari plot diatas dipilih model polinomial derajat 3 dengan bentuk plot yang lebih stabil dan juga dengan nilai RMSE dan mae yang kecil yaitu 4.351002 dan 3.284153.

- Metode Fungsi Tangga

set.seed(888)

cv.pc <- function(dataset){
  set.seed(888)
  cross_val <- vfold_cv(dataset,v=10,strata = "mpg")

breaks <- 3:12

best_tangga <- map_dfr(breaks, function(i){

metric_tangga <- map_dfr(cross_val$splits,
    function(x){

training <- dataset[x$in_id,]
training$horsepower <- cut(training$horsepower,i)

mod <- lm(mpg ~ horsepower,
         data=training)

labs_x <- levels(mod$model[,2])

labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                  upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))


testing <- dataset[-x$in_id,]

horsepower_new <- cut(testing$horsepower,c(labs_x_breaks[1,1],labs_x_breaks[,2]))

pred <- predict(mod,
               newdata=list(horsepower=horsepower_new))
truth <- testing$mpg

data_eval <- na.omit(data.frame(truth,pred))

rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
colMeans(metric_tangga) # menghitung rata-rata untuk 10 folds
}
)
best_tangga <- cbind(breaks=breaks,best_tangga) # menampilkan hasil all breaks
basedonrmse <- as.data.frame(best_tangga %>% slice_min(rmse))
basedonmae <- as.data.frame(best_tangga %>% slice_min(mae))
return(rbind(basedonrmse,basedonmae))
}

cv.pc(Auto)

##   breaks     rmse      mae
## 1      8 4.417354 3.395444
## 2     12 4.446221 3.310524

Dilakakukan pengujian secara empiris diperoleh bahwa nilai RMSE terkecil pada breaks ke 8 dan nilai MAE terkecil pada breaks ke 12. Karena perbedaan keduanya nilai RMSE dan MAE dari kedua breaks maka dipilih salah satu yaitu breaks 12, berikut bentuk untuk kedua breaks visualisasinya;

pc8 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~cut(x,8), 
               lty = 1, col = "red",se = F)+
  theme_bw()

pc12 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~cut(x,12), 
               lty = 1, col = "blue",se = F)+
  theme_bw()

plot_grid(pc8, pc12, labels = c("pc8", "pc12"), label_size = 6)

- Natural Spline

set.seed(888)
cross.val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 1:12
best.spline3 <- map_dfr(df, function(i){
metric.spline3 <- map_dfr(cross.val$splits,
                  function(x){
                  mod <- lm(mpg ~ ns(horsepower,df=i),data=Auto[x$in_id,])
                  pred <- predict(mod,newdata=Auto[-x$in_id,])
                  truth <- Auto[-x$in_id,]$mpg
                  rmse <- mlr3measures::rmse(truth = truth,response = pred)
                  mae <- mlr3measures::mae(truth = truth,response = pred)
                  metric <- c(rmse,mae)
                  names(metric) <- c("rmse","mae")
                  return(metric)
                }
      )
    metric.spline3
    mean.metric.spline3 <- colMeans(metric.spline3) #rata-rata untuk 10 folds
    mean.metric.spline3
  }
)


best.spline3 <- cbind(df=df,best.spline3)
basis<-data.frame(rbind(best.spline3 %>% slice_min(rmse), #berdasarkan rmse
                 best.spline3 %>% slice_min(mae))) #berdasarkan mae

basis

##   df     rmse      mae
## 1  7 4.293046 3.213462
## 2 10 4.297427 3.199674

Untuk metode Natural Spline dengan cross validation diperoleh 2 df dengan nilai RMSE terkecil yaitu df=7 dan nilai MAE terkecil yaitu df=10. Kemudian di cari knot dari setiap df tersebut:

df_7<-attr(ns(Auto$horsepower, df=7),"knots")
df_7

## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 
##  68.85714  79.71429  89.57143  98.00000 113.57143 150.00000

df_10<-attr(ns(Auto$horsepower, df=10),"knots")
df_10

##   10%   20%   30%   40%   50%   60%   70%   80%   90% 
##  67.0  72.0  80.0  88.0  93.5 100.0 110.0 140.0 157.7

setelah diperoleh knot diatas akan dilakukan cross validation kembali untuk melihat mana yang memiliki nilai RMSE dan MAE terkecil, sebagai berikut :

set.seed(888)

cross.val <- vfold_cv(Auto,v=10,strata = "mpg")

metric.spline3.ns6 <- map_dfr(cross.val$splits,
    function(x){
mod <- lm(mpg ~ ns(horsepower, knots = c(68.85714, 79.71429, 89.57143, 98.00000, 113.57143,150.00000)),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
mean.metric.spline3.ns6 <- colMeans(metric.spline3.ns6)# menghitung rata-rata untuk 10 folds



metric.spline3.ns9 <- map_dfr(cross.val$splits,
    function(x){
mod <- lm(mpg ~ ns(horsepower, knots = c(67,72,80,88,93.5,100,110,140,157.7)),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
mean.metric.spline3.ns9 <- colMeans(metric.spline3.ns9)# menghitung rata-rata untuk 10 folds



nilai.cv.ns <- rbind("NCS6" = mean.metric.spline3.ns6,
                     "NCS9" = mean.metric.spline3.ns9)

nama.model.ncs <- c("6 - NCS (df=7)",
                    "9 - NCS (df=10)")

model.ncs<-data.frame(Model = nama.model.ncs, nilai.cv.ns)

model.ncs

##                Model     rmse      mae
## NCS6  6 - NCS (df=7) 4.295157 3.215237
## NCS9 9 - NCS (df=10) 4.286825 3.193450

dilihat pada tabel diatas bahwa yang memiliki nilai RMSE dan MAE terkecil adalah natural spline dengan df=10 atau 9 knot, berikut akan di ditampilkan visualisasi dari model natural spline diatas:

nc6 <- ggplot(Auto, aes(x = horsepower, y = mpg)) + 
  geom_point(alpha = 0.55, color="darkgrey") + 
  stat_smooth(method = "lm", formula = y~ns(x, knots = c(68.85714, 79.71429, 89.57143, 98.00000, 113.57143,150.00000)), col = "blue", se = F) +
  theme_bw()


nc9 <- ggplot(Auto, aes(x = horsepower, y = mpg)) + 
  geom_point(alpha = 0.55, color="darkgrey") + 
  stat_smooth(method = "lm", formula = y~ns(x, knots = c(67,72,80,88,93.5,100,110,140,157.7)), col = "blue", se = F)+
  theme_bw()

plot_grid(nc6, nc9, labels = c("nc6", "nc9"), label_size = 6)

dari plot yang ditampilkan diatas terlihat walaupun model natural spline df=10 terlihat bahwa di ujung masih tidak stabil atau ‘liar’.

setelah diperoleh model terbaik dari 3 metode diatas akan dipilih model terbaik antara mpg dan horsepower sebagai berikut :

perbandingan.model <- rbind(cv.poly3(Auto), cv.pc(Auto)[1,-1], model.ncs[2,-1])
perbandingan.model$metode <- c("Polinomial Ordo 3","Piecewise Constant 12","NCS 9 Knot")
perbandingan.model

##          rmse      mae                metode
## 1    4.351002 3.284153     Polinomial Ordo 3
## 2    4.417354 3.395444 Piecewise Constant 12
## NCS9 4.286825 3.193450            NCS 9 Knot

berikut adalah plot dari 3 model terbaik tersebut :

plot_grid(poly3, pc12, nc9, labels = c("poly3","pc12", "nc9"), label_size = 6)

ggplot(Auto,aes(x=horsepower,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="lm",
              formula=y~poly(x,3,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="blue",se=F)+
  stat_smooth(method="lm",
              formula=y~cut(x,12),
            lty=1,aes(col="Piecewise Constant"),col="red",se=F)+
   stat_smooth(method="lm",
              formula=y~ns(x,df=9),
            lty=1,aes(col="Natural Cubic Spline"),col="black",se=F)+
  scale_color_manual(values = c("Regresi Polinomial"="blue","Piecewise Constant"="red","Natural Cubic Spline"="black"))

Jika berdasarkan dengan nilai RMSE terkecil maka dipilih model natural spline df=10, namun jika melihat kembali kepada bentuk plot yang terbaik tanpa ada ‘aneh’ pada plot maka disarankan memilih model polinomial derajat 3.

Untuk hubungan non-linier antara mpg dan horsepower ini dapat dilihat pada plot bahwa semakin kecil horsepower (kekuatan mesin) maka semakin boros mpg (bahan bakar) yang dihabiskan dan begitu pun sebaliknya semakin tinggi horsepower maka semakin hemat mpg yang dihabiskan.

b. MODEL TERBAIK untuk mpg vs Acceleration

- Metode Rergresi Polinomial

set.seed(888)
folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()

for (k in 1:10) {
  for (i in 1:10) {
    model <- lm(mpg ~ poly(acceleration,i), data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
  }
}
errors <- apply(errors, 2, mean)
errors

##  [1] 5.832256 7.432576 6.043080 6.460294 7.289915 6.931338 8.824586 6.357289
##  [9] 7.388151 7.676203

data_frame(RMSE = errors) %>%
  mutate(Polynomial = row_number()) %>%
  ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
  geom_col() +
  scale_x_continuous(breaks = 1:10) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

poly_model <- lm(mpg ~ poly(acceleration, which.min(errors)), data = Auto)
tidy(poly_model) %>%
  mutate(term = ifelse(row_number() == 1, 'Intercept',
                       paste0('$acceleration^', row_number()-1,'$'))) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.358	65.564	0
\(acceleration^1\)	65.334	7.080	9.228	0

poly1 <- Auto %>%
  mutate(Predictions = predict(poly_model, Auto)) %>%
  ggplot() + xlab('acceleration') + ylab('Miles Per Gallon') +
  geom_point(aes(acceleration, mpg, col = 'blue')) +
  geom_line(aes(acceleration, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
poly1

Dilihat dari plot diatas polinomial derajat 1 adalah polinomial dengan nilai RMSE terkecil dan jika dilihat plot nya telah cukup satbil dan tidak ada bentuk aneh. Sehingga dipilih model polinomial dengan derajat 1.

- Metode Fungsi Tangga

set.seed(888)

cv.pc <- function(dataset){
  set.seed(888)
  cross_val <- vfold_cv(dataset,v=10,strata = "mpg")

breaks <- 3:12

best_tangga <- map_dfr(breaks, function(i){

metric_tangga <- map_dfr(cross_val$splits,
    function(x){

training <- dataset[x$in_id,]
training$acceleration <- cut(training$acceleration,i)

mod <- lm(mpg ~ acceleration,
         data=training)

labs_x <- levels(mod$model[,2])

labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                  upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))


testing <- dataset[-x$in_id,]

acceleration_new <- cut(testing$acceleration,c(labs_x_breaks[1,1],labs_x_breaks[,2]))

pred <- predict(mod,
               newdata=list(acceleration=acceleration_new))
truth <- testing$mpg

data_eval <- na.omit(data.frame(truth,pred))

rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
colMeans(metric_tangga) # menghitung rata-rata untuk 10 folds
}
)
best_tangga <- cbind(breaks=breaks,best_tangga) # menampilkan hasil all breaks
basedonrmse <- as.data.frame(best_tangga %>% slice_min(rmse))
basedonmae <- as.data.frame(best_tangga %>% slice_min(mae))
return(rbind(basedonrmse,basedonmae))
}

cv.pc(Auto)

##   breaks     rmse      mae
## 1      3 6.980699 5.651621
## 2      3 6.980699 5.651621

Dilakakukan pengujian secara empiris diperoleh bahwa nilai RMSE dan MAE terkecil pada breaks ke 3 berikut bentuk untuk visualisasinya;

pc3 <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~cut(x,3), 
               lty = 1, col = "red",se = F)+
  theme_bw()

pc3

Terlihat pada plot diatas dengan menunjukkan fungsi tangga breaks 3, cukup mewakili data yang tersebar tersebut.

- Natural Spline

set.seed(888)
cross.val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 1:12
best.spline3 <- map_dfr(df, function(i){
metric.spline3 <- map_dfr(cross.val$splits,
                  function(x){
                  mod <- lm(mpg ~ ns(acceleration,df=i),data=Auto[x$in_id,])
                  pred <- predict(mod,newdata=Auto[-x$in_id,])
                  truth <- Auto[-x$in_id,]$mpg
                  rmse <- mlr3measures::rmse(truth = truth,response = pred)
                  mae <- mlr3measures::mae(truth = truth,response = pred)
                  metric <- c(rmse,mae)
                  names(metric) <- c("rmse","mae")
                  return(metric)
                }
      )
    metric.spline3
    mean.metric.spline3 <- colMeans(metric.spline3) #rata-rata untuk 10 folds
    mean.metric.spline3
  }
)


best.spline3 <- cbind(df=df,best.spline3)
basis<-data.frame(rbind(best.spline3 %>% slice_min(rmse), #berdasarkan rmse
                 best.spline3 %>% slice_min(mae))) #berdasarkan mae

basis

##   df     rmse      mae
## 1  5 6.947152 5.622538
## 2  5 6.947152 5.622538

Untuk metode Natural Spline dengan cross validation diperoleh df dengan nilai RMSE terkecil yaitu df=5. Kemudian di cari knot:

df_5<-attr(ns(Auto$acceleration, df=5),"knots")
df_5

##   20%   40%   60%   80% 
## 13.42 14.80 16.00 17.70

setelah diperoleh knot diatas akan dilakukan cross validation kembali untuk melihat nilai RMSE dan MAE, sebagai berikut :

set.seed(888)

cross.val <- vfold_cv(Auto,v=10,strata = "mpg")

metric.spline3.ns5 <- map_dfr(cross.val$splits,
    function(x){
mod <- lm(mpg ~ ns(acceleration, knots = c(13.42, 14.80, 16.00, 17.70 )),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
mean.metric.spline3.ns5 <- colMeans(metric.spline3.ns5)# menghitung rata-rata untuk 10 folds



nilai.cv.ns <- mean.metric.spline3.ns5
                

nama.model.ncs <- "4 - NCS (df=5)"
                    

model.ncs<-data.frame(Model = nama.model.ncs, nilai.cv.ns)

model.ncs

##               Model nilai.cv.ns
## rmse 4 - NCS (df=5)    6.943104
## mae  4 - NCS (df=5)    5.617941

dilihat pada tabel diatas bahwa yang memiliki nilai RMSE dan MAE terkecil adalah natural spline dengan df=5 atau 4 knot, berikut akan di ditampilkan visualisasi dari model natural spline diatas:

nc4 <- ggplot(Auto, aes(x = acceleration, y = mpg)) + 
  geom_point(alpha = 0.55, color="darkgrey") + 
  stat_smooth(method = "lm", formula = y~ns(x, knots = c(13.42, 14.80, 16.00, 17.70 )), col = "blue", se = F)+
  theme_bw()

nc4

dari plot yang ditampilkan diatas terlihat walaupun model natural spline df=5 terlihat bahwa plot menunjukkan ke stabilan dalam menggambarkan data acceleration.

setelah diperoleh model terbaik dari 3 metode diatas akan dipilih model terbaik antara mpg dan acceleration sebagai berikut :

perbandingan.model <- rbind(5.832256, cv.pc(Auto)[1,-1], model.ncs[2,-1])
perbandingan.model$metode <- c("Polinomial Ordo 1","Piecewise Constant 3","NCS 4
                               Knot")
perbandingan.model

##       rmse      mae                                     metode
## 1 5.832256 5.832256                          Polinomial Ordo 1
## 2 6.980699 5.651621                       Piecewise Constant 3
## 3 5.617941 5.617941 NCS 4\n                               Knot

Jika dilihat dari nilai RMSE dan MAE terkecil yaitu model natural spline df=5, berikut adalah plot dari 3 model terbaik tersebut :

plot_grid(poly1, pc3, nc4, labels = c("poly1","pc3", "nc4"), label_size = 6)

ggplot(Auto,aes(x=acceleration,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="lm",
              formula=y~poly(x,1,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="blue",se=F)+
  stat_smooth(method="lm",
              formula=y~cut(x,3),
            lty=1,aes(col="Piecewise Constant"),col="red",se=F)+
   stat_smooth(method="lm",
              formula=y~ns(x,df=5),
            lty=1,aes(col="Natural Cubic Spline"),col="black",se=F)+
  scale_color_manual(values = c("Regresi Polinomial"="blue","Piecewise Constant"="red","Natural Cubic Spline"="black"))

setelah diperoleh model terbaik dari 3 metode diatas dengan nilai RMSE dan MAE terkecil ada pada model natural spline df=5 dan dapat dilihat pada plot yaitu plot dengan warna hitam tersebut cukup mewakili data dari acceleration ini.

dapat disimpulkan bahwa antara mpg dan acceleration ini memiliki hubungan dimana semakin tinggi waktu accelerate maka semakin tinggi mpg yang dihabiskan.

b. MODEL TERBAIK untuk mpg vs Displacement

- Metode Rergresi Polinomial

set.seed(888)
folds <- sample(rep(1:10, nrow(Auto)/10))
errors <- matrix(NA, 10, 10)
models <- list()

for (k in 1:10) {
  for (i in 1:10) {
    model <- lm(mpg ~ poly(displacement,i), data = Auto[folds != i,])
    pred <- predict(model, Auto[folds == i,])
    resid <- (Auto$mpg[folds==i] - pred)^2
    errors[k, i] <- sqrt(mean(resid))
  }
}
errors <- apply(errors, 2, mean)
errors

##  [1] 4.555659 4.078831 4.853867 3.670272 5.033470 3.111436 4.710184 4.082699
##  [9] 4.717649 4.430111

data_frame(RMSE = errors) %>%
  mutate(Polynomial = row_number()) %>%
  ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) + 
  geom_col() +
  scale_x_continuous(breaks = 1:10) +
  guides(fill = FALSE) +
  coord_cartesian(ylim = c(min(errors), max(errors)))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

poly_model <- lm(mpg ~ poly(displacement, which.min(errors)), data = Auto)
tidy(poly_model) %>%
  mutate(term = ifelse(row_number() == 1, 'Intercept',
                       paste0('$displacement^', row_number()-1,'$'))) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
Intercept	23.446	0.219	106.820	0.000
\(displacement^1\)	-124.258	4.346	-28.593	0.000
\(displacement^2\)	31.090	4.346	7.154	0.000
\(displacement^3\)	-4.466	4.346	-1.028	0.305
\(displacement^4\)	0.775	4.346	0.178	0.859
\(displacement^5\)	3.299	4.346	0.759	0.448
\(displacement^6\)	-10.492	4.346	-2.414	0.016

polys6 <- Auto %>%
  mutate(Predictions = predict(poly_model, Auto)) %>%
  ggplot() + xlab('displacement') + ylab('Miles Per Gallon') +
  geom_point(aes(displacement, mpg, col = 'blue')) +
  geom_line(aes(displacement, Predictions, col = 'goldenrod2'), size = 1.5) +
  scale_color_manual(name = 'Value Type',
                     labels = c('Observed', 'Predicted'),
                     values = c('#56B4E9', '#E69F00'))
polys6

Dilihat dari plot diatas polinomial derajat 6 adalah polinomial dengan nilai RMSE terkecil dan jika dilihat plot nya telah tidak stbil dan ada bentuk aneh. Sehingga berdasarkan buku ISLR, maka di coba untuk melihat polinomial derajat 1-4 yang memiliki nilai RMSE terkecil, karena berdasarkan buku ini dikatakan jika model polinomial derajat >4 , maka dari grafik RMSE diatas maka dipilih model derajat 4, dan dilakukan cross validation sebagai berikut :

cv.poly4 <- function(dataset){
  set.seed(888)
  cross_val <- vfold_cv(dataset,v=10,strata = "mpg")
  metric_poly4 <- map_dfr(cross_val$splits,
    function(x){  
      mod <- lm(mpg ~ poly(displacement,4,raw = T),
         data=dataset[x$in_id,]) 
      pred <- predict(mod,
               newdata=dataset[-x$in_id,]) 
      truth <- dataset[-x$in_id,]$mpg
      rmse <- mlr3measures::rmse(truth = truth,response = pred)
      mae <- mlr3measures::mae(truth = truth,response = pred)
      metric <- c(rmse,mae) 
      names(metric) <- c("rmse","mae")
      return(metric)
      }
  )
  poly4<- colMeans(metric_poly4) #menghitung rmse dan mae rata-rata untuk 10 folds
  return(poly4)
}

cv.poly4(Auto)

##     rmse      mae 
## 4.318556 3.188496

poly4 <- ggplot(Auto,aes(x=displacement, y=mpg)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,4,raw=T), 
               lty = 1, col = "blue",se = F)+
  theme_bw()
poly4

#perbandingan antara polinomial derajat 4 dan 6
ggplot(Auto,aes(x=displacement,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="lm",
              formula=y~poly(x,6,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="blue",se=F)+
    stat_smooth(method="lm",
              formula=y~poly(x,4,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="red",se=F)+
  scale_color_manual(values = c("Regresi Polinomial"="blue","Regresi Polinomial"="red"))

dapat dilihat dari perbandingan nilai RMSE terkecil yaitu pada polinomial derajat 6 namun,jika dilihat lebih lanjut pada visualisasi plot maka polinomial derajat 4 lebih stabil daripada polinomial derajat 6.Untuk itu kali ini akan di ambil polinomial derajat 4 yang plotnya lebih stabil dan nilai RMSE nya juga termasuk kecil.

- Metode Fungsi Tangga

set.seed(888)

cv.pc <- function(dataset){
  set.seed(888)
  cross_val <- vfold_cv(dataset,v=10,strata = "mpg")

breaks <- 3:12

best_tangga <- map_dfr(breaks, function(i){

metric_tangga <- map_dfr(cross_val$splits,
    function(x){

training <- dataset[x$in_id,]
training$displacement <- cut(training$displacement,i)

mod <- lm(mpg ~ displacement,
         data=training)

labs_x <- levels(mod$model[,2])

labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                  upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))


testing <- dataset[-x$in_id,]

displacement_new <- cut(testing$displacement,c(labs_x_breaks[1,1],labs_x_breaks[,2]))

pred <- predict(mod,
               newdata=list(displacement=displacement_new))
truth <- testing$mpg

data_eval <- na.omit(data.frame(truth,pred))

rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
colMeans(metric_tangga) # menghitung rata-rata untuk 10 folds
}
)
best_tangga <- cbind(breaks=breaks,best_tangga) # menampilkan hasil all breaks
basedonrmse <- as.data.frame(best_tangga %>% slice_min(rmse))
basedonmae <- as.data.frame(best_tangga %>% slice_min(mae))
return(rbind(basedonrmse,basedonmae))
}

cv.pc(Auto)

##   breaks     rmse      mae
## 1      9 4.194392 3.103064
## 2      9 4.194392 3.103064

Dilakakukan pengujian secara empiris diperoleh bahwa nilai RMSE dan MAE terkecil pada breaks ke 9 berikut bentuk untuk visualisasinya;

pc9 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~cut(x,9), 
               lty = 1, col = "red",se = F)+
  theme_bw()

pc9

Terlihat pada plot diatas dengan menunjukkan fungsi tangga breaks 9, cukup mewakili data yang tersebar tersebut.

- Natural Spline

set.seed(888)
cross.val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 1:12
best.spline3 <- map_dfr(df, function(i){
metric.spline3 <- map_dfr(cross.val$splits,
                  function(x){
                  mod <- lm(mpg ~ ns(displacement,df=i),data=Auto[x$in_id,])
                  pred <- predict(mod,newdata=Auto[-x$in_id,])
                  truth <- Auto[-x$in_id,]$mpg
                  rmse <- mlr3measures::rmse(truth = truth,response = pred)
                  mae <- mlr3measures::mae(truth = truth,response = pred)
                  metric <- c(rmse,mae)
                  names(metric) <- c("rmse","mae")
                  return(metric)
                }
      )
    metric.spline3
    mean.metric.spline3 <- colMeans(metric.spline3) #rata-rata untuk 10 folds
    mean.metric.spline3
  }
)


best.spline3 <- cbind(df=df,best.spline3)
basis<-data.frame(rbind(best.spline3 %>% slice_min(rmse), #berdasarkan rmse
                 best.spline3 %>% slice_min(mae))) #berdasarkan mae

basis

##   df     rmse      mae
## 1 12 4.059328 3.063204
## 2 12 4.059328 3.063204

Untuk metode Natural Spline dengan cross validation diperoleh df dengan nilai RMSE terkecil yaitu df=12. Kemudian di cari knot:

df_12<-attr(ns(Auto$displacement, df=12),"knots")
df_12

## 8.333333% 16.66667%       25% 33.33333% 41.66667%       50% 58.33333% 66.66667% 
##   89.0000   97.0000  105.0000  119.0000  130.9167  151.0000  200.0000  232.0000 
##       75% 83.33333% 91.66667% 
##  275.7500  318.0000  351.0000

setelah diperoleh knot diatas akan dilakukan cross validation kembali untuk melihat nilai RMSE dan MAE, sebagai berikut :

set.seed(888)

cross.val <- vfold_cv(Auto,v=10,strata = "mpg")

metric.spline3.ns12 <- map_dfr(cross.val$splits,
    function(x){
mod <- lm(mpg ~ ns(acceleration, knots = c(12.00000, 13.00000, 13.77500, 14.50000, 14.90000, 15.50000, 16.00000, 16.50000, 17.02500 )),
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

rmse <- mlr3measures::rmse(truth = truth,
                           response = pred
                           )
mae <- mlr3measures::mae(truth = truth,
                           response = pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
mean.metric.spline3.ns12 <- colMeans(metric.spline3.ns12)# menghitung rata-rata untuk 10 folds



nilai.cv.ns <- mean.metric.spline3.ns12
                

nama.model.ncs <- "11 - NCS (df=12)"
                    

model.ncs<-data.frame(Model = nama.model.ncs, nilai.cv.ns)

model.ncs

##                 Model nilai.cv.ns
## rmse 11 - NCS (df=12)    7.008586
## mae  11 - NCS (df=12)    5.697098

dilihat pada tabel diatas bahwa yang memiliki nilai RMSE dan MAE terkecil adalah natural spline dengan df=12 atau 11 knot, berikut akan di ditampilkan visualisasi dari model natural spline diatas:

nc11 <- ggplot(Auto, aes(x = displacement, y = mpg)) + 
  geom_point(alpha = 0.55, color="darkgrey") + 
  stat_smooth(method = "lm", formula = y~ns(x, knots = c(12.00000, 13.00000, 13.77500, 14.50000, 14.90000, 15.50000, 16.00000, 16.50000, 17.02500 )), col = "blue", se = F)+
  theme_bw()

nc11

## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading

dari plot yang ditampilkan diatas terlihat walaupun model natural spline df=12 terlihat bahwa plot menunjukkan ke stabilan dalam menggambarkan data displacement.

setelah diperoleh model terbaik dari 3 metode diatas akan dipilih model terbaik antara mpg dan acceleration sebagai berikut :

perbandingan.model <- rbind(cv.poly4(Auto), cv.pc(Auto)[1,-1], model.ncs[2,-1])
perbandingan.model$metode <- c("Polinomial Ordo 4","Piecewise Constant 9","NCS 11 Knot")
perbandingan.model

##       rmse      mae               metode
## 1 4.318556 3.188496    Polinomial Ordo 4
## 2 4.194392 3.103064 Piecewise Constant 9
## 3 5.697098 5.697098          NCS 11 Knot

Jika dilihat dari nilai RMSE dan MAE terkecil yaitu model natural spline df=5, berikut adalah plot dari 3 model terbaik tersebut :

plot_grid(poly4, pc9, nc11, labels = c("poly4","pc9", "nc11"), label_size = 6)

## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading

ggplot(Auto,aes(x=displacement,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="lm",
              formula=y~poly(x,4,raw=T),
            lty=1,aes(col="Regresi Polinomial"),col="blue",se=F)+
  stat_smooth(method="lm",
              formula=y~cut(x,9),
            lty=1,aes(col="Piecewise Constant"),col="red",se=F)+
   stat_smooth(method = "lm", formula = y~ns(x, knots = c(12.00000, 13.00000, 13.77500, 14.50000, 14.90000, 15.50000, 16.00000, 16.50000, 17.02500 )), col = "black", se = F)+
  scale_color_manual(values = c("Regresi Polinomial"="blue","Piecewise Constant"="red","Natural Cubic Spline"="black"))

## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading

setelah diperoleh model terbaik dari 3 metode diatas dengan nilai RMSE dan MAE terkecil ada pada model piecewise constant 9 dan dapat dilihat pada plot yaitu plot dengan warna merah tersebut cukup mewakili data dari displacement ini.

Dapat disimpulkan bahwa antara mpg dan displacement ini memiliki hubungan dimana semakin tinggi engine displacment maka semakin tinggi mpg yng dihabiskan.

METODE SMOOTHING dan LOESS

a. METODE SMOOTHING SPLINE

Metode ini akan digunakan untuk data yang sama pada ketiga metode sebelumnya untuk membandingkan apakah smooting spline bisa lebih baik dari model yang sebelumnya telah diperoleh.

- mpg dan horsepower

model_sms <- with(data = Auto,smooth.spline(horsepower,mpg))
model_sms

## Call:
## smooth.spline(x = horsepower, y = mpg)
## 
## Smoothing Parameter  spar= 0.2648644  lambda= 5.101567e-07 (12 iterations)
## Equivalent Degrees of Freedom (Df): 42.14987
## Penalized Criterion (RSS): 1117.657
## GCV: 18.64132

pred_data <- broom::augment(model_sms)

pada hasil tabel diatas dapat dilihat bahwa nilai dari parameter lamda yang sangat kecil yaitu 5.101567e-07 sehingga menyebabkan kurva yang dihasilkan dari nilai lamda tersebut sangat tidak smooth, dapat dilihat pada gambar dibawah ini,

ggplot(pred_data,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="blue",
            lty=1)+
  xlab("horsepower")+
  ylab("mpg")+
  theme_bw()

dengan hasil kurva smooth yang masih sangat kasar maka dilakukan pemilihan lamda yang dapat membuat kurva semakin halus, berikut dipilih nilai lamda untuk melihat pada nilai lamda berapa kurva semakin mulus,

model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>% 
  group_by(lambda) %>% 
  do(broom::augment(with(data = Auto,smooth.spline(horsepower,mpg,lambda = .$lambda))))

p <- ggplot(model_sms_lambda,
            aes(x=x,y=y))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
  )+
  facet_wrap(~lambda)
p

dapat dilihat pada pemilihan lamda untuk model smooth-spline ini semakin besar lamda maka semakin mulus plot nya, namun kita kita dapat memilih lamda disesuaikan dengan membandingkan nilai CGV dari setiap lamda yang memungkinkan akan memuluskan kurva, kemudian dapat ditentukan df=6, maka dapt mempresentasikan data lebih baik seperti berikut ,

model_sms <- with(data = Auto,smooth.spline(horsepower,mpg,df=6))
model_sms

## Call:
## smooth.spline(x = horsepower, y = mpg, df = 6)
## 
## Smoothing Parameter  spar= 0.8363393  lambda= 0.006861984 (13 iterations)
## Equivalent Degrees of Freedom (Df): 5.998717
## Penalized Criterion (RSS): 2490.625
## GCV: 18.92528

pred_data <- broom::augment(model_sms)

smooth1 <- ggplot(pred_data,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="red",
            lty=1)+
  xlab("horsepower")+
  ylab("mpg")+
  theme_bw()
smooth1

dapat dilihat pada bahwa dengan df=6 diperoleh lamda 0.006861984 dari 13 iterasi dan diperoleh nilai CGV sebesar 18,92528. Dan terlihat pada plot bahwa kurva sudah sangat mulus untuk mpg dan horsepower.

- mpg dan acceleration

model_sms2 <- with(data = Auto,smooth.spline(acceleration,mpg))
model_sms2

## Call:
## smooth.spline(x = acceleration, y = mpg)
## 
## Smoothing Parameter  spar= 0.8515044  lambda= 0.009126167 (12 iterations)
## Equivalent Degrees of Freedom (Df): 5.556353
## Penalized Criterion (RSS): 7154.033
## GCV: 48.92947

pred_data2 <- broom::augment(model_sms2)

pada hasil tabel diatas dapat dilihat bahwa nilai dari parameter lamda yang sangat kecil yaitu 0.009126167 sehingga menyebabkan kurva yang dihasilkan dari nilai lamda tersebut sudah sangat smooth, dapat dilihat pada gambar dibawah ini,

smooth2 <- ggplot(pred_data2,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="blue",
            lty=1)+
  xlab("acceleration")+
  ylab("mpg")+
  theme_bw()

smooth2

model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>% 
  group_by(lambda) %>% 
  do(broom::augment(with(data = Auto,smooth.spline(acceleration,mpg,lambda = .$lambda))))

p <- ggplot(model_sms_lambda,
            aes(x=x,y=y))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
  )+
  facet_wrap(~lambda)
p

dapat dilihat pada pemilihan lamda untuk model smooth-spline ini semakin besar lamda maka semakin mulus plot nya, namun kita kita dapat memilih lamda disesuaikan dengan membandingkan nilai CGV dari setiap lamda yang memungkinkan akan memuluskan kurva, kemudian dapat ditentukan df=3, maka dapt mempresentasikan data lebih baik seperti berikut ,

model_sms21 <- with(data = Auto,smooth.spline(acceleration,mpg,df=3))
model_sms21

## Call:
## smooth.spline(x = acceleration, y = mpg, df = 3)
## 
## Smoothing Parameter  spar= 1.0302  lambda= 0.1782674 (13 iterations)
## Equivalent Degrees of Freedom (Df): 2.999954
## Penalized Criterion (RSS): 7579.401
## GCV: 49.3904

pred_data21 <- broom::augment(model_sms21)

ggplot(pred_data21,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="red",
            lty=1)+
  xlab("acceleration")+
  ylab("mpg")+
  theme_bw()

dapat dilihat pada bahwa dengan df=3 diperoleh lamda 0.1782674 dari 13 iterasi dan diperoleh nilai CGV sebesar 49.3904. Dan terlihat pada plot bahwa kurva sudah sangat mulus untuk mpg dan acceleration.

- mpg dan displacement

model_sms3 <- with(data = Auto,smooth.spline(displacement,mpg))
model_sms3

## Call:
## smooth.spline(x = displacement, y = mpg)
## 
## Smoothing Parameter  spar= 0.08978477  lambda= 6.674266e-09 (13 iterations)
## Equivalent Degrees of Freedom (Df): 52.73308
## Penalized Criterion (RSS): 471.1873
## GCV: 16.58384

pred_data3 <- broom::augment(model_sms)

pada hasil tabel diatas dapat dilihat bahwa nilai dari parameter lamda yang sangat kecil yaitu 6.674266e-09 sehingga menyebabkan kurva yang dihasilkan dari nilai lamda tersebut sudah cukup smooth, dapat dilihat pada gambar dibawah ini,

smooth3 <- ggplot(pred_data3,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="blue",
            lty=1)+
  xlab("displacement")+
  ylab("mpg")+
  theme_bw()

smooth3

jika hasil kurva smooth yang masih sangat kasar maka dilakukan pemilihan lamda yang dapat membuat kurva semakin halus, berikut dipilih nilai lamda untuk melihat pada nilai lamda berapa kurva semakin mulus,

model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>% 
  group_by(lambda) %>% 
  do(broom::augment(with(data = Auto,smooth.spline(displacement,mpg,lambda = .$lambda))))

p <- ggplot(model_sms_lambda,
            aes(x=x,y=y))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
  )+
  facet_wrap(~lambda)
p

dapat dilihat pada pemilihan lamda untuk model smooth-spline ini semakin besar lamda maka semakin mulus plot nya, namun kita kita dapat memilih lamda disesuaikan dengan membandingkan nilai CGV dari setiap lamda yang memungkinkan akan memuluskan kurva, kemudian dapat ditentukan df=5, maka dapt mempresentasikan data lebih baik seperti berikut ,

model_sms31 <- with(data = Auto,smooth.spline(displacement,mpg,df=5))
model_sms31

## Call:
## smooth.spline(x = displacement, y = mpg, df = 5)
## 
## Smoothing Parameter  spar= 0.9815859  lambda= 0.01851043 (12 iterations)
## Equivalent Degrees of Freedom (Df): 5.000745
## Penalized Criterion (RSS): 2936.755
## GCV: 19.19855

pred_data31 <- broom::augment(model_sms31)

ggplot(pred_data31,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="red",
            lty=1)+
  xlab("displacement")+
  ylab("mpg")+
  theme_bw()

dapat dilihat pada bahwa dengan df=5 diperoleh lamda 0.01851043 dari 12 iterasi dan diperoleh nilai CGV sebesar 19.19855. Dan terlihat pada plot bahwa kurva sudah sangat mulus untuk mpg dan displacement.

b. METODE LOESS

- mpg dan horsepower

model_loess <- loess(mpg~horsepower,
                     data = Auto)
summary(model_loess)

## Call:
## loess(formula = mpg ~ horsepower, data = Auto)
## 
## Number of Observations: 392 
## Equivalent Number of Parameters: 4.97 
## Residual Standard Error: 4.32 
## Trace of smoother matrix: 5.43  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

dapat dilihat beberapa plot jika kita mengamati nilai span dari range 0.1 hingga 5 dimana semakin besar nilai span semakin mulus kurvanya sebagai berikut :

model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>% 
  group_by(span) %>% 
  do(broom::augment(loess(mpg~horsepower,
                     data = Auto,span=.$span)))

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

p2 <- ggplot(model_loess_span,
       aes(x=horsepower,y=mpg))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~span)
p2

pada hasil diatas dengan nilai span 0,6 saja sudah menunjukkan kurva semakin mulus, kemudian dibawah ini jika kita pilih nilai span 0,75 maka bentuk kurvanya sebagai berikut:

loess1 <- ggplot(Auto, aes(horsepower,mpg)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span = 0.75,
              col="blue",
              lty=1,
              se=F)
loess1

lalu dapat dicari juga nilai span dengan menggunakan cross validation sebagai berikut, akan dipilih span dengan nilai RMSE terkecil hasil cross validation :

set.seed(888)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

span <- seq(0.1,0.5,length.out=50)

best_loess1 <- map_dfr(span, function(i){
metric_loess <- map_dfr(cross_val$splits,
    function(x){
mod <- loess(mpg ~ horsepower,span = i,
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

data_eval <- na.omit(data.frame(pred=pred,
                                truth=truth))


rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

head(metric_loess,20)

# menghitung rata-rata untuk 10 folds
mean_metric_loess <- colMeans(metric_loess)

mean_metric_loess
}
)

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 88

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.9473e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 90

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5678e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 88

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.9473e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 90

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5678e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 88

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.9473e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 90

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 5.9951e-16

best_loess1 <- cbind(span=span,best_loess1)
# menampilkan hasil all breaks
best_loess1

##         span     rmse      mae
## 1  0.1000000 4.252304 3.225419
## 2  0.1081633 4.231483 3.205378
## 3  0.1163265 4.248619 3.200222
## 4  0.1244898 4.232360 3.194677
## 5  0.1326531 4.237386 3.199095
## 6  0.1408163 4.232085 3.193228
## 7  0.1489796 4.246285 3.187883
## 8  0.1571429 4.249366 3.186191
## 9  0.1653061 4.248424 3.184295
## 10 0.1734694 4.241974 3.182916
## 11 0.1816327 4.247037 3.186926
## 12 0.1897959 4.253527 3.193200
## 13 0.1979592 4.252929 3.196654
## 14 0.2061224 4.252525 3.195074
## 15 0.2142857 4.262046 3.198223
## 16 0.2224490 4.263477 3.195271
## 17 0.2306122 4.272419 3.193290
## 18 0.2387755 4.266789 3.190315
## 19 0.2469388 4.265666 3.187888
## 20 0.2551020 4.271424 3.189890
## 21 0.2632653 4.267838 3.186223
## 22 0.2714286 4.276395 3.192654
## 23 0.2795918 4.279632 3.191888
## 24 0.2877551 4.272021 3.190894
## 25 0.2959184 4.275796 3.192166
## 26 0.3040816 4.273767 3.188727
## 27 0.3122449 4.279596 3.190772
## 28 0.3204082 4.278355 3.191824
## 29 0.3285714 4.273755 3.184855
## 30 0.3367347 4.274787 3.185930
## 31 0.3448980 4.267167 3.181951
## 32 0.3530612 4.269275 3.183528
## 33 0.3612245 4.271715 3.186843
## 34 0.3693878 4.272885 3.188304
## 35 0.3775510 4.274136 3.186489
## 36 0.3857143 4.267278 3.184746
## 37 0.3938776 4.269377 3.186387
## 38 0.4020408 4.270406 3.187194
## 39 0.4102041 4.269965 3.188680
## 40 0.4183673 4.270313 3.190169
## 41 0.4265306 4.271892 3.194359
## 42 0.4346939 4.272356 3.195559
## 43 0.4428571 4.271780 3.196905
## 44 0.4510204 4.268600 3.195091
## 45 0.4591837 4.269150 3.195939
## 46 0.4673469 4.269335 3.196224
## 47 0.4755102 4.270988 3.198561
## 48 0.4836735 4.267736 3.199559
## 49 0.4918367 4.271171 3.201232
## 50 0.5000000 4.269020 3.201599

#berdasarkan rmse
best_loess1 %>% slice_min(rmse)

##        span     rmse      mae
## 1 0.1081633 4.231483 3.205378

#berdasarkan mae
best_loess1 %>% slice_min(mae)

##       span     rmse      mae
## 1 0.344898 4.267167 3.181951

dapat dilihat bahwa nilai RMSE terkecil yaitu pada span=0.1081633.

loess12 <- ggplot(Auto, aes(horsepower,mpg)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span = 0.1081633,
              col="blue",
              lty=1,
              se=F)
loess12

jika kita lihat plot model loess dengan nilai span = 0.1081633 yang memiliki nilai RMSE terkecil maka kurva tidak mulus, oleh karena itu tetap akan dipilih nilai span=0.75, berikut perbandingan kurva antara span=0.75 dan span=0.1081633 :

ggplot(Auto,aes(x=horsepower,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="loess",
              formula=y~x,span = 0.1081633,col="blue",lty=1,se=F)+
  stat_smooth(method="loess",
              formula=y~x,span = 0.75,col="red",lty=1,se=F)+
  scale_color_manual(values = c("LOESS 0.1081633"="blue","OESS 0.75"="red"))

jika dilihat perbandingannya maka Model LOESS dengan span=0.75 lebih mulus dibandingkan dengan span= 0,1081633.

- mpg dan acceleration

model_loess <- loess(mpg~acceleration,
                     data = Auto)
summary(model_loess)

## Call:
## loess(formula = mpg ~ acceleration, data = Auto)
## 
## Number of Observations: 392 
## Equivalent Number of Parameters: 5.44 
## Residual Standard Error: 6.95 
## Trace of smoother matrix: 5.97  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

dapat dilihat beberapa plot jika kita mengamati nilai span dari range 0.1 hingga 5 dimana semakin besar nilai span semakin mulus kurvanya sebagai berikut :

model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>% 
  group_by(span) %>% 
  do(broom::augment(loess(mpg~acceleration,
                     data = Auto,span=.$span)))

p2 <- ggplot(model_loess_span,
       aes(x=acceleration,y=mpg))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~span)
p2

pada hasil diatas dengan nilai span 0,6 saja sudah menunjukkan kurva semakin mulus, kemudian dibawah ini jika kita pilih nilai span 0,75 maka bentuk kurvanya sebagai berikut:

loess2 <- ggplot(Auto, aes(acceleration,mpg)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span = 0.75,
              col="blue",
              lty=1,
              se=F)
loess2

set.seed(888)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

span <- seq(0.1,0.5,length.out=50)

best_loess2 <- map_dfr(span, function(i){
metric_loess <- map_dfr(cross_val$splits,
    function(x){
mod <- loess(mpg ~ acceleration,span = i,
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

data_eval <- na.omit(data.frame(pred=pred,
                                truth=truth))


rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

head(metric_loess,20)

# menghitung rata-rata untuk 10 folds
mean_metric_loess <- colMeans(metric_loess)

mean_metric_loess
}
)

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 14.7

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 5.0237e-17

best_loess2 <- cbind(span=span,best_loess2)
# menampilkan hasil all breaks
best_loess2

##         span     rmse      mae
## 1  0.1000000 7.151111 5.788477
## 2  0.1081633 7.125819 5.771783
## 3  0.1163265 7.134794 5.766607
## 4  0.1244898 7.133215 5.759420
## 5  0.1326531 7.097500 5.724688
## 6  0.1408163 7.100221 5.727130
## 7  0.1489796 7.128290 5.757999
## 8  0.1571429 7.089359 5.736927
## 9  0.1653061 7.099287 5.745807
## 10 0.1734694 7.103402 5.753668
## 11 0.1816327 7.102602 5.742795
## 12 0.1897959 7.098719 5.741475
## 13 0.1979592 7.116402 5.746295
## 14 0.2061224 7.096639 5.734868
## 15 0.2142857 7.098794 5.739141
## 16 0.2224490 7.103094 5.751629
## 17 0.2306122 7.089906 5.748137
## 18 0.2387755 7.080634 5.747328
## 19 0.2469388 7.072773 5.740761
## 20 0.2551020 7.060613 5.732950
## 21 0.2632653 7.053934 5.726877
## 22 0.2714286 7.047607 5.720023
## 23 0.2795918 7.045635 5.721605
## 24 0.2877551 7.038972 5.717494
## 25 0.2959184 7.036122 5.715767
## 26 0.3040816 7.027548 5.713194
## 27 0.3122449 7.015244 5.701765
## 28 0.3204082 7.012734 5.701001
## 29 0.3285714 7.010253 5.694352
## 30 0.3367347 7.001467 5.684793
## 31 0.3448980 6.995728 5.679982
## 32 0.3530612 6.988839 5.671989
## 33 0.3612245 6.981602 5.663705
## 34 0.3693878 6.981069 5.661170
## 35 0.3775510 6.974665 5.653863
## 36 0.3857143 6.971542 5.647048
## 37 0.3938776 6.967390 5.643307
## 38 0.4020408 6.965280 5.640694
## 39 0.4102041 6.962046 5.639393
## 40 0.4183673 6.959201 5.637927
## 41 0.4265306 6.958656 5.636661
## 42 0.4346939 6.957700 5.637191
## 43 0.4428571 6.956942 5.637362
## 44 0.4510204 6.954318 5.636558
## 45 0.4591837 6.952891 5.633994
## 46 0.4673469 6.951727 5.632833
## 47 0.4755102 6.949852 5.631223
## 48 0.4836735 6.950171 5.633131
## 49 0.4918367 6.949112 5.631564
## 50 0.5000000 6.947773 5.631030

#berdasarkan rmse
best_loess2 %>% slice_min(rmse)

##   span     rmse     mae
## 1  0.5 6.947773 5.63103

#berdasarkan mae
best_loess2 %>% slice_min(mae)

##   span     rmse     mae
## 1  0.5 6.947773 5.63103

dapat dilihat bahwa nilai RMSE terkecil yaitu pada span=0.5.

loess21 <- ggplot(Auto, aes(acceleration,mpg)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span =0.5,
              col="blue",
              lty=1,
              se=F)
loess21

jika kita lihat plot model loess dengan nilai span = 0.5 yang memiliki nilai RMSE terkecil maka kurva tidak mulus, oleh karena itu tetap akan dipilih nilai span=0.75, berikut perbandingan kurva antara span=0.75 dan span=0.5 :

ggplot(Auto,aes(x=acceleration,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="loess",
              formula=y~x,span = 0.5,col="blue",lty=1,se=F)+
  stat_smooth(method="loess",
              formula=y~x,span = 0.75,col="red",lty=1,se=F)+
  scale_color_manual(values = c("LOESS 0.1081633"="blue","OESS 0.75"="red"))

jika dilihat perbandingannya maka Model LOESS dengan span=0.75 lebih mulus dibandingkan dengan span= 0,5.

- mpg dan displacement

model_loess <- loess(mpg~displacement,
                     data = Auto)
summary(model_loess)

## Call:
## loess(formula = mpg ~ displacement, data = Auto)
## 
## Number of Observations: 392 
## Equivalent Number of Parameters: 4.42 
## Residual Standard Error: 4.372 
## Trace of smoother matrix: 4.82  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

dapat dilihat beberapa plot jika kita mengamati nilai span dari range 0.1 hingga 5 dimana semakin besar nilai span semakin mulus kurvanya sebagai berikut :

model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>% 
  group_by(span) %>% 
  do(broom::augment(loess(mpg~displacement,
                     data = Auto,span=.$span)))

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.0557e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

p2 <- ggplot(model_loess_span,
       aes(x=displacement,y=mpg))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~span)
p2

pada hasil diatas dengan nilai span 0,6 saja sudah menunjukkan kurva semakin mulus, kemudian dibawah ini jika kita pilih nilai span 0,75 maka bentuk kurvanya sebagai berikut:

loess3 <- ggplot(Auto, aes(displacement,mpg)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span = 0.75,
              col="blue",
              lty=1,
              se=F)
loess3

set.seed(888)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")

span <- seq(0.1,0.5,length.out=50)

best_loess2 <- map_dfr(span, function(i){
metric_loess <- map_dfr(cross_val$splits,
    function(x){
mod <- loess(mpg ~ displacement,span = i,
         data=Auto[x$in_id,])
pred <- predict(mod,
               newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg

data_eval <- na.omit(data.frame(pred=pred,
                                truth=truth))


rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

head(metric_loess,20)

# menghitung rata-rata untuk 10 folds
mean_metric_loess <- colMeans(metric_loess)

mean_metric_loess
}
)

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5678e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 98

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1349e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 9.3816e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.075e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97.5

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.5

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.0334e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 98

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.3777e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5678e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 97

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5678e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 98

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.0334e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

best_loess2 <- cbind(span=span,best_loess2)
# menampilkan hasil all breaks
best_loess2

##         span     rmse      mae
## 1  0.1000000 4.225327 3.122863
## 2  0.1081633 4.417702 3.179048
## 3  0.1163265 4.157759 3.091476
## 4  0.1244898 4.103268 3.094051
## 5  0.1326531 4.108009 3.093898
## 6  0.1408163 4.115126 3.095317
## 7  0.1489796 4.116378 3.108550
## 8  0.1571429 4.125344 3.109337
## 9  0.1653061 4.121612 3.108970
## 10 0.1734694 4.118629 3.108359
## 11 0.1816327 4.093315 3.091939
## 12 0.1897959 4.093071 3.090964
## 13 0.1979592 4.090070 3.087954
## 14 0.2061224 4.089061 3.084043
## 15 0.2142857 4.086769 3.080722
## 16 0.2224490 4.084785 3.078321
## 17 0.2306122 4.074832 3.072816
## 18 0.2387755 4.074621 3.076482
## 19 0.2469388 4.068849 3.072597
## 20 0.2551020 4.065645 3.073137
## 21 0.2632653 4.062349 3.073529
## 22 0.2714286 4.065194 3.072428
## 23 0.2795918 4.073215 3.078652
## 24 0.2877551 4.082263 3.087041
## 25 0.2959184 4.086366 3.086601
## 26 0.3040816 4.107827 3.101168
## 27 0.3122449 4.119666 3.110011
## 28 0.3204082 4.119619 3.103120
## 29 0.3285714 4.132290 3.109060
## 30 0.3367347 4.137273 3.110742
## 31 0.3448980 4.138277 3.117718
## 32 0.3530612 4.142644 3.123295
## 33 0.3612245 4.150772 3.131658
## 34 0.3693878 4.151897 3.135399
## 35 0.3775510 4.158076 3.137300
## 36 0.3857143 4.161003 3.139435
## 37 0.3938776 4.163513 3.140417
## 38 0.4020408 4.165850 3.142187
## 39 0.4102041 4.166398 3.142692
## 40 0.4183673 4.197209 3.153756
## 41 0.4265306 4.203905 3.157064
## 42 0.4346939 4.207990 3.159411
## 43 0.4428571 4.212641 3.159148
## 44 0.4510204 4.213111 3.159723
## 45 0.4591837 4.213235 3.160100
## 46 0.4673469 4.212284 3.158568
## 47 0.4755102 4.213083 3.158899
## 48 0.4836735 4.217184 3.156877
## 49 0.4918367 4.229526 3.157638
## 50 0.5000000 4.229615 3.151470

#berdasarkan rmse
best_loess2 %>% slice_min(rmse)

##        span     rmse      mae
## 1 0.2632653 4.062349 3.073529

#berdasarkan mae
best_loess2 %>% slice_min(mae)

##        span     rmse      mae
## 1 0.2714286 4.065194 3.072428

dapat dilihat bahwa nilai RMSE terkecil yaitu pada span=0.2632653.

loess31 <- ggplot(Auto, aes(displacement,mpg)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span =0.2632653,
              col="blue",
              lty=1,
              se=F)
loess31

jika kita lihat plot model loess dengan nilai span = 0.2632653 yang memiliki nilai RMSE terkecil maka kurva tidak mulus, oleh karena itu tetap akan dipilih nilai span=0.75, berikut perbandingan kurva antara span=0.75 dan span=0.2632653 :

ggplot(Auto,aes(x=displacement,y=mpg))+
  geom_point(alpha=0.55,color="darkgrey")+
  stat_smooth(method="loess",
              formula=y~x,span = 0.2632653,col="blue",lty=1,se=F)+
  stat_smooth(method="loess",
              formula=y~x,span = 0.75,col="red",lty=1,se=F)+
  scale_color_manual(values = c("LOESS 0.1081633"="blue","OESS 0.75"="red"))

jika dilihat perbandingannya maka Model LOESS dengan span=0.75 lebih mulus dibandingkan dengan span=0.2632653.

sehingga dapat ditambahkan model terbaik untuk :

mpg vs horsepower

plot_grid(poly3, pc12, nc9,smooth1,loess1, labels = c("poly3","pc12", "nc9","smoothing","LOESS"), label_size = 6)

plot diatas adalah 5 metode untuk model non-linier untuk mpg vs horsepower,dapat dilihat perbedaan dari 5 metode tersebut, namun untuk model polinomial, smoothing spline dan local regresion memiliki curva yang sangat mirip dan sangat mulus.

mpg vs acceleration

plot_grid(poly1, pc3, nc4,smooth2,loess2, labels = c("poly1","pc3", "nc4","smoothing","LOESS"), label_size = 6)

dapat dilihat pada 5 plot diatas untuk 5 metode model non-linier yang berbeda untuk data mpg vs acceleration, dapat dilihat untuk curva model natural spline, smoothing spline dan local regression sangat mirip bentuknya.

mpg vs displacement

plot_grid(poly4, pc9, nc11,smooth3,loess3, labels = c("poly4","pc9", "nc11","smoothing","LOESS"), label_size = 6)

## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading

dapat dilihat pada 5 plot diatas untuk 5 metode model non-linier yang berbeda untuk data mpg vs displacement, dapat dilihat untuk kurva model polinomial , smoothing spline dan local regression sangat mirip bentuknya.