Tugas Responsi UAS 1 Pengantar Sains Data

Muhammad Zhillan Zakiyyan

2022-11-04

1. Pendahuluan

1.1. Cakupan Materi

  • Nonlinier Regression I
    Pemodelan pada data nonlinier yang meliputi:
    1. Regresi Linier
    2. Regresi Polinomial
    3. Regresi Fungsi Tangga
  • Nonlinier Regression II
    Pemulusan pada data nonlinier yang meliputi:
    1. Piecewise Cubic Polynomial
    2. Spline Regression
    3. Smoothing Spline
    4. LOESS Function

1.2. Deskripsi Data

Data yang akan digunakan terdiri atas 2 jenis data, yakni:
1. data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 3 dan ragam 2.
2. Dataset Auto dari library(ISLR).

Dataset Auto terdiri atas 392 amatan dan 9 peubah dengan rincian tipe peubah:
1. mpg (numeric)
2. cylinders (numeric)
3. displacement (numeric)
4. horsepower (numeric)
5. weight (numeric)
6. acceleration (numeric)
7. year (numeric)
8. origin (numeric)
9. name (categoric (Factor))

Pada materi analisis regresi nonlinier ini, peubah yang akan dipakai adalah peubah mpg (Konsumsi bahan bakar mobil (mil/galon)), horsepower (Tenaga kuda mobil (hp)), dan origin (Asal pabrikan mobil).
Origin 1: America
Origin 2: Europe
Origin 3: Japan

1.3. Package

  • Daftar package yang dibutuhkan untuk analisis:
    1. ISLR
    2. tidyverse
    3. ggplot2
    4. dplyr
    5. purrr
    6. rsamples
    7. splines
  • Fungsi buatan:
    1. MSE (Mean Square Error)
MSE <- function(pred, act) {
    mean((pred - act)^2)
}

2. Pemodelan Nonlinier Data bangkitan

2.1. Inisiasi Data

set.seed(92)
psd.x.tr <- rnorm(1000, 3, 2)
err.tr <- rnorm(1000)
psd.y.tr <- 3 - 2 * psd.x.tr + psd.x.tr^2 + err.tr
plot(psd.x.tr, psd.y.tr)

2.2. Pemodelan Data Linier

lin.mod.tr <- lm(psd.y.tr ~ psd.x.tr)
plot(psd.x.tr, psd.y.tr)
lines(psd.x.tr, lin.mod.tr$fitted.values, col = "red")

summary(lin.mod.tr)
## 
## Call:
## lm(formula = psd.y.tr ~ psd.x.tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.567 -3.629 -1.902  1.461 53.266 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.33828    0.32640   -4.10 4.46e-05 ***
## psd.x.tr     3.84551    0.09153   42.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.895 on 998 degrees of freedom
## Multiple R-squared:  0.6388, Adjusted R-squared:  0.6384 
## F-statistic:  1765 on 1 and 998 DF,  p-value: < 2.2e-16

2.3. Pemodelan Data Polynomial

pol.mod.tr <- lm(psd.y.tr ~ poly(psd.x.tr, 2, raw = T))
ix.tr <- sort(psd.x.tr, index.return = T)$ix
plot(psd.x.tr, psd.y.tr)
lines(psd.x.tr[ix.tr], pol.mod.tr$fitted.values[ix.tr], col = "blue", cex = 2)

summary(pol.mod.tr)
## 
## Call:
## lm(formula = psd.y.tr ~ poly(psd.x.tr, 2, raw = T))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3029 -0.6866 -0.0281  0.6954  3.1818 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.043390   0.061876   49.19   <2e-16 ***
## poly(psd.x.tr, 2, raw = T)1 -1.972414   0.036241  -54.42   <2e-16 ***
## poly(psd.x.tr, 2, raw = T)2  0.994696   0.005565  178.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 997 degrees of freedom
## Multiple R-squared:  0.9891, Adjusted R-squared:  0.989 
## F-statistic: 4.51e+04 on 2 and 997 DF,  p-value: < 2.2e-16

2.4. Pemodelan Data Fungsi Tangga

# Menentukan rentang nilai data
range(psd.x.tr)
## [1] -4.579933  9.162481
# Membuat fungsi tangga dengan breaks sebanyak 6
step.mod.tr <- lm(psd.y.tr ~ cut(psd.x.tr, 6))
plot(psd.x.tr, psd.y.tr)
lines(psd.x.tr[ix.tr], step.mod.tr$fitted.values[ix.tr], col = "darkgreen")

summary(step.mod.tr)
## 
## Call:
## lm(formula = psd.y.tr ~ cut(psd.x.tr, 6))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0647  -2.2271  -0.2512   1.4305  23.1358 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       22.2047     1.8301  12.133  < 2e-16 ***
## cut(psd.x.tr, 6)(-2.29,0.000871] -16.9789     1.8782  -9.040  < 2e-16 ***
## cut(psd.x.tr, 6)(0.000871,2.29]  -19.6694     1.8427 -10.674  < 2e-16 ***
## cut(psd.x.tr, 6)(2.29,4.58]      -14.2925     1.8389  -7.772 1.92e-14 ***
## cut(psd.x.tr, 6)(4.58,6.87]       -0.3025     1.8490  -0.164     0.87    
## cut(psd.x.tr, 6)(6.87,9.18]       24.6640     1.9658  12.546  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.66 on 994 degrees of freedom
## Multiple R-squared:  0.8613, Adjusted R-squared:  0.8606 
## F-statistic:  1235 on 5 and 994 DF,  p-value: < 2.2e-16

2.5. Perbandingan Hasil Pemodelan (I)

model.tr <- data.frame(Model = c("Linear", "Polynomial (2)", "Step (6)"),
    AIC = c(AIC(lin.mod.tr), AIC(pol.mod.tr), AIC(step.mod.tr)), MSE = c(MSE(predict(lin.mod.tr),
        psd.y.tr), MSE(predict(pol.mod.tr), psd.y.tr), MSE(predict(step.mod.tr),
        psd.y.tr)))
knitr::kable(model.tr)
Model AIC MSE
Linear 6390.013 34.679046
Polynomial (2) 2894.280 1.049593
Step (6) 5440.859 13.316209

Model terbaik saat ini (Materi Nonlinier Regression I) berdasarkan nilai AIC dan MSE yang paling rendah. Berdasarkan ketiga hasil pemodelan, didapatkan bahwa model tersebut tidak linier. Model pada data bangkitan ini merupakan model polynomial dengan derajat 2 karena pada model tersebut, nilai AIC dan MSE-nya yang paling kecil (2894,280 dan 1,049593)diantara model lainnya.

2.6. Piecewise Cubic Polynomial Modelling

# Membagi data menjadi 2 bagian berdasarkan mean (mean = 3)
psd.dt.tr <- cbind(psd.x.tr, psd.y.tr)
psd.dt.tr1 <- psd.dt.tr[psd.x.tr <= 3, ]
psd.dt.tr2 <- psd.dt.tr[psd.x.tr > 3, ]

# Plotting
plot(psd.x.tr, psd.y.tr, main = "Piecewise Cubic Polynomial\nData Bangkitan")
abline(v = 3, col = "orange", lty = 3)
cub.mod.tr1 <- lm(psd.dt.tr1[, 2] ~ psd.dt.tr1[, 1] + I(psd.dt.tr1[, 1]^2) +
    I(psd.dt.tr1[, 1]^3))
ix.tr1 <- sort(psd.dt.tr1[, 1], index.return = T)$ix
lines(psd.dt.tr1[ix.tr1, 1], cub.mod.tr1$fitted.values[ix.tr1], col = "red",
    lwd = 2)
cub.mod.tr2 <- lm(psd.dt.tr2[, 2] ~ psd.dt.tr2[, 1] + I(psd.dt.tr2[, 1]^2) +
    I(psd.dt.tr2[, 1]^3))
ix.tr2 <- sort(psd.dt.tr2[, 1], index.return = T)$ix
lines(psd.dt.tr2[ix.tr2, 1], cub.mod.tr2$fitted.values[ix.tr2], col = "red",
    lwd = 2)

2.7. Spline Regression

1. Menentukan banyaknya fungsi basis dan knots (Penentuan via komputerisasi)

psd.dt.tr <- as.data.frame(psd.dt.tr)
knots.val.tr <- attr(bs(psd.dt.tr$psd.x.tr, df = 6), "knots")
knots.val.tr
##      25%      50%      75% 
## 1.578083 2.879866 4.343768

2. Pemodelan Regresi Spline (df = banyak bagian yang terbagi + 2)

spline.mod.tr <- lm(psd.y.tr ~ bs(psd.x.tr, knots = knots.val.tr), data = psd.dt.tr)
summary(spline.mod.tr)
## 
## Call:
## lm(formula = psd.y.tr ~ bs(psd.x.tr, knots = knots.val.tr), data = psd.dt.tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3278 -0.6843 -0.0170  0.7025  3.1352 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          34.0164     0.8785  38.720  < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)1 -24.6332     1.2703 -19.392  < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)2 -35.8056     0.8389 -42.683  < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)3 -28.9695     0.9021 -32.114  < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)4 -15.2662     0.8867 -17.217  < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)5   7.6464     1.0076   7.589 7.43e-14 ***
## bs(psd.x.tr, knots = knots.val.tr)6  34.9890     1.0285  34.018  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 993 degrees of freedom
## Multiple R-squared:  0.9891, Adjusted R-squared:  0.9891 
## F-statistic: 1.504e+04 on 6 and 993 DF,  p-value: < 2.2e-16
ggplot(psd.dt.tr, aes(x = psd.x.tr, y = psd.y.tr)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.tr),
        lty = 1, se = F)

2.8. Smoothing Spline

ss.tr <- smooth.spline(psd.x.tr, psd.y.tr, all.knots = T)
ss.mod.tr <- with(data = psd.dt.tr, smooth.spline(psd.x.tr, psd.y.tr))
ss.mod.tr
## Call:
## smooth.spline(x = psd.x.tr, y = psd.y.tr)
## 
## Smoothing Parameter  spar= 0.9044535  lambda= 0.0004273051 (14 iterations)
## Equivalent Degrees of Freedom (Df): 12.86151
## Penalized Criterion (RSS): 1042.428
## GCV: 191.3561
plot(psd.x.tr, psd.y.tr, pch = 19, main = "Smoothing Spline Data Bangkitan")
lines(ss.tr, col = "red", lwd = 1.5)

2.9. LOESS Function

loess.tr <- loess(formula = psd.y.tr ~ psd.x.tr, data = psd.dt.tr)
summary(loess.tr)
## Call:
## loess(formula = psd.y.tr ~ psd.x.tr, data = psd.dt.tr)
## 
## Number of Observations: 1000 
## Equivalent Number of Parameters: 5.37 
## Residual Standard Error: 1.027 
## Trace of smoother matrix: 5.88  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE
loess.span.tr <- data.frame(span = seq(0.1, 2, by = 0.1)) %>%
    group_by(span) %>%
    do(broom::augment(loess(psd.y.tr ~ psd.x.tr, data = psd.dt.tr, span = .$span)))
ggplot(loess.span.tr, aes(x = psd.x.tr, y = psd.y.tr)) + geom_line(aes(y = .fitted),
    col = "blue", lty = 1) + facet_wrap(~span)

2.10. Perbandingan Eksploratif seluruh Pemodelan

ggplot(psd.dt.tr, aes(x = psd.x.tr, y = psd.y.tr)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ x, col = "red", show.legend = T) +
    stat_smooth(method = "lm", formula = y ~ poly(x, 2, raw = T), col = "blue",
        lwd = 3.5, show.legend = T) + stat_smooth(method = "lm", formula = y ~
    cut(x, 6), col = "green", show.legend = T) + stat_smooth(method = "lm",
    formula = y ~ bs(x, knots = knots.val.tr), col = "yellow", lwd = 2,
    show.legend = T) + stat_smooth(method = "loess", formula = y ~ x, col = "orange",
    show.legend = T)

Legend:
—–: Linear Model
—–: Polynomial Model
—–: Step Function Model
—–: Spline Regression Model
—–: LOESS Function Model

model.tr2 <- data.frame(Model = c("Linear", "Polynomial (2)", "Step-Function (6)",
    "Spline Regression", "LOESS Function"), MSE = c(MSE(predict(lin.mod.tr),
    psd.y.tr), MSE(predict(pol.mod.tr), psd.y.tr), MSE(predict(step.mod.tr),
    psd.y.tr), MSE(predict(spline.mod.tr), psd.y.tr), MSE(predict(loess.tr),
    psd.y.tr)))
knitr::kable(model.tr2)
Model MSE
Linear 34.679046
Polynomial (2) 1.049593
Step-Function (6) 13.316209
Spline Regression 1.044992
LOESS Function 1.047259

Pemodelan terbaik untuk data bangkitan: Spline Regression (Berdasarkan nilai MSE terkecil).

3. Pemodelan Nonlinier Data Auto

3.1. Inisiasi Data

psd.dt.A <- Auto %>%
    select(mpg, horsepower, origin)
str(psd.dt.A)
## 'data.frame':    392 obs. of  3 variables:
##  $ mpg       : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ horsepower: num  130 165 150 150 140 198 220 215 225 190 ...
##  $ origin    : num  1 1 1 1 1 1 1 1 1 1 ...
knitr::kable(head(psd.dt.A))
mpg horsepower origin
18 130 1
15 165 1
18 150 1
16 150 1
17 140 1
15 198 1

3.2. Eksplorasi Data

pairs(psd.dt.A, pch = 19)

clist <- c("darkgreen", "blue", "red")
plot(psd.dt.A$mpg, psd.dt.A$horsepower, pch = 19, col = clist[psd.dt.A$origin])

Dari hasil eksplorasi, untuk data Auto ini akan dilakukan 2 analisis pemodelan terkait hubungan antarpeubah:
1. mpg (x), dan origin (y)
2. mpg (x), dan horsepower (y)

3.3. Pemodelan Data Linier

Model origin~mpg

lin.mod.1 <- lm(origin ~ mpg, data = psd.dt.A)
plot(psd.dt.A$mpg, psd.dt.A$origin, pch = 19)
lines(psd.dt.A$mpg, lin.mod.1$fitted.values, col = "red")

summary(lin.mod.1)
## 
## Call:
## lm(formula = origin ~ mpg, data = psd.dt.A)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48384 -0.40469 -0.08386  0.34740  1.74114 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.208871   0.106520   1.961   0.0506 .  
## mpg         0.058333   0.004311  13.531   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6654 on 390 degrees of freedom
## Multiple R-squared:  0.3195, Adjusted R-squared:  0.3177 
## F-statistic: 183.1 on 1 and 390 DF,  p-value: < 2.2e-16

Model horsepower~mpg

lin.mod.2 <- lm(horsepower ~ mpg, data = psd.dt.A)
plot(psd.dt.A$mpg, psd.dt.A$horsepower, pch = 19)
lines(psd.dt.A$mpg, lin.mod.2$fitted.values, col = "red")

summary(lin.mod.2)
## 
## Call:
## lm(formula = horsepower ~ mpg, data = psd.dt.A)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.892 -15.716  -2.094  13.108  96.947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 194.4756     3.8732   50.21   <2e-16 ***
## mpg          -3.8389     0.1568  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

3.4. Pemodelan Data Polynomial

Menentukan derajat polinomial terbaik untuk masing-masing hubungan peubah

# Penentuan derajat polinomial terbaik dibatasi hingga derajat ke-11
df.list.pol <- data.frame(df = c(rep(NA, 10)), AIC1 = c(rep(NA, 10)), MSE1 = c(rep(NA,
    10)), AIC2 = c(rep(NA, 10)), MSE2 = c(rep(NA, 10)))
for (j in 2:3) {
    for (i in 2:11) {
        pol.mod <- lm(psd.dt.A[, j] ~ poly(mpg, i, raw = T), data = psd.dt.A)
        AIC.pol <- AIC(pol.mod)
        MSE.pol <- mean((predict(pol.mod) - psd.dt.A[, j])^2)
        if (j == 3) {
            df.list.pol[i - 1, 1:3] <- c(i, AIC.pol, MSE.pol)
        } else {
            df.list.pol[i - 1, 4:5] <- c(AIC.pol, MSE.pol)
        }
    }
}
knitr::kable(df.list.pol)
df AIC1 MSE1 AIC2 MSE2
2 797.6568 0.4389163 3485.217 416.7871
3 796.2795 0.4351511 3454.949 383.8523
4 797.3754 0.4341486 3456.813 383.7195
5 799.2506 0.4340104 3455.310 380.3060
6 800.0427 0.4326751 3455.135 378.2017
7 801.2607 0.4318129 3455.139 376.2801
8 802.3611 0.4308230 3454.849 374.0890
9 804.3474 0.4308080 3455.295 372.6086
10 803.5020 0.4276922 3456.768 372.1082
11 805.3653 0.4275430 3455.520 369.0375

Berdasarkan hasil looping, diketahui bahwa derajat bebas polinomial yang terpilih untuk hubungan peubah `mpg~origin adalah 3 dan untuk hubungan peubah mpg-horsepower adalah 8 (Melihat nilai minimum dari AIC).

Model origin~mpg

ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ poly(x, 3, raw = T), col = "blue",
        se = F)

Model horsepower~mpg

ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ poly(x, 8, raw = T), col = "blue",
        se = F)

3.5. Pemodelan Data Fungsi Tangga

Menentukan banyak tangga terbaik untuk masing-masing hubungan peubah

# Penentuan derajat fungsi tangga terbaik dibatasi hingga derajat
# ke-12
df.list.step <- data.frame(df = c(rep(NA, 10)), AIC1 = c(rep(NA, 10)),
    MSE1 = c(rep(NA, 10)), AIC2 = c(rep(NA, 10)), MSE2 = c(rep(NA, 10)))
for (j in 2:3) {
    for (i in 3:12) {
        step.mod <- lm(psd.dt.A[, j] ~ cut(mpg, i), data = psd.dt.A)
        AIC.step <- AIC(step.mod)
        MSE.step <- mean((predict(step.mod) - psd.dt.A[, j])^2)
        if (j == 3) {
            df.list.step[i - 2, 1:3] <- c(i, AIC.step, MSE.step)
        } else {
            df.list.step[i - 2, 4:5] <- c(AIC.step, MSE.step)
        }
    }
}
knitr::kable(df.list.step)
df AIC1 MSE1 AIC2 MSE2
3 825.4751 0.4711959 3722.514 763.5092
4 816.3085 0.4579626 3605.549 563.6549
5 812.0036 0.4506557 3526.845 458.7770
6 812.9059 0.4493955 3554.364 489.6366
7 800.0531 0.4326866 3521.117 447.5318
8 809.4147 0.4408891 3551.701 481.3840
9 807.4994 0.4365074 3494.485 413.8924
10 805.2416 0.4317918 3478.747 395.5813
11 802.1242 0.4261916 3499.380 414.8383
12 792.2372 0.4134617 3496.674 409.8881

Berdasarkan hasil looping, diketahui bahwa derajat bebas fungsi tangga yang terpilih untuk hubungan peubah `mpg~origin adalah 12 dan untuk hubungan peubah mpg-horsepower adalah 10 (Melihat nilai minimum dari AIC).

Model origin~mpg

ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ cut(x, 12), col = "blue",
        se = F)

Model horsepower~mpg

ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ cut(x, 10), col = "blue",
        se = F)

3.6. Perbandingan Hasil Pemodelan (I)

pol.mod.1 <- lm(origin ~ poly(mpg, 3, raw = T), data = psd.dt.A)
pol.mod.2 <- lm(horsepower ~ poly(mpg, 8, raw = T), data = psd.dt.A)
step.mod.1 <- lm(origin ~ cut(mpg, 12), data = psd.dt.A)
step.mod.2 <- lm(horsepower ~ cut(mpg, 10), data = psd.dt.A)
model.auto <- data.frame(Model = c("Linear (mpg~origin)", "Linear (mpg~horsepower)",
    "Polynomial (3) (mpg~origin)", "Polynomial (8) (mpg~horsepower)", "Step (12) (mpg~origin)",
    "Step (10) (mpg~horsepower)"), AIC = c(AIC(lin.mod.1), AIC(lin.mod.2),
    AIC(pol.mod.1), AIC(pol.mod.2), AIC(step.mod.1), AIC(step.mod.2)),
    MSE = c(MSE(predict(lin.mod.1), psd.dt.A$origin), MSE(predict(lin.mod.2),
        psd.dt.A$horsepower), MSE(predict(pol.mod.1), psd.dt.A$origin),
        MSE(predict(pol.mod.2), psd.dt.A$horsepower), MSE(predict(step.mod.1),
            psd.dt.A$origin), MSE(predict(step.mod.2), psd.dt.A$horsepower)))
knitr::kable(model.auto)
Model AIC MSE
Linear (mpg~origin) 797.0222 0.4404478
Linear (mpg~horsepower) 3614.3235 582.3256764
Polynomial (3) (mpg~origin) 796.2795 0.4351511
Polynomial (8) (mpg~horsepower) 3454.8494 374.0890447
Step (12) (mpg~origin) 792.2372 0.4134617
Step (10) (mpg~horsepower) 3478.7475 395.5813018

Berdasarkah hasil pemodelan, diketahui bahwa untuk model hubungan peubah mpg~origin lebih baik menggunakan model fungsi tangga (12). Sementara untuk model mpg~horsepower lebih baik menggunakan model polinomial (3). Model terbaik ini masih bersifat sementara karena masih ada pemodelan lainnya (Materi Nonlinier Regression II).

3.7. Spline Regression

Menentukan banyaknya fungsi basis dan knots (Penentuan via komputerisasi)

knots.val.auto <- attr(bs(psd.dt.A$mpg, df = 6), "knots")
knots.val.auto
##   25%   50%   75% 
## 17.00 22.75 29.00

A. b-Spline Model

1. b-Spline Model 1 (origin~mpg)

spline.mod.auto1 <- lm(origin ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(spline.mod.auto1)
## 
## Call:
## lm(formula = origin ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39803 -0.38041 -0.02476  0.33868  1.81900 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                       1.04633    0.44800   2.336   0.0200 *
## bs(mpg, knots = knots.val.auto)1 -0.09890    0.64164  -0.154   0.8776  
## bs(mpg, knots = knots.val.auto)2 -0.07956    0.42992  -0.185   0.8533  
## bs(mpg, knots = knots.val.auto)3  0.57546    0.49774   1.156   0.2483  
## bs(mpg, knots = knots.val.auto)4  1.14229    0.49623   2.302   0.0219 *
## bs(mpg, knots = knots.val.auto)5  1.53577    0.64019   2.399   0.0169 *
## bs(mpg, knots = knots.val.auto)6  1.30545    0.61946   2.107   0.0357 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6649 on 385 degrees of freedom
## Multiple R-squared:  0.3292, Adjusted R-squared:  0.3187 
## F-statistic: 31.49 on 6 and 385 DF,  p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "orange") +
    stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.auto),
        lty = 1, se = F)

2. b-Spline Model 2 (horsepower~mpg)

spline.mod.auto2 <- lm(horsepower ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(spline.mod.auto2)
## 
## Call:
## lm(formula = horsepower ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.043 -10.103  -0.818   8.075  95.778 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       195.012     13.181  14.795  < 2e-16 ***
## bs(mpg, knots = knots.val.auto)1    2.792     18.879   0.148    0.882    
## bs(mpg, knots = knots.val.auto)2  -80.488     12.649  -6.363 5.62e-10 ***
## bs(mpg, knots = knots.val.auto)3 -103.774     14.645  -7.086 6.62e-12 ***
## bs(mpg, knots = knots.val.auto)4 -126.115     14.600  -8.638  < 2e-16 ***
## bs(mpg, knots = knots.val.auto)5 -127.143     18.836  -6.750 5.45e-11 ***
## bs(mpg, knots = knots.val.auto)6 -141.413     18.226  -7.759 7.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.56 on 385 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7417 
## F-statistic: 188.1 on 6 and 385 DF,  p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "orange") +
    stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.auto),
        lty = 1, se = F)

B. Natural Spline

Model 1 (origin~mpg)

ns.spline.auto1 <- lm(origin ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(ns.spline.auto1)
## 
## Call:
## lm(formula = origin ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36121 -0.36897 -0.03639  0.34308  1.82256 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        0.9776     0.2295   4.261 2.56e-05 ***
## ns(mpg, knots = knots.val.auto)1   0.6192     0.2188   2.830  0.00489 ** 
## ns(mpg, knots = knots.val.auto)2   1.2993     0.2091   6.215 1.33e-09 ***
## ns(mpg, knots = knots.val.auto)3   1.4244     0.5348   2.663  0.00806 ** 
## ns(mpg, knots = knots.val.auto)4   1.5339     0.2913   5.265 2.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6633 on 387 degrees of freedom
## Multiple R-squared:  0.3288, Adjusted R-squared:  0.3218 
## F-statistic: 47.39 on 4 and 387 DF,  p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "orange") +
    stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
        lty = 1, se = F)

Model 2 (horsepower~mpg)

ns.spline.auto2 <- lm(horsepower ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(ns.spline.auto2)
## 
## Call:
## lm(formula = horsepower ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.461 -10.731  -1.433   8.759  95.650 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       218.000      6.786   32.13   <2e-16 ***
## ns(mpg, knots = knots.val.auto)1 -130.511      6.470  -20.17   <2e-16 ***
## ns(mpg, knots = knots.val.auto)2 -109.368      6.183  -17.69   <2e-16 ***
## ns(mpg, knots = knots.val.auto)3 -237.134     15.815  -14.99   <2e-16 ***
## ns(mpg, knots = knots.val.auto)4 -115.116      8.615  -13.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.62 on 387 degrees of freedom
## Multiple R-squared:  0.7429, Adjusted R-squared:  0.7403 
## F-statistic: 279.6 on 4 and 387 DF,  p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "orange") +
    stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
        lty = 1, se = F)

3.8. Smoothing Spline

A. Model origin~mpg

ss.auto1 <- with(data = psd.dt.A, smooth.spline(mpg, origin))
ss.auto1
## Call:
## smooth.spline(x = mpg, y = origin)
## 
## Smoothing Parameter  spar= 0.6891065  lambda= 0.0002244654 (11 iterations)
## Equivalent Degrees of Freedom (Df): 12.93154
## Penalized Criterion (RSS): 61.61513
## GCV: 0.4385944
pred.ss1 <- broom::augment(ss.auto1)
ggplot(pred.ss1, aes(x = x, y = y)) + geom_point(color = "orange") + geom_line(aes(y = .fitted),
    col = "blue", lty = 1) + xlab("mpg") + ylab("origin")

B. Model horsepower~mpg

ss.auto2 <- with(data = psd.dt.A, smooth.spline(mpg, horsepower))
ss.auto2
## Call:
## smooth.spline(x = mpg, y = horsepower)
## 
## Smoothing Parameter  spar= 0.8528728  lambda= 0.003422279 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.989153
## Penalized Criterion (RSS): 35962.53
## GCV: 386.3516
pred.ss2 <- broom::augment(ss.auto2)
ggplot(pred.ss2, aes(x = x, y = y)) + geom_point(color = "orange") + geom_line(aes(y = .fitted),
    col = "blue", lty = 1) + xlab("mpg") + ylab("horsepower")

3.9. LOESS Function

A. Model origin~mpg

loess.auto1 <- loess(origin ~ mpg, data = psd.dt.A)
summary(loess.auto1)
## Call:
## loess(formula = origin ~ mpg, data = psd.dt.A)
## 
## Number of Observations: 392 
## Equivalent Number of Parameters: 4.8 
## Residual Standard Error: 0.6627 
## Trace of smoother matrix: 5.24  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
    stat_smooth(method = "loess", formula = y ~ x, col = "red", show.legend = T)

B. Model horsepower~mpg

loess.auto2 <- loess(horsepower ~ mpg, data = psd.dt.A)
summary(loess.auto2)
## Call:
## loess(formula = horsepower ~ mpg, data = psd.dt.A)
## 
## Number of Observations: 392 
## Equivalent Number of Parameters: 4.8 
## Residual Standard Error: 19.58 
## Trace of smoother matrix: 5.24  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
    stat_smooth(method = "loess", formula = y ~ x, col = "red", show.legend = T)

3.10. Perbandingan hasil Pemodelan (II)

A. Model origin~mpg

ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ poly(x, 3, raw = T), col = "blue",
        lwd = 1.2, se = F) + stat_smooth(method = "lm", formula = y ~ cut(x,
    12), col = "green", lwd = 1.2, se = F) + stat_smooth(method = "lm",
    formula = y ~ bs(x, knots = knots.val.auto), col = "yellow", lwd = 1.2,
    se = F) + stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
    col = "darkgreen", lwd = 1.2, se = F) + stat_smooth(method = "loess",
    formula = y ~ x, col = "orange", lwd = 1.2, se = F) + stat_smooth(method = "lm",
    formula = y ~ x, col = "red", lwd = 1.2, se = F)

Legend:
—–: Linear Model
—–: Polynomial Model
—–: Step Function Model
—–: Spline Regression Model
—–: Natural Spline Regression Model
—–: LOESS Function Model

model.auto2a <- data.frame(Model = c("Linear", "Polynomial (3)", "Step-Function (12)",
    "Spline Regression", "Natural Spline Regression", "LOESS Function"),
    MSE = c(MSE(predict(lin.mod.1), psd.dt.A$origin), MSE(predict(pol.mod.1),
        psd.dt.A$origin), MSE(predict(step.mod.1), psd.dt.A$origin), MSE(predict(spline.mod.auto1),
        psd.dt.A$origin), MSE(predict(ns.spline.auto1), psd.dt.A$origin),
        MSE(predict(loess.auto1), psd.dt.A$origin)))
knitr::kable(model.auto2a)
Model MSE
Linear 0.4404478
Polynomial (3) 0.4351511
Step-Function (12) 0.4134617
Spline Regression 0.4341514
Natural Spline Regression 0.4344177
LOESS Function 0.4328007

Pemodelan terbaik untuk data auto origin~mpg: Step-Function with 12 breaks (Berdasarkan nilai MSE terkecil).

B. Model horsepower~mpg

ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
    stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.auto),
        col = "yellow", lwd = 4, se = F) + stat_smooth(method = "loess",
    formula = y ~ x, col = "orange", lwd = 2.5, se = F) + stat_smooth(method = "lm",
    formula = y ~ cut(x, 10), col = "green", lwd = 1.2, se = F) + stat_smooth(method = "lm",
    formula = y ~ poly(x, 8, raw = T), col = "blue", lwd = 1.2, se = F) +
    stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
        col = "darkgreen", lwd = 1.2, se = F) + stat_smooth(method = "lm",
    formula = y ~ x, col = "red", lwd = 1.2, se = F)

Legend:
—–: Linear Model
—–: Polynomial Model
—–: Step Function Model
—–: Spline Regression Model
—–: Natural Spline Regression Model
—–: LOESS Function Model

model.auto2b <- data.frame(Model = c("Linear", "Polynomial (8)", "Step-Function (10)",
    "Spline Regression", "Natural Spline Regression", "LOESS Function"),
    MSE = c(MSE(predict(lin.mod.2), psd.dt.A$horsepower), MSE(predict(pol.mod.2),
        psd.dt.A$horsepower), MSE(predict(step.mod.2), psd.dt.A$horsepower),
        MSE(predict(spline.mod.auto2), psd.dt.A$horsepower), MSE(predict(ns.spline.auto2),
            psd.dt.A$horsepower), MSE(predict(loess.auto2), psd.dt.A$horsepower)))
knitr::kable(model.auto2b)
Model MSE
Linear 582.3257
Polynomial (8) 374.0890
Step-Function (10) 395.5813
Spline Regression 375.8340
Natural Spline Regression 379.8993
LOESS Function 377.9225

Pemodelan terbaik untuk data bangkitan: 8th Degree Polynomial (Berdasarkan nilai MSE terkecil).

4. Cross-Validation Evaluation Model Data Auto

4.1. Inisiasi

set.seed(92)
cross.val1 <- vfold_cv(psd.dt.A, v = 10, strata = "origin")
cross.val2 <- vfold_cv(psd.dt.A, v = 10, strata = "horsepower")

4.2. Model (origin~mpg)

4.2.1. Linear

lin.met1 <- map_dfr(cross.val1$splits, function(x) {
    mod <- lm(origin ~ mpg, data = psd.dt.A[x$in_id, ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$origin
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_lin.met1 <- colMeans(lin.met1)
m_lin.met1
##      rmse       mae 
## 0.6634798 0.5142883

4.2.2. Polynomial (3)

pol.met1 <- map_dfr(cross.val1$splits, function(x) {
    mod <- lm(origin ~ poly(mpg, 3, raw = T), data = psd.dt.A[x$in_id,
        ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$origin
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_pol.met1 <- colMeans(pol.met1)
m_pol.met1
##      rmse       mae 
## 0.6621524 0.5017377

4.2.3. Step-Function

breaks <- 3:12
best.step1 <- map_dfr(breaks, function(i) {
    step.met1 <- map_dfr(cross.val1$splits, function(x) {
        training <- psd.dt.A[x$in_id, ]
        training$mpg <- cut(training$mpg, i)
        mod <- lm(origin ~ mpg, data = training)
        labs_x <- levels(mod$model[, 2])
        labs_x_breaks <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1",
            labs_x)), upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1",
            labs_x)))
        testing <- psd.dt.A[-x$in_id, ]
        newdt <- cut(testing$mpg, c(labs_x_breaks[1, 1], labs_x_breaks[,
            2]))
        pred <- predict(mod, newdata = list(mpg = newdt))
        truth <- testing$origin
        data_eval <- na.omit(data.frame(truth, pred))
        rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
        mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
        metric <- c(rmse, mae)
        names(metric) <- c("rmse", "mae")
        return(metric)
    })
    step.met1
    # menghitung rata-rata untuk 10 folds
    m_step.met1 <- colMeans(step.met1)
    m_step.met1
})

best.step1 <- cbind(breaks = breaks, best.step1)
# menampilkan hasil all breaks
best.step1 %>%
    slice_min(rmse, n = 5)
##   breaks      rmse       mae
## 1     11 0.6625326 0.4967856
## 2      7 0.6638193 0.5014978
## 3     12 0.6652098 0.4939470
## 4     10 0.6728325 0.5025272
## 5      6 0.6728480 0.5041022
best.step1 %>%
    slice_min(mae, n = 5)
##   breaks      rmse       mae
## 1     12 0.6652098 0.4939470
## 2     11 0.6625326 0.4967856
## 3      7 0.6638193 0.5014978
## 4     10 0.6728325 0.5025272
## 5      8 0.6735929 0.5040323

Best Step yang diambil = 11

4.2.4. Spline Regression

spline.met1 <- map_dfr(cross.val1$splits, function(x) {
    mod <- lm(origin ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
        ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$origin
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_spline.met1 <- colMeans(spline.met1)
m_spline.met1
##      rmse       mae 
## 0.6643972 0.4957557

4.2.5. Natural Spline Regression

nspl.met1 <- map_dfr(cross.val1$splits, function(x) {
    mod <- lm(origin ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
        ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$origin
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_nspl.met1 <- colMeans(nspl.met1)
m_nspl.met1
##      rmse       mae 
## 0.6619944 0.4936618

4.2.6. LOESS Function

span <- seq(0.1, 1, length.out = 91)
best.loess1 <- map_dfr(span, function(i) {
    loess.met1 <- map_dfr(cross.val1$splits, function(x) {
        mod <- loess(origin ~ mpg, span = i, data = psd.dt.A[x$in_id, ])
        pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
        truth <- psd.dt.A[-x$in_id, ]$origin
        data_eval <- na.omit(data.frame(pred = pred, truth = truth))
        rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
        mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
        metric <- c(rmse, mae)
        names(metric) <- c("rmse", "mae")
        return(metric)
    })

    head(loess.met1, 20)

    # menghitung rata-rata untuk 10 folds
    m_loess.met1 <- colMeans(loess.met1)
    m_loess.met1
})
best.loess1 <- cbind(span = span, best.loess1)
# menampilkan hasil all breaks
best.loess1 %>%
    slice_min(rmse, n = 10)
##    span      rmse       mae
## 1  0.39 0.6618308 0.4941327
## 2  0.38 0.6618467 0.4942368
## 3  0.48 0.6618643 0.4945020
## 4  0.44 0.6618700 0.4941280
## 5  0.45 0.6619338 0.4941375
## 6  0.43 0.6619899 0.4941977
## 7  0.46 0.6619976 0.4943916
## 8  0.51 0.6620351 0.4947503
## 9  0.52 0.6621044 0.4947526
## 10 0.47 0.6621310 0.4947993
best.loess1 %>%
    slice_min(mae, n = 10)
##    span      rmse       mae
## 1  0.10 0.6765811 0.4913830
## 2  0.16 0.6686370 0.4929906
## 3  0.17 0.6657005 0.4930809
## 4  0.18 0.6650503 0.4931812
## 5  0.11 0.6761704 0.4932793
## 6  0.19 0.6637576 0.4937502
## 7  0.44 0.6618700 0.4941280
## 8  0.39 0.6618308 0.4941327
## 9  0.25 0.6630546 0.4941343
## 10 0.24 0.6621407 0.4941345

Best LOESS based on rmse: span 0.39
Best LOESS based on mae: span 0.1
Best LOESS based on slice_min rankings: span 0.44

4.2.7. Perbandingan Model

df.cv1 <- rbind(m_lin.met1, m_pol.met1, best.step1[9, 2:3], m_spline.met1,
    m_nspl.met1, best.loess1[35, -1])
rownames(df.cv1) <- c("Linear", "Polynomial (3)", "Step-Function (11)",
    "Spline Regression", "Natural Spline Regression", "LOESS Function (0.44)")
knitr::kable(df.cv1)
rmse mae
Linear 0.6634798 0.5142883
Polynomial (3) 0.6621524 0.5017377
Step-Function (11) 0.6625326 0.4967856
Spline Regression 0.6643972 0.4957557
Natural Spline Regression 0.6619944 0.4936618
LOESS Function (0.44) 0.6618700 0.4941280

Berdasarkan hasil Cross-Validation 10 folds, model LOESS dengan span=0.44 merupakan model terbaik untuk hubungan data origin~mpg berdasarkan nilai rmse yang lebih kecil dibanding kelima model lainnya. Sementara model Natural Spline Regression merupakan model terbaik berdasarkan nilai mae-nya yang paling kecil.

4.3. Model (horsepower~mpg)

4.3.1. Linear

lin.met2 <- map_dfr(cross.val2$splits, function(x) {
    mod <- lm(horsepower ~ mpg, data = psd.dt.A[x$in_id, ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$horsepower
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_lin.met2 <- colMeans(lin.met2)
m_lin.met2
##     rmse      mae 
## 24.05804 18.51148

4.3.2. Polynomial (8)

pol.met2 <- map_dfr(cross.val2$splits, function(x) {
    mod <- lm(horsepower ~ poly(mpg, 8, raw = T), data = psd.dt.A[x$in_id,
        ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$horsepower
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_pol.met2 <- colMeans(pol.met2)
m_pol.met2
##     rmse      mae 
## 19.85886 14.21747

4.3.3. Step Function

breaks <- 3:12
best.step2 <- map_dfr(breaks, function(i) {
    step.met2 <- map_dfr(cross.val2$splits, function(x) {
        training <- psd.dt.A[x$in_id, ]
        training$mpg <- cut(training$mpg, i)
        mod <- lm(horsepower ~ mpg, data = training)
        labs_x <- levels(mod$model[, 2])
        labs_x_breaks <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1",
            labs_x)), upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1",
            labs_x)))
        testing <- psd.dt.A[-x$in_id, ]
        newdt <- cut(testing$mpg, c(labs_x_breaks[1, 1], labs_x_breaks[,
            2]))
        pred <- predict(mod, newdata = list(mpg = newdt))
        truth <- testing$horsepower
        data_eval <- na.omit(data.frame(truth, pred))
        rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
        mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
        metric <- c(rmse, mae)
        names(metric) <- c("rmse", "mae")
        return(metric)
    })
    step.met2
    # menghitung rata-rata untuk 10 folds
    m_step.met2 <- colMeans(step.met2)
    m_step.met2
})

best.step2 <- cbind(breaks = breaks, best.step2)
# menampilkan hasil all breaks
best.step2 %>%
    slice_min(rmse, n = 5)
##   breaks     rmse      mae
## 1     10 20.28387 14.57820
## 2     12 20.35149 14.63873
## 3      9 20.81224 14.63435
## 4     11 21.10692 15.43080
## 5      7 21.23988 16.01389
best.step2 %>%
    slice_min(mae, n = 5)
##   breaks     rmse      mae
## 1     10 20.28387 14.57820
## 2      9 20.81224 14.63435
## 3     12 20.35149 14.63873
## 4      5 21.36285 15.41345
## 5     11 21.10692 15.43080

Banyak break untuk Best Step yang diambil = 10

4.3.4. Spline Regression

spline.met2 <- map_dfr(cross.val2$splits, function(x) {
    mod <- lm(horsepower ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
        ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$horsepower
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_spline.met2 <- colMeans(spline.met2)
m_spline.met2
##     rmse      mae 
## 19.39506 14.12607

4.3.5. Natural Spline Regression

nspl.met2 <- map_dfr(cross.val2$splits, function(x) {
    mod <- lm(horsepower ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
        ])
    pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
    truth <- psd.dt.A[-x$in_id, ]$horsepower
    rmse <- mlr3measures::rmse(truth = truth, response = pred)
    mae <- mlr3measures::mae(truth = truth, response = pred)
    metric <- c(rmse, mae)
    names(metric) <- c("rmse", "mae")
    return(metric)
})
m_nspl.met2 <- colMeans(nspl.met2)
m_nspl.met2
##     rmse      mae 
## 19.33404 14.18485

4.3.6. LOESS Function

span <- seq(0.1, 1, length.out = 91)
best.loess2 <- map_dfr(span, function(i) {
    loess.met2 <- map_dfr(cross.val1$splits, function(x) {
        mod <- loess(horsepower ~ mpg, span = i, data = psd.dt.A[x$in_id,
            ])
        pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
        truth <- psd.dt.A[-x$in_id, ]$horsepower
        data_eval <- na.omit(data.frame(pred = pred, truth = truth))
        rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
        mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
        metric <- c(rmse, mae)
        names(metric) <- c("rmse", "mae")
        return(metric)
    })
    head(loess.met2, 20)
    # menghitung rata-rata untuk 10 folds
    m_loess.met2 <- colMeans(loess.met2)
    m_loess.met2
})
best.loess2 <- cbind(span = span, best.loess2)
# menampilkan hasil all breaks
best.loess2 %>%
    slice_min(rmse, n = 5)
##   span     rmse      mae
## 1 0.58 19.32379 13.77316
## 2 0.60 19.32434 13.77218
## 3 0.61 19.32438 13.77701
## 4 0.56 19.32621 13.77059
## 5 0.53 19.32786 13.77461
best.loess2 %>%
    slice_min(mae, n = 5)
##   span     rmse      mae
## 1 0.55 19.33162 13.76996
## 2 0.56 19.32621 13.77059
## 3 0.60 19.32434 13.77218
## 4 0.54 19.33110 13.77222
## 5 0.58 19.32379 13.77316

Best LOESS based on rmse: span 0.58
Best LOESS based on mae: span 0.55
Best LOESS based on slice_min rankings: span 0.56

4.3.7. Perbandingan Model

df.cv2 <- rbind(m_lin.met2, m_pol.met2, best.step2[8, 2:3], m_spline.met2,
    m_nspl.met2, best.loess2[47, -1])
rownames(df.cv2) <- c("Linear", "Polynomial (8)", "Step-Function (10)",
    "Spline Regression", "Natural Spline Regression", "LOESS Function (0.56)")
knitr::kable(df.cv2)
rmse mae
Linear 24.05804 18.51148
Polynomial (8) 19.85886 14.21747
Step-Function (10) 20.28387 14.57820
Spline Regression 19.39506 14.12607
Natural Spline Regression 19.33404 14.18485
LOESS Function (0.56) 19.32621 13.77059

Berdasarkan hasil Cross-Validation 10 folds, model LOESS dengan span=0.56 merupakan model terbaik untuk hubungan data horsepower~mpg karena nilai rmse dan mae yang paling kecil dibanding kelima model lainnya.