1. Pendahuluan
1.1. Cakupan Materi
- Nonlinier Regression I
Pemodelan pada data nonlinier yang meliputi:- Regresi Linier
- Regresi Polinomial
- Regresi Fungsi Tangga
- Regresi Linier
- Nonlinier Regression II
Pemulusan pada data nonlinier yang meliputi:- Piecewise Cubic Polynomial
- Spline Regression
- Smoothing Spline
- LOESS Function
- Piecewise Cubic Polynomial
1.2. Deskripsi Data
Data yang akan digunakan terdiri atas 2 jenis data, yakni:
1. data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 3 dan ragam 2.
2. Dataset Auto dari library(ISLR).
Dataset Auto terdiri atas 392 amatan dan 9 peubah dengan rincian tipe peubah:
1. mpg (numeric)
2. cylinders (numeric)
3. displacement (numeric)
4. horsepower (numeric)
5. weight (numeric)
6. acceleration (numeric)
7. year (numeric)
8. origin (numeric)
9. name (categoric (Factor))
Pada materi analisis regresi nonlinier ini, peubah yang akan dipakai adalah peubah mpg (Konsumsi bahan bakar mobil (mil/galon)), horsepower (Tenaga kuda mobil (hp)), dan origin (Asal pabrikan mobil).
Origin 1: America
Origin 2: Europe
Origin 3: Japan
1.3. Package
- Daftar package yang dibutuhkan untuk analisis:
ISLR
tidyverse
ggplot2
dplyr
purrr
rsamples
splines
- Fungsi buatan:
- MSE (Mean Square Error)
MSE <- function(pred, act) {
mean((pred - act)^2)
}
2. Pemodelan Nonlinier Data bangkitan
2.1. Inisiasi Data
set.seed(92)
psd.x.tr <- rnorm(1000, 3, 2)
err.tr <- rnorm(1000)
psd.y.tr <- 3 - 2 * psd.x.tr + psd.x.tr^2 + err.tr
plot(psd.x.tr, psd.y.tr)
2.2. Pemodelan Data Linier
lin.mod.tr <- lm(psd.y.tr ~ psd.x.tr)
plot(psd.x.tr, psd.y.tr)
lines(psd.x.tr, lin.mod.tr$fitted.values, col = "red")
summary(lin.mod.tr)
##
## Call:
## lm(formula = psd.y.tr ~ psd.x.tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.567 -3.629 -1.902 1.461 53.266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.33828 0.32640 -4.10 4.46e-05 ***
## psd.x.tr 3.84551 0.09153 42.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.895 on 998 degrees of freedom
## Multiple R-squared: 0.6388, Adjusted R-squared: 0.6384
## F-statistic: 1765 on 1 and 998 DF, p-value: < 2.2e-16
2.3. Pemodelan Data Polynomial
pol.mod.tr <- lm(psd.y.tr ~ poly(psd.x.tr, 2, raw = T))
ix.tr <- sort(psd.x.tr, index.return = T)$ix
plot(psd.x.tr, psd.y.tr)
lines(psd.x.tr[ix.tr], pol.mod.tr$fitted.values[ix.tr], col = "blue", cex = 2)
summary(pol.mod.tr)
##
## Call:
## lm(formula = psd.y.tr ~ poly(psd.x.tr, 2, raw = T))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3029 -0.6866 -0.0281 0.6954 3.1818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.043390 0.061876 49.19 <2e-16 ***
## poly(psd.x.tr, 2, raw = T)1 -1.972414 0.036241 -54.42 <2e-16 ***
## poly(psd.x.tr, 2, raw = T)2 0.994696 0.005565 178.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 997 degrees of freedom
## Multiple R-squared: 0.9891, Adjusted R-squared: 0.989
## F-statistic: 4.51e+04 on 2 and 997 DF, p-value: < 2.2e-16
2.4. Pemodelan Data Fungsi Tangga
# Menentukan rentang nilai data
range(psd.x.tr)
## [1] -4.579933 9.162481
# Membuat fungsi tangga dengan breaks sebanyak 6
step.mod.tr <- lm(psd.y.tr ~ cut(psd.x.tr, 6))
plot(psd.x.tr, psd.y.tr)
lines(psd.x.tr[ix.tr], step.mod.tr$fitted.values[ix.tr], col = "darkgreen")
summary(step.mod.tr)
##
## Call:
## lm(formula = psd.y.tr ~ cut(psd.x.tr, 6))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0647 -2.2271 -0.2512 1.4305 23.1358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.2047 1.8301 12.133 < 2e-16 ***
## cut(psd.x.tr, 6)(-2.29,0.000871] -16.9789 1.8782 -9.040 < 2e-16 ***
## cut(psd.x.tr, 6)(0.000871,2.29] -19.6694 1.8427 -10.674 < 2e-16 ***
## cut(psd.x.tr, 6)(2.29,4.58] -14.2925 1.8389 -7.772 1.92e-14 ***
## cut(psd.x.tr, 6)(4.58,6.87] -0.3025 1.8490 -0.164 0.87
## cut(psd.x.tr, 6)(6.87,9.18] 24.6640 1.9658 12.546 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.66 on 994 degrees of freedom
## Multiple R-squared: 0.8613, Adjusted R-squared: 0.8606
## F-statistic: 1235 on 5 and 994 DF, p-value: < 2.2e-16
2.5. Perbandingan Hasil Pemodelan (I)
model.tr <- data.frame(Model = c("Linear", "Polynomial (2)", "Step (6)"),
AIC = c(AIC(lin.mod.tr), AIC(pol.mod.tr), AIC(step.mod.tr)), MSE = c(MSE(predict(lin.mod.tr),
psd.y.tr), MSE(predict(pol.mod.tr), psd.y.tr), MSE(predict(step.mod.tr),
psd.y.tr)))
knitr::kable(model.tr)
| Model | AIC | MSE |
|---|---|---|
| Linear | 6390.013 | 34.679046 |
| Polynomial (2) | 2894.280 | 1.049593 |
| Step (6) | 5440.859 | 13.316209 |
Model terbaik saat ini (Materi Nonlinier Regression I) berdasarkan nilai AIC dan MSE yang paling rendah. Berdasarkan ketiga hasil pemodelan, didapatkan bahwa model tersebut tidak linier. Model pada data bangkitan ini merupakan model polynomial dengan derajat 2 karena pada model tersebut, nilai AIC dan MSE-nya yang paling kecil (2894,280 dan 1,049593)diantara model lainnya.
2.6. Piecewise Cubic Polynomial Modelling
# Membagi data menjadi 2 bagian berdasarkan mean (mean = 3)
psd.dt.tr <- cbind(psd.x.tr, psd.y.tr)
psd.dt.tr1 <- psd.dt.tr[psd.x.tr <= 3, ]
psd.dt.tr2 <- psd.dt.tr[psd.x.tr > 3, ]
# Plotting
plot(psd.x.tr, psd.y.tr, main = "Piecewise Cubic Polynomial\nData Bangkitan")
abline(v = 3, col = "orange", lty = 3)
cub.mod.tr1 <- lm(psd.dt.tr1[, 2] ~ psd.dt.tr1[, 1] + I(psd.dt.tr1[, 1]^2) +
I(psd.dt.tr1[, 1]^3))
ix.tr1 <- sort(psd.dt.tr1[, 1], index.return = T)$ix
lines(psd.dt.tr1[ix.tr1, 1], cub.mod.tr1$fitted.values[ix.tr1], col = "red",
lwd = 2)
cub.mod.tr2 <- lm(psd.dt.tr2[, 2] ~ psd.dt.tr2[, 1] + I(psd.dt.tr2[, 1]^2) +
I(psd.dt.tr2[, 1]^3))
ix.tr2 <- sort(psd.dt.tr2[, 1], index.return = T)$ix
lines(psd.dt.tr2[ix.tr2, 1], cub.mod.tr2$fitted.values[ix.tr2], col = "red",
lwd = 2)
2.7. Spline Regression
1. Menentukan banyaknya fungsi basis dan knots (Penentuan via komputerisasi)
psd.dt.tr <- as.data.frame(psd.dt.tr)
knots.val.tr <- attr(bs(psd.dt.tr$psd.x.tr, df = 6), "knots")
knots.val.tr
## 25% 50% 75%
## 1.578083 2.879866 4.343768
2. Pemodelan Regresi Spline (df = banyak bagian yang terbagi + 2)
spline.mod.tr <- lm(psd.y.tr ~ bs(psd.x.tr, knots = knots.val.tr), data = psd.dt.tr)
summary(spline.mod.tr)
##
## Call:
## lm(formula = psd.y.tr ~ bs(psd.x.tr, knots = knots.val.tr), data = psd.dt.tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3278 -0.6843 -0.0170 0.7025 3.1352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.0164 0.8785 38.720 < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)1 -24.6332 1.2703 -19.392 < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)2 -35.8056 0.8389 -42.683 < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)3 -28.9695 0.9021 -32.114 < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)4 -15.2662 0.8867 -17.217 < 2e-16 ***
## bs(psd.x.tr, knots = knots.val.tr)5 7.6464 1.0076 7.589 7.43e-14 ***
## bs(psd.x.tr, knots = knots.val.tr)6 34.9890 1.0285 34.018 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 993 degrees of freedom
## Multiple R-squared: 0.9891, Adjusted R-squared: 0.9891
## F-statistic: 1.504e+04 on 6 and 993 DF, p-value: < 2.2e-16
ggplot(psd.dt.tr, aes(x = psd.x.tr, y = psd.y.tr)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.tr),
lty = 1, se = F)
2.8. Smoothing Spline
ss.tr <- smooth.spline(psd.x.tr, psd.y.tr, all.knots = T)
ss.mod.tr <- with(data = psd.dt.tr, smooth.spline(psd.x.tr, psd.y.tr))
ss.mod.tr
## Call:
## smooth.spline(x = psd.x.tr, y = psd.y.tr)
##
## Smoothing Parameter spar= 0.9044535 lambda= 0.0004273051 (14 iterations)
## Equivalent Degrees of Freedom (Df): 12.86151
## Penalized Criterion (RSS): 1042.428
## GCV: 191.3561
plot(psd.x.tr, psd.y.tr, pch = 19, main = "Smoothing Spline Data Bangkitan")
lines(ss.tr, col = "red", lwd = 1.5)
2.9. LOESS Function
loess.tr <- loess(formula = psd.y.tr ~ psd.x.tr, data = psd.dt.tr)
summary(loess.tr)
## Call:
## loess(formula = psd.y.tr ~ psd.x.tr, data = psd.dt.tr)
##
## Number of Observations: 1000
## Equivalent Number of Parameters: 5.37
## Residual Standard Error: 1.027
## Trace of smoother matrix: 5.88 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
loess.span.tr <- data.frame(span = seq(0.1, 2, by = 0.1)) %>%
group_by(span) %>%
do(broom::augment(loess(psd.y.tr ~ psd.x.tr, data = psd.dt.tr, span = .$span)))
ggplot(loess.span.tr, aes(x = psd.x.tr, y = psd.y.tr)) + geom_line(aes(y = .fitted),
col = "blue", lty = 1) + facet_wrap(~span)
2.10. Perbandingan Eksploratif seluruh Pemodelan
ggplot(psd.dt.tr, aes(x = psd.x.tr, y = psd.y.tr)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ x, col = "red", show.legend = T) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2, raw = T), col = "blue",
lwd = 3.5, show.legend = T) + stat_smooth(method = "lm", formula = y ~
cut(x, 6), col = "green", show.legend = T) + stat_smooth(method = "lm",
formula = y ~ bs(x, knots = knots.val.tr), col = "yellow", lwd = 2,
show.legend = T) + stat_smooth(method = "loess", formula = y ~ x, col = "orange",
show.legend = T)
Legend:
—–: Linear Model
—–: Polynomial Model
—–: Step Function Model
—–: Spline Regression Model
—–: LOESS Function Model
model.tr2 <- data.frame(Model = c("Linear", "Polynomial (2)", "Step-Function (6)",
"Spline Regression", "LOESS Function"), MSE = c(MSE(predict(lin.mod.tr),
psd.y.tr), MSE(predict(pol.mod.tr), psd.y.tr), MSE(predict(step.mod.tr),
psd.y.tr), MSE(predict(spline.mod.tr), psd.y.tr), MSE(predict(loess.tr),
psd.y.tr)))
knitr::kable(model.tr2)
| Model | MSE |
|---|---|
| Linear | 34.679046 |
| Polynomial (2) | 1.049593 |
| Step-Function (6) | 13.316209 |
| Spline Regression | 1.044992 |
| LOESS Function | 1.047259 |
Pemodelan terbaik untuk data bangkitan: Spline Regression (Berdasarkan nilai MSE terkecil).
3. Pemodelan Nonlinier Data Auto
3.1. Inisiasi Data
psd.dt.A <- Auto %>%
select(mpg, horsepower, origin)
str(psd.dt.A)
## 'data.frame': 392 obs. of 3 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ horsepower: num 130 165 150 150 140 198 220 215 225 190 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
knitr::kable(head(psd.dt.A))
| mpg | horsepower | origin |
|---|---|---|
| 18 | 130 | 1 |
| 15 | 165 | 1 |
| 18 | 150 | 1 |
| 16 | 150 | 1 |
| 17 | 140 | 1 |
| 15 | 198 | 1 |
3.2. Eksplorasi Data
pairs(psd.dt.A, pch = 19)
clist <- c("darkgreen", "blue", "red")
plot(psd.dt.A$mpg, psd.dt.A$horsepower, pch = 19, col = clist[psd.dt.A$origin])
Dari hasil eksplorasi, untuk data Auto ini akan dilakukan 2 analisis pemodelan terkait hubungan antarpeubah:
1. mpg (x), dan origin (y)
2. mpg (x), dan horsepower (y)
3.3. Pemodelan Data Linier
Model origin~mpg
lin.mod.1 <- lm(origin ~ mpg, data = psd.dt.A)
plot(psd.dt.A$mpg, psd.dt.A$origin, pch = 19)
lines(psd.dt.A$mpg, lin.mod.1$fitted.values, col = "red")
summary(lin.mod.1)
##
## Call:
## lm(formula = origin ~ mpg, data = psd.dt.A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48384 -0.40469 -0.08386 0.34740 1.74114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.208871 0.106520 1.961 0.0506 .
## mpg 0.058333 0.004311 13.531 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6654 on 390 degrees of freedom
## Multiple R-squared: 0.3195, Adjusted R-squared: 0.3177
## F-statistic: 183.1 on 1 and 390 DF, p-value: < 2.2e-16
Model horsepower~mpg
lin.mod.2 <- lm(horsepower ~ mpg, data = psd.dt.A)
plot(psd.dt.A$mpg, psd.dt.A$horsepower, pch = 19)
lines(psd.dt.A$mpg, lin.mod.2$fitted.values, col = "red")
summary(lin.mod.2)
##
## Call:
## lm(formula = horsepower ~ mpg, data = psd.dt.A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.892 -15.716 -2.094 13.108 96.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.4756 3.8732 50.21 <2e-16 ***
## mpg -3.8389 0.1568 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
3.4. Pemodelan Data Polynomial
Menentukan derajat polinomial terbaik untuk masing-masing hubungan peubah
# Penentuan derajat polinomial terbaik dibatasi hingga derajat ke-11
df.list.pol <- data.frame(df = c(rep(NA, 10)), AIC1 = c(rep(NA, 10)), MSE1 = c(rep(NA,
10)), AIC2 = c(rep(NA, 10)), MSE2 = c(rep(NA, 10)))
for (j in 2:3) {
for (i in 2:11) {
pol.mod <- lm(psd.dt.A[, j] ~ poly(mpg, i, raw = T), data = psd.dt.A)
AIC.pol <- AIC(pol.mod)
MSE.pol <- mean((predict(pol.mod) - psd.dt.A[, j])^2)
if (j == 3) {
df.list.pol[i - 1, 1:3] <- c(i, AIC.pol, MSE.pol)
} else {
df.list.pol[i - 1, 4:5] <- c(AIC.pol, MSE.pol)
}
}
}
knitr::kable(df.list.pol)
| df | AIC1 | MSE1 | AIC2 | MSE2 |
|---|---|---|---|---|
| 2 | 797.6568 | 0.4389163 | 3485.217 | 416.7871 |
| 3 | 796.2795 | 0.4351511 | 3454.949 | 383.8523 |
| 4 | 797.3754 | 0.4341486 | 3456.813 | 383.7195 |
| 5 | 799.2506 | 0.4340104 | 3455.310 | 380.3060 |
| 6 | 800.0427 | 0.4326751 | 3455.135 | 378.2017 |
| 7 | 801.2607 | 0.4318129 | 3455.139 | 376.2801 |
| 8 | 802.3611 | 0.4308230 | 3454.849 | 374.0890 |
| 9 | 804.3474 | 0.4308080 | 3455.295 | 372.6086 |
| 10 | 803.5020 | 0.4276922 | 3456.768 | 372.1082 |
| 11 | 805.3653 | 0.4275430 | 3455.520 | 369.0375 |
Berdasarkan hasil looping, diketahui bahwa derajat bebas polinomial yang terpilih untuk hubungan peubah `mpg~origin adalah 3 dan untuk hubungan peubah mpg-horsepower adalah 8 (Melihat nilai minimum dari AIC).
Model origin~mpg
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ poly(x, 3, raw = T), col = "blue",
se = F)
Model horsepower~mpg
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ poly(x, 8, raw = T), col = "blue",
se = F)
3.5. Pemodelan Data Fungsi Tangga
Menentukan banyak tangga terbaik untuk masing-masing hubungan peubah
# Penentuan derajat fungsi tangga terbaik dibatasi hingga derajat
# ke-12
df.list.step <- data.frame(df = c(rep(NA, 10)), AIC1 = c(rep(NA, 10)),
MSE1 = c(rep(NA, 10)), AIC2 = c(rep(NA, 10)), MSE2 = c(rep(NA, 10)))
for (j in 2:3) {
for (i in 3:12) {
step.mod <- lm(psd.dt.A[, j] ~ cut(mpg, i), data = psd.dt.A)
AIC.step <- AIC(step.mod)
MSE.step <- mean((predict(step.mod) - psd.dt.A[, j])^2)
if (j == 3) {
df.list.step[i - 2, 1:3] <- c(i, AIC.step, MSE.step)
} else {
df.list.step[i - 2, 4:5] <- c(AIC.step, MSE.step)
}
}
}
knitr::kable(df.list.step)
| df | AIC1 | MSE1 | AIC2 | MSE2 |
|---|---|---|---|---|
| 3 | 825.4751 | 0.4711959 | 3722.514 | 763.5092 |
| 4 | 816.3085 | 0.4579626 | 3605.549 | 563.6549 |
| 5 | 812.0036 | 0.4506557 | 3526.845 | 458.7770 |
| 6 | 812.9059 | 0.4493955 | 3554.364 | 489.6366 |
| 7 | 800.0531 | 0.4326866 | 3521.117 | 447.5318 |
| 8 | 809.4147 | 0.4408891 | 3551.701 | 481.3840 |
| 9 | 807.4994 | 0.4365074 | 3494.485 | 413.8924 |
| 10 | 805.2416 | 0.4317918 | 3478.747 | 395.5813 |
| 11 | 802.1242 | 0.4261916 | 3499.380 | 414.8383 |
| 12 | 792.2372 | 0.4134617 | 3496.674 | 409.8881 |
Berdasarkan hasil looping, diketahui bahwa derajat bebas fungsi tangga yang terpilih untuk hubungan peubah `mpg~origin adalah 12 dan untuk hubungan peubah mpg-horsepower adalah 10 (Melihat nilai minimum dari AIC).
Model origin~mpg
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ cut(x, 12), col = "blue",
se = F)
Model horsepower~mpg
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ cut(x, 10), col = "blue",
se = F)
3.6. Perbandingan Hasil Pemodelan (I)
pol.mod.1 <- lm(origin ~ poly(mpg, 3, raw = T), data = psd.dt.A)
pol.mod.2 <- lm(horsepower ~ poly(mpg, 8, raw = T), data = psd.dt.A)
step.mod.1 <- lm(origin ~ cut(mpg, 12), data = psd.dt.A)
step.mod.2 <- lm(horsepower ~ cut(mpg, 10), data = psd.dt.A)
model.auto <- data.frame(Model = c("Linear (mpg~origin)", "Linear (mpg~horsepower)",
"Polynomial (3) (mpg~origin)", "Polynomial (8) (mpg~horsepower)", "Step (12) (mpg~origin)",
"Step (10) (mpg~horsepower)"), AIC = c(AIC(lin.mod.1), AIC(lin.mod.2),
AIC(pol.mod.1), AIC(pol.mod.2), AIC(step.mod.1), AIC(step.mod.2)),
MSE = c(MSE(predict(lin.mod.1), psd.dt.A$origin), MSE(predict(lin.mod.2),
psd.dt.A$horsepower), MSE(predict(pol.mod.1), psd.dt.A$origin),
MSE(predict(pol.mod.2), psd.dt.A$horsepower), MSE(predict(step.mod.1),
psd.dt.A$origin), MSE(predict(step.mod.2), psd.dt.A$horsepower)))
knitr::kable(model.auto)
| Model | AIC | MSE |
|---|---|---|
| Linear (mpg~origin) | 797.0222 | 0.4404478 |
| Linear (mpg~horsepower) | 3614.3235 | 582.3256764 |
| Polynomial (3) (mpg~origin) | 796.2795 | 0.4351511 |
| Polynomial (8) (mpg~horsepower) | 3454.8494 | 374.0890447 |
| Step (12) (mpg~origin) | 792.2372 | 0.4134617 |
| Step (10) (mpg~horsepower) | 3478.7475 | 395.5813018 |
Berdasarkah hasil pemodelan, diketahui bahwa untuk model hubungan peubah mpg~origin lebih baik menggunakan model fungsi tangga (12). Sementara untuk model mpg~horsepower lebih baik menggunakan model polinomial (3). Model terbaik ini masih bersifat sementara karena masih ada pemodelan lainnya (Materi Nonlinier Regression II).
3.7. Spline Regression
Menentukan banyaknya fungsi basis dan knots (Penentuan via komputerisasi)
knots.val.auto <- attr(bs(psd.dt.A$mpg, df = 6), "knots")
knots.val.auto
## 25% 50% 75%
## 17.00 22.75 29.00
A. b-Spline Model
1. b-Spline Model 1 (origin~mpg)
spline.mod.auto1 <- lm(origin ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(spline.mod.auto1)
##
## Call:
## lm(formula = origin ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39803 -0.38041 -0.02476 0.33868 1.81900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.04633 0.44800 2.336 0.0200 *
## bs(mpg, knots = knots.val.auto)1 -0.09890 0.64164 -0.154 0.8776
## bs(mpg, knots = knots.val.auto)2 -0.07956 0.42992 -0.185 0.8533
## bs(mpg, knots = knots.val.auto)3 0.57546 0.49774 1.156 0.2483
## bs(mpg, knots = knots.val.auto)4 1.14229 0.49623 2.302 0.0219 *
## bs(mpg, knots = knots.val.auto)5 1.53577 0.64019 2.399 0.0169 *
## bs(mpg, knots = knots.val.auto)6 1.30545 0.61946 2.107 0.0357 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6649 on 385 degrees of freedom
## Multiple R-squared: 0.3292, Adjusted R-squared: 0.3187
## F-statistic: 31.49 on 6 and 385 DF, p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "orange") +
stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.auto),
lty = 1, se = F)
2. b-Spline Model 2 (horsepower~mpg)
spline.mod.auto2 <- lm(horsepower ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(spline.mod.auto2)
##
## Call:
## lm(formula = horsepower ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.043 -10.103 -0.818 8.075 95.778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.012 13.181 14.795 < 2e-16 ***
## bs(mpg, knots = knots.val.auto)1 2.792 18.879 0.148 0.882
## bs(mpg, knots = knots.val.auto)2 -80.488 12.649 -6.363 5.62e-10 ***
## bs(mpg, knots = knots.val.auto)3 -103.774 14.645 -7.086 6.62e-12 ***
## bs(mpg, knots = knots.val.auto)4 -126.115 14.600 -8.638 < 2e-16 ***
## bs(mpg, knots = knots.val.auto)5 -127.143 18.836 -6.750 5.45e-11 ***
## bs(mpg, knots = knots.val.auto)6 -141.413 18.226 -7.759 7.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.56 on 385 degrees of freedom
## Multiple R-squared: 0.7457, Adjusted R-squared: 0.7417
## F-statistic: 188.1 on 6 and 385 DF, p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "orange") +
stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.auto),
lty = 1, se = F)
B. Natural Spline
Model 1 (origin~mpg)
ns.spline.auto1 <- lm(origin ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(ns.spline.auto1)
##
## Call:
## lm(formula = origin ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36121 -0.36897 -0.03639 0.34308 1.82256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9776 0.2295 4.261 2.56e-05 ***
## ns(mpg, knots = knots.val.auto)1 0.6192 0.2188 2.830 0.00489 **
## ns(mpg, knots = knots.val.auto)2 1.2993 0.2091 6.215 1.33e-09 ***
## ns(mpg, knots = knots.val.auto)3 1.4244 0.5348 2.663 0.00806 **
## ns(mpg, knots = knots.val.auto)4 1.5339 0.2913 5.265 2.33e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6633 on 387 degrees of freedom
## Multiple R-squared: 0.3288, Adjusted R-squared: 0.3218
## F-statistic: 47.39 on 4 and 387 DF, p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "orange") +
stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
lty = 1, se = F)
Model 2 (horsepower~mpg)
ns.spline.auto2 <- lm(horsepower ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
summary(ns.spline.auto2)
##
## Call:
## lm(formula = horsepower ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.461 -10.731 -1.433 8.759 95.650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 218.000 6.786 32.13 <2e-16 ***
## ns(mpg, knots = knots.val.auto)1 -130.511 6.470 -20.17 <2e-16 ***
## ns(mpg, knots = knots.val.auto)2 -109.368 6.183 -17.69 <2e-16 ***
## ns(mpg, knots = knots.val.auto)3 -237.134 15.815 -14.99 <2e-16 ***
## ns(mpg, knots = knots.val.auto)4 -115.116 8.615 -13.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.62 on 387 degrees of freedom
## Multiple R-squared: 0.7429, Adjusted R-squared: 0.7403
## F-statistic: 279.6 on 4 and 387 DF, p-value: < 2.2e-16
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "orange") +
stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
lty = 1, se = F)
3.8. Smoothing Spline
A. Model origin~mpg
ss.auto1 <- with(data = psd.dt.A, smooth.spline(mpg, origin))
ss.auto1
## Call:
## smooth.spline(x = mpg, y = origin)
##
## Smoothing Parameter spar= 0.6891065 lambda= 0.0002244654 (11 iterations)
## Equivalent Degrees of Freedom (Df): 12.93154
## Penalized Criterion (RSS): 61.61513
## GCV: 0.4385944
pred.ss1 <- broom::augment(ss.auto1)
ggplot(pred.ss1, aes(x = x, y = y)) + geom_point(color = "orange") + geom_line(aes(y = .fitted),
col = "blue", lty = 1) + xlab("mpg") + ylab("origin")
B. Model horsepower~mpg
ss.auto2 <- with(data = psd.dt.A, smooth.spline(mpg, horsepower))
ss.auto2
## Call:
## smooth.spline(x = mpg, y = horsepower)
##
## Smoothing Parameter spar= 0.8528728 lambda= 0.003422279 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.989153
## Penalized Criterion (RSS): 35962.53
## GCV: 386.3516
pred.ss2 <- broom::augment(ss.auto2)
ggplot(pred.ss2, aes(x = x, y = y)) + geom_point(color = "orange") + geom_line(aes(y = .fitted),
col = "blue", lty = 1) + xlab("mpg") + ylab("horsepower")
3.9. LOESS Function
A. Model origin~mpg
loess.auto1 <- loess(origin ~ mpg, data = psd.dt.A)
summary(loess.auto1)
## Call:
## loess(formula = origin ~ mpg, data = psd.dt.A)
##
## Number of Observations: 392
## Equivalent Number of Parameters: 4.8
## Residual Standard Error: 0.6627
## Trace of smoother matrix: 5.24 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
stat_smooth(method = "loess", formula = y ~ x, col = "red", show.legend = T)
B. Model horsepower~mpg
loess.auto2 <- loess(horsepower ~ mpg, data = psd.dt.A)
summary(loess.auto2)
## Call:
## loess(formula = horsepower ~ mpg, data = psd.dt.A)
##
## Number of Observations: 392
## Equivalent Number of Parameters: 4.8
## Residual Standard Error: 19.58
## Trace of smoother matrix: 5.24 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
stat_smooth(method = "loess", formula = y ~ x, col = "red", show.legend = T)
3.10. Perbandingan hasil Pemodelan (II)
A. Model origin~mpg
ggplot(psd.dt.A, aes(x = mpg, y = origin)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ poly(x, 3, raw = T), col = "blue",
lwd = 1.2, se = F) + stat_smooth(method = "lm", formula = y ~ cut(x,
12), col = "green", lwd = 1.2, se = F) + stat_smooth(method = "lm",
formula = y ~ bs(x, knots = knots.val.auto), col = "yellow", lwd = 1.2,
se = F) + stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
col = "darkgreen", lwd = 1.2, se = F) + stat_smooth(method = "loess",
formula = y ~ x, col = "orange", lwd = 1.2, se = F) + stat_smooth(method = "lm",
formula = y ~ x, col = "red", lwd = 1.2, se = F)
Legend:
—–: Linear Model
—–: Polynomial Model
—–: Step Function Model
—–: Spline Regression Model
—–: Natural Spline Regression Model
—–: LOESS Function Model
model.auto2a <- data.frame(Model = c("Linear", "Polynomial (3)", "Step-Function (12)",
"Spline Regression", "Natural Spline Regression", "LOESS Function"),
MSE = c(MSE(predict(lin.mod.1), psd.dt.A$origin), MSE(predict(pol.mod.1),
psd.dt.A$origin), MSE(predict(step.mod.1), psd.dt.A$origin), MSE(predict(spline.mod.auto1),
psd.dt.A$origin), MSE(predict(ns.spline.auto1), psd.dt.A$origin),
MSE(predict(loess.auto1), psd.dt.A$origin)))
knitr::kable(model.auto2a)
| Model | MSE |
|---|---|
| Linear | 0.4404478 |
| Polynomial (3) | 0.4351511 |
| Step-Function (12) | 0.4134617 |
| Spline Regression | 0.4341514 |
| Natural Spline Regression | 0.4344177 |
| LOESS Function | 0.4328007 |
Pemodelan terbaik untuk data auto origin~mpg: Step-Function with 12 breaks (Berdasarkan nilai MSE terkecil).
B. Model horsepower~mpg
ggplot(psd.dt.A, aes(x = mpg, y = horsepower)) + geom_point(color = "black") +
stat_smooth(method = "lm", formula = y ~ bs(x, knots = knots.val.auto),
col = "yellow", lwd = 4, se = F) + stat_smooth(method = "loess",
formula = y ~ x, col = "orange", lwd = 2.5, se = F) + stat_smooth(method = "lm",
formula = y ~ cut(x, 10), col = "green", lwd = 1.2, se = F) + stat_smooth(method = "lm",
formula = y ~ poly(x, 8, raw = T), col = "blue", lwd = 1.2, se = F) +
stat_smooth(method = "lm", formula = y ~ ns(x, knots = knots.val.auto),
col = "darkgreen", lwd = 1.2, se = F) + stat_smooth(method = "lm",
formula = y ~ x, col = "red", lwd = 1.2, se = F)
Legend:
—–: Linear Model
—–: Polynomial Model
—–: Step Function Model
—–: Spline Regression Model
—–: Natural Spline Regression Model
—–: LOESS Function Model
model.auto2b <- data.frame(Model = c("Linear", "Polynomial (8)", "Step-Function (10)",
"Spline Regression", "Natural Spline Regression", "LOESS Function"),
MSE = c(MSE(predict(lin.mod.2), psd.dt.A$horsepower), MSE(predict(pol.mod.2),
psd.dt.A$horsepower), MSE(predict(step.mod.2), psd.dt.A$horsepower),
MSE(predict(spline.mod.auto2), psd.dt.A$horsepower), MSE(predict(ns.spline.auto2),
psd.dt.A$horsepower), MSE(predict(loess.auto2), psd.dt.A$horsepower)))
knitr::kable(model.auto2b)
| Model | MSE |
|---|---|
| Linear | 582.3257 |
| Polynomial (8) | 374.0890 |
| Step-Function (10) | 395.5813 |
| Spline Regression | 375.8340 |
| Natural Spline Regression | 379.8993 |
| LOESS Function | 377.9225 |
Pemodelan terbaik untuk data bangkitan: 8th Degree Polynomial (Berdasarkan nilai MSE terkecil).
4. Cross-Validation Evaluation Model Data Auto
4.1. Inisiasi
set.seed(92)
cross.val1 <- vfold_cv(psd.dt.A, v = 10, strata = "origin")
cross.val2 <- vfold_cv(psd.dt.A, v = 10, strata = "horsepower")
4.2. Model (origin~mpg)
4.2.1. Linear
lin.met1 <- map_dfr(cross.val1$splits, function(x) {
mod <- lm(origin ~ mpg, data = psd.dt.A[x$in_id, ])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$origin
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_lin.met1 <- colMeans(lin.met1)
m_lin.met1
## rmse mae
## 0.6634798 0.5142883
4.2.2. Polynomial (3)
pol.met1 <- map_dfr(cross.val1$splits, function(x) {
mod <- lm(origin ~ poly(mpg, 3, raw = T), data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$origin
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_pol.met1 <- colMeans(pol.met1)
m_pol.met1
## rmse mae
## 0.6621524 0.5017377
4.2.3. Step-Function
breaks <- 3:12
best.step1 <- map_dfr(breaks, function(i) {
step.met1 <- map_dfr(cross.val1$splits, function(x) {
training <- psd.dt.A[x$in_id, ]
training$mpg <- cut(training$mpg, i)
mod <- lm(origin ~ mpg, data = training)
labs_x <- levels(mod$model[, 2])
labs_x_breaks <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1",
labs_x)), upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1",
labs_x)))
testing <- psd.dt.A[-x$in_id, ]
newdt <- cut(testing$mpg, c(labs_x_breaks[1, 1], labs_x_breaks[,
2]))
pred <- predict(mod, newdata = list(mpg = newdt))
truth <- testing$origin
data_eval <- na.omit(data.frame(truth, pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
step.met1
# menghitung rata-rata untuk 10 folds
m_step.met1 <- colMeans(step.met1)
m_step.met1
})
best.step1 <- cbind(breaks = breaks, best.step1)
# menampilkan hasil all breaks
best.step1 %>%
slice_min(rmse, n = 5)
## breaks rmse mae
## 1 11 0.6625326 0.4967856
## 2 7 0.6638193 0.5014978
## 3 12 0.6652098 0.4939470
## 4 10 0.6728325 0.5025272
## 5 6 0.6728480 0.5041022
best.step1 %>%
slice_min(mae, n = 5)
## breaks rmse mae
## 1 12 0.6652098 0.4939470
## 2 11 0.6625326 0.4967856
## 3 7 0.6638193 0.5014978
## 4 10 0.6728325 0.5025272
## 5 8 0.6735929 0.5040323
Best Step yang diambil = 11
4.2.4. Spline Regression
spline.met1 <- map_dfr(cross.val1$splits, function(x) {
mod <- lm(origin ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$origin
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_spline.met1 <- colMeans(spline.met1)
m_spline.met1
## rmse mae
## 0.6643972 0.4957557
4.2.5. Natural Spline Regression
nspl.met1 <- map_dfr(cross.val1$splits, function(x) {
mod <- lm(origin ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$origin
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_nspl.met1 <- colMeans(nspl.met1)
m_nspl.met1
## rmse mae
## 0.6619944 0.4936618
4.2.6. LOESS Function
span <- seq(0.1, 1, length.out = 91)
best.loess1 <- map_dfr(span, function(i) {
loess.met1 <- map_dfr(cross.val1$splits, function(x) {
mod <- loess(origin ~ mpg, span = i, data = psd.dt.A[x$in_id, ])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$origin
data_eval <- na.omit(data.frame(pred = pred, truth = truth))
rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
head(loess.met1, 20)
# menghitung rata-rata untuk 10 folds
m_loess.met1 <- colMeans(loess.met1)
m_loess.met1
})
best.loess1 <- cbind(span = span, best.loess1)
# menampilkan hasil all breaks
best.loess1 %>%
slice_min(rmse, n = 10)
## span rmse mae
## 1 0.39 0.6618308 0.4941327
## 2 0.38 0.6618467 0.4942368
## 3 0.48 0.6618643 0.4945020
## 4 0.44 0.6618700 0.4941280
## 5 0.45 0.6619338 0.4941375
## 6 0.43 0.6619899 0.4941977
## 7 0.46 0.6619976 0.4943916
## 8 0.51 0.6620351 0.4947503
## 9 0.52 0.6621044 0.4947526
## 10 0.47 0.6621310 0.4947993
best.loess1 %>%
slice_min(mae, n = 10)
## span rmse mae
## 1 0.10 0.6765811 0.4913830
## 2 0.16 0.6686370 0.4929906
## 3 0.17 0.6657005 0.4930809
## 4 0.18 0.6650503 0.4931812
## 5 0.11 0.6761704 0.4932793
## 6 0.19 0.6637576 0.4937502
## 7 0.44 0.6618700 0.4941280
## 8 0.39 0.6618308 0.4941327
## 9 0.25 0.6630546 0.4941343
## 10 0.24 0.6621407 0.4941345
Best LOESS based on rmse: span 0.39
Best LOESS based on mae: span 0.1
Best LOESS based on slice_min rankings: span 0.44
4.2.7. Perbandingan Model
df.cv1 <- rbind(m_lin.met1, m_pol.met1, best.step1[9, 2:3], m_spline.met1,
m_nspl.met1, best.loess1[35, -1])
rownames(df.cv1) <- c("Linear", "Polynomial (3)", "Step-Function (11)",
"Spline Regression", "Natural Spline Regression", "LOESS Function (0.44)")
knitr::kable(df.cv1)
| rmse | mae | |
|---|---|---|
| Linear | 0.6634798 | 0.5142883 |
| Polynomial (3) | 0.6621524 | 0.5017377 |
| Step-Function (11) | 0.6625326 | 0.4967856 |
| Spline Regression | 0.6643972 | 0.4957557 |
| Natural Spline Regression | 0.6619944 | 0.4936618 |
| LOESS Function (0.44) | 0.6618700 | 0.4941280 |
Berdasarkan hasil Cross-Validation 10 folds, model LOESS dengan span=0.44 merupakan model terbaik untuk hubungan data origin~mpg berdasarkan nilai rmse yang lebih kecil dibanding kelima model lainnya. Sementara model Natural Spline Regression merupakan model terbaik berdasarkan nilai mae-nya yang paling kecil.
4.3. Model (horsepower~mpg)
4.3.1. Linear
lin.met2 <- map_dfr(cross.val2$splits, function(x) {
mod <- lm(horsepower ~ mpg, data = psd.dt.A[x$in_id, ])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$horsepower
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_lin.met2 <- colMeans(lin.met2)
m_lin.met2
## rmse mae
## 24.05804 18.51148
4.3.2. Polynomial (8)
pol.met2 <- map_dfr(cross.val2$splits, function(x) {
mod <- lm(horsepower ~ poly(mpg, 8, raw = T), data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$horsepower
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_pol.met2 <- colMeans(pol.met2)
m_pol.met2
## rmse mae
## 19.85886 14.21747
4.3.3. Step Function
breaks <- 3:12
best.step2 <- map_dfr(breaks, function(i) {
step.met2 <- map_dfr(cross.val2$splits, function(x) {
training <- psd.dt.A[x$in_id, ]
training$mpg <- cut(training$mpg, i)
mod <- lm(horsepower ~ mpg, data = training)
labs_x <- levels(mod$model[, 2])
labs_x_breaks <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1",
labs_x)), upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1",
labs_x)))
testing <- psd.dt.A[-x$in_id, ]
newdt <- cut(testing$mpg, c(labs_x_breaks[1, 1], labs_x_breaks[,
2]))
pred <- predict(mod, newdata = list(mpg = newdt))
truth <- testing$horsepower
data_eval <- na.omit(data.frame(truth, pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
step.met2
# menghitung rata-rata untuk 10 folds
m_step.met2 <- colMeans(step.met2)
m_step.met2
})
best.step2 <- cbind(breaks = breaks, best.step2)
# menampilkan hasil all breaks
best.step2 %>%
slice_min(rmse, n = 5)
## breaks rmse mae
## 1 10 20.28387 14.57820
## 2 12 20.35149 14.63873
## 3 9 20.81224 14.63435
## 4 11 21.10692 15.43080
## 5 7 21.23988 16.01389
best.step2 %>%
slice_min(mae, n = 5)
## breaks rmse mae
## 1 10 20.28387 14.57820
## 2 9 20.81224 14.63435
## 3 12 20.35149 14.63873
## 4 5 21.36285 15.41345
## 5 11 21.10692 15.43080
Banyak break untuk Best Step yang diambil = 10
4.3.4. Spline Regression
spline.met2 <- map_dfr(cross.val2$splits, function(x) {
mod <- lm(horsepower ~ bs(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$horsepower
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_spline.met2 <- colMeans(spline.met2)
m_spline.met2
## rmse mae
## 19.39506 14.12607
4.3.5. Natural Spline Regression
nspl.met2 <- map_dfr(cross.val2$splits, function(x) {
mod <- lm(horsepower ~ ns(mpg, knots = knots.val.auto), data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$horsepower
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
m_nspl.met2 <- colMeans(nspl.met2)
m_nspl.met2
## rmse mae
## 19.33404 14.18485
4.3.6. LOESS Function
span <- seq(0.1, 1, length.out = 91)
best.loess2 <- map_dfr(span, function(i) {
loess.met2 <- map_dfr(cross.val1$splits, function(x) {
mod <- loess(horsepower ~ mpg, span = i, data = psd.dt.A[x$in_id,
])
pred <- predict(mod, newdata = psd.dt.A[-x$in_id, ])
truth <- psd.dt.A[-x$in_id, ]$horsepower
data_eval <- na.omit(data.frame(pred = pred, truth = truth))
rmse <- mlr3measures::rmse(truth = data_eval$truth, response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth, response = data_eval$pred)
metric <- c(rmse, mae)
names(metric) <- c("rmse", "mae")
return(metric)
})
head(loess.met2, 20)
# menghitung rata-rata untuk 10 folds
m_loess.met2 <- colMeans(loess.met2)
m_loess.met2
})
best.loess2 <- cbind(span = span, best.loess2)
# menampilkan hasil all breaks
best.loess2 %>%
slice_min(rmse, n = 5)
## span rmse mae
## 1 0.58 19.32379 13.77316
## 2 0.60 19.32434 13.77218
## 3 0.61 19.32438 13.77701
## 4 0.56 19.32621 13.77059
## 5 0.53 19.32786 13.77461
best.loess2 %>%
slice_min(mae, n = 5)
## span rmse mae
## 1 0.55 19.33162 13.76996
## 2 0.56 19.32621 13.77059
## 3 0.60 19.32434 13.77218
## 4 0.54 19.33110 13.77222
## 5 0.58 19.32379 13.77316
Best LOESS based on rmse: span 0.58
Best LOESS based on mae: span 0.55
Best LOESS based on slice_min rankings: span 0.56
4.3.7. Perbandingan Model
df.cv2 <- rbind(m_lin.met2, m_pol.met2, best.step2[8, 2:3], m_spline.met2,
m_nspl.met2, best.loess2[47, -1])
rownames(df.cv2) <- c("Linear", "Polynomial (8)", "Step-Function (10)",
"Spline Regression", "Natural Spline Regression", "LOESS Function (0.56)")
knitr::kable(df.cv2)
| rmse | mae | |
|---|---|---|
| Linear | 24.05804 | 18.51148 |
| Polynomial (8) | 19.85886 | 14.21747 |
| Step-Function (10) | 20.28387 | 14.57820 |
| Spline Regression | 19.39506 | 14.12607 |
| Natural Spline Regression | 19.33404 | 14.18485 |
| LOESS Function (0.56) | 19.32621 | 13.77059 |
Berdasarkan hasil Cross-Validation 10 folds, model LOESS dengan span=0.56 merupakan model terbaik untuk hubungan data horsepower~mpg karena nilai rmse dan mae yang paling kecil dibanding kelima model lainnya.