Pendahuluan
Tujuan
Melakukan cross validation untuk menghasilkan pemodelan antara tingkat efisiensi penggunaan bahan bakar (direpresentasikan oleh
mpg
atau miles per gallon) versus daya mesin (dalamhorsepower
) optimal berdasarkan dataISLR::Auto
dengan metode:- Polynomial regression
- Piecewise constant
- Natural cubic splines
Pemodelan dilakukan untuk masing-masing subset berdasarkan negara asal (
origin
), yaitu 1) Amerika, 2) Eropa, dan 3) Jepang
Output
Mendapatkan model non-linier yang optimal untuk masing-masing negara asal.
Metodologi
Tahapan
Pemodelan dilakukan dengan tahapan sebagai berikut:
Skenario Tuning Parameter
Cross validation
4-fold cross validation dengan meminimumkan RMSE (root mean squared error)
Polynomial regression
Derajat/degree polinomial awal:
c(2:10)
Piecewise constant
Jumlah bins/breaks awal:
c(2:10)
Natural Cubic Spline
Jumlah dan lokasi knots awal ditentukan perdasarkan quantile:
Jumlah knots | Lokasi knots (quantiles) |
---|---|
1 | .5, |
2 | .33 * c(1:2) |
3 | .25 * c(1:3) |
3* | c(.10, .5, .90) |
4 | .20 * c(1:4) |
4* | c(.05, .35, .65, .95) |
5 | .167 * c(1:5) |
5* | c(.05, .275, .5, .725, .95) |
6 | .143 * c(1:6) |
6* | c(.05, .23, .41, .59, .77, .95) |
7 | .125 * c(1:7) |
7* | c(.025, .1833, .3417, .5, .6583, .8167, .975) |
Sumber : Frank E. Harrell, Jr. (2015). Regression Modeling Strategies
Tools
- R
splines
danstats
untuk pemodelancaret
untuk cross validationggplot2
dancowplot
untuk grafik/visualisasi
Kumpulan Fungsi
Untuk memudahkan iterasi, dibuat beberapa fungsi:
Modeling dan Cross Validation
library(splines)
library(caret)
beyondLinear <- function(model, param, df){
dat <- df
#' model <string> : modeling approach, ie. polynomial (polynomial regression),
#' piecewise (piecewise constant), ncubicspline (natural cubic spline)
#' param <list or vector> : list of tuning parameters to be evaluated
cvResult <- mapply(
function(param, model){
if (model == "polynomial"){
f <- bquote(mpg ~ poly(horsepower, degree = .(param)))
} else if (model == "piecewise") {
f <- bquote(mpg ~ cut(horsepower, breaks = .(param)))
} else if (model == "ncubicspline") {
f <- bquote(mpg ~ ns(horsepower, knots = .(param)))
}
trCtl <- trainControl(method='cv', number = 4)
models <- train(as.formula(f), data = df, trControl = trCtl, method = 'glm')
RMSE <- models$results$RMSE
fModel <- models$finalModel
return(list(models, RMSE, fModel))
}, model = model, param = param, USE.NAMES = FALSE, SIMPLIFY = FALSE )
if(!is.null(names(param))){
paramIdx <- names(param)
} else {
paramIdx <- param
}
cvScore <- data.frame(paramIdx = as.character(unlist(paramIdx)),
score = unlist(sapply(cvResult, "[", 2, USE.NAMES = F)))
cvScore$param <- param
bestPos <- which.min(cvScore$score)[1]
bestParam <- param[[bestPos]]
bestScore <- cvResult[[bestPos]][[2]]
bestModel <- cvResult[[bestPos]][[1]]
predValue <- data.frame(df[c("mpg","horsepower")],
pred = predict(bestModel, df, interval = "prediction"))
return(list(model = model,
cvScore = cvScore,
whichBest = bestPos,
bestParamIdx = paramIdx[bestPos],
bestParam = bestParam,
bestScore = bestScore,
bestModel = bestModel,
predValue = predValue))
}
Grafik
library(ggplot2)
library(cowplot)
cvPlot <- function(df, title, xlab){
scaleFUN <- function(x) as.character(round(x,2))
df$paramIdx <- factor(df$paramIdx, levels = df$paramIdx)
ggplot(df, aes(x=paramIdx, y=score)) +
geom_bar(stat = "identity", width=0.1, fill = "black") +
coord_cartesian(ylim = c(0.99*min(df$score), NA)) +
geom_hline(yintercept = min(df$score), linetype="dashed", color = "red") +
ggtitle(title) +
xlab(xlab) + ylab("Root Mean Squared Error") +
theme_classic()
}
predPlot <- function(df, title){
ggplot(df) +
geom_point(aes(x = horsepower, y = mpg, fill = "*")) +
geom_line(aes(x = horsepower, y = pred, color = "*"), size = 1) +
scale_fill_manual(labels = c("Observed"), values = c("black")) +
scale_color_manual(labels = c("Predicted"), values = c("blue")) +
xlim(40, 235) +
theme_classic() +
ggtitle(title) +
xlab("Horsepower") + ylab("Miles per Gallon") +
theme(legend.title = element_blank(),
legend.position = "bottom",
legend.box = "horizontal")
}
predPlotAll <- function(df) {
ggplot(df) +
geom_point(aes(x = horsepower, y = mpg, fill = "*")) +
geom_line(aes(x = horsepower, y = polynomial, color = "a"), size = 1) +
geom_line(aes(x = horsepower, y = piecewise, color = "b"), size = 1) +
geom_line(aes(x = horsepower, y = ncubicspline, color = "c"), size = 1) +
scale_fill_manual(labels = c("Observed"), values = c("black")) +
scale_color_manual(labels = c("Polynomial", "Piecewise", "Natural Cubic Splines"),
values = c("blue", "red", "green")) +
xlim(40, 235) +
theme_classic() +
ggtitle("Comparison of Best Predictions") +
xlab("Horsepower") + ylab("Miles per Gallon") +
theme(legend.title = element_blank(),
legend.position = "bottom",
legend.box = "horizontal")
}
Fungsi wrapper
generateAllREg <- function(df, orig, degree, cut, knot, best=NULL){
methods <- c("polynomial", "piecewise", "ncubicspline")
method_alias <- c("Polynomial Regression", "Piecewise Constant", "Natural Cubic Splines")
params <- list(Degree = degree, `N Bins` = cut, `N Knots` = knot)
outModel <- list()
pltCV <- list()
pltPred <- list()
for( i in seq(length(methods)) ){
model <- beyondLinear(model = methods[i], param = params[[i]], df = df)
outModel[[i]] <- model
p <- cvPlot(model$cvScore,
title = method_alias[i],
xlab = names(params)[i])
pltCV[[i]] <- p
p <- predPlot(model$predValue,
sprintf("%s with %s=%s (RMSE=%s)",
method_alias[i], names(params)[i],
model$bestParamIdx, round(model$bestScore,2)))
if(methods[i] == "ncubicspline"){
p <- p + geom_vline(xintercept = model$bestParam, linetype="dashed", color = "gray")
}
pltPred[[i]] <- p
}
bestSummary <- data.frame(t(sapply(outModel, "[", c("model", "bestParamIdx", "bestScore"))))
names(bestSummary) <- c("method", "param", "RMSE")
bestSummary <- cbind(origin = orig, bestSummary)
predAllBest <- data.frame(outModel[[1]]$predValue,
outModel[[2]]$predValue$pred,
outModel[[3]]$predValue$pred)
names(predAllBest) <- c("mpg","horsepower", methods)
pltPredAll <- predPlotAll(predAllBest)
if(is.null(best)){
bIdx <- which.min(bestSummary$RMSE)
} else {
bIdx <- best
}
bestModel <- bestSummary[bIdx, ]
predicted <- outModel[[bIdx]]$predValue
predicted <- cbind(origin = orig, predicted)
return(list(bestSummary = bestSummary,
bestModel = bestModel,
bestModelDetail = outModel[[bIdx]]$bestModel,
bestParamDetail = outModel[[bIdx]]$bestParam,
model = outModel,
pltCV = pltCV,
pltPred = pltPred,
pltPredAll = pltPredAll,
pltCVGrid = plot_grid(pltCV[[1]], pltCV[[2]], pltCV[[3]]),
pltPredGrid = plot_grid(pltPred[[1]], pltPred[[2]], pltPred[[3]], pltPredAll),
predBest3 = predAllBest,
predBestFinal = predicted))
}
Penentuan Knots
pctKnots <- function(x){
pct <- list(
.5,
.33*c(1:2),
.25*c(1:3),
c(.10,.5,.90),
.20*c(1:4),
c(.05,.35,.65,.95),
.167*c(1:5),
c(.05,.275,.5,.725,.95),
.143*c(1:6),
c(.05,.23,.41,.59,.77,.95),
.125*c(1:7),
c(.025,.1833,.3417,.5,.6583,.8167,.975))
knt <- sapply(pct, function(p, i){round(quantile(i, p),2)}, i = x)
names(knt) <- c(1,2,3,"3*",4,"4*",5,"5*",6,"6*",7,"7*")
return(knt)
}
Hasil dan Pembahasan
Preview Dataset
library(ISLR)
head(Auto)
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name |
---|---|---|---|---|---|---|---|---|
18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
15 | 8 | 429 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
Hubungan mpg
dan horsepower
library(ggplot2)
ggplot(Auto) +
geom_point(aes(x = horsepower, y = mpg), alpha = 0.8) +
ggtitle("Miles per Gallon vs Horsepower") +
xlab("Horsepower") + ylab("Miles per Gallon") +
theme_classic()
Berdasarkan origin
freq <- table(Auto$origin)
data.frame(cbind(freq, prop = prop.table(freq)))
freq | prop |
---|---|
245 | 0.6250000 |
68 | 0.1734694 |
79 | 0.2015306 |
p <- ggplot(Auto) +
geom_point(aes(x = horsepower, y = mpg, color = as.factor(origin)), alpha = 0.8) +
scale_color_manual(values = c("black", "red", "blue"),
labels = c("American", "European", "Japanese")) +
ggtitle("Miles per Gallon vs Horsepower, by Origin") +
xlab("Horsepower") + ylab("Miles per Gallon") +
theme_classic() +
theme(legend.title = element_blank(),
legend.position = "bottom",
legend.box = "horizontal")
p
p + facet_grid(origin ~ .)
Berdasarkan visualisasi di atas terlihat bahwa hubungan mpg
dan horsepower
tidak bersifat linear. Untuk itu diperlukan metode non-linear untuk mendapatkan pemodelan optimal.
Semua Negara
AutoKnots <- pctKnots(Auto$horsepower)
set.seed(1234)
orig0 <- generateAllREg(df = Auto,
orig = "Semua Negara",
degree = c(2:10), cut = c(2:10), knot = AutoKnots)
Cross validation
orig0$pltCVGrid
Ringkasan tuning parameter terbaik
orig0$bestSummary
origin | method | param | RMSE |
---|---|---|---|
Semua Negara | polynomial | 9 | 4.313204 |
Semua Negara | piecewise | 8 | 4.453155 |
Semua Negara | ncubicspline | 4* | 4.306619 |
orig0$bestModel
origin | method | param | RMSE | |
---|---|---|---|---|
3 | Semua Negara | ncubicspline | 4* | 4.306619 |
Tunning Parameter terbaik
orig0$bestParamDetail
## 5% 35% 65% 95%
## 60.55 85.00 105.00 180.00
Nilai observasi vs prediksi
orig0$pltPredGrid
Model terbaik yang diperoleh dari cross-validation adalah natural cubic spline dengan 4 knots.
Ringkasan modeling
summary(orig0$bestModelDetail)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -15.7924 -2.6455 -0.2029 2.2116 15.4144
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 33.789 1.570 21.516
## `ns(horsepower, knots = c(60.55, 85, 105, 180))1` -9.327 1.597 -5.840
## `ns(horsepower, knots = c(60.55, 85, 105, 180))2` -15.555 2.140 -7.268
## `ns(horsepower, knots = c(60.55, 85, 105, 180))3` -22.579 1.904 -11.860
## `ns(horsepower, knots = c(60.55, 85, 105, 180))4` -19.168 3.508 -5.465
## `ns(horsepower, knots = c(60.55, 85, 105, 180))5` -21.579 1.798 -12.004
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))1` 1.11e-08 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))2` 2.04e-12 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))3` < 2e-16 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))4` 8.33e-08 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))5` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 18.44132)
##
## Null deviance: 23819.0 on 391 degrees of freedom
## Residual deviance: 7118.4 on 386 degrees of freedom
## AIC: 2262.9
##
## Number of Fisher Scoring iterations: 2
Origin 1: Amerika
Auto1 <- Auto[Auto$origin == 1, ]
Auto1Knots <- pctKnots(Auto1$horsepower)
set.seed(1234)
orig1 <- generateAllREg(df = Auto1,
orig = "Amerika",
degree = c(2:10), cut = c(2:10), knot = Auto1Knots)
Cross validation
orig1$pltCVGrid
Ringkasan tuning parameter terbaik
orig1$bestSummary
origin | method | param | RMSE |
---|---|---|---|
Amerika | polynomial | 2 | 3.76728 |
Amerika | piecewise | 9 | 3.794538 |
Amerika | ncubicspline | 4 | 3.759049 |
orig1$bestModel
origin | method | param | RMSE | |
---|---|---|---|---|
3 | Amerika | ncubicspline | 4 | 3.759049 |
Tunning Parameter terbaik
orig1$bestParamDetail
## 20% 40% 60% 80%
## 85.0 99.2 122.0 150.0
Nilai observasi vs prediksi
orig1$pltPredGrid
Ringkasan modeling
summary(orig1$bestModelDetail)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.8393 -2.3393 -0.2519 2.1876 12.8473
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 32.869 2.003 16.409
## `ns(horsepower, knots = c(85, 99.2, 122, 150))1` -14.266 1.904 -7.494
## `ns(horsepower, knots = c(85, 99.2, 122, 150))2` -13.533 2.528 -5.352
## `ns(horsepower, knots = c(85, 99.2, 122, 150))3` -19.748 1.689 -11.691
## `ns(horsepower, knots = c(85, 99.2, 122, 150))4` -24.493 4.449 -5.505
## `ns(horsepower, knots = c(85, 99.2, 122, 150))5` -16.706 1.655 -10.093
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))1` 1.30e-12 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))2` 2.04e-07 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))3` < 2e-16 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))4` 9.51e-08 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))5` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 14.29455)
##
## Null deviance: 10120.8 on 244 degrees of freedom
## Residual deviance: 3416.4 on 239 degrees of freedom
## AIC: 1354.9
##
## Number of Fisher Scoring iterations: 2
Origin 2: Eropa
Auto2 <- Auto[Auto$origin == 2, ]
Auto2Knots <- pctKnots(Auto2$horsepower)
set.seed(1234)
orig2 <- generateAllREg(df = Auto2,
orig = "Eropa",
degree = c(2:8), cut = c(2:10), knot = Auto2Knots)
Cross validation
orig2$pltCVGrid
Ringkasan tuning parameter terbaik
orig2$bestSummary
origin | method | param | RMSE |
---|---|---|---|
Eropa | polynomial | 4 | 5.205988 |
Eropa | piecewise | 5 | 4.891956 |
Eropa | ncubicspline | 3 | 4.860623 |
orig2$bestModel
origin | method | param | RMSE | |
---|---|---|---|---|
3 | Eropa | ncubicspline | 3 | 4.860623 |
Tunning Parameter terbaik
orig2$bestParamDetail
## 25% 50% 75%
## 69.75 76.50 90.00
Nilai observasi vs prediksi
orig2$pltPredGrid
Ringkasan modeling
summary(orig2$bestModelDetail)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.4288 -2.7067 -0.6468 2.0588 12.6018
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 35.425 2.113 16.769
## `ns(horsepower, knots = c(69.75, 76.5, 90))1` -6.851 2.532 -2.705
## `ns(horsepower, knots = c(69.75, 76.5, 90))2` -12.103 3.160 -3.830
## `ns(horsepower, knots = c(69.75, 76.5, 90))3` -20.101 5.503 -3.652
## `ns(horsepower, knots = c(69.75, 76.5, 90))4` -15.407 3.897 -3.954
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))1` 0.008763 **
## `ns(horsepower, knots = c(69.75, 76.5, 90))2` 0.000298 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))3` 0.000530 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))4` 0.000197 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 24.63829)
##
## Null deviance: 2901.0 on 67 degrees of freedom
## Residual deviance: 1552.2 on 63 degrees of freedom
## AIC: 417.67
##
## Number of Fisher Scoring iterations: 2
Origin 3: Jepang
Auto3 <- Auto[Auto$origin == 3, ]
Auto3Knots <- pctKnots(Auto3$horsepower)
set.seed(1234)
orig3 <- generateAllREg(df = Auto3,
orig = "Jepang",
degree = c(2:8), cut = c(2:10), knot = Auto3Knots)
Cross validation
orig3$pltCVGrid
Ringkasan tuning parameter terbaik
orig3$bestSummary
origin | method | param | RMSE |
---|---|---|---|
Jepang | polynomial | 3 | 4.013379 |
Jepang | piecewise | 9 | 4.211734 |
Jepang | ncubicspline | 3* | 4.152535 |
orig3$bestModel
origin | method | param | RMSE |
---|---|---|---|
Jepang | polynomial | 3 | 4.013379 |
Nilai observasi vs prediksi
orig3$pltPredGrid
Ringkasan modeling
summary(orig3$bestModelDetail)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.8602 -2.7115 -0.5224 2.1985 11.6985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.4506 0.4536 67.138 < 2e-16 ***
## `poly(horsepower, degree = 3)1` -36.2030 4.0312 -8.981 1.62e-13 ***
## `poly(horsepower, degree = 3)2` 9.1966 4.0312 2.281 0.0254 *
## `poly(horsepower, degree = 3)3` 16.6991 4.0312 4.142 8.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 16.25097)
##
## Null deviance: 2892.9 on 78 degrees of freedom
## Residual deviance: 1218.8 on 75 degrees of freedom
## AIC: 450.35
##
## Number of Fisher Scoring iterations: 2
Origin 3: Jepang (Refined)
Meskipun mempunyai RMSE yang paling rendah berdasarkan hasil k-fold cross validation, model terbaik secara visual menunjukkan adanya belokan di sekitar ujung-ujung kurva. Hal ini dirasa kurang masuk akal. Untuk itu perlu dilakukan “revisi” model. Karena ketiga metode menunjukkan pola yang sama, akan dilakukan revisi terhadap model natural cubic spline. Langkah yang dilakukan adalah dengan menghapus knot dan mengubah lokasi (dalam hal ini adalah persentil 10 sampai dengan 90).
newKnots <- as.list(quantile(Auto3$horsepower, seq(0.1, 0.9, by=0.1)))
names(newKnots) <- paste0("1.", letters[1:length(newKnots)])
set.seed(1234)
orig3b <- generateAllREg(df = Auto3,
orig = "Jepang",
degree = c(2:8), cut = c(2:10), knot = newKnots, best = 3)
Cross validation
orig3b$pltCVGrid
Ringkasan tuning parameter terbaik
orig3b$bestSummary
origin | method | param | RMSE |
---|---|---|---|
Jepang | polynomial | 3 | 4.013379 |
Jepang | piecewise | 9 | 4.211734 |
Jepang | ncubicspline | 1.h | 4.504423 |
orig3b$bestModel
origin | method | param | RMSE | |
---|---|---|---|---|
3 | Jepang | ncubicspline | 1.h | 4.504423 |
Ringkasan modeling
summary(orig3b$bestModelDetail)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.9311 -2.5978 -0.3679 2.1341 12.3869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.918 1.250 31.146 < 2e-16 ***
## `ns(horsepower, knots = 96)1` -25.236 2.968 -8.504 1.19e-12 ***
## `ns(horsepower, knots = 96)2` -7.348 3.164 -2.322 0.0229 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 19.30245)
##
## Null deviance: 2892.9 on 78 degrees of freedom
## Residual deviance: 1467.0 on 76 degrees of freedom
## AIC: 462.99
##
## Number of Fisher Scoring iterations: 2
Perbandingan model lama dan baru
plt <- orig3$pltPred[[3]]
plt + geom_line(data = orig3b$model[[3]]$predValue,
aes(x = horsepower, y = pred, color = "1"), size = 1) +
scale_color_manual(labels = c(sprintf("%s(%s)",orig3$bestSummary[3,2],orig3$bestSummary[3,3]),
sprintf("%s(%s)",orig3b$bestModel[1,2],orig3b$bestModel[1,3])), values = c("blue", "red")) +
geom_vline(xintercept = orig3b$bestParamDetail, linetype="dashed", color = "red") +
ggtitle("Japan's Natural Cubic Spline: Old vs New")
Curva menunjukkan bahwa belokan bisa diminimalisasi. Konsekuensinya, RMSE meningkat dari 4.15 menjadi 4.50.
Nilai observasi vs prediksi
orig3b$pltPredGrid
All Models
Dengan demikin diperoleh pemodelan optimal untuk masing-masing negara asal sebagai berikut:
finalModels <- rbind(orig1$bestModel, orig2$bestModel, orig3b$bestModel)
finalModels
origin | method | param | RMSE | |
---|---|---|---|---|
3 | Amerika | ncubicspline | 4 | 3.759049 |
31 | Eropa | ncubicspline | 3 | 4.860623 |
32 | Jepang | ncubicspline | 1.h | 4.504423 |
finalPred <- rbind(orig1$predBestFinal, orig2$predBestFinal, orig3b$predBestFinal)
finalPred <- merge(finalPred, finalModels)
finalPred$originModel <- sprintf('%s: %s(%s)', finalPred$origin, finalPred$method, finalPred$param)
ggplot(finalPred) +
geom_point(aes(x = horsepower, y = mpg, color = originModel), alpha = 0.8) +
geom_line(aes(x = horsepower, y = pred, color = originModel), size = 1) +
scale_color_manual(values = c("black", "red", "blue")) +
theme_classic() +
ggtitle("Modeling Miles per Gallon (MPG) vs Horsepower, by Country of Origin") +
xlab("Horsepower") + ylab("Miles per Gallon") +
theme(legend.title = element_blank(),
legend.position = "bottom",
legend.box = "horizontal")
Kesimpulan
- Perlu kehati-hatian dalam memilih tuning parameter.
- Pemodelan terbaik hasil dari cross validation, belum tentu merupakan model yang bisa diterima, sehingga perlu pandangan analis untuk melakukan evaluasi terhadap model-model yang diperoleh.
- Model non-linier mampu digunakan untuk melakukan prediksi efisiensi bahan bakar menggunakan horsepower.
- Secara umum, semakin tinggi horsepower, maka kendaraan semakin boros dalam penggunaan bahan bakar (mpg lebih rendah). Hal ini berlaku untuk ketiga negara asal.
- Berdasarkan plot terakhir, terlihat bahwa untuk
horsepower
yang sama, kendaraan yang berasal dari Jepang lebih hemat dalam penggunaan bahan bakar, diikuti oleh Eropa dan Amerika.
Daftar Pustaka
Dito, Gerry A. (2021), Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R, https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/ (Oct 29, 2021).
Harrell, Frank E., Jr. (2015). Regression Modeling Strategies. Springer Series in Statistics, Springer.
James, G.; Witten, D.; Hastie, T. & Tibshirani, R. (2013), An Introduction to Statistical Learning: with Applications in R, Springer.
Mahasiswa Pascasarjana Statistika dan Sains Data, Institut Pertanian Bogor, NIM: G1501211061, nur.andi@apps.ipb.ac.id↩︎