Moving Beyond Linearity

Tugas Kuliah STA581 Sains Data

Nur Andi Setiabudi1

2021-10-29

Pendahuluan

Tujuan

  • Melakukan cross validation untuk menghasilkan pemodelan antara tingkat efisiensi penggunaan bahan bakar (direpresentasikan olehmpg atau miles per gallon) versus daya mesin (dalam horsepower) optimal berdasarkan data ISLR::Auto dengan metode:

    • Polynomial regression
    • Piecewise constant
    • Natural cubic splines
  • Pemodelan dilakukan untuk masing-masing subset berdasarkan negara asal (origin), yaitu 1) Amerika, 2) Eropa, dan 3) Jepang

Output

Mendapatkan model non-linier yang optimal untuk masing-masing negara asal.

Metodologi

Tahapan

Pemodelan dilakukan dengan tahapan sebagai berikut:

Skenario Tuning Parameter

  • Cross validation

    4-fold cross validation dengan meminimumkan RMSE (root mean squared error)

  • Polynomial regression

    Derajat/degree polinomial awal: c(2:10)

  • Piecewise constant

    Jumlah bins/breaks awal: c(2:10)

  • Natural Cubic Spline

    Jumlah dan lokasi knots awal ditentukan perdasarkan quantile:

Jumlah knots Lokasi knots (quantiles)
1 .5,
2 .33 * c(1:2)
3 .25 * c(1:3)
3* c(.10, .5, .90)
4 .20 * c(1:4)
4* c(.05, .35, .65, .95)
5 .167 * c(1:5)
5* c(.05, .275, .5, .725, .95)
6 .143 * c(1:6)
6* c(.05, .23, .41, .59, .77, .95)
7 .125 * c(1:7)
7* c(.025, .1833, .3417, .5, .6583, .8167, .975)

Sumber : Frank E. Harrell, Jr. (2015). Regression Modeling Strategies

Tools

  • R
  • splines dan stats untuk pemodelan
  • caret untuk cross validation
  • ggplot2 dan cowplot untuk grafik/visualisasi

Kumpulan Fungsi

Untuk memudahkan iterasi, dibuat beberapa fungsi:

Modeling dan Cross Validation

library(splines)
library(caret)

beyondLinear <- function(model, param, df){
  
  dat <- df
  
  #' model <string> : modeling approach, ie. polynomial (polynomial regression),
  #'   piecewise (piecewise constant), ncubicspline (natural cubic spline)
  #' param <list or vector> : list of tuning parameters to be evaluated

  cvResult <- mapply(
      function(param, model){
        
        if (model == "polynomial"){
          f <- bquote(mpg ~ poly(horsepower, degree = .(param)))
        } else if (model == "piecewise") {
          f <- bquote(mpg ~ cut(horsepower, breaks = .(param)))
        } else if (model == "ncubicspline") {
          f <- bquote(mpg ~ ns(horsepower, knots = .(param)))
        } 
  
        trCtl <- trainControl(method='cv', number = 4)
        models <- train(as.formula(f), data = df, trControl = trCtl, method = 'glm')
        
        RMSE <- models$results$RMSE
        fModel <- models$finalModel
        
        return(list(models, RMSE, fModel))
      },  model = model, param = param, USE.NAMES = FALSE, SIMPLIFY = FALSE )
  
  
  if(!is.null(names(param))){
    paramIdx <- names(param)
  } else {
    paramIdx <- param
  }

  cvScore <- data.frame(paramIdx = as.character(unlist(paramIdx)),
                        score = unlist(sapply(cvResult, "[", 2, USE.NAMES = F)))
  cvScore$param <- param
  
  bestPos <- which.min(cvScore$score)[1]
  bestParam <- param[[bestPos]]
  bestScore <- cvResult[[bestPos]][[2]]
  bestModel <- cvResult[[bestPos]][[1]]
  
  predValue <- data.frame(df[c("mpg","horsepower")], 
                    pred = predict(bestModel, df, interval = "prediction"))
  
  return(list(model = model,
              cvScore = cvScore, 
              whichBest = bestPos, 
              bestParamIdx = paramIdx[bestPos],
              bestParam = bestParam, 
              bestScore = bestScore, 
              bestModel = bestModel,
              predValue = predValue))
}

Grafik

library(ggplot2)
library(cowplot)

cvPlot <- function(df, title, xlab){
  scaleFUN <- function(x) as.character(round(x,2))
  df$paramIdx <- factor(df$paramIdx, levels = df$paramIdx)
  
  ggplot(df, aes(x=paramIdx, y=score)) + 
    geom_bar(stat = "identity", width=0.1, fill = "black") +
    coord_cartesian(ylim = c(0.99*min(df$score), NA)) +
    geom_hline(yintercept = min(df$score), linetype="dashed", color = "red") +
    ggtitle(title) +
    xlab(xlab) + ylab("Root Mean Squared Error") +
    theme_classic()
}

predPlot <- function(df, title){
  ggplot(df) +
    geom_point(aes(x = horsepower, y = mpg, fill = "*")) +
    geom_line(aes(x = horsepower, y = pred, color = "*"), size = 1) +
    scale_fill_manual(labels = c("Observed"), values = c("black")) +
    scale_color_manual(labels = c("Predicted"), values = c("blue"))  +
    xlim(40, 235) +
    theme_classic() +
    ggtitle(title) +
    xlab("Horsepower") + ylab("Miles per Gallon") +
    theme(legend.title = element_blank(), 
          legend.position = "bottom", 
          legend.box = "horizontal")
}

predPlotAll <- function(df) {
  ggplot(df) +
  geom_point(aes(x = horsepower, y = mpg, fill = "*")) +
  geom_line(aes(x = horsepower, y = polynomial, color = "a"), size = 1) +
  geom_line(aes(x = horsepower, y = piecewise, color = "b"), size = 1) +
  geom_line(aes(x = horsepower, y = ncubicspline, color = "c"), size = 1) +
  scale_fill_manual(labels = c("Observed"), values = c("black")) +
  scale_color_manual(labels = c("Polynomial", "Piecewise", "Natural Cubic Splines"), 
                     values = c("blue", "red", "green"))  +
  xlim(40, 235) +
  theme_classic() +
  ggtitle("Comparison of Best Predictions") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme(legend.title = element_blank(), 
        legend.position = "bottom", 
        legend.box = "horizontal")
}

Fungsi wrapper

generateAllREg <- function(df, orig, degree, cut, knot, best=NULL){

  methods <- c("polynomial", "piecewise", "ncubicspline")
  method_alias <- c("Polynomial Regression", "Piecewise Constant", "Natural Cubic Splines")
  params <- list(Degree = degree, `N Bins` = cut, `N Knots` = knot)
  
  outModel <- list()
  pltCV <- list()
  pltPred <- list()
  
  for( i in seq(length(methods)) ){
    
    model <- beyondLinear(model = methods[i], param = params[[i]], df = df)
      
    outModel[[i]] <- model
    

    p <- cvPlot(model$cvScore,
           title = method_alias[i],
           xlab =  names(params)[i])
    
    pltCV[[i]] <- p
    
    p <- predPlot(model$predValue, 
             sprintf("%s with %s=%s (RMSE=%s)", 
                     method_alias[i], names(params)[i], 
                     model$bestParamIdx, round(model$bestScore,2)))
    
    if(methods[i] == "ncubicspline"){
      p <- p + geom_vline(xintercept = model$bestParam, linetype="dashed", color = "gray")
    }
  
    pltPred[[i]] <- p
  }
  
  bestSummary <- data.frame(t(sapply(outModel, "[", c("model", "bestParamIdx", "bestScore"))))
  names(bestSummary) <- c("method", "param", "RMSE")
  
  bestSummary <- cbind(origin = orig, bestSummary)
  
  predAllBest <- data.frame(outModel[[1]]$predValue, 
                            outModel[[2]]$predValue$pred, 
                            outModel[[3]]$predValue$pred)
  names(predAllBest) <- c("mpg","horsepower", methods)
  
  pltPredAll <- predPlotAll(predAllBest)

  if(is.null(best)){
    bIdx <- which.min(bestSummary$RMSE)
  } else {
    bIdx <- best
  }
  
  bestModel <- bestSummary[bIdx, ]

  predicted <- outModel[[bIdx]]$predValue
  predicted <- cbind(origin = orig, predicted)
  
  return(list(bestSummary = bestSummary, 
              bestModel = bestModel,
              bestModelDetail = outModel[[bIdx]]$bestModel,
              bestParamDetail = outModel[[bIdx]]$bestParam,
              model = outModel, 
              pltCV = pltCV, 
              pltPred = pltPred,
              pltPredAll = pltPredAll,
              pltCVGrid = plot_grid(pltCV[[1]], pltCV[[2]], pltCV[[3]]),
              pltPredGrid = plot_grid(pltPred[[1]], pltPred[[2]], pltPred[[3]], pltPredAll),
              predBest3 = predAllBest,
              predBestFinal = predicted))

}

Penentuan Knots

pctKnots <- function(x){
  pct <- list(
    .5,
    .33*c(1:2),
    .25*c(1:3),
    c(.10,.5,.90),
    .20*c(1:4),
    c(.05,.35,.65,.95),
    .167*c(1:5),
    c(.05,.275,.5,.725,.95),
    .143*c(1:6),
    c(.05,.23,.41,.59,.77,.95),
    .125*c(1:7),
    c(.025,.1833,.3417,.5,.6583,.8167,.975))
  
  knt <- sapply(pct, function(p, i){round(quantile(i, p),2)}, i = x)
  names(knt) <- c(1,2,3,"3*",4,"4*",5,"5*",6,"6*",7,"7*")
  return(knt)
  
}

Hasil dan Pembahasan

Preview Dataset

library(ISLR)
head(Auto)
mpg cylinders displacement horsepower weight acceleration year origin name
18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
15 8 350 165 3693 11.5 70 1 buick skylark 320
18 8 318 150 3436 11.0 70 1 plymouth satellite
16 8 304 150 3433 12.0 70 1 amc rebel sst
17 8 302 140 3449 10.5 70 1 ford torino
15 8 429 198 4341 10.0 70 1 ford galaxie 500
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

Hubungan mpg dan horsepower

library(ggplot2)
ggplot(Auto) +
  geom_point(aes(x = horsepower, y = mpg), alpha = 0.8) +
  ggtitle("Miles per Gallon vs Horsepower") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme_classic() 

Berdasarkan origin

freq <- table(Auto$origin)
data.frame(cbind(freq, prop = prop.table(freq)))
freq prop
245 0.6250000
68 0.1734694
79 0.2015306
p <- ggplot(Auto) +
  geom_point(aes(x = horsepower, y = mpg, color = as.factor(origin)), alpha = 0.8) +
  scale_color_manual(values = c("black", "red", "blue"),
                     labels = c("American", "European", "Japanese")) +
  ggtitle("Miles per Gallon vs Horsepower, by Origin") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme_classic() +
  theme(legend.title = element_blank(), 
        legend.position = "bottom", 
        legend.box = "horizontal")
p

p + facet_grid(origin ~ .)

Berdasarkan visualisasi di atas terlihat bahwa hubungan mpg dan horsepower tidak bersifat linear. Untuk itu diperlukan metode non-linear untuk mendapatkan pemodelan optimal.

Semua Negara

AutoKnots <- pctKnots(Auto$horsepower)

set.seed(1234)
orig0 <- generateAllREg(df = Auto, 
                  orig = "Semua Negara", 
                  degree = c(2:10), cut = c(2:10), knot = AutoKnots)

Cross validation

orig0$pltCVGrid

Ringkasan tuning parameter terbaik

orig0$bestSummary
origin method param RMSE
Semua Negara polynomial 9 4.313204
Semua Negara piecewise 8 4.453155
Semua Negara ncubicspline 4* 4.306619
orig0$bestModel
origin method param RMSE
3 Semua Negara ncubicspline 4* 4.306619

Tunning Parameter terbaik

orig0$bestParamDetail
##     5%    35%    65%    95% 
##  60.55  85.00 105.00 180.00

Nilai observasi vs prediksi

orig0$pltPredGrid

Model terbaik yang diperoleh dari cross-validation adalah natural cubic spline dengan 4 knots.

Ringkasan modeling

summary(orig0$bestModelDetail)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -15.7924   -2.6455   -0.2029    2.2116   15.4144  
## 
## Coefficients:
##                                                   Estimate Std. Error t value
## (Intercept)                                         33.789      1.570  21.516
## `ns(horsepower, knots = c(60.55, 85, 105, 180))1`   -9.327      1.597  -5.840
## `ns(horsepower, knots = c(60.55, 85, 105, 180))2`  -15.555      2.140  -7.268
## `ns(horsepower, knots = c(60.55, 85, 105, 180))3`  -22.579      1.904 -11.860
## `ns(horsepower, knots = c(60.55, 85, 105, 180))4`  -19.168      3.508  -5.465
## `ns(horsepower, knots = c(60.55, 85, 105, 180))5`  -21.579      1.798 -12.004
##                                                   Pr(>|t|)    
## (Intercept)                                        < 2e-16 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))1` 1.11e-08 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))2` 2.04e-12 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))3`  < 2e-16 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))4` 8.33e-08 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))5`  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 18.44132)
## 
##     Null deviance: 23819.0  on 391  degrees of freedom
## Residual deviance:  7118.4  on 386  degrees of freedom
## AIC: 2262.9
## 
## Number of Fisher Scoring iterations: 2

Origin 1: Amerika

Auto1 <- Auto[Auto$origin == 1, ]
Auto1Knots <- pctKnots(Auto1$horsepower)

set.seed(1234)
orig1 <- generateAllREg(df = Auto1, 
                  orig = "Amerika", 
                  degree = c(2:10), cut = c(2:10), knot = Auto1Knots)

Cross validation

orig1$pltCVGrid

Ringkasan tuning parameter terbaik

orig1$bestSummary
origin method param RMSE
Amerika polynomial 2 3.76728
Amerika piecewise 9 3.794538
Amerika ncubicspline 4 3.759049
orig1$bestModel
origin method param RMSE
3 Amerika ncubicspline 4 3.759049

Tunning Parameter terbaik

orig1$bestParamDetail
##   20%   40%   60%   80% 
##  85.0  99.2 122.0 150.0

Nilai observasi vs prediksi

orig1$pltPredGrid

Ringkasan modeling

summary(orig1$bestModelDetail)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.8393   -2.3393   -0.2519    2.1876   12.8473  
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                        32.869      2.003  16.409
## `ns(horsepower, knots = c(85, 99.2, 122, 150))1`  -14.266      1.904  -7.494
## `ns(horsepower, knots = c(85, 99.2, 122, 150))2`  -13.533      2.528  -5.352
## `ns(horsepower, knots = c(85, 99.2, 122, 150))3`  -19.748      1.689 -11.691
## `ns(horsepower, knots = c(85, 99.2, 122, 150))4`  -24.493      4.449  -5.505
## `ns(horsepower, knots = c(85, 99.2, 122, 150))5`  -16.706      1.655 -10.093
##                                                  Pr(>|t|)    
## (Intercept)                                       < 2e-16 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))1` 1.30e-12 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))2` 2.04e-07 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))3`  < 2e-16 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))4` 9.51e-08 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))5`  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 14.29455)
## 
##     Null deviance: 10120.8  on 244  degrees of freedom
## Residual deviance:  3416.4  on 239  degrees of freedom
## AIC: 1354.9
## 
## Number of Fisher Scoring iterations: 2

Origin 2: Eropa

Auto2 <- Auto[Auto$origin == 2, ]
Auto2Knots <- pctKnots(Auto2$horsepower)

set.seed(1234)
orig2 <- generateAllREg(df = Auto2, 
                  orig = "Eropa", 
                  degree = c(2:8), cut = c(2:10), knot = Auto2Knots)

Cross validation

orig2$pltCVGrid

Ringkasan tuning parameter terbaik

orig2$bestSummary
origin method param RMSE
Eropa polynomial 4 5.205988
Eropa piecewise 5 4.891956
Eropa ncubicspline 3 4.860623
orig2$bestModel
origin method param RMSE
3 Eropa ncubicspline 3 4.860623

Tunning Parameter terbaik

orig2$bestParamDetail
##   25%   50%   75% 
## 69.75 76.50 90.00

Nilai observasi vs prediksi

orig2$pltPredGrid

Ringkasan modeling

summary(orig2$bestModelDetail)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.4288   -2.7067   -0.6468    2.0588   12.6018  
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                     35.425      2.113  16.769
## `ns(horsepower, knots = c(69.75, 76.5, 90))1`   -6.851      2.532  -2.705
## `ns(horsepower, knots = c(69.75, 76.5, 90))2`  -12.103      3.160  -3.830
## `ns(horsepower, knots = c(69.75, 76.5, 90))3`  -20.101      5.503  -3.652
## `ns(horsepower, knots = c(69.75, 76.5, 90))4`  -15.407      3.897  -3.954
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))1` 0.008763 ** 
## `ns(horsepower, knots = c(69.75, 76.5, 90))2` 0.000298 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))3` 0.000530 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))4` 0.000197 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 24.63829)
## 
##     Null deviance: 2901.0  on 67  degrees of freedom
## Residual deviance: 1552.2  on 63  degrees of freedom
## AIC: 417.67
## 
## Number of Fisher Scoring iterations: 2

Origin 3: Jepang

Auto3 <- Auto[Auto$origin == 3, ]
Auto3Knots <- pctKnots(Auto3$horsepower)

set.seed(1234)
orig3 <- generateAllREg(df = Auto3, 
                  orig = "Jepang", 
                  degree = c(2:8), cut = c(2:10), knot = Auto3Knots)

Cross validation

orig3$pltCVGrid

Ringkasan tuning parameter terbaik

orig3$bestSummary
origin method param RMSE
Jepang polynomial 3 4.013379
Jepang piecewise 9 4.211734
Jepang ncubicspline 3* 4.152535
orig3$bestModel
origin method param RMSE
Jepang polynomial 3 4.013379

Nilai observasi vs prediksi

orig3$pltPredGrid

Ringkasan modeling

summary(orig3$bestModelDetail)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.8602  -2.7115  -0.5224   2.1985  11.6985  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      30.4506     0.4536  67.138  < 2e-16 ***
## `poly(horsepower, degree = 3)1` -36.2030     4.0312  -8.981 1.62e-13 ***
## `poly(horsepower, degree = 3)2`   9.1966     4.0312   2.281   0.0254 *  
## `poly(horsepower, degree = 3)3`  16.6991     4.0312   4.142 8.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 16.25097)
## 
##     Null deviance: 2892.9  on 78  degrees of freedom
## Residual deviance: 1218.8  on 75  degrees of freedom
## AIC: 450.35
## 
## Number of Fisher Scoring iterations: 2

Origin 3: Jepang (Refined)

Meskipun mempunyai RMSE yang paling rendah berdasarkan hasil k-fold cross validation, model terbaik secara visual menunjukkan adanya belokan di sekitar ujung-ujung kurva. Hal ini dirasa kurang masuk akal. Untuk itu perlu dilakukan “revisi” model. Karena ketiga metode menunjukkan pola yang sama, akan dilakukan revisi terhadap model natural cubic spline. Langkah yang dilakukan adalah dengan menghapus knot dan mengubah lokasi (dalam hal ini adalah persentil 10 sampai dengan 90).

newKnots <- as.list(quantile(Auto3$horsepower, seq(0.1, 0.9, by=0.1)))
names(newKnots) <- paste0("1.", letters[1:length(newKnots)])

set.seed(1234)
orig3b <- generateAllREg(df = Auto3, 
                  orig = "Jepang", 
                  degree = c(2:8), cut = c(2:10), knot = newKnots, best = 3)

Cross validation

orig3b$pltCVGrid

Ringkasan tuning parameter terbaik

orig3b$bestSummary
origin method param RMSE
Jepang polynomial 3 4.013379
Jepang piecewise 9 4.211734
Jepang ncubicspline 1.h 4.504423
orig3b$bestModel
origin method param RMSE
3 Jepang ncubicspline 1.h 4.504423

Ringkasan modeling

summary(orig3b$bestModelDetail)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.9311  -2.5978  -0.3679   2.1341  12.3869  
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     38.918      1.250  31.146  < 2e-16 ***
## `ns(horsepower, knots = 96)1`  -25.236      2.968  -8.504 1.19e-12 ***
## `ns(horsepower, knots = 96)2`   -7.348      3.164  -2.322   0.0229 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 19.30245)
## 
##     Null deviance: 2892.9  on 78  degrees of freedom
## Residual deviance: 1467.0  on 76  degrees of freedom
## AIC: 462.99
## 
## Number of Fisher Scoring iterations: 2

Perbandingan model lama dan baru

plt <- orig3$pltPred[[3]]


plt + geom_line(data = orig3b$model[[3]]$predValue, 
                aes(x = horsepower, y = pred, color = "1"), size = 1) +
scale_color_manual(labels = c(sprintf("%s(%s)",orig3$bestSummary[3,2],orig3$bestSummary[3,3]),
                              sprintf("%s(%s)",orig3b$bestModel[1,2],orig3b$bestModel[1,3])), values = c("blue", "red")) +
geom_vline(xintercept = orig3b$bestParamDetail, linetype="dashed", color = "red") +
ggtitle("Japan's Natural Cubic Spline: Old vs New")

Curva menunjukkan bahwa belokan bisa diminimalisasi. Konsekuensinya, RMSE meningkat dari 4.15 menjadi 4.50.

Nilai observasi vs prediksi

orig3b$pltPredGrid

All Models

Dengan demikin diperoleh pemodelan optimal untuk masing-masing negara asal sebagai berikut:

finalModels <- rbind(orig1$bestModel, orig2$bestModel, orig3b$bestModel)
finalModels
origin method param RMSE
3 Amerika ncubicspline 4 3.759049
31 Eropa ncubicspline 3 4.860623
32 Jepang ncubicspline 1.h 4.504423
finalPred <- rbind(orig1$predBestFinal, orig2$predBestFinal, orig3b$predBestFinal)

finalPred <- merge(finalPred, finalModels)
finalPred$originModel <- sprintf('%s: %s(%s)', finalPred$origin, finalPred$method, finalPred$param)


ggplot(finalPred) +
  geom_point(aes(x = horsepower, y = mpg, color = originModel), alpha = 0.8) +
  geom_line(aes(x = horsepower, y = pred, color = originModel), size = 1) +
  scale_color_manual(values = c("black", "red", "blue"))  +
  theme_classic() +
  ggtitle("Modeling Miles per Gallon (MPG) vs Horsepower, by Country of Origin") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme(legend.title = element_blank(), 
        legend.position = "bottom", 
        legend.box = "horizontal")

Kesimpulan

  • Perlu kehati-hatian dalam memilih tuning parameter.
  • Pemodelan terbaik hasil dari cross validation, belum tentu merupakan model yang bisa diterima, sehingga perlu pandangan analis untuk melakukan evaluasi terhadap model-model yang diperoleh.
  • Model non-linier mampu digunakan untuk melakukan prediksi efisiensi bahan bakar menggunakan horsepower.
  • Secara umum, semakin tinggi horsepower, maka kendaraan semakin boros dalam penggunaan bahan bakar (mpg lebih rendah). Hal ini berlaku untuk ketiga negara asal.
  • Berdasarkan plot terakhir, terlihat bahwa untuk horsepower yang sama, kendaraan yang berasal dari Jepang lebih hemat dalam penggunaan bahan bakar, diikuti oleh Eropa dan Amerika.

Daftar Pustaka

Dito, Gerry A. (2021), Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R, https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/ (Oct 29, 2021).

Harrell, Frank E., Jr. (2015). Regression Modeling Strategies. Springer Series in Statistics, Springer.

James, G.; Witten, D.; Hastie, T. & Tibshirani, R. (2013), An Introduction to Statistical Learning: with Applications in R, Springer.


  1. Mahasiswa Pascasarjana Statistika dan Sains Data, Institut Pertanian Bogor, NIM: G1501211061, ↩︎