Moving Beyond Linearity

Tugas Kuliah STA581 Sains Data

Nur Andi Setiabudi¹

2021-10-29

Pendahuluan

Tujuan

Melakukan cross validation untuk menghasilkan pemodelan antara tingkat efisiensi penggunaan bahan bakar (direpresentasikan olehmpg atau miles per gallon) versus daya mesin (dalam horsepower) optimal berdasarkan data ISLR::Auto dengan metode:
- Polynomial regression
- Piecewise constant
- Natural cubic splines
Pemodelan dilakukan untuk masing-masing subset berdasarkan negara asal (origin), yaitu 1) Amerika, 2) Eropa, dan 3) Jepang

Output

Mendapatkan model non-linier yang optimal untuk masing-masing negara asal.

Metodologi

Tahapan

Pemodelan dilakukan dengan tahapan sebagai berikut:

Skenario Tuning Parameter

Cross validation

4-fold cross validation dengan meminimumkan RMSE (root mean squared error)
Polynomial regression

Derajat/degree polinomial awal: c(2:10)
Piecewise constant

Jumlah bins/breaks awal: c(2:10)
Natural Cubic Spline

Jumlah dan lokasi knots awal ditentukan perdasarkan quantile:

Jumlah knots	Lokasi knots (quantiles)
1	`.5,`
2	`.33 * c(1:2)`
3	`.25 * c(1:3)`
3*	`c(.10, .5, .90)`
4	`.20 * c(1:4)`
4*	`c(.05, .35, .65, .95)`
5	`.167 * c(1:5)`
5*	`c(.05, .275, .5, .725, .95)`
6	`.143 * c(1:6)`
6*	`c(.05, .23, .41, .59, .77, .95)`
7	`.125 * c(1:7)`
7*	`c(.025, .1833, .3417, .5, .6583, .8167, .975)`

Sumber : Frank E. Harrell, Jr. (2015). Regression Modeling Strategies

Tools

R
splines dan stats untuk pemodelan
caret untuk cross validation
ggplot2 dan cowplot untuk grafik/visualisasi

Kumpulan Fungsi

Untuk memudahkan iterasi, dibuat beberapa fungsi:

Modeling dan Cross Validation

library(splines)
library(caret)

beyondLinear <- function(model, param, df){
  
  dat <- df
  
  #' model <string> : modeling approach, ie. polynomial (polynomial regression),
  #'   piecewise (piecewise constant), ncubicspline (natural cubic spline)
  #' param <list or vector> : list of tuning parameters to be evaluated

  cvResult <- mapply(
      function(param, model){
        
        if (model == "polynomial"){
          f <- bquote(mpg ~ poly(horsepower, degree = .(param)))
        } else if (model == "piecewise") {
          f <- bquote(mpg ~ cut(horsepower, breaks = .(param)))
        } else if (model == "ncubicspline") {
          f <- bquote(mpg ~ ns(horsepower, knots = .(param)))
        } 
  
        trCtl <- trainControl(method='cv', number = 4)
        models <- train(as.formula(f), data = df, trControl = trCtl, method = 'glm')
        
        RMSE <- models$results$RMSE
        fModel <- models$finalModel
        
        return(list(models, RMSE, fModel))
      },  model = model, param = param, USE.NAMES = FALSE, SIMPLIFY = FALSE )
  
  
  if(!is.null(names(param))){
    paramIdx <- names(param)
  } else {
    paramIdx <- param
  }

  cvScore <- data.frame(paramIdx = as.character(unlist(paramIdx)),
                        score = unlist(sapply(cvResult, "[", 2, USE.NAMES = F)))
  cvScore$param <- param
  
  bestPos <- which.min(cvScore$score)[1]
  bestParam <- param[[bestPos]]
  bestScore <- cvResult[[bestPos]][[2]]
  bestModel <- cvResult[[bestPos]][[1]]
  
  predValue <- data.frame(df[c("mpg","horsepower")], 
                    pred = predict(bestModel, df, interval = "prediction"))
  
  return(list(model = model,
              cvScore = cvScore, 
              whichBest = bestPos, 
              bestParamIdx = paramIdx[bestPos],
              bestParam = bestParam, 
              bestScore = bestScore, 
              bestModel = bestModel,
              predValue = predValue))
}

Grafik

library(ggplot2)
library(cowplot)

cvPlot <- function(df, title, xlab){
  scaleFUN <- function(x) as.character(round(x,2))
  df$paramIdx <- factor(df$paramIdx, levels = df$paramIdx)
  
  ggplot(df, aes(x=paramIdx, y=score)) + 
    geom_bar(stat = "identity", width=0.1, fill = "black") +
    coord_cartesian(ylim = c(0.99*min(df$score), NA)) +
    geom_hline(yintercept = min(df$score), linetype="dashed", color = "red") +
    ggtitle(title) +
    xlab(xlab) + ylab("Root Mean Squared Error") +
    theme_classic()
}

predPlot <- function(df, title){
  ggplot(df) +
    geom_point(aes(x = horsepower, y = mpg, fill = "*")) +
    geom_line(aes(x = horsepower, y = pred, color = "*"), size = 1) +
    scale_fill_manual(labels = c("Observed"), values = c("black")) +
    scale_color_manual(labels = c("Predicted"), values = c("blue"))  +
    xlim(40, 235) +
    theme_classic() +
    ggtitle(title) +
    xlab("Horsepower") + ylab("Miles per Gallon") +
    theme(legend.title = element_blank(), 
          legend.position = "bottom", 
          legend.box = "horizontal")
}

predPlotAll <- function(df) {
  ggplot(df) +
  geom_point(aes(x = horsepower, y = mpg, fill = "*")) +
  geom_line(aes(x = horsepower, y = polynomial, color = "a"), size = 1) +
  geom_line(aes(x = horsepower, y = piecewise, color = "b"), size = 1) +
  geom_line(aes(x = horsepower, y = ncubicspline, color = "c"), size = 1) +
  scale_fill_manual(labels = c("Observed"), values = c("black")) +
  scale_color_manual(labels = c("Polynomial", "Piecewise", "Natural Cubic Splines"), 
                     values = c("blue", "red", "green"))  +
  xlim(40, 235) +
  theme_classic() +
  ggtitle("Comparison of Best Predictions") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme(legend.title = element_blank(), 
        legend.position = "bottom", 
        legend.box = "horizontal")
}

Fungsi wrapper

generateAllREg <- function(df, orig, degree, cut, knot, best=NULL){

  methods <- c("polynomial", "piecewise", "ncubicspline")
  method_alias <- c("Polynomial Regression", "Piecewise Constant", "Natural Cubic Splines")
  params <- list(Degree = degree, `N Bins` = cut, `N Knots` = knot)
  
  outModel <- list()
  pltCV <- list()
  pltPred <- list()
  
  for( i in seq(length(methods)) ){
    
    model <- beyondLinear(model = methods[i], param = params[[i]], df = df)
      
    outModel[[i]] <- model
    

    p <- cvPlot(model$cvScore,
           title = method_alias[i],
           xlab =  names(params)[i])
    
    pltCV[[i]] <- p
    
    p <- predPlot(model$predValue, 
             sprintf("%s with %s=%s (RMSE=%s)", 
                     method_alias[i], names(params)[i], 
                     model$bestParamIdx, round(model$bestScore,2)))
    
    if(methods[i] == "ncubicspline"){
      p <- p + geom_vline(xintercept = model$bestParam, linetype="dashed", color = "gray")
    }
  
    pltPred[[i]] <- p
  }
  
  bestSummary <- data.frame(t(sapply(outModel, "[", c("model", "bestParamIdx", "bestScore"))))
  names(bestSummary) <- c("method", "param", "RMSE")
  
  bestSummary <- cbind(origin = orig, bestSummary)
  
  predAllBest <- data.frame(outModel[[1]]$predValue, 
                            outModel[[2]]$predValue$pred, 
                            outModel[[3]]$predValue$pred)
  names(predAllBest) <- c("mpg","horsepower", methods)
  
  pltPredAll <- predPlotAll(predAllBest)

  if(is.null(best)){
    bIdx <- which.min(bestSummary$RMSE)
  } else {
    bIdx <- best
  }
  
  bestModel <- bestSummary[bIdx, ]

  predicted <- outModel[[bIdx]]$predValue
  predicted <- cbind(origin = orig, predicted)
  
  return(list(bestSummary = bestSummary, 
              bestModel = bestModel,
              bestModelDetail = outModel[[bIdx]]$bestModel,
              bestParamDetail = outModel[[bIdx]]$bestParam,
              model = outModel, 
              pltCV = pltCV, 
              pltPred = pltPred,
              pltPredAll = pltPredAll,
              pltCVGrid = plot_grid(pltCV[[1]], pltCV[[2]], pltCV[[3]]),
              pltPredGrid = plot_grid(pltPred[[1]], pltPred[[2]], pltPred[[3]], pltPredAll),
              predBest3 = predAllBest,
              predBestFinal = predicted))

}

Penentuan Knots

pctKnots <- function(x){
  pct <- list(
    .5,
    .33*c(1:2),
    .25*c(1:3),
    c(.10,.5,.90),
    .20*c(1:4),
    c(.05,.35,.65,.95),
    .167*c(1:5),
    c(.05,.275,.5,.725,.95),
    .143*c(1:6),
    c(.05,.23,.41,.59,.77,.95),
    .125*c(1:7),
    c(.025,.1833,.3417,.5,.6583,.8167,.975))
  
  knt <- sapply(pct, function(p, i){round(quantile(i, p),2)}, i = x)
  names(knt) <- c(1,2,3,"3*",4,"4*",5,"5*",6,"6*",7,"7*")
  return(knt)
  
}

Hasil dan Pembahasan

Preview Dataset

library(ISLR)
head(Auto)

mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
15	8	350	165	3693	11.5	70	1	buick skylark 320
18	8	318	150	3436	11.0	70	1	plymouth satellite
16	8	304	150	3433	12.0	70	1	amc rebel sst
17	8	302	140	3449	10.5	70	1	ford torino
15	8	429	198	4341	10.0	70	1	ford galaxie 500

summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

Hubungan `mpg` dan `horsepower`

library(ggplot2)
ggplot(Auto) +
  geom_point(aes(x = horsepower, y = mpg), alpha = 0.8) +
  ggtitle("Miles per Gallon vs Horsepower") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme_classic()

Berdasarkan `origin`

freq <- table(Auto$origin)
data.frame(cbind(freq, prop = prop.table(freq)))

freq	prop
245	0.6250000
68	0.1734694
79	0.2015306

p <- ggplot(Auto) +
  geom_point(aes(x = horsepower, y = mpg, color = as.factor(origin)), alpha = 0.8) +
  scale_color_manual(values = c("black", "red", "blue"),
                     labels = c("American", "European", "Japanese")) +
  ggtitle("Miles per Gallon vs Horsepower, by Origin") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme_classic() +
  theme(legend.title = element_blank(), 
        legend.position = "bottom", 
        legend.box = "horizontal")
p

p + facet_grid(origin ~ .)

Berdasarkan visualisasi di atas terlihat bahwa hubungan mpg dan horsepower tidak bersifat linear. Untuk itu diperlukan metode non-linear untuk mendapatkan pemodelan optimal.

Semua Negara

AutoKnots <- pctKnots(Auto$horsepower)

set.seed(1234)
orig0 <- generateAllREg(df = Auto, 
                  orig = "Semua Negara", 
                  degree = c(2:10), cut = c(2:10), knot = AutoKnots)

Cross validation

orig0$pltCVGrid

Ringkasan tuning parameter terbaik

orig0$bestSummary

origin	method	param	RMSE
Semua Negara	polynomial	9	4.313204
Semua Negara	piecewise	8	4.453155
Semua Negara	ncubicspline	4*	4.306619

orig0$bestModel

	origin	method	param	RMSE
3	Semua Negara	ncubicspline	4*	4.306619

Tunning Parameter terbaik

orig0$bestParamDetail

##     5%    35%    65%    95% 
##  60.55  85.00 105.00 180.00

Nilai observasi vs prediksi

orig0$pltPredGrid

Model terbaik yang diperoleh dari cross-validation adalah natural cubic spline dengan 4 knots.

Ringkasan modeling

summary(orig0$bestModelDetail)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -15.7924   -2.6455   -0.2029    2.2116   15.4144  
## 
## Coefficients:
##                                                   Estimate Std. Error t value
## (Intercept)                                         33.789      1.570  21.516
## `ns(horsepower, knots = c(60.55, 85, 105, 180))1`   -9.327      1.597  -5.840
## `ns(horsepower, knots = c(60.55, 85, 105, 180))2`  -15.555      2.140  -7.268
## `ns(horsepower, knots = c(60.55, 85, 105, 180))3`  -22.579      1.904 -11.860
## `ns(horsepower, knots = c(60.55, 85, 105, 180))4`  -19.168      3.508  -5.465
## `ns(horsepower, knots = c(60.55, 85, 105, 180))5`  -21.579      1.798 -12.004
##                                                   Pr(>|t|)    
## (Intercept)                                        < 2e-16 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))1` 1.11e-08 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))2` 2.04e-12 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))3`  < 2e-16 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))4` 8.33e-08 ***
## `ns(horsepower, knots = c(60.55, 85, 105, 180))5`  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 18.44132)
## 
##     Null deviance: 23819.0  on 391  degrees of freedom
## Residual deviance:  7118.4  on 386  degrees of freedom
## AIC: 2262.9
## 
## Number of Fisher Scoring iterations: 2

Origin 1: Amerika

Auto1 <- Auto[Auto$origin == 1, ]
Auto1Knots <- pctKnots(Auto1$horsepower)

set.seed(1234)
orig1 <- generateAllREg(df = Auto1, 
                  orig = "Amerika", 
                  degree = c(2:10), cut = c(2:10), knot = Auto1Knots)

Cross validation

orig1$pltCVGrid

Ringkasan tuning parameter terbaik

orig1$bestSummary

origin	method	param	RMSE
Amerika	polynomial	2	3.76728
Amerika	piecewise	9	3.794538
Amerika	ncubicspline	4	3.759049

orig1$bestModel

	origin	method	param	RMSE
3	Amerika	ncubicspline	4	3.759049

Tunning Parameter terbaik

orig1$bestParamDetail

##   20%   40%   60%   80% 
##  85.0  99.2 122.0 150.0

Nilai observasi vs prediksi

orig1$pltPredGrid

Ringkasan modeling

summary(orig1$bestModelDetail)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.8393   -2.3393   -0.2519    2.1876   12.8473  
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                        32.869      2.003  16.409
## `ns(horsepower, knots = c(85, 99.2, 122, 150))1`  -14.266      1.904  -7.494
## `ns(horsepower, knots = c(85, 99.2, 122, 150))2`  -13.533      2.528  -5.352
## `ns(horsepower, knots = c(85, 99.2, 122, 150))3`  -19.748      1.689 -11.691
## `ns(horsepower, knots = c(85, 99.2, 122, 150))4`  -24.493      4.449  -5.505
## `ns(horsepower, knots = c(85, 99.2, 122, 150))5`  -16.706      1.655 -10.093
##                                                  Pr(>|t|)    
## (Intercept)                                       < 2e-16 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))1` 1.30e-12 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))2` 2.04e-07 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))3`  < 2e-16 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))4` 9.51e-08 ***
## `ns(horsepower, knots = c(85, 99.2, 122, 150))5`  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 14.29455)
## 
##     Null deviance: 10120.8  on 244  degrees of freedom
## Residual deviance:  3416.4  on 239  degrees of freedom
## AIC: 1354.9
## 
## Number of Fisher Scoring iterations: 2

Origin 2: Eropa

Auto2 <- Auto[Auto$origin == 2, ]
Auto2Knots <- pctKnots(Auto2$horsepower)

set.seed(1234)
orig2 <- generateAllREg(df = Auto2, 
                  orig = "Eropa", 
                  degree = c(2:8), cut = c(2:10), knot = Auto2Knots)

Cross validation

orig2$pltCVGrid

Ringkasan tuning parameter terbaik

orig2$bestSummary

origin	method	param	RMSE
Eropa	polynomial	4	5.205988
Eropa	piecewise	5	4.891956
Eropa	ncubicspline	3	4.860623

orig2$bestModel

	origin	method	param	RMSE
3	Eropa	ncubicspline	3	4.860623

Tunning Parameter terbaik

orig2$bestParamDetail

##   25%   50%   75% 
## 69.75 76.50 90.00

Nilai observasi vs prediksi

orig2$pltPredGrid

Ringkasan modeling

summary(orig2$bestModelDetail)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.4288   -2.7067   -0.6468    2.0588   12.6018  
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                     35.425      2.113  16.769
## `ns(horsepower, knots = c(69.75, 76.5, 90))1`   -6.851      2.532  -2.705
## `ns(horsepower, knots = c(69.75, 76.5, 90))2`  -12.103      3.160  -3.830
## `ns(horsepower, knots = c(69.75, 76.5, 90))3`  -20.101      5.503  -3.652
## `ns(horsepower, knots = c(69.75, 76.5, 90))4`  -15.407      3.897  -3.954
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))1` 0.008763 ** 
## `ns(horsepower, knots = c(69.75, 76.5, 90))2` 0.000298 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))3` 0.000530 ***
## `ns(horsepower, knots = c(69.75, 76.5, 90))4` 0.000197 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 24.63829)
## 
##     Null deviance: 2901.0  on 67  degrees of freedom
## Residual deviance: 1552.2  on 63  degrees of freedom
## AIC: 417.67
## 
## Number of Fisher Scoring iterations: 2

Origin 3: Jepang

Auto3 <- Auto[Auto$origin == 3, ]
Auto3Knots <- pctKnots(Auto3$horsepower)

set.seed(1234)
orig3 <- generateAllREg(df = Auto3, 
                  orig = "Jepang", 
                  degree = c(2:8), cut = c(2:10), knot = Auto3Knots)

Cross validation

orig3$pltCVGrid

Ringkasan tuning parameter terbaik

orig3$bestSummary

origin	method	param	RMSE
Jepang	polynomial	3	4.013379
Jepang	piecewise	9	4.211734
Jepang	ncubicspline	3*	4.152535

orig3$bestModel

origin	method	param	RMSE
Jepang	polynomial	3	4.013379

Nilai observasi vs prediksi

orig3$pltPredGrid

Ringkasan modeling

summary(orig3$bestModelDetail)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.8602  -2.7115  -0.5224   2.1985  11.6985  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      30.4506     0.4536  67.138  < 2e-16 ***
## `poly(horsepower, degree = 3)1` -36.2030     4.0312  -8.981 1.62e-13 ***
## `poly(horsepower, degree = 3)2`   9.1966     4.0312   2.281   0.0254 *  
## `poly(horsepower, degree = 3)3`  16.6991     4.0312   4.142 8.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 16.25097)
## 
##     Null deviance: 2892.9  on 78  degrees of freedom
## Residual deviance: 1218.8  on 75  degrees of freedom
## AIC: 450.35
## 
## Number of Fisher Scoring iterations: 2

Origin 3: Jepang (Refined)

Meskipun mempunyai RMSE yang paling rendah berdasarkan hasil k-fold cross validation, model terbaik secara visual menunjukkan adanya belokan di sekitar ujung-ujung kurva. Hal ini dirasa kurang masuk akal. Untuk itu perlu dilakukan “revisi” model. Karena ketiga metode menunjukkan pola yang sama, akan dilakukan revisi terhadap model natural cubic spline. Langkah yang dilakukan adalah dengan menghapus knot dan mengubah lokasi (dalam hal ini adalah persentil 10 sampai dengan 90).

newKnots <- as.list(quantile(Auto3$horsepower, seq(0.1, 0.9, by=0.1)))
names(newKnots) <- paste0("1.", letters[1:length(newKnots)])

set.seed(1234)
orig3b <- generateAllREg(df = Auto3, 
                  orig = "Jepang", 
                  degree = c(2:8), cut = c(2:10), knot = newKnots, best = 3)

Cross validation

orig3b$pltCVGrid

Ringkasan tuning parameter terbaik

orig3b$bestSummary

origin	method	param	RMSE
Jepang	polynomial	3	4.013379
Jepang	piecewise	9	4.211734
Jepang	ncubicspline	1.h	4.504423

orig3b$bestModel

	origin	method	param	RMSE
3	Jepang	ncubicspline	1.h	4.504423

Ringkasan modeling

summary(orig3b$bestModelDetail)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.9311  -2.5978  -0.3679   2.1341  12.3869  
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     38.918      1.250  31.146  < 2e-16 ***
## `ns(horsepower, knots = 96)1`  -25.236      2.968  -8.504 1.19e-12 ***
## `ns(horsepower, knots = 96)2`   -7.348      3.164  -2.322   0.0229 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 19.30245)
## 
##     Null deviance: 2892.9  on 78  degrees of freedom
## Residual deviance: 1467.0  on 76  degrees of freedom
## AIC: 462.99
## 
## Number of Fisher Scoring iterations: 2

Perbandingan model lama dan baru

plt <- orig3$pltPred[[3]]


plt + geom_line(data = orig3b$model[[3]]$predValue, 
                aes(x = horsepower, y = pred, color = "1"), size = 1) +
scale_color_manual(labels = c(sprintf("%s(%s)",orig3$bestSummary[3,2],orig3$bestSummary[3,3]),
                              sprintf("%s(%s)",orig3b$bestModel[1,2],orig3b$bestModel[1,3])), values = c("blue", "red")) +
geom_vline(xintercept = orig3b$bestParamDetail, linetype="dashed", color = "red") +
ggtitle("Japan's Natural Cubic Spline: Old vs New")

Curva menunjukkan bahwa belokan bisa diminimalisasi. Konsekuensinya, RMSE meningkat dari 4.15 menjadi 4.50.

Nilai observasi vs prediksi

orig3b$pltPredGrid

All Models

Dengan demikin diperoleh pemodelan optimal untuk masing-masing negara asal sebagai berikut:

finalModels <- rbind(orig1$bestModel, orig2$bestModel, orig3b$bestModel)
finalModels

	origin	method	param	RMSE
3	Amerika	ncubicspline	4	3.759049
31	Eropa	ncubicspline	3	4.860623
32	Jepang	ncubicspline	1.h	4.504423

finalPred <- rbind(orig1$predBestFinal, orig2$predBestFinal, orig3b$predBestFinal)

finalPred <- merge(finalPred, finalModels)
finalPred$originModel <- sprintf('%s: %s(%s)', finalPred$origin, finalPred$method, finalPred$param)


ggplot(finalPred) +
  geom_point(aes(x = horsepower, y = mpg, color = originModel), alpha = 0.8) +
  geom_line(aes(x = horsepower, y = pred, color = originModel), size = 1) +
  scale_color_manual(values = c("black", "red", "blue"))  +
  theme_classic() +
  ggtitle("Modeling Miles per Gallon (MPG) vs Horsepower, by Country of Origin") +
  xlab("Horsepower") + ylab("Miles per Gallon") +
  theme(legend.title = element_blank(), 
        legend.position = "bottom", 
        legend.box = "horizontal")

Kesimpulan

Perlu kehati-hatian dalam memilih tuning parameter.
Pemodelan terbaik hasil dari cross validation, belum tentu merupakan model yang bisa diterima, sehingga perlu pandangan analis untuk melakukan evaluasi terhadap model-model yang diperoleh.
Model non-linier mampu digunakan untuk melakukan prediksi efisiensi bahan bakar menggunakan horsepower.
Secara umum, semakin tinggi horsepower, maka kendaraan semakin boros dalam penggunaan bahan bakar (mpg lebih rendah). Hal ini berlaku untuk ketiga negara asal.
Berdasarkan plot terakhir, terlihat bahwa untuk horsepower yang sama, kendaraan yang berasal dari Jepang lebih hemat dalam penggunaan bahan bakar, diikuti oleh Eropa dan Amerika.

Daftar Pustaka

Dito, Gerry A. (2021), Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R, https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/ (Oct 29, 2021).

Harrell, Frank E., Jr. (2015). Regression Modeling Strategies. Springer Series in Statistics, Springer.

James, G.; Witten, D.; Hastie, T. & Tibshirani, R. (2013), An Introduction to Statistical Learning: with Applications in R, Springer.

Mahasiswa Pascasarjana Statistika dan Sains Data, Institut Pertanian Bogor, NIM: G1501211061, nur.andi@apps.ipb.ac.id ↩︎

Moving Beyond Linearity

Tugas Kuliah STA581 Sains Data

Pendahuluan

Tujuan

Output

Metodologi

Tahapan

Skenario Tuning Parameter

Tools

Kumpulan Fungsi

Modeling dan Cross Validation

Grafik

Fungsi wrapper

Penentuan Knots

Hasil dan Pembahasan

Preview Dataset

Hubungan mpg dan horsepower

Berdasarkan origin

Semua Negara

Cross validation

Ringkasan tuning parameter terbaik

Nilai observasi vs prediksi

Ringkasan modeling

Origin 1: Amerika

Cross validation

Ringkasan tuning parameter terbaik

Nilai observasi vs prediksi

Ringkasan modeling

Origin 2: Eropa

Cross validation

Ringkasan tuning parameter terbaik

Nilai observasi vs prediksi

Ringkasan modeling

Origin 3: Jepang

Cross validation

Ringkasan tuning parameter terbaik

Nilai observasi vs prediksi

Ringkasan modeling

Origin 3: Jepang (Refined)

Cross validation

Ringkasan tuning parameter terbaik

Ringkasan modeling

Perbandingan model lama dan baru

Nilai observasi vs prediksi

All Models

Kesimpulan

Daftar Pustaka

Hubungan `mpg` dan `horsepower`

Berdasarkan `origin`