Sains Data

Tugas Praktikum 8

Import Library

Data bisa diperoleh untuk tugas dapat diperoleh di package ISLR.Silahkan Install dulu packagenya.

library(MultiKink)
library(mlr3measures)
library(OneR)
library(fastDummies)
library(caret)
library(e1071)
library(doSNOW)
library(ipred)
library(xgboost)
library(Lahman)
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(readxl)
library(corrplot)
library(knitr)
library(smbinning) 
library(ISLR)
library(splines)
library(corrplot)
library(boot)
library(MLmetrics)
library(caTools)
library(rpart)
library(rsample)

Import Data

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. The original dataset has 397 observations, of which 5 have missing values for the variable "horsepower

A data frame with 392 observations on the following 9 variables. mpg : miles per gallon cylinders : Number of cylinders between 4 and 8 displacement : Engine displacement (cu. inches) horsepower : Engine horsepower weight : Vehicle weight (lbs.) acceleration : Time to accelerate from 0 to 60 mph (sec.) year : Model year (modulo 100) origin : Origin of car (1. American, 2. European, 3. Japanese) name : Vehicle name

attach(Auto)
df <- as.data.frame(cbind(mpg,cylinders,displacement,horsepower,weight,acceleration))
df

   mpg cylinders displacement horsepower weight acceleration
1   18         8          307        130   3504         12.0
2   15         8          350        165   3693         11.5
3   18         8          318        150   3436         11.0
4   16         8          304        150   3433         12.0
5   17         8          302        140   3449         10.5
6   15         8          429        198   4341         10.0
7   14         8          454        220   4354          9.0
8   14         8          440        215   4312          8.5
9   14         8          455        225   4425         10.0
10  15         8          390        190   3850          8.5
11  15         8          383        170   3563         10.0
12  14         8          340        160   3609          8.0
 [ reached 'max' / getOption("max.print") -- omitted 380 rows ]

Tugas A

Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.

Scatter plot MPG vs Displacement

df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year))
plot(df$displacement, df$mpg,  pch=19, col = "blue", xlab = "displacement", ylab = "mile per gallon")
title("Plot of mpg vs displacement", cex.sub = 0.8)
abline(lm(mpg ~ displacement))

# korelasi antara displacement dengan MPG
cat("Koefisien Korelasi (r) =",cor(df$displacement, df$mpg, method = "pearson"))

Koefisien Korelasi (r) = -0.8051269

cor.test(df$displacement,df$mpg)


    Pearson's product-moment correlation

data:  df$displacement and df$mpg
t = -26.808, df = 390, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8373959 -0.7672653
sample estimates:
       cor 
-0.8051269

Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara displacement dan mile per gallon merupakan hubungan yang non linier dikarenakan pada nilai displacement sekitar 420 tidak terjadi pola penurunan melainkan ada kecendrungan naik. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara displacement dan mile per hour sekitar -0.80 yang artinya semakin meningkat displacement maka mile per gallon akan semakin mengecil, meskipun pola hubungan linier menurut korelasi, tetapi dilihat dari visual hubungan tidak linier antara displacement dan mile per gallon.

Scatter plot MPG vs HorsePower

df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year))
plot(df$horsepower, df$mpg,  pch=19, col = "black", xlab = "Horse Power", ylab = "mile per gallon")
title("Plot of mpg vs Horse Power", cex.sub = 0.8)
abline(lm(mpg ~ horsepower))

# korelasi antara Horsepower dengan MPG Overall
cat("Koefisien Korelasi (r) =",cor(df$horsepower, df$mpg, method = "pearson"))

Koefisien Korelasi (r) = -0.7784268

cor.test(df$horsepower,df$mpg)


    Pearson's product-moment correlation

data:  df$horsepower and df$mpg
t = -24.489, df = 390, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8146631 -0.7361359
sample estimates:
       cor 
-0.7784268

Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara Horsepower dan mile per gallon merupakan hubungan yang non linier dikarenakan pada nilai Horsepower sekitar 200 tidak terjadi pola penurunan melainkan ada kecendrungan naik. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara Horsepower dan mile per hour sekitar -0.77 yang artinya semakin meningkat Horsepower maka mile per gallon akan semakin mengecil, meskipun pola hubungan linier menurut korelasi, tetapi dilihat dari visual hubungan tidak linier antara Horsepower dan mile per gallon.

Scatter plot MPG vs weight

df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year))
plot(df$weight, df$mpg,  pch=19, col = "coral", xlab = "weight", ylab = "mile per gallon")
title("Plot of mpg vs weight", cex.sub = 0.8)
abline(lm(mpg ~ weight))

# korelasi antara weight dengan MPG Overall
cat("Koefisien Korelasi (r) =",cor(df$weight, df$mpg, method = "pearson"))

Koefisien Korelasi (r) = -0.8322442

cor.test(df$weight,df$mpg)


    Pearson's product-moment correlation

data:  df$weight and df$mpg
t = -29.645, df = 390, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8603702 -0.7990647
sample estimates:
       cor 
-0.8322442

Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara weight dan mile per gallon merupakan hubungan yang non linier dikarenakan pada nilai weight sekitar 5000 tidak terjadi pola penurunan melainkan ada kecendrungan naik. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara weight dan mile per gallon sekitar -0.83 yang artinya semakin meningkat weight maka mile per gallon akan semakin mengecil, meskipun pola hubungan linier menurut korelasi, tetapi dilihat dari visual hubungan tidak linier antara weight dan mile per gallon.

Scatter plot MPG vs acceleration

df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year))
plot(df$acceleration, df$mpg,  pch=19, col = "grey", xlab = "acceleration", ylab = "mile per gallon")
title("Plot of mpg vs acceleration", cex.sub = 0.8)
abline(lm(mpg ~ acceleration))

# korelasi antara acceleration dengan MPG
cat("Koefisien Korelasi (r) =",cor(df$acceleration, df$mpg, method = "pearson"))

Koefisien Korelasi (r) = 0.4233285

cor.test(df$acceleration,df$mpg)


    Pearson's product-moment correlation

data:  df$acceleration and df$mpg
t = 9.2277, df = 390, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3384724 0.5013550
sample estimates:
      cor 
0.4233285

Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara acceleration dan mile per gallon merupakan hubungan yang linier dikarenakan pada nilai mpg semakin besar apabila acceleration juga meningkat. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara acceleration dan mile per gallon sekitar 0.42 yang artinya semakin meningkat acceleration maka mile per gallon akan semakin besar.

Tugas B

Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris). Berdasarkan analisis scatter plot pada soal nomor A atau Tugas A, didapatkan bahwa variable peubah acak yang dapat digunkan untuk menggunakan model linier terhadap mpg adalah displacement, horsepower, dan weight. Tahapan-tahapan pemodelan non linier yang harus dilakukan untuk setiap peubah acak.

Split data into train and test dengan prporsi 70:30 atay 80:20, tapi saya megunakan 70:30
Mencari parameter optimum melalui metode K-fold Cross validation dengan train data set yang sudah di split pada nomor 1. Dipilih weighted score terkecil untuk setiap parameter degree. weighted score dihitung dengan cara rata rata dari rmse,mae,dan mape.
Parameter yang didapatkan, dimodelkan ulang dengan train data set sebagai data training, lalu test data set sebagai data testing.
Memilih model terbaik berdasarkan weighted score terkecil

B.1 Displacement vs MPG

B.1.1 Polynomial Regression

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
pol_reg_cv <- function(df,pcg=0.75,K=10,D=5,dep_col=FALSE,columns){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  degree <- matrix(nrow = D*K, ncol = 1)
  CV <- matrix(nrow = D*K, ncol = 1)
  RMSES <- matrix(nrow = D*K, ncol = 1)
  MAES <- matrix(nrow = D*K, ncol = 1)
  MAPES <- matrix(nrow = D*K, ncol = 1)
  i <- 0
  for (d in 1:D) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      polinom_train <- lm(mpg ~ poly(displacement,d), data = training)
      prediksi <- predict(polinom_train, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      degree[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(degree,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(degree) %>% summarise(RMSE_score = mean(RMSES),
 MAE_score = mean(MAES),
 MAPE_score = mean(MAPES),
 WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
D <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- pol_reg_cv(df = df, pcg = pcg, K = K, D = D, dep_col = dep_col, columns = columns)
model_pol_all$df_pol

   degree  CV    RMSES     MAES     MAPES WEIGHTED
1       1 CV1 4.420065 3.419909 0.1499310 2.663302
2       1 CV2 4.335985 3.285711 0.1441593 2.588619
3       1 CV3 4.538141 3.259594 0.1268758 2.641537
4       1 CV4 4.597318 3.710270 0.1737371 2.827108
5       1 CV5 4.960205 3.493081 0.1525632 2.868617
6       2 CV1 4.138268 3.055096 0.1309501 2.441438
7       2 CV2 3.977767 2.951997 0.1254357 2.351733
8       2 CV3 4.338138 3.129038 0.1189623 2.528713
9       2 CV4 4.286909 3.257085 0.1458972 2.563297
10      2 CV5 4.838493 3.261941 0.1339528 2.744796
11      3 CV1 4.111953 3.022902 0.1295969 2.421484
12      3 CV2 3.949769 2.928616 0.1245307 2.334305
 [ reached 'max' / getOption("max.print") -- omitted 38 rows ]

model_pol_all$df_pol_agg

   degree RMSE_score MAE_score MAPE_score WEIGHTED_score
9       9   4.207214  3.142020  0.1319057       2.493713
6       6   4.241103  3.126389  0.1307972       2.499430
10     10   4.206002  3.167422  0.1334887       2.502304
8       8   4.229312  3.158230  0.1325185       2.506687
7       7   4.239218  3.173912  0.1335556       2.515562
4       4   4.310350  3.126605  0.1310293       2.522661
5       5   4.304216  3.137739  0.1321408       2.524699
3       3   4.305306  3.138277  0.1320104       2.525198
2       2   4.315915  3.131031  0.1310396       2.525995
1       1   4.570343  3.433713  0.1494533       2.717836

Parameter optimum (degree) untuk model polynomial regression yaitu degree 9 berdasarkan weighted score yang paling kecil. Selanjutnya dimodelkan kembali menggunakan polynomial regresson degree 9 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

degree <- 9
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
df_train <- dfrand[model_pol_all$train_ind,]
df_test <- dfrand[model_pol_all$test_ind,]
polinom_train = lm(mpg ~ poly(displacement,degree), data = df_train)
#train
prediksi <- predict(polinom_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab='displacement', ylab='mpg',main="Regresi Polinomial Ordo 9")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(polinom_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab='displacement', ylab='mpg',main="Regresi Polinomial Ordo 9")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.119046 3.036755 0.1267458 2.427516
2  testing 4.298119 3.304675 0.1475408 2.583445

B.1.2 Step Functions

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
piecewise_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      id = seq(1, length(unique(training$displacement)), by = round(length(unique(training$displacement))/(d+1)))
      breaks = sort(unique(training$displacement))[id]
      breaks=breaks[-c(1,length(breaks))]
      breaks[length(breaks)+1] <- 2*max(training$displacement)
      breaks[length(breaks)+1] <- (-2)*min(training$displacement)
      labels=paste('gr',seq_len(length(breaks)-1),sep='')
      training$gr <- cut(training$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
      validation$gr <- cut(validation$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
      sf_train <- lm(mpg ~ gr, data = training)
      prediksi <- predict(sf_train, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                        MAE_score = mean(MAES),
                                                                        MAPE_score = mean(MAPES),
                                                                        WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'displacement'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- piecewise_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 5.772823 4.270018 0.1841975 3.409013
2      2 CV2 5.497051 4.502477 0.2188102 3.406113
3      2 CV3 5.559409 4.273895 0.1785762 3.337294
4      2 CV4 4.236006 3.298433 0.1588730 2.564437
5      2 CV5 5.233705 3.934705 0.1887980 3.119069
6      3 CV1 4.639002 3.523816 0.1479911 2.770270
7      3 CV2 4.267873 3.375280 0.1505377 2.597897
8      3 CV3 4.898393 3.872457 0.1574689 2.976106
9      3 CV4 4.678056 3.737885 0.1768993 2.864280
10     3 CV5 5.467206 4.469303 0.2218816 3.386130
11     4 CV1 4.328386 3.214958 0.1378119 2.560385
12     4 CV2 4.673731 3.766893 0.1723247 2.870983
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
5     6   4.303624  3.145174  0.1329524       2.527250
9    10   4.366322  3.190972  0.1342283       2.563841
8     9   4.388511  3.215215  0.1354847       2.579737
6     7   4.412208  3.292275  0.1410216       2.615168
7     8   4.499340  3.316479  0.1396854       2.651835
4     5   4.534251  3.566113  0.1579259       2.752763
3     4   4.703221  3.660230  0.1631372       2.842196
2     3   4.790106  3.795748  0.1709557       2.918937
1     2   5.259799  4.055906  0.1858510       3.167185

Parameter optimum (knots) untuk model step functions yaitu 6 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan step function dengan jumlah titik sebanyak 6 dan selang ada 7 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'displacement'
xlab <- "displacement"
ylab <- "mile per gallon"
model_type <- "step function"
Knots <- 6
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
id = seq(1, length(unique(df_train$displacement)), by = round(length(unique(df_train$displacement))/(Knots+1)))
breaks = sort(unique(df_train$displacement))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(df_train$displacement)
breaks[length(breaks)+1] <- (-2)*min(df_train$displacement)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
df_train$gr <- cut(df_train$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
df_test$gr <- cut(df_test$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = df_train)
#train
prediksi <- predict(sf_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function knots 6")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(sf_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function knots 6")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.400215 3.256056 0.1408742 2.599049
2  testing 4.418019 3.284186 0.1460048 2.616070

B.1.3 Baseline splines

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
bs_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      regspline = glm( mpg ~ bs(displacement, df = d), data=training)
      prediksi <- predict(regspline, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                       MAE_score = mean(MAES),
                                                                       MAPE_score = mean(MAPES),
                                                                       WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'displacement'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- bs_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 4.111953 3.022902 0.1295969 2.421484
2      2 CV2 3.949769 2.928616 0.1245307 2.334305
3      2 CV3 4.346956 3.130042 0.1181625 2.531720
4      2 CV4 4.287805 3.270083 0.1476468 2.568512
5      2 CV5 4.830048 3.339743 0.1401152 2.769969
6      3 CV1 4.111953 3.022902 0.1295969 2.421484
7      3 CV2 3.949769 2.928616 0.1245307 2.334305
8      3 CV3 4.346956 3.130042 0.1181625 2.531720
9      3 CV4 4.287805 3.270083 0.1476468 2.568512
10     3 CV5 4.830048 3.339743 0.1401152 2.769969
11     4 CV1 4.128312 3.010619 0.1291213 2.422684
12     4 CV2 4.002056 2.966462 0.1262836 2.364934
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
6     7   4.177243  3.097753  0.1293515       2.468116
5     6   4.190485  3.118070  0.1302337       2.479596
4     5   4.215277  3.115346  0.1300918       2.486905
7     8   4.230068  3.146471  0.1313725       2.502637
8     9   4.260037  3.178469  0.1326127       2.523706
1     2   4.305306  3.138277  0.1320104       2.525198
2     3   4.305306  3.138277  0.1320104       2.525198
9    10   4.297378  3.172256  0.1320087       2.533881
3     4   4.324482  3.157979  0.1332974       2.538586

Parameter optimum (knots) untuk model baseline splines yaitu 6 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan baseline splines dengan jumlah titik sebanyak 6 selang 7 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)

xcols <- 'displacement'
xlab <- "displacement"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 7
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ bs(displacement, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 6")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 6")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.090548 3.017533 0.1260246 2.411369
2  testing 4.243822 3.286924 0.1461223 2.558956

B.1.4 Natural Splines

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
ns_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      regspline = glm( mpg ~ ns(displacement, df = d), data=training)
      prediksi <- predict(regspline, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                       MAE_score = mean(MAES),
                                                                       MAPE_score = mean(MAPES),
                                                                       WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'displacement'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- ns_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 4.120023 3.020128 0.1293868 2.423179
2      2 CV2 3.957653 2.938173 0.1249792 2.340268
3      2 CV3 4.330537 3.127524 0.1185001 2.525520
4      2 CV4 4.271793 3.249318 0.1458685 2.555660
5      2 CV5 4.834369 3.331811 0.1392585 2.768480
6      3 CV1 4.095964 3.015267 0.1294412 2.413558
7      3 CV2 3.926923 2.902464 0.1234775 2.317621
8      3 CV3 4.360187 3.123338 0.1175143 2.533680
9      3 CV4 4.274105 3.268234 0.1478315 2.563390
10     3 CV5 4.814877 3.330390 0.1396040 2.761623
11     4 CV1 4.098069 2.993124 0.1284785 2.406557
12     4 CV2 4.005912 2.986635 0.1275349 2.373360
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
4     5   4.253471  3.136378  0.1313241       2.507058
9    10   4.235413  3.163258  0.1325501       2.510407
2     3   4.294411  3.127938  0.1315737       2.517974
1     2   4.302875  3.133391  0.1315986       2.522621
8     9   4.262170  3.181392  0.1326983       2.525420
5     6   4.289218  3.171906  0.1327490       2.531291
3     4   4.308491  3.154456  0.1331978       2.532048
6     7   4.290514  3.178776  0.1326212       2.533970
7     8   4.296379  3.174654  0.1320417       2.534358

Parameter optimum (knots) untuk model natural splines yaitu 4 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan natural splines dengan jumlah titik sebanyak 4 selang 5 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)

xcols <- 'displacement'
xlab <- "displacement"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 5
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ ns(displacement, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 4")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 4")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.202994 3.089173 0.1289719 2.473713
2  testing 4.486540 3.389999 0.1500923 2.675544

===============================================

x : displacement

y : mpg

===============================================

Model : Polynomial Regression ordo 9

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.119046 3.036755 0.1267458 2.427516

2 testing 4.298119 3.304675 0.1475408 2.583445

===============================================

Model : Step Functions Knots 6

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.400215 3.256056 0.1408742 2.599049

2 testing 4.418019 3.284186 0.1460048 2.616070

===============================================

Model : Baseline Splines Knots 6

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.090548 3.017533 0.1260246 2.411369

2 testing 4.243822 3.286924 0.1461223 2.558956

===============================================

Model : Natural Splines Knots 4

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.202994 3.089173 0.1289719 2.473713

2 testing 4.486540 3.389999 0.1500923 2.675544

===============================================

Model Terbaik dengan peubah acak displacement adalah baseline splines Knots 6

B.2 Horsepower vs MPG

B.2.1 Polynomial Regression

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
pol_reg_cv <- function(df,pcg=0.75,K=10,D=5,dep_col=FALSE,columns){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  degree <- matrix(nrow = D*K, ncol = 1)
  CV <- matrix(nrow = D*K, ncol = 1)
  RMSES <- matrix(nrow = D*K, ncol = 1)
  MAES <- matrix(nrow = D*K, ncol = 1)
  MAPES <- matrix(nrow = D*K, ncol = 1)
  i <- 0
  for (d in 1:D) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      polinom_train <- lm(mpg ~ poly(horsepower,d), data = training)
      prediksi <- predict(polinom_train, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      degree[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(degree,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(degree) %>% summarise(RMSE_score = mean(RMSES),
 MAE_score = mean(MAES),
 MAPE_score = mean(MAPES),
 WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
D <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- pol_reg_cv(df = df, pcg = pcg, K = K, D = D, dep_col = dep_col, columns = columns)
model_pol_all$df_pol

   degree  CV    RMSES     MAES     MAPES WEIGHTED
1       1 CV1 4.834504 3.898745 0.1699088 2.967719
2       1 CV2 5.343545 3.974648 0.1860230 3.168072
3       1 CV3 4.621917 3.606616 0.1439466 2.790826
4       1 CV4 4.573925 3.720405 0.1814015 2.825244
5       1 CV5 4.907973 3.733567 0.1700596 2.937200
6       2 CV1 4.650057 3.603537 0.1500087 2.801201
7       2 CV2 4.933845 3.490390 0.1621746 2.862136
8       2 CV3 4.318095 3.355824 0.1345258 2.602815
9       2 CV4 3.753398 2.994182 0.1342507 2.293943
10      2 CV5 4.699263 3.330653 0.1352009 2.721705
11      3 CV1 4.672475 3.590570 0.1486807 2.803908
12      3 CV2 4.961901 3.520806 0.1637465 2.882151
 [ reached 'max' / getOption("max.print") -- omitted 38 rows ]

model_pol_all$df_pol_agg

   degree RMSE_score MAE_score MAPE_score WEIGHTED_score
7       7   4.448757  3.281076  0.1413419       2.623725
8       8   4.461356  3.290221  0.1419090       2.631162
9       9   4.469974  3.298036  0.1424602       2.636823
6       6   4.465935  3.305980  0.1426649       2.638193
10     10   4.486505  3.313439  0.1432379       2.647727
2       2   4.470931  3.354917  0.1432322       2.656360
5       5   4.489308  3.367478  0.1455442       2.667444
3       3   4.504663  3.390128  0.1452515       2.680014
4       4   4.511212  3.413632  0.1476968       2.690847
1       1   4.856373  3.786796  0.1702679       2.937812

Parameter optimum (degree) untuk model polynomial regression yaitu degree 7 berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan polynomial regresson degree 7 dengan menggunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

degree <- 7
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
df_train <- dfrand[model_pol_all$train_ind,]
df_test <- dfrand[model_pol_all$test_ind,]
polinom_train = lm(mpg ~ poly(horsepower,degree), data = df_train)
#train
prediksi <- predict(polinom_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab='horsepower', ylab='mpg',main="Regresi Polinomial Ordo 7")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(polinom_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab='horsepower', ylab='mpg',main="Regresi Polinomial Ordo 7")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.327981 3.194352 0.1370348 2.553123
2  testing 4.111706 3.112849 0.1390994 2.454552

B.2.2 Step Functions

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
piecewise_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      id = seq(1, length(unique(training$horsepower)), by = round(length(unique(training$horsepower))/(d+1)))
      breaks = sort(unique(training$horsepower))[id]
      breaks=breaks[-c(1,length(breaks))]
      breaks[length(breaks)+1] <- 2*max(training$horsepower)
      breaks[length(breaks)+1] <- (-2)*min(training$horsepower)
      labels=paste('gr',seq_len(length(breaks)-1),sep='')
      training$gr <- cut(training$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
      validation$gr <- cut(validation$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
      sf_train <- lm(mpg ~ gr, data = training)
      prediksi <- predict(sf_train, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                        MAE_score = mean(MAES),
                                                                        MAPE_score = mean(MAPES),
                                                                        WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'horsepower'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- piecewise_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 5.825717 4.180151 0.1822688 3.396046
2      2 CV2 6.149926 4.844878 0.2444107 3.746405
3      2 CV3 5.703813 4.512520 0.1888911 3.468408
4      2 CV4 4.123990 3.213432 0.1463161 2.494579
5      2 CV5 6.462275 5.326581 0.2748729 4.021243
6      3 CV1 4.604564 3.629283 0.1505007 2.794783
7      3 CV2 5.333277 3.829079 0.1838242 3.115394
8      3 CV3 5.032771 3.922097 0.1591387 3.038002
9      3 CV4 4.012484 2.908284 0.1362304 2.352333
10     3 CV5 4.877438 3.712836 0.1610665 2.917114
11     4 CV1 4.476669 3.562161 0.1510829 2.729971
12     4 CV2 4.676163 3.637898 0.1726260 2.828896
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
9    10   4.326833  3.237146  0.1392678       2.567749
8     9   4.438962  3.319902  0.1441795       2.634348
7     8   4.481515  3.319321  0.1432834       2.648040
6     7   4.500671  3.355770  0.1451928       2.667211
5     6   4.470448  3.408200  0.1486145       2.675754
4     5   4.519053  3.449184  0.1495761       2.705938
3     4   4.513945  3.455816  0.1504288       2.706730
2     3   4.772107  3.600316  0.1581521       2.843525
1     2   5.653144  4.415512  0.2073519       3.425336

Parameter optimum (knots) untuk model step functions yaitu 10 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan step function dengan jumlah titik sebanyak 10 selang 11 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'horsepower'
xlab <- "horsepower"
ylab <- "mile per gallon"
model_type <- "step function"
Knots <- 10
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
id = seq(1, length(unique(df_train$horsepower)), by = round(length(unique(df_train$horsepower))/(Knots+1)))
breaks = sort(unique(df_train$horsepower))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(df_train$horsepower)
breaks[length(breaks)+1] <- (-2)*min(df_train$horsepower)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
df_train$gr <- cut(df_train$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
df_test$gr <- cut(df_test$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = df_train)
#train
prediksi <- predict(sf_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(sf_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.199674 3.104994 0.1330575 2.479242
2  testing 4.211224 3.264680 0.1447796 2.540228

B.2.3 Baseline splines

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
bs_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      regspline = glm( mpg ~ bs(horsepower, df = d), data=training)
      prediksi <- predict(regspline, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                       MAE_score = mean(MAES),
                                                                       MAPE_score = mean(MAPES),
                                                                       WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'horsepower'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- bs_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 4.672475 3.590570 0.1486807 2.803908
2      2 CV2 4.961901 3.520806 0.1637465 2.882151
3      2 CV3 4.334660 3.381311 0.1358210 2.617264
4      2 CV4 3.765460 2.981998 0.1334898 2.293649
5      2 CV5 4.788818 3.475956 0.1445197 2.803098
6      3 CV1 4.672475 3.590570 0.1486807 2.803908
7      3 CV2 4.961901 3.520806 0.1637465 2.882151
8      3 CV3 4.334660 3.381311 0.1358210 2.617264
9      3 CV4 3.765460 2.981998 0.1334898 2.293649
10     3 CV5 4.788818 3.475956 0.1445197 2.803098
11     4 CV1 4.477430 3.406691 0.1416677 2.675263
12     4 CV2 4.988939 3.552203 0.1684208 2.903188
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
4     5   4.449046  3.277204  0.1407714       2.622340
5     6   4.450946  3.275481  0.1408224       2.622416
6     7   4.472071  3.284396  0.1412787       2.632582
8     9   4.489045  3.276452  0.1412380       2.635578
7     8   4.490068  3.298059  0.1420330       2.643387
9    10   4.498160  3.292837  0.1419609       2.644319
3     4   4.495600  3.372730  0.1459311       2.671421
1     2   4.504663  3.390128  0.1452515       2.680014
2     3   4.504663  3.390128  0.1452515       2.680014

Parameter optimum (knots) untuk model baseline splines yaitu 4 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan baseline splines dengan jumlah titik sebanyak 4 selang 5 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)

xcols <- 'horsepower'
xlab <- "horsepower"
ylab <- "mile per gallon"
model_type <- "baseline spline"
Knots <- 5
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ bs(horsepower, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 4")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 4")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.331941 3.203326 0.1376315 2.557633
2  testing 4.124905 3.131825 0.1398707 2.465534

B.2.4 Natural Splines

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
ns_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      regspline = glm( mpg ~ ns(horsepower, df = d), data=training)
      prediksi <- predict(regspline, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                       MAE_score = mean(MAES),
                                                                       MAPE_score = mean(MAPES),
                                                                       WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'horsepower'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- ns_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 4.645561 3.544641 0.1463739 2.778859
2      2 CV2 4.905000 3.450502 0.1607161 2.838739
3      2 CV3 4.302078 3.308859 0.1318420 2.580926
4      2 CV4 3.720110 3.004466 0.1352199 2.286599
5      2 CV5 4.734787 3.443485 0.1431158 2.773796
6      3 CV1 4.639128 3.549877 0.1468206 2.778609
7      3 CV2 4.965028 3.519192 0.1643371 2.882852
8      3 CV3 4.385813 3.422505 0.1377939 2.648704
9      3 CV4 3.735204 2.953269 0.1319278 2.273467
10     3 CV5 4.756529 3.462496 0.1441429 2.787723
11     4 CV1 4.489656 3.360756 0.1389494 2.663121
12     4 CV2 4.993260 3.522083 0.1658521 2.893732
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
9    10   4.444106  3.261510  0.1405965       2.615404
6     7   4.452166  3.269877  0.1405072       2.620850
8     9   4.468448  3.269994  0.1407372       2.626393
7     8   4.460465  3.285620  0.1413188       2.629135
5     6   4.461346  3.307380  0.1422334       2.636986
4     5   4.460384  3.316710  0.1425750       2.639889
1     2   4.461507  3.350391  0.1434535       2.651784
3     4   4.479450  3.352933  0.1444600       2.658948
2     3   4.496341  3.381468  0.1450045       2.674271

Parameter optimum (knots) untuk model natural splines yaitu 9 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan natural splines dengan jumlah titik sebanyak 9 selang 10 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)

xcols <- 'horsepower'
xlab <- "horsepower"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 10
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ ns(horsepower, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 9")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 9")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.249171 3.098283 0.1329692 2.493474
2  testing 4.058468 3.093289 0.1378698 2.429875

===============================================

x : Horsepower

y : mpg

===============================================

Model : Polynomial Regression ordo 7

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.327981 3.194352 0.1370348 2.553123

2 testing 4.111706 3.112849 0.1390994 2.454552

===============================================

Model : Step Functions Knots 10

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.199674 3.104994 0.1330575 2.479242

2 testing 4.211224 3.264680 0.1447796 2.540228

===============================================

Model : Baseline Splines Knots 4

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.331941 3.203326 0.1376315 2.557633

2 testing 4.124905 3.131825 0.1398707 2.465534

===============================================

Model : Natural Splines Knots 9

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.249171 3.098283 0.1329692 2.493474

2 testing 4.058468 3.093289 0.1378698 2.429875

===============================================

Model Terbaik dengan peubah acak Horsepower adalah natural splines Knots 9

B.3 Weight vs MPG

B.3.1 Polynomial Regression

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
pol_reg_cv <- function(df,pcg=0.75,K=10,D=5,dep_col=FALSE,columns){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  degree <- matrix(nrow = D*K, ncol = 1)
  CV <- matrix(nrow = D*K, ncol = 1)
  RMSES <- matrix(nrow = D*K, ncol = 1)
  MAES <- matrix(nrow = D*K, ncol = 1)
  MAPES <- matrix(nrow = D*K, ncol = 1)
  i <- 0
  for (d in 1:D) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      polinom_train <- lm(mpg ~ poly(weight,d), data = training)
      prediksi <- predict(polinom_train, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      degree[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(degree,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(degree) %>% summarise(RMSE_score = mean(RMSES),
 MAE_score = mean(MAES),
 MAPE_score = mean(MAPES),
 WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
D <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- pol_reg_cv(df = df, pcg = pcg, K = K, D = D, dep_col = dep_col, columns = columns)
model_pol_all$df_pol

   degree  CV    RMSES     MAES      MAPES WEIGHTED
1       1 CV1 4.098737 3.262634 0.13677871 2.499383
2       1 CV2 4.407682 3.413286 0.15665201 2.659207
3       1 CV3 4.248817 2.989446 0.11483872 2.451034
4       1 CV4 3.999092 3.104835 0.13741690 2.413781
5       1 CV5 4.982291 3.778535 0.17524118 2.978689
6       2 CV1 4.025090 3.204660 0.13329280 2.454347
7       2 CV2 4.065905 3.158427 0.13959506 2.454642
8       2 CV3 4.169867 2.763159 0.09937366 2.344133
9       2 CV4 3.807461 2.818342 0.12229193 2.249365
10      2 CV5 4.862025 3.460551 0.14721963 2.823265
11      3 CV1 4.025691 3.208582 0.13351829 2.455931
12      3 CV2 4.103869 3.182700 0.14039059 2.475653
 [ reached 'max' / getOption("max.print") -- omitted 38 rows ]

model_pol_all$df_pol_agg

   degree RMSE_score MAE_score MAPE_score WEIGHTED_score
2       2   4.186069  3.081028  0.1283546       2.465151
5       5   4.204156  3.088254  0.1283595       2.473590
3       3   4.203005  3.092800  0.1289282       2.474911
4       4   4.210471  3.095284  0.1288693       2.478208
6       6   4.223372  3.109551  0.1294450       2.487456
7       7   4.246695  3.125095  0.1300688       2.500620
9       9   4.260505  3.129379  0.1307195       2.506868
8       8   4.259763  3.137417  0.1311929       2.509457
10     10   4.291399  3.141929  0.1315923       2.521640
1       1   4.347324  3.309747  0.1441855       2.600419

Parameter optimum (degree) untuk model polynomial regression yaitu degree 2 berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan polynomial regresson degree 2 dengan menggunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

degree <- 2
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
df_train <- dfrand[model_pol_all$train_ind,]
df_test <- dfrand[model_pol_all$test_ind,]
polinom_train = lm(mpg ~ poly(weight,degree), data = df_train)
#train
prediksi <- predict(polinom_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab='weight', ylab='mpg',main="Regresi Polinomial Ordo 2")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(polinom_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab='weight', ylab='mpg',main="Regresi Polinomial Ordo 2")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.173080 3.068577 0.1277039 2.456454
2  testing 4.152997 2.952700 0.1265259 2.410741

B.3.2 Step Functions

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
piecewise_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      id = seq(1, length(unique(training$weight)), by = round(length(unique(training$weight))/(d+1)))
      breaks = sort(unique(training$weight))[id]
      breaks=breaks[-c(1,length(breaks))]
      breaks[length(breaks)+1] <- 2*max(training$weight)
      breaks[length(breaks)+1] <- (-2)*min(training$weight)
      labels=paste('gr',seq_len(length(breaks)-1),sep='')
      training$gr <- cut(training$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
      validation$gr <- cut(validation$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
      sf_train <- lm(mpg ~ gr, data = training)
      prediksi <- predict(sf_train, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                        MAE_score = mean(MAES),
                                                                        MAPE_score = mean(MAPES),
                                                                        WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'weight'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- piecewise_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES     MAPES WEIGHTED
1      2 CV1 5.489604 4.067482 0.1744693 3.243852
2      2 CV2 4.601544 3.717976 0.1692796 2.829600
3      2 CV3 4.740351 3.672659 0.1456014 2.852871
4      2 CV4 5.299839 4.478935 0.2171118 3.331962
5      2 CV5 6.212907 5.117106 0.2662996 3.865438
6      3 CV1 4.444610 3.517650 0.1494944 2.703918
7      3 CV2 5.052088 4.180859 0.1958051 3.142917
8      3 CV3 4.738872 3.296872 0.1247334 2.720159
9      3 CV4 4.449043 3.448903 0.1651446 2.687697
10     3 CV5 5.507773 4.173390 0.2066991 3.295954
11     4 CV1 4.259783 3.401037 0.1433772 2.601399
12     4 CV2 4.717958 3.814678 0.1734151 2.902017
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
9    10   4.265315  3.132899  0.1328419       2.510352
8     9   4.311208  3.185551  0.1330501       2.543270
7     8   4.321397  3.212570  0.1359790       2.556649
6     7   4.376691  3.291181  0.1401156       2.602662
5     6   4.391405  3.304476  0.1399056       2.611929
3     4   4.477855  3.473564  0.1543016       2.701907
4     5   4.566261  3.524613  0.1540219       2.748299
2     3   4.838477  3.723535  0.1683753       2.910129
1     2   5.268849  4.210832  0.1945523       3.224744

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'weight'
xlab <- "weight"
ylab <- "mile per gallon"
model_type <- "step function"
Knots <- 10
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
id = seq(1, length(unique(df_train$weight)), by = round(length(unique(df_train$weight))/(Knots+1)))
breaks = sort(unique(df_train$weight))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(df_train$weight)
breaks[length(breaks)+1] <- (-2)*min(df_train$weight)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
df_train$gr <- cut(df_train$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
df_test$gr <- cut(df_test$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = df_train)
#train
prediksi <- predict(sf_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(sf_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.157577 3.089568 0.1304351 2.459193
2  testing 4.134778 3.060675 0.1331605 2.442871

B.3.3 Baseline plines

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
bs_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      regspline = glm( mpg ~ bs(weight, df = d), data=training)
      prediksi <- predict(regspline, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                       MAE_score = mean(MAES),
                                                                       MAPE_score = mean(MAPES),
                                                                       WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'weight'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- bs_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES      MAPES WEIGHTED
1      2 CV1 4.025691 3.208582 0.13351829 2.455931
2      2 CV2 4.103869 3.182700 0.14039059 2.475653
3      2 CV3 4.172695 2.753937 0.09868328 2.341772
4      2 CV4 3.810854 2.817392 0.12225921 2.250168
5      2 CV5 4.901914 3.501390 0.14978985 2.851031
6      3 CV1 4.025691 3.208582 0.13351829 2.455931
7      3 CV2 4.103869 3.182700 0.14039059 2.475653
8      3 CV3 4.172695 2.753937 0.09868328 2.341772
9      3 CV4 3.810854 2.817392 0.12225921 2.250168
10     3 CV5 4.901914 3.501390 0.14978985 2.851031
11     4 CV1 4.032868 3.195323 0.13256580 2.453586
12     4 CV2 4.155572 3.234149 0.14171161 2.510478
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
1     2   4.203005  3.092800  0.1289282       2.474911
2     3   4.203005  3.092800  0.1289282       2.474911
3     4   4.209264  3.094364  0.1288044       2.477478
4     5   4.224417  3.110194  0.1293965       2.488002
5     6   4.232796  3.111767  0.1293421       2.491301
6     7   4.246197  3.105733  0.1289232       2.493617
7     8   4.267770  3.119323  0.1294530       2.505515
8     9   4.298056  3.123232  0.1293944       2.516894
9    10   4.418708  3.164247  0.1306858       2.571214

Parameter optimum (knots) untuk model baseline splines yaitu 1 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan baseline splines dengan jumlah titik sebanyak 1 selang 2 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)

xcols <- 'weight'
xlab <- "weight"
ylab <- "mile per gallon"
model_type <- "baseline spline"
Knots <- 2
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ bs(weight, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 1")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 1")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.173074 3.068949 0.1277261 2.456583
2  testing 4.152980 2.952380 0.1265228 2.410627

B.3.4 Natural Splines

split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
  set.seed(1501211042)
  df <- df[sample(nrow(df)),]
  if (dep_col==TRUE) {
    indeks <- sample.split(df[,columns],SplitRatio = pcg)
    df1 <- df[which(indeks==TRUE),]
    df2 <- df[which(indeks==FALSE),]
    df <- rbind(df1,df2)
  } else {df <- df}
  row.names(df) <- seq_len(nrow(df))
  train_ind <- seq_len(floor(pcg * nrow(df)))
  test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
  df_train <- df[train_ind, ]
  df_test <- df[test_ind, ]
  smp_size <- round(nrow(df_train)/K)
  group_ind <- matrix(nrow = smp_size, ncol = K)
  indeks <- as.integer(row.names(df_train))
  min_indeks <- min(indeks)
  for (i in 1:K){
    ind <- seq(min_indeks,min_indeks+smp_size-1,1)
    group_ind[,i] <- ind
    min_indeks <- min_indeks+smp_size
  }
  df_group  <- as.data.frame(group_ind)
  cvlist <- paste('CV',c(1:K),sep = '')
  colnames(df_group) <- cvlist
  return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
ns_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
  sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
  dfrand <- sp_dataset$dfrand
  df_group <- sp_dataset$df_group
  train_ind <- sp_dataset$train_ind
  test_ind <- sp_dataset$test_ind
  df_train <- dfrand[train_ind,]
  df_test <- dfrand[test_ind,]
  KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
  CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
  RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
  i <- 0
  for (d in 2:Knots) {
    for (cv in colnames(df_group)) {
      `%!in%` <- Negate(`%in%`)
      training <- df_train[row.names(df_train) %!in% df_group[,cv],]
      validation <- df_train[row.names(df_train) %in% df_group[,cv],]
      regspline = glm( mpg ~ ns(weight, df = d), data=training)
      prediksi <- predict(regspline, newdata = validation)
      actual <- validation$mpg
      i <- i+1
      KNOTS[i] <- d
      CV[i] <- cv
      RMSES[i] <- RMSE(prediksi,actual)
      MAES[i] <- MAE(prediksi,actual)
      MAPES[i] <- MAPE(prediksi, actual)
    }
  }
  WEIGHTED <- (RMSES+MAES+MAPES)/3
  df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
  df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
                                                                       MAE_score = mean(MAES),
                                                                       MAPE_score = mean(MAPES),
                                                                       WEIGHTED_score =  mean(WEIGHTED)))
  df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
  
  return(list('train_ind' = train_ind,'test_ind'= test_ind,
              'df_group' = df_group, 'dfrand'= dfrand,
              'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'weight'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- ns_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol

   KNOTS  CV    RMSES     MAES      MAPES WEIGHTED
1      2 CV1 4.024755 3.186773 0.13220455 2.447911
2      2 CV2 4.059506 3.171044 0.14066066 2.457070
3      2 CV3 4.167291 2.740486 0.09757765 2.335118
4      2 CV4 3.800988 2.813894 0.12224855 2.245710
5      2 CV5 4.864412 3.466742 0.14812645 2.826427
6      3 CV1 4.023849 3.197002 0.13287077 2.451241
7      3 CV2 4.139431 3.205717 0.14084242 2.495330
8      3 CV3 4.168504 2.741861 0.09770511 2.336023
9      3 CV4 3.807571 2.810138 0.12215481 2.246621
10     3 CV5 4.884863 3.489401 0.14945498 2.841240
11     4 CV1 4.024065 3.179424 0.13168007 2.445057
12     4 CV2 4.164382 3.240913 0.14216629 2.515820
 [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

model_pol_all$df_pol_agg

  KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
1     2   4.183390  3.075788  0.1281636       2.462447
2     3   4.204844  3.088824  0.1286056       2.474091
3     4   4.207898  3.090500  0.1286277       2.475675
6     7   4.261913  3.099014  0.1284067       2.496444
4     5   4.244174  3.125766  0.1302307       2.500057
7     8   4.277739  3.108912  0.1288897       2.505180
8     9   4.284369  3.113305  0.1290438       2.508906
5     6   4.265652  3.132358  0.1302423       2.509417
9    10   4.293648  3.126213  0.1297538       2.516538

Parameter optimum (knots) untuk model natural splines yaitu 1 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan natural splines dengan jumlah titik sebanyak 1 selang 2 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.

tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)

xcols <- 'weight'
xlab <- "weight"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 2
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ ns(weight, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 1")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)

tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 1")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)

tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df

      tipe    RMSES     MAES     MAPES WEIGHTED
1 training 4.170799 3.061577 0.1273529 2.453243
2  testing 4.154022 2.952269 0.1264608 2.410917

===============================================

x : Weight

y : mpg

===============================================

Model : Polynomial Regression ordo 2

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.173080 3.068577 0.1277039 2.456454

2 testing 4.152997 2.952700 0.1265259 2.410741

===============================================

Model : Step Functions Knots 10

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.157577 3.089568 0.1304351 2.459193

2 testing 4.134778 3.060675 0.1331605 2.442871

===============================================

Model : Baseline Splines Knots 1

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.173074 3.068949 0.1277261 2.456583

2 testing 4.152980 2.952380 0.1265228 2.410627

===============================================

Model : Natural Splines Knots 1

no tipe RMSES MAES MAPES WEIGHTED

1 training 4.170799 3.061577 0.1273529 2.453243

2 testing 4.154022 2.952269 0.1264608 2.410917

===============================================

Model Terbaik dengan peubah acak Weight adalah Baseline Splines Knots 1