Sains Data
Tugas Praktikum 8
Import Library
Data bisa diperoleh untuk tugas dapat diperoleh di package ISLR.Silahkan Install dulu packagenya.
library(MultiKink)
library(mlr3measures)
library(OneR)
library(fastDummies)
library(caret)
library(e1071)
library(doSNOW)
library(ipred)
library(xgboost)
library(Lahman)
library(tidyverse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(readxl)
library(corrplot)
library(knitr)
library(smbinning)
library(ISLR)
library(splines)
library(corrplot)
library(boot)
library(MLmetrics)
library(caTools)
library(rpart)
library(rsample)Import Data
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. The original dataset has 397 observations, of which 5 have missing values for the variable "horsepower
A data frame with 392 observations on the following 9 variables. mpg : miles per gallon cylinders : Number of cylinders between 4 and 8 displacement : Engine displacement (cu. inches) horsepower : Engine horsepower weight : Vehicle weight (lbs.) acceleration : Time to accelerate from 0 to 60 mph (sec.) year : Model year (modulo 100) origin : Origin of car (1. American, 2. European, 3. Japanese) name : Vehicle name
attach(Auto)
df <- as.data.frame(cbind(mpg,cylinders,displacement,horsepower,weight,acceleration))
df mpg cylinders displacement horsepower weight acceleration
1 18 8 307 130 3504 12.0
2 15 8 350 165 3693 11.5
3 18 8 318 150 3436 11.0
4 16 8 304 150 3433 12.0
5 17 8 302 140 3449 10.5
6 15 8 429 198 4341 10.0
7 14 8 454 220 4354 9.0
8 14 8 440 215 4312 8.5
9 14 8 455 225 4425 10.0
10 15 8 390 190 3850 8.5
11 15 8 383 170 3563 10.0
12 14 8 340 160 3609 8.0
[ reached 'max' / getOption("max.print") -- omitted 380 rows ]
Tugas A
Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.
Scatter plot MPG vs Displacement
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year)) plot(df$displacement, df$mpg, pch=19, col = "blue", xlab = "displacement", ylab = "mile per gallon") title("Plot of mpg vs displacement", cex.sub = 0.8) abline(lm(mpg ~ displacement))# korelasi antara displacement dengan MPG cat("Koefisien Korelasi (r) =",cor(df$displacement, df$mpg, method = "pearson"))Koefisien Korelasi (r) = -0.8051269cor.test(df$displacement,df$mpg)Pearson's product-moment correlation data: df$displacement and df$mpg t = -26.808, df = 390, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8373959 -0.7672653 sample estimates: cor -0.8051269Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara displacement dan mile per gallon merupakan hubungan yang non linier dikarenakan pada nilai displacement sekitar 420 tidak terjadi pola penurunan melainkan ada kecendrungan naik. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara displacement dan mile per hour sekitar -0.80 yang artinya semakin meningkat displacement maka mile per gallon akan semakin mengecil, meskipun pola hubungan linier menurut korelasi, tetapi dilihat dari visual hubungan tidak linier antara displacement dan mile per gallon.
Scatter plot MPG vs HorsePower
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year)) plot(df$horsepower, df$mpg, pch=19, col = "black", xlab = "Horse Power", ylab = "mile per gallon") title("Plot of mpg vs Horse Power", cex.sub = 0.8) abline(lm(mpg ~ horsepower))# korelasi antara Horsepower dengan MPG Overall cat("Koefisien Korelasi (r) =",cor(df$horsepower, df$mpg, method = "pearson"))Koefisien Korelasi (r) = -0.7784268cor.test(df$horsepower,df$mpg)Pearson's product-moment correlation data: df$horsepower and df$mpg t = -24.489, df = 390, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8146631 -0.7361359 sample estimates: cor -0.7784268Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara Horsepower dan mile per gallon merupakan hubungan yang non linier dikarenakan pada nilai Horsepower sekitar 200 tidak terjadi pola penurunan melainkan ada kecendrungan naik. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara Horsepower dan mile per hour sekitar -0.77 yang artinya semakin meningkat Horsepower maka mile per gallon akan semakin mengecil, meskipun pola hubungan linier menurut korelasi, tetapi dilihat dari visual hubungan tidak linier antara Horsepower dan mile per gallon.
Scatter plot MPG vs weight
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year)) plot(df$weight, df$mpg, pch=19, col = "coral", xlab = "weight", ylab = "mile per gallon") title("Plot of mpg vs weight", cex.sub = 0.8) abline(lm(mpg ~ weight))# korelasi antara weight dengan MPG Overall cat("Koefisien Korelasi (r) =",cor(df$weight, df$mpg, method = "pearson"))Koefisien Korelasi (r) = -0.8322442cor.test(df$weight,df$mpg)Pearson's product-moment correlation data: df$weight and df$mpg t = -29.645, df = 390, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8603702 -0.7990647 sample estimates: cor -0.8322442Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara weight dan mile per gallon merupakan hubungan yang non linier dikarenakan pada nilai weight sekitar 5000 tidak terjadi pola penurunan melainkan ada kecendrungan naik. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara weight dan mile per gallon sekitar -0.83 yang artinya semakin meningkat weight maka mile per gallon akan semakin mengecil, meskipun pola hubungan linier menurut korelasi, tetapi dilihat dari visual hubungan tidak linier antara weight dan mile per gallon.
Scatter plot MPG vs acceleration
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration,year)) plot(df$acceleration, df$mpg, pch=19, col = "grey", xlab = "acceleration", ylab = "mile per gallon") title("Plot of mpg vs acceleration", cex.sub = 0.8) abline(lm(mpg ~ acceleration))# korelasi antara acceleration dengan MPG cat("Koefisien Korelasi (r) =",cor(df$acceleration, df$mpg, method = "pearson"))Koefisien Korelasi (r) = 0.4233285cor.test(df$acceleration,df$mpg)Pearson's product-moment correlation data: df$acceleration and df$mpg t = 9.2277, df = 390, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3384724 0.5013550 sample estimates: cor 0.4233285Berdasarkan pola hubungan yang terlihat pada scatterplot, dapat kita analisa bahwa hubungan antara acceleration dan mile per gallon merupakan hubungan yang linier dikarenakan pada nilai mpg semakin besar apabila acceleration juga meningkat. Pola hubungan linier tersebut bisa dibuktikan juga melalui korelasi plot antara acceleration dan mile per gallon sekitar 0.42 yang artinya semakin meningkat acceleration maka mile per gallon akan semakin besar.
Tugas B
Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris). Berdasarkan analisis scatter plot pada soal nomor A atau Tugas A, didapatkan bahwa variable peubah acak yang dapat digunkan untuk menggunakan model linier terhadap mpg adalah displacement, horsepower, dan weight. Tahapan-tahapan pemodelan non linier yang harus dilakukan untuk setiap peubah acak.
Split data into train and test dengan prporsi 70:30 atay 80:20, tapi saya megunakan 70:30
Mencari parameter optimum melalui metode K-fold Cross validation dengan train data set yang sudah di split pada nomor 1. Dipilih weighted score terkecil untuk setiap parameter degree. weighted score dihitung dengan cara rata rata dari rmse,mae,dan mape.
Parameter yang didapatkan, dimodelkan ulang dengan train data set sebagai data training, lalu test data set sebagai data testing.
Memilih model terbaik berdasarkan weighted score terkecil
B.1 Displacement vs MPG
B.1.1 Polynomial Regression
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
pol_reg_cv <- function(df,pcg=0.75,K=10,D=5,dep_col=FALSE,columns){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
degree <- matrix(nrow = D*K, ncol = 1)
CV <- matrix(nrow = D*K, ncol = 1)
RMSES <- matrix(nrow = D*K, ncol = 1)
MAES <- matrix(nrow = D*K, ncol = 1)
MAPES <- matrix(nrow = D*K, ncol = 1)
i <- 0
for (d in 1:D) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
polinom_train <- lm(mpg ~ poly(displacement,d), data = training)
prediksi <- predict(polinom_train, newdata = validation)
actual <- validation$mpg
i <- i+1
degree[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(degree,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(degree) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
D <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- pol_reg_cv(df = df, pcg = pcg, K = K, D = D, dep_col = dep_col, columns = columns)
model_pol_all$df_pol degree CV RMSES MAES MAPES WEIGHTED
1 1 CV1 4.420065 3.419909 0.1499310 2.663302
2 1 CV2 4.335985 3.285711 0.1441593 2.588619
3 1 CV3 4.538141 3.259594 0.1268758 2.641537
4 1 CV4 4.597318 3.710270 0.1737371 2.827108
5 1 CV5 4.960205 3.493081 0.1525632 2.868617
6 2 CV1 4.138268 3.055096 0.1309501 2.441438
7 2 CV2 3.977767 2.951997 0.1254357 2.351733
8 2 CV3 4.338138 3.129038 0.1189623 2.528713
9 2 CV4 4.286909 3.257085 0.1458972 2.563297
10 2 CV5 4.838493 3.261941 0.1339528 2.744796
11 3 CV1 4.111953 3.022902 0.1295969 2.421484
12 3 CV2 3.949769 2.928616 0.1245307 2.334305
[ reached 'max' / getOption("max.print") -- omitted 38 rows ]
model_pol_all$df_pol_agg degree RMSE_score MAE_score MAPE_score WEIGHTED_score
9 9 4.207214 3.142020 0.1319057 2.493713
6 6 4.241103 3.126389 0.1307972 2.499430
10 10 4.206002 3.167422 0.1334887 2.502304
8 8 4.229312 3.158230 0.1325185 2.506687
7 7 4.239218 3.173912 0.1335556 2.515562
4 4 4.310350 3.126605 0.1310293 2.522661
5 5 4.304216 3.137739 0.1321408 2.524699
3 3 4.305306 3.138277 0.1320104 2.525198
2 2 4.315915 3.131031 0.1310396 2.525995
1 1 4.570343 3.433713 0.1494533 2.717836
Parameter optimum (degree) untuk model polynomial regression yaitu degree 9 berdasarkan weighted score yang paling kecil. Selanjutnya dimodelkan kembali menggunakan polynomial regresson degree 9 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
degree <- 9
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
df_train <- dfrand[model_pol_all$train_ind,]
df_test <- dfrand[model_pol_all$test_ind,]
polinom_train = lm(mpg ~ poly(displacement,degree), data = df_train)
#train
prediksi <- predict(polinom_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab='displacement', ylab='mpg',main="Regresi Polinomial Ordo 9")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(polinom_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab='displacement', ylab='mpg',main="Regresi Polinomial Ordo 9")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.119046 3.036755 0.1267458 2.427516
2 testing 4.298119 3.304675 0.1475408 2.583445
B.1.2 Step Functions
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
piecewise_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
id = seq(1, length(unique(training$displacement)), by = round(length(unique(training$displacement))/(d+1)))
breaks = sort(unique(training$displacement))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(training$displacement)
breaks[length(breaks)+1] <- (-2)*min(training$displacement)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
training$gr <- cut(training$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
validation$gr <- cut(validation$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = training)
prediksi <- predict(sf_train, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'displacement'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- piecewise_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 5.772823 4.270018 0.1841975 3.409013
2 2 CV2 5.497051 4.502477 0.2188102 3.406113
3 2 CV3 5.559409 4.273895 0.1785762 3.337294
4 2 CV4 4.236006 3.298433 0.1588730 2.564437
5 2 CV5 5.233705 3.934705 0.1887980 3.119069
6 3 CV1 4.639002 3.523816 0.1479911 2.770270
7 3 CV2 4.267873 3.375280 0.1505377 2.597897
8 3 CV3 4.898393 3.872457 0.1574689 2.976106
9 3 CV4 4.678056 3.737885 0.1768993 2.864280
10 3 CV5 5.467206 4.469303 0.2218816 3.386130
11 4 CV1 4.328386 3.214958 0.1378119 2.560385
12 4 CV2 4.673731 3.766893 0.1723247 2.870983
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
5 6 4.303624 3.145174 0.1329524 2.527250
9 10 4.366322 3.190972 0.1342283 2.563841
8 9 4.388511 3.215215 0.1354847 2.579737
6 7 4.412208 3.292275 0.1410216 2.615168
7 8 4.499340 3.316479 0.1396854 2.651835
4 5 4.534251 3.566113 0.1579259 2.752763
3 4 4.703221 3.660230 0.1631372 2.842196
2 3 4.790106 3.795748 0.1709557 2.918937
1 2 5.259799 4.055906 0.1858510 3.167185
Parameter optimum (knots) untuk model step functions yaitu 6 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan step function dengan jumlah titik sebanyak 6 dan selang ada 7 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'displacement'
xlab <- "displacement"
ylab <- "mile per gallon"
model_type <- "step function"
Knots <- 6
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
id = seq(1, length(unique(df_train$displacement)), by = round(length(unique(df_train$displacement))/(Knots+1)))
breaks = sort(unique(df_train$displacement))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(df_train$displacement)
breaks[length(breaks)+1] <- (-2)*min(df_train$displacement)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
df_train$gr <- cut(df_train$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
df_test$gr <- cut(df_test$displacement,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = df_train)
#train
prediksi <- predict(sf_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function knots 6")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(sf_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function knots 6")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.400215 3.256056 0.1408742 2.599049
2 testing 4.418019 3.284186 0.1460048 2.616070
B.1.3 Baseline splines
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
bs_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
regspline = glm( mpg ~ bs(displacement, df = d), data=training)
prediksi <- predict(regspline, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'displacement'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- bs_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 4.111953 3.022902 0.1295969 2.421484
2 2 CV2 3.949769 2.928616 0.1245307 2.334305
3 2 CV3 4.346956 3.130042 0.1181625 2.531720
4 2 CV4 4.287805 3.270083 0.1476468 2.568512
5 2 CV5 4.830048 3.339743 0.1401152 2.769969
6 3 CV1 4.111953 3.022902 0.1295969 2.421484
7 3 CV2 3.949769 2.928616 0.1245307 2.334305
8 3 CV3 4.346956 3.130042 0.1181625 2.531720
9 3 CV4 4.287805 3.270083 0.1476468 2.568512
10 3 CV5 4.830048 3.339743 0.1401152 2.769969
11 4 CV1 4.128312 3.010619 0.1291213 2.422684
12 4 CV2 4.002056 2.966462 0.1262836 2.364934
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
6 7 4.177243 3.097753 0.1293515 2.468116
5 6 4.190485 3.118070 0.1302337 2.479596
4 5 4.215277 3.115346 0.1300918 2.486905
7 8 4.230068 3.146471 0.1313725 2.502637
8 9 4.260037 3.178469 0.1326127 2.523706
1 2 4.305306 3.138277 0.1320104 2.525198
2 3 4.305306 3.138277 0.1320104 2.525198
9 10 4.297378 3.172256 0.1320087 2.533881
3 4 4.324482 3.157979 0.1332974 2.538586
Parameter optimum (knots) untuk model baseline splines yaitu 6 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan baseline splines dengan jumlah titik sebanyak 6 selang 7 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'displacement'
xlab <- "displacement"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 7
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ bs(displacement, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 6")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 6")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.090548 3.017533 0.1260246 2.411369
2 testing 4.243822 3.286924 0.1461223 2.558956
B.1.4 Natural Splines
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
ns_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
regspline = glm( mpg ~ ns(displacement, df = d), data=training)
prediksi <- predict(regspline, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'displacement'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- ns_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 4.120023 3.020128 0.1293868 2.423179
2 2 CV2 3.957653 2.938173 0.1249792 2.340268
3 2 CV3 4.330537 3.127524 0.1185001 2.525520
4 2 CV4 4.271793 3.249318 0.1458685 2.555660
5 2 CV5 4.834369 3.331811 0.1392585 2.768480
6 3 CV1 4.095964 3.015267 0.1294412 2.413558
7 3 CV2 3.926923 2.902464 0.1234775 2.317621
8 3 CV3 4.360187 3.123338 0.1175143 2.533680
9 3 CV4 4.274105 3.268234 0.1478315 2.563390
10 3 CV5 4.814877 3.330390 0.1396040 2.761623
11 4 CV1 4.098069 2.993124 0.1284785 2.406557
12 4 CV2 4.005912 2.986635 0.1275349 2.373360
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
4 5 4.253471 3.136378 0.1313241 2.507058
9 10 4.235413 3.163258 0.1325501 2.510407
2 3 4.294411 3.127938 0.1315737 2.517974
1 2 4.302875 3.133391 0.1315986 2.522621
8 9 4.262170 3.181392 0.1326983 2.525420
5 6 4.289218 3.171906 0.1327490 2.531291
3 4 4.308491 3.154456 0.1331978 2.532048
6 7 4.290514 3.178776 0.1326212 2.533970
7 8 4.296379 3.174654 0.1320417 2.534358
Parameter optimum (knots) untuk model natural splines yaitu 4 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan natural splines dengan jumlah titik sebanyak 4 selang 5 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'displacement'
xlab <- "displacement"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 5
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ ns(displacement, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$displacement, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 4")
ix <- sort(df_train$displacement,index.return=T)$ix
lines(df_train[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$displacement, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 4")
ix <- sort(df_test$displacement,index.return=T)$ix
lines(df_test[ix,'displacement'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(displacement, df = Knots),"knots"), col="grey50", lty=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.202994 3.089173 0.1289719 2.473713
2 testing 4.486540 3.389999 0.1500923 2.675544
===============================================
x : displacement
y : mpg
===============================================
Model : Polynomial Regression ordo 9
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.119046 3.036755 0.1267458 2.427516
2 testing 4.298119 3.304675 0.1475408 2.583445
===============================================
Model : Step Functions Knots 6
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.400215 3.256056 0.1408742 2.599049
2 testing 4.418019 3.284186 0.1460048 2.616070
===============================================
Model : Baseline Splines Knots 6
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.090548 3.017533 0.1260246 2.411369
2 testing 4.243822 3.286924 0.1461223 2.558956
===============================================
Model : Natural Splines Knots 4
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.202994 3.089173 0.1289719 2.473713
2 testing 4.486540 3.389999 0.1500923 2.675544
===============================================
Model Terbaik dengan peubah acak displacement adalah baseline splines Knots 6
B.2 Horsepower vs MPG
B.2.1 Polynomial Regression
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
pol_reg_cv <- function(df,pcg=0.75,K=10,D=5,dep_col=FALSE,columns){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
degree <- matrix(nrow = D*K, ncol = 1)
CV <- matrix(nrow = D*K, ncol = 1)
RMSES <- matrix(nrow = D*K, ncol = 1)
MAES <- matrix(nrow = D*K, ncol = 1)
MAPES <- matrix(nrow = D*K, ncol = 1)
i <- 0
for (d in 1:D) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
polinom_train <- lm(mpg ~ poly(horsepower,d), data = training)
prediksi <- predict(polinom_train, newdata = validation)
actual <- validation$mpg
i <- i+1
degree[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(degree,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(degree) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
D <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- pol_reg_cv(df = df, pcg = pcg, K = K, D = D, dep_col = dep_col, columns = columns)
model_pol_all$df_pol degree CV RMSES MAES MAPES WEIGHTED
1 1 CV1 4.834504 3.898745 0.1699088 2.967719
2 1 CV2 5.343545 3.974648 0.1860230 3.168072
3 1 CV3 4.621917 3.606616 0.1439466 2.790826
4 1 CV4 4.573925 3.720405 0.1814015 2.825244
5 1 CV5 4.907973 3.733567 0.1700596 2.937200
6 2 CV1 4.650057 3.603537 0.1500087 2.801201
7 2 CV2 4.933845 3.490390 0.1621746 2.862136
8 2 CV3 4.318095 3.355824 0.1345258 2.602815
9 2 CV4 3.753398 2.994182 0.1342507 2.293943
10 2 CV5 4.699263 3.330653 0.1352009 2.721705
11 3 CV1 4.672475 3.590570 0.1486807 2.803908
12 3 CV2 4.961901 3.520806 0.1637465 2.882151
[ reached 'max' / getOption("max.print") -- omitted 38 rows ]
model_pol_all$df_pol_agg degree RMSE_score MAE_score MAPE_score WEIGHTED_score
7 7 4.448757 3.281076 0.1413419 2.623725
8 8 4.461356 3.290221 0.1419090 2.631162
9 9 4.469974 3.298036 0.1424602 2.636823
6 6 4.465935 3.305980 0.1426649 2.638193
10 10 4.486505 3.313439 0.1432379 2.647727
2 2 4.470931 3.354917 0.1432322 2.656360
5 5 4.489308 3.367478 0.1455442 2.667444
3 3 4.504663 3.390128 0.1452515 2.680014
4 4 4.511212 3.413632 0.1476968 2.690847
1 1 4.856373 3.786796 0.1702679 2.937812
Parameter optimum (degree) untuk model polynomial regression yaitu degree 7 berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan polynomial regresson degree 7 dengan menggunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
degree <- 7
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
df_train <- dfrand[model_pol_all$train_ind,]
df_test <- dfrand[model_pol_all$test_ind,]
polinom_train = lm(mpg ~ poly(horsepower,degree), data = df_train)
#train
prediksi <- predict(polinom_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab='horsepower', ylab='mpg',main="Regresi Polinomial Ordo 7")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(polinom_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab='horsepower', ylab='mpg',main="Regresi Polinomial Ordo 7")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.327981 3.194352 0.1370348 2.553123
2 testing 4.111706 3.112849 0.1390994 2.454552
B.2.2 Step Functions
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
piecewise_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
id = seq(1, length(unique(training$horsepower)), by = round(length(unique(training$horsepower))/(d+1)))
breaks = sort(unique(training$horsepower))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(training$horsepower)
breaks[length(breaks)+1] <- (-2)*min(training$horsepower)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
training$gr <- cut(training$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
validation$gr <- cut(validation$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = training)
prediksi <- predict(sf_train, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'horsepower'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- piecewise_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 5.825717 4.180151 0.1822688 3.396046
2 2 CV2 6.149926 4.844878 0.2444107 3.746405
3 2 CV3 5.703813 4.512520 0.1888911 3.468408
4 2 CV4 4.123990 3.213432 0.1463161 2.494579
5 2 CV5 6.462275 5.326581 0.2748729 4.021243
6 3 CV1 4.604564 3.629283 0.1505007 2.794783
7 3 CV2 5.333277 3.829079 0.1838242 3.115394
8 3 CV3 5.032771 3.922097 0.1591387 3.038002
9 3 CV4 4.012484 2.908284 0.1362304 2.352333
10 3 CV5 4.877438 3.712836 0.1610665 2.917114
11 4 CV1 4.476669 3.562161 0.1510829 2.729971
12 4 CV2 4.676163 3.637898 0.1726260 2.828896
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
9 10 4.326833 3.237146 0.1392678 2.567749
8 9 4.438962 3.319902 0.1441795 2.634348
7 8 4.481515 3.319321 0.1432834 2.648040
6 7 4.500671 3.355770 0.1451928 2.667211
5 6 4.470448 3.408200 0.1486145 2.675754
4 5 4.519053 3.449184 0.1495761 2.705938
3 4 4.513945 3.455816 0.1504288 2.706730
2 3 4.772107 3.600316 0.1581521 2.843525
1 2 5.653144 4.415512 0.2073519 3.425336
Parameter optimum (knots) untuk model step functions yaitu 10 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan step function dengan jumlah titik sebanyak 10 selang 11 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'horsepower'
xlab <- "horsepower"
ylab <- "mile per gallon"
model_type <- "step function"
Knots <- 10
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
id = seq(1, length(unique(df_train$horsepower)), by = round(length(unique(df_train$horsepower))/(Knots+1)))
breaks = sort(unique(df_train$horsepower))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(df_train$horsepower)
breaks[length(breaks)+1] <- (-2)*min(df_train$horsepower)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
df_train$gr <- cut(df_train$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
df_test$gr <- cut(df_test$horsepower,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = df_train)
#train
prediksi <- predict(sf_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(sf_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.199674 3.104994 0.1330575 2.479242
2 testing 4.211224 3.264680 0.1447796 2.540228
B.2.3 Baseline splines
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
bs_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
regspline = glm( mpg ~ bs(horsepower, df = d), data=training)
prediksi <- predict(regspline, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'horsepower'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- bs_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 4.672475 3.590570 0.1486807 2.803908
2 2 CV2 4.961901 3.520806 0.1637465 2.882151
3 2 CV3 4.334660 3.381311 0.1358210 2.617264
4 2 CV4 3.765460 2.981998 0.1334898 2.293649
5 2 CV5 4.788818 3.475956 0.1445197 2.803098
6 3 CV1 4.672475 3.590570 0.1486807 2.803908
7 3 CV2 4.961901 3.520806 0.1637465 2.882151
8 3 CV3 4.334660 3.381311 0.1358210 2.617264
9 3 CV4 3.765460 2.981998 0.1334898 2.293649
10 3 CV5 4.788818 3.475956 0.1445197 2.803098
11 4 CV1 4.477430 3.406691 0.1416677 2.675263
12 4 CV2 4.988939 3.552203 0.1684208 2.903188
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
4 5 4.449046 3.277204 0.1407714 2.622340
5 6 4.450946 3.275481 0.1408224 2.622416
6 7 4.472071 3.284396 0.1412787 2.632582
8 9 4.489045 3.276452 0.1412380 2.635578
7 8 4.490068 3.298059 0.1420330 2.643387
9 10 4.498160 3.292837 0.1419609 2.644319
3 4 4.495600 3.372730 0.1459311 2.671421
1 2 4.504663 3.390128 0.1452515 2.680014
2 3 4.504663 3.390128 0.1452515 2.680014
Parameter optimum (knots) untuk model baseline splines yaitu 4 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan baseline splines dengan jumlah titik sebanyak 4 selang 5 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'horsepower'
xlab <- "horsepower"
ylab <- "mile per gallon"
model_type <- "baseline spline"
Knots <- 5
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ bs(horsepower, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 4")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 4")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.331941 3.203326 0.1376315 2.557633
2 testing 4.124905 3.131825 0.1398707 2.465534
B.2.4 Natural Splines
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
ns_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
regspline = glm( mpg ~ ns(horsepower, df = d), data=training)
prediksi <- predict(regspline, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'horsepower'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- ns_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 4.645561 3.544641 0.1463739 2.778859
2 2 CV2 4.905000 3.450502 0.1607161 2.838739
3 2 CV3 4.302078 3.308859 0.1318420 2.580926
4 2 CV4 3.720110 3.004466 0.1352199 2.286599
5 2 CV5 4.734787 3.443485 0.1431158 2.773796
6 3 CV1 4.639128 3.549877 0.1468206 2.778609
7 3 CV2 4.965028 3.519192 0.1643371 2.882852
8 3 CV3 4.385813 3.422505 0.1377939 2.648704
9 3 CV4 3.735204 2.953269 0.1319278 2.273467
10 3 CV5 4.756529 3.462496 0.1441429 2.787723
11 4 CV1 4.489656 3.360756 0.1389494 2.663121
12 4 CV2 4.993260 3.522083 0.1658521 2.893732
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
9 10 4.444106 3.261510 0.1405965 2.615404
6 7 4.452166 3.269877 0.1405072 2.620850
8 9 4.468448 3.269994 0.1407372 2.626393
7 8 4.460465 3.285620 0.1413188 2.629135
5 6 4.461346 3.307380 0.1422334 2.636986
4 5 4.460384 3.316710 0.1425750 2.639889
1 2 4.461507 3.350391 0.1434535 2.651784
3 4 4.479450 3.352933 0.1444600 2.658948
2 3 4.496341 3.381468 0.1450045 2.674271
Parameter optimum (knots) untuk model natural splines yaitu 9 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan natural splines dengan jumlah titik sebanyak 9 selang 10 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'horsepower'
xlab <- "horsepower"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 10
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ ns(horsepower, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$horsepower, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 9")
ix <- sort(df_train$horsepower,index.return=T)$ix
lines(df_train[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$horsepower, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 9")
ix <- sort(df_test$horsepower,index.return=T)$ix
lines(df_test[ix,'horsepower'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(horsepower, df = Knots),"knots"), col="grey50", lty=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.249171 3.098283 0.1329692 2.493474
2 testing 4.058468 3.093289 0.1378698 2.429875
===============================================
x : Horsepower
y : mpg
===============================================
Model : Polynomial Regression ordo 7
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.327981 3.194352 0.1370348 2.553123
2 testing 4.111706 3.112849 0.1390994 2.454552
===============================================
Model : Step Functions Knots 10
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.199674 3.104994 0.1330575 2.479242
2 testing 4.211224 3.264680 0.1447796 2.540228
===============================================
Model : Baseline Splines Knots 4
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.331941 3.203326 0.1376315 2.557633
2 testing 4.124905 3.131825 0.1398707 2.465534
===============================================
Model : Natural Splines Knots 9
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.249171 3.098283 0.1329692 2.493474
2 testing 4.058468 3.093289 0.1378698 2.429875
===============================================
Model Terbaik dengan peubah acak Horsepower adalah natural splines Knots 9
B.3 Weight vs MPG
B.3.1 Polynomial Regression
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
pol_reg_cv <- function(df,pcg=0.75,K=10,D=5,dep_col=FALSE,columns){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
degree <- matrix(nrow = D*K, ncol = 1)
CV <- matrix(nrow = D*K, ncol = 1)
RMSES <- matrix(nrow = D*K, ncol = 1)
MAES <- matrix(nrow = D*K, ncol = 1)
MAPES <- matrix(nrow = D*K, ncol = 1)
i <- 0
for (d in 1:D) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
polinom_train <- lm(mpg ~ poly(weight,d), data = training)
prediksi <- predict(polinom_train, newdata = validation)
actual <- validation$mpg
i <- i+1
degree[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(degree,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(degree) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
D <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- pol_reg_cv(df = df, pcg = pcg, K = K, D = D, dep_col = dep_col, columns = columns)
model_pol_all$df_pol degree CV RMSES MAES MAPES WEIGHTED
1 1 CV1 4.098737 3.262634 0.13677871 2.499383
2 1 CV2 4.407682 3.413286 0.15665201 2.659207
3 1 CV3 4.248817 2.989446 0.11483872 2.451034
4 1 CV4 3.999092 3.104835 0.13741690 2.413781
5 1 CV5 4.982291 3.778535 0.17524118 2.978689
6 2 CV1 4.025090 3.204660 0.13329280 2.454347
7 2 CV2 4.065905 3.158427 0.13959506 2.454642
8 2 CV3 4.169867 2.763159 0.09937366 2.344133
9 2 CV4 3.807461 2.818342 0.12229193 2.249365
10 2 CV5 4.862025 3.460551 0.14721963 2.823265
11 3 CV1 4.025691 3.208582 0.13351829 2.455931
12 3 CV2 4.103869 3.182700 0.14039059 2.475653
[ reached 'max' / getOption("max.print") -- omitted 38 rows ]
model_pol_all$df_pol_agg degree RMSE_score MAE_score MAPE_score WEIGHTED_score
2 2 4.186069 3.081028 0.1283546 2.465151
5 5 4.204156 3.088254 0.1283595 2.473590
3 3 4.203005 3.092800 0.1289282 2.474911
4 4 4.210471 3.095284 0.1288693 2.478208
6 6 4.223372 3.109551 0.1294450 2.487456
7 7 4.246695 3.125095 0.1300688 2.500620
9 9 4.260505 3.129379 0.1307195 2.506868
8 8 4.259763 3.137417 0.1311929 2.509457
10 10 4.291399 3.141929 0.1315923 2.521640
1 1 4.347324 3.309747 0.1441855 2.600419
Parameter optimum (degree) untuk model polynomial regression yaitu degree 2 berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan polynomial regresson degree 2 dengan menggunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
degree <- 2
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
df_train <- dfrand[model_pol_all$train_ind,]
df_test <- dfrand[model_pol_all$test_ind,]
polinom_train = lm(mpg ~ poly(weight,degree), data = df_train)
#train
prediksi <- predict(polinom_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab='weight', ylab='mpg',main="Regresi Polinomial Ordo 2")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(polinom_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab='weight', ylab='mpg',main="Regresi Polinomial Ordo 2")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = 'regresi polinomial', col="blue",type="l", lwd=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.173080 3.068577 0.1277039 2.456454
2 testing 4.152997 2.952700 0.1265259 2.410741
B.3.2 Step Functions
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
piecewise_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
id = seq(1, length(unique(training$weight)), by = round(length(unique(training$weight))/(d+1)))
breaks = sort(unique(training$weight))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(training$weight)
breaks[length(breaks)+1] <- (-2)*min(training$weight)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
training$gr <- cut(training$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
validation$gr <- cut(validation$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = training)
prediksi <- predict(sf_train, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'weight'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- piecewise_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 5.489604 4.067482 0.1744693 3.243852
2 2 CV2 4.601544 3.717976 0.1692796 2.829600
3 2 CV3 4.740351 3.672659 0.1456014 2.852871
4 2 CV4 5.299839 4.478935 0.2171118 3.331962
5 2 CV5 6.212907 5.117106 0.2662996 3.865438
6 3 CV1 4.444610 3.517650 0.1494944 2.703918
7 3 CV2 5.052088 4.180859 0.1958051 3.142917
8 3 CV3 4.738872 3.296872 0.1247334 2.720159
9 3 CV4 4.449043 3.448903 0.1651446 2.687697
10 3 CV5 5.507773 4.173390 0.2066991 3.295954
11 4 CV1 4.259783 3.401037 0.1433772 2.601399
12 4 CV2 4.717958 3.814678 0.1734151 2.902017
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
9 10 4.265315 3.132899 0.1328419 2.510352
8 9 4.311208 3.185551 0.1330501 2.543270
7 8 4.321397 3.212570 0.1359790 2.556649
6 7 4.376691 3.291181 0.1401156 2.602662
5 6 4.391405 3.304476 0.1399056 2.611929
3 4 4.477855 3.473564 0.1543016 2.701907
4 5 4.566261 3.524613 0.1540219 2.748299
2 3 4.838477 3.723535 0.1683753 2.910129
1 2 5.268849 4.210832 0.1945523 3.224744
Parameter optimum (knots) untuk model step functions yaitu 10 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan step function dengan jumlah titik sebanyak 10 selang 11 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'weight'
xlab <- "weight"
ylab <- "mile per gallon"
model_type <- "step function"
Knots <- 10
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
id = seq(1, length(unique(df_train$weight)), by = round(length(unique(df_train$weight))/(Knots+1)))
breaks = sort(unique(df_train$weight))[id]
breaks=breaks[-c(1,length(breaks))]
breaks[length(breaks)+1] <- 2*max(df_train$weight)
breaks[length(breaks)+1] <- (-2)*min(df_train$weight)
labels=paste('gr',seq_len(length(breaks)-1),sep='')
df_train$gr <- cut(df_train$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
df_test$gr <- cut(df_test$weight,breaks=breaks,labels=labels, include.lowest=TRUE)
sf_train <- lm(mpg ~ gr, data = df_train)
#train
prediksi <- predict(sf_train, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(sf_train, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="step function overall knots 10")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.157577 3.089568 0.1304351 2.459193
2 testing 4.134778 3.060675 0.1331605 2.442871
B.3.3 Baseline plines
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
bs_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
regspline = glm( mpg ~ bs(weight, df = d), data=training)
prediksi <- predict(regspline, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'weight'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- bs_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 4.025691 3.208582 0.13351829 2.455931
2 2 CV2 4.103869 3.182700 0.14039059 2.475653
3 2 CV3 4.172695 2.753937 0.09868328 2.341772
4 2 CV4 3.810854 2.817392 0.12225921 2.250168
5 2 CV5 4.901914 3.501390 0.14978985 2.851031
6 3 CV1 4.025691 3.208582 0.13351829 2.455931
7 3 CV2 4.103869 3.182700 0.14039059 2.475653
8 3 CV3 4.172695 2.753937 0.09868328 2.341772
9 3 CV4 3.810854 2.817392 0.12225921 2.250168
10 3 CV5 4.901914 3.501390 0.14978985 2.851031
11 4 CV1 4.032868 3.195323 0.13256580 2.453586
12 4 CV2 4.155572 3.234149 0.14171161 2.510478
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
1 2 4.203005 3.092800 0.1289282 2.474911
2 3 4.203005 3.092800 0.1289282 2.474911
3 4 4.209264 3.094364 0.1288044 2.477478
4 5 4.224417 3.110194 0.1293965 2.488002
5 6 4.232796 3.111767 0.1293421 2.491301
6 7 4.246197 3.105733 0.1289232 2.493617
7 8 4.267770 3.119323 0.1294530 2.505515
8 9 4.298056 3.123232 0.1293944 2.516894
9 10 4.418708 3.164247 0.1306858 2.571214
Parameter optimum (knots) untuk model baseline splines yaitu 1 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan baseline splines dengan jumlah titik sebanyak 1 selang 2 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'weight'
xlab <- "weight"
ylab <- "mile per gallon"
model_type <- "baseline spline"
Knots <- 2
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ bs(weight, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 1")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="baseline spline knots 1")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.173074 3.068949 0.1277261 2.456583
2 testing 4.152980 2.952380 0.1265228 2.410627
B.3.4 Natural Splines
split_dataset <- function(df,pcg=0.75,K=10,dep_col=FALSE,columns){
set.seed(1501211042)
df <- df[sample(nrow(df)),]
if (dep_col==TRUE) {
indeks <- sample.split(df[,columns],SplitRatio = pcg)
df1 <- df[which(indeks==TRUE),]
df2 <- df[which(indeks==FALSE),]
df <- rbind(df1,df2)
} else {df <- df}
row.names(df) <- seq_len(nrow(df))
train_ind <- seq_len(floor(pcg * nrow(df)))
test_ind <- sort(as.integer(row.names(df[-train_ind, ])))
df_train <- df[train_ind, ]
df_test <- df[test_ind, ]
smp_size <- round(nrow(df_train)/K)
group_ind <- matrix(nrow = smp_size, ncol = K)
indeks <- as.integer(row.names(df_train))
min_indeks <- min(indeks)
for (i in 1:K){
ind <- seq(min_indeks,min_indeks+smp_size-1,1)
group_ind[,i] <- ind
min_indeks <- min_indeks+smp_size
}
df_group <- as.data.frame(group_ind)
cvlist <- paste('CV',c(1:K),sep = '')
colnames(df_group) <- cvlist
return(list('train_ind' = train_ind,'test_ind'= test_ind,'df_group' = df_group, 'dfrand'= df))
}
ns_cv <- function(df,pcg=0.75,K=10,Knots=5,dep_col=FALSE,columns,xcols){
sp_dataset <- split_dataset(df = df,pcg = pcg,K = K,dep_col = dep_col, columns = columns)
dfrand <- sp_dataset$dfrand
df_group <- sp_dataset$df_group
train_ind <- sp_dataset$train_ind
test_ind <- sp_dataset$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
KNOTS <- matrix(nrow = (Knots-1)*K, ncol = 1)
CV <- matrix(nrow = (Knots-1)*K, ncol = 1)
RMSES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAES <- matrix(nrow = (Knots-1)*K, ncol = 1)
MAPES <- matrix(nrow = (Knots-1)*K, ncol = 1)
i <- 0
for (d in 2:Knots) {
for (cv in colnames(df_group)) {
`%!in%` <- Negate(`%in%`)
training <- df_train[row.names(df_train) %!in% df_group[,cv],]
validation <- df_train[row.names(df_train) %in% df_group[,cv],]
regspline = glm( mpg ~ ns(weight, df = d), data=training)
prediksi <- predict(regspline, newdata = validation)
actual <- validation$mpg
i <- i+1
KNOTS[i] <- d
CV[i] <- cv
RMSES[i] <- RMSE(prediksi,actual)
MAES[i] <- MAE(prediksi,actual)
MAPES[i] <- MAPE(prediksi, actual)
}
}
WEIGHTED <- (RMSES+MAES+MAPES)/3
df_pol <- data.frame(KNOTS,CV,RMSES,MAES,MAPES,WEIGHTED)
df_pol_agg <- as.data.frame(df_pol %>% group_by(KNOTS) %>% summarise(RMSE_score = mean(RMSES),
MAE_score = mean(MAES),
MAPE_score = mean(MAPES),
WEIGHTED_score = mean(WEIGHTED)))
df_pol_agg <- df_pol_agg[order(df_pol_agg$WEIGHTED_score),]
return(list('train_ind' = train_ind,'test_ind'= test_ind,
'df_group' = df_group, 'dfrand'= dfrand,
'df_pol'=df_pol,'df_pol_agg'=df_pol_agg))
}
attach(Auto)
df <- as.data.frame(cbind(mpg,displacement,horsepower,weight,acceleration))
pcg <- 0.70
K <- 5
xcols <-'weight'
Knots <- 10
dep_col <- FALSE
columns <- 'origin'
model_pol_all <- ns_cv(df = df, pcg = pcg, K = K, Knots = Knots, dep_col = dep_col, columns = columns, xcols = xcols)
model_pol_all$df_pol KNOTS CV RMSES MAES MAPES WEIGHTED
1 2 CV1 4.024755 3.186773 0.13220455 2.447911
2 2 CV2 4.059506 3.171044 0.14066066 2.457070
3 2 CV3 4.167291 2.740486 0.09757765 2.335118
4 2 CV4 3.800988 2.813894 0.12224855 2.245710
5 2 CV5 4.864412 3.466742 0.14812645 2.826427
6 3 CV1 4.023849 3.197002 0.13287077 2.451241
7 3 CV2 4.139431 3.205717 0.14084242 2.495330
8 3 CV3 4.168504 2.741861 0.09770511 2.336023
9 3 CV4 3.807571 2.810138 0.12215481 2.246621
10 3 CV5 4.884863 3.489401 0.14945498 2.841240
11 4 CV1 4.024065 3.179424 0.13168007 2.445057
12 4 CV2 4.164382 3.240913 0.14216629 2.515820
[ reached 'max' / getOption("max.print") -- omitted 33 rows ]
model_pol_all$df_pol_agg KNOTS RMSE_score MAE_score MAPE_score WEIGHTED_score
1 2 4.183390 3.075788 0.1281636 2.462447
2 3 4.204844 3.088824 0.1286056 2.474091
3 4 4.207898 3.090500 0.1286277 2.475675
6 7 4.261913 3.099014 0.1284067 2.496444
4 5 4.244174 3.125766 0.1302307 2.500057
7 8 4.277739 3.108912 0.1288897 2.505180
8 9 4.284369 3.113305 0.1290438 2.508906
5 6 4.265652 3.132358 0.1302423 2.509417
9 10 4.293648 3.126213 0.1297538 2.516538
Parameter optimum (knots) untuk model natural splines yaitu 1 knots berdasarkan weighted score yang paling kecil. Selanjutnya di modelkan kembali menggunakan natural splines dengan jumlah titik sebanyak 1 selang 2 dengan mengunakan train data set lalu di check apakah model non linier overfitting atau underfitting.
tipe_models <- matrix(nrow = 2, ncol = 1)
tipe <- matrix(nrow = 2, ncol = 1)
populasi <- matrix(nrow = 2, ncol = 1)
RMSES <- matrix(nrow = 2, ncol = 1)
MAPES <- matrix(nrow = 2, ncol = 1)
MAES <- matrix(nrow = 2, ncol = 1)
WEIGHTED <- matrix(nrow = 2, ncol = 1)
xcols <- 'weight'
xlab <- "weight"
ylab <- "mile per gallon"
model_type <- "natural spline"
Knots <- 2
dfrand <- model_pol_all$dfrand
df_group <- model_pol_all$df_group
train_ind <- model_pol_all$train_ind
test_ind <- model_pol_all$test_ind
df_train <- dfrand[train_ind,]
df_test <- dfrand[test_ind,]
regspline = lm( mpg ~ ns(weight, df = Knots), data=df_train)
#train
prediksi <- predict(regspline, newdata = df_train)
actual <- df_train$mpg
plot(df_train$weight, df_train$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 1")
ix <- sort(df_train$weight,index.return=T)$ix
lines(df_train[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)tipe[1] <- 'training'
RMSES[1] <- RMSE(prediksi,actual)
MAES[1] <- MAE(prediksi,actual)
MAPES[1] <- MAPE(prediksi, actual)
WEIGHTED[1] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
#test
prediksi <- predict(regspline, newdata = df_test)
actual <- df_test$mpg
plot(df_test$weight, df_test$mpg, pch=19, col="coral",xlab=xlab, ylab=ylab,main="natural spline knots 1")
ix <- sort(df_test$weight,index.return=T)$ix
lines(df_test[ix,'weight'], prediksi[ix], main = model_type, col="blue",type="l", lwd=2)
abline(v=attr(ns(weight, df = Knots),"knots"), col="grey50", lty=2)tipe[2] <- 'testing'
RMSES[2] <- RMSE(prediksi,actual)
MAES[2] <- MAE(prediksi,actual)
MAPES[2] <- MAPE(prediksi, actual)
WEIGHTED[2] <- (RMSE(prediksi,actual)+MAPE(prediksi, actual)+MAE(prediksi,actual))/3
df <- data.frame(tipe,RMSES,MAES,MAPES,WEIGHTED)
df tipe RMSES MAES MAPES WEIGHTED
1 training 4.170799 3.061577 0.1273529 2.453243
2 testing 4.154022 2.952269 0.1264608 2.410917
===============================================
x : Weight
y : mpg
===============================================
Model : Polynomial Regression ordo 2
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.173080 3.068577 0.1277039 2.456454
2 testing 4.152997 2.952700 0.1265259 2.410741
===============================================
Model : Step Functions Knots 10
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.157577 3.089568 0.1304351 2.459193
2 testing 4.134778 3.060675 0.1331605 2.442871
===============================================
Model : Baseline Splines Knots 1
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.173074 3.068949 0.1277261 2.456583
2 testing 4.152980 2.952380 0.1265228 2.410627
===============================================
Model : Natural Splines Knots 1
no tipe RMSES MAES MAPES WEIGHTED
1 training 4.170799 3.061577 0.1273529 2.453243
2 testing 4.154022 2.952269 0.1264608 2.410917
===============================================
Model Terbaik dengan peubah acak Weight adalah Baseline Splines Knots 1