P8 STA581 Moving Beyond Linearity
SOAL
Data bisa diperoleh untuk tugas dapat diperoleh di package ISLR. Silahkan Install dulu packagenya.
Gunakan model-model non-linier yang sudah dipelajari pada data Auto dengan peubah respon mpg dan pilih tiga kolom untuk dijadikan peubah prediktor.
- Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.
- Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris.
Packages
library(tidyverse)
library(splines)
library(rsample) #bootstrapping
library(ISLR)
library(cowplot) #Streamlined Plot Theme and Plot Annotations for 'ggplot2'
library(caret) # cross-validation
library(gridExtra) # combining graphs
library(ggplot2)
library(dplyr)
library(purrr)
library(ggpubr)
library(knitr)
library(DT)
library(kableExtra)
library(gam) # generalized additive models
library(mlr3measures)Eksplorasi Data
Data yang digunakan adalah Auto dari package ISLR yang terdiri dari 392 observasi dan 9 variabel. Dataset ini diambil dari perpustakaan StatLib yang dikelola di Carnegie Mellon University. Dataset ini digunakan dalam American Statistical Association Exposition 1983.Variabel yang akan digunakan pada kasus ini adalah :
mpg: miles per gallonhorsepower: Engine horsepowerweight: Vehicle weight (lbs.)acceleration: Time to accelerate from 0 to 60 mph (sec.)
Auto = Auto %>% select(mpg, horsepower, acceleration, weight)
datatable(Auto)Visualisasi Data
hp <- ggplot(Auto,aes(x=horsepower, y=mpg), add=TRUE) +
geom_point(alpha=0.55, color="#3399ff") +
theme_bw()
we <- ggplot(Auto,aes(x=weight, y=mpg), add=TRUE) +
geom_point(alpha=0.55, color="#6600cc") +
theme_bw()
ac <- ggplot(Auto,aes(x=acceleration, y=mpg), add=TRUE) +
geom_point(alpha=0.55, color="#ff3333") +
theme_bw()
plot_grid(hp, we, ac, labels = c("mpg vs horsepower","mpg vs Weight", "mpg vs Acceleration"), label_size = 6) Jika kita amati lebih detail dari scatter plot diatas, pola data ketiga scatterplot cenderung tidak linier, maka dari itu ada beberapa pendekatan yang bisa digunakan.
Dalam kasus kali ini akan digunakan pendekatan Regresi Polinomial, Piecewise constant, dan Natural cubic splines.
Regresi Polinomial
Data mpg & horsepower
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
ordo <- 2:10
best_poly <- map_dfr(ordo, function(i) {
metric_poly <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ poly(horsepower,ordo=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly
# menghitung rata-rata untuk 10 folds
mean_metric_poly <- colMeans(metric_poly)
mean_metric_poly
}
)
(polyh <- cbind(ordo=ordo,best_poly))## ordo rmse mae
## 1 2 4.351129 3.266460
## 2 3 4.355430 3.266945
## 3 4 4.367009 3.280452
## 4 5 4.318767 3.249003
## 5 6 4.317431 3.257272
## 6 7 4.293242 3.233376
## 7 8 4.309511 3.251758
## 8 9 4.323321 3.244919
## 9 10 4.371903 3.274887
Plot CV
polyh$ordo <- as.factor(polyh$ordo)
ggplot(data=best_poly, aes(x=ordo, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Root Mean Square Error")Plot
A <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="6666cc") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Ordo 2") +
xlab("Horse Power") +
ylab("mile per gallon")
B <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="6666cc") +
stat_smooth(method = "lm",
formula = y~poly(x,7,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Ordo 7") +
xlab("Horse Power") +
ylab("mile per gallon")
ggarrange(A, B, ncol = 2, nrow = 1)Jika dilihat berdasarkan plot, meskipun nilai RMSE Regresi Polinomial ordo 7 lebih kecil daripada ordo 2, model terbaik yang dipilih adalah Regresi Polinomial ordo 2. Karena untuk ordo 7 pada bagian ujung ujung plot cenderung tidak stabil.
Data mpg & weight
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
ordo <- 2:10
best_poly <- map_dfr(ordo, function(i) {
metric_poly <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ poly(weight,ordo=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly
# menghitung rata-rata untuk 10 folds
mean_metric_poly <- colMeans(metric_poly)
mean_metric_poly
}
)
(polyw <- cbind(ordo=ordo,best_poly))## ordo rmse mae
## 1 2 4.149977 3.060704
## 2 3 4.158482 3.068020
## 3 4 4.165585 3.067210
## 4 5 4.169760 3.079497
## 5 6 4.178027 3.098482
## 6 7 4.170375 3.100742
## 7 8 4.181430 3.108979
## 8 9 4.254235 3.164729
## 9 10 4.299992 3.167556
Plot CV
polyw$ordo <- as.factor(polyw$ordo)
ggplot(data=best_poly, aes(x=ordo, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Root Mean Square Error")Plot
poly2<-ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="#9900FF") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Ordo 2") +
xlab("weight") +
ylab("mile per gallon") Secara empiris dapat dilihat bahwa nilai rmse dan mae pada ordo 2 memiliki nilai paling kecil dibanding ordo lain dan dapat dibutkian pada plot diatas.
Data mpg & acceleration
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
ordo <- 2:10
best_poly <- map_dfr(ordo, function(i) {
metric_poly <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ poly(acceleration,ordo=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly
# menghitung rata-rata untuk 10 folds
mean_metric_poly <- colMeans(metric_poly)
mean_metric_poly
}
)
(polya <- cbind(ordo=ordo,best_poly))## ordo rmse mae
## 1 2 7.037689 5.755305
## 2 3 7.048668 5.792450
## 3 4 7.008647 5.700795
## 4 5 7.042178 5.729396
## 5 6 7.046243 5.704385
## 6 7 7.097228 5.711475
## 7 8 7.167747 5.767367
## 8 9 7.150072 5.744992
## 9 10 7.164808 5.754932
Plot CV
polya$ordo <- as.factor(polya$ordo)
ggplot(data=best_poly, aes(x=ordo, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Root Mean Square Error")Plot
poly31 <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="#FF0033") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Ordo 2") +
xlab("acceleration") +
ylab("mile per gallon")
poly32 <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="#FF9966") +
stat_smooth(method = "lm",
formula = y~poly(x,4,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Ordo 4") +
xlab("acceleration") +
ylab("mile per gallon")
poly33 <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="#FF00CC") +
stat_smooth(method = "lm",
formula = y~poly(x,6,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Ordo 6") +
xlab("acceleration") +
ylab("mile per gallon")
ggarrange(poly31, poly32, poly33, ncol = 2, nrow = 2)Berdasarkan nilai rmse dan mae ordo 4 merupakan model terbaik untuk regresi polinomial namun jika dilihat secara visualisasi dapat dilihat bahwa kurva untuk ordo 4 tidak stabil sehingga dibandingkan dengan kedua ordo lainnya yang memiliki nilai rmse dan mae terkecil yaitu ordo 2 dan 6.
Dari ketiga ordo diatas dapat dilihat bahwa ordo 2 menunjukkan kurva yang cenderung stabil dibanding kedua ordo lainnya.
Piecewise Constant
Data mpg & horsepower
###Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
breaks <- 2:20
best_tangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- Auto[x$in_id,]
training$horsepower <- cut(training$horsepower,i)
mod <- lm(mpg ~ horsepower, data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
testing <- Auto[-x$in_id,]
horsepower_new <- cut(testing$horsepower,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod, newdata=list(horsepower=horsepower_new))
truth <- testing$mpg
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
})
(tanggah <- cbind(breaks=breaks,best_tangga))## breaks rmse mae
## 1 2 6.147626 4.790234
## 2 3 5.775792 4.521829
## 3 4 4.985270 3.774549
## 4 5 4.712623 3.571614
## 5 6 4.644383 3.548375
## 6 7 4.554116 3.379980
## 7 8 4.437597 3.405433
## 8 9 4.549668 3.471999
## 9 10 4.589172 3.455531
## 10 11 4.531039 3.380931
## 11 12 4.442697 3.321151
## 12 13 4.527126 3.375170
## 13 14 4.428824 3.270424
## 14 15 4.313162 3.286289
## 15 16 4.336661 3.302031
## 16 17 4.414343 3.347532
## 17 18 4.340715 3.221099
## 18 19 4.331798 3.224404
## 19 20 4.480710 3.320605
Plot CV rmse
tanggah$breaks <- as.factor(tanggah$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Root Mean Square Error")Plot CV mae
tanggah$breaks <- as.factor(tanggah$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=mae, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("MAE")Plot
pc8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="#FF9966") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("K=8") +
xlab("Horse Power") + ylab("mile per gallon")
pc15<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="#FF00CC") +
stat_smooth(method = "lm",
formula = y~cut(x,15),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("K=15") +
xlab("Horse Power") + ylab("mile per gallon")
ggarrange(pc8, pc15, ncol = 2, nrow = 1) Berdasarkan nilai rmse dan mae interval ordo 8 dan ordo 15 merupakan model terbaik, namun jika dibandingkan secara visual didapatkan bahwa interval sebesar 8 lebih baik dikarenakan kurva cenderung lebih mulus.
Data mpg & weight
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
breaks <- 2:10
best_tangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- Auto[x$in_id,]
training$weight <- cut(training$weight,i)
mod <- lm(mpg ~ weight, data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
testing <- Auto[-x$in_id,]
weight_new <- cut(testing$weight,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod, newdata=list(weight=weight_new))
truth <- testing$mpg
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
})tanggaw <- cbind(breaks=breaks,best_tangga)
tanggaw## breaks rmse mae
## 1 2 5.570088 4.345368
## 2 3 4.868847 3.657505
## 3 4 4.706593 3.633584
## 4 5 4.389515 3.221556
## 5 6 4.201264 3.112221
## 6 7 4.279395 3.191726
## 7 8 4.370964 3.249765
## 8 9 4.341743 3.182396
## 9 10 4.329114 3.176390
Plot CV rmse
tanggaw$breaks <- as.factor(tanggaw$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Root Mean Square Error")Plot CV mae
tanggaw$breaks <- as.factor(tanggaw$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=mae, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("MAE")Plot
pcw6<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="#FF9966") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("K=6") +
xlab("weight") + ylab("mile per gallon")Secara empiris dapat dilihat bahwa nilai rmse dan mae pada interval 6 memiliki nilai paling kecil dibanding interval lain dan dapat dibuktikan pada plot diatas.
Data mpg & acceleration
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
breaks <- 2:12
best_tangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- Auto[x$in_id,]
training$acceleration <- cut(training$acceleration,i)
mod <- lm(mpg ~ acceleration, data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
testing <- Auto[-x$in_id,]
acceleration_new <- cut(testing$acceleration,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod, newdata=list(acceleration=acceleration_new))
truth <- testing$mpg
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
})
(tanggaa <- cbind(breaks=breaks,best_tangga))## breaks rmse mae
## 1 2 7.502748 6.315199
## 2 3 7.003365 5.637141
## 3 4 7.243047 5.879089
## 4 5 7.240624 5.948255
## 5 6 6.987921 5.640505
## 6 7 7.202002 5.846671
## 7 8 7.018944 5.692063
## 8 9 7.012388 5.626494
## 9 10 7.145911 5.803971
## 10 11 7.037956 5.692398
## 11 12 7.048222 5.630799
Plot CV rmse
tanggaa$breaks <- as.factor(tanggaa$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Root Mean Square Error")Plot CV mae
tanggaa$breaks <- as.factor(tanggaa$breaks)
ggplot(data=best_tangga, aes(x=breaks, y=mae, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("MAE")Plot
pca6<- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="#FF9966") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("K=6") +
xlab("acceleration") + ylab("mile per gallon")
pca9<- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="#FF00CC") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("K=9") +
xlab("acceleration") + ylab("mile per gallon")
ggarrange(pca6, pca9, ncol = 2, nrow = 1) Model terbaik menurut nilai rmse yaitu interval 6 sedangkan menurut nilai mae yaitu interval 9, namun jika dibandingkan secara visual didapatkan bahwa interval sebesar 6 lebih baik dikarenakan kurva cenderung lebih mulus.
Natural Cubic Splines
Data mpg & horsepower
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:15
best_spline3 <- map_dfr(df, function(i) {
metric_spline3 <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ ns(horsepower,df=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_spline3
# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)
mean_metric_spline3
}
)
(best_spline31 <- cbind(df=df,best_spline3))## df rmse mae
## 1 2 4.330108 3.252908
## 2 3 4.348006 3.261538
## 3 4 4.329461 3.257926
## 4 5 4.313312 3.246412
## 5 6 4.310011 3.237990
## 6 7 4.294671 3.229174
## 7 8 4.293369 3.232221
## 8 9 4.281877 3.211860
## 9 10 4.260316 3.192010
## 10 11 4.288667 3.220958
## 11 12 4.285063 3.192363
## 12 13 4.302071 3.220998
## 13 14 4.322821 3.244848
## 14 15 4.305928 3.250204
Plot CV
best_spline31$df <- as.factor(best_spline31$df)
ggplot(data=best_spline3, aes(x=df, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error") Berdasarkan nilai rmse diatas didapatkan derajat 10 memiliki nilai terendah dibanding derajat lain.
Knots
attr(ns(Auto$horsepower, df=10),"knots")## 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 67.0 72.0 80.0 88.0 93.5 100.0 110.0 140.0 157.7
Plot
ncs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, df=10),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("DF =10") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(67, 72, 80, 88, 93.5, 100, 110, 140, 157.7), col="grey50", lty=2)
ncsData mpg & weight
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:15
best_spline3 <- map_dfr(df, function(i) {
metric_spline3 <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ ns(weight,df=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_spline3
# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)
mean_metric_spline3
}
)
(best_spline32 <- cbind(df=df,best_spline3))## df rmse mae
## 1 2 4.150320 3.062108
## 2 3 4.162964 3.064666
## 3 4 4.171112 3.073367
## 4 5 4.180162 3.092261
## 5 6 4.171277 3.102353
## 6 7 4.141103 3.061136
## 7 8 4.170552 3.098226
## 8 9 4.175108 3.091378
## 9 10 4.168457 3.089614
## 10 11 4.162104 3.087247
## 11 12 4.157470 3.086991
## 12 13 4.178029 3.102996
## 13 14 4.187941 3.109626
## 14 15 4.184376 3.108734
Plot CV
best_spline32$df <- as.factor(best_spline32$df)
ggplot(data=best_spline3, aes(x=df, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")Berdasarkan nilai rmse diatas didapatkan derajat 7 memiliki nilai terendah dibanding derajat lain.
Knots
attr(ns(Auto$weight, df=7),"knots")## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 2074.857 2285.429 2635.000 2986.571 3446.143 4096.286
Plot
ncs <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, df=7),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("DF =7") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2074.857, 2285.429, 2635.000, 2986.571, 3446.143, 4096.286), col="grey50", lty=2)
ncsData mpg & acceleration
Cross Validation
set.seed(123)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:15
best_spline3 <- map_dfr(df, function(i) {
metric_spline3 <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ ns(acceleration,df=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_spline3
# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)
mean_metric_spline3
}
)
(best_spline33 <- cbind(df=df,best_spline3))## df rmse mae
## 1 2 7.020337 5.743563
## 2 3 7.033832 5.767504
## 3 4 6.968525 5.649933
## 4 5 6.965883 5.610253
## 5 6 6.977508 5.638072
## 6 7 6.959692 5.633256
## 7 8 6.936841 5.598141
## 8 9 6.997009 5.649171
## 9 10 6.977197 5.646953
## 10 11 7.013287 5.661907
## 11 12 7.027583 5.680325
## 12 13 7.046496 5.701470
## 13 14 7.079820 5.714688
## 14 15 7.109796 5.737668
Plot CV
best_spline33$df <- as.factor(best_spline33$df)
ggplot(data=best_spline3, aes(x=df, y=rmse, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")Berdasarkan nilai rmse diatas didapatkan derajat 8 memiliki nilai terendah dibanding derajat lain.
Knots
attr(ns(Auto$acceleration, df=8),"knots")## 12.5% 25% 37.5% 50% 62.5% 75% 87.5%
## 12.5000 13.7750 14.5000 15.5000 16.2000 17.0250 18.7125
Plot
ncs <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, df=8),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("DF =8") +
xlab("acceleration") +
ylab("mile per gallon") +
geom_vline(xintercept = c(12.5000, 13.7750, 14.5000, 15.5000, 16.2000, 17.0250, 18.7125), col="grey50", lty=2)
ncsPerbandingan Model
Data mpg & horsepower
Secara Empiris
## nama_model rmse mae
## 1 Polinomial 4.293242 3.233376
## 2 Piecewise Constant 4.313162 3.286289
## 3 Natural Cubic Splines 4.260316 3.192010
Secara Visual
Data mpg & weight
Secara Empiris
## nama_model rmse mae
## 1 Polinomial 4.149977 3.060704
## 2 Piecewise Constant 4.201264 3.112221
## 3 Natural Cubic Splines 4.141103 3.061136
Secara Visual
Data mpg & acceleration
Secara Empiris
## nama_model rmse mae
## 1 Polinomial 7.008647 5.700795
## 2 Piecewise Constant 6.987921 5.640505
## 3 Natural Cubic Splines 6.936841 5.598141
Secara Visual
Referensi
Sartono, B.. (November 1, 2020). Moving Beyond Linearity. Retrieved from https://rpubs.com/bagusco/taklinear
Dito, G.A. (April 14, 2021). Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R. Retrieved from https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/
Amalia, Reni. (April 23, 2021). Moving Beyond Linearity. Retrived from https://www.rpubs.com/reniamelia/MovingBeyondLinearity
Rahmi, Anisa. (April, 22, 2021). Beyond Linearity. Retrived from https://www.rpubs.com/annisarahmi/beyondlienarity