Data
Data yang digunakan yaitu data Auto
. Data ini berisi tentang informasi mengenai 392 jenis kendaraan. Berikut adalah penjelasan untuk masing-masing variabel dalam data:
- mpg: miles per gallon
- cylinders: Number of cylinders between 4 and 8
- displacement: Engine displacement (cu. inches)
- horsepower: Engine horsepower
- weight: Vehicle weight (lbs.)
- acceleration: Time to accelerate from 0 to 60 mph (sec.)
- year: Model year (modulo 100)
- origin: Origin of car (1. American, 2. European, 3. Japanese)
- name: Vehicle name
Peubah respon yang digunakan yaitu mpg
(miles per gallon) sedangkan peubah prediktor yang dipilih yaitu horsepower
, displacement
, dan weight
.
Plot masing-masing peubah prediktor terhadap respon mpg
:
plot_hp<-ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color=I("black")) +
ggtitle("Plot mpg vs horsepower")+
theme_bw()
plot_dis<-ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color=I("blue")) +
ggtitle("Plot mpg vs displacement")+
theme_bw()
plot_wei<-ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color=I("red")) +
ggtitle("Plot mpg vs weight")+
theme_bw()
plot_grid(plot_hp, plot_dis, plot_wei)
Berdasarkan plot di atas terlihat bahwa tidak terdapat pola linier antara masing-masing peubah prediktor (horsepower
, displacement
, dan weight
) dan peubah respon (mpg
). Oleh karena itu, dilakukan pemodelan hubungan tak linier dengan menggunakan metode:
- Regresi Polinomial
- Regresi Fungsi Tangga
- Cubic Splines
- Natural Cubic Splines
Pemodelan mpg
vs horsepower
Regresi Polinomial
A. Regresi Polinomial (Secara Visual)
1. Regresi Polinomial Derajat 2
rp2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Horse Power") +
ylab("mile per gallon")
2. Regresi Polinomial Derajat 3
rp3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 3") +
xlab("Horse Power") +
ylab("mile per gallon")
3. Regresi Polinomial Derajat 4
rp4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="navy") +
stat_smooth(method = "lm",
formula = y~poly(x,4,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 4") +
xlab("Horse Power") +
ylab("mile per gallon")
4. Regresi Polinomial Derajat 5
rp5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~poly(x,5,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 5") +
xlab("Horse Power") +
ylab("mile per gallon")
plot_grid(rp2, rp3, rp4, rp5)
Berdasarkan keempat plot di atas dapat dilihat bahwa regresi polinomial derajat 2 dan 3 memiliki pola yang teratur dibandingkan dengan regresi polinomial derajat 4 dan 5. Sehingga, pada metode regresi polinomial dipilih derajat 2, karena modelnya lebih sederhana.
B. Regresi Polinomial (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
derajat <- 1:5
polinomial1 <- map_dfr(derajat, function(i) {
metric_poli <- map_dfr(val_silang$splits,
function(x) {
mod <- lm(mpg ~ poly(horsepower,derajat=i), data=Auto[x$in_id,])
predik <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_poli
mean_metric_poli<- colMeans(metric_poli) # menghitung rata-rata untuk 10 folds
mean_metric_poli
}
)
polinomial1<- cbind(derajat=derajat,polinomial1)
Perbandingan RMSE dan MAE
datatable(polinomial1)
Plot RMSE dan MAE
polinomial1$derajat <- as.factor(polinomial1$derajat)
prmse<-ggplot(data=polinomial1, aes(x=derajat, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Root Mean Square Error")
polinomial1$derajat <- as.factor(polinomial1$derajat)
pmae<-ggplot(data=polinomial1, aes(x=derajat, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Mean Absolute Error")
plot_grid(prmse,pmae, labels=c("RMSE","MAE"), label_size = 6)
Plot Regresi Polinomial d=2 dan d=5
regpol2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Horse Power") +
ylab("mile per gallon")
regpol5 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,5,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 5") +
xlab("Horse Power") +
ylab("mile per gallon")
plot_grid(regpol2,regpol5)
Nilai RMSE dan MAE terendah terdapat pada regresi polinomial derajat 5, namun apabila dilihat dari plot regresi pada bagian ujung-ujung plot cenderung tidak teratur untuk regresi polinomial derajat 5. Oleh karena itu, regresi polinomial terbaik yaitu regresi polinomial derajat 2. Berikut merupakan summary dari model regresi polinomial derajat 2:
poli2 = lm(mpg ~ poly(horsepower,2), data = Auto)
summary(poli2)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Regresi Fungsi Tangga
A. Regresi Fungsi Tangga (Secara Visual)
Regresi Fungsi Tangga dengan Knots 1-10
ft1<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,2),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=1") +
xlab("Horse Power") + ylab("mile per gallon")
ft2<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,3),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=2") +
xlab("Horse Power") + ylab("mile per gallon")
ft3<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,4),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=3") +
xlab("Horse Power") + ylab("mile per gallon")
ft4<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=4") +
xlab("Horse Power") + ylab("mile per gallon")
ft5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=5") +
xlab("Horse Power") + ylab("mile per gallon")
ft6<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=6") +
xlab("Horse Power") + ylab("mile per gallon")
ft7<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=7") +
xlab("Horse Power") + ylab("mile per gallon")
ft8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=8") +
xlab("Horse Power") + ylab("mile per gallon")
ft9<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,10),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=9") +
xlab("Horse Power") + ylab("mile per gallon")
ft10<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~cut(x,11),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=10") +
xlab("Horse Power") + ylab("mile per gallon")
Berdasarkan hasil plot di atas dapat dilihat bahwa plot dari regresi fungsi tangga yang dapat memberikan pola paling mulus yaitu knots 7, untuk knots 8 hingga 10 terdapat pola yang tidak teratur dalam plot. Sehingga secara visual, dipilih regresi fungsi tangga dengan knots 7 untuk memodelkan data mpg vs horsepower.
B. Regresi Fungsi Tangga (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
breakk <- 2:11 #knots 1-10
pemilihanf.tangga <- map_dfr(breakk, function(i){
metric_ftangga <- map_dfr(val_silang$splits,
function(x){
datatrain <- Auto[x$in_id,]
datatrain$horsepower <- cut(datatrain$horsepower,i)
modelt <- lm(mpg ~ horsepower, data=datatrain)
labs_x <- levels(modelt$model[,2])
labs_x_b <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
datatesting <- Auto[-x$in_id,]
horsepower_new <- cut(datatesting$horsepower,c(labs_x_b[1,1],labs_x_b[,2]))
predik <- predict(modelt, newdata=list(horsepower=horsepower_new))
truth <- datatesting$mpg
evaluasi <- na.omit(data.frame(truth,predik))
rmse <- mlr3measures::rmse(truth = evaluasi$truth,response = evaluasi$predik)
mae <- mlr3measures::mae(truth = evaluasi$truth,response = evaluasi$predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_ftangga
mean_metricftangga <- colMeans(metric_ftangga)
mean_metricftangga
})
pemilihanf.tangga <- cbind(breakk=breakk,pemilihanf.tangga)
Perbandingan RMSE dan MAE
datatable(pemilihanf.tangga)
Plot RMSE dan MAE
pemilihanf.tangga$breakk <- as.factor(pemilihanf.tangga$breakk)
trmse<-ggplot(data=pemilihanf.tangga, aes(x=breakk, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Root Mean Square Error")
pemilihanf.tangga$breakk <- as.factor(pemilihanf.tangga$breakk)
tmae<-ggplot(data=pemilihanf.tangga, aes(x=breakk, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Mean Absolute Error")
plot_grid(trmse,tmae)
Berdasarkan nilai RMSE model regresi fungsi tangga terbaik adalah regresi fungsi tangga dengan knots=7, sedangkan berdasarkan nilai MAE model terbaik adalah regresi fungsi tangga dengan knots=6.
Plot Regresi Rungsi Tangga
ft7<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=6") +
xlab("Horse Power") + ylab("mile per gallon")
ft8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=7") +
xlab("Horse Power") + ylab("mile per gallon")
plot_grid(ft7,ft8)
Apabila dilihat secara visual, plot dari regresi fungsi tangga dengan knots 7 memiliki kurva yang lebih mulus dibandingkan dengan knots 6, selain itu juga tidak terdapat pola yang tidak teratur pada plot dengan knots=7. Jadi, pada metode regresi fungsi tangga ini dipilih knots=7.
Berikut merupakan summary model regresi fungsi tangga dengan knots=7:
fungsitangga7 = lm(Auto$mpg ~ cut(Auto$horsepower,8)) #7 knots
summary(fungsitangga7)
##
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$horsepower, 8))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9471 -2.6757 -0.1533 2.4015 14.5529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9085 0.5771 58.76 <2e-16 ***
## cut(Auto$horsepower, 8)(69,92] -6.9614 0.6910 -10.07 <2e-16 ***
## cut(Auto$horsepower, 8)(92,115] -12.7551 0.7425 -17.18 <2e-16 ***
## cut(Auto$horsepower, 8)(115,138] -15.6656 1.1264 -13.91 <2e-16 ***
## cut(Auto$horsepower, 8)(138,161] -18.7799 0.8568 -21.92 <2e-16 ***
## cut(Auto$horsepower, 8)(161,184] -19.9735 1.1470 -17.41 <2e-16 ***
## cut(Auto$horsepower, 8)(184,207] -21.1228 1.7720 -11.92 <2e-16 ***
## cut(Auto$horsepower, 8)(207,230] -21.0085 1.5159 -13.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.433 on 384 degrees of freedom
## Multiple R-squared: 0.6832, Adjusted R-squared: 0.6774
## F-statistic: 118.3 on 7 and 384 DF, p-value: < 2.2e-16
Cubic Splines
A. Cubic Splines (Secara Visual)
1. Knots 1-6
attr(bs(Auto$horsepower, df=4),"knots")
## 50%
## 93.5
attr(bs(Auto$horsepower, df=5),"knots")
## 33.33333% 66.66667%
## 84 110
attr(bs(Auto$horsepower, df=6),"knots")
## 25% 50% 75%
## 75.0 93.5 126.0
attr(bs(Auto$horsepower, df=7),"knots")
## 20% 40% 60% 80%
## 72 88 100 140
attr(bs(Auto$horsepower, df=8),"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 70.0 84.0 93.5 110.0 150.0
attr(bs(Auto$horsepower, df=9),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 68.85714 79.71429 89.57143 98.00000 113.57143 150.00000
Plot Cubic Splines dengan Knots 1-6
cs1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(93.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=1") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(93.5), col="grey50", lty=2)
cs2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="orange")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(84,110)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=2") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(84,110), col="grey50", lty=2)
cs3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="orange")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(75.0, 93.5,126.0)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=3") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(75.0, 93.5,126.0), col="grey50", lty=2)
cs4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="green")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(72, 88, 100, 140)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=4") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(72, 88, 100, 140), col="grey50", lty=2)
cs5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(70.0,84.0,93.5,110.0, 150.0)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=5") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)
cs6<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=6") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150), col="grey50", lty=2)
Berdasarkan hasil plot di atas dapat disimpulkan bahwa model cubic splines terbaik adalah cubic splines dengan knots 1 karena mulai knots 2 hingga 6 terdapat pola yang tidak teratur pada ujung plot.
B. Cubic Splines (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 4:9 #knots 1-6
pemilihan.cs <- map_dfr(df, function(i) {
metric_cs<- map_dfr(val_silang$splits,
function(x) {
modelcs <- lm(mpg ~ bs(horsepower,df=i), data=Auto[x$in_id,])
predik <- predict(modelcs, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_cs
# menghitung rata-rata untuk 10 folds
mean_metric_cs <- colMeans(metric_cs)
mean_metric_cs
}
)
pemilihan.cs <- cbind(df=df,pemilihan.cs)
Perbandingan RMSE dan MAE
datatable(pemilihan.cs)
Plot RMSE dan MAE
pemilihan.cs$df <- as.factor(pemilihan.cs$df)
crmse<-ggplot(data=pemilihan.cs, aes(x=df, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")
cmae<-ggplot(data=pemilihan.cs, aes(x=df, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Mean Absolute Error")
plot_grid(crmse,cmae)
Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model cubic splines dengan knots 2, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model cubic splines dengan knots 2 memiliki ketidakteraturan pola pada ujung-ujung plot. Sehingga model cubic splines terbaik adalah dengan knots 1. Berikut merupakan plot dan summary model:
Plot Cubic Splines
plotcs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(93.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines with Knots=1") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(93.5), col="grey50", lty=2)
plotcs
cspline = lm(Auto$mpg ~ bs(Auto$horsepower, knots = c(93.5)))
summary(cspline)
##
## Call:
## lm(formula = Auto$mpg ~ bs(Auto$horsepower, knots = c(93.5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3859 -2.4974 -0.2154 2.0668 16.2357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.366 1.532 23.082 < 2e-16 ***
## bs(Auto$horsepower, knots = c(93.5))1 -0.483 2.152 -0.224 0.823
## bs(Auto$horsepower, knots = c(93.5))2 -26.299 1.962 -13.406 < 2e-16 ***
## bs(Auto$horsepower, knots = c(93.5))3 -19.576 2.952 -6.631 1.12e-10 ***
## bs(Auto$horsepower, knots = c(93.5))4 -23.402 2.294 -10.201 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.336 on 387 degrees of freedom
## Multiple R-squared: 0.6945, Adjusted R-squared: 0.6914
## F-statistic: 220 on 4 and 387 DF, p-value: < 2.2e-16
Natural Cubic Splines
A. Natural Cubic Splines (Secara Visual)
1. Knots 1-6
attr(ns(Auto$horsepower, df=2),"knots")
## 50%
## 93.5
attr(ns(Auto$horsepower, df=3),"knots")
## 33.33333% 66.66667%
## 84 110
attr(ns(Auto$horsepower, df=4),"knots")
## 25% 50% 75%
## 75.0 93.5 126.0
attr(ns(Auto$horsepower, df=5),"knots")
## 20% 40% 60% 80%
## 72 88 100 140
attr(ns(Auto$horsepower, df=6),"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 70.0 84.0 93.5 110.0 150.0
attr(ns(Auto$horsepower, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 68.85714 79.71429 89.57143 98.00000 113.57143 150.00000
Plot Natural Cubic Splines dengan Knots 1-6
ns1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(93.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(93.5), col="grey50", lty=2)
ns2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="orange")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(84,110)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=2") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(84,110), col="grey50", lty=2)
ns3 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="coral")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(75.0, 93.5,126.0)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=3") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(75.0, 93.5,126.0), col="grey50", lty=2)
ns4 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="green")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(72, 88, 100, 140)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=4") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(72, 88, 100, 140), col="grey50", lty=2)
ns5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(70.0,84.0,93.5,110.0, 150.0)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=5") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)
ns6<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=6") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(68.85714, 79.71429, 89.57143, 98, 113.57143, 150), col="grey50", lty=2)
Berdasarkan hasil plot di atas dapat dilihat bahwa plot natural cubic splines dengan knots 1 hingga 5 tidak memiliki pola yang tidak teratur. Sedangkan untuk knots 6 terdapat pola yang tidak teratur pada ujung plot. Sehingga untuk lebih meyakinkan dalam memilih model terbaik, dilakukan pemilihan model secara empiris.
B. Natural Cubic Splines (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:7 #knots 1-6
pemilihan.ncs <- map_dfr(df, function(i) {
metric_ncs<- map_dfr(val_silang$splits,
function(x) {
modelncs <- lm(mpg ~ ns(horsepower,df=i), data=Auto[x$in_id,])
predik <- predict(modelncs, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_ncs
# menghitung rata-rata untuk 10 folds
mean_metric_ncs <- colMeans(metric_ncs)
mean_metric_ncs
}
)
pemilihan.ncs <- cbind(df=df,pemilihan.ncs)
Perbandingan RMSE dan MAE
datatable(pemilihan.ncs)
Plot RMSE dan MAE
pemilihan.ncs$df <- as.factor(pemilihan.ncs$df)
nrmse<-ggplot(data=pemilihan.ncs, aes(x=df, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")
nmae<-ggplot(data=pemilihan.ncs, aes(x=df, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Mean Absolute Error")
plot_grid(nrmse,nmae)
Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model natural cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model natural cubic splines dengan knots 6 memiliki pola yang tidak teratur pada ujung-ujung plot. Sehingga model natural cubic splines terbaik adalah dengan knots 5. Berikut merupakan plot dan summary model:
Plot Natural Cubic Splines
plotncs <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(70.0,84.0,93.5,110.0,150.0)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=5") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)
plotncs
ncspline = lm(Auto$mpg ~ ns(Auto$horsepower, knots = c(70.0,84.0,93.5,110.0,150.0)))
summary(ncspline)
##
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$horsepower, knots = c(70, 84,
## 93.5, 110, 150)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9491 -2.6183 -0.1595 2.3508 15.1349
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 34.738 1.509
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1 -8.210 1.594
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2 -13.046 1.835
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3 -14.577 1.886
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4 -22.802 1.624
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5 -22.758 3.512
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6 -20.849 1.742
## t value Pr(>|t|)
## (Intercept) 23.021 < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1 -5.149 4.18e-07 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2 -7.108 5.76e-12 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3 -7.730 9.50e-14 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4 -14.039 < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5 -6.480 2.81e-10 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6 -11.967 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.302 on 385 degrees of freedom
## Multiple R-squared: 0.7009, Adjusted R-squared: 0.6962
## F-statistic: 150.4 on 6 and 385 DF, p-value: < 2.2e-16
Perbandingan Model
nilai_rmse <- rbind(4.32756231703042,
4.41347139979508,
4.32073392297636,
4.2917765537155)
nilai_mae<-rbind(3.2639553356593,
3.41081122407494,
3.26311155879859,
3.24670654206857)
Metode <- c("Regresi Polinomial (d=2)",
"Fungsi Tangga (knots=7)",
"Cubic Splines (knots=1)",
"Natural Cubic Splines (knots=5)")
perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)
Plot dari Masing-masing Model
#Regresi Polinomial
rp2 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Horse Power") +
ylab("mile per gallon")
#Regresi Fungsi Tangga
ft7<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=7") +
xlab("Horse Power") + ylab("mile per gallon")
#Cubic Splines
cs1 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(93.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=1") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(93.5), col="grey50", lty=2)
#Natural Cubic Splines
ns5<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(70.0,84.0,93.5,110.0, 150.0)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=5") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(70.0,84.0,93.5,110.0,150.0), col="grey50", lty=2)
plot_grid(rp2, ft7, cs1, ns5)
Berdasarkan nila RMSE dan MAE dapat disimpulkan bahwa model terbaik untuk memodelkanmpg
vs horsepower
adalah model natural cubic splines dengan knots 6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat horsepower
(kekuatan mesin) dari kendaraan maka miles per gallon
(jarak yang dapat ditempuh per galon bahan bakar) akan semakin kecil.
Pemodelan mpg
vs displacement
Regresi Polinomial
A. Regresi Polinomial (Secara Visual)
1. Regresi Polinomial Derajat 2
rp22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Displacement") +
ylab("mile per gallon")
2. Regresi Polinomial Derajat 3
rp33 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 3") +
xlab("Displacement") +
ylab("mile per gallon")
3. Regresi Polinomial Derajat 4
rp44 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="navy") +
stat_smooth(method = "lm",
formula = y~poly(x,4,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 4") +
xlab("Displacement") +
ylab("mile per gallon")
4. Regresi Polinomial Derajat 5
rp55 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~poly(x,5,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 5") +
xlab("Displacement") +
ylab("mile per gallon")
plot_grid(rp22, rp33, rp44, rp55)
Berdasarkan keempat plot di atas dapat dilihat bahwa plot regresi polinomial derajat 2 hingga 5 memiliki pola yang hampir sama, tidak terdapat pola yang tidak teratur pada ujung plot. Sehingga, pada metode Regresi Polinomial dipilih derajat 2 karena model paling sederhana.
B. Regresi Polinomial (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
derajat <- 1:5
polinomial2 <- map_dfr(derajat, function(i) {
metric_poli <- map_dfr(val_silang$splits,
function(x) {
mod <- lm(mpg ~ poly(displacement,derajat=i), data=Auto[x$in_id,])
predik <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_poli
mean_metric_poli<- colMeans(metric_poli) # menghitung rata-rata untuk 10 folds
mean_metric_poli
}
)
polinomial2<- cbind(derajat=derajat,polinomial2)
Perbandingan RMSE dan MAE
datatable(polinomial2)
Plot RMSE dan MAE
polinomial2$derajat <- as.factor(polinomial2$derajat)
prmse2<-ggplot(data=polinomial2, aes(x=derajat, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Root Mean Square Error")
polinomial2$derajat <- as.factor(polinomial2$derajat)
pmae2<-ggplot(data=polinomial2, aes(x=derajat, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Mean Absolute Error")
plot_grid(prmse2,pmae2, labels=c("RMSE","MAE"), label_size = 6)
Berdasarkan nilai RMSE dan MAE terendah, dapat disimpulkan bahwa model regresi polinomial terbaik adalah regresi polinomial derajat 2. Berikut merupakan plot dan summary model:
Plot Regresi Polinomial d=2
regpol22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Displacement") +
ylab("mile per gallon")
regpol22
poli22= lm(mpg ~ poly(displacement,2), data = Auto)
summary(poli22)
##
## Call:
## lm(formula = mpg ~ poly(displacement, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2165 -2.2404 -0.2508 2.1094 20.5158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2205 106.343 < 2e-16 ***
## poly(displacement, 2)1 -124.2585 4.3652 -28.466 < 2e-16 ***
## poly(displacement, 2)2 31.0895 4.3652 7.122 5.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared: 0.6888, Adjusted R-squared: 0.6872
## F-statistic: 430.5 on 2 and 389 DF, p-value: < 2.2e-16
Regresi Fungsi Tangga
A. Regresi Fungsi Tangga (Secara Visual)
Regresi Fungsi Tangga dengan Knots 1-10
ft11<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,2),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=1") +
xlab("displacement") + ylab("mile per gallon")
ft22<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,3),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=2") +
xlab("displacement") + ylab("mile per gallon")
ft33<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,4),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=3") +
xlab("displacement") + ylab("mile per gallon")
ft44<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=4") +
xlab("displacement") + ylab("mile per gallon")
ft55<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=5") +
xlab("displacement") + ylab("mile per gallon")
ft66<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=6") +
xlab("displacement") + ylab("mile per gallon")
ft77<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=7") +
xlab("displacement") + ylab("mile per gallon")
ft88<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=8") +
xlab("displacement") + ylab("mile per gallon")
ft99<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,10),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=9") +
xlab("displacement") + ylab("mile per gallon")
ft110<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~cut(x,11),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=10") +
xlab("displacement") + ylab("mile per gallon")
Berdasarkan hasil plot di atas dapat dilihat bahwa plot dari regresi fungsi tangga yang dapat memberikan pola paling mulus yaitu knots 8, untuk knots 9 dan 10 terdapat pola yang tidak teratur dalam plot. Sehingga secara visual, dipilih regresi fungsi tangga dengan knots 8 untuk memodelkan data mpg vs displacement.
B. Regresi Fungsi Tangga (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
breakk <- 2:11 #knots 1-10
pemilihanf.tangga1 <- map_dfr(breakk, function(i){
metric_ftangga <- map_dfr(val_silang$splits,
function(x){
datatrain <- Auto[x$in_id,]
datatrain$displacement <- cut(datatrain$displacement,i)
modelt <- lm(mpg ~ displacement, data=datatrain)
labs_x <- levels(modelt$model[,2])
labs_x_b <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
datatesting <- Auto[-x$in_id,]
displacement_new <- cut(datatesting$displacement,c(labs_x_b[1,1],labs_x_b[,2]))
predik <- predict(modelt, newdata=list(displacement=displacement_new))
truth <- datatesting$mpg
evaluasi <- na.omit(data.frame(truth,predik))
rmse <- mlr3measures::rmse(truth = evaluasi$truth,response = evaluasi$predik)
mae <- mlr3measures::mae(truth = evaluasi$truth,response = evaluasi$predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_ftangga
mean_metricftangga <- colMeans(metric_ftangga)
mean_metricftangga
})
pemilihanf.tangga1 <- cbind(breakk=breakk,pemilihanf.tangga1)
Perbandingan RMSE dan MAE
datatable(pemilihanf.tangga1)
Plot RMSE dan MAE
pemilihanf.tangga1$breakk <- as.factor(pemilihanf.tangga1$breakk)
trmse1<-ggplot(data=pemilihanf.tangga1, aes(x=breakk, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Root Mean Square Error")
pemilihanf.tangga1$breakk <- as.factor(pemilihanf.tangga1$breakk)
tmae1<-ggplot(data=pemilihanf.tangga1, aes(x=breakk, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Mean Absolute Error")
plot_grid(trmse1,tmae1)
Berdasarkan nilai RMSE dan MAE terkecil, model regresi fungsi tangga terbaik adalah regresi fungsi tangga dengan knots 8.
Berikut merupakan plot dan summary model regresi fungsi tangga dengan knots 8:
Plot Regresi Rungsi Tangga
ft88<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=8") +
xlab("displacement") + ylab("mile per gallon")
ft88
fungsitangga88 = lm(Auto$mpg ~ cut(Auto$displacement,9)) #8 knots
summary(fungsitangga88)
##
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$displacement, 9))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5181 -2.2250 -0.2905 2.0339 19.4607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.5181 0.3947 79.850 < 2e-16 ***
## cut(Auto$displacement, 9)(111,154] -5.5601 0.6010 -9.252 < 2e-16 ***
## cut(Auto$displacement, 9)(154,197] -8.2848 1.0770 -7.693 1.23e-13 ***
## cut(Auto$displacement, 9)(197,240] -11.7022 0.7527 -15.547 < 2e-16 ***
## cut(Auto$displacement, 9)(240,283] -12.9788 0.8951 -14.499 < 2e-16 ***
## cut(Auto$displacement, 9)(283,326] -16.2276 0.7656 -21.197 < 2e-16 ***
## cut(Auto$displacement, 9)(326,369] -16.7923 0.8595 -19.537 < 2e-16 ***
## cut(Auto$displacement, 9)(369,412] -17.5494 1.1337 -15.479 < 2e-16 ***
## cut(Auto$displacement, 9)(412,455] -18.2959 1.4710 -12.438 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.251 on 383 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7033
## F-statistic: 116.9 on 8 and 383 DF, p-value: < 2.2e-16
Cubic Splines
A. Cubic Splines (Secara Visual)
1. Knots 1-6
attr(bs(Auto$displacement, df=4),"knots")
## 50%
## 151
attr(bs(Auto$displacement, df=5),"knots")
## 33.33333% 66.66667%
## 119 232
attr(bs(Auto$displacement, df=6),"knots")
## 25% 50% 75%
## 105.00 151.00 275.75
attr(bs(Auto$displacement, df=7),"knots")
## 20% 40% 60% 80%
## 98 122 225 305
attr(bs(Auto$displacement, df=8),"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 97 119 151 232 318
attr(bs(Auto$displacement, df=9),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 96.85714 108.00000 134.57143 198.00000 250.00000 321.14286
Plot Cubic Splines dengan Knots 1-6
cs11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(151)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=1") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(151), col="grey50", lty=2)
cs22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="orange")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(119,232)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=2") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(119,232), col="grey50", lty=2)
cs33 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="orange")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(105, 151, 275.75)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=3") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(105, 151,275.75), col="grey50", lty=2)
cs44 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="green")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(98, 122, 225, 305)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=4") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(98, 122, 225, 305), col="grey50", lty=2)
cs55<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(97,119,151,232,318)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=5") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(97,119,151,232,318), col="grey50", lty=2)
cs66<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(96.85714,108, 134.57143, 198, 250, 321.14286)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=6") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(96.85714,108, 134.57143, 198, 250, 321.14286), col="grey50", lty=2)
Berdasarkan hasil plot di atas dapat disimpulkan bahwa model cubic splines terbaik adalah cubic splines dengan knots 1 karena mulai knots 2 hingga 6 terdapat pola yang tidak teratur pada ujung plot.
B. Cubic Splines (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 4:9 #knots 1-6
pemilihan.cs1 <- map_dfr(df, function(i) {
metric_cs<- map_dfr(val_silang$splits,
function(x) {
modelcs <- lm(mpg ~ bs(displacement,df=i), data=Auto[x$in_id,])
predik <- predict(modelcs, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_cs
# menghitung rata-rata untuk 10 folds
mean_metric_cs <- colMeans(metric_cs)
mean_metric_cs
}
)
pemilihan.cs1 <- cbind(df=df,pemilihan.cs1)
Perbandingan RMSE dan MAE
datatable(pemilihan.cs1)
Plot RMSE dan MAE
pemilihan.cs1$df <- as.factor(pemilihan.cs1$df)
crmse1<-ggplot(data=pemilihan.cs1, aes(x=df, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")
cmae1<-ggplot(data=pemilihan.cs, aes(x=df, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Mean Absolute Error")
plot_grid(crmse1,cmae1)
Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model cubic splines dengan knots 2 dan 3, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model cubic splines dengan knots 2 dan 3 memiliki ketidakteraturan pola pada ujung-ujung plot. Sehingga model cubic splines terbaik adalah dengan knots 1. Berikut merupakan plot dan summary model:
Plot Cubic Splines
plotcs1 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(151)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines with Knots=1") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(151), col="grey50", lty=2)
plotcs1
cspline1 = lm(Auto$mpg ~ bs(Auto$displacement, knots = c(151)))
summary(cspline1)
##
## Call:
## lm(formula = Auto$mpg ~ bs(Auto$displacement, knots = c(151)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1199 -2.3744 -0.2858 2.1009 20.5408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.308 1.218 27.351 < 2e-16 ***
## bs(Auto$displacement, knots = c(151))1 -2.579 1.831 -1.409 0.16
## bs(Auto$displacement, knots = c(151))2 -18.060 1.864 -9.691 < 2e-16 ***
## bs(Auto$displacement, knots = c(151))3 -18.381 2.592 -7.093 6.3e-12 ***
## bs(Auto$displacement, knots = c(151))4 -20.465 1.857 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.368 on 387 degrees of freedom
## Multiple R-squared: 0.69, Adjusted R-squared: 0.6868
## F-statistic: 215.3 on 4 and 387 DF, p-value: < 2.2e-16
Natural Cubic Splines
A. Natural Cubic Splines (Secara Visual)
1. Knots 1-6
attr(ns(Auto$displacement, df=2),"knots")
## 50%
## 151
attr(ns(Auto$displacement, df=3),"knots")
## 33.33333% 66.66667%
## 119 232
attr(ns(Auto$displacement, df=4),"knots")
## 25% 50% 75%
## 105.00 151.00 275.75
attr(ns(Auto$displacement, df=5),"knots")
## 20% 40% 60% 80%
## 98 122 225 305
attr(ns(Auto$displacement, df=6),"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 97 119 151 232 318
attr(ns(Auto$displacement, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 96.85714 108.00000 134.57143 198.00000 250.00000 321.14286
Plot Natural Cubic Splines dengan Knots 1-6
ns11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(151)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(151), col="grey50", lty=2)
ns22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="coral")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(119,232)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=2") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(119,232), col="grey50", lty=2)
ns33 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="yellow")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(105, 151, 275.75)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=3") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(105, 151, 275.75), col="grey50", lty=2)
ns44 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="green")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(98, 122, 225, 305)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=4") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(98, 122, 225, 305), col="grey50", lty=2)
ns55<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(97,119,151,232,318)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=5") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(97,119,151,232,318), col="grey50", lty=2)
ns66<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(96.85714,108, 134.57143, 198, 250, 321.14286)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=6") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(96.85714,108, 134.57143, 198, 250, 321.14286), col="grey50", lty=2)
Berdasarkan hasil plot di atas dapat dilihat bahwa plot natural cubic splines dengan knots 1 hingga 3 tidak memiliki pola yang tidak teratur. Sedangkan untuk knots 4 hingga 6 terdapat pola yang tidak teratur pada ujung plot. Sehingga untuk lebih meyakinkan dalam memilih model terbaik, dilakukan pemilihan model secara empiris.
B. Natural Cubic Splines (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:7 #knots 1-6
pemilihan.ncs1 <- map_dfr(df, function(i) {
metric_ncs<- map_dfr(val_silang$splits,
function(x) {
modelncs <- lm(mpg ~ ns(displacement,df=i), data=Auto[x$in_id,])
predik <- predict(modelncs, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_ncs
# menghitung rata-rata untuk 10 folds
mean_metric_ncs <- colMeans(metric_ncs)
mean_metric_ncs
}
)
pemilihan.ncs1 <- cbind(df=df,pemilihan.ncs1)
Perbandingan RMSE dan MAE
datatable(pemilihan.ncs1)
Plot RMSE dan MAE
pemilihan.ncs1$df <- as.factor(pemilihan.ncs1$df)
nrmse1<-ggplot(data=pemilihan.ncs1, aes(x=df, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")
nmae1<-ggplot(data=pemilihan.ncs1, aes(x=df, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Mean Absolute Error")
plot_grid(nrmse1,nmae1)
Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model natural cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model natural cubic splines dengan knots 4 hingga 6 memiliki pola yang tidak teratur pada ujung-ujung plot. Sehingga model natural cubic splines terbaik adalah dengan knots 1, karena memiliki nilai RMSE dan MAE terkecil di antara knots 1 hingga 3. Berikut merupakan plot dan summary model:
Plot Natural Cubic Splines
plotncs1 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(151)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(151), col="grey50", lty=2)
plotncs1
ncspline1 = lm(Auto$mpg ~ ns(Auto$displacement, knots = c(151)))
summary(ncspline1)
##
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$displacement, knots = c(151)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3850 -2.3565 -0.3177 2.1249 20.5067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.6114 0.4962 67.74 <2e-16 ***
## ns(Auto$displacement, knots = c(151))1 -30.1316 1.2818 -23.51 <2e-16 ***
## ns(Auto$displacement, knots = c(151))2 -14.3845 1.0485 -13.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.358 on 389 degrees of freedom
## Multiple R-squared: 0.6899, Adjusted R-squared: 0.6883
## F-statistic: 432.6 on 2 and 389 DF, p-value: < 2.2e-16
Perbandingan Model
nilai_rmse1 <- rbind(4.32105681825998,
4.23676691518348,
4.34109320633999,
4.31394442085514)
nilai_mae1<-rbind(3.16031517994995,
3.1001249378559,
3.186742052199,
3.16206372510196)
Metode <- c("Regresi Polinomial (d=2)",
"Fungsi Tangga (knots=8)",
"Cubic Splines (knots=1)",
"Natural Cubic Splines (knots=1)")
perbandingan1 <- data.frame(Metode,nilai_rmse1, nilai_mae1)
datatable(perbandingan1)
Plot dari Masing-masing Model
#Regresi Polinomial
rp22 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Displacement") +
ylab("mile per gallon")
#Regresi Fungsi Tangga
ft88<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=8") +
xlab("displacement") + ylab("mile per gallon")
#Cubic Splines
cs11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(151)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=1") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(151), col="grey50", lty=2)
#Natural Cubic Splines
ns11 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="yellow")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(151)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("displacement") +
ylab("mile per gallon") +
geom_vline(xintercept = c(151), col="grey50", lty=2)
plot_grid(rp22, ft88, cs11, ns11)
Berdasarkan nila RMSE dan MAE dapat disimpulkan bahwa model terbaik untuk memodelkanmpg
vs displacement
adalah model regresi fungsi tangga dengan knots 8. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat displacement
(perpindahan mesin atau total volume bahan bakar yang dapat ditarik mesin selama satu siklus lengkap) dari kendaraan maka miles per gallon
(jarak yang dapat ditempuh per galon bahan bakar) akan semakin kecil.
Pemodelan mpg
vs weight
Regresi Polinomial
A. Regresi Polinomial (Secara Visual)
1. Regresi Polinomial Derajat 2
rp222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("weight") +
ylab("mile per gallon")
2. Regresi Polinomial Derajat 3
rp333 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 3") +
xlab("weight") +
ylab("mile per gallon")
3. Regresi Polinomial Derajat 4
rp444 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="navy") +
stat_smooth(method = "lm",
formula = y~poly(x,4,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 4") +
xlab("weight") +
ylab("mile per gallon")
4. Regresi Polinomial Derajat 5
rp555 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~poly(x,5,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 5") +
xlab("weight") +
ylab("mile per gallon")
plot_grid(rp222, rp333, rp444, rp555)
Berdasarkan keempat plot di atas dapat dilihat bahwa plot regresi polinomial derajat 5 memiliki sedikit ketidakteraturan pola pada ujungnya, sedangkan untuk derajat 2 hingga 4 membentuk pola yang hampir sama dan tidak terdapat pola yang tidak teratur pada ujung plot. Sehingga, pada metode Regresi Polinomial dipilih derajat 2 karena model paling sederhana.
B. Regresi Polinomial (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
derajat <- 1:5
polinomial3 <- map_dfr(derajat, function(i) {
metric_poli <- map_dfr(val_silang$splits,
function(x) {
mod <- lm(mpg ~ poly(weight,derajat=i), data=Auto[x$in_id,])
predik <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_poli
mean_metric_poli<- colMeans(metric_poli) # menghitung rata-rata untuk 10 folds
mean_metric_poli
}
)
polinomial3<- cbind(derajat=derajat,polinomial3)
Perbandingan RMSE dan MAE
datatable(polinomial3)
Plot RMSE dan MAE
polinomial3$derajat <- as.factor(polinomial3$derajat)
prmse3<-ggplot(data=polinomial3, aes(x=derajat, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Root Mean Square Error")
polinomial3$derajat <- as.factor(polinomial3$derajat)
pmae3<-ggplot(data=polinomial3, aes(x=derajat, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Degree of Polynomial") +
ylab("Mean Absolute Error")
plot_grid(prmse3,pmae3, labels=c("RMSE","MAE"), label_size = 6)
Berdasarkan nilai RMSE dan MAE terendah, dapat disimpulkan bahwa model regresi polinomial terbaik adalah regresi polinomial derajat 2. Berikut merupakan plot dan summary model:
Plot Regresi Polinomial d=2
regpol222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("Weight") +
ylab("mile per gallon")
regpol222
poli222= lm(mpg ~ poly(weight,2), data = Auto)
summary(poli222)
##
## Call:
## lm(formula = mpg ~ poly(weight, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6246 -2.7134 -0.3485 1.8267 16.0866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2109 111.151 < 2e-16 ***
## poly(weight, 2)1 -128.4436 4.1763 -30.755 < 2e-16 ***
## poly(weight, 2)2 23.1589 4.1763 5.545 5.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137
## F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16
Regresi Fungsi Tangga
A. Regresi Fungsi Tangga (Secara Visual)
Regresi Fungsi Tangga dengan Knots 1-10
ft111<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,2),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=1") +
xlab("weight") + ylab("mile per gallon")
ft222<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,3),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=2") +
xlab("weight") + ylab("mile per gallon")
ft333<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,4),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=3") +
xlab("weight") + ylab("mile per gallon")
ft444<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=4") +
xlab("weight") + ylab("mile per gallon")
ft555<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=5") +
xlab("weight") + ylab("mile per gallon")
ft666<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=6") +
xlab("weight") + ylab("mile per gallon")
ft777<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="yellow") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=7") +
xlab("weight") + ylab("mile per gallon")
ft888<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="green") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=8") +
xlab("weight") + ylab("mile per gallon")
ft999<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,10),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=9") +
xlab("weight") + ylab("mile per gallon")
ft120<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~cut(x,11),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=10") +
xlab("weight") + ylab("mile per gallon")
Berdasarkan hasil plot di atas dapat dilihat bahwa plot dari regresi fungsi tangga yang dapat memberikan pola yang mulus mulai dari knots 5, namun pada knots 9 dan 10 terdapat ketidakteraturan pola dalam plot. Sehingga untuk lebih meyakinkan dalam memilih model, dilakukan pengecekan nilai RMSE dan MAE melalui pemilihan secara empiris.
B. Regresi Fungsi Tangga (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
breakk <- 2:11 #knots 1-10
pemilihanf.tangga12 <- map_dfr(breakk, function(i){
metric_ftangga <- map_dfr(val_silang$splits,
function(x){
datatrain <- Auto[x$in_id,]
datatrain$weight <- cut(datatrain$weight,i)
modelt <- lm(mpg ~ weight, data=datatrain)
labs_x <- levels(modelt$model[,2])
labs_x_b <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
datatesting <- Auto[-x$in_id,]
weight_new <- cut(datatesting$weight,c(labs_x_b[1,1],labs_x_b[,2]))
predik <- predict(modelt, newdata=list(weight=weight_new))
truth <- datatesting$mpg
evaluasi <- na.omit(data.frame(truth,predik))
rmse <- mlr3measures::rmse(truth = evaluasi$truth,response = evaluasi$predik)
mae <- mlr3measures::mae(truth = evaluasi$truth,response = evaluasi$predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_ftangga
mean_metricftangga <- colMeans(metric_ftangga)
mean_metricftangga
})
pemilihanf.tangga12 <- cbind(breakk=breakk,pemilihanf.tangga12)
Perbandingan RMSE dan MAE
datatable(pemilihanf.tangga12)
Plot RMSE dan MAE
pemilihanf.tangga12$breakk <- as.factor(pemilihanf.tangga12$breakk)
trmse12<-ggplot(data=pemilihanf.tangga12, aes(x=breakk, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Root Mean Square Error")
pemilihanf.tangga12$breakk <- as.factor(pemilihanf.tangga12$breakk)
tmae12<-ggplot(data=pemilihanf.tangga12, aes(x=breakk, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("Breaks") +
ylab("Mean Absolute Error")
plot_grid(trmse12,tmae12)
Berdasarkan nilai RMSE dan MAE terkecil, model regresi fungsi tangga terbaik adalah regresi fungsi tangga dengan knots 5.
Berikut merupakan plot dan summary model regresi fungsi tangga dengan knots 5:
Plot Regresi Rungsi Tangga
ft555<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=5") +
xlab("weight") + ylab("mile per gallon")
ft555
fungsitangga555 = lm(Auto$mpg ~ cut(Auto$weight,6)) #5 knots
summary(fungsitangga555)
##
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$weight, 6))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.603 -2.486 -0.423 1.635 16.901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.6033 0.4433 73.55 <2e-16 ***
## cut(Auto$weight, 6)(2.2e+03,2.79e+03] -6.1043 0.6082 -10.04 <2e-16 ***
## cut(Auto$weight, 6)(2.79e+03,3.38e+03] -11.1170 0.6624 -16.78 <2e-16 ***
## cut(Auto$weight, 6)(3.38e+03,3.96e+03] -15.1324 0.6940 -21.80 <2e-16 ***
## cut(Auto$weight, 6)(3.96e+03,4.55e+03] -18.2380 0.7466 -24.43 <2e-16 ***
## cut(Auto$weight, 6)(4.55e+03,5.14e+03] -20.2283 1.1409 -17.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.205 on 386 degrees of freedom
## Multiple R-squared: 0.7134, Adjusted R-squared: 0.7097
## F-statistic: 192.2 on 5 and 386 DF, p-value: < 2.2e-16
Cubic Splines
A. Cubic Splines (Secara Visual)
1. Knots 1-6
attr(bs(Auto$weight, df=4),"knots")
## 50%
## 2803.5
attr(bs(Auto$weight, df=5),"knots")
## 33.33333% 66.66667%
## 2397.000 3333.667
attr(bs(Auto$weight, df=6),"knots")
## 25% 50% 75%
## 2225.25 2803.50 3614.75
attr(bs(Auto$weight, df=7),"knots")
## 20% 40% 60% 80%
## 2155.0 2583.2 3113.4 3820.8
attr(bs(Auto$weight, df=8),"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 2125.000 2397.000 2803.500 3333.667 3960.833
attr(bs(Auto$weight, df=9),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 2074.857 2285.429 2635.000 2986.571 3446.143 4096.286
Plot Cubic Splines dengan Knots 1-6
cs111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2803.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=1") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
cs222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="pink")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2397,3333.667)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=2") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2397,3333.667), col="grey50", lty=2)
cs333 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="orange")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2225.25, 2803.50, 3614.75)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=3") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2225.25, 2803.50, 3614.75), col="grey50", lty=2)
cs444 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="green")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2155.0, 2583.2, 3113.4, 3820.8)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=4") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2155.0, 2583.2, 3113.4, 3820.8), col="grey50", lty=2)
cs555<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2125.000,2397.000, 2803.500,3333.667,3960.833)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=5") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2125.000,2397.000, 2803.500,3333.667,3960.833), col="grey50", lty=2)
cs666<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2074.857,2285.429, 2635.000,2986.571,3446.143,4096.286)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=6") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2074.857,2285.429, 2635.000,2986.571,3446.143,4096.286), col="grey50", lty=2)
Berdasarkan hasil plot di atas dapat disimpulkan bahwa model cubic splines terbaik adalah cubic splines dengan knots 1 karena mulai knots 2 hingga 6 terdapat pola yang tidak teratur pada ujung plot.
B. Cubic Splines (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 4:9 #knots 1-6
pemilihan.cs12 <- map_dfr(df, function(i) {
metric_cs<- map_dfr(val_silang$splits,
function(x) {
modelcs <- lm(mpg ~ bs(weight,df=i), data=Auto[x$in_id,])
predik <- predict(modelcs, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_cs
# menghitung rata-rata untuk 10 folds
mean_metric_cs <- colMeans(metric_cs)
mean_metric_cs
}
)
pemilihan.cs12 <- cbind(df=df,pemilihan.cs12)
Perbandingan RMSE dan MAE
datatable(pemilihan.cs12)
Plot RMSE dan MAE
pemilihan.cs12$df <- as.factor(pemilihan.cs12$df)
crmse12<-ggplot(data=pemilihan.cs12, aes(x=df, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")
cmae12<-ggplot(data=pemilihan.cs12, aes(x=df, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Mean Absolute Error")
plot_grid(crmse12,cmae12)
Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model cubic splines dengan knots 6 memiliki ketidakteraturan pola pada ujung-ujung plot. Sehingga model cubic splines terbaik adalah dengan knots 1. Berikut merupakan plot dan summary model:
Plot Cubic Splines
plotcs12 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2803.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines with Knots=1") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
plotcs12
cspline12 = lm(Auto$mpg ~ bs(Auto$weight, knots = c(2803.5)))
summary(cspline12)
##
## Call:
## lm(formula = Auto$mpg ~ bs(Auto$weight, knots = c(2803.5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7865 -2.7357 -0.4536 1.8590 16.2655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.737 1.675 21.330 < 2e-16 ***
## bs(Auto$weight, knots = c(2803.5))1 -3.245 2.553 -1.271 0.204
## bs(Auto$weight, knots = c(2803.5))2 -18.339 1.792 -10.231 < 2e-16 ***
## bs(Auto$weight, knots = c(2803.5))3 -21.830 2.897 -7.535 3.48e-13 ***
## bs(Auto$weight, knots = c(2803.5))4 -24.419 2.338 -10.443 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.184 on 387 degrees of freedom
## Multiple R-squared: 0.7156, Adjusted R-squared: 0.7127
## F-statistic: 243.5 on 4 and 387 DF, p-value: < 2.2e-16
Natural Cubic Splines
A. Natural Cubic Splines (Secara Visual)
1. Knots 1-6
attr(ns(Auto$weight, df=2),"knots")
## 50%
## 2803.5
attr(ns(Auto$weight, df=3),"knots")
## 33.33333% 66.66667%
## 2397.000 3333.667
attr(ns(Auto$weight, df=4),"knots")
## 25% 50% 75%
## 2225.25 2803.50 3614.75
attr(ns(Auto$weight, df=5),"knots")
## 20% 40% 60% 80%
## 2155.0 2583.2 3113.4 3820.8
attr(ns(Auto$weight, df=6),"knots")
## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 2125.000 2397.000 2803.500 3333.667 3960.833
attr(ns(Auto$weight, df=7),"knots")
## 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429%
## 2074.857 2285.429 2635.000 2986.571 3446.143 4096.286
Plot Natural Cubic Splines dengan Knots 1-6
ns111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2803.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
ns222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="yellow")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2397,3333.667)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=2") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2397,3333.667), col="grey50", lty=2)
ns333 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="coral")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2225.25, 2803.50, 3614.75)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=3") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2225.25, 2803.50, 3614.75), col="grey50", lty=2)
ns444 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="green")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2155.0, 2583.2, 3113.4, 3820.8)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=4") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2155.0, 2583.2, 3113.4, 3820.8), col="grey50", lty=2)
ns555<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2125.000,2397.000, 2803.500,3333.667,3960.833)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=5") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2125.000,2397.000, 2803.500,3333.667,3960.833), col="grey50", lty=2)
ns666<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2074.857,2285.429, 2635.000,2986.571,3446.143,4096.286)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=6") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2074.857,2285.429, 2635.000,2986.571,3446.143,4096.286), col="grey50", lty=2)
Berdasarkan hasil plot di atas dapat dilihat bahwa plot natural cubic splines dengan knots 1 hingga 5 tidak memiliki pola yang tidak teratur. Sedangkan untuk knots 6 terdapat pola yang tidak teratur pada ujung plot. Sehingga untuk lebih meyakinkan dalam memilih model terbaik, dilakukan pemilihan model secara empiris.
B. Natural Cubic Splines (Secara Empiris)
set.seed(018)
val_silang <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:7 #knots 1-6
pemilihan.ncs12 <- map_dfr(df, function(i) {
metric_ncs<- map_dfr(val_silang$splits,
function(x) {
modelncs <- lm(mpg ~ ns(weight,df=i), data=Auto[x$in_id,])
predik <- predict(modelncs, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = predik)
mae <- mlr3measures::mae(truth = truth, response = predik)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_ncs
# menghitung rata-rata untuk 10 folds
mean_metric_ncs <- colMeans(metric_ncs)
mean_metric_ncs
}
)
pemilihan.ncs12 <- cbind(df=df,pemilihan.ncs12)
Perbandingan RMSE dan MAE
datatable(pemilihan.ncs12)
Plot RMSE dan MAE
pemilihan.ncs12$df <- as.factor(pemilihan.ncs12$df)
nrmse12<-ggplot(data=pemilihan.ncs12, aes(x=df, y=RMSE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Root Mean Square Error")
nmae12<-ggplot(data=pemilihan.ncs12, aes(x=df, y=MAE, group=1)) +
geom_line(linetype = "dashed")+
geom_point()+
ggtitle("K-Fold Cross Validation") +
xlab("DF") + ylab("Mean Absolute Error")
plot_grid(nrmse12,nmae12)
Berdasarkan nilai RMSE dan MAE diperoleh nilai terendah pada model natural cubic splines dengan knots 6, namun berdasarkan hasil visualisasi data dapat dilihat bahwa untuk model natural cubic splines dengan knots 6 memiliki pola yang tidak teratur pada ujung-ujung plot. Sehingga model natural cubic splines terbaik adalah dengan knots 1, karena memiliki nilai RMSE dan MAE terkecil di antara knots 1 hingga 5. Berikut merupakan plot dan summary model:
Plot Natural Cubic Splines
plotncs12<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2803.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
plotncs12
ncspline12 = lm(Auto$mpg ~ ns(Auto$weight, knots = c(2803.5)))
summary(ncspline12)
##
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$weight, knots = c(2803.5)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6845 -2.7067 -0.3965 1.8145 16.2062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.6987 0.6448 56.92 <2e-16 ***
## ns(Auto$weight, knots = c(2803.5))1 -34.8674 1.4493 -24.06 <2e-16 ***
## ns(Auto$weight, knots = c(2803.5))2 -18.3763 1.0192 -18.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.175 on 389 degrees of freedom
## Multiple R-squared: 0.7153, Adjusted R-squared: 0.7138
## F-statistic: 488.7 on 2 and 389 DF, p-value: < 2.2e-16
Perbandingan Model
nilai_rmse12 <- rbind(4.14814621667258,
4.21960381450313,
4.16083965104084,
4.14665995693387)
nilai_mae12<-rbind(3.06116897397545,
3.12547480366824,
3.06302776816373,
3.0590341190436)
Metode <- c("Regresi Polinomial (d=2)",
"Fungsi Tangga (knots=5)",
"Cubic Splines (knots=1)",
"Natural Cubic Splines (knots=1)")
perbandingan12 <- data.frame(Metode,nilai_rmse12, nilai_mae12)
datatable(perbandingan12)
Plot dari Masing-masing Model
#Regresi Polinomial
rp222 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Polinomial Derajat 2") +
xlab("weight") +
ylab("mile per gallon")
#Regresi Fungsi Tangga
ft555<- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="red") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "black",se = F)+
theme_bw()+
ggtitle("Regresi Fungsi Tangga Knots=5") +
xlab("weight") + ylab("mile per gallon")
#Cubic Splines
cs111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="yellow")+
stat_smooth(method = "lm",
formula = y~bs(x, knots=c(2803.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Cubic Splines Knots=1") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
#Natural Cubic Splines
ns111 <- ggplot(Auto,aes(x=weight, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, knots=c(2803.5)),
lty = 1,col = "black", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines Knots=1") +
xlab("weight") +
ylab("mile per gallon") +
geom_vline(xintercept = c(2803.5), col="grey50", lty=2)
plot_grid(rp222, ft555, cs111, ns111)
Berdasarkan nila RMSE dan MAE dapat disimpulkan bahwa model terbaik untuk memodelkanmpg
vs weight
adalah model natural cubic splines dengan knots 1. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat weight
(berat/bobot) dari kendaraan maka miles per gallon
(jarak yang dapat ditempuh per galon bahan bakar) akan semakin kecil.
Referensi
Dito, G.A. (April 14, 2021). Regresi Polinomial, Regresi Fungsi Tangga dan Regresi Spline di R. Retrieved from https://gerrydito.github.io/Regresi-Polinomial,-Regresi-Fungsi-Tangga-dan-Regresi-Spline/
James, Gareth. et.al. (2013). An Introduction to Statistical Learning Aplication in R. New York: Springer.
Sartono, B.. (November 1, 2020). Moving Beyond Linearity. Retrieved from https://rpubs.com/bagusco/taklinear