Tugas Praktikum 8 Sains Data
Soal
Gunakan model-model non-linier yang sudah dipelajari pada data Auto dengan peubah respon mpg dan pilih tiga kolom untuk dijadikan peubah prediktor.
a.) Apakah ada bukti untuk hubungan non-linier untuk peubah-peubah yang anda pilih? Buat visualisasi untuk Mendukung jawaban Anda.
b.) Tentukan model non-linear terbaik untuk masing pasangan peubah (Secara visual dan/atau secara empiris)
Library
library(ISLR)
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(splines)
library(rsample)
library(DT)
library(ggpubr)Dataset
Data yang digunakan adalah Auto dari package ISLR yang terdiri dari 392 observasi dan 9 variabel. Berikut keterangan 9 Variabel yang ada.
mpg: miles per gallon
cylinders: Number of cylinders between 4 and 8
displacement: Engine displacement (cu. inches)
horsepower: Engine horsepower
weight: Vehicle weight (lbs.)
acceleration: Time to accelerate from 0 to 60 mph (sec.)
year: Model year (modulo 100)
origin: Origin of car (1. American, 2. European, 3. Japanese)
name: Vehicle name
data("Auto",package="ISLR")
datatable(Auto)plot_dis <-ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="pink") +
ggtitle("Scatterplot mpg vs displacement") +
xlab("displacement")+
ylab("mpg")+
theme_bw()
plot_accel<-ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
ggtitle("Scatterplot mpg vs acceleration") +
xlab("acceleration")+
ylab("mpg")+
theme_bw()
plot_year<-ggplot(Auto,aes(x=year, y=mpg)) +
geom_point(alpha=0.55, color="purple") +
ggtitle("Scatterplot mpg vs year") +
xlab("year")+
ylab("mpg")+
theme_bw()
ggarrange(plot_dis,plot_accel,plot_year, nrow = 1)Dari visualisasi tersebut hubungan antara data peubah respons dan data peubah prediktor non-linier sehingga dilakukan pemodelan hubungan non-linier dengan menggunakan: 1) Polynomial Regression 2) Piecewise Constant 3) Natural Cubic Splines
Model MPG vs Displacement
Regresi Polinomial
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
derajat <- 2:4
polinomial <- map_dfr(derajat, function(i) {
metric_poly <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ poly(displacement,derajat=i),data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly
# menghitung rata-rata untuk 10 folds
mean_metric_poly <- colMeans(metric_poly)
mean_metric_poly
}
)
polinomial <- cbind(derajat=derajat,polinomial)
datatable(polinomial)polinomial %>% slice_min(mae)## derajat rmse mae
## 1 2 4.361273 3.172365
polinomial %>% slice_min(rmse)## derajat rmse mae
## 1 2 4.361273 3.172365
Secara empiris jika dilihat dari rmse dan mae terlihat bahwa polinomial pangkat 2 memberikan nilai rmse dan mae yang terkecil sehingga dipilih polinomial pangkat 2.
Plot Terbaik Polinomial Regresi db=2
Poly2<-ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "magenta",se = F)+
theme_bw()
Poly2Secara visual dan empiris plot untuk regresi polinomial terbaik adalah yang derajat 2
poly2 = lm(mpg ~ poly(displacement,2), data = Auto)
summary(poly2)##
## Call:
## lm(formula = mpg ~ poly(displacement, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2165 -2.2404 -0.2508 2.1094 20.5158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2205 106.343 < 2e-16 ***
## poly(displacement, 2)1 -124.2585 4.3652 -28.466 < 2e-16 ***
## poly(displacement, 2)2 31.0895 4.3652 7.122 5.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared: 0.6888, Adjusted R-squared: 0.6872
## F-statistic: 430.5 on 2 and 389 DF, p-value: < 2.2e-16
Piecewise Constant
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
breaks <- 2:11
fungsitangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- Auto[x$in_id,]
training$displacement <- cut(training$displacement,i)
mod <- lm(mpg ~ displacement, data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*","\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
testing <- Auto[-x$in_id,]
displacement_new <- cut(testing$displacement,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod, newdata=list(displacement=displacement_new))
truth <- testing$mpg
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
})
fungsitangga <- cbind(breaks=breaks,fungsitangga)
datatable(fungsitangga)fungsitangga %>% slice_min(mae)## breaks rmse mae
## 1 9 4.265649 3.130846
fungsitangga %>% slice_min(rmse)## breaks rmse mae
## 1 9 4.265649 3.130846
Berdasarkan nilai RMSE & MAE model piecewise constant terbaik adalah piecewise constant dengan knots=8
pc8<- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,8),
lty = 1, col = "magenta",se = F)+
theme_bw()+
ggtitle("Piecewise Constant with Knots=8") +
xlab("displacement") + ylab("mile per gallon")
pc8 Berikut merupakan summary model piecewise constant dengan knots=7:
ftangga9 = lm(Auto$mpg ~ cut(Auto$displacement,9)) #8 knots
summary(ftangga9)##
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$displacement, 9))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5181 -2.2250 -0.2905 2.0339 19.4607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.5181 0.3947 79.850 < 2e-16 ***
## cut(Auto$displacement, 9)(111,154] -5.5601 0.6010 -9.252 < 2e-16 ***
## cut(Auto$displacement, 9)(154,197] -8.2848 1.0770 -7.693 1.23e-13 ***
## cut(Auto$displacement, 9)(197,240] -11.7022 0.7527 -15.547 < 2e-16 ***
## cut(Auto$displacement, 9)(240,283] -12.9788 0.8951 -14.499 < 2e-16 ***
## cut(Auto$displacement, 9)(283,326] -16.2276 0.7656 -21.197 < 2e-16 ***
## cut(Auto$displacement, 9)(326,369] -16.7923 0.8595 -19.537 < 2e-16 ***
## cut(Auto$displacement, 9)(369,412] -17.5494 1.1337 -15.479 < 2e-16 ***
## cut(Auto$displacement, 9)(412,455] -18.2959 1.4710 -12.438 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.251 on 383 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7033
## F-statistic: 116.9 on 8 and 383 DF, p-value: < 2.2e-16
Natural Cubic Splines
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:6
naturalspline3 <- map_dfr(df, function(i) {
metric_spline3 <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ ns(displacement,df=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_spline3
# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)
mean_metric_spline3
}
)
naturalspline3 <- cbind(df=df,naturalspline3)
datatable(naturalspline3)naturalspline3 %>% slice_min(mae)## df rmse mae
## 1 6 4.260237 3.165058
naturalspline3 %>% slice_min(rmse)## df rmse mae
## 1 6 4.260237 3.165058
Bersadarkan nilai rmse terkecil, didapat model natural cubic spline terbaik dengan df=6.
attr(ns(Auto$displacement, df=6),"knots")## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 97 119 151 232 318
ncs6 <- ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, df=6),
lty = 1,col = "magenta", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines with df=6") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(97, 119, 151, 232, 318), col="grey50", lty=2)
ncs6ncspline = lm(Auto$mpg ~ ns(Auto$displacement, knots =
c(97, 119, 151, 232, 318)))
summary(ncspline)##
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$displacement, knots = c(97, 119,
## 151, 232, 318)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9381 -2.3829 -0.3102 2.1258 20.6988
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 29.254 1.409
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))1 -5.603 1.473
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))2 -2.879 2.154
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))3 -11.521 1.735
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))4 -16.379 1.372
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))5 -10.755 3.232
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))6 -19.107 1.483
## t value Pr(>|t|)
## (Intercept) 20.760 < 2e-16 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))1 -3.803 0.000166 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))2 -1.336 0.182237
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))3 -6.639 1.08e-10 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))4 -11.935 < 2e-16 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))5 -3.327 0.000961 ***
## ns(Auto$displacement, knots = c(97, 119, 151, 232, 318))6 -12.885 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.24 on 385 degrees of freedom
## Multiple R-squared: 0.7094, Adjusted R-squared: 0.7049
## F-statistic: 156.6 on 6 and 385 DF, p-value: < 2.2e-16
Perbandingan Model
nilai_rmse <- rbind(4.361273,
4.265649,
4.260237)
nilai_mae<-rbind(3.172365,
3.130846,
3.165058)
Metode <- c("Regresi Polinomial (d=2)",
"Piecewise Constant (knots=8)",
"Natural Cubic Splines (df=6)")
perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)ggplot(Auto,aes(x=displacement, y=mpg)) +
geom_point(alpha=0.55, color="turquoise") +
ggtitle("Scatterplot Displacement VS mile per gallon") +
stat_smooth(method = "lm",formula = y~poly(x,2,raw=T),
lty = 1, col = "yellow",se = F)+
stat_smooth(method = "lm",formula = y~cut(x,8),
lty = 1, col = "purple",se = F)+
stat_smooth(method = "lm",formula = y~ns(x,df=6),
lty = 1, col = "red",se = F)+
theme_bw()Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs displacement adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat displacement dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.
Model MPG vs acceleration
Regresi Polinomial
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
derajat <- 2:4
polinomial <- map_dfr(derajat, function(i) {
metric_poly <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ poly(acceleration,derajat=i),data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly
# menghitung rata-rata untuk 10 folds
mean_metric_poly <- colMeans(metric_poly)
mean_metric_poly
}
)
polinomial <- cbind(derajat=derajat,polinomial)
datatable(polinomial)polinomial %>% slice_min(mae)## derajat rmse mae
## 1 4 6.998135 5.690208
polinomial %>% slice_min(rmse)## derajat rmse mae
## 1 4 6.998135 5.690208
Secara empiris jika dilihat dari rmse dan mae terlihat bahwa polinomial pangkat 4 memberikan nilai rmse dan mae yang terkecil sehingga dipilih polinomial pangkat 4.
Plot Terbaik Polinomial Regresi db=4
Poly4<-ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~poly(x,4,raw=T),
lty = 1, col = "magenta",se = F)+
theme_bw()
Poly4Secara visual dan empiris plot untuk regresi polinomial terbaik adalah yang derajat 4
poly4 = lm(mpg ~ poly(acceleration,4), data = Auto)
summary(poly4)##
## Call:
## lm(formula = mpg ~ poly(acceleration, 4), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4665 -5.3322 -0.8491 4.6067 22.9362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.3514 66.724 < 2e-16 ***
## poly(acceleration, 4)1 65.3340 6.9571 9.391 < 2e-16 ***
## poly(acceleration, 4)2 -18.7482 6.9571 -2.695 0.00735 **
## poly(acceleration, 4)3 6.0643 6.9571 0.872 0.38393
## poly(acceleration, 4)4 20.7577 6.9571 2.984 0.00303 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.957 on 387 degrees of freedom
## Multiple R-squared: 0.2136, Adjusted R-squared: 0.2055
## F-statistic: 26.28 on 4 and 387 DF, p-value: < 2.2e-16
Piecewise Constant
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
breaks <- 2:11
fungsitangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- Auto[x$in_id,]
training$acceleration <- cut(training$acceleration,i)
mod <- lm(mpg ~ acceleration, data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*","\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
testing <- Auto[-x$in_id,]
acceleration_new <- cut(testing$acceleration,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod, newdata=list(acceleration=acceleration_new))
truth <- testing$mpg
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
})
fungsitangga <- cbind(breaks=breaks,fungsitangga)
datatable(fungsitangga)fungsitangga %>% slice_min(mae)## breaks rmse mae
## 1 3 6.984927 5.632108
fungsitangga %>% slice_min(rmse)## breaks rmse mae
## 1 3 6.984927 5.632108
Berdasarkan nilai RMSE & MAE model piecewise constant terbaik adalah piecewise constant dengan knots=2 namun berdasarkan visual piecewise constant yang terbaik dengan knots=5
pc5<- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,6),
lty = 1, col = "magenta",se = F)+
theme_bw()+
ggtitle("Piecewise Constant with Knots=5") +
xlab("acceleration") + ylab("mile per gallon")
pc5 Berikut merupakan summary model piecewise constant dengan knots=5:
ftangga6 = lm(Auto$mpg ~ cut(Auto$acceleration,6)) #5 knots
summary(ftangga6)##
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$acceleration, 6))
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6725 -5.0975 -0.6837 4.3191 20.9275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.818 2.100 7.055 8.02e-12 ***
## cut(Auto$acceleration, 6)(10.8,13.6] 2.932 2.232 1.314 0.19
## cut(Auto$acceleration, 6)(13.6,16.4] 9.866 2.171 4.544 7.40e-06 ***
## cut(Auto$acceleration, 6)(16.4,19.2] 10.854 2.211 4.910 1.35e-06 ***
## cut(Auto$acceleration, 6)(19.2,22] 12.441 2.492 4.993 9.01e-07 ***
## cut(Auto$acceleration, 6)(22,24.8] 15.896 3.368 4.720 3.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.966 on 386 degrees of freedom
## Multiple R-squared: 0.2137, Adjusted R-squared: 0.2035
## F-statistic: 20.98 on 5 and 386 DF, p-value: < 2.2e-16
Natural Cubic Splines
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:6
naturalspline2 <- map_dfr(df, function(i) {
metric_spline3 <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ ns(acceleration,df=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_spline3
# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)
mean_metric_spline3
}
)
naturalspline2 <- cbind(df=df,naturalspline2)
datatable(naturalspline2)naturalspline2 %>% slice_min(mae)## df rmse mae
## 1 5 6.984689 5.625165
naturalspline2 %>% slice_min(rmse)## df rmse mae
## 1 6 6.961777 5.627009
Berdasarkan nilai rmse dan mae yang terkecil, model terbaik adalah natural cubic spline dengan df=6.
attr(ns(Auto$acceleration, df=6),"knots")## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 13.00000 14.50000 15.50000 16.50000 18.18333
ncs6 <- ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, df=6),
lty = 1,col = "magenta", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines with df=6") +
xlab("Acceleration") +
ylab("mile per gallon") +
geom_vline(xintercept = c(13, 14.5, 15.5, 16.5, 18.183), col="grey50", lty=2)
ncs6ncspline = lm(Auto$mpg ~ ns(Auto$acceleration, knots =
c(13, 14.5, 15.5, 16.5, 18.183)))
summary(ncspline)##
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$acceleration, knots = c(13, 14.5,
## 15.5, 16.5, 18.183)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7756 -5.3712 -0.8021 4.5241 22.2375
##
## Coefficients:
## Estimate
## (Intercept) 16.234
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1 9.117
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2 7.469
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3 10.660
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4 9.162
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5 10.313
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6 20.427
## Std. Error
## (Intercept) 3.383
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1 3.248
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2 3.835
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3 3.578
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4 2.840
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5 7.846
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6 3.827
## t value
## (Intercept) 4.799
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1 2.807
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2 1.947
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3 2.980
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4 3.226
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5 1.314
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6 5.337
## Pr(>|t|)
## (Intercept) 2.29e-06 ***
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))1 0.00526 **
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))2 0.05223 .
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))3 0.00307 **
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))4 0.00136 **
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))5 0.18948
## ns(Auto$acceleration, knots = c(13, 14.5, 15.5, 16.5, 18.183))6 1.61e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.932 on 385 degrees of freedom
## Multiple R-squared: 0.2234, Adjusted R-squared: 0.2113
## F-statistic: 18.45 on 6 and 385 DF, p-value: < 2.2e-16
Perbandingan Model
nilai_rmse <- rbind(6.998135,
7.004580,
4.305876)
nilai_mae<-rbind(5.690208,
5.6388286,
3.246447)
Metode <- c("Regresi Polinomial (d=4)",
"Piecewise Constant (knots=5)",
"Natural Cubic Splines (df=6)")
perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)ggplot(Auto,aes(x=acceleration, y=mpg)) +
geom_point(alpha=0.55, color="turquoise") +
ggtitle("Scatterplot acceleration VS mile per gallon") +
stat_smooth(method = "lm",formula = y~poly(x,4,raw=T),
lty = 1, col = "yellow",se = F)+
stat_smooth(method = "lm",formula = y~cut(x,6),
lty = 1, col = "purple",se = F)+
stat_smooth(method = "lm",formula = y~ns(x,df=6),
lty = 1, col = "red",se = F)+
theme_bw() Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs acceleration adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat acceleration dari kendaraan maka miles per gallon akan semakin besar atau kendaraan semakin hemat.
Model MPG vs horsepower
Regresi Polinomial
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
derajat <- 2:4
polinomial <- map_dfr(derajat, function(i) {
metric_poly <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ poly(horsepower,derajat=i),data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly
# menghitung rata-rata untuk 10 folds
mean_metric_poly <- colMeans(metric_poly)
mean_metric_poly
}
)
polinomial <- cbind(derajat=derajat,polinomial)
datatable(polinomial)polinomial %>% slice_min(mae)## derajat rmse mae
## 1 2 4.339298 3.268563
polinomial %>% slice_min(rmse)## derajat rmse mae
## 1 2 4.339298 3.268563
Secara empiris jika dilihat dari rmse dan mae terlihat bahwa polinomial pangkat 2 memberikan nilai rmse dan mae yang terkecil sehingga dipilih polinomial pangkat 2.
Plot Terbaik Polinomial Regresi db=2
Poly2<-ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "magenta",se = F)+
theme_bw()
Poly2Secara visual dan empiris plot untuk regresi polinomial terbaik adalah yang derajat 2
poly2 = lm(mpg ~ poly(horsepower,2), data = Auto)
summary(poly2)##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Piecewise Constant
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
breaks <- 2:11
fungsitangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- Auto[x$in_id,]
training$horsepower <- cut(training$horsepower,i)
mod <- lm(mpg ~ horsepower, data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*","\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x)))
testing <- Auto[-x$in_id,]
horsepower_new <- cut(testing$horsepower,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod, newdata=list(horsepower=horsepower_new))
truth <- testing$mpg
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
})
fungsitangga <- cbind(breaks=breaks,fungsitangga)
datatable(fungsitangga)fungsitangga %>% slice_min(mae)## breaks rmse mae
## 1 11 4.478578 3.349871
fungsitangga %>% slice_min(rmse)## breaks rmse mae
## 1 8 4.454052 3.395497
Berdasarkan nilai RMSE & MAE model piecewise constant terbaik adalah piecewise constant dengan knots=7
pc8<- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "magenta",se = F)+
theme_bw()+
ggtitle("Piecewise Constant with Knots=7") +
xlab("horsepower") + ylab("mile per gallon")
pc8 Berikut merupakan summary model piecewise constant dengan knots=7:
ftangga8 = lm(Auto$mpg ~ cut(Auto$horsepower,8)) #7 knots
summary(ftangga8)##
## Call:
## lm(formula = Auto$mpg ~ cut(Auto$horsepower, 8))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9471 -2.6757 -0.1533 2.4015 14.5529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9085 0.5771 58.76 <2e-16 ***
## cut(Auto$horsepower, 8)(69,92] -6.9614 0.6910 -10.07 <2e-16 ***
## cut(Auto$horsepower, 8)(92,115] -12.7551 0.7425 -17.18 <2e-16 ***
## cut(Auto$horsepower, 8)(115,138] -15.6656 1.1264 -13.91 <2e-16 ***
## cut(Auto$horsepower, 8)(138,161] -18.7799 0.8568 -21.92 <2e-16 ***
## cut(Auto$horsepower, 8)(161,184] -19.9735 1.1470 -17.41 <2e-16 ***
## cut(Auto$horsepower, 8)(184,207] -21.1228 1.7720 -11.92 <2e-16 ***
## cut(Auto$horsepower, 8)(207,230] -21.0085 1.5159 -13.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.433 on 384 degrees of freedom
## Multiple R-squared: 0.6832, Adjusted R-squared: 0.6774
## F-statistic: 118.3 on 7 and 384 DF, p-value: < 2.2e-16
Natural Cubic Splines
set.seed(41)
cross_val <- vfold_cv(Auto,v=10,strata = "mpg")
df <- 2:6
naturalspline3 <- map_dfr(df, function(i) {
metric_spline3 <- map_dfr(cross_val$splits,
function(x) {
mod <- lm(mpg ~ ns(horsepower,df=i), data=Auto[x$in_id,])
pred <- predict(mod, newdata=Auto[-x$in_id,])
truth <- Auto[-x$in_id,]$mpg
rmse <- mlr3measures::rmse(truth = truth, response = pred)
mae <- mlr3measures::mae(truth = truth, response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_spline3
# menghitung rata-rata untuk 10 folds
mean_metric_spline3 <- colMeans(metric_spline3)
mean_metric_spline3
}
)
naturalspline3 <- cbind(df=df,naturalspline3)
datatable(naturalspline3)naturalspline3 %>% slice_min(mae)## df rmse mae
## 1 6 4.305876 3.246447
naturalspline3 %>% slice_min(rmse)## df rmse mae
## 1 6 4.305876 3.246447
attr(ns(Auto$horsepower, df=6),"knots")## 16.66667% 33.33333% 50% 66.66667% 83.33333%
## 70.0 84.0 93.5 110.0 150.0
ncs6 <- ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="blue")+
stat_smooth(method = "lm",
formula = y~ns(x, df=6),
lty = 1,col = "magenta", se=F)+
theme_bw()+
ggtitle("Natural Cubic Splines with df=6") +
xlab("Horse Power") +
ylab("mile per gallon") +
geom_vline(xintercept = c(70, 84, 93.5, 110, 150), col="grey50", lty=2)
ncs6ncspline = lm(Auto$mpg ~ ns(Auto$horsepower, knots =
c(70, 84, 93.5, 110, 150)))
summary(ncspline)##
## Call:
## lm(formula = Auto$mpg ~ ns(Auto$horsepower, knots = c(70, 84,
## 93.5, 110, 150)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9491 -2.6183 -0.1595 2.3508 15.1349
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 34.738 1.509
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1 -8.210 1.594
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2 -13.046 1.835
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3 -14.577 1.886
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4 -22.802 1.624
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5 -22.758 3.512
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6 -20.849 1.742
## t value Pr(>|t|)
## (Intercept) 23.021 < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))1 -5.149 4.18e-07 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))2 -7.108 5.76e-12 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))3 -7.730 9.50e-14 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))4 -14.039 < 2e-16 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))5 -6.480 2.81e-10 ***
## ns(Auto$horsepower, knots = c(70, 84, 93.5, 110, 150))6 -11.967 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.302 on 385 degrees of freedom
## Multiple R-squared: 0.7009, Adjusted R-squared: 0.6962
## F-statistic: 150.4 on 6 and 385 DF, p-value: < 2.2e-16
Perbandingan Model
nilai_rmse <- rbind(4.339298,
4.454052,
4.305876)
nilai_mae<-rbind(3.268563,
3.395497,
3.246447)
Metode <- c("Regresi Polinomial (d=2)",
"Piecewise Constant (knots=7)",
"Natural Cubic Splines (df=6)")
perbandingan <- data.frame(Metode,nilai_rmse, nilai_mae)
datatable(perbandingan)ggplot(Auto,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="turquoise") +
ggtitle("Scatterplot horsepower VS mile per gallon") +
stat_smooth(method = "lm",formula = y~poly(x,2,raw=T),
lty = 1, col = "yellow",se = F)+
stat_smooth(method = "lm",formula = y~cut(x,7),
lty = 1, col = "purple",se = F)+
stat_smooth(method = "lm",formula = y~ns(x,df=6),
lty = 1, col = "red",se = F)+
theme_bw() Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs horsepower adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat horsepower dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.
Kesimpulan
Berdasarkan perbandingan dari ketiga model regresi non-linier tersebut didapat masing-masing model terbaik untuk setiap pasangan peubah respons vs peubah prediktor.
mpg vs displacement : Regresi piecewise constant (knots=8) Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs displacement adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat displacement (Engine displacement) dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.
mpg vs horsepower : Regresi natural cubic splines (df=6) Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs horsepower adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat Engine horsepower dari kendaraan maka miles per gallon akan semakin kecil atau kendaraan semakin boros.
mpg vs acceleration : Regresi natural cubic splines (df=6) Berdasarkan perbandingan nilai RMSE dan MAE terkecil dapat disimpulkan bahwa model terbaik untuk memodelkan mpg vs acceleration adalah model natural cubic splines dengan df=6. Hasil visualisasi data menunjukkan bahwa secara umum semakin meningkat acceleration atau waktu yang dibutuhkan untuk akselerasi dari 0 hingga 60 mph (miles per hour) dari kendaraan maka miles per gallon akan semakin besar atau kendaraan semakin hemat.