STA1381: Nonlinear Regression Part 1
A. Library
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.3 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
B. Contoh Pada Kuliah
1. Eksplorasi Data
set.seed(123)
data.x <- rnorm(1000,3,1)
err <- rnorm(1000)
y <- 5+2*data.x+3*data.x^2+err
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70))
Data diatas merupakan data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 3 dan ragam 1.
2. Regresi Linier
lin.mod <-lm( y~data.x)
plot( data.x,y,xlim=c( -2,5), ylim=c( -10,70))
lines(data.x,lin.mod$fitted.values,col="red")
summary(lin.mod)
##
## Call:
## lm(formula = y ~ data.x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.686 -2.574 -1.428 1.195 27.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.1524 0.4254 -47.38 <2e-16 ***
## data.x 20.3790 0.1340 152.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.2 on 998 degrees of freedom
## Multiple R-squared: 0.9586, Adjusted R-squared: 0.9586
## F-statistic: 2.314e+04 on 1 and 998 DF, p-value: < 2.2e-16
3. Polynomial
pol.mod <- lm( y~data.x+I(data.x^2))
ix <- sort( data.x,index.return=T)$ix
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70))
lines(data.x[ix], pol.mod$fitted.values[ix],col="blue", cex=2)
summary(pol.mod)
##
## Call:
## lm(formula = y ~ data.x + I(data.x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0319 -0.6942 0.0049 0.7116 3.2855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.70052 0.21941 21.42 <2e-16 ***
## data.x 2.14409 0.14611 14.67 <2e-16 ***
## I(data.x^2) 2.99081 0.02338 127.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
## F-statistic: 2.094e+05 on 2 and 997 DF, p-value: < 2.2e-16
4. Fungsi Tangga
step.mod = lm(y ~ cut(data.x,5))
summary(step.mod)
##
## Call:
## lm(formula = y ~ cut(data.x, 5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1608 -5.6146 -0.4399 4.7188 31.4947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.4596 0.9519 10.99 <2e-16 ***
## cut(data.x, 5)(1.4,2.61] 12.4980 1.0418 12.00 <2e-16 ***
## cut(data.x, 5)(2.61,3.82] 31.5492 1.0072 31.32 <2e-16 ***
## cut(data.x, 5)(3.82,5.03] 57.8323 1.0869 53.21 <2e-16 ***
## cut(data.x, 5)(5.03,6.25] 92.2643 1.6801 54.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.059 on 995 degrees of freedom
## Multiple R-squared: 0.8835, Adjusted R-squared: 0.883
## F-statistic: 1886 on 4 and 995 DF, p-value: < 2.2e-16
5. Komparasi Model
nilai_AIC <- rbind(AIC(lin.mod),
AIC(pol.mod),
AIC(step.mod))
nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)
## nama_model nilai_AIC
## 1 Linear 5711.836
## 2 Poly (ordo=2) 2856.470
## 3 Tangga (breaks=3) 6753.590
Berdasarkan nilai AIC diatas, maka dapat ditentukan bahwa model terbaik adalah model Poly dengan ordo 2 karena memiliki nilai AIC paling kecil diantara model lainnya.
C. Contoh Data TRICEPS
Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:
age : umur responden
Intriceps : logaritma dari ketebalan lipatan kulit triceps
triceps: ketebalan lipatan kulit triceps Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas.
Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein
data("triceps", package="MultiKink")
Jika kita gambarkan dalam bentuk scatterplot, maka :
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
theme_bw()
Berdasarkan pola hubungan yang terlihat pada scatterplot, kita akan mencoba untuk mencari model yang bisa merepresentasikan pola hubungan tersebut dengan baik.
1. Regresi Linear
mod_linear = lm(triceps~age,data=triceps)
summary(mod_linear)
##
## Call:
## lm(formula = triceps ~ age, data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9512 -2.3965 -0.5154 1.5822 25.1233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.19717 0.21244 29.17 <2e-16 ***
## age 0.21584 0.01014 21.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.007 on 890 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.3365
## F-statistic: 452.8 on 1 and 890 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~x,lty = 1,
col = "blue",se = F)+
theme_bw()
2. Regresi Polinomial Derajat 2 (ordo 2)
mod_polinomial2 = lm(triceps ~ poly(age,2,raw = T),
data=triceps)
summary(mod_polinomial2)
##
## Call:
## lm(formula = triceps ~ poly(age, 2, raw = T), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5677 -2.4401 -0.4587 1.6368 24.9961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0229191 0.3063806 19.658 < 2e-16 ***
## poly(age, 2, raw = T)1 0.2434733 0.0364403 6.681 4.17e-11 ***
## poly(age, 2, raw = T)2 -0.0006257 0.0007926 -0.789 0.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.008 on 889 degrees of freedom
## Multiple R-squared: 0.3377, Adjusted R-squared: 0.3362
## F-statistic: 226.6 on 2 and 889 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "blue",se = F)+
theme_bw()
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "blue",se = T)+
theme_bw()
3.Regresi Polinomial Derajat 3 (ordo 3)
mod_polinomial3 = lm(triceps ~ poly(age,3,raw = T),data=triceps)
summary(mod_polinomial3)
##
## Call:
## lm(formula = triceps ~ poly(age, 3, raw = T), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5832 -1.9284 -0.5415 1.3283 24.4440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.004e+00 3.831e-01 20.893 < 2e-16 ***
## poly(age, 3, raw = T)1 -3.157e-01 7.721e-02 -4.089 4.73e-05 ***
## poly(age, 3, raw = T)2 3.101e-02 3.964e-03 7.824 1.45e-14 ***
## poly(age, 3, raw = T)3 -4.566e-04 5.612e-05 -8.135 1.38e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.868 on 888 degrees of freedom
## Multiple R-squared: 0.3836, Adjusted R-squared: 0.3815
## F-statistic: 184.2 on 3 and 888 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T),
lty = 1, col = "blue",se = T)+
theme_bw()
4.Regresi Fungsi Tangga (5)
mod_tangga5 = lm(triceps ~ cut(age,5),data=triceps)
summary(mod_tangga5)
##
## Call:
## lm(formula = triceps ~ cut(age, 5), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5474 -2.0318 -0.4465 1.3682 23.3759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2318 0.1994 36.260 < 2e-16 ***
## cut(age, 5)(10.6,20.9] 1.6294 0.3244 5.023 6.16e-07 ***
## cut(age, 5)(20.9,31.2] 5.9923 0.4222 14.192 < 2e-16 ***
## cut(age, 5)(31.2,41.5] 7.5156 0.4506 16.678 < 2e-16 ***
## cut(age, 5)(41.5,51.8] 7.4561 0.5543 13.452 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.939 on 887 degrees of freedom
## Multiple R-squared: 0.3617, Adjusted R-squared: 0.3588
## F-statistic: 125.7 on 4 and 887 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "blue",se = F)+
theme_bw()
5. Regresi Fungsi Tangga (7)
mod_tangga7 = lm(triceps ~ cut(age,7),data=triceps)
summary(mod_tangga7)
##
## Call:
## lm(formula = triceps ~ cut(age, 7), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8063 -1.7592 -0.4366 1.2894 23.1461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5592 0.2219 34.060 < 2e-16 ***
## cut(age, 7)(7.62,15] -0.6486 0.3326 -1.950 0.0515 .
## cut(age, 7)(15,22.3] 3.4534 0.4239 8.146 1.27e-15 ***
## cut(age, 7)(22.3,29.7] 5.8947 0.4604 12.804 < 2e-16 ***
## cut(age, 7)(29.7,37] 7.8471 0.5249 14.949 < 2e-16 ***
## cut(age, 7)(37,44.4] 6.9191 0.5391 12.835 < 2e-16 ***
## cut(age, 7)(44.4,51.8] 6.3013 0.6560 9.606 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.805 on 885 degrees of freedom
## Multiple R-squared: 0.4055, Adjusted R-squared: 0.4014
## F-statistic: 100.6 on 6 and 885 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "blue",se = F)+
theme_bw()
6. Perbandingan Model
MSE = function(pred,actual){
mean((pred-actual)^2)
}
7. Membandingkan model
nilai_MSE <- rbind(MSE(predict(mod_linear),triceps$triceps),
MSE(predict(mod_polinomial2),triceps$triceps),
MSE(predict(mod_polinomial3),triceps$triceps),
MSE(predict(mod_tangga5),triceps$triceps),
MSE(predict(mod_tangga7),triceps$triceps))
nama_model <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)",
"Tangga (breaks=5)", "Tangga (breaks=7)")
data.frame(nama_model,nilai_MSE)
## nama_model nilai_MSE
## 1 Linear 16.01758
## 2 Poly (ordo=2) 16.00636
## 3 Poly (ordo=3) 14.89621
## 4 Tangga (breaks=5) 15.42602
## 5 Tangga (breaks=7) 14.36779
Berdasarkan perbandingan model diatas, dapat dilihat bahwa nilai MSE terkecil ada pada model tangga dengan breaks 7 dengan nilai 14.36779.
D.Evaluasi Model Menggunakan Cross Validation
1. Regresi Linear
library(mlr3measures)
## In order to avoid name clashes, do not attach 'mlr3measures'. Instead, only load the namespace with `requireNamespace("mlrmeasures")` and access the measures directly via `::`, e.g. `mlr3measures::auc()`.
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_linear <- map_dfr(cross_val$splits,
function(x){
mod <- lm(triceps ~ age,data=triceps[x$in_id,])
pred <- predict(mod,newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_linear
## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 3.65 2.82
## 2 4.62 3.22
## 3 4.38 3.00
## 4 3.85 2.80
## 5 3.08 2.36
## 6 3.83 2.81
## 7 3.59 2.78
## 8 4.66 3.06
## 9 3.50 2.56
## 10 4.59 2.93
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear
## rmse mae
## 3.973249 2.833886
2. Polynomial Derajat 2 (ordo 2)
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_poly2 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(triceps ~ poly(age,2,raw = T),data=triceps[x$in_id,])
pred <- predict(mod,newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly2
## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 3.64 2.82
## 2 4.62 3.26
## 3 4.39 3.02
## 4 3.85 2.81
## 5 3.10 2.38
## 6 3.82 2.81
## 7 3.62 2.83
## 8 4.65 3.07
## 9 3.50 2.57
## 10 4.59 2.94
# menghitung rata-rata untuk 10 folds
mean_metric_poly2 <- colMeans(metric_poly2)
mean_metric_poly2
## rmse mae
## 3.977777 2.851787
3. Polynomial Derajat 3 (ordo 3)
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_poly3 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(triceps ~ poly(age,3,raw = T),data=triceps[x$in_id,])
pred <- predict(mod,newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly3
## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 3.49 2.60
## 2 4.48 2.99
## 3 4.21 2.85
## 4 4.02 2.75
## 5 3.03 2.09
## 6 3.63 2.59
## 7 3.53 2.52
## 8 4.54 2.91
## 9 3.27 2.33
## 10 4.27 2.68
# menghitung rata-rata untuk 10 folds
mean_metric_poly3 <- colMeans(metric_poly3)
mean_metric_poly3
## rmse mae
## 3.845976 2.632125
4. Fungsi Tangga
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
breaks <- 3:10
best_tangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- triceps[x$in_id,]
training$age <- cut(training$age,i)
mod <- lm(triceps ~ age,data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))
testing <- triceps[-x$in_id,]
age_new <- cut(testing$age,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod,newdata=list(age=age_new))
truth <- testing$triceps
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
}
)
best_tangga <- cbind(breaks=breaks,best_tangga)
# menampilkan hasil all breaks
best_tangga
## breaks rmse mae
## 1 3 3.835357 2.618775
## 2 4 3.882932 2.651911
## 3 5 3.917840 2.724368
## 4 6 3.836068 2.622939
## 5 7 3.789715 2.555062
## 6 8 3.812789 2.555563
## 7 9 3.781720 2.518706
## 8 10 3.795877 2.529479
#berdasarkan rmse
best_tangga %>% slice_min(rmse)
## breaks rmse mae
## 1 9 3.78172 2.518706
#berdasarkan mae
best_tangga %>% slice_min(mae)
## breaks rmse mae
## 1 9 3.78172 2.518706
#Perbandingan Model
nilai_metric <- rbind(mean_metric_linear,
mean_metric_poly2,
mean_metric_poly3,
best_tangga %>% select(-1) %>% slice_min(mae))
nama_model <- c("Linear","Poly2","Poly3","Tangga (breaks=9)")
data.frame(nama_model,nilai_metric)
## nama_model rmse mae
## 1 Linear 3.973249 2.833886
## 2 Poly2 3.977777 2.851787
## 3 Poly3 3.845976 2.632125
## 4 Tangga (breaks=9) 3.781720 2.518706
STA1381: Nonlinear Regression Part 2
A. Library
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
library(splines)
B. Contoh Pada Kuliah
set.seed(123)
data.x <- rnorm(1000,1,1)
err <- rnorm(1000,0,5)
y <- 5+2*data.x+3*data.x^2+err
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="orange")
abline(v=1, col="red", lty=2)
Data diatas merupakan data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 1 dan ragam 1.
1. Piecewise cubic polynomial
dt.all <- cbind(y,data.x)
##knots=1
dt1 <- dt.all[data.x <=1,]
dim(dt1)
## [1] 495 2
dt2 <- dt.all[data.x >1,]
dim(dt2)
## [1] 505 2
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="orange")
cub.mod1 <- lm(dt1[,1]~dt1[,2]+I(dt1[,2]^2)+I(dt1[,2]^3))
ix <- sort(dt1[,2], index.return=T)$ix
lines(dt1[ix,2],cub.mod1$fitted.values[ix],col="blue", lwd=2)
cub.mod2 <- lm(dt2[,1]~dt2[,2]+I(dt2[,2]^2)+I(dt2[,2]^3))
ix <- sort(dt2[,2], index.return=T)$ix
lines(dt2[ix,2],cub.mod2$fitted.values[ix],col="blue", lwd=2)
2. Truncated power basis
#dengan menggunakan truncated power basis
plot(data.x,y,xlim=c( -2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=1,col="red", lty=2)
hx <- ifelse( data.x>1,(data.x-1)^3,0)
cubspline.mod <- lm( y~data.x+I(data.x^2)+I(data.x^3)+hx)
ix <- sort( data.x,index.return=T)$ix
lines( data.x[ix], cubspline.mod$fitted.values[ix],col="green", lwd=2)
3. Perbandingan dengan k-fold CV
plot( data.x,y,xlim=c(-2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=0,col="red", lty=2)
abline(v=2,col="red", lty=2)
hx1 <- ifelse( data.x>0,(data.x-0)^3,0)
hx2 <- ifelse( data.x>2,(data.x-2)^3,0)
cubspline.mod2 <- lm( y~data.x+I(data.x^2)+I(data.x^3)+hx1+hx2)
ix <- sort( data.x,index.return=T)$ix
lines( data.x[ix],cubspline.mod2$fitted.values[ix],col="green", lwd=2)
4. Perbandingan 1 knot dan 2 knot dengan 5-fold CV
##1 knot
data.all <- cbind( y,data.x,hx)
set.seed(456)
ind <- sample(1:5,length( data.x),replace=T)
res <- c()
for( i in 1:5){
dt.train <- data.all[ ind!=i,]
x.test <- data.all[ ind==i, -1]
y.test <- data.all[ ind==i,1]
mod1 <- lm( dt.train[,1]~dt.train[,2]+I( dt.train[,2]^2)+I( dt.train[,2]^3)+dt.train[,3])
x.test.olah <- cbind(1,x.test[,1], x.test[,1]^2,x.test[,1]^3,x.test[,2])
beta <- coefficients(mod1)
prediksi <- x.test.olah%*%beta
res <- c( res,mean(( y.test-prediksi)^2))
}
res
## [1] 22.08767 21.39598 29.47677 29.68683 25.18070
mean(res)
## [1] 25.56559
##2 knot (knot = 0, 2)
data.all2 <- cbind(y,data.x,hx1,hx2)
set.seed(456)
ind2 <- sample(1:5,length( data.x),replace=T)
res2 <- c()
for( i in 1:5){
dt.train2 <- data.all2[ind2!=i,]
x.test2 <- data.all2[ind2==i,-1]
y.test2 <- data.all2[ind2==i,1]
mod2 <- lm(dt.train2[,1]~dt.train2[,2]+I(dt.train2[,2]^2)+I(dt.train2[,2]^3)+dt.train2[,3]+dt.train2[,4])
x.test.olah2 <- cbind(1,x.test2[,1],x.test2[,1]^2,x.test2[,1]^3,x.test2[,2],x.test2[,3])
beta2 <- coefficients(mod2)
prediksi2 <- x.test.olah2%*%beta2
res2 <- c(res2,mean((y.test2-prediksi2)^2))
}
res2
## [1] 22.32898 21.33868 29.40459 29.79913 25.22047
mean(res2)
## [1] 25.61837
5. Smoothing Spline
ss1 <- smooth.spline(data.x,y,all.knots = T)
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16,col="orange")
lines(ss1,col="purple", lwd=2)
C. Contoh data TRICEPS
Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:
age : umur responden
Intriceps : logaritma dari ketebalan lipatan kulit triceps
triceps: ketebalan lipatan kulit triceps Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas.
``
Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein
data("triceps", package="MultiKink")
Jika kita gambarkan dalam bentuk scatterplot
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
theme_bw()
Berdasarkan pola hubungan yang terlihat pada scatterplot, kita akan mencoba untuk mencari model yang bisa merepresentasikan pola hubungan tersebut dengan baik.
1. Regresi Spline
#Menentukan banyaknya fungsi basis dan knots
dim(bs(triceps$age, knots = c(10, 20,40)))
## [1] 892 6
#atau
dim(bs(triceps$age, df=6))
## [1] 892 6
#nilai knots yang ditentukan oleh komputer
attr(bs(triceps$age, df=6),"knots")
## 25% 50% 75%
## 5.5475 12.2100 24.7275
2. b-spline
Menggunakan knot yang ditentukan manual
knot_value_manual_3 = c(10, 20,40)
mod_spline_1 = lm(triceps ~ bs(age, knots =knot_value_manual_3 ),data=triceps)
summary(mod_spline_1)
##
## Call:
## lm(formula = triceps ~ bs(age, knots = knot_value_manual_3),
## data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.385 -1.682 -0.393 1.165 22.855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.22533 0.61873 13.294 < 2e-16 ***
## bs(age, knots = knot_value_manual_3)1 -0.06551 1.14972 -0.057 0.955
## bs(age, knots = knot_value_manual_3)2 -4.30051 0.72301 -5.948 3.90e-09 ***
## bs(age, knots = knot_value_manual_3)3 7.80435 1.17793 6.625 6.00e-11 ***
## bs(age, knots = knot_value_manual_3)4 6.14890 1.27439 4.825 1.65e-06 ***
## bs(age, knots = knot_value_manual_3)5 5.56640 1.42225 3.914 9.78e-05 ***
## bs(age, knots = knot_value_manual_3)6 7.90178 1.54514 5.114 3.87e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.76 on 885 degrees of freedom
## Multiple R-squared: 0.4195, Adjusted R-squared: 0.4155
## F-statistic: 106.6 on 6 and 885 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = knot_value_manual_3),
lty = 1,se = F)
Menggunakan knot yang ditentukan fungsi komputer
knot_value_pc_df_6 = attr(bs(triceps$age, df=6),"knots")
mod_spline_1 = lm(triceps ~ bs(age, knots =knot_value_pc_df_6 ),data=triceps)
summary(mod_spline_1)
##
## Call:
## lm(formula = triceps ~ bs(age, knots = knot_value_pc_df_6), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0056 -1.7556 -0.2944 1.2008 23.0695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5925 0.8770 7.517 1.37e-13 ***
## bs(age, knots = knot_value_pc_df_6)1 3.7961 1.5015 2.528 0.01164 *
## bs(age, knots = knot_value_pc_df_6)2 -2.0749 0.8884 -2.335 0.01974 *
## bs(age, knots = knot_value_pc_df_6)3 1.5139 1.1645 1.300 0.19391
## bs(age, knots = knot_value_pc_df_6)4 11.6394 1.3144 8.855 < 2e-16 ***
## bs(age, knots = knot_value_pc_df_6)5 5.9680 1.5602 3.825 0.00014 ***
## bs(age, knots = knot_value_pc_df_6)6 8.9127 1.4053 6.342 3.60e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.757 on 885 degrees of freedom
## Multiple R-squared: 0.4206, Adjusted R-squared: 0.4167
## F-statistic: 107.1 on 6 and 885 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = knot_value_pc_df_6),
lty = 1,se = F)
3. Natural Spline
knot_value_manual_3 = c(10, 20,40)
mod_spline3ns = lm(triceps ~ ns(age, knots = knot_value_manual_3),data=triceps)
summary(mod_spline3ns)
##
## Call:
## lm(formula = triceps ~ ns(age, knots = knot_value_manual_3),
## data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7220 -1.7640 -0.3985 1.1908 22.9684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8748 0.3742 23.714 < 2e-16 ***
## ns(age, knots = knot_value_manual_3)1 7.0119 0.6728 10.422 < 2e-16 ***
## ns(age, knots = knot_value_manual_3)2 6.0762 0.8625 7.045 3.72e-12 ***
## ns(age, knots = knot_value_manual_3)3 2.0780 1.0632 1.954 0.051 .
## ns(age, knots = knot_value_manual_3)4 8.8616 0.9930 8.924 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.762 on 887 degrees of freedom
## Multiple R-squared: 0.4176, Adjusted R-squared: 0.415
## F-statistic: 159 on 4 and 887 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black")+
stat_smooth(method = "lm",
formula = y~ns(x, knots = knot_value_manual_3),
lty = 1,se=F)
4. Smoothing Splin
Fungsi smooth.spline() dapat digunakan untuk melakukan smoothing spline pada R
model_sms <- with(data = triceps,smooth.spline(age,triceps))
model_sms
## Call:
## smooth.spline(x = age, y = triceps)
##
## Smoothing Parameter spar= 0.5162704 lambda= 1.232259e-06 (13 iterations)
## Equivalent Degrees of Freedom (Df): 50.00639
## Penalized Criterion (RSS): 8591.581
## GCV: 13.77835
Fungsi smooth.spline secara otomatis memilih paramter lambda terbaik dengan menggunakan Cross-Validation (CV) dan Generalized Cross-Validation (GCV). Perbedaan utama antara keduanya adalah pada ukuran kebaikan model yang digunakan. Kalau CV menggunakan MSE sebagai ukuran kebaikan model, sedangkan GCV menggunakan Weighted-MSE. Banyaknya folds (lipatan) pada CV dan GCV ini adalah sejumlah obserbasinya.
pred_data <- broom::augment(model_sms)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("age")+
ylab("triceps")+
theme_bw()
Pengaruh parameter lambda terhadap tingkat smoothness kurva bisa dilihat dari ilustrasi berikut:
model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>%
group_by(lambda) %>%
do(broom::augment(with(data = triceps,smooth.spline(age,triceps,lambda = .$lambda))))
p <- ggplot(model_sms_lambda,
aes(x=x,y=y))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~lambda)
p
Jika kita tentukan df=7, maka hasil kurva model smooth.spline akan lebih merepresentasikan data
model_sms <- with(data = triceps,smooth.spline(age,triceps,df=7))
model_sms
## Call:
## smooth.spline(x = age, y = triceps, df = 7)
##
## Smoothing Parameter spar= 1.049874 lambda= 0.008827582 (11 iterations)
## Equivalent Degrees of Freedom (Df): 7.000747
## Penalized Criterion (RSS): 10112.16
## GCV: 14.20355
pred_data <- broom::augment(model_sms)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("age")+
ylab("triceps")+
theme_bw()
5. LOESS
Masih menggunakan data triceps seperti pada ilustrasi sebelumnya, kali ini kita akan mencoba melakukan pendekatan local regression dengan fungsi loess()
model_loess <- loess(triceps~age,
data = triceps)
summary(model_loess)
## Call:
## loess(formula = triceps ~ age, data = triceps)
##
## Number of Observations: 892
## Equivalent Number of Parameters: 4.6
## Residual Standard Error: 3.777
## Trace of smoother matrix: 5.01 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
Pengaruh parameter span terhadap tingkat smoothness kurva bisa dilihat dari ilustrasi berikut:
model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>%
group_by(span) %>%
do(broom::augment(loess(triceps~age,
data = triceps,span=.$span)))
p2 <- ggplot(model_loess_span,
aes(x=age,y=triceps))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~span)
p2
Pendekatan ini juga dapat dilakukan dengan fungsi stat_smooth() pada package ggplot2.
library(ggplot2)
ggplot(triceps, aes(age,triceps)) +
geom_point(alpha=0.5,color="black") +
stat_smooth(method='loess',
formula=y~x,
span = 0.75,
col="blue",
lty=1,
se=F)
Tuning span dapat dilakukan dengan menggunakan cross-validation:
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
span <- seq(0.1,1,length.out=50)
best_loess <- map_dfr(span, function(i){
metric_loess <- map_dfr(cross_val$splits,
function(x){
mod <- loess(triceps ~ age,span = i,
data=triceps[x$in_id,])
pred <- predict(mod,
newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
data_eval <- na.omit(data.frame(pred=pred,
truth=truth))
rmse <- mlr3measures::rmse(truth = data_eval$truth,
response = data_eval$pred
)
mae <- mlr3measures::mae(truth = data_eval$truth,
response = data_eval$pred
)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
head(metric_loess,20)
# menghitung rata-rata untuk 10 folds
mean_metric_loess <- colMeans(metric_loess)
mean_metric_loess
}
)
best_loess <- cbind(span=span,best_loess)
# menampilkan hasil all breaks
best_loess
## span rmse mae
## 1 0.1000000 3.773822 2.495752
## 2 0.1183673 3.771342 2.493506
## 3 0.1367347 3.766450 2.487557
## 4 0.1551020 3.757385 2.480571
## 5 0.1734694 3.749539 2.473494
## 6 0.1918367 3.746288 2.470915
## 7 0.2102041 3.742741 2.467990
## 8 0.2285714 3.740397 2.465388
## 9 0.2469388 3.737724 2.463311
## 10 0.2653061 3.737105 2.462252
## 11 0.2836735 3.735645 2.460514
## 12 0.3020408 3.734522 2.459225
## 13 0.3204082 3.733298 2.457694
## 14 0.3387755 3.732538 2.456280
## 15 0.3571429 3.732082 2.455452
## 16 0.3755102 3.731261 2.454642
## 17 0.3938776 3.730882 2.454293
## 18 0.4122449 3.730585 2.454291
## 19 0.4306122 3.730351 2.454869
## 20 0.4489796 3.730717 2.456222
## 21 0.4673469 3.730831 2.456768
## 22 0.4857143 3.731387 2.458307
## 23 0.5040816 3.732091 2.460189
## 24 0.5224490 3.732392 2.461811
## 25 0.5408163 3.733609 2.464543
## 26 0.5591837 3.734393 2.467180
## 27 0.5775510 3.735554 2.470517
## 28 0.5959184 3.736679 2.473012
## 29 0.6142857 3.738227 2.476519
## 30 0.6326531 3.741401 2.476887
## 31 0.6510204 3.742660 2.479842
## 32 0.6693878 3.744336 2.483581
## 33 0.6877551 3.746651 2.488792
## 34 0.7061224 3.748708 2.492180
## 35 0.7244898 3.750734 2.495218
## 36 0.7428571 3.752815 2.498359
## 37 0.7612245 3.753998 2.504175
## 38 0.7795918 3.755981 2.511421
## 39 0.7979592 3.756794 2.514375
## 40 0.8163265 3.759456 2.522994
## 41 0.8346939 3.767892 2.543250
## 42 0.8530612 3.777168 2.557863
## 43 0.8714286 3.782407 2.566972
## 44 0.8897959 3.788428 2.578042
## 45 0.9081633 3.797002 2.595698
## 46 0.9265306 3.803259 2.609852
## 47 0.9448980 3.809478 2.623386
## 48 0.9632653 3.824192 2.655691
## 49 0.9816327 3.835596 2.679155
## 50 1.0000000 3.860514 2.718911
#berdasarkan rmse
best_loess %>% slice_min(rmse)
## span rmse mae
## 1 0.4306122 3.730351 2.454869
#berdasarkan mae
best_loess %>% slice_min(mae)
## span rmse mae
## 1 0.4122449 3.730585 2.454291
library(ggplot2)
ggplot(triceps, aes(age,triceps)) +
geom_point(alpha=0.5,color="black") +
stat_smooth(method='loess',
formula=y~x,
span = 0.4306122,
col="blue",
lty=1,
se=F)
LATIHAN
#install.packages("ISLR")
library(ISLR)
AutoData = Auto %>% select(mpg, horsepower, origin)
tibble(AutoData)
## # A tibble: 392 x 3
## mpg horsepower origin
## <dbl> <dbl> <dbl>
## 1 18 130 1
## 2 15 165 1
## 3 18 150 1
## 4 16 150 1
## 5 17 140 1
## 6 15 198 1
## 7 14 220 1
## 8 14 215 1
## 9 14 225 1
## 10 15 190 1
## # ... with 382 more rows
X <- AutoData$horsepower
Y <- AutoData$mpg
1. Regresi Linier
lin.mod <-lm( Y~X)
plot(X,Y)
lines(X,lin.mod$fitted.values,col="red")
2. Polynomial
pol.mod <- lm( Y~X+I(X^2))
ix <- sort(X,index.return=T)$ix
plot(X,Y)
lines(X[ix], pol.mod$fitted.values[ix],col="blue", cex=2)
summary(pol.mod)
##
## Call:
## lm(formula = Y ~ X + I(X^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## X -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(X^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
3. Fungsi Tangga
#regresi fungsi tangga
range(X)
## [1] 46 230
step.mod = lm(Y ~ cut(X,5))
summary(step.mod)
##
## Call:
## lm(formula = Y ~ cut(X, 5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.3033 -3.0220 -0.5413 2.4074 16.6394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.3033 0.4272 73.27 <2e-16 ***
## cut(X, 5)(82.8,120] -8.2813 0.5642 -14.68 <2e-16 ***
## cut(X, 5)(120,156] -15.2427 0.7210 -21.14 <2e-16 ***
## cut(X, 5)(156,193] -17.5922 1.0036 -17.53 <2e-16 ***
## cut(X, 5)(193,230] -18.5340 1.3767 -13.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.719 on 387 degrees of freedom
## Multiple R-squared: 0.6382, Adjusted R-squared: 0.6345
## F-statistic: 170.7 on 4 and 387 DF, p-value: < 2.2e-16
ggplot(AutoData,aes(x= X, y=Y)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = Y~cut(X,5),
lty = 1, col = "blue",se = F)+
theme_bw()
## Warning: 'newdata' had 80 rows but variables found have 392 rows
## Warning: Computation failed in `stat_smooth()`:
## arguments imply differing number of rows: 80, 392
4. Komparasi Model
nilai_AIC <- rbind(AIC(lin.mod),
AIC(pol.mod),
AIC(step.mod))
nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)
## nama_model nilai_AIC
## 1 Linear 2363.324
## 2 Poly (ordo=2) 2274.354
## 3 Tangga (breaks=3) 2335.820
Berdasarkan komperasi model dapat dilihat bahwa model terbaik untuk data yang berasal dari package ISLR adalah model Tangga dengan breaks 3 dengan nilai AIC sebesar 2335.820. Nilai AIC pada model Tangga breaks 3 paling kecil.