Tugas Prak PSD

Berliana Apriyanti

2022-11-04

STA1381: Nonlinear Regression Part 1

A. Library

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6      v purrr   0.3.4 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.4.1 
## v readr   2.1.3      v forcats 0.5.2 
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)

B. Contoh Pada Kuliah

1. Eksplorasi Data

set.seed(123)
data.x <- rnorm(1000,3,1)
err <- rnorm(1000)
y <- 5+2*data.x+3*data.x^2+err
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70))

Data diatas merupakan data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 3 dan ragam 1.

2. Regresi Linier

lin.mod <-lm( y~data.x)
plot( data.x,y,xlim=c( -2,5), ylim=c( -10,70))
lines(data.x,lin.mod$fitted.values,col="red")

summary(lin.mod)
## 
## Call:
## lm(formula = y ~ data.x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.686 -2.574 -1.428  1.195 27.185 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.1524     0.4254  -47.38   <2e-16 ***
## data.x       20.3790     0.1340  152.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.2 on 998 degrees of freedom
## Multiple R-squared:  0.9586, Adjusted R-squared:  0.9586 
## F-statistic: 2.314e+04 on 1 and 998 DF,  p-value: < 2.2e-16

3. Polynomial

pol.mod <- lm( y~data.x+I(data.x^2))
ix <- sort( data.x,index.return=T)$ix
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70))
lines(data.x[ix], pol.mod$fitted.values[ix],col="blue", cex=2)

summary(pol.mod)
## 
## Call:
## lm(formula = y ~ data.x + I(data.x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0319 -0.6942  0.0049  0.7116  3.2855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.70052    0.21941   21.42   <2e-16 ***
## data.x       2.14409    0.14611   14.67   <2e-16 ***
## I(data.x^2)  2.99081    0.02338  127.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
## F-statistic: 2.094e+05 on 2 and 997 DF,  p-value: < 2.2e-16

4. Fungsi Tangga

step.mod = lm(y ~ cut(data.x,5))
summary(step.mod)
## 
## Call:
## lm(formula = y ~ cut(data.x, 5))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1608  -5.6146  -0.4399   4.7188  31.4947 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                10.4596     0.9519   10.99   <2e-16 ***
## cut(data.x, 5)(1.4,2.61]   12.4980     1.0418   12.00   <2e-16 ***
## cut(data.x, 5)(2.61,3.82]  31.5492     1.0072   31.32   <2e-16 ***
## cut(data.x, 5)(3.82,5.03]  57.8323     1.0869   53.21   <2e-16 ***
## cut(data.x, 5)(5.03,6.25]  92.2643     1.6801   54.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.059 on 995 degrees of freedom
## Multiple R-squared:  0.8835, Adjusted R-squared:  0.883 
## F-statistic:  1886 on 4 and 995 DF,  p-value: < 2.2e-16

5. Komparasi Model

nilai_AIC <- rbind(AIC(lin.mod),
                   AIC(pol.mod),
                   AIC(step.mod))

nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)
##          nama_model nilai_AIC
## 1            Linear  5711.836
## 2     Poly (ordo=2)  2856.470
## 3 Tangga (breaks=3)  6753.590

Berdasarkan nilai AIC diatas, maka dapat ditentukan bahwa model terbaik adalah model Poly dengan ordo 2 karena memiliki nilai AIC paling kecil diantara model lainnya.

C. Contoh Data TRICEPS

Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:

age : umur responden

Intriceps : logaritma dari ketebalan lipatan kulit triceps

triceps: ketebalan lipatan kulit triceps Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas.

Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein

data("triceps", package="MultiKink")

Jika kita gambarkan dalam bentuk scatterplot, maka :

ggplot(triceps,aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
      
             theme_bw() 

Berdasarkan pola hubungan yang terlihat pada scatterplot, kita akan mencoba untuk mencari model yang bisa merepresentasikan pola hubungan tersebut dengan baik.

1. Regresi Linear

mod_linear = lm(triceps~age,data=triceps)
summary(mod_linear)
## 
## Call:
## lm(formula = triceps ~ age, data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9512  -2.3965  -0.5154   1.5822  25.1233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.19717    0.21244   29.17   <2e-16 ***
## age          0.21584    0.01014   21.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.007 on 890 degrees of freedom
## Multiple R-squared:  0.3372, Adjusted R-squared:  0.3365 
## F-statistic: 452.8 on 1 and 890 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") +
   stat_smooth(method = "lm", 
               formula = y~x,lty = 1,
               col = "blue",se = F)+
  theme_bw()

2. Regresi Polinomial Derajat 2 (ordo 2)

mod_polinomial2 = lm(triceps ~ poly(age,2,raw = T),
                     data=triceps)
summary(mod_polinomial2)
## 
## Call:
## lm(formula = triceps ~ poly(age, 2, raw = T), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5677  -2.4401  -0.4587   1.6368  24.9961 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             6.0229191  0.3063806  19.658  < 2e-16 ***
## poly(age, 2, raw = T)1  0.2434733  0.0364403   6.681 4.17e-11 ***
## poly(age, 2, raw = T)2 -0.0006257  0.0007926  -0.789     0.43    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.008 on 889 degrees of freedom
## Multiple R-squared:  0.3377, Adjusted R-squared:  0.3362 
## F-statistic: 226.6 on 2 and 889 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,2,raw=T), 
               lty = 1, col = "blue",se = F)+
  theme_bw()

ggplot(triceps,aes(x=age, y=triceps)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,2,raw=T), 
               lty = 1, col = "blue",se = T)+
  theme_bw()

3.Regresi Polinomial Derajat 3 (ordo 3)

mod_polinomial3 = lm(triceps ~ poly(age,3,raw = T),data=triceps)
summary(mod_polinomial3)
## 
## Call:
## lm(formula = triceps ~ poly(age, 3, raw = T), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5832  -1.9284  -0.5415   1.3283  24.4440 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.004e+00  3.831e-01  20.893  < 2e-16 ***
## poly(age, 3, raw = T)1 -3.157e-01  7.721e-02  -4.089 4.73e-05 ***
## poly(age, 3, raw = T)2  3.101e-02  3.964e-03   7.824 1.45e-14 ***
## poly(age, 3, raw = T)3 -4.566e-04  5.612e-05  -8.135 1.38e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.868 on 888 degrees of freedom
## Multiple R-squared:  0.3836, Adjusted R-squared:  0.3815 
## F-statistic: 184.2 on 3 and 888 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) + 
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
              formula = y~poly(x,3,raw=T), 
              lty = 1, col = "blue",se = T)+
  theme_bw()

4.Regresi Fungsi Tangga (5)

mod_tangga5 = lm(triceps ~ cut(age,5),data=triceps)
summary(mod_tangga5)
## 
## Call:
## lm(formula = triceps ~ cut(age, 5), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5474  -2.0318  -0.4465   1.3682  23.3759 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.2318     0.1994  36.260  < 2e-16 ***
## cut(age, 5)(10.6,20.9]   1.6294     0.3244   5.023 6.16e-07 ***
## cut(age, 5)(20.9,31.2]   5.9923     0.4222  14.192  < 2e-16 ***
## cut(age, 5)(31.2,41.5]   7.5156     0.4506  16.678  < 2e-16 ***
## cut(age, 5)(41.5,51.8]   7.4561     0.5543  13.452  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 887 degrees of freedom
## Multiple R-squared:  0.3617, Adjusted R-squared:  0.3588 
## F-statistic: 125.7 on 4 and 887 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,5), 
              lty = 1, col = "blue",se = F)+
  theme_bw()

5. Regresi Fungsi Tangga (7)

mod_tangga7 = lm(triceps ~ cut(age,7),data=triceps)
summary(mod_tangga7)
## 
## Call:
## lm(formula = triceps ~ cut(age, 7), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8063  -1.7592  -0.4366   1.2894  23.1461 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.5592     0.2219  34.060  < 2e-16 ***
## cut(age, 7)(7.62,15]    -0.6486     0.3326  -1.950   0.0515 .  
## cut(age, 7)(15,22.3]     3.4534     0.4239   8.146 1.27e-15 ***
## cut(age, 7)(22.3,29.7]   5.8947     0.4604  12.804  < 2e-16 ***
## cut(age, 7)(29.7,37]     7.8471     0.5249  14.949  < 2e-16 ***
## cut(age, 7)(37,44.4]     6.9191     0.5391  12.835  < 2e-16 ***
## cut(age, 7)(44.4,51.8]   6.3013     0.6560   9.606  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.805 on 885 degrees of freedom
## Multiple R-squared:  0.4055, Adjusted R-squared:  0.4014 
## F-statistic: 100.6 on 6 and 885 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
  geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
              formula = y~cut(x,7), 
              lty = 1, col = "blue",se = F)+
  theme_bw()

6. Perbandingan Model

MSE = function(pred,actual){
  mean((pred-actual)^2)
}

7. Membandingkan model

nilai_MSE <- rbind(MSE(predict(mod_linear),triceps$triceps),
                   MSE(predict(mod_polinomial2),triceps$triceps),
                   MSE(predict(mod_polinomial3),triceps$triceps),
                   MSE(predict(mod_tangga5),triceps$triceps),
                   MSE(predict(mod_tangga7),triceps$triceps))
nama_model <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)",
                "Tangga (breaks=5)", "Tangga (breaks=7)")
data.frame(nama_model,nilai_MSE)
##          nama_model nilai_MSE
## 1            Linear  16.01758
## 2     Poly (ordo=2)  16.00636
## 3     Poly (ordo=3)  14.89621
## 4 Tangga (breaks=5)  15.42602
## 5 Tangga (breaks=7)  14.36779

Berdasarkan perbandingan model diatas, dapat dilihat bahwa nilai MSE terkecil ada pada model tangga dengan breaks 7 dengan nilai 14.36779.

D.Evaluasi Model Menggunakan Cross Validation

1. Regresi Linear

library(mlr3measures)
## In order to avoid name clashes, do not attach 'mlr3measures'. Instead, only load the namespace with `requireNamespace("mlrmeasures")` and access the measures directly via `::`, e.g. `mlr3measures::auc()`.
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_linear <- map_dfr(cross_val$splits,
                         function(x){
                           mod <- lm(triceps ~ age,data=triceps[x$in_id,])
                           pred <- predict(mod,newdata=triceps[-x$in_id,])
                           truth <- triceps[-x$in_id,]$triceps
                           rmse <- mlr3measures::rmse(truth = truth,response = pred)
                           mae <- mlr3measures::mae(truth = truth,response = pred)
                           metric <- c(rmse,mae)
                           names(metric) <- c("rmse","mae")
                           return(metric)
                         }
)
metric_linear
## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  3.65  2.82
##  2  4.62  3.22
##  3  4.38  3.00
##  4  3.85  2.80
##  5  3.08  2.36
##  6  3.83  2.81
##  7  3.59  2.78
##  8  4.66  3.06
##  9  3.50  2.56
## 10  4.59  2.93
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear
##     rmse      mae 
## 3.973249 2.833886

2. Polynomial Derajat 2 (ordo 2)

set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_poly2 <- map_dfr(cross_val$splits,
                        function(x){
                          mod <- lm(triceps ~ poly(age,2,raw = T),data=triceps[x$in_id,])
                          pred <- predict(mod,newdata=triceps[-x$in_id,])
                          truth <- triceps[-x$in_id,]$triceps
                          rmse <- mlr3measures::rmse(truth = truth,response = pred)
                          mae <- mlr3measures::mae(truth = truth,response = pred)
                          metric <- c(rmse,mae)
                          names(metric) <- c("rmse","mae")
                          return(metric)
                        }
)
metric_poly2
## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  3.64  2.82
##  2  4.62  3.26
##  3  4.39  3.02
##  4  3.85  2.81
##  5  3.10  2.38
##  6  3.82  2.81
##  7  3.62  2.83
##  8  4.65  3.07
##  9  3.50  2.57
## 10  4.59  2.94
# menghitung rata-rata untuk 10 folds
mean_metric_poly2 <- colMeans(metric_poly2)
mean_metric_poly2
##     rmse      mae 
## 3.977777 2.851787

3. Polynomial Derajat 3 (ordo 3)

set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")

metric_poly3 <- map_dfr(cross_val$splits,
                        function(x){
                          mod <- lm(triceps ~ poly(age,3,raw = T),data=triceps[x$in_id,])
                          pred <- predict(mod,newdata=triceps[-x$in_id,])
                          truth <- triceps[-x$in_id,]$triceps
                          rmse <- mlr3measures::rmse(truth = truth,response = pred)
                          mae <- mlr3measures::mae(truth = truth,response = pred)
                          metric <- c(rmse,mae)
                          names(metric) <- c("rmse","mae")
                          return(metric)
                        }
)
metric_poly3
## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  3.49  2.60
##  2  4.48  2.99
##  3  4.21  2.85
##  4  4.02  2.75
##  5  3.03  2.09
##  6  3.63  2.59
##  7  3.53  2.52
##  8  4.54  2.91
##  9  3.27  2.33
## 10  4.27  2.68
# menghitung rata-rata untuk 10 folds
mean_metric_poly3 <- colMeans(metric_poly3)
mean_metric_poly3
##     rmse      mae 
## 3.845976 2.632125

4. Fungsi Tangga

set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
breaks <- 3:10
best_tangga <- map_dfr(breaks, function(i){
  metric_tangga <- map_dfr(cross_val$splits,
                           function(x){
                             training <- triceps[x$in_id,]
                             training$age <- cut(training$age,i)
                             mod <- lm(triceps ~ age,data=training)
                             labs_x <- levels(mod$model[,2])
                             labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                                                    upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))
                             testing <- triceps[-x$in_id,]
                             age_new <- cut(testing$age,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
                             pred <- predict(mod,newdata=list(age=age_new))
                             truth <- testing$triceps
                             data_eval <- na.omit(data.frame(truth,pred))
                             rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
                             mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
                             metric <- c(rmse,mae)
                             names(metric) <- c("rmse","mae")
                             return(metric)
                           }
  )
  metric_tangga
  # menghitung rata-rata untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  mean_metric_tangga
}
)

best_tangga <- cbind(breaks=breaks,best_tangga)
# menampilkan hasil all breaks
best_tangga
##   breaks     rmse      mae
## 1      3 3.835357 2.618775
## 2      4 3.882932 2.651911
## 3      5 3.917840 2.724368
## 4      6 3.836068 2.622939
## 5      7 3.789715 2.555062
## 6      8 3.812789 2.555563
## 7      9 3.781720 2.518706
## 8     10 3.795877 2.529479
#berdasarkan rmse
best_tangga %>% slice_min(rmse)
##   breaks    rmse      mae
## 1      9 3.78172 2.518706
#berdasarkan mae
best_tangga %>% slice_min(mae)
##   breaks    rmse      mae
## 1      9 3.78172 2.518706
#Perbandingan Model
nilai_metric <- rbind(mean_metric_linear,
                      mean_metric_poly2,
                      mean_metric_poly3,
                      best_tangga %>% select(-1) %>% slice_min(mae))

nama_model <- c("Linear","Poly2","Poly3","Tangga (breaks=9)")
data.frame(nama_model,nilai_metric)
##          nama_model     rmse      mae
## 1            Linear 3.973249 2.833886
## 2             Poly2 3.977777 2.851787
## 3             Poly3 3.845976 2.632125
## 4 Tangga (breaks=9) 3.781720 2.518706

STA1381: Nonlinear Regression Part 2

A. Library

library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
library(splines)

B. Contoh Pada Kuliah

set.seed(123)
data.x <- rnorm(1000,1,1)
err <- rnorm(1000,0,5)
y <- 5+2*data.x+3*data.x^2+err
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="orange")
abline(v=1, col="red", lty=2)

Data diatas merupakan data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 1 dan ragam 1.

1. Piecewise cubic polynomial

dt.all <- cbind(y,data.x)

##knots=1
dt1 <- dt.all[data.x <=1,]
dim(dt1)
## [1] 495   2
dt2 <- dt.all[data.x >1,]
dim(dt2)
## [1] 505   2
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="orange")

cub.mod1 <- lm(dt1[,1]~dt1[,2]+I(dt1[,2]^2)+I(dt1[,2]^3))
ix <- sort(dt1[,2], index.return=T)$ix
lines(dt1[ix,2],cub.mod1$fitted.values[ix],col="blue", lwd=2)

cub.mod2 <- lm(dt2[,1]~dt2[,2]+I(dt2[,2]^2)+I(dt2[,2]^3))
ix <- sort(dt2[,2], index.return=T)$ix
lines(dt2[ix,2],cub.mod2$fitted.values[ix],col="blue", lwd=2)

2. Truncated power basis

#dengan menggunakan truncated power basis
plot(data.x,y,xlim=c( -2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=1,col="red", lty=2)

hx <- ifelse( data.x>1,(data.x-1)^3,0)

cubspline.mod <- lm( y~data.x+I(data.x^2)+I(data.x^3)+hx)
ix <- sort( data.x,index.return=T)$ix
lines( data.x[ix], cubspline.mod$fitted.values[ix],col="green", lwd=2)

3. Perbandingan dengan k-fold CV

plot( data.x,y,xlim=c(-2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=0,col="red", lty=2)
abline(v=2,col="red", lty=2)

hx1 <- ifelse( data.x>0,(data.x-0)^3,0)
hx2 <- ifelse( data.x>2,(data.x-2)^3,0)

cubspline.mod2 <- lm( y~data.x+I(data.x^2)+I(data.x^3)+hx1+hx2)
ix <- sort( data.x,index.return=T)$ix
lines( data.x[ix],cubspline.mod2$fitted.values[ix],col="green", lwd=2)

4. Perbandingan 1 knot dan 2 knot dengan 5-fold CV

##1 knot
data.all <- cbind( y,data.x,hx)
set.seed(456)
ind <- sample(1:5,length( data.x),replace=T)
res <- c()
for( i in 1:5){
  dt.train <- data.all[ ind!=i,]
  x.test <- data.all[ ind==i, -1]
  y.test <- data.all[ ind==i,1]
  mod1 <- lm( dt.train[,1]~dt.train[,2]+I( dt.train[,2]^2)+I( dt.train[,2]^3)+dt.train[,3])
  x.test.olah <- cbind(1,x.test[,1], x.test[,1]^2,x.test[,1]^3,x.test[,2])
  beta <- coefficients(mod1)
  prediksi <- x.test.olah%*%beta
  res <- c( res,mean(( y.test-prediksi)^2))
}
res
## [1] 22.08767 21.39598 29.47677 29.68683 25.18070
mean(res)
## [1] 25.56559
##2 knot (knot = 0, 2)
data.all2 <- cbind(y,data.x,hx1,hx2)
set.seed(456)
ind2 <- sample(1:5,length( data.x),replace=T)
res2 <- c()
for( i in 1:5){
  dt.train2 <- data.all2[ind2!=i,]
  x.test2 <- data.all2[ind2==i,-1]
  y.test2 <- data.all2[ind2==i,1]
  mod2 <- lm(dt.train2[,1]~dt.train2[,2]+I(dt.train2[,2]^2)+I(dt.train2[,2]^3)+dt.train2[,3]+dt.train2[,4])
  x.test.olah2 <- cbind(1,x.test2[,1],x.test2[,1]^2,x.test2[,1]^3,x.test2[,2],x.test2[,3])
  beta2 <- coefficients(mod2)
  prediksi2 <- x.test.olah2%*%beta2
  res2 <- c(res2,mean((y.test2-prediksi2)^2))
}
res2
## [1] 22.32898 21.33868 29.40459 29.79913 25.22047
mean(res2)
## [1] 25.61837

5. Smoothing Spline

ss1 <- smooth.spline(data.x,y,all.knots = T)
plot(data.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16,col="orange")
lines(ss1,col="purple", lwd=2)

C. Contoh data TRICEPS

Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:

age : umur responden

Intriceps : logaritma dari ketebalan lipatan kulit triceps

triceps: ketebalan lipatan kulit triceps Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas.

``

Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein

data("triceps", package="MultiKink")

Jika kita gambarkan dalam bentuk scatterplot

ggplot(triceps,aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_bw() 

Berdasarkan pola hubungan yang terlihat pada scatterplot, kita akan mencoba untuk mencari model yang bisa merepresentasikan pola hubungan tersebut dengan baik.

1. Regresi Spline

#Menentukan banyaknya fungsi basis dan knots
dim(bs(triceps$age, knots = c(10, 20,40)))
## [1] 892   6
   #atau
dim(bs(triceps$age, df=6))
## [1] 892   6
#nilai knots yang ditentukan oleh komputer
attr(bs(triceps$age, df=6),"knots")
##     25%     50%     75% 
##  5.5475 12.2100 24.7275

2. b-spline

Menggunakan knot yang ditentukan manual

knot_value_manual_3 = c(10, 20,40)
mod_spline_1 = lm(triceps ~ bs(age, knots =knot_value_manual_3 ),data=triceps)
summary(mod_spline_1)
## 
## Call:
## lm(formula = triceps ~ bs(age, knots = knot_value_manual_3), 
##     data = triceps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.385  -1.682  -0.393   1.165  22.855 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            8.22533    0.61873  13.294  < 2e-16 ***
## bs(age, knots = knot_value_manual_3)1 -0.06551    1.14972  -0.057    0.955    
## bs(age, knots = knot_value_manual_3)2 -4.30051    0.72301  -5.948 3.90e-09 ***
## bs(age, knots = knot_value_manual_3)3  7.80435    1.17793   6.625 6.00e-11 ***
## bs(age, knots = knot_value_manual_3)4  6.14890    1.27439   4.825 1.65e-06 ***
## bs(age, knots = knot_value_manual_3)5  5.56640    1.42225   3.914 9.78e-05 ***
## bs(age, knots = knot_value_manual_3)6  7.90178    1.54514   5.114 3.87e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.76 on 885 degrees of freedom
## Multiple R-squared:  0.4195, Adjusted R-squared:  0.4155 
## F-statistic: 106.6 on 6 and 885 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~bs(x, knots = knot_value_manual_3), 
               lty = 1,se = F)

Menggunakan knot yang ditentukan fungsi komputer

knot_value_pc_df_6 = attr(bs(triceps$age, df=6),"knots")
mod_spline_1 = lm(triceps ~ bs(age, knots =knot_value_pc_df_6 ),data=triceps)
summary(mod_spline_1)
## 
## Call:
## lm(formula = triceps ~ bs(age, knots = knot_value_pc_df_6), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0056  -1.7556  -0.2944   1.2008  23.0695 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            6.5925     0.8770   7.517 1.37e-13 ***
## bs(age, knots = knot_value_pc_df_6)1   3.7961     1.5015   2.528  0.01164 *  
## bs(age, knots = knot_value_pc_df_6)2  -2.0749     0.8884  -2.335  0.01974 *  
## bs(age, knots = knot_value_pc_df_6)3   1.5139     1.1645   1.300  0.19391    
## bs(age, knots = knot_value_pc_df_6)4  11.6394     1.3144   8.855  < 2e-16 ***
## bs(age, knots = knot_value_pc_df_6)5   5.9680     1.5602   3.825  0.00014 ***
## bs(age, knots = knot_value_pc_df_6)6   8.9127     1.4053   6.342 3.60e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.757 on 885 degrees of freedom
## Multiple R-squared:  0.4206, Adjusted R-squared:  0.4167 
## F-statistic: 107.1 on 6 and 885 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = y~bs(x, knots = knot_value_pc_df_6), 
               lty = 1,se = F)

3. Natural Spline

knot_value_manual_3 = c(10, 20,40)
mod_spline3ns = lm(triceps ~ ns(age, knots = knot_value_manual_3),data=triceps)
summary(mod_spline3ns)
## 
## Call:
## lm(formula = triceps ~ ns(age, knots = knot_value_manual_3), 
##     data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7220  -1.7640  -0.3985   1.1908  22.9684 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             8.8748     0.3742  23.714  < 2e-16 ***
## ns(age, knots = knot_value_manual_3)1   7.0119     0.6728  10.422  < 2e-16 ***
## ns(age, knots = knot_value_manual_3)2   6.0762     0.8625   7.045 3.72e-12 ***
## ns(age, knots = knot_value_manual_3)3   2.0780     1.0632   1.954    0.051 .  
## ns(age, knots = knot_value_manual_3)4   8.8616     0.9930   8.924  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.762 on 887 degrees of freedom
## Multiple R-squared:  0.4176, Adjusted R-squared:  0.415 
## F-statistic:   159 on 4 and 887 DF,  p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black")+
      stat_smooth(method = "lm", 
                 formula = y~ns(x, knots = knot_value_manual_3), 
                 lty = 1,se=F)

4. Smoothing Splin

Fungsi smooth.spline() dapat digunakan untuk melakukan smoothing spline pada R

model_sms <- with(data = triceps,smooth.spline(age,triceps))
model_sms 
## Call:
## smooth.spline(x = age, y = triceps)
## 
## Smoothing Parameter  spar= 0.5162704  lambda= 1.232259e-06 (13 iterations)
## Equivalent Degrees of Freedom (Df): 50.00639
## Penalized Criterion (RSS): 8591.581
## GCV: 13.77835

Fungsi smooth.spline secara otomatis memilih paramter lambda terbaik dengan menggunakan Cross-Validation (CV) dan Generalized Cross-Validation (GCV). Perbedaan utama antara keduanya adalah pada ukuran kebaikan model yang digunakan. Kalau CV menggunakan MSE sebagai ukuran kebaikan model, sedangkan GCV menggunakan Weighted-MSE. Banyaknya folds (lipatan) pada CV dan GCV ini adalah sejumlah obserbasinya.

pred_data <- broom::augment(model_sms)

ggplot(pred_data,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="blue",
            lty=1)+
  xlab("age")+
  ylab("triceps")+
  theme_bw()

Pengaruh parameter lambda terhadap tingkat smoothness kurva bisa dilihat dari ilustrasi berikut:

model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>% 
  group_by(lambda) %>% 
  do(broom::augment(with(data = triceps,smooth.spline(age,triceps,lambda = .$lambda))))

p <- ggplot(model_sms_lambda,
       aes(x=x,y=y))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~lambda)
p

Jika kita tentukan df=7, maka hasil kurva model smooth.spline akan lebih merepresentasikan data

model_sms <- with(data = triceps,smooth.spline(age,triceps,df=7))
model_sms 
## Call:
## smooth.spline(x = age, y = triceps, df = 7)
## 
## Smoothing Parameter  spar= 1.049874  lambda= 0.008827582 (11 iterations)
## Equivalent Degrees of Freedom (Df): 7.000747
## Penalized Criterion (RSS): 10112.16
## GCV: 14.20355
pred_data <- broom::augment(model_sms)

ggplot(pred_data,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+
  geom_line(aes(y=.fitted),col="blue",
            lty=1)+
  xlab("age")+
  ylab("triceps")+
  theme_bw()

5. LOESS

Masih menggunakan data triceps seperti pada ilustrasi sebelumnya, kali ini kita akan mencoba melakukan pendekatan local regression dengan fungsi loess()

model_loess <- loess(triceps~age,
                     data = triceps)
summary(model_loess)
## Call:
## loess(formula = triceps ~ age, data = triceps)
## 
## Number of Observations: 892 
## Equivalent Number of Parameters: 4.6 
## Residual Standard Error: 3.777 
## Trace of smoother matrix: 5.01  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

Pengaruh parameter span terhadap tingkat smoothness kurva bisa dilihat dari ilustrasi berikut:

model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>% 
  group_by(span) %>% 
  do(broom::augment(loess(triceps~age,
                     data = triceps,span=.$span)))

p2 <- ggplot(model_loess_span,
       aes(x=age,y=triceps))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~span)
p2

Pendekatan ini juga dapat dilakukan dengan fungsi stat_smooth() pada package ggplot2.

library(ggplot2)
ggplot(triceps, aes(age,triceps)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span = 0.75,
              col="blue",
              lty=1,
              se=F)

Tuning span dapat dilakukan dengan menggunakan cross-validation:

set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")

span <- seq(0.1,1,length.out=50)

best_loess <- map_dfr(span, function(i){
metric_loess <- map_dfr(cross_val$splits,
    function(x){
mod <- loess(triceps ~ age,span = i,
         data=triceps[x$in_id,])
pred <- predict(mod,
               newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps

data_eval <- na.omit(data.frame(pred=pred,
                                truth=truth))


rmse <- mlr3measures::rmse(truth = data_eval$truth,
                           response = data_eval$pred
                           )
mae <- mlr3measures::mae(truth = data_eval$truth,
                           response = data_eval$pred
                           )
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)

head(metric_loess,20)

# menghitung rata-rata untuk 10 folds
mean_metric_loess <- colMeans(metric_loess)

mean_metric_loess
}
)

best_loess <- cbind(span=span,best_loess)
# menampilkan hasil all breaks
best_loess
##         span     rmse      mae
## 1  0.1000000 3.773822 2.495752
## 2  0.1183673 3.771342 2.493506
## 3  0.1367347 3.766450 2.487557
## 4  0.1551020 3.757385 2.480571
## 5  0.1734694 3.749539 2.473494
## 6  0.1918367 3.746288 2.470915
## 7  0.2102041 3.742741 2.467990
## 8  0.2285714 3.740397 2.465388
## 9  0.2469388 3.737724 2.463311
## 10 0.2653061 3.737105 2.462252
## 11 0.2836735 3.735645 2.460514
## 12 0.3020408 3.734522 2.459225
## 13 0.3204082 3.733298 2.457694
## 14 0.3387755 3.732538 2.456280
## 15 0.3571429 3.732082 2.455452
## 16 0.3755102 3.731261 2.454642
## 17 0.3938776 3.730882 2.454293
## 18 0.4122449 3.730585 2.454291
## 19 0.4306122 3.730351 2.454869
## 20 0.4489796 3.730717 2.456222
## 21 0.4673469 3.730831 2.456768
## 22 0.4857143 3.731387 2.458307
## 23 0.5040816 3.732091 2.460189
## 24 0.5224490 3.732392 2.461811
## 25 0.5408163 3.733609 2.464543
## 26 0.5591837 3.734393 2.467180
## 27 0.5775510 3.735554 2.470517
## 28 0.5959184 3.736679 2.473012
## 29 0.6142857 3.738227 2.476519
## 30 0.6326531 3.741401 2.476887
## 31 0.6510204 3.742660 2.479842
## 32 0.6693878 3.744336 2.483581
## 33 0.6877551 3.746651 2.488792
## 34 0.7061224 3.748708 2.492180
## 35 0.7244898 3.750734 2.495218
## 36 0.7428571 3.752815 2.498359
## 37 0.7612245 3.753998 2.504175
## 38 0.7795918 3.755981 2.511421
## 39 0.7979592 3.756794 2.514375
## 40 0.8163265 3.759456 2.522994
## 41 0.8346939 3.767892 2.543250
## 42 0.8530612 3.777168 2.557863
## 43 0.8714286 3.782407 2.566972
## 44 0.8897959 3.788428 2.578042
## 45 0.9081633 3.797002 2.595698
## 46 0.9265306 3.803259 2.609852
## 47 0.9448980 3.809478 2.623386
## 48 0.9632653 3.824192 2.655691
## 49 0.9816327 3.835596 2.679155
## 50 1.0000000 3.860514 2.718911
#berdasarkan rmse
best_loess %>% slice_min(rmse)
##        span     rmse      mae
## 1 0.4306122 3.730351 2.454869
#berdasarkan mae
best_loess %>% slice_min(mae)
##        span     rmse      mae
## 1 0.4122449 3.730585 2.454291
library(ggplot2)
ggplot(triceps, aes(age,triceps)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
              formula=y~x,
              span = 0.4306122,
              col="blue",
              lty=1,
              se=F)

LATIHAN

#install.packages("ISLR")
library(ISLR)
AutoData = Auto %>% select(mpg, horsepower, origin)
tibble(AutoData)
## # A tibble: 392 x 3
##      mpg horsepower origin
##    <dbl>      <dbl>  <dbl>
##  1    18        130      1
##  2    15        165      1
##  3    18        150      1
##  4    16        150      1
##  5    17        140      1
##  6    15        198      1
##  7    14        220      1
##  8    14        215      1
##  9    14        225      1
## 10    15        190      1
## # ... with 382 more rows
X <- AutoData$horsepower
Y <- AutoData$mpg

1. Regresi Linier

lin.mod <-lm( Y~X)
plot(X,Y)
lines(X,lin.mod$fitted.values,col="red")

2. Polynomial

pol.mod <- lm( Y~X+I(X^2))
ix <- sort(X,index.return=T)$ix
plot(X,Y)
lines(X[ix], pol.mod$fitted.values[ix],col="blue", cex=2)

summary(pol.mod)
## 
## Call:
## lm(formula = Y ~ X + I(X^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.9000997  1.8004268   31.60   <2e-16 ***
## X           -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(X^2)       0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

3. Fungsi Tangga

#regresi fungsi tangga
range(X)
## [1]  46 230
step.mod = lm(Y ~ cut(X,5))
summary(step.mod)
## 
## Call:
## lm(formula = Y ~ cut(X, 5))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3033  -3.0220  -0.5413   2.4074  16.6394 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          31.3033     0.4272   73.27   <2e-16 ***
## cut(X, 5)(82.8,120]  -8.2813     0.5642  -14.68   <2e-16 ***
## cut(X, 5)(120,156]  -15.2427     0.7210  -21.14   <2e-16 ***
## cut(X, 5)(156,193]  -17.5922     1.0036  -17.53   <2e-16 ***
## cut(X, 5)(193,230]  -18.5340     1.3767  -13.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.719 on 387 degrees of freedom
## Multiple R-squared:  0.6382, Adjusted R-squared:  0.6345 
## F-statistic: 170.7 on 4 and 387 DF,  p-value: < 2.2e-16
ggplot(AutoData,aes(x= X, y=Y)) +
                 geom_point(alpha=0.55, color="black") +
  stat_smooth(method = "lm", 
               formula = Y~cut(X,5), 
               lty = 1, col = "blue",se = F)+
  theme_bw()
## Warning: 'newdata' had 80 rows but variables found have 392 rows
## Warning: Computation failed in `stat_smooth()`:
## arguments imply differing number of rows: 80, 392

4. Komparasi Model

nilai_AIC <- rbind(AIC(lin.mod),
                   AIC(pol.mod),
                   AIC(step.mod))

nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)
##          nama_model nilai_AIC
## 1            Linear  2363.324
## 2     Poly (ordo=2)  2274.354
## 3 Tangga (breaks=3)  2335.820

Berdasarkan komperasi model dapat dilihat bahwa model terbaik untuk data yang berasal dari package ISLR adalah model Tangga dengan breaks 3 dengan nilai AIC sebesar 2335.820. Nilai AIC pada model Tangga breaks 3 paling kecil.