Tugas Responsi 1 STA1333 Pengantar Sains Data: Nonlinear Regression
Tugas akan di update secara berkala pada link berikut “https://rpubs.com/hanung0212safrizal/965994”
1. PENDAHULUAN
1.1 Cakupan Materi
- Nonlinier Regression Part 1 Pemodelan pada data nonlinier yang meliputi:
- Regresi Linier
- Regresi Polinomial
- Regresi Fungsi Tangga
- Nonlinier Regression Part 2 Pemulusan pada data nonlinier yang meliputi:
- Piecewise Cubic Polynomial
- Spline Regression
- Smoothing Spline
- LOESS Function
1.2 Data
Data yang akan digunakan terdiri atas 2 jenis data, yakni: 1. Data bangkitan dengan 1000 amatan yang menyebar normal dengan mean sebesar 2 dan ragam 2. 2. Dataset Auto dari library(ISLR)
Dataset Auto terdiri atas 392 amatan dan 9 peubah dengan rincian tipe peubah:
1. *mpg (numeric)
2. cylinders (numeric)*
3. *displacement (numeric)*
4. *horsepower (numeric)*
5. *weight (numeric)*
6. *acceleration (numeric)*
7. *year (numeric)*
8. *origin (numeric)*
9. *name (categoric (Factor))*
Pada materi analisis regresi nonlinier ini, peubah yang akan dipakai adalah peubah *mpg*, *horsepower*, dan *origin*.
Dataset TRICEPS dari library(MultiKink)
Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:
- age : umur responden
- Intriceps : logaritma dari ketebalan lipatan kulit triceps
- triceps : ketebalan lipatan kulit triceps
Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas. Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein
1.3 Package
Daftar package yang digunakan dalam nonlinear regression:
library(MultiKink)
library(ISLR)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
library(splines)
library(MultiKink)
library(ISLR)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
library(splines)
library(corrplot)
library(PerformanceAnalytics)2. Regresi Nonlinear Data Bangkitan (Contoh Kuliah)
2.1 Panggil Data
set.seed(123)
datapsd.x <- rnorm(1000,2,2) #jumlah data, miu, standar deviasi
err <- rnorm(1000)
y <- 5+2*datapsd.x+3*datapsd.x^2+err
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70))Berdasarkan plot di atas, data bangkitan tersebut termasuk model polinomial ordo 2 karena pola data tidak garis lurus dan terdapat lengkungan di ujung dan tengah.
2.2 Pemodelan Regresi Linier
Regresi linier adalah metode statistika yang digunakan untuk membentuk model atau hubungan antara satu atau lebih variabel bebas X dengan sebuah variabel respon Y.
lin.mod <-lm( y~datapsd.x)
plot( datapsd.x,y,xlim=c( -2,5), ylim=c( -10,70), xlab = "x", main = "Plot Data Bangkitan Fungsi Regresi Linier")
lines(datapsd.x,lin.mod$fitted.values,col="red")summary(lin.mod)##
## Call:
## lm(formula = y ~ datapsd.x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.530 -10.542 -6.695 4.101 110.065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5634 0.7417 4.804 1.79e-06 ***
## datapsd.x 14.6259 0.2612 55.985 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.38 on 998 degrees of freedom
## Multiple R-squared: 0.7585, Adjusted R-squared: 0.7582
## F-statistic: 3134 on 1 and 998 DF, p-value: < 2.2e-16
Terdapat garis lurus warna merah yang menunjukkan garis fungsi regresi linier yang dibentuk dari fitted value sebagai nilai harapan/nilai duga/nilai prediksi dan data x. Berdasarkan plot di atas, penggunaan fungsi regresi linier biasa tidak tepat karena garis fungsi linier biasa tidak tepat dan menjauhi pola data observasi.
2.3 Pemodelan Polynomial Ordo 2
pol.mod <- lm( y~datapsd.x+I(datapsd.x^2)) # ordo 2
ix <- sort( datapsd.x,index.return=T)$ix
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70), xlab = "x", main = "Plot Data Bangkitan Fungsi Polinomial")
lines(datapsd.x[ix], pol.mod$fitted.values[ix],col="red", cex=4)summary(pol.mod)##
## Call:
## lm(formula = y ~ datapsd.x + I(datapsd.x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0319 -0.6942 0.0049 0.7116 3.2855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.951933 0.045676 108.41 <2e-16 ***
## datapsd.x 2.053661 0.029305 70.08 <2e-16 ***
## I(datapsd.x^2) 2.997702 0.005845 512.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991
## F-statistic: 5.462e+05 on 2 and 997 DF, p-value: < 2.2e-16
Terdapat garis warna merah yang menunjukkan garis fungsi polinomial yang dibentuk dari fitted value index i sebagai nilai harapan/nilai duga/nilai prediksi dan data x. Plot di atas merupakan fungsi polinomial berordo 2. Berdasarkan plot di atas, fungsi polinomial tepat digunakan untuk data bangkitan karena garis fungsi polinomial mendekati dan mengikuti pola data amatan sehingga kemungkinan error yang dihasilkan lebih kecil
2.4 Pemodelan Fungsi Tangga
range( datapsd.x)## [1] -3.619549 8.482080
c1 <- as.factor(ifelse(datapsd.x<=0,1,0))
c2 <- as.factor(ifelse(datapsd.x<=2 & datapsd.x>0,1,0))
c3 <- as.factor(ifelse(datapsd.x>2,1,0))
data.frame(y, c1, c2, c3)## y c1 c2 c3
## 1 8.080479 0 1 0
## 2 14.150855 0 1 0
## 3 93.780712 0 0 1
## 4 22.901717 0 0 1
## 5 22.271298 0 0 1
## 6 105.359768 0 0 1
## 7 36.704704 0 0 1
## 8 7.199052 1 0 0
## 9 8.114520 0 1 0
## 10 10.457881 0 1 0
## 11 76.052196 0 0 1
## 12 35.460605 0 0 1
## 13 32.930302 0 0 1
## 14 24.715156 0 0 1
## 15 8.932714 0 1 0
## 16 109.537322 0 0 1
## 17 38.141617 0 0 1
## 18 11.083816 1 0 0
## 19 46.826356 0 0 1
## 20 12.193469 0 1 0
## 21 4.619816 1 0 0
## 22 15.303932 0 1 0
## 23 6.302669 1 0 0
## 24 7.864831 0 1 0
## 25 6.538495 0 1 0
## 26 8.140356 1 0 0
## 27 54.534230 0 0 1
## 28 26.992003 0 0 1
## 29 5.096386 1 0 0
## 30 75.692661 0 0 1
## 31 33.926524 0 0 1
## 32 14.082936 0 1 0
## 33 54.724069 0 0 1
## 34 54.383141 0 0 1
## 35 53.039820 0 0 1
## 36 44.835739 0 0 1
## 37 40.458510 0 0 1
## 38 19.740801 0 1 0
## 39 13.611315 0 1 0
## 40 13.906099 0 1 0
## 41 6.317271 0 1 0
## 42 16.303202 0 1 0
## 43 4.694707 1 0 0
## 44 137.922375 0 0 1
## 45 72.797093 0 0 1
## 46 3.669034 1 0 0
## 47 10.353568 0 1 0
## 48 10.052376 0 1 0
## 49 51.890927 0 0 1
## 50 18.804836 0 1 0
## 51 29.194396 0 0 1
## 52 20.020623 0 1 0
## 53 20.292174 0 1 0
## 54 80.846051 0 0 1
## 55 16.447993 0 1 0
## 56 91.642079 0 0 1
## 57 5.612092 1 0 0
## 58 41.525017 0 0 1
## 59 25.368329 0 0 1
## 60 28.163664 0 0 1
## 61 34.841353 0 0 1
## 62 9.349902 0 1 0
## 63 14.118656 0 1 0
## 64 5.966387 1 0 0
## 65 4.612200 1 0 0
## 66 29.628431 0 0 1
## 67 34.871433 0 0 1
## 68 22.975619 0 0 1
## 69 56.959290 0 0 1
## 70 130.615640 0 0 1
## 71 10.679605 0 1 0
## 72 19.958457 1 0 0
## 73 60.273256 0 0 1
## 74 6.595566 0 1 0
## 75 7.758917 0 1 0
## 76 61.886623 0 0 1
## 77 14.513734 0 1 0
## 78 4.367387 1 0 0
## 79 26.365389 0 0 1
## 80 16.612022 0 1 0
## 81 23.066840 0 0 1
## 82 33.901765 0 0 1
## 83 12.500819 0 1 0
## 84 42.333335 0 0 1
## 85 16.069540 0 1 0
## 86 30.587223 0 0 1
## 87 65.256641 0 0 1
## 88 36.376018 0 0 1
## 89 12.695992 0 1 0
## 90 67.255348 0 0 1
## 91 62.432611 0 0 1
## 92 37.586579 0 0 1
## 93 28.941214 0 0 1
## 94 9.167072 0 1 0
## 95 80.683802 0 0 1
## 96 8.960757 0 1 0
## 97 140.097562 0 0 1
## 98 93.140465 0 0 1
## 99 15.551145 0 1 0
## 100 4.657809 1 0 0
## 101 8.080738 0 1 0
## 102 29.785237 0 0 1
## 103 13.886341 0 1 0
## 104 11.317450 0 1 0
## 105 5.381892 0 1 0
## 106 19.489591 0 1 0
## 107 5.430036 0 1 0
## 108 7.765920 1 0 0
## 109 10.768527 0 1 0
## 110 57.027789 0 0 1
## 111 8.237646 0 1 0
## 112 43.415613 0 0 1
## 113 9.534307 1 0 0
## 114 18.565331 0 1 0
## 115 39.838472 0 0 1
## 116 31.345764 0 0 1
## 117 24.022749 0 0 1
## 118 7.532636 0 1 0
## 119 7.447556 0 1 0
## 120 2.905013 1 0 0
## 121 23.816999 0 0 1
## 122 3.806365 0 1 0
## 123 11.547464 0 1 0
## 124 14.425714 0 1 0
## 125 112.901390 0 0 1
## 126 11.029911 0 1 0
## 127 28.205669 0 0 1
## 128 22.812089 0 0 1
## 129 5.469898 0 1 0
## 130 17.495967 0 1 0
## 131 86.978453 0 0 1
## 132 35.992221 0 0 1
## 133 22.643449 0 0 1
## 134 10.329761 0 1 0
## 135 13.075984 1 0 0
## 136 67.343114 0 0 1
## 137 4.935721 1 0 0
## 138 49.587848 0 0 1
## 139 119.770163 0 0 1
## 140 5.432029 1 0 0
## 141 46.201040 0 0 1
## 142 14.154402 0 1 0
## 143 6.708847 1 0 0
## 144 6.216827 1 0 0
## 145 7.226039 1 0 0
## 146 8.770279 0 1 0
## 147 4.864700 1 0 0
## 148 47.137501 0 0 1
## 149 132.179914 0 0 1
## 150 5.143562 1 0 0
## 151 50.446107 0 0 1
## 152 48.672445 0 0 1
## 153 32.217037 0 0 1
## 154 5.140440 1 0 0
## 155 19.226337 0 1 0
## 156 14.209849 0 1 0
## 157 40.235648 0 0 1
## 158 12.514537 0 1 0
## 159 58.623387 0 0 1
## 160 11.359572 0 1 0
## 161 64.284611 0 0 1
## 162 4.499192 1 0 0
## 163 4.705587 1 0 0
## 164 237.685975 0 0 1
## 165 10.762718 0 1 0
## 166 28.398960 0 0 1
## 167 44.035437 0 0 1
## 168 11.024306 0 1 0
## 169 37.389177 0 0 1
## 170 34.447027 0 0 1
## 171 15.911166 0 1 0
## 172 24.221003 0 0 1
## 173 19.102873 0 1 0
## 174 135.127124 0 0 1
## 175 6.737526 0 1 0
## 176 5.495106 1 0 0
## 177 21.499351 0 0 1
## 178 30.840143 0 0 1
## 179 33.730631 0 0 1
## 180 9.909335 0 1 0
## 181 4.919852 1 0 0
## 182 74.810505 0 0 1
## 183 12.633284 0 1 0
## 184 5.287064 0 1 0
## 185 15.661039 0 1 0
## 186 17.114103 0 1 0
## 187 66.038343 0 0 1
## 188 23.151773 0 0 1
## 189 50.376433 0 0 1
## 190 7.812410 0 1 0
## 191 27.236472 0 0 1
## 192 15.238550 0 1 0
## 193 25.949281 0 0 1
## 194 5.706528 0 1 0
## 195 4.052356 1 0 0
## 196 124.953768 0 0 1
## 197 41.497286 0 0 1
## 198 6.205379 1 0 0
## 199 7.563157 0 1 0
## 200 5.043826 1 0 0
## 201 141.203743 0 0 1
## 202 77.659187 0 0 1
## 203 15.271086 0 1 0
## 204 39.002221 0 0 1
## 205 12.088852 0 1 0
## 206 11.483482 0 1 0
## 207 5.393411 0 1 0
## 208 9.701548 0 1 0
## 209 99.441822 0 0 1
## 210 19.816594 0 1 0
## 211 24.711337 0 0 1
## 212 28.108654 0 0 1
## 213 74.005389 0 0 1
## 214 8.515644 0 1 0
## 215 4.894508 0 1 0
## 216 102.440827 0 0 1
## 217 8.808805 0 1 0
## 218 5.540119 0 1 0
## 219 3.562870 1 0 0
## 220 3.244804 1 0 0
## 221 9.301672 0 1 0
## 222 41.893552 0 0 1
## 223 64.692356 0 0 1
## 224 46.183080 0 0 1
## 225 12.013920 0 1 0
## 226 23.572624 0 0 1
## 227 6.125021 0 1 0
## 228 8.252003 0 1 0
## 229 55.559855 0 0 1
## 230 5.302899 1 0 0
## 231 120.773799 0 0 1
## 232 20.522611 0 1 0
## 233 27.395139 0 0 1
## 234 5.041405 0 1 0
## 235 8.672329 0 1 0
## 236 3.003482 1 0 0
## 237 15.969119 0 1 0
## 238 34.415835 0 0 1
## 239 32.024431 0 0 1
## 240 7.456066 0 1 0
## 241 5.655575 0 1 0
## 242 10.770987 0 1 0
## 243 91.172392 0 0 1
## 244 3.892869 1 0 0
## 245 15.718865 0 1 0
## 246 118.344675 0 0 1
## 247 18.478102 0 1 0
## 248 5.663236 1 0 0
## 249 9.094161 0 1 0
## 250 37.808019 0 0 1
## 251 13.227751 0 1 0
## 252 9.678828 0 1 0
## 253 13.223287 0 1 0
## 254 24.018266 0 0 1
## 255 97.712332 0 0 1
## 256 17.612042 0 1 0
## 257 64.174734 0 0 1
## 258 44.027270 0 0 1
## 259 17.853362 0 1 0
## 260 6.350212 1 0 0
## 261 10.408752 0 1 0
## 262 10.916599 0 1 0
## 263 22.084336 0 0 1
## 264 77.379218 0 0 1
## 265 148.378344 0 0 1
## 266 94.135373 0 0 1
## 267 17.910544 0 1 0
## 268 10.274902 1 0 0
## 269 11.920324 0 1 0
## 270 24.718965 0 0 1
## 271 54.111930 0 0 1
## 272 59.680388 0 0 1
## 273 46.194717 0 0 1
## 274 5.013922 1 0 0
## 275 53.343687 0 0 1
## 276 11.118754 0 1 0
## 277 26.310036 0 0 1
## 278 24.097352 0 0 1
## 279 35.079274 0 0 1
## 280 21.627829 0 0 1
## 281 7.192057 1 0 0
## 282 47.992666 0 0 1
## 283 33.528176 0 0 1
## 284 12.094868 0 1 0
## 285 23.110712 0 0 1
## 286 24.896191 0 0 1
## 287 27.509457 0 0 1
## 288 98.051337 0 0 1
## 289 13.450848 0 1 0
## 290 25.690413 0 0 1
## 291 70.749695 0 0 1
## 292 65.625879 0 0 1
## 293 68.768441 0 0 1
## 294 10.325712 0 1 0
## 295 125.271784 0 0 1
## 296 23.036545 0 0 1
## 297 115.418307 0 0 1
## 298 4.203404 1 0 0
## 299 21.541106 0 0 1
## 300 75.653493 0 0 1
## 301 6.362350 0 1 0
## 302 6.401591 0 1 0
## 303 4.143404 0 1 0
## 304 5.177391 1 0 0
## 305 11.477634 0 1 0
## 306 32.237520 0 0 1
## 307 12.066823 1 0 0
## 308 27.581916 0 0 1
## 309 73.035226 0 0 1
## 310 127.872182 0 0 1
## 311 79.092259 0 0 1
## 312 48.558665 0 0 1
## 313 9.147407 1 0 0
## 314 7.749867 0 1 0
## 315 12.151411 0 1 0
## 316 47.076742 0 0 1
## 317 17.496088 0 1 0
## 318 3.065230 1 0 0
## 319 103.477252 0 0 1
## 320 56.846922 0 0 1
## 321 27.740886 0 0 1
## 322 70.918425 0 0 1
## 323 6.924329 1 0 0
## 324 48.133541 0 0 1
## 325 9.847182 0 1 0
## 326 46.604777 0 0 1
## 327 20.565937 0 1 0
## 328 42.828767 0 0 1
## 329 79.446585 0 0 1
## 330 19.477639 0 0 1
## 331 61.180171 0 0 1
## 332 5.294762 1 0 0
## 333 6.752915 0 1 0
## 334 91.020154 0 0 1
## 335 33.163364 0 0 1
## 336 12.213516 1 0 0
## 337 5.971759 1 0 0
## 338 14.418395 0 1 0
## 339 54.028141 0 0 1
## 340 17.833268 0 1 0
## 341 42.933975 0 0 1
## 342 60.348412 0 0 1
## 343 100.716566 0 0 1
## 344 21.823025 0 0 1
## 345 18.057278 0 1 0
## 346 7.989751 1 0 0
## 347 22.733380 0 0 1
## 348 9.320294 0 1 0
## 349 4.249063 0 1 0
## 350 16.655063 0 1 0
## 351 61.633297 0 0 1
## 352 11.422039 1 0 0
## 353 10.436383 0 1 0
## 354 25.314207 0 0 1
## 355 6.467101 0 1 0
## 356 33.692750 0 0 1
## 357 34.547752 0 0 1
## 358 18.592257 0 1 0
## 359 24.154280 1 0 0
## 360 172.758081 0 0 1
## 361 17.657531 0 1 0
## 362 44.432053 0 0 1
## 363 30.705226 0 0 1
## 364 63.058395 0 0 1
## 365 50.749177 0 0 1
## 366 15.482837 0 1 0
## 367 34.610090 0 0 1
## 368 6.130223 0 1 0
## 369 54.269445 0 0 1
## 370 11.118717 0 1 0
## 371 158.267770 0 0 1
## 372 6.162795 1 0 0
## 373 11.887193 0 1 0
## 374 50.865440 0 0 1
## 375 37.467638 0 0 1
## 376 9.293351 0 1 0
## 377 3.750807 0 1 0
## 378 24.743928 0 0 1
## 379 19.419049 0 1 0
## 380 9.954072 1 0 0
## 381 22.428068 0 0 1
## 382 27.182584 0 0 1
## 383 26.701156 0 0 1
## 384 5.373500 1 0 0
## 385 37.691523 0 0 1
## 386 80.436770 0 0 1
## 387 36.123623 0 0 1
## 388 4.790721 1 0 0
## 389 11.804047 0 1 0
## 390 30.940868 0 0 1
## 391 7.407139 0 1 0
## 392 15.377512 1 0 0
## 393 56.199057 0 0 1
## 394 7.310097 0 1 0
## 395 9.675646 0 1 0
## 396 89.027418 0 0 1
## 397 6.967499 0 1 0
## 398 54.414074 0 0 1
## 399 4.940676 1 0 0
## 400 12.015107 0 1 0
## 401 17.919239 0 1 0
## 402 4.001411 1 0 0
## 403 8.776761 0 1 0
## 404 19.770757 0 1 0
## 405 45.405099 0 0 1
## 406 8.771289 1 0 0
## 407 13.253153 0 1 0
## 408 50.409861 0 0 1
## 409 7.695547 0 1 0
## 410 27.703437 0 0 1
## 411 37.754936 0 0 1
## 412 29.938797 0 0 1
## 413 43.242956 0 0 1
## 414 18.551031 0 1 0
## 415 11.777987 0 1 0
## 416 31.090472 1 0 0
## 417 19.009793 0 1 0
## 418 35.153352 0 0 1
## 419 40.356537 0 0 1
## 420 9.800445 0 1 0
## 421 108.675439 0 0 1
## 422 31.044728 0 0 1
## 423 25.020871 0 0 1
## 424 76.716174 0 0 1
## 425 6.483093 0 1 0
## 426 12.404609 0 1 0
## 427 157.098021 0 0 1
## 428 22.160946 0 0 1
## 429 98.662349 0 0 1
## 430 5.273800 1 0 0
## 431 16.885528 0 1 0
## 432 31.729708 0 0 1
## 433 30.959723 0 0 1
## 434 5.371403 1 0 0
## 435 18.848381 0 0 1
## 436 5.130622 1 0 0
## 437 44.882628 0 0 1
## 438 66.154351 0 0 1
## 439 17.653274 1 0 0
## 440 73.227740 0 0 1
## 441 4.726205 1 0 0
## 442 37.588373 0 0 1
## 443 44.067609 0 0 1
## 444 16.440659 0 1 0
## 445 8.272031 0 1 0
## 446 24.777442 0 0 1
## 447 33.856644 0 0 1
## 448 53.102580 0 0 1
## 449 14.630884 1 0 0
## 450 7.279484 1 0 0
## 451 87.454448 0 0 1
## 452 64.024466 0 0 1
## 453 36.311504 0 0 1
## 454 48.497141 0 0 1
## 455 56.274697 0 0 1
## 456 31.970381 1 0 0
## 457 67.749135 0 0 1
## 458 11.612255 0 1 0
## 459 28.858132 0 0 1
## 460 14.202146 0 1 0
## 461 53.670669 0 0 1
## 462 13.429528 0 1 0
## 463 39.244525 0 0 1
## 464 12.526688 0 1 0
## 465 5.155810 1 0 0
## 466 72.247961 0 0 1
## 467 48.255528 0 0 1
## 468 105.643670 0 0 1
## 469 23.046882 0 0 1
## 470 66.857542 0 0 1
## 471 122.848940 0 0 1
## 472 12.750161 0 1 0
## 473 3.992733 1 0 0
## 474 14.841015 0 1 0
## 475 13.758477 0 1 0
## 476 23.834595 0 0 1
## 477 105.230965 0 0 1
## 478 12.567790 0 1 0
## 479 31.262028 0 0 1
## 480 15.134063 0 1 0
## 481 22.898708 0 0 1
## 482 31.639455 0 0 1
## 483 79.801246 0 0 1
## 484 25.757278 0 0 1
## 485 46.285814 0 0 1
## 486 50.817245 0 0 1
## 487 56.068288 0 0 1
## 488 8.876866 0 1 0
## 489 100.528049 0 0 1
## 490 13.044182 0 1 0
## 491 18.940335 0 1 0
## 492 82.861365 0 0 1
## 493 76.543951 0 0 1
## 494 7.021331 1 0 0
## 495 4.607747 0 1 0
## 496 5.320810 1 0 0
## 497 27.381113 0 0 1
## 498 26.960372 0 0 1
## 499 33.875277 0 0 1
## 500 39.955825 0 0 1
## 501 7.673314 0 1 0
## 502 4.718425 0 1 0
## 503 61.499334 0 0 1
## 504 49.425902 0 0 1
## 505 7.194696 1 0 0
## 506 20.571721 0 1 0
## 507 5.912245 0 1 0
## 508 13.600309 1 0 0
## 509 26.498271 0 0 1
## 510 19.762125 0 1 0
## 511 18.149181 0 1 0
## 512 26.055079 0 0 1
## 513 55.815272 0 0 1
## 514 28.393119 0 0 1
## 515 8.004606 0 1 0
## 516 7.441023 0 1 0
## 517 17.242095 0 1 0
## 518 31.516117 0 0 1
## 519 4.742882 1 0 0
## 520 15.902313 0 1 0
## 521 59.422375 0 0 1
## 522 17.825447 0 1 0
## 523 6.706701 0 1 0
## 524 13.871347 0 1 0
## 525 67.111304 0 0 1
## 526 40.239306 0 0 1
## 527 75.511750 0 0 1
## 528 23.768921 0 0 1
## 529 34.307992 0 0 1
## 530 9.737219 0 1 0
## 531 44.110083 0 0 1
## 532 10.325159 0 1 0
## 533 5.426489 1 0 0
## 534 24.472833 0 0 1
## 535 121.334186 0 0 1
## 536 52.112225 0 0 1
## 537 69.737022 0 0 1
## 538 32.757051 0 0 1
## 539 8.621437 0 1 0
## 540 16.557350 0 1 0
## 541 15.356408 0 1 0
## 542 10.791721 0 1 0
## 543 46.590732 0 0 1
## 544 6.072637 1 0 0
## 545 54.429869 0 0 1
## 546 55.735236 0 0 1
## 547 4.616998 1 0 0
## 548 44.427099 0 0 1
## 549 158.372277 0 0 1
## 550 9.011444 0 1 0
## 551 53.201882 0 0 1
## 552 5.682080 0 1 0
## 553 65.868185 0 0 1
## 554 29.692200 0 0 1
## 555 100.913683 0 0 1
## 556 4.393235 1 0 0
## 557 20.019616 0 1 0
## 558 8.465351 0 1 0
## 559 14.892472 0 1 0
## 560 8.653640 0 1 0
## 561 5.309897 0 1 0
## 562 42.216742 0 0 1
## 563 4.780245 1 0 0
## 564 89.517189 0 0 1
## 565 4.147622 1 0 0
## 566 22.801599 0 0 1
## 567 40.247383 0 0 1
## 568 41.797761 0 0 1
## 569 13.404638 0 1 0
## 570 23.371138 0 0 1
## 571 60.329650 0 0 1
## 572 4.976303 1 0 0
## 573 5.695283 0 1 0
## 574 30.583541 0 0 1
## 575 9.335085 0 1 0
## 576 82.702671 0 0 1
## 577 45.482727 0 0 1
## 578 23.290338 0 0 1
## 579 6.019433 1 0 0
## 580 21.228821 0 0 1
## 581 12.518365 0 1 0
## 582 19.111852 0 1 0
## 583 3.243143 1 0 0
## 584 38.386642 0 0 1
## 585 4.069774 1 0 0
## 586 15.562211 0 1 0
## 587 32.685061 0 0 1
## 588 5.904980 0 1 0
## 589 43.171673 0 0 1
## 590 5.604387 1 0 0
## 591 34.919417 1 0 0
## 592 36.739859 0 0 1
## 593 52.561389 0 0 1
## 594 12.840193 0 1 0
## 595 38.375040 0 0 1
## 596 4.798019 1 0 0
## 597 17.305334 0 1 0
## 598 13.844112 1 0 0
## 599 68.566634 0 0 1
## 600 115.426944 0 0 1
## 601 64.625348 0 0 1
## 602 20.899773 0 1 0
## 603 19.626084 0 1 0
## 604 5.537774 1 0 0
## 605 48.916918 0 0 1
## 606 15.422901 0 1 0
## 607 10.265679 0 1 0
## 608 6.378782 1 0 0
## 609 15.360513 0 1 0
## 610 6.792111 0 1 0
## 611 12.919007 0 1 0
## 612 5.600674 1 0 0
## 613 102.903397 0 0 1
## 614 19.972473 0 1 0
## 615 64.431610 0 0 1
## 616 27.777663 1 0 0
## 617 10.769302 0 1 0
## 618 8.461167 0 1 0
## 619 5.736571 1 0 0
## 620 93.104188 0 0 1
## 621 4.860754 1 0 0
## 622 34.421913 0 0 1
## 623 54.034354 0 0 1
## 624 27.790921 0 0 1
## 625 5.348032 0 1 0
## 626 57.944202 0 0 1
## 627 26.574277 0 0 1
## 628 6.470914 1 0 0
## 629 4.943314 1 0 0
## 630 132.534380 0 0 1
## 631 5.431496 0 1 0
## 632 9.854493 1 0 0
## 633 38.051640 0 0 1
## 634 31.954724 0 0 1
## 635 4.922594 1 0 0
## 636 11.507802 1 0 0
## 637 19.274945 0 1 0
## 638 69.598056 0 0 1
## 639 42.769297 0 0 1
## 640 10.541466 0 1 0
## 641 4.764725 0 1 0
## 642 28.995980 0 0 1
## 643 27.319592 0 0 1
## 644 44.840499 0 0 1
## 645 12.906548 0 1 0
## 646 55.317062 0 0 1
## 647 4.138819 0 1 0
## 648 11.880168 0 1 0
## 649 49.281037 0 0 1
## 650 60.189163 0 0 1
## 651 9.652866 1 0 0
## 652 23.959021 0 0 1
## 653 43.903394 0 0 1
## 654 6.279082 1 0 0
## 655 36.408270 0 0 1
## 656 4.495675 0 1 0
## 657 61.231532 0 0 1
## 658 39.591579 0 0 1
## 659 50.518835 0 0 1
## 660 22.615079 0 0 1
## 661 55.156021 0 0 1
## 662 82.529709 0 0 1
## 663 123.018299 0 0 1
## 664 18.561358 0 1 0
## 665 18.058685 1 0 0
## 666 21.446421 0 0 1
## 667 28.306670 0 0 1
## 668 17.968090 0 1 0
## 669 41.222595 0 0 1
## 670 63.160835 0 0 1
## 671 9.200750 0 1 0
## 672 14.815773 0 1 0
## 673 34.002983 0 0 1
## 674 8.559347 0 1 0
## 675 24.578824 0 0 1
## 676 13.632848 1 0 0
## 677 5.571163 1 0 0
## 678 4.468658 1 0 0
## 679 5.372830 0 1 0
## 680 8.733203 0 1 0
## 681 33.811257 0 0 1
## 682 60.903289 0 0 1
## 683 6.643601 0 1 0
## 684 6.273799 0 1 0
## 685 4.045345 1 0 0
## 686 12.079473 0 1 0
## 687 14.175630 0 1 0
## 688 34.885347 0 0 1
## 689 11.898943 0 1 0
## 690 14.780007 1 0 0
## 691 19.423086 0 1 0
## 692 69.402879 0 0 1
## 693 70.152374 0 0 1
## 694 8.143131 0 1 0
## 695 5.975704 1 0 0
## 696 162.836130 0 0 1
## 697 16.241821 0 1 0
## 698 18.086743 0 1 0
## 699 34.985913 0 0 1
## 700 7.079337 1 0 0
## 701 6.780942 0 1 0
## 702 5.873370 1 0 0
## 703 4.310052 0 1 0
## 704 26.365942 0 0 1
## 705 6.927742 1 0 0
## 706 40.595388 0 0 1
## 707 29.938165 0 0 1
## 708 6.868512 0 1 0
## 709 28.950064 0 0 1
## 710 49.938379 0 0 1
## 711 63.149998 0 0 1
## 712 18.161169 0 1 0
## 713 17.349924 0 1 0
## 714 19.416153 0 1 0
## 715 86.435949 0 0 1
## 716 67.287744 0 0 1
## 717 52.338540 0 0 1
## 718 12.953838 0 1 0
## 719 34.745130 0 0 1
## 720 34.333360 0 0 1
## 721 5.113365 1 0 0
## 722 8.661544 1 0 0
## 723 43.539419 0 0 1
## 724 5.110772 1 0 0
## 725 21.300904 0 0 1
## 726 26.664787 0 0 1
## 727 39.185077 0 0 1
## 728 25.658126 0 0 1
## 729 5.846311 0 1 0
## 730 60.400444 0 0 1
## 731 107.554402 0 0 1
## 732 53.475396 0 0 1
## 733 12.344797 1 0 0
## 734 84.593440 0 0 1
## 735 18.478609 0 1 0
## 736 38.951220 0 0 1
## 737 43.883646 0 0 1
## 738 18.149459 0 1 0
## 739 18.739921 0 1 0
## 740 62.308298 0 0 1
## 741 46.941368 0 0 1
## 742 59.939122 0 0 1
## 743 155.723408 0 0 1
## 744 46.783865 0 0 1
## 745 28.196424 0 0 1
## 746 16.830461 1 0 0
## 747 182.066966 0 0 1
## 748 9.920790 0 1 0
## 749 156.493726 0 0 1
## 750 33.466993 0 0 1
## 751 91.775750 0 0 1
## 752 18.954782 0 1 0
## 753 38.327039 0 0 1
## 754 26.419481 0 0 1
## 755 16.665504 0 1 0
## 756 19.327052 0 1 0
## 757 62.103830 0 0 1
## 758 16.038197 0 1 0
## 759 13.114446 1 0 0
## 760 16.543971 0 1 0
## 761 38.540053 0 0 1
## 762 41.167823 0 0 1
## 763 42.782417 0 0 1
## 764 7.945055 1 0 0
## 765 35.321482 0 0 1
## 766 58.124728 0 0 1
## 767 76.470947 0 0 1
## 768 16.113428 0 1 0
## 769 14.280978 0 1 0
## 770 89.212672 0 0 1
## 771 6.938566 1 0 0
## 772 10.992822 0 1 0
## 773 38.259953 0 0 1
## 774 7.591377 1 0 0
## 775 27.723404 0 0 1
## 776 117.320131 0 0 1
## 777 21.427002 0 1 0
## 778 14.100155 0 1 0
## 779 36.986158 0 0 1
## 780 13.970765 0 1 0
## 781 51.971212 0 0 1
## 782 55.636368 0 0 1
## 783 21.352465 0 0 1
## 784 3.736517 1 0 0
## 785 5.142437 1 0 0
## 786 9.722662 0 1 0
## 787 51.465648 0 0 1
## 788 13.218815 1 0 0
## 789 9.790698 1 0 0
## 790 7.878094 0 1 0
## 791 34.526874 0 0 1
## 792 4.367381 0 1 0
## 793 56.830674 0 0 1
## 794 30.633178 0 0 1
## 795 16.736024 0 1 0
## 796 52.124774 0 0 1
## 797 33.805825 0 0 1
## 798 10.689691 0 1 0
## 799 28.562124 0 0 1
## 800 43.130492 0 0 1
## 801 31.209823 0 0 1
## 802 7.116875 0 1 0
## 803 53.664787 0 0 1
## 804 70.490105 0 0 1
## 805 31.239073 0 0 1
## 806 25.603606 0 0 1
## 807 19.332719 0 1 0
## 808 137.336620 0 0 1
## 809 29.701091 0 0 1
## 810 15.463905 0 1 0
## 811 26.544056 1 0 0
## 812 6.037841 1 0 0
## 813 19.232227 0 1 0
## 814 26.940939 0 0 1
## 815 29.010713 0 0 1
## 816 51.881559 0 0 1
## 817 15.649170 0 1 0
## 818 73.805776 0 0 1
## 819 5.597061 0 1 0
## 820 5.120101 1 0 0
## 821 3.472418 1 0 0
## 822 49.893197 0 0 1
## 823 18.920121 0 1 0
## 824 49.744369 0 0 1
## 825 14.673268 0 1 0
## 826 9.501490 0 1 0
## 827 10.930654 0 1 0
## 828 12.003105 1 0 0
## 829 4.508740 1 0 0
## 830 6.697768 1 0 0
## 831 63.932430 0 0 1
## 832 8.223973 0 1 0
## 833 48.748918 0 0 1
## 834 89.958490 0 0 1
## 835 16.030296 0 1 0
## 836 30.743509 0 0 1
## 837 6.883280 1 0 0
## 838 7.495330 0 1 0
## 839 24.010928 0 0 1
## 840 29.304553 0 0 1
## 841 7.129379 0 1 0
## 842 181.644027 0 0 1
## 843 10.561334 0 1 0
## 844 23.777009 0 0 1
## 845 45.129266 0 0 1
## 846 19.430427 0 1 0
## 847 7.563624 0 1 0
## 848 62.404362 0 0 1
## 849 97.894653 0 0 1
## 850 18.620074 0 1 0
## 851 37.907895 0 0 1
## 852 19.856133 0 1 0
## 853 64.132394 0 0 1
## 854 4.212956 1 0 0
## 855 151.381895 0 0 1
## 856 8.464622 0 1 0
## 857 7.165430 1 0 0
## 858 13.206165 0 1 0
## 859 25.569782 0 0 1
## 860 96.296007 0 0 1
## 861 57.560464 0 0 1
## 862 24.881515 0 0 1
## 863 3.858945 1 0 0
## 864 5.549403 0 1 0
## 865 16.621187 0 1 0
## 866 171.759246 0 0 1
## 867 9.271591 1 0 0
## 868 67.463302 0 0 1
## 869 11.061272 0 1 0
## 870 102.312719 0 0 1
## 871 5.838934 1 0 0
## 872 26.930506 0 0 1
## 873 4.924752 1 0 0
## 874 55.512679 0 0 1
## 875 89.520249 0 0 1
## 876 122.342661 0 0 1
## 877 53.380453 0 0 1
## 878 111.418039 0 0 1
## 879 74.815976 0 0 1
## 880 18.170989 0 1 0
## 881 37.033218 0 0 1
## 882 3.563613 0 1 0
## 883 16.967438 0 1 0
## 884 72.296927 0 0 1
## 885 40.177680 0 0 1
## 886 37.223666 0 0 1
## 887 6.148881 1 0 0
## 888 6.901001 0 1 0
## 889 7.688242 0 1 0
## 890 33.186409 0 0 1
## 891 3.750031 0 1 0
## 892 41.842166 0 0 1
## 893 5.843228 1 0 0
## 894 112.078553 0 0 1
## 895 37.201831 0 0 1
## 896 6.408446 0 1 0
## 897 28.295492 0 0 1
## 898 12.528266 0 1 0
## 899 32.764413 0 0 1
## 900 29.264713 0 0 1
## 901 6.483261 1 0 0
## 902 5.901572 0 1 0
## 903 30.433041 0 0 1
## 904 100.768105 0 0 1
## 905 65.156963 0 0 1
## 906 5.589087 0 1 0
## 907 52.825609 0 0 1
## 908 18.781925 0 1 0
## 909 29.527723 0 0 1
## 910 30.524396 0 0 1
## 911 175.155889 0 0 1
## 912 3.631046 1 0 0
## 913 24.978999 0 0 1
## 914 9.086318 1 0 0
## 915 41.832090 0 0 1
## 916 67.357876 0 0 1
## 917 86.771489 0 0 1
## 918 10.718778 1 0 0
## 919 34.131681 0 0 1
## 920 96.766428 0 0 1
## 921 9.128518 0 1 0
## 922 16.695124 0 1 0
## 923 18.425827 0 1 0
## 924 32.841320 0 0 1
## 925 44.743369 0 0 1
## 926 80.031877 0 0 1
## 927 18.624611 0 1 0
## 928 27.060025 0 0 1
## 929 165.643678 0 0 1
## 930 34.473911 0 0 1
## 931 26.516721 0 0 1
## 932 6.214176 1 0 0
## 933 21.763541 0 0 1
## 934 14.115194 0 1 0
## 935 21.330137 0 1 0
## 936 7.483982 0 1 0
## 937 6.831440 0 1 0
## 938 8.263434 0 1 0
## 939 13.294975 0 1 0
## 940 79.990143 0 0 1
## 941 64.882791 0 0 1
## 942 11.660889 0 1 0
## 943 9.145818 1 0 0
## 944 5.743268 0 1 0
## 945 11.125017 0 1 0
## 946 24.844033 0 0 1
## 947 7.064365 0 1 0
## 948 125.203677 0 0 1
## 949 13.850967 0 1 0
## 950 7.509573 1 0 0
## 951 17.410276 0 1 0
## 952 5.451637 1 0 0
## 953 25.324222 0 0 1
## 954 27.954967 0 0 1
## 955 33.852518 0 0 1
## 956 6.814176 1 0 0
## 957 27.784614 0 0 1
## 958 31.226115 0 0 1
## 959 5.011830 1 0 0
## 960 60.661984 0 0 1
## 961 48.907413 0 0 1
## 962 150.859626 0 0 1
## 963 25.330762 0 0 1
## 964 23.691867 0 0 1
## 965 21.817172 0 0 1
## 966 23.967943 0 0 1
## 967 11.121413 1 0 0
## 968 6.449213 1 0 0
## 969 18.132733 0 1 0
## 970 8.975310 0 1 0
## 971 23.712411 0 0 1
## 972 14.606576 1 0 0
## 973 5.765480 1 0 0
## 974 6.289000 1 0 0
## 975 59.150013 0 0 1
## 976 5.670237 1 0 0
## 977 6.733048 0 1 0
## 978 22.207210 0 0 1
## 979 12.545217 0 1 0
## 980 26.140214 0 0 1
## 981 149.550424 0 0 1
## 982 6.112996 1 0 0
## 983 12.964807 0 1 0
## 984 8.020732 0 1 0
## 985 91.842978 0 0 1
## 986 5.951085 0 1 0
## 987 18.902766 0 1 0
## 988 52.193735 0 0 1
## 989 4.556418 1 0 0
## 990 101.769762 0 0 1
## 991 44.304066 0 0 1
## 992 5.694764 1 0 0
## 993 36.454670 0 0 1
## 994 14.836628 0 1 0
## 995 29.667744 0 0 1
## 996 18.654485 0 1 0
## 997 64.981669 0 0 1
## 998 5.352303 1 0 0
## 999 10.181127 0 1 0
## 1000 14.307327 0 1 0
step.mod <- lm(y~c1+c2+c3)
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70))
lines(datapsd.x,lin.mod$fitted.values,col="blue", lwd =1)
lines(datapsd.x[ix], step.mod$fitted.values[ix],col="red", lwd =2)summary(step.mod)##
## Call:
## lm(formula = y ~ c1 + c2 + c3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.639 -11.091 -2.245 4.847 182.198
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.488 1.095 50.68 <2e-16 ***
## c11 -47.735 2.206 -21.64 <2e-16 ***
## c21 -43.407 1.741 -24.93 <2e-16 ***
## c31 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.6 on 997 degrees of freedom
## Multiple R-squared: 0.4555, Adjusted R-squared: 0.4544
## F-statistic: 417.1 on 2 and 997 DF, p-value: < 2.2e-16
Terdapat garis warna merah yang menunjukkan garis fungsi tangga dimana terdapat 3 break. Plot di atas merupakan fungsi tangga dengan break 3. Selang 1 adalah data x dengan rentang >-2.5 (data minimum) sampai 0. Selang 2 dalam rentang 0 sampai 2 dan selang 3 dalam rentang 2 sampai >5 (data maksimum). Secara eksploratif, garis tangga lebih mengikuti pola data dibandingkan dengan garis fungsi regresi linier.
2.5 Perbandingan Hasil Model
Perbandingan model dapat dilihat secara eksploratif dengan plot-plot di atas. Untuk lebih akurat, komparasi model dapat dilakukan dengan melihat nilai AIC (Akaike’s Information Criterion) dan MSE (Mean Squared Error). Semakin baik model maka semakin kecil nilai AIC atau MSE setiap modelnya.
MSE = function(pred,actual){
mean((pred-actual)^2)
}AIC <- rbind(AIC(lin.mod),
AIC(pol.mod),
AIC(step.mod))
MSE <- rbind(MSE(predict(lin.mod),y),
MSE(predict(pol.mod),y),
MSE(predict(step.mod),y))
Model <- c("Linear","Polynomial (ordo=2)","Tangga (breaks=3)")
model.bangkitan1 <- data.frame(Model,AIC,MSE)Berdasarkan ketiga hasil pemodelan, didapatkan bahwa model tersebut tidak linier. Secara eksploratif, fungsi polinomial ordo dua paling tepat dan baik. Berdasarkan komparasi model dengan nilai AIC dan MSE, fungsi model polinomial ordo dua juga menjadi model terbaik dikarenakan memiliki nilai AIC dan MSE paling kecil di antara ketiga model lainnya yaitu masing-masing sebesar 2856.470 dan 1.010649.
2.6 Piecewise Cubic Polynomial
# Membagi data menjadi 2 bagian berdasarkan mean (mean = 2)
dt.all <- cbind(y,datapsd.x)
## knots = 1
dt1 <- dt.all[datapsd.x <=2,]
dim(dt1)## [1] 495 2
## knots = 2
dt2 <- dt.all[datapsd.x >2,]
dim(dt2)## [1] 505 2
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="orange")
cub.mod1 <- lm(dt1[,1]~dt1[,2]+I(dt1[,2]^2)+I(dt1[,2]^3))
ix <- sort(dt1[,2], index.return=T)$ix
lines(dt1[ix,2],cub.mod1$fitted.values[ix],col="red", lwd=2)
cub.mod2 <- lm(dt2[,1]~dt2[,2]+I(dt2[,2]^2)+I(dt2[,2]^3))
ix <- sort(dt2[,2], index.return=T)$ix
lines(dt2[ix,2],cub.mod2$fitted.values[ix],col="blue", lwd=2)2.7 Truncated Power Basis
plot(datapsd.x,y,xlim=c( -2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=1,col="red", lty=2)
hx <- ifelse( datapsd.x>1,(datapsd.x-1)^3,0)
cubspline.mod <- lm( y~datapsd.x+I(datapsd.x^2)+I(datapsd.x^3)+hx)
ix <- sort( datapsd.x,index.return=T)$ix
lines( datapsd.x[ix], cubspline.mod$fitted.values[ix],col="blue", lwd=2)2.8 Perbandingan dengan K-Fold CV
plot( datapsd.x,y,xlim=c(-2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=0,col="red", lty=2)
abline(v=2,col="red", lty=2)
hx1 <- ifelse( datapsd.x>0,(datapsd.x-0)^3,0)
hx2 <- ifelse( datapsd.x>2,(datapsd.x-2)^3,0)
cubspline.mod2 <- lm( y~datapsd.x+I(datapsd.x^2)+I(datapsd.x^3)+hx1+hx2)
ix <- sort( datapsd.x,index.return=T)$ix
lines(datapsd.x[ix],cubspline.mod2$fitted.values[ix],col="blue", lwd=2)2.9 Perbandingan 1 Knot dan 2 Knot dengan 5-Fold CV
##1 knot
data.all <- cbind( y,datapsd.x,hx)
set.seed(456)
ind <- sample(1:5,length( datapsd.x),replace=T)
res <- c()
for( i in 1:5){
dt.train <- data.all[ ind!=i,]
x.test <- data.all[ ind==i, -1]
y.test <- data.all[ ind==i,1]
mod1 <- lm( dt.train[,1]~dt.train[,2]+I( dt.train[,2]^2)+I( dt.train[,2]^3)+dt.train[,3])
x.test.olah <- cbind(1,x.test[,1], x.test[,1]^2,x.test[,1]^3,x.test[,2])
beta <- coefficients(mod1)
prediksi <- x.test.olah%*%beta
res <- c( res,mean(( y.test-prediksi)^2))
}
res## [1] 0.8883211 0.8545532 1.1762354 1.1853072 1.0085237
mean(res)## [1] 1.022588
##2 knot (knot = 0, 2)
data.all2 <- cbind(y,datapsd.x,hx1,hx2)
set.seed(456)
ind2 <- sample(1:5,length( datapsd.x),replace=T)
res2 <- c()
for( i in 1:5){
dt.train2 <- data.all2[ind2!=i,]
x.test2 <- data.all2[ind2==i,-1]
y.test2 <- data.all2[ind2==i,1]
mod2 <- lm(dt.train2[,1]~dt.train2[,2]+I(dt.train2[,2]^2)+I(dt.train2[,2]^3)+dt.train2[,3]+dt.train2[,4])
x.test.olah2 <- cbind(1,x.test2[,1],x.test2[,1]^2,x.test2[,1]^3,x.test2[,2],x.test2[,3])
beta2 <- coefficients(mod2)
prediksi2 <- x.test.olah2%*%beta2
res2 <- c(res2,mean((y.test2-prediksi2)^2))
}
res2## [1] 0.8926291 0.8563630 1.1799248 1.1941252 1.0096880
mean(res2)## [1] 1.026546
2.10 Spline Regression
- Menentukan banyaknya fungsi basis dan knots (Penentuan via komputerisasi)
dt.all <- as.data.frame(dt.all)
knots.val <- attr(bs(dt.all$datapsd.x, df=6), "knots")
knots.val## 25% 50% 75%
## 0.7433515 2.0184193 3.3292037
- Pemodelan Regresi Spline (df = banyak bagian yang terbagi + 2)
spline.mod.tr <- lm(y ~ bs(datapsd.x, knots=knots.val), data=dt.all)
summary(spline.mod.tr)##
## Call:
## lm(formula = y ~ bs(datapsd.x, knots = knots.val), data = dt.all)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0346 -0.6809 -0.0013 0.7086 3.3460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.7407 0.5373 66.52 <2e-16 ***
## bs(datapsd.x, knots = knots.val)1 -26.9856 0.8589 -31.42 <2e-16 ***
## bs(datapsd.x, knots = knots.val)2 -39.9744 0.4999 -79.96 <2e-16 ***
## bs(datapsd.x, knots = knots.val)3 -15.8889 0.5700 -27.87 <2e-16 ***
## bs(datapsd.x, knots = knots.val)4 30.4698 0.5613 54.29 <2e-16 ***
## bs(datapsd.x, knots = knots.val)5 111.9672 0.7656 146.26 <2e-16 ***
## bs(datapsd.x, knots = knots.val)6 201.8858 0.9206 219.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.006 on 993 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991
## F-statistic: 1.823e+05 on 6 and 993 DF, p-value: < 2.2e-16
2.11 Smoothing Spline
ss1 <- smooth.spline(datapsd.x,y,all.knots = T)
ss.mod.tr <- with(data=dt.all, smooth.spline(datapsd.x, y))
ss.mod.tr## Call:
## smooth.spline(x = datapsd.x, y = y)
##
## Smoothing Parameter spar= 0.8275227 lambda= 0.0001455632 (14 iterations)
## Equivalent Degrees of Freedom (Df): 17.06386
## Penalized Criterion (RSS): 993.8526
## GCV: 2278.857
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16,col="orange")
lines(ss1,col="blue", lwd=2)2.12 Loess Function
loess.tr <- loess(formula=y~datapsd.x, data=dt.all)
summary(loess.tr)## Call:
## loess(formula = y ~ datapsd.x, data = dt.all)
##
## Number of Observations: 1000
## Equivalent Number of Parameters: 5.25
## Residual Standard Error: 1.006
## Trace of smoother matrix: 5.74 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
loess.span.tr <- data.frame(span=seq(0.1,2,by=0.1)) %>%
group_by(span) %>%
do(broom::augment(loess(y~datapsd.x, data=dt.all,span=.$span)))
ggplot(loess.span.tr, aes(x=datapsd.x, y=y)) + geom_line(aes(y=.fitted), col="blue", lty=1) + facet_wrap(~span)Contoh Data TRICEPS
Penjelasan data
Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:
- age : umur responden
- Intriceps : logaritma dari ketebalan lipatan kulit triceps
- triceps: ketebalan lipatan kulit triceps
Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas. Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein
data("triceps", package="MultiKink")Jika digambarkan dalam bentuk scatterplot
library(corrplot)## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
theme_bw()cor(triceps)## age lntriceps triceps
## age 1.0000000 0.5776443 0.5806972
## lntriceps 0.5776443 1.0000000 0.9612043
## triceps 0.5806972 0.9612043 1.0000000
Scatterplot diatas terlihat memiliki hubungan positif yang lemah dengan nilai korelasi hanya sebesar 0.5806972. Berdasarkan pola hubungan yang terlihat pada scatterplot, akan dicoba untuk mencari model yang bisa merepresentasikan pola hubungan tersebut dengan baik.
Regresi Linear
mod_linear = lm(triceps~age,data=triceps)
summary(mod_linear)##
## Call:
## lm(formula = triceps ~ age, data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9512 -2.3965 -0.5154 1.5822 25.1233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.19717 0.21244 29.17 <2e-16 ***
## age 0.21584 0.01014 21.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.007 on 890 degrees of freedom
## Multiple R-squared: 0.3372, Adjusted R-squared: 0.3365
## F-statistic: 452.8 on 1 and 890 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~x,lty = 1,
col = "blue",se = F)+
theme_bw() Berdasarkan scatterplot diatas, fungsi regresi linier belum mengikuti pola hubungan pada scatterplot.
Regresi Polinomial Derajat 2 (Ordo 2)
mod_polinomial2 = lm(triceps ~ poly(age,2,raw = T), # raw = T untuk matriks ortogonal atau tidak
data=triceps)
summary(mod_polinomial2)##
## Call:
## lm(formula = triceps ~ poly(age, 2, raw = T), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5677 -2.4401 -0.4587 1.6368 24.9961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0229191 0.3063806 19.658 < 2e-16 ***
## poly(age, 2, raw = T)1 0.2434733 0.0364403 6.681 4.17e-11 ***
## poly(age, 2, raw = T)2 -0.0006257 0.0007926 -0.789 0.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.008 on 889 degrees of freedom
## Multiple R-squared: 0.3377, Adjusted R-squared: 0.3362
## F-statistic: 226.6 on 2 and 889 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "blue",se = F)+
theme_bw()ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "blue",se = T)+
theme_bw() Fungsi polinomial Ordo 2 juga belum mengikuti pola hubungan pada scatterplot.
Regresi Polinomial Derajat 3 (ordo 3)
mod_polinomial3 = lm(triceps ~ poly(age,3,raw = T),data=triceps)
summary(mod_polinomial3)##
## Call:
## lm(formula = triceps ~ poly(age, 3, raw = T), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5832 -1.9284 -0.5415 1.3283 24.4440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.004e+00 3.831e-01 20.893 < 2e-16 ***
## poly(age, 3, raw = T)1 -3.157e-01 7.721e-02 -4.089 4.73e-05 ***
## poly(age, 3, raw = T)2 3.101e-02 3.964e-03 7.824 1.45e-14 ***
## poly(age, 3, raw = T)3 -4.566e-04 5.612e-05 -8.135 1.38e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.868 on 888 degrees of freedom
## Multiple R-squared: 0.3836, Adjusted R-squared: 0.3815
## F-statistic: 184.2 on 3 and 888 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T),
lty = 1, col = "blue",se = T)+
theme_bw() Berdasarkan scatterplot diatas, dapat dilihat bahwa regresi polynomial ordo 3 lebih mengikuti pola hubungan jika dibandingkan dengan regresi linier dan polynomial orodo 2.
Regresi Fungsi Tangga (5)
mod_tangga5 = lm(triceps ~ cut(age,5),data=triceps)
summary(mod_tangga5)##
## Call:
## lm(formula = triceps ~ cut(age, 5), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5474 -2.0318 -0.4465 1.3682 23.3759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2318 0.1994 36.260 < 2e-16 ***
## cut(age, 5)(10.6,20.9] 1.6294 0.3244 5.023 6.16e-07 ***
## cut(age, 5)(20.9,31.2] 5.9923 0.4222 14.192 < 2e-16 ***
## cut(age, 5)(31.2,41.5] 7.5156 0.4506 16.678 < 2e-16 ***
## cut(age, 5)(41.5,51.8] 7.4561 0.5543 13.452 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.939 on 887 degrees of freedom
## Multiple R-squared: 0.3617, Adjusted R-squared: 0.3588
## F-statistic: 125.7 on 4 and 887 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
lty = 1, col = "blue",se = F)+
theme_bw() Scatter plot diatas menggambarkan fungsi tangga dimana data age dibagi menjadi 5 selang yang sama panjang. Dapat dilihat bahwa fungsi tangga cenderung mengikuti pola hubungan pada data.
Regresi Fungsi Tangga (7)
mod_tangga7 = lm(triceps ~ cut(age,7),data=triceps)
summary(mod_tangga7)##
## Call:
## lm(formula = triceps ~ cut(age, 7), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8063 -1.7592 -0.4366 1.2894 23.1461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5592 0.2219 34.060 < 2e-16 ***
## cut(age, 7)(7.62,15] -0.6486 0.3326 -1.950 0.0515 .
## cut(age, 7)(15,22.3] 3.4534 0.4239 8.146 1.27e-15 ***
## cut(age, 7)(22.3,29.7] 5.8947 0.4604 12.804 < 2e-16 ***
## cut(age, 7)(29.7,37] 7.8471 0.5249 14.949 < 2e-16 ***
## cut(age, 7)(37,44.4] 6.9191 0.5391 12.835 < 2e-16 ***
## cut(age, 7)(44.4,51.8] 6.3013 0.6560 9.606 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.805 on 885 degrees of freedom
## Multiple R-squared: 0.4055, Adjusted R-squared: 0.4014
## F-statistic: 100.6 on 6 and 885 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
lty = 1, col = "blue",se = F)+
theme_bw() Scatter plot di atas menggambarkan fungsi tangga dengan 7 selang. Garis pada fungsi tangga dengan selang 7 lebih mengikuti pola data dibandingkan dengan selang 5.
Perbandingan Model
MSE = function(pred,actual){
mean((pred-actual)^2)
}nilai_AIC_triceps <- rbind(AIC(mod_linear),
AIC(mod_polinomial2),
AIC(mod_polinomial3),
AIC(mod_tangga5),
AIC(mod_tangga7))
nilai_MSE_triceps <- rbind(MSE(predict(mod_linear),triceps$triceps),
MSE(predict(mod_polinomial2),triceps$triceps),
MSE(predict(mod_polinomial3),triceps$triceps),
MSE(predict(mod_tangga5),triceps$triceps),
MSE(predict(mod_tangga7),triceps$triceps))
nama_model_triceps <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)",
"Tangga (breaks=5)", "Tangga (breaks=7)")
data.frame(nama_model_triceps,nilai_AIC_triceps, nilai_MSE_triceps)## nama_model_triceps nilai_AIC_triceps nilai_MSE_triceps
## 1 Linear 5011.515 16.01758
## 2 Poly (ordo=2) 5012.890 16.00636
## 3 Poly (ordo=3) 4950.774 14.89621
## 4 Tangga (breaks=5) 4983.948 15.42602
## 5 Tangga (breaks=7) 4924.556 14.36779
Berdasarkan perbandingan model diatas dilihat dari nilai MSE dan BIC, model fungsi tangga dengan selang 7 memiliki nilai MSE paling kecil yaitu sebesar 14.36779 dan nilai AIC sebesar 4924.556.
Evaluasi Model Menggunakan Cross Validation
Cross validation atau dapat disebut estimasi rotasi adalah sebuah teknik validasi model untuk menilai bagaimana hasil statistik analisis akan menggeneralisasi kumpulan data independen. Teknik ini utamanya digunakan untuk melakukan prediksi model dan memperkirakan seberapa akurat sebuah model prediktif ketika dijalankan dalam praktiknya.
Regresi Linear
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_linear <- map_dfr(cross_val$splits,
function(x){
mod <- lm(triceps ~ age,data=triceps[x$in_id,])
pred <- predict(mod,newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_linear## # A tibble: 10 x 2
## RMSE MAE
## <dbl> <dbl>
## 1 3.65 2.82
## 2 4.62 3.22
## 3 4.38 3.00
## 4 3.85 2.80
## 5 3.08 2.36
## 6 3.83 2.81
## 7 3.59 2.78
## 8 4.66 3.06
## 9 3.50 2.56
## 10 4.59 2.93
# menghitung rata-rata untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear## RMSE MAE
## 3.973249 2.833886
Polynomial Derajat 2 (ordo 2)
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_poly2 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(triceps ~ poly(age,2,raw = T),data=triceps[x$in_id,])
pred <- predict(mod,newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_poly2## # A tibble: 10 x 2
## RMSE MAE
## <dbl> <dbl>
## 1 3.64 2.82
## 2 4.62 3.26
## 3 4.39 3.02
## 4 3.85 2.81
## 5 3.10 2.38
## 6 3.82 2.81
## 7 3.62 2.83
## 8 4.65 3.07
## 9 3.50 2.57
## 10 4.59 2.94
# menghitung rata-rata untuk 10 folds
mean_metric_poly2 <- colMeans(metric_poly2)
mean_metric_poly2## RMSE MAE
## 3.977777 2.851787
Polynomial Derajat 3 (ordo 3)
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
metric_poly3 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(triceps ~ poly(age,3,raw = T),data=triceps[x$in_id,])
pred <- predict(mod,newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_poly3## # A tibble: 10 x 2
## RMSE MAE
## <dbl> <dbl>
## 1 3.49 2.60
## 2 4.48 2.99
## 3 4.21 2.85
## 4 4.02 2.75
## 5 3.03 2.09
## 6 3.63 2.59
## 7 3.53 2.52
## 8 4.54 2.91
## 9 3.27 2.33
## 10 4.27 2.68
# menghitung rata-rata untuk 10 folds
mean_metric_poly3 <- colMeans(metric_poly3)
mean_metric_poly3## RMSE MAE
## 3.845976 2.632125
Fungsi Tangga
set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
breaks <- 3:10
best_tangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- triceps[x$in_id,]
training$age <- cut(training$age,i)
mod <- lm(triceps ~ age,data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))
testing <- triceps[-x$in_id,]
age_new <- cut(testing$age,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod,newdata=list(age=age_new))
truth <- testing$triceps
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("RMSE","MAE")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
}
)
best_tangga <- cbind(breaks=breaks,best_tangga)
# menampilkan hasil all breaks
best_tangga## breaks RMSE MAE
## 1 3 3.835357 2.618775
## 2 4 3.882932 2.651911
## 3 5 3.917840 2.724368
## 4 6 3.836068 2.622939
## 5 7 3.789715 2.555062
## 6 8 3.812789 2.555563
## 7 9 3.781720 2.518706
## 8 10 3.795877 2.529479
#berdasarkan rmse
best_tangga %>% slice_min(RMSE)## breaks RMSE MAE
## 1 9 3.78172 2.518706
#berdasarkan mae
best_tangga %>% slice_min(MAE)## breaks RMSE MAE
## 1 9 3.78172 2.518706
Perbandingan Model
nilai_metric <- rbind(mean_metric_linear,
mean_metric_poly2,
mean_metric_poly3,
best_tangga %>% select(-1) %>% slice_min(MAE))
nama_model <- c("Linear","Poly2","Poly3","Tangga (breaks=9)")
data.frame(nama_model,nilai_metric)## nama_model RMSE MAE
## 1 Linear 3.973249 2.833886
## 2 Poly2 3.977777 2.851787
## 3 Poly3 3.845976 2.632125
## 4 Tangga (breaks=9) 3.781720 2.518706
Berdasarkan perbandingan model diatas, diperoleh nilai RMSE dan MAE terkecil yaitu pada model Tangga breaks 9 dengan nilai masing-masing sebesar 3.781720 dan 2.518706.
3. REGRESI NON LINIER PART 2
Data Contoh Kuliah
Data yang digunakan adalah data dari set seed 123. Sebanyak data 1000 dibangkitkan dengan menyebar secara normal yang mempunyai rata-rata dan ragam 1. Data dibagi menjadi 2 bagian, yaitu data.x kurang dari sama dengan 1 dan data.x lebih dari 1
set.seed(123)
datapsd.x <- rnorm(1000,1,1)
err <- rnorm(1000,0,5)
y <- 5+2*datapsd.x+3*datapsd.x^2+err
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="blue")
abline(v=1, col="red", lty=2)Piecewise cubic polynomial
dt.all <- cbind(y,datapsd.x)
## knots = 1
dt1 <- dt.all[datapsd.x <=1,]
dim(dt1)## [1] 495 2
## knots = 2
dt2 <- dt.all[datapsd.x >1,]
dim(dt2)## [1] 505 2
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="dark blue")
cub.mod1 <- lm(dt1[,1]~dt1[,2]+I(dt1[,2]^2)+I(dt1[,2]^3))
ix <- sort(dt1[,2], index.return=T)$ix
lines(dt1[ix,2],cub.mod1$fitted.values[ix],col="red", lwd=2)
cub.mod2 <- lm(dt2[,1]~dt2[,2]+I(dt2[,2]^2)+I(dt2[,2]^3))
ix <- sort(dt2[,2], index.return=T)$ix
lines(dt2[ix,2],cub.mod2$fitted.values[ix],col="red", lwd=2) Dapat dilihat bahwa terdapat perbedaaan garis pada nilai 1 datapsd.x
Truncated power basis
Untuk menghaluskan data.x sama dengan 1 maka dilakuan penghalusan dengan menggunakan truncates power basis.
plot(datapsd.x,y,xlim=c( -2,5), ylim=c( -10,70), pch=16,col="dark blue")
abline(v=1,col="red", lty=2)
hx <- ifelse( datapsd.x>1,(datapsd.x-1)^3,0)
cubspline.mod <- lm( y~datapsd.x+I(datapsd.x^2)+I(datapsd.x^3)+hx)
ix <- sort( datapsd.x,index.return=T)$ix
lines( datapsd.x[ix], cubspline.mod$fitted.values[ix],col="green", lwd=2)Perbandingan dengan k-fold CV
Data dibagi menjadi 3 bagian, yaitu datapsd.x kurang dari 0, datapsd.x dari 1 sampai 2, dan datapsd.x lebih dari 2.
plot( datapsd.x,y,xlim=c(-2,5), ylim=c( -10,70), pch=16,col="dark blue")
abline(v=0,col="red", lty=2)
abline(v=2,col="red", lty=2)
hx1 <- ifelse( datapsd.x>0,(datapsd.x-0)^3,0)
hx2 <- ifelse( datapsd.x>2,(datapsd.x-2)^3,0)
cubspline.mod2 <- lm( y~datapsd.x+I(datapsd.x^2)+I(datapsd.x^3)+hx1+hx2)
ix <- sort( datapsd.x,index.return=T)$ix
lines(datapsd.x[ix],cubspline.mod2$fitted.values[ix],col="green", lwd=2)Perbandingan 1 knot dan 2 knot dengan 5-fold CV
##1 knot
data.all <- cbind( y,datapsd.x,hx)
set.seed(456)
ind <- sample(1:5,length( datapsd.x),replace=T)
res <- c()
for( i in 1:5){
dt.train <- data.all[ ind!=i,]
x.test <- data.all[ ind==i, -1]
y.test <- data.all[ ind==i,1]
mod1 <- lm( dt.train[,1]~dt.train[,2]+I( dt.train[,2]^2)+I( dt.train[,2]^3)+dt.train[,3])
x.test.olah <- cbind(1,x.test[,1], x.test[,1]^2,x.test[,1]^3,x.test[,2])
beta <- coefficients(mod1)
prediksi <- x.test.olah%*%beta
res <- c(res,mean(( y.test-prediksi)^2))
}
res## [1] 22.08767 21.39598 29.47677 29.68683 25.18070
mean(res)## [1] 25.56559
##2 knot (knot = 0, 2)
data.all2 <- cbind(y,datapsd.x,hx1,hx2)
set.seed(456)
ind2 <- sample(1:5,length( datapsd.x),replace=T)
res2 <- c()
for( i in 1:5){
dt.train2 <- data.all2[ind2!=i,]
x.test2 <- data.all2[ind2==i,-1]
y.test2 <- data.all2[ind2==i,1]
mod2 <- lm(dt.train2[,1]~dt.train2[,2]+I(dt.train2[,2]^2)+I(dt.train2[,2]^3)+dt.train2[,3]+dt.train2[,4])
x.test.olah2 <- cbind(1,x.test2[,1],x.test2[,1]^2,x.test2[,1]^3,x.test2[,2],x.test2[,3])
beta2 <- coefficients(mod2)
prediksi2 <- x.test.olah2%*%beta2
res2 <- c(res2,mean((y.test2-prediksi2)^2))
}
res2## [1] 22.32898 21.33868 29.40459 29.79913 25.22047
mean(res2)## [1] 25.61837
Smoothing Spline
ss1 <- smooth.spline(datapsd.x,y,all.knots = T)
plot(datapsd.x,y,xlim=c(-2,5), ylim=c(-10,70), pch=16,col="dark blue")
lines(ss1,col="red", lwd=2)Contoh Data TRICEPS
Data yang digunakan untuk ilustrasi berasal dari studi antropometri terhadap 892 perempuan di bawah 50 tahun di tiga desa Gambia di Afrika Barat. Data terdiri dari 3 Kolom yaitu Age, Intriceps dan tricepts. Berikut adalah penjelasannya pada masing-masing kolom:
age : umur responden
Intriceps : logaritma dari ketebalan lipatan kulit triceps
triceps: ketebalan lipatan kulit triceps
Lipatan kulit trisep diperlukan untuk menghitung lingkar otot lengan atas. Ketebalannya memberikan informasi tentang cadangan lemak tubuh, sedangkan massa otot yang dihitung memberikan informasi tentang cadangan protein.
data("triceps", package="MultiKink")Jika digambarkan dalam bentuk scatterplot
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
theme_bw() Regresi Spline
library(splines)
#Menentukan banyaknya fungsi basis dan knots
dim(bs(triceps$age, knots = c(10, 20,40)))## [1] 892 6
#atau
dim(bs(triceps$age, df=6))## [1] 892 6
#nilai knots yang ditentukan oleh komputer
attr(bs(triceps$age, df=6),"knots")## 25% 50% 75%
## 5.5475 12.2100 24.7275
b-spline : Menggunakan knot yang ditentukan manual
knot_value_manual_3 = c(10, 20,40)
mod_spline_1 = lm(triceps ~ bs(age, knots =knot_value_manual_3 ),data=triceps)
summary(mod_spline_1)##
## Call:
## lm(formula = triceps ~ bs(age, knots = knot_value_manual_3),
## data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.385 -1.682 -0.393 1.165 22.855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.22533 0.61873 13.294 < 2e-16 ***
## bs(age, knots = knot_value_manual_3)1 -0.06551 1.14972 -0.057 0.955
## bs(age, knots = knot_value_manual_3)2 -4.30051 0.72301 -5.948 3.90e-09 ***
## bs(age, knots = knot_value_manual_3)3 7.80435 1.17793 6.625 6.00e-11 ***
## bs(age, knots = knot_value_manual_3)4 6.14890 1.27439 4.825 1.65e-06 ***
## bs(age, knots = knot_value_manual_3)5 5.56640 1.42225 3.914 9.78e-05 ***
## bs(age, knots = knot_value_manual_3)6 7.90178 1.54514 5.114 3.87e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.76 on 885 degrees of freedom
## Multiple R-squared: 0.4195, Adjusted R-squared: 0.4155
## F-statistic: 106.6 on 6 and 885 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = knot_value_manual_3),
lty = 1,se = F)Menggunakan knot yang ditentukan fungsi komputer
knot_value_pc_df_6 = attr(bs(triceps$age, df=6),"knots")
mod_spline_1 = lm(triceps ~ bs(age, knots =knot_value_pc_df_6 ),data=triceps)
summary(mod_spline_1)##
## Call:
## lm(formula = triceps ~ bs(age, knots = knot_value_pc_df_6), data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0056 -1.7556 -0.2944 1.2008 23.0695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5925 0.8770 7.517 1.37e-13 ***
## bs(age, knots = knot_value_pc_df_6)1 3.7961 1.5015 2.528 0.01164 *
## bs(age, knots = knot_value_pc_df_6)2 -2.0749 0.8884 -2.335 0.01974 *
## bs(age, knots = knot_value_pc_df_6)3 1.5139 1.1645 1.300 0.19391
## bs(age, knots = knot_value_pc_df_6)4 11.6394 1.3144 8.855 < 2e-16 ***
## bs(age, knots = knot_value_pc_df_6)5 5.9680 1.5602 3.825 0.00014 ***
## bs(age, knots = knot_value_pc_df_6)6 8.9127 1.4053 6.342 3.60e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.757 on 885 degrees of freedom
## Multiple R-squared: 0.4206, Adjusted R-squared: 0.4167
## F-statistic: 107.1 on 6 and 885 DF, p-value: < 2.2e-16
ggplot(triceps,aes(x=age, y=triceps)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = knot_value_pc_df_6),
lty = 1,se = F)Natural Spline
knot_value_manual_3 = c(10, 20,40)
mod_spline3ns = lm(triceps ~ ns(age, knots = knot_value_manual_3),data=triceps)
summary(mod_spline3ns)##
## Call:
## lm(formula = triceps ~ ns(age, knots = knot_value_manual_3),
## data = triceps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7220 -1.7640 -0.3985 1.1908 22.9684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8748 0.3742 23.714 < 2e-16 ***
## ns(age, knots = knot_value_manual_3)1 7.0119 0.6728 10.422 < 2e-16 ***
## ns(age, knots = knot_value_manual_3)2 6.0762 0.8625 7.045 3.72e-12 ***
## ns(age, knots = knot_value_manual_3)3 2.0780 1.0632 1.954 0.051 .
## ns(age, knots = knot_value_manual_3)4 8.8616 0.9930 8.924 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.762 on 887 degrees of freedom
## Multiple R-squared: 0.4176, Adjusted R-squared: 0.415
## F-statistic: 159 on 4 and 887 DF, p-value: < 2.2e-16
Smoothing Splin
model_sms <- with(data = triceps,smooth.spline(age,triceps))
model_sms## Call:
## smooth.spline(x = age, y = triceps)
##
## Smoothing Parameter spar= 0.5162704 lambda= 1.232259e-06 (13 iterations)
## Equivalent Degrees of Freedom (Df): 50.00639
## Penalized Criterion (RSS): 8591.581
## GCV: 13.77835
pred_data <- broom::augment(model_sms)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("age")+
ylab("triceps")+
theme_bw()model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>%
group_by(lambda) %>%
do(broom::augment(with(data = triceps,smooth.spline(age,triceps,lambda = .$lambda))))
p <- ggplot(model_sms_lambda,
aes(x=x,y=y))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~lambda)
pmodel_sms <- with(data = triceps,smooth.spline(age,triceps,df=7))
model_sms ## Call:
## smooth.spline(x = age, y = triceps, df = 7)
##
## Smoothing Parameter spar= 1.049874 lambda= 0.008827582 (11 iterations)
## Equivalent Degrees of Freedom (Df): 7.000747
## Penalized Criterion (RSS): 10112.16
## GCV: 14.20355
pred_data <- broom::augment(model_sms)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("age")+
ylab("triceps")+
theme_bw() ## LOESS
model_loess <- loess(triceps~age,
data = triceps)
summary(model_loess)## Call:
## loess(formula = triceps ~ age, data = triceps)
##
## Number of Observations: 892
## Equivalent Number of Parameters: 4.6
## Residual Standard Error: 3.777
## Trace of smoother matrix: 5.01 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>%
group_by(span) %>%
do(broom::augment(loess(triceps~age,
data = triceps,span=.$span)))
p2 <- ggplot(model_loess_span,
aes(x=age,y=triceps))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~span)
p2library(ggplot2)
ggplot(triceps, aes(age,triceps)) +
geom_point(alpha=0.5,color="black") +
stat_smooth(method='loess',
formula=y~x,
span = 0.75,
col="blue",
lty=1,
se=F)set.seed(123)
cross_val <- vfold_cv(triceps,v=10,strata = "triceps")
span <- seq(0.1,1,length.out=50)
best_loess <- map_dfr(span, function(i){
metric_loess <- map_dfr(cross_val$splits,
function(x){
mod <- loess(triceps ~ age,span = i,
data=triceps[x$in_id,])
pred <- predict(mod,
newdata=triceps[-x$in_id,])
truth <- triceps[-x$in_id,]$triceps
data_eval <- na.omit(data.frame(pred=pred,
truth=truth))
rmse <- mlr3measures::rmse(truth = data_eval$truth,
response = data_eval$pred
)
mae <- mlr3measures::mae(truth = data_eval$truth,
response = data_eval$pred
)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
head(metric_loess,20)
# menghitung rata-rata untuk 10 folds
mean_metric_loess <- colMeans(metric_loess)
mean_metric_loess
}
)
best_loess <- cbind(span=span,best_loess)
# menampilkan hasil all breaks
best_loess## span rmse mae
## 1 0.1000000 3.773822 2.495752
## 2 0.1183673 3.771342 2.493506
## 3 0.1367347 3.766450 2.487557
## 4 0.1551020 3.757385 2.480571
## 5 0.1734694 3.749539 2.473494
## 6 0.1918367 3.746288 2.470915
## 7 0.2102041 3.742741 2.467990
## 8 0.2285714 3.740397 2.465388
## 9 0.2469388 3.737724 2.463311
## 10 0.2653061 3.737105 2.462252
## 11 0.2836735 3.735645 2.460514
## 12 0.3020408 3.734522 2.459225
## 13 0.3204082 3.733298 2.457694
## 14 0.3387755 3.732538 2.456280
## 15 0.3571429 3.732082 2.455452
## 16 0.3755102 3.731261 2.454642
## 17 0.3938776 3.730882 2.454293
## 18 0.4122449 3.730585 2.454291
## 19 0.4306122 3.730351 2.454869
## 20 0.4489796 3.730717 2.456222
## 21 0.4673469 3.730831 2.456768
## 22 0.4857143 3.731387 2.458307
## 23 0.5040816 3.732091 2.460189
## 24 0.5224490 3.732392 2.461811
## 25 0.5408163 3.733609 2.464543
## 26 0.5591837 3.734393 2.467180
## 27 0.5775510 3.735554 2.470517
## 28 0.5959184 3.736679 2.473012
## 29 0.6142857 3.738227 2.476519
## 30 0.6326531 3.741401 2.476887
## 31 0.6510204 3.742660 2.479842
## 32 0.6693878 3.744336 2.483581
## 33 0.6877551 3.746651 2.488792
## 34 0.7061224 3.748708 2.492180
## 35 0.7244898 3.750734 2.495218
## 36 0.7428571 3.752815 2.498359
## 37 0.7612245 3.753998 2.504175
## 38 0.7795918 3.755981 2.511421
## 39 0.7979592 3.756794 2.514375
## 40 0.8163265 3.759456 2.522994
## 41 0.8346939 3.767892 2.543250
## 42 0.8530612 3.777168 2.557863
## 43 0.8714286 3.782407 2.566972
## 44 0.8897959 3.788428 2.578042
## 45 0.9081633 3.797002 2.595698
## 46 0.9265306 3.803259 2.609852
## 47 0.9448980 3.809478 2.623386
## 48 0.9632653 3.824192 2.655691
## 49 0.9816327 3.835596 2.679155
## 50 1.0000000 3.860514 2.718911
library(ggplot2)
ggplot(triceps, aes(age,triceps)) +
geom_point(alpha=0.5,color="black") +
stat_smooth(method='loess',
formula=y~x,
span = 0.4306122,
col="blue",
lty=1,
se=F)4. DATA LATIHAN
AutoData = Auto %>% select(mpg, horsepower, origin)
tibble(AutoData)## # A tibble: 392 x 3
## mpg horsepower origin
## <dbl> <dbl> <dbl>
## 1 18 130 1
## 2 15 165 1
## 3 18 150 1
## 4 16 150 1
## 5 17 140 1
## 6 15 198 1
## 7 14 220 1
## 8 14 215 1
## 9 14 225 1
## 10 15 190 1
## # ... with 382 more rows
Plot Data
plot(AutoData$mpg, AutoData$horsepower,col="yellow") Berdasarkan scatterplot diatas dapat dilihat bahwa pola hubungan data tidak linier diindikasikan dengan garis yang cenderung melengkung tidak membentuk garis lurus.
3.1 Regresi Linier
lin.mod2 <-lm(AutoData$horsepower~AutoData$mpg)
plot(AutoData$mpg,AutoData$horsepower)
lines(AutoData$mpg,lin.mod2$fitted.values,col="red") Garis fungsi linier yang ditunjukkan oleh garis warna merah belum mengikuti pola data.
summary(lin.mod2)##
## Call:
## lm(formula = AutoData$horsepower ~ AutoData$mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.892 -15.716 -2.094 13.108 96.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.4756 3.8732 50.21 <2e-16 ***
## AutoData$mpg -3.8389 0.1568 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
3.2 Polinomial dengan Ordo 2
pol.mod2 <- lm(AutoData$horsepower~AutoData$mpg+I(AutoData$mpg^2)) #ordo 2
ix2 <- sort(AutoData$mpg,index.return=T)$ix
plot(AutoData$mpg,AutoData$horsepower)
lines(AutoData$mpg[ix2], pol.mod2$fitted.values[ix2],col="blue", cex=2) Garis fungsi polinomial dengan ordo 2 yang ditunjukkan warna biru sudah mengikuti pola data.
summary(pol.mod2)##
## Call:
## lm(formula = AutoData$horsepower ~ AutoData$mpg + I(AutoData$mpg^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.52 -11.24 -0.11 9.47 92.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 302.06824 9.25689 32.63 <2e-16 ***
## AutoData$mpg -13.32406 0.77456 -17.20 <2e-16 ***
## I(AutoData$mpg^2) 0.18804 0.01513 12.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.49 on 389 degrees of freedom
## Multiple R-squared: 0.718, Adjusted R-squared: 0.7165
## F-statistic: 495.1 on 2 and 389 DF, p-value: < 2.2e-16
3.3 Fungsi Tangga
range(AutoData$mpg)## [1] 9.0 46.6
d1 <- as.factor(ifelse(AutoData$mpg<=20,1,0))
d2 <- as.factor(ifelse(AutoData$mpg<=30 & AutoData$mpg>20,1,0))
d3 <- as.factor(ifelse(AutoData$mpg>30,1,0))
data.frame(AutoData$horsepower,d1,d2,d3)## AutoData.horsepower d1 d2 d3
## 1 130 1 0 0
## 2 165 1 0 0
## 3 150 1 0 0
## 4 150 1 0 0
## 5 140 1 0 0
## 6 198 1 0 0
## 7 220 1 0 0
## 8 215 1 0 0
## 9 225 1 0 0
## 10 190 1 0 0
## 11 170 1 0 0
## 12 160 1 0 0
## 13 150 1 0 0
## 14 225 1 0 0
## 15 95 0 1 0
## 16 95 0 1 0
## 17 97 1 0 0
## 18 85 0 1 0
## 19 88 0 1 0
## 20 46 0 1 0
## 21 87 0 1 0
## 22 90 0 1 0
## 23 95 0 1 0
## 24 113 0 1 0
## 25 90 0 1 0
## 26 215 1 0 0
## 27 200 1 0 0
## 28 210 1 0 0
## 29 193 1 0 0
## 30 88 0 1 0
## 31 90 0 1 0
## 32 95 0 1 0
## 33 100 1 0 0
## 34 105 1 0 0
## 35 100 1 0 0
## 36 88 1 0 0
## 37 100 1 0 0
## 38 165 1 0 0
## 39 175 1 0 0
## 40 153 1 0 0
## 41 150 1 0 0
## 42 180 1 0 0
## 43 170 1 0 0
## 44 175 1 0 0
## 45 110 1 0 0
## 46 72 0 1 0
## 47 100 1 0 0
## 48 88 1 0 0
## 49 86 0 1 0
## 50 90 0 1 0
## 51 70 0 1 0
## 52 76 0 1 0
## 53 65 0 0 1
## 54 69 0 0 1
## 55 60 0 1 0
## 56 70 0 1 0
## 57 95 0 1 0
## 58 80 0 1 0
## 59 54 0 1 0
## 60 90 1 0 0
## 61 86 0 1 0
## 62 165 1 0 0
## 63 175 1 0 0
## 64 150 1 0 0
## 65 153 1 0 0
## 66 150 1 0 0
## 67 208 1 0 0
## 68 155 1 0 0
## 69 160 1 0 0
## 70 190 1 0 0
## 71 97 1 0 0
## 72 150 1 0 0
## 73 130 1 0 0
## 74 140 1 0 0
## 75 150 1 0 0
## 76 112 1 0 0
## 77 76 0 1 0
## 78 87 0 1 0
## 79 69 0 1 0
## 80 86 0 1 0
## 81 92 0 1 0
## 82 97 0 1 0
## 83 80 0 1 0
## 84 88 0 1 0
## 85 175 1 0 0
## 86 150 1 0 0
## 87 145 1 0 0
## 88 137 1 0 0
## 89 150 1 0 0
## 90 198 1 0 0
## 91 150 1 0 0
## 92 158 1 0 0
## 93 150 1 0 0
## 94 215 1 0 0
## 95 225 1 0 0
## 96 175 1 0 0
## 97 105 1 0 0
## 98 100 1 0 0
## 99 100 1 0 0
## 100 88 1 0 0
## 101 95 0 1 0
## 102 46 0 1 0
## 103 150 1 0 0
## 104 167 1 0 0
## 105 170 1 0 0
## 106 180 1 0 0
## 107 100 1 0 0
## 108 88 1 0 0
## 109 72 0 1 0
## 110 94 0 1 0
## 111 90 1 0 0
## 112 85 1 0 0
## 113 107 0 1 0
## 114 90 0 1 0
## 115 145 1 0 0
## 116 230 1 0 0
## 117 49 0 1 0
## 118 75 0 1 0
## 119 91 1 0 0
## 120 112 1 0 0
## 121 150 1 0 0
## 122 110 0 1 0
## 123 122 1 0 0
## 124 180 1 0 0
## 125 95 1 0 0
## 126 100 1 0 0
## 127 100 1 0 0
## 128 67 0 0 1
## 129 80 0 1 0
## 130 65 0 0 1
## 131 75 0 1 0
## 132 100 1 0 0
## 133 110 1 0 0
## 134 105 1 0 0
## 135 140 1 0 0
## 136 150 1 0 0
## 137 150 1 0 0
## 138 140 1 0 0
## 139 150 1 0 0
## 140 83 0 1 0
## 141 67 0 1 0
## 142 78 0 1 0
## 143 52 0 0 1
## 144 61 0 0 1
## 145 75 0 1 0
## 146 75 0 1 0
## 147 75 0 1 0
## 148 97 0 1 0
## 149 93 0 1 0
## 150 67 0 0 1
## 151 95 1 0 0
## 152 105 1 0 0
## 153 72 1 0 0
## 154 72 1 0 0
## 155 170 1 0 0
## 156 145 1 0 0
## 157 150 1 0 0
## 158 148 1 0 0
## 159 110 1 0 0
## 160 105 1 0 0
## 161 110 1 0 0
## 162 95 1 0 0
## 163 110 0 1 0
## 164 110 1 0 0
## 165 129 1 0 0
## 166 75 0 1 0
## 167 83 0 1 0
## 168 100 1 0 0
## 169 78 0 1 0
## 170 96 0 1 0
## 171 71 0 1 0
## 172 97 0 1 0
## 173 97 1 0 0
## 174 70 0 1 0
## 175 90 1 0 0
## 176 95 0 1 0
## 177 88 0 1 0
## 178 98 0 1 0
## 179 115 0 1 0
## 180 53 0 0 1
## 181 86 0 1 0
## 182 81 0 1 0
## 183 92 0 1 0
## 184 79 0 1 0
## 185 83 0 1 0
## 186 140 1 0 0
## 187 150 1 0 0
## 188 120 1 0 0
## 189 152 1 0 0
## 190 100 0 1 0
## 191 105 0 1 0
## 192 81 0 1 0
## 193 90 0 1 0
## 194 52 0 1 0
## 195 60 0 1 0
## 196 70 0 1 0
## 197 53 0 0 1
## 198 100 1 0 0
## 199 78 1 0 0
## 200 110 1 0 0
## 201 95 1 0 0
## 202 71 0 1 0
## 203 70 0 0 1
## 204 75 0 1 0
## 205 72 0 1 0
## 206 102 1 0 0
## 207 150 1 0 0
## 208 88 1 0 0
## 209 108 1 0 0
## 210 120 1 0 0
## 211 180 1 0 0
## 212 145 1 0 0
## 213 130 1 0 0
## 214 150 1 0 0
## 215 68 0 0 1
## 216 80 0 1 0
## 217 58 0 0 1
## 218 96 0 1 0
## 219 70 0 0 1
## 220 145 1 0 0
## 221 110 1 0 0
## 222 145 1 0 0
## 223 130 1 0 0
## 224 110 1 0 0
## 225 105 0 1 0
## 226 100 1 0 0
## 227 98 1 0 0
## 228 180 1 0 0
## 229 170 1 0 0
## 230 190 1 0 0
## 231 149 1 0 0
## 232 78 0 1 0
## 233 88 0 1 0
## 234 75 0 1 0
## 235 89 0 1 0
## 236 63 0 0 1
## 237 83 0 0 1
## 238 67 0 1 0
## 239 78 0 0 1
## 240 97 0 1 0
## 241 110 0 1 0
## 242 110 0 1 0
## 243 48 0 0 1
## 244 66 0 0 1
## 245 52 0 0 1
## 246 70 0 0 1
## 247 60 0 0 1
## 248 110 1 0 0
## 249 140 1 0 0
## 250 139 0 1 0
## 251 105 1 0 0
## 252 95 0 1 0
## 253 85 0 1 0
## 254 88 0 1 0
## 255 100 0 1 0
## 256 90 1 0 0
## 257 105 0 1 0
## 258 85 0 1 0
## 259 110 1 0 0
## 260 120 1 0 0
## 261 145 1 0 0
## 262 165 1 0 0
## 263 139 1 0 0
## 264 140 1 0 0
## 265 68 0 1 0
## 266 95 0 1 0
## 267 97 0 1 0
## 268 75 0 0 1
## 269 95 0 1 0
## 270 105 0 1 0
## 271 85 0 1 0
## 272 97 0 1 0
## 273 103 0 1 0
## 274 125 1 0 0
## 275 115 0 1 0
## 276 133 1 0 0
## 277 71 0 0 1
## 278 68 0 1 0
## 279 115 0 1 0
## 280 85 1 0 0
## 281 88 0 1 0
## 282 90 0 1 0
## 283 110 0 1 0
## 284 130 1 0 0
## 285 129 1 0 0
## 286 138 1 0 0
## 287 135 1 0 0
## 288 155 1 0 0
## 289 142 1 0 0
## 290 125 1 0 0
## 291 150 1 0 0
## 292 71 0 0 1
## 293 65 0 0 1
## 294 80 0 0 1
## 295 80 0 1 0
## 296 77 0 1 0
## 297 125 0 1 0
## 298 71 0 1 0
## 299 90 0 1 0
## 300 70 0 0 1
## 301 70 0 0 1
## 302 65 0 0 1
## 303 69 0 0 1
## 304 90 0 1 0
## 305 115 0 1 0
## 306 115 0 1 0
## 307 90 0 0 1
## 308 76 0 0 1
## 309 60 0 0 1
## 310 70 0 0 1
## 311 65 0 0 1
## 312 90 0 1 0
## 313 88 0 1 0
## 314 90 0 1 0
## 315 90 1 0 0
## 316 78 0 0 1
## 317 90 0 1 0
## 318 75 0 0 1
## 319 92 0 0 1
## 320 75 0 0 1
## 321 65 0 0 1
## 322 105 0 1 0
## 323 65 0 0 1
## 324 48 0 0 1
## 325 48 0 0 1
## 326 67 0 0 1
## 327 67 0 1 0
## 328 67 0 0 1
## 329 67 0 0 1
## 330 62 0 1 0
## 331 132 0 0 1
## 332 100 0 1 0
## 333 88 0 0 1
## 334 72 0 0 1
## 335 84 0 1 0
## 336 84 0 1 0
## 337 92 0 1 0
## 338 110 0 1 0
## 339 84 0 1 0
## 340 58 0 0 1
## 341 64 0 0 1
## 342 60 0 0 1
## 343 67 0 0 1
## 344 65 0 0 1
## 345 62 0 0 1
## 346 68 0 0 1
## 347 63 0 0 1
## 348 65 0 0 1
## 349 65 0 1 0
## 350 74 0 0 1
## 351 75 0 0 1
## 352 75 0 0 1
## 353 100 0 0 1
## 354 74 0 0 1
## 355 80 0 1 0
## 356 76 0 0 1
## 357 116 0 1 0
## 358 120 0 1 0
## 359 110 0 1 0
## 360 105 0 1 0
## 361 88 0 1 0
## 362 85 1 0 0
## 363 88 0 1 0
## 364 88 0 1 0
## 365 88 0 0 1
## 366 85 0 0 1
## 367 84 0 1 0
## 368 90 0 1 0
## 369 92 0 1 0
## 370 74 0 0 1
## 371 68 0 0 1
## 372 68 0 0 1
## 373 63 0 0 1
## 374 70 0 0 1
## 375 88 0 0 1
## 376 75 0 0 1
## 377 70 0 0 1
## 378 67 0 0 1
## 379 67 0 0 1
## 380 67 0 0 1
## 381 110 0 1 0
## 382 85 0 0 1
## 383 92 0 1 0
## 384 112 0 1 0
## 385 96 0 0 1
## 386 84 0 0 1
## 387 90 0 1 0
## 388 86 0 1 0
## 389 52 0 0 1
## 390 84 0 0 1
## 391 79 0 1 0
## 392 82 0 0 1
step.mod2 <- lm(AutoData$horsepower~d1+d2+d3)
plot(AutoData$mpg,AutoData$horsepower)
lines(AutoData$mpg,lin.mod2$fitted.values,col="red")
lines(AutoData$mpg[ix2], step.mod2$fitted.values[ix2],col="blue") Garis gungsi dengan breaks 3 yang ditunjukkan oleh warna garis biru cenderung mengikuti pola data.
summary(step.mod2)##
## Call:
## lm(formula = AutoData$horsepower ~ d1 + d2 + d3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.450 -12.966 0.034 12.550 92.550
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.518 2.886 24.431 < 2e-16 ***
## d11 66.932 3.557 18.816 < 2e-16 ***
## d21 17.448 3.602 4.844 1.84e-06 ***
## d31 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.3 on 389 degrees of freedom
## Multiple R-squared: 0.5356, Adjusted R-squared: 0.5332
## F-statistic: 224.3 on 2 and 389 DF, p-value: < 2.2e-16
3.4 Perbandingan Nilai AIC
nilai_AIC <- rbind(AIC(lin.mod2),
AIC(pol.mod2),
AIC(step.mod2))
nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)## nama_model nilai_AIC
## 1 Linear 3614.324
## 2 Poly (ordo=2) 3485.217
## 3 Tangga (breaks=3) 3680.688
Dapat dlihat berdasarkan output diatas bahwa model Polynomial ordo 2 memiliki nilai AIC yang paling kecil yaitu sebesar 3485.217 yang berarti bahwamodel tersebut yang paling baik untuk dapat merepresentasikan pola hubungan data.
plot(AutoData$mpg,AutoData$horsepower)
lines(AutoData$mpg,lin.mod2$fitted.values,col="red")
lines(AutoData$mpg[ix2], pol.mod2$fitted.values[ix2],col="blue", cex=2)
lines(AutoData$mpg[ix2], step.mod2$fitted.values[ix2],col="yellow") Secara eksploratif, garis yang paling mengikuti pola data adalah garis yang warna biru yaitu fungsi polynomial dengan ordo 2.
3.5 Perbandingan Nilai MSE
MSE2 = function(pred,actual){
mean((pred-actual)^2)
}
MSE2(predict(lin.mod2),AutoData$horsepower)## [1] 582.3257
MSE2(predict(pol.mod2),AutoData$horsepower)## [1] 416.7871
MSE2(predict(step.mod2),AutoData$horsepower)## [1] 686.2376
nilai_MSE <- rbind(MSE2(predict(lin.mod2),AutoData$horsepower),MSE2(predict(pol.mod2),AutoData$horsepower),MSE2(predict(step.mod2),AutoData$horsepower))
nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_MSE)## nama_model nilai_MSE
## 1 Linear 582.3257
## 2 Poly (ordo=2) 416.7871
## 3 Tangga (breaks=3) 686.2376
Berdasarkan output diatas dapat dilihat bahwa model polynomial memiliki nilai MSE paling kecil yaitu sebesar 416.7871. Sehingga dapat disimpulkan bahwa model terbaik adalah polynomial dengan ordo 2.
3.6 Regresi Spline
ggplot(AutoData,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = knot_value_manual_auto),
lty = 1,se = T)## Warning: Computation failed in `stat_smooth()`:
## could not find function "bs"
## 3.7 Natural Spline
knot_value_manual_auto = c(10, 20,40)
mod_spline_ns_auto = lm(mpg ~ ns(horsepower, knots = knot_value_manual_auto),data=AutoData)
summary(mod_spline_ns_auto)##
## Call:
## lm(formula = mpg ~ ns(horsepower, knots = knot_value_manual_auto),
## data = AutoData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 10.0857 0.5992 16.83
## ns(horsepower, knots = knot_value_manual_auto)1 32.2705 1.3177 24.49
## ns(horsepower, knots = knot_value_manual_auto)2 NA NA NA
## ns(horsepower, knots = knot_value_manual_auto)3 NA NA NA
## ns(horsepower, knots = knot_value_manual_auto)4 NA NA NA
## Pr(>|t|)
## (Intercept) <2e-16 ***
## ns(horsepower, knots = knot_value_manual_auto)1 <2e-16 ***
## ns(horsepower, knots = knot_value_manual_auto)2 NA
## ns(horsepower, knots = knot_value_manual_auto)3 NA
## ns(horsepower, knots = knot_value_manual_auto)4 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
ggplot(AutoData,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="black")+
stat_smooth(method = "lm",
formula = y~ns(x, knots = knot_value_manual_auto),
lty = 1,se=T)## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading
## 3.8 Smoothing Spline
ggplot(AutoData,aes(x=horsepower, y=mpg)) +
geom_point(alpha=0.55, color="black")+
stat_smooth(method = "lm",
formula = y~ns(x, knots = knot_value_manual_auto),
lty = 1,se=T)## Warning in predict.lm(model, newdata = new_data_frame(list(x = xseq)), se.fit =
## se, : prediction from a rank-deficient fit may be misleading
model_sms_auto <- with(data = AutoData,smooth.spline(horsepower,mpg))
model_sms_auto## Call:
## smooth.spline(x = horsepower, y = mpg)
##
## Smoothing Parameter spar= 0.2648644 lambda= 5.101567e-07 (12 iterations)
## Equivalent Degrees of Freedom (Df): 42.14987
## Penalized Criterion (RSS): 1117.657
## GCV: 18.64132
pred_data <- broom::augment(model_sms_auto)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("horsepower")+
ylab("mpg")+
theme_bw() Terlihat pengaruh lambda terhadap tingkat smoothness kurva
model_sms_lambda_auto <- data.frame(lambda=seq(0,5,by=0.5)) %>%
group_by(lambda) %>%
do(broom::augment(with(data = AutoData,smooth.spline(horsepower,mpg,lambda = .$lambda))))
p <- ggplot(model_sms_lambda_auto,
aes(x=x,y=y))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~lambda)
p Jika ditentukan df=7 maka model kurva akan lebih merepresentasikan data:
model_sms_auto <- with(data = AutoData,smooth.spline(horsepower,mpg,df=7))
model_sms_auto## Call:
## smooth.spline(x = horsepower, y = mpg, df = 7)
##
## Smoothing Parameter spar= 0.7927002 lambda= 0.003320238 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.999082
## Penalized Criterion (RSS): 2424.819
## GCV: 18.84972
pred_data <- broom::augment(model_sms_auto)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("horsepower")+
ylab("mpg")+
theme_bw() ## 3.8 LOESS
model_loess_auto <- loess(mpg~horsepower,
data = AutoData)
summary(model_loess_auto)## Call:
## loess(formula = mpg ~ horsepower, data = AutoData)
##
## Number of Observations: 392
## Equivalent Number of Parameters: 4.97
## Residual Standard Error: 4.32
## Trace of smoother matrix: 5.43 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
model_loess_span_auto <- data.frame(span=seq(0.1,5,by=0.5)) %>%
group_by(span) %>%
do(broom::augment(loess(mpg~horsepower,
data = AutoData,span=.$span)))## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 89
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
p2 <- ggplot(model_loess_span_auto,
aes(x=horsepower,y=mpg))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~span)
p2library(ggplot2)
ggplot(AutoData, aes(horsepower,mpg)) +
geom_point(alpha=0.5,color="black") +
stat_smooth(method='loess',
formula=y~x,
span = 0.75,
col="blue",
lty=1,
se=T) ## 3.9 Perbandingan Model
ggplot(AutoData, aes(x=horsepower, y=mpg)) + geom_point(color="black") +
stat_smooth(method="lm", formula=y~poly(x,7,raw=T), col="blue", lwd=1.2, se=F) +
stat_smooth(method="lm", formula=y~cut(x,8), col="green", lwd=1.2, se=F) +
stat_smooth(method="lm", formula=y~bs(x, knots=knot_value_pc_df_auto), col="yellow", lwd=1.2, se=F) +
stat_smooth(method="lm", formula=y~ns(x, knots=knot_value_pc_df_auto), col="darkgreen", lwd=1.2, se=F) +
stat_smooth(method="loess", formula=y~x, col="orange", lwd=1.2, se=F) +
stat_smooth(method="lm", formula=y~x, col="red", lwd=1.2, se=F)## Warning: Computation failed in `stat_smooth()`:
## object 'knot_value_pc_df_auto' not found
## Computation failed in `stat_smooth()`:
## object 'knot_value_pc_df_auto' not found
MSE = function(pred,actual){
mean((pred-actual)^2)
}
MSE_Autogab<- rbind(MSE(predict(lin.mod2),AutoData$mpg),
MSE(predict(pol.mod2),AutoData$mpg),
MSE(predict(step.mod2),AutoData$mpg),
MSE(predict(mod_spline_ns_auto),AutoData$mpg),
MSE(predict(model_loess_auto),AutoData$mpg))
Model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)","Naturak Spline Regression", "Loess Function")
model.autogab <- data.frame(Model, MSE_Autogab)
knitr::kable(model.autogab)| Model | MSE_Autogab |
|---|---|
| Linear | 7987.55223 |
| Poly (ordo=2) | 8153.09082 |
| Tangga (breaks=3) | 7805.23895 |
| Naturak Spline Regression | 23.94366 |
| Loess Function | 18.37932 |
Berdasarkan output diatas terlihat bahwa model dengan MSE terkecil adalah Loess Function.