Non-Linear Regression Practice
Pengantar Sains Data Kelas Paralel 1
Herdian Partawijaya (G1401201022)
Packages
library(ISLR)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
library(splines)Non-linear Regression Minggu 8
Data Hasil Bangkitan
1. Regresi Linear
Dilakukan pembangkitan data variabel independen (x) yang menyebar normal secara random sebanyak 1000 amatan dengan miu = 1 dan ragam = 1. Untuk data variabel (y) diperoleh dari persamaan non linear dengan dimisalkan beta0 = 0,5, beta1 = 3 dan beta2 = 1,5. Sedangkan data errornya dibangkitkan sebanyak 1000 amatan yang juga menyebar normal.
set.seed(123)
datax <- rnorm(1000,1,1)
e <- rnorm(1000)
y <- 0.5+3*datax+1.5*datax^2+e
plot(datax,y)
linear <- lm(y~datax)
lines(datax, linear$fitted.values, col="red")summary(linear)##
## Call:
## lm(formula = y ~ datax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2121 -1.4314 -0.5609 0.8585 13.3717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.27929 0.10270 2.719 0.00665 **
## datax 6.23351 0.07235 86.155 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.268 on 998 degrees of freedom
## Multiple R-squared: 0.8815, Adjusted R-squared: 0.8814
## F-statistic: 7423 on 1 and 998 DF, p-value: < 2.2e-16
Dari scatter plot data di atas terlihat bahwa data tidak berbentuk linear dibandingkan dengan garis regresi linear yang berwarna merah. Untuk persamaan linearnya diperoleh yaitu dugaan_rataan_y = 0.27929 + 6.23351(x).
2. Regresi Polynomial
Karena data tidak berbentuk linear maka kita tidak bisa menggunakan regresi linear. Oleh karena itu kita bisa menggunakan berbagai metode regresi non-linear salah satunya regresi polynomial.
polym <- lm(y~datax+I(datax^2))
ix <- sort(datax,index.return=T)$ix #memberi index pada datax dan diurutkan dari nilai x yang menghasilkan y terkecil hingga terbesar
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70))
lines(datax[ix],polym$fitted.values[ix], col="yellow",lwd=2)summary(polym)##
## Call:
## lm(formula = y ~ datax + I(datax^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0319 -0.6942 0.0049 0.7116 3.2855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.45193 0.04568 9.894 <2e-16 ***
## datax 3.10732 0.05861 53.017 <2e-16 ***
## I(datax^2) 1.49081 0.02338 63.769 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared: 0.9767, Adjusted R-squared: 0.9766
## F-statistic: 2.086e+04 on 2 and 997 DF, p-value: < 2.2e-16
ternyata scatter plot data sesuai dengan menggunakan persamaan regresi polynomial dilihat dari garis regresi berwarna kuning. Diperoleh persamaan regresi polynomial yaitu dugaan_rataan_y = 0.45193 + 3.10732(x) + 1.49081(X^2). Setelah dicobakan untuk polynomial ordo 3 dan 4 yang paling baik adalah ordo 2 yang memiliki nilai r-square paling tinggi.
3. Fungsi Tangga
range(datax)## [1] -1.809775 4.241040
c1 <- as.factor(ifelse(datax<=0,1,0))
c2 <- as.factor(ifelse(datax<=2 & datax>0,1,0))
c3 <- as.factor(ifelse(datax>2,1,0))cbaru <- data.frame(c1,c2,c3)
cbaru## c1 c2 c3
## 1 0 1 0
## 2 0 1 0
## 3 0 0 1
## 4 0 1 0
## 5 0 1 0
## 6 0 0 1
## 7 0 1 0
## 8 1 0 0
## 9 0 1 0
## 10 0 1 0
## 11 0 0 1
## 12 0 1 0
## 13 0 1 0
## 14 0 1 0
## 15 0 1 0
## 16 0 0 1
## 17 0 1 0
## 18 1 0 0
## 19 0 1 0
## 20 0 1 0
## 21 1 0 0
## 22 0 1 0
## 23 1 0 0
## 24 0 1 0
## 25 0 1 0
## 26 1 0 0
## 27 0 1 0
## 28 0 1 0
## 29 1 0 0
## 30 0 0 1
## 31 0 1 0
## 32 0 1 0
## 33 0 1 0
## 34 0 1 0
## 35 0 1 0
## 36 0 1 0
## 37 0 1 0
## 38 0 1 0
## 39 0 1 0
## 40 0 1 0
## 41 0 1 0
## 42 0 1 0
## 43 1 0 0
## 44 0 0 1
## 45 0 0 1
## 46 1 0 0
## 47 0 1 0
## 48 0 1 0
## 49 0 1 0
## 50 0 1 0
## 51 0 1 0
## 52 0 1 0
## 53 0 1 0
## 54 0 0 1
## 55 0 1 0
## 56 0 0 1
## 57 1 0 0
## 58 0 1 0
## 59 0 1 0
## 60 0 1 0
## 61 0 1 0
## 62 0 1 0
## 63 0 1 0
## 64 1 0 0
## 65 1 0 0
## 66 0 1 0
## 67 0 1 0
## 68 0 1 0
## 69 0 1 0
## 70 0 0 1
## 71 0 1 0
## 72 1 0 0
## 73 0 0 1
## 74 0 1 0
## 75 0 1 0
## 76 0 0 1
## 77 0 1 0
## 78 1 0 0
## 79 0 1 0
## 80 0 1 0
## 81 0 1 0
## 82 0 1 0
## 83 0 1 0
## 84 0 1 0
## 85 0 1 0
## 86 0 1 0
## 87 0 0 1
## 88 0 1 0
## 89 0 1 0
## 90 0 0 1
## 91 0 1 0
## 92 0 1 0
## 93 0 1 0
## 94 0 1 0
## 95 0 0 1
## 96 0 1 0
## 97 0 0 1
## 98 0 0 1
## 99 0 1 0
## 100 1 0 0
## 101 0 1 0
## 102 0 1 0
## 103 0 1 0
## 104 0 1 0
## 105 0 1 0
## 106 0 1 0
## 107 0 1 0
## 108 1 0 0
## 109 0 1 0
## 110 0 1 0
## 111 0 1 0
## 112 0 1 0
## 113 1 0 0
## 114 0 1 0
## 115 0 1 0
## 116 0 1 0
## 117 0 1 0
## 118 0 1 0
## 119 0 1 0
## 120 1 0 0
## 121 0 1 0
## 122 0 1 0
## 123 0 1 0
## 124 0 1 0
## 125 0 0 1
## 126 0 1 0
## 127 0 1 0
## 128 0 1 0
## 129 0 1 0
## 130 0 1 0
## 131 0 0 1
## 132 0 1 0
## 133 0 1 0
## 134 0 1 0
## 135 1 0 0
## 136 0 0 1
## 137 1 0 0
## 138 0 1 0
## 139 0 0 1
## 140 1 0 0
## 141 0 1 0
## 142 0 1 0
## 143 1 0 0
## 144 1 0 0
## 145 1 0 0
## 146 0 1 0
## 147 1 0 0
## 148 0 1 0
## 149 0 0 1
## 150 1 0 0
## 151 0 1 0
## 152 0 1 0
## 153 0 1 0
## 154 1 0 0
## 155 0 1 0
## 156 0 1 0
## 157 0 1 0
## 158 0 1 0
## 159 0 1 0
## 160 0 1 0
## 161 0 0 1
## 162 1 0 0
## 163 1 0 0
## 164 0 0 1
## 165 0 1 0
## 166 0 1 0
## 167 0 1 0
## 168 0 1 0
## 169 0 1 0
## 170 0 1 0
## 171 0 1 0
## 172 0 1 0
## 173 0 1 0
## 174 0 0 1
## 175 0 1 0
## 176 1 0 0
## 177 0 1 0
## 178 0 1 0
## 179 0 1 0
## 180 0 1 0
## 181 1 0 0
## 182 0 0 1
## 183 0 1 0
## 184 0 1 0
## 185 0 1 0
## 186 0 1 0
## 187 0 0 1
## 188 0 1 0
## 189 0 1 0
## 190 0 1 0
## 191 0 1 0
## 192 0 1 0
## 193 0 1 0
## 194 0 1 0
## 195 1 0 0
## 196 0 0 1
## 197 0 1 0
## 198 1 0 0
## 199 0 1 0
## 200 1 0 0
## 201 0 0 1
## 202 0 0 1
## 203 0 1 0
## 204 0 1 0
## 205 0 1 0
## 206 0 1 0
## 207 0 1 0
## 208 0 1 0
## 209 0 0 1
## 210 0 1 0
## 211 0 1 0
## 212 0 1 0
## 213 0 0 1
## 214 0 1 0
## 215 0 1 0
## 216 0 0 1
## 217 0 1 0
## 218 0 1 0
## 219 1 0 0
## 220 1 0 0
## 221 0 1 0
## 222 0 1 0
## 223 0 0 1
## 224 0 1 0
## 225 0 1 0
## 226 0 1 0
## 227 0 1 0
## 228 0 1 0
## 229 0 1 0
## 230 1 0 0
## 231 0 0 1
## 232 0 1 0
## 233 0 1 0
## 234 0 1 0
## 235 0 1 0
## 236 1 0 0
## 237 0 1 0
## 238 0 1 0
## 239 0 1 0
## 240 0 1 0
## 241 0 1 0
## 242 0 1 0
## 243 0 0 1
## 244 1 0 0
## 245 0 1 0
## 246 0 0 1
## 247 0 1 0
## 248 1 0 0
## 249 0 1 0
## 250 0 1 0
## 251 0 1 0
## 252 0 1 0
## 253 0 1 0
## 254 0 1 0
## 255 0 0 1
## 256 0 1 0
## 257 0 0 1
## 258 0 1 0
## 259 0 1 0
## 260 1 0 0
## 261 0 1 0
## 262 0 1 0
## 263 0 1 0
## 264 0 0 1
## 265 0 0 1
## 266 0 0 1
## 267 0 1 0
## 268 1 0 0
## 269 0 1 0
## 270 0 1 0
## 271 0 1 0
## 272 0 1 0
## 273 0 1 0
## 274 1 0 0
## 275 0 1 0
## 276 0 1 0
## 277 0 1 0
## 278 0 1 0
## 279 0 1 0
## 280 0 1 0
## 281 1 0 0
## 282 0 1 0
## 283 0 1 0
## 284 0 1 0
## 285 0 1 0
## 286 0 1 0
## 287 0 1 0
## 288 0 0 1
## 289 0 1 0
## 290 0 1 0
## 291 0 0 1
## 292 0 0 1
## 293 0 0 1
## 294 0 1 0
## 295 0 0 1
## 296 0 1 0
## 297 0 0 1
## 298 1 0 0
## 299 0 1 0
## 300 0 0 1
## 301 0 1 0
## 302 0 1 0
## 303 0 1 0
## 304 1 0 0
## 305 0 1 0
## 306 0 1 0
## 307 1 0 0
## 308 0 1 0
## 309 0 0 1
## 310 0 0 1
## 311 0 0 1
## 312 0 1 0
## 313 1 0 0
## 314 0 1 0
## 315 0 1 0
## 316 0 1 0
## 317 0 1 0
## 318 1 0 0
## 319 0 0 1
## 320 0 1 0
## 321 0 1 0
## 322 0 0 1
## 323 1 0 0
## 324 0 1 0
## 325 0 1 0
## 326 0 1 0
## 327 0 1 0
## 328 0 1 0
## 329 0 0 1
## 330 0 1 0
## 331 0 0 1
## 332 1 0 0
## 333 0 1 0
## 334 0 0 1
## 335 0 1 0
## 336 1 0 0
## 337 1 0 0
## 338 0 1 0
## 339 0 1 0
## 340 0 1 0
## 341 0 1 0
## 342 0 1 0
## 343 0 0 1
## 344 0 1 0
## 345 0 1 0
## 346 1 0 0
## 347 0 1 0
## 348 0 1 0
## 349 0 1 0
## 350 0 1 0
## 351 0 0 1
## 352 1 0 0
## 353 0 1 0
## 354 0 1 0
## 355 0 1 0
## 356 0 1 0
## 357 0 1 0
## 358 0 1 0
## 359 1 0 0
## 360 0 0 1
## 361 0 1 0
## 362 0 1 0
## 363 0 1 0
## 364 0 0 1
## 365 0 1 0
## 366 0 1 0
## 367 0 1 0
## 368 0 1 0
## 369 0 1 0
## 370 0 1 0
## 371 0 0 1
## 372 1 0 0
## 373 0 1 0
## 374 0 1 0
## 375 0 1 0
## 376 0 1 0
## 377 0 1 0
## 378 0 1 0
## 379 0 1 0
## 380 1 0 0
## 381 0 1 0
## 382 0 1 0
## 383 0 1 0
## 384 1 0 0
## 385 0 1 0
## 386 0 0 1
## 387 0 1 0
## 388 1 0 0
## 389 0 1 0
## 390 0 1 0
## 391 0 1 0
## 392 1 0 0
## 393 0 1 0
## 394 0 1 0
## 395 0 1 0
## 396 0 0 1
## 397 0 1 0
## 398 0 1 0
## 399 1 0 0
## 400 0 1 0
## 401 0 1 0
## 402 1 0 0
## 403 0 1 0
## 404 0 1 0
## 405 0 1 0
## 406 1 0 0
## 407 0 1 0
## 408 0 1 0
## 409 0 1 0
## 410 0 1 0
## 411 0 1 0
## 412 0 1 0
## 413 0 1 0
## 414 0 1 0
## 415 0 1 0
## 416 1 0 0
## 417 0 1 0
## 418 0 1 0
## 419 0 1 0
## 420 0 1 0
## 421 0 0 1
## 422 0 1 0
## 423 0 1 0
## 424 0 0 1
## 425 0 1 0
## 426 0 1 0
## 427 0 0 1
## 428 0 1 0
## 429 0 0 1
## 430 1 0 0
## 431 0 1 0
## 432 0 1 0
## 433 0 1 0
## 434 1 0 0
## 435 0 1 0
## 436 1 0 0
## 437 0 1 0
## 438 0 0 1
## 439 1 0 0
## 440 0 0 1
## 441 1 0 0
## 442 0 1 0
## 443 0 1 0
## 444 0 1 0
## 445 0 1 0
## 446 0 1 0
## 447 0 1 0
## 448 0 1 0
## 449 1 0 0
## 450 1 0 0
## 451 0 0 1
## 452 0 0 1
## 453 0 1 0
## 454 0 1 0
## 455 0 1 0
## 456 1 0 0
## 457 0 0 1
## 458 0 1 0
## 459 0 1 0
## 460 0 1 0
## 461 0 1 0
## 462 0 1 0
## 463 0 1 0
## 464 0 1 0
## 465 1 0 0
## 466 0 0 1
## 467 0 1 0
## 468 0 0 1
## 469 0 1 0
## 470 0 0 1
## 471 0 0 1
## 472 0 1 0
## 473 1 0 0
## 474 0 1 0
## 475 0 1 0
## 476 0 1 0
## 477 0 0 1
## 478 0 1 0
## 479 0 1 0
## 480 0 1 0
## 481 0 1 0
## 482 0 1 0
## 483 0 0 1
## 484 0 1 0
## 485 0 1 0
## 486 0 1 0
## 487 0 1 0
## 488 0 1 0
## 489 0 0 1
## 490 0 1 0
## 491 0 1 0
## 492 0 0 1
## 493 0 0 1
## 494 1 0 0
## 495 0 1 0
## 496 1 0 0
## 497 0 1 0
## 498 0 1 0
## 499 0 1 0
## 500 0 1 0
## 501 0 1 0
## 502 0 1 0
## 503 0 0 1
## 504 0 1 0
## 505 1 0 0
## 506 0 1 0
## 507 0 1 0
## 508 1 0 0
## 509 0 1 0
## 510 0 1 0
## 511 0 1 0
## 512 0 1 0
## 513 0 1 0
## 514 0 1 0
## 515 0 1 0
## 516 0 1 0
## 517 0 1 0
## 518 0 1 0
## 519 1 0 0
## 520 0 1 0
## 521 0 1 0
## 522 0 1 0
## 523 0 1 0
## 524 0 1 0
## 525 0 0 1
## 526 0 1 0
## 527 0 0 1
## 528 0 1 0
## 529 0 1 0
## 530 0 1 0
## 531 0 1 0
## 532 0 1 0
## 533 1 0 0
## 534 0 1 0
## 535 0 0 1
## 536 0 1 0
## 537 0 0 1
## 538 0 1 0
## 539 0 1 0
## 540 0 1 0
## 541 0 1 0
## 542 0 1 0
## 543 0 1 0
## 544 1 0 0
## 545 0 1 0
## 546 0 1 0
## 547 1 0 0
## 548 0 1 0
## 549 0 0 1
## 550 0 1 0
## 551 0 1 0
## 552 0 1 0
## 553 0 0 1
## 554 0 1 0
## 555 0 0 1
## 556 1 0 0
## 557 0 1 0
## 558 0 1 0
## 559 0 1 0
## 560 0 1 0
## 561 0 1 0
## 562 0 1 0
## 563 1 0 0
## 564 0 0 1
## 565 1 0 0
## 566 0 1 0
## 567 0 1 0
## 568 0 1 0
## 569 0 1 0
## 570 0 1 0
## 571 0 1 0
## 572 1 0 0
## 573 0 1 0
## 574 0 1 0
## 575 0 1 0
## 576 0 0 1
## 577 0 1 0
## 578 0 1 0
## 579 1 0 0
## 580 0 1 0
## 581 0 1 0
## 582 0 1 0
## 583 1 0 0
## 584 0 1 0
## 585 1 0 0
## 586 0 1 0
## 587 0 1 0
## 588 0 1 0
## 589 0 1 0
## 590 1 0 0
## 591 1 0 0
## 592 0 1 0
## 593 0 1 0
## 594 0 1 0
## 595 0 1 0
## 596 1 0 0
## 597 0 1 0
## 598 1 0 0
## 599 0 0 1
## 600 0 0 1
## 601 0 0 1
## 602 0 1 0
## 603 0 1 0
## 604 1 0 0
## 605 0 1 0
## 606 0 1 0
## 607 0 1 0
## 608 1 0 0
## 609 0 1 0
## 610 0 1 0
## 611 0 1 0
## 612 1 0 0
## 613 0 0 1
## 614 0 1 0
## 615 0 0 1
## 616 1 0 0
## 617 0 1 0
## 618 0 1 0
## 619 1 0 0
## 620 0 0 1
## 621 1 0 0
## 622 0 1 0
## 623 0 1 0
## 624 0 1 0
## 625 0 1 0
## 626 0 1 0
## 627 0 1 0
## 628 1 0 0
## 629 1 0 0
## 630 0 0 1
## 631 0 1 0
## 632 1 0 0
## 633 0 1 0
## 634 0 1 0
## 635 1 0 0
## 636 1 0 0
## 637 0 1 0
## 638 0 0 1
## 639 0 1 0
## 640 0 1 0
## 641 0 1 0
## 642 0 1 0
## 643 0 1 0
## 644 0 1 0
## 645 0 1 0
## 646 0 1 0
## 647 0 1 0
## 648 0 1 0
## 649 0 1 0
## 650 0 1 0
## 651 1 0 0
## 652 0 1 0
## 653 0 1 0
## 654 1 0 0
## 655 0 1 0
## 656 0 1 0
## 657 0 0 1
## 658 0 1 0
## 659 0 1 0
## 660 0 1 0
## 661 0 1 0
## 662 0 0 1
## 663 0 0 1
## 664 0 1 0
## 665 1 0 0
## 666 0 1 0
## 667 0 1 0
## 668 0 1 0
## 669 0 1 0
## 670 0 0 1
## 671 0 1 0
## 672 0 1 0
## 673 0 1 0
## 674 0 1 0
## 675 0 1 0
## 676 1 0 0
## 677 1 0 0
## 678 1 0 0
## 679 0 1 0
## 680 0 1 0
## 681 0 1 0
## 682 0 1 0
## 683 0 1 0
## 684 0 1 0
## 685 1 0 0
## 686 0 1 0
## 687 0 1 0
## 688 0 1 0
## 689 0 1 0
## 690 1 0 0
## 691 0 1 0
## 692 0 0 1
## 693 0 0 1
## 694 0 1 0
## 695 1 0 0
## 696 0 0 1
## 697 0 1 0
## 698 0 1 0
## 699 0 1 0
## 700 1 0 0
## 701 0 1 0
## 702 1 0 0
## 703 0 1 0
## 704 0 1 0
## 705 1 0 0
## 706 0 1 0
## 707 0 1 0
## 708 0 1 0
## 709 0 1 0
## 710 0 1 0
## 711 0 0 1
## 712 0 1 0
## 713 0 1 0
## 714 0 1 0
## 715 0 0 1
## 716 0 0 1
## 717 0 1 0
## 718 0 1 0
## 719 0 1 0
## 720 0 1 0
## 721 1 0 0
## 722 1 0 0
## 723 0 1 0
## 724 1 0 0
## 725 0 1 0
## 726 0 1 0
## 727 0 1 0
## 728 0 1 0
## 729 0 1 0
## 730 0 1 0
## 731 0 0 1
## 732 0 1 0
## 733 1 0 0
## 734 0 0 1
## 735 0 1 0
## 736 0 1 0
## 737 0 1 0
## 738 0 1 0
## 739 0 1 0
## 740 0 0 1
## 741 0 1 0
## 742 0 1 0
## 743 0 0 1
## 744 0 1 0
## 745 0 1 0
## 746 1 0 0
## 747 0 0 1
## 748 0 1 0
## 749 0 0 1
## 750 0 1 0
## 751 0 0 1
## 752 0 1 0
## 753 0 1 0
## 754 0 1 0
## 755 0 1 0
## 756 0 1 0
## 757 0 0 1
## 758 0 1 0
## 759 1 0 0
## 760 0 1 0
## 761 0 1 0
## 762 0 1 0
## 763 0 1 0
## 764 1 0 0
## 765 0 1 0
## 766 0 1 0
## 767 0 0 1
## 768 0 1 0
## 769 0 1 0
## 770 0 0 1
## 771 1 0 0
## 772 0 1 0
## 773 0 1 0
## 774 1 0 0
## 775 0 1 0
## 776 0 0 1
## 777 0 1 0
## 778 0 1 0
## 779 0 1 0
## 780 0 1 0
## 781 0 1 0
## 782 0 1 0
## 783 0 1 0
## 784 1 0 0
## 785 1 0 0
## 786 0 1 0
## 787 0 1 0
## 788 1 0 0
## 789 1 0 0
## 790 0 1 0
## 791 0 1 0
## 792 0 1 0
## 793 0 1 0
## 794 0 1 0
## 795 0 1 0
## 796 0 1 0
## 797 0 1 0
## 798 0 1 0
## 799 0 1 0
## 800 0 1 0
## 801 0 1 0
## 802 0 1 0
## 803 0 1 0
## 804 0 0 1
## 805 0 1 0
## 806 0 1 0
## 807 0 1 0
## 808 0 0 1
## 809 0 1 0
## 810 0 1 0
## 811 1 0 0
## 812 1 0 0
## 813 0 1 0
## 814 0 1 0
## 815 0 1 0
## 816 0 1 0
## 817 0 1 0
## 818 0 0 1
## 819 0 1 0
## 820 1 0 0
## 821 1 0 0
## 822 0 1 0
## 823 0 1 0
## 824 0 1 0
## 825 0 1 0
## 826 0 1 0
## 827 0 1 0
## 828 1 0 0
## 829 1 0 0
## 830 1 0 0
## 831 0 0 1
## 832 0 1 0
## 833 0 1 0
## 834 0 0 1
## 835 0 1 0
## 836 0 1 0
## 837 1 0 0
## 838 0 1 0
## 839 0 1 0
## 840 0 1 0
## 841 0 1 0
## 842 0 0 1
## 843 0 1 0
## 844 0 1 0
## 845 0 1 0
## 846 0 1 0
## 847 0 1 0
## 848 0 0 1
## 849 0 0 1
## 850 0 1 0
## 851 0 1 0
## 852 0 1 0
## 853 0 0 1
## 854 1 0 0
## 855 0 0 1
## 856 0 1 0
## 857 1 0 0
## 858 0 1 0
## 859 0 1 0
## 860 0 0 1
## 861 0 1 0
## 862 0 1 0
## 863 1 0 0
## 864 0 1 0
## 865 0 1 0
## 866 0 0 1
## 867 1 0 0
## 868 0 0 1
## 869 0 1 0
## 870 0 0 1
## 871 1 0 0
## 872 0 1 0
## 873 1 0 0
## 874 0 1 0
## 875 0 0 1
## 876 0 0 1
## 877 0 1 0
## 878 0 0 1
## 879 0 0 1
## 880 0 1 0
## 881 0 1 0
## 882 0 1 0
## 883 0 1 0
## 884 0 0 1
## 885 0 1 0
## 886 0 1 0
## 887 1 0 0
## 888 0 1 0
## 889 0 1 0
## 890 0 1 0
## 891 0 1 0
## 892 0 1 0
## 893 1 0 0
## 894 0 0 1
## 895 0 1 0
## 896 0 1 0
## 897 0 1 0
## 898 0 1 0
## 899 0 1 0
## 900 0 1 0
## 901 1 0 0
## 902 0 1 0
## 903 0 1 0
## 904 0 0 1
## 905 0 0 1
## 906 0 1 0
## 907 0 1 0
## 908 0 1 0
## 909 0 1 0
## 910 0 1 0
## 911 0 0 1
## 912 1 0 0
## 913 0 1 0
## 914 1 0 0
## 915 0 1 0
## 916 0 0 1
## 917 0 0 1
## 918 1 0 0
## 919 0 1 0
## 920 0 0 1
## 921 0 1 0
## 922 0 1 0
## 923 0 1 0
## 924 0 1 0
## 925 0 1 0
## 926 0 0 1
## 927 0 1 0
## 928 0 1 0
## 929 0 0 1
## 930 0 1 0
## 931 0 1 0
## 932 1 0 0
## 933 0 1 0
## 934 0 1 0
## 935 0 1 0
## 936 0 1 0
## 937 0 1 0
## 938 0 1 0
## 939 0 1 0
## 940 0 0 1
## 941 0 0 1
## 942 0 1 0
## 943 1 0 0
## 944 0 1 0
## 945 0 1 0
## 946 0 1 0
## 947 0 1 0
## 948 0 0 1
## 949 0 1 0
## 950 1 0 0
## 951 0 1 0
## 952 1 0 0
## 953 0 1 0
## 954 0 1 0
## 955 0 1 0
## 956 1 0 0
## 957 0 1 0
## 958 0 1 0
## 959 1 0 0
## 960 0 1 0
## 961 0 1 0
## 962 0 0 1
## 963 0 1 0
## 964 0 1 0
## 965 0 1 0
## 966 0 1 0
## 967 1 0 0
## 968 1 0 0
## 969 0 1 0
## 970 0 1 0
## 971 0 1 0
## 972 1 0 0
## 973 1 0 0
## 974 1 0 0
## 975 0 1 0
## 976 1 0 0
## 977 0 1 0
## 978 0 1 0
## 979 0 1 0
## 980 0 1 0
## 981 0 0 1
## 982 1 0 0
## 983 0 1 0
## 984 0 1 0
## 985 0 0 1
## 986 0 1 0
## 987 0 1 0
## 988 0 1 0
## 989 1 0 0
## 990 0 0 1
## 991 0 1 0
## 992 1 0 0
## 993 0 1 0
## 994 0 1 0
## 995 0 1 0
## 996 0 1 0
## 997 0 0 1
## 998 1 0 0
## 999 0 1 0
## 1000 0 1 0
4. Plot Perbandingan Model
tangga <- lm(y~c1+c2+c3)
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70))
lines(datax[ix],tangga$fitted.values[ix],col="blue",lwd=2)
lines(datax[ix],polym$fitted.values[ix], col="yellow",lwd=2)
lines(datax,linear$fitted.values,col="red")5. Perbandingan Kebaikan Model
nilai_AIC <- rbind(AIC(linear),
AIC(polym),
AIC(tangga))
nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)## nama_model nilai_AIC
## 1 Linear 4479.528
## 2 Poly (ordo=2) 2856.470
## 3 Tangga (breaks=3) 5392.009
MSE <- function(pred,actual){
mean((pred-actual)^2)
}
MSE(predict(linear),y)## [1] 5.132797
MSE(predict(polym),y)## [1] 1.010649
MSE(predict(tangga),y)## [1] 12.75766
Model regresi terbaik adalah Polynomial regression ordo 2 karena nilai AIC dan MSE nya paling kecil diantara yang lainnya.
Data Auto Dari Packages ISLR
1. Data
Data yang digunakan merupaka data Auto, yaitu data tentang kendaraan yang diambil dari packages ISLR. Data auto terdiri dari 392 observasi dan 9 variabels. Variabels:library(ISLR)
data.auto <- Auto
data.auto <- data.auto[,-c(7,8,9)] #membuang variabel tahun, origin of car, dan nama
head(data.auto)## mpg cylinders displacement horsepower weight acceleration
## 1 18 8 307 130 3504 12.0
## 2 15 8 350 165 3693 11.5
## 3 18 8 318 150 3436 11.0
## 4 16 8 304 150 3433 12.0
## 5 17 8 302 140 3449 10.5
## 6 15 8 429 198 4341 10.0
2. Plot Korelasi Antar Peubah
plot(data.auto)3. Plot Regresi Mpg vs Horsepower
plot(data.auto$mpg, data.auto$horsepower, col="black", lwd=1, type="p",xlab="mpg",ylab="Horsepower",main="Scatter Plot Mpg dan Horsepower")
abline(lm(data.auto$horsepower ~ data.auto$mpg), col = "red", lwd = 2) Terlihat lebih jelas bahwa mpg dan horsepower memiliki hubungan negatif yang tak linear.
4. Summary Model Linear
linear.auto <- lm(data.auto$horsepower~data.auto$mpg)
summary(linear.auto)##
## Call:
## lm(formula = data.auto$horsepower ~ data.auto$mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.892 -15.716 -2.094 13.108 96.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.4756 3.8732 50.21 <2e-16 ***
## data.auto$mpg -3.8389 0.1568 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
aic.lin.auto <- AIC(linear.auto); aic.lin.auto## [1] 3614.324
Model linear yang diperoleh adalah dugaan_rataan_y = 194.4756 - 3.8389(mpg). Arti dari model tersebut adalah dugaan rataan horsepower akan menurun 3.8389 satuan ketika mpg meningkat satu satuan, yang dimana kedua variabel tersebut mempunyai hubungan non linear negatif. Semakin meningkat nilai mpg maka Horsepowernya semakin kecil. Artinya mobil yang bertenaga besar itu mempunyai konsumsi bahan bakar yang lebih boros.
5. Regresi Polynomial Ordo 2
polym.2 = lm(horsepower ~ poly(mpg,2,raw = T),
data=data.auto)
summary(polym.2)##
## Call:
## lm(formula = horsepower ~ poly(mpg, 2, raw = T), data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.52 -11.24 -0.11 9.47 92.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 302.06824 9.25689 32.63 <2e-16 ***
## poly(mpg, 2, raw = T)1 -13.32406 0.77456 -17.20 <2e-16 ***
## poly(mpg, 2, raw = T)2 0.18804 0.01513 12.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.49 on 389 degrees of freedom
## Multiple R-squared: 0.718, Adjusted R-squared: 0.7165
## F-statistic: 495.1 on 2 and 389 DF, p-value: < 2.2e-16
aic.polym2 <- AIC(polym.2); aic.polym2## [1] 3485.217
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,2,raw=T),
lty = 1, col = "yellow", se=F)+
labs(title = "Plot Regresi Polynomial Ordo 2")+
theme(plot.title=element_text(hjust = 0.5))6. Regresi Polynomial Ordo 3
polym.3 = lm(horsepower ~ poly(mpg,3,raw = T),
data=data.auto)
summary(polym.3)##
## Call:
## lm(formula = horsepower ~ poly(mpg, 3, raw = T), data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.935 -11.251 -1.338 9.324 95.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 429.581461 23.823018 18.032 < 2e-16 ***
## poly(mpg, 3, raw = T)1 -30.131244 3.006538 -10.022 < 2e-16 ***
## poly(mpg, 3, raw = T)2 0.866793 0.118533 7.313 1.51e-12 ***
## poly(mpg, 3, raw = T)3 -0.008506 0.001474 -5.770 1.62e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.69 on 388 degrees of freedom
## Multiple R-squared: 0.7403, Adjusted R-squared: 0.7382
## F-statistic: 368.6 on 3 and 388 DF, p-value: < 2.2e-16
aic.polym3 <- AIC(polym.3); aic.polym3## [1] 3454.949
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,3,raw=T),
lty = 1, col = "yellow", se=F)+
labs(title = "Plot Regresi Polynomial Ordo 3")+
theme(plot.title=element_text(hjust = 0.5))7. Regresi Polynomial Ordo 4
polym.4 = lm(horsepower ~ poly(mpg,4,raw = T),
data=data.auto)
summary(polym.4)##
## Call:
## lm(formula = horsepower ~ poly(mpg, 4, raw = T), data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.661 -11.465 -1.145 9.143 95.783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 450.665773 62.343875 7.229 2.62e-12 ***
## poly(mpg, 4, raw = T)1 -33.885837 10.689919 -3.170 0.00165 **
## poly(mpg, 4, raw = T)2 1.100819 0.650271 1.693 0.09129 .
## poly(mpg, 4, raw = T)3 -0.014589 0.016684 -0.874 0.38244
## poly(mpg, 4, raw = T)4 0.000056 0.000153 0.366 0.71454
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.71 on 387 degrees of freedom
## Multiple R-squared: 0.7403, Adjusted R-squared: 0.7377
## F-statistic: 275.9 on 4 and 387 DF, p-value: < 2.2e-16
aic.polym4 <- AIC(polym.4); aic.polym4## [1] 3456.813
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~poly(x,4,raw=T),
lty = 1, col = "yellow", se=F)+
labs(title = "Plot Regresi Polynomial Ordo 4")+
theme(plot.title=element_text(hjust = 0.5)) ## 8. Fungsi Tangga (5)
tangga5 = lm(horsepower ~ cut(mpg,5),data=data.auto)
summary(tangga5)##
## Call:
## lm(formula = horsepower ~ cut(mpg, 5), data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.242 -11.378 -3.000 8.832 71.758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158.242 2.260 70.03 <2e-16 ***
## cut(mpg, 5)(16.5,24] -55.242 2.942 -18.78 <2e-16 ***
## cut(mpg, 5)(24,31.6] -77.073 3.116 -24.74 <2e-16 ***
## cut(mpg, 5)(31.6,39.1] -85.971 3.603 -23.86 <2e-16 ***
## cut(mpg, 5)(39.1,46.6] -98.542 7.182 -13.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.56 on 387 degrees of freedom
## Multiple R-squared: 0.6896, Adjusted R-squared: 0.6863
## F-statistic: 214.9 on 4 and 387 DF, p-value: < 2.2e-16
aic.tangga5 <- AIC(tangga5); aic.tangga5## [1] 3526.845
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,5),
col = "blue",se = F)+
theme_bw()+
labs(title = "Plot Regresi Fungsi Tangga Cut 5")+
theme(plot.title=element_text(hjust = 0.5))9. Fungsi Tangga (7)
tangga7 = lm(horsepower ~ cut(mpg,7),data=data.auto)
summary(tangga7)##
## Call:
## lm(formula = horsepower ~ cut(mpg, 7), data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.48 -12.96 -2.25 10.31 105.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 169.692 2.960 57.32 <2e-16 ***
## cut(mpg, 7)(14.4,19.7] -45.208 3.669 -12.32 <2e-16 ***
## cut(mpg, 7)(19.7,25.1] -74.908 3.734 -20.06 <2e-16 ***
## cut(mpg, 7)(25.1,30.5] -88.317 3.885 -22.73 <2e-16 ***
## cut(mpg, 7)(30.5,35.9] -96.865 4.186 -23.14 <2e-16 ***
## cut(mpg, 7)(35.9,41.2] -100.442 5.268 -19.07 <2e-16 ***
## cut(mpg, 7)(41.2,46.6] -111.978 8.594 -13.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.35 on 385 degrees of freedom
## Multiple R-squared: 0.6972, Adjusted R-squared: 0.6924
## F-statistic: 147.7 on 6 and 385 DF, p-value: < 2.2e-16
aic.tangga7 <- AIC(tangga7); aic.tangga7## [1] 3521.117
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,7),
col = "blue",se = F)+
theme_bw()+
labs(title = "Plot Regresi Fungsi Tangga Cut 7")+
theme(plot.title=element_text(hjust = 0.5))10. Fungsi Tangga (9)
tangga9 = lm(horsepower ~ cut(mpg,9),data=data.auto)
summary(tangga9)##
## Call:
## lm(formula = horsepower ~ cut(mpg, 9), data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.288 -10.349 -1.355 7.581 81.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 170.697 3.583 47.643 < 2e-16 ***
## cut(mpg, 9)(13.2,17.4] -22.409 4.388 -5.107 5.18e-07 ***
## cut(mpg, 9)(17.4,21.5] -65.348 4.236 -15.428 < 2e-16 ***
## cut(mpg, 9)(21.5,25.7] -78.256 4.474 -17.491 < 2e-16 ***
## cut(mpg, 9)(25.7,29.9] -88.964 4.461 -19.944 < 2e-16 ***
## cut(mpg, 9)(29.9,34.1] -97.472 4.635 -21.030 < 2e-16 ***
## cut(mpg, 9)(34.1,38.2] -100.342 5.148 -19.491 < 2e-16 ***
## cut(mpg, 9)(38.2,42.4] -104.097 9.877 -10.539 < 2e-16 ***
## cut(mpg, 9)(42.4,46.6] -116.030 9.135 -12.702 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.58 on 383 degrees of freedom
## Multiple R-squared: 0.7199, Adjusted R-squared: 0.7141
## F-statistic: 123.1 on 8 and 383 DF, p-value: < 2.2e-16
aic.tangga9 <- AIC(tangga9); aic.tangga9## [1] 3494.485
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~cut(x,9),
col = "blue",se = F)+
theme_bw()+
labs(title = "Plot Regresi Fungsi Tangga Cut 9")+
theme(plot.title=element_text(hjust = 0.5))11. Perbandingan Model Berdasarkan MSE
MSE = function(pred,actual){
mean((pred-actual)^2)
}
nilai_MSE <- rbind(MSE(predict(linear.auto),data.auto$horsepower),
MSE(predict(polym.2),data.auto$horsepower),
MSE(predict(polym.3),data.auto$horsepower),
MSE(predict(polym.4),data.auto$horsepower),
MSE(predict(tangga5),data.auto$horsepower),
MSE(predict(tangga7),data.auto$horsepower),
MSE(predict(tangga9),data.auto$horsepower))
nama_model <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)", "Poly (ordo=4)",
"Tangga (breaks=5)", "Tangga (breaks=7)", "Tangga (breaks=9)")
data.frame(nama_model,nilai_MSE)## nama_model nilai_MSE
## 1 Linear 582.3257
## 2 Poly (ordo=2) 416.7871
## 3 Poly (ordo=3) 383.8523
## 4 Poly (ordo=4) 383.7195
## 5 Tangga (breaks=5) 458.7770
## 6 Tangga (breaks=7) 447.5318
## 7 Tangga (breaks=9) 413.8924
Dilihat dari nilai MSE maka model regresi yang terbaik adalah regresi polynomial dengan ordo 4 dengan nilai MSE terkecil yaitu 383.7195. Secara Eksploratif dilihat dari plot regresi memang garis regresi polynomial ordo 4 lebih pas terhadap scatter plot data sehingga errornya paling kecil.
12. Perbandingan Model Berdasarkan R-square
nilai_Rsquare <- rbind(summary(linear.auto)$r.square,
summary(polym.2)$r.square,
summary(polym.3)$r.square,
summary(polym.4)$r.square,
summary(tangga5)$r.square,
summary(tangga7)$r.square,
summary(tangga9)$r.square)
nama_model <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)", "Poly (ordo=4)",
"Tangga (breaks=5)", "Tangga (breaks=7)", "Tangga (breaks=9)")
data.frame(nama_model,nilai_Rsquare)## nama_model nilai_Rsquare
## 1 Linear 0.6059483
## 2 Poly (ordo=2) 0.7179659
## 3 Poly (ordo=3) 0.7402524
## 4 Poly (ordo=4) 0.7403423
## 5 Tangga (breaks=5) 0.6895519
## 6 Tangga (breaks=7) 0.6971614
## 7 Tangga (breaks=9) 0.7199248
Jika dilihat dari nilai R-square maka model regresi yang terbaik adalah regresi polynomial dengan ordo 4 dengan nilai R-square terbesar yaitu 0.7403423.
13. Evaluasi Model Dengan Cross Validation
Model Linear
set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_linear <- map_dfr(cross_val$splits,
function(x){
mod <- lm(horsepower ~ mpg,data=data.auto[x$in_id,])
pred <- predict(mod,newdata=data.auto[-x$in_id,])
truth <- data.auto[-x$in_id,]$horsepower
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_linear## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 25.3 20.3
## 2 29.3 22.2
## 3 22.4 16.9
## 4 26.8 19.9
## 5 23.4 18.9
## 6 24.9 20.6
## 7 18.1 15.0
## 8 20.3 13.6
## 9 22.6 18.2
## 10 26.9 18.3
# menghitung rata-rata untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear## rmse mae
## 24.00403 18.39118
Model Polynomial Orde 2
set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_poly2 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(horsepower ~ poly(mpg,2,raw = T),data=data.auto[x$in_id,])
pred <- predict(mod,newdata=data.auto[-x$in_id,])
truth <- data.auto[-x$in_id,]$horsepower
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly2## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 23.1 18.9
## 2 25.1 17.2
## 3 18.4 14.3
## 4 22.1 15.1
## 5 21.5 15.6
## 6 19.7 14.8
## 7 15.6 12.2
## 8 16.5 12.3
## 9 18.9 14.6
## 10 22.5 13.8
# menghitung rata-rata untuk 10 folds
mean_metric_poly2 <- colMeans(metric_poly2)
mean_metric_poly2## rmse mae
## 20.35171 14.88587
Model Polynomial Orde 3
set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_poly3 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(horsepower ~ poly(mpg,3,raw = T),data=data.auto[x$in_id,])
pred <- predict(mod,newdata=data.auto[-x$in_id,])
truth <- data.auto[-x$in_id,]$horsepower
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly3## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 20.9 17.0
## 2 24.1 15.8
## 3 17.2 13.2
## 4 22.5 15.2
## 5 21.3 15.6
## 6 18.6 14.7
## 7 15.8 12.0
## 8 16.2 11.9
## 9 17.4 14.2
## 10 22.0 13.5
# menghitung rata-rata untuk 10 folds
mean_metric_poly3 <- colMeans(metric_poly3)
mean_metric_poly3## rmse mae
## 19.59566 14.30442
Model Polynomial Orde 4
set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_poly4 <- map_dfr(cross_val$splits,
function(x){
mod <- lm(horsepower ~ poly(mpg,4,raw = T),data=data.auto[x$in_id,])
pred <- predict(mod,newdata=data.auto[-x$in_id,])
truth <- data.auto[-x$in_id,]$horsepower
rmse <- mlr3measures::rmse(truth = truth,response = pred)
mae <- mlr3measures::mae(truth = truth,response = pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_poly4## # A tibble: 10 x 2
## rmse mae
## <dbl> <dbl>
## 1 20.9 16.9
## 2 24.1 15.8
## 3 17.2 13.2
## 4 22.5 15.2
## 5 21.3 15.6
## 6 18.6 14.6
## 7 15.8 12.0
## 8 16.2 11.9
## 9 17.4 14.2
## 10 22.2 13.6
# menghitung rata-rata untuk 10 folds
mean_metric_poly4 <- colMeans(metric_poly4)
mean_metric_poly4## rmse mae
## 19.61659 14.29458
Fungsi Tangga
set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
breaks <- 3:10
best_tangga <- map_dfr(breaks, function(i){
metric_tangga <- map_dfr(cross_val$splits,
function(x){
training <- data.auto[1:314,]
training$mpg <- cut(training$mpg,i)
mod <- lm(horsepower ~ mpg,data=training)
labs_x <- levels(mod$model[,2])
labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))
testing <- data.auto[-1:-314,]
mpg_new <- cut(testing$mpg,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
pred <- predict(mod,newdata=list(mpg=mpg_new))
truth <- testing$horsepower
data_eval <- na.omit(data.frame(truth,pred))
rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
metric <- c(rmse,mae)
names(metric) <- c("rmse","mae")
return(metric)
}
)
metric_tangga
# menghitung rata-rata untuk 10 folds
mean_metric_tangga <- colMeans(metric_tangga)
mean_metric_tangga
}
)
best_tangga <- cbind(breaks=breaks,best_tangga)
# menampilkan hasil all breaks
best_tangga## breaks rmse mae
## 1 3 18.64373 12.670291
## 2 4 13.63102 9.966213
## 3 5 15.47493 10.495278
## 4 6 16.53925 11.688612
## 5 7 15.63135 10.666628
## 6 8 15.91060 11.440924
## 7 9 15.43487 10.933407
## 8 10 15.76456 10.947742
#berdasarkan rmse
best_tangga %>% slice_min(rmse)## breaks rmse mae
## 1 4 13.63102 9.966213
#berdasarkan mae
best_tangga %>% slice_min(mae)## breaks rmse mae
## 1 4 13.63102 9.966213
Perbandingan Model Dari Cross Validation
nilai_metric <- rbind(mean_metric_linear,
mean_metric_poly2,
mean_metric_poly3,
mean_metric_poly4,
best_tangga %>% select(-1) %>% slice_min(mae))
nama_model <- c("Linear","Poly2","Poly3","Poly4","Tangga (breaks=4)")
data.frame(nama_model,nilai_metric)## nama_model rmse mae
## 1 Linear 24.00403 18.391184
## 2 Poly2 20.35171 14.885874
## 3 Poly3 19.59566 14.304419
## 4 Poly4 19.61659 14.294577
## 5 Tangga (breaks=4) 13.63102 9.966213
Dari proses cross validation untuk mengevaluasi kinerja model diperoleh model terbaik yaitu model dengan fungsi tangga (breaks=4) dengan nilai error rmse dan mae yang paling kecil. Semakin valid hasil prediksi terhadap data aktual/uji maka errornya akan semakin kecil.
Non-linear Regression Minggu 9
Data Hasil Bangkitan
set.seed(123)
datax <- rnorm(1000,1,1)
e <- rnorm(1000,0,5)
y <- 0.5+3*datax+1.5*datax^2+e
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="black")
abline(v=1, col="red", lty=2)Piecewise cubic polynomial
dt.all <- cbind(y,datax)
##knots=1
dt1 <- dt.all[datax <=1,] #memunculkan data x yang nilainya <= 1
dim(dt1)## [1] 495 2
terdapat 495 amatan yang nilai data x <= 1.
dt2 <- dt.all[datax >1,] #memunculkan data x yang nilainya > 1
dim(dt2)## [1] 505 2
terdapat 505 amatan yang nilai data x > 1
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="black")
cub.mod1 <- lm(dt1[,1]~dt1[,2]+I(dt1[,2]^2)+I(dt1[,2]^3))
ix <- sort(dt1[,2], index.return=T)$ix
lines(dt1[ix,2],cub.mod1$fitted.values[ix],col="orange", lwd=2)
cub.mod2 <- lm(dt2[,1]~dt2[,2]+I(dt2[,2]^2)+I(dt2[,2]^3))
ix <- sort(dt2[,2], index.return=T)$ix
lines(dt2[ix,2],cub.mod2$fitted.values[ix],col="orange", lwd=2) patahan di x = 1 karena sebelumnya kita melakukan regresi polynomial untuk data x yang > 1 dan <= 1.
Truncated power basis
karena sebelumnya ada cut off di x = 1 maka kita akan menyambungkannya menggunakan truncated power basis.
#dengan menggunakan truncated power basis
plot(datax,y,xlim=c( -2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=1,col="red", lty=2)
hx <- ifelse( datax>1,(datax-1)^3,0)
cubspline.mod <- lm( y~datax+I(datax^2)+I(datax^3)+hx)
ix <- sort( datax,index.return=T)$ix
lines( datax[ix], cubspline.mod$fitted.values[ix],col="green", lwd=2)Perbandingan dengan k-fold CV
Selanjutnya kita akan melihat untuk knot-nya sebanyak 2 yaitu di x=0 dan x=2. jadi kita akan mempunyai 3 fungsi trend.
plot( datax,y,xlim=c(-2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=0,col="red", lty=2)
abline(v=2,col="red", lty=2)
hx1 <- ifelse( datax>0,(datax-0)^3,0)
hx2 <- ifelse( datax>2,(datax-2)^3,0)
cubspline.mod2 <- lm( y~datax+I(datax^2)+I(datax^3)+hx1+hx2)
ix <- sort( datax,index.return=T)$ix
lines( datax[ix],cubspline.mod2$fitted.values[ix],col="green", lwd=2)Perbandingan 1 knot (x=1) dan 2 knot (x=0, x=2) dengan 5-fold Cross Validation
##1 knot
data.all <- cbind( y,datax,hx)
set.seed(456)
ind <- sample(1:5,length( datax),replace=T)
res <- c()
for( i in 1:5){
dt.train <- data.all[ ind!=i,]
x.test <- data.all[ ind==i, -1]
y.test <- data.all[ ind==i,1]
mod1 <- lm( dt.train[,1]~dt.train[,2]+I( dt.train[,2]^2)+I( dt.train[,2]^3)+dt.train[,3])
x.test.olah <- cbind(1,x.test[,1], x.test[,1]^2,x.test[,1]^3,x.test[,2])
beta <- coefficients(mod1)
prediksi <- x.test.olah%*%beta
res <- c( res,mean(( y.test-prediksi)^2))
}
res## [1] 22.08767 21.39598 29.47677 29.68683 25.18070
mean(res) #rata-rata nilai residu## [1] 25.56559
##2 knot (knot = 0, 2)
data.all2 <- cbind(y,datax,hx1,hx2)
set.seed(456)
ind2 <- sample(1:5,length( datax),replace=T)
res2 <- c()
for( i in 1:5){
dt.train2 <- data.all2[ind2!=i,]
x.test2 <- data.all2[ind2==i,-1]
y.test2 <- data.all2[ind2==i,1]
mod2 <- lm(dt.train2[,1]~dt.train2[,2]+I(dt.train2[,2]^2)+I(dt.train2[,2]^3)+dt.train2[,3]+dt.train2[,4])
x.test.olah2 <- cbind(1,x.test2[,1],x.test2[,1]^2,x.test2[,1]^3,x.test2[,2],x.test2[,3])
beta2 <- coefficients(mod2)
prediksi2 <- x.test.olah2%*%beta2
res2 <- c(res2,mean((y.test2-prediksi2)^2))
}
res2## [1] 22.32898 21.33868 29.40459 29.79913 25.22047
mean(res2) #rata-rata nilai residu## [1] 25.61837
residual knot 1 lebih baik daripada knot 2, karena nilai residualnya lebih kecil.
Smoothing Splines
ss1 <- smooth.spline(datax,y,all.knots = T); ss1## Call:
## smooth.spline(x = datax, y = y, all.knots = T)
##
## Smoothing Parameter spar= 1.499938 lambda= 0.0002745903 (26 iterations)
## Equivalent Degrees of Freedom (Df): 14.84845
## Penalized Criterion (RSS): 24848.74
## GCV: 121.2927
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70), pch=16,col="orange")
lines(ss1,col="black", lwd=2) fungsi smooth splines otomatis memilih parameter terbaik menggunakan cross validation dan generalized cross validation. Diperoleh parameter terbaik dari fungsi di atas adalah lambda = 0.000275.
Data Auto Dari Packages ISLR
Scatter Plot
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(alpha=0.55, color="black") +
theme_bw() Regresi Splines
#Menentukan banyaknya fungsi basis dan knots
dim(bs(data.auto$mpg, knots = c(15, 25,35)))## [1] 392 6
#nilai knots yang ditentukan oleh komputer
attr(bs(data.auto$mpg, df=6),"knots")## 25% 50% 75%
## 17.00 22.75 29.00
Penentuan knot menggunakan komputer diperoleh ada 3 knot yang membagi data menjadi 25%, 50%, dan 75%. knot 1 di x=17, knot 2 di x=22.75, dan knot 3 di x=29
b Splines
menggunakan knot dari fungsi komputer
knot_value_pc_df_6 = attr(bs(data.auto$mpg, df=6),"knots")
mod_spline_1 = lm(horsepower ~ bs(mpg, knots =knot_value_pc_df_6 ),data=data.auto)
summary(mod_spline_1)##
## Call:
## lm(formula = horsepower ~ bs(mpg, knots = knot_value_pc_df_6),
## data = data.auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.043 -10.103 -0.818 8.075 95.778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.012 13.181 14.795 < 2e-16 ***
## bs(mpg, knots = knot_value_pc_df_6)1 2.792 18.879 0.148 0.882
## bs(mpg, knots = knot_value_pc_df_6)2 -80.488 12.649 -6.363 5.62e-10 ***
## bs(mpg, knots = knot_value_pc_df_6)3 -103.774 14.645 -7.086 6.62e-12 ***
## bs(mpg, knots = knot_value_pc_df_6)4 -126.115 14.600 -8.638 < 2e-16 ***
## bs(mpg, knots = knot_value_pc_df_6)5 -127.143 18.836 -6.750 5.45e-11 ***
## bs(mpg, knots = knot_value_pc_df_6)6 -141.413 18.226 -7.759 7.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.56 on 385 degrees of freedom
## Multiple R-squared: 0.7457, Adjusted R-squared: 0.7417
## F-statistic: 188.1 on 6 and 385 DF, p-value: < 2.2e-16
ggplot(data.auto,aes(x=mpg, y=horsepower)) +
geom_point(color="black") +
stat_smooth(method = "lm",
formula = y~bs(x, knots = knot_value_pc_df_6),
lty = 1,se = F)Smoothing Spline
model_sms <- with(data = data.auto,smooth.spline(mpg,horsepower))
model_sms ## Call:
## smooth.spline(x = mpg, y = horsepower)
##
## Smoothing Parameter spar= 0.8528728 lambda= 0.003422279 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.989153
## Penalized Criterion (RSS): 35962.53
## GCV: 386.3516
fungsi smooth splines otomatis memilih parameter terbaik menggunakan cross validation dan generalized cross validation. Diperoleh parameter terbaik dari fungsi di atas adalah lambda = 0.003422279 dengan 12 iterasi.
pred_data <- broom::augment(model_sms)
ggplot(pred_data,aes(x=x,y=y))+
geom_point(alpha=0.55, color="black")+
geom_line(aes(y=.fitted),col="blue",
lty=1)+
xlab("mpg")+
ylab("horsepower")+
theme_bw()Smoothing Spline Untuk Berbagai Nilai Lambda
Kita akan melihat pengaruh berbagai nilai parameter lambda terhadap hasil smoothing spline
model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>%
group_by(lambda) %>%
do(broom::augment(with(data = data.auto,smooth.spline(mpg,horsepower,lambda = .$lambda))))
p <- ggplot(model_sms_lambda,
aes(x=x,y=y))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~lambda)
p Semakin besar nilai parameter maka garisnya akan semakin lurus atau linear.
LOESS
model_loess <- loess(horsepower~mpg,
data = data.auto)
summary(model_loess)## Call:
## loess(formula = horsepower ~ mpg, data = data.auto)
##
## Number of Observations: 392
## Equivalent Number of Parameters: 4.8
## Residual Standard Error: 19.58
## Trace of smoother matrix: 5.24 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
diperoleh parameter span-nya adalah 0.75
LOESS Untuk Berbagai Nilai Parameter Span
Kita akan melihat pengaruh nilai parameter Span terhadap pemulusan LOESS. Pada fungsi di bawah ini kita akan melihat hasil pemulusan dengan parameter span dari 0.1 sampai 5.
model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>%
group_by(span) %>%
do(broom::augment(loess(horsepower~mpg,
data = data.auto,span=.$span)))## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 14
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 6.6352e-016
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
p2 <- ggplot(model_loess_span,
aes(x=mpg,y=horsepower))+
geom_line(aes(y=.fitted),
col="blue",
lty=1
)+
facet_wrap(~span)
p2LOESS Dengan Span = 0.75
library(ggplot2)
ggplot(data.auto, aes(mpg,horsepower)) +
geom_point(alpha=0.5,color="black") +
stat_smooth(method='loess',
formula=y~x,
span = 0.75,
col="blue",
lty=1,
se=F)