Non-Linear Regression Practice

Pengantar Sains Data Kelas Paralel 1

Herdian Partawijaya (G1401201022)

Packages

library(ISLR)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(purrr)
library(rsample)
library(splines)

Non-linear Regression Minggu 8

Data Hasil Bangkitan

1. Regresi Linear

Dilakukan pembangkitan data variabel independen (x) yang menyebar normal secara random sebanyak 1000 amatan dengan miu = 1 dan ragam = 1. Untuk data variabel (y) diperoleh dari persamaan non linear dengan dimisalkan beta0 = 0,5, beta1 = 3 dan beta2 = 1,5. Sedangkan data errornya dibangkitkan sebanyak 1000 amatan yang juga menyebar normal.

set.seed(123)
datax <- rnorm(1000,1,1)
e <- rnorm(1000)
y <- 0.5+3*datax+1.5*datax^2+e
plot(datax,y)
linear <- lm(y~datax)
lines(datax, linear$fitted.values, col="red")

summary(linear)

## 
## Call:
## lm(formula = y ~ datax)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2121 -1.4314 -0.5609  0.8585 13.3717 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27929    0.10270   2.719  0.00665 ** 
## datax        6.23351    0.07235  86.155  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.268 on 998 degrees of freedom
## Multiple R-squared:  0.8815, Adjusted R-squared:  0.8814 
## F-statistic:  7423 on 1 and 998 DF,  p-value: < 2.2e-16

Dari scatter plot data di atas terlihat bahwa data tidak berbentuk linear dibandingkan dengan garis regresi linear yang berwarna merah. Untuk persamaan linearnya diperoleh yaitu dugaan_rataan_y = 0.27929 + 6.23351(x).

2. Regresi Polynomial

Karena data tidak berbentuk linear maka kita tidak bisa menggunakan regresi linear. Oleh karena itu kita bisa menggunakan berbagai metode regresi non-linear salah satunya regresi polynomial.

polym <- lm(y~datax+I(datax^2))
ix <- sort(datax,index.return=T)$ix #memberi index pada datax dan diurutkan dari nilai x yang menghasilkan y terkecil hingga terbesar
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70))
lines(datax[ix],polym$fitted.values[ix], col="yellow",lwd=2)

summary(polym)

## 
## Call:
## lm(formula = y ~ datax + I(datax^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0319 -0.6942  0.0049  0.7116  3.2855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.45193    0.04568   9.894   <2e-16 ***
## datax        3.10732    0.05861  53.017   <2e-16 ***
## I(datax^2)   1.49081    0.02338  63.769   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared:  0.9767, Adjusted R-squared:  0.9766 
## F-statistic: 2.086e+04 on 2 and 997 DF,  p-value: < 2.2e-16

ternyata scatter plot data sesuai dengan menggunakan persamaan regresi polynomial dilihat dari garis regresi berwarna kuning. Diperoleh persamaan regresi polynomial yaitu dugaan_rataan_y = 0.45193 + 3.10732(x) + 1.49081(X^2). Setelah dicobakan untuk polynomial ordo 3 dan 4 yang paling baik adalah ordo 2 yang memiliki nilai r-square paling tinggi.

3. Fungsi Tangga

range(datax)

## [1] -1.809775  4.241040

c1 <- as.factor(ifelse(datax<=0,1,0))
c2 <- as.factor(ifelse(datax<=2 & datax>0,1,0))
c3 <- as.factor(ifelse(datax>2,1,0))

cbaru <- data.frame(c1,c2,c3)
cbaru

##      c1 c2 c3
## 1     0  1  0
## 2     0  1  0
## 3     0  0  1
## 4     0  1  0
## 5     0  1  0
## 6     0  0  1
## 7     0  1  0
## 8     1  0  0
## 9     0  1  0
## 10    0  1  0
## 11    0  0  1
## 12    0  1  0
## 13    0  1  0
## 14    0  1  0
## 15    0  1  0
## 16    0  0  1
## 17    0  1  0
## 18    1  0  0
## 19    0  1  0
## 20    0  1  0
## 21    1  0  0
## 22    0  1  0
## 23    1  0  0
## 24    0  1  0
## 25    0  1  0
## 26    1  0  0
## 27    0  1  0
## 28    0  1  0
## 29    1  0  0
## 30    0  0  1
## 31    0  1  0
## 32    0  1  0
## 33    0  1  0
## 34    0  1  0
## 35    0  1  0
## 36    0  1  0
## 37    0  1  0
## 38    0  1  0
## 39    0  1  0
## 40    0  1  0
## 41    0  1  0
## 42    0  1  0
## 43    1  0  0
## 44    0  0  1
## 45    0  0  1
## 46    1  0  0
## 47    0  1  0
## 48    0  1  0
## 49    0  1  0
## 50    0  1  0
## 51    0  1  0
## 52    0  1  0
## 53    0  1  0
## 54    0  0  1
## 55    0  1  0
## 56    0  0  1
## 57    1  0  0
## 58    0  1  0
## 59    0  1  0
## 60    0  1  0
## 61    0  1  0
## 62    0  1  0
## 63    0  1  0
## 64    1  0  0
## 65    1  0  0
## 66    0  1  0
## 67    0  1  0
## 68    0  1  0
## 69    0  1  0
## 70    0  0  1
## 71    0  1  0
## 72    1  0  0
## 73    0  0  1
## 74    0  1  0
## 75    0  1  0
## 76    0  0  1
## 77    0  1  0
## 78    1  0  0
## 79    0  1  0
## 80    0  1  0
## 81    0  1  0
## 82    0  1  0
## 83    0  1  0
## 84    0  1  0
## 85    0  1  0
## 86    0  1  0
## 87    0  0  1
## 88    0  1  0
## 89    0  1  0
## 90    0  0  1
## 91    0  1  0
## 92    0  1  0
## 93    0  1  0
## 94    0  1  0
## 95    0  0  1
## 96    0  1  0
## 97    0  0  1
## 98    0  0  1
## 99    0  1  0
## 100   1  0  0
## 101   0  1  0
## 102   0  1  0
## 103   0  1  0
## 104   0  1  0
## 105   0  1  0
## 106   0  1  0
## 107   0  1  0
## 108   1  0  0
## 109   0  1  0
## 110   0  1  0
## 111   0  1  0
## 112   0  1  0
## 113   1  0  0
## 114   0  1  0
## 115   0  1  0
## 116   0  1  0
## 117   0  1  0
## 118   0  1  0
## 119   0  1  0
## 120   1  0  0
## 121   0  1  0
## 122   0  1  0
## 123   0  1  0
## 124   0  1  0
## 125   0  0  1
## 126   0  1  0
## 127   0  1  0
## 128   0  1  0
## 129   0  1  0
## 130   0  1  0
## 131   0  0  1
## 132   0  1  0
## 133   0  1  0
## 134   0  1  0
## 135   1  0  0
## 136   0  0  1
## 137   1  0  0
## 138   0  1  0
## 139   0  0  1
## 140   1  0  0
## 141   0  1  0
## 142   0  1  0
## 143   1  0  0
## 144   1  0  0
## 145   1  0  0
## 146   0  1  0
## 147   1  0  0
## 148   0  1  0
## 149   0  0  1
## 150   1  0  0
## 151   0  1  0
## 152   0  1  0
## 153   0  1  0
## 154   1  0  0
## 155   0  1  0
## 156   0  1  0
## 157   0  1  0
## 158   0  1  0
## 159   0  1  0
## 160   0  1  0
## 161   0  0  1
## 162   1  0  0
## 163   1  0  0
## 164   0  0  1
## 165   0  1  0
## 166   0  1  0
## 167   0  1  0
## 168   0  1  0
## 169   0  1  0
## 170   0  1  0
## 171   0  1  0
## 172   0  1  0
## 173   0  1  0
## 174   0  0  1
## 175   0  1  0
## 176   1  0  0
## 177   0  1  0
## 178   0  1  0
## 179   0  1  0
## 180   0  1  0
## 181   1  0  0
## 182   0  0  1
## 183   0  1  0
## 184   0  1  0
## 185   0  1  0
## 186   0  1  0
## 187   0  0  1
## 188   0  1  0
## 189   0  1  0
## 190   0  1  0
## 191   0  1  0
## 192   0  1  0
## 193   0  1  0
## 194   0  1  0
## 195   1  0  0
## 196   0  0  1
## 197   0  1  0
## 198   1  0  0
## 199   0  1  0
## 200   1  0  0
## 201   0  0  1
## 202   0  0  1
## 203   0  1  0
## 204   0  1  0
## 205   0  1  0
## 206   0  1  0
## 207   0  1  0
## 208   0  1  0
## 209   0  0  1
## 210   0  1  0
## 211   0  1  0
## 212   0  1  0
## 213   0  0  1
## 214   0  1  0
## 215   0  1  0
## 216   0  0  1
## 217   0  1  0
## 218   0  1  0
## 219   1  0  0
## 220   1  0  0
## 221   0  1  0
## 222   0  1  0
## 223   0  0  1
## 224   0  1  0
## 225   0  1  0
## 226   0  1  0
## 227   0  1  0
## 228   0  1  0
## 229   0  1  0
## 230   1  0  0
## 231   0  0  1
## 232   0  1  0
## 233   0  1  0
## 234   0  1  0
## 235   0  1  0
## 236   1  0  0
## 237   0  1  0
## 238   0  1  0
## 239   0  1  0
## 240   0  1  0
## 241   0  1  0
## 242   0  1  0
## 243   0  0  1
## 244   1  0  0
## 245   0  1  0
## 246   0  0  1
## 247   0  1  0
## 248   1  0  0
## 249   0  1  0
## 250   0  1  0
## 251   0  1  0
## 252   0  1  0
## 253   0  1  0
## 254   0  1  0
## 255   0  0  1
## 256   0  1  0
## 257   0  0  1
## 258   0  1  0
## 259   0  1  0
## 260   1  0  0
## 261   0  1  0
## 262   0  1  0
## 263   0  1  0
## 264   0  0  1
## 265   0  0  1
## 266   0  0  1
## 267   0  1  0
## 268   1  0  0
## 269   0  1  0
## 270   0  1  0
## 271   0  1  0
## 272   0  1  0
## 273   0  1  0
## 274   1  0  0
## 275   0  1  0
## 276   0  1  0
## 277   0  1  0
## 278   0  1  0
## 279   0  1  0
## 280   0  1  0
## 281   1  0  0
## 282   0  1  0
## 283   0  1  0
## 284   0  1  0
## 285   0  1  0
## 286   0  1  0
## 287   0  1  0
## 288   0  0  1
## 289   0  1  0
## 290   0  1  0
## 291   0  0  1
## 292   0  0  1
## 293   0  0  1
## 294   0  1  0
## 295   0  0  1
## 296   0  1  0
## 297   0  0  1
## 298   1  0  0
## 299   0  1  0
## 300   0  0  1
## 301   0  1  0
## 302   0  1  0
## 303   0  1  0
## 304   1  0  0
## 305   0  1  0
## 306   0  1  0
## 307   1  0  0
## 308   0  1  0
## 309   0  0  1
## 310   0  0  1
## 311   0  0  1
## 312   0  1  0
## 313   1  0  0
## 314   0  1  0
## 315   0  1  0
## 316   0  1  0
## 317   0  1  0
## 318   1  0  0
## 319   0  0  1
## 320   0  1  0
## 321   0  1  0
## 322   0  0  1
## 323   1  0  0
## 324   0  1  0
## 325   0  1  0
## 326   0  1  0
## 327   0  1  0
## 328   0  1  0
## 329   0  0  1
## 330   0  1  0
## 331   0  0  1
## 332   1  0  0
## 333   0  1  0
## 334   0  0  1
## 335   0  1  0
## 336   1  0  0
## 337   1  0  0
## 338   0  1  0
## 339   0  1  0
## 340   0  1  0
## 341   0  1  0
## 342   0  1  0
## 343   0  0  1
## 344   0  1  0
## 345   0  1  0
## 346   1  0  0
## 347   0  1  0
## 348   0  1  0
## 349   0  1  0
## 350   0  1  0
## 351   0  0  1
## 352   1  0  0
## 353   0  1  0
## 354   0  1  0
## 355   0  1  0
## 356   0  1  0
## 357   0  1  0
## 358   0  1  0
## 359   1  0  0
## 360   0  0  1
## 361   0  1  0
## 362   0  1  0
## 363   0  1  0
## 364   0  0  1
## 365   0  1  0
## 366   0  1  0
## 367   0  1  0
## 368   0  1  0
## 369   0  1  0
## 370   0  1  0
## 371   0  0  1
## 372   1  0  0
## 373   0  1  0
## 374   0  1  0
## 375   0  1  0
## 376   0  1  0
## 377   0  1  0
## 378   0  1  0
## 379   0  1  0
## 380   1  0  0
## 381   0  1  0
## 382   0  1  0
## 383   0  1  0
## 384   1  0  0
## 385   0  1  0
## 386   0  0  1
## 387   0  1  0
## 388   1  0  0
## 389   0  1  0
## 390   0  1  0
## 391   0  1  0
## 392   1  0  0
## 393   0  1  0
## 394   0  1  0
## 395   0  1  0
## 396   0  0  1
## 397   0  1  0
## 398   0  1  0
## 399   1  0  0
## 400   0  1  0
## 401   0  1  0
## 402   1  0  0
## 403   0  1  0
## 404   0  1  0
## 405   0  1  0
## 406   1  0  0
## 407   0  1  0
## 408   0  1  0
## 409   0  1  0
## 410   0  1  0
## 411   0  1  0
## 412   0  1  0
## 413   0  1  0
## 414   0  1  0
## 415   0  1  0
## 416   1  0  0
## 417   0  1  0
## 418   0  1  0
## 419   0  1  0
## 420   0  1  0
## 421   0  0  1
## 422   0  1  0
## 423   0  1  0
## 424   0  0  1
## 425   0  1  0
## 426   0  1  0
## 427   0  0  1
## 428   0  1  0
## 429   0  0  1
## 430   1  0  0
## 431   0  1  0
## 432   0  1  0
## 433   0  1  0
## 434   1  0  0
## 435   0  1  0
## 436   1  0  0
## 437   0  1  0
## 438   0  0  1
## 439   1  0  0
## 440   0  0  1
## 441   1  0  0
## 442   0  1  0
## 443   0  1  0
## 444   0  1  0
## 445   0  1  0
## 446   0  1  0
## 447   0  1  0
## 448   0  1  0
## 449   1  0  0
## 450   1  0  0
## 451   0  0  1
## 452   0  0  1
## 453   0  1  0
## 454   0  1  0
## 455   0  1  0
## 456   1  0  0
## 457   0  0  1
## 458   0  1  0
## 459   0  1  0
## 460   0  1  0
## 461   0  1  0
## 462   0  1  0
## 463   0  1  0
## 464   0  1  0
## 465   1  0  0
## 466   0  0  1
## 467   0  1  0
## 468   0  0  1
## 469   0  1  0
## 470   0  0  1
## 471   0  0  1
## 472   0  1  0
## 473   1  0  0
## 474   0  1  0
## 475   0  1  0
## 476   0  1  0
## 477   0  0  1
## 478   0  1  0
## 479   0  1  0
## 480   0  1  0
## 481   0  1  0
## 482   0  1  0
## 483   0  0  1
## 484   0  1  0
## 485   0  1  0
## 486   0  1  0
## 487   0  1  0
## 488   0  1  0
## 489   0  0  1
## 490   0  1  0
## 491   0  1  0
## 492   0  0  1
## 493   0  0  1
## 494   1  0  0
## 495   0  1  0
## 496   1  0  0
## 497   0  1  0
## 498   0  1  0
## 499   0  1  0
## 500   0  1  0
## 501   0  1  0
## 502   0  1  0
## 503   0  0  1
## 504   0  1  0
## 505   1  0  0
## 506   0  1  0
## 507   0  1  0
## 508   1  0  0
## 509   0  1  0
## 510   0  1  0
## 511   0  1  0
## 512   0  1  0
## 513   0  1  0
## 514   0  1  0
## 515   0  1  0
## 516   0  1  0
## 517   0  1  0
## 518   0  1  0
## 519   1  0  0
## 520   0  1  0
## 521   0  1  0
## 522   0  1  0
## 523   0  1  0
## 524   0  1  0
## 525   0  0  1
## 526   0  1  0
## 527   0  0  1
## 528   0  1  0
## 529   0  1  0
## 530   0  1  0
## 531   0  1  0
## 532   0  1  0
## 533   1  0  0
## 534   0  1  0
## 535   0  0  1
## 536   0  1  0
## 537   0  0  1
## 538   0  1  0
## 539   0  1  0
## 540   0  1  0
## 541   0  1  0
## 542   0  1  0
## 543   0  1  0
## 544   1  0  0
## 545   0  1  0
## 546   0  1  0
## 547   1  0  0
## 548   0  1  0
## 549   0  0  1
## 550   0  1  0
## 551   0  1  0
## 552   0  1  0
## 553   0  0  1
## 554   0  1  0
## 555   0  0  1
## 556   1  0  0
## 557   0  1  0
## 558   0  1  0
## 559   0  1  0
## 560   0  1  0
## 561   0  1  0
## 562   0  1  0
## 563   1  0  0
## 564   0  0  1
## 565   1  0  0
## 566   0  1  0
## 567   0  1  0
## 568   0  1  0
## 569   0  1  0
## 570   0  1  0
## 571   0  1  0
## 572   1  0  0
## 573   0  1  0
## 574   0  1  0
## 575   0  1  0
## 576   0  0  1
## 577   0  1  0
## 578   0  1  0
## 579   1  0  0
## 580   0  1  0
## 581   0  1  0
## 582   0  1  0
## 583   1  0  0
## 584   0  1  0
## 585   1  0  0
## 586   0  1  0
## 587   0  1  0
## 588   0  1  0
## 589   0  1  0
## 590   1  0  0
## 591   1  0  0
## 592   0  1  0
## 593   0  1  0
## 594   0  1  0
## 595   0  1  0
## 596   1  0  0
## 597   0  1  0
## 598   1  0  0
## 599   0  0  1
## 600   0  0  1
## 601   0  0  1
## 602   0  1  0
## 603   0  1  0
## 604   1  0  0
## 605   0  1  0
## 606   0  1  0
## 607   0  1  0
## 608   1  0  0
## 609   0  1  0
## 610   0  1  0
## 611   0  1  0
## 612   1  0  0
## 613   0  0  1
## 614   0  1  0
## 615   0  0  1
## 616   1  0  0
## 617   0  1  0
## 618   0  1  0
## 619   1  0  0
## 620   0  0  1
## 621   1  0  0
## 622   0  1  0
## 623   0  1  0
## 624   0  1  0
## 625   0  1  0
## 626   0  1  0
## 627   0  1  0
## 628   1  0  0
## 629   1  0  0
## 630   0  0  1
## 631   0  1  0
## 632   1  0  0
## 633   0  1  0
## 634   0  1  0
## 635   1  0  0
## 636   1  0  0
## 637   0  1  0
## 638   0  0  1
## 639   0  1  0
## 640   0  1  0
## 641   0  1  0
## 642   0  1  0
## 643   0  1  0
## 644   0  1  0
## 645   0  1  0
## 646   0  1  0
## 647   0  1  0
## 648   0  1  0
## 649   0  1  0
## 650   0  1  0
## 651   1  0  0
## 652   0  1  0
## 653   0  1  0
## 654   1  0  0
## 655   0  1  0
## 656   0  1  0
## 657   0  0  1
## 658   0  1  0
## 659   0  1  0
## 660   0  1  0
## 661   0  1  0
## 662   0  0  1
## 663   0  0  1
## 664   0  1  0
## 665   1  0  0
## 666   0  1  0
## 667   0  1  0
## 668   0  1  0
## 669   0  1  0
## 670   0  0  1
## 671   0  1  0
## 672   0  1  0
## 673   0  1  0
## 674   0  1  0
## 675   0  1  0
## 676   1  0  0
## 677   1  0  0
## 678   1  0  0
## 679   0  1  0
## 680   0  1  0
## 681   0  1  0
## 682   0  1  0
## 683   0  1  0
## 684   0  1  0
## 685   1  0  0
## 686   0  1  0
## 687   0  1  0
## 688   0  1  0
## 689   0  1  0
## 690   1  0  0
## 691   0  1  0
## 692   0  0  1
## 693   0  0  1
## 694   0  1  0
## 695   1  0  0
## 696   0  0  1
## 697   0  1  0
## 698   0  1  0
## 699   0  1  0
## 700   1  0  0
## 701   0  1  0
## 702   1  0  0
## 703   0  1  0
## 704   0  1  0
## 705   1  0  0
## 706   0  1  0
## 707   0  1  0
## 708   0  1  0
## 709   0  1  0
## 710   0  1  0
## 711   0  0  1
## 712   0  1  0
## 713   0  1  0
## 714   0  1  0
## 715   0  0  1
## 716   0  0  1
## 717   0  1  0
## 718   0  1  0
## 719   0  1  0
## 720   0  1  0
## 721   1  0  0
## 722   1  0  0
## 723   0  1  0
## 724   1  0  0
## 725   0  1  0
## 726   0  1  0
## 727   0  1  0
## 728   0  1  0
## 729   0  1  0
## 730   0  1  0
## 731   0  0  1
## 732   0  1  0
## 733   1  0  0
## 734   0  0  1
## 735   0  1  0
## 736   0  1  0
## 737   0  1  0
## 738   0  1  0
## 739   0  1  0
## 740   0  0  1
## 741   0  1  0
## 742   0  1  0
## 743   0  0  1
## 744   0  1  0
## 745   0  1  0
## 746   1  0  0
## 747   0  0  1
## 748   0  1  0
## 749   0  0  1
## 750   0  1  0
## 751   0  0  1
## 752   0  1  0
## 753   0  1  0
## 754   0  1  0
## 755   0  1  0
## 756   0  1  0
## 757   0  0  1
## 758   0  1  0
## 759   1  0  0
## 760   0  1  0
## 761   0  1  0
## 762   0  1  0
## 763   0  1  0
## 764   1  0  0
## 765   0  1  0
## 766   0  1  0
## 767   0  0  1
## 768   0  1  0
## 769   0  1  0
## 770   0  0  1
## 771   1  0  0
## 772   0  1  0
## 773   0  1  0
## 774   1  0  0
## 775   0  1  0
## 776   0  0  1
## 777   0  1  0
## 778   0  1  0
## 779   0  1  0
## 780   0  1  0
## 781   0  1  0
## 782   0  1  0
## 783   0  1  0
## 784   1  0  0
## 785   1  0  0
## 786   0  1  0
## 787   0  1  0
## 788   1  0  0
## 789   1  0  0
## 790   0  1  0
## 791   0  1  0
## 792   0  1  0
## 793   0  1  0
## 794   0  1  0
## 795   0  1  0
## 796   0  1  0
## 797   0  1  0
## 798   0  1  0
## 799   0  1  0
## 800   0  1  0
## 801   0  1  0
## 802   0  1  0
## 803   0  1  0
## 804   0  0  1
## 805   0  1  0
## 806   0  1  0
## 807   0  1  0
## 808   0  0  1
## 809   0  1  0
## 810   0  1  0
## 811   1  0  0
## 812   1  0  0
## 813   0  1  0
## 814   0  1  0
## 815   0  1  0
## 816   0  1  0
## 817   0  1  0
## 818   0  0  1
## 819   0  1  0
## 820   1  0  0
## 821   1  0  0
## 822   0  1  0
## 823   0  1  0
## 824   0  1  0
## 825   0  1  0
## 826   0  1  0
## 827   0  1  0
## 828   1  0  0
## 829   1  0  0
## 830   1  0  0
## 831   0  0  1
## 832   0  1  0
## 833   0  1  0
## 834   0  0  1
## 835   0  1  0
## 836   0  1  0
## 837   1  0  0
## 838   0  1  0
## 839   0  1  0
## 840   0  1  0
## 841   0  1  0
## 842   0  0  1
## 843   0  1  0
## 844   0  1  0
## 845   0  1  0
## 846   0  1  0
## 847   0  1  0
## 848   0  0  1
## 849   0  0  1
## 850   0  1  0
## 851   0  1  0
## 852   0  1  0
## 853   0  0  1
## 854   1  0  0
## 855   0  0  1
## 856   0  1  0
## 857   1  0  0
## 858   0  1  0
## 859   0  1  0
## 860   0  0  1
## 861   0  1  0
## 862   0  1  0
## 863   1  0  0
## 864   0  1  0
## 865   0  1  0
## 866   0  0  1
## 867   1  0  0
## 868   0  0  1
## 869   0  1  0
## 870   0  0  1
## 871   1  0  0
## 872   0  1  0
## 873   1  0  0
## 874   0  1  0
## 875   0  0  1
## 876   0  0  1
## 877   0  1  0
## 878   0  0  1
## 879   0  0  1
## 880   0  1  0
## 881   0  1  0
## 882   0  1  0
## 883   0  1  0
## 884   0  0  1
## 885   0  1  0
## 886   0  1  0
## 887   1  0  0
## 888   0  1  0
## 889   0  1  0
## 890   0  1  0
## 891   0  1  0
## 892   0  1  0
## 893   1  0  0
## 894   0  0  1
## 895   0  1  0
## 896   0  1  0
## 897   0  1  0
## 898   0  1  0
## 899   0  1  0
## 900   0  1  0
## 901   1  0  0
## 902   0  1  0
## 903   0  1  0
## 904   0  0  1
## 905   0  0  1
## 906   0  1  0
## 907   0  1  0
## 908   0  1  0
## 909   0  1  0
## 910   0  1  0
## 911   0  0  1
## 912   1  0  0
## 913   0  1  0
## 914   1  0  0
## 915   0  1  0
## 916   0  0  1
## 917   0  0  1
## 918   1  0  0
## 919   0  1  0
## 920   0  0  1
## 921   0  1  0
## 922   0  1  0
## 923   0  1  0
## 924   0  1  0
## 925   0  1  0
## 926   0  0  1
## 927   0  1  0
## 928   0  1  0
## 929   0  0  1
## 930   0  1  0
## 931   0  1  0
## 932   1  0  0
## 933   0  1  0
## 934   0  1  0
## 935   0  1  0
## 936   0  1  0
## 937   0  1  0
## 938   0  1  0
## 939   0  1  0
## 940   0  0  1
## 941   0  0  1
## 942   0  1  0
## 943   1  0  0
## 944   0  1  0
## 945   0  1  0
## 946   0  1  0
## 947   0  1  0
## 948   0  0  1
## 949   0  1  0
## 950   1  0  0
## 951   0  1  0
## 952   1  0  0
## 953   0  1  0
## 954   0  1  0
## 955   0  1  0
## 956   1  0  0
## 957   0  1  0
## 958   0  1  0
## 959   1  0  0
## 960   0  1  0
## 961   0  1  0
## 962   0  0  1
## 963   0  1  0
## 964   0  1  0
## 965   0  1  0
## 966   0  1  0
## 967   1  0  0
## 968   1  0  0
## 969   0  1  0
## 970   0  1  0
## 971   0  1  0
## 972   1  0  0
## 973   1  0  0
## 974   1  0  0
## 975   0  1  0
## 976   1  0  0
## 977   0  1  0
## 978   0  1  0
## 979   0  1  0
## 980   0  1  0
## 981   0  0  1
## 982   1  0  0
## 983   0  1  0
## 984   0  1  0
## 985   0  0  1
## 986   0  1  0
## 987   0  1  0
## 988   0  1  0
## 989   1  0  0
## 990   0  0  1
## 991   0  1  0
## 992   1  0  0
## 993   0  1  0
## 994   0  1  0
## 995   0  1  0
## 996   0  1  0
## 997   0  0  1
## 998   1  0  0
## 999   0  1  0
## 1000  0  1  0

4. Plot Perbandingan Model

tangga <- lm(y~c1+c2+c3)
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70))
lines(datax[ix],tangga$fitted.values[ix],col="blue",lwd=2)
lines(datax[ix],polym$fitted.values[ix], col="yellow",lwd=2)
lines(datax,linear$fitted.values,col="red")

5. Perbandingan Kebaikan Model

nilai_AIC <- rbind(AIC(linear),
                   AIC(polym),
                   AIC(tangga))

nama_model <- c("Linear","Poly (ordo=2)","Tangga (breaks=3)")
data.frame(nama_model,nilai_AIC)

##          nama_model nilai_AIC
## 1            Linear  4479.528
## 2     Poly (ordo=2)  2856.470
## 3 Tangga (breaks=3)  5392.009

MSE <- function(pred,actual){
  mean((pred-actual)^2)
}
MSE(predict(linear),y)

## [1] 5.132797

MSE(predict(polym),y)

## [1] 1.010649

MSE(predict(tangga),y)

## [1] 12.75766

Model regresi terbaik adalah Polynomial regression ordo 2 karena nilai AIC dan MSE nya paling kecil diantara yang lainnya.

Data Auto Dari Packages ISLR

1. Data

Data yang digunakan merupaka data Auto, yaitu data tentang kendaraan yang diambil dari packages ISLR. Data auto terdiri dari 392 observasi dan 9 variabels. Variabels:

mpg : miles per gallon

cylinders : Number of cylinders between 4 and 8

displacement : Engine displacement (cu. inches)

horsepower : Engine horsepower

weight : Vehicle weight (lbs.)

acceleration : Time to accelerate from 0 to 60 mph (sec.)

year : year

origin : Origin of car (1. American, 2. European, 3. Japanese)

name : Vehicle name

library(ISLR)
data.auto <- Auto
data.auto <- data.auto[,-c(7,8,9)] #membuang variabel tahun, origin of car, dan nama
head(data.auto)

##   mpg cylinders displacement horsepower weight acceleration
## 1  18         8          307        130   3504         12.0
## 2  15         8          350        165   3693         11.5
## 3  18         8          318        150   3436         11.0
## 4  16         8          304        150   3433         12.0
## 5  17         8          302        140   3449         10.5
## 6  15         8          429        198   4341         10.0

2. Plot Korelasi Antar Peubah

plot(data.auto)

Terlihat dari plot di atas variabel-variabel mana saja yang mempunyai pola hubungan non-linear. Selanjutnya kita akan fokus menggunakan variabel mpg (x) dan horsepower (y).

variabel mpg sebagai variabel independen (x) merupakan data tentang penggunaan bahan bakar pada kendaraan dengan satuan miles/galon. Artinya berapa miles jarak yang bisa ditempuh suatu kendaraan per 1 galon bahan bakar.

Variabel horsepower sebagai variabel dependen (y) merupakan data tentang kekuatan kendaraan dalam satuan horsepower (tenaga kuda). Angka yang ada pada data menggambarkan kekuatan kendaraan setara dengan berapa kuda.

3. Plot Regresi Mpg vs Horsepower

plot(data.auto$mpg, data.auto$horsepower, col="black", lwd=1, type="p",xlab="mpg",ylab="Horsepower",main="Scatter Plot Mpg dan Horsepower")
abline(lm(data.auto$horsepower ~ data.auto$mpg), col = "red", lwd = 2)

Terlihat lebih jelas bahwa mpg dan horsepower memiliki hubungan negatif yang tak linear.

4. Summary Model Linear

linear.auto <- lm(data.auto$horsepower~data.auto$mpg)
summary(linear.auto)

## 
## Call:
## lm(formula = data.auto$horsepower ~ data.auto$mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.892 -15.716  -2.094  13.108  96.947 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   194.4756     3.8732   50.21   <2e-16 ***
## data.auto$mpg  -3.8389     0.1568  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.19 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

aic.lin.auto <- AIC(linear.auto); aic.lin.auto

## [1] 3614.324

Model linear yang diperoleh adalah dugaan_rataan_y = 194.4756 - 3.8389(mpg). Arti dari model tersebut adalah dugaan rataan horsepower akan menurun 3.8389 satuan ketika mpg meningkat satu satuan, yang dimana kedua variabel tersebut mempunyai hubungan non linear negatif. Semakin meningkat nilai mpg maka Horsepowernya semakin kecil. Artinya mobil yang bertenaga besar itu mempunyai konsumsi bahan bakar yang lebih boros.

5. Regresi Polynomial Ordo 2

polym.2 = lm(horsepower ~ poly(mpg,2,raw = T),
                     data=data.auto)
summary(polym.2)

## 
## Call:
## lm(formula = horsepower ~ poly(mpg, 2, raw = T), data = data.auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72.52 -11.24  -0.11   9.47  92.98 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            302.06824    9.25689   32.63   <2e-16 ***
## poly(mpg, 2, raw = T)1 -13.32406    0.77456  -17.20   <2e-16 ***
## poly(mpg, 2, raw = T)2   0.18804    0.01513   12.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.49 on 389 degrees of freedom
## Multiple R-squared:  0.718,  Adjusted R-squared:  0.7165 
## F-statistic: 495.1 on 2 and 389 DF,  p-value: < 2.2e-16

aic.polym2 <- AIC(polym.2); aic.polym2

## [1] 3485.217

ggplot(data.auto,aes(x=mpg, y=horsepower)) + 
  geom_point(color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,2,raw=T), 
               lty = 1, col = "yellow", se=F)+
  labs(title = "Plot Regresi Polynomial Ordo 2")+
  theme(plot.title=element_text(hjust = 0.5))

6. Regresi Polynomial Ordo 3

polym.3 = lm(horsepower ~ poly(mpg,3,raw = T),
                     data=data.auto)
summary(polym.3)

## 
## Call:
## lm(formula = horsepower ~ poly(mpg, 3, raw = T), data = data.auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.935 -11.251  -1.338   9.324  95.459 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            429.581461  23.823018  18.032  < 2e-16 ***
## poly(mpg, 3, raw = T)1 -30.131244   3.006538 -10.022  < 2e-16 ***
## poly(mpg, 3, raw = T)2   0.866793   0.118533   7.313 1.51e-12 ***
## poly(mpg, 3, raw = T)3  -0.008506   0.001474  -5.770 1.62e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.69 on 388 degrees of freedom
## Multiple R-squared:  0.7403, Adjusted R-squared:  0.7382 
## F-statistic: 368.6 on 3 and 388 DF,  p-value: < 2.2e-16

aic.polym3 <- AIC(polym.3); aic.polym3

## [1] 3454.949

ggplot(data.auto,aes(x=mpg, y=horsepower)) + 
  geom_point(color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,3,raw=T), 
               lty = 1, col = "yellow", se=F)+
  labs(title = "Plot Regresi Polynomial Ordo 3")+
  theme(plot.title=element_text(hjust = 0.5))

7. Regresi Polynomial Ordo 4

polym.4 = lm(horsepower ~ poly(mpg,4,raw = T),
                     data=data.auto)
summary(polym.4)

## 
## Call:
## lm(formula = horsepower ~ poly(mpg, 4, raw = T), data = data.auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.661 -11.465  -1.145   9.143  95.783 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            450.665773  62.343875   7.229 2.62e-12 ***
## poly(mpg, 4, raw = T)1 -33.885837  10.689919  -3.170  0.00165 ** 
## poly(mpg, 4, raw = T)2   1.100819   0.650271   1.693  0.09129 .  
## poly(mpg, 4, raw = T)3  -0.014589   0.016684  -0.874  0.38244    
## poly(mpg, 4, raw = T)4   0.000056   0.000153   0.366  0.71454    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.71 on 387 degrees of freedom
## Multiple R-squared:  0.7403, Adjusted R-squared:  0.7377 
## F-statistic: 275.9 on 4 and 387 DF,  p-value: < 2.2e-16

aic.polym4 <- AIC(polym.4); aic.polym4

## [1] 3456.813

ggplot(data.auto,aes(x=mpg, y=horsepower)) + 
  geom_point(color="black") +
  stat_smooth(method = "lm", 
               formula = y~poly(x,4,raw=T), 
               lty = 1, col = "yellow", se=F)+
  labs(title = "Plot Regresi Polynomial Ordo 4")+
  theme(plot.title=element_text(hjust = 0.5))

## 8. Fungsi Tangga (5)

tangga5 = lm(horsepower ~ cut(mpg,5),data=data.auto)
summary(tangga5)

## 
## Call:
## lm(formula = horsepower ~ cut(mpg, 5), data = data.auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.242 -11.378  -3.000   8.832  71.758 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             158.242      2.260   70.03   <2e-16 ***
## cut(mpg, 5)(16.5,24]    -55.242      2.942  -18.78   <2e-16 ***
## cut(mpg, 5)(24,31.6]    -77.073      3.116  -24.74   <2e-16 ***
## cut(mpg, 5)(31.6,39.1]  -85.971      3.603  -23.86   <2e-16 ***
## cut(mpg, 5)(39.1,46.6]  -98.542      7.182  -13.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.56 on 387 degrees of freedom
## Multiple R-squared:  0.6896, Adjusted R-squared:  0.6863 
## F-statistic: 214.9 on 4 and 387 DF,  p-value: < 2.2e-16

aic.tangga5 <- AIC(tangga5); aic.tangga5

## [1] 3526.845

ggplot(data.auto,aes(x=mpg, y=horsepower)) +
       geom_point(color="black") +
       stat_smooth(method = "lm", 
               formula = y~cut(x,5), 
               col = "blue",se = F)+
  theme_bw()+
  labs(title = "Plot Regresi Fungsi Tangga Cut 5")+
  theme(plot.title=element_text(hjust = 0.5))

9. Fungsi Tangga (7)

tangga7 = lm(horsepower ~ cut(mpg,7),data=data.auto)
summary(tangga7)

## 
## Call:
## lm(formula = horsepower ~ cut(mpg, 7), data = data.auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.48 -12.96  -2.25  10.31 105.52 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             169.692      2.960   57.32   <2e-16 ***
## cut(mpg, 7)(14.4,19.7]  -45.208      3.669  -12.32   <2e-16 ***
## cut(mpg, 7)(19.7,25.1]  -74.908      3.734  -20.06   <2e-16 ***
## cut(mpg, 7)(25.1,30.5]  -88.317      3.885  -22.73   <2e-16 ***
## cut(mpg, 7)(30.5,35.9]  -96.865      4.186  -23.14   <2e-16 ***
## cut(mpg, 7)(35.9,41.2] -100.442      5.268  -19.07   <2e-16 ***
## cut(mpg, 7)(41.2,46.6] -111.978      8.594  -13.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.35 on 385 degrees of freedom
## Multiple R-squared:  0.6972, Adjusted R-squared:  0.6924 
## F-statistic: 147.7 on 6 and 385 DF,  p-value: < 2.2e-16

aic.tangga7 <- AIC(tangga7); aic.tangga7

## [1] 3521.117

ggplot(data.auto,aes(x=mpg, y=horsepower)) +
       geom_point(color="black") +
       stat_smooth(method = "lm", 
               formula = y~cut(x,7), 
               col = "blue",se = F)+
  theme_bw()+
  labs(title = "Plot Regresi Fungsi Tangga Cut 7")+
  theme(plot.title=element_text(hjust = 0.5))

10. Fungsi Tangga (9)

tangga9 = lm(horsepower ~ cut(mpg,9),data=data.auto)
summary(tangga9)

## 
## Call:
## lm(formula = horsepower ~ cut(mpg, 9), data = data.auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.288 -10.349  -1.355   7.581  81.712 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             170.697      3.583  47.643  < 2e-16 ***
## cut(mpg, 9)(13.2,17.4]  -22.409      4.388  -5.107 5.18e-07 ***
## cut(mpg, 9)(17.4,21.5]  -65.348      4.236 -15.428  < 2e-16 ***
## cut(mpg, 9)(21.5,25.7]  -78.256      4.474 -17.491  < 2e-16 ***
## cut(mpg, 9)(25.7,29.9]  -88.964      4.461 -19.944  < 2e-16 ***
## cut(mpg, 9)(29.9,34.1]  -97.472      4.635 -21.030  < 2e-16 ***
## cut(mpg, 9)(34.1,38.2] -100.342      5.148 -19.491  < 2e-16 ***
## cut(mpg, 9)(38.2,42.4] -104.097      9.877 -10.539  < 2e-16 ***
## cut(mpg, 9)(42.4,46.6] -116.030      9.135 -12.702  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.58 on 383 degrees of freedom
## Multiple R-squared:  0.7199, Adjusted R-squared:  0.7141 
## F-statistic: 123.1 on 8 and 383 DF,  p-value: < 2.2e-16

aic.tangga9 <- AIC(tangga9); aic.tangga9

## [1] 3494.485

ggplot(data.auto,aes(x=mpg, y=horsepower)) +
       geom_point(color="black") +
       stat_smooth(method = "lm", 
               formula = y~cut(x,9), 
               col = "blue",se = F)+
  theme_bw()+
  labs(title = "Plot Regresi Fungsi Tangga Cut 9")+
  theme(plot.title=element_text(hjust = 0.5))

11. Perbandingan Model Berdasarkan MSE

MSE = function(pred,actual){
  mean((pred-actual)^2)
}

nilai_MSE <- rbind(MSE(predict(linear.auto),data.auto$horsepower),
              MSE(predict(polym.2),data.auto$horsepower),
              MSE(predict(polym.3),data.auto$horsepower),
              MSE(predict(polym.4),data.auto$horsepower),
              MSE(predict(tangga5),data.auto$horsepower),
              MSE(predict(tangga7),data.auto$horsepower),
              MSE(predict(tangga9),data.auto$horsepower))

nama_model <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)", "Poly (ordo=4)",
                "Tangga (breaks=5)", "Tangga (breaks=7)", "Tangga (breaks=9)")
data.frame(nama_model,nilai_MSE)

##          nama_model nilai_MSE
## 1            Linear  582.3257
## 2     Poly (ordo=2)  416.7871
## 3     Poly (ordo=3)  383.8523
## 4     Poly (ordo=4)  383.7195
## 5 Tangga (breaks=5)  458.7770
## 6 Tangga (breaks=7)  447.5318
## 7 Tangga (breaks=9)  413.8924

Dilihat dari nilai MSE maka model regresi yang terbaik adalah regresi polynomial dengan ordo 4 dengan nilai MSE terkecil yaitu 383.7195. Secara Eksploratif dilihat dari plot regresi memang garis regresi polynomial ordo 4 lebih pas terhadap scatter plot data sehingga errornya paling kecil.

12. Perbandingan Model Berdasarkan R-square

nilai_Rsquare <- rbind(summary(linear.auto)$r.square,
                       summary(polym.2)$r.square,
                       summary(polym.3)$r.square,
                       summary(polym.4)$r.square,
                       summary(tangga5)$r.square,
                       summary(tangga7)$r.square,
                       summary(tangga9)$r.square)

nama_model <- c("Linear","Poly (ordo=2)", "Poly (ordo=3)", "Poly (ordo=4)",
                "Tangga (breaks=5)", "Tangga (breaks=7)", "Tangga (breaks=9)")
data.frame(nama_model,nilai_Rsquare)

##          nama_model nilai_Rsquare
## 1            Linear     0.6059483
## 2     Poly (ordo=2)     0.7179659
## 3     Poly (ordo=3)     0.7402524
## 4     Poly (ordo=4)     0.7403423
## 5 Tangga (breaks=5)     0.6895519
## 6 Tangga (breaks=7)     0.6971614
## 7 Tangga (breaks=9)     0.7199248

Jika dilihat dari nilai R-square maka model regresi yang terbaik adalah regresi polynomial dengan ordo 4 dengan nilai R-square terbesar yaitu 0.7403423.

13. Evaluasi Model Dengan Cross Validation

Model Linear

set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_linear <- map_dfr(cross_val$splits,
    function(x){
    mod <- lm(horsepower ~ mpg,data=data.auto[x$in_id,])
    pred <- predict(mod,newdata=data.auto[-x$in_id,])
    truth <- data.auto[-x$in_id,]$horsepower
    rmse <- mlr3measures::rmse(truth = truth,response = pred)
    mae <- mlr3measures::mae(truth = truth,response = pred)
    metric <- c(rmse,mae)
    names(metric) <- c("rmse","mae")
    return(metric)
  }
)
metric_linear

## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  25.3  20.3
##  2  29.3  22.2
##  3  22.4  16.9
##  4  26.8  19.9
##  5  23.4  18.9
##  6  24.9  20.6
##  7  18.1  15.0
##  8  20.3  13.6
##  9  22.6  18.2
## 10  26.9  18.3

# menghitung rata-rata untuk 10 folds
mean_metric_linear <- colMeans(metric_linear)
mean_metric_linear

##     rmse      mae 
## 24.00403 18.39118

Model Polynomial Orde 2

set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_poly2 <- map_dfr(cross_val$splits,
function(x){
  mod <- lm(horsepower ~ poly(mpg,2,raw = T),data=data.auto[x$in_id,])
  pred <- predict(mod,newdata=data.auto[-x$in_id,])
  truth <- data.auto[-x$in_id,]$horsepower
  rmse <- mlr3measures::rmse(truth = truth,response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)
metric_poly2

## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  23.1  18.9
##  2  25.1  17.2
##  3  18.4  14.3
##  4  22.1  15.1
##  5  21.5  15.6
##  6  19.7  14.8
##  7  15.6  12.2
##  8  16.5  12.3
##  9  18.9  14.6
## 10  22.5  13.8

# menghitung rata-rata untuk 10 folds
mean_metric_poly2 <- colMeans(metric_poly2)
mean_metric_poly2

##     rmse      mae 
## 20.35171 14.88587

Model Polynomial Orde 3

set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_poly3 <- map_dfr(cross_val$splits,
function(x){
  mod <- lm(horsepower ~ poly(mpg,3,raw = T),data=data.auto[x$in_id,])
  pred <- predict(mod,newdata=data.auto[-x$in_id,])
  truth <- data.auto[-x$in_id,]$horsepower
  rmse <- mlr3measures::rmse(truth = truth,response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)
metric_poly3

## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  20.9  17.0
##  2  24.1  15.8
##  3  17.2  13.2
##  4  22.5  15.2
##  5  21.3  15.6
##  6  18.6  14.7
##  7  15.8  12.0
##  8  16.2  11.9
##  9  17.4  14.2
## 10  22.0  13.5

# menghitung rata-rata untuk 10 folds
mean_metric_poly3 <- colMeans(metric_poly3)
mean_metric_poly3

##     rmse      mae 
## 19.59566 14.30442

Model Polynomial Orde 4

set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
metric_poly4 <- map_dfr(cross_val$splits,
function(x){
  mod <- lm(horsepower ~ poly(mpg,4,raw = T),data=data.auto[x$in_id,])
  pred <- predict(mod,newdata=data.auto[-x$in_id,])
  truth <- data.auto[-x$in_id,]$horsepower
  rmse <- mlr3measures::rmse(truth = truth,response = pred)
  mae <- mlr3measures::mae(truth = truth,response = pred)
  metric <- c(rmse,mae)
  names(metric) <- c("rmse","mae")
  return(metric)
  }
)
metric_poly4

## # A tibble: 10 x 2
##     rmse   mae
##    <dbl> <dbl>
##  1  20.9  16.9
##  2  24.1  15.8
##  3  17.2  13.2
##  4  22.5  15.2
##  5  21.3  15.6
##  6  18.6  14.6
##  7  15.8  12.0
##  8  16.2  11.9
##  9  17.4  14.2
## 10  22.2  13.6

# menghitung rata-rata untuk 10 folds
mean_metric_poly4 <- colMeans(metric_poly4)
mean_metric_poly4

##     rmse      mae 
## 19.61659 14.29458

Fungsi Tangga

set.seed(123)
cross_val <- vfold_cv(data.auto,v=10,strata = "horsepower")
breaks <- 3:10
best_tangga <- map_dfr(breaks, function(i){
    metric_tangga <- map_dfr(cross_val$splits,
    function(x){
        training <- data.auto[1:314,]
        training$mpg <- cut(training$mpg,i)
        mod <- lm(horsepower ~ mpg,data=training)
        labs_x <- levels(mod$model[,2])
        labs_x_breaks <- cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs_x) ),
                  upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs_x) ))
        testing <- data.auto[-1:-314,]
        mpg_new <- cut(testing$mpg,c(labs_x_breaks[1,1],labs_x_breaks[,2]))
        pred <- predict(mod,newdata=list(mpg=mpg_new))
        truth <- testing$horsepower
        data_eval <- na.omit(data.frame(truth,pred))
        rmse <- mlr3measures::rmse(truth = data_eval$truth,response = data_eval$pred)
        mae <- mlr3measures::mae(truth = data_eval$truth,response = data_eval$pred)
        metric <- c(rmse,mae)
        names(metric) <- c("rmse","mae")
        return(metric)
      }
    )
  metric_tangga
  # menghitung rata-rata untuk 10 folds
  mean_metric_tangga <- colMeans(metric_tangga)
  mean_metric_tangga
  }
)

best_tangga <- cbind(breaks=breaks,best_tangga)
# menampilkan hasil all breaks
best_tangga

##   breaks     rmse       mae
## 1      3 18.64373 12.670291
## 2      4 13.63102  9.966213
## 3      5 15.47493 10.495278
## 4      6 16.53925 11.688612
## 5      7 15.63135 10.666628
## 6      8 15.91060 11.440924
## 7      9 15.43487 10.933407
## 8     10 15.76456 10.947742

#berdasarkan rmse
best_tangga %>% slice_min(rmse)

##   breaks     rmse      mae
## 1      4 13.63102 9.966213

#berdasarkan mae
best_tangga %>% slice_min(mae)

##   breaks     rmse      mae
## 1      4 13.63102 9.966213

Perbandingan Model Dari Cross Validation

nilai_metric <- rbind(mean_metric_linear,
                      mean_metric_poly2,
                      mean_metric_poly3,
                      mean_metric_poly4,
                      best_tangga %>% select(-1) %>% slice_min(mae))

nama_model <- c("Linear","Poly2","Poly3","Poly4","Tangga (breaks=4)")
data.frame(nama_model,nilai_metric)

##          nama_model     rmse       mae
## 1            Linear 24.00403 18.391184
## 2             Poly2 20.35171 14.885874
## 3             Poly3 19.59566 14.304419
## 4             Poly4 19.61659 14.294577
## 5 Tangga (breaks=4) 13.63102  9.966213

Dari proses cross validation untuk mengevaluasi kinerja model diperoleh model terbaik yaitu model dengan fungsi tangga (breaks=4) dengan nilai error rmse dan mae yang paling kecil. Semakin valid hasil prediksi terhadap data aktual/uji maka errornya akan semakin kecil.

Non-linear Regression Minggu 9

Data Hasil Bangkitan

set.seed(123)
datax <- rnorm(1000,1,1)
e <- rnorm(1000,0,5)
y <- 0.5+3*datax+1.5*datax^2+e
plot(datax,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="black")
abline(v=1, col="red", lty=2)

Piecewise cubic polynomial

dt.all <- cbind(y,datax)

##knots=1
dt1 <- dt.all[datax <=1,] #memunculkan data x yang nilainya <= 1
dim(dt1)

## [1] 495   2

terdapat 495 amatan yang nilai data x <= 1.

dt2 <- dt.all[datax >1,] #memunculkan data x yang nilainya > 1
dim(dt2)

## [1] 505   2

terdapat 505 amatan yang nilai data x > 1

plot(datax,y,xlim=c(-2,5), ylim=c(-10,70), pch=16, col="black")

cub.mod1 <- lm(dt1[,1]~dt1[,2]+I(dt1[,2]^2)+I(dt1[,2]^3))
ix <- sort(dt1[,2], index.return=T)$ix
lines(dt1[ix,2],cub.mod1$fitted.values[ix],col="orange", lwd=2)

cub.mod2 <- lm(dt2[,1]~dt2[,2]+I(dt2[,2]^2)+I(dt2[,2]^3))
ix <- sort(dt2[,2], index.return=T)$ix
lines(dt2[ix,2],cub.mod2$fitted.values[ix],col="orange", lwd=2)

patahan di x = 1 karena sebelumnya kita melakukan regresi polynomial untuk data x yang > 1 dan <= 1.

Truncated power basis

karena sebelumnya ada cut off di x = 1 maka kita akan menyambungkannya menggunakan truncated power basis.

#dengan menggunakan truncated power basis
plot(datax,y,xlim=c( -2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=1,col="red", lty=2)

hx <- ifelse( datax>1,(datax-1)^3,0)

cubspline.mod <- lm( y~datax+I(datax^2)+I(datax^3)+hx)
ix <- sort( datax,index.return=T)$ix
lines( datax[ix], cubspline.mod$fitted.values[ix],col="green", lwd=2)

Perbandingan dengan k-fold CV

Selanjutnya kita akan melihat untuk knot-nya sebanyak 2 yaitu di x=0 dan x=2. jadi kita akan mempunyai 3 fungsi trend.

plot( datax,y,xlim=c(-2,5), ylim=c( -10,70), pch=16,col="orange")
abline(v=0,col="red", lty=2)
abline(v=2,col="red", lty=2)

hx1 <- ifelse( datax>0,(datax-0)^3,0)
hx2 <- ifelse( datax>2,(datax-2)^3,0)

cubspline.mod2 <- lm( y~datax+I(datax^2)+I(datax^3)+hx1+hx2)
ix <- sort( datax,index.return=T)$ix
lines( datax[ix],cubspline.mod2$fitted.values[ix],col="green", lwd=2)

Perbandingan 1 knot (x=1) dan 2 knot (x=0, x=2) dengan 5-fold Cross Validation

##1 knot
data.all <- cbind( y,datax,hx)
set.seed(456)
ind <- sample(1:5,length( datax),replace=T)
res <- c()
for( i in 1:5){
  dt.train <- data.all[ ind!=i,]
  x.test <- data.all[ ind==i, -1]
  y.test <- data.all[ ind==i,1]
  mod1 <- lm( dt.train[,1]~dt.train[,2]+I( dt.train[,2]^2)+I( dt.train[,2]^3)+dt.train[,3])
  x.test.olah <- cbind(1,x.test[,1], x.test[,1]^2,x.test[,1]^3,x.test[,2])
  beta <- coefficients(mod1)
  prediksi <- x.test.olah%*%beta
  res <- c( res,mean(( y.test-prediksi)^2))
}
res

## [1] 22.08767 21.39598 29.47677 29.68683 25.18070

mean(res) #rata-rata nilai residu

## [1] 25.56559

##2 knot (knot = 0, 2)
data.all2 <- cbind(y,datax,hx1,hx2)
set.seed(456)
ind2 <- sample(1:5,length( datax),replace=T)
res2 <- c()
for( i in 1:5){
  dt.train2 <- data.all2[ind2!=i,]
  x.test2 <- data.all2[ind2==i,-1]
  y.test2 <- data.all2[ind2==i,1]
  mod2 <- lm(dt.train2[,1]~dt.train2[,2]+I(dt.train2[,2]^2)+I(dt.train2[,2]^3)+dt.train2[,3]+dt.train2[,4])
  x.test.olah2 <- cbind(1,x.test2[,1],x.test2[,1]^2,x.test2[,1]^3,x.test2[,2],x.test2[,3])
  beta2 <- coefficients(mod2)
  prediksi2 <- x.test.olah2%*%beta2
  res2 <- c(res2,mean((y.test2-prediksi2)^2))
}
res2

## [1] 22.32898 21.33868 29.40459 29.79913 25.22047

mean(res2) #rata-rata nilai residu

## [1] 25.61837

residual knot 1 lebih baik daripada knot 2, karena nilai residualnya lebih kecil.

Smoothing Splines

ss1 <- smooth.spline(datax,y,all.knots = T); ss1

## Call:
## smooth.spline(x = datax, y = y, all.knots = T)
## 
## Smoothing Parameter  spar= 1.499938  lambda= 0.0002745903 (26 iterations)
## Equivalent Degrees of Freedom (Df): 14.84845
## Penalized Criterion (RSS): 24848.74
## GCV: 121.2927

plot(datax,y,xlim=c(-2,5), ylim=c(-10,70), pch=16,col="orange")
lines(ss1,col="black", lwd=2)

fungsi smooth splines otomatis memilih parameter terbaik menggunakan cross validation dan generalized cross validation. Diperoleh parameter terbaik dari fungsi di atas adalah lambda = 0.000275.

Data Auto Dari Packages ISLR

Scatter Plot

ggplot(data.auto,aes(x=mpg, y=horsepower)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_bw()

Regresi Splines

#Menentukan banyaknya fungsi basis dan knots
dim(bs(data.auto$mpg, knots = c(15, 25,35)))

## [1] 392   6

#nilai knots yang ditentukan oleh komputer
attr(bs(data.auto$mpg, df=6),"knots")

##   25%   50%   75% 
## 17.00 22.75 29.00

Penentuan knot menggunakan komputer diperoleh ada 3 knot yang membagi data menjadi 25%, 50%, dan 75%. knot 1 di x=17, knot 2 di x=22.75, dan knot 3 di x=29

b Splines

menggunakan knot dari fungsi komputer

knot_value_pc_df_6 = attr(bs(data.auto$mpg, df=6),"knots")
mod_spline_1 = lm(horsepower ~ bs(mpg, knots =knot_value_pc_df_6 ),data=data.auto)
summary(mod_spline_1)

## 
## Call:
## lm(formula = horsepower ~ bs(mpg, knots = knot_value_pc_df_6), 
##     data = data.auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.043 -10.103  -0.818   8.075  95.778 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           195.012     13.181  14.795  < 2e-16 ***
## bs(mpg, knots = knot_value_pc_df_6)1    2.792     18.879   0.148    0.882    
## bs(mpg, knots = knot_value_pc_df_6)2  -80.488     12.649  -6.363 5.62e-10 ***
## bs(mpg, knots = knot_value_pc_df_6)3 -103.774     14.645  -7.086 6.62e-12 ***
## bs(mpg, knots = knot_value_pc_df_6)4 -126.115     14.600  -8.638  < 2e-16 ***
## bs(mpg, knots = knot_value_pc_df_6)5 -127.143     18.836  -6.750 5.45e-11 ***
## bs(mpg, knots = knot_value_pc_df_6)6 -141.413     18.226  -7.759 7.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.56 on 385 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7417 
## F-statistic: 188.1 on 6 and 385 DF,  p-value: < 2.2e-16

ggplot(data.auto,aes(x=mpg, y=horsepower)) +
                 geom_point(color="black") +
  stat_smooth(method = "lm", 
               formula = y~bs(x, knots = knot_value_pc_df_6), 
               lty = 1,se = F)

Smoothing Spline

model_sms <- with(data = data.auto,smooth.spline(mpg,horsepower))
model_sms

## Call:
## smooth.spline(x = mpg, y = horsepower)
## 
## Smoothing Parameter  spar= 0.8528728  lambda= 0.003422279 (12 iterations)
## Equivalent Degrees of Freedom (Df): 6.989153
## Penalized Criterion (RSS): 35962.53
## GCV: 386.3516

fungsi smooth splines otomatis memilih parameter terbaik menggunakan cross validation dan generalized cross validation. Diperoleh parameter terbaik dari fungsi di atas adalah lambda = 0.003422279 dengan 12 iterasi.

pred_data <- broom::augment(model_sms)

ggplot(pred_data,aes(x=x,y=y))+
  geom_point(alpha=0.55, color="black")+ 
  geom_line(aes(y=.fitted),col="blue",
            lty=1)+
  xlab("mpg")+
  ylab("horsepower")+
  theme_bw()

Smoothing Spline Untuk Berbagai Nilai Lambda

Kita akan melihat pengaruh berbagai nilai parameter lambda terhadap hasil smoothing spline

model_sms_lambda <- data.frame(lambda=seq(0,5,by=0.5)) %>% 
  group_by(lambda) %>% 
  do(broom::augment(with(data = data.auto,smooth.spline(mpg,horsepower,lambda = .$lambda))))

p <- ggplot(model_sms_lambda,
       aes(x=x,y=y))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~lambda)
p

Semakin besar nilai parameter maka garisnya akan semakin lurus atau linear.

LOESS

model_loess <- loess(horsepower~mpg,
                     data = data.auto)
summary(model_loess)

## Call:
## loess(formula = horsepower ~ mpg, data = data.auto)
## 
## Number of Observations: 392 
## Equivalent Number of Parameters: 4.8 
## Residual Standard Error: 19.58 
## Trace of smoother matrix: 5.24  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

diperoleh parameter span-nya adalah 0.75

LOESS Untuk Berbagai Nilai Parameter Span

Kita akan melihat pengaruh nilai parameter Span terhadap pemulusan LOESS. Pada fungsi di bawah ini kita akan melihat hasil pemulusan dengan parameter span dari 0.1 sampai 5.

model_loess_span <- data.frame(span=seq(0.1,5,by=0.5)) %>% 
  group_by(span) %>% 
  do(broom::augment(loess(horsepower~mpg,
                     data = data.auto,span=.$span)))

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 14

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 6.6352e-016

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1

p2 <- ggplot(model_loess_span,
       aes(x=mpg,y=horsepower))+
  geom_line(aes(y=.fitted),
            col="blue",
            lty=1
            )+
  facet_wrap(~span)
p2

LOESS Dengan Span = 0.75

library(ggplot2)
ggplot(data.auto, aes(mpg,horsepower)) +
  geom_point(alpha=0.5,color="black") +
  stat_smooth(method='loess',
               formula=y~x,
              span = 0.75,
              col="blue",
              lty=1,
              se=F)