8.1 Recreate the simulated data

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

A.

## This is taken from the textbook.
set.seed(10)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y~.,data = simulated,importance = TRUE,ntree = 1000)

rfImp1 <- varImp(model1,scale = FALSE)
rfImp1
##         Overall
## V1   8.81746858
## V2   6.70058089
## V3   0.67828171
## V4   7.83871409
## V5   2.11431959
## V6   0.15229799
## V7   0.07501809
## V8  -0.09120051
## V9  -0.05293163
## V10  0.06457985

Looking at the variable importance table, we see that the random forest did not consider the variables v6-v10 at all during model creation.


B. Now add an additional predictor

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9422561
model2 <- randomForest(y~.,data = simulated,importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2,scale = FALSE)
rfImp2
##                Overall
## V1          5.97053713
## V2          6.22467245
## V3          0.58560929
## V4          7.07090410
## V5          2.08891249
## V6          0.26342318
## V7         -0.06009788
## V8         -0.10104364
## V9         -0.06050339
## V10        -0.04303861
## duplicate1  3.81324659

The importance score for V1 did change, it appears adding an additional predictor that is highly corrlated with v1 decreased the importance score from 8.7 to 5.6. It seems adding a correlated predictor highly correlated with v1 decreases the overall importance during the creation of it’s model.


C.

set.seed(200)
simulated2 <- mlbench.friedman1(200, sd = 1)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"
library(party)
## Warning: package 'party' was built under R version 4.3.3
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
set.seed(10)
rand_cf <- cforest(y ~ .,data = simulated2,control = cforest_unbiased(ntree = 1000))
rand_cf
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  1000 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10 
## Number of observations:  200
varimp(rand_cf,conditional = TRUE)
##            V1            V2            V3            V4            V5 
##  5.7101609863  5.2220075065  0.0096108046  6.5164006098  1.2366535535 
##            V6            V7            V8            V9           V10 
##  0.0244136339  0.0003022317 -0.0183686880 -0.0123724422 -0.0072053403
varimp(rand_cf,conditional = FALSE)
##           V1           V2           V3           V4           V5           V6 
##  8.948066291  6.517743826  0.015936205  8.383193023  1.935979111  0.007571961 
##           V7           V8           V9          V10 
##  0.038976165 -0.047041308 -0.053014190 -0.029579844

For the traditional random-forest model, the important variables were v1,v2,and v4, looking at the conditonal inference tree we see that if the conditional is false the same variables are importance and the same is said when the conditional is true.


D. Repeat This Process

Boosted Trees:

gbmGrid <- expand.grid(.interaction.depth = seq(1,7,by = 2),.n.trees = seq(100,1000,by = 50),.shrinkage = c(0.01,0.1),.n.minobsinnode = 10)
set.seed(10)
simulateX <- simulated2[1:10]
simulateY <- simulated2$y
gbmTune <- train(simulateX,simulateY,method = "gbm",tuneGrid = gbmGrid,verbose = FALSE)
summary(gbmTune,plot = FALSE)
##     var    rel.inf
## V4   V4 27.7540918
## V1   V1 24.9977002
## V2   V2 22.3201472
## V5   V5 10.7289426
## V3   V3  9.5273894
## V7   V7  1.2994823
## V6   V6  1.2851214
## V9   V9  1.1144768
## V10 V10  0.5050145
## V8   V8  0.4676339
plot(gbmTune)

gbmTune
## Stochastic Gradient Boosting 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.01       1                   100     4.203561  0.5846443  3.447156
##   0.01       1                   150     3.905186  0.6308288  3.211024
##   0.01       1                   200     3.652878  0.6668283  3.006751
##   0.01       1                   250     3.445676  0.6920470  2.840386
##   0.01       1                   300     3.266789  0.7140282  2.694462
##   0.01       1                   350     3.114278  0.7311043  2.567584
##   0.01       1                   400     2.983546  0.7471146  2.456264
##   0.01       1                   450     2.868025  0.7604846  2.357151
##   0.01       1                   500     2.767807  0.7728716  2.270184
##   0.01       1                   550     2.677897  0.7831939  2.190451
##   0.01       1                   600     2.595946  0.7935902  2.119074
##   0.01       1                   650     2.519680  0.8034706  2.053290
##   0.01       1                   700     2.454690  0.8108801  1.997272
##   0.01       1                   750     2.392771  0.8178658  1.942705
##   0.01       1                   800     2.339343  0.8235337  1.895576
##   0.01       1                   850     2.290333  0.8288759  1.853332
##   0.01       1                   900     2.246419  0.8333399  1.815319
##   0.01       1                   950     2.207280  0.8371232  1.780585
##   0.01       1                  1000     2.171128  0.8408760  1.748630
##   0.01       3                   100     3.651480  0.6921300  3.001797
##   0.01       3                   150     3.258980  0.7255947  2.680939
##   0.01       3                   200     2.974293  0.7529463  2.443834
##   0.01       3                   250     2.755733  0.7756204  2.259825
##   0.01       3                   300     2.587262  0.7943690  2.116550
##   0.01       3                   350     2.451036  0.8097066  1.999107
##   0.01       3                   400     2.345812  0.8208261  1.909308
##   0.01       3                   450     2.261539  0.8300047  1.838026
##   0.01       3                   500     2.194790  0.8367955  1.780541
##   0.01       3                   550     2.141735  0.8421377  1.734621
##   0.01       3                   600     2.099014  0.8463967  1.697518
##   0.01       3                   650     2.065256  0.8497224  1.667345
##   0.01       3                   700     2.037333  0.8524046  1.642812
##   0.01       3                   750     2.014664  0.8546810  1.621556
##   0.01       3                   800     1.997036  0.8563893  1.605104
##   0.01       3                   850     1.981980  0.8577726  1.591561
##   0.01       3                   900     1.970256  0.8588537  1.581245
##   0.01       3                   950     1.959607  0.8599272  1.571737
##   0.01       3                  1000     1.950853  0.8607409  1.564126
##   0.01       5                   100     3.463549  0.7155799  2.842018
##   0.01       5                   150     3.064422  0.7460058  2.515455
##   0.01       5                   200     2.789200  0.7710465  2.283280
##   0.01       5                   250     2.587972  0.7919578  2.110914
##   0.01       5                   300     2.437734  0.8083317  1.981188
##   0.01       5                   350     2.322488  0.8213919  1.881829
##   0.01       5                   400     2.237881  0.8304865  1.809276
##   0.01       5                   450     2.175823  0.8367943  1.755319
##   0.01       5                   500     2.126811  0.8418072  1.712096
##   0.01       5                   550     2.091984  0.8453483  1.681577
##   0.01       5                   600     2.063489  0.8482199  1.655831
##   0.01       5                   650     2.042880  0.8501954  1.637580
##   0.01       5                   700     2.025447  0.8518778  1.622050
##   0.01       5                   750     2.012324  0.8532482  1.610768
##   0.01       5                   800     2.001580  0.8543233  1.601442
##   0.01       5                   850     1.992926  0.8552844  1.594041
##   0.01       5                   900     1.984919  0.8561693  1.587337
##   0.01       5                   950     1.978046  0.8569003  1.581040
##   0.01       5                  1000     1.972501  0.8575203  1.576358
##   0.01       7                   100     3.398634  0.7231207  2.797676
##   0.01       7                   150     3.007412  0.7500725  2.473698
##   0.01       7                   200     2.743256  0.7725846  2.253930
##   0.01       7                   250     2.553572  0.7924050  2.090985
##   0.01       7                   300     2.411496  0.8079564  1.969247
##   0.01       7                   350     2.305866  0.8200020  1.876823
##   0.01       7                   400     2.231801  0.8280040  1.812280
##   0.01       7                   450     2.174362  0.8341191  1.762257
##   0.01       7                   500     2.132072  0.8387669  1.724166
##   0.01       7                   550     2.101576  0.8420252  1.696429
##   0.01       7                   600     2.078159  0.8445659  1.675085
##   0.01       7                   650     2.059786  0.8465432  1.658277
##   0.01       7                   700     2.045797  0.8479963  1.645352
##   0.01       7                   750     2.034986  0.8491369  1.635576
##   0.01       7                   800     2.025573  0.8501313  1.627067
##   0.01       7                   850     2.018402  0.8508315  1.620419
##   0.01       7                   900     2.012229  0.8515335  1.615295
##   0.01       7                   950     2.006096  0.8522187  1.609913
##   0.01       7                  1000     2.001171  0.8527190  1.605586
##   0.10       1                   100     2.189385  0.8341195  1.760321
##   0.10       1                   150     2.001107  0.8514459  1.594128
##   0.10       1                   200     1.937149  0.8573451  1.538530
##   0.10       1                   250     1.925201  0.8579859  1.529163
##   0.10       1                   300     1.912157  0.8584244  1.517905
##   0.10       1                   350     1.918202  0.8569159  1.521494
##   0.10       1                   400     1.920321  0.8567113  1.522698
##   0.10       1                   450     1.922088  0.8561339  1.523206
##   0.10       1                   500     1.925364  0.8555053  1.522996
##   0.10       1                   550     1.930539  0.8545162  1.527451
##   0.10       1                   600     1.934568  0.8536659  1.530653
##   0.10       1                   650     1.944269  0.8523735  1.538471
##   0.10       1                   700     1.948734  0.8518310  1.541398
##   0.10       1                   750     1.950977  0.8515822  1.544159
##   0.10       1                   800     1.955597  0.8508896  1.548440
##   0.10       1                   850     1.956037  0.8506911  1.550606
##   0.10       1                   900     1.962257  0.8496280  1.553193
##   0.10       1                   950     1.963255  0.8495775  1.554913
##   0.10       1                  1000     1.970251  0.8485101  1.561382
##   0.10       3                   100     2.004742  0.8489628  1.597464
##   0.10       3                   150     1.970263  0.8522357  1.570298
##   0.10       3                   200     1.958929  0.8537991  1.563142
##   0.10       3                   250     1.956258  0.8537412  1.558696
##   0.10       3                   300     1.952966  0.8540826  1.555873
##   0.10       3                   350     1.951263  0.8542494  1.555888
##   0.10       3                   400     1.949015  0.8544113  1.554264
##   0.10       3                   450     1.947614  0.8544992  1.553188
##   0.10       3                   500     1.945879  0.8546927  1.552158
##   0.10       3                   550     1.946022  0.8546597  1.552832
##   0.10       3                   600     1.946515  0.8546513  1.553333
##   0.10       3                   650     1.946098  0.8546738  1.553084
##   0.10       3                   700     1.946457  0.8545577  1.553465
##   0.10       3                   750     1.946259  0.8545705  1.553356
##   0.10       3                   800     1.946544  0.8545029  1.553759
##   0.10       3                   850     1.946461  0.8545013  1.553727
##   0.10       3                   900     1.946540  0.8544851  1.553848
##   0.10       3                   950     1.946702  0.8544701  1.554158
##   0.10       3                  1000     1.946835  0.8544401  1.554290
##   0.10       5                   100     2.074212  0.8375517  1.645462
##   0.10       5                   150     2.049447  0.8406543  1.627639
##   0.10       5                   200     2.041039  0.8416720  1.621600
##   0.10       5                   250     2.038160  0.8419322  1.618610
##   0.10       5                   300     2.034409  0.8424350  1.615651
##   0.10       5                   350     2.032116  0.8426642  1.613103
##   0.10       5                   400     2.031221  0.8427163  1.612068
##   0.10       5                   450     2.031812  0.8425815  1.612404
##   0.10       5                   500     2.030656  0.8427324  1.611574
##   0.10       5                   550     2.030682  0.8427099  1.611231
##   0.10       5                   600     2.030567  0.8427053  1.610970
##   0.10       5                   650     2.030627  0.8426798  1.610852
##   0.10       5                   700     2.030428  0.8426981  1.610657
##   0.10       5                   750     2.030506  0.8426760  1.610680
##   0.10       5                   800     2.030466  0.8426739  1.610586
##   0.10       5                   850     2.030488  0.8426569  1.610547
##   0.10       5                   900     2.030382  0.8426763  1.610528
##   0.10       5                   950     2.030322  0.8426798  1.610457
##   0.10       5                  1000     2.030436  0.8426616  1.610506
##   0.10       7                   100     2.081389  0.8366214  1.672606
##   0.10       7                   150     2.053107  0.8402639  1.647968
##   0.10       7                   200     2.046112  0.8409539  1.641513
##   0.10       7                   250     2.040473  0.8417242  1.636271
##   0.10       7                   300     2.038768  0.8417870  1.634833
##   0.10       7                   350     2.036455  0.8420851  1.632789
##   0.10       7                   400     2.034801  0.8422914  1.631696
##   0.10       7                   450     2.033867  0.8424272  1.631193
##   0.10       7                   500     2.033370  0.8424558  1.630903
##   0.10       7                   550     2.033124  0.8424821  1.630516
##   0.10       7                   600     2.032925  0.8424802  1.630483
##   0.10       7                   650     2.032753  0.8424914  1.630503
##   0.10       7                   700     2.032728  0.8424840  1.630479
##   0.10       7                   750     2.032603  0.8424892  1.630396
##   0.10       7                   800     2.032492  0.8424946  1.630299
##   0.10       7                   850     2.032456  0.8424974  1.630331
##   0.10       7                   900     2.032283  0.8425105  1.630244
##   0.10       7                   950     2.032201  0.8425214  1.630223
##   0.10       7                  1000     2.032240  0.8425151  1.630284
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 300, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 10.

When I look at the summary of the gbmTune model, it appears that variable importance again are v4,v1,and v2, the final values for the model were n.trees = 300,interaction.depth = 1, shrinkage = 0.1 to get the lowest RMSE possible. Note I didnt tweak the n.minobsinnode.


Cubist

Now we fit a cubist model

## neighbors must be less than 10 and commite must start at 1
cubistGrid <- expand.grid(.committees = seq(1,100,by = 10), .neighbors = c(1:9))
set.seed(10)
cubistTune <- train(simulateX,simulateY,method = "cubist",tuneGrid = cubistGrid)
cubistTune
## Cubist 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE     
##    1          1          3.230127  0.6373371  2.385714
##    1          2          3.146872  0.6513648  2.305151
##    1          3          3.106911  0.6575103  2.270835
##    1          4          3.077654  0.6619837  2.248509
##    1          5          3.059315  0.6646654  2.230855
##    1          6          3.052870  0.6656409  2.220812
##    1          7          3.046871  0.6666078  2.210989
##    1          8          3.043294  0.6671375  2.206758
##    1          9          3.041489  0.6673762  2.204806
##   11          1          2.235859  0.8063775  1.763933
##   11          2          2.177485  0.8165337  1.708932
##   11          3          2.146532  0.8218619  1.680067
##   11          4          2.129843  0.8245148  1.664558
##   11          5          2.119062  0.8262737  1.651548
##   11          6          2.111690  0.8275447  1.642773
##   11          7          2.107824  0.8283442  1.637122
##   11          8          2.104077  0.8290377  1.631975
##   11          9          2.101851  0.8293954  1.629472
##   21          1          2.159341  0.8187959  1.709508
##   21          2          2.094262  0.8300286  1.649370
##   21          3          2.064394  0.8351712  1.616786
##   21          4          2.047922  0.8378382  1.598240
##   21          5          2.036577  0.8396537  1.583655
##   21          6          2.029147  0.8409403  1.575016
##   21          7          2.025356  0.8417179  1.570579
##   21          8          2.021269  0.8424427  1.566644
##   21          9          2.018200  0.8429489  1.564393
##   31          1          2.135175  0.8224874  1.696539
##   31          2          2.068588  0.8340052  1.635833
##   31          3          2.038067  0.8392887  1.601190
##   31          4          2.020050  0.8421911  1.582604
##   31          5          2.008462  0.8439871  1.568463
##   31          6          2.000321  0.8453193  1.559932
##   31          7          1.995620  0.8462118  1.554214
##   31          8          1.991326  0.8469512  1.548734
##   31          9          1.987786  0.8475225  1.545216
##   41          1          2.105264  0.8281054  1.678557
##   41          2          2.036020  0.8399086  1.617547
##   41          3          2.004538  0.8452660  1.584213
##   41          4          1.986087  0.8481919  1.564608
##   41          5          1.974210  0.8500324  1.550436
##   41          6          1.965682  0.8514296  1.541675
##   41          7          1.960820  0.8523427  1.535793
##   41          8          1.956340  0.8531004  1.530298
##   41          9          1.952545  0.8536972  1.526572
##   51          1          2.086555  0.8310425  1.663049
##   51          2          2.015438  0.8430979  1.602322
##   51          3          1.983381  0.8484792  1.569729
##   51          4          1.963966  0.8515496  1.551144
##   51          5          1.951406  0.8535013  1.536947
##   51          6          1.942628  0.8549512  1.527626
##   51          7          1.937865  0.8558485  1.522302
##   51          8          1.933408  0.8565990  1.516528
##   51          9          1.929520  0.8572060  1.512889
##   61          1          2.088505  0.8307583  1.665213
##   61          2          2.016160  0.8429915  1.602027
##   61          3          1.983230  0.8484894  1.568509
##   61          4          1.963964  0.8515423  1.551182
##   61          5          1.951474  0.8535049  1.536848
##   61          6          1.942529  0.8549954  1.526621
##   61          7          1.937913  0.8558718  1.521622
##   61          8          1.933730  0.8565789  1.516443
##   61          9          1.929736  0.8571981  1.513110
##   71          1          2.085890  0.8310235  1.664558
##   71          2          2.014040  0.8432065  1.599841
##   71          3          1.981510  0.8486101  1.566693
##   71          4          1.961784  0.8517243  1.548588
##   71          5          1.949342  0.8536962  1.533751
##   71          6          1.939896  0.8552920  1.523328
##   71          7          1.935068  0.8562139  1.517616
##   71          8          1.930832  0.8569342  1.512457
##   71          9          1.926663  0.8575889  1.509286
##   81          1          2.079881  0.8320253  1.660123
##   81          2          2.009984  0.8439264  1.596155
##   81          3          1.977573  0.8493317  1.563916
##   81          4          1.958099  0.8524154  1.545082
##   81          5          1.945713  0.8543883  1.531313
##   81          6          1.935980  0.8560337  1.521106
##   81          7          1.931172  0.8569500  1.516079
##   81          8          1.926945  0.8576715  1.510875
##   81          9          1.922880  0.8583187  1.507894
##   91          1          2.074699  0.8328649  1.656400
##   91          2          2.005625  0.8445872  1.593872
##   91          3          1.973647  0.8499237  1.561008
##   91          4          1.953884  0.8530686  1.542364
##   91          5          1.941827  0.8549991  1.529580
##   91          6          1.932442  0.8565946  1.520127
##   91          7          1.927759  0.8574994  1.514923
##   91          8          1.923625  0.8582107  1.509741
##   91          9          1.919571  0.8588558  1.506626
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 91 and neighbors = 9.
plot(cubistTune)

plot(varImp(cubistTune,scale = TRUE),top = 10)

I wasn’t sure what the optimal parameters were for the cubist model, so I just selected a grid range between 1 to 100 comittees and 1 to 9 numbers of neighbors, The model managed to choose the committe = 81 and neighbors = 9 to get the lowest RMSE value. The Variable importance for the cubist model is v1,v2 and v4 just like the other models formation.


8.2 Use a simulation to show tree bias

With some help, the problem is referring to the issue with tree-based models having a bias for predictors with high distinct values as opposed to predictors with low distinct values,i.e it places a greater importance on predictors with many distinct values than low disinct value predictors.

## Let's create some simulated data.

set.seed(123)
v1 <- runif(500,1,250)
v2 <- runif(500,1,10)
v3 <- runif(500,1,500)
y <- v1+v3
df <- data.frame(v1,v2,v3,y)

head(df)
##          v1       v2        v3        y
## 1  72.60680 4.182455 137.53774 210.1445
## 2 197.28798 4.297973 297.33960 494.6276
## 3 102.83525 3.583901  80.93222 183.7675
## 4 220.87133 1.719756 426.86169 647.7330
## 5 235.17635 4.289088 424.02184 659.1982
## 6  12.34357 2.602124 239.46552 251.8091
## now we can try to run a model with our simulated data

simMod <- randomForest(y~.,data = df,importance = TRUE,ntree = 100)
varImp(simMod,scale = TRUE)
##      Overall
## v1 45.694763
## v2 -2.104113
## v3 83.356770

Here we noticed that the v3 has a higher importance than v1 since the y variable was a combination of v1 and v3, also we have to note that v3 has more distinct values than either v1 or v2, hence we notice that the tree-models tend to have a bias for variables with many different distinct values than predictors with lower distinct values as noted where v2 has the lowest importance score than v1 and v3.


8.3 In stochastic gradient boosting

A.

Reading from the textbook, the learning rate takes the fraction of the current predicted value which then gets added to the previous iteration’s predicted value,the lambda is between 0 and 1 and lower values work best for predictive accuracy while the bagging fraction is defined as taking the fraction of the training data which improves predicitve performace while reducing computational cost.

In this case, since the left hand side of the plot has both the learning rate and bagging fraction set to 0.1 we see that the importance of the predictors are spread out.With a learning rate of 0.1 we notice that the tree’s contribution will make a smaller adjustment to the model’s prediction and a bagging fraction of 0.1 tells us that about 10% of the training data will be randomly sampled for each tree. Which introduces a high level of randomness into the training process which helps promotes diversity among individuals trees in the ensembles and prevents over fitting.

On the other hand, the learning rate and bagging fraction is set to 0.9, i.e each tree’s contribution to the ensemble will be relatively large to the model’s prediction which can lead to overfitting, while a bagging fraction of 0.9 will introduce some level of randomness which is smaller than 0.1 in the training set, a smaller bagging fraction means less access to the training set and a faster computional time to make the prediction since the learning rate is inversely proportional to the computional cost. As a result, we would have a faster convergernce with our model and less spread of the predictors whereas the model on the left would have a better spread of the predictors since it would overfit less and have longer performance.

B.

The model on the left would perform better, at the cost of computational performance. We know the model on the left would overfit less and have better predictive performance, since we have a higher level of randomnes in the training process and each tree’s contribution will make small contribution to the ensemble.

C.

Interaction depth is defined as the subsequential split that can be thought of higher-level interaction term with other previous split predictors.

If we increase the interaction depth then I would assume it woud cause a few predictors to be important smilar to the plot on the right, since we are going to increase the # of split and make the model even more complex which can results in the model overfitting and then not generalize well on the training data, which may cause the vif to place more importance on a select few predictors.

–

8.7 Refer to Exercises 6.3 and 7.6

Like before I will employ the same process of data-cleaning,and imputation from the other exercerises.

Data Preprocessing

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data("ChemicalManufacturingProcess")
## we have to find the columns with missing values

na_counts <- colSums(is.na(ChemicalManufacturingProcess))

cols_w_na <- names(na_counts[na_counts > 0])

cols_w_na
##  [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
##  [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
##  [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"

It appears the missing values are in the ManufacturingProcess columns

## Check each column and impute it 

trans <- preProcess(ChemicalManufacturingProcess,method = "knnImpute")

We use the preProcess function and apply knnimpute according to section 3.9 from the textbook.

## This is the only way I have found that I can view the knn imputation.. 
imp <- predict(trans,newdata = ChemicalManufacturingProcess)

head(imp)
##        Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -1.1792673           -0.2261036           -1.5140979          -2.68303622
## 2  1.2263678            2.2391498            1.3089960          -0.05623504
## 3  1.0042258            2.2391498            1.3089960          -0.05623504
## 4  0.6737219            2.2391498            1.3089960          -0.05623504
## 5  1.2534583            1.4827653            1.8939391           1.13594780
## 6  1.8386128           -0.4081962            0.6620886          -0.59859075
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2201765            0.4941942           -1.3828880
## 2            1.2964386            0.4128555            1.1290767
## 3            1.2964386            0.4128555            1.1290767
## 4            1.2964386            0.4128555            1.1290767
## 5            0.9414412           -0.3734185            1.5348350
## 6            1.5894524            1.7305423            0.6192092
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1           -0.1313107            -1.233131           -3.3962895
## 2           -0.1313107             2.282619           -0.7227225
## 3           -0.1313107             2.282619           -0.7227225
## 4           -0.1313107             2.282619           -0.7227225
## 5           -0.1313107             1.071310           -0.1205678
## 6           -0.1313107             1.189487           -1.7343424
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1            1.1005296            -1.838655           -1.7709224
## 2            1.1005296             1.393395            1.0989855
## 3            1.1005296             1.393395            1.0989855
## 4            1.1005296             1.393395            1.0989855
## 5            0.4162193             0.136256            1.0989855
## 6            1.6346255             1.022062            0.7240877
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1              0.2154105              0.5662872              0.3765810
## 2             -6.1497028             -1.9692525              0.1979962
## 3             -6.1497028             -1.9692525              0.1087038
## 4             -6.1497028             -1.9692525              0.4658734
## 5             -0.2784345             -1.9692525              0.1087038
## 6              0.4348971             -1.9692525              0.5551658
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1              0.5655598            -0.44593467             -0.5414997
## 2             -2.3669726             0.99933318              0.9625383
## 3             -3.1638563             0.06246417             -0.1117745
## 4             -3.3232331             0.42279841              2.1850322
## 5             -2.2075958             0.84537219             -0.6304083
## 6             -1.2513352             0.49486525              0.5550403
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1             -0.1596700             -0.3095182             -1.7201524
## 2             -0.9580199              0.8941637              0.5883746
## 3              1.0378549              0.8941637             -0.3815947
## 4             -0.9580199             -1.1119728             -0.4785917
## 5              1.0378549              0.8941637             -0.4527258
## 6              1.0378549              0.8941637             -0.2199332
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1            -0.07700901            -0.09157342             -0.4806937
## 2             0.52297397             1.08204765             -0.4806937
## 3             0.31428424             0.55112383             -0.4806937
## 4            -0.02483658             0.80261406             -0.4806937
## 5            -0.39004361             0.10403009             -0.4806937
## 6             0.28819802             1.41736795             -0.4806937
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1             0.97711512              0.8093999              1.1846438
## 2            -0.50030980              0.2775205              0.9617071
## 3             0.28765016              0.4425865              0.8245152
## 4             0.28765016              0.7910592              1.0817499
## 5             0.09066017              2.5334227              3.3282665
## 6            -0.50030980              2.4050380              3.1396277
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1              0.3303945              0.9263296              0.1505348
## 2              0.1455765             -0.2753953              0.1559773
## 3              0.1455765              0.3655246              0.1831898
## 4              0.1967569              0.3655246              0.1695836
## 5              0.4754056             -0.3555103              0.2076811
## 6              0.6261033             -0.7560852              0.1423710
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1              0.4563798              0.3109942              0.2109804
## 2              1.5095063              0.1849230              0.2109804
## 3              1.0926437              0.1849230              0.2109804
## 4              0.9829430              0.1562704              0.2109804
## 5              1.6192070              0.2938027             -0.6884239
## 6              1.9044287              0.3998171             -0.5599376
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1             0.05833309              0.8317688              0.8907291
## 2            -0.72230090             -1.8147683             -1.0060115
## 3            -0.42205706             -1.2132826             -0.8335805
## 4            -0.12181322             -0.6117969             -0.6611496
## 5             0.77891831              0.5911745              1.5804530
## 6             1.07916216             -1.2132826             -1.3508734
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1              0.1200183              0.1256347              0.3460352
## 2              0.1093082              0.1966227              0.1906613
## 3              0.1842786              0.2159831              0.2104362
## 4              0.1708910              0.2052273              0.1906613
## 5              0.2726365              0.2912733              0.3432102
## 6              0.1146633              0.2417969              0.3516852
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1              0.7826636              0.5943242              0.7566948
## 2              0.8779201              0.8347250              0.7566948
## 3              0.8588688              0.7746248              0.2444430
## 4              0.8588688              0.7746248              0.2444430
## 5              0.8969714              0.9549255             -0.1653585
## 6              0.9160227              1.0150257              0.9615956
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1             -0.1952552             -0.4568829              0.9890307
## 2             -0.2672523              1.9517531              0.9890307
## 3             -0.1592567              2.6928719              0.9890307
## 4             -0.1592567              2.3223125              1.7943843
## 5             -0.1412574              2.3223125              2.5997378
## 6             -0.3572486              2.6928719              2.5997378
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1             -1.7202722            -0.88694718             -0.6557774
## 2              1.9568096             1.14638329             -0.6557774
## 3              1.9568096             1.23880740             -1.8000420
## 4              0.1182687             0.03729394             -1.8000420
## 5              0.1182687            -2.55058120             -2.9443066
## 6              0.1182687            -0.51725073             -1.8000420
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1             -1.1540243              0.7174727              0.2317270
## 2              2.2161351             -0.8224687              0.2317270
## 3             -0.7046697             -0.8224687              0.2317270
## 4              0.4187168             -0.8224687              0.2317270
## 5             -1.8280562             -0.8224687              0.2981503
## 6             -1.3787016             -0.8224687              0.2317270
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1             0.05969714            -0.06900773             0.20279570
## 2             2.14909691             2.34626280            -0.05472265
## 3            -0.46265281            -0.44058781             0.40881037
## 4            -0.46265281            -0.44058781            -0.31224099
## 5            -0.46265281            -0.44058781            -0.10622632
## 6            -0.46265281            -0.44058781             0.15129203
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1             2.40564734            -0.01588055             0.64371849
## 2            -0.01374656             0.29467248             0.15220242
## 3             0.10146268            -0.01588055             0.39796046
## 4             0.21667191            -0.01588055            -0.09355562
## 5             0.21667191            -0.32643359            -0.09355562
## 6             1.48397347            -0.01588055            -0.33931365

It seems the entire values were transformed, but the missing values are imputed with KNN, the defualt is k = 10.

## We need a ydefault
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
## 
##     where
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
impnoY <- imp %>%
  select(-Yield)
set.seed(1)
trainRow <- createDataPartition(imp$Yield, p=0.8, list=FALSE)
imp.train <- impnoY[trainRow, ]
Yield.train <- imp[trainRow,]$Yield
imp.test <- impnoY[-trainRow, ]
Yield.test <- imp[-trainRow,]$Yield

A. Train several models

RandomF

I am going to use the same settings as the textbooks has performed

set.seed(100)
rpT <- train(imp.train,Yield.train,method = "rf",tuneLength = 10,trControl = trainControl(method = "cv"))
rpT
## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6916734  0.6155629  0.5524264
##    8    0.6340945  0.6511338  0.4977362
##   14    0.6307533  0.6457011  0.4890549
##   20    0.6283848  0.6374897  0.4872593
##   26    0.6316846  0.6319565  0.4869175
##   32    0.6358197  0.6203305  0.4855248
##   38    0.6367755  0.6177617  0.4873467
##   44    0.6396269  0.6146241  0.4850697
##   50    0.6433851  0.6085142  0.4890040
##   57    0.6499582  0.5963269  0.4960372
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.
plot(varImp(rpT,scale = TRUE),top = 25)

The Variable Importance plot shows that ManufacturingProcess32 played a very big part in the model formation, followed by BiologicalMaterial12 and 03.

Test Data

rpTpred <- predict(rpT,imp.test)
postResample(rpTpred,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.5052724 0.7014780 0.4111476

It appears that the Random-Forest model I have trained seems to have performed really well on the test set. We can see from the Variable Importance plot that manufacturingprocess32 played a huge part in its predictions.


Boosted Trees

Likewise I will follow the textbook but tweak some parameters

## Textbook mentions # of trees to be between 100 and 1000 and the lowest shrinkage should be by 0.01 for accurate measurements,for some reason minobsinnode won't choose anything besides 10
## NOTE: I got a warning saying BioMat07 that has no variation 
gbMGrid <- expand.grid(.interaction.depth = seq(1,7,by = 2),.n.trees = seq(100,1000,by = 100),.shrinkage = c(0.01,0.1),.n.minobsinnode = 10)
set.seed(10)
gbmModel <- train(imp.train,Yield.train,method = "gbm",tuneGrid = gbMGrid,verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   0.01       1                   100     0.8064048  0.4720621  0.6356502
##   0.01       1                   200     0.7389415  0.5072909  0.5741904
##   0.01       1                   300     0.7148848  0.5176540  0.5536391
##   0.01       1                   400     0.7057699  0.5223404  0.5457539
##   0.01       1                   500     0.7023519  0.5250374  0.5436225
##   0.01       1                   600     0.6998363  0.5281716  0.5432279
##   0.01       1                   700     0.6987112  0.5299070  0.5431603
##   0.01       1                   800     0.6974424  0.5323634  0.5432976
##   0.01       1                   900     0.6966755  0.5332840  0.5433620
##   0.01       1                  1000     0.6962344  0.5346120  0.5434701
##   0.01       3                   100     0.7686338  0.4970681  0.6023272
##   0.01       3                   200     0.7138938  0.5198361  0.5522335
##   0.01       3                   300     0.6990372  0.5290918  0.5399262
##   0.01       3                   400     0.6910796  0.5373960  0.5345625
##   0.01       3                   500     0.6853474  0.5442075  0.5308698
##   0.01       3                   600     0.6816615  0.5488776  0.5291221
##   0.01       3                   700     0.6789491  0.5525289  0.5275279
##   0.01       3                   800     0.6764236  0.5557242  0.5261585
##   0.01       3                   900     0.6741767  0.5590358  0.5247783
##   0.01       3                  1000     0.6723232  0.5613554  0.5236432
##   0.01       5                   100     0.7649127  0.4963699  0.5982528
##   0.01       5                   200     0.7120267  0.5194987  0.5499892
##   0.01       5                   300     0.6950794  0.5334203  0.5355699
##   0.01       5                   400     0.6859930  0.5432696  0.5288009
##   0.01       5                   500     0.6793451  0.5512589  0.5240791
##   0.01       5                   600     0.6751921  0.5561559  0.5214576
##   0.01       5                   700     0.6714907  0.5608709  0.5190162
##   0.01       5                   800     0.6690672  0.5640189  0.5172337
##   0.01       5                   900     0.6669269  0.5668241  0.5157023
##   0.01       5                  1000     0.6652818  0.5689316  0.5146214
##   0.01       7                   100     0.7638564  0.4987634  0.5981485
##   0.01       7                   200     0.7144239  0.5163737  0.5516890
##   0.01       7                   300     0.6986248  0.5291540  0.5392048
##   0.01       7                   400     0.6892415  0.5397417  0.5319934
##   0.01       7                   500     0.6821143  0.5479971  0.5262743
##   0.01       7                   600     0.6779614  0.5532363  0.5237892
##   0.01       7                   700     0.6743706  0.5577751  0.5216182
##   0.01       7                   800     0.6717896  0.5611722  0.5199110
##   0.01       7                   900     0.6696070  0.5640257  0.5187355
##   0.01       7                  1000     0.6677799  0.5664430  0.5177671
##   0.10       1                   100     0.7045709  0.5258319  0.5489964
##   0.10       1                   200     0.7061660  0.5305659  0.5540292
##   0.10       1                   300     0.7074474  0.5323695  0.5572586
##   0.10       1                   400     0.7081509  0.5346540  0.5582108
##   0.10       1                   500     0.7088288  0.5360446  0.5589189
##   0.10       1                   600     0.7106043  0.5355472  0.5605557
##   0.10       1                   700     0.7116717  0.5355486  0.5612190
##   0.10       1                   800     0.7126189  0.5353060  0.5616922
##   0.10       1                   900     0.7130504  0.5354639  0.5618976
##   0.10       1                  1000     0.7128102  0.5362698  0.5614193
##   0.10       3                   100     0.6851424  0.5463236  0.5332492
##   0.10       3                   200     0.6762695  0.5593214  0.5282412
##   0.10       3                   300     0.6726042  0.5644515  0.5255678
##   0.10       3                   400     0.6710785  0.5665157  0.5246876
##   0.10       3                   500     0.6697268  0.5682110  0.5236838
##   0.10       3                   600     0.6693151  0.5688296  0.5234091
##   0.10       3                   700     0.6690860  0.5692138  0.5232357
##   0.10       3                   800     0.6688882  0.5695239  0.5230945
##   0.10       3                   900     0.6687854  0.5696785  0.5230290
##   0.10       3                  1000     0.6687501  0.5697466  0.5230221
##   0.10       5                   100     0.6728541  0.5594173  0.5205407
##   0.10       5                   200     0.6705948  0.5647082  0.5208694
##   0.10       5                   300     0.6688556  0.5671566  0.5204614
##   0.10       5                   400     0.6684876  0.5681810  0.5206057
##   0.10       5                   500     0.6684861  0.5682461  0.5207427
##   0.10       5                   600     0.6685479  0.5682903  0.5208857
##   0.10       5                   700     0.6686819  0.5681903  0.5210624
##   0.10       5                   800     0.6687102  0.5681644  0.5211203
##   0.10       5                   900     0.6688525  0.5680708  0.5212386
##   0.10       5                  1000     0.6689022  0.5680245  0.5212259
##   0.10       7                   100     0.6786065  0.5554930  0.5269564
##   0.10       7                   200     0.6715366  0.5662973  0.5246063
##   0.10       7                   300     0.6701918  0.5690034  0.5243619
##   0.10       7                   400     0.6701049  0.5698053  0.5245933
##   0.10       7                   500     0.6698702  0.5702582  0.5247019
##   0.10       7                   600     0.6698670  0.5705163  0.5249064
##   0.10       7                   700     0.6699607  0.5705703  0.5251088
##   0.10       7                   800     0.6700546  0.5705302  0.5252277
##   0.10       7                   900     0.6700758  0.5705695  0.5253010
##   0.10       7                  1000     0.6701524  0.5705440  0.5254000
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  5, shrinkage = 0.01 and n.minobsinnode = 10.
plot(gbmModel)

Test Data

gbmPred <- predict(gbmModel,imp.test)
postResample(gbmPred,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.5051136 0.7034703 0.4053230
## For some reason varIMp does not work on this type of model
summary(gbmModel,plot = FALSE)
##                                           var      rel.inf
## ManufacturingProcess32 ManufacturingProcess32 29.430239206
## BiologicalMaterial12     BiologicalMaterial12  6.506883140
## ManufacturingProcess17 ManufacturingProcess17  5.172638931
## ManufacturingProcess09 ManufacturingProcess09  4.719923604
## ManufacturingProcess13 ManufacturingProcess13  4.495867321
## BiologicalMaterial03     BiologicalMaterial03  3.379207708
## ManufacturingProcess06 ManufacturingProcess06  2.677323118
## BiologicalMaterial09     BiologicalMaterial09  2.519894323
## BiologicalMaterial05     BiologicalMaterial05  2.102527204
## ManufacturingProcess31 ManufacturingProcess31  2.081904623
## BiologicalMaterial06     BiologicalMaterial06  1.545249054
## ManufacturingProcess15 ManufacturingProcess15  1.450036115
## ManufacturingProcess39 ManufacturingProcess39  1.443892164
## ManufacturingProcess20 ManufacturingProcess20  1.438013030
## ManufacturingProcess05 ManufacturingProcess05  1.396929167
## ManufacturingProcess04 ManufacturingProcess04  1.311863932
## ManufacturingProcess37 ManufacturingProcess37  1.309760338
## BiologicalMaterial11     BiologicalMaterial11  1.263095877
## ManufacturingProcess14 ManufacturingProcess14  1.251347047
## BiologicalMaterial08     BiologicalMaterial08  1.138289395
## ManufacturingProcess01 ManufacturingProcess01  1.137858067
## ManufacturingProcess22 ManufacturingProcess22  1.055807797
## ManufacturingProcess21 ManufacturingProcess21  1.027038257
## BiologicalMaterial01     BiologicalMaterial01  1.021839032
## ManufacturingProcess30 ManufacturingProcess30  1.005055326
## ManufacturingProcess29 ManufacturingProcess29  0.982536554
## ManufacturingProcess18 ManufacturingProcess18  0.949178393
## ManufacturingProcess27 ManufacturingProcess27  0.946929203
## ManufacturingProcess16 ManufacturingProcess16  0.895659998
## ManufacturingProcess11 ManufacturingProcess11  0.893156421
## ManufacturingProcess28 ManufacturingProcess28  0.881865464
## ManufacturingProcess03 ManufacturingProcess03  0.874777308
## ManufacturingProcess19 ManufacturingProcess19  0.833697841
## BiologicalMaterial10     BiologicalMaterial10  0.820607514
## ManufacturingProcess36 ManufacturingProcess36  0.809658556
## ManufacturingProcess26 ManufacturingProcess26  0.791116593
## ManufacturingProcess24 ManufacturingProcess24  0.790534012
## ManufacturingProcess43 ManufacturingProcess43  0.768802269
## ManufacturingProcess02 ManufacturingProcess02  0.749025987
## BiologicalMaterial02     BiologicalMaterial02  0.737091519
## ManufacturingProcess42 ManufacturingProcess42  0.663055127
## BiologicalMaterial04     BiologicalMaterial04  0.612922969
## ManufacturingProcess35 ManufacturingProcess35  0.514487025
## ManufacturingProcess34 ManufacturingProcess34  0.503078695
## ManufacturingProcess45 ManufacturingProcess45  0.486996410
## ManufacturingProcess33 ManufacturingProcess33  0.482358747
## ManufacturingProcess23 ManufacturingProcess23  0.474051804
## ManufacturingProcess25 ManufacturingProcess25  0.408049054
## ManufacturingProcess44 ManufacturingProcess44  0.345235347
## ManufacturingProcess07 ManufacturingProcess07  0.321921189
## ManufacturingProcess10 ManufacturingProcess10  0.184711113
## ManufacturingProcess38 ManufacturingProcess38  0.174111928
## ManufacturingProcess12 ManufacturingProcess12  0.169258144
## ManufacturingProcess08 ManufacturingProcess08  0.046092293
## ManufacturingProcess40 ManufacturingProcess40  0.006548748
## BiologicalMaterial07     BiologicalMaterial07  0.000000000
## ManufacturingProcess41 ManufacturingProcess41  0.000000000

The variable importance once again shows that Manufacutring32 played a bit of influence in the model prediction. It seems like more manufacturingProcess predictors played a bigger role for the gbm model as opposed to biological material.


Cubist

## neighbors must be less than 10 and commite must start at 1
## I am getting a warning of NA coerecions which I have ignored. 
cubistGrid2 <- expand.grid(.committees = seq(1,100,by = 10), .neighbors = c(1:9))
set.seed(10)
cubistTune2 <- train(imp.train,Yield.train,method = "cubist",tuneGrid = cubistGrid2)
cubistTune2
## Cubist 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          1          0.9901747  0.3517646  0.7019520
##    1          2          0.9767733  0.3587294  0.6917103
##    1          3          0.9775609  0.3565926  0.6928085
##    1          4          0.9772473  0.3561552  0.6928682
##    1          5          0.9809024  0.3540057  0.6954490
##    1          6          0.9836019  0.3504534  0.6989831
##    1          7          0.9860281  0.3478040  0.7010217
##    1          8          0.9870537  0.3469707  0.7006243
##    1          9          0.9888389  0.3446509  0.7023363
##   11          1          0.6726770  0.5779530  0.5069547
##   11          2          0.6693583  0.5802986  0.5078205
##   11          3          0.6713463  0.5780573  0.5091482
##   11          4          0.6727348  0.5765396  0.5116252
##   11          5          0.6762010  0.5725706  0.5159387
##   11          6          0.6788873  0.5690574  0.5178690
##   11          7          0.6808974  0.5664567  0.5193507
##   11          8          0.6821534  0.5650557  0.5204960
##   11          9          0.6839219  0.5630240  0.5220857
##   21          1          0.6457204  0.6041539  0.4911921
##   21          2          0.6432198  0.6062964  0.4905463
##   21          3          0.6457996  0.6034255  0.4932120
##   21          4          0.6469975  0.6024796  0.4955241
##   21          5          0.6501749  0.5986685  0.4991011
##   21          6          0.6526930  0.5954945  0.5007140
##   21          7          0.6545790  0.5931616  0.5018465
##   21          8          0.6558545  0.5918792  0.5032987
##   21          9          0.6574068  0.5901986  0.5047408
##   31          1          0.6415487  0.6088933  0.4909079
##   31          2          0.6372401  0.6135743  0.4892726
##   31          3          0.6409720  0.6100015  0.4919199
##   31          4          0.6425445  0.6086123  0.4939111
##   31          5          0.6446844  0.6060659  0.4969715
##   31          6          0.6471710  0.6028612  0.4988821
##   31          7          0.6492807  0.6002639  0.5002502
##   31          8          0.6505461  0.5990120  0.5018511
##   31          9          0.6524279  0.5970152  0.5036007
##   41          1          0.6374353  0.6161319  0.4860259
##   41          2          0.6338339  0.6200631  0.4855146
##   41          3          0.6371555  0.6166169  0.4877380
##   41          4          0.6385867  0.6153186  0.4897771
##   41          5          0.6415992  0.6119155  0.4931891
##   41          6          0.6440656  0.6087349  0.4948693
##   41          7          0.6458245  0.6067236  0.4965678
##   41          8          0.6471298  0.6054700  0.4980097
##   41          9          0.6489159  0.6034554  0.4996487
##   51          1          0.6324312  0.6210343  0.4822490
##   51          2          0.6294971  0.6244339  0.4824047
##   51          3          0.6323635  0.6212642  0.4847729
##   51          4          0.6339649  0.6197914  0.4872719
##   51          5          0.6369684  0.6164150  0.4905297
##   51          6          0.6394431  0.6132018  0.4923868
##   51          7          0.6412083  0.6111035  0.4940972
##   51          8          0.6425062  0.6098737  0.4953493
##   51          9          0.6441022  0.6079745  0.4967655
##   61          1          0.6283824  0.6251717  0.4792889
##   61          2          0.6257165  0.6281235  0.4797443
##   61          3          0.6289791  0.6246697  0.4827440
##   61          4          0.6306025  0.6231638  0.4853647
##   61          5          0.6336645  0.6196729  0.4887697
##   61          6          0.6361127  0.6164481  0.4902736
##   61          7          0.6380433  0.6141326  0.4921060
##   61          8          0.6392057  0.6130291  0.4931808
##   61          9          0.6411548  0.6107932  0.4948025
##   71          1          0.6261533  0.6263781  0.4776903
##   71          2          0.6239535  0.6290588  0.4778727
##   71          3          0.6272521  0.6252707  0.4804417
##   71          4          0.6289605  0.6236763  0.4831259
##   71          5          0.6318414  0.6207077  0.4870008
##   71          6          0.6343505  0.6174583  0.4885677
##   71          7          0.6358722  0.6155179  0.4898814
##   71          8          0.6370792  0.6143337  0.4909451
##   71          9          0.6389271  0.6121181  0.4926474
##   81          1          0.6247571  0.6275089  0.4767110
##   81          2          0.6221596  0.6305620  0.4765588
##   81          3          0.6256985  0.6266827  0.4794807
##   81          4          0.6271057  0.6254188  0.4823223
##   81          5          0.6302172  0.6219933  0.4858486
##   81          6          0.6326209  0.6188245  0.4873722
##   81          7          0.6342693  0.6168432  0.4888876
##   81          8          0.6355154  0.6156301  0.4900435
##   81          9          0.6374468  0.6134548  0.4917374
##   91          1          0.6244149  0.6280700  0.4766628
##   91          2          0.6217945  0.6309046  0.4763706
##   91          3          0.6253583  0.6270347  0.4791456
##   91          4          0.6268662  0.6255992  0.4818895
##   91          5          0.6297599  0.6223026  0.4853003
##   91          6          0.6321944  0.6190515  0.4869444
##   91          7          0.6339079  0.6169692  0.4884489
##   91          8          0.6350832  0.6158384  0.4896520
##   91          9          0.6368461  0.6138311  0.4912611
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 91 and neighbors = 2.
plot(cubistTune2)

plot(varImp(cubistTune2,scale = TRUE),top =20)

Test Data

cubPred <- predict(cubistTune2,imp.test)
postResample(cubPred,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.4278131 0.7769360 0.3338987

Looking at the cubist model we have created from our tuning it appears that the cubist model seems to have perform really well on the testing data, the vif appears to show that the top predictors are manfuprocess32,13,17 and 19 played a big role in the model formation. The final parameters for the model were # of commitees is 91 with # of neighbors being 2 to get the lowest RMSE.


Model-Comparison

Tablee3 <- bind_rows(Random_Forest = postResample(rpTpred,Yield.test),cubist = postResample(cubPred,Yield.test),boosted_tree = postResample(gbmPred,Yield.test))

Tablee3$id = c('Random_Forest','Cubist','Boosted Tree')

Tablee3
## # A tibble: 3 × 4
##    RMSE Rsquared   MAE id           
##   <dbl>    <dbl> <dbl> <chr>        
## 1 0.505    0.701 0.411 Random_Forest
## 2 0.428    0.777 0.334 Cubist       
## 3 0.505    0.703 0.405 Boosted Tree

Looking at the three types of model we have made i.e the Cubist, The random forest and the booosted tree it appears that the cubist model performed exceptionally well on the test data with a low score of 0.42 and an R-squared of 0.77/ The Random_Forest and the Boosted_tree seems to have almost smiliar RMSE and R-squared values, in this case we would go with our cubist model for the slightly better fit


B.

Looking at the variable importance for each of the model we are able to see that the # 1 predictor that was used in all of the model were manufacturingprocess32, in the case of the cubist model we see that the manufprocess of 32,13, and 17 were the top 3 predictors As opposed to the randomforest and boosted tree we see that the top 3 were a mix of manufacutring process and biological predictors that dominate the list. I feel like the top 10 predictors are smiliar to the nonlinear regression from the previous hw but it is quite different from the seventh homework where the manufacturingprocess dominate the top 10 predictors.


C.

FOr some reason the partykit library is not on the newest version of R, so I resorted to using the rpartplot library to plot the decision tree. Seems like the split the tree made was with mostly manufacturingproccess.

library(rpart)
rpartTree <- rpart(Yield ~ ., data = imp)
rpart.plot::rpart.plot(rpartTree)