Exercise 8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200,sd=1)
simulated <- cbind(simulated$x,simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a) Fit a random forest model to all of the predictors, then estimate the variable importance scores.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y~., data=simulated,
                       importance=TRUE,
                       ntree=1000)
rfImp1 <- varImp(model1,scale=FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1,simulated$V1)
## [1] 0.9460206

Fit another random forest model to this data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y~., data=simulated,
                       importance=TRUE,
                       ntree=1000)
rfImp2 <- varImp(model2,scale=FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

When two predictors are highly correlated, the importance of the variables is split. So for this example, the importance of V1 decreases from 8.732 to 5.691. Which is a decrease of -34.825%.

The duplicate value is 4.2833158. Which is -50.948% of the original V1 value.

Judging from the differences of the percentage changes of V1 and the percentage of duplicate1 to V1 it appears that there is some added value from the high correlation. To really know if there was value added we would need to see how much the accuracy of the predictions changed.

c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between traditional importance measure and the modified version described in Strobel et al (2007). Do these importances show the same pattern as the traditional random forest model?

Yes it does show a similar pattern as the importance levels are all pretty close to the original random forest models importance levels. Variables V7, V8 and V9 are all negative, which is the same in the original model. The one variable where they differ is with variable V10, which is negative in model3 and positive in the original.

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
model3 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp3 <- varimp(model3, conditional = TRUE)

rfImp3
##           V1           V2           V3           V4           V5           V6 
##  5.471457361  5.166826657  0.020994281  6.689072245  1.256076719  0.004925215 
##           V7           V8           V9          V10 
## -0.008184439 -0.009141850 -0.012594617 -0.013200299

d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

In general the pattern holds. More so with the gradient boosting model than the cubist because the gradient boosting model has the variables V1, V2 and V4 as most impactful. While V3 and V5 are less impactful. Just like in the original random forest model. The rest of the variables in the gradient boosting model are considerably less important, as it is in the original model.

While the cubist model is pretty much the same as far as the rankings go and which variables have the most impact. The main difference being that the weakest variables, V7 through V10, have no effect at all.

library(gbm)
## Loaded gbm 2.1.8.1
gbm <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbm_tune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbm,
                 verbose = FALSE)

gbmImp <- varImp(gbm_tune$finalModel, scale = FALSE)

gbmImp
##       Overall
## V1  4634.0234
## V2  4316.6044
## V3  1310.7178
## V4  4287.0793
## V5  1844.2147
## V6   416.3966
## V7   368.9383
## V8   224.4769
## V9   246.1400
## V10  250.6263
library(Cubist)
cube <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
cubeImp <- varImp(cube$finalModel, scale = FALSE)

cubeImp
##     Overall
## V1     72.0
## V3     42.0
## V2     54.5
## V4     49.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

Exercise 8.2 Use a simulation to show tree bias with different granularities.

Variable a is a set of repeating values that go from -1 to 1. Variable b is a normal distribution with a mean of 120 and a standard deviation of 3. Variable b will have a rough range of about -110 to 130. Variable c is a poisson distribution with a mean of 5, and a range of 0 through 17. The y variable is a random distribution with a mean of 0 and standard deviation of 10. Variables a, b and c have no relation with y. Each variable has hundred values.

Variables a and c will have many repeating values, while variable b should have none or very few at most.

As the book states, variable b, is favored as the most important variable. While variable c and a are used as top level splits.

library(rpart)
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
set.seed(200)
a <- sample(-10:10 / 10, 500, replace = TRUE)
b <- rnorm(500,120,3)
c <- rpois(500,5)

y <- rnorm(500,0,10)

fake_data <- data.frame(a,b,c,y)
unique(a)
##  [1] -0.5  0.7  0.4 -0.3  0.1  1.0  0.9 -0.7  0.2 -0.8  0.0 -0.6 -1.0 -0.4  0.8
## [16]  0.3 -0.9 -0.2  0.5 -0.1  0.6
unique(b)
##   [1] 121.2644 117.7821 118.3786 119.9664 119.2032 118.4965 119.2895 120.3332
##   [9] 115.1853 126.5368 121.6179 117.7442 121.1537 115.5922 119.2543 120.9924
##  [17] 119.7116 114.1832 118.8436 119.9735 118.2044 118.5860 122.2504 122.8145
##  [25] 117.0631 121.0549 120.9987 121.9927 116.1554 121.3171 122.5931 119.9219
##  [33] 124.4802 117.7479 115.3083 120.0362 124.6008 119.5719 121.3335 121.6768
##  [41] 119.1036 122.5548 118.1500 123.5260 116.0520 122.9401 125.5281 117.1701
##  [49] 120.3592 116.0841 123.2489 118.2568 120.7307 119.2570 124.7413 121.4215
##  [57] 119.5398 119.9999 122.6617 118.4564 118.9659 122.7720 118.7420 116.8065
##  [65] 120.9888 122.7800 120.0940 116.3969 119.7837 114.8651 121.9600 115.9902
##  [73] 121.4466 122.5025 118.2534 118.8524 120.0631 123.3830 123.1343 120.0662
##  [81] 124.4277 116.7585 117.8183 119.9209 119.5314 125.4641 124.0997 116.9761
##  [89] 119.1092 117.9840 115.7384 121.6968 121.3943 122.3756 114.2110 117.1068
##  [97] 126.7006 120.4231 121.1606 118.9389 122.6776 119.9178 116.4342 119.5296
## [105] 122.9126 124.0010 120.6323 120.6109 116.6626 118.9231 118.9855 126.8395
## [113] 121.3126 121.7990 121.0170 119.0515 118.0837 120.2544 119.9852 120.3526
## [121] 119.3673 120.0810 120.7186 120.4392 119.6462 120.1882 119.5246 121.2542
## [129] 118.3806 121.3105 119.8240 120.1268 117.1983 120.0807 118.6093 120.1455
## [137] 126.3939 120.4437 120.5537 116.2128 119.5614 122.6959 123.9105 120.0992
## [145] 120.6488 118.3940 120.5646 116.3789 124.6579 125.0272 113.7232 122.3703
## [153] 116.2398 117.1314 123.1173 116.4215 122.4837 117.4764 118.1851 119.2880
## [161] 120.5655 120.5758 120.5465 120.6573 120.7398 118.9934 122.2492 119.1204
## [169] 116.9839 119.4699 123.6378 116.1787 119.5487 120.7418 129.2595 118.2690
## [177] 120.8695 118.7221 118.2299 120.0914 113.5300 123.5540 119.6951 124.0861
## [185] 119.3348 123.0083 119.9046 121.3695 122.8711 123.4193 119.2149 121.1296
## [193] 120.1695 118.9895 119.7808 122.6669 116.6057 119.1188 113.6657 123.6981
## [201] 118.8154 111.7238 123.6136 118.7529 121.0959 120.0899 126.2977 116.7338
## [209] 118.0963 120.0353 121.0833 119.3301 119.2198 123.0570 118.7204 118.1748
## [217] 122.1246 118.7618 119.9627 119.4565 117.2897 122.5208 126.5057 115.8777
## [225] 121.1475 120.1005 118.7299 119.7912 118.6206 122.1370 119.3948 121.5230
## [233] 117.2252 117.0838 117.6838 121.3226 117.9976 115.4534 118.6741 117.5300
## [241] 118.8446 120.9810 116.0758 116.1336 119.4361 116.1448 118.7374 117.0627
## [249] 120.0336 123.2684 116.8286 120.2896 117.5904 117.6725 118.2444 121.0318
## [257] 115.4300 113.4194 121.1821 116.9307 126.2024 113.0207 118.9651 120.4523
## [265] 123.6250 122.2845 126.2800 117.0009 117.5870 114.8986 114.3732 115.3840
## [273] 121.9953 119.9068 109.8045 119.9743 115.8297 116.7926 117.4554 124.1450
## [281] 121.3815 116.3859 117.2442 125.7620 126.4193 117.8887 119.5220 116.3957
## [289] 117.8584 120.5175 123.3530 118.3875 120.0794 114.3935 120.8726 121.8943
## [297] 123.6087 118.6477 122.4272 117.0328 120.2302 120.1216 120.8927 114.7128
## [305] 120.0974 118.3116 118.0641 127.3272 125.1477 120.8492 116.9704 121.5472
## [313] 117.8402 122.6367 121.2250 117.3399 121.9487 121.3579 121.5065 117.2972
## [321] 121.9819 121.2900 118.0402 121.2599 123.0409 122.4931 121.1472 119.0964
## [329] 125.2629 120.3834 118.2735 124.1703 121.7860 128.1766 120.3277 118.3018
## [337] 121.0300 117.2320 114.8274 122.3690 121.0390 121.1702 118.1774 119.2305
## [345] 120.3296 121.8053 117.1518 119.7378 121.4426 118.0569 125.0981 114.6757
## [353] 118.9499 118.0641 121.0161 127.6252 116.9357 121.2659 121.3298 121.5042
## [361] 122.7368 118.3905 122.6406 116.4232 115.3832 123.1103 124.7948 120.6225
## [369] 123.5881 119.7034 118.6357 122.4012 122.4103 116.5186 119.1537 121.0492
## [377] 125.0884 120.5386 129.9802 119.2034 121.5412 123.2948 122.9103 123.0091
## [385] 120.6477 119.6028 118.8013 116.2165 118.5311 118.5311 122.0715 121.4610
## [393] 118.6884 121.1908 121.1699 117.2222 121.4714 122.6445 117.5660 121.1950
## [401] 112.2081 119.7501 119.9951 113.5598 120.8665 117.6818 119.7108 124.3895
## [409] 121.5218 122.8697 119.7220 121.0059 119.1259 122.9723 118.3178 122.5042
## [417] 119.5266 126.5930 113.1847 117.5826 123.6255 118.1372 118.5065 125.2444
## [425] 124.3874 121.4864 120.2346 122.1071 124.3031 115.9098 120.9635 118.7543
## [433] 116.8019 120.5475 124.2079 123.0988 123.4544 120.5197 123.3594 115.0563
## [441] 123.7806 121.5065 124.1004 121.6138 120.5117 120.0845 120.6682 121.3387
## [449] 117.8196 125.3068 121.6161 120.8853 117.6992 119.2284 124.2829 121.3436
## [457] 117.7003 121.3657 121.2028 121.2212 123.0032 123.9063 116.3816 118.8100
## [465] 126.0443 117.1490 121.4038 115.1327 117.3419 120.6244 126.3222 118.6443
## [473] 117.2887 115.8204 126.0547 122.4935 115.3411 122.0174 120.6019 121.0263
## [481] 122.1106 125.0379 117.5639 124.5606 116.8561 121.2758 119.4098 117.0782
## [489] 117.9868 116.3367 117.0289 117.8229 118.6401 124.3295 116.6051 114.5334
## [497] 122.1051 119.1777 115.1889 121.2974
unique(c)
##  [1]  5  7  3  4  2  9  6 10  1  8 11  0 12 17 13
paste0("Maximum a value: ",max(a),"  Minimum a value: ",min(a))
## [1] "Maximum a value: 1  Minimum a value: -1"
paste0("Maximum b value: ",max(b),"  Minimum b value: ",min(b))
## [1] "Maximum b value: 129.980161400066  Minimum b value: 109.804469598582"
paste0("Maximum c value: ",max(c),"  Minimum c value: ",min(c))
## [1] "Maximum c value: 17  Minimum c value: 0"
paste0("Maximum y value: ",max(y),"  Minimum y value: ",min(y))
## [1] "Maximum y value: 26.4734477760858  Minimum y value: -32.39599994589"
plot(fake_data$a)

plot(fake_data$b)

plot(fake_data$c)

plot(fake_data$y)

rptree <- rpart(y ~ ., data = fake_data)

plot(as.party(rptree), gp = gpar(fontsize = 5))

varImp(rptree)
##      Overall
## a 0.05648335
## b 0.10929376
## c 0.06266866
set.seed(160)
a <- sample(-20:20 / 10, 500, replace = TRUE)
b <- rnorm(500,80,3)
c <- rpois(500,7)

y <- rnorm(500,0,10)

fake_data <- data.frame(a,b,c,y) 

unique(a)
##  [1]  0.9  1.1 -1.2  0.7 -0.6 -2.0  0.2  0.1 -1.6 -0.7  1.3  1.8 -1.9 -1.5 -0.4
## [16]  0.5 -1.4 -1.3  2.0  0.4 -1.7  1.6 -0.8  0.6  1.9  0.3  1.7 -1.8 -0.3  1.5
## [31] -0.1 -0.2  0.8  1.0 -0.9  0.0 -1.0 -0.5  1.4 -1.1  1.2
unique(b)
##   [1] 74.47251 81.68123 85.15812 84.63671 73.53506 76.07734 75.85338 73.52391
##   [9] 82.62325 71.21904 74.52155 83.08312 77.59434 82.62463 82.95394 84.40269
##  [17] 86.35832 81.84968 77.38727 76.29015 82.55760 78.40541 80.67450 82.70363
##  [25] 83.00534 78.09892 81.40635 79.50853 78.61462 79.50403 75.73572 84.34068
##  [33] 82.44276 77.18083 84.16547 79.74133 79.50502 84.65542 88.26146 79.14957
##  [41] 82.30673 77.20610 79.43534 78.62886 78.89600 85.07339 81.78086 78.92085
##  [49] 80.47994 83.62927 78.21933 73.68532 79.47395 79.43044 83.86069 77.59273
##  [57] 80.47963 76.87070 82.04667 84.69640 85.15042 79.96098 82.51401 76.85285
##  [65] 78.69055 80.42776 81.48330 79.28597 77.47127 79.42142 82.76758 83.95402
##  [73] 82.93269 79.14256 82.71644 77.29274 80.58037 86.94624 78.18525 76.90406
##  [81] 79.02055 87.31286 77.72341 81.35637 83.32777 83.60916 78.75871 83.87612
##  [89] 81.42562 78.33334 81.98633 80.71327 82.81676 83.39901 82.10382 80.39856
##  [97] 80.40726 78.36679 79.20082 79.40217 85.18187 79.44480 82.37798 82.82427
## [105] 83.36213 81.14804 79.25685 81.02656 78.79265 84.95578 80.34366 83.45189
## [113] 81.42910 82.32241 85.89088 77.71959 83.28015 82.49752 79.25394 78.04126
## [121] 76.22278 76.33327 82.18459 76.64944 71.85515 81.18499 78.07892 78.58395
## [129] 77.84861 75.95569 82.07099 82.11535 72.44467 77.92276 78.81015 78.78072
## [137] 85.55053 79.19878 77.04864 78.79523 77.59207 79.35289 84.10733 83.33944
## [145] 73.57414 81.07930 84.27688 79.27632 77.62306 81.22630 86.80024 80.80004
## [153] 82.75197 81.36777 74.89148 83.56216 75.92537 80.93432 81.92881 78.40255
## [161] 79.61148 80.03229 77.38526 83.53552 78.32307 81.71223 82.04817 76.62304
## [169] 79.93137 84.42090 84.61246 83.48981 80.44320 79.15910 83.25581 79.46437
## [177] 82.74975 72.50642 79.15799 76.46781 81.13894 80.55749 75.16694 83.41141
## [185] 75.93880 84.48156 76.74422 77.46050 83.50106 80.71346 81.85651 79.89011
## [193] 79.84937 84.38911 82.84415 85.98936 83.92595 79.59059 78.29533 82.73115
## [201] 85.36473 75.27750 85.74152 80.55794 79.66851 81.40213 79.14967 80.57108
## [209] 74.21620 81.06509 84.68494 80.65558 79.50543 76.05028 76.63961 82.48477
## [217] 84.07911 77.06854 82.24457 82.80553 87.12988 83.42626 77.83954 83.88954
## [225] 76.11348 77.00294 84.17039 76.12993 81.22282 79.70778 75.03466 76.04622
## [233] 78.67128 78.99134 79.97609 79.12697 82.94800 78.77644 78.09752 77.25856
## [241] 79.53259 78.75039 86.39893 84.08157 79.07243 79.45344 71.07975 78.49334
## [249] 81.49783 78.86872 78.50647 81.24371 79.49664 80.24154 81.44088 80.31344
## [257] 77.12409 83.23638 75.02876 80.78075 78.72552 84.70699 81.74886 81.81747
## [265] 85.87421 88.11692 76.22579 82.23504 79.54224 76.20487 84.95786 80.98043
## [273] 78.05446 77.04769 80.07805 82.17420 79.39113 81.56852 84.35586 71.01606
## [281] 76.53906 80.93690 76.77600 81.02707 84.84970 74.76363 81.05884 82.19960
## [289] 79.64566 72.58488 80.55347 80.80824 79.56770 76.87116 81.11638 84.30366
## [297] 85.44116 78.71789 82.29515 86.37625 81.77784 78.57825 79.26836 78.12342
## [305] 77.33113 80.94059 80.06381 83.43784 83.97255 80.44776 77.93746 80.04402
## [313] 85.16297 80.25339 82.12858 79.35334 73.54175 77.84002 76.54910 79.56279
## [321] 77.50943 80.40087 79.24122 82.21533 80.10460 84.24807 77.87128 79.51538
## [329] 76.64066 83.46379 76.94690 80.02969 75.88348 77.36123 81.16461 79.65956
## [337] 82.27495 83.95038 78.38096 79.86427 79.79035 78.93721 79.57385 78.06220
## [345] 82.83927 83.09998 78.17479 81.08427 75.97628 79.39883 83.75504 74.49062
## [353] 78.54058 86.44976 77.10286 77.83122 84.98486 78.96433 82.88856 76.85650
## [361] 81.24266 80.31769 83.56133 78.43028 81.95635 76.64171 75.11633 79.91733
## [369] 76.58452 79.17525 76.85955 81.53960 82.98782 79.21200 80.62234 78.25466
## [377] 78.08363 75.13685 79.83480 75.01492 79.03600 81.16265 81.34666 75.11188
## [385] 82.54922 80.19654 78.58401 83.17156 75.00108 81.00737 82.99497 78.57853
## [393] 80.55920 83.47679 72.52562 78.73311 81.90625 84.60287 81.33154 81.51540
## [401] 84.77601 73.59922 78.91789 77.71055 78.43505 80.25789 81.16984 76.89900
## [409] 82.52889 79.82632 81.11141 79.52570 78.38947 80.02306 82.49742 77.22440
## [417] 78.61309 80.65303 86.09755 78.89038 78.63949 84.68840 82.73631 76.68678
## [425] 75.66440 79.81753 76.73121 80.81371 88.57067 75.93562 85.64394 81.48114
## [433] 82.76935 78.37220 77.15936 83.39975 79.84114 80.20254 82.81077 80.12230
## [441] 80.48470 78.16646 78.64657 81.30551 83.22406 80.05117 78.18661 82.21377
## [449] 77.09481 78.87357 76.55586 75.41194 80.07460 84.04900 77.26301 77.00596
## [457] 82.73616 78.60423 81.07704 80.10059 84.53656 77.08313 86.36747 79.13314
## [465] 83.19324 83.31822 83.63428 80.67060 81.33669 81.54256 84.66085 83.57408
## [473] 85.84153 78.75534 82.64884 81.94813 83.36174 81.75706 79.06567 81.74352
## [481] 83.18011 81.53077 76.89065 81.46497 82.06632 83.09080 80.74807 75.30499
## [489] 82.16478 81.78787 75.05382 82.60260 86.36382 80.04424 85.12828 81.45698
## [497] 77.56274 73.89246 79.30560 78.81670
unique(c)
##  [1]  6  4  7 11  8  5  9 12  2  3 10 13 14  1 15
paste0("Maximum a value: ",max(a),"  Minimum a value: ",min(a))
## [1] "Maximum a value: 2  Minimum a value: -2"
paste0("Maximum b value: ",max(b),"  Minimum b value: ",min(b))
## [1] "Maximum b value: 88.5706709281885  Minimum b value: 71.0160631000418"
paste0("Maximum c value: ",max(c),"  Minimum c value: ",min(c))
## [1] "Maximum c value: 15  Minimum c value: 1"
paste0("Maximum y value: ",max(y),"  Minimum y value: ",min(y))
## [1] "Maximum y value: 30.5350979355139  Minimum y value: -26.5386044656447"
plot(fake_data$a)

plot(fake_data$b)

plot(fake_data$c)

plot(fake_data$y)

rpt2 <- rpart(y ~ ., data = fake_data)

plot(as.party(rpt2), gp = gpar(fontsize = 5))

varImp(rpt2)
##      Overall
## a 0.05345298
## b 0.09253418
## c 0.03849219

Exercise 8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of those parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for bagging fraction(.1, and .9) and the learning rate (.1 and .9) for the solubility data. The left-hand plot has both parameters set to .1, and the right-hand plot has both set to .9:


Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to .01 in the left figure. Both are set to .9 in the right figure.


a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

It’s all about the sample size being used. The model on the right is using 90% of the observations, while the model on the right is only using 10%. This makes the model on the right importance features more refined where it is only getting the most important features. While the model on the left is using much smaller samples each time, which is creating more variance for the importance of each variable.

b) Which model do you think would be more predictive of other samples?

I think the model with the bagging fraction of .9 would be more predictive of other samples since each sample is comprised of 90% of the observations. However, I do not think it would do that well with new unseen data as the model will more than likely overfit the data.

c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth involves more predictors in the bagging process/tree splitting process. So it will make the importance slope spread out more.

Exercise 8.7 Refer to Exercise 6.3 and 7.5 which describes a chemical manufacturing process. Use the same data imputation, data splitting and pre-processing steps as beforeand train several tree-based models:

library(AppliedPredictiveModeling)
require(doParallel)
## Loading required package: doParallel
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
data(ChemicalManufacturingProcess)

only_yield <- subset(ChemicalManufacturingProcess, select = c(Yield))
no_yield <- subset(ChemicalManufacturingProcess, select = -c(Yield))
imputed_data <- preProcess(no_yield, "knnImpute")
fixed_cmp <- predict(imputed_data, no_yield)

cl2<-makeCluster(detectCores())
registerDoParallel(cl2)

set.seed(791)
training2 <- createDataPartition(only_yield$Yield, p=0.7, list=FALSE)
X_training2 <- fixed_cmp[training2, ]
y_training2 <- only_yield$Yield[training2]
X_testing2 <- fixed_cmp[-training2, ]
y_testing2 <- only_yield$Yield[-training2]


seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
myControl <- trainControl(method='cv', seeds=seeds)

a) Which tree-based regression model gives the optimal resampling and test set performance?

Recursive Partitioning
rp_grid <- expand.grid(maxdepth= seq(1,10,by=1))
rp <- train(x =X_training2, y = y_training2, method = "rpart2",metric = "Rsquared", tuneGrid = rp_grid,
                       trControl = myControl)

rp
## CART 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 112, 111, 112, 111, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.471562  0.3869829  1.120412
##    2        1.483639  0.3753083  1.184351
##    3        1.498507  0.3944570  1.188671
##    4        1.525093  0.3735759  1.220650
##    5        1.515595  0.3774816  1.212819
##    6        1.563395  0.3576300  1.242848
##    7        1.558246  0.3627158  1.201699
##    8        1.576811  0.3583433  1.208042
##    9        1.589551  0.3521703  1.218510
##   10        1.589551  0.3521703  1.218510
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 3.
rp_pred <- predict(rp, newdata=X_testing2)
postResample(pred=rp_pred, obs=y_testing2)
##      RMSE  Rsquared       MAE 
## 1.5934308 0.2969394 1.3090651
Random Forest
rf_grid <- expand.grid(mtry=seq(2,38,by=3))

rf <- train(x =X_training2, y = y_training2, method = "rf", tuneGrid = rf_grid, metric = "Rsquared", importance = TRUE, 
                  trControl = myControl)
rf
## Random Forest 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 112, 111, 111, 111, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.268762  0.6086269  1.0285112
##    5    1.203598  0.6272184  0.9649833
##    8    1.184852  0.6343871  0.9447936
##   11    1.188911  0.6220674  0.9401748
##   14    1.184664  0.6249541  0.9331198
##   17    1.184801  0.6167741  0.9303671
##   20    1.172648  0.6267543  0.9163328
##   23    1.181276  0.6169985  0.9235268
##   26    1.198358  0.6043296  0.9297546
##   29    1.193400  0.6015570  0.9240494
##   32    1.193178  0.6038257  0.9248578
##   35    1.187984  0.6079066  0.9170179
##   38    1.218809  0.5815756  0.9362464
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
rf_pred <- predict(rf, newdata=X_testing2)
rf_pr <- postResample(pred=rf_pred, obs=y_testing2)
rf_pr[2]
## Rsquared 
##  0.70783
Gradient Boosting
gbm_grid <- expand.grid(interaction.depth=seq(1,6,by=1),
                       n.trees=c(25,50,100,200),
                       shrinkage=c(0.01,0.05,0.1,0.2),
                       n.minobsinnode=5)

gbm_model <- train(x =X_training2, y = y_training2, method = "gbm", tuneGrid = gbm_grid, metric = "Rsquared",verbose = FALSE,
                  trControl = myControl)
gbm_model
## Stochastic Gradient Boosting 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111, 112, 112, 112, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.01       1                   25      1.702742  0.4915171  1.3699868
##   0.01       1                   50      1.602792  0.5369543  1.2759925
##   0.01       1                  100      1.465112  0.5618667  1.1593513
##   0.01       1                  200      1.327978  0.5758923  1.0508738
##   0.01       2                   25      1.676906  0.5242055  1.3531046
##   0.01       2                   50      1.554626  0.5473122  1.2410069
##   0.01       2                  100      1.401124  0.5707124  1.1093616
##   0.01       2                  200      1.269217  0.5966355  0.9997110
##   0.01       3                   25      1.669362  0.5467351  1.3487609
##   0.01       3                   50      1.541724  0.5659575  1.2394585
##   0.01       3                  100      1.382837  0.5830333  1.1022648
##   0.01       3                  200      1.249787  0.6069828  0.9918095
##   0.01       4                   25      1.656615  0.5546487  1.3359848
##   0.01       4                   50      1.526143  0.5577493  1.2198965
##   0.01       4                  100      1.376578  0.5756605  1.0894704
##   0.01       4                  200      1.250679  0.6029759  0.9827305
##   0.01       5                   25      1.661928  0.5237027  1.3420720
##   0.01       5                   50      1.522994  0.5688462  1.2243777
##   0.01       5                  100      1.359429  0.5936962  1.0835057
##   0.01       5                  200      1.233284  0.6148762  0.9815729
##   0.01       6                   25      1.641772  0.5742925  1.3258048
##   0.01       6                   50      1.505627  0.5941762  1.2057756
##   0.01       6                  100      1.346076  0.6086654  1.0639567
##   0.01       6                  200      1.223195  0.6300503  0.9557685
##   0.05       1                   25      1.432195  0.5364834  1.1354427
##   0.05       1                   50      1.298209  0.5672434  1.0247579
##   0.05       1                  100      1.274043  0.5599415  1.0092571
##   0.05       1                  200      1.264459  0.5582021  1.0013875
##   0.05       2                   25      1.367565  0.5741275  1.0942566
##   0.05       2                   50      1.258982  0.5952298  1.0029881
##   0.05       2                  100      1.210326  0.6123900  0.9713814
##   0.05       2                  200      1.182850  0.6244167  0.9574168
##   0.05       3                   25      1.334963  0.5738235  1.0645274
##   0.05       3                   50      1.231267  0.6116190  0.9744521
##   0.05       3                  100      1.206427  0.6271019  0.9582795
##   0.05       3                  200      1.189466  0.6337255  0.9463041
##   0.05       4                   25      1.336768  0.5810752  1.0649079
##   0.05       4                   50      1.249647  0.5970875  0.9910742
##   0.05       4                  100      1.215676  0.6070590  0.9723121
##   0.05       4                  200      1.199773  0.6094163  0.9607200
##   0.05       5                   25      1.335550  0.5536692  1.0529251
##   0.05       5                   50      1.242705  0.5832735  0.9804756
##   0.05       5                  100      1.210336  0.5915882  0.9736762
##   0.05       5                  200      1.187062  0.6100458  0.9497239
##   0.05       6                   25      1.326083  0.5806105  1.0639531
##   0.05       6                   50      1.224667  0.6001990  0.9853851
##   0.05       6                  100      1.167466  0.6321122  0.9357453
##   0.05       6                  200      1.154111  0.6388920  0.9254460
##   0.10       1                   25      1.291054  0.5774584  1.0292840
##   0.10       1                   50      1.271516  0.5580917  1.0119226
##   0.10       1                  100      1.260507  0.5652451  1.0001383
##   0.10       1                  200      1.266969  0.5660138  1.0056881
##   0.10       2                   25      1.286086  0.5726373  1.0089118
##   0.10       2                   50      1.254791  0.5830419  0.9948803
##   0.10       2                  100      1.229475  0.5954654  0.9887814
##   0.10       2                  200      1.237718  0.5911786  1.0123412
##   0.10       3                   25      1.264308  0.5636646  0.9950674
##   0.10       3                   50      1.218990  0.5943221  0.9742976
##   0.10       3                  100      1.207283  0.6050684  0.9712299
##   0.10       3                  200      1.183576  0.6254670  0.9531824
##   0.10       4                   25      1.214783  0.6076393  0.9626600
##   0.10       4                   50      1.157176  0.6449272  0.9208540
##   0.10       4                  100      1.158829  0.6427303  0.9173196
##   0.10       4                  200      1.161172  0.6428421  0.9256104
##   0.10       5                   25      1.212519  0.5819046  0.9698003
##   0.10       5                   50      1.177711  0.6069250  0.9373107
##   0.10       5                  100      1.133601  0.6365795  0.9050780
##   0.10       5                  200      1.134479  0.6369831  0.9097169
##   0.10       6                   25      1.240088  0.5834427  1.0059568
##   0.10       6                   50      1.212076  0.5813037  0.9868013
##   0.10       6                  100      1.186199  0.5970314  0.9634316
##   0.10       6                  200      1.161009  0.6175945  0.9479195
##   0.20       1                   25      1.346099  0.5266810  1.0719809
##   0.20       1                   50      1.328170  0.5451538  1.0709660
##   0.20       1                  100      1.363680  0.5350679  1.0957853
##   0.20       1                  200      1.365127  0.5419245  1.1029006
##   0.20       2                   25      1.238894  0.5814430  0.9808205
##   0.20       2                   50      1.260834  0.5642065  1.0093714
##   0.20       2                  100      1.237900  0.5749814  0.9872285
##   0.20       2                  200      1.256712  0.5695153  1.0058358
##   0.20       3                   25      1.301990  0.5473124  1.0282612
##   0.20       3                   50      1.343866  0.5211000  1.0674277
##   0.20       3                  100      1.323148  0.5486445  1.0449505
##   0.20       3                  200      1.323795  0.5504975  1.0405489
##   0.20       4                   25      1.287848  0.5679624  1.0503679
##   0.20       4                   50      1.295123  0.5793168  1.0420585
##   0.20       4                  100      1.286296  0.5842801  1.0383649
##   0.20       4                  200      1.286568  0.5867000  1.0397929
##   0.20       5                   25      1.239467  0.5895346  1.0028998
##   0.20       5                   50      1.227213  0.6028140  0.9748339
##   0.20       5                  100      1.209051  0.6128716  0.9706494
##   0.20       5                  200      1.195128  0.6212591  0.9647568
##   0.20       6                   25      1.336147  0.5282460  1.0655574
##   0.20       6                   50      1.307272  0.5552212  1.0390391
##   0.20       6                  100      1.291658  0.5659785  1.0192570
##   0.20       6                  200      1.290980  0.5655476  1.0183565
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth =
##  4, shrinkage = 0.1 and n.minobsinnode = 5.
gbm_pred <- predict(gbm_model, newdata=X_testing2)
postResample(pred=gbm_pred, obs=y_testing2)
##      RMSE  Rsquared       MAE 
## 1.0856288 0.6493653 0.8083657
Cubist
cubist_grid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100), 
                          neighbors = c(0, 1, 3, 5, 7))

cubist_model <- train(x =X_training2, y = y_training2, method = "cubist", tuneGrid = cubist_grid, metric = "Rsquared",verbose = FALSE, trControl = myControl)

cubist_model
## Cubist 
## 
## 124 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 112, 111, 111, 112, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##     1         0          1.3719916  0.5446855  1.0505129
##     1         1          1.4310478  0.5177002  1.0447485
##     1         3          1.2629137  0.6080926  0.9794625
##     1         5          1.3014485  0.5870984  1.0202697
##     1         7          1.3308707  0.5712087  1.0300811
##     5         0          1.3454461  0.5599896  1.0321340
##     5         1          1.2815873  0.5990047  0.9158336
##     5         3          1.2086098  0.6351033  0.9084863
##     5         5          1.2522828  0.6145714  0.9531207
##     5         7          1.2982097  0.5884430  0.9728913
##    10         0          1.2752147  0.5701001  1.0184069
##    10         1          1.1686579  0.6365841  0.8732971
##    10         3          1.1147511  0.6666262  0.8592205
##    10         5          1.1682821  0.6373414  0.9281511
##    10         7          1.2161200  0.6093643  0.9515386
##    20         0          1.2007485  0.5963158  0.9888749
##    20         1          1.0976469  0.6607546  0.8305242
##    20         3          1.0385975  0.6953742  0.8440363
##    20         5          1.0945312  0.6663173  0.9108772
##    20         7          1.1426899  0.6345153  0.9285680
##    50         0          1.1521513  0.6209698  0.9490717
##    50         1          1.0685522  0.6773100  0.8057271
##    50         3          0.9901910  0.7215584  0.8115809
##    50         5          1.0430126  0.6934100  0.8805174
##    50         7          1.0856360  0.6656967  0.8902986
##   100         0          1.1326517  0.6385594  0.9472680
##   100         1          1.0687959  0.6896180  0.7994641
##   100         3          0.9793031  0.7362285  0.7955938
##   100         5          1.0251158  0.7139864  0.8588931
##   100         7          1.0633445  0.6865955  0.8768017
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 100 and neighbors = 3.
cube_pred <- predict(cubist_model, newdata=X_testing2)
postResample(pred=cube_pred, obs=y_testing2)
##      RMSE  Rsquared       MAE 
## 1.1632558 0.6288877 0.7267937

The cubist model gives the optimal resampling performance with an r-squared value of 0.7362285 and an rmse of 1.0945312. This is better than the random forest model which had a r-square of 0.6343871 and a rmse of 1.1848524.

However, the random forest model had the best outcome on the testing data with an r-squared of 0.70783 and an rmse of 1.0786422. Overall I would judge the random forest model as the best model.

b) Which predictors are most important in the optimal tree-based regression model? Do either biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The variable importance list of the random forest model is very similar to the variable importance list from the support vector machine model from the optimal non linear models.

Both models have the same three biological processes, although they have slightly different rankings in the list. The manufacturing processes importance list is a little different. There are two manufacturing processes that are different between the random forest model and the support vector machine model, manufacturing processes 11 and 31 are in the random forest importance list. While the support vector machine model has manufacturing processes 06 and 29 instead.

varImp(rf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   63.03
## BiologicalMaterial06     59.39
## BiologicalMaterial03     57.24
## ManufacturingProcess36   56.34
## BiologicalMaterial12     55.13
## ManufacturingProcess17   52.42
## ManufacturingProcess09   50.42
## ManufacturingProcess11   50.36
## ManufacturingProcess31   49.84
## ManufacturingProcess29   49.07
## ManufacturingProcess39   45.18
## ManufacturingProcess06   44.92
## BiologicalMaterial02     44.63
## ManufacturingProcess28   44.44
## ManufacturingProcess33   43.52
## ManufacturingProcess26   41.94
## BiologicalMaterial04     37.47
## ManufacturingProcess01   37.47
## ManufacturingProcess18   36.67

c) Plot the optimal single tree with distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

It does give additional information since this tells you at what value each biological material or manufacturing process is important at. That is all the additional information I see, since I am not certain how these manufacturing processes are linked together to produce the yield.

However, the information it gives I do not find as useful as say a linear plot of the most important variables. That will show outliers which could have a strong positive or negative relationship with yield. Those outlier points would be the points to focus on to improve yield, especially positive outliers.

recur <- rpart(y_training2 ~ ., data = X_training2)

plot(as.party(recur), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))