## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
##
## Attaching package: 'pre'
## The following object is masked from 'package:randomForest':
##
## importance
## Loading required package: Rcpp
## Loading required package: rlang
## V1 V2 V3 V4 V5 V6 V7
## 1 0.1965959 0.8897369 0.4034173 0.9335958 0.7343655 0.3080857 0.7300751
## 2 0.7164260 0.1839942 0.8771072 0.5623151 0.1027748 0.2279233 0.6644855
## 3 0.3620857 0.7163158 0.4601120 0.2225171 0.8524531 0.3392558 0.9341949
## 4 0.3910775 0.2375733 0.5848327 0.2497158 0.7292472 0.3881883 0.5560306
## 5 0.8133072 0.3541920 0.6959593 0.5953801 0.7285362 0.9964300 0.6814503
## 6 0.4279599 0.1889772 0.3961759 0.3743723 0.5802106 0.2530935 0.3974592
## V8 V9 V10 y
## 1 0.24125356 0.9239784 0.4639425 18.210088
## 2 0.98578585 0.4530842 0.4231528 14.829633
## 3 0.21314727 0.7413460 0.7615259 13.886334
## 4 0.02176553 0.6763241 0.2649187 9.044441
## 5 0.94450272 0.9189753 0.1579674 19.844821
## 6 0.09120397 0.6593991 0.2463080 8.882691
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
As we can see in the above chart, only variables 1-5 significantly effected the model.
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
## [11,] 12.7
Next, I added an 11th variable that was highly correlated with V1. As we can see from the graph, it is a significant indicator. Additionally, it reduced the importace of V1 because V11 explains some of its variance.
Using conditional trees, we find that fewer predictors are necessary. Namely, variable 3 becomes basically irrelevant.
Bagging seems to reduce the importance of all the previously important variables and likewise boosting the influence of the previously discarded variables (because explained variance must all equal 1).
Cubist tree building, however, gives the results we’d expect from above with the additional benefit of eliminating the importance of the dependent variable (V11).
## Overall
## High 5.1044426
## Low 0.1013638
## Middle 0.7867465
We can see that as granularity increases, importance tends to decrease. That is, as the range of our values increases, so does its importance.
By reducing the bagging fraction, variables with less explanatory power tend to be modelled separately from the more important variables. Conversely, there are fewer opportunities for models to be constructed from the traditionally important vectors. As learning rate increases, the marginal effect of a new tree on the model is increased– leading to a higher correlation factor. When a learning factor of .1 is used, each additional tree has less effect on the model than a learning factor of .2. This means more trees are needed, but reduces the likelihood of overfitting. That means the relationship bweetn learning rate and the number of trees is inverse.
The model using a learning rate of .9 would have a tendency to overfit as each additional tree has a large marginal effect.
Interaction depth refers to the tree depth and the number of leaf nodes. As tree depth increases, the number of leaf nodes tends to increase leading to more data vectors coming into play. In both cases, we’d prefer a more uniform distribution of importance.
##
## Attaching package: 'imputeTS'
## The following object is masked from 'package:zoo':
##
## na.locf
##
## Call:
## summary.resamples(object = resampling)
##
## Models: SingleTree, RandomForest, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.8349514 0.9570279 1.1142687 1.1233198 1.2185423 1.508726
## RandomForest 0.5873856 0.8306705 0.8875993 0.8703800 0.9222134 1.208150
## Cubist 0.4685895 0.6423180 0.6659518 0.7688176 0.9578148 1.053937
## NA's
## SingleTree 0
## RandomForest 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 1.0154013 1.1992917 1.3977413 1.408004 1.484750 2.003905 0
## RandomForest 0.7767842 1.0243063 1.0933312 1.106106 1.209931 1.593887 0
## Cubist 0.6580057 0.7795592 0.8716059 0.972782 1.233067 1.294626 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.1411297 0.3028180 0.4234355 0.4348887 0.5120323 0.7737295
## RandomForest 0.4366632 0.5791650 0.6848531 0.6716643 0.7870188 0.9048642
## Cubist 0.4150874 0.6130897 0.7349109 0.7137045 0.8201955 0.9359714
## NA's
## SingleTree 0
## RandomForest 0
## Cubist 0
As we can see, the random forest model performs the best when using RMSE as the indicator, but does not beat the cubist model by much. ### b
## rpart2 variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.00
## ManufacturingProcess09 89.24
## ManufacturingProcess11 77.08
## BiologicalMaterial12 58.89
## ManufacturingProcess32 49.93
## BiologicalMaterial06 46.26
## BiologicalMaterial02 45.43
## ManufacturingProcess18 43.02
## ManufacturingProcess02 40.35
## BiologicalMaterial05 40.21
## ManufacturingProcess21 36.75
## BiologicalMaterial04 35.67
## ManufacturingProcess31 29.90
## ManufacturingProcess06 29.62
## BiologicalMaterial03 28.87
## ManufacturingProcess25 24.66
## ManufacturingProcess13 24.51
## BiologicalMaterial10 21.63
## BiologicalMaterial01 20.68
## BiologicalMaterial11 17.67
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial12 65.58
## BiologicalMaterial03 59.69
## ManufacturingProcess09 58.64
## BiologicalMaterial06 55.15
## ManufacturingProcess13 46.94
## BiologicalMaterial02 46.66
## ManufacturingProcess17 46.58
## ManufacturingProcess31 45.96
## BiologicalMaterial11 43.85
## ManufacturingProcess36 43.65
## BiologicalMaterial04 42.10
## BiologicalMaterial08 40.86
## BiologicalMaterial01 40.21
## ManufacturingProcess06 39.78
## BiologicalMaterial05 39.64
## BiologicalMaterial09 37.94
## ManufacturingProcess01 36.37
## ManufacturingProcess33 35.96
## ManufacturingProcess11 35.10
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 49.19
## ManufacturingProcess17 46.77
## BiologicalMaterial03 39.52
## ManufacturingProcess29 30.65
## BiologicalMaterial02 30.65
## BiologicalMaterial06 27.42
## ManufacturingProcess22 25.81
## ManufacturingProcess04 23.39
## ManufacturingProcess34 18.55
## ManufacturingProcess27 15.32
## BiologicalMaterial08 14.52
## ManufacturingProcess26 14.52
## ManufacturingProcess01 14.52
## BiologicalMaterial01 13.71
## ManufacturingProcess24 12.90
## BiologicalMaterial10 12.10
## ManufacturingProcess45 11.29
## BiologicalMaterial12 11.29
## BiologicalMaterial04 10.48
By looking at the above summaries, we can see that Rpart and Cubist models have similar slopes of the importance curve where the random forest model has a much more shallow importance curve. Additionally, Manufacturing processes dominated the models over biological ones.
Below, the optimal tree is described–using a single split of manufacting process 32 around the point .006. Likewise, we can confirm this by looking at the means of the yields of the respective subsets.
## n= 140
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 140 454.300300 40.18379
## 2) ManufacturingProcess32< 159.5 83 171.874100 39.26651
## 4) BiologicalMaterial12< 19.975 45 74.609320 38.57711
## 8) BiologicalMaterial05>=19.705 8 5.948750 36.99750 *
## 9) BiologicalMaterial05< 19.705 37 44.383230 38.91865
## 18) BiologicalMaterial05< 19.07 30 19.951230 38.63700
## 36) BiologicalMaterial09>=12.985 8 1.241487 37.94875 *
## 37) BiologicalMaterial09< 12.985 22 13.542240 38.88727
## 74) ManufacturingProcess09< 44.995 8 0.640800 38.23000 *
## 75) ManufacturingProcess09>=44.995 14 7.470486 39.26286 *
## 19) BiologicalMaterial05>=19.07 7 11.853170 40.12571 *
## 5) BiologicalMaterial12>=19.975 38 50.551180 40.08289
## 10) ManufacturingProcess17>=33.85 24 9.976196 39.41542 *
## 11) ManufacturingProcess17< 33.85 14 11.552090 41.22714 *
## 3) ManufacturingProcess32>=159.5 57 110.898300 41.51947
## 6) ManufacturingProcess06< 208.1 33 63.766020 40.96515
## 12) ManufacturingProcess04< 933.5 23 30.529530 40.53826
## 24) ManufacturingProcess02>=18.5 16 8.787344 40.02687 *
## 25) ManufacturingProcess02< 18.5 7 7.993943 41.70714 *
## 13) ManufacturingProcess04>=933.5 10 19.404810 41.94700 *
## 7) ManufacturingProcess06>=208.1 24 23.049730 42.28167
## 14) ManufacturingProcess17>=33.45 15 9.368773 41.79867 *
## 15) ManufacturingProcess17< 33.45 9 4.349400 43.08667 *
## [1] "Mean of 'less than' yield"
## [1] NaN
## [1] "Mean of 'more than' yield"
## [1] 40.18379