Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
# Required R packages
library(tidyverse)
library(mlbench)
library(randomForest)
library(caret)
library(party)
library(kableExtra)Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"model1 = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 = varImp(model1, scale = FALSE)Did the random forest model significantly use the uninformative predictors (V6 – V10)?
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Based on the proportion results above, the random forest model did not significantly use the uninformative predictors.
set.seed(525)
simulated$duplicate1 = simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)## [1] 0.9214171
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 = varImp(model2, scale = FALSE)
rfImp2## Overall
## V1 6.069974498
## V2 6.251421110
## V3 0.576015993
## V4 6.872802659
## V5 1.903860827
## V6 0.163886738
## V7 0.091403979
## V8 0.038224086
## V9 0.050789286
## V10 -0.009303006
## duplicate1 3.762100696
After adding another predictor that is highly correlated with V1, its importance was reduced. The importance of V1 dropped by nearly half. This is a result of the splitting between V1 and newly correlated variables, and the choice of which to use in a split is somewhat random. Each is used in the tree and the small difference in influencing the choice between the two. As a result, more predictors may be selected than needed, and the variable importance values are affected.
cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Does this importance show the same pattern as the traditional random forest model?model3 = cforest(y ~ ., data = simulated)
# Conditional variable importance
cfImp3 = varimp(model3, conditional = TRUE)
# Un-conditional variable importance
cfImp4 = varimp(model3, conditional = FALSE)## Overall Condition Uncondition
## V1 6.069974498 3.282262865 6.749896825
## V2 6.251421110 5.333140929 6.608794026
## V3 0.576015993 -0.011290827 -0.005387238
## V4 6.872802659 6.208000862 7.968862415
## V5 1.903860827 1.104881893 1.800410315
## V6 0.163886738 -0.008741740 0.014506722
## V7 0.091403979 0.042938165 0.066003046
## V8 0.038224086 -0.008163298 -0.010127230
## V9 0.050789286 0.002996186 -0.002605593
## V10 -0.009303006 -0.020603819 -0.027945528
## duplicate1 3.762100696 0.990731273 3.013624719
The importance of the conditional set is different from the traditional random forest model. When variable importance is calculated conditionally, the correlation between variable V1 and duplicate1 is considered and the importance is adjusted accordingly than with the uncondition and traditional model. However, regardless of the model, the uninformative predictors are still low ranked and the importance of other predictors, for example, V4, is still the most important variable. But the correlated variables can be misleading with the traditional and unconditional set.
# boosting regression trees via stochastic gradient boosting machines
library(gbm)
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2),
n.trees = seq(100, 1000, by = 100),
shrinkage = 0.1,
n.minobsinnode = 5)
gbm.model = train(y ~ ., data = simulated,
tuneGrid = gbmGrid, verbose = FALSE,
method = 'gbm')
# Variable importance
gbm.Imp3 = varImp(gbm.model)# Cubist
library(Cubist)
cubist.model = cubist(x = simulated[, names(simulated)[names(simulated) != 'y']],
y = simulated[,c('y')])
# Variable importance
cube.Imp3 = varImp(cubist.model)df = as.data.frame(cbind(cfImp4, cfImp3, gbm.Imp3$importance, cube.Imp3))
names(df) = c("cf.uncond", "cf.cond", "boosted", "cubist")
df## cf.uncond cf.cond boosted cubist
## V1 6.749896825 3.282262865 73.608966 50
## V2 6.608794026 5.333140929 72.899028 50
## V3 -0.005387238 -0.011290827 32.829245 50
## V4 7.968862415 6.208000862 100.000000 50
## V5 1.800410315 1.104881893 33.129076 50
## V6 0.014506722 -0.008741740 4.206332 0
## V7 0.066003046 0.042938165 4.169578 0
## V8 -0.010127230 -0.008163298 1.638842 0
## V9 -0.002605593 0.002996186 0.000000 0
## V10 -0.027945528 -0.020603819 1.145871 0
## duplicate1 3.013624719 0.990731273 18.878971 0
When comparing the results, like with cforest, the uninformative predictors V6 - V10 are still ranked among the lowest, while V4 is still the specific top predictor for the boosted trees. Moreover, it deemed V1 as more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures. Whereas for Cubist, the result is different than both the random forest and boosted trees models. Nonetheless, the Cubist model ranked V1 higher than duplicated1 similar to the boosted trees model.
Use a simulation to show tree bias with different granularities.
Tree models have an increased bias is for highly granular data such that the predictors with a higher number of distinct values are favored over more granular predictors.
Below are 5 simulated variables with distinct granularity. It is built such that v1 will have higher importance because it possesses the highest granularity. The response variable is then a summation function of v1 and v4 (chosen at random), with some noise.
set.seed(525)
df = data.frame(v1 = sample(1:5000 / 5000, 252, replace = TRUE), # 5000 distinct values
v2 = sample(1:50 / 50, 252, replace = TRUE), # 50 distinct values
v3 = sample(1:500 / 500, 252, replace = TRUE), # 500 distinct values
v4 = sample(1:5 / 5, 252, replace = TRUE), # 5 distinct values
v5 = rep(1, 252)) # no distinct values
df$y = df$v1 + df$v4 + rnorm(252)
head(df)## v1 v2 v3 v4 v5 y
## 1 0.5500 0.54 0.408 0.6 1 2.2573334
## 2 0.7532 0.90 0.612 0.8 1 1.3306134
## 3 0.2298 0.56 0.158 0.8 1 -0.5837182
## 4 0.6712 0.40 0.170 1.0 1 1.9485260
## 5 0.2732 0.92 0.538 0.6 1 1.4062075
## 6 0.3006 0.66 0.316 0.2 1 0.8879699
As expected, v1, with the most granularity, is the most important variable among the variables, followed by v3, v2 then v4. Notice that v5, with no distinct value, is not deemed an important variable at all. Therefore, this simulation highlights that there is a selection bias in the tree model where predictors with more distinct values are favored.
## Overall
## v1 0.8031078
## v2 0.5261687
## v3 0.7027381
## v4 0.4392937
## v5 0.0000000
In stochastic gradient boosting, the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect the magnitudes of variable importance.
Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Gradient boosting is a technique used creating for prediction models in the form of decision trees. Regularization reduces overfitting effects and is aided by the learning rate and bagging rate. The use of learning rates below 0.1 produces improvements that are significant in the generalization of a model. Moreover, bagging is the proportion of the data that is utilized for each iteration. When the values are small, the algorithm experiences randomness, which reduces the chances of overfitting.
Therefore, from the graphs, the plot on the right focuses on a few variables because there is a higher learning rate with a higher bagging rate. This means that there is a larger portion of the data used, increasing the correlation structure each iteration. Thus, only a few variables are considered important. Whereas, the plot of the left, having a lower learning rate and bagging rate, results in a model that has a more diverse importance rank on its variables. In this case, a small portion of the data is used to train the model and it is less dependent on each iteration. Thus, the variable importance plots for boosting using two extreme values for the bagging fraction and the learning rate are quite different.
Following the reasons stated above, it is more likely that the model with the smaller learning and bagging rate will have better and generalized predictability. The other model with the larger learning and bagging rates will most likely overfit.
Increasing interaction depth, or tree depth, would result in the inclusion of more predictors, which would further result in the importance score to be distributed out. Thus, the slope of the predictor importance would become flatten.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
From Home Work #7 and #8, the matrix ChemicalManufacturingProcess containing the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs was used.
The variable description highlights that some variables have less than n = 176, these are the variables with missing values that must be imputed. Moreover, they’re quite a few variables that are heavily skewed, this suggests further transformation for normality is needed.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
psych::describe(ChemicalManufacturingProcess)[,-c(1,5,6,10,13)] %>%
kable() %>%
kable_styling(bootstrap_options = "striped") %>%
scroll_box(width = "100%", height = "300px")| n | mean | sd | mad | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| Yield | 176 | 40.1765341 | 1.8456664 | 1.9718580 | 35.250 | 46.340 | 0.3109596 | -0.1132944 |
| BiologicalMaterial01 | 176 | 6.4114205 | 0.7139225 | 0.6745830 | 4.580 | 8.810 | 0.2733165 | 0.4567758 |
| BiologicalMaterial02 | 176 | 55.6887500 | 4.0345806 | 4.5812340 | 46.870 | 64.750 | 0.2441269 | -0.7050911 |
| BiologicalMaterial03 | 176 | 67.7050000 | 4.0010641 | 4.2773010 | 56.970 | 78.250 | 0.0285108 | -0.1235203 |
| BiologicalMaterial04 | 176 | 12.3492614 | 1.7746607 | 1.3714050 | 9.380 | 23.090 | 1.7323153 | 7.0564614 |
| BiologicalMaterial05 | 176 | 18.5986364 | 1.8441408 | 1.8829020 | 13.240 | 24.850 | 0.3040053 | 0.2198005 |
| BiologicalMaterial06 | 176 | 48.9103977 | 3.7460718 | 3.9437160 | 40.600 | 59.380 | 0.3685344 | -0.3654933 |
| BiologicalMaterial07 | 176 | 100.0141477 | 0.1077423 | 0.0000000 | 100.000 | 100.830 | 7.3986642 | 53.0417012 |
| BiologicalMaterial08 | 176 | 17.4947727 | 0.6769536 | 0.5930400 | 15.880 | 19.140 | 0.2200539 | 0.0627721 |
| BiologicalMaterial09 | 176 | 12.8500568 | 0.4151757 | 0.4225410 | 11.440 | 14.080 | -0.2684177 | 0.2927765 |
| BiologicalMaterial10 | 176 | 2.8006250 | 0.5991433 | 0.4003020 | 1.770 | 6.870 | 2.4023783 | 11.6471845 |
| BiologicalMaterial11 | 176 | 146.9531818 | 4.8204704 | 4.1142150 | 135.810 | 158.730 | 0.3588211 | 0.0162456 |
| BiologicalMaterial12 | 176 | 20.1998864 | 0.7735440 | 0.6671700 | 18.350 | 22.210 | 0.3038443 | 0.0146595 |
| ManufacturingProcess01 | 175 | 11.2074286 | 1.8224342 | 1.0378200 | 0.000 | 14.100 | -3.9201855 | 21.8688069 |
| ManufacturingProcess02 | 173 | 16.6826590 | 8.4715694 | 1.4826000 | 0.000 | 22.500 | -1.4307675 | 0.1062466 |
| ManufacturingProcess03 | 161 | 1.5395652 | 0.0223983 | 0.0148260 | 1.470 | 1.600 | -0.4799447 | 1.7280557 |
| ManufacturingProcess04 | 175 | 931.8514286 | 6.2744406 | 5.9304000 | 911.000 | 946.000 | -0.6979357 | 0.0631282 |
| ManufacturingProcess05 | 175 | 1001.6931429 | 30.5272134 | 17.3464200 | 923.000 | 1175.300 | 2.5872769 | 11.7446904 |
| ManufacturingProcess06 | 174 | 207.4017241 | 2.6993999 | 1.9273800 | 203.000 | 227.400 | 3.0419007 | 17.3764864 |
| ManufacturingProcess07 | 175 | 177.4800000 | 0.5010334 | 0.0000000 | 177.000 | 178.000 | 0.0793788 | -2.0050587 |
| ManufacturingProcess08 | 175 | 177.5542857 | 0.4984706 | 0.0000000 | 177.000 | 178.000 | -0.2165645 | -1.9642262 |
| ManufacturingProcess09 | 176 | 45.6601136 | 1.5464407 | 1.2157320 | 38.890 | 49.360 | -0.9406685 | 3.2701986 |
| ManufacturingProcess10 | 167 | 9.1790419 | 0.7666884 | 0.5930400 | 7.500 | 11.600 | 0.6492504 | 0.6317264 |
| ManufacturingProcess11 | 166 | 9.3855422 | 0.7157336 | 0.6671700 | 7.500 | 11.500 | -0.0193109 | 0.3227966 |
| ManufacturingProcess12 | 175 | 857.8114286 | 1784.5282624 | 0.0000000 | 0.000 | 4549.000 | 1.5786729 | 0.4951353 |
| ManufacturingProcess13 | 176 | 34.5079545 | 1.0152800 | 0.8895600 | 32.100 | 38.600 | 0.4802776 | 1.9593883 |
| ManufacturingProcess14 | 175 | 4853.8685714 | 54.5236412 | 40.0302000 | 4701.000 | 5055.000 | -0.0109687 | 1.0781378 |
| ManufacturingProcess15 | 176 | 6038.9204545 | 58.3125023 | 40.7715000 | 5904.000 | 6233.000 | 0.6743478 | 1.2162163 |
| ManufacturingProcess16 | 176 | 4565.8011364 | 351.6973215 | 42.9954000 | 0.000 | 4852.000 | -12.4202248 | 158.3981993 |
| ManufacturingProcess17 | 176 | 34.3437500 | 1.2482059 | 1.1860800 | 31.300 | 40.000 | 1.1629715 | 4.6626982 |
| ManufacturingProcess18 | 176 | 4809.6818182 | 367.4777364 | 34.8411000 | 0.000 | 4971.000 | -12.7361378 | 163.7375845 |
| ManufacturingProcess19 | 176 | 6028.1988636 | 45.5785689 | 36.3237000 | 5890.000 | 6146.000 | 0.2973414 | 0.2962151 |
| ManufacturingProcess20 | 176 | 4556.4602273 | 349.0089784 | 42.9954000 | 0.000 | 4759.000 | -12.6383268 | 162.0663905 |
| ManufacturingProcess21 | 176 | -0.1642045 | 0.7782930 | 0.4447800 | -1.800 | 3.600 | 1.7291140 | 5.0274763 |
| ManufacturingProcess22 | 175 | 5.4057143 | 3.3306262 | 4.4478000 | 0.000 | 12.000 | 0.3148909 | -1.0175458 |
| ManufacturingProcess23 | 175 | 3.0171429 | 1.6625499 | 1.4826000 | 0.000 | 6.000 | 0.1967985 | -0.9975572 |
| ManufacturingProcess24 | 175 | 8.8342857 | 5.7994224 | 7.4130000 | 0.000 | 23.000 | 0.3593200 | -1.0207362 |
| ManufacturingProcess25 | 171 | 4828.1754386 | 373.4810865 | 34.0998000 | 0.000 | 4990.000 | -12.6310220 | 160.3293620 |
| ManufacturingProcess26 | 171 | 6015.5964912 | 464.8674900 | 38.5476000 | 0.000 | 6161.000 | -12.6694398 | 160.9849144 |
| ManufacturingProcess27 | 171 | 4562.5087719 | 353.9848679 | 35.5824000 | 0.000 | 4710.000 | -12.5174778 | 158.3931091 |
| ManufacturingProcess28 | 171 | 6.5918129 | 5.2489823 | 1.0378200 | 0.000 | 11.500 | -0.4556335 | -1.7907822 |
| ManufacturingProcess29 | 171 | 20.0111111 | 1.6638879 | 0.4447800 | 0.000 | 22.000 | -10.0848133 | 119.4378857 |
| ManufacturingProcess30 | 171 | 9.1614035 | 0.9760824 | 0.7413000 | 0.000 | 11.200 | -4.7557268 | 43.0848842 |
| ManufacturingProcess31 | 171 | 70.1847953 | 5.5557816 | 0.8895600 | 0.000 | 72.500 | -11.8231008 | 146.0094297 |
| ManufacturingProcess32 | 176 | 158.4659091 | 5.3972456 | 4.4478000 | 143.000 | 173.000 | 0.2112252 | 0.0602714 |
| ManufacturingProcess33 | 171 | 63.5438596 | 2.4833813 | 1.4826000 | 56.000 | 70.000 | -0.1310030 | 0.2740324 |
| ManufacturingProcess34 | 171 | 2.4935673 | 0.0543910 | 0.0000000 | 2.300 | 2.600 | -0.2634497 | 1.0013075 |
| ManufacturingProcess35 | 171 | 495.5964912 | 10.8196874 | 8.8956000 | 463.000 | 522.000 | -0.1556154 | 0.4130958 |
| ManufacturingProcess36 | 171 | 0.0195731 | 0.0008739 | 0.0014826 | 0.017 | 0.022 | 0.1453141 | -0.0557822 |
| ManufacturingProcess37 | 176 | 1.0136364 | 0.4450828 | 0.4447800 | 0.000 | 2.300 | 0.3783578 | 0.0698597 |
| ManufacturingProcess38 | 176 | 2.5340909 | 0.6493753 | 0.0000000 | 0.000 | 3.000 | -1.6818052 | 3.9189211 |
| ManufacturingProcess39 | 176 | 6.8511364 | 1.5054943 | 0.1482600 | 0.000 | 7.500 | -4.2691214 | 16.4987895 |
| ManufacturingProcess40 | 175 | 0.0177143 | 0.0382885 | 0.0000000 | 0.000 | 0.100 | 1.6768073 | 0.8164458 |
| ManufacturingProcess41 | 175 | 0.0237143 | 0.0538242 | 0.0000000 | 0.000 | 0.200 | 2.1686898 | 3.6290714 |
| ManufacturingProcess42 | 176 | 11.2062500 | 1.9416092 | 0.2965200 | 0.000 | 12.100 | -5.4500082 | 28.5288867 |
| ManufacturingProcess43 | 176 | 0.9119318 | 0.8679860 | 0.2965200 | 0.000 | 11.000 | 9.0548747 | 101.0332345 |
| ManufacturingProcess44 | 176 | 1.8051136 | 0.3220062 | 0.1482600 | 0.000 | 2.100 | -4.9703552 | 25.0876065 |
| ManufacturingProcess45 | 176 | 2.1380682 | 0.4069043 | 0.1482600 | 0.000 | 2.600 | -4.0779411 | 18.7565001 |
A small percentage of cells in the predictor set contain missing values. Because these proportions are not too extreme for most of the variables, the imputation by k-Nearest Neighbor is conducted. The distance computation for defining the nearest neighbors is based on Gower distance (Gower, 1971), which can now handle distance variables of the type binary, categorical, ordered, continuous, and semi-continuous. As a result, the data set is now complete.
pre.process = preProcess(ChemicalManufacturingProcess[, -c(1)], method = "knnImpute")
chemical = predict(pre.process, ChemicalManufacturingProcess[, -c(1)])To build a smaller model without predictors with extremely high correlations, it is best to reduce the number of predictors such that there are no absolute pairwise correlations above 0.90. The list below shows only significant correlations (at a 5% level) for the top 10 highest correlations by the correlation coefficient. The results show that these ten have a perfect correlation of 1. Afterward, the data is pre-processed to fulfill the assumption of normality using the Yeo-Johnson transformation (Yeo & Johnson, 2000). This technique attempts to find the value of lambda that minimizes the Kullback-Leibler distance between the normal distribution and the transformed distribution. This method has the advantage of working without having to worry about the domain of x.
corr = cor(chemical)
corr[corr == 1] = NA
corr[abs(corr) < 0.85] = NA
corr = na.omit(reshape::melt(corr))
head(corr[order(-abs(corr$value)),], 10)## X1 X2 value
## 2090 ManufacturingProcess26 ManufacturingProcess25 0.9975339
## 2146 ManufacturingProcess25 ManufacturingProcess26 0.9975339
## 2148 ManufacturingProcess27 ManufacturingProcess26 0.9960721
## 2204 ManufacturingProcess26 ManufacturingProcess27 0.9960721
## 2091 ManufacturingProcess27 ManufacturingProcess25 0.9934932
## 2203 ManufacturingProcess25 ManufacturingProcess27 0.9934932
## 1685 ManufacturingProcess20 ManufacturingProcess18 0.9917474
## 1797 ManufacturingProcess18 ManufacturingProcess20 0.9917474
## 2095 ManufacturingProcess31 ManufacturingProcess25 0.9706780
## 2431 ManufacturingProcess25 ManufacturingProcess31 0.9706780
tooHigh = findCorrelation(cor(chemical), 0.90)
chemical = chemical[, -tooHigh]
(pre.process = preProcess(chemical, method = c("YeoJohnson", "center", "scale")))## Created from 176 samples and 47 variables
##
## Pre-processing:
## - centered (47)
## - ignored (0)
## - scaled (47)
## - Yeo-Johnson transformation (41)
##
## Lambda estimates for Yeo-Johnson transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.8404 0.7085 0.8795 0.9698 1.2484 2.7992
Next, the data is split into training and testing data in a ratio of 4:1, and an elastic net model is fitted.
set.seed(525)
intrain = createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
train.p = chemical[intrain, ]
train.r = ChemicalManufacturingProcess$Yield[intrain]
test.p = chemical[-intrain, ]
test.r = ChemicalManufacturingProcess$Yield[-intrain]There will be 4 tree models and the RMSE is used to determine the best fit.
A single tree is used to create a single regression tree.
set.seed(525)
sig.model = train(x = train.p,
y = train.r,
method = "rpart",
tuneLength = 10,
control = rpart.control(maxdepth=2),
trControl = trainControl(method = "cv",
repeats = 5))## cp
## 6 0.04669955
data.frame(rsquared = sig.model[["results"]][["Rsquared"]][as.numeric(rownames(sig.model$bestTune))],
rmse = sig.model[["results"]][["RMSE"]][as.numeric(rownames(sig.model$bestTune))])## rsquared rmse
## 1 0.3693208 1.524235
RMSE was used to select the optimal model using the smallest value. The best tune for the single regression tree model which resulted in the smallest root mean squared error is with a complexity parameter of 0.047. It has RMSE = 1.52, and \(R^2\) = 0.37. In this case, it does not account for the largest portion of the variability in the data than all other variables, but it produces the smallest error which makes it the best fit.
set.seed(525)
rf.model = train(x = train.p,
y = train.r,
method = "rf",
tuneLength = 10,
trControl = trainControl(method = "cv",
repeats = 5))## mtry
## 3 12
data.frame(rsquared = rf.model[["results"]][["Rsquared"]][as.numeric(rownames(rf.model$bestTune))],
rmse = rf.model[["results"]][["RMSE"]][as.numeric(rownames(rf.model$bestTune))])## rsquared rmse
## 1 0.6547554 1.164854
RMSE was used to select the optimal model using the smallest value. The best tune for the random forest model which resulted in the smallest root mean squared error is with the optimal number of randomly selected predictors to choose from at each split being 12. It has RMSE = 1.16, and \(R^2\) = 0.65. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error which makes it the best fit.
set.seed(525)
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2),
n.trees = seq(100, 1000, by = 100),
shrinkage = 0.1,
n.minobsinnode = 5)
gbm.model = train(x = train.p,
y = train.r,
method = "gbm",
tuneGrid = gbmGrid,
trControl = trainControl(method = "cv",
repeats = 5),
verbose = FALSE)## n.trees interaction.depth shrinkage n.minobsinnode
## 28 800 5 0.1 5
data.frame(rsquared = gbm.model[["results"]][["Rsquared"]][as.numeric(rownames(gbm.model$bestTune))],
rmse = gbm.model[["results"]][["RMSE"]][as.numeric(rownames(gbm.model$bestTune))])## rsquared rmse
## 1 0.5454097 1.31391
RMSE was used to select the optimal model using the smallest value. The best tune for the tree model is to tune over interaction depth, number of trees, and shrinkage. The tuning grid which resulted in the smallest root mean squared error is with these parameters: ntrees = 800, interaction depth of 5, shrinkage on 0.1, and the minimum number of observations in trees’ terminal nodes is 5. It has RMSE = 1.31, and \(R^2\) = 0.55. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error which makes it the best fit.
set.seed(525)
cub.model = train(x = train.p,
y = train.r,
method = "cubist",
tuneLength = 10,
trControl = trainControl(method = "cv",
repeats = 5))## committees neighbors
## 8 20 5
data.frame(rsquared = cub.model[["results"]][["Rsquared"]][as.numeric(rownames(cub.model$bestTune))],
rmse = cub.model[["results"]][["RMSE"]][as.numeric(rownames(cub.model$bestTune))])## rsquared rmse
## 1 0.6375071 1.119134
RMSE was used to select the optimal model using the smallest value. The best tune for the cubist model which resulted in the smallest root mean squared error is with 200 committees and correct the prediction using the 5-nearest neighbors. It has RMSE = 1.12, and \(R^2\) = 0.64. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error which makes it the best fit.
Which tree-based regression model gives the optimal resampling and test set performance?
By conducting a resampling method, performance metrics were collected and analyzed to determine the model that best fits the training data. The results below suggest that the random forest tree model had the largest mean \(R^2\) = 0.65 from the 10 sample cross-validations. However, it does not produce the smallest RMSE. It is the Cubist model that produced the smallest errors, RMSE = 1.12. The best model, based on the RMSE, is the Cubist tree model that best fitted the training data than the single regression tree, random forest, and gradient boosted models.
set.seed(525)
summary(resamples(list(single = sig.model,
rf = rf.model,
gbm = gbm.model,
cubist = cub.model)))##
## Call:
## summary.resamples(object = resamples(list(single = sig.model, rf =
## rf.model, gbm = gbm.model, cubist = cub.model)))
##
## Models: single, rf, gbm, cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## single 0.8754220 1.0015262 1.1175626 1.1888780 1.2945343 1.791230 0
## rf 0.6989707 0.7392571 0.8427794 0.8979355 0.9241703 1.301306 0
## gbm 0.6161708 0.7073535 0.8270680 0.8473638 0.9965386 1.063495 0
## cubist 0.5802682 0.7123826 0.8557653 0.8795848 0.9494055 1.314607 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## single 1.0650430 1.3215967 1.454610 1.524235 1.725709 2.019552 0
## rf 0.8124008 0.9694383 1.108072 1.164854 1.283113 1.646847 0
## gbm 0.7384911 1.0101713 1.181483 1.147954 1.292906 1.482990 0
## cubist 0.8063886 0.9010233 1.091564 1.119134 1.285865 1.629093 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## single 0.06236162 0.2588163 0.3713675 0.3693208 0.4806552 0.6449245 0
## rf 0.56833761 0.6047194 0.6376973 0.6547554 0.6763671 0.8709030 0
## gbm 0.39771434 0.5571460 0.6073452 0.6428870 0.7997377 0.8563244 0
## cubist 0.16351491 0.5963975 0.7143895 0.6375071 0.7776114 0.8110698 0
Now, let’s use the best model to make predictions with the test predictive variables and compare the accuracy to the actual test responses.
accuracy = function(models, predictors, response){
acc = list()
i = 1
for (model in models){
predictions = predict(model, newdata = predictors)
acc[[i]] = postResample(pred = predictions, obs = response)
i = i + 1
}
names(acc) = c("single", "rf", "gbm", "cubist")
return(acc)
}
models = list(sig.model, rf.model, gbm.model, cub.model)
accuracy(models, test.p, test.r)## $single
## RMSE Rsquared MAE
## 1.2617150 0.5821762 1.0029291
##
## $rf
## RMSE Rsquared MAE
## 1.0408299 0.7612483 0.8464121
##
## $gbm
## RMSE Rsquared MAE
## 0.9563292 0.7458271 0.7447816
##
## $cubist
## RMSE Rsquared MAE
## 0.7715339 0.8322406 0.6559078
From the results, it can be concluded that the Cubist model predicted the test response with the best accuracy. It has \(R^2\) = 0.83 and RMSE = 0.77.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The top 20 most important variable out 47 is ranked below. The caret::varImp calculation of variable importance for regression is the relationship between each predictor and the outcome from a linear model fit and the \(R^2\) statistic is calculated for this model against the intercept only null model. This number is returned as a relative measure of variable importance. The list shows that the most contributive variable is ManufacturingProcess32. As a result, it shows that Manufacturing Processes dominate the list.
## cubist variable importance
##
## only 20 most important variables shown (out of 47)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09 39.837
## ManufacturingProcess33 37.398
## ManufacturingProcess17 34.146
## ManufacturingProcess04 25.203
## ManufacturingProcess13 21.138
## ManufacturingProcess14 17.886
## ManufacturingProcess21 16.260
## ManufacturingProcess19 13.821
## BiologicalMaterial08 12.195
## BiologicalMaterial10 11.382
## ManufacturingProcess37 10.569
## ManufacturingProcess15 10.569
## BiologicalMaterial05 8.943
## BiologicalMaterial06 8.130
## BiologicalMaterial11 8.130
## ManufacturingProcess02 7.317
## ManufacturingProcess39 7.317
## ManufacturingProcess43 5.691
## ManufacturingProcess28 4.878
It was the SVM model that best fitted the data for the non-linear models While Manufacturing Processes still dominated the list, their ranks are different than those compared to the nonlinear model.
set.seed(525)
svm.model = train(x = train.p,
y = train.r,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))It was the elastic net linear model that best fitted the data, and it only needed 15 variables of the 47. While Manufacturing Processes still dominated the list, their ranks are different than those compared to the nonlinear model and regression tree model.
# Elastic Net Regression
set.seed(525)
elastic.model = train(x = train.p, y = train.r, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10)p1 = plot(varImp(elastic.model), top = 10, main = "Linear Model: Elastic Net")
p2 = plot(varImp(svm.model), top = 10, main = "Non-Linear Model: SVM")
p3 = plot(varImp(cub.model), top = 10, main = "Regression Tree: Cubist")
gridExtra::grid.arrange(p1,p2,p3, ncol = 3)From the importance plot above, each model ranked all other variables differently except for ManufactingProcess32. The non-linear model (SVM) has distributed importance to a few more biological materials than the other two models. It is interesting to see that ManufractingProcess09 was the second most important variable for the regression tree and linear models, and BiologicalMaterial06 was the top-ranking biological material variable for the linear and non-linear model.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 484.98730 40.18951
## 2) ManufacturingProcess32< 0.2324036 81 141.73460 39.18975
## 4) BiologicalMaterial11< -0.3311184 41 44.02209 38.57317 *
## 5) BiologicalMaterial11>=-0.3311184 40 66.14858 39.82175 *
## 3) ManufacturingProcess32>=0.2324036 63 158.19840 41.47492
## 6) ManufacturingProcess06< 0.5762191 35 71.77247 40.79457 *
## 7) ManufacturingProcess06>=0.5762191 28 49.97450 42.32536 *
The plot above of the optimal single tree model highlights that splits begins with ManufractingProcess32 at 0.232. If it is less than 0.232, the yield will be 39.19, while if it is more than or equal to 0.232, the yield will be 41.79. Then the tree branches to terminal nodes and based on these values, the yield can further improve. This view of the data provides additional knowledge about the process predictors and their relationship with yield since the higher values, the higher the yield will become.