CUNY 624 Week 12 Homework 9

Joel Park

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

A. Fit a random forest model to all of the predictors, then estimate the variable importance score.

Random Forests can be thought of an improvement to the bagging tree. As all of the predictors are utilized in bagging, trees can start to align and correlate amongst each other thus reducing the effectiveness of the model. However, by stochastically selecting several predictors, instead of the full set, we can decrease the changes of tree correlation and produce a more stable and reliable model. Below, we will fit the random forest model and estimate the variable importance score.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

model1

## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.582687
##                     % Var explained: 73

rfImp1

##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

varImp is a function in caret that calculates the variable importance for objects produced by the training function. According to “Applied Predictive Modelling”, the ensemble nature of random forests makes it impossible to gain an understanding of the relationship between the predictors and the response. However, because trees are the typical base learner for this method, it is possible to quantify the impact of predictors in the ensemble. Brieman (2000) originally proposed randomly permuting the values of each predictor for the out-of-bag sample of one predictor at at ime for each tree. The difference in predictive performance between the non-permuted sample and the permuted sample for each predictor is recorded and aggregated across the entire forest. Another approach is to measure the improvement in node purity based on the performance metric for each predictor at each occurrence of that predictor across the forest. These individual improvement values for each predictor are then aggregated across the forest to determine the overall importance for each predictor.

Unfortunately, these methods showed that the correlations between predictors can have a significant impact on the importance values. For example, uninformative predictors with high correlations to informative predictors have abnormally large importance values. In some cases, their importance was greater than or equal to weakly important variables. They also demonstrated that the\(m_{try}\) tuning parameter has a serious effect on the importance values.

Another impact of between-predictor correlations is to dilute the importances of key predictors. For example, suppose a critical predictor had an importance of X. If another predictor is just as critical but is almost perfectly correlated as the first, the importance of these two predictors will be roughly X/2.

Let us evaluate for correlations between the predictor variables.

library(corrplot)

## corrplot 0.84 loaded

a <- cor(simulated[,1:10])
a

##               V1          V2           V3           V4          V5
## V1   1.000000000 0.089816473 -0.001873298  0.050332672  0.02107599
## V2   0.089816473 1.000000000  0.034013427  0.002802895  0.14764291
## V3  -0.001873298 0.034013427  1.000000000 -0.042024494  0.05408663
## V4   0.050332672 0.002802895 -0.042024494  1.000000000 -0.02206722
## V5   0.021075993 0.147642909  0.054086632 -0.022067217  1.00000000
## V6   0.144762406 0.074387220  0.132125076  0.045085240 -0.01534334
## V7  -0.037410293 0.006573036  0.003555705 -0.080802962 -0.09186847
## V8  -0.134497500 0.114404202 -0.082519941  0.026964742 -0.07242580
## V9  -0.019268646 0.068952243  0.155301002  0.096962108 -0.03853264
## V10 -0.081282284 0.047265415  0.029153287 -0.004946379 -0.02985258
##              V6           V7          V8          V9          V10
## V1   0.14476241 -0.037410293 -0.13449750 -0.01926865 -0.081282284
## V2   0.07438722  0.006573036  0.11440420  0.06895224  0.047265415
## V3   0.13212508  0.003555705 -0.08251994  0.15530100  0.029153287
## V4   0.04508524 -0.080802962  0.02696474  0.09696211 -0.004946379
## V5  -0.01534334 -0.091868466 -0.07242580 -0.03853264 -0.029852578
## V6   1.00000000 -0.056462537 -0.18690846  0.06565457 -0.128188393
## V7  -0.05646254  1.000000000  0.07225545 -0.04132251  0.044428186
## V8  -0.18690846  0.072255451  1.00000000 -0.01170511  0.062549483
## V9   0.06565457 -0.041322515 -0.01170511  1.00000000  0.072239311
## V10 -0.12818839  0.044428186  0.06254948  0.07223931  1.000000000

corrplot(a, type = "lower")

There does not appear to be any strong correlations between each of the predictors. This will help avoid the issues as listed above. Let us plot the variable importance as noted above.

varImpPlot(model1, type = 2)

The variable importance plot above confirms that the random forest model does not consider V6 - V10 as important.

B. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

There is ~94% correlation with V1.

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

model2

## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 7.063163
##                     % Var explained: 71.03

rfImp2

##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

varImpPlot(model2, type = 2)

V1’s variable importance score had decreased significantly. As we have mentioned in question part A, between-predictor correlations dilute the importances of key predictors, which consequently had reduced V1’s variable importance score.

C. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

According to “Applied Predictive Modelling”, the conditional inference tree is a unified framework for unbiased tree-based models for regression and classification. In this model, statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points. For a candidate split, a statistical test is used to evaluate the difference between the means of the two groups created by the split and a p-value can be computed for the test.

Utilizing the test statistic p-value has several advantages. First, predictors that are on disparate scales can be compared since the p-values are on the same scale. Second, multiple comparison corrections can be applied to the raw p-values within a predictor to reduce the bias resulting from a large number of split candidates. These corrections attempt to reduce the number of false-positive test results that are incurred by conductin a large number of statistical hypothesis tests. Thus, predictors are increasingly penalized by multiple comparison procedures as the number of splits increases.

This algorithm does not use pruning; as the data are further split, the decrease in the number of samples reduces the power of the hypothesis tests. This results in higher p-values anda lower likelihood of a new split and over-fitting.

Let us see how the cforest and varimp functions perform in this scenario.

Removing the duplicated variable *

simulated <- subset(simulated, select = -duplicate1)

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

model3 <- cforest(y ~ ., data = simulated)
b <- varimp(model3)
b

##           V1           V2           V3           V4           V5 
##  8.435713169  6.722758384  0.027508376  8.193627405  2.060390994 
##           V6           V7           V8           V9          V10 
##  0.007814551  0.019529045 -0.039020603  0.013479110 -0.012547405

library(ggplot2)
library(data.table)

b <- as.data.frame(b)
setDT(b, keep.rownames = TRUE)
names(b)[names(b) == "b"] <- "VariableImportance"
names(b)[names(b) == "rn"] <- "Variables"

ggplot(b, aes(Variables, VariableImportance)) + geom_col() + coord_flip()

Interestingly, the cforest conditional inference trees show similar Variable Importance pattern as the traditional CART tree model.

D. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

According to Statsoft, the general idea is to compute a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree.

Let us build a boosted tree model.

# Boosted Trees
library(gbm)

## Loaded gbm 2.1.5

set.seed(100)
model4 <- gbm(y ~ ., data = simulated, distribution = "gaussian")

model4

## gbm(formula = y ~ ., distribution = "gaussian", data = simulated)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 10 predictors of which 7 had non-zero influence.

summary(model4)

##     var    rel.inf
## V4   V4 29.7926789
## V1   V1 26.0431191
## V2   V2 23.7400606
## V5   V5 11.1100184
## V3   V3  8.7680640
## V6   V6  0.3369801
## V8   V8  0.2090789
## V7   V7  0.0000000
## V9   V9  0.0000000
## V10 V10  0.0000000

According to the “Applied Predictive Modelling”, Cubist is a rule-based model that is an amalgamation of several methodologies published some time ago but has evolved. Some specific differences between Cubist and the previously described approaches for model trees and their rule-based variants are:

The specific techniques used for linear model smoothing, creating rules, and pruning are different.
An optional boosting - like procedure called committes
The predictions generated by the model rules can be adjusted using nearby points from the training data set.

Let’s see how the Cubist model works.

# Cubist
library(Cubist)
model5 <- cubist(simulated[,-11], simulated$y)
model5

## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
## 
## Number of samples: 200 
## Number of predictors: 10 
## 
## Number of committees: 1 
## Number of rules: 1

summary(model5)

## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sat Apr 27 21:12:11 2019
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (11 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.224012
##     Relative |error|               0.55
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs

Interestingly, both the Cubist Models and Boosted models utilize the same variables (highly important) compared to the previous models. (The only difference is that Cubist does not use V3.)

8.2

Use a simulation to show tree bias with different granularities.

According to the textbook, single regression tress suffer from selection bias; predictors with a higher number of distinct values (with lower variance) are favored over more granular predictors (with higher variance). According to Loh and Shin, “the danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce a tree with misleading structure or no tree at all.”

Therefore, for this simulation, we will be creating two predictors X1 and X2, where each contains 200 values. One of the variables will have a low variance predictor that correlates to the response variable whereas the second predictor with high variance will have no relation to the response variable.

library(rpart)
library(caret)
set.seed(100)
X1 <- rep(1:2,each=100)
X2 <- rnorm(200,mean=0,sd=2)
Y <- X1 + rnorm(200,mean=0,sd=4)

df1 <- data.frame(Y=Y, X1=X1, X2=X2)

mod <- rpart(Y ~ ., data = df1)
varImp(mod)

##      Overall
## X1 0.4226645
## X2 0.8863903

As we can see, if we assume that X1 is the true correlated variable to the response variable, the fact that so much noise has been introduced to X2 (where it is very highly granular), X2 had become the more important variable in this dataset, thus affecting the bias.

8.3

In stochastic gradient boosting, the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9

A. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

According to “Applied Predictive Modelling”, these tree models could be susceptible to over-fitting. Despite using weak learners, boosting still employs the greedy strategy of choosing the optimal weak learner at each stage. Although this strategy gen- erates an optimal solution at the current stage, it has the drawbacks of not finding the optimal global model as well as over-fitting the training data. A remedy is to constrain the learning process by employing regularization, or shrinkage. Instead of adding the predicted value for a sample to previous iteration’s predicted value, only a fraction of the current predicted value is added to the previous iteration’s predicted value. The learning rate is a paramter that take values between 0 and 1 and becomes a tuning parameter for the model. Small values of the learning parameter ( 0.01) works best, but may be more computationally expensive.

Likewise, a bagging fractuion is another tuning parameter for the model. It takes some of the properties of the bagging technique. Specifically, the random sampling nature of bagging offered a reduction in prediction variance for bagging. Friedman suggests using a bagging fraction of around 0.5.

Therefore, if we look at the diagram above, the bagging parameter of 0.9 and the learning parameter of 0.9 starts to approach the “greedy” model, which as noted before, can start to overfit the data and may not be generalizable. The higher bagging value may increase the prediction variance as well and the increased learning rate may tend to focus on few predictors than if the learning rates are lower (which may consider other predictors as contributors and important).

B. Which model do you think would be more predictive of other samples?

As noted in the textbook, boosting models that do not overfit (or less prone to overfit) will tend to be more generalizable and be more predictive of other samples. So in other words, the model on the left will be more predictive. (It also considers other predictors whereas the model on the right only focuses on few predictors).

C. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

According to the textbook, variable importance for boosting is a function of the reduction in squared error. Specifically, the improvement in squared error due to each predictor is summed within each tree in the ensemble (i.e., each predictor gets an improvement value for each tree). The improvement values for each predictor are then averaged across the entire ensemble to yield an overall importance value.

Because of the way the variable importance is calculated, increasing the interaction depth would likely include many more predictors, which would then be included intor the variable importance calculation, thus potentially inflating its importance.

8.7

Refer to Exercises 6.3 and 7.5, which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# Loading the Chemical Manufacturing Process Dataset
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")

# Loading the DMwR library for KNN imputation
library(DMwR)
df <- knnImputation(ChemicalManufacturingProcess[, 1:57], k = 3, meth = "weighAvg")

# Removing zero or near zero variables
near_zero <- nearZeroVar(df)
df <- df[,-near_zero]

# Standardizing and scaling the predictors
df[,2:(ncol(df))] <- scale(df[,2:(ncol(df))])

# Splitting the data into training and testing data sets.
set.seed(1)
inTraining <- createDataPartition(df$Yield, p = 0.80, list=FALSE)
training <- df[ inTraining,]
testing <- df[-inTraining,]

X_train <- training[,2:(length(training))]
Y_train <- training$Yield

X_test <- testing[,2:(length(testing))]
Y_test <- testing$Yield

# Now the data is successfully imputed, cleaned, pre-processed and split
head(df)

##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00           -0.2261036           -1.5140979          -2.68303622
## 2 42.44            2.2391498            1.3089960          -0.05623504
## 3 42.03            2.2391498            1.3089960          -0.05623504
## 4 41.42            2.2391498            1.3089960          -0.05623504
## 5 42.49            1.4827653            1.8939391           1.13594780
## 6 43.57           -0.4081962            0.6620886          -0.59859075
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2201765            0.4941942           -1.3828880
## 2            1.2964386            0.4128555            1.1290767
## 3            1.2964386            0.4128555            1.1290767
## 4            1.2964386            0.4128555            1.1290767
## 5            0.9414412           -0.3734185            1.5348350
## 6            1.5894524            1.7305423            0.6192092
##   BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1            -1.233131           -3.3962895            1.1005296
## 2             2.282619           -0.7227225            1.1005296
## 3             2.282619           -0.7227225            1.1005296
## 4             2.282619           -0.7227225            1.1005296
## 5             1.071310           -0.1205678            0.4162193
## 6             1.189487           -1.7343424            1.6346255
##   BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1            -1.838655           -1.7709224              0.2006342
## 2             1.393395            1.0989855             -6.1677821
## 3             1.393395            1.0989855             -6.1677821
## 4             1.393395            1.0989855             -6.1677821
## 5             0.136256            1.0989855             -0.2803476
## 6             1.022062            0.7240877              0.4349481
##   ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1              0.5551856              0.3045192              0.2349609
## 2             -1.9906511              0.5708010             -2.3747305
## 3             -1.9906511              0.1938106             -3.1737741
## 4             -1.9906511              0.5427779             -3.3335828
## 5             -1.9906511              0.4262861             -2.2149218
## 6             -1.9906511              0.4262861             -1.2560694
##   ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1            -0.43413255             -0.4693269             -0.4256500
## 2             1.00413856              0.9730078             -0.9578363
## 3             0.06509023             -0.1046607              1.0427233
## 4             0.42626267              2.1993201             -0.9578363
## 5             0.84981943             -0.6249144              1.0427233
## 6             0.49849715              0.5642370              1.0427233
##   ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1             -0.5761169             -1.7201524            -0.29888818
## 2              0.8991659              0.5883746             0.48941917
## 3              0.8991659             -0.3815947             0.19393012
## 4             -1.1108074             -0.4785917             0.09437624
## 5              0.8991659             -0.4527258            -0.37534711
## 6              0.8991659             -0.2199332            -0.32443045
##   ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1             -0.3232328             -0.4790178             0.97711512
## 2              1.1432407             -0.4790178            -0.50030980
## 3              0.2640909             -0.4790178             0.28765016
## 4              1.0545956             -0.4790178             0.28765016
## 5              0.5336292             -0.4790178             0.09066017
## 6              0.5765004             -0.4790178            -0.50030980
##   ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1              0.8141279              1.1846438              0.3303945
## 2              0.2811695              0.9617071              0.1455765
## 3              0.4465704              0.8245152              0.1455765
## 4              0.7957500              1.0817499              0.1967569
## 5              2.5416480              3.3282665              0.4754056
## 6              2.4130029              3.1396277              0.6261033
##   ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1              0.9263296              0.1505348              0.4563798
## 2             -0.2753953              0.1559773              1.5095063
## 3              0.3655246              0.1831898              1.0926437
## 4              0.3655246              0.1695836              0.9829430
## 5             -0.3555103              0.2076811              1.6192070
## 6             -0.7560852              0.1423710              1.9044287
##   ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1              0.3109942              0.2109804              0.3855865
## 2              0.1849230              0.2109804             -0.7262673
## 3              0.1849230              0.2109804             -0.4252906
## 4              0.1562704              0.2109804             -0.1243139
## 5              0.2938027             -0.6884239              0.7786162
## 6              0.3998171             -0.5599376              1.0795929
##   ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1              0.7815967              0.4717009              0.1216354
## 2             -1.8212444             -1.0109484              0.1107691
## 3             -1.2190925             -0.8381332              0.1868337
## 4             -0.6169407             -0.6653181              0.1732507
## 5              0.5873630              1.5812789              0.2764812
## 6             -1.2190925             -1.3565787              0.1162023
##   ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1              0.1274268              0.3500191              0.8103308
## 2              0.1994510              0.1923811              0.9052417
## 3              0.2190939              0.2124441              0.8862595
## 4              0.2081812              0.1923811              0.8862595
## 5              0.2954832              0.3471529              0.9242238
## 6              0.2452845              0.3557513              0.9432060
##   ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1              0.6071537              0.7568194             -0.2010964
## 2              0.8509759              0.7568194             -0.2741327
## 3              0.7900203              0.2379678             -0.1645783
## 4              0.7900203              0.2379678             -0.1645783
## 5              0.9728869             -0.1771134             -0.1463192
## 6              1.0338425              0.9643601             -0.3654280
##   ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1             -0.4568829              0.9999809             -1.7011967
## 2              1.9517531              0.9999809              1.9880597
## 3              2.6928719              0.9999809              1.9880597
## 4              2.3223125              1.8158588              0.1434315
## 5              2.3223125              2.6317367              0.1434315
## 6              2.6928719              2.6317367              0.1434315
##   ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1            -0.87382216             -0.6515638             -1.1540243
## 2             1.17282515             -0.6515638              2.2161351
## 3             1.26585457             -1.8089817             -0.7046697
## 4             0.05647207             -1.8089817              0.4187168
## 5            -2.54835178             -2.9663997             -1.8280562
## 6            -0.50170447             -1.8089817             -1.3787016
##   ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1              0.7174727              0.2317270             -0.4610622
## 2             -0.8224687              0.2317270              2.1565813
## 3             -0.8224687              0.2317270             -0.4610622
## 4             -0.8224687              0.2317270             -0.4610622
## 5             -0.8224687              0.2981503             -0.4610622
## 6             -0.8224687              0.2317270             -0.4610622
##   ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1             -0.4390981             0.20279570             2.40564734
## 2              2.3542004            -0.05472265            -0.01374656
## 3             -0.4390981             0.40881037             0.10146268
## 4             -0.4390981            -0.31224099             0.21667191
## 5             -0.4390981            -0.10622632             0.21667191
## 6             -0.4390981             0.15129203             1.48397347
##   ManufacturingProcess44
## 1            -0.01588055
## 2             0.29467248
## 3            -0.01588055
## 4            -0.01588055
## 5            -0.32643359
## 6            -0.01588055

We will use RMSE from the test set performance to determine the most optimal tree-based regression model.

A. Which tree-based regression model gives the optimal resampling and test set performance?

We will be implementing several tree-based regression models that were discussed in the textbook. The regression models will range from a single tree to a much more sophisticated model such as boosted tree and Cubist. Given that we are comparing different trees, we will be utilizing the RMSE values from the test set performance to choose the most optimal model. As each model has its own pros and cons, we will soon find out which models succeeds the best here with this particular data set.

Single Tree

library(caret)
library(rpart)
set.seed(1)
rpartTune <- train(X_train, Y_train, method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
rpartTune

## CART 
## 
## 144 samples
##  55 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 130, 131, 129, 130, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE      
##    1        1.308352  0.5130147  1.0569095
##    2        1.352595  0.5029501  1.1032161
##    3        1.330961  0.5208352  1.0847034
##    4        1.295385  0.5340336  1.0397327
##    5        1.310353  0.5120659  1.0278080
##    6        1.324231  0.4993690  1.0023312
##    7        1.321880  0.5002141  0.9965908
##    8        1.314831  0.5017745  1.0038798
##    9        1.315171  0.5110080  0.9942259
##   10        1.287015  0.5330928  0.9692818
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 10.

# Checking RMSE for single tree based model
y_pred <- predict(rpartTune, X_test)
RMSE(y_pred, Y_test)

## [1] 1.746993

Bagged Trees

library(ipred)
library(party)
set.seed(1)
train_df <- cbind(X_train, Y_train)

baggedTree <- bagging(Y_train ~ ., data = train_df)
baggedTree

## 
## Bagging regression trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Y_train ~ ., data = train_df)

summary(baggedTree)

##        Length Class      Mode   
## y      144    -none-     numeric
## X       55    data.frame list   
## mtrees  25    -none-     list   
## OOB      1    -none-     logical
## comb     1    -none-     logical
## call     3    -none-     call

# Checking RMSE for bagged tree based model
y_pred <- predict(baggedTree, X_test)
RMSE(y_pred, Y_test)

## [1] 1.574695

Random Forest

library(randomForest)
rfModel <- randomForest(X_train, Y_train,
                        importance = TRUE,
                        ntress = 1000)
rfModel

## 
## Call:
##  randomForest(x = X_train, y = Y_train, importance = TRUE, ntress = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 18
## 
##           Mean of squared residuals: 1.384861
##                     % Var explained: 58.85

# Checking RMSE for random forest based model
y_pred <- predict(rfModel, X_test)
RMSE(y_pred, Y_test)

## [1] 1.213179

Boosted Trees

library(gbm)
gbmModel <- gbm.fit(X_train, Y_train, distribution = "gaussian")

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        3.3621             nan     0.0010    0.0024
##      2        3.3590             nan     0.0010    0.0032
##      3        3.3556             nan     0.0010    0.0033
##      4        3.3520             nan     0.0010    0.0030
##      5        3.3487             nan     0.0010    0.0031
##      6        3.3456             nan     0.0010    0.0031
##      7        3.3426             nan     0.0010    0.0027
##      8        3.3398             nan     0.0010    0.0027
##      9        3.3366             nan     0.0010    0.0032
##     10        3.3332             nan     0.0010    0.0031
##     20        3.3018             nan     0.0010    0.0029
##     40        3.2454             nan     0.0010    0.0022
##     60        3.1882             nan     0.0010    0.0030
##     80        3.1346             nan     0.0010    0.0027
##    100        3.0804             nan     0.0010    0.0026

gbmModel

## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 55 predictors of which 1 had non-zero influence.

# Checking RMSE for boosted tree based model
y_pred <- predict(gbmModel, X_test, n.trees=100)
RMSE(y_pred, Y_test)

## [1] 1.826119

Cubist

library(Cubist)
cubistMod <- cubist(X_train, Y_train)
cubistMod

## 
## Call:
## cubist.default(x = X_train, y = Y_train)
## 
## Number of samples: 144 
## Number of predictors: 55 
## 
## Number of committees: 1 
## Number of rules: 1

# Checking RMSE for single tree based model
y_pred <- predict(cubistMod, X_test)
RMSE(y_pred, Y_test)

## [1] 2.043035

It appears that the random forest had the smallest RMSE, so we will choose the random forest as the most optimal model.

B. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

We will be using the random forest model.

a <- varImp(rfModel)
a$Variables <- rownames(a)
rownames(a) <- 1:nrow(a)

head(a[order(a$Overall, decreasing = TRUE),],10)

##      Overall              Variables
## 43 24.999668 ManufacturingProcess32
## 6  10.727534   BiologicalMaterial06
## 11  9.366823   BiologicalMaterial12
## 3   8.571881   BiologicalMaterial03
## 42  7.813420 ManufacturingProcess31
## 28  7.570996 ManufacturingProcess17
## 10  7.289205   BiologicalMaterial11
## 47  6.836107 ManufacturingProcess36
## 2   6.806431   BiologicalMaterial02
## 4   6.511863   BiologicalMaterial04

There are 6 Biological Material and 4 Manufacturing Processes in the top 10, which is fairly mixed.

Compared to the optimal linear model “Lasso Model” from my previous (homework assignment)[http://www.rpubs.com/jcp9010/483992]:

library(elasticnet)

## Loading required package: lars

## Loaded lars 1.2

lassoModel <- enet(x = as.matrix(X_train), y = Y_train,
                   lambda = 0.0, normalize = TRUE)

lassoCoef <- predict(lassoModel, newx = as.matrix(X_test),
                     s=.1, mode = "fraction", type = "coefficients")

list_coef <- lassoCoef$coefficients
head(sort(list_coef[list_coef != 0], decreasing = TRUE),10)

## ManufacturingProcess32 ManufacturingProcess09 ManufacturingProcess04 
##              1.0170398              0.6084445              0.2672812 
##   BiologicalMaterial05   BiologicalMaterial06 ManufacturingProcess06 
##              0.2310542              0.1874771              0.1535055 
## ManufacturingProcess43 ManufacturingProcess39 ManufacturingProcess34 
##              0.1404246              0.1357536              0.1202239 
## ManufacturingProcess42 
##              0.1178699

The top 10 are for the most part different in the linear model vs. the random forest model. Though the two consistent variables are ManufacturingProcess32 and BiologicalMaterial06.

The optimal nonlinear model “Support Vector Regression” from (homework assignment)[http://www.rpubs.com/jcp9010/486614]:

set.seed(1)
svmRTuned1 <- train(x = X_train, 
                   y = Y_train,
                   method = "svmRadial",
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))
varImp(svmRTuned1)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     84.75
## ManufacturingProcess36   71.84
## BiologicalMaterial03     70.69
## ManufacturingProcess13   68.69
## BiologicalMaterial02     64.16
## ManufacturingProcess31   63.17
## BiologicalMaterial12     60.11
## ManufacturingProcess17   57.19
## ManufacturingProcess09   55.42
## ManufacturingProcess33   54.25
## BiologicalMaterial04     51.35
## ManufacturingProcess29   47.17
## BiologicalMaterial11     46.74
## ManufacturingProcess06   46.73
## BiologicalMaterial01     41.19
## BiologicalMaterial08     38.75
## ManufacturingProcess11   29.49
## ManufacturingProcess26   29.35
## BiologicalMaterial09     26.01

Out of the top 10, again, ManufacturingProcess32 and BiologicalMaterial06 were on the top. However, the top 10 here appears to be overall closer to the forest model than compared to the linear model.

C. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

# rtree
# Reference: http://www.milbo.org/rpart-plot/prp.pdf
library(rpart.plot)

rpartTree <- rpart(Y_train ~ ., data = train_df)
summary(rpartTree)

## Call:
## rpart(formula = Y_train ~ ., data = train_df)
##   n= 144 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.47674176      0 1.0000000 1.0170264 0.11797821
## 2  0.07019415      1 0.5232582 0.5464058 0.07156368
## 3  0.06127500      2 0.4530641 0.5905480 0.07422087
## 4  0.04999254      3 0.3917891 0.5778638 0.07326127
## 5  0.02805096      4 0.3417965 0.5981311 0.08011925
## 6  0.02550828      5 0.3137456 0.6025261 0.08043490
## 7  0.02456980      6 0.2882373 0.5940701 0.08036269
## 8  0.02156477      7 0.2636675 0.5738613 0.07914386
## 9  0.01294122      8 0.2421027 0.5533322 0.08158744
## 10 0.01259667      9 0.2291615 0.5337561 0.07629500
## 11 0.01102070     10 0.2165649 0.5295310 0.07638671
## 12 0.01000000     11 0.2055442 0.5266983 0.07647938
## 
## Variable importance
## ManufacturingProcess32   BiologicalMaterial02   BiologicalMaterial03 
##                     17                     12                     11 
## ManufacturingProcess33 ManufacturingProcess36   BiologicalMaterial04 
##                     11                     10                      8 
##   BiologicalMaterial06   BiologicalMaterial11   BiologicalMaterial12 
##                      5                      4                      3 
##   BiologicalMaterial05 ManufacturingProcess09   BiologicalMaterial01 
##                      3                      2                      2 
##   BiologicalMaterial08 ManufacturingProcess25 ManufacturingProcess18 
##                      2                      2                      1 
## ManufacturingProcess17 ManufacturingProcess30 ManufacturingProcess11 
##                      1                      1                      1 
## ManufacturingProcess01   BiologicalMaterial10 
##                      1                      1 
## 
## Node number 1: 144 observations,    complexity param=0.4767418
##   mean=40.18903, MSE=3.365156 
##   left son=2 (85 obs) right son=3 (59 obs)
##   Primary splits:
##       ManufacturingProcess32 < 0.191596      to the left,  improve=0.4767418, (0 missing)
##       ManufacturingProcess36 < -0.5116931    to the right, improve=0.2569147, (0 missing)
##       BiologicalMaterial06   < -0.4405676    to the left,  improve=0.2538506, (0 missing)
##       BiologicalMaterial03   < -0.8735176    to the left,  improve=0.2355991, (0 missing)
##       ManufacturingProcess31 < 0.04540095    to the right, improve=0.2334344, (0 missing)
##   Surrogate splits:
##       ManufacturingProcess33 < 0.2272012     to the left,  agree=0.847, adj=0.627, (0 split)
##       ManufacturingProcess36 < -0.5116931    to the right, agree=0.833, adj=0.593, (0 split)
##       BiologicalMaterial02   < 0.0201384     to the left,  agree=0.757, adj=0.407, (0 split)
##       BiologicalMaterial03   < 0.3736506     to the left,  agree=0.757, adj=0.407, (0 split)
##       BiologicalMaterial04   < 0.222994      to the left,  agree=0.757, adj=0.407, (0 split)
## 
## Node number 2: 85 observations,    complexity param=0.061275
##   mean=39.13376, MSE=1.561656 
##   left son=4 (32 obs) right son=5 (53 obs)
##   Primary splits:
##       BiologicalMaterial12   < -0.6009825    to the left,  improve=0.2236900, (0 missing)
##       BiologicalMaterial11   < -0.3896263    to the left,  improve=0.2182804, (0 missing)
##       ManufacturingProcess17 < -0.6359127    to the right, improve=0.1888019, (0 missing)
##       BiologicalMaterial06   < -0.8783595    to the left,  improve=0.1704131, (0 missing)
##       ManufacturingProcess06 < -0.4019485    to the left,  improve=0.1625716, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial06 < -0.8783595    to the left,  agree=0.918, adj=0.781, (0 split)
##       BiologicalMaterial03 < -0.8735176    to the left,  agree=0.906, adj=0.750, (0 split)
##       BiologicalMaterial02 < -0.8833012    to the left,  agree=0.894, adj=0.719, (0 split)
##       BiologicalMaterial11 < -0.7391772    to the left,  agree=0.882, adj=0.687, (0 split)
##       BiologicalMaterial08 < -0.4797562    to the left,  agree=0.871, adj=0.656, (0 split)
## 
## Node number 3: 59 observations,    complexity param=0.07019415
##   mean=41.70932, MSE=2.047813 
##   left son=6 (34 obs) right son=7 (25 obs)
##   Primary splits:
##       BiologicalMaterial06   < 0.7206488     to the left,  improve=0.2815310, (0 missing)
##       BiologicalMaterial03   < 1.020978      to the left,  improve=0.2716172, (0 missing)
##       ManufacturingProcess31 < 0.1001782     to the right, improve=0.2289845, (0 missing)
##       BiologicalMaterial02   < 0.7575137     to the left,  improve=0.2230087, (0 missing)
##       BiologicalMaterial05   < 0.1444378     to the left,  improve=0.1683230, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial02 < 0.7575137     to the left,  agree=0.966, adj=0.92, (0 split)
##       BiologicalMaterial01 < 0.6913629     to the left,  agree=0.831, adj=0.60, (0 split)
##       BiologicalMaterial03 < 1.020978      to the left,  agree=0.814, adj=0.56, (0 split)
##       BiologicalMaterial05 < 0.21222       to the left,  agree=0.814, adj=0.56, (0 split)
##       BiologicalMaterial11 < 0.6942929     to the left,  agree=0.780, adj=0.48, (0 split)
## 
## Node number 4: 32 observations,    complexity param=0.02805096
##   mean=38.37312, MSE=1.141753 
##   left son=8 (7 obs) right son=9 (25 obs)
##   Primary splits:
##       BiologicalMaterial05   < 0.5429974     to the right, improve=0.3720432, (0 missing)
##       BiologicalMaterial01   < -0.9614775    to the right, improve=0.2317783, (0 missing)
##       ManufacturingProcess24 < 0.5080444     to the left,  improve=0.2200246, (0 missing)
##       ManufacturingProcess34 < -0.7788826    to the left,  improve=0.2109260, (0 missing)
##       BiologicalMaterial10   < -0.802187     to the right, improve=0.1861732, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial12   < -0.6203737    to the right, agree=0.844, adj=0.286, (0 split)
##       ManufacturingProcess06 < -0.9965242    to the left,  agree=0.844, adj=0.286, (0 split)
##       ManufacturingProcess17 < 1.006445      to the right, agree=0.844, adj=0.286, (0 split)
##       ManufacturingProcess18 < 0.233533      to the right, agree=0.844, adj=0.286, (0 split)
##       ManufacturingProcess19 < 1.092644      to the right, agree=0.844, adj=0.286, (0 split)
## 
## Node number 5: 53 observations,    complexity param=0.04999254
##   mean=39.59302, MSE=1.254942 
##   left son=10 (39 obs) right son=11 (14 obs)
##   Primary splits:
##       ManufacturingProcess25 < -0.0006111902 to the right, improve=0.3642281, (0 missing)
##       ManufacturingProcess17 < -0.5958552    to the right, improve=0.3536997, (0 missing)
##       ManufacturingProcess18 < 0.03352089    to the right, improve=0.3004055, (0 missing)
##       ManufacturingProcess11 < -0.2181852    to the left,  improve=0.2952740, (0 missing)
##       ManufacturingProcess30 < 0.3152981     to the left,  improve=0.2836663, (0 missing)
##   Surrogate splits:
##       ManufacturingProcess30 < 0.3152981     to the left,  agree=0.887, adj=0.571, (0 split)
##       ManufacturingProcess18 < 0.01311149    to the right, agree=0.868, adj=0.500, (0 split)
##       ManufacturingProcess09 < 0.3652816     to the left,  agree=0.849, adj=0.429, (0 split)
##       ManufacturingProcess11 < 0.7624321     to the left,  agree=0.849, adj=0.429, (0 split)
##       ManufacturingProcess17 < -0.7961427    to the right, agree=0.830, adj=0.357, (0 split)
## 
## Node number 6: 34 observations,    complexity param=0.02550828
##   mean=41.05824, MSE=1.543815 
##   left son=12 (11 obs) right son=13 (23 obs)
##   Primary splits:
##       ManufacturingProcess01 < -0.2803476    to the left,  improve=0.2354913, (0 missing)
##       BiologicalMaterial01   < -0.03700745   to the right, improve=0.1936169, (0 missing)
##       ManufacturingProcess06 < 0.3970126     to the left,  improve=0.1813063, (0 missing)
##       ManufacturingProcess43 < -0.1865604    to the right, improve=0.1709492, (0 missing)
##       BiologicalMaterial02   < -0.6366833    to the right, improve=0.1571672, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial11   < 0.1767085     to the right, agree=0.794, adj=0.364, (0 split)
##       ManufacturingProcess20 < -0.08297846   to the left,  agree=0.794, adj=0.364, (0 split)
##       ManufacturingProcess27 < -0.08850105   to the left,  agree=0.794, adj=0.364, (0 split)
##       BiologicalMaterial01   < 0.593313      to the right, agree=0.765, adj=0.273, (0 split)
##       BiologicalMaterial03   < 0.9522467     to the right, agree=0.735, adj=0.182, (0 split)
## 
## Node number 7: 25 observations,    complexity param=0.0245698
##   mean=42.5948, MSE=1.372657 
##   left son=14 (18 obs) right son=15 (7 obs)
##   Primary splits:
##       ManufacturingProcess09 < 1.021628      to the left,  improve=0.3469503, (0 missing)
##       ManufacturingProcess21 < -0.3672081    to the right, improve=0.2725999, (0 missing)
##       BiologicalMaterial03   < 1.153443      to the left,  improve=0.2562389, (0 missing)
##       ManufacturingProcess17 < -0.4756828    to the right, improve=0.2253699, (0 missing)
##       ManufacturingProcess12 < 0.798363      to the left,  improve=0.1869572, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial06   < 1.915234      to the left,  agree=0.84, adj=0.429, (0 split)
##       BiologicalMaterial05   < 1.798325      to the left,  agree=0.80, adj=0.286, (0 split)
##       BiologicalMaterial09   < -1.445308     to the right, agree=0.80, adj=0.286, (0 split)
##       ManufacturingProcess11 < 1.268035      to the left,  agree=0.80, adj=0.286, (0 split)
##       ManufacturingProcess17 < -1.997868     to the right, agree=0.80, adj=0.286, (0 split)
## 
## Node number 8: 7 observations
##   mean=37.14143, MSE=0.7487265 
## 
## Node number 9: 25 observations,    complexity param=0.01259667
##   mean=38.718, MSE=0.70808 
##   left son=18 (14 obs) right son=19 (11 obs)
##   Primary splits:
##       ManufacturingProcess02 < 0.5306407     to the right, improve=0.3448270, (0 missing)
##       BiologicalMaterial05   < 0.2556007     to the left,  improve=0.3129484, (0 missing)
##       ManufacturingProcess26 < 0.02812072    to the left,  improve=0.2353783, (0 missing)
##       ManufacturingProcess15 < -0.5474033    to the left,  improve=0.1915047, (0 missing)
##       ManufacturingProcess30 < -0.02145796   to the right, improve=0.1717073, (0 missing)
##   Surrogate splits:
##       ManufacturingProcess24 < 1.667686      to the left,  agree=0.84, adj=0.636, (0 split)
##       BiologicalMaterial10   < -0.802187     to the right, agree=0.80, adj=0.545, (0 split)
##       BiologicalMaterial05   < -0.0941557    to the left,  agree=0.76, adj=0.455, (0 split)
##       BiologicalMaterial06   < -1.070561     to the left,  agree=0.76, adj=0.455, (0 split)
##       BiologicalMaterial01   < -0.9334632    to the right, agree=0.72, adj=0.364, (0 split)
## 
## Node number 10: 39 observations,    complexity param=0.01294122
##   mean=39.18795, MSE=0.7695958 
##   left son=20 (11 obs) right son=21 (28 obs)
##   Primary splits:
##       BiologicalMaterial12   < -0.2777946    to the left,  improve=0.2089371, (0 missing)
##       ManufacturingProcess27 < 0.05400293    to the right, improve=0.1963069, (0 missing)
##       BiologicalMaterial11   < -0.2879764    to the left,  improve=0.1909250, (0 missing)
##       BiologicalMaterial04   < -0.2982324    to the left,  improve=0.1640964, (0 missing)
##       ManufacturingProcess42 < 0.1255402     to the right, improve=0.1591866, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial02 < -0.5672832    to the left,  agree=0.897, adj=0.636, (0 split)
##       BiologicalMaterial06 < -0.4846137    to the left,  agree=0.872, adj=0.545, (0 split)
##       BiologicalMaterial03 < -0.3324116    to the left,  agree=0.846, adj=0.455, (0 split)
##       BiologicalMaterial08 < -0.420668     to the left,  agree=0.846, adj=0.455, (0 split)
##       BiologicalMaterial11 < -0.3087213    to the left,  agree=0.846, adj=0.455, (0 split)
## 
## Node number 11: 14 observations
##   mean=40.72143, MSE=0.8765837 
## 
## Node number 12: 11 observations
##   mean=40.18636, MSE=0.326405 
## 
## Node number 13: 23 observations,    complexity param=0.02156477
##   mean=41.47522, MSE=1.588625 
##   left son=26 (16 obs) right son=27 (7 obs)
##   Primary splits:
##       BiologicalMaterial04   < -0.1714476    to the right, improve=0.2859982, (0 missing)
##       ManufacturingProcess19 < 0.2369784     to the right, improve=0.2666275, (0 missing)
##       ManufacturingProcess17 < -0.03505031   to the right, improve=0.2281862, (0 missing)
##       ManufacturingProcess20 < 0.06458221    to the right, improve=0.2123225, (0 missing)
##       ManufacturingProcess30 < -0.2289986    to the left,  improve=0.2110616, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial02   < -0.6862547    to the right, agree=0.826, adj=0.429, (0 split)
##       BiologicalMaterial10   < -0.3765793    to the right, agree=0.826, adj=0.429, (0 split)
##       ManufacturingProcess09 < 0.775902      to the left,  agree=0.826, adj=0.429, (0 split)
##       ManufacturingProcess12 < 0.798363      to the left,  agree=0.826, adj=0.429, (0 split)
##       BiologicalMaterial03   < -0.7610475    to the right, agree=0.783, adj=0.286, (0 split)
## 
## Node number 14: 18 observations
##   mean=42.16444, MSE=0.7041247 
## 
## Node number 15: 7 observations
##   mean=43.70143, MSE=1.390869 
## 
## Node number 18: 14 observations
##   mean=38.28, MSE=0.3700286 
## 
## Node number 19: 11 observations
##   mean=39.27545, MSE=0.5834066 
## 
## Node number 20: 11 observations
##   mean=38.54818, MSE=0.4510512 
## 
## Node number 21: 28 observations,    complexity param=0.0110207
##   mean=39.43929, MSE=0.6707709 
##   left son=42 (11 obs) right son=43 (17 obs)
##   Primary splits:
##       BiologicalMaterial12   < 0.6788414     to the right, improve=0.2843443, (0 missing)
##       BiologicalMaterial11   < 0.4194234     to the right, improve=0.2577418, (0 missing)
##       ManufacturingProcess19 < -0.2895849    to the right, improve=0.2419446, (0 missing)
##       ManufacturingProcess04 < 0.4219222     to the left,  improve=0.2340126, (0 missing)
##       ManufacturingProcess01 < 0.07730024    to the left,  improve=0.2327076, (0 missing)
##   Surrogate splits:
##       BiologicalMaterial11 < 0.4194234     to the right, agree=0.964, adj=0.909, (0 split)
##       BiologicalMaterial02 < 0.7686673     to the right, agree=0.893, adj=0.727, (0 split)
##       BiologicalMaterial06 < 0.5404601     to the right, agree=0.857, adj=0.636, (0 split)
##       BiologicalMaterial03 < 1.343393      to the right, agree=0.821, adj=0.545, (0 split)
##       BiologicalMaterial08 < 0.5912181     to the right, agree=0.821, adj=0.545, (0 split)
## 
## Node number 26: 16 observations
##   mean=41.02937, MSE=1.316068 
## 
## Node number 27: 7 observations
##   mean=42.49429, MSE=0.7187673 
## 
## Node number 42: 11 observations
##   mean=38.89636, MSE=0.3381686 
## 
## Node number 43: 17 observations
##   mean=39.79059, MSE=0.5718408

prp(rpartTree)

The single tree here demonstrates which predictors are much more granular. As noted in previous trees, this single tree seems to confirm the importance of ManufacturingProcess32 which then breaks down into different branches incluindg BiologicalMaterial12 and BiologicalMaterial06. The nice aspect of tree visualization is that we can see how the interactions work with each other. It is obvious that even though the Biological materials and manufacturing processes are mixed in the top 10 for variable importance, it is clear that the Biological materials are the initial nodes and most influential.

CUNY624HW9JoelPark

CUNY 624 Week 12 Homework 9