Overview

Exercises from Kuhn and Johnson Applied Predictive Modeling, Chapter 8 Regression Trees and Rule-Based Models. All R code is displayed at the end for clarity.

Exercise 8.1

Recreate the simulated data from Exercise 7.2.

set.seed(200)
simulated = mlbench.friedman1( 200, sd = 1 )
simulated = cbind( simulated$x, simulated$y)

simulated = as.data.frame( simulated)

colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Response 8.1a

We apply the code suggested by the textbook.

model1  <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )

rfImp1 = varImp(  model1, scale = FALSE )

The variable importance plot below shows that variables V6-V10 have minimal importance in predicting y.

Response 8.1b

Now add an additional predictor that is highly correlated with one of the informative predictors. The correlation of the duplicate1 variable is 94.6%.

set.seed(2010)
simulated$duplicate1 = simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1 )

## [1] 0.9457663

The plot below shows that the duplicate1 variable is regarded as significant by the variable importance score.

When we add a second duplicate2 variable, V1 and its two correlates are all regarded as significant by the variable importance score as shown below.

## [1] 0.9429767

Plotting all three variable importance plots together, we conclude that:

each correlate of V1 is treated as significant but less so than the original variable.
the sum of the variable importance of all three in Model 3 appears to add up to the variable importance of V1 is Model 1. Thus, variable importance is not created from thin air when new variables are added.

Response 8.1c

If we run the conditional random forest model using the same number of trees, the result is similar.

set.seed( 10203)

cond_forest1 = cforest( y ~ ., data = simulated[, 1:11], control = cforest_unbiased(ntree = 1000) )

cond_forest2 = cforest( y ~ ., data = simulated[, 1:12], control = cforest_unbiased(ntree = 1000) )

cond_forest3 = cforest( y ~ ., data = simulated[, 1:13], control = cforest_unbiased(ntree = 1000) )

For conditional inference trees as base learners in a random forest, the variable importance plots show the same pattern and order: V1 and its correlates both spread out their variable importance. The relative importance of the informative predictors remains the same.

set.seed( 10203)

cond_impVar1 = varimp( cond_forest1, conditional = TRUE )
cond_impVar2 = varimp( cond_forest2, conditional = TRUE )
cond_impVar3 = varimp( cond_forest3, conditional = TRUE )

cf_combined = tibble( varName = c(labels(cond_impVar1), labels(cond_impVar2), labels( cond_impVar3)) ,
                      score = c( cond_impVar1, cond_impVar2, cond_impVar3) ,
                      Model = c( rep("Model 1", length(cond_impVar1)),
                                 rep("Model 2", length(cond_impVar2)),
                                 rep("Model 3", length(cond_impVar3))
                                 )
                      )

cdf3 = tibble( varName = labels(cond_impVar3), score = cond_impVar3 )

# Sort the variable name for the graph to show in score size order based on the third and biggest model.
cf_combined$varName = factor( cf_combined$varName, levels = cdf3$varName[ order( cdf3$score, decreasing = TRUE ) ] )

The variable importance plot using the conditional permutation importance method of Strobl shows some change in rank of informative predictors:

V5 is more important than duplicate2 under the conditional permutation importance measure.
We observe a slight decline in the importance of V1 relative to the other informative predictors V4 and V2.

Response 8.1d

We consider the same dataset and variable importance problem using gradient boosted and Cubist trees. The gbm library and variable importance measures are run with 2000 trees and default learning rates.

In the plot below, we consider the same 2 correlates of V1 and show the variable importance plots of the 3 models sorted by the largest model’s variable importance scores.

We observe similar results for gradient boosted tree variable importance as for random forests.

Informative predictors uniformly outrank uninformative predictors in models 1 and 2.
Duplicate predictors (the correlates of V1) are less importance. For GBM, duplicates are much less useful than V1.
V1 importance declines as additional correlates are introduced.

Lastly, we use the Cubist tree approach to fitting models to the same 3 versions of the dataset. Here we configure the models to use 15 committees.

The resulting variable importance plot for Cubist models show that:

The informative predictors V1-V5 are important. However, in the first model, the uninformative predictor V6 also shows some importance.
The uninformative predictors V8, V6 show some importance in model 3.
The duplicate2 predictor shows no importance in model 3.
The ordering of informative predictors is different. V1 retains its importance even with the addition of correlates.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

8.2 Response

We consider a random variable \(y=sin(x) + \epsilon\) over a short range of

\[x \in { 0, \pi/6, \pi/3, \pi/2 }\]

where \(\epsilon\) is a uniform white noise distribution.

By altering the variance of the noise \(\epsilon\), thereby creating low and high volatility versions of \(y\) as \(y_{low}\) and \(y_{hi}\), we can study how noise impacts on the variable importance of uninformative predictors.

We define an uninformative predictor \(w\) to be an uncorrelated normal variable with parameters \(N( \mu(y), 5 )\). We then try to predict \(y_{low}\) in terms of \(x, w\) using a CART tree model, repeat the process with \(y_{hi}\) in place of \(y_{low}\) and compare the resulting variable importances.

Next, we build a tree model. The \(y_{hi}\) dataframe shows that the uninformative predictor is more important.

rpart = rpart(y_hi ~ . , data = df)

varImp(rpart)

##     Overall
## w 0.7388358
## x 0.3167823

In contrast,the \(y_{low}\) dataframe shows that informative predictor \(x\) is more important although \(w\) still has a small impact.

rpart2 = rpart(y_low ~ . , data = df2)

varImp(rpart2)

##      Overall
## w 0.05436432
## x 0.84365298

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.
Although the optimal values of these parameters should be obtained through the tuning process - it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.

The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9.

Response 8.3a

The model on the right focuses on the first few predictors because it is a greedier algorithm. It seeks to learn using nearly all of the data. Thus, if multiple correlated predictors explain the response, the strongest predictor will be preferred in the tree model (if the depth is shallow). The incremental predictive power of the other predictors will be omitted. Increased bagging fraction causes infrequent data to be ignored for model building.

Response 8.3b

The model performance should be better for a small bagging fraction and lower learning rate. Thus, the model on the left side (10% learning rate) should perform better provided that sufficient number of trees are allowed to run, say, ntrees=1000.

Response 8.3c

To answer this question, we fit a gbm model to the solubility data with 2000 trees and tree depth of 1. Then we vary the bagging fraction and learning rate from 0.1 to 0.9. Afterwards, we compare their variable importance plots to those obtained when changing the tree depth to 7.

The baseline models are plotted below. We are able to replicate the pattern in Figure 8.24, but the exact order of predictor importance changes. For example, the textbook produces a top 3 rank of MolWeight, Hydrophilic Factor, NumAromaticBonds whereas we produce a top 3 rank of MolWeight, NumCarbon, SurfaceArea1. These differences don’t appear to be very significant. There are likely differences in training parameters or random seed that could explain these variations.

## Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
## Please use `tibble::rownames_to_column()` instead.

## Selecting by Overall
## Selecting by Overall

The plots below compares the impact of increased tree depth from 1 to 7 for different learning and bagging settings. The left plot shows bagging fraction and learning rate are small: 10%, The right plot shows bagging fraction and learning rate are large: 90%.

The left plot shows that increasing tree depth allows more variables to be used per tree. The variable importance increases for informative predictors. We conclude that a deeper tree flattens the variable importance curve.

The right plot shows that increasing tree depth reorders the importance of variables. But we cannot conclude that it flattens the variable importance curve based on this plot.

Instead, for the learning rate and bagging fraction of 90%, we need to plot variable importance for depth 1 and depth 7 separately sorting each plot by its own variable importance scores. We use the same y-scale in both plots as well.

The top 1 and 2 variables in the right plot (depth 7) have lower scores. Their importance is re-assigned to the variables of rank 4,5,6 in the right plot and to the binary predictors.

We conclude that increasing depth from 1 to 7 does flatten the curve both when bagging fraction and learning rate are high (90%) and when those rates are low (10%).

Exercise 8.7

Response 8.7a

Let’s tune and training a CART tree model, a random forest using CART trees, a stochastic gradient boosted tree and a Cubist tree. We’ll repeat the data imputation and splitting logic from exercise 6.3 and 7.5 below first.

For ease of replication, we reproduce the entire code block and the preceding text in the subsection below entitled Preprocessing the data from Assignment 8.

Preprocessing the data

To re-use the same data splitting, data imputation and pre-processing steps as before in Exercise 6.3, we replicate and run the code below and briefly recall the steps taken to obtain the prepared datasets. However, we omit the detailed analysis and diagnostic plots used to justify those steps.

Rename the lengthy variable names:
- predictors beginning with BiologicalMaterial are shortened to B
- predictors beginning with ManufacturingProcess are shortened to M
Split the data into a 75-25 train/test split based on stratified random sampling using the response Yield to define 5 quantiles.
Drop near zero variance predictors: B07 is dropped from both test and train data sets.
Remove outlier zero values for predictors where all other values are positive. Replace the zeros by the median of the population.
Center and scale the predictors.
Use knn imputation to fill in missing values in the data set.

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]

chemX %>% 
  rename_with(~ str_replace(., "ManufacturingProcess", "M"),  starts_with("Manufacturing") ) %>%
  rename_with(~ str_replace(., "BiologicalMaterial", "B"),  starts_with("Biological") ) -> chemX3

set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )

chemX_train_bv %>% mutate( 
  across(c(M18, M20, M16, M26, M25, M27, M31, 
           M29, M43, M30, M01, M42, M44, M45, 
           M39, M38 ) ,   # Run a function across the columns listed in across()
  ~if_else( condition = ( .x == 0 )  ,  # anonymous function applied to each column in across()  .x refers to current column argument
            true = median( . ) ,  # we use median without removal of NA
            false = as.numeric(.) ) ) # keep the existing value of it is nonzero
  
  ) -> chemX_train_nz

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )

CART Tree

Using the rpart library and its training methods, we consider the best model using RMSE as the metric on the best selectionFunction.

set.seed(10393)
library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 15, by = 1 )  )

#cartGrid = data.frame( maxdepth = seq(1, 15, by = 1), .model = TRUE )

#cartControl <- trainControl(method = "cv", number = 10,  selectionFunction = "best")
cartControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(rpartTune = train( x= chemX_train_pp, 
                   y = chemY_train,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )

## CART 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.531248  0.2883396  1.220869
##    2        1.540336  0.3066603  1.219017
##    3        1.545803  0.3158142  1.222760
##    4        1.531157  0.3327614  1.204190
##    5        1.521892  0.3447464  1.192122
##    6        1.517168  0.3517898  1.190026
##    7        1.516941  0.3572683  1.188099
##    8        1.525483  0.3569762  1.196125
##    9        1.540100  0.3493954  1.207663
##   10        1.547503  0.3460554  1.213469
##   11        1.546986  0.3463922  1.213032
##   12        1.547563  0.3461687  1.211958
##   13        1.547563  0.3461687  1.211958
##   14        1.547563  0.3461687  1.211958
##   15        1.547563  0.3461687  1.211958
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 7.

Random Forest

We train a random forest using the rf method using the best selectionFunction on RMSE.

set.seed(1929)
randomForestControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(randomForestTune = caret::train( x = chemX_train_pp, 
                          y = chemY_train, 
                          method = "rf",
                          tuneGrid = expand.grid( mtry = seq( 2, 30, by = 2) ) ,
                          metric = "RMSE" ,
                          trControl = randomForestControl ,
                          ntree = 1000 ) )

## Random Forest 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.331580  0.5744982  1.0704786
##    4    1.280646  0.5915410  1.0243153
##    6    1.257237  0.5996109  1.0058057
##    8    1.247368  0.6004864  0.9932623
##   10    1.242392  0.6006760  0.9880305
##   12    1.235027  0.6014757  0.9794181
##   14    1.233911  0.5994134  0.9773687
##   16    1.229192  0.6001594  0.9723560
##   18    1.233912  0.5943630  0.9752514
##   20    1.230791  0.5954681  0.9711008
##   22    1.231127  0.5923629  0.9699665
##   24    1.233065  0.5899367  0.9715160
##   26    1.234727  0.5872798  0.9717896
##   28    1.232454  0.5869972  0.9683762
##   30    1.233838  0.5850364  0.9693608
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 16.

Stochastic Gradient Boosted Trees

We run the stochastic gradient boosted tree using the gbm library with the best selectionFunction on RMSE.

set.seed( 1027)
gbmControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(gbmTune = caret::train( x = chemX_train_pp, 
                          y = chemY_train, 
                          method = "gbm",
                          tuneGrid = expand.grid( n.trees = c(300, 500), 
                                                  shrinkage = c( 0.005, 0.02,  0.1 , 0.15 ) ,
                                                  interaction.depth = c( 1,  5, 7) ,
                                                  n.minobsinnode = 6
                                                  ) ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )

## Stochastic Gradient Boosting 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.005      1                  300      1.415166  0.4720228  1.1299013
##   0.005      1                  500      1.330934  0.4975995  1.0525428
##   0.005      5                  300      1.298819  0.5259602  1.0259547
##   0.005      5                  500      1.240000  0.5420020  0.9752577
##   0.005      7                  300      1.292105  0.5274668  1.0199990
##   0.005      7                  500      1.231573  0.5460870  0.9674665
##   0.020      1                  300      1.250333  0.5286831  0.9800080
##   0.020      1                  500      1.233148  0.5383720  0.9647773
##   0.020      5                  300      1.196448  0.5624542  0.9386733
##   0.020      5                  500      1.184833  0.5708035  0.9290373
##   0.020      7                  300      1.197474  0.5619141  0.9380596
##   0.020      7                  500      1.185182  0.5702181  0.9280550
##   0.100      1                  300      1.237243  0.5399428  0.9689625
##   0.100      1                  500      1.241025  0.5393010  0.9722572
##   0.100      5                  300      1.197560  0.5634529  0.9439971
##   0.100      5                  500      1.197152  0.5639139  0.9438898
##   0.100      7                  300      1.199562  0.5612095  0.9374184
##   0.100      7                  500      1.199081  0.5616638  0.9372710
##   0.150      1                  300      1.263423  0.5224939  0.9855267
##   0.150      1                  500      1.267867  0.5222623  0.9899760
##   0.150      5                  300      1.203555  0.5563481  0.9554399
##   0.150      5                  500      1.203216  0.5566581  0.9551805
##   0.150      7                  300      1.230460  0.5429069  0.9687746
##   0.150      7                  500      1.229839  0.5434100  0.9683315
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 6
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  5, shrinkage = 0.02 and n.minobsinnode = 6.

Cubist Trees

Now we train a Cubist tree using the best selectionFunction on RMSE.

set.seed( 1027)
cubistControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(cubistTune = caret::train( x = chemX_train_pp, 
                          y = chemY_train, 
                          method = "cubist",
                          tuneGrid = expand.grid( committees = c( 1, 10, 20, 50 ) ,
                                                  neighbors = c( 0, 1, 5, 8)
                                                  ) ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )

## Cubist 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.672066  0.3667729  1.2738195
##    1          1          1.640418  0.3993203  1.2220316
##    1          5          1.632140  0.3906536  1.2385862
##    1          8          1.639775  0.3857359  1.2487970
##   10          0          1.280367  0.5140740  1.0068784
##   10          1          1.221138  0.5555331  0.9461957
##   10          5          1.242895  0.5402182  0.9709419
##   10          8          1.251116  0.5349056  0.9803797
##   20          0          1.217417  0.5528367  0.9628783
##   20          1          1.147448  0.6005078  0.8840467
##   20          5          1.178367  0.5795569  0.9216508
##   20          8          1.186654  0.5738764  0.9316537
##   50          0          1.177035  0.5758809  0.9347189
##   50          1          1.106673  0.6223372  0.8561449
##   50          5          1.138743  0.6018281  0.8931441
##   50          8          1.147362  0.5962340  0.9040053
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 1.

Summarizing Model Comparison

Based on the above results, we summarize their training and test RMSE and \(R^2\) below. \(\color{red}{\text{ We also include the results from Assignment 7 (the Partial Least Squares Model)}}\) \(\color{red}{\text{and Assignment 8 (MARS).}}\) The joint results are tabulated below.

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune, randomForestTune, gbmTune, cubistTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    pred_data <- predict( modelObj, newdata=chemX_test_pp )
    
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  chemY_test) ,
                     testR2    = caret::R2(   pred_data,  chemY_test) ,
                     modelType = "Tree/Rule"
                     )
    model_stats <- rbind(model_stats, output)
}

Model Results Comparison

modelName	trainRMSE	trainR2	testRMSE	testR2	modelType
cubist	1.11	0.62	0.94	0.75	Tree/Rule
gbm	1.18	0.57	1.03	0.71	Tree/Rule
rf	1.23	0.60	1.15	0.65	Tree/Rule
Mars	1.22	0.58	1.20	0.60	Non-Linear Regression
pls	1.20	0.57	1.23	0.59	Linear Regression
rpart2	1.52	0.36	1.56	0.38	Tree/Rule

We conclude that the best model is cubist. The order of model preference from best to worst is Cubist, stochastic gradient boosted trees, random forests, MARS, partial least squares and CART trees.

For the tree/rule based models, rankings is consistent across both RMSE and \(R^2\) metrics and across both test and training data sets. Moreover, the test and cross-validated performances are quite comparable which gives us some assurance that the model findings are robust.
For the middle of the pack, (rf, Mars, pls) the results are slightly conflicted. For MARS and PLS, the results are not consistent across all metrics because their training RMSE and training \(R^2\) would yield a different ordering. They would be preferred to random forests.

Response 8.7b

## Selecting by Overall

The top 3 predictors are manufacturing variables: M32, M17, M13. Seven of the top 10 predictors are manufacturing. This is promising because it suggests that improvements to operational processes will increase yield. The top 3 biological predictors are B03 at rank 4 and B12 at rank 7 and B02 at rank 10.

While the exact fraction of biological predictors in the top 10 will vary with the model building process, the conclusion that biological predictors are less important has now been validated across 3 types of models: Linear regression, non-linear regression and tree-based models.

When we compare top 10 variables from pls, MARS and Cubist, we see some commonalities and differences.

Predictor Importance Across 3 Model Types

rank	PLS	MARS	Cubist
1	M32	M32	M32
2	M13	M09	M17
3	M36		M13
4	M09		B03
5	M17		M29
6	B02		M39
7	M33		B12
8	M06		M33
9	B06		M09
10	B03		B02

M32 is the top predictor across all models. This should be a robust finding and efforts to improve it should be the top recommendation.
M13 is a top 3 predictor in two of the 3 model categories and should be improved.
M17 is a top 5 and top 2 predictor in two of the 3 model categories and should be improved.
The biological predictors make the top 10 list but they make up only 3 of the top 10 predictors and never higher than top 4 in the PLS and Cubist models. However, there is significant instability in the importance of these variables due to modeling variation, so our conclusion should be limited to stating that improvements in biological predictors is of secondary importance based on available data.

Response 8.7c

We plot the optimal CART tree from 8.7a using the rpart.plot and partykit packages. These produce related but complementary information on the optimal tree which has maxdepth 7. We gain several insights from inspecting both diagrams below.

## Loading required package: libcoin

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

M32 is a crucial predictor to manage in the manufacturing process. If we can make M32 go down the right branch at node 1 (i.e. exceed its threshold of 0.195), then the worst case (at node 14) exceeds a yield of 39. This alone boosts the conditional mean yield to 42 - a net increase of 3 over the alternative branch.
B11 is a crucial predictor to avoid extreme low yield and to improve extreme high yields. It is relevant at nodes 2 and 17. At Node 2, we want to avoid worst case yields (Nodes 4,5) of 37 and 39. At Node 17, we want to a median yield above 43. This means that the biological ingredient associated with B11 needs to be high quality. We note that the role of B11 is somewhat inconsistent with the model findings of PLS, MARS and Cubist.

We conclude that , based on the RPART model, these are the key recommendations to business managers on how to improve yields or investigate the underlying cost/benefits.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)

library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(caret)
library(randomForest)
library(party)
library(caret)
library(gbm)
library(Cubist)
library(rpart)

library(AppliedPredictiveModeling)
library(mlbench)  # data for Friedman non-linear dataset

.code-bg {
  background-color: lightgray;
}
set.seed(200)
simulated = mlbench.friedman1( 200, sd = 1 )
simulated = cbind( simulated$x, simulated$y)

simulated = as.data.frame( simulated)

colnames(simulated)[ncol(simulated)] <- "y"

model1  <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )

rfImp1 = varImp(  model1, scale = FALSE ) 

df = tibble(  varName = rownames(rfImp1), score = rfImp1$Overall)

df$varName = factor( df$varName, levels = df$varName[ order( df$score)])

df %>% ggplot(aes( x = varName, y = score ) ) + geom_bar(stat = "identity")  + labs(title = "Variable Importance Plot - Model 1")

set.seed(2010)
simulated$duplicate1 = simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1 )

model2  <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )

rfImp2 = varImp(  model2, scale = FALSE ) 

df2 = tibble(  varName = rownames(rfImp2), score = rfImp2$Overall)

df2$varName = factor( df2$varName, levels = df2$varName[ order( df2$score)])

df2 %>% ggplot(aes( x = varName, y = score ) ) + geom_bar(stat = "identity")  + labs(title = "Variable Importance Plot - Model 2")


simulated$duplicate2 = simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate2, simulated$V1 )

model3  <- randomForest( y ~ . , data = simulated, importance = TRUE , ntree = 1000 )

rfImp3 = varImp(  model3, scale = FALSE ) 

df3 = tibble(  varName = rownames(rfImp3), score = rfImp3$Overall)

df3$varName = factor( df3$varName, levels = df3$varName[ order( df3$score)])

df3 %>% ggplot(aes( x = varName, y = score ) ) + geom_bar(stat = "identity")  + labs(title = "Variable Importance Plot - Model 3")


df_combined = rbind( df, df2, df3)

df_combined$Model = c( rep("Model1", nrow(df)), rep("Model2", nrow(df2)), rep("Model3", nrow(df3)))


df_combined %>% ggplot(aes( x = varName, y = score ) ) + 
  geom_bar(stat = "identity")  + 
  facet_wrap(~Model, ncol = 1) +
  labs(title = "Variable Importance Plot with Extra Correlated Predictors")



set.seed( 10203)

cond_forest1 = cforest( y ~ ., data = simulated[, 1:11], control = cforest_unbiased(ntree = 1000) )

cond_forest2 = cforest( y ~ ., data = simulated[, 1:12], control = cforest_unbiased(ntree = 1000) )

cond_forest3 = cforest( y ~ ., data = simulated[, 1:13], control = cforest_unbiased(ntree = 1000) )
cond_impVar1 = varimp( cond_forest1)
cond_impVar2 = varimp( cond_forest2)
cond_impVar3 = varimp( cond_forest3)

cf_combined = tibble( varName = c(labels(cond_impVar1), labels(cond_impVar2), labels( cond_impVar3)) ,
                      score = c( cond_impVar1, cond_impVar2, cond_impVar3) ,
                      Model = c( rep("Model 1", length(cond_impVar1)),
                                 rep("Model 2", length(cond_impVar2)),
                                 rep("Model 3", length(cond_impVar3))
                                 )
                      )

cdf3 = tibble( varName = labels(cond_impVar3), score = cond_impVar3 )

# Sort the variable name for the graph to show in score size order based on the third and biggest model.
cf_combined$varName = factor( cf_combined$varName, levels = cdf3$varName[ order( cdf3$score, decreasing = TRUE ) ] )


cf_combined %>% ggplot(aes( x = varName, y = score ) ) + 
  geom_bar(stat = "identity")  + 
  facet_wrap(~Model, ncol = 1) +
  labs(title = "Variable Importance Plot (Conditional Random Forest) w Correlates")

set.seed( 10203)

cond_impVar1 = varimp( cond_forest1, conditional = TRUE )
cond_impVar2 = varimp( cond_forest2, conditional = TRUE )
cond_impVar3 = varimp( cond_forest3, conditional = TRUE )

cf_combined = tibble( varName = c(labels(cond_impVar1), labels(cond_impVar2), labels( cond_impVar3)) ,
                      score = c( cond_impVar1, cond_impVar2, cond_impVar3) ,
                      Model = c( rep("Model 1", length(cond_impVar1)),
                                 rep("Model 2", length(cond_impVar2)),
                                 rep("Model 3", length(cond_impVar3))
                                 )
                      )

cdf3 = tibble( varName = labels(cond_impVar3), score = cond_impVar3 )

# Sort the variable name for the graph to show in score size order based on the third and biggest model.
cf_combined$varName = factor( cf_combined$varName, levels = cdf3$varName[ order( cdf3$score, decreasing = TRUE ) ] )


cf_combined %>% ggplot(aes( x = varName, y = score ) ) + 
  geom_bar(stat = "identity")  + 
  facet_wrap(~Model, ncol = 1) +
  labs(title = "Variable Importance Plot (Conditional Random Forest) STROBL metric")

set.seed(10238)

gbmModel1 = gbm( y ~ . , data = simulated[, 1:11], distribution = "gaussian" , n.trees = 2000)

gbm_varimp1 = varImp(gbmModel1, numTrees = 2000 )


gbmModel2 = gbm( y ~ . , data = simulated[, 1:12], distribution = "gaussian" , n.trees = 2000)

gbm_varimp2 = varImp(gbmModel2, numTrees = 2000 )

gbmModel3 = gbm( y ~ . , data = simulated[, 1:13], distribution = "gaussian" , n.trees = 2000)

gbm_varimp3 = varImp(gbmModel3, numTrees = 2000 )
gbmdf3 = tibble( varName = labels(gbm_varimp3)[[1]] , score = gbm_varimp3$Overall )

gbm_combined = tibble( varName = c(labels(gbm_varimp1)[[1]], 
                                   labels(gbm_varimp2)[[1]], 
                                   labels(gbm_varimp3)[[1]] )  ,
                    score = c( gbm_varimp1$Overall, gbm_varimp2$Overall, gbm_varimp3$Overall) ,
                      
                      Model = c( rep("Model 1", length(gbm_varimp1$Overall)),
                                 rep("Model 2", length(gbm_varimp2$Overall)),
                                 rep("Model 3", length(gbm_varimp3$Overall)) ) )

gbm_combined$varName = factor( gbm_combined$varName, levels = gbmdf3$varName[ order( gbmdf3$score, decreasing = TRUE ) ] )
  


gbm_combined %>% ggplot(aes( x = varName, y = score ) ) + 
  geom_bar(stat = "identity")  + 
  facet_wrap(~Model, ncol = 1) +
  labs(title = "Variable Importance Plot (Gradient Boosted Trees) Correlated Predictors")

set.seed(10738)

cubistModel1 = cubist( x=simulated[, 1:10], y = simulated[,11] , committees = 15 )

cubist_varimp1 = caret::varImp(cubistModel1 )

cubistModel2 = cubist( x=simulated[, c(1:10,12)], y = simulated[,11], committees = 15  )

cubist_varimp2 = caret::varImp(cubistModel2 )

cubistModel3 = cubist( x=simulated[, c(1:10,12,13)], y = simulated[,11] , committees = 15 )

cubist_varimp3 = caret::varImp(cubistModel3 )



cubistdf3 = tibble( varName = labels(cubist_varimp3)[[1]] , score = cubist_varimp3$Overall )

cubist_combined = tibble( varName = c(labels(cubist_varimp1)[[1]], 
                                      labels(cubist_varimp2)[[1]], 
                                      labels(cubist_varimp3)[[1]] )  ,
                    score = c( cubist_varimp1$Overall, cubist_varimp2$Overall, cubist_varimp3$Overall) ,
                      
                      Model = c( rep("Model 1", length(cubist_varimp1$Overall)),
                                 rep("Model 2", length(cubist_varimp2$Overall)),
                                 rep("Model 3", length(cubist_varimp3$Overall)) ) )

cubist_combined$varName = factor( cubist_combined$varName, levels = cubistdf3$varName[ order( cubistdf3$score, decreasing = TRUE ) ] )



cubist_combined %>% ggplot(aes( x = varName, y = score ) ) + 
  geom_bar(stat = "identity")  + 
  facet_wrap(~Model, ncol = 1) +
  labs(title = "Variable Importance Plot (Cubist) Correlated Predictors")

set.seed( 102303)

x = pi/6 * seq(0,3) 

rep_x = rep(x, 50 )


func_x = sin(rep_x) + 2.0 * runif(200)

func_x_low = sin(rep_x) + 1.0 * runif(200)


white_noise = rnorm(200, mean = mean(sin(rep_x)), sd = 5  )

df = tibble( y_hi = func_x, x = rep_x, w = white_noise )

df2 = tibble( y_low = func_x_low, x = rep_x, w = white_noise )

pairs(df)

pairs(df2)

rpart = rpart(y_hi ~ . , data = df)

varImp(rpart)
rpart2 = rpart(y_low ~ . , data = df2)

varImp(rpart2)



# response 8.3c
# ----------------------------
data(solubility)

set.seed(1029)
gbmModel10 = gbm.fit( x=solTrainXtrans, y = solTrainY, 
                     distribution = "gaussian" , 
                     n.trees = 2000, 
                     interaction.depth = 1,
                     bag.fraction = 0.1,
                     shrinkage = 0.1 ,
                     verbose = FALSE )

vImpModel10 = varImp(gbmModel10, numTrees = 2000 )

set.seed(1029)
gbmModel90 = gbm.fit( x=solTrainXtrans, y = solTrainY, 
                     distribution = "gaussian" , 
                     n.trees = 2000, 
                     interaction.depth = 1,
                     bag.fraction = 0.9,
                     shrinkage = 0.9 ,
                     verbose = FALSE )

vImpModel90 = varImp(gbmModel90, numTrees = 2000 )

vImpModel10 %>% arrange(desc(Overall)) %>% top_n(n=30) %>% 
  add_rownames(var = "Var") %>%
  ggplot(aes(x=reorder(Var, Overall) ,y=Overall)) + 
  geom_bar(stat="identity") + coord_flip() +
  labs(title="Learning .1 ,Bag Frac .1",
        x = "Variable",
       y = "Importance") -> p10


vImpModel90 %>% arrange(desc(Overall)) %>% top_n(n=30) %>% 
  add_rownames(var = "Var") %>%
  ggplot(aes(x=reorder(Var, Overall) ,y=Overall)) + 
  geom_bar(stat="identity") + coord_flip() +
  labs(title="Learning .9 ,Bag Frac .9", 
       x = "Variable",
       y = "Importance") -> p90


plot_grid(p10, p90)

data(solubility)

set.seed(1029)
gbmModel10d7 = gbm.fit( x=solTrainXtrans, y = solTrainY, 
                     distribution = "gaussian" , 
                     n.trees = 2000, 
                     interaction.depth = 7,
                     bag.fraction = 0.1,
                     shrinkage = 0.1 ,
                     verbose = FALSE )

vImpModel10d7 = varImp(gbmModel10d7, numTrees = 2000 )

set.seed(1029)
gbmModel90d7 = gbm.fit( x=solTrainXtrans, y = solTrainY, 
                     distribution = "gaussian" , 
                     n.trees = 2000, 
                     interaction.depth = 7,
                     bag.fraction = 0.9,
                     shrinkage = 0.9 ,
                     verbose = FALSE )

vImpModel90d7 = varImp(gbmModel90d7, numTrees = 2000 )

vImpModel10 %>% add_rownames(var = "Var") -> vim10
vImpModel90 %>% add_rownames(var = "Var") -> vim90
vImpModel90d7 %>% add_rownames( var= "Var") -> vim90d7
vImpModel10d7 %>% add_rownames( var= "Var") -> vim10d7
vim10 %>% inner_join(vim10d7, by = "Var") %>% rename(Depth7 = Overall.y, Depth1 = Overall.x) -> vi10
vim90 %>% inner_join(vim90d7, by = "Var") %>% rename(Depth7 = Overall.y, Depth1 = Overall.x) -> vi90

vi10$Var = factor( vi10$Var, levels = vi10$Var[ order( vi10$Depth1, decreasing = FALSE ) ] )


vi90$Var = factor( vi90$Var, levels = vi90$Var[ order( vi90$Depth1, decreasing = FALSE ) ] )


vi10 %>% 
  arrange(desc(Depth1)) %>%
  top_n(n=25) %>%
  pivot_longer(!Var, names_to = "Depth", values_to = "Score") %>%
  ggplot(aes(x=Var, y = Score , fill = Depth )) + 
  geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
  coord_flip() +
  theme(legend.position = c(.8, 0.2)) +
  labs(title = "Increased tree depth flattens slope",
       subtitle = "Bagging 10%, Learning 10%",
       y = "Importance") -> p10d7


vi90 %>% 
  arrange(desc(Depth1)) %>%
  top_n(n=25) %>%
  pivot_longer(!Var, names_to = "Depth", values_to = "Score") %>%
  ggplot(aes(x=Var, y = Score , fill = Depth )) + 
  geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
  coord_flip() +
  theme(legend.position = c(.8, 0.2)) +
  labs(title = "Increased tree depth flattens slope",
       subtitle = "Bagging 90%, Learning 90%",
       y = "Importance") -> p90d7


plot_grid(p10d7, p90d7)

vi90$Var = factor( vi90$Var, levels = vi90$Var[ order( vi90$Depth7, decreasing = FALSE ) ] )

vi90 %>% 
  arrange(desc(Depth7)) %>%
  top_n(n=25) %>%
  ggplot(aes(x=Var, y = Depth7  )) + 
  geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1600)) +
  labs(title = "Depth 7",
       subtitle = "Bagging 90%, Learning 90%",
       y = "Importance") -> p90d7x

vi90$Var = factor( vi90$Var, levels = vi90$Var[ order( vi90$Depth1, decreasing = FALSE ) ] )


vi90 %>% 
  arrange(desc(Depth1)) %>%
  top_n(n=25) %>%
  ggplot(aes(x=Var, y = Depth1  )) + 
  geom_bar(stat= "identity", alpha = 0.8 , position = "dodge") +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1600)) +
  labs(title = "Depth 1",
       subtitle = "Bagging 90%, Learning 90%",
       y = "Importance") -> p90d1x 

plot_grid(p90d1x, p90d7x)


data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]

chemX %>% 
  rename_with(~ str_replace(., "ManufacturingProcess", "M"),  starts_with("Manufacturing") ) %>%
  rename_with(~ str_replace(., "BiologicalMaterial", "B"),  starts_with("Biological") ) -> chemX3


set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )
chemX_test   = chemX3[-train_indices , ]
chemX_train  = chemX3[ train_indices , ]
chemY_test   = chemY[-train_indices , ]
chemY_train  = chemY[ train_indices , ]


# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_train_bv = chemX_train[, -nearZeroVar(chemX_train) ]

# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_test_bv = chemX_test[, -nearZeroVar(chemX_test) ]


chemX_train_bv %>% mutate( 
  across(c(M18, M20, M16, M26, M25, M27, M31, 
           M29, M43, M30, M01, M42, M44, M45, 
           M39, M38 ) ,   # Run a function across the columns listed in across()
  ~if_else( condition = ( .x == 0 )  ,  # anonymous function applied to each column in across()  .x refers to current column argument
            true = median( . ) ,  # we use median without removal of NA
            false = as.numeric(.) ) ) # keep the existing value of it is nonzero
  
  ) -> chemX_train_nz
  
# Replace the outlier zeros with the median value for the variables below for test
chemX_test_bv %>% mutate(
    across(c(M39, M42, M44, M45, M01 ) ,
  ~if_else( condition = ( .x == 0 )  ,
            true = median( . ) ,
            false = as.numeric(.) ) ) )  -> chemX_test_nz 
                           

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )


predictors = subset(ChemicalManufacturingProcess, select = -Yield)
yield = subset( ChemicalManufacturingProcess, select = "Yield")
set.seed(517)

trainingRows = createDataPartition( yield$Yield, p = 0.7, list = FALSE)
trainPredictors = predictors[trainingRows,]
trainYield = yield[trainingRows,]

testPredictors = predictors[-trainingRows, ]
testYield = yield[-trainingRows,]

pp <- preProcess(trainPredictors, method = c("BoxCox", "center", "scale", "knnImpute"))
ppTrainPredictors = predict(pp, trainPredictors)
set.seed(10393)
library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 15, by = 1 )  )

#cartGrid = data.frame( maxdepth = seq(1, 15, by = 1), .model = TRUE )

#cartControl <- trainControl(method = "cv", number = 10,  selectionFunction = "best")
cartControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(rpartTune = train( x= chemX_train_pp, 
                   y = chemY_train,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )

set.seed(1929)
randomForestControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(randomForestTune = caret::train( x = chemX_train_pp, 
                          y = chemY_train, 
                          method = "rf",
                          tuneGrid = expand.grid( mtry = seq( 2, 30, by = 2) ) ,
                          metric = "RMSE" ,
                          trControl = randomForestControl ,
                          ntree = 1000 ) )



set.seed( 1027)
gbmControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(gbmTune = caret::train( x = chemX_train_pp, 
                          y = chemY_train, 
                          method = "gbm",
                          tuneGrid = expand.grid( n.trees = c(300, 500), 
                                                  shrinkage = c( 0.005, 0.02,  0.1 , 0.15 ) ,
                                                  interaction.depth = c( 1,  5, 7) ,
                                                  n.minobsinnode = 6
                                                  ) ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )



set.seed( 1027)
cubistControl <- trainControl(method = "boot", number = 30,  selectionFunction = "best")

(cubistTune = caret::train( x = chemX_train_pp, 
                          y = chemY_train, 
                          method = "cubist",
                          tuneGrid = expand.grid( committees = c( 1, 10, 20, 50 ) ,
                                                  neighbors = c( 0, 1, 5, 8)
                                                  ) ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )



# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune, randomForestTune, gbmTune, cubistTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    pred_data <- predict( modelObj, newdata=chemX_test_pp )
    
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  chemY_test) ,
                     testR2    = caret::R2(   pred_data,  chemY_test) ,
                     modelType = "Tree/Rule"
                     )
    model_stats <- rbind(model_stats, output)
}

model_stats %>% 
  add_row(modelName = "pls", trainRMSE = 1.203 , trainR2 =   .566 , testRMSE = 1.22817 , testR2 = .5872  ,
                        modelType = "Linear Regression") %>%
  add_row(modelName = "Mars", trainRMSE = 1.22 , trainR2 =   .580 , testRMSE = 1.20 , testR2 = .600  ,
                        modelType = "Non-Linear Regression") ->model_stats2

model_stats2 %>% as_tibble() %>% 
  arrange( testRMSE) %>%
  kable( digits = 2, caption = "Model Results Comparison") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

cubist_varimp = varImp(cubistTune, scale = TRUE)

cubist_varimp$importance %>% arrange(desc(Overall)) %>% top_n(n=20) %>% 
  add_rownames(var = "Var") %>%
  ggplot(aes(x=reorder(Var, Overall) ,y=Overall)) + 
  geom_bar(stat="identity") + coord_flip() +
  labs(title="Cubist Top 20 Variables", 
       x = "Variable",
       y = "Importance") 



predictor_rankings = tibble( 
  rank = c(1:10) ,
  PLS  = c( "M32", "M13", "M36" , "M09", "M17", "B02" , "M33", "M06" , "B06" , "B03" ) ,
  MARS = c("M32", "M09", rep("", 8)) ,
  Cubist = c("M32", "M17", "M13", "B03", "M29", "M39", "B12", "M33", "M09", "B02") )

predictor_rankings %>% kable(caption = "Predictor Importance Across 3 Model Types") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left" )
library(rpart.plot)
library(partykit)

rpart.plot(rpartTune$finalModel)
rpart_pt = as.party(rpartTune$finalModel)
plot(rpart_pt, gp = gpar(fontsize = 7 , col = "blue"), 
     inner_panel = node_inner( rpart_pt,  fill = c("lightgreen", "tan"  ) ) )

Data 624 HW 9 (Week 12) Regression Trees and Rule-Based Models

Alexander Ng

Due 11/21/2021