HW9DATA624

Author

Semyon Toybis

Assignment

We are required to complete questions 8.1, 8.2, 8.3, and 8.7 from chapter 8 of “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson.

8.1

First, I recreate the simulated data as instructed per the text book:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- 'y'

A - Random forest model

I fit a random forest model to the data as instructed per the text book

set.seed(150)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
        Overall
V1   8.86263006
V2   6.38343366
V3   0.85484513
V4   7.59799658
V5   2.17914269
V6   0.13072356
V7   0.07284997
V8  -0.11366310
V9  -0.08409527
V10 -0.01795819

The uninformative predictors (V6-V10) had the lowest importance in the tree model - the model did not use them significantly.

B - adding correlated predictor

Next, I add an additional predictor that is highly correlated to one of the informative predictors as per the text book:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
[1] 0.9368739

Next, I re-fit the random forest model and check the variable importance:

set.seed(200)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

head(rfImp2, 11)
                Overall
V1          6.683611876
V2          6.362177789
V3          0.638976625
V4          7.049935214
V5          2.055229560
V6          0.179210946
V7          0.007933688
V8         -0.106945928
V9         -0.088110476
V10        -0.083458370
duplicate1  3.186932572

The variable importance of V1 dropped to 6.6 (from 8.8) and the duplicate variable had a variable importance of 3.3, which is higher than the uninformative variables and even higher than V3 and V5.

C - conditional inference trees

Next, I use the cforest function from the party package to fit a random forest model using conditional inference trees.

set.seed(300)
cond_rfmodel <- cforest(y ~ ., data = simulated)
cond_rfImp <- varimp(cond_rfmodel)
head(as.data.frame(cond_rfImp),11)
            cond_rfImp
V1          7.27090006
V2          5.77200180
V3          0.34689431
V4          6.55147742
V5          2.14523322
V6         -0.05943921
V7          0.09158284
V8         -0.22262047
V9          0.03835449
V10        -0.17671655
duplicate1  4.55764007

The conditional inference random forest model shows a similar pattern for variable importance as the traditional model. V1 and V2 have similar importance across the models. V3 has low importance in both models and V4 and V5 have similar importance between the models, as does the duplicate variable.

comp_varImp <-cbind(rfImp2,as.data.frame(cond_rfImp))
comp_varImp$variable <- rownames(comp_varImp)
rownames(comp_varImp) <- NULL
colnames(comp_varImp) <- c('traditional', 'conditional', 'variable')

comp_varImp <- comp_varImp |> pivot_longer(cols = c('traditional','conditional'), names_to = 'model', values_to = 'importance')
comp_varImp <-droplevels(comp_varImp)


comp_varImp |> ggplot(aes(x = importance, y = (reorder(variable, importance)), fill = model)) + geom_col(position = 'dodge') + geom_col(position = 'dodge')

D - boosted tree

Next, I try fitting a boosted tree model to compare the variable importance

set.seed(400)
boosted_tree <- gbm(y ~ ., data = simulated, distribution = 'gaussian')
summary(boosted_tree)

                  var    rel.inf
V4                 V4 29.6335481
V1                 V1 26.4971701
V2                 V2 22.9947906
V5                 V5  9.9479564
V3                 V3  9.0559530
duplicate1 duplicate1  1.2754878
V6                 V6  0.5950939
V7                 V7  0.0000000
V8                 V8  0.0000000
V9                 V9  0.0000000
V10               V10  0.0000000

The boosted tree is better able to account for correlated variables. Thus, while this is using the normalized values (the non-normalized values for variable importance would be on a different scale than the other models), we can see that the duplicate variable has minimal influence on the model. The boosted model was otherwise similar to the other models in that V4 was the most important variable, followed by V1 and V2 in similar magnitudes.

8.2

We are tasked with using simulated data to show tree bias with different granularities.

First, I create a data-set of 200 observations using the Friedman 2 function from the mlbench package. This will be my granular data-set.

set.seed(1000)
granular_sim <- mlbench.friedman2(200)
granular_sim <- cbind(granular_sim$x, granular_sim$y)
granular_sim <- as.data.frame(granular_sim)
colnames(granular_sim)[ncol(granular_sim)] <- 'y'

head(granular_sim)
         V1        V2         V3       V4        y
1 32.787871  265.2331 0.82076792 8.352731 138.0365
2 75.884649  667.4839 0.20936290 3.385597 137.1029
3 11.393640 1048.2510 0.76743166 2.457987 798.2568
4 69.075515  940.6604 0.67099722 5.186370 643.5930
5 51.640240  868.8430 0.04625896 6.182549 131.0724
6  6.773796  688.3886 0.75360079 1.275275 572.1818

Next, I will create a low granularity data-set that has the values from the above data-set rounded to whole numbers.

low_granular_sim <- round(granular_sim)

head(low_granular_sim)
  V1   V2 V3 V4   y
1 33  265  1  8 138
2 76  667  0  3 137
3 11 1048  1  2 798
4 69  941  1  5 644
5 52  869  0  6 131
6  7  688  1  1 572

Next, I fit a random forest model to both data-sets and compare the variable importance.

set.seed(1100)


rf_granular <- train(x = granular_sim[,1:4], y = granular_sim$y, method = 'rf', 
                           preProcess = c('center','scale'))
rf_granular
Random Forest 

200 samples
  4 predictor

Pre-processing: centered (4), scaled (4) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  2     151.8388  0.8656250  119.6727
  3     143.1225  0.8715739  113.5702
  4     144.5391  0.8677560  115.1115

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 3.
plot(rf_granular)

plot(varImp(rf_granular, scale = FALSE))

Next, I try fitting the same model on the low granular data:

set.seed(1101)


rf_low_granular <- train(x = low_granular_sim[,1:4], y = low_granular_sim$y, method = 'rf', 
                           preProcess = c('center','scale'))
rf_low_granular
Random Forest 

200 samples
  4 predictor

Pre-processing: centered (4), scaled (4) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  2     210.1909  0.7384075  165.4654
  3     211.8551  0.7354334  168.2194
  4     217.1665  0.7266993  173.2270

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
plot(rf_low_granular)

plot(varImp(rf_granular, scale = FALSE))

The variable importance between the models is similar. However, we can see that the model fit to the less granular data has a higher RMSE and a lower R-squared than the model fit to the more granular data. This suggests that the model fit on the more granular data has lower bias.

8.3

We are tasked with analyzing two variable importance plots. One shows the output of a gradient boosting model with bagging fraction and learning rate set to 0.1 and one shows the output of a gradient boosting model with bagging fraction and learning rate set to 0.9.

A

The model with with bagging and learning rate set to 0.9 focuses its importance on the first few predictors whereas the model with those parameters set to 0.1 spreads importance across predictors.

The learning rate controls for greediness in the model - the model adds a fraction of the predicted value in the current step to the predicted value in the previous step to reduce the error measurement. Typically, smaller values of the learning rate work best.

The bagging fraction controls how much training data is used in training the model. Typically, a value of 0.5 is recommended.

Because the model with the bagging fraction set to 0.9 uses most of the training set, there is less variation in the sample data and thus the model consistently finds the few most important predictors. However, the model with the lower bagging fraction has more variety in its sample data, it has a variety of data-sets where some predictors are more important than others. Additionally, the higher learning rate makes the model greedier. The model sees that the top few variables are most effective initially to reduce error and thus focuses on them without considering other variables.

B

The model with the lower values for learning rate and bagging fraction may be more predictive of other samples as it is less likely to be over-fit to the existing training data than the other model.

C

Increasing interaction depth would result in more variables being included in the model but the model with lower parameters would still spread its variable importance across more variables whereas the model with the higher parameters would still concentrate in the first few most impactful predictors. Thus, the model with the lower values would have a more gradual slope.

8.7

We are tasked with fitting different tree models to the chemical data from prior questions.

Below I load the data, impute missing values, and break the data out into a train and test set as done previously.

Pre-processing

data("ChemicalManufacturingProcess")

Below I impute missing values using the k-nearest neighbors approach. I will use the KNN function from the VIM package. The function adds columns which state whether a value was imputed to the right of the original data. I will drop these columns from the data frame.

chem_data <- kNN(ChemicalManufacturingProcess)
chem_data <- chem_data[,1:ncol(ChemicalManufacturingProcess)]

Next, I split the data into a training and test set:

set.seed(10)
chem_trainIndex <- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)

chem_trainData <- chem_data[chem_trainIndex,]
chem_testData <- chem_data[-chem_trainIndex,]

Random Forest model

First I will fit a random forest model on the data

set.seed(600)
chem_rf_model <- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'rf', 
                           preProcess = c('center','scale'))
chem_rf_model
Random Forest 

144 samples
 57 predictor

Pre-processing: centered (57), scaled (57) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE      
   2    1.337594  0.5706017  1.0552181
  29    1.259343  0.5723492  0.9626985
  57    1.287118  0.5437091  0.9775779

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.
plot(chem_rf_model)

The final model has an RMSE of 1.3 and R squared of 0.57.

Next, I try predicting the test-set with this same model:

chem_rf_predict <- predict(chem_rf_model, newdata = chem_testData[,-1])
postResample(pred = chem_rf_predict, obs = chem_testData$Yield)
     RMSE  Rsquared       MAE 
0.9332407 0.6968260 0.7710356 

The model has an RMSE of 0.93 and an R-squared of 0.7 for the test set.

The model performs better on the test set than the training set.

Below is the variable importance plot for the model.

plot(varImp(chem_rf_model))

Cubist

Next, I try fitting a Cubist model to the data.

set.seed(500)
chem_cubist_model <- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'cubist', 
                           preProcess = c('center','scale'))
chem_cubist_model
Cubist 

144 samples
 57 predictor

Pre-processing: centered (57), scaled (57) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared   MAE      
   1          0          1.955199  0.3096608  1.4058838
   1          5          1.909067  0.3394789  1.3534423
   1          9          1.922811  0.3305395  1.3711970
  10          0          1.284828  0.5420604  0.9737343
  10          5          1.251636  0.5635881  0.9380624
  10          9          1.268959  0.5524196  0.9542267
  20          0          1.224458  0.5833360  0.9410085
  20          5          1.189862  0.6051725  0.9023676
  20          9          1.208171  0.5936076  0.9192820

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 20 and neighbors = 5.
plot(chem_cubist_model)

The final model had 20 committees and 5 neighbors. This model had an RMSE of 1.19 and an R-squared of 0.6.

Next, I try predicting the test-set with this same model:

chem_cubist_predict <- predict(chem_cubist_model, newdata = chem_testData[,-1])
postResample(pred = chem_cubist_predict, obs = chem_testData$Yield)
     RMSE  Rsquared       MAE 
0.9772636 0.6445684 0.7720810 

The model has an RMSE of 0.97 and an R-squared of 0.64.

The model performs better on the test set than the training set.

Below is the variable importance plot for the model.

plot(varImp(chem_cubist_model))

Boosted Tree

Last, I will try fitting a Boosted Tree model to the data.

set.seed(600)
gbmGrid <- expand.grid(interaction.depth = seq(1,7, by = 2),
                       shrinkage = c(0.01,0.1),
                       n.trees = seq(100,1000, by = 50),
                       n.minobsinnode = seq(5,30, by = 5))

chem_boostedTree_model <- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'gbm', 
                           preProcess = c('center','scale'),
                           tuneGrid = gbmGrid, verbose = FALSE)
chem_boostedTree_model$finalModel
A gradient boosted model with gaussian loss function.
1000 iterations were performed.
There were 57 predictors of which 56 had non-zero influence.

The model with the lowest RMSE was a model with an interaction depth of 5, a shrinkage of 0.01, and a minimum of 5 observations in the final node.

The RMSE is:

chem_boostedTree_model$results$RMSE[which.min(chem_boostedTree_model$results$RMSE)]
[1] 1.221159

And the R-squared is:

chem_boostedTree_model$results$Rsquared[which.max(chem_boostedTree_model$results$Rsquared)]
[1] 0.585094
plot(chem_boostedTree_model)

Next, I try predicting the test-set with this same model:

chem_boosted_predict <- predict(chem_boostedTree_model, newdata = chem_testData[,-1])
postResample(pred = chem_boosted_predict, obs = chem_testData$Yield)
     RMSE  Rsquared       MAE 
0.8859718 0.7193019 0.7288205 

The model has an RMSE of 0.88 and an R-squared of 0.72.

The model performs better on the test set than the training set.

Below is the variable importance plot for the model.

plot(varImp(chem_boostedTree_model))

A

As seen via the output above, the Cubist model had slightly better performance on the re-sampling data but the Boosted Tree model had the best test-set performance. Thus, I would recommend the Boosted Tree model out of the three models that were tested.

B

The most important predictor in the Boosted Tree model was Manufacturing Process 32, which is consistent with previous models that were fit onto the data. While Manufacturing Processes account for four of the top five predictors, the top ten predictors are evenly split between Manufacturing and Biological variables.

C

The boosted tree model is an ensemble model that aggregates values across multiple trees, thus there is no final tree that can be plotted.

Below I tune a single tree model and plot the diagram for visualization purposes:

set.seed(700)
chem_singletree_model <- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'rpart2', 
                           preProcess = c('center','scale'),
                           tuneLength = 10,
                           trControl = trainControl(method = 'cv'))
chem_singletree_model
CART 

144 samples
 57 predictor

Pre-processing: centered (57), scaled (57) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 131, 130, 128, 131, 129, 129, ... 
Resampling results across tuning parameters:

  maxdepth  RMSE      Rsquared   MAE     
   1        1.426980  0.4418136  1.154294
   2        1.496596  0.4054703  1.186517
   3        1.498126  0.4179266  1.194233
   4        1.513466  0.4211162  1.235792
   5        1.515828  0.4342953  1.228781
   6        1.522847  0.4339160  1.239924
   7        1.460151  0.4796659  1.169364
   8        1.425329  0.5011166  1.147199
   9        1.432890  0.4973246  1.153072
  10        1.437989  0.4946859  1.149669

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 8.
rpartTree <-as.party(chem_singletree_model$finalModel)
plot(rpartTree)

The root node and second level of nodes are the same as the variable importance plots from some of the above models. While the further down nodes may differ, we can tell from the tree above and the variable importance plots that manufacturing process 32 is the most important predictor followed by biological material 12 and manufacturing process 9.