set.seed(200)
<- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
simulated colnames(simulated)[ncol(simulated)] <- 'y'
HW9DATA624
Assignment
We are required to complete questions 8.1, 8.2, 8.3, and 8.7 from chapter 8 of “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson.
8.1
First, I recreate the simulated data as instructed per the text book:
A - Random forest model
I fit a random forest model to the data as instructed per the text book
set.seed(150)
<- randomForest(y ~ ., data = simulated,
model1 importance = TRUE,
ntree = 1000)
<- varImp(model1, scale = FALSE) rfImp1
rfImp1
Overall
V1 8.86263006
V2 6.38343366
V3 0.85484513
V4 7.59799658
V5 2.17914269
V6 0.13072356
V7 0.07284997
V8 -0.11366310
V9 -0.08409527
V10 -0.01795819
The uninformative predictors (V6-V10) had the lowest importance in the tree model - the model did not use them significantly.
C - conditional inference trees
Next, I use the cforest function from the party package to fit a random forest model using conditional inference trees.
set.seed(300)
<- cforest(y ~ ., data = simulated) cond_rfmodel
<- varimp(cond_rfmodel)
cond_rfImp head(as.data.frame(cond_rfImp),11)
cond_rfImp
V1 7.27090006
V2 5.77200180
V3 0.34689431
V4 6.55147742
V5 2.14523322
V6 -0.05943921
V7 0.09158284
V8 -0.22262047
V9 0.03835449
V10 -0.17671655
duplicate1 4.55764007
The conditional inference random forest model shows a similar pattern for variable importance as the traditional model. V1 and V2 have similar importance across the models. V3 has low importance in both models and V4 and V5 have similar importance between the models, as does the duplicate variable.
<-cbind(rfImp2,as.data.frame(cond_rfImp))
comp_varImp $variable <- rownames(comp_varImp)
comp_varImprownames(comp_varImp) <- NULL
colnames(comp_varImp) <- c('traditional', 'conditional', 'variable')
<- comp_varImp |> pivot_longer(cols = c('traditional','conditional'), names_to = 'model', values_to = 'importance') comp_varImp
<-droplevels(comp_varImp)
comp_varImp
|> ggplot(aes(x = importance, y = (reorder(variable, importance)), fill = model)) + geom_col(position = 'dodge') + geom_col(position = 'dodge') comp_varImp
D - boosted tree
Next, I try fitting a boosted tree model to compare the variable importance
set.seed(400)
<- gbm(y ~ ., data = simulated, distribution = 'gaussian') boosted_tree
summary(boosted_tree)
var rel.inf
V4 V4 29.6335481
V1 V1 26.4971701
V2 V2 22.9947906
V5 V5 9.9479564
V3 V3 9.0559530
duplicate1 duplicate1 1.2754878
V6 V6 0.5950939
V7 V7 0.0000000
V8 V8 0.0000000
V9 V9 0.0000000
V10 V10 0.0000000
The boosted tree is better able to account for correlated variables. Thus, while this is using the normalized values (the non-normalized values for variable importance would be on a different scale than the other models), we can see that the duplicate variable has minimal influence on the model. The boosted model was otherwise similar to the other models in that V4 was the most important variable, followed by V1 and V2 in similar magnitudes.
8.2
We are tasked with using simulated data to show tree bias with different granularities.
First, I create a data-set of 200 observations using the Friedman 2 function from the mlbench package. This will be my granular data-set.
set.seed(1000)
<- mlbench.friedman2(200)
granular_sim <- cbind(granular_sim$x, granular_sim$y)
granular_sim <- as.data.frame(granular_sim)
granular_sim colnames(granular_sim)[ncol(granular_sim)] <- 'y'
head(granular_sim)
V1 V2 V3 V4 y
1 32.787871 265.2331 0.82076792 8.352731 138.0365
2 75.884649 667.4839 0.20936290 3.385597 137.1029
3 11.393640 1048.2510 0.76743166 2.457987 798.2568
4 69.075515 940.6604 0.67099722 5.186370 643.5930
5 51.640240 868.8430 0.04625896 6.182549 131.0724
6 6.773796 688.3886 0.75360079 1.275275 572.1818
Next, I will create a low granularity data-set that has the values from the above data-set rounded to whole numbers.
<- round(granular_sim)
low_granular_sim
head(low_granular_sim)
V1 V2 V3 V4 y
1 33 265 1 8 138
2 76 667 0 3 137
3 11 1048 1 2 798
4 69 941 1 5 644
5 52 869 0 6 131
6 7 688 1 1 572
Next, I fit a random forest model to both data-sets and compare the variable importance.
set.seed(1100)
<- train(x = granular_sim[,1:4], y = granular_sim$y, method = 'rf',
rf_granular preProcess = c('center','scale'))
rf_granular
Random Forest
200 samples
4 predictor
Pre-processing: centered (4), scaled (4)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 151.8388 0.8656250 119.6727
3 143.1225 0.8715739 113.5702
4 144.5391 0.8677560 115.1115
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 3.
plot(rf_granular)
plot(varImp(rf_granular, scale = FALSE))
Next, I try fitting the same model on the low granular data:
set.seed(1101)
<- train(x = low_granular_sim[,1:4], y = low_granular_sim$y, method = 'rf',
rf_low_granular preProcess = c('center','scale'))
rf_low_granular
Random Forest
200 samples
4 predictor
Pre-processing: centered (4), scaled (4)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 210.1909 0.7384075 165.4654
3 211.8551 0.7354334 168.2194
4 217.1665 0.7266993 173.2270
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
plot(rf_low_granular)
plot(varImp(rf_granular, scale = FALSE))
The variable importance between the models is similar. However, we can see that the model fit to the less granular data has a higher RMSE and a lower R-squared than the model fit to the more granular data. This suggests that the model fit on the more granular data has lower bias.
8.3
We are tasked with analyzing two variable importance plots. One shows the output of a gradient boosting model with bagging fraction and learning rate set to 0.1 and one shows the output of a gradient boosting model with bagging fraction and learning rate set to 0.9.
A
The model with with bagging and learning rate set to 0.9 focuses its importance on the first few predictors whereas the model with those parameters set to 0.1 spreads importance across predictors.
The learning rate controls for greediness in the model - the model adds a fraction of the predicted value in the current step to the predicted value in the previous step to reduce the error measurement. Typically, smaller values of the learning rate work best.
The bagging fraction controls how much training data is used in training the model. Typically, a value of 0.5 is recommended.
Because the model with the bagging fraction set to 0.9 uses most of the training set, there is less variation in the sample data and thus the model consistently finds the few most important predictors. However, the model with the lower bagging fraction has more variety in its sample data, it has a variety of data-sets where some predictors are more important than others. Additionally, the higher learning rate makes the model greedier. The model sees that the top few variables are most effective initially to reduce error and thus focuses on them without considering other variables.
B
The model with the lower values for learning rate and bagging fraction may be more predictive of other samples as it is less likely to be over-fit to the existing training data than the other model.
C
Increasing interaction depth would result in more variables being included in the model but the model with lower parameters would still spread its variable importance across more variables whereas the model with the higher parameters would still concentrate in the first few most impactful predictors. Thus, the model with the lower values would have a more gradual slope.
8.7
We are tasked with fitting different tree models to the chemical data from prior questions.
Below I load the data, impute missing values, and break the data out into a train and test set as done previously.
Pre-processing
data("ChemicalManufacturingProcess")
Below I impute missing values using the k-nearest neighbors approach. I will use the KNN function from the VIM package. The function adds columns which state whether a value was imputed to the right of the original data. I will drop these columns from the data frame.
<- kNN(ChemicalManufacturingProcess)
chem_data <- chem_data[,1:ncol(ChemicalManufacturingProcess)] chem_data
Next, I split the data into a training and test set:
set.seed(10)
<- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)
chem_trainIndex
<- chem_data[chem_trainIndex,]
chem_trainData <- chem_data[-chem_trainIndex,] chem_testData
Random Forest model
First I will fit a random forest model on the data
set.seed(600)
<- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'rf',
chem_rf_model preProcess = c('center','scale'))
chem_rf_model
Random Forest
144 samples
57 predictor
Pre-processing: centered (57), scaled (57)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.337594 0.5706017 1.0552181
29 1.259343 0.5723492 0.9626985
57 1.287118 0.5437091 0.9775779
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.
plot(chem_rf_model)
The final model has an RMSE of 1.3 and R squared of 0.57.
Next, I try predicting the test-set with this same model:
<- predict(chem_rf_model, newdata = chem_testData[,-1])
chem_rf_predict postResample(pred = chem_rf_predict, obs = chem_testData$Yield)
RMSE Rsquared MAE
0.9332407 0.6968260 0.7710356
The model has an RMSE of 0.93 and an R-squared of 0.7 for the test set.
The model performs better on the test set than the training set.
Below is the variable importance plot for the model.
plot(varImp(chem_rf_model))
Cubist
Next, I try fitting a Cubist model to the data.
set.seed(500)
<- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'cubist',
chem_cubist_model preProcess = c('center','scale'))
chem_cubist_model
Cubist
144 samples
57 predictor
Pre-processing: centered (57), scaled (57)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
Resampling results across tuning parameters:
committees neighbors RMSE Rsquared MAE
1 0 1.955199 0.3096608 1.4058838
1 5 1.909067 0.3394789 1.3534423
1 9 1.922811 0.3305395 1.3711970
10 0 1.284828 0.5420604 0.9737343
10 5 1.251636 0.5635881 0.9380624
10 9 1.268959 0.5524196 0.9542267
20 0 1.224458 0.5833360 0.9410085
20 5 1.189862 0.6051725 0.9023676
20 9 1.208171 0.5936076 0.9192820
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 20 and neighbors = 5.
plot(chem_cubist_model)
The final model had 20 committees and 5 neighbors. This model had an RMSE of 1.19 and an R-squared of 0.6.
Next, I try predicting the test-set with this same model:
<- predict(chem_cubist_model, newdata = chem_testData[,-1])
chem_cubist_predict postResample(pred = chem_cubist_predict, obs = chem_testData$Yield)
RMSE Rsquared MAE
0.9772636 0.6445684 0.7720810
The model has an RMSE of 0.97 and an R-squared of 0.64.
The model performs better on the test set than the training set.
Below is the variable importance plot for the model.
plot(varImp(chem_cubist_model))
Boosted Tree
Last, I will try fitting a Boosted Tree model to the data.
set.seed(600)
<- expand.grid(interaction.depth = seq(1,7, by = 2),
gbmGrid shrinkage = c(0.01,0.1),
n.trees = seq(100,1000, by = 50),
n.minobsinnode = seq(5,30, by = 5))
<- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'gbm',
chem_boostedTree_model preProcess = c('center','scale'),
tuneGrid = gbmGrid, verbose = FALSE)
$finalModel chem_boostedTree_model
A gradient boosted model with gaussian loss function.
1000 iterations were performed.
There were 57 predictors of which 56 had non-zero influence.
The model with the lowest RMSE was a model with an interaction depth of 5, a shrinkage of 0.01, and a minimum of 5 observations in the final node.
The RMSE is:
$results$RMSE[which.min(chem_boostedTree_model$results$RMSE)] chem_boostedTree_model
[1] 1.221159
And the R-squared is:
$results$Rsquared[which.max(chem_boostedTree_model$results$Rsquared)] chem_boostedTree_model
[1] 0.585094
plot(chem_boostedTree_model)
Next, I try predicting the test-set with this same model:
<- predict(chem_boostedTree_model, newdata = chem_testData[,-1])
chem_boosted_predict postResample(pred = chem_boosted_predict, obs = chem_testData$Yield)
RMSE Rsquared MAE
0.8859718 0.7193019 0.7288205
The model has an RMSE of 0.88 and an R-squared of 0.72.
The model performs better on the test set than the training set.
Below is the variable importance plot for the model.
plot(varImp(chem_boostedTree_model))
A
As seen via the output above, the Cubist model had slightly better performance on the re-sampling data but the Boosted Tree model had the best test-set performance. Thus, I would recommend the Boosted Tree model out of the three models that were tested.
B
The most important predictor in the Boosted Tree model was Manufacturing Process 32, which is consistent with previous models that were fit onto the data. While Manufacturing Processes account for four of the top five predictors, the top ten predictors are evenly split between Manufacturing and Biological variables.
C
The boosted tree model is an ensemble model that aggregates values across multiple trees, thus there is no final tree that can be plotted.
Below I tune a single tree model and plot the diagram for visualization purposes:
set.seed(700)
<- train(x = chem_trainData[,-1], y = chem_trainData$Yield, method = 'rpart2',
chem_singletree_model preProcess = c('center','scale'),
tuneLength = 10,
trControl = trainControl(method = 'cv'))
chem_singletree_model
CART
144 samples
57 predictor
Pre-processing: centered (57), scaled (57)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 131, 130, 128, 131, 129, 129, ...
Resampling results across tuning parameters:
maxdepth RMSE Rsquared MAE
1 1.426980 0.4418136 1.154294
2 1.496596 0.4054703 1.186517
3 1.498126 0.4179266 1.194233
4 1.513466 0.4211162 1.235792
5 1.515828 0.4342953 1.228781
6 1.522847 0.4339160 1.239924
7 1.460151 0.4796659 1.169364
8 1.425329 0.5011166 1.147199
9 1.432890 0.4973246 1.153072
10 1.437989 0.4946859 1.149669
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 8.
<-as.party(chem_singletree_model$finalModel)
rpartTree plot(rpartTree)
The root node and second level of nodes are the same as the variable importance plots from some of the above models. While the further down nodes may differ, we can tell from the tree above and the variable importance plots that manufacturing process 32 is the most important predictor followed by biological material 12 and manufacturing process 9.