Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.3
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
#Problem 8.1 “Recreate the simulated data from Exercise7.2” “>library(mlbench)
>set.seed(200)
>simulated<-mlbench.friedman1(200,sd1)
>simulated<-cbind(simulated$x, simulated$y)
>simulated<-as.data.frame(simulated)
>colnames(simulated)[ncol(simulated)]<-"y”
set.seed(200)
simulated<-mlbench.friedman1(200,sd = 1)
simulated<-cbind(simulated$x, simulated$y)
simulated<-as.data.frame(simulated)
colnames(simulated)[ncol(simulated)]<-"y"
colnames(simulated)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "y"
##Problem 8.1 (a) “Fit a random forest model to all of the predictors, then estimate the variable importance scores:” “>library(randomForest)
>library(caret)
>model1<-randomForest(y~.,data=simulated,
+ importance=TRUE,
+ntree=1000)
>rfImp1<-varImp(model1,scale=FALSE)”
“Did the random forest model significantly use the uninformative predictors (V6 – V10)?”
model1<-randomForest(y~.,
data=simulated,
importance = TRUE,
ntree = 1000
)
rfIm1 <- importance(model1, scale = FALSE)
rfIm1
## %IncMSE IncNodePurity
## V1 8.62743275 1110.2060
## V2 6.27437240 923.7456
## V3 0.72305459 283.4993
## V4 7.50258584 1033.8895
## V5 2.13575650 493.2582
## V6 0.12395003 190.7885
## V7 0.02927888 201.2613
## V8 -0.11724317 151.4477
## V9 -0.10344797 151.6630
## V10 0.04312556 180.6207
Yes the random model significantly used uninformative predictors as they are nearly zero, which deems them insignificant.
##Problem 8.1 (b) “Now add an additional predictor that is highly correlated with one of the informative predictors. For example:”
simulated$duplicate1<-simulated$V1+rnorm(200)* 0.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201
The duplicate is very similar to original.
Applying it to a model.
print(names(simulated))
## [1] "V1" "V2" "V3" "V4" "V5"
## [6] "V6" "V7" "V8" "V9" "V10"
## [11] "y" "duplicate1"
model2 <- randomForest(y~.,
data=simulated,
importance = TRUE,
ntree = 1000
)
importance(model2)
## %IncMSE IncNodePurity
## V1 37.0333953 851.7258
## V2 47.2816516 856.9051
## V3 10.1512628 259.2217
## V4 49.4699119 963.8986
## V5 24.7131905 458.7097
## V6 3.3767346 169.6796
## V7 2.6899419 181.1686
## V8 -1.6045221 134.0665
## V9 -2.0953735 137.8951
## V10 -0.1934903 163.4766
## duplicate1 24.2144018 518.3576
names(simulated)
## [1] "V1" "V2" "V3" "V4" "V5"
## [6] "V6" "V7" "V8" "V9" "V10"
## [11] "y" "duplicate1"
Now with the 2 columns that are similar / highly correlated, it looks like both decreased in the IncNodePurity from model 1 to model 2, which means that since they are so similar it actually makes the both less meaningful.
##Problem 8.1 (c) “Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?”
Fitting the model using cforest
bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
model3 <- cforest(y ~ ., data = simulated, controls = bagCtrl)
traditional importance measure
varimp(model3, conditional = FALSE)
## V1 V2 V3 V4 V5
## 8.6452529553 7.8177828620 0.0004641803 10.4805998332 2.2437163429
## V6 V7 V8 V9 V10
## -0.0194358771 -0.0008392866 -0.0451147246 -0.0226129822 0.0266089808
## duplicate1
## 0.7146433068
conditional importance
varimp(model3, conditional = TRUE)
## V1 V2 V3 V4 V5
## 2.149642e+00 4.540293e+00 -4.902451e-03 6.007182e+00 8.307402e-01
## V6 V7 V8 V9 V10
## -3.192562e-03 1.087743e-03 7.522085e-04 7.187361e-04 4.007129e-05
## duplicate1
## 5.105282e-02
It looks like when the conditional was applied to the model, it decreased the importance of the correlation for variable 1 and the duplicate by a lot in this case. It looks like the conditional pretty much stripped away all importance of the duplicate variable, which is good, since it is the same as variable 1.
##Problem 8.1 (d) “Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?”
gbmModel <- gbm(y ~ .,
data = simulated,
distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 29.8985297
## V1 V1 26.5068407
## V2 V2 24.7617753
## V5 V5 9.9939413
## V3 V3 7.2348280
## duplicate1 duplicate1 1.2583726
## V6 V6 0.1880807
## V7 V7 0.1576317
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
The relative influence seems to be different pattern than the random trees. It looks more simialr to the cforest model that is conditional, where the duplicate V1 gets importance dropped all the way down. THis probably happens because of the “boost”, since the model already used the values of V1 then the duplicate is not bringing that much importance into the model
separating values into training and test
x <- simulated[, !names(simulated) %in% "y"]
y <- simulated$y
cubistModel <- train(x,y, method = "cubist")
varImp(cubistModel)
## cubist variable importance
##
## Overall
## V1 100.000
## V2 84.028
## V4 65.278
## V3 54.167
## V5 51.389
## V6 14.583
## V8 4.167
## duplicate1 0.000
## V10 0.000
## V9 0.000
## V7 0.000
Kind of similar to boost but not to the reandom forest. It also did put the duplicate at 0, like other models put it near zero.
#Problem 8.2: “Use a simulation to show tree bias with different granularities.”
set.seed(200)
V1 <- runif(1000, 2, 1000)
V2 <- runif(1000, 50, 500)
V3 <- rnorm(100, 500, 10)
y <- rnorm(1000)
df <- data.frame(V1, V2, V3, y)
simulation_bias <- randomForest(y ~ .,
data = df,
importance = TRUE,
ntree = 1000
)
testing model
importance(simulation_bias)
## %IncMSE IncNodePurity
## V1 -8.589150 301.9919
## V2 -5.660444 301.9514
## V3 -3.251921 255.8702
So for V1 and V2, there are more unique values becauseof the range. V3 is more of a tight cluster so it generated a lower importance, but that just makes it biased since it just has a smaller range.
#Problem 8.3: “In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:”
##Problem 8.3 (a): “Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?”
Based on the figure, it looks like the model on the left is more spread out. Since the learning rate is smaller on the left, it makes each tree make smaller moves. The smaller the moves, the more importance is spread across variables as the model explores itself. When the learning rate is high, the model makes bigger moves and jumps to the predictors, barely exploring, causing a bias in the way the importance is distributed.
##Problem 8.3 (b): “Which model do you think would be more predictive of other samples?”
The left model for sure as I think that the right model is overgeneralizing / overfitting because of the high learning curve. The right model focuses too much on a few of the predictors and then gives almost none to zero importance to others. The model on the left is more evenly distributed.
##Problem 8.3 (c): “How would increasing interaction depth affect the slope of predictor “importance for either model in Fig. 8.24?”
I think increasing the interaction depth will just amplify what the models are already doing. For the plot on the left I think that the increased depth would spread the importance even more. I think that the graph on the right will probably increase the weight of the importance on the few variables it already has on the top.
#Problem 8.7 “ Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:”
Loading library from chapter 6 and 7
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176 58
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
sum(is.na(ChemicalManufacturingProcess))
## [1] 106
##Prepping data for modeling
imputing the missing values before doing any regression models using k nearest neighbors
imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
computed_data <- predict(imputer, ChemicalManufacturingProcess)
sum(is.na(computed_data))
## [1] 0
set.seed(200)
x <- computed_data[, -1]
y <- computed_data$Yield
trainIndex <- sample(1:nrow(computed_data), size = 0.8 * nrow(computed_data))
X_train <- x[trainIndex, ]
y_train <- y[trainIndex]
X_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]
checking number of rows for each
nrow(X_train)
## [1] 140
length(y_train)
## [1] 140
##Problem 8.7 (a): “Which tree-based regression model gives the optimal resampling and test set performance?”
library(randomForest)
rfModel <- randomForest(X_train, y_train,
importance = TRUE,
ntree = 1000)
rfModel
##
## Call:
## randomForest(x = X_train, y = y_train, ntree = 1000, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 19
##
## Mean of squared residuals: 0.4107485
## % Var explained: 55.62
Generating predictions
rf_pred <- predict(rfModel, X_test)
Comparing predictions
postResample(rf_pred, y_test)
## RMSE Rsquared MAE
## 0.6101305 0.7451196 0.5070163
library(gbm)
gbmGrid <- expand.grid(interaction.depth = c(1,3),
n.trees = seq(100, 500, 1000),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
ctrl <- trainControl(method = "cv", number = 10)
set.seed(200)
gbmModel <- train(X_train, y_train,
method = "gbm",
tuneGrid = gbmGrid,
trControl = ctrl,
verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 125, 125, 127, 127, 127, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth RMSE Rsquared MAE
## 0.01 1 0.7853273 0.5032062 0.6205900
## 0.01 3 0.7390219 0.5323814 0.5788667
## 0.10 1 0.6849711 0.5115510 0.5192905
## 0.10 3 0.6340027 0.5845414 0.4886388
##
## Tuning parameter 'n.trees' was held constant at a value of 100
## Tuning
## parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
gbm_pred <- predict(gbmModel, X_test)
postResample(gbm_pred, y_test)
## RMSE Rsquared MAE
## 0.5440288 0.7649182 0.4368690
gbmModel
## Stochastic Gradient Boosting
##
## 140 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 125, 125, 127, 127, 127, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth RMSE Rsquared MAE
## 0.01 1 0.7853273 0.5032062 0.6205900
## 0.01 3 0.7390219 0.5323814 0.5788667
## 0.10 1 0.6849711 0.5115510 0.5192905
## 0.10 3 0.6340027 0.5845414 0.4886388
##
## Tuning parameter 'n.trees' was held constant at a value of 100
## Tuning
## parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
postResample(as.vector(gbm_pred), as.vector(y_test))
## RMSE Rsquared MAE
## 0.5440288 0.7649182 0.4368690
It looks like the boosted trees are actually doing better than random forest. It has a smaller RMSE value and a higher RSquared value. This means its predictions are closer to the actual values and there is more variation as well, compared to the random forest model.