Homework 9

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson

Libraries

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.3
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

#Problem 8.1 “Recreate the simulated data from Exercise7.2” “>library(mlbench)

              >set.seed(200)
            
              >simulated<-mlbench.friedman1(200,sd1)
            
              >simulated<-cbind(simulated$x, simulated$y)
            
              >simulated<-as.data.frame(simulated)
            
              >colnames(simulated)[ncol(simulated)]<-"y”
              
set.seed(200)
simulated<-mlbench.friedman1(200,sd = 1)
simulated<-cbind(simulated$x, simulated$y)
simulated<-as.data.frame(simulated)
colnames(simulated)[ncol(simulated)]<-"y"
colnames(simulated)
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "y"

##Problem 8.1 (a) “Fit a random forest model to all of the predictors, then estimate the variable importance scores:” “>library(randomForest)

                      >library(caret)
                    
                      >model1<-randomForest(y~.,data=simulated,
                    
                       + importance=TRUE,
                    
                       +ntree=1000)
                    
                       >rfImp1<-varImp(model1,scale=FALSE)”

“Did the random forest model significantly use the uninformative predictors (V6 – V10)?”

model1<-randomForest(y~.,
                     data=simulated,
                     importance = TRUE, 
                     ntree = 1000
)
rfIm1 <- importance(model1, scale = FALSE)
rfIm1 
##         %IncMSE IncNodePurity
## V1   8.62743275     1110.2060
## V2   6.27437240      923.7456
## V3   0.72305459      283.4993
## V4   7.50258584     1033.8895
## V5   2.13575650      493.2582
## V6   0.12395003      190.7885
## V7   0.02927888      201.2613
## V8  -0.11724317      151.4477
## V9  -0.10344797      151.6630
## V10  0.04312556      180.6207

Yes the random model significantly used uninformative predictors as they are nearly zero, which deems them insignificant.

##Problem 8.1 (b) “Now add an additional predictor that is highly correlated with one of the informative predictors. For example:”

simulated$duplicate1<-simulated$V1+rnorm(200)* 0.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201

The duplicate is very similar to original.

Applying it to a model.

print(names(simulated))
##  [1] "V1"         "V2"         "V3"         "V4"         "V5"        
##  [6] "V6"         "V7"         "V8"         "V9"         "V10"       
## [11] "y"          "duplicate1"
model2 <- randomForest(y~.,
                     data=simulated,
                     importance = TRUE, 
                     ntree = 1000
)
importance(model2)  
##               %IncMSE IncNodePurity
## V1         37.0333953      851.7258
## V2         47.2816516      856.9051
## V3         10.1512628      259.2217
## V4         49.4699119      963.8986
## V5         24.7131905      458.7097
## V6          3.3767346      169.6796
## V7          2.6899419      181.1686
## V8         -1.6045221      134.0665
## V9         -2.0953735      137.8951
## V10        -0.1934903      163.4766
## duplicate1 24.2144018      518.3576
names(simulated)
##  [1] "V1"         "V2"         "V3"         "V4"         "V5"        
##  [6] "V6"         "V7"         "V8"         "V9"         "V10"       
## [11] "y"          "duplicate1"

Now with the 2 columns that are similar / highly correlated, it looks like both decreased in the IncNodePurity from model 1 to model 2, which means that since they are so similar it actually makes the both less meaningful.

##Problem 8.1 (c) “Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?”

Fitting the model using cforest

bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
model3 <- cforest(y ~ ., data = simulated, controls = bagCtrl)

traditional importance measure

varimp(model3, conditional = FALSE)
##            V1            V2            V3            V4            V5 
##  8.6452529553  7.8177828620  0.0004641803 10.4805998332  2.2437163429 
##            V6            V7            V8            V9           V10 
## -0.0194358771 -0.0008392866 -0.0451147246 -0.0226129822  0.0266089808 
##    duplicate1 
##  0.7146433068

conditional importance

varimp(model3, conditional = TRUE)
##            V1            V2            V3            V4            V5 
##  2.149642e+00  4.540293e+00 -4.902451e-03  6.007182e+00  8.307402e-01 
##            V6            V7            V8            V9           V10 
## -3.192562e-03  1.087743e-03  7.522085e-04  7.187361e-04  4.007129e-05 
##    duplicate1 
##  5.105282e-02

It looks like when the conditional was applied to the model, it decreased the importance of the correlation for variable 1 and the duplicate by a lot in this case. It looks like the conditional pretty much stripped away all importance of the duplicate variable, which is good, since it is the same as variable 1.

##Problem 8.1 (d) “Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?”

Boosted Tree Model

gbmModel <- gbm(y ~ ., 
                data = simulated, 
                distribution = "gaussian")

summary(gbmModel)

##                   var    rel.inf
## V4                 V4 29.8985297
## V1                 V1 26.5068407
## V2                 V2 24.7617753
## V5                 V5  9.9939413
## V3                 V3  7.2348280
## duplicate1 duplicate1  1.2583726
## V6                 V6  0.1880807
## V7                 V7  0.1576317
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

The relative influence seems to be different pattern than the random trees. It looks more simialr to the cforest model that is conditional, where the duplicate V1 gets importance dropped all the way down. THis probably happens because of the “boost”, since the model already used the values of V1 then the duplicate is not bringing that much importance into the model

Cubist Model

separating values into training and test

x <- simulated[, !names(simulated) %in% "y"]
y <- simulated$y
cubistModel <- train(x,y, method = "cubist")
varImp(cubistModel)
## cubist variable importance
## 
##            Overall
## V1         100.000
## V2          84.028
## V4          65.278
## V3          54.167
## V5          51.389
## V6          14.583
## V8           4.167
## duplicate1   0.000
## V10          0.000
## V9           0.000
## V7           0.000

Kind of similar to boost but not to the reandom forest. It also did put the duplicate at 0, like other models put it near zero.

#Problem 8.2: “Use a simulation to show tree bias with different granularities.”

set.seed(200)
V1 <- runif(1000, 2, 1000)
V2 <- runif(1000, 50, 500)
V3 <- rnorm(100, 500, 10)
y <- rnorm(1000)

df <- data.frame(V1, V2, V3, y)

simulation_bias <- randomForest(y ~ ., 
                                data = df, 
                                importance = TRUE,
                                ntree = 1000
  
)

testing model

importance(simulation_bias)
##      %IncMSE IncNodePurity
## V1 -8.589150      301.9919
## V2 -5.660444      301.9514
## V3 -3.251921      255.8702

So for V1 and V2, there are more unique values becauseof the range. V3 is more of a tight cluster so it generated a lower importance, but that just makes it biased since it just has a smaller range.

#Problem 8.3: “In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:”

##Problem 8.3 (a): “Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?”

Based on the figure, it looks like the model on the left is more spread out. Since the learning rate is smaller on the left, it makes each tree make smaller moves. The smaller the moves, the more importance is spread across variables as the model explores itself. When the learning rate is high, the model makes bigger moves and jumps to the predictors, barely exploring, causing a bias in the way the importance is distributed.

##Problem 8.3 (b): “Which model do you think would be more predictive of other samples?”

The left model for sure as I think that the right model is overgeneralizing / overfitting because of the high learning curve. The right model focuses too much on a few of the predictors and then gives almost none to zero importance to others. The model on the left is more evenly distributed.

##Problem 8.3 (c): “How would increasing interaction depth affect the slope of predictor “importance for either model in Fig. 8.24?”

I think increasing the interaction depth will just amplify what the models are already doing. For the plot on the left I think that the increased depth would spread the importance even more. I think that the graph on the right will probably increase the weight of the importance on the few variables it already has on the top.

#Problem 8.7 “ Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:”

Loading library from chapter 6 and 7

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)
## [1] 176  58
head(ChemicalManufacturingProcess)
##   Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00                 6.25                49.58                56.97
## 2 42.44                 8.01                60.97                67.48
## 3 42.03                 8.01                60.97                67.48
## 4 41.42                 8.01                60.97                67.48
## 5 42.49                 7.47                63.33                72.25
## 6 43.57                 6.12                58.36                65.31
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1                12.74                19.51                43.73
## 2                14.65                19.36                53.14
## 3                14.65                19.36                53.14
## 4                14.65                19.36                53.14
## 5                14.02                17.91                54.66
## 6                15.17                21.79                51.23
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1                  100                16.66                11.44
## 2                  100                19.04                12.55
## 3                  100                19.04                12.55
## 4                  100                19.04                12.55
## 5                  100                18.22                12.80
## 6                  100                18.30                12.13
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1                 3.46               138.09                18.83
## 2                 3.46               153.67                21.05
## 3                 3.46               153.67                21.05
## 4                 3.46               153.67                21.05
## 5                 3.05               147.61                21.05
## 6                 3.78               151.88                20.76
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1                     NA                     NA                     NA
## 2                    0.0                      0                     NA
## 3                    0.0                      0                     NA
## 4                    0.0                      0                     NA
## 5                   10.7                      0                     NA
## 6                   12.0                      0                     NA
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1                     NA                     NA                     NA
## 2                    917                 1032.2                  210.0
## 3                    912                 1003.6                  207.1
## 4                    911                 1014.6                  213.3
## 5                    918                 1027.5                  205.7
## 6                    924                 1016.8                  208.9
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1                     NA                     NA                  43.00
## 2                    177                    178                  46.57
## 3                    178                    178                  45.07
## 4                    177                    177                  44.92
## 5                    178                    178                  44.96
## 6                    178                    178                  45.32
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1                     NA                     NA                     NA
## 2                     NA                     NA                      0
## 3                     NA                     NA                      0
## 4                     NA                     NA                      0
## 5                     NA                     NA                      0
## 6                     NA                     NA                      0
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1                   35.5                   4898                   6108
## 2                   34.0                   4869                   6095
## 3                   34.8                   4878                   6087
## 4                   34.8                   4897                   6102
## 5                   34.6                   4992                   6233
## 6                   34.0                   4985                   6222
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1                   4682                   35.5                   4865
## 2                   4617                   34.0                   4867
## 3                   4617                   34.8                   4877
## 4                   4635                   34.8                   4872
## 5                   4733                   33.9                   4886
## 6                   4786                   33.4                   4862
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1                   6049                   4665                    0.0
## 2                   6097                   4621                    0.0
## 3                   6078                   4621                    0.0
## 4                   6073                   4611                    0.0
## 5                   6102                   4659                   -0.7
## 6                   6115                   4696                   -0.6
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1                     NA                     NA                     NA
## 2                      3                      0                      3
## 3                      4                      1                      4
## 4                      5                      2                      5
## 5                      8                      4                     18
## 6                      9                      1                      1
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1                   4873                   6074                   4685
## 2                   4869                   6107                   4630
## 3                   4897                   6116                   4637
## 4                   4892                   6111                   4630
## 5                   4930                   6151                   4684
## 6                   4871                   6128                   4687
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1                   10.7                   21.0                    9.9
## 2                   11.2                   21.4                    9.9
## 3                   11.1                   21.3                    9.4
## 4                   11.1                   21.3                    9.4
## 5                   11.3                   21.6                    9.0
## 6                   11.4                   21.7                   10.1
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1                   69.1                    156                     66
## 2                   68.7                    169                     66
## 3                   69.3                    173                     66
## 4                   69.3                    171                     68
## 5                   69.4                    171                     70
## 6                   68.2                    173                     70
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1                    2.4                    486                  0.019
## 2                    2.6                    508                  0.019
## 3                    2.6                    509                  0.018
## 4                    2.5                    496                  0.018
## 5                    2.5                    468                  0.017
## 6                    2.5                    490                  0.018
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1                    0.5                      3                    7.2
## 2                    2.0                      2                    7.2
## 3                    0.7                      2                    7.2
## 4                    1.2                      2                    7.2
## 5                    0.2                      2                    7.3
## 6                    0.4                      2                    7.2
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1                     NA                     NA                   11.6
## 2                    0.1                   0.15                   11.1
## 3                    0.0                   0.00                   12.0
## 4                    0.0                   0.00                   10.6
## 5                    0.0                   0.00                   11.0
## 6                    0.0                   0.00                   11.5
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1                    3.0                    1.8                    2.4
## 2                    0.9                    1.9                    2.2
## 3                    1.0                    1.8                    2.3
## 4                    1.1                    1.8                    2.1
## 5                    1.1                    1.7                    2.1
## 6                    2.2                    1.8                    2.0
sum(is.na(ChemicalManufacturingProcess))
## [1] 106

##Prepping data for modeling

imputing the missing values before doing any regression models using k nearest neighbors

imputer <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
computed_data <- predict(imputer, ChemicalManufacturingProcess)
sum(is.na(computed_data))
## [1] 0

Splitting data set into training and test sets

set.seed(200)
x <- computed_data[, -1]
y <- computed_data$Yield

trainIndex <- sample(1:nrow(computed_data), size = 0.8 * nrow(computed_data))

X_train <- x[trainIndex, ]
y_train <- y[trainIndex]
X_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]

checking number of rows for each

nrow(X_train)
## [1] 140
length(y_train)
## [1] 140

##Problem 8.7 (a): “Which tree-based regression model gives the optimal resampling and test set performance?”

Random Forest Model

library(randomForest)

rfModel <- randomForest(X_train, y_train, 
                        importance = TRUE,  
                        ntree = 1000)
rfModel
## 
## Call:
##  randomForest(x = X_train, y = y_train, ntree = 1000, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 19
## 
##           Mean of squared residuals: 0.4107485
##                     % Var explained: 55.62

Generating predictions

rf_pred <- predict(rfModel, X_test)

Comparing predictions

postResample(rf_pred, y_test)
##      RMSE  Rsquared       MAE 
## 0.6101305 0.7451196 0.5070163

Boosted Trees Model

library(gbm)


gbmGrid <- expand.grid(interaction.depth = c(1,3),
                       n.trees = seq(100, 500, 1000), 
                       shrinkage = c(0.01, 0.1), 
                       n.minobsinnode = 10)

ctrl <- trainControl(method = "cv", number = 10)

set.seed(200)

gbmModel <- train(X_train, y_train,
                  method = "gbm",
                  tuneGrid = gbmGrid, 
                  trControl = ctrl,
                  verbose = FALSE)

gbmModel
## Stochastic Gradient Boosting 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 125, 125, 127, 127, 127, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  RMSE       Rsquared   MAE      
##   0.01       1                  0.7853273  0.5032062  0.6205900
##   0.01       3                  0.7390219  0.5323814  0.5788667
##   0.10       1                  0.6849711  0.5115510  0.5192905
##   0.10       3                  0.6340027  0.5845414  0.4886388
## 
## Tuning parameter 'n.trees' was held constant at a value of 100
## Tuning
##  parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
gbm_pred <- predict(gbmModel, X_test)
postResample(gbm_pred, y_test)
##      RMSE  Rsquared       MAE 
## 0.5440288 0.7649182 0.4368690
gbmModel
## Stochastic Gradient Boosting 
## 
## 140 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 125, 125, 127, 127, 127, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  RMSE       Rsquared   MAE      
##   0.01       1                  0.7853273  0.5032062  0.6205900
##   0.01       3                  0.7390219  0.5323814  0.5788667
##   0.10       1                  0.6849711  0.5115510  0.5192905
##   0.10       3                  0.6340027  0.5845414  0.4886388
## 
## Tuning parameter 'n.trees' was held constant at a value of 100
## Tuning
##  parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
postResample(as.vector(gbm_pred), as.vector(y_test))
##      RMSE  Rsquared       MAE 
## 0.5440288 0.7649182 0.4368690

It looks like the boosted trees are actually doing better than random forest. It has a smaller RMSE value and a higher RSquared value. This means its predictions are closer to the actual values and there is more variation as well, compared to the random forest model.