8.1. Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)%>%
  arrange(-Overall)

rfImp1
##          Overall
## V1   8.732235404
## V4   7.615118809
## V2   6.415369387
## V5   2.023524577
## V3   0.763591825
## V6   0.165111172
## V7  -0.005961659
## V10 -0.074944788
## V9  -0.095292651
## V8  -0.166362581

Did the random forest model significantly use the uninformative predictors (V6 – V10)? No, V6-V10 had lower weights than V1-V5

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change?

The importance score for V1 was decreased when the duplicate was added, but its still more important than most other variables

model2 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)%>%
  arrange(-Overall)

rfImp2
##                Overall
## V4          7.04752238
## V2          6.06896061
## V1          5.69119973
## duplicate1  4.28331581
## V5          1.87238438
## V3          0.62970218
## V6          0.13569065
## V10         0.02894814
## V9          0.00840438
## V7         -0.01345645
## V8         -0.04370565

What happens when you add another predictor that is also highly correlated with V1?

The importance score of V1 decreases yet again

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)%>%
  arrange(-Overall)

rfImp3
##                Overall
## V4          7.04870917
## V2          6.52816504
## V1          4.91687329
## duplicate1  3.80068234
## V5          2.03115561
## duplicate2  1.87721959
## V3          0.58711552
## V6          0.14213148
## V7          0.10991985
## V10         0.09230576
## V9         -0.01075028
## V8         -0.08405687

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Importance values show a similar, but not exact, pattern to the traditional random forest model

party_model = party::cforest(y~., data=simulated)

data.frame("importance" = party::varimp(party_model))%>%
  arrange(-importance)
##             importance
## V4          7.16991462
## V2          5.81149273
## duplicate1  4.62957645
## V1          4.11910478
## V5          1.62054023
## duplicate2  1.08104915
## V3          0.01133118
## V6         -0.01818532
## V7         -0.02795336
## V10        -0.03454443
## V8         -0.03663807
## V9         -0.04360946

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Cubist shows a similar pattern of importance but the second duplicate value and V3 are not included

cube_model = Cubist::cubist(x = simulated%>%select(-'y'), y = simulated$y)
varImp(cube_model, scale=FALSE)%>%arrange(-Overall)
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate2       0

The GBM model also follows the same pattern when determining important variables

gbm_model = gbm::gbm(y~., data = simulated, distribution = "gaussian")
gbm_model%>%summary()

##                   var    rel.inf
## V4                 V4 29.8563503
## V2                 V2 24.1467176
## duplicate1 duplicate1 15.2165605
## V5                 V5  9.6655870
## V3                 V3  8.0779617
## V1                 V1  6.3502504
## duplicate2 duplicate2  6.1045134
## V8                 V8  0.3338332
## V6                 V6  0.2482259
## V7                 V7  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

8.2. Use a simulation to show tree bias with different granularities.

lower = sample(10, 1000, replace=TRUE)/10
low=sample(100, 1000, replace=TRUE)/100
mid=sample(1000, 1000, replace= TRUE)/1000
high=sample(10000, 1000, replace= TRUE)/10000
higher = sample(100000, 1000, replace=TRUE)/100000

y = lower+low+mid+high+higher

sample_df = data.frame(lower,low,mid,high,higher,y)
sample_df%>%head()
##   lower  low   mid   high  higher       y
## 1   0.6 0.91 0.735 0.4083 0.36563 3.01893
## 2   0.3 0.46 0.319 0.5948 0.28874 1.96254
## 3   0.5 0.03 0.957 0.3667 0.21795 2.07165
## 4   0.4 0.39 0.978 0.5062 0.11128 2.38548
## 5   0.5 0.85 0.639 0.7238 0.97019 3.68299
## 6   0.7 0.35 0.608 0.2734 0.88541 2.81681

The granularity of the data does not seem to affect the importance of a variable in a tree model

sample_model = randomForest(y ~ ., data = sample_df,
                      importance = TRUE,
                      ntree = 1000)
sample_model%>%varImp(scale=FALSE)%>%arrange(-Overall)
##          Overall
## mid    0.1273238
## lower  0.1243791
## higher 0.1175226
## high   0.1122531
## low    0.1115792

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

Because the model on the left has a lower learning rate, meaning it will iterate through more of the features before converging. Compared to the model on the right, with a higher learning rate that will converge too quickly and skip over features.

(b) Which model do you think would be more predictive of other samples?

I think the left model will generalize better despite being trained on less (10%) samples. This is because the model on the right was trained with a high learning rate which tends to overfit.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

The model on the left could benefit from increasing tree depth as learning rate is low. Since the learning rate is already high on the right model, increasing the complexity is likely to overfit the model even more.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)
chem_manuf = ChemicalManufacturingProcess
knn = preProcess(chem_manuf, method = c('knnImpute'))
chem_imputed = predict(knn,chem_manuf)


set.seed(2022)
index = sample(round(dim(chem_imputed)[1]*.70))
chem_train = chem_imputed[index,]
chem_test = chem_imputed[-index,]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

random forest

random_forest = randomForest(Yield~., data = chem_train, importance = TRUE, ntrees = 100)
RMSE(predict(random_forest,chem_test),chem_test$Yield)
## [1] 0.8389114

Gradient Boost Trees

chem_gbm = gbm::gbm(Yield~.,data=chem_train, n.tree=25, distribution='gaussian')
RMSE(predict(chem_gbm,chem_test),chem_test$Yield)
## Using 25 trees...
## [1] 0.7689781

Gradient boosted trees gives better test set performance than random forest based on RMSE value

(b) - Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Manufacturing processes dominate the list. The exact predictors are different, but the ratio of manufacturing process and biological material are similar to the previous models

chem_gbm%>%summary()

##                                           var   rel.inf
## ManufacturingProcess32 ManufacturingProcess32 33.843683
## ManufacturingProcess13 ManufacturingProcess13 24.663755
## ManufacturingProcess09 ManufacturingProcess09 12.149901
## BiologicalMaterial12     BiologicalMaterial12  5.210949
## BiologicalMaterial11     BiologicalMaterial11  4.861641
## BiologicalMaterial06     BiologicalMaterial06  2.672616
## ManufacturingProcess06 ManufacturingProcess06  2.575845
## ManufacturingProcess21 ManufacturingProcess21  2.573006
## ManufacturingProcess11 ManufacturingProcess11  2.504523
## ManufacturingProcess31 ManufacturingProcess31  2.432286
## BiologicalMaterial03     BiologicalMaterial03  1.827177
## ManufacturingProcess28 ManufacturingProcess28  1.728321
## ManufacturingProcess36 ManufacturingProcess36  1.528567
## ManufacturingProcess04 ManufacturingProcess04  1.427731
## BiologicalMaterial01     BiologicalMaterial01  0.000000
## BiologicalMaterial02     BiologicalMaterial02  0.000000
## BiologicalMaterial04     BiologicalMaterial04  0.000000
## BiologicalMaterial05     BiologicalMaterial05  0.000000
## BiologicalMaterial07     BiologicalMaterial07  0.000000
## BiologicalMaterial08     BiologicalMaterial08  0.000000
## BiologicalMaterial09     BiologicalMaterial09  0.000000
## BiologicalMaterial10     BiologicalMaterial10  0.000000
## ManufacturingProcess01 ManufacturingProcess01  0.000000
## ManufacturingProcess02 ManufacturingProcess02  0.000000
## ManufacturingProcess03 ManufacturingProcess03  0.000000
## ManufacturingProcess05 ManufacturingProcess05  0.000000
## ManufacturingProcess07 ManufacturingProcess07  0.000000
## ManufacturingProcess08 ManufacturingProcess08  0.000000
## ManufacturingProcess10 ManufacturingProcess10  0.000000
## ManufacturingProcess12 ManufacturingProcess12  0.000000
## ManufacturingProcess14 ManufacturingProcess14  0.000000
## ManufacturingProcess15 ManufacturingProcess15  0.000000
## ManufacturingProcess16 ManufacturingProcess16  0.000000
## ManufacturingProcess17 ManufacturingProcess17  0.000000
## ManufacturingProcess18 ManufacturingProcess18  0.000000
## ManufacturingProcess19 ManufacturingProcess19  0.000000
## ManufacturingProcess20 ManufacturingProcess20  0.000000
## ManufacturingProcess22 ManufacturingProcess22  0.000000
## ManufacturingProcess23 ManufacturingProcess23  0.000000
## ManufacturingProcess24 ManufacturingProcess24  0.000000
## ManufacturingProcess25 ManufacturingProcess25  0.000000
## ManufacturingProcess26 ManufacturingProcess26  0.000000
## ManufacturingProcess27 ManufacturingProcess27  0.000000
## ManufacturingProcess29 ManufacturingProcess29  0.000000
## ManufacturingProcess30 ManufacturingProcess30  0.000000
## ManufacturingProcess33 ManufacturingProcess33  0.000000
## ManufacturingProcess34 ManufacturingProcess34  0.000000
## ManufacturingProcess35 ManufacturingProcess35  0.000000
## ManufacturingProcess37 ManufacturingProcess37  0.000000
## ManufacturingProcess38 ManufacturingProcess38  0.000000
## ManufacturingProcess39 ManufacturingProcess39  0.000000
## ManufacturingProcess40 ManufacturingProcess40  0.000000
## ManufacturingProcess41 ManufacturingProcess41  0.000000
## ManufacturingProcess42 ManufacturingProcess42  0.000000
## ManufacturingProcess43 ManufacturingProcess43  0.000000
## ManufacturingProcess44 ManufacturingProcess44  0.000000
## ManufacturingProcess45 ManufacturingProcess45  0.000000

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Manufacturing process variables are the variables that provide the highest separation in Yield amounts even in a single tree. Even though some biological materials provide good separation too, most are through the manufacturing process features. This is in line with what we have seen throughout the other models.

rpart(Yield~., data = chem_imputed, maxdepth = 3)%>%rpart.plot()