Data 624 - HW 9

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.1.2

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.1.2

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

model1 <- randomForest(y ~ ., data = simulated,
mportance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rfImp1

##       Overall
## V1  1101.5125
## V2   918.2002
## V3   292.9121
## V4  1050.8320
## V5   489.9815
## V6   177.9841
## V7   199.3523
## V8   146.3390
## V9   156.1152
## V10  178.7002

The random forest model significantly did not use the uninformative predictors. The predictors V6 ~ V10 have very small importance, compare to other predictors such as V1, V2, V4, and V5. (b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9312879

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

m2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(m2, scale = FALSE)
rfImp2

##                Overall
## V1          5.62411205
## V2          6.18242103
## V3          0.66445442
## V4          6.69869172
## V5          2.01545359
## V6          0.16952216
## V7         -0.03423950
## V8         -0.11680402
## V9         -0.04139870
## V10         0.01472598
## duplicate1  3.91374379

Yes, it did. The importance for V1 is reduced, splitted between V1 and duplicated1.It happened because the trees in the random forest are just as likely to pick V1 for a split than to pick duplicated1, because there is a correlation. As a result, the total number of splits that originally belonged to V1 is “shared” with duplicated1.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(data.table)

## Warning: package 'data.table' was built under R version 4.1.2

library(party)

## Warning: package 'party' was built under R version 4.1.3

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.1.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.1.3

frstmodel <- cforest(y ~ ., data=simulated)

sort(varimp(frstmodel, conditional = FALSE), decreasing = TRUE)

##           V4           V2           V1   duplicate1           V5           V7 
##  7.469697868  6.286805180  6.102506073  3.339546539  1.697256639  0.016261856 
##           V3           V6           V9          V10           V8 
##  0.013701626  0.002575282 -0.008888066 -0.014897148 -0.045566285

sort(varimp(frstmodel, conditional = TRUE), decreasing = TRUE)

##           V4           V2           V1   duplicate1           V5           V3 
##  5.910739023  5.026287021  2.863024722  1.110339875  1.019342990  0.016134445 
##           V6           V9           V7          V10           V8 
##  0.011933896  0.004576673  0.003943168 -0.005913247 -0.008058880

After calculating conditionally the variable importance it shows that it takes into account correlation between variable V1 and duplicate1. It adjust importance score of these two variables accordingly. While the variable importance is calculated un-conditionally then it treats both highly correlated variables (V1 and duplicate1) with equal importance which can be mispresented.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)

## Warning: package 'gbm' was built under R version 4.1.3

## Loaded gbm 2.1.8

gbmModel <- gbm(y ~ ., data = simulated, distribution = 'gaussian')   
gbmModel

## gbm(formula = y ~ ., distribution = "gaussian", data = simulated)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 11 predictors of which 7 had non-zero influence.

summary(gbmModel)

##                   var    rel.inf
## V4                 V4 29.6311732
## V2                 V2 23.9753767
## V1                 V1 23.3175633
## V5                 V5 10.1218535
## V3                 V3  8.4820347
## duplicate1 duplicate1  4.3161926
## V6                 V6  0.1558059
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

For Boosted Trees,the same patten occurs where V4 is still the highest and V6 thru V10 are still low.

library(Cubist)

## Warning: package 'Cubist' was built under R version 4.1.3

cmdl <- cubist(simulated[,-11], simulated$y)
cmdl

## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
## 
## Number of samples: 200 
## Number of predictors: 11 
## 
## Number of committees: 1 
## Number of rules: 1

summary(cmdl)

## 
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun May 01 22:44:27 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (12 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.181831
##     Relative |error|               0.54
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs

It can be assumed that Cubist model is the only one which uses highly important variables.

V1, V2, V4 and V5 are the most important features that were used by Cubist model.

8.2. Use a simulation to show tree bias with different granularities.

set.seed(999)
x<- sample(0:10000 / 10000, 200, replace = T)
y<- sample(0:1000 / 1000, 200, replace = T)
w <- sample(0:100 / 100, 200, replace = T)
z <- sample(0:10 / 10, 200, replace = T)

p <- x + z + rnorm(200) 

df <- data.frame(x, y, w, z, p)
str(df)

## 'data.frame':    200 obs. of  5 variables:
##  $ x: num  0.911 0.544 0.62 0.673 0.241 ...
##  $ y: num  0.569 0.398 0.99 0.82 0.692 0.635 0.556 0.839 0.282 0.763 ...
##  $ w: num  0.99 0.23 0.31 0.79 0.13 0.96 0.68 0.47 0.68 0.22 ...
##  $ z: num  0.6 0.7 0.4 0.7 0.7 0.6 0.8 0.9 0.5 0.7 ...
##  $ p: num  2.09 1.16 1.6 1.36 2.09 ...

#using rpart for fitting

library(rpart)

tmodel <- rpart(y ~ ., data=df)
varImp(tmodel)

##     Overall
## p 0.5814526
## w 0.8588503
## x 0.5198500
## z 0.4592129

The tree uses x mostly to split and it uses z the least often. On the other hand it also uses y and w to split, while they are not used to produce the model. Also, y and w have more different values compared to z. That shows that there ia a selection bias in the model.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudesof these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? Increasing a learning rate is intended to slow down the adaptation of the model to the training data, lower adaptation results in fewer predictors. The model on the right has a higher learning rate and this is closer to optimal. There are more variables with lower weight because this model effectively predicts more about the data. It will take longer to generate such a model.Additionally, the model on the right was also trained with more data which suppose to reduce the importance of marginal variables. On the other hand, the model on the left is less optimized, and with a smaller data set. It does not have enough information to reduce variable importance in the same way as the model on the right.
Which model do you think would be more predictive of other samples? It would be expected that the performance of the model on the right to be better on analogous data sets due to the level of tuning. But it could likely also generate more sensitive differences in the predicted data. However, since this involves a trade off between bias-variance, I believe that using cross validation or a test set.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? The approach which should be taken to increase interaction depth affect the slope of predictor importance for either model, which would means that increasing interaction will provide more dense trees, what would mean more predictors will be taken into account, so varimp function will show allocation of importance across more variables. Therefore, increasing the depth reduce the slope of the importance plot

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.3

data(ChemicalManufacturingProcess)

(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))

## Created from 152 samples and 57 variables
## 
## Pre-processing:
##   - bagged tree imputation (57)
##   - ignored (0)

cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])

set.seed(999)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
trainx <- cmp[trainRow, ]
trainy <- ChemicalManufacturingProcess$Yield[trainRow]
testx <- cmp[-trainRow, ]
testy <- ChemicalManufacturingProcess$Yield[-trainRow]

rpmodel <- train(x = trainx,
                    y = trainy,
                    method = "rpart",
                    tuneLength = 10,
                    control = rpart.control(maxdepth=2))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

rpmodel

## CART 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.00771655  1.545751  0.3337652  1.230786
##   0.01452393  1.545751  0.3337652  1.230786
##   0.02597846  1.547909  0.3321186  1.233044
##   0.03316588  1.548980  0.3304596  1.232825
##   0.03583795  1.547007  0.3313855  1.230445
##   0.03633812  1.547007  0.3313855  1.230445
##   0.05190902  1.555853  0.3243507  1.240100
##   0.06399225  1.549841  0.3256107  1.234475
##   0.08270155  1.556993  0.3164081  1.235518
##   0.40695439  1.628733  0.3026192  1.308738
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01452393.

rmodel <- train(x = trainx,
                 y = trainy,
                 method = 'rf',
                 tuneLength = 10)
rmodel

## Random Forest 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.361659  0.4927028  1.0777609
##    8    1.274503  0.5375705  1.0033760
##   14    1.262533  0.5388912  0.9916761
##   20    1.258331  0.5380994  0.9897108
##   26    1.256970  0.5363426  0.9853907
##   32    1.257390  0.5343093  0.9834924
##   38    1.261075  0.5301549  0.9857780
##   44    1.265822  0.5254258  0.9892238
##   50    1.274982  0.5194139  0.9952460
##   57    1.278971  0.5156588  0.9989663
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.

Which tree-based regression model gives the optimal resampling and test set performance?

resamp <- resamples(list(SingleTree=rpmodel, RandomForest=rmodel))
summary(resamp)

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: SingleTree, RandomForest 
## Number of resamples: 25 
## 
## MAE 
##                   Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## SingleTree   1.0589102 1.1703070 1.2109348 1.2307862 1.264816 1.506939    0
## RandomForest 0.7779966 0.8766132 0.9768654 0.9853907 1.098811 1.241988    0
## 
## RMSE 
##                  Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## SingleTree   1.240733 1.457838 1.529595 1.545751 1.637663 1.871156    0
## RandomForest 1.009419 1.139510 1.226476 1.256970 1.405009 1.548129    0
## 
## Rsquared 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SingleTree   0.04207691 0.2758174 0.3378588 0.3337652 0.4020415 0.4716263    0
## RandomForest 0.37073833 0.4672912 0.5365569 0.5363426 0.5926409 0.7533917    0

Random’s forecast mean R^2 is higher than single tree.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictorsfrom the optimal linear and nonlinear models?

varImp(rpmodel)

## rpart variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess13  100.00
## BiologicalMaterial03     69.25
## BiologicalMaterial06     67.96
## ManufacturingProcess32   63.24
## ManufacturingProcess36   39.39
## ManufacturingProcess06   35.81
## ManufacturingProcess17   34.45
## ManufacturingProcess11   30.80
## BiologicalMaterial11     30.37
## BiologicalMaterial12     30.15
## BiologicalMaterial02     28.26
## ManufacturingProcess30    0.00
## BiologicalMaterial10      0.00
## ManufacturingProcess27    0.00
## ManufacturingProcess26    0.00
## ManufacturingProcess16    0.00
## ManufacturingProcess18    0.00
## ManufacturingProcess08    0.00
## ManufacturingProcess33    0.00
## ManufacturingProcess12    0.00

varImp(rmodel)

## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial03    24.652
## ManufacturingProcess13  22.916
## BiologicalMaterial06    18.003
## ManufacturingProcess36  14.298
## BiologicalMaterial12    12.785
## ManufacturingProcess09  12.177
## ManufacturingProcess17  12.146
## ManufacturingProcess31  11.506
## ManufacturingProcess28   9.131
## BiologicalMaterial11     8.747
## ManufacturingProcess06   8.242
## BiologicalMaterial02     7.908
## BiologicalMaterial04     7.104
## BiologicalMaterial05     6.723
## ManufacturingProcess20   6.010
## ManufacturingProcess33   5.430
## BiologicalMaterial09     5.140
## ManufacturingProcess11   4.883
## ManufacturingProcess24   4.534

ManuFacturing materials are highest ranking but in top 10 they are equal.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

plot(rpmodel$finalModel)
text(rpmodel$finalModel)

Yes, it does show that that ManuFacturing materials are in higher importance compare to biological. Higher value of ManufacturingProcess32 leads to higher yield, and vice versa. The assign values are the means of the groups splitted by ManufacturingProcess32 at 159.5, as demonstrated below:

Data 624 - HW 9

Dominika Markowska-Desvallons

5/1/2022