DATA 622 HW 9

8.1. Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No the model focused on the predictors V1-V5.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                            importance = TRUE,
                            ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218

b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

V1 and the duplicated both have less importance than the original V1 score which makes sense that they are splitting the importance that the original had given they ar the same.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

model1 <- randomForest(y ~ ., data = simulated,
                            importance = TRUE,
                            ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##                Overall
## V1          6.29780744
## V2          6.08038134
## V3          0.58410718
## V4          6.93924427
## V5          2.03104094
## V6          0.07947642
## V7         -0.02566414
## V8         -0.11007435
## V9         -0.08839463
## V10        -0.00715093
## duplicate1  3.56411581

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The importance show a somewhat similar pattern where V1-V5 are the most important but this one differs where V3 is of less importance. it seems to emphasize the very most important variable and deamphisize the less important thus making a simpler model.

library(party)

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:fabletools':
## 
##     refit

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'party'

## The following object is masked from 'package:fabletools':
## 
##     response

model1 <- cforest(y ~ ., data = simulated)
rfImp1 <- varImp(model1)
rfImp1

##                 Overall
## V1          6.571632689
## V2          6.110131511
## V3          0.010450332
## V4          7.485796796
## V5          1.889117623
## V6         -0.006751589
## V7          0.007081381
## V8         -0.036022685
## V9          0.008976088
## V10         0.005356254
## duplicate1  2.722380892

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

They follow similar pattern by identifying the most impactful to be the same but the cubist assigned the same value of 50 to the top 4 variables rather than ranking them.

library(gbm)

## Loaded gbm 2.1.8

gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 31.2071181
## V2                 V2 22.4595980
## V1                 V1 14.1867814
## duplicate1 duplicate1 12.7266608
## V5                 V5 10.5518956
## V3                 V3  8.2074258
## V6                 V6  0.5274133
## V7                 V7  0.1331071
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

library(Cubist)

cube <- cubist(simulated[,-11], simulated[, 11])
c_imp<-varImp(cube)
c_imp

##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate1       0

8.2. Use a simulation to show tree bias with different granularities.

V3-V5 are the most important and they ahve moderate granularities. As the granularity becomes extreme at either end it loses importance.

library(rpart)
set.seed(10101)
V1 <- sample(0:1000000 / 1000000, 100, replace = TRUE)
V2 <- sample(0:100000 / 100000, 100, replace = TRUE)
V3 <- sample(0:10000 / 10000, 100, replace = TRUE)
V4 <- sample(0:1000 / 1000, 100, replace = TRUE)
V5 <- sample(0:100 / 100, 100, replace = TRUE)
V6 <- sample(0:10 / 10, 100, replace = TRUE)

y <- V1 + V2+V6+V5+ rnorm(1000, mean = 0, sd = 20.5)

df <- data.frame(V1, V2, V3, V4, V5, V6, y)

r <- rpart(y ~., data=df)
varImp(r)

##        Overall
## V1 0.007390939
## V2 0.002166609
## V3 0.012655076
## V4 0.030256099
## V5 0.014603885
## V6 0.009416652

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the left has a high value for the bagging fraction and learning rate. With a high learning rate the value of each additional tree is lessened. The higher bagging fraction indicates that a greater amount of the data is used in each iteration. This leads to fewer values being viewed as important as each additional tree is less important and the data is likely overfit by the higher bagging fraction producing only a few impactful variables.

(b) Which model do you think would be more predictive of other samples?

The model on the left would be more predictive as it will be much less overfit then the model on the right.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would likely lead to a predictor slope more similar to the model on the left where multiple variables are important. By increasing the tree depth you allow for more variables to be impactful but it may lead to overfitting.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem<-preProcess(ChemicalManufacturingProcess,
 method = c("medianImpute"))

set.seed(6354)
partition <- createDataPartition(ChemicalManufacturingProcess[,1] , p=0.75, list=F)
training <- ChemicalManufacturingProcess[partition,-1]
y_training<- ChemicalManufacturingProcess[partition,1]
testing<- ChemicalManufacturingProcess[-partition,-1]
y_testing<- ChemicalManufacturingProcess[-partition,1]






grid <- expand.grid(n.trees=c(50, 100), 
                    interaction.depth=c(1, 5, 10), 
                    shrinkage=c(0.01, 0.1), 
                    n.minobsinnode=c(5, 10))

r <- train(x=training, y= y_training, method="rpart", preProcess=c( "corr",  "medianImpute"), tuneLength=10)

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

rf <- train(x=training, y= y_training, method="rf", preProcess=c( "corr",  "medianImpute"), tuneLength=10,verbose=FALSE)

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

## Warning in cor(x[, !(colnames(x) %in% c(method$ignore, method$remove)), : the
## standard deviation is zero

gbm<-train(x = training,y = y_training, method = 'gbm', preProcess=c( "corr",  "medianImpute"),tuneGrid = grid, verbose = FALSE)

rPred <- predict(r, newdata = testing)
rfPred <- predict(rf, newdata = testing)
gbmPred <- predict(gbm, newdata = testing)

postResample(pred = rPred, obs = y_training)

##     RMSE Rsquared      MAE 
## 2.136208       NA 1.703349

postResample(pred = rfPred, obs = y_training)

##     RMSE Rsquared      MAE 
## 2.095278       NA 1.690864

postResample(pred = gbmPred, obs = y_training)

##     RMSE Rsquared      MAE 
## 2.239082       NA 1.833380

varImp(rf)

## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13  34.132
## BiologicalMaterial12    27.319
## BiologicalMaterial03    19.035
## BiologicalMaterial06    16.731
## ManufacturingProcess17  15.675
## ManufacturingProcess09  10.543
## ManufacturingProcess06  10.453
## BiologicalMaterial11     9.587
## ManufacturingProcess36   7.775
## ManufacturingProcess31   5.599
## ManufacturingProcess05   4.920
## BiologicalMaterial04     4.808
## BiologicalMaterial02     4.781
## BiologicalMaterial09     4.458
## BiologicalMaterial05     4.307
## ManufacturingProcess11   4.215
## ManufacturingProcess01   3.777
## ManufacturingProcess04   3.656
## ManufacturingProcess15   3.631

(a) Which tree-based regression model gives the optimal resampling and test set performance?

The random forest gave the lowest RMSE and the lowest MAE which makes is th ebst in test performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Neither biological or manufacturing dominate the list which is different from some earlier predictors which were largely dominated by one or the other usually manufacturing). ManufacturingProcess32 was far and away the most important for the rf model with a mix of manufacturing (13,17,9,6,36) and biological (12,3,6,11) rounding out the top 10.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Using the rpart.plot package it is easier to see the relationship of the variables via the optimal tree. We can see yet again that ManufacturingProcess32 is easily the most important with a strong combination with Biological Material6. throughout the examples of this dataset we have found these to be very impactful variables.

library("rpart.plot")
model = rpart(y_training ~ ., data = training)
rpart.plot(model)