Data624-hw9

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

model1 <- randomForest(y ~ .,
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

We can observe from above that the predictors are very low. Randon forest didnot use these predictors (V6 - V10).

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

After adding the new predictor the value of V1 decreased. The new predictor duplicate1 is highly correlated with V1 and the number of splits in the random tree model are shared between v1 and duplicate1 predictors.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

model3 <- cforest(y ~ ., data = simulated)
cf_simulated <- varimp(model3)
cf_simulated

##            V1            V2            V3            V4            V5 
##  4.6171158805  6.0579730772  0.0003116115  7.6223892727  1.7161194047 
##            V6            V7            V8            V9           V10 
## -0.0289427183  0.0465374951 -0.0380965511  0.0046062409 -0.0310326410 
##    duplicate1 
##  5.0941897280

plot(cf_simulated)

Based on above results in model3 and plot we can observe that new predictor duplicate1 add more value. Additionally we can find that the cforest has same pattern as the traditional random forest model.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree Model

library(gbm)

## Loaded gbm 2.1.8

set.seed(200)
gbm_model <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbm_model)

##                   var    rel.inf
## V4                 V4 30.9675963
## V2                 V2 22.0642239
## V1                 V1 14.7119711
## duplicate1 duplicate1 13.7237851
## V5                 V5 10.4092910
## V3                 V3  7.7070181
## V6                 V6  0.2890616
## V8                 V8  0.1270530
## V7                 V7  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

Based on above results we can observe that the priority of predictor variables have changed in Boosted Tree Model. V4 came as the important variable. V6 to V10 still has low values and are not useful. V3 seems to be little off which is similar to the earlier models.

Cubist Model

library(Cubist)
set.seed(200)
cubist_model <- cubist(simulated[, -11], simulated[, 11])
varImp(cubist_model)

##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0

Like other models, Cubist also uses V1,V2,V4,V5 and duplicate1 perdictors and v3 is off. V6 to V10 values are very low and not much use.

8.2. Use a simulation to show tree bias with different granularities.

library(rpart)
set.seed(200)
a1 <- sample(1:10, 100, replace = T)
a2 <- rnorm(100)
a3 <- rnorm(100, mean = 0, sd = 4)
a4 <- sample(1:10/10, 100, replace = T)
b <- a1 + rnorm(100, mean = 0, sd = .5)

df <- data.frame(a1, a2, a3, a4, b)
new_df <- rpart(b~., data = df)
varImp(new_df)

##      Overall
## a1 3.2849079
## a2 0.3554106
## a3 0.3784541
## a4 0.3295456

While simulating we gave lowest granularity to a1 and we got better results for a1 when compared to other predictors.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

8.24

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

We know that learning rate impacts the model. High learning rate indicates large steps are needed towards minimum and it is result in less trees. The right side model has large learning rate and bagging fraction when compared to the left side model. With low learning rate on the left model we have more trees that are being trained on samples which results in large variance in predictor importance.

(b) Which model do you think would be more predictive of other samples?

There is high chance that right model might overfit the data because parameters are high so it drops many predictors. The left model seems to be more predictive.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

When interaction depth increases there will be spread of important predictors as each tree can go deeper. Hence increasing interaction depth reduces the slope of predictor in the plot.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(missForest)

## Loading required package: foreach

## Loading required package: itertools

## Loading required package: iterators

library(kableExtra)

Use missmap function to find the missing values

data(ChemicalManufacturingProcess)
missmap(ChemicalManufacturingProcess, col = c("red", "lightgreen"))

Use missForest function to impute missing values

Original_df <- ChemicalManufacturingProcess
Imputed_df <- missForest(Original_df)

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!

df <- Imputed_df$ximp

Split the data into test and training dataset

data <- df[, 2:58]
target <- df[,1]
train <- createDataPartition(target, p=0.75)
train_pred <- data[train$Resample1,]
train_target <- target[train$Resample]
test_pred <- data[-train$Resample1,]
test_target <- target[-train$Resample1]
control <- trainControl(method = "cv", number=10)

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Regression Tree Model

set.seed(1)
rt_Model <- train(x = train_pred,
                  y = train_target,
                  method = "rpart2",
                  tuneLength = 10)
rt_Model

## CART 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.549204  0.2971345  1.212403
##    2        1.554881  0.3060266  1.223812
##    3        1.543639  0.3195709  1.219891
##    4        1.543254  0.3214862  1.212584
##    5        1.541306  0.3309388  1.199296
##    6        1.572511  0.3169023  1.216852
##    7        1.573681  0.3236931  1.208675
##    8        1.582067  0.3227783  1.219734
##    9        1.582248  0.3223702  1.218682
##   10        1.585317  0.3225424  1.218959
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 5.

rt.Predict <- predict(rt_Model, newdata = test_pred)
# Use PostResample function
postResample(pred = rt.Predict, obs = test_target)

##     RMSE Rsquared      MAE 
## 1.549507 0.410666 1.170624

rtAccuracy <- postResample(pred = rt.Predict, obs = test_target)

Random Forest Model

set.seed(1)
rf_Model <- train(x = train_pred,
                  y = train_target,
                  method = "rf",
                  tuneLength = 10)
rf_Model

## Random Forest 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.362282  0.5061519  1.0653492
##    8    1.287270  0.5366514  0.9976655
##   14    1.271212  0.5381600  0.9808907
##   20    1.265893  0.5347276  0.9755612
##   26    1.264688  0.5316547  0.9714118
##   32    1.272871  0.5208577  0.9777948
##   38    1.271139  0.5196967  0.9736169
##   44    1.274348  0.5153325  0.9766654
##   50    1.282146  0.5094456  0.9821527
##   57    1.289583  0.5032978  0.9840231
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.

rf.Predict <- predict(rf_Model, newdata = test_pred)
# Use PostResample function
postResample(pred = rf.Predict, obs = test_target)

##      RMSE  Rsquared       MAE 
## 1.1166471 0.7615895 0.8424019

rfAccuracy <- postResample(pred = rf.Predict, obs = test_target)

Cubist Model

set.seed(1)
cubist_Model <- train(x = train_pred,
                  y = train_target,
                  method = "cubist",
                  tuneLength = 10)
cubist_Model

## Cubist 
## 
## 132 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          2.020186  0.2635118  1.4458780
##    1          5          1.986481  0.2788826  1.4094603
##    1          9          1.996060  0.2735182  1.4217784
##   10          0          1.334883  0.4804142  1.0114764
##   10          5          1.312066  0.4973527  0.9776325
##   10          9          1.322466  0.4896068  0.9914823
##   20          0          1.251818  0.5296744  0.9561366
##   20          5          1.226397  0.5489871  0.9208038
##   20          9          1.237730  0.5401037  0.9355957
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

cubist.Predict <- predict(cubist_Model, newdata = test_pred)
# Use PostResample function
postResample(pred = cubist.Predict, obs = test_target)

##      RMSE  Rsquared       MAE 
## 0.9226027 0.8018563 0.7530944

cubistAccuracy <- postResample(pred = cubist.Predict, obs = test_target)

Compare Models

accuracies <- rbind(rtAccuracy,rfAccuracy,cubistAccuracy)
rownames(accuracies)<- c("REGRESSION TREE","RANDOM FOREST","CUBIST")
accuracies%>%
  kable() %>%
  kable_styling()

	RMSE	Rsquared	MAE
REGRESSION TREE	1.5495073	0.4106660	1.1706239
RANDOM FOREST	1.1166471	0.7615895	0.8424019
CUBIST	0.9226027	0.8018563	0.7530944

Based on above results, Cubist is our best model with better RMSE score compared to other models.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Important Predictors

plot(varImp(cubist_Model), top=10)

Manufacturing process 32 and Maunfacturing process 28 folled by 17 are top 3 in the list and these are the important predictors. In top 10 we got 8 Manufacturing processes and 2 Biological processes. Based on this we can say Manufacturing processes dominates Biological processes. Cubist model focus more on Manufactuing than lasso and SVM which we got in earlier assignments.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(partykit)

## Loading required package: libcoin

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

plot(as.party(rt_Model$finalModel),gp=gpar(fontsize=8))

Yes. This view provides additional knowledge about the biological or process predictors and their relationship with yield.

Data624-hw9

Vijaya Cherukuri

11/17/2020

8.1. Recreate the simulated data from Exercise 7.2:

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Tree Model

Cubist Model

8.2. Use a simulation to show tree bias with different granularities.

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

(b) Which model do you think would be more predictive of other samples?

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Use missmap function to find the missing values

Use missForest function to impute missing values

Split the data into test and training dataset

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Regression Tree Model

Random Forest Model

Cubist Model

Compare Models

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Important Predictors

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?