HW 9 A

Question 8.1

A

  • The RF model has correctly identified that v6-v10 isn’t important
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"




model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

B

  • The duplicate variables clearly take away from the variable importance
    • The model is randomly choosing the correlated predictors some of the time over our original V1
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

C

  • All of the variable importance calls generally follow the same trend, properly identifying 1-5
# rfModel <- randomForest(y ~ ., data = simulated,importance = TRUE,ntrees = 1000, method = "cforest" )
# rfImp1 <- varImp(rfModel, scale = FALSE)
# rfImp1
# 
# ?randomForest
# #method = "rf" or method = "cforest".
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788
baggedTree <- party::cforest(y ~ ., data = simulated)

non_conditional<- party::varimp(baggedTree,conditional = FALSE)
condtitional <- party::varimp(baggedTree,conditional = TRUE)



cbind(rfImp1,non_conditional,condtitional)
##          Overall non_conditional condtitional
## V1   8.732235404    8.7468487424  5.361033209
## V2   6.415369387    6.6111386740  4.921941000
## V3   0.763591825    0.0251324481  0.020360250
## V4   7.615118809    8.3216311809  6.385159554
## V5   2.023524577    2.0053334352  1.326730935
## V6   0.165111172    0.0001915275  0.006724338
## V7  -0.005961659   -0.0017520711 -0.032085548
## V8  -0.166362581   -0.0178123815 -0.003736865
## V9  -0.095292651   -0.0290695242 -0.014184225
## V10 -0.074944788   -0.0454872070 -0.029568133
plot(condtitional)

D

  • The cubist method identifies 1-5 but also gives some importance to variable 6
    • overall it performs well at never picking v7-10
  • the GBM model correctly identifies 1-5
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##     var    rel.inf
## V4   V4 31.5263562
## V1   V1 25.6582158
## V2   V2 22.6288332
## V5   V5 12.3172094
## V3   V3  7.6762241
## V9   V9  0.1931612
## V6   V6  0.0000000
## V7   V7  0.0000000
## V8   V8  0.0000000
## V10 V10  0.0000000
# gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian",method= 'permutation.test.gbm')
# summary(gbmModel)





#simulated[,1:10]

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
 n.trees= seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),n.minobsinnode=5 )
set.seed(100)

gbmTune <- train(simulated[,1:10], simulated$y,method = "gbm", tuneGrid = gbmGrid,verbose = FALSE)
#gbmTune$
varImp(gbmTune)
## gbm variable importance
## 
##      Overall
## V4  100.0000
## V1   92.1317
## V2   86.8359
## V5   37.4263
## V3   31.9971
## V6    2.7878
## V7    2.3789
## V8    0.7181
## V9    0.1236
## V10   0.0000
cubistTuned <- train(simulated[,1:10], simulated$y, method = "cubist", importance=TRUE)
varImp(cubistTuned)
## cubist variable importance
## 
##     Overall
## V1   100.00
## V2    75.69
## V4    68.06
## V3    58.33
## V5    55.56
## V6    15.28
## V9     0.00
## V8     0.00
## V10    0.00
## V7     0.00

8.2

  • below i make several simulations.
  • the theme here is that tree’s tend to pick variables with many split points, or distributions that are more granular
    • Below i have set a target variable that directly is associated with a distribution x1. We add some white noise to the target, and create a variable x2 of complete white noise that is highly granular(which should not be used in any splits). We can see through the three simulations that as x1 becomes more granular, it’s importance rises and the model correctly ignores the white noise distribution.
    • the first sim has the most granular x1 variable and by the last sim we have an x1 that only represents 2 integers+white noise. As such the importance of x1, improperly goes down from simulation 1 through 3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)



rpartTree <- rpart(y ~ ., data = simulated)

varImp(rpartTree)
##        Overall
## V1  1.61675805
## V10 0.52544906
## V2  2.15453588
## V3  0.79147160
## V4  3.02099652
## V5  2.45400266
## V6  0.54723044
## V7  0.51139268
## V8  0.09454648
## V9  0.20541040
# dtree_fit <- train(y ~., data = simulated, method = "rpart",
#                    parms = list(split = "information"),
#                    trControl=trctrl,
#                    tuneLength = 10)
# 
# varImp(dtree_fit)


## 
set.seed(624)
X1 <- rep(11:20, each=20)
Y <- X1 + rnorm(200, mean=0, sd=4)

set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)
#X2
simData <- data.frame(Y=Y, X1=X1, X2=X2)
# set.seed(624)
fit <- rpart(Y ~ ., data = simData)
varImp(fit)
##     Overall
## X1 3.543150
## X2 4.789933
set.seed(624)
X1 <- rep(16:20, each=40)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)

simData <- data.frame(Y=Y, X1=X1, X2=X2)
# set.seed(624)
fit2 <- rpart(Y ~ ., data = simData)
varImp(fit2)
##     Overall
## X1 3.803285
## X2 3.652489
set.seed(624)
X1 <- rep(19:20, each=100)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)

simData <- data.frame(Y=Y, X1=X1, X2=X2)
# set.seed(624)
fit3 <- rpart(Y ~ ., data = simData)
varImp(fit3)
##      Overall
## X1 0.8849928
## X2 4.7209459
# fit1 <- rpart(y ~., data = simulated, control = list(maxdepth = 2,minsplit=1,minbucket=1))
# fit2 <- rpart(y ~., data = simulated, control = list(maxdepth = 3,minsplit=1,minbucket=1))
# fit3 <- rpart(y ~., data = simulated, control = list(maxdepth = 5,minsplit=1,minbucket=1))
# fit4 <- rpart(y ~., data = simulated, control = list(maxdepth = 10,minsplit=1,minbucket=1))
# fit5 <- rpart(y ~., data = simulated, control = list(maxdepth = 30,minsplit=1,minbucket=1))
# plot(fit1)
# plot(fit2)
# plot(fit3)
# plot(fit4)
# plot(fit5)
# summary(fit4)
# fit4$cptable
# varImp(fit1)
# varImp(fit2)
# 
# varImp(fit3)
# varImp(fit4)
# varImp(fit5)
# ?varImp
# simulated
# 
# set.seed(12)
# test <- mlbench.friedman1(2000, sd = 1)
# test <- cbind(test$x, test$y)
# test <- as.data.frame(test)
# colnames(test)[ncol(test)] <- "y"
# test_y <- test$y
# test <- test[,-11]
# fit1_pred <- predict(fit1,test)
# fit2_pred <- predict(fit2,test)
# fit3_pred <- predict(fit3,test)
# fit4_pred <- predict(fit4,test)
# fit5_pred <- predict(fit5,test)
# 
# 
# rmse(test_y,fit1_pred)
# rmse(test_y,fit2_pred)
# rmse(test_y,fit3_pred)
# rmse(test_y,fit4_pred)
# rmse(test_y,fit5_pred)
# 
# library(rpart)
# #fancyRpartPlot(fit1)
# 
# #max(rpart:::tree.depth(nodes))

8.3

A

  • The model on the right is built on samples that represent 90% of the data, and is updating the residuals with a learning rate of .9. The effect of sampeling on such a large proportion of the data, is that each weak learner shares a majority of overlap in observations. The high learning rate means that we are taking larger steps towards the minimum and therefore we will have less trees in our overall model. For the model on the right if we assume an average of 5 steps to the minimum, our learning rate on the right will have us get to the minimum in around 5 individual trees. The learning rate of .1 on the left, we will need to build nearly 50 trees to reach the theoretical minimum. So the one on the left has more trees and they are being trained on different samples, therefore we would expect to see a large variance in predictor importance.

B

  • The stochastic sampling on the model on the left should act in the same way bagging does, and allow us to reduce our overall variance and thus be better at predicting on unseen samples. However, I don’t believe this answer is as easy as it sounds. It could be the model on the left doesn’t fit well enough to overall population mean(.1 for bagging is low), and the model on the right finds the population mean easier. Hyper parameter tuning would work best, with the assumption that the model on the left is likely better for prediction.

C

  • interaction depth is the depth of splits the model will allow. Model importance, are the predictors that are used for splits in a model. Logically speaking, the deeper your tree goes, the more predictors would we chosen for overall importance. So I would expect the importance levels to level out, as we increase the interaction depth. The model on the right, might look more like the model on the left, and the model on the left would show even more variables with importance

8.7

## [1] 8

GBM

  • model identifies Biological material 3 as important in every tree split(100 importance)
  • all biological material seems to have some importance with 9,11,5 all being selected with an importance over 20
  • Manufacturing process 17 28, 06 and 36 are highly important, all with nearly an importance score of 50 and above
set.seed(100)
fitControl <- trainControl(method = "repeatedcv",
                       repeats = 5,
                       preProcOptions = list(thresh = 0.95))

gbmGrid <- expand.grid(interaction.depth = seq(4, 6, by = 1),
 n.trees= seq(2000, 4000, by = 500),
shrinkage = c(0.01),n.minobsinnode=10)
set.seed(100)

gbmTune <- train(completedData, yield,method = "gbm", tuneGrid = gbmGrid,verbose = FALSE)
#gbmTune$
gbmTune
## Stochastic Gradient Boosting 
## 
## 176 samples
##  36 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 176, 176, 176, 176, 176, 176, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   4                  2000     1.247578  0.5502498  0.9505465
##   4                  2500     1.244986  0.5521889  0.9487557
##   4                  3000     1.243571  0.5532701  0.9478211
##   4                  3500     1.242632  0.5540278  0.9475109
##   4                  4000     1.241774  0.5547013  0.9471427
##   5                  2000     1.236111  0.5591455  0.9447760
##   5                  2500     1.233683  0.5607636  0.9434386
##   5                  3000     1.231927  0.5620530  0.9423792
##   5                  3500     1.230670  0.5629623  0.9416365
##   5                  4000     1.229745  0.5636486  0.9412084
##   6                  2000     1.237204  0.5582136  0.9438385
##   6                  2500     1.234706  0.5599288  0.9422316
##   6                  3000     1.233168  0.5610440  0.9412871
##   6                  3500     1.231818  0.5619718  0.9404083
##   6                  4000     1.231046  0.5626175  0.9399537
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 4000,
##  interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 10.
varImp(gbmTune)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   70.56
## ManufacturingProcess28   53.58
## ManufacturingProcess06   51.16
## ManufacturingProcess36   46.89
## BiologicalMaterial09     41.96
## BiologicalMaterial11     34.49
## ManufacturingProcess39   23.31
## ManufacturingProcess33   22.31
## BiologicalMaterial05     21.90
## ManufacturingProcess20   19.40
## ManufacturingProcess27   17.58
## ManufacturingProcess11   16.90
## ManufacturingProcess30   16.57
## ManufacturingProcess43   15.25
## ManufacturingProcess05   15.15
## ManufacturingProcess01   15.12
## BiologicalMaterial10     14.30
## ManufacturingProcess24   13.48
## ManufacturingProcess16   12.62
# adabbot <- train(completedData ~ yield, 
#             distribution = "adaboost",
#             method = "gbm", bag.fraction = 0.5,
#             nTrain = round(nrow(training) *.75),
#             trControl = fitControl,
#             verbose = TRUE,
#             tuneGrid = gbmGrid ))
            ## Specify which metric to optimize
          #  metric = "ROC"))

Cubist

  • Identifies manufacturing 17 as important in all trees
  • biological material 3 is highly important, as well as 11 and 9
  • Manufacturing process 33,39, 02 are all selected with importance over 40
set.seed(100)
cubistTuned <- train(completedData, yield, method = "cubist", importance=TRUE)
varImp(cubistTuned)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## ManufacturingProcess17     100
## BiologicalMaterial03        79
## ManufacturingProcess33      53
## ManufacturingProcess39      48
## ManufacturingProcess02      40
## BiologicalMaterial11        40
## ManufacturingProcess06      37
## ManufacturingProcess24      32
## ManufacturingProcess19      31
## BiologicalMaterial09        30
## ManufacturingProcess37      30
## ManufacturingProcess28      27
## ManufacturingProcess21      24
## ManufacturingProcess10      20
## ManufacturingProcess04      20
## ManufacturingProcess36      20
## ManufacturingProcess34      14
## ManufacturingProcess11      10
## ManufacturingProcess35       9
## BiologicalMaterial05         9

Random forest

  • similar to GBM model and cubist Biological material 3 is identified with importance of 100 and manufacturing process 17 is again highly important(over 60)
  • Biological material 11 9 5 and 10 are identified as importance of over 20
  • Manufacturing Process 28,36,39 are all nearly selected at importance of 50
set.seed(100)
# train model
control <- trainControl(method="repeatedcv", number=10, repeats=3,search = "random")
tunegrid <- expand.grid(mtry=c(1:15), .ntree=c(1000, 1500, 2000, 2500))

rf_model <- train(completedData, yield,method = "rf",tuneLength= 3, trControl=control,importance=TRUE)
summary(rf_model)
##                 Length Class      Mode     
## call              5    -none-     call     
## type              1    -none-     character
## predicted       176    -none-     numeric  
## mse             500    -none-     numeric  
## rsq             500    -none-     numeric  
## oob.times       176    -none-     numeric  
## importance       72    -none-     numeric  
## importanceSD     36    -none-     numeric  
## localImportance   0    -none-     NULL     
## proximity         0    -none-     NULL     
## ntree             1    -none-     numeric  
## mtry              1    -none-     numeric  
## forest           11    -none-     list     
## coefs             0    -none-     NULL     
## y               176    -none-     numeric  
## test              0    -none-     NULL     
## inbag             0    -none-     NULL     
## xNames           36    -none-     character
## problemType       1    -none-     character
## tuneValue         1    data.frame list     
## obsLevels         1    -none-     logical  
## param             1    -none-     list
plot(rf_model)

varImp(rf_model)
## rf variable importance
## 
##   only 20 most important variables shown (out of 36)
## 
##                        Overall
## BiologicalMaterial03    100.00
## ManufacturingProcess17   75.28
## ManufacturingProcess28   66.20
## BiologicalMaterial11     55.39
## ManufacturingProcess39   54.32
## ManufacturingProcess36   53.94
## BiologicalMaterial09     50.68
## ManufacturingProcess06   48.35
## ManufacturingProcess30   42.17
## ManufacturingProcess27   41.34
## ManufacturingProcess11   38.77
## ManufacturingProcess33   38.73
## BiologicalMaterial05     37.60
## ManufacturingProcess02   35.26
## ManufacturingProcess24   32.15
## ManufacturingProcess20   31.61
## ManufacturingProcess01   31.38
## ManufacturingProcess43   31.12
## ManufacturingProcess12   30.96
## BiologicalMaterial10     29.19

Model comparison

  • GBM rmse of 1.22
  • Cubist rmse of 1.22
  • RMSE of 1.17
gbmTune$results[14,]
##    shrinkage interaction.depth n.minobsinnode n.trees     RMSE  Rsquared
## 10      0.01                 5             10    4000 1.229745 0.5636486
##          MAE    RMSESD RsquaredSD      MAESD
## 10 0.9412084 0.1272344 0.06063059 0.08256482
cubistTuned$results[8,]
##   committees neighbors     RMSE  Rsquared      MAE    RMSESD RsquaredSD
## 8         20         5 1.318315 0.5036683 1.004599 0.1782006   0.102998
##      MAESD
## 8 0.131135
rf_model$results
##   mtry     RMSE  Rsquared       MAE    RMSESD RsquaredSD     MAESD
## 1    3 1.253362 0.6205473 1.0115401 0.2253348  0.1318919 0.1619490
## 2   10 1.180145 0.6436069 0.9458280 0.2272004  0.1314893 0.1644716
## 3   20 1.168974 0.6384828 0.9312232 0.2322809  0.1355641 0.1666012

B Looking at Random forest

  • I described the relationships for each model and what they found to be important in each individual model section. Trees find biological material 3 to be important in every single tree split in all different models. Manufacturing process 17 is identified as highly important in all models as well
    • our non linear models all identified BM 3 as being very important as well
  • Our Mars model has chose MP 17 and MP 28 as the only predictors, both of those predictors show significance in our random forest model. SVM shoed them as important as well
  • SVM radial had showed similar varibale importance
  • it’s difficult to compare train test split measurements, as i didn’t split off the data as a holdout set in this homework.
    • overall the rmse seems like it is lower in the rf and GBM model, and it in fact was when i ran the XGboost and rf in last week’s hw with a holdout.

C plot tree

  • it looks as though the tree identifies biological material 3 at the top split. It then uses a split on biological material 11 and 13, followed by splits in all manufacturing processes
    • I think this makes sense. We know from the past several homework’s that the biological material has a significant additive effect to yield. So classifying by biological material should give us a strong gain in prediction, followed by manufacturing processes.
library(rattle)
## Warning: package 'rattle' was built under R version 3.5.3
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance
plot_df <- cbind(completedData,yield)
x <- rpart(yield~.,completedData)
fancyRpartPlot(x)