Question 8.1
A
- The RF model has correctly identified that v6-v10 isn’t important
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
B
- The duplicate variables clearly take away from the variable importance
- The model is randomly choosing the correlated predictors some of the time over our original V1
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
C
- All of the variable importance calls generally follow the same trend, properly identifying 1-5
# rfModel <- randomForest(y ~ ., data = simulated,importance = TRUE,ntrees = 1000, method = "cforest" )
# rfImp1 <- varImp(rfModel, scale = FALSE)
# rfImp1
#
# ?randomForest
# #method = "rf" or method = "cforest".
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
baggedTree <- party::cforest(y ~ ., data = simulated)
non_conditional<- party::varimp(baggedTree,conditional = FALSE)
condtitional <- party::varimp(baggedTree,conditional = TRUE)
cbind(rfImp1,non_conditional,condtitional)
## Overall non_conditional condtitional
## V1 8.732235404 8.7468487424 5.361033209
## V2 6.415369387 6.6111386740 4.921941000
## V3 0.763591825 0.0251324481 0.020360250
## V4 7.615118809 8.3216311809 6.385159554
## V5 2.023524577 2.0053334352 1.326730935
## V6 0.165111172 0.0001915275 0.006724338
## V7 -0.005961659 -0.0017520711 -0.032085548
## V8 -0.166362581 -0.0178123815 -0.003736865
## V9 -0.095292651 -0.0290695242 -0.014184225
## V10 -0.074944788 -0.0454872070 -0.029568133
plot(condtitional)

D
- The cubist method identifies 1-5 but also gives some importance to variable 6
- overall it performs well at never picking v7-10
- the GBM model correctly identifies 1-5
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

## var rel.inf
## V4 V4 31.5263562
## V1 V1 25.6582158
## V2 V2 22.6288332
## V5 V5 12.3172094
## V3 V3 7.6762241
## V9 V9 0.1931612
## V6 V6 0.0000000
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
# gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian",method= 'permutation.test.gbm')
# summary(gbmModel)
#simulated[,1:10]
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees= seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),n.minobsinnode=5 )
set.seed(100)
gbmTune <- train(simulated[,1:10], simulated$y,method = "gbm", tuneGrid = gbmGrid,verbose = FALSE)
#gbmTune$
varImp(gbmTune)
## gbm variable importance
##
## Overall
## V4 100.0000
## V1 92.1317
## V2 86.8359
## V5 37.4263
## V3 31.9971
## V6 2.7878
## V7 2.3789
## V8 0.7181
## V9 0.1236
## V10 0.0000
cubistTuned <- train(simulated[,1:10], simulated$y, method = "cubist", importance=TRUE)
varImp(cubistTuned)
## cubist variable importance
##
## Overall
## V1 100.00
## V2 75.69
## V4 68.06
## V3 58.33
## V5 55.56
## V6 15.28
## V9 0.00
## V8 0.00
## V10 0.00
## V7 0.00
8.2
- below i make several simulations.
- the theme here is that tree’s tend to pick variables with many split points, or distributions that are more granular
- Below i have set a target variable that directly is associated with a distribution x1. We add some white noise to the target, and create a variable x2 of complete white noise that is highly granular(which should not be used in any splits). We can see through the three simulations that as x1 becomes more granular, it’s importance rises and the model correctly ignores the white noise distribution.
- the first sim has the most granular x1 variable and by the last sim we have an x1 that only represents 2 integers+white noise. As such the importance of x1, improperly goes down from simulation 1 through 3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
rpartTree <- rpart(y ~ ., data = simulated)
varImp(rpartTree)
## Overall
## V1 1.61675805
## V10 0.52544906
## V2 2.15453588
## V3 0.79147160
## V4 3.02099652
## V5 2.45400266
## V6 0.54723044
## V7 0.51139268
## V8 0.09454648
## V9 0.20541040
# dtree_fit <- train(y ~., data = simulated, method = "rpart",
# parms = list(split = "information"),
# trControl=trctrl,
# tuneLength = 10)
#
# varImp(dtree_fit)
##
set.seed(624)
X1 <- rep(11:20, each=20)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)
#X2
simData <- data.frame(Y=Y, X1=X1, X2=X2)
# set.seed(624)
fit <- rpart(Y ~ ., data = simData)
varImp(fit)
## Overall
## X1 3.543150
## X2 4.789933
set.seed(624)
X1 <- rep(16:20, each=40)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)
simData <- data.frame(Y=Y, X1=X1, X2=X2)
# set.seed(624)
fit2 <- rpart(Y ~ ., data = simData)
varImp(fit2)
## Overall
## X1 3.803285
## X2 3.652489
set.seed(624)
X1 <- rep(19:20, each=100)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)
simData <- data.frame(Y=Y, X1=X1, X2=X2)
# set.seed(624)
fit3 <- rpart(Y ~ ., data = simData)
varImp(fit3)
## Overall
## X1 0.8849928
## X2 4.7209459
# fit1 <- rpart(y ~., data = simulated, control = list(maxdepth = 2,minsplit=1,minbucket=1))
# fit2 <- rpart(y ~., data = simulated, control = list(maxdepth = 3,minsplit=1,minbucket=1))
# fit3 <- rpart(y ~., data = simulated, control = list(maxdepth = 5,minsplit=1,minbucket=1))
# fit4 <- rpart(y ~., data = simulated, control = list(maxdepth = 10,minsplit=1,minbucket=1))
# fit5 <- rpart(y ~., data = simulated, control = list(maxdepth = 30,minsplit=1,minbucket=1))
# plot(fit1)
# plot(fit2)
# plot(fit3)
# plot(fit4)
# plot(fit5)
# summary(fit4)
# fit4$cptable
# varImp(fit1)
# varImp(fit2)
#
# varImp(fit3)
# varImp(fit4)
# varImp(fit5)
# ?varImp
# simulated
#
# set.seed(12)
# test <- mlbench.friedman1(2000, sd = 1)
# test <- cbind(test$x, test$y)
# test <- as.data.frame(test)
# colnames(test)[ncol(test)] <- "y"
# test_y <- test$y
# test <- test[,-11]
# fit1_pred <- predict(fit1,test)
# fit2_pred <- predict(fit2,test)
# fit3_pred <- predict(fit3,test)
# fit4_pred <- predict(fit4,test)
# fit5_pred <- predict(fit5,test)
#
#
# rmse(test_y,fit1_pred)
# rmse(test_y,fit2_pred)
# rmse(test_y,fit3_pred)
# rmse(test_y,fit4_pred)
# rmse(test_y,fit5_pred)
#
# library(rpart)
# #fancyRpartPlot(fit1)
#
# #max(rpart:::tree.depth(nodes))
8.7
## [1] 8
GBM
- model identifies Biological material 3 as important in every tree split(100 importance)
- all biological material seems to have some importance with 9,11,5 all being selected with an importance over 20
- Manufacturing process 17 28, 06 and 36 are highly important, all with nearly an importance score of 50 and above
set.seed(100)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95))
gbmGrid <- expand.grid(interaction.depth = seq(4, 6, by = 1),
n.trees= seq(2000, 4000, by = 500),
shrinkage = c(0.01),n.minobsinnode=10)
set.seed(100)
gbmTune <- train(completedData, yield,method = "gbm", tuneGrid = gbmGrid,verbose = FALSE)
#gbmTune$
gbmTune
## Stochastic Gradient Boosting
##
## 176 samples
## 36 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 176, 176, 176, 176, 176, 176, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 4 2000 1.247578 0.5502498 0.9505465
## 4 2500 1.244986 0.5521889 0.9487557
## 4 3000 1.243571 0.5532701 0.9478211
## 4 3500 1.242632 0.5540278 0.9475109
## 4 4000 1.241774 0.5547013 0.9471427
## 5 2000 1.236111 0.5591455 0.9447760
## 5 2500 1.233683 0.5607636 0.9434386
## 5 3000 1.231927 0.5620530 0.9423792
## 5 3500 1.230670 0.5629623 0.9416365
## 5 4000 1.229745 0.5636486 0.9412084
## 6 2000 1.237204 0.5582136 0.9438385
## 6 2500 1.234706 0.5599288 0.9422316
## 6 3000 1.233168 0.5610440 0.9412871
## 6 3500 1.231818 0.5619718 0.9404083
## 6 4000 1.231046 0.5626175 0.9399537
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 4000,
## interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 10.
varImp(gbmTune)
## gbm variable importance
##
## only 20 most important variables shown (out of 36)
##
## Overall
## BiologicalMaterial03 100.00
## ManufacturingProcess17 70.56
## ManufacturingProcess28 53.58
## ManufacturingProcess06 51.16
## ManufacturingProcess36 46.89
## BiologicalMaterial09 41.96
## BiologicalMaterial11 34.49
## ManufacturingProcess39 23.31
## ManufacturingProcess33 22.31
## BiologicalMaterial05 21.90
## ManufacturingProcess20 19.40
## ManufacturingProcess27 17.58
## ManufacturingProcess11 16.90
## ManufacturingProcess30 16.57
## ManufacturingProcess43 15.25
## ManufacturingProcess05 15.15
## ManufacturingProcess01 15.12
## BiologicalMaterial10 14.30
## ManufacturingProcess24 13.48
## ManufacturingProcess16 12.62
# adabbot <- train(completedData ~ yield,
# distribution = "adaboost",
# method = "gbm", bag.fraction = 0.5,
# nTrain = round(nrow(training) *.75),
# trControl = fitControl,
# verbose = TRUE,
# tuneGrid = gbmGrid ))
## Specify which metric to optimize
# metric = "ROC"))
Cubist
- Identifies manufacturing 17 as important in all trees
- biological material 3 is highly important, as well as 11 and 9
- Manufacturing process 33,39, 02 are all selected with importance over 40
set.seed(100)
cubistTuned <- train(completedData, yield, method = "cubist", importance=TRUE)
varImp(cubistTuned)
## cubist variable importance
##
## only 20 most important variables shown (out of 36)
##
## Overall
## ManufacturingProcess17 100
## BiologicalMaterial03 79
## ManufacturingProcess33 53
## ManufacturingProcess39 48
## ManufacturingProcess02 40
## BiologicalMaterial11 40
## ManufacturingProcess06 37
## ManufacturingProcess24 32
## ManufacturingProcess19 31
## BiologicalMaterial09 30
## ManufacturingProcess37 30
## ManufacturingProcess28 27
## ManufacturingProcess21 24
## ManufacturingProcess10 20
## ManufacturingProcess04 20
## ManufacturingProcess36 20
## ManufacturingProcess34 14
## ManufacturingProcess11 10
## ManufacturingProcess35 9
## BiologicalMaterial05 9
Random forest
- similar to GBM model and cubist Biological material 3 is identified with importance of 100 and manufacturing process 17 is again highly important(over 60)
- Biological material 11 9 5 and 10 are identified as importance of over 20
- Manufacturing Process 28,36,39 are all nearly selected at importance of 50
set.seed(100)
# train model
control <- trainControl(method="repeatedcv", number=10, repeats=3,search = "random")
tunegrid <- expand.grid(mtry=c(1:15), .ntree=c(1000, 1500, 2000, 2500))
rf_model <- train(completedData, yield,method = "rf",tuneLength= 3, trControl=control,importance=TRUE)
summary(rf_model)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 176 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 176 -none- numeric
## importance 72 -none- numeric
## importanceSD 36 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 176 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 36 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 1 -none- list
plot(rf_model)

varImp(rf_model)
## rf variable importance
##
## only 20 most important variables shown (out of 36)
##
## Overall
## BiologicalMaterial03 100.00
## ManufacturingProcess17 75.28
## ManufacturingProcess28 66.20
## BiologicalMaterial11 55.39
## ManufacturingProcess39 54.32
## ManufacturingProcess36 53.94
## BiologicalMaterial09 50.68
## ManufacturingProcess06 48.35
## ManufacturingProcess30 42.17
## ManufacturingProcess27 41.34
## ManufacturingProcess11 38.77
## ManufacturingProcess33 38.73
## BiologicalMaterial05 37.60
## ManufacturingProcess02 35.26
## ManufacturingProcess24 32.15
## ManufacturingProcess20 31.61
## ManufacturingProcess01 31.38
## ManufacturingProcess43 31.12
## ManufacturingProcess12 30.96
## BiologicalMaterial10 29.19
Model comparison
- GBM rmse of 1.22
- Cubist rmse of 1.22
- RMSE of 1.17
gbmTune$results[14,]
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 10 0.01 5 10 4000 1.229745 0.5636486
## MAE RMSESD RsquaredSD MAESD
## 10 0.9412084 0.1272344 0.06063059 0.08256482
cubistTuned$results[8,]
## committees neighbors RMSE Rsquared MAE RMSESD RsquaredSD
## 8 20 5 1.318315 0.5036683 1.004599 0.1782006 0.102998
## MAESD
## 8 0.131135
rf_model$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 3 1.253362 0.6205473 1.0115401 0.2253348 0.1318919 0.1619490
## 2 10 1.180145 0.6436069 0.9458280 0.2272004 0.1314893 0.1644716
## 3 20 1.168974 0.6384828 0.9312232 0.2322809 0.1355641 0.1666012
B Looking at Random forest
- I described the relationships for each model and what they found to be important in each individual model section. Trees find biological material 3 to be important in every single tree split in all different models. Manufacturing process 17 is identified as highly important in all models as well
- our non linear models all identified BM 3 as being very important as well
- Our Mars model has chose MP 17 and MP 28 as the only predictors, both of those predictors show significance in our random forest model. SVM shoed them as important as well
- SVM radial had showed similar varibale importance
- it’s difficult to compare train test split measurements, as i didn’t split off the data as a holdout set in this homework.
- overall the rmse seems like it is lower in the rf and GBM model, and it in fact was when i ran the XGboost and rf in last week’s hw with a holdout.
C plot tree
- it looks as though the tree identifies biological material 3 at the top split. It then uses a split on biological material 11 and 13, followed by splits in all manufacturing processes
- I think this makes sense. We know from the past several homework’s that the biological material has a significant additive effect to yield. So classifying by biological material should give us a strong gain in prediction, followed by manufacturing processes.
library(rattle)
## Warning: package 'rattle' was built under R version 3.5.3
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
plot_df <- cbind(completedData,yield)
x <- rpart(yield~.,completedData)
fancyRpartPlot(x)
