8.1
A.
library(mlbench)
set.seed(34)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y~.,data=simulated, importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1,scale=FALSE)
rfImp1The Random Forest model used the most insignificant predictors the least. It correctly identified the least informative predictors. V1-V5 were the most used in the model.
B.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)## [1] 0.9522047
model2 <- randomForest(y~.,data=simulated, importance = TRUE,ntree = 1000)
rfImp2 <- varImp(model2,scale=FALSE)
rfImp2rfImp1$Overall - rfImp2$Overall[c(1:10)]## [1] 1.81444682 0.36371951 0.19863027 0.41740669 -0.09017871 0.07638838
## [7] -0.03785006 0.02020768 -0.01461263 -0.01496501
The random forest model with duplicate1 included still weighed the importance of V6-V10 as the lowest importance.
The score of V1 and V2 decreased, while V3, V4, and V5 were not particularly impacted. With duplicate1 included, V1-V5 were the variables that were most impacted, and the importance of V7,V9,V10 very slightly increased.
The inclusion of duplicate1 led to less reliance on V1-V4.
C.
library(party)## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:fabletools':
##
## refit
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'party'
## The following object is masked from 'package:fabletools':
##
## response
set.seed(34)
model3 <- cforest(y~.,data=simulated)
rfImp3 <- varimp(model3)
rfImp3## V1 V2 V3 V4 V5 V6
## 3.29619544 5.68140016 0.29022230 10.59336091 1.93889836 0.01792445
## V7 V8 V9 V10 duplicate1
## -0.02943824 0.03869927 -0.03467306 -0.05347390 2.21814946
rfImp4 <- varimp(model3,conditional=TRUE)
rfImp4## V1 V2 V3 V4 V5 V6
## 1.583121472 3.555937565 0.243850602 7.448150562 1.339510438 -0.018823510
## V7 V8 V9 V10 duplicate1
## -0.034318511 -0.017020530 -0.005379022 -0.010609322 1.103798987
Both the standard and conditional variable importance tables show preference to V1-V5, though the non-conditional approach weighs all variables more heavily.
D.
library(Cubist)
model4 <- cubist(simulated[,c(1:10,12)],simulated$y,committees=100)
varImp(model4)The cubist model also weighed V1-V5 overwhelmingly higher than V6-V10, including duplicate1. Duplicate1 was weighed very poorly, despite being similar to V1.
Due to the complicated nature of the cubist algorithm, it is difficult to say exactly why the importance of duplicate1 was so low, but it seems that V1 was either boosted or simply chosen as more predictive than duplicate1, so duplicate1 was filtered out due to its redundancy.
library(gbm)## Loaded gbm 2.1.8.1
gbmModel <- gbm(y~.,data=simulated, distribution = "gaussian")
summary(gbmModel)The same is true for the gbm boosted trees model. V1-V5 are weighted higher than the other 5 variables, though V4 was the highest weighted variable, which is different from V1 being the most important variable in the cubist model.
8.2
I will create a dataframe consisting of 5 variables with an arbitrary response variable y that corresponds to a polynomial combination of the five variables. Two will be non-granular, and three will be highly granular.
dummyframe <- data.frame(V1 = sample(0:10000/10000,1000,replace=TRUE),
V2 = sample(0:1000/1000,1000,replace=TRUE),
V3 = sample(0:100000/100000,1000,replace=TRUE),
V4 = sample(0:1/1,1000,replace=TRUE),
V5 = sample(0:10/10,1000,replace=TRUE))
dummyframe$y <- 100*dummyframe$V5 + dummyframe$V4^2 + dummyframe$V3 + dummyframe$V2^3 + 25*dummyframe$V1
set.seed(34)
dummymodel <- cubist(dummyframe[,1:5],dummyframe$y,committees=100)
varImp(dummymodel)The cubist package does not appear to fall victim to selection bias, as did not rank the more granular data above the less granular variables. Next, I will use the rpart package to compare.
library(rpart)
set.seed(34)
rpartmodel <- rpart(y~.,dummyframe)
varImp(rpartmodel)The rpart model disagreed with the cubist model in some ways, but it did not rank the more granular variables higher than the less granular variables. The two highest contributors to the response variable, y, were correctly identified by the rpart model.
Last, instead of assigning a specific value pattern for y, I will just use a random number for y to see if selection bias is evident or not.
Instead of the cubist model, I will use the cforest model. Cubist filters out variables, which is not what we want for this demonstration.
dummyframe2 <- data.frame(V1 = sample(0:10000/10000,1000,replace=TRUE),
V2 = sample(0:1000/1000,1000,replace=TRUE),
V3 = sample(0:100000/100000,1000,replace=TRUE),
V4 = sample(0:1/1,1000,replace=TRUE),
V5 = sample(0:10/10,1000,replace=TRUE),
y = sample(0:1000/1000,1000,replace=TRUE))
set.seed(34)
dummymodel2 <- cforest(y~.,dummyframe2)
varImp(dummymodel2)The cforest model did not find any variables particularly important, and no conclusive results were obtained.
Below is a cubist model run on the dummy dataframe with granular variables V4 and V5 and less granular variables V1,V2,V3.
dummyframe2 <- data.frame(V1 = sample(0:10000/10000,1000,replace=TRUE),
V2 = sample(0:1000/1000,1000,replace=TRUE),
V3 = sample(0:100000/100000,1000,replace=TRUE),
V4 = sample(0:1/1,1000,replace=TRUE),
V5 = sample(0:10/10,1000,replace=TRUE),
y = sample(0:1000/1000,1000,replace=TRUE))
set.seed(34)
dummymodel2 <- cubist(dummyframe2[,1:5],dummyframe2$y,committees=100)
varImp(dummymodel2)As mentioned by Loh and Shih from the excerpt in the textbook, tree pruning has produced a tree with no structure at all.
8.3
A.
The model on the right pruned the tree to eliminate less important variables.
The learning rate of 0.9 leads to fewer variables being used because more important variables are boosted while less important variables are pruned.
The bagging fraction of 0.9 is much higher on the right than on the left. The 0.9 bagging fraction leads to most of the data being used for each iteration, which also boosts more influential variables over less important variables. NumCarbon is far more significant in the second model than in the first.
In short, the learning rate and bagging fraction led to the extensive pruning of the decision tree.
B.
I would expect that the boosted model would be more predictive of other samples because it has limited the level of potential noise from less important variables.
With that said, the 0.9 learning rate and 0.9 bagging fraction may overfit the data to the sample set. Further, the NumCarbon variable is far more significant than any other variable in the model on the right, and this leads me to believe that the aggressively tuned model will not perform well on a blind test or on evaluation data.
C.
Increasing interaction depth would likely further prune the trees and would lead to higher performance on the training data, but the model will struggle when presented with an evaluation data set because it will probably overfit to the sample set.
8.7
data(ChemicalManufacturingProcess)#summary(ChemicalManufacturingProcess)
gaps <- preProcess(ChemicalManufacturingProcess,method = "bagImpute")
ChemicalManufacturingProcess_cleaned <- predict(gaps,ChemicalManufacturingProcess)
#w3summary(ChemicalManufacturingProcess_cleaned)
cmp_c <- ChemicalManufacturingProcess_cleaned6.3 Models:
set.seed(34)
train_idx <- sample(c(TRUE,FALSE), nrow(cmp_c),
replace=TRUE, prob=c(0.7,0.3))
train_set <- cmp_c[train_idx,]
test_set <- cmp_c[!train_idx,]
train_yield <- data.frame(cmp_c[train_idx,1])
test_yield <- data.frame(cmp_c[!train_idx,1])
colnames(train_yield) <- c('Yield')
colnames(test_yield) <- c('Yield')
fit <- train(Yield ~.,train_set, method = "pls",tunelength = 25, trControl = trainControl(method = "cv"))
plot(fit)fit## Partial Least Squares
##
## 115 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 104, 103, 103, 103, 104, 104, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.649402 0.2094404 1.350919
## 2 1.620935 0.2558763 1.313781
## 3 1.644062 0.2310858 1.334369
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
A.
Single Tree
set.seed(34)
rpartTune <- train(train_set[,-c(1)], train_set$Yield, method = "rpart2",tuneLength = 10,trControl = trainControl(method = "cv"))## note: only 8 possible values of the max tree depth from the initial fit.
## Truncating the grid to 8 .
rpartTune_predictions <- predict(rpartTune,test_set[,-c(1)])
postResample(rpartTune_predictions,test_yield$Yield)## RMSE Rsquared MAE
## 1.6436107 0.3376015 1.2866580
I could not get the RWeka package to function, so I skipped the Model Trees process.
cforest also failed with an invalid ‘x’ type in ‘x && y’
Cubist
set.seed(34)
cubistTune <- cubist(train_set[,-c(1)],train_set$Yield,committees=100)
varImp(cubistTune)cubistTune_predictions <- predict(cubistTune,test_set[,-c(1)])
postResample(cubistTune_predictions,test_yield$Yield)## RMSE Rsquared MAE
## 1.1119040 0.7148358 0.9291487
Finally, a model has yielded an R^2 that is indicative of a better than random fit.
GBM
Last, I will check the GBM Model from above.
set.seed(34)
gbmTune <- gbm(Yield~.,data=train_set, distribution = "gaussian")
gbmTune_predictions <- predict(gbmTune,test_set)## Using 100 trees...
postResample(gbmTune_predictions,test_yield$Yield)## RMSE Rsquared MAE
## 1.3541800 0.5542933 1.0843371
The GBM model yielded a better R^2 than any of the models in Assignment 7 or 8, but it is not as strong as the R^2 of the Cubist model.
B.
v_imp <- varImp(cubistTune)
v_imp <- v_imp %>%
arrange(desc(Overall))
v_imp_10 <- head(v_imp,10)
v_imp_10v_imp_10$Process <- rownames(v_imp_10)
rownames(v_imp_10) <- c(1:10)
#v_imp_10 <- data.frame(v_imp_10$Overall[c(1:10)])
ggplot(v_imp_10,aes(y=Process,x=Overall))+
geom_col()The optimal model, the Cubist model, ranked 7 of the Manufacturing processes and three of the biological material variables as the most important.
The most important variable is the ManufacturingProcess32 variable, which was the most important variable in the nonlinear model. Seven of the top 10 were also ranked in the top 10 of the nonlinear model.
C.
library(partykit)## Loading required package: libcoin
## Registered S3 method overwritten by 'inum':
## method from
## format.interval tsibble
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
plot(as.party(rpartTune$finalModel))The resulting plot reveals that ManufacturingProcess32 is the single most important variable and that the subsequent nodes feed into one another.
The terminal nodes are the least influential in the model, and each edge informs which node is chosen based on the value of the node above.
Visualizing the tree is very helpful for understanding the model conceptually.