Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
##8.1. Recreate the my_simu data from Exercise 7.2:
library(mlbench)
set.seed(200)
my_simu <- mlbench.friedman1(200, sd = 1)
my_simu <- cbind(my_simu$x, my_simu$y)
my_simu <- as.data.frame(my_simu)
colnames(my_simu)[ncol(my_simu)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
model1 <- randomForest(y ~ ., data = my_simu,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predic- tors (V6 – V10)? If looking at the below graph, shows that the model did not use V6-V10 variables significantly.
rfImp1 %>%
mutate (var = rownames(rfImp1)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Variable Importance' , y = 'Variable')
my_simu$duplicate1 <- my_simu$V1 + rnorm(200) * .1
cor(my_simu$duplicate1, my_simu$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1? As we observe below that the V1 importance has dropped to the fourth position. Now see it is next to the duplicate.
model2 <- randomForest(y ~ ., data = my_simu,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 %>%
mutate (var = rownames(rfImp2)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Variable Importance' , y = 'Variable')
library(partykit)
## Warning: package 'partykit' was built under R version 4.1.3
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.1.3
## Loading required package: mvtnorm
## Registered S3 method overwritten by 'inum':
## method from
## format.interval tsibble
model3 <- cforest(y ~ ., data = my_simu)
rfImp3 <- varimp(model3, conditional = TRUE) %>% as.data.frame()
rfImp3 %>%
rename(Overall = '.') %>%
mutate (var = rownames(rfImp3)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Variable Importance' , y = 'Variable')
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.1.3
model4 <- cubist(my_simu[, colnames(my_simu)[colnames(my_simu) != 'y']],
my_simu$y)
rfImp4 <- varImp(model4, scale = FALSE)
rfImp4 %>%
mutate (var = rownames(rfImp4)) %>%
ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Variable Importance' , y = 'Variable')
Use a simulation to show tree bias with different granularities.
set.seed(88)
V1 <- runif(500, 2,500)
V2 <- rnorm(500, 2,10)
V3 <- rnorm(500, 1,1000)
y <- V2 + V3
df <- data.frame(V1, V2, V3, y)
test_model <- cforest(y ~ ., data = df, ntree = 10)
test_model_imp <- varimp(test_model, conditional = FALSE)
barplot(sort(test_model_imp),horiz = TRUE, main = 'Un-Conditional', col = rainbow(5))
as you can see above when using Model Random Forest the most significant variable is V3. We know this basing on utilizing function of y <- V2 + V3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? As you can see the right model, focuses on a few variables since it has higher learning rate along with a higher bagging rate. You’ll see bigger size of the data is used. In turn increasing the correlation at each iteration. Thus only a few of the variables have high importance.
Which model do you think would be more predictive of other samples? Due to overfit as it considers fewer variables, the model with larger learning and bagging will most likely. The left model has a higher chance of being more predictive of other samples.
How would increasing interaction depth affect the slope of predictor im- portance for either model in Fig. 8.24? Since the interaction depth increases, more predictors are included. The RMSE error will lower and the steeper the slop of importance of predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.3
data(ChemicalManufacturingProcess)
# Impute the missing values using bagImpute
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
# Replace
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])
# Splitting the data into training and test datasets
set.seed(480)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]
Gradien Boost
set.seed(44)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
model_gbm1 <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
model_gbm1$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 76 200 10 0.1 5
RANDOM FOREST
set.seed(77)
model_rf3<- train(X_train, y_train, method='rf', tuneLength = 10)
model_rf3
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.334811 0.5276895 1.0719915
## 8 1.254487 0.5561364 0.9842596
## 14 1.237085 0.5602251 0.9599156
## 20 1.238068 0.5549427 0.9565511
## 26 1.236422 0.5521014 0.9517747
## 32 1.240043 0.5464730 0.9510833
## 38 1.244768 0.5418790 0.9518807
## 44 1.251081 0.5360229 0.9552270
## 50 1.258535 0.5292140 0.9582833
## 57 1.268117 0.5218864 0.9645965
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
SINGLE TREE
library(rpart)
set.seed(66)
model_rpart <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
model_rpart
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01684157 1.516456 0.3578700 1.162605
## 0.02137409 1.516456 0.3578700 1.162605
## 0.02341864 1.516456 0.3578700 1.162605
## 0.02532733 1.516456 0.3578700 1.162605
## 0.02890208 1.516456 0.3578700 1.162605
## 0.03111746 1.516456 0.3578700 1.162605
## 0.04559790 1.517792 0.3567906 1.165215
## 0.06391370 1.532447 0.3473686 1.174197
## 0.09164355 1.543158 0.3396923 1.183760
## 0.39332469 1.687773 0.2730084 1.329875
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03111746.
summary(resamples(list(Single_True = model_rpart, Random_Forest = model_rf3,
Gradient_Boosting=model_gbm1)))
##
## Call:
## summary.resamples(object = resamples(list(Single_True =
## model_rpart, Random_Forest = model_rf3, Gradient_Boosting = model_gbm1)))
##
## Models: Single_True, Random_Forest, Gradient_Boosting
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.9721591 1.0916421 1.1544688 1.1626050 1.2438397 1.411900
## Random_Forest 0.8413057 0.9003406 0.9357733 0.9517747 0.9861940 1.187571
## Gradient_Boosting 0.8318969 0.8904796 0.9198531 0.9398576 0.9738667 1.112387
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_True 1.271720 1.426333 1.524448 1.516456 1.614609 1.732767 0
## Random_Forest 1.066881 1.148314 1.237786 1.236422 1.328218 1.504047 0
## Gradient_Boosting 1.066207 1.164416 1.230343 1.231950 1.269784 1.430956 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.1885859 0.2973369 0.3602365 0.3578700 0.4259184 0.5371485
## Random_Forest 0.3948908 0.5083560 0.5547292 0.5521014 0.6138288 0.6830667
## Gradient_Boosting 0.3111637 0.5177063 0.5533567 0.5660252 0.6416362 0.7151902
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
#Function for test data
test_performance <- function(models, testData, testTarget) {
method <- c()
res <- data.frame()
for(model in models){
method <- c(method, model$method)
pred <- predict(model, newdata=testData)
res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
}
row.names(res) <- method
return(res)
}
#List te models
models <- list(model_rpart, model_rf3, model_gbm1)
#Run the function
performance <- test_performance(models, X_test, y_test)
performance
## RMSE Rsquared MAE
## rpart 1.5647270 0.3765960 1.2801338
## rf 1.1313936 0.7267739 0.8847341
## gbm 0.8727583 0.8195122 0.6368033
library(gbm)
## Warning: package 'gbm' was built under R version 4.1.3
## Loaded gbm 2.1.8
model_pls <- train(x = X_train,y = y_train, method='pls', metric='RMSE',
tuneLength=20, trControl = trainControl(method='cv'))
(pls_imp = varImp(model_pls))
## pls variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess14 38.56
## BiologicalMaterial03 37.72
## ManufacturingProcess27 35.46
## ManufacturingProcess35 35.22
## ManufacturingProcess24 34.14
## ManufacturingProcess15 32.40
## ManufacturingProcess28 31.96
## ManufacturingProcess25 30.59
## ManufacturingProcess26 29.40
## ManufacturingProcess06 29.10
## BiologicalMaterial06 28.36
## BiologicalMaterial02 27.71
## ManufacturingProcess04 27.69
## ManufacturingProcess02 21.97
## ManufacturingProcess05 18.61
## ManufacturingProcess33 17.73
## ManufacturingProcess19 15.84
## ManufacturingProcess17 14.69
## ManufacturingProcess09 13.91
set.seed(222)
svm_model <- train(x = X_train,y = y_train,
method = "svmRadial",
tuneLength=10,
preProc = c("center", "scale"))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
(svm_imp = varImp(svm_model))
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## BiologicalMaterial06 100.00
## ManufacturingProcess13 98.34
## ManufacturingProcess32 96.95
## BiologicalMaterial03 81.49
## ManufacturingProcess17 75.31
## BiologicalMaterial02 73.06
## BiologicalMaterial12 71.59
## ManufacturingProcess09 70.96
## ManufacturingProcess31 67.04
## ManufacturingProcess36 66.24
## ManufacturingProcess06 55.48
## BiologicalMaterial04 54.37
## BiologicalMaterial11 53.94
## ManufacturingProcess33 49.60
## BiologicalMaterial08 48.82
## BiologicalMaterial01 46.53
## ManufacturingProcess02 44.43
## ManufacturingProcess29 40.72
## ManufacturingProcess11 36.51
## ManufacturingProcess12 32.98
p1<-plot(svm_imp, top=10, main='SVM')
p2<-plot(pls_imp, top=10, main='PLS')
gbm_imp<-varImp(model_gbm1)
p3<-plot(gbm_imp, top=10, main='GBM')
gridExtra::grid.arrange(p1, p2,p3, ncol = 3)
model_rpart$finalModel
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 472.70180 40.17111
## 2) ManufacturingProcess32< 159.5 83 142.65450 39.19699
## 4) BiologicalMaterial11< 145.075 39 42.89152 38.55615 *
## 5) BiologicalMaterial11>=145.075 44 69.55090 39.76500 *
## 3) ManufacturingProcess32>=159.5 61 144.12200 41.49656
## 6) BiologicalMaterial06< 51.61 34 60.34664 40.74559 *
## 7) BiologicalMaterial06>=51.61 27 40.45527 42.44222 *