library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
We can observe from above that the predictors are very low. Randon forest didnot use these predictors (V6 - V10).
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
model3 <- cforest(y ~ ., data = simulated)
cf_simulated <- varimp(model3)
cf_simulated
## V1 V2 V3 V4 V5
## 4.6171158805 6.0579730772 0.0003116115 7.6223892727 1.7161194047
## V6 V7 V8 V9 V10
## -0.0289427183 0.0465374951 -0.0380965511 0.0046062409 -0.0310326410
## duplicate1
## 5.0941897280
plot(cf_simulated)
Based on above results in model3 and plot we can observe that new predictor duplicate1 add more value. Additionally we can find that the cforest has same pattern as the traditional random forest model.
library(gbm)
## Loaded gbm 2.1.8
set.seed(200)
gbm_model <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbm_model)
## var rel.inf
## V4 V4 30.9675963
## V2 V2 22.0642239
## V1 V1 14.7119711
## duplicate1 duplicate1 13.7237851
## V5 V5 10.4092910
## V3 V3 7.7070181
## V6 V6 0.2890616
## V8 V8 0.1270530
## V7 V7 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
Based on above results we can observe that the priority of predictor variables have changed in Boosted Tree Model. V4 came as the important variable. V6 to V10 still has low values and are not useful. V3 seems to be little off which is similar to the earlier models.
library(Cubist)
set.seed(200)
cubist_model <- cubist(simulated[, -11], simulated[, 11])
varImp(cubist_model)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
Like other models, Cubist also uses V1,V2,V4,V5 and duplicate1 perdictors and v3 is off. V6 to V10 values are very low and not much use.
library(rpart)
set.seed(200)
a1 <- sample(1:10, 100, replace = T)
a2 <- rnorm(100)
a3 <- rnorm(100, mean = 0, sd = 4)
a4 <- sample(1:10/10, 100, replace = T)
b <- a1 + rnorm(100, mean = 0, sd = .5)
df <- data.frame(a1, a2, a3, a4, b)
new_df <- rpart(b~., data = df)
varImp(new_df)
## Overall
## a1 3.2849079
## a2 0.3554106
## a3 0.3784541
## a4 0.3295456
While simulating we gave lowest granularity to a1 and we got better results for a1 when compared to other predictors.
8.24
We know that learning rate impacts the model. High learning rate indicates large steps are needed towards minimum and it is result in less trees. The right side model has large learning rate and bagging fraction when compared to the left side model. With low learning rate on the left model we have more trees that are being trained on samples which results in large variance in predictor importance.
There is high chance that right model might overfit the data because parameters are high so it drops many predictors. The left model seems to be more predictive.
When interaction depth increases there will be spread of important predictors as each tree can go deeper. Hence increasing interaction depth reduces the slope of predictor in the plot.
library(AppliedPredictiveModeling)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(missForest)
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
library(kableExtra)
data(ChemicalManufacturingProcess)
missmap(ChemicalManufacturingProcess, col = c("red", "lightgreen"))
Original_df <- ChemicalManufacturingProcess
Imputed_df <- missForest(Original_df)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
df <- Imputed_df$ximp
data <- df[, 2:58]
target <- df[,1]
train <- createDataPartition(target, p=0.75)
train_pred <- data[train$Resample1,]
train_target <- target[train$Resample]
test_pred <- data[-train$Resample1,]
test_target <- target[-train$Resample1]
control <- trainControl(method = "cv", number=10)
set.seed(1)
rt_Model <- train(x = train_pred,
y = train_target,
method = "rpart2",
tuneLength = 10)
rt_Model
## CART
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.549204 0.2971345 1.212403
## 2 1.554881 0.3060266 1.223812
## 3 1.543639 0.3195709 1.219891
## 4 1.543254 0.3214862 1.212584
## 5 1.541306 0.3309388 1.199296
## 6 1.572511 0.3169023 1.216852
## 7 1.573681 0.3236931 1.208675
## 8 1.582067 0.3227783 1.219734
## 9 1.582248 0.3223702 1.218682
## 10 1.585317 0.3225424 1.218959
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 5.
rt.Predict <- predict(rt_Model, newdata = test_pred)
# Use PostResample function
postResample(pred = rt.Predict, obs = test_target)
## RMSE Rsquared MAE
## 1.549507 0.410666 1.170624
rtAccuracy <- postResample(pred = rt.Predict, obs = test_target)
set.seed(1)
rf_Model <- train(x = train_pred,
y = train_target,
method = "rf",
tuneLength = 10)
rf_Model
## Random Forest
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.362282 0.5061519 1.0653492
## 8 1.287270 0.5366514 0.9976655
## 14 1.271212 0.5381600 0.9808907
## 20 1.265893 0.5347276 0.9755612
## 26 1.264688 0.5316547 0.9714118
## 32 1.272871 0.5208577 0.9777948
## 38 1.271139 0.5196967 0.9736169
## 44 1.274348 0.5153325 0.9766654
## 50 1.282146 0.5094456 0.9821527
## 57 1.289583 0.5032978 0.9840231
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
rf.Predict <- predict(rf_Model, newdata = test_pred)
# Use PostResample function
postResample(pred = rf.Predict, obs = test_target)
## RMSE Rsquared MAE
## 1.1166471 0.7615895 0.8424019
rfAccuracy <- postResample(pred = rf.Predict, obs = test_target)
set.seed(1)
cubist_Model <- train(x = train_pred,
y = train_target,
method = "cubist",
tuneLength = 10)
cubist_Model
## Cubist
##
## 132 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 2.020186 0.2635118 1.4458780
## 1 5 1.986481 0.2788826 1.4094603
## 1 9 1.996060 0.2735182 1.4217784
## 10 0 1.334883 0.4804142 1.0114764
## 10 5 1.312066 0.4973527 0.9776325
## 10 9 1.322466 0.4896068 0.9914823
## 20 0 1.251818 0.5296744 0.9561366
## 20 5 1.226397 0.5489871 0.9208038
## 20 9 1.237730 0.5401037 0.9355957
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
cubist.Predict <- predict(cubist_Model, newdata = test_pred)
# Use PostResample function
postResample(pred = cubist.Predict, obs = test_target)
## RMSE Rsquared MAE
## 0.9226027 0.8018563 0.7530944
cubistAccuracy <- postResample(pred = cubist.Predict, obs = test_target)
accuracies <- rbind(rtAccuracy,rfAccuracy,cubistAccuracy)
rownames(accuracies)<- c("REGRESSION TREE","RANDOM FOREST","CUBIST")
accuracies%>%
kable() %>%
kable_styling()
RMSE | Rsquared | MAE | |
---|---|---|---|
REGRESSION TREE | 1.5495073 | 0.4106660 | 1.1706239 |
RANDOM FOREST | 1.1166471 | 0.7615895 | 0.8424019 |
CUBIST | 0.9226027 | 0.8018563 | 0.7530944 |
Based on above results, Cubist is our best model with better RMSE score compared to other models.
plot(varImp(cubist_Model), top=10)
Manufacturing process 32 and Maunfacturing process 28 folled by 17 are top 3 in the list and these are the important predictors. In top 10 we got 8 Manufacturing processes and 2 Biological processes. Based on this we can say Manufacturing processes dominates Biological processes. Cubist model focus more on Manufactuing than lasso and SVM which we got in earlier assignments.
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
plot(as.party(rt_Model$finalModel),gp=gpar(fontsize=8))
Yes. This view provides additional knowledge about the biological or process predictors and their relationship with yield.