You are given three datasets
the Iris dataset
the Pima Indians diabetes dataset, and
a dataset that you can name and place in Excel, then save as a .csv file (.csv excel dataset)
You are to run the .csv excel dataset plus one of the two remaining datasets through the random forest model.
The performance of each of the two models should be improved as well.
Remember:
Use a resampling method to split the dataset into subsets Implement, or call, the random forest model (use as many arguments as possible) Evaluate the performance of the model using metrics such as “Accuracy,” etc. If the model is not performing well, then go back and tune the rf parameters
library(randomForest)
library(mlbench)
library(RCurl)
library(caret)
library(rpart)
Michael <- read.csv("Butros.csv")
str(Michael)
## 'data.frame': 10 obs. of 4 variables:
## $ Left : int 1 0 1 0 0 0 0 1 1 0
## $ Right: int 45 0 92 18 26 48 41 52 64 80
## $ Up : int 24 26 32 41 80 76 92 39 46 50
## $ Down : int 100 69 46 24 0 32 86 71 65 48
set.seed(2022)
rf <- randomForest(Left~.,data=Michael)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf
##
## Call:
## randomForest(formula = Left ~ ., data = Michael)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.2774899
## % Var explained: -15.62
Michael$Left <- as.factor(Michael$Left)
str(Michael)
## 'data.frame': 10 obs. of 4 variables:
## $ Left : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 2 1
## $ Right: int 45 0 92 18 26 48 41 52 64 80
## $ Up : int 24 26 32 41 80 76 92 39 46 50
## $ Down : int 100 69 46 24 0 32 86 71 65 48
set.seed(2022)
rf <- randomForest(Left~.,data=Michael)
rf
##
## Call:
## randomForest(formula = Left ~ ., data = Michael)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 70%
## Confusion matrix:
## 0 1 class.error
## 0 3 3 0.5
## 1 4 0 1.0
control <- trainControl(method = "cv", number = 3)
grid_rf <- expand.grid(mtry=3)
m_rf <- train(Left~., data=Michael, method = "rf", importance=TRUE,
trControl=control, tuneGrid = grid_rf)
m_rf
## Random Forest
##
## 10 samples
## 3 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 7, 7, 6
## Resampling results:
##
## Accuracy Kappa
## 0.7777778 0.6
##
## Tuning parameter 'mtry' was held constant at a value of 3
fitControl <- trainControl(method="repeatedcv",number=5,repeats = 5)
grid_rf <- expand.grid(mtry=3)
m_rf <- train(Left~., data=Michael, method = "rf", importance=TRUE,
trControl=control, tuneGrid = grid_rf)
m_rf
## Random Forest
##
## 10 samples
## 3 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 7, 6, 7
## Resampling results:
##
## Accuracy Kappa
## 0.6388889 0.3
##
## Tuning parameter 'mtry' was held constant at a value of 3
pred <- predict(m_rf, Michael)
table(pred,Michael$Left)
##
## pred 0 1
## 0 6 0
## 1 0 4
The accuracy for the dataset was not as high as we would like for running a random forest algorithm. This could be because some of the trees in the random forest are correlated.
We ran a regression algorithm on the dataset, and maybe should have ran a classification algorithm.
The algorithm performed better when evaluated using cross validations versus repeated cross validations.
Surprisingly, the prediction table yielded none zero entries only on the main diagonal, that is, there were no misclassifications.
library(randomForest)
library(caret)
data("iris")
Index <- createDataPartition(iris$Species,p=0.80, list=FALSE)
training <- iris[Index, ]
testing <- iris[-Index, ]
model <- randomForest(Species~., data=training)
print(model)
##
## Call:
## randomForest(formula = Species ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 40 0 0 0.000
## versicolor 0 37 3 0.075
## virginica 0 3 37 0.075
pred <- predict(model, testing)
table <- confusionMatrix(testing$Species, pred)
print(table)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 4.857e-15
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
tmodel = train(Species~., data=training, method="rf", trControl=trainControl)
print(tmodel)
## Random Forest
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9472222 0.9208333
## 3 0.9527778 0.9291667
## 4 0.9500000 0.9250000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
tpred <- predict(tmodel, testing)
ttable <- confusionMatrix(testing$Species, tpred)
print(ttable)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 4.857e-15
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
The accuracy of the model was high before tuning parameters and was very high after tuning the parameters of the random forest algorithm.
Parameters were tuned used repeated cross validations with 10 folds repeated 3 times.
Optimal configuration after tunuing parameter with
mtry = 3
accuracy = 0.9527778
kappa = 0.9291667
Predictions made after tuning the parameters yielded no misclassifications. That is, an accuracy of 100%.
library(mlbench)
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
head(PimaIndiansDiabetes, n=5)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 0 33.6 0.627 50 pos
## 2 1 85 66 29 0 26.6 0.351 31 neg
## 3 8 183 64 0 0 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
trainIndex <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE)
PimaIndiansDiabetes$Outcome <- as.factor(PimaIndiansDiabetes$diabetes)
diabetes.training <- PimaIndiansDiabetes[trainIndex, ]
diabetes.testing <- PimaIndiansDiabetes[-trainIndex, ]
prop.table(table(diabetes.training$Outcome))
##
## neg pos
## 0.6504065 0.3495935
prop.table(table(diabetes.testing$diabetes))
##
## neg pos
## 0.6535948 0.3464052
model <- randomForest(diabetes~., data=diabetes.training)
print(model)
##
## Call:
## randomForest(formula = diabetes ~ ., data = diabetes.training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## neg pos class.error
## neg 400 0 0
## pos 0 215 0
control <- trainControl(method ="repeatedcv", number = 10, repeats = 10)
grid <- expand.grid(mtry =c(3,4,5))
model.random.forest <- train(diabetes~., data=diabetes.training, method="rf",
tuneGrid = grid, trConrtol=control)
model.random.forest
## Random Forest
##
## 615 samples
## 9 predictor
## 2 classes: 'neg', 'pos'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 615, 615, 615, 615, 615, 615, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 3 1 1
## 4 1 1
## 5 1 1
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
plot(model.random.forest)
## Evaluate model performance
pred <- predict(model.random.forest,diabetes.testing)
confusionMatrix(pred,diabetes.testing$diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 100 0
## pos 0 53
##
## Accuracy : 1
## 95% CI : (0.9762, 1)
## No Information Rate : 0.6536
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6536
## Detection Rate : 0.6536
## Detection Prevalence : 0.6536
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : neg
##
Random forest algorithm yielded 0 error rate before parameter tuning.
Parameters were tuned using repeated cross validations with 10 folds and 10 repetitions. Also different values of mtry were used.
All values of mtry yielded 100% accuracy and 100% kappa values
Predictions on the tuned parameters yielded no misclassifications.