library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
cars_data<-mtcars %>% mutate(T_FACTOR=ifelse(am==1,"Automatic","Manual"),
T_FACTOR=as.factor(T_FACTOR))
set.seed(123)
train_index<-createDataPartition(cars_data$T_FACTOR,p=0.7, list=FALSE)
train<-cars_data[train_index,]
test<-cars_data[-train_index,]
rf_model1<-randomForest(
T_FACTOR~.,
data=train %>% select(-am),
ntree=500,
mtry=5,
importance=TRUE
)
rf_model1
##
## Call:
## randomForest(formula = T_FACTOR ~ ., data = train %>% select(-am), ntree = 500, mtry = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 16.67%
## Confusion matrix:
## Automatic Manual class.error
## Automatic 8 2 0.2000000
## Manual 2 12 0.1428571
Random Forest models have a kind of “internal test” in which the grown tree is compared to rows not included in the tree. The model is seeing how well the tree predicts the rows not selected for the tree. Higher error is bad, lower is good.
plot(rf_model1)
legend("topright",legend=colnames(rf_model1$err.rate)[-1],
col=c("red","green"),
lty=2,
bty="n")
plot(rf_model1) shows the “Out-of-Bag-error” versus the number of trees grown. The solid black line represents the overall OOB (misclassification error), while the red and green lines show class-specific errors for each factor. reveals that the model converges well before 500 trees have been grown. In this model, it’s around 20-50 trees.
When the lines flatten out – that’s the final error rate. The lower the dashed line, the easier it is to predict. In this case, the green dashed line is easier to predict overall than red (you’ll notice red has more “Error” along the Y axis).
# we are getting the last row of OOB errors from the model using nrow
final_oob<-rf_model1$err.rate[nrow(rf_model1$err.rate),]
final_oob
## OOB Automatic Manual
## 0.1666667 0.2000000 0.1428571
When the model finally settles, automatic has a bit more error than manual.
Because the model converges well before 500 trees, we can lower it a bit to 100 trees for better interpretation:
rf_model2<-randomForest(
T_FACTOR~.,
data=train %>% select(-am),
ntree=100,
mtry=5,
importance=TRUE
)
rf_model2
##
## Call:
## randomForest(formula = T_FACTOR ~ ., data = train %>% select(-am), ntree = 100, mtry = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 16.67%
## Confusion matrix:
## Automatic Manual class.error
## Automatic 8 2 0.2000000
## Manual 2 12 0.1428571
plot(rf_model2)
legend("topright",legend=colnames(rf_model1$err.rate)[-1],
col=c("red","green"),
lty=2,
bty="n")
The errors for all factors are initially high, but tend to settle after
20 trees and settle definitively after 40 trees. The model has somewhat
more trouble predicting automatics than manuals, overall.
importance(rf_model2)
## Automatic Manual MeanDecreaseAccuracy MeanDecreaseGini
## mpg 2.038962 -0.3934634 1.4901502 0.4032592
## cyl 1.005038 -1.0050378 0.3431991 0.0300000
## disp 2.277245 2.6243950 3.2183435 1.1119672
## hp 0.433963 -0.8192319 0.3844318 0.5639278
## drat 4.039595 1.9355191 3.8409012 1.4877908
## wt 7.391460 5.0719410 7.3249943 3.5286032
## qsec 1.027985 2.6734099 2.0990390 1.0098321
## vs 1.353881 1.0050378 1.2982571 0.1046667
## gear 6.072988 3.7593282 6.0763236 2.6369689
## carb -0.323399 1.0050378 0.2000400 0.2671508
Each predictor is listed on the left-hand-side. The degree to which each predictor helps to predict automatic or manual is listed beneath “automatic” and “manual”. We see that weight is pretty good at predicting both automatic and manual, while “gear” is better at predicting automatic than manual. “VS” is slightly better at predicting manual than automatic. They key here is looking for asymmetry in predictors.
MeanDecreaseAccuracy is how much the accuracy of the model drops when the variable is removed. The more accuracy is removed, the more “important” the variable is in predicting the outcome.
MeanDecreaseGini is harder to interpret, but it’s basically “which variable gives the cleanest split between outcomes”. Example: if you had a basket of apples and oranges, what would be the most effective variable with which to separate them? Probably color: red vs. orange gives the “cleanest split” between apples and oranges. Something like weight or size probably wouldn’t split the groups so evenly. Here, we see that weight creates the “cleanest split” between automatic and manual.
These values can also be visualized as a plot:
varImpPlot(rf_model2)
Now we can use the random forest model to predict manual vs. automatic for the “test” group (created earlier):
rf_pred_class <- predict(rf_model2, newdata = test, type = "response")
We can now run a confusion matrix on the prediction (rf_pred_class) vs. the reference (test$T_FACTOR):
confusionMatrix(data=rf_pred_class,
reference=test$T_FACTOR,
positive="Manual")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Automatic Manual
## Automatic 3 1
## Manual 0 4
##
## Accuracy : 0.875
## 95% CI : (0.4735, 0.9968)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.135
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : 1.000
##
## Sensitivity : 0.800
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 0.750
## Prevalence : 0.625
## Detection Rate : 0.500
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.900
##
## 'Positive' Class : Manual
##
The top of the confusion matrix is a reference vs. prediction table. Across is “reference”, up and down is “prediction”. The model predicted that 4 cars were automatic, but in reality (reference), 3 were automatic and 1 was manual.
The model predicted 4 manual cars, and there really were 4 manual cars.
Shorter version:
TRUE NEGATIVE: The model predicted Automatic, and the actual was Automatic (happened 3 times) FALSE POSITIVE: The model predicted automatic, but the actual was manual (happened once) FALSE NEGATIVE: The model predicted manual, but the actual was automatic (never happened) TRUE POSITIVE: The model predicted manual, and the actual was manual (happened 4 times)
Accuracy (what percent of the model’s predictions were correct) - this is basically TRUES (from above) divided by total observations. The above model was correct 7 times out of 8, and 7 divided by 8 is 0.875, or 87.5% of the time.
No Information Rate: essentially the “baseline” of the model. It represents how often you’d be correct if you just always predicted the most common predictor (i.e, if you just set everything to “manual”, how often would you be right - by chance?). A good model should have an accuracy higher than the No Information Rate. However, the p-value also needs to be significant for this to mean anything.
Kappa: how much better is the model than random chance?
0.0–0.2 = slight
0.2–0.4 = fair
0.4–0.6 = moderate
0.6–0.8 = substantial
0.8–1.0 = almost perfect
Sensitivity: How often does the model predict the positive class? or How often do we catch the thing we care about?
High sensitivity means that it rarely misses the correct target. Lower sensitivity means it misses more often.
Specificity: How often does the model predict the negative class? or How often do we avoid false alarms?
High specificity means very few false alarms. Lower specificity introduces a danger of false alarms.
Balanced accuracy: the average of sensitivity and specificity
Positive Predictive Value: When the model predicts the positive class (in this case, “manual”), how often is it correct?
Negative Predictive Value: when the model predicts the negative class (in this case, “automatic”), how often is it correct?
The importance metric can be influenced by correlation among variables:
cor(train %>% select(-T_FACTOR))
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8309449 -0.7881596 -0.77360581 0.611397470 -0.8019581
## cyl -0.8309449 1.0000000 0.9051132 0.83268741 -0.605986937 0.7889396
## disp -0.7881596 0.9051132 1.0000000 0.79772145 -0.620828132 0.8534729
## hp -0.7736058 0.8326874 0.7977215 1.00000000 -0.356904082 0.6445489
## drat 0.6113975 -0.6059869 -0.6208281 -0.35690408 1.000000000 -0.7037722
## wt -0.8019581 0.7889396 0.8534729 0.64454894 -0.703772153 1.0000000
## qsec 0.5901966 -0.6901128 -0.5675108 -0.83177905 0.031077460 -0.3277591
## vs 0.6537116 -0.8414757 -0.7051113 -0.74407701 0.392527856 -0.5471696
## am 0.5082539 -0.4315685 -0.5127435 -0.13483450 0.688453869 -0.6621306
## gear 0.2707460 -0.2914072 -0.3991829 0.06063522 0.632182264 -0.4414508
## carb -0.5308384 0.5277102 0.3490279 0.73689995 -0.005918978 0.3639306
## qsec vs am gear carb
## mpg 0.59019661 0.65371158 0.50825395 0.27074596 -0.530838356
## cyl -0.69011277 -0.84147572 -0.43156847 -0.29140724 0.527710196
## disp -0.56751076 -0.70511129 -0.51274346 -0.39918289 0.349027924
## hp -0.83177905 -0.74407701 -0.13483450 0.06063522 0.736899951
## drat 0.03107746 0.39252786 0.68845387 0.63218226 -0.005918978
## wt -0.32775908 -0.54716958 -0.66213064 -0.44145077 0.363930591
## qsec 1.00000000 0.82744857 -0.21794602 -0.32596435 -0.799830660
## vs 0.82744857 1.00000000 0.07067535 0.08112739 -0.609945670
## am -0.21794602 0.07067535 1.00000000 0.77892406 0.167935500
## gear -0.32596435 0.08112739 0.77892406 1.00000000 0.481927898
## carb -0.79983066 -0.60994567 0.16793550 0.48192790 1.000000000
When predictors are correlated, Random Forest tends to “split” the predictive signal between them, potentially leading to lower or higher importance scores for variables.
There are ways to mitigate this issue: drop correlates that incompletely represent what is actually being measured, or replace them with a single variable (“car size index”?).
CITATIONS
Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Liaw A, Wiener M (2002). “Classification and Regression by randomForest.” _R News_, *2*(3), 18-22.
<https://CRAN.R-project.org/doc/Rnews/>.