First we simply created a random forest model designed to identify pattern differences between three school districts, Hanover, Salem and WJCC. The output of the modeling processes is seen below.
## randomForest(formula = site1 ~ ., data = jex_train, ntree = 196,
## mtry = 30, replace = TRUE, sample = 100, node = 5, importance = TRUE,
## proximity = TRUE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,
## keep.inbag = TRUE, na.action = na.omit)
The model learning process is visualized below with the classification of Salem resulting in significantly lower error rates. It’s theorized that this is a result of the larger sample size but more investigation is necessary to validate this theory
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
An important output of a random forest model is a ordered list of important variables, as seen below with staaccess_1 and access_2 being critical along with several others.
Lastly, after running the model on unseen data below is the overall performance of the model.
## Confusion Matrix and Statistics
##
## Actual
## Prediction Hanover Salem WJCC
## Hanover 7 0 3
## Salem 0 10 1
## WJCC 6 0 15
##
## Overall Statistics
##
## Accuracy : 0.7619
## 95% CI : (0.6055, 0.8795)
## No Information Rate : 0.4524
## P-Value [Acc > NIR] : 4.45e-05
##
## Kappa : 0.6267
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Hanover Class: Salem Class: WJCC
## Sensitivity 0.5385 1.0000 0.7895
## Specificity 0.8966 0.9688 0.7391
## Pos Pred Value 0.7000 0.9091 0.7143
## Neg Pred Value 0.8125 1.0000 0.8095
## Prevalence 0.3095 0.2381 0.4524
## Detection Rate 0.1667 0.2381 0.3571
## Detection Prevalence 0.2381 0.2619 0.5000
## Balanced Accuracy 0.7175 0.9844 0.7643