First we simply created a random forest model designed to identify pattern differences between three school districts, Hanover, Salem and WJCC. The output of the modeling processes is seen below.

## randomForest(formula = site1 ~ ., data = jex_train, ntree = 196, 
##     mtry = 30, replace = TRUE, sample = 100, node = 5, importance = TRUE, 
##     proximity = TRUE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, 
##     keep.inbag = TRUE, na.action = na.omit)

The model learning process is visualized below with the classification of Salem resulting in significantly lower error rates. It’s theorized that this is a result of the larger sample size but more investigation is necessary to validate this theory

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

An important output of a random forest model is a ordered list of important variables, as seen below with staaccess_1 and access_2 being critical along with several others.

Lastly, after running the model on unseen data below is the overall performance of the model.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction Hanover Salem WJCC
##    Hanover       7     0    3
##    Salem         0    10    1
##    WJCC          6     0   15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7619          
##                  95% CI : (0.6055, 0.8795)
##     No Information Rate : 0.4524          
##     P-Value [Acc > NIR] : 4.45e-05        
##                                           
##                   Kappa : 0.6267          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Hanover Class: Salem Class: WJCC
## Sensitivity                  0.5385       1.0000      0.7895
## Specificity                  0.8966       0.9688      0.7391
## Pos Pred Value               0.7000       0.9091      0.7143
## Neg Pred Value               0.8125       1.0000      0.8095
## Prevalence                   0.3095       0.2381      0.4524
## Detection Rate               0.1667       0.2381      0.3571
## Detection Prevalence         0.2381       0.2619      0.5000
## Balanced Accuracy            0.7175       0.9844      0.7643