The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:
alpha2: Level of significance used for merging of predictor categories (step 2).
alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).
alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).
minsplit: Number of observations in splitted response at which no further split is desired.
minbucket: Minimum number of observations in terminal nodes.
minprob: Mininimum frequency of observations in terminal nodes.
A visualization of the resulting optimized model is shown below:
The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1454 132
## 1 489 85
##
## Accuracy : 0.7125
## 95% CI : (0.6929, 0.7315)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0809
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7483
## Specificity : 0.3917
## Pos Pred Value : 0.9168
## Neg Pred Value : 0.1481
## Prevalence : 0.8995
## Detection Rate : 0.6731
## Detection Prevalence : 0.7343
## Balanced Accuracy : 0.5700
##
## 'Positive' Class : 0
##
The resulting model has a \(71.25\%\) accuracy with higher rate of false negatives than false positives.
Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?
Optimizing the CHAID with the homicide rate per 100k variable, we get the following model:
Applying the model to the \(30\%\) held out test set results in the following:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1843 192
## 1 100 25
##
## Accuracy : 0.8648
## 95% CI : (0.8497, 0.879)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0785
##
## Mcnemar's Test P-Value : 1.007e-07
##
## Sensitivity : 0.9485
## Specificity : 0.1152
## Pos Pred Value : 0.9057
## Neg Pred Value : 0.2000
## Prevalence : 0.8995
## Detection Rate : 0.8532
## Detection Prevalence : 0.9421
## Balanced Accuracy : 0.5319
##
## 'Positive' Class : 0
##
This model has a \(82.27\%\) accuracy with, again, a higher rate of false negatives than false positives. However, this model produces false negatives at lower rate than the previous model without the homicide rate included. Additionally there is an approximately \(11\%\) gain in accuracy.