7. Initial CHAID Model Predicting Malignant Lung Cancer

  1. In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:
  1. Age
  2. Race/ethnicity
  3. Gender
  4. Smoking status (former vs. current)

The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:

A visualization of the resulting optimized model is shown below:

The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1454  132
##          1  489   85
##                                           
##                Accuracy : 0.7125          
##                  95% CI : (0.6929, 0.7315)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0809          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7483          
##             Specificity : 0.3917          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.1481          
##              Prevalence : 0.8995          
##          Detection Rate : 0.6731          
##    Detection Prevalence : 0.7343          
##       Balanced Accuracy : 0.5700          
##                                           
##        'Positive' Class : 0               
## 

The resulting model has a \(71.25\%\) accuracy with higher rate of false negatives than false positives.

8. CHAID Model with Homicide Rate Predicting Malignant Lung Cancer

  1. Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?

    1. Expectation = Yes

Optimizing the CHAID with the homicide rate per 100k variable, we get the following model:

Applying the model to the \(30\%\) held out test set results in the following:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1843  192
##          1  100   25
##                                          
##                Accuracy : 0.8648         
##                  95% CI : (0.8497, 0.879)
##     No Information Rate : 0.8995         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0785         
##                                          
##  Mcnemar's Test P-Value : 1.007e-07      
##                                          
##             Sensitivity : 0.9485         
##             Specificity : 0.1152         
##          Pos Pred Value : 0.9057         
##          Neg Pred Value : 0.2000         
##              Prevalence : 0.8995         
##          Detection Rate : 0.8532         
##    Detection Prevalence : 0.9421         
##       Balanced Accuracy : 0.5319         
##                                          
##        'Positive' Class : 0              
## 

This model has a \(82.27\%\) accuracy with, again, a higher rate of false negatives than false positives. However, this model produces false negatives at lower rate than the previous model without the homicide rate included. Additionally there is an approximately \(11\%\) gain in accuracy.