7. Initial CHAID Model Predicting Malignant Lung Cancer

In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:

Age
Race/ethnicity
Gender
Smoking status (former vs. current)

The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:

alpha2: Level of significance used for merging of predictor categories (step 2).
alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).
alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).
minsplit: Number of observations in splitted response at which no further split is desired.
minbucket: Minimum number of observations in terminal nodes.
minprob: Mininimum frequency of observations in terminal nodes.

A visualization of the resulting optimized model is shown below:

The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1454  132
##          1  489   85
##                                           
##                Accuracy : 0.7125          
##                  95% CI : (0.6929, 0.7315)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0809          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7483          
##             Specificity : 0.3917          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.1481          
##              Prevalence : 0.8995          
##          Detection Rate : 0.6731          
##    Detection Prevalence : 0.7343          
##       Balanced Accuracy : 0.5700          
##                                           
##        'Positive' Class : 0               
##

The resulting model has a \(71.25\%\) accuracy with higher rate of false negatives than false positives.

8. CHAID Model with Homicide Rate Predicting Malignant Lung Cancer

Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?
1. Expectation = Yes

Optimizing the CHAID with the homicide rate per 100k variable, we get the following model:

Applying the model to the \(30\%\) held out test set results in the following:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1843  192
##          1  100   25
##                                          
##                Accuracy : 0.8648         
##                  95% CI : (0.8497, 0.879)
##     No Information Rate : 0.8995         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0785         
##                                          
##  Mcnemar's Test P-Value : 1.007e-07      
##                                          
##             Sensitivity : 0.9485         
##             Specificity : 0.1152         
##          Pos Pred Value : 0.9057         
##          Neg Pred Value : 0.2000         
##              Prevalence : 0.8995         
##          Detection Rate : 0.8532         
##    Detection Prevalence : 0.9421         
##       Balanced Accuracy : 0.5319         
##                                          
##        'Positive' Class : 0              
##

This model has a \(82.27\%\) accuracy with, again, a higher rate of false negatives than false positives. However, this model produces false negatives at lower rate than the previous model without the homicide rate included. Additionally there is an approximately \(11\%\) gain in accuracy.

Malignant Lung Cancer Prediction

Alexis Kwan

2023-04-19

7. Initial CHAID Model Predicting Malignant Lung Cancer

8. CHAID Model with Homicide Rate Predicting Malignant Lung Cancer