The data used for analysis and modeling in this study were obtained from electronic medical records (EMR) from a large academic medical center in Chicago using a standardized form collecting data on demographics, smoking history, lung cancer screening eligibility,
We look at several different cross sections of LDCT (low dose CT scan) screening eligibility across demographics, environmental violence measures and smoking rates to establish base lines for the characteristics of our sample or distributions across demographic qualities including race/ethnicity, gender and age. We set exceptions for what recommendations are sensible based on those statistics and what segments of the sample have potential for additional screenings to reduce mortality and cancer rates. This is also to reveal potentially significant differences in sample subgroups.
A variety of decision trees are used to dissect the distribution of cancer rates across demographics, smoking packyear rates and environmental violence. Trees are chosen because of potential for interpretability and therefore transparency in guiding decision and policy making for lung cancer screening. Initially a Chi-square automatic interaction detection (CHAID) tree is used to model the data but later Conditional Inference trees and others are chosen because of performance improvements. Model would then be checked against a combination of patient screening eligibility, smoking rates and rates of malignant cancer.
| Smoking Status | Count | Percentage |
|---|---|---|
| current | 2,252 | 31.29% |
| former | 1,886 | 26.20% |
| never | 2,217 | 30.80% |
| NA | 843 | 11.71% |
| Race/Ethnicity | Smoking Status | Count | Percentage within Race |
|---|---|---|---|
| Black | current | 1,668 | 36.09% |
| Black | former | 1,182 | 25.57% |
| Black | never | 1,268 | 27.43% |
| Black | NA | 504 | 10.90% |
| Latinx | current | 271 | 16.74% |
| Latinx | former | 446 | 27.55% |
| Latinx | never | 692 | 42.74% |
| Latinx | NA | 210 | 12.97% |
| White | current | 313 | 32.71% |
| White | former | 258 | 26.96% |
| White | never | 257 | 26.85% |
| White | NA | 129 | 13.48% |
| Gender | Smoking Status | Count | Percentage within Gender |
|---|---|---|---|
| FEMALE | current | 1,135 | 28.07% |
| FEMALE | former | 941 | 23.27% |
| FEMALE | never | 1,549 | 38.30% |
| FEMALE | NA | 419 | 10.36% |
| MALE | current | 1,117 | 35.45% |
| MALE | former | 944 | 29.96% |
| MALE | never | 667 | 21.17% |
| MALE | NA | 423 | 13.42% |
| UNKNOWN | former | 1 | 33.33% |
| UNKNOWN | never | 1 | 33.33% |
| UNKNOWN | NA | 1 | 33.33% |
| Smoking Status | Homicide Rate Exposure | Count | Percentage within Smoking Group |
|---|---|---|---|
| current | <mean | 1,308 | 58.08% |
| current | >=mean | 938 | 41.65% |
| current | NA | 6 | 0.27% |
| former | <mean | 1,169 | 61.98% |
| former | >=mean | 717 | 38.02% |
| never | <mean | 1,473 | 66.44% |
| never | >=mean | 741 | 33.42% |
| never | NA | 3 | 0.14% |
| NA | <mean | 554 | 65.72% |
| NA | >=mean | 287 | 34.05% |
| NA | NA | 2 | 0.24% |
| Smoking Status | Mean |
|---|---|
| current | 16.36 |
| former | 22.27 |
| Smoking Status | Race/Ethnicity | Mean |
|---|---|---|
| current | Black | 15.24 |
| current | Latinx | 17.30 |
| current | White | 21.48 |
| former | Black | 22.62 |
| former | Latinx | 16.95 |
| former | White | 29.07 |
| Smoking Status | Gender | Mean Packyears |
|---|---|---|
| current | FEMALE | 14.94 |
| current | MALE | 17.75 |
| former | FEMALE | 22.85 |
| former | MALE | 21.62 |
| former | UNKNOWN | NaN |
| Smoking Status | Homicide Rate Exposure | Mean Packyears |
|---|---|---|
| current | <mean | 17.83 |
| current | >=mean | 14.11 |
| current | NA | 52.00 |
| former | <mean | 21.61 |
| former | >=mean | 23.28 |
Eligibility criteria: Aged 50 to 80, 20 pack year smoking history, and no prior history of lung cancer
## New names:
## • `raceethnic` -> `raceethnic...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `raceethnic` -> `raceethnic...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `raceethnic` -> `raceethnic...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
| Race/Ethnicity | Age Eligibility | Count | Percent | Packyear Eligibility | Count | Percent | No Prior Diagnosis Eligibility | Count | Percent |
|---|---|---|---|---|---|---|---|---|---|
| Black | No | 1,046 | 22.63% | No | 4,384 | 94.85% | No | 4,075 | 88.17% |
| Black | Yes | 3,576 | 77.37% | Yes | 238 | 5.15% | Yes | 547 | 11.83% |
| Latinx | No | 446 | 27.55% | No | 1,575 | 97.28% | No | 1,469 | 90.74% |
| Latinx | Yes | 1,173 | 72.45% | Yes | 44 | 2.72% | Yes | 150 | 9.26% |
| White | No | 204 | 21.32% | No | 891 | 93.10% | No | 836 | 87.36% |
| White | Yes | 753 | 78.68% | Yes | 66 | 6.90% | Yes | 121 | 12.64% |
## New names:
## • `gender` -> `gender...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `gender` -> `gender...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `gender` -> `gender...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
| Gender | Age Eligibility | Count | Percent | Packyear Eligibility | Count | Percent | No Prior Diagnosis Eligibility | Count | Percent |
|---|---|---|---|---|---|---|---|---|---|
| FEMALE | No | 1,049 | 25.94% | No | 3,877 | 95.87% | No | 3,617 | 89.44% |
| FEMALE | Yes | 2,995 | 74.06% | Yes | 167 | 4.13% | Yes | 427 | 10.56% |
| MALE | No | 645 | 20.47% | No | 2,970 | 94.26% | No | 2,760 | 87.59% |
| MALE | Yes | 2,506 | 79.53% | Yes | 181 | 5.74% | Yes | 391 | 12.41% |
| UNKNOWN | No | 2 | 66.67% | No | 3 | 100.00% | No | 3 | 100.00% |
| UNKNOWN | Yes | 1 | 33.33% | Yes | 0 | 0.00% | Yes | 0 | 0.00% |
## New names:
## • `raceethnic` -> `raceethnic...1`
## • `gender` -> `gender...2`
## • `n` -> `n...4`
## • `Percentage` -> `Percentage...5`
## • `raceethnic` -> `raceethnic...6`
## • `gender` -> `gender...7`
## • `n` -> `n...9`
## • `Percentage` -> `Percentage...10`
## • `raceethnic` -> `raceethnic...11`
## • `gender` -> `gender...12`
## • `n` -> `n...14`
## • `Percentage` -> `Percentage...15`
| Race/Ethnicity | Gender | Age Eligibility | Count | Percent within Race/Gender | Packyear Eligibility | Count | Percent within Race/Gender | No Prior Diagnosis Eligibility | Count | Percent within Race/Gender |
|---|---|---|---|---|---|---|---|---|---|---|
| Black | FEMALE | No | 711 | 25.32% | No | 2,681 | 95.48% | No | 2,500 | 89.03% |
| Black | FEMALE | Yes | 2,097 | 74.68% | Yes | 127 | 4.52% | Yes | 308 | 10.97% |
| Black | MALE | No | 334 | 18.43% | No | 1,701 | 93.87% | No | 1,573 | 86.81% |
| Black | MALE | Yes | 1,478 | 81.57% | Yes | 111 | 6.13% | Yes | 239 | 13.19% |
| Black | UNKNOWN | No | 1 | 50.00% | No | 2 | 100.00% | No | 2 | 100.00% |
| Black | UNKNOWN | Yes | 1 | 50.00% | No | 793 | 98.51% | No | 737 | 91.55% |
| Latinx | FEMALE | No | 241 | 29.94% | Yes | 12 | 1.49% | Yes | 68 | 8.45% |
| Latinx | FEMALE | Yes | 564 | 70.06% | No | 781 | 96.06% | No | 731 | 89.91% |
| Latinx | MALE | No | 204 | 25.09% | Yes | 32 | 3.94% | Yes | 82 | 10.09% |
| Latinx | MALE | Yes | 609 | 74.91% | No | 1 | 100.00% | No | 1 | 100.00% |
| Latinx | UNKNOWN | No | 1 | 100.00% | No | 403 | 93.50% | No | 380 | 88.17% |
| White | FEMALE | No | 97 | 22.51% | Yes | 28 | 6.50% | Yes | 51 | 11.83% |
| White | FEMALE | Yes | 334 | 77.49% | No | 488 | 92.78% | No | 456 | 86.69% |
| White | MALE | No | 107 | 20.34% | Yes | 38 | 7.22% | Yes | 70 | 13.31% |
| White | MALE | Yes | 419 | 79.66% | No | 0 | NA | No | 0 | NA |
In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:
## [1] "[ 0, 5) packs/year" "[ 5, 10) packs/year" "[ 10, 15) packs/year"
## [4] "[ 15, 20) packs/year" "[ 20,250] packs/year" "never smoked"
We exame the initial possibilities of recommendations and demographic split differences using a decision tree based on using the Chi-squared statistic to split subsets that determine how subsets are classified. The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:
alpha2: Level of significance used for merging of predictor categories (step 2).
alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).
alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).
minsplit: Number of observations in splitted response at which no further split is desired.
minbucket: Minimum number of observations in terminal nodes.
minprob: Mininimum frequency of observations in terminal nodes.
A visualization of the resulting optimized model is shown below:
The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1404 106
## 1 539 111
##
## Accuracy : 0.7014
## 95% CI : (0.6816, 0.7206)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1241
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7226
## Specificity : 0.5115
## Pos Pred Value : 0.9298
## Neg Pred Value : 0.1708
## Prevalence : 0.8995
## Detection Rate : 0.6500
## Detection Prevalence : 0.6991
## Balanced Accuracy : 0.6171
##
## 'Positive' Class : 0
##
The resulting model has an accuracy of \(lung_mod_fit_3_alt_acc\) with a higher rate of false negatives than false positives.
Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?
a. Expectation = Yes
Remodeling with the homicide rate per 100k variable, we get the following model:
Applying the model to the \(30\%\) held out test set results in the following:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1709 157
## 1 234 60
##
## Accuracy : 0.819
## 95% CI : (0.8021, 0.835)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1.0000000
##
## Kappa : 0.1348
##
## Mcnemar's Test P-Value : 0.0001213
##
## Sensitivity : 0.8796
## Specificity : 0.2765
## Pos Pred Value : 0.9159
## Neg Pred Value : 0.2041
## Prevalence : 0.8995
## Detection Rate : 0.7912
## Detection Prevalence : 0.8639
## Balanced Accuracy : 0.5780
##
## 'Positive' Class : 0
##
This model has a lung_mod_hom_fit_acc accuracy with, again, a higher rate of false negatives than false positives. However, this model produces false negatives at lower rate than the previous model without the homicide rate included. Additionally there is an approximately lung_mod_hom_fit_acc - lung_mod_fit_3_alt_acc gain in accuracy.
Using intervals of size 1 between 5 and 15, with ordinal categories as below:
## [1] "[ 0, 5)" "[ 5, 6)" "[ 6, 7)" "[ 7, 8)" "[ 8, 9)"
## [6] "[ 9, 10)" "[ 10, 11)" "[ 11, 12)" "[ 12, 13)" "[ 13, 14)"
## [11] "[ 14, 15)" "[ 15, 20)" "[ 20,250]" "never smoked"
## [1] "Lung Cancer Model With Narrower Bins on Total Sample"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1696 154
## 1 247 63
##
## Accuracy : 0.8144
## 95% CI : (0.7973, 0.8305)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1371
##
## Mcnemar's Test P-Value : 4.343e-06
##
## Sensitivity : 0.8729
## Specificity : 0.2903
## Pos Pred Value : 0.9168
## Neg Pred Value : 0.2032
## Prevalence : 0.8995
## Detection Rate : 0.7852
## Detection Prevalence : 0.8565
## Balanced Accuracy : 0.5816
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1696 154
## 1 247 63
##
## Accuracy : 0.8144
## 95% CI : (0.7973, 0.8305)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1371
##
## Mcnemar's Test P-Value : 4.343e-06
##
## Sensitivity : 0.8729
## Specificity : 0.2903
## Pos Pred Value : 0.9168
## Neg Pred Value : 0.2032
## Prevalence : 0.8995
## Detection Rate : 0.7852
## Detection Prevalence : 0.8565
## Balanced Accuracy : 0.5816
##
## 'Positive' Class : 0
##
Using intervals of size 5 between 0 and 10 and then size 1 between 10 and 15, with ordinal categories as below:
## [1] "[ 0, 5)" "[ 5, 10)" "[ 10, 11)" "[ 11, 12)" "[ 12, 13)"
## [6] "[ 13, 14)" "[ 14, 15)" "[ 15, 20)" "[ 20,250]" "never smoked"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1696 154
## 1 247 63
##
## Accuracy : 0.8144
## 95% CI : (0.7973, 0.8305)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1371
##
## Mcnemar's Test P-Value : 4.343e-06
##
## Sensitivity : 0.8729
## Specificity : 0.2903
## Pos Pred Value : 0.9168
## Neg Pred Value : 0.2032
## Prevalence : 0.8995
## Detection Rate : 0.7852
## Detection Prevalence : 0.8565
## Balanced Accuracy : 0.5816
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1696 154
## 1 247 63
##
## Accuracy : 0.8144
## 95% CI : (0.7973, 0.8305)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1371
##
## Mcnemar's Test P-Value : 4.343e-06
##
## Sensitivity : 0.8729
## Specificity : 0.2903
## Pos Pred Value : 0.9168
## Neg Pred Value : 0.2032
## Prevalence : 0.8995
## Detection Rate : 0.7852
## Detection Prevalence : 0.8565
## Balanced Accuracy : 0.5816
##
## 'Positive' Class : 0
##
## [1] "Race/Ethnicity Subset with Bin Size 5: Black"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 974 86
## 1 263 64
##
## Accuracy : 0.7484
## 95% CI : (0.7247, 0.771)
## No Information Rate : 0.8919
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.141
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7874
## Specificity : 0.4267
## Pos Pred Value : 0.9189
## Neg Pred Value : 0.1957
## Prevalence : 0.8919
## Detection Rate : 0.7022
## Detection Prevalence : 0.7642
## Balanced Accuracy : 0.6070
##
## 'Positive' Class : 0
##
## [1] "Race/Ethnicity Subset with Bin Size 5: Latinx"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 448 30
## 1 8 0
##
## Accuracy : 0.9218
## 95% CI : (0.8943, 0.9441)
## No Information Rate : 0.9383
## P-Value [Acc > NIR] : 0.9412190
##
## Kappa : -0.0267
##
## Mcnemar's Test P-Value : 0.0006577
##
## Sensitivity : 0.9825
## Specificity : 0.0000
## Pos Pred Value : 0.9372
## Neg Pred Value : 0.0000
## Prevalence : 0.9383
## Detection Rate : 0.9218
## Detection Prevalence : 0.9835
## Balanced Accuracy : 0.4912
##
## 'Positive' Class : 0
##
## [1] "Race/Ethnicity Subset with Bin Size 5: White"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 77 2
## 1 181 28
##
## Accuracy : 0.3646
## 95% CI : (0.3089, 0.4231)
## No Information Rate : 0.8958
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0637
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2984
## Specificity : 0.9333
## Pos Pred Value : 0.9747
## Neg Pred Value : 0.1340
## Prevalence : 0.8958
## Detection Rate : 0.2674
## Detection Prevalence : 0.2743
## Balanced Accuracy : 0.6159
##
## 'Positive' Class : 0
##
## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`
## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for Black Subset</caption>
## <thead>
## <tr>
## <th style="text-align:left;"> malignanto...1 </th>
## <th style="text-align:left;"> ldct_elig </th>
## <th style="text-align:left;"> ldct_n </th>
## <th style="text-align:left;"> ldct_percentage </th>
## <th style="text-align:left;"> prediction </th>
## <th style="text-align:left;"> prediction_n </th>
## <th style="text-align:left;"> prediction_percentage </th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td style="text-align:left;"> yes </td>
## <td style="text-align:left;"> No </td>
## <td style="text-align:left;"> 484 </td>
## <td style="text-align:left;"> 95.09% </td>
## <td style="text-align:left;"> 1 </td>
## <td style="text-align:left;"> 202 </td>
## <td style="text-align:left;"> 39.69% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> yes </td>
## <td style="text-align:left;"> Yes </td>
## <td style="text-align:left;"> 25 </td>
## <td style="text-align:left;"> 4.91% </td>
## <td style="text-align:left;"> 0 </td>
## <td style="text-align:left;"> 307 </td>
## <td style="text-align:left;"> 60.31% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> no </td>
## <td style="text-align:left;"> No </td>
## <td style="text-align:left;"> 3,938 </td>
## <td style="text-align:left;"> 95.75% </td>
## <td style="text-align:left;"> 1 </td>
## <td style="text-align:left;"> 811 </td>
## <td style="text-align:left;"> 19.72% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> no </td>
## <td style="text-align:left;"> Yes </td>
## <td style="text-align:left;"> 175 </td>
## <td style="text-align:left;"> 4.25% </td>
## <td style="text-align:left;"> 0 </td>
## <td style="text-align:left;"> 3,302 </td>
## <td style="text-align:left;"> 80.28% </td>
## </tr>
## </tbody>
## </table>
## [1] "Race/Ethnicity Subset with Narrower Bins: Black"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 974 86
## 1 263 64
##
## Accuracy : 0.7484
## 95% CI : (0.7247, 0.771)
## No Information Rate : 0.8919
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.141
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7874
## Specificity : 0.4267
## Pos Pred Value : 0.9189
## Neg Pred Value : 0.1957
## Prevalence : 0.8919
## Detection Rate : 0.7022
## Detection Prevalence : 0.7642
## Balanced Accuracy : 0.6070
##
## 'Positive' Class : 0
##
## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`
## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for Latinx Subset</caption>
## <thead>
## <tr>
## <th style="text-align:left;"> malignanto...1 </th>
## <th style="text-align:left;"> ldct_elig </th>
## <th style="text-align:left;"> ldct_n </th>
## <th style="text-align:left;"> ldct_percentage </th>
## <th style="text-align:left;"> prediction </th>
## <th style="text-align:left;"> prediction_n </th>
## <th style="text-align:left;"> prediction_percentage </th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td style="text-align:left;"> yes </td>
## <td style="text-align:left;"> No </td>
## <td style="text-align:left;"> 96 </td>
## <td style="text-align:left;"> 96.97% </td>
## <td style="text-align:left;"> 1 </td>
## <td style="text-align:left;"> 8 </td>
## <td style="text-align:left;"> 8.08% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> yes </td>
## <td style="text-align:left;"> Yes </td>
## <td style="text-align:left;"> 3 </td>
## <td style="text-align:left;"> 3.03% </td>
## <td style="text-align:left;"> 0 </td>
## <td style="text-align:left;"> 91 </td>
## <td style="text-align:left;"> 91.92% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> no </td>
## <td style="text-align:left;"> No </td>
## <td style="text-align:left;"> 1,490 </td>
## <td style="text-align:left;"> 98.03% </td>
## <td style="text-align:left;"> 1 </td>
## <td style="text-align:left;"> 34 </td>
## <td style="text-align:left;"> 2.24% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> no </td>
## <td style="text-align:left;"> Yes </td>
## <td style="text-align:left;"> 30 </td>
## <td style="text-align:left;"> 1.97% </td>
## <td style="text-align:left;"> 0 </td>
## <td style="text-align:left;"> 1,486 </td>
## <td style="text-align:left;"> 97.76% </td>
## </tr>
## </tbody>
## </table>
## [1] "Race/Ethnicity Subset with Narrower Bins: Latinx"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 446 30
## 1 10 0
##
## Accuracy : 0.9177
## 95% CI : (0.8896, 0.9406)
## No Information Rate : 0.9383
## P-Value [Acc > NIR] : 0.971979
##
## Kappa : -0.0318
##
## Mcnemar's Test P-Value : 0.002663
##
## Sensitivity : 0.9781
## Specificity : 0.0000
## Pos Pred Value : 0.9370
## Neg Pred Value : 0.0000
## Prevalence : 0.9383
## Detection Rate : 0.9177
## Detection Prevalence : 0.9794
## Balanced Accuracy : 0.4890
##
## 'Positive' Class : 0
##
## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`
## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for White Subset</caption>
## <thead>
## <tr>
## <th style="text-align:left;"> malignanto...1 </th>
## <th style="text-align:left;"> ldct_elig </th>
## <th style="text-align:left;"> ldct_n </th>
## <th style="text-align:left;"> ldct_percentage </th>
## <th style="text-align:left;"> prediction </th>
## <th style="text-align:left;"> prediction_n </th>
## <th style="text-align:left;"> prediction_percentage </th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td style="text-align:left;"> yes </td>
## <td style="text-align:left;"> No </td>
## <td style="text-align:left;"> 97 </td>
## <td style="text-align:left;"> 97.98% </td>
## <td style="text-align:left;"> 1 </td>
## <td style="text-align:left;"> 64 </td>
## <td style="text-align:left;"> 64.65% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> yes </td>
## <td style="text-align:left;"> Yes </td>
## <td style="text-align:left;"> 2 </td>
## <td style="text-align:left;"> 2.02% </td>
## <td style="text-align:left;"> 0 </td>
## <td style="text-align:left;"> 35 </td>
## <td style="text-align:left;"> 35.35% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> no </td>
## <td style="text-align:left;"> No </td>
## <td style="text-align:left;"> 812 </td>
## <td style="text-align:left;"> 94.64% </td>
## <td style="text-align:left;"> 1 </td>
## <td style="text-align:left;"> 493 </td>
## <td style="text-align:left;"> 57.46% </td>
## </tr>
## <tr>
## <td style="text-align:left;"> no </td>
## <td style="text-align:left;"> Yes </td>
## <td style="text-align:left;"> 46 </td>
## <td style="text-align:left;"> 5.36% </td>
## <td style="text-align:left;"> 0 </td>
## <td style="text-align:left;"> 365 </td>
## <td style="text-align:left;"> 42.54% </td>
## </tr>
## </tbody>
## </table>
## [1] "Race/Ethnicity Subset with Narrower Bins: White"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 114 14
## 1 144 16
##
## Accuracy : 0.4514
## 95% CI : (0.3929, 0.5108)
## No Information Rate : 0.8958
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0085
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.4419
## Specificity : 0.5333
## Pos Pred Value : 0.8906
## Neg Pred Value : 0.1000
## Prevalence : 0.8958
## Detection Rate : 0.3958
## Detection Prevalence : 0.4444
## Balanced Accuracy : 0.4876
##
## 'Positive' Class : 0
##
Let’s compare the CHAID trees to a logistic regression model. Here we have an interaction term with smokingstatus and packyear since packyear would not be applicable without a “not never” status.
##
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + smokingstatus *
## packyear + homicidegtmean2, family = "binomial", data = lung_emr_lr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3859 -0.5568 -0.4072 -0.2726 3.1039
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.388645 0.154065 -21.995 < 2e-16 ***
## agecat.L 1.109462 0.162683 6.820 9.12e-12 ***
## agecat.Q -0.247364 0.121573 -2.035 0.04188 *
## genderMALE -0.002696 0.097786 -0.028 0.97801
## genderUNKNOWN -9.745831 221.541247 -0.044 0.96491
## raceethnicLatinx -0.461841 0.152911 -3.020 0.00253 **
## raceethnicWhite -0.016659 0.156549 -0.106 0.91525
## smokingstatus1 1.009121 0.140372 7.189 6.53e-13 ***
## packyear 0.008041 0.003893 2.065 0.03888 *
## homicidegtmean2.L 0.231492 0.076816 3.014 0.00258 **
## smokingstatus1:packyear NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3276.4 on 5037 degrees of freedom
## Residual deviance: 3066.6 on 5028 degrees of freedom
## AIC: 3086.6
##
## Number of Fisher Scoring iterations: 11
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1942 217
## 1 1 0
##
## Accuracy : 0.8991
## 95% CI : (0.8856, 0.9115)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 0.5465
##
## Kappa : -9e-04
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9995
## Specificity : 0.0000
## Pos Pred Value : 0.8995
## Neg Pred Value : 0.0000
## Prevalence : 0.8995
## Detection Rate : 0.8991
## Detection Prevalence : 0.9995
## Balanced Accuracy : 0.4997
##
## 'Positive' Class : 0
##
Modeling only the part of the sample that smoked at any time in their lives produces the following model:
##
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + packyear +
## homicidegtmean2, family = "binomial", data = lung_emr_smokers_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2919 -0.5593 -0.4731 -0.3285 2.6975
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.53318 0.12487 -20.287 < 2e-16 ***
## agecat.L 1.13337 0.18594 6.095 1.09e-09 ***
## agecat.Q -0.20435 0.13757 -1.485 0.137417
## genderMALE -0.02743 0.10604 -0.259 0.795884
## genderUNKNOWN -10.61529 365.20689 -0.029 0.976812
## raceethnicLatinx -0.43241 0.17321 -2.496 0.012543 *
## raceethnicWhite 0.17289 0.16132 1.072 0.283839
## packyear 0.01537 0.00455 3.378 0.000729 ***
## homicidegtmean2.L 0.28555 0.08469 3.372 0.000747 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2588.6 on 3485 degrees of freedom
## Residual deviance: 2480.0 on 3477 degrees of freedom
## AIC: 2498
##
## Number of Fisher Scoring iterations: 12
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1303 191
## 1 1 0
##
## Accuracy : 0.8716
## 95% CI : (0.8535, 0.8881)
## No Information Rate : 0.8722
## P-Value [Acc > NIR] : 0.55
##
## Kappa : -0.0013
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9992
## Specificity : 0.0000
## Pos Pred Value : 0.8722
## Neg Pred Value : 0.0000
## Prevalence : 0.8722
## Detection Rate : 0.8716
## Detection Prevalence : 0.9993
## Balanced Accuracy : 0.4996
##
## 'Positive' Class : 0
##
Supposing that are binning of the packyear variable into 5-packyear segments might be imprecise, a tree model that can take in continuous variables might produce different results in terms of what packyear split might produce the predictions on malignant lung cancer.
## 'data.frame': 4981 obs. of 13 variables:
## $ gender.FEMALE : num 1 1 1 1 1 0 0 1 1 1 ...
## $ gender.MALE : num 0 0 0 0 0 1 1 0 0 0 ...
## $ gender.UNKNOWN : num 0 0 0 0 0 0 0 0 0 0 ...
## $ raceethnic.Black : num 1 1 1 1 1 1 1 1 1 1 ...
## $ raceethnic.Latinx: num 0 0 0 0 0 0 0 0 0 0 ...
## $ raceethnic.White : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 40-50 : num 0 0 0 0 1 0 1 0 0 0 ...
## $ 50-60 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 60+ : num 1 1 1 1 0 1 0 1 1 1 ...
## $ <mean : num 1 0 0 1 0 1 0 0 0 1 ...
## $ >=mean : num 0 1 1 0 1 0 1 1 1 0 ...
## $ packyear : num 9.19 21 3.25 13.32 11.6 ...
## $ malignanto : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## [1] "1->2" "2->4" "6->7" "7->9" "11->12" "12->14"
## [7] "13->16" "18->19" "19->21" "23->24" "24->26" "28->29"
## [13] "29->31" "30->33" "35->36" "36->38" "37->40" "42->43"
## [19] "43->45" "44->47" "49->50" "50->52" "51->54" "56->57"
## [25] "57->59" "61->62" "62->64" "66->67" "67->69" "71->72"
## [31] "72->74" "73->76" "78->79" "79->81" "80->83" "85->86"
## [37] "86->88" "87->90" "92->93" "93->95" "94->97" "99->100"
## [43] "100->102" "101->104" "106->107" "107->109" "108->111" "113->114"
## [49] "114->116" "118->119" "119->121" "120->123" "125->126" "126->128"
## [55] "127->130" "132->133" "133->135" "134->137" "139->140" "140->142"
## [61] "141->144" "146->147" "147->149" "151->152" "152->154" "153->156"
## [67] "158->159" "159->161" "160->163" "165->166" "166->168" "167->170"
## [73] "172->173" "173->175" "177->178" "178->180" "179->182" "184->185"
## [79] "185->187" "186->189" "191->192" "192->194" "193->196" "198->199"
## [85] "199->201" "203->204" "204->206" "208->209" "209->211" "213->214"
## [91] "214->216" "215->218" "220->221" "221->223" "222->225" "227->228"
## [97] "228->230" "232->233" "233->235" "234->237" "239->240" "240->242"
## [103] "244->245" "245->247" "246->249" "251->252" "252->254" "256->257"
## [109] "257->259" "261->262" "264->265" "265->267" "269->270" "270->272"
## [115] "274->275" "275->277" "279->280" "280->282" "281->284" "286->287"
## [121] "287->289" "291->292" "292->294" "296->297" "297->299" "1->3"
## [127] "2->5" "6->8" "7->10" "11->13" "12->15" "13->17"
## [133] "18->20" "19->22" "23->25" "24->27" "28->30" "29->32"
## [139] "30->34" "35->37" "36->39" "37->41" "42->44" "43->46"
## [145] "44->48" "49->51" "50->53" "51->55" "56->58" "57->60"
## [151] "61->63" "62->65" "66->68" "67->70" "71->73" "72->75"
## [157] "73->77" "78->80" "79->82" "80->84" "85->87" "86->89"
## [163] "87->91" "92->94" "93->96" "94->98" "99->101" "100->103"
## [169] "101->105" "106->108" "107->110" "108->112" "113->115" "114->117"
## [175] "118->120" "119->122" "120->124" "125->127" "126->129" "127->131"
## [181] "132->134" "133->136" "134->138" "139->141" "140->143" "141->145"
## [187] "146->148" "147->150" "151->153" "152->155" "153->157" "158->160"
## [193] "159->162" "160->164" "165->167" "166->169" "167->171" "172->174"
## [199] "173->176" "177->179" "178->181" "179->183" "184->186" "185->188"
## [205] "186->190" "191->193" "192->195" "193->197" "198->200" "199->202"
## [211] "203->205" "204->207" "208->210" "209->212" "213->215" "214->217"
## [217] "215->219" "220->222" "221->224" "222->226" "227->229" "228->231"
## [223] "232->234" "233->236" "234->238" "239->241" "240->243" "244->246"
## [229] "245->248" "246->250" "251->253" "252->255" "256->258" "257->260"
## [235] "261->263" "264->266" "265->268" "269->271" "270->273" "274->276"
## [241] "275->278" "279->281" "280->283" "281->285" "286->288" "287->290"
## [247] "291->293" "292->295" "296->298" "297->300"
## [1] "Optimal cutoff: 0.172222222222222"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1943 217
## 1 0 0
##
## Accuracy : 0.8995
## 95% CI : (0.8861, 0.9119)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 0.5181
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8995
## Neg Pred Value : NaN
## Prevalence : 0.8995
## Detection Rate : 0.8995
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
## 6 10
## "packyear > 13.8" "packyear <= 30"
## 6 10
## "13.8" "30"
Based on Conditional Inference Tree models of the entire sample, packyear policies ought to be revised to around \(13.8\).
Do the recommendations for packyear requirements be different for different race subsets?
## [1] 0.09
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4113 509
## 1 0 0
##
## Accuracy : 0.8899
## 95% CI : (0.8805, 0.8988)
## No Information Rate : 0.8899
## P-Value [Acc > NIR] : 0.5118
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8899
## Neg Pred Value : NaN
## Prevalence : 0.8899
## Detection Rate : 0.8899
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
## [1] 0.05571429
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1520 99
## 1 0 0
##
## Accuracy : 0.9389
## 95% CI : (0.9261, 0.95)
## No Information Rate : 0.9389
## P-Value [Acc > NIR] : 0.5267
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9389
## Neg Pred Value : NaN
## Prevalence : 0.9389
## Detection Rate : 0.9389
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
## [1] 0.04714286
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 858 99
## 1 0 0
##
## Accuracy : 0.8966
## 95% CI : (0.8755, 0.9151)
## No Information Rate : 0.8966
## P-Value [Acc > NIR] : 0.5267
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8966
## Neg Pred Value : NaN
## Prevalence : 0.8966
## Detection Rate : 0.8966
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
The packyear splits are different across races. Although the splits themselves are not all statistically significant p-values, they indicate that there may be different packyear requirements for each race that will optimally assign individuals to lung cancer screening.
This section is a WIP