1 Methods

1.1 Data Source and Study Population

The data used for analysis and modeling in this study were obtained from electronic medical records (EMR) from a large academic medical center in Chicago using a standardized form collecting data on demographics, smoking history, lung cancer screening eligibility,

1.2 Statistical Analysis

We look at several different cross sections of LDCT (low dose CT scan) screening eligibility across demographics, environmental violence measures and smoking rates to establish base lines for the characteristics of our sample or distributions across demographic qualities including race/ethnicity, gender and age. We set exceptions for what recommendations are sensible based on those statistics and what segments of the sample have potential for additional screenings to reduce mortality and cancer rates. This is also to reveal potentially significant differences in sample subgroups.

1.3 Modeling

A variety of decision trees are used to dissect the distribution of cancer rates across demographics, smoking packyear rates and environmental violence. Trees are chosen because of potential for interpretability and therefore transparency in guiding decision and policy making for lung cancer screening. Initially a Chi-square automatic interaction detection (CHAID) tree is used to model the data but later Conditional Inference trees and others are chosen because of performance improvements. Model would then be checked against a combination of patient screening eligibility, smoking rates and rates of malignant cancer.

2 Patient Sample Characteristics

2.1 Percentage of cases in the total sample based on smoking status

Smoking Status of Total Sample
Smoking Status Count Percentage
current 2,252 31.29%
former 1,886 26.20%
never 2,217 30.80%
NA 843 11.71%

2.2 Percentage of cases that were current, former, or never smokers by race

Smoking Status of Total Sample by Race
Race/Ethnicity Smoking Status Count Percentage within Race
Black current 1,668 36.09%
Black former 1,182 25.57%
Black never 1,268 27.43%
Black NA 504 10.90%
Latinx current 271 16.74%
Latinx former 446 27.55%
Latinx never 692 42.74%
Latinx NA 210 12.97%
White current 313 32.71%
White former 258 26.96%
White never 257 26.85%
White NA 129 13.48%

2.3 Percentage of cases that were current, former, or never smokers by gender

Smoking Status of Total Sample by Gender
Gender Smoking Status Count Percentage within Gender
FEMALE current 1,135 28.07%
FEMALE former 941 23.27%
FEMALE never 1,549 38.30%
FEMALE NA 419 10.36%
MALE current 1,117 35.45%
MALE former 944 29.96%
MALE never 667 21.17%
MALE NA 423 13.42%
UNKNOWN former 1 33.33%
UNKNOWN never 1 33.33%
UNKNOWN NA 1 33.33%

2.4 Percentage of cases that were current, former, or never smokers by homicide exposure

Smoking Status of Total Sample by Homicide Rate Exposure
Smoking Status Homicide Rate Exposure Count Percentage within Smoking Group
current <mean 1,308 58.08%
current >=mean 938 41.65%
current NA 6 0.27%
former <mean 1,169 61.98%
former >=mean 717 38.02%
never <mean 1,473 66.44%
never >=mean 741 33.42%
never NA 3 0.14%
NA <mean 554 65.72%
NA >=mean 287 34.05%
NA NA 2 0.24%

3 Descriptive Statistics on Patients Who Smoked

3.1 Mean pack years smoked for current and former smokers

Mean Packyear of Smokers
Smoking Status Mean
current 16.36
former 22.27

3.2 Mean pack years smoked for current and former smokers by Race

Mean Packyear of Smokers by Race
Smoking Status Race/Ethnicity Mean
current Black 15.24
current Latinx 17.30
current White 21.48
former Black 22.62
former Latinx 16.95
former White 29.07

3.3 Mean pack years smoked for current and former smokers by Gender

Mean Packyear of Smokers by Gender
Smoking Status Gender Mean Packyears
current FEMALE 14.94
current MALE 17.75
former FEMALE 22.85
former MALE 21.62
former UNKNOWN NaN

3.4 Mean pack years smoked for current and former smokers by homicide exposure

Mean Packyear of Smokers by Race
Smoking Status Homicide Rate Exposure Mean Packyears
current <mean 17.83
current >=mean 14.11
current NA 52.00
former <mean 21.61
former >=mean 23.28

4 LDCT Eligibility Breakdowns

Eligibility criteria: Aged 50 to 80, 20 pack year smoking history, and no prior history of lung cancer

4.1 Eligibility by Race

## New names:
## • `raceethnic` -> `raceethnic...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `raceethnic` -> `raceethnic...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `raceethnic` -> `raceethnic...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
Mean Packyear of Smokers by Race
Race/Ethnicity Age Eligibility Count Percent Packyear Eligibility Count Percent No Prior Diagnosis Eligibility Count Percent
Black No 1,046 22.63% No 4,384 94.85% No 4,075 88.17%
Black Yes 3,576 77.37% Yes 238 5.15% Yes 547 11.83%
Latinx No 446 27.55% No 1,575 97.28% No 1,469 90.74%
Latinx Yes 1,173 72.45% Yes 44 2.72% Yes 150 9.26%
White No 204 21.32% No 891 93.10% No 836 87.36%
White Yes 753 78.68% Yes 66 6.90% Yes 121 12.64%

4.2 Eligibility by Gender

## New names:
## • `gender` -> `gender...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `gender` -> `gender...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `gender` -> `gender...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
Mean Packyear of Smokers by Race
Gender Age Eligibility Count Percent Packyear Eligibility Count Percent No Prior Diagnosis Eligibility Count Percent
FEMALE No 1,049 25.94% No 3,877 95.87% No 3,617 89.44%
FEMALE Yes 2,995 74.06% Yes 167 4.13% Yes 427 10.56%
MALE No 645 20.47% No 2,970 94.26% No 2,760 87.59%
MALE Yes 2,506 79.53% Yes 181 5.74% Yes 391 12.41%
UNKNOWN No 2 66.67% No 3 100.00% No 3 100.00%
UNKNOWN Yes 1 33.33% Yes 0 0.00% Yes 0 0.00%

4.3 Eligibility by Race/Gender

## New names:
## • `raceethnic` -> `raceethnic...1`
## • `gender` -> `gender...2`
## • `n` -> `n...4`
## • `Percentage` -> `Percentage...5`
## • `raceethnic` -> `raceethnic...6`
## • `gender` -> `gender...7`
## • `n` -> `n...9`
## • `Percentage` -> `Percentage...10`
## • `raceethnic` -> `raceethnic...11`
## • `gender` -> `gender...12`
## • `n` -> `n...14`
## • `Percentage` -> `Percentage...15`
Mean Packyear of Smokers by Race
Race/Ethnicity Gender Age Eligibility Count Percent within Race/Gender Packyear Eligibility Count Percent within Race/Gender No Prior Diagnosis Eligibility Count Percent within Race/Gender
Black FEMALE No 711 25.32% No 2,681 95.48% No 2,500 89.03%
Black FEMALE Yes 2,097 74.68% Yes 127 4.52% Yes 308 10.97%
Black MALE No 334 18.43% No 1,701 93.87% No 1,573 86.81%
Black MALE Yes 1,478 81.57% Yes 111 6.13% Yes 239 13.19%
Black UNKNOWN No 1 50.00% No 2 100.00% No 2 100.00%
Black UNKNOWN Yes 1 50.00% No 793 98.51% No 737 91.55%
Latinx FEMALE No 241 29.94% Yes 12 1.49% Yes 68 8.45%
Latinx FEMALE Yes 564 70.06% No 781 96.06% No 731 89.91%
Latinx MALE No 204 25.09% Yes 32 3.94% Yes 82 10.09%
Latinx MALE Yes 609 74.91% No 1 100.00% No 1 100.00%
Latinx UNKNOWN No 1 100.00% No 403 93.50% No 380 88.17%
White FEMALE No 97 22.51% Yes 28 6.50% Yes 51 11.83%
White FEMALE Yes 334 77.49% No 488 92.78% No 456 86.69%
White MALE No 107 20.34% Yes 38 7.22% Yes 70 13.31%
White MALE Yes 419 79.66% No 0 NA No 0 NA

5 Modeling and Predicting Malignanc Lung Cancer Prevalence

In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:

  1. Age
  2. Race/ethnicity
  3. Gender
  4. Smoking status (never vs former/current)
  5. Packyear binned by 5 up to current packyear eligibility (see below)
## [1] "[  0,  5) packs/year" "[  5, 10) packs/year" "[ 10, 15) packs/year"
## [4] "[ 15, 20) packs/year" "[ 20,250] packs/year" "never smoked"

5.1 Initial CHAID Model Predicting Malignant Lung Cancer

We exame the initial possibilities of recommendations and demographic split differences using a decision tree based on using the Chi-squared statistic to split subsets that determine how subsets are classified. The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:

  • alpha2: Level of significance used for merging of predictor categories (step 2).

  • alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).

  • alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).

  • minsplit: Number of observations in splitted response at which no further split is desired.

  • minbucket: Minimum number of observations in terminal nodes.

  • minprob: Mininimum frequency of observations in terminal nodes.

A visualization of the resulting optimized model is shown below:

The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1404  106
##          1  539  111
##                                           
##                Accuracy : 0.7014          
##                  95% CI : (0.6816, 0.7206)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1241          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7226          
##             Specificity : 0.5115          
##          Pos Pred Value : 0.9298          
##          Neg Pred Value : 0.1708          
##              Prevalence : 0.8995          
##          Detection Rate : 0.6500          
##    Detection Prevalence : 0.6991          
##       Balanced Accuracy : 0.6171          
##                                           
##        'Positive' Class : 0               
## 

The resulting model has an accuracy of \(lung_mod_fit_3_alt_acc\) with a higher rate of false negatives than false positives.

5.2 CHAID Model with Homicide Rate Predicting Malignant Lung Cancer

Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?

a.  Expectation = Yes

Remodeling with the homicide rate per 100k variable, we get the following model:

Applying the model to the \(30\%\) held out test set results in the following:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1709  157
##          1  234   60
##                                          
##                Accuracy : 0.819          
##                  95% CI : (0.8021, 0.835)
##     No Information Rate : 0.8995         
##     P-Value [Acc > NIR] : 1.0000000      
##                                          
##                   Kappa : 0.1348         
##                                          
##  Mcnemar's Test P-Value : 0.0001213      
##                                          
##             Sensitivity : 0.8796         
##             Specificity : 0.2765         
##          Pos Pred Value : 0.9159         
##          Neg Pred Value : 0.2041         
##              Prevalence : 0.8995         
##          Detection Rate : 0.7912         
##    Detection Prevalence : 0.8639         
##       Balanced Accuracy : 0.5780         
##                                          
##        'Positive' Class : 0              
## 

This model has a lung_mod_hom_fit_acc accuracy with, again, a higher rate of false negatives than false positives. However, this model produces false negatives at lower rate than the previous model without the homicide rate included. Additionally there is an approximately lung_mod_hom_fit_acc - lung_mod_fit_3_alt_acc gain in accuracy.

5.3 CHAID Model with Narrower Bins

Using intervals of size 1 between 5 and 15, with ordinal categories as below:

##  [1] "[  0,  5)"    "[  5,  6)"    "[  6,  7)"    "[  7,  8)"    "[  8,  9)"   
##  [6] "[  9, 10)"    "[ 10, 11)"    "[ 11, 12)"    "[ 12, 13)"    "[ 13, 14)"   
## [11] "[ 14, 15)"    "[ 15, 20)"    "[ 20,250]"    "never smoked"

## [1] "Lung Cancer Model With Narrower Bins on Total Sample"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
## 

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
## 

Using intervals of size 5 between 0 and 10 and then size 1 between 10 and 15, with ordinal categories as below:

##  [1] "[  0,  5)"    "[  5, 10)"    "[ 10, 11)"    "[ 11, 12)"    "[ 12, 13)"   
##  [6] "[ 13, 14)"    "[ 14, 15)"    "[ 15, 20)"    "[ 20,250]"    "never smoked"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
## 

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
## 

5.4 CHAID Model on Subsets

5.4.1 Model on Race Subsets Using Size 5 Bins

## [1] "Race/Ethnicity Subset with Bin Size 5: Black"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 974  86
##          1 263  64
##                                          
##                Accuracy : 0.7484         
##                  95% CI : (0.7247, 0.771)
##     No Information Rate : 0.8919         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.141          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7874         
##             Specificity : 0.4267         
##          Pos Pred Value : 0.9189         
##          Neg Pred Value : 0.1957         
##              Prevalence : 0.8919         
##          Detection Rate : 0.7022         
##    Detection Prevalence : 0.7642         
##       Balanced Accuracy : 0.6070         
##                                          
##        'Positive' Class : 0              
## 

## [1] "Race/Ethnicity Subset with Bin Size 5: Latinx"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 448  30
##          1   8   0
##                                           
##                Accuracy : 0.9218          
##                  95% CI : (0.8943, 0.9441)
##     No Information Rate : 0.9383          
##     P-Value [Acc > NIR] : 0.9412190       
##                                           
##                   Kappa : -0.0267         
##                                           
##  Mcnemar's Test P-Value : 0.0006577       
##                                           
##             Sensitivity : 0.9825          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9372          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9383          
##          Detection Rate : 0.9218          
##    Detection Prevalence : 0.9835          
##       Balanced Accuracy : 0.4912          
##                                           
##        'Positive' Class : 0               
## 

## [1] "Race/Ethnicity Subset with Bin Size 5: White"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  77   2
##          1 181  28
##                                           
##                Accuracy : 0.3646          
##                  95% CI : (0.3089, 0.4231)
##     No Information Rate : 0.8958          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0637          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2984          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9747          
##          Neg Pred Value : 0.1340          
##              Prevalence : 0.8958          
##          Detection Rate : 0.2674          
##    Detection Prevalence : 0.2743          
##       Balanced Accuracy : 0.6159          
##                                           
##        'Positive' Class : 0               
## 

5.4.2 Model on Race Subsets Using Custom Sized Bins

## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`
## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for Black Subset</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> malignanto...1 </th>
##    <th style="text-align:left;"> ldct_elig </th>
##    <th style="text-align:left;"> ldct_n </th>
##    <th style="text-align:left;"> ldct_percentage </th>
##    <th style="text-align:left;"> prediction </th>
##    <th style="text-align:left;"> prediction_n </th>
##    <th style="text-align:left;"> prediction_percentage </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 484 </td>
##    <td style="text-align:left;"> 95.09% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 202 </td>
##    <td style="text-align:left;"> 39.69% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 25 </td>
##    <td style="text-align:left;"> 4.91% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 307 </td>
##    <td style="text-align:left;"> 60.31% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 3,938 </td>
##    <td style="text-align:left;"> 95.75% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 811 </td>
##    <td style="text-align:left;"> 19.72% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 175 </td>
##    <td style="text-align:left;"> 4.25% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 3,302 </td>
##    <td style="text-align:left;"> 80.28% </td>
##   </tr>
## </tbody>
## </table>
## [1] "Race/Ethnicity Subset with Narrower Bins: Black"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 974  86
##          1 263  64
##                                          
##                Accuracy : 0.7484         
##                  95% CI : (0.7247, 0.771)
##     No Information Rate : 0.8919         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.141          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7874         
##             Specificity : 0.4267         
##          Pos Pred Value : 0.9189         
##          Neg Pred Value : 0.1957         
##              Prevalence : 0.8919         
##          Detection Rate : 0.7022         
##    Detection Prevalence : 0.7642         
##       Balanced Accuracy : 0.6070         
##                                          
##        'Positive' Class : 0              
## 
## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`

## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for Latinx Subset</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> malignanto...1 </th>
##    <th style="text-align:left;"> ldct_elig </th>
##    <th style="text-align:left;"> ldct_n </th>
##    <th style="text-align:left;"> ldct_percentage </th>
##    <th style="text-align:left;"> prediction </th>
##    <th style="text-align:left;"> prediction_n </th>
##    <th style="text-align:left;"> prediction_percentage </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 96 </td>
##    <td style="text-align:left;"> 96.97% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 8 </td>
##    <td style="text-align:left;"> 8.08% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 3 </td>
##    <td style="text-align:left;"> 3.03% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 91 </td>
##    <td style="text-align:left;"> 91.92% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 1,490 </td>
##    <td style="text-align:left;"> 98.03% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 34 </td>
##    <td style="text-align:left;"> 2.24% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 30 </td>
##    <td style="text-align:left;"> 1.97% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 1,486 </td>
##    <td style="text-align:left;"> 97.76% </td>
##   </tr>
## </tbody>
## </table>
## [1] "Race/Ethnicity Subset with Narrower Bins: Latinx"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 446  30
##          1  10   0
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.8896, 0.9406)
##     No Information Rate : 0.9383          
##     P-Value [Acc > NIR] : 0.971979        
##                                           
##                   Kappa : -0.0318         
##                                           
##  Mcnemar's Test P-Value : 0.002663        
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9370          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9383          
##          Detection Rate : 0.9177          
##    Detection Prevalence : 0.9794          
##       Balanced Accuracy : 0.4890          
##                                           
##        'Positive' Class : 0               
## 
## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`

## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for White Subset</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> malignanto...1 </th>
##    <th style="text-align:left;"> ldct_elig </th>
##    <th style="text-align:left;"> ldct_n </th>
##    <th style="text-align:left;"> ldct_percentage </th>
##    <th style="text-align:left;"> prediction </th>
##    <th style="text-align:left;"> prediction_n </th>
##    <th style="text-align:left;"> prediction_percentage </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 97 </td>
##    <td style="text-align:left;"> 97.98% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 64 </td>
##    <td style="text-align:left;"> 64.65% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 2 </td>
##    <td style="text-align:left;"> 2.02% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 35 </td>
##    <td style="text-align:left;"> 35.35% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 812 </td>
##    <td style="text-align:left;"> 94.64% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 493 </td>
##    <td style="text-align:left;"> 57.46% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 46 </td>
##    <td style="text-align:left;"> 5.36% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 365 </td>
##    <td style="text-align:left;"> 42.54% </td>
##   </tr>
## </tbody>
## </table>

## [1] "Race/Ethnicity Subset with Narrower Bins: White"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 114  14
##          1 144  16
##                                           
##                Accuracy : 0.4514          
##                  95% CI : (0.3929, 0.5108)
##     No Information Rate : 0.8958          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0085         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4419          
##             Specificity : 0.5333          
##          Pos Pred Value : 0.8906          
##          Neg Pred Value : 0.1000          
##              Prevalence : 0.8958          
##          Detection Rate : 0.3958          
##    Detection Prevalence : 0.4444          
##       Balanced Accuracy : 0.4876          
##                                           
##        'Positive' Class : 0               
## 

5.5 Generalized Linear Model, Logistic Regression

Let’s compare the CHAID trees to a logistic regression model. Here we have an interaction term with smokingstatus and packyear since packyear would not be applicable without a “not never” status.

## 
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + smokingstatus * 
##     packyear + homicidegtmean2, family = "binomial", data = lung_emr_lr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3859  -0.5568  -0.4072  -0.2726   3.1039  
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -3.388645   0.154065 -21.995  < 2e-16 ***
## agecat.L                  1.109462   0.162683   6.820 9.12e-12 ***
## agecat.Q                 -0.247364   0.121573  -2.035  0.04188 *  
## genderMALE               -0.002696   0.097786  -0.028  0.97801    
## genderUNKNOWN            -9.745831 221.541247  -0.044  0.96491    
## raceethnicLatinx         -0.461841   0.152911  -3.020  0.00253 ** 
## raceethnicWhite          -0.016659   0.156549  -0.106  0.91525    
## smokingstatus1            1.009121   0.140372   7.189 6.53e-13 ***
## packyear                  0.008041   0.003893   2.065  0.03888 *  
## homicidegtmean2.L         0.231492   0.076816   3.014  0.00258 ** 
## smokingstatus1:packyear         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3276.4  on 5037  degrees of freedom
## Residual deviance: 3066.6  on 5028  degrees of freedom
## AIC: 3086.6
## 
## Number of Fisher Scoring iterations: 11
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1942  217
##          1    1    0
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8856, 0.9115)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 0.5465          
##                                           
##                   Kappa : -9e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9995          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8995          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8995          
##          Detection Rate : 0.8991          
##    Detection Prevalence : 0.9995          
##       Balanced Accuracy : 0.4997          
##                                           
##        'Positive' Class : 0               
## 

Modeling only the part of the sample that smoked at any time in their lives produces the following model:

## 
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + packyear + 
##     homicidegtmean2, family = "binomial", data = lung_emr_smokers_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2919  -0.5593  -0.4731  -0.3285   2.6975  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.53318    0.12487 -20.287  < 2e-16 ***
## agecat.L            1.13337    0.18594   6.095 1.09e-09 ***
## agecat.Q           -0.20435    0.13757  -1.485 0.137417    
## genderMALE         -0.02743    0.10604  -0.259 0.795884    
## genderUNKNOWN     -10.61529  365.20689  -0.029 0.976812    
## raceethnicLatinx   -0.43241    0.17321  -2.496 0.012543 *  
## raceethnicWhite     0.17289    0.16132   1.072 0.283839    
## packyear            0.01537    0.00455   3.378 0.000729 ***
## homicidegtmean2.L   0.28555    0.08469   3.372 0.000747 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2588.6  on 3485  degrees of freedom
## Residual deviance: 2480.0  on 3477  degrees of freedom
## AIC: 2498
## 
## Number of Fisher Scoring iterations: 12
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1303  191
##          1    1    0
##                                           
##                Accuracy : 0.8716          
##                  95% CI : (0.8535, 0.8881)
##     No Information Rate : 0.8722          
##     P-Value [Acc > NIR] : 0.55            
##                                           
##                   Kappa : -0.0013         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9992          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8722          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8722          
##          Detection Rate : 0.8716          
##    Detection Prevalence : 0.9993          
##       Balanced Accuracy : 0.4996          
##                                           
##        'Positive' Class : 0               
## 

5.6 Other Tree Based Models

Supposing that are binning of the packyear variable into 5-packyear segments might be imprecise, a tree model that can take in continuous variables might produce different results in terms of what packyear split might produce the predictions on malignant lung cancer.

5.6.1 Gradient Boosted Trees

## 'data.frame':    4981 obs. of  13 variables:
##  $ gender.FEMALE    : num  1 1 1 1 1 0 0 1 1 1 ...
##  $ gender.MALE      : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ gender.UNKNOWN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ raceethnic.Black : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ raceethnic.Latinx: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ raceethnic.White : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 40-50            : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ 50-60            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 60+              : num  1 1 1 1 0 1 0 1 1 1 ...
##  $ <mean            : num  1 0 0 1 0 1 0 0 0 1 ...
##  $ >=mean           : num  0 1 1 0 1 0 1 1 1 0 ...
##  $ packyear         : num  9.19 21 3.25 13.32 11.6 ...
##  $ malignanto       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   [1] "1->2"     "2->4"     "6->7"     "7->9"     "11->12"   "12->14"  
##   [7] "13->16"   "18->19"   "19->21"   "23->24"   "24->26"   "28->29"  
##  [13] "29->31"   "30->33"   "35->36"   "36->38"   "37->40"   "42->43"  
##  [19] "43->45"   "44->47"   "49->50"   "50->52"   "51->54"   "56->57"  
##  [25] "57->59"   "61->62"   "62->64"   "66->67"   "67->69"   "71->72"  
##  [31] "72->74"   "73->76"   "78->79"   "79->81"   "80->83"   "85->86"  
##  [37] "86->88"   "87->90"   "92->93"   "93->95"   "94->97"   "99->100" 
##  [43] "100->102" "101->104" "106->107" "107->109" "108->111" "113->114"
##  [49] "114->116" "118->119" "119->121" "120->123" "125->126" "126->128"
##  [55] "127->130" "132->133" "133->135" "134->137" "139->140" "140->142"
##  [61] "141->144" "146->147" "147->149" "151->152" "152->154" "153->156"
##  [67] "158->159" "159->161" "160->163" "165->166" "166->168" "167->170"
##  [73] "172->173" "173->175" "177->178" "178->180" "179->182" "184->185"
##  [79] "185->187" "186->189" "191->192" "192->194" "193->196" "198->199"
##  [85] "199->201" "203->204" "204->206" "208->209" "209->211" "213->214"
##  [91] "214->216" "215->218" "220->221" "221->223" "222->225" "227->228"
##  [97] "228->230" "232->233" "233->235" "234->237" "239->240" "240->242"
## [103] "244->245" "245->247" "246->249" "251->252" "252->254" "256->257"
## [109] "257->259" "261->262" "264->265" "265->267" "269->270" "270->272"
## [115] "274->275" "275->277" "279->280" "280->282" "281->284" "286->287"
## [121] "287->289" "291->292" "292->294" "296->297" "297->299" "1->3"    
## [127] "2->5"     "6->8"     "7->10"    "11->13"   "12->15"   "13->17"  
## [133] "18->20"   "19->22"   "23->25"   "24->27"   "28->30"   "29->32"  
## [139] "30->34"   "35->37"   "36->39"   "37->41"   "42->44"   "43->46"  
## [145] "44->48"   "49->51"   "50->53"   "51->55"   "56->58"   "57->60"  
## [151] "61->63"   "62->65"   "66->68"   "67->70"   "71->73"   "72->75"  
## [157] "73->77"   "78->80"   "79->82"   "80->84"   "85->87"   "86->89"  
## [163] "87->91"   "92->94"   "93->96"   "94->98"   "99->101"  "100->103"
## [169] "101->105" "106->108" "107->110" "108->112" "113->115" "114->117"
## [175] "118->120" "119->122" "120->124" "125->127" "126->129" "127->131"
## [181] "132->134" "133->136" "134->138" "139->141" "140->143" "141->145"
## [187] "146->148" "147->150" "151->153" "152->155" "153->157" "158->160"
## [193] "159->162" "160->164" "165->167" "166->169" "167->171" "172->174"
## [199] "173->176" "177->179" "178->181" "179->183" "184->186" "185->188"
## [205] "186->190" "191->193" "192->195" "193->197" "198->200" "199->202"
## [211] "203->205" "204->207" "208->210" "209->212" "213->215" "214->217"
## [217] "215->219" "220->222" "221->224" "222->226" "227->229" "228->231"
## [223] "232->234" "233->236" "234->238" "239->241" "240->243" "244->246"
## [229] "245->248" "246->250" "251->253" "252->255" "256->258" "257->260"
## [235] "261->263" "264->266" "265->268" "269->271" "270->273" "274->276"
## [241] "275->278" "279->281" "280->283" "281->285" "286->288" "287->290"
## [247] "291->293" "292->295" "296->298" "297->300"

5.6.2 Logistic Model Trees

5.6.3 Conditional Inference Tree

## [1] "Optimal cutoff:  0.172222222222222"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1943  217
##          1    0    0
##                                           
##                Accuracy : 0.8995          
##                  95% CI : (0.8861, 0.9119)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 0.5181          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8995          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8995          
##          Detection Rate : 0.8995          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

##                 6                10 
## "packyear > 13.8"  "packyear <= 30"
##      6     10 
## "13.8"   "30"

Based on Conditional Inference Tree models of the entire sample, packyear policies ought to be revised to around \(13.8\).

5.6.3.1 CTree Modeling of Race/Ethnic Subsets

Do the recommendations for packyear requirements be different for different race subsets?

## [1] 0.09
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4113  509
##          1    0    0
##                                           
##                Accuracy : 0.8899          
##                  95% CI : (0.8805, 0.8988)
##     No Information Rate : 0.8899          
##     P-Value [Acc > NIR] : 0.5118          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8899          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8899          
##          Detection Rate : 0.8899          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

## [1] 0.05571429
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1520   99
##          1    0    0
##                                         
##                Accuracy : 0.9389        
##                  95% CI : (0.9261, 0.95)
##     No Information Rate : 0.9389        
##     P-Value [Acc > NIR] : 0.5267        
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 1.0000        
##             Specificity : 0.0000        
##          Pos Pred Value : 0.9389        
##          Neg Pred Value :    NaN        
##              Prevalence : 0.9389        
##          Detection Rate : 0.9389        
##    Detection Prevalence : 1.0000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : 0             
## 

## [1] 0.04714286
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 858  99
##          1   0   0
##                                           
##                Accuracy : 0.8966          
##                  95% CI : (0.8755, 0.9151)
##     No Information Rate : 0.8966          
##     P-Value [Acc > NIR] : 0.5267          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8966          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8966          
##          Detection Rate : 0.8966          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

The packyear splits are different across races. Although the splits themselves are not all statistically significant p-values, they indicate that there may be different packyear requirements for each race that will optimally assign individuals to lung cancer screening.

5.7 Conditional Inference Tree Forest

This section is a WIP