1 Methods

1.1 Data Source and Study Population

The data used for analysis and modeling in this study were obtained from electronic medical records (EMR) from a large academic medical center in Chicago using a standardized form collecting data on demographics, smoking history, lung cancer screening eligibility,

1.2 Statistical Analysis

1.3 Modeling

2 Patient Sample Characteristics

2.1 Percentage of cases in the total sample based on smoking status

Smoking Status of Total Sample
Smoking Status Count Percentage
current 2,252 31.29%
former 1,886 26.20%
never 2,217 30.80%
NA 843 11.71%

2.2 Percentage of cases that were current, former, or never smokers by race

Smoking Status of Total Sample by Race
Race/Ethnicity Smoking Status Count Percentage within Race
Black current 1,668 36.09%
Black former 1,182 25.57%
Black never 1,268 27.43%
Black NA 504 10.90%
Latinx current 271 16.74%
Latinx former 446 27.55%
Latinx never 692 42.74%
Latinx NA 210 12.97%
White current 313 32.71%
White former 258 26.96%
White never 257 26.85%
White NA 129 13.48%

2.3 Percentage of cases that were current, former, or never smokers by gender

Smoking Status of Total Sample by Gender
Gender Smoking Status Count Percentage within Gender
FEMALE current 1,135 28.07%
FEMALE former 941 23.27%
FEMALE never 1,549 38.30%
FEMALE NA 419 10.36%
MALE current 1,117 35.45%
MALE former 944 29.96%
MALE never 667 21.17%
MALE NA 423 13.42%
UNKNOWN former 1 33.33%
UNKNOWN never 1 33.33%
UNKNOWN NA 1 33.33%

2.4 Percentage of cases that were current, former, or never smokers by homicide exposure

Smoking Status of Total Sample by Homicide Rate Exposure
Smoking Status Homicide Rate Exposure Count Percentage within Smoking Group
current <mean 1,308 58.08%
current >=mean 938 41.65%
current NA 6 0.27%
former <mean 1,169 61.98%
former >=mean 717 38.02%
never <mean 1,473 66.44%
never >=mean 741 33.42%
never NA 3 0.14%
NA <mean 554 65.72%
NA >=mean 287 34.05%
NA NA 2 0.24%

3 Descriptive Statistics on Patients Who Smoked

3.1 Mean pack years smoked for current and former smokers

Mean Packyear of Smokers
Smoking Status Mean
current 16.36
former 22.27

3.2 Mean pack years smoked for current and former smokers by Race

Mean Packyear of Smokers by Race
Smoking Status Race/Ethnicity Mean
current Black 15.24
current Latinx 17.30
current White 21.48
former Black 22.62
former Latinx 16.95
former White 29.07

3.3 Mean pack years smoked for current and former smokers by Gender

Mean Packyear of Smokers by Gender
Smoking Status Gender Mean Packyears
current FEMALE 14.94
current MALE 17.75
former FEMALE 22.85
former MALE 21.62
former UNKNOWN NaN

3.4 Mean pack years smoked for current and former smokers by homicide exposure

Mean Packyear of Smokers by Race
Smoking Status Homicide Rate Exposure Mean Packyears
current <mean 17.83
current >=mean 14.11
current NA 52.00
former <mean 21.61
former >=mean 23.28

4 LDCT Eligibility Breakdowns

Eligibility criteria: Aged 50 to 80, 20 pack year smoking history, and no prior history of lung cancer

4.1 Eligibility by Race

## New names:
## • `raceethnic` -> `raceethnic...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `raceethnic` -> `raceethnic...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `raceethnic` -> `raceethnic...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
Mean Packyear of Smokers by Race
Race/Ethnicity Age Eligibility Count Percent Packyear Eligibility Count Percent No Prior Diagnosis Eligibility Count Percent
Black No 1,046 22.63% No 4,384 94.85% No 4,075 88.17%
Black Yes 3,576 77.37% Yes 238 5.15% Yes 547 11.83%
Latinx No 446 27.55% No 1,575 97.28% No 1,469 90.74%
Latinx Yes 1,173 72.45% Yes 44 2.72% Yes 150 9.26%
White No 204 21.32% No 891 93.10% No 836 87.36%
White Yes 753 78.68% Yes 66 6.90% Yes 121 12.64%

4.2 Eligibility by Gender

## New names:
## • `gender` -> `gender...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `gender` -> `gender...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `gender` -> `gender...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
Mean Packyear of Smokers by Race
Gender Age Eligibility Count Percent Packyear Eligibility Count Percent No Prior Diagnosis Eligibility Count Percent
FEMALE No 1,049 25.94% No 3,877 95.87% No 3,617 89.44%
FEMALE Yes 2,995 74.06% Yes 167 4.13% Yes 427 10.56%
MALE No 645 20.47% No 2,970 94.26% No 2,760 87.59%
MALE Yes 2,506 79.53% Yes 181 5.74% Yes 391 12.41%
UNKNOWN No 2 66.67% No 3 100.00% No 3 100.00%
UNKNOWN Yes 1 33.33% Yes 0 0.00% Yes 0 0.00%

4.3 Eligibility by Race/Gender

## New names:
## • `raceethnic` -> `raceethnic...1`
## • `gender` -> `gender...2`
## • `n` -> `n...4`
## • `Percentage` -> `Percentage...5`
## • `raceethnic` -> `raceethnic...6`
## • `gender` -> `gender...7`
## • `n` -> `n...9`
## • `Percentage` -> `Percentage...10`
## • `raceethnic` -> `raceethnic...11`
## • `gender` -> `gender...12`
## • `n` -> `n...14`
## • `Percentage` -> `Percentage...15`
Mean Packyear of Smokers by Race
Race/Ethnicity Gender Age Eligibility Count Percent within Race/Gender Packyear Eligibility Count Percent within Race/Gender No Prior Diagnosis Eligibility Count Percent within Race/Gender
Black FEMALE No 711 25.32% No 2,681 95.48% No 2,500 89.03%
Black FEMALE Yes 2,097 74.68% Yes 127 4.52% Yes 308 10.97%
Black MALE No 334 18.43% No 1,701 93.87% No 1,573 86.81%
Black MALE Yes 1,478 81.57% Yes 111 6.13% Yes 239 13.19%
Black UNKNOWN No 1 50.00% No 2 100.00% No 2 100.00%
Black UNKNOWN Yes 1 50.00% No 793 98.51% No 737 91.55%
Latinx FEMALE No 241 29.94% Yes 12 1.49% Yes 68 8.45%
Latinx FEMALE Yes 564 70.06% No 781 96.06% No 731 89.91%
Latinx MALE No 204 25.09% Yes 32 3.94% Yes 82 10.09%
Latinx MALE Yes 609 74.91% No 1 100.00% No 1 100.00%
Latinx UNKNOWN No 1 100.00% No 403 93.50% No 380 88.17%
White FEMALE No 97 22.51% Yes 28 6.50% Yes 51 11.83%
White FEMALE Yes 334 77.49% No 488 92.78% No 456 86.69%
White MALE No 107 20.34% Yes 38 7.22% Yes 70 13.31%
White MALE Yes 419 79.66% No 0 NA No 0 NA

5 Modeling and Predicting Malignanc Lung Cancer Prevalence

In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:

  1. Age
  2. Race/ethnicity
  3. Gender
  4. Smoking status (never vs former/current)
  5. Packyear binned by 5 up to current packyear eligibility (see below)
## [1] "[  0,  5)" "[  5, 10)" "[ 10, 15)" "[ 15, 20)" "[ 20,250]"

5.1 Initial CHAID Model Predicting Malignant Lung Cancer

The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:

  • alpha2: Level of significance used for merging of predictor categories (step 2).

  • alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).

  • alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).

  • minsplit: Number of observations in splitted response at which no further split is desired.

  • minbucket: Minimum number of observations in terminal nodes.

  • minprob: Mininimum frequency of observations in terminal nodes.

A visualization of the resulting optimized model is shown below:

The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1404  106
##          1  539  111
##                                           
##                Accuracy : 0.7014          
##                  95% CI : (0.6816, 0.7206)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1241          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7226          
##             Specificity : 0.5115          
##          Pos Pred Value : 0.9298          
##          Neg Pred Value : 0.1708          
##              Prevalence : 0.8995          
##          Detection Rate : 0.6500          
##    Detection Prevalence : 0.6991          
##       Balanced Accuracy : 0.6171          
##                                           
##        'Positive' Class : 0               
## 

5.2 CHAID Model with Homicide Rate Predicting Malignant Lung Cancer

Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?

a.  Expectation = Yes

Optimizing the CHAID with the homicide rate per 100k variable, we get the following model:

Applying the model to the \(30\%\) held out test set results in the following:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1709  157
##          1  234   60
##                                          
##                Accuracy : 0.819          
##                  95% CI : (0.8021, 0.835)
##     No Information Rate : 0.8995         
##     P-Value [Acc > NIR] : 1.0000000      
##                                          
##                   Kappa : 0.1348         
##                                          
##  Mcnemar's Test P-Value : 0.0001213      
##                                          
##             Sensitivity : 0.8796         
##             Specificity : 0.2765         
##          Pos Pred Value : 0.9159         
##          Neg Pred Value : 0.2041         
##              Prevalence : 0.8995         
##          Detection Rate : 0.7912         
##    Detection Prevalence : 0.8639         
##       Balanced Accuracy : 0.5780         
##                                          
##        'Positive' Class : 0              
## 

5.3 Generalized Linear Model, Logistic Regression

Let’s compare the CHAID trees to a logistic regression model. Here we have an interaction term with smokingstatus and packyear since packyear would not be applicable without a “not never” status.

## 
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + smokingstatus * 
##     packyear + homicidegtmean2, family = "binomial", data = lung_emr_lr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5916  -0.5373  -0.3780  -0.2587   3.0812  
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -3.576165   0.161319 -22.168  < 2e-16 ***
## agecat.L                  1.007990   0.159713   6.311 2.77e-10 ***
## agecat.Q                 -0.217804   0.121246  -1.796  0.07243 .  
## genderMALE                0.067734   0.099753   0.679  0.49713    
## genderUNKNOWN           -10.377146 293.865798  -0.035  0.97183    
## raceethnicLatinx         -0.387736   0.150875  -2.570  0.01017 *  
## raceethnicWhite          -0.005654   0.155604  -0.036  0.97101    
## smokingstatus1            1.130454   0.149487   7.562 3.96e-14 ***
## packyear                  0.010387   0.004021   2.583  0.00979 ** 
## homicidegtmean2.L         0.227574   0.078797   2.888  0.00388 ** 
## smokingstatus1:packyear         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3169.7  on 5037  degrees of freedom
## Residual deviance: 2967.8  on 5028  degrees of freedom
## AIC: 2987.8
## 
## Number of Fisher Scoring iterations: 12
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1925  215
##          1   18    2
##                                           
##                Accuracy : 0.8921          
##                  95% CI : (0.8783, 0.9049)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 0.8805          
##                                           
##                   Kappa : -1e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.990736        
##             Specificity : 0.009217        
##          Pos Pred Value : 0.899533        
##          Neg Pred Value : 0.100000        
##              Prevalence : 0.899537        
##          Detection Rate : 0.891204        
##    Detection Prevalence : 0.990741        
##       Balanced Accuracy : 0.499976        
##                                           
##        'Positive' Class : 0               
## 

Modeling only the part of the sample that smoked at any time in their lives produces the following model:

## 
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + packyear + 
##     homicidegtmean2, family = "binomial", data = lung_emr_smokers_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3775  -0.5584  -0.5023  -0.3476   2.6972  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.379463   0.114595 -20.764  < 2e-16 ***
## agecat.L            0.950875   0.164517   5.780 7.48e-09 ***
## agecat.Q           -0.198720   0.125780  -1.580   0.1141    
## genderMALE          0.084623   0.104531   0.810   0.4182    
## genderUNKNOWN      -9.755752 222.696888  -0.044   0.9651    
## raceethnicLatinx   -0.401831   0.166772  -2.409   0.0160 *  
## raceethnicWhite     0.153039   0.158846   0.963   0.3353    
## packyear            0.008300   0.003962   2.095   0.0362 *  
## homicidegtmean2.L   0.243531   0.082344   2.957   0.0031 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2635.6  on 3485  degrees of freedom
## Residual deviance: 2554.9  on 3477  degrees of freedom
## AIC: 2572.9
## 
## Number of Fisher Scoring iterations: 11
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1315  179
##          1    1    0
##                                          
##                Accuracy : 0.8796         
##                  95% CI : (0.862, 0.8957)
##     No Information Rate : 0.8803         
##     P-Value [Acc > NIR] : 0.5515         
##                                          
##                   Kappa : -0.0013        
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9992         
##             Specificity : 0.0000         
##          Pos Pred Value : 0.8802         
##          Neg Pred Value : 0.0000         
##              Prevalence : 0.8803         
##          Detection Rate : 0.8796         
##    Detection Prevalence : 0.9993         
##       Balanced Accuracy : 0.4996         
##                                          
##        'Positive' Class : 0              
## 

5.4 Other Tree Based Models

Supposing that are binning of the packyear variable into 5-packyear segments might be imprecise, a tree model that can take in continuous variables might produce different results in terms of what packyear split might produce the predictions on malignant lung cancer.

5.4.1 CART Tree

This algorithm partitions a dataset for classification based on the Gini impurity index and information gain measure, both of which are based on the proportion of mis-classified observations. We continue to use the subset of datawith only past and current smokers since packyear of 0 does not necessarily indicate a “never smoker”.

Within the description of each of leaf of the tree, the left value describes the probability of no detection of malignant lung cancer while the right is the probability of positive detection of lung cancer.

5.4.2 Gradient Boosted Trees

5.4.3 Logistic Model Trees

5.4.4 Conditional Inference Tree