The data used for analysis and modeling in this study were obtained from electronic medical records (EMR) from a large academic medical center in Chicago using a standardized form collecting data on demographics, smoking history, lung cancer screening eligibility,
| Smoking Status | Count | Percentage |
|---|---|---|
| current | 2,252 | 31.29% |
| former | 1,886 | 26.20% |
| never | 2,217 | 30.80% |
| NA | 843 | 11.71% |
| Race/Ethnicity | Smoking Status | Count | Percentage within Race |
|---|---|---|---|
| Black | current | 1,668 | 36.09% |
| Black | former | 1,182 | 25.57% |
| Black | never | 1,268 | 27.43% |
| Black | NA | 504 | 10.90% |
| Latinx | current | 271 | 16.74% |
| Latinx | former | 446 | 27.55% |
| Latinx | never | 692 | 42.74% |
| Latinx | NA | 210 | 12.97% |
| White | current | 313 | 32.71% |
| White | former | 258 | 26.96% |
| White | never | 257 | 26.85% |
| White | NA | 129 | 13.48% |
| Gender | Smoking Status | Count | Percentage within Gender |
|---|---|---|---|
| FEMALE | current | 1,135 | 28.07% |
| FEMALE | former | 941 | 23.27% |
| FEMALE | never | 1,549 | 38.30% |
| FEMALE | NA | 419 | 10.36% |
| MALE | current | 1,117 | 35.45% |
| MALE | former | 944 | 29.96% |
| MALE | never | 667 | 21.17% |
| MALE | NA | 423 | 13.42% |
| UNKNOWN | former | 1 | 33.33% |
| UNKNOWN | never | 1 | 33.33% |
| UNKNOWN | NA | 1 | 33.33% |
| Smoking Status | Homicide Rate Exposure | Count | Percentage within Smoking Group |
|---|---|---|---|
| current | <mean | 1,308 | 58.08% |
| current | >=mean | 938 | 41.65% |
| current | NA | 6 | 0.27% |
| former | <mean | 1,169 | 61.98% |
| former | >=mean | 717 | 38.02% |
| never | <mean | 1,473 | 66.44% |
| never | >=mean | 741 | 33.42% |
| never | NA | 3 | 0.14% |
| NA | <mean | 554 | 65.72% |
| NA | >=mean | 287 | 34.05% |
| NA | NA | 2 | 0.24% |
| Smoking Status | Mean |
|---|---|
| current | 16.36 |
| former | 22.27 |
| Smoking Status | Race/Ethnicity | Mean |
|---|---|---|
| current | Black | 15.24 |
| current | Latinx | 17.30 |
| current | White | 21.48 |
| former | Black | 22.62 |
| former | Latinx | 16.95 |
| former | White | 29.07 |
| Smoking Status | Gender | Mean Packyears |
|---|---|---|
| current | FEMALE | 14.94 |
| current | MALE | 17.75 |
| former | FEMALE | 22.85 |
| former | MALE | 21.62 |
| former | UNKNOWN | NaN |
| Smoking Status | Homicide Rate Exposure | Mean Packyears |
|---|---|---|
| current | <mean | 17.83 |
| current | >=mean | 14.11 |
| current | NA | 52.00 |
| former | <mean | 21.61 |
| former | >=mean | 23.28 |
Eligibility criteria: Aged 50 to 80, 20 pack year smoking history, and no prior history of lung cancer
## New names:
## • `raceethnic` -> `raceethnic...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `raceethnic` -> `raceethnic...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `raceethnic` -> `raceethnic...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
| Race/Ethnicity | Age Eligibility | Count | Percent | Packyear Eligibility | Count | Percent | No Prior Diagnosis Eligibility | Count | Percent |
|---|---|---|---|---|---|---|---|---|---|
| Black | No | 1,046 | 22.63% | No | 4,384 | 94.85% | No | 4,075 | 88.17% |
| Black | Yes | 3,576 | 77.37% | Yes | 238 | 5.15% | Yes | 547 | 11.83% |
| Latinx | No | 446 | 27.55% | No | 1,575 | 97.28% | No | 1,469 | 90.74% |
| Latinx | Yes | 1,173 | 72.45% | Yes | 44 | 2.72% | Yes | 150 | 9.26% |
| White | No | 204 | 21.32% | No | 891 | 93.10% | No | 836 | 87.36% |
| White | Yes | 753 | 78.68% | Yes | 66 | 6.90% | Yes | 121 | 12.64% |
## New names:
## • `gender` -> `gender...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `gender` -> `gender...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `gender` -> `gender...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`
| Gender | Age Eligibility | Count | Percent | Packyear Eligibility | Count | Percent | No Prior Diagnosis Eligibility | Count | Percent |
|---|---|---|---|---|---|---|---|---|---|
| FEMALE | No | 1,049 | 25.94% | No | 3,877 | 95.87% | No | 3,617 | 89.44% |
| FEMALE | Yes | 2,995 | 74.06% | Yes | 167 | 4.13% | Yes | 427 | 10.56% |
| MALE | No | 645 | 20.47% | No | 2,970 | 94.26% | No | 2,760 | 87.59% |
| MALE | Yes | 2,506 | 79.53% | Yes | 181 | 5.74% | Yes | 391 | 12.41% |
| UNKNOWN | No | 2 | 66.67% | No | 3 | 100.00% | No | 3 | 100.00% |
| UNKNOWN | Yes | 1 | 33.33% | Yes | 0 | 0.00% | Yes | 0 | 0.00% |
## New names:
## • `raceethnic` -> `raceethnic...1`
## • `gender` -> `gender...2`
## • `n` -> `n...4`
## • `Percentage` -> `Percentage...5`
## • `raceethnic` -> `raceethnic...6`
## • `gender` -> `gender...7`
## • `n` -> `n...9`
## • `Percentage` -> `Percentage...10`
## • `raceethnic` -> `raceethnic...11`
## • `gender` -> `gender...12`
## • `n` -> `n...14`
## • `Percentage` -> `Percentage...15`
| Race/Ethnicity | Gender | Age Eligibility | Count | Percent within Race/Gender | Packyear Eligibility | Count | Percent within Race/Gender | No Prior Diagnosis Eligibility | Count | Percent within Race/Gender |
|---|---|---|---|---|---|---|---|---|---|---|
| Black | FEMALE | No | 711 | 25.32% | No | 2,681 | 95.48% | No | 2,500 | 89.03% |
| Black | FEMALE | Yes | 2,097 | 74.68% | Yes | 127 | 4.52% | Yes | 308 | 10.97% |
| Black | MALE | No | 334 | 18.43% | No | 1,701 | 93.87% | No | 1,573 | 86.81% |
| Black | MALE | Yes | 1,478 | 81.57% | Yes | 111 | 6.13% | Yes | 239 | 13.19% |
| Black | UNKNOWN | No | 1 | 50.00% | No | 2 | 100.00% | No | 2 | 100.00% |
| Black | UNKNOWN | Yes | 1 | 50.00% | No | 793 | 98.51% | No | 737 | 91.55% |
| Latinx | FEMALE | No | 241 | 29.94% | Yes | 12 | 1.49% | Yes | 68 | 8.45% |
| Latinx | FEMALE | Yes | 564 | 70.06% | No | 781 | 96.06% | No | 731 | 89.91% |
| Latinx | MALE | No | 204 | 25.09% | Yes | 32 | 3.94% | Yes | 82 | 10.09% |
| Latinx | MALE | Yes | 609 | 74.91% | No | 1 | 100.00% | No | 1 | 100.00% |
| Latinx | UNKNOWN | No | 1 | 100.00% | No | 403 | 93.50% | No | 380 | 88.17% |
| White | FEMALE | No | 97 | 22.51% | Yes | 28 | 6.50% | Yes | 51 | 11.83% |
| White | FEMALE | Yes | 334 | 77.49% | No | 488 | 92.78% | No | 456 | 86.69% |
| White | MALE | No | 107 | 20.34% | Yes | 38 | 7.22% | Yes | 70 | 13.31% |
| White | MALE | Yes | 419 | 79.66% | No | 0 | NA | No | 0 | NA |
In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:
## [1] "[ 0, 5)" "[ 5, 10)" "[ 10, 15)" "[ 15, 20)" "[ 20,250]"
The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:
alpha2: Level of significance used for merging of predictor categories (step 2).
alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).
alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).
minsplit: Number of observations in splitted response at which no further split is desired.
minbucket: Minimum number of observations in terminal nodes.
minprob: Mininimum frequency of observations in terminal nodes.
A visualization of the resulting optimized model is shown below:
The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1404 106
## 1 539 111
##
## Accuracy : 0.7014
## 95% CI : (0.6816, 0.7206)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1241
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7226
## Specificity : 0.5115
## Pos Pred Value : 0.9298
## Neg Pred Value : 0.1708
## Prevalence : 0.8995
## Detection Rate : 0.6500
## Detection Prevalence : 0.6991
## Balanced Accuracy : 0.6171
##
## 'Positive' Class : 0
##
Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?
a. Expectation = Yes
Optimizing the CHAID with the homicide rate per 100k variable, we get the following model:
Applying the model to the \(30\%\) held out test set results in the following:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1709 157
## 1 234 60
##
## Accuracy : 0.819
## 95% CI : (0.8021, 0.835)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 1.0000000
##
## Kappa : 0.1348
##
## Mcnemar's Test P-Value : 0.0001213
##
## Sensitivity : 0.8796
## Specificity : 0.2765
## Pos Pred Value : 0.9159
## Neg Pred Value : 0.2041
## Prevalence : 0.8995
## Detection Rate : 0.7912
## Detection Prevalence : 0.8639
## Balanced Accuracy : 0.5780
##
## 'Positive' Class : 0
##
Let’s compare the CHAID trees to a logistic regression model. Here we have an interaction term with smokingstatus and packyear since packyear would not be applicable without a “not never” status.
##
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + smokingstatus *
## packyear + homicidegtmean2, family = "binomial", data = lung_emr_lr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5916 -0.5373 -0.3780 -0.2587 3.0812
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.576165 0.161319 -22.168 < 2e-16 ***
## agecat.L 1.007990 0.159713 6.311 2.77e-10 ***
## agecat.Q -0.217804 0.121246 -1.796 0.07243 .
## genderMALE 0.067734 0.099753 0.679 0.49713
## genderUNKNOWN -10.377146 293.865798 -0.035 0.97183
## raceethnicLatinx -0.387736 0.150875 -2.570 0.01017 *
## raceethnicWhite -0.005654 0.155604 -0.036 0.97101
## smokingstatus1 1.130454 0.149487 7.562 3.96e-14 ***
## packyear 0.010387 0.004021 2.583 0.00979 **
## homicidegtmean2.L 0.227574 0.078797 2.888 0.00388 **
## smokingstatus1:packyear NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3169.7 on 5037 degrees of freedom
## Residual deviance: 2967.8 on 5028 degrees of freedom
## AIC: 2987.8
##
## Number of Fisher Scoring iterations: 12
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1925 215
## 1 18 2
##
## Accuracy : 0.8921
## 95% CI : (0.8783, 0.9049)
## No Information Rate : 0.8995
## P-Value [Acc > NIR] : 0.8805
##
## Kappa : -1e-04
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.990736
## Specificity : 0.009217
## Pos Pred Value : 0.899533
## Neg Pred Value : 0.100000
## Prevalence : 0.899537
## Detection Rate : 0.891204
## Detection Prevalence : 0.990741
## Balanced Accuracy : 0.499976
##
## 'Positive' Class : 0
##
Modeling only the part of the sample that smoked at any time in their lives produces the following model:
##
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + packyear +
## homicidegtmean2, family = "binomial", data = lung_emr_smokers_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3775 -0.5584 -0.5023 -0.3476 2.6972
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.379463 0.114595 -20.764 < 2e-16 ***
## agecat.L 0.950875 0.164517 5.780 7.48e-09 ***
## agecat.Q -0.198720 0.125780 -1.580 0.1141
## genderMALE 0.084623 0.104531 0.810 0.4182
## genderUNKNOWN -9.755752 222.696888 -0.044 0.9651
## raceethnicLatinx -0.401831 0.166772 -2.409 0.0160 *
## raceethnicWhite 0.153039 0.158846 0.963 0.3353
## packyear 0.008300 0.003962 2.095 0.0362 *
## homicidegtmean2.L 0.243531 0.082344 2.957 0.0031 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2635.6 on 3485 degrees of freedom
## Residual deviance: 2554.9 on 3477 degrees of freedom
## AIC: 2572.9
##
## Number of Fisher Scoring iterations: 11
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1315 179
## 1 1 0
##
## Accuracy : 0.8796
## 95% CI : (0.862, 0.8957)
## No Information Rate : 0.8803
## P-Value [Acc > NIR] : 0.5515
##
## Kappa : -0.0013
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9992
## Specificity : 0.0000
## Pos Pred Value : 0.8802
## Neg Pred Value : 0.0000
## Prevalence : 0.8803
## Detection Rate : 0.8796
## Detection Prevalence : 0.9993
## Balanced Accuracy : 0.4996
##
## 'Positive' Class : 0
##
Supposing that are binning of the packyear variable into 5-packyear segments might be imprecise, a tree model that can take in continuous variables might produce different results in terms of what packyear split might produce the predictions on malignant lung cancer.
This algorithm partitions a dataset for classification based on the Gini impurity index and information gain measure, both of which are based on the proportion of mis-classified observations. We continue to use the subset of datawith only past and current smokers since packyear of 0 does not necessarily indicate a “never smoker”.
Within the description of each of leaf of the tree, the left value describes the probability of no detection of malignant lung cancer while the right is the probability of positive detection of lung cancer.