1 Methods

1.1 Data Source and Study Population

The data used for analysis and modeling in this study were obtained from electronic medical records (EMR) from a large academic medical center in Chicago using a standardized form collecting data on demographics, smoking history, lung cancer screening eligibility,

1.2 Statistical Analysis

We look at several different cross sections of LDCT (low dose CT scan) screening eligibility across demographics, environmental violence measures and smoking rates to establish base lines for the characteristics of our sample or distributions across demographic qualities including race/ethnicity, gender and age. We set exceptions for what recommendations are sensible based on those statistics and what segments of the sample have potential for additional screenings to reduce mortality and cancer rates. This is also to reveal potentially significant differences in sample subgroups.

1.3 Modeling

A variety of decision trees are used to dissect the distribution of cancer rates across demographics, smoking packyear rates and environmental violence. Trees are chosen because of potential for interpretability and therefore transparency in guiding decision and policy making for lung cancer screening. Initially a Chi-square automatic interaction detection (CHAID) tree is used to model the data but later Conditional Inference trees and others are chosen because of performance improvements. Model would then be checked against a combination of patient screening eligibility, smoking rates and rates of malignant cancer.

2 Patient Sample Characteristics

2.1 Percentage of cases in the total sample based on smoking status

Smoking Status of Total Sample
Smoking Status	Count	Percentage
current	2,252	31.29%
former	1,886	26.20%
never	2,217	30.80%
NA	843	11.71%

2.2 Percentage of cases that were current, former, or never smokers by race

Smoking Status of Total Sample by Race
Race/Ethnicity	Smoking Status	Count	Percentage within Race
Black	current	1,668	36.09%
Black	former	1,182	25.57%
Black	never	1,268	27.43%
Black	NA	504	10.90%
Latinx	current	271	16.74%
Latinx	former	446	27.55%
Latinx	never	692	42.74%
Latinx	NA	210	12.97%
White	current	313	32.71%
White	former	258	26.96%
White	never	257	26.85%
White	NA	129	13.48%

2.3 Percentage of cases that were current, former, or never smokers by gender

Smoking Status of Total Sample by Gender
Gender	Smoking Status	Count	Percentage within Gender
FEMALE	current	1,135	28.07%
FEMALE	former	941	23.27%
FEMALE	never	1,549	38.30%
FEMALE	NA	419	10.36%
MALE	current	1,117	35.45%
MALE	former	944	29.96%
MALE	never	667	21.17%
MALE	NA	423	13.42%
UNKNOWN	former	1	33.33%
UNKNOWN	never	1	33.33%
UNKNOWN	NA	1	33.33%

2.4 Percentage of cases that were current, former, or never smokers by homicide exposure

Smoking Status of Total Sample by Homicide Rate Exposure
Smoking Status	Homicide Rate Exposure	Count	Percentage within Smoking Group
current	<mean	1,308	58.08%
current	>=mean	938	41.65%
current	NA	6	0.27%
former	<mean	1,169	61.98%
former	>=mean	717	38.02%
never	<mean	1,473	66.44%
never	>=mean	741	33.42%
never	NA	3	0.14%
NA	<mean	554	65.72%
NA	>=mean	287	34.05%
NA	NA	2	0.24%

3 Descriptive Statistics on Patients Who Smoked

3.1 Mean pack years smoked for current and former smokers

Mean Packyear of Smokers
Smoking Status	Mean
current	16.36
former	22.27

3.2 Mean pack years smoked for current and former smokers by Race

Mean Packyear of Smokers by Race
Smoking Status	Race/Ethnicity	Mean
current	Black	15.24
current	Latinx	17.30
current	White	21.48
former	Black	22.62
former	Latinx	16.95
former	White	29.07

3.3 Mean pack years smoked for current and former smokers by Gender

Mean Packyear of Smokers by Gender
Smoking Status	Gender	Mean Packyears
current	FEMALE	14.94
current	MALE	17.75
former	FEMALE	22.85
former	MALE	21.62
former	UNKNOWN	NaN

3.4 Mean pack years smoked for current and former smokers by homicide exposure

Mean Packyear of Smokers by Race
Smoking Status	Homicide Rate Exposure	Mean Packyears
current	<mean	17.83
current	>=mean	14.11
current	NA	52.00
former	<mean	21.61
former	>=mean	23.28

4 LDCT Eligibility Breakdowns

Eligibility criteria: Aged 50 to 80, 20 pack year smoking history, and no prior history of lung cancer

4.1 Eligibility by Race

## New names:
## • `raceethnic` -> `raceethnic...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `raceethnic` -> `raceethnic...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `raceethnic` -> `raceethnic...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`

Mean Packyear of Smokers by Race
Race/Ethnicity	Age Eligibility	Count	Percent	Packyear Eligibility	Count	Percent	No Prior Diagnosis Eligibility	Count	Percent
Black	No	1,046	22.63%	No	4,384	94.85%	No	4,075	88.17%
Black	Yes	3,576	77.37%	Yes	238	5.15%	Yes	547	11.83%
Latinx	No	446	27.55%	No	1,575	97.28%	No	1,469	90.74%
Latinx	Yes	1,173	72.45%	Yes	44	2.72%	Yes	150	9.26%
White	No	204	21.32%	No	891	93.10%	No	836	87.36%
White	Yes	753	78.68%	Yes	66	6.90%	Yes	121	12.64%

4.2 Eligibility by Gender

## New names:
## • `gender` -> `gender...1`
## • `n` -> `n...3`
## • `Percentage` -> `Percentage...4`
## • `gender` -> `gender...5`
## • `n` -> `n...7`
## • `Percentage` -> `Percentage...8`
## • `gender` -> `gender...9`
## • `n` -> `n...11`
## • `Percentage` -> `Percentage...12`

Mean Packyear of Smokers by Race
Gender	Age Eligibility	Count	Percent	Packyear Eligibility	Count	Percent	No Prior Diagnosis Eligibility	Count	Percent
FEMALE	No	1,049	25.94%	No	3,877	95.87%	No	3,617	89.44%
FEMALE	Yes	2,995	74.06%	Yes	167	4.13%	Yes	427	10.56%
MALE	No	645	20.47%	No	2,970	94.26%	No	2,760	87.59%
MALE	Yes	2,506	79.53%	Yes	181	5.74%	Yes	391	12.41%
UNKNOWN	No	2	66.67%	No	3	100.00%	No	3	100.00%
UNKNOWN	Yes	1	33.33%	Yes	0	0.00%	Yes	0	0.00%

4.3 Eligibility by Race/Gender

## New names:
## • `raceethnic` -> `raceethnic...1`
## • `gender` -> `gender...2`
## • `n` -> `n...4`
## • `Percentage` -> `Percentage...5`
## • `raceethnic` -> `raceethnic...6`
## • `gender` -> `gender...7`
## • `n` -> `n...9`
## • `Percentage` -> `Percentage...10`
## • `raceethnic` -> `raceethnic...11`
## • `gender` -> `gender...12`
## • `n` -> `n...14`
## • `Percentage` -> `Percentage...15`

Mean Packyear of Smokers by Race
Race/Ethnicity	Gender	Age Eligibility	Count	Percent within Race/Gender	Packyear Eligibility	Count	Percent within Race/Gender	No Prior Diagnosis Eligibility	Count	Percent within Race/Gender
Black	FEMALE	No	711	25.32%	No	2,681	95.48%	No	2,500	89.03%
Black	FEMALE	Yes	2,097	74.68%	Yes	127	4.52%	Yes	308	10.97%
Black	MALE	No	334	18.43%	No	1,701	93.87%	No	1,573	86.81%
Black	MALE	Yes	1,478	81.57%	Yes	111	6.13%	Yes	239	13.19%
Black	UNKNOWN	No	1	50.00%	No	2	100.00%	No	2	100.00%
Black	UNKNOWN	Yes	1	50.00%	No	793	98.51%	No	737	91.55%
Latinx	FEMALE	No	241	29.94%	Yes	12	1.49%	Yes	68	8.45%
Latinx	FEMALE	Yes	564	70.06%	No	781	96.06%	No	731	89.91%
Latinx	MALE	No	204	25.09%	Yes	32	3.94%	Yes	82	10.09%
Latinx	MALE	Yes	609	74.91%	No	1	100.00%	No	1	100.00%
Latinx	UNKNOWN	No	1	100.00%	No	403	93.50%	No	380	88.17%
White	FEMALE	No	97	22.51%	Yes	28	6.50%	Yes	51	11.83%
White	FEMALE	Yes	334	77.49%	No	488	92.78%	No	456	86.69%
White	MALE	No	107	20.34%	Yes	38	7.22%	Yes	70	13.31%
White	MALE	Yes	419	79.66%	No	0	NA	No	0	NA

5 Modeling and Predicting Malignanc Lung Cancer Prevalence

In a predictive model including all patients who do not meet eligibility for LDCT, what is the overall predictive ability of following combined data to predict lung cancer diagnosis:

Age
Race/ethnicity
Gender
Smoking status (never vs former/current)
Packyear binned by 5 up to current packyear eligibility (see below)

## [1] "[  0,  5) packs/year" "[  5, 10) packs/year" "[ 10, 15) packs/year"
## [4] "[ 15, 20) packs/year" "[ 20,250] packs/year" "never smoked"

5.1 Initial CHAID Model Predicting Malignant Lung Cancer

We exame the initial possibilities of recommendations and demographic split differences using a decision tree based on using the Chi-squared statistic to split subsets that determine how subsets are classified. The CHAID model below is the result of optimizing over some of the following hyperparameters using 70% of the sample selected at random:

alpha2: Level of significance used for merging of predictor categories (step 2).
alpha3: If set to a positive value \(< 1\), level of significance used for the the splitting of former merged categories of the predictor (step 3). Otherwise, step 3 is omitted (the default).
alpha4: Level of significance used for splitting of a node in the most significant predictor (step 5).
minsplit: Number of observations in splitted response at which no further split is desired.
minbucket: Minimum number of observations in terminal nodes.
minprob: Mininimum frequency of observations in terminal nodes.

A visualization of the resulting optimized model is shown below:

The predictions on the remaining \(30\%\) of the set, held out for testing, is shown below as a confusion matrix where the upper right represents False Negatives and lower left represents False Positives.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1404  106
##          1  539  111
##                                           
##                Accuracy : 0.7014          
##                  95% CI : (0.6816, 0.7206)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1241          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7226          
##             Specificity : 0.5115          
##          Pos Pred Value : 0.9298          
##          Neg Pred Value : 0.1708          
##              Prevalence : 0.8995          
##          Detection Rate : 0.6500          
##    Detection Prevalence : 0.6991          
##       Balanced Accuracy : 0.6171          
##                                           
##        'Positive' Class : 0               
##

The resulting model has an accuracy of \(lung_mod_fit_3_alt_acc\) with a higher rate of false negatives than false positives.

5.2 CHAID Model with Homicide Rate Predicting Malignant Lung Cancer

Does adding exposure to neighborhood violence (Homicide rate > = Mean vs. < Mean) increase the predictive ability of the model?

a.  Expectation = Yes

Remodeling with the homicide rate per 100k variable, we get the following model:

Applying the model to the \(30\%\) held out test set results in the following:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1709  157
##          1  234   60
##                                          
##                Accuracy : 0.819          
##                  95% CI : (0.8021, 0.835)
##     No Information Rate : 0.8995         
##     P-Value [Acc > NIR] : 1.0000000      
##                                          
##                   Kappa : 0.1348         
##                                          
##  Mcnemar's Test P-Value : 0.0001213      
##                                          
##             Sensitivity : 0.8796         
##             Specificity : 0.2765         
##          Pos Pred Value : 0.9159         
##          Neg Pred Value : 0.2041         
##              Prevalence : 0.8995         
##          Detection Rate : 0.7912         
##    Detection Prevalence : 0.8639         
##       Balanced Accuracy : 0.5780         
##                                          
##        'Positive' Class : 0              
##

This model has a lung_mod_hom_fit_acc accuracy with, again, a higher rate of false negatives than false positives. However, this model produces false negatives at lower rate than the previous model without the homicide rate included. Additionally there is an approximately lung_mod_hom_fit_acc - lung_mod_fit_3_alt_acc gain in accuracy.

5.3 CHAID Model with Narrower Bins

Using intervals of size 1 between 5 and 15, with ordinal categories as below:

##  [1] "[  0,  5)"    "[  5,  6)"    "[  6,  7)"    "[  7,  8)"    "[  8,  9)"   
##  [6] "[  9, 10)"    "[ 10, 11)"    "[ 11, 12)"    "[ 12, 13)"    "[ 13, 14)"   
## [11] "[ 14, 15)"    "[ 15, 20)"    "[ 20,250]"    "never smoked"

## [1] "Lung Cancer Model With Narrower Bins on Total Sample"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
##

Using intervals of size 5 between 0 and 10 and then size 1 between 10 and 15, with ordinal categories as below:

##  [1] "[  0,  5)"    "[  5, 10)"    "[ 10, 11)"    "[ 11, 12)"    "[ 12, 13)"   
##  [6] "[ 13, 14)"    "[ 14, 15)"    "[ 15, 20)"    "[ 20,250]"    "never smoked"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1696  154
##          1  247   63
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7973, 0.8305)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1371          
##                                           
##  Mcnemar's Test P-Value : 4.343e-06       
##                                           
##             Sensitivity : 0.8729          
##             Specificity : 0.2903          
##          Pos Pred Value : 0.9168          
##          Neg Pred Value : 0.2032          
##              Prevalence : 0.8995          
##          Detection Rate : 0.7852          
##    Detection Prevalence : 0.8565          
##       Balanced Accuracy : 0.5816          
##                                           
##        'Positive' Class : 0               
##

5.4 CHAID Model on Subsets

5.4.1 Model on Race Subsets Using Size 5 Bins

## [1] "Race/Ethnicity Subset with Bin Size 5: Black"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 974  86
##          1 263  64
##                                          
##                Accuracy : 0.7484         
##                  95% CI : (0.7247, 0.771)
##     No Information Rate : 0.8919         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.141          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7874         
##             Specificity : 0.4267         
##          Pos Pred Value : 0.9189         
##          Neg Pred Value : 0.1957         
##              Prevalence : 0.8919         
##          Detection Rate : 0.7022         
##    Detection Prevalence : 0.7642         
##       Balanced Accuracy : 0.6070         
##                                          
##        'Positive' Class : 0              
##

## [1] "Race/Ethnicity Subset with Bin Size 5: Latinx"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 448  30
##          1   8   0
##                                           
##                Accuracy : 0.9218          
##                  95% CI : (0.8943, 0.9441)
##     No Information Rate : 0.9383          
##     P-Value [Acc > NIR] : 0.9412190       
##                                           
##                   Kappa : -0.0267         
##                                           
##  Mcnemar's Test P-Value : 0.0006577       
##                                           
##             Sensitivity : 0.9825          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9372          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9383          
##          Detection Rate : 0.9218          
##    Detection Prevalence : 0.9835          
##       Balanced Accuracy : 0.4912          
##                                           
##        'Positive' Class : 0               
##

## [1] "Race/Ethnicity Subset with Bin Size 5: White"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  77   2
##          1 181  28
##                                           
##                Accuracy : 0.3646          
##                  95% CI : (0.3089, 0.4231)
##     No Information Rate : 0.8958          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0637          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2984          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9747          
##          Neg Pred Value : 0.1340          
##              Prevalence : 0.8958          
##          Detection Rate : 0.2674          
##    Detection Prevalence : 0.2743          
##       Balanced Accuracy : 0.6159          
##                                           
##        'Positive' Class : 0               
##

5.4.2 Model on Race Subsets Using Custom Sized Bins

## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`

## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for Black Subset</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> malignanto...1 </th>
##    <th style="text-align:left;"> ldct_elig </th>
##    <th style="text-align:left;"> ldct_n </th>
##    <th style="text-align:left;"> ldct_percentage </th>
##    <th style="text-align:left;"> prediction </th>
##    <th style="text-align:left;"> prediction_n </th>
##    <th style="text-align:left;"> prediction_percentage </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 484 </td>
##    <td style="text-align:left;"> 95.09% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 202 </td>
##    <td style="text-align:left;"> 39.69% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 25 </td>
##    <td style="text-align:left;"> 4.91% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 307 </td>
##    <td style="text-align:left;"> 60.31% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 3,938 </td>
##    <td style="text-align:left;"> 95.75% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 811 </td>
##    <td style="text-align:left;"> 19.72% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 175 </td>
##    <td style="text-align:left;"> 4.25% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 3,302 </td>
##    <td style="text-align:left;"> 80.28% </td>
##   </tr>
## </tbody>
## </table>

## [1] "Race/Ethnicity Subset with Narrower Bins: Black"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 974  86
##          1 263  64
##                                          
##                Accuracy : 0.7484         
##                  95% CI : (0.7247, 0.771)
##     No Information Rate : 0.8919         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.141          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7874         
##             Specificity : 0.4267         
##          Pos Pred Value : 0.9189         
##          Neg Pred Value : 0.1957         
##              Prevalence : 0.8919         
##          Detection Rate : 0.7022         
##    Detection Prevalence : 0.7642         
##       Balanced Accuracy : 0.6070         
##                                          
##        'Positive' Class : 0              
##

## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`

## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for Latinx Subset</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> malignanto...1 </th>
##    <th style="text-align:left;"> ldct_elig </th>
##    <th style="text-align:left;"> ldct_n </th>
##    <th style="text-align:left;"> ldct_percentage </th>
##    <th style="text-align:left;"> prediction </th>
##    <th style="text-align:left;"> prediction_n </th>
##    <th style="text-align:left;"> prediction_percentage </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 96 </td>
##    <td style="text-align:left;"> 96.97% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 8 </td>
##    <td style="text-align:left;"> 8.08% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 3 </td>
##    <td style="text-align:left;"> 3.03% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 91 </td>
##    <td style="text-align:left;"> 91.92% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 1,490 </td>
##    <td style="text-align:left;"> 98.03% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 34 </td>
##    <td style="text-align:left;"> 2.24% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 30 </td>
##    <td style="text-align:left;"> 1.97% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 1,486 </td>
##    <td style="text-align:left;"> 97.76% </td>
##   </tr>
## </tbody>
## </table>

## [1] "Race/Ethnicity Subset with Narrower Bins: Latinx"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 446  30
##          1  10   0
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.8896, 0.9406)
##     No Information Rate : 0.9383          
##     P-Value [Acc > NIR] : 0.971979        
##                                           
##                   Kappa : -0.0318         
##                                           
##  Mcnemar's Test P-Value : 0.002663        
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9370          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.9383          
##          Detection Rate : 0.9177          
##    Detection Prevalence : 0.9794          
##       Balanced Accuracy : 0.4890          
##                                           
##        'Positive' Class : 0               
##

## New names:
## • `malignanto` -> `malignanto...1`
## • `malignanto` -> `malignanto...5`

## <table class=" lightable-classic table table-striped" style="font-size: 15px; font-family: Cambria; width: auto !important; margin-left: auto; margin-right: auto; width: auto !important; ">
## <caption style="font-size: initial !important;">Eligibility Versus Prediction for White Subset</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> malignanto...1 </th>
##    <th style="text-align:left;"> ldct_elig </th>
##    <th style="text-align:left;"> ldct_n </th>
##    <th style="text-align:left;"> ldct_percentage </th>
##    <th style="text-align:left;"> prediction </th>
##    <th style="text-align:left;"> prediction_n </th>
##    <th style="text-align:left;"> prediction_percentage </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 97 </td>
##    <td style="text-align:left;"> 97.98% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 64 </td>
##    <td style="text-align:left;"> 64.65% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> yes </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 2 </td>
##    <td style="text-align:left;"> 2.02% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 35 </td>
##    <td style="text-align:left;"> 35.35% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> No </td>
##    <td style="text-align:left;"> 812 </td>
##    <td style="text-align:left;"> 94.64% </td>
##    <td style="text-align:left;"> 1 </td>
##    <td style="text-align:left;"> 493 </td>
##    <td style="text-align:left;"> 57.46% </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> no </td>
##    <td style="text-align:left;"> Yes </td>
##    <td style="text-align:left;"> 46 </td>
##    <td style="text-align:left;"> 5.36% </td>
##    <td style="text-align:left;"> 0 </td>
##    <td style="text-align:left;"> 365 </td>
##    <td style="text-align:left;"> 42.54% </td>
##   </tr>
## </tbody>
## </table>

## [1] "Race/Ethnicity Subset with Narrower Bins: White"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 114  14
##          1 144  16
##                                           
##                Accuracy : 0.4514          
##                  95% CI : (0.3929, 0.5108)
##     No Information Rate : 0.8958          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0085         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4419          
##             Specificity : 0.5333          
##          Pos Pred Value : 0.8906          
##          Neg Pred Value : 0.1000          
##              Prevalence : 0.8958          
##          Detection Rate : 0.3958          
##    Detection Prevalence : 0.4444          
##       Balanced Accuracy : 0.4876          
##                                           
##        'Positive' Class : 0               
##

5.5 Generalized Linear Model, Logistic Regression

Let’s compare the CHAID trees to a logistic regression model. Here we have an interaction term with smokingstatus and packyear since packyear would not be applicable without a “not never” status.

## 
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + smokingstatus * 
##     packyear + homicidegtmean2, family = "binomial", data = lung_emr_lr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3859  -0.5568  -0.4072  -0.2726   3.1039  
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -3.388645   0.154065 -21.995  < 2e-16 ***
## agecat.L                  1.109462   0.162683   6.820 9.12e-12 ***
## agecat.Q                 -0.247364   0.121573  -2.035  0.04188 *  
## genderMALE               -0.002696   0.097786  -0.028  0.97801    
## genderUNKNOWN            -9.745831 221.541247  -0.044  0.96491    
## raceethnicLatinx         -0.461841   0.152911  -3.020  0.00253 ** 
## raceethnicWhite          -0.016659   0.156549  -0.106  0.91525    
## smokingstatus1            1.009121   0.140372   7.189 6.53e-13 ***
## packyear                  0.008041   0.003893   2.065  0.03888 *  
## homicidegtmean2.L         0.231492   0.076816   3.014  0.00258 ** 
## smokingstatus1:packyear         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3276.4  on 5037  degrees of freedom
## Residual deviance: 3066.6  on 5028  degrees of freedom
## AIC: 3086.6
## 
## Number of Fisher Scoring iterations: 11

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1942  217
##          1    1    0
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8856, 0.9115)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 0.5465          
##                                           
##                   Kappa : -9e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9995          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8995          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8995          
##          Detection Rate : 0.8991          
##    Detection Prevalence : 0.9995          
##       Balanced Accuracy : 0.4997          
##                                           
##        'Positive' Class : 0               
##

Modeling only the part of the sample that smoked at any time in their lives produces the following model:

## 
## Call:
## glm(formula = malignanto ~ agecat + gender + raceethnic + packyear + 
##     homicidegtmean2, family = "binomial", data = lung_emr_smokers_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2919  -0.5593  -0.4731  -0.3285   2.6975  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.53318    0.12487 -20.287  < 2e-16 ***
## agecat.L            1.13337    0.18594   6.095 1.09e-09 ***
## agecat.Q           -0.20435    0.13757  -1.485 0.137417    
## genderMALE         -0.02743    0.10604  -0.259 0.795884    
## genderUNKNOWN     -10.61529  365.20689  -0.029 0.976812    
## raceethnicLatinx   -0.43241    0.17321  -2.496 0.012543 *  
## raceethnicWhite     0.17289    0.16132   1.072 0.283839    
## packyear            0.01537    0.00455   3.378 0.000729 ***
## homicidegtmean2.L   0.28555    0.08469   3.372 0.000747 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2588.6  on 3485  degrees of freedom
## Residual deviance: 2480.0  on 3477  degrees of freedom
## AIC: 2498
## 
## Number of Fisher Scoring iterations: 12

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1303  191
##          1    1    0
##                                           
##                Accuracy : 0.8716          
##                  95% CI : (0.8535, 0.8881)
##     No Information Rate : 0.8722          
##     P-Value [Acc > NIR] : 0.55            
##                                           
##                   Kappa : -0.0013         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9992          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8722          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8722          
##          Detection Rate : 0.8716          
##    Detection Prevalence : 0.9993          
##       Balanced Accuracy : 0.4996          
##                                           
##        'Positive' Class : 0               
##

5.6 Other Tree Based Models

Supposing that are binning of the packyear variable into 5-packyear segments might be imprecise, a tree model that can take in continuous variables might produce different results in terms of what packyear split might produce the predictions on malignant lung cancer.

5.6.1 Gradient Boosted Trees

## 'data.frame':    4981 obs. of  13 variables:
##  $ gender.FEMALE    : num  1 1 1 1 1 0 0 1 1 1 ...
##  $ gender.MALE      : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ gender.UNKNOWN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ raceethnic.Black : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ raceethnic.Latinx: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ raceethnic.White : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 40-50            : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ 50-60            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 60+              : num  1 1 1 1 0 1 0 1 1 1 ...
##  $ <mean            : num  1 0 0 1 0 1 0 0 0 1 ...
##  $ >=mean           : num  0 1 1 0 1 0 1 1 1 0 ...
##  $ packyear         : num  9.19 21 3.25 13.32 11.6 ...
##  $ malignanto       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

##   [1] "1->2"     "2->4"     "6->7"     "7->9"     "11->12"   "12->14"  
##   [7] "13->16"   "18->19"   "19->21"   "23->24"   "24->26"   "28->29"  
##  [13] "29->31"   "30->33"   "35->36"   "36->38"   "37->40"   "42->43"  
##  [19] "43->45"   "44->47"   "49->50"   "50->52"   "51->54"   "56->57"  
##  [25] "57->59"   "61->62"   "62->64"   "66->67"   "67->69"   "71->72"  
##  [31] "72->74"   "73->76"   "78->79"   "79->81"   "80->83"   "85->86"  
##  [37] "86->88"   "87->90"   "92->93"   "93->95"   "94->97"   "99->100" 
##  [43] "100->102" "101->104" "106->107" "107->109" "108->111" "113->114"
##  [49] "114->116" "118->119" "119->121" "120->123" "125->126" "126->128"
##  [55] "127->130" "132->133" "133->135" "134->137" "139->140" "140->142"
##  [61] "141->144" "146->147" "147->149" "151->152" "152->154" "153->156"
##  [67] "158->159" "159->161" "160->163" "165->166" "166->168" "167->170"
##  [73] "172->173" "173->175" "177->178" "178->180" "179->182" "184->185"
##  [79] "185->187" "186->189" "191->192" "192->194" "193->196" "198->199"
##  [85] "199->201" "203->204" "204->206" "208->209" "209->211" "213->214"
##  [91] "214->216" "215->218" "220->221" "221->223" "222->225" "227->228"
##  [97] "228->230" "232->233" "233->235" "234->237" "239->240" "240->242"
## [103] "244->245" "245->247" "246->249" "251->252" "252->254" "256->257"
## [109] "257->259" "261->262" "264->265" "265->267" "269->270" "270->272"
## [115] "274->275" "275->277" "279->280" "280->282" "281->284" "286->287"
## [121] "287->289" "291->292" "292->294" "296->297" "297->299" "1->3"    
## [127] "2->5"     "6->8"     "7->10"    "11->13"   "12->15"   "13->17"  
## [133] "18->20"   "19->22"   "23->25"   "24->27"   "28->30"   "29->32"  
## [139] "30->34"   "35->37"   "36->39"   "37->41"   "42->44"   "43->46"  
## [145] "44->48"   "49->51"   "50->53"   "51->55"   "56->58"   "57->60"  
## [151] "61->63"   "62->65"   "66->68"   "67->70"   "71->73"   "72->75"  
## [157] "73->77"   "78->80"   "79->82"   "80->84"   "85->87"   "86->89"  
## [163] "87->91"   "92->94"   "93->96"   "94->98"   "99->101"  "100->103"
## [169] "101->105" "106->108" "107->110" "108->112" "113->115" "114->117"
## [175] "118->120" "119->122" "120->124" "125->127" "126->129" "127->131"
## [181] "132->134" "133->136" "134->138" "139->141" "140->143" "141->145"
## [187] "146->148" "147->150" "151->153" "152->155" "153->157" "158->160"
## [193] "159->162" "160->164" "165->167" "166->169" "167->171" "172->174"
## [199] "173->176" "177->179" "178->181" "179->183" "184->186" "185->188"
## [205] "186->190" "191->193" "192->195" "193->197" "198->200" "199->202"
## [211] "203->205" "204->207" "208->210" "209->212" "213->215" "214->217"
## [217] "215->219" "220->222" "221->224" "222->226" "227->229" "228->231"
## [223] "232->234" "233->236" "234->238" "239->241" "240->243" "244->246"
## [229] "245->248" "246->250" "251->253" "252->255" "256->258" "257->260"
## [235] "261->263" "264->266" "265->268" "269->271" "270->273" "274->276"
## [241] "275->278" "279->281" "280->283" "281->285" "286->288" "287->290"
## [247] "291->293" "292->295" "296->298" "297->300"

5.6.2 Logistic Model Trees

5.6.3 Conditional Inference Tree

## [1] "Optimal cutoff:  0.172222222222222"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1943  217
##          1    0    0
##                                           
##                Accuracy : 0.8995          
##                  95% CI : (0.8861, 0.9119)
##     No Information Rate : 0.8995          
##     P-Value [Acc > NIR] : 0.5181          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8995          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8995          
##          Detection Rate : 0.8995          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

##                 6                10 
## "packyear > 13.8"  "packyear <= 30"

##      6     10 
## "13.8"   "30"

Based on Conditional Inference Tree models of the entire sample, packyear policies ought to be revised to around \(13.8\).

5.6.3.1 CTree Modeling of Race/Ethnic Subsets

Do the recommendations for packyear requirements be different for different race subsets?

## [1] 0.09
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4113  509
##          1    0    0
##                                           
##                Accuracy : 0.8899          
##                  95% CI : (0.8805, 0.8988)
##     No Information Rate : 0.8899          
##     P-Value [Acc > NIR] : 0.5118          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8899          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8899          
##          Detection Rate : 0.8899          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

## [1] 0.05571429
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1520   99
##          1    0    0
##                                         
##                Accuracy : 0.9389        
##                  95% CI : (0.9261, 0.95)
##     No Information Rate : 0.9389        
##     P-Value [Acc > NIR] : 0.5267        
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 1.0000        
##             Specificity : 0.0000        
##          Pos Pred Value : 0.9389        
##          Neg Pred Value :    NaN        
##              Prevalence : 0.9389        
##          Detection Rate : 0.9389        
##    Detection Prevalence : 1.0000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : 0             
##

## [1] 0.04714286
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 858  99
##          1   0   0
##                                           
##                Accuracy : 0.8966          
##                  95% CI : (0.8755, 0.9151)
##     No Information Rate : 0.8966          
##     P-Value [Acc > NIR] : 0.5267          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8966          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8966          
##          Detection Rate : 0.8966          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

The packyear splits are different across races. Although the splits themselves are not all statistically significant p-values, they indicate that there may be different packyear requirements for each race that will optimally assign individuals to lung cancer screening.

5.7 Conditional Inference Tree Forest

This section is a WIP

Malignant Lung Cancer Prediction

Alexis Kwan

2023-09-01