More than 200,000 new cases of lung cancer are discovered in the United States per year, making it one of the more common cancers diagnosed in the United States. The purpose of our investigation was to determine what factors were more likely to cause different levels of lung cancer (either low or high level). We looked at a dataset of lung cancer statistics, and based on our background knowledge and outside research,we picked the factors that we believed impacted the severity of lung cancer the most. We then ran a kNN model with the optimized value of k to see if it could accurately predict the severity of lung cancer. We then ran the kNN on the whole dataset to see if that yielded stronger results than our original model.
We’re going to start with some summaries of our data.
69.7% of the lung cancer patients in this dataset had a severe level of the disease, while 30.3% of the patients had a milder form of lung cancer.
Before looking at these correlations, we recoded our output to have low or high severity of lung cancer (we considered middle severity as high for the purpose of simplicity). We also only looked at environmental and genetic risk factors, meaning that we did not include symptoms of lung cancer as part of our analysis.
We made a correlation plot to observe differences between the variables, knowing that Level was our variable of interest. We used this, along with background research on common lung cancer risk factors, to pick the variables that we should study using our kNN models. Level indicated the severity of the cancer in each patient, and based on Level’s correlation values, we selected Obesity and Genetic Risk. Finally, we wanted to look at Smoking vs. Passive Smoking (Secondhand Smoking) to see if environmental factors could impact the severity of lung cancer observed.
We will begin by making a plot of k vs. accuracy to pick the best k-value.
Based on the results of this plot, we selected a k value of 7, as it has the highest accuracy with the highest k value.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 56 0
## 1 13 131
##
## Accuracy : 0.935
## 95% CI : (0.8914, 0.9649)
## No Information Rate : 0.655
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8495
##
## Mcnemar's Test P-Value : 0.0008741
##
## Sensitivity : 1.0000
## Specificity : 0.8116
## Pos Pred Value : 0.9097
## Neg Pred Value : 1.0000
## Prevalence : 0.6550
## Detection Rate : 0.6550
## Detection Prevalence : 0.7200
## Balanced Accuracy : 0.9058
##
## 'Positive' Class : 1
##
We have an accuracy of 0.935 and a Kappa of 0.8495, and a sensitivity of 1, which indicates that this model did a fairly good job at accurately predicting the severity of lung cancer within the testing dataset.
Now, we want to run our kNN algorithm again but we want to see if passive smokers’ results (patients who inhale secondhand smoke) was any different in terms of predicting severity
Looking at the elbow plot, changing the smoking variable doesn’t change the k-value we will use, so we will go with k=7 for the secondhand smoking model.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 58 0
## 1 11 131
##
## Accuracy : 0.945
## 95% CI : (0.9037, 0.9722)
## No Information Rate : 0.655
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8735
##
## Mcnemar's Test P-Value : 0.002569
##
## Sensitivity : 1.0000
## Specificity : 0.8406
## Pos Pred Value : 0.9225
## Neg Pred Value : 1.0000
## Prevalence : 0.6550
## Detection Rate : 0.6550
## Detection Prevalence : 0.7100
## Balanced Accuracy : 0.9203
##
## 'Positive' Class : 1
##
This model has an accuracy of 0.945, a Kappa of 0.8735, and a sensitivity of 1, meaning that it is slightly better at predicting the severity of lung cancer.
Because the passive smoking model had a better accuracy, we will run our ML Evaluation on that model.
This model has an error of 5.5%, which means that our predictions met the actual data in most instances.
Our TPR vs. FPR graph shows that the AUC was very close to 0.5 (0.51), this indicates that the model is not as good at predicting the severity of lung cancer in patients as we initially believed. An AUC of 0.5 indicates an almost random classifier, meaning that the model is almost as effective as randomly guessing the severity of lung cancer cases based on our selected metrics.
We obtained a LogLoss of 10.12, which is a very terrible result since LogLoss should be close to 0. Because LogLoss penalizes errors to a larger extent, we could infer that the model was not able to properly classify patients based on their probability of having low or high level lung cancer.
Now, we will evaluate the fairness of our model based on our protected class, gender
In looking at the bar graph of proportional parity, we see that Men (column 1) had a proportional parity of 1, meaning that they were treated equally among their cases regardless of the variables that we chose. Women, however, had a proportional parity of ~0.8. This could mean that in diagnostics and determining severity, there could be some gender bias in how doctors make those judgments. Additionally, this dataset contains more information on men, which could be another reason for the differences in proportional parity.
In looking at predicted probabilities and the density, we can see that there is a slight advantage to being female before the 50% threshold, but the males have a larger advantage once you hit a 75% threshold. Again, we could attribute this to not having enough information on women in this dataset.
In trying to predict lung cancer in patients, we chose to focus on genetic risk, obesity, and smoking as our variables based off of a correlation matrix and outside research. From there, we chose to create KNN models to observe the difference between smoking and second-hand smoking, and observed a greater accuracy in the model for second-hand smoking. This indicates that those exposed to second-hand smoke are likely to develop more severe lung cancer.
Overall, although our accuracy and kappa values indicated the model was a fair fit, the Logloss and AUC analysis presented contradicting evidence.
This can be attributed to the size of neighbors used in the kNN model. Although choosing a higher k value can handle the variance in random error, it overlooks the smaller patterns. Our elbow plot indicated that neighbors 3-7 all had an accuracy of ~0.94; the model could be improved by finding a better balance in over and under fitting. Additionally, fairness analysis indicated that there is a slight gender bias. This limitation can be handled by including more data observations for females.
For additional analysis, we would recommend gathering more data that would allow us to look deeper into passive smoking and how that relates to the severity of lung cancer. If more data on passive smoking could be obtained, we may be able to expand the training data for our model which could improve its performance and pinpoint passive smoking’s impact on cancer severity. Additionally, we would recommend having more data on female patients to limit the gender bias in our algorithm development, which could potentially improve the model’s accuracy as well.