Lung cancer is a significant health concern worldwide, and early detection and accurate prediction of the disease can significantly impact patient outcomes. In this report, we explore the use of different machine learning models to predict the occurrence of lung cancer based on various factors and symptoms.
The dataset used in this analysis contains information from 309 individuals, including their gender, age, smoking status, and several other categorical features related to symptoms and health conditions. The target variable is “LUNG_CANCER,” which indicates whether an individual has lung cancer or not.
To gain insights into the relationship between different variables
and the occurrence of lung cancer, we used the summary
function to assess the relationship among variables and also used bar
plots to understand some of the relationships..
knitr::opts_chunk$set(echo = TRUE)
# Load the necessary libraries
library(caTools)
library(randomForest)
library(rpart)
library(rpart.plot)
library(e1071)
library(caret)
library(ggplot2)
library(caret)
# Load the dataset
pd <- read.csv("survey lung cancer.csv")
# Handling missing values (if any)
if (sum(is.na(pd)) > 0) {
pd <- na.omit(pd) # Remove rows with NAs if present
}
str(pd)
## 'data.frame': 309 obs. of 16 variables:
## $ GENDER : chr "M" "M" "F" "M" ...
## $ AGE : int 69 74 59 63 63 75 52 51 68 53 ...
## $ SMOKING : int 1 2 1 2 1 1 2 2 2 2 ...
## $ YELLOW_FINGERS : int 2 1 1 2 2 2 1 2 1 2 ...
## $ ANXIETY : int 2 1 1 2 1 1 1 2 2 2 ...
## $ PEER_PRESSURE : int 1 1 2 1 1 1 1 2 1 2 ...
## $ CHRONIC.DISEASE : int 1 2 1 1 1 2 1 1 1 2 ...
## $ FATIGUE : int 2 2 2 1 1 2 2 2 2 1 ...
## $ ALLERGY : int 1 2 1 1 1 2 1 2 1 2 ...
## $ WHEEZING : int 2 1 2 1 2 2 2 1 1 1 ...
## $ ALCOHOL.CONSUMING : int 2 1 1 2 1 1 2 1 1 2 ...
## $ COUGHING : int 2 1 2 1 2 2 2 1 1 1 ...
## $ SHORTNESS.OF.BREATH : int 2 2 2 1 2 2 2 2 1 1 ...
## $ SWALLOWING.DIFFICULTY: int 2 2 1 2 1 1 1 2 1 2 ...
## $ CHEST.PAIN : int 2 2 2 2 1 1 2 1 1 2 ...
## $ LUNG_CANCER : chr "YES" "YES" "NO" "NO" ...
summary(pd)
## GENDER AGE SMOKING YELLOW_FINGERS
## Length:309 Min. :21.00 Min. :1.000 Min. :1.00
## Class :character 1st Qu.:57.00 1st Qu.:1.000 1st Qu.:1.00
## Mode :character Median :62.00 Median :2.000 Median :2.00
## Mean :62.67 Mean :1.563 Mean :1.57
## 3rd Qu.:69.00 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :87.00 Max. :2.000 Max. :2.00
## ANXIETY PEER_PRESSURE CHRONIC.DISEASE FATIGUE
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :2.000 Median :2.000 Median :2.000
## Mean :1.498 Mean :1.502 Mean :1.505 Mean :1.673
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## ALLERGY WHEEZING ALCOHOL.CONSUMING COUGHING
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :2.000
## Mean :1.557 Mean :1.557 Mean :1.557 Mean :1.579
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## SHORTNESS.OF.BREATH SWALLOWING.DIFFICULTY CHEST.PAIN LUNG_CANCER
## Min. :1.000 Min. :1.000 Min. :1.000 Length:309
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 Class :character
## Median :2.000 Median :1.000 Median :2.000 Mode :character
## Mean :1.641 Mean :1.469 Mean :1.557
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.000
# Function to create a bar plot for each categorical variable
create_bar_plot <- function(data, x_var, target_var) {
p <- ggplot(data, aes_string(x = x_var, fill = target_var)) +
geom_bar(position = "dodge") +
labs(title = paste("Bar Plot of", x_var, "vs. LUNG_CANCER"),
x = x_var, y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
return(p)
}
# Create individual bar plots for each categorical variable
for (col in names(pd)) {
if (is.factor(pd[[col]]) && col != "LUNG_CANCER") {
p <- create_bar_plot(pd, col, "LUNG_CANCER")
print(p)
}
}
GENDER:
The dataset includes information on the gender of the individuals. The gender variable is stored as a character class.
AGE:
The age of the individuals ranges from 21 to 87, with a mean age of approximately 62.67. The median age is 62, indicating that the data is roughly symmetrically distributed.
SMOKING:
This variable represents smoking status, where 1 indicates “smoking” and 2 indicates “non-smoking.” The mean value of 1.563 suggests that the majority of individuals in the dataset are non-smokers.
YELLOW_FINGERS:
Yellow fingers status is represented by 1 for “present” and 2 for “absent.” The mean value of 1.57 indicates that, on average, more individuals have yellow fingers.
ANXIETY:
The variable represents the presence (1) or absence (2) of anxiety. The mean value of 1.498 indicates that, on average, anxiety is relatively prevalent in the dataset.
PEER_PRESSURE:
This variable represents whether individuals experienced peer pressure (1) or not (2). The mean value of 1.502 indicates that, on average, there is some level of peer pressure experienced by individuals.
CHRONIC.DISEASE:
The presence (1) or absence (2) of chronic disease is captured by this variable. The mean value of 1.505 suggests that, on average, some individuals in the dataset have chronic diseases.
FATIGUE:
Fatigue status is represented by 1 for “present” and 2 for “absent.” The mean value of 1.673 indicates that, on average, more individuals experience fatigue.
ALLERGY:
The variable indicates the presence (1) or absence (2) of allergies. The mean value of 1.557 suggests that, on average, allergies are relatively prevalent.
WHEEZING:
Wheezing status is represented by 1 for “present” and 2 for “absent.” The mean value of 1.557 indicates that, on average, wheezing is somewhat prevalent.
ALCOHOL.CONSUMING:
This variable represents alcohol consumption, where 1 indicates “consuming” and 2 indicates “non-consuming.” The mean value of 1.557 suggests that, on average, the majority of individuals do not consume alcohol.
COUGHING:
The variable indicates coughing status, with 1 representing “present” and 2 representing “absent.” The mean value of 1.579 suggests that, on average, coughing is somewhat prevalent.
SHORTNESS.OF.BREATH:
This variable represents shortness of breath status, where 1 indicates “present” and 2 indicates “absent.” The mean value of 1.641 suggests that, on average, shortness of breath is somewhat prevalent.
SWALLOWING.DIFFICULTY:
Swallowing difficulty status is represented by 1 for “present” and 2 for “absent.” The mean value of 1.469 indicates that, on average, swallowing difficulty is somewhat prevalent.
CHEST.PAIN:
The variable indicates chest pain status, with 1 representing “present” and 2 representing “absent.” The mean value of 1.557 suggests that, on average, chest pain is somewhat prevalent.
LUNG_CANCER:
The target variable represents the presence (“YES”) or absence (“NO”) of lung cancer. The data is stored as a character class.
We employed four different machine learning models for lung cancer prediction:
To evaluate the performance of each model, we calculated the following metrics:
The overall accuracy of the model in predicting lung cancer cases correctly.
The proportion of true positive predictions out of all positive predictions made by the model.
The proportion of true positive predictions out of all actual positive cases in the test data.
The harmonic mean of precision and recall, providing a balanced measure between the two.
# Split the data into training and testing sets
set.seed(4325) # Set seed for reproducibility
split <- sample.split(pd$LUNG_CANCER, SplitRatio = 0.7)
train_data <- subset(pd, split == TRUE)
test_data <- subset(pd, split == FALSE)
# Convert "YES" and "NO" to binary (0 and 1) for logistic regression
train_data$LUNG_CANCER <- ifelse(train_data$LUNG_CANCER == "YES", 1, 0)
test_data$LUNG_CANCER <- ifelse(test_data$LUNG_CANCER == "YES", 1, 0)
# Train and Evaluate Models for Lung Cancer Prediction
# Model 1: Logistic Regression
model_logistic <- glm(LUNG_CANCER ~ ., data = train_data, family = binomial)
logistic_predictions <- predict(model_logistic, newdata = test_data, type = "response")
# Convert the predicted probabilities to binary (0 and 1) based on a threshold (e.g., 0.5)
logistic_predictions <- ifelse(logistic_predictions >= 0.5, 1, 0)
# Convert both predicted and actual values to factors with the same levels (0 and 1)
logistic_predictions <- factor(logistic_predictions, levels = c(0, 1))
test_data$LUNG_CANCER <- factor(test_data$LUNG_CANCER, levels = c(0, 1))
# Model 2: Decision Trees
model_decision_tree <- rpart(LUNG_CANCER ~ ., data = train_data, method = "class")
decision_tree_predictions <- predict(model_decision_tree, newdata = test_data, type = "class")
# Convert target variable back to factor format
train_data$LUNG_CANCER <- factor(train_data$LUNG_CANCER, levels = c(0, 1))
test_data$LUNG_CANCER <- factor(test_data$LUNG_CANCER, levels = c(0, 1))
# Model 3: Random Forest
model_random_forest <- randomForest(LUNG_CANCER ~ ., data = train_data, ntree = 100)
random_forest_predictions <- predict(model_random_forest, newdata = test_data)
# Set factor levels explicitly for both predicted and actual values
levels_rf <- levels(random_forest_predictions)
levels_actual <- levels(test_data$LUNG_CANCER)
random_forest_predictions <- factor(random_forest_predictions, levels = levels_actual)
# Model 4: Support Vector Machines (SVM)
model_svm <- svm(LUNG_CANCER ~ ., data = train_data)
svm_predictions <- predict(model_svm, newdata = test_data)
# Set factor levels explicitly for both predicted and actual values
svm_predictions <- factor(svm_predictions, levels = levels_actual)
# Evaluation of Models
# Function to calculate and print evaluation metrics
calculate_metrics <- function(predictions, actual) {
confusion <- confusionMatrix(predictions, actual)
accuracy <- confusion$overall["Accuracy"]
precision <- confusion$byClass["Pos Pred Value"]
recall <- confusion$byClass["Sensitivity"]
f1_score <- confusion$byClass["F1"]
cat("Accuracy:", round(accuracy, 2), "\n")
cat("Precision:", round(precision, 2), "\n")
cat("Recall:", round(recall, 2), "\n")
cat("F1-Score:", round(f1_score, 2), "\n")
}
# Evaluate Logistic Regression Model
cat("Logistic Regression Model:\n")
## Logistic Regression Model:
calculate_metrics(logistic_predictions, test_data$LUNG_CANCER)
## Accuracy: 0.94
## Precision: 0.75
## Recall: 0.75
## F1-Score: 0.75
# Evaluate Decision Tree Model
cat("Decision Tree Model:\n")
## Decision Tree Model:
calculate_metrics(decision_tree_predictions, test_data$LUNG_CANCER)
## Accuracy: 0.9
## Precision: 0.71
## Recall: 0.42
## F1-Score: 0.53
# Plot the Decision Tree
rpart.plot(model_decision_tree, box.palette = "RdBu", shadow.col = "gray", nn = TRUE)
# Evaluate Random Forest Model
cat("Random Forest Model:\n")
## Random Forest Model:
calculate_metrics(random_forest_predictions, test_data$LUNG_CANCER)
## Accuracy: 0.96
## Precision: 0.83
## Recall: 0.83
## F1-Score: 0.83
# Evaluate SVM Model
cat("Support Vector Machines (SVM) Model:\n")
## Support Vector Machines (SVM) Model:
calculate_metrics(svm_predictions, test_data$LUNG_CANCER)
## Accuracy: 0.94
## Precision: 0.75
## Recall: 0.75
## F1-Score: 0.75
The evaluation metrics for each model are as follows:
Logistic Regression Model:
Decision Tree Model:
Random Forest Model:
Support Vector Machines (SVM) Model:
Based on the evaluation metrics, the Random Forest model achieved the highest accuracy, precision, recall, and F1-Score, making it the most effective model for predicting lung cancer in this dataset.
#REFERENCES
data_source <- https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer
Kuhn, M. (2020). caret: Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Muller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12(1), 77.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press. Retrieved from
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.
Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media.