Introduction:

This report details an analysis of lung cancer risk factors and the development of a predictive logistic regression model. We will explore the data, address class imbalance, fit a model to identify significant associations, assess its predictive performance, and apply it to new test samples. Key insights into the impact of smoking status and its interactions with other variables will be a central focus.

Setup and Data Loading:

This section prepares the environment by loading necessary R libraries and importing the dataset, ensuring variables are in the correct format for analysis.

Load Libraries:

We begin by loading all required packages for data manipulation, visualization, statistical modeling, and model evaluation.

library(ggplot2)   # for plots
library(dplyr)     # for data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)     # for confusion matrix

## Loading required package: lattice

library(pROC)      # for ROC curve and AUC

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ROSE)      # for advanced balancing

## Loaded ROSE 0.0-4

library(precrec)   # for Precision-Recall Curve

## 
## Attaching package: 'precrec'

## The following object is masked from 'package:pROC':
## 
##     auc

# Libraries for calibration plot
library(scales)              # for percent format
library(ResourceSelection)  # for optional calibration test

## ResourceSelection 0.3-6   2023-06-27

Load and Prepare Data:

The primary dataset (lung cancer.csv) is loaded. The days_to_cancer variable is used to create a binary has_cancer outcome (1 if cancer, 0 otherwise). Categorical variables (gender, race, smoker) are converted into factor types with specified levels to ensure correct interpretation by the model.

train <- read.csv("C:/Users/ameen/OneDrive - Higher Education Commission/Desktop/Internship task Russia ITMO/lung cancer.csv")

train <- train %>%
  mutate(
    has_cancer = factor(ifelse(!is.na(days_to_cancer), 1, 0)),
    gender = factor(gender, levels = c("Female", "Male")),
    race = factor(race),
    smoker = factor(smoker, levels = c("Never", "Former", "Current"))
  )

cat("\nClass Distribution Before Balancing:\n")

## 
## Class Distribution Before Balancing:

print(table(train$has_cancer))

## 
##     0     1 
## 51394  2033

0 (51394): This means there are 51,394 individuals in MY original train dataset who do not have lung cancer (or at least, for whom the days_to_cancer field was missing, which I defined as “no cancer”).

1 (2033): This means there are 2,033 individuals in MY original train dataset who do have lung cancer (or for whom the days_to_cancer field was present, indicating a cancer diagnosis). This output clearly shows a significant class imbalance in my raw data.

EDA on ORIGINAL DATA:

This section provides an initial exploration of the training data. We visualize key variable distributions and their relationships with the lung cancer outcome to gain insights into potential risk factors.

Summary Statistics:

cat("\nSummary Statistics of Original Training Data:\n")

## 
## Summary Statistics of Original Training Data:

summary(train[, c("age", "gender", "smoker", "has_cancer")])

##       age           gender          smoker      has_cancer
##  Min.   :43.00   Female:21910   Never  :    0   0:51394   
##  1st Qu.:57.00   Male  :31517   Former :27680   1: 2033   
##  Median :60.00                  Current:25747             
##  Mean   :61.42                                            
##  3rd Qu.:65.00                                            
##  Max.   :79.00

Interpretation:

age (Numerical Variable):

Min. :43.00: The youngest person in my dataset is 43 years old.

1st Qu.:57.00: 25% of the people in my dataset are 57 years old or younger.

Median :60.00: The middle age in my dataset is 60 years old. Half the people are younger than 60, and half are older.

Mean :61.42: The average age of people in my dataset is approximately 61.42 years. (Notice it’s slightly higher than the median, which can suggest a slight skew towards older ages).

3rd Qu.:65.00: 75% of the people in my dataset are 65 years old or younger. (This also means 25% are older than 65).

Max. :79.00: The oldest person in my dataset is 79 years old.

Overall for age: This tells me the range and distribution of ages in my data. It looks like my dataset contains adults, mostly in their late 50s to mid-60s.

gender (Categorical Variable/Factor):

Female:21910: There are 21,910 female individuals in my dataset.

Male :31517: There are 31,517 male individuals in my dataset.

Overall for gender: This shows me the count of individuals for each gender category. In this dataset, there are more males than females.

smoker (Categorical Variable/Factor):

Never : 0: This is interesting! It indicates there are zero individuals categorized as “Never” smokers in my dataset.

Former :27680: There are 27,680 individuals categorized as “Former” smokers.

Current:25747: There are 25,747 individuals categorized as “Current” smokers.

Overall for smoker: This tells me the distribution of smoking statuses. The vast majority of my dataset comprises former and current smokers, with no ‘Never’ smokers represented.

has_cancer (Categorical Variable/Factor):

0:51394: There are 51,394 individuals in my dataset who are categorized as 0 (meaning they do not have cancer).

1: 2033: There are 2,033 individuals in my dataset who are categorized as 1 (meaning they do have cancer).

Overall for has_cancer: This clearly highlights the class imbalance we discussed earlier. The number of individuals without cancer is significantly higher than those with cancer.

Cancer Proportion by Smoking Status:

This bar plot visualizes the proportion of individuals with and without cancer for each smoking status category. The position = “fill” ensures that bars represent proportions summing to 1 for each smoking group.

ggplot(train, aes(x = smoker, fill = factor(has_cancer))) +
  geom_bar(position = "fill") +
  labs(title = "Cancer Proportion by Smoking Status (Original Data)",
       subtitle = "Reference: Never smokers",
       y = "Proportion")

### Interpretation:

Bar- Graph:

“The bar chart illustrates the proportion of individuals with and without lung cancer based on their smoking status in the original dataset. Each bar represents 100% of the individuals within that smoking category. The blue segment indicates the proportion of people who have cancer, while the pink/red segment shows those who do not. The visualization clearly suggests a higher proportion of lung cancer cases among both ‘Former’ and ‘Current’ smokers, demonstrating a visual association between smoking status and the risk of lung cancer.”

Age Distribution by Cancer Status:

This density plot illustrates the distribution of age for both cancer and non-cancer groups, allowing us to see if age plays a role in lung cancer risk. alpha = 0.5 makes the densities semi-transparent so they can overlap.

ggplot(train, aes(x = age, fill = factor(has_cancer))) +
  geom_density(alpha = 0.5) +
  labs(title = "Age Distribution by Cancer Status (Original Data)")

### Interpretation:

Density-Plot:

“The density plot visualizes the age distribution for individuals categorized by their lung cancer status in the original dataset. The pink/red curve (0) represents individuals without cancer, while the blue/teal curve (1) represents individuals with cancer.

Upon examining the plot, it’s evident that the age distribution for individuals with cancer (blue/teal curve) is shifted towards older ages compared to those without cancer (pink/red curve). The peak density for the cancer group occurs at a higher age, and the curve extends further into the older age ranges. This visually suggests that advancing age is associated with an increased likelihood of having lung cancer.”

Data Balancing and Logistic Regression Model:

This section details the balancing of the training dataset to mitigate class imbalance, followed by fitting a logistic regression model, including interaction terms, to predict lung cancer.

Data Balancing with ROSE:

To address the observed class imbalance and ensure our model learns equally from both cancer and non-cancer cases, we apply the ROSE (Random OverSampling Examples) technique. ROSE generates new synthetic samples for the minority class, resulting in a more balanced dataset.

set.seed(123)
train_balanced <- ROSE(has_cancer ~ age + gender + race + smoker,
                       data = train,
                       seed = 123)$data

cat("\nClass Distribution After Balancing with ROSE:\n")

## 
## Class Distribution After Balancing with ROSE:

print(table(train_balanced$has_cancer))

## 
##     0     1 
## 27020 26407

cat("\nFactor Levels for Smoker in Balanced Data:\n")

## 
## Factor Levels for Smoker in Balanced Data:

print(levels(train_balanced$smoker))

## [1] "Never"   "Former"  "Current"

Interpretation:

Class Distribution After Balancing with ROSE:

This shows that after applying the ROSE technique, my dataset now has 27,020 individuals without cancer (0) and 26,407 individuals with cancer (1). This is a significant improvement from the highly imbalanced original data, making my model training more fair to both groups.

Factor Levels for Smoker in Balanced Data:

This simply confirms that the smoker variable in my balanced dataset still correctly recognizes the categories “Never,” “Former,” and “Current” in the intended order. This is a good check to ensure data integrity after the balancing process.

Logistic Regression Model Fitting:

A logistic regression model is fitted using the balanced training data. We include age, gender, race, and smoker as primary predictors. Importantly, we add interaction terms (age:smoker and gender:smoker) to explore how the effect of age and gender on cancer risk might differ based on smoking status.

model <- glm(has_cancer ~ age + gender + race + smoker + age:smoker + gender:smoker,
             data = train_balanced,
             family = "binomial")

Model Summary and Odds Ratios:

The summary() function provides coefficients, standard errors, z-values, and p-values for each predictor in the model, indicating their statistical significance. Odds ratios (exponentiated coefficients) are also calculated, which are easier to interpret in logistic regression as they represent the multiplicative change in odds of the outcome for a one-unit increase in the predictor.

cat("\nLogistic Regression Model Summary (trained on balanced data):\n")

## 
## Logistic Regression Model Summary (trained on balanced data):

summary(model)

## 
## Call:
## glm(formula = has_cancer ~ age + gender + race + smoker + age:smoker + 
##     gender:smoker, family = "binomial", data = train_balanced)
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                   -6.170861   0.209253 -29.490
## age                                            0.099615   0.002533  39.331
## genderMale                                     0.038159   0.027359   1.395
## raceAsian                                     -0.620220   0.150463  -4.122
## raceBlack or African-American                 -0.275570   0.142838  -1.929
## raceMore than one race                        -0.617272   0.160759  -3.840
## raceN/A                                       -0.253281   0.183381  -1.381
## raceNative Hawaiian or Other Pacific Islander -1.000773   0.220009  -4.549
## raceParticipant refused to answer             -1.734132   0.317765  -5.457
## raceWhite                                     -0.460236   0.136903  -3.362
## smokerCurrent                                  1.494646   0.219977   6.795
## age:smokerCurrent                             -0.013101   0.003504  -3.739
## genderMale:smokerCurrent                      -0.039541   0.036843  -1.073
##                                               Pr(>|z|)    
## (Intercept)                                    < 2e-16 ***
## age                                            < 2e-16 ***
## genderMale                                    0.163097    
## raceAsian                                     3.75e-05 ***
## raceBlack or African-American                 0.053701 .  
## raceMore than one race                        0.000123 ***
## raceN/A                                       0.167226    
## raceNative Hawaiian or Other Pacific Islander 5.40e-06 ***
## raceParticipant refused to answer             4.83e-08 ***
## raceWhite                                     0.000774 ***
## smokerCurrent                                 1.09e-11 ***
## age:smokerCurrent                             0.000185 ***
## genderMale:smokerCurrent                      0.283167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 74059  on 53426  degrees of freedom
## Residual deviance: 70052  on 53414  degrees of freedom
## AIC: 70078
## 
## Number of Fisher Scoring iterations: 4

cat("\nOdds Ratios and 95% Confidence Intervals:\n")

## 
## Odds Ratios and 95% Confidence Intervals:

exp(cbind(OR = coef(model), confint(model)))

## Waiting for profiling to be done...

##                                                        OR       2.5 %
## (Intercept)                                   0.002089435 0.001386555
## age                                           1.104745363 1.099286400
## genderMale                                    1.038895964 0.984665924
## raceAsian                                     0.537826190 0.399737956
## raceBlack or African-American                 0.759139273 0.572629459
## raceMore than one race                        0.539413868 0.392972042
## raceN/A                                       0.776249890 0.541400427
## raceNative Hawaiian or Other Pacific Islander 0.367595352 0.238066535
## raceParticipant refused to answer             0.176553344 0.092192801
## raceWhite                                     0.631134470 0.481551573
## smokerCurrent                                 4.457756949 2.896786341
## age:smokerCurrent                             0.986984003 0.980228736
## genderMale:smokerCurrent                      0.961230362 0.894254694
##                                                    97.5 %
## (Intercept)                                   0.003149604
## age                                           1.110254886
## genderMale                                    1.096140251
## raceAsian                                     0.721324890
## raceBlack or African-American                 1.002952958
## raceMore than one race                        0.738274329
## raceN/A                                       1.111458031
## raceNative Hawaiian or Other Pacific Islander 0.564402509
## raceParticipant refused to answer             0.322393510
## raceWhite                                     0.824091326
## smokerCurrent                                 6.861395057
## age:smokerCurrent                             0.993784583
## genderMale:smokerCurrent                      1.033197987

Interpretation of Model Coefficients and Odds Ratios:

Age: For every year a person gets older, the odds of having lung cancer increase by about 1.10 times (a 10% increase), holding other factors constant. This is a highly statistically significant finding (p<2e-16), indicating age is a strong risk factor.

Gender (Male vs. Female): Being male increases the odds of lung cancer by about 1.04 times compared to being female. However, this finding is not statistically significant (p=0.163), meaning we cannot be certain it’s a true difference.

Race: (Compared to the model’s baseline race category, which is implied as the reference).

Individuals in the Asian race group have significantly lower odds of cancer (Odds Ratio: 0.538, p=3.75e-05).

Black or African-American individuals have significantly lower odds (Odds Ratio: 0.759, p=0.0537).

More than one race individuals have significantly lower odds (Odds Ratio: 0.539, p=1.67e-06).

Native Hawaiian or Other Pacific Islander individuals have significantly lower odds (Odds Ratio: 0.368, p=4.8e-08).

Participant refused to answer category has significantly lower odds (Odds Ratio: 0.177, p=4.83e-08).

White individuals have significantly lower odds (Odds Ratio: 0.631, p=0.000774). This suggests that compared to the reference race, most other racial categories show a reduced likelihood of lung cancer, with many of these effects being statistically significant.

Smoker Status (Current vs. Never): This is a very important finding. Compared to someone who has Never smoked (the model’s reference point), a Current smoker has about 4.45 times higher odds of having lung cancer. This is a highly significant finding (p=1.09e-11). (Note: ‘Former’ smoker odds were not shown in the provided output.)

Interaction Term (age:smokerCurrent): This interaction (Odds Ratio: 0.987, p=0.000185) means that for Current smokers, the effect of increasing age on cancer odds is slightly less pronounced (reduced by about 1.3% per year) than for those who never smoked. This is a small but statistically significant modification.

Interaction Term (genderMale:smokerCurrent): This interaction (Odds Ratio: 0.961, p=0.283167) suggests a slight decrease in cancer odds for males who are current smokers compared to what’s expected from their individual effects. However, this finding is not statistically significant, so we cannot be confident in this specific interaction.

Summary of Most Important Risk Factors: Based on this analysis, the most impactful and statistically significant risk factors for lung cancer are smoking status (specifically Current smoking) and age. Several race categories also show significant associations with lower odds of cancer. The interaction between age and current smoking status indicates a subtle but significant modification of age’s effect.

Model Evaluation:

This section evaluates the predictive performance of the logistic regression model on the balanced training data using various metrics and visualizations.

Optimal Threshold Selection:

For binary classification, a threshold is needed to convert predicted probabilities into class labels (0 or 1). We select an optimal threshold from the ROC curve using the “closest.topleft” method, which aims to maximize both sensitivity and specificity.

train_balanced$pred_prob <- predict(model, type = "response")

roc_obj <- roc(as.numeric(as.character(train_balanced$has_cancer)), train_balanced$pred_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

optimal_threshold <- coords(roc_obj, x = "best",
                            best.method = "closest.topleft",
                            ret = "threshold")$threshold

cat(paste("\nOptimal Threshold (prioritizing sensitivity):", round(optimal_threshold, 4), "\n"))

## 
## Optimal Threshold (prioritizing sensitivity): 0.489

train_balanced$pred_class <- ifelse(train_balanced$pred_prob > optimal_threshold, 1, 0)

Interpretation:

An optimal threshold of 0.489 was determined. This means any predicted probability of lung cancer equal to or above this value will be classified as ‘1’ (has cancer), and below it as ‘0’ (no cancer). This threshold aims to achieve a good balance between correctly identifying actual cancer cases (sensitivity) and correctly identifying actual non-cancer cases (specificity).

Confusion Matrix:

The confusion matrix provides a comprehensive breakdown of the model’s classification performance, showing true positives, true negatives, false positives, and false negatives. From this, key metrics like Accuracy, Sensitivity, and Specificity are derived.

cat("\nConfusion Matrix on Balanced Training Data:\n")

## 
## Confusion Matrix on Balanced Training Data:

confusionMatrix(factor(train_balanced$pred_class, levels = c("0", "1")),
                factor(train_balanced$has_cancer, levels = c("0", "1")))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16404  9954
##          1 10616 16453
##                                           
##                Accuracy : 0.615           
##                  95% CI : (0.6108, 0.6191)
##     No Information Rate : 0.5057          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2301          
##                                           
##  Mcnemar's Test P-Value : 4.051e-06       
##                                           
##             Sensitivity : 0.6071          
##             Specificity : 0.6231          
##          Pos Pred Value : 0.6224          
##          Neg Pred Value : 0.6078          
##              Prevalence : 0.5057          
##          Detection Rate : 0.3070          
##    Detection Prevalence : 0.4933          
##       Balanced Accuracy : 0.6151          
##                                           
##        'Positive' Class : 0               
##

Interpretation:

True Negatives (TN): 16,404 individuals were correctly predicted as not having cancer (0).

False Positives (FP): 10,616 individuals were incorrectly predicted as having cancer (1) when they did not.

False Negatives (FN): 9,954 individuals were incorrectly predicted as not having cancer (0) when they actually did.

True Positives (TP): 16,453 individuals were correctly predicted as having cancer (1).

From these, the key performance metrics are:

Accuracy: 0.615 (61.5%). This is the overall percentage of correct predictions.

Sensitivity (Recall): 0.6071 (60.71%). The model correctly identified 60.71% of all actual lung cancer cases.

Specificity: 0.6231 (62.31%). The model correctly identified 62.31% of all actual non-cancer cases.

Positive Predictive Value (Precision): 0.6224 (62.24%). When the model predicts cancer, it is correct 62.24% of the time.

Negative Predictive Value: 0.6078 (60.78%). When the model predicts no cancer, it is correct 60.78% of the time.

Balanced Accuracy: 0.6151. This confirms a consistent performance across both classes. The model’s accuracy is significantly better than chance (p-value < 2.2e-16), though the Kappa statistic of 0.2301 indicates only a fair agreement beyond what would be expected by random chance.

ROC Curve and AUC:

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (Sensitivity) against the False Positive Rate (1 - Specificity) across various threshold settings. The Area Under the Curve (AUC) summarizes the model’s overall discriminative ability; a higher AUC (closer to 1) indicates better performance.

plot(roc_obj, main = paste("ROC Curve (Balanced Data) - AUC =", round(roc_obj$auc, 3)))

### Interpretation: The Area Under the ROC Curve (AUC) is 0.655. An AUC value of 0.655 suggests that the model has fair to moderate ability to distinguish between individuals with and without lung cancer. A value closer to 1 would indicate excellent discrimination, while 0.5 suggests performance no better than random chance.

Precision-Recall Curve:

The Precision-Recall (PR) curve is particularly useful for imbalanced datasets, as it focuses on the performance of the positive class (cancer cases). Precision is the proportion of positive predictions that were actually correct, and Recall (Sensitivity) is the proportion of actual positives that were correctly identified.

precrec_obj <- evalmod(scores = train_balanced$pred_prob, labels = as.numeric(as.character(train_balanced$has_cancer)))
autoplot(precrec_obj, "PRC") + ggtitle("Precision-Recall Curve (Balanced Data)")

### Interpretation: The Precision-Recall (PR) curve shows the trade-off between precision (the proportion of positive predictions that are correct) and recall (the proportion of actual positive cases that are correctly identified). As seen in the plot, when recall is low (meaning the model is only identifying a small fraction of actual cancer cases), the precision is relatively high (around 0.75), meaning most of those few predictions are correct. However, as the model attempts to identify more actual cancer cases (as recall increases), the precision gradually decreases. This curve indicates that achieving very high recall without a significant drop in precision is challenging for this model, which is common in classification tasks where achieving perfect performance across all thresholds is difficult.

Calibration Plot:

A calibration plot assesses how well the model’s predicted probabilities align with the observed event rates. A perfectly calibrated model’s predictions would fall along the diagonal line (y=x).

# Create deciles of predicted probabilities
train_balanced$prob_bin <- cut(train_balanced$pred_prob,
                               breaks = quantile(train_balanced$pred_prob, probs = seq(0, 1, 0.1)),
                               include.lowest = TRUE)

# Calculate observed vs predicted
calib_data <- train_balanced %>%
  group_by(prob_bin) %>%
  summarise(
    mean_pred = mean(pred_prob),
    obs_rate = mean(as.numeric(as.character(has_cancer)))
  )

# Plot calibration curve
ggplot(calib_data, aes(x = mean_pred, y = obs_rate)) +
  geom_line(color = "blue") +
  geom_point(size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Calibration Plot (Balanced Data)",
    x = "Predicted Probability",
    y = "Observed Cancer Rate"
  )

### Interpretation: The calibration plot visually assesses how well the model’s predicted probabilities align with the actual observed rates of lung cancer. The blue line represents the observed cancer rate for given predicted probabilities (grouped into deciles), while the dashed red line represents ideal calibration (where predicted probability perfectly matches observed rate). In this plot, the blue line generally follows the dashed red line quite closely, indicating that the model is reasonably well-calibrated. This means that when the model predicts a certain probability of cancer, that probability tends to be close to the actual observed proportion of cancer cases.

Summary of Model Performance:

Overall, the logistic regression model demonstrates fair to moderate predictive performance on the balanced training data. The AUC of 0.655 indicates a fair to moderate discriminative ability, and the confusion matrix shows a balanced performance in terms of sensitivity (60.71%) and specificity (62.31%). The Kappa statistic of 0.2301 suggests a fair agreement between predictions and actual outcomes.

Test Predictions and Risk Categorization:

This section applies the trained model to a new, unseen test dataset to infer the outcome probability for individuals and categorize their risk levels.

# 6. Test Predictions
# ----------------------------
test <- read.csv("C:/Users/ameen/OneDrive - Higher Education Commission/Desktop/Internship task Russia ITMO/lung cancer_test.csv") %>%
  mutate(
    gender = factor(gender, levels = c("Female", "Male")),
    smoker = factor(smoker, levels = c("Never", "Former", "Current"))
  )

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on 'C:/Users/ameen/OneDrive -
## Higher Education Commission/Desktop/Internship task Russia ITMO/lung
## cancer_test.csv'

test$high_risk <- ifelse(test$age > 60 & test$smoker %in% c("Current", "Former"), 1, 0)

test$pred_prob <- predict(model, newdata = test, type = "response")
test$pred_class <- ifelse(test$pred_prob > optimal_threshold, 1, 0)

test <- test %>%
  mutate(risk_level = case_when(
    pred_prob > 0.3 ~ "High",
    pred_prob > 0.1 ~ "Medium",
    TRUE ~ "Low"
  ))

cat("\nTest Predictions with Risk Categorization:\n")

## 
## Test Predictions with Risk Categorization:

test %>% select(age, gender, smoker, high_risk, pred_prob, risk_level, pred_class) %>% head(10)

##   age gender  smoker high_risk pred_prob risk_level pred_class
## 1  70 Female Current         1 0.7149377       High          1
## 2  64   Male  Former         1 0.4917704       High          1
## 3  86   Male Current         1 0.9090690       High          1

Interprettation:

The model was applied to the unseen test dataset to predict lung cancer probabilities for individuals. Each individual is assigned a pred_prob (predicted probability of having cancer), a pred_class (binary classification as 0 or 1 based on the optimal threshold of 0.489), and a risk_level (Low, Medium, or High) based on these probabilities. For example, the first three entries show individuals with predicted probabilities of 0.71, 0.49, and 0.91 respectively, all categorized as “High” risk and classified as ‘1’ (has cancer). The high_risk column provides a simple rule-based risk classification for comparison.

Answers to Specific Case Questions: Smoking and Lung Cancer

How does smoking status impact the risk of lung cancer? What is the odds ratio?

Smoking status has a very significant impact on the risk of lung cancer. The report states that compared to someone who has Never smoked (the model’s reference point), a Current smoker has about 4.45 times higher odds of having lung cancer. This is a highly statistically significant finding (p=1.09e−11). The report did not provide the odds ratio for “Former” smokers.

How does smoking interact with other covariates? Infer the interaction effects.

Age and Smoking Interaction (age:smokerCurrent): The odds ratio for this interaction is 0.987 (p=0.000185). This indicates that for Current smokers, the effect of increasing age on cancer odds is slightly less pronounced (reduced by about 1.3% per year) than for those who never smoked. This is a small but statistically significant modification.

Gender and Smoking Interaction (genderMale:smokerCurrent): The odds ratio for this interaction is 0.961 (p=0.283167). This suggests a slight decrease in cancer odds for males who are current smokers compared to what’s expected from their individual effects. However, this finding is not statistically significant, meaning we cannot be confident in this specific interaction.

“Lung Cancer Prediction Model Report”

Ameen Talha

2025-06-26

Introduction:

Setup and Data Loading:

Load Libraries:

Load and Prepare Data:

EDA on ORIGINAL DATA:

Summary Statistics:

Interpretation:

age (Numerical Variable):

gender (Categorical Variable/Factor):

smoker (Categorical Variable/Factor):

has_cancer (Categorical Variable/Factor):

Cancer Proportion by Smoking Status:

Bar- Graph:

Age Distribution by Cancer Status:

Density-Plot:

Data Balancing and Logistic Regression Model:

Data Balancing with ROSE:

Interpretation:

Class Distribution After Balancing with ROSE:

Factor Levels for Smoker in Balanced Data:

Logistic Regression Model Fitting:

Model Summary and Odds Ratios:

Interpretation of Model Coefficients and Odds Ratios:

Model Evaluation:

Optimal Threshold Selection:

Interpretation:

Confusion Matrix:

Interpretation:

ROC Curve and AUC:

Precision-Recall Curve:

Calibration Plot:

Summary of Model Performance:

Test Predictions and Risk Categorization:

Interprettation:

Answers to Specific Case Questions: Smoking and Lung Cancer

How does smoking status impact the risk of lung cancer? What is the odds ratio?

How does smoking interact with other covariates? Infer the interaction effects.