Introduction

Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide. Early diagnosis and effective treatment strategies are essential for improving patient survival rates. However, predicting individual patient outcomes remains a major challenge due to the complex interplay of clinical and biological factors.

Fig.1

This project explores how machine learning models can be used to predict survival status in breast cancer patients using a dataset containing demographic, histopathological, and treatment-related variables. The goal is to identify the most influential predictors of survival and demonstrate how models like Random Forest and Decision Tree can aid in clinical decision-making and improving patient outcomes.

Aim of the Study

The aim of this study is to predict survival outcomes in breast cancer patients and identify factors that influence prognosis.

Research Questions

Which clinical and molecular factors most strongly influence survival outcomes in breast cancer patients?
Which predictive model provides the highest accuracy and reliability for predicting breast cancer survival outcomes?

Research Objectives

The objectives of this study are to:

Identify and rank the clinical and molecular factors that most significantly influence survival outcomes in breast cancer patients.
Evaluate and compare two predictive models in order to determine the most accurate and reliable model for Predicting breast cancer survival outcomes.

Research Methodology

This study uses a quantitative, predictive research design based on supervised machine learning. The design is appropriate because the goal of the study is to predict breast cancer survival outcomes and identify the factors that influence prognosis.

Data Source

The data set used in this study was obtained from Kaggle, a public data repository. The data contains anonymized clinical variables, molecular bio-markers, and patient survival outcomes.

Data Preprocessing

The data set was imported into R and prepared for analysis. The pre-processing steps included:

Removing missing or inconsistent entries
Exploratory data analysis ( Uni-variate and Bi-variate Analysis)
Encoding categorical variables
Balancing imbalanced variable (Target Variable) using Oversampling
Splitting the data into training and testing sets

These steps ensured that the data was clean and suitable for model development.

Methods Aligned With Research Objectives

Objective 1:

Identify and rank the clinical and molecular factors that most significantly influence survival outcomes in breast cancer patients.

Method:

Feature importance analysis was performed using the odds ratio of the variables. This technique helped determine and rank the most influential predictors of survival.

Objective 2:

Evaluate and compare two predictive models to determine the most accurate and reliable model for Predicting breast cancer survival outcomes.

Method:

Two machine learning algorithms; Random Forest and Decision Tree were trained on the data set. Each model was evaluated using accuracy, sensitivity, specificity and the confusion matrix to identify the best-performing model.

Model Selection and Interpretation

The model with the highest performance on the test dataset was selected as the best predictive model. Feature importance plots and other visualizations were used to interpret how the model made predictions and which variables contributed most to survival outcomes.

Deployment of the Predictive Model

The selected model was deployed as an interactive Shiny web application to make predictions easily accessible.
The trained model was saved as an .rds file.
A Shiny interface was created to allow users to input patient features.
The server logic processed these inputs and generated survival predictions in real time.
The application was deployed on shinyapps.io, allowing users to access it from any device with internet access.
This deployment transformed the machine learning model into a practical decision-support tool.

Tools and Software

The study used:

R and RStudio for data analysis and model development
Libraries such as caret, randomForest, gbm, e1071, and shiny
shinyapps.io for deployment

Data Description

The dataset comprises clinical and molecular information on breast cancer patients, aimed at predicting survival outcomes.

Age: Age of the patient at the time of diagnosis.
Gender: Patient’s gender (majority are female).
Tumour_Stage: Clinical stage of the tumour (I–III), indicating how far the cancer has progressed.
Histology: Microscopic classification of the tumour type, such as Infiltrating Ductal or Lobular Carcinoma.
HER2_status: Human Epidermal Growth Factor Receptor 2 expression; Positive indicates over-expression of HER2 proteins, often linked to more aggressive tumour behaviour.
Surgery_type: The surgical procedure performed, such as Lumpectomy, Mastectomy, or Modified Radical Mastectomy.
Protein_1 – Protein_4: Quantitative molecular markers representing different protein expression levels potentially associated with tumour growth or treatment response.
Patient_Status: Survival outcome of the patient, categorized as Alive or Dead.
ER Receptor(Estrogen Receptor): It is a hormone receptor that indicates a tumor’s ability to respond to estrogen.
PR Receptor(Progesterone Receptor): It is another hormone receptor, often regulated by ER activity.

Loading the Libraries

# Load required libraries
library(readr)       # For fast and easy reading of CSV files
library(tidyverse)   # For data manipulation and visualization tools
library(ggplot2)
library(ROSE)
library(caret)
library(broom)
library(reshape2)
library(randomForest)
library(rpart)        # For building the decision tree model
library(rpart.plot)   # For visualizing the decision tree
library(plotly)       # For interactive visualization

# Import the dataset
breast_cancer <- read_csv("C:\\Users\\USER\\Documents\\R\\Breast cancer data.csv")   # Read CSV file into R

## Rows: 341 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Patient_ID, Gender, Tumour_Stage, Histology, ER status, PR status,...
## dbl  (5): Age, Protein1, Protein2, Protein3, Protein4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(breast_cancer)

## # A tibble: 6 × 16
##   Patient_ID     Age Gender Protein1 Protein2 Protein3 Protein4 Tumour_Stage
##   <chr>        <dbl> <chr>     <dbl>    <dbl>    <dbl>    <dbl> <chr>       
## 1 TCGA-D8-A1XD    36 FEMALE   0.0804    0.426   0.547    0.274  III         
## 2 TCGA-EW-A1OX    43 FEMALE  -0.420     0.578   0.614   -0.0315 II          
## 3 TCGA-A8-A079    69 FEMALE   0.214     1.31   -0.327   -0.234  III         
## 4 TCGA-D8-A1XR    56 FEMALE   0.345    -0.211  -0.193    0.124  II          
## 5 TCGA-BH-A0BF    56 FEMALE   0.222     1.91    0.520   -0.312  II          
## 6 TCGA-AO-A1KQ    84 MALE    -0.0819    1.72   -0.0573   0.0430 III         
## # ℹ 8 more variables: Histology <chr>, `ER status` <chr>, `PR status` <chr>,
## #   `HER2 status` <chr>, Surgery_type <chr>, Date_of_Surgery <chr>,
## #   Date_of_Last_Visit <chr>, Patient_Status <chr>

# Get a quick statistical summary of the dataset
summary(breast_cancer)

##   Patient_ID             Age           Gender             Protein1        
##  Length:341         Min.   :29.00   Length:341         Min.   :-2.340900  
##  Class :character   1st Qu.:49.00   Class :character   1st Qu.:-0.358888  
##  Mode  :character   Median :58.00   Mode  :character   Median : 0.006129  
##                     Mean   :58.89                      Mean   :-0.029991  
##                     3rd Qu.:68.00                      3rd Qu.: 0.343598  
##                     Max.   :90.00                      Max.   : 1.593600  
##                     NA's   :7                          NA's   :7          
##     Protein2          Protein3          Protein4         Tumour_Stage      
##  Min.   :-0.9787   Min.   :-1.6274   Min.   :-2.025500   Length:341        
##  1st Qu.: 0.3622   1st Qu.:-0.5137   1st Qu.:-0.377090   Class :character  
##  Median : 0.9928   Median :-0.1732   Median : 0.041768   Mode  :character  
##  Mean   : 0.9469   Mean   :-0.0902   Mean   : 0.009819                     
##  3rd Qu.: 1.6279   3rd Qu.: 0.2784   3rd Qu.: 0.425630                     
##  Max.   : 3.4022   Max.   : 2.1934   Max.   : 1.629900                     
##  NA's   :7         NA's   :7         NA's   :7                             
##   Histology          ER status          PR status         HER2 status       
##  Length:341         Length:341         Length:341         Length:341        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Surgery_type       Date_of_Surgery    Date_of_Last_Visit Patient_Status    
##  Length:341         Length:341         Length:341         Length:341        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##

# Check the dimensions of the dataset (rows and columns)
dim(breast_cancer)

## [1] 341  16

# Check total number of missing values in the dataset
sum(is.na(breast_cancer))

## [1] 142

# Check number of missing values in each column
colSums(is.na(breast_cancer))

##         Patient_ID                Age             Gender           Protein1 
##                  7                  7                  7                  7 
##           Protein2           Protein3           Protein4       Tumour_Stage 
##                  7                  7                  7                  7 
##          Histology          ER status          PR status        HER2 status 
##                  7                  7                  7                  7 
##       Surgery_type    Date_of_Surgery Date_of_Last_Visit     Patient_Status 
##                  7                  7                 24                 20

# Check how many rows have at least one missing value
sum(!complete.cases(breast_cancer))

## [1] 24

# Remove rows with missing values
breast_clean <- na.omit(breast_cancer)

# Check the new dimensions of the cleaned dataset
dim(breast_clean)

## [1] 317  16

# Generate a summary of the cleaned dataset
summary(breast_clean)

##   Patient_ID             Age           Gender             Protein1        
##  Length:317         Min.   :29.00   Length:317         Min.   :-2.144600  
##  Class :character   1st Qu.:49.00   Class :character   1st Qu.:-0.350600  
##  Mode  :character   Median :58.00   Mode  :character   Median : 0.005649  
##                     Mean   :58.73                      Mean   :-0.027232  
##                     3rd Qu.:67.00                      3rd Qu.: 0.336260  
##                     Max.   :90.00                      Max.   : 1.593600  
##     Protein2          Protein3          Protein4         Tumour_Stage      
##  Min.   :-0.9787   Min.   :-1.6274   Min.   :-2.025500   Length:317        
##  1st Qu.: 0.3688   1st Qu.:-0.5314   1st Qu.:-0.382240   Class :character  
##  Median : 0.9971   Median :-0.1930   Median : 0.038522   Mode  :character  
##  Mean   : 0.9496   Mean   :-0.0951   Mean   : 0.006713                     
##  3rd Qu.: 1.6120   3rd Qu.: 0.2512   3rd Qu.: 0.436250                     
##  Max.   : 3.4022   Max.   : 2.1934   Max.   : 1.629900                     
##   Histology          ER status          PR status         HER2 status       
##  Length:317         Length:317         Length:317         Length:317        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Surgery_type       Date_of_Surgery    Date_of_Last_Visit Patient_Status    
##  Length:317         Length:317         Length:317         Length:317        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##

===============================

UNIVARIATE ANALYSIS

===============================

Patient Status

############Patient Status############
# Check the distribution of patient outcomes (e.g., Alive vs Deceased)
table(breast_clean$Patient_Status)

## 
## Alive  Dead 
##   255    62

# Create a bar plot showing the number of patients by their status
ggplot(breast_clean, aes(x = Patient_Status, fill = Patient_Status)) +
  geom_bar() +                                   # Create bars for each category
  theme_minimal() +                              # Use a clean minimal theme
  labs(
    title = "Distribution of Patient Status",    # Add title
    x = "Patient Status",                        # X-axis label
    y = "Count"                                  # Y-axis label
  ) +
  scale_fill_brewer(palette = "Set2")            # Add soft color palette

Age Distribution

###########Age Distribution#############
# Create a frequency table for Age
age_counts <- table(breast_clean$Age)

# Create a histogram to show the age distribution of patients
hist(
  breast_clean$Age,              # Column to plot
  ylim = c(0, 60),               # Set y-axis limit
  col = "skyblue",               # Fill color for bars
  border = "black",              # Outline color for bars
  main = "Age Distribution of Patients",  # Title of the histogram
  xlab = "Age (years)",          # Label for x-axis
  ylab = "Frequency"             # Label for y-axis
)

Gender Distribution of Patients

############Gender##########
# Create a frequency table for Gender
gender_counts <- table(breast_clean$Gender)

# Create the bar plot and store the x positions of the bars
bar_positions <- barplot(
  gender_counts,                       # Data for the bars
  col = c("lightpink", "lightblue"),   # Colors for each gender
  border = "black",                    # Outline color for bars
  ylim = c(0, max(gender_counts) + 50),# Add buffer to top of y-axis
  main = "Gender Distribution of Patients",  # Title of the plot
  xlab = "Gender",                     # Label for x-axis
  ylab = "Number of Patients"          # Label for y-axis
)

# Add text labels showing the counts above each bar
text(
  x = bar_positions,                   # Use bar positions for correct placement
  y = gender_counts + 10,              # Lift labels slightly above bars
  labels = gender_counts,              # Text to display (the counts)
  cex = 0.8,                           # Font size
  col = "black"                        # Text color
)

Distribution of Tumour Stages

###########Tumor Stage###########
# Create a frequency table for Tumour Stage
tumour_counts <- table(breast_clean$Tumour_Stage)

# Create the bar plot and store x-axis positions
bar_positions <- barplot(
  tumour_counts,                          # Data for the bars
  col = c("lightgreen", "gold", "orange"),# Colors for each stage
  border = "black",                       # Outline color for bars
  ylim = c(0, max(tumour_counts) + 20),   # Add buffer to the top of the y-axis
  main = "Distribution of Tumour Stages", # Plot title
  xlab = "Tumour Stage",                  # Label for x-axis
  ylab = "Number of Patients"             # Label for y-axis
)

This bar chart shows the distribution of tumour stages among breast cancer patients. It reveals that Stage II tumours are the most common, followed by Stage III, while Stage I cases are the least frequent. This suggests that many patients in the dataset were diagnosed at more advanced stages, which may have implications for treatment outcomes and survival analysis.

Distribution of Histology Types

###########Histology##########
# Frequency table for Histology
histology_counts <- table(breast_clean$Histology)
histology_counts

## 
##  Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma 
##                            224                             81 
##             Mucinous Carcinoma 
##                             12

# Bar plot for Histology distribution
library(ggplot2)

# Convert named vector to data frame
histology_df <- data.frame(
  Histology = names(histology_counts),
  Count = as.numeric(histology_counts)
)

ggplot(histology_df, aes(x = Histology, y = Count)) +
  geom_bar(
    stat = "identity",
    fill = "orchid",
    color = "black"
  ) +
  labs(
    title = "Distribution of Histology Types",
    y = "Number of Patients",
    x = NULL
  ) +
  ylim(0, max(histology_df$Count) + 50) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1,
      size = 10
    ),
    plot.title = element_text(hjust = 0.5)
  )

This chart shows that Invasive Ductal Carcinoma (IDC) is the most common histologic type, followed by Invasive Lobular Carcinoma (ILC), while Mucinous Carcinoma is the least frequent. IDC dominates because it arises in the milk ducts where most breast tissue is concentrated, making it more likely to occur and be detected. ILC is less common and often harder to identify on imaging due to its diffuse growth pattern, while Mucinous Carcinoma is rare, slow-growing, and typically detected at an early stage.

=============================================

BIVARIATE ANALYSIS

=============================================

AGE vs PATIENT STATUS

# Boxplot of Age by Patient Status
boxplot(
  Age ~ Patient_Status,              # Compare Age across Patient Status groups
  data = breast_clean,               # Use the cleaned dataset
  col = c("lightblue", "red"), # Colors for each group
  main = "Distribution of Age by Patient Status",  # Title of the plot
  xlab = "Patient Status",           # Label for the x-axis
  ylab = "Age (years)",              # Label for the y-axis
  notch = FALSE,                     # Disable notches
  border = "black"                   # Outline color for clarity
)

The boxplot shows that patients who are alive tend to be slightly younger than those who are deceased.
Both groups have a similar age range, but the higher median age among the deceased suggests that older patients are more likely to have poorer outcomes, possibly due to age-related factors affecting prognosis.

GENDER vs PATIENT_STATUS

# Bar plot to show the relationship between Gender and Patient Status
ggplot(breast_clean, aes(x = Gender, fill = Patient_Status)) +
  geom_bar(position = "dodge") +   # Side-by-side bars
  labs(
    title = "Gender vs Patient Status",
    x = "Gender",
    y = "Number of Patients"
  ) +
  # Use distinct, professional colors for clarity
  scale_fill_manual(values = c("#6A5ACD", "#FFD700")) +  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Center and bold title
    axis.text.x = element_text(angle = 15, hjust = 1),      # Gentle tilt for x-axis labels
    legend.title = element_text(face = "bold")              # Bold legend title
  )

The bar plot shows that female patients make up the majority of the dataset, which aligns with the higher prevalence of breast cancer among women.

TUMOUR STAGE vs PATIENT_STATUS

# Bar plot for Tumour Stage vs Patient Status with distinct colors
ggplot(breast_clean, aes(x = Tumour_Stage, fill = Patient_Status)) +
  geom_bar(position = "dodge") +   # Place bars side by side
  labs(
    title = "Tumour Stage vs Patient Status",
    x = "Tumour Stage",
    y = "Number of Patients"
  ) +
  scale_fill_manual(values = c("steelblue", "red")) +  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),  # Center and bold title
    axis.text.x = element_text(angle = 20, hjust = 1),      # Slight tilt for x-axis labels
    legend.title = element_text(face = "bold")              # Bold legend title
  )

The bar plot shows that Stage II has the largest number of patients and the largest number of deaths in absolute terms. The proportion of deaths is slightly higher in Stage II and Stage III than in Stage I, suggesting worse outcomes with more advanced stages.

HISTOLOGY vs PATIENT STATUS

# A bar plot for Histologic Type vs Patient Status

ggplot(breast_clean, aes(x = Histology, fill = Patient_Status)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Histologic Type vs Patient Status",
    x = "Histologic Type",
    y = "Number of Patients",
    fill = "Patient Status"
  ) +
  scale_fill_manual(values = c("#1E90FF", "#FF6347")) +  # Blue & Tomato red for clear contrast
  ylim(0, 200) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

This chart compares histologic type with patient status. The majority of patients with Infiltrating Ductal Carcinoma were alive, though it also had the highest number of deaths, reflecting its high overall occurrence. Infiltrating Lobular Carcinoma showed fewer cases, with most patients surviving. Mucinous Carcinoma had the lowest frequency and the fewest deaths, consistent with its typically less aggressive and slow-growing nature. Overall, survival appears relatively higher across all histologic types, but mortality is most concentrated among patients with Infiltrating Ductal Carcinoma due to its predominance.

Checking Variable Uniformity and Feature Reduction

To ensure all predictors contribute meaningful information, variables with no variation were identified and excluded. Both ER_status and PR_status contained only “Positive” values, indicating no discriminatory power for classification. Hence, they were removed from the dataset prior to modelling.

# Check unique values in ER_status and PR_status to confirm uniformity
unique(breast_clean$`ER status`)

## [1] "Positive"

unique(breast_clean$`PR status`)

## [1] "Positive"

# since both are only "Positive", drop them from the dataset
breast_clean <- subset(breast_clean, select = -c(`ER status`, `PR status`))

# Confirm removal
names(breast_clean)

##  [1] "Patient_ID"         "Age"                "Gender"            
##  [4] "Protein1"           "Protein2"           "Protein3"          
##  [7] "Protein4"           "Tumour_Stage"       "Histology"         
## [10] "HER2 status"        "Surgery_type"       "Date_of_Surgery"   
## [13] "Date_of_Last_Visit" "Patient_Status"

Encoding Target Variable and Preparing Predictors for Modeling

# Encode Patient_Status numerically and label both classes
breast_clean$Patient_Status <- ifelse(breast_clean$Patient_Status == "Alive", 1, 0)

# Convert to factor with descriptive labels
breast_clean$Patient_Status <- factor(breast_clean$Patient_Status,
                                      levels = c(0, 1),
                                      labels = c("Dead", "Alive"))


names(breast_clean)[names(breast_clean) == "HER2 status"] <- "HER2_status"

# Convert categorical variables to factors
breast_clean$Tumour_Stage <- as.factor(breast_clean$Tumour_Stage)
breast_clean$Histology <- as.factor(breast_clean$Histology)
breast_clean$HER2_status <- as.factor(breast_clean$HER2_status)
breast_clean$Surgery_type <- as.factor(breast_clean$Surgery_type)

Feature Selection and Handling Class Imbalance

# Select relevant predictors for the model
breast_model_data <- breast_clean %>%
  select(-Patient_ID, -Date_of_Surgery, -Date_of_Last_Visit)


# Perform random oversampling on the entire dataset based on the target variable
set.seed(123)
oversampled_data <- ovun.sample(Patient_Status ~ ., 
                                data = breast_model_data, 
                                method = "over", 
                                N = max(table(breast_model_data$Patient_Status)) * 2)$data

table(oversampled_data$Patient_Status)

## 
## Alive  Dead 
##   255   255

prop.table(table(oversampled_data$Patient_Status))

## 
## Alive  Dead 
##   0.5   0.5

After cleaning and encoding, non-predictive variables (Patient_ID, Date_of_Surgery, Date_of_Last_Visit) were removed to retain key predictors. To correct the class imbalance in Patient_Status, random oversampling was performed using the ovun.sample() function, ensuring both classes were equally represented and reducing model bias.

MODEL TRAINING & TESTING

# Set a random seed for reproducibility
set.seed(123)

# ---- Create a partition: 80% training, 20% testing ----
train_index <- createDataPartition(oversampled_data$Patient_Status, p = 0.8, list = FALSE)

# ---- Split the data ----
train_data <- oversampled_data[train_index, ]   # 80% training data
test_data  <- oversampled_data[-train_index, ]  # 20% testing data

# ---- Check the dimensions of each ----
dim(train_data)

## [1] 408  11

dim(test_data)

## [1] 102  11

# ---- Check class distribution ----
prop.table(table(train_data$Patient_Status))

## 
## Alive  Dead 
##   0.5   0.5

prop.table(table(test_data$Patient_Status))

## 
## Alive  Dead 
##   0.5   0.5

The balanced dataset was split into 80% training and 20% testing subsets using the createDataPartition() function to ensure reproducibility and fair evaluation. This approach allows the model to learn patterns from the training data while reserving a portion for unbiased performance assessment.

RANDOM FOREST MODEL

# convert Patient_Status to a factor
train_data$Patient_Status <- as.factor(train_data$Patient_Status)
test_data$Patient_Status  <- as.factor(test_data$Patient_Status)

# Base predictors
predictors <- c("Age", "Tumour_Stage", "Histology", "HER2_status", "Surgery_type",
                "Protein1", "Protein2", "Protein3", "Protein4")

formula <- reformulate(predictors, response = "Patient_Status")



# Train the Random Forest model
rf_model <- randomForest(
  formula,
  data = train_data,
  ntree = 600,
  mtry = 3,
  importance = TRUE,
  seed = 123
)

print(rf_model)

## 
## Call:
##  randomForest(formula = formula, data = train_data, ntree = 600,      mtry = 3, importance = TRUE, seed = 123) 
##                Type of random forest: classification
##                      Number of trees: 600
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 5.15%
## Confusion matrix:
##       Alive Dead class.error
## Alive   188   16  0.07843137
## Dead      5  199  0.02450980

The Random Forest model performed very well, correctly predicting patient outcomes about 95% of the time. It was especially good at identifying patients who did not survive, making only a few mistakes, while its predictions for surviving patients were also strong. Overall, the model shows reliable clinical usefulness, as it can clearly distinguish between patients at higher risk and those likely to survive.

# Predict on test set
rf_pred <- predict(rf_model, newdata = test_data)

# Confusion matrix
conf_matrix_rf <- confusionMatrix(rf_pred, test_data$Patient_Status)
print(conf_matrix_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive Dead
##      Alive    46    0
##      Dead      5   51
##                                           
##                Accuracy : 0.951           
##                  95% CI : (0.8893, 0.9839)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.902           
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 0.9020          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9107          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4510          
##    Detection Prevalence : 0.4510          
##       Balanced Accuracy : 0.9510          
##                                           
##        'Positive' Class : Alive           
##

The model achieved 95% accuracy, perfectly identifying all non-survivors (specificity 100%) and correctly detecting most survivors (sensitivity 90%), misclassifying 5 alive patients, demonstrating reliable clinical performance in distinguishing high-risk patients.

cm <- as.data.frame(conf_matrix_rf$table)

# Rename columns
colnames(cm) <- c("Predicted", "Actual", "Freq")

# Plot confusion matrix 
ggplot(cm, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "black", linewidth = 1.2) +
  geom_text(aes(label = Freq), color = "white", size = 6, fontface = "bold") +
  scale_fill_gradient(low = "#FF9999", high = "#3366CC") +
  labs(
    title = "Confusion Matrix - Random Forest Model",
    x = "Actual Class",
    y = "Predicted Class",
    fill = "Count"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text = element_text(color = "black", face = "bold"),
    panel.grid = element_blank(),
    legend.position = "right"
  )

The plot shows 46 survivors correctly predicted as Alive and 51 non-survivors as Dead, with only 5 survivors misclassified. This yields 95% accuracy, sensitivity ≈ 90%, and specificity = 100%. Clinically, the model is conservative, misclassifying a few survivors—but never misses high-risk patients, which is ideal for safe risk detection.

# --- Plot Feature Importance ---
varImpPlot(
  rf_model,
  type = 1,              # 1 = Mean Decrease Accuracy
  main = "Feature Importance",
  pch = 19,
  col = "darkblue",
  cex = 0.9
)

The Random Forest model shows that Protein1, Protein4, and Protein2 are the strongest predictors of patient survival, with Age having moderate influence. Traditional clinical features Surgery type, Tumour Stage, Histology, and HER2 status contribute minimally. This highlights that protein biomarkers dominate the model’s predictive power in distinguishing survivors from non-survivors.

DECISION TREE MODEL

# Base predictors
predictors <- c("Age", "Tumour_Stage", "Histology", "HER2_status", "Surgery_type",
                "Protein1", "Protein2", "Protein3", "Protein4")

formula <- reformulate(predictors, response = "Patient_Status")

# --- Build the Decision Tree model ---
# 'method = "class"' means this is a classification tree (not regression)
tree_model <- rpart(
  formula,
  data = train_data,
  method = "class"
)

# Load required libraries
library(rpart)
library(rpart.plot)

# --- Train decision tree with limited depth ---
tree_model <- rpart(
  formula, 
  data = train_data, 
  method = "class",
  control = rpart.control(
    maxdepth = 3,    # ✅ Limits tree depth to 3 levels (you can adjust to 2 or 4)
    minsplit = 10,   # Minimum observations required to attempt a split
    cp = 0.01        # Complexity parameter to avoid overfitting
  )
)

# --- Plot the tree ---
#rpart.plot(
  #tree_model,
  #type = 2,                    # Shows split variable names
  #extra = 104,                 # Adds predicted class, probability, and % of observations
  #fallen.leaves = TRUE,        # Neatly aligns leaves at bottom
  #cex = 0.7,                   # Adjusts font size for readability
  #box.palette = "GnBu",        # Softer blue-green color palette
  #shadow.col = "gray",         # Adds subtle shadow for contrast
  #branch.lty = 3,              # Dashed branch lines
  #nn = TRUE,                   # Displays node numbers
  #main = "Decision Tree - Breast Cancer Survival (Depth-Controlled View)"
#)

# --- Make Predictions on Training and Test Data ---
# type = "class" → returns the predicted class (e.g., Alive or Dead)
train_pred_tree <- predict(tree_model, train_data, type = "class")
test_pred_tree  <- predict(tree_model, test_data,  type = "class")


# --- Generate Confusion Matrices ---
conf_matrix_train_tree <- confusionMatrix(
  factor(train_pred_tree),
  factor(train_data$Patient_Status)
)

conf_matrix_test_tree <- confusionMatrix(
  factor(test_pred_tree),
  factor(test_data$Patient_Status)
)

# --- Print Results ---
cat("Confusion Matrix - Train Data\n")

## Confusion Matrix - Train Data

print(conf_matrix_train_tree)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive Dead
##      Alive   172   91
##      Dead     32  113
##                                           
##                Accuracy : 0.6985          
##                  95% CI : (0.6514, 0.7427)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 3.273e-16       
##                                           
##                   Kappa : 0.3971          
##                                           
##  Mcnemar's Test P-Value : 1.698e-07       
##                                           
##             Sensitivity : 0.8431          
##             Specificity : 0.5539          
##          Pos Pred Value : 0.6540          
##          Neg Pred Value : 0.7793          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4216          
##    Detection Prevalence : 0.6446          
##       Balanced Accuracy : 0.6985          
##                                           
##        'Positive' Class : Alive           
##

cat("\nConfusion Matrix - Test Data\n")

## 
## Confusion Matrix - Test Data

print(conf_matrix_test_tree)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive Dead
##      Alive    32   23
##      Dead     19   28
##                                           
##                Accuracy : 0.5882          
##                  95% CI : (0.4864, 0.6848)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.04592         
##                                           
##                   Kappa : 0.1765          
##                                           
##  Mcnemar's Test P-Value : 0.64343         
##                                           
##             Sensitivity : 0.6275          
##             Specificity : 0.5490          
##          Pos Pred Value : 0.5818          
##          Neg Pred Value : 0.5957          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3137          
##    Detection Prevalence : 0.5392          
##       Balanced Accuracy : 0.5882          
##                                           
##        'Positive' Class : Alive           
##

The model performs modestly on the test set, with 59% accuracy. It detects survivors reasonably (sensitivity 63%) but has limited ability to identify non-survivors (specificity 55%).

# --- Check Feature Importance ---
print(tree_model$variable.importance)

##     Protein1     Protein4 Surgery_type     Protein3          Age     Protein2 
##   14.8067564   14.2067583   11.1412137    8.7072081    7.6605589    7.4109578 
## Tumour_Stage  HER2_status    Histology 
##    1.6240075    1.1381540    0.3961879

# Extract importance values
importance_values <- tree_model$variable.importance

# Create barplot with extended y-limit
bp <- barplot(
  importance_values,
  main = "Feature Importance (Decision Tree)",
  xlab = "Features",
  ylab = "Importance Score",
  ylim = c(0, max(importance_values) + 5),   # Increase y-axis limit
  col = "lightblue",
  names.arg = FALSE                          # Remove labels first
)

# Add slanted x-axis labels
text(
  x = bp,
  y = par("usr")[3] - 0.5,                    # Position below x-axis
  labels = names(importance_values),
  srt = 45,                                   # Slant labels 45 degrees
  adj = 1,
  xpd = TRUE,                                 # Allow drawing outside plot area
  cex = 0.8
)

From this plot, decision tree model suggests that Protein1 and Protein4 are the strongest predictors of survival among the protein biomarkers, likely reflecting tumor aggressiveness and hormone receptor pathways. Protein2 also contributes, refining risk predictions, while age and Protein3 have more subtle effects. This highlights the potential of these proteins as prognostic biomarkers in breast cancer.

Comparison of the Model Accuracies

# Create a data frame for plotting
model_accuracy <- data.frame(
  Model = c("Decision Tree", "Random Forest"),
  Accuracy = c(0.5882, 0.951)
)

# Base R barplot
barplot(
  model_accuracy$Accuracy,
  names.arg = model_accuracy$Model,
  col = c("tomato", "steelblue"),
  ylim = c(0, 1),
  main = "Comparison of Model Accuracies",
  ylab = "Accuracy",
  cex.names = 1.1
)

# Optional: Add text labels on top of bars
text(
  x = c(0.7, 1.9),
  y = model_accuracy$Accuracy + 0.03,
  labels = round(model_accuracy$Accuracy, 2)
)

# Save the trained model
saveRDS(rf_model, file = "random_forest_model.rds")

Conclusion:

This study showed that Random Forest is the most accurate model for predicting breast cancer survival. Among all factors, Protein1 was found to be the strongest predictor of patient outcomes.

The model was deployed as a Shiny app, making survival predictions easy and accessible.

Recommendation:

Protein1 emerged as an important predictive feature in the model and may warrant closer monitoring in breast cancer patients. While specific clinical cutoff values for Protein1–Protein4 are not currently established, their relative expression patterns provided useful prognostic signals within the model. Tools like this application can therefore support faster, data-informed decision-making when used alongside established clinical and pathological indicators.

Predictive Modelling of Breast Cancer Survival Outcomes Using Machine Learning Models

Joy Emmanuel

2025-12-17

Introduction

Aim of the Study

Research Questions

Research Objectives

Research Methodology

Data Description

Loading the Libraries

===============================

UNIVARIATE ANALYSIS

===============================

Patient Status

Age Distribution

Gender Distribution of Patients

Distribution of Tumour Stages

Distribution of Histology Types

=============================================

BIVARIATE ANALYSIS

=============================================

AGE vs PATIENT STATUS

GENDER vs PATIENT_STATUS

TUMOUR STAGE vs PATIENT_STATUS

HISTOLOGY vs PATIENT STATUS

Checking Variable Uniformity and Feature Reduction

Encoding Target Variable and Preparing Predictors for Modeling

Feature Selection and Handling Class Imbalance

MODEL TRAINING & TESTING

RANDOM FOREST MODEL

DECISION TREE MODEL

Comparison of the Model Accuracies

Conclusion:

Recommendation: