Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide. Early diagnosis and effective treatment strategies are essential for improving patient survival rates. However, predicting individual patient outcomes remains a major challenge due to the complex interplay of clinical and biological factors.
This project explores how machine learning models can be used to predict survival status in breast cancer patients using a dataset containing demographic, histopathological, and treatment-related variables. The goal is to identify the most influential predictors of survival and demonstrate how models like Random Forest and Decision Tree can aid in clinical decision-making and improving patient outcomes.
The aim of this study is to predict survival outcomes in breast cancer patients and identify factors that influence prognosis.
Which clinical and molecular factors most strongly influence survival outcomes in breast cancer patients?
Which predictive model provides the highest accuracy and reliability for predicting breast cancer survival outcomes?
The objectives of this study are to:
Identify and rank the clinical and molecular factors that most significantly influence survival outcomes in breast cancer patients.
Evaluate and compare two predictive models in order to determine the most accurate and reliable model for Predicting breast cancer survival outcomes.
This study uses a quantitative, predictive research design based on supervised machine learning. The design is appropriate because the goal of the study is to predict breast cancer survival outcomes and identify the factors that influence prognosis.
Data Source
The data set used in this study was obtained from Kaggle, a public data repository. The data contains anonymized clinical variables, molecular bio-markers, and patient survival outcomes.
Data Preprocessing
The data set was imported into R and prepared for analysis. The pre-processing steps included:
Removing missing or inconsistent entries
Exploratory data analysis ( Uni-variate and Bi-variate Analysis)
Encoding categorical variables
Balancing imbalanced variable (Target Variable) using Oversampling
Splitting the data into training and testing sets
These steps ensured that the data was clean and suitable for model development.
Methods Aligned With Research Objectives
Objective 1:
Identify and rank the clinical and molecular factors that most significantly influence survival outcomes in breast cancer patients.
Method:
Feature importance analysis was performed using the odds ratio of the variables. This technique helped determine and rank the most influential predictors of survival.
Objective 2:
Evaluate and compare two predictive models to determine the most accurate and reliable model for Predicting breast cancer survival outcomes.
Method:
Two machine learning algorithms; Random Forest and Decision Tree were trained on the data set. Each model was evaluated using accuracy, sensitivity, specificity and the confusion matrix to identify the best-performing model.
Model Selection and Interpretation
The model with the highest performance on the test dataset was selected as the best predictive model. Feature importance plots and other visualizations were used to interpret how the model made predictions and which variables contributed most to survival outcomes.
Deployment of the Predictive Model
The selected model was deployed as an interactive Shiny web application to make predictions easily accessible.
The trained model was saved as an .rds file.
A Shiny interface was created to allow users to input patient features.
The server logic processed these inputs and generated survival predictions in real time.
The application was deployed on shinyapps.io, allowing users to access it from any device with internet access.
This deployment transformed the machine learning model into a practical decision-support tool.
Tools and Software
The study used:
R and RStudio for data analysis and model development
Libraries such as caret, randomForest, gbm, e1071, and shiny
shinyapps.io for deployment
The dataset comprises clinical and molecular information on breast cancer patients, aimed at predicting survival outcomes.
# Load required libraries
library(readr) # For fast and easy reading of CSV files
library(tidyverse) # For data manipulation and visualization tools
library(ggplot2)
library(ROSE)
library(caret)
library(broom)
library(reshape2)
library(randomForest)
library(rpart) # For building the decision tree model
library(rpart.plot) # For visualizing the decision tree
library(plotly) # For interactive visualization
# Import the dataset
breast_cancer <- read_csv("C:\\Users\\USER\\Documents\\R\\Breast cancer data.csv") # Read CSV file into R
## Rows: 341 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Patient_ID, Gender, Tumour_Stage, Histology, ER status, PR status,...
## dbl (5): Age, Protein1, Protein2, Protein3, Protein4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(breast_cancer)
## # A tibble: 6 × 16
## Patient_ID Age Gender Protein1 Protein2 Protein3 Protein4 Tumour_Stage
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 TCGA-D8-A1XD 36 FEMALE 0.0804 0.426 0.547 0.274 III
## 2 TCGA-EW-A1OX 43 FEMALE -0.420 0.578 0.614 -0.0315 II
## 3 TCGA-A8-A079 69 FEMALE 0.214 1.31 -0.327 -0.234 III
## 4 TCGA-D8-A1XR 56 FEMALE 0.345 -0.211 -0.193 0.124 II
## 5 TCGA-BH-A0BF 56 FEMALE 0.222 1.91 0.520 -0.312 II
## 6 TCGA-AO-A1KQ 84 MALE -0.0819 1.72 -0.0573 0.0430 III
## # ℹ 8 more variables: Histology <chr>, `ER status` <chr>, `PR status` <chr>,
## # `HER2 status` <chr>, Surgery_type <chr>, Date_of_Surgery <chr>,
## # Date_of_Last_Visit <chr>, Patient_Status <chr>
# Get a quick statistical summary of the dataset
summary(breast_cancer)
## Patient_ID Age Gender Protein1
## Length:341 Min. :29.00 Length:341 Min. :-2.340900
## Class :character 1st Qu.:49.00 Class :character 1st Qu.:-0.358888
## Mode :character Median :58.00 Mode :character Median : 0.006129
## Mean :58.89 Mean :-0.029991
## 3rd Qu.:68.00 3rd Qu.: 0.343598
## Max. :90.00 Max. : 1.593600
## NA's :7 NA's :7
## Protein2 Protein3 Protein4 Tumour_Stage
## Min. :-0.9787 Min. :-1.6274 Min. :-2.025500 Length:341
## 1st Qu.: 0.3622 1st Qu.:-0.5137 1st Qu.:-0.377090 Class :character
## Median : 0.9928 Median :-0.1732 Median : 0.041768 Mode :character
## Mean : 0.9469 Mean :-0.0902 Mean : 0.009819
## 3rd Qu.: 1.6279 3rd Qu.: 0.2784 3rd Qu.: 0.425630
## Max. : 3.4022 Max. : 2.1934 Max. : 1.629900
## NA's :7 NA's :7 NA's :7
## Histology ER status PR status HER2 status
## Length:341 Length:341 Length:341 Length:341
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Surgery_type Date_of_Surgery Date_of_Last_Visit Patient_Status
## Length:341 Length:341 Length:341 Length:341
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
# Check the dimensions of the dataset (rows and columns)
dim(breast_cancer)
## [1] 341 16
# Check total number of missing values in the dataset
sum(is.na(breast_cancer))
## [1] 142
# Check number of missing values in each column
colSums(is.na(breast_cancer))
## Patient_ID Age Gender Protein1
## 7 7 7 7
## Protein2 Protein3 Protein4 Tumour_Stage
## 7 7 7 7
## Histology ER status PR status HER2 status
## 7 7 7 7
## Surgery_type Date_of_Surgery Date_of_Last_Visit Patient_Status
## 7 7 24 20
# Check how many rows have at least one missing value
sum(!complete.cases(breast_cancer))
## [1] 24
# Remove rows with missing values
breast_clean <- na.omit(breast_cancer)
# Check the new dimensions of the cleaned dataset
dim(breast_clean)
## [1] 317 16
# Generate a summary of the cleaned dataset
summary(breast_clean)
## Patient_ID Age Gender Protein1
## Length:317 Min. :29.00 Length:317 Min. :-2.144600
## Class :character 1st Qu.:49.00 Class :character 1st Qu.:-0.350600
## Mode :character Median :58.00 Mode :character Median : 0.005649
## Mean :58.73 Mean :-0.027232
## 3rd Qu.:67.00 3rd Qu.: 0.336260
## Max. :90.00 Max. : 1.593600
## Protein2 Protein3 Protein4 Tumour_Stage
## Min. :-0.9787 Min. :-1.6274 Min. :-2.025500 Length:317
## 1st Qu.: 0.3688 1st Qu.:-0.5314 1st Qu.:-0.382240 Class :character
## Median : 0.9971 Median :-0.1930 Median : 0.038522 Mode :character
## Mean : 0.9496 Mean :-0.0951 Mean : 0.006713
## 3rd Qu.: 1.6120 3rd Qu.: 0.2512 3rd Qu.: 0.436250
## Max. : 3.4022 Max. : 2.1934 Max. : 1.629900
## Histology ER status PR status HER2 status
## Length:317 Length:317 Length:317 Length:317
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Surgery_type Date_of_Surgery Date_of_Last_Visit Patient_Status
## Length:317 Length:317 Length:317 Length:317
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
############Patient Status############
# Check the distribution of patient outcomes (e.g., Alive vs Deceased)
table(breast_clean$Patient_Status)
##
## Alive Dead
## 255 62
# Create a bar plot showing the number of patients by their status
ggplot(breast_clean, aes(x = Patient_Status, fill = Patient_Status)) +
geom_bar() + # Create bars for each category
theme_minimal() + # Use a clean minimal theme
labs(
title = "Distribution of Patient Status", # Add title
x = "Patient Status", # X-axis label
y = "Count" # Y-axis label
) +
scale_fill_brewer(palette = "Set2") # Add soft color palette
###########Age Distribution#############
# Create a frequency table for Age
age_counts <- table(breast_clean$Age)
# Create a histogram to show the age distribution of patients
hist(
breast_clean$Age, # Column to plot
ylim = c(0, 60), # Set y-axis limit
col = "skyblue", # Fill color for bars
border = "black", # Outline color for bars
main = "Age Distribution of Patients", # Title of the histogram
xlab = "Age (years)", # Label for x-axis
ylab = "Frequency" # Label for y-axis
)
############Gender##########
# Create a frequency table for Gender
gender_counts <- table(breast_clean$Gender)
# Create the bar plot and store the x positions of the bars
bar_positions <- barplot(
gender_counts, # Data for the bars
col = c("lightpink", "lightblue"), # Colors for each gender
border = "black", # Outline color for bars
ylim = c(0, max(gender_counts) + 50),# Add buffer to top of y-axis
main = "Gender Distribution of Patients", # Title of the plot
xlab = "Gender", # Label for x-axis
ylab = "Number of Patients" # Label for y-axis
)
# Add text labels showing the counts above each bar
text(
x = bar_positions, # Use bar positions for correct placement
y = gender_counts + 10, # Lift labels slightly above bars
labels = gender_counts, # Text to display (the counts)
cex = 0.8, # Font size
col = "black" # Text color
)
###########Tumor Stage###########
# Create a frequency table for Tumour Stage
tumour_counts <- table(breast_clean$Tumour_Stage)
# Create the bar plot and store x-axis positions
bar_positions <- barplot(
tumour_counts, # Data for the bars
col = c("lightgreen", "gold", "orange"),# Colors for each stage
border = "black", # Outline color for bars
ylim = c(0, max(tumour_counts) + 20), # Add buffer to the top of the y-axis
main = "Distribution of Tumour Stages", # Plot title
xlab = "Tumour Stage", # Label for x-axis
ylab = "Number of Patients" # Label for y-axis
)
This bar chart shows the distribution of tumour stages among breast cancer patients. It reveals that Stage II tumours are the most common, followed by Stage III, while Stage I cases are the least frequent. This suggests that many patients in the dataset were diagnosed at more advanced stages, which may have implications for treatment outcomes and survival analysis.
###########Histology##########
# Frequency table for Histology
histology_counts <- table(breast_clean$Histology)
histology_counts
##
## Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma
## 224 81
## Mucinous Carcinoma
## 12
# Bar plot for Histology distribution
library(ggplot2)
# Convert named vector to data frame
histology_df <- data.frame(
Histology = names(histology_counts),
Count = as.numeric(histology_counts)
)
ggplot(histology_df, aes(x = Histology, y = Count)) +
geom_bar(
stat = "identity",
fill = "orchid",
color = "black"
) +
labs(
title = "Distribution of Histology Types",
y = "Number of Patients",
x = NULL
) +
ylim(0, max(histology_df$Count) + 50) +
theme_minimal() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 1,
size = 10
),
plot.title = element_text(hjust = 0.5)
)
This chart shows that Invasive Ductal Carcinoma (IDC) is the most common histologic type, followed by Invasive Lobular Carcinoma (ILC), while Mucinous Carcinoma is the least frequent. IDC dominates because it arises in the milk ducts where most breast tissue is concentrated, making it more likely to occur and be detected. ILC is less common and often harder to identify on imaging due to its diffuse growth pattern, while Mucinous Carcinoma is rare, slow-growing, and typically detected at an early stage.
# Boxplot of Age by Patient Status
boxplot(
Age ~ Patient_Status, # Compare Age across Patient Status groups
data = breast_clean, # Use the cleaned dataset
col = c("lightblue", "red"), # Colors for each group
main = "Distribution of Age by Patient Status", # Title of the plot
xlab = "Patient Status", # Label for the x-axis
ylab = "Age (years)", # Label for the y-axis
notch = FALSE, # Disable notches
border = "black" # Outline color for clarity
)
# Bar plot to show the relationship between Gender and Patient Status
ggplot(breast_clean, aes(x = Gender, fill = Patient_Status)) +
geom_bar(position = "dodge") + # Side-by-side bars
labs(
title = "Gender vs Patient Status",
x = "Gender",
y = "Number of Patients"
) +
# Use distinct, professional colors for clarity
scale_fill_manual(values = c("#6A5ACD", "#FFD700")) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"), # Center and bold title
axis.text.x = element_text(angle = 15, hjust = 1), # Gentle tilt for x-axis labels
legend.title = element_text(face = "bold") # Bold legend title
)
# Bar plot for Tumour Stage vs Patient Status with distinct colors
ggplot(breast_clean, aes(x = Tumour_Stage, fill = Patient_Status)) +
geom_bar(position = "dodge") + # Place bars side by side
labs(
title = "Tumour Stage vs Patient Status",
x = "Tumour Stage",
y = "Number of Patients"
) +
scale_fill_manual(values = c("steelblue", "red")) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"), # Center and bold title
axis.text.x = element_text(angle = 20, hjust = 1), # Slight tilt for x-axis labels
legend.title = element_text(face = "bold") # Bold legend title
)
The bar plot shows that Stage II has the largest number of patients and the largest number of deaths in absolute terms. The proportion of deaths is slightly higher in Stage II and Stage III than in Stage I, suggesting worse outcomes with more advanced stages.
# A bar plot for Histologic Type vs Patient Status
ggplot(breast_clean, aes(x = Histology, fill = Patient_Status)) +
geom_bar(position = "dodge") +
labs(
title = "Histologic Type vs Patient Status",
x = "Histologic Type",
y = "Number of Patients",
fill = "Patient Status"
) +
scale_fill_manual(values = c("#1E90FF", "#FF6347")) + # Blue & Tomato red for clear contrast
ylim(0, 200) +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold", hjust = 0.5)
)
This chart compares histologic type with patient status. The majority of patients with Infiltrating Ductal Carcinoma were alive, though it also had the highest number of deaths, reflecting its high overall occurrence. Infiltrating Lobular Carcinoma showed fewer cases, with most patients surviving. Mucinous Carcinoma had the lowest frequency and the fewest deaths, consistent with its typically less aggressive and slow-growing nature. Overall, survival appears relatively higher across all histologic types, but mortality is most concentrated among patients with Infiltrating Ductal Carcinoma due to its predominance.
To ensure all predictors contribute meaningful information, variables with no variation were identified and excluded. Both ER_status and PR_status contained only “Positive” values, indicating no discriminatory power for classification. Hence, they were removed from the dataset prior to modelling.
# Check unique values in ER_status and PR_status to confirm uniformity
unique(breast_clean$`ER status`)
## [1] "Positive"
unique(breast_clean$`PR status`)
## [1] "Positive"
# since both are only "Positive", drop them from the dataset
breast_clean <- subset(breast_clean, select = -c(`ER status`, `PR status`))
# Confirm removal
names(breast_clean)
## [1] "Patient_ID" "Age" "Gender"
## [4] "Protein1" "Protein2" "Protein3"
## [7] "Protein4" "Tumour_Stage" "Histology"
## [10] "HER2 status" "Surgery_type" "Date_of_Surgery"
## [13] "Date_of_Last_Visit" "Patient_Status"
# Encode Patient_Status numerically and label both classes
breast_clean$Patient_Status <- ifelse(breast_clean$Patient_Status == "Alive", 1, 0)
# Convert to factor with descriptive labels
breast_clean$Patient_Status <- factor(breast_clean$Patient_Status,
levels = c(0, 1),
labels = c("Dead", "Alive"))
names(breast_clean)[names(breast_clean) == "HER2 status"] <- "HER2_status"
# Convert categorical variables to factors
breast_clean$Tumour_Stage <- as.factor(breast_clean$Tumour_Stage)
breast_clean$Histology <- as.factor(breast_clean$Histology)
breast_clean$HER2_status <- as.factor(breast_clean$HER2_status)
breast_clean$Surgery_type <- as.factor(breast_clean$Surgery_type)
# Select relevant predictors for the model
breast_model_data <- breast_clean %>%
select(-Patient_ID, -Date_of_Surgery, -Date_of_Last_Visit)
# Perform random oversampling on the entire dataset based on the target variable
set.seed(123)
oversampled_data <- ovun.sample(Patient_Status ~ .,
data = breast_model_data,
method = "over",
N = max(table(breast_model_data$Patient_Status)) * 2)$data
table(oversampled_data$Patient_Status)
##
## Alive Dead
## 255 255
prop.table(table(oversampled_data$Patient_Status))
##
## Alive Dead
## 0.5 0.5
After cleaning and encoding, non-predictive variables (Patient_ID,
Date_of_Surgery, Date_of_Last_Visit) were removed to retain key
predictors. To correct the class imbalance in Patient_Status, random
oversampling was performed using the ovun.sample()
function, ensuring both classes were equally represented and reducing
model bias.
# Set a random seed for reproducibility
set.seed(123)
# ---- Create a partition: 80% training, 20% testing ----
train_index <- createDataPartition(oversampled_data$Patient_Status, p = 0.8, list = FALSE)
# ---- Split the data ----
train_data <- oversampled_data[train_index, ] # 80% training data
test_data <- oversampled_data[-train_index, ] # 20% testing data
# ---- Check the dimensions of each ----
dim(train_data)
## [1] 408 11
dim(test_data)
## [1] 102 11
# ---- Check class distribution ----
prop.table(table(train_data$Patient_Status))
##
## Alive Dead
## 0.5 0.5
prop.table(table(test_data$Patient_Status))
##
## Alive Dead
## 0.5 0.5
The balanced dataset was split into 80% training and 20% testing subsets using the createDataPartition() function to ensure reproducibility and fair evaluation. This approach allows the model to learn patterns from the training data while reserving a portion for unbiased performance assessment.
# convert Patient_Status to a factor
train_data$Patient_Status <- as.factor(train_data$Patient_Status)
test_data$Patient_Status <- as.factor(test_data$Patient_Status)
# Base predictors
predictors <- c("Age", "Tumour_Stage", "Histology", "HER2_status", "Surgery_type",
"Protein1", "Protein2", "Protein3", "Protein4")
formula <- reformulate(predictors, response = "Patient_Status")
# Train the Random Forest model
rf_model <- randomForest(
formula,
data = train_data,
ntree = 600,
mtry = 3,
importance = TRUE,
seed = 123
)
print(rf_model)
##
## Call:
## randomForest(formula = formula, data = train_data, ntree = 600, mtry = 3, importance = TRUE, seed = 123)
## Type of random forest: classification
## Number of trees: 600
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 5.15%
## Confusion matrix:
## Alive Dead class.error
## Alive 188 16 0.07843137
## Dead 5 199 0.02450980
The Random Forest model performed very well, correctly predicting patient outcomes about 95% of the time. It was especially good at identifying patients who did not survive, making only a few mistakes, while its predictions for surviving patients were also strong. Overall, the model shows reliable clinical usefulness, as it can clearly distinguish between patients at higher risk and those likely to survive.
# Predict on test set
rf_pred <- predict(rf_model, newdata = test_data)
# Confusion matrix
conf_matrix_rf <- confusionMatrix(rf_pred, test_data$Patient_Status)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 46 0
## Dead 5 51
##
## Accuracy : 0.951
## 95% CI : (0.8893, 0.9839)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.902
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 0.9020
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9107
## Prevalence : 0.5000
## Detection Rate : 0.4510
## Detection Prevalence : 0.4510
## Balanced Accuracy : 0.9510
##
## 'Positive' Class : Alive
##
The model achieved 95% accuracy, perfectly identifying all non-survivors (specificity 100%) and correctly detecting most survivors (sensitivity 90%), misclassifying 5 alive patients, demonstrating reliable clinical performance in distinguishing high-risk patients.
cm <- as.data.frame(conf_matrix_rf$table)
# Rename columns
colnames(cm) <- c("Predicted", "Actual", "Freq")
# Plot confusion matrix
ggplot(cm, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "black", linewidth = 1.2) +
geom_text(aes(label = Freq), color = "white", size = 6, fontface = "bold") +
scale_fill_gradient(low = "#FF9999", high = "#3366CC") +
labs(
title = "Confusion Matrix - Random Forest Model",
x = "Actual Class",
y = "Predicted Class",
fill = "Count"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.text = element_text(color = "black", face = "bold"),
panel.grid = element_blank(),
legend.position = "right"
)
The plot shows 46 survivors correctly predicted as Alive and 51 non-survivors as Dead, with only 5 survivors misclassified. This yields 95% accuracy, sensitivity ≈ 90%, and specificity = 100%. Clinically, the model is conservative, misclassifying a few survivors—but never misses high-risk patients, which is ideal for safe risk detection.
# --- Plot Feature Importance ---
varImpPlot(
rf_model,
type = 1, # 1 = Mean Decrease Accuracy
main = "Feature Importance",
pch = 19,
col = "darkblue",
cex = 0.9
)
The Random Forest model shows that Protein1, Protein4, and Protein2 are the strongest predictors of patient survival, with Age having moderate influence. Traditional clinical features Surgery type, Tumour Stage, Histology, and HER2 status contribute minimally. This highlights that protein biomarkers dominate the model’s predictive power in distinguishing survivors from non-survivors.
# Base predictors
predictors <- c("Age", "Tumour_Stage", "Histology", "HER2_status", "Surgery_type",
"Protein1", "Protein2", "Protein3", "Protein4")
formula <- reformulate(predictors, response = "Patient_Status")
# --- Build the Decision Tree model ---
# 'method = "class"' means this is a classification tree (not regression)
tree_model <- rpart(
formula,
data = train_data,
method = "class"
)
# Load required libraries
library(rpart)
library(rpart.plot)
# --- Train decision tree with limited depth ---
tree_model <- rpart(
formula,
data = train_data,
method = "class",
control = rpart.control(
maxdepth = 3, # ✅ Limits tree depth to 3 levels (you can adjust to 2 or 4)
minsplit = 10, # Minimum observations required to attempt a split
cp = 0.01 # Complexity parameter to avoid overfitting
)
)
# --- Plot the tree ---
#rpart.plot(
#tree_model,
#type = 2, # Shows split variable names
#extra = 104, # Adds predicted class, probability, and % of observations
#fallen.leaves = TRUE, # Neatly aligns leaves at bottom
#cex = 0.7, # Adjusts font size for readability
#box.palette = "GnBu", # Softer blue-green color palette
#shadow.col = "gray", # Adds subtle shadow for contrast
#branch.lty = 3, # Dashed branch lines
#nn = TRUE, # Displays node numbers
#main = "Decision Tree - Breast Cancer Survival (Depth-Controlled View)"
#)
# --- Make Predictions on Training and Test Data ---
# type = "class" → returns the predicted class (e.g., Alive or Dead)
train_pred_tree <- predict(tree_model, train_data, type = "class")
test_pred_tree <- predict(tree_model, test_data, type = "class")
# --- Generate Confusion Matrices ---
conf_matrix_train_tree <- confusionMatrix(
factor(train_pred_tree),
factor(train_data$Patient_Status)
)
conf_matrix_test_tree <- confusionMatrix(
factor(test_pred_tree),
factor(test_data$Patient_Status)
)
# --- Print Results ---
cat("Confusion Matrix - Train Data\n")
## Confusion Matrix - Train Data
print(conf_matrix_train_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 172 91
## Dead 32 113
##
## Accuracy : 0.6985
## 95% CI : (0.6514, 0.7427)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 3.273e-16
##
## Kappa : 0.3971
##
## Mcnemar's Test P-Value : 1.698e-07
##
## Sensitivity : 0.8431
## Specificity : 0.5539
## Pos Pred Value : 0.6540
## Neg Pred Value : 0.7793
## Prevalence : 0.5000
## Detection Rate : 0.4216
## Detection Prevalence : 0.6446
## Balanced Accuracy : 0.6985
##
## 'Positive' Class : Alive
##
cat("\nConfusion Matrix - Test Data\n")
##
## Confusion Matrix - Test Data
print(conf_matrix_test_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 32 23
## Dead 19 28
##
## Accuracy : 0.5882
## 95% CI : (0.4864, 0.6848)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.04592
##
## Kappa : 0.1765
##
## Mcnemar's Test P-Value : 0.64343
##
## Sensitivity : 0.6275
## Specificity : 0.5490
## Pos Pred Value : 0.5818
## Neg Pred Value : 0.5957
## Prevalence : 0.5000
## Detection Rate : 0.3137
## Detection Prevalence : 0.5392
## Balanced Accuracy : 0.5882
##
## 'Positive' Class : Alive
##
The model performs modestly on the test set, with 59% accuracy. It detects survivors reasonably (sensitivity 63%) but has limited ability to identify non-survivors (specificity 55%).
# --- Check Feature Importance ---
print(tree_model$variable.importance)
## Protein1 Protein4 Surgery_type Protein3 Age Protein2
## 14.8067564 14.2067583 11.1412137 8.7072081 7.6605589 7.4109578
## Tumour_Stage HER2_status Histology
## 1.6240075 1.1381540 0.3961879
# Extract importance values
importance_values <- tree_model$variable.importance
# Create barplot with extended y-limit
bp <- barplot(
importance_values,
main = "Feature Importance (Decision Tree)",
xlab = "Features",
ylab = "Importance Score",
ylim = c(0, max(importance_values) + 5), # Increase y-axis limit
col = "lightblue",
names.arg = FALSE # Remove labels first
)
# Add slanted x-axis labels
text(
x = bp,
y = par("usr")[3] - 0.5, # Position below x-axis
labels = names(importance_values),
srt = 45, # Slant labels 45 degrees
adj = 1,
xpd = TRUE, # Allow drawing outside plot area
cex = 0.8
)
From this plot, decision tree model suggests that Protein1 and Protein4 are the strongest predictors of survival among the protein biomarkers, likely reflecting tumor aggressiveness and hormone receptor pathways. Protein2 also contributes, refining risk predictions, while age and Protein3 have more subtle effects. This highlights the potential of these proteins as prognostic biomarkers in breast cancer.
# Create a data frame for plotting
model_accuracy <- data.frame(
Model = c("Decision Tree", "Random Forest"),
Accuracy = c(0.5882, 0.951)
)
# Base R barplot
barplot(
model_accuracy$Accuracy,
names.arg = model_accuracy$Model,
col = c("tomato", "steelblue"),
ylim = c(0, 1),
main = "Comparison of Model Accuracies",
ylab = "Accuracy",
cex.names = 1.1
)
# Optional: Add text labels on top of bars
text(
x = c(0.7, 1.9),
y = model_accuracy$Accuracy + 0.03,
labels = round(model_accuracy$Accuracy, 2)
)
# Save the trained model
saveRDS(rf_model, file = "random_forest_model.rds")
This study showed that Random Forest is the most accurate model for predicting breast cancer survival. Among all factors, Protein1 was found to be the strongest predictor of patient outcomes.
The model was deployed as a Shiny app, making survival predictions easy and accessible.
Protein1 emerged as an important predictive feature in the model and may warrant closer monitoring in breast cancer patients. While specific clinical cutoff values for Protein1–Protein4 are not currently established, their relative expression patterns provided useful prognostic signals within the model. Tools like this application can therefore support faster, data-informed decision-making when used alongside established clinical and pathological indicators.