Breast Cancer Survival Prediction

Author

Richmond Silvanus Baye

Published

April 11, 2025

Introduction

This report performs survival analysis on breast cancer patients using clinical and demographic data. We model time-to-event outcomes using Cox Proportional Hazards, accounting for censoring in patient survival.

Loading the libraries

Code

# Load all required packages with pacman
pacman::p_load(
  lubridate, ggsurvfit, gtsummary, tidycmprsk, tidyverse, cardx, broom.helpers,
  dplyr, gt, survival, broom, survminer, survRM2, ggpubr, caret, randomForestSRC
)

Loading the dataset.

You can download this dataset from the following website.

Code

data <- read.csv("Breast_Cancer.csv")

The dataset consists of 4,024 observations and 16 variables related to breast cancer patient characteristics. These include: - Demographics: Age, Race, Marital Status - Tumor Staging and Grading: T Stage, N Stage, 6th Stage, A Stage, Grade, differentiate - Tumor Properties: Tumor Size, Estrogen Status, Progesterone Status - Lymph Node Information: Regional Node Examined, Reginol Node Positive Outcomes: - Survival Months (numeric survival duration) - Status (binary: Alive or Dead) — We will use this as the target for classification.

Data Cleaning and Pre-processing

Code

# Data preprocessing
data <- distinct(data)  # Remove duplicates

# Convert relevant variables to factors
cat_cols <- sapply(data, is.character)
data[cat_cols] <- lapply(data[cat_cols], factor)

# Create survival object
data$Status <- factor(data$Status, levels = c("Alive", "Dead"))
data$event <- as.numeric(data$Status == "Dead")
surv_object <- Surv(time = data$Survival.Months, event = data$event)

# Fix typo in variable name (Reginol → Regional)
names(data)[names(data) == "Reginol.Node.Positive"] <- "Regional.Node.Positive"

We removed duplicate rows to ensure unique patient records, converted character columns to factors for proper model handling, and created a Surv object using survival time and event status. We also corrected a variable name typo to prevent formula errors in modeling.

Cox Proportional Hazards Model

Code

# Fit Cox proportional hazards model
cox_model <- coxph(
  surv_object ~ Age + Race + Tumor.Size + T.Stage + N.Stage + 
  Estrogen.Status + Progesterone.Status + 
  Regional.Node.Examined + Regional.Node.Positive, 
  data = data
)
cox_model

Call:
coxph(formula = surv_object ~ Age + Race + Tumor.Size + T.Stage + 
    N.Stage + Estrogen.Status + Progesterone.Status + Regional.Node.Examined + 
    Regional.Node.Positive, data = data)

                                 coef exp(coef)  se(coef)      z        p
Age                          0.017846  1.018006  0.004697  3.800 0.000145
RaceOther                   -0.879055  0.415175  0.210775 -4.171 3.04e-05
RaceWhite                   -0.480338  0.618574  0.126338 -3.802 0.000144
Tumor.Size                   0.002263  1.002266  0.003111  0.727 0.466949
T.StageT2                    0.388800  1.475210  0.113547  3.424 0.000617
T.StageT3                    0.431219  1.539133  0.218913  1.970 0.048859
T.StageT4                    0.911015  2.486846  0.231115  3.942 8.09e-05
N.StageN2                    0.458247  1.581300  0.109581  4.182 2.89e-05
N.StageN3                    0.640475  1.897382  0.178862  3.581 0.000342
Estrogen.StatusPositive     -0.782999  0.457033  0.132835 -5.895 3.76e-09
Progesterone.StatusPositive -0.528910  0.589247  0.106342 -4.974 6.57e-07
Regional.Node.Examined      -0.032523  0.968000  0.006492 -5.010 5.45e-07
Regional.Node.Positive       0.061905  1.063861  0.011284  5.486 4.11e-08

Likelihood ratio test=456.4  on 13 df, p=< 2.2e-16
n= 4023, number of events= 616

Race differences: Compared to the reference group, patients identified as White or from Other racial backgrounds had better survival outcomes.

Tumor stage is critical: Patients with larger or more advanced tumors (T2, T3, or T4) had a higher risk of dying than those with smaller tumors (T1). For example, those with T4 tumors had more than double the risk.

Lymph node involvement: The more lymph nodes that were affected by cancer (N2 or N3), the higher the risk. This means cancer that has spread to many lymph nodes is more dangerous.

Hormone receptor status offers protection: Patients whose tumors were sensitive to estrogen or progesterone had significantly better chances of survival.

Lymph node testing helps: Having more lymph nodes examined during diagnosis was linked to better outcomes, likely because it helps guide effective treatment. However, having more positive (cancer-affected) nodes was linked to worse outcomes.

Tumor size alone wasn’t a strong predictor of survival when other factors were considered, such as stage and lymph node involvement.

Code

# Build survival object
data$event <- ifelse(data$Status == "Dead", 1, 0)
surv_object <- Surv(time = data$Survival.Months, event = data$event)

# Fit Cox model
cox_model <- coxph(surv_object ~ Age + Race + T.Stage + N.Stage + 
                   Estrogen.Status + Progesterone.Status + Tumor.Size, 
                   data = data)

# Generate styled summary table
cox_tbl <- tbl_regression(cox_model, exp = TRUE) %>%
  add_global_p() %>%
  add_significance_stars() %>%
  bold_p(t = 0.05) %>%
  bold_labels()

# Display it
cox_tbl

Characteristic	HR¹	SE
Age	1.02***	0.005
Race	***
Black	—	—
Other	0.42	0.211
White	0.63	0.126
T.Stage	***
T1	—	—
T2	1.42	0.114
T3	1.40	0.220
T4	2.37	0.232
N.Stage	***
N1	—	—
N2	1.84	0.101
N3	3.42	0.102
Estrogen.Status	***
Negative	—	—
Positive	0.46	0.132
Progesterone.Status	***
Negative	—	—
Positive	0.59	0.106
Tumor.Size	1.00	0.003
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio, SE = Standard Error
¹ p<0.05; p<0.01; **p<0.001

Testing proportional hazards assumption

Code

ph_test <- cox.zph(cox_model)
print(ph_test)

                     chisq df       p
Age                  0.208  1    0.65
Race                 1.463  2    0.48
T.Stage              0.247  3    0.97
N.Stage              2.158  2    0.34
Estrogen.Status     32.015  1 1.5e-08
Progesterone.Status 33.527  1 7.0e-09
Tumor.Size           1.161  1    0.28
GLOBAL              47.059 11 2.1e-06

To ensure the validity of our survival analysis, we tested whether each variable in the model met the proportional hazards assumption—a key requirement of the Cox model that assumes the effect of a variable on survival remains consistent over time.

Most variables, including age, race, tumor size, tumor and nodal stage, and number of lymph nodes, did not violate this assumption. Their p-values were well above 0.05, indicating no significant time-dependent effects.
Although estrogen and progesterone receptor status showed statistically significant test results (p < 0.001), this could suggest a potential violation of the proportional hazards assumption for these variables. It’s worth exploring this further with time-varying models or stratification.
The global test for the entire model was also significant (p = 6.0e-06), suggesting that overall, some violation of the proportional hazards assumption exists in the model.

While most predictors behaved consistently over time, hormone receptor status may have effects that change during the follow-up period. Further checks or model adjustments (like stratification or adding time-varying covariates) might be necessary.

Forest plot of key variables

Code

# Create forest plot of hazard ratios
forest_plot <- ggforest(cox_model, data = data, fontsize = 0.6)
print(forest_plot)

What is contributing to survival

The Wald chi-square tests indicate that only estrogen and progesterone receptor status significantly contribute to the model (p < 0.001). Other variables such as age, stage, tumor size, and node involvement do not show statistically significant effects when tested individually, possibly due to multicollinearity or overlapping effects. However, the global test is highly significant (p = 6.0e-06), suggesting the model as a whole explains survival differences effectively.

Code

fit_km <- survfit(surv_object ~ 1, data = data)
km_plot <- ggsurvplot(
  fit_km, data = data, 
  conf.int = TRUE, 
  risk.table = TRUE,
  pval = FALSE,
  surv.median.line = "hv",
  ggtheme = theme_bw(),
  palette = "jco",
  legend.labs = "Overall",
  title = "Overall Survival Curve"
)

Warning in .add_surv_median(p, fit, type = surv.median.line, fun = fun, :
Median survival not reached.

Code

print(km_plot)

The curve begins at 100%, meaning all patients were alive at the start of observation. As time progresses, the curve gradually declines, reflecting patients who experienced the event (death). By the end of the observation period (100 units of time), the estimated survival probability remains above 70%, indicating that the majority of patients survived through most of the follow-up period. The table below the graph shows the number of patients still being followed (“at risk”) at each time point. For example, while 4,023 patients were at risk at the beginning, only 444 remained under observation at the final time point—this is expected due to both events (deaths) and censoring (patients lost to follow-up or study end). Together with our previous analysis, this curve provides a clear, visual summary of overall patient survival and supports the use of the Cox model for understanding what influences differences in survival.

Survival by T-stage

Next, we stratify by the stages of the progression and we observe that patients in T1 and T2 had survival probability above 0.75 over the study period. While those at T3 stage had a survival probability above 0.65. T4 patients had a survival probability slightly above 0.5 by the end of the study period. These results show that survival probability is strongly associated with the Tumor Stage.

Code

# Fit model or plot
fit <- survfit(surv_object ~ T.Stage, data = data)

# Plot survival curve only
surv_curve <- ggsurvplot(
  fit, 
  data = data,
  pval = TRUE,
  conf.int = FALSE,
  risk.table = FALSE,  # <--- turn off risk table
  palette = "jco",
  ggtheme = theme_bw(base_size = 14),
  title = "Survival Probability by Tumor Stage (T Stage)",
  xlab = "Time (months)",
  ylab = "Survival Probability",
  legend.title = "Tumor Stage"
)

# Plot risk table only
risk_tbl <- ggrisktable(
  fit,
  data = data,
  palette = "jco",
  title = "Number at Risk by Tumor Stage",
  xlab = "Time (months)"
)

# Display separately
print(surv_curve)

Code

print(risk_tbl)

Survival Outcomes by Tumor Stage (T Stage)

The Kaplan-Meier survival analysis revealed clear differences in survival probability over time based on tumor stage at diagnosis (T1–T4). Patients with T1 tumors showed the highest survival probability throughout the follow-up period, while those with T4 tumors experienced the steepest decline, indicating significantly poorer outcomes. This trend demonstrates a consistent gradient where survival worsens with advancing tumor stage.

The log-rank test confirmed that these differences are statistically significant (p < 0.0001), suggesting that tumor stage is a strong predictor of survival. The accompanying number-at-risk table shows how many patients remained under observation at each time point. Notably, the T4 group had substantially fewer patients and experienced faster attrition, reflecting higher event rates in this group. These results reinforce the prognostic importance of early tumor stage detection and support stage-based stratification in clinical decision-making and patient monitoring.