The dataset I’m working with is the 2017 SEER (Surveillance, Epidemiology, and End Results) Program dataset, provided by the National Cancer Institute (NCI). It contains 4,024 observations and 16 variables, offering population-based cancer statistics for breast cancer patients.
Observations are categorized by key demographic and clinical factors, such as race, age, marital status, tumor size, hormone receptor status (estrogen and progesterone), T-stage, N-stage, and A-stage. Although I didn’t need to clean the data, as it appeared well-organized, what truly drew me to this dataset was the inclusion of race categorization. Having lost a family member to breast cancer, I feel a personal connection to this topic. This dataset has sparked a curiosity about whether breast cancer, like fibroid, shows race-specific trends. Exploring these patterns could help highlight disparities and contribute to better understanding and interventions for affected communities.
To prepare the dataset for visualization, I used the dypler function to filter undesirable observation in the race column like patient that neither identify as white or black. I then use the mutate function to create a new column where tumor is grouped into different sizes and finally I group the data by tumor_size_group, race to display a histogram. While the above topic is unfamiliar to me, I find the NIH(National cancer institute : https://www.cancer.gov/about-cancer/diagnosis-staging/staging) very helpful in better understanding cancer staging.
Load tidyverse library to access requisite package to execute the dataset.
library(tidyverse)
Warning: package 'readr' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Load dataset from working directory to the global environment,and make headers lower_case and remove space
setwd("C:/Users/Owner/OneDrive/Desktop/Data110")# Suppress all messages when reading the CSV filebreast_cancer <-suppressMessages(read_csv("breast_cancer_project_2.csv", show_col_types =FALSE))names(breast_cancer)<-gsub( " ","_",tolower(names(breast_cancer)))head(breast_cancer)
# A tibble: 6 × 16
age race marital_status t_stage n_stage `6th_stage` differentiate grade
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 68 White Married T1 N1 IIA Poorly different… 3
2 50 White Married T2 N2 IIIA Moderately diffe… 2
3 58 White Divorced T3 N3 IIIC Moderately diffe… 2
4 58 White Married T1 N1 IIA Poorly different… 3
5 47 White Married T2 N1 IIB Poorly different… 3
6 51 White Single T1 N1 IIA Moderately diffe… 2
# ℹ 8 more variables: a_stage <chr>, tumor_size <dbl>, estrogen_status <chr>,
# progesterone_status <chr>, regional_node_examined <dbl>,
# reginol_node_positive <dbl>, survival_months <dbl>, status <chr>
Understanding the variables and summary statistics.
summary(breast_cancer)
age race marital_status t_stage
Min. :30.00 Length:4024 Length:4024 Length:4024
1st Qu.:47.00 Class :character Class :character Class :character
Median :54.00 Mode :character Mode :character Mode :character
Mean :53.97
3rd Qu.:61.00
Max. :69.00
n_stage 6th_stage differentiate grade
Length:4024 Length:4024 Length:4024 Length:4024
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
a_stage tumor_size estrogen_status progesterone_status
Length:4024 Min. : 1.00 Length:4024 Length:4024
Class :character 1st Qu.: 16.00 Class :character Class :character
Mode :character Median : 25.00 Mode :character Mode :character
Mean : 30.47
3rd Qu.: 38.00
Max. :140.00
regional_node_examined reginol_node_positive survival_months
Min. : 1.00 Min. : 1.000 Min. : 1.0
1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 56.0
Median :14.00 Median : 2.000 Median : 73.0
Mean :14.36 Mean : 4.158 Mean : 71.3
3rd Qu.:19.00 3rd Qu.: 5.000 3rd Qu.: 90.0
Max. :61.00 Max. :46.000 Max. :107.0
status
Length:4024
Class :character
Mode :character
Using dypler function to filter out observation from the categorical variable “race”.
Include a linear or multiple linear regression analysis of 2 or more quantitative variables. Write the equation for your model and analyze your model based on p-values, adjusted R^2 values, and diagnostic plots. Then analyze what those values mean for your model in the context of your variables
# Fit the modelfit <-lm(survival_months ~ tumor_size + regional_node_examined + reginol_node_positive, data = breast_cancer_race)# View model summarysummary(fit)
Call:
lm(formula = survival_months ~ tumor_size + regional_node_examined +
reginol_node_positive, data = breast_cancer_race)
Residuals:
Min 1Q Median 3Q Max
-73.561 -15.238 1.265 18.374 44.849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.59848 0.88864 82.822 < 2e-16 ***
tumor_size -0.06101 0.01809 -3.373 0.000752 ***
regional_node_examined 0.14494 0.05010 2.893 0.003841 **
reginol_node_positive -0.64680 0.08173 -7.914 3.28e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.63 on 3700 degrees of freedom
Multiple R-squared: 0.02411, Adjusted R-squared: 0.02332
F-statistic: 30.48 on 3 and 3700 DF, p-value: < 2.2e-16
Obtain predicted and residual values for diagonostic plots.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
breast_cancer_race |>gather(key ="iv", value ="x", tumor_size, regional_node_examined, reginol_node_positive) |># Get data into shape. The gather() function reshapes the dataset from wide to long format,ggplot(aes(x = x, y = survival_months)) +# Note use of `x` here and next linegeom_segment(aes(xend = x, yend = predicted), alpha = .2) +geom_point(aes(color = residuals)) +scale_color_gradient2(low ="blue", mid ="white", high ="red") +guides(color =FALSE) +geom_point(aes(y = predicted), shape =1) +facet_grid(~ iv, scales ="free_x") +# Split panels here by `iv`theme_bw()
Tumor Size : Effect: Each 1 mm increase in tumor size is linked to a 0.06 month decrease in survival •p-value = 0.00075 → Significant
Regional Node Examined : Effect: Each additional node examined is linked to a 0.144 month increase in survival • p-value = 0.0038 → Significant
Regional Node Positive : Effect: Each additional positive node is linked to a 0.647 month decrease in survival • p-value < 0.0000… → Highly significant.
Model Fit:
R² = 2.3% → These 3 predictors explain a small but real portion of survival variation. • Residual Error = ~22.7 months → Predictions vary from actual survival by ~23 months on average.
Conclusion:
All three variables significantly impact survival. • Most notably, positive lymph nodes having the strongest negative effect. • The model is statistically strong, but survival is influenced by many other unmeasured factors I didn’t consider in this analysis e.g age,n-stages,t-stages and hormones level etc
Explanation of quantitative variables in the data
Age = Age of 3704 patients,
Tumor Size: The size of the tumor,
Regional Node Examine:Number of lymph node examined,
Regional Node Positive: Number of lymph node that was positive,
Survival Months: Duration of survival of survival for each patient.
Explore both quantitative and categorical variables with simple plots/facets to determine what you want to focus on for your final visualization
After analyzing the regression model, where tumor size, regional_node_examined and regional node positive were examined against survival months, I which to visualize the plot.
`summarise()` has grouped output by 'tumor_size_group'. You can override using
the `.groups` argument.
Step 2: Create a ggplot bar chart with interactive plotly, a bar chart to helps compare survival across race and tumor sizes
plot4 <-ggplot(breast_summary, aes(x = tumor_size_group, y = avg_survival, fill = race)) +geom_bar(stat ="identity", position ="dodge", alpha =0.9) +geom_text(aes(label =round(avg_survival, 1)), position =position_dodge(width =0.9), vjust =-0.5) +labs(title ="Average Survival by Tumor Size Group and Race",caption ="source: SEER Program of the NCI:\nhttp://seer.cancer.gov/seerstat/variables/seer/lrd-stage",x ="Tumor Size Group",y ="Average Survival (months)",fill ="Race" ) +theme_minimal(base_size =16) +scale_fill_brewer(palette ="Set1") +theme(plot.title =element_text(face ="bold", size =13),axis.title =element_text(face ="italic") )# Convert to interactive plotggplotly(plot4)
PLot 4 without the interactiveness to display caption,and source.
plot4 <-ggplot(breast_summary, aes(x = tumor_size_group, y = avg_survival, fill = race)) +geom_bar(stat ="identity", position ="dodge", alpha =0.9) +geom_text(aes(label =round(avg_survival, 1)), position =position_dodge(width =0.9), vjust =-0.5) +labs(title ="Average Survival by Tumor Size Group and Race",caption ="source: SEER Program of the NCI:\nhttp://seer.cancer.gov/seerstat/variables/seer/lrd-stage",x ="Tumor Size Group",y ="Average Survival (months)",fill ="Race" ) +theme_minimal(base_size =16) +scale_fill_manual(values =c("Black"="#377eb8", "White"="#4daf4a", "Other"="#e41a1c")) +theme(plot.title =element_text(face ="bold", size =13),axis.title =element_text(face ="italic") )plot4
Visualization interpretation.
The multiple regression analysis reveals that tumor size, regional nodes examined, and regional nodes positive significantly influence breast cancer survival months. Among these, positive lymph nodes have the strongest negative impact on survival. Although the model demonstrates statistical robustness, survival outcomes are also shaped by unmeasured variables such as age, N-stage, T-stage, hormone levels, and marital status. Incorporating these variables could improve the model’s predictive power and provide a more comprehensive understanding of patient outcomes.
Reflecting on the influence of marital status, I initially underestimated its potential role in survival. However, personal experiences—particularly supporting my spouse through medical challenges—have shifted my perspective. This underscores the importance of exploring how relationship status (e.g., single, married, divorced) might affect not only survival rates but also recovery trajectories and support systems.
The accompanying bar chart visualizes average survival (in months) for breast cancer patients, categorized by tumor size group and race (Black and White).
Tumor Size and Survival
Smaller tumors (<20mm) are associated with the longest average survival.
Larger tumors (50+mm) correspond to the shortest average survival, indicating the importance of early detection.
Race-Based Disparities
White patients consistently demonstrate higher average survival across all tumor size groups:
<20mm: Black – 66.9 months | White – 73.0 months
20–49mm: Black – 67.7 months | White – 70.9 months
50+mm: Black – 62.4 months | White – 68.8 months
Implications
This visualization highlights persistent racial disparities in breast cancer outcomes, potentially driven by systemic factors such as inequities in healthcare access, early screening, and treatment quality. The findings reinforce the need for equitable healthcare policies, community outreach, and tailored interventions to reduce survival gaps and improve outcomes for all patients.
Bibliography
National Cancer Institute: https://www.cancer.gov/about-cancer/diagnosis-staging
Dr. Simon Jackson: https://drsimonj.svbtle.com/visualising-residuals