Project 2

Author

Paul D-O

Introduction

The dataset I’m working with is the 2017 SEER (Surveillance, Epidemiology, and End Results) Program dataset, provided by the National Cancer Institute (NCI). It contains 4,024 observations and 16 variables, offering population-based cancer statistics for breast cancer patients.

Observations are categorized by key demographic and clinical factors, such as race, age, marital status, tumor size, hormone receptor status (estrogen and progesterone), T-stage, N-stage, and A-stage. Although I didn’t need to clean the data, as it appeared well-organized, what truly drew me to this dataset was the inclusion of race categorization. Having lost a family member to breast cancer, I feel a personal connection to this topic. This dataset has sparked a curiosity about whether breast cancer, like fibroid, shows race-specific trends. Exploring these patterns could help highlight disparities and contribute to better understanding and interventions for affected communities.

To prepare the dataset for visualization, I used the dypler function to filter undesirable observation in the race column like patient that neither identify as white or black. I then use the mutate function to create a new column where tumor is grouped into different sizes and finally I group the data by tumor_size_group, race to display a histogram. While the above topic is unfamiliar to me, I find the NIH(National cancer institute : https://www.cancer.gov/about-cancer/diagnosis-staging/staging) very helpful in better understanding cancer staging.

Load tidyverse library to access requisite package to execute the dataset.

library(tidyverse)

Warning: package 'readr' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

Warning: package 'plotly' was built under R version 4.4.3


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Load dataset from working directory to the global environment,and make headers lower_case and remove space

setwd("C:/Users/Owner/OneDrive/Desktop/Data110")
# Suppress all messages when reading the CSV file
breast_cancer <- suppressMessages(read_csv("breast_cancer_project_2.csv", show_col_types = FALSE))
names(breast_cancer)<- gsub( " ","_",tolower(names(breast_cancer)))
head(breast_cancer)

# A tibble: 6 × 16
    age race  marital_status t_stage n_stage `6th_stage` differentiate     grade
  <dbl> <chr> <chr>          <chr>   <chr>   <chr>       <chr>             <chr>
1    68 White Married        T1      N1      IIA         Poorly different… 3    
2    50 White Married        T2      N2      IIIA        Moderately diffe… 2    
3    58 White Divorced       T3      N3      IIIC        Moderately diffe… 2    
4    58 White Married        T1      N1      IIA         Poorly different… 3    
5    47 White Married        T2      N1      IIB         Poorly different… 3    
6    51 White Single         T1      N1      IIA         Moderately diffe… 2    
# ℹ 8 more variables: a_stage <chr>, tumor_size <dbl>, estrogen_status <chr>,
#   progesterone_status <chr>, regional_node_examined <dbl>,
#   reginol_node_positive <dbl>, survival_months <dbl>, status <chr>

Understanding the variables and summary statistics.

summary(breast_cancer)

      age            race           marital_status       t_stage         
 Min.   :30.00   Length:4024        Length:4024        Length:4024       
 1st Qu.:47.00   Class :character   Class :character   Class :character  
 Median :54.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :53.97                                                           
 3rd Qu.:61.00                                                           
 Max.   :69.00                                                           
   n_stage           6th_stage         differentiate         grade          
 Length:4024        Length:4024        Length:4024        Length:4024       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   a_stage            tumor_size     estrogen_status    progesterone_status
 Length:4024        Min.   :  1.00   Length:4024        Length:4024        
 Class :character   1st Qu.: 16.00   Class :character   Class :character   
 Mode  :character   Median : 25.00   Mode  :character   Mode  :character   
                    Mean   : 30.47                                         
                    3rd Qu.: 38.00                                         
                    Max.   :140.00                                         
 regional_node_examined reginol_node_positive survival_months
 Min.   : 1.00          Min.   : 1.000        Min.   :  1.0  
 1st Qu.: 9.00          1st Qu.: 1.000        1st Qu.: 56.0  
 Median :14.00          Median : 2.000        Median : 73.0  
 Mean   :14.36          Mean   : 4.158        Mean   : 71.3  
 3rd Qu.:19.00          3rd Qu.: 5.000        3rd Qu.: 90.0  
 Max.   :61.00          Max.   :46.000        Max.   :107.0  
    status         
 Length:4024       
 Class :character  
 Mode  :character

Using dypler function to filter out observation from the categorical variable “race”.

breast_cancer_race <- breast_cancer |>
  filter(race %in% c("White", "Black"))
head(breast_cancer_race)

# A tibble: 6 × 16
    age race  marital_status t_stage n_stage `6th_stage` differentiate     grade
  <dbl> <chr> <chr>          <chr>   <chr>   <chr>       <chr>             <chr>
1    68 White Married        T1      N1      IIA         Poorly different… 3    
2    50 White Married        T2      N2      IIIA        Moderately diffe… 2    
3    58 White Divorced       T3      N3      IIIC        Moderately diffe… 2    
4    58 White Married        T1      N1      IIA         Poorly different… 3    
5    47 White Married        T2      N1      IIB         Poorly different… 3    
6    51 White Single         T1      N1      IIA         Moderately diffe… 2    
# ℹ 8 more variables: a_stage <chr>, tumor_size <dbl>, estrogen_status <chr>,
#   progesterone_status <chr>, regional_node_examined <dbl>,
#   reginol_node_positive <dbl>, survival_months <dbl>, status <chr>

View the structure of the dataset.

str(breast_cancer_race)

spc_tbl_ [3,704 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ age                   : num [1:3704] 68 50 58 58 47 51 51 40 40 69 ...
 $ race                  : chr [1:3704] "White" "White" "White" "White" ...
 $ marital_status        : chr [1:3704] "Married" "Married" "Divorced" "Married" ...
 $ t_stage               : chr [1:3704] "T1" "T2" "T3" "T1" ...
 $ n_stage               : chr [1:3704] "N1" "N2" "N3" "N1" ...
 $ 6th_stage             : chr [1:3704] "IIA" "IIIA" "IIIC" "IIA" ...
 $ differentiate         : chr [1:3704] "Poorly differentiated" "Moderately differentiated" "Moderately differentiated" "Poorly differentiated" ...
 $ grade                 : chr [1:3704] "3" "2" "2" "3" ...
 $ a_stage               : chr [1:3704] "Regional" "Regional" "Regional" "Regional" ...
 $ tumor_size            : num [1:3704] 4 35 63 18 41 20 8 30 103 32 ...
 $ estrogen_status       : chr [1:3704] "Positive" "Positive" "Positive" "Positive" ...
 $ progesterone_status   : chr [1:3704] "Positive" "Positive" "Positive" "Positive" ...
 $ regional_node_examined: num [1:3704] 24 14 14 2 3 18 11 9 20 21 ...
 $ reginol_node_positive : num [1:3704] 1 5 7 1 1 2 1 1 18 12 ...
 $ survival_months       : num [1:3704] 60 62 75 84 50 89 54 14 70 92 ...
 $ status                : chr [1:3704] "Alive" "Alive" "Alive" "Alive" ...
 - attr(*, "spec")=
  .. cols(
  ..   Age = col_double(),
  ..   Race = col_character(),
  ..   `Marital Status` = col_character(),
  ..   `T Stage` = col_character(),
  ..   `N Stage` = col_character(),
  ..   `6th Stage` = col_character(),
  ..   differentiate = col_character(),
  ..   Grade = col_character(),
  ..   `A Stage` = col_character(),
  ..   `Tumor Size` = col_double(),
  ..   `Estrogen Status` = col_character(),
  ..   `Progesterone Status` = col_character(),
  ..   `Regional Node Examined` = col_double(),
  ..   `Reginol Node Positive` = col_double(),
  ..   `Survival Months` = col_double(),
  ..   Status = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Include a linear or multiple linear regression analysis of 2 or more quantitative variables. Write the equation for your model and analyze your model based on p-values, adjusted R^2 values, and diagnostic plots. Then analyze what those values mean for your model in the context of your variables

# Fit the model
fit <- lm(survival_months ~ tumor_size + regional_node_examined + reginol_node_positive, data = breast_cancer_race)

# View model summary
summary(fit)


Call:
lm(formula = survival_months ~ tumor_size + regional_node_examined + 
    reginol_node_positive, data = breast_cancer_race)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.561 -15.238   1.265  18.374  44.849 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            73.59848    0.88864  82.822  < 2e-16 ***
tumor_size             -0.06101    0.01809  -3.373 0.000752 ***
regional_node_examined  0.14494    0.05010   2.893 0.003841 ** 
reginol_node_positive  -0.64680    0.08173  -7.914 3.28e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.63 on 3700 degrees of freedom
Multiple R-squared:  0.02411,   Adjusted R-squared:  0.02332 
F-statistic: 30.48 on 3 and 3700 DF,  p-value: < 2.2e-16

Obtain predicted and residual values for diagonostic plots.

breast_cancer_race$predicted <- predict(fit)
breast_cancer_race$residuals <- residuals(fit)
head(breast_cancer_race)

# A tibble: 6 × 18
    age race  marital_status t_stage n_stage `6th_stage` differentiate     grade
  <dbl> <chr> <chr>          <chr>   <chr>   <chr>       <chr>             <chr>
1    68 White Married        T1      N1      IIA         Poorly different… 3    
2    50 White Married        T2      N2      IIIA        Moderately diffe… 2    
3    58 White Divorced       T3      N3      IIIC        Moderately diffe… 2    
4    58 White Married        T1      N1      IIA         Poorly different… 3    
5    47 White Married        T2      N1      IIB         Poorly different… 3    
6    51 White Single         T1      N1      IIA         Moderately diffe… 2    
# ℹ 10 more variables: a_stage <chr>, tumor_size <dbl>, estrogen_status <chr>,
#   progesterone_status <chr>, regional_node_examined <dbl>,
#   reginol_node_positive <dbl>, survival_months <dbl>, status <chr>,
#   predicted <dbl>, residuals <dbl>

Lets create a relevant plot using one our predictor,tumor_size.

ggplot(breast_cancer_race,aes(x = tumor_size, y = survival_months)) +
  geom_segment(aes(xend = tumor_size,yend = survival_months), alpha = .2) + # Lines to connect points
  geom_point() + # Points of actual values
  geom_point(aes(y = predicted), shape = 1) +  # Points of predicted values
  theme_bw()

Making adjustment using residual values.

ggplot(breast_cancer_race, aes(x = tumor_size , y = survival_months)) +
  geom_segment(aes(xend = tumor_size, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  theme_bw()

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

breast_cancer_race |>
  gather(key = "iv", value = "x", tumor_size, regional_node_examined, reginol_node_positive) |>  # Get data into shape. The gather() function reshapes the dataset from wide to long format,
ggplot(aes(x = x, y = survival_months)) +  # Note use of `x` here and next line
  geom_segment(aes(xend = x, yend = predicted), alpha = .2) +
  geom_point(aes(color = residuals)) +
  scale_color_gradient2(low = "blue", mid = "white", high = "red") +
  guides(color = FALSE) +
  geom_point(aes(y = predicted), shape = 1) +
  facet_grid(~ iv, scales = "free_x") +  # Split panels here by `iv`
  theme_bw()

Analysis of the linear regression

Regression Formula: survival_months = 73.59 − 0.06(tumor_size) + 0.144(regional_node_examined) − 0.646(reginol_node_positive)

#Significant Predictors:

Tumor Size : Effect: Each 1 mm increase in tumor size is linked to a 0.06 month decrease in survival •p-value = 0.00075 → Significant
Regional Node Examined : Effect: Each additional node examined is linked to a 0.144 month increase in survival • p-value = 0.0038 → Significant
Regional Node Positive : Effect: Each additional positive node is linked to a 0.647 month decrease in survival • p-value < 0.0000… → Highly significant.

Model Fit:

R² = 2.3% → These 3 predictors explain a small but real portion of survival variation. • Residual Error = ~22.7 months → Predictions vary from actual survival by ~23 months on average.

Conclusion:

All three variables significantly impact survival. • Most notably, positive lymph nodes having the strongest negative effect. • The model is statistically strong, but survival is influenced by many other unmeasured factors I didn’t consider in this analysis e.g age,n-stages,t-stages and hormones level etc

Explanation of quantitative variables in the data

Age = Age of 3704 patients,

Tumor Size: The size of the tumor,

Regional Node Examine:Number of lymph node examined,

Regional Node Positive: Number of lymph node that was positive,

Survival Months: Duration of survival of survival for each patient.

Explore both quantitative and categorical variables with simple plots/facets to determine what you want to focus on for your final visualization

After analyzing the regression model, where tumor size, regional_node_examined and regional node positive were examined against survival months, I which to visualize the plot.

Step 1: Filter and group the data using dplyr

breast_summary <- breast_cancer|>
  mutate(tumor_size_group = case_when(
    tumor_size < 20 ~ "<20mm",
    tumor_size < 50 ~ "20-49mm",
    TRUE ~ "50+mm"
  )) |>
  group_by(tumor_size_group, race) |>
  summarise(
    avg_survival = mean(survival_months, na.rm = TRUE),
    count = n()
  )

`summarise()` has grouped output by 'tumor_size_group'. You can override using
the `.groups` argument.

Step 2: Create a ggplot bar chart with interactive plotly, a bar chart to helps compare survival across race and tumor sizes

plot4 <- ggplot(breast_summary, aes(x = tumor_size_group, y = avg_survival, fill = race)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.9) +
  geom_text(aes(label = round(avg_survival, 1)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5) +
  labs(
    title = "Average Survival by Tumor Size Group and Race",
    caption = "source: SEER Program of the NCI:\nhttp://seer.cancer.gov/seerstat/variables/seer/lrd-stage",
    x = "Tumor Size Group",
    y = "Average Survival (months)",
    fill = "Race"
  ) +
  theme_minimal(base_size = 16) +  
  scale_fill_brewer(palette = "Set1") +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    axis.title = element_text(face = "italic")
  )

# Convert to interactive plot
ggplotly(plot4)

PLot 4 without the interactiveness to display caption,and source.

plot4 <- ggplot(breast_summary, aes(x = tumor_size_group, y = avg_survival, fill = race)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.9) +
  geom_text(aes(label = round(avg_survival, 1)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5) +
  labs(
    title = "Average Survival by Tumor Size Group and Race",
    caption = "source: SEER Program of the NCI:\nhttp://seer.cancer.gov/seerstat/variables/seer/lrd-stage",
    x = "Tumor Size Group",
    y = "Average Survival (months)",
    fill = "Race"
  ) +
  theme_minimal(base_size = 16) +  
  scale_fill_manual(values = c("Black" = "#377eb8", "White" = "#4daf4a", "Other" = "#e41a1c")) +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    axis.title = element_text(face = "italic")
  )
plot4

Visualization interpretation.

The multiple regression analysis reveals that tumor size, regional nodes examined, and regional nodes positive significantly influence breast cancer survival months. Among these, positive lymph nodes have the strongest negative impact on survival. Although the model demonstrates statistical robustness, survival outcomes are also shaped by unmeasured variables such as age, N-stage, T-stage, hormone levels, and marital status. Incorporating these variables could improve the model’s predictive power and provide a more comprehensive understanding of patient outcomes.

Reflecting on the influence of marital status, I initially underestimated its potential role in survival. However, personal experiences—particularly supporting my spouse through medical challenges—have shifted my perspective. This underscores the importance of exploring how relationship status (e.g., single, married, divorced) might affect not only survival rates but also recovery trajectories and support systems.

The accompanying bar chart visualizes average survival (in months) for breast cancer patients, categorized by tumor size group and race (Black and White).

Tumor Size and Survival

Smaller tumors (<20mm) are associated with the longest average survival.

Larger tumors (50+mm) correspond to the shortest average survival, indicating the importance of early detection.

Race-Based Disparities

White patients consistently demonstrate higher average survival across all tumor size groups:

<20mm: Black – 66.9 months | White – 73.0 months

20–49mm: Black – 67.7 months | White – 70.9 months

50+mm: Black – 62.4 months | White – 68.8 months

Implications

This visualization highlights persistent racial disparities in breast cancer outcomes, potentially driven by systemic factors such as inequities in healthcare access, early screening, and treatment quality. The findings reinforce the need for equitable healthcare policies, community outreach, and tailored interventions to reduce survival gaps and improve outcomes for all patients.

Bibliography

National Cancer Institute: https://www.cancer.gov/about-cancer/diagnosis-staging

Dr. Simon Jackson: https://drsimonj.svbtle.com/visualising-residuals