DATA1001 Project 2

Author

SID 541011183

1. Client Bio:

LinkedIn – Cancer Council Australia


Australian Cancer Council’s main objective is to reduce cancer’s impact and improve the wellbeing.Their goals include fighting disparities, improving health outcomes, supporting scientific advancements, and advising the government. A critical focus of their work is the Australian Cancer Plan along with supporting the priority populations for exemplary cancer outcomes.

2. Recommendation:


In this data analysis, we researched how delays in detecting tumor size can be detrimental to the timely treatment of cancer patients. We recommend to the Australian Cancer Council that early detection of tumor size in breast cancer is crucial for enabling prompt and adequate treatment planning

3. Evidence


3.1 Tumor Size Distribution across Cancer Stages

This boxplot illustrates how tumor size varies across different stages of breast cancer (IIA, IIB, IIIA, IIIB)

Code
library(dplyr)
library(ggplot2)
library(readr)
library(plotly)
library(reactable)
library(kableExtra)
library(gt)

# Load and clean data
Data2 <- read.csv("breast_cancer (1).csv")
Data2 <- Data2 %>% filter(X6th.Stage != "IIIC")

# Create boxplot
p <- ggplot(Data2, aes(x = X6th.Stage, y = Tumor.Size, fill = X6th.Stage)) +
  geom_boxplot(aes(color = X6th.Stage), size = 0.8) +
  labs(
    title = "Tumor Size vs 6th Stage",
    x = "Stage",
    y = "Tumor Size (mm)"
  ) +
  scale_fill_manual(values = c(
    "IIA" = "pink1",
    "IIB" = "lightskyblue1",
    "IIIA" = "lightyellow1",
    "IIIB" = "plum1"
  )) +
  scale_color_manual(values = c(
    "IIA" = "palevioletred1",
    "IIB" = "deepskyblue1",
    "IIIA" = "khaki1",
    "IIIB" = "magenta1"
  )) +
  theme_minimal()

# Render interactive plot
ggplotly(p)
The plot shows a clear increasing trend in tumor size as the stage increase from IIA to IIIB. Lower stages such as IIA and IIB tend to have smaller tumor sizes, while higher stages like IIIA and IIIB has larger tumors

3.2 Scatterplot: Tumor Size vs Survival Months

The scatter plot below shows the relationship between tumor size and survival months, coloured by cancer stage (IIA to IIIB)

Code
library(dplyr)
library(ggplot2)
library(readr)
library(plotly)
library(reactable)
library(kableExtra)
library(gt)


Data2 <- read.csv("breast_cancer (1).csv")
Data2 <- Data2 %>% filter(X6th.Stage != "IIIC")



ggplot(Data2, aes(x = Tumor.Size, y = Survival.Months, color = X6th.Stage)) +
  geom_point(size = 2, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  scale_color_manual(
    values = c(
      "IIA" = "palevioletred1",
      "IIB" = "deepskyblue1",
      "IIIA" = "khaki1",
      "IIIB" = "magenta1"
    )
  ) +
  labs(
    title = "Tumor Size vs Survival Months by 6th Stage",
    x = "Tumor Size (mm)",
    y = "Survival Months",
    color = "6th Stage"
  ) +
  theme_minimal()

[1] 0.0001472146

Visual Interpretation:

The scatter plot shows that smaller tumor sizes (<50 mm) and cancer stages (IIA, IIB) are typically associated with longer survival months . On the other hand, larger tumors (>50 mm) and further stages (IIIA, IIIB) typically associate with shorter survival months (less than 60 months)

Correlation Test:

The correlation test (p = 0.0001) shows a weak but significant negative relationship, large tumors are tied to have short survival emphasizing the need for early detection

Statistical Summary

Code
library(dplyr)
library(gt)

Data2 <- read.csv("breast_cancer (1).csv")
Data2 <- Data2 %>% filter(X6th.Stage != "IIIC")

summary_table <- Data2 %>%
  group_by(X6th.Stage) %>%
  summarise(
    Count = n(),
    Mean_Tumor_Size = round(mean(Tumor.Size, na.rm = TRUE), 2),
    Median_Tumor_Size = round(median(Tumor.Size, na.rm = TRUE), 2),
    Mean_Survival = round(mean(Survival.Months, na.rm = TRUE), 2),
    Median_Survival = round(median(Survival.Months, na.rm = TRUE), 2)
  ) %>%
  gt() %>%
  cols_label(
    X6th.Stage = "6th Stage",
    Mean_Tumor_Size = "Mean",
    Median_Tumor_Size = "Median",
    Mean_Survival = "Mean",
    Median_Survival = "Median"
  ) %>%
  tab_spanner(
    label = "Tumor Size",
    columns = c(Mean_Tumor_Size, Median_Tumor_Size)
  ) %>%
  tab_spanner(
    label = "Survival Months",
    columns = c(Mean_Survival, Median_Survival)
  ) %>%
  tab_style(
    style = cell_fill(color = "deeppink"),
    locations = cells_column_labels(columns = c(Mean_Tumor_Size, Median_Tumor_Size))
  ) %>%
  tab_style(
    style = cell_fill(color = "dodgerblue"),
    locations = cells_column_labels(columns = c(Mean_Survival, Median_Survival))
  ) %>%
  tab_style(
    style = list(cell_fill(color = "lightpink"), cell_text(weight = "bold")),
    locations = cells_body(
      columns = c(Mean_Tumor_Size, Median_Tumor_Size),
      rows = Mean_Tumor_Size == max(Mean_Tumor_Size) | Median_Tumor_Size == max(Median_Tumor_Size)
    )
  ) %>%
  tab_style(
    style = list(cell_fill(color = "lightblue"), cell_text(weight = "bold")),
    locations = cells_body(
      columns = c(Mean_Survival, Median_Survival),
      rows = Mean_Survival == max(Mean_Survival) | Median_Survival == max(Median_Survival)
    )
  ) %>%
  tab_options(
    column_labels.padding = px(12),   
    table.width = pct(100)           
  ) %>%
  opt_table_font(font = list(
    google_font("Open Sans"), default_fonts()
  )) %>%
  opt_row_striping() %>%  
  opt_css(
    css = "
      tbody tr:hover {
        background-color: #e6f7ff !important;
      }
    "
  ) %>%
  tab_header(title = ("STATISTICAL ANALYSIS"))

summary_table
STATISTICAL ANALYSIS
6th Stage Count
Tumor Size
Survival Months
Mean Median Mean Median
IIA 1305 14.18 15.0 74.41 76.0
IIB 1130 30.63 30.0 72.22 74.0
IIIA 1050 43.62 36.5 70.19 70.5
IIIB 67 54.31 51.0 69.42 74.0

FURTHER ANALYSIS:

3.3 Understanding Estrogen Status Across Different Tumor Groups

Code
library(dplyr)
library(ggplot2)
library(plotly)


Data2 <- read.csv("breast_cancer (1).csv")
Data2 <- Data2 %>%
mutate(Tumor_Size_Group = cut(Tumor.Size,
breaks = c(0, 20, 50, Inf),
labels = c("Small (<=20mm)", "Medium (21-50mm)", "Large (>50mm)"),
right = TRUE))


Data2 <- Data2 %>%
group_by(Tumor_Size_Group, Estrogen.Status) %>%
summarise(Average_Survival = mean(Survival.Months, na.rm = TRUE)) %>%
na.omit()

bar <- ggplot(Data2, aes(x = Tumor_Size_Group, y = Average_Survival,
fill = Estrogen.Status , color = Estrogen.Status))+

geom_col(width = 0.7, position = position_dodge()) +

scale_fill_manual(values = c(
"Positive" = "paleturquoise1",  
"Negative" = "thistle1",
"Unknown"  = "#FFB6C1"

)) +



scale_color_manual(values = c(
"Positive" = "darkturquoise", 
"Negative" = "orchid1",   
"Unknown"  = "purple"

)) +

labs(
title = "Tumor Size vs Survival Months by Estrogen Status",
x = "Tumor Size (mm)",
y = "Survival Months",
fill = "Estrogen Status",
color = "Estrogen Status"
) +
theme_minimal()
ggplotly(bar) 
Code
anova_data <- read.csv("breast_cancer (1).csv") %>%
  filter(!is.na(Tumor.Size) & !is.na(Survival.Months) & !is.na(Estrogen.Status)) %>%
  mutate(Tumor_Size_Group = cut(Tumor.Size,
                                breaks = c(0, 20, 50, Inf),
                                labels = c("Small (<=20mm)", "Medium (21-50mm)", "Large (>50mm)"),
                                right = TRUE))

fit <- aov(Survival.Months ~ Tumor_Size_Group * Estrogen.Status, data = anova_data)

aov_summary <- summary(fit)[[1]]


results <- data.frame(
  Variable = rownames(aov_summary),
  P_Value = aov_summary$`Pr(>F)`
)


results <- results[results$Variable != "Residuals", ]


print(results)
                          Variable      P_Value
1 Tumor_Size_Group                 1.656193e-06
2 Estrogen.Status                  1.294992e-15
3 Tumor_Size_Group:Estrogen.Status 3.208499e-05
4 Residuals                                  NA

Statistical Test: Two-Way ANOVA

Tumor size (\(p = 1.66 \times 10^{-6}\)), estrogen status (\(p = 1.29 \times 10^{-15}\)), and their interaction (\(p = 3.21 \times 10^{-5}\)) all have a significant effect on survival months. Survival decreases with larger tumors and negative estrogen status, and the impact of tumor size varies by estrogen status.

Conclusion:


Tumor size increases with stage, reducing survival rates, while estrogen status significantly affects outcomes.All these key findings highlights the importance of timely detection of tumour size hence plan the treatment suitably.

Limitations

Incomplete data

– Some patient records may lack information

Independent Analysis

– The analysis is based on specific dataset, it may not be completely relatable to all individuals

External Evidence

  1. Cancer Center – Tumor Size Chart understanding tumor size based on its stage.
  2. ScienceDirect – Study on Tumor Characteristics information on how tumor characteristics relate to treatment outcomes and survival, supporting this analysis.

4. Ethics statement

I did this analysis with honesty, making sure to interpret the data transparently and with respect. I remained unbiased and aware of how it could affect things, dedicated to using the statistics to help make important and ethical choices in cancer treatment.

5. AI usage statement

  1. Tool used : Chatgpt- 4.1 mini

  2. Date used : 11/05/2025
    13/05/2025
    14/05/2025

  3. ChatGPT Analysis 1
    ChatGPT Analysis 2
    ChatGPT Analysis 3

  4. This AI tool helped me how to manually fill colors in barplots and use appropriate statistical tests it is found in evidence section. Additionally, it helped me figure out my errors while rendering.

6. Acknowledgement

Drop in -
9/05/2025
  • cleared few of my doubts with tutor regarding word count,formatting and the variables which would give best graphical visualization according to my idea.
    13/05/2025
    • Tutor suggested me to use scatterplot instead of stacked barplot for better visual analysis of my idea
      ED - post -
      ED Post 1

ED Post 2