Breast cancer diagnosis relies heavily on the examination of cellular characteristics observed in biopsy samples, like its clump thickness, nuclei count, cell size, and cell shape. The Breast Cancer Wisconsin dataset provides valuable measurements of these cellular features that can help distinguish and differentiate between benign and malignant tumors. This exploratory data analysis (EDA) aims to investigate relationships between these cellular characteristics and their tumor classification using statistical tests and interactive visualizations.
This EDA will explore the following questions:
Is there a relationship between clump thickness and tumor malignancy?
Does bare nuclei count differ significantly between benign and malignant tumors?
Are there significant correlations between different cellular characteristics?
library(tidyverse)
library(ggstatsplot)
library(plotly)
library(mlbench)
data("BreastCancer")
dataset <- BreastCancer
# Clean the dataset
bc_data <- dataset %>%
mutate(across(.cols = -c(Id, Class), ~ifelse(. == "?", NA, .))) %>%
mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
select(-Id) %>%
drop_na()
# Preview the first few rows
head(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1 5 1 1 1 2 1
## 2 5 4 4 5 7 10
## 3 3 1 1 1 2 2
## 4 6 8 8 1 3 4
## 5 4 1 1 3 2 1
## 6 8 10 10 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 benign
## 2 3 2 1 benign
## 3 3 1 1 benign
## 4 3 7 1 benign
## 5 3 1 1 benign
## 6 9 7 1 malignant
tail(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 694 3 1 1 1 2 1
## 695 3 1 1 1 3 2
## 696 2 1 1 1 2 1
## 697 5 10 10 3 7 3
## 698 4 8 6 4 3 4
## 699 4 8 8 5 4 5
## Bl.cromatin Normal.nucleoli Mitoses Class
## 694 2 1 2 benign
## 695 1 1 1 benign
## 696 1 1 1 benign
## 697 8 10 2 malignant
## 698 10 6 1 malignant
## 699 10 4 1 malignant
# Check structure
str(bc_data)
## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
# Summary statistics
summary(dataset)
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion
## Length:699 1 :145 1 :384 1 :353 1 :407
## Class :character 5 :130 10 : 67 2 : 59 2 : 58
## Mode :character 3 :108 3 : 52 10 : 58 3 : 58
## 4 : 80 2 : 45 3 : 56 10 : 55
## 10 : 69 4 : 40 4 : 44 4 : 33
## 2 : 50 5 : 30 5 : 34 8 : 25
## (Other):117 (Other): 81 (Other): 95 (Other): 63
## Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
## 2 :386 1 :402 2 :166 1 :443 1 :579
## 3 : 72 10 :132 3 :165 10 : 61 2 : 35
## 4 : 48 2 : 30 1 :152 3 : 44 3 : 33
## 1 : 47 5 : 30 7 : 73 2 : 36 10 : 14
## 6 : 41 3 : 28 4 : 40 8 : 24 4 : 12
## 5 : 39 (Other): 61 5 : 34 6 : 22 7 : 9
## (Other): 66 NA's : 16 (Other): 69 (Other): 69 (Other): 17
## Class
## benign :458
## malignant:241
##
##
##
##
##
# Check distribution of diagnosis
table(bc_data$Class)
##
## benign malignant
## 444 239
ggbetweenstats(
data = bc_data,
x = Class,
y = Cl.thickness,
type = "parametric",
title = "Comparison of Clump Thickness by Tumor Type",
xlab = "Tumor Type",
ylab = "Clump Thickness",
plot.type = "box",
pairwise.comparisons = TRUE
)
The statistical boxplot using ggbetweenstats showed that malignant tumors generally have higher clump thickness compared to benign ones. The result showed a significant difference in clump thickness between benign and malignant tumors. This indicates that clump thickness is an important feature or a strong indicator that may help in distinguishing malignant tumors, where those with thicker clumps are more likely to be malignant.
ggbetweenstats(
data = bc_data,
x = Class,
y = Bare.nuclei,
type = "parametric",
var.equal = FALSE,
title = "Bare Nuclei Count: Benign vs. Malignant Tumors",
xlab = "Tumor Type",
ylab = "Bare Nuclei Count",
plot.type = "boxviolin", # Boxplot + violin
pairwise.comparisons = FALSE
)
The plot comparing bare nuclei counts between tumor types also revealed a notable difference, with malignant tumors tending to have higher bare nuclei counts. This difference was statistically significant, suggesting that bare nuclei count is a strong indicator in predicting tumor malignancy.
# Correlation plots
# For individual correlation
ggstatsplot::ggscatterstats(
data = bc_data,
x = Cl.thickness,
y = Cell.size,
title = "Correlation Between Clump Thickness and Cell Size",
xlab = "Clump Thickness",
ylab = "Cell Size",
point.color = "#0072B2",
point.alpha = 0.5,
line.color = "#D55E00",
marginal = FALSE,
)
# For correlation matrix
ggstatsplot::ggcorrmat(
data = bc_data,
type = "parametric", # Pearson's r
colors = c("#6D9EC1", "white", "#E46726"),
title = "Correlation Matrix of Cellular Characteristics",
subtitle = "Pairwise Pearson correlations",
matrix.type = "lower",
p.adjust.method = "none",
hc.order = TRUE,
lab = TRUE,
)
The correlation matrix shows strong positive correlations between cellular characteristics like cell size, cell shape, bare nuclei count, and clump thickness. For instance, clump thickness and cell size had a moderately strong positive correlation, as larger cells often form thicker clumps. Other cellular characteristics also showed that they were highly correlated. These findings suggest that many of the cellular features tend to increase together and may collectively inform diagnosis.
ggbetweenstats(
data = bc_data,
x = Class,
y = Cell.size,
type = "parametric",
title = "Comparison of Cell Size by Tumor Type",
xlab = "Tumor Type",
ylab = "Cell Size",
plot.type = "violin",
pairwise.comparisons = TRUE
)
The violin plot shows that there is a huge difference in the bare nuclei count between benign and malignant cases. Malignant tumors exhibit a much higher and more variable count, while benign tumors remain tightly clustered at low counts. This confirms that there is a significant difference between the bare nuclei count and the tumor classification.
# Create a ggplot scatter plot with additional tooltip information
p_bc <- ggplot(bc_data, aes(x = Cl.thickness, y = Bare.nuclei, color = Class,
text = paste("Class: ", Class, "<br>",
"Clump Thickness: ", Cl.thickness, "<br>",
"Bare Nuclei: ", Bare.nuclei, "<br>",
"Cell Shape: ", Cell.shape, "<br>",
"Epith. Cell Size: ", Epith.c.size))) +
geom_point(size = 2, alpha = 0.8) +
labs(title = "Interactive Plot: Clump Thickness vs. Bare Nuclei",
subtitle = "Colored by Tumor Class",
x = "Clump Thickness",
y = "Bare Nuclei Count",
color = "Tumor Class") +
theme_minimal()
# Convert to an interactive Plotly object with custom tooltip and hidden modebar
interactive_plot <- ggplotly(p_bc, tooltip = "text") %>%
layout(modebar = list(orientation = "h", visible = FALSE))
# Display the interactive plot
interactive_plot
Malignant cases were clustered towards higher values of both clump thickness and bare nuclei count, while benign cases clustered at lower values. This concludes that these findings are significant in understanding the relationship of two variables that are crucial for the accuracy of the diagnostic findings.
#For Other Cell Characteristics (Cell Shape and Cell Size)
p_scatter <- ggplot(bc_data,
aes(x = Cell.size, y = Cell.shape, color = Class,
text = paste("Cell Size:", Cell.size, "<br>",
"Cell Shape:", Cell.shape, "<br>",
"Class:", Class))) +
geom_point(alpha = 0.7) +
labs(title = "Cell Size vs. Cell Shape by Tumor Type",
x = "Cell Size",
y = "Cell Shape") +
scale_color_manual(values = c("benign" = "#00BA38", "malignant" = "#F8766D")) +
theme_minimal()
ggplotly(p_scatter, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
The plots show a clear positive trend, especially among malignant tumors, where they have generally large and more irregularly shaped cells compared to benign cases, forming distinct clusters at higher values. The malignant class had more cases with higher cell size and shape, further emphasizing the co-occurrence of these traits in more severe cases. Abnormal cell size and shape are typically the characteristics of malignant cases.
In conclusion, this EDA clearly shows that clump thickness, bare nuclei count, cell size, and shape are crucial indicators between benign and malignant cases of tumors. Findings show that malignant cases typically have higher values in those cellular characteristics, making them strong diagnostic indicators. Moreover, the strong correlation among the said variables suggests that there could be an underlying biological or scientific pattern to tumor growth. The use of both statistical tests and interactive visualizations provided a comprehensive understanding of how cellular characteristics relate to tumor classification, whether they are benign or malignant cases. These findings support the more accurate, understandable, and explainable models in cancer diagnosis and provide a better diagnosis report.