This exploratory data analysis (EDA) addresses the following questions:
Do malignant tumors have significantly greater clump thickness than benign ones?
Does the number of bare nuclei significantly vary between tumor classes?
Are there notable relationships among the different cellular characteristics?
library(tidyverse)
library(ggstatsplot)
library(plotly)
library(mlbench)
data("BreastCancer")
dataset <- BreastCancer
# Clean the dataset
bc_data <- dataset %>%
mutate(across(.cols = -c(Id, Class), ~ ifelse(. == "?", NA, .))) %>%
mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
select(-Id) %>%
drop_na()
# Preview the dataset
head(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1 5 1 1 1 2 1
## 2 5 4 4 5 7 10
## 3 3 1 1 1 2 2
## 4 6 8 8 1 3 4
## 5 4 1 1 3 2 1
## 6 8 10 10 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 benign
## 2 3 2 1 benign
## 3 3 1 1 benign
## 4 3 7 1 benign
## 5 3 1 1 benign
## 6 9 7 1 malignant
tail(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 694 3 1 1 1 2 1
## 695 3 1 1 1 3 2
## 696 2 1 1 1 2 1
## 697 5 10 10 3 7 3
## 698 4 8 6 4 3 4
## 699 4 8 8 5 4 5
## Bl.cromatin Normal.nucleoli Mitoses Class
## 694 2 1 2 benign
## 695 1 1 1 benign
## 696 1 1 1 benign
## 697 8 10 2 malignant
## 698 10 6 1 malignant
## 699 10 4 1 malignant
str(bc_data)
## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
summary(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00
## Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.00
## Mean : 3.234 Mean : 3.545 Mean : 3.445 Mean : 2.87
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Mitoses Class
## Min. :1.000 benign :444
## 1st Qu.:1.000 malignant:239
## Median :1.000
## Mean :1.583
## 3rd Qu.:1.000
## Max. :9.000
table(bc_data$Class)
##
## benign malignant
## 444 239
ggbetweenstats(
data = bc_data,
x = Class,
y = Cl.thickness,
type = "parametric",
title = "Clump Thickness by Tumor Class",
xlab = "Tumor Class",
ylab = "Clump Thickness",
plot.type = "box",
pairwise.comparisons = TRUE
)
# Analysis The boxplot indicates that malignant tumors generally have
higher clump thickness compared to benign tumors. The statistical
comparison confirms this difference is significant, suggesting clump
thickness is a strong indicator of malignancy.
ggbetweenstats(
data = bc_data,
x = Class,
y = Bare.nuclei,
type = "parametric",
var.equal = FALSE,
title = "Bare Nuclei Count by Tumor Class",
xlab = "Tumor Class",
ylab = "Bare Nuclei",
plot.type = "box",
pairwise.comparisons = FALSE
)
#Analysis The boxplot shows a clear separation between benign and
malignant tumors in terms of bare nuclei count. Malignant tumors tend to
have more bare nuclei, and the difference is statistically significant,
highlighting this feature’s diagnostic relevance.
# Correlation plots
# Individual correlation
ggstatsplot::ggscatterstats(
data = bc_data,
x = Cl.thickness,
y = Cell.shape,
title = "Correlation Between Clump Thickness and Cell Shape",
xlab = "Clump Thickness",
ylab = "Cell Shape",
point.color = "#2E86AB",
point.alpha = 0.6,
line.color = "#D55E00",
marginal = FALSE
)
# Correlation matrix
ggstatsplot::ggcorrmat(
data = bc_data,
type = "parametric",
colors = c("#B3E2CD", "white", "#FDCDAC"),
title = "Correlation Matrix of Cellular Features",
subtitle = "Pearson Correlations of Numeric Variables",
matrix.type = "lower",
p.adjust.method = "none",
hc.order = TRUE,
lab = TRUE
)
# Analysis The scatter plot shows a moderate positive correlation
between clump thickness and cell shape, indicating that these two
features may increase together. The correlation matrix supports this and
reveals additional strong associations among features such as Cell Size,
Cell Shape, and Clump Thickness.
# Interactive scatterplot: Clump Thickness vs. Bare Nuclei
p_bc <- ggplot(bc_data, aes(
x = Cl.thickness,
y = Bare.nuclei,
color = Class,
text = paste("Class: ", Class, "<br>",
"Clump Thickness: ", Cl.thickness, "<br>",
"Bare Nuclei: ", Bare.nuclei, "<br>",
"Cell Size: ", Cell.size, "<br>",
"Marginal Adhesion: ", Marg.adhesion)
)) +
geom_point(size = 2, alpha = 0.8) +
labs(title = "Interactive: Clump Thickness vs. Bare Nuclei",
subtitle = "Grouped by Tumor Class",
x = "Clump Thickness",
y = "Bare Nuclei Count",
color = "Tumor Class") +
theme_minimal()
interactive_plot <- ggplotly(p_bc, tooltip = "text") %>%
layout(modebar = list(orientation = "h", visible = FALSE))
interactive_plot
This scatterplot highlights the tendency for malignant tumors to show higher values of both clump thickness and bare nuclei. The interactive format allows for easy inspection of individual observations and patterns.
# Additional scatterplot: Cell Size vs Marginal Adhesion
p_scatter <- ggplot(bc_data,
aes(x = Cell.size, y = Marg.adhesion, color = Class,
text = paste("Cell Size:", Cell.size, "<br>",
"Marginal Adhesion:", Marg.adhesion, "<br>",
"Class:", Class))) +
geom_point(alpha = 0.7) +
labs(title = "Cell Size vs. Marginal Adhesion by Tumor Type",
x = "Cell Size",
y = "Marginal Adhesion") +
scale_color_manual(values = c("benign" = "#00AFBB", "malignant" = "#E7B800")) +
theme_minimal()
ggplotly(p_scatter, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
This plot reveals that malignant tumors tend to exhibit both higher cell size and marginal adhesion. Though the overlap is present, the interaction of these variables can be useful in combination with others for diagnosis.
This EDA demonstrated key differences in cellular characteristics between benign and malignant tumors. Clump thickness and bare nuclei showed statistically significant variation across classes. Correlation analyses revealed strong interrelationships between variables, particularly cell size, shape, and clump thickness. Interactive plots helped in visually identifying important patterns in the data.
The results suggest that multiple cellular features contribute significantly to predicting tumor malignancy, supporting their use in diagnostic tools.