This Exploratory Data Analysis (EDA) focuses on the Breast Cancer Wisconsin dataset. The main objective is to understand the structure of the data, detect patterns, and uncover insights using both statistical and graphical approaches.
if (!require("mlbench")) install.packages("mlbench", dependencies = TRUE)
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)
if (!require("plotly")) install.packages("plotly", dependencies = TRUE)
if (!require("ggstatsplot")) install.packages("ggstatsplot", dependencies = TRUE)
library(mlbench)
library(tidyverse)
library(plotly)
library(ggstatsplot)
data("BreastCancer")
df <- BreastCancer %>%
select(-Id) %>%
mutate(across(where(is.character), ~ as.numeric(as.character(.)))) %>%
mutate(across(-Class, ~ as.numeric(.))) %>%
mutate(Class = factor(Class)) %>%
drop_na()
summary(df)
## Cl.thickness Cell.size Cell.shape Marg.adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00
## Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.00
## Mean : 3.234 Mean : 3.545 Mean : 3.445 Mean : 2.87
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Mitoses Class
## Min. :1.000 benign :444
## 1st Qu.:1.000 malignant:239
## Median :1.000
## Mean :1.583
## 3rd Qu.:1.000
## Max. :9.000
str(df)
## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
glimpse(df)
## Rows: 683
## Columns: 10
## $ Cl.thickness <dbl> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, …
## $ Cell.size <dbl> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1,…
## $ Cell.shape <dbl> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1,…
## $ Marg.adhesion <dbl> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1,…
## $ Epith.c.size <dbl> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, …
## $ Bare.nuclei <dbl> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, …
## $ Bl.cromatin <dbl> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, …
## $ Normal.nucleoli <dbl> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, …
## $ Mitoses <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, …
## $ Class <fct> benign, benign, benign, benign, benign, malignant, ben…
df %>%
pivot_longer(cols = -Class, names_to = "Feature", values_to = "Value") %>%
drop_na(Value) %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
facet_wrap(~ Feature, scales = "free") +
theme_minimal()
Interpretation: Most features are right-skewed and need standardization before any advanced modeling.
ggbetweenstats(
data = df,
x = Class,
y = Cl.thickness,
title = "Comparison of Clump Thickness by Class",
messages = FALSE
)
df_numeric <- df %>% select(-Class)
cor_matrix <- cor(df_numeric)
ggplotly(
ggcorrplot::ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower", lab = TRUE)
)
Interpretation: Certain features like
Bare.nuclei and Cl.thickness show clear
separation between benign and malignant tumors.
plot_ly(data = df, x = ~Cl.thickness, y = ~Cell.size, color = ~Class, type = "scatter", mode = "markers")
Interpretation: Clear clustering can be seen visually based on cell characteristics.
The EDA highlights key differences in features between malignant and benign classes. This insight is helpful for developing machine learning models that classify cancerous tumors. Further steps include modeling and feature selection.