Introduction

This Exploratory Data Analysis (EDA) focuses on the Breast Cancer Wisconsin dataset. The main objective is to understand the structure of the data, detect patterns, and uncover insights using both statistical and graphical approaches.

Load Required Libraries

if (!require("mlbench")) install.packages("mlbench", dependencies = TRUE)
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)
if (!require("plotly")) install.packages("plotly", dependencies = TRUE)
if (!require("ggstatsplot")) install.packages("ggstatsplot", dependencies = TRUE)

library(mlbench)
library(tidyverse)
library(plotly)
library(ggstatsplot)

Load and Prepare Dataset

data("BreastCancer")
df <- BreastCancer %>%
  select(-Id) %>%
  mutate(across(where(is.character), ~ as.numeric(as.character(.)))) %>%
  mutate(across(-Class, ~ as.numeric(.))) %>%
  mutate(Class = factor(Class)) %>%
  drop_na()

Dataset Overview

summary(df)

##   Cl.thickness      Cell.size        Cell.shape     Marg.adhesion  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##   Epith.c.size     Bare.nuclei      Bl.cromatin     Normal.nucleoli
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Class    
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000

str(df)

## 'data.frame':    683 obs. of  10 variables:
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

glimpse(df)

## Rows: 683
## Columns: 10
## $ Cl.thickness    <dbl> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, …
## $ Cell.size       <dbl> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1,…
## $ Cell.shape      <dbl> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1,…
## $ Marg.adhesion   <dbl> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1,…
## $ Epith.c.size    <dbl> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, …
## $ Bare.nuclei     <dbl> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, …
## $ Bl.cromatin     <dbl> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, …
## $ Normal.nucleoli <dbl> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, …
## $ Mitoses         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, …
## $ Class           <fct> benign, benign, benign, benign, benign, malignant, ben…

Univariate Analysis

df %>%
  pivot_longer(cols = -Class, names_to = "Feature", values_to = "Value") %>%
  drop_na(Value) %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  facet_wrap(~ Feature, scales = "free") +
  theme_minimal()

Interpretation: Most features are right-skewed and need standardization before any advanced modeling.

Bivariate Analysis

Comparing Features by Class

ggbetweenstats(
  data = df,
  x = Class,
  y = Cl.thickness,
  title = "Comparison of Clump Thickness by Class",
  messages = FALSE
)

df_numeric <- df %>% select(-Class)
cor_matrix <- cor(df_numeric)
ggplotly(
  ggcorrplot::ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower", lab = TRUE)
)

Interpretation: Certain features like Bare.nuclei and Cl.thickness show clear separation between benign and malignant tumors.

Interactive Visualization

plot_ly(data = df, x = ~Cl.thickness, y = ~Cell.size, color = ~Class, type = "scatter", mode = "markers")

Interpretation: Clear clustering can be seen visually based on cell characteristics.

Conclusion

The EDA highlights key differences in features between malignant and benign classes. This insight is helpful for developing machine learning models that classify cancerous tumors. Further steps include modeling and feature selection.

Breast Cancer Wisconsin Dataset EDA

Angelo Cawa

2025-04-12