This exploratory data analysis (EDA) addresses the following questions:

Do malignant tumors have significantly greater clump thickness than benign ones?

Does the number of bare nuclei significantly vary between tumor classes?

Are there notable relationships among the different cellular characteristics?

Load the required libraries

library(tidyverse)
library(ggstatsplot)
library(plotly)
library(mlbench)

Load the dataset

data("BreastCancer")
dataset <- BreastCancer

# Clean the dataset
bc_data <- dataset %>%
  mutate(across(.cols = -c(Id, Class), ~ ifelse(. == "?", NA, .))) %>%
  mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
  select(-Id) %>%
  drop_na()

# Preview the dataset
head(bc_data)
##   Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1            5         1          1             1            2           1
## 2            5         4          4             5            7          10
## 3            3         1          1             1            2           2
## 4            6         8          8             1            3           4
## 5            4         1          1             3            2           1
## 6            8        10         10             8            7          10
##   Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           3               1       1    benign
## 2           3               2       1    benign
## 3           3               1       1    benign
## 4           3               7       1    benign
## 5           3               1       1    benign
## 6           9               7       1 malignant
tail(bc_data)
##     Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 694            3         1          1             1            2           1
## 695            3         1          1             1            3           2
## 696            2         1          1             1            2           1
## 697            5        10         10             3            7           3
## 698            4         8          6             4            3           4
## 699            4         8          8             5            4           5
##     Bl.cromatin Normal.nucleoli Mitoses     Class
## 694           2               1       2    benign
## 695           1               1       1    benign
## 696           1               1       1    benign
## 697           8              10       2 malignant
## 698          10               6       1 malignant
## 699          10               4       1 malignant
str(bc_data)
## 'data.frame':    683 obs. of  10 variables:
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
summary(bc_data)
##   Cl.thickness      Cell.size        Cell.shape     Marg.adhesion  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##   Epith.c.size     Bare.nuclei      Bl.cromatin     Normal.nucleoli
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Class    
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000
table(bc_data$Class)
## 
##    benign malignant 
##       444       239

Part 1: EDA with ggstatsplot for Statistical Tests (Breast Cancer Wisconsin Dataset)

ggbetweenstats(
  data = bc_data,
  x = Class,
  y = Cl.thickness,
  type = "parametric",
  title = "Clump Thickness by Tumor Class",
  xlab = "Tumor Class",
  ylab = "Clump Thickness",
  plot.type = "box",
  pairwise.comparisons = TRUE
)

# Analysis The boxplot indicates that malignant tumors generally have higher clump thickness compared to benign tumors. The statistical comparison confirms this difference is significant, suggesting clump thickness is a strong indicator of malignancy.

ggbetweenstats(
  data = bc_data,
  x = Class,          
  y = Bare.nuclei,    
  type = "parametric", 
  var.equal = FALSE,   
  title = "Bare Nuclei Count by Tumor Class",
  xlab = "Tumor Class",
  ylab = "Bare Nuclei",
  plot.type = "box",
  pairwise.comparisons = FALSE 
)

#Analysis The boxplot shows a clear separation between benign and malignant tumors in terms of bare nuclei count. Malignant tumors tend to have more bare nuclei, and the difference is statistically significant, highlighting this feature’s diagnostic relevance.

# Correlation plots
# Individual correlation
ggstatsplot::ggscatterstats(
  data = bc_data,  
  x = Cl.thickness,
  y = Cell.shape,
  title = "Correlation Between Clump Thickness and Cell Shape",
  xlab = "Clump Thickness",
  ylab = "Cell Shape",
  point.color = "#2E86AB",
  point.alpha = 0.6,
  line.color = "#D55E00",
  marginal = FALSE
)

# Correlation matrix
ggstatsplot::ggcorrmat(
  data = bc_data,
  type = "parametric",
  colors = c("#B3E2CD", "white", "#FDCDAC"),
  title = "Correlation Matrix of Cellular Features",
  subtitle = "Pearson Correlations of Numeric Variables",
  matrix.type = "lower",
  p.adjust.method = "none",
  hc.order = TRUE,
  lab = TRUE
)

# Analysis The scatter plot shows a moderate positive correlation between clump thickness and cell shape, indicating that these two features may increase together. The correlation matrix supports this and reveals additional strong associations among features such as Cell Size, Cell Shape, and Clump Thickness.

Part 2: Interactive Visualizations with plotly

# Interactive scatterplot: Clump Thickness vs. Bare Nuclei
p_bc <- ggplot(bc_data, aes(
  x = Cl.thickness,
  y = Bare.nuclei,
  color = Class,
  text = paste("Class: ", Class, "<br>",
               "Clump Thickness: ", Cl.thickness, "<br>",
               "Bare Nuclei: ", Bare.nuclei, "<br>",
               "Cell Size: ", Cell.size, "<br>",
               "Marginal Adhesion: ", Marg.adhesion)
)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "Interactive: Clump Thickness vs. Bare Nuclei",
       subtitle = "Grouped by Tumor Class",
       x = "Clump Thickness",
       y = "Bare Nuclei Count",
       color = "Tumor Class") +
  theme_minimal()

interactive_plot <- ggplotly(p_bc, tooltip = "text") %>%
  layout(modebar = list(orientation = "h", visible = FALSE))

interactive_plot

Analysis

This scatterplot highlights the tendency for malignant tumors to show higher values of both clump thickness and bare nuclei. The interactive format allows for easy inspection of individual observations and patterns.

# Additional scatterplot: Cell Size vs Marginal Adhesion
p_scatter <- ggplot(bc_data,
       aes(x = Cell.size, y = Marg.adhesion, color = Class,
           text = paste("Cell Size:", Cell.size, "<br>",
                        "Marginal Adhesion:", Marg.adhesion, "<br>",
                        "Class:", Class))) +
  geom_point(alpha = 0.7) +
  labs(title = "Cell Size vs. Marginal Adhesion by Tumor Type",
       x = "Cell Size",
       y = "Marginal Adhesion") +
  scale_color_manual(values = c("benign" = "#00AFBB", "malignant" = "#E7B800")) +
  theme_minimal()

ggplotly(p_scatter, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))

Analysis

This plot reveals that malignant tumors tend to exhibit both higher cell size and marginal adhesion. Though the overlap is present, the interaction of these variables can be useful in combination with others for diagnosis.

Conclusion

This EDA demonstrated key differences in cellular characteristics between benign and malignant tumors. Clump thickness and bare nuclei showed statistically significant variation across classes. Correlation analyses revealed strong interrelationships between variables, particularly cell size, shape, and clump thickness. Interactive plots helped in visually identifying important patterns in the data.

The results suggest that multiple cellular features contribute significantly to predicting tumor malignancy, supporting their use in diagnostic tools.