Introduction

This EDA will explore the following questions: 1. Is there a relationship between clump thickness and tumor malignancy? 2. Does bare nuclei count differ significantly between benign and malignant tumors? 3. Are there significant correlations between different cellular characteristics?

Load the required libraries

library(tidyverse)
library(ggstatsplot)
library(plotly)
library(mlbench)
library(dplyr)
library(reshape2) 

Load the dataset

data("BreastCancer")
dataset <- BreastCancer

# Clean the dataset
bc_data <- dataset %>%
  mutate(across(.cols = -c(Id, Class), ~ifelse(. == "?", NA, .))) %>%
  mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
  select(-Id) %>%
  drop_na()

# Preview the first few rows
head(bc_data)
##   Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1            5         1          1             1            2           1
## 2            5         4          4             5            7          10
## 3            3         1          1             1            2           2
## 4            6         8          8             1            3           4
## 5            4         1          1             3            2           1
## 6            8        10         10             8            7          10
##   Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           3               1       1    benign
## 2           3               2       1    benign
## 3           3               1       1    benign
## 4           3               7       1    benign
## 5           3               1       1    benign
## 6           9               7       1 malignant
tail(bc_data)
##     Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 694            3         1          1             1            2           1
## 695            3         1          1             1            3           2
## 696            2         1          1             1            2           1
## 697            5        10         10             3            7           3
## 698            4         8          6             4            3           4
## 699            4         8          8             5            4           5
##     Bl.cromatin Normal.nucleoli Mitoses     Class
## 694           2               1       2    benign
## 695           1               1       1    benign
## 696           1               1       1    benign
## 697           8              10       2 malignant
## 698          10               6       1 malignant
## 699          10               4       1 malignant
# Check structure
str(bc_data)
## 'data.frame':    683 obs. of  10 variables:
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
# Summary statistics
summary(dataset)
##       Id             Cl.thickness   Cell.size     Cell.shape  Marg.adhesion
##  Length:699         1      :145   1      :384   1      :353   1      :407  
##  Class :character   5      :130   10     : 67   2      : 59   2      : 58  
##  Mode  :character   3      :108   3      : 52   10     : 58   3      : 58  
##                     4      : 80   2      : 45   3      : 56   10     : 55  
##                     10     : 69   4      : 40   4      : 44   4      : 33  
##                     2      : 50   5      : 30   5      : 34   8      : 25  
##                     (Other):117   (Other): 81   (Other): 95   (Other): 63  
##   Epith.c.size  Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses   
##  2      :386   1      :402   2      :166   1      :443     1      :579  
##  3      : 72   10     :132   3      :165   10     : 61     2      : 35  
##  4      : 48   2      : 30   1      :152   3      : 44     3      : 33  
##  1      : 47   5      : 30   7      : 73   2      : 36     10     : 14  
##  6      : 41   3      : 28   4      : 40   8      : 24     4      : 12  
##  5      : 39   (Other): 61   5      : 34   6      : 22     7      :  9  
##  (Other): 66   NA's   : 16   (Other): 69   (Other): 69     (Other): 17  
##        Class    
##  benign   :458  
##  malignant:241  
##                 
##                 
##                 
##                 
## 
# Check distribution of diagnosis
table(bc_data$Class)
## 
##    benign malignant 
##       444       239

Part 1: EDA with ggstatsplot for Statistical Tests (Breast Cancer Wisconsin Dataset)

ggbetweenstats(
  data = bc_data,
  x = Class,
  y = Cl.thickness,
  type = "parametric",
  title = "Comparison of Clump Thickness by Tumor Type",
  xlab = "Tumor Type",
  ylab = "Clump Thickness",
  plot.type = "box",
  pairwise.comparisons = TRUE
)

This violin plot reveals a distinct disparity in clump thickness between benign and malignant tumors. On average, malignant tumors exhibit a considerably greater clump thickness (mean ≈ 7.19) compared to benign ones (mean ≈ 2.96), highlighting a potential diagnostic indicator.

The results of Welch’s t-test validate that the observed difference in clump thickness is statistically significant and not merely due to random variation (p < 0.001). The large effect size (Hedges’ g = -2.02) and a strong Bayes Factor favoring the alternative hypothesis further strengthen this finding. These metrics collectively underscore the clinical importance of clump thickness as a key characteristic in differentiating malignant tumors from benign ones.

ggbetweenstats(
  data = bc_data,
  x = Class,          
  y = Bare.nuclei,   
  type = "parametric", 
  var.equal = FALSE,   
  title = "Bare Nuclei Count: Benign vs. Malignant Tumors",
  xlab = "Tumor Type",
  ylab = "Bare Nuclei Count",
  plot.type = "boxviolin", # Boxplot + violin
  pairwise.comparisons = FALSE 
)

The violin plot highlights a pronounced contrast in Bare Nuclei Count between benign and malignant tumor samples. Malignant tumors not only show significantly higher counts (mean ≈ 7.63) but also display greater variability, whereas benign tumors are consistently associated with lower, tightly grouped counts (mean ≈ 1.35). This suggests that Bare Nuclei Count may serve as a strong indicator for tumor malignancy.

The Welch’s t-test demonstrates that the disparity in Bare Nuclei Count between benign and malignant tumors is highly significant (p ≈ 8.55e-89), accompanied by an exceptionally large effect size (Hedges’ g = -2.66). Complementing this, the Bayesian analysis yields a log Bayes Factor of -379.82, offering compelling evidence in favor of a true difference between the two groups. These findings emphasize the Bare Nuclei Count as a vital cytological marker for detecting malignancy—possibly even surpassing Clump Thickness in diagnostic importance.

ggbetweenstats(
  data = bc_data,
  x = Class,
  y = Cell.size,
  type = "parametric",
  title = "Comparison of Cell Size by Tumor Type",
  xlab = "Tumor Type",
  ylab = "Cell Size",
  plot.type = "violin",
  pairwise.comparisons = TRUE
)

The violin plot clearly illustrates a significant contrast in Bare Nuclei Count between benign and malignant tumor samples. Malignant tumors tend to have substantially higher and more dispersed counts (mean ≈ 7.63), whereas benign tumors show consistently low and tightly grouped values (mean ≈ 1.35). This pattern suggests that Bare Nuclei Count is a highly informative feature for distinguishing between the two tumor types.

Welch’s t-test confirms that the difference in Bare Nuclei Count is statistically significant (p ≈ 8.55e-89), with an extremely large effect size (Hedges’ g = -2.66). This is further reinforced by the Bayesian analysis, which reports a log Bayes Factor of -379.82—offering overwhelming support for a genuine distinction between the benign and malignant groups. These findings underscore the Bare Nuclei Count as a key cytological indicator for identifying malignant tumors—potentially offering even greater diagnostic value than Clump Thickness.

library(plotly)

# Reshape to long format
bc_long <- bc_data %>%
  pivot_longer(cols = -Class, names_to = "Feature", values_to = "Value")

# Generate plot list per feature
plot_list <- bc_long %>%
  split(.$Feature) %>%
  map(function(df) {
    plot_ly(data = df,
            x = ~Value,
            color = ~Class,
            colors = c("benign" = "#800080", "malignant" = "#FF69B4"),
            type = "histogram",
            nbinsx = 20,
            name = unique(df$Feature)) %>%
      layout(
        title = list(text = paste("Feature:", unique(df$Feature)), x = 0.5),
        xaxis = list(title = ""),
        yaxis = list(title = "Count")
      )
  })

# Combine into subplot layout
subplot(
  plot_list,
  nrows = ceiling(length(plot_list)/2),
  shareX = FALSE,
  shareY = FALSE,
  titleX = TRUE,
  titleY = TRUE
) %>%
  layout(title = "Interactive Histograms of Cellular Features by Tumor Class")

The histograms display noticeable distributional differences in various cellular features between benign and malignant tumors. In many cases, one class (indicated by either the purple or pink bars) tends to have higher values, evident through a rightward shift in the distribution. In contrast, some features show a skew toward lower values for one tumor type. The degree of overlap between the two histograms for each feature serves as an indicator of that feature’s discriminative power—features with little overlap are likely more effective at distinguishing between tumor types. Additionally, the shape and spread of each distribution offer valuable insight into the variability of these cellular characteristics within each class.

By comparing each pair of histograms, one can pinpoint cellular features that exhibit the most significant distributional differences between benign and malignant tumors. For example, if the histogram for a specific feature in one class is primarily shifted towards higher intensity values, while the other class’s histogram is concentrated around lower values, this feature becomes a strong candidate for distinguishing the two tumor types. The width of each distribution reflects the level of heterogeneity within each class: narrower distributions suggest that the feature is more consistent, while broader ones indicate greater variability. Features with similar distributions across both tumor types are less likely to be useful for classification. The interactive nature of the original visualization allows for deeper quantitative analysis of these distributional differences, offering more precise insights.

# Correlation plots
# For individual correlation
ggstatsplot::ggscatterstats(
  data = bc_data,  
  x = Cl.thickness,
  y = Cell.size,
  title = "Correlation Between Clump Thickness and Cell Size",
  xlab = "Clump Thickness",
  ylab = "Cell Size",
  point.color = "#FF69B4",
  point.alpha = 0.5,
  line.color = "#800080",
  marginal = FALSE,
)

# For correlation matrix
ggstatsplot::ggcorrmat(
  data = bc_data,
  type = "parametric",  # Pearson's r
  colors = c("#800080", "white", "#FF69B4"),
  title = "Correlation Matrix of Cellular Characteristics",
  subtitle = "Pairwise Pearson correlations",
  matrix.type = "lower",
  p.adjust.method = "none",
  hc.order = TRUE,
  lab = TRUE,
)

Analysis (Clump Thickness vs. Cell Size): The scatter plot reveals a statistically significant, moderately strong positive linear correlation between Clump Thickness and Cell Size (r = 0.64, p < 0.001). As Clump Thickness increases, Cell Size also tends to grow larger. This suggests a possible biological connection, where greater cellular density is linked to an increase in cell size. This relationship may offer valuable insights for predictive modeling in tumor classification.

Analysis (Correlation Matrix):

The correlation matrix uncovers several significant positive linear relationships among the cellular features. In particular, strong associations are observed between Cell Size and Cell Shape, as well as between Bare Nuclei and Bland Chromatin. Additionally, both Cell Shape and Cell Size exhibit robust correlations with Marginal Adhesion and Bare Nuclei. Moderate positive relationships are found linking Clump Thickness and Epithelial Cell Size with other features, while Mitoses shows only relatively weak positive connections. These strong inter-correlations suggest shared underlying biological mechanisms, which might imply redundancy in the information provided by some of these features for modeling purposes. Importantly, all the correlations presented are statistically significant.

Part 2: Interactive Visualizations with plotly

# Create a ggplot scatter plot with additional tooltip information
p_bc <- ggplot(bc_data, aes(x = Cl.thickness, y = Bare.nuclei, color = Class,
                            text = paste("Class: ", Class, "<br>",
                                         "Clump Thickness: ", Cl.thickness, "<br>",
                                         "Bare Nuclei: ", Bare.nuclei, "<br>",
                                         "Cell Shape: ", Cell.shape, "<br>",
                                         "Epith. Cell Size: ", Epith.c.size))) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "Interactive Plot: Clump Thickness vs. Bare Nuclei",
       subtitle = "Colored by Tumor Class",
       x = "Clump Thickness",
       y = "Bare Nuclei Count",
       color = "Tumor Class") +
  theme_minimal()

# Convert to an interactive Plotly object with custom tooltip and hidden modebar
interactive_plot <- ggplotly(p_bc, tooltip = "text") %>%
  layout(modebar = list(orientation = "h", visible = FALSE))

# Display the interactive plot
interactive_plot

The scatter plot illustrates the relationship between Clump Thickness and Bare Nuclei Count, with points color-coded by Tumor Class (pink for benign and blue for malignant). The plot reveals a tendency for higher Clump Thickness to correspond with a greater Bare Nuclei Count, especially within the malignant tumor class. Benign tumors are predominantly concentrated at lower values for both Clump Thickness and Bare Nuclei Count. While there is some overlap, particularly at lower to moderate values, the higher concentration of blue (malignant) points in the upper-right quadrant suggests a positive correlation between these features and malignancy.

The distinct separation of the two tumor classes in the scatter plot suggests that both Clump Thickness and Bare Nuclei Count could serve as valuable features for differentiating between benign and malignant tumors. Higher values of these features tend to indicate malignancy, while lower values are more commonly associated with benign tumors. However, there are cases where malignant tumors also display lower values. The interactive functionality of the original plot would have enabled a more detailed examination of the point density and distribution for each tumor class across specific combinations of Clump Thickness and Bare Nuclei Count.

#For Other Cell Characteristics (Cell Shape and Cell Size)
p_scatter <- ggplot(bc_data,
       aes(x = Cell.size, y = Cell.shape, color = Class,
           text = paste("Cell Size:", Cell.size, "<br>",
                        "Cell Shape:", Cell.shape, "<br>",
                        "Class:", Class))) +
  geom_point(alpha = 0.7) +
  labs(title = "Cell Size vs. Cell Shape by Tumor Type",
       x = "Cell Size",
       y = "Cell Shape") +
  scale_color_manual(values = c("benign" = "#00BA38", "malignant" = "#F8766D")) +
  theme_minimal()

ggplotly(p_scatter, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))

The scatter plot highlights the relationship between Cell Size and Cell Shape, with points color-coded by Tumor Type (green for benign and pink for malignant). A noticeable trend emerges: malignant tumors (pink) are mostly linked to higher values for both Cell Size and Cell Shape, whereas benign tumors (green) tend to cluster at lower values for these features. Although some overlap is observed in the lower ranges, the upper-right quadrant is densely populated with malignant cases, suggesting a positive correlation between these features and malignancy.

The clear clustering of tumor types in different areas of the scatter plot suggests that Cell Size and Cell Shape are key indicators for differentiating between benign and malignant tumors. Higher values for both features are strongly indicative of malignancy. While lower values are more frequently associated with benign tumors, they do not rule out the possibility of malignancy entirely. The positive correlation and distinct separation of the clusters emphasize the potential of these features for use in classification models.

Conclusion

This Exploratory Data Analysis of the Breast Cancer Wisconsin dataset has offered valuable insights into the connections between different cellular characteristics and tumor malignancy.

This Exploratory Data Analysis of the Breast Cancer Wisconsin dataset has provided meaningful insights into the relationships between various cellular features and tumor malignancy.

Similarly, the Bare Nuclei count revealed a more pronounced and statistically significant difference between the two tumor classes. Malignant tumors exhibited notably higher and more variable Bare Nuclei counts, emphasizing the importance of this feature as a key differentiator, potentially even surpassing clump thickness in its diagnostic value.

Analysis of Cell Size by tumor type also revealed a significant difference, with malignant tumors generally exhibiting larger cell sizes compared to benign tumors. This further supports the idea that cellular morphology is a key factor in distinguishing between tumor types.

The interactive histograms of all cellular features, categorized by tumor class, visually confirmed distinct distributional differences. Several features exhibited a noticeable shift towards higher values in malignant tumors, highlighting their potential as effective discriminators. The extent of overlap in the distributions indicates the varying levels of discriminatory power each feature holds.

The correlation analysis uncovered significant positive linear relationships among several cellular characteristics. Strong correlations between features such as Cell Size and Cell Shape, as well as Bare Nuclei and Bland Chromatin, suggest potential underlying biological connections or redundancy. The correlation matrix offers valuable insight into how these features co-vary, which can guide future modeling efforts and feature selection.

The interactive scatter plots examining the relationships between Clump Thickness and Bare Nuclei, as well as Cell Size and Cell Shape, further highlighted the distinction between benign and malignant tumors based on these cellular features. Malignant tumors were predominantly concentrated at higher values for both pairs of characteristics, reinforcing their potential as distinguishing factors.

In summary, this EDA provides compelling evidence that several cellular characteristics—specifically Clump Thickness, Bare Nuclei count, Cell Size, and Cell Shape—are strongly associated with tumor malignancy in the Breast Cancer Wisconsin dataset. These findings highlight the crucial role of these cytological features in differentiating between benign and malignant tumors, as well as their potential use in developing diagnostic or prognostic models. Additionally, the observed correlations between features should be taken into account in future analyses to prevent multicollinearity and to guide feature engineering efforts.