Introduction

Introduction to the Exploratory Data Analysis

This document presents an Exploratory Data Analysis (EDA) of the Breast Cancer Wisconsin dataset. The primary goal of this EDA is to gain initial insights into the characteristics of cell nuclei and their potential relationship with tumor malignancy. By examining the distribution of individual features and the relationships,aim to identify key factors that may differentiate between benign and malignant tumors. This preliminary investigation will help inform subsequent statistical modeling and feature selection processes.

This EDA will explore the following questions:

  1. Is there a statistically significant and practically meaningful relationship between clump thickness and the likelihood of a tumor being malignant?

  2. Does the count of bare nuclei exhibit a statistically significant difference in its distribution between benign and malignant tumors?

  3. Are there statistically significant correlations between different quantitative cellular characteristics, such as size, shape, and uniformity?

  4. Is there a significant difference in the frequency of mitosis observed in benign versus malignant tumor samples?

  5. Do features related to cell shape and uniformity such as Cell Shape, Uniformity of Cell Shape shows distinct patterns or distributions when comparing benign and malignant tumors?

Load the required libraries

library(tidyverse)
library(ggstatsplot)
library(plotly)
library(mlbench)
library(dplyr)
library(reshape2) 

Load the dataset

data("BreastCancer")
dataset <- BreastCancer

# Clean the dataset
bc_data <- dataset %>%
  mutate(across(.cols = -c(Id, Class), ~ifelse(. == "?", NA, .))) %>%
  mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
  select(-Id) %>%
  drop_na()

# Preview the first few rows
head(bc_data)
##   Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1            5         1          1             1            2           1
## 2            5         4          4             5            7          10
## 3            3         1          1             1            2           2
## 4            6         8          8             1            3           4
## 5            4         1          1             3            2           1
## 6            8        10         10             8            7          10
##   Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           3               1       1    benign
## 2           3               2       1    benign
## 3           3               1       1    benign
## 4           3               7       1    benign
## 5           3               1       1    benign
## 6           9               7       1 malignant
tail(bc_data)
##     Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 694            3         1          1             1            2           1
## 695            3         1          1             1            3           2
## 696            2         1          1             1            2           1
## 697            5        10         10             3            7           3
## 698            4         8          6             4            3           4
## 699            4         8          8             5            4           5
##     Bl.cromatin Normal.nucleoli Mitoses     Class
## 694           2               1       2    benign
## 695           1               1       1    benign
## 696           1               1       1    benign
## 697           8              10       2 malignant
## 698          10               6       1 malignant
## 699          10               4       1 malignant
# Check structure
str(bc_data)
## 'data.frame':    683 obs. of  10 variables:
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
# Summary statistics
summary(dataset)
##       Id             Cl.thickness   Cell.size     Cell.shape  Marg.adhesion
##  Length:699         1      :145   1      :384   1      :353   1      :407  
##  Class :character   5      :130   10     : 67   2      : 59   2      : 58  
##  Mode  :character   3      :108   3      : 52   10     : 58   3      : 58  
##                     4      : 80   2      : 45   3      : 56   10     : 55  
##                     10     : 69   4      : 40   4      : 44   4      : 33  
##                     2      : 50   5      : 30   5      : 34   8      : 25  
##                     (Other):117   (Other): 81   (Other): 95   (Other): 63  
##   Epith.c.size  Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses   
##  2      :386   1      :402   2      :166   1      :443     1      :579  
##  3      : 72   10     :132   3      :165   10     : 61     2      : 35  
##  4      : 48   2      : 30   1      :152   3      : 44     3      : 33  
##  1      : 47   5      : 30   7      : 73   2      : 36     10     : 14  
##  6      : 41   3      : 28   4      : 40   8      : 24     4      : 12  
##  5      : 39   (Other): 61   5      : 34   6      : 22     7      :  9  
##  (Other): 66   NA's   : 16   (Other): 69   (Other): 69     (Other): 17  
##        Class    
##  benign   :458  
##  malignant:241  
##                 
##                 
##                 
##                 
## 
# Check distribution of diagnosis
table(bc_data$Class)
## 
##    benign malignant 
##       444       239

Part 1: EDA with ggstatsplot for Statistical Tests (Breast Cancer Wisconsin Dataset)

ggbetweenstats(
  data = bc_data,
  x = Class,
  y = Cl.thickness,
  type = "parametric",
  title = "Comparison of Clump Thickness by Tumor Type",
  xlab = "Tumor Type",
  ylab = "Clump Thickness",
  plot.type = "box",
  pairwise.comparisons = TRUE
)

This violin plot illustrates a clear difference in clump thickness between benign and malignant tumors. The average clump thickness for malignant tumors is significantly higher (mean ≈ 7.19) than for benign tumors (mean ≈ 2.96).

The statistical analysis (Welch’s t-test) confirms this difference is not due to chance (p < 0.001), with a large effect size (Hedges’ g = -2.02) and a strong Bayes Factor favoring the alternative hypothesis. This supports the clinical relevance of clump thickness as a distinguishing feature for tumor malignancy.

ggbetweenstats(
  data = bc_data,
  x = Class,          
  y = Bare.nuclei,   
  type = "parametric", 
  var.equal = FALSE,   
  title = "Bare Nuclei Count: Benign vs. Malignant Tumors",
  xlab = "Tumor Type",
  ylab = "Bare Nuclei Count",
  plot.type = "boxviolin", # Boxplot + violin
  pairwise.comparisons = FALSE 
)

This violin plot reveals a striking difference in Bare Nuclei Count between benign and malignant tumor samples. Malignant tumors exhibit a much higher and more variable count (mean ≈ 7.63), while benign tumors remain tightly clustered at low counts (mean ≈ 1.35).

The Welch’s t-test confirms this difference is statistically significant (p ≈ 8.55e-89) with an exceptionally large effect size (Hedges’ g = -2.66). The Bayesian result further supports this, with a log Bayes Factor of -379.82, providing overwhelming evidence for a true difference between the groups. This result highlights Bare Nuclei Count as a critical cytological feature in identifying malignant tumors, potentially even more influential than Clump Thickness.

ggbetweenstats(
  data = bc_data,
  x = Class,
  y = Cell.size,
  type = "parametric",
  title = "Comparison of Cell Size by Tumor Type",
  xlab = "Tumor Type",
  ylab = "Cell Size",
  plot.type = "violin",
  pairwise.comparisons = TRUE
)

Violin plot reveals a striking difference in Bare Nuclei Count between benign and malignant tumor samples. Malignant tumors exhibit a much higher and more variable count (mean ≈ 7.63), while benign tumors remain tightly clustered at low counts (mean ≈ 1.35).

The Welch’s t-test confirms this difference is statistically significant (p ≈ 8.55e-89) with an exceptionally large effect size (Hedges’ g = -2.66). The Bayesian result further supports this, with a log Bayes Factor of -379.82, providing overwhelming evidence for a true difference between the groups.

This result highlights Bare Nuclei Count as a critical cytological feature in identifying malignant tumors, potentially even more influential than Clump Thickness.

library(plotly)

# Reshape to long format
bc_long <- bc_data %>%
  pivot_longer(cols = -Class, names_to = "Feature", values_to = "Value")

# Generate plot list per feature
plot_list <- bc_long %>%
  split(.$Feature) %>%
  map(function(df) {
    plot_ly(data = df,
            x = ~Value,
            color = ~Class,
            colors = c("benign" = "#00BA38", "malignant" = "#F8766D"),
            type = "histogram",
            nbinsx = 20,
            name = unique(df$Feature)) %>%
      layout(
        title = list(text = paste("Feature:", unique(df$Feature)), x = 0.5),
        xaxis = list(title = ""),
        yaxis = list(title = "Count")
      )
  })

# Combine into subplot layout
subplot(
  plot_list,
  nrows = ceiling(length(plot_list)/2),
  shareX = FALSE,
  shareY = FALSE,
  titleX = TRUE,
  titleY = TRUE
) %>%
  layout(title = "Interactive Histograms of Cellular Features by Tumor Class")

The histograms reveal distributional differences in cellular features between the two tumor classes. For several features, one tumor class (represented by either green or pink bars) exhibits a tendency towards higher values compared to the other class, as indicated by a rightward shift in its distribution. Conversely, for other features, one class may show a distribution skewed towards lower values. The degree of overlap between the two histograms for each feature suggests the extent to which that feature can discriminate between the two tumor classes. Features with minimal overlap in their distributions are likely stronger discriminators. The spread and shape of each histogram provide insights into the variability of each cellular characteristic within each tumor class.

By examining each pair of histograms, one can identify specific cellular features that show the most pronounced differences in distribution between the two tumor classes. For instance, if the histogram for a particular feature in one class is predominantly located at higher intensity values while the other class’s histogram is concentrated at lower values, this feature is likely important in distinguishing between the two tumor types. The width of the distributions indicates the heterogeneity of the feature within each class. Narrower distributions suggest more consistent feature values, while wider distributions indicate greater variability. Features with similar distributions across both tumor classes may have limited utility in classification. The interactive nature of the original visualization would allow for a more detailed quantitative analysis of these distributional differences.

# Correlation plots
# For individual correlation
ggstatsplot::ggscatterstats(
  data = bc_data,  
  x = Cl.thickness,
  y = Cell.size,
  title = "Correlation Between Clump Thickness and Cell Size",
  xlab = "Clump Thickness",
  ylab = "Cell Size",
  point.color = "#0072B2",
  point.alpha = 0.5,
  line.color = "#D55E00",
  marginal = FALSE,
)

# For correlation matrix
ggstatsplot::ggcorrmat(
  data = bc_data,
  type = "parametric",  # Pearson's r
  colors = c("#6D9EC1", "white", "#E46726"),
  title = "Correlation Matrix of Cellular Characteristics",
  subtitle = "Pairwise Pearson correlations",
  matrix.type = "lower",
  p.adjust.method = "none",
  hc.order = TRUE,
  lab = TRUE,
)

Analysis (Clump Thickness vs. Cell Size): The scatter plot shows a statistically significant, moderately strong positive linear relationship between Clump Thickness and Cell Size (r=0.64, p<0.001). Higher Clump Thickness tends to be associated with larger Cell Size. This suggests a potential biological link where increased cellular crowding correlates with cell enlargement, and this relationship could be valuable for predictive modeling.

Analysis (Correlation Matrix):

The correlation matrix reveals significant positive linear relationships among several cellular features. Strong correlations exist between Cell Size and Cell Shape, and between Bare Nuclei and Bland Chromatin. Cell Shape and Cell Size also strongly correlate with Marginal Adhesion and Bare Nuclei. Moderate positive correlations are present for Clump Thickness and Epithelial Cell Size with other features. Mitoses shows weaker positive correlations. The strong inter-correlations among certain features indicate potential shared biological underpinnings and suggest that these features might provide redundant information for modeling purposes. All displayed correlations are statistically significant.

Part 2: Interactive Visualizations with plotly

# Create a ggplot scatter plot with additional tooltip information
p_bc <- ggplot(bc_data, aes(x = Cl.thickness, y = Bare.nuclei, color = Class,
                            text = paste("Class: ", Class, "<br>",
                                         "Clump Thickness: ", Cl.thickness, "<br>",
                                         "Bare Nuclei: ", Bare.nuclei, "<br>",
                                         "Cell Shape: ", Cell.shape, "<br>",
                                         "Epith. Cell Size: ", Epith.c.size))) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "Interactive Plot: Clump Thickness vs. Bare Nuclei",
       subtitle = "Colored by Tumor Class",
       x = "Clump Thickness",
       y = "Bare Nuclei Count",
       color = "Tumor Class") +
  theme_minimal()

# Convert to an interactive Plotly object with custom tooltip and hidden modebar
interactive_plot <- ggplotly(p_bc, tooltip = "text") %>%
  layout(modebar = list(orientation = "h", visible = FALSE))

# Display the interactive plot
interactive_plot

The scatter plot displays the relationship between Clump Thickness and Bare Nuclei Count, with points colored by Tumor Class (pink for benign, blue for malignant). The plot indicates a tendency for higher Clump Thickness to be associated with a greater Bare Nuclei Count, particularly within the malignant tumor class. Benign tumors appear to be concentrated at lower values for both Clump Thickness and Bare Nuclei Count. There is some overlap, especially at lower to moderate values, but the higher density of blue (malignant) points in the upper-right quadrant suggests a positive association between these two features and malignancy.

The visual separation of the two tumor classes across the scatter plot suggests that both Clump Thickness and Bare Nuclei Count are potentially useful features for distinguishing between benign and malignant tumors. Higher values of these features appear more indicative of malignancy. While lower values are more frequently associated with benign tumors, there are instances where malignant cases also present with lower values. The interactive nature of the original plot likely allowed for closer examination of point density and distribution for each tumor class at specific combinations of Clump Thickness and Bare Nuclei Count.

#For Other Cell Characteristics (Cell Shape and Cell Size)
p_scatter <- ggplot(bc_data,
       aes(x = Cell.size, y = Cell.shape, color = Class,
           text = paste("Cell Size:", Cell.size, "<br>",
                        "Cell Shape:", Cell.shape, "<br>",
                        "Class:", Class))) +
  geom_point(alpha = 0.7) +
  labs(title = "Cell Size vs. Cell Shape by Tumor Type",
       x = "Cell Size",
       y = "Cell Shape") +
  scale_color_manual(values = c("benign" = "#00BA38", "malignant" = "#F8766D")) +
  theme_minimal()

ggplotly(p_scatter, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))

The scatter plot illustrates the relationship between Cell Size and Cell Shape, with points colored by Tumor Type (green for benign, pink for malignant). A clear trend emerges: malignant tumors (pink) are predominantly associated with higher values of both Cell Size and Cell Shape. Benign tumors (green), in contrast, are largely clustered at lower values for both characteristics. While some overlap exists in the lower ranges, the upper-right quadrant of the plot is heavily populated by malignant cases, indicating a positive correlation between these features and malignancy.

The distinct clustering of tumor types in different regions of the scatter plot suggests that Cell Size and Cell Shape are strong indicators for distinguishing between benign and malignant tumors. Higher values for both features are strongly suggestive of malignancy. Although lower values are more common in benign cases, they do not entirely exclude the possibility of malignancy. This positive association and the separation of clusters highlight the potential of these features for classification models.

Conclusion

This Exploratory Data Analysis of the Breast Cancer Wisconsin dataset has provided valuable insights into the relationships between various cellular characteristics and tumor malignancy.

The relationship between Clump Thickness and tumor type revealed a statistically significant and substantial difference, with malignant tumors exhibiting significantly higher average clump thickness compared to benign tumors. This suggests that clump thickness is a strong indicator of malignancy.

Similarly, the Bare Nuclei count showed an even more pronounced and statistically significant difference between the two tumor classes. Malignant tumors displayed considerably higher and more variable bare nuclei counts, highlighting this feature as a critical distinguishing factor, potentially even more influential than clump thickness.

Examination of Cell Size by tumor type also demonstrated a significant difference, with malignant tumors tending to have larger cell sizes compared to benign ones. This further reinforces the notion that cellular morphology plays a crucial role in differentiating tumor types.

The interactive histograms of all cellular features, categorized by tumor class, visually confirmed distributional differences. Several features showed a clear shift towards higher values in malignant tumors, indicating their potential as discriminators. The degree of overlap in the distributions suggests varying levels of discriminatory power for each feature.

Correlation analysis revealed significant positive linear associations among several cellular characteristics. Strong correlations between features like Cell Size and Cell Shape, and Bare Nuclei and Bland Chromatin, suggest potential underlying biological connections or redundancy. The correlation matrix provides a crucial understanding of how these features co-vary, which can inform subsequent modeling efforts and feature selection.

Interactive scatter plots exploring the relationships between Clump Thickness and Bare Nuclei, and Cell Size and Cell Shape, further illustrated the separation between benign and malignant tumors based on these cellular features. Malignant tumors tended to cluster at higher values for these pairs of characteristics.

In summary, this EDA strongly suggests that several cellular characteristics, particularly Clump Thickness, Bare Nuclei count, Cell Size, and Cell Shape, are significantly associated with tumor malignancy in the Breast Cancer Wisconsin dataset. These findings underscore the importance of these cytological features in distinguishing between benign and malignant tumors and their potential utility in developing diagnostic or prognostic models. The identified correlations between features also warrant consideration in future analyses to avoid multicollinearity and to potentially inform feature engineering.