Introduction

This EDA will explore the following questions:

  1. Is there a difference in Clump Thickness between benign and malignant tumors?
  2. Is there a correlation between Clump Thickness and Cell Shape by tumor type?
  3. Is there a correlation between Cell Size and Cell Shape by tumor type?
  4. Is there a correlation between clumpthickness and cell size?

Load necessary Libraries

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.4.3
## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
data("BreastCancer")

Breast_Cancer <- BreastCancer %>%
  mutate(across(.cols = -c(Id, Class), ~ifelse(. == "?", NA, .))) %>%
  mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
  select(-Id) %>%
  drop_na()

#check the strucuture of the dataset
str(Breast_Cancer)
## 'data.frame':    683 obs. of  10 variables:
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
#Check the dimensions and column names
dim(Breast_Cancer)
## [1] 683  10
names(Breast_Cancer)
##  [1] "Cl.thickness"    "Cell.size"       "Cell.shape"      "Marg.adhesion"  
##  [5] "Epith.c.size"    "Bare.nuclei"     "Bl.cromatin"     "Normal.nucleoli"
##  [9] "Mitoses"         "Class"
#Check for missing values
colSums(is.na(Breast_Cancer))
##    Cl.thickness       Cell.size      Cell.shape   Marg.adhesion    Epith.c.size 
##               0               0               0               0               0 
##     Bare.nuclei     Bl.cromatin Normal.nucleoli         Mitoses           Class 
##               0               0               0               0               0
table(Breast_Cancer$Class)
## 
##    benign malignant 
##       444       239
#Summary statistics
summary(Breast_Cancer)
##   Cl.thickness      Cell.size        Cell.shape     Marg.adhesion  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##   Epith.c.size     Bare.nuclei      Bl.cromatin     Normal.nucleoli
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses            Class    
##  Min.   :1.000   benign   :444  
##  1st Qu.:1.000   malignant:239  
##  Median :1.000                  
##  Mean   :1.583                  
##  3rd Qu.:1.000                  
##  Max.   :9.000

Part 1: EDA with ggstatsplot for Statistical Tests

ggbetweenstats(data = Breast_Cancer, x = Class,y = Cl.thickness,type = "parametric", title = "Clump Thickness by Tumor Class", xlab = "Tumor Class", ylab = "Clump Thickness")

# Analysis

The Welch’s t-test showed that malignant tumors have a much thicker Clump Thickness (mean = 7.19) compared to benign tumors (mean = 2.96), with a t-value of -23.93 and a small p-value of 1.66e-76. This p-value means the difference is real and not just by chance, making Clump Thickness a helpful feature for telling the difference between benign and malignant tumors.

ggbetweenstats(data = Breast_Cancer,x = Class, y = Bare.nuclei,type = "nonparametric", title = "Bare Nuclei by Tumor Class", xlab = "Tumor Class",ylab = "Bare Nuclei")

# Analysis

The Wilcoxon-Mann-Whitney test confirms that there is a significant difference in Bare Nuclei between benign and malignant tumors. Malignant tumors have a much higher number of Bare Nuclei, making this a helpful feature for distinguishing between tumor types.

# Visualize correlation clump thickness and cell size
ggscatterstats(data = Breast_Cancer, x = Cl.thickness, y = Cell.size, title = "Correlation Between Clump Thickness and Cell Size", xlab = "Clump Thickness", ylab = "Cell Size")
## Registered S3 method overwritten by 'ggside':
##   method from   
##   +.gg   ggplot2
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

# Analysis

This plot shows the relationship between Clump Thickness and Cell Size in breast tumor. The correlation is strong and positive, with a Pearson correlation coefficient of 0.64, which means that as clump thickness increases, cell size tends to increase too.

Part 2: Creating Interactive Plots

plot1 <- ggbetweenstats(data = Breast_Cancer, x = Class, y = Cl.thickness, type = "parametric", title = "Clump Thickness by Tumor Class", xlab = "Tumor Class", ylab = "Clump Thickness")

#Convert the ggplot2 object to an interactive plotly object, specifying the tooltip
plot_1 <- ggplotly(plot1, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabelRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues
#Display
plot_1

Analysis

Clump Thickness is a pretty good indicator of whether a tumor is benign or malignant in this dataset. Benign tumors generally have a lower Clump Thickness, while malignant tumors tend to have a much higher one. This plot helps us see the distribution of this characteristic for each tumor type and how they differ.

plot2 <- ggplot(Breast_Cancer, aes(x = Cl.thickness, y = Cell.shape)) +
  geom_point(aes(
    color = Class,
    text = paste("Clump Thickness:", Cl.thickness, "<br>",
                 "Cell Shape:", Cell.shape, "<br>",
                 "Class:", Class)
  )) +
  labs(
    title = "Correlation Between Clump Thickness and Cell Shape by Tumor Type",
    x = "Clump Thickness",
    y = "Cell Shape",
    color = "Tumor Type"
  ) +
  theme_minimal()
## Warning in geom_point(aes(color = Class, text = paste("Clump Thickness:", :
## Ignoring unknown aesthetics: text
#Convert the ggplot2 object to an interactive plotly object, specifying the tooltip
plot_2 <- ggplotly(plot2, tooltip = "text") %>%
  layout(modebar = list(visible = TRUE))

#Display
plot_2

Analysis

The plot indicates a positive correlation between both Clump Thickness and Cell Shape with the likelihood of a tumor being malignant. As these two characteristics increase, the chances of the tumor being malignant appear to go up. However, the overlap suggests that a diagnosis likely needs to consider more than just these two factors

plot3 <- ggplot(Breast_Cancer, aes(x = Cl.thickness, y = Cell.size)) +
  geom_point(aes(
    text = paste("Clump Thickness:", Cl.thickness, "<br>",
                 "Cell Size:", Cell.size, "<br>",
                 "Class:", Class)
  ), color = "pink") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Correlation Between Clump Thickness and Cell Size",
    x = "Clump Thickness",
    y = "Cell Size"
  ) +
  theme_minimal()
## Warning in geom_point(aes(text = paste("Clump Thickness:", Cl.thickness, :
## Ignoring unknown aesthetics: text
# Convert to interactive plotly plot
plot_3 <- ggplotly(plot3, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))
## `geom_smooth()` using formula = 'y ~ x'
#Display
plot_3
#(Cell Shape and Cell Size)
plot4 <- ggplot(Breast_Cancer,
       aes(x = Cell.size, y = Cell.shape, color = Class,
           text = paste("Cell Size:", Cell.size, "<br>",
                        "Cell Shape:", Cell.shape, "<br>",
                        "Class:", Class))) +
  geom_point() +
  labs(title = "Cell Size vs. Cell Shape by Tumor Type",
       x = "Cell Size",
       y = "Cell Shape") +
  scale_color_manual(values = c("benign" = "red", "malignant" = "pink")) +
  theme_minimal()

#Convert the ggplot2 object to an interactive plotly object, specifying the tooltip
plot_4<- ggplotly(plot4, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))

#Display
plot_4

Analysis

The plots show a clear positive trend, especially among malignant tumors, where they have generally large and more irregularly shaped cells compared to benign cases, forming distinct clusters at higher values.

Conclusion

This exploratory data analysis aimed to understand the characteristics of breast tumors, specifically looking at clump thickness, cell size, and cell shape, and how these relate to whether a tumor is benign or malignant.

First, There is a difference in clump thickness between benign and malignant tumors. Malignant tumors tend to have a much higher clump thickness, and this difference is statistically significant, meaning it’s a real finding and not just due to chance. This makes clump thickness a pretty useful indicator for telling the two types of tumors apart. We also saw that clump thickness is correlated with cell size. As tumors get thicker, the cells also tend to be larger.

Based on the plots of cell size and cell shape together, It shows that malignant tumors often have larger and more irregular cell shapes compared to benign tumors. You can almost see distinct groupings of malignant tumors at the higher ends of cell size and cell shape.

In conclusion, With the help of interactive plots and visualization, Characteristic like clump thickness, cell size, and cell shape are all important factors in distinguishing between benign and malignant breast tumors. These findings could be really helpful for diagnosis in breast tumors.