This EDA will explore the following questions:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.4.3
## You can cite this package as:
## Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
## Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
data("BreastCancer")
Breast_Cancer <- BreastCancer %>%
mutate(across(.cols = -c(Id, Class), ~ifelse(. == "?", NA, .))) %>%
mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
select(-Id) %>%
drop_na()
#check the strucuture of the dataset
str(Breast_Cancer)
## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
#Check the dimensions and column names
dim(Breast_Cancer)
## [1] 683 10
names(Breast_Cancer)
## [1] "Cl.thickness" "Cell.size" "Cell.shape" "Marg.adhesion"
## [5] "Epith.c.size" "Bare.nuclei" "Bl.cromatin" "Normal.nucleoli"
## [9] "Mitoses" "Class"
#Check for missing values
colSums(is.na(Breast_Cancer))
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 0 0 0 0 0
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 0 0 0 0 0
table(Breast_Cancer$Class)
##
## benign malignant
## 444 239
#Summary statistics
summary(Breast_Cancer)
## Cl.thickness Cell.size Cell.shape Marg.adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00
## Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.00
## Mean : 3.234 Mean : 3.545 Mean : 3.445 Mean : 2.87
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Mitoses Class
## Min. :1.000 benign :444
## 1st Qu.:1.000 malignant:239
## Median :1.000
## Mean :1.583
## 3rd Qu.:1.000
## Max. :9.000
ggbetweenstats(data = Breast_Cancer, x = Class,y = Cl.thickness,type = "parametric", title = "Clump Thickness by Tumor Class", xlab = "Tumor Class", ylab = "Clump Thickness")
# Analysis
The Welch’s t-test showed that malignant tumors have a much thicker Clump Thickness (mean = 7.19) compared to benign tumors (mean = 2.96), with a t-value of -23.93 and a small p-value of 1.66e-76. This p-value means the difference is real and not just by chance, making Clump Thickness a helpful feature for telling the difference between benign and malignant tumors.
ggbetweenstats(data = Breast_Cancer,x = Class, y = Bare.nuclei,type = "nonparametric", title = "Bare Nuclei by Tumor Class", xlab = "Tumor Class",ylab = "Bare Nuclei")
# Analysis
The Wilcoxon-Mann-Whitney test confirms that there is a significant difference in Bare Nuclei between benign and malignant tumors. Malignant tumors have a much higher number of Bare Nuclei, making this a helpful feature for distinguishing between tumor types.
# Visualize correlation clump thickness and cell size
ggscatterstats(data = Breast_Cancer, x = Cl.thickness, y = Cell.size, title = "Correlation Between Clump Thickness and Cell Size", xlab = "Clump Thickness", ylab = "Cell Size")
## Registered S3 method overwritten by 'ggside':
## method from
## +.gg ggplot2
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
# Analysis
This plot shows the relationship between Clump Thickness and Cell Size in breast tumor. The correlation is strong and positive, with a Pearson correlation coefficient of 0.64, which means that as clump thickness increases, cell size tends to increase too.
plot1 <- ggbetweenstats(data = Breast_Cancer, x = Class, y = Cl.thickness, type = "parametric", title = "Clump Thickness by Tumor Class", xlab = "Tumor Class", ylab = "Clump Thickness")
#Convert the ggplot2 object to an interactive plotly object, specifying the tooltip
plot_1 <- ggplotly(plot1, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabelRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
#Display
plot_1
Clump Thickness is a pretty good indicator of whether a tumor is benign or malignant in this dataset. Benign tumors generally have a lower Clump Thickness, while malignant tumors tend to have a much higher one. This plot helps us see the distribution of this characteristic for each tumor type and how they differ.
plot2 <- ggplot(Breast_Cancer, aes(x = Cl.thickness, y = Cell.shape)) +
geom_point(aes(
color = Class,
text = paste("Clump Thickness:", Cl.thickness, "<br>",
"Cell Shape:", Cell.shape, "<br>",
"Class:", Class)
)) +
labs(
title = "Correlation Between Clump Thickness and Cell Shape by Tumor Type",
x = "Clump Thickness",
y = "Cell Shape",
color = "Tumor Type"
) +
theme_minimal()
## Warning in geom_point(aes(color = Class, text = paste("Clump Thickness:", :
## Ignoring unknown aesthetics: text
#Convert the ggplot2 object to an interactive plotly object, specifying the tooltip
plot_2 <- ggplotly(plot2, tooltip = "text") %>%
layout(modebar = list(visible = TRUE))
#Display
plot_2
The plot indicates a positive correlation between both Clump Thickness and Cell Shape with the likelihood of a tumor being malignant. As these two characteristics increase, the chances of the tumor being malignant appear to go up. However, the overlap suggests that a diagnosis likely needs to consider more than just these two factors
plot3 <- ggplot(Breast_Cancer, aes(x = Cl.thickness, y = Cell.size)) +
geom_point(aes(
text = paste("Clump Thickness:", Cl.thickness, "<br>",
"Cell Size:", Cell.size, "<br>",
"Class:", Class)
), color = "pink") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Correlation Between Clump Thickness and Cell Size",
x = "Clump Thickness",
y = "Cell Size"
) +
theme_minimal()
## Warning in geom_point(aes(text = paste("Clump Thickness:", Cl.thickness, :
## Ignoring unknown aesthetics: text
# Convert to interactive plotly plot
plot_3 <- ggplotly(plot3, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
## `geom_smooth()` using formula = 'y ~ x'
#Display
plot_3
#(Cell Shape and Cell Size)
plot4 <- ggplot(Breast_Cancer,
aes(x = Cell.size, y = Cell.shape, color = Class,
text = paste("Cell Size:", Cell.size, "<br>",
"Cell Shape:", Cell.shape, "<br>",
"Class:", Class))) +
geom_point() +
labs(title = "Cell Size vs. Cell Shape by Tumor Type",
x = "Cell Size",
y = "Cell Shape") +
scale_color_manual(values = c("benign" = "red", "malignant" = "pink")) +
theme_minimal()
#Convert the ggplot2 object to an interactive plotly object, specifying the tooltip
plot_4<- ggplotly(plot4, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
#Display
plot_4
The plots show a clear positive trend, especially among malignant tumors, where they have generally large and more irregularly shaped cells compared to benign cases, forming distinct clusters at higher values.
This exploratory data analysis aimed to understand the characteristics of breast tumors, specifically looking at clump thickness, cell size, and cell shape, and how these relate to whether a tumor is benign or malignant.
First, There is a difference in clump thickness between benign and malignant tumors. Malignant tumors tend to have a much higher clump thickness, and this difference is statistically significant, meaning it’s a real finding and not just due to chance. This makes clump thickness a pretty useful indicator for telling the two types of tumors apart. We also saw that clump thickness is correlated with cell size. As tumors get thicker, the cells also tend to be larger.
Based on the plots of cell size and cell shape together, It shows that malignant tumors often have larger and more irregular cell shapes compared to benign tumors. You can almost see distinct groupings of malignant tumors at the higher ends of cell size and cell shape.
In conclusion, With the help of interactive plots and visualization, Characteristic like clump thickness, cell size, and cell shape are all important factors in distinguishing between benign and malignant breast tumors. These findings could be really helpful for diagnosis in breast tumors.