The goal of this analysis is to enhance our understanding of the Breast Cancer Wisconsin dataset through Exploratory Data Analysis (EDA), advanced statistical testing, and interactive data visualization. By leveraging R packages such as ggstatsplot for visual statistics and plotly for interactivity, we aim to gain deeper insights into the differences and relationships among cellular characteristics that distinguish benign from malignant tumors.
library(tidyverse)
library(ggstatsplot)
library(plotly)
df <- read.csv("C:/Users/Nadinne/OneDrive/Desktop/Breast Cancer Wisconsin/breast-cancer-wisconsin.csv", stringsAsFactors = TRUE)
str(df)
## 'data.frame': 699 obs. of 11 variables:
## $ id : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
## $ clump_thickness : int 5 5 3 6 4 8 1 2 2 4 ...
## $ size_uniformity : int 1 4 1 8 1 10 1 1 1 2 ...
## $ shape_uniformity : int 1 4 1 8 1 10 1 2 1 1 ...
## $ marginal_adhesion: int 1 5 1 1 3 8 1 1 1 1 ...
## $ epithelial_size : int 2 7 2 3 2 7 2 2 2 2 ...
## $ bare_nucleoli : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
## $ bland_chromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ normal_nucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
## $ class : int 2 2 2 2 2 4 2 2 2 2 ...
dim(df)
## [1] 699 11
names(df)
## [1] "id" "clump_thickness" "size_uniformity"
## [4] "shape_uniformity" "marginal_adhesion" "epithelial_size"
## [7] "bare_nucleoli" "bland_chromatin" "normal_nucleoli"
## [10] "mitoses" "class"
colSums(is.na(df))
## id clump_thickness size_uniformity shape_uniformity
## 0 0 0 0
## marginal_adhesion epithelial_size bare_nucleoli bland_chromatin
## 0 0 0 0
## normal_nucleoli mitoses class
## 0 0 0
summary(df)
## id clump_thickness size_uniformity shape_uniformity
## Min. : 61634 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000
## Mean : 1071704 Mean : 4.418 Mean : 3.134 Mean : 3.207
## 3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000
##
## marginal_adhesion epithelial_size bare_nucleoli bland_chromatin
## Min. : 1.000 Min. : 1.000 1 :402 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.000 10 :132 1st Qu.: 2.000
## Median : 1.000 Median : 2.000 2 : 30 Median : 3.000
## Mean : 2.807 Mean : 3.216 5 : 30 Mean : 3.438
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3 : 28 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 8 : 21 Max. :10.000
## (Other): 56
## normal_nucleoli mitoses class
## Min. : 1.000 Min. : 1.000 Min. :2.00
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:2.00
## Median : 1.000 Median : 1.000 Median :2.00
## Mean : 2.867 Mean : 1.589 Mean :2.69
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.:4.00
## Max. :10.000 Max. :10.000 Max. :4.00
##
df$diagnosis <- ifelse(df$class == 2, "Benign", "Malignant")
names(df)
## [1] "id" "clump_thickness" "size_uniformity"
## [4] "shape_uniformity" "marginal_adhesion" "epithelial_size"
## [7] "bare_nucleoli" "bland_chromatin" "normal_nucleoli"
## [10] "mitoses" "class" "diagnosis"
Is there a significant difference in clump thickness between benign and malignant breast tumors?
t_test_standard <- t.test(clump_thickness ~ diagnosis, data = df, var.equal = TRUE)
print(t_test_standard)
##
## Two Sample t-test
##
## data: clump_thickness by diagnosis
## t = -27.078, df = 697, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Benign and group Malignant is not equal to 0
## 95 percent confidence interval:
## -4.546030 -3.931347
## sample estimates:
## mean in group Benign mean in group Malignant
## 2.956332 7.195021
The p-value is extremely small (less than 0.05), indicating that we reject the null hypothesis. Thus, there is a statistically significant difference in the average clump thickness between benign and malignant tumors. On average, malignant tumors have a higher clump thickness compared to benign tumors.
t_test_welch <- t.test(clump_thickness ~ diagnosis, data = df, var.equal = FALSE)
print(t_test_welch)
##
## Welch Two Sample t-test
##
## data: clump_thickness by diagnosis
## t = -24.231, df = 363.11, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Benign and group Malignant is not equal to 0
## 95 percent confidence interval:
## -4.582685 -3.894693
## sample estimates:
## mean in group Benign mean in group Malignant
## 2.956332 7.195021
The p-value is extremely small (p < 0.05), so we reject the null hypothesis. The Welch’s test, which does not assume equal variances, also confirms that there is a statistically significant difference between the two groups. Malignant tumors have significantly greater clump thickness than benign tumors.
library(ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.4.3
## You can cite this package as:
## Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
## Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
df$diagnosis <- as.factor(df$diagnosis)
df$clump_thickness <- as.numeric(df$clump_thickness)
ggbetweenstats(
data = df,
x = diagnosis,
y = clump_thickness,
type = "parametric",
title = "Comparison of Clump Thickness by Tumor Diagnosis",
xlab = "Tumor Diagnosis",
ylab = "Clump Thickness"
)
This comparison between clump thickness and diagnosis in tumors has
found a very strong difference between groups. The statistical test
(Welch’s t-test) showed a highly significant result (p-value extremely
small, far below 0.05), meaning the difference is very unlikely to be
due to chance. The effect size (Hedges’ g = -2.03) confirms that one
group had much lower clump thickness than the other. The average clump
thickness for the main group was 2.96, based on 458 samples out of 699
total.
A Bayesian analysis also strongly supported these findings, with the data favoring a real difference between groups. The estimated difference in clump thickness was large (-4.23), with a tight confidence range, further confirming the result. In simple terms, clump thickness is clearly different depending on tumor type, and this difference is both statistically and practically meaningful.
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(df, aes(x = clump_thickness,
y = size_uniformity,
color = diagnosis,
text = paste("Diagnosis:", diagnosis,
"<br>Clump Thickness:", clump_thickness,
"<br>Cell Size Uniformity:", size_uniformity))) +
geom_point() +
labs(title = "Clump Thickness vs Uniformity of Cell Size",
x = "Clump Thickness",
y = "Uniformity of Cell Size") +
theme_minimal()
ggplotly(p, tooltip = "text") %>%
layout(modebar = list(remove = c("zoom2d", "pan2d", "select2d", "lasso2d", "zoomIn2d", "zoomOut2d", "autoScale2d", "resetScale2d")))
This graph compares clump thickness and cell size uniformity between benign (non-cancerous) and malignant (cancerous) tumors. The data shows that malignant tumors tend to have less uniform cell sizes, with values ranging from 2.5 up to 10, where higher numbers mean the cells vary more in size. While the exact clump thickness values aren’t shown in this image, we typically see that cancerous tumors also have thicker clumps compared to benign ones.
The clear difference between the two diagnosis groups suggests that doctors can use these two features - cell size irregularity and clump thickness - as warning signs when checking for cancer. The more uneven the cells and thicker the clumps, the more likely the tumor may be malignant. These visual patterns help medical professionals quickly assess potential cancer cases during examinations.
The analysis of tumor characteristics in the Breast Cancer Wisconsin dataset reveals important patterns. Statistical tests confirm that malignant tumors have significantly greater clump thickness (average 7.20) compared to benign tumors (average 2.96), with the difference being extremely unlikely due to chance (p < 0.001). The large effect size (Hedges’ g = -2.03) further emphasizes how strongly clump thickness distinguishes cancerous from non-cancerous growths. The relationship between tumor features is also evident in the scatterplot, which shows malignant tumors consistently exhibiting both thicker clumps and less uniform cell sizes compared to benign cases. These findings demonstrate that clump thickness and cell uniformity are reliable indicators of malignancy, when cells appear more irregular and clumped together, the tumor is more likely to be cancerous. Together, the statistical results and visual trends provide clinicians with measurable criteria to help identify dangerous tumors efficiently.