Introduction

The goal of this analysis is to enhance our understanding of the Breast Cancer Wisconsin dataset through Exploratory Data Analysis (EDA), advanced statistical testing, and interactive data visualization. By leveraging R packages such as ggstatsplot for visual statistics and plotly for interactivity, we aim to gain deeper insights into the differences and relationships among cellular characteristics that distinguish benign from malignant tumors.

Part 1: EDA and Statistical Tests using ggstatsplot

Load packages and Data

library(tidyverse)
library(ggstatsplot)
library(plotly)

Loading the Dataset

df <- read.csv("C:/Users/Nadinne/OneDrive/Desktop/Breast Cancer Wisconsin/breast-cancer-wisconsin.csv", stringsAsFactors = TRUE)

Exploring the Data

str(df)

## 'data.frame':    699 obs. of  11 variables:
##  $ id               : int  1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
##  $ clump_thickness  : int  5 5 3 6 4 8 1 2 2 4 ...
##  $ size_uniformity  : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ shape_uniformity : int  1 4 1 8 1 10 1 2 1 1 ...
##  $ marginal_adhesion: int  1 5 1 1 3 8 1 1 1 1 ...
##  $ epithelial_size  : int  2 7 2 3 2 7 2 2 2 2 ...
##  $ bare_nucleoli    : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
##  $ bland_chromatin  : int  3 3 3 3 3 9 3 3 1 2 ...
##  $ normal_nucleoli  : int  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitoses          : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ class            : int  2 2 2 2 2 4 2 2 2 2 ...

Basic Structure

dim(df)

## [1] 699  11

names(df)

##  [1] "id"                "clump_thickness"   "size_uniformity"  
##  [4] "shape_uniformity"  "marginal_adhesion" "epithelial_size"  
##  [7] "bare_nucleoli"     "bland_chromatin"   "normal_nucleoli"  
## [10] "mitoses"           "class"

Check for Missing Values

colSums(is.na(df))

##                id   clump_thickness   size_uniformity  shape_uniformity 
##                 0                 0                 0                 0 
## marginal_adhesion   epithelial_size     bare_nucleoli   bland_chromatin 
##                 0                 0                 0                 0 
##   normal_nucleoli           mitoses             class 
##                 0                 0                 0

Summary Statistics

summary(df)

##        id           clump_thickness  size_uniformity  shape_uniformity
##  Min.   :   61634   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:  870688   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 1171710   Median : 4.000   Median : 1.000   Median : 1.000  
##  Mean   : 1071704   Mean   : 4.418   Mean   : 3.134   Mean   : 3.207  
##  3rd Qu.: 1238298   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##  Max.   :13454352   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                       
##  marginal_adhesion epithelial_size  bare_nucleoli bland_chromatin 
##  Min.   : 1.000    Min.   : 1.000   1      :402   Min.   : 1.000  
##  1st Qu.: 1.000    1st Qu.: 2.000   10     :132   1st Qu.: 2.000  
##  Median : 1.000    Median : 2.000   2      : 30   Median : 3.000  
##  Mean   : 2.807    Mean   : 3.216   5      : 30   Mean   : 3.438  
##  3rd Qu.: 4.000    3rd Qu.: 4.000   3      : 28   3rd Qu.: 5.000  
##  Max.   :10.000    Max.   :10.000   8      : 21   Max.   :10.000  
##                                     (Other): 56                   
##  normal_nucleoli     mitoses           class     
##  Min.   : 1.000   Min.   : 1.000   Min.   :2.00  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:2.00  
##  Median : 1.000   Median : 1.000   Median :2.00  
##  Mean   : 2.867   Mean   : 1.589   Mean   :2.69  
##  3rd Qu.: 4.000   3rd Qu.: 1.000   3rd Qu.:4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :4.00  
##

Convert class to readable labels

df$diagnosis <- ifelse(df$class == 2, "Benign", "Malignant")

Check names

names(df)

##  [1] "id"                "clump_thickness"   "size_uniformity"  
##  [4] "shape_uniformity"  "marginal_adhesion" "epithelial_size"  
##  [7] "bare_nucleoli"     "bland_chromatin"   "normal_nucleoli"  
## [10] "mitoses"           "class"             "diagnosis"

Statistical Questions

Is there a significant difference in clump thickness between benign and malignant breast tumors?

Standard two-sample t-test

t_test_standard <- t.test(clump_thickness ~ diagnosis, data = df, var.equal = TRUE)
print(t_test_standard)

## 
##  Two Sample t-test
## 
## data:  clump_thickness by diagnosis
## t = -27.078, df = 697, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Benign and group Malignant is not equal to 0
## 95 percent confidence interval:
##  -4.546030 -3.931347
## sample estimates:
##    mean in group Benign mean in group Malignant 
##                2.956332                7.195021

The p-value is extremely small (less than 0.05), indicating that we reject the null hypothesis. Thus, there is a statistically significant difference in the average clump thickness between benign and malignant tumors. On average, malignant tumors have a higher clump thickness compared to benign tumors.

Welch Two Sample t-test

t_test_welch <- t.test(clump_thickness ~ diagnosis, data = df, var.equal = FALSE)
print(t_test_welch)

## 
##  Welch Two Sample t-test
## 
## data:  clump_thickness by diagnosis
## t = -24.231, df = 363.11, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Benign and group Malignant is not equal to 0
## 95 percent confidence interval:
##  -4.582685 -3.894693
## sample estimates:
##    mean in group Benign mean in group Malignant 
##                2.956332                7.195021

The p-value is extremely small (p < 0.05), so we reject the null hypothesis. The Welch’s test, which does not assume equal variances, also confirms that there is a statistically significant difference between the two groups. Malignant tumors have significantly greater clump thickness than benign tumors.

Statistical Testing with ggstatsplot

library(ggstatsplot)

## Warning: package 'ggstatsplot' was built under R version 4.4.3

## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167

df$diagnosis <- as.factor(df$diagnosis)  
df$clump_thickness <- as.numeric(df$clump_thickness)

ggbetweenstats(
  data = df,
  x = diagnosis,
  y = clump_thickness,
  type = "parametric",  
  title = "Comparison of Clump Thickness by Tumor Diagnosis",
  xlab = "Tumor Diagnosis",
  ylab = "Clump Thickness"
)

This comparison between clump thickness and diagnosis in tumors has found a very strong difference between groups. The statistical test (Welch’s t-test) showed a highly significant result (p-value extremely small, far below 0.05), meaning the difference is very unlikely to be due to chance. The effect size (Hedges’ g = -2.03) confirms that one group had much lower clump thickness than the other. The average clump thickness for the main group was 2.96, based on 458 samples out of 699 total.

A Bayesian analysis also strongly supported these findings, with the data favoring a real difference between groups. The estimated difference in clump thickness was large (-4.23), with a tight confidence range, further confirming the result. In simple terms, clump thickness is clearly different depending on tumor type, and this difference is both statistically and practically meaningful.

How does Clump Thickness relate to Uniformity of Cell Size across different tumor types?

Create Interactive Plot using Plotly

library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 4.4.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

p <- ggplot(df, aes(x = clump_thickness, 
                     y = size_uniformity, 
                     color = diagnosis,
                     text = paste("Diagnosis:", diagnosis,
                                 "<br>Clump Thickness:", clump_thickness,
                                 "<br>Cell Size Uniformity:", size_uniformity))) +
  geom_point() +
  labs(title = "Clump Thickness vs Uniformity of Cell Size",
       x = "Clump Thickness",
       y = "Uniformity of Cell Size") +
  theme_minimal()

ggplotly(p, tooltip = "text") %>%
  layout(modebar = list(remove = c("zoom2d", "pan2d", "select2d", "lasso2d", "zoomIn2d", "zoomOut2d", "autoScale2d", "resetScale2d")))

This graph compares clump thickness and cell size uniformity between benign (non-cancerous) and malignant (cancerous) tumors. The data shows that malignant tumors tend to have less uniform cell sizes, with values ranging from 2.5 up to 10, where higher numbers mean the cells vary more in size. While the exact clump thickness values aren’t shown in this image, we typically see that cancerous tumors also have thicker clumps compared to benign ones.

The clear difference between the two diagnosis groups suggests that doctors can use these two features - cell size irregularity and clump thickness - as warning signs when checking for cancer. The more uneven the cells and thicker the clumps, the more likely the tumor may be malignant. These visual patterns help medical professionals quickly assess potential cancer cases during examinations.

Interactive Exploratory Data Analysis

Nadinne C. Calaor

2025-04-23