Introduction

This report presents an exploratory data analysis of the Breast Cancer Wisconsin dataset, which comprises measurements from digitized images of breast masses. These measurements characterize cell nuclei within the images, encompassing attributes like radius, texture, perimeter, and area. Crucially, each mass is also classified with a diagnosis of either benign or malignant. The primary objective of this analysis is to investigate the relationships between these cellular characteristics and the diagnosis. By understanding these relationships, we aim to gain insights into the factors that differentiate between cancerous and non-cancerous tumors. To achieve this, we employ a combination of statistical tests and data visualizations. Specifically, we utilize t-tests to compare the means of key cellular features between benign and malignant tumors, correlation analysis to explore the associations between different cellular measurements, and interactive plots to facilitate a more dynamic and detailed examination of the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggstatsplot)
## You can cite this package as:
##      Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach.
##      Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(knitr)
library(mlbench)
data("BreastCancer")
BreastCancer <- as.data.frame(BreastCancer) # Convert to data frame and assign back to BreastCancer
BreastCancer <- BreastCancer %>%
  mutate(
    Class = factor(Class, levels = c("benign", "malignant"))
  )
BreastCancer <- BreastCancer %>%  # And here
  mutate(
    `Cl.thickness` = as.numeric(as.character(`Cl.thickness`)),
    `Cell.size` = as.numeric(as.character(`Cell.size`)),
    `Cell.shape` = as.numeric(as.character(`Cell.shape`)),
    `Marg.adhesion` = as.numeric(as.character(`Marg.adhesion`)),
    `Epith.c.size` = as.numeric(as.character(`Epith.c.size`)),
    `Bare.nuclei` = as.numeric(as.character(`Bare.nuclei`)),
    `Bl.cromatin`  = as.numeric(as.character(`Bl.cromatin`)),
    `Normal.nucleoli` = as.numeric(as.character(`Normal.nucleoli`))
  )
knitr::kable(head(BreastCancer), caption = "First Few Rows of Breast Cancer Data")
First Few Rows of Breast Cancer Data
Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
1000025 5 1 1 1 2 1 3 1 1 benign
1002945 5 4 4 5 7 10 3 2 1 benign
1015425 3 1 1 1 2 2 3 1 1 benign
1016277 6 8 8 1 3 4 3 7 1 benign
1017023 4 1 1 3 2 1 3 1 1 benign
1017122 8 10 10 8 7 10 9 7 1 malignant
str(BreastCancer)
## 'data.frame':    699 obs. of  11 variables:
##  $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
summary(BreastCancer)
##       Id             Cl.thickness      Cell.size        Cell.shape    
##  Length:699         Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  Class :character   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Mode  :character   Median : 4.000   Median : 1.000   Median : 1.000  
##                     Mean   : 4.418   Mean   : 3.134   Mean   : 3.207  
##                     3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##                     Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                       
##  Marg.adhesion     Epith.c.size     Bare.nuclei      Bl.cromatin    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 1.000   Median : 2.000   Median : 1.000   Median : 3.000  
##  Mean   : 2.807   Mean   : 3.216   Mean   : 3.545   Mean   : 3.438  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                    NA's   :16                       
##  Normal.nucleoli     Mitoses          Class    
##  Min.   : 1.000   1      :579   benign   :458  
##  1st Qu.: 1.000   2      : 35   malignant:241  
##  Median : 1.000   3      : 33                  
##  Mean   : 2.867   10     : 14                  
##  3rd Qu.: 4.000   4      : 12                  
##  Max.   :10.000   7      :  9                  
##                   (Other): 17

Part 1: EDA with ggstatsplot for Statistical Tests (Breast Cancer Wisconsin Dataset)

Question 1: Is there a difference in clump thickness between benign and malignant tumors?

plot1 <- ggstatsplot::ggbetweenstats( # Added ggstatsplot::
  data = BreastCancer,
  x = Class,
  y = "Cl.thickness",
  type = "parametric",
  title = "Clump Thickness vs. Tumor Class",
  xlab = "Tumor Class",
  ylab = "Clump Thickness"
)
plot1

Explanation of the plot:When I look at this graph, I can see how clump thickness varies between benign and malignant tumors. The plot shows the distribution of clump thickness for each tumor class, almost like two side-by-side shapes. The shaded areas represent the density of the data, so I can see where most of the data points fall.

I also see the individual data points scattered around, giving me a sense of the actual measurements. The red dots and dashed lines mark the average clump thickness for each group. It’s clear that the average clump thickness is much higher for malignant tumors than for benign ones.

To back this up statistically, the graph provides results from a Welch’s t-test. The t-test value is -24.23, which, combined with the very small p-value of 7.43e-78, tells me there’s a highly significant difference in clump thickness between the two tumor classes. The Hedges’ g value of -2.03 indicates a large effect size, meaning this isn’t just a statistically significant difference; it’s a substantial difference.

The confidence interval, shown as [-2.25, -1.81], gives me a range where I can be pretty sure the true difference in average clump thickness lies. The sample sizes for each group are also provided: 458 for benign and 241 for malignant.

Additionally, the graph includes some Bayesian analysis results. The large negative value for the log of the Bayes Factor (-246.12) reinforces the conclusion that there’s a strong difference between the groups. The estimated difference in means is 4.23, and its credible interval is [-4.54, -3.92].

Overall, this graph gives me a very clear picture: malignant tumors tend to have a much greater clump thickness than benign tumors, and this difference is both statistically significant and practically meaningful.

Question 2: Is there a difference in uniformity of cell size between benign and malignant tumors?

plot2 <- ggstatsplot::ggbetweenstats( # Added ggstatsplot::
  data = BreastCancer,
  x = Class,
  y = "Cell.size",
  type = "parametric",
  title = "Uniformity of Cell Size vs. Tumor Class",
  xlab = "Tumor Class",
  ylab = "Uniformity of Cell Size"
)
plot2

Exaplanation of the plot: Looking at this graph, I can see how the uniformity of cell size differs between benign and malignant tumors. The visualization combines a violin-like representation with the actual data points. The shaded areas show the distribution of cell size uniformity for each tumor class.

The x-axis separates the data by ‘Tumor Class,’ showing ‘benign’ and ‘malignant’ categories. The y-axis represents ‘Uniformity of Cell Size,’ displaying the range of values for this measurement.

The red dots with dashed lines indicate the mean uniformity of cell size for each group. The mean cell size uniformity is shown as ‘μmean = 1.33’ for benign tumors and ‘μmean = 6.57’ for malignant tumors. This immediately suggests a noticeable difference.

To statistically analyze this difference, a Welch’s t-test was conducted. The results are displayed at the top of the graph. The t-value is ‘tWelch(268.48) = -29.11,’ and the p-value is ‘p = 4.83e-85.’ This extremely small p-value indicates a highly significant difference in cell size uniformity between the two tumor classes.

The Hedges’ g value is ‘-2.58,’ representing a very large effect size. This means the difference in cell size uniformity is not only statistically significant but also practically meaningful. The 95% confidence interval for this difference is ‘CI95% [-2.86, -2.30].’

The sample sizes for each group are provided: ‘n = 458’ for benign and ‘n = 241’ for malignant.

The graph also includes Bayesian analysis results. The log of the Bayes Factor (‘loge(BF01) = -380.49’) strongly supports the alternative hypothesis, indicating substantial evidence for a difference in cell size uniformity. The estimated difference is ‘-5.25,’ with a 95% credible interval of ‘CI95% [-5.52, -4.97].’

In conclusion, this graph provides compelling evidence that malignant tumors exhibit a significantly higher uniformity of cell size compared to benign tumors. This difference is both statistically significant and has a large effect size.

Question 3: Is there a relationship between clump thickness and uniformity of cell size?

plot3 <- ggscatterstats(
  data = BreastCancer,
  x = "Cl.thickness",
  y = "Cell.size",
  type = "pearson",
  title = "Relationship between Clump Thickness and Cell Size",
  xlab = "Clump Thickness",
  ylab = "Uniformity of Cell Size"
)
## Registered S3 method overwritten by 'ggside':
##   method from   
##   +.gg   ggplot2
plot3
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

Explanation of the plot: Looking at this plot, I can see the relationship between ‘Clump Thickness’ and ‘Uniformity of Cell Size’. It’s primarily a scatter plot, showing how these two measurements vary together.

The x-axis represents ‘Clump Thickness’, and the y-axis represents ‘Uniformity of Cell Size’. Each point on the plot represents a single observation (presumably a tumor), showing its values for both of these characteristics.

I notice a general upward trend in the scatter plot. As ‘Clump Thickness’ increases, ‘Uniformity of Cell Size’ also tends to increase. This suggests a positive relationship between these two variables.

A blue line is drawn through the data points. This is a regression line, which represents the best linear fit to the data. It visually emphasizes the positive relationship. The shaded area around the blue line represents the confidence interval, indicating the uncertainty in the estimated regression line.

To get a statistical measure of this relationship, the plot provides the results of a Student’s t-test and Pearson’s correlation coefficient. The t-value is ‘t(697) = 22.28’, and the p-value is ‘p = 1.94e-83’. This extremely small p-value indicates a highly significant relationship between clump thickness and cell size uniformity.

The Pearson correlation coefficient is ‘r_Pearson = 0.64’. This value tells me the strength and direction of the linear relationship. A value of 0.64 indicates a moderate positive linear relationship.

The plot also includes some Bayesian analysis results. The log of the Bayes Factor (‘logₑ(BF₀₁) = -183.79’) provides strong evidence in favor of the alternative hypothesis (that there is a correlation).

In summary, this plot shows a clear positive relationship between clump thickness and cell size uniformity. Tumors with higher clump thickness tend to have higher uniformity of cell size, and this relationship is both statistically significant and moderately strong.

Question 4: How do the distributions of cell shape vary across tumor classes?

plot4 <- ggplot(BreastCancer, aes(x = Class, y = `Cell.shape`)) +
  geom_boxplot() +
  labs(
    title = "Cell Shape Distribution by Tumor Class",
    x = "Tumor Class",
    y = "Cell Shape"
  )
plot4

Explanation of the plot: In this plot, I’m looking at how ‘Cell Shape’ varies between benign and malignant tumors. It’s a boxplot, which is a great way to visualize the distribution of a numerical variable (‘Cell Shape’) across different categories (‘Tumor Class’).

The x-axis shows the ‘Tumor Class,’ with ‘benign’ and ‘malignant’ as the two categories. The y-axis represents ‘Cell Shape,’ showing the range of cell shape values.

For each tumor class, the boxplot gives me a summary of the cell shape distribution:

The box itself: The bottom of the box represents the first quartile (25th percentile), the top of the box represents the third quartile (75th percentile), and the line inside the box represents the median (50th percentile). So, the box shows the middle 50% of the data. The whiskers: The lines extending from the box (the whiskers) show the range of the data, excluding outliers. In this plot, it seems the whiskers extend to the minimum and maximum values within a certain range (often 1.5 times the interquartile range). The points: Any points outside the whiskers are considered outliers. Looking at the plot, here’s what I observe:

Benign tumors: The box for benign tumors is generally lower on the Cell Shape scale. The median cell shape is quite low, and the box itself is compressed towards the lower end, indicating that most benign tumors have relatively low cell shape values. There are a few outliers with higher cell shape values. Malignant tumors: The box for malignant tumors is much higher on the Cell Shape scale. The median cell shape is considerably higher than in benign tumors, and the box is larger, suggesting a wider range of cell shape values. In essence, this plot tells me that malignant tumors tend to have, on average, a higher cell shape value than benign tumors, and their cell shape values are more variable.

Part 2: Creating Interactive Plots with plotly (Breast Cancer Wisconsin Dataset)

p_cancer <- ggplot(BreastCancer, aes(x = `Cl.thickness`, y = `Cell.size`,
                           color = Class,
                           text = paste("Clump Thickness: ", `Cl.thickness`, "<br>",
                                        "Cell Size: ", `Cell.size`, "<br>",
                                        "Class: ", Class))) +
  geom_point() +
  labs(title = "Clump Thickness vs. Cell Size (Interactive)",
       x = "Clump Thickness",
       y = "Uniformity of Cell Size",
       color = "Tumor Class") +
  theme_minimal()

fig_cancer <- ggplotly(p_cancer, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE)) 
fig_cancer

Explanation: Looking at this plot, I can see the relationship between ‘Clump Thickness’ and ‘Uniformity of Cell Size’, and it’s interactive.

The x-axis represents ‘Clump Thickness’, and the y-axis represents ‘Uniformity of Cell Size’. Each point on the plot represents a single observation (a tumor), showing its values for both of these characteristics.

The points are colored differently, and there’s a legend in the top right corner that tells me what the colors mean. ‘benign’ tumors are shown in one color, and ‘malignant’ tumors are shown in another. This helps me see if there’s a pattern in how these two measurements relate to the tumor class.

The plot is interactive, which means I can hover my mouse cursor over the points. When I do that, a little box (a tooltip) appears, showing me the exact values of ‘Clump Thickness’ and ‘Uniformity of Cell Size’ for that particular point, and which ‘Tumor Class’ it belongs to. This is very helpful for getting more detailed information about individual data points.

From the plot, I can observe the general trend between clump thickness and cell size uniformity, and how that trend might differ between benign and malignant tumors. For example, I might see if malignant tumors tend to have higher values for both clump thickness and cell size uniformity compared to benign tumors.

Conclusion: In conclusion, this exploratory data analysis of the Breast Cancer Wisconsin dataset has revealed several important relationships between cellular characteristics and tumor diagnosis. Notably, we found that clump thickness and uniformity of cell size are significantly higher in malignant tumors compared to benign tumors, suggesting that greater clump thickness and uniformity are indicative of cancerous growth. Furthermore, our analysis demonstrated a moderate positive correlation between clump thickness and uniformity of cell size, indicating that these two characteristics tend to increase in tandem. The interactive plot further enhanced our understanding of the relationship between clump thickness and cell size uniformity, allowing us to observe how these variables vary across individual tumors and highlighting the distinctions between benign and malignant cases. Overall, these findings underscore the potential of clump thickness and uniformity of cell size as important factors in distinguishing between benign and malignant breast tumors. Future research may build upon these findings to explore these characteristics in greater depth and potentially contribute to the development of improved diagnostic methodologies.