This Exploratory Data Analysis (EDA) explores patterns in the Breast Cancer Wisconsin dataset. The goal is to understand how certain cellular features differ between benign and malignant tumors. In particular, this EDA aim to answer the following questions:
Through visualizations and statistical tests, It aims to highlight which features are most useful in identifying whether a tumor is likely to be benign or malignant.
library(tidyverse)
library(ggstatsplot)
library(plotly)
library(mlbench)
library(dplyr)
library(reshape2)
data("BreastCancer")
dataset <- BreastCancer
# Clean the dataset
bc_data <- dataset %>%
mutate(across(.cols = -c(Id, Class), ~ifelse(. == "?", NA, .))) %>%
mutate(across(.cols = -c(Id, Class), as.numeric)) %>%
select(-Id) %>%
drop_na()
# Preview the first few rows
head(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1 5 1 1 1 2 1
## 2 5 4 4 5 7 10
## 3 3 1 1 1 2 2
## 4 6 8 8 1 3 4
## 5 4 1 1 3 2 1
## 6 8 10 10 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 benign
## 2 3 2 1 benign
## 3 3 1 1 benign
## 4 3 7 1 benign
## 5 3 1 1 benign
## 6 9 7 1 malignant
tail(bc_data)
## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 694 3 1 1 1 2 1
## 695 3 1 1 1 3 2
## 696 2 1 1 1 2 1
## 697 5 10 10 3 7 3
## 698 4 8 6 4 3 4
## 699 4 8 8 5 4 5
## Bl.cromatin Normal.nucleoli Mitoses Class
## 694 2 1 2 benign
## 695 1 1 1 benign
## 696 1 1 1 benign
## 697 8 10 2 malignant
## 698 10 6 1 malignant
## 699 10 4 1 malignant
# Check structure
str(bc_data)
## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
# Summary statistics
summary(dataset)
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion
## Length:699 1 :145 1 :384 1 :353 1 :407
## Class :character 5 :130 10 : 67 2 : 59 2 : 58
## Mode :character 3 :108 3 : 52 10 : 58 3 : 58
## 4 : 80 2 : 45 3 : 56 10 : 55
## 10 : 69 4 : 40 4 : 44 4 : 33
## 2 : 50 5 : 30 5 : 34 8 : 25
## (Other):117 (Other): 81 (Other): 95 (Other): 63
## Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
## 2 :386 1 :402 2 :166 1 :443 1 :579
## 3 : 72 10 :132 3 :165 10 : 61 2 : 35
## 4 : 48 2 : 30 1 :152 3 : 44 3 : 33
## 1 : 47 5 : 30 7 : 73 2 : 36 10 : 14
## 6 : 41 3 : 28 4 : 40 8 : 24 4 : 12
## 5 : 39 (Other): 61 5 : 34 6 : 22 7 : 9
## (Other): 66 NA's : 16 (Other): 69 (Other): 69 (Other): 17
## Class
## benign :458
## malignant:241
##
##
##
##
##
# Check distribution of diagnosis
table(bc_data$Class)
##
## benign malignant
## 444 239
ggbetweenstats(
data = bc_data,
x = Class,
y = Cl.thickness,
type = "parametric",
title = "Comparison of Clump Thickness by Tumor Type",
xlab = "Tumor Type",
ylab = "Clump Thickness",
plot.type = "box",
pairwise.comparisons = TRUE
)
This violin plot shows a clear difference in clump thickness between benign and malignant tumors. On average, malignant tumors have much thicker clumps (around 7.19) compared to benign ones (around 2.96).
A statistical test (Welch’s t-test) shows that this difference is very unlikely to be just random (p < 0.001). The effect is also quite strong (Hedges’ g = -2.02), and the Bayes Factor strongly supports that the two groups are truly different. This means clump thickness is an important feature that can help tell if a tumor is cancerous or not.
ggbetweenstats(
data = bc_data,
x = Class,
y = Bare.nuclei,
type = "parametric",
var.equal = FALSE,
title = "Bare Nuclei Count: Benign vs. Malignant Tumors",
xlab = "Tumor Type",
ylab = "Bare Nuclei Count",
plot.type = "boxviolin", # Boxplot + violin
pairwise.comparisons = FALSE
)
This violin plot shows a big difference in the Bare Nuclei Count between benign and malignant tumors. Malignant tumors have a much higher and more spread-out count (average around 7.63), while benign tumors mostly stay at low counts (average around 1.35).
A statistical test (Welch’s t-test) shows this difference is very clear and not just due to chance (p ≈ 8.55e-89), with a very strong effect (Hedges’ g = -2.66). Another method (Bayesian analysis) also strongly supports this result, with a very high Bayes Factor, meaning the evidence is very solid. This shows that the Bare Nuclei Count is a very important feature for spotting cancerous tumors—maybe even more useful than clump thickness.
ggbetweenstats(
data = bc_data,
x = Class,
y = Cell.size,
type = "parametric",
title = "Comparison of Cell Size by Tumor Type",
xlab = "Tumor Type",
ylab = "Cell Size",
plot.type = "violin",
pairwise.comparisons = TRUE
)
The violin plot shows a big difference in Bare Nuclei Count between benign and malignant tumors. Malignant tumors have much higher and more varied counts (average about 7.63), while benign tumors have low and consistent counts (average about 1.35).
A statistical test (Welch’s t-test) shows this difference is very strong and not just random (p ≈ 8.55e-89), with a very large effect (Hedges’ g = -2.66). Another method (Bayesian analysis) also strongly supports this, with results showing very strong evidence of a real difference between the two groups.
This means that Bare Nuclei Count is a very important feature in identifying cancerous tumors, possibly even more useful than Clump Thickness.
library(plotly)
# Reshape to long format
bc_long <- bc_data %>%
pivot_longer(cols = -Class, names_to = "Feature", values_to = "Value")
# Generate plot list per feature
plot_list <- bc_long %>%
split(.$Feature) %>%
map(function(df) {
plot_ly(data = df,
x = ~Value,
color = ~Class,
colors = c("benign" = "#00BA38", "malignant" = "#F8766D"),
type = "histogram",
nbinsx = 20,
name = unique(df$Feature)) %>%
layout(
title = list(text = paste("Feature:", unique(df$Feature)), x = 0.5),
xaxis = list(title = ""),
yaxis = list(title = "Count")
)
})
# Combine into subplot layout
subplot(
plot_list,
nrows = ceiling(length(plot_list)/2),
shareX = FALSE,
shareY = FALSE,
titleX = TRUE,
titleY = TRUE
) %>%
layout(title = "Interactive Histograms of Cellular Features by Tumor Class")
The histograms show differences in how cell features are spread out between benign and malignant tumors. In some features, one type of tumor (shown in green or pink bars) has values that are mostly higher, which you can see as a shift to the right. In other features, the values might be lower and shift to the left.
How much the two histograms overlap shows how well that feature can help tell the two tumor types apart. If there’s little overlap, that feature is probably better at telling the difference. The shape and spread of each histogram also tell us how much variety there is in that feature for each tumor type.
By looking at each pair of histograms, we can spot which features show the biggest differences. For example, if one tumor type mostly has high values for a certain feature and the other type has low values, that feature is likely useful for telling them apart. If a histogram is narrow, it means the feature values are quite similar within that group. If it’s wide, there’s more variety. Features that look the same for both tumor types may not be very helpful for classification.
With the original interactive charts, we could explore these differences in more detail and get more exact information.
# Correlation plots
# For individual correlation
ggstatsplot::ggscatterstats(
data = bc_data,
x = Cl.thickness,
y = Cell.size,
title = "Correlation Between Clump Thickness and Cell Size",
xlab = "Clump Thickness",
ylab = "Cell Size",
point.color = "#0072B2",
point.alpha = 0.5,
line.color = "#D55E00",
marginal = FALSE,
)
# For correlation matrix
ggstatsplot::ggcorrmat(
data = bc_data,
type = "parametric", # Pearson's r
colors = c("#6D9EC1", "white", "#E46726"),
title = "Correlation Matrix of Cellular Characteristics",
subtitle = "Pairwise Pearson correlations",
matrix.type = "lower",
p.adjust.method = "none",
hc.order = TRUE,
lab = TRUE,
)
Analysis (Clump Thickness vs. Cell Size): The scatter plot shows a statistically significant, moderately strong positive linear relationship between Clump Thickness and Cell Size (r=0.64, p<0.001). Higher Clump Thickness tends to be associated with larger Cell Size. This suggests a potential biological link where increased cellular crowding correlates with cell enlargement, and this relationship could be valuable for predictive modeling.
Analysis (Correlation Matrix):
The correlation matrix reveals significant positive linear relationships among several cellular features. Strong correlations exist between Cell Size and Cell Shape, and between Bare Nuclei and Bland Chromatin. Cell Shape and Cell Size also strongly correlate with Marginal Adhesion and Bare Nuclei. Moderate positive correlations are present for Clump Thickness and Epithelial Cell Size with other features. Mitoses shows weaker positive correlations. The strong inter-correlations among certain features indicate potential shared biological underpinnings and suggest that these features might provide redundant information for modeling purposes. All displayed correlations are statistically significant.
Part 2: Interactive Visualizations with plotly
# Create a ggplot scatter plot with additional tooltip information
p_bc <- ggplot(bc_data, aes(x = Cl.thickness, y = Bare.nuclei, color = Class,
text = paste("Class: ", Class, "<br>",
"Clump Thickness: ", Cl.thickness, "<br>",
"Bare Nuclei: ", Bare.nuclei, "<br>",
"Cell Shape: ", Cell.shape, "<br>",
"Epith. Cell Size: ", Epith.c.size))) +
geom_point(size = 2, alpha = 0.8) +
labs(title = "Interactive Plot: Clump Thickness vs. Bare Nuclei",
subtitle = "Colored by Tumor Class",
x = "Clump Thickness",
y = "Bare Nuclei Count",
color = "Tumor Class") +
theme_minimal()
# Convert to an interactive Plotly object with custom tooltip and hidden modebar
interactive_plot <- ggplotly(p_bc, tooltip = "text") %>%
layout(modebar = list(orientation = "h", visible = FALSE))
# Display the interactive plot
interactive_plot
The scatter plot displays the relationship between Clump Thickness and Bare Nuclei Count, with points colored by Tumor Class (pink for benign, blue for malignant). The plot indicates a tendency for higher Clump Thickness to be associated with a greater Bare Nuclei Count, particularly within the malignant tumor class. Benign tumors appear to be concentrated at lower values for both Clump Thickness and Bare Nuclei Count. There is some overlap, especially at lower to moderate values, but the higher density of blue (malignant) points in the upper-right quadrant suggests a positive association between these two features and malignancy.
The visual separation of the two tumor classes across the scatter plot suggests that both Clump Thickness and Bare Nuclei Count are potentially useful features for distinguishing between benign and malignant tumors. Higher values of these features appear more indicative of malignancy. While lower values are more frequently associated with benign tumors, there are instances where malignant cases also present with lower values. The interactive nature of the original plot likely allowed for closer examination of point density and distribution for each tumor class at specific combinations of Clump Thickness and Bare Nuclei Count.
#For Other Cell Characteristics (Cell Shape and Cell Size)
p_scatter <- ggplot(bc_data,
aes(x = Cell.size, y = Cell.shape, color = Class,
text = paste("Cell Size:", Cell.size, "<br>",
"Cell Shape:", Cell.shape, "<br>",
"Class:", Class))) +
geom_point(alpha = 0.7) +
labs(title = "Cell Size vs. Cell Shape by Tumor Type",
x = "Cell Size",
y = "Cell Shape") +
scale_color_manual(values = c("benign" = "#00BA38", "malignant" = "#F8766D")) +
theme_minimal()
ggplotly(p_scatter, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
The scatter plot illustrates the relationship between Cell Size and Cell Shape, with points colored by Tumor Type (green for benign, pink for malignant). A clear trend emerges: malignant tumors (pink) are predominantly associated with higher values of both Cell Size and Cell Shape. Benign tumors (green), in contrast, are largely clustered at lower values for both characteristics. While some overlap exists in the lower ranges, the upper-right quadrant of the plot is heavily populated by malignant cases, indicating a positive correlation between these features and malignancy.
The distinct clustering of tumor types in different regions of the scatter plot suggests that Cell Size and Cell Shape are strong indicators for distinguishing between benign and malignant tumors. Higher values for both features are strongly suggestive of malignancy. Although lower values are more common in benign cases, they do not entirely exclude the possibility of malignancy. This positive association and the separation of clusters highlight the potential of these features for classification models.
In this Exploratory Data Analysis of the Breast Cancer Wisconsin dataset, this provided some meaningful patterns that help us understand how certain cell characteristics relate to whether a tumor is malignant or benign.
One of the standout findings was the link between clump thickness and tumor type. Malignant tumors tended to have noticeably thicker clumps of cells, and this difference wasn’t just obviou, it was statistically significant. This means clump thickness could be a strong sign of malignancy.
Even more striking was the difference in bare nuclei counts. Malignant tumors had much higher and more varied counts compared to benign ones. This suggests that the number of bare nuclei might be an even more powerful indicator than clump thickness.
The cell size shows that malignant tumors generally had larger cells. This adds to the idea that differences in how cells look their shape and size are key in telling apart the two types of tumors.
To make things more visual, by interactive histograms for all the cell features, it sorted by tumor type. These charts clearly showed that many features tend to have higher values in malignant tumors, which supports their potential as useful markers. However, some features showed more overlap between benign and malignant cases, suggesting they may not be as strong on their own.
Exploring how the different features relate to each other. There were strong positive correlations between several pairs—for example, cell size and cell shape, as well as bare nuclei and bland chromatin. This means they often increase together, possibly because of some biological link or because they capture similar information. These insights can be really helpful when choosing which features to use in future models.
Also, using the interactive scatter plots to see how certain pairs of features like clump thickness and bare nuclei, or cell size and shape work together to separate tumor types. Again, malignant tumors tended to group toward the higher ends of these scales, making these combinations especially useful for identifying them.
In summary, this data analysis shows that some cell features, especially Clump Thickness, Bare Nuclei Count, Cell Size, and Cell Shape are strongly linked to whether a tumor is cancerous or not in the Breast Cancer Wisconsin dataset. These features are important for finding the difference between benign and malignant tumors. They may also be helpful when creating tools to help doctors predict or diagnose cancer. The connections between some features should also be kept in mind in future work to avoid using features that are too similar to each other.