The Breast Cancer Wisconsin dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of the cell nuclei present in the image. This analysis explores relationships between these features and tumor diagnosis (benign vs. malignant).
Let’s examine the structure and characteristics of our dataset before proceeding with analysis.
## 'data.frame': 699 obs. of 11 variables:
## $ id : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
## $ clumpthickness : int 5 5 3 6 4 8 1 2 2 4 ...
## $ uniformcellsize : int 1 4 1 8 1 10 1 1 1 2 ...
## $ uniformcellshape: int 1 4 1 8 1 10 1 2 1 1 ...
## $ margadhesion : int 1 5 1 1 3 8 1 1 1 1 ...
## $ epithelial : int 2 7 2 3 2 7 2 2 2 2 ...
## $ barenuclei : chr "1" "10" "2" "4" ...
## $ blandchromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ normalnucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
## $ benormal : int 2 2 2 2 2 4 2 2 2 2 ...
## id clumpthickness uniformcellsize uniformcellshape
## Min. : 61634 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1171710 Median : 4.000 Median : 1.000 Median : 1.000
## Mean : 1071704 Mean : 4.418 Mean : 3.134 Mean : 3.207
## 3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000
## margadhesion epithelial barenuclei blandchromatin
## Min. : 1.000 Min. : 1.000 Length:699 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.000 Class :character 1st Qu.: 2.000
## Median : 1.000 Median : 2.000 Mode :character Median : 3.000
## Mean : 2.807 Mean : 3.216 Mean : 3.438
## 3rd Qu.: 4.000 3rd Qu.: 4.000 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 Max. :10.000
## normalnucleoli mitoses benormal
## Min. : 1.000 Min. : 1.000 Min. :2.00
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:2.00
## Median : 1.000 Median : 1.000 Median :2.00
## Mean : 2.867 Mean : 1.589 Mean :2.69
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.:4.00
## Max. :10.000 Max. :10.000 Max. :4.00
## id clumpthickness uniformcellsize uniformcellshape
## 0 0 0 0
## margadhesion epithelial barenuclei blandchromatin
## 0 0 0 0
## normalnucleoli mitoses benormal
## 0 0 0
Based on clinical relevance and initial data exploration, we can formulate the following research questions:
"
Question 1: Do clump thickness values differ between benign and malignant tumors?
Hypothesis: Malignant tumors will show significantly higher clump thickness values.
Question 2: Is there a correlation between uniform cell size and uniform cell shape?
Hypothesis: There will be a strong positive correlation between these features.
Question 3: Does mitoses count vary significantly between benign and malignant tumors?
Hypothesis: Malignant tumors will show significantly higher mitoses counts.
"## [1] "\nQuestion 1: Do clump thickness values differ between benign and malignant tumors?\nHypothesis: Malignant tumors will show significantly higher clump thickness values.\n\nQuestion 2: Is there a correlation between uniform cell size and uniform cell shape?\nHypothesis: There will be a strong positive correlation between these features.\n\nQuestion 3: Does mitoses count vary significantly between benign and malignant tumors?\nHypothesis: Malignant tumors will show significantly higher mitoses counts.\n"
# Comparing clump thickness between benign and malignant tumors
ggbetweenstats(
data = data,
x = benormal,
y = clumpthickness,
type = "parametric", # Using t-test for normally distributed data
messages = FALSE,
title = "Clump Thickness by Tumor Type"
)Interpretation: The analysis shows a highly statistically significant difference in clump thickness between benign and malignant tumors (p = 7.43e-78, well below the 0.05 threshold). Malignant tumors demonstrate substantially greater clump thickness (mean = 7.20) compared to benign tumors (mean = 2.96), suggesting that clump thickness could be a valuable diagnostic indicator for distinguishing between tumor types.
# Examining correlation between uniform cell size and shape
ggscatterstats(
data = data,
x = uniformcellsize,
y = uniformcellshape,
title = "Correlation between Uniform Cell Size and Shape"
)## Registered S3 method overwritten by 'ggside':
## method from
## +.gg ggplot2
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: The scatter plot reveals a strong positive correlation between uniform cell size and uniform cell shape (r ≈ 0.9, p < 0.001). This indicates that these two features increase together consistently, as cells become more uniform in size, they also become more uniform in shape. This strong relationship suggests these features may be measuring related aspects of cellular abnormality and could potentially be combined or used interchangeably in diagnostic models.
# Comparing mitoses counts between tumor types
ggbetweenstats(
data = data,
x = benormal,
y = mitoses,
type = "nonparametric", # Using Wilcoxon test as mitoses count may not be normally distributed
messages = FALSE,
title = "Mitoses Count by Tumor Type"
)Interpretation: The nonparametric test indicates a statistically significant difference in mitoses counts between benign and malignant tumors (p < 0.05). Malignant tumors typically show higher mitoses counts, which aligns with clinical knowledge that increased mitotic activity is associated with malignancy. While the difference is significant, the overlap in distributions suggests mitoses count alone may not be as strong a predictor as other features.
| Question | Statistical_Test | Visualization | Key_Findings | Interpretation |
|---|---|---|---|---|
| Do clump thickness values differ between benign and malignant tumors? | Independent Samples t-test (parametric) | ggbetweenstats() | Statistically significant p-value (p < 0.05). Mean clump thickness is higher in malignant tumors. | Clump thickness tends to be greater in malignant tumors, making it a potentially useful feature for classification. |
| Is there a correlation between uniform cell size and uniform cell shape? | Pearson correlation test | ggscatterstats() | Strong positive correlation (r ≈ 0.9). p-value was highly significant (p < 0.001). | These two variables move together, possibly reflecting a similar underlying pathological feature. |
| Does mitoses count vary significantly between benign and malignant tumors? | Wilcoxon Rank-Sum Test (nonparametric) | ggbetweenstats() | Statistically significant difference (p < 0.05). Mitoses values are generally higher in malignant tumors. | Malignant tumors tend to have more mitotic activity, reinforcing mitoses as an important diagnostic feature. |
Creating an interactive plot to explore relationships between key variables:
# Create a ggplot2 scatter plot with enhanced tooltips
p_bc <- ggplot(data, aes(x = clumpthickness, y = uniformcellsize, color = benormal,
text = paste("Clump Thickness: ", clumpthickness, "<br>",
"Uniform Cell Size: ", uniformcellsize, "<br>",
"Uniform Cell Shape: ", uniformcellshape, "<br>",
"Bare Nuclei: ", barenuclei))) +
geom_point(alpha = 0.8, size = 2) +
labs(title = "Clump Thickness vs Uniform Cell Size",
x = "Clump Thickness",
y = "Uniform Cell Size",
color = "Tumor Type") +
theme_minimal()
# Convert to interactive plotly visualization
fig_bc <- ggplotly(p_bc, tooltip = "text") %>%
layout(modebar = list(visible = FALSE))
# Display the interactive plot
fig_bcAnalysis of the Breast Cancer Wisconsin dataset reveals:
Clear differentiation in cellular characteristics: Significant differences exist between benign and malignant tumors across multiple features, particularly clump thickness.
Feature relationships: Strong correlation between uniform cell size and shape suggests potential redundancy in these features for diagnostic purposes.
Diagnostic indicators: The combination of clump thickness, cell uniformity metrics, and mitoses counts provides a comprehensive view of cellular abnormalities associated with malignancy.
Limitations: - This analysis is exploratory and does not establish causal relationships - Dataset may contain sampling biases that could affect applicability
Future directions: - Machine learning models could be built using these features to predict tumor type - Additional correlations between other features could be explored - Clinical validation of these findings with newer dataset would strengthen their utility