1. Introduction and Setup

The Breast Cancer Wisconsin dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of the cell nuclei present in the image. This analysis explores relationships between these features and tumor diagnosis (benign vs. malignant).

2. Data Exploration and Preprocessing

Let’s examine the structure and characteristics of our dataset before proceeding with analysis.

# Check structure
str(data)

## 'data.frame':    699 obs. of  11 variables:
##  $ id              : int  1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
##  $ clumpthickness  : int  5 5 3 6 4 8 1 2 2 4 ...
##  $ uniformcellsize : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ uniformcellshape: int  1 4 1 8 1 10 1 2 1 1 ...
##  $ margadhesion    : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ epithelial      : int  2 7 2 3 2 7 2 2 2 2 ...
##  $ barenuclei      : chr  "1" "10" "2" "4" ...
##  $ blandchromatin  : int  3 3 3 3 3 9 3 3 1 2 ...
##  $ normalnucleoli  : int  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitoses         : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ benormal        : int  2 2 2 2 2 4 2 2 2 2 ...

# Summary statistics
summary(data)

##        id           clumpthickness   uniformcellsize  uniformcellshape
##  Min.   :   61634   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:  870688   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 1171710   Median : 4.000   Median : 1.000   Median : 1.000  
##  Mean   : 1071704   Mean   : 4.418   Mean   : 3.134   Mean   : 3.207  
##  3rd Qu.: 1238298   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##  Max.   :13454352   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##   margadhesion      epithelial      barenuclei        blandchromatin  
##  Min.   : 1.000   Min.   : 1.000   Length:699         Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 2.000   Class :character   1st Qu.: 2.000  
##  Median : 1.000   Median : 2.000   Mode  :character   Median : 3.000  
##  Mean   : 2.807   Mean   : 3.216                      Mean   : 3.438  
##  3rd Qu.: 4.000   3rd Qu.: 4.000                      3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000                      Max.   :10.000  
##  normalnucleoli      mitoses          benormal   
##  Min.   : 1.000   Min.   : 1.000   Min.   :2.00  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:2.00  
##  Median : 1.000   Median : 1.000   Median :2.00  
##  Mean   : 2.867   Mean   : 1.589   Mean   :2.69  
##  3rd Qu.: 4.000   3rd Qu.: 1.000   3rd Qu.:4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :4.00

# Check for missing values
colSums(is.na(data))

##               id   clumpthickness  uniformcellsize uniformcellshape 
##                0                0                0                0 
##     margadhesion       epithelial       barenuclei   blandchromatin 
##                0                0                0                0 
##   normalnucleoli          mitoses         benormal 
##                0                0                0

# Ensure 'benormal' is treated as a factor for proper analysis
data$benormal <- factor(data$benormal, labels = c("Benign", "Malignant"))

3. Statistical Questions and Hypotheses

Based on clinical relevance and initial data exploration, we can formulate the following research questions:

"
Question 1: Do clump thickness values differ between benign and malignant tumors?
Hypothesis: Malignant tumors will show significantly higher clump thickness values.

Question 2: Is there a correlation between uniform cell size and uniform cell shape?
Hypothesis: There will be a strong positive correlation between these features.

Question 3: Does mitoses count vary significantly between benign and malignant tumors?
Hypothesis: Malignant tumors will show significantly higher mitoses counts.
"

## [1] "\nQuestion 1: Do clump thickness values differ between benign and malignant tumors?\nHypothesis: Malignant tumors will show significantly higher clump thickness values.\n\nQuestion 2: Is there a correlation between uniform cell size and uniform cell shape?\nHypothesis: There will be a strong positive correlation between these features.\n\nQuestion 3: Does mitoses count vary significantly between benign and malignant tumors?\nHypothesis: Malignant tumors will show significantly higher mitoses counts.\n"

4. Statistical Analysis with ggstatsplot

4.1 Clump Thickness vs Tumor Type

# Comparing clump thickness between benign and malignant tumors
ggbetweenstats(
  data = data,
  x = benormal,
  y = clumpthickness,
  type = "parametric",  # Using t-test for normally distributed data
  messages = FALSE,
  title = "Clump Thickness by Tumor Type"
)

Interpretation: The analysis shows a highly statistically significant difference in clump thickness between benign and malignant tumors (p = 7.43e-78, well below the 0.05 threshold). Malignant tumors demonstrate substantially greater clump thickness (mean = 7.20) compared to benign tumors (mean = 2.96), suggesting that clump thickness could be a valuable diagnostic indicator for distinguishing between tumor types.

4.2 Uniform Cell Size vs Shape Correlation

# Examining correlation between uniform cell size and shape
ggscatterstats(
  data = data,
  x = uniformcellsize,
  y = uniformcellshape,
  title = "Correlation between Uniform Cell Size and Shape"
)

## Registered S3 method overwritten by 'ggside':
##   method from   
##   +.gg   ggplot2

## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The scatter plot reveals a strong positive correlation between uniform cell size and uniform cell shape (r ≈ 0.9, p < 0.001). This indicates that these two features increase together consistently, as cells become more uniform in size, they also become more uniform in shape. This strong relationship suggests these features may be measuring related aspects of cellular abnormality and could potentially be combined or used interchangeably in diagnostic models.

4.3 Mitoses vs Tumor Type

# Comparing mitoses counts between tumor types
ggbetweenstats(
  data = data,
  x = benormal,
  y = mitoses,
  type = "nonparametric",  # Using Wilcoxon test as mitoses count may not be normally distributed
  messages = FALSE,
  title = "Mitoses Count by Tumor Type"
)

Interpretation: The nonparametric test indicates a statistically significant difference in mitoses counts between benign and malignant tumors (p < 0.05). Malignant tumors typically show higher mitoses counts, which aligns with clinical knowledge that increased mitotic activity is associated with malignancy. While the difference is significant, the overlap in distributions suggests mitoses count alone may not be as strong a predictor as other features.

5. Summary of Statistical Findings

Question	Statistical_Test	Visualization	Key_Findings	Interpretation
Do clump thickness values differ between benign and malignant tumors?	Independent Samples t-test (parametric)	ggbetweenstats()	Statistically significant p-value (p < 0.05). Mean clump thickness is higher in malignant tumors.	Clump thickness tends to be greater in malignant tumors, making it a potentially useful feature for classification.
Is there a correlation between uniform cell size and uniform cell shape?	Pearson correlation test	ggscatterstats()	Strong positive correlation (r ≈ 0.9). p-value was highly significant (p < 0.001).	These two variables move together, possibly reflecting a similar underlying pathological feature.
Does mitoses count vary significantly between benign and malignant tumors?	Wilcoxon Rank-Sum Test (nonparametric)	ggbetweenstats()	Statistically significant difference (p < 0.05). Mitoses values are generally higher in malignant tumors.	Malignant tumors tend to have more mitotic activity, reinforcing mitoses as an important diagnostic feature.

6. Interactive Visualizations with Plotly

Creating an interactive plot to explore relationships between key variables:

# Create a ggplot2 scatter plot with enhanced tooltips
p_bc <- ggplot(data, aes(x = clumpthickness, y = uniformcellsize, color = benormal,
                         text = paste("Clump Thickness: ", clumpthickness, "<br>",
                                      "Uniform Cell Size: ", uniformcellsize, "<br>",
                                      "Uniform Cell Shape: ", uniformcellshape, "<br>",
                                      "Bare Nuclei: ", barenuclei))) +
  geom_point(alpha = 0.8, size = 2) +
  labs(title = "Clump Thickness vs Uniform Cell Size",
       x = "Clump Thickness",
       y = "Uniform Cell Size",
       color = "Tumor Type") +
  theme_minimal()

# Convert to interactive plotly visualization
fig_bc <- ggplotly(p_bc, tooltip = "text") %>%
  layout(modebar = list(visible = FALSE))

# Display the interactive plot
fig_bc

7. Conclusions and Next Steps

Analysis of the Breast Cancer Wisconsin dataset reveals:

Clear differentiation in cellular characteristics: Significant differences exist between benign and malignant tumors across multiple features, particularly clump thickness.
Feature relationships: Strong correlation between uniform cell size and shape suggests potential redundancy in these features for diagnostic purposes.
Diagnostic indicators: The combination of clump thickness, cell uniformity metrics, and mitoses counts provides a comprehensive view of cellular abnormalities associated with malignancy.

Limitations: - This analysis is exploratory and does not establish causal relationships - Dataset may contain sampling biases that could affect applicability

Future directions: - Machine learning models could be built using these features to predict tumor type - Additional correlations between other features could be explored - Clinical validation of these findings with newer dataset would strengthen their utility

Analysis of Breast Cancer Wisconsin Dataset

Riza June Espino

2025-04-15