Load Breast Cancer dataset

breast_cancer <- read.csv("breast_cancer.csv", stringsAsFactors = TRUE)

Introduction

The dataset from the Wisconsin Diagnostic Breast Cancer (WDBC) contains 569 records and 33 columns related to breast cancer diagnosis. It includes various tumor characteristics extracted from digitized images of fine needle aspirate (FNA) samples of breast masses. This dataset is highly relevant for medical diagnosis and machine learning research in breast cancer detection.

Data Structure and Quality

Dimensions

dim(breast_cancer)
## [1] 569  33

Column Names

names(breast_cancer)
##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave.points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave.points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave.points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"
## [33] "X"

Missing Value

colSums(is.na(breast_cancer))
##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave.points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave.points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave.points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst                       X 
##                       0                       0                     569

Summary Statistics

summary(breast_cancer)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     concave.points_se 
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
##  Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
##   symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
##  Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
##  1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
##  Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
##  Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
##  3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
##  Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
##  perimeter_worst    area_worst     smoothness_worst  compactness_worst
##  Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
##  1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
##  Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
##  Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
##  3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
##  Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
##  concavity_worst  concave.points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750        
##     X          
##  Mode:logical  
##  NA's:569      
##                
##                
##                
## 

Bivariate Analysis

Do malignant tumors have a significantly larger mean radius than benign tumors?

ggbetweenstats(data = breast_cancer, x = diagnosis, y = radius_mean, title = "Do Malignant Tumors Have a Larger Mean Radius?", xlab = "Diagnosis", ylab = "Mean Radius", messages = FALSE)

Malignant tumors have a significantly larger mean radius than benign tumors. This is supported by both classical and Bayesian statistical analyses, showing a large and highly significant difference in mean tumor size between the two diagnosis groups. The strong statistical evidence suggests that mean radius can be a useful feature for distinguishing between malignant and benign tumors.

Is there a significant association between tumor type and whether radius is above average?

breast_cancer$radius_cat <- ifelse(breast_cancer$radius_mean > mean(breast_cancer$radius_mean, na.rm = TRUE), "Above Avg", "Below Avg")
chisq.test(table(breast_cancer$radius_cat, breast_cancer$diagnosis), simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  table(breast_cancer$radius_cat, breast_cancer$diagnosis)
## X-squared = 288.15, df = NA, p-value = 0.0004998
ggplot(breast_cancer, aes(x = diagnosis, fill = radius_cat)) + geom_bar(position = "fill") + labs(title = "Tumor Type by Radius Category", y = "Proportion", fill = "Radius") + theme_minimal()

There appears to be a significant association between tumor type and whether the radius is above average, as malignant tumors are more likely to have a larger radius, while benign tumors tend to have a smaller one. This visual trend supports the idea that radius size is meaningfully linked to tumor diagnosis.

Is there a correlation between radius and perimeter (mean)?

ggscatterstats(data = breast_cancer, x = radius_mean, y = perimeter_mean, title = "Correlation Between Radius and Perimeter (Mean)", xlab = "Radius Mean", ylab = "Perimeter Mean", messages = FALSE)
## Registered S3 method overwritten by 'ggside':
##   method from   
##   +.gg   ggplot2
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.

There is a perfect linear correlation between Radius Mean and Perimeter Mean in this dataset. This suggests that as the radius of a tumor increases, the perimeter increases proportionally, which is also geometrically intuitive (since perimeter depends on radius in circular-like shapes). Given the statistical significance and near-perfect correlation coefficient, Radius Mean can be considered a strong predictor of Perimeter Mean in modeling or diagnostic contexts.

Are smoothness_mean and compactness_mean correlated?

ggcorrmat(data = breast_cancer, cor.vars = c("smoothness_mean", "compactness_mean", "concavity_mean", "symmetry_mean"), title = "Correlation Matrix of Cell Characteristics", colors = c("red", "white", "pink"), messages = FALSE)

The correlation between smoothness_mean and compactness_mean is moderately strong and positive, with a Pearson correlation coefficient of 0.66. This indicates that as the smoothness of the cell nuclei increases, the compactness tends to increase as well. The relationship is statistically significant, as shown by the absence of an “X” mark in the correlation matrix, meaning the p-value is less than 0.05 after Holm adjustment for multiple comparisons. This suggests that smoother tumor cell boundaries may be associated with greater compactness, which could have implications for understanding tumor structure and behavior.

What proportion of benign vs malignant tumors exist in the dataset?

ggpiestats(data = breast_cancer, x = diagnosis, title = "Proportion of Tumor Diagnoses in the Dataset", messages = FALSE)

In the dataset, 63% of the tumors are benign (B) and 37% are malignant (M). This distribution is based on a total of 569 observations, as shown in the pie chart. The chi-squared indicates that the observed proportions significantly differ from a uniform distribution, confirming that benign cases are more common in this dataset.

Interactive Plots with plotly

# Create a ggplot2 scatter plot using breast_cancer data
p_breast_cancer <- ggplot(breast_cancer, aes(x = radius_mean, y = texture_mean, color = diagnosis, text = paste("Radius Mean: ", radius_mean, "<br>", "Texture Mean: ", texture_mean, "<br>", "Perimeter Mean: ", perimeter_mean, "<br>", "Area Mean: ", area_mean))) + geom_point(alpha = 0.7) + labs(title = "Mean Radius vs Texture by Diagnosis", x = "Radius Mean", y = "Texture Mean", color = "Diagnosis") + theme_minimal()

# Convert to interactive plotly object with custom tooltip and hidden modebar
fig_breast_cancer <- ggplotly(p_breast_cancer, tooltip = "text") %>% layout(modebar = list(visible = FALSE))

# Display the interactive plot
fig_breast_cancer

The interactive scatter plot reveals a clear distinction between benign and malignant tumors based on their radius_mean and texture_mean. Malignant tumors tend to exhibit higher values for both features, implying that tumors with larger and rougher cell structures are more likely to be malignant. This pattern reinforces the diagnostic value of these cellular characteristics.