breast_cancer <- read.csv("breast_cancer.csv", stringsAsFactors = TRUE)
The dataset from the Wisconsin Diagnostic Breast Cancer (WDBC) contains 569 records and 33 columns related to breast cancer diagnosis. It includes various tumor characteristics extracted from digitized images of fine needle aspirate (FNA) samples of breast masses. This dataset is highly relevant for medical diagnosis and machine learning research in breast cancer detection.
dim(breast_cancer)
## [1] 569 33
names(breast_cancer)
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave.points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave.points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave.points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
## [33] "X"
colSums(is.na(breast_cancer))
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave.points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave.points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave.points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst X
## 0 0 569
summary(breast_cancer)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638
## Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930
## Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710
## Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :0.007882 Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.015160 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.018730 Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.020542 Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.023480 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.078950 Max. :0.0298400 Max. :36.04 Max. :49.54
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720
## Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190
## Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910
## Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800
## concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
## X
## Mode:logical
## NA's:569
##
##
##
##
Do malignant tumors have a significantly larger mean radius than benign tumors?
ggbetweenstats(data = breast_cancer, x = diagnosis, y = radius_mean, title = "Do Malignant Tumors Have a Larger Mean Radius?", xlab = "Diagnosis", ylab = "Mean Radius", messages = FALSE)
Malignant tumors have a significantly larger mean radius than benign
tumors. This is supported by both classical and Bayesian statistical
analyses, showing a large and highly significant difference in mean
tumor size between the two diagnosis groups. The strong statistical
evidence suggests that mean radius can be a useful feature for
distinguishing between malignant and benign tumors.
Is there a significant association between tumor type and whether radius is above average?
breast_cancer$radius_cat <- ifelse(breast_cancer$radius_mean > mean(breast_cancer$radius_mean, na.rm = TRUE), "Above Avg", "Below Avg")
chisq.test(table(breast_cancer$radius_cat, breast_cancer$diagnosis), simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: table(breast_cancer$radius_cat, breast_cancer$diagnosis)
## X-squared = 288.15, df = NA, p-value = 0.0004998
ggplot(breast_cancer, aes(x = diagnosis, fill = radius_cat)) + geom_bar(position = "fill") + labs(title = "Tumor Type by Radius Category", y = "Proportion", fill = "Radius") + theme_minimal()
There appears to be a significant association between tumor type and
whether the radius is above average, as malignant tumors are more likely
to have a larger radius, while benign tumors tend to have a smaller one.
This visual trend supports the idea that radius size is meaningfully
linked to tumor diagnosis.
Is there a correlation between radius and perimeter (mean)?
ggscatterstats(data = breast_cancer, x = radius_mean, y = perimeter_mean, title = "Correlation Between Radius and Perimeter (Mean)", xlab = "Radius Mean", ylab = "Perimeter Mean", messages = FALSE)
## Registered S3 method overwritten by 'ggside':
## method from
## +.gg ggplot2
## `stat_xsidebin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_ysidebin()` using `bins = 30`. Pick better value with `binwidth`.
There is a perfect linear correlation between Radius Mean and
Perimeter Mean in this dataset. This suggests that as the radius of a
tumor increases, the perimeter increases proportionally, which is also
geometrically intuitive (since perimeter depends on radius in
circular-like shapes). Given the statistical significance and
near-perfect correlation coefficient, Radius Mean can be considered a
strong predictor of Perimeter Mean in modeling or diagnostic
contexts.
Are smoothness_mean and compactness_mean correlated?
ggcorrmat(data = breast_cancer, cor.vars = c("smoothness_mean", "compactness_mean", "concavity_mean", "symmetry_mean"), title = "Correlation Matrix of Cell Characteristics", colors = c("red", "white", "pink"), messages = FALSE)
The correlation between smoothness_mean and compactness_mean is
moderately strong and positive, with a Pearson correlation coefficient
of 0.66. This indicates that as the smoothness of the cell nuclei
increases, the compactness tends to increase as well. The relationship
is statistically significant, as shown by the absence of an “X” mark in
the correlation matrix, meaning the p-value is less than 0.05 after Holm
adjustment for multiple comparisons. This suggests that smoother tumor
cell boundaries may be associated with greater compactness, which could
have implications for understanding tumor structure and
behavior.
What proportion of benign vs malignant tumors exist in the dataset?
ggpiestats(data = breast_cancer, x = diagnosis, title = "Proportion of Tumor Diagnoses in the Dataset", messages = FALSE)
In the dataset, 63% of the tumors are benign (B) and 37% are
malignant (M). This distribution is based on a total of 569
observations, as shown in the pie chart. The chi-squared indicates that
the observed proportions significantly differ from a uniform
distribution, confirming that benign cases are more common in this
dataset.
# Create a ggplot2 scatter plot using breast_cancer data
p_breast_cancer <- ggplot(breast_cancer, aes(x = radius_mean, y = texture_mean, color = diagnosis, text = paste("Radius Mean: ", radius_mean, "<br>", "Texture Mean: ", texture_mean, "<br>", "Perimeter Mean: ", perimeter_mean, "<br>", "Area Mean: ", area_mean))) + geom_point(alpha = 0.7) + labs(title = "Mean Radius vs Texture by Diagnosis", x = "Radius Mean", y = "Texture Mean", color = "Diagnosis") + theme_minimal()
# Convert to interactive plotly object with custom tooltip and hidden modebar
fig_breast_cancer <- ggplotly(p_breast_cancer, tooltip = "text") %>% layout(modebar = list(visible = FALSE))
# Display the interactive plot
fig_breast_cancer
The interactive scatter plot reveals a clear distinction between benign and malignant tumors based on their radius_mean and texture_mean. Malignant tumors tend to exhibit higher values for both features, implying that tumors with larger and rougher cell structures are more likely to be malignant. This pattern reinforces the diagnostic value of these cellular characteristics.