Loading the Necessary Libraries and Data
# Load necessary packages
library(tidyverse)
library(ggstatsplot)
library(plotly)
#Loading the dataset
df <- read.csv("D:/School Related/Third Year/Analytics and Tools/breastcancerdataset.csv", stringsAsFactors = TRUE)
Exploring the data
#Structure of the Data
str(df)
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
#Column names
colnames(df)
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave.points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave.points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave.points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
## [33] "X"
#Descriptive Stats
summary(df)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638
## Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930
## Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710
## Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :0.007882 Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.015160 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.018730 Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.020542 Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.023480 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.078950 Max. :0.0298400 Max. :36.04 Max. :49.54
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720
## Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190
## Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910
## Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800
## concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
## X
## Mode:logical
## NA's:569
##
##
##
##
#Checking for missing values
colSums(is.na(df))
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave.points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave.points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave.points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst X
## 0 0 569
#View First Few Rows
head(df)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02250 0.004571 23.57 25.53 152.50
## 4 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01756 0.005115 22.54 16.67 152.20
## 6 0.02165 0.005082 15.47 23.75 103.40
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019.0 0.1622 0.6656 0.7119
## 2 1956.0 0.1238 0.1866 0.2416
## 3 1709.0 0.1444 0.4245 0.4504
## 4 567.7 0.2098 0.8663 0.6869
## 5 1575.0 0.1374 0.2050 0.4000
## 6 741.6 0.1791 0.5249 0.5355
## concave.points_worst symmetry_worst fractal_dimension_worst X
## 1 0.2654 0.4601 0.11890 NA
## 2 0.1860 0.2750 0.08902 NA
## 3 0.2430 0.3613 0.08758 NA
## 4 0.2575 0.6638 0.17300 NA
## 5 0.1625 0.2364 0.07678 NA
## 6 0.1741 0.3985 0.12440 NA
1.Is there a significant difference in mean size of the core tumor between malignant and benign tumors?
#T-test
df$diagnosis <- factor(df$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
t_test_result <- t.test(area_mean ~ diagnosis, data = df)
t_test_result
##
## Welch Two Sample t-test
##
## data: area_mean by diagnosis
## t = -19.641, df = 244.79, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Benign and group Malignant is not equal to 0
## 95 percent confidence interval:
## -567.2919 -463.8805
## sample estimates:
## mean in group Benign mean in group Malignant
## 462.7902 978.3764
The results of the independent samples t-test indicate that there is a statistically significant difference in the mean size of the core tumor (area_mean) between malignant and benign tumors. Specifically, the average area for benign tumors is approximately 462.79, while the average for malignant tumors is considerably higher at 978.38. The test yielded a t-value of -19.64 with a p-value less than 2.2e-16, which is far below the conventional significance level of 0.05. This strong statistical evidence suggests that the difference in means is not due to random chance. Additionally, the 95% confidence interval for the difference in means ranges from -567.29 to -463.88, further confirming that malignant tumors tend to have significantly larger core areas compared to benign tumors.
bc_data <- read.csv("breastcancerdataset.csv")
bc_data$diagnosis <- factor(bc_data$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
p <- ggbetweenstats(
data = bc_data,
x = diagnosis,
y = area_mean,
type = "parametric",
messages = FALSE,
title = "Comparison of Mean Tumor Area (area_mean) by Diagnosis",
xlab = "Diagnosis",
ylab = "Mean Tumor Area",
results.subtitle = TRUE,
ggtheme = ggplot2::theme_minimal()
)
ggplotly(p)
The comparison of mean tumor area (area_mean) between benign and malignant tumors reveals a clear and significant difference. The plot shows that malignant tumors tend to have a much larger mean area compared to benign ones. This is evident from both the position of the data points and the summary statistics displayed: the median and mean values for malignant tumors are noticeably higher, and the distribution is more spread out, indicating greater variability in tumor sizes among malignant cases. In contrast, benign tumors have a more compact distribution with lower average values. The statistical test performed (Welch’s t-test) supports this visual interpretation, showing that the difference in mean tumor area between the two groups is statistically significant. This finding suggests that tumor area is a valuable feature for distinguishing between benign and malignant breast tumors, with larger mean areas being more indicative of malignancy.
2.Is there a significant relationship between mean compactness and mean concavity in tumors?
scatter_plot <- ggplot(df, aes(x = compactness_mean, y = concavity_mean)) +
geom_point(aes(text = paste("Diagnosis:", diagnosis)), alpha = 0.6, color = "#2C3E50") +
geom_smooth(method = "lm", se = TRUE, color = "#E74C3C") +
labs(
x = "Mean Compactness",
y = "Mean Concavity"
) +
theme_minimal()
ggplotly(scatter_plot, tooltip = "text")
The scatter plot above illustrates the relationship between Mean Compactness and Mean Concavity of tumor cells. From the visualization, we can observe a strong positive linear correlation between the two variables. This indicates that as the compactness of the cell nuclei increases, their concavity tends to increase as well. The trend line and its confidence band reinforce this observation, suggesting a statistically significant relationship. This association is biologically plausible, as both features describe the shape and structure of cell nuclei, which tend to become more irregular and distorted in malignant tumors. The strength and direction of this linear relationship suggest that mean compactness could be a good predictor of mean concavity, and vice versa, which may help in differentiating between tumor types or understanding tumor progression.
3.Is there a noticeable difference in the distribution of mean perimeter between benign and malignant tumors?
df <- df %>%
mutate(diagnosis = factor(diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant")))
perimeter_plot <- ggplot(df, aes(x = diagnosis, y = perimeter_mean, fill = diagnosis)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.4, color = "black") +
labs(
title = "Distribution of Mean Perimeter by Tumor Type",
x = "Diagnosis",
y = "Mean Perimeter"
) +
theme_minimal() +
scale_fill_manual(values = c("Benign" = "#00BFC4", "Malignant" = "#F8766D")) +
theme(legend.position = "none")
ggplotly(perimeter_plot, tooltip = c("x", "y"))
The box plot displays the distribution of mean perimeter values for tumors categorized as either benign or malignant. The visualization reveals a clear difference between the two groups. Malignant tumors tend to have higher perimeter values, as indicated by the higher median and upper quartile (Q3) compared to benign tumors. Specifically, the median mean perimeter is approximately 86.24, and the values for malignant tumors often extend far beyond the upper fence (147.30), with some reaching up to 188.50, highlighting the presence of outliers. In contrast, benign tumors have a more compressed distribution with lower perimeter measurements. These differences suggest that mean perimeter is a meaningful feature for distinguishing between tumor types, with larger perimeter values being associated with malignancy. The visualization supports the hypothesis that malignant tumors are generally larger in perimeter, and it opens further statistical testing to confirm the significance of this observation.