According to Kalshtein Yael (2017), breast cancer is the most common malignancy that’s occurring among women, which accounts for nearly one of the three cancers diagnosed among women in the United States, and it is the second leading cause of fatal cancer among women. Breast cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor doesn’t mean cancer; tumors can be benign - mild or not cancerous, pre-malignant - pre-cancerous, or malignant - cancerous. Various tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer conducted.
This analysis aims to classify whether the breast cancer is benign or malignant. In other to achieve this, the objectives are to observe which features are most helpful in predicting malignant of benign cancer and to see the general trends that may help us in the diagnosis classification through visualizations.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'purrr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Warning: package 'reshape2' was built under R version 4.2.2
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave.points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave.points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave.points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst X
## 0 0 569
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave.points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave.points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave.points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
## [33] "X"
The data file contains 569 row and 33 columns/variables, with the column names Id, radius_mean, perimeter_mean, radius_se, radius_worst, etc.
From the preview of this dataset, we can observe that the columns can be sub-grouped into three names respectively; mean, se, and worst.
## `summarise()` has grouped output by 'diagnosis', 'radius_mean', 'texture_mean',
## 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean',
## 'concavity_mean', 'concave.points_mean', 'symmetry_mean'. You can override
## using the `.groups` argument.
## # A tibble: 569 × 11
## # Groups: diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean,
## # smoothness_mean, compactness_mean, concavity_mean, concave.points_mean,
## # symmetry_mean [569]
## diagnosis radius_mean textu…¹ perim…² area_…³ smoot…⁴ compa…⁵ conca…⁶ conca…⁷
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B 6.98 13.4 43.8 144. 0.117 0.0757 0 0
## 2 B 7.69 25.4 48.3 170. 0.0867 0.120 0.0925 0.0136
## 3 B 7.73 25.5 48.0 179. 0.0810 0.0488 0 0
## 4 B 7.76 24.5 47.9 181 0.0526 0.0436 0 0
## 5 B 8.20 16.8 51.7 202. 0.086 0.0594 0.0159 0.00592
## 6 B 8.22 20.7 53.3 204. 0.0940 0.130 0.132 0.0217
## 7 B 8.57 13.1 54.5 221. 0.104 0.0763 0.0256 0.0151
## 8 B 8.60 18.6 54.1 221. 0.107 0.0585 0 0
## 9 B 8.60 21.0 54.7 222. 0.124 0.0896 0.03 0.00926
## 10 B 8.62 11.8 54.3 224. 0.0975 0.0527 0.0206 0.00780
## # … with 559 more rows, 2 more variables: symmetry_mean <dbl>,
## # fractal_dimension_mean <dbl>, and abbreviated variable names ¹texture_mean,
## # ²perimeter_mean, ³area_mean, ⁴smoothness_mean, ⁵compactness_mean,
## # ⁶concavity_mean, ⁷concave.points_mean
## `summarise()` has grouped output by 'diagnosis', 'radius_se', 'texture_se',
## 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se',
## 'concave.points_se', 'symmetry_se'. You can override using the `.groups`
## argument.
## # A tibble: 569 × 11
## # Groups: diagnosis, radius_se, texture_se, perimeter_se, area_se,
## # smoothness_se, compactness_se, concavity_se, concave.points_se, symmetry_se
## # [569]
## diagnosis radius_se texture…¹ perim…² area_se smoot…³ compa…⁴ conca…⁵ conca…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B 0.112 1.23 2.36 7.23 0.00850 0.0764 0.154 0.0292
## 2 B 0.114 1.02 0.989 7.33 0.0103 0.0308 0.0261 0.0110
## 3 B 0.115 0.674 0.757 9.01 0.00326 0.00493 0.00649 0.00376
## 4 B 0.117 0.496 0.771 8.96 0.00368 0.00917 0.00873 0.00574
## 5 B 0.119 1.18 1.17 6.80 0.00552 0.0267 0.0374 0.00513
## 6 B 0.119 1.43 1.78 9.55 0.00504 0.0456 0.0430 0.0167
## 7 B 0.120 0.894 0.848 9.23 0.00346 0.0105 0.0117 0.00556
## 8 B 0.121 0.893 1.06 8.60 0.00365 0.0165 0.0163 0.00312
## 9 B 0.127 0.679 1.07 7.25 0.00790 0.0176 0.0180 0.00732
## 10 B 0.130 0.720 0.844 10.8 0.00349 0.00371 0.00483 0.00361
## # … with 559 more rows, 2 more variables: symmetry_se <dbl>,
## # fractal_dimension_se <dbl>, and abbreviated variable names ¹texture_se,
## # ²perimeter_se, ³smoothness_se, ⁴compactness_se, ⁵concavity_se,
## # ⁶concave.points_se
## `summarise()` has grouped output by 'diagnosis', 'radius_worst',
## 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst',
## 'compactness_worst', 'concavity_worst', 'concave.points_worst',
## 'symmetry_worst'. You can override using the `.groups` argument.
## # A tibble: 569 × 11
## # Groups: diagnosis, radius_worst, texture_worst, perimeter_worst,
## # area_worst, smoothness_worst, compactness_worst, concavity_worst,
## # concave.points_worst, symmetry_worst [569]
## diagnosis radius_wo…¹ textu…² perim…³ area_…⁴ smoot…⁵ compa…⁶ conca…⁷ conca…⁸
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B 7.93 19.5 50.4 185. 0.158 0.120 0 0
## 2 B 8.68 31.9 54.5 224. 0.160 0.306 0.339 0.05
## 3 B 8.95 22.4 56.6 240. 0.135 0.0777 0 0
## 4 B 8.96 22.0 57.3 242. 0.130 0.136 0.0688 0.0256
## 5 B 9.08 30.9 57.2 248 0.126 0.0834 0 0
## 6 B 9.09 29.7 58.1 250. 0.163 0.431 0.538 0.0788
## 7 B 9.26 17.0 58.4 259. 0.116 0.0706 0 0
## 8 B 9.41 17.1 63.3 270 0.118 0.188 0.154 0.0385
## 9 B 9.46 30.4 59.2 269. 0.0900 0.0644 0 0
## 10 B 9.47 18.4 63.3 276. 0.164 0.224 0.175 0.0851
## # … with 559 more rows, 2 more variables: symmetry_worst <dbl>,
## # fractal_dimension_worst <dbl>, and abbreviated variable names
## # ¹radius_worst, ²texture_worst, ³perimeter_worst, ⁴area_worst,
## # ⁵smoothness_worst, ⁶compactness_worst, ⁷concavity_worst,
## # ⁸concave.points_worst
One of the main objectives of visualizing this data is to observe the features that are most helpful in predicting benign and malignant cancer. As this data is concerned, i decided to use two visualization tools (histogram and bar charts), to correlate and understand which features have larger predictive value and which does not bring a remarkable value, in case we w aim at creating model that predicts if a tumor is benign or malignant.
The histogram and bar graph correlate the differences in diagnosis between each respective column that’s grouped under the mean data.frame. However, perimeter_mean, area_mean, concavity_mean and concave.points_mean; all tends to exhibit higher benign diagnosis (i.e, not cancerous), with with little rise in malignant (pre-malignant) diagnosis. Meanwhile, texture_mean, smoothness_mean and a somewhat of concave.points_mean also show a little high spike of malignant (cancerous) diagnosis.
Here in this plot, the radius_se, perimeter_se, area_se and fractal_dimension_se; all tends to show highest peak of benign (not cancerous) diagnosis, compared to the rest of the columns. While concavity_se and fractal_dimension_se tend to also exhibit higher spike of malignant (cancerous) diagnosis. Hence, the se_data seems to show lower results of malignant diagnosis.
Here in this plot, the perimeter_worst, area_worst, symmetry_worst and fractal_dimension_worst; all tends to show highest peak of benign (not cancerous) diagnosis as well, compared to the rest of the columns. While symmetry_worst, concavity_worst and fractal_dimension_worst tend to also exhibit high spike of malignant (cancerous) diagnosis. Hence, the data frame ‘worst_data’ generally seems to show lower results of malignant diagnosis.
The Breast Cancer Wisconsin (Diagonistic) Dataset analysis show that there are few features with more predictive value for diagnosis. From this analysis, it can be inferred that 63% of all the observations indicate the absence of cancer (B= benign), while 37% of all the datasets observed, indicate the presence of cancer (M= malignant). The observations were also confirmed by the visualizations of the data frames with bar chart plots, showing that the same features are aligned to the main primary or principal components of the data frame respectively. In conclusion as this analysis is concerned, it can be deduced thus; high number of cases diagnosed, represent benign (negative malignant tumor), while small number of cases diagnosed represent malignant (positive malignant tumor).