Breast Cancer Wisconsin (Diagonistic) Dataset

Ask

Problem task

According to Kalshtein Yael (2017), breast cancer is the most common malignancy that’s occurring among women, which accounts for nearly one of the three cancers diagnosed among women in the United States, and it is the second leading cause of fatal cancer among women. Breast cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor doesn’t mean cancer; tumors can be benign - mild or not cancerous, pre-malignant - pre-cancerous, or malignant - cancerous. Various tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer conducted.

About the data

Data Source: The dataset used for this analysis was obtained from kaggle. This data was created by Dr. William H. Wolberg, a physician at the University of Wisconsin Hospital at Madison, Wisconsin, USA. The data generated using a curve-fitting algorithm program to compute ten important features from each one of the cells in the sample.
Data Integrity: The data only constitute a single file, with up to 30 features which can be sub-grouped into three major categories; the ‘mean’ data, the ‘se’ data and the ‘worst’ data. Information in the dataset file:
ID number
Diagnosis (M=malignant, B= benign)
Sub-grouped features

mean (radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave.points_mean, symmetry_mean, fractal_dimension_mean)
se (radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave.points_se, symmetry_se, fractal_dimension_se)
worst (radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave.points_worst, symmetry_worst, fractal_dimension_worst)

Aim of this Analysis

This analysis aims to classify whether the breast cancer is benign or malignant. In other to achieve this, the objectives are to observe which features are most helpful in predicting malignant of benign cancer and to see the general trends that may help us in the diagnosis classification through visualizations.

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: package 'purrr' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

## Warning: package 'reshape2' was built under R version 4.2.2

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave.points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave.points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave.points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst                       X 
##                       0                       0                     569

## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave.points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave.points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave.points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"
## [33] "X"

The data file contains 569 row and 33 columns/variables, with the column names Id, radius_mean, perimeter_mean, radius_se, radius_worst, etc.

From the preview of this dataset, we can observe that the columns can be sub-grouped into three names respectively; mean, se, and worst.

creating data frame with the three sub-grouped columns respectively

## `summarise()` has grouped output by 'diagnosis', 'radius_mean', 'texture_mean',
## 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean',
## 'concavity_mean', 'concave.points_mean', 'symmetry_mean'. You can override
## using the `.groups` argument.

## # A tibble: 569 × 11
## # Groups:   diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean,
## #   smoothness_mean, compactness_mean, concavity_mean, concave.points_mean,
## #   symmetry_mean [569]
##    diagnosis radius_mean textu…¹ perim…² area_…³ smoot…⁴ compa…⁵ conca…⁶ conca…⁷
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 B                6.98    13.4    43.8    144.  0.117   0.0757  0      0      
##  2 B                7.69    25.4    48.3    170.  0.0867  0.120   0.0925 0.0136 
##  3 B                7.73    25.5    48.0    179.  0.0810  0.0488  0      0      
##  4 B                7.76    24.5    47.9    181   0.0526  0.0436  0      0      
##  5 B                8.20    16.8    51.7    202.  0.086   0.0594  0.0159 0.00592
##  6 B                8.22    20.7    53.3    204.  0.0940  0.130   0.132  0.0217 
##  7 B                8.57    13.1    54.5    221.  0.104   0.0763  0.0256 0.0151 
##  8 B                8.60    18.6    54.1    221.  0.107   0.0585  0      0      
##  9 B                8.60    21.0    54.7    222.  0.124   0.0896  0.03   0.00926
## 10 B                8.62    11.8    54.3    224.  0.0975  0.0527  0.0206 0.00780
## # … with 559 more rows, 2 more variables: symmetry_mean <dbl>,
## #   fractal_dimension_mean <dbl>, and abbreviated variable names ¹texture_mean,
## #   ²perimeter_mean, ³area_mean, ⁴smoothness_mean, ⁵compactness_mean,
## #   ⁶concavity_mean, ⁷concave.points_mean

## `summarise()` has grouped output by 'diagnosis', 'radius_se', 'texture_se',
## 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se',
## 'concave.points_se', 'symmetry_se'. You can override using the `.groups`
## argument.

## # A tibble: 569 × 11
## # Groups:   diagnosis, radius_se, texture_se, perimeter_se, area_se,
## #   smoothness_se, compactness_se, concavity_se, concave.points_se, symmetry_se
## #   [569]
##    diagnosis radius_se texture…¹ perim…² area_se smoot…³ compa…⁴ conca…⁵ conca…⁶
##    <chr>         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 B             0.112     1.23    2.36     7.23 0.00850 0.0764  0.154   0.0292 
##  2 B             0.114     1.02    0.989    7.33 0.0103  0.0308  0.0261  0.0110 
##  3 B             0.115     0.674   0.757    9.01 0.00326 0.00493 0.00649 0.00376
##  4 B             0.117     0.496   0.771    8.96 0.00368 0.00917 0.00873 0.00574
##  5 B             0.119     1.18    1.17     6.80 0.00552 0.0267  0.0374  0.00513
##  6 B             0.119     1.43    1.78     9.55 0.00504 0.0456  0.0430  0.0167 
##  7 B             0.120     0.894   0.848    9.23 0.00346 0.0105  0.0117  0.00556
##  8 B             0.121     0.893   1.06     8.60 0.00365 0.0165  0.0163  0.00312
##  9 B             0.127     0.679   1.07     7.25 0.00790 0.0176  0.0180  0.00732
## 10 B             0.130     0.720   0.844   10.8  0.00349 0.00371 0.00483 0.00361
## # … with 559 more rows, 2 more variables: symmetry_se <dbl>,
## #   fractal_dimension_se <dbl>, and abbreviated variable names ¹texture_se,
## #   ²perimeter_se, ³smoothness_se, ⁴compactness_se, ⁵concavity_se,
## #   ⁶concave.points_se

## `summarise()` has grouped output by 'diagnosis', 'radius_worst',
## 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst',
## 'compactness_worst', 'concavity_worst', 'concave.points_worst',
## 'symmetry_worst'. You can override using the `.groups` argument.

## # A tibble: 569 × 11
## # Groups:   diagnosis, radius_worst, texture_worst, perimeter_worst,
## #   area_worst, smoothness_worst, compactness_worst, concavity_worst,
## #   concave.points_worst, symmetry_worst [569]
##    diagnosis radius_wo…¹ textu…² perim…³ area_…⁴ smoot…⁵ compa…⁶ conca…⁷ conca…⁸
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 B                7.93    19.5    50.4    185.  0.158   0.120   0       0     
##  2 B                8.68    31.9    54.5    224.  0.160   0.306   0.339   0.05  
##  3 B                8.95    22.4    56.6    240.  0.135   0.0777  0       0     
##  4 B                8.96    22.0    57.3    242.  0.130   0.136   0.0688  0.0256
##  5 B                9.08    30.9    57.2    248   0.126   0.0834  0       0     
##  6 B                9.09    29.7    58.1    250.  0.163   0.431   0.538   0.0788
##  7 B                9.26    17.0    58.4    259.  0.116   0.0706  0       0     
##  8 B                9.41    17.1    63.3    270   0.118   0.188   0.154   0.0385
##  9 B                9.46    30.4    59.2    269.  0.0900  0.0644  0       0     
## 10 B                9.47    18.4    63.3    276.  0.164   0.224   0.175   0.0851
## # … with 559 more rows, 2 more variables: symmetry_worst <dbl>,
## #   fractal_dimension_worst <dbl>, and abbreviated variable names
## #   ¹radius_worst, ²texture_worst, ³perimeter_worst, ⁴area_worst,
## #   ⁵smoothness_worst, ⁶compactness_worst, ⁷concavity_worst,
## #   ⁸concave.points_worst

Visualization of the dataset

One of the main objectives of visualizing this data is to observe the features that are most helpful in predicting benign and malignant cancer. As this data is concerned, i decided to use two visualization tools (histogram and bar charts), to correlate and understand which features have larger predictive value and which does not bring a remarkable value, in case we w aim at creating model that predicts if a tumor is benign or malignant.

The histogram and bar graph correlate the differences in diagnosis between each respective column that’s grouped under the mean data.frame. However, perimeter_mean, area_mean, concavity_mean and concave.points_mean; all tends to exhibit higher benign diagnosis (i.e, not cancerous), with with little rise in malignant (pre-malignant) diagnosis. Meanwhile, texture_mean, smoothness_mean and a somewhat of concave.points_mean also show a little high spike of malignant (cancerous) diagnosis.

Here in this plot, the radius_se, perimeter_se, area_se and fractal_dimension_se; all tends to show highest peak of benign (not cancerous) diagnosis, compared to the rest of the columns. While concavity_se and fractal_dimension_se tend to also exhibit higher spike of malignant (cancerous) diagnosis. Hence, the se_data seems to show lower results of malignant diagnosis.

Here in this plot, the perimeter_worst, area_worst, symmetry_worst and fractal_dimension_worst; all tends to show highest peak of benign (not cancerous) diagnosis as well, compared to the rest of the columns. While symmetry_worst, concavity_worst and fractal_dimension_worst tend to also exhibit high spike of malignant (cancerous) diagnosis. Hence, the data frame ‘worst_data’ generally seems to show lower results of malignant diagnosis.

Conclusion

The Breast Cancer Wisconsin (Diagonistic) Dataset analysis show that there are few features with more predictive value for diagnosis. From this analysis, it can be inferred that 63% of all the observations indicate the absence of cancer (B= benign), while 37% of all the datasets observed, indicate the presence of cancer (M= malignant). The observations were also confirmed by the visualizations of the data frames with bar chart plots, showing that the same features are aligned to the main primary or principal components of the data frame respectively. In conclusion as this analysis is concerned, it can be deduced thus; high number of cases diagnosed, represent benign (negative malignant tumor), while small number of cases diagnosed represent malignant (positive malignant tumor).