Objective

Create a statistical, binary classification model that can accurately predict malignancy of breast cancer, with given cytopathology data.

Dataset Information

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

Data Source

The dataset can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information

  1. ID number
  2. Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Notes about Dataset

Load Data and Libraries

Preview Data

##          id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1    842302         M       17.99        10.38         122.80    1001.0
## 2    842517         M       20.57        17.77         132.90    1326.0
## 3  84300903         M       19.69        21.25         130.00    1203.0
## 4  84348301         M       11.42        20.38          77.58     386.1
## 5  84358402         M       20.29        14.34         135.10    1297.0
## 6    843786         M       12.45        15.70          82.57     477.1
## 7    844359         M       18.25        19.98         119.60    1040.0
## 8  84458202         M       13.71        20.83          90.20     577.9
## 9    844981         M       13.00        21.82          87.50     519.8
## 10 84501001         M       12.46        24.04          83.97     475.9
##    smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1          0.11840          0.27760        0.30010             0.14710
## 2          0.08474          0.07864        0.08690             0.07017
## 3          0.10960          0.15990        0.19740             0.12790
## 4          0.14250          0.28390        0.24140             0.10520
## 5          0.10030          0.13280        0.19800             0.10430
## 6          0.12780          0.17000        0.15780             0.08089
## 7          0.09463          0.10900        0.11270             0.07400
## 8          0.11890          0.16450        0.09366             0.05985
## 9          0.12730          0.19320        0.18590             0.09353
## 10         0.11860          0.23960        0.22730             0.08543
##    symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1         0.2419                0.07871    1.0950     0.9053        8.589
## 2         0.1812                0.05667    0.5435     0.7339        3.398
## 3         0.2069                0.05999    0.7456     0.7869        4.585
## 4         0.2597                0.09744    0.4956     1.1560        3.445
## 5         0.1809                0.05883    0.7572     0.7813        5.438
## 6         0.2087                0.07613    0.3345     0.8902        2.217
## 7         0.1794                0.05742    0.4467     0.7732        3.180
## 8         0.2196                0.07451    0.5835     1.3770        3.856
## 9         0.2350                0.07389    0.3063     1.0020        2.406
## 10        0.2030                0.08243    0.2976     1.5990        2.039
##    area_se smoothness_se compactness_se concavity_se concave_points_se
## 1   153.40      0.006399        0.04904      0.05373           0.01587
## 2    74.08      0.005225        0.01308      0.01860           0.01340
## 3    94.03      0.006150        0.04006      0.03832           0.02058
## 4    27.23      0.009110        0.07458      0.05661           0.01867
## 5    94.44      0.011490        0.02461      0.05688           0.01885
## 6    27.19      0.007510        0.03345      0.03672           0.01137
## 7    53.91      0.004314        0.01382      0.02254           0.01039
## 8    50.96      0.008805        0.03029      0.02488           0.01448
## 9    24.32      0.005731        0.03502      0.03553           0.01226
## 10   23.94      0.007149        0.07217      0.07743           0.01432
##    symmetry_se fractal_dimension_se radius_worst texture_worst
## 1      0.03003             0.006193        25.38         17.33
## 2      0.01389             0.003532        24.99         23.41
## 3      0.02250             0.004571        23.57         25.53
## 4      0.05963             0.009208        14.91         26.50
## 5      0.01756             0.005115        22.54         16.67
## 6      0.02165             0.005082        15.47         23.75
## 7      0.01369             0.002179        22.88         27.66
## 8      0.01486             0.005412        17.06         28.14
## 9      0.02143             0.003749        15.49         30.73
## 10     0.01789             0.010080        15.09         40.68
##    perimeter_worst area_worst smoothness_worst compactness_worst
## 1           184.60     2019.0           0.1622            0.6656
## 2           158.80     1956.0           0.1238            0.1866
## 3           152.50     1709.0           0.1444            0.4245
## 4            98.87      567.7           0.2098            0.8663
## 5           152.20     1575.0           0.1374            0.2050
## 6           103.40      741.6           0.1791            0.5249
## 7           153.20     1606.0           0.1442            0.2576
## 8           110.60      897.0           0.1654            0.3682
## 9           106.20      739.3           0.1703            0.5401
## 10           97.65      711.4           0.1853            1.0580
##    concavity_worst concave_points_worst symmetry_worst
## 1           0.7119               0.2654         0.4601
## 2           0.2416               0.1860         0.2750
## 3           0.4504               0.2430         0.3613
## 4           0.6869               0.2575         0.6638
## 5           0.4000               0.1625         0.2364
## 6           0.5355               0.1741         0.3985
## 7           0.3784               0.1932         0.3063
## 8           0.2678               0.1556         0.3196
## 9           0.5390               0.2060         0.4378
## 10          1.1050               0.2210         0.4366
##    fractal_dimension_worst
## 1                  0.11890
## 2                  0.08902
## 3                  0.08758
## 4                  0.17300
## 5                  0.07678
## 6                  0.12440
## 7                  0.08368
## 8                  0.11510
## 9                  0.10720
## 10                 0.20750

Create Training and Testing Datasets

We will split the dataset into a training dataset and a test dataset. We will perform our analyses and create models using the training dataset. Then, we will evaluate the model performance in the test dataset.

We want to make sure the values in the outcome variable are evenly distributed in training and test datasets. We can check the proportion of values in the outcome variable (diagnosis) by counting the values and observing the proportions.

## 
##         B         M 
## 0.6285714 0.3714286
## 
##        B        M 
## 0.622807 0.377193

Since the proportions are roughly equal, we will proceed to the next step.

Difference In Mean by Outcome Variable

We want to find variables that have the most predictive power. In our case, all of the predictor variables are continuous variables and the outcome variable is a categorical variable. Hence, we can compute the averages value of each attribute, grouped by outcome variable, and observing the difference.

For example, in our case, we have two values in the outcome variable: B (benign) or M (malignant). We can calculate the mean radius for the maligant tumor cells and compare to the mean radius of benign tumor cells. If the difference between these two means is great, then we can deduce that the attribute (radius) is significant.

##        B        M 
## 12.15995 17.52302

Average mean radius of benign cell nuclei is 12.15 while that of malignant nuclei is 17.46.

We can do this for each of the attribute.

## # A tibble: 2 x 32
##   diagnosis     id radius_mean texture_mean perimeter_mean area_mean
##   <fct>      <dbl>       <dbl>        <dbl>          <dbl>     <dbl>
## 1 B         2.32e7        12.2         17.9           78.2      464.
## 2 M         4.26e7        17.5         21.6          116.       987.
## # … with 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
## #   concavity_mean <dbl>, concave_points_mean <dbl>, symmetry_mean <dbl>,
## #   fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
## #   perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
## #   compactness_se <dbl>, concavity_se <dbl>, concave_points_se <dbl>,
## #   symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
## #   texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>,
## #   smoothness_worst <dbl>, compactness_worst <dbl>,
## #   concavity_worst <dbl>, concave_points_worst <dbl>,
## #   symmetry_worst <dbl>, fractal_dimension_worst <dbl>

Visualizing the Distribution Difference with Histogram

We can visualize the above result with distribution plots. The more “separated out” the distributions are, the more predictive power the feature has.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualizing the Distribution Difference with Barplots

Finding Correlated Variables

Having many correlated attributes often leads of model overfitting. To prevent that pitfall, we want to avoid including too many features that are highly correlated with another when creating our models. To examine which, if any, features are highly correlated, we produce the following graph.

As evident, many of the feature variables are highly correlated with other variables and therefore redundant.

Dimensionality Reduction: Principal Component Analysis

Because we have many redundant, highly-correlated features in our dataset, we would like to reduce the number of feature variables. One method we can apply is principal component analysis (PCA), a very popular dimension reduction technique.

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6
## Standard deviation     3.6665 2.3917 1.65918 1.39033 1.28019 1.0996
## Proportion of Variance 0.4481 0.1907 0.09176 0.06443 0.05463 0.0403
## Cumulative Proportion  0.4481 0.6388 0.73054 0.79498 0.84960 0.8899
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.80160 0.68717 0.65478 0.58253 0.54029 0.50711
## Proportion of Variance 0.02142 0.01574 0.01429 0.01131 0.00973 0.00857
## Cumulative Proportion  0.91132 0.92706 0.94136 0.95267 0.96240 0.97097
##                           PC13   PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.48177 0.4024 0.29294 0.28644 0.24030 0.23584
## Proportion of Variance 0.00774 0.0054 0.00286 0.00273 0.00192 0.00185
## Cumulative Proportion  0.97871 0.9841 0.98697 0.98970 0.99162 0.99348
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.20220 0.17887 0.16663 0.15888 0.14154 0.13167
## Proportion of Variance 0.00136 0.00107 0.00093 0.00084 0.00067 0.00058
## Cumulative Proportion  0.99484 0.99591 0.99683 0.99768 0.99834 0.99892
##                          PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.1229 0.09183 0.08237 0.03696 0.02361 0.01136
## Proportion of Variance 0.0005 0.00028 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion  0.9994 0.99971 0.99993 0.99998 1.00000 1.00000

Visualize Variance Explained by Principal Components

Effectiveness of First Principal Components in Separating Out Data Points

##            PC1       PC2        PC3        PC4        PC5        PC6
## 129 -1.3161432 -1.298799  0.6362512  1.9286064  0.1527150 -0.4583164
## 509  1.2291376  1.601287 -0.7205266  1.8258560 -0.7471148 -0.8341412
## 471  2.6627577 -1.394929  0.2782799 -0.3293326  0.9021914  0.8810544
## 299  2.7856293  2.498263 -0.8437551 -0.3002056 -1.6154444  0.8424754
## 270  0.8148387 -3.069724  1.5253843 -0.3360296 -0.7738540 -0.4785682
## 187 -0.2945659  3.498859 -2.1654255  0.2387659 -0.9226097  0.4171881
##            PC7         PC8         PC9       PC10       PC11        PC12
## 129 -1.1554734 -0.36953772 -0.44423390  0.3959139  1.5154993  0.42925567
## 509 -1.3156708  0.36162608  0.12976703  0.6582973  0.0923487 -0.01735643
## 471  0.5809932 -0.39689831  1.17929674  0.7854529 -0.2228470  0.68258717
## 299  0.7445195  0.11701140  0.05602298  0.4538755 -0.1293102  0.48797305
## 270 -0.3731491  0.03439504  0.08585404 -0.3194456  0.6013498 -0.68208920
## 187 -0.6186967  0.24644361 -0.19065865  0.2966028 -0.3576314 -0.24618479
##             PC13        PC14        PC15        PC16          PC17
## 129  0.041236860  0.26019609  0.07328204  0.13121570 -0.0924210679
## 509  0.322709210 -0.17963353 -0.21570085 -0.37502036 -0.3325836550
## 471  0.116939814 -0.14114442 -0.05992310  0.08107342 -0.0136590961
## 299 -0.033697334 -0.23996721  0.16061233 -0.02408015 -0.0003508203
## 270  0.003409216  0.70607398  0.13587239  0.04430825  0.0239678302
## 187  0.109699478 -0.01555519  0.23156169  0.16456625  0.1628610489
##            PC18        PC19         PC20        PC21        PC22
## 129 -0.13995513  0.12271246  0.408426271  0.02018819  0.15594383
## 509 -0.15397185  0.20384521 -0.056650831  0.08773525  0.04537708
## 471 -0.06485303 -0.06851374 -0.166331501 -0.19113485  0.04571498
## 299  0.04418318 -0.01268881 -0.002053593 -0.15120228 -0.04346294
## 270 -0.04297632 -0.11088803  0.212471581  0.09292700  0.12864791
## 187  0.08683317 -0.10262787 -0.130109622  0.27366184 -0.20232836
##            PC23          PC24         PC25        PC26         PC27
## 129  0.15039131 -0.2237835486  0.152552119  0.08729756 -0.138855825
## 509 -0.11156990 -0.0082444344  0.045367252 -0.05915807  0.021252677
## 471  0.01086144  0.0134687722 -0.007700176  0.14192225  0.083548837
## 299 -0.07035428 -0.0114456437 -0.121228608 -0.10780136  0.051912772
## 270 -0.03531285  0.0658494089  0.122016733  0.14529810  0.064477196
## 187 -0.16077978 -0.0002624782 -0.059705528  0.08564050 -0.006007364
##             PC28          PC29          PC30 diagnosis
## 129 -0.020590844 -0.0067291299  0.0241993432         B
## 509  0.015111608  0.0064330734  0.0008658221         B
## 471  0.001140486 -0.0022162740 -0.0119609197         B
## 299  0.001586351 -0.0028392557 -0.0021066943         B
## 270 -0.003498468 -0.0147883926  0.0105659925         B
## 187  0.032235851 -0.0005438972  0.0052511327         M

Biplot of Principal Components

Transform Test Dataset Using Train Dataset’s PCA

Applying Machine Learning Method: Logistic Regression

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 282   6
##          M   4 163
##                                         
##                Accuracy : 0.978         
##                  95% CI : (0.96, 0.9894)
##     No Information Rate : 0.6286        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9528        
##                                         
##  Mcnemar's Test P-Value : 0.7518        
##                                         
##             Sensitivity : 0.9860        
##             Specificity : 0.9645        
##          Pos Pred Value : 0.9792        
##          Neg Pred Value : 0.9760        
##              Prevalence : 0.6286        
##          Detection Rate : 0.6198        
##    Detection Prevalence : 0.6330        
##       Balanced Accuracy : 0.9753        
##                                         
##        'Positive' Class : B             
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  1
##          M  1 42
##                                           
##                Accuracy : 0.9825          
##                  95% CI : (0.9381, 0.9979)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9627          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.9767          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.9767          
##              Prevalence : 0.6228          
##          Detection Rate : 0.6140          
##    Detection Prevalence : 0.6228          
##       Balanced Accuracy : 0.9813          
##                                           
##        'Positive' Class : B               
## 

Applying Machine Learning Method: Regularized Generalized Linear Models

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 283  12
##          M   3 157
##                                           
##                Accuracy : 0.967           
##                  95% CI : (0.9462, 0.9814)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9286          
##                                           
##  Mcnemar's Test P-Value : 0.03887         
##                                           
##             Sensitivity : 0.9895          
##             Specificity : 0.9290          
##          Pos Pred Value : 0.9593          
##          Neg Pred Value : 0.9812          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6220          
##    Detection Prevalence : 0.6484          
##       Balanced Accuracy : 0.9593          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  4
##          M  1 39
##                                           
##                Accuracy : 0.9561          
##                  95% CI : (0.9006, 0.9856)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9053          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.9070          
##          Pos Pred Value : 0.9459          
##          Neg Pred Value : 0.9750          
##              Prevalence : 0.6228          
##          Detection Rate : 0.6140          
##    Detection Prevalence : 0.6491          
##       Balanced Accuracy : 0.9464          
##                                           
##        'Positive' Class : B               
## 

Applying Machine Learning Method: rpart

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 279   4
##          M   7 165
##                                           
##                Accuracy : 0.9758          
##                  95% CI : (0.9572, 0.9879)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9484          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9755          
##             Specificity : 0.9763          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.9593          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6132          
##    Detection Prevalence : 0.6220          
##       Balanced Accuracy : 0.9759          
##                                           
##        'Positive' Class : B               
## 

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 67  2
##          M  4 41
##                                          
##                Accuracy : 0.9474         
##                  95% CI : (0.889, 0.9804)
##     No Information Rate : 0.6228         
##     P-Value [Acc > NIR] : 5.203e-16      
##                                          
##                   Kappa : 0.889          
##                                          
##  Mcnemar's Test P-Value : 0.6831         
##                                          
##             Sensitivity : 0.9437         
##             Specificity : 0.9535         
##          Pos Pred Value : 0.9710         
##          Neg Pred Value : 0.9111         
##              Prevalence : 0.6228         
##          Detection Rate : 0.5877         
##    Detection Prevalence : 0.6053         
##       Balanced Accuracy : 0.9486         
##                                          
##        'Positive' Class : B              
## 

Applying Machine Learning Method: Random Forest (RF)

##              B         M MeanDecreaseAccuracy MeanDecreaseGini
## PC1 108.547924 91.707746           117.112479       133.462240
## PC2  26.297583 10.779132            27.088647        23.294656
## PC3  20.296545 14.382881            23.325427        24.599899
## PC4   9.394303  5.163344            10.056945         9.248541
## PC5  17.359557  6.664207            18.126283        11.634863
## PC6   7.663011  1.633911             7.265714         9.460126

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 279  11
##          M   7 158
##                                           
##                Accuracy : 0.9604          
##                  95% CI : (0.9382, 0.9764)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9149          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9755          
##             Specificity : 0.9349          
##          Pos Pred Value : 0.9621          
##          Neg Pred Value : 0.9576          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6132          
##    Detection Prevalence : 0.6374          
##       Balanced Accuracy : 0.9552          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  6
##          M  2 37
##                                           
##                Accuracy : 0.9298          
##                  95% CI : (0.8664, 0.9692)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : 4.081e-14       
##                                           
##                   Kappa : 0.8478          
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.8605          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.6228          
##          Detection Rate : 0.6053          
##    Detection Prevalence : 0.6579          
##       Balanced Accuracy : 0.9161          
##                                           
##        'Positive' Class : B               
## 

Applying Machine Learning Method: Support Vector Machine (SVM)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 284   9
##          M   2 160
##                                           
##                Accuracy : 0.9758          
##                  95% CI : (0.9572, 0.9879)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9478          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9930          
##             Specificity : 0.9467          
##          Pos Pred Value : 0.9693          
##          Neg Pred Value : 0.9877          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6242          
##    Detection Prevalence : 0.6440          
##       Balanced Accuracy : 0.9699          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  6
##          M  1 37
##                                          
##                Accuracy : 0.9386         
##                  95% CI : (0.8776, 0.975)
##     No Information Rate : 0.6228         
##     P-Value [Acc > NIR] : 4.947e-15      
##                                          
##                   Kappa : 0.8662         
##                                          
##  Mcnemar's Test P-Value : 0.1306         
##                                          
##             Sensitivity : 0.9859         
##             Specificity : 0.8605         
##          Pos Pred Value : 0.9211         
##          Neg Pred Value : 0.9737         
##              Prevalence : 0.6228         
##          Detection Rate : 0.6140         
##    Detection Prevalence : 0.6667         
##       Balanced Accuracy : 0.9232         
##                                          
##        'Positive' Class : B              
## 

Applying Machine Learning Method: Linear Discriminant Analysis (LDA)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 286  21
##          M   0 148
##                                           
##                Accuracy : 0.9538          
##                  95% CI : (0.9303, 0.9712)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8986          
##                                           
##  Mcnemar's Test P-Value : 1.275e-05       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8757          
##          Pos Pred Value : 0.9316          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6286          
##    Detection Prevalence : 0.6747          
##       Balanced Accuracy : 0.9379          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  8
##          M  0 35
##                                           
##                Accuracy : 0.9298          
##                  95% CI : (0.8664, 0.9692)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : 4.081e-14       
##                                           
##                   Kappa : 0.845           
##                                           
##  Mcnemar's Test P-Value : 0.01333         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8140          
##          Pos Pred Value : 0.8987          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6228          
##          Detection Rate : 0.6228          
##    Detection Prevalence : 0.6930          
##       Balanced Accuracy : 0.9070          
##                                           
##        'Positive' Class : B               
## 

Applying Machine Learning Method: C5.0

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 283   4
##          M   3 165
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9686, 0.9938)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.967           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9895          
##             Specificity : 0.9763          
##          Pos Pred Value : 0.9861          
##          Neg Pred Value : 0.9821          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6220          
##    Detection Prevalence : 0.6308          
##       Balanced Accuracy : 0.9829          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 63  7
##          M  8 36
##                                           
##                Accuracy : 0.8684          
##                  95% CI : (0.7923, 0.9244)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : 5.354e-09       
##                                           
##                   Kappa : 0.7212          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8873          
##             Specificity : 0.8372          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.6228          
##          Detection Rate : 0.5526          
##    Detection Prevalence : 0.6140          
##       Balanced Accuracy : 0.8623          
##                                           
##        'Positive' Class : B               
## 

Applying Machine Learning Method: ctree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 276  19
##          M  10 150
##                                           
##                Accuracy : 0.9363          
##                  95% CI : (0.9097, 0.9569)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.862           
##                                           
##  Mcnemar's Test P-Value : 0.1374          
##                                           
##             Sensitivity : 0.9650          
##             Specificity : 0.8876          
##          Pos Pred Value : 0.9356          
##          Neg Pred Value : 0.9375          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6066          
##    Detection Prevalence : 0.6484          
##       Balanced Accuracy : 0.9263          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 66  7
##          M  5 36
##                                           
##                Accuracy : 0.8947          
##                  95% CI : (0.8233, 0.9444)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : 5.965e-11       
##                                           
##                   Kappa : 0.7739          
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.9296          
##             Specificity : 0.8372          
##          Pos Pred Value : 0.9041          
##          Neg Pred Value : 0.8780          
##              Prevalence : 0.6228          
##          Detection Rate : 0.5789          
##    Detection Prevalence : 0.6404          
##       Balanced Accuracy : 0.8834          
##                                           
##        'Positive' Class : B               
## 

Decision Tree with Original Variables

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 284   5
##          M   2 164
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9686, 0.9938)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9669          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9930          
##             Specificity : 0.9704          
##          Pos Pred Value : 0.9827          
##          Neg Pred Value : 0.9880          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6242          
##    Detection Prevalence : 0.6352          
##       Balanced Accuracy : 0.9817          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 67  4
##          M  4 39
##                                           
##                Accuracy : 0.9298          
##                  95% CI : (0.8664, 0.9692)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : 4.081e-14       
##                                           
##                   Kappa : 0.8506          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9437          
##             Specificity : 0.9070          
##          Pos Pred Value : 0.9437          
##          Neg Pred Value : 0.9070          
##              Prevalence : 0.6228          
##          Detection Rate : 0.5877          
##    Detection Prevalence : 0.6228          
##       Balanced Accuracy : 0.9253          
##                                           
##        'Positive' Class : B               
## 

Variable Importance from Original Attributes

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 278  11
##          M   8 158
##                                           
##                Accuracy : 0.9582          
##                  95% CI : (0.9356, 0.9747)
##     No Information Rate : 0.6286          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9102          
##                                           
##  Mcnemar's Test P-Value : 0.6464          
##                                           
##             Sensitivity : 0.9720          
##             Specificity : 0.9349          
##          Pos Pred Value : 0.9619          
##          Neg Pred Value : 0.9518          
##              Prevalence : 0.6286          
##          Detection Rate : 0.6110          
##    Detection Prevalence : 0.6352          
##       Balanced Accuracy : 0.9535          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  3
##          M  1 40
##                                           
##                Accuracy : 0.9649          
##                  95% CI : (0.9126, 0.9904)
##     No Information Rate : 0.6228          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9246          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9859          
##             Specificity : 0.9302          
##          Pos Pred Value : 0.9589          
##          Neg Pred Value : 0.9756          
##              Prevalence : 0.6228          
##          Detection Rate : 0.6140          
##    Detection Prevalence : 0.6404          
##       Balanced Accuracy : 0.9581          
##                                           
##        'Positive' Class : B               
## 
##                                  B         M MeanDecreaseAccuracy
## radius_mean              8.6527319  6.896798             9.833006
## texture_mean             7.7099781  8.766658            10.727110
## perimeter_mean           7.1613020  5.781739             8.986680
## area_mean               10.2158662  5.781764            10.937731
## smoothness_mean          2.7963816  7.421065             7.586682
## compactness_mean         3.4147638  4.042742             5.786704
## concavity_mean           8.0424461  8.993043            11.802227
## concave_points_mean      9.6108765 12.608901            15.357859
## symmetry_mean            3.3168255  4.673737             5.956139
## fractal_dimension_mean   2.9844852  1.175255             3.223094
## radius_se                7.7903726  5.777717            10.102631
## texture_se               0.6853298  2.260062             1.777194
## perimeter_se             7.9842918  4.370677             8.979239
## area_se                 10.8453466  6.801102            13.443781
## smoothness_se            1.4640143  2.541501             2.510664
## compactness_se           4.5420048  2.113043             5.075881
## concavity_se             4.2157821  4.347335             6.047456
## concave_points_se        3.9192328  1.230971             3.729495
## symmetry_se              2.4697857  1.997448             3.034253
## fractal_dimension_se     2.1379745  1.187316             2.433550
## radius_worst            12.1393035 10.412392            15.073391
## texture_worst            8.6701377 10.272282            13.089697
## perimeter_worst         11.9250872  9.740643            14.580801
## area_worst              12.9900006 11.160080            16.079525
## smoothness_worst         8.3417432  9.045569            11.738711
## compactness_worst        6.0717922  5.812304             8.217070
## concavity_worst          7.9109109 11.017735            13.972965
## concave_points_worst    13.7695949 13.070010            18.745524
## symmetry_worst           6.7408593  6.203105             8.357989
## fractal_dimension_worst  4.2066268  4.331695             5.998680
##                         MeanDecreaseGini
## radius_mean                   10.6020859
## texture_mean                   2.8024499
## perimeter_mean                 8.9217099
## area_mean                      9.9635087
## smoothness_mean                1.3050828
## compactness_mean               2.8038881
## concavity_mean                13.3169990
## concave_points_mean           24.6556323
## symmetry_mean                  0.8241397
## fractal_dimension_mean         0.7265690
## radius_se                      3.6198822
## texture_se                     0.9270440
## perimeter_se                   2.7343202
## area_se                        7.0384945
## smoothness_se                  0.8453041
## compactness_se                 0.9689220
## concavity_se                   1.9207891
## concave_points_se              1.1152920
## symmetry_se                    0.9416348
## fractal_dimension_se           1.0631534
## radius_worst                  20.4196276
## texture_worst                  3.5414397
## perimeter_worst               20.6397563
## area_worst                    19.5676588
## smoothness_worst               2.7628274
## compactness_worst              3.1209972
## concavity_worst                7.7192475
## concave_points_worst          32.4215476
## symmetry_worst                 2.8691094
## fractal_dimension_worst        1.9204473

Radar Plot: Benign vs. Malignant

## perform aggregation: get mean of each metric
mean.agg = data %>%

  dplyr::group_by(diagnosis) %>%
  
  dplyr::select(-id) %>%
  
  dplyr::summarize_each(
    mean
  ) 

## perform aggregation: get max of each metric
max.agg = data %>%
  
  dplyr::select(-id, -diagnosis) %>%
  
  dplyr::summarize_each(
    max
  ) 


## create data frames that can be used to radarchart()

# for mean metrics
mean_metrics_df = rbind(

  # values for outer radarchart edges (maximum values of mean metrics)
  max.agg %>% dplyr::select(ends_with("_mean")),
  
  # values for inner radarchart edges (0)
  rep(0, 10), 
  
  # values for radarchart lines (first row for benign, second row for malignant)
  mean.agg %>% dplyr::select(ends_with("_mean"))
)
rownames(mean_metrics_df) = c(1, 2, as.character(mean.agg$diagnosis))

# for se metrics
se_metrics_df = rbind(

  # values for outer radarchart edges (maximum values of se metrics)
  max.agg %>% dplyr::select(ends_with("_se")),
  
  # values for inner radarchart edges (0)
  rep(0, 10), 
  
  # values for radarchart lines (first row for benign, second row for malignant)
  mean.agg %>% dplyr::select(ends_with("_se"))
)
rownames(se_metrics_df) = c(1, 2, as.character(mean.agg$diagnosis))

# for worst metrics
worst_metrics_df = rbind(

  # values for outer radarchart edges (maximum values of worst metrics)
  max.agg %>% dplyr::select(ends_with("_worst")),
  
  # values for inner radarchart edges (0)
  rep(0, 10), 
  
  # values for radarchart lines (first row for benign, second row for malignant)
  mean.agg %>% dplyr::select(ends_with("_worst"))
)
rownames(se_metrics_df) = c(1, 2, as.character(mean.agg$diagnosis))


## default radar chart
radarchart(mean_metrics_df, axistype=1, title="Mean Metrics")
legend(x=1, y=1, legend = mean.agg$diagnosis, bty = "n", pch=20 , col=mean.agg$diagnosis, cex=1.2, pt.cex=3)