Create a statistical, binary classification model that can accurately predict malignancy of breast cancer, with given cytopathology data.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].
The dataset can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Ten real-valued features are computed for each cell nucleus:
The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## 7 844359 M 18.25 19.98 119.60 1040.0
## 8 84458202 M 13.71 20.83 90.20 577.9
## 9 844981 M 13.00 21.82 87.50 519.8
## 10 84501001 M 12.46 24.04 83.97 475.9
## smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1 0.11840 0.27760 0.30010 0.14710
## 2 0.08474 0.07864 0.08690 0.07017
## 3 0.10960 0.15990 0.19740 0.12790
## 4 0.14250 0.28390 0.24140 0.10520
## 5 0.10030 0.13280 0.19800 0.10430
## 6 0.12780 0.17000 0.15780 0.08089
## 7 0.09463 0.10900 0.11270 0.07400
## 8 0.11890 0.16450 0.09366 0.05985
## 9 0.12730 0.19320 0.18590 0.09353
## 10 0.11860 0.23960 0.22730 0.08543
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## 7 0.1794 0.05742 0.4467 0.7732 3.180
## 8 0.2196 0.07451 0.5835 1.3770 3.856
## 9 0.2350 0.07389 0.3063 1.0020 2.406
## 10 0.2030 0.08243 0.2976 1.5990 2.039
## area_se smoothness_se compactness_se concavity_se concave_points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## 7 53.91 0.004314 0.01382 0.02254 0.01039
## 8 50.96 0.008805 0.03029 0.02488 0.01448
## 9 24.32 0.005731 0.03502 0.03553 0.01226
## 10 23.94 0.007149 0.07217 0.07743 0.01432
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 1 0.03003 0.006193 25.38 17.33
## 2 0.01389 0.003532 24.99 23.41
## 3 0.02250 0.004571 23.57 25.53
## 4 0.05963 0.009208 14.91 26.50
## 5 0.01756 0.005115 22.54 16.67
## 6 0.02165 0.005082 15.47 23.75
## 7 0.01369 0.002179 22.88 27.66
## 8 0.01486 0.005412 17.06 28.14
## 9 0.02143 0.003749 15.49 30.73
## 10 0.01789 0.010080 15.09 40.68
## perimeter_worst area_worst smoothness_worst compactness_worst
## 1 184.60 2019.0 0.1622 0.6656
## 2 158.80 1956.0 0.1238 0.1866
## 3 152.50 1709.0 0.1444 0.4245
## 4 98.87 567.7 0.2098 0.8663
## 5 152.20 1575.0 0.1374 0.2050
## 6 103.40 741.6 0.1791 0.5249
## 7 153.20 1606.0 0.1442 0.2576
## 8 110.60 897.0 0.1654 0.3682
## 9 106.20 739.3 0.1703 0.5401
## 10 97.65 711.4 0.1853 1.0580
## concavity_worst concave_points_worst symmetry_worst
## 1 0.7119 0.2654 0.4601
## 2 0.2416 0.1860 0.2750
## 3 0.4504 0.2430 0.3613
## 4 0.6869 0.2575 0.6638
## 5 0.4000 0.1625 0.2364
## 6 0.5355 0.1741 0.3985
## 7 0.3784 0.1932 0.3063
## 8 0.2678 0.1556 0.3196
## 9 0.5390 0.2060 0.4378
## 10 1.1050 0.2210 0.4366
## fractal_dimension_worst
## 1 0.11890
## 2 0.08902
## 3 0.08758
## 4 0.17300
## 5 0.07678
## 6 0.12440
## 7 0.08368
## 8 0.11510
## 9 0.10720
## 10 0.20750
We will split the dataset into a training dataset and a test dataset. We will perform our analyses and create models using the training dataset. Then, we will evaluate the model performance in the test dataset.
# get the number of rows
nrows = nrow(data)
# set random seed for reproducibility
set.seed(1)
# get indices
indices = sample(1:nrows, 0.8 * nrows)
# obtain train and test datasets
train = data[indices, ]
test = data[-indices, ]We want to make sure the values in the outcome variable are evenly distributed in training and test datasets. We can check the proportion of values in the outcome variable (diagnosis) by counting the values and observing the proportions.
##
## B M
## 0.6285714 0.3714286
##
## B M
## 0.622807 0.377193
Since the proportions are roughly equal, we will proceed to the next step.
We want to find variables that have the most predictive power. In our case, all of the predictor variables are continuous variables and the outcome variable is a categorical variable. Hence, we can compute the averages value of each attribute, grouped by outcome variable, and observing the difference.
For example, in our case, we have two values in the outcome variable: B (benign) or M (malignant). We can calculate the mean radius for the maligant tumor cells and compare to the mean radius of benign tumor cells. If the difference between these two means is great, then we can deduce that the attribute (radius) is significant.
## B M
## 12.15995 17.52302
Average mean radius of benign cell nuclei is 12.15 while that of malignant nuclei is 17.46.
We can do this for each of the attribute.
## # A tibble: 2 x 32
## diagnosis id radius_mean texture_mean perimeter_mean area_mean
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 B 2.32e7 12.2 17.9 78.2 464.
## 2 M 4.26e7 17.5 21.6 116. 987.
## # … with 26 more variables: smoothness_mean <dbl>, compactness_mean <dbl>,
## # concavity_mean <dbl>, concave_points_mean <dbl>, symmetry_mean <dbl>,
## # fractal_dimension_mean <dbl>, radius_se <dbl>, texture_se <dbl>,
## # perimeter_se <dbl>, area_se <dbl>, smoothness_se <dbl>,
## # compactness_se <dbl>, concavity_se <dbl>, concave_points_se <dbl>,
## # symmetry_se <dbl>, fractal_dimension_se <dbl>, radius_worst <dbl>,
## # texture_worst <dbl>, perimeter_worst <dbl>, area_worst <dbl>,
## # smoothness_worst <dbl>, compactness_worst <dbl>,
## # concavity_worst <dbl>, concave_points_worst <dbl>,
## # symmetry_worst <dbl>, fractal_dimension_worst <dbl>
We can visualize the above result with distribution plots. The more “separated out” the distributions are, the more predictive power the feature has.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Because we have many redundant, highly-correlated features in our dataset, we would like to reduce the number of feature variables. One method we can apply is principal component analysis (PCA), a very popular dimension reduction technique.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.6665 2.3917 1.65918 1.39033 1.28019 1.0996
## Proportion of Variance 0.4481 0.1907 0.09176 0.06443 0.05463 0.0403
## Cumulative Proportion 0.4481 0.6388 0.73054 0.79498 0.84960 0.8899
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.80160 0.68717 0.65478 0.58253 0.54029 0.50711
## Proportion of Variance 0.02142 0.01574 0.01429 0.01131 0.00973 0.00857
## Cumulative Proportion 0.91132 0.92706 0.94136 0.95267 0.96240 0.97097
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.48177 0.4024 0.29294 0.28644 0.24030 0.23584
## Proportion of Variance 0.00774 0.0054 0.00286 0.00273 0.00192 0.00185
## Cumulative Proportion 0.97871 0.9841 0.98697 0.98970 0.99162 0.99348
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.20220 0.17887 0.16663 0.15888 0.14154 0.13167
## Proportion of Variance 0.00136 0.00107 0.00093 0.00084 0.00067 0.00058
## Cumulative Proportion 0.99484 0.99591 0.99683 0.99768 0.99834 0.99892
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.1229 0.09183 0.08237 0.03696 0.02361 0.01136
## Proportion of Variance 0.0005 0.00028 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion 0.9994 0.99971 0.99993 0.99998 1.00000 1.00000
fviz_eig(train.pca, addlabels=TRUE, ylim=c(0,60), geom = c("bar", "line"), barfill="pink", barcolor="grey", linecolor="red", ncp=10) +
labs(title = "Variance Explained By Each Principal Component",
x = "Principal Components", y = "% of Variance")## PC1 PC2 PC3 PC4 PC5 PC6
## 129 -1.3161432 -1.298799 0.6362512 1.9286064 0.1527150 -0.4583164
## 509 1.2291376 1.601287 -0.7205266 1.8258560 -0.7471148 -0.8341412
## 471 2.6627577 -1.394929 0.2782799 -0.3293326 0.9021914 0.8810544
## 299 2.7856293 2.498263 -0.8437551 -0.3002056 -1.6154444 0.8424754
## 270 0.8148387 -3.069724 1.5253843 -0.3360296 -0.7738540 -0.4785682
## 187 -0.2945659 3.498859 -2.1654255 0.2387659 -0.9226097 0.4171881
## PC7 PC8 PC9 PC10 PC11 PC12
## 129 -1.1554734 -0.36953772 -0.44423390 0.3959139 1.5154993 0.42925567
## 509 -1.3156708 0.36162608 0.12976703 0.6582973 0.0923487 -0.01735643
## 471 0.5809932 -0.39689831 1.17929674 0.7854529 -0.2228470 0.68258717
## 299 0.7445195 0.11701140 0.05602298 0.4538755 -0.1293102 0.48797305
## 270 -0.3731491 0.03439504 0.08585404 -0.3194456 0.6013498 -0.68208920
## 187 -0.6186967 0.24644361 -0.19065865 0.2966028 -0.3576314 -0.24618479
## PC13 PC14 PC15 PC16 PC17
## 129 0.041236860 0.26019609 0.07328204 0.13121570 -0.0924210679
## 509 0.322709210 -0.17963353 -0.21570085 -0.37502036 -0.3325836550
## 471 0.116939814 -0.14114442 -0.05992310 0.08107342 -0.0136590961
## 299 -0.033697334 -0.23996721 0.16061233 -0.02408015 -0.0003508203
## 270 0.003409216 0.70607398 0.13587239 0.04430825 0.0239678302
## 187 0.109699478 -0.01555519 0.23156169 0.16456625 0.1628610489
## PC18 PC19 PC20 PC21 PC22
## 129 -0.13995513 0.12271246 0.408426271 0.02018819 0.15594383
## 509 -0.15397185 0.20384521 -0.056650831 0.08773525 0.04537708
## 471 -0.06485303 -0.06851374 -0.166331501 -0.19113485 0.04571498
## 299 0.04418318 -0.01268881 -0.002053593 -0.15120228 -0.04346294
## 270 -0.04297632 -0.11088803 0.212471581 0.09292700 0.12864791
## 187 0.08683317 -0.10262787 -0.130109622 0.27366184 -0.20232836
## PC23 PC24 PC25 PC26 PC27
## 129 0.15039131 -0.2237835486 0.152552119 0.08729756 -0.138855825
## 509 -0.11156990 -0.0082444344 0.045367252 -0.05915807 0.021252677
## 471 0.01086144 0.0134687722 -0.007700176 0.14192225 0.083548837
## 299 -0.07035428 -0.0114456437 -0.121228608 -0.10780136 0.051912772
## 270 -0.03531285 0.0658494089 0.122016733 0.14529810 0.064477196
## 187 -0.16077978 -0.0002624782 -0.059705528 0.08564050 -0.006007364
## PC28 PC29 PC30 diagnosis
## 129 -0.020590844 -0.0067291299 0.0241993432 B
## 509 0.015111608 0.0064330734 0.0008658221 B
## 471 0.001140486 -0.0022162740 -0.0119609197 B
## 299 0.001586351 -0.0028392557 -0.0021066943 B
## 270 -0.003498468 -0.0147883926 0.0105659925 B
## 187 0.032235851 -0.0005438972 0.0052511327 M
## first and second principal components
ggplot(train.pc) +
geom_point(aes(x=PC1, y=PC2, color=diagnosis))## second and third principal components
ggplot(train.pc) +
geom_point(aes(x=PC2, y=PC3, color=diagnosis))fviz_pca_biplot(train.pca, col.ind = train$diagnosis, col="black",
palette = "jco", geom = "point", repel=TRUE,
legend.title="Diagnosis", addEllipses = TRUE)## create model
glm_pca_slim_model = glm(formula = diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train.pc,
family = "binomial")## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## create prediction probabilities (on train dataset)
glm_pca_slim_train_pred_probs = predict(glm_pca_slim_model, type="response")
## create predictions (on train dataset)
glm_pca_slim_train_preds = as.factor(ifelse(glm_pca_slim_train_pred_probs > 0.5, "M", "B"))
## evaluate performance (on train dataset)
confusionMatrix(glm_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 282 6
## M 4 163
##
## Accuracy : 0.978
## 95% CI : (0.96, 0.9894)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9528
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.9860
## Specificity : 0.9645
## Pos Pred Value : 0.9792
## Neg Pred Value : 0.9760
## Prevalence : 0.6286
## Detection Rate : 0.6198
## Detection Prevalence : 0.6330
## Balanced Accuracy : 0.9753
##
## 'Positive' Class : B
##
## create prediction probabilities (on test dataset)
glm_pca_slim_test_pred_probs = predict(glm_pca_slim_model, type="response", newdata=test.pc)
## create predictions (on test dataset)
glm_pca_slim_test_preds = as.factor(ifelse(glm_pca_slim_test_pred_probs > 0.5, "M", "B"))
## evaluate performance (on test dataset)
confusionMatrix(glm_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 1
## M 1 42
##
## Accuracy : 0.9825
## 95% CI : (0.9381, 0.9979)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9627
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9859
## Specificity : 0.9767
## Pos Pred Value : 0.9859
## Neg Pred Value : 0.9767
## Prevalence : 0.6228
## Detection Rate : 0.6140
## Detection Prevalence : 0.6228
## Balanced Accuracy : 0.9813
##
## 'Positive' Class : B
##
## save first 6 PC variable column names
pc_colnames = c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6")
## create model
glmnet_pca_slim_model = glmnet(x = as.matrix(train.pc[ , pc_colnames]),
y = train.pc$diagnosis,
family = "binomial")
## create prediction probabilities (on train dataset)
glmnet_pca_slim_train_pred_probs = predict(glmnet_pca_slim_model,
s=0.01,
newx=as.matrix(train.pc[ , pc_colnames]),
type="response")
## create predictions (on train dataset)
glmnet_pca_slim_train_preds = as.factor(ifelse(glmnet_pca_slim_train_pred_probs > 0.5, "M", "B"))
## evaluate performance (on train dataset)
confusionMatrix(glmnet_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 283 12
## M 3 157
##
## Accuracy : 0.967
## 95% CI : (0.9462, 0.9814)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9286
##
## Mcnemar's Test P-Value : 0.03887
##
## Sensitivity : 0.9895
## Specificity : 0.9290
## Pos Pred Value : 0.9593
## Neg Pred Value : 0.9812
## Prevalence : 0.6286
## Detection Rate : 0.6220
## Detection Prevalence : 0.6484
## Balanced Accuracy : 0.9593
##
## 'Positive' Class : B
##
## create prediction probabilities (on test dataset)
glmnet_pca_slim_test_pred_probs = predict(glmnet_pca_slim_model,
s=0.01,
newx=as.matrix(test.pc[ , pc_colnames]),
type="response")
## create predictions (on test dataset)
glmnet_pca_slim_test_preds = as.factor(ifelse(glmnet_pca_slim_test_pred_probs > 0.5, "M", "B"))
## evaluate performance (on test dataset)
confusionMatrix(glmnet_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 4
## M 1 39
##
## Accuracy : 0.9561
## 95% CI : (0.9006, 0.9856)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9053
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.9859
## Specificity : 0.9070
## Pos Pred Value : 0.9459
## Neg Pred Value : 0.9750
## Prevalence : 0.6228
## Detection Rate : 0.6140
## Detection Prevalence : 0.6491
## Balanced Accuracy : 0.9464
##
## 'Positive' Class : B
##
## create model
rpart_pca_slim_model = rpart(diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train.pc,
method = 'class',
control=rpart.control(minsplit=2))
## create prediction probabilities (on train dataset)
rpart_pca_slim_train_pred_probs = predict(rpart_pca_slim_model)
## create predictions (on train dataset)
rpart_pca_slim_train_preds = as.factor(ifelse(rpart_pca_slim_train_pred_probs[ , 2] > 0.5, "M", "B"))
## evaluate performance (on train dataset)
confusionMatrix(rpart_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 279 4
## M 7 165
##
## Accuracy : 0.9758
## 95% CI : (0.9572, 0.9879)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9484
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.9755
## Specificity : 0.9763
## Pos Pred Value : 0.9859
## Neg Pred Value : 0.9593
## Prevalence : 0.6286
## Detection Rate : 0.6132
## Detection Prevalence : 0.6220
## Balanced Accuracy : 0.9759
##
## 'Positive' Class : B
##
## create prediction probabilities (on test dataset)
rpart_pca_slim_test_pred_probs = predict(rpart_pca_slim_model, newdata=test.pc)
## create predictions (on test dataset)
rpart_pca_slim_test_preds = as.factor(ifelse(rpart_pca_slim_test_pred_probs[ , 2] > 0.5, "M", "B"))
## evaluate performance (on test dataset)
confusionMatrix(rpart_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 67 2
## M 4 41
##
## Accuracy : 0.9474
## 95% CI : (0.889, 0.9804)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 5.203e-16
##
## Kappa : 0.889
##
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9437
## Specificity : 0.9535
## Pos Pred Value : 0.9710
## Neg Pred Value : 0.9111
## Prevalence : 0.6228
## Detection Rate : 0.5877
## Detection Prevalence : 0.6053
## Balanced Accuracy : 0.9486
##
## 'Positive' Class : B
##
## create model
rf_pca_slim_model = randomForest(diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data=train.pc, ntree=500, proximity=T, importance=T)
## see variable importance
importance(rf_pca_slim_model)## B M MeanDecreaseAccuracy MeanDecreaseGini
## PC1 108.547924 91.707746 117.112479 133.462240
## PC2 26.297583 10.779132 27.088647 23.294656
## PC3 20.296545 14.382881 23.325427 24.599899
## PC4 9.394303 5.163344 10.056945 9.248541
## PC5 17.359557 6.664207 18.126283 11.634863
## PC6 7.663011 1.633911 7.265714 9.460126
## visualize variable importance
varImpPlot(rf_pca_slim_model, pch=18, col="red", main="Variable Importance")## create predictions (on train dataset)
rf_pca_slim_train_preds = predict(rf_pca_slim_model)
## evaluate performance (on train dataset)
confusionMatrix(rf_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 279 11
## M 7 158
##
## Accuracy : 0.9604
## 95% CI : (0.9382, 0.9764)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9149
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9755
## Specificity : 0.9349
## Pos Pred Value : 0.9621
## Neg Pred Value : 0.9576
## Prevalence : 0.6286
## Detection Rate : 0.6132
## Detection Prevalence : 0.6374
## Balanced Accuracy : 0.9552
##
## 'Positive' Class : B
##
## create predictions (on test dataset)
rf_pca_slim_test_preds = predict(rf_pca_slim_model, newdata=test.pc[ , pc_colnames])
## evaluate performance (on test dataset)
confusionMatrix(rf_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 6
## M 2 37
##
## Accuracy : 0.9298
## 95% CI : (0.8664, 0.9692)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 4.081e-14
##
## Kappa : 0.8478
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9718
## Specificity : 0.8605
## Pos Pred Value : 0.9200
## Neg Pred Value : 0.9487
## Prevalence : 0.6228
## Detection Rate : 0.6053
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9161
##
## 'Positive' Class : B
##
## create model
svm_pca_slim_model = svm(diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train.pc)
## create predictions (on train dataset)
svm_pca_slim_train_preds = predict(svm_pca_slim_model)
## evaluate performance (on train dataset)
confusionMatrix(svm_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 284 9
## M 2 160
##
## Accuracy : 0.9758
## 95% CI : (0.9572, 0.9879)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9478
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9930
## Specificity : 0.9467
## Pos Pred Value : 0.9693
## Neg Pred Value : 0.9877
## Prevalence : 0.6286
## Detection Rate : 0.6242
## Detection Prevalence : 0.6440
## Balanced Accuracy : 0.9699
##
## 'Positive' Class : B
##
## create predictions (on test dataset)
svm_pca_slim_test_preds = predict(svm_pca_slim_model, newdata=test.pc)
## evaluate performance (on test dataset)
confusionMatrix(svm_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 6
## M 1 37
##
## Accuracy : 0.9386
## 95% CI : (0.8776, 0.975)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 4.947e-15
##
## Kappa : 0.8662
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.9859
## Specificity : 0.8605
## Pos Pred Value : 0.9211
## Neg Pred Value : 0.9737
## Prevalence : 0.6228
## Detection Rate : 0.6140
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.9232
##
## 'Positive' Class : B
##
## create model
lda_pca_slim_model = lda(diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=train.pc)
## create predictions (on train dataset)
lda_pca_slim_train_preds = predict(lda_pca_slim_model)$class
## evaluate performance (on train dataset)
confusionMatrix(lda_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 286 21
## M 0 148
##
## Accuracy : 0.9538
## 95% CI : (0.9303, 0.9712)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8986
##
## Mcnemar's Test P-Value : 1.275e-05
##
## Sensitivity : 1.0000
## Specificity : 0.8757
## Pos Pred Value : 0.9316
## Neg Pred Value : 1.0000
## Prevalence : 0.6286
## Detection Rate : 0.6286
## Detection Prevalence : 0.6747
## Balanced Accuracy : 0.9379
##
## 'Positive' Class : B
##
## create predictions (on test dataset)
lda_pca_slim_test_preds = predict(lda_pca_slim_model, newdata=test.pc)$class
## evaluate performance (on test dataset)
confusionMatrix(lda_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 8
## M 0 35
##
## Accuracy : 0.9298
## 95% CI : (0.8664, 0.9692)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 4.081e-14
##
## Kappa : 0.845
##
## Mcnemar's Test P-Value : 0.01333
##
## Sensitivity : 1.0000
## Specificity : 0.8140
## Pos Pred Value : 0.8987
## Neg Pred Value : 1.0000
## Prevalence : 0.6228
## Detection Rate : 0.6228
## Detection Prevalence : 0.6930
## Balanced Accuracy : 0.9070
##
## 'Positive' Class : B
##
## create model
c5.0_pca_slim_model = C5.0(train.pc[ , pc_colnames], train.pc$diagnosis)
## create predictions (on train dataset)
c5.0_pca_slim_train_preds = predict(c5.0_pca_slim_model, newdata=train.pc[ , pc_colnames])
## evaluate performance (on train dataset)
confusionMatrix(c5.0_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 283 4
## M 3 165
##
## Accuracy : 0.9846
## 95% CI : (0.9686, 0.9938)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.967
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9895
## Specificity : 0.9763
## Pos Pred Value : 0.9861
## Neg Pred Value : 0.9821
## Prevalence : 0.6286
## Detection Rate : 0.6220
## Detection Prevalence : 0.6308
## Balanced Accuracy : 0.9829
##
## 'Positive' Class : B
##
## create predictions (on test dataset)
c5.0_pca_slim_test_preds = predict(c5.0_pca_slim_model, newdata=test.pc[ , pc_colnames])
## evaluate performance (on test dataset)
confusionMatrix(c5.0_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 63 7
## M 8 36
##
## Accuracy : 0.8684
## 95% CI : (0.7923, 0.9244)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 5.354e-09
##
## Kappa : 0.7212
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8873
## Specificity : 0.8372
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.8182
## Prevalence : 0.6228
## Detection Rate : 0.5526
## Detection Prevalence : 0.6140
## Balanced Accuracy : 0.8623
##
## 'Positive' Class : B
##
## create model
ctree_pca_slim_model = ctree(diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train.pc,
controls = ctree_control(maxdepth=2))
## create predictions (on train dataset)
ctree_pca_slim_train_preds = predict(ctree_pca_slim_model)
## evaluate performance (on train dataset)
confusionMatrix(ctree_pca_slim_train_preds, train.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 276 19
## M 10 150
##
## Accuracy : 0.9363
## 95% CI : (0.9097, 0.9569)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.862
##
## Mcnemar's Test P-Value : 0.1374
##
## Sensitivity : 0.9650
## Specificity : 0.8876
## Pos Pred Value : 0.9356
## Neg Pred Value : 0.9375
## Prevalence : 0.6286
## Detection Rate : 0.6066
## Detection Prevalence : 0.6484
## Balanced Accuracy : 0.9263
##
## 'Positive' Class : B
##
## create predictions (on test dataset)
ctree_pca_slim_test_preds = predict(ctree_pca_slim_model, newdata=test.pc)
## evaluate performance (on test dataset)
confusionMatrix(ctree_pca_slim_test_preds, test.pc$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 66 7
## M 5 36
##
## Accuracy : 0.8947
## 95% CI : (0.8233, 0.9444)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 5.965e-11
##
## Kappa : 0.7739
##
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.9296
## Specificity : 0.8372
## Pos Pred Value : 0.9041
## Neg Pred Value : 0.8780
## Prevalence : 0.6228
## Detection Rate : 0.5789
## Detection Prevalence : 0.6404
## Balanced Accuracy : 0.8834
##
## 'Positive' Class : B
##
## create model
rpart_model = rpart(diagnosis ~ .,
data = train[ , c("diagnosis", predictor_variables)],
method = 'class',
control = rpart.control(minsplit=2),
model = TRUE)
## create prediction probabilities (on train dataset)
rpart_train_pred_probs = predict(rpart_model)
## create predictions (on train dataset)
rpart_train_preds = as.factor(ifelse(rpart_train_pred_probs[ , 2] > 0.5, "M", "B"))
## evaluate performance (on train dataset)
confusionMatrix(rpart_train_preds, train$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 284 5
## M 2 164
##
## Accuracy : 0.9846
## 95% CI : (0.9686, 0.9938)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9669
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9930
## Specificity : 0.9704
## Pos Pred Value : 0.9827
## Neg Pred Value : 0.9880
## Prevalence : 0.6286
## Detection Rate : 0.6242
## Detection Prevalence : 0.6352
## Balanced Accuracy : 0.9817
##
## 'Positive' Class : B
##
## create prediction probabilities (on test dataset)
rpart_test_pred_probs = predict(rpart_model, newdata=test)
## create predictions (on test dataset)
rpart_test_preds = as.factor(ifelse(rpart_test_pred_probs[ , 2] > 0.5, "M", "B"))
## evaluate performance (on test dataset)
confusionMatrix(rpart_test_preds, test$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 67 4
## M 4 39
##
## Accuracy : 0.9298
## 95% CI : (0.8664, 0.9692)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : 4.081e-14
##
## Kappa : 0.8506
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9437
## Specificity : 0.9070
## Pos Pred Value : 0.9437
## Neg Pred Value : 0.9070
## Prevalence : 0.6228
## Detection Rate : 0.5877
## Detection Prevalence : 0.6228
## Balanced Accuracy : 0.9253
##
## 'Positive' Class : B
##
## create model
rf_all_model = randomForest(diagnosis ~ .,
data = train[ , c("diagnosis", predictor_variables)],
ntree = 500, proximity = T, importance = T)
## create predictions (on train dataset)
rf_train_preds = predict(rf_all_model)
## evaluate performance on (train dataset)
confusionMatrix(rf_train_preds, train$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 278 11
## M 8 158
##
## Accuracy : 0.9582
## 95% CI : (0.9356, 0.9747)
## No Information Rate : 0.6286
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9102
##
## Mcnemar's Test P-Value : 0.6464
##
## Sensitivity : 0.9720
## Specificity : 0.9349
## Pos Pred Value : 0.9619
## Neg Pred Value : 0.9518
## Prevalence : 0.6286
## Detection Rate : 0.6110
## Detection Prevalence : 0.6352
## Balanced Accuracy : 0.9535
##
## 'Positive' Class : B
##
## create predictions (on test dataset)
rf_test_preds = predict(rf_all_model, newdata=test)
## evaluate performance on (test dataset)
confusionMatrix(rf_test_preds, test$diagnosis)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 3
## M 1 40
##
## Accuracy : 0.9649
## 95% CI : (0.9126, 0.9904)
## No Information Rate : 0.6228
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9246
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9859
## Specificity : 0.9302
## Pos Pred Value : 0.9589
## Neg Pred Value : 0.9756
## Prevalence : 0.6228
## Detection Rate : 0.6140
## Detection Prevalence : 0.6404
## Balanced Accuracy : 0.9581
##
## 'Positive' Class : B
##
## B M MeanDecreaseAccuracy
## radius_mean 8.6527319 6.896798 9.833006
## texture_mean 7.7099781 8.766658 10.727110
## perimeter_mean 7.1613020 5.781739 8.986680
## area_mean 10.2158662 5.781764 10.937731
## smoothness_mean 2.7963816 7.421065 7.586682
## compactness_mean 3.4147638 4.042742 5.786704
## concavity_mean 8.0424461 8.993043 11.802227
## concave_points_mean 9.6108765 12.608901 15.357859
## symmetry_mean 3.3168255 4.673737 5.956139
## fractal_dimension_mean 2.9844852 1.175255 3.223094
## radius_se 7.7903726 5.777717 10.102631
## texture_se 0.6853298 2.260062 1.777194
## perimeter_se 7.9842918 4.370677 8.979239
## area_se 10.8453466 6.801102 13.443781
## smoothness_se 1.4640143 2.541501 2.510664
## compactness_se 4.5420048 2.113043 5.075881
## concavity_se 4.2157821 4.347335 6.047456
## concave_points_se 3.9192328 1.230971 3.729495
## symmetry_se 2.4697857 1.997448 3.034253
## fractal_dimension_se 2.1379745 1.187316 2.433550
## radius_worst 12.1393035 10.412392 15.073391
## texture_worst 8.6701377 10.272282 13.089697
## perimeter_worst 11.9250872 9.740643 14.580801
## area_worst 12.9900006 11.160080 16.079525
## smoothness_worst 8.3417432 9.045569 11.738711
## compactness_worst 6.0717922 5.812304 8.217070
## concavity_worst 7.9109109 11.017735 13.972965
## concave_points_worst 13.7695949 13.070010 18.745524
## symmetry_worst 6.7408593 6.203105 8.357989
## fractal_dimension_worst 4.2066268 4.331695 5.998680
## MeanDecreaseGini
## radius_mean 10.6020859
## texture_mean 2.8024499
## perimeter_mean 8.9217099
## area_mean 9.9635087
## smoothness_mean 1.3050828
## compactness_mean 2.8038881
## concavity_mean 13.3169990
## concave_points_mean 24.6556323
## symmetry_mean 0.8241397
## fractal_dimension_mean 0.7265690
## radius_se 3.6198822
## texture_se 0.9270440
## perimeter_se 2.7343202
## area_se 7.0384945
## smoothness_se 0.8453041
## compactness_se 0.9689220
## concavity_se 1.9207891
## concave_points_se 1.1152920
## symmetry_se 0.9416348
## fractal_dimension_se 1.0631534
## radius_worst 20.4196276
## texture_worst 3.5414397
## perimeter_worst 20.6397563
## area_worst 19.5676588
## smoothness_worst 2.7628274
## compactness_worst 3.1209972
## concavity_worst 7.7192475
## concave_points_worst 32.4215476
## symmetry_worst 2.8691094
## fractal_dimension_worst 1.9204473
## visualize variable importance
varImpPlot(rf_all_model, pch=18, col="red", main="Variable Importance")## perform aggregation: get mean of each metric
mean.agg = data %>%
dplyr::group_by(diagnosis) %>%
dplyr::select(-id) %>%
dplyr::summarize_each(
mean
)
## perform aggregation: get max of each metric
max.agg = data %>%
dplyr::select(-id, -diagnosis) %>%
dplyr::summarize_each(
max
)
## create data frames that can be used to radarchart()
# for mean metrics
mean_metrics_df = rbind(
# values for outer radarchart edges (maximum values of mean metrics)
max.agg %>% dplyr::select(ends_with("_mean")),
# values for inner radarchart edges (0)
rep(0, 10),
# values for radarchart lines (first row for benign, second row for malignant)
mean.agg %>% dplyr::select(ends_with("_mean"))
)
rownames(mean_metrics_df) = c(1, 2, as.character(mean.agg$diagnosis))
# for se metrics
se_metrics_df = rbind(
# values for outer radarchart edges (maximum values of se metrics)
max.agg %>% dplyr::select(ends_with("_se")),
# values for inner radarchart edges (0)
rep(0, 10),
# values for radarchart lines (first row for benign, second row for malignant)
mean.agg %>% dplyr::select(ends_with("_se"))
)
rownames(se_metrics_df) = c(1, 2, as.character(mean.agg$diagnosis))
# for worst metrics
worst_metrics_df = rbind(
# values for outer radarchart edges (maximum values of worst metrics)
max.agg %>% dplyr::select(ends_with("_worst")),
# values for inner radarchart edges (0)
rep(0, 10),
# values for radarchart lines (first row for benign, second row for malignant)
mean.agg %>% dplyr::select(ends_with("_worst"))
)
rownames(se_metrics_df) = c(1, 2, as.character(mean.agg$diagnosis))
## default radar chart
radarchart(mean_metrics_df, axistype=1, title="Mean Metrics")
legend(x=1, y=1, legend = mean.agg$diagnosis, bty = "n", pch=20 , col=mean.agg$diagnosis, cex=1.2, pt.cex=3)radarchart(se_metrics_df, axistype=1, title="SE Metrics")
legend(x=1, y=1, legend = mean.agg$diagnosis, bty = "n", pch=20 , col=mean.agg$diagnosis, cex=1.2, pt.cex=3)radarchart(worst_metrics_df, axistype=1, title="Worst Metrics")
legend(x=1, y=1, legend = mean.agg$diagnosis, bty = "n", pch=20 , col=mean.agg$diagnosis, cex=1.2, pt.cex=3)