In this project the objective is to create a supervised learning model to predict if a patient has cancer based on 30 different medical features. The data contains 569 records of patients, provided by the University of Wisconsin.
Data Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
library(dplyr)
library(class)
library(gmodels)
library(caret)
library(MLmetrics)
library(pROC)
#read in data
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors=FALSE)
wbcd <- wbcd[-1]
colnames(wbcd)
## [1] "diagnosis" "radius_mean" "texture_mean"
## [4] "perimeter_mean" "area_mean" "smoothness_mean"
## [7] "compactness_mean" "concavity_mean" "points_mean"
## [10] "symmetry_mean" "dimension_mean" "radius_se"
## [13] "texture_se" "perimeter_se" "area_se"
## [16] "smoothness_se" "compactness_se" "concavity_se"
## [19] "points_se" "symmetry_se" "dimension_se"
## [22] "radius_worst" "texture_worst" "perimeter_worst"
## [25] "area_worst" "smoothness_worst" "compactness_worst"
## [28] "concavity_worst" "points_worst" "symmetry_worst"
## [31] "dimension_worst"
How many malignant vs benign?
table(wbcd$diagnosis)
##
## B M
## 357 212
Percent malignant vs benign?
round(prop.table(table(wbcd$diagnosis))*100, digits = 1)
##
## B M
## 62.7 37.3
Reclassify diagnosis as a factor and add labels (required for ML)
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
Summarize the features to understand scale of the data for each feature
summary(wbcd[c(colnames(wbcd))])
## diagnosis radius_mean texture_mean perimeter_mean
## Benign :357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## Malignant:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## points_mean symmetry_mean dimension_mean radius_se
## Min. :0.00000 Min. :0.1060 Min. :0.04996 Min. :0.1115
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324
## Median :0.03350 Median :0.1792 Median :0.06154 Median :0.3242
## Mean :0.04892 Mean :0.1812 Mean :0.06280 Mean :0.4052
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789
## Max. :0.20120 Max. :0.3040 Max. :0.09744 Max. :2.8730
## texture_se perimeter_se area_se smoothness_se
## Min. :0.3602 Min. : 0.757 Min. : 6.802 Min. :0.001713
## 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850 1st Qu.:0.005169
## Median :1.1080 Median : 2.287 Median : 24.530 Median :0.006380
## Mean :1.2169 Mean : 2.866 Mean : 40.337 Mean :0.007041
## 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190 3rd Qu.:0.008146
## Max. :4.8850 Max. :21.980 Max. :542.200 Max. :0.031130
## compactness_se concavity_se points_se symmetry_se
## Min. :0.002252 Min. :0.00000 Min. :0.000000 Min. :0.007882
## 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638 1st Qu.:0.015160
## Median :0.020450 Median :0.02589 Median :0.010930 Median :0.018730
## Mean :0.025478 Mean :0.03189 Mean :0.011796 Mean :0.020542
## 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710 3rd Qu.:0.023480
## Max. :0.135400 Max. :0.39600 Max. :0.052790 Max. :0.078950
## dimension_se radius_worst texture_worst perimeter_worst
## Min. :0.0008948 Min. : 7.93 Min. :12.02 Min. : 50.41
## 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11
## Median :0.0031870 Median :14.97 Median :25.41 Median : 97.66
## Mean :0.0037949 Mean :16.27 Mean :25.68 Mean :107.26
## 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40
## Max. :0.0298400 Max. :36.04 Max. :49.54 Max. :251.20
## area_worst smoothness_worst compactness_worst concavity_worst
## Min. : 185.2 Min. :0.07117 Min. :0.02729 Min. :0.0000
## 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145
## Median : 686.5 Median :0.13130 Median :0.21190 Median :0.2267
## Mean : 880.6 Mean :0.13237 Mean :0.25427 Mean :0.2722
## 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829
## Max. :4254.0 Max. :0.22260 Max. :1.05800 Max. :1.2520
## points_worst symmetry_worst dimension_worst
## Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.29100 Max. :0.6638 Max. :0.20750
Because the data has different scales, some values less than 1, while others >100, we will need normalize the data so that all of the data is on the same scale (from 0 to 1)
Create Normalize Function and test it
norm <- function(x){
return((x-min(x))/(max(x)-min(x)))}
norm(c(1,2,3,4,5))
## [1] 0.00 0.25 0.50 0.75 1.00
norm(c(100,200,300,400,500))
## [1] 0.00 0.25 0.50 0.75 1.00
Normalize all of the data by applying the normalize function to the entire dataframe using lapply
wbcd_norm <- as.data.frame(lapply(wbcd[2:31], norm))
#rejoin diagnosis column after data is normalized
wbcd_norm <- cbind(wbcd_norm, wbcd[1])
Check a few features to make sure normalize worked
summary(wbcd_norm[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2233 1st Qu.:0.1174 1st Qu.:0.3046
## Median :0.3024 Median :0.1729 Median :0.3904
## Mean :0.3382 Mean :0.2169 Mean :0.3948
## 3rd Qu.:0.4164 3rd Qu.:0.2711 3rd Qu.:0.4755
## Max. :1.0000 Max. :1.0000 Max. :1.0000
Using 80-20 split for train versus test
set.seed(1)
train.index <- createDataPartition(wbcd$diagnosis, times = 1, p = 0.80, list = FALSE)
train <- wbcd_norm[train.index,]
test <- wbcd_norm[-train.index,]
Using K-fold cross validation because the data set is limited in size For purposes of validating the model further cross validation is set to repeat 10 times, this will result in a more accurate estimate of how the model would perform on unknown data, leading to a better selection of K.
knn_fit_cv <- train(diagnosis~., data = train, method = "knn",
trControl = trainControl(method = "repeatedcv",
repeats = 10),
tuneLength = 10
)
plot(knn_fit_cv)
knn_fit_cv
## k-Nearest Neighbors
##
## 456 samples
## 30 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 410, 410, 410, 410, 411, 411, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9620966 0.9180562
## 7 0.9649614 0.9239214
## 9 0.9649469 0.9237819
## 11 0.9649372 0.9237508
## 13 0.9618551 0.9168279
## 15 0.9594348 0.9114049
## 17 0.9558986 0.9036386
## 19 0.9548164 0.9012128
## 21 0.9554783 0.9027573
## 23 0.9554734 0.9028074
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
Apply model to test data for prediction
knn_predict <- predict(knn_fit_cv,newdata = test )
View Confusion matrix to view performance. In the case of cancer prediction, false negatives are most important to minimize. False positive certainly has extreme consequence as well, but undetected cancer can be fatal. Let’s look at accuracy broken out into rate of true positives/negatives.
Accuracy
confusionMatrix(knn_predict, test$diagnosis)$overall["Accuracy"]
## Accuracy
## 0.9911504
False Positive Rate %
1 - confusionMatrix(knn_predict, test$diagnosis)$byClass["Pos Pred Value"]
## Pos Pred Value
## 0.01388889
False Negative Rate %
1 - confusionMatrix(knn_predict, test$diagnosis)$byClass["Neg Pred Value"]
## Neg Pred Value
## 0
Precision & Recall
confusionMatrix(knn_predict, test$diagnosis, mode = "prec_recall")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 71 1
## Malignant 0 41
##
## Accuracy : 0.9912
## 95% CI : (0.9517, 0.9998)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.981
##
## Mcnemar's Test P-Value : 1
##
## Precision : 0.9861
## Recall : 1.0000
## F1 : 0.9930
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6372
## Balanced Accuracy : 0.9881
##
## 'Positive' Class : Benign
##
Model accuracy on the test data was 99% and 100% accuracy for predicting malignant cases. However, the performance on the test data does not indicate the performance would consistently be this high. More unknown random data would be needed to test the variance in accuracy of the model. This being said, we can expect this model to be very strong at predicting both malignant and benign cases with a K of 7.