Overview

In this project the objective is to create a supervised learning model to predict if a patient has cancer based on 30 different medical features. The data contains 569 records of patients, provided by the University of Wisconsin.

Data Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Packages

library(dplyr)
library(class)
library(gmodels)
library(caret)
library(MLmetrics)
library(pROC)

Features

#read in data
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors=FALSE)
wbcd <- wbcd[-1]
colnames(wbcd)
##  [1] "diagnosis"         "radius_mean"       "texture_mean"     
##  [4] "perimeter_mean"    "area_mean"         "smoothness_mean"  
##  [7] "compactness_mean"  "concavity_mean"    "points_mean"      
## [10] "symmetry_mean"     "dimension_mean"    "radius_se"        
## [13] "texture_se"        "perimeter_se"      "area_se"          
## [16] "smoothness_se"     "compactness_se"    "concavity_se"     
## [19] "points_se"         "symmetry_se"       "dimension_se"     
## [22] "radius_worst"      "texture_worst"     "perimeter_worst"  
## [25] "area_worst"        "smoothness_worst"  "compactness_worst"
## [28] "concavity_worst"   "points_worst"      "symmetry_worst"   
## [31] "dimension_worst"

Distribution of Diagnosis’

How many malignant vs benign?

table(wbcd$diagnosis)
## 
##   B   M 
## 357 212

Percent malignant vs benign?

round(prop.table(table(wbcd$diagnosis))*100, digits = 1)
## 
##    B    M 
## 62.7 37.3

Prepping Data

Reclassify diagnosis as a factor and add labels (required for ML)

wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
                         labels = c("Benign", "Malignant"))

Summarize the features to understand scale of the data for each feature

summary(wbcd[c(colnames(wbcd))])
##      diagnosis    radius_mean      texture_mean   perimeter_mean  
##  Benign   :357   Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  Malignant:212   1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##                  Median :13.370   Median :18.84   Median : 86.24  
##                  Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##                  3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##                  Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##   points_mean      symmetry_mean    dimension_mean      radius_se     
##  Min.   :0.00000   Min.   :0.1060   Min.   :0.04996   Min.   :0.1115  
##  1st Qu.:0.02031   1st Qu.:0.1619   1st Qu.:0.05770   1st Qu.:0.2324  
##  Median :0.03350   Median :0.1792   Median :0.06154   Median :0.3242  
##  Mean   :0.04892   Mean   :0.1812   Mean   :0.06280   Mean   :0.4052  
##  3rd Qu.:0.07400   3rd Qu.:0.1957   3rd Qu.:0.06612   3rd Qu.:0.4789  
##  Max.   :0.20120   Max.   :0.3040   Max.   :0.09744   Max.   :2.8730  
##    texture_se      perimeter_se       area_se        smoothness_se     
##  Min.   :0.3602   Min.   : 0.757   Min.   :  6.802   Min.   :0.001713  
##  1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850   1st Qu.:0.005169  
##  Median :1.1080   Median : 2.287   Median : 24.530   Median :0.006380  
##  Mean   :1.2169   Mean   : 2.866   Mean   : 40.337   Mean   :0.007041  
##  3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190   3rd Qu.:0.008146  
##  Max.   :4.8850   Max.   :21.980   Max.   :542.200   Max.   :0.031130  
##  compactness_se      concavity_se       points_se         symmetry_se      
##  Min.   :0.002252   Min.   :0.00000   Min.   :0.000000   Min.   :0.007882  
##  1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638   1st Qu.:0.015160  
##  Median :0.020450   Median :0.02589   Median :0.010930   Median :0.018730  
##  Mean   :0.025478   Mean   :0.03189   Mean   :0.011796   Mean   :0.020542  
##  3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710   3rd Qu.:0.023480  
##  Max.   :0.135400   Max.   :0.39600   Max.   :0.052790   Max.   :0.078950  
##   dimension_se        radius_worst   texture_worst   perimeter_worst 
##  Min.   :0.0008948   Min.   : 7.93   Min.   :12.02   Min.   : 50.41  
##  1st Qu.:0.0022480   1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11  
##  Median :0.0031870   Median :14.97   Median :25.41   Median : 97.66  
##  Mean   :0.0037949   Mean   :16.27   Mean   :25.68   Mean   :107.26  
##  3rd Qu.:0.0045580   3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40  
##  Max.   :0.0298400   Max.   :36.04   Max.   :49.54   Max.   :251.20  
##    area_worst     smoothness_worst  compactness_worst concavity_worst 
##  Min.   : 185.2   Min.   :0.07117   Min.   :0.02729   Min.   :0.0000  
##  1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145  
##  Median : 686.5   Median :0.13130   Median :0.21190   Median :0.2267  
##  Mean   : 880.6   Mean   :0.13237   Mean   :0.25427   Mean   :0.2722  
##  3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829  
##  Max.   :4254.0   Max.   :0.22260   Max.   :1.05800   Max.   :1.2520  
##   points_worst     symmetry_worst   dimension_worst  
##  Min.   :0.00000   Min.   :0.1565   Min.   :0.05504  
##  1st Qu.:0.06493   1st Qu.:0.2504   1st Qu.:0.07146  
##  Median :0.09993   Median :0.2822   Median :0.08004  
##  Mean   :0.11461   Mean   :0.2901   Mean   :0.08395  
##  3rd Qu.:0.16140   3rd Qu.:0.3179   3rd Qu.:0.09208  
##  Max.   :0.29100   Max.   :0.6638   Max.   :0.20750

Normalize Data

Because the data has different scales, some values less than 1, while others >100, we will need normalize the data so that all of the data is on the same scale (from 0 to 1)

Create Normalize Function and test it

norm <- function(x){
  return((x-min(x))/(max(x)-min(x)))}

norm(c(1,2,3,4,5))
## [1] 0.00 0.25 0.50 0.75 1.00
norm(c(100,200,300,400,500))
## [1] 0.00 0.25 0.50 0.75 1.00

Normalize all of the data by applying the normalize function to the entire dataframe using lapply

wbcd_norm <- as.data.frame(lapply(wbcd[2:31], norm))
#rejoin diagnosis column after data is normalized
wbcd_norm <- cbind(wbcd_norm, wbcd[1])

Check a few features to make sure normalize worked

summary(wbcd_norm[c("radius_mean", "area_mean", "smoothness_mean")])
##   radius_mean       area_mean      smoothness_mean 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2233   1st Qu.:0.1174   1st Qu.:0.3046  
##  Median :0.3024   Median :0.1729   Median :0.3904  
##  Mean   :0.3382   Mean   :0.2169   Mean   :0.3948  
##  3rd Qu.:0.4164   3rd Qu.:0.2711   3rd Qu.:0.4755  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Create Train & Test Data

Using 80-20 split for train versus test

set.seed(1)
train.index <- createDataPartition(wbcd$diagnosis, times = 1, p = 0.80, list = FALSE)
  train <- wbcd_norm[train.index,]
  test <-  wbcd_norm[-train.index,]

Train and Tune KNN Model to determine K

Using K-fold cross validation because the data set is limited in size For purposes of validating the model further cross validation is set to repeat 10 times, this will result in a more accurate estimate of how the model would perform on unknown data, leading to a better selection of K.

knn_fit_cv <- train(diagnosis~., data = train, method = "knn",
                 trControl = trainControl(method = "repeatedcv",
                 repeats = 10),
                 tuneLength = 10
                 )

View Performance of Model

plot(knn_fit_cv)

knn_fit_cv
## k-Nearest Neighbors 
## 
## 456 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 410, 410, 410, 410, 411, 411, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9620966  0.9180562
##    7  0.9649614  0.9239214
##    9  0.9649469  0.9237819
##   11  0.9649372  0.9237508
##   13  0.9618551  0.9168279
##   15  0.9594348  0.9114049
##   17  0.9558986  0.9036386
##   19  0.9548164  0.9012128
##   21  0.9554783  0.9027573
##   23  0.9554734  0.9028074
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

Model Performance on Test Data

Apply model to test data for prediction

knn_predict <- predict(knn_fit_cv,newdata = test )

View Confusion matrix to view performance. In the case of cancer prediction, false negatives are most important to minimize. False positive certainly has extreme consequence as well, but undetected cancer can be fatal. Let’s look at accuracy broken out into rate of true positives/negatives.

Accuracy

confusionMatrix(knn_predict, test$diagnosis)$overall["Accuracy"] 
##  Accuracy 
## 0.9911504

False Positive Rate %

1 - confusionMatrix(knn_predict, test$diagnosis)$byClass["Pos Pred Value"] 
## Pos Pred Value 
##     0.01388889

False Negative Rate %

1 - confusionMatrix(knn_predict, test$diagnosis)$byClass["Neg Pred Value"] 
## Neg Pred Value 
##              0

Precision & Recall

confusionMatrix(knn_predict, test$diagnosis, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        71         1
##   Malignant      0        41
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9517, 0.9998)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.981           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##               Precision : 0.9861          
##                  Recall : 1.0000          
##                      F1 : 0.9930          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6283          
##    Detection Prevalence : 0.6372          
##       Balanced Accuracy : 0.9881          
##                                           
##        'Positive' Class : Benign          
## 

Summary

Model accuracy on the test data was 99% and 100% accuracy for predicting malignant cases. However, the performance on the test data does not indicate the performance would consistently be this high. More unknown random data would be needed to test the variance in accuracy of the model. This being said, we can expect this model to be very strong at predicting both malignant and benign cases with a K of 7.