Introduction
Step 1: First Lets Load the Data
data<-read.csv(file=file.choose(), header=TRUE, stringsAsFactors=FALSE,strip.white=TRUE, sep=",")
data$ID<-NULL
kable(head(data, n = 10), format = 'html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
| diagnosis | radius_mean | radius_sd_error | radius_worst | texture_mean | texture_sd_error | texture_worst | perimeter_mean | perimeter_sd_error | perimeter_worst | area_mean | area_sd_error | area_worst | smoothness_mean | smoothness_sd_error | smoothness_worst | compactness_mean | compactness_sd_error | compactness_worst | concavity_mean | concavity_sd_error | concavity_worst | concave_points_mean | concave_points_sd_error | concave_points_worst | symmetry_mean | symmetry_sd_error | symmetry_worst | fractal_dimension_mean | fractal_dimension_sd_error | fractal_dimension_worst |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | 0.4956 | 1.1560 | 3.445 | 27.23 | 0.009110 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.011490 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
| M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.15780 | 0.08089 | 0.2087 | 0.07613 | 0.3345 | 0.8902 | 2.217 | 27.19 | 0.007510 | 0.03345 | 0.03672 | 0.01137 | 0.02165 | 0.005082 | 15.47 | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 |
| M | 18.25 | 19.98 | 119.60 | 1040.0 | 0.09463 | 0.10900 | 0.11270 | 0.07400 | 0.1794 | 0.05742 | 0.4467 | 0.7732 | 3.180 | 53.91 | 0.004314 | 0.01382 | 0.02254 | 0.01039 | 0.01369 | 0.002179 | 22.88 | 27.66 | 153.20 | 1606.0 | 0.1442 | 0.2576 | 0.3784 | 0.1932 | 0.3063 | 0.08368 |
| M | 13.71 | 20.83 | 90.20 | 577.9 | 0.11890 | 0.16450 | 0.09366 | 0.05985 | 0.2196 | 0.07451 | 0.5835 | 1.3770 | 3.856 | 50.96 | 0.008805 | 0.03029 | 0.02488 | 0.01448 | 0.01486 | 0.005412 | 17.06 | 28.14 | 110.60 | 897.0 | 0.1654 | 0.3682 | 0.2678 | 0.1556 | 0.3196 | 0.11510 |
| M | 13.00 | 21.82 | 87.50 | 519.8 | 0.12730 | 0.19320 | 0.18590 | 0.09353 | 0.2350 | 0.07389 | 0.3063 | 1.0020 | 2.406 | 24.32 | 0.005731 | 0.03502 | 0.03553 | 0.01226 | 0.02143 | 0.003749 | 15.49 | 30.73 | 106.20 | 739.3 | 0.1703 | 0.5401 | 0.5390 | 0.2060 | 0.4378 | 0.10720 |
| M | 12.46 | 24.04 | 83.97 | 475.9 | 0.11860 | 0.23960 | 0.22730 | 0.08543 | 0.2030 | 0.08243 | 0.2976 | 1.5990 | 2.039 | 23.94 | 0.007149 | 0.07217 | 0.07743 | 0.01432 | 0.01789 | 0.010080 | 15.09 | 40.68 | 97.65 | 711.4 | 0.1853 | 1.0580 | 1.1050 | 0.2210 | 0.4366 | 0.20750 |
Step 2: Data Cleaning - Convert ‘0’ values into NA
#data[, 2:31][data[, 2:31] == 0] <- NA
Step 3: visualize the missing data(if any)
missmap(data)
Step 4: Summary of the dataSet
summary(data)
## diagnosis radius_mean radius_sd_error radius_worst
## Length:569 Min. : 6.981 Min. : 9.71 Min. : 43.79
## Class :character 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Mode :character Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## texture_mean texture_sd_error texture_worst perimeter_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## perimeter_sd_error perimeter_worst area_mean area_sd_error
## Min. :0.00000 Min. :0.1060 Min. :0.04996 Min. :0.1115
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324
## Median :0.03350 Median :0.1792 Median :0.06154 Median :0.3242
## Mean :0.04892 Mean :0.1812 Mean :0.06280 Mean :0.4052
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789
## Max. :0.20120 Max. :0.3040 Max. :0.09744 Max. :2.8730
## area_worst smoothness_mean smoothness_sd_error smoothness_worst
## Min. :0.3602 Min. : 0.757 Min. : 6.802 Min. :0.001713
## 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850 1st Qu.:0.005169
## Median :1.1080 Median : 2.287 Median : 24.530 Median :0.006380
## Mean :1.2169 Mean : 2.866 Mean : 40.337 Mean :0.007041
## 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190 3rd Qu.:0.008146
## Max. :4.8850 Max. :21.980 Max. :542.200 Max. :0.031130
## compactness_mean compactness_sd_error compactness_worst concavity_mean
## Min. :0.002252 Min. :0.00000 Min. :0.000000 Min. :0.007882
## 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638 1st Qu.:0.015160
## Median :0.020450 Median :0.02589 Median :0.010930 Median :0.018730
## Mean :0.025478 Mean :0.03189 Mean :0.011796 Mean :0.020542
## 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710 3rd Qu.:0.023480
## Max. :0.135400 Max. :0.39600 Max. :0.052790 Max. :0.078950
## concavity_sd_error concavity_worst concave_points_mean
## Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.0298400 Max. :36.04 Max. :49.54
## concave_points_sd_error concave_points_worst symmetry_mean
## Min. : 50.41 Min. : 185.2 Min. :0.07117
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660
## Median : 97.66 Median : 686.5 Median :0.13130
## Mean :107.26 Mean : 880.6 Mean :0.13237
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600
## Max. :251.20 Max. :4254.0 Max. :0.22260
## symmetry_sd_error symmetry_worst fractal_dimension_mean
## Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :1.05800 Max. :1.2520 Max. :0.29100
## fractal_dimension_sd_error fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
Step 5: Because we are dealing with different units of measurement with each variable, it is sometimes necessary to “normalize” the numerical data. We can easily accomplish this usinng R’s scale
If we want to detect whether a patient may have beast cancer, it will be important to inspect the “diagnosis” variable. In the variable, B stands for benign (i.e. no breast canccer) and M means malignant(i.e. breast cancer).
prop.table(table(data$diagnosis))
##
## B M
## 0.6274165 0.3725835
Step 6: Many R machie learning classification algorithms require the target feature to be coded as a factor, so we will need to convert the diagnosis variable to a factor as opposed to just as a character.
data$diagnosis <- factor(data$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
str(data$diagnosis)
## Factor w/ 2 levels "Benign","Malignant": 2 2 2 2 2 2 2 2 2 2 ...
Step 7: Because we are dealing with different units of measurement for each numerical variable, we will need to “normalize” each relevantn numerical variable to ensure accuracy of our model. We can easily achieve this using R’s scale() function. Eacch numericcal value will be convnerted to their Z-score equivalent.
data.norm <- scale(data[c(2:30)])
summary(data.norm)
## radius_mean radius_sd_error radius_worst texture_mean
## Min. :-2.0279 Min. :-2.2273 Min. :-1.9828 Min. :-1.4532
## 1st Qu.:-0.6888 1st Qu.:-0.7253 1st Qu.:-0.6913 1st Qu.:-0.6666
## Median :-0.2149 Median :-0.1045 Median :-0.2358 Median :-0.2949
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4690 3rd Qu.: 0.5837 3rd Qu.: 0.4992 3rd Qu.: 0.3632
## Max. : 3.9678 Max. : 4.6478 Max. : 3.9726 Max. : 5.2459
## texture_sd_error texture_worst perimeter_mean perimeter_sd_error
## Min. :-3.10935 Min. :-1.6087 Min. :-1.1139 Min. :-1.2607
## 1st Qu.:-0.71034 1st Qu.:-0.7464 1st Qu.:-0.7431 1st Qu.:-0.7373
## Median :-0.03486 Median :-0.2217 Median :-0.3419 Median :-0.3974
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.63564 3rd Qu.: 0.4934 3rd Qu.: 0.5256 3rd Qu.: 0.6464
## Max. : 4.76672 Max. : 4.5644 Max. : 4.2399 Max. : 3.9245
## perimeter_worst area_mean area_sd_error area_worst
## Min. :-2.74171 Min. :-1.8183 Min. :-1.0590 Min. :-1.5529
## 1st Qu.:-0.70262 1st Qu.:-0.7220 1st Qu.:-0.6230 1st Qu.:-0.6942
## Median :-0.07156 Median :-0.1781 Median :-0.2920 Median :-0.1973
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.53031 3rd Qu.: 0.4706 3rd Qu.: 0.2659 3rd Qu.: 0.4661
## Max. : 4.48081 Max. : 4.9066 Max. : 8.8991 Max. : 6.6494
## smoothness_mean smoothness_sd_error smoothness_worst compactness_mean
## Min. :-1.0431 Min. :-0.7372 Min. :-1.7745 Min. :-1.2970
## 1st Qu.:-0.6232 1st Qu.:-0.4943 1st Qu.:-0.6235 1st Qu.:-0.6923
## Median :-0.2864 Median :-0.3475 Median :-0.2201 Median :-0.2808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2428 3rd Qu.: 0.1067 3rd Qu.: 0.3680 3rd Qu.: 0.3893
## Max. : 9.4537 Max. :11.0321 Max. : 8.0229 Max. : 6.1381
## compactness_sd_error compactness_worst concavity_mean concavity_sd_error
## Min. :-1.0566 Min. :-1.9118 Min. :-1.5315 Min. :-1.0960
## 1st Qu.:-0.5567 1st Qu.:-0.6739 1st Qu.:-0.6511 1st Qu.:-0.5846
## Median :-0.1989 Median :-0.1404 Median :-0.2192 Median :-0.2297
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3365 3rd Qu.: 0.4722 3rd Qu.: 0.3554 3rd Qu.: 0.2884
## Max. :12.0621 Max. : 6.6438 Max. : 7.0657 Max. : 9.8429
## concavity_worst concave_points_mean concave_points_sd_error
## Min. :-1.7254 Min. :-2.22204 Min. :-1.6919
## 1st Qu.:-0.6743 1st Qu.:-0.74797 1st Qu.:-0.6890
## Median :-0.2688 Median :-0.04348 Median :-0.2857
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.5216 3rd Qu.: 0.65776 3rd Qu.: 0.5398
## Max. : 4.0906 Max. : 3.88249 Max. : 4.2836
## concave_points_worst symmetry_mean symmetry_sd_error symmetry_worst
## Min. :-1.2213 Min. :-2.6803 Min. :-1.4426 Min. :-1.3047
## 1st Qu.:-0.6416 1st Qu.:-0.6906 1st Qu.:-0.6805 1st Qu.:-0.7558
## Median :-0.3409 Median :-0.0468 Median :-0.2693 Median :-0.2180
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3573 3rd Qu.: 0.5970 3rd Qu.: 0.5392 3rd Qu.: 0.5307
## Max. : 5.9250 Max. : 3.9519 Max. : 5.1084 Max. : 4.6965
## fractal_dimension_mean fractal_dimension_sd_error
## Min. :-1.7435 Min. :-2.1591
## 1st Qu.:-0.7557 1st Qu.:-0.6413
## Median :-0.2233 Median :-0.1273
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7119 3rd Qu.: 0.4497
## Max. : 2.6835 Max. : 6.0407
Dividing Our Data Into Training and Test Sets
Step 8 Before we can fit our model, we first have to divide our data into training and test sets. This allows us to assess the accuracy of our model.
indxTrain <- createDataPartition(y = data$diagnosis,p = 0.75,list = FALSE)
training <- data[indxTrain,]
testing <- data[-indxTrain,]
Step 9 Check dimensions of the split
prop.table(table(data$diagnosis)) * 100
##
## Benign Malignant
## 62.74165 37.25835
prop.table(table(training$diagnosis)) * 100
##
## Benign Malignant
## 62.76347 37.23653
prop.table(table(testing$diagnosis)) * 100
##
## Benign Malignant
## 62.67606 37.32394
Step 10: For comparing the outcome of the training and testing phase let’s create separate variables that store the value of the response variable:
x = training[,-30]
y = training$diagnosis
Step 11: Create Naive Bayes model by using the training data set:
model = train(x,y,'nb',trControl=trainControl(method='cv',number=30))
model
## Naive Bayes
##
## 427 samples
## 30 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Cross-Validated (30 fold)
## Summary of sample sizes: 413, 413, 412, 413, 412, 412, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9722222 0.9393870
## TRUE 0.9744444 0.9450779
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
Step 12: Model Evaluation
To check the efficiency of the model, we are now going to run the testing data set on the model, after which we will evaluate the accuracy of the model by using a Confusion matrix.
Predict testing set
Predict <- predict(model,newdata = testing )
Step 13
Get the confusion matrix to see accuracy value and other parameter values
Please look for parameter Balanced Accuracy below to get the accuracy of our model
confusionMatrix(Predict, testing$diagnosis )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Malignant
## Benign 87 4
## Malignant 2 49
##
## Accuracy : 0.9577
## 95% CI : (0.9103, 0.9843)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.909
##
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9775
## Specificity : 0.9245
## Pos Pred Value : 0.9560
## Neg Pred Value : 0.9608
## Prevalence : 0.6268
## Detection Rate : 0.6127
## Detection Prevalence : 0.6408
## Balanced Accuracy : 0.9510
##
## 'Positive' Class : Benign
##