In this tutorial, we will go over a machine learning technique called the k-NNN algorithm. The algorithm is a classification algorithm annd is a supervised machine learning algorithm.
Load the data.
wbcd <- read.csv('/Users/czar.yobero/Desktop/wbcd.csv', stringsAsFactors = F)
kable(head(wbcd, n = 10), format = 'html') %>%
kable_styling(bootstrap_options = c('striped', 'hover'))
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave.points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | 0.4956 | 1.1560 | 3.445 | 27.23 | 0.009110 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.011490 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
843786 | M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.15780 | 0.08089 | 0.2087 | 0.07613 | 0.3345 | 0.8902 | 2.217 | 27.19 | 0.007510 | 0.03345 | 0.03672 | 0.01137 | 0.02165 | 0.005082 | 15.47 | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 |
844359 | M | 18.25 | 19.98 | 119.60 | 1040.0 | 0.09463 | 0.10900 | 0.11270 | 0.07400 | 0.1794 | 0.05742 | 0.4467 | 0.7732 | 3.180 | 53.91 | 0.004314 | 0.01382 | 0.02254 | 0.01039 | 0.01369 | 0.002179 | 22.88 | 27.66 | 153.20 | 1606.0 | 0.1442 | 0.2576 | 0.3784 | 0.1932 | 0.3063 | 0.08368 |
84458202 | M | 13.71 | 20.83 | 90.20 | 577.9 | 0.11890 | 0.16450 | 0.09366 | 0.05985 | 0.2196 | 0.07451 | 0.5835 | 1.3770 | 3.856 | 50.96 | 0.008805 | 0.03029 | 0.02488 | 0.01448 | 0.01486 | 0.005412 | 17.06 | 28.14 | 110.60 | 897.0 | 0.1654 | 0.3682 | 0.2678 | 0.1556 | 0.3196 | 0.11510 |
844981 | M | 13.00 | 21.82 | 87.50 | 519.8 | 0.12730 | 0.19320 | 0.18590 | 0.09353 | 0.2350 | 0.07389 | 0.3063 | 1.0020 | 2.406 | 24.32 | 0.005731 | 0.03502 | 0.03553 | 0.01226 | 0.02143 | 0.003749 | 15.49 | 30.73 | 106.20 | 739.3 | 0.1703 | 0.5401 | 0.5390 | 0.2060 | 0.4378 | 0.10720 |
84501001 | M | 12.46 | 24.04 | 83.97 | 475.9 | 0.11860 | 0.23960 | 0.22730 | 0.08543 | 0.2030 | 0.08243 | 0.2976 | 1.5990 | 2.039 | 23.94 | 0.007149 | 0.07217 | 0.07743 | 0.01432 | 0.01789 | 0.010080 | 15.09 | 40.68 | 97.65 | 711.4 | 0.1853 | 1.0580 | 1.1050 | 0.2210 | 0.4366 | 0.20750 |
A summary of the data set.
summary(wbcd)
## id diagnosis radius_mean texture_mean
## Min. : 8670 Length:569 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 Class :character 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Mode :character Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
str(wbcd)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
Because we are dealing with different units of measurement with each variable, it is sometimes necessary to “normalize” the numerical data. We can easily accomplish this usinng R’s \(\text{scale}\)
If we want to detect whether a patient may have beast cancer, it will be important to inspect the “diagnosis” variable. In the variable, B stands for benign (i.e. no breast canccer) and M means malignant(i.e. breast cancer).
prop.table(table(wbcd$diagnosis))
##
## B M
## 0.6274165 0.3725835
Many R machie learningn classification algorithms require the target feature to bbe coded as a factor, so we will nneed to conver the diagnosis variable to a factor as opposed to just as a character.
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
str(wbcd$diagnosis)
## Factor w/ 2 levels "Benign","Malignant": 2 2 2 2 2 2 2 2 2 2 ...
Because we are dealig with different unints of measurement for each numerical variable, we will need to “normalize” each relevantn numerical variable to ensure accuracy of our model. We cacn easily acchieve this using R’s \(\text{scale()}\) function. Eacch numericcal value will be convnerted to their Z-score equivalent.
wbcd.norm <- scale(wbcd[c(3:32)])
summary(wbcd.norm)
## radius_mean texture_mean perimeter_mean area_mean
## Min. :-2.0279 Min. :-2.2273 Min. :-1.9828 Min. :-1.4532
## 1st Qu.:-0.6888 1st Qu.:-0.7253 1st Qu.:-0.6913 1st Qu.:-0.6666
## Median :-0.2149 Median :-0.1045 Median :-0.2358 Median :-0.2949
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4690 3rd Qu.: 0.5837 3rd Qu.: 0.4992 3rd Qu.: 0.3632
## Max. : 3.9678 Max. : 4.6478 Max. : 3.9726 Max. : 5.2459
## smoothness_mean compactness_mean concavity_mean
## Min. :-3.10935 Min. :-1.6087 Min. :-1.1139
## 1st Qu.:-0.71034 1st Qu.:-0.7464 1st Qu.:-0.7431
## Median :-0.03486 Median :-0.2217 Median :-0.3419
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.63564 3rd Qu.: 0.4934 3rd Qu.: 0.5256
## Max. : 4.76672 Max. : 4.5644 Max. : 4.2399
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :-1.2607 Min. :-2.74171 Min. :-1.8183
## 1st Qu.:-0.7373 1st Qu.:-0.70262 1st Qu.:-0.7220
## Median :-0.3974 Median :-0.07156 Median :-0.1781
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6464 3rd Qu.: 0.53031 3rd Qu.: 0.4706
## Max. : 3.9245 Max. : 4.48081 Max. : 4.9066
## radius_se texture_se perimeter_se area_se
## Min. :-1.0590 Min. :-1.5529 Min. :-1.0431 Min. :-0.7372
## 1st Qu.:-0.6230 1st Qu.:-0.6942 1st Qu.:-0.6232 1st Qu.:-0.4943
## Median :-0.2920 Median :-0.1973 Median :-0.2864 Median :-0.3475
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2659 3rd Qu.: 0.4661 3rd Qu.: 0.2428 3rd Qu.: 0.1067
## Max. : 8.8991 Max. : 6.6494 Max. : 9.4537 Max. :11.0321
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :-1.7745 Min. :-1.2970 Min. :-1.0566 Min. :-1.9118
## 1st Qu.:-0.6235 1st Qu.:-0.6923 1st Qu.:-0.5567 1st Qu.:-0.6739
## Median :-0.2201 Median :-0.2808 Median :-0.1989 Median :-0.1404
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3680 3rd Qu.: 0.3893 3rd Qu.: 0.3365 3rd Qu.: 0.4722
## Max. : 8.0229 Max. : 6.1381 Max. :12.0621 Max. : 6.6438
## symmetry_se fractal_dimension_se radius_worst
## Min. :-1.5315 Min. :-1.0960 Min. :-1.7254
## 1st Qu.:-0.6511 1st Qu.:-0.5846 1st Qu.:-0.6743
## Median :-0.2192 Median :-0.2297 Median :-0.2688
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3554 3rd Qu.: 0.2884 3rd Qu.: 0.5216
## Max. : 7.0657 Max. : 9.8429 Max. : 4.0906
## texture_worst perimeter_worst area_worst smoothness_worst
## Min. :-2.22204 Min. :-1.6919 Min. :-1.2213 Min. :-2.6803
## 1st Qu.:-0.74797 1st Qu.:-0.6890 1st Qu.:-0.6416 1st Qu.:-0.6906
## Median :-0.04348 Median :-0.2857 Median :-0.3409 Median :-0.0468
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.65776 3rd Qu.: 0.5398 3rd Qu.: 0.3573 3rd Qu.: 0.5970
## Max. : 3.88249 Max. : 4.2836 Max. : 5.9250 Max. : 3.9519
## compactness_worst concavity_worst concave.points_worst
## Min. :-1.4426 Min. :-1.3047 Min. :-1.7435
## 1st Qu.:-0.6805 1st Qu.:-0.7558 1st Qu.:-0.7557
## Median :-0.2693 Median :-0.2180 Median :-0.2233
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5392 3rd Qu.: 0.5307 3rd Qu.: 0.7119
## Max. : 5.1084 Max. : 4.6965 Max. : 2.6835
## symmetry_worst fractal_dimension_worst
## Min. :-2.1591 Min. :-1.6004
## 1st Qu.:-0.6413 1st Qu.:-0.6913
## Median :-0.1273 Median :-0.2163
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4497 3rd Qu.: 0.4504
## Max. : 6.0407 Max. : 6.8408
Before we can fit our model, we first have to divide our data into training and test sets. This allows us to assess the accuracy of our model.
wbcd.train <- wbcd.norm[1:469, ]
wbcd.test <- wbcd.norm[470:569, ]
When we constructed our nnormalized training and test sets, we excluded the target variable diagnosis. For training our k-NN algorithm, we’ll nneed to store these class labels in factor vectors, split between the training and test sets.
wbcd.train.labels <- wbcd[1:469, 2]
wbcd.test.labels <- wbcd[470:569, 2]
Now that we’ve basically taken care of the little things, it’s finally time to see how well our model works. The implementation of the k-NN algorithm can be found in nthe “class” package, and you can install it as you would any other R package.
library(class)
For now, we’ll arbitrarily choose a level of \(k=21\) just to start.
wbcd.test.pred <- knn(train = wbcd.train, test = wbcd.test, cl = wbcd.train.labels, k = 21)
prop.table(table(wbcd.test.pred))
## wbcd.test.pred
## Benign Malignant
## 0.79 0.21
Our model predicted that based on the features we used, 79% of patients are benign annd 21% of patientns are malignant. How does that compare to our actual data?
prop.table(table(wbcd$diagnosis))
##
## Benign Malignant
## 0.6274165 0.3725835
Our innitial model isn’t that acccurate, but fortunately there are many ways we can correct this that is the beyond the scope of this tutorial.
Let’s compare the overall performance.
library(gmodels)
CrossTable(x = wbcd.test.labels, y = wbcd.test.pred, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd.test.pred
## wbcd.test.labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
Although our model could definintely use some major improvements, it didn’t do a bad job of correctly guessing the diagnnosis of breast cancer.