Introduction

In this tutorial, we will go over a machine learning technique called the k-NNN algorithm. The algorithm is a classification algorithm annd is a supervised machine learning algorithm.

Exploring the Data

Load the data.

wbcd <- read.csv('/Users/czar.yobero/Desktop/wbcd.csv', stringsAsFactors = F)
kable(head(wbcd, n = 10), format = 'html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))

id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave.points_mean	symmetry_mean	fractal_dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	concave.points_se	symmetry_se	fractal_dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave.points_worst	symmetry_worst	fractal_dimension_worst
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	0.2419	0.07871	1.0950	0.9053	8.589	153.40	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	0.1812	0.05667	0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.01860	0.01340	0.01389	0.003532	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	0.2069	0.05999	0.7456	0.7869	4.585	94.03	0.006150	0.04006	0.03832	0.02058	0.02250	0.004571	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	0.2597	0.09744	0.4956	1.1560	3.445	27.23	0.009110	0.07458	0.05661	0.01867	0.05963	0.009208	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.011490	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678
843786	M	12.45	15.70	82.57	477.1	0.12780	0.17000	0.15780	0.08089	0.2087	0.07613	0.3345	0.8902	2.217	27.19	0.007510	0.03345	0.03672	0.01137	0.02165	0.005082	15.47	23.75	103.40	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.12440
844359	M	18.25	19.98	119.60	1040.0	0.09463	0.10900	0.11270	0.07400	0.1794	0.05742	0.4467	0.7732	3.180	53.91	0.004314	0.01382	0.02254	0.01039	0.01369	0.002179	22.88	27.66	153.20	1606.0	0.1442	0.2576	0.3784	0.1932	0.3063	0.08368
84458202	M	13.71	20.83	90.20	577.9	0.11890	0.16450	0.09366	0.05985	0.2196	0.07451	0.5835	1.3770	3.856	50.96	0.008805	0.03029	0.02488	0.01448	0.01486	0.005412	17.06	28.14	110.60	897.0	0.1654	0.3682	0.2678	0.1556	0.3196	0.11510
844981	M	13.00	21.82	87.50	519.8	0.12730	0.19320	0.18590	0.09353	0.2350	0.07389	0.3063	1.0020	2.406	24.32	0.005731	0.03502	0.03553	0.01226	0.02143	0.003749	15.49	30.73	106.20	739.3	0.1703	0.5401	0.5390	0.2060	0.4378	0.10720
84501001	M	12.46	24.04	83.97	475.9	0.11860	0.23960	0.22730	0.08543	0.2030	0.08243	0.2976	1.5990	2.039	23.94	0.007149	0.07217	0.07743	0.01432	0.01789	0.010080	15.09	40.68	97.65	711.4	0.1853	1.0580	1.1050	0.2210	0.4366	0.20750

A summary of the data set.

summary(wbcd)

##        id             diagnosis          radius_mean      texture_mean  
##  Min.   :     8670   Length:569         Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   Class :character   1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024   Mode  :character   Median :13.370   Median :18.84  
##  Mean   : 30371831                      Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129                      3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502                      Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

str(wbcd)

## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

Because we are dealing with different units of measurement with each variable, it is sometimes necessary to “normalize” the numerical data. We can easily accomplish this usinng R’s \(\text{scale}\)

If we want to detect whether a patient may have beast cancer, it will be important to inspect the “diagnosis” variable. In the variable, B stands for benign (i.e. no breast canccer) and M means malignant(i.e. breast cancer).

prop.table(table(wbcd$diagnosis))

## 
##         B         M 
## 0.6274165 0.3725835

Many R machie learningn classification algorithms require the target feature to bbe coded as a factor, so we will nneed to conver the diagnosis variable to a factor as opposed to just as a character.

wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
str(wbcd$diagnosis)

##  Factor w/ 2 levels "Benign","Malignant": 2 2 2 2 2 2 2 2 2 2 ...

Because we are dealig with different unints of measurement for each numerical variable, we will need to “normalize” each relevantn numerical variable to ensure accuracy of our model. We cacn easily acchieve this using R’s \(\text{scale()}\) function. Eacch numericcal value will be convnerted to their Z-score equivalent.

wbcd.norm <- scale(wbcd[c(3:32)])
summary(wbcd.norm)

##   radius_mean       texture_mean     perimeter_mean      area_mean      
##  Min.   :-2.0279   Min.   :-2.2273   Min.   :-1.9828   Min.   :-1.4532  
##  1st Qu.:-0.6888   1st Qu.:-0.7253   1st Qu.:-0.6913   1st Qu.:-0.6666  
##  Median :-0.2149   Median :-0.1045   Median :-0.2358   Median :-0.2949  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4690   3rd Qu.: 0.5837   3rd Qu.: 0.4992   3rd Qu.: 0.3632  
##  Max.   : 3.9678   Max.   : 4.6478   Max.   : 3.9726   Max.   : 5.2459  
##  smoothness_mean    compactness_mean  concavity_mean   
##  Min.   :-3.10935   Min.   :-1.6087   Min.   :-1.1139  
##  1st Qu.:-0.71034   1st Qu.:-0.7464   1st Qu.:-0.7431  
##  Median :-0.03486   Median :-0.2217   Median :-0.3419  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.63564   3rd Qu.: 0.4934   3rd Qu.: 0.5256  
##  Max.   : 4.76672   Max.   : 4.5644   Max.   : 4.2399  
##  concave.points_mean symmetry_mean      fractal_dimension_mean
##  Min.   :-1.2607     Min.   :-2.74171   Min.   :-1.8183       
##  1st Qu.:-0.7373     1st Qu.:-0.70262   1st Qu.:-0.7220       
##  Median :-0.3974     Median :-0.07156   Median :-0.1781       
##  Mean   : 0.0000     Mean   : 0.00000   Mean   : 0.0000       
##  3rd Qu.: 0.6464     3rd Qu.: 0.53031   3rd Qu.: 0.4706       
##  Max.   : 3.9245     Max.   : 4.48081   Max.   : 4.9066       
##    radius_se         texture_se       perimeter_se        area_se       
##  Min.   :-1.0590   Min.   :-1.5529   Min.   :-1.0431   Min.   :-0.7372  
##  1st Qu.:-0.6230   1st Qu.:-0.6942   1st Qu.:-0.6232   1st Qu.:-0.4943  
##  Median :-0.2920   Median :-0.1973   Median :-0.2864   Median :-0.3475  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2659   3rd Qu.: 0.4661   3rd Qu.: 0.2428   3rd Qu.: 0.1067  
##  Max.   : 8.8991   Max.   : 6.6494   Max.   : 9.4537   Max.   :11.0321  
##  smoothness_se     compactness_se     concavity_se     concave.points_se
##  Min.   :-1.7745   Min.   :-1.2970   Min.   :-1.0566   Min.   :-1.9118  
##  1st Qu.:-0.6235   1st Qu.:-0.6923   1st Qu.:-0.5567   1st Qu.:-0.6739  
##  Median :-0.2201   Median :-0.2808   Median :-0.1989   Median :-0.1404  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3680   3rd Qu.: 0.3893   3rd Qu.: 0.3365   3rd Qu.: 0.4722  
##  Max.   : 8.0229   Max.   : 6.1381   Max.   :12.0621   Max.   : 6.6438  
##   symmetry_se      fractal_dimension_se  radius_worst    
##  Min.   :-1.5315   Min.   :-1.0960      Min.   :-1.7254  
##  1st Qu.:-0.6511   1st Qu.:-0.5846      1st Qu.:-0.6743  
##  Median :-0.2192   Median :-0.2297      Median :-0.2688  
##  Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 0.3554   3rd Qu.: 0.2884      3rd Qu.: 0.5216  
##  Max.   : 7.0657   Max.   : 9.8429      Max.   : 4.0906  
##  texture_worst      perimeter_worst     area_worst      smoothness_worst 
##  Min.   :-2.22204   Min.   :-1.6919   Min.   :-1.2213   Min.   :-2.6803  
##  1st Qu.:-0.74797   1st Qu.:-0.6890   1st Qu.:-0.6416   1st Qu.:-0.6906  
##  Median :-0.04348   Median :-0.2857   Median :-0.3409   Median :-0.0468  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.65776   3rd Qu.: 0.5398   3rd Qu.: 0.3573   3rd Qu.: 0.5970  
##  Max.   : 3.88249   Max.   : 4.2836   Max.   : 5.9250   Max.   : 3.9519  
##  compactness_worst concavity_worst   concave.points_worst
##  Min.   :-1.4426   Min.   :-1.3047   Min.   :-1.7435     
##  1st Qu.:-0.6805   1st Qu.:-0.7558   1st Qu.:-0.7557     
##  Median :-0.2693   Median :-0.2180   Median :-0.2233     
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000     
##  3rd Qu.: 0.5392   3rd Qu.: 0.5307   3rd Qu.: 0.7119     
##  Max.   : 5.1084   Max.   : 4.6965   Max.   : 2.6835     
##  symmetry_worst    fractal_dimension_worst
##  Min.   :-2.1591   Min.   :-1.6004        
##  1st Qu.:-0.6413   1st Qu.:-0.6913        
##  Median :-0.1273   Median :-0.2163        
##  Mean   : 0.0000   Mean   : 0.0000        
##  3rd Qu.: 0.4497   3rd Qu.: 0.4504        
##  Max.   : 6.0407   Max.   : 6.8408

Dividing Our Data Into Training and Test Sets

Before we can fit our model, we first have to divide our data into training and test sets. This allows us to assess the accuracy of our model.

wbcd.train <- wbcd.norm[1:469, ]
wbcd.test <- wbcd.norm[470:569, ]

When we constructed our nnormalized training and test sets, we excluded the target variable diagnosis. For training our k-NN algorithm, we’ll nneed to store these class labels in factor vectors, split between the training and test sets.

wbcd.train.labels <- wbcd[1:469, 2]
wbcd.test.labels <- wbcd[470:569, 2]

Trainingn a Model on Our Data

Now that we’ve basically taken care of the little things, it’s finally time to see how well our model works. The implementation of the k-NN algorithm can be found in nthe “class” package, and you can install it as you would any other R package.

library(class)

For now, we’ll arbitrarily choose a level of \(k=21\) just to start.

wbcd.test.pred <- knn(train = wbcd.train, test = wbcd.test, cl = wbcd.train.labels, k = 21)
prop.table(table(wbcd.test.pred))

## wbcd.test.pred
##    Benign Malignant 
##      0.79      0.21

Our model predicted that based on the features we used, 79% of patients are benign annd 21% of patientns are malignant. How does that compare to our actual data?

prop.table(table(wbcd$diagnosis))

## 
##    Benign Malignant 
## 0.6274165 0.3725835

Our innitial model isn’t that acccurate, but fortunately there are many ways we can correct this that is the beyond the scope of this tutorial.

Let’s compare the overall performance.

library(gmodels)
CrossTable(x = wbcd.test.labels, y = wbcd.test.pred, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd.test.pred 
## wbcd.test.labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        77 |         0 |        77 | 
##                  |     1.000 |     0.000 |     0.770 | 
##                  |     0.975 |     0.000 |           | 
##                  |     0.770 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        21 |        23 | 
##                  |     0.087 |     0.913 |     0.230 | 
##                  |     0.025 |     1.000 |           | 
##                  |     0.020 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        79 |        21 |       100 | 
##                  |     0.790 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

Although our model could definintely use some major improvements, it didn’t do a bad job of correctly guessing the diagnnosis of breast cancer.

Detecting Breast Cancer Using Machinen Learning

Czar Yobero

10/3/2018

Introduction

Exploring the Data

Dividing Our Data Into Training and Test Sets

Trainingn a Model on Our Data