kNN for breast cancer

This tutorial covers k Nearest Neighbours, and we’ll apply it to a breast cancer dataset.

The idea

We can measure the difference between any two randomly chosen points by their difference, typically through some kind of distance (dist()) type function. The smaller the distance, presumably the more similar the two points are.

So, the algo is simple: for any point, identify its nearest neighbours (k of them), and the majority type of those k neighbours determines the type assigned to our point in question.

R’s class package comes with a builtin knn() function. Other more sophisticated versions exist, an interesting one being weighted neighbours (increased weight going to nearest points).

The data

The data is taken from from this link. Let’s see what the dataset looks like:

str(d)

## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

Alright, so an ID, a diagnosis, and a bunch of mean, stderr, and max values for 10 different physical measures which only mean something to an oncologist. Let’s see how they range:

summary(d)

##        id             diagnosis          radius_mean      texture_mean  
##  Min.   :     8670   Length:569         Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   Class :character   1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024   Mode  :character   Median :13.370   Median :18.84  
##  Mean   : 30371831                      Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129                      3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502                      Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

Data cleaning

The ID presumably uniquely identifies each record. Let’s verify:

length(unique(d$id))

## [1] 569

Looks fine. note that we’ll need to exclude this from the model as an ID can be used to uniquely ‘predict’ each record! Such models suffer from massive overfitting!! Let’s do this now:

d = d[-1] # <-- this REMOVES column 1 (the ID column)

The diagnosis is apparently a character class, let’s examine it:

table(d$diagnosis)

## 
##   B   M 
## 357 212

Ok, these look like B for benign and M for malignant; they should be converted into a factor really, since that’s what models like to work with; we’ll also give ’em nicer names:

d$diagnosis = factor(d$diagnosis, levels=c("B", "M"), labels=c("Benign", "Malignant"))
# this is useful for showing table output as percentages:
prop.table(table(d$diagnosis)) * 100

## 
##    Benign Malignant 
##  62.74165  37.25835

# to round:
round(prop.table(table(d$diagnosis)) * 100, digits=1)

## 
##    Benign Malignant 
##      62.7      37.3

For the purposes of this illustration, we won’t be examining all 30 variables, but 3: * radius_mean * area_mean * smoothness_mean

Let’s have a look:

# useful way of summarising only a few vars we care about:
summary(d[c("radius_mean", "area_mean", "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

So looking at their ranges, we can see that there’s likely gonna be a scale issue with the relative magnitudes of these numbers. Recall that the distance calculation in kNN is heavily influenced by the measurement scales. We’re gonna need to normalise. Let’s write a function to do this!

# normalise(x): normalises a vector based on the range of elements
# input: vector x of numeric values
# returns: x as a vector of numeric values scaled to [0, 1]
normalise = function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
# test it out on two vectors which are 10x different
v = c(1,2,3,4,5)
v10 = v*10
v

## [1] 1 2 3 4 5

v10

## [1] 10 20 30 40 50

# v10 is 10x bigger than v
# are v and v10's normalised versions completely identical?
all(normalise(v) == normalise(v10))

## [1] TRUE

# the outputs are identical, normalise() is GTG

Right, we have our normalisation function, let’s apply it to the whole frame’s numerics. We could do it one by one, but that’s a waste of time. For that, we can use lapply()!

# lapply takes a list and applies a function to each element of that list. It also 
# RETURNS a list, which we'll have to re-convert to a frame.
tmpList = lapply(d[2:31], normalise)
d.n = as.data.frame(tmpList)
# so let's see what the summary for our key vars now looks like:
summary(d.n[c("radius_mean", "area_mean", "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2233   1st Qu.:0.1174   1st Qu.:0.3046  
##  Median :0.3024   Median :0.1729   Median :0.3904  
##  Mean   :0.3382   Mean   :0.2169   Mean   :0.3948  
##  3rd Qu.:0.4164   3rd Qu.:0.2711   3rd Qu.:0.4755  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

# yay

remember that lapply() returns a list corresponding to its input: in the above, we didn’t pass it the first column of d, only the second onwards - the first column of d is the diagnosis, so that’s now stripped out of d.n

Training preparation

So now our data’s normalised, we can analyse it. Given that we already know what the diagnosis is for all our data, it’s not interesting to predict what we already know. Instead, we’ll split our dataset into two groups: training (469 records), and test (remaining 100). The former will be used to build our model, and the latter will be used to test it.

We’ll split up our data using the first 469 for training and remaining 100 for testing. For the model to be any good, it must have a well-sampled training set - this splitting method only works because the dataset was randomised - beware of datasets which are already sorted according to some criterion!!!

# split up our data
d.train = d.n[1:469, ] # remember R is 1-based, not 0!!
d.test = d.n[470:569, ]

Finally, neither d.train nor d.test contain the diagnosis column - it was stripped out during normalisation by lapply(), remember? We still need that info though:

d_labels.train = d[1:469, 1]
d_labels.test = d[470:569, 1]

Training

Here we’ll use the class package’s knn() function to do all the work (others exist, check out cran for more sophisticated / efficient implementations). It takes 4 args, the training frame, the test frame, the class factor vector and the value of k desired. It returns a factor vector of predicated labels for the test set.

d_labels.test.pred = knn(train = d.train,
                         test  = d.test,
                         cl    = d_labels.train,
                         k     = 21) # this is an odd number roughly eq to sqrt(469)
# let's make sure the predictions look roughly right in the first few lines:
head(d_labels.test)

## [1] Benign Benign Benign Benign Benign Benign
## Levels: Benign Malignant

head(d_labels.test.pred)

## [1] Benign Benign Benign Benign Benign Benign
## Levels: Benign Malignant

Evaluating model performance

We can use gmodel’s CrossTable() function to show a cross-tabulation showing the agreement:

#install.packages("gmodels")
library(gmodels)
CrossTable(x = d_labels.test, y = d_labels.test.pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##               | d_labels.test.pred 
## d_labels.test |    Benign | Malignant | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |        77 |         0 |        77 | 
##               |     1.000 |     0.000 |     0.770 | 
##               |     0.975 |     0.000 |           | 
##               |     0.770 |     0.000 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |         2 |        21 |        23 | 
##               |     0.087 |     0.913 |     0.230 | 
##               |     0.025 |     1.000 |           | 
##               |     0.020 |     0.210 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        79 |        21 |       100 | 
##               |     0.790 |     0.210 |           | 
## --------------|-----------|-----------|-----------|
## 
##

This shows that 77 of the actually benign tumours were all correctly matched, and only 21 of the 23 malignants were correctly matched. Missing out on the two meant that the algorithm mis-labelled 2 malignants as benign, a serious issue.

These results give a 2% error rate, not bad for a few lines of R code. Let’s see what we can do to improve.

Improving the model

We’ll explore two ways. The first involves an improvement of the normalise() function, which is mot hugely useful as outliers get compressed - and we can expect lots of outliers with something like malignant tumours, which can get pretty wild.

The other popular way of standardising numeric types is z-score transforms:

z-score = ( x_i - mean(X) ) / stdev(X)

As luck would have it, R comes with such a function - scale() - which can even work directly with frames, so no need to faff with lapply():

d.z = as.data.frame(scale(d[-1]))
summary(d.z[c("radius_mean", "area_mean", "smoothness_mean")])

##   radius_mean        area_mean       smoothness_mean   
##  Min.   :-2.0279   Min.   :-1.4532   Min.   :-3.10935  
##  1st Qu.:-0.6888   1st Qu.:-0.6666   1st Qu.:-0.71034  
##  Median :-0.2149   Median :-0.2949   Median :-0.03486  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4690   3rd Qu.: 0.3632   3rd Qu.: 0.63564  
##  Max.   : 3.9678   Max.   : 5.2459   Max.   : 4.76672

# compare this to the summary for d.n earlier

Note that mean is 0 for z-score standardised data and the range should be fairly compact. Recall that z-scores beyond +3 and -3 are very rare values, so we’re got some interesting data here.

So, let’s repeat the previous routine and see what knn() yields:

d.train = d.z[1:469, ]
d.test  = d.z[470:569, ]
d_labels.train = d[1:469, 1]
d_labels.test  = d[470:569, 1]
d_labels.test.pred = knn(train = d.train,
                         test  = d.test,
                         cl    = d_labels.train,
                         k     = 21)
CrossTable(x = d_labels.test, y = d_labels.test.pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##               | d_labels.test.pred 
## d_labels.test |    Benign | Malignant | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |        77 |         0 |        77 | 
##               |     1.000 |     0.000 |     0.770 | 
##               |     0.975 |     0.000 |           | 
##               |     0.770 |     0.000 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |         2 |        21 |        23 | 
##               |     0.087 |     0.913 |     0.230 | 
##               |     0.025 |     1.000 |           | 
##               |     0.020 |     0.210 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        79 |        21 |       100 | 
##               |     0.790 |     0.210 |           | 
## --------------|-----------|-----------|-----------|
## 
##

Unfortunately it seems this has made no difference. This sometimes happens.

The other improvement is to use varying values of k; we may get 1 or 2% improvements, but nothing startling - and it’s important not to overfit the model to our test data.

Summary

kNN isn’t really a learning system, it just stores the data verbatim. Unlabelled test examples are then matched to the most similar records in the training set using some kind of distance function, and the unlabelled example is assigned the label of its neighbours.

Despite its simplicity and stupidity, it’s pretty powerful - it can predict whether a tumour is benign or malignant 98% of the time in just a few lines of R.