This tutorial covers k Nearest Neighbours, and we’ll apply it to a breast cancer dataset.
We can measure the difference between any two randomly chosen points by their difference, typically through some kind of distance (dist()) type function. The smaller the distance, presumably the more similar the two points are.
So, the algo is simple: for any point, identify its nearest neighbours (k of them), and the majority type of those k neighbours determines the type assigned to our point in question.
R’s class package comes with a builtin knn() function. Other more sophisticated versions exist, an interesting one being weighted neighbours (increased weight going to nearest points).
The data is taken from from this link. Let’s see what the dataset looks like:
str(d)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
Alright, so an ID, a diagnosis, and a bunch of mean, stderr, and max values for 10 different physical measures which only mean something to an oncologist. Let’s see how they range:
summary(d)
## id diagnosis radius_mean texture_mean
## Min. : 8670 Length:569 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 Class :character 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Mode :character Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
The ID presumably uniquely identifies each record. Let’s verify:
length(unique(d$id))
## [1] 569
Looks fine. note that we’ll need to exclude this from the model as an ID can be used to uniquely ‘predict’ each record! Such models suffer from massive overfitting!! Let’s do this now:
d = d[-1] # <-- this REMOVES column 1 (the ID column)
The diagnosis is apparently a character class, let’s examine it:
table(d$diagnosis)
##
## B M
## 357 212
Ok, these look like B for benign and M for malignant; they should be converted into a factor really, since that’s what models like to work with; we’ll also give ’em nicer names:
d$diagnosis = factor(d$diagnosis, levels=c("B", "M"), labels=c("Benign", "Malignant"))
# this is useful for showing table output as percentages:
prop.table(table(d$diagnosis)) * 100
##
## Benign Malignant
## 62.74165 37.25835
# to round:
round(prop.table(table(d$diagnosis)) * 100, digits=1)
##
## Benign Malignant
## 62.7 37.3
For the purposes of this illustration, we won’t be examining all 30 variables, but 3: * radius_mean * area_mean * smoothness_mean
Let’s have a look:
# useful way of summarising only a few vars we care about:
summary(d[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
So looking at their ranges, we can see that there’s likely gonna be a scale issue with the relative magnitudes of these numbers. Recall that the distance calculation in kNN is heavily influenced by the measurement scales. We’re gonna need to normalise. Let’s write a function to do this!
# normalise(x): normalises a vector based on the range of elements
# input: vector x of numeric values
# returns: x as a vector of numeric values scaled to [0, 1]
normalise = function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# test it out on two vectors which are 10x different
v = c(1,2,3,4,5)
v10 = v*10
v
## [1] 1 2 3 4 5
v10
## [1] 10 20 30 40 50
# v10 is 10x bigger than v
# are v and v10's normalised versions completely identical?
all(normalise(v) == normalise(v10))
## [1] TRUE
# the outputs are identical, normalise() is GTG
Right, we have our normalisation function, let’s apply it to the whole frame’s numerics. We could do it one by one, but that’s a waste of time. For that, we can use lapply()!
# lapply takes a list and applies a function to each element of that list. It also
# RETURNS a list, which we'll have to re-convert to a frame.
tmpList = lapply(d[2:31], normalise)
d.n = as.data.frame(tmpList)
# so let's see what the summary for our key vars now looks like:
summary(d.n[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2233 1st Qu.:0.1174 1st Qu.:0.3046
## Median :0.3024 Median :0.1729 Median :0.3904
## Mean :0.3382 Mean :0.2169 Mean :0.3948
## 3rd Qu.:0.4164 3rd Qu.:0.2711 3rd Qu.:0.4755
## Max. :1.0000 Max. :1.0000 Max. :1.0000
# yay
remember that lapply() returns a list corresponding to its input: in the above, we didn’t pass it the first column of d, only the second onwards - the first column of d is the diagnosis, so that’s now stripped out of d.n
So now our data’s normalised, we can analyse it. Given that we already know what the diagnosis is for all our data, it’s not interesting to predict what we already know. Instead, we’ll split our dataset into two groups: training (469 records), and test (remaining 100). The former will be used to build our model, and the latter will be used to test it.
We’ll split up our data using the first 469 for training and remaining 100 for testing. For the model to be any good, it must have a well-sampled training set - this splitting method only works because the dataset was randomised - beware of datasets which are already sorted according to some criterion!!!
# split up our data
d.train = d.n[1:469, ] # remember R is 1-based, not 0!!
d.test = d.n[470:569, ]
Finally, neither d.train nor d.test contain the diagnosis column - it was stripped out during normalisation by lapply(), remember? We still need that info though:
d_labels.train = d[1:469, 1]
d_labels.test = d[470:569, 1]
Here we’ll use the class package’s knn() function to do all the work (others exist, check out cran for more sophisticated / efficient implementations). It takes 4 args, the training frame, the test frame, the class factor vector and the value of k desired. It returns a factor vector of predicated labels for the test set.
d_labels.test.pred = knn(train = d.train,
test = d.test,
cl = d_labels.train,
k = 21) # this is an odd number roughly eq to sqrt(469)
# let's make sure the predictions look roughly right in the first few lines:
head(d_labels.test)
## [1] Benign Benign Benign Benign Benign Benign
## Levels: Benign Malignant
head(d_labels.test.pred)
## [1] Benign Benign Benign Benign Benign Benign
## Levels: Benign Malignant
We can use gmodel’s CrossTable() function to show a cross-tabulation showing the agreement:
#install.packages("gmodels")
library(gmodels)
CrossTable(x = d_labels.test, y = d_labels.test.pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | d_labels.test.pred
## d_labels.test | Benign | Malignant | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## --------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## --------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## --------------|-----------|-----------|-----------|
##
##
This shows that 77 of the actually benign tumours were all correctly matched, and only 21 of the 23 malignants were correctly matched. Missing out on the two meant that the algorithm mis-labelled 2 malignants as benign, a serious issue.
These results give a 2% error rate, not bad for a few lines of R code. Let’s see what we can do to improve.
We’ll explore two ways. The first involves an improvement of the normalise() function, which is mot hugely useful as outliers get compressed - and we can expect lots of outliers with something like malignant tumours, which can get pretty wild.
The other popular way of standardising numeric types is z-score transforms:
z-score = ( x_i - mean(X) ) / stdev(X)
As luck would have it, R comes with such a function - scale() - which can even work directly with frames, so no need to faff with lapply():
d.z = as.data.frame(scale(d[-1]))
summary(d.z[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :-2.0279 Min. :-1.4532 Min. :-3.10935
## 1st Qu.:-0.6888 1st Qu.:-0.6666 1st Qu.:-0.71034
## Median :-0.2149 Median :-0.2949 Median :-0.03486
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4690 3rd Qu.: 0.3632 3rd Qu.: 0.63564
## Max. : 3.9678 Max. : 5.2459 Max. : 4.76672
# compare this to the summary for d.n earlier
Note that mean is 0 for z-score standardised data and the range should be fairly compact. Recall that z-scores beyond +3 and -3 are very rare values, so we’re got some interesting data here.
So, let’s repeat the previous routine and see what knn() yields:
d.train = d.z[1:469, ]
d.test = d.z[470:569, ]
d_labels.train = d[1:469, 1]
d_labels.test = d[470:569, 1]
d_labels.test.pred = knn(train = d.train,
test = d.test,
cl = d_labels.train,
k = 21)
CrossTable(x = d_labels.test, y = d_labels.test.pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | d_labels.test.pred
## d_labels.test | Benign | Malignant | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## --------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## --------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## --------------|-----------|-----------|-----------|
##
##
Unfortunately it seems this has made no difference. This sometimes happens.
The other improvement is to use varying values of k; we may get 1 or 2% improvements, but nothing startling - and it’s important not to overfit the model to our test data.
kNN isn’t really a learning system, it just stores the data verbatim. Unlabelled test examples are then matched to the most similar records in the training set using some kind of distance function, and the unlabelled example is assigned the label of its neighbours.
Despite its simplicity and stupidity, it’s pretty powerful - it can predict whether a tumour is benign or malignant 98% of the time in just a few lines of R.