Introduction

The goal of this work is to experiment with classical kNN algorithm and distance weighted kNN and compare their accuracy. The idea of the knn algorithm is to measure the distance between the given test example and all the examples in the dataset, choose k closest examples and then predict class (or value) based on these nearest neigbours. There are several modifications to this algorithms - for example, distance weighted knn and attribute weighted knn. In this work we only experiment with classical knn and distance weighted knn.

Load datasets

Standard step - loading datasets.

wine <- read.csv("~/Studies/Year3/DataMining/wine.data", header=FALSE)
winequality = read.csv("~/Studies/Year3/Machine Learning/winequality-red.csv",sep=";")

All the datasets are discrete-valued. Iris has 3 class labels, wine - 3, winequality - 10.

Normalization

normalize <- function(M) {
    mins = t(apply(M, 2, min))
    maxs = t(apply(M, 2, max))
    ranges = maxs-mins
    Xnorm = t(apply(M,1,function(x){ (x-mins)/(ranges) }))
    Xnorm
}

standardize <- function(M) {
    means = t(apply(M,2,mean))
    sds = t(apply(M,2,sd))
    Xnorm = t(apply(M,1,function(x) {(x-means)/sds}))
    Xnorm
}

Instance-based methods introduction

Training in instance-based methods means just storing the examples. Then each time we get a new query we compare it to the existing examples to decide the value of the target function for this new example. The advantage of this approach is that we can estimate target function locally (which is good when the target function is complex), we actually never approximate over the entire instance space (so we’re not trying to find a target function that fits all examples). The downside is that the computational cost could be quite high. In addition to that these algorithms consider all avaliable attributes.

k Nearest Neighbors

Each example in a dataset is a point in n-dimensional space. The number of dimensions is equal to the number of features. The target value is the label put on this point in n-dimensional space. Nearest neighbors are defined in terms of the standard Euqlidean distance. Since training = storing in this algorithms the first stage of the algorithm has already been performed in the beginning of this document.

We’ll start with the helper function that simple gets the most fequent value for the vector. This function will be useful for implementing kNN for discrete-valued functions.

# calculates the mode of the vector
get_mode <- function(v) {
    # make frequencies table and sort it
    z = sort(table(v), decreasing=T)[1]
    # get the value with the hights frequency
    mode = names(which.max(z))[1]
    return(mode)
}

Example:

a = c(1,1,2,2)
get_mode(a)
## [1] "1"
b = c("cat", "dog", "dog", "ant")
get_mode(b)
## [1] "dog"

The algorithm itself is so simple that we don’t need any other helper functions. The idea is to simply take the query and compare it to each training example by finding the Euclidean distance, choosing the k closest and then choosing the most frequent label among these k examples.

# returns indices of the k nearest neighbours
# examples - matrix, observations, target column included
# query - vector, example to be classified
# k - number of neighbours to be used
# target_column_index - the index of the column with the labels
get_k_nearest <- function(query, examples, k, target_column_index) {
    X = examples[,-target_column_index]
    y = examples[,target_column_index]
    #calculate distances between the query and each example in the examples
    dists = apply(X, 1, function(x) { sqrt(sum((x-query)^2)) })
    # sort the distances and get the indices of this sorting
    sorted_dists = sort(dists, index.return = T)$ix
    # choose indices of k nearest
    k_nearest = sorted_dists[1:k]
    candidates = y[k_nearest]
    # get the most frequent answer
    result = get_mode(candidates)
    return(result)
}

Detailed example

Let’s take iris dataset and split it into train and test set in 70/30 proportion:

library(caTools)
set.seed(3711)
split = sample.split(iris$Species, SplitRatio = 0.7)
train = subset(iris, split==T)
test = subset(iris, split==F)
testX = test[,1:4]
testy = test[,5]
tst = testX[1,]

Now we can apply our kNN algorithm. We’ll use k=5 and run the algorithm for each test example and then calculate accuracy as proportion of the correct predictions.

predictions = apply(testX, 1, get_k_nearest, train, 5, 5)
dif = predictions == testy
sum(dif)/length(predictions)
## [1] 0.9555556

Next step is to pack it all into one function that will allow to run experiments with arbitrary number of neighbors and arbitrary train/test ratio.

# k = number of neighbours
# dataset - original dataset
# target_column_index - the index of the column with the target function values
# train_ratio - the proportion of the training examples
# knn - algorithm to be applied
library(caTools)
run_experiment <- function(dataset, target_column_index, k, train_ratio, knnf, normz, n_times=1) {
    accuracy = 0
    for (i in 1:n_times) {
        # split the dataset
        target_column = dataset[,target_column_index]
        split = sample.split(target_column, SplitRatio = train_ratio)
        train = subset(dataset, split==T)
        trainX = train[,-target_column_index]
        trainy = train[,target_column_index]
        test = subset(dataset, split==F)
        testX = test[,-target_column_index]
        testy = test[,target_column_index]
        trainX = data.frame(normz(trainX))
        testX = data.frame(normz(testX))
        train = cbind(trainX,trainy)
        predictions = apply(testX, 1, knnf, train, k, ncol(train))
        dif = predictions == testy
        accuracy = accuracy + sum(dif)/length(predictions)
    }
    accuracy = accuracy/n_times
    accuracy
}

Distance-Weighted Nearest Neighbour Algorithm

The idea is to weight the contribution of each of the k neighbours according to their distance to the query. So the closer the neighbour the more important it is. The implementation is very simple and is very similar to classical kNN.

get_weighted <- function(labels,weights) {
    dset = data.frame(labels=factor(labels), 
                      weights=weights)
    scores = aggregate(. ~ labels, dset, sum)
    result = as.character(scores$labels[which.max(scores$weights)])
    return(result)
}
s = as.numeric(c(0.1,0.2,0.1,0.4))
l = c("d","c","c","m")
get_weighted(l,s)
## [1] "m"
get_weighted_k_nearest <- function(query, examples, k, target_column_index) {
    X = examples[,-target_column_index]
    y = examples[,target_column_index]
    #calculate distances between the query and each example in the examples
    dists = apply(X, 1, function(x) { sum((x-query)^2) })
    if (any(dists==0)) {
        ind = which(dists==0)
        res = get_mode(y[ind])
        return(res)
    }
    else {
        proximities = 1/dists
        # sort the distances and get the indices of this sorting
        sorted_dists = sort(dists, index.return = T)$ix
        # choose indices of k nearest
        k_nearest = sorted_dists[1:k]
        candidates = y[k_nearest]
        weights = proximities[k_nearest]
        
        # get the most frequent answer
        result = get_weighted(candidates,weights)
        return(result)
    }
}

To check the algorithm we’ll take exactly the same dataset and the same options:

run_experiment(iris, 5, 5, 0.7, get_weighted_k_nearest,identity)
## [1] 0.9555556

Build accuracy tables

First we need to create a multidimensional array (basically a tensor) to keep the results. First we’ll prepare properties for each of the inner tables. We’ll set up names, possible values (that will be passed as function arguments) and sizes.

prop_params = list(n = 9, 
                   names = c("10","20","30","40","50","60","70","80","90"), 
                   vals = seq(0.1,0.9,0.1))
k_params = list(n = 4, 
                names = c("k1","k3","k5","k7"), 
                vals = seq(1,7,2))
norm_params = list(n = 3, 
                   names = c("none","normalization","standartization"), 
                   vals = list(identity,normalize,standardize))
knn_params = list(n=2, 
                  names= c("classical","weighted"), 
                  vals = list(get_k_nearest, get_weighted_k_nearest))

Now we can build a function that will work with any custom properites provided they’ve been declared properly (meaning that they are named lists with standard fields).

# for every proportion of train and test data
get_accuracy_tables <- function(dataset, 
                                target_column_index, 
                                prop_params, 
                                k_params, 
                                norm_params,
                                knn_params,
                                n_times=1) {
    # create tensor to keep results
    resultsTensor <- array(rep(0, prop_params$n*k_params$n*norm_params$n*knn_params$n),
                           dim=c(prop_params$n,
                                 k_params$n,
                                 norm_params$n,
                                 knn_params$n),
                           dimnames = list(prop_params$names,
                                           k_params$names,
                                           norm_params$names,
                                           knn_params$names))
    
    for (n in 1:prop_params$n) {
        train_ratio = prop_params$vals[n]
        for (o in 1:k_params$n) {
            k = k_params$vals[o]
            for (p in 1:norm_params$n) {
                normz = norm_params$vals[[p]]
                for (q in 1:knn_params$n) {
                    knn = knn_params$vals[[q]]
                    #print(paste("Proportion:",n,", neib:",o,"norm type", p, "knn type", q))
                    z = run_experiment(dataset, 
                                       target_column_index, 
                                       k, 
                                       train_ratio, 
                                       knn,
                                       normz,
                                       n_times)
                    #print(z)
                    resultsTensor[n,o,p,q] = z
                    #print(resultsTensor[n,o,p,q])
                }
            }
        }
    }
    resultsTensor
}

Results

First we’ll full accuracty tables for both datasets.

irisResults = get_accuracy_tables(iris,5,prop_params,k_params,norm_params,knn_params,10)
wineResults = get_accuracy_tables(wine,
                                  1,
                                  prop_params,
                                  k_params,
                                  norm_params,
                                  knn_params,
                                  1)
winequalityResults = get_accuracy_tables(winequality,
                                ncol(winequality),
                                prop_params,
                                k_params,
                                norm_params,
                                knn_params,1)

Visualization

The graphs actully is the most complicated part of the work Now that we have the accuracy results for all possible combinations, we can plot it.

Iris dataset

Winequality dataset

Results

Now let’s plot accuracy against proportions and compare accuracy by normalization type. For these plots we’ll take fixed value for k (k=3) and classical kNN.

irisC3 = irisResults[,"k3",,"classical"]
wineC3 = winequalityResults[,"k3",,"classical"]

Accuracy tables

Iris

classical knn, k:k1
none normalization standartization
10 0.9296 0.8956 0.9052
20 0.9492 0.9217 0.9433
30 0.9562 0.9286 0.9238
40 0.9433 0.9278 0.9400
50 0.9533 0.9373 0.9387
60 0.9683 0.9367 0.9400
70 0.9578 0.9578 0.9422
80 0.9567 0.9400 0.9300
90 0.9667 0.9200 0.9533
weighted knn, k:k1
none normalization standartization
10 0.9274 0.8726 0.9111
20 0.9383 0.9225 0.9192
30 0.9524 0.9400 0.9410
40 0.9567 0.9311 0.9300
50 0.9533 0.9400 0.9613
60 0.9683 0.9383 0.9517
70 0.9511 0.9533 0.9533
80 0.9600 0.9333 0.9533
90 0.9733 0.9000 0.9467
classical knn, k:k3
none normalization standartization
10 0.9156 0.8837 0.8815
20 0.9500 0.9375 0.9367
30 0.9590 0.9419 0.9229
40 0.9633 0.9489 0.9456
50 0.9587 0.9480 0.9400
60 0.9567 0.9433 0.9567
70 0.9667 0.9533 0.9378
80 0.9633 0.9600 0.9467
90 0.9667 0.8867 0.9600
weighted knn, k:k3
none normalization standartization
10 0.9393 0.8696 0.9200
20 0.9467 0.9442 0.9350
30 0.9629 0.9305 0.9333
40 0.9589 0.9500 0.9489
50 0.9573 0.9573 0.9547
60 0.9583 0.9533 0.9417
70 0.9644 0.9356 0.9467
80 0.9533 0.9300 0.9533
90 0.9867 0.9400 0.9733
classical knn, k:k5
none normalization standartization
10 0.9193 0.8422 0.8948
20 0.9450 0.9258 0.9192
30 0.9581 0.9181 0.9486
40 0.9567 0.9456 0.9522
50 0.9640 0.9520 0.9560
60 0.9617 0.9583 0.9683
70 0.9600 0.9644 0.9422
80 0.9767 0.9400 0.9667
90 0.9733 0.9267 0.9867
weighted knn, k:k5
none normalization standartization
10 0.9348 0.9022 0.9037
20 0.9483 0.9342 0.9408
30 0.9581 0.9476 0.9362
40 0.9633 0.9578 0.9511
50 0.9493 0.9387 0.9600
60 0.9700 0.9550 0.9517
70 0.9578 0.9578 0.9422
80 0.9567 0.9500 0.9600
90 0.9467 0.9400 0.9667
classical knn, k:k7
none normalization standartization
10 0.9200 0.8481 0.8481
20 0.9408 0.9250 0.8992
30 0.9638 0.9543 0.9286
40 0.9489 0.9511 0.9433
50 0.9680 0.9560 0.9507
60 0.9617 0.9533 0.9433
70 0.9622 0.9533 0.9689
80 0.9700 0.9567 0.9733
90 0.9667 0.9200 0.9600
weighted knn, k:k7
none normalization standartization
10 0.9333 0.8800 0.9030
20 0.9583 0.9225 0.9425
30 0.9514 0.9419 0.9486
40 0.9622 0.9533 0.9544
50 0.9653 0.9493 0.9560
60 0.9617 0.9517 0.9500
70 0.9467 0.9489 0.9600
80 0.9700 0.9333 0.9500
90 0.9733 0.9267 0.9733

Wine

classical knn, k:k1
none normalization standartization
10 0.6688 0.9000 0.9375
20 0.6901 0.9014 0.9225
30 0.7360 0.9200 0.9520
40 0.6822 0.9533 0.9252
50 0.6818 0.9773 0.9773
60 0.7887 0.9577 1.0000
70 0.6792 0.9057 0.9434
80 0.8611 0.9722 0.9722
90 0.7778 0.8889 0.8889
weighted knn, k:k1
none normalization standartization
10 0.6938 0.9438 0.9750
20 0.7183 0.9014 0.9296
30 0.6640 0.9200 0.9360
40 0.6636 0.9533 0.9065
50 0.6932 0.9545 0.9659
60 0.7042 0.9014 0.9437
70 0.7170 0.9245 0.9434
80 0.6944 0.9167 0.9444
90 0.7778 0.9444 0.8889
classical knn, k:k3
none normalization standartization
10 0.7312 0.9312 0.9375
20 0.7183 0.9225 0.9296
30 0.7040 0.9280 0.9600
40 0.6542 0.9159 0.9533
50 0.7273 0.9545 0.9432
60 0.7042 0.9718 0.9859
70 0.6792 1.0000 0.9245
80 0.7222 1.0000 0.9722
90 0.8333 1.0000 0.9444
weighted knn, k:k3
none normalization standartization
10 0.7062 0.9438 0.9375
20 0.6831 0.9577 0.9296
30 0.6400 0.9280 0.9680
40 0.7290 0.9159 0.9626
50 0.7386 0.9659 0.9545
60 0.7042 0.9155 0.9577
70 0.8113 0.9623 0.9811
80 0.8333 0.9722 1.0000
90 0.7778 0.8889 0.9444
classical knn, k:k5
none normalization standartization
10 0.7062 0.9188 0.9500
20 0.7254 0.9085 0.9225
30 0.7200 0.9760 0.9680
40 0.7290 0.9439 0.9813
50 0.6364 0.9659 0.9886
60 0.7042 0.9859 1.0000
70 0.6792 0.9434 0.9623
80 0.6389 0.9722 1.0000
90 0.5556 1.0000 1.0000
weighted knn, k:k5
none normalization standartization
10 0.6438 0.8875 0.9375
20 0.6901 0.8944 0.9437
30 0.6960 0.9440 0.9760
40 0.7477 0.9533 0.9533
50 0.7273 1.0000 0.9545
60 0.6338 0.9437 0.9577
70 0.7736 0.9811 0.9623
80 0.8333 0.8611 0.9444
90 0.8333 0.9444 0.9444
classical knn, k:k7
none normalization standartization
10 0.6688 0.9625 0.9562
20 0.6972 0.9507 0.9507
30 0.6640 0.9440 0.9440
40 0.6636 0.9159 0.9626
50 0.6818 0.9886 0.9432
60 0.6761 0.9437 0.9718
70 0.6792 0.9623 0.9811
80 0.6667 1.0000 0.9444
90 0.7222 0.9444 1.0000
weighted knn, k:k7
none normalization standartization
10 0.6562 0.9188 0.9562
20 0.6901 0.9507 0.9507
30 0.6960 0.9520 0.9760
40 0.6916 0.9720 0.9439
50 0.7500 0.9545 0.9545
60 0.7042 0.9718 1.0000
70 0.7170 0.9434 0.9623
80 0.7222 1.0000 1.0000
90 0.8889 0.9444 1.0000

Wine Quality

classical knn, k:k1
none normalization standartization
10 0.4302 0.4628 0.4941
20 0.4773 0.4859 0.5149
30 0.4839 0.4768 0.5455
40 0.4948 0.4469 0.5375
50 0.5575 0.5762 0.5875
60 0.5462 0.5399 0.5806
70 0.5407 0.5157 0.5908
80 0.5670 0.5981 0.6293
90 0.6062 0.5562 0.6812
weighted knn, k:k1
none normalization standartization
10 0.4267 0.4906 0.4997
20 0.4679 0.5282 0.5282
30 0.4830 0.4955 0.5723
40 0.5042 0.5323 0.5646
50 0.5238 0.5462 0.5588
60 0.5743 0.5509 0.6135
70 0.5741 0.5324 0.6033
80 0.5452 0.6044 0.6386
90 0.6125 0.4375 0.6000
classical knn, k:k3
none normalization standartization
10 0.4329 0.4990 0.5205
20 0.4444 0.4844 0.5297
30 0.4500 0.5062 0.5205
40 0.4708 0.5156 0.5375
50 0.4800 0.5062 0.5250
60 0.4742 0.5352 0.5477
70 0.4635 0.5136 0.5678
80 0.5514 0.5016 0.5234
90 0.5562 0.5188 0.5125
weighted knn, k:k3
none normalization standartization
10 0.4510 0.4705 0.4927
20 0.4789 0.5313 0.5086
30 0.4821 0.5170 0.5661
40 0.4958 0.5125 0.5802
50 0.5350 0.5025 0.5850
60 0.5524 0.6041 0.6260
70 0.5699 0.5616 0.6409
80 0.6449 0.5607 0.6355
90 0.6438 0.5125 0.6625
classical knn, k:k5
none normalization standartization
10 0.4580 0.5288 0.5170
20 0.4937 0.5196 0.5180
30 0.4857 0.5571 0.5429
40 0.4688 0.5646 0.5438
50 0.4888 0.5387 0.5562
60 0.4883 0.5399 0.5524
70 0.5282 0.5386 0.5449
80 0.4984 0.5701 0.5483
90 0.4875 0.5500 0.6125
weighted knn, k:k5
none normalization standartization
10 0.4475 0.4691 0.5274
20 0.4679 0.5305 0.5665
30 0.5330 0.4875 0.5714
40 0.5521 0.5729 0.6281
50 0.5562 0.5750 0.6038
60 0.5822 0.5665 0.6244
70 0.5616 0.5929 0.6493
80 0.6012 0.6137 0.6604
90 0.6500 0.5750 0.6875
classical knn, k:k7
none normalization standartization
10 0.4809 0.5024 0.5198
20 0.4648 0.5321 0.5563
30 0.4946 0.5062 0.5786
40 0.4833 0.5333 0.5427
50 0.4938 0.5362 0.5800
60 0.5055 0.5399 0.5790
70 0.4990 0.5595 0.5887
80 0.5389 0.5358 0.5732
90 0.5625 0.6000 0.6625
weighted knn, k:k7
none normalization standartization
10 0.4566 0.5087 0.5525
20 0.5000 0.5086 0.5485
30 0.5357 0.5598 0.5795
40 0.5375 0.5427 0.6135
50 0.5512 0.5562 0.6000
60 0.5603 0.5775 0.5962
70 0.5678 0.5950 0.6889
80 0.5919 0.6262 0.6511
90 0.6000 0.6250 0.6562

Since the accuracy is really very low, it would be be interesting to compare this performance to the baseline, i.e. what if we just always predict the most frequent class?

wqbaseline = get_mode(winequality[,ncol(winequality)])
sum(winequality[,ncol(winequality)]==wqbaseline)/nrow(winequality)
## [1] 0.4258912
winebaseline = get_mode(wine[,1])
sum(wine[,1]==winebaseline)/nrow(wine)
## [1] 0.3988764
irisbaseline = get_mode(iris[,ncol(iris)])
sum(iris[,ncol(iris)]==irisbaseline)/nrow(iris)
## [1] 0.3333333

So we have significant improvement over the baseline for the iris dataset and some improvement for winequality dataset. It is better than the

Sanity check

It would also be interesting to compare the algorithm we made to standard R’s knn for the winequality dataset.

target_column_index = ncol(winequality)
target_column = winequality[,ncol(winequality)]
split = sample.split(target_column, SplitRatio = 0.7)
train = subset(winequality, split==T)
trainX = train[,-target_column_index]
trainy = train[,target_column_index]
test = subset(winequality, split==F)
testX = test[,-target_column_index]
testy = test[,target_column_index]
res = knn(trainX, testX, trainy, k = 3)
sum(testy==res)/length(res)
## [1] 0.5135699

Summary

We tried two simple versions of kNN algorithm: classical kNN and distance weighted KNN. Both algorithms have pretty good performance for classical small datasets: iris and wine. But when it comes to a bigger dataset with bigger number of classes the performance is just slightly better than the baseline.

There is also such thing as the curse of dimensionality. It’s quite possible that in classical datasets (iris and wine) most of the attributes have some relevance, while this could be not the case for the winequality dataset. That’s why we see only a ~10% improvement.

As for the number of neighbours to choose, k=5 or k=7 looks like the optimal choice because it shows the most stable growth. The type of normalization also has some impact on the results, but this impact varies depending on the dataset. Standartization (vs. Normalization) looks like a better choice (moreove it’s easier to think about the data in terms of number of standard deviations).