The goal of this work is to experiment with classical kNN algorithm and distance weighted kNN and compare their accuracy. The idea of the knn algorithm is to measure the distance between the given test example and all the examples in the dataset, choose k closest examples and then predict class (or value) based on these nearest neigbours. There are several modifications to this algorithms - for example, distance weighted knn and attribute weighted knn. In this work we only experiment with classical knn and distance weighted knn.
Standard step - loading datasets.
wine <- read.csv("~/Studies/Year3/DataMining/wine.data", header=FALSE)
winequality = read.csv("~/Studies/Year3/Machine Learning/winequality-red.csv",sep=";")
All the datasets are discrete-valued. Iris has 3 class labels, wine - 3, winequality - 10.
normalize <- function(M) {
mins = t(apply(M, 2, min))
maxs = t(apply(M, 2, max))
ranges = maxs-mins
Xnorm = t(apply(M,1,function(x){ (x-mins)/(ranges) }))
Xnorm
}
standardize <- function(M) {
means = t(apply(M,2,mean))
sds = t(apply(M,2,sd))
Xnorm = t(apply(M,1,function(x) {(x-means)/sds}))
Xnorm
}
Training in instance-based methods means just storing the examples. Then each time we get a new query we compare it to the existing examples to decide the value of the target function for this new example. The advantage of this approach is that we can estimate target function locally (which is good when the target function is complex), we actually never approximate over the entire instance space (so we’re not trying to find a target function that fits all examples). The downside is that the computational cost could be quite high. In addition to that these algorithms consider all avaliable attributes.
Each example in a dataset is a point in n-dimensional space. The number of dimensions is equal to the number of features. The target value is the label put on this point in n-dimensional space. Nearest neighbors are defined in terms of the standard Euqlidean distance. Since training = storing in this algorithms the first stage of the algorithm has already been performed in the beginning of this document.
We’ll start with the helper function that simple gets the most fequent value for the vector. This function will be useful for implementing kNN for discrete-valued functions.
# calculates the mode of the vector
get_mode <- function(v) {
# make frequencies table and sort it
z = sort(table(v), decreasing=T)[1]
# get the value with the hights frequency
mode = names(which.max(z))[1]
return(mode)
}
Example:
a = c(1,1,2,2)
get_mode(a)
## [1] "1"
b = c("cat", "dog", "dog", "ant")
get_mode(b)
## [1] "dog"
The algorithm itself is so simple that we don’t need any other helper functions. The idea is to simply take the query and compare it to each training example by finding the Euclidean distance, choosing the k closest and then choosing the most frequent label among these k examples.
# returns indices of the k nearest neighbours
# examples - matrix, observations, target column included
# query - vector, example to be classified
# k - number of neighbours to be used
# target_column_index - the index of the column with the labels
get_k_nearest <- function(query, examples, k, target_column_index) {
X = examples[,-target_column_index]
y = examples[,target_column_index]
#calculate distances between the query and each example in the examples
dists = apply(X, 1, function(x) { sqrt(sum((x-query)^2)) })
# sort the distances and get the indices of this sorting
sorted_dists = sort(dists, index.return = T)$ix
# choose indices of k nearest
k_nearest = sorted_dists[1:k]
candidates = y[k_nearest]
# get the most frequent answer
result = get_mode(candidates)
return(result)
}
Let’s take iris dataset and split it into train and test set in 70/30 proportion:
library(caTools)
set.seed(3711)
split = sample.split(iris$Species, SplitRatio = 0.7)
train = subset(iris, split==T)
test = subset(iris, split==F)
testX = test[,1:4]
testy = test[,5]
tst = testX[1,]
Now we can apply our kNN algorithm. We’ll use k=5 and run the algorithm for each test example and then calculate accuracy as proportion of the correct predictions.
predictions = apply(testX, 1, get_k_nearest, train, 5, 5)
dif = predictions == testy
sum(dif)/length(predictions)
## [1] 0.9555556
Next step is to pack it all into one function that will allow to run experiments with arbitrary number of neighbors and arbitrary train/test ratio.
# k = number of neighbours
# dataset - original dataset
# target_column_index - the index of the column with the target function values
# train_ratio - the proportion of the training examples
# knn - algorithm to be applied
library(caTools)
run_experiment <- function(dataset, target_column_index, k, train_ratio, knnf, normz, n_times=1) {
accuracy = 0
for (i in 1:n_times) {
# split the dataset
target_column = dataset[,target_column_index]
split = sample.split(target_column, SplitRatio = train_ratio)
train = subset(dataset, split==T)
trainX = train[,-target_column_index]
trainy = train[,target_column_index]
test = subset(dataset, split==F)
testX = test[,-target_column_index]
testy = test[,target_column_index]
trainX = data.frame(normz(trainX))
testX = data.frame(normz(testX))
train = cbind(trainX,trainy)
predictions = apply(testX, 1, knnf, train, k, ncol(train))
dif = predictions == testy
accuracy = accuracy + sum(dif)/length(predictions)
}
accuracy = accuracy/n_times
accuracy
}
The idea is to weight the contribution of each of the k neighbours according to their distance to the query. So the closer the neighbour the more important it is. The implementation is very simple and is very similar to classical kNN.
get_weighted <- function(labels,weights) {
dset = data.frame(labels=factor(labels),
weights=weights)
scores = aggregate(. ~ labels, dset, sum)
result = as.character(scores$labels[which.max(scores$weights)])
return(result)
}
s = as.numeric(c(0.1,0.2,0.1,0.4))
l = c("d","c","c","m")
get_weighted(l,s)
## [1] "m"
get_weighted_k_nearest <- function(query, examples, k, target_column_index) {
X = examples[,-target_column_index]
y = examples[,target_column_index]
#calculate distances between the query and each example in the examples
dists = apply(X, 1, function(x) { sum((x-query)^2) })
if (any(dists==0)) {
ind = which(dists==0)
res = get_mode(y[ind])
return(res)
}
else {
proximities = 1/dists
# sort the distances and get the indices of this sorting
sorted_dists = sort(dists, index.return = T)$ix
# choose indices of k nearest
k_nearest = sorted_dists[1:k]
candidates = y[k_nearest]
weights = proximities[k_nearest]
# get the most frequent answer
result = get_weighted(candidates,weights)
return(result)
}
}
To check the algorithm we’ll take exactly the same dataset and the same options:
run_experiment(iris, 5, 5, 0.7, get_weighted_k_nearest,identity)
## [1] 0.9555556
First we need to create a multidimensional array (basically a tensor) to keep the results. First we’ll prepare properties for each of the inner tables. We’ll set up names, possible values (that will be passed as function arguments) and sizes.
prop_params = list(n = 9,
names = c("10","20","30","40","50","60","70","80","90"),
vals = seq(0.1,0.9,0.1))
k_params = list(n = 4,
names = c("k1","k3","k5","k7"),
vals = seq(1,7,2))
norm_params = list(n = 3,
names = c("none","normalization","standartization"),
vals = list(identity,normalize,standardize))
knn_params = list(n=2,
names= c("classical","weighted"),
vals = list(get_k_nearest, get_weighted_k_nearest))
Now we can build a function that will work with any custom properites provided they’ve been declared properly (meaning that they are named lists with standard fields).
# for every proportion of train and test data
get_accuracy_tables <- function(dataset,
target_column_index,
prop_params,
k_params,
norm_params,
knn_params,
n_times=1) {
# create tensor to keep results
resultsTensor <- array(rep(0, prop_params$n*k_params$n*norm_params$n*knn_params$n),
dim=c(prop_params$n,
k_params$n,
norm_params$n,
knn_params$n),
dimnames = list(prop_params$names,
k_params$names,
norm_params$names,
knn_params$names))
for (n in 1:prop_params$n) {
train_ratio = prop_params$vals[n]
for (o in 1:k_params$n) {
k = k_params$vals[o]
for (p in 1:norm_params$n) {
normz = norm_params$vals[[p]]
for (q in 1:knn_params$n) {
knn = knn_params$vals[[q]]
#print(paste("Proportion:",n,", neib:",o,"norm type", p, "knn type", q))
z = run_experiment(dataset,
target_column_index,
k,
train_ratio,
knn,
normz,
n_times)
#print(z)
resultsTensor[n,o,p,q] = z
#print(resultsTensor[n,o,p,q])
}
}
}
}
resultsTensor
}
First we’ll full accuracty tables for both datasets.
irisResults = get_accuracy_tables(iris,5,prop_params,k_params,norm_params,knn_params,10)
wineResults = get_accuracy_tables(wine,
1,
prop_params,
k_params,
norm_params,
knn_params,
1)
winequalityResults = get_accuracy_tables(winequality,
ncol(winequality),
prop_params,
k_params,
norm_params,
knn_params,1)
The graphs actully is the most complicated part of the work Now that we have the accuracy results for all possible combinations, we can plot it.
Now let’s plot accuracy against proportions and compare accuracy by normalization type. For these plots we’ll take fixed value for k (k=3) and classical kNN.
irisC3 = irisResults[,"k3",,"classical"]
wineC3 = winequalityResults[,"k3",,"classical"]
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9296 | 0.8956 | 0.9052 |
| 20 | 0.9492 | 0.9217 | 0.9433 |
| 30 | 0.9562 | 0.9286 | 0.9238 |
| 40 | 0.9433 | 0.9278 | 0.9400 |
| 50 | 0.9533 | 0.9373 | 0.9387 |
| 60 | 0.9683 | 0.9367 | 0.9400 |
| 70 | 0.9578 | 0.9578 | 0.9422 |
| 80 | 0.9567 | 0.9400 | 0.9300 |
| 90 | 0.9667 | 0.9200 | 0.9533 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9274 | 0.8726 | 0.9111 |
| 20 | 0.9383 | 0.9225 | 0.9192 |
| 30 | 0.9524 | 0.9400 | 0.9410 |
| 40 | 0.9567 | 0.9311 | 0.9300 |
| 50 | 0.9533 | 0.9400 | 0.9613 |
| 60 | 0.9683 | 0.9383 | 0.9517 |
| 70 | 0.9511 | 0.9533 | 0.9533 |
| 80 | 0.9600 | 0.9333 | 0.9533 |
| 90 | 0.9733 | 0.9000 | 0.9467 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9156 | 0.8837 | 0.8815 |
| 20 | 0.9500 | 0.9375 | 0.9367 |
| 30 | 0.9590 | 0.9419 | 0.9229 |
| 40 | 0.9633 | 0.9489 | 0.9456 |
| 50 | 0.9587 | 0.9480 | 0.9400 |
| 60 | 0.9567 | 0.9433 | 0.9567 |
| 70 | 0.9667 | 0.9533 | 0.9378 |
| 80 | 0.9633 | 0.9600 | 0.9467 |
| 90 | 0.9667 | 0.8867 | 0.9600 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9393 | 0.8696 | 0.9200 |
| 20 | 0.9467 | 0.9442 | 0.9350 |
| 30 | 0.9629 | 0.9305 | 0.9333 |
| 40 | 0.9589 | 0.9500 | 0.9489 |
| 50 | 0.9573 | 0.9573 | 0.9547 |
| 60 | 0.9583 | 0.9533 | 0.9417 |
| 70 | 0.9644 | 0.9356 | 0.9467 |
| 80 | 0.9533 | 0.9300 | 0.9533 |
| 90 | 0.9867 | 0.9400 | 0.9733 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9193 | 0.8422 | 0.8948 |
| 20 | 0.9450 | 0.9258 | 0.9192 |
| 30 | 0.9581 | 0.9181 | 0.9486 |
| 40 | 0.9567 | 0.9456 | 0.9522 |
| 50 | 0.9640 | 0.9520 | 0.9560 |
| 60 | 0.9617 | 0.9583 | 0.9683 |
| 70 | 0.9600 | 0.9644 | 0.9422 |
| 80 | 0.9767 | 0.9400 | 0.9667 |
| 90 | 0.9733 | 0.9267 | 0.9867 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9348 | 0.9022 | 0.9037 |
| 20 | 0.9483 | 0.9342 | 0.9408 |
| 30 | 0.9581 | 0.9476 | 0.9362 |
| 40 | 0.9633 | 0.9578 | 0.9511 |
| 50 | 0.9493 | 0.9387 | 0.9600 |
| 60 | 0.9700 | 0.9550 | 0.9517 |
| 70 | 0.9578 | 0.9578 | 0.9422 |
| 80 | 0.9567 | 0.9500 | 0.9600 |
| 90 | 0.9467 | 0.9400 | 0.9667 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9200 | 0.8481 | 0.8481 |
| 20 | 0.9408 | 0.9250 | 0.8992 |
| 30 | 0.9638 | 0.9543 | 0.9286 |
| 40 | 0.9489 | 0.9511 | 0.9433 |
| 50 | 0.9680 | 0.9560 | 0.9507 |
| 60 | 0.9617 | 0.9533 | 0.9433 |
| 70 | 0.9622 | 0.9533 | 0.9689 |
| 80 | 0.9700 | 0.9567 | 0.9733 |
| 90 | 0.9667 | 0.9200 | 0.9600 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.9333 | 0.8800 | 0.9030 |
| 20 | 0.9583 | 0.9225 | 0.9425 |
| 30 | 0.9514 | 0.9419 | 0.9486 |
| 40 | 0.9622 | 0.9533 | 0.9544 |
| 50 | 0.9653 | 0.9493 | 0.9560 |
| 60 | 0.9617 | 0.9517 | 0.9500 |
| 70 | 0.9467 | 0.9489 | 0.9600 |
| 80 | 0.9700 | 0.9333 | 0.9500 |
| 90 | 0.9733 | 0.9267 | 0.9733 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.6688 | 0.9000 | 0.9375 |
| 20 | 0.6901 | 0.9014 | 0.9225 |
| 30 | 0.7360 | 0.9200 | 0.9520 |
| 40 | 0.6822 | 0.9533 | 0.9252 |
| 50 | 0.6818 | 0.9773 | 0.9773 |
| 60 | 0.7887 | 0.9577 | 1.0000 |
| 70 | 0.6792 | 0.9057 | 0.9434 |
| 80 | 0.8611 | 0.9722 | 0.9722 |
| 90 | 0.7778 | 0.8889 | 0.8889 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.6938 | 0.9438 | 0.9750 |
| 20 | 0.7183 | 0.9014 | 0.9296 |
| 30 | 0.6640 | 0.9200 | 0.9360 |
| 40 | 0.6636 | 0.9533 | 0.9065 |
| 50 | 0.6932 | 0.9545 | 0.9659 |
| 60 | 0.7042 | 0.9014 | 0.9437 |
| 70 | 0.7170 | 0.9245 | 0.9434 |
| 80 | 0.6944 | 0.9167 | 0.9444 |
| 90 | 0.7778 | 0.9444 | 0.8889 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.7312 | 0.9312 | 0.9375 |
| 20 | 0.7183 | 0.9225 | 0.9296 |
| 30 | 0.7040 | 0.9280 | 0.9600 |
| 40 | 0.6542 | 0.9159 | 0.9533 |
| 50 | 0.7273 | 0.9545 | 0.9432 |
| 60 | 0.7042 | 0.9718 | 0.9859 |
| 70 | 0.6792 | 1.0000 | 0.9245 |
| 80 | 0.7222 | 1.0000 | 0.9722 |
| 90 | 0.8333 | 1.0000 | 0.9444 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.7062 | 0.9438 | 0.9375 |
| 20 | 0.6831 | 0.9577 | 0.9296 |
| 30 | 0.6400 | 0.9280 | 0.9680 |
| 40 | 0.7290 | 0.9159 | 0.9626 |
| 50 | 0.7386 | 0.9659 | 0.9545 |
| 60 | 0.7042 | 0.9155 | 0.9577 |
| 70 | 0.8113 | 0.9623 | 0.9811 |
| 80 | 0.8333 | 0.9722 | 1.0000 |
| 90 | 0.7778 | 0.8889 | 0.9444 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.7062 | 0.9188 | 0.9500 |
| 20 | 0.7254 | 0.9085 | 0.9225 |
| 30 | 0.7200 | 0.9760 | 0.9680 |
| 40 | 0.7290 | 0.9439 | 0.9813 |
| 50 | 0.6364 | 0.9659 | 0.9886 |
| 60 | 0.7042 | 0.9859 | 1.0000 |
| 70 | 0.6792 | 0.9434 | 0.9623 |
| 80 | 0.6389 | 0.9722 | 1.0000 |
| 90 | 0.5556 | 1.0000 | 1.0000 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.6438 | 0.8875 | 0.9375 |
| 20 | 0.6901 | 0.8944 | 0.9437 |
| 30 | 0.6960 | 0.9440 | 0.9760 |
| 40 | 0.7477 | 0.9533 | 0.9533 |
| 50 | 0.7273 | 1.0000 | 0.9545 |
| 60 | 0.6338 | 0.9437 | 0.9577 |
| 70 | 0.7736 | 0.9811 | 0.9623 |
| 80 | 0.8333 | 0.8611 | 0.9444 |
| 90 | 0.8333 | 0.9444 | 0.9444 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.6688 | 0.9625 | 0.9562 |
| 20 | 0.6972 | 0.9507 | 0.9507 |
| 30 | 0.6640 | 0.9440 | 0.9440 |
| 40 | 0.6636 | 0.9159 | 0.9626 |
| 50 | 0.6818 | 0.9886 | 0.9432 |
| 60 | 0.6761 | 0.9437 | 0.9718 |
| 70 | 0.6792 | 0.9623 | 0.9811 |
| 80 | 0.6667 | 1.0000 | 0.9444 |
| 90 | 0.7222 | 0.9444 | 1.0000 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.6562 | 0.9188 | 0.9562 |
| 20 | 0.6901 | 0.9507 | 0.9507 |
| 30 | 0.6960 | 0.9520 | 0.9760 |
| 40 | 0.6916 | 0.9720 | 0.9439 |
| 50 | 0.7500 | 0.9545 | 0.9545 |
| 60 | 0.7042 | 0.9718 | 1.0000 |
| 70 | 0.7170 | 0.9434 | 0.9623 |
| 80 | 0.7222 | 1.0000 | 1.0000 |
| 90 | 0.8889 | 0.9444 | 1.0000 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4302 | 0.4628 | 0.4941 |
| 20 | 0.4773 | 0.4859 | 0.5149 |
| 30 | 0.4839 | 0.4768 | 0.5455 |
| 40 | 0.4948 | 0.4469 | 0.5375 |
| 50 | 0.5575 | 0.5762 | 0.5875 |
| 60 | 0.5462 | 0.5399 | 0.5806 |
| 70 | 0.5407 | 0.5157 | 0.5908 |
| 80 | 0.5670 | 0.5981 | 0.6293 |
| 90 | 0.6062 | 0.5562 | 0.6812 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4267 | 0.4906 | 0.4997 |
| 20 | 0.4679 | 0.5282 | 0.5282 |
| 30 | 0.4830 | 0.4955 | 0.5723 |
| 40 | 0.5042 | 0.5323 | 0.5646 |
| 50 | 0.5238 | 0.5462 | 0.5588 |
| 60 | 0.5743 | 0.5509 | 0.6135 |
| 70 | 0.5741 | 0.5324 | 0.6033 |
| 80 | 0.5452 | 0.6044 | 0.6386 |
| 90 | 0.6125 | 0.4375 | 0.6000 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4329 | 0.4990 | 0.5205 |
| 20 | 0.4444 | 0.4844 | 0.5297 |
| 30 | 0.4500 | 0.5062 | 0.5205 |
| 40 | 0.4708 | 0.5156 | 0.5375 |
| 50 | 0.4800 | 0.5062 | 0.5250 |
| 60 | 0.4742 | 0.5352 | 0.5477 |
| 70 | 0.4635 | 0.5136 | 0.5678 |
| 80 | 0.5514 | 0.5016 | 0.5234 |
| 90 | 0.5562 | 0.5188 | 0.5125 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4510 | 0.4705 | 0.4927 |
| 20 | 0.4789 | 0.5313 | 0.5086 |
| 30 | 0.4821 | 0.5170 | 0.5661 |
| 40 | 0.4958 | 0.5125 | 0.5802 |
| 50 | 0.5350 | 0.5025 | 0.5850 |
| 60 | 0.5524 | 0.6041 | 0.6260 |
| 70 | 0.5699 | 0.5616 | 0.6409 |
| 80 | 0.6449 | 0.5607 | 0.6355 |
| 90 | 0.6438 | 0.5125 | 0.6625 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4580 | 0.5288 | 0.5170 |
| 20 | 0.4937 | 0.5196 | 0.5180 |
| 30 | 0.4857 | 0.5571 | 0.5429 |
| 40 | 0.4688 | 0.5646 | 0.5438 |
| 50 | 0.4888 | 0.5387 | 0.5562 |
| 60 | 0.4883 | 0.5399 | 0.5524 |
| 70 | 0.5282 | 0.5386 | 0.5449 |
| 80 | 0.4984 | 0.5701 | 0.5483 |
| 90 | 0.4875 | 0.5500 | 0.6125 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4475 | 0.4691 | 0.5274 |
| 20 | 0.4679 | 0.5305 | 0.5665 |
| 30 | 0.5330 | 0.4875 | 0.5714 |
| 40 | 0.5521 | 0.5729 | 0.6281 |
| 50 | 0.5562 | 0.5750 | 0.6038 |
| 60 | 0.5822 | 0.5665 | 0.6244 |
| 70 | 0.5616 | 0.5929 | 0.6493 |
| 80 | 0.6012 | 0.6137 | 0.6604 |
| 90 | 0.6500 | 0.5750 | 0.6875 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4809 | 0.5024 | 0.5198 |
| 20 | 0.4648 | 0.5321 | 0.5563 |
| 30 | 0.4946 | 0.5062 | 0.5786 |
| 40 | 0.4833 | 0.5333 | 0.5427 |
| 50 | 0.4938 | 0.5362 | 0.5800 |
| 60 | 0.5055 | 0.5399 | 0.5790 |
| 70 | 0.4990 | 0.5595 | 0.5887 |
| 80 | 0.5389 | 0.5358 | 0.5732 |
| 90 | 0.5625 | 0.6000 | 0.6625 |
| none | normalization | standartization | |
|---|---|---|---|
| 10 | 0.4566 | 0.5087 | 0.5525 |
| 20 | 0.5000 | 0.5086 | 0.5485 |
| 30 | 0.5357 | 0.5598 | 0.5795 |
| 40 | 0.5375 | 0.5427 | 0.6135 |
| 50 | 0.5512 | 0.5562 | 0.6000 |
| 60 | 0.5603 | 0.5775 | 0.5962 |
| 70 | 0.5678 | 0.5950 | 0.6889 |
| 80 | 0.5919 | 0.6262 | 0.6511 |
| 90 | 0.6000 | 0.6250 | 0.6562 |
Since the accuracy is really very low, it would be be interesting to compare this performance to the baseline, i.e. what if we just always predict the most frequent class?
wqbaseline = get_mode(winequality[,ncol(winequality)])
sum(winequality[,ncol(winequality)]==wqbaseline)/nrow(winequality)
## [1] 0.4258912
winebaseline = get_mode(wine[,1])
sum(wine[,1]==winebaseline)/nrow(wine)
## [1] 0.3988764
irisbaseline = get_mode(iris[,ncol(iris)])
sum(iris[,ncol(iris)]==irisbaseline)/nrow(iris)
## [1] 0.3333333
So we have significant improvement over the baseline for the iris dataset and some improvement for winequality dataset. It is better than the
It would also be interesting to compare the algorithm we made to standard R’s knn for the winequality dataset.
target_column_index = ncol(winequality)
target_column = winequality[,ncol(winequality)]
split = sample.split(target_column, SplitRatio = 0.7)
train = subset(winequality, split==T)
trainX = train[,-target_column_index]
trainy = train[,target_column_index]
test = subset(winequality, split==F)
testX = test[,-target_column_index]
testy = test[,target_column_index]
res = knn(trainX, testX, trainy, k = 3)
sum(testy==res)/length(res)
## [1] 0.5135699
We tried two simple versions of kNN algorithm: classical kNN and distance weighted KNN. Both algorithms have pretty good performance for classical small datasets: iris and wine. But when it comes to a bigger dataset with bigger number of classes the performance is just slightly better than the baseline.
There is also such thing as the curse of dimensionality. It’s quite possible that in classical datasets (iris and wine) most of the attributes have some relevance, while this could be not the case for the winequality dataset. That’s why we see only a ~10% improvement.
As for the number of neighbours to choose, k=5 or k=7 looks like the optimal choice because it shows the most stable growth. The type of normalization also has some impact on the results, but this impact varies depending on the dataset. Standartization (vs. Normalization) looks like a better choice (moreove it’s easier to think about the data in terms of number of standard deviations).