rm(list = ls())
library(class)
distance <- function(a, b) sqrt(sum((a - b)^2))
KNN <- function(train_data, train_labels, test_data, k) {
# Create a empty vector to store the output
predictions <- vector("character", nrow(test_data))
# For each output in the test set
for (i in 1:nrow(test_data)) {
# Get the ith output
new_input <- test_data[i, ]
# 1. Calculate distances between new input and all points in train dataset
distances <- apply(train_data, 1, function(x) distance(x, new_input))
# 2. Create a data frame of distances and labels
DL_table <- data.frame(dist = distances, label = train_labels)
# 3. Sort the data frame by distance from closest to the farthest
sorted_DL_table <- DL_table[order(DL_table$dist), ]
# 4. Get the k nearest neighbors
neighbors <- head(sorted_DL_table, k)
# 5. Get most common label from the closest neighbors
predictions[i] <- names(which.max(table(neighbors$label)))
}
# Return the vector of predictions
return(predictions)
}
df <- read.csv("D:/Study/AI1/Assignment 2/knn_example.csv")
df <- df[,-1]
set.seed(123)
train_index <-sample(x = nrow(df), size = round(0.8 * nrow(df)))
train <- df[train_index, ]
test <- df[-train_index, ]
To define K, I will use the rule of thumb method for simple KNN.
k <- round(sqrt(nrow(train)))
model <- KNN(train_data = train, train_labels = train$y, test_data = test, k)
table <- table(model, test$y)
table
##
## model 0 1 2 3 4
## 0 11 0 0 0 0
## 1 0 7 0 0 0
## 2 0 0 11 0 0
## 3 0 0 0 14 0
## 4 0 0 0 0 7
KNN_P <- knn(train = train, test = test, cl = train$y, k = k )
table2 <- table(KNN_P, test$y)
table2
##
## KNN_P 0 1 2 3 4
## 0 11 0 0 0 0
## 1 0 7 0 0 0
## 2 0 0 11 0 0
## 3 0 0 0 14 0
## 4 0 0 0 0 7
In the class we treats all neighbors the same which is the uniform weighting KNN and there is another more common used method called inverse distance weighting.
In IDW, the prediction for an input vector is calculated by taking a weighted average of the target values of its K nearest neighbors. The weights are determined based on the inverse of the distances between the input vector and its neighbors. The closer neighbors are assigned higher weights, indicating their greater influence on the prediction. But we may need to consider the impact of outliers or very distant neighbors. One way to mitigate their influence is by using a decay function or a cutoff distance to limit the weights of extremely distant neighbors.
####The formula for IDW is:
predicted_value = sum(weights * values) / sum(weights)
where weight = 1 / (distance ^ p)
That ‘P’ is a power parameter which determines the significance of observed points on the prediction as a function of distance.
In the real world, we can use the IDW for predict air quality. Because air quality is heavily based on geographic location so that given more weight on the air quality sensor that is close by to each other make a lot sense. If we want to predict a new location’s air quality, its much better to use the nearby sensors reading.
The key to the IDW method is the assumption that things that are close to one another are more alike than those that are farther apart - this is known as the First Law of Geography.
This method is often used in spatial interpolation but can be used in any situation where you want to make a prediction based on nearby observations.