Set Up

Clean the Environment

rm(list = ls())

Prepare Packages

library(class)

Establish KNN Function

Set up the distance function

distance <- function(a, b) sqrt(sum((a - b)^2))

KNN Function

KNN <- function(train_data, train_labels, test_data, k) {
  
  # Create a empty vector to store the output
  predictions <- vector("character", nrow(test_data))
  
  # For each output in the test set
  for (i in 1:nrow(test_data)) {
    
    # Get the ith output
    new_input <- test_data[i, ]
    
    # 1. Calculate distances between new input and all points in train dataset
    distances <- apply(train_data, 1, function(x) distance(x, new_input))
    
    # 2. Create a data frame of distances and labels
    DL_table <- data.frame(dist = distances, label = train_labels)
    
    # 3. Sort the data frame by distance from closest to the farthest
    sorted_DL_table <- DL_table[order(DL_table$dist), ]
    
    # 4. Get the k nearest neighbors
    neighbors <- head(sorted_DL_table, k)
    
    # 5. Get most common label from the closest neighbors
    predictions[i] <- names(which.max(table(neighbors$label)))
  }
  
  # Return the vector of predictions
  return(predictions)
}

Get sample Dataset

df <- read.csv("D:/Study/AI1/Assignment 2/knn_example.csv")

df <- df[,-1]

Split the data into 80/20

set.seed(123)

train_index <-sample(x = nrow(df),  size = round(0.8 * nrow(df)))

train <- df[train_index, ]

test <- df[-train_index, ]

Define K

To define K, I will use the rule of thumb method for simple KNN.

k <- round(sqrt(nrow(train)))

Apply simple KNN

model <- KNN(train_data = train, train_labels = train$y, test_data = test, k)

Create performance table

table <- table(model, test$y)

table
##      
## model  0  1  2  3  4
##     0 11  0  0  0  0
##     1  0  7  0  0  0
##     2  0  0 11  0  0
##     3  0  0  0 14  0
##     4  0  0  0  0  7

Check if KNN works by using packages

KNN_P <- knn(train = train, test = test, cl = train$y, k = k )

Performance table

table2 <- table(KNN_P, test$y)

table2
##      
## KNN_P  0  1  2  3  4
##     0 11  0  0  0  0
##     1  0  7  0  0  0
##     2  0  0 11  0  0
##     3  0  0  0 14  0
##     4  0  0  0  0  7

relative distance

In the class we treats all neighbors the same which is the uniform weighting KNN and there is another more common used method called inverse distance weighting.

inverse distance weighting.

In IDW, the prediction for an input vector is calculated by taking a weighted average of the target values of its K nearest neighbors. The weights are determined based on the inverse of the distances between the input vector and its neighbors. The closer neighbors are assigned higher weights, indicating their greater influence on the prediction. But we may need to consider the impact of outliers or very distant neighbors. One way to mitigate their influence is by using a decay function or a cutoff distance to limit the weights of extremely distant neighbors.

####The formula for IDW is:

predicted_value = sum(weights * values) / sum(weights)

where weight = 1 / (distance ^ p)

That ‘P’ is a power parameter which determines the significance of observed points on the prediction as a function of distance.

How to implement

In the real world, we can use the IDW for predict air quality. Because air quality is heavily based on geographic location so that given more weight on the air quality sensor that is close by to each other make a lot sense. If we want to predict a new location’s air quality, its much better to use the nearby sensors reading.

The key to the IDW method is the assumption that things that are close to one another are more alike than those that are farther apart - this is known as the First Law of Geography.

This method is often used in spatial interpolation but can be used in any situation where you want to make a prediction based on nearby observations.