Classification using KNN

Installing Necessary Packages in R To use KNN in R, you might use the class package which includes a function called knn. Here’s how you might install and load that package:

#install.packages("class")
library(class)

Using KNN for Classification Here’s a simplified example of how you might use the knn function for a classification task:

Load the data: For this example, let’s use the well-known Iris dataset.

data(iris)

Prepare the data: Before you run KNN, you usually want to normalize your data and split it into training and testing sets.

set.seed(123)  # Setting seed for reproducibility
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
iris.train <- iris[ind==1,]
iris.test <- iris[ind==2,]

Note: ind is a vector of randomly generated numbers (1 or 2), and it’s used to split the data into training and testing sets.

Normalize the Data: KNN is a distance-based algorithm, hence scale/normalize the data is crucial.

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

iris.train[,1:4] <- as.data.frame(lapply(iris.train[,1:4], normalize))
iris.test[,1:4] <- as.data.frame(lapply(iris.test[,1:4], normalize))

Run KNN: Finally, you run the KNN algorithm. The below code classifies the test set based on a k value of 3.

knn.pred <- knn(train = iris.train[, 1:4], 
                test = iris.test[, 1:4], 
                cl = iris.train[, 5], 
                k = 3)

Here, iris.train[, 1:4] and iris.test[, 1:4] are the predictor variables for the training and testing sets, respectively. iris.train[, 5] represents the class labels for the training set, and k = 3 specifies that the three nearest neighbors should be considered.

Evaluate the Model: Finally, you might want to evaluate the performance of your model, e.g., by creating a confusion matrix.

confusion <- table(pred = knn.pred, true = iris.test[,5])
print(confusion)

##             true
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         11         2
##   virginica       0          3        13

Notes: Ensure to choose an appropriate k value: A small k (like k=1 or k=2) is most flexible, and hence can have low bias but high variance. On the other hand, a larger k has more stable predictions and is more resistant to outliers.

Consider feature scaling: KNN uses the distance between data points to determine their similarity. Therefore, your results might depend heavily on the scale of your variables. You might want to scale your features to ensure they have equal weight.

Handling Categorical Variables: KNN inherently does not handle categorical variables well. So you might want to explore strategies for dealing with categorical variables, such as one-hot encoding.

Classification using KNN

Dr N Srikanth Reddy

2023-10-15