IE575_Lesson-9_KNN

Think about Figure 1 shown in Lecture 9 (K-Neareset Neighbors Section). For the following questions write about how you would expect the model to perform. You should consider accuracy and training time.

. What would happen if k was set to one?

ANSWER: If K=1, . solid dot will be classified as “red circle”. As one closest neighbor is red circle. . solid square will be classified as “red circle”. As one closest neighbor is red circle.

Basically, with lower K value, - KNN model is trying to fit the model to data very closely and trying to find pattern in the data very closely. - Generate highly non-linear decision boundaries. - This means model is going to have probably a good fit for training set due to low bias. - High accuracy, especially for non-linear data. - More flexibility will lead to high variance (variability in prediction results). - Increase in potential overfitting problem (good training error, but high test error). - Low training time.

. Assume there are 100 points, what would happen if k was set to 100?

ANSWER: If K=100, . solid dot will be classified as “blue triangle”. As conditional probability (likelihood) of blue triangle is more. . solid square will be classified as “blue triangle”. As conditional probability (likelihood) of blue triangle is more.

Basically, with higher K value, - KNN model is trying to fit the model to data with less flexibility and less non-linearity. - Generate less non-linear decision boundaries. - lower accuracy, especially for non-linear data. Will work for linear decision boundaries. - less flexibility will lead to low variance (variability in prediction results) but higher bias. - Less of an overfitting problem. - high training time.

. Imagine that all of the blue triangles (class 2) were removed from the training set except for 5 triangles. How would the model perform at predicting the remaining 5 blue triangles shown in the plot?

ANSWER:

The model would perform poorly as all removed blue triangles would likely be closer to an included red circle than any of the 5 included blue triangle.

Split the iris dataset into training (80%) and testing (20%) and train a k-NN classification model. Also calculate the misclassification rate. Assume k=5 and use set.seed(10). use knn function from class package or knn3 function from caret package.

set.seed(10)

data("iris")

names(iris) = tolower(names(iris))

library(class)

# preparing training/test sets.
n = nrow(iris)
index = sample(1:n, n*0.80, replace=FALSE)
train = iris[index,]
test = iris[-index,]

dim(iris)

## [1] 150   5

dim(train)

## [1] 120   5

dim(test)

## [1] 30  5

# preparing inputs for the knn()
train.species = train$species
train.x = train[,-5]
test.x = test[,-5]

# fit the model
knn.pred = knn(train=train.x, test=test.x, cl=train.species, k=5)

# checking results
table(knn.pred, test$species)

##             
## knn.pred     setosa versicolor virginica
##   setosa          8          0         0
##   versicolor      0         10         1
##   virginica       0          2         9

accuracy.k5 = mean(knn.pred == test$species)
accuracy.k5

## [1] 0.9

error.rate.k5 = mean(knn.pred != test$species)
error.rate.k5

## [1] 0.1

The Pennsylvania State University’s Applied Data Mining and Statistical Learning course out of the Statistics Department gives an example of k-NN using the diabetes data from UCI machine learning database. Your task is to follow their code and approach to partitioning the data and performing k-NN to generate following figures. . Figure 1. Divide the dataset into two sets (80% trianing and 20% testing). Plot the training and testing error with varying k from 1 to 30. Find the k for which the test error is lowest. Use set.seed(10) . Figure 2. Generate classification results for the testing dataset with k found in the last part.

library(class)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(10)

# loading the data
RawData <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data', header=FALSE)

# creating matrices for Xs and Y
responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])

# PCs from PCA
pca <- princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1]
pc.comp2 <- -1*pc.comp[,2]

# data partition for train/test sets.
trainIndex <- createDataPartition(responseY, times=1, p = 0.8, list = F)
X = cbind(pc.comp1, pc.comp2)

# fitting models for 30 different k-values (one for test and one for train set for each K)
train.error = rep(0,30)
test.error = rep(0,30)
for(k in 1:30){
    model.knn.train <- knn(train=X[trainIndex,], test=X[trainIndex,], cl=responseY[trainIndex], k=k, prob=F)
    train.error[k] <- sum(model.knn.train!=responseY[trainIndex])/length(responseY[trainIndex])
    model.knn.test <- knn(train=X[trainIndex,], test=X[-trainIndex,], cl=responseY[trainIndex], k=k, prob=F)
    test.error[k] <- sum(model.knn.test!=responseY[-trainIndex])/length(responseY[-trainIndex])
}

# PLOTTING:
plot(1:30, train.error, col='red', type = 'b', ylim = c(0,0.4))
points(1:30, test.error, col='blue', type = 'b')

+Figure 2. Generate classification results for the testing dataset with k found in above part.

plot(pc.comp1[-trainIndex], pc.comp2[-trainIndex], col = model.knn.test)

If you want to do cross-validation.

idx <- createFolds(responseY, k=10)
train.error = c()
test.error = c()
for(k in 1:30){
  test.error.tmp = c()
  train.error.tmp = c()
  for(i in 1:10){
    pred <- knn(train = X[-idx[[i]],],test = X[idx[[i]],], cl = responseY[-idx[[i]]], k=k)
    test.error.tmp = c(test.error.tmp,mean(responseY[idx[[i]],] != pred))
    pred <- knn(train = X[-idx[[i]],],test = X[-idx[[i]],], cl = responseY[-idx[[i]]], k=k)
    train.error.tmp = c(train.error.tmp,mean(responseY[-idx[[i]],] != pred))
  }
  test.error = rbind(test.error,test.error.tmp)
  train.error = rbind(train.error,train.error.tmp)
}
plot(1:30, rowMeans(train.error), col='blue', type='b', ylim = c(0,0.4))
points(1:30, rowMeans(test.error), col='red', type='b')

IE575_Lesson-9_KNN_ANN

Maulik Patel

October 25, 2016