. What would happen if k was set to one?
ANSWER: If K=1, . solid dot will be classified as “red circle”. As one closest neighbor is red circle. . solid square will be classified as “red circle”. As one closest neighbor is red circle.
Basically, with lower K value, - KNN model is trying to fit the model to data very closely and trying to find pattern in the data very closely. - Generate highly non-linear decision boundaries. - This means model is going to have probably a good fit for training set due to low bias. - High accuracy, especially for non-linear data. - More flexibility will lead to high variance (variability in prediction results). - Increase in potential overfitting problem (good training error, but high test error). - Low training time.
. Assume there are 100 points, what would happen if k was set to 100?
ANSWER: If K=100, . solid dot will be classified as “blue triangle”. As conditional probability (likelihood) of blue triangle is more. . solid square will be classified as “blue triangle”. As conditional probability (likelihood) of blue triangle is more.
Basically, with higher K value, - KNN model is trying to fit the model to data with less flexibility and less non-linearity. - Generate less non-linear decision boundaries. - lower accuracy, especially for non-linear data. Will work for linear decision boundaries. - less flexibility will lead to low variance (variability in prediction results) but higher bias. - Less of an overfitting problem. - high training time.
. Imagine that all of the blue triangles (class 2) were removed from the training set except for 5 triangles. How would the model perform at predicting the remaining 5 blue triangles shown in the plot?
ANSWER:
The model would perform poorly as all removed blue triangles would likely be closer to an included red circle than any of the 5 included blue triangle.
set.seed(10)
data("iris")
names(iris) = tolower(names(iris))
library(class)
# preparing training/test sets.
n = nrow(iris)
index = sample(1:n, n*0.80, replace=FALSE)
train = iris[index,]
test = iris[-index,]
dim(iris)
## [1] 150 5
dim(train)
## [1] 120 5
dim(test)
## [1] 30 5
# preparing inputs for the knn()
train.species = train$species
train.x = train[,-5]
test.x = test[,-5]
# fit the model
knn.pred = knn(train=train.x, test=test.x, cl=train.species, k=5)
# checking results
table(knn.pred, test$species)
##
## knn.pred setosa versicolor virginica
## setosa 8 0 0
## versicolor 0 10 1
## virginica 0 2 9
accuracy.k5 = mean(knn.pred == test$species)
accuracy.k5
## [1] 0.9
error.rate.k5 = mean(knn.pred != test$species)
error.rate.k5
## [1] 0.1
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(10)
# loading the data
RawData <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data', header=FALSE)
# creating matrices for Xs and Y
responseY <- as.matrix(RawData[,dim(RawData)[2]])
predictorX <- as.matrix(RawData[,1:(dim(RawData)[2]-1)])
# PCs from PCA
pca <- princomp(predictorX, cor=T) # principal components analysis using correlation matrix
pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1]
pc.comp2 <- -1*pc.comp[,2]
# data partition for train/test sets.
trainIndex <- createDataPartition(responseY, times=1, p = 0.8, list = F)
X = cbind(pc.comp1, pc.comp2)
# fitting models for 30 different k-values (one for test and one for train set for each K)
train.error = rep(0,30)
test.error = rep(0,30)
for(k in 1:30){
model.knn.train <- knn(train=X[trainIndex,], test=X[trainIndex,], cl=responseY[trainIndex], k=k, prob=F)
train.error[k] <- sum(model.knn.train!=responseY[trainIndex])/length(responseY[trainIndex])
model.knn.test <- knn(train=X[trainIndex,], test=X[-trainIndex,], cl=responseY[trainIndex], k=k, prob=F)
test.error[k] <- sum(model.knn.test!=responseY[-trainIndex])/length(responseY[-trainIndex])
}
# PLOTTING:
plot(1:30, train.error, col='red', type = 'b', ylim = c(0,0.4))
points(1:30, test.error, col='blue', type = 'b')
+Figure 2. Generate classification results for the testing dataset with k found in above part.
plot(pc.comp1[-trainIndex], pc.comp2[-trainIndex], col = model.knn.test)
If you want to do cross-validation.
idx <- createFolds(responseY, k=10)
train.error = c()
test.error = c()
for(k in 1:30){
test.error.tmp = c()
train.error.tmp = c()
for(i in 1:10){
pred <- knn(train = X[-idx[[i]],],test = X[idx[[i]],], cl = responseY[-idx[[i]]], k=k)
test.error.tmp = c(test.error.tmp,mean(responseY[idx[[i]],] != pred))
pred <- knn(train = X[-idx[[i]],],test = X[-idx[[i]],], cl = responseY[-idx[[i]]], k=k)
train.error.tmp = c(train.error.tmp,mean(responseY[-idx[[i]],] != pred))
}
test.error = rbind(test.error,test.error.tmp)
train.error = rbind(train.error,train.error.tmp)
}
plot(1:30, rowMeans(train.error), col='blue', type='b', ylim = c(0,0.4))
points(1:30, rowMeans(test.error), col='red', type='b')