First we will load in all the data into R:
train = read.table("digits_train.csv", header = TRUE, sep = ',')
test = read.table("digits_test.csv", header = TRUE, sep = ',')
validation = read.table("digits_valid.csv", header = TRUE, sep = ',')
plotDigit <- function(k, dat)
{
p <- matrix(as.numeric(dat[k,1:256]),16,16)
image(x=1:16, y=1:16, p[,16:1], xlab="", ylab="",
main=paste("Row: ", k, " | Digit: ", dat[k,257]))
}
Part 1 - Which digits will be the hardest to classify?
I think one of the digits that will be the hardest to classify is the number 1. For two different reasons:
fig.width = 50
fig.height = 100
plotDigit(348, train)
plotDigit(156, train)
2)The second reason why 1 might be hard to classify is that it might often get confused with the number 7. Both the 1 and 7 can be drawn as having a vertical line with a nearly horizontal line at the top.
plotDigit(156, train)
plotDigit(621, train)
Another digit that might often be misclassified is 8, because it is similar to 6 and 9. Some examples of eights that might be mistaken for a 6 or a 9:
plotDigit(84, train)
plotDigit(129, train)
plotDigit(137, train)
plotDigit(139, train)
Part 2:
Train a knn nearest neighbor to find the best k using the validation set:
library(class)
library(gmodels)
# y will be all the y values for the training set
y = train[257]
#tr is the x values or the training set
tr = train[c(0:256)]
# vy is the y value for the validation set
vy = validation[257]
# v are the x values for the validation set
v = validation[c(0:256)]
yy = y[,1]
knn_predict1 <- knn(tr, v, yy, k = 1)
knn_predict2 <- knn(tr, v, yy, k = 2)
knn_predict3 <- knn(tr, v, yy, k = 3)
knn_predict4 <- knn(tr, v, yy, k = 4)
knn_predict5 <- knn(tr, v, yy, k = 5)
knn_predict6 <- knn(tr, v, yy, k = 6)
knn_predict7 <- knn(tr, v, yy, k = 7)
knn_predict8 <- knn(tr, v, yy, k = 8)
knn_predict9 <- knn(tr, v, yy, k = 9)
knn_predict10 <- knn(tr, v, yy, k = 10)
knn_predict11 <- knn(tr, v, yy, k = 11)
knn_predict12 <- knn(tr, v, yy, k = 12)
mean(knn_predict1[1:318] != t(vy[1]))
## [1] 0.1132075
mean(knn_predict2[1:318] != t(vy[1]))
## [1] 0.1383648
mean(knn_predict3[1:318] != t(vy[1]))
## [1] 0.1226415
mean(knn_predict4[1:318] != t(vy[1]))
## [1] 0.1226415
mean(knn_predict5[1:318] != t(vy[1]))
## [1] 0.1132075
mean(knn_predict6[1:318] != t(vy[1]))
## [1] 0.1163522
mean(knn_predict7[1:318] != t(vy[1]))
## [1] 0.1226415
mean(knn_predict8[1:318] != t(vy[1]))
## [1] 0.1320755
mean(knn_predict9[1:318] != t(vy[1]))
## [1] 0.1320755
mean(knn_predict10[1:318] != t(vy[1]))
## [1] 0.1289308
mean(knn_predict11[1:318] != t(vy[1]))
## [1] 0.1194969
mean(knn_predict12[1:318] != t(vy[1]))
## [1] 0.1415094
It appears that the best value found for k on the validation set is k=5. Which has around a 10% error rate.
Create models using knn and LDA and run it on the validation data to see which one works best:
library("MASS")
test_y_knn_val = knn(tr, v, yy, k = 5)
lda_model = lda(tr, yy)
test_y_lda_val = predict(lda_model,v)$class
print ("error percentage on knn and its confusion matrix")
## [1] "error percentage on knn and its confusion matrix"
mean(test_y_knn_val[1:318] != t(vy[1]))
## [1] 0.1163522
table(predicted = test_y_knn_val, actual = t(vy[1]))
## actual
## predicted 0 1 2 3 4 5 6 7 8 9
## 0 40 0 0 0 0 0 3 0 0 1
## 1 0 25 1 0 6 0 0 4 1 0
## 2 0 0 29 0 0 0 0 0 3 0
## 3 0 0 0 24 0 0 0 0 0 4
## 4 0 0 1 0 26 0 0 0 0 0
## 5 0 0 0 0 1 38 0 0 0 3
## 6 0 0 0 0 0 0 33 0 1 0
## 7 0 1 0 2 0 0 0 26 0 0
## 8 0 0 1 0 0 0 0 0 19 0
## 9 0 0 0 1 0 0 0 0 3 21
print ("error percentage on lda and its confusion matrix:")
## [1] "error percentage on lda and its confusion matrix:"
mean(test_y_lda_val != t(vy[1]))
## [1] 0.1981132
table(predicted = test_y_lda_val, actual = t(vy[1]))
## actual
## predicted 0 1 2 3 4 5 6 7 8 9
## 0 37 0 0 0 0 0 1 0 1 1
## 1 0 21 1 0 7 0 1 2 2 0
## 2 0 2 25 0 0 0 0 0 1 2
## 3 0 0 0 24 0 6 0 0 0 4
## 4 0 0 0 0 24 0 0 0 0 1
## 5 0 1 0 0 0 28 1 1 0 0
## 6 1 0 2 0 1 1 31 0 0 0
## 7 0 1 0 3 0 2 0 27 2 0
## 8 1 0 3 0 1 1 2 0 18 1
## 9 1 1 1 0 0 0 0 0 3 20
Run KNN and LDA on the test data and save it to csv files:
#create test predition using knn and the best k=6.
test_y_knn = knn(tr, test, yy, k = 6)
#create test prediction using LDA
l = lda(tr, yy)
test_y_lda = predict(l,test)$class
cols = c(test_y_knn , test_y_lda)
knn_pred = cols[1:638]
lda_pred = cols[639:1276]
df = data.frame(knn_pred,lda_pred)
#write the results to a csv file
write.csv(df, file = "HW3_njs48.csv", row.names = FALSE)
2)Examine a confusion matrix for your best model. Which digit(s) is(are) relatively difficult to classify? How does this compare to your initial guesses from Part 1?
KNN actual predicted 0 1 2 3 4 5 6 7 8 9 0 40 0 0 0 0 0 2 0 0 2 1 0 25 1 0 7 0 0 4 1 0 2 0 0 29 0 0 0 0 0 3 0 3 0 0 0 24 0 0 0 0 0 4 4 0 0 1 0 26 0 0 0 0 0 5 0 0 0 0 0 37 0 0 0 4 6 0 0 1 0 0 1 34 0 1 0 7 0 1 0 2 0 0 0 26 0 0 8 0 0 0 0 0 0 0 0 19 0 9 0 0 0 1 0 0 0 0 3 19
LDA actual predicted 0 1 2 3 4 5 6 7 8 9 0 37 0 0 0 0 0 1 0 1 1 1 0 21 1 0 7 0 1 2 2 0 2 0 2 25 0 0 0 0 0 1 2 3 0 0 0 24 0 6 0 0 0 4 4 0 0 0 0 24 0 0 0 0 1 5 0 1 0 0 0 28 1 1 0 0 6 1 0 2 0 1 1 31 0 0 0 7 0 1 0 3 0 2 0 27 2 0 8 1 0 3 0 1 1 2 0 18 1 9 1 1 1 0 0 0 0 0 3 20
It appears that both models have difficulty in classifying the number 1. As I guessed, it did often confuse it with the number 7. But in addition to 7, it also often confused it to the number 4. A number that both models had difficulty in predicting was the number 9. But suprisingly it didnt confuse it with the number 8. It often confused it with the number 3, which makes sense because the only difference is the top left line.