First we will load in all the data into R:

train = read.table("digits_train.csv", header = TRUE, sep = ',')
test = read.table("digits_test.csv", header = TRUE, sep = ',')
validation = read.table("digits_valid.csv", header = TRUE, sep = ',')
plotDigit <- function(k, dat) 
{
  p <- matrix(as.numeric(dat[k,1:256]),16,16)
  image(x=1:16, y=1:16, p[,16:1], xlab="", ylab="",
    main=paste("Row: ", k, " | Digit: ", dat[k,257]))
}

Part 1 - Which digits will be the hardest to classify?

I think one of the digits that will be the hardest to classify is the number 1. For two different reasons:

  1. The first being that it is only composed of one line, and algorithms usually look for things that share similar things. In this case it is comparing pixels, but these pixels have to be in exactly the same place. So even if two ones are identical but in one the pixels are all in column 6 and in the other all the pixels are in column 7, then most algorithms will consider that they are completely different from one another. Consider the images of 1 bellow:
fig.width = 50
fig.height = 100
plotDigit(348, train)

plotDigit(156, train)

2)The second reason why 1 might be hard to classify is that it might often get confused with the number 7. Both the 1 and 7 can be drawn as having a vertical line with a nearly horizontal line at the top.

plotDigit(156, train)

plotDigit(621, train)

Another digit that might often be misclassified is 8, because it is similar to 6 and 9. Some examples of eights that might be mistaken for a 6 or a 9:

plotDigit(84, train)

plotDigit(129, train)

plotDigit(137, train)

plotDigit(139, train)

Part 2:

Train a knn nearest neighbor to find the best k using the validation set:

library(class)
library(gmodels)
# y will be all the y values for the training set
y = train[257]
#tr is the x values or the training set
tr = train[c(0:256)]
# vy is the y value for the validation set
vy = validation[257]
# v are the x values for the validation set
v = validation[c(0:256)]
yy = y[,1]


knn_predict1 <- knn(tr, v, yy, k = 1)
knn_predict2 <- knn(tr, v, yy, k = 2)
knn_predict3 <- knn(tr, v, yy, k = 3)
knn_predict4 <- knn(tr, v, yy, k = 4)
knn_predict5 <- knn(tr, v, yy, k = 5)
knn_predict6 <- knn(tr, v, yy, k = 6)
knn_predict7 <- knn(tr, v, yy, k = 7)
knn_predict8 <- knn(tr, v, yy, k = 8)
knn_predict9 <- knn(tr, v, yy, k = 9)
knn_predict10 <- knn(tr, v, yy, k = 10)
knn_predict11 <- knn(tr, v, yy, k = 11)
knn_predict12 <- knn(tr, v, yy, k = 12)

mean(knn_predict1[1:318] != t(vy[1]))
## [1] 0.1132075
mean(knn_predict2[1:318] != t(vy[1]))
## [1] 0.1383648
mean(knn_predict3[1:318] != t(vy[1]))
## [1] 0.1226415
mean(knn_predict4[1:318] != t(vy[1]))
## [1] 0.1226415
mean(knn_predict5[1:318] != t(vy[1]))
## [1] 0.1132075
mean(knn_predict6[1:318] != t(vy[1]))
## [1] 0.1163522
mean(knn_predict7[1:318] != t(vy[1]))
## [1] 0.1226415
mean(knn_predict8[1:318] != t(vy[1]))
## [1] 0.1320755
mean(knn_predict9[1:318] != t(vy[1]))
## [1] 0.1320755
mean(knn_predict10[1:318] != t(vy[1]))
## [1] 0.1289308
mean(knn_predict11[1:318] != t(vy[1]))
## [1] 0.1194969
mean(knn_predict12[1:318] != t(vy[1]))
## [1] 0.1415094

It appears that the best value found for k on the validation set is k=5. Which has around a 10% error rate.

Create models using knn and LDA and run it on the validation data to see which one works best:

library("MASS")
test_y_knn_val = knn(tr, v, yy, k = 5)
lda_model = lda(tr, yy)
test_y_lda_val = predict(lda_model,v)$class

print ("error percentage on knn and its confusion matrix")
## [1] "error percentage on knn and its confusion matrix"
mean(test_y_knn_val[1:318] != t(vy[1]))
## [1] 0.1163522
table(predicted = test_y_knn_val, actual = t(vy[1]))
##          actual
## predicted  0  1  2  3  4  5  6  7  8  9
##         0 40  0  0  0  0  0  3  0  0  1
##         1  0 25  1  0  6  0  0  4  1  0
##         2  0  0 29  0  0  0  0  0  3  0
##         3  0  0  0 24  0  0  0  0  0  4
##         4  0  0  1  0 26  0  0  0  0  0
##         5  0  0  0  0  1 38  0  0  0  3
##         6  0  0  0  0  0  0 33  0  1  0
##         7  0  1  0  2  0  0  0 26  0  0
##         8  0  0  1  0  0  0  0  0 19  0
##         9  0  0  0  1  0  0  0  0  3 21
print ("error percentage on lda and its confusion matrix:")
## [1] "error percentage on lda and its confusion matrix:"
mean(test_y_lda_val != t(vy[1]))
## [1] 0.1981132
table(predicted = test_y_lda_val, actual = t(vy[1]))
##          actual
## predicted  0  1  2  3  4  5  6  7  8  9
##         0 37  0  0  0  0  0  1  0  1  1
##         1  0 21  1  0  7  0  1  2  2  0
##         2  0  2 25  0  0  0  0  0  1  2
##         3  0  0  0 24  0  6  0  0  0  4
##         4  0  0  0  0 24  0  0  0  0  1
##         5  0  1  0  0  0 28  1  1  0  0
##         6  1  0  2  0  1  1 31  0  0  0
##         7  0  1  0  3  0  2  0 27  2  0
##         8  1  0  3  0  1  1  2  0 18  1
##         9  1  1  1  0  0  0  0  0  3 20

Run KNN and LDA on the test data and save it to csv files:

#create test predition using knn and the best k=6.
test_y_knn = knn(tr, test, yy, k = 6)

#create test prediction using LDA
l = lda(tr, yy)
test_y_lda = predict(l,test)$class

cols = c(test_y_knn , test_y_lda)
knn_pred = cols[1:638]
lda_pred = cols[639:1276]
df = data.frame(knn_pred,lda_pred)

#write the results to a csv file
write.csv(df, file = "HW3_njs48.csv", row.names = FALSE)
  1. Summarize the performance of the 2 approaches. What error rates do you achieve from each? I tested both approaches using the training data to train the data, and then testing it on the validation set. The percentage of error of each was: knn error rate: 0.1257862 lda error rate: 0.1981132 So it appears that KNN is doing a better job at categorizing this type of data.

2)Examine a confusion matrix for your best model. Which digit(s) is(are) relatively difficult to classify? How does this compare to your initial guesses from Part 1?

KNN actual predicted 0 1 2 3 4 5 6 7 8 9 0 40 0 0 0 0 0 2 0 0 2 1 0 25 1 0 7 0 0 4 1 0 2 0 0 29 0 0 0 0 0 3 0 3 0 0 0 24 0 0 0 0 0 4 4 0 0 1 0 26 0 0 0 0 0 5 0 0 0 0 0 37 0 0 0 4 6 0 0 1 0 0 1 34 0 1 0 7 0 1 0 2 0 0 0 26 0 0 8 0 0 0 0 0 0 0 0 19 0 9 0 0 0 1 0 0 0 0 3 19

LDA actual predicted 0 1 2 3 4 5 6 7 8 9 0 37 0 0 0 0 0 1 0 1 1 1 0 21 1 0 7 0 1 2 2 0 2 0 2 25 0 0 0 0 0 1 2 3 0 0 0 24 0 6 0 0 0 4 4 0 0 0 0 24 0 0 0 0 1 5 0 1 0 0 0 28 1 1 0 0 6 1 0 2 0 1 1 31 0 0 0 7 0 1 0 3 0 2 0 27 2 0 8 1 0 3 0 1 1 2 0 18 1 9 1 1 1 0 0 0 0 0 3 20

It appears that both models have difficulty in classifying the number 1. As I guessed, it did often confuse it with the number 7. But in addition to 7, it also often confused it to the number 4. A number that both models had difficulty in predicting was the number 9. But suprisingly it didnt confuse it with the number 8. It often confused it with the number 3, which makes sense because the only difference is the top left line.

  1. Why it would be difficult (or impossible) to apply a multinomial logistic regression to this problem? First there are better algorithms than multinomial logistic regression for multiple class problems. And second, MLR works well when there are clearly defined separable classes, but when it comes to pixel points for numbers, often the pixels will be overlapping in different classes and thus not clearly separable.