Homework 3- Digit Recognition with NBC, Random Forest, and SVM

Section 1: Introduction

We are attempting to classify images of handwritten digits 0-9. Each pixel is a single data point. Since the data set is extremely large, we will use a sample of 1,400 points for our training data. There is also a sample of 1,000 points we want to read.

We will read in the data

df <- read.csv("trainSample.csv", header = T)#our data frame for the training
predDf <- read.csv("testSample.csv", header = T)#What we want to predict.

We will split the data as follows.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(e1071)
index <- createDataPartition(df$label, p = .2, list = F)
test<- df[index,]
train <- df[-index,]

There is no other data processing that needs to be done for all of the models so any data will be processed as needed

Section 2: Naive Bayes

We will use the Naive Bayes library to run this model and then the caret package to predict the values. Naive Bayes classifies categories not numbers so we have to start by transforming the labels into characters.

library(naivebayes)

## naivebayes 0.9.6 loaded

trainChar = train
trainChar$label <- as.character(trainChar$label)

Now we will run the model with the default parameters.

nbModel <- naive_bayes(label ~., trainChar)

predTrain = predict(nbModel, train)
predTest = predict(nbModel, test)

#predTrain <- as.factor(round(predTrain))
#PredTest<- as.factor(round(predTest))

trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']

## Accuracy 
## 0.559034

confusionMatrix(predTest, testLabel)$overall['Accuracy']

##  Accuracy 
## 0.5248227

While the accuracy is over 50%, if we set the usepoisson to be True, we get a much better result. This means that we use poison distribution to predict the integers. Also a rather high value for the Laplace smoothing raises the accuracy so I settled with a value of 1000 for the Laplace parameter.

nbModel <- naive_bayes(label ~., trainChar, laplace = 1000, usepoisson = T)
predTrain = predict(nbModel, train)
predTest = predict(nbModel, test  )
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)

#Accuracy of the Training Set
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']

##  Accuracy 
## 0.8202147

#Accuracy of the Testing Set
confusionMatrix(predTest, testLabel)$overall['Accuracy']

##  Accuracy 
## 0.7836879

Section 3: Random Forest

Running the random forest with base settings is too computationally expensive for my PC so we will start by setting the number of trees to 2

rfModel <- train(label ~., data = train, method = "rf",ntree = 2)

#rfModel <- train(label ~., data = train, method = "rf",ntree = 2, mnty = 2)
predTrain = predict(rfModel, train)
predTest = predict(rfModel, test)

predTrain <- as.factor(round(predTrain))
predTest<- as.factor(round(predTest))

trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)



confusionMatrix(predTrain, trainLabel)$overall['Accuracy']

##  Accuracy 
## 0.7298748

confusionMatrix(predTest, testLabel)$overall['Accuracy']

##  Accuracy 
## 0.4219858

The two variables that are important to tune are ntrees and mnty . After trying different values of these, the best values I found that ntree was 2 and mnty of 250.

rfModel <- train(label ~., data = train, method = "rf", mnty = 200, ntree = 2)
predTrain = predict(rfModel, train)
predTest = predict(rfModel, test)

predTrain <- as.factor(round(predTrain))
predTest<- as.factor(round(predTest))

trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)



confusionMatrix(predTrain, trainLabel)$overall['Accuracy']

##  Accuracy 
## 0.7245081

confusionMatrix(predTest, testLabel)$overall['Accuracy']

## Accuracy 
##      0.5

Section 4: Suport Vector Machine

For the support vector machine, we will use the e1071. We will start by building a model with default settings

library(e1071)
svmModel = svm(label ~ ., data = train, scale = FALSE)

We will then predict the training and testing data sets.

predTrain = predict(svmModel, train)
predTest = predict(svmModel, test)

We have to turn all of the columns into factors so we can compare them in a confusion matrix

predTrain <- as.factor(round(predTrain))
predTest<- as.factor(round(predTest))
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)

Finally we look at the confusion matrices to get the accuracy.

confusionMatrix(predTrain, trainLabel)$overall['Accuracy']

##  Accuracy 
## 0.2871199

confusionMatrix(predTest, testLabel)$overall['Accuracy']

##   Accuracy 
## 0.08510638

We do not get a very good accuracy. If we try just switching to c-classification, we get incredible accuracy with the training data and incredibly poor accuracy with the testing data set. But if we try it with a very high gamma, we get good results

svmModel = svm(label ~ ., data = train, scale = FALSE, type = 'C-classification', gamma = 0.000001)
predTrain = predict(svmModel, train)
predTest = predict(svmModel, test)

confusionMatrix(predTrain, trainLabel)$overall['Accuracy']

## Accuracy 
##        1

confusionMatrix(predTest, testLabel)$overall['Accuracy']

##  Accuracy 
## 0.8652482

Section 5: Algorithm Performance Comparison

If I were to rank the models in terms of performance, random forest was the worst, naive Bayes was the next best and support vector machine was the best. While the Support Vector Machine did give the best accuracy, I am rather skeptical of the results. I would definitely run the model using more data and use a more rigorous validation process. I will also make a comment about the computational power required for each of the models. Random Forest was extremely computationally expensive while Naive Bayes was very cheap. Support Vector Machine was in the middle.