We are attempting to classify images of handwritten digits 0-9. Each pixel is a single data point. Since the data set is extremely large, we will use a sample of 1,400 points for our training data. There is also a sample of 1,000 points we want to read.
We will read in the data
df <- read.csv("trainSample.csv", header = T)#our data frame for the training
predDf <- read.csv("testSample.csv", header = T)#What we want to predict.
We will split the data as follows.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
index <- createDataPartition(df$label, p = .2, list = F)
test<- df[index,]
train <- df[-index,]
There is no other data processing that needs to be done for all of the models so any data will be processed as needed
We will use the Naive Bayes library to run this model and then the caret package to predict the values. Naive Bayes classifies categories not numbers so we have to start by transforming the labels into characters.
library(naivebayes)
## naivebayes 0.9.6 loaded
trainChar = train
trainChar$label <- as.character(trainChar$label)
Now we will run the model with the default parameters.
nbModel <- naive_bayes(label ~., trainChar)
predTrain = predict(nbModel, train)
predTest = predict(nbModel, test)
#predTrain <- as.factor(round(predTrain))
#PredTest<- as.factor(round(predTest))
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']
## Accuracy
## 0.559034
confusionMatrix(predTest, testLabel)$overall['Accuracy']
## Accuracy
## 0.5248227
While the accuracy is over 50%, if we set the usepoisson to be True, we get a much better result. This means that we use poison distribution to predict the integers. Also a rather high value for the Laplace smoothing raises the accuracy so I settled with a value of 1000 for the Laplace parameter.
nbModel <- naive_bayes(label ~., trainChar, laplace = 1000, usepoisson = T)
predTrain = predict(nbModel, train)
predTest = predict(nbModel, test )
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)
#Accuracy of the Training Set
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']
## Accuracy
## 0.8202147
#Accuracy of the Testing Set
confusionMatrix(predTest, testLabel)$overall['Accuracy']
## Accuracy
## 0.7836879
Running the random forest with base settings is too computationally expensive for my PC so we will start by setting the number of trees to 2
rfModel <- train(label ~., data = train, method = "rf",ntree = 2)
#rfModel <- train(label ~., data = train, method = "rf",ntree = 2, mnty = 2)
predTrain = predict(rfModel, train)
predTest = predict(rfModel, test)
predTrain <- as.factor(round(predTrain))
predTest<- as.factor(round(predTest))
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']
## Accuracy
## 0.7298748
confusionMatrix(predTest, testLabel)$overall['Accuracy']
## Accuracy
## 0.4219858
The two variables that are important to tune are ntrees and mnty . After trying different values of these, the best values I found that ntree was 2 and mnty of 250.
rfModel <- train(label ~., data = train, method = "rf", mnty = 200, ntree = 2)
predTrain = predict(rfModel, train)
predTest = predict(rfModel, test)
predTrain <- as.factor(round(predTrain))
predTest<- as.factor(round(predTest))
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']
## Accuracy
## 0.7245081
confusionMatrix(predTest, testLabel)$overall['Accuracy']
## Accuracy
## 0.5
For the support vector machine, we will use the e1071. We will start by building a model with default settings
library(e1071)
svmModel = svm(label ~ ., data = train, scale = FALSE)
We will then predict the training and testing data sets.
predTrain = predict(svmModel, train)
predTest = predict(svmModel, test)
We have to turn all of the columns into factors so we can compare them in a confusion matrix
predTrain <- as.factor(round(predTrain))
predTest<- as.factor(round(predTest))
trainLabel <- as.factor(train$label)
testLabel <- as.factor(test$label)
Finally we look at the confusion matrices to get the accuracy.
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']
## Accuracy
## 0.2871199
confusionMatrix(predTest, testLabel)$overall['Accuracy']
## Accuracy
## 0.08510638
We do not get a very good accuracy. If we try just switching to c-classification, we get incredible accuracy with the training data and incredibly poor accuracy with the testing data set. But if we try it with a very high gamma, we get good results
svmModel = svm(label ~ ., data = train, scale = FALSE, type = 'C-classification', gamma = 0.000001)
predTrain = predict(svmModel, train)
predTest = predict(svmModel, test)
confusionMatrix(predTrain, trainLabel)$overall['Accuracy']
## Accuracy
## 1
confusionMatrix(predTest, testLabel)$overall['Accuracy']
## Accuracy
## 0.8652482
If I were to rank the models in terms of performance, random forest was the worst, naive Bayes was the next best and support vector machine was the best. While the Support Vector Machine did give the best accuracy, I am rather skeptical of the results. I would definitely run the model using more data and use a more rigorous validation process. I will also make a comment about the computational power required for each of the models. Random Forest was extremely computationally expensive while Naive Bayes was very cheap. Support Vector Machine was in the middle.