Digit Recognition 1

The full tutorial of this analysis reading in the raw .JPG images can be found at
http://kyle-stahl-mn.com/computer-vision-1

Step 0: Read in training data

traindata <- read.csv("data_train.csv")
traindata$label <- as.factor(traindata$label)

Step 1: Clean data

The MINST data set is well organized as downloaded. The challenge is that the data is sparse. Princple Component Analysis was used to reduce the dimensionality of the data. The data was transformed to be on a scale from 0-1. Then singular value decomposition was performed using the covariance matrix

## Principle Component Analysis for Variable Reduction ----

# Transform data to be on a 0 - 1 scale
traindata[,-1] <- (traindata[,-1]/255)

# create covariance matrix
Sigma <- var(as.matrix(traindata[,-1])) 
diag(Sigma) <- 0

#Singular Value Decomposition
D <- svd(Sigma)$d 
U <- svd(Sigma)$u

By reducing the dimensionality of the data set some of the variation is ineviteably lost. We either directly chose how many variables that we want to keep, and then sum over that many indicies of the vector ‘D’ from the singular value decompostion; or we can choose an amount of variation, and determine how many varibles need to be kepted to maintain the percentage of variation within the data. Here the variation is set to .99. This cut the number of variables almost in half, from 784 down to 399.

#Value deciding how much variation to retain in the PCA
variation <- .99

#Figure out how many variables (k) to include to keep that level of variation
for (k in seq(length(D)))
{
  if( (sum(D[1:k])/sum(D)) >= variation){break}
}

After we have the number of variables that are being kept ‘k’, we take the first ‘k’ columns from the ‘U’ matrix given by the singular value decomposition. We use the coefficients in that matrix to create ‘k’ different linear combinations of the previously variables. As stated before, in this instance k = 399 and those 399 variables capture 99% of the variation within the 784 original variables.

#Reduce U matrix to down to 'k' columns
U <- U[,1:k]

# Use U matrix and data to map data to lower dimension
ReducedData <- as.matrix(traindata[,-1]) %*% U

# Replace Transformed Variables into Training Data Set
traindata <- data.frame(traindata[,1],ReducedData)
names(traindata)[1] <- "label"

Step 2: Building the model

Ten different logistic regression models are made, one for each possible digit 0-9. For this to be done, a new training set needs to be created for each digit. Instead of having a ‘label’ indicating which number a particular sample is, all the labels are binary. In “isone” the label is equal to 1 if the observation is a 1, and 0 otherwise. And so on and so forth for all 0-9.

## Create data frame to train logistic regressions for each number ----
iszero <- as.numeric(traindata$label == 0)
isone <- as.numeric(traindata$label == 1)
istwo <- as.numeric(traindata$label == 2)
isthree <- as.numeric(traindata$label == 3)
isfour <- as.numeric(traindata$label == 4)
isfive <- as.numeric(traindata$label == 5)
issix <- as.numeric(traindata$label == 6)
isseven <- as.numeric(traindata$label == 7)
iseight <- as.numeric(traindata$label == 8)
isnine <- as.numeric(traindata$label == 9)

# Add new each 'label' columns to the newly transformed data
trainzero <- cbind(iszero, traindata[,-1])
trainone <- cbind(isone, traindata[,-1])
traintwo <- cbind(istwo, traindata[,-1])
trainthree <- cbind(isthree, traindata[,-1])
trainfour <- cbind(isfour, traindata[,-1])
trainfive <- cbind(isfive, traindata[,-1])
trainsix <- cbind(issix, traindata[,-1])
trainseven <- cbind(isseven, traindata[,-1])
traineight <- cbind(iseight, traindata[,-1])
trainnine <- cbind(isnine, traindata[,-1])
rm(iszero, isone, istwo, isthree, isfour, isfive, issix, isseven, iseight, isnine)

Binomial logistic regression models were used to predict the probability that each observation is a 0, or 1, or 2, etc. We will use what ever model is most confident that the observation is equal to a particular value. So if ‘probseven’ predicts there is a 80% chance that an observation is a seven, and the rest of the models predict < 50% chance, then we will guess that the observation is a seven.

## Train Models ----
probone <- glm(isone ~ . , data = trainone, family = "binomial")
probtwo <- glm(istwo ~ . , data = traintwo, family = "binomial")
probthree <- glm(isthree ~ . , data = trainthree, family = "binomial")
probfour <- glm(isfour ~ . , data = trainfour, family = "binomial")
probfive <- glm(isfive ~ . , data = trainfive, family = "binomial")
probsix <- glm(issix ~ . , data = trainsix, family = "binomial")
probseven <- glm(isseven ~ . , data = trainseven, family = "binomial")
probeight <- glm(iseight ~ . , data = traineight, family = "binomial")
probnine <- glm(isnine ~ . , data = trainnine, family = "binomial")
probzero <- glm(iszero ~ . , data = trainzero, family = "binomial")

The test set is imported and the same transformation that was done on the train data using the ‘U’ matrix is now done on the test data. The guesses for each observation in the test set are assembled into a data frame. The predictions are deliberately put in numeric order.

## Import and transform test data ----
testdata <- read.csv("data_test.csv")
testdata <- as.matrix(testdata) %*% U
testdata <- data.frame(testdata)

ProbabilityOfEachValue <- data.frame(predict(probone,testdata),
                                     predict(probtwo,testdata),
                                     predict(probthree,testdata),
                                     predict(probfour,testdata),
                                     predict(probfive,testdata),
                                     predict(probsix,testdata),
                                     predict(probseven,testdata),
                                     predict(probeight,testdata),
                                     predict(probnine,testdata),
                                     predict(probzero,testdata))

Step 3: Make predictions and output results

For each observation, the index which has the highest (max in that row) predicted probability will correlate to the model that is the most confident. So we use that model as our guess. The 0’s are in the 10th index, so any guess that is a 10 is changed to a 0. This is the reason that the predictions were put into numeric order

# Find Guess by taking the highest value from each model
Label <- rep(NA, nrow(ProbabilityOfEachValue))
for (i in seq(nrow(ProbabilityOfEachValue)))
{
  Label[i] <- which.max(ProbabilityOfEachValue[i,])
}
Label[Label == 10] <- 0

The results are formatted such that each obersavtion is given the correct ImageId, in the same order that they were imported in. The result is a two column .csv file: one column has the ImageId, the other has the guess of which digit is represented.

ImageId <- seq(length(Label))
submission <- data.frame(ImageId,Label)
write.csv(submission, file = "ImageRecognition1.csv",row.names = FALSE)

Step 4: Conclusion

This method is similar to a neural network with 0 hidden layers. All the inputs go directly into the outputs and the weights on each connection are directly given by the logistic regression model. Kaggle graded this submission at 81% accurate. The on the surface this seems pretty good, but is actually towards the bottom of the leader. There are much more sophisticated prediction algorithms that can be used and combined to give better accuracy. And this is a fairly simple problem for image recognition and predictive analytics. But sometimes you have to crawl before you can walk, and this is a good start.