Introduction

OCR softwares divide the paper into a matrix in a way that each cell in the rectangular grid contains a single glyph, which could be a symbol, letter or number. They then attempt to match the glyph to the character set. Lastly, the characters are all combined together into words, which could be spell-checked against a dictionary for more accuracy.

In this project, I’m making the assumption that the document contains only alphabetic characters in the English language.

Collecting Data

I’m using a dataset donated to the UCI Machine Learning Data Repository by D.J. Slate and W. Frey.

Description of the dataset

The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.

Click here to download the dataset

How are glyphs represented mathematically?

According to Slate and Frey, the pixelized glyphs can be represented mathematically by using 16 statistical attributes. The attributes measure several characteristics like vertical and horizontal dimensions of the glyph, the average horizontal and vertical position of the pixels, and the proportion of the black vs white pixels. These attributes, according to them, provide a way to differentiate among the letters of the English alphabet.

Data reading and preparation

Downloading the data

if (!file.exists("ocrdata.csv")){
        link <- "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
        download.file(link, destfile = "ocrdata.csv", method = "curl")
}
ocrdata <- read.csv("ocrdata.csv")
str(ocrdata)
## 'data.frame':    19999 obs. of  17 variables:
##  $ T   : Factor w/ 26 levels "A","B","C","D",..: 9 4 14 7 19 2 1 10 13 24 ...
##  $ X2  : int  5 4 7 2 4 4 1 2 11 3 ...
##  $ X8  : int  12 11 11 1 11 2 1 2 15 9 ...
##  $ X3  : int  3 6 6 3 5 5 3 4 13 5 ...
##  $ X5  : int  7 8 6 1 8 4 2 4 9 7 ...
##  $ X1  : int  2 6 3 1 3 4 1 2 7 4 ...
##  $ X8.1: int  10 10 5 8 8 8 8 10 13 8 ...
##  $ X13 : int  5 6 9 6 8 7 2 6 2 7 ...
##  $ X0  : int  5 2 4 6 6 6 2 2 6 3 ...
##  $ X6  : int  4 6 6 6 9 6 2 6 2 8 ...
##  $ X6.1: int  13 10 4 6 5 7 8 12 12 5 ...
##  $ X10 : int  3 3 4 5 6 6 2 4 1 6 ...
##  $ X8.2: int  9 7 10 9 6 6 8 8 9 8 ...
##  $ X0.1: int  2 3 6 1 0 2 1 1 8 2 ...
##  $ X8.3: int  8 7 10 7 8 8 6 6 1 8 ...
##  $ X0.2: int  4 3 2 5 9 7 2 1 1 6 ...
##  $ X8.4: int  10 9 8 10 7 10 7 7 8 7 ...

Renaming the columns to meaningful names

Right now, the column headings do not make sense. Let’s rename them.

#Before:
names(ocrdata)
##  [1] "T"    "X2"   "X8"   "X3"   "X5"   "X1"   "X8.1" "X13"  "X0"   "X6"  
## [11] "X6.1" "X10"  "X8.2" "X0.1" "X8.3" "X0.2" "X8.4"
#After
names(ocrdata) = c("letter", "xbox", "ybox", "width", "height", "onpix", "xbar", "ybar", "x2bar", "y2bar", "xybar", "x2ybar", 
"xy2bar", "xedge", "xedgey", "yedge", "yedgex")
names(ocrdata)
##  [1] "letter" "xbox"   "ybox"   "width"  "height" "onpix"  "xbar"  
##  [8] "ybar"   "x2bar"  "y2bar"  "xybar"  "x2ybar" "xy2bar" "xedge" 
## [15] "xedgey" "yedge"  "yedgex"

Since I’m going to use SVM algorithm, it should be known that:

  1. SVM learners require all variables to be numeric.
  2. Each variable should be scaled to a fairly small interval.

I already have all variables as numbers, so step 1 is done. Step 2, however would demand some work, I would want to normalize the data, but the R package that I will use to fit the SVM model will do the job for me.

Dividing the big dataset into training and test data

The dataset I have is already randomized (although I could redo it), I are now going to use 80% of our data as training data and 20% of the remaining as test data to evaluate our model.

ocrTrain <- ocrdata[1:16000,]
ocrTest <- ocrdata[16001:20000,]

Training the model on the data

Package Options available

  1. e1071 package, which provides an R interface for the widely used LIBSVM library written in C++
  2. klaR package, written natively in R
  3. kernlab package (good for beginners), written natively in R

I’m going to use kernlab package, primarily because it is really beginner-friendly and works seamlessly with the caret package.

Training a simple linear SVM classifier

I will use the ksvm function from the kernlab library and for this purpose, I will use linear kernel (vanilladot).

classifier <- ksvm(letter ~ ., data = ocrTrain, kernel = "vanilladot")
##  Setting default kernel parameters
classifier
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 7039 
## 
## Objective Function Value : -14.1747 -20.007 -23.5629 -6.2007 -7.5523 -32.7693 -49.9788 -18.1824 -62.111 -32.7284 -16.221 -32.2839 -28.9776 -51.2192 -13.276 -35.6223 -30.8612 -16.5255 -14.681 -32.7472 -30.3216 -7.7959 -11.814 -32.3455 -13.126 -9.2693 -153.165 -52.9678 -76.7743 -119.2073 -165.4435 -54.6247 -41.9818 -67.2686 -25.1959 -27.6368 -26.41 -35.5578 -41.26 -122.1636 -187.9174 -222.0861 -21.4765 -10.3749 -56.3682 -12.2279 -49.4902 -9.3371 -19.2099 -11.1776 -100.2194 -29.14 -238.0507 -77.1985 -8.334 -4.5309 -139.8544 -80.8849 -20.3643 -13.0243 -82.515 -14.5037 -26.7516 -18.5709 -23.9512 -27.3041 -53.273 -11.4773 -5.1202 -13.9501 -4.4981 -3.5754 -8.4912 -40.971 -49.8188 -190.0265 -43.8604 -44.868 -45.258 -13.5555 -17.767 -87.4103 -107.1064 -37.025 -30.713 -112.3208 -32.9635 -27.2966 -35.5832 -17.8585 -5.1394 -43.4089 -7.7841 -16.6797 -58.51 -159.9932 -49.0779 -37.8439 -32.801 -74.5254 -133.3417 -11.164 -5.3575 -12.4375 -30.9902 -141.6928 -54.2953 -179.012 -99.8894 -10.288 -15.1555 -3.7818 -67.612 -7.6958 -88.9304 -47.6447 -94.3718 -70.2735 -71.5066 -21.7856 -12.7654 -7.4383 -23.5023 -13.1052 -239.9699 -30.4194 -25.211 -136.2793 -140.9563 -9.812 -34.4584 -6.304 -60.8422 -66.5785 -27.282 -214.3225 -34.7801 -16.7631 -135.7818 -160.627 -45.2949 -25.1021 -144.9052 -82.2355 -327.7157 -142.0611 -158.8819 -32.2184 -32.8889 -52.9638 -25.4942 -47.9924 -6.8991 -9.7296 -36.4361 -70.3911 -187.7606 -46.9366 -89.8108 -143.4214 -624.3642 -119.2205 -145.4432 -327.7745 -33.3256 -64.0603 -145.4829 -116.5903 -36.2988 -66.3768 -44.8241 -7.509 -217.9246 -12.971 -30.5035 -2.0371 -6.1261 -14.4445 -21.6334 -57.3084 -20.6923 -184.3623 -20.105 -4.1485 -4.5347 -0.8281 -121.4429 -7.9484 -58.5602 -21.4882 -13.5474 -5.6465 -15.6294 -28.9573 -20.5961 -76.7112 -27.0123 -94.7105 -15.1714 -10.0223 -7.6397 -1.5785 -87.6952 -6.2237 -99.3707 -101.0906 -45.6639 -24.0721 -61.7692 -24.1578 -52.2364 -234.326 -39.9757 -48.8561 -34.1458 -20.9665 -11.4524 -123.0291 -6.4901 -5.1868 -8.8018 -9.4612 -21.7736 -24.2361 -123.3978 -31.4396 -88.3897 -30.0912 -13.8194 -9.2701 -3.0825 -87.9616 -6.3842 -13.9679 -65.0712 -105.5232 -13.7404 -13.7627 -50.4226 -2.9331 -8.429 -80.9508 -36.4142 -112.7479 -4.1714 -7.8989 -1.2678 -90.8033 -21.4921 -7.2235 -47.9551 -3.3832 -20.433 -64.6126 -45.5778 -56.1314 -6.1347 -18.6305 -2.3742 -72.2553 -111.188 -106.765 -23.1321 -19.3763 -54.9815 -34.2944 -64.4748 -20.4109 -6.6886 -4.3781 -59.1414 -34.2461 -58.1506 -33.8664 -10.6902 -53.1394 -13.7482 -20.1987 -55.092 -3.8058 -60.0373 -235.484 -12.6837 -11.7408 -17.3059 -9.7171 -65.8491 -17.1047 -42.8136 -53.1058 -25.0432 -15.3018 -44.0747 -16.9584 -62.9777 -5.2037 -5.2966 -86.1709 -3.7209 -6.3449 -1.1265 -122.5773 -23.904 -355.0149 -31.1009 -32.6198 -4.9668 -84.1037 -134.5943 -72.8374 -23.9003 -35.5893 -11.7117 -22.2889 -1.8598 -59.2178 -8.8997 -150.7441 -1.8536 -1.9713 -9.9677 -0.5208 -26.9227 -30.4291 -5.6286 
## Training error : 0.13025

Now let’s see how well the model will perform with the test dataset.

Testing the model and evaluation

I can use the predict() function to use the model to make predictions on the test dataset.

ocrPredictions <- predict(classifier, ocrTest)

This now returns the predicted character of each row of the test dataset. Let’s see what I have.

c(as.character(head(ocrPredictions)), as.character(tail(ocrPredictions)))
##  [1] "N" "V" "X" "N" "H" "E" "T" "D" "C" "T" "S" "A"

Let’s make a table and compare the original letters against the predicted letters.

table(ocrPredictions, ocrTest$letter, dnn = c("Prediction","Original"))[,1:16]
##           Original
## Prediction   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P
##          A 144   0   0   0   0   0   0   0   0   1   0   0   1   2   2   0
##          B   0 121   0   5   2   0   1   2   0   0   1   0   1   0   0   2
##          C   0   0 120   0   4   0  10   2   2   0   1   3   0   0   2   0
##          D   2   2   0 156   0   1   3  10   4   3   4   3   0   5   5   3
##          E   0   0   5   0 127   3   1   1   0   0   3   4   0   0   0   0
##          F   0   0   0   0   0 138   2   2   6   0   0   0   0   0   0  16
##          G   1   1   2   1   9   2 123   2   0   0   1   2   1   0   1   2
##          H   0   0   0   1   0   1   0 102   0   2   3   2   3   4  20   0
##          I   0   1   0   0   0   1   0   0 141   8   0   0   0   0   0   1
##          J   0   1   0   0   0   1   0   2   5 128   0   0   0   0   1   1
##          K   1   1   9   0   0   0   2   5   0   0 118   0   0   2   0   1
##          L   0   0   0   0   2   0   1   1   0   0   0 134   0   0   0   0
##          M   0   0   1   1   0   0   1   1   0   0   0   0 135   4   0   0
##          N   0   0   0   0   0   1   0   1   0   0   0   0   0 145   0   0
##          O   1   0   2   1   0   0   1   2   0   1   0   0   0   1  99   3
##          P   0   0   0   1   0   2   1   0   0   0   0   0   0   0   2 130
##          Q   0   0   0   0   0   0   8   2   0   0   0   3   0   0   3   1
##          R   0   7   0   0   1   0   3   8   0   0  13   0   0   1   1   1
##          S   1   1   0   0   1   0   3   0   1   1   0   1   0   0   0   0
##          T   0   0   0   0   3   2   0   0   0   0   1   0   0   0   0   0
##          U   1   0   3   1   0   0   0   2   0   0   0   0   0   0   1   0
##          V   0   0   0   0   0   1   3   4   0   0   0   0   1   2   1   0
##          W   0   0   0   0   0   0   1   0   0   0   0   0   2   0   0   0
##          X   0   1   0   0   2   0   0   1   3   0   1   5   0   0   1   0
##          Y   3   0   0   0   0   0   0   1   0   0   0   0   0   0   0   7
##          Z   2   0   0   0   1   0   0   0   3   4   0   0   0   0   0   0

Interpreting the table

The diagonal values of the table represent the number of letters successfully predicted by the model, whereas the intersection of different characters in the table represent the number of mistakes. For example, the fourth row first column is an intersection of A and D. It means that it was originally A, but was classified as D.

It reveals a lot of things. For example:

  1. Some Cs were classified as Es (not very surprising).
  2. 18 Ss were classified as Z.
  3. 13 Ks were classified as Rs.

Calculating the overally accuracy

Although it would be really nice if I could go through the analysis of all 26 characters and it would reveal where the model is mostly not working, it would also be very insightful if I calculate the overall accuracy of the model.

matches <- ocrPredictions == ocrTest$letter
table(matches)
## matches
## FALSE  TRUE 
##   642  3357
#The model accuracy, in percentage, therefore is:
(sum(matches, na.rm = TRUE) / nrow(ocrTest)) * 100
## [1] 83.925

So the model was almost 84% accurate and 16% wrong.

Improvements by using different kernel function

I previously used the simple linear kernel function(vanilladot), but there are several other kernels I could use which can help me map the data to higher dimensional space, and thus giving me a better model fit.

The challenge is I do not already know which kernel to pick from. From several hit-and-trials (or the popular convention of going with Gaussian RBF Kernel first), I tried to use the RBF-based SVM. The kernel for this is rbfdot.

classifierRbf <- ksvm(letter ~ ., data = ocrTrain, kernel = "rbfdot")
ocrPredictionsRbf <- predict(classifierRbf, ocrTest)
ocrPredictionsRbf[4000] = NA
table(ocrPredictionsRbf, ocrTest$letter, dnn=c("Prediction","Original"))[,1:16]   
##           Original
## Prediction   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P
##          A 151   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##          B   0 127   0   3   0   1   0   2   0   0   0   1   2   1   0   2
##          C   0   0 132   0   3   0   1   0   2   0   0   1   0   0   0   0
##          D   1   1   0 161   0   0   2   9   2   3   1   0   0   1   1   3
##          E   0   0   3   0 137   2   0   0   0   1   0   4   0   0   0   1
##          F   0   0   0   0   0 148   0   0   3   0   0   0   0   0   0  11
##          G   0   0   2   0   8   0 155   2   0   0   0   2   2   0   2   1
##          H   0   1   0   1   0   0   1 124   0   1   2   1   1   3   0   1
##          I   0   0   0   0   0   0   0   0 151   3   0   0   0   0   0   0
##          J   0   0   0   0   0   0   0   0   3 136   0   0   0   0   0   0
##          K   0   0   1   0   0   0   0   5   0   0 132   0   0   1   0   0
##          L   0   0   0   0   0   0   1   0   0   0   0 141   0   0   0   0
##          M   0   0   0   0   0   0   1   1   0   0   0   0 138   1   0   0
##          N   0   0   0   0   0   2   0   0   0   0   0   0   0 150   0   0
##          O   0   0   2   0   0   0   0   0   0   1   0   0   0   5 129   2
##          P   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0 140
##          Q   0   0   0   0   0   0   0   1   0   0   0   0   0   0   3   3
##          R   0   4   1   1   0   0   2   5   0   0   9   1   0   3   2   1
##          S   0   2   0   0   0   0   0   0   1   2   0   2   0   0   0   0
##          T   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##          U   0   0   1   1   0   0   0   1   0   0   0   0   0   0   0   0
##          V   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0
##          W   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0
##          X   0   1   0   0   1   0   0   0   0   0   2   4   0   0   0   0
##          Y   4   0   0   0   0   0   0   1   0   0   0   0   0   0   0   3
##          Z   0   0   0   0   3   0   0   0   2   1   0   0   0   0   0   0

I can see that there are several kinds of new mistakes that are different from the previous model. Now let’s measure the accuracy. It looks better.

Measuring the accuracy of the new model

matchesNew <- ocrPredictionsRbf == ocrTest$letter
table(matchesNew)
## matchesNew
## FALSE  TRUE 
##   281  3718
#The model accuracy, in percentage, therefore is:
(sum(matchesNew, na.rm = TRUE) / nrow(ocrTest)) * 100
## [1] 92.95

This is a really big improvement over the previous model, I now have 93% accuracy!

Self-caveat: I did not mean to use the test set twice. I know it is purely overfitting. I shall update the page with results from cross-validation done on my training set with different SVM kernels as well as other classification algorithms.

=================================================================