OCR softwares divide the paper into a matrix in a way that each cell in the rectangular grid contains a single glyph, which could be a symbol, letter or number. They then attempt to match the glyph to the character set. Lastly, the characters are all combined together into words, which could be spell-checked against a dictionary for more accuracy.
In this project, I’m making the assumption that the document contains only alphabetic characters in the English language.
I’m using a dataset donated to the UCI Machine Learning Data Repository by D.J. Slate and W. Frey.
The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.
According to Slate and Frey, the pixelized glyphs can be represented mathematically by using 16 statistical attributes. The attributes measure several characteristics like vertical and horizontal dimensions of the glyph, the average horizontal and vertical position of the pixels, and the proportion of the black vs white pixels. These attributes, according to them, provide a way to differentiate among the letters of the English alphabet.
if (!file.exists("ocrdata.csv")){
link <- "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
download.file(link, destfile = "ocrdata.csv", method = "curl")
}
ocrdata <- read.csv("ocrdata.csv")
str(ocrdata)
## 'data.frame': 19999 obs. of 17 variables:
## $ T : Factor w/ 26 levels "A","B","C","D",..: 9 4 14 7 19 2 1 10 13 24 ...
## $ X2 : int 5 4 7 2 4 4 1 2 11 3 ...
## $ X8 : int 12 11 11 1 11 2 1 2 15 9 ...
## $ X3 : int 3 6 6 3 5 5 3 4 13 5 ...
## $ X5 : int 7 8 6 1 8 4 2 4 9 7 ...
## $ X1 : int 2 6 3 1 3 4 1 2 7 4 ...
## $ X8.1: int 10 10 5 8 8 8 8 10 13 8 ...
## $ X13 : int 5 6 9 6 8 7 2 6 2 7 ...
## $ X0 : int 5 2 4 6 6 6 2 2 6 3 ...
## $ X6 : int 4 6 6 6 9 6 2 6 2 8 ...
## $ X6.1: int 13 10 4 6 5 7 8 12 12 5 ...
## $ X10 : int 3 3 4 5 6 6 2 4 1 6 ...
## $ X8.2: int 9 7 10 9 6 6 8 8 9 8 ...
## $ X0.1: int 2 3 6 1 0 2 1 1 8 2 ...
## $ X8.3: int 8 7 10 7 8 8 6 6 1 8 ...
## $ X0.2: int 4 3 2 5 9 7 2 1 1 6 ...
## $ X8.4: int 10 9 8 10 7 10 7 7 8 7 ...
Right now, the column headings do not make sense. Let’s rename them.
#Before:
names(ocrdata)
## [1] "T" "X2" "X8" "X3" "X5" "X1" "X8.1" "X13" "X0" "X6"
## [11] "X6.1" "X10" "X8.2" "X0.1" "X8.3" "X0.2" "X8.4"
#After
names(ocrdata) = c("letter", "xbox", "ybox", "width", "height", "onpix", "xbar", "ybar", "x2bar", "y2bar", "xybar", "x2ybar",
"xy2bar", "xedge", "xedgey", "yedge", "yedgex")
names(ocrdata)
## [1] "letter" "xbox" "ybox" "width" "height" "onpix" "xbar"
## [8] "ybar" "x2bar" "y2bar" "xybar" "x2ybar" "xy2bar" "xedge"
## [15] "xedgey" "yedge" "yedgex"
Since I’m going to use SVM algorithm, it should be known that:
I already have all variables as numbers, so step 1 is done. Step 2, however would demand some work, I would want to normalize the data, but the R package that I will use to fit the SVM model will do the job for me.
The dataset I have is already randomized (although I could redo it), I are now going to use 80% of our data as training data and 20% of the remaining as test data to evaluate our model.
ocrTrain <- ocrdata[1:16000,]
ocrTest <- ocrdata[16001:20000,]
I’m going to use kernlab package, primarily because it is really beginner-friendly and works seamlessly with the caret package.
I will use the ksvm function from the kernlab library and for this purpose, I will use linear kernel (vanilladot).
classifier <- ksvm(letter ~ ., data = ocrTrain, kernel = "vanilladot")
## Setting default kernel parameters
classifier
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 7039
##
## Objective Function Value : -14.1747 -20.007 -23.5629 -6.2007 -7.5523 -32.7693 -49.9788 -18.1824 -62.111 -32.7284 -16.221 -32.2839 -28.9776 -51.2192 -13.276 -35.6223 -30.8612 -16.5255 -14.681 -32.7472 -30.3216 -7.7959 -11.814 -32.3455 -13.126 -9.2693 -153.165 -52.9678 -76.7743 -119.2073 -165.4435 -54.6247 -41.9818 -67.2686 -25.1959 -27.6368 -26.41 -35.5578 -41.26 -122.1636 -187.9174 -222.0861 -21.4765 -10.3749 -56.3682 -12.2279 -49.4902 -9.3371 -19.2099 -11.1776 -100.2194 -29.14 -238.0507 -77.1985 -8.334 -4.5309 -139.8544 -80.8849 -20.3643 -13.0243 -82.515 -14.5037 -26.7516 -18.5709 -23.9512 -27.3041 -53.273 -11.4773 -5.1202 -13.9501 -4.4981 -3.5754 -8.4912 -40.971 -49.8188 -190.0265 -43.8604 -44.868 -45.258 -13.5555 -17.767 -87.4103 -107.1064 -37.025 -30.713 -112.3208 -32.9635 -27.2966 -35.5832 -17.8585 -5.1394 -43.4089 -7.7841 -16.6797 -58.51 -159.9932 -49.0779 -37.8439 -32.801 -74.5254 -133.3417 -11.164 -5.3575 -12.4375 -30.9902 -141.6928 -54.2953 -179.012 -99.8894 -10.288 -15.1555 -3.7818 -67.612 -7.6958 -88.9304 -47.6447 -94.3718 -70.2735 -71.5066 -21.7856 -12.7654 -7.4383 -23.5023 -13.1052 -239.9699 -30.4194 -25.211 -136.2793 -140.9563 -9.812 -34.4584 -6.304 -60.8422 -66.5785 -27.282 -214.3225 -34.7801 -16.7631 -135.7818 -160.627 -45.2949 -25.1021 -144.9052 -82.2355 -327.7157 -142.0611 -158.8819 -32.2184 -32.8889 -52.9638 -25.4942 -47.9924 -6.8991 -9.7296 -36.4361 -70.3911 -187.7606 -46.9366 -89.8108 -143.4214 -624.3642 -119.2205 -145.4432 -327.7745 -33.3256 -64.0603 -145.4829 -116.5903 -36.2988 -66.3768 -44.8241 -7.509 -217.9246 -12.971 -30.5035 -2.0371 -6.1261 -14.4445 -21.6334 -57.3084 -20.6923 -184.3623 -20.105 -4.1485 -4.5347 -0.8281 -121.4429 -7.9484 -58.5602 -21.4882 -13.5474 -5.6465 -15.6294 -28.9573 -20.5961 -76.7112 -27.0123 -94.7105 -15.1714 -10.0223 -7.6397 -1.5785 -87.6952 -6.2237 -99.3707 -101.0906 -45.6639 -24.0721 -61.7692 -24.1578 -52.2364 -234.326 -39.9757 -48.8561 -34.1458 -20.9665 -11.4524 -123.0291 -6.4901 -5.1868 -8.8018 -9.4612 -21.7736 -24.2361 -123.3978 -31.4396 -88.3897 -30.0912 -13.8194 -9.2701 -3.0825 -87.9616 -6.3842 -13.9679 -65.0712 -105.5232 -13.7404 -13.7627 -50.4226 -2.9331 -8.429 -80.9508 -36.4142 -112.7479 -4.1714 -7.8989 -1.2678 -90.8033 -21.4921 -7.2235 -47.9551 -3.3832 -20.433 -64.6126 -45.5778 -56.1314 -6.1347 -18.6305 -2.3742 -72.2553 -111.188 -106.765 -23.1321 -19.3763 -54.9815 -34.2944 -64.4748 -20.4109 -6.6886 -4.3781 -59.1414 -34.2461 -58.1506 -33.8664 -10.6902 -53.1394 -13.7482 -20.1987 -55.092 -3.8058 -60.0373 -235.484 -12.6837 -11.7408 -17.3059 -9.7171 -65.8491 -17.1047 -42.8136 -53.1058 -25.0432 -15.3018 -44.0747 -16.9584 -62.9777 -5.2037 -5.2966 -86.1709 -3.7209 -6.3449 -1.1265 -122.5773 -23.904 -355.0149 -31.1009 -32.6198 -4.9668 -84.1037 -134.5943 -72.8374 -23.9003 -35.5893 -11.7117 -22.2889 -1.8598 -59.2178 -8.8997 -150.7441 -1.8536 -1.9713 -9.9677 -0.5208 -26.9227 -30.4291 -5.6286
## Training error : 0.13025
Now let’s see how well the model will perform with the test dataset.
I can use the predict() function to use the model to make predictions on the test dataset.
ocrPredictions <- predict(classifier, ocrTest)
This now returns the predicted character of each row of the test dataset. Let’s see what I have.
c(as.character(head(ocrPredictions)), as.character(tail(ocrPredictions)))
## [1] "N" "V" "X" "N" "H" "E" "T" "D" "C" "T" "S" "A"
Let’s make a table and compare the original letters against the predicted letters.
table(ocrPredictions, ocrTest$letter, dnn = c("Prediction","Original"))[,1:16]
## Original
## Prediction A B C D E F G H I J K L M N O P
## A 144 0 0 0 0 0 0 0 0 1 0 0 1 2 2 0
## B 0 121 0 5 2 0 1 2 0 0 1 0 1 0 0 2
## C 0 0 120 0 4 0 10 2 2 0 1 3 0 0 2 0
## D 2 2 0 156 0 1 3 10 4 3 4 3 0 5 5 3
## E 0 0 5 0 127 3 1 1 0 0 3 4 0 0 0 0
## F 0 0 0 0 0 138 2 2 6 0 0 0 0 0 0 16
## G 1 1 2 1 9 2 123 2 0 0 1 2 1 0 1 2
## H 0 0 0 1 0 1 0 102 0 2 3 2 3 4 20 0
## I 0 1 0 0 0 1 0 0 141 8 0 0 0 0 0 1
## J 0 1 0 0 0 1 0 2 5 128 0 0 0 0 1 1
## K 1 1 9 0 0 0 2 5 0 0 118 0 0 2 0 1
## L 0 0 0 0 2 0 1 1 0 0 0 134 0 0 0 0
## M 0 0 1 1 0 0 1 1 0 0 0 0 135 4 0 0
## N 0 0 0 0 0 1 0 1 0 0 0 0 0 145 0 0
## O 1 0 2 1 0 0 1 2 0 1 0 0 0 1 99 3
## P 0 0 0 1 0 2 1 0 0 0 0 0 0 0 2 130
## Q 0 0 0 0 0 0 8 2 0 0 0 3 0 0 3 1
## R 0 7 0 0 1 0 3 8 0 0 13 0 0 1 1 1
## S 1 1 0 0 1 0 3 0 1 1 0 1 0 0 0 0
## T 0 0 0 0 3 2 0 0 0 0 1 0 0 0 0 0
## U 1 0 3 1 0 0 0 2 0 0 0 0 0 0 1 0
## V 0 0 0 0 0 1 3 4 0 0 0 0 1 2 1 0
## W 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0
## X 0 1 0 0 2 0 0 1 3 0 1 5 0 0 1 0
## Y 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0 7
## Z 2 0 0 0 1 0 0 0 3 4 0 0 0 0 0 0
The diagonal values of the table represent the number of letters successfully predicted by the model, whereas the intersection of different characters in the table represent the number of mistakes. For example, the fourth row first column is an intersection of A and D. It means that it was originally A, but was classified as D.
It reveals a lot of things. For example:
Although it would be really nice if I could go through the analysis of all 26 characters and it would reveal where the model is mostly not working, it would also be very insightful if I calculate the overall accuracy of the model.
matches <- ocrPredictions == ocrTest$letter
table(matches)
## matches
## FALSE TRUE
## 642 3357
#The model accuracy, in percentage, therefore is:
(sum(matches, na.rm = TRUE) / nrow(ocrTest)) * 100
## [1] 83.925
So the model was almost 84% accurate and 16% wrong.
I previously used the simple linear kernel function(vanilladot), but there are several other kernels I could use which can help me map the data to higher dimensional space, and thus giving me a better model fit.
The challenge is I do not already know which kernel to pick from. From several hit-and-trials (or the popular convention of going with Gaussian RBF Kernel first), I tried to use the RBF-based SVM. The kernel for this is rbfdot.
classifierRbf <- ksvm(letter ~ ., data = ocrTrain, kernel = "rbfdot")
ocrPredictionsRbf <- predict(classifierRbf, ocrTest)
ocrPredictionsRbf[4000] = NA
table(ocrPredictionsRbf, ocrTest$letter, dnn=c("Prediction","Original"))[,1:16]
## Original
## Prediction A B C D E F G H I J K L M N O P
## A 151 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## B 0 127 0 3 0 1 0 2 0 0 0 1 2 1 0 2
## C 0 0 132 0 3 0 1 0 2 0 0 1 0 0 0 0
## D 1 1 0 161 0 0 2 9 2 3 1 0 0 1 1 3
## E 0 0 3 0 137 2 0 0 0 1 0 4 0 0 0 1
## F 0 0 0 0 0 148 0 0 3 0 0 0 0 0 0 11
## G 0 0 2 0 8 0 155 2 0 0 0 2 2 0 2 1
## H 0 1 0 1 0 0 1 124 0 1 2 1 1 3 0 1
## I 0 0 0 0 0 0 0 0 151 3 0 0 0 0 0 0
## J 0 0 0 0 0 0 0 0 3 136 0 0 0 0 0 0
## K 0 0 1 0 0 0 0 5 0 0 132 0 0 1 0 0
## L 0 0 0 0 0 0 1 0 0 0 0 141 0 0 0 0
## M 0 0 0 0 0 0 1 1 0 0 0 0 138 1 0 0
## N 0 0 0 0 0 2 0 0 0 0 0 0 0 150 0 0
## O 0 0 2 0 0 0 0 0 0 1 0 0 0 5 129 2
## P 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 140
## Q 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3 3
## R 0 4 1 1 0 0 2 5 0 0 9 1 0 3 2 1
## S 0 2 0 0 0 0 0 0 1 2 0 2 0 0 0 0
## T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## U 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0
## V 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
## W 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0
## X 0 1 0 0 1 0 0 0 0 0 2 4 0 0 0 0
## Y 4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 3
## Z 0 0 0 0 3 0 0 0 2 1 0 0 0 0 0 0
I can see that there are several kinds of new mistakes that are different from the previous model. Now let’s measure the accuracy. It looks better.
matchesNew <- ocrPredictionsRbf == ocrTest$letter
table(matchesNew)
## matchesNew
## FALSE TRUE
## 281 3718
#The model accuracy, in percentage, therefore is:
(sum(matchesNew, na.rm = TRUE) / nrow(ocrTest)) * 100
## [1] 92.95
This is a really big improvement over the previous model, I now have 93% accuracy!
Self-caveat: I did not mean to use the test set twice. I know it is purely overfitting. I shall update the page with results from cross-validation done on my training set with different SVM kernels as well as other classification algorithms.