Image processing is a difficult task for many types of machine learning algorithms. The relationships linking patterns of pixels to higher concepts are extremely complex and hard to define. For instance, it’s easy for a human being to recognize a face, a cat, or the letter “A”, but defining these patterns in strict rules is difficult. Furthermore, image data is often noisy. There can be many slight variations in how the image was captured depending on the lighting, orientation, and positioning of the subject.
SVMs are well suited to tackle the challenges of image data. Capable of learning complex patterns without being overly sensitive to noise, they are able to recognize visual patterns with a high degree of accuracy.
We will develop a model similar to those used at the core of the optical character recognition (OCR) software often bundled with desktop document scanners or in smartphone applications. The purpose of such software is to process paper-based documents by converting printed or handwritten text into an electronic form to be saved in a database.
When OCR software first processes a document, it divides the paper into a matrix such that each cell in the grid contains a single glyph, which is a term referring to a letter, symbol, or number. Next, for each cell, the software will attempt to match the glyph to a set of all characters it recognizes. Finally, the individual characters can be combined into words, which optionally could be spell-checked against a dictionary in the document’s language.
We’ll assume that we have already developed the algorithm to partition the document into rectangular regions each consisting of a single glyph. We will also assume the document contains only alphabetic characters in English. Therefore, we’ll simulate a process that involves matching glyphs to one of the 26 letters, A to Z.
We’ll use a dataset donated to the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) by W. Frey and D. J. Slate. The dataset contains 20,000 examples of 26 English alphabet capital letters as printed using 20 different randomly reshaped and distorted black-and-white fonts.
The following figure, published by Frey and Slate, provides an example of some of the printed glyphs. Distorted in this way, the letters are challenging for a computer to identify, yet are easily recognized by a human being:
Examples of glyphs the SVM algorithm will attempt to identify
For ease to acces the dataset I have hosted a public repository and included the code that directly downloads the data from the repo and loads the data. The path can be canged to anythig according to your preference of the working directory.
set.seed(12345)
path <- "A:/Project/OCR"
setwd(path)
url <- "https://raw.githubusercontent.com/shreyaskhadse/data_files/master/letterdata.csv"
datafile <- "./letterdata.csv"
if (!file.exists(datafile)) {
download.file(url, datafile ,method="auto") }
letters <- read.csv("letterdata.csv")
When the glyphs are scanned into the computer, they are converted into pixels and 16 statistical attributes are recorded.
The attributes measure such characteristics as the horizontal and vertical dimensions of the glyph; the proportion of black (versus white) pixels; and the average horizontal and vertical position of the pixels. Presumably, differences in the concentration of black pixels across various areas of the box should provide a way to differentiate among the 26 letters of the alphabet.
str(letters)
## 'data.frame': 20000 obs. of 17 variables:
## $ letter: chr "T" "I" "D" "N" ...
## $ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
## $ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
## $ width : int 3 3 6 6 3 5 5 3 4 13 ...
## $ height: int 5 7 8 6 1 8 4 2 4 9 ...
## $ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
## $ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
## $ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
## $ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
## $ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
## $ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
## $ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
## $ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
## $ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
## $ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
## $ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
## $ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
SVM learners require all features to be numeric, and moreover, that each feature is scaled to a fairly small interval. In this case, every feature is an integer, so we do not need to convert any factors into numbers. On the other hand, some of the ranges for these integer variables appear fairly wide. This indicates that we need to normalize or standardize the data. The R package that we will use for fitting the SVM model will perform the rescaling automatically.
We now split the data into a 80:20 train-test dataset.
letters_train <- letters[1:16000, ]
letters_test <- letters[16001:20000, ]
The e1071 package from the Department of Statistics at
the Vienna University of Technology (TU Wien) provides an R interface to
the award-winning LIBSVM library, a widely used open-source
SVM program written in C++.
SVMlight algorithm, the klaR package from
the Department of Statistics at the Dortmund University of Technology
(TU Dortmund) provides functions to work with this SVM implementation
directly from R.
It is perhaps best to begin with the SVM functions in the
kernlab package. An interesting advantage of this package
is that it was developed natively in R rather than C or C++, which
allows it to be easily customized; none of the internals are hidden
behind the scenes. Perhaps even more importantly, unlike the other
options, kernlab can be used with the caret
package, which allows SVM models to be trained and evaluated using a
variety of automated methods.
#install.packages("kernlab")
library(kernlab)
letter_classifier <- ksvm(letter ~ ., data = letters_train, kernel = "vanilladot")
## Setting default kernel parameters
## Warning in .local(x, ...): NAs introduced by coercion
## Error in .local(x, ...): No Support Vectors found. You may want to change your parameters
letter_classifier
## Error in eval(expr, envir, enclos): object 'letter_classifier' not found
letter_predictions <- predict(object = letter_classifier, letters_test, type = "response")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'letter_classifier' not found
head(letter_predictions)
## Error in head(letter_predictions): object 'letter_predictions' not found
To examine how well our classifier performed, we need to compare the
predicted letter to the true letter in the testing dataset. We’ll use
the table() function for this purpose (only a portion of
the full table is shown here):
table(letter_predictions, letters_test$letter)
## Error in table(letter_predictions, letters_test$letter): object 'letter_predictions' not found
The diagonal values of 144, 121, 120, 156, and 127 indicate the total number of records where the predicted letter matches the true value. Similarly, the number of mistakes is also listed.
agreement <- letter_predictions == letters_test$letter
## Error in eval(expr, envir, enclos): object 'letter_predictions' not found
table(agreement)
## Error in table(agreement): object 'agreement' not found
prop.table(table(agreement))
## Error in table(agreement): object 'agreement' not found
We see that the classifier correctly identified the letter in 3,357 out of the 4,000 test records: In percentage terms, the accuracy is about 84 percent.
Our previous SVM model used the simple linear kernel function. By using a more complex kernel function, we can map the data into a higher dimensional space, and potentially obtain a better model fit.
We begin with the Gaussian RBF (radial basis function) kernel, which
has been shown to perform well for many types of data. We can train an
RBF-based SVM using the ksvm() function:
letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train, kernel = "rbfdot")
## Warning in .local(x, ...): NAs introduced by coercion
## Error in .local(x, ...): No Support Vectors found. You may want to change your parameters
letter_classifier_rbf
## Error in eval(expr, envir, enclos): object 'letter_classifier_rbf' not found
letter_predictions_rbf <- predict(letter_classifier_rbf, letters_test)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'predict': object 'letter_classifier_rbf' not found
agreement_rbf <- letter_predictions_rbf == letters_test$letter
## Error in eval(expr, envir, enclos): object 'letter_predictions_rbf' not found
table(agreement_rbf)
## Error in table(agreement_rbf): object 'agreement_rbf' not found
prop.table(table(agreement_rbf))
## Error in table(agreement_rbf): object 'agreement_rbf' not found
Simply by changing the kernel function, we were able to increase the accuracy of our character recognition model from 84 percent to 93 percent.
Another fruitful approach is to vary the cost parameter, which modifies the width of the SVM decision boundary. This governs the model’s balance between overfitting and underfitting the training data—the larger the cost value, the harder the learner will try to perfectly classify every training instance, as there is a higher penalty for each mistake.
We use the sapply() function to apply a custom function
to a vector of potential cost values. We begin by using the
seq() function to generate this vector as a sequence
counting from five to 40 by five. Then, as shown in the following code,
the custom function trains the model as before, each time using the cost
value and making predictions on the test dataset. Each model’s accuracy
is computed as the number of predictions that match the actual values
divided by the total number of predictions. The result is visualized
using the plot() function:
start.time <- Sys.time()
cost_values <- c(1, seq(from = 5, to = 40, by = 5))
accuracy_values <- sapply(cost_values, function(x) {
set.seed(12345)
m <- ksvm(letter ~ ., data = letters_train,
kernel = "rbfdot", C = x)
pred <- predict(m, letters_test)
agree <- ifelse(pred == letters_test$letter, 1, 0)
accuracy <- sum(agree) / nrow(letters_test)
return (accuracy)
})
## Warning in .local(x, ...): NAs introduced by coercion
## Error in .local(x, ...): No Support Vectors found. You may want to change your parameters
end.time <- Sys.time()
print(round(end.time - start.time))
## Time difference of 0 secs
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
df <- data.frame(cost_values, accuracy_values)
## Error in data.frame(cost_values, accuracy_values): object 'accuracy_values' not found
ggplot(df, aes(x = cost_values, y = accuracy_values))+
geom_line(color = "steelblue")+
geom_point(color = "red")+
labs(title = "Accuracy vs Cost Values", x = "Cost Values", y = "Accuracy")
## Error in `ggplot()`:
## ! `data` cannot be a function.
## ℹ Have you misspelled the `data` argument in `ggplot()`
As depicted in the visualization, with an accuracy of 93 percent, the default SVM cost parameter of C = 1 resulted in by far the least accurate model among the nine models evaluated. Instead, setting C to a value of 10 or higher results in an accuracy of around 97 percent.