SVMs are well suited to tackle the challenges of image data. Capable of learning complex patterns without being overly sensitive to noise, they are able to recognize visual patterns with a high degree of accuracy.

In this notebook we will develop a model similar to those used at the core of the optical character recognition (OCR) software often bundled with desktop document scanners or in smartphone applications. The purpose of such software is to process paper-based documents by converting printed or handwritten text into an electronic form to be saved in a database.

data set : https://s3.us-east-2.amazonaws.com/artificium.us/datasets/letterdata.csv

library('kernlab')
letterdata <- read.csv("https://s3.us-east-2.amazonaws.com/artificium.us/datasets/letterdata.csv")
head(letterdata)
summary(letterdata)
##     letter               xbox             ybox            width       
##  Length:20000       Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  Class :character   1st Qu.: 3.000   1st Qu.: 5.000   1st Qu.: 4.000  
##  Mode  :character   Median : 4.000   Median : 7.000   Median : 5.000  
##                     Mean   : 4.024   Mean   : 7.035   Mean   : 5.122  
##                     3rd Qu.: 5.000   3rd Qu.: 9.000   3rd Qu.: 6.000  
##                     Max.   :15.000   Max.   :15.000   Max.   :15.000  
##      height           onpix             xbar             ybar     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 4.000   1st Qu.: 2.000   1st Qu.: 6.000   1st Qu.: 6.0  
##  Median : 6.000   Median : 3.000   Median : 7.000   Median : 7.0  
##  Mean   : 5.372   Mean   : 3.506   Mean   : 6.898   Mean   : 7.5  
##  3rd Qu.: 7.000   3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 9.0  
##  Max.   :15.000   Max.   :15.000   Max.   :15.000   Max.   :15.0  
##      x2bar            y2bar            xybar            x2ybar      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.: 7.000   1st Qu.: 5.000  
##  Median : 4.000   Median : 5.000   Median : 8.000   Median : 6.000  
##  Mean   : 4.629   Mean   : 5.179   Mean   : 8.282   Mean   : 6.454  
##  3rd Qu.: 6.000   3rd Qu.: 7.000   3rd Qu.:10.000   3rd Qu.: 8.000  
##  Max.   :15.000   Max.   :15.000   Max.   :15.000   Max.   :15.000  
##      xy2bar           xedge            xedgey           yedge       
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 7.000   1st Qu.: 1.000   1st Qu.: 8.000   1st Qu.: 2.000  
##  Median : 8.000   Median : 3.000   Median : 8.000   Median : 3.000  
##  Mean   : 7.929   Mean   : 3.046   Mean   : 8.339   Mean   : 3.692  
##  3rd Qu.: 9.000   3rd Qu.: 4.000   3rd Qu.: 9.000   3rd Qu.: 5.000  
##  Max.   :15.000   Max.   :15.000   Max.   :15.000   Max.   :15.000  
##      yedgex      
##  Min.   : 0.000  
##  1st Qu.: 7.000  
##  Median : 8.000  
##  Mean   : 7.801  
##  3rd Qu.: 9.000  
##  Max.   :15.000
str(letterdata)
## 'data.frame':    20000 obs. of  17 variables:
##  $ letter: chr  "T" "I" "D" "N" ...
##  $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
##  $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
##  $ width : int  3 3 6 6 3 5 5 3 4 13 ...
##  $ height: int  5 7 8 6 1 8 4 2 4 9 ...
##  $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
##  $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
##  $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
##  $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
##  $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
##  $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
##  $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
##  $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
##  $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
##  $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
##  $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
##  $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...

Key finding

The data is all numerical with 20,000 observatiobs and 17 variables

No data preparaton needed, lets go strainght into splitting the data

letterhead_train <- letterdata[1:16000, ]
letterhead_test <- letterdata[16001:20000, ]
dim(letterhead_train)
## [1] 16000    17
dim(letterhead_test)
## [1] 4000   17

train a model on the data

SVM expects all variables to be numerical so for the variable ‘letter’ which is the target variable, we cannot change it into numerical but we can change it to be a factor for classification.

letterhead_train$letter <- as.factor(letterhead_train$letter)
letterhead_test$letter <- as.factor(letterhead_test$letter)

# build a model 
letter_classifier <- ksvm(letter ~.,data = letterhead_train, kernel = 'vanilladot')
##  Setting default kernel parameters
str(letterhead_train)
## 'data.frame':    16000 obs. of  17 variables:
##  $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
##  $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
##  $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
##  $ width : int  3 3 6 6 3 5 5 3 4 13 ...
##  $ height: int  5 7 8 6 1 8 4 2 4 9 ...
##  $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
##  $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
##  $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
##  $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
##  $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
##  $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
##  $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
##  $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
##  $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
##  $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
##  $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
##  $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...

Evaluate the performance of the model

letter_predictions <- predict(letter_classifier, letterhead_test)
head(letter_predictions)
## [1] U N V X N H
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
table(letter_predictions, letterhead_test$letter)
##                   
## letter_predictions   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
##                  A 144   0   0   0   0   0   0   0   0   1   0   0   1   2   2
##                  B   0 121   0   5   2   0   1   2   0   0   1   0   1   0   0
##                  C   0   0 120   0   4   0  10   2   2   0   1   3   0   0   2
##                  D   2   2   0 156   0   1   3  10   4   3   4   3   0   5   5
##                  E   0   0   5   0 127   3   1   1   0   0   3   4   0   0   0
##                  F   0   0   0   0   0 138   2   2   6   0   0   0   0   0   0
##                  G   1   1   2   1   9   2 123   2   0   0   1   2   1   0   1
##                  H   0   0   0   1   0   1   0 102   0   2   3   2   3   4  20
##                  I   0   1   0   0   0   1   0   0 141   8   0   0   0   0   0
##                  J   0   1   0   0   0   1   0   2   5 128   0   0   0   0   1
##                  K   1   1   9   0   0   0   2   5   0   0 118   0   0   2   0
##                  L   0   0   0   0   2   0   1   1   0   0   0 133   0   0   0
##                  M   0   0   1   1   0   0   1   1   0   0   0   0 135   4   0
##                  N   0   0   0   0   0   1   0   1   0   0   0   0   0 145   0
##                  O   1   0   2   1   0   0   1   2   0   1   0   0   0   1  99
##                  P   0   0   0   1   0   2   1   0   0   0   0   0   0   0   2
##                  Q   0   0   0   0   0   0   8   2   0   0   0   3   0   0   3
##                  R   0   7   0   0   1   0   3   8   0   0  13   0   0   1   1
##                  S   1   1   0   0   1   0   3   0   1   1   0   1   0   0   0
##                  T   0   0   0   0   3   2   0   0   0   0   1   0   0   0   0
##                  U   1   0   3   1   0   0   0   2   0   0   0   0   0   0   1
##                  V   0   0   0   0   0   1   3   4   0   0   0   0   1   2   1
##                  W   0   0   0   0   0   0   1   0   0   0   0   0   2   0   0
##                  X   0   1   0   0   2   0   0   1   3   0   1   6   0   0   1
##                  Y   3   0   0   0   0   0   0   1   0   0   0   0   0   0   0
##                  Z   2   0   0   0   1   0   0   0   3   4   0   0   0   0   0
##                   
## letter_predictions   P   Q   R   S   T   U   V   W   X   Y   Z
##                  A   0   5   0   1   1   1   0   1   0   0   1
##                  B   2   2   3   5   0   0   2   0   1   0   0
##                  C   0   0   0   0   0   0   0   0   0   0   0
##                  D   3   1   4   0   0   0   0   0   3   3   1
##                  E   0   2   0  10   0   0   0   0   2   0   3
##                  F  16   0   0   3   0   0   1   0   1   2   0
##                  G   2   8   2   4   3   0   0   0   1   0   0
##                  H   0   2   3   0   3   0   2   0   0   1   0
##                  I   1   0   0   3   0   0   0   0   5   1   1
##                  J   1   3   0   2   0   0   0   0   1   0   6
##                  K   1   0   7   0   1   3   0   0   5   0   0
##                  L   0   1   0   5   0   0   0   0   0   0   1
##                  M   0   0   0   0   0   3   0   8   0   0   0
##                  N   0   0   3   0   0   1   0   2   0   0   0
##                  O   3   3   0   0   0   3   0   0   0   0   0
##                  P 130   0   0   0   0   0   0   0   0   1   0
##                  Q   1 124   0   5   0   0   0   0   0   2   0
##                  R   1   0 138   0   1   0   1   0   0   0   0
##                  S   0  14   0 101   3   0   0   0   2   0  10
##                  T   0   0   0   3 133   1   0   0   0   2   2
##                  U   0   0   0   0   0 152   0   0   1   1   0
##                  V   0   3   1   0   0   0 126   1   0   4   0
##                  W   0   0   0   0   0   4   4 127   0   0   0
##                  X   0   0   0   1   0   0   0   0 137   1   1
##                  Y   7   0   0   0   3   0   0   0   0 127   0
##                  Z   0   0   0  18   3   0   0   0   0   0 132

The diagonal row with 144, 121, 120 etc indicates the total number of records where the predicted letter matches the actual letter. The mistakes such as 4 indicates the number of records where E was classified as C, 20 indicates the number of times the O was predicted as H, 16 indicates the number of records that P was predicted as F.

Looking at each type of mistake individually may reveal some interesting patterns about the specific types of letters the model has trouble with, but this is time consuming. We can simplify our evaluation by instead calculating the overall accuracy. This considers only whether the prediction was correct or incorrect, and ignores the type of error.

# use agreement to chack for accuracy
agreement <- letter_predictions == letterhead_test$letter
table(agreement) 
## agreement
## FALSE  TRUE 
##   643  3357
prop.table(table(agreement))
## agreement
##   FALSE    TRUE 
## 0.16075 0.83925

The model correctly predicted 3,357 out of 4000 test records

# lets use a more complex kernel function
letter_classifier_rbf <- ksvm(letter ~., data = letterhead_train, kernel = 'rbfdot')
# make predictions
letter_predictions_rbf <- predict(letter_classifier_rbf,letterhead_test)
# use agreement to check for accuracy
agreement_rbf <- letter_predictions_rbf == letterhead_test$letter
table(agreement_rbf)
## agreement_rbf
## FALSE  TRUE 
##   278  3722
prop.table(table(agreement_rbf))
## agreement_rbf
##  FALSE   TRUE 
## 0.0695 0.9305

The more complex kernel ‘rbfdot’ predicted 3722 records(93%) true out of 4000 records.

Identifying the best SVM cost parameter It is possible to test additional kernels. One way is to vary the cost parameter, which modifies the width of the SVM decision boundary.

This governs the model’s balance between overfitting and underfitting the training data—the larger the cost value, the harder the learner will try to perfectly classify every training instance, as there is a higher penalty for each mistake. On the one hand, a high cost can lead the learner to overfit the training data. On the other hand, a cost parameter set too small can cause the learner to miss important, subtle patterns in the training data and underfit the true pattern.

There is no rule of thumb to know the ideal value beforehand, so instead we will examine how the model performs for various values of C, the cost parameter. Rather than repeating the training and evaluation process repeatedly, we can use the sapply() function to apply a custom function to a vector of potential cost values. We begin by using the seq() function to generate this vector as a sequence counting from five to 40 by five.

cost_values <- c(1, seq(from = 5, to = 40, by = 5))

accuracy_values <- sapply(cost_values, function(x) {
    set.seed(12345)
    m <- ksvm(letter ~ ., data = letterhead_train, kernel = "rbfdot", C=x )

    pred <- predict(m, letterhead_test)

    agree <- ifelse(pred == letterhead_test$letter, 1, 0)

    accuracy <- sum(agree) / nrow(letterhead_test)
    return (accuracy)})
plot(cost_values, accuracy_values, type="b")

As depicted in the visualization, with an accuracy of 93 percent, the default SVM cost parameter of C = 1 resulted in by far the least accurate model among the nine models evaluated. Instead, setting C to a value of 10 or higher results in an accuracy of around 97 percent, which is quite an improvement in performance! Perhaps this is close enough to perfect for the model to be deployed in a real-world environment, though it may still be worth experimenting further with various kernels to see if it is possible to get even closer to 100 percent accuracy. Each additional improvement in accuracy will result in fewer mistakes for the OCR software and a better overall experience for the end user.