SVMs are well suited to tackle the challenges of image data. Capable of learning complex patterns without being overly sensitive to noise, they are able to recognize visual patterns with a high degree of accuracy.
In this notebook we will develop a model similar to those used at the core of the optical character recognition (OCR) software often bundled with desktop document scanners or in smartphone applications. The purpose of such software is to process paper-based documents by converting printed or handwritten text into an electronic form to be saved in a database.
data set : https://s3.us-east-2.amazonaws.com/artificium.us/datasets/letterdata.csv
library('kernlab')
letterdata <- read.csv("https://s3.us-east-2.amazonaws.com/artificium.us/datasets/letterdata.csv")
head(letterdata)
summary(letterdata)
## letter xbox ybox width
## Length:20000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 3.000 1st Qu.: 5.000 1st Qu.: 4.000
## Mode :character Median : 4.000 Median : 7.000 Median : 5.000
## Mean : 4.024 Mean : 7.035 Mean : 5.122
## 3rd Qu.: 5.000 3rd Qu.: 9.000 3rd Qu.: 6.000
## Max. :15.000 Max. :15.000 Max. :15.000
## height onpix xbar ybar
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.: 4.000 1st Qu.: 2.000 1st Qu.: 6.000 1st Qu.: 6.0
## Median : 6.000 Median : 3.000 Median : 7.000 Median : 7.0
## Mean : 5.372 Mean : 3.506 Mean : 6.898 Mean : 7.5
## 3rd Qu.: 7.000 3rd Qu.: 5.000 3rd Qu.: 8.000 3rd Qu.: 9.0
## Max. :15.000 Max. :15.000 Max. :15.000 Max. :15.0
## x2bar y2bar xybar x2ybar
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 7.000 1st Qu.: 5.000
## Median : 4.000 Median : 5.000 Median : 8.000 Median : 6.000
## Mean : 4.629 Mean : 5.179 Mean : 8.282 Mean : 6.454
## 3rd Qu.: 6.000 3rd Qu.: 7.000 3rd Qu.:10.000 3rd Qu.: 8.000
## Max. :15.000 Max. :15.000 Max. :15.000 Max. :15.000
## xy2bar xedge xedgey yedge
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.000 1st Qu.: 1.000 1st Qu.: 8.000 1st Qu.: 2.000
## Median : 8.000 Median : 3.000 Median : 8.000 Median : 3.000
## Mean : 7.929 Mean : 3.046 Mean : 8.339 Mean : 3.692
## 3rd Qu.: 9.000 3rd Qu.: 4.000 3rd Qu.: 9.000 3rd Qu.: 5.000
## Max. :15.000 Max. :15.000 Max. :15.000 Max. :15.000
## yedgex
## Min. : 0.000
## 1st Qu.: 7.000
## Median : 8.000
## Mean : 7.801
## 3rd Qu.: 9.000
## Max. :15.000
str(letterdata)
## 'data.frame': 20000 obs. of 17 variables:
## $ letter: chr "T" "I" "D" "N" ...
## $ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
## $ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
## $ width : int 3 3 6 6 3 5 5 3 4 13 ...
## $ height: int 5 7 8 6 1 8 4 2 4 9 ...
## $ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
## $ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
## $ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
## $ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
## $ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
## $ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
## $ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
## $ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
## $ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
## $ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
## $ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
## $ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
The data is all numerical with 20,000 observatiobs and 17 variables
No data preparaton needed, lets go strainght into splitting the data
letterhead_train <- letterdata[1:16000, ]
letterhead_test <- letterdata[16001:20000, ]
dim(letterhead_train)
## [1] 16000 17
dim(letterhead_test)
## [1] 4000 17
SVM expects all variables to be numerical so for the variable ‘letter’ which is the target variable, we cannot change it into numerical but we can change it to be a factor for classification.
letterhead_train$letter <- as.factor(letterhead_train$letter)
letterhead_test$letter <- as.factor(letterhead_test$letter)
# build a model
letter_classifier <- ksvm(letter ~.,data = letterhead_train, kernel = 'vanilladot')
## Setting default kernel parameters
str(letterhead_train)
## 'data.frame': 16000 obs. of 17 variables:
## $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
## $ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
## $ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
## $ width : int 3 3 6 6 3 5 5 3 4 13 ...
## $ height: int 5 7 8 6 1 8 4 2 4 9 ...
## $ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
## $ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
## $ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
## $ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
## $ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
## $ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
## $ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
## $ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
## $ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
## $ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
## $ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
## $ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
Evaluate the performance of the model
letter_predictions <- predict(letter_classifier, letterhead_test)
head(letter_predictions)
## [1] U N V X N H
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
table(letter_predictions, letterhead_test$letter)
##
## letter_predictions A B C D E F G H I J K L M N O
## A 144 0 0 0 0 0 0 0 0 1 0 0 1 2 2
## B 0 121 0 5 2 0 1 2 0 0 1 0 1 0 0
## C 0 0 120 0 4 0 10 2 2 0 1 3 0 0 2
## D 2 2 0 156 0 1 3 10 4 3 4 3 0 5 5
## E 0 0 5 0 127 3 1 1 0 0 3 4 0 0 0
## F 0 0 0 0 0 138 2 2 6 0 0 0 0 0 0
## G 1 1 2 1 9 2 123 2 0 0 1 2 1 0 1
## H 0 0 0 1 0 1 0 102 0 2 3 2 3 4 20
## I 0 1 0 0 0 1 0 0 141 8 0 0 0 0 0
## J 0 1 0 0 0 1 0 2 5 128 0 0 0 0 1
## K 1 1 9 0 0 0 2 5 0 0 118 0 0 2 0
## L 0 0 0 0 2 0 1 1 0 0 0 133 0 0 0
## M 0 0 1 1 0 0 1 1 0 0 0 0 135 4 0
## N 0 0 0 0 0 1 0 1 0 0 0 0 0 145 0
## O 1 0 2 1 0 0 1 2 0 1 0 0 0 1 99
## P 0 0 0 1 0 2 1 0 0 0 0 0 0 0 2
## Q 0 0 0 0 0 0 8 2 0 0 0 3 0 0 3
## R 0 7 0 0 1 0 3 8 0 0 13 0 0 1 1
## S 1 1 0 0 1 0 3 0 1 1 0 1 0 0 0
## T 0 0 0 0 3 2 0 0 0 0 1 0 0 0 0
## U 1 0 3 1 0 0 0 2 0 0 0 0 0 0 1
## V 0 0 0 0 0 1 3 4 0 0 0 0 1 2 1
## W 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0
## X 0 1 0 0 2 0 0 1 3 0 1 6 0 0 1
## Y 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## Z 2 0 0 0 1 0 0 0 3 4 0 0 0 0 0
##
## letter_predictions P Q R S T U V W X Y Z
## A 0 5 0 1 1 1 0 1 0 0 1
## B 2 2 3 5 0 0 2 0 1 0 0
## C 0 0 0 0 0 0 0 0 0 0 0
## D 3 1 4 0 0 0 0 0 3 3 1
## E 0 2 0 10 0 0 0 0 2 0 3
## F 16 0 0 3 0 0 1 0 1 2 0
## G 2 8 2 4 3 0 0 0 1 0 0
## H 0 2 3 0 3 0 2 0 0 1 0
## I 1 0 0 3 0 0 0 0 5 1 1
## J 1 3 0 2 0 0 0 0 1 0 6
## K 1 0 7 0 1 3 0 0 5 0 0
## L 0 1 0 5 0 0 0 0 0 0 1
## M 0 0 0 0 0 3 0 8 0 0 0
## N 0 0 3 0 0 1 0 2 0 0 0
## O 3 3 0 0 0 3 0 0 0 0 0
## P 130 0 0 0 0 0 0 0 0 1 0
## Q 1 124 0 5 0 0 0 0 0 2 0
## R 1 0 138 0 1 0 1 0 0 0 0
## S 0 14 0 101 3 0 0 0 2 0 10
## T 0 0 0 3 133 1 0 0 0 2 2
## U 0 0 0 0 0 152 0 0 1 1 0
## V 0 3 1 0 0 0 126 1 0 4 0
## W 0 0 0 0 0 4 4 127 0 0 0
## X 0 0 0 1 0 0 0 0 137 1 1
## Y 7 0 0 0 3 0 0 0 0 127 0
## Z 0 0 0 18 3 0 0 0 0 0 132
The diagonal row with 144, 121, 120 etc indicates the total number of records where the predicted letter matches the actual letter. The mistakes such as 4 indicates the number of records where E was classified as C, 20 indicates the number of times the O was predicted as H, 16 indicates the number of records that P was predicted as F.
Looking at each type of mistake individually may reveal some interesting patterns about the specific types of letters the model has trouble with, but this is time consuming. We can simplify our evaluation by instead calculating the overall accuracy. This considers only whether the prediction was correct or incorrect, and ignores the type of error.
# use agreement to chack for accuracy
agreement <- letter_predictions == letterhead_test$letter
table(agreement)
## agreement
## FALSE TRUE
## 643 3357
prop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.16075 0.83925
The model correctly predicted 3,357 out of 4000 test records
# lets use a more complex kernel function
letter_classifier_rbf <- ksvm(letter ~., data = letterhead_train, kernel = 'rbfdot')
# make predictions
letter_predictions_rbf <- predict(letter_classifier_rbf,letterhead_test)
# use agreement to check for accuracy
agreement_rbf <- letter_predictions_rbf == letterhead_test$letter
table(agreement_rbf)
## agreement_rbf
## FALSE TRUE
## 278 3722
prop.table(table(agreement_rbf))
## agreement_rbf
## FALSE TRUE
## 0.0695 0.9305
The more complex kernel ‘rbfdot’ predicted 3722 records(93%) true out of 4000 records.
Identifying the best SVM cost parameter It is possible to test additional kernels. One way is to vary the cost parameter, which modifies the width of the SVM decision boundary.
This governs the model’s balance between overfitting and underfitting the training data—the larger the cost value, the harder the learner will try to perfectly classify every training instance, as there is a higher penalty for each mistake. On the one hand, a high cost can lead the learner to overfit the training data. On the other hand, a cost parameter set too small can cause the learner to miss important, subtle patterns in the training data and underfit the true pattern.
There is no rule of thumb to know the ideal value beforehand, so instead we will examine how the model performs for various values of C, the cost parameter. Rather than repeating the training and evaluation process repeatedly, we can use the sapply() function to apply a custom function to a vector of potential cost values. We begin by using the seq() function to generate this vector as a sequence counting from five to 40 by five.
cost_values <- c(1, seq(from = 5, to = 40, by = 5))
accuracy_values <- sapply(cost_values, function(x) {
set.seed(12345)
m <- ksvm(letter ~ ., data = letterhead_train, kernel = "rbfdot", C=x )
pred <- predict(m, letterhead_test)
agree <- ifelse(pred == letterhead_test$letter, 1, 0)
accuracy <- sum(agree) / nrow(letterhead_test)
return (accuracy)})
plot(cost_values, accuracy_values, type="b")
As depicted in the visualization, with an accuracy of 93 percent, the
default SVM cost parameter of C = 1 resulted in by far the least
accurate model among the nine models evaluated. Instead, setting C to a
value of 10 or higher results in an accuracy of around 97 percent, which
is quite an improvement in performance! Perhaps this is close enough to
perfect for the model to be deployed in a real-world environment, though
it may still be worth experimenting further with various kernels to see
if it is possible to get even closer to 100 percent accuracy. Each
additional improvement in accuracy will result in fewer mistakes for the
OCR software and a better overall experience for the end user.