We will use a dataset donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by W. Frey and D. J. Slate. The dataset contains 20,000 examples of 26 English alphabet capital letters as printed using 20 different randomly reshaped and distorted black and white fonts.
Reading the data into R, we confirm that we have received the data with the 16 features that define each example of the letter class. As expected, letter has 26 levels:
letters <- read.csv("letterdata.csv")
str(letters)
## 'data.frame': 20000 obs. of 17 variables:
## $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
## $ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
## $ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
## $ width : int 3 3 6 6 3 5 5 3 4 13 ...
## $ height: int 5 7 8 6 1 8 4 2 4 9 ...
## $ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
## $ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
## $ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
## $ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
## $ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
## $ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
## $ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
## $ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
## $ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
## $ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
## $ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
## $ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
The data has already been normalized by the authors, thus, we will be using first 16,000 records (80 percent) to build the model and the next 4,000 records (20 percent) to test. Following their advice, we can create training and testing data frames as follows:
letters_train <- letters[1:16000, ]
letters_test <- letters[16001:20000, ]
To provide a baseline measure of SVM performance, let’s begin by training a simple linear SVM classifier by installing the kernlab package. Then, we can call the ksvm() function on the training data and specify the linear (that is, vanilla) kernel using the vanilladot option, as follows:
library(kernlab)
letter_classifier <- ksvm(letter ~ ., data = letters_train, kernel = "vanilladot")
## Setting default kernel parameters
letter_classifier
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 7037
##
## Objective Function Value : -14.1746 -20.0072 -23.5628 -6.2009 -7.5524 -32.7694 -49.9786 -18.1824 -62.1111 -32.7284 -16.2209 -32.2837 -28.9777 -51.2195 -13.276 -35.6217 -30.8612 -16.5256 -14.6811 -32.7475 -30.3219 -7.7956 -11.8138 -32.3463 -13.1262 -9.2692 -153.1654 -52.9678 -76.7744 -119.2067 -165.4437 -54.6237 -41.9809 -67.2688 -25.1959 -27.6371 -26.4102 -35.5583 -41.2597 -122.164 -187.9178 -222.0856 -21.4765 -10.3752 -56.3684 -12.2277 -49.4899 -9.3372 -19.2092 -11.1776 -100.2186 -29.1397 -238.0516 -77.1985 -8.3339 -4.5308 -139.8534 -80.8854 -20.3642 -13.0245 -82.5151 -14.5032 -26.7509 -18.5713 -23.9511 -27.3034 -53.2731 -11.4773 -5.12 -13.9504 -4.4982 -3.5755 -8.4914 -40.9716 -49.8182 -190.0269 -43.8594 -44.8667 -45.2596 -13.5561 -17.7664 -87.4105 -107.1056 -37.0245 -30.7133 -112.3218 -32.9619 -27.2971 -35.5836 -17.8586 -5.1391 -43.4094 -7.7843 -16.6785 -58.5103 -159.9936 -49.0782 -37.8426 -32.8002 -74.5249 -133.3423 -11.1638 -5.3575 -12.438 -30.9907 -141.6924 -54.2953 -179.0114 -99.8896 -10.288 -15.1553 -3.7815 -67.6123 -7.696 -88.9304 -47.6448 -94.3718 -70.2733 -71.5057 -21.7854 -12.7657 -7.4383 -23.502 -13.1055 -239.9708 -30.4193 -25.2113 -136.2795 -140.9565 -9.8122 -34.4584 -6.3039 -60.8421 -66.5793 -27.2816 -214.3225 -34.7796 -16.7631 -135.7821 -160.6279 -45.2949 -25.1023 -144.9059 -82.2352 -327.7154 -142.0613 -158.8821 -32.2181 -32.8887 -52.9641 -25.4937 -47.9936 -6.8991 -9.7293 -36.436 -70.3907 -187.7611 -46.9371 -89.8103 -143.4213 -624.3645 -119.2204 -145.4435 -327.7748 -33.3255 -64.0607 -145.4831 -116.5903 -36.2977 -66.3762 -44.8248 -7.5088 -217.9246 -12.9699 -30.504 -2.0369 -6.126 -14.4448 -21.6337 -57.3084 -20.6915 -184.3625 -20.1052 -4.1484 -4.5344 -0.828 -121.4411 -7.9486 -58.5604 -21.4878 -13.5476 -5.646 -15.629 -28.9576 -20.5959 -76.7111 -27.0119 -94.7101 -15.1713 -10.0222 -7.6394 -1.5784 -87.6952 -6.2239 -99.3711 -101.0906 -45.6639 -24.0725 -61.7702 -24.1583 -52.2368 -234.3264 -39.9749 -48.8556 -34.1464 -20.9664 -11.4525 -123.0277 -6.4903 -5.1865 -8.8016 -9.4618 -21.7742 -24.2361 -123.3984 -31.4404 -88.3901 -30.0924 -13.8198 -9.2701 -3.0823 -87.9624 -6.3845 -13.968 -65.0702 -105.523 -13.7403 -13.7625 -50.4223 -2.933 -8.4289 -80.3381 -36.4147 -112.7485 -4.1711 -7.8989 -1.2676 -90.8037 -21.4919 -7.2235 -47.9557 -3.383 -20.433 -64.6138 -45.5781 -56.1309 -6.1345 -18.6307 -2.374 -72.2553 -111.1885 -106.7664 -23.1323 -19.3765 -54.9819 -34.2953 -64.4756 -20.4115 -6.689 -4.378 -59.141 -34.2468 -58.1509 -33.8665 -10.6902 -53.1387 -13.7478 -20.1987 -55.0923 -3.8058 -60.0382 -235.4841 -12.6837 -11.7407 -17.3058 -9.7167 -65.8498 -17.1051 -42.8131 -53.1054 -25.0437 -15.302 -44.0749 -16.9582 -62.9773 -5.204 -5.2963 -86.1704 -3.7209 -6.3445 -1.1264 -122.5771 -23.9041 -355.0145 -31.1013 -32.619 -4.9664 -84.1048 -134.5957 -72.8371 -23.9002 -35.3077 -11.7119 -22.2889 -1.8598 -59.2174 -8.8994 -150.742 -1.8533 -1.9711 -9.9676 -0.5207 -26.9229 -30.429 -5.6289
## Training error : 0.130062
This above information tells us very little about how well the model will perform in the real world. We’ll need to examine its performance on the testing dataset to know whether it generalizes well to unseen data.
The following predict() function allows us to use the letter classification model to make predictions on the testing dataset:
letter_predictions <- predict(letter_classifier, letters_test)
Since we didn’t specify the type parameter, the type = “response” default was used, this returns a vector containing a predicted letter for each row of values in the test data. Let’s use the head() function and view the first six predicted letters were U, N, V, X, N, and H:
head(letter_predictions)
## [1] U N V X N H
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Next, we examine how well our classifier performed, we need to compare the predicted letter to the true letter in the testing dataset. We will use the table() function.
table(letter_predictions, letters_test$letter)
##
## letter_predictions A B C D E F G H I J K L M N
## A 144 0 0 0 0 0 0 0 0 1 0 0 1 2
## B 0 121 0 5 2 0 1 2 0 0 1 0 1 0
## C 0 0 120 0 4 0 10 2 2 0 1 3 0 0
## D 2 2 0 156 0 1 3 10 4 3 4 3 0 5
## E 0 0 5 0 127 3 1 1 0 0 3 4 0 0
## F 0 0 0 0 0 138 2 2 6 0 0 0 0 0
## G 1 1 2 1 9 2 123 2 0 0 1 2 1 0
## H 0 0 0 1 0 1 0 102 0 2 3 2 3 4
## I 0 1 0 0 0 1 0 0 141 8 0 0 0 0
## J 0 1 0 0 0 1 0 2 5 128 0 0 0 0
## K 1 1 9 0 0 0 2 5 0 0 118 0 0 2
## L 0 0 0 0 2 0 1 1 0 0 0 133 0 0
## M 0 0 1 1 0 0 1 1 0 0 0 0 135 4
## N 0 0 0 0 0 1 0 1 0 0 0 0 0 145
## O 1 0 2 1 0 0 1 2 0 1 0 0 0 1
## P 0 0 0 1 0 2 1 0 0 0 0 0 0 0
## Q 0 0 0 0 0 0 8 2 0 0 0 3 0 0
## R 0 7 0 0 1 0 3 8 0 0 13 0 0 1
## S 1 1 0 0 1 0 3 0 1 1 0 1 0 0
## T 0 0 0 0 3 2 0 0 0 0 1 0 0 0
## U 1 0 3 1 0 0 0 2 0 0 0 0 0 0
## V 0 0 0 0 0 1 3 4 0 0 0 0 1 2
## W 0 0 0 0 0 0 1 0 0 0 0 0 2 0
## X 0 1 0 0 2 0 0 1 3 0 1 6 0 0
## Y 3 0 0 0 0 0 0 1 0 0 0 0 0 0
## Z 2 0 0 0 1 0 0 0 3 4 0 0 0 0
##
## letter_predictions O P Q R S T U V W X Y Z
## A 2 0 5 0 1 1 1 0 1 0 0 1
## B 0 2 2 3 5 0 0 2 0 1 0 0
## C 2 0 0 0 0 0 0 0 0 0 0 0
## D 5 3 1 4 0 0 0 0 0 3 3 1
## E 0 0 2 0 10 0 0 0 0 2 0 3
## F 0 16 0 0 3 0 0 1 0 1 2 0
## G 1 2 8 2 4 3 0 0 0 1 0 0
## H 20 0 2 3 0 3 0 2 0 0 1 0
## I 0 1 0 0 3 0 0 0 0 5 1 1
## J 1 1 3 0 2 0 0 0 0 1 0 6
## K 0 1 0 7 0 1 3 0 0 5 0 0
## L 0 0 1 0 5 0 0 0 0 0 0 1
## M 0 0 0 0 0 0 3 0 8 0 0 0
## N 0 0 0 3 0 0 1 0 2 0 0 0
## O 99 3 3 0 0 0 3 0 0 0 0 0
## P 2 130 0 0 0 0 0 0 0 0 1 0
## Q 3 1 124 0 5 0 0 0 0 0 2 0
## R 1 1 0 138 0 1 0 1 0 0 0 0
## S 0 0 14 0 101 3 0 0 0 2 0 10
## T 0 0 0 0 3 133 1 0 0 0 2 2
## U 1 0 0 0 0 0 152 0 0 1 1 0
## V 1 0 3 1 0 0 0 126 1 0 4 0
## W 0 0 0 0 0 0 4 4 127 0 0 0
## X 1 0 0 0 1 0 0 0 0 137 1 1
## Y 0 7 0 0 0 3 0 0 0 0 127 0
## Z 0 0 0 0 18 3 0 0 0 0 0 132
The diagonal values of 144, 121, 120, 156, and 127 indicate the total number of records where the predicted letter matches the true value. Similarly, the number of mistakes is also listed. For example, the value of 5 in row B and column D indicates that there were five cases where the letter D was misidentified as a B.
Looking at each type of mistake individually may reveal some interesting patterns about the specific types of letters the model has trouble with, but this is time consuming. We can simplify our evaluation instead by calculating the overall accuracy. This considers only whether the prediction was correct or incorrect, and ignores the type of error.
The following command returns a vector of TRUE or FALSE values, indicating whether the model’s predicted letter agrees with (that is, matches) the actual letter in the test dataset:
agreement <- letter_predictions == letters_test$letter
Using the table() function, we see that the classifier correctly identified the letter in 3,357 out of the 4,000 test records:
table(agreement)
## agreement
## FALSE TRUE
## 643 3357
In percentage terms, the accuracy is about 84 percent:
prop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.16075 0.83925
Our previous SVM model used the simple linear kernel function. By using a more complex kernel function, we can map the data into a higher dimensional space, and potentially obtain a better model fit.
It can be challenging, however, to choose from the many different kernel functions. A popular convention is to begin with the Gaussian RBF kernel, which has been shown to perform well for many types of data. We can train an RBF-based SVM, using the ksvm() function as shown here:
letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train,
kernel = "rbfdot")
Next, we make predictions as done earlier:
letter_predictions_rbf <- predict(letter_classifier_rbf,
letters_test)
Finally, we’ll compare the accuracy to our linear SVM:
agreement_rbf <- letter_predictions_rbf == letters_test$letter
table(agreement_rbf)
## agreement_rbf
## FALSE TRUE
## 278 3722
prop.table(table(agreement_rbf))
## agreement_rbf
## FALSE TRUE
## 0.0695 0.9305
By simply changing the kernel function, we were able to increase the accuracy of our character recognition model from 84 percent to 93 percent. If this level of performance is still unsatisfactory for the OCR program, other kernels could be tested, or the cost of constraints parameter C could be varied to modify the width of the decision boundary. As an exercise, you should experiment with these parameters to see how they impact the success of the final model.
Lantz, Brett. Machine Learning with R. 2nd ed. Birmingham: Packt Publishing Ltd, 2015. Print. , 2013. Print.
EOF