Letters Recognition with Support Vector Machine (SVM)

Step 1: Collecting Data

In the exercise, various hand-written alphabetical letter images is being recoreded in terms of pixels, lighting, shading, horizontal and vertical dimensions of the letter in numerical features. With the outcome values classified into any possible 26 alphabetical letters, a Support Vector Machine (SVM) analysis will be performed to allow a learner to learn various characteristics of 26 alphabets and make prediction on the letter classfication on new data. The data is collected from the UCI Machine Learning Data Repository ( http://archive.ics.uci.edu/ml), or will be also found from the Statistical Learning with R by Prof. Suess in his website: (“http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml11/letterdata.csv”).

Step 2: Exploring & Preparing the Data

The dataset contains 20000 observations and 17 featues, including 16 features denoted to various characteristics of the alphabetical letters, and each respective letter classfication outcome. All of the values in the dataset are numerical, which is required for the SVM analysis. Similar ranges of all numerical features are required for performing SVM and a normalization or standardization for all features are stongly recommended so that each features difference will take in similar weight and minimize any bias from different ranges of features. However, the package and function used later will automatically done the normalization so this step can be skipped.

letters <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml11/letterdata.csv")
str(letters)

## 'data.frame':    20000 obs. of  17 variables:
##  $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
##  $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
##  $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
##  $ width : int  3 3 6 6 3 5 5 3 4 13 ...
##  $ height: int  5 7 8 6 1 8 4 2 4 9 ...
##  $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
##  $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
##  $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
##  $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
##  $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
##  $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
##  $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
##  $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
##  $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
##  $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
##  $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
##  $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...

The dataset is pre-randomized, and 80% of the data will be splitted as trained dataset, the rest of 20% data is left as tested dataset.

letters_train <- letters[1:16000, ]
letters_test  <- letters[16001:20000, ]

Step 3 Model Training for Data

The kernalab package is installed and loaded for the kernal based support vector machine (ksvm) function. Using the letter variable as outcome and take the rest of features as numerical predictors for the model in the trained dataset, set the kernel to the ‘vanilladot,’ one of the kernal types that will be used for creating a nonlinear kernel, a svm model is then built as ltter_classifier. The kernel is an additional dimension/feature created based on the relationship of any other existing predictors for seperating the data linearly using a maximum margin hyperplane(MMH).

library(kernlab)

letter_classifier <- ksvm(letter ~ ., data = letters_train,
                          kernel = "vanilladot")

##  Setting default kernel parameters

A glance of the svm object is shown below, with cost value, kernal type/function, and number of support vectors used for defining the maximum marginal hyperplane.

letter_classifier

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 7037 
## 
## Objective Function Value : -14.1746 -20.0072 -23.5628 -6.2009 -7.5524 -32.7694 -49.9786 -18.1824 -62.1111 -32.7284 -16.2209 -32.2837 -28.9777 -51.2195 -13.276 -35.6217 -30.8612 -16.5256 -14.6811 -32.7475 -30.3219 -7.7956 -11.8138 -32.3463 -13.1262 -9.2692 -153.1654 -52.9678 -76.7744 -119.2067 -165.4437 -54.6237 -41.9809 -67.2688 -25.1959 -27.6371 -26.4102 -35.5583 -41.2597 -122.164 -187.9178 -222.0856 -21.4765 -10.3752 -56.3684 -12.2277 -49.4899 -9.3372 -19.2092 -11.1776 -100.2186 -29.1397 -238.0516 -77.1985 -8.3339 -4.5308 -139.8534 -80.8854 -20.3642 -13.0245 -82.5151 -14.5032 -26.7509 -18.5713 -23.9511 -27.3034 -53.2731 -11.4773 -5.12 -13.9504 -4.4982 -3.5755 -8.4914 -40.9716 -49.8182 -190.0269 -43.8594 -44.8667 -45.2596 -13.5561 -17.7664 -87.4105 -107.1056 -37.0245 -30.7133 -112.3218 -32.9619 -27.2971 -35.5836 -17.8586 -5.1391 -43.4094 -7.7843 -16.6785 -58.5103 -159.9936 -49.0782 -37.8426 -32.8002 -74.5249 -133.3423 -11.1638 -5.3575 -12.438 -30.9907 -141.6924 -54.2953 -179.0114 -99.8896 -10.288 -15.1553 -3.7815 -67.6123 -7.696 -88.9304 -47.6448 -94.3718 -70.2733 -71.5057 -21.7854 -12.7657 -7.4383 -23.502 -13.1055 -239.9708 -30.4193 -25.2113 -136.2795 -140.9565 -9.8122 -34.4584 -6.3039 -60.8421 -66.5793 -27.2816 -214.3225 -34.7796 -16.7631 -135.7821 -160.6279 -45.2949 -25.1023 -144.9059 -82.2352 -327.7154 -142.0613 -158.8821 -32.2181 -32.8887 -52.9641 -25.4937 -47.9936 -6.8991 -9.7293 -36.436 -70.3907 -187.7611 -46.9371 -89.8103 -143.4213 -624.3645 -119.2204 -145.4435 -327.7748 -33.3255 -64.0607 -145.4831 -116.5903 -36.2977 -66.3762 -44.8248 -7.5088 -217.9246 -12.9699 -30.504 -2.0369 -6.126 -14.4448 -21.6337 -57.3084 -20.6915 -184.3625 -20.1052 -4.1484 -4.5344 -0.828 -121.4411 -7.9486 -58.5604 -21.4878 -13.5476 -5.646 -15.629 -28.9576 -20.5959 -76.7111 -27.0119 -94.7101 -15.1713 -10.0222 -7.6394 -1.5784 -87.6952 -6.2239 -99.3711 -101.0906 -45.6639 -24.0725 -61.7702 -24.1583 -52.2368 -234.3264 -39.9749 -48.8556 -34.1464 -20.9664 -11.4525 -123.0277 -6.4903 -5.1865 -8.8016 -9.4618 -21.7742 -24.2361 -123.3984 -31.4404 -88.3901 -30.0924 -13.8198 -9.2701 -3.0823 -87.9624 -6.3845 -13.968 -65.0702 -105.523 -13.7403 -13.7625 -50.4223 -2.933 -8.4289 -80.3381 -36.4147 -112.7485 -4.1711 -7.8989 -1.2676 -90.8037 -21.4919 -7.2235 -47.9557 -3.383 -20.433 -64.6138 -45.5781 -56.1309 -6.1345 -18.6307 -2.374 -72.2553 -111.1885 -106.7664 -23.1323 -19.3765 -54.9819 -34.2953 -64.4756 -20.4115 -6.689 -4.378 -59.141 -34.2468 -58.1509 -33.8665 -10.6902 -53.1387 -13.7478 -20.1987 -55.0923 -3.8058 -60.0382 -235.4841 -12.6837 -11.7407 -17.3058 -9.7167 -65.8498 -17.1051 -42.8131 -53.1054 -25.0437 -15.302 -44.0749 -16.9582 -62.9773 -5.204 -5.2963 -86.1704 -3.7209 -6.3445 -1.1264 -122.5771 -23.9041 -355.0145 -31.1013 -32.619 -4.9664 -84.1048 -134.5957 -72.8371 -23.9002 -35.3077 -11.7119 -22.2889 -1.8598 -59.2174 -8.8994 -150.742 -1.8533 -1.9711 -9.9676 -0.5207 -26.9229 -30.429 -5.6289 
## Training error : 0.130062

Step 4: Model Performance Evaluation

A vector of letter prediction from the tested dataset is built by applying the svm model on the tested dataset to work through all the predictors/features using a predict() function. By using the head() for the predicted vector, the first few predicted letter classification is shown.

letter_predictions <- predict(letter_classifier, letters_test)

head(letter_predictions)

## [1] U N V X N H
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Using a table() function to compare the predicted letter vector and the actual letter value from the tested dataset. Values in the diagnoal are those correctly predicted letter from the tested dataset; while other off-diagnoal values are incorrectly classified. For example, in row B column D contains 5 observations that a letter B is incorrectly classified as letter D; and in row G column C contains 2 observations that a letter G is misclassified as a letter C. The rows in the table is a list of predicted letter position, while columns in the table is a list of actual letter position.

table(letter_predictions, letters_test$letter)

##                   
## letter_predictions   A   B   C   D   E   F   G   H   I   J   K   L   M   N
##                  A 144   0   0   0   0   0   0   0   0   1   0   0   1   2
##                  B   0 121   0   5   2   0   1   2   0   0   1   0   1   0
##                  C   0   0 120   0   4   0  10   2   2   0   1   3   0   0
##                  D   2   2   0 156   0   1   3  10   4   3   4   3   0   5
##                  E   0   0   5   0 127   3   1   1   0   0   3   4   0   0
##                  F   0   0   0   0   0 138   2   2   6   0   0   0   0   0
##                  G   1   1   2   1   9   2 123   2   0   0   1   2   1   0
##                  H   0   0   0   1   0   1   0 102   0   2   3   2   3   4
##                  I   0   1   0   0   0   1   0   0 141   8   0   0   0   0
##                  J   0   1   0   0   0   1   0   2   5 128   0   0   0   0
##                  K   1   1   9   0   0   0   2   5   0   0 118   0   0   2
##                  L   0   0   0   0   2   0   1   1   0   0   0 133   0   0
##                  M   0   0   1   1   0   0   1   1   0   0   0   0 135   4
##                  N   0   0   0   0   0   1   0   1   0   0   0   0   0 145
##                  O   1   0   2   1   0   0   1   2   0   1   0   0   0   1
##                  P   0   0   0   1   0   2   1   0   0   0   0   0   0   0
##                  Q   0   0   0   0   0   0   8   2   0   0   0   3   0   0
##                  R   0   7   0   0   1   0   3   8   0   0  13   0   0   1
##                  S   1   1   0   0   1   0   3   0   1   1   0   1   0   0
##                  T   0   0   0   0   3   2   0   0   0   0   1   0   0   0
##                  U   1   0   3   1   0   0   0   2   0   0   0   0   0   0
##                  V   0   0   0   0   0   1   3   4   0   0   0   0   1   2
##                  W   0   0   0   0   0   0   1   0   0   0   0   0   2   0
##                  X   0   1   0   0   2   0   0   1   3   0   1   6   0   0
##                  Y   3   0   0   0   0   0   0   1   0   0   0   0   0   0
##                  Z   2   0   0   0   1   0   0   0   3   4   0   0   0   0
##                   
## letter_predictions   O   P   Q   R   S   T   U   V   W   X   Y   Z
##                  A   2   0   5   0   1   1   1   0   1   0   0   1
##                  B   0   2   2   3   5   0   0   2   0   1   0   0
##                  C   2   0   0   0   0   0   0   0   0   0   0   0
##                  D   5   3   1   4   0   0   0   0   0   3   3   1
##                  E   0   0   2   0  10   0   0   0   0   2   0   3
##                  F   0  16   0   0   3   0   0   1   0   1   2   0
##                  G   1   2   8   2   4   3   0   0   0   1   0   0
##                  H  20   0   2   3   0   3   0   2   0   0   1   0
##                  I   0   1   0   0   3   0   0   0   0   5   1   1
##                  J   1   1   3   0   2   0   0   0   0   1   0   6
##                  K   0   1   0   7   0   1   3   0   0   5   0   0
##                  L   0   0   1   0   5   0   0   0   0   0   0   1
##                  M   0   0   0   0   0   0   3   0   8   0   0   0
##                  N   0   0   0   3   0   0   1   0   2   0   0   0
##                  O  99   3   3   0   0   0   3   0   0   0   0   0
##                  P   2 130   0   0   0   0   0   0   0   0   1   0
##                  Q   3   1 124   0   5   0   0   0   0   0   2   0
##                  R   1   1   0 138   0   1   0   1   0   0   0   0
##                  S   0   0  14   0 101   3   0   0   0   2   0  10
##                  T   0   0   0   0   3 133   1   0   0   0   2   2
##                  U   1   0   0   0   0   0 152   0   0   1   1   0
##                  V   1   0   3   1   0   0   0 126   1   0   4   0
##                  W   0   0   0   0   0   0   4   4 127   0   0   0
##                  X   1   0   0   0   1   0   0   0   0 137   1   1
##                  Y   0   7   0   0   0   3   0   0   0   0 127   0
##                  Z   0   0   0   0  18   3   0   0   0   0   0 132

An agreement object is created that contains T/F values for testing each position of the predicted letter vector to the same position of the actual letter vector. The position which the agreement returns ‘T’ indicates a correctly classified letter/observation, while the position which the agreement returns ‘F’ indicates an incorrectly classified letter/observation. Using the table() and the prop.table() functions, the absolute incidents and its percentage of correctly or incorrectly classified letter is shown. It indicates that using this model with a vanilladot kernel and cost factor equals 1 will yield an accuracy of 84%.

agreement <- letter_predictions == letters_test$letter
head(agreement)

## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

table(agreement)

## agreement
## FALSE  TRUE 
##   643  3357

prop.table(table(agreement))

## agreement
##   FALSE    TRUE 
## 0.16075 0.83925

Step 5: Model Performance Improvement

Model improvemnet for the SVM in this package can be done by changing the kernel type or the cost factor value. Different kernel types can be used to attain best performance based on the amount of training data, relatioships among features. Its recommended to do trial and error for each type of kernel to evaluate and validate best model. In this case, the Gaussian RBF kernel is used while keeping the same cost value (cost = 1 as default).

set.seed(12345)
letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train, kernel = "rbfdot")
letter_predictions_rbf <- predict(letter_classifier_rbf, letters_test)

Similarly as before, an agreement_rbf object is created to compared classes of predicted vector to the actual letter vector from the tested dataset. A true indicates a correctly classified incident while a false indicates an incorrectly classified incidents. Using the table() & prop.table(), an accuracy of 93% is seen from this improved model, and that means changing the kernel from vanilladot to rbfdot has increase about 10% accuracy.

agreement_rbf <- letter_predictions_rbf == letters_test$letter
table(agreement_rbf)

## agreement_rbf
## FALSE  TRUE 
##   275  3725

prop.table(table(agreement_rbf))

## agreement_rbf
##   FALSE    TRUE 
## 0.06875 0.93125

Conclusion: The Support Vector Machine (SVM) is performed in this exerise to recognize alphabetical letters by their various characteristics. the SVM analysis is done by creating a maximum margin hyperplane (MMH) that optimize the linear seperation of all features. When the observations are not linearly seperated, a slack variable is used to created a soft margin that allows some points to go to the incorrect side of the margin with a cost value that penalized such observation to the model. A kernel can be created as an additional feature/dimension based on relationship of other existing featues to allow a hyperplane more easily and linearly drawn. Finally, the SVM is used on numerical featuers that needed to be normalized to be similar range to make classification or numerical prediction. SVM is a supervised learning algorithm.