For this assignment, I will be conducting a SVM Analysis on the dataset, letter data. The dataset contains 20,000 examples of 26 English alphabet capital letters as printed using 20 different randomly reshpaed and distorted black and white fonts. Below, I have read the data into R and used the str() function to get descriptive look at the data types in the dataset.
## Example: Optical Character Recognition ----
# read in data and examine structure
letters <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml11/letterdata.csv")
str(letters)
'data.frame': 20000 obs. of 17 variables:
$ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
$ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
$ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
$ width : int 3 3 6 6 3 5 5 3 4 13 ...
$ height: int 5 7 8 6 1 8 4 2 4 9 ...
$ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
$ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
$ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
$ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
$ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
$ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
$ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
$ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
$ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
$ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
$ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
$ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
head(letters)
From the above tables, we see that data has 16 features the define each example of the letter class. Also we know that all the letters are accounted for because the variable letters has 26 levels.
Below is a horizontal side by side boxplot of the all the letters compared with variable onpix.
plot(letters$letter, letters$onpix, xlab = "Letters", ylab="Onpix")
Next, we need to conduct a simple holdout method. It has been stated in the data that the data has already been normalized. We will use 20% of the data for the testing data. Once we split the data, we will use the training dataset to train the SVM classifier, using the ksvm function from the kernlab package.
# divide into training and test data
letters_train <- letters[1:16000, ]
letters_test <- letters[16001:20000, ]
# begin by training a simple linear SVM
#install.packages("kernlab")
library(kernlab)
letter_classifier <- ksvm(letter ~ ., data = letters_train,
kernel = "vanilladot")
Setting default kernel parameters
# look at basic information about the model
letter_classifier
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 7037
Objective Function Value : -14.1746 -20.0072 -23.5628 -6.2009 -7.5524 -32.7694 -49.9786 -18.1824 -62.1111 -32.7284 -16.2209 -32.2837 -28.9777 -51.2195 -13.276 -35.6217 -30.8612 -16.5256 -14.6811 -32.7475 -30.3219 -7.7956 -11.8138 -32.3463 -13.1262 -9.2692 -153.1654 -52.9678 -76.7744 -119.2067 -165.4437 -54.6237 -41.9809 -67.2688 -25.1959 -27.6371 -26.4102 -35.5583 -41.2597 -122.164 -187.9178 -222.0856 -21.4765 -10.3752 -56.3684 -12.2277 -49.4899 -9.3372 -19.2092 -11.1776 -100.2186 -29.1397 -238.0516 -77.1985 -8.3339 -4.5308 -139.8534 -80.8854 -20.3642 -13.0245 -82.5151 -14.5032 -26.7509 -18.5713 -23.9511 -27.3034 -53.2731 -11.4773 -5.12 -13.9504 -4.4982 -3.5755 -8.4914 -40.9716 -49.8182 -190.0269 -43.8594 -44.8667 -45.2596 -13.5561 -17.7664 -87.4105 -107.1056 -37.0245 -30.7133 -112.3218 -32.9619 -27.2971 -35.5836 -17.8586 -5.1391 -43.4094 -7.7843 -16.6785 -58.5103 -159.9936 -49.0782 -37.8426 -32.8002 -74.5249 -133.3423 -11.1638 -5.3575 -12.438 -30.9907 -141.6924 -54.2953 -179.0114 -99.8896 -10.288 -15.1553 -3.7815 -67.6123 -7.696 -88.9304 -47.6448 -94.3718 -70.2733 -71.5057 -21.7854 -12.7657 -7.4383 -23.502 -13.1055 -239.9708 -30.4193 -25.2113 -136.2795 -140.9565 -9.8122 -34.4584 -6.3039 -60.8421 -66.5793 -27.2816 -214.3225 -34.7796 -16.7631 -135.7821 -160.6279 -45.2949 -25.1023 -144.9059 -82.2352 -327.7154 -142.0613 -158.8821 -32.2181 -32.8887 -52.9641 -25.4937 -47.9936 -6.8991 -9.7293 -36.436 -70.3907 -187.7611 -46.9371 -89.8103 -143.4213 -624.3645 -119.2204 -145.4435 -327.7748 -33.3255 -64.0607 -145.4831 -116.5903 -36.2977 -66.3762 -44.8248 -7.5088 -217.9246 -12.9699 -30.504 -2.0369 -6.126 -14.4448 -21.6337 -57.3084 -20.6915 -184.3625 -20.1052 -4.1484 -4.5344 -0.828 -121.4411 -7.9486 -58.5604 -21.4878 -13.5476 -5.646 -15.629 -28.9576 -20.5959 -76.7111 -27.0119 -94.7101 -15.1713 -10.0222 -7.6394 -1.5784 -87.6952 -6.2239 -99.3711 -101.0906 -45.6639 -24.0725 -61.7702 -24.1583 -52.2368 -234.3264 -39.9749 -48.8556 -34.1464 -20.9664 -11.4525 -123.0277 -6.4903 -5.1865 -8.8016 -9.4618 -21.7742 -24.2361 -123.3984 -31.4404 -88.3901 -30.0924 -13.8198 -9.2701 -3.0823 -87.9624 -6.3845 -13.968 -65.0702 -105.523 -13.7403 -13.7625 -50.4223 -2.933 -8.4289 -80.3381 -36.4147 -112.7485 -4.1711 -7.8989 -1.2676 -90.8037 -21.4919 -7.2235 -47.9557 -3.383 -20.433 -64.6138 -45.5781 -56.1309 -6.1345 -18.6307 -2.374 -72.2553 -111.1885 -106.7664 -23.1323 -19.3765 -54.9819 -34.2953 -64.4756 -20.4115 -6.689 -4.378 -59.141 -34.2468 -58.1509 -33.8665 -10.6902 -53.1387 -13.7478 -20.1987 -55.0923 -3.8058 -60.0382 -235.4841 -12.6837 -11.7407 -17.3058 -9.7167 -65.8498 -17.1051 -42.8131 -53.1054 -25.0437 -15.302 -44.0749 -16.9582 -62.9773 -5.204 -5.2963 -86.1704 -3.7209 -6.3445 -1.1264 -122.5771 -23.9041 -355.0145 -31.1013 -32.619 -4.9664 -84.1048 -134.5957 -72.8371 -23.9002 -35.3077 -11.7119 -22.2889 -1.8598 -59.2174 -8.8994 -150.742 -1.8533 -1.9711 -9.9676 -0.5207 -26.9229 -30.429 -5.6289
Training error : 0.130062
We see that training error from the above model is .13 but we to performance of the model on the test dataset.
# predictions on testing dataset
letter_predictions <- predict(letter_classifier, letters_test)
head(letter_predictions)
[1] U N V X N H
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
table(letters_test$letter, letter_predictions)
letter_predictions
A B C D E F G H I J K L M N O P Q R S T U V
A 144 0 0 2 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0
B 0 121 0 2 0 0 1 0 1 1 1 0 0 0 0 0 0 7 1 0 0 0
C 0 0 120 0 5 0 2 0 0 0 9 0 1 0 2 0 0 0 0 0 3 0
D 0 5 0 156 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0
E 0 2 4 0 127 0 9 0 0 0 0 2 0 0 0 0 0 1 1 3 0 0
F 0 0 0 1 3 138 2 1 1 1 0 0 0 1 0 2 0 0 0 2 0 1
G 0 1 10 3 1 2 123 0 0 0 2 1 1 0 1 1 8 3 3 0 0 3
H 0 2 2 10 1 2 2 102 0 2 5 1 1 1 2 0 2 8 0 0 2 4
I 0 0 2 4 0 6 0 0 141 5 0 0 0 0 0 0 0 0 1 0 0 0
J 1 0 0 3 0 0 0 2 8 128 0 0 0 0 1 0 0 0 1 0 0 0
K 0 1 1 4 3 0 1 3 0 0 118 0 0 0 0 0 0 13 0 1 0 0
L 0 0 3 3 4 0 2 2 0 0 0 133 0 0 0 0 3 0 1 0 0 0
M 1 1 0 0 0 0 1 3 0 0 0 0 135 0 0 0 0 0 0 0 0 1
N 2 0 0 5 0 0 0 4 0 0 2 0 4 145 1 0 0 1 0 0 0 2
O 2 0 2 5 0 0 1 20 0 1 0 0 0 0 99 2 3 1 0 0 1 1
P 0 2 0 3 0 16 2 0 1 1 1 0 0 0 3 130 1 1 0 0 0 0
Q 5 2 0 1 2 0 8 2 0 3 0 1 0 0 3 0 124 0 14 0 0 3
R 0 3 0 4 0 0 2 3 0 0 7 0 0 3 0 0 0 138 0 0 0 1
S 1 5 0 0 10 3 4 0 3 2 0 5 0 0 0 0 5 0 101 3 0 0
T 1 0 0 0 0 0 3 3 0 0 1 0 0 0 0 0 0 1 3 133 0 0
U 1 0 0 0 0 0 0 0 0 0 3 0 3 1 3 0 0 0 0 1 152 0
V 0 2 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 126
W 1 0 0 0 0 0 0 0 0 0 0 0 8 2 0 0 0 0 0 0 0 1
X 0 1 0 3 2 1 1 0 5 1 5 0 0 0 0 0 0 0 2 0 1 0
Y 0 0 0 3 0 2 0 1 1 0 0 0 0 0 0 1 2 0 0 2 1 4
Z 1 0 0 1 3 0 0 0 1 6 0 1 0 0 0 0 0 0 10 2 0 0
letter_predictions
W X Y Z
A 0 0 3 2
B 0 1 0 0
C 0 0 0 0
D 0 0 0 0
E 0 2 0 1
F 0 0 0 0
G 1 0 0 0
H 0 1 1 0
I 0 3 0 3
J 0 0 0 4
K 0 1 0 0
L 0 6 0 0
M 2 0 0 0
N 0 0 0 0
O 0 1 0 0
P 0 0 7 0
Q 0 0 0 0
R 0 0 0 0
S 0 1 0 18
T 0 0 3 3
U 4 0 0 0
V 4 0 0 0
W 127 0 0 0
X 0 137 0 0
Y 0 1 127 0
Z 0 1 0 132
From the above functions and tables, we seeing using the head() function the first six predicted letters. Also using the table() function, I compared the predicted letter to the true letter in the test dataset. From the table, the diagonal values indicate the total number of true postive predictions for each letter. The remaining values are the where the model misclassified the letter.
Since the above matrix is hard to read, we will create matrix of True and False values that will help to easliy calculate the accuarcy of the model.
# look only at agreement vs. non-agreement
# construct a vector of TRUE/FALSE indicating correct/incorrect predictions
agreement <- letter_predictions == letters_test$letter
table(agreement)
agreement
FALSE TRUE
643 3357
prop.table(table(agreement))
agreement
FALSE TRUE
0.16075 0.83925
It is shown that the accuracy of the model is 0.83925. This is ok but lets see if we can do better.
I will try to improve the model using a more complex kernel function. This method will map the data into a higher dimensional space and hopefully produce a more effective model. The method that will be used is the Gaussian RBF kernel, which is the ksvm() function.
## Step 5: Improving model performance ----
set.seed(12345)
letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train, kernel = "rbfdot")
I will now use the predict() function as earlier to produce the predictions from the model.
letter_predictions_rbf <- predict(letter_classifier_rbf, letters_test)
#table(letters_test$letter, letter_predictions_rbf)
agreement_rbf <- letter_predictions_rbf == letters_test$letter
table(agreement_rbf)
agreement_rbf
FALSE TRUE
275 3725
prop.table(table(agreement_rbf))
agreement_rbf
FALSE TRUE
0.06875 0.93125
We see that this model out performed the pervious model and that this method was better.