R Markdown

Support vector machines (SVM) are binary classifiers. For more than two (binary) classes, classification is achieved by running binary classification multiple times creatively. The advantage of SVM is that they can perform non linear classification therefore making them more flexible.

Classification methodology:

First step is performed via Kernel functions. These are equations that project the dataset (whose elements are to be classified) into a new feature space. In the new space,previosly non linear observations become linearly separable.

Classification is then achieved using a hyperplane that separates the various classes.

Therefore,an SVM is trained by looking for the optimal hyperplane that will separate the two classes. The optimal hyperplane by definition, is the plane that maximizes the margins between the closest points of the two classes. The points that lie on the margin are called the suporting vectors and the line passing via the midpoint of the margins is the optimal hyperplane.

For this session, we are going to be using the packages tidyverse and e071. E071 is a package that provides an R INTERFACE to the commonly used LIBSVM library written in C++.

If you don’t have the packages installed already just run the commands below:

#suppressMessages(install.packages("tidyverse")) #installs the packages
suppressMessages(library(tidyverse)) #loads the installed package
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 4.0.2
#suppressMessages(install.packages("e071"))
suppressMessages(library(e1071))
## Warning: package 'e1071' was built under R version 3.6.3

After loading the necessary packages, we load the dataset which we’ll be using in the session

#LOAD THE DATASET INTO OUR IDE
tree_data <-read.csv("E:/datasets/trees.csv",header = TRUE)
str(tree_data)
## 'data.frame':    200 obs. of  5 variables:
##  $ leaf_width  : num  5.13 7.49 9.22 6.98 3.46 4.55 4.95 7.64 8.69 7.21 ...
##  $ leaf_length : num  6.18 4.02 4.16 11.1 5.19 5.15 10.4 2.58 4.35 3.62 ...
##  $ trunk_girth : num  8.26 8.07 5.46 6.96 8.72 9.01 6.33 9.73 4.37 8.71 ...
##  $ trunk_height: num  8.74 6.78 8.45 4.06 10.4 9.64 4.49 7.75 8.82 7.43 ...
##  $ tree_type   : int  0 0 1 2 0 0 2 0 1 0 ...
#we check to see what is contained in the dataset
view(tree_data)

From the result, we see that the data has four features and the data has three classification categories i.e 0,1 and 2 (under tree type column) The features of these dataset are: leaf_width leaf_length trunk_girth trunk_height

Features are basically the characteristics we use to classify the type of a tree.Therefore, we may have different trees with almost similar features combinations belonging to one class. For classification to work, all the features need to be scaled over a fairly small interval implying that, if the range of the integer variable is wide, the data needs to be normalized/standardized.

Basically, for trees to be classified into the same type, they need to have feature combinations that more or less lie in a given range.

The good news is,the R package used for fitting the SVM model performs normalization automatically.

To get a better understanding of the dataset we’ll be using for this session, lets visualize it using the ggplot package. We look at the leaf features and trunk features separately using scatter plots and colour the points based on the label tree_type.

# Plot of leaf features, where `x = leaf_width` and `y = leaf_length`

tree_data %>% ggplot(aes(x = leaf_width, y = leaf_length,color = as.factor(tree_type)))+geom_point()+ ggtitle("Leaf length against. leaf width coloured by tree type") + labs(x = "Leaf width", y = "Leaf length", colour = "Tree type")+ 
theme(plot.title = element_text(hjust = 0.5))

from the plot, we see that based on leaf width and leaf length, there are three groups present in this dataset divided according to tree type.

Similarly, using the trunks features we get the plot below.

# Plot of trunk features, where `x = trunk girth` and `y = trunk height`
tree_data %>% ggplot(aes(x = trunk_girth, y = trunk_height,color = as.factor(tree_type)))+geom_point()+ ggtitle("trunk height against trunk girth coloured by tree type")+ labs(x = "trunk girth", y = "trunk_height", colour = "Tree type") +
theme(plot.title = element_text(hjust = 0.5))

We can see three groups that separate based on tree_type: 0, 1, and 2 (coloured red, green, and blue, respectively). There are some Extraneous elements, but for the most part, the features trunk girth and trunk height allow you to predict tree type.

TRAINING GOAL/OBJECTIVE

Suppose we have a new tree specimen and we want to figure out the tree type based on its leaf and trunk measurements. We could create boundary lines separating the 3 classes on the plots we just made and based on the region tree data points lie in the two scatter plots, we approximate the tree type .

Alternatively, using these same leaf and trunk measurements, SVMs can predict the tree type for us. SVMs will use the features and labels we provide for known tree types to create hyperplanes for tree type. These hyperplanes allow us to predict which tree type a new tree specimen belongs to, given their leaf and trunk measurements.

To achieve this, we can use either of the two approaches below:

  1. Divide the treedata dataset into a training set and a test set and evaluate how the model performs

  2. Use the treedata dataset for training then use a randomly generated dataset containing leaf features to test the svm model we just created and evaluate how correct it predicts the new tree specimens.

IMPLEMENTATION

Lets begin with binary classification. We create one SVM based on leaf features and the other svm can be implemented in a similar fashion using trunk features.

The svm function is found in the package e1071. To see it’s input arguments structure and syntax, we use the package documentation as shown below

 ?svm
## starting httpd help server ... done

For this session, we use the syntax: svm(X = x, y = y, data = dataset)

where x represents the features i.e the outcome in the dataset to be modelled. In this case, the features are leaf width and leaf length (of class matrix).

y represents the labels which are the features of the datset that are going to be used for prediction. In this example, the label is the tree type (of class factor).

It is vital to note that the svm function requires two types of data structures, a matrix and a factor.

A factor is used to categorize data, where the names of the categories are known as levels. For example, “air” could be a factor, with levels including “oxygen”, “nitrogen”, and “carbon dioxide” or “tree type” is our factor and levels are “type 0”, “type 1”, “type 2”

To check the type of elements in the dataset we run the code

str(tree_data)
## 'data.frame':    200 obs. of  5 variables:
##  $ leaf_width  : num  5.13 7.49 9.22 6.98 3.46 4.55 4.95 7.64 8.69 7.21 ...
##  $ leaf_length : num  6.18 4.02 4.16 11.1 5.19 5.15 10.4 2.58 4.35 3.62 ...
##  $ trunk_girth : num  8.26 8.07 5.46 6.96 8.72 9.01 6.33 9.73 4.37 8.71 ...
##  $ trunk_height: num  8.74 6.78 8.45 4.06 10.4 9.64 4.49 7.75 8.82 7.43 ...
##  $ tree_type   : int  0 0 1 2 0 0 2 0 1 0 ...
#we notice that tree type values are integers
#the data is stored in a data frame

The SVM will be based on the leaf features i.e leaf_width and leaf_length. We will need to create a new variable that contains only these two features, then convert it from a data.frame to a matrix, so it can be the input to the x argument in the svm function.

We also need to convert the labels tree_type into a factor for the input to the y variable of the svm function, as it is currently stored as an integer.

Conversion to appropriate data type of the x and y variables for the leaf features in tree_data is done as shown below.

#create a subset from the original dataset excluding the trunk features and converting it to a matrix

x_leaf_data <- tree_data %>% select(leaf_width,leaf_length) %>% as.matrix()


#confirmation of the class of the new datset we created
class(x_leaf_data)
## [1] "matrix"
#view the new subset wee just created
head(x_leaf_data)
##      leaf_width leaf_length
## [1,]       5.13        6.18
## [2,]       7.49        4.02
## [3,]       9.22        4.16
## [4,]       6.98       11.10
## [5,]       3.46        5.19
## [6,]       4.55        5.15
#Convert the tree type from integer to factor
tree_data <- tree_data %>% mutate(tree_type = as.factor(tree_type))


#Confirmation of our y variable input to svm

class(tree_data$tree_type)
## [1] "factor"
head(tree_data$tree_type)
## [1] 0 0 1 2 0 0
## Levels: 0 1 2

Finally we are now ready to run the function svm based on the leaf features stored in the new variable ‘x_leaf_data’, and the label saved in the variable ‘tree_data$tree_type’.

TRAINING

TRAINIG APPROACH 1: DIVIDEING THE DATASET INTO TRAINING AND TEST SUBSETS

First we divide the target feature into a training and test dataset. It is important to note that for this approach to work, our dataset needs to be randomized but luckily for us, our dataset is already randomized so we can skip this step.

x_leaf_data_train <- x_leaf_data[1:150, ] #Takes the first 150 entries of our x_leaf_data dataset which will be used as our training dataset

x_leaf_data_test <- x_leaf_data[151:200, ] #Takes the 151th entry to the 200th of our x_leaf_data dataset which will be used as our test dataset
tree_data_train <- tree_data[1:150, ] #Takes the first 150 entries of our tree_data dataset which will be used as our predictors
tree_data_test <- tree_data[151:200, ]

Then we build our svm model following the afforementioned syntax :

svm_leaf_data <- svm(x = x_leaf_data_train, y = tree_data_train$tree_type, type = "C-classification", kernel = "radial")

print("Our SVM model dubbed svm_leaf_data is ready.")
## [1] "Our SVM model dubbed svm_leaf_data is ready."

Then we make predictions using the predict function. The general syntax of the predict function is obtained by running ?predict

svm_leaf_data_prediction<- predict(svm_leaf_data, newdata = x_leaf_data_test, type = "decision")
view(svm_leaf_data_prediction) # shows the predictions made

Let’s take a minute and reflect on what we have done so far: 1) We trained the SVM model using the first 150 entries of our original dataset 2) We built a prediction model and named it svm_leaf_data 3) We used the model we just created to predict the tree type of the remaining last 50 entries of our original dataset.

Sounds interesting right?? :)

Now we evaluate the perfomance of our svm model. To achieve this we have to compare the predicted values we obtained from our model with the actual values contained in our tree_data_test_dataset. To achieve this we use the table function:

table(svm_leaf_data_prediction, tree_data_test$tree_type)
##                         
## svm_leaf_data_prediction  0  1  2
##                        0 16  2  0
##                        1  0 14  0
##                        2  0  1 17

From the output we see that there are 16 type 0 trees which were correctly classified but there was some mis-classification since 2 type 0 trees were classified erroneously as type 1 trees. The table function gives a more detailed summary of misclassification. or simplicity, we generalize the errors by considering only whether the prediction was correct or incorrect,and ignores the type of error using true/false.

correct_pred <- svm_leaf_data_prediction == tree_data_test$tree_type
table(correct_pred)
## correct_pred
## FALSE  TRUE 
##     3    47
#GETTING THE RESULTS IN PERCENTAGE
prop.table(table(correct_pred))
## correct_pred
## FALSE  TRUE 
##  0.06  0.94

Thus the model we just created has an accuracy of 94%

TRAINING APPROACH 2: USING TREEDATA DATASET FOR TRAINING AND A RANDOMLY GENERATED DATASET CONTAINING LEAF FEATURES FOR TESTING

At the begining, i mentioned something to do with hyperplanes, to help illustrate the hyperplane, we will create a fine grid of data points within the feature space to represent different combinations of leaf width and leaf length, and colour the new data points based on the predictions of our SVM

# Create a fine grid of the feature space
leaf_width <- seq(from = min(tree_data$leaf_width), to = max(tree_data$leaf_width), length = 100) #creates an array(100 elements) leafwidthvalues from min to max value


leaf_length <- seq(from = min(tree_data$leaf_length), to = max(tree_data$leaf_length), length = 100)

fine_grid_leaf <- as.data.frame(expand.grid(leaf_width, leaf_length)) #create a dataframe with columns leaf_width, leaf_length


fine_grid_leaf <- fine_grid_leaf %>%
                  dplyr::rename(leaf_width = "Var1", leaf_length = "Var2")
# Check output
view(fine_grid_leaf)

#Buid SVM model
svm_leaf_data_2 <- svm(x = x_leaf_data, y = tree_data$tree_type, type = "C-classification", kernel = "radial")

# For every new point in `fine_grid_leaf`, predict its tree type based on the SVM `svm_leaf_data_2`

fine_grid_leaf$tree_pred <- predict(svm_leaf_data_2, newdata = fine_grid_leaf, type = "decision")

# Check output
head(fine_grid_leaf)
##   leaf_width leaf_length tree_pred
## 1   2.060000        1.11         0
## 2   2.150303        1.11         0
## 3   2.240606        1.11         0
## 4   2.330909        1.11         0
## 5   2.421212        1.11         0
## 6   2.511515        1.11         0
table(fine_grid_leaf$tree_pred) #gives entries per prediction
## 
##    0    1    2 
## 3245 3235 3520

Now we can create a scatter plot that contains the new fine grid of points we created above, and also the original tree data to see which group the different trees fall into based on the SVM svm_leaf_data

# Create scatter plot  with original leaf features layered over the fine grid of data points(i.e values predicted in the feature space)

ggplot() +
geom_point(data = fine_grid_leaf, aes(x = leaf_width, y = leaf_length, colour = tree_pred), alpha = 0.25) + #plotting the predicted values
stat_contour(data = fine_grid_leaf, aes(x = leaf_width, y = leaf_length, z = as.integer(tree_pred)),
             lineend = "round", linejoin = "round", linemitre = 1, size = 0.25, colour = "black") + #plots the boundaries (hyperplane)
geom_point(data = tree_data, aes(x = leaf_width, y = leaf_length, colour = tree_type, shape = tree_type)) +
ggtitle("SVM decision boundaries for leaf length vs. leaf width") +
labs(x = "Leaf width", y = "Leaf length", colour = "Actual tree type", shape = "Actual tree type" ) +#plots actual classification value
theme(plot.title = element_text(hjust = 0.5))

From the graph the three faintly coloured zones are the SVM’s classification predictions based on the leaf features. The hyperplane is represented by the thick black boundary lines.

We can use these coloured zones and hyperplanes to observe which tree type the SVM has chosen to place our original data points into. In the graph above, our original data points are represented by both colour and shape.

Also remember, that the tree type of the fine grid of data points is based on the SVM model where we used leaf features as input to the SVM.

inferring to the graph above, we observe two different classification scenarios:

  1. Our original data points are classified correctly by the SVM, as the data point falls into the zone of the same colour, e.g. a green triangle data point (type 1 tree) falls into the green zone (the SVM predicted the tree as type 1).

  2. Our original data points are misclassified by the SVM, as the data point falls into the zone of a different colour, e.g. a red circle data point ( type 0 tree) falls into the green zone (the SVM predicted the tree as type 1).

However, we cannot tell precisely if our classification is accurate. To test the level of accuracy, we need to determine the mis-classification rate. To do this, we will need to run the predict function again, but this time using our original data points as input for comparison. Therefore we use the original dataset,tree_data as our test set.

pred_leaf_data <- tree_data %>% select(leaf_width, leaf_length)

# Predict the tree type of our original data based on the SVM `svm_leaf_data`
pred_leaf_data$tree_pred <- predict(svm_leaf_data, newdata = pred_leaf_data, type = "decision")

# Check output
head(pred_leaf_data)
##   leaf_width leaf_length tree_pred
## 1       5.13        6.18         0
## 2       7.49        4.02         1
## 3       9.22        4.16         1
## 4       6.98       11.10         2
## 5       3.46        5.19         0
## 6       4.55        5.15         0
# Add tree_data$tree_type to pred_leaf_data
pred_leaf_data <- inner_join(pred_leaf_data, tree_data, by = c("leaf_width", "leaf_length")) %>%
select(-trunk_girth, -trunk_height)

# Check output
head(pred_leaf_data)
##   leaf_width leaf_length tree_pred tree_type
## 1       5.13        6.18         0         0
## 2       7.49        4.02         1         0
## 3       9.22        4.16         1         1
## 4       6.98       11.10         2         2
## 5       3.46        5.19         0         0
## 6       4.55        5.15         0         0
# Create a table of predictions to show mis-classification rate
table(pred_leaf_data$tree_pred, pred_leaf_data$tree_type)
##    
##      0  1  2
##   0 62  3  3
##   1  4 63  0
##   2  2  1 62
# Mis-classification rate: proportion of misclassifiedb observations
mean(pred_leaf_data$tree_pred != pred_leaf_data$tree_type)
## [1] 0.065
#Accuracy will therefore be abou 93.5%
mean(pred_leaf_data$tree_pred == pred_leaf_data$tree_type)
## [1] 0.935

Thus we get a misclassification rate of about 6.5%. which can actually preferable to a mis-classification rate of 0%, as the latter might indicate that the model has overfit the training data.

NOTE: To further increase the accuracy, we can use more complex kernels and compare their results. The SVM model caan also be created based on the trunk features using a procedure similar to that we’ve used in the discussion.