Machine Learning (ML) sometimes seems pretty difficult as it assumes that you are comfortable with some broad ranging concepts from linear algebra, calculus, statistics, and probability. It borrows techniques from all of these disciplines, mashing them together in all kinds of different ways, and using them in problems that are often concerned with some kind of function optimisation. THe purpose of this tutorial is to provide an overview of some of the ideas that you may run into in ML, in the context of exploring Support Vector Machines. Along the way we will also explore some of the really nice things you can do in R Markdown notebooks, like formatting for math, and embedding images and videos.
There are, of course, all kinds of different algorithms that are used in ML to solve problems of classification (things like Decision Trees, KNN etc). The one we will explore here is called a Support vector machines (SVM). SVMs can actually be used for both classification or regression problems, but most of the time you will find them used for classification. If, for example, you have data that has 2 or more labeled classes, you can use an SVM as a discriminative classifier, and can define a decision boundary (sometimes called an optimal hyperplane) that provides a function which seperates the classes. When any new data points are then mapped onto the same space of the others, they will be categorized into the correct class (i.e. be on the correct side of the line or plane that is the function).
An SVM is an example of what’s called a constrained optimisation problem . It optimises the space of the margin between classes, but also seeks to constrain the extent of this margin in relation to data points. A nice visualisation to present the idea provided below. You can think of the problem in terms of how you can construct a street between the points, that is as wide as possible.
To provide another another intuition, of the three lines you can see in the above visualisation, the dotted ones (sometimes called the “gutters” of the street) are at the extremes of the margin, or width of the street. We want to maximisise the margin (or space between the two gutters) so the classes (represented by green ovals and blue rectangles) as as far away as possible. The middle line is called the decision boundary (or optimal hyperplane) that is a centre line. The image below shows an alternative visualisation, and you can see the decision boundary / hyperplane in both two and three dimensional space.
One of the nice things about many ML algorithms is you can extend them into arbitrary dimensions using the power of linear algebra. So you might have 400 classes, and if so, you can create a 399 dimenssion hyperplane, or decision boardary that seperates them. You can then provide a further data point to your model, and your SVM will be put it in the right space on the 399 dimensions.
You can also do this neat thing (often called the “Kernal trick”) to deal with data that does not lend itself to these straight (i.e. linear) widest road type decision boundaries (essentially by pushing the data into higher dimensions). But for now, lets keep things really simple, and just start with a few data points in two dimensions that we could divide with a linear (i.e. straight line) SVM.
To explore this, the first thing to do is create a data frame. It will have 5 rows, with each row having two features (f1 and f2), and a target. Note that I have also put something in called the biasTerm. The bias terms is kind of complicated to explain but it gives the whole model a little more wiggle room to move as it trys to fit the best decision boundary, and doesn’t constrain the function to passing through the origin (you might like to think of it as a way to move the y-intecept of a function). The code to build the data frame and the frame itself is listed below.
To put the problem as simply as possible, the question we are really trying to solve here asks, “what are the most optimal weights I might use for f1 and f2 which will mean I will predict the target correctly?”.
Let’s start by looking at the plotted points, f1, and f2:
ggplot(a,
aes(x = f1,
y = f2,
color = factor(target))) +
geom_point()
Before we start thinking about constructing a optimal hyperplane or decision boundary, note that we could of course just provide any random decision boundary (i.e. apply no learning at all). Here we are only in two dimensions, so this is would be a one-dimensional hyperplane, or just a line that we could use. And being random, as to be expected, it is not very good. So our challenge is how to move this line so it becomes an optimal hyperplane or decision boundary for these points.
ggplot(a,
aes(x = f1,
y = f2,
color = factor(target))) +
geom_point() +
geom_segment(aes(x = -2.1,
y = -2,
xend = 6,
yend = 6))
Time to check out the underlying Math
Like many algorithms you might find in ML, the basic idea here is that we want to create weights or co-efficients for our variables, then check how wrong they are, and then iterate in some way to make it more correct. To do this, we will use a few mathematical ingredients: a loss function, an objective function, an optimisation function, and we will also provide some things to track how our function is performing given different weights of f1 and f2.
Details around each of these are provided below.
Loss function
In machine learning its critical that we can keep track of how well our function predicts outcomes. If it is a bad predictor it has a lots of loss, or a high cost to us. We want low cost and low loss. The c() function below (and think of c as standing for “cost”) is the loss function. It takes into account our input vector (f1, f2 and bias term), the output from the wieghts we have chosen, and the true output (the target). So below you can see x as the sample input vector, y as the true label or target, and f(x) the predicted label. There are many different loss functions and this one is called a “hinge” loss function. Loss functions turn up in ML algorithms as a way of measuring performance while we search for optimal weights.
Our objective function is what we need to get to the optimal hyperplane. This means that, when we optimse the function below, it will give us the best weights so we can create our hyperplane or decision boundary.
\[ min_w\lambda || w ||^2 +
\sum_{i=1}^n (1 - y_i \langle x_i, w \rangle) \]
Right about now in most explanations of SVM, people start getting confused. Why would this be a function to use, that explains the picture of the problem above? Why is not some other function. And notice too we snuck in a \(\lambda\) symbol.TO deal first the the \(\lambda\), this is a regularization parameter and serves as a way to take into account the degree of importance that is given to misclassifications. Its kind of like a tuning knob we can change during the process of creating the decision boundaries, and also helps us create a model that works on both training and test data well. This idea turns up alot in ML. It addresses the fact that the SVM is trying to both maximise the margin between both classes and minimizing the amount of misclassifications.
Now in terms of precisely why the function works, I will defer the explanation to Professor Patrick Winston SVM derivation in his amazing MIT Artificial Intelligence Course (in lecture 16) which is provided below. Winston starts from an intuitiion of the image at the top of this tutorial (the idea of the largest street) and then derives the function using some really elegant mathematical moves. Note also that Vladimir Vapnick originally came up the SVM and I encourage you to read up him as an amazing thinker in this space. And when you go through all that math, you will indeed find that this function is the one that can be used to create a optimal hyperplane or decision boundary between an arbitraty number fo classes.
Optimising the objective function through its partial derivatives
Now whenever we start talking about ideas of optimisation, we find ourselves heading into the world of calculus (the branch of math that explores rates of change). You can see that our objective function consists of two terms. The first is that regularizer thing, and the second tells us about the loss. The regularizer provides a balance between margin maximization and loss, to ensure our hyperplane is optimal (i.e.that is maximally far away from any data points).
So lets provide some derivations of these terms. It turns out this is not complicated (doing derivatives on vectors is just like doing derivatives on scalars, which is one of the really nice things about linear algebra). Here is the first one:
\[\frac{\delta}{\delta w_k} \lambda || w ||^2 = 2\lambda w_k \]
This a condition which we can think of as tracking the function’s ability to predict correctly. If our given weights for f1 and f2 are not good, it will classify wrongly, and we will get an error, and we will track this. You will see this in action in the code below.
\[ y_i \langle x_i, w \rangle \lt 1 \]
A way to update the weights
You will also see this in action in the code below. Whether we make an error or not, we will still update our weights, but will update them slightly differently depending on if there is an error.
\[ w = w + \eta (y_i x_i - 2 \lambda w ) \]
So there is quiet a bit going on here: a regularizer controls the trade off between the achieving a low training error and a low testing error that is the ability to generalize your classifier to unseen data. As a regulizing parameter we choose 1/epochs, so this parameter will decrease, as the number of epochs increases. Note that a regularizer too high will overfit (large testing error) and regularizer being too low will underfit (large training error).
Code
weights <- base::rep(0, 3)
eta = 1
epochs = 10000
errors <- c()
for (epoch in 1:epochs) {
# set current error to 0
error <- 0
# Print current weight values
# print("Current weights:")
# print(weights)
for(i in 1:nrow(a)) {
featureVector <- as.double(a[i, 1:3])
targets <- as.double(a$target)
currentError <- targets[i] * featureVector %*% weights
# print("Error with current weights:")
# print(currentError < 1)
if (targets[i] * featureVector %*% weights < 1) {
# misclassification, set error
weights = weights + eta * ( (featureVector * targets[i]) + (-2 *(1/epoch)* weights) )
error = 1
} else {
# correct prediction
weights <- weights + eta * (-2 *(1/epoch)* weights)
}
errors <- c(errors, error)
}
}
For each of our iterations, sometimes we predicted correctly, and sometimes not correctly. Then we used our derived objective function to update the weights for f1 and f2. Note how simple the actual code was, and this is largely because the math itself is doing the heavy lifting. Once we have that objective function, its just a matter of coding it up. Now Let’s plot the errors
Plot of errors
t <- data.frame(errors)
ggplot(t,
aes(y = errors,
x = seq(1:50000),
color = "errors"
)) +
geom_point()
Let’s also check out the final weights
print(paste("Optimal f1 weight: ", weights[1]))
[1] "Optimal f1 weight: 1.75353189546445"
print(paste("Optimal f2 weight: ", weights[2]))
[1] "Optimal f2 weight: 3.47498302933517"
print(paste("Optimal bias: ", weights[3]))
[1] "Optimal bias: 12.2452630147726"
So where to next?
We covered alot of ideas here but it should at least serve to give a bit of an intro into some of the main ideas that come up alot in machine learning. To get into it further here are some next steps:
Data camp has a cool SVM tutorial using the E1071 package. Just paste in the code to get a further intuition how it works. They also extend the idea into non-linear SVMs which are, not difficult to implment but give you alot more flexibility when you classes are not seperable in a linear fashion
Here is another tutorial that implements SVMs using the popular ML library in R, caret, using medical data that predicts heart disease. Again, just paste in to get some intuititions
You will soon notice that in ML, things seem like a bit of a black box. You put your features and targets in some kind of ML library function. Proceeding in this way means that you may be able to do cool stuff, but its risky as you don’t really know how it works. To get around this, get your math up to date. I recommend starting with the Khan Academy playlist on linear algebra, calculus and probability and go from there.
This tutorial started out as me just building an R version of another tutorial from Ravel Siraj who presents a similiar idea in Python (all his stuff is great so be sure to check it out). It is of course great to know how to do this in Python (and you can also use the scikitlearn lirary), and a great way to learn about ML is convert Python jupyter notebook tutorials to R Markdown
Finally, check out this amazing lecture from Vladimir Vapnick, who came up with the idea of SVMs in the early 90s. He is an amazing mathematician and thinker and this lecture covers a this thing which he calls, “SVM Plus”. The first 20 minutes or so of math is hard going, but stick with it as its a mind blowing lecture.