1. Sketching Hyperplanes in 2D
- (a) Hyperplane Sketch - 1
- (b) Hyperplane Sketch - 2
2. Non-Linear Decision Boundaries
3. Maximal Margin Classifier
4. APPLIED: Generated Data (Linear/Polynomial/Radial Kernel SVMs)
5. APPLIED: Generated Data (Logistic Regression vs SVMs)
6. APPLIED: Generated Data (Support Vector Classifiers)
7. APPLIED: The Auto Dataset (Linear/Polynomial/Radial Kernel SVMs)
8. APPLIED: The OJ Dataset (Linear/Polynomial/Radial Kernel SVMs)

NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

library(tidyverse)
library(latex2exp) # latex labels
library(ggforce) # geom_circle
library(kableExtra)
library(e1071) # svm()
library(scales)
library(ISLR) # Auto dataset
library(gridExtra)

RNGkind(kind = "Mersenne-Twister", 
        normal.kind = "Inversion", 
        sample.kind = "Rejection") # diff versions of R/RMD using different sampling types was annoying me

theme_set(theme_light())

1. Sketching Hyperplanes in 2D

This problem involves hyperplanes in two dimensions.

(a) Hyperplane Sketch - 1

Q: Sketch the hyperplane \(1 + 3X_1 − X_2 = 0\). Indicate the set of points for which \(1 + 3X_1 − X_2 > 0\), as well as the set of points for which \(1 + 3X_1 − X_2 < 0\).

Plotting the line:

\[1 + 3X_1 − X_2 = 0 \\ \iff X_2 = 3X_1 + 1\]

Rearranging for \(X_2\) makes it clear which side of the hyperplane is which:

\(1 + 3X_1 − X_2 < 0 \iff X_2 > 1 + 3X_1\)
\(1 + 3X_1 − X_2 > 0 \iff X_2 < 1 + 3X_1\)

X1 <- -10:10
X2 <- 3*X1 + 1

df1 <- data.frame(X1, X2, line = "a")

grid <- expand.grid(X1 = seq(min(df1$X1), max(df1$X1), length.out = 200), 
                    X2 = seq(min(df1$X2), max(df1$X2), length.out = 200)) %>%
  mutate(region = case_when(1 + 3*X1 - X2 < 0 ~ "lt 0", 
                            1 + 3*X1 - X2 > 0 ~ "gt 0"))

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_line(data = df1, aes(x = X1, y = X2), size = 1, col = "#7CAE00") + 
        annotate("text", x = -5, y = 10, label = TeX("$1 + 3X_1 − X_2 < 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "blue") + 
          annotate("text", x = 5, y = -10, label = TeX("$1 + 3X_1 − X_2 > 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "red") + 
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  scale_x_continuous(expand = c(0.01,0.01), breaks = seq(-10, 10, 2)) +
  scale_y_continuous(expand = c(0.01,0.01), breaks = seq(-30, 30, 10)) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Hyperplane Plot: $1 + 3X_1 − X_2 = 0$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

(b) Hyperplane Sketch - 2

Q: On the same plot, sketch the hyperplane \(−2 + X_1 + 2X_2 = 0\). Indicate the set of points for which \(−2 + X_1 + 2X_2 > 0\), as well as the set of points for which \(−2 + X_1 + 2X_2 < 0\).

Plotting the line:

\[−2 + X_1 + 2X_2 = 0 \\ \iff X_2 = 1 -\frac{1}{2} X_1\]

As before:

\(−2 + X_1 + 2X_2 > 0 \iff X_2 > 1 -\frac{1}{2} X_1\)
\(−2 + X_1 + 2X_2 < 0 \iff X_2 < 1 -\frac{1}{2} X_1\)

X1 <- -10:10
X2 <- 1 - 0.5*X1

df2 <- data.frame(X1, X2, line = "b")

grid <- grid %>%
  mutate(region = case_when(-2 + X1 + 2*X2 < 0 ~ "lt 0", 
                            -2 + X1 + 2*X2 > 0 ~ "gt 0"))

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_line(data = df2, aes(x = X1, y = X2), size = 1, col = "#C77CFF") + 
        annotate("text", x = 2.5, y = 15, label = TeX("$-2 + X_1 + 2X_2 > 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "blue") + 
          annotate("text", x = -2.5, y = -15, label = TeX("$-2 + X_1 + 2X_2 < 0$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "red") + 
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  scale_x_continuous(expand = c(0.01,0.01), breaks = seq(-10, 10, 2)) +
  scale_y_continuous(expand = c(0.01,0.01), breaks = seq(-30, 30, 10)) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Hyperplane Plot: $−2 + X_1 + 2X_2 = 0$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

Since the question asked for them on the same plot:

bind_rows(df1, df2) %>%
  ggplot(aes(x = X1, y = X2, col = line)) + 
  geom_line(size = 1) + 
  scale_color_manual(values = c("#7CAE00", "#C77CFF"), 
                     labels = unname(TeX(c("$1 + 3X_1 − X_2 = 0", "$−2 + X_1 + 2X_2 = 0")))) + 
  scale_x_continuous(breaks = seq(-10, 10, 2)) +
  scale_y_continuous(breaks = seq(-30, 30, 10)) +
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'), 
       col = "Hyperplane:") + 
  theme(legend.position = "bottom")

2. Non-Linear Decision Boundaries

We have seen that in \(p = 2\) dimensions, a linear decision boundary takes the form \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0\). We now investigate a non-linear decision boundary.

(a) Sketch - 1

Q: Sketch the curve

\[(1 + X_1)^2 + (2 - X_2)^2 = 4\]

We can see with a little rearranging that the curve is a circle with center \((-1, 2)\) and radius \(2\):

\[\begin{align} & (1 + X_1)^2 + (2 - X_2)^2 = 4 \\ \implies & (X_1 - (-1))^2 + ((-1)(X_2 - 2))^2 = 4 \\ \implies & (X_1 - (-1))^2 + (X_2 - 2)^2 = 2^2 \\ \end{align}\]

ggplot() + 
  geom_circle(data = data.frame(X1 = -1, X2 = 2, r = 2), 
              aes(x0 = X1, y0 = X2, r = r), col = "mediumseagreen", size = 1) + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 1.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 4.5)) + 
  labs(title = TeX(r'(Curve Plot: $(1 + X_1)^2 + (2 - X_2)^2 = 4$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

(b) Sketch - 2

Q: On your sketch, indicate the set of points for which

\[(1 + X_1)^2 + (2 - X_2)^2 > 4\]

as well as the set of points for which

\[(1 + X_1)^2 + (2 - X_2)^2 \le 4\]

grid <- expand.grid(X1 = seq(-3.5, 1.5, length.out = 200), 
                    X2 = seq(-0.5, 4.5, length.out = 200)) %>%
  mutate(region = case_when((1 + X1)^2 + (2 - X2)^2 > 4 ~ "gt 4", 
                            TRUE ~ "le 4"))

ggplot() + 
  geom_tile(data = grid, aes(x = X1, y = X2, fill = region), alpha = 0.5) + 
  geom_circle(data = data.frame(X1 = -1, X2 = 2, r = 2), 
              aes(x0 = X1, y0 = X2, r = r), col = "mediumseagreen", size = 1) + 
  annotate("text", x = -1, y = 2, label = TeX("$(1 + X_1)^2 + (2 - X_2)^2 \\leq 4$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "red") + 
  annotate("text", x = 0.5, y = 0, label = TeX("$(1 + X_1)^2 + (2 - X_2)^2 > 4$", output = "character"),
           hjust = 0.5, size = 4, parse = TRUE, col = "blue") + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 1.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 4.5)) + 
    scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Curve Plot: $(1 + X_1)^2 + (2 - X_2)^2 = 4$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

(c) Classifying Observations

Q: Suppose that a classifier assigns an observation to the blue class if

\[(1 + X_1)^2 + (2 - X_2)^2 > 4\]

and to the red class otherwise. To what class is the observation \((0, 0)\) classified? \((−1, 1)\)? \((2, 2)\)? \((3, 8)\)?

It’s fairly obvious from the previous plot how the observations will be classified:

new_obs <- data.frame(X1 = c(0, -1, 2, 3), X2 = c(0, 1, 2, 8)) %>%
  mutate(region = case_when((1 + X1)^2 + (2 - X2)^2 > 4 ~ "gt 4", 
                            TRUE ~ "le 4"))


grid <- expand.grid(X1 = seq(-3.5, 3.5, length.out = 200), 
                    X2 = seq(-0.5, 8.5, length.out = 200)) %>%
  mutate(region = case_when((1 + X1)^2 + (2 - X2)^2 > 4 ~ "gt 4", 
                            TRUE ~ "le 4"))

ggplot() + 
  geom_tile(data = grid, aes(x = X1, y = X2, fill = region), alpha = 0.5) + 
  geom_circle(data = data.frame(X1 = -1, X2 = 2, r = 2), 
              aes(x0 = X1, y0 = X2, r = r), col = "mediumseagreen", size = 1) + 
  geom_point(data = new_obs, aes(x = X1, y = X2, col = region)) + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 3.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 8.5)) + 
  scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
  scale_color_manual(values = c("blue", "red")) +
  theme(legend.position = "none") + 
  labs(title = TeX(r'(Classifier Plot: $f(X) = (1 + X_1)^2 + (2 - X_2)^2 - 4$)'), 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
    coord_fixed()

In the context of support vector machines, we can think of this as a classifier of the form

\[f(X) = (1 + X_1)^2 + (2 - X_2)^2 - 4,\]

where a new observation \(X^*\) is assigned to the blue class if \(f(X^*) > 0\) and the red class if \(f(X^*) \le 0\):

\(f(0, 0) = 1^2 + 2^2 - 4 = 1 > 0 \implies\) blue
\(f(−1, 1) = (1 - 1)^2 + (2 - 1)^2 - 4 = -3 < 0 \implies\) red
\(f(2, 2) = (1 + 2)^2 + (2 - 2)^2 - 4 = 5 > 0 \implies\) blue
\(f(3, 8) = (1 + 3)^2 + (2 - 8)^2 - 4 = 48 > 0 \implies\) blue

(d) Nonlinear/Linear Decision Boundary

Q: Argue that while the decision boundary in (c) is not linear in terms of \(X_1\) and \(X_2\), it is linear in terms of \(X_1, X_1^2, X_2\) and \(X_2^2\)

The decision boundary is simply the boundary described by \(f(X) = 0\) (as this is the boundary where the prediction changes):

\[\begin{align} f(X) & = 0 \\\implies f(X) & = (1 + X_1)^2 + (2 - X_2)^2 - 4 \\ & = X_1^2 + 2X_1 + 1 + X_2^2 - 4X_2 + 4 - 4 \\ & = X_1^2 + X_2^2 + 2X_1 - 4X_2 + 1 \\ & = (1)(X_1^2) + (1)(X_2^2) + (2)(X_1) + (-4)(X_2) + 1 \\ & = 0 \end{align}\]

From the final form you can see how \(f(X) = 0\) is expressed in the form \(a_1 Z_1 + a_2 Z_2 + a_3 Z_3 + a_4 Z_4 + b = 0\) for \((Z_1, Z_2, Z_3, Z_4) = (X_1^2, X_2^2, X_1, X_2)\), and is therefore linear in terms of \(X_1, X_1^2, X_2\) and \(X_2^2\).

3. Maximal Margin Classifier

Here we explore the maximal margin classifier on a toy data set.

(a) Plot Observations

Q: We are given \(n = 7\) observations in \(p = 2\) dimensions. For each observation, there is an associated class label.

df <- data.frame(obs = 1:7, 
                 X1 = c(3, 2, 4, 1, 2, 4, 4), 
                 X2 = c(4, 2, 4, 4, 1, 3, 1), 
                 Y = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue"))

df %>%
  kable(row.names = F, col.names = c("Obs.", "$X_1$", "$X_2$", "$Y$")) %>%
  kable_styling("bordered", full_width = F)

Obs.	\(X_1\)	\(X_2\)	\(Y\)
1	3	4	Red
2	2	2	Red
3	4	4	Red
4	1	4	Red
5	2	1	Blue
6	4	3	Blue
7	4	1	Blue

Sketch the observations.

ggplot() + 
  geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

(b) Optimal Separating Hyperplane

Q: Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form (9.1)).

Visual inspection shows that the line separating the blue and red points with the largest margin will be given by the line

\[X_2 - X_1 + 0.5 = 0\]

Simple geometry shows that the margin \(M = \frac{\sqrt{2}}{4}\) (minimum perpendicular distance to the hyperplane, see this).

## 'simple geometry', taking (x_0, y_0) = (2, 1):
# a = -1
# b = 1
# c = 0.5
# x_0 = 2
# y_0 = 1
# abs((-1)*(2) + (1)*(1) + 0.5) / sqrt((-1)^2 + (1)^2)

perp_lines <- data.frame(X1 = c(2, 1.75, 2, 2.25, 4, 3.75, 4, 4.25), 
                         X2 = c(1, 1.25, 2, 1.75, 3, 3.25, 4, 3.75), 
                         group = c("a", "a", "b", "b", "c", "c", "d", "d"))

ggplot() + 
  geom_line(data = perp_lines, aes(x = X1, y = X2, group = group), linetype = "dotted", col = "grey40") +
  geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) + 
  geom_abline(slope = 1, intercept = -0.5) +
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

(c) Maximal Margin Classifier

Q: Describe the classification rule for the maximal margin classifier. It should be something along the lines of "Classify to Red if \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0\), and classify to Blue otherwise. Provide the values for \(\beta_0\), \(\beta_1\) and \(\beta_2\).

The optimal separating hyperplane from the previous question has the equation \(X_2 = X_1 - 0.5\), and visual inspection makes it clear that the prediction for the maximal margin classifier will be Red when \(X_2 > X_1 - 0.5 \iff X_2 - X_1 + 0.5 > 0\) and Blue otherwise.

The maximal margin classifier will therefore take the form

\[f(X) = X_2 - X_1 + 0.5\]

where a new observation \(X^*\) is assigned to the Red class if \(f(X^*) > 0\) and the Blue class if \(f(X^*) \le 0\).

We therefore have \((\beta_0, \beta_1, \beta_2) = (0.5, -1, 1)\).

grid <- expand.grid(X1 = seq(0.5, 4.5,length.out = 200), X2 = seq(0.5, 4.5,length.out = 200)) %>%
  mutate(region = case_when(X2 - X1 + 0.5 < 0 ~ "lt 0", 
                            X2 - X1 + 0.5 > 0 ~ "gt 0"))

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) + 
  geom_abline(slope = 1, intercept = -0.5) +
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_color_manual(values = c("blue", "red")) + 
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

(d) Margin

Q: On your sketch, indicate the margin for the maximal margin hyperplane.

The maximal margin hyperplane is the solid line, and the margin is the distance from the solid line to either of the dashed lines:

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_abline(slope = 1, intercept = -0.5) +
  geom_abline(slope = 1, intercept = -1, linetype = 'dashed', col = 'grey40') +
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed', col = 'grey40') +
  geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_color_manual(values = c("blue", "red")) + 
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

This is the maximal margin hyperplane because it’s the hyperplane for which the margin (minimum perpendicular distance between the hyperplane and the training observations) is maximized.

The book provides another useful way of phrasing this:

In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes.

(e) Support Vectors

Q: Indicate the support vectors for the maximal margin classifier.

I add a number representing the training observation number in addition to an indicator for the support vectors:

df$support <- c("No", "Yes", "Yes", "No", "Yes", "Yes", "No")

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_abline(slope = 1, intercept = -0.5) +
  geom_abline(slope = 1, intercept = -1, linetype = 'dashed', col = 'grey40') +
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed', col = 'grey40') +
  geom_point(data = df, aes(x = X1, y = X2, col = Y, shape = support), size = 2) + 
  geom_text(data = df, aes(label = obs, vjust = -1), col = 'grey40') +
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) + 
  scale_color_manual(values = c("red", "blue"), limits = c("Red", "Blue")) + 
  scale_fill_manual(values = c("#F8766D", "#00BFC4"), labels = c("Red", "Blue")) +
  labs(fill = "Predicted:", 
       col = "Actual:", 
       shape = "Support Vector:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed() + 
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(order = 1),
         col = guide_legend(order = 2), 
         shape = guide_legend(order = 3))

(f) Moving an Observation

Q: Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.

We can see above that observation 7 is not a support vector. Any small movement of this observation in any direction is not going to change the fact that \(X_2 - X_1 + 0.5 = 0\) provides the best separation of observations with the largest margin of \(M = \frac{\sqrt{2}}{4}\) (it will still be the mid-line of the ‘widest slab’ we can fit between the classes).

In this case, observation 7 would have to move inside the margin to start influencing the position of the maximal margin hyperplane.

(g) Non-Optimal Hyperplane

Q: Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.

Here I produce a hyperplane that is still a separating hyperplane, but isn’t the optimal separating hyperplane (the maximal margin hyperplane used in the maximal margin classifier).

The hyperplane is given by the line

\[X_2 - 0.75 \cdot X_1 - 0.3 = 0\]

You can see below that the margin \(M\) is considerably smaller here.

grid <- expand.grid(X1 = seq(0.5, 4.5,length.out = 200), X2 = seq(0.5, 4.5,length.out = 200)) %>%
  mutate(region = case_when(X2 - 0.75*X1 - 0.3 < 0 ~ "lt 0", 
                            X2 - 0.75*X1 - 0.3 > 0 ~ "gt 0"))

df$support <- ifelse(df$obs == 2, "Yes", "No")

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = region), alpha = 0.5) + 
  geom_abline(slope = 0.75, intercept = 0.3) +
  geom_abline(slope = 0.75, intercept = 0.1, linetype = 'dashed', col = 'grey40') +
  geom_abline(slope = 0.75, intercept = 0.5, linetype = 'dashed', col = 'grey40') +
  geom_point(data = df, aes(x = X1, y = X2, col = Y, shape = support), size = 2) + 
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  geom_text(data = df, aes(label = obs, vjust = -1), col = 'grey40') +
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) + 
  scale_color_manual(values = c("red", "blue"), limits = c("Red", "Blue")) + 
  scale_fill_manual(values = c("#F8766D", "#00BFC4"), labels = c("Red", "Blue")) +
  labs(fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed() + 
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(order = 1),
         col = guide_legend(order = 2), 
         shape = guide_legend(order = 3))

(h) Non-Separable Data

Q: Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

See below where I added observation 8, given by \((X_1, X_2) = (2, 3)\), meaning the two classes can no longer be separated by a hyperplane in two dimensions.

df %>%
  select(obs, X1, X2, Y) %>%
  bind_rows(data.frame(obs = 8, X1 = 2, X2 = 3, Y = "Blue")) %>%
  ggplot(aes(x = X1, y = X2, col = Y), size = 2) + 
  geom_point(size = 2) + 
  geom_text(aes(label = obs, vjust = -1), col = 'grey40') +
  scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) + 
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)')) + 
  coord_fixed()

4. APPLIED: Generated Data (Linear/Polynomial/Radial Kernel SVMs)

Q: Generate a simulated two-class data set with 100 observations and two features in which there is a visible but non-linear separation between the two classes. Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data. Which technique performs best on the test data? Make plots and report training and test error rates in order to back up your assertions.

I generate the data, perform a 70/30 train/test split and visualize train below.

set.seed(5)

X1 <- runif(100, min = -6, max = 6)
X2 <- 0.1*X1 + sin(X1) + rnorm(100, sd = 0.15)
Y <- sample(x = c(-1, 1), size = 100, replace = T)
X2[Y == 1] <- X2[Y == 1] + 1

df <- data.frame(Y = factor(Y), X1, X2)

train_index <- sample(1:100, 70)

train <- df[train_index, ]
test <- df[-train_index, ]

ggplot(train, aes(x = X1, y = X2, col = Y)) + 
  geom_point() + 
  scale_color_manual(values = c("red", "blue")) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

First I fit a SVM with a linear kernel, selecting the best cost parameter via 10-fold cross-validation:

set.seed(1)
svm_linear_tune <- tune(svm, Y ~ ., data = train, kernel = "linear", scale = TRUE, ranges = list(cost = c(10^c(-3:3))))
svm_linear <- svm_linear_tune$best.model

svm_linear_tune$performances %>%
  rename(CV_error = error) %>%
  mutate(min_CV_error = as.numeric(cost == svm_linear_tune$best.parameters$cost)) %>%
  ggplot(aes(x = cost, y = CV_error)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) + 
  scale_x_continuous(trans = 'log10', breaks = 10^c(-3:3), minor_breaks = NULL, labels = paste0("10^", c(-3:3))) + 
  scale_y_continuous(limits = c(0, 0.65), labels = percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "Generated Data - SVM (Linear Kernel)", 
       subtitle = "Selecting cost parameter using cross-validation",
       x = "Cost", 
       y = "CV Error")

Visualizing the prediction regions and actual classes on train:

grid <- expand.grid(X1 = seq(min(df$X1), max(df$X1), length.out = 200), X2 = seq(min(df$X2), max(df$X2), length.out = 200))
grid$class_pred <- predict(svm_linear, grid)

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = train, aes(col = Y)) +
  scale_color_manual(values = c("red", "blue")) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

I next fit a SVM with a polynomial kernel, using cross-validation to select the degree parameter:

set.seed(2)
svm_poly_tune <- tune(svm, Y ~ ., data = train, kernel = "polynomial", scale = TRUE, ranges = list(degree = 2:7))
svm_poly <- svm_poly_tune$best.model


svm_poly_tune$performances %>%
  rename(CV_error = error) %>%
  mutate(min_CV_error = as.numeric(degree == svm_poly_tune$best.parameters$degree)) %>%
  ggplot(aes(x = degree, y = CV_error)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) + 
  scale_x_continuous(breaks = 2:7, minor_breaks = NULL, labels = 2:7) + 
  scale_y_continuous(limits = c(0, 0.65), labels = percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none") + 
  labs(title = "Generated Data - SVM (Polynomial Kernel)", 
       subtitle = "Selecting the degree using cross-validation",
       x = "Degree", 
       y = "CV Error")

Visualizing on train:

grid <- expand.grid(X1 = seq(min(df$X1), max(df$X1), length.out = 200), X2 = seq(min(df$X2), max(df$X2), length.out = 200))
grid$class_pred <- predict(svm_poly, grid)

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = train, aes(col = Y)) +
  scale_color_manual(values = c('red', 'blue')) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

Finally I fit a SVM with a radial kernel, using cross-validation to select two parameters here (cost & gamma):

set.seed(2)
svm_radial_tune <- tune(svm, Y ~ ., data = train, kernel = "radial", scale = TRUE, ranges = list(cost = c(10^c(-3:5)), gamma = 10^c(-2:2)))
svm_radial <- svm_radial_tune$best.model

svm_radial_tune$performances %>%
  rename(CV_error = error) %>%
  mutate(min_CV_error = case_when(cost == svm_radial_tune$best.parameters$cost & gamma == svm_radial_tune$best.parameters$gamma ~ 1, T ~ 0)) %>%
  ggplot(aes(x = cost, y = CV_error, col = factor(gamma))) + 
  geom_line() + 
  geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) + 
  scale_shape_manual(values = c(20, 19)) +
  scale_x_continuous(trans = 'log10', breaks = 10^c(-3:6), minor_breaks = NULL, labels = paste0("10^", (-3:6))) + 
  scale_y_continuous(labels = percent_format()) +
  theme(legend.position = "bottom", 
        axis.text.x = ggtext::element_markdown()) +
  labs(title = "Generated Data - SVM (Radial Kernel)", 
       subtitle = "Selecting cost & gamma parameters using cross-validation",
       x = "Cost", 
       y = "CV Error", 
       col = "Gamma")

Visualizing on train:

grid <- expand.grid(X1 = seq(min(df$X1), max(df$X1), length.out = 200), X2 = seq(min(df$X2), max(df$X2), length.out = 200))
grid$class_pred <- predict(svm_radial, grid)

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = train, aes(col = Y)) +
  scale_color_manual(values = c('red', 'blue')) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

This looks much better. If I was tasked with drawing by hand what I thought the decision boundary should be, it would probably look very similar to that produced by the radial kernel SVM above.

Here’s a summary of the results:

data.frame(kernel = c("Linear", "Polynomial", "Radial"), 
           CV_error = c(min(svm_linear_tune$performances$error), min(svm_poly_tune$performances$error), min(svm_radial_tune$performances$error)), 
           test_error = c(mean(predict(svm_linear, test) != test$Y), mean(predict(svm_poly, test) != test$Y), mean(predict(svm_radial, test) != test$Y)))

##       kernel   CV_error test_error
## 1     Linear 0.18571429  0.2000000
## 2 Polynomial 0.14285714  0.2333333
## 3     Radial 0.01428571  0.0000000

The CV and test errors aren’t perfectly correlated, but this is to be expected when we are dealing with a total dataset size of 100 observations. The radial SVM unambiguously performed best, with the lowest CV error and a test error of 0.

Here I produce the same plot for the radial SVM showing the models decision boundary and actual response values (but this time for the test dataset), where we can see that the test observations were indeed all correctly classified:

ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = test, aes(col = Y)) +
  scale_color_manual(values = c('red', 'blue')) + 
  labs(fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

5. APPLIED: Generated Data (Logistic Regression vs SVMs)

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

(a) Generate Data

Q: Generate a data set with \(n = 500\) and \(p = 2\), such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:

x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1*(x1^2 - x2^2 > 0)

I’ll stick with the example data provided by the question here.

set.seed(1)

x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1*(x1^2 - x2^2 > 0)

df <- data.frame(x1, x2, y = factor(y))

(b) Plot Observations

Q: Plot the observations, colored according to their class labels. Your plot should display \(X_1\) on the \(x\)-axis, and \(X_2\) on the \(y\)-axis.

The relationship is quite interesting and we can already foresee the issues that a linear classifier might have in separating the classes:

ggplot(df, aes(x = x1, y = x2, col = y)) + 
  geom_point() + 
  scale_x_continuous(limits = c(-0.6, 0.6)) +
  scale_y_continuous(limits = c(-0.6, 0.6)) +
  scale_color_manual(values = c("red", "blue")) + 
  theme(legend.position = "none") + 
  labs(x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

(c) Logistic Regression - Linear

Q: Fit a logistic regression model to the data, using \(X_1\) and \(X_2\) as predictors.

I save the model as model_logistic_linear. The model summary can be seen below.

model_logistic_linear <- glm(y ~ x1 + x2, family = "binomial")
df$pred_logistic_linear <- ifelse(predict(model_logistic_linear, type = "response") > 0.5, 1, 0)
summary(model_logistic_linear)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.179  -1.139  -1.112   1.206   1.257  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.087260   0.089579  -0.974    0.330
## x1           0.196199   0.316864   0.619    0.536
## x2          -0.002854   0.305712  -0.009    0.993
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 692.18  on 499  degrees of freedom
## Residual deviance: 691.79  on 497  degrees of freedom
## AIC: 697.79
## 
## Number of Fisher Scoring iterations: 3

Neither \(X_1\) nor \(X_2\) are significant in this model.

(d) Logistic Regression - Linear - Results

Fitting a logistic regression model with just \(X_1\) and \(X_2\) as predictors results in very few positive class predictions:

table(df$pred_logistic_linear)

## 
##   0   1 
## 470  30

I produce a slightly different plot to the one the question asks for; instead of just colour-coding according to the predicted class labels, this graph shows the models decision boundary along with the actual class labels (as we have produced previously). I think this shows the linear decision boundary more clearly:

grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- ifelse(predict(model_logistic_linear, newdata = grid, type = "response") > 0.5, 1, 0)

ggplot(grid, aes(x = x1, y = x2)) + 
  geom_tile(aes(fill = factor(class_pred)), alpha = 0.5) +
  geom_point(data = df, aes(col = y)) +
  scale_color_manual(values = c('red', 'blue')) + 
  labs(fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

It’s obvious that a linear decision boundary is not appropriate here, but as far as linear decision boundaries go this one actually makes a lot of logical sense; most straight lines through the center of the observations will hover around 50% training accuracy, but this model correctly identifies that those with a high \(X_1\) value will probably be of class \(1\):

sum(df$y == df$pred_logistic_linear) / nrow(df)

## [1] 0.57

(e) Logistic Regression - Non-Linear

Q: Now fit a logistic regression model to the data using non-linear functions of \(X_1\) and \(X_2\) as predictors (e.g. \(X_1^2\), \(X_1 \times X_2\), \(\log(X_2)\), and so forth).

I fit the same logistic regression model with the addition of two quadratic terms for \(X_1\) and \(X_2\) (\(X_1^2\) and \(X_2^2\)), saving the model as model_logistic_nonlinear. The model summary is again provided below.

model_logistic_nonlinear <- glm(y ~ x1 + x2 + I(x1^2) + I(x2^2), family = "binomial")
df$pred_logistic_nonlinear <- ifelse(predict(model_logistic_nonlinear, type = "response") > 0.5, 1, 0)
summary(model_logistic_nonlinear)

## 
## Call:
## glm(formula = y ~ x1 + x2 + I(x1^2) + I(x2^2), family = "binomial")
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.079e-03  -2.000e-08  -2.000e-08   2.000e-08   1.297e-03  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -10.530    526.853  -0.020    0.984
## x1             115.895   6067.885   0.019    0.985
## x2              -1.604   4002.215   0.000    1.000
## I(x1^2)      18538.679 528515.760   0.035    0.972
## I(x2^2)     -18235.099 520182.819  -0.035    0.972
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.9218e+02  on 499  degrees of freedom
## Residual deviance: 4.2881e-06  on 495  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

(f) Logistic Regression - Non-Linear - Results

Q: Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

We have obtained an obviously non-linear decision boundary as required by the question:

grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- ifelse(predict(model_logistic_nonlinear, newdata = grid, type = "response") > 0.5, 1, 0)

ggplot(grid, aes(x = x1, y = x2)) + 
  geom_tile(aes(fill = factor(class_pred)), alpha = 0.5) +
  geom_point(data = df, aes(col = y)) +
  scale_color_manual(values = c('red', 'blue')) + 
  labs(fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

Unsurprisingly these quadratic transformations of \(X_1\) and \(X_2\) allow us to perfectly separate the two classes using logistic regression. I say this is unsurprising because we generated the data and have access to the equation that generated the class labels (it is quadratic in terms of \(X_1\) and \(X_2\)).

Verifying this perfect separation by checking the accuracy:

sum(df$y == df$pred_logistic_nonlinear) / nrow(df)

## [1] 1

Did you notice the strange (large) p-values in the model summary, despite the fact that this logistic model provides perfect separation of the classes? A Google search suggests this is the Hauck-Donner effect.

(g) Support Vector Classifier (linear kernel)

Q: Fit a support vector classifier to the data with \(X_1\) and \(X_2\) as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

In this case (with a linear kernel), the cost parameter defines the cost of violating the margin:

A small value of cost means the margins are wide and many support vectors can be on the margin or violate it
A large cost results in a narrow margin with few support vectors on the margin or violating it

In this particular case the support vector classifier seems to be the same regardless of the cost parameter (the result is a naive classifier that predicts 0 for all observations).

svm_linear <- svm(y ~ x1 + x2, data = df, kernel = "linear", scale = FALSE, cost = 0.001)
df$pred_svm_linear <- predict(svm_linear, df)

grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- predict(svm_linear, grid)

ggplot(grid, aes(x = x1, y = x2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = df, aes(col = y)) +
  scale_color_manual(values = c("red", "blue")) + 
  labs(fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

There were some exceptions to this (e.g. setting cost = 10^7 results in a warning message and a worse classifier for reasons I don’t understand). Note that I didn’t scale the data in this example since it doesn’t seem necessary (\(X_1\) and \(X_2\) are sampled from a uniform distribution over the same range).

The prediction accuracy is obviously the same as the proportion of observations of class 0 in the dataset:

sum(df$y == df$pred_svm_linear) / nrow(df)

## [1] 0.522

(h) Support Vector Machine (non-linear kernel)

Q: Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

Finally I fit a SVM with a radial kernel, using cross-validation to select cost & gamma:

set.seed(2)
svm_radial_tune <- tune(svm, y ~ x1 + x2, data = df, kernel = "radial", scale = FALSE, ranges = list(cost = c(10^c(-3:8)), gamma = 10^c(-2:2)))
svm_radial <- svm_radial_tune$best.model
df$pred_svm_nonlinear <- predict(svm_radial, df)

svm_radial_tune$performances %>%
  rename(CV_error = error) %>%
  mutate(min_CV_error = case_when(cost == svm_radial_tune$best.parameters$cost & gamma == svm_radial_tune$best.parameters$gamma ~ 1, T ~ 0)) %>%
  ggplot(aes(x = cost, y = CV_error, col = factor(gamma))) + 
  geom_line() + 
  geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) + 
  scale_shape_manual(values = c(20, 19)) +
  scale_x_continuous(trans = 'log10', breaks = 10^c(-3:8), minor_breaks = NULL, labels = paste0("10^", c(-3:8))) + 
  scale_y_continuous(trans = 'log10', labels = percent_format()) +
  theme(legend.position = "bottom", 
        axis.text.x = ggtext::element_markdown()) +
  labs(title = "Generated Data - SVM (Radial Kernel)", 
       subtitle = "Selecting cost & gamma parameters using cross-validation",
       x = "Cost", 
       y = "CV Error", 
       col = "Gamma:")

Like with non-linear logistic regression, we have achieved perfect separation of the classes:

grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- predict(svm_radial, grid)

ggplot(grid, aes(x = x1, y = x2)) + 
  geom_tile(aes(fill = factor(class_pred)), alpha = 0.5) +
  geom_point(data = df, aes(col = factor(y))) +
  scale_color_manual(values = c("red", "blue")) + 
  labs(fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

Verifying this by checking the accuracy:

sum(df$y == df$pred_svm_nonlinear) / nrow(df)

## [1] 1

(i) Comments

Q: Comment on your results.

The thing to appreciate here is the ability for an SVM with a radial kernel to automatically model the non-linear relationship without requiring anything more than selection of the cost and gamma parameters through cross-validation. Logistic regression can clearly achieve the same performance here, but only when the model is well-specified; a model without the quadratic transformations clearly performed much worse, and a logistic model that was more complex but with the ‘wrong’ transformations (e.g. adding an interaction term) performed equally poorly.

This is illustrated on generated data, but anecdotally I have found similar things in work projects and when working with real data. Often a well-specified linear or logistic regression model, with appropriate interaction terms and transformations of the response/predictors, can perform competitively with more flexible methods like SVMs, smoothing splines, tree ensembles. However, these flexible methods will tend to perform best with less effort (aside from tuning hyper-parameters) at the cost of interpretability.

6. APPLIED: Generated Data (Support Vector Classifiers)

At the end of Section 9.6.1, it is claimed that in the case of data that is just barely linearly separable, a support vector classifier with a small value of cost that misclassifies a couple of training observations may perform better on test data than one with a huge value of cost that does not misclassify any training observations. You will now investigate this claim.

(a) Generate Data

Q: Generate two-class data with \(p\) = 2 in such a way that the classes are just barely linearly separable.

Here’s the initial training dataset I generated, with the true decision boundary given by the line \(X_2 = 2X_1 + 0.3\)

set.seed(1)
X1 <- c(rnorm(50, mean = -1, sd = 0.5), rnorm(50, mean = 1, sd = 0.5))
X2 <- runif(100)
Y <- ifelse(X2 > 2*X1 + 0.3, -1, 1)
train <- data.frame(X1, X2, Y = factor(Y))

ggplot(train) + 
  geom_point(aes(x = X1, y = X2, col = factor(Y))) + 
  geom_abline(aes(intercept = 0.3, slope = 2, col = "True Decision Boundary"), size = 1) + 
  scale_color_manual(values = c('red', 'blue', 'black'),
                     guide = guide_legend(override.aes = list(pch = c(16, 16, NA), linetype = c(0, 0, 1)))) + 
  theme(legend.position = "bottom") + 
  labs(title = "Training Data", 
       col = "Y:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

I introduce a single outlier to this training dataset, ensuring that the data is still (just barely) linearly separable:

outlier <- data.frame(X1 = -0.15, X2 = 0.75, Y = factor(1, levels = c("-1", "1")))
train[100, ] <- outlier


ggplot(train) + 
  geom_point(aes(x = X1, y = X2, col = factor(Y))) + 
  geom_abline(aes(intercept = 0.3, slope = 2, col = "True Decision Boundary"), size = 1) + 
  scale_color_manual(values = c('red', 'blue', 'black'),
                     guide = guide_legend(override.aes = list(pch = c(16, 16, NA), linetype = c(0, 0, 1)))) + 
  geom_abline(intercept = -8.3, slope = -50, linetype = 'dashed', col = 'grey40') + 
  theme(legend.position = "bottom") + 
  labs(title = "Training Data", 
       col = "Y:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

The theory is that a classifier with a high value for cost will not want to classify this observation incorrectly, resulting in a support vector classifier with decision boundary close to the dashed line. A smaller value of cost which accepts some training misclassifications might produce a better test error.

(b) Varying `cost` - Training & CV Errors

Q: Compute the cross-validation error rates for support vector classifiers with a range of cost values. How many training errors are misclassified for each value of cost considered, and how does this relate to the cross-validation errors obtained?

With this small dataset there was a huge amount of availability in the selected cost parameter when using cross-validation. I would usually use repeated cross-validation in caret, but the SVM implementations there seem to be focused around the kernlab package which seems to use slightly different tuning parameters for its methods.

I implement my own repeated cross-validation to select cost below (sticking to the e1071 package):

# 3 repeats of 10-fold cross-validation:
power_range <- seq(-2.6, 3, 0.2)
cost_range <- 10^power_range
number <- 10
repeats <- 3

cv_matrix <- matrix(nrow = length(cost_range), ncol = repeats)

set.seed(1)

for (i in 1:repeats) {
  svm_linear_tune <- tune(svm, Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_linear_tune$performances$error
}


data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error))) %>%
  ggplot(aes(x = cost, y = CV_error)) + 
  geom_line(col = "deepskyblue3") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) +
  scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", round(power_range[seq(1, length(power_range), 2)], 1))) + 
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  coord_cartesian(ylim = c(0, 0.1)) +
  theme(legend.position = "none", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "Generated Data - SVM (Linear Kernel)", 
       subtitle = "Selecting cost parameter using cross-validation",
       x = "Cost", 
       y = "CV Error")

Below I calculate the training error rate for this same sequence of cost parameters, overlaying this onto the graph above:

train_error <- c()

for (i in 1:length(cost_range)) {
  svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = cost_range[i])
  train_error[i] <- sum(predict(svm_linear, train) != train$Y) / nrow(train)
}


data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix), train_error) %>%
  pivot_longer(cols = c("CV_error", "train_error")) %>%
  mutate(name = factor(name, ordered = T, levels = c("train_error", "CV_error"))) %>%
  ggplot(aes(x = cost, y = value, group = name, col = name)) + 
  geom_line() + 
  geom_point(size = 2) + 
  scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", round(power_range[seq(1, length(power_range), 2)], 1))) + 
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("mediumseagreen", "deepskyblue3"), labels = c("Training", "Cross-Validation")) + 
  coord_cartesian(ylim = c(0, 0.1)) +
  theme(legend.position = "bottom", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "Generated Data - SVM (Linear Kernel)", 
       subtitle = "Training & cross-validation errors",
       x = "Cost", 
       y = "Error", 
       col = "Error:")

We can observe that higher values of cost decreases training error, with the training error eventually reaching zero. This makes sense as a large cost results in a narrow margin with few support vectors on the margin or violating it, so eventually (for linearly separable data such as this) the cost of a violation will be sufficiently high that all training observations will be correctly classified.

It is also worth noting that the cost parameters associated with the lowest (zero) training errors doesn’t have the lowest cross-validation error.

(c) Varying `cost` - Test Error

Q: Generate an appropriate test data set, and compute the test errors corresponding to each of the values of cost considered. Which value of cost leads to the fewest test errors, and how does this compare to the values of cost that yield the fewest training errors and the fewest cross-validation errors?

I generate a test dataset with the same number of rows as train (100), using the same principles and decision boundary:

set.seed(2)
X1 <- c(rnorm(50, mean = -1, sd = 0.5), rnorm(50, mean = 1, sd = 0.5))
X2 <- runif(100)
Y <- ifelse(X2 > 2*X1 + 0.3, -1, 1)
test <- data.frame(X1, X2, Y = factor(Y))

ggplot(test) + 
  geom_point(aes(x = X1, y = X2, col = factor(Y))) + 
  geom_abline(aes(intercept = 0.3, slope = 2, col = "True Decision Boundary"), size = 1) + 
  scale_color_manual(values = c('red', 'blue', 'black'),
                     guide = guide_legend(override.aes = list(pch = c(16, 16, NA), linetype = c(0, 0, 1)))) + 
  theme(legend.position = "bottom") + 
  labs(title = "Test Data", 
       col = "Y:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

Here I calculate the test errors for each model in the previous section, overlaying it on the same graph again:

test_error <- c()

for (i in 1:length(cost_range)) {
  svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = cost_range[i])
  test_error[i] <- sum(predict(svm_linear, test) != test$Y) / nrow(test)
}

train_test_cv_df <- data.frame(cost = cost_range, power = power_range, CV_error = rowMeans(cv_matrix), train_error, test_error)


train_test_cv_df %>%
  pivot_longer(cols = c("CV_error", "train_error", "test_error")) %>%
  mutate(name = factor(name, ordered = T, levels = c("train_error", "CV_error", "test_error"))) %>%
  ggplot(aes(x = cost, y = value, group = name, col = name)) + 
  geom_line() + 
  geom_point(size = 2) + 
  scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", round(power_range[seq(1, length(power_range), 2)], 1))) + 
  scale_y_continuous(labels = percent_format()) +
  coord_cartesian(ylim = c(0, 0.1)) +
  scale_color_manual(values = c("mediumseagreen", "deepskyblue3", "#BB8FCE"), labels = c("Training", "Cross-Validation", "Test")) + 
  theme(legend.position = "bottom", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "Generated Data - SVM (Linear Kernel)", 
       subtitle = "Training, cross-validation & test errors",
       x = "Cost", 
       y = "Error", 
       col = "Error:")

We can see that model with the lowest training error performed poorer on the test dataset than any of those with the lowest cross-validation errors.

Looking at the cross-validation error, taking cost = \(10^{-0.6}\) (the middle value of cost for several models with identical CV errors - a logical choice) produces a model which had the lowest error on the test dataset.

(d) Comments

Q: Comment on your results.

Going back to the question:

… it is claimed that in the case of data that is just barely linearly separable, a support vector classifier with a small value of cost that misclassifies a couple of training observations may perform better on test data than one with a huge value of cost that does not misclassify any training observations.

It seems my example has shown this successfully. Let’s take a look at the decision boundaries of some of these support vector classifiers to drive home this point.

First, let’s visualize the classification using a cost value that minimizes the training error (cost = \(10^{2.2}\)):

svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = 10^2.2)

grid <- expand.grid(X1 = seq(-2.3, 2.3, length.out = 200), X2 = seq(0, 1, length.out = 200))
grid$class_pred <- predict(svm_linear, grid)

g1 <- ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = train, aes(col = Y)) +
  geom_abline(intercept = 0.3, slope = 2, size = 1) + 
  scale_color_manual(values = c("red", "blue")) + 
  labs(title = "Training Data", 
       subtitle = "SVM optimizing training error: cost = 10^2.2", 
       fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

g2 <- ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = test, aes(col = Y)) +
  geom_abline(intercept = 0.3, slope = 2, size = 1) + 
  scale_color_manual(values = c("red", "blue")) + 
  labs(title = "Test Data", 
       subtitle = "SVM optimizing training error: cost = 10^2.2", 
       fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))


grid.arrange(g1, g2, nrow = 2)

Observe how every observation in the train dataset is correctly classified, but 7 are misclassified on the test dataset (hence the test error of 7%).

Now let’s visualize the classification using a cost value that minimizes the cross-validation error (cost = \(10^{-0.6}\)):

svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = 10^-0.6)

grid <- expand.grid(X1 = seq(-2.3, 2.3, length.out = 200), X2 = seq(0, 1, length.out = 200))
grid$class_pred <- predict(svm_linear, grid)

g1 <- ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = train, aes(col = Y)) +
  geom_abline(intercept = 0.3, slope = 2, size = 1) + 
  scale_color_manual(values = c("red", "blue")) + 
  labs(title = "Training Data", 
       subtitle = "SVM optimizing cross-validation error: cost = 10^-0.6", 
       fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

g2 <- ggplot(grid, aes(x = X1, y = X2)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = test, aes(col = Y)) +
  geom_abline(intercept = 0.3, slope = 2, size = 1) + 
  scale_color_manual(values = c("red", "blue")) + 
  labs(title = "Test Data", 
       subtitle = "SVM optimizing cross-validation error: cost = 10^-0.6", 
       fill = "Predicted:", 
       shape = "Support Vector:", 
       col = "Actual:", 
       x = TeX(r'($X_1$)'), 
       y = TeX(r'($X_2$)'))

grid.arrange(g1, g2, nrow = 2)

Optimizing for cross-validation error meant accepting one training misclassification, but resulted in far fewer test errors as a result (2, so a test error of 2%).

Notice how the second model also has a decision boundary closer to the true decision boundary. The model selected using cross-validation is not perfect, but it’s a significant improvement given the limited data (and the fact that I introduced an intentionally difficult outlier to throw the model off!)

7. APPLIED: The `Auto` Dataset (Linear/Polynomial/Radial Kernel SVMs)

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) New Variable: `mpg01`

Q: Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

I call this variable mpg01 (I created this variable before in the chapter 4 solutions):

Auto <- ISLR::Auto


### same pre-processing used in previous chapters:
Auto$brand <- sapply(strsplit(as.character(Auto$name), split = " "),
                     function(x) x[1]) # extract the first item from each list element

Auto$brand <- factor(ifelse(Auto$brand %in% c("vokswagen", "vw"), "volkswagen", 
                     ifelse(Auto$brand == "toyouta", "toyota", 
                            ifelse(Auto$brand %in% c("chevroelt", "chevy"), "chevrolet", 
                                   ifelse(Auto$brand == "maxda", "mazda", 
                                          Auto$brand))))) # fixing typo's

Auto$brand <- fct_lump(Auto$brand, n = 9, other_level = "uncommon") # collapse into 10 categories

Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
### same pre-processing used in previous chapters^


Auto$mpg01 <- factor(as.numeric(Auto$mpg > median(Auto$mpg)))

Auto <- Auto %>%
  select(-c(mpg, name)) %>% # removing unneeded variables
  select(mpg01, everything())

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg01        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <fct> American, American, American, American, American, Amer...
## $ brand        <fct> chevrolet, buick, plymouth, amc, ford, ford, chevrolet...

I’ve removed mpg (it wouldn’t make much sense to use it as a predictor when predicting mpg01).

(b) Support Vector Classifier - CV Errors

Q: Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.

Linear Kernel:

I fit a wide range of support vector classifiers with cost values on a \(\log_{10}\) scale. I again use repeated cross-validation to provide an out-of-sample error estimate:

# 5 repeats of 10-fold cross-validation:
power_range <- seq(-3, 3, 0.25)
cost_range <- 10^power_range
number <- 10
repeats <- 5

cv_matrix <- matrix(nrow = length(cost_range), ncol = repeats)

set.seed(202)

for (i in 1:repeats) {
  svm_linear_tune <- tune(svm, mpg01 ~ ., data = Auto, kernel = "linear", scale = TRUE, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_linear_tune$performances$error
}

svm_linear_df <- data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))

ggplot(svm_linear_df, aes(x = cost, y = CV_error)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) +
  scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", power_range[seq(1, length(power_range), 2)])) + 
  scale_y_continuous(labels = percent_format(), breaks = seq(0.08, 0.2, 0.01)) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "Auto Dataset - SVM (Linear Kernel)", 
       subtitle = "Selecting cost parameter using cross-validation",
       x = "Cost", 
       y = "CV Error")

Here is the cost parameter and cross-validation error associated with the best-performing model:

svm_linear_df %>% filter(min_CV_error == 1) %>% select(-min_CV_error) # 0.08615385

##   cost   CV_error
## 1 0.01 0.08615385

(c) SVM (Polynomial/Radial Kernel) - CV Errors

Q: Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

Polynomial Kernel:

The book shows tuning degree for an SVM with a polynomial kernel, but you can actually tune both degree and cost together.

I test different combinations of these hyper-parameters, using repeated cross-validation to select the best model:

power_range <- seq(1, 7, 0.5)
cost_range <- 10^power_range
degree_range <- 2:7
number <- 10
repeats <- 2

cv_matrix <- matrix(nrow = length(cost_range)*length(degree_range), ncol = repeats)

set.seed(151)
for (i in 1:repeats) {
  svm_poly_tune <- tune(svm, mpg01 ~ ., data = Auto, kernel = "polynomial", scale = TRUE, ranges = list(degree = degree_range, cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_poly_tune$performances$error
}

svm_poly_df <- cbind(svm_poly_tune$performances[ ,c("degree", "cost")], CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))

ggplot(svm_poly_df, aes(x = cost, y = CV_error, col = factor(degree))) + 
  geom_line() + 
  geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) + 
  scale_shape_manual(values = c(20, 19)) +
  scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", power_range)) + 
  scale_y_continuous(labels = percent_format()) +
  coord_cartesian(ylim = c(0.05, 0.3)) +
  theme(axis.text.x = ggtext::element_markdown(), 
        legend.position = "bottom") +
  labs(title = "Auto Dataset - SVM (Polynomial Kernel)", 
       subtitle = "Selecting cost & degree parameters using cross-validation",
       x = "Cost", 
       y = "CV Error", 
       col = "Degree")

The selected model outperforms the linear kernel model:

svm_poly_df %>% filter(min_CV_error == 1) %>% select(-min_CV_error) # 0.07137821

##   degree     cost   CV_error
## 1      3 316.2278 0.07137821

Radial Kernel:

I test different combinations of gamma and cost, using repeated cross-validation to select the best model:

cost_power_range <- seq(-2, 5, 0.5)
cost_range <- 10^cost_power_range
gamma_power_range <- -4:1
gamma_range <- 10^gamma_power_range
number <- 10
repeats <- 3

cv_matrix <- matrix(nrow = length(cost_power_range)*length(gamma_power_range), ncol = repeats)

set.seed(666)
for (i in 1:repeats) {
  svm_radial_tune <- tune(svm, mpg01 ~ ., data = Auto, kernel = "radial", scale = TRUE, ranges = list(gamma = gamma_range, cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_radial_tune$performances$error
}

svm_radial_df <- cbind(svm_radial_tune$performances[ ,c("gamma", "cost")], CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))

ggplot(svm_radial_df, aes(x = cost, y = CV_error, col = factor(gamma))) + 
  geom_line() + 
  geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) + 
  scale_shape_manual(values = c(20, 19)) +
  scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", cost_power_range)) + 
  scale_y_continuous(labels = percent_format()) +
  scale_color_discrete(labels = paste0("10^", gamma_power_range)) +
  coord_cartesian(ylim = c(0.05, 0.3)) +
  theme(axis.text.x = ggtext::element_markdown(), 
        legend.text = ggtext::element_markdown(), 
        legend.position = "bottom") +
  labs(title = "Auto Dataset - SVM (Radial Kernel)", 
       subtitle = "Selecting cost & gamma parameters using cross-validation",
       x = "Cost", 
       y = "CV Error", 
       col = "Gamma")

The selected model has the best performance so far:

svm_radial_df %>% filter(min_CV_error == 1) %>% select(-min_CV_error) # 0.07055556

##   gamma     cost   CV_error
## 1   0.1 31.62278 0.07055556

(d) SVM Plots

Q: Make some plots to back up your assertions in (b) and (c). Hint: In the lab, we used the plot() function for svm objects only in cases with p = 2. When p > 2, you can use the plot() function to create plots displaying pairs of variables at a time. Essentially, instead of typing

plot(svmfit, dat)

where svmfit contains your fitted model and dat is a data frame containing your data, you can type

plot(svmfit, dat, x1 ~ x4)

in order to plot just the first and fourth variables. However, you must replace x1 and x4 with the correct variable names. To find out more, type ?plot.svm

I create svm_radial, which is a fit of the best-performing model on the full Auto dataset - a SVM with radial kernel, gamma = 0.1 and cost = 10^1.5:

svm_radial <- svm(mpg01 ~ ., data = Auto, kernel = "radial", scale = T, gamma = 0.01, cost = 10^1.5)

I will attempt to reproduce these plots in ggplot2, as I feel like the default plot.svm output is not great.

First let’s look at the default output from plotting the svm_radial. Note that we must choose two variables to plot or an error is returned. I chose weight and horsepower:

plot(svm_radial, Auto, weight ~ horsepower)

All the plot above ends up doing is plotting weight * horsepower (with colour-coding for mpg01, and crosses for support vectors which we can cancel by putting the argument svSymbol = 'o'). In fact, if you didn’t care about identifying which points were support vectors you could produce the plot with just the underlying data (no need for the SVM).

It’s easy to produce a nicer equivalent in ggplot2:

svm_radial_plot_df <- Auto %>%
  select(mpg01, weight, horsepower) %>%
  mutate(support = case_when(row_number() %in% svm_radial$index ~ "Yes", T ~ "No"))

ggplot(svm_radial_plot_df, aes(x = horsepower, y = weight, col = mpg01, shape = support)) + 
  geom_point(size = 2) + 
  scale_color_manual(values = c("red", "blue")) + 
  scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) + 
  scale_y_continuous(labels = comma_format()) +
  labs(shape = "Support Vector:", 
       col = "Actual mpg01:")

Note that the contours can’t really be shown because the data is too high-dimensional. By this I mean that we can’t exactly show accurate decision boundaries where the SVM predicts high/low mpg in 2D space because the prediction depends on far more than 2 variables.

We can get around this by fixing the other predictors at a certain value with the slice argument. I try fixing the numeric predictors at their median value and the categorical predictors at their modal category:

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

plot(svm_radial, 
     Auto, 
     weight ~ horsepower, 
     slice = list(cylinders = median(Auto$cylinders), 
                  displacement = median(Auto$displacement),
                  acceleration = median(Auto$acceleration),
                  year = median(Auto$year), 
                  origin = Mode(Auto$origin), 
                  brand = Mode(Auto$brand)))

Again reproducing this in ggplot2:

grid <- expand.grid(horsepower = seq(min(Auto$horsepower), max(Auto$horsepower), length.out = 200), 
                    weight = seq(min(Auto$weight), max(Auto$weight), length.out = 200), 
                    cylinders = median(Auto$cylinders), 
                    displacement = median(Auto$displacement),
                    acceleration = median(Auto$acceleration),
                    year = median(Auto$year), 
                    origin = Mode(Auto$origin), # levels(Auto$origin)[1]
                    brand = Mode(Auto$brand)) # levels(Auto$brand)[1]

grid$class_pred <- predict(svm_radial, grid)


ggplot(grid, aes(x = horsepower, y = weight)) + 
  geom_tile(aes(fill = class_pred), alpha = 0.5) +
  geom_point(data = svm_radial_plot_df, aes(col = mpg01, shape = support), size = 2) +
  scale_color_manual(values = c("red", "blue")) + 
  scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) + 
  scale_y_continuous(labels = comma_format()) +
  labs(shape = "Support Vector:", 
       fill = "Predicted mpg01:", 
       col = "Actual mpg01:") + 
  guides(fill = guide_legend(order = 1),
         col = guide_legend(order = 2), 
         shape = guide_legend(order = 3))

This doesn’t necessarily mean that those in the blue region had a prediction of 1 and those in the red region had a prediction of 0 (the prediction will depend on all 8 predictors in the dataset, to varying degrees).

This reminds me in some ways of a partial dependence plot, as we are trying to isolate the effect of two variables (horsepower and weight) on the class prediction. In a PDP we do this by ‘integrating out’ other predictors, whereas here we do it by varying just these two predictors while holding all others constant.

Some insights we can get from these plots would be:

Higher values of horsepower and weight are associated with lower values of mpg, i.e. where mpg01 = 0 (visible from the first plot - no SVM required)
For everything else held constant, higher values of horsepower and weight will result in the SVM predicting mpg01 = 0 (that the vehicle will have an mpg prediction below the median of 22.75)

For comparison, here’s the same plot for the selected model when using a linear kernel:

svm_linear <- svm(mpg01 ~ ., data = Auto, kernel = "linear", scale = T, cost = 10^-2)

svm_linear_plot_df <- Auto %>%
  select(mpg01, weight, horsepower) %>%
  mutate(support = case_when(row_number() %in% svm_linear$index ~ "Yes", T ~ "No"))

grid$class_pred_linear <- predict(svm_linear, grid)


ggplot(grid, aes(x = horsepower, y = weight)) + 
  geom_tile(aes(fill = class_pred_linear), alpha = 0.5) +
  geom_point(data = svm_linear_plot_df, aes(col = mpg01, shape = support), size = 2) +
  scale_color_manual(values = c("red", "blue")) + 
  scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) + 
  scale_y_continuous(labels = comma_format()) +
  labs(shape = "Support Vector:", 
       fill = "Predicted mpg01:", 
       col = "Actual mpg01:") + 
  guides(fill = guide_legend(order = 1),
         col = guide_legend(order = 2), 
         shape = guide_legend(order = 3))

8. APPLIED: The `OJ` Dataset (Linear/Polynomial/Radial Kernel SVMs)

This problem involves the OJ data set which is part of the ISLR package.

(a) `train`/`test` Split

Q: Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations

I create train and test below. Note that I use the same split that I used in my Chapter 8 solutions, for comparison purposes.

glimpse(OJ)

## Rows: 1,070
## Columns: 18
## $ Purchase       <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <dbl> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID        <dbl> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH        <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM        <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH         <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM         <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH      <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM      <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH        <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM    <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH    <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff      <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7         <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM      <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH      <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff  <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE          <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...

set.seed(5)
train_index <- sample(1:nrow(OJ), 800)

train <- OJ[train_index, ]
test <- OJ[-train_index, ]

(b) Support Vector Classifier

Q: Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

The summary() output isn’t particularly detailed:

svm_linear <- svm(Purchase ~ ., data = train, kernel = "linear", scale = T, cost = 0.01)

summary(svm_linear)

## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01, 
##     scale = T)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  441
## 
##  ( 219 222 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

According to the documentation, C-classification is the default classification type for SVMs.

There does appear to be quite a large number of support vectors (441) relative to the number of training observations (800).

Note that the printed output is a bit deceptive, “( 219 222 )” does not correspond to the Purchase levels (CH MM) printed below that. Here are the classes of the 441 support vectors to show what I mean:

table(train$Purchase[svm_linear$index])

## 
##  CH  MM 
## 222 219

Really it would make more sense if “( 219 222 )” instead said “( 222 219 )” so these lined up.

(c) `train`/`test` Error Rates

Q: What are the training and test error rates?

Here are the training and test error rates respectively:

data.frame(train_error = mean(predict(svm_linear, train) != train$Purchase), 
           test_error = mean(predict(svm_linear, test) != test$Purchase))

##   train_error test_error
## 1     0.16625  0.1666667

(d) Cross-Validation - Selecting `cost`

Q: Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

I implement repeated cross-validation again, but even with this selected model seems to depend largely on the seed used (probably because, looking at the y-axis scale, varying cost over this range doesn’t seem to make a huge difference to the error estimate):

power_range <- seq(-2, 1, 0.2)
cost_range <- 10^power_range
number <- 10
repeats <- 3

cv_matrix <- matrix(nrow = length(cost_range), ncol = repeats)

set.seed(10)

for (i in 1:repeats) {
  svm_linear_tune <- tune(svm, Purchase ~ ., data = train, kernel = "linear", scale = T, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_linear_tune$performances$error
}


svm_linear_df <- data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))

ggplot(svm_linear_df, aes(x = cost, y = CV_error)) + 
  geom_line(col = "deepskyblue3") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) +
  scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", power_range)) + 
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "OJ Data - SVM (Linear Kernel)", 
       subtitle = "Selecting cost parameter using cross-validation",
       x = "Cost", 
       y = "CV Error")

Here is the cost parameter and cross-validation error associated with the best-performing model:

svm_linear_df %>% filter(min_CV_error == 1) %>% select(-min_CV_error) # 0.1704167

##        cost  CV_error
## 1 0.1584893 0.1704167

(e) Chosen Model - `train`/`test` Error Rates

Q: Compute the training and test error rates using this new value for cost.

The test error has improved slightly:

svm_linear <- svm(Purchase ~ ., data = train, kernel = "linear", scale = T, cost = 10^-0.8)

data.frame(train_error = mean(predict(svm_linear, train) != train$Purchase), 
           test_error = mean(predict(svm_linear, test) != test$Purchase))

##   train_error test_error
## 1     0.16375  0.1555556

(f) Radial Kernel SVM

Q: Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

I test different values of cost (leaving gamma at its default value), using repeated cross-validation to select the best model:

cost_power_range <- seq(-2, 1, 0.2)
cost_range <- 10^cost_power_range
number <- 10
repeats <- 3

cv_matrix <- matrix(nrow = length(cost_power_range), ncol = repeats)

set.seed(360)
for (i in 1:repeats) {
  svm_radial_tune <- tune(svm, Purchase ~ ., data = train, kernel = "radial", scale = T, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_radial_tune$performances$error
}

svm_radial_df <- data.frame(cost = svm_radial_tune$performances$cost, CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))

ggplot(svm_radial_df, aes(x = cost, y = CV_error)) + 
  geom_line(col = "deepskyblue3") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) +
  scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", cost_power_range)) + 
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "OJ Data - SVM (Radial Kernel)", 
       subtitle = "Selecting cost parameter using cross-validation",
       x = "Cost", 
       y = "CV Error")

Below is the repeated cross-validation error, training error and test error for cost = 0.01 and the optimal value over the range tested (cost = 10^-0.2):

svm_radial_1 <- svm(Purchase ~ ., data = train, kernel = "radial", scale = T, cost = 0.01)
svm_radial_2 <- svm(Purchase ~ ., data = train, kernel = "radial", scale = T, cost = 10^-0.2)


svm_radial_df %>% filter(min_CV_error == 1 | cost == 0.01) %>% select(-min_CV_error) %>%
  cbind(data.frame(train_error = c(mean(predict(svm_radial_1, train) != train$Purchase), 
                                   mean(predict(svm_radial_2, train) != train$Purchase)), 
                   test_error = c(mean(predict(svm_radial_1, test) != test$Purchase), 
                                  mean(predict(svm_radial_2, test) != test$Purchase))))

##        cost  CV_error train_error test_error
## 1 0.0100000 0.3875000     0.38750  0.3962963
## 2 0.6309573 0.1720833     0.15375  0.1629630

(g) Polynomial Kernel SVM

Q: Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.

I test different values of cost (leaving degree fixed at 2), using repeated cross-validation to select the best model:

cost_power_range <- c(seq(-2, 1, 0.2))
cost_range <- 10^cost_power_range
number <- 10
repeats <- 3

cv_matrix <- matrix(nrow = length(cost_power_range), ncol = repeats)

set.seed(720)
for (i in 1:repeats) {
  svm_polynomial_tune <- tune(svm, Purchase ~ ., data = train, kernel = "polynomial", scale = T, ranges = list(degree = 2, cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
  cv_matrix[ ,i] <- svm_polynomial_tune$performances$error
}

svm_polynomial_df <- data.frame(cost = svm_polynomial_tune$performances$cost, CV_error = rowMeans(cv_matrix)) %>%
  mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))

ggplot(svm_polynomial_df, aes(x = cost, y = CV_error)) + 
  geom_line(col = "deepskyblue3") + 
  geom_point(size = 2, aes(col = factor(min_CV_error))) +
  scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", cost_power_range)) + 
  scale_y_continuous(labels = percent_format()) +
  scale_color_manual(values = c("deepskyblue3", "green")) + 
  theme(legend.position = "none", 
        axis.text.x = ggtext::element_markdown()) + 
  labs(title = "OJ Data - SVM (Polynomial Kernel)", 
       subtitle = "Selecting cost parameter using cross-validation",
       x = "Cost", 
       y = "CV Error")

Below is the repeated cross-validation error, training error and test error for cost = 0.01, alongside the optimal value (cost = 10^1) over the range tested:

svm_polynomial_1 <- svm(Purchase ~ ., data = train, kernel = "polynomial", scale = T, cost = 0.01, degree = 2)
svm_polynomial_2 <- svm(Purchase ~ ., data = train, kernel = "polynomial", scale = T, cost = 10, degree = 2)


svm_polynomial_df %>% filter(cost == 10 | cost == 0.01) %>% select(-min_CV_error) %>%
  cbind(data.frame(train_error = c(mean(predict(svm_polynomial_1, train) != train$Purchase), 
                                   mean(predict(svm_polynomial_2, train) != train$Purchase)), 
                   test_error = c(mean(predict(svm_polynomial_1, test) != test$Purchase), 
                                  mean(predict(svm_polynomial_2, test) != test$Purchase))))

##    cost  CV_error train_error test_error
## 1  0.01 0.3666667     0.36125  0.3925926
## 2 10.00 0.1725000     0.14125  0.1666667

I tested beyond the cost range provided by the question (‘0.01 to 10’), since it appears that increasing cost further (for degree = 2) might continue decreasing the cross-validation error, but this decrease was marginal (CV error was still above 0.17).

(h) Best Approach?

Q: Overall, which approach seems to give the best results on this data?

Interestingly the linear, polynomial and radial kernel SVMs all seemed to perform fairly similarly.

From the selected models using the three approaches we would favor the linear kernel:

data.frame(kernel = c("linear", "radial", "polynomial"),
           CV_error = c(min(svm_linear_df$CV_error), 
                        min(svm_radial_df$CV_error), 
                        min(svm_polynomial_df$CV_error)), 
           test_error = c(mean(predict(svm_linear, test) != test$Purchase), 
                          mean(predict(svm_radial_2, test) != test$Purchase), 
                          mean(predict(svm_polynomial_2, test) != test$Purchase)))

##       kernel  CV_error test_error
## 1     linear 0.1704167  0.1555556
## 2     radial 0.1720833  0.1629630
## 3 polynomial 0.1725000  0.1666667

Note that using identical cross-validation folds for each model would have provided the fairest comparison, although this is slightly less important with repeated CV.

It is also worth considering the fact that the question restricts us to degree = 2 for the polynomial SVM and the default value of gamma for the radial SVM, so it’s possible (likely) that making use of these hyper-parameters (along with testing a broader range for cost) could lead have led to greater performance with the radial/polynomial SVMs.

ISLR - Support Vector Machines (Ch. 9) - Exercise Solutions

Liam Morgan

October 2021

1. Sketching Hyperplanes in 2D

(a) Hyperplane Sketch - 1

(b) Hyperplane Sketch - 2

2. Non-Linear Decision Boundaries

(a) Sketch - 1

(b) Sketch - 2

(c) Classifying Observations

(d) Nonlinear/Linear Decision Boundary

3. Maximal Margin Classifier

(a) Plot Observations

(b) Optimal Separating Hyperplane

(c) Maximal Margin Classifier

(d) Margin

(e) Support Vectors

(f) Moving an Observation

(g) Non-Optimal Hyperplane

(h) Non-Separable Data

4. APPLIED: Generated Data (Linear/Polynomial/Radial Kernel SVMs)

5. APPLIED: Generated Data (Logistic Regression vs SVMs)

(a) Generate Data

(b) Plot Observations

(c) Logistic Regression - Linear

(d) Logistic Regression - Linear - Results

(e) Logistic Regression - Non-Linear

(f) Logistic Regression - Non-Linear - Results

(g) Support Vector Classifier (linear kernel)

(h) Support Vector Machine (non-linear kernel)

(i) Comments

6. APPLIED: Generated Data (Support Vector Classifiers)

(a) Generate Data

(b) Varying cost - Training & CV Errors

(c) Varying cost - Test Error

(d) Comments

7. APPLIED: The Auto Dataset (Linear/Polynomial/Radial Kernel SVMs)

(a) New Variable: mpg01

(b) Support Vector Classifier - CV Errors

(c) SVM (Polynomial/Radial Kernel) - CV Errors

(d) SVM Plots

8. APPLIED: The OJ Dataset (Linear/Polynomial/Radial Kernel SVMs)

(a) train/test Split

(b) Support Vector Classifier

(c) train/test Error Rates

(d) Cross-Validation - Selecting cost

(e) Chosen Model - train/test Error Rates

(f) Radial Kernel SVM

(g) Polynomial Kernel SVM

(h) Best Approach?

(b) Varying `cost` - Training & CV Errors

(c) Varying `cost` - Test Error

7. APPLIED: The `Auto` Dataset (Linear/Polynomial/Radial Kernel SVMs)

(a) New Variable: `mpg01`

8. APPLIED: The `OJ` Dataset (Linear/Polynomial/Radial Kernel SVMs)

(a) `train`/`test` Split

(c) `train`/`test` Error Rates

(d) Cross-Validation - Selecting `cost`

(e) Chosen Model - `train`/`test` Error Rates