Auto
Dataset (Linear/Polynomial/Radial Kernel SVMs)OJ
Dataset (Linear/Polynomial/Radial Kernel SVMs)NOTE: There are no official solutions for these questions. These are my solutions and could be incorrect. If you spot any mistakes/inconsistencies, please contact me on Liam95morgan@gmail.com, or via LinkedIn.
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
library(tidyverse)
library(latex2exp) # latex labels
library(ggforce) # geom_circle
library(kableExtra)
library(e1071) # svm()
library(scales)
library(ISLR) # Auto dataset
library(gridExtra)
RNGkind(kind = "Mersenne-Twister",
normal.kind = "Inversion",
sample.kind = "Rejection") # diff versions of R/RMD using different sampling types was annoying me
theme_set(theme_light())
This problem involves hyperplanes in two dimensions.
Q: Sketch the hyperplane \(1 + 3X_1 − X_2 = 0\). Indicate the set of points for which \(1 + 3X_1 − X_2 > 0\), as well as the set of points for which \(1 + 3X_1 − X_2 < 0\).
A:
Plotting the line:
\[1 + 3X_1 − X_2 = 0 \\ \iff X_2 = 3X_1 + 1\]
Rearranging for \(X_2\) makes it clear which side of the hyperplane is which:
X1 <- -10:10
X2 <- 3*X1 + 1
df1 <- data.frame(X1, X2, line = "a")
grid <- expand.grid(X1 = seq(min(df1$X1), max(df1$X1), length.out = 200),
X2 = seq(min(df1$X2), max(df1$X2), length.out = 200)) %>%
mutate(region = case_when(1 + 3*X1 - X2 < 0 ~ "lt 0",
1 + 3*X1 - X2 > 0 ~ "gt 0"))
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = region), alpha = 0.5) +
geom_line(data = df1, aes(x = X1, y = X2), size = 1, col = "#7CAE00") +
annotate("text", x = -5, y = 10, label = TeX("$1 + 3X_1 − X_2 < 0$", output = "character"),
hjust = 0.5, size = 4, parse = TRUE, col = "blue") +
annotate("text", x = 5, y = -10, label = TeX("$1 + 3X_1 − X_2 > 0$", output = "character"),
hjust = 0.5, size = 4, parse = TRUE, col = "red") +
scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
scale_x_continuous(expand = c(0.01,0.01), breaks = seq(-10, 10, 2)) +
scale_y_continuous(expand = c(0.01,0.01), breaks = seq(-30, 30, 10)) +
theme(legend.position = "none") +
labs(title = TeX(r'(Hyperplane Plot: $1 + 3X_1 − X_2 = 0$)'),
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Q: On the same plot, sketch the hyperplane \(−2 + X_1 + 2X_2 = 0\). Indicate the set of points for which \(−2 + X_1 + 2X_2 > 0\), as well as the set of points for which \(−2 + X_1 + 2X_2 < 0\).
A:
Plotting the line:
\[−2 + X_1 + 2X_2 = 0 \\ \iff X_2 = 1 -\frac{1}{2} X_1\]
As before:
X1 <- -10:10
X2 <- 1 - 0.5*X1
df2 <- data.frame(X1, X2, line = "b")
grid <- grid %>%
mutate(region = case_when(-2 + X1 + 2*X2 < 0 ~ "lt 0",
-2 + X1 + 2*X2 > 0 ~ "gt 0"))
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = region), alpha = 0.5) +
geom_line(data = df2, aes(x = X1, y = X2), size = 1, col = "#C77CFF") +
annotate("text", x = 2.5, y = 15, label = TeX("$-2 + X_1 + 2X_2 > 0$", output = "character"),
hjust = 0.5, size = 4, parse = TRUE, col = "blue") +
annotate("text", x = -2.5, y = -15, label = TeX("$-2 + X_1 + 2X_2 < 0$", output = "character"),
hjust = 0.5, size = 4, parse = TRUE, col = "red") +
scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
scale_x_continuous(expand = c(0.01,0.01), breaks = seq(-10, 10, 2)) +
scale_y_continuous(expand = c(0.01,0.01), breaks = seq(-30, 30, 10)) +
theme(legend.position = "none") +
labs(title = TeX(r'(Hyperplane Plot: $−2 + X_1 + 2X_2 = 0$)'),
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Since the question asked for them on the same plot:
bind_rows(df1, df2) %>%
ggplot(aes(x = X1, y = X2, col = line)) +
geom_line(size = 1) +
scale_color_manual(values = c("#7CAE00", "#C77CFF"),
labels = unname(TeX(c("$1 + 3X_1 − X_2 = 0", "$−2 + X_1 + 2X_2 = 0")))) +
scale_x_continuous(breaks = seq(-10, 10, 2)) +
scale_y_continuous(breaks = seq(-30, 30, 10)) +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'),
col = "Hyperplane:") +
theme(legend.position = "bottom")
We have seen that in \(p = 2\) dimensions, a linear decision boundary takes the form \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0\). We now investigate a non-linear decision boundary.
Q: Sketch the curve
\[(1 + X_1)^2 + (2 - X_2)^2 = 4\]
A:
We can see with a little rearranging that the curve is a circle with center \((-1, 2)\) and radius \(2\):
\[\begin{align} & (1 + X_1)^2 + (2 - X_2)^2 = 4 \\ \implies & (X_1 - (-1))^2 + ((-1)(X_2 - 2))^2 = 4 \\ \implies & (X_1 - (-1))^2 + (X_2 - 2)^2 = 2^2 \\ \end{align}\]
ggplot() +
geom_circle(data = data.frame(X1 = -1, X2 = 2, r = 2),
aes(x0 = X1, y0 = X2, r = r), col = "mediumseagreen", size = 1) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 1.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 4.5)) +
labs(title = TeX(r'(Curve Plot: $(1 + X_1)^2 + (2 - X_2)^2 = 4$)'),
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
Q: On your sketch, indicate the set of points for which
\[(1 + X_1)^2 + (2 - X_2)^2 > 4\]
as well as the set of points for which
\[(1 + X_1)^2 + (2 - X_2)^2 \le 4\]
A:
grid <- expand.grid(X1 = seq(-3.5, 1.5, length.out = 200),
X2 = seq(-0.5, 4.5, length.out = 200)) %>%
mutate(region = case_when((1 + X1)^2 + (2 - X2)^2 > 4 ~ "gt 4",
TRUE ~ "le 4"))
ggplot() +
geom_tile(data = grid, aes(x = X1, y = X2, fill = region), alpha = 0.5) +
geom_circle(data = data.frame(X1 = -1, X2 = 2, r = 2),
aes(x0 = X1, y0 = X2, r = r), col = "mediumseagreen", size = 1) +
annotate("text", x = -1, y = 2, label = TeX("$(1 + X_1)^2 + (2 - X_2)^2 \\leq 4$", output = "character"),
hjust = 0.5, size = 4, parse = TRUE, col = "red") +
annotate("text", x = 0.5, y = 0, label = TeX("$(1 + X_1)^2 + (2 - X_2)^2 > 4$", output = "character"),
hjust = 0.5, size = 4, parse = TRUE, col = "blue") +
scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 1.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 4.5)) +
scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
theme(legend.position = "none") +
labs(title = TeX(r'(Curve Plot: $(1 + X_1)^2 + (2 - X_2)^2 = 4$)'),
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
Q: Suppose that a classifier assigns an observation to the blue class if
\[(1 + X_1)^2 + (2 - X_2)^2 > 4\]
and to the red class otherwise. To what class is the observation \((0, 0)\) classified? \((−1, 1)\)? \((2, 2)\)? \((3, 8)\)?
A:
It’s fairly obvious from the previous plot how the observations will be classified:
new_obs <- data.frame(X1 = c(0, -1, 2, 3), X2 = c(0, 1, 2, 8)) %>%
mutate(region = case_when((1 + X1)^2 + (2 - X2)^2 > 4 ~ "gt 4",
TRUE ~ "le 4"))
grid <- expand.grid(X1 = seq(-3.5, 3.5, length.out = 200),
X2 = seq(-0.5, 8.5, length.out = 200)) %>%
mutate(region = case_when((1 + X1)^2 + (2 - X2)^2 > 4 ~ "gt 4",
TRUE ~ "le 4"))
ggplot() +
geom_tile(data = grid, aes(x = X1, y = X2, fill = region), alpha = 0.5) +
geom_circle(data = data.frame(X1 = -1, X2 = 2, r = 2),
aes(x0 = X1, y0 = X2, r = r), col = "mediumseagreen", size = 1) +
geom_point(data = new_obs, aes(x = X1, y = X2, col = region)) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(-3.5, 3.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(-0.5, 8.5)) +
scale_fill_manual(values = c("#00BFC4", "#F8766D")) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(title = TeX(r'(Classifier Plot: $f(X) = (1 + X_1)^2 + (2 - X_2)^2 - 4$)'),
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
In the context of support vector machines, we can think of this as a classifier of the form
\[f(X) = (1 + X_1)^2 + (2 - X_2)^2 - 4,\]
where a new observation \(X^*\) is assigned to the blue class if \(f(X^*) > 0\) and the red class if \(f(X^*) \le 0\):
Q: Argue that while the decision boundary in (c) is not linear in terms of \(X_1\) and \(X_2\), it is linear in terms of \(X_1, X_1^2, X_2\) and \(X_2^2\)
A:
The decision boundary is simply the boundary described by \(f(X) = 0\) (as this is the boundary where the prediction changes):
\[\begin{align} f(X) & = 0 \\\implies f(X) & = (1 + X_1)^2 + (2 - X_2)^2 - 4 \\ & = X_1^2 + 2X_1 + 1 + X_2^2 - 4X_2 + 4 - 4 \\ & = X_1^2 + X_2^2 + 2X_1 - 4X_2 + 1 \\ & = (1)(X_1^2) + (1)(X_2^2) + (2)(X_1) + (-4)(X_2) + 1 \\ & = 0 \end{align}\]
From the final form you can see how \(f(X) = 0\) is expressed in the form \(a_1 Z_1 + a_2 Z_2 + a_3 Z_3 + a_4 Z_4 + b = 0\) for \((Z_1, Z_2, Z_3, Z_4) = (X_1^2, X_2^2, X_1, X_2)\), and is therefore linear in terms of \(X_1, X_1^2, X_2\) and \(X_2^2\).
Here we explore the maximal margin classifier on a toy data set.
Q: We are given \(n = 7\) observations in \(p = 2\) dimensions. For each observation, there is an associated class label.
df <- data.frame(obs = 1:7,
X1 = c(3, 2, 4, 1, 2, 4, 4),
X2 = c(4, 2, 4, 4, 1, 3, 1),
Y = c("Red", "Red", "Red", "Red", "Blue", "Blue", "Blue"))
df %>%
kable(row.names = F, col.names = c("Obs.", "$X_1$", "$X_2$", "$Y$")) %>%
kable_styling("bordered", full_width = F)
Obs. | \(X_1\) | \(X_2\) | \(Y\) |
---|---|---|---|
1 | 3 | 4 | Red |
2 | 2 | 2 | Red |
3 | 4 | 4 | Red |
4 | 1 | 4 | Red |
5 | 2 | 1 | Blue |
6 | 4 | 3 | Blue |
7 | 4 | 1 | Blue |
Sketch the observations.
A:
ggplot() +
geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
Q: Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form (9.1)).
A:
Visual inspection shows that the line separating the blue and red points with the largest margin will be given by the line
\[X_2 - X_1 + 0.5 = 0\]
Simple geometry shows that the margin \(M = \frac{\sqrt{2}}{4}\) (minimum perpendicular distance to the hyperplane, see this).
## 'simple geometry', taking (x_0, y_0) = (2, 1):
# a = -1
# b = 1
# c = 0.5
# x_0 = 2
# y_0 = 1
# abs((-1)*(2) + (1)*(1) + 0.5) / sqrt((-1)^2 + (1)^2)
perp_lines <- data.frame(X1 = c(2, 1.75, 2, 2.25, 4, 3.75, 4, 4.25),
X2 = c(1, 1.25, 2, 1.75, 3, 3.25, 4, 3.75),
group = c("a", "a", "b", "b", "c", "c", "d", "d"))
ggplot() +
geom_line(data = perp_lines, aes(x = X1, y = X2, group = group), linetype = "dotted", col = "grey40") +
geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) +
geom_abline(slope = 1, intercept = -0.5) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
Q: Describe the classification rule for the maximal margin classifier. It should be something along the lines of "Classify to Red if \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0\), and classify to Blue otherwise. Provide the values for \(\beta_0\), \(\beta_1\) and \(\beta_2\).
A:
The optimal separating hyperplane from the previous question has the equation \(X_2 = X_1 - 0.5\), and visual inspection makes it clear that the prediction for the maximal margin classifier will be Red when \(X_2 > X_1 - 0.5 \iff X_2 - X_1 + 0.5 > 0\) and Blue otherwise.
The maximal margin classifier will therefore take the form
\[f(X) = X_2 - X_1 + 0.5\]
where a new observation \(X^*\) is assigned to the Red class if \(f(X^*) > 0\) and the Blue class if \(f(X^*) \le 0\).
We therefore have \((\beta_0, \beta_1, \beta_2) = (0.5, -1, 1)\).
grid <- expand.grid(X1 = seq(0.5, 4.5,length.out = 200), X2 = seq(0.5, 4.5,length.out = 200)) %>%
mutate(region = case_when(X2 - X1 + 0.5 < 0 ~ "lt 0",
X2 - X1 + 0.5 > 0 ~ "gt 0"))
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = region), alpha = 0.5) +
geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) +
geom_abline(slope = 1, intercept = -0.5) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_color_manual(values = c("blue", "red")) +
scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
Q: On your sketch, indicate the margin for the maximal margin hyperplane.
A:
The maximal margin hyperplane is the solid line, and the margin is the distance from the solid line to either of the dashed lines:
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = region), alpha = 0.5) +
geom_abline(slope = 1, intercept = -0.5) +
geom_abline(slope = 1, intercept = -1, linetype = 'dashed', col = 'grey40') +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed', col = 'grey40') +
geom_point(data = df, aes(x = X1, y = X2, col = Y), size = 2) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_color_manual(values = c("blue", "red")) +
scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
This is the maximal margin hyperplane because it’s the hyperplane for which the margin (minimum perpendicular distance between the hyperplane and the training observations) is maximized.
The book provides another useful way of phrasing this:
In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes.
Q: Indicate the support vectors for the maximal margin classifier.
A:
I add a number representing the training observation number in addition to an indicator for the support vectors:
df$support <- c("No", "Yes", "Yes", "No", "Yes", "Yes", "No")
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = region), alpha = 0.5) +
geom_abline(slope = 1, intercept = -0.5) +
geom_abline(slope = 1, intercept = -1, linetype = 'dashed', col = 'grey40') +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed', col = 'grey40') +
geom_point(data = df, aes(x = X1, y = X2, col = Y, shape = support), size = 2) +
geom_text(data = df, aes(label = obs, vjust = -1), col = 'grey40') +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) +
scale_color_manual(values = c("red", "blue"), limits = c("Red", "Blue")) +
scale_fill_manual(values = c("#F8766D", "#00BFC4"), labels = c("Red", "Blue")) +
labs(fill = "Predicted:",
col = "Actual:",
shape = "Support Vector:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(order = 1),
col = guide_legend(order = 2),
shape = guide_legend(order = 3))
Q: Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.
A:
We can see above that observation 7 is not a support vector. Any small movement of this observation in any direction is not going to change the fact that \(X_2 - X_1 + 0.5 = 0\) provides the best separation of observations with the largest margin of \(M = \frac{\sqrt{2}}{4}\) (it will still be the mid-line of the ‘widest slab’ we can fit between the classes).
In this case, observation 7 would have to move inside the margin to start influencing the position of the maximal margin hyperplane.
Q: Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.
A:
Here I produce a hyperplane that is still a separating hyperplane, but isn’t the optimal separating hyperplane (the maximal margin hyperplane used in the maximal margin classifier).
The hyperplane is given by the line
\[X_2 - 0.75 \cdot X_1 - 0.3 = 0\]
You can see below that the margin \(M\) is considerably smaller here.
grid <- expand.grid(X1 = seq(0.5, 4.5,length.out = 200), X2 = seq(0.5, 4.5,length.out = 200)) %>%
mutate(region = case_when(X2 - 0.75*X1 - 0.3 < 0 ~ "lt 0",
X2 - 0.75*X1 - 0.3 > 0 ~ "gt 0"))
df$support <- ifelse(df$obs == 2, "Yes", "No")
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = region), alpha = 0.5) +
geom_abline(slope = 0.75, intercept = 0.3) +
geom_abline(slope = 0.75, intercept = 0.1, linetype = 'dashed', col = 'grey40') +
geom_abline(slope = 0.75, intercept = 0.5, linetype = 'dashed', col = 'grey40') +
geom_point(data = df, aes(x = X1, y = X2, col = Y, shape = support), size = 2) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
geom_text(data = df, aes(label = obs, vjust = -1), col = 'grey40') +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) +
scale_color_manual(values = c("red", "blue"), limits = c("Red", "Blue")) +
scale_fill_manual(values = c("#F8766D", "#00BFC4"), labels = c("Red", "Blue")) +
labs(fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(order = 1),
col = guide_legend(order = 2),
shape = guide_legend(order = 3))
Q: Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.
A:
See below where I added observation 8, given by \((X_1, X_2) = (2, 3)\), meaning the two classes can no longer be separated by a hyperplane in two dimensions.
df %>%
select(obs, X1, X2, Y) %>%
bind_rows(data.frame(obs = 8, X1 = 2, X2 = 3, Y = "Blue")) %>%
ggplot(aes(x = X1, y = X2, col = Y), size = 2) +
geom_point(size = 2) +
geom_text(aes(label = obs, vjust = -1), col = 'grey40') +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0.5, 4.5)) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)')) +
coord_fixed()
Q: Generate a simulated two-class data set with 100 observations and two features in which there is a visible but non-linear separation between the two classes. Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data. Which technique performs best on the test data? Make plots and report training and test error rates in order to back up your assertions.
A:
I generate the data, perform a 70/30 train
/test
split and visualize train
below.
set.seed(5)
X1 <- runif(100, min = -6, max = 6)
X2 <- 0.1*X1 + sin(X1) + rnorm(100, sd = 0.15)
Y <- sample(x = c(-1, 1), size = 100, replace = T)
X2[Y == 1] <- X2[Y == 1] + 1
df <- data.frame(Y = factor(Y), X1, X2)
train_index <- sample(1:100, 70)
train <- df[train_index, ]
test <- df[-train_index, ]
ggplot(train, aes(x = X1, y = X2, col = Y)) +
geom_point() +
scale_color_manual(values = c("red", "blue")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
First I fit a SVM with a linear kernel, selecting the best cost
parameter via 10-fold cross-validation:
set.seed(1)
svm_linear_tune <- tune(svm, Y ~ ., data = train, kernel = "linear", scale = TRUE, ranges = list(cost = c(10^c(-3:3))))
svm_linear <- svm_linear_tune$best.model
svm_linear_tune$performances %>%
rename(CV_error = error) %>%
mutate(min_CV_error = as.numeric(cost == svm_linear_tune$best.parameters$cost)) %>%
ggplot(aes(x = cost, y = CV_error)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(trans = 'log10', breaks = 10^c(-3:3), minor_breaks = NULL, labels = paste0("10^", c(-3:3))) +
scale_y_continuous(limits = c(0, 0.65), labels = percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Generated Data - SVM (Linear Kernel)",
subtitle = "Selecting cost parameter using cross-validation",
x = "Cost",
y = "CV Error")
Visualizing the prediction regions and actual classes on train
:
grid <- expand.grid(X1 = seq(min(df$X1), max(df$X1), length.out = 200), X2 = seq(min(df$X2), max(df$X2), length.out = 200))
grid$class_pred <- predict(svm_linear, grid)
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = train, aes(col = Y)) +
scale_color_manual(values = c("red", "blue")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
I next fit a SVM with a polynomial kernel, using cross-validation to select the degree
parameter:
set.seed(2)
svm_poly_tune <- tune(svm, Y ~ ., data = train, kernel = "polynomial", scale = TRUE, ranges = list(degree = 2:7))
svm_poly <- svm_poly_tune$best.model
svm_poly_tune$performances %>%
rename(CV_error = error) %>%
mutate(min_CV_error = as.numeric(degree == svm_poly_tune$best.parameters$degree)) %>%
ggplot(aes(x = degree, y = CV_error)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(breaks = 2:7, minor_breaks = NULL, labels = 2:7) +
scale_y_continuous(limits = c(0, 0.65), labels = percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none") +
labs(title = "Generated Data - SVM (Polynomial Kernel)",
subtitle = "Selecting the degree using cross-validation",
x = "Degree",
y = "CV Error")
Visualizing on train
:
grid <- expand.grid(X1 = seq(min(df$X1), max(df$X1), length.out = 200), X2 = seq(min(df$X2), max(df$X2), length.out = 200))
grid$class_pred <- predict(svm_poly, grid)
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = train, aes(col = Y)) +
scale_color_manual(values = c('red', 'blue')) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Finally I fit a SVM with a radial kernel, using cross-validation to select two parameters here (cost
& gamma
):
set.seed(2)
svm_radial_tune <- tune(svm, Y ~ ., data = train, kernel = "radial", scale = TRUE, ranges = list(cost = c(10^c(-3:5)), gamma = 10^c(-2:2)))
svm_radial <- svm_radial_tune$best.model
svm_radial_tune$performances %>%
rename(CV_error = error) %>%
mutate(min_CV_error = case_when(cost == svm_radial_tune$best.parameters$cost & gamma == svm_radial_tune$best.parameters$gamma ~ 1, T ~ 0)) %>%
ggplot(aes(x = cost, y = CV_error, col = factor(gamma))) +
geom_line() +
geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) +
scale_shape_manual(values = c(20, 19)) +
scale_x_continuous(trans = 'log10', breaks = 10^c(-3:6), minor_breaks = NULL, labels = paste0("10^", (-3:6))) +
scale_y_continuous(labels = percent_format()) +
theme(legend.position = "bottom",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Generated Data - SVM (Radial Kernel)",
subtitle = "Selecting cost & gamma parameters using cross-validation",
x = "Cost",
y = "CV Error",
col = "Gamma")
Visualizing on train
:
grid <- expand.grid(X1 = seq(min(df$X1), max(df$X1), length.out = 200), X2 = seq(min(df$X2), max(df$X2), length.out = 200))
grid$class_pred <- predict(svm_radial, grid)
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = train, aes(col = Y)) +
scale_color_manual(values = c('red', 'blue')) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
This looks much better. If I was tasked with drawing by hand what I thought the decision boundary should be, it would probably look very similar to that produced by the radial kernel SVM above.
Here’s a summary of the results:
data.frame(kernel = c("Linear", "Polynomial", "Radial"),
CV_error = c(min(svm_linear_tune$performances$error), min(svm_poly_tune$performances$error), min(svm_radial_tune$performances$error)),
test_error = c(mean(predict(svm_linear, test) != test$Y), mean(predict(svm_poly, test) != test$Y), mean(predict(svm_radial, test) != test$Y)))
## kernel CV_error test_error
## 1 Linear 0.18571429 0.2000000
## 2 Polynomial 0.14285714 0.2333333
## 3 Radial 0.01428571 0.0000000
The CV and test errors aren’t perfectly correlated, but this is to be expected when we are dealing with a total dataset size of 100 observations. The radial SVM unambiguously performed best, with the lowest CV error and a test
error of 0.
Here I produce the same plot for the radial SVM showing the models decision boundary and actual response values (but this time for the test
dataset), where we can see that the test observations were indeed all correctly classified:
ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = test, aes(col = Y)) +
scale_color_manual(values = c('red', 'blue')) +
labs(fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
Q: Generate a data set with \(n = 500\) and \(p = 2\), such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1*(x1^2 - x2^2 > 0)
A:
I’ll stick with the example data provided by the question here.
set.seed(1)
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1*(x1^2 - x2^2 > 0)
df <- data.frame(x1, x2, y = factor(y))
Q: Plot the observations, colored according to their class labels. Your plot should display \(X_1\) on the \(x\)-axis, and \(X_2\) on the \(y\)-axis.
A:
The relationship is quite interesting and we can already foresee the issues that a linear classifier might have in separating the classes:
ggplot(df, aes(x = x1, y = x2, col = y)) +
geom_point() +
scale_x_continuous(limits = c(-0.6, 0.6)) +
scale_y_continuous(limits = c(-0.6, 0.6)) +
scale_color_manual(values = c("red", "blue")) +
theme(legend.position = "none") +
labs(x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Q: Fit a logistic regression model to the data, using \(X_1\) and \(X_2\) as predictors.
A:
I save the model as model_logistic_linear
. The model summary can be seen below.
model_logistic_linear <- glm(y ~ x1 + x2, family = "binomial")
df$pred_logistic_linear <- ifelse(predict(model_logistic_linear, type = "response") > 0.5, 1, 0)
summary(model_logistic_linear)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.179 -1.139 -1.112 1.206 1.257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.087260 0.089579 -0.974 0.330
## x1 0.196199 0.316864 0.619 0.536
## x2 -0.002854 0.305712 -0.009 0.993
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.18 on 499 degrees of freedom
## Residual deviance: 691.79 on 497 degrees of freedom
## AIC: 697.79
##
## Number of Fisher Scoring iterations: 3
Neither \(X_1\) nor \(X_2\) are significant in this model.
Q: Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
A:
Fitting a logistic regression model with just \(X_1\) and \(X_2\) as predictors results in very few positive class predictions:
##
## 0 1
## 470 30
I produce a slightly different plot to the one the question asks for; instead of just colour-coding according to the predicted class labels, this graph shows the models decision boundary along with the actual class labels (as we have produced previously). I think this shows the linear decision boundary more clearly:
grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- ifelse(predict(model_logistic_linear, newdata = grid, type = "response") > 0.5, 1, 0)
ggplot(grid, aes(x = x1, y = x2)) +
geom_tile(aes(fill = factor(class_pred)), alpha = 0.5) +
geom_point(data = df, aes(col = y)) +
scale_color_manual(values = c('red', 'blue')) +
labs(fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
It’s obvious that a linear decision boundary is not appropriate here, but as far as linear decision boundaries go this one actually makes a lot of logical sense; most straight lines through the center of the observations will hover around 50% training accuracy, but this model correctly identifies that those with a high \(X_1\) value will probably be of class \(1\):
## [1] 0.57
Q: Now fit a logistic regression model to the data using non-linear functions of \(X_1\) and \(X_2\) as predictors (e.g. \(X_1^2\), \(X_1 \times X_2\), \(\log(X_2)\), and so forth).
A:
I fit the same logistic regression model with the addition of two quadratic terms for \(X_1\) and \(X_2\) (\(X_1^2\) and \(X_2^2\)), saving the model as model_logistic_nonlinear
. The model summary is again provided below.
model_logistic_nonlinear <- glm(y ~ x1 + x2 + I(x1^2) + I(x2^2), family = "binomial")
df$pred_logistic_nonlinear <- ifelse(predict(model_logistic_nonlinear, type = "response") > 0.5, 1, 0)
summary(model_logistic_nonlinear)
##
## Call:
## glm(formula = y ~ x1 + x2 + I(x1^2) + I(x2^2), family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.079e-03 -2.000e-08 -2.000e-08 2.000e-08 1.297e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.530 526.853 -0.020 0.984
## x1 115.895 6067.885 0.019 0.985
## x2 -1.604 4002.215 0.000 1.000
## I(x1^2) 18538.679 528515.760 0.035 0.972
## I(x2^2) -18235.099 520182.819 -0.035 0.972
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9218e+02 on 499 degrees of freedom
## Residual deviance: 4.2881e-06 on 495 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
Q: Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
A:
We have obtained an obviously non-linear decision boundary as required by the question:
grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- ifelse(predict(model_logistic_nonlinear, newdata = grid, type = "response") > 0.5, 1, 0)
ggplot(grid, aes(x = x1, y = x2)) +
geom_tile(aes(fill = factor(class_pred)), alpha = 0.5) +
geom_point(data = df, aes(col = y)) +
scale_color_manual(values = c('red', 'blue')) +
labs(fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Unsurprisingly these quadratic transformations of \(X_1\) and \(X_2\) allow us to perfectly separate the two classes using logistic regression. I say this is unsurprising because we generated the data and have access to the equation that generated the class labels (it is quadratic in terms of \(X_1\) and \(X_2\)).
Verifying this perfect separation by checking the accuracy:
## [1] 1
Did you notice the strange (large) p-values in the model summary, despite the fact that this logistic model provides perfect separation of the classes? A Google search suggests this is the Hauck-Donner effect.
Q: Fit a support vector classifier to the data with \(X_1\) and \(X_2\) as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
A:
In this case (with a linear kernel), the cost
parameter defines the cost of violating the margin:
cost
means the margins are wide and many support vectors can be on the margin or violate itIn this particular case the support vector classifier seems to be the same regardless of the cost
parameter (the result is a naive classifier that predicts 0
for all observations).
svm_linear <- svm(y ~ x1 + x2, data = df, kernel = "linear", scale = FALSE, cost = 0.001)
df$pred_svm_linear <- predict(svm_linear, df)
grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- predict(svm_linear, grid)
ggplot(grid, aes(x = x1, y = x2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = df, aes(col = y)) +
scale_color_manual(values = c("red", "blue")) +
labs(fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
There were some exceptions to this (e.g. setting cost = 10^7
results in a warning message and a worse classifier for reasons I don’t understand). Note that I didn’t scale the data in this example since it doesn’t seem necessary (\(X_1\) and \(X_2\) are sampled from a uniform distribution over the same range).
The prediction accuracy is obviously the same as the proportion of observations of class 0
in the dataset:
## [1] 0.522
Q: Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
A:
Finally I fit a SVM with a radial kernel, using cross-validation to select cost
& gamma
:
set.seed(2)
svm_radial_tune <- tune(svm, y ~ x1 + x2, data = df, kernel = "radial", scale = FALSE, ranges = list(cost = c(10^c(-3:8)), gamma = 10^c(-2:2)))
svm_radial <- svm_radial_tune$best.model
df$pred_svm_nonlinear <- predict(svm_radial, df)
svm_radial_tune$performances %>%
rename(CV_error = error) %>%
mutate(min_CV_error = case_when(cost == svm_radial_tune$best.parameters$cost & gamma == svm_radial_tune$best.parameters$gamma ~ 1, T ~ 0)) %>%
ggplot(aes(x = cost, y = CV_error, col = factor(gamma))) +
geom_line() +
geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) +
scale_shape_manual(values = c(20, 19)) +
scale_x_continuous(trans = 'log10', breaks = 10^c(-3:8), minor_breaks = NULL, labels = paste0("10^", c(-3:8))) +
scale_y_continuous(trans = 'log10', labels = percent_format()) +
theme(legend.position = "bottom",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Generated Data - SVM (Radial Kernel)",
subtitle = "Selecting cost & gamma parameters using cross-validation",
x = "Cost",
y = "CV Error",
col = "Gamma:")
Like with non-linear logistic regression, we have achieved perfect separation of the classes:
grid <- expand.grid(x1 = seq(-0.6, 0.6, length.out = 200), x2 = seq(-0.6, 0.6, length.out = 200))
grid$class_pred <- predict(svm_radial, grid)
ggplot(grid, aes(x = x1, y = x2)) +
geom_tile(aes(fill = factor(class_pred)), alpha = 0.5) +
geom_point(data = df, aes(col = factor(y))) +
scale_color_manual(values = c("red", "blue")) +
labs(fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Verifying this by checking the accuracy:
## [1] 1
Q: Comment on your results.
A:
The thing to appreciate here is the ability for an SVM with a radial kernel to automatically model the non-linear relationship without requiring anything more than selection of the cost
and gamma
parameters through cross-validation. Logistic regression can clearly achieve the same performance here, but only when the model is well-specified; a model without the quadratic transformations clearly performed much worse, and a logistic model that was more complex but with the ‘wrong’ transformations (e.g. adding an interaction term) performed equally poorly.
This is illustrated on generated data, but anecdotally I have found similar things in work projects and when working with real data. Often a well-specified linear or logistic regression model, with appropriate interaction terms and transformations of the response/predictors, can perform competitively with more flexible methods like SVMs, smoothing splines, tree ensembles. However, these flexible methods will tend to perform best with less effort (aside from tuning hyper-parameters) at the cost of interpretability.
At the end of Section 9.6.1, it is claimed that in the case of data that is just barely linearly separable, a support vector classifier with a small value of cost
that misclassifies a couple of training observations may perform better on test data than one with a huge value of cost
that does not misclassify any training observations. You will now investigate this claim.
Q: Generate two-class data with \(p\) = 2 in such a way that the classes are just barely linearly separable.
A:
Here’s the initial training dataset I generated, with the true decision boundary given by the line \(X_2 = 2X_1 + 0.3\)
set.seed(1)
X1 <- c(rnorm(50, mean = -1, sd = 0.5), rnorm(50, mean = 1, sd = 0.5))
X2 <- runif(100)
Y <- ifelse(X2 > 2*X1 + 0.3, -1, 1)
train <- data.frame(X1, X2, Y = factor(Y))
ggplot(train) +
geom_point(aes(x = X1, y = X2, col = factor(Y))) +
geom_abline(aes(intercept = 0.3, slope = 2, col = "True Decision Boundary"), size = 1) +
scale_color_manual(values = c('red', 'blue', 'black'),
guide = guide_legend(override.aes = list(pch = c(16, 16, NA), linetype = c(0, 0, 1)))) +
theme(legend.position = "bottom") +
labs(title = "Training Data",
col = "Y:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
I introduce a single outlier to this training dataset, ensuring that the data is still (just barely) linearly separable:
outlier <- data.frame(X1 = -0.15, X2 = 0.75, Y = factor(1, levels = c("-1", "1")))
train[100, ] <- outlier
ggplot(train) +
geom_point(aes(x = X1, y = X2, col = factor(Y))) +
geom_abline(aes(intercept = 0.3, slope = 2, col = "True Decision Boundary"), size = 1) +
scale_color_manual(values = c('red', 'blue', 'black'),
guide = guide_legend(override.aes = list(pch = c(16, 16, NA), linetype = c(0, 0, 1)))) +
geom_abline(intercept = -8.3, slope = -50, linetype = 'dashed', col = 'grey40') +
theme(legend.position = "bottom") +
labs(title = "Training Data",
col = "Y:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
The theory is that a classifier with a high value for cost
will not want to classify this observation incorrectly, resulting in a support vector classifier with decision boundary close to the dashed line. A smaller value of cost
which accepts some training misclassifications might produce a better test error.
cost
- Training & CV ErrorsQ: Compute the cross-validation error rates for support vector classifiers with a range of cost values. How many training errors are misclassified for each value of cost considered, and how does this relate to the cross-validation errors obtained?
A:
With this small dataset there was a huge amount of availability in the selected cost
parameter when using cross-validation. I would usually use repeated cross-validation in caret
, but the SVM implementations there seem to be focused around the kernlab
package which seems to use slightly different tuning parameters for its methods.
I implement my own repeated cross-validation to select cost
below (sticking to the e1071
package):
# 3 repeats of 10-fold cross-validation:
power_range <- seq(-2.6, 3, 0.2)
cost_range <- 10^power_range
number <- 10
repeats <- 3
cv_matrix <- matrix(nrow = length(cost_range), ncol = repeats)
set.seed(1)
for (i in 1:repeats) {
svm_linear_tune <- tune(svm, Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_linear_tune$performances$error
}
data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error))) %>%
ggplot(aes(x = cost, y = CV_error)) +
geom_line(col = "deepskyblue3") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", round(power_range[seq(1, length(power_range), 2)], 1))) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
coord_cartesian(ylim = c(0, 0.1)) +
theme(legend.position = "none",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Generated Data - SVM (Linear Kernel)",
subtitle = "Selecting cost parameter using cross-validation",
x = "Cost",
y = "CV Error")
Below I calculate the training error rate for this same sequence of cost
parameters, overlaying this onto the graph above:
train_error <- c()
for (i in 1:length(cost_range)) {
svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = cost_range[i])
train_error[i] <- sum(predict(svm_linear, train) != train$Y) / nrow(train)
}
data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix), train_error) %>%
pivot_longer(cols = c("CV_error", "train_error")) %>%
mutate(name = factor(name, ordered = T, levels = c("train_error", "CV_error"))) %>%
ggplot(aes(x = cost, y = value, group = name, col = name)) +
geom_line() +
geom_point(size = 2) +
scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", round(power_range[seq(1, length(power_range), 2)], 1))) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("mediumseagreen", "deepskyblue3"), labels = c("Training", "Cross-Validation")) +
coord_cartesian(ylim = c(0, 0.1)) +
theme(legend.position = "bottom",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Generated Data - SVM (Linear Kernel)",
subtitle = "Training & cross-validation errors",
x = "Cost",
y = "Error",
col = "Error:")
We can observe that higher values of cost decreases training error, with the training error eventually reaching zero. This makes sense as a large cost results in a narrow margin with few support vectors on the margin or violating it, so eventually (for linearly separable data such as this) the cost of a violation will be sufficiently high that all training observations will be correctly classified.
It is also worth noting that the cost
parameters associated with the lowest (zero) training errors doesn’t have the lowest cross-validation error.
cost
- Test ErrorQ: Generate an appropriate test data set, and compute the test errors corresponding to each of the values of cost
considered. Which value of cost
leads to the fewest test errors, and how does this compare to the values of cost
that yield the fewest training errors and the fewest cross-validation errors?
A:
I generate a test
dataset with the same number of rows as train
(100), using the same principles and decision boundary:
set.seed(2)
X1 <- c(rnorm(50, mean = -1, sd = 0.5), rnorm(50, mean = 1, sd = 0.5))
X2 <- runif(100)
Y <- ifelse(X2 > 2*X1 + 0.3, -1, 1)
test <- data.frame(X1, X2, Y = factor(Y))
ggplot(test) +
geom_point(aes(x = X1, y = X2, col = factor(Y))) +
geom_abline(aes(intercept = 0.3, slope = 2, col = "True Decision Boundary"), size = 1) +
scale_color_manual(values = c('red', 'blue', 'black'),
guide = guide_legend(override.aes = list(pch = c(16, 16, NA), linetype = c(0, 0, 1)))) +
theme(legend.position = "bottom") +
labs(title = "Test Data",
col = "Y:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
Here I calculate the test errors for each model in the previous section, overlaying it on the same graph again:
test_error <- c()
for (i in 1:length(cost_range)) {
svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = cost_range[i])
test_error[i] <- sum(predict(svm_linear, test) != test$Y) / nrow(test)
}
train_test_cv_df <- data.frame(cost = cost_range, power = power_range, CV_error = rowMeans(cv_matrix), train_error, test_error)
train_test_cv_df %>%
pivot_longer(cols = c("CV_error", "train_error", "test_error")) %>%
mutate(name = factor(name, ordered = T, levels = c("train_error", "CV_error", "test_error"))) %>%
ggplot(aes(x = cost, y = value, group = name, col = name)) +
geom_line() +
geom_point(size = 2) +
scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", round(power_range[seq(1, length(power_range), 2)], 1))) +
scale_y_continuous(labels = percent_format()) +
coord_cartesian(ylim = c(0, 0.1)) +
scale_color_manual(values = c("mediumseagreen", "deepskyblue3", "#BB8FCE"), labels = c("Training", "Cross-Validation", "Test")) +
theme(legend.position = "bottom",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Generated Data - SVM (Linear Kernel)",
subtitle = "Training, cross-validation & test errors",
x = "Cost",
y = "Error",
col = "Error:")
We can see that model with the lowest training error performed poorer on the test
dataset than any of those with the lowest cross-validation errors.
Looking at the cross-validation error, taking cost
= \(10^{-0.6}\) (the middle value of cost
for several models with identical CV errors - a logical choice) produces a model which had the lowest error on the test
dataset.
Q: Comment on your results.
A:
Going back to the question:
… it is claimed that in the case of data that is just barely linearly separable, a support vector classifier with a small value of
cost
that misclassifies a couple of training observations may perform better on test data than one with a huge value ofcost
that does not misclassify any training observations.
It seems my example has shown this successfully. Let’s take a look at the decision boundaries of some of these support vector classifiers to drive home this point.
First, let’s visualize the classification using a cost
value that minimizes the training error (cost
= \(10^{2.2}\)):
svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = 10^2.2)
grid <- expand.grid(X1 = seq(-2.3, 2.3, length.out = 200), X2 = seq(0, 1, length.out = 200))
grid$class_pred <- predict(svm_linear, grid)
g1 <- ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = train, aes(col = Y)) +
geom_abline(intercept = 0.3, slope = 2, size = 1) +
scale_color_manual(values = c("red", "blue")) +
labs(title = "Training Data",
subtitle = "SVM optimizing training error: cost = 10^2.2",
fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
g2 <- ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = test, aes(col = Y)) +
geom_abline(intercept = 0.3, slope = 2, size = 1) +
scale_color_manual(values = c("red", "blue")) +
labs(title = "Test Data",
subtitle = "SVM optimizing training error: cost = 10^2.2",
fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
grid.arrange(g1, g2, nrow = 2)
Observe how every observation in the train
dataset is correctly classified, but 7 are misclassified on the test
dataset (hence the test
error of 7%).
Now let’s visualize the classification using a cost
value that minimizes the cross-validation error (cost
= \(10^{-0.6}\)):
svm_linear <- svm(Y ~ X1 + X2, data = train, kernel = "linear", scale = FALSE, cost = 10^-0.6)
grid <- expand.grid(X1 = seq(-2.3, 2.3, length.out = 200), X2 = seq(0, 1, length.out = 200))
grid$class_pred <- predict(svm_linear, grid)
g1 <- ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = train, aes(col = Y)) +
geom_abline(intercept = 0.3, slope = 2, size = 1) +
scale_color_manual(values = c("red", "blue")) +
labs(title = "Training Data",
subtitle = "SVM optimizing cross-validation error: cost = 10^-0.6",
fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
g2 <- ggplot(grid, aes(x = X1, y = X2)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = test, aes(col = Y)) +
geom_abline(intercept = 0.3, slope = 2, size = 1) +
scale_color_manual(values = c("red", "blue")) +
labs(title = "Test Data",
subtitle = "SVM optimizing cross-validation error: cost = 10^-0.6",
fill = "Predicted:",
shape = "Support Vector:",
col = "Actual:",
x = TeX(r'($X_1$)'),
y = TeX(r'($X_2$)'))
grid.arrange(g1, g2, nrow = 2)
Optimizing for cross-validation error meant accepting one training misclassification, but resulted in far fewer test
errors as a result (2, so a test
error of 2%).
Notice how the second model also has a decision boundary closer to the true decision boundary. The model selected using cross-validation is not perfect, but it’s a significant improvement given the limited data (and the fact that I introduced an intentionally difficult outlier to throw the model off!)
Auto
Dataset (Linear/Polynomial/Radial Kernel SVMs)In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto
data set.
mpg01
Q: Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
A:
I call this variable mpg01
(I created this variable before in the chapter 4 solutions):
Auto <- ISLR::Auto
### same pre-processing used in previous chapters:
Auto$brand <- sapply(strsplit(as.character(Auto$name), split = " "),
function(x) x[1]) # extract the first item from each list element
Auto$brand <- factor(ifelse(Auto$brand %in% c("vokswagen", "vw"), "volkswagen",
ifelse(Auto$brand == "toyouta", "toyota",
ifelse(Auto$brand %in% c("chevroelt", "chevy"), "chevrolet",
ifelse(Auto$brand == "maxda", "mazda",
Auto$brand))))) # fixing typo's
Auto$brand <- fct_lump(Auto$brand, n = 9, other_level = "uncommon") # collapse into 10 categories
Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
### same pre-processing used in previous chapters^
Auto$mpg01 <- factor(as.numeric(Auto$mpg > median(Auto$mpg)))
Auto <- Auto %>%
select(-c(mpg, name)) %>% # removing unneeded variables
select(mpg01, everything())
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <fct> American, American, American, American, American, Amer...
## $ brand <fct> chevrolet, buick, plymouth, amc, ford, ford, chevrolet...
I’ve removed mpg
(it wouldn’t make much sense to use it as a predictor when predicting mpg01
).
Q: Fit a support vector classifier to the data with various values of cost
, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.
A:
Linear Kernel:
I fit a wide range of support vector classifiers with cost
values on a \(\log_{10}\) scale. I again use repeated cross-validation to provide an out-of-sample error estimate:
# 5 repeats of 10-fold cross-validation:
power_range <- seq(-3, 3, 0.25)
cost_range <- 10^power_range
number <- 10
repeats <- 5
cv_matrix <- matrix(nrow = length(cost_range), ncol = repeats)
set.seed(202)
for (i in 1:repeats) {
svm_linear_tune <- tune(svm, mpg01 ~ ., data = Auto, kernel = "linear", scale = TRUE, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_linear_tune$performances$error
}
svm_linear_df <- data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))
ggplot(svm_linear_df, aes(x = cost, y = CV_error)) +
geom_line(col = "grey55") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(trans = 'log10', breaks = cost_range[seq(1, length(cost_range), 2)], labels = paste0("10^", power_range[seq(1, length(power_range), 2)])) +
scale_y_continuous(labels = percent_format(), breaks = seq(0.08, 0.2, 0.01)) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none",
axis.text.x = ggtext::element_markdown()) +
labs(title = "Auto Dataset - SVM (Linear Kernel)",
subtitle = "Selecting cost parameter using cross-validation",
x = "Cost",
y = "CV Error")
Here is the cost
parameter and cross-validation error associated with the best-performing model:
## cost CV_error
## 1 0.01 0.08615385
Q: Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma
and degree
and cost.
Comment on your results.
A:
Polynomial Kernel:
The book shows tuning degree
for an SVM with a polynomial kernel, but you can actually tune both degree
and cost
together.
I test different combinations of these hyper-parameters, using repeated cross-validation to select the best model:
power_range <- seq(1, 7, 0.5)
cost_range <- 10^power_range
degree_range <- 2:7
number <- 10
repeats <- 2
cv_matrix <- matrix(nrow = length(cost_range)*length(degree_range), ncol = repeats)
set.seed(151)
for (i in 1:repeats) {
svm_poly_tune <- tune(svm, mpg01 ~ ., data = Auto, kernel = "polynomial", scale = TRUE, ranges = list(degree = degree_range, cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_poly_tune$performances$error
}
svm_poly_df <- cbind(svm_poly_tune$performances[ ,c("degree", "cost")], CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))
ggplot(svm_poly_df, aes(x = cost, y = CV_error, col = factor(degree))) +
geom_line() +
geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) +
scale_shape_manual(values = c(20, 19)) +
scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", power_range)) +
scale_y_continuous(labels = percent_format()) +
coord_cartesian(ylim = c(0.05, 0.3)) +
theme(axis.text.x = ggtext::element_markdown(),
legend.position = "bottom") +
labs(title = "Auto Dataset - SVM (Polynomial Kernel)",
subtitle = "Selecting cost & degree parameters using cross-validation",
x = "Cost",
y = "CV Error",
col = "Degree")
The selected model outperforms the linear kernel model:
## degree cost CV_error
## 1 3 316.2278 0.07137821
Radial Kernel:
I test different combinations of gamma
and cost
, using repeated cross-validation to select the best model:
cost_power_range <- seq(-2, 5, 0.5)
cost_range <- 10^cost_power_range
gamma_power_range <- -4:1
gamma_range <- 10^gamma_power_range
number <- 10
repeats <- 3
cv_matrix <- matrix(nrow = length(cost_power_range)*length(gamma_power_range), ncol = repeats)
set.seed(666)
for (i in 1:repeats) {
svm_radial_tune <- tune(svm, mpg01 ~ ., data = Auto, kernel = "radial", scale = TRUE, ranges = list(gamma = gamma_range, cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_radial_tune$performances$error
}
svm_radial_df <- cbind(svm_radial_tune$performances[ ,c("gamma", "cost")], CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))
ggplot(svm_radial_df, aes(x = cost, y = CV_error, col = factor(gamma))) +
geom_line() +
geom_point(aes(shape = factor(min_CV_error)), show.legend = F, size = 3) +
scale_shape_manual(values = c(20, 19)) +
scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", cost_power_range)) +
scale_y_continuous(labels = percent_format()) +
scale_color_discrete(labels = paste0("10^", gamma_power_range)) +
coord_cartesian(ylim = c(0.05, 0.3)) +
theme(axis.text.x = ggtext::element_markdown(),
legend.text = ggtext::element_markdown(),
legend.position = "bottom") +
labs(title = "Auto Dataset - SVM (Radial Kernel)",
subtitle = "Selecting cost & gamma parameters using cross-validation",
x = "Cost",
y = "CV Error",
col = "Gamma")
The selected model has the best performance so far:
## gamma cost CV_error
## 1 0.1 31.62278 0.07055556
Q: Make some plots to back up your assertions in (b) and (c). Hint: In the lab, we used the plot()
function for svm
objects only in cases with p = 2. When p > 2, you can use the plot()
function to create plots displaying pairs of variables at a time. Essentially, instead of typing
plot(svmfit, dat)
where svmfit
contains your fitted model and dat
is a data frame containing your data, you can type
plot(svmfit, dat, x1 ~ x4)
in order to plot just the first and fourth variables. However, you must replace x1
and x4
with the correct variable names. To find out more, type ?plot.svm
A:
I create svm_radial
, which is a fit of the best-performing model on the full Auto
dataset - a SVM with radial kernel, gamma = 0.1
and cost = 10^1.5
:
svm_radial <- svm(mpg01 ~ ., data = Auto, kernel = "radial", scale = T, gamma = 0.01, cost = 10^1.5)
I will attempt to reproduce these plots in ggplot2
, as I feel like the default plot.svm
output is not great.
First let’s look at the default output from plotting the svm_radial
. Note that we must choose two variables to plot or an error is returned. I chose weight
and horsepower
:
All the plot above ends up doing is plotting weight
* horsepower
(with colour-coding for mpg01
, and crosses for support vectors which we can cancel by putting the argument svSymbol = 'o'
). In fact, if you didn’t care about identifying which points were support vectors you could produce the plot with just the underlying data (no need for the SVM).
It’s easy to produce a nicer equivalent in ggplot2
:
svm_radial_plot_df <- Auto %>%
select(mpg01, weight, horsepower) %>%
mutate(support = case_when(row_number() %in% svm_radial$index ~ "Yes", T ~ "No"))
ggplot(svm_radial_plot_df, aes(x = horsepower, y = weight, col = mpg01, shape = support)) +
geom_point(size = 2) +
scale_color_manual(values = c("red", "blue")) +
scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) +
scale_y_continuous(labels = comma_format()) +
labs(shape = "Support Vector:",
col = "Actual mpg01:")
Note that the contours can’t really be shown because the data is too high-dimensional. By this I mean that we can’t exactly show accurate decision boundaries where the SVM predicts high/low mpg
in 2D space because the prediction depends on far more than 2 variables.
We can get around this by fixing the other predictors at a certain value with the slice
argument. I try fixing the numeric predictors at their median value and the categorical predictors at their modal category:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
plot(svm_radial,
Auto,
weight ~ horsepower,
slice = list(cylinders = median(Auto$cylinders),
displacement = median(Auto$displacement),
acceleration = median(Auto$acceleration),
year = median(Auto$year),
origin = Mode(Auto$origin),
brand = Mode(Auto$brand)))
Again reproducing this in ggplot2
:
grid <- expand.grid(horsepower = seq(min(Auto$horsepower), max(Auto$horsepower), length.out = 200),
weight = seq(min(Auto$weight), max(Auto$weight), length.out = 200),
cylinders = median(Auto$cylinders),
displacement = median(Auto$displacement),
acceleration = median(Auto$acceleration),
year = median(Auto$year),
origin = Mode(Auto$origin), # levels(Auto$origin)[1]
brand = Mode(Auto$brand)) # levels(Auto$brand)[1]
grid$class_pred <- predict(svm_radial, grid)
ggplot(grid, aes(x = horsepower, y = weight)) +
geom_tile(aes(fill = class_pred), alpha = 0.5) +
geom_point(data = svm_radial_plot_df, aes(col = mpg01, shape = support), size = 2) +
scale_color_manual(values = c("red", "blue")) +
scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) +
scale_y_continuous(labels = comma_format()) +
labs(shape = "Support Vector:",
fill = "Predicted mpg01:",
col = "Actual mpg01:") +
guides(fill = guide_legend(order = 1),
col = guide_legend(order = 2),
shape = guide_legend(order = 3))
This doesn’t necessarily mean that those in the blue region had a prediction of 1
and those in the red region had a prediction of 0
(the prediction will depend on all 8 predictors in the dataset, to varying degrees).
This reminds me in some ways of a partial dependence plot, as we are trying to isolate the effect of two variables (horsepower
and weight
) on the class prediction. In a PDP we do this by ‘integrating out’ other predictors, whereas here we do it by varying just these two predictors while holding all others constant.
Some insights we can get from these plots would be:
horsepower
and weight
are associated with lower values of mpg
, i.e. where mpg01 = 0
(visible from the first plot - no SVM required)horsepower
and weight
will result in the SVM predicting mpg01 = 0
(that the vehicle will have an mpg
prediction below the median of 22.75)For comparison, here’s the same plot for the selected model when using a linear kernel:
svm_linear <- svm(mpg01 ~ ., data = Auto, kernel = "linear", scale = T, cost = 10^-2)
svm_linear_plot_df <- Auto %>%
select(mpg01, weight, horsepower) %>%
mutate(support = case_when(row_number() %in% svm_linear$index ~ "Yes", T ~ "No"))
grid$class_pred_linear <- predict(svm_linear, grid)
ggplot(grid, aes(x = horsepower, y = weight)) +
geom_tile(aes(fill = class_pred_linear), alpha = 0.5) +
geom_point(data = svm_linear_plot_df, aes(col = mpg01, shape = support), size = 2) +
scale_color_manual(values = c("red", "blue")) +
scale_shape_manual(values = c('Yes' = 16, 'No' = 1)) +
scale_y_continuous(labels = comma_format()) +
labs(shape = "Support Vector:",
fill = "Predicted mpg01:",
col = "Actual mpg01:") +
guides(fill = guide_legend(order = 1),
col = guide_legend(order = 2),
shape = guide_legend(order = 3))
OJ
Dataset (Linear/Polynomial/Radial Kernel SVMs)This problem involves the OJ
data set which is part of the ISLR
package.
train
/test
SplitQ: Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations
A:
I create train
and test
below. Note that I use the same split that I used in my Chapter 8 solutions, for comparison purposes.
## Rows: 1,070
## Columns: 18
## $ Purchase <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <dbl> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID <dbl> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7 <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...
set.seed(5)
train_index <- sample(1:nrow(OJ), 800)
train <- OJ[train_index, ]
test <- OJ[-train_index, ]
Q: Fit a support vector classifier to the training data using cost = 0.01
, with Purchase
as the response and the other variables as predictors. Use the summary()
function to produce summary statistics, and describe the results obtained.
A:
The summary()
output isn’t particularly detailed:
svm_linear <- svm(Purchase ~ ., data = train, kernel = "linear", scale = T, cost = 0.01)
summary(svm_linear)
##
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01,
## scale = T)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 441
##
## ( 219 222 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
According to the documentation, C-classification
is the default classification type for SVMs.
There does appear to be quite a large number of support vectors (441) relative to the number of training observations (800).
Note that the printed output is a bit deceptive, “( 219 222 )
” does not correspond to the Purchase
levels (CH MM
) printed below that. Here are the classes of the 441 support vectors to show what I mean:
##
## CH MM
## 222 219
Really it would make more sense if “( 219 222 )
” instead said “( 222 219 )
” so these lined up.
train
/test
Error RatesQ: What are the training and test error rates?
A:
Here are the training and test error rates respectively:
data.frame(train_error = mean(predict(svm_linear, train) != train$Purchase),
test_error = mean(predict(svm_linear, test) != test$Purchase))
## train_error test_error
## 1 0.16625 0.1666667
cost
Q: Use the tune()
function to select an optimal cost
. Consider values in the range 0.01 to 10.
A:
I implement repeated cross-validation again, but even with this selected model seems to depend largely on the seed used (probably because, looking at the y-axis scale, varying cost
over this range doesn’t seem to make a huge difference to the error estimate):
power_range <- seq(-2, 1, 0.2)
cost_range <- 10^power_range
number <- 10
repeats <- 3
cv_matrix <- matrix(nrow = length(cost_range), ncol = repeats)
set.seed(10)
for (i in 1:repeats) {
svm_linear_tune <- tune(svm, Purchase ~ ., data = train, kernel = "linear", scale = T, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_linear_tune$performances$error
}
svm_linear_df <- data.frame(cost = cost_range, CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))
ggplot(svm_linear_df, aes(x = cost, y = CV_error)) +
geom_line(col = "deepskyblue3") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", power_range)) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none",
axis.text.x = ggtext::element_markdown()) +
labs(title = "OJ Data - SVM (Linear Kernel)",
subtitle = "Selecting cost parameter using cross-validation",
x = "Cost",
y = "CV Error")
Here is the cost
parameter and cross-validation error associated with the best-performing model:
## cost CV_error
## 1 0.1584893 0.1704167
train
/test
Error RatesQ: Compute the training and test error rates using this new value for cost
.
A:
The test error has improved slightly:
svm_linear <- svm(Purchase ~ ., data = train, kernel = "linear", scale = T, cost = 10^-0.8)
data.frame(train_error = mean(predict(svm_linear, train) != train$Purchase),
test_error = mean(predict(svm_linear, test) != test$Purchase))
## train_error test_error
## 1 0.16375 0.1555556
Q: Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma
.
A:
I test different values of cost
(leaving gamma
at its default value), using repeated cross-validation to select the best model:
cost_power_range <- seq(-2, 1, 0.2)
cost_range <- 10^cost_power_range
number <- 10
repeats <- 3
cv_matrix <- matrix(nrow = length(cost_power_range), ncol = repeats)
set.seed(360)
for (i in 1:repeats) {
svm_radial_tune <- tune(svm, Purchase ~ ., data = train, kernel = "radial", scale = T, ranges = list(cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_radial_tune$performances$error
}
svm_radial_df <- data.frame(cost = svm_radial_tune$performances$cost, CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))
ggplot(svm_radial_df, aes(x = cost, y = CV_error)) +
geom_line(col = "deepskyblue3") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", cost_power_range)) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none",
axis.text.x = ggtext::element_markdown()) +
labs(title = "OJ Data - SVM (Radial Kernel)",
subtitle = "Selecting cost parameter using cross-validation",
x = "Cost",
y = "CV Error")
Below is the repeated cross-validation error, training error and test error for cost = 0.01
and the optimal value over the range tested (cost = 10^-0.2
):
svm_radial_1 <- svm(Purchase ~ ., data = train, kernel = "radial", scale = T, cost = 0.01)
svm_radial_2 <- svm(Purchase ~ ., data = train, kernel = "radial", scale = T, cost = 10^-0.2)
svm_radial_df %>% filter(min_CV_error == 1 | cost == 0.01) %>% select(-min_CV_error) %>%
cbind(data.frame(train_error = c(mean(predict(svm_radial_1, train) != train$Purchase),
mean(predict(svm_radial_2, train) != train$Purchase)),
test_error = c(mean(predict(svm_radial_1, test) != test$Purchase),
mean(predict(svm_radial_2, test) != test$Purchase))))
## cost CV_error train_error test_error
## 1 0.0100000 0.3875000 0.38750 0.3962963
## 2 0.6309573 0.1720833 0.15375 0.1629630
Q: Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2
.
A:
I test different values of cost
(leaving degree
fixed at 2
), using repeated cross-validation to select the best model:
cost_power_range <- c(seq(-2, 1, 0.2))
cost_range <- 10^cost_power_range
number <- 10
repeats <- 3
cv_matrix <- matrix(nrow = length(cost_power_range), ncol = repeats)
set.seed(720)
for (i in 1:repeats) {
svm_polynomial_tune <- tune(svm, Purchase ~ ., data = train, kernel = "polynomial", scale = T, ranges = list(degree = 2, cost = cost_range), tunecontrol = tune.control(sampling = "cross", cross = number))
cv_matrix[ ,i] <- svm_polynomial_tune$performances$error
}
svm_polynomial_df <- data.frame(cost = svm_polynomial_tune$performances$cost, CV_error = rowMeans(cv_matrix)) %>%
mutate(min_CV_error = as.numeric(CV_error == min(CV_error)))
ggplot(svm_polynomial_df, aes(x = cost, y = CV_error)) +
geom_line(col = "deepskyblue3") +
geom_point(size = 2, aes(col = factor(min_CV_error))) +
scale_x_continuous(trans = 'log10', breaks = cost_range, minor_breaks = NULL, labels = paste0("10^", cost_power_range)) +
scale_y_continuous(labels = percent_format()) +
scale_color_manual(values = c("deepskyblue3", "green")) +
theme(legend.position = "none",
axis.text.x = ggtext::element_markdown()) +
labs(title = "OJ Data - SVM (Polynomial Kernel)",
subtitle = "Selecting cost parameter using cross-validation",
x = "Cost",
y = "CV Error")
Below is the repeated cross-validation error, training error and test error for cost = 0.01
, alongside the optimal value (cost = 10^1
) over the range tested:
svm_polynomial_1 <- svm(Purchase ~ ., data = train, kernel = "polynomial", scale = T, cost = 0.01, degree = 2)
svm_polynomial_2 <- svm(Purchase ~ ., data = train, kernel = "polynomial", scale = T, cost = 10, degree = 2)
svm_polynomial_df %>% filter(cost == 10 | cost == 0.01) %>% select(-min_CV_error) %>%
cbind(data.frame(train_error = c(mean(predict(svm_polynomial_1, train) != train$Purchase),
mean(predict(svm_polynomial_2, train) != train$Purchase)),
test_error = c(mean(predict(svm_polynomial_1, test) != test$Purchase),
mean(predict(svm_polynomial_2, test) != test$Purchase))))
## cost CV_error train_error test_error
## 1 0.01 0.3666667 0.36125 0.3925926
## 2 10.00 0.1725000 0.14125 0.1666667
I tested beyond the cost
range provided by the question (‘0.01 to 10’), since it appears that increasing cost
further (for degree = 2
) might continue decreasing the cross-validation error, but this decrease was marginal (CV error was still above 0.17).
Q: Overall, which approach seems to give the best results on this data?
A:
Interestingly the linear, polynomial and radial kernel SVMs all seemed to perform fairly similarly.
From the selected models using the three approaches we would favor the linear kernel:
data.frame(kernel = c("linear", "radial", "polynomial"),
CV_error = c(min(svm_linear_df$CV_error),
min(svm_radial_df$CV_error),
min(svm_polynomial_df$CV_error)),
test_error = c(mean(predict(svm_linear, test) != test$Purchase),
mean(predict(svm_radial_2, test) != test$Purchase),
mean(predict(svm_polynomial_2, test) != test$Purchase)))
## kernel CV_error test_error
## 1 linear 0.1704167 0.1555556
## 2 radial 0.1720833 0.1629630
## 3 polynomial 0.1725000 0.1666667
Note that using identical cross-validation folds for each model would have provided the fairest comparison, although this is slightly less important with repeated CV.
It is also worth considering the fact that the question restricts us to degree = 2
for the polynomial SVM and the default value of gamma
for the radial SVM, so it’s possible (likely) that making use of these hyper-parameters (along with testing a broader range for cost
) could lead have led to greater performance with the radial/polynomial SVMs.