Here’s the same presentation outline with code snippets converted to R. Each slide includes R code for generating plots, along with the necessary equations in LaTeX format where appropriate.
Motivation for SVM:
SVM Basics: Support Vector Machines (SVMs) are used in supervised learning, primarily for classification and regression. They find an optimal boundary, or “hyperplane,” that maximizes the distance from the boundary to the nearest points in each class.
Key Feature: Support vectors are the critical data points that define the boundary.
Mathematical Formulation: Given training data \(\{(x_i, y_i)\}_{i=1}^n\), where \(x_i \in \mathbb{R}^d\) and \(y_i \in \{-1, 1\}\):
\(\{(x_i, y_i)\}_{i=1}^n\): Represents a set of training data points, where \(x_i\) is a vector in \(d\)-dimensional space (\(\mathbb{R}^d\)), and \(y_i\) is the class label which can be either -1 or 1.
The SVM objective is to find the hyperplane \(w^T x + b = 0\) that maximizes the margin. This is subject to the constraint:
\[ y_i (w^T x_i + b) \geq 1, \quad orall i \]
This can be formulated as an optimization problem:
\[ \min_{w, b} rac{1}{2} \|w\|^2 \]
# Load libraries
library(ggplot2)
# Generate example data points
point1 <- c(1, 2)
point2 <- c(2, 3)
point3 <- c(3, 1)
# Create a data frame of the points
data_points <- data.frame(
x = c(point1[1], point2[1], point3[1]),
y = c(point1[2], point2[2], point3[2]),
label = c("Point 1", "Point 2", "Point 3")
)
# Plot the points to visualize their locations
ggplot(data_points, aes(x = x, y = y, label = label)) +
geom_point(size = 4, color = "blue") +
geom_text(aes(x = x, y = y, label = label), vjust = c(1, 0.5, -1.5), hjust = c(-1, 1.5, 0.5)) +
labs(title = "Data Points for Kernel Function Visualization", x = "X-axis", y = "Y-axis") +
theme_minimal()
# Calculate similarities
linear_kernel <- function(x1, x2) {
return(sum(x1 * x2))
}
similarity12 <- linear_kernel(point1, point2)
similarity13 <- linear_kernel(point1, point3)
# Print similarities to console
cat("Similarity between Point 1 and Point 2:", similarity12, "\n")
## Similarity between Point 1 and Point 2: 8
cat("Similarity between Point 1 and Point 3:", similarity13, "\n")
## Similarity between Point 1 and Point 3: 5
This plot shows the locations of the data points, and the console output shows the similarity between data points calculated using the linear kernel function \(K(x_i, u) = x_i^T u\). A higher value indicates more similarity between the two points.
Optimization Problem: The linear SVM solves the following quadratic programming problem:
\[ \min_{w, b, \xi} rac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i \]
Subject to: \[ y_i (w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]
Explanation: The objective function \(rac{1}{2} \|w\|^2\) aims to maximize the margin, while the term \(C \sum_{i=1}^n \xi_i\) penalizes misclassifications. The parameter \(C\) determines how much weight is given to the misclassification penalty.
Graphic: Linear SVM Classification Plot
# Load libraries
library(e1071)
library(ggplot2)
# Generate linearly separable data and shift to positive quadrant
set.seed(1)
X <- rbind(matrix(rnorm(40), ncol = 2) + 5, matrix(rnorm(40), ncol = 2) + 2) # Shifted by adding a constant to make all values positive
y <- factor(c(rep(1, 20), rep(2, 20)))
# Train linear SVM classifier
linear_svm <- svm(X, y, kernel = "linear", cost = 1)
# Identify support vectors
support_vectors <- as.data.frame(X[linear_svm$index, , drop = FALSE])
# Create a grid to plot decision boundaries
grid <- expand.grid(X1 = seq(min(X[,1]) - 1, max(X[,1]) + 1, length.out = 100),
X2 = seq(min(X[,2]) - 1, max(X[,2]) + 1, length.out = 100))
grid$pred <- predict(linear_svm, newdata = grid, decision.values = TRUE)
decision_values <- attr(grid$pred, "decision.values")
# Plot using ggplot2 with annotations
ggplot() +
geom_point(data = as.data.frame(X), aes(x = X[,1], y = X[,2], color = y)) +
geom_point(data = support_vectors, aes(x = V1, y = V2), shape = 21, size = 3, color = "black", fill = NA) + # Highlight support vectors
geom_contour(data = grid, aes(x = X1, y = X2, z = decision_values),
breaks = 0, color = "blue", linetype = "solid") + # Main separating hyperplane
geom_contour(data = grid, aes(x = X1, y = X2, z = decision_values),
breaks = c(-1, 1), color = "blue", linetype = "dashed") + # Margin lines at -1 and +1
labs(title = "SVM with Separating Hyperplane, Support Vectors, and Margin",
x = "X1", y = "X2") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
geom_hline(yintercept = 0, linetype = "solid", color = "black") +
geom_vline(xintercept = 0, linetype = "solid", color = "black") +
annotate("text", x = 2.5, y = 5.5, label = "Separating Hyperplane", color = "blue", angle = -30, hjust = 0)
Problem: Linear SVMs can’t handle non-linearly separable data. Here’s an example where no straight line can separate the classes perfectly.
Why Non-linearity Matters: Many real-world problems require non-linear boundaries, motivating the need for non-linear SVMs.
Mathematical Insight: In the case of non-linear data, the decision boundary is not a simple hyperplane. We need a transformation function \(\phi(x)\) that maps the input features into a higher-dimensional space where a hyperplane can separate the classes.
The transformation function \(\phi(x)\) is used to project the original data into a feature space where it becomes linearly separable.
Graphic: Non-linearly Separable Data Plot
# Load required libraries
library(mlbench)
library(ggplot2)
# Generate non-linear data
set.seed(42)
data <- mlbench.circle(100, d = 2) # Set d = 2 for two-dimensional data
X <- data$x
y <- as.factor(data$classes)
# Plot
ggplot(data = data.frame(X1 = X[,1], X2 = X[,2], Class = y), aes(x = X1, y = X2, color = Class)) +
geom_point(size = 3) +
ggtitle("Non-Linearly Separable Data")
Concept of Kernel Trick: The kernel trick projects data into a higher-dimensional space to make it separable without explicit transformations. Instead of computing \(\phi(x)\) explicitly, we use kernel functions \(K(x, x') = \langle \phi(x), \phi(x') \rangle\) to operate in the higher-dimensional space implicitly.
Common Kernels:
Mathematical Explanation: The key idea of the kernel trick is that we never need to compute the mapping \(\phi(x)\) explicitly. Instead, we work directly with the kernel function \(K(x, x')\), which represents the dot product in the transformed feature space.
Visual Explanation: Left panel shows non-separable data in 2D; right panel shows data transformed into 3D, where it becomes linearly separable.
Equation with Kernel: \[ D(u) = eta_0 + \sum_{i=1}^n \alpha_i y_i K(x_i, u) \]
Mathematical Detail: The optimization problem for SVM with kernels becomes:
\[ \max_{\alpha} \sum_{i=1}^n \alpha_i - rac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(x_i, x_j) \]
Subject to: \[ 0 \leq \alpha_i \leq C, \quad \sum_{i=1}^n \alpha_i y_i = 0 \]
Explanation: The dual formulation allows us to work with the kernel function \(K(x_i, x_j)\) directly, which simplifies computation in the high-dimensional space.
Graphic: Original vs. Transformed Data
# Load required libraries
library(scatterplot3d)
library(mlbench)
# Generate non-linear data
set.seed(42)
data <- mlbench.circle(100, d = 2)
X <- data$x
y <- as.factor(data$classes)
# Transformation: Add Z axis as x^2 + y^2
Z <- rowSums(X^2)
# Plot in 3D
s3d <- scatterplot3d(X[,1], X[,2], Z, color = ifelse(y == 1, "blue", "red"),
pch = 16, main = "Transformed Data in 3D with Separating Hyperplane")
# Add a manually created separating hyperplane at Z = 1
x_vals <- seq(min(X[,1]), max(X[,1]), length.out = 10)
y_vals <- seq(min(X[,2]), max(X[,2]), length.out = 10)
z_vals <- matrix(1, nrow = 10, ncol = 10) # Constant Z = 1
# Use expand.grid to create grid for plotting the plane
grid_points <- expand.grid(x_vals, y_vals)
# Draw the plane by plotting each point in the grid at Z = 1
s3d$points3d(grid_points[,1], grid_points[,2], rep(1, nrow(grid_points)),
col = "gray", pch = ".", cex = 2)
Non-linear SVM in Action: This example shows an RBF kernel capturing a non-linear boundary. Support vectors define the decision boundary’s shape.
Mathematical Insight: The RBF kernel is effective for capturing complex, non-linear relationships in data. The kernel function \[ K(x, x') = \exp(-\gamma \|x - x'\|^2) \]
allows SVM to create flexible decision boundaries that adapt to the structure of the data. Here, \(\gamma\) controls the influence of each training example, with larger values leading to more localized decision boundaries.
Graphic: Non-linear SVM Classification Plot with RBF Kernel
# Load required libraries
if (!requireNamespace("e1071", quietly = TRUE)) {
install.packages("e1071")
}
library(e1071)
library(ggplot2)
# Fit SVM with RBF kernel
svm_rbf <- svm(X, y, kernel = "radial", gamma = 1, cost = 1)
# Create a grid for plotting the decision boundary
x_min <- min(X[, 1]) - 1
x_max <- max(X[, 1]) + 1
y_min <- min(X[, 2]) - 1
y_max <- max(X[, 2]) + 1
# Generate a grid of points
grid <- expand.grid(X1 = seq(x_min, x_max, length.out = 100),
X2 = seq(y_min, y_max, length.out = 100))
# Predict the class for each point in the grid
grid$pred <- predict(svm_rbf, newdata = grid)
# Plot the results
ggplot() +
geom_point(data = data.frame(X1 = X[,1], X2 = X[,2], Class = y), aes(x = X1, y = X2, color = Class)) +
geom_contour(data = grid, aes(x = X1, y = X2, z = as.numeric(pred)), bins = 1, color = "black") +
labs(title = "Non-linear SVM Classification with RBF Kernel",
x = "X1", y = "X2") +
theme_minimal()
SVM for Regression: SVMs for regression (SVR) aim to fit a model that captures the continuous relationship. The ε-insensitive loss function focuses on deviations outside a predefined tolerance ε.
Mathematical Formulation: Given training data \(\{(x_i, y_i)\}_{i=1}^n\), SVR aims to find a function \(f(x) = w^T x + b\) that has at most \(\epsilon\) deviation from the actual target \(y_i\) for each training point, while also ensuring that the model is as flat as possible:
\[ \min_{w, b, \xi, \xi^*} rac{1}{2} \|w\|^2 + C \sum_{i=1}^n (\xi_i + \xi_i^*) \]
\(rac{1}{2} \|w\|^2\): Controls the flatness of the regression function.
\(C \sum_{i=1}^n (\xi_i + \xi_i^*)\): Penalizes deviations greater than \(\epsilon\).
These constraints ensure that the regression line fits the data with an allowable deviation of \(\epsilon\). The slack variables \(\xi_i\) and \(\xi_i^*\) account for data points that fall outside this deviation.
Visualizing SVM Regression: This example shows an SVM regression with an ε-tube (light blue), where only points outside the tube (support vectors) influence the regression line.
Graphic: SVM Regression with ε-Tube
# Generate data for regression
set.seed(42)
X <- sort(runif(40, min = 0, max = 5))
y <- sin(X) + rnorm(40, sd = 0.2)
# Fit SVM regression model
svr_model <- svm(X, y, kernel = "radial", cost = 100, epsilon = 0.1)
# Prediction
y_pred <- predict(svr_model, X)
# Plot
plot(X, y, col = "darkorange", pch = 16, xlab = "Input Feature", ylab = "Target Output", main = "SVM Regression with ε-Tube")
lines(X, y_pred, col = "navy", lwd = 2)
abline(h = mean(y_pred) + 0.1, col = "lightblue", lty = 2)
abline(h = mean(y_pred) - 0.1, col = "lightblue", lty = 2)
points(X[svr_model$index], y[svr_model$index], pch = 4, col = "red", lwd = 2) # support vectors
This completes the Non-linear Support Vector Machines presentation using R. Each code block can be run to generate the graphics required for each concept, offering a comprehensive view of SVM in non-linear settings. Let me know if any additional details or refinements are needed!