DATA 624 Presentation

Here’s the same presentation outline with code snippets converted to R. Each slide includes R code for generating plots, along with the necessary equations in LaTeX format where appropriate.

Slide 1: Title Slide

Non-linear Support Vector Machines

Presented by: Fomba Kassoh
Overview: Exploring the versatility of Support Vector Machines, focusing on non-linear applications in classification and regression.

Slide 2: Introduction to Support Vector Machines

Motivation for SVM:
- SVM was developed to overcome the shortcomings of linear regression and perceptron-based models when dealing with complex, non-linearly separable data.
- Linear regression struggles with classification tasks, especially when classes are not linearly separable.
- SVM aims to find an optimal boundary (hyperplane) that maximizes the margin between classes, which helps in improving generalization performance.
SVM Basics: Support Vector Machines (SVMs) are used in supervised learning, primarily for classification and regression. They find an optimal boundary, or “hyperplane,” that maximizes the distance from the boundary to the nearest points in each class.
Key Feature: Support vectors are the critical data points that define the boundary.
Mathematical Formulation: Given training data \(\{(x_i, y_i)\}_{i=1}^n\), where \(x_i \in \mathbb{R}^d\) and \(y_i \in \{-1, 1\}\):
- \(\{(x_i, y_i)\}_{i=1}^n\): Represents a set of training data points, where \(x_i\) is a vector in \(d\)-dimensional space (\(\mathbb{R}^d\)), and \(y_i\) is the class label which can be either -1 or 1.
- The SVM objective is to find the hyperplane \(w^T x + b = 0\) that maximizes the margin. This is subject to the constraint:
\[ y_i (w^T x_i + b) \geq 1, \quad orall i \]
- \(y_i (w^T x_i + b) \geq 1\): This condition ensures that each data point is correctly classified and lies on the correct side of the hyperplane. The symbol \(orall i\) means “for all data points \(i\)”.
- \(w\) is the weight vector that defines the orientation of the hyperplane.
- \(b\) is the bias term, shifting the hyperplane.
This can be formulated as an optimization problem:

\[ \min_{w, b} rac{1}{2} \|w\|^2 \]
- The goal is to minimize \(rac{1}{2} \|w\|^2\), which corresponds to maximizing the margin between the hyperplane and the closest data points.
- The \(rac{1}{2}\) factor is added for mathematical convenience, especially when deriving the dual formulation of the optimization problem.

Slide 3: Linear SVM for Classification

Objective of Linear SVMs:
- Find the hyperplane that maximizes the margin between two classes.
- Importance:
  - Maximizes Confidence: Larger margin = more confident predictions.
  - Better Generalization: Reduces overfitting, leading to improved performance on unseen data.
  - Efficient: Only support vectors influence the decision boundary, making the model computationally efficient.
Equation: \[ D(u) = eta_0 + \sum_{i=1}^n \alpha_i y_i K(x_i, u) \]
- \(D(u)\): The decision function used to classify a data point \(u\).
- \(eta_0\): The bias term.
- \(\alpha_i\): Lagrange multipliers, representing the influence of each training point on the decision boundary.
- \(y_i\): Class labels, which are either -1 or 1.
- \(K(x_i, u)\): The kernel function, which measures the similarity between the training data point \(x_i\) and the new point \(u\). For linear SVM, \(K(x_i, u) = x_i^T u\).
Kernel Function Visualization:

  # Load libraries
  library(ggplot2)

  # Generate example data points
  point1 <- c(1, 2)
  point2 <- c(2, 3)
  point3 <- c(3, 1)

  # Create a data frame of the points
  data_points <- data.frame(
    x = c(point1[1], point2[1], point3[1]),
    y = c(point1[2], point2[2], point3[2]),
    label = c("Point 1", "Point 2", "Point 3")
  )

  # Plot the points to visualize their locations
  ggplot(data_points, aes(x = x, y = y, label = label)) +
    geom_point(size = 4, color = "blue") +
    geom_text(aes(x = x, y = y, label = label), vjust = c(1, 0.5, -1.5), hjust = c(-1, 1.5, 0.5)) +
    labs(title = "Data Points for Kernel Function Visualization", x = "X-axis", y = "Y-axis") +
    theme_minimal()

  # Calculate similarities
  linear_kernel <- function(x1, x2) {
    return(sum(x1 * x2))
  }
  
  similarity12 <- linear_kernel(point1, point2)
  similarity13 <- linear_kernel(point1, point3)

  # Print similarities to console
  cat("Similarity between Point 1 and Point 2:", similarity12, "\n")

## Similarity between Point 1 and Point 2: 8

  cat("Similarity between Point 1 and Point 3:", similarity13, "\n")

## Similarity between Point 1 and Point 3: 5

This plot shows the locations of the data points, and the console output shows the similarity between data points calculated using the linear kernel function \(K(x_i, u) = x_i^T u\). A higher value indicates more similarity between the two points.
Optimization Problem: The linear SVM solves the following quadratic programming problem:

\[ \min_{w, b, \xi} rac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i \]
- This objective function aims to minimize the combination of two components:
  - \(rac{1}{2} \|w\|^2\): Maximizes the margin between classes.
  - \(C \sum_{i=1}^n \xi_i\): A penalty for misclassified data points, where \(C\) is a regularization parameter controlling the trade-off between margin size and classification error.
Subject to: \[ y_i (w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]
- \(\xi_i\): Slack variables allowing for some points to be within the margin or misclassified.
- \(y_i (w^T x_i + b) \geq 1 - \xi_i\): This ensures that each point is either correctly classified or within a certain tolerance (\(\xi_i\)), allowing some flexibility in the classification.
Explanation: The objective function \(rac{1}{2} \|w\|^2\) aims to maximize the margin, while the term \(C \sum_{i=1}^n \xi_i\) penalizes misclassifications. The parameter \(C\) determines how much weight is given to the misclassification penalty.
Graphic: Linear SVM Classification Plot

# Load libraries
library(e1071)
library(ggplot2)

# Generate linearly separable data and shift to positive quadrant
set.seed(1)
X <- rbind(matrix(rnorm(40), ncol = 2) + 5, matrix(rnorm(40), ncol = 2) + 2)  # Shifted by adding a constant to make all values positive
y <- factor(c(rep(1, 20), rep(2, 20)))

# Train linear SVM classifier
linear_svm <- svm(X, y, kernel = "linear", cost = 1)

# Identify support vectors
support_vectors <- as.data.frame(X[linear_svm$index, , drop = FALSE])

# Create a grid to plot decision boundaries
grid <- expand.grid(X1 = seq(min(X[,1]) - 1, max(X[,1]) + 1, length.out = 100),
                    X2 = seq(min(X[,2]) - 1, max(X[,2]) + 1, length.out = 100))
grid$pred <- predict(linear_svm, newdata = grid, decision.values = TRUE)
decision_values <- attr(grid$pred, "decision.values")

# Plot using ggplot2 with annotations
ggplot() +
  geom_point(data = as.data.frame(X), aes(x = X[,1], y = X[,2], color = y)) +
  geom_point(data = support_vectors, aes(x = V1, y = V2), shape = 21, size = 3, color = "black", fill = NA) +  # Highlight support vectors
  geom_contour(data = grid, aes(x = X1, y = X2, z = decision_values),
               breaks = 0, color = "blue", linetype = "solid") +     # Main separating hyperplane
  geom_contour(data = grid, aes(x = X1, y = X2, z = decision_values),
               breaks = c(-1, 1), color = "blue", linetype = "dashed") + # Margin lines at -1 and +1
  labs(title = "SVM with Separating Hyperplane, Support Vectors, and Margin",
       x = "X1", y = "X2") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black") +
  geom_vline(xintercept = 0, linetype = "solid", color = "black") +
  annotate("text", x = 2.5, y = 5.5, label = "Separating Hyperplane", color = "blue", angle = -30, hjust = 0)

Slide 4: Limitations of Linear SVMs

Problem: Linear SVMs can’t handle non-linearly separable data. Here’s an example where no straight line can separate the classes perfectly.
Why Non-linearity Matters: Many real-world problems require non-linear boundaries, motivating the need for non-linear SVMs.
Mathematical Insight: In the case of non-linear data, the decision boundary is not a simple hyperplane. We need a transformation function \(\phi(x)\) that maps the input features into a higher-dimensional space where a hyperplane can separate the classes.

The transformation function \(\phi(x)\) is used to project the original data into a feature space where it becomes linearly separable.
Graphic: Non-linearly Separable Data Plot

# Load required libraries
library(mlbench)
library(ggplot2)

# Generate non-linear data
set.seed(42)
data <- mlbench.circle(100, d = 2)  # Set d = 2 for two-dimensional data
X <- data$x
y <- as.factor(data$classes)

# Plot
ggplot(data = data.frame(X1 = X[,1], X2 = X[,2], Class = y), aes(x = X1, y = X2, color = Class)) +
  geom_point(size = 3) +
  ggtitle("Non-Linearly Separable Data")

Slide 5: The Kernel Trick

Concept of Kernel Trick: The kernel trick projects data into a higher-dimensional space to make it separable without explicit transformations. Instead of computing \(\phi(x)\) explicitly, we use kernel functions \(K(x, x') = \langle \phi(x), \phi(x') \rangle\) to operate in the higher-dimensional space implicitly.
Common Kernels:
- Polynomial: \[ K(x, u) = ( ext{scale} \cdot (x^T u) + 1)^{ ext{degree}} \]
  - \(K(x, u)\): The polynomial kernel function.
  - \(ext{scale}\): A scaling factor to control the contribution of the inner product.
  - \(ext{degree}\): The degree of the polynomial, which determines the complexity of the decision boundary.
- Radial Basis Function (RBF): \[ K(x, u) = \exp(-\gamma \|x - u\|^2) \]
  - \(K(x, u)\): The RBF kernel function, commonly used for non-linear classification.
  - \(\gamma\): A parameter that controls the spread of the RBF. A higher value of \(\gamma\) creates a more complex decision boundary.
  - \(\|x - u\|^2\): The squared Euclidean distance between vectors \(x\) and \(u\), which measures similarity.
- Hyperbolic Tangent (Sigmoid): \[ K(x, u) = anh( ext{scale} \cdot (x^T u) + 1) \]
  - This kernel is inspired by neural networks and is useful for certain types of data.
Mathematical Explanation: The key idea of the kernel trick is that we never need to compute the mapping \(\phi(x)\) explicitly. Instead, we work directly with the kernel function \(K(x, x')\), which represents the dot product in the transformed feature space.

Slide 6: Non-linear SVM with Kernel Transformation

Visual Explanation: Left panel shows non-separable data in 2D; right panel shows data transformed into 3D, where it becomes linearly separable.
Equation with Kernel: \[ D(u) = eta_0 + \sum_{i=1}^n \alpha_i y_i K(x_i, u) \]
- This equation is similar to the linear version but uses a kernel function \(K(x_i, u)\), allowing SVM to create non-linear decision boundaries in the original feature space.
Mathematical Detail: The optimization problem for SVM with kernels becomes:

\[ \max_{\alpha} \sum_{i=1}^n \alpha_i - rac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(x_i, x_j) \]

Subject to: \[ 0 \leq \alpha_i \leq C, \quad \sum_{i=1}^n \alpha_i y_i = 0 \]
- The dual formulation focuses on optimizing the Lagrange multipliers \(\alpha_i\).
- \(K(x_i, x_j)\): Kernel function that allows the optimization to take place in the transformed feature space without explicitly computing \(\phi(x)\).
Explanation: The dual formulation allows us to work with the kernel function \(K(x_i, x_j)\) directly, which simplifies computation in the high-dimensional space.
Graphic: Original vs. Transformed Data

# Load required libraries
library(scatterplot3d)
library(mlbench)

# Generate non-linear data
set.seed(42)
data <- mlbench.circle(100, d = 2)
X <- data$x
y <- as.factor(data$classes)

# Transformation: Add Z axis as x^2 + y^2
Z <- rowSums(X^2)

# Plot in 3D
s3d <- scatterplot3d(X[,1], X[,2], Z, color = ifelse(y == 1, "blue", "red"), 
                     pch = 16, main = "Transformed Data in 3D with Separating Hyperplane")

# Add a manually created separating hyperplane at Z = 1
x_vals <- seq(min(X[,1]), max(X[,1]), length.out = 10)
y_vals <- seq(min(X[,2]), max(X[,2]), length.out = 10)
z_vals <- matrix(1, nrow = 10, ncol = 10)  # Constant Z = 1

# Use expand.grid to create grid for plotting the plane
grid_points <- expand.grid(x_vals, y_vals)

# Draw the plane by plotting each point in the grid at Z = 1
s3d$points3d(grid_points[,1], grid_points[,2], rep(1, nrow(grid_points)), 
             col = "gray", pch = ".", cex = 2)

Slide 7: Example of Non-linear SVM Classification

Non-linear SVM in Action: This example shows an RBF kernel capturing a non-linear boundary. Support vectors define the decision boundary’s shape.
Mathematical Insight: The RBF kernel is effective for capturing complex, non-linear relationships in data. The kernel function \[ K(x, x') = \exp(-\gamma \|x - x'\|^2) \]

allows SVM to create flexible decision boundaries that adapt to the structure of the data. Here, \(\gamma\) controls the influence of each training example, with larger values leading to more localized decision boundaries.
Graphic: Non-linear SVM Classification Plot with RBF Kernel

# Load required libraries
if (!requireNamespace("e1071", quietly = TRUE)) {
  install.packages("e1071")
}
library(e1071)
library(ggplot2)

# Fit SVM with RBF kernel
svm_rbf <- svm(X, y, kernel = "radial", gamma = 1, cost = 1)

# Create a grid for plotting the decision boundary
x_min <- min(X[, 1]) - 1
x_max <- max(X[, 1]) + 1
y_min <- min(X[, 2]) - 1
y_max <- max(X[, 2]) + 1

# Generate a grid of points
grid <- expand.grid(X1 = seq(x_min, x_max, length.out = 100),
                    X2 = seq(y_min, y_max, length.out = 100))

# Predict the class for each point in the grid
grid$pred <- predict(svm_rbf, newdata = grid)

# Plot the results
ggplot() +
  geom_point(data = data.frame(X1 = X[,1], X2 = X[,2], Class = y), aes(x = X1, y = X2, color = Class)) +
  geom_contour(data = grid, aes(x = X1, y = X2, z = as.numeric(pred)), bins = 1, color = "black") +
  labs(title = "Non-linear SVM Classification with RBF Kernel",
       x = "X1", y = "X2") +
  theme_minimal()

Slide 8: SVM Regression Overview

SVM for Regression: SVMs for regression (SVR) aim to fit a model that captures the continuous relationship. The ε-insensitive loss function focuses on deviations outside a predefined tolerance ε.
Mathematical Formulation: Given training data \(\{(x_i, y_i)\}_{i=1}^n\), SVR aims to find a function \(f(x) = w^T x + b\) that has at most \(\epsilon\) deviation from the actual target \(y_i\) for each training point, while also ensuring that the model is as flat as possible:

\[ \min_{w, b, \xi, \xi^*} rac{1}{2} \|w\|^2 + C \sum_{i=1}^n (\xi_i + \xi_i^*) \]
- \(rac{1}{2} \|w\|^2\): Controls the flatness of the regression function.
- \(C \sum_{i=1}^n (\xi_i + \xi_i^*)\): Penalizes deviations greater than \(\epsilon\).
- These constraints ensure that the regression line fits the data with an allowable deviation of \(\epsilon\). The slack variables \(\xi_i\) and \(\xi_i^*\) account for data points that fall outside this deviation.

Slide 9: Example of SVM Regression with ε-Tube

Visualizing SVM Regression: This example shows an SVM regression with an ε-tube (light blue), where only points outside the tube (support vectors) influence the regression line.
Graphic: SVM Regression with ε-Tube

# Generate data for regression
set.seed(42)
X <- sort(runif(40, min = 0, max = 5))
y <- sin(X) + rnorm(40, sd = 0.2)

# Fit SVM regression model
svr_model <- svm(X, y, kernel = "radial", cost = 100, epsilon = 0.1)

# Prediction
y_pred <- predict(svr_model, X)

# Plot
plot(X, y, col = "darkorange", pch = 16, xlab = "Input Feature", ylab = "Target Output", main = "SVM Regression with ε-Tube")
lines(X, y_pred, col = "navy", lwd = 2)
abline(h = mean(y_pred) + 0.1, col = "lightblue", lty = 2)
abline(h = mean(y_pred) - 0.1, col = "lightblue", lty = 2)
points(X[svr_model$index], y[svr_model$index], pch = 4, col = "red", lwd = 2)  # support vectors

Slide 10: Conclusion and Key Takeaways

Summary:
- Versatility: SVMs can handle both classification and regression tasks.
- Kernel Trick: Kernel transformations allow SVMs to handle non-linear relationships effectively.
- Mathematical Rigor: The SVM optimization problem involves maximizing the margin while controlling classification errors through regularization.
- Robustness: SVMs are resistant to overfitting, thanks to margin maximization and ε-insensitive loss in regression.

This completes the Non-linear Support Vector Machines presentation using R. Each code block can be run to generate the graphics required for each concept, offering a comprehensive view of SVM in non-linear settings. Let me know if any additional details or refinements are needed!