Support Vector Machines (SVMs) are powerful supervised learning algorithms primarily used for classification tasks. The goal of SVM is to find an optimal separating hyperplane that best divides different classes of data points.
Understanding vector algebra is essential for SVMs.
A hyperplane in \(n\)-dimensional space is defined as: \[ \beta_0 + \beta^T x = 0 \] Where: - \(\beta\) is the normal vector (perpendicular to the hyperplane). - \(x\) represents the feature vectors. - \(\beta_0\) is the bias term.
For two classes labeled \(y_i \in \{-1, 1\}\), the SVM decision boundary follows: \[ y_i (\beta^T x_i + \beta_0) \geq M \] where \(M\) is the margin.
The objective is to maximize the margin, which translates to minimizing \(||\beta||\) subject to: \[ y_i (\beta^T x_i + \beta_0) \geq 1 \] This is a convex optimization problem solved using Lagrange multipliers.
SVM introduces slack variables \(\xi_i\) to allow some misclassification: \[ y_i (\beta^T x_i + \beta_0) \geq 1 - \xi_i \] where \(\xi_i \geq 0\) represents the margin violation amount.
New optimization objective: \[ \min_{\beta, \beta_0, \xi} \frac{1}{2} ||\beta||^2 + C \sum \xi_i \] where \(C\) is a regularization parameter controlling the trade-off between margin maximization and classification errors.
If a dataset is not linearly separable, we map it to a higher-dimensional space using kernels: \[ K(x_i, x_j) = \phi(x_i)^T \phi(x_j) \]
The kernel trick enables SVM to operate in higher-dimensional spaces without explicit transformations.
\[ \min_{\beta, \beta_0} \frac{1}{2} ||\beta||^2 \] subject to: \[ y_i (\beta^T x_i + \beta_0) \geq 1 \]
\[ \min_{\beta, \beta_0, \xi} \frac{1}{2} ||\beta||^2 + C \sum \xi_i \] subject to: \[ y_i (\beta^T x_i + \beta_0) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_redundant=0, random_state=42)
y = np.where(y == 0, -1, 1) # Convert labels to {-1, 1}
# Split into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train SVM classifier
svm = SVC(kernel='linear', C=1.0)
svm.fit(X_train, y_train)
# Predictions
y_pred = svm.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
# Plot decision boundary
def plot_svm_decision_boundary(X, y, model):
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolors='k')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
np.linspace(ylim[0], ylim[1], 50))
Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])
plt.title("SVM Decision Boundary")
plt.show()
plot_svm_decision_boundary(X_train, y_train, svm)
Study Guide: Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are supervised learning models used for classification and regression tasks. They work by identifying the optimal hyperplane that best separates classes in a given dataset.
Given a dataset \((x_i, y_i)\) where \(x_i\) represents the feature vectors and \(y_i\) the class labels (\(\pm1\)), the decision boundary is defined as: \[ w \cdot x + b = 0 \] where \(w\) is the weight vector and \(b\) is the bias term.
The margin is given by: \[ \frac{2}{\|w\|} \] which should be maximized while ensuring correct classification.
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
The hinge loss function is defined as: \[ L(y, f(x)) = \max(0, 1 - y f(x)) \] This loss penalizes misclassified points and points close to the margin, ensuring better generalization.
For non-linearly separable data, SVMs use the kernel trick to transform input space into a higher-dimensional feature space where a linear decision boundary can be applied.
# Example of using an RBF kernel
model = SVC(kernel='rbf', gamma=0.1, C=1.0)
model.fit(X_train, y_train)
SVMs have an O(n^2) complexity, making them inefficient for large datasets. Performance optimization techniques include: - Using LinearSVC (which implements SVM as an optimization problem) - Reducing features via PCA - Using approximate methods such as SGDClassifier
from sklearn.svm import LinearSVC
model = LinearSVC()
model.fit(X_train, y_train)
While SVMs optimize a margin, Naive Bayes estimates class probabilities. - SVMs perform better in high-dimensional spaces. - Naive Bayes is computationally faster and works well when feature independence assumptions hold.