Support Vector Machines for predicting student passed or failed
Introduction to Support Vector Machines (SVM)
Support Vector Machines (SVM) are powerful supervised learning models used for classification tasks. While they can seem complicated due to the terminology associated with them, the key ideas behind SVMs can be broken down into easy-to-understand concepts. In this post, we will explore SVMs and their components using an academic performance example. We assume you are familiar with the bias-variance tradeoff and cross-validation concepts.
The bias-variance tradeoff is a fundamental concept in statistics and machine learning that describes the balance between a model’s ability to minimize bias and variance in its predictions. Bias refers to the error introduced by approximating a real-world problem, which can lead to systematic errors in predictions. High bias often results in oversimplified models that cannot capture the underlying patterns in the data (underfitting). Variance, on the other hand, refers to the model’s sensitivity to fluctuations in the training data, leading to models that may capture noise and perform well on training data but poorly on new, unseen data (overfitting). The tradeoff implies that increasing model complexity can reduce bias but often increases variance, and vice versa.
Cross-validation is a technique used in machine learning to evaluate the performance of a model by dividing the dataset into multiple subsets. The model is trained on a portion of the data and tested on the remaining portion, which helps assess how well the model generalizes to unseen data. This method helps to reduce overfitting and provides a more reliable estimate of the model’s performance.
A Simple Classification Task
Imagine we are working with an Excel dataset containing students’ StudyHours, PreviousResult, and whether they Passed. The goal is to build an SVM model that classifies students into “Pass” or “Fail” categories based on their study habits and previous results.
Let’s load the dataset using Python and explore how we can classify students using SVM.
Our goal is to classify new students based on their study hours and previous result
!()[https://docs.google.com/spreadsheets/d/1hrNmeq4qKoqnBilZCGehrhGIVJwYr-IFA-_LcMIk7KQ/edit?usp=sharing] Download
Steps:
- Load the dataset from an Excel file.
- Prepare the features (StudyHours, PreviousResult) and the target (Passed).
- Split the data into training and test sets.
- Train an SVM classifier with a linear kernel.
- Evaluate the model using accuracy.
Step 1: Load the dataset from an Excel file.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Step 1: Load the dataset from Excel
data = pd.read_excel('student_data.xlsx') # Assuming the file is named 'student_data.xlsx'Step 2: Prepare features and target
Step 3: Split the data into training and test sets
Step 4: Initialize and train the SVM classifier (using linear kernel)
svm_model = SVC(kernel='linear') # Linear kernel since data is linearly separable
svm_model.fit(X_train, y_train)SVC(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear')
Step 5: Make predictions and evaluate the model
y_pred = svm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of the SVM model: {accuracy:.2f}')## Accuracy of the SVM model: 1.00
Step 7: Plot the training data
plt.scatter(X_train['StudyHours'], X_train['PreviousResult'], c=y_train, cmap='coolwarm', s=50, label="Training Data")Step 8: Visualize the decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 500),
np.linspace(ylim[0], ylim[1], 500))
# Get decision boundary and margin
Z = svm_model.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary and margins
ax.contour(xx, yy, Z, levels=[-1, 0, 1], linestyles=['--', '-', '--'], colors='black')
# Highlight the support vectors
ax.scatter(svm_model.support_vectors_[:, 0], svm_model.support_vectors_[:, 1], s=100,
linewidth=1, facecolors='none', edgecolors='k', label='Support Vectors')
# Step 8: Add labels and titles
plt.xlabel('Study Hours')
plt.ylabel('Previous Result')
plt.title('SVM Decision Boundary with Support Vectors')
plt.legend(loc='best')
plt.show()