Code
# Display images of the three Iris species
from IPython.display import Image
'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg', width=300) Image(
This project leverages the renowned Iris dataset to demonstrate the power of Support Vector Machines (SVM) in classifying species of Iris flowers based on petal and sepal measurements. The Iris dataset, a classic in machine learning, provides an ideal foundation for exploring classification techniques due to its well-defined features and clearly separable classes. By employing Grid Search for hyperparameter tuning, we optimized the SVM model to achieve near-perfect accuracy, significantly outperforming the baseline model. This analysis highlights the critical impact of parameter tuning in enhancing model performance while minimizing overfitting. The results showcase the effectiveness of SVM in distinguishing between flower species, offering insights into how machine learning can be applied to broader classification problems in real-world scenarios.
Data Analysis, Python, Pandas, Seaborn, Numpy, Machine Learning, Support Vector Machines (SVM), Scikit-learn, Classification
In this project, we analyze the well-known Iris dataset using machine learning techniques to classify different species of Iris flowers. This dataset is a classic in the field of data science and machine learning, often used as an introductory example for classification algorithms. The dataset was first introduced by Sir Ronald Fisher in 1936 and remains a benchmark for evaluating classification models (Fisher 1936).
We employ Support Vector Machines (SVM) to classify the Iris species based on flower measurements. Additionally, we use Grid Search to fine-tune the model’s hyperparameters for better performance. By comparing the performance of a baseline model with a tuned model, we demonstrate the effectiveness of hyperparameter optimization.
The Iris flower dataset is publicly available and can be found here. The dataset consists of 150 samples from three species of Iris:
Each sample has four features: 1. Sepal length (in cm) 2. Sepal width (in cm) 3. Petal length (in cm) 4. Petal width (in cm)
These features are used to predict the species of the flower.
Here are the pictures of the three species of Iris (James et al. 2013).
# Display images of the three Iris species
from IPython.display import Image
'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg', width=300) Image(
'http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg', width=300) Image(
'http://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg', width=300) Image(
We begin by loading the dataset using Python’s seaborn library, which includes the Iris dataset by default:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
# Load the Iris dataset
= sns.load_dataset("iris")
iris iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
We encode the species labels into numerical values for model training:
= LabelEncoder()
label_encoder 'species'] = label_encoder.fit_transform(iris["species"])
iris[ iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
We begin by examining the dataset to understand its structure:
iris.info() iris.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 | 1.000000 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 | 0.819232 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 | 0.000000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 | 0.000000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 | 1.000000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 | 2.000000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 | 2.000000 |
To visualize relationships between the features, we create a pair plot:
="species", palette="rocket", corner=True)
sns.pairplot(iris, hue"Pairs Plot of Variables")
plt.title( plt.show()
The pair plot reveals patterns in the data that can help differentiate between the species, particularly using petal length and petal width.
We split the dataset into training and testing sets to evaluate our models:
= iris.drop(columns="species")
X = iris['species']
y
= train_test_split(X, y, test_size=0.3, random_state=101) X_train, X_test, y_train, y_test
We start by tuning the SVM model using Grid Search to optimize two important parameters:
= {"C": [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
parameters "gamma": [1, 100, 1000, 10000, 100000]}
= GridSearchCV(SVC(), param_grid=parameters)
svm_tuned svm_tuned.fit(X_train, y_train)
GridSearchCV(estimator=SVC(), param_grid={'C': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1], 'gamma': [1, 100, 1000, 10000, 100000]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(estimator=SVC(), param_grid={'C': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1], 'gamma': [1, 100, 1000, 10000, 100000]})
SVC(C=1, gamma=1)
SVC(C=1, gamma=1)
After running the grid search, we find the best parameters:
= svm_tuned.best_params_
best_params print(f"Best parameters: {best_params}")
Best parameters: {'C': 1, 'gamma': 1}
With the optimal parameters (C=1
and gamma=1
), we fit the model:
= svm_tuned.predict(X_test)
svm_pred print(classification_report(y_test, svm_pred))
="Blues", fmt=".0f", annot=True)
sns.heatmap(confusion_matrix(y_test, svm_pred), cmap"Confusion Matrix - Tuned SVM")
plt.title( plt.show()
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 20
2 1.00 1.00 1.00 12
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
The tuned model shows high accuracy and a balanced performance across all classes.
For comparison, we fit a baseline SVM model with default parameters:
= SVC()
baseline
baseline.fit(X_train, y_train)= baseline.predict(X_test)
baseline_preds print(classification_report(y_test, baseline_preds))
="Blues", fmt=".0f", annot=True)
sns.heatmap(confusion_matrix(y_test, baseline_preds), cmap"Confusion Matrix - Baseline SVM")
plt.title( plt.show()
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.95 0.97 20
2 0.92 1.00 0.96 12
accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45
As expected, the baseline model does not perform as well as the tuned model, especially in terms of precision and recall.
In this project, we successfully utilized Support Vector Machines (SVM) to classify different species of the Iris flower based on measurements of their features. By fine-tuning hyperparameters using Grid Search, we significantly improved the model’s accuracy compared to the baseline model (Muddana and Vinayakam 2024).