Predicting Iris Species Using Support Vector Machines (SVM): A Detailed Classification Approach in Python

Independent Data Analysis Project

Author
Affiliations

John Karuitha, PhD

Published

November 19, 2024

Modified

November 19, 2024

Executive Summary

This project leverages the renowned Iris dataset to demonstrate the power of Support Vector Machines (SVM) in classifying species of Iris flowers based on petal and sepal measurements. The Iris dataset, a classic in machine learning, provides an ideal foundation for exploring classification techniques due to its well-defined features and clearly separable classes. By employing Grid Search for hyperparameter tuning, we optimized the SVM model to achieve near-perfect accuracy, significantly outperforming the baseline model. This analysis highlights the critical impact of parameter tuning in enhancing model performance while minimizing overfitting. The results showcase the effectiveness of SVM in distinguishing between flower species, offering insights into how machine learning can be applied to broader classification problems in real-world scenarios.

Keywords

Data Analysis, Python, Pandas, Seaborn, Numpy, Machine Learning, Support Vector Machines (SVM), Scikit-learn, Classification

1 Background

In this project, we analyze the well-known Iris dataset using machine learning techniques to classify different species of Iris flowers. This dataset is a classic in the field of data science and machine learning, often used as an introductory example for classification algorithms. The dataset was first introduced by Sir Ronald Fisher in 1936 and remains a benchmark for evaluating classification models (Fisher 1936).

We employ Support Vector Machines (SVM) to classify the Iris species based on flower measurements. Additionally, we use Grid Search to fine-tune the model’s hyperparameters for better performance. By comparing the performance of a baseline model with a tuned model, we demonstrate the effectiveness of hyperparameter optimization.

2 Key Insights:

  • The Iris dataset is well-suited for classification tasks, particularly with algorithms like SVM due to the clear separation between species based on petal measurements.
  • Hyperparameter tuning played a crucial role in optimizing the model’s performance, reducing the risk of overfitting while maximizing accuracy.
  • The tuned SVM model achieved perfect classification results, demonstrating the effectiveness of SVM for datasets with relatively small sample sizes and distinct class separations.

3 The Data

The Iris flower dataset is publicly available and can be found here. The dataset consists of 150 samples from three species of Iris:

  • Iris-setosa (n=50)
  • Iris-versicolor (n=50)
  • Iris-virginica (n=50)

Each sample has four features: 1. Sepal length (in cm) 2. Sepal width (in cm) 3. Petal length (in cm) 4. Petal width (in cm)

These features are used to predict the species of the flower.

3.0.1 Visual Representation of the Three Iris Species

Here are the pictures of the three species of Iris (James et al. 2013).

Code
# Display images of the three Iris species
from IPython.display import Image

Image('http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg', width=300)

Setosa
Code
Image('http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg', width=300)

Versicolor
Code
Image('http://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg', width=300)

Virginica

4 Loading the Data

We begin by loading the dataset using Python’s seaborn library, which includes the Iris dataset by default:

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

# Load the Iris dataset
iris = sns.load_dataset("iris")
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

We encode the species labels into numerical values for model training:

Code
label_encoder = LabelEncoder()
iris['species'] = label_encoder.fit_transform(iris["species"])
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

5 Data Exploration

5.1 Summary Statistics

We begin by examining the dataset to understand its structure:

Code
iris.info()
iris.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
sepal_length sepal_width petal_length petal_width species
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 1.000000
std 0.828066 0.435866 1.765298 0.762238 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000

5.1.1 Data Visualization

To visualize relationships between the features, we create a pair plot:

Code
sns.pairplot(iris, hue="species", palette="rocket", corner=True)
plt.title("Pairs Plot of Variables")
plt.show()

The pair plot reveals patterns in the data that can help differentiate between the species, particularly using petal length and petal width.


6 Train-Test Split

We split the dataset into training and testing sets to evaluate our models:

Code
X = iris.drop(columns="species")
y = iris['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

7 Support Vector Machines (SVM) Model

7.1 Tuned Model

We start by tuning the SVM model using Grid Search to optimize two important parameters:

  1. C (Regularization Parameter): Controls the trade-off between achieving a low error on the training data and minimizing the complexity of the model.
  2. Gamma: Defines how far the influence of a single training example reaches. Higher gamma values can lead to overfitting.
Code
parameters = {"C": [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1], 
              "gamma": [1, 100, 1000, 10000, 100000]}

svm_tuned = GridSearchCV(SVC(), param_grid=parameters)
svm_tuned.fit(X_train, y_train)
GridSearchCV(estimator=SVC(),
             param_grid={'C': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1],
                         'gamma': [1, 100, 1000, 10000, 100000]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.1.1 Best Parameters

After running the grid search, we find the best parameters:

Code
best_params = svm_tuned.best_params_
print(f"Best parameters: {best_params}")
Best parameters: {'C': 1, 'gamma': 1}

With the optimal parameters (C=1 and gamma=1), we fit the model:

Code
svm_pred = svm_tuned.predict(X_test)
print(classification_report(y_test, svm_pred))
sns.heatmap(confusion_matrix(y_test, svm_pred), cmap="Blues", fmt=".0f", annot=True)
plt.title("Confusion Matrix - Tuned SVM")
plt.show()
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      1.00      1.00        20
           2       1.00      1.00      1.00        12

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

The tuned model shows high accuracy and a balanced performance across all classes.


7.2 Baseline Model Without Parameter Tuning

For comparison, we fit a baseline SVM model with default parameters:

Code
baseline = SVC()
baseline.fit(X_train, y_train)
baseline_preds = baseline.predict(X_test)
print(classification_report(y_test, baseline_preds))
sns.heatmap(confusion_matrix(y_test, baseline_preds), cmap="Blues", fmt=".0f", annot=True)
plt.title("Confusion Matrix - Baseline SVM")
plt.show()
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      0.95      0.97        20
           2       0.92      1.00      0.96        12

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

As expected, the baseline model does not perform as well as the tuned model, especially in terms of precision and recall.


8 Future Work:

  • Experimenting with other classification algorithms such as Random Forest and K-Nearest Neighbors to compare performance.
  • Exploring dimensionality reduction techniques like PCA to visualize data in two dimensions.
  • Applying these techniques to other datasets with more complex structures to assess the generalizability of the models.

9 Conclusion

In this project, we successfully utilized Support Vector Machines (SVM) to classify different species of the Iris flower based on measurements of their features. By fine-tuning hyperparameters using Grid Search, we significantly improved the model’s accuracy compared to the baseline model (Muddana and Vinayakam 2024).


References

Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Muddana, A Lakshmi, and Sandhya Vinayakam. 2024. Python for Data Science. Springer.