Support Vector Machine in R using Iris Dataset

04/18/2025

Introduction

This report demonstrates how to implement a Support Vector Machine (SVM) model in R using the iris dataset. We’ll go through data loading, preprocessing, model training, evaluation, and tuning.

SVM is a supervised machine learning algorithm used for classification and regression. It works by finding the best hyperplane to separate different classes in the feature space.

Install & Load Required Packages

We use the e1071 package in R, which includes the svm() function to build and tune SVM models.

Install e1071 library if not already installed

if (!require("e1071")) {
  install.packages("e1071", dependencies = TRUE)
}

Load e1071 library

library(e1071)

Load & Explore the Dataset

Load built-in iris dataset

data(iris)

The iris dataset contains 150 flower observations with 4 numeric features and one categorical target (Species).

Display the first six rows of the dataset

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Display the statistical summary of the dataset

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Data Splitting

Split the data: 70% for training, 30% for testing. This allows evaluation of model performance on unseen data.

Set seed for reproducibility

set.seed(123)

Split 70% for training and 30% for testing

index <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[index, ]
test_data <- iris[-index, ]

Display the training dataset dimensions

dim(train_data)

## [1] 105   5

Display the testing dataset dimensions

dim(test_data)

## [1] 45  5

Train `SVM` Model

We train an SVM model using a linear kernel. The Species ~ . formula means we use all other variables to predict the species.

model <- svm(Species ~ ., data = train_data, kernel = "linear")

Display the statistical summary of the SVM model

summary(model)

## 
## Call:
## svm(formula = Species ~ ., data = train_data, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  24
## 
##  ( 2 10 12 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  setosa versicolor virginica

Make Predictions

Use the trained model to predict flower species in the test dataset.

Make predictions

predictions <- predict(model, test_data)

Display the first six predictions

head(predictions)

##      1      2      3      5     11     18 
## setosa setosa setosa setosa setosa setosa 
## Levels: setosa versicolor virginica

Evaluate Model Performance

Confusion matrix helps us measure the model’s performance. We also calculate accuracy to evaluate correctness.

conf_mat <- table(Predicted = predictions, Actual = test_data$Species)
conf_mat

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         17         0
##   virginica       0          1        13

Display the accuracy

accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
paste("Accuracy:", round(accuracy * 100, 2), "%")

## [1] "Accuracy: 97.78 %"

Tune the Model

Use cross-validation to find the best cost and gamma parameters for improved performance with a radial kernel.

tuned_model <- tune(svm, Species ~ ., data = train_data,
                    kernel = "radial",
                    ranges = list(cost = c(0.1, 1, 10),
                                  gamma = c(0.5, 1, 2)))

Display the statistical summary of the tuned model

summary(tuned_model)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.04727273 
## 
## - Detailed performance results:
##   cost gamma      error dispersion
## 1  0.1   0.5 0.11363636 0.08487554
## 2  1.0   0.5 0.04727273 0.04994028
## 3 10.0   0.5 0.05727273 0.04943204
## 4  0.1   1.0 0.19181818 0.13001730
## 5  1.0   1.0 0.05636364 0.04863613
## 6 10.0   1.0 0.07545455 0.05861111
## 7  0.1   2.0 0.33545455 0.14928357
## 8  1.0   2.0 0.06545455 0.04531291
## 9 10.0   2.0 0.07545455 0.03998393

Predict Using Tuned Model

Evaluate performance again after tuning. The new model should be more accurate or generalizable.

Select the best model from the tuned model

best_model <- tuned_model$best.model

Make predictions

tuned_predictions <- predict(best_model, test_data)

Create Confusion Matrix table

tuned_conf_matrix <- table(Predicted = tuned_predictions, 
                           Actual = test_data$Species)

Calculate and display the Accuracy

tuned_accuracy <- sum(diag(tuned_conf_matrix)) / 
  sum(tuned_conf_matrix)
paste("Tuned Accuracy:", round(tuned_accuracy * 100, 2), "%")

## [1] "Tuned Accuracy: 97.78 %"

Accuracy Comparison

accuracy_data <- data.frame(
  Model = c("Original", "Tuned"),
  Accuracy = c(round(accuracy, 4), round(tuned_accuracy, 4))
)

# Display the accuracy comparison in a table
kable(accuracy_data)

Model	Accuracy
Original	0.9778
Tuned	0.9778

Key Parameters Explained

Tuning these parameters can greatly improve model performance.

kernel – function that transforms data (e.g., “linear”, “radial”)
cost – controls margin width and misclassification penalty
gamma – defines influence of a data point (used in non-linear kernels)

Summary

This workflow applies to both small datasets like iris and real-world datasets, such as those used in healthcare analytics.

Loaded data and trained an SVM model
Evaluated accuracy and confusion matrix
Tuned hyperparameters for improved performance