Here we will look at the famous iris Dataset and try to classify the its Species based on the given parameters. The dataset has been broken into two groups: training set and test set with split ratio being 80:20.

Loading the dependent library and Dataset

library(caTools)
library(ggplot2)
library(GGally)
library(e1071)

We will use caTools for Dataset spliting into training & test set.

ggplot2 and GGally will be used for Visualization.

e1071 will be used for Support Vector Classification.

dataset = iris

Exploring the dataset

str(dataset)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(dataset)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

We will use sepal length, sepal width, petal length and petal width to predict the species of Flower.

Spliting data into training set and test set

split = sample.split(dataset$Species, SplitRatio = .8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

nrow(training_set)
## [1] 120
nrow(test_set)
## [1] 30

We have 120 data points on which we will train our model and then we will use 30 data points to test the model on.

Exploratory Visualization

Lets have a closer look at the parameters and judge before hand if a good model can be created or not.

ggpairs(training_set, ggplot2::aes(colour = Species, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I personally like this graph because you can deduce so much information from one single chart. Lets have a look!

We can clearly see from the Histograms of Petal.length and Petal.width that we can clearly seperate out Setosa species with very high confidence.

However, Versicolor and Virginica Species are overlapped. If we look at the scatterplot of Sepal.Length vs Petal.Length and Petal.Width vs Petal.Length, we can distintly see a seperator that can be draw between the groups of Species.

Looks like we can just use Petal.Width and Petal.Length as parameters and come with a good model. SVM seems to be a very good model for this type of data. Lets create two model, one contains all parameter and second contain just Petal.Width and Petal.Length as parameter and compare their individual performances.

Feature Scaling and Model Fiting

training_set[,1:4] = scale(training_set[,1:4])
test_set[,1:4] = scale(test_set[,1:4])

classifier1 = svm(formula = Species~., data = training_set, type = 'C-classification', kernel = 'radial')
classifier2 = svm(formula = Species~ Petal.Width + Petal.Length, data = training_set, type = 'C-classification', kernel = 'radial')

Here classifier1 uses all the parameter to make model while classifier2 just uses Petal’s Legth and Width to generate model.

Prediction and Evaluation

test_pred1 = predict(classifier1, type = 'response', newdata = test_set[-5])
test_pred2 = predict(classifier2, type = 'response', newdata = test_set[-5])

# Making Confusion Matrix
cm1 = table(test_set[,5], test_pred1)
cm2 = table(test_set[,5], test_pred2)
cm1 # Confusion Matrix for all parameters
##             test_pred1
##              setosa versicolor virginica
##   setosa          9          1         0
##   versicolor      0         10         0
##   virginica       0          1         9
cm2 # Confusion Matrix for parameters being Petal Length and Petal Width
##             test_pred2
##              setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          2         8

Wow, the accuracy for both model looks solid. Also notice that as we had deduced, only Petal Length and Width is important to make this model accurate and our second classifier proves it!