Here we will look at the famous iris Dataset and try to classify the its Species based on the given parameters. The dataset has been broken into two groups: training set and test set with split ratio being 80:20.
library(caTools)
library(ggplot2)
library(GGally)
library(e1071)
We will use caTools for Dataset spliting into training & test set.
ggplot2 and GGally will be used for Visualization.
e1071 will be used for Support Vector Classification.
dataset = iris
str(dataset)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(dataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We will use sepal length, sepal width, petal length and petal width to predict the species of Flower.
split = sample.split(dataset$Species, SplitRatio = .8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
nrow(training_set)
## [1] 120
nrow(test_set)
## [1] 30
We have 120 data points on which we will train our model and then we will use 30 data points to test the model on.
Lets have a closer look at the parameters and judge before hand if a good model can be created or not.
ggpairs(training_set, ggplot2::aes(colour = Species, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I personally like this graph because you can deduce so much information from one single chart. Lets have a look!
We can clearly see from the Histograms of Petal.length and Petal.width that we can clearly seperate out Setosa species with very high confidence.
However, Versicolor and Virginica Species are overlapped. If we look at the scatterplot of Sepal.Length vs Petal.Length and Petal.Width vs Petal.Length, we can distintly see a seperator that can be draw between the groups of Species.
Looks like we can just use Petal.Width and Petal.Length as parameters and come with a good model. SVM seems to be a very good model for this type of data. Lets create two model, one contains all parameter and second contain just Petal.Width and Petal.Length as parameter and compare their individual performances.
training_set[,1:4] = scale(training_set[,1:4])
test_set[,1:4] = scale(test_set[,1:4])
classifier1 = svm(formula = Species~., data = training_set, type = 'C-classification', kernel = 'radial')
classifier2 = svm(formula = Species~ Petal.Width + Petal.Length, data = training_set, type = 'C-classification', kernel = 'radial')
Here classifier1 uses all the parameter to make model while classifier2 just uses Petal’s Legth and Width to generate model.
test_pred1 = predict(classifier1, type = 'response', newdata = test_set[-5])
test_pred2 = predict(classifier2, type = 'response', newdata = test_set[-5])
# Making Confusion Matrix
cm1 = table(test_set[,5], test_pred1)
cm2 = table(test_set[,5], test_pred2)
cm1 # Confusion Matrix for all parameters
## test_pred1
## setosa versicolor virginica
## setosa 9 1 0
## versicolor 0 10 0
## virginica 0 1 9
cm2 # Confusion Matrix for parameters being Petal Length and Petal Width
## test_pred2
## setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 2 8
Wow, the accuracy for both model looks solid. Also notice that as we had deduced, only Petal Length and Width is important to make this model accurate and our second classifier proves it!