Execute by Neha Raut
This program shows the classification of Iris data using Support Vector Machines classifier.
#e1071 will be used for Support Vector Classification.
library("e1071")
library(GGally)
library(ggplot2)
data(iris)
#first 4 are numeric and 5th is factor
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris,5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
We will use sepal length, sepal width, petal length and petal width to predict the species of Flower.
svm_model <- svm(Species ~ ., data=iris,
kernel="radial") #linear/polynomial/sigmoid
Since we are dealing with species which is char type, you will get svm type as a classsification defaulty, If data is quantitative that is continuous, you will get as regression
Lets have a closer look at the parameters and judge before hand if a good model can be created or not.
ggpairs(iris, ggplot2::aes(colour = Species, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I personally like this graph because you can deduce so much information from one single chart. Lets have a look!
We can clearly see from the Histograms of Petal.length and Petal.width that we can clearly seperate out Setosa species with very high confidence.
However, Versicolor and Virginica Species are overlapped. If we look at the scatterplot of Sepal.Length vs Petal.Length and Petal.Width vs Petal.Length, we can distintly see a seperator that can be draw between the groups of Species.
Looks like we can just use Petal.Width and Petal.Length as parameters and come with a good model. SVM seems to be a very good model for this type of data.
plot(svm_model, data=iris,
Petal.Width~Petal.Length,
slice = list(Sepal.Width=3, Sepal.Length=4)
)
from graph you can see data, support vector(represented by cross sign) and decision boundry, belong to 3 types of species
White color represented predicted class for second species(versicolor)
Pink color represented predicted class for third species(virginica)
Also we have 52 Support vector, 8 of them belongs to first species(You can see 8 cross in first class), 22 of them belongs to second species, 21 of them belongs to third species.
Confusion matrix and missclasscation Error
pred = predict(svm_model,iris)
tab = table(Predicted=pred, Actual = iris$Species)
tab
## Actual
## Predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 2 48
Out of the total data points, 50 data points belongs to first species and model also predicted 50 belongs to first species, same for second and third category.
Here, 2 data points belong to third category(virginica) but model predicted them to be from second category(versicolor) Similarly 2 data points belong to second category(versicolor) but model predicted them to be from third category(virginica).So here is missclassification
Get missclassification rate
1-sum(diag(tab)/sum(tab))
## [1] 0.02666667
I have find missclassification rate for remaining three kernal,
For linear,I get missclassification rate: 0.03333333 i.e% 3.3% (higher than radial)
For polynomial, missclassification rate: 0.04666667 i.e 4.7%
for sigmoid,missclassification rate: 0.1133333 i.e 11.3% <- worst case
Hence Radial One is best
It help you to select best model
eps= start with 0 and end with 1 so 11 different values(0.1,0.2,…..1.0)
cost capture cost of constraint voilation and default values of cost is 1 If cost is too high this means high penalty for non seperable points [Model may store too many support vectors], which will lead to overfitting Similarly too small cost value will lead to underfitting
Here i am trying 11*7=77 different combinations
set.seed(123)
tmodel=tune(svm,Species~., data=iris,
ranges=list(epsilon= seq(0,1,0.1), cost = 2^(2:7)))
plot(tmodel)
Plot gives us performance evauation of SVM for 2 parameter we have used.Dark region means Lower missclassification range
summary(tmodel)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0 8
##
## - best performance: 0.03333333
##
## - Detailed performance results:
## epsilon cost error dispersion
## 1 0.0 4 0.04666667 0.06324555
## 2 0.1 4 0.04666667 0.06324555
## 3 0.2 4 0.04666667 0.06324555
## 4 0.3 4 0.04666667 0.06324555
## 5 0.4 4 0.04666667 0.06324555
## 6 0.5 4 0.04666667 0.06324555
## 7 0.6 4 0.04666667 0.06324555
## 8 0.7 4 0.04666667 0.06324555
## 9 0.8 4 0.04666667 0.06324555
## 10 0.9 4 0.04666667 0.06324555
## 11 1.0 4 0.04666667 0.06324555
## 12 0.0 8 0.03333333 0.06478835
## 13 0.1 8 0.03333333 0.06478835
## 14 0.2 8 0.03333333 0.06478835
## 15 0.3 8 0.03333333 0.06478835
## 16 0.4 8 0.03333333 0.06478835
## 17 0.5 8 0.03333333 0.06478835
## 18 0.6 8 0.03333333 0.06478835
## 19 0.7 8 0.03333333 0.06478835
## 20 0.8 8 0.03333333 0.06478835
## 21 0.9 8 0.03333333 0.06478835
## 22 1.0 8 0.03333333 0.06478835
## 23 0.0 16 0.04000000 0.06440612
## 24 0.1 16 0.04000000 0.06440612
## 25 0.2 16 0.04000000 0.06440612
## 26 0.3 16 0.04000000 0.06440612
## 27 0.4 16 0.04000000 0.06440612
## 28 0.5 16 0.04000000 0.06440612
## 29 0.6 16 0.04000000 0.06440612
## 30 0.7 16 0.04000000 0.06440612
## 31 0.8 16 0.04000000 0.06440612
## 32 0.9 16 0.04000000 0.06440612
## 33 1.0 16 0.04000000 0.06440612
## 34 0.0 32 0.04666667 0.07062333
## 35 0.1 32 0.04666667 0.07062333
## 36 0.2 32 0.04666667 0.07062333
## 37 0.3 32 0.04666667 0.07062333
## 38 0.4 32 0.04666667 0.07062333
## 39 0.5 32 0.04666667 0.07062333
## 40 0.6 32 0.04666667 0.07062333
## 41 0.7 32 0.04666667 0.07062333
## 42 0.8 32 0.04666667 0.07062333
## 43 0.9 32 0.04666667 0.07062333
## 44 1.0 32 0.04666667 0.07062333
## 45 0.0 64 0.04666667 0.07062333
## 46 0.1 64 0.04666667 0.07062333
## 47 0.2 64 0.04666667 0.07062333
## 48 0.3 64 0.04666667 0.07062333
## 49 0.4 64 0.04666667 0.07062333
## 50 0.5 64 0.04666667 0.07062333
## 51 0.6 64 0.04666667 0.07062333
## 52 0.7 64 0.04666667 0.07062333
## 53 0.8 64 0.04666667 0.07062333
## 54 0.9 64 0.04666667 0.07062333
## 55 1.0 64 0.04666667 0.07062333
## 56 0.0 128 0.06000000 0.07981460
## 57 0.1 128 0.06000000 0.07981460
## 58 0.2 128 0.06000000 0.07981460
## 59 0.3 128 0.06000000 0.07981460
## 60 0.4 128 0.06000000 0.07981460
## 61 0.5 128 0.06000000 0.07981460
## 62 0.6 128 0.06000000 0.07981460
## 63 0.7 128 0.06000000 0.07981460
## 64 0.8 128 0.06000000 0.07981460
## 65 0.9 128 0.06000000 0.07981460
## 66 1.0 128 0.06000000 0.07981460
mymodel=tmodel$best.model
summary(mymodel)
##
## Call:
## best.tune(method = svm, train.x = Species ~ ., data = iris, ranges = list(epsilon = seq(0,
## 1, 0.1), cost = 2^(2:7)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 8
## gamma: 0.25
##
## Number of Support Vectors: 35
##
## ( 6 15 14 )
##
##
## Number of Classes: 3
##
## Levels:
## setosa versicolor virginica
mymodel=tmodel$best.model
summary(mymodel)
##
## Call:
## best.tune(method = svm, train.x = Species ~ ., data = iris, ranges = list(epsilon = seq(0,
## 1, 0.1), cost = 2^(2:7)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 8
## gamma: 0.25
##
## Number of Support Vectors: 35
##
## ( 6 15 14 )
##
##
## Number of Classes: 3
##
## Levels:
## setosa versicolor virginica
plot(mymodel, data=iris,
Petal.Width~Petal.Length,
slice = list(Sepal.Width=3, Sepal.Length=4)
)
Look at confusion matrix and missclassification rate using best parameter
pred1 = predict(mymodel,iris)
tab1 = table(Predicted=pred1, Actual = iris$Species)
tab1
## Actual
## Predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 0
## virginica 0 2 50
You can see out of 150 data points, only two are missclassified
1-sum(diag(tab1)/sum(tab1))
## [1] 0.01333333