Introduction :-
In this report, I am attempting to do Classification of Sex of Cats based on the observations from Cats Data Set.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
Structure of given dataset :-
The given dataset has 144 cats data and each cat is observed in 3 attributes of it.
The sample dataset is as folows.
## Sex Bwt Hwt
## 1 F 2.0 7.0
## 2 F 2.0 7.4
## 3 F 2.0 9.5
## 4 F 2.1 7.2
## 5 F 2.1 7.3
## 6 F 2.1 7.6
Explanation of all the variables :-
Sex: Two classes of cats ( Male / Female )
Bwt: Body weight of the cat
Hwt: Height of the cat.
The detailed structure is as follows.
## 'data.frame': 144 obs. of 3 variables:
## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
In Input , The type of each attribute is as follows.
## Sex Bwt Hwt
## "factor" "numeric" "numeric"
For our classification analysis the types are perfectly fine. We can proceed further.
- Dealing with NULL values :-
The number of null values in each column are as follows.
## Sex Bwt Hwt
## 0 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
The distribution of all continuous variables is as follows.
The distribution of all contionus variables in each category is as follows.
- Bwt:-
The co-releation between the continuous variables is as follows
## Bwt Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000
###
Description of EDA :-
In our data set,
Support Vector machine ( Classifiers) :-
Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.
Fitting SVM - Classifier :-
With Default Parameters :-
As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).
The summary of the fitted default SVM_classifier :-
##
## Call:
## svm(formula = Sex ~ ., data = cats)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 84
##
## ( 39 45 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
We can observe that,
- Number of data points which are forming margin : 84
- 39 Female Data Points
- 45 Male Data Points
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Default Cost Value ( C ) is 1
Plotting on training Data :-
By seeing this plot, we can breifly say that the classification is not so much accurate. But what is the exact number of misclassfied data points.
Confusion Matrix on training data :-
## predicted
## true F M
## F 33 14
## M 14 83
Correct classification rate:-
The accuracy of this classifier on Training Data is 80.56 %.
Parameters Tuning ( grid search ):-
As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.
- The tuning summary by using Bootstrapping sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: bootstrapping
##
## - best parameters:
## gamma cost
## 0.6 4
##
## - best performance: 0.2241626
We Can observe that, recommended best parameters from Bootstrapping sampling method is
\(\gamma\) : 0.6
C : 4
The tuning Plot is as follows :-
Fitting SVM_classifier with new parameters :-
The summary of the fitted
##
## Call:
## svm(formula = Sex ~ ., data = cats, gamma = gam, cost = cos)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 4
##
## Number of Support Vectors: 79
##
## ( 36 43 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
We can observe that,
- Number of data points which are forming margin : 79
- 36 Female Data Points
- 43 Male Data Points
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Cost Value ( C ) is 4 ( Controlled by us)
- Gamme Value ( \(\gamma\) ) is 0.6 ( Controlled by us)
Plotting on training Data :-
By seeing this plot, we can breifly say that the classification is not so more accurate. But what is the exact number of misclassfied data points.
Confusion Matrix on training data :-
## predicted
## true F M
## F 32 15
## M 15 82
Correct classification rate:-
The accuracy of this classifier on Training Data is 79.17 %.
The accuracy of Default SVM classifier on Training Data is 80.56 %.
Intrestingly, the accuracy is same (or) not good than default model.
- The tuning summary by using Fix sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: fixed training/validation set
##
## - best parameters:
## gamma cost
## 0.2 15.6
##
## - best performance: 0.2708333
We Can observ that, recommended best parameters from Fixed sampling method is
\(\gamma\) : 0.2
C : 15.6
The tuning Plot is as follows :-
Fitting SVM_classifier with new parameters :-
The summary of the fitted
##
## Call:
## svm(formula = Sex ~ ., data = cats, gamma = gam, cost = cos)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 15.6
##
## Number of Support Vectors: 76
##
## ( 36 40 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
We can observe that,
- Number of data points which are forming margin : 76
- 36 Female Data Points
- 40 Male Data Points
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Cost Value ( C ) is 15.6 ( Controlled by us)
- Gamme Value ( \(\gamma\) ) is 0.2 ( Controlled by us)
Plotting on training Data :-
By seeing this plot, we can breifly say that the classification is not so more accurate. But what is the exact number of misclassfied data points.
Confusion Matrix on training data :-
## predicted
## true F M
## F 31 16
## M 14 83
Correct classification rate:-
The accuracy of this classifier on Training Data is 79.17 %.
The accuracy of Default SVM classifier on Training Data is 80.56 %.
Intrestingly, the accuracy is same (or) not good than default model.
- The tuning summary by using cross sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.1 4.8
##
## - best performance: 0.2028571
We Can observ that, recommended best parameters from Cross sampling method is
\(\gamma\) : 0.1
C : 4.8
The tuning Plot is as follows :-
Fitting SVM_classifier with new parameters :-
The summary of the fitted
##
## Call:
## svm(formula = Sex ~ ., data = cats, gamma = gam, cost = cos)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 4.8
##
## Number of Support Vectors: 81
##
## ( 40 41 )
##
##
## Number of Classes: 2
##
## Levels:
## F M
We can observe that,
- Number of data points which are forming margin : 81
- 40 Female Data Points
- 41 Male Data Points
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Cost Value ( C ) is 4.8 ( Controlled by us)
- Gamme Value ( \(\gamma\) ) is 0.1 ( Controlled by us)
Plotting on training Data :-
By seeing this plot, we can breifly say that the classification is not so more accurate. But what is the exact number of misclassfied data points.
Confusion Matrix on training data :-
## predicted
## true F M
## F 30 17
## M 14 83
Correct classification rate:-
The accuracy of this classifier on Training Data is 78.47 %.
The accuracy of Default SVM classifier on Training Data is 80.56 %.
Intrestingly, the accuracy is same (or) not good than default model.
Predicting New Unseen Data:-
For this prediciton , I am taking new dataset with following values.
- Hwt=18,
- Bwt=3.5
The SEX of the cat ( As per default SVM classifier):-
## 1
## M
## Levels: F M
The SEX of the cat ( As per default Bootstraped tunned SVM classifier):-
## 1
## M
## Levels: F M
The SEX of the cat ( As per default Fixed tunned SVM classifier):-
## 1
## M
## Levels: F M
The SEX of the cat ( As per default Corss tunned SVM classifier):-
## 1
## M
## Levels: F M
- Fitting of Support Vector maching is possible
Correct classification rate:-
- The maximum accuracy which we can acheive with SVM classifier is 80.56 %.
- The Parameters for the best SVM Classifier is default parameters.
————————————————————- THANK YOU ————————————————————-