Introduction :-

In this report, I am attempting to do Classification of Sex of Cats based on the observations from Cats Data Set.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 144 cats data and each cat is observed in 3 attributes of it.

The sample dataset is as folows.

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6
  • Explanation of all the variables :-

    1. Sex: Two classes of cats ( Male / Female )

    2. Bwt: Body weight of the cat

    3. Hwt: Height of the cat.

The detailed structure is as follows.

## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

In Input , The type of each attribute is as follows.

##       Sex       Bwt       Hwt 
##  "factor" "numeric" "numeric"

For our classification analysis the types are perfectly fine. We can proceed further.


  • Dealing with NULL values :-

The number of null values in each column are as follows.

## Sex Bwt Hwt 
##   0   0   0

As there is no null values, we can proceed further.


  • Summary :-

The overall summary of all the attributes is as follows.

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

The distribution of all continuous variables is as follows.


The distribution of all contionus variables in each category is as follows.

  1. Bwt:-

The co-releation between the continuous variables is as follows

##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

###

Description of EDA :-

In our data set,


Support Vector machine ( Classifiers) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.


Fitting SVM - Classifier :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = Sex ~ ., data = cats)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  84
## 
##  ( 39 45 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M

We can observe that,

  • Number of data points which are forming margin : 84
    • 39 Female Data Points
    • 45 Male Data Points
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Default Cost Value ( C ) is 1

Plotting on training Data :-

By seeing this plot, we can breifly say that the classification is not so much accurate. But what is the exact number of misclassfied data points.

Confusion Matrix on training data :-

##     predicted
## true  F  M
##    F 33 14
##    M 14 83

Correct classification rate:-

The accuracy of this classifier on Training Data is 80.56 %.

Parameters Tuning ( grid search ):-

As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

  • The tuning summary by using Bootstrapping sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##  gamma cost
##    0.6    4
## 
## - best performance: 0.2241626

We Can observe that, recommended best parameters from Bootstrapping sampling method is

  • \(\gamma\) : 0.6

  • C : 4

  • The tuning Plot is as follows :-

Fitting SVM_classifier with new parameters :-

The summary of the fitted

## 
## Call:
## svm(formula = Sex ~ ., data = cats, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  4 
## 
## Number of Support Vectors:  79
## 
##  ( 36 43 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M

We can observe that,

  • Number of data points which are forming margin : 79
    • 36 Female Data Points
    • 43 Male Data Points
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Cost Value ( C ) is 4 ( Controlled by us)
  • Gamme Value ( \(\gamma\) ) is 0.6 ( Controlled by us)

Plotting on training Data :-

By seeing this plot, we can breifly say that the classification is not so more accurate. But what is the exact number of misclassfied data points.

Confusion Matrix on training data :-

##     predicted
## true  F  M
##    F 32 15
##    M 15 82

Correct classification rate:-

The accuracy of this classifier on Training Data is 79.17 %.

The accuracy of Default SVM classifier on Training Data is 80.56 %.

Intrestingly, the accuracy is same (or) not good than default model.

  • The tuning summary by using Fix sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: fixed training/validation set 
## 
## - best parameters:
##  gamma cost
##    0.2 15.6
## 
## - best performance: 0.2708333

We Can observ that, recommended best parameters from Fixed sampling method is

  • \(\gamma\) : 0.2

  • C : 15.6

  • The tuning Plot is as follows :-

Fitting SVM_classifier with new parameters :-

The summary of the fitted
## 
## Call:
## svm(formula = Sex ~ ., data = cats, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  15.6 
## 
## Number of Support Vectors:  76
## 
##  ( 36 40 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M

We can observe that,

  • Number of data points which are forming margin : 76
    • 36 Female Data Points
    • 40 Male Data Points
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Cost Value ( C ) is 15.6 ( Controlled by us)
  • Gamme Value ( \(\gamma\) ) is 0.2 ( Controlled by us)

Plotting on training Data :-

By seeing this plot, we can breifly say that the classification is not so more accurate. But what is the exact number of misclassfied data points.

Confusion Matrix on training data :-

##     predicted
## true  F  M
##    F 31 16
##    M 14 83

Correct classification rate:-

The accuracy of this classifier on Training Data is 79.17 %.

The accuracy of Default SVM classifier on Training Data is 80.56 %.

Intrestingly, the accuracy is same (or) not good than default model.

  • The tuning summary by using cross sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##    0.1  4.8
## 
## - best performance: 0.2028571

We Can observ that, recommended best parameters from Cross sampling method is

  • \(\gamma\) : 0.1

  • C : 4.8

  • The tuning Plot is as follows :-

Fitting SVM_classifier with new parameters :-

The summary of the fitted
## 
## Call:
## svm(formula = Sex ~ ., data = cats, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  4.8 
## 
## Number of Support Vectors:  81
## 
##  ( 40 41 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  F M

We can observe that,

  • Number of data points which are forming margin : 81
    • 40 Female Data Points
    • 41 Male Data Points
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Cost Value ( C ) is 4.8 ( Controlled by us)
  • Gamme Value ( \(\gamma\) ) is 0.1 ( Controlled by us)

Plotting on training Data :-

By seeing this plot, we can breifly say that the classification is not so more accurate. But what is the exact number of misclassfied data points.

Confusion Matrix on training data :-

##     predicted
## true  F  M
##    F 30 17
##    M 14 83

Correct classification rate:-

The accuracy of this classifier on Training Data is 78.47 %.

The accuracy of Default SVM classifier on Training Data is 80.56 %.

Intrestingly, the accuracy is same (or) not good than default model.

Predicting New Unseen Data:-

For this prediciton , I am taking new dataset with following values.

The SEX of the cat ( As per default SVM classifier):-

## 1 
## M 
## Levels: F M

The SEX of the cat ( As per default Bootstraped tunned SVM classifier):-

## 1 
## M 
## Levels: F M

The SEX of the cat ( As per default Fixed tunned SVM classifier):-

## 1 
## M 
## Levels: F M

The SEX of the cat ( As per default Corss tunned SVM classifier):-

## 1 
## M 
## Levels: F M
  • Fitting of Support Vector maching is possible

Correct classification rate:-

————————————————————- THANK YOU ————————————————————-