Introduction :-

In this report, I am attempting to do Complete Analysis on Wine Data set.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

The given dataset has 178 observations and each observation has 14 attributes. The header of the dataset is as follows.

##   class Alcohol Malic.acid  Ash Alcalinity.of.ash Magnesium Total.phenols
## 1     1   14.23       1.71 2.43              15.6       127          2.80
## 2     1   13.20       1.78 2.14              11.2       100          2.65
## 3     1   13.16       2.36 2.67              18.6       101          2.80
## 4     1   14.37       1.95 2.50              16.8       113          3.85
## 5     1   13.24       2.59 2.87              21.0       118          2.80
## 6     1   14.20       1.76 2.45              15.2       112          3.27
##   Flavanoids Nonflavanoid.phenols Proanthocyanins Color.intensity  Hue
## 1       3.06                 0.28            2.29            5.64 1.04
## 2       2.76                 0.26            1.28            4.38 1.05
## 3       3.24                 0.30            2.81            5.68 1.03
## 4       3.49                 0.24            2.18            7.80 0.86
## 5       2.69                 0.39            1.82            4.32 1.04
## 6       3.39                 0.34            1.97            6.75 1.05
##   OD280.OD315.of.diluted.wines Proline
## 1                         3.92    1065
## 2                         3.40    1050
## 3                         3.17    1185
## 4                         3.45    1480
## 5                         2.93     735
## 6                         2.85    1450

The detailed structure is as follows.

## 'data.frame':    178 obs. of  14 variables:
##  $ class                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol                     : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic.acid                  : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                         : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity.of.ash           : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium                   : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Total.phenols               : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids                  : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid.phenols        : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins             : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color.intensity             : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                         : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280.OD315.of.diluted.wines: num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline                     : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

In Input , The type of each attribute is as follows.

##                        class                      Alcohol 
##                    "integer"                    "numeric" 
##                   Malic.acid                          Ash 
##                    "numeric"                    "numeric" 
##            Alcalinity.of.ash                    Magnesium 
##                    "numeric"                    "integer" 
##                Total.phenols                   Flavanoids 
##                    "numeric"                    "numeric" 
##         Nonflavanoid.phenols              Proanthocyanins 
##                    "numeric"                    "numeric" 
##              Color.intensity                          Hue 
##                    "numeric"                    "numeric" 
## OD280.OD315.of.diluted.wines                      Proline 
##                    "numeric"                    "integer"

Actually, Class is not integer value. It is a factor of 1 / 2 / 3.

I am updating the type of Class to factorial. Finally, the type of each attribute is as follows.

##                        class                      Alcohol 
##                     "factor"                    "numeric" 
##                   Malic.acid                          Ash 
##                    "numeric"                    "numeric" 
##            Alcalinity.of.ash                    Magnesium 
##                    "numeric"                    "integer" 
##                Total.phenols                   Flavanoids 
##                    "numeric"                    "numeric" 
##         Nonflavanoid.phenols              Proanthocyanins 
##                    "numeric"                    "numeric" 
##              Color.intensity                          Hue 
##                    "numeric"                    "numeric" 
## OD280.OD315.of.diluted.wines                      Proline 
##                    "numeric"                    "integer"

The number of null values in each column are as follows.

##                        class                      Alcohol 
##                            0                            0 
##                   Malic.acid                          Ash 
##                            0                            0 
##            Alcalinity.of.ash                    Magnesium 
##                            0                            0 
##                Total.phenols                   Flavanoids 
##                            0                            0 
##         Nonflavanoid.phenols              Proanthocyanins 
##                            0                            0 
##              Color.intensity                          Hue 
##                            0                            0 
## OD280.OD315.of.diluted.wines                      Proline 
##                            0                            0

As there is no null values, we can proceed further.


The overall summary of all the attributes is as follows.

##  class     Alcohol        Malic.acid         Ash        Alcalinity.of.ash
##  1:59   Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60    
##  2:71   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20    
##  3:48   Median :13.05   Median :1.865   Median :2.360   Median :19.50    
##         Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49    
##         3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50    
##         Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total.phenols     Flavanoids    Nonflavanoid.phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color.intensity       Hue         OD280.OD315.of.diluted.wines
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270               
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938               
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780               
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612               
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170               
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000               
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

The distribution of all continuous variables is as follows.


The distribution of all contionus variables in each category is as follows.

  1. Alcohol:-

  1. Malic.acid:-

  1. Ash:-

  1. Alcalinity.of.ash:-

  1. Magnesium:-

  1. Total.phenols:-

  1. Flavanoids:-

  1. Nonflavanoid.phenols:-

  1. Proanthocyanins:-

  1. Color.intensity:-

  1. Hue:-

  1. OD280.OD315.of.diluted.wines:-

  1. Proline:-


The co-releation between the continous variables is as follows

##                                  Alcohol  Malic.acid          Ash
## Alcohol                       1.00000000  0.09439694  0.211544596
## Malic.acid                    0.09439694  1.00000000  0.164045470
## Ash                           0.21154460  0.16404547  1.000000000
## Alcalinity.of.ash            -0.31023514  0.28850040  0.443367187
## Magnesium                     0.27079823 -0.05457510  0.286586691
## Total.phenols                 0.28910112 -0.33516700  0.128979538
## Flavanoids                    0.23681493 -0.41100659  0.115077279
## Nonflavanoid.phenols         -0.15592947  0.29297713  0.186230446
## Proanthocyanins               0.13669791 -0.22074619  0.009651935
## Color.intensity               0.54636420  0.24898534  0.258887259
## Hue                          -0.07174720 -0.56129569 -0.074666889
## OD280.OD315.of.diluted.wines  0.07234319 -0.36871043  0.003911231
## Proline                       0.64372004 -0.19201056  0.223626264
##                              Alcalinity.of.ash   Magnesium Total.phenols
## Alcohol                            -0.31023514  0.27079823    0.28910112
## Malic.acid                          0.28850040 -0.05457510   -0.33516700
## Ash                                 0.44336719  0.28658669    0.12897954
## Alcalinity.of.ash                   1.00000000 -0.08333309   -0.32111332
## Magnesium                          -0.08333309  1.00000000    0.21440123
## Total.phenols                      -0.32111332  0.21440123    1.00000000
## Flavanoids                         -0.35136986  0.19578377    0.86456350
## Nonflavanoid.phenols                0.36192172 -0.25629405   -0.44993530
## Proanthocyanins                    -0.19732684  0.23644061    0.61241308
## Color.intensity                     0.01873198  0.19995001   -0.05513642
## Hue                                -0.27395522  0.05539820    0.43368134
## OD280.OD315.of.diluted.wines       -0.27676855  0.06600394    0.69994936
## Proline                            -0.44059693  0.39335085    0.49811488
##                              Flavanoids Nonflavanoid.phenols Proanthocyanins
## Alcohol                       0.2368149           -0.1559295     0.136697912
## Malic.acid                   -0.4110066            0.2929771    -0.220746187
## Ash                           0.1150773            0.1862304     0.009651935
## Alcalinity.of.ash            -0.3513699            0.3619217    -0.197326836
## Magnesium                     0.1957838           -0.2562940     0.236440610
## Total.phenols                 0.8645635           -0.4499353     0.612413084
## Flavanoids                    1.0000000           -0.5378996     0.652691769
## Nonflavanoid.phenols         -0.5378996            1.0000000    -0.365845099
## Proanthocyanins               0.6526918           -0.3658451     1.000000000
## Color.intensity              -0.1723794            0.1390570    -0.025249931
## Hue                           0.5434786           -0.2626396     0.295544253
## OD280.OD315.of.diluted.wines  0.7871939           -0.5032696     0.519067096
## Proline                       0.4941931           -0.3113852     0.330416700
##                              Color.intensity         Hue
## Alcohol                           0.54636420 -0.07174720
## Malic.acid                        0.24898534 -0.56129569
## Ash                               0.25888726 -0.07466689
## Alcalinity.of.ash                 0.01873198 -0.27395522
## Magnesium                         0.19995001  0.05539820
## Total.phenols                    -0.05513642  0.43368134
## Flavanoids                       -0.17237940  0.54347857
## Nonflavanoid.phenols              0.13905701 -0.26263963
## Proanthocyanins                  -0.02524993  0.29554425
## Color.intensity                   1.00000000 -0.52181319
## Hue                              -0.52181319  1.00000000
## OD280.OD315.of.diluted.wines     -0.42881494  0.56546829
## Proline                           0.31610011  0.23618345
##                              OD280.OD315.of.diluted.wines    Proline
## Alcohol                                       0.072343187  0.6437200
## Malic.acid                                   -0.368710428 -0.1920106
## Ash                                           0.003911231  0.2236263
## Alcalinity.of.ash                            -0.276768549 -0.4405969
## Magnesium                                     0.066003936  0.3933508
## Total.phenols                                 0.699949365  0.4981149
## Flavanoids                                    0.787193902  0.4941931
## Nonflavanoid.phenols                         -0.503269596 -0.3113852
## Proanthocyanins                               0.519067096  0.3304167
## Color.intensity                              -0.428814942  0.3161001
## Hue                                           0.565468293  0.2361834
## OD280.OD315.of.diluted.wines                  1.000000000  0.3127611
## Proline                                       0.312761075  1.0000000


Description of EDA :-

In our data set,

  • The distribution of all the numerical variables is good. there is no strange things in it.

  • If we consider the numerical variables in each class category, there are few outliers. But we can proceed further without removing outliers.

  • There are many numerical variables which are co-releated to each other. We can do Dimentionality reduction techinques to reduce the number of variables.

  • Overall, there is no conspicuous patterns involved in the input data.


Models to predict Alcohol:-

Now I am planning to build various models to predict the content of Alcohol when we have following attributes.

Fitting Linear Model:-

Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables.

The summary of the fitted Linear Regression Model :-

## 
## Call:
## lm(formula = regression_form, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48450 -0.37609 -0.00201  0.36816  1.78214 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.8544978  0.4139789  28.636  < 2e-16 ***
## Malic.acid         0.1008448  0.0403089   2.502   0.0133 *  
## Alcalinity.of.ash -0.0338236  0.0141062  -2.398   0.0176 *  
## Magnesium          0.0001504  0.0031527   0.048   0.9620    
## Proanthocyanins   -0.0248777  0.0774548  -0.321   0.7485    
## Color.intensity    0.1242727  0.0198923   6.247 3.21e-09 ***
## Proline            0.0012932  0.0001717   7.532 2.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5407 on 171 degrees of freedom
## Multiple R-squared:  0.5714, Adjusted R-squared:  0.5564 
## F-statistic: 37.99 on 6 and 171 DF,  p-value: < 2.2e-16

We can observe that,

  • Residuals Median is almost equal to zero. ( which is good).

  • Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)

  • Intercept value is 11.8544978 and also stastically significant.

  • Malic.acid coeeficent value is 0.1008448 and also little bit stastically significant.

  • Alcalinity.of.ash coeeficent value is-0.0338236 and also little bit stastically significant.

  • Magnesium coeeficent value is 0.0001504 and not stastically significant.

  • Proanthocyanins coeeficent value is -0.0248777 and not stastically significant.

  • Color.intensity coeeficent value is 0.1242727 and also stastically significant.

  • Proline coeeficent value is 0.0012932 and also stastically significant.

  • 57.14 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).

  • P-Value is < 2.2e-16 , so the linear model is stastically signifient.

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.5299928

I am trying to reduce the dependent variables and taking most signifient dependent variables.

Fitting Updated Linear Model:-

## 
## Call:
## lm(formula = Alcohol ~ Malic.acid + Alcalinity.of.ash + Color.intensity + 
##     Proline, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47459 -0.36917  0.00056  0.36430  1.77082 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11.8295611  0.3287319  35.985  < 2e-16 ***
## Malic.acid         0.1024480  0.0397664   2.576   0.0108 *  
## Alcalinity.of.ash -0.0337045  0.0139570  -2.415   0.0168 *  
## Color.intensity    0.1249396  0.0196251   6.366 1.68e-09 ***
## Proline            0.0012811  0.0001561   8.209 4.96e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5378 on 173 degrees of freedom
## Multiple R-squared:  0.5711, Adjusted R-squared:  0.5612 
## F-statistic:  57.6 on 4 and 173 DF,  p-value: < 2.2e-16

We can observe that,

  • Residuals Median is almost equal to zero. ( which is good).

  • Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)

  • Intercept value is 11.8295611 and also stastically significant.

  • Malic.acid coeeficent value is 0.1024480 and also little bit stastically significant.

  • Alcalinity.of.ash coeeficent value is-0.0337045 and also little bit stastically significant.

  • Color.intensity coeeficent value is 0.1249396 and also stastically significant.

  • Proline coeeficent value is 0.0012811 and also stastically significant.

  • 57.11 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).

  • P-Value is < 2.2e-16 , so the linear model is stastically signifient.

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.5301526.

Support Vector machine ( Regressors) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.


Fitting SVM - regressor :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine regressor with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = regression_form, data = df)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1666667 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  155

We can observe that,

  • Number of data points which are forming margin : 155

  • This algorithm is using default eps-regression type.

  • Kernel Selected is : Radial basis function (RBF)

  • Default Cost Value ( C ) is 1

  • Default Gamma Value ( \(\gamma\) ) is 0.1666667

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.4149909.

Parameters Tuning ( grid search ):-

As we can see the RMSE of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

  • The tuning summary by using Bootstrapping sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##  gamma cost
##    0.3    4
## 
## - best performance: 0.3807862

We Can observe that, recommended best parameters from Bootstrapping sampling method is

  • \(\gamma\) : 0.3
  • C : 4

Fitting SVM_regressor with new parameters :-

## 
## Call:
## svm(formula = regression_form, data = df, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  4 
##       gamma:  0.3 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  161

We can observe that,

  • Number of data points which are forming margin : 161

  • This algorithm is using default eps-regression type.

  • Kernel Selected is : Radial basis function (RBF)

  • Cost Value ( C ) is 4 ( Which we are controlling )

  • Gamma Value ( \(\gamma\) ) is 0.3 ( Which we are controlling )

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.2584476.

Binart Decission Trees :-

A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target you’re looking for.


Fitting CTREE binary Decision Tree: :-

The summary of the fitted model is as follows

## 
##   Conditional inference tree with 5 terminal nodes
## 
## Response:  Alcohol 
## Inputs:  Malic.acid, Alcalinity.of.ash, Magnesium, Proanthocyanins, Color.intensity, Proline 
## Number of observations:  178 
## 
## 1) Proline <= 720; criterion = 1, statistic = 73.344
##   2) Color.intensity <= 3.3; criterion = 1, statistic = 42.688
##     3)*  weights = 46 
##   2) Color.intensity > 3.3
##     4) Color.intensity <= 7.65; criterion = 0.994, statistic = 10.868
##       5)*  weights = 43 
##     4) Color.intensity > 7.65
##       6)*  weights = 15 
## 1) Proline > 720
##   7) Proline <= 1020; criterion = 1, statistic = 16.348
##     8)*  weights = 33 
##   7) Proline > 1020
##     9)*  weights = 41

The same thing, we can explain in following plot

  • ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.4926241.

Conclusion on regression Models:-

The summary of all the fitted models and its performence on the training data is as follows

REGRESSION MODELs SUMMARY

S No Model Name RMSE Value
1. Linear Model 0.5299928
2. Steped (Pruned Linear Model ) 0.5301526
3. SVM - Regressor 0.4149909
4. Tunned SVM - Regressor 0.2584476
5. Binary Decission Tree 0.4926241

As Tunned SVM - Regressor has very less RMSE value, I can conclude as this model is better model to predict the Alcohol content.



Models to predict Class of Wine:-

Now I am planning to build various models to predict the Class of wine when we have following all the attributes in given data set.

For this classifer Models, I am using a training set of 70% of given data, remaining 30% for testing.

Total Number of observation in given dataset : 178

Total Number of observation in train dataset : 125

Total Number of observation in test dataset : 53

Linear discriminant analysis (LDA):-

LDA is used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes

Fitting LDA:-

The summary of the fitter LDA Model is as follows.

## Call:
## lda(classification_form, data = train_df)
## 
## Prior probabilities of groups:
##     1     2     3 
## 0.328 0.400 0.272 
## 
## Group means:
##    Alcohol Malic.acid      Ash Alcalinity.of.ash Magnesium Total.phenols
## 1 13.76976   1.853659 2.458780          16.94146 106.31707      2.860976
## 2 12.23080   1.844600 2.285400          20.40800  94.28000      2.292200
## 3 13.16000   3.168235 2.455882          21.60294  99.32353      1.698824
##   Flavanoids Nonflavanoid.phenols Proanthocyanins Color.intensity       Hue
## 1  3.0363415            0.2907317        1.900976        5.549024 1.0675610
## 2  2.1602000            0.3688000        1.657400        3.111000 1.0751200
## 3  0.7947059            0.4523529        1.188529        7.625000 0.6782353
##   OD280.OD315.of.diluted.wines   Proline
## 1                     3.167805 1129.2683
## 2                     2.837400  531.1200
## 3                     1.672941  645.7353
## 
## Coefficients of linear discriminants:
##                                       LD1          LD2
## Alcohol                      -0.521870355  1.008139140
## Malic.acid                    0.305737407  0.294471054
## Ash                          -0.212029283  2.061604035
## Alcalinity.of.ash             0.185736095 -0.140807844
## Magnesium                     0.007890352  0.002354910
## Total.phenols                 0.698923759 -0.072032371
## Flavanoids                   -1.832854947 -0.412573641
## Nonflavanoid.phenols         -2.135147371 -1.337869122
## Proanthocyanins               0.490050381 -0.572153820
## Color.intensity               0.257510411  0.258671867
## Hue                          -1.390960425 -1.275777480
## OD280.OD315.of.diluted.wines -1.317639986 -0.161337269
## Proline                      -0.003813797  0.002768756
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7283 0.2717

Confusion Matrix on training data :-

##     predicted
## true  1  2  3
##    1 41  0  0
##    2  0 50  0
##    3  0  0 34

Correct classification rate:-

The accuracy of this LDA classifier on Training Data is 100 %.

Confusion Matrix on testing data :-

##     predicted
## true  1  2  3
##    1 16  2  0
##    2  0 20  1
##    3  0  0 14

Correct classification rate:-

The accuracy of this LDA classifier on Testing Data is 94.34 %.

Support Vector machine ( Classifiers) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.


Fitting SVM - Classifier :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = classification_form, data = train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  57
## 
##  ( 25 19 13 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  1 2 3

We can observe that,

  • Number of data points which are forming margin : 57
    • 25 Class - 1 Data Points
    • 19 Class - 2 Data Points
    • 13 Class - 3 Data Points
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Default Cost Value ( C ) is 1

Confusion Matrix on training data :-

##     predicted
## true  1  2  3
##    1 41  0  0
##    2  0 50  0
##    3  0  0 34

Correct classification rate:-

The accuracy of this SVM classifier on Training Data is 100 %.

Confusion Matrix on testing data :-

##     predicted
## true  1  2  3
##    1 17  1  0
##    2  0 20  1
##    3  0  0 14

Correct classification rate:-

The accuracy of this SVM classifier on Testing Data is 96.23 %.

Parameters Tuning ( grid search ):-

As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

  • The tuning summary by using Cross sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##    0.3    4
## 
## - best performance: 0.03888889

We Can observ that, recommended best parameters from Fixed sampling method is

  • \(\gamma\) : 0.3

  • C : 4

  • The tuning Plot is as follows :-

Fitting SVM_classifier with new parameters :-

The summary of the fitted

## 
## Call:
## svm(formula = classification_form, data = train_df, gamma = gam, 
##     cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  4 
## 
## Number of Support Vectors:  103
## 
##  ( 44 28 31 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  1 2 3

We can observe that,

  • Number of data points which are forming margin : 103
    • 44 Class - 1 Data Points
    • 28 Class - 2 Data Points
    • 31 Class - 3 Data Points
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Cost Value ( C ) is 4 ( Controlled by us)
  • Gamme Value ( \(\gamma\) ) is 0.3 ( Controlled by us)

Confusion Matrix on training data :-

##     predicted
## true  1  2  3
##    1 41  0  0
##    2  0 50  0
##    3  0  0 34

Correct classification rate:-

The accuracy of this tunned SVM classifier on Training Data is 100 %.

Confusion Matrix on testing data :-

##     predicted
## true  1  2  3
##    1 13  5  0
##    2  0 20  1
##    3  0  1 13

Correct classification rate:-

The accuracy of this tunned SVM classifier on Testing Data is 86.79 %.

Classification Trees:-

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning.


Fitting Classification Tree on training data :-

The summary of the fitted Decision tree is as follows.

## 
##   Conditional inference tree with 5 terminal nodes
## 
## Response:  class 
## Inputs:  Alcohol, Malic.acid, Ash, Alcalinity.of.ash, Magnesium, Total.phenols, Flavanoids, Nonflavanoid.phenols, Proanthocyanins, Color.intensity, Hue, OD280.OD315.of.diluted.wines, Proline 
## Number of observations:  125 
## 
## 1) Flavanoids <= 1.39; criterion = 1, statistic = 91.347
##   2) Hue <= 0.85; criterion = 1, statistic = 18.711
##     3)*  weights = 31 
##   2) Hue > 0.85
##     4)*  weights = 7 
## 1) Flavanoids > 1.39
##   5) Proline <= 750; criterion = 1, statistic = 64.099
##     6)*  weights = 43 
##   5) Proline > 750
##     7) Alcohol <= 13.07; criterion = 0.999, statistic = 15.804
##       8)*  weights = 7 
##     7) Alcohol > 13.07
##       9)*  weights = 37

The same summary can be visulize as follows.

Confusion Matrix on training data :-

##     predicted
## true  1  2  3
##    1 41  0  0
##    2  3 47  0
##    3  0  3 31

Correct classification rate:-

The accuracy of this Decision tree classifier on Training Data is 95.2 %.

Confusion Matrix on testing data :-

##     predicted
## true  1  2  3
##    1 16  2  0
##    2  0 20  1
##    3  0  3 11

Correct classification rate:-

The accuracy of this Decision tree classifier on Testing Data is 88.68 %.

Conclusion on Classification Models:-

The summary of all the fitted models and its performence on the training data & testing data is as follows

CLASSIFICAITON MODELs SUMMARY

S No Model Name ACC. (TRAIN DATA) ACC. (TEST DATA)
1. LDA Model 100 % 94.34 %
2. SVM Classifier 100 % 96.23 %
3. Tunned SVM Classifier 100 % 86.79 %
4. Decision Tree ( Classifier ) 95.2 % 88.68 %

As SVM Classifier has high accuracy value, I can conclude as this model is better model to predict the class of wine.