Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
- Structure of given dataset :-
The given dataset has 178 observations and each observation has 14 attributes. The header of the dataset is as follows.
## class Alcohol Malic.acid Ash Alcalinity.of.ash Magnesium Total.phenols
## 1 1 14.23 1.71 2.43 15.6 127 2.80
## 2 1 13.20 1.78 2.14 11.2 100 2.65
## 3 1 13.16 2.36 2.67 18.6 101 2.80
## 4 1 14.37 1.95 2.50 16.8 113 3.85
## 5 1 13.24 2.59 2.87 21.0 118 2.80
## 6 1 14.20 1.76 2.45 15.2 112 3.27
## Flavanoids Nonflavanoid.phenols Proanthocyanins Color.intensity Hue
## 1 3.06 0.28 2.29 5.64 1.04
## 2 2.76 0.26 1.28 4.38 1.05
## 3 3.24 0.30 2.81 5.68 1.03
## 4 3.49 0.24 2.18 7.80 0.86
## 5 2.69 0.39 1.82 4.32 1.04
## 6 3.39 0.34 1.97 6.75 1.05
## OD280.OD315.of.diluted.wines Proline
## 1 3.92 1065
## 2 3.40 1050
## 3 3.17 1185
## 4 3.45 1480
## 5 2.93 735
## 6 2.85 1450
Explanation of all the variables :-
- Alcohol: Amount of Alcholol in that perticular wine type
- Malic acid : Amount of Malic Acid in that perticular wine type
- Ash : Amount of Ash in that perticular wine type
- Alcalinity of ash : Amount of Alcalinity of Ash in that perticular wine type
- Magnesium : Amount of Magnesium in that perticular wine type
- Total phenols : Amount of phenol in that perticular wine type
- Flavanoids : Amount of Flavanoids in that perticular wine type
- Nonflavanoid phenols : Amount of Nonflavanoid phenols in that perticular wine type
- Proanthocyanins : Amount of Proanthocyanins in that perticular wine type
- Color intensity : Amount of Color intensity for that perticular wine type
- Hue : Amount of Hue for that perticular wine type
- OD280/OD315 of diluted wines : Amount of diluted in that perticular wine type
- Proline : Amount of Proline in that perticular wine type
and
14 . Class : Class Category of the Wine ( Class - 1 / 2 / 3)
The detailed structure is as follows.
## 'data.frame': 178 obs. of 14 variables:
## $ class : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic.acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Alcalinity.of.ash : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Total.phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoid.phenols : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color.intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ OD280.OD315.of.diluted.wines: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
In Input , The type of each attribute is as follows.
## class Alcohol
## "integer" "numeric"
## Malic.acid Ash
## "numeric" "numeric"
## Alcalinity.of.ash Magnesium
## "numeric" "integer"
## Total.phenols Flavanoids
## "numeric" "numeric"
## Nonflavanoid.phenols Proanthocyanins
## "numeric" "numeric"
## Color.intensity Hue
## "numeric" "numeric"
## OD280.OD315.of.diluted.wines Proline
## "numeric" "integer"
Actually, Class is not integer value. It is a factor of 1 / 2 / 3.
I am updating the type of Class to factorial. Finally, the type of each attribute is as follows.
## class Alcohol
## "factor" "numeric"
## Malic.acid Ash
## "numeric" "numeric"
## Alcalinity.of.ash Magnesium
## "numeric" "integer"
## Total.phenols Flavanoids
## "numeric" "numeric"
## Nonflavanoid.phenols Proanthocyanins
## "numeric" "numeric"
## Color.intensity Hue
## "numeric" "numeric"
## OD280.OD315.of.diluted.wines Proline
## "numeric" "integer"
- Dealing with NULL values :-
The number of null values in each column are as follows.
## class Alcohol
## 0 0
## Malic.acid Ash
## 0 0
## Alcalinity.of.ash Magnesium
## 0 0
## Total.phenols Flavanoids
## 0 0
## Nonflavanoid.phenols Proanthocyanins
## 0 0
## Color.intensity Hue
## 0 0
## OD280.OD315.of.diluted.wines Proline
## 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## class Alcohol Malic.acid Ash Alcalinity.of.ash
## 1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## 3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total.phenols Flavanoids Nonflavanoid.phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color.intensity Hue OD280.OD315.of.diluted.wines
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
The distribution of all continuous variables is as follows.
The distribution of all contionus variables in each category is as follows.
- Alcohol:-
- Malic.acid:-
- Ash:-
- Alcalinity.of.ash:-
- Magnesium:-
- Total.phenols:-
- Flavanoids:-
- Nonflavanoid.phenols:-
- Proanthocyanins:-
- Color.intensity:-
- Hue:-
- OD280.OD315.of.diluted.wines:-
- Proline:-
The co-releation between the continous variables is as follows
## Alcohol Malic.acid Ash
## Alcohol 1.00000000 0.09439694 0.211544596
## Malic.acid 0.09439694 1.00000000 0.164045470
## Ash 0.21154460 0.16404547 1.000000000
## Alcalinity.of.ash -0.31023514 0.28850040 0.443367187
## Magnesium 0.27079823 -0.05457510 0.286586691
## Total.phenols 0.28910112 -0.33516700 0.128979538
## Flavanoids 0.23681493 -0.41100659 0.115077279
## Nonflavanoid.phenols -0.15592947 0.29297713 0.186230446
## Proanthocyanins 0.13669791 -0.22074619 0.009651935
## Color.intensity 0.54636420 0.24898534 0.258887259
## Hue -0.07174720 -0.56129569 -0.074666889
## OD280.OD315.of.diluted.wines 0.07234319 -0.36871043 0.003911231
## Proline 0.64372004 -0.19201056 0.223626264
## Alcalinity.of.ash Magnesium Total.phenols
## Alcohol -0.31023514 0.27079823 0.28910112
## Malic.acid 0.28850040 -0.05457510 -0.33516700
## Ash 0.44336719 0.28658669 0.12897954
## Alcalinity.of.ash 1.00000000 -0.08333309 -0.32111332
## Magnesium -0.08333309 1.00000000 0.21440123
## Total.phenols -0.32111332 0.21440123 1.00000000
## Flavanoids -0.35136986 0.19578377 0.86456350
## Nonflavanoid.phenols 0.36192172 -0.25629405 -0.44993530
## Proanthocyanins -0.19732684 0.23644061 0.61241308
## Color.intensity 0.01873198 0.19995001 -0.05513642
## Hue -0.27395522 0.05539820 0.43368134
## OD280.OD315.of.diluted.wines -0.27676855 0.06600394 0.69994936
## Proline -0.44059693 0.39335085 0.49811488
## Flavanoids Nonflavanoid.phenols Proanthocyanins
## Alcohol 0.2368149 -0.1559295 0.136697912
## Malic.acid -0.4110066 0.2929771 -0.220746187
## Ash 0.1150773 0.1862304 0.009651935
## Alcalinity.of.ash -0.3513699 0.3619217 -0.197326836
## Magnesium 0.1957838 -0.2562940 0.236440610
## Total.phenols 0.8645635 -0.4499353 0.612413084
## Flavanoids 1.0000000 -0.5378996 0.652691769
## Nonflavanoid.phenols -0.5378996 1.0000000 -0.365845099
## Proanthocyanins 0.6526918 -0.3658451 1.000000000
## Color.intensity -0.1723794 0.1390570 -0.025249931
## Hue 0.5434786 -0.2626396 0.295544253
## OD280.OD315.of.diluted.wines 0.7871939 -0.5032696 0.519067096
## Proline 0.4941931 -0.3113852 0.330416700
## Color.intensity Hue
## Alcohol 0.54636420 -0.07174720
## Malic.acid 0.24898534 -0.56129569
## Ash 0.25888726 -0.07466689
## Alcalinity.of.ash 0.01873198 -0.27395522
## Magnesium 0.19995001 0.05539820
## Total.phenols -0.05513642 0.43368134
## Flavanoids -0.17237940 0.54347857
## Nonflavanoid.phenols 0.13905701 -0.26263963
## Proanthocyanins -0.02524993 0.29554425
## Color.intensity 1.00000000 -0.52181319
## Hue -0.52181319 1.00000000
## OD280.OD315.of.diluted.wines -0.42881494 0.56546829
## Proline 0.31610011 0.23618345
## OD280.OD315.of.diluted.wines Proline
## Alcohol 0.072343187 0.6437200
## Malic.acid -0.368710428 -0.1920106
## Ash 0.003911231 0.2236263
## Alcalinity.of.ash -0.276768549 -0.4405969
## Magnesium 0.066003936 0.3933508
## Total.phenols 0.699949365 0.4981149
## Flavanoids 0.787193902 0.4941931
## Nonflavanoid.phenols -0.503269596 -0.3113852
## Proanthocyanins 0.519067096 0.3304167
## Color.intensity -0.428814942 0.3161001
## Hue 0.565468293 0.2361834
## OD280.OD315.of.diluted.wines 1.000000000 0.3127611
## Proline 0.312761075 1.0000000
Description of EDA :-
In our data set,
The distribution of all the numerical variables is good. there is no strange things in it.
If we consider the numerical variables in each class category, there are few outliers. But we can proceed further without removing outliers.
There are many numerical variables which are co-releated to each other. We can do Dimentionality reduction techinques to reduce the number of variables.
Overall, there is no conspicuous patterns involved in the input data.
Models to predict Alcohol:-
Now I am planning to build various models to predict the content of Alcohol when we have following attributes.
- Malic.acid
- Alcalinity.of.ash
- Magnesium
- Proanthocyanins
- Color.intensity
- Proline
Fitting Linear Model:-
Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables.
The summary of the fitted Linear Regression Model :-
##
## Call:
## lm(formula = regression_form, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48450 -0.37609 -0.00201 0.36816 1.78214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.8544978 0.4139789 28.636 < 2e-16 ***
## Malic.acid 0.1008448 0.0403089 2.502 0.0133 *
## Alcalinity.of.ash -0.0338236 0.0141062 -2.398 0.0176 *
## Magnesium 0.0001504 0.0031527 0.048 0.9620
## Proanthocyanins -0.0248777 0.0774548 -0.321 0.7485
## Color.intensity 0.1242727 0.0198923 6.247 3.21e-09 ***
## Proline 0.0012932 0.0001717 7.532 2.77e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5407 on 171 degrees of freedom
## Multiple R-squared: 0.5714, Adjusted R-squared: 0.5564
## F-statistic: 37.99 on 6 and 171 DF, p-value: < 2.2e-16
We can observe that,
Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is 11.8544978 and also stastically significant.
Malic.acid coeeficent value is 0.1008448 and also little bit stastically significant.
Alcalinity.of.ash coeeficent value is-0.0338236 and also little bit stastically significant.
Magnesium coeeficent value is 0.0001504 and not stastically significant.
Proanthocyanins coeeficent value is -0.0248777 and not stastically significant.
Color.intensity coeeficent value is 0.1242727 and also stastically significant.
Proline coeeficent value is 0.0012932 and also stastically significant.
57.14 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).
P-Value is < 2.2e-16 , so the linear model is stastically signifient.
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.5299928
I am trying to reduce the dependent variables and taking most signifient dependent variables.
Fitting Updated Linear Model:-
##
## Call:
## lm(formula = Alcohol ~ Malic.acid + Alcalinity.of.ash + Color.intensity +
## Proline, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47459 -0.36917 0.00056 0.36430 1.77082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.8295611 0.3287319 35.985 < 2e-16 ***
## Malic.acid 0.1024480 0.0397664 2.576 0.0108 *
## Alcalinity.of.ash -0.0337045 0.0139570 -2.415 0.0168 *
## Color.intensity 0.1249396 0.0196251 6.366 1.68e-09 ***
## Proline 0.0012811 0.0001561 8.209 4.96e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5378 on 173 degrees of freedom
## Multiple R-squared: 0.5711, Adjusted R-squared: 0.5612
## F-statistic: 57.6 on 4 and 173 DF, p-value: < 2.2e-16
We can observe that,
Residuals Median is almost equal to zero. ( which is good).
Residuals Minimum and Maximum values are very close (opposite signs). (Data points are equally distributed on both sides of fitted line)
Intercept value is 11.8295611 and also stastically significant.
Malic.acid coeeficent value is 0.1024480 and also little bit stastically significant.
Alcalinity.of.ash coeeficent value is-0.0337045 and also little bit stastically significant.
Color.intensity coeeficent value is 0.1249396 and also stastically significant.
Proline coeeficent value is 0.0012811 and also stastically significant.
57.11 % of Alchol distribution is Explained by all dependent variables. ( Which is not good ).
P-Value is < 2.2e-16 , so the linear model is stastically signifient.
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.5301526.
Support Vector machine ( Regressors) :-
Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.
Fitting SVM - regressor :-
With Default Parameters :-
As a first step, I am trying to fit a Support Vector Machine regressor with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).
The summary of the fitted default SVM_classifier :-
##
## Call:
## svm(formula = regression_form, data = df)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1666667
## epsilon: 0.1
##
##
## Number of Support Vectors: 155
We can observe that,
Number of data points which are forming margin : 155
This algorithm is using default eps-regression type.
Kernel Selected is : Radial basis function (RBF)
Default Cost Value ( C ) is 1
Default Gamma Value ( \(\gamma\) ) is 0.1666667
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.4149909.
Parameters Tuning ( grid search ):-
As we can see the RMSE of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.
- The tuning summary by using Bootstrapping sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: bootstrapping
##
## - best parameters:
## gamma cost
## 0.3 4
##
## - best performance: 0.3807862
We Can observe that, recommended best parameters from Bootstrapping sampling method is
- \(\gamma\) : 0.3
- C : 4
Fitting SVM_regressor with new parameters :-
##
## Call:
## svm(formula = regression_form, data = df, gamma = gam, cost = cos)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 4
## gamma: 0.3
## epsilon: 0.1
##
##
## Number of Support Vectors: 161
We can observe that,
Number of data points which are forming margin : 161
This algorithm is using default eps-regression type.
Kernel Selected is : Radial basis function (RBF)
Cost Value ( C ) is 4 ( Which we are controlling )
Gamma Value ( \(\gamma\) ) is 0.3 ( Which we are controlling )
ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.2584476.
Binart Decission Trees :-
A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target you’re looking for.
Fitting CTREE binary Decision Tree: :-
The summary of the fitted model is as follows
##
## Conditional inference tree with 5 terminal nodes
##
## Response: Alcohol
## Inputs: Malic.acid, Alcalinity.of.ash, Magnesium, Proanthocyanins, Color.intensity, Proline
## Number of observations: 178
##
## 1) Proline <= 720; criterion = 1, statistic = 73.344
## 2) Color.intensity <= 3.3; criterion = 1, statistic = 42.688
## 3)* weights = 46
## 2) Color.intensity > 3.3
## 4) Color.intensity <= 7.65; criterion = 0.994, statistic = 10.868
## 5)* weights = 43
## 4) Color.intensity > 7.65
## 6)* weights = 15
## 1) Proline > 720
## 7) Proline <= 1020; criterion = 1, statistic = 16.348
## 8)* weights = 33
## 7) Proline > 1020
## 9)* weights = 41
The same thing, we can explain in following plot
- ROOT MEAN SQUARE ERROR OF THIS MODEL IS 0.4926241.
Conclusion on regression Models:-
The summary of all the fitted models and its performence on the training data is as follows
REGRESSION MODELs SUMMARY
| S No | Model Name | RMSE Value |
|---|---|---|
| 1. | Linear Model | 0.5299928 |
| 2. | Steped (Pruned Linear Model ) | 0.5301526 |
| 3. | SVM - Regressor | 0.4149909 |
| 4. | Tunned SVM - Regressor | 0.2584476 |
| 5. | Binary Decission Tree | 0.4926241 |
As Tunned SVM - Regressor has very less RMSE value, I can conclude as this model is better model to predict the Alcohol content.
Models to predict Class of Wine:-
Now I am planning to build various models to predict the Class of wine when we have following all the attributes in given data set.
For this classifer Models, I am using a training set of 70% of given data, remaining 30% for testing.
Total Number of observation in given dataset : 178
Total Number of observation in train dataset : 125
Total Number of observation in test dataset : 53
Linear discriminant analysis (LDA):-
LDA is used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes
Fitting LDA:-
The summary of the fitter LDA Model is as follows.
## Call:
## lda(classification_form, data = train_df)
##
## Prior probabilities of groups:
## 1 2 3
## 0.328 0.400 0.272
##
## Group means:
## Alcohol Malic.acid Ash Alcalinity.of.ash Magnesium Total.phenols
## 1 13.76976 1.853659 2.458780 16.94146 106.31707 2.860976
## 2 12.23080 1.844600 2.285400 20.40800 94.28000 2.292200
## 3 13.16000 3.168235 2.455882 21.60294 99.32353 1.698824
## Flavanoids Nonflavanoid.phenols Proanthocyanins Color.intensity Hue
## 1 3.0363415 0.2907317 1.900976 5.549024 1.0675610
## 2 2.1602000 0.3688000 1.657400 3.111000 1.0751200
## 3 0.7947059 0.4523529 1.188529 7.625000 0.6782353
## OD280.OD315.of.diluted.wines Proline
## 1 3.167805 1129.2683
## 2 2.837400 531.1200
## 3 1.672941 645.7353
##
## Coefficients of linear discriminants:
## LD1 LD2
## Alcohol -0.521870355 1.008139140
## Malic.acid 0.305737407 0.294471054
## Ash -0.212029283 2.061604035
## Alcalinity.of.ash 0.185736095 -0.140807844
## Magnesium 0.007890352 0.002354910
## Total.phenols 0.698923759 -0.072032371
## Flavanoids -1.832854947 -0.412573641
## Nonflavanoid.phenols -2.135147371 -1.337869122
## Proanthocyanins 0.490050381 -0.572153820
## Color.intensity 0.257510411 0.258671867
## Hue -1.390960425 -1.275777480
## OD280.OD315.of.diluted.wines -1.317639986 -0.161337269
## Proline -0.003813797 0.002768756
##
## Proportion of trace:
## LD1 LD2
## 0.7283 0.2717
Confusion Matrix on training data :-
## predicted
## true 1 2 3
## 1 41 0 0
## 2 0 50 0
## 3 0 0 34
Correct classification rate:-
The accuracy of this LDA classifier on Training Data is 100 %.
Confusion Matrix on testing data :-
## predicted
## true 1 2 3
## 1 16 2 0
## 2 0 20 1
## 3 0 0 14
Correct classification rate:-
The accuracy of this LDA classifier on Testing Data is 94.34 %.
Support Vector machine ( Classifiers) :-
Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.
Fitting SVM - Classifier :-
With Default Parameters :-
As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).
The summary of the fitted default SVM_classifier :-
##
## Call:
## svm(formula = classification_form, data = train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 57
##
## ( 25 19 13 )
##
##
## Number of Classes: 3
##
## Levels:
## 1 2 3
We can observe that,
- Number of data points which are forming margin : 57
- 25 Class - 1 Data Points
- 19 Class - 2 Data Points
- 13 Class - 3 Data Points
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Default Cost Value ( C ) is 1
Confusion Matrix on training data :-
## predicted
## true 1 2 3
## 1 41 0 0
## 2 0 50 0
## 3 0 0 34
Correct classification rate:-
The accuracy of this SVM classifier on Training Data is 100 %.
Confusion Matrix on testing data :-
## predicted
## true 1 2 3
## 1 17 1 0
## 2 0 20 1
## 3 0 0 14
Correct classification rate:-
The accuracy of this SVM classifier on Testing Data is 96.23 %.
Parameters Tuning ( grid search ):-
As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.
- The tuning summary by using Cross sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.3 4
##
## - best performance: 0.03888889
We Can observ that, recommended best parameters from Fixed sampling method is
\(\gamma\) : 0.3
C : 4
The tuning Plot is as follows :-
Fitting SVM_classifier with new parameters :-
The summary of the fitted
##
## Call:
## svm(formula = classification_form, data = train_df, gamma = gam,
## cost = cos)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 4
##
## Number of Support Vectors: 103
##
## ( 44 28 31 )
##
##
## Number of Classes: 3
##
## Levels:
## 1 2 3
We can observe that,
- Number of data points which are forming margin : 103
- 44 Class - 1 Data Points
- 28 Class - 2 Data Points
- 31 Class - 3 Data Points
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Cost Value ( C ) is 4 ( Controlled by us)
- Gamme Value ( \(\gamma\) ) is 0.3 ( Controlled by us)
Confusion Matrix on training data :-
## predicted
## true 1 2 3
## 1 41 0 0
## 2 0 50 0
## 3 0 0 34
Correct classification rate:-
The accuracy of this tunned SVM classifier on Training Data is 100 %.
Confusion Matrix on testing data :-
## predicted
## true 1 2 3
## 1 13 5 0
## 2 0 20 1
## 3 0 1 13
Correct classification rate:-
The accuracy of this tunned SVM classifier on Testing Data is 86.79 %.
Classification Trees:-
Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning.
Fitting Classification Tree on training data :-
The summary of the fitted Decision tree is as follows.
##
## Conditional inference tree with 5 terminal nodes
##
## Response: class
## Inputs: Alcohol, Malic.acid, Ash, Alcalinity.of.ash, Magnesium, Total.phenols, Flavanoids, Nonflavanoid.phenols, Proanthocyanins, Color.intensity, Hue, OD280.OD315.of.diluted.wines, Proline
## Number of observations: 125
##
## 1) Flavanoids <= 1.39; criterion = 1, statistic = 91.347
## 2) Hue <= 0.85; criterion = 1, statistic = 18.711
## 3)* weights = 31
## 2) Hue > 0.85
## 4)* weights = 7
## 1) Flavanoids > 1.39
## 5) Proline <= 750; criterion = 1, statistic = 64.099
## 6)* weights = 43
## 5) Proline > 750
## 7) Alcohol <= 13.07; criterion = 0.999, statistic = 15.804
## 8)* weights = 7
## 7) Alcohol > 13.07
## 9)* weights = 37
The same summary can be visulize as follows.
Confusion Matrix on training data :-
## predicted
## true 1 2 3
## 1 41 0 0
## 2 3 47 0
## 3 0 3 31
Correct classification rate:-
The accuracy of this Decision tree classifier on Training Data is 95.2 %.
Confusion Matrix on testing data :-
## predicted
## true 1 2 3
## 1 16 2 0
## 2 0 20 1
## 3 0 3 11
Correct classification rate:-
The accuracy of this Decision tree classifier on Testing Data is 88.68 %.
Conclusion on Classification Models:-
The summary of all the fitted models and its performence on the training data & testing data is as follows
CLASSIFICAITON MODELs SUMMARY
| S No | Model Name | ACC. (TRAIN DATA) | ACC. (TEST DATA) |
|---|---|---|---|
| 1. | LDA Model | 100 % | 94.34 % |
| 2. | SVM Classifier | 100 % | 96.23 % |
| 3. | Tunned SVM Classifier | 100 % | 86.79 % |
| 4. | Decision Tree ( Classifier ) | 95.2 % | 88.68 % |
As SVM Classifier has high accuracy value, I can conclude as this model is better model to predict the class of wine.