true

ABOUT THE DATA :

OBJECTIVE :

Prediction of Quality ranking from the chemical properties of the wines

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality.

Number of Instances and observations :

red wine - 1599 ; white wine - 4898

Number of Attributes :

11 predictors and 1 output attribute

Missing Attribute Values :

None

ATTRIBUTE INFORMATION:

Input variables (based on physicochemical tests):

Output variable (based on sensory data):

DISTRIBUTION OF EACH VARIABLES(ie., HISTOGRAM) :

BOXPLOT AS AN ANOTHER INDICATOR OF SPREAD :

Observations on the Datasets

  1. Quality has most values concentrated in the categories 5, 6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories [1, 2] and 10.

  2. Fixed acidity, volatile acidity and citric acid have outliers. If those outliers are eliminated distribution of the variables may be taken to be symmetric.

  3. Residual sugar has a positively skewed distribution; even after eliminating the outliers distribution will remain skewed.

  4. Some of the variables, e.g . free sulphur dioxide, density, have a few outliers but these are very different from the rest.

  5. Mostly outliers are on the larger side.

  6. Alcohol has an irregular shaped distribution but it does not have pronounced outliers.

  7. The classes are ordered and not balanced e.g. there are much more normal wines than excellent or poor ones.

  8. Outlier detection algorithms could be used to detect the few excellent or poor wines.

  9. Several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

Data Preparation :

1. Transformation of Predictors(i.e removing outliers) :

Possibly the most important step in data preparation is to identify outliers. Since this is a multivariate data, we consider only those points which do not have any predictor variable value to be outside of limits constructed by boxplots. The following rule is applied:

A predictor value is considered to be an outlier only if it is greater than Q3 + 1.5IQR The rationale behind this rule is that the extreme outliers are all on the higher end of the values and the distributions are all positively skewed. Application of this rule reduces the data size from 4899 to 4061.

The resultant Data is divided into 50% as Training data and 50% as Test Data.

Transformation on the Output Variable “Quality”

The output variable quality ranging from 1 to 10 has been grouped into four categories as given below : 1. Excellent - Quality greater than 6 2. Good - Quality equals to 6 3. Ok - Quality equals to 5 4. Worst - Quality less than 5

Histogram after Data Preparation :

Boxplot after Data Preparation

Splitting of Data into Training and Testing Data

ANALYTICAL APPROACHES :

# Multinomial

# R Part

# C50 Tree

# C50 Rule

Multinomial

## # weights:  52 (36 variable)
## initial  value 2815.563847 
## iter  10 value 2236.347560
## iter  20 value 2062.966781
## iter  30 value 1975.526144
## iter  40 value 1972.944437
## iter  50 value 1971.711423
## iter  60 value 1970.691410
## iter  70 value 1970.536506
## iter  80 value 1966.587966
## iter  90 value 1965.897435
## iter 100 value 1964.567235
## iter 110 value 1964.539460
## iter 120 value 1964.527259
## iter 130 value 1963.747029
## iter 140 value 1962.719187
## final  value 1962.621782 
## converged
##            mnPredict1
##             Excellent Good  Ok Worst
##   Excellent       161  301  22     0
##   Good             96  681 166     5
##   Ok                8  262 276     3
##   Worst             4   22  22     1
## # weights:  8 (3 variable)
## initial  value 2815.563847 
## iter  10 value 2331.297522
## final  value 2331.296577 
## converged
##            mnPredict2
##             Excellent Good  Ok Worst
##   Excellent         0  484   0     0
##   Good              0  948   0     0
##   Ok                0  549   0     0
##   Worst             0   49   0     0
## # weights:  48 (33 variable)
## initial  value 2815.563847 
## iter  10 value 2236.278400
## iter  20 value 2062.606009
## iter  30 value 1976.510861
## iter  40 value 1973.769476
## iter  50 value 1971.901471
## iter  60 value 1971.043069
## final  value 1970.962444 
## converged
##            mnPredict3
##             Excellent Good  Ok Worst
##   Excellent       150  310  24     0
##   Good             99  677 167     5
##   Ok                5  262 279     3
##   Worst             2   24  22     1
## # weights:  44 (30 variable)
## initial  value 2815.563847 
## iter  10 value 2233.470605
## iter  20 value 2015.913973
## iter  30 value 1991.152188
## iter  40 value 1989.208260
## iter  50 value 1987.876358
## iter  60 value 1987.286160
## final  value 1987.285260 
## converged
##            mnPredict4
##             Excellent Good  Ok Worst
##   Excellent       143  301  40     0
##   Good             88  681 175     4
##   Ok                5  264 276     4
##   Worst             2   25  21     1
## # weights:  48 (33 variable)
## initial  value 2815.563847 
## iter  10 value 2285.762588
## iter  20 value 2192.831657
## iter  30 value 2135.902568
## iter  40 value 2120.939196
## iter  50 value 2103.319052
## iter  60 value 2092.611884
## iter  70 value 2032.776296
## iter  80 value 2032.774405
## iter  90 value 2030.823628
## iter 100 value 2005.872430
## iter 110 value 2004.866730
## iter 120 value 2003.601517
## iter 130 value 2001.055510
## iter 140 value 1983.010864
## iter 150 value 1982.405806
## iter 160 value 1981.993796
## iter 170 value 1981.627691
## iter 180 value 1981.081926
## iter 190 value 1980.639647
## iter 200 value 1976.405511
## final  value 1976.389947 
## converged
##            mnPredict5
##             Excellent Good  Ok Worst
##   Excellent       160  301  23     0
##   Good             97  696 149     6
##   Ok                8  294 245     2
##   Worst             3   22  22     2

Regression Partition

C50 Tree Based Approach

C50 Rule Based Algorithm

Inferences :

Accuracy has been considered as a metric to measure for comparing the model and Algorithm for the given data. The following table shows the accuracy of different models.

Multinomial Rpart C50 Tree C50 Rule
Full Model 1 0.5512 0.5453 0.5695 0.5665
Null Model 2 0.4670 0.5453 0.5217 0.5281
-Density Model 3 0.5453 0.5453 0.5507 0.5606
del 4 0.5424 0.4946 0.5695 0.5714
Model 5 0.5433 0.4946 0.5635 0.5685
Model 6 NA 0.5463 0.5837 0.5906
Model 7 NA 0.5463 0.5389 0.5483
Model 8 NA 0.5409 NA NA
Model 9 NA 0.5463 NA NA
Model 10 NA 0.5453 NA NA

Conclusion :

Sources :

Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009 Dataset : http://www3.dsi.uminho.pt/pcortez/wine/

References :

http://www.statmethods.net/advstats/cart.html

http://scg.sdsu.edu/ctrees_r/

https://cran.r-project.org/doc/manuals/r-release/R-lang.html#List-objects

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html