Prediction of Quality ranking from the chemical properties of the wines
A predictive model developed on this data is expected to provide guidance to vineyards regarding quality.
red wine - 1599 ; white wine - 4898
11 predictors and 1 output attribute
None
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
Quality (score between 0 and 10)
| fix | ed.acidity vol | atile.acidity ci | tric.acid res | idual.sugar c | hlorides fre | e.sulfur.dioxide tot | al.sulfur.dioxide | density | pH s | ulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | 3.80 | 0.08 | 0.00 | 0.60 | 0.01 | 2.00 | 9.0 | 0.99 | 2.72 | 0.22 | 8.00 | 3.00 |
| Q1 Val | 6.30 | 0.21 | 0.27 | 1.70 | 0.04 | 23.00 | 108.0 | 0.99 | 3.09 | 0.41 | 9.50 | 5.00 |
| Median | 6.80 | 0.26 | 0.32 | 5.20 | 0.04 | 34.00 | 134.0 | 0.99 | 3.18 | 0.47 | 10.40 | 6.00 |
| Mean | 6.86 | 0.28 | 0.33 | 6.39 | 0.05 | 35.31 | 138.4 | 0.99 | 3.19 | 0.49 | 10.51 | 5.88 |
| Q3 Val | 7.30 | 0.32 | 0.39 | 9.90 | 0.05 | 46.00 | 167.0 | 1.00 | 3.28 | 0.55 | 11.40 | 6.00 |
| Max | 14.20 | 1.10 | 1.66 | 65.80 | 0.35 | 289.00 | 440.0 | 1.04 | 3.82 | 1.08 | 14.20 | 9.00 |
Quality has most values concentrated in the categories 5, 6 and 7. Only a small proportion is in the categories [3, 4] and [8, 9] and none in the categories [1, 2] and 10.
Fixed acidity, volatile acidity and citric acid have outliers. If those outliers are eliminated distribution of the variables may be taken to be symmetric.
Residual sugar has a positively skewed distribution; even after eliminating the outliers distribution will remain skewed.
Some of the variables, e.g . free sulphur dioxide, density, have a few outliers but these are very different from the rest.
Mostly outliers are on the larger side.
Alcohol has an irregular shaped distribution but it does not have pronounced outliers.
The classes are ordered and not balanced e.g. there are much more normal wines than excellent or poor ones.
Outlier detection algorithms could be used to detect the few excellent or poor wines.
Several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.
Possibly the most important step in data preparation is to identify outliers. Since this is a multivariate data, we consider only those points which do not have any predictor variable value to be outside of limits constructed by boxplots. The following rule is applied:
A predictor value is considered to be an outlier only if it is greater than Q3 + 1.5IQR The rationale behind this rule is that the extreme outliers are all on the higher end of the values and the distributions are all positively skewed. Application of this rule reduces the data size from 4899 to 4061.
The resultant Data is divided into 50% as Training data and 50% as Test Data.
The output variable quality ranging from 1 to 10 has been grouped into four categories as given below : 1. Excellent - Quality greater than 6 2. Good - Quality equals to 6 3. Ok - Quality equals to 5 4. Worst - Quality less than 5
# Multinomial
# R Part
# C50 Tree
# C50 Rule
## # weights: 52 (36 variable)
## initial value 2815.563847
## iter 10 value 2236.347560
## iter 20 value 2062.966781
## iter 30 value 1975.526144
## iter 40 value 1972.944437
## iter 50 value 1971.711423
## iter 60 value 1970.691410
## iter 70 value 1970.536506
## iter 80 value 1966.587966
## iter 90 value 1965.897435
## iter 100 value 1964.567235
## iter 110 value 1964.539460
## iter 120 value 1964.527259
## iter 130 value 1963.747029
## iter 140 value 1962.719187
## final value 1962.621782
## converged
## mnPredict1
## Excellent Good Ok Worst
## Excellent 161 301 22 0
## Good 96 681 166 5
## Ok 8 262 276 3
## Worst 4 22 22 1
## # weights: 8 (3 variable)
## initial value 2815.563847
## iter 10 value 2331.297522
## final value 2331.296577
## converged
## mnPredict2
## Excellent Good Ok Worst
## Excellent 0 484 0 0
## Good 0 948 0 0
## Ok 0 549 0 0
## Worst 0 49 0 0
## # weights: 48 (33 variable)
## initial value 2815.563847
## iter 10 value 2236.278400
## iter 20 value 2062.606009
## iter 30 value 1976.510861
## iter 40 value 1973.769476
## iter 50 value 1971.901471
## iter 60 value 1971.043069
## final value 1970.962444
## converged
## mnPredict3
## Excellent Good Ok Worst
## Excellent 150 310 24 0
## Good 99 677 167 5
## Ok 5 262 279 3
## Worst 2 24 22 1
## # weights: 44 (30 variable)
## initial value 2815.563847
## iter 10 value 2233.470605
## iter 20 value 2015.913973
## iter 30 value 1991.152188
## iter 40 value 1989.208260
## iter 50 value 1987.876358
## iter 60 value 1987.286160
## final value 1987.285260
## converged
## mnPredict4
## Excellent Good Ok Worst
## Excellent 143 301 40 0
## Good 88 681 175 4
## Ok 5 264 276 4
## Worst 2 25 21 1
## # weights: 48 (33 variable)
## initial value 2815.563847
## iter 10 value 2285.762588
## iter 20 value 2192.831657
## iter 30 value 2135.902568
## iter 40 value 2120.939196
## iter 50 value 2103.319052
## iter 60 value 2092.611884
## iter 70 value 2032.776296
## iter 80 value 2032.774405
## iter 90 value 2030.823628
## iter 100 value 2005.872430
## iter 110 value 2004.866730
## iter 120 value 2003.601517
## iter 130 value 2001.055510
## iter 140 value 1983.010864
## iter 150 value 1982.405806
## iter 160 value 1981.993796
## iter 170 value 1981.627691
## iter 180 value 1981.081926
## iter 190 value 1980.639647
## iter 200 value 1976.405511
## final value 1976.389947
## converged
## mnPredict5
## Excellent Good Ok Worst
## Excellent 160 301 23 0
## Good 97 696 149 6
## Ok 8 294 245 2
## Worst 3 22 22 2
Accuracy has been considered as a metric to measure for comparing the model and Algorithm for the given data. The following table shows the accuracy of different models.
| Multinomial | Rpart | C50 Tree | C50 Rule | |
|---|---|---|---|---|
| Full Model 1 | 0.5512 | 0.5453 | 0.5695 | 0.5665 |
| Null Model 2 | 0.4670 | 0.5453 | 0.5217 | 0.5281 |
| -Density Model 3 | 0.5453 | 0.5453 | 0.5507 | 0.5606 |
| del 4 | 0.5424 | 0.4946 | 0.5695 | 0.5714 |
| Model 5 | 0.5433 | 0.4946 | 0.5635 | 0.5685 |
| Model 6 | NA | 0.5463 | 0.5837 | 0.5906 |
| Model 7 | NA | 0.5463 | 0.5389 | 0.5483 |
| Model 8 | NA | 0.5409 | NA | NA |
| Model 9 | NA | 0.5463 | NA | NA |
| Model 10 | NA | 0.5453 | NA | NA |
Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009 Dataset : http://www3.dsi.uminho.pt/pcortez/wine/