—————————– Standard Approach ———— Phase 1 ————–

This dataset is Obtained from and is also available from the UCI machine learning repository,

https://archive.ics.uci.edu/ml/datasets/wine+quality

Objective: Use machine learning to determine which physiochemical

properties make a wine ‘good’!

For Phase 1 , Regression Analysis will be used.

For phase 2 , Machine Learning will used.

And then one will used to compare against the other.

Definition of physiochemical: (of or pertaining to both physical and chemical properties, changes, and reactions. of or according to physical chemistry.)

Wine Quality in this data set is classified as being, if

equal to “6” = Good , equal to “7” = Good Plus ,

if equal to “8” = Very Good.

The highest value in this data set is “8”

Please Note: Acknowledgements are at the End of this Presentation.

Content: (Or Name per each Column) For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)

What follows is an attempt to come to an understanding of the data,

and hopefully of how each column relates to one another.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Plots and various attempts at understanding the inter-relationship of the columns

You can also embed plots, for example:

## NULL

## The following objects are masked from Wine (pos = 3):
## 
##     alcohol, chlorides, citric.acid, density, fixed.acidity,
##     free.sulfur.dioxide, pH, quality, residual.sugar, sulphates,
##     total.sulfur.dioxide, volatile.acidity

## # A tibble: 11 x 2
##    rowname              quality
##    <chr>                  <dbl>
##  1 fixed.acidity         0.124 
##  2 volatile.acidity     -0.391 
##  3 citric.acid           0.226 
##  4 residual.sugar        0.0137
##  5 chlorides            -0.129 
##  6 free.sulfur.dioxide  -0.0507
##  7 total.sulfur.dioxide -0.185 
##  8 density              -0.175 
##  9 pH                   -0.0577
## 10 sulphates             0.251 
## 11 alcohol               0.476

Note that the , echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

## The following objects are masked _by_ .GlobalEnv:
## 
##     alcohol, quality

## The following objects are masked from Wine (pos = 3):
## 
##     alcohol, chlorides, citric.acid, density, fixed.acidity,
##     free.sulfur.dioxide, pH, quality, residual.sugar, sulphates,
##     total.sulfur.dioxide, volatile.acidity

## The following objects are masked from Wine (pos = 4):
## 
##     alcohol, chlorides, citric.acid, density, fixed.acidity,
##     free.sulfur.dioxide, pH, quality, residual.sugar, sulphates,
##     total.sulfur.dioxide, volatile.acidity

## # A tibble: 11 x 2
##    rowname              quality
##    <chr>                  <dbl>
##  1 fixed.acidity         0.124 
##  2 volatile.acidity     -0.391 
##  3 citric.acid           0.226 
##  4 residual.sugar        0.0137
##  5 chlorides            -0.129 
##  6 free.sulfur.dioxide  -0.0507
##  7 total.sulfur.dioxide -0.185 
##  8 density              -0.175 
##  9 pH                   -0.0577
## 10 sulphates             0.251 
## 11 alcohol               0.476

This section below (fit) is as yet unfinished, in that I intend to use the Regression Model in

the below Equation in testing against a random selection from the actual data set. And from

this try to derive if this Equation is sufficient for prediction purposes.

I then intend to see if the the Machine Learning derived Equation

would be a better fit.

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + citric.acid + volatile.acidity, 
##     data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71408 -0.38590 -0.06402  0.46657  2.20393 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.64592    0.20106  13.160  < 2e-16 ***
## alcohol           0.30908    0.01581  19.553  < 2e-16 ***
## sulphates         0.69552    0.10311   6.746 2.12e-11 ***
## citric.acid      -0.07913    0.10381  -0.762    0.446    
## volatile.acidity -1.26506    0.11266 -11.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6588 on 1594 degrees of freedom
## Multiple R-squared:  0.3361, Adjusted R-squared:  0.3345 
## F-statistic: 201.8 on 4 and 1594 DF,  p-value: < 2.2e-16

## Quality =  - 2.646 x alcohol  + 0.309 x alcohol  + 0.696 x sulphates  -0.079 x citric.acid -1.265 x volatile.acidity

##                       2.5 %     97.5 %
## (Intercept)       2.2515575  3.0402782
## alcohol           0.2780728  0.3400835
## sulphates         0.4932780  0.8977542
## citric.acid      -0.2827462  0.1244961
## volatile.acidity -1.4860436 -1.0440733

## Analysis of Variance Table
## 
## Response: quality
##                    Df Sum Sq Mean Sq F value    Pr(>F)    
## alcohol             1 236.29 236.295 544.413 < 2.2e-16 ***
## sulphates           1  44.98  44.977 103.624 < 2.2e-16 ***
## citric.acid         1  14.32  14.318  32.987 1.109e-08 ***
## volatile.acidity    1  54.72  54.724 126.081 < 2.2e-16 ***
## Residuals        1594 691.85   0.434                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                   (Intercept)       alcohol     sulphates  citric.acid
## (Intercept)       0.040423190 -2.717029e-03 -6.257782e-03 -0.004778726
## alcohol          -0.002717029  2.498724e-04 -7.273176e-05  0.000019016
## sulphates        -0.006257782 -7.273176e-05  1.063090e-02 -0.002245641
## citric.acid      -0.004778726  1.901600e-05 -2.245641e-03  0.010776802
## volatile.acidity -0.012160812  2.942844e-04  1.189162e-03  0.005945653
##                  volatile.acidity
## (Intercept)         -0.0121608117
## alcohol              0.0002942844
## sulphates            0.0011891619
## citric.acid          0.0059456529
## volatile.acidity     0.0126931742

—————————– Attempt at Machine Learning ———— Phase 2 ————–

##                    name      type na mean      disp median mad min  max
## 1         fixed.acidity character  0   NA 0.9581119     NA  NA   1  134
## 2      volatile.acidity character  0   NA 0.9706158     NA  NA   1   94
## 3           citric.acid character  0   NA 0.9174742     NA  NA   1  264
## 4        residual.sugar character  0   NA 0.9024695     NA  NA   1  312
## 5             chlorides character  0   NA 0.9587371     NA  NA   1  132
## 6   free.sulfur.dioxide character  0   NA 0.9137230     NA  NA   1  276
## 7  total.sulfur.dioxide character  0   NA 0.9731166     NA  NA   1   86
## 8               density character  0   NA 0.9774930     NA  NA   1   72
## 9                    pH character  0   NA 0.9643639     NA  NA   1  114
## 10            sulphates character  0   NA 0.9568615     NA  NA   1  138
## 11              alcohol character  0   NA 0.9130978     NA  NA   1  278
## 12              quality character  0   NA 0.5742420     NA  NA   1 1362
##    nlevs
## 1     97
## 2    144
## 3     81
## 4     92
## 5    154
## 6     61
## 7    145
## 8    437
## 9     90
## 10    97
## 11    66
## 12     7

## 'data.frame':    3199 obs. of  12 variables:
##  $ fixed.acidity       : chr  "fixed acidity" "7.4" "7.8" "7.8" ...
##  $ volatile.acidity    : chr  "volatile acidity" "0.7" "0.88" "0.76" ...
##  $ citric.acid         : chr  "citric acid" "0" "0" "0.04" ...
##  $ residual.sugar      : chr  "residual sugar" "1.9" "2.6" "2.3" ...
##  $ chlorides           : chr  "chlorides" "0.076" "0.098" "0.092" ...
##  $ free.sulfur.dioxide : chr  "free sulfur dioxide" "11" "25" "15" ...
##  $ total.sulfur.dioxide: chr  "total sulfur dioxide" "34" "67" "54" ...
##  $ density             : chr  "density" "0.9978" "0.9968" "0.997" ...
##  $ pH                  : chr  "pH" "3.51" "3.2" "3.26" ...
##  $ sulphates           : chr  "sulphates" "0.56" "0.68" "0.65" ...
##  $ alcohol             : chr  "alcohol" "9.4" "9.8" "9.8" ...
##  $ quality             : chr  "quality" "5" "5" "5" ...

Acknowledgements

This dataset is also available from the UCI machine learning repository,

https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle

for convenience. (I am mistaken and the public license type disallowed me from doing

so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira,

F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from

physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by

data mining from physicochemical properties. In Decision Support Systems, Elsevier,

MATH2319_Phase1

s3686502 Dan Enoka

2 April 2018