Wine Quality Prediction Model

12/04/2017

Summary

This data product uses available data sets for red and white Portuguese wine. The data was analysed and a model (random forest) was created to predict the quality of wine based on the results of physicochemical tests.

A Shiny application was developed based on these results using a simpler (faster) linear regression model.

Wine quality data is available from Kaggle
More details are available at UCI Machine Learning Repository
Citation is available here

Ideas for extending this data product are:

Include data from other wine reviews to improve the accuracy
Improve the model (perhaps only use random forest)

Application details:

The application is available here.
Git repo: https://github.com/greigar/ddp-wine-quality

Data Processing

The data were loaded from two semi-colon delimited files (one file for red, one for white)
The data sets were combined rbind() and an extra column indicating color was added

The available columns are:

fixed.acidity

volatile.acidity

citric.acid

residual.sugar

chlorides

free.sulfur.dioxide

total.sulfur.dioxide

density

sulphates

alcohol

quality

colour

Where quality is on a scale of 0 (bad) to 10 (very good).

Model - random forest

A random forest model was trained and tested against a data set, further partitioned into training and testing data sets. The overall results from the confusion matrix are:




 Accuracy 
    0.6527094 
  

 Kappa 
    0.4575751 
  

 AccuracyLower 
    0.6289858 
  

 AccuracyUpper 
    0.6758796 
  

 AccuracyNull 
    0.4365764 
  

 AccuracyPValue 
    0.0000000 
  

 McnemarPValue 
    NaN

The accuracy is quite low.

Model - linear regression

This model was created to:

Create a model to predict quality of wine based on the other variables
Determine the three main predictors of quality for use in a smaller model (using varImp())

A smaller model using linear regression was used since this was much faster than the random forest model. For a user interactive data product, model accuracy was traded off against performance.

Name	Value
alcohol	87.95623
volatile.acidity	82.55369
free.sulfur.dioxide	72.11978
sulphates	64.73579
pH	60.80405
total.sulfur.dioxide	53.44822
residual.sugar	52.63394
citric.acid	50.50613
fixed.acidity	49.06257
chlorides	48.69321
density	41.07213
colour	12.23574

Accuracy	0.6527094
Kappa	0.4575751
AccuracyLower	0.6289858
AccuracyUpper	0.6758796
AccuracyNull	0.4365764
AccuracyPValue	0.0000000
McnemarPValue	NaN