12/04/2017

Summary

This data product uses available data sets for red and white Portuguese wine. The data was analysed and a model (random forest) was created to predict the quality of wine based on the results of physicochemical tests.

A Shiny application was developed based on these results using a simpler (faster) linear regression model.

Ideas for extending this data product are:

  • Include data from other wine reviews to improve the accuracy
  • Improve the model (perhaps only use random forest)

Application details:

Data Processing

  • The data were loaded from two semi-colon delimited files (one file for red, one for white)
  • The data sets were combined rbind() and an extra column indicating color was added

The available columns are:

fixed.acidity
volatile.acidity
citric.acid
residual.sugar
chlorides
free.sulfur.dioxide
total.sulfur.dioxide
density
pH
sulphates
alcohol
quality
colour


Where quality is on a scale of 0 (bad) to 10 (very good).

Model - random forest

A random forest model was trained and tested against a data set, further partitioned into training and testing data sets. The overall results from the confusion matrix are:


Accuracy 0.6527094
Kappa 0.4575751
AccuracyLower 0.6289858
AccuracyUpper 0.6758796
AccuracyNull 0.4365764
AccuracyPValue 0.0000000
McnemarPValue NaN

The accuracy is quite low.

Model - linear regression

This model was created to:

  • Create a model to predict quality of wine based on the other variables
  • Determine the three main predictors of quality for use in a smaller model (using varImp())

A smaller model using linear regression was used since this was much faster than the random forest model. For a user interactive data product, model accuracy was traded off against performance.

Name Value
alcohol 87.95623
volatile.acidity 82.55369
free.sulfur.dioxide 72.11978
sulphates 64.73579
pH 60.80405
total.sulfur.dioxide 53.44822
residual.sugar 52.63394
citric.acid 50.50613
fixed.acidity 49.06257
chlorides 48.69321
density 41.07213
colour 12.23574