This data product uses available data sets for red and white Portuguese wine. The data was analysed and a model (random forest) was created to predict the quality of wine based on the results of physicochemical tests.
A Shiny application was developed based on these results using a simpler (faster) linear regression model.
- Wine quality data is available from Kaggle
- More details are available at UCI Machine Learning Repository
- Citation is available here
Ideas for extending this data product are:
- Include data from other wine reviews to improve the accuracy
- Improve the model (perhaps only use random forest)
Application details:
- The application is available here.
- Git repo: https://github.com/greigar/ddp-wine-quality
Data Processing
- The data were loaded from two semi-colon delimited files (one file for red, one for white)
- The data sets were combined
rbind()and an extra column indicating color was added
The available columns are:
| fixed.acidity |
| volatile.acidity |
| citric.acid |
| residual.sugar |
| chlorides |
| free.sulfur.dioxide |
| total.sulfur.dioxide |
| density |
| pH |
| sulphates |
| alcohol |
| quality |
| colour |
Where quality is on a scale of 0 (bad) to 10 (very good).
Model - random forest
A random forest model was trained and tested against a data set, further partitioned into training and testing data sets. The overall results from the confusion matrix are:
| Accuracy | 0.6527094 |
| Kappa | 0.4575751 |
| AccuracyLower | 0.6289858 |
| AccuracyUpper | 0.6758796 |
| AccuracyNull | 0.4365764 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | NaN |
The accuracy is quite low.
Model - linear regression
This model was created to:
- Create a model to predict quality of wine based on the other variables
- Determine the three main predictors of quality for use in a smaller model (using
varImp())
A smaller model using linear regression was used since this was much faster than the random forest model. For a user interactive data product, model accuracy was traded off against performance.
| Name | Value |
|---|---|
| alcohol | 87.95623 |
| volatile.acidity | 82.55369 |
| free.sulfur.dioxide | 72.11978 |
| sulphates | 64.73579 |
| pH | 60.80405 |
| total.sulfur.dioxide | 53.44822 |
| residual.sugar | 52.63394 |
| citric.acid | 50.50613 |
| fixed.acidity | 49.06257 |
| chlorides | 48.69321 |
| density | 41.07213 |
| colour | 12.23574 |