Wine Quality Data Analysis
About
This dataset is related to red variant of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).
Input variables (based on physicochemical
tests):
1. fixed acidity g(tartaric acid)/dm
2. volatile acidity g(acetic acid)/dm
3. citric acid g/dm
4. residual sugar g/dm
5. chlorides g(sodium chloride)/dm
6. free sulfur dioxide mg/dm
7. total sulfur dioxide mg/dm
8. density g/dm
9. pH
10. sulphates g(potassium sulphate)/dm
11. alcohol % vol
Output variable (based on sensory data):
12. quality (score between 0 and 10)
Analysis question/approach
Classify wine quality from physicochemical properties
Build a model to predict wine quality (regression)
Identify attributes differentiating low vs high quality wine
- MR (multiple regression)
- NN (neural network)
- SVM (support vector machine learning)
- random forest classifier
- imbalanced sklearn
need to preserve order
address imbalance
Initialize package libraries
#install.packages("usethis")
#install.packages("rmdformats")
library(usethis)
library(magrittr)
library(highcharter)
library(explore)
library(dplyr)
library(DataExplorer)
library(skimr)
library(rmdformats)
library(readr)
library(DT)
library(ggplot2)
library(pastecs)
library(corrplot)
# Connect RStudio to Git client
# usethis::create_from_github(
# "https://github.com/dezzygc/WineData.git", destdir = "./wine_repo")# Import dataset
wine_df <- read_csv("red_wine_data.csv")
#as_tibble(wine_df)
datatable(wine_df)Data diagnostics
- Check NA values
- Check highly correlated feature variables
- Assess balance and distribution
# wine_df %>% explore()
# wine_df %>% report(output_file = "report.html", output_dir = getwd())# explore structure of dataframe, compute summary descriptives and examine quantiles/distribution
skim(wine_df)| Name | wine_df |
| Number of rows | 1599 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fixedAcidity | 0 | 1 | 8.32 | 1.74 | 4.60 | 7.10 | 7.90 | 9.20 | 15.90 | ▂▇▂▁▁ |
| volatileAcidity | 0 | 1 | 0.53 | 0.18 | 0.12 | 0.39 | 0.52 | 0.64 | 1.58 | ▅▇▂▁▁ |
| citricAcid | 0 | 1 | 0.27 | 0.19 | 0.00 | 0.09 | 0.26 | 0.42 | 1.00 | ▇▆▅▁▁ |
| residualSugar | 0 | 1 | 2.54 | 1.41 | 0.90 | 1.90 | 2.20 | 2.60 | 15.50 | ▇▁▁▁▁ |
| chlorides | 0 | 1 | 0.09 | 0.05 | 0.01 | 0.07 | 0.08 | 0.09 | 0.61 | ▇▁▁▁▁ |
| freeSulfurDioxide | 0 | 1 | 15.87 | 10.46 | 1.00 | 7.00 | 14.00 | 21.00 | 72.00 | ▇▅▁▁▁ |
| totalSulfurDioxide | 0 | 1 | 46.47 | 32.90 | 6.00 | 22.00 | 38.00 | 62.00 | 289.00 | ▇▂▁▁▁ |
| density | 0 | 1 | 1.00 | 0.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▃▇▂▁ |
| pH | 0 | 1 | 3.31 | 0.15 | 2.74 | 3.21 | 3.31 | 3.40 | 4.01 | ▁▅▇▂▁ |
| sulphates | 0 | 1 | 0.66 | 0.17 | 0.33 | 0.55 | 0.62 | 0.73 | 2.00 | ▇▅▁▁▁ |
| alcohol | 0 | 1 | 10.42 | 1.07 | 8.40 | 9.50 | 10.20 | 11.10 | 14.90 | ▇▇▃▁▁ |
| quality | 0 | 1 | 5.64 | 0.81 | 3.00 | 5.00 | 6.00 | 6.00 | 8.00 | ▁▇▇▂▁ |
Check for Missing Values
No missing values for all variables.
ggplot(wine_df, aes(quality)) +
geom_histogram(color = "#000000", fill = "#0099F8", bins = 8) +
labs(
title = "Histogram of Wine Quality Rating",
caption = "Source: Wine dataset",
x = "Quality Rating",
y = "Count") +
theme_classic() +
theme(
plot.title = element_text(color = "#0099F8", size = 14, face = "bold"),
plot.subtitle = element_text(size = 10, face = "bold"),
plot.caption = element_text(face = "italic")
)Calculate the bin width by dividing the specification tolerance or range (USL-LSL or Max-Min value) by the # of bins (and rounding up)
boxplot(wine_df$quality, horizontal=TRUE, col=rgb(0.8,0.8,0,0.5), main="Wine Quality")## [1] "Frequency table"
## quality_df
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## [1] "Cumulative frequency table"
## 3 4 5 6 7 8
## 10 63 744 1382 1581 1599
ratings.vec <- as.vector(t(table_count)) # store rating frequencies
qual_rating<- c(rownames(table_count)) # store rating valuesCheck distribution of target and feature variables
| fixedAcidity | volatileAcidity | citricAcid | residualSugar | chlorides | freeSulfurDioxide | totalSulfurDioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| skewness | 0.981 | 0.670 | 0.318 | 4.532 | 5.670 | 1.248 | 1.513 | 0.071 | 0.193 | 2.424 | 0.859 | 0.217 |
| skew.2SE | 8.014 | 5.477 | 2.596 | 37.028 | 46.322 | 10.198 | 12.359 | 0.581 | 1.579 | 19.805 | 7.020 | 1.776 |
| kurtosis | 1.120 | 1.213 | -0.793 | 28.485 | 41.526 | 2.007 | 3.786 | 0.923 | 0.796 | 11.662 | 0.192 | 0.288 |
| kurt.2SE | 4.577 | 4.957 | -3.242 | 116.435 | 169.740 | 8.205 | 15.474 | 3.771 | 3.253 | 47.667 | 0.783 | 1.177 |
| normtest.W | 0.942 | 0.974 | 0.955 | 0.566 | 0.484 | 0.902 | 0.873 | 0.991 | 0.993 | 0.833 | 0.929 | 0.858 |
| normtest.p | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Data preparation
Binning Wine Quality Rating
Recoding quality to reduce number of bins (with equal widths but unequal frequencies). Ratings 3-4 = low quality, 5-6 = quality, 7-8 = high quality.
wine_df$quality2 <- recode(wine_df$quality, `3` = 1, `4` = 1, `5` = 2, `6` = 2, `7`= 3, `8`= 3)
wine_df$quality_ordered <- factor(wine_df$quality2, ordered = TRUE,
levels = c("1", "2", "3"))
table(wine_df$quality2)##
## 1 2 3
## 63 1319 217
Addressing class imbalance
- Undersampling
- Oversampling
- Algorithmic approaches
Imbalanced data can lead to overfitting and underfitting issues - the Accuracy paradox. The classification analysis that proceeds highlights the impact of class imbalance on accuracy estimates.
Fitting a Classifier
Export csv for modelling in Jupyter
write.csv(wine_df, "./ready_wine.csv", row.names=FALSE)Classifier selection in Python with SciKit.learn.
Please see jupyter notebook (‘wine.ipynb’) for classifier selection analyses ensemble, boosting algorithms), executable through google Collaboratory. A Jupyter notebook is accessible through the github repo (https://github.com/dezzygc/WineData.git).
The following classifiers were examined:
1. Decision tree classifier
2. Random Forest classifier
3. Gradient Boosting
4. XG Boosting
5. Support Vector Machine (SVM) - weighted and
unweighted
Step 1 - Normalize feature variables
Step 2 - Split data into test and training sets
Step 3 - Fit classifier and report classification metrics
Metrics include precision, recall, f1-score, support and accuracy. The metrics focused upon here are defined below.
Precision: the ability of the classifier to avoid labeling a negative sample as positive (i.e false positive). This is calculated as true positives/(true positives + false positives).
Recall: ability of the classifier to identify all positive samples in each class. This is calculated as true positives/(true positives + false negatives).
F1-Score: the weighted harmonic mean of precision and recall. A good approximation of accuracy for imbalanced datasets (consider the Precision-Recall Trade-Off) by accounting for type I and type II errors. It gives equal weight to both recall and precision, i.e. identification of all positive cases (recall) versus identification of only positive case (precision). If precision and recall are both high, the F1 score will also be high (a score close to 1.0). Similarly, the model will obtain a medium F1 score if either Precision or Recall is low and the other is high.
Accuracy: includes two metric options. The macro average is an average of the total true positives, false neagtives and false positives. The weighted-average accuracy returns the average accuracy per class, making it suitable for mutli-classification problems.
Classifier Summary
Please see Jupyter notebook.
Model 1 (decision tree) feature importance
Decision Tree feature importance.
wine_df %>% explain_tree(target=quality)Random Forest feature importance
random forest feature importance.
XGB feature importance
XGB Classifier feature importance.
Gradient boost classifier feature importance
gradient boost feature importance.