Wine Quality Data Analysis

About

This dataset is related to red variant of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).

Input variables (based on physicochemical tests):
1. fixed acidity g(tartaric acid)/dm
2. volatile acidity g(acetic acid)/dm
3. citric acid g/dm
4. residual sugar g/dm
5. chlorides g(sodium chloride)/dm
6. free sulfur dioxide mg/dm
7. total sulfur dioxide mg/dm
8. density g/dm
9. pH
10. sulphates g(potassium sulphate)/dm
11. alcohol % vol

Output variable (based on sensory data):
12. quality (score between 0 and 10)

Analysis question/approach
Classify wine quality from physicochemical properties
Build a model to predict wine quality (regression)
Identify attributes differentiating low vs high quality wine

  1. MR (multiple regression)
  2. NN (neural network)
  3. SVM (support vector machine learning)
  4. random forest classifier
  5. imbalanced sklearn

need to preserve order
address imbalance

Initialize package libraries

#install.packages("usethis")
#install.packages("rmdformats")
library(usethis)
library(magrittr)
library(highcharter)
library(explore)
library(dplyr)
library(DataExplorer)
library(skimr)
library(rmdformats)
library(readr)
library(DT)
library(ggplot2)
library(pastecs)
library(corrplot)

# Connect RStudio to Git client
# usethis::create_from_github(
#  "https://github.com/dezzygc/WineData.git", destdir = "./wine_repo")
# Import dataset
wine_df <- read_csv("red_wine_data.csv")
#as_tibble(wine_df)
datatable(wine_df)

Data diagnostics

  1. Check NA values
  2. Check highly correlated feature variables
  3. Assess balance and distribution
# wine_df %>% explore()
# wine_df %>% report(output_file = "report.html", output_dir = getwd())
# explore structure of dataframe, compute summary descriptives and examine quantiles/distribution
skim(wine_df)
Data summary
Name wine_df
Number of rows 1599
Number of columns 12
_______________________
Column type frequency:
numeric 12
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fixedAcidity 0 1 8.32 1.74 4.60 7.10 7.90 9.20 15.90 ▂▇▂▁▁
volatileAcidity 0 1 0.53 0.18 0.12 0.39 0.52 0.64 1.58 ▅▇▂▁▁
citricAcid 0 1 0.27 0.19 0.00 0.09 0.26 0.42 1.00 ▇▆▅▁▁
residualSugar 0 1 2.54 1.41 0.90 1.90 2.20 2.60 15.50 ▇▁▁▁▁
chlorides 0 1 0.09 0.05 0.01 0.07 0.08 0.09 0.61 ▇▁▁▁▁
freeSulfurDioxide 0 1 15.87 10.46 1.00 7.00 14.00 21.00 72.00 ▇▅▁▁▁
totalSulfurDioxide 0 1 46.47 32.90 6.00 22.00 38.00 62.00 289.00 ▇▂▁▁▁
density 0 1 1.00 0.00 0.99 1.00 1.00 1.00 1.00 ▁▃▇▂▁
pH 0 1 3.31 0.15 2.74 3.21 3.31 3.40 4.01 ▁▅▇▂▁
sulphates 0 1 0.66 0.17 0.33 0.55 0.62 0.73 2.00 ▇▅▁▁▁
alcohol 0 1 10.42 1.07 8.40 9.50 10.20 11.10 14.90 ▇▇▃▁▁
quality 0 1 5.64 0.81 3.00 5.00 6.00 6.00 8.00 ▁▇▇▂▁

Check for Missing Values

No missing values for all variables.

ggplot(wine_df, aes(quality)) +
  geom_histogram(color = "#000000", fill = "#0099F8", bins = 8) +
  labs(
    title = "Histogram of Wine Quality Rating",
    caption = "Source: Wine dataset",
    x = "Quality Rating",
    y = "Count") +
  theme_classic() +
theme(
  plot.title = element_text(color = "#0099F8", size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic")
)

Calculate the bin width by dividing the specification tolerance or range (USL-LSL or Max-Min value) by the # of bins (and rounding up)

boxplot(wine_df$quality, horizontal=TRUE, col=rgb(0.8,0.8,0,0.5), main="Wine Quality")

## [1] "Frequency table"
## quality_df
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
## [1] "Cumulative frequency table"
##    3    4    5    6    7    8 
##   10   63  744 1382 1581 1599
ratings.vec <- as.vector(t(table_count)) # store rating frequencies
qual_rating<- c(rownames(table_count)) # store rating values

Check distribution of target and feature variables

fixedAcidity volatileAcidity citricAcid residualSugar chlorides freeSulfurDioxide totalSulfurDioxide density pH sulphates alcohol quality
skewness 0.981 0.670 0.318 4.532 5.670 1.248 1.513 0.071 0.193 2.424 0.859 0.217
skew.2SE 8.014 5.477 2.596 37.028 46.322 10.198 12.359 0.581 1.579 19.805 7.020 1.776
kurtosis 1.120 1.213 -0.793 28.485 41.526 2.007 3.786 0.923 0.796 11.662 0.192 0.288
kurt.2SE 4.577 4.957 -3.242 116.435 169.740 8.205 15.474 3.771 3.253 47.667 0.783 1.177
normtest.W 0.942 0.974 0.955 0.566 0.484 0.902 0.873 0.991 0.993 0.833 0.929 0.858
normtest.p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Check highly correlated features

Caption:Correlogram of feature and target variable(s). Fixed acidity is highly correlated with several feature variables. Removed prior to classifier selection.

Data preparation

Binning Wine Quality Rating

Recoding quality to reduce number of bins (with equal widths but unequal frequencies). Ratings 3-4 = low quality, 5-6 = quality, 7-8 = high quality.

wine_df$quality2 <- recode(wine_df$quality, `3` = 1, `4` = 1, `5` = 2, `6` = 2, `7`= 3, `8`= 3)

wine_df$quality_ordered <- factor(wine_df$quality2, ordered = TRUE, 
                                levels = c("1", "2", "3"))
table(wine_df$quality2)
## 
##    1    2    3 
##   63 1319  217

Addressing class imbalance

  1. Undersampling
  2. Oversampling
  3. Algorithmic approaches

Imbalanced data can lead to overfitting and underfitting issues - the Accuracy paradox. The classification analysis that proceeds highlights the impact of class imbalance on accuracy estimates.

Fitting a Classifier

Export csv for modelling in Jupyter

write.csv(wine_df, "./ready_wine.csv", row.names=FALSE)

Classifier selection in Python with SciKit.learn.

Please see jupyter notebook (‘wine.ipynb’) for classifier selection analyses ensemble, boosting algorithms), executable through google Collaboratory. A Jupyter notebook is accessible through the github repo (https://github.com/dezzygc/WineData.git).

The following classifiers were examined:
1. Decision tree classifier
2. Random Forest classifier
3. Gradient Boosting
4. XG Boosting
5. Support Vector Machine (SVM) - weighted and unweighted

Step 1 - Normalize feature variables

Step 2 - Split data into test and training sets

Step 3 - Fit classifier and report classification metrics

Metrics include precision, recall, f1-score, support and accuracy. The metrics focused upon here are defined below.

Precision: the ability of the classifier to avoid labeling a negative sample as positive (i.e false positive). This is calculated as true positives/(true positives + false positives).

Recall: ability of the classifier to identify all positive samples in each class. This is calculated as true positives/(true positives + false negatives).

F1-Score: the weighted harmonic mean of precision and recall. A good approximation of accuracy for imbalanced datasets (consider the Precision-Recall Trade-Off) by accounting for type I and type II errors. It gives equal weight to both recall and precision, i.e. identification of all positive cases (recall) versus identification of only positive case (precision). If precision and recall are both high, the F1 score will also be high (a score close to 1.0). Similarly, the model will obtain a medium F1 score if either Precision or Recall is low and the other is high.

Accuracy: includes two metric options. The macro average is an average of the total true positives, false neagtives and false positives. The weighted-average accuracy returns the average accuracy per class, making it suitable for mutli-classification problems.

Classifier Summary

Please see Jupyter notebook.

Model 1 (decision tree) feature importance

Decision Tree feature importance.

wine_df %>% explain_tree(target=quality)

Random Forest feature importance

random forest feature importance.

XGB feature importance

XGB Classifier feature importance.

Gradient boost classifier feature importance

gradient boost feature importance.