Wine Quality Data Analysis

About

This dataset is related to red variant of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).

Input variables (based on physicochemical tests):
1. fixed acidity g(tartaric acid)/dm
2. volatile acidity g(acetic acid)/dm
3. citric acid g/dm
4. residual sugar g/dm
5. chlorides g(sodium chloride)/dm
6. free sulfur dioxide mg/dm
7. total sulfur dioxide mg/dm
8. density g/dm
9. pH
10. sulphates g(potassium sulphate)/dm
11. alcohol % vol

Output variable (based on sensory data):
12. quality (score between 0 and 10)

Analysis question/approach
Classify wine quality from physicochemical properties
Build a model to predict wine quality (regression)
Identify attributes differentiating low vs high quality wine

MR (multiple regression)
NN (neural network)
SVM (support vector machine learning)
random forest classifier
imbalanced sklearn

need to preserve order
address imbalance

Initialize package libraries

#install.packages("usethis")
#install.packages("rmdformats")
library(usethis)
library(magrittr)
library(highcharter)
library(explore)
library(dplyr)
library(DataExplorer)
library(skimr)
library(rmdformats)
library(readr)
library(DT)
library(ggplot2)
library(pastecs)
library(corrplot)

# Connect RStudio to Git client
# usethis::create_from_github(
#  "https://github.com/dezzygc/WineData.git", destdir = "./wine_repo")

# Import dataset
wine_df <- read_csv("red_wine_data.csv")
#as_tibble(wine_df)
datatable(wine_df)

Data diagnostics

Check NA values
Check highly correlated feature variables
Assess balance and distribution

# wine_df %>% explore()
# wine_df %>% report(output_file = "report.html", output_dir = getwd())

# explore structure of dataframe, compute summary descriptives and examine quantiles/distribution
skim(wine_df)

Data summary
Name	wine_df
Number of rows	1599
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
fixedAcidity	1	8.32	1.74	4.60	7.10	7.90	9.20	15.90	▂▇▂▁▁
volatileAcidity	1	0.53	0.18	0.12	0.39	0.52	0.64	1.58	▅▇▂▁▁
citricAcid	1	0.27	0.19	0.00	0.09	0.26	0.42	1.00	▇▆▅▁▁
residualSugar	1	2.54	1.41	0.90	1.90	2.20	2.60	15.50	▇▁▁▁▁
chlorides	1	0.09	0.05	0.01	0.07	0.08	0.09	0.61	▇▁▁▁▁
freeSulfurDioxide	1	15.87	10.46	1.00	7.00	14.00	21.00	72.00	▇▅▁▁▁
totalSulfurDioxide	1	46.47	32.90	6.00	22.00	38.00	62.00	289.00	▇▂▁▁▁
density	1	1.00	0.00	0.99	1.00	1.00	1.00	1.00	▁▃▇▂▁
pH	1	3.31	0.15	2.74	3.21	3.31	3.40	4.01	▁▅▇▂▁
sulphates	1	0.66	0.17	0.33	0.55	0.62	0.73	2.00	▇▅▁▁▁
alcohol	1	10.42	1.07	8.40	9.50	10.20	11.10	14.90	▇▇▃▁▁
quality	1	5.64	0.81	3.00	5.00	6.00	6.00	8.00	▁▇▇▂▁

Check for Missing Values

No missing values for all variables.

ggplot(wine_df, aes(quality)) +
  geom_histogram(color = "#000000", fill = "#0099F8", bins = 8) +
  labs(
    title = "Histogram of Wine Quality Rating",
    caption = "Source: Wine dataset",
    x = "Quality Rating",
    y = "Count") +
  theme_classic() +
theme(
  plot.title = element_text(color = "#0099F8", size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic")
)

Calculate the bin width by dividing the specification tolerance or range (USL-LSL or Max-Min value) by the # of bins (and rounding up)

boxplot(wine_df$quality, horizontal=TRUE, col=rgb(0.8,0.8,0,0.5), main="Wine Quality")

## [1] "Frequency table"

## quality_df
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

## [1] "Cumulative frequency table"

##    3    4    5    6    7    8 
##   10   63  744 1382 1581 1599

ratings.vec <- as.vector(t(table_count)) # store rating frequencies
qual_rating<- c(rownames(table_count)) # store rating values

Check distribution of target and feature variables

	fixedAcidity	volatileAcidity	citricAcid	residualSugar	chlorides	freeSulfurDioxide	totalSulfurDioxide	density	pH	sulphates	alcohol	quality
skewness	0.981	0.670	0.318	4.532	5.670	1.248	1.513	0.071	0.193	2.424	0.859	0.217
skew.2SE	8.014	5.477	2.596	37.028	46.322	10.198	12.359	0.581	1.579	19.805	7.020	1.776
kurtosis	1.120	1.213	-0.793	28.485	41.526	2.007	3.786	0.923	0.796	11.662	0.192	0.288
kurt.2SE	4.577	4.957	-3.242	116.435	169.740	8.205	15.474	3.771	3.253	47.667	0.783	1.177
normtest.W	0.942	0.974	0.955	0.566	0.484	0.902	0.873	0.991	0.993	0.833	0.929	0.858
normtest.p	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Check highly correlated features

Caption:Correlogram of feature and target variable(s). Fixed acidity is highly correlated with several feature variables. Removed prior to classifier selection.

Data preparation

Binning Wine Quality Rating

Recoding quality to reduce number of bins (with equal widths but unequal frequencies). Ratings 3-4 = low quality, 5-6 = quality, 7-8 = high quality.

wine_df$quality2 <- recode(wine_df$quality, `3` = 1, `4` = 1, `5` = 2, `6` = 2, `7`= 3, `8`= 3)

wine_df$quality_ordered <- factor(wine_df$quality2, ordered = TRUE, 
                                levels = c("1", "2", "3"))
table(wine_df$quality2)

## 
##    1    2    3 
##   63 1319  217

Addressing class imbalance

Undersampling
Oversampling
Algorithmic approaches

Imbalanced data can lead to overfitting and underfitting issues - the Accuracy paradox. The classification analysis that proceeds highlights the impact of class imbalance on accuracy estimates.

Fitting a Classifier

Export csv for modelling in Jupyter

write.csv(wine_df, "./ready_wine.csv", row.names=FALSE)

Classifier selection in Python with SciKit.learn.

Please see jupyter notebook (‘wine.ipynb’) for classifier selection analyses ensemble, boosting algorithms), executable through google Collaboratory. A Jupyter notebook is accessible through the github repo (https://github.com/dezzygc/WineData.git).

The following classifiers were examined:
1. Decision tree classifier
2. Random Forest classifier
3. Gradient Boosting
4. XG Boosting
5. Support Vector Machine (SVM) - weighted and unweighted

Step 1 - Normalize feature variables

Step 2 - Split data into test and training sets

Step 3 - Fit classifier and report classification metrics

Metrics include precision, recall, f1-score, support and accuracy. The metrics focused upon here are defined below.

Precision: the ability of the classifier to avoid labeling a negative sample as positive (i.e false positive). This is calculated as true positives/(true positives + false positives).

Recall: ability of the classifier to identify all positive samples in each class. This is calculated as true positives/(true positives + false negatives).

F1-Score: the weighted harmonic mean of precision and recall. A good approximation of accuracy for imbalanced datasets (consider the Precision-Recall Trade-Off) by accounting for type I and type II errors. It gives equal weight to both recall and precision, i.e. identification of all positive cases (recall) versus identification of only positive case (precision). If precision and recall are both high, the F1 score will also be high (a score close to 1.0). Similarly, the model will obtain a medium F1 score if either Precision or Recall is low and the other is high.

Accuracy: includes two metric options. The macro average is an average of the total true positives, false neagtives and false positives. The weighted-average accuracy returns the average accuracy per class, making it suitable for mutli-classification problems.

Classifier Summary

Please see Jupyter notebook.

Model 1 (decision tree) feature importance

Decision Tree feature importance.

wine_df %>% explain_tree(target=quality)

Random Forest feature importance

random forest feature importance.

XGB feature importance

XGB Classifier feature importance.

Gradient boost classifier feature importance

gradient boost feature importance.