Importing Libraries

library(dplyr)
library(tidyverse)
library(vtable)
library(psych)
library(reshape)
library(ggplot2)
library(plotly)
library(tidyr)
library(PerformanceAnalytics)
library(corrplot)
library(jtools)
library(gtsummary)
library(pander)
library(kableExtra)

Data Structure

df = read.csv("datasets/wine.csv",header=T)
attach(df)
tbl_summary(df)
Characteristic N = 6,4971
id 3,249 (1,625, 4,873)
fixed.acidity 7.00 (6.40, 7.70)
volatile.acidity 0.29 (0.23, 0.40)
citric.acid 0.31 (0.25, 0.39)
residual.sugar 3.0 (1.8, 8.1)
chlorides 0.047 (0.038, 0.065)
free.sulfur.dioxide 29 (17, 41)
total.sulfur.dioxide 118 (77, 156)
density 0.9949 (0.9923, 0.9970)
pH 3.21 (3.11, 3.32)
sulphates 0.51 (0.43, 0.60)
alcohol 10.30 (9.50, 11.30)
quality
3 30 (0.5%)
4 216 (3.3%)
5 2,138 (33%)
6 2,836 (44%)
7 1,079 (17%)
8 193 (3.0%)
9 5 (<0.1%)
wine.type
Red 1,599 (25%)
White 4,898 (75%)

1 Median (IQR); n (%)

As we can see from the summary table above the dataset contains 14 columns, 11 of these columns are the chemical properties of the wines which will be used as the variables to determine the wine rating. Also we see that the dataset contain 1599 red wines and 4898 white wines. And the minimum quality rating is 3 and the maximum is 9 which only 5 wines received this rating of 9.

Exploratory Data Analysis

Let’s create white wine dataset and red wine dataset and explore them separately

dfWhite = filter(df, wine.type == "White")
dfRed = filter(df, wine.type == "Red")

White Wine Summary Statistics

cv <- function(variable) { sd(variable) / mean(variable)}
sumtable(dfWhite[, !(names(dfWhite) %in% c("wine.type", "id"))],
         summ=c(
                'mean(x)',
                'median(x)',
                'sd(x)',
                'cv(x)',
                'propNA(x)'), 
         title = "White Wines Summary Statistics",
         factor.counts=FALSE)
White Wines Summary Statistics
Variable Mean Median Sd Cv PropNA
fixed.acidity 6.9 6.8 0.844 0.123 0
volatile.acidity 0.3 0.26 0.101 0.362 0
citric.acid 0.3 0.32 0.121 0.362 0
residual.sugar 6.4 5.2 5.072 0.794 0
chlorides 0 0.043 0.022 0.477 0
free.sulfur.dioxide 35.3 34 17.007 0.482 0
total.sulfur.dioxide 138.4 134 42.498 0.307 0
density 1 0.994 0.003 0.003 0
pH 3.2 3.18 0.151 0.047 0
sulphates 0.5 0.47 0.114 0.233 0
alcohol 10.5 10.4 1.231 0.117 0
quality 5.9 6 0.886 0.151 0

Red Wine Summary Statistics

sumtable(dfRed[, !(names(dfRed) %in% c("wine.type", "id"))],
         summ=c(
                'mean(x)',
                'median(x)',
                'sd(x)',
                'cv(x)',
                'propNA(x)'), 
         title = "Red Wines Summary Statistics",
         factor.counts=FALSE)
Red Wines Summary Statistics
Variable Mean Median Sd Cv PropNA
fixed.acidity 8.3 7.9 1.741 0.209 0
volatile.acidity 0.5 0.52 0.179 0.339 0
citric.acid 0.3 0.26 0.195 0.719 0
residual.sugar 2.5 2.2 1.41 0.555 0
chlorides 0.1 0.079 0.047 0.538 0
free.sulfur.dioxide 15.9 14 10.46 0.659 0
total.sulfur.dioxide 46.5 38 32.895 0.708 0
density 1 0.997 0.002 0.002 0
pH 3.3 3.31 0.154 0.047 0
sulphates 0.7 0.62 0.17 0.258 0
alcohol 10.4 10.2 1.066 0.102 0
quality 5.6 6 0.808 0.143 0

Data Distribution Plots

White Wine

melted <- melt(dfWhite[, !(names(dfWhite) %in% c("id"))])
p = ggplot(melted, aes(value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scale="free", shrink = FALSE) + 
  theme(panel.spacing = unit(+4, "lines")) +
  xlab("") +
  ylab("") 
ggplotly(p)

The histograms above shows the white wine data distribution in histograms almost all variables reflects some type of normal distribution except for few such as alcohol levels which shows a higher distribution skewness towards the left similarly residual sugar shows similar distribution with values concentrated towards the left with obvious majority in 2.4 residual sugar average.

Red Wine

melted <- melt(dfRed[, !(names(dfRed) %in% c("id"))])
p = ggplot(melted, aes(value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scale="free", shrink = FALSE) + 
  theme(panel.spacing = unit(+4, "lines")) +
  xlab("") +
  ylab("") 
ggplotly(p)

The distribution of values for red wines looks a lot different in some instances than white wine for example citric.acid distribution in white wine is narrowly distributed between 0.1 and 0.5 while the red wine citric.acid has a much more variety in the distribution. For better comparison let’s see the histograms overlaid in one plot.

Distribution Comparison By Wine Type

melted <- melt(df[, !(names(df) %in% c("id"))])
p = ggplot(melted, aes(value,  fill=wine.type)) + 
  geom_histogram(alpha=0.5) + 
  theme(panel.spacing = unit(+3, "lines")) +
  facet_wrap(~variable, scale="free", shrink = FALSE) + 
  xlab("") +
  ylab("") +
  labs(fill="")
ggplotly(p)

The plot above shows an easier way to view the distribution of data for each variable by wine type and allow us to compare to better understand the data.

Correlation Analysis

For the remaining sections of this analysis we will use the White Wines data only to do the analysis. In this section we will try to explore if there is any correlation between the wine rating and the chemical composition of the wine, also we will explore the strongest and lowest correlation

corr = cor(dfWhite[, !(names(dfWhite) %in% c("wine.type", "id"))])
corrplot(corr, method = "number", insig = "p-value")

The correlation matrix above shows the correlation coefficient of each value on the left side with the value on the right side. For example quality has a correlation coefficient of 0.44 with alcohol level. The correlation coefficient value lies between -1.0 and 1.0 a coefficient of -1 means a strong negative correlation and a value of 1 means a strong positive correlation. If we consider anything over 0.30 or -0.30 as a medium correlation, the only 2 variables that have a medium correlation with quality are alcohol at 0.44 and density at 0.-31. Can we improve this correlation? One thing to consider when looking at correlation is the influence of outliers “ In most practical circumstances an outlier decreases the value of a correlation coefficient and weakens the regression relationship, but it’s also possible that in some circumstances an outlier may increase a correlation value and improve regression.” (Penn State, n.d.) With the above in mind and knowing that only 5 out of the 4989 white wines received a rating of 9 as shown in the Exploratory Data Analysis summary table and distribution histogram, we can check if these are considered outliers “statistically speaking”

dfWhiteVar = dfWhite[, !(names(dfWhite) %in% c("wine.type", "id"))]

melted <- melt(dfWhiteVar)
p = ggplot(melted, aes(factor(variable), value)) + 
  geom_boxplot() + 
  facet_wrap(~variable, scale="free", shrink = FALSE) + 
  theme(panel.margin = unit(+3, "lines"), strip.text.x = element_blank()) +
  xlab("") +
  ylab("")

ggplotly(p)

Let’s remove the outliers observed in the box plots

# Approach retrieved from https://www.statology.org/remove-outliers-from-multiple-columns-in-r/
outliers <- function(x) {

  Q1 <- quantile(x, probs=.25)
  Q3 <- quantile(x, probs=.75)
  iqr = Q3-Q1

 upper_limit = Q3 + (iqr*1.5)
 lower_limit = Q1 - (iqr*1.5)

 x > upper_limit | x < lower_limit
}

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    df <- df[!outliers(df[[col]]),]
  }
  df
}

dfWhiteVarClean = remove_outliers(dfWhiteVar, c('fixed.acidity', 'volatile.acidity', 'citric.acid', 
                                                'chlorides', 'quality', 'free.sulfur.dioxide', 'pH'))

corr1 = cor(dfWhiteVarClean)
corrplot(corr1, method = "number")

The correlation coefficient doesn’t change as much if we remove the outliers so we will keep using the data set without removing the outliers.

Regression Model

To build a regression model formula we first retrieve the values that have significance on quality from the chart below

testRes = cor.mtest(dfWhiteVar, conf.level = 0.95)

## leave blank on no significant coefficient
corrplot(corr, p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank',
         addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=FALSE)

This chart is similar to the correlation coefficient chart described in the correlation analysis section above, but this time any values that don’t have a correlation coefficient value is left blank so build our first regression model we will take all variables that have values in the quality row of the chart above to include in our model.

First Model

model1 = lm(quality ~ pH + alcohol + fixed.acidity + volatile.acidity + chlorides +
              density + residual.sugar + total.sulfur.dioxide + sulphates
            , data = dfWhiteVar)
summ(model1)
Observations 4898
Dependent variable quality
Type OLS linear regression
F(9,4888) 210.14
0.28
Adj. R² 0.28
Est. S.E. t val. p
(Intercept) 162.79 18.57 8.77 0.00
pH 0.71 0.11 6.73 0.00
alcohol 0.18 0.02 7.64 0.00
fixed.acidity 0.07 0.02 3.20 0.00
volatile.acidity -1.97 0.11 -17.87 0.00
chlorides -0.15 0.54 -0.28 0.78
density -162.94 18.84 -8.65 0.00
residual.sugar 0.09 0.01 11.78 0.00
total.sulfur.dioxide 0.00 0.00 2.35 0.02
sulphates 0.64 0.10 6.35 0.00
Standard errors: OLS

The model has an R-squared value of 0.28 which means only 28% of the data fits this model. As the p-value is much less than 0.05 in all instances except for chlorides which have a p-value of 0.78, we reject the null hypothesis that β = 0. Hence there is a relationship between the variables in the linear regression model.

par(mfrow=c(2,2))
plot(model1)

Second Model

model2 = lm(quality ~ + alcohol + fixed.acidity + volatile.acidity + 
              density + residual.sugar + total.sulfur.dioxide + sulphates +
              alcohol:density + density:residual.sugar +
              free.sulfur.dioxide:total.sulfur.dioxide
            , data = dfWhiteVar)
summ(model2)
Observations 4898
Dependent variable quality
Type OLS linear regression
F(10,4887) 185.72
0.28
Adj. R² 0.27
Est. S.E. t val. p
(Intercept) -31.54 41.63 -0.76 0.45
alcohol 12.83 3.30 3.88 0.00
fixed.acidity -0.02 0.02 -1.45 0.15
volatile.acidity -2.12 0.11 -19.21 0.00
density 34.91 41.87 0.83 0.40
residual.sugar -0.63 0.31 -2.02 0.04
total.sulfur.dioxide 0.00 0.00 3.11 0.00
sulphates 0.59 0.10 5.86 0.00
alcohol:density -12.66 3.33 -3.81 0.00
density:residual.sugar 0.69 0.32 2.18 0.03
total.sulfur.dioxide:free.sulfur.dioxide -0.00 0.00 -2.08 0.04
Standard errors: OLS

Prediction

Let’s predict with model 1

set.seed(3068)
random = sample_n(dfWhite, 10)
predicted_model1 = predict(model1,list(pH   = random$pH,alcohol=random$alcohol,
                                fixed.acidity=random$fixed.acidity,
                                volatile.acidity=random$volatile.acidity,
                                chlorides=random$chlorides,
                                density=random$density,
                                residual.sugar=random$residual.sugar,
                                total.sulfur.dioxide=random$total.sulfur.dioxide,
                                sulphates=random$sulphates))
wine_id = c(random$id) 
Actual = c(random$quality) 
Predicted = c(round(predicted_model1))
predicted_values = data.frame(wine_id, Actual, Predicted)
predicted_values %>% 
  kbl() %>% 
  kable_styling() %>%
  row_spec(which(predicted_values$Predicted == predicted_values$Actual), bold = T, color = "white", background = "green") %>%
  row_spec(which(predicted_values$Predicted != predicted_values$Actual), bold = T, color = "white", background = "red")
wine_id Actual Predicted
6283 6 6
3692 5 4
5843 6 6
5355 6 6
4284 6 6
4533 6 6
4699 7 6
3583 8 6
2543 7 6
5977 6 7

As we can see from the table above the model (the first model) predicted the wine quality correctly (colored in green) 50% of the time in our 10 random wines. We also notice that the prediction was almost always accurate for a wine rated 6 compared to other ratings and this is due to the majority of the wines in the data set are rated at 6.

Conclution

In this analysis, we explored a data set that contains experts ratings of both white and red wines and explored the chemical compositions of each type of wine. Then we build a linear regression model to predict the wine quality rating based on the wine chemical compositions. The linear model with an R-Squared of only 0.28 was able to predict accurately 50% of the times of a random sample of 10 wines and is mostly within a 1 point difference for the rating when the prediction was wrong. This model can be improved and be more accurate if the dataset contained more wines rated on both sides of the distribution to better balance the data set as most of the wines in the data set are rated 5 or 6 the linear model will predict these values more accurately compared to 3 or 8 due to the small number of wines with these ratings in our data set.