Wine Quality Predicition

Background

Wine, everyone knows what this is: an alcoholic drink made from fermenting grapes, but do you know that this drink is considered so important that there are actual professionals dedicating themselves to taste test wine? These people are called sommelier, and you usually find them in high-end restaurants that also serve wine. They are intended to guide restaurant visitors regarding what kind of wine will pair well with the food they ordered. While there are reports that wine testing might not be as legitimate as it seems to be, we probably need to remember that they evaluate wine based on their sense of smell and taste. What if there is another way where we can predict wine quality based on their contents?

That’s where this report comes in. Credits to Ghassen Khaled, he compiled a dataset that contains the properties from multiple wine, along with their colour and rating. You can find the dataset here: https://www.kaggle.com/datasets/ghassenkhaled/wine-quality-data?resource=download.

About Data

Let’s read the data

wine <- read_csv("Wine_Quality_Data.csv")

## Rows: 6497 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): color
## dbl (12): fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(wine)

## Rows: 6,497
## Columns: 13
## $ fixed_acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color                <chr> "red", "red", "red", "red", "red", "red", "red", …

Data Wrangling

Changing Data Types

There are two variables that we immediately need to change, “quality” and “color”.

“quality”: The scale of wine quality ranges from 3 to 9, with higher being better. This will also be dependent variable.
“color” : There are two colors of wine: red and white.

Additionally, since we have to check the The rest of the variables represent the chemical components of the wine, therefore we don’t need to change the data type.

wine <- wine %>% 
  mutate_at(.vars = vars(quality, color),
            as.factor) %>% 
  mutate(quality = factor(quality, levels = c("3","4","5","6","7","8","9"))) # we need to reorder our categories from lowest to highest

glimpse(wine)

## Rows: 6,497
## Columns: 13
## $ fixed_acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color                <fct> red, red, red, red, red, red, red, red, red, red,…

Check for NA

Next step is to check for NA. We need to make sure that we don’t have any missing values in our data.

colSums(is.na(wine))

##        fixed_acidity     volatile_acidity          citric_acid 
##                    0                    0                    0 
##       residual_sugar            chlorides  free_sulfur_dioxide 
##                    0                    0                    0 
## total_sulfur_dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0 
##                color 
##                    0

As we can see, our dataframe is free from missing values.

Exploratory Data Analysis

Let’s see the descriptive statistics of our wine dataset, this is so we can take a look into where each variable usually resides and also see how spread out they are. We can find this simply by using the summary() function.

summary(wine)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0        1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00      Median :118.0        Median :0.9949  
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7        Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9970  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##                                                                             
##        pH          sulphates         alcohol      quality    color     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   3:  30   red  :1599  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   4: 216   white:4898  
##  Median :3.210   Median :0.5100   Median :10.30   5:2138               
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   6:2836               
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   7:1079               
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 193               
##                                                   9:   5

From a glance, the variables “residual_sugar”, “free_sulfur_dioxide”, and “total_sulfur_dioxide” have very high maximum values compared to the rest of the variables, so we can expect them to have a lot of outliers too. But let’s see whether that’s really the case using the boxplot() function.

boxplot(wine)

The plot is a bit too small to look at, but we can see that the variables “residual_sugar”, “free_sulfur_dioxide”, and “total_sulfur_dioxide” have a lot of outliers just as the author suspected.

Let’s also see the correlation between the numerical variables. This is so we can see the behaviour we should expect when these variables are added together.

wine <- wine %>% 
  rename("total_so2" = total_sulfur_dioxide, "free_so2" = free_sulfur_dioxide, "vol_acid" = volatile_acidity, "fix_acid" = fixed_acidity, "res_sugar" = residual_sugar) #the author wanted a tidier plot and decided to rename some of the columns
ggcorr(wine, label = T, size = 2.5, layout.exp=1)

## Warning in ggcorr(wine, label = T, size = 2.5, layout.exp = 1): data in
## column(s) 'quality', 'color' are not numeric and were ignored

After doing 4 experiments in forming the model, the author decided to drop 2 variables: “density” and “fix_acid”. “density” is dropped due to how many variables it is strongly correlated to: “res_sugar”, “fix_acid”, and “alcohol”. “fix_acid” is only strongly correlated to “density”, however causes multicollinearity problems in the model, therefore was removed. The section below explains the process the author went through.

Modelling with Logistic Regression

Logistic Regression is a type of machine learning model where our computer tries to predict data based on the components it has. In our case, it would be predicting wine scores based on the chemical compositions of each wine.

Drop Density + Fixed Acidity

Before forming our logistic regression model, the author dropped the two variables previously stated. Dropping variables can be done easily with the select() function.

wine_clean<- wine %>% 
  select(-c(density,fix_acid))
glimpse(wine_clean)

## Rows: 6,497
## Columns: 11
## $ vol_acid    <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.…
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36…
## $ res_sugar   <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
## $ chlorides   <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.…
## $ free_so2    <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, …
## $ total_so2   <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, …
## $ pH          <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35…
## $ sulphates   <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80…
## $ alcohol     <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 1…
## $ quality     <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7, 5, 4, 6…
## $ color       <fct> red, red, red, red, red, red, red, red, red, red, red, red…

Splitting The Data

Now that we have confirmed that our data is clean, we are now ready to create a logistic regression model. Firstly, let’s split the data into two, training data and test data. We’ll be using 80% of the data for training, while the remaining 20% will be used for testing how well our model works.

index <- sample(x=nrow(wine_clean), size = nrow(wine_clean)*0.8)
wine_train <-wine_clean[index,] 
wine_test <- wine_clean[-index,]

Balancing

While experimenting with the data, the author found that proportion of scored given to wine isn’t balanced. It showed that a lot of the wine that was sampled gets a score of 5 or 6 in terms of quality. We don’t want our data to be imbalanced because it will affect the regression model’s ability in predicting other values other than 5 or 6. Let’s balance it by using the upSample() function, where we increase the amount of samples we have.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(148)

wine_train <- upSample(x=wine_train %>% select(-quality),
                       y=wine_train$quality,
                       yname = "quality")
prop.table(table(wine_train$quality))

## 
##         3         4         5         6         7         8         9 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571

Building the Model

It’s time to form the model. It may be worth reminding now that there is a lot of variables that can be used in forming the prediction model, which could be confusing as there is there is no way in determining which variables are actually important, and which variables are unnecessary (aside from the two variables dropped previously).

While the author believes that every variable in the dataset could be significant in deciding the quality of wine, we can use the method called stepwise regression to be extra sure. With this method, variables are dropped or included based on the variables’ statistical significance. As there are three methods of stepwise regression (backward, forward, or both), we will be forming three models based on each method.

Side note, the output of the code below is lengthy, therefore the author to chose hide the output.

# forming the initial model
wine_all <- multinom(formula= quality ~ .,
                   data=wine_train)

wine_none <- multinom(formula= quality ~ 1,
                   data=wine_train)

#stepwise regression

wine_bck <- step(object = wine_all,
                 direction = "backward",
                 scope = list(upper=wine_all,
                              lower=wine_none),
                 trace = F)
wine_fwd <- step(object = wine_none,
                 direction = "forward",
                 scope = list(upper=wine_all,
                              lower=wine_none),
                 trace = F)
wine_bth <- step(object = wine_all,
                 direction = "both",
                 scope = list(upper=wine_all,
                              lower=wine_none),
                 trace = F)

Which Model to Use?

As there are three models available, we’ll have to decide which one to move forward with. We can decide it based on their AIC values, which measures how much data we lost after performing stepwise regression.The lower the score, the better our model.

Backward

summary(wine_bck)

## Call:
## multinom(formula = quality ~ vol_acid + citric_acid + res_sugar + 
##     chlorides + free_so2 + total_so2 + pH + sulphates + alcohol + 
##     color, data = wine_train)
## 
## Coefficients:
##   (Intercept)   vol_acid citric_acid   res_sugar   chlorides     free_so2
## 4    5.954897  -0.756587   -1.660368 -0.09974528   -8.104164 -0.049419750
## 5   11.648802  -3.971477   -1.625586 -0.05567012   -7.904641 -0.021148948
## 6    3.911871  -7.918927   -2.439814  0.01579415   -8.333396 -0.012165585
## 7   -3.278056 -10.817833   -2.375664  0.04927643  -12.720788 -0.011504846
## 8   -9.776945 -10.098634   -2.065539  0.12072419   -8.910770 -0.001774726
## 9  -31.881950 -11.169793    9.174830  0.13501180 -212.916750 -0.017206932
##       total_so2         pH sulphates    alcohol colorwhite
## 4  3.739643e-03 -0.6805702  3.536294 -0.3537828  1.1301152
## 5  6.989898e-03 -1.2892876  2.583995 -0.4786599 -1.5718442
## 6  1.598251e-05 -1.0942407  4.329335  0.3090728 -1.6349000
## 7 -1.765892e-03 -0.9271179  5.542250  0.9531321 -1.6548241
## 8 -4.281086e-03 -0.5972112  5.128498  1.2977306 -0.7313697
## 9  9.419447e-03  3.7149046  5.315737  1.1688837 10.2938548
## 
## Std. Errors:
##   (Intercept)  vol_acid citric_acid   res_sugar  chlorides    free_so2
## 4   0.5920897 0.1818768   0.2413384 0.008955264 0.84837415 0.002490200
## 5   0.6017877 0.2263131   0.2288343 0.008173552 0.80598207 0.001705450
## 6   0.5338112 0.2891945   0.2597608 0.008253636 0.85601566 0.001670238
## 7   0.5529676 0.3494669   0.2965729 0.009108730 0.41515012 0.001872787
## 8   0.5766600 0.3612904   0.3141938 0.009144153 0.24160412 0.001695915
## 9   0.4960036 0.5000370   0.4338575 0.013533217 0.01730289 0.002748372
##      total_so2        pH sulphates    alcohol colorwhite
## 4 0.0009338075 0.1778417 0.3089805 0.03403011  0.1452230
## 5 0.0009099039 0.1743575 0.3068492 0.03461758  0.1475399
## 6 0.0009947422 0.1627847 0.2948416 0.03224683  0.1567115
## 7 0.0011031077 0.1693312 0.3023052 0.03486521  0.1691955
## 8 0.0011182111 0.1745404 0.3102527 0.03707487  0.1767860
## 9 0.0014925845 0.2711588 0.3849466 0.05133128  0.4960489
## 
## Residual Deviance: 45314.18 
## AIC: 45446.18

Forward

summary(wine_fwd)

## Call:
## multinom(formula = quality ~ alcohol + vol_acid + chlorides + 
##     citric_acid + free_so2 + color + res_sugar + sulphates + 
##     pH + total_so2, data = wine_train)
## 
## Coefficients:
##   (Intercept)    alcohol   vol_acid   chlorides citric_acid     free_so2
## 4    5.954897 -0.3537828  -0.756587   -8.104164   -1.660368 -0.049419750
## 5   11.648802 -0.4786599  -3.971477   -7.904641   -1.625586 -0.021148948
## 6    3.911871  0.3090728  -7.918927   -8.333396   -2.439814 -0.012165585
## 7   -3.278056  0.9531321 -10.817833  -12.720788   -2.375664 -0.011504846
## 8   -9.776945  1.2977306 -10.098634   -8.910770   -2.065539 -0.001774726
## 9  -31.881950  1.1688837 -11.169793 -212.916750    9.174830 -0.017206932
##   colorwhite   res_sugar sulphates         pH     total_so2
## 4  1.1301152 -0.09974528  3.536294 -0.6805702  3.739643e-03
## 5 -1.5718442 -0.05567012  2.583995 -1.2892876  6.989898e-03
## 6 -1.6349000  0.01579415  4.329335 -1.0942407  1.598251e-05
## 7 -1.6548241  0.04927643  5.542250 -0.9271179 -1.765892e-03
## 8 -0.7313697  0.12072419  5.128498 -0.5972112 -4.281086e-03
## 9 10.2938548  0.13501180  5.315737  3.7149046  9.419447e-03
## 
## Std. Errors:
##   (Intercept)    alcohol  vol_acid  chlorides citric_acid    free_so2
## 4   0.5920897 0.03403011 0.1818768 0.84837415   0.2413384 0.002490200
## 5   0.6017877 0.03461758 0.2263131 0.80598207   0.2288343 0.001705450
## 6   0.5338112 0.03224683 0.2891945 0.85601566   0.2597608 0.001670238
## 7   0.5529676 0.03486521 0.3494669 0.41515012   0.2965729 0.001872787
## 8   0.5766600 0.03707487 0.3612904 0.24160412   0.3141938 0.001695915
## 9   0.4960036 0.05133128 0.5000370 0.01730289   0.4338575 0.002748372
##   colorwhite   res_sugar sulphates        pH    total_so2
## 4  0.1452230 0.008955264 0.3089805 0.1778417 0.0009338075
## 5  0.1475399 0.008173552 0.3068492 0.1743575 0.0009099039
## 6  0.1567115 0.008253636 0.2948416 0.1627847 0.0009947422
## 7  0.1691955 0.009108730 0.3023052 0.1693312 0.0011031077
## 8  0.1767860 0.009144153 0.3102527 0.1745404 0.0011182111
## 9  0.4960489 0.013533217 0.3849466 0.2711588 0.0014925845
## 
## Residual Deviance: 45314.18 
## AIC: 45446.18

Both

summary(wine_bth)

## Call:
## multinom(formula = quality ~ vol_acid + citric_acid + res_sugar + 
##     chlorides + free_so2 + total_so2 + pH + sulphates + alcohol + 
##     color, data = wine_train)
## 
## Coefficients:
##   (Intercept)   vol_acid citric_acid   res_sugar   chlorides     free_so2
## 4    5.954897  -0.756587   -1.660368 -0.09974528   -8.104164 -0.049419750
## 5   11.648802  -3.971477   -1.625586 -0.05567012   -7.904641 -0.021148948
## 6    3.911871  -7.918927   -2.439814  0.01579415   -8.333396 -0.012165585
## 7   -3.278056 -10.817833   -2.375664  0.04927643  -12.720788 -0.011504846
## 8   -9.776945 -10.098634   -2.065539  0.12072419   -8.910770 -0.001774726
## 9  -31.881950 -11.169793    9.174830  0.13501180 -212.916750 -0.017206932
##       total_so2         pH sulphates    alcohol colorwhite
## 4  3.739643e-03 -0.6805702  3.536294 -0.3537828  1.1301152
## 5  6.989898e-03 -1.2892876  2.583995 -0.4786599 -1.5718442
## 6  1.598251e-05 -1.0942407  4.329335  0.3090728 -1.6349000
## 7 -1.765892e-03 -0.9271179  5.542250  0.9531321 -1.6548241
## 8 -4.281086e-03 -0.5972112  5.128498  1.2977306 -0.7313697
## 9  9.419447e-03  3.7149046  5.315737  1.1688837 10.2938548
## 
## Std. Errors:
##   (Intercept)  vol_acid citric_acid   res_sugar  chlorides    free_so2
## 4   0.5920897 0.1818768   0.2413384 0.008955264 0.84837415 0.002490200
## 5   0.6017877 0.2263131   0.2288343 0.008173552 0.80598207 0.001705450
## 6   0.5338112 0.2891945   0.2597608 0.008253636 0.85601566 0.001670238
## 7   0.5529676 0.3494669   0.2965729 0.009108730 0.41515012 0.001872787
## 8   0.5766600 0.3612904   0.3141938 0.009144153 0.24160412 0.001695915
## 9   0.4960036 0.5000370   0.4338575 0.013533217 0.01730289 0.002748372
##      total_so2        pH sulphates    alcohol colorwhite
## 4 0.0009338075 0.1778417 0.3089805 0.03403011  0.1452230
## 5 0.0009099039 0.1743575 0.3068492 0.03461758  0.1475399
## 6 0.0009947422 0.1627847 0.2948416 0.03224683  0.1567115
## 7 0.0011031077 0.1693312 0.3023052 0.03486521  0.1691955
## 8 0.0011182111 0.1745404 0.3102527 0.03707487  0.1767860
## 9 0.0014925845 0.2711588 0.3849466 0.05133128  0.4960489
## 
## Residual Deviance: 45314.18 
## AIC: 45446.18

The three of our models have the same value of AIC, therefore using any of them would be fine. The author decided to pick the model produced from backwards regression for this model (wine_bck).

Here is the author’s interpretation of the model:

Lower volatile acidity (vol_acid) in wine may lead to higher wine scores
Higher citric acidity (citric_acid) in wine may lead to higher wine scores
Higher residual sugar (res_sugar) in wine may lead to higher wine scores
Lower chlorides (chlorides) in wine may lead to higher wine scores
Higher free sulphur dioxide (free_so2) in wine may lead to higher wine scores
Lower total sulphur dioxide (total_so2) in wine may lead to higher wine scores
Higher pH levels (pH) in wine may lead to higher wine scores
Higher sulphate levels (sulphates) in wine may lead to higher wine scores
Higher alcohol levels (alcohol) in wine may lead to higher wine scores
Red wine (colorwhite = 0) may have higher scores compared to white wine (colorwhite=1)

Check for Multicollinearity

Before we start using the model for predicting wine quality, let’s check whether our chosen variables are free from the issue of multicollinearity. Normally, we can use the vif() function to check for multicollinearity. However, we can’t use it for this report because the package we use for multinomial regression, “nnet”, doesn’t work with the vif() function. The author found an alternative to measuring VIF here : https://stackoverflow.com/questions/67077015/vif-no-intercept-vifs-may-not-be-sensible

labels = rownames(coefficients(wine_bck))
ref = setdiff(wine_bck$lab,labels)
t(sapply(labels,function(i){
       dat = wine_train
       dat$quality = as.numeric(dat$quality == i)
       vif(glm(quality ~ .,data=dat,family="binomial"))
}))

##   vol_acid citric_acid res_sugar chlorides free_so2 total_so2       pH
## 4 2.685474    1.640055  1.527775  1.820203 1.998086  3.489344 1.541121
## 5 2.360805    1.453608  1.620573  1.677046 2.231591  4.374169 1.503363
## 6 2.107683    1.376495  1.586424  1.786312 2.325326  4.124478 1.361211
## 7 1.733950    1.264242  1.538528  1.804936 2.027057  3.360825 1.281000
## 8 1.483200    1.209206  1.699436  1.753131 2.351602  3.351056 1.285325
## 9 1.306019    1.417164  1.917286  2.157596 1.768718  2.291256 1.293597
##   sulphates  alcohol    color
## 4  1.338732 1.457793 3.905513
## 5  1.540667 1.240650 4.198445
## 6  1.518736 1.485939 3.606910
## 7  1.405424 1.621113 3.121159
## 8  1.401476 1.886062 2.234017
## 9  1.161171 1.948435 1.000000

As we can see, all of the VIF values shown has values that are less than 10, which signifies that our model is free from the issues of multicollinearity.

Predictions With Logistic Regression

Predictions

Let’s first make the predictions on the test data we have previously prepared.

wine_test$pred <- predict(object = wine_bck,
                          newdata = wine_test, 
                          type = "class")

head(wine_test)

## # A tibble: 6 × 12
##   vol_acid citri…¹ res_s…² chlor…³ free_…⁴ total…⁵    pH sulph…⁶ alcohol quality
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl> <fct>  
## 1     0.7     0        1.9   0.076      11      34  3.51    0.56     9.4 5      
## 2     0.88    0        2.6   0.098      25      67  3.2     0.68     9.8 5      
## 3     0.58    0.02     2     0.073       9      18  3.36    0.57     9.5 7      
## 4     0.5     0.36     6.1   0.071      17     102  3.35    0.8     10.5 5      
## 5     0.62    0.19     3.9   0.17       51     148  3.17    0.93     9.2 5      
## 6     0.56    0.28     1.7   0.368      16      56  3.11    1.28     9.3 5      
## # … with 2 more variables: color <fct>, pred <fct>, and abbreviated variable
## #   names ¹citric_acid, ²res_sugar, ³chlorides, ⁴free_so2, ⁵total_so2,
## #   ⁶sulphates

Evaluation

Predictions have been made. The only thing we need to do now is to see how well our model actually performed.

confusionMatrix(data = wine_test$pred,
                reference = wine_test$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   1   9  67  64   2   1   0
##          4   1  14  87  56   7   0   0
##          5   0   5 162 120  19   2   0
##          6   1   1  65 124  29   3   0
##          7   0   2  21 108  65   7   0
##          8   0   2  16  73  58  18   0
##          9   0   1  12  43  28   6   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.2954         
##                  95% CI : (0.2707, 0.321)
##     No Information Rate : 0.4523         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1321         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                       Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.3333333  0.41176   0.3767  0.21088   0.3125  0.48649
## Specificity          0.8897456  0.88073   0.8322  0.86096   0.8736  0.88203
## Pos Pred Value       0.0069444  0.08485   0.5260  0.55605   0.3202  0.10778
## Neg Pred Value       0.9982699  0.98238   0.7298  0.56917   0.8696  0.98323
## Prevalence           0.0023077  0.02615   0.3308  0.45231   0.1600  0.02846
## Detection Rate       0.0007692  0.01077   0.1246  0.09538   0.0500  0.01385
## Detection Prevalence 0.1107692  0.12692   0.2369  0.17154   0.1562  0.12846
## Balanced Accuracy    0.6115395  0.64625   0.6045  0.53592   0.5931  0.68426
##                      Class: 9
## Sensitivity                NA
## Specificity           0.93077
## Pos Pred Value             NA
## Neg Pred Value             NA
## Prevalence            0.00000
## Detection Rate        0.00000
## Detection Prevalence  0.06923
## Balanced Accuracy          NA

Judging from the accuracy statistic, looks like our model performed poorly here. It only has an accuracy of 30%, meaning that it gets its predictions wrong 70% of the time.

Looking at the sensitivity statistic, our model is generally not very good at making true positive predictions as it only ranges from 20.8% to 48.6%. An illustration of true positive in this context: Our model tends to fail at predicting a score of 5 to a wine that is rated 5 in quality.

The specifity statistic for our model is the opposite, it’s very good at making true negative predictions as it ranges from 85% to 94.7%. An illustration of true negative: Our model is very good at avoiding giving a score of 4 to a wine that is actually scored 9 in quality.

Modelling with kNN

The next thing we’re gonna do now is creating another model using a different method, called the kNN (k-Nearest Neighbour). This is a method where variables are classified based on their numerical characteristics. A simple illustration would be to imagine the difference between a chair, table, and bed. Each of those items have their own characteristics and dimensions. The kNN method will learn about the characteristics of those items, and will make predictions of new incoming items based on what it has learned. The difference between the previous model is that we won’t be able to form interpretations of the model (We won’t understand which variables affect the quality of wine, we just see the prediction made by the model).

Reforming the Data

Before we perform the kNN method however, we have to make sure that all of our independent variables are numerical, while our dependent variable is categorical. Let’s check our data types again.

glimpse(wine) # we're using the raw data

## Rows: 6,497
## Columns: 13
## $ fix_acid    <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7, 7.…
## $ vol_acid    <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.…
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36…
## $ res_sugar   <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
## $ chlorides   <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.…
## $ free_so2    <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, …
## $ total_so2   <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, …
## $ density     <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0.9964, 0.…
## $ pH          <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35…
## $ sulphates   <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80…
## $ alcohol     <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 1…
## $ quality     <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7, 5, 4, 6…
## $ color       <fct> red, red, red, red, red, red, red, red, red, red, red, red…

One of our independent variable, color, is categorical. For the kNN model, we will have to remove the color variable from our data frame.

wine_wo_col <- wine %>% 
  select(-color)

glimpse(wine_wo_col)

## Rows: 6,497
## Columns: 12
## $ fix_acid    <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7, 7.…
## $ vol_acid    <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.…
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36…
## $ res_sugar   <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
## $ chlorides   <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.…
## $ free_so2    <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, …
## $ total_so2   <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, …
## $ density     <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0.9964, 0.…
## $ pH          <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35…
## $ sulphates   <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80…
## $ alcohol     <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 1…
## $ quality     <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7, 5, 4, 6…

Splitting and Balancing

Just like the previous method, we need to divide the data into two: training data and test data.

index_2 <- sample(x = nrow(wine_wo_col), size = nrow(wine_wo_col)*0.8)
knn_train <- wine_wo_col[index_2,]
knn_test <- wine_wo_col[-index_2,]

Check whether the data is balanced or not,

nrow(knn_train) # to check the data left after splitting

## [1] 5197

prop.table(table(knn_train$quality)) # before balancing

## 
##            3            4            5            6            7            8 
## 0.0048104676 0.0327111795 0.3269193766 0.4339041755 0.1714450645 0.0294400616 
##            9 
## 0.0007696748

rebalance it again,

knn_train <- upSample(x = knn_train %>% select(c(-quality)),
                      y = knn_train$quality,
                      yname = "quality")
prop.table(table(knn_train$quality))

## 
##         3         4         5         6         7         8         9 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571

Preparing the Data

Now the difference between the previous method and the kNN method is that we need to separate the dependent variable from the independent variables.

# Predictors
knn_train_x <- knn_train %>% select_if(is.numeric)
knn_test_x <- knn_test %>% select_if(is.numeric)

# Dependent
knn_train_y <- knn_train[,"quality"]
knn_test_y <- knn_test[,"quality"]

Prediction with kNN

Our data is ready to be used to form the model. There are only a few steps left.

Finding the Optimum k Value

We can find the optimum k value by simply finding the square root of our training data.

nrow(knn_train_x)

## [1] 15785

sqrt(nrow(knn_train_x))

## [1] 125.6384

However, we also have to consider the rules in picking the k value:

k value must be even if the number of target variables is odd
k value must be odd if the number of target variables is even
k value must not be a multiple of the number of target variables

We have 7 target variables: 3, 4, 5, 6, 7, 8, 9. Therefore we have to pick an even number. Since 126 is a multiple of 7, the author will use the number 128 instead.

Forming the Model

We can form the kNN model by simply using the knn() function

knn_model <- knn(train = knn_train_x,
                 test = knn_test_x,
                 cl = knn_train_y,
                 k = 128)

head(knn_model)

## [1] 5 5 4 5 8 4
## Levels: 3 4 5 6 7 8 9

Model Evaluation

Let’s now check how well the model performed compared to our previous logistic regression model

confusionMatrix(data = knn_model,
                reference = knn_test$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0  11  78  95  22   4   0
##          4   2  15  91  80  20   3   0
##          5   1   6  89  85  14   4   0
##          6   0   4  24  48  11   3   0
##          7   1   4  51  87  40   3   0
##          8   1   5  84 166  69  20   1
##          9   0   1  22  20  12   3   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1631          
##                  95% CI : (0.1434, 0.1843)
##     No Information Rate : 0.4469          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0512          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.32609  0.20273  0.08262  0.21277  0.50000
## Specificity          0.837838  0.84370  0.87224  0.94159  0.86871  0.74127
## Pos Pred Value       0.000000  0.07109  0.44724  0.53333  0.21505  0.05780
## Neg Pred Value       0.995413  0.97153  0.68211  0.55950  0.86715  0.97904
## Prevalence           0.003846  0.03538  0.33769  0.44692  0.14462  0.03077
## Detection Rate       0.000000  0.01154  0.06846  0.03692  0.03077  0.01538
## Detection Prevalence 0.161538  0.16231  0.15308  0.06923  0.14308  0.26615
## Balanced Accuracy    0.418919  0.58489  0.53749  0.51210  0.54074  0.62063
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          0.9553503
## Pos Pred Value       0.0000000
## Neg Pred Value       0.9991948
## Prevalence           0.0007692
## Detection Rate       0.0000000
## Detection Prevalence 0.0446154
## Balanced Accuracy    0.4776751

Taking a look at the model’s accuracy, it is much lower than the other model, where this one only can only correctly predict 18% of time.

It also has lower sensitivity and specificity levels compared to our previous model. Overall the author deems the model created from the kNN model worse than the model formed with logistic regression.

Discussion

One possible explanation of why our model performed poorly is because the previously dropped variables. It could be the case that all of the variables have to be concluded for the model to perform well. But at the same time, that is impossible for our logistic regression model since including all variables causes multicollinearity. Our kNN model also cannot include all variables as one of our predictor is categorical.

Another possible explanation may be that similar wines are given different ratings (one wine is rated 5 and another is rated lower or higher, despite of how close they are in the levels of their chemical properties). This leads back to the idea of how wine tasting is questionable in terms of how legitimate it is. Here are some example articles that question the legitimacy of wine tasting. In short, preferences of taste and inconsistencies in how wine is tasted and valued from by testers seem to heavily affect the rating they give to wine.

Conclusion and Suggestions

In this report the author tries to find out whether it is possible to create a model that can predict the quality of wine based on their chemical components. The author created two models, one with logistic regression and the other with kNN. After the results are out, the author founds that while the logistic regression model performed better than the kNN model, the author deemed the performance of both models to be poor. A possible cause may be that the rating given to the wine themselves may be biased by the reviewer, resulting in scores that are varied despite the possibility of one wine having very similar characteristics with another.

A note for readers who are interested in making a similar report: Since wine tasting (or tasting in general) relates to each individual’s preferences, it may be best to instead create an unsupervised machine learning model to categorize wine based on their characteristics instead of trying to create a supervised model that categorizes scores for food and beverages. If a reader still wishes to create a model that predicts scores, then the author suggests to simplify the scoring system: Instead of categorizing wine from a scale of 3 to 9, future reports should categorize ratings as “below average” or “above average”. This may result in a better performing model.