Wine Quality Predicition
Background
Wine, everyone knows what this is: an alcoholic drink made from fermenting grapes, but do you know that this drink is considered so important that there are actual professionals dedicating themselves to taste test wine? These people are called sommelier, and you usually find them in high-end restaurants that also serve wine. They are intended to guide restaurant visitors regarding what kind of wine will pair well with the food they ordered. While there are reports that wine testing might not be as legitimate as it seems to be, we probably need to remember that they evaluate wine based on their sense of smell and taste. What if there is another way where we can predict wine quality based on their contents?
That’s where this report comes in. Credits to Ghassen Khaled, he compiled a dataset that contains the properties from multiple wine, along with their colour and rating. You can find the dataset here: https://www.kaggle.com/datasets/ghassenkhaled/wine-quality-data?resource=download.
About Data
Let’s read the data
wine <- read_csv("Wine_Quality_Data.csv")## Rows: 6497 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): color
## dbl (12): fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlo...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wine)## Rows: 6,497
## Columns: 13
## $ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color <chr> "red", "red", "red", "red", "red", "red", "red", …
Data Wrangling
Changing Data Types
There are two variables that we immediately need to change, “quality” and “color”.
- “quality”: The scale of wine quality ranges from 3 to 9, with higher being better. This will also be dependent variable.
- “color” : There are two colors of wine: red and white.
Additionally, since we have to check the The rest of the variables represent the chemical components of the wine, therefore we don’t need to change the data type.
wine <- wine %>%
mutate_at(.vars = vars(quality, color),
as.factor) %>%
mutate(quality = factor(quality, levels = c("3","4","5","6","7","8","9"))) # we need to reorder our categories from lowest to highest
glimpse(wine)## Rows: 6,497
## Columns: 13
## $ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color <fct> red, red, red, red, red, red, red, red, red, red,…
Check for NA
Next step is to check for NA. We need to make sure that we don’t have any missing values in our data.
colSums(is.na(wine))## fixed_acidity volatile_acidity citric_acid
## 0 0 0
## residual_sugar chlorides free_sulfur_dioxide
## 0 0 0
## total_sulfur_dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
## color
## 0
As we can see, our dataframe is free from missing values.
Exploratory Data Analysis
Let’s see the descriptive statistics of our wine dataset, this is so we can take a look into where each variable usually resides and also see how spread out they are. We can find this simply by using the summary() function.
summary(wine)## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
##
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
##
## pH sulphates alcohol quality color
## Min. :2.720 Min. :0.2200 Min. : 8.00 3: 30 red :1599
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 4: 216 white:4898
## Median :3.210 Median :0.5100 Median :10.30 5:2138
## Mean :3.219 Mean :0.5313 Mean :10.49 6:2836
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 7:1079
## Max. :4.010 Max. :2.0000 Max. :14.90 8: 193
## 9: 5
From a glance, the variables “residual_sugar”, “free_sulfur_dioxide”, and “total_sulfur_dioxide” have very high maximum values compared to the rest of the variables, so we can expect them to have a lot of outliers too. But let’s see whether that’s really the case using the boxplot() function.
boxplot(wine)The plot is a bit too small to look at, but we can see that the variables “residual_sugar”, “free_sulfur_dioxide”, and “total_sulfur_dioxide” have a lot of outliers just as the author suspected.
Let’s also see the correlation between the numerical variables. This is so we can see the behaviour we should expect when these variables are added together.
wine <- wine %>%
rename("total_so2" = total_sulfur_dioxide, "free_so2" = free_sulfur_dioxide, "vol_acid" = volatile_acidity, "fix_acid" = fixed_acidity, "res_sugar" = residual_sugar) #the author wanted a tidier plot and decided to rename some of the columns
ggcorr(wine, label = T, size = 2.5, layout.exp=1)## Warning in ggcorr(wine, label = T, size = 2.5, layout.exp = 1): data in
## column(s) 'quality', 'color' are not numeric and were ignored
After doing 4 experiments in forming the model, the author decided to
drop 2 variables: “density” and “fix_acid”. “density” is dropped due to
how many variables it is strongly correlated to: “res_sugar”,
“fix_acid”, and “alcohol”. “fix_acid” is only strongly correlated to
“density”, however causes multicollinearity problems in the model,
therefore was removed. The section below explains the process the author
went through.
Modelling with Logistic Regression
Logistic Regression is a type of machine learning model where our computer tries to predict data based on the components it has. In our case, it would be predicting wine scores based on the chemical compositions of each wine.
Drop Density + Fixed Acidity
Before forming our logistic regression model, the author dropped the two variables previously stated. Dropping variables can be done easily with the select() function.
wine_clean<- wine %>%
select(-c(density,fix_acid))
glimpse(wine_clean)## Rows: 6,497
## Columns: 11
## $ vol_acid <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.…
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36…
## $ res_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.…
## $ free_so2 <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, …
## $ total_so2 <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, …
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 1…
## $ quality <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7, 5, 4, 6…
## $ color <fct> red, red, red, red, red, red, red, red, red, red, red, red…
Splitting The Data
Now that we have confirmed that our data is clean, we are now ready to create a logistic regression model. Firstly, let’s split the data into two, training data and test data. We’ll be using 80% of the data for training, while the remaining 20% will be used for testing how well our model works.
index <- sample(x=nrow(wine_clean), size = nrow(wine_clean)*0.8)
wine_train <-wine_clean[index,]
wine_test <- wine_clean[-index,]Balancing
While experimenting with the data, the author found that proportion of scored given to wine isn’t balanced. It showed that a lot of the wine that was sampled gets a score of 5 or 6 in terms of quality. We don’t want our data to be imbalanced because it will affect the regression model’s ability in predicting other values other than 5 or 6. Let’s balance it by using the upSample() function, where we increase the amount of samples we have.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(148)
wine_train <- upSample(x=wine_train %>% select(-quality),
y=wine_train$quality,
yname = "quality")
prop.table(table(wine_train$quality))##
## 3 4 5 6 7 8 9
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
Building the Model
It’s time to form the model. It may be worth reminding now that there is a lot of variables that can be used in forming the prediction model, which could be confusing as there is there is no way in determining which variables are actually important, and which variables are unnecessary (aside from the two variables dropped previously).
While the author believes that every variable in the dataset could be significant in deciding the quality of wine, we can use the method called stepwise regression to be extra sure. With this method, variables are dropped or included based on the variables’ statistical significance. As there are three methods of stepwise regression (backward, forward, or both), we will be forming three models based on each method.
Side note, the output of the code below is lengthy, therefore the author to chose hide the output.
# forming the initial model
wine_all <- multinom(formula= quality ~ .,
data=wine_train)
wine_none <- multinom(formula= quality ~ 1,
data=wine_train)
#stepwise regression
wine_bck <- step(object = wine_all,
direction = "backward",
scope = list(upper=wine_all,
lower=wine_none),
trace = F)
wine_fwd <- step(object = wine_none,
direction = "forward",
scope = list(upper=wine_all,
lower=wine_none),
trace = F)
wine_bth <- step(object = wine_all,
direction = "both",
scope = list(upper=wine_all,
lower=wine_none),
trace = F)Which Model to Use?
As there are three models available, we’ll have to decide which one to move forward with. We can decide it based on their AIC values, which measures how much data we lost after performing stepwise regression.The lower the score, the better our model.
Backward
summary(wine_bck)## Call:
## multinom(formula = quality ~ vol_acid + citric_acid + res_sugar +
## chlorides + free_so2 + total_so2 + pH + sulphates + alcohol +
## color, data = wine_train)
##
## Coefficients:
## (Intercept) vol_acid citric_acid res_sugar chlorides free_so2
## 4 5.954897 -0.756587 -1.660368 -0.09974528 -8.104164 -0.049419750
## 5 11.648802 -3.971477 -1.625586 -0.05567012 -7.904641 -0.021148948
## 6 3.911871 -7.918927 -2.439814 0.01579415 -8.333396 -0.012165585
## 7 -3.278056 -10.817833 -2.375664 0.04927643 -12.720788 -0.011504846
## 8 -9.776945 -10.098634 -2.065539 0.12072419 -8.910770 -0.001774726
## 9 -31.881950 -11.169793 9.174830 0.13501180 -212.916750 -0.017206932
## total_so2 pH sulphates alcohol colorwhite
## 4 3.739643e-03 -0.6805702 3.536294 -0.3537828 1.1301152
## 5 6.989898e-03 -1.2892876 2.583995 -0.4786599 -1.5718442
## 6 1.598251e-05 -1.0942407 4.329335 0.3090728 -1.6349000
## 7 -1.765892e-03 -0.9271179 5.542250 0.9531321 -1.6548241
## 8 -4.281086e-03 -0.5972112 5.128498 1.2977306 -0.7313697
## 9 9.419447e-03 3.7149046 5.315737 1.1688837 10.2938548
##
## Std. Errors:
## (Intercept) vol_acid citric_acid res_sugar chlorides free_so2
## 4 0.5920897 0.1818768 0.2413384 0.008955264 0.84837415 0.002490200
## 5 0.6017877 0.2263131 0.2288343 0.008173552 0.80598207 0.001705450
## 6 0.5338112 0.2891945 0.2597608 0.008253636 0.85601566 0.001670238
## 7 0.5529676 0.3494669 0.2965729 0.009108730 0.41515012 0.001872787
## 8 0.5766600 0.3612904 0.3141938 0.009144153 0.24160412 0.001695915
## 9 0.4960036 0.5000370 0.4338575 0.013533217 0.01730289 0.002748372
## total_so2 pH sulphates alcohol colorwhite
## 4 0.0009338075 0.1778417 0.3089805 0.03403011 0.1452230
## 5 0.0009099039 0.1743575 0.3068492 0.03461758 0.1475399
## 6 0.0009947422 0.1627847 0.2948416 0.03224683 0.1567115
## 7 0.0011031077 0.1693312 0.3023052 0.03486521 0.1691955
## 8 0.0011182111 0.1745404 0.3102527 0.03707487 0.1767860
## 9 0.0014925845 0.2711588 0.3849466 0.05133128 0.4960489
##
## Residual Deviance: 45314.18
## AIC: 45446.18
Forward
summary(wine_fwd)## Call:
## multinom(formula = quality ~ alcohol + vol_acid + chlorides +
## citric_acid + free_so2 + color + res_sugar + sulphates +
## pH + total_so2, data = wine_train)
##
## Coefficients:
## (Intercept) alcohol vol_acid chlorides citric_acid free_so2
## 4 5.954897 -0.3537828 -0.756587 -8.104164 -1.660368 -0.049419750
## 5 11.648802 -0.4786599 -3.971477 -7.904641 -1.625586 -0.021148948
## 6 3.911871 0.3090728 -7.918927 -8.333396 -2.439814 -0.012165585
## 7 -3.278056 0.9531321 -10.817833 -12.720788 -2.375664 -0.011504846
## 8 -9.776945 1.2977306 -10.098634 -8.910770 -2.065539 -0.001774726
## 9 -31.881950 1.1688837 -11.169793 -212.916750 9.174830 -0.017206932
## colorwhite res_sugar sulphates pH total_so2
## 4 1.1301152 -0.09974528 3.536294 -0.6805702 3.739643e-03
## 5 -1.5718442 -0.05567012 2.583995 -1.2892876 6.989898e-03
## 6 -1.6349000 0.01579415 4.329335 -1.0942407 1.598251e-05
## 7 -1.6548241 0.04927643 5.542250 -0.9271179 -1.765892e-03
## 8 -0.7313697 0.12072419 5.128498 -0.5972112 -4.281086e-03
## 9 10.2938548 0.13501180 5.315737 3.7149046 9.419447e-03
##
## Std. Errors:
## (Intercept) alcohol vol_acid chlorides citric_acid free_so2
## 4 0.5920897 0.03403011 0.1818768 0.84837415 0.2413384 0.002490200
## 5 0.6017877 0.03461758 0.2263131 0.80598207 0.2288343 0.001705450
## 6 0.5338112 0.03224683 0.2891945 0.85601566 0.2597608 0.001670238
## 7 0.5529676 0.03486521 0.3494669 0.41515012 0.2965729 0.001872787
## 8 0.5766600 0.03707487 0.3612904 0.24160412 0.3141938 0.001695915
## 9 0.4960036 0.05133128 0.5000370 0.01730289 0.4338575 0.002748372
## colorwhite res_sugar sulphates pH total_so2
## 4 0.1452230 0.008955264 0.3089805 0.1778417 0.0009338075
## 5 0.1475399 0.008173552 0.3068492 0.1743575 0.0009099039
## 6 0.1567115 0.008253636 0.2948416 0.1627847 0.0009947422
## 7 0.1691955 0.009108730 0.3023052 0.1693312 0.0011031077
## 8 0.1767860 0.009144153 0.3102527 0.1745404 0.0011182111
## 9 0.4960489 0.013533217 0.3849466 0.2711588 0.0014925845
##
## Residual Deviance: 45314.18
## AIC: 45446.18
Both
summary(wine_bth)## Call:
## multinom(formula = quality ~ vol_acid + citric_acid + res_sugar +
## chlorides + free_so2 + total_so2 + pH + sulphates + alcohol +
## color, data = wine_train)
##
## Coefficients:
## (Intercept) vol_acid citric_acid res_sugar chlorides free_so2
## 4 5.954897 -0.756587 -1.660368 -0.09974528 -8.104164 -0.049419750
## 5 11.648802 -3.971477 -1.625586 -0.05567012 -7.904641 -0.021148948
## 6 3.911871 -7.918927 -2.439814 0.01579415 -8.333396 -0.012165585
## 7 -3.278056 -10.817833 -2.375664 0.04927643 -12.720788 -0.011504846
## 8 -9.776945 -10.098634 -2.065539 0.12072419 -8.910770 -0.001774726
## 9 -31.881950 -11.169793 9.174830 0.13501180 -212.916750 -0.017206932
## total_so2 pH sulphates alcohol colorwhite
## 4 3.739643e-03 -0.6805702 3.536294 -0.3537828 1.1301152
## 5 6.989898e-03 -1.2892876 2.583995 -0.4786599 -1.5718442
## 6 1.598251e-05 -1.0942407 4.329335 0.3090728 -1.6349000
## 7 -1.765892e-03 -0.9271179 5.542250 0.9531321 -1.6548241
## 8 -4.281086e-03 -0.5972112 5.128498 1.2977306 -0.7313697
## 9 9.419447e-03 3.7149046 5.315737 1.1688837 10.2938548
##
## Std. Errors:
## (Intercept) vol_acid citric_acid res_sugar chlorides free_so2
## 4 0.5920897 0.1818768 0.2413384 0.008955264 0.84837415 0.002490200
## 5 0.6017877 0.2263131 0.2288343 0.008173552 0.80598207 0.001705450
## 6 0.5338112 0.2891945 0.2597608 0.008253636 0.85601566 0.001670238
## 7 0.5529676 0.3494669 0.2965729 0.009108730 0.41515012 0.001872787
## 8 0.5766600 0.3612904 0.3141938 0.009144153 0.24160412 0.001695915
## 9 0.4960036 0.5000370 0.4338575 0.013533217 0.01730289 0.002748372
## total_so2 pH sulphates alcohol colorwhite
## 4 0.0009338075 0.1778417 0.3089805 0.03403011 0.1452230
## 5 0.0009099039 0.1743575 0.3068492 0.03461758 0.1475399
## 6 0.0009947422 0.1627847 0.2948416 0.03224683 0.1567115
## 7 0.0011031077 0.1693312 0.3023052 0.03486521 0.1691955
## 8 0.0011182111 0.1745404 0.3102527 0.03707487 0.1767860
## 9 0.0014925845 0.2711588 0.3849466 0.05133128 0.4960489
##
## Residual Deviance: 45314.18
## AIC: 45446.18
The three of our models have the same value of AIC, therefore using any of them would be fine. The author decided to pick the model produced from backwards regression for this model (wine_bck).
Here is the author’s interpretation of the model:
- Lower volatile acidity (vol_acid) in wine may lead to higher wine scores
- Higher citric acidity (citric_acid) in wine may lead to higher wine scores
- Higher residual sugar (res_sugar) in wine may lead to higher wine scores
- Lower chlorides (chlorides) in wine may lead to higher wine scores
- Higher free sulphur dioxide (free_so2) in wine may lead to higher wine scores
- Lower total sulphur dioxide (total_so2) in wine may lead to higher wine scores
- Higher pH levels (pH) in wine may lead to higher wine scores
- Higher sulphate levels (sulphates) in wine may lead to higher wine scores
- Higher alcohol levels (alcohol) in wine may lead to higher wine scores
- Red wine (colorwhite = 0) may have higher scores compared to white wine (colorwhite=1)
Check for Multicollinearity
Before we start using the model for predicting wine quality, let’s check whether our chosen variables are free from the issue of multicollinearity. Normally, we can use the vif() function to check for multicollinearity. However, we can’t use it for this report because the package we use for multinomial regression, “nnet”, doesn’t work with the vif() function. The author found an alternative to measuring VIF here : https://stackoverflow.com/questions/67077015/vif-no-intercept-vifs-may-not-be-sensible
labels = rownames(coefficients(wine_bck))
ref = setdiff(wine_bck$lab,labels)
t(sapply(labels,function(i){
dat = wine_train
dat$quality = as.numeric(dat$quality == i)
vif(glm(quality ~ .,data=dat,family="binomial"))
}))## vol_acid citric_acid res_sugar chlorides free_so2 total_so2 pH
## 4 2.685474 1.640055 1.527775 1.820203 1.998086 3.489344 1.541121
## 5 2.360805 1.453608 1.620573 1.677046 2.231591 4.374169 1.503363
## 6 2.107683 1.376495 1.586424 1.786312 2.325326 4.124478 1.361211
## 7 1.733950 1.264242 1.538528 1.804936 2.027057 3.360825 1.281000
## 8 1.483200 1.209206 1.699436 1.753131 2.351602 3.351056 1.285325
## 9 1.306019 1.417164 1.917286 2.157596 1.768718 2.291256 1.293597
## sulphates alcohol color
## 4 1.338732 1.457793 3.905513
## 5 1.540667 1.240650 4.198445
## 6 1.518736 1.485939 3.606910
## 7 1.405424 1.621113 3.121159
## 8 1.401476 1.886062 2.234017
## 9 1.161171 1.948435 1.000000
As we can see, all of the VIF values shown has values that are less than 10, which signifies that our model is free from the issues of multicollinearity.
Predictions With Logistic Regression
Predictions
Let’s first make the predictions on the test data we have previously prepared.
wine_test$pred <- predict(object = wine_bck,
newdata = wine_test,
type = "class")
head(wine_test)## # A tibble: 6 × 12
## vol_acid citri…¹ res_s…² chlor…³ free_…⁴ total…⁵ pH sulph…⁶ alcohol quality
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.7 0 1.9 0.076 11 34 3.51 0.56 9.4 5
## 2 0.88 0 2.6 0.098 25 67 3.2 0.68 9.8 5
## 3 0.58 0.02 2 0.073 9 18 3.36 0.57 9.5 7
## 4 0.5 0.36 6.1 0.071 17 102 3.35 0.8 10.5 5
## 5 0.62 0.19 3.9 0.17 51 148 3.17 0.93 9.2 5
## 6 0.56 0.28 1.7 0.368 16 56 3.11 1.28 9.3 5
## # … with 2 more variables: color <fct>, pred <fct>, and abbreviated variable
## # names ¹citric_acid, ²res_sugar, ³chlorides, ⁴free_so2, ⁵total_so2,
## # ⁶sulphates
Evaluation
Predictions have been made. The only thing we need to do now is to see how well our model actually performed.
confusionMatrix(data = wine_test$pred,
reference = wine_test$quality)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 1 9 67 64 2 1 0
## 4 1 14 87 56 7 0 0
## 5 0 5 162 120 19 2 0
## 6 1 1 65 124 29 3 0
## 7 0 2 21 108 65 7 0
## 8 0 2 16 73 58 18 0
## 9 0 1 12 43 28 6 0
##
## Overall Statistics
##
## Accuracy : 0.2954
## 95% CI : (0.2707, 0.321)
## No Information Rate : 0.4523
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1321
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.3333333 0.41176 0.3767 0.21088 0.3125 0.48649
## Specificity 0.8897456 0.88073 0.8322 0.86096 0.8736 0.88203
## Pos Pred Value 0.0069444 0.08485 0.5260 0.55605 0.3202 0.10778
## Neg Pred Value 0.9982699 0.98238 0.7298 0.56917 0.8696 0.98323
## Prevalence 0.0023077 0.02615 0.3308 0.45231 0.1600 0.02846
## Detection Rate 0.0007692 0.01077 0.1246 0.09538 0.0500 0.01385
## Detection Prevalence 0.1107692 0.12692 0.2369 0.17154 0.1562 0.12846
## Balanced Accuracy 0.6115395 0.64625 0.6045 0.53592 0.5931 0.68426
## Class: 9
## Sensitivity NA
## Specificity 0.93077
## Pos Pred Value NA
## Neg Pred Value NA
## Prevalence 0.00000
## Detection Rate 0.00000
## Detection Prevalence 0.06923
## Balanced Accuracy NA
Judging from the accuracy statistic, looks like our model performed poorly here. It only has an accuracy of 30%, meaning that it gets its predictions wrong 70% of the time.
Looking at the sensitivity statistic, our model is generally not very good at making true positive predictions as it only ranges from 20.8% to 48.6%. An illustration of true positive in this context: Our model tends to fail at predicting a score of 5 to a wine that is rated 5 in quality.
The specifity statistic for our model is the opposite, it’s very good at making true negative predictions as it ranges from 85% to 94.7%. An illustration of true negative: Our model is very good at avoiding giving a score of 4 to a wine that is actually scored 9 in quality.
Modelling with kNN
The next thing we’re gonna do now is creating another model using a different method, called the kNN (k-Nearest Neighbour). This is a method where variables are classified based on their numerical characteristics. A simple illustration would be to imagine the difference between a chair, table, and bed. Each of those items have their own characteristics and dimensions. The kNN method will learn about the characteristics of those items, and will make predictions of new incoming items based on what it has learned. The difference between the previous model is that we won’t be able to form interpretations of the model (We won’t understand which variables affect the quality of wine, we just see the prediction made by the model).
Reforming the Data
Before we perform the kNN method however, we have to make sure that all of our independent variables are numerical, while our dependent variable is categorical. Let’s check our data types again.
glimpse(wine) # we're using the raw data## Rows: 6,497
## Columns: 13
## $ fix_acid <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7, 7.…
## $ vol_acid <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.…
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36…
## $ res_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.…
## $ free_so2 <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, …
## $ total_so2 <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, …
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0.9964, 0.…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 1…
## $ quality <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7, 5, 4, 6…
## $ color <fct> red, red, red, red, red, red, red, red, red, red, red, red…
One of our independent variable, color, is categorical. For the kNN model, we will have to remove the color variable from our data frame.
wine_wo_col <- wine %>%
select(-color)
glimpse(wine_wo_col)## Rows: 6,497
## Columns: 12
## $ fix_acid <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7, 7.…
## $ vol_acid <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, 0.650, 0.…
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0.02, 0.36…
## $ res_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8, 6.1…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, 0.065, 0.…
## $ free_so2 <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16, 9, 52, …
## $ total_so2 <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102, 59, 29, …
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0.9964, 0.…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3.36, 3.35…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.80…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.5, 9.2, 1…
## $ quality <fct> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7, 5, 4, 6…
Splitting and Balancing
Just like the previous method, we need to divide the data into two: training data and test data.
index_2 <- sample(x = nrow(wine_wo_col), size = nrow(wine_wo_col)*0.8)
knn_train <- wine_wo_col[index_2,]
knn_test <- wine_wo_col[-index_2,]Check whether the data is balanced or not,
nrow(knn_train) # to check the data left after splitting## [1] 5197
prop.table(table(knn_train$quality)) # before balancing##
## 3 4 5 6 7 8
## 0.0048104676 0.0327111795 0.3269193766 0.4339041755 0.1714450645 0.0294400616
## 9
## 0.0007696748
rebalance it again,
knn_train <- upSample(x = knn_train %>% select(c(-quality)),
y = knn_train$quality,
yname = "quality")
prop.table(table(knn_train$quality))##
## 3 4 5 6 7 8 9
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
Preparing the Data
Now the difference between the previous method and the kNN method is that we need to separate the dependent variable from the independent variables.
# Predictors
knn_train_x <- knn_train %>% select_if(is.numeric)
knn_test_x <- knn_test %>% select_if(is.numeric)
# Dependent
knn_train_y <- knn_train[,"quality"]
knn_test_y <- knn_test[,"quality"]Prediction with kNN
Our data is ready to be used to form the model. There are only a few steps left.
Finding the Optimum k Value
We can find the optimum k value by simply finding the square root of our training data.
nrow(knn_train_x)## [1] 15785
sqrt(nrow(knn_train_x))## [1] 125.6384
However, we also have to consider the rules in picking the k value:
- k value must be even if the number of target variables is odd
- k value must be odd if the number of target variables is even
- k value must not be a multiple of the number of target variables
We have 7 target variables: 3, 4, 5, 6, 7, 8, 9. Therefore we have to pick an even number. Since 126 is a multiple of 7, the author will use the number 128 instead.
Forming the Model
We can form the kNN model by simply using the knn() function
knn_model <- knn(train = knn_train_x,
test = knn_test_x,
cl = knn_train_y,
k = 128)
head(knn_model)## [1] 5 5 4 5 8 4
## Levels: 3 4 5 6 7 8 9
Model Evaluation
Let’s now check how well the model performed compared to our previous logistic regression model
confusionMatrix(data = knn_model,
reference = knn_test$quality)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 11 78 95 22 4 0
## 4 2 15 91 80 20 3 0
## 5 1 6 89 85 14 4 0
## 6 0 4 24 48 11 3 0
## 7 1 4 51 87 40 3 0
## 8 1 5 84 166 69 20 1
## 9 0 1 22 20 12 3 0
##
## Overall Statistics
##
## Accuracy : 0.1631
## 95% CI : (0.1434, 0.1843)
## No Information Rate : 0.4469
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0512
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.32609 0.20273 0.08262 0.21277 0.50000
## Specificity 0.837838 0.84370 0.87224 0.94159 0.86871 0.74127
## Pos Pred Value 0.000000 0.07109 0.44724 0.53333 0.21505 0.05780
## Neg Pred Value 0.995413 0.97153 0.68211 0.55950 0.86715 0.97904
## Prevalence 0.003846 0.03538 0.33769 0.44692 0.14462 0.03077
## Detection Rate 0.000000 0.01154 0.06846 0.03692 0.03077 0.01538
## Detection Prevalence 0.161538 0.16231 0.15308 0.06923 0.14308 0.26615
## Balanced Accuracy 0.418919 0.58489 0.53749 0.51210 0.54074 0.62063
## Class: 9
## Sensitivity 0.0000000
## Specificity 0.9553503
## Pos Pred Value 0.0000000
## Neg Pred Value 0.9991948
## Prevalence 0.0007692
## Detection Rate 0.0000000
## Detection Prevalence 0.0446154
## Balanced Accuracy 0.4776751
Taking a look at the model’s accuracy, it is much lower than the other model, where this one only can only correctly predict 18% of time.
It also has lower sensitivity and specificity levels compared to our previous model. Overall the author deems the model created from the kNN model worse than the model formed with logistic regression.
Discussion
One possible explanation of why our model performed poorly is because the previously dropped variables. It could be the case that all of the variables have to be concluded for the model to perform well. But at the same time, that is impossible for our logistic regression model since including all variables causes multicollinearity. Our kNN model also cannot include all variables as one of our predictor is categorical.
Another possible explanation may be that similar wines are given different ratings (one wine is rated 5 and another is rated lower or higher, despite of how close they are in the levels of their chemical properties). This leads back to the idea of how wine tasting is questionable in terms of how legitimate it is. Here are some example articles that question the legitimacy of wine tasting. In short, preferences of taste and inconsistencies in how wine is tasted and valued from by testers seem to heavily affect the rating they give to wine.
Conclusion and Suggestions
In this report the author tries to find out whether it is possible to create a model that can predict the quality of wine based on their chemical components. The author created two models, one with logistic regression and the other with kNN. After the results are out, the author founds that while the logistic regression model performed better than the kNN model, the author deemed the performance of both models to be poor. A possible cause may be that the rating given to the wine themselves may be biased by the reviewer, resulting in scores that are varied despite the possibility of one wine having very similar characteristics with another.
A note for readers who are interested in making a similar report: Since wine tasting (or tasting in general) relates to each individual’s preferences, it may be best to instead create an unsupervised machine learning model to categorize wine based on their characteristics instead of trying to create a supervised model that categorizes scores for food and beverages. If a reader still wishes to create a model that predicts scores, then the author suggests to simplify the scoring system: Instead of categorizing wine from a scale of 3 to 9, future reports should categorize ratings as “below average” or “above average”. This may result in a better performing model.