Red Wine Quality Analysis

1 Background

1.1 About Red Wine

$Photo by <a href = 'https://www.pexels.com/photo/food-restaurant-man-couple-5086625/'>Jep Gambardella from Pexels</a>$

Figure 1.1: Photo by Jep Gambardella from Pexels

Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol and carbon dioxide, releasing heat in the process. Different varieties of grapes and strains of yeasts are major factors in different styles of wine. These differences result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the grape’s growing environment, and the wine production process. Wines not made from grapes involve fermentation of ther crops including rice wine and other fruit wines such as plum, cherry, pomegranate, currant and elderberry.

Wine is a popular and important drink that accompanies and enhances a wide range of cuisines, from the simple and traditional stews to the most sophisticated and complex haute cuisines. Wine is often served with dinner. Sweet dessert wines may be served with the dessert course. In fine restaurants in Western countries, wine typically accompanies dinner. At a restaurant, patrons are helped to make good food-wine pairings by the restaurant’s sommelier or wine waiter. Individuals dining at home may use wine guides to help make food–wine pairings. Wine is also drunk without the accompaniment of a meal in wine bars or with a selection of cheeses (at a wine and cheese party). Wines are also used as a theme for organizing various events such as festivals around the world; the city of Kuopio in North Savonia, Finland is known for its annual Kuopio Wine Festivals (Kuopion viinijuhlat).

Wine is important in cuisine not just for its value as a drink, but as a flavor agent, primarily in stocks and braising, since its acidity lends balance to rich savory or sweet dishes. Wine sauce is an example of a culinary sauce that uses wine as a primary ingredient. Natural wines may exhibit a broad range of alcohol content, from below 9% to above 16% ABV, with most wines being in the 12.5–14.5% range. Fortified wines (usually with brandy) may contain 20% alcohol or more.

Red wine is a type of wine made from dark-colored grape varieties. The actual color of the wine can range from intense violet, typical of young wines, through to brick red for mature wines and brown for older red wines. The juice from most purple grapes is greenish-white, the red color coming from anthocyan pigments (also called anthocyanins) present in the skin of the grape; exceptions are the relatively uncommon teinturier varieties, which produce a red-colored juice. Much of the red-wine production process therefore involves extraction of color and flavor components from the grape skin. Red wine is a delicacy around the world.

Wine tasting is the sensory examination and evaluation of wine. While the practice of wine tasting is as ancient as its production, a more formalized methodology has slowly become established from the 14th century onwards. Modern, professional wine tasters (such as sommeliers or buyers for retailers) use a constantly evolving specialized terminology which is used to describe the range of perceived flavors, aromas and general characteristics of a wine. More informal, recreational tasting may use similar terminology, usually involving a much less analytical process for a more general, personal appreciation.

Results that have surfaced through scientific blind wine tasting suggest the unreliability of wine tasting in both experts and consumers, such as inconsistency in identifying wines based on region and price.

1.2 Business Questions

After encountering the dataset that is trying to predict red wine’s quality based on some numerical predictors, I have learned a little bit of its history and information as summarized above. It must be noted that I am not an avid alcoholic drinker myself, I am just drawn with the dataset itself, then so what can I do?

Some question arises in me after I research a little bit about red wine, while learning through the Supervised Machine Learning courses in Algoritma Data Science School, both Regression and Classification models. Can I apply this on the red wine dataset?

So, I summed up some important questions that I’d like to find the answer to:

Can I apply these numerical predictors into the Supervised Machine Learning models that I have just studied?
Some models (especially the classification models) are said to be best at using mostly categorical predictors, while my dataset has none, only numerical ones. Will this affect the performance of the prediction model so much?
In class it was learned that the Random Forest method is one of the most used model due to its accuracy and robustness. Will this also hold true with the dataset that I randomly found in Kaggle.com?
Some interpretable models might favor the same predictors over and over. Which predictors could be the most significant ones across most models, that they keep being used in one and another?

2 About the Dataset

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Input variables (based on physicochemical tests):

1 - fixed acidity (gr/L)
most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity (gr/L)
the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid (gr/L)
found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar (gr/L)
the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides (gr/L)
the amount of salt in the wine
6 - free sulfur dioxide (mg/L)
the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide (mg/L)
amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density (g/cm3)
the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH
describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates (g/L)
a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol (% by volume)
the percent alcohol content of the wine

Output variable:

12 - quality (0-10, based on sensory data)

3 Data Pre-processing

3.1 Enabling Libraries

First and foremost, let’s enable the possible libraries that we are going to work with. I am thinking of at least dplyr, tidyr, and glue for easier time of pre-processing the data. Also ggplot2 and plotly for helpful visualization of data.

library(dplyr) #for easier time of tidying the data
library(tidyr) #same as above
library(glue) #to put pop-up label for interactive plotting

library(ggplot2) #for modern plotting
library(plotly) #for interactive plotting
library(GGally) #for making simple heatmap of correlations between each columns

library(rsample) #for train-test splitting

library(performance) #for comparing classification models
library(lmtest) #for linear regression's homoscedasticity test 
library(car) #for linear regression's multicollinearity test
library(gtools) #for using logit and inv-logit to interpret logistic regression
library(caret) #for using confusion matrix, making random forest model

library(class) #for kNN model
library(e1071) #for Naive Bayes model
library(ROCR) #for evaluating Naive Bayes model
library(partykit) #for decision tree model
library(randomForest) #for reviewing the random forest model

3.2 Reading the Dataset

Let’s read the data and put it inside a wine object. Taking a look for some first data of it:

wine <- read.csv("winequality-red.csv")
head(wine)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Checking the data structure:

glimpse(wine)

## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5~
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, ~
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0~
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,~
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, ~
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16~
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,~
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0~
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3~
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0~
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.~
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7~

As explained in the About the Dataset section, it’s clear about the separation of the target and predictor variables. The target variable is the quality, while the rest will be used as predictors.

It looks fine overall, no predictors seem to have wrong data types, but since we are going to try and use some classification models, we would need to make the quality column to be categorical. Since we can see from the glimpse() function above that the quality ranges from 0 to 10 and it seems to only use integers, either we factorize the column as is, or we categorize the numbers into some categories/groups.

3.3 Verifying No Near-Zero Variance Predictors

Checking if any of the columns has near-zero variance, since this would be necessary to do especially for some classification models, like Random Forest, since inserting such columns would mean including irrelevant columns and tampering with the analysis.

nearZeroVar(wine)

## integer(0)

Usually if there is something, the function above will spit out the column indexes of such predictors. But since it shows integer(0), it means that the function cannot find any near-zero variance predictors, that means our dataset is varying enough to be able to give us some insights.

3.4 Checking Missing Values

colSums(is.na(wine))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

It’s safe! Seems like all predictors have no missing values.

3.5 Adjusting Target Variable for Classification Models

Let’s check the current proportion of the target variable.

prop.table(table(wine$quality))

## 
##           3           4           5           6           7           8 
## 0.006253909 0.033145716 0.425891182 0.398999375 0.124452783 0.011257036

plot_1 <- wine %>% 
  group_by(quality) %>% 
  summarise(count = n()) %>% 
  mutate(quality = as.factor(quality),
         label = glue("Quality = {quality}
                      Count = {count}")) %>% 
  ggplot(mapping = aes(x = quality, y = count, text = label))+
  geom_col(aes(fill = count))+
  theme_dark()+
  labs(title = "`Quality` as Categoric Target Variable",
       x = "Quality",
       y = "Count",
       fill = "Count")

ggplotly(plot_1, tooltip = "label")

If we are going to use the classification machine learning models, to still use the different numbers as “classes” seems not appropriate, since the numbers represent an order of “ratings”. Simple regression models that are predicting numeric targets would be fine, but we would need to “categorize” the target variable if we would like to use the classification machine learning models.

One thing that we should avoid is a class imbalance situation. I originally wanted to split the targets of at least 7 to be high quality (or 1) while the rest would be low (or 0) since 7 out of 10 seems fair to be categorized as high, but looking at the proportion, this might not be ideal. Since the data is quite centered on values 5 and 6, I think it would be better to split them both into different classes. So for now I will be splitting 6 an above into high, and the rest would be low. I will make a new column called quality_high to categorize the quality column.

wine$quality_high <- as.factor(ifelse(wine$quality>=6, 1, 0))
glimpse(wine$quality_high)

##  Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 2 2 1 ...

prop.table(table(wine$quality_high))

## 
##         0         1 
## 0.4652908 0.5347092

A 53:47 split! Seems balanced enough, so we may not need to do any down or up-sampling.

3.6 Cross-validation / Train-Test Splitting

Next, we will separate the data into train and test ones, with the default of 80:20 split, using strata of quality_high to keep the balanced proportion. To make the random sampling stay, I will use seed of 314 since I like pi number.

RNGkind(sample.kind = "Rounding") # tambahan khusus u/ R 3.6 ke atas

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(314) # mengunci random number yang dipilih

# index sampling
index_wine <- initial_split(wine, prop = 0.8, strata = "quality_high")

# splitting
wine_train <- training(index_wine)
wine_test <- testing(index_wine)

#checking proportions on separated dataframes
prop.table(table(wine_train$quality_high))

## 
##         0         1 
## 0.4652072 0.5347928

prop.table(table(wine_test$quality_high))

## 
##        0        1 
## 0.465625 0.534375

4 EDA and Data Visualization

4.1 Overall Summary Statistics

summary(wine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality      quality_high
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000   0:744       
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000   1:855       
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000               
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636               
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000               
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Insights:

Some predictors like fixed.acidity, total.sulfur.dioxide, free.sulfur.dioxide, and sulphates seems to have outliers, based solely on the max numbers having much more larger number than the mean, median, or 3rd quartile.
density and pH seems to have quite a good kind of distributions, based on how near the max is to each of their median and/or 3rd quartile of the data.
We might need to take these outliers out. But for now, we are just going to analyze them as is, and see how it affects our models.

4.2 Boxplot of Variables

Since the predictors seems to have different scales, I will separate them based on similar scales. density and pH seems to have quite a normal scale, to not crowd our plot so much, I will exclude them for this.

plot_2 <- ggplot(data = stack(wine %>% select(volatile.acidity, citric.acid, chlorides, sulphates)), mapping = aes(x = ind, y = values)) +
  geom_boxplot(fill = "pink")+
  theme_dark()+
  labs(title = "Boxplot of Volatile Acidity, Citric Acid, Chlorides, Sulphates",
       x = "Predictors",
       y = "Value")

ggplotly(plot_2)

plot_3 <- ggplot(data = stack(wine %>% select(fixed.acidity, residual.sugar, alcohol)), mapping = aes(x = ind, y = values)) +
  geom_boxplot(fill = "green")+
  theme_dark()+
  labs(title = "Boxplot of Fixed Acidity, Residual Sugar, Alcohol",
       x = "Predictors",
       y = "Value")

ggplotly(plot_3)

plot_4 <- ggplot(data = stack(wine %>% select(free.sulfur.dioxide, total.sulfur.dioxide)), mapping = aes(x = ind, y = values)) +
  geom_boxplot(fill = "cyan")+
  theme_dark()+
  labs(title = "Boxplot of Free and Total Sulfur Dioxide",
       x = "Predictors",
       y = "Value")

ggplotly(plot_4)

Insights:

Most of the predictors seems to have lots of outliers, therefore it won’t be to wise to have them taken out, at the cost of information loss.
2 predictors of alcohol and citric.acid seems to have quite normal and good kind of distributions, based on the little number of outliers, and the median being placed quite in the center of the boxplot.

4.3 Checking Correlations between Predictors

ggcorr(wine_train, label = T, hjust = 0.9, label_size = 3, layout.exp = 3)

## Warning in ggcorr(wine_train, label = T, hjust = 0.9, label_size = 3, layout.exp
## = 3): data in column(s) 'quality_high' are not numeric and were ignored

Some early insights from the plot above:

Our target variable, quality seems to have quite good correlation with alcohol (positively) and volatile.acidity (negatively).
The opposite, quality seems to have no correlation with free.sulfur.dioxide and residual.sugar
There are some strong correlations that might need to be checked based on the similarity of the names. For example, strong correlations between free.sulfur.dioxide and total.sulfur.dioxide, or fixed.acidity and volatile.acidity might indicate a multicollinearity, which would be bad for some of our model.

4.4 Scatter Plots of the Strong Correlated Variables

4.4.1 Free Sulfur Dioxide and Total Sulfur Dioxide

plot_5 <- ggplot(data = wine %>% mutate(label = glue("Free Sulfur Dioxide = {free.sulfur.dioxide} ppm,
                                           Total Sulfur Dioxide = {total.sulfur.dioxide} ppm,
                                           Ratio = {round(free.sulfur.dioxide/total.sulfur.dioxide,3)}")),
       mapping = aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide, text = label))+
  geom_point(aes(color = free.sulfur.dioxide/total.sulfur.dioxide))+
  theme_dark()+
  labs(title = "Correlation between Free Sulfur Dioxide and Total Sulfur Dioxide",
       x = "Free Sulfur Dioxide (ppm)",
       y = "Total Sulfur Dioxide (ppm)",
       color = "Ratio of Free:Total")

ggplotly(plot_5, tooltip = "label")

Insight:

Although we can sort of see a trend line going from bottom left to top right, showing a positive correlation between the two variables, but the spread is quite random, therefore I don’t think that these two warrants a special feature engineering if they do have some linearity of their own.

4.4.2 Fixed Acidity and Volatile Acidity

plot_6 <- ggplot(data = wine %>% mutate(label = glue("Fixed Acidity = {fixed.acidity},
                                                      Volatile Acidity = {volatile.acidity},
                                                      Ratio = {round(fixed.acidity/volatile.acidity,3)}")),
       mapping = aes(x = fixed.acidity, y = volatile.acidity, text = label))+
  geom_point(aes(color = fixed.acidity/volatile.acidity))+
  theme_dark()+
  labs(title = "Correlation between Fixed and Volatile Acidity",
       x = "Fixed Acidity",
       y = "Volatile Acidity",
       color = "Ratio of Fixed:Volatile")

ggplotly(plot_6, tooltip = "label")

Insight:

Although the correlation is strong, as we can see from the scatter plot, there is no clear pattern to indicate that any one of the variable is able to predict the other (having their own linearity). Therefore this combination of two variables can be used safely.

4.4.3 Fixed Acidity and pH

plot_7 <- ggplot(data = wine %>% mutate(label = glue("Fixed Acidity = {fixed.acidity},
                                                      pH = {pH}")),
       mapping = aes(x = fixed.acidity, y = pH, text = label))+
  geom_point(color = "aquamarine")+
  theme_dark()+
  labs(title = "Correlation between Fixed Acidity and pH",
       x = "Fixed Acidity",
       y = "pH")

ggplotly(plot_7, tooltip = "label")

Insight:

We can roughly create a trend line going from top-left to the bottom-right of the chart, but that wouldn’t be a good enough estimation since the data points are quite randomly spread out. Therefore, the linearity isn’t really seen in these two variables.

4.4.4 Volatile Acidity and Citric Acid

plot_8 <- ggplot(data = wine %>% mutate(label = glue("Volatile Acidity = {volatile.acidity},
                                                      Citric Acid = {citric.acid}")),
       mapping = aes(x = volatile.acidity, y = citric.acid, text = label))+
  geom_point(color = "turquoise")+
  theme_dark()+
  labs(title = "Correlation between Volatile Acidity and Citric Acid",
       x = "Volatile Acidity",
       y = "Citric Acid")

ggplotly(plot_8, tooltip = "label")

Insight:

The spread are quite throughout the bottom-left of the chart for these two variables. As with before, we can’t see a clear linearity for these two variables.

4.5 Characteristics of Top-Rated vs Lowest-Rated Red Wines

Let’s remind ourselves of our target variables’ current composition.

table(wine$quality_high)

## 
##   0   1 
## 744 855

Let’s see what characteristics our top-rated red wine tends to have, compared to the rest. They could have some specific quality much more higher than the rest is.

Since we are looking for a metric to represent the whole group in each of the characteristic (predictor), I think using a mean would be a better choice rather than others like median, since it would weigh in even the outliers that could actually accentuate its characteristic to be more distinctive than the others.

So we will be grouping the data into 2 groups: red wines rated 6-8, and red wines rated other than that (3-5). After than, we will summarize each predictors using a mean() function.

wine_char <- wine %>% 
  mutate(quality_high = ifelse(quality>=6, "high", "low")) %>% 
  group_by(quality_high) %>% 
  summarise_all(mean) %>% 
  select(-quality)

wine_char

## # A tibble: 2 x 12
##   quality_high fixed.acidity volatile.acidity citric.acid residual.sugar
##   <chr>                <dbl>            <dbl>       <dbl>          <dbl>
## 1 high                  8.47            0.474       0.300           2.54
## 2 low                   8.14            0.590       0.238           2.54
## # ... with 7 more variables: chlorides <dbl>, free.sulfur.dioxide <dbl>,
## #   total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>

To better visualize the difference, let’s see the visualization.

plot_9 <- wine_char %>% 
  pivot_longer(cols = -quality_high, names_to = "names", values_to = "values") %>% 
  mutate(label = glue("Red Wine Quality? {quality_high}
                      Average of {names} = {round(values,2)}")) %>% 
  ggplot(mapping = aes(x=names, y=values))+
  geom_line(aes(group = quality_high, color = quality_high))+
  geom_jitter(mapping = aes(x=names, y=values, color = quality_high, text = label))+
  theme_dark()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title="Average Characteristics of Top-Rated vs Lower-Rated Red Wines",
       x="Predictor",
       y="Value",
       color="Red Wine Quality?")

ggplotly(plot_9, tooltip = "label")

Insights:

Top-rated red wines seemingly has a significantly lower amount of Total Sulfur Dioxide, based on the visualization of average of predictors against the group of lower ratings.
Other predictors are just slightly different from each other, so these are not necessarily true, but could be helpful in deciding the right predictors to classify a higher quality of red wine:
- Alcohol, Citric Acid, Sulphates, and Fixed Acidity of a higher quality of red wine would be slightly higher.
- Meanwhile, for Chlorides, Free Sulfur Dioxide, Volatile Acidity would be slightly lower that the lower rated red wines.
Density, pH, and Residual Sugar, is quite the same across all red wines.

5 Regression Model: Simple Linear with Multiple Predictors

Regression model is where machine learning is looking to predict a target variable that is numerical. This is part of a supervised machine learning since we have some targets/label to be predicted, instead of just looking for some pattern.

We are going to use the simple linear regression model that is taught in class. This model can be used properly if these assumptions are fulfilled:

Linearity
Normality
Homoscedasticity
No Multicollinearity

5.1 Model with No Predictor

First, let’s make a model with no predictors, only the target of quality.

model_nopred_lm <- lm(formula = quality~1, data = wine_train %>% select(-quality_high))
summary(model_nopred_lm)

## 
## Call:
## lm(formula = quality ~ 1, data = wine_train %>% select(-quality_high))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6364 -0.6364  0.3636  0.3636  2.3636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.63643    0.02223   253.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7951 on 1278 degrees of freedom

5.2 Model using All Predictors (as baseline)

Then the opposite, we’ll make a model with the whole predictors.

model_allpred_lm <- lm(formula = quality~., data = wine_train %>% select(-quality_high))
summary(model_allpred_lm)

## 
## Call:
## lm(formula = quality ~ ., data = wine_train %>% select(-quality_high))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60102 -0.37142 -0.06258  0.45407  1.97898 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.739e+01  2.330e+01   0.746 0.455659    
## fixed.acidity        -5.110e-03  2.883e-02  -0.177 0.859325    
## volatile.acidity     -1.013e+00  1.385e-01  -7.316 4.51e-13 ***
## citric.acid          -1.343e-01  1.680e-01  -0.799 0.424307    
## residual.sugar        1.565e-02  1.600e-02   0.978 0.328300    
## chlorides            -2.136e+00  4.522e-01  -4.723 2.59e-06 ***
## free.sulfur.dioxide   4.217e-03  2.418e-03   1.744 0.081371 .  
## total.sulfur.dioxide -2.961e-03  8.079e-04  -3.665 0.000257 ***
## density              -1.239e+01  2.378e+01  -0.521 0.602442    
## pH                   -6.175e-01  2.154e-01  -2.868 0.004205 ** 
## sulphates             8.816e-01  1.304e-01   6.760 2.10e-11 ***
## alcohol               2.778e-01  2.963e-02   9.377  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6476 on 1267 degrees of freedom
## Multiple R-squared:  0.3422, Adjusted R-squared:  0.3365 
## F-statistic: 59.93 on 11 and 1267 DF,  p-value: < 2.2e-16

Some interpretations that we can make from this simple model are:

Columns volatile.acidity, chlorides, total.sulfur.dioxide, sulphates, and alcohol are highly significant, looking at their p-values.
Columns free.sulfur.dioxide and pH are not as high, but still significant ones.
Since we are most likely using multiple predictors, we will look at the Adjusted R-squared value to evaluate our own model, and it is 0.3561 out of 1.0, which is around 35.61%. Quite a low score, let’s try to make this better by eliminating unnecessary predictors.
For now, we can’t see any variables that is causing a perfect separation, that is any variable(s) that can be used solely to “predict” the target variable that causes using a machine learning would not be an ideal thing to do, based on all the p-values seem not so extremely insignificant that none is very close to 1.

5.3 Feature Selection

Since I have not much experiences about red wines, I feel like I’m not able to put forward any variable that must be included. Therefore, I would use a manual approach and stepwise approach(es) for the feature selection.

5.3.1 Manual Approach: Significance of Variables

For the manual approach, I will first include the predictors having highest significance according to the full predictors model, then I will try to include the rest one by one and see how the adjusted R-squared will react.

5.3.1.1 Including Highly Significant Predictors

model_manual_lm <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides, data = wine_train)
summary(model_manual_lm)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65844 -0.38155 -0.07256  0.45499  1.90341 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.0060194  0.2285537  13.152  < 2e-16 ***
## alcohol               0.2753780  0.0187735  14.668  < 2e-16 ***
## volatile.acidity     -1.0741011  0.1088476  -9.868  < 2e-16 ***
## sulphates             0.8736111  0.1253790   6.968 5.15e-12 ***
## total.sulfur.dioxide -0.0018217  0.0005633  -3.234  0.00125 ** 
## chlorides            -1.8281563  0.4201948  -4.351 1.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6511 on 1273 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3294 
## F-statistic: 126.6 on 5 and 1273 DF,  p-value: < 2.2e-16

Then we’ll check a combination of the not-so-high but quite significant.

model_manual_lm_2 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide, data = wine_train)
summary(model_manual_lm_2)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7167 -0.3809 -0.0793  0.4507  1.9261 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.0010993  0.2284884  13.135  < 2e-16 ***
## alcohol               0.2738784  0.0187956  14.571  < 2e-16 ***
## volatile.acidity     -1.0652659  0.1089815  -9.775  < 2e-16 ***
## sulphates             0.8657254  0.1254516   6.901 8.13e-12 ***
## total.sulfur.dioxide -0.0025337  0.0007536  -3.362 0.000797 ***
## chlorides            -1.8138314  0.4201475  -4.317 1.70e-05 ***
## free.sulfur.dioxide   0.0033248  0.0023393   1.421 0.155476    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6508 on 1272 degrees of freedom
## Multiple R-squared:  0.3331, Adjusted R-squared:   0.33 
## F-statistic: 105.9 on 6 and 1272 DF,  p-value: < 2.2e-16

model_manual_lm_3 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+pH, data = wine_train)
summary(model_manual_lm_3)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + pH, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58930 -0.36205 -0.06625  0.46591  1.92018 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.3345713  0.4530526   9.567  < 2e-16 ***
## alcohol               0.2886654  0.0191026  15.111  < 2e-16 ***
## volatile.acidity     -0.9726581  0.1124522  -8.650  < 2e-16 ***
## sulphates             0.8466241  0.1251183   6.767 2.01e-11 ***
## total.sulfur.dioxide -0.0018725  0.0005611  -3.337 0.000872 ***
## chlorides            -2.1225431  0.4273794  -4.966 7.75e-07 ***
## pH                   -0.4454254  0.1313337  -3.392 0.000716 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6484 on 1272 degrees of freedom
## Multiple R-squared:  0.338,  Adjusted R-squared:  0.3349 
## F-statistic: 108.3 on 6 and 1272 DF,  p-value: < 2.2e-16

model_manual_lm_4 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH, data = wine_train)
summary(model_manual_lm_4)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide + 
##     pH, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66643 -0.36177 -0.06254  0.46695  1.98981 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4648450  0.4569986   9.770  < 2e-16 ***
## alcohol               0.2878711  0.0190830  15.085  < 2e-16 ***
## volatile.acidity     -0.9493919  0.1128938  -8.410  < 2e-16 ***
## sulphates             0.8324306  0.1251584   6.651 4.31e-11 ***
## total.sulfur.dioxide -0.0029073  0.0007567  -3.842 0.000128 ***
## chlorides            -2.1322722  0.4268792  -4.995 6.70e-07 ***
## free.sulfur.dioxide   0.0048078  0.0023622   2.035 0.042026 *  
## pH                   -0.4914877  0.1331098  -3.692 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6476 on 1271 degrees of freedom
## Multiple R-squared:  0.3402, Adjusted R-squared:  0.3365 
## F-statistic: 93.61 on 7 and 1271 DF,  p-value: < 2.2e-16

It’s worth noting that adding free.sulfur.dioxide only makes it considered to be not so significant, while pH only is still makes it very significant, and lastly adding makes the model seem slightly better, from the adjusted R-squared value, but it is strangely still quite the same than our baseline model of all predictors.

Then, I’ll try adding other seemingly insignificant predictors one-by-one, to see the R-squared changes.

model_manual_lm_5 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+residual.sugar, data = wine_train)
summary(model_manual_lm_5)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide + 
##     pH + residual.sugar, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6576 -0.3567 -0.0628  0.4692  1.9999 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4380410  0.4594336   9.660  < 2e-16 ***
## alcohol               0.2866100  0.0192107  14.919  < 2e-16 ***
## volatile.acidity     -0.9499212  0.1129268  -8.412  < 2e-16 ***
## sulphates             0.8374677  0.1254901   6.674 3.72e-11 ***
## total.sulfur.dioxide -0.0029587  0.0007621  -3.883 0.000109 ***
## chlorides            -2.1499277  0.4280675  -5.022 5.83e-07 ***
## free.sulfur.dioxide   0.0046857  0.0023721   1.975 0.048442 *  
## pH                   -0.4842568  0.1337233  -3.621 0.000305 ***
## residual.sugar        0.0073944  0.0127104   0.582 0.560830    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6478 on 1270 degrees of freedom
## Multiple R-squared:  0.3404, Adjusted R-squared:  0.3362 
## F-statistic: 81.91 on 8 and 1270 DF,  p-value: < 2.2e-16

model_manual_lm_6 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+citric.acid, data = wine_train)
summary(model_manual_lm_6)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide + 
##     pH + citric.acid, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63594 -0.36888 -0.06171  0.45082  1.98029 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.8330762  0.5182401   9.326  < 2e-16 ***
## alcohol               0.2930494  0.0193818  15.120  < 2e-16 ***
## volatile.acidity     -1.0506079  0.1313791  -7.997 2.85e-15 ***
## sulphates             0.8509021  0.1256977   6.769 1.97e-11 ***
## total.sulfur.dioxide -0.0027113  0.0007675  -3.533 0.000426 ***
## chlorides            -2.0252085  0.4325638  -4.682 3.15e-06 ***
## free.sulfur.dioxide   0.0043050  0.0023845   1.805 0.071254 .  
## pH                   -0.5926948  0.1490904  -3.975 7.42e-05 ***
## citric.acid          -0.2073729  0.1378676  -1.504 0.132792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared:  0.3413, Adjusted R-squared:  0.3372 
## F-statistic: 82.27 on 8 and 1270 DF,  p-value: < 2.2e-16

model_manual_lm_7 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+fixed.acidity, data = wine_train)
summary(model_manual_lm_7)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide + 
##     pH + fixed.acidity, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60344 -0.37288 -0.05657  0.46191  1.97685 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.2937226  0.6966774   7.599 5.79e-14 ***
## alcohol               0.2878564  0.0190719  15.093  < 2e-16 ***
## volatile.acidity     -0.9607030  0.1130561  -8.498  < 2e-16 ***
## sulphates             0.8507531  0.1256249   6.772 1.93e-11 ***
## total.sulfur.dioxide -0.0030923  0.0007653  -4.040 5.66e-05 ***
## chlorides            -2.2274465  0.4308856  -5.169 2.72e-07 ***
## free.sulfur.dioxide   0.0048229  0.0023608   2.043  0.04127 *  
## pH                   -0.6792675  0.1786103  -3.803  0.00015 ***
## fixed.acidity        -0.0236195  0.0149909  -1.576  0.11537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared:  0.3415, Adjusted R-squared:  0.3373 
## F-statistic: 82.31 on 8 and 1270 DF,  p-value: < 2.2e-16

model_manual_lm_8 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+density, data = wine_train)
summary(model_manual_lm_8)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide + 
##     pH + density, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65102 -0.36437 -0.06408  0.46023  1.96303 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.897e+01  1.197e+01   1.585 0.113119    
## alcohol               2.755e-01  2.163e-02  12.738  < 2e-16 ***
## volatile.acidity     -9.435e-01  1.130e-01  -8.351  < 2e-16 ***
## sulphates             8.585e-01  1.270e-01   6.762 2.08e-11 ***
## total.sulfur.dioxide -2.944e-03  7.572e-04  -3.888 0.000106 ***
## chlorides            -2.149e+00  4.270e-01  -5.033 5.52e-07 ***
## free.sulfur.dioxide   4.796e-03  2.362e-03   2.031 0.042503 *  
## pH                   -5.321e-01  1.372e-01  -3.877 0.000111 ***
## density              -1.431e+01  1.179e+01  -1.213 0.225272    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6475 on 1270 degrees of freedom
## Multiple R-squared:  0.3409, Adjusted R-squared:  0.3368 
## F-statistic: 82.12 on 8 and 1270 DF,  p-value: < 2.2e-16

It seems like adding citric.acid and fixed.acidity to the predictors increases the adjusted R-squared value a little bit compared to our baseline, while the other 2 decrease it. For a final step of the manual approach, I will to include both of these 2 predictors.

model_manual_lm_9 <- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+citric.acid+fixed.acidity, data = wine_train)
summary(model_manual_lm_9)

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + chlorides + free.sulfur.dioxide + 
##     pH + citric.acid + fixed.acidity, data = wine_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60561 -0.37695 -0.05421  0.45300  1.97536 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.245442   0.699839   7.495 1.24e-13 ***
## alcohol               0.290960   0.019530  14.899  < 2e-16 ***
## volatile.acidity     -1.017605   0.136675  -7.445 1.78e-13 ***
## sulphates             0.855868   0.125837   6.801 1.59e-11 ***
## total.sulfur.dioxide -0.002915   0.000802  -3.635 0.000289 ***
## chlorides            -2.132534   0.449585  -4.743 2.34e-06 ***
## free.sulfur.dioxide   0.004517   0.002397   1.885 0.059725 .  
## pH                   -0.678972   0.178642  -3.801 0.000151 ***
## citric.acid          -0.124092   0.167425  -0.741 0.458720    
## fixed.acidity        -0.015965   0.018206  -0.877 0.380722    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6474 on 1269 degrees of freedom
## Multiple R-squared:  0.3417, Adjusted R-squared:  0.3371 
## F-statistic:  73.2 on 9 and 1269 DF,  p-value: < 2.2e-16

It appears that the adjusted R-squared decrease a little bit. To make sure, we will compare our baseline model model_allpred_lm that was using all predictors, model_manual_lm_4 that was using only the most significant predictors, model_manual_lm_7 that was adding another predictor that at first seems insignificant but improves the R-squared.

compare_performance(model_allpred_lm, model_manual_lm_4, model_manual_lm_7)

## # Comparison of Model Performance Indices
## 
## Name              | Model |      AIC | AIC weights |      BIC | BIC weights |    R2 | R2 (adj.) |  RMSE | Sigma
## ---------------------------------------------------------------------------------------------------------------
## model_allpred_lm  |    lm | 2532.337 |       0.057 | 2599.336 |     < 0.001 | 0.342 |     0.337 | 0.645 | 0.648
## model_manual_lm_4 |    lm | 2528.359 |       0.413 | 2574.744 |       0.911 | 0.340 |     0.337 | 0.646 | 0.648
## model_manual_lm_7 |    lm | 2527.861 |       0.530 | 2579.400 |       0.089 | 0.341 |     0.337 | 0.645 | 0.647

The R2 adjusted value seems quite the same, but we can see from each summary, that the model_manual_lm_7 is the best. The error from RMSE seems to be lower, and also the AIC is a bit lower. So we will use the model_manual_lm_7 to be used from our manual approach of feature selection, with the adjusted R-Squared of 0.3373.

5.3.2 Stepwise Approach

Now this would be simpler than the manual approach. The algorithm will automatically search adding or deleting predictors based on the AIC (Akaike Information Criteria) value. The lower, the better.

model_stepwise_both_lm <- step(object = model_allpred_lm, scope = list(lower = model_nopred_lm, upper = model_allpred_lm), direction = "both", trace = F)
summary(model_stepwise_both_lm)

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity, 
##     data = wine_train %>% select(-quality_high))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60344 -0.37288 -0.05657  0.46191  1.97685 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.2937226  0.6966774   7.599 5.79e-14 ***
## volatile.acidity     -0.9607030  0.1130561  -8.498  < 2e-16 ***
## chlorides            -2.2274465  0.4308856  -5.169 2.72e-07 ***
## free.sulfur.dioxide   0.0048229  0.0023608   2.043  0.04127 *  
## total.sulfur.dioxide -0.0030923  0.0007653  -4.040 5.66e-05 ***
## pH                   -0.6792675  0.1786103  -3.803  0.00015 ***
## sulphates             0.8507531  0.1256249   6.772 1.93e-11 ***
## alcohol               0.2878564  0.0190719  15.093  < 2e-16 ***
## fixed.acidity        -0.0236195  0.0149909  -1.576  0.11537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared:  0.3415, Adjusted R-squared:  0.3373 
## F-statistic: 82.31 on 8 and 1270 DF,  p-value: < 2.2e-16

If we compare it to our previously chosen model_manual_lm_7, the predictors chosen are the same. Therefore, we can pretty much conclude that our best model using this simple linear regression method are only explaining around 33.73% of our target variable, while the rest is explained using other variables other than that we used, or even maybe other variables excluded by the dataset provider, or simply that this model is not too suited for the dataset.

5.4 Verifying Linear Regression Assumptions

5.4.1 Linearity

This should be done before creating the linear regression models, and we have done so. Some variables seems to have quite correlations to our target variable.

This assumption is also about how to interpret the summary of the model.

summary(model_stepwise_both_lm)

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity, 
##     data = wine_train %>% select(-quality_high))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60344 -0.37288 -0.05657  0.46191  1.97685 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.2937226  0.6966774   7.599 5.79e-14 ***
## volatile.acidity     -0.9607030  0.1130561  -8.498  < 2e-16 ***
## chlorides            -2.2274465  0.4308856  -5.169 2.72e-07 ***
## free.sulfur.dioxide   0.0048229  0.0023608   2.043  0.04127 *  
## total.sulfur.dioxide -0.0030923  0.0007653  -4.040 5.66e-05 ***
## pH                   -0.6792675  0.1786103  -3.803  0.00015 ***
## sulphates             0.8507531  0.1256249   6.772 1.93e-11 ***
## alcohol               0.2878564  0.0190719  15.093  < 2e-16 ***
## fixed.acidity        -0.0236195  0.0149909  -1.576  0.11537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared:  0.3415, Adjusted R-squared:  0.3373 
## F-statistic: 82.31 on 8 and 1270 DF,  p-value: < 2.2e-16

The linearity assumption also describes how each of the coefficients (or estimates in the model summary above) can be interpreted.

For example, we can see at the predictor volatile.acidity and we know that the ‘Estimate’ is around -0.96. This can be interpreted as: an increase of 1 unit of volatile.acidity, will decrease the final target variable of ours, the quality, as much as 0.96.

The linearity assumption will also hold true to each estimates in the linear regression model, that corresponds with each predictors.

5.4.2 Normality

Checking with Shapiro Test:

shapiro.test(model_stepwise_both_lm$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_stepwise_both_lm$residuals
## W = 0.9903, p-value = 1.784e-07

The p-value is very small, lower that a default alpha of 0.05! Which is saying that the residuals of our model is not distributed normally. Let’s see if the plot can convince us otherwise.

plot(density(model_stepwise_both_lm$residuals))

Sure, it’s not your usual normal distribution bell curve, but I would think that this is close enough, that we can pass this assumption.

5.4.3 Homoscedasticity

To be objective, let’s try using Breusch-Pagan test.

bptest(model_stepwise_both_lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_stepwise_both_lm
## BP = 45.346, df = 8, p-value = 3.164e-07

Once again, p-value shown is a very small value (< alpha of 0.05), showing that the data is more heteroscedastic, or showing some kind of pattern in the spread, while we hope that the model is homoscedastic or randomly spread. Once again, we can try to see the plot if we can try passing this assumption to safely use the model.

plot(model_stepwise_both_lm$fitted.values, model_stepwise_both_lm$residuals)
abline(h = 0, col = "red")

Unfortunately, the data seems to be showing a pattern here, which are shown by the data spread in some kind of lines on the plot, so visually, I agree with the BP Test’s result, that this dataset might not be a good fit for this linear regression model, though we can still use it, since it’s not easy to find any kind of datasets that fulfills all of the assumption perfectly.

5.4.4 Multicolinearity

Though one of our assumption above fails the test, we should still check if the model passes this one. VIF (Variance Inflation Factor) is a measure of how varying the coefficients due to multicolinearity, or few of the predictors are very much correlated and affect eachother.

vif(model_stepwise_both_lm)

##     volatile.acidity            chlorides  free.sulfur.dioxide 
##             1.231478             1.415006             1.869230 
## total.sulfur.dioxide                   pH            sulphates 
##             1.961983             2.259270             1.381130 
##              alcohol        fixed.acidity 
##             1.221879             2.048816

The rule-of-thumb is for each VIF values showing under 10. Since we can see that in all of our used predictors, we can safely say that there is no multicolinearity in our model, so our model passes this assumption.

5.5 Prediction using Test Data

Since our model passes most assumptions, instead of predicting the test data that we separated from the beginning, we can simply copy the same predictors from the best model that we found during the comparing and verifying assumptions.

summary(lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
    total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity, 
    data = wine_test))

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity, 
##     data = wine_test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25444 -0.39954  0.01312  0.39674  1.81988 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.773247   1.259682   1.408 0.160220    
## volatile.acidity     -1.215154   0.223025  -5.449 1.03e-07 ***
## chlorides            -0.256637   1.188869  -0.216 0.829233    
## free.sulfur.dioxide   0.009265   0.004887   1.896 0.058934 .  
## total.sulfur.dioxide -0.006150   0.001636  -3.759 0.000204 ***
## pH                    0.112568   0.325102   0.346 0.729386    
## sulphates             1.047150   0.231117   4.531 8.39e-06 ***
## alcohol               0.284235   0.035207   8.073 1.52e-14 ***
## fixed.acidity         0.075909   0.028401   2.673 0.007921 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6378 on 311 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4459 
## F-statistic: 33.09 on 8 and 311 DF,  p-value: < 2.2e-16

It was actually somehow better than the train data result, but not by far. We can say that 44.59% of our target quality variable can be explained using the same model.

5.6 Conclusions

Our best simple linear regression model is found both using manual or stepwise approach, involving following predictors:
- volatile.acidity
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- pH
- sulphates
- alcohol
- fixed.acidity
The adjusted R-squared of the train data is estimated at 33.73%, but on the test data it’s improved to 44.59%.
Our model is not so perfect that it passes all of our assumptions, but it only fails one, so let’s see if other model will be “objectively” better than this model.

6 Classification Model: Logistic Regression

Logistic regression is a machine learning model that is included in the classification side of prediction, which has the main purpose of predicting target variable(s) that is categorical/factor/in classes. It is a little bit similar to our simple linear regression model, in which this model will output a regression formula, although it cannot be interpreted directly since this model is using the concept of log of odds. The prediction result of the target variable would be a log of odds that can be translated into a probability, which then we can use a threshold to classify the result to 2 possible classes, either 1 or 0.

This model’s assumptions that would need to be fulfilled:

Linearity of Predictor & Log of Odds
Multicollinearity

Independence of Observations

6.1 Feature Selection

As with the linear regression model, we would need to use the most relevant predictors, so we need to do some feature selection.

6.1.1 Model with No Predictor

model_nopred_glm <- glm(formula = quality_high~1, data = wine_train, family = "binomial")
summary(model_nopred_glm)

## 
## Call:
## glm(formula = quality_high ~ 1, family = "binomial", data = wine_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.237  -1.237   1.119   1.119   1.119  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.13940    0.05606   2.487   0.0129 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.9  on 1278  degrees of freedom
## Residual deviance: 1766.9  on 1278  degrees of freedom
## AIC: 1768.9
## 
## Number of Fisher Scoring iterations: 3

6.1.2 Model with All Predictors (as baseline)

model_allpred_glm <- glm(formula = quality_high~., data = wine_train %>% select(-quality), family = "binomial")
summary(model_allpred_glm)

## 
## Call:
## glm(formula = quality_high ~ ., family = "binomial", data = wine_train %>% 
##     select(-quality))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3271  -0.8611   0.3231   0.8499   2.2927  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            6.034168  85.658299   0.070  0.94384    
## fixed.acidity          0.044987   0.107873   0.417  0.67665    
## volatile.acidity      -3.104694   0.541488  -5.734 9.83e-09 ***
## citric.acid           -1.156493   0.633898  -1.824  0.06809 .  
## residual.sugar         0.042083   0.056802   0.741  0.45877    
## chlorides             -4.798408   1.707187  -2.811  0.00494 ** 
## free.sulfur.dioxide    0.022396   0.008999   2.489  0.01282 *  
## total.sulfur.dioxide  -0.014448   0.003074  -4.700 2.60e-06 ***
## density              -10.770445  87.471430  -0.123  0.90200    
## pH                    -1.198891   0.799158  -1.500  0.13356    
## sulphates              2.613487   0.502181   5.204 1.95e-07 ***
## alcohol                0.902744   0.115363   7.825 5.07e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.9  on 1278  degrees of freedom
## Residual deviance: 1353.5  on 1267  degrees of freedom
## AIC: 1377.5
## 
## Number of Fisher Scoring iterations: 4

In classification model, one of the metric that we can use to evaluate the model is the AIC value, or information loss. We can see that in model using all predictors, the AIC is lower than the no predictor.

Since we don’t have any prejudice or assumption for or against any of the variables, and that the highly significant predictors are quite the same with the one we found in the linear regression model, we will just be using a stepwise approach in both direction.

6.1.3 Stepwise Approach

model_stepwise_both_glm <- step(object = model_allpred_glm, scope = list(lower = model_nopred_glm, upper = model_allpred_glm), direction = "both", trace = F)
summary(model_stepwise_both_glm)

## 
## Call:
## glm(formula = quality_high ~ volatile.acidity + citric.acid + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, family = "binomial", data = wine_train %>% 
##     select(-quality))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2275  -0.8637   0.3239   0.8610   2.2927  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.574704   1.871666  -1.910  0.05615 .  
## volatile.acidity     -3.006772   0.507219  -5.928 3.07e-09 ***
## citric.acid          -0.917448   0.509761  -1.800  0.07190 .  
## chlorides            -4.986193   1.620821  -3.076  0.00210 ** 
## free.sulfur.dioxide   0.023707   0.008907   2.662  0.00778 ** 
## total.sulfur.dioxide -0.014604   0.002902  -5.032 4.86e-07 ***
## pH                   -1.458359   0.541300  -2.694  0.00706 ** 
## sulphates             2.574440   0.486009   5.297 1.18e-07 ***
## alcohol               0.915012   0.083324  10.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.9  on 1278  degrees of freedom
## Residual deviance: 1354.5  on 1270  degrees of freedom
## AIC: 1372.5
## 
## Number of Fisher Scoring iterations: 4

It is different from the one we found in linear regression model, in that the current model includes citric.acid and excludes fixed.acidity, interesting!

Let’s compare this model’s performance with our baseline logistic regression model using all predictors.

compare_performance(model_allpred_glm, model_stepwise_both_glm)

## # Comparison of Model Performance Indices
## 
## Name                    | Model |      AIC | AIC weights |      BIC | BIC weights | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## model_allpred_glm       |   glm | 1377.482 |       0.075 | 1439.328 |     < 0.001 |     0.290 | 0.420 | 1.034 |    0.529 |      -Inf |       9.549e-04 | 0.647
## model_stepwise_both_glm |   glm | 1372.460 |       0.925 | 1418.845 |       1.000 |     0.289 | 0.420 | 1.033 |    0.529 |      -Inf |           0.001 | 0.646

We can see that the AIC is lower, so that is good, less information is lost, while the RMSE are not so much different. Therefore we will be using this model to represent the logistic regression type.

6.1.4 Fitted Values vs Real Labels from Training Data

For comparison against the test data later, we need to see some metrics that we can compare against other models.

confusionMatrix(data = as.factor(ifelse(model_stepwise_both_glm$fitted.values>0.5, 1, 0)),
                reference = wine_train$quality_high,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 439 178
##          1 156 506
##                                           
##                Accuracy : 0.7389          
##                  95% CI : (0.7139, 0.7628)
##     No Information Rate : 0.5348          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4764          
##                                           
##  Mcnemar's Test P-Value : 0.2505          
##                                           
##             Sensitivity : 0.7398          
##             Specificity : 0.7378          
##          Pos Pred Value : 0.7644          
##          Neg Pred Value : 0.7115          
##              Prevalence : 0.5348          
##          Detection Rate : 0.3956          
##    Detection Prevalence : 0.5176          
##       Balanced Accuracy : 0.7388          
##                                           
##        'Positive' Class : 1               
##

As we can see from the class proportion that was quite balanced, we can safely use the Accuracy value to try and pre-evaluate our model before seeing if the model survives the assumptions verification and the test data. The Accuracy is 73.89%, which quite good for now.

Since this is for estimating a red wine quality, I’d say that this is not a preventive case, so I would choose Precision as my secondary metric check. From the confusion matrix above, I can see the value of Pos Pred Value, another name for Precision, is showing 76.44% prediction, which is also pretty good score!

6.2 Verifying Assumptions

There are 3 assumptions of logistic regression model, that would convince us that our model is already quite proper to be used. The assumptions are:

Linearity of Predictor & Log of Odds
Multicollinearity
Independence of Observations

6.2.1 Linearity of Predictor & Log of Odds

This is basically just a reminder for us, that the interpretation of the formula or summary we found of the logistic regression model is different from the linear regression one, though it looks really similar. The biggest difference is that the numbers of the summary are only in a log-of-odds manner, so we would need to use exp() function or inv.logit() function to interpret them one by one.

Let’s see the summary of the chosen model again.

summary(model_stepwise_both_glm)

## 
## Call:
## glm(formula = quality_high ~ volatile.acidity + citric.acid + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, family = "binomial", data = wine_train %>% 
##     select(-quality))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2275  -0.8637   0.3239   0.8610   2.2927  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.574704   1.871666  -1.910  0.05615 .  
## volatile.acidity     -3.006772   0.507219  -5.928 3.07e-09 ***
## citric.acid          -0.917448   0.509761  -1.800  0.07190 .  
## chlorides            -4.986193   1.620821  -3.076  0.00210 ** 
## free.sulfur.dioxide   0.023707   0.008907   2.662  0.00778 ** 
## total.sulfur.dioxide -0.014604   0.002902  -5.032 4.86e-07 ***
## pH                   -1.458359   0.541300  -2.694  0.00706 ** 
## sulphates             2.574440   0.486009   5.297 1.18e-07 ***
## alcohol               0.915012   0.083324  10.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.9  on 1278  degrees of freedom
## Residual deviance: 1354.5  on 1270  degrees of freedom
## AIC: 1372.5
## 
## Number of Fisher Scoring iterations: 4

For example, let’s try to interpret how a predictor of volatile.acidity affect the final target variable of quality.high. * It has a p-value (Pr(>|z|)) that is very small (3.07 * 10^-9). R categorized this predictor as highly significant, since it is still under the default value of 0.05, but very close to 0, therefore we can see the 3 stars code beside this p-value. We can compare how significant it is based on the Signif. codes underneath all predictors. * Then we can see on the Estimate value of it, which is -3.006772. This where the assumption of logistic regression mainly comes on, where we can safely interpret this as: an increase of 1 unit of volatile.acidity, it will increase -3.006772 units of the log-of-odds of the quality_high or our target variable. We can’t really interpret this easily in our head, so we could use the inv.logit() function to better understand this.

inv.logit(-3.006772)

## [1] 0.04712087

Therefore, we can interpret this as: an increase of 1 unit of volatile.acidity, it will increase the probability of 0.047 of the quality or our target variable. As a reminder, probability is only from 0 to 1, so as this predictor increases in its own unit, it increases the chance of the wine being a high quality (rated 6-8).

We can similarly interpret each predictor in this way (using inv.logit()), but we’ll skip it for now.

6.2.2 Multicolinearity

This assumption is the same that we have discussed and check in the linear regression model, in which we should make sure that any subset of our predictors are not tightly correlated to eachother.

We can use VIF (Variance Inflation Factor) test for this.

vif(model_stepwise_both_glm)

##     volatile.acidity          citric.acid            chlorides 
##             1.613645             2.194901             1.499875 
##  free.sulfur.dioxide total.sulfur.dioxide                   pH 
##             1.937301             1.952260             1.561357 
##            sulphates              alcohol 
##             1.440349             1.168738

6.2.3 Independence of Observations

This assumption needs to be sure that we are not doing a repeated measurement that would result in a biased analysis result, or basically we don’t want any correlation between each observations or each rows in a tabular data.

This assumption is easily fulfilled by doing a random sampling when doing a data collection, or simply when we are analyzing, we could do a cross-validation or train-test splitting that is random, which we have done so at the beginning.

6.3 Model Evaluation

We passed the assumptions! Now we can try to evaluate our model using the test data that we have put aside from the beginning for bias prevention.

glm_pred <- as.factor(ifelse(predict(object = model_stepwise_both_glm,
                                     newdata = wine_test,
                                     type = "response")>0.5, 1, 0))
confusionMatrix(data = glm_pred,
                reference = wine_test$quality_high,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 112  37
##          1  37 134
##                                           
##                Accuracy : 0.7688          
##                  95% CI : (0.7186, 0.8138)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5353          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7836          
##             Specificity : 0.7517          
##          Pos Pred Value : 0.7836          
##          Neg Pred Value : 0.7517          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4188          
##    Detection Prevalence : 0.5344          
##       Balanced Accuracy : 0.7677          
##                                           
##        'Positive' Class : 1               
##

Our Accuracy is 76.88%, and our Precision / Pos Pred Value is 78.36%. Somehow slightly better than our train results, but can’t complain!

6.4 Conclusions

Using a stepwise approach, the best model are made using similar predictors like in our previous linear regression model, swapping a specific predictor with another. Here is the full one:
- volatile.acidity
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- pH
- sulphates
- alcohol
- citric.acid
The Accuracy of our logistic regression model is 76.88%, and the Precision is 78.36%

7 Classification Model: k-NN

k-NN, or a k-Nearest Neighbour, is a machine learning classification model that is comparing characteristics of a new/unseen data, against the existing/training data. The “distance” of the characteristics is measured using Euclidean Distance, and afterwards, these distances are ordered from shortest/closest until the longest distance, and the k value that the data scientist chose will be used to pick the k shortest distanced data as a “vote” on the which class it will be categorized, in which the majority will “win” and determine which class it belongs.

This model is better for numerical predictors, which is a perfect fit for this dataset.

7.1 Pre-processing and Scaling the Dataset

Let’s check the dataset summary and check the range of the data (minimum-maximum values of each predictors).

summary(wine_train)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.3900   1st Qu.:0.1000   1st Qu.: 1.900  
##  Median : 7.900   Median :0.5200   Median :0.2600   Median : 2.200  
##  Mean   : 8.321   Mean   :0.5271   Mean   :0.2729   Mean   : 2.551  
##  3rd Qu.: 9.300   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.3300   Max.   :1.0000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 8.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08833   Mean   :15.94       Mean   : 46.61       Mean   :0.9968  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9979  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH         sulphates         alcohol         quality      quality_high
##  Min.   :2.74   Min.   :0.3700   Min.   : 8.40   Min.   :3.000   0:595       
##  1st Qu.:3.20   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000   1:684       
##  Median :3.31   Median :0.6200   Median :10.20   Median :6.000               
##  Mean   :3.31   Mean   :0.6586   Mean   :10.41   Mean   :5.636               
##  3rd Qu.:3.40   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000               
##  Max.   :4.01   Max.   :2.0000   Max.   :14.90   Max.   :8.000

It’s obvious that the scale of each predictor is different, and since this machine learning algorithm would be much more biased from predictors with large numbers, we would need to exclude the quality column, then scale all of the numerical columns so that each predictors would weigh the same, since we are going to use them as our predictors in the k-NN method.

We would also need to separate the predictors (x) with the target predictor (y, quality_high).

7.1.1 Separating Predictors and Target Variables

Before we are scaling the data points, we would need to separate the labels, or the target variable from the predictors, since we don’t need the target variable to be scaled. Don’t forget to exclude the quality column too since we don’t to evaluate it anymore in this k-NN model or any of the classification models.

wine_train_x <- wine_train %>% 
  select(-c("quality", "quality_high"))

wine_train_y <- wine_train %>% 
  pull(quality_high)

wine_test_x <- wine_test %>% 
  select(-c("quality", "quality_high"))

wine_test_y <- wine_test %>% 
  pull(quality_high)

7.1.2 Scaling

For best and standardized results, it would be better for us if we use the z-score scaling. The concept is that we change the mean to 0, and scale other values based on the how many multiples of standard deviations it deviates from the mean.

This is simply done by using scale() function. We will do this for the predictors in the training dataset, but not yet for the testing ones.

wine_train_x_scaled <- scale(wine_train_x)

For the testing ones, we need to use the parameters from the scaled training dataset because it is supposed to be unseen data that follows the same “rules” with the training data.

wine_test_x_scaled <- scale(wine_test_x,
                            center = attr(wine_train_x_scaled, "scaled:center"),
                            scale = attr(wine_train_x_scaled, "scaled:scale"))

Let’s check the summary of the predictors, to make sure that they are quite equally scaled now.

summary(wine_train_x_scaled)

##  fixed.acidity     volatile.acidity   citric.acid       residual.sugar    
##  Min.   :-2.1526   Min.   :-2.2905   Min.   :-1.41082   Min.   :-1.11466  
##  1st Qu.:-0.7065   1st Qu.:-0.7712   1st Qu.:-0.89379   1st Qu.:-0.43964  
##  Median :-0.2437   Median :-0.0397   Median :-0.06654   Median :-0.23713  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.5661   3rd Qu.: 0.6355   3rd Qu.: 0.76071   3rd Qu.: 0.03288  
##  Max.   : 4.3838   Max.   : 4.5181   Max.   : 3.75949   Max.   : 8.74072  
##    chlorides        free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :-1.52715   Min.   :-1.4247     Min.   :-1.2256     
##  1st Qu.:-0.36678   1st Qu.:-0.7572     1st Qu.:-0.7427     
##  Median :-0.18672   Median :-0.1849     Median :-0.2599     
##  Mean   : 0.00000   Mean   : 0.0000     Mean   : 0.0000     
##  3rd Qu.: 0.03335   3rd Qu.: 0.4827     3rd Qu.: 0.4644     
##  Max.   :10.45671   Max.   : 5.3466     Max.   : 7.3148     
##     density                pH              sulphates          alcohol       
##  Min.   :-3.551693   Min.   :-3.739176   Min.   :-1.7040   Min.   :-1.9184  
##  1st Qu.:-0.603171   1st Qu.:-0.720146   1st Qu.:-0.6413   1st Qu.:-0.8702  
##  Median :-0.002841   Median : 0.001796   Median :-0.2280   Median :-0.2031  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.594832   3rd Qu.: 0.592476   3rd Qu.: 0.4214   3rd Qu.: 0.6546  
##  Max.   : 3.684140   Max.   : 4.595972   Max.   : 7.9195   Max.   : 4.2757

summary(wine_test_x_scaled)

##  fixed.acidity       volatile.acidity    citric.acid       residual.sugar    
##  Min.   :-1.979064   Min.   :-2.29046   Min.   :-1.41082   Min.   :-1.11466  
##  1st Qu.:-0.706482   1st Qu.:-0.71493   1st Qu.:-0.99720   1st Qu.:-0.43964  
##  Median :-0.243725   Median :-0.09597   Median :-0.16995   Median :-0.23713  
##  Mean   :-0.004935   Mean   : 0.02149   Mean   :-0.04893   Mean   :-0.04211  
##  3rd Qu.: 0.450410   3rd Qu.: 0.63552   3rd Qu.: 0.76071   3rd Qu.: 0.03288  
##  Max.   : 4.210310   Max.   : 5.92479   Max.   : 2.51862   Max.   : 5.70310  
##    chlorides        free.sulfur.dioxide total.sulfur.dioxide    density        
##  Min.   :-0.98698   Min.   :-1.42475    Min.   :-1.2256      Min.   :-3.55169  
##  1st Qu.:-0.36678   1st Qu.:-0.85252    1st Qu.:-0.7503      1st Qu.:-0.65630  
##  Median :-0.18672   Median :-0.23261    Median :-0.3202      Median :-0.02409  
##  Mean   :-0.08663   Mean   :-0.03054    Mean   :-0.0216      Mean   :-0.02301  
##  3rd Qu.: 0.03335   3rd Qu.: 0.57804    3rd Qu.: 0.4946      3rd Qu.: 0.50717  
##  Max.   : 6.53544   Max.   : 3.62991    Max.   : 3.1502      Max.   : 3.42382  
##        pH              sulphates           alcohol        
##  Min.   :-2.951603   Min.   :-1.94019   Min.   :-1.91839  
##  1st Qu.:-0.605291   1st Qu.:-0.58227   1st Qu.:-0.87016  
##  Median : 0.001796   Median :-0.22803   Median :-0.20309  
##  Mean   : 0.045482   Mean   :-0.01402   Mean   : 0.04695  
##  3rd Qu.: 0.608883   3rd Qu.: 0.42140   3rd Qu.: 0.67838  
##  Max.   : 4.595972   Max.   : 7.80139   Max.   : 3.41810

Now all the medians and means are quite similar with each other, and the range of the predictors are similar too, therefore we are ready to use this dataset.

7.2 Finding Optimum k

Finding the best number ‘k’ would factor how good our model is. From a lot of researches, it was found that the best k for this is the square root of the number of observations (or number of rows in tabular data).

Once again, since our testing data is the unseen data, we will use the number of observations only from the training data.

sqrt(nrow(wine_test_x_scaled))

## [1] 17.88854

The number of different classes, or how many categories are there in the target variable that we have is 2. Therefore, it’s best approach to use an odd number of k to avoid any tie for the majority-voting in the classification algorithm.

The optimum k is around 17.88, therefore we will try with k=17, and we will also try 2 odd numbers around it, like k=15 and k=19 when trying to tune our model.

7.3 Fitting the Model

k-NN is categorized as a “black-box” machine learning model, therefore we are not able to see its algorithm when it’s working and cannot interpret its components. We can only use the result after the model has been fitted.

wine_knn_pred_k17 <- knn(train = wine_train_x_scaled,
                         test = wine_test_x_scaled,
                         cl = wine_train_y,
                         k=17)

wine_knn_pred_k15 <- knn(train = wine_train_x_scaled,
                         test = wine_test_x_scaled,
                         cl = wine_train_y,
                         k=15)

wine_knn_pred_k19 <- knn(train = wine_train_x_scaled,
                         test = wine_test_x_scaled,
                         cl = wine_train_y,
                         k=19)

7.4 Model Evaluation

One of the most generally interpretable way to evaluate a classification model is using confusion matrix. Since we have already decided on using Accuracy and Precision from the logistic regression model, due to the target’s classes are quite balanced nicely (50:50) and a more precise model that correctly predict a good quality wine by some of its qualities would be more appreciated rather than recalling as most good quality wine as possible.

7.4.1 Using k=17 (closest number to optimum)

confusionMatrix(data = wine_knn_pred_k17,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 101  30
##          1  48 141
##                                           
##                Accuracy : 0.7562          
##                  95% CI : (0.7054, 0.8023)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5063          
##                                           
##  Mcnemar's Test P-Value : 0.05425         
##                                           
##             Sensitivity : 0.8246          
##             Specificity : 0.6779          
##          Pos Pred Value : 0.7460          
##          Neg Pred Value : 0.7710          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4406          
##    Detection Prevalence : 0.5906          
##       Balanced Accuracy : 0.7512          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 75.62%

Precision / Pos Pred Value: 74.60%

Not bad!

7.4.2 Using k=15

confusionMatrix(data = wine_knn_pred_k15,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 101  32
##          1  48 139
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6988, 0.7965)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : 1.56e-15        
##                                           
##                   Kappa : 0.4941          
##                                           
##  Mcnemar's Test P-Value : 0.09353         
##                                           
##             Sensitivity : 0.8129          
##             Specificity : 0.6779          
##          Pos Pred Value : 0.7433          
##          Neg Pred Value : 0.7594          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4344          
##    Detection Prevalence : 0.5844          
##       Balanced Accuracy : 0.7454          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 74.69%

Precision / Pos Pred Value: 74.46%

Slightly lower compared to that of our original and optimum k.

7.4.3 Using k=19

confusionMatrix(data = wine_knn_pred_k19,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  98  31
##          1  51 140
##                                           
##                Accuracy : 0.7438          
##                  95% CI : (0.6922, 0.7907)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : 1.044e-14       
##                                           
##                   Kappa : 0.4806          
##                                           
##  Mcnemar's Test P-Value : 0.03589         
##                                           
##             Sensitivity : 0.8187          
##             Specificity : 0.6577          
##          Pos Pred Value : 0.7330          
##          Neg Pred Value : 0.7597          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4375          
##    Detection Prevalence : 0.5969          
##       Balanced Accuracy : 0.7382          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 73.75%

Precision / Pos Pred Value: 72.77%

Even lower compared to that of our original and optimum k.

7.5 Conclusions

Our best k-NN model is found using k from the optimum k formula, which the closest odd number we found is k=17.
The Accuracy of our model is 75.62%, and the Precision is 74.60%.

8 Classification Model: Naive Bayes

Naive Bayes is a classification machine learning model that is utilizing Bayes’ Theorem of dependent and independent events to predict the probability of each observation’s class.

This model has some unique assumptions of assuming that the predictors are dependent to the target variable (well, this is obvious, if not, why would we try to predict?), the predictors are independent of each other, and each predictors weigh the same to reach the final probability calculation.

It should be noted that this model is known to work better if we are using only categorical predictors. Knowing so, we can still use numerical predictors with this model, and we’ll see how better it performs compared to the other models.

8.1 Model Fitting

It should be noted, since the Naive Bayes model is known to have a weakness of skewness due to data scarcity, or if any of the specific segment of the dataset has 0 observations, it will really deter the calculation, therefore we will be using the Laplace Smoothing method by the value of 1, or simply adding 1 dummy data to each specific segment of the dataset. It will offset the data by an amount that would be insignificant, just like $1000/2000$ is not too far from $1001/2001$.

8.1.1 Using All Predictors

model_nb_all <- naiveBayes(formula = quality_high~.,
                           data = wine_train %>% select(-quality),
                           laplace = 1)

wine_nb_all_pred_class <- predict(object = model_nb_all,
                            newdata = wine_test,
                            type = "class")

wine_nb_all_pred_raw <- predict(object = model_nb_all,
                            newdata = wine_test,
                            type = "raw")

head(wine_nb_all_pred_class)

## [1] 0 0 0 0 0 0
## Levels: 0 1

head(wine_nb_all_pred_raw)

##              0           1
## [1,] 0.8333146 0.166685388
## [2,] 0.8148565 0.185143539
## [3,] 0.6065141 0.393485924
## [4,] 0.7671096 0.232890388
## [5,] 0.6810443 0.318955749
## [6,] 0.9922018 0.007798212

8.1.2 Using Predictors found in Linear Regression Model

model_nb_fromlm <- naiveBayes(formula = quality_high~volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity,
                           data = wine_train %>% select(-quality),
                           laplace = 1)

wine_nb_fromlm_pred_class <- predict(object = model_nb_fromlm,
                            newdata = wine_test,
                            type = "class")

wine_nb_fromlm_pred_raw <- predict(object = model_nb_fromlm,
                            newdata = wine_test,
                            type = "raw")

head(wine_nb_fromlm_pred_class)

## [1] 0 0 0 0 0 0
## Levels: 0 1

head(wine_nb_fromlm_pred_raw)

##              0          1
## [1,] 0.7018541 0.29814590
## [2,] 0.6748204 0.32517962
## [3,] 0.5730255 0.42697452
## [4,] 0.6412530 0.35874704
## [5,] 0.5533299 0.44667009
## [6,] 0.9955356 0.00446441

8.1.3 Using Predictors found in Logistic Regression Model

model_nb_fromglm <- naiveBayes(formula = quality_high ~ volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol,
                           data = wine_train %>% select(-quality),
                           laplace = 1)

wine_nb_fromglm_pred_class <- predict(object = model_nb_fromglm,
                            newdata = wine_test,
                            type = "class")

wine_nb_fromglm_pred_raw <- predict(object = model_nb_fromglm,
                            newdata = wine_test,
                            type = "raw")

head(wine_nb_fromglm_pred_class)

## [1] 0 0 0 0 0 0
## Levels: 0 1

head(wine_nb_fromglm_pred_raw)

##              0           1
## [1,] 0.7434616 0.256538449
## [2,] 0.7186884 0.281311635
## [3,] 0.5038296 0.496170395
## [4,] 0.6704487 0.329551297
## [5,] 0.6073191 0.392680893
## [6,] 0.9963817 0.003618292

8.2 Model Evaluation

In addition of using confusion matrix for general evaluation of the model, we will try to evaluate the model using ROC and AUC metrics for internal comparation between all 3 of the Naive Bayes models.

8.2.1 Using All Predictors

confusionMatrix(data = wine_nb_all_pred_class,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103  35
##          1  46 136
##                                           
##                Accuracy : 0.7469          
##                  95% CI : (0.6955, 0.7936)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : 4.068e-15       
##                                           
##                   Kappa : 0.4889          
##                                           
##  Mcnemar's Test P-Value : 0.2665          
##                                           
##             Sensitivity : 0.7953          
##             Specificity : 0.6913          
##          Pos Pred Value : 0.7473          
##          Neg Pred Value : 0.7464          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4250          
##    Detection Prevalence : 0.5687          
##       Balanced Accuracy : 0.7433          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 74.69%

Precision / Pos Pred Value: 74.73%

Pretty good, and similar to our previous classification models.

wine_nb_all_roc <- prediction(predictions = wine_nb_all_pred_raw[,2],labels = wine_test_y)
plot(performance(prediction.obj = wine_nb_all_roc,measure = "tpr",x.measure =  "fpr"))

performance(prediction.obj = wine_nb_all_roc, measure = "auc")@y.values

## [[1]]
## [1] 0.8299384

The AUC value is 82.99%, which is pretty good!

8.2.2 Using Predictors found in Linear Regression Model

confusionMatrix(data = wine_nb_fromlm_pred_class,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  97  26
##          1  52 145
##                                           
##                Accuracy : 0.7562          
##                  95% CI : (0.7054, 0.8023)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5046          
##                                           
##  Mcnemar's Test P-Value : 0.004645        
##                                           
##             Sensitivity : 0.8480          
##             Specificity : 0.6510          
##          Pos Pred Value : 0.7360          
##          Neg Pred Value : 0.7886          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4531          
##    Detection Prevalence : 0.6156          
##       Balanced Accuracy : 0.7495          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 75.62%

Precision / Pos Pred Value: 73.60%

Pretty good performance so far.

wine_nb_fromlm_roc <- prediction(predictions = wine_nb_fromlm_pred_raw[,2],
                              labels = wine_test_y)

plot(performance(prediction.obj = wine_nb_fromlm_roc,
                 measure = "tpr",
                 x.measure =  "fpr"))

performance(prediction.obj = wine_nb_fromlm_roc, measure = "auc")@y.values

## [[1]]
## [1] 0.8363358

The AUC value is 83.63%, which is pretty good and better than the all predictors one.

8.2.3 Using Predictors found in Logistic Regression Model

confusionMatrix(data = wine_nb_fromglm_pred_class,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  98  27
##          1  51 144
##                                           
##                Accuracy : 0.7562          
##                  95% CI : (0.7054, 0.8023)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5051          
##                                           
##  Mcnemar's Test P-Value : 0.009208        
##                                           
##             Sensitivity : 0.8421          
##             Specificity : 0.6577          
##          Pos Pred Value : 0.7385          
##          Neg Pred Value : 0.7840          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4500          
##    Detection Prevalence : 0.6094          
##       Balanced Accuracy : 0.7499          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 75.62%

Precision / Pos Pred Value: 73.85%

Pretty good, and similar to our previous model using predictors selected using linear regression ones.

wine_nb_fromglm_roc <- prediction(predictions = wine_nb_fromglm_pred_raw[,2],
                              labels = wine_test_y)

plot(performance(prediction.obj = wine_nb_fromglm_roc,
                 measure = "tpr",
                 x.measure =  "fpr"))

performance(prediction.obj = wine_nb_fromglm_roc, measure = "auc")@y.values

## [[1]]
## [1] 0.832411

The AUC value is 83.24%, which is not bad, but lower than our second model, so we will use that to go as our representative.

8.3 Conclusions

Our best Naive Bayes model is found both using predictors that is found to be quite significant during the linear regression model’s feature selection, which is involving following predictors:
- volatile.acidity
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- pH
- sulphates
- alcohol
- fixed.acidity
The AUC value of our model is 83.63%, which is better than other Naive Bayes models that we made using different set of predictors.
The Accuracy of our model is 75.62%, and the Precision is 73.60%.

9 Classification Model: Decision Tree

Decision Tree merupakan tree-based model yang cukup sederhana dengan performa yang robust/powerful untuk prediksi. Decision Tree menghasilkan visualisasi berupa pohon keputusan yang dapat diinterpretasi dengan mudah.

Karakter tambahan Decision Tree:

Variable predictor diasumsikan saling dependent, sehingga dapat mengatasi multicollinearity.
Dapat mengatasi nilai predictor numerik yang berupa outlier.

Decision Tree is a classification machine learning model that is quite simple deemed to have a robust and powerful performance to predict and visualize using a tree-like model. Starting from the very top-center which is the root node, each “branch” will have a distinctive variable and criteria to be chosen for which the algorithm will be categorizing each prediction optimally, so although not ideal, this model is not too bad to be used for numerical predictors like the dataset we have.

The assumptions of the Decision Tree model is that no multicolinearity, and not sensitive to outliers (for example, if the outlier datum is like 10,000 by itself while the normally it’s only around 100-200, the algorithm could just categorize them like “>200” anyway).

Although decision tree can be used as a regression, it is usually not recommended, so let’s see if using it is a mistake or a breakthrough

9.1 Model Fitting

Since the algorithm of Decision Tree is already using the best entropy, or information gain when choosing the best predictors and criteria, we don’t need to do a feature selection too much. Maybe later we might need to prune the tree when it is becoming too crowded to read.

9.1.1 Using it as a Regression Model

model_allpred_tree_reg <- ctree(formula = quality ~ . , data = wine_train %>% select(-quality_high) )

plot(model_allpred_tree_reg, type = "simple")

The information is very crowded! This model and plot is only using the default parameters of a decision tree, so we can tune it out a little better if we needed it to. But then again, looking at the leafs/terminal nodes, which are the nodes that are not branching to another decision anymore (the greyed out boxes at relatively bottom of the tree), we can see that there are some numbers predicted, while there is a possible error in there also. Just looking at one leaf node at quite the middle of the tree, it has the predicted value of 5.672, with a possible error of 37.4, well that would not make any sense, wouldn’t it? Our quality rating is only from 0-10 anyway. Similarly, there are a lot of leaf nodes showing similar, too extreme errors, that I don’t think pruning this regression decision tree would help it too much.

Therefore, I think it would be best if we go back to use the Decision Tree as a classification model, and use the quality_high as the target variable instead.

9.1.2 Using it as a Classification Model

9.1.2.1 As It Is

model_allpred_tree_cla <- ctree(formula = quality_high ~ . , data = wine_train %>% select(-quality) )

plot(model_allpred_tree_cla)

plot(model_allpred_tree_cla, type = "simple")

I was tempted to use a plot type=simple like before, but seeing that the result are still overlapped with eachother, I prefer to use without it instead, since we can pretty much classify each leaf node using a default threshold of 0.5, to use its prediction. It should be noted that the leaf count is 15 for this one.

As we can see from the tree, some decisions are quite polarizing and leave a little error to be considered, for example Node 14 (the probability is very close to being 0) or Node 28 (it’s very close to being 1), while others are not so polarized, even few nodes are a bit close to our threshold of 0.5 (like Node 11 and Node 29) that it might be the nodes that is causing most of the errors (false negatives or false positives) later on.

Predictors that are highly relevant/significant (excluding duplicates and if the branches are all classifying to only one class):

alcohol
volatile.acidity
chlorides
sulphates
total.sulfur.dioxide

9.1.2.2 Pruning the Model

There is some hyperparameters available to try and prune the model, which is:

mincriterion. Default is 0.95, this parameter is about checking each decision’s p-value so that it would branch only if p-value < (1-mincriterion). Higher number would prune the tree, making a node more difficult to branch to have less error tolerated of its decision.
minsplit Default is 20, this parameter ensures the minimum number of observations after branching. A higher number would also prune the tree, making a node search for a decision that is splitting the higher number of observation on each future branches.
minbucket. Default is 7, this parameter ensures the minimum number of observations in leafs/terminal nodes. Similar to minsplit, if this is set to a higher number, it would definitely prune the model as more observations are needed for the algorithm to make a decision criterion for terminating a branch into a leaf.

There is no rule-of-thumb for each numbers. After trying some numbers, this setting seems to prune the leafs count quite a lot, around half of the original ones, so this could be another model to be considered.

model_allpred_tree_cla_prune <- ctree(formula = quality_high ~ . ,
                                  data = wine_train %>% select(-quality),
                                  control = ctree_control(mincriterion = 0.95, minsplit = 100, minbucket = 100))

plot(model_allpred_tree_cla_prune, type = "simple")

But then again, if we see better at the right side of the root node, in which if each of our data is actually having alcohol>10.3, whatever the decision may fall under it, all will classified as “1”. Though the number of observations (n) and error possibility (err) are different between each leaf, won’t any value is not relevant then? Therefore, although this is pruning half of the original leaf count into 8 leafs and much more readable, I think this is not quite a good model yet, so I will try to find one again.

After some changing with the numbers, I found the following decision tree:

model_allpred_tree_cla_prune2 <- ctree(formula = quality_high ~ . ,
                                  data = wine_train %>% select(-quality),
                                  control = ctree_control(mincriterion = 0.98, minsplit = 100, minbucket = 20))

plot(model_allpred_tree_cla_prune2, type = "simple")

Although not super good, that is the bottom-right internal node are branching both decisions to class of “1”, and the leaf count is 11 (though technically 10 due to the same thing mentioned; 2 leaf node branching from the same internal node that both are classifying the data to “1”), I think I would be happier with this one.

When adjusting the parameters to get this model, what I was thinking is that the “precious” leaf nodes classifying data to “0” on the right side of the root node must be preserved. We can see from the n numbers from the the 0 leaf node in the default model that some are very low, like 18, 29, or 61, therefore I am very careful to not increase the minbucket parameter too much. Therefore I “pruned” at least one of the “0” right-side leaf nodes, and play again with the other thresholds, to finally achieve this model.

Predictors that are highly relevant/significant (excluding duplicates and if the branches are all classifying to only one class): * alcohol * volatile.acidity * sulphates * total.sulfur.dioxide

Using one less predictor compared to the one used on the default parameters, which is excluding chlorides.

9.2 Model Evaluation

There is no better way to evaluate the decision tree models, than to predict the testing data and create a confusion matrix.

9.2.1 Using Default Parameters

wine_tree_def_pred <- predict(object = model_allpred_tree_cla,
                              newdata = wine_test,
                              type = "response")

confusionMatrix(data = wine_tree_def_pred,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 117  48
##          1  32 123
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6988, 0.7965)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : 1.56e-15        
##                                           
##                   Kappa : 0.5011          
##                                           
##  Mcnemar's Test P-Value : 0.09353         
##                                           
##             Sensitivity : 0.7193          
##             Specificity : 0.7852          
##          Pos Pred Value : 0.7935          
##          Neg Pred Value : 0.7091          
##              Prevalence : 0.5344          
##          Detection Rate : 0.3844          
##    Detection Prevalence : 0.4844          
##       Balanced Accuracy : 0.7523          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 75.00%

Precision / Pos Pred Value: 79.35%

One of the good ones we’ve had compared to other models, although since the leafs are too many, this could be less readable.

9.2.2 Using Customized Parameters

wine_tree_cust_pred <- predict(object = model_allpred_tree_cla_prune2,
                              newdata = wine_test,
                              type = "response")

confusionMatrix(data = wine_tree_cust_pred,
                reference = wine_test_y,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 115  48
##          1  34 123
##                                           
##                Accuracy : 0.7438          
##                  95% CI : (0.6922, 0.7907)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : 1.044e-14       
##                                           
##                   Kappa : 0.4882          
##                                           
##  Mcnemar's Test P-Value : 0.1511          
##                                           
##             Sensitivity : 0.7193          
##             Specificity : 0.7718          
##          Pos Pred Value : 0.7834          
##          Neg Pred Value : 0.7055          
##              Prevalence : 0.5344          
##          Detection Rate : 0.3844          
##    Detection Prevalence : 0.4906          
##       Balanced Accuracy : 0.7456          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 74.38%

Precision / Pos Pred Value: 78.34%

A little bit less than that of the default parameters, but this one is more readable.

9.3 Conclusions

Our Decision Tree model using the default parameters are showing the best metrics compared to the one that we pruned, or tamper with some of the param:
- alcohol
- volatile.acidity
- chlorides
- sulphates
- total.sulfur.dioxide
We have made a pruned, alternative model using customized parameters which is more readable since it contains a little bit less leaf nodes.
The Accuracy of our best Decision Tree model is 75.00%, and the Precision is 79.35%

10 Classification Model: Random Forest

Random Forest is one of the machine learning models that is using the concept of ensemble methods (using multiple models to find the best one through majority or averaging) consisting of Decision Trees. Each of the trees has its own characteristics and it is independent from each other. In simpler terms, this model is basically making multiple Decision Tree models with random predictors/variables to use, then either use the majority voting system for classification cases, or mean of the targets for regression ones.

This model is called to have one of the best and accurate predictions amongst most models. The general downside though, to do multiple modeling, this method will need resources like processors, RAM, and most likely, time, to compute the predicted result.

10.1 Model Fitting

Since it is said that Random Forest can be utilized for both regression and classification models, let’s try both of them. I will attach the code in the chunk below, but they will be commented, as sometimes, model fitting for Random Forest could take hours. Although it would depend on how many folds and repetition will the cross-validation of the data be done, and the number of observations and variables themselves. Therefore, I will originally run the code but save them in RDS file so the model can be used and evaluated easily, and takes less time.

10.1.1 Fitting as a Regression Model

# set.seed(314)
# 
# ctrl <- trainControl(method = "repeatedcv",
#                      number = 5, # k-fold
#                      repeats = 3) # repetition
# 
# wine_forest_reg <- train(quality ~ .,
#                    data = wine_train %>% select(-quality_high),
#                    method = "rf", # random forest
#                    trControl = ctrl)
# 
# saveRDS(wine_forest_reg, "wine_forest_reg.RDS") # saving model

Running this model takes me about 2-4 minutes. You won’t want to wait that long for a webpage to open, don’t you?

10.1.2 Fitting as a Classification Model

# set.seed(314)
# 
# wine_forest_cla <- train(quality_high ~ .,
#                    data = wine_train %>% select(-quality),
#                    method = "rf", # random forest
#                    trControl = ctrl)
# 
# saveRDS(wine_forest_cla, "wine_forest_cla.RDS") # saving model

For this one, somehow it took only around 1 minute to finish.

10.2 Model Evaluation

10.2.1 Regression Model

First, let’s read the saved model.

wine_forest_reg <- readRDS("wine_forest_reg.RDS")
wine_forest_reg

## Random Forest 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 1024, 1023, 1023, 1023, 1023, 1024, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.5873358  0.4661201  0.4397804
##    6    0.5855157  0.4614288  0.4328461
##   11    0.5876692  0.4563827  0.4317041
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

mtry is the number of predictors that will be used when fitting a new Decision Tree model. From the model summary above, after some iterations, we can see that the algorithm chose the mtry value of 6, as it has the lowest RMSE, or basically an error estimation value.

Random Forest has its own cross validation technique called Out-Of-Bag Error, which basically means that the algorithm is separating some data randomly, and evaluate the model based on that “unseen” data, just like what we did manually. This is why a cross validation or train-test splitting technique won’t be necessary when we are using Random Forest.

wine_forest_reg$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.3317218
##                     % Var explained: 47.49

The algorithm valued its model accuracy at 47.49%, which is quite low. But if we compare this to our first linear regression model, this one is actually improved by a little.

Even though the algorithm itself has tested the model on the Out-Of-Bag samples, since we have a testing data set aside, let’s try to predict the value……………….

10.2.2 Classification Model

Again, let’s read the saved model.

wine_forest_cla <- readRDS("wine_forest_cla.RDS")
wine_forest_cla

## Random Forest 
## 
## 1279 samples
##   11 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 1024, 1023, 1023, 1023, 1023, 1023, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7969812  0.5920936
##    6    0.7909875  0.5804291
##   11    0.7847324  0.5678683
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The mtry value that the algorithm feels best to use is 2, based on the repeated attempts. Therefore, it will randomly choose between 2 predictors on each node of the Decision Tree that it makes, and so on until the tree is completed.

wine_forest_cla$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.69%
## Confusion matrix:
##     0   1 class.error
## 0 478 117   0.1966387
## 1 122 562   0.1783626

As explained before, OOB or Out-Of-Bag is Random Forest’s term for its own randomly sampled observations that are treated like unseen data and will be used to evaluate the model. In here it appears that it evaluates the error rate of only 18.69%! That means its accuracy is 81.31%, the highest one of all the models so far.

Since we have our own testing data set aside, let’s try to evaluate the model again based on that dataset. First we would need to predict the classes first, then use a confusion matrix to see the accuracy and precision.

wine_forest_cla_pred <- predict(object = wine_forest_cla,
                                newdata = wine_test,
                                type = "raw")

confusionMatrix(data = wine_forest_cla_pred,
                reference = wine_test$quality_high,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 114  19
##          1  35 152
##                                           
##                Accuracy : 0.8312          
##                  95% CI : (0.7856, 0.8706)
##     No Information Rate : 0.5344          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6585          
##                                           
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.7651          
##          Pos Pred Value : 0.8128          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.5344          
##          Detection Rate : 0.4750          
##    Detection Prevalence : 0.5844          
##       Balanced Accuracy : 0.8270          
##                                           
##        'Positive' Class : 1               
##

Accuracy: 83.12%

Precision / Pos Pred Value: 81.28%

It’s even better! The first time we have seen any prediction models, be it regression or classification, that touches the 80% accuracy rate. As many have said, this is surely one of the best model that we can use to predict target variable(s).

10.3 Conclusions

Random Forest is objectively performing the best out of all the models.
As a regression model, the model was performing at 47.49% of its capability to explain the target variable (while the rest of 52.51% are only able to be explained by predictors outside the one that is used)
As a classification model, the Accuracy was estimated at 83.12%, and its Precision was 81.28%, objectively the best out of all model.

11 Comparing Models and Conclusions

11.1 Performance of Regression Models

The Linear Regression Model has been chosen using our manual approach and stepwise approach in combined forward and backward directions, resulting in model_stepwise_both_lm that has the 44.59% rate on the adjusted R-squared metric. This model failed 1 of the 4 required assumptions though, so although the model can definitely be used to predict the numerical target of our quality column, it must be used with caution that the residual data is spread in a pattern (having heteroscedasticity instead of homoscedasticity).
The Decision Tree Model resulted in a lot of the leaf nodes have possible estimated errors to be outside of the possible range of quality rating, which should only 0-10. Therefore, this model is not fit to be used to estimate the numerical value of the red wine quality.
The Random Forest Model, or simply an ensemble of Decision Trees performed over and over again, performed in an improved score of 47.49% rate on how well it explained the target variable, which is quality. This is an improvement, as expected from Random Forest, though the resources used might not be balanced enough if the dataset is much larger, having more number of predictors, or if the business inquiry is urgent. Otherwise, this model could be deemed as one of the best fromm all the regression models that we have tried in this report.

11.2 Performance of Classification Models

The Logistic Regression Model that is usually the best at numerical predictors, has the Accuracy of 76.88%, and the Precision is 78.36%.
The k Nearest Neighbour Model work best in a scaled and numerical predictors environment, which we can provide easily, although the interpretation of the model components for this one is not one of its strength. Its Accuracy is at 75.62%, while the Precision is 74.60%.
The Naive Bayes Model is one of the most simplest model that is the best at predicting using categorical predictors. Given the condition of our all numerical predictors, it performs quite similarly to k-NN, at the Accuracy of 75.62%, and the Precision of 73.60%.
The Decision Tree Model is the one you will look for if you are looking for a rule-based model, as its visualization of the figurative Decision Tree is easily interpretable if you set the width to be wide enough. Although the Accuracy of our best Decision Tree model is 75.00%, and the Precision is at 79.35%, after some pruning or tuning the default parameters of the model, we have an alternative model which has leaf nodes and therefore much simpler, but sacrificing its accuracy and precision by a bit.
The Random Forest Model is still the best performer out of all the classification methods that we have tried in this report. Showing Accuracy at 83.12% and Precision at 81.28%, the only downside to this model is that it consumes time and resources for its computation to finish, therefore definitely usable in most business cases, except any that is urgent or if the resources expectation is less than it actually needs.

11.3 Overall Conclusions

Random Forest is one of the most robust machine learning method if you want some accuracy and precision while trading off time or resources, as we have proved from the our different model comparation in the report above.
In all across our models that we can interpret the components of, we can see these predictors that are used the most:
- Alcohol
- Volatile Acidity
- Sulphates
- Total Sulfur Dioxides
Therefore we can conclude that they must be one the of the most decisive factors in determining a red wine quality.

12 References and About Me

12.1 References

Kaggle.com: Red Wine Quality
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Wikipedia.org: Wine, in which cites:
- “Kuopion Viinijuhlat » Kuopio Wine Festival” (in Finnish). Kuopio Wine Festival. Retrieved 25 July 2020.
- “6 Secrets of Cooking With Wine” . WebMD.
- Parker, Robert M. (2008). Parker’s Wine Buyer’s Guide , 7th Edition. Simon and Schuster. p. 15. ISBN 978-1-4391-3997-4.
- Jancis Robinson (2006). The Oxford Companion to Wine (3rd ed.) . Oxford University Press. ISBN 978-0-19-860990-2. See alcoholic strength at p. 10.
Wikipedia.org: Red Wine
Wikipedia.org: Wine tasting, in which cites:
- Peynaud, Émile (1996) The Taste of Wine: The Art and Science of Wine Appreciation, London: Macdonald Orbis, p1
- Hodgson, Robert T., “How Expert are”Expert” Wine Judges?“, Journal of Wine Economics, Vol. 4; Issue 02 (Winter 2009), pp. 233–241.

12.2 About Me

Hi! My name is Calvin, I am from Jakarta, Indonesia. I am looking forward to be a full-time data analyst and/or data scientist. I have a background in Mathematics and Computer Science from my Bachelor’s Degrees, and I love playing with numbers and data. I am doing this to enhance my Data Science portfolio (constructive criticism is very much welcomed!), also as part of Learn-By-Building assignment at Algoritma Data Science School.

You can reach me at my LinkedIn for more discussion. Thank you!