Red Wine Quality Analysis
1 Background
1.1 About Red Wine
Figure 1.1: Photo by Jep Gambardella from Pexels
Wine is an alcoholic drink typically made from fermented grapes. Yeast consumes the sugar in the grapes and converts it to ethanol and carbon dioxide, releasing heat in the process. Different varieties of grapes and strains of yeasts are major factors in different styles of wine. These differences result from the complex interactions between the biochemical development of the grape, the reactions involved in fermentation, the grape’s growing environment, and the wine production process. Wines not made from grapes involve fermentation of ther crops including rice wine and other fruit wines such as plum, cherry, pomegranate, currant and elderberry.
Wine is a popular and important drink that accompanies and enhances a wide range of cuisines, from the simple and traditional stews to the most sophisticated and complex haute cuisines. Wine is often served with dinner. Sweet dessert wines may be served with the dessert course. In fine restaurants in Western countries, wine typically accompanies dinner. At a restaurant, patrons are helped to make good food-wine pairings by the restaurant’s sommelier or wine waiter. Individuals dining at home may use wine guides to help make food–wine pairings. Wine is also drunk without the accompaniment of a meal in wine bars or with a selection of cheeses (at a wine and cheese party). Wines are also used as a theme for organizing various events such as festivals around the world; the city of Kuopio in North Savonia, Finland is known for its annual Kuopio Wine Festivals (Kuopion viinijuhlat).
Wine is important in cuisine not just for its value as a drink, but as a flavor agent, primarily in stocks and braising, since its acidity lends balance to rich savory or sweet dishes. Wine sauce is an example of a culinary sauce that uses wine as a primary ingredient. Natural wines may exhibit a broad range of alcohol content, from below 9% to above 16% ABV, with most wines being in the 12.5–14.5% range. Fortified wines (usually with brandy) may contain 20% alcohol or more.
Red wine is a type of wine made from dark-colored grape varieties. The actual color of the wine can range from intense violet, typical of young wines, through to brick red for mature wines and brown for older red wines. The juice from most purple grapes is greenish-white, the red color coming from anthocyan pigments (also called anthocyanins) present in the skin of the grape; exceptions are the relatively uncommon teinturier varieties, which produce a red-colored juice. Much of the red-wine production process therefore involves extraction of color and flavor components from the grape skin. Red wine is a delicacy around the world.
Wine tasting is the sensory examination and evaluation of wine. While the practice of wine tasting is as ancient as its production, a more formalized methodology has slowly become established from the 14th century onwards. Modern, professional wine tasters (such as sommeliers or buyers for retailers) use a constantly evolving specialized terminology which is used to describe the range of perceived flavors, aromas and general characteristics of a wine. More informal, recreational tasting may use similar terminology, usually involving a much less analytical process for a more general, personal appreciation.
Results that have surfaced through scientific blind wine tasting suggest the unreliability of wine tasting in both experts and consumers, such as inconsistency in identifying wines based on region and price.
1.2 Business Questions
After encountering the dataset that is trying to predict red wine’s quality based on some numerical predictors, I have learned a little bit of its history and information as summarized above. It must be noted that I am not an avid alcoholic drinker myself, I am just drawn with the dataset itself, then so what can I do?
Some question arises in me after I research a little bit about red wine, while learning through the Supervised Machine Learning courses in Algoritma Data Science School, both Regression and Classification models. Can I apply this on the red wine dataset?
So, I summed up some important questions that I’d like to find the answer to:
- Can I apply these numerical predictors into the Supervised Machine Learning models that I have just studied?
- Some models (especially the classification models) are said to be best at using mostly categorical predictors, while my dataset has none, only numerical ones. Will this affect the performance of the prediction model so much?
- In class it was learned that the Random Forest method is one of the most used model due to its accuracy and robustness. Will this also hold true with the dataset that I randomly found in Kaggle.com?
-
Some interpretable models might favor the same predictors over and over. Which predictors could be the most significant ones across most models, that they keep being used in one and another?
2 About the Dataset
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
Input variables (based on physicochemical tests):
-
1 - fixed acidity (gr/L)
most acids involved with wine or fixed or nonvolatile (do not evaporate readily) -
2 - volatile acidity (gr/L)
the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste -
3 - citric acid (gr/L)
found in small quantities, citric acid can add ‘freshness’ and flavor to wines -
4 - residual sugar (gr/L)
the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet -
5 - chlorides (gr/L)
the amount of salt in the wine -
6 - free sulfur dioxide (mg/L)
the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine -
7 - total sulfur dioxide (mg/L)
amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine -
8 - density (g/cm3)
the density of water is close to that of water depending on the percent alcohol and sugar content -
9 - pH
describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale -
10 - sulphates (g/L)
a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant -
11 - alcohol (% by volume)
the percent alcohol content of the wine
Output variable:
- 12 - quality (0-10, based on sensory data)
3 Data Pre-processing
3.1 Enabling Libraries
First and foremost, let’s enable the possible libraries that we are going to work with.
I am thinking of at least dplyr
, tidyr
, and glue
for easier time of pre-processing the data. Also ggplot2
and plotly
for helpful visualization of data.
library(dplyr) #for easier time of tidying the data
library(tidyr) #same as above
library(glue) #to put pop-up label for interactive plotting
library(ggplot2) #for modern plotting
library(plotly) #for interactive plotting
library(GGally) #for making simple heatmap of correlations between each columns
library(rsample) #for train-test splitting
library(performance) #for comparing classification models
library(lmtest) #for linear regression's homoscedasticity test
library(car) #for linear regression's multicollinearity test
library(gtools) #for using logit and inv-logit to interpret logistic regression
library(caret) #for using confusion matrix, making random forest model
library(class) #for kNN model
library(e1071) #for Naive Bayes model
library(ROCR) #for evaluating Naive Bayes model
library(partykit) #for decision tree model
library(randomForest) #for reviewing the random forest model
3.2 Reading the Dataset
Let’s read the data and put it inside a wine
object. Taking a look for some first data of it:
<- read.csv("winequality-red.csv")
wine head(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Checking the data structure:
glimpse(wine)
## Rows: 1,599
## Columns: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5~
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, ~
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0~
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,~
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, ~
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16~
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,~
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0~
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3~
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0~
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.~
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7~
As explained in the About the Dataset
section, it’s clear about the separation of the target and predictor variables.
The target variable is the quality
, while the rest will be used as predictors.
It looks fine overall, no predictors seem to have wrong data types, but since we are going to try and use some classification models, we would need to make the quality
column to be categorical. Since we can see from the glimpse() function above that the quality ranges from 0 to 10 and it seems to only use integers, either we factorize the column as is, or we categorize the numbers into some categories/groups.
3.3 Verifying No Near-Zero Variance Predictors
Checking if any of the columns has near-zero variance, since this would be necessary to do especially for some classification models, like Random Forest, since inserting such columns would mean including irrelevant columns and tampering with the analysis.
nearZeroVar(wine)
## integer(0)
Usually if there is something, the function above will spit out the column indexes of such predictors. But since it shows integer(0), it means that the function cannot find any near-zero variance predictors, that means our dataset is varying enough to be able to give us some insights.
3.4 Checking Missing Values
colSums(is.na(wine))
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
It’s safe! Seems like all predictors have no missing values.
3.5 Adjusting Target Variable for Classification Models
Let’s check the current proportion of the target variable.
prop.table(table(wine$quality))
##
## 3 4 5 6 7 8
## 0.006253909 0.033145716 0.425891182 0.398999375 0.124452783 0.011257036
<- wine %>%
plot_1 group_by(quality) %>%
summarise(count = n()) %>%
mutate(quality = as.factor(quality),
label = glue("Quality = {quality}
Count = {count}")) %>%
ggplot(mapping = aes(x = quality, y = count, text = label))+
geom_col(aes(fill = count))+
theme_dark()+
labs(title = "`Quality` as Categoric Target Variable",
x = "Quality",
y = "Count",
fill = "Count")
ggplotly(plot_1, tooltip = "label")
If we are going to use the classification machine learning models, to still use the different numbers as “classes” seems not appropriate, since the numbers represent an order of “ratings”. Simple regression models that are predicting numeric targets would be fine, but we would need to “categorize” the target variable if we would like to use the classification machine learning models.
One thing that we should avoid is a class imbalance situation. I originally wanted to split the targets of at least 7 to be high
quality (or 1) while the rest would be low
(or 0) since 7 out of 10 seems fair to be categorized as high, but looking at the proportion, this might not be ideal. Since the data is quite centered on values 5 and 6, I think it would be better to split them both into different classes. So for now I will be splitting 6 an above into high
, and the rest would be low
. I will make a new column called quality_high
to categorize the quality
column.
$quality_high <- as.factor(ifelse(wine$quality>=6, 1, 0))
wineglimpse(wine$quality_high)
## Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 2 2 1 ...
prop.table(table(wine$quality_high))
##
## 0 1
## 0.4652908 0.5347092
A 53:47 split! Seems balanced enough, so we may not need to do any down or up-sampling.
3.6 Cross-validation / Train-Test Splitting
Next, we will separate the data into train and test ones, with the default of 80:20 split, using strata of quality_high
to keep the balanced proportion. To make the random sampling stay, I will use seed of 314 since I like pi number.
RNGkind(sample.kind = "Rounding") # tambahan khusus u/ R 3.6 ke atas
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314) # mengunci random number yang dipilih
# index sampling
<- initial_split(wine, prop = 0.8, strata = "quality_high")
index_wine
# splitting
<- training(index_wine)
wine_train <- testing(index_wine)
wine_test
#checking proportions on separated dataframes
prop.table(table(wine_train$quality_high))
##
## 0 1
## 0.4652072 0.5347928
prop.table(table(wine_test$quality_high))
##
## 0 1
## 0.465625 0.534375
4 EDA and Data Visualization
4.1 Overall Summary Statistics
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality quality_high
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000 0:744
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000 1:855
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Insights:
-
Some predictors like
fixed.acidity
,total.sulfur.dioxide
,free.sulfur.dioxide
, andsulphates
seems to have outliers, based solely on the max numbers having much more larger number than the mean, median, or 3rd quartile. -
density
andpH
seems to have quite a good kind of distributions, based on how near the max is to each of their median and/or 3rd quartile of the data. - We might need to take these outliers out. But for now, we are just going to analyze them as is, and see how it affects our models.
4.2 Boxplot of Variables
Since the predictors seems to have different scales, I will separate them based on similar scales.
density
and pH
seems to have quite a normal scale, to not crowd our plot so much, I will exclude them for this.
<- ggplot(data = stack(wine %>% select(volatile.acidity, citric.acid, chlorides, sulphates)), mapping = aes(x = ind, y = values)) +
plot_2 geom_boxplot(fill = "pink")+
theme_dark()+
labs(title = "Boxplot of Volatile Acidity, Citric Acid, Chlorides, Sulphates",
x = "Predictors",
y = "Value")
ggplotly(plot_2)
<- ggplot(data = stack(wine %>% select(fixed.acidity, residual.sugar, alcohol)), mapping = aes(x = ind, y = values)) +
plot_3 geom_boxplot(fill = "green")+
theme_dark()+
labs(title = "Boxplot of Fixed Acidity, Residual Sugar, Alcohol",
x = "Predictors",
y = "Value")
ggplotly(plot_3)
<- ggplot(data = stack(wine %>% select(free.sulfur.dioxide, total.sulfur.dioxide)), mapping = aes(x = ind, y = values)) +
plot_4 geom_boxplot(fill = "cyan")+
theme_dark()+
labs(title = "Boxplot of Free and Total Sulfur Dioxide",
x = "Predictors",
y = "Value")
ggplotly(plot_4)
Insights:
- Most of the predictors seems to have lots of outliers, therefore it won’t be to wise to have them taken out, at the cost of information loss.
-
2 predictors of
alcohol
andcitric.acid
seems to have quite normal and good kind of distributions, based on the little number of outliers, and the median being placed quite in the center of the boxplot.
4.3 Checking Correlations between Predictors
ggcorr(wine_train, label = T, hjust = 0.9, label_size = 3, layout.exp = 3)
## Warning in ggcorr(wine_train, label = T, hjust = 0.9, label_size = 3, layout.exp
## = 3): data in column(s) 'quality_high' are not numeric and were ignored
Some early insights from the plot above:
-
Our target variable,
quality
seems to have quite good correlation withalcohol
(positively) andvolatile.acidity
(negatively). -
The opposite,
quality
seems to have no correlation withfree.sulfur.dioxide
andresidual.sugar
-
There are some strong correlations that might need to be checked based on the similarity of the names. For example, strong correlations between
free.sulfur.dioxide
andtotal.sulfur.dioxide
, orfixed.acidity
andvolatile.acidity
might indicate a multicollinearity, which would be bad for some of our model.
4.5 Characteristics of Top-Rated vs Lowest-Rated Red Wines
Let’s remind ourselves of our target variables’ current composition.
table(wine$quality_high)
##
## 0 1
## 744 855
Let’s see what characteristics our top-rated red wine tends to have, compared to the rest. They could have some specific quality much more higher than the rest is.
Since we are looking for a metric to represent the whole group in each of the characteristic (predictor), I think using a mean would be a better choice rather than others like median, since it would weigh in even the outliers that could actually accentuate its characteristic to be more distinctive than the others.
So we will be grouping the data into 2 groups: red wines rated 6-8, and red wines rated other than that (3-5). After than, we will summarize each predictors using a mean() function.
<- wine %>%
wine_char mutate(quality_high = ifelse(quality>=6, "high", "low")) %>%
group_by(quality_high) %>%
summarise_all(mean) %>%
select(-quality)
wine_char
## # A tibble: 2 x 12
## quality_high fixed.acidity volatile.acidity citric.acid residual.sugar
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 high 8.47 0.474 0.300 2.54
## 2 low 8.14 0.590 0.238 2.54
## # ... with 7 more variables: chlorides <dbl>, free.sulfur.dioxide <dbl>,
## # total.sulfur.dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>
To better visualize the difference, let’s see the visualization.
<- wine_char %>%
plot_9 pivot_longer(cols = -quality_high, names_to = "names", values_to = "values") %>%
mutate(label = glue("Red Wine Quality? {quality_high}
Average of {names} = {round(values,2)}")) %>%
ggplot(mapping = aes(x=names, y=values))+
geom_line(aes(group = quality_high, color = quality_high))+
geom_jitter(mapping = aes(x=names, y=values, color = quality_high, text = label))+
theme_dark()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(title="Average Characteristics of Top-Rated vs Lower-Rated Red Wines",
x="Predictor",
y="Value",
color="Red Wine Quality?")
ggplotly(plot_9, tooltip = "label")
Insights:
- Top-rated red wines seemingly has a significantly lower amount of Total Sulfur Dioxide, based on the visualization of average of predictors against the group of lower ratings.
-
Other predictors are just slightly different from each other, so these are not necessarily true, but could be helpful in deciding the right predictors to classify a higher quality of red wine:
- Alcohol, Citric Acid, Sulphates, and Fixed Acidity of a higher quality of red wine would be slightly higher.
- Meanwhile, for Chlorides, Free Sulfur Dioxide, Volatile Acidity would be slightly lower that the lower rated red wines.
- Density, pH, and Residual Sugar, is quite the same across all red wines.
5 Regression Model: Simple Linear with Multiple Predictors
Regression model is where machine learning is looking to predict a target variable that is numerical. This is part of a supervised machine learning since we have some targets/label to be predicted, instead of just looking for some pattern.
We are going to use the simple linear regression model that is taught in class. This model can be used properly if these assumptions are fulfilled:
- Linearity
- Normality
- Homoscedasticity
- No Multicollinearity
5.1 Model with No Predictor
First, let’s make a model with no predictors, only the target of quality
.
<- lm(formula = quality~1, data = wine_train %>% select(-quality_high))
model_nopred_lm summary(model_nopred_lm)
##
## Call:
## lm(formula = quality ~ 1, data = wine_train %>% select(-quality_high))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6364 -0.6364 0.3636 0.3636 2.3636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63643 0.02223 253.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7951 on 1278 degrees of freedom
5.2 Model using All Predictors (as baseline)
Then the opposite, we’ll make a model with the whole predictors.
<- lm(formula = quality~., data = wine_train %>% select(-quality_high))
model_allpred_lm summary(model_allpred_lm)
##
## Call:
## lm(formula = quality ~ ., data = wine_train %>% select(-quality_high))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60102 -0.37142 -0.06258 0.45407 1.97898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.739e+01 2.330e+01 0.746 0.455659
## fixed.acidity -5.110e-03 2.883e-02 -0.177 0.859325
## volatile.acidity -1.013e+00 1.385e-01 -7.316 4.51e-13 ***
## citric.acid -1.343e-01 1.680e-01 -0.799 0.424307
## residual.sugar 1.565e-02 1.600e-02 0.978 0.328300
## chlorides -2.136e+00 4.522e-01 -4.723 2.59e-06 ***
## free.sulfur.dioxide 4.217e-03 2.418e-03 1.744 0.081371 .
## total.sulfur.dioxide -2.961e-03 8.079e-04 -3.665 0.000257 ***
## density -1.239e+01 2.378e+01 -0.521 0.602442
## pH -6.175e-01 2.154e-01 -2.868 0.004205 **
## sulphates 8.816e-01 1.304e-01 6.760 2.10e-11 ***
## alcohol 2.778e-01 2.963e-02 9.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6476 on 1267 degrees of freedom
## Multiple R-squared: 0.3422, Adjusted R-squared: 0.3365
## F-statistic: 59.93 on 11 and 1267 DF, p-value: < 2.2e-16
Some interpretations that we can make from this simple model are:
-
Columns
volatile.acidity
,chlorides
,total.sulfur.dioxide
,sulphates
, andalcohol
are highly significant, looking at their p-values. -
Columns
free.sulfur.dioxide
andpH
are not as high, but still significant ones. - Since we are most likely using multiple predictors, we will look at the Adjusted R-squared value to evaluate our own model, and it is 0.3561 out of 1.0, which is around 35.61%. Quite a low score, let’s try to make this better by eliminating unnecessary predictors.
- For now, we can’t see any variables that is causing a perfect separation, that is any variable(s) that can be used solely to “predict” the target variable that causes using a machine learning would not be an ideal thing to do, based on all the p-values seem not so extremely insignificant that none is very close to 1.
5.3 Feature Selection
Since I have not much experiences about red wines, I feel like I’m not able to put forward any variable that must be included. Therefore, I would use a manual approach and stepwise approach(es) for the feature selection.
5.3.1 Manual Approach: Significance of Variables
For the manual approach, I will first include the predictors having highest significance according to the full predictors model, then I will try to include the rest one by one and see how the adjusted R-squared will react.
5.3.1.1 Including Highly Significant Predictors
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides, data = wine_train)
model_manual_lm summary(model_manual_lm)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.65844 -0.38155 -0.07256 0.45499 1.90341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0060194 0.2285537 13.152 < 2e-16 ***
## alcohol 0.2753780 0.0187735 14.668 < 2e-16 ***
## volatile.acidity -1.0741011 0.1088476 -9.868 < 2e-16 ***
## sulphates 0.8736111 0.1253790 6.968 5.15e-12 ***
## total.sulfur.dioxide -0.0018217 0.0005633 -3.234 0.00125 **
## chlorides -1.8281563 0.4201948 -4.351 1.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6511 on 1273 degrees of freedom
## Multiple R-squared: 0.332, Adjusted R-squared: 0.3294
## F-statistic: 126.6 on 5 and 1273 DF, p-value: < 2.2e-16
Then we’ll check a combination of the not-so-high but quite significant.
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide, data = wine_train)
model_manual_lm_2 summary(model_manual_lm_2)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7167 -0.3809 -0.0793 0.4507 1.9261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0010993 0.2284884 13.135 < 2e-16 ***
## alcohol 0.2738784 0.0187956 14.571 < 2e-16 ***
## volatile.acidity -1.0652659 0.1089815 -9.775 < 2e-16 ***
## sulphates 0.8657254 0.1254516 6.901 8.13e-12 ***
## total.sulfur.dioxide -0.0025337 0.0007536 -3.362 0.000797 ***
## chlorides -1.8138314 0.4201475 -4.317 1.70e-05 ***
## free.sulfur.dioxide 0.0033248 0.0023393 1.421 0.155476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6508 on 1272 degrees of freedom
## Multiple R-squared: 0.3331, Adjusted R-squared: 0.33
## F-statistic: 105.9 on 6 and 1272 DF, p-value: < 2.2e-16
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+pH, data = wine_train)
model_manual_lm_3 summary(model_manual_lm_3)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + pH, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58930 -0.36205 -0.06625 0.46591 1.92018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3345713 0.4530526 9.567 < 2e-16 ***
## alcohol 0.2886654 0.0191026 15.111 < 2e-16 ***
## volatile.acidity -0.9726581 0.1124522 -8.650 < 2e-16 ***
## sulphates 0.8466241 0.1251183 6.767 2.01e-11 ***
## total.sulfur.dioxide -0.0018725 0.0005611 -3.337 0.000872 ***
## chlorides -2.1225431 0.4273794 -4.966 7.75e-07 ***
## pH -0.4454254 0.1313337 -3.392 0.000716 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6484 on 1272 degrees of freedom
## Multiple R-squared: 0.338, Adjusted R-squared: 0.3349
## F-statistic: 108.3 on 6 and 1272 DF, p-value: < 2.2e-16
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH, data = wine_train)
model_manual_lm_4 summary(model_manual_lm_4)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide +
## pH, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.66643 -0.36177 -0.06254 0.46695 1.98981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4648450 0.4569986 9.770 < 2e-16 ***
## alcohol 0.2878711 0.0190830 15.085 < 2e-16 ***
## volatile.acidity -0.9493919 0.1128938 -8.410 < 2e-16 ***
## sulphates 0.8324306 0.1251584 6.651 4.31e-11 ***
## total.sulfur.dioxide -0.0029073 0.0007567 -3.842 0.000128 ***
## chlorides -2.1322722 0.4268792 -4.995 6.70e-07 ***
## free.sulfur.dioxide 0.0048078 0.0023622 2.035 0.042026 *
## pH -0.4914877 0.1331098 -3.692 0.000232 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6476 on 1271 degrees of freedom
## Multiple R-squared: 0.3402, Adjusted R-squared: 0.3365
## F-statistic: 93.61 on 7 and 1271 DF, p-value: < 2.2e-16
It’s worth noting that adding free.sulfur.dioxide
only makes it considered to be not so significant, while pH
only is still makes it very significant, and lastly adding makes the model seem slightly better, from the adjusted R-squared value, but it is strangely still quite the same than our baseline model of all predictors.
Then, I’ll try adding other seemingly insignificant predictors one-by-one, to see the R-squared changes.
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+residual.sugar, data = wine_train)
model_manual_lm_5 summary(model_manual_lm_5)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide +
## pH + residual.sugar, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6576 -0.3567 -0.0628 0.4692 1.9999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4380410 0.4594336 9.660 < 2e-16 ***
## alcohol 0.2866100 0.0192107 14.919 < 2e-16 ***
## volatile.acidity -0.9499212 0.1129268 -8.412 < 2e-16 ***
## sulphates 0.8374677 0.1254901 6.674 3.72e-11 ***
## total.sulfur.dioxide -0.0029587 0.0007621 -3.883 0.000109 ***
## chlorides -2.1499277 0.4280675 -5.022 5.83e-07 ***
## free.sulfur.dioxide 0.0046857 0.0023721 1.975 0.048442 *
## pH -0.4842568 0.1337233 -3.621 0.000305 ***
## residual.sugar 0.0073944 0.0127104 0.582 0.560830
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6478 on 1270 degrees of freedom
## Multiple R-squared: 0.3404, Adjusted R-squared: 0.3362
## F-statistic: 81.91 on 8 and 1270 DF, p-value: < 2.2e-16
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+citric.acid, data = wine_train)
model_manual_lm_6 summary(model_manual_lm_6)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide +
## pH + citric.acid, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.63594 -0.36888 -0.06171 0.45082 1.98029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8330762 0.5182401 9.326 < 2e-16 ***
## alcohol 0.2930494 0.0193818 15.120 < 2e-16 ***
## volatile.acidity -1.0506079 0.1313791 -7.997 2.85e-15 ***
## sulphates 0.8509021 0.1256977 6.769 1.97e-11 ***
## total.sulfur.dioxide -0.0027113 0.0007675 -3.533 0.000426 ***
## chlorides -2.0252085 0.4325638 -4.682 3.15e-06 ***
## free.sulfur.dioxide 0.0043050 0.0023845 1.805 0.071254 .
## pH -0.5926948 0.1490904 -3.975 7.42e-05 ***
## citric.acid -0.2073729 0.1378676 -1.504 0.132792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared: 0.3413, Adjusted R-squared: 0.3372
## F-statistic: 82.27 on 8 and 1270 DF, p-value: < 2.2e-16
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+fixed.acidity, data = wine_train)
model_manual_lm_7 summary(model_manual_lm_7)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide +
## pH + fixed.acidity, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60344 -0.37288 -0.05657 0.46191 1.97685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2937226 0.6966774 7.599 5.79e-14 ***
## alcohol 0.2878564 0.0190719 15.093 < 2e-16 ***
## volatile.acidity -0.9607030 0.1130561 -8.498 < 2e-16 ***
## sulphates 0.8507531 0.1256249 6.772 1.93e-11 ***
## total.sulfur.dioxide -0.0030923 0.0007653 -4.040 5.66e-05 ***
## chlorides -2.2274465 0.4308856 -5.169 2.72e-07 ***
## free.sulfur.dioxide 0.0048229 0.0023608 2.043 0.04127 *
## pH -0.6792675 0.1786103 -3.803 0.00015 ***
## fixed.acidity -0.0236195 0.0149909 -1.576 0.11537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared: 0.3415, Adjusted R-squared: 0.3373
## F-statistic: 82.31 on 8 and 1270 DF, p-value: < 2.2e-16
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+density, data = wine_train)
model_manual_lm_8 summary(model_manual_lm_8)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide +
## pH + density, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.65102 -0.36437 -0.06408 0.46023 1.96303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.897e+01 1.197e+01 1.585 0.113119
## alcohol 2.755e-01 2.163e-02 12.738 < 2e-16 ***
## volatile.acidity -9.435e-01 1.130e-01 -8.351 < 2e-16 ***
## sulphates 8.585e-01 1.270e-01 6.762 2.08e-11 ***
## total.sulfur.dioxide -2.944e-03 7.572e-04 -3.888 0.000106 ***
## chlorides -2.149e+00 4.270e-01 -5.033 5.52e-07 ***
## free.sulfur.dioxide 4.796e-03 2.362e-03 2.031 0.042503 *
## pH -5.321e-01 1.372e-01 -3.877 0.000111 ***
## density -1.431e+01 1.179e+01 -1.213 0.225272
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6475 on 1270 degrees of freedom
## Multiple R-squared: 0.3409, Adjusted R-squared: 0.3368
## F-statistic: 82.12 on 8 and 1270 DF, p-value: < 2.2e-16
It seems like adding citric.acid
and fixed.acidity
to the predictors increases the adjusted R-squared value a little bit compared to our baseline, while the other 2 decrease it. For a final step of the manual approach, I will to include both of these 2 predictors.
<- lm(formula = quality~alcohol+volatile.acidity+sulphates+total.sulfur.dioxide+chlorides+free.sulfur.dioxide+pH+citric.acid+fixed.acidity, data = wine_train)
model_manual_lm_9 summary(model_manual_lm_9)
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + chlorides + free.sulfur.dioxide +
## pH + citric.acid + fixed.acidity, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60561 -0.37695 -0.05421 0.45300 1.97536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.245442 0.699839 7.495 1.24e-13 ***
## alcohol 0.290960 0.019530 14.899 < 2e-16 ***
## volatile.acidity -1.017605 0.136675 -7.445 1.78e-13 ***
## sulphates 0.855868 0.125837 6.801 1.59e-11 ***
## total.sulfur.dioxide -0.002915 0.000802 -3.635 0.000289 ***
## chlorides -2.132534 0.449585 -4.743 2.34e-06 ***
## free.sulfur.dioxide 0.004517 0.002397 1.885 0.059725 .
## pH -0.678972 0.178642 -3.801 0.000151 ***
## citric.acid -0.124092 0.167425 -0.741 0.458720
## fixed.acidity -0.015965 0.018206 -0.877 0.380722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6474 on 1269 degrees of freedom
## Multiple R-squared: 0.3417, Adjusted R-squared: 0.3371
## F-statistic: 73.2 on 9 and 1269 DF, p-value: < 2.2e-16
It appears that the adjusted R-squared decrease a little bit.
To make sure, we will compare our baseline model model_allpred_lm
that was using all predictors, model_manual_lm_4
that was using only the most significant predictors, model_manual_lm_7
that was adding another predictor that at first seems insignificant but improves the R-squared.
compare_performance(model_allpred_lm, model_manual_lm_4, model_manual_lm_7)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC | AIC weights | BIC | BIC weights | R2 | R2 (adj.) | RMSE | Sigma
## ---------------------------------------------------------------------------------------------------------------
## model_allpred_lm | lm | 2532.337 | 0.057 | 2599.336 | < 0.001 | 0.342 | 0.337 | 0.645 | 0.648
## model_manual_lm_4 | lm | 2528.359 | 0.413 | 2574.744 | 0.911 | 0.340 | 0.337 | 0.646 | 0.648
## model_manual_lm_7 | lm | 2527.861 | 0.530 | 2579.400 | 0.089 | 0.341 | 0.337 | 0.645 | 0.647
The R2 adjusted value seems quite the same, but we can see from each summary, that the model_manual_lm_7
is the best. The error from RMSE seems to be lower, and also the AIC is a bit lower. So we will use the model_manual_lm_7
to be used from our manual approach of feature selection, with the adjusted R-Squared of 0.3373.
5.3.2 Stepwise Approach
Now this would be simpler than the manual approach. The algorithm will automatically search adding or deleting predictors based on the AIC (Akaike Information Criteria) value. The lower, the better.
<- step(object = model_allpred_lm, scope = list(lower = model_nopred_lm, upper = model_allpred_lm), direction = "both", trace = F)
model_stepwise_both_lm summary(model_stepwise_both_lm)
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity,
## data = wine_train %>% select(-quality_high))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60344 -0.37288 -0.05657 0.46191 1.97685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2937226 0.6966774 7.599 5.79e-14 ***
## volatile.acidity -0.9607030 0.1130561 -8.498 < 2e-16 ***
## chlorides -2.2274465 0.4308856 -5.169 2.72e-07 ***
## free.sulfur.dioxide 0.0048229 0.0023608 2.043 0.04127 *
## total.sulfur.dioxide -0.0030923 0.0007653 -4.040 5.66e-05 ***
## pH -0.6792675 0.1786103 -3.803 0.00015 ***
## sulphates 0.8507531 0.1256249 6.772 1.93e-11 ***
## alcohol 0.2878564 0.0190719 15.093 < 2e-16 ***
## fixed.acidity -0.0236195 0.0149909 -1.576 0.11537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared: 0.3415, Adjusted R-squared: 0.3373
## F-statistic: 82.31 on 8 and 1270 DF, p-value: < 2.2e-16
If we compare it to our previously chosen model_manual_lm_7
, the predictors chosen are the same. Therefore, we can pretty much conclude that our best model using this simple linear regression method are only explaining around 33.73% of our target variable, while the rest is explained using other variables other than that we used, or even maybe other variables excluded by the dataset provider, or simply that this model is not too suited for the dataset.
5.4 Verifying Linear Regression Assumptions
5.4.1 Linearity
This should be done before creating the linear regression models, and we have done so. Some variables seems to have quite correlations to our target variable.
This assumption is also about how to interpret the summary of the model.
summary(model_stepwise_both_lm)
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity,
## data = wine_train %>% select(-quality_high))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.60344 -0.37288 -0.05657 0.46191 1.97685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2937226 0.6966774 7.599 5.79e-14 ***
## volatile.acidity -0.9607030 0.1130561 -8.498 < 2e-16 ***
## chlorides -2.2274465 0.4308856 -5.169 2.72e-07 ***
## free.sulfur.dioxide 0.0048229 0.0023608 2.043 0.04127 *
## total.sulfur.dioxide -0.0030923 0.0007653 -4.040 5.66e-05 ***
## pH -0.6792675 0.1786103 -3.803 0.00015 ***
## sulphates 0.8507531 0.1256249 6.772 1.93e-11 ***
## alcohol 0.2878564 0.0190719 15.093 < 2e-16 ***
## fixed.acidity -0.0236195 0.0149909 -1.576 0.11537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6473 on 1270 degrees of freedom
## Multiple R-squared: 0.3415, Adjusted R-squared: 0.3373
## F-statistic: 82.31 on 8 and 1270 DF, p-value: < 2.2e-16
The linearity assumption also describes how each of the coefficients (or estimates in the model summary above) can be interpreted.
For example, we can see at the predictor volatile.acidity
and we know that the ‘Estimate’ is around -0.96. This can be interpreted as: an increase of 1 unit of volatile.acidity
, will decrease the final target variable of ours, the quality
, as much as 0.96.
The linearity assumption will also hold true to each estimates in the linear regression model, that corresponds with each predictors.
5.4.2 Normality
Checking with Shapiro Test:
shapiro.test(model_stepwise_both_lm$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_stepwise_both_lm$residuals
## W = 0.9903, p-value = 1.784e-07
The p-value is very small, lower that a default alpha of 0.05! Which is saying that the residuals of our model is not distributed normally. Let’s see if the plot can convince us otherwise.
plot(density(model_stepwise_both_lm$residuals))
Sure, it’s not your usual normal distribution bell curve, but I would think that this is close enough, that we can pass this assumption.
5.4.3 Homoscedasticity
To be objective, let’s try using Breusch-Pagan test.
bptest(model_stepwise_both_lm)
##
## studentized Breusch-Pagan test
##
## data: model_stepwise_both_lm
## BP = 45.346, df = 8, p-value = 3.164e-07
Once again, p-value shown is a very small value (< alpha of 0.05), showing that the data is more heteroscedastic, or showing some kind of pattern in the spread, while we hope that the model is homoscedastic or randomly spread. Once again, we can try to see the plot if we can try passing this assumption to safely use the model.
plot(model_stepwise_both_lm$fitted.values, model_stepwise_both_lm$residuals)
abline(h = 0, col = "red")
Unfortunately, the data seems to be showing a pattern here, which are shown by the data spread in some kind of lines on the plot, so visually, I agree with the BP Test’s result, that this dataset might not be a good fit for this linear regression model, though we can still use it, since it’s not easy to find any kind of datasets that fulfills all of the assumption perfectly.
5.4.4 Multicolinearity
Though one of our assumption above fails the test, we should still check if the model passes this one. VIF (Variance Inflation Factor) is a measure of how varying the coefficients due to multicolinearity, or few of the predictors are very much correlated and affect eachother.
vif(model_stepwise_both_lm)
## volatile.acidity chlorides free.sulfur.dioxide
## 1.231478 1.415006 1.869230
## total.sulfur.dioxide pH sulphates
## 1.961983 2.259270 1.381130
## alcohol fixed.acidity
## 1.221879 2.048816
The rule-of-thumb is for each VIF values showing under 10. Since we can see that in all of our used predictors, we can safely say that there is no multicolinearity in our model, so our model passes this assumption.
5.5 Prediction using Test Data
Since our model passes most assumptions, instead of predicting the test data that we separated from the beginning, we can simply copy the same predictors from the best model that we found during the comparing and verifying assumptions.
summary(lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
+ pH + sulphates + alcohol + fixed.acidity,
total.sulfur.dioxide data = wine_test))
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity,
## data = wine_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25444 -0.39954 0.01312 0.39674 1.81988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.773247 1.259682 1.408 0.160220
## volatile.acidity -1.215154 0.223025 -5.449 1.03e-07 ***
## chlorides -0.256637 1.188869 -0.216 0.829233
## free.sulfur.dioxide 0.009265 0.004887 1.896 0.058934 .
## total.sulfur.dioxide -0.006150 0.001636 -3.759 0.000204 ***
## pH 0.112568 0.325102 0.346 0.729386
## sulphates 1.047150 0.231117 4.531 8.39e-06 ***
## alcohol 0.284235 0.035207 8.073 1.52e-14 ***
## fixed.acidity 0.075909 0.028401 2.673 0.007921 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6378 on 311 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4459
## F-statistic: 33.09 on 8 and 311 DF, p-value: < 2.2e-16
It was actually somehow better than the train data result, but not by far. We can say that 44.59% of our target quality
variable can be explained using the same model.
5.6 Conclusions
-
Our best simple linear regression model is found both using manual or stepwise approach, involving following predictors:
- volatile.acidity
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- pH
- sulphates
- alcohol
- fixed.acidity
- The adjusted R-squared of the train data is estimated at 33.73%, but on the test data it’s improved to 44.59%.
- Our model is not so perfect that it passes all of our assumptions, but it only fails one, so let’s see if other model will be “objectively” better than this model.
6 Classification Model: Logistic Regression
Logistic regression is a machine learning model that is included in the classification side of prediction, which has the main purpose of predicting target variable(s) that is categorical/factor/in classes. It is a little bit similar to our simple linear regression model, in which this model will output a regression formula, although it cannot be interpreted directly since this model is using the concept of log of odds. The prediction result of the target variable would be a log of odds that can be translated into a probability, which then we can use a threshold to classify the result to 2 possible classes, either 1 or 0.
This model’s assumptions that would need to be fulfilled:
- Linearity of Predictor & Log of Odds
- Multicollinearity
-
Independence of Observations
6.1 Feature Selection
As with the linear regression model, we would need to use the most relevant predictors, so we need to do some feature selection.
6.1.1 Model with No Predictor
<- glm(formula = quality_high~1, data = wine_train, family = "binomial") model_nopred_glm summary(model_nopred_glm)
## ## Call: ## glm(formula = quality_high ~ 1, family = "binomial", data = wine_train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.237 -1.237 1.119 1.119 1.119 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.13940 0.05606 2.487 0.0129 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1766.9 on 1278 degrees of freedom ## Residual deviance: 1766.9 on 1278 degrees of freedom ## AIC: 1768.9 ## ## Number of Fisher Scoring iterations: 3
6.1.2 Model with All Predictors (as baseline)
<- glm(formula = quality_high~., data = wine_train %>% select(-quality), family = "binomial") model_allpred_glm summary(model_allpred_glm)
## ## Call: ## glm(formula = quality_high ~ ., family = "binomial", data = wine_train %>% ## select(-quality)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.3271 -0.8611 0.3231 0.8499 2.2927 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 6.034168 85.658299 0.070 0.94384 ## fixed.acidity 0.044987 0.107873 0.417 0.67665 ## volatile.acidity -3.104694 0.541488 -5.734 9.83e-09 *** ## citric.acid -1.156493 0.633898 -1.824 0.06809 . ## residual.sugar 0.042083 0.056802 0.741 0.45877 ## chlorides -4.798408 1.707187 -2.811 0.00494 ** ## free.sulfur.dioxide 0.022396 0.008999 2.489 0.01282 * ## total.sulfur.dioxide -0.014448 0.003074 -4.700 2.60e-06 *** ## density -10.770445 87.471430 -0.123 0.90200 ## pH -1.198891 0.799158 -1.500 0.13356 ## sulphates 2.613487 0.502181 5.204 1.95e-07 *** ## alcohol 0.902744 0.115363 7.825 5.07e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1766.9 on 1278 degrees of freedom ## Residual deviance: 1353.5 on 1267 degrees of freedom ## AIC: 1377.5 ## ## Number of Fisher Scoring iterations: 4
In classification model, one of the metric that we can use to evaluate the model is the AIC value, or information loss. We can see that in model using all predictors, the AIC is lower than the no predictor.
Since we don’t have any prejudice or assumption for or against any of the variables, and that the highly significant predictors are quite the same with the one we found in the linear regression model, we will just be using a stepwise approach in both direction.
6.1.3 Stepwise Approach
<- step(object = model_allpred_glm, scope = list(lower = model_nopred_glm, upper = model_allpred_glm), direction = "both", trace = F) model_stepwise_both_glm summary(model_stepwise_both_glm)
## ## Call: ## glm(formula = quality_high ~ volatile.acidity + citric.acid + ## chlorides + free.sulfur.dioxide + total.sulfur.dioxide + ## pH + sulphates + alcohol, family = "binomial", data = wine_train %>% ## select(-quality)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.2275 -0.8637 0.3239 0.8610 2.2927 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.574704 1.871666 -1.910 0.05615 . ## volatile.acidity -3.006772 0.507219 -5.928 3.07e-09 *** ## citric.acid -0.917448 0.509761 -1.800 0.07190 . ## chlorides -4.986193 1.620821 -3.076 0.00210 ** ## free.sulfur.dioxide 0.023707 0.008907 2.662 0.00778 ** ## total.sulfur.dioxide -0.014604 0.002902 -5.032 4.86e-07 *** ## pH -1.458359 0.541300 -2.694 0.00706 ** ## sulphates 2.574440 0.486009 5.297 1.18e-07 *** ## alcohol 0.915012 0.083324 10.981 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1766.9 on 1278 degrees of freedom ## Residual deviance: 1354.5 on 1270 degrees of freedom ## AIC: 1372.5 ## ## Number of Fisher Scoring iterations: 4
It is different from the one we found in linear regression model, in that the current model includes
citric.acid
and excludesfixed.acidity
, interesting!Let’s compare this model’s performance with our baseline logistic regression model using all predictors.
compare_performance(model_allpred_glm, model_stepwise_both_glm)
## # Comparison of Model Performance Indices ## ## Name | Model | AIC | AIC weights | BIC | BIC weights | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP ## -------------------------------------------------------------------------------------------------------------------------------------------------------------- ## model_allpred_glm | glm | 1377.482 | 0.075 | 1439.328 | < 0.001 | 0.290 | 0.420 | 1.034 | 0.529 | -Inf | 9.549e-04 | 0.647 ## model_stepwise_both_glm | glm | 1372.460 | 0.925 | 1418.845 | 1.000 | 0.289 | 0.420 | 1.033 | 0.529 | -Inf | 0.001 | 0.646
We can see that the AIC is lower, so that is good, less information is lost, while the RMSE are not so much different. Therefore we will be using this model to represent the logistic regression type.
6.1.4 Fitted Values vs Real Labels from Training Data
For comparison against the test data later, we need to see some metrics that we can compare against other models.
confusionMatrix(data = as.factor(ifelse(model_stepwise_both_glm$fitted.values>0.5, 1, 0)), reference = wine_train$quality_high, positive = "1")
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 439 178 ## 1 156 506 ## ## Accuracy : 0.7389 ## 95% CI : (0.7139, 0.7628) ## No Information Rate : 0.5348 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.4764 ## ## Mcnemar's Test P-Value : 0.2505 ## ## Sensitivity : 0.7398 ## Specificity : 0.7378 ## Pos Pred Value : 0.7644 ## Neg Pred Value : 0.7115 ## Prevalence : 0.5348 ## Detection Rate : 0.3956 ## Detection Prevalence : 0.5176 ## Balanced Accuracy : 0.7388 ## ## 'Positive' Class : 1 ##
As we can see from the class proportion that was quite balanced, we can safely use the Accuracy value to try and pre-evaluate our model before seeing if the model survives the assumptions verification and the test data. The Accuracy is 73.89%, which quite good for now.
Since this is for estimating a red wine quality, I’d say that this is not a preventive case, so I would choose Precision as my secondary metric check. From the confusion matrix above, I can see the value of Pos Pred Value, another name for Precision, is showing 76.44% prediction, which is also pretty good score!
6.2 Verifying Assumptions
There are 3 assumptions of logistic regression model, that would convince us that our model is already quite proper to be used. The assumptions are:
- Linearity of Predictor & Log of Odds
- Multicollinearity
- Independence of Observations
6.2.1 Linearity of Predictor & Log of Odds
This is basically just a reminder for us, that the interpretation of the formula or summary we found of the logistic regression model is different from the linear regression one, though it looks really similar. The biggest difference is that the numbers of the summary are only in a log-of-odds manner, so we would need to use
exp()
function orinv.logit()
function to interpret them one by one.Let’s see the summary of the chosen model again.
summary(model_stepwise_both_glm)
## ## Call: ## glm(formula = quality_high ~ volatile.acidity + citric.acid + ## chlorides + free.sulfur.dioxide + total.sulfur.dioxide + ## pH + sulphates + alcohol, family = "binomial", data = wine_train %>% ## select(-quality)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.2275 -0.8637 0.3239 0.8610 2.2927 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.574704 1.871666 -1.910 0.05615 . ## volatile.acidity -3.006772 0.507219 -5.928 3.07e-09 *** ## citric.acid -0.917448 0.509761 -1.800 0.07190 . ## chlorides -4.986193 1.620821 -3.076 0.00210 ** ## free.sulfur.dioxide 0.023707 0.008907 2.662 0.00778 ** ## total.sulfur.dioxide -0.014604 0.002902 -5.032 4.86e-07 *** ## pH -1.458359 0.541300 -2.694 0.00706 ** ## sulphates 2.574440 0.486009 5.297 1.18e-07 *** ## alcohol 0.915012 0.083324 10.981 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1766.9 on 1278 degrees of freedom ## Residual deviance: 1354.5 on 1270 degrees of freedom ## AIC: 1372.5 ## ## Number of Fisher Scoring iterations: 4
For example, let’s try to interpret how a predictor of
volatile.acidity
affect the final target variable ofquality.high
. * It has a p-value (Pr(>|z|
)) that is very small (3.07 * 10^-9). R categorized this predictor as highly significant, since it is still under the default value of 0.05, but very close to 0, therefore we can see the 3 stars code beside this p-value. We can compare how significant it is based on the Signif. codes underneath all predictors. * Then we can see on theEstimate
value of it, which is -3.006772. This where the assumption of logistic regression mainly comes on, where we can safely interpret this as: an increase of 1 unit ofvolatile.acidity
, it will increase -3.006772 units of the log-of-odds of thequality_high
or our target variable. We can’t really interpret this easily in our head, so we could use theinv.logit()
function to better understand this.inv.logit(-3.006772)
## [1] 0.04712087
Therefore, we can interpret this as: an increase of 1 unit of
volatile.acidity
, it will increase the probability of 0.047 of thequality
or our target variable. As a reminder, probability is only from 0 to 1, so as this predictor increases in its own unit, it increases the chance of the wine being a high quality (rated 6-8).We can similarly interpret each predictor in this way (using
inv.logit()
), but we’ll skip it for now.6.2.2 Multicolinearity
This assumption is the same that we have discussed and check in the linear regression model, in which we should make sure that any subset of our predictors are not tightly correlated to eachother.
We can use VIF (Variance Inflation Factor) test for this.
vif(model_stepwise_both_glm)
## volatile.acidity citric.acid chlorides ## 1.613645 2.194901 1.499875 ## free.sulfur.dioxide total.sulfur.dioxide pH ## 1.937301 1.952260 1.561357 ## sulphates alcohol ## 1.440349 1.168738
The rule-of-thumb is for each VIF values showing under 10. Since we can see that in all of our used predictors, we can safely say that there is no multicolinearity in our model, so our model passes this assumption.
6.2.3 Independence of Observations
This assumption needs to be sure that we are not doing a repeated measurement that would result in a biased analysis result, or basically we don’t want any correlation between each observations or each rows in a tabular data.
This assumption is easily fulfilled by doing a random sampling when doing a data collection, or simply when we are analyzing, we could do a cross-validation or train-test splitting that is random, which we have done so at the beginning.
6.3 Model Evaluation
We passed the assumptions! Now we can try to evaluate our model using the test data that we have put aside from the beginning for bias prevention.
<- as.factor(ifelse(predict(object = model_stepwise_both_glm, glm_pred newdata = wine_test, type = "response")>0.5, 1, 0)) confusionMatrix(data = glm_pred, reference = wine_test$quality_high, positive = "1")
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 112 37 ## 1 37 134 ## ## Accuracy : 0.7688 ## 95% CI : (0.7186, 0.8138) ## No Information Rate : 0.5344 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.5353 ## ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.7836 ## Specificity : 0.7517 ## Pos Pred Value : 0.7836 ## Neg Pred Value : 0.7517 ## Prevalence : 0.5344 ## Detection Rate : 0.4188 ## Detection Prevalence : 0.5344 ## Balanced Accuracy : 0.7677 ## ## 'Positive' Class : 1 ##
Our Accuracy is 76.88%, and our Precision / Pos Pred Value is 78.36%. Somehow slightly better than our train results, but can’t complain!
6.4 Conclusions
-
Using a stepwise approach, the best model are made using similar predictors like in our previous linear regression model, swapping a specific predictor with another. Here is the full one:
- volatile.acidity
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- pH
- sulphates
- alcohol
- citric.acid
- The Accuracy of our logistic regression model is 76.88%, and the Precision is 78.36%
7 Classification Model: k-NN
k-NN, or a k-Nearest Neighbour, is a machine learning classification model that is comparing characteristics of a new/unseen data, against the existing/training data. The “distance” of the characteristics is measured using Euclidean Distance, and afterwards, these distances are ordered from shortest/closest until the longest distance, and the k value that the data scientist chose will be used to pick the k shortest distanced data as a “vote” on the which class it will be categorized, in which the majority will “win” and determine which class it belongs.
This model is better for numerical predictors, which is a perfect fit for this dataset.
7.1 Pre-processing and Scaling the Dataset
Let’s check the dataset summary and check the range of the data (minimum-maximum values of each predictors).
summary(wine_train)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.1000 1st Qu.: 1.900
## Median : 7.900 Median :0.5200 Median :0.2600 Median : 2.200
## Mean : 8.321 Mean :0.5271 Mean :0.2729 Mean : 2.551
## 3rd Qu.: 9.300 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.: 2.600
## Max. :15.900 Max. :1.3300 Max. :1.0000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 8.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08833 Mean :15.94 Mean : 46.61 Mean :0.9968
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9979
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality quality_high
## Min. :2.74 Min. :0.3700 Min. : 8.40 Min. :3.000 0:595
## 1st Qu.:3.20 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000 1:684
## Median :3.31 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.31 Mean :0.6586 Mean :10.41 Mean :5.636
## 3rd Qu.:3.40 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.01 Max. :2.0000 Max. :14.90 Max. :8.000
It’s obvious that the scale of each predictor is different, and since this machine learning algorithm would be much more biased from predictors with large numbers, we would need to exclude the quality
column, then scale all of the numerical columns so that each predictors would weigh the same, since we are going to use them as our predictors in the k-NN method.
We would also need to separate the predictors (x) with the target predictor (y, quality_high
).
7.1.1 Separating Predictors and Target Variables
Before we are scaling the data points, we would need to separate the labels, or the target variable from the predictors, since we don’t need the target variable to be scaled. Don’t forget to exclude the quality
column too since we don’t to evaluate it anymore in this k-NN model or any of the classification models.
<- wine_train %>%
wine_train_x select(-c("quality", "quality_high"))
<- wine_train %>%
wine_train_y pull(quality_high)
<- wine_test %>%
wine_test_x select(-c("quality", "quality_high"))
<- wine_test %>%
wine_test_y pull(quality_high)
7.1.2 Scaling
For best and standardized results, it would be better for us if we use the z-score scaling. The concept is that we change the mean to 0, and scale other values based on the how many multiples of standard deviations it deviates from the mean.
This is simply done by using scale()
function. We will do this for the predictors in the training dataset, but not yet for the testing ones.
<- scale(wine_train_x) wine_train_x_scaled
For the testing ones, we need to use the parameters from the scaled training dataset because it is supposed to be unseen data that follows the same “rules” with the training data.
<- scale(wine_test_x,
wine_test_x_scaled center = attr(wine_train_x_scaled, "scaled:center"),
scale = attr(wine_train_x_scaled, "scaled:scale"))
Let’s check the summary of the predictors, to make sure that they are quite equally scaled now.
summary(wine_train_x_scaled)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :-2.1526 Min. :-2.2905 Min. :-1.41082 Min. :-1.11466
## 1st Qu.:-0.7065 1st Qu.:-0.7712 1st Qu.:-0.89379 1st Qu.:-0.43964
## Median :-0.2437 Median :-0.0397 Median :-0.06654 Median :-0.23713
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.5661 3rd Qu.: 0.6355 3rd Qu.: 0.76071 3rd Qu.: 0.03288
## Max. : 4.3838 Max. : 4.5181 Max. : 3.75949 Max. : 8.74072
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :-1.52715 Min. :-1.4247 Min. :-1.2256
## 1st Qu.:-0.36678 1st Qu.:-0.7572 1st Qu.:-0.7427
## Median :-0.18672 Median :-0.1849 Median :-0.2599
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.03335 3rd Qu.: 0.4827 3rd Qu.: 0.4644
## Max. :10.45671 Max. : 5.3466 Max. : 7.3148
## density pH sulphates alcohol
## Min. :-3.551693 Min. :-3.739176 Min. :-1.7040 Min. :-1.9184
## 1st Qu.:-0.603171 1st Qu.:-0.720146 1st Qu.:-0.6413 1st Qu.:-0.8702
## Median :-0.002841 Median : 0.001796 Median :-0.2280 Median :-0.2031
## Mean : 0.000000 Mean : 0.000000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.594832 3rd Qu.: 0.592476 3rd Qu.: 0.4214 3rd Qu.: 0.6546
## Max. : 3.684140 Max. : 4.595972 Max. : 7.9195 Max. : 4.2757
summary(wine_test_x_scaled)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :-1.979064 Min. :-2.29046 Min. :-1.41082 Min. :-1.11466
## 1st Qu.:-0.706482 1st Qu.:-0.71493 1st Qu.:-0.99720 1st Qu.:-0.43964
## Median :-0.243725 Median :-0.09597 Median :-0.16995 Median :-0.23713
## Mean :-0.004935 Mean : 0.02149 Mean :-0.04893 Mean :-0.04211
## 3rd Qu.: 0.450410 3rd Qu.: 0.63552 3rd Qu.: 0.76071 3rd Qu.: 0.03288
## Max. : 4.210310 Max. : 5.92479 Max. : 2.51862 Max. : 5.70310
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :-0.98698 Min. :-1.42475 Min. :-1.2256 Min. :-3.55169
## 1st Qu.:-0.36678 1st Qu.:-0.85252 1st Qu.:-0.7503 1st Qu.:-0.65630
## Median :-0.18672 Median :-0.23261 Median :-0.3202 Median :-0.02409
## Mean :-0.08663 Mean :-0.03054 Mean :-0.0216 Mean :-0.02301
## 3rd Qu.: 0.03335 3rd Qu.: 0.57804 3rd Qu.: 0.4946 3rd Qu.: 0.50717
## Max. : 6.53544 Max. : 3.62991 Max. : 3.1502 Max. : 3.42382
## pH sulphates alcohol
## Min. :-2.951603 Min. :-1.94019 Min. :-1.91839
## 1st Qu.:-0.605291 1st Qu.:-0.58227 1st Qu.:-0.87016
## Median : 0.001796 Median :-0.22803 Median :-0.20309
## Mean : 0.045482 Mean :-0.01402 Mean : 0.04695
## 3rd Qu.: 0.608883 3rd Qu.: 0.42140 3rd Qu.: 0.67838
## Max. : 4.595972 Max. : 7.80139 Max. : 3.41810
Now all the medians and means are quite similar with each other, and the range of the predictors are similar too, therefore we are ready to use this dataset.
7.2 Finding Optimum k
Finding the best number ‘k’ would factor how good our model is. From a lot of researches, it was found that the best k for this is the square root of the number of observations (or number of rows in tabular data).
Once again, since our testing data is the unseen data, we will use the number of observations only from the training data.
sqrt(nrow(wine_test_x_scaled))
## [1] 17.88854
The number of different classes, or how many categories are there in the target variable that we have is 2. Therefore, it’s best approach to use an odd number of k to avoid any tie for the majority-voting in the classification algorithm.
The optimum k is around 17.88, therefore we will try with k=17, and we will also try 2 odd numbers around it, like k=15 and k=19 when trying to tune our model.
7.3 Fitting the Model
k-NN is categorized as a “black-box” machine learning model, therefore we are not able to see its algorithm when it’s working and cannot interpret its components. We can only use the result after the model has been fitted.
<- knn(train = wine_train_x_scaled,
wine_knn_pred_k17 test = wine_test_x_scaled,
cl = wine_train_y,
k=17)
<- knn(train = wine_train_x_scaled,
wine_knn_pred_k15 test = wine_test_x_scaled,
cl = wine_train_y,
k=15)
<- knn(train = wine_train_x_scaled,
wine_knn_pred_k19 test = wine_test_x_scaled,
cl = wine_train_y,
k=19)
7.4 Model Evaluation
One of the most generally interpretable way to evaluate a classification model is using confusion matrix. Since we have already decided on using Accuracy and Precision from the logistic regression model, due to the target’s classes are quite balanced nicely (50:50) and a more precise model that correctly predict a good quality wine by some of its qualities would be more appreciated rather than recalling as most good quality wine as possible.
7.4.1 Using k=17 (closest number to optimum)
confusionMatrix(data = wine_knn_pred_k17,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 101 30
## 1 48 141
##
## Accuracy : 0.7562
## 95% CI : (0.7054, 0.8023)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5063
##
## Mcnemar's Test P-Value : 0.05425
##
## Sensitivity : 0.8246
## Specificity : 0.6779
## Pos Pred Value : 0.7460
## Neg Pred Value : 0.7710
## Prevalence : 0.5344
## Detection Rate : 0.4406
## Detection Prevalence : 0.5906
## Balanced Accuracy : 0.7512
##
## 'Positive' Class : 1
##
Accuracy: 75.62%
Precision / Pos Pred Value: 74.60%
Not bad!
7.4.2 Using k=15
confusionMatrix(data = wine_knn_pred_k15,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 101 32
## 1 48 139
##
## Accuracy : 0.75
## 95% CI : (0.6988, 0.7965)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : 1.56e-15
##
## Kappa : 0.4941
##
## Mcnemar's Test P-Value : 0.09353
##
## Sensitivity : 0.8129
## Specificity : 0.6779
## Pos Pred Value : 0.7433
## Neg Pred Value : 0.7594
## Prevalence : 0.5344
## Detection Rate : 0.4344
## Detection Prevalence : 0.5844
## Balanced Accuracy : 0.7454
##
## 'Positive' Class : 1
##
Accuracy: 74.69%
Precision / Pos Pred Value: 74.46%
Slightly lower compared to that of our original and optimum k.
7.4.3 Using k=19
confusionMatrix(data = wine_knn_pred_k19,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 98 31
## 1 51 140
##
## Accuracy : 0.7438
## 95% CI : (0.6922, 0.7907)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : 1.044e-14
##
## Kappa : 0.4806
##
## Mcnemar's Test P-Value : 0.03589
##
## Sensitivity : 0.8187
## Specificity : 0.6577
## Pos Pred Value : 0.7330
## Neg Pred Value : 0.7597
## Prevalence : 0.5344
## Detection Rate : 0.4375
## Detection Prevalence : 0.5969
## Balanced Accuracy : 0.7382
##
## 'Positive' Class : 1
##
Accuracy: 73.75%
Precision / Pos Pred Value: 72.77%
Even lower compared to that of our original and optimum k.
7.5 Conclusions
- Our best k-NN model is found using k from the optimum k formula, which the closest odd number we found is k=17.
- The Accuracy of our model is 75.62%, and the Precision is 74.60%.
8 Classification Model: Naive Bayes
Naive Bayes is a classification machine learning model that is utilizing Bayes’ Theorem of dependent and independent events to predict the probability of each observation’s class.
This model has some unique assumptions of assuming that the predictors are dependent to the target variable (well, this is obvious, if not, why would we try to predict?), the predictors are independent of each other, and each predictors weigh the same to reach the final probability calculation.
It should be noted that this model is known to work better if we are using only categorical predictors. Knowing so, we can still use numerical predictors with this model, and we’ll see how better it performs compared to the other models.
8.1 Model Fitting
It should be noted, since the Naive Bayes model is known to have a weakness of skewness due to data scarcity, or if any of the specific segment of the dataset has 0 observations, it will really deter the calculation, therefore we will be using the Laplace Smoothing method by the value of 1, or simply adding 1 dummy data to each specific segment of the dataset. It will offset the data by an amount that would be insignificant, just like \(1000/2000\) is not too far from \(1001/2001\).
8.1.1 Using All Predictors
<- naiveBayes(formula = quality_high~.,
model_nb_all data = wine_train %>% select(-quality),
laplace = 1)
<- predict(object = model_nb_all,
wine_nb_all_pred_class newdata = wine_test,
type = "class")
<- predict(object = model_nb_all,
wine_nb_all_pred_raw newdata = wine_test,
type = "raw")
head(wine_nb_all_pred_class)
## [1] 0 0 0 0 0 0
## Levels: 0 1
head(wine_nb_all_pred_raw)
## 0 1
## [1,] 0.8333146 0.166685388
## [2,] 0.8148565 0.185143539
## [3,] 0.6065141 0.393485924
## [4,] 0.7671096 0.232890388
## [5,] 0.6810443 0.318955749
## [6,] 0.9922018 0.007798212
8.1.2 Using Predictors found in Linear Regression Model
<- naiveBayes(formula = quality_high~volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol + fixed.acidity,
model_nb_fromlm data = wine_train %>% select(-quality),
laplace = 1)
<- predict(object = model_nb_fromlm,
wine_nb_fromlm_pred_class newdata = wine_test,
type = "class")
<- predict(object = model_nb_fromlm,
wine_nb_fromlm_pred_raw newdata = wine_test,
type = "raw")
head(wine_nb_fromlm_pred_class)
## [1] 0 0 0 0 0 0
## Levels: 0 1
head(wine_nb_fromlm_pred_raw)
## 0 1
## [1,] 0.7018541 0.29814590
## [2,] 0.6748204 0.32517962
## [3,] 0.5730255 0.42697452
## [4,] 0.6412530 0.35874704
## [5,] 0.5533299 0.44667009
## [6,] 0.9955356 0.00446441
8.1.3 Using Predictors found in Logistic Regression Model
<- naiveBayes(formula = quality_high ~ volatile.acidity + citric.acid + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates + alcohol,
model_nb_fromglm data = wine_train %>% select(-quality),
laplace = 1)
<- predict(object = model_nb_fromglm,
wine_nb_fromglm_pred_class newdata = wine_test,
type = "class")
<- predict(object = model_nb_fromglm,
wine_nb_fromglm_pred_raw newdata = wine_test,
type = "raw")
head(wine_nb_fromglm_pred_class)
## [1] 0 0 0 0 0 0
## Levels: 0 1
head(wine_nb_fromglm_pred_raw)
## 0 1
## [1,] 0.7434616 0.256538449
## [2,] 0.7186884 0.281311635
## [3,] 0.5038296 0.496170395
## [4,] 0.6704487 0.329551297
## [5,] 0.6073191 0.392680893
## [6,] 0.9963817 0.003618292
8.2 Model Evaluation
In addition of using confusion matrix for general evaluation of the model, we will try to evaluate the model using ROC and AUC metrics for internal comparation between all 3 of the Naive Bayes models.
8.2.1 Using All Predictors
confusionMatrix(data = wine_nb_all_pred_class,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 103 35
## 1 46 136
##
## Accuracy : 0.7469
## 95% CI : (0.6955, 0.7936)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : 4.068e-15
##
## Kappa : 0.4889
##
## Mcnemar's Test P-Value : 0.2665
##
## Sensitivity : 0.7953
## Specificity : 0.6913
## Pos Pred Value : 0.7473
## Neg Pred Value : 0.7464
## Prevalence : 0.5344
## Detection Rate : 0.4250
## Detection Prevalence : 0.5687
## Balanced Accuracy : 0.7433
##
## 'Positive' Class : 1
##
Accuracy: 74.69%
Precision / Pos Pred Value: 74.73%
Pretty good, and similar to our previous classification models.
<- prediction(predictions = wine_nb_all_pred_raw[,2],labels = wine_test_y)
wine_nb_all_roc plot(performance(prediction.obj = wine_nb_all_roc,measure = "tpr",x.measure = "fpr"))
performance(prediction.obj = wine_nb_all_roc, measure = "auc")@y.values
## [[1]]
## [1] 0.8299384
The AUC value is 82.99%, which is pretty good!
8.2.2 Using Predictors found in Linear Regression Model
confusionMatrix(data = wine_nb_fromlm_pred_class,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 26
## 1 52 145
##
## Accuracy : 0.7562
## 95% CI : (0.7054, 0.8023)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5046
##
## Mcnemar's Test P-Value : 0.004645
##
## Sensitivity : 0.8480
## Specificity : 0.6510
## Pos Pred Value : 0.7360
## Neg Pred Value : 0.7886
## Prevalence : 0.5344
## Detection Rate : 0.4531
## Detection Prevalence : 0.6156
## Balanced Accuracy : 0.7495
##
## 'Positive' Class : 1
##
Accuracy: 75.62%
Precision / Pos Pred Value: 73.60%
Pretty good performance so far.
<- prediction(predictions = wine_nb_fromlm_pred_raw[,2],
wine_nb_fromlm_roc labels = wine_test_y)
plot(performance(prediction.obj = wine_nb_fromlm_roc,
measure = "tpr",
x.measure = "fpr"))
performance(prediction.obj = wine_nb_fromlm_roc, measure = "auc")@y.values
## [[1]]
## [1] 0.8363358
The AUC value is 83.63%, which is pretty good and better than the all predictors one.
8.2.3 Using Predictors found in Logistic Regression Model
confusionMatrix(data = wine_nb_fromglm_pred_class,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 98 27
## 1 51 144
##
## Accuracy : 0.7562
## 95% CI : (0.7054, 0.8023)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5051
##
## Mcnemar's Test P-Value : 0.009208
##
## Sensitivity : 0.8421
## Specificity : 0.6577
## Pos Pred Value : 0.7385
## Neg Pred Value : 0.7840
## Prevalence : 0.5344
## Detection Rate : 0.4500
## Detection Prevalence : 0.6094
## Balanced Accuracy : 0.7499
##
## 'Positive' Class : 1
##
Accuracy: 75.62%
Precision / Pos Pred Value: 73.85%
Pretty good, and similar to our previous model using predictors selected using linear regression ones.
<- prediction(predictions = wine_nb_fromglm_pred_raw[,2],
wine_nb_fromglm_roc labels = wine_test_y)
plot(performance(prediction.obj = wine_nb_fromglm_roc,
measure = "tpr",
x.measure = "fpr"))
performance(prediction.obj = wine_nb_fromglm_roc, measure = "auc")@y.values
## [[1]]
## [1] 0.832411
The AUC value is 83.24%, which is not bad, but lower than our second model, so we will use that to go as our representative.
8.3 Conclusions
-
Our best Naive Bayes model is found both using predictors that is found to be quite significant during the linear regression model’s feature selection, which is involving following predictors:
- volatile.acidity
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- pH
- sulphates
- alcohol
- fixed.acidity
- The AUC value of our model is 83.63%, which is better than other Naive Bayes models that we made using different set of predictors.
-
The Accuracy of our model is 75.62%, and the Precision is 73.60%.
9 Classification Model: Decision Tree
Decision Tree merupakan tree-based model yang cukup sederhana dengan performa yang robust/powerful untuk prediksi. Decision Tree menghasilkan visualisasi berupa pohon keputusan yang dapat diinterpretasi dengan mudah.
Karakter tambahan Decision Tree:
- Variable predictor diasumsikan saling dependent, sehingga dapat mengatasi multicollinearity.
- Dapat mengatasi nilai predictor numerik yang berupa outlier.
Decision Tree is a classification machine learning model that is quite simple deemed to have a robust and powerful performance to predict and visualize using a tree-like model. Starting from the very top-center which is the root node, each “branch” will have a distinctive variable and criteria to be chosen for which the algorithm will be categorizing each prediction optimally, so although not ideal, this model is not too bad to be used for numerical predictors like the dataset we have.
The assumptions of the Decision Tree model is that no multicolinearity, and not sensitive to outliers (for example, if the outlier datum is like 10,000 by itself while the normally it’s only around 100-200, the algorithm could just categorize them like “>200” anyway).
Although decision tree can be used as a regression, it is usually not recommended, so let’s see if using it is a mistake or a breakthrough
9.1 Model Fitting
Since the algorithm of Decision Tree is already using the best entropy, or information gain when choosing the best predictors and criteria, we don’t need to do a feature selection too much. Maybe later we might need to prune the tree when it is becoming too crowded to read.
9.1.1 Using it as a Regression Model
<- ctree(formula = quality ~ . , data = wine_train %>% select(-quality_high) )
model_allpred_tree_reg
plot(model_allpred_tree_reg, type = "simple")
The information is very crowded! This model and plot is only using the default parameters of a decision tree, so we can tune it out a little better if we needed it to. But then again, looking at the leafs/terminal nodes, which are the nodes that are not branching to another decision anymore (the greyed out boxes at relatively bottom of the tree), we can see that there are some numbers predicted, while there is a possible error in there also. Just looking at one leaf node at quite the middle of the tree, it has the predicted value of 5.672, with a possible error of 37.4, well that would not make any sense, wouldn’t it? Our quality rating is only from 0-10 anyway. Similarly, there are a lot of leaf nodes showing similar, too extreme errors, that I don’t think pruning this regression decision tree would help it too much.
Therefore, I think it would be best if we go back to use the Decision Tree as a classification model, and use the quality_high
as the target variable instead.
9.1.2 Using it as a Classification Model
9.1.2.1 As It Is
<- ctree(formula = quality_high ~ . , data = wine_train %>% select(-quality) )
model_allpred_tree_cla
plot(model_allpred_tree_cla)
plot(model_allpred_tree_cla, type = "simple")
I was tempted to use a plot type=simple like before, but seeing that the result are still overlapped with eachother, I prefer to use without it instead, since we can pretty much classify each leaf node using a default threshold of 0.5, to use its prediction. It should be noted that the leaf count is 15 for this one.
As we can see from the tree, some decisions are quite polarizing and leave a little error to be considered, for example Node 14 (the probability is very close to being 0) or Node 28 (it’s very close to being 1), while others are not so polarized, even few nodes are a bit close to our threshold of 0.5 (like Node 11 and Node 29) that it might be the nodes that is causing most of the errors (false negatives or false positives) later on.
Predictors that are highly relevant/significant (excluding duplicates and if the branches are all classifying to only one class):- alcohol
- volatile.acidity
- chlorides
- sulphates
- total.sulfur.dioxide
9.1.2.2 Pruning the Model
There is some hyperparameters available to try and prune the model, which is:-
mincriterion
. Default is 0.95, this parameter is about checking each decision’s p-value so that it would branch only if p-value < (1-mincriterion). Higher number would prune the tree, making a node more difficult to branch to have less error tolerated of its decision. -
minsplit
Default is 20, this parameter ensures the minimum number of observations after branching. A higher number would also prune the tree, making a node search for a decision that is splitting the higher number of observation on each future branches. -
minbucket
. Default is 7, this parameter ensures the minimum number of observations in leafs/terminal nodes. Similar tominsplit
, if this is set to a higher number, it would definitely prune the model as more observations are needed for the algorithm to make a decision criterion for terminating a branch into a leaf.
There is no rule-of-thumb for each numbers. After trying some numbers, this setting seems to prune the leafs count quite a lot, around half of the original ones, so this could be another model to be considered.
<- ctree(formula = quality_high ~ . ,
model_allpred_tree_cla_prune data = wine_train %>% select(-quality),
control = ctree_control(mincriterion = 0.95, minsplit = 100, minbucket = 100))
plot(model_allpred_tree_cla_prune, type = "simple")
But then again, if we see better at the right side of the root node, in which if each of our data is actually having alcohol
>10.3, whatever the decision may fall under it, all will classified as “1”. Though the number of observations (n) and error possibility (err) are different between each leaf, won’t any value is not relevant then? Therefore, although this is pruning half of the original leaf count into 8 leafs and much more readable, I think this is not quite a good model yet, so I will try to find one again.
After some changing with the numbers, I found the following decision tree:
<- ctree(formula = quality_high ~ . ,
model_allpred_tree_cla_prune2 data = wine_train %>% select(-quality),
control = ctree_control(mincriterion = 0.98, minsplit = 100, minbucket = 20))
plot(model_allpred_tree_cla_prune2, type = "simple")
Although not super good, that is the bottom-right internal node are branching both decisions to class of “1”, and the leaf count is 11 (though technically 10 due to the same thing mentioned; 2 leaf node branching from the same internal node that both are classifying the data to “1”), I think I would be happier with this one.
When adjusting the parameters to get this model, what I was thinking is that the “precious” leaf nodes classifying data to “0” on the right side of the root node must be preserved. We can see from the n
numbers from the the 0 leaf node in the default model that some are very low, like 18, 29, or 61, therefore I am very careful to not increase the minbucket
parameter too much. Therefore I “pruned” at least one of the “0” right-side leaf nodes, and play again with the other thresholds, to finally achieve this model.
Predictors that are highly relevant/significant (excluding duplicates and if the branches are all classifying to only one class): * alcohol * volatile.acidity * sulphates * total.sulfur.dioxide
Using one less predictor compared to the one used on the default parameters, which is excluding chlorides
.
9.2 Model Evaluation
There is no better way to evaluate the decision tree models, than to predict the testing data and create a confusion matrix.
9.2.1 Using Default Parameters
<- predict(object = model_allpred_tree_cla,
wine_tree_def_pred newdata = wine_test,
type = "response")
confusionMatrix(data = wine_tree_def_pred,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117 48
## 1 32 123
##
## Accuracy : 0.75
## 95% CI : (0.6988, 0.7965)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : 1.56e-15
##
## Kappa : 0.5011
##
## Mcnemar's Test P-Value : 0.09353
##
## Sensitivity : 0.7193
## Specificity : 0.7852
## Pos Pred Value : 0.7935
## Neg Pred Value : 0.7091
## Prevalence : 0.5344
## Detection Rate : 0.3844
## Detection Prevalence : 0.4844
## Balanced Accuracy : 0.7523
##
## 'Positive' Class : 1
##
Accuracy: 75.00%
Precision / Pos Pred Value: 79.35%
One of the good ones we’ve had compared to other models, although since the leafs are too many, this could be less readable.
9.2.2 Using Customized Parameters
<- predict(object = model_allpred_tree_cla_prune2,
wine_tree_cust_pred newdata = wine_test,
type = "response")
confusionMatrix(data = wine_tree_cust_pred,
reference = wine_test_y,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 115 48
## 1 34 123
##
## Accuracy : 0.7438
## 95% CI : (0.6922, 0.7907)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : 1.044e-14
##
## Kappa : 0.4882
##
## Mcnemar's Test P-Value : 0.1511
##
## Sensitivity : 0.7193
## Specificity : 0.7718
## Pos Pred Value : 0.7834
## Neg Pred Value : 0.7055
## Prevalence : 0.5344
## Detection Rate : 0.3844
## Detection Prevalence : 0.4906
## Balanced Accuracy : 0.7456
##
## 'Positive' Class : 1
##
Accuracy: 74.38%
Precision / Pos Pred Value: 78.34%
A little bit less than that of the default parameters, but this one is more readable.
9.3 Conclusions
-
Our Decision Tree model using the default parameters are showing the best metrics compared to the one that we pruned, or tamper with some of the param:
- alcohol
- volatile.acidity
- chlorides
- sulphates
- total.sulfur.dioxide
- We have made a pruned, alternative model using customized parameters which is more readable since it contains a little bit less leaf nodes.
- The Accuracy of our best Decision Tree model is 75.00%, and the Precision is 79.35%
10 Classification Model: Random Forest
Random Forest is one of the machine learning models that is using the concept of ensemble methods (using multiple models to find the best one through majority or averaging) consisting of Decision Trees. Each of the trees has its own characteristics and it is independent from each other. In simpler terms, this model is basically making multiple Decision Tree models with random predictors/variables to use, then either use the majority voting system for classification cases, or mean of the targets for regression ones.
This model is called to have one of the best and accurate predictions amongst most models. The general downside though, to do multiple modeling, this method will need resources like processors, RAM, and most likely, time, to compute the predicted result.
10.1 Model Fitting
Since it is said that Random Forest can be utilized for both regression and classification models, let’s try both of them. I will attach the code in the chunk below, but they will be commented, as sometimes, model fitting for Random Forest could take hours. Although it would depend on how many folds and repetition will the cross-validation of the data be done, and the number of observations and variables themselves. Therefore, I will originally run the code but save them in RDS file so the model can be used and evaluated easily, and takes less time.
10.1.1 Fitting as a Regression Model
# set.seed(314)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) # repetition
#
# wine_forest_reg <- train(quality ~ .,
# data = wine_train %>% select(-quality_high),
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(wine_forest_reg, "wine_forest_reg.RDS") # saving model
Running this model takes me about 2-4 minutes. You won’t want to wait that long for a webpage to open, don’t you?
10.1.2 Fitting as a Classification Model
# set.seed(314)
#
# wine_forest_cla <- train(quality_high ~ .,
# data = wine_train %>% select(-quality),
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(wine_forest_cla, "wine_forest_cla.RDS") # saving model
For this one, somehow it took only around 1 minute to finish.
10.2 Model Evaluation
10.2.1 Regression Model
First, let’s read the saved model.
<- readRDS("wine_forest_reg.RDS")
wine_forest_reg wine_forest_reg
## Random Forest
##
## 1279 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 1024, 1023, 1023, 1023, 1023, 1024, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.5873358 0.4661201 0.4397804
## 6 0.5855157 0.4614288 0.4328461
## 11 0.5876692 0.4563827 0.4317041
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
mtry
is the number of predictors that will be used when fitting a new Decision Tree model. From the model summary above, after some iterations, we can see that the algorithm chose the mtry value of 6, as it has the lowest RMSE, or basically an error estimation value.
Random Forest has its own cross validation technique called Out-Of-Bag Error, which basically means that the algorithm is separating some data randomly, and evaluate the model based on that “unseen” data, just like what we did manually. This is why a cross validation or train-test splitting technique won’t be necessary when we are using Random Forest.
$finalModel wine_forest_reg
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.3317218
## % Var explained: 47.49
The algorithm valued its model accuracy at 47.49%, which is quite low. But if we compare this to our first linear regression model, this one is actually improved by a little.
Even though the algorithm itself has tested the model on the Out-Of-Bag samples, since we have a testing data set aside, let’s try to predict the value……………….
10.2.2 Classification Model
Again, let’s read the saved model.
<- readRDS("wine_forest_cla.RDS")
wine_forest_cla wine_forest_cla
## Random Forest
##
## 1279 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 1024, 1023, 1023, 1023, 1023, 1023, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7969812 0.5920936
## 6 0.7909875 0.5804291
## 11 0.7847324 0.5678683
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The mtry
value that the algorithm feels best to use is 2, based on the repeated attempts. Therefore, it will randomly choose between 2 predictors on each node of the Decision Tree that it makes, and so on until the tree is completed.
$finalModel wine_forest_cla
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 18.69%
## Confusion matrix:
## 0 1 class.error
## 0 478 117 0.1966387
## 1 122 562 0.1783626
As explained before, OOB or Out-Of-Bag is Random Forest’s term for its own randomly sampled observations that are treated like unseen data and will be used to evaluate the model. In here it appears that it evaluates the error rate of only 18.69%! That means its accuracy is 81.31%, the highest one of all the models so far.
Since we have our own testing data set aside, let’s try to evaluate the model again based on that dataset. First we would need to predict the classes first, then use a confusion matrix to see the accuracy and precision.
<- predict(object = wine_forest_cla,
wine_forest_cla_pred newdata = wine_test,
type = "raw")
confusionMatrix(data = wine_forest_cla_pred,
reference = wine_test$quality_high,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 114 19
## 1 35 152
##
## Accuracy : 0.8312
## 95% CI : (0.7856, 0.8706)
## No Information Rate : 0.5344
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6585
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 0.8889
## Specificity : 0.7651
## Pos Pred Value : 0.8128
## Neg Pred Value : 0.8571
## Prevalence : 0.5344
## Detection Rate : 0.4750
## Detection Prevalence : 0.5844
## Balanced Accuracy : 0.8270
##
## 'Positive' Class : 1
##
Accuracy: 83.12%
Precision / Pos Pred Value: 81.28%
It’s even better! The first time we have seen any prediction models, be it regression or classification, that touches the 80% accuracy rate. As many have said, this is surely one of the best model that we can use to predict target variable(s).
10.3 Conclusions
- Random Forest is objectively performing the best out of all the models.
- As a regression model, the model was performing at 47.49% of its capability to explain the target variable (while the rest of 52.51% are only able to be explained by predictors outside the one that is used)
- As a classification model, the Accuracy was estimated at 83.12%, and its Precision was 81.28%, objectively the best out of all model.
11 Comparing Models and Conclusions
11.1 Performance of Regression Models
-
The Linear Regression Model has been chosen using our manual approach and stepwise approach in combined forward and backward directions, resulting in
model_stepwise_both_lm
that has the 44.59% rate on the adjusted R-squared metric. This model failed 1 of the 4 required assumptions though, so although the model can definitely be used to predict the numerical target of ourquality
column, it must be used with caution that the residual data is spread in a pattern (having heteroscedasticity instead of homoscedasticity). -
The Decision Tree Model resulted in a lot of the leaf nodes have possible estimated errors to be outside of the possible range of
quality
rating, which should only 0-10. Therefore, this model is not fit to be used to estimate the numerical value of the red wine quality. -
The Random Forest Model, or simply an ensemble of Decision Trees performed over and over again, performed in an improved score of 47.49% rate on how well it explained the target variable, which is
quality
. This is an improvement, as expected from Random Forest, though the resources used might not be balanced enough if the dataset is much larger, having more number of predictors, or if the business inquiry is urgent. Otherwise, this model could be deemed as one of the best fromm all the regression models that we have tried in this report.
11.2 Performance of Classification Models
- The Logistic Regression Model that is usually the best at numerical predictors, has the Accuracy of 76.88%, and the Precision is 78.36%.
- The k Nearest Neighbour Model work best in a scaled and numerical predictors environment, which we can provide easily, although the interpretation of the model components for this one is not one of its strength. Its Accuracy is at 75.62%, while the Precision is 74.60%.
- The Naive Bayes Model is one of the most simplest model that is the best at predicting using categorical predictors. Given the condition of our all numerical predictors, it performs quite similarly to k-NN, at the Accuracy of 75.62%, and the Precision of 73.60%.
- The Decision Tree Model is the one you will look for if you are looking for a rule-based model, as its visualization of the figurative Decision Tree is easily interpretable if you set the width to be wide enough. Although the Accuracy of our best Decision Tree model is 75.00%, and the Precision is at 79.35%, after some pruning or tuning the default parameters of the model, we have an alternative model which has leaf nodes and therefore much simpler, but sacrificing its accuracy and precision by a bit.
- The Random Forest Model is still the best performer out of all the classification methods that we have tried in this report. Showing Accuracy at 83.12% and Precision at 81.28%, the only downside to this model is that it consumes time and resources for its computation to finish, therefore definitely usable in most business cases, except any that is urgent or if the resources expectation is less than it actually needs.
11.3 Overall Conclusions
- Random Forest is one of the most robust machine learning method if you want some accuracy and precision while trading off time or resources, as we have proved from the our different model comparation in the report above.
-
In all across our models that we can interpret the components of, we can see these predictors that are used the most:
- Alcohol
- Volatile Acidity
- Sulphates
- Total Sulfur Dioxides
Therefore we can conclude that they must be one the of the most decisive factors in determining a red wine quality.
12 References and About Me
12.1 References
- Kaggle.com: Red Wine Quality
- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
-
Wikipedia.org: Wine, in which cites:
- “Kuopion Viinijuhlat » Kuopio Wine Festival” (in Finnish). Kuopio Wine Festival. Retrieved 25 July 2020.
- “6 Secrets of Cooking With Wine” . WebMD.
- Parker, Robert M. (2008). Parker’s Wine Buyer’s Guide , 7th Edition. Simon and Schuster. p. 15. ISBN 978-1-4391-3997-4.
- Jancis Robinson (2006). The Oxford Companion to Wine (3rd ed.) . Oxford University Press. ISBN 978-0-19-860990-2. See alcoholic strength at p. 10.
- Wikipedia.org: Red Wine
-
Wikipedia.org: Wine tasting, in which cites:
- Peynaud, Émile (1996) The Taste of Wine: The Art and Science of Wine Appreciation, London: Macdonald Orbis, p1
- Hodgson, Robert T., “How Expert are”Expert” Wine Judges?“, Journal of Wine Economics, Vol. 4; Issue 02 (Winter 2009), pp. 233–241.
12.2 About Me
Hi! My name is Calvin, I am from Jakarta, Indonesia. I am looking forward to be a full-time data analyst and/or data scientist. I have a background in Mathematics and Computer Science from my Bachelor’s Degrees, and I love playing with numbers and data. I am doing this to enhance my Data Science portfolio (constructive criticism is very much welcomed!), also as part of Learn-By-Building assignment at Algoritma Data Science School.
You can reach me at my LinkedIn for more discussion. Thank you!