Introduction

Inspiration and Motive

My inspiration and motive for this project largely stems from my complete and utter ignorance about the subject of wine. When to drink it, when not to drink it, how to drink it.. I can’t even answer these simple questions. I have enjoyed it when it has been served to me on some occasions but not always and if I were buying a bottle of wine at the store for myself I would have no clue what to get since you must not only remember labels and brands but also specific categories and sub-categories. All this is besides the point other than to show that I would be wholly out of my depth if I were asked to judge a wine’s quality. With the advent of predictive data analytics, however, could I devise a way to rate wines just as good as the critics? This is where the idea for this final project sprang, namely two questions arose that were particularly of interest to me:

(1) Do the quality rankings of wine follow some sort of objective pattern based on the make-up of their contents? Surely if a wine’s contents can be broken down to x% of alcohol and y% of sugar etc. then another wine of very similar contents should score the same if the ratings are in fact objective and..

(2) If this rating can be predicted than surely that means it is objective and following some sort of pattern, but what is the pattern and is it even an interpretable one?

People who are knowledgable about wine tend to be sophisticated and high-status)
People who are knowledgable about wine tend to be sophisticated and high-status)
Me on the other hand
Me on the other hand

Description of Data

For this final project, I am working with the file "winequality-red.csv", found in /data. The file is from Kaggle.

From the dataset description:

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The source of wine is a vineyard, the source of our data is Kaggle
The source of wine is a vineyard, the source of our data is Kaggle

This dataset is also available from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/wine+quality

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Project Outline

My aim is to ultimately answer both my questions laid out earlier:

I would like to fit multiple models that can predict a wine’s rating and…

if i’m successful I hope to then choose which model I think is best and successfully explain why it is the best.

Exploratory Data Analysis

Prior to building our models we will first take a look at our data. By looking at the characteristics of each variable (it’s range of values it takes on, it’s distribution, and its class), and then by visualizing the data using visual graphics we can gain some insights on the data set prior to building our models.

Load Our Libraries and Data

First let’s load in all the libraries we will be using for this project, then we’ll load in our data set winequality-red.csv using read.csv().

3333 |> set.seed()

library(tidyverse)            # load in our libraries we will need
library(tidymodels)
library(ggplot2)              # for visual plots
library(ggthemes)             # for plot themes/customization
library(corrplot)             # for a correlation plot
library(corrr)
library(kknn)                 # for KNN modeling
visdat |> library()           # visualize missing data
tibble |> library()           # creating tibbles
astsa |> library()
janitor |> library()          # for cleaning names
knitr |> library()
kableExtra |> library()
discrim |> library()
tidymodels_prefer()
?ggthemes()
# read in our csv data
wine_quant <- "data/winequality-red.csv" |> read.csv()

# create a copy of the data set to work with (it's always good to have an 
# original copy kept in tact and an a new copy to edit/work with)
wine <- wine_quant

Clean up Variable Names

Using the janitor package we clean up our data set’s variable names.

Our original variable names

wine |> names()                               # read out the variable names
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
wine <- wine |> 
  as_tibble() |> clean_names()                # clean up the variable names 


wine_quant <- wine_quant |> 
  as_tibble() |> clean_names()                # ditto for our original data set

Our new variable names

wine |> names()                               # inspect the new variable names
##  [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
##  [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
##  [7] "total_sulfur_dioxide" "density"              "p_h"                 
## [10] "sulphates"            "alcohol"              "quality"

Study the Variables

Before we can even begin to start we must first take a look at what’s in the data set. We’ll start by taking a look at the first 10 observations.

# glimpse the data set
wine |> as_tibble() |> head(10)
## # A tibble: 10 × 12
##    fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.4             0.7         0               1.9     0.076
##  2           7.8             0.88        0               2.6     0.098
##  3           7.8             0.76        0.04            2.3     0.092
##  4          11.2             0.28        0.56            1.9     0.075
##  5           7.4             0.7         0               1.9     0.076
##  6           7.4             0.66        0               1.8     0.075
##  7           7.9             0.6         0.06            1.6     0.069
##  8           7.3             0.65        0               1.2     0.065
##  9           7.8             0.58        0.02            2       0.073
## 10           7.5             0.5         0.36            6.1     0.071
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## #   total_sulfur_dioxide <dbl>, density <dbl>, p_h <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <int>

Notes On Variable Classes

All of our data is numerical, all but one variable is floating (i.e. with decimals), and only quality, the variable we are seeking to predict is an integer.

Treating quality as a Factor

The first dificult decision I had to make in this project was whether or not I wanted to perform a regression or a classification. Since all of my data was numerical the simple solution would have been to do regression: the quality score would simply be a quantity that was calculated using each of the other variables’ quantities. It would not have been difficult to round up or down my predicted value to ensure that the rating ended up in integer form either. Ultimately, however, I came to the conclusion that quality should in fact be treated as a factor and I should perform classification since another wine data set could very well have used ratings of good, great, exceptional or have been rated out of 4 or 5 stars, etc.. and it would make the most sense to make my code easily reproducible on other data sets and truly put it to the test if I wanted to do so.

Transform quality to a Factor

Here with a few lines of code we change our variable quality from an integer to a factor

wine$quality <- wine$quality |> as.factor(); wine$quality |> class()
## [1] "factor"

Summary and Structure of Data

Prior to visualizing our data we can clean information quickly by looking at their quartiles to gain insight on their distribution/range.

wine |> summary()     # look at the structure of the data set
##  fixed_acidity   volatile_acidity  citric_acid    residual_sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##       p_h          sulphates         alcohol      quality
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18
wine |> str()         # look at the distribution of the data set
## tibble [1,599 × 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity       : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile_acidity    : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric_acid         : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual_sugar      : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free_sulfur_dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total_sulfur_dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
##  $ p_h                 : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Ultimately we’re looking at a data set of 1600 observation containing 11 predictor variables and 1 response variable.

Summary and Structure of quality

Additionally we want to quickly glean some information about the distribution of our response variable quality in particular.

wine$quality |> str() # look at the structure of quality scores in particular
##  Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
wine$quality |> 
  summary() # look at the distribution of quality scores in particular
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Two things are Immediately noticeable after looking at the summary and structure of quality:

1 quality takes on only unique values: 3, 4, 5, 6, 7, and 8.

2 the vast majority of observations take on a quality score of 5 and 6 (so-so), quite a few take on the score of 4, and 7 (not good/good), and very few take on the score of 3 and 8 (very bad/very good).

Produce our Codebook

Now that we have made all the changes we seek to make to our variables (aside from quality, it made no sense to change any of the other variables to non-numeric classes), and we’ve studied each variable enough to reasonably guess what they represent, we can now write out our Codebook which will be a reference tool throughout the remainder of the project.

Note: the codebook is also available in a separate file in our zip package titled wine_codebook.rtf.

Input variables (based on physicochemical tests):

1 - fixed_acidity - fixed acidity - The predominant fixed acids found in wines are tartaric, malic, citric, and succinic. Their respective levels found in wine can vary greatly but in general one would expect to see 1,000 to 4,000 mg/L tartaric acid, 0 to 8,000 mg/L malic acid, 0 to 500 mg/L citric acid, and 500 to 2,000 mg/L succinic acid.

2 - volatile_acidity - volatile acidity - A measure of the wine’s volatile (or gaseous) acids. The primary volatile acid in wine is acetic acid, which is also the primary acid associated with the smell and taste of vinegar.

3 - citric_acid - citric acid - Citric acid is often added to wines to increase acidity, complement a specific flavor or prevent ferric hazes. It can be added to finished wines to increase acidity and give a “fresh” flavor. The disadvantage of adding citric acid is its microbial instability.

4 - residual_sugar - residual sugar - Sweetness in wine is called residual sugar and is usually measured in grams per litre (g/L). Residual sugar or ‘RS’ is from the natural grape sugars left in a wine after the alcoholic fermentation finishes. The more residual sugar remaining in a wine, the sweeter the wine is.

5 - chlorides - chlorides - In wines, the concentration of chloride ions is generally indicative of the presence of sodium chloride5. Sodium chloride adds to the saltiness of a wine, which can contribute to or detract from the overall taste and quality of the wine.

6 - free_sulfur_dioxide - free sulfur dioxide - Two classes of sulfites are found in wine: free and bound. The free sulfites are those available to react and thus exhibit both germicidal and antioxidant properties.

7 - total_sulfur_dioxide - total sulfur dioxide - Total Sulfur Dioxide (TSO2) is the portion of SO2 that is free in the wine plus the portion that is bound to other chemicals in the wine such as aldehydes, pigments, or sugars.

8 - density - density - Density is defined as the mass, or weight, per volume of a material. In the case of liquids, density is often measured in units of g/mL, which is the weight in grams (g) of each milliliter (mL) of liquid. The density of wine is primarily determined by the concentration of alcohol, sugar, glycerol, and other dissolved solids.

9 - p_h - pH - pH is a measurement of acidity. Acidity gives a wine its tart and sour taste. Fundamentally speaking, all wines lie on the acidic side of the pH spectrum, and most range from 2.5 to about 4.5 pH (7 is neutral).

10 - sulphates - sulphates - Sulphites act as both a wine’s preservative and enhancer, many vintners purposely add sulphites at key moments of the winemaking process to quickly halt on-going fermentation or to help protect the wine against potential oxidation or bacterial exposure which could occur at various stages of the winemaking process.

11 - alcohol - alcohol - percentage volume of the wine that is ethanol, a type of alcohol that is produced by fermentation of grains, fruits, or other sources of sugar (in this case grapes).

Output variable (based on sensory data):

12 - quality - quality (score between 0 and 10) - the rating given to the wines in our data set.

Partition the Data

An important step in predictive analysis is to partition the data so that we can isolate part of the data for training our models and use the remainder of our data set for testing the models. Intuitively we refer to these as the training set and testing set.

Training and Testing Data Sets

We will partition our data into two.

We use the first partition as our training data set. This is the data set we will use to train our models (our models will learn from it). I’ve chosen 70% of the dataset to use for training (it is typical to use a majority of the data set for training and a minority for testing, striking the right balance in terms of their respective proportions is largely left up to our own discretion…I chose 70% since I thought it was appropriate given the amount of observations I have available since it ensured there were 2-3 observations at extreme values for quality in the testing data set, namely values of 3 and 8).

Then we will use our second partition to test the models by seeing how accurately they can make predictions on it.

We stratify the partition on quality to ensure that the distribution of quality ratings is similar in each data set.

data_split <- initial_split(wine, prop = 0.70, strata = quality)
data_train <- training(data_split)
data_test <- testing(data_split)

Cross-Validation

Similarly to our training/testing partition, the cross-validation fold method is a way of training and testing our data even more rigorously by creating a testing/training set within the training set itself that we can train and test our model on before eventually testing it for real on our testing data set.

It has been typical to use 5-fold cross-validation in our class up until this point, for my folds it was very difficult to decide but ulitmately I decided for 2 folds total since that was the only way to ensure that there were at least 2 observations with a quality rating of 8 in both folds (any higher number of folds would have left some folds with 1 or no observations with a rating of 8).

wine_fold <- vfold_cv(data_train, v = 2, strata = "quality")   

Visualize the Data

We’ll have to dig in to the data and inspect it just how wine producers inspect their grapes
We’ll have to dig in to the data and inspect it just how wine producers inspect their grapes

We have loaded our data in and have made adjustments to the variables accordingly, the next step is to visualize our data which can give us greater insights than just looking at the data’s summary and structure.

Check for Missing Values

One of the first things we usually want to do is look at how complete the data set is. That is: how many observations and for which variables are those observations missing inputs?

wine |> vis_dat()       # note: we have no missing values

We’ve lucked out in this case as this data set has no missing values, that being said it would have been reasonable to impute some of the missing values by using knowledge about which variables are correlated with which others and in other cases just outright removing the observation entirely if an accurate way to impute the missing value did not present itself.

Histogram and Bar Chart

Our histogram and bar charts are two rudimentary ways to show the the distribution of our ratings for quality as well as the frequency at which those ratings occur.

                          # histogram of quality scores
hist <- ggplot(           
  wine_quant,aes(x=quality)) +  
  geom_histogram(binwidth=1, 
                 color="darkblue", 
                 fill="lightblue");hist

# bar chart (similar to the histogram above but now ordered by frequency)
bar <- ggplot(wine, aes(y=fct_rev(fct_infreq(quality)))) + 
  geom_bar(color="darkblue", fill="lightblue") + ylab("quality");bar

As we already confirmed earlier, the most common values for quality are at 5 and 6, the least common are at 3 and 8

Scatterplot

Since there are so many variables in our data set it would be cumbersome to show a scatterplot or a box plot of every single predictor plotted against quality. Keeping with the spirit of this project (namely that I am an ignoramus when it comes to wine), I will arbitrarily choose alcohol as a good starting point to plot against quality with a scatterplot (maybe it’s plausible that the more drunk the wine critic gets the higher rating he will give to the wine). Note: At the end I will show a very concise and quick way to visualize all the possible scaterplots of the entire data set at once using the pairs() function, but for the time being lets keep things simple by plotting quality rating v.s. alcohol percentage.

scatter <- ggplot(wine, aes(x=quality,y=alcohol)) + 
  geom_point(color="darkblue");scatter

As expected this scatterplot is quite useless seeing as the value of quality only takes on integer values and the visual is 2-dimensional with no way of accounting for overlapping values (i.e. the clusters at values 5, 6, and 7 of quality could be thousands of obersvations or just dozens). Overall this is not a very helpful plot so we’ll move on and look at some better ones.

Box Plot

Keeping with my hunch that alcohol is a good predictor of quality I’ll make a box plot once again using alcohol and quality.

box <- ggplot(wine,aes(x=quality,y= alcohol, group=quality)) +
  geom_boxplot(aes(fill = quality))+
  theme_minimal()+
  # geom_jitter(alpha=0.5) +
  theme(legend.position = "top");box

Here it becomes very evident that the highest rated wines have the highest-on-average alcohol content, but whereas 6, 7, and 8 all have an upward trend, 3, 4, and 5 have a downward trend creating an overall U-shaped trend for alcohol content

Box Plot (with Jitter)

But if we recall from our Histogram and Bar Charts, the graphs might not be telling the whole story since there are so few values at the extremes of 3 and 8 and there are so many values in the middle at 5 and 6. We can account for this by using geom_jitter.

boxjitter <- ggplot(wine,aes(x=quality,y= alcohol, group=quality)) +
  geom_boxplot(aes(fill = quality))+
  theme_minimal()+
  geom_jitter(alpha=0.5) +
  theme(legend.position = "top");boxjitter

It is apparent now that there is a conditional probability to consider. While the higher-rated wines have the highest alcohol content on average, there are many more high alcohol-content wines overall with a quality rating of 5, 6, and 7. Additionally the range of alcohol content is overlapping accross each rating of quality (i.e. there are wines with a quality of 3 with higher alcohol content then wines with a quality of 8). Ultimately this tells me that I will likely not be able to find a good predictor of quality by naively choosing a single predictor variable based off of a hunch feeling. I will need to look at graphs that can efficiently visualize multiple variables at once rather than go pair-by pair (or use the pairs() function to visualize all the scatterplots quickly).

Correlation Matrix

The correlation matrix is a quick tool to see how each variable correlates with one another.

# create an alternative correlation matrix that gives numerical values for
# correlation

winecorr <- data_train |> correlate()
winecorr |> rplot()

?rplot()

Our Correlation Matrix confirms that there are some variables that are correlated with one another (fixed acidity is positively correlated with citric acid and density, and negatively correlated with pH, the rest of the variables seem to have weak correlation), but how strong is the correlation?

Triangular Correlation Matrix

We’ll use two triangular correlation matrixes. One to quickly assess which variables are correlated with one another and a second one which we will customize so that we are given quantities rather than colors to measure correlation by.

m <- data_train[,sapply(data_train, is.numeric)]


corrplot(cor(m), type = "lower",method = 'color', tl.cex = .5, 
         col = COL2('RdBu', 10))

corrplot(cor(m), type = "lower",method = 'number', tl.cex = .5, 
         col = COL2('RdBu', 10))

Now we can confirm just how correlated fixed acidity is with citric acid (0.68 correlation) and density (0.68 correlation), along with its negative correlation with pH (-0.68 correlation). Just as we expected earlier no other variables are strongly correlated with one another (no other variables have > 0.6 correlation).

The Pair Function

Lastly, as alluded to earlier there is a quick way to visualilize a scatterplot of every variable with one another using the pairs function.

quality <- wine$quality
l <- length(unique(quality))

pairs(wine, main = "Pairs Matrix", row1attop = FALSE, upper.panel = NULL,  
      col = hcl.colors(l, "temps")[quality])

By making each value for quality represented by a different color we should ideally have been able to read into this plot a clear case of either single logistic regression of one variable with quality or multiple logistic regression with two variables and quality since the ratings for quality would have been separated into clusters. Instead they all seem to be indistinguishable. The correlations that pop up between the predictor variables are not of much concern to us since we already know from the correlation matrix how correlated the variables are with one another. In this case we were simply focused on looking for a pattern in the distribution of quality within each scatterplot.

As it stands though no single pair of variables shows is a good metric on its own for predicting quality. What this tells us is that quality is likely predicted by a multiple logistic regression that takes into account more than two predictor variables at a time to accurately predict quality or even more likely that quality is very difficult to discern from an interpretable model that simply uses a linear combination (+ plus some sort of penalty in some cases) to make a prediction. Models such as K-Nearest Neighbors, Ensemble Random Forests, and Boosted Trees are what come to mind.

Model Building

Criteria for Judging Models

We will be balancing the ROC AUC values of our tested models with model complexity/interpretability to determine which model is the best.

ROC AUC

The ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis. I will be using ROC AUC values as the main metric for determining model goodness for this project.

ROC Terminology

TPR is the True Positive Rate: \(\frac {TP}{TP + FP}\) where \(TP\) represents the number of True Positives and \(FP\) represents the number of False Positives

Specificity: \(\frac {TN}{TN + FP}\) where \(TN\) represents the number of True Negatives

FPR is the False Positive Rate: \(FPR = 1 - Specificity = 1 - \frac {TN}{TN + FP} = \frac {FP}{TN + FP}\)

source https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Model Complexity

There is a very basic tradeoff in Data Science that is the more the model is complex the less it is interpretable. In situations where we want to understand why the predictors (or certain predictors) are causing a specific response then we will want a more interpretable model. In cases where all we care about is accuracy and precision then we will opt for the less interpretable and more complex models. How this balancing act pertains to this project I will say that first and foremost I want the most accurate model possible since I want to compete with the pros (the wine critics), at the same time though I am personally interested in what is causing certain wines to receive certain ratings. Ultimately I will weigh both complexity and interpretability roughly equally when determining my model (with the tie going to the more accurate model which will typically be the more complex one).

Now that we know what data we are dealing with and what we wish to accomplish with it we can move on to the next step which is building models that can help us achieve what we’ve set out to do: namely predict the quality ratings of wine given the phsiochemical contents of the wine.

Create Recipe

Model Building is as much an artform as it is a science (much like producing wine)
Model Building is as much an artform as it is a science (much like producing wine)

Perhaps the most important step in our model building is the recipe step. Here we choose one variable to predict (in our case quality) and then specify from our data set which variables we’d like to use as predictors. Since there are only 11 predictor variables total to choose from and they all appear to be relevant physicochemical properties (who am I to say that citric acid or chlorides are irrelevant to predicting wine quality?), I will leave all of the predictor variables in the recipe so that we will have 1 response variable quality and 11 predictor variables fixed_acidity,volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, p_h, sulphates, and alcohol.

Note Had we been asking a different sort of question: namely if we wanted to know which predictor variables had the greatest effect of all or which predictor variables had the most significant effect on another predictor variable we would then go about our model much differently. We would likely perform a Global test in which we went one by one adding to the null model (or subtracting from the full model) predictor variables that satisfied a certain threshhold. As it stands though we are coming into this project as completely ignorant about wine and what makes wine good and so our approach, to put it indelicately, is essentially to throw everything at the wall and see if it sticks. We are not trying to create the most agile model that picks only the bare minimum number of predictors or tries to see which predictors interact with each other (this took hours of coding and mixing and matching different models together and using different pairs of interaction terms but to no avail since my model hardly improved after each of these tinkerings). While certain variables appear on the surface to a layman like me to have lots of correlation with one another (think free_sulfur_dioxide and total_sulfur_dioxide), it did my model no great benefit to remove one or the other of these correlated variables. Similarly when I tried to add interaction terms that made sense to my naive understanding of what makes wine taste good (in my case the obvious examples were interactions between opposites like residual_sugar and citric_acid since the great taste of many things are often attributed to striking the right balance of opposites (e.g. donuts usually have to strike the right balance between carbs and sugars to be tasty)).

Thus, I assumed some sort of balance of opposites in wine might be the key to making it taste good, but these attempts at intereaction terms between seemingly opposite components all failed to make any improvements to my model which left me back at square one which is the simplest to interpret and easiest to explain recipe: we use 11 predictor variables to predict 1 response variable, making sure to center and scale using step_normalize all the predictor variables.

wr <- recipe(quality ~ ., data = data_train) |>
  
    step_normalize(all_predictors())


wine_recipe <- recipe(quality ~ ., data = data_train) |>
  
    step_normalize(all_predictors())

After creating my recipe the next step is to create, tune, and then test my models. To refer to each model as a single model however is a bit deceiving, since 3 of my models I will be tweaking the parameters of and then choosing which of the possible tweaked models is the best of the set. Finally I will compare the best model of each model type at the end before finally summarizing my findings.

Linear Discriminant Analysis

Background on LDA

Linear discriminant analysis is used as a tool for classification, dimension reduction, and data visualization. It has been around for quite some time now. Despite its simplicity, LDA often produces robust, decent, and interpretable classification results. When tackling real-world classification problems, LDA is often the benchmarking method before other more complicated and flexible ones are employed.

source: https://towardsdatascience.com/linear-discriminant-analysis-explained-f88be6c1e00b

Performing Linear Discriminant Analysis is a three-step process.

1 Calculate the ‘separability’ between the classes. Known as the between-class variance, it is defined as the distance between the mean of different classes, and allows for the algorithm to put a quantitative measure on ‘how difficult’ the problem is (closer means = harder problem). This separability is kept in a ‘between-class scatter matrix’.

2 Compute the within-class variance, or the distance between the mean and the sample of every class. This is another factor in the difficulty of separation — higher variance within a class makes clean separation more difficult.

3 Construct a lower-dimensional space that maximizes the between-class variance (‘separability’) and minimizes the within-class variance. Known as Fisher’s Criterion, Linear Discriminant Analysis can be computed using singular value decomposition, eigenvalues, or using a least squares method.

Building the Model

We will build our model using the discrim_linear() function from the discrim library. We set the mode to “classification” since we are seeking to classify a factor, namely the rating for quality.

lda_mod <- discrim_linear() |>
  set_mode("classification") |>
  set_engine("MASS")

lda_wkflow <- workflow() |>
  add_model(lda_mod) |>
  add_recipe(wine_recipe)

Fit Our Model

Next we will fit our model on the training data set data_train then take that fitted model and test it out on our testing data set (i.e. we give the model some data to learn from and then test how well that model has learned by introducing it to brand new data to make predictions on).

lda_fit <- fit(lda_wkflow, data_train)


lda_acc <- augment(lda_fit, new_data = data_train) |>
  accuracy(truth = quality, estimate = .pred_class)

lda_test_acc <- augment(lda_fit, new_data = data_test) |>
  accuracy(truth = quality, estimate = .pred_class)

lda_roc <- augment(lda_fit, new_data = data_train) |>
  roc_auc(truth = quality, estimate = .pred_3:.pred_8)

lda_test_roc <- augment(lda_fit, new_data = data_test) |>
  roc_auc(truth = quality, estimate = .pred_3:.pred_8)
Evaluating the ROC AUC Values
augment(lda_fit, new_data = data_test) |>
  roc_curve(quality, .pred_3:.pred_8) |>
  autoplot()

From our ROC AUC plots we see that for quality = 3 and quality = 8 we are getting very good results, for ratings of 4 and 6 our results are okay but not great, and for ratings of 5 and 7 they are quite good but not great.

Confusion Matrix
# augment(lda_fit, new_data = data_train) |>
#   conf_mat(truth = quality, estimate = .pred_class)

augment(lda_fit, new_data = data_test) |>
  conf_mat(truth = quality, estimate = .pred_class)
##           Truth
## Prediction   3   4   5   6   7   8
##          3   0   1   0   0   0   0
##          4   0   0   3   2   0   0
##          5   3  10 156  61   4   0
##          6   0   4  43 113  31   2
##          7   0   1   3  16  22   3
##          8   0   0   1   2   0   0

From our confusion matrix it can be seen how accurate our predictions are by looking at the values in the diagonal (from the top left corner down to the bottom right corner). These values represent correctly predicted testing observations whereas other values for, say rating x. The other values that are in the same column represent values of x that were incorrectly predicted to be y, z, etc.. and the values in the in the same row represent observations y, z, etc. that were incorrectly predicted to be x.

In our Confusion Matrix we got okay predictions of 5 and 6 but poor predictions at every other rating for quality.

Summary of LDA

Ultimately the LDA model has shown to have an accuracy of 0.6049896 and an average roc_auc of 0.7854594. Are these good values? It’s hard to say off the bat other than that if I were trying to rate the wine on my own I would certainly not do as well as this model (since i’d be completely guessing and have only a 1/6 chance of guessing each rating correctly). To further determine how good this model is we’ll have to build more models and see how it stacks up. Next up is K-Nearest Neighbors

K-Nearest Neighbors

Background on KNN

The K-Nearest Neighbor method for Classification is a simple, easy-to-implement supervised machine learning algorithm that takes a given observation and classifies it based on the class of it’s nearest neighbors, so if two of the three nearest observations are blue as pictured below then it will guess that the observation is blue and vice-versa if the majority of it’s nearest neighbors are yellow.

Choosing K

The Key to KNN classification is choosing a suitable value of K (k is the number of nearest neighbors we make our classification based on, hence the name k-nearest neighbors). If we choose too high a value for K then the model will follow a fairly smooth line while a small value of K will create very jagged boundaries as well as even some islands within the boundaries.

Overfitting

Here we see that past a certain point (1/K = 0.10, i.e. K = 10) the training error will keep diminishing but we have overfit our data and so the testing error will begin to increase since our model is no longer fitting the actual population but only focusing on the training set population which is a subset of the actual population.

Building the Model

Since I do not know what the Best value for K will be I will simply set neighbors = tune() and hope that I can find the best model by evaluating multiple different values of K and then comparing them.

KNN <- nearest_neighbor(neighbors = tune()) |>  # K-nearest neighbor model (KNN)
                                                # tuning neighbors
  set_mode("classification")  |>                # set for classification
  set_engine("kknn")                            # using the kknn engine


knn_workflow <- workflow() |>                   # Create a Workflow
  add_model(KNN) |>                             # add the KNN2 model we created
  add_recipe(wine_recipe)                                # add our recipe

neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)
Tuning the Model

Using tune_grid() we can tune our KNN model then use collect_metrics() on our tuned model to gather our performance metrics.

# tune the knn model
tune_knn <- tune_grid(
  object = knn_workflow,
  resamples = wine_fold, 
  grid = neighbors_grid,
  control = control_grid(verbose = TRUE)
)
Find the Best KNN Model (Based on roc_auc)

We can use show_best() to find which knn model’s best in regards to roc_auc.

tune_knn |> 
  show_best(metric = "roc_auc")
## # A tibble: 5 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        10 roc_auc hand_till  0.712     2  0.0173 Preprocessor1_Model10
## 2         9 roc_auc hand_till  0.703     2  0.0132 Preprocessor1_Model09
## 3         8 roc_auc hand_till  0.701     2  0.0156 Preprocessor1_Model08
## 4         7 roc_auc hand_till  0.701     2  0.0165 Preprocessor1_Model07
## 5         6 roc_auc hand_till  0.700     2  0.0197 Preprocessor1_Model06

Here we only see the 5 best models according to their roc_auc values, but how do we know which one is best, since often time we want to balance model complexity with performance metric?

Determining K

We use desc(model_parameter) if larger values of the parameter would lead to a simpler model (in the case of KNN a smaller value of K will lead to a more bendy decision boundary i.e. more complex, and so the less complex model will be one with a larger value of K).

best_neighbors <- select_by_one_std_err(
  tune_knn, 
  desc(neighbors),
  # neighbors,
  metric = "roc_auc")

We’ve determined the best value of K to be 10 when taking into account model complexity when evaluating the roc_auc performance metric.

Fitting our Best KNN Model

Using our best_neighbors which is our K = 10 model we now fit it to our training data set using finalize_workflow(). We then use augment() to assess the performance of our chosen model on our testing data set.

# Use `finalize_workflow()` and `fit()` to fit your chosen model to the entire
# **training set**.
en_final_knn <- finalize_workflow(knn_workflow,
                                 best_neighbors)
en_final_knn <- fit(en_final_knn, 
                        data = data_train)


# test the best model's predictions on quality using the testing data
final_knn_test <- augment(en_final_knn,
                               data_test) |>
  select(quality, starts_with(".pred"))
Evaluating the ROC AUC Values

We plot the roc_auc values for each quality rating.

# plot the roc curves of the different qualities of wine
roc_curves_knn <- roc_curve(
  final_knn_test, truth = quality, .pred_3:.pred_8) |>
  autoplot(); roc_curves_knn

Heat Map

In addition to ROC AUC, the heat map is a useful visual tool for quickly seeing how are predictions did at the 0.5 threshold (the roc_auc plots measure performance across thresholds, you can think of a heat map as a snapshot at the exact center point of the roc_auc plot).

conf_matrix <- conf_mat(
  final_knn_test, truth = quality,
         .pred_class) |>
  autoplot(type = "heatmap"); conf_matrix

best_roc_knn <- roc_auc(
  final_knn_test, truth = quality, .pred_3:.pred_8)
Compare Training vs Testing ROC AUC
# best_roc_knn$.estimate

# best_neighbors$mean

Finally our Testing and Training ROC AUC values are very very similar, with the training value being 0.7119268 and the testing value being 0.6669486, respectively.

Summary of KNN

What is quickly evident from our roc_auc plots is that all of our models did much better than I could ever have done on my own (which isn’t saying much). With this KNN model with K set equal to 10 we can predict wine’s with a rating of 5, 6, and 7, quite well and only at the extremes of 3, 4, and 8 do these predictions work badly which was to be expected. The bad predictions are likely due to the fact that the observations at those extremities do not have very many neighbors to begin with so the KNN model is going to have a hard time predicting them. Still this has been helpful insofar as we know we are on the right track, I am not a professional wine critic yet but using KNN i’ve gotten good at rating the wines of medium quality, now I just need to figure out how to rate the great wines and the bad wines. Most likely we’ll have to move to a less interpretable but more powerful machine learning model now to get better predictions (hopefully) for these observations that are hard to reach with KNN.

Random Forests

The next model we will be building is a Random Forest Ensemble. #### Background on Random Forests

Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. But what is a decision tree?

Decision Tree Recap

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Recap of Random Forests

The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data science speak, the reason that the random forest model works so well is: A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

Source: https://towardsdatascience.com/understanding-random-forest-58381e0602d2

Note on Random Forest vs Bagging

Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees. Source ISL: pg. 343

Building the Model

Since I do not know what the best value for mtry, trees, or min_n should be I will tuning all three and hope that I can find the best model by evaluating the models at multiple different values of these respective parameters.

rf_wine <- rand_forest(mtry = tune(),     # create our random forest
                       trees = tune(),    # tune mtry/trees/min_n
                       min_n = tune()) |>
  
  set_engine("ranger",                    # use the `ranger` engine
             importance = "impurity") |>  # set importance = impurity
  
  set_mode("classification")              # we are doing classification 
                                          # for `quality`
  



rf_wine_wf <- workflow() |>             # create our workflow
  add_model(rf_wine) |>                 # add model
  add_recipe(wine_recipe)               # add our recipe



# create our rf grid, determine the range for trees on our own, our mtry 
# ranges from 1 to 8, and min_n ranges from 10 to 20, 8 levels of each.
rf_grid <- grid_regular(mtry(range = c(1,11)),
                        trees(range = c(100,500)),
                        min_n(range = c(10,20)),
                        levels = 8)
Tuning the Model

We have already tuned the random forest and saved it to a file in our zip pacakge entitled tune_class.rds. We do not need to re-run this tuning step and can save time be just pre-loading it.

# tune our random forest..... this part takes forever
tune_class <- tune_grid(
  rf_wine_wf,
  resamples = wine_fold,
  grid = rf_grid
)
# save this to an .rds file so we can re-load it and save time
save(tune_class, file = "tune_class.rds")
Load-in tune_class.rds
# load our tuned random forest to save time
load("tune_class.rds")

Fitting Our Best RF Model

Using best_rf_class which is our best random forest model we now fit it to our training data set using finalize_workflow(). We then use augment() to assess the performance of our chosen model on our testing data set.

Evaluating the ROC AUC Values
# fit the best rf model using finalize_workflow()
final_rf <- finalize_workflow(rf_wine_wf,
                                 best_rf_class)
final_rf <- fit(final_rf, 
                        data = data_train)

# test the best model's predictions on quality using the testing data
final_rf_model_test <- augment(final_rf,
                               data_test) |>
  select(quality, starts_with(".pred"))


# save the roc_auc value of the best model
best_roc_rf <- roc_auc(
  final_rf_model_test, truth = quality, .pred_3:.pred_8) 


# plot the roc curves of the different types of Pokemon
roc_curves_rf <- roc_curve(
  final_rf_model_test, truth = quality, .pred_3:.pred_8) |>
  autoplot(); roc_curves_rf

What we see here is a massive improvement in the roc_auc values at 3, 4, and 8, and substantial improvement at 5, 6, and 7

Variable Importance Plot

The variable importance plot is helpful in showing how important the model recognized each variable to be when making its splits (i.e. decision tree branches).

From ISL page 344: Variable importance is computed using the mean decrease in Gini index, and expressed relative to the maximum

vip |> library()
# create a variable importance plot `vip()`
final_rf |> extract_fit_parsnip() |>
  vip(aesthetics = list(color = "dark blue", fill = "light blue")) +
  theme_minimal()

Funnily enough our best random forest model found alcohol to be the most important variable, followed by sulphates, total sulfur dioxide, and volatile acidity. Alcohol blew the latter 3 away with an score of 150 in importance. The next three all scored above 50 in importance with all the remaining variables falling closely together between 25 and 50.

Heat Map

Create a heat map of our predictions at the 0.5 probability threshold (just like we did with K-Nearest Neightbors).

conf_matrix <- conf_mat(
  final_rf_model_test, truth = quality,
         .pred_class) |>
  autoplot(type = "heatmap"); conf_matrix

Two things stick out again with this confusion matrix: 1 We still cannot get good predictions of 3, 4, and 8, however, our predictions for 6 and 7 are much more accurate while 5 is slightly less accurate.

Compare Training vs Testing ROC AUC
best_rf <- tune_class |>
  show_best(metric = "roc_auc")

rftrain <- best_rf$mean |> max()

# best_roc_rf$.estimate

Finally our Testing ROC AUC actually outperformed our Training ROC AUC, with scores of 0.8321943, and 0.7796423, respectively.

Summary

From looking at the roc_auc plots our random forest model has substantially improved across every quality rating (with the exception of 5 which was only marginally improved). This tells us that the job of determining the correct wine quality rating is much more complex than the KNN approach which you can think of as a single decision tree (*are the majority of the K-nearest neighbors quality x/y/z etc.). The Random forest is much more granular than KNN obviously because there is more than one decision branch. Quantity is not always better than quality though: the best number of trees was shown to be 100 which was the minimum amount we had specified. Past a certain number the decision branches generated are likely redundant and it would have been worth reducing our minimum number of trees if we were to re-run this model again and see if the optimal tree number is even lower.

Boosted Trees

Background on Boosted Trees

Recap on Decision Trees/Random Forests

When we want to create non-linear models, we can try creating tree-based models. First, we can start with decision trees. The main drawback of decision trees is overfitting the training data. A great alternative to decision trees is random forests. We can create a random forest by combining multiple decision trees via a technique called Bagging (bootstrap aggregating). Random forests have much better performance than decision trees.

Random forests also have a drawback. They can’t deal with mistakes (if any) created by their individual decision trees. Because of parallel learning, if one decision tree makes a mistake, the whole random forest model will make that mistake. A great alternative to random forests is boosted-tree models. The main objective of such models is to outperform decision trees and random forests by avoiding the above drawbacks.

What are Boosted Trees

Like bagging, boosting is an ensemble method in which boosted trees are created with a group of decision trees. It is useful to distinguish between bagging and boosting.

Boosting deals with errors created by previous decision trees. In boosting, new trees are formed by considering the errors of trees in previous rounds. Therefore, new trees are created one after another. Each tree is dependent on the previous tree. This type of learning is called sequential learning where parallel computing is not ideal to perform.

Some of the key considerations of boosting are:

  • Boosting transforms weak decision trees (called weak learners) into strong learners.

  • Each new tree is built considering the errors of previous trees.

  • In both bagging and boosting, the algorithms use a group (ensemble) of decision trees. Bagging and boosting are known as ensemble meta-algorithms.

  • Boosting is an iterative process. Each tree is dependent on the previous one. Therefore, it is hard to parallelize the training process of boosting algorithms. The training time will be higher. This is the main drawback of boosting algorithms.

  • The trees modified from the boosting process are called boosted trees.

Source https://towardsdatascience.com/introduction-to-boosted-trees-2692b6653b53

Building the Model

Since I do not know what the best value for mtry, trees, or learn_rate should be I will tuning all three and hope that I can find the best model by evaluating the models at multiple different values of these respective parameters.

bt_wine <- boost_tree(mtry = tune(),                # create our boosted tree
                           trees = tune(),          # tune mtry/trees/learn_rate
                           learn_rate = tune()) |>
  set_engine("xgboost") |>                          # use the `xgboost` engine
  set_mode("classification")                  # we are performing classification

bt_wf <- workflow() |>                              # create our workflow
  add_model(bt_wine) |>                             # add model
  add_recipe(wine_recipe)                           # add our recipe



# create our bt grid, determine the range for trees on our own, our mtry 
# ranges from 1 to 11, trees from 100 to 500, and learn_rate from -10 to -1, 
# 5 levels of each.
bt_grid <- grid_regular(mtry(range = c(1, 11)),
                        trees(range = c(100, 500)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)
Tuning the Model

We have already tuned the boosted tree and saved it to a file in our zip pacakge entitled tune_bt_wine.rds. We do not need to re-run this tuning step and can save time be just pre-loading it.

# tune our boosted tree..... this part takes forever
tune_bt_wine <- tune_grid(
  bt_wf,
  resamples = wine_fold,
  grid = bt_grid
)

# save this to an .rds file so we can re-load it and save time
save(tune_bt_wine, file = "tune_bt_wine.rds")
Load-in tune_bt_wine.rds
# load our tuned boosted tree to save time
load("tune_bt_wine.rds")

Fitting Our Best BT Model

Evaluating the ROC AUC Values
# fit the best bt model using finalize_workflow()
final_bt_model <- finalize_workflow(bt_wf, best_bt_class)
final_bt_model <- fit(final_bt_model, data_train)


# test the best model's predictions on quality using the testing data
final_bt_model_test <- augment(final_bt_model, 
                               data_test) |> 
  select(quality, starts_with(".pred"))

# save the roc_auc value of the best model
best_roc_bt <- roc_auc(
  final_rf_model_test, truth = quality, .pred_3:.pred_8) 


# plot the roc curves of the different types of Pokemon
roc_curves_bt <- roc_curve(
  final_bt_model_test, truth = quality, .pred_3:.pred_8) |>
  autoplot(); roc_curves_bt

Once again this is a large improvement upon the KNN roc_auc values, additionally we seem to have done worse than the Random Forest model for quality = 3 but at the same time we’ve improved at quality = 4. Seeing as there are 5 times as many values at quality = 4 than there are at quality = 3 I think it’s fair to say that the Boosted Tree model is an improvement over and better than the Random Forest model (although both are quite similar and both are an improvement over KNN).

Variable Importance Plot
# Here is the VIP plot:

final_bt_model |> extract_fit_parsnip() |>
  vip(aesthetics = list(color = "dark blue", fill = "light blue")) +
  theme_minimal()

According to Our Variable Importance plot alcohol is still the most important variable, followed by sulphates, then volatile_acid etc. What’s noticeably different is there are not steep dropoffs in the level of importance between the top few and the rest like there was with the Random Forest Variable Importance Plot. The importance declines steadily from most to least important variable.

Heat Map
conf_mat(final_bt_model_test, truth = quality, 
         .pred_class) |> 
  autoplot(type = "heatmap")

One thing that instantly pops out is that we now have successfully predicted a wine quality of 8 at the 0.5 probability threshold. This means that with zero knowledge of wine we can build a model that can correctly predict a wine’s quality to be 8 with a 20% chance (there were 4 other predictions of 8 that were wrong). While this sounds really bad, keep in mind that there are 481 observations in our testing set and only 5 total observations with quality = 8 so if we were guessing at random that would be 1.039% chance at guessing quality = 8. In that light, getting a sincle quality = 8 prediction right is pretty good. Unfortunately we still did not correctly predict a quality = 3 (there were only 3 observations of quality = 3 out of the 481 observations in our testing set so this was an even harder task).

Compare Training vs Testing ROC AUC

Training ROC AUC

best_bt <- tune_bt_wine |>
  show_best(metric = "roc_auc")

bttrain <- best_bt$mean |> max();bttrain
## [1] 0.7394401

Testing ROC AUC

best_roc_bt$.estimate
## [1] 0.8321943

Unlike with our random forest modesl (and as expected), the ROC AUC value declined slightly when we fit our model to the testing data after already fitting it to the training data. The training data produced an roc_auc value equal to 0.7394401 while the testing data set produced an roc_auc value equal to 0.8321943.

Summary

Surprisingly the Boosted Tree with mtry = 1 worked the best meaning that only one variable was available

Model Comparison and Conclusion

Note When comparing our models we will strictly be focusing on how these models performed on the Testing data set. It is of little concern how well they did on the Training data set because it could very well be the case that they were overfitted. The testing data set is what is relevant to us, it is the gold standard for measuring how good the model will perform on new data in the future.

ROC AUC

The Receiver operating characteristic Area Under the Curve was the primary metric we measured our models by. Taking the best performing model of each of the 4 model types we have the following roc_auc values:

best_rocs <- data.frame("LDA" = lda_test_roc$.estimate,
               "KNN" = best_roc_knn$.estimate,
               "RF" = best_roc_rf$.estimate,
               "BT" = best_roc_bt$.estimate)

best_rocs |> print()
##         LDA       KNN        RF        BT
## 1 0.7854594 0.6669486 0.8321943 0.8321943

The Random Forest and Boosted Tree models had the best roc_auc values. The Linear Discriminatory Analysis model was not too far off but the K-Nearest Neighbors Model was off enough to be considered the worst of the bunch.

Model Complexity, Interpretability, and Explanability

Ultimately The Linear Discriminatory Analysis is more interpretable due and less complex than the Random Forest/Boosted Tree but at the same time it can be fairly hard to explain to a layman. Though there is a lot more going on in the Random Forest and Boosted Tree models (e.g. they are much more computationally difficult to perform), they are not to difficult to explain since they are essentially made up of decision trees as their building blocks and everyone can understand the workings of a decision tree. For this reason I’ve narrowed down the two best models to be the Random Forest and Boosted Tree Model. Once again model complexity comes into my thinking and I’ve chosen the Random Forest model to be the best since all things being equal (and their roc_auc values are in fact equal), the less complex model is the Random Forest model (boosted trees in many ways are an extension to the random forest method and more complex). Since there is no improvement in roc_auc when using the Boosted Tree, it is the Random Forest model in the end that wins out.

A last consideration in support of using the Random Forest model over the Boosted Tree model is that in the case of the Random Forest Model our roc_auc was actually able to perform better on the testing data than it did on the training data. This is strong evidence against the possibility that our Random Forest model is overfitted. This is also a consideration taken into my decision to decide it is the best performing model.

Conclusion

Ultimately I wanted to answer the following questions:

(1) Do the quality rankings of wine follow some sort of objective pattern based on the make-up of their contents? Surely if a wine’s contents can be broken down to x% of alcohol and y% of sugar etc. then another wine of very similar contents should score the same if the ratings are in fact objective and..

(2) If this rating can be predicted than surely that means it is objective and following some sort of pattern, but what is the pattern and is it even an interpretable one?

  1. It is clear that there is a pattern at play in regards to wine quality ratings. All of our models performed better than if we were just randomly guessing which suggests there is some clear relationship between the objectively measured predictor variables and the subjective response variable. Wine critics are following some sort of methodology in their wine ratings that can be predicted if we know the physicochemical characteristics of the wine we were provided with.

  2. Now that we know there is a pattern is it interpretable? Unfortunately the models we used that were more interpretable were not included in this project (logistic regression/generalized additive models) becasue they performed so poorly. The models I did include were all of at least a moderate complexity. This tells us that to get our predictions we have to use very complex models that will give us a prediction but will not give us a good reason why that prediction was made.

Note We know the underlying workings of LDA/KNN/Random Forests/Boosted Trees but they do not provide the clear interpretations that logistic regression would (i.e. if logistic regression had performed well we might be able to say that if a wine has x% alcohol percentage and y% gms of sugar then it will have log odds of getting a rating of z of \(\Pr(z) = 0.1+ .25x + 0.4y\) to give a very rudimentary regression example; in such a scenario we would know exactly what effect each predictor variable is making on the response variable). As it happens though the models we have been left with are not so easy to interpret and draw clear conclusions from. Each rating happens to be determined by a very unique balance of each predictor variable and more importantly it is likely determined by variables that are not available to us which leads me to my final point…

Why Weren’t Our Predictions Better?

Most of our models performed in the green Range
Most of our models performed in the green Range

It would have been very satisfying to perfectly predict the quality ratings of our wines but this did not end up being the case even when utilizing complex machine learning models to make our predictions and eschewing interpretable models altogether.

Here are a few of my hunches on why the predictions were not better:

1 - We did not have enough data.

While 1600 observations was certainly large enough to perform the task we set out to do, it was the distribution of those 1600 observations that was the bottleneck to achieving greater predictive accuracy. There were so few observations with a quality rating of 3 and 8 that we had to constrain our model’s efficiency by only performing 2 folds (note: this only effected KNN/RF/BT but not LDA). A greater number of observations or a more evenly distributed data set would have likely resulted in better predictions.

2 - We did not have the right predictor variables

As is always the case in data science the amount of data we have available is limited. Our models are performing the best they can given the data on hand but it is entirely possible that a few missing physicochemical properties that are missing could have provided the missing key to a perfectly predictive model. Lastly because human judgement can never truly be objective, we can think of the wine quality ratings as inherently having some sort of white noise error that is completely random in nature so that a wine critic may give the same exact wine a different rating one day then he would the next day just do to random error inherent in human judgement. This is irreducible error that no model can correct for but it is an error nonetheless that helps explain why our model was not as accurate as we would have liked.

It was a long journey from start to finish!
It was a long journey from start to finish!
The finished Product
The finished Product