My inspiration and motive for this project largely stems from my complete and utter ignorance about the subject of wine. When to drink it, when not to drink it, how to drink it.. I can’t even answer these simple questions. I have enjoyed it when it has been served to me on some occasions but not always and if I were buying a bottle of wine at the store for myself I would have no clue what to get since you must not only remember labels and brands but also specific categories and sub-categories. All this is besides the point other than to show that I would be wholly out of my depth if I were asked to judge a wine’s quality. With the advent of predictive data analytics, however, could I devise a way to rate wines just as good as the critics? This is where the idea for this final project sprang, namely two questions arose that were particularly of interest to me:
(1) Do the quality rankings of wine follow some sort of
objective pattern based on the make-up of their contents? Surely if a
wine’s contents can be broken down to x% of alcohol and
y% of sugar etc. then another wine of very similar contents
should score the same if the ratings are in fact objective
and..
(2) If this rating can be predicted than surely that means it is objective and following some sort of pattern, but what is the pattern and is it even an interpretable one?
For this final project, I am working with the file
"winequality-red.csv", found in /data. The
file is from Kaggle.
From the dataset description:
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
This dataset is also available from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/wine+quality
Relevant publication
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
My aim is to ultimately answer both my questions laid out earlier:
I would like to fit multiple models that can predict a wine’s rating and…
if i’m successful I hope to then choose which model I think is best and successfully explain why it is the best.
Prior to building our models we will first take a look at our data. By looking at the characteristics of each variable (it’s range of values it takes on, it’s distribution, and its class), and then by visualizing the data using visual graphics we can gain some insights on the data set prior to building our models.
First let’s load in all the libraries we will be using for this
project, then we’ll load in our data set
winequality-red.csv using read.csv().
3333 |> set.seed()
library(tidyverse) # load in our libraries we will need
library(tidymodels)
library(ggplot2) # for visual plots
library(ggthemes) # for plot themes/customization
library(corrplot) # for a correlation plot
library(corrr)
library(kknn) # for KNN modeling
visdat |> library() # visualize missing data
tibble |> library() # creating tibbles
astsa |> library()
janitor |> library() # for cleaning names
knitr |> library()
kableExtra |> library()
discrim |> library()
tidymodels_prefer()
?ggthemes()# read in our csv data
wine_quant <- "data/winequality-red.csv" |> read.csv()
# create a copy of the data set to work with (it's always good to have an
# original copy kept in tact and an a new copy to edit/work with)
wine <- wine_quantUsing the janitor package we clean up our data set’s
variable names.
Our original variable names
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
wine <- wine |>
as_tibble() |> clean_names() # clean up the variable names
wine_quant <- wine_quant |>
as_tibble() |> clean_names() # ditto for our original data setOur new variable names
## [1] "fixed_acidity" "volatile_acidity" "citric_acid"
## [4] "residual_sugar" "chlorides" "free_sulfur_dioxide"
## [7] "total_sulfur_dioxide" "density" "p_h"
## [10] "sulphates" "alcohol" "quality"
Before we can even begin to start we must first take a look at what’s in the data set. We’ll start by taking a look at the first 10 observations.
## # A tibble: 10 × 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.58 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, p_h <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <int>
All of our data is numerical, all but one variable is floating
(i.e. with decimals), and only quality, the variable we are
seeking to predict is an integer.
quality as a FactorThe first dificult decision I had to make in this project was whether
or not I wanted to perform a regression or a classification. Since all
of my data was numerical the simple solution would have been to do
regression: the quality score would simply be a quantity that was
calculated using each of the other variables’ quantities. It would not
have been difficult to round up or down my predicted value to ensure
that the rating ended up in integer form either. Ultimately, however, I
came to the conclusion that quality should in fact be
treated as a factor and I should perform classification since another
wine data set could very well have used ratings of
good, great,
exceptional or have been rated out of 4 or 5
stars, etc.. and it would make the most sense to make my code
easily reproducible on other data sets and truly put it to the test if I
wanted to do so.
Prior to visualizing our data we can clean information quickly by looking at their quartiles to gain insight on their distribution/range.
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## p_h sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
## tibble [1,599 × 12] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfur_dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfur_dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## $ p_h : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
Ultimately we’re looking at a data set of 1600 observation containing 11 predictor variables and 1 response variable.
qualityAdditionally we want to quickly glean some information about the
distribution of our response variable quality in
particular.
## Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Two things are Immediately noticeable after looking at the summary
and structure of quality:
1 quality takes on only unique values:
3, 4, 5, 6, 7, and 8.
2 the vast majority of observations take on a
quality score of 5 and 6 (so-so), quite a
few take on the score of 4, and 7 (not good/good), and
very few take on the score of 3 and 8 (very bad/very
good).
Now that we have made all the changes we seek to make to our variables (aside from quality, it made no sense to change any of the other variables to non-numeric classes), and we’ve studied each variable enough to reasonably guess what they represent, we can now write out our Codebook which will be a reference tool throughout the remainder of the project.
Note: the codebook is also available in a separate
file in our zip package titled wine_codebook.rtf.
Input variables (based on physicochemical tests):
1 - fixed_acidity - fixed acidity - The predominant
fixed acids found in wines are tartaric, malic, citric, and succinic.
Their respective levels found in wine can vary greatly but in general
one would expect to see 1,000 to 4,000 mg/L tartaric acid, 0 to 8,000
mg/L malic acid, 0 to 500 mg/L citric acid, and 500 to 2,000 mg/L
succinic acid.
2 - volatile_acidity - volatile acidity - A measure of
the wine’s volatile (or gaseous) acids. The primary volatile acid in
wine is acetic acid, which is also the primary acid associated with the
smell and taste of vinegar.
3 - citric_acid - citric acid - Citric acid is often
added to wines to increase acidity, complement a specific flavor or
prevent ferric hazes. It can be added to finished wines to increase
acidity and give a “fresh” flavor. The disadvantage of adding citric
acid is its microbial instability.
4 - residual_sugar - residual sugar - Sweetness in wine
is called residual sugar and is usually measured in grams per litre
(g/L). Residual sugar or ‘RS’ is from the natural grape sugars left in a
wine after the alcoholic fermentation finishes. The more residual sugar
remaining in a wine, the sweeter the wine is.
5 - chlorides - chlorides - In wines, the concentration
of chloride ions is generally indicative of the presence of sodium
chloride5. Sodium chloride adds to the saltiness of a wine, which can
contribute to or detract from the overall taste and quality of the
wine.
6 - free_sulfur_dioxide - free sulfur dioxide - Two
classes of sulfites are found in wine: free and bound. The free sulfites
are those available to react and thus exhibit both germicidal and
antioxidant properties.
7 - total_sulfur_dioxide - total sulfur dioxide - Total
Sulfur Dioxide (TSO2) is the portion of SO2 that is free in the wine
plus the portion that is bound to other chemicals in the wine such as
aldehydes, pigments, or sugars.
8 - density - density - Density is defined as the mass,
or weight, per volume of a material. In the case of liquids, density is
often measured in units of g/mL, which is the weight in grams (g) of
each milliliter (mL) of liquid. The density of wine is primarily
determined by the concentration of alcohol, sugar, glycerol, and other
dissolved solids.
9 - p_h - pH - pH is a measurement of acidity. Acidity
gives a wine its tart and sour taste. Fundamentally speaking, all wines
lie on the acidic side of the pH spectrum, and most range from 2.5 to
about 4.5 pH (7 is neutral).
10 - sulphates - sulphates - Sulphites act as both a
wine’s preservative and enhancer, many vintners purposely add sulphites
at key moments of the winemaking process to quickly halt on-going
fermentation or to help protect the wine against potential oxidation or
bacterial exposure which could occur at various stages of the winemaking
process.
11 - alcohol - alcohol - percentage volume of the wine
that is ethanol, a type of alcohol that is produced by fermentation of
grains, fruits, or other sources of sugar (in this case grapes).
Output variable (based on sensory data):
12 - quality - quality (score between 0 and 10) - the
rating given to the wines in our data set.
An important step in predictive analysis is to partition the data so that we can isolate part of the data for training our models and use the remainder of our data set for testing the models. Intuitively we refer to these as the training set and testing set.
We will partition our data into two.
We use the first partition as our training data set. This is the data
set we will use to train our models (our models will learn from it).
I’ve chosen 70% of the dataset to use for training (it is
typical to use a majority of the data set for training and a minority
for testing, striking the right balance in terms of their respective
proportions is largely left up to our own discretion…I chose 70% since I
thought it was appropriate given the amount of observations I have
available since it ensured there were 2-3 observations at extreme values
for quality in the testing data set, namely values of
3 and 8).
Then we will use our second partition to test the models by seeing how accurately they can make predictions on it.
We stratify the partition on quality to ensure that the
distribution of quality ratings is similar in each data set.
Similarly to our training/testing partition, the cross-validation fold method is a way of training and testing our data even more rigorously by creating a testing/training set within the training set itself that we can train and test our model on before eventually testing it for real on our testing data set.
It has been typical to use 5-fold cross-validation in our class up
until this point, for my folds it was very difficult to decide but
ulitmately I decided for 2 folds total since that was the only way to
ensure that there were at least 2 observations with a
quality rating of 8 in both folds (any higher number of
folds would have left some folds with 1 or no observations with a rating
of 8).
We have loaded our data in and have made adjustments to the variables accordingly, the next step is to visualize our data which can give us greater insights than just looking at the data’s summary and structure.
One of the first things we usually want to do is look at how complete the data set is. That is: how many observations and for which variables are those observations missing inputs?
We’ve lucked out in this case as this data set has no missing values, that being said it would have been reasonable to impute some of the missing values by using knowledge about which variables are correlated with which others and in other cases just outright removing the observation entirely if an accurate way to impute the missing value did not present itself.
Our histogram and bar charts are two rudimentary ways to show the the
distribution of our ratings for quality as well as the
frequency at which those ratings occur.
# histogram of quality scores
hist <- ggplot(
wine_quant,aes(x=quality)) +
geom_histogram(binwidth=1,
color="darkblue",
fill="lightblue");hist# bar chart (similar to the histogram above but now ordered by frequency)
bar <- ggplot(wine, aes(y=fct_rev(fct_infreq(quality)))) +
geom_bar(color="darkblue", fill="lightblue") + ylab("quality");barAs we already confirmed earlier, the most common values for
quality are at 5 and 6,
the least common are at 3 and 8
Since there are so many variables in our data set it would be
cumbersome to show a scatterplot or a box plot of every single predictor
plotted against quality. Keeping with the spirit of this
project (namely that I am an ignoramus when it comes to wine), I will
arbitrarily choose alcohol as a good starting point to plot
against quality with a scatterplot (maybe it’s plausible
that the more drunk the wine critic gets the higher rating he will give
to the wine). Note: At the end I will show a very
concise and quick way to visualize all the possible scaterplots of the
entire data set at once using the pairs() function, but for
the time being lets keep things simple by plotting quality rating v.s.
alcohol percentage.
As expected this scatterplot is quite useless seeing as the value of
quality only takes on integer values and the visual is
2-dimensional with no way of accounting for overlapping values (i.e. the
clusters at values 5, 6, and 7 of
quality could be thousands of obersvations or just dozens).
Overall this is not a very helpful plot so we’ll move on and look at
some better ones.
Keeping with my hunch that alcohol is a good predictor of quality
I’ll make a box plot once again using alcohol and
quality.
box <- ggplot(wine,aes(x=quality,y= alcohol, group=quality)) +
geom_boxplot(aes(fill = quality))+
theme_minimal()+
# geom_jitter(alpha=0.5) +
theme(legend.position = "top");box
Here it becomes very evident that the highest rated wines have the
highest-on-average alcohol content, but whereas
6,
7, and 8 all have an upward trend,
3, 4, and 5 have a downward trend
creating an overall U-shaped trend for alcohol content
But if we recall from our Histogram and Bar Charts, the graphs might
not be telling the whole story since there are so few values at
the extremes of 3 and 8 and there are so
many values in the middle at 5 and 6.
We can account for this by using geom_jitter.
boxjitter <- ggplot(wine,aes(x=quality,y= alcohol, group=quality)) +
geom_boxplot(aes(fill = quality))+
theme_minimal()+
geom_jitter(alpha=0.5) +
theme(legend.position = "top");boxjitter
It is apparent now that there is a conditional probability to consider.
While the higher-rated wines have the highest alcohol content on
average, there are many more high alcohol-content wines overall with a
quality rating of 5, 6,
and 7. Additionally the range of alcohol content is
overlapping accross each rating of quality (i.e. there are
wines with a quality of 3 with higher
alcohol content then wines with a quality of
8). Ultimately this tells me that I will likely not be
able to find a good predictor of quality by naively choosing a single
predictor variable based off of a hunch feeling. I will need to look at
graphs that can efficiently visualize multiple variables at once rather
than go pair-by pair (or use the pairs() function to
visualize all the scatterplots quickly).
The correlation matrix is a quick tool to see how each variable correlates with one another.
# create an alternative correlation matrix that gives numerical values for
# correlation
winecorr <- data_train |> correlate()
winecorr |> rplot()Our Correlation Matrix confirms that there are some variables that are correlated with one another (fixed acidity is positively correlated with citric acid and density, and negatively correlated with pH, the rest of the variables seem to have weak correlation), but how strong is the correlation?
We’ll use two triangular correlation matrixes. One to quickly assess which variables are correlated with one another and a second one which we will customize so that we are given quantities rather than colors to measure correlation by.
m <- data_train[,sapply(data_train, is.numeric)]
corrplot(cor(m), type = "lower",method = 'color', tl.cex = .5,
col = COL2('RdBu', 10))
Now we can confirm just how correlated fixed acidity is with citric acid
(0.68 correlation) and density (0.68
correlation), along with its negative correlation with pH
(-0.68 correlation). Just as we expected earlier no
other variables are strongly correlated with one another (no other
variables have > 0.6 correlation).
Lastly, as alluded to earlier there is a quick way to visualilize a scatterplot of every variable with one another using the pairs function.
quality <- wine$quality
l <- length(unique(quality))
pairs(wine, main = "Pairs Matrix", row1attop = FALSE, upper.panel = NULL,
col = hcl.colors(l, "temps")[quality])By making each value for quality represented by a
different color we should ideally have been able to read into this plot
a clear case of either single logistic regression of one variable with
quality or multiple logistic regression with two variables
and quality since the ratings for quality
would have been separated into clusters. Instead they all seem to be
indistinguishable. The correlations that pop up between the predictor
variables are not of much concern to us since we already know from the
correlation matrix how correlated the variables are with one another. In
this case we were simply focused on looking for a pattern in the
distribution of quality within each scatterplot.
As it stands though no single pair of variables shows is a good
metric on its own for predicting quality. What this tells
us is that quality is likely predicted by a multiple
logistic regression that takes into account more than two predictor
variables at a time to accurately predict quality or even
more likely that quality is very difficult to discern from
an interpretable model that simply uses a linear combination (+ plus
some sort of penalty in some cases) to make a prediction. Models such as
K-Nearest Neighbors, Ensemble Random Forests, and Boosted Trees are what
come to mind.
We will be balancing the ROC AUC values of our tested models with model complexity/interpretability to determine which model is the best.
The ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis. I will be using ROC AUC values as the main metric for determining model goodness for this project.
TPR is the True Positive Rate: \(\frac {TP}{TP + FP}\) where \(TP\) represents the number of True Positives and \(FP\) represents the number of False Positives
Specificity: \(\frac {TN}{TN + FP}\) where \(TN\) represents the number of True Negatives
FPR is the False Positive Rate: \(FPR = 1 - Specificity = 1 - \frac {TN}{TN + FP} = \frac {FP}{TN + FP}\)
source https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
There is a very basic tradeoff in Data Science that is the more the model is complex the less it is interpretable. In situations where we want to understand why the predictors (or certain predictors) are causing a specific response then we will want a more interpretable model. In cases where all we care about is accuracy and precision then we will opt for the less interpretable and more complex models. How this balancing act pertains to this project I will say that first and foremost I want the most accurate model possible since I want to compete with the pros (the wine critics), at the same time though I am personally interested in what is causing certain wines to receive certain ratings. Ultimately I will weigh both complexity and interpretability roughly equally when determining my model (with the tie going to the more accurate model which will typically be the more complex one).
Now that we know what data we are dealing with and what we wish to
accomplish with it we can move on to the next step which is building
models that can help us achieve what we’ve set out to do: namely predict
the quality ratings of wine given the phsiochemical
contents of the wine.
Perhaps the most important step in our model building is the recipe
step. Here we choose one variable to predict (in our case
quality) and then specify from our data set which variables
we’d like to use as predictors. Since there are only 11 predictor
variables total to choose from and they all appear to be relevant
physicochemical properties (who am I to say that citric acid or
chlorides are irrelevant to predicting wine quality?), I will
leave all of the predictor variables in the recipe so that we will have
1 response variable quality and 11 predictor variables
fixed_acidity,volatile_acidity,
citric_acid, residual_sugar,
chlorides, free_sulfur_dioxide,
total_sulfur_dioxide, density,
p_h, sulphates, and alcohol.
Note Had we been asking a different sort of
question: namely if we wanted to know which predictor variables
had the greatest effect of all or which predictor variables had
the most significant effect on another predictor variable we
would then go about our model much differently. We would likely perform
a Global test in which we went one by one adding to the null model (or
subtracting from the full model) predictor variables that satisfied a
certain threshhold. As it stands though we are coming into this project
as completely ignorant about wine and what makes wine good and so our
approach, to put it indelicately, is essentially to throw everything at
the wall and see if it sticks. We are not trying to create the most
agile model that picks only the bare minimum number of predictors or
tries to see which predictors interact with each other (this took hours
of coding and mixing and matching different models together and using
different pairs of interaction terms but to no avail since my model
hardly improved after each of these tinkerings). While certain variables
appear on the surface to a layman like me to have lots of correlation
with one another (think free_sulfur_dioxide and
total_sulfur_dioxide), it did my model no great benefit to
remove one or the other of these correlated variables. Similarly when I
tried to add interaction terms that made sense to my naive understanding
of what makes wine taste good (in my case the obvious examples were
interactions between opposites like residual_sugar and
citric_acid since the great taste of many things are often
attributed to striking the right balance of opposites (e.g. donuts
usually have to strike the right balance between carbs and sugars to be
tasty)).
Thus, I assumed some sort of balance of opposites in wine might be
the key to making it taste good, but these attempts at intereaction
terms between seemingly opposite components all failed to make any
improvements to my model which left me back at square one which is the
simplest to interpret and easiest to explain recipe: we use 11 predictor
variables to predict 1 response variable, making sure to center and
scale using step_normalize all the predictor variables.
wr <- recipe(quality ~ ., data = data_train) |>
step_normalize(all_predictors())
wine_recipe <- recipe(quality ~ ., data = data_train) |>
step_normalize(all_predictors())After creating my recipe the next step is to create, tune, and then test my models. To refer to each model as a single model however is a bit deceiving, since 3 of my models I will be tweaking the parameters of and then choosing which of the possible tweaked models is the best of the set. Finally I will compare the best model of each model type at the end before finally summarizing my findings.
Linear discriminant analysis is used as a tool for classification, dimension reduction, and data visualization. It has been around for quite some time now. Despite its simplicity, LDA often produces robust, decent, and interpretable classification results. When tackling real-world classification problems, LDA is often the benchmarking method before other more complicated and flexible ones are employed.
source: https://towardsdatascience.com/linear-discriminant-analysis-explained-f88be6c1e00b
Performing Linear Discriminant Analysis is a three-step process.
1 Calculate the ‘separability’ between the classes. Known as the between-class variance, it is defined as the distance between the mean of different classes, and allows for the algorithm to put a quantitative measure on ‘how difficult’ the problem is (closer means = harder problem). This separability is kept in a ‘between-class scatter matrix’.
2 Compute the within-class variance, or the distance between the mean and the sample of every class. This is another factor in the difficulty of separation — higher variance within a class makes clean separation more difficult.
3 Construct a lower-dimensional space that maximizes the between-class variance (‘separability’) and minimizes the within-class variance. Known as Fisher’s Criterion, Linear Discriminant Analysis can be computed using singular value decomposition, eigenvalues, or using a least squares method.
We will build our model using the discrim_linear()
function from the discrim library. We set the mode to
“classification” since we are seeking to classify a factor, namely the
rating for quality.
Next we will fit our model on the training data set
data_train then take that fitted model and test it out on
our testing data set (i.e. we give the model some data to learn from and
then test how well that model has learned by introducing it to brand new
data to make predictions on).
lda_fit <- fit(lda_wkflow, data_train)
lda_acc <- augment(lda_fit, new_data = data_train) |>
accuracy(truth = quality, estimate = .pred_class)
lda_test_acc <- augment(lda_fit, new_data = data_test) |>
accuracy(truth = quality, estimate = .pred_class)
lda_roc <- augment(lda_fit, new_data = data_train) |>
roc_auc(truth = quality, estimate = .pred_3:.pred_8)
lda_test_roc <- augment(lda_fit, new_data = data_test) |>
roc_auc(truth = quality, estimate = .pred_3:.pred_8)From our ROC AUC plots we see that for quality = 3 and
quality = 8 we are getting very good results, for ratings
of 4 and 6 our results are okay but
not great, and for ratings of 5 and 7
they are quite good but not great.
# augment(lda_fit, new_data = data_train) |>
# conf_mat(truth = quality, estimate = .pred_class)
augment(lda_fit, new_data = data_test) |>
conf_mat(truth = quality, estimate = .pred_class)## Truth
## Prediction 3 4 5 6 7 8
## 3 0 1 0 0 0 0
## 4 0 0 3 2 0 0
## 5 3 10 156 61 4 0
## 6 0 4 43 113 31 2
## 7 0 1 3 16 22 3
## 8 0 0 1 2 0 0
From our confusion matrix it can be seen how accurate our predictions
are by looking at the values in the diagonal (from the top left corner
down to the bottom right corner). These values represent correctly
predicted testing observations whereas other values for, say rating
x. The other values that are in the same column represent
values of x that were incorrectly predicted to be
y, z, etc.. and the values in the in the same
row represent observations y, z, etc. that
were incorrectly predicted to be x.
In our Confusion Matrix we got okay predictions of 5
and 6 but poor predictions at every other rating for
quality.
Ultimately the LDA model has shown to have an accuracy of 0.6049896
and an average roc_auc of 0.7854594. Are these good values?
It’s hard to say off the bat other than that if I were trying to rate
the wine on my own I would certainly not do as well as this model (since
i’d be completely guessing and have only a 1/6 chance of guessing each
rating correctly). To further determine how good this model is we’ll
have to build more models and see how it stacks up. Next up is
K-Nearest Neighbors
The K-Nearest Neighbor method for Classification is a simple, easy-to-implement supervised machine learning algorithm that takes a given observation and classifies it based on the class of it’s nearest neighbors, so if two of the three nearest observations are blue as pictured below then it will guess that the observation is blue and vice-versa if the majority of it’s nearest neighbors are yellow.
The Key to KNN classification is choosing a suitable value of K (k is the number of nearest neighbors we make our classification based on, hence the name k-nearest neighbors). If we choose too high a value for K then the model will follow a fairly smooth line while a small value of K will create very jagged boundaries as well as even some islands within the boundaries.
Here we see that past a certain point (1/K = 0.10, i.e. K = 10) the
training error will keep diminishing but we have overfit our data and so
the testing error will begin to increase since our model is no longer
fitting the actual population but only focusing on the
training set population which is a subset of the actual population.
Since I do not know what the Best value for K will be I will simply
set neighbors = tune() and hope that I can find the best
model by evaluating multiple different values of K and then comparing
them.
KNN <- nearest_neighbor(neighbors = tune()) |> # K-nearest neighbor model (KNN)
# tuning neighbors
set_mode("classification") |> # set for classification
set_engine("kknn") # using the kknn engine
knn_workflow <- workflow() |> # Create a Workflow
add_model(KNN) |> # add the KNN2 model we created
add_recipe(wine_recipe) # add our recipe
neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)Using tune_grid() we can tune our KNN model then use
collect_metrics() on our tuned model to gather our
performance metrics.
I’ll use collect_metrics() to print the
accuracy and roc_auc for each model across
folds.
## # A tibble: 20 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy multiclass 0.584 2 0.00731 Preprocessor1_Model01
## 2 1 roc_auc hand_till 0.604 2 0.00284 Preprocessor1_Model01
## 3 2 accuracy multiclass 0.584 2 0.00731 Preprocessor1_Model02
## 4 2 roc_auc hand_till 0.659 2 0.00800 Preprocessor1_Model02
## 5 3 accuracy multiclass 0.584 2 0.00731 Preprocessor1_Model03
## 6 3 roc_auc hand_till 0.669 2 0.00524 Preprocessor1_Model03
## 7 4 accuracy multiclass 0.584 2 0.00731 Preprocessor1_Model04
## 8 4 roc_auc hand_till 0.673 2 0.00912 Preprocessor1_Model04
## 9 5 accuracy multiclass 0.592 2 0.0154 Preprocessor1_Model05
## 10 5 roc_auc hand_till 0.681 2 0.0124 Preprocessor1_Model05
## 11 6 accuracy multiclass 0.574 2 0.0153 Preprocessor1_Model06
## 12 6 roc_auc hand_till 0.700 2 0.0197 Preprocessor1_Model06
## 13 7 accuracy multiclass 0.583 2 0.0118 Preprocessor1_Model07
## 14 7 roc_auc hand_till 0.701 2 0.0165 Preprocessor1_Model07
## 15 8 accuracy multiclass 0.590 2 0.0154 Preprocessor1_Model08
## 16 8 roc_auc hand_till 0.701 2 0.0156 Preprocessor1_Model08
## 17 9 accuracy multiclass 0.584 2 0.00909 Preprocessor1_Model09
## 18 9 roc_auc hand_till 0.703 2 0.0132 Preprocessor1_Model09
## 19 10 accuracy multiclass 0.584 2 0.0109 Preprocessor1_Model10
## 20 10 roc_auc hand_till 0.712 2 0.0173 Preprocessor1_Model10
For the sake of time and simplicity we will be focusing only on
models with the highest roc_auc values for this project
(for help on understanding what roc_auc represents consult
the previous section).
roc_auc)We can use show_best() to find which knn model’s best in
regards to roc_auc.
## # A tibble: 5 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 roc_auc hand_till 0.712 2 0.0173 Preprocessor1_Model10
## 2 9 roc_auc hand_till 0.703 2 0.0132 Preprocessor1_Model09
## 3 8 roc_auc hand_till 0.701 2 0.0156 Preprocessor1_Model08
## 4 7 roc_auc hand_till 0.701 2 0.0165 Preprocessor1_Model07
## 5 6 roc_auc hand_till 0.700 2 0.0197 Preprocessor1_Model06
Here we only see the 5 best models according to their
roc_auc values, but how do we know which one is best, since
often time we want to balance model complexity with performance
metric?
We use desc(model_parameter) if larger values of the
parameter would lead to a simpler model (in the case of KNN a smaller
value of K will lead to a more bendy decision boundary i.e. more
complex, and so the less complex model will be one with a larger value
of K).
best_neighbors <- select_by_one_std_err(
tune_knn,
desc(neighbors),
# neighbors,
metric = "roc_auc")We’ve determined the best value of K to be 10 when taking into
account model complexity when evaluating the roc_auc
performance metric.
Using our best_neighbors which is our K = 10 model we
now fit it to our training data set using
finalize_workflow(). We then use augment() to
assess the performance of our chosen model on our testing data
set.
# Use `finalize_workflow()` and `fit()` to fit your chosen model to the entire
# **training set**.
en_final_knn <- finalize_workflow(knn_workflow,
best_neighbors)
en_final_knn <- fit(en_final_knn,
data = data_train)
# test the best model's predictions on quality using the testing data
final_knn_test <- augment(en_final_knn,
data_test) |>
select(quality, starts_with(".pred"))We plot the roc_auc values for each quality
rating.
# plot the roc curves of the different qualities of wine
roc_curves_knn <- roc_curve(
final_knn_test, truth = quality, .pred_3:.pred_8) |>
autoplot(); roc_curves_knnIn addition to ROC AUC, the heat map is a useful visual tool for
quickly seeing how are predictions did at the 0.5 threshold
(the roc_auc plots measure performance across thresholds,
you can think of a heat map as a snapshot at the exact center point of
the roc_auc plot).
conf_matrix <- conf_mat(
final_knn_test, truth = quality,
.pred_class) |>
autoplot(type = "heatmap"); conf_matrixWhat is quickly evident from our roc_auc plots is that
all of our models did much better than I could ever have done on my own
(which isn’t saying much). With this KNN model with K set equal to 10 we
can predict wine’s with a rating of 5,
6, and 7, quite well and only at the
extremes of 3, 4, and
8 do these predictions work badly which was to be
expected. The bad predictions are likely due to the fact that the
observations at those extremities do not have very many neighbors to
begin with so the KNN model is going to have a hard time predicting
them. Still this has been helpful insofar as we know we are on the right
track, I am not a professional wine critic yet but using KNN i’ve gotten
good at rating the wines of medium quality, now I just need to figure
out how to rate the great wines and the bad wines. Most likely we’ll
have to move to a less interpretable but more powerful machine learning
model now to get better predictions (hopefully) for these observations
that are hard to reach with KNN.
The next model we will be building is a Random Forest Ensemble. #### Background on Random Forests
Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. But what is a decision tree?
A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data science speak, the reason that the random forest model works so well is: A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.
Source: https://towardsdatascience.com/understanding-random-forest-58381e0602d2
Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees. Source ISL: pg. 343
Since I do not know what the best value for mtry,
trees, or min_n should be I will tuning all
three and hope that I can find the best model by evaluating the models
at multiple different values of these respective parameters.
rf_wine <- rand_forest(mtry = tune(), # create our random forest
trees = tune(), # tune mtry/trees/min_n
min_n = tune()) |>
set_engine("ranger", # use the `ranger` engine
importance = "impurity") |> # set importance = impurity
set_mode("classification") # we are doing classification
# for `quality`
rf_wine_wf <- workflow() |> # create our workflow
add_model(rf_wine) |> # add model
add_recipe(wine_recipe) # add our recipe
# create our rf grid, determine the range for trees on our own, our mtry
# ranges from 1 to 8, and min_n ranges from 10 to 20, 8 levels of each.
rf_grid <- grid_regular(mtry(range = c(1,11)),
trees(range = c(100,500)),
min_n(range = c(10,20)),
levels = 8)We have already tuned the random forest and saved it to a file in our
zip pacakge entitled tune_class.rds. We do not need to
re-run this tuning step and can save time be just pre-loading it.
Since there are 512 accuracy and 512
roc_auc metrics calculated I’ll only preview the first 20
here, to save time. Like before we can find which of these is the best
using the select_best() function later on.
## # A tibble: 10 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 100 10 accuracy multiclass 0.638 2 0.00984 Preprocessor1_Mode…
## 2 1 100 10 roc_auc hand_till 0.722 2 0.0325 Preprocessor1_Mode…
## 3 2 100 10 accuracy multiclass 0.647 2 0.00984 Preprocessor1_Mode…
## 4 2 100 10 roc_auc hand_till 0.737 2 0.0372 Preprocessor1_Mode…
## 5 3 100 10 accuracy multiclass 0.636 2 0.00447 Preprocessor1_Mode…
## 6 3 100 10 roc_auc hand_till 0.731 2 0.0116 Preprocessor1_Mode…
## 7 5 100 10 accuracy multiclass 0.636 2 0.00268 Preprocessor1_Mode…
## 8 5 100 10 roc_auc hand_till 0.723 2 0.00325 Preprocessor1_Mode…
## 9 6 100 10 accuracy multiclass 0.648 2 0.00268 Preprocessor1_Mode…
## 10 6 100 10 roc_auc hand_till 0.730 2 0.0132 Preprocessor1_Mode…
## # A tibble: 1 × 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 11 100 11 Preprocessor1_Model072
According to our roc_auc criteria, the model with
mtry = 11, trees = 100, and
min_n = 11 performs the best. This is backed up by our
autoplot() function.
In the second from the left on the bottom row plot the best random
forest model sticks out as the orange dot that is highest up, it
corresponds to the model with
mtry = 11,
trees = 100, and min_n = 11 (the orange color
represents trees = 100, the X axis represents
mtry-the randomly selected predictors, and the second
column that the plot lies within represents the min_n= 11
the minimal node size).
Using best_rf_class which is our best random forest
model we now fit it to our training data set using
finalize_workflow(). We then use augment() to
assess the performance of our chosen model on our testing data
set.
# fit the best rf model using finalize_workflow()
final_rf <- finalize_workflow(rf_wine_wf,
best_rf_class)
final_rf <- fit(final_rf,
data = data_train)
# test the best model's predictions on quality using the testing data
final_rf_model_test <- augment(final_rf,
data_test) |>
select(quality, starts_with(".pred"))
# save the roc_auc value of the best model
best_roc_rf <- roc_auc(
final_rf_model_test, truth = quality, .pred_3:.pred_8)
# plot the roc curves of the different types of Pokemon
roc_curves_rf <- roc_curve(
final_rf_model_test, truth = quality, .pred_3:.pred_8) |>
autoplot(); roc_curves_rfWhat we see here is a massive improvement in the roc_auc
values at 3, 4, and 8, and substantial
improvement at 5, 6, and 7
The variable importance plot is helpful in showing how important the model recognized each variable to be when making its splits (i.e. decision tree branches).
From ISL page 344: Variable importance is computed using the mean decrease in Gini index, and expressed relative to the maximum
vip |> library()
# create a variable importance plot `vip()`
final_rf |> extract_fit_parsnip() |>
vip(aesthetics = list(color = "dark blue", fill = "light blue")) +
theme_minimal()Funnily enough our best random forest model found alcohol to be the most important variable, followed by sulphates, total sulfur dioxide, and volatile acidity. Alcohol blew the latter 3 away with an score of 150 in importance. The next three all scored above 50 in importance with all the remaining variables falling closely together between 25 and 50.
Create a heat map of our predictions at the 0.5 probability threshold (just like we did with K-Nearest Neightbors).
conf_matrix <- conf_mat(
final_rf_model_test, truth = quality,
.pred_class) |>
autoplot(type = "heatmap"); conf_matrixTwo things stick out again with this confusion matrix: 1 We still cannot get good predictions of 3, 4, and 8, however, our predictions for 6 and 7 are much more accurate while 5 is slightly less accurate.
From looking at the roc_auc plots our random forest
model has substantially improved across every quality
rating (with the exception of 5 which was only marginally
improved). This tells us that the job of determining the correct wine
quality rating is much more complex than the KNN approach which you can
think of as a single decision tree (*are the majority of the K-nearest
neighbors quality x/y/z etc.). The Random forest is much
more granular than KNN obviously because there is more than one decision
branch. Quantity is not always better than quality though: the best
number of trees was shown to be 100 which was the minimum amount we had
specified. Past a certain number the decision branches generated are
likely redundant and it would have been worth reducing our minimum
number of trees if we were to re-run this model again and see if the
optimal tree number is even lower.
When we want to create non-linear models, we can try creating tree-based models. First, we can start with decision trees. The main drawback of decision trees is overfitting the training data. A great alternative to decision trees is random forests. We can create a random forest by combining multiple decision trees via a technique called Bagging (bootstrap aggregating). Random forests have much better performance than decision trees.
Random forests also have a drawback. They can’t deal with mistakes (if any) created by their individual decision trees. Because of parallel learning, if one decision tree makes a mistake, the whole random forest model will make that mistake. A great alternative to random forests is boosted-tree models. The main objective of such models is to outperform decision trees and random forests by avoiding the above drawbacks.
Like bagging, boosting is an ensemble method in which boosted trees are created with a group of decision trees. It is useful to distinguish between bagging and boosting.
Boosting deals with errors created by previous decision trees. In boosting, new trees are formed by considering the errors of trees in previous rounds. Therefore, new trees are created one after another. Each tree is dependent on the previous tree. This type of learning is called sequential learning where parallel computing is not ideal to perform.
Some of the key considerations of boosting are:
Boosting transforms weak decision trees (called weak learners) into strong learners.
Each new tree is built considering the errors of previous trees.
In both bagging and boosting, the algorithms use a group (ensemble) of decision trees. Bagging and boosting are known as ensemble meta-algorithms.
Boosting is an iterative process. Each tree is dependent on the previous one. Therefore, it is hard to parallelize the training process of boosting algorithms. The training time will be higher. This is the main drawback of boosting algorithms.
The trees modified from the boosting process are called boosted trees.
Source https://towardsdatascience.com/introduction-to-boosted-trees-2692b6653b53
Since I do not know what the best value for mtry,
trees, or learn_rate should be I will tuning
all three and hope that I can find the best model by evaluating the
models at multiple different values of these respective parameters.
bt_wine <- boost_tree(mtry = tune(), # create our boosted tree
trees = tune(), # tune mtry/trees/learn_rate
learn_rate = tune()) |>
set_engine("xgboost") |> # use the `xgboost` engine
set_mode("classification") # we are performing classification
bt_wf <- workflow() |> # create our workflow
add_model(bt_wine) |> # add model
add_recipe(wine_recipe) # add our recipe
# create our bt grid, determine the range for trees on our own, our mtry
# ranges from 1 to 11, trees from 100 to 500, and learn_rate from -10 to -1,
# 5 levels of each.
bt_grid <- grid_regular(mtry(range = c(1, 11)),
trees(range = c(100, 500)),
learn_rate(range = c(-10, -1)),
levels = 5)We have already tuned the boosted tree and saved it to a file in our
zip pacakge entitled tune_bt_wine.rds. We do not need to
re-run this tuning step and can save time be just pre-loading it.
Since there are 250 performance metrics measuring
accuracy and roc_auc I’ll only preview the
first 20 here, to save time. Like before we can find which of these is
the best using the select_best() function later on.
## # A tibble: 10 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 100 0.0000000001 accuracy multiclass 0.00537 2 0.00179 Preproces…
## 2 1 100 0.0000000001 roc_auc hand_till 0.5 2 0 Preproces…
## 3 1 200 0.0000000001 accuracy multiclass 0.00537 2 0.00179 Preproces…
## 4 1 200 0.0000000001 roc_auc hand_till 0.5 2 0 Preproces…
## 5 1 300 0.0000000001 accuracy multiclass 0.00537 2 0.00179 Preproces…
## 6 1 300 0.0000000001 roc_auc hand_till 0.5 2 0 Preproces…
## 7 1 400 0.0000000001 accuracy multiclass 0.00537 2 0.00179 Preproces…
## 8 1 400 0.0000000001 roc_auc hand_till 0.5 2 0 Preproces…
## 9 1 500 0.0000000001 accuracy multiclass 0.00537 2 0.00179 Preproces…
## 10 1 500 0.0000000001 roc_auc hand_till 0.5 2 0 Preproces…
## # A tibble: 1 × 4
## mtry trees learn_rate .config
## <int> <int> <dbl> <chr>
## 1 1 200 0.1 Preprocessor1_Model022
Suprisingly the best performance according to our
roc_auc criteria was for the model with
mtry = 1, trees = 200, and
learn_rate = 0.1. This is backed up by our
autoplot() function.
In the bottom right graphic furthest left green dot (representing
trees = 200) is shown to have the highest
roc_auc value.
# fit the best bt model using finalize_workflow()
final_bt_model <- finalize_workflow(bt_wf, best_bt_class)
final_bt_model <- fit(final_bt_model, data_train)
# test the best model's predictions on quality using the testing data
final_bt_model_test <- augment(final_bt_model,
data_test) |>
select(quality, starts_with(".pred"))
# save the roc_auc value of the best model
best_roc_bt <- roc_auc(
final_rf_model_test, truth = quality, .pred_3:.pred_8)
# plot the roc curves of the different types of Pokemon
roc_curves_bt <- roc_curve(
final_bt_model_test, truth = quality, .pred_3:.pred_8) |>
autoplot(); roc_curves_btOnce again this is a large improvement upon the KNN
roc_auc values, additionally we seem to have done worse
than the Random Forest model for quality = 3 but at the
same time we’ve improved at quality = 4. Seeing as there
are 5 times as many values at quality = 4 than there are at
quality = 3 I think it’s fair to say that the Boosted Tree
model is an improvement over and better than the Random Forest model
(although both are quite similar and both are an improvement over
KNN).
# Here is the VIP plot:
final_bt_model |> extract_fit_parsnip() |>
vip(aesthetics = list(color = "dark blue", fill = "light blue")) +
theme_minimal()According to Our Variable Importance plot alcohol is
still the most important variable, followed by sulphates,
then volatile_acid etc. What’s noticeably different is
there are not steep dropoffs in the level of importance between the top
few and the rest like there was with the Random Forest Variable
Importance Plot. The importance declines steadily from most to least
important variable.
One thing that instantly pops out is that we now have successfully
predicted a wine quality of 8 at the 0.5 probability threshold. This
means that with zero knowledge of wine we can build a model that can
correctly predict a wine’s quality to be 8 with a 20% chance (there were
4 other predictions of 8 that were wrong). While this sounds really bad,
keep in mind that there are 481 observations in our testing set and only
5 total observations with quality = 8 so if we were
guessing at random that would be 1.039% chance at guessing
quality = 8. In that light, getting a sincle
quality = 8 prediction right is pretty good. Unfortunately
we still did not correctly predict a quality = 3 (there
were only 3 observations of quality = 3 out of the 481
observations in our testing set so this was an even harder task).
Training ROC AUC
## [1] 0.7394401
Testing ROC AUC
## [1] 0.8321943
Unlike with our random forest modesl (and as expected), the ROC AUC
value declined slightly when we fit our model to the testing data after
already fitting it to the training data. The training data produced an
roc_auc value equal to 0.7394401 while the testing data set
produced an roc_auc value equal to 0.8321943.
Surprisingly the Boosted Tree with mtry = 1 worked the
best meaning that only one variable was available
Note When comparing our models we will strictly be focusing on how these models performed on the Testing data set. It is of little concern how well they did on the Training data set because it could very well be the case that they were overfitted. The testing data set is what is relevant to us, it is the gold standard for measuring how good the model will perform on new data in the future.
The Receiver operating characteristic Area Under the
Curve was the primary metric we measured our models by. Taking
the best performing model of each of the 4 model types we have the
following roc_auc values:
best_rocs <- data.frame("LDA" = lda_test_roc$.estimate,
"KNN" = best_roc_knn$.estimate,
"RF" = best_roc_rf$.estimate,
"BT" = best_roc_bt$.estimate)
best_rocs |> print()## LDA KNN RF BT
## 1 0.7854594 0.6669486 0.8321943 0.8321943
The Random Forest and Boosted Tree models had the best
roc_auc values. The Linear Discriminatory Analysis model
was not too far off but the K-Nearest Neighbors Model was off enough to
be considered the worst of the bunch.
Ultimately The Linear Discriminatory Analysis is more interpretable
due and less complex than the Random Forest/Boosted Tree but at the same
time it can be fairly hard to explain to a layman. Though there is a lot
more going on in the Random Forest and Boosted Tree models (e.g. they
are much more computationally difficult to perform), they are not to
difficult to explain since they are essentially made up of decision
trees as their building blocks and everyone can understand the workings
of a decision tree. For this reason I’ve narrowed down the two best
models to be the Random Forest and Boosted Tree Model. Once again model
complexity comes into my thinking and I’ve chosen the Random Forest
model to be the best since all things being equal (and their
roc_auc values are in fact equal), the less complex model
is the Random Forest model (boosted trees in many ways are an extension
to the random forest method and more complex). Since there is no
improvement in roc_auc when using the Boosted Tree, it is
the Random Forest model in the end that wins out.
A last consideration in support of using the Random Forest model over
the Boosted Tree model is that in the case of the Random Forest Model
our roc_auc was actually able to perform
better on the testing data than it did on the training
data. This is strong evidence against the possibility
that our Random Forest model is overfitted. This is also a consideration
taken into my decision to decide it is the best performing model.
Ultimately I wanted to answer the following questions:
(1) Do the quality rankings of wine follow some sort of objective pattern based on the make-up of their contents? Surely if a wine’s contents can be broken down to x% of alcohol and y% of sugar etc. then another wine of very similar contents should score the same if the ratings are in fact objective and..
(2) If this rating can be predicted than surely that means it is objective and following some sort of pattern, but what is the pattern and is it even an interpretable one?
It is clear that there is a pattern at play in regards to wine quality ratings. All of our models performed better than if we were just randomly guessing which suggests there is some clear relationship between the objectively measured predictor variables and the subjective response variable. Wine critics are following some sort of methodology in their wine ratings that can be predicted if we know the physicochemical characteristics of the wine we were provided with.
Now that we know there is a pattern is it interpretable? Unfortunately the models we used that were more interpretable were not included in this project (logistic regression/generalized additive models) becasue they performed so poorly. The models I did include were all of at least a moderate complexity. This tells us that to get our predictions we have to use very complex models that will give us a prediction but will not give us a good reason why that prediction was made.
Note We know the underlying workings of
LDA/KNN/Random Forests/Boosted Trees but they do not provide the clear
interpretations that logistic regression would (i.e. if logistic
regression had performed well we might be able to say that if a wine has
x% alcohol percentage and y% gms of sugar then
it will have log odds of getting a rating of z of \(\Pr(z) = 0.1+ .25x + 0.4y\) to give a very
rudimentary regression example; in such a scenario we would know exactly
what effect each predictor variable is making on the response variable).
As it happens though the models we have been left with are not so easy
to interpret and draw clear conclusions from. Each rating happens to be
determined by a very unique balance of each predictor variable and more
importantly it is likely determined by variables that are not available
to us which leads me to my final point…
It would have been very satisfying to perfectly predict the quality ratings of our wines but this did not end up being the case even when utilizing complex machine learning models to make our predictions and eschewing interpretable models altogether.
Here are a few of my hunches on why the predictions were not better:
While 1600 observations was certainly large enough to perform the task we set out to do, it was the distribution of those 1600 observations that was the bottleneck to achieving greater predictive accuracy. There were so few observations with a quality rating of 3 and 8 that we had to constrain our model’s efficiency by only performing 2 folds (note: this only effected KNN/RF/BT but not LDA). A greater number of observations or a more evenly distributed data set would have likely resulted in better predictions.
As is always the case in data science the amount of data we have available is limited. Our models are performing the best they can given the data on hand but it is entirely possible that a few missing physicochemical properties that are missing could have provided the missing key to a perfectly predictive model. Lastly because human judgement can never truly be objective, we can think of the wine quality ratings as inherently having some sort of white noise error that is completely random in nature so that a wine critic may give the same exact wine a different rating one day then he would the next day just do to random error inherent in human judgement. This is irreducible error that no model can correct for but it is an error nonetheless that helps explain why our model was not as accurate as we would have liked.