wine <- read.csv("~/Documents/Intro Data Science/winequalityN.csv")

Introduction

My general research question is what variables from the “winequalityN” dataset are able to predict quality of the wine?

Data

The dataset of “winequalityN” came from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality), but I downloaded it from Kaggle. The dataset includes wines from the Portugal which is called Vinho Verde. In this dataset units is the wines and as seen in the head() command below the variables are the properties that the wine has based on physicochemical tests. There are 6497 observations in the dataset.

The variables I will be focusing are “residual sugar”, “pH”, “alcohol” and “type” to predict “quality”.

The residual sugar is measured in g/L to tell the sweetness

pH is measuring the acidity or basicity (<7 is acidic, >7 is alkaline)

Alcohol is percentage of alcohol content

Type is whether the wine is red or white

Quality is from a scale from 1 to 10 certified by lab tests

Data analysis

Models

I will be using linear regression on two models with the explanatory variables I decided to predict quality of the wine.

Model 1: \(Quality = \beta_0 + \beta_1Alcohol + \beta_2ResidualSugar + \beta_3pH + \epsilon\)

Model 2: \(Quality = \beta_0 + \beta_1Alcohol + \beta_2ResidualSugar + \beta_3pH + \beta_4Type + \epsilon\)

Preliminary exploratory data analysis

The distribution of the response variable is uniform, so I will have to convert it to a normal distribution.

ggplot(wine, aes(x= quality)) +
  geom_histogram()

The following code is converting a uniform distribution to a normal distribution.

#Uniform to normal distribution

set.seed(58493)

wine_uniform <- runif(6497, 0, 10) #create 6497 uniform numbers

wine_uniform_df <- data.frame(
  quality_normal = unif2norm(wine_uniform)  # convert uniform to normal
)

wine2 <- cbind(wine, wine_uniform_df) # combined normal numbers to the wine dataset

ggplot(wine2, aes(x = quality_normal))+
  geom_histogram()

Fitting the model

First I will be splitting the data into training and testing.

Training is 80% and testing is 20%

set.seed(5748)

wine_split <- initial_split(wine2, prop = 0.80)

wine_train <- training(wine_split)

wine_test  <- testing(wine_split)

Model 1

This is the recipe and workflow of the first model on the TRAINING data.

wine_mod <- linear_reg() %>%
  set_engine("lm")

wine_recipe <- recipe(quality_normal ~ .,data= wine2) %>% 
  step_rm(fixed.acidity, volatile.acidity, citric.acid, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, density,sulphates, type, quality)

wine_wflow <- workflow() %>%
  add_model(wine_mod) %>%
  add_recipe(wine_recipe)
wine_fit <- wine_wflow %>%
  fit(data = wine_train)
tidy(wine_fit)

After fitting Model 1 the fitted model equation is:

\(\widehat{Quality} = 0.047 - 0.0002ResidualSugar - 0.036pH + 0.007Alcohol\)

In this model the ResidualSugar and pH increase it causes the Quality to go down while Alcohol causes Quality to go up.

Model 2

This is the recipe and workflow of the second model on the TRAINING data.

wine_recipe2 <- recipe(quality_normal ~ .,data= wine2) %>% 
  step_rm(fixed.acidity, volatile.acidity, citric.acid, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, density,sulphates,quality)%>%
   step_dummy(all_nominal())

wine_wflow2 <- workflow() %>%
  add_model(wine_mod) %>%
  add_recipe(wine_recipe2)
wine_fit2 <- wine_wflow2 %>%
  fit(data = wine_train)

tidy(wine_fit2)

After fitting Model 2 the fitted model equation is:

\(\widehat{Quality} = 0.0532 -0.0001ResidualSugar -0.038pH + 0.007Alcohol -0.002TypeWhite\)

In this model is similar to the first model as ResidualSugar and pH increase it causes the Quality to go down while Alcohol causes Quality to go up. For Type when the wine is white the Quality decreases by 0.002

V-fold cross validation

Now after after getting the fitted models I will be using the V-fold cross validation to assess which model is the winning model.

#v-fold
set.seed(5849)
folds <- vfold_cv(wine_train, v = 5)
set.seed(8594)
wine_fit_rs <- wine_wflow %>%
  fit_resamples(folds)

collect_metrics(wine_fit_rs)

For the first fitted model the RMSE is 1.014 and the R-squared is 0.0012.

set.seed(5839)
wine_fit_rs2 <- wine_wflow2 %>%
  fit_resamples(folds)

collect_metrics(wine_fit_rs2)

For the second fitted model the RMSE is 1.015 and the R-squared is 0.0014.

After using the V-fold cross validation it shows that the both fitted models have almost the same RMSE and R-squared. This shows that both model equally can predict the quality of wine. In the end I chose the second model as the winning model for the TESTING data.

Testing data

set.seed(38829)
wine_fit3 <- wine_wflow2 %>%
  fit(data = wine_test)
tidy(wine_fit3)

After fitting the second model on the TESTING data the fitted equation is:

\(\widehat{Quality} = -1.0933 + 0.004ResidualSugar +0.4106pH -0.0316Alcohol + 0.0834TypeWhite\)

The fitted equation on the TESTING data shows that as ResidualSugar and pH increases it causes the Quality to go up, while when Alcohol increases it causes the Quality to go down. When the wine is White the Quality increases by 0.0834.

set.seed(5859)
folds2 <- vfold_cv(wine_test, v = 5)
set.seed(647)
wine_fit_rs3 <- wine_wflow2 %>%
  fit_resamples(folds2)

collect_metrics(wine_fit_rs3)

After using the V-fold cross validation on the TESTING data the RMSE is 1.0187 and R-squared is 0.002, which is very similar as the RMSE and R-squared on the TESTING data.

Conclusion

The results from these statistical methods suggest that both models can predict quality of wine similarly and that adding Type as a predictor did not make a difference.

The possible limitations of this conclusion is that I couldn’t use some of the variables because they could have been correlated, for example variables with that contained acidity are correlated with pH. Also this model can be used only for wine from Portugal and nowhere else.

A potential future work question would be now that we know which variables can affect quality I am curious which attributes can affect sales of the wine.