wine <- read.csv("~/Documents/Intro Data Science/winequalityN.csv")
My general research question is what variables from the “winequalityN” dataset are able to predict quality of the wine?
The dataset of “winequalityN” came from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality), but I downloaded it from Kaggle. The dataset includes wines from the Portugal which is called Vinho Verde. In this dataset units is the wines and as seen in the head() command below the variables are the properties that the wine has based on physicochemical tests. There are 6497 observations in the dataset.
The variables I will be focusing are “residual sugar”, “pH”, “alcohol” and “type” to predict “quality”.
The residual sugar is measured in g/L to tell the sweetness
pH is measuring the acidity or basicity (<7 is acidic, >7 is alkaline)
Alcohol is percentage of alcohol content
Type is whether the wine is red or white
Quality is from a scale from 1 to 10 certified by lab tests
I will be using linear regression on two models with the explanatory variables I decided to predict quality of the wine.
Model 1: \(Quality = \beta_0 + \beta_1Alcohol + \beta_2ResidualSugar + \beta_3pH + \epsilon\)
Model 2: \(Quality = \beta_0 + \beta_1Alcohol + \beta_2ResidualSugar + \beta_3pH + \beta_4Type + \epsilon\)
The distribution of the response variable is uniform, so I will have to convert it to a normal distribution.
ggplot(wine, aes(x= quality)) +
geom_histogram()
The following code is converting a uniform distribution to a normal
distribution.
#Uniform to normal distribution
set.seed(58493)
wine_uniform <- runif(6497, 0, 10) #create 6497 uniform numbers
wine_uniform_df <- data.frame(
quality_normal = unif2norm(wine_uniform) # convert uniform to normal
)
wine2 <- cbind(wine, wine_uniform_df) # combined normal numbers to the wine dataset
ggplot(wine2, aes(x = quality_normal))+
geom_histogram()
First I will be splitting the data into training and testing.
Training is 80% and testing is 20%
set.seed(5748)
wine_split <- initial_split(wine2, prop = 0.80)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)
This is the recipe and workflow of the first model on the TRAINING data.
wine_mod <- linear_reg() %>%
set_engine("lm")
wine_recipe <- recipe(quality_normal ~ .,data= wine2) %>%
step_rm(fixed.acidity, volatile.acidity, citric.acid, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, density,sulphates, type, quality)
wine_wflow <- workflow() %>%
add_model(wine_mod) %>%
add_recipe(wine_recipe)
wine_fit <- wine_wflow %>%
fit(data = wine_train)
tidy(wine_fit)
After fitting Model 1 the fitted model equation is:
\(\widehat{Quality} = 0.047 - 0.0002ResidualSugar - 0.036pH + 0.007Alcohol\)
In this model the ResidualSugar and pH increase it causes the Quality to go down while Alcohol causes Quality to go up.
This is the recipe and workflow of the second model on the TRAINING data.
wine_recipe2 <- recipe(quality_normal ~ .,data= wine2) %>%
step_rm(fixed.acidity, volatile.acidity, citric.acid, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, density,sulphates,quality)%>%
step_dummy(all_nominal())
wine_wflow2 <- workflow() %>%
add_model(wine_mod) %>%
add_recipe(wine_recipe2)
wine_fit2 <- wine_wflow2 %>%
fit(data = wine_train)
tidy(wine_fit2)
After fitting Model 2 the fitted model equation is:
\(\widehat{Quality} = 0.0532 -0.0001ResidualSugar -0.038pH + 0.007Alcohol -0.002TypeWhite\)
In this model is similar to the first model as ResidualSugar and pH increase it causes the Quality to go down while Alcohol causes Quality to go up. For Type when the wine is white the Quality decreases by 0.002
Now after after getting the fitted models I will be using the V-fold cross validation to assess which model is the winning model.
#v-fold
set.seed(5849)
folds <- vfold_cv(wine_train, v = 5)
set.seed(8594)
wine_fit_rs <- wine_wflow %>%
fit_resamples(folds)
collect_metrics(wine_fit_rs)
For the first fitted model the RMSE is 1.014 and the R-squared is 0.0012.
set.seed(5839)
wine_fit_rs2 <- wine_wflow2 %>%
fit_resamples(folds)
collect_metrics(wine_fit_rs2)
For the second fitted model the RMSE is 1.015 and the R-squared is 0.0014.
After using the V-fold cross validation it shows that the both fitted models have almost the same RMSE and R-squared. This shows that both model equally can predict the quality of wine. In the end I chose the second model as the winning model for the TESTING data.
set.seed(38829)
wine_fit3 <- wine_wflow2 %>%
fit(data = wine_test)
tidy(wine_fit3)
After fitting the second model on the TESTING data the fitted equation is:
\(\widehat{Quality} = -1.0933 + 0.004ResidualSugar +0.4106pH -0.0316Alcohol + 0.0834TypeWhite\)
The fitted equation on the TESTING data shows that as ResidualSugar and pH increases it causes the Quality to go up, while when Alcohol increases it causes the Quality to go down. When the wine is White the Quality increases by 0.0834.
set.seed(5859)
folds2 <- vfold_cv(wine_test, v = 5)
set.seed(647)
wine_fit_rs3 <- wine_wflow2 %>%
fit_resamples(folds2)
collect_metrics(wine_fit_rs3)
After using the V-fold cross validation on the TESTING data the RMSE is 1.0187 and R-squared is 0.002, which is very similar as the RMSE and R-squared on the TESTING data.
The results from these statistical methods suggest that both models can predict quality of wine similarly and that adding Type as a predictor did not make a difference.
The possible limitations of this conclusion is that I couldn’t use some of the variables because they could have been correlated, for example variables with that contained acidity are correlated with pH. Also this model can be used only for wine from Portugal and nowhere else.
A potential future work question would be now that we know which variables can affect quality I am curious which attributes can affect sales of the wine.