Data analysis

Models

I will be using linear regression on two models with the explanatory variables I decided to predict quality of the wine.

Model 1: \(Quality = \beta_0 + \beta_1Alcohol + \beta_2ResidualSugar + \beta_3pH + \epsilon\)

Model 2: \(Quality = \beta_0 + \beta_1Alcohol + \beta_2ResidualSugar + \beta_3pH + \beta_4Type + \epsilon\)

Preliminary exploratory data analysis

The distribution of the response variable is uniform, so I will have to convert it to a normal distribution.

ggplot(wine, aes(x= quality)) +
  geom_histogram()

The following code is converting a uniform distribution to a normal distribution.

#Uniform to normal distribution

set.seed(58493)

wine_uniform <- runif(6497, 0, 10) #create 6497 uniform numbers

wine_uniform_df <- data.frame(
  quality_normal = unif2norm(wine_uniform)  # convert uniform to normal
)

wine2 <- cbind(wine, wine_uniform_df) # combined normal numbers to the wine dataset

ggplot(wine2, aes(x = quality_normal))+
  geom_histogram()

Fitting the model

First I will be splitting the data into training and testing.

Training is 80% and testing is 20%

set.seed(5748)

wine_split <- initial_split(wine2, prop = 0.80)

wine_train <- training(wine_split)

wine_test  <- testing(wine_split)

Model 1

This is the recipe and workflow of the first model on the TRAINING data.

wine_mod <- linear_reg() %>%
  set_engine("lm")

wine_recipe <- recipe(quality_normal ~ .,data= wine2) %>% 
  step_rm(fixed.acidity, volatile.acidity, citric.acid, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, density,sulphates, type, quality)

wine_wflow <- workflow() %>%
  add_model(wine_mod) %>%
  add_recipe(wine_recipe)

wine_fit <- wine_wflow %>%
  fit(data = wine_train)
tidy(wine_fit)

After fitting Model 1 the fitted model equation is:

\(\widehat{Quality} = 0.047 - 0.0002ResidualSugar - 0.036pH + 0.007Alcohol\)

In this model the ResidualSugar and pH increase it causes the Quality to go down while Alcohol causes Quality to go up.

Model 2

This is the recipe and workflow of the second model on the TRAINING data.

wine_recipe2 <- recipe(quality_normal ~ .,data= wine2) %>% 
  step_rm(fixed.acidity, volatile.acidity, citric.acid, chlorides,free.sulfur.dioxide, total.sulfur.dioxide, density,sulphates,quality)%>%
   step_dummy(all_nominal())

wine_wflow2 <- workflow() %>%
  add_model(wine_mod) %>%
  add_recipe(wine_recipe2)

wine_fit2 <- wine_wflow2 %>%
  fit(data = wine_train)

tidy(wine_fit2)

After fitting Model 2 the fitted model equation is:

\(\widehat{Quality} = 0.0532 -0.0001ResidualSugar -0.038pH + 0.007Alcohol -0.002TypeWhite\)

In this model is similar to the first model as ResidualSugar and pH increase it causes the Quality to go down while Alcohol causes Quality to go up. For Type when the wine is white the Quality decreases by 0.002

V-fold cross validation

Now after after getting the fitted models I will be using the V-fold cross validation to assess which model is the winning model.

#v-fold
set.seed(5849)
folds <- vfold_cv(wine_train, v = 5)

set.seed(8594)
wine_fit_rs <- wine_wflow %>%
  fit_resamples(folds)

collect_metrics(wine_fit_rs)

For the first fitted model the RMSE is 1.014 and the R-squared is 0.0012.

set.seed(5839)
wine_fit_rs2 <- wine_wflow2 %>%
  fit_resamples(folds)

collect_metrics(wine_fit_rs2)

For the second fitted model the RMSE is 1.015 and the R-squared is 0.0014.

After using the V-fold cross validation it shows that the both fitted models have almost the same RMSE and R-squared. This shows that both model equally can predict the quality of wine. In the end I chose the second model as the winning model for the TESTING data.

Testing data

set.seed(38829)
wine_fit3 <- wine_wflow2 %>%
  fit(data = wine_test)
tidy(wine_fit3)

After fitting the second model on the TESTING data the fitted equation is:

\(\widehat{Quality} = -1.0933 + 0.004ResidualSugar +0.4106pH -0.0316Alcohol + 0.0834TypeWhite\)

The fitted equation on the TESTING data shows that as ResidualSugar and pH increases it causes the Quality to go up, while when Alcohol increases it causes the Quality to go down. When the wine is White the Quality increases by 0.0834.

set.seed(5859)
folds2 <- vfold_cv(wine_test, v = 5)

set.seed(647)
wine_fit_rs3 <- wine_wflow2 %>%
  fit_resamples(folds2)

collect_metrics(wine_fit_rs3)

After using the V-fold cross validation on the TESTING data the RMSE is 1.0187 and R-squared is 0.002, which is very similar as the RMSE and R-squared on the TESTING data.

Project Report

Steve Lapa

Introduction

Data