Homework 2

Claire Zhang, Michael He, Aashna Banerjee

Picking Variables

The method in which we chose our three variables was from doing an Exploratory Data Analysis (EDA). In retrospect, it probably would have been a more informed decision if we had created our prediction algorithms (k-Nearest Neighbors (kNN), Linear Discriminant Analysis (LDA), Decision Tree) beforehand. In addition, we likely could have picked better variables for predicting more extreme qualities of wines for the 10 best and the 10 worst, but unfortunately our variables lean toward the center. As a result, from our graphs, we decided to use the variables residual sugar, alcohol, and volatile acidity to predict the 10 best wines and the 10 worst wines.

Residual Sugar

We chose to investigate residual sugar because we saw trends in red wines and white wines where residual sugar differed in noticeable ways. The residual sugar had a much wider distribution in the white wines, whereas in the red wines it was much more concentrated at a lower level. We also chose this because we saw that mostly lower quality wines had larger residual sugar ranges than higher quality wines, which have much lower sugar content.

Alcohol

Our decision to do alcohol percentage mainly rested on how many wines with quality ~7-8 had higher percentages of alcohol than the wines with quality ~3-4. Although there were certainly differences in the boxplots based on the type of wine (red or white), red and white wines generally had lower percentages of alcohol if they were low quality and higher percentages if they were of higher quality. We also intuitively knew that bad wines would have a abnormally high alcohol content, so at the very least, picking alcohol as a variable would help us determine the bad wines in our dataset.

Volatile Acidity

The reason we decided to use volatile acidity was to predict the quality of the wines based on type: white wine and red wine. The white wines tend to have less volatile acidity, whereas the red wines tend to have a higher volatile acidity. As the quality increases, volatile acidity tends to be lower. We thought this would be especially useful for predicting good wines because there is a more clear distinction between a good wine and a bad wine.

We decided to treat quality as a qualitative/categorical variable because it would be more conducive to using Linear Discriminant Analysis (LDA), as well as the fact that the quality is already in a range of 3-9, which is each categorical variable.

Below is our exploratory data analysis.

Exploratory Data Analysis

Volatile Acidity vs. Quality

# Exploratory Data Analysis (EDA)

# Volatile Acidity vs Quality
full_wine |>
  group_by(type) |>
  ggplot() + 
  geom_boxplot(aes(`volatile acidity`, factor(type))) + facet_wrap(~quality) +
  ylab("Type of Wine") + xlab("Volatile Acidity (g/L)") +
  ggtitle("Volatile Acidity vs Quality")

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Residual Sugar vs. Quality

# Residual Sugar vs. Quality
full_wine |>
  group_by(type) |>
  ggplot() + 
  geom_boxplot(aes(`residual sugar`, factor(type))) + facet_wrap(~quality)+
  ylab("Type of Wine") + xlab("Residual Sugar (g/L)") +
  ggtitle("Residual Sugar vs Quality")

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

Alcohol Content vs. Quality

# Alcohol Content vs. Quality
full_wine |>
  group_by(type) |>
  ggplot() + 
  geom_boxplot(aes(alcohol, factor(type))) + facet_wrap(~quality)+
  ylab("Type of Wine") + xlab("Alcohol Content (%)") +
  ggtitle("Alcohol Content vs Quality")

Scatterplot of Alcohol Content and Volatile Acidity, Faceted by Quality

# Alcohol Content and Volatile Acidity
full_wine |>
  group_by(type) |>
  ggplot() + 
  geom_point(aes(x = `volatile acidity`, y = alcohol, color = type)) + facet_wrap(~quality)+
  ylab("Alcohol Content (%)") + xlab("Volatile Acidity (g/L)") +
  ggtitle("Scatter plot of Alcohol Content and Volatile Acidity, Faceted by Quality")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

Volatile Acidity vs Residual Sugar

full_wine |>
  group_by(type) |>
  ggplot() + 
  geom_point(aes(x = `volatile acidity`, y = `residual sugar`, color = type)) + facet_wrap(~quality)

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

Creating the Algorithms

We started off by normalizing or scaling the chosen variables in our dataset. For each variable – alcohol percentage, residual sugar, and volatile acidity – we subtracted by the minimum and divided by the range to scale them into values between 0 and 1. The purpose of this was to scale the variables to be of the same proportion for kNN so that the distances between the points were not as large. This is not as big of an issue as it is for LDA and the Decision Tree, mainly because kNN encompasses observing the three nearest neighbors. Therefore, if the distances are disproportionate, then there would be an inaccurate representation of the distances between the values.

# Normalizing the Dataset (just for KNN)
full_wine_scaled <- full_wine |>
  na.omit() |>
  mutate(scaled_alcohol =
           (alcohol - min(alcohol))/(max(alcohol) - min(alcohol))) |>
  mutate(scaled_residual_sugar =
           (`residual sugar` - min(`residual sugar`, na.rm=TRUE))/(max(`residual sugar`, na.rm=TRUE)-min(`residual sugar`, na.rm=TRUE))) |>
  mutate(scaled_volatile_acidity =
           (`volatile acidity`- min(`volatile acidity`, na.rm=TRUE))/(max(`volatile acidity`, na.rm=TRUE)- min(`volatile acidity`, na.rm=TRUE)))

alcohol_input <- (10 - min(full_wine$alcohol))/(max(full_wine$alcohol) - min(full_wine$alcohol))
residual_sugar_input <- (30 - min(full_wine$`residual sugar`, na.rm=TRUE))/(max(full_wine$`residual sugar`, na.rm=TRUE)-
                                                                              min(full_wine$`residual sugar`, na.rm=TRUE))
volatile_acidity_input <- (1.0 - min(full_wine$`volatile acidity`, na.rm=TRUE))/(max(full_wine$`volatile acidity`, na.rm=TRUE)- 
                                                                                  min(full_wine$`volatile acidity`, na.rm=TRUE))

We are categorizing the quality to good, bad, and average based on the number of the quality of wine. The quality in the original data set spans from 3-9, with 3 being the lowest quality wine, and 9 being the highest quality wine. Due to this, we decided to set 3-4 as a ‘bad’ quality wine, 5-6 as an ‘average’ quality wine, and 7-9 as a ‘good’ quality wine.

full_wine_scaled_final <- full_wine_scaled |> 
  mutate(quality = case_when(
    quality == 3 | quality == 4 ~ "bad",    
    quality == 5 | quality == 6 ~ "average",  
    quality == 7 | quality == 8 | quality == 9 ~ "good"
  )) |>
  mutate(type = case_when(
    type == "white" ~ 0,
    type == "red" ~ 1
  ))

full_wine_lda_scaled <- full_wine_scaled |> 
  mutate(quality = case_when(
    quality == 3 | quality == 4 ~ "bad",    
    quality == 5 | quality == 6 ~ "average",  
    quality == 7 | quality == 8 | quality == 9 ~ "good"
  ))

Picking a Model

We measured the accuracy of LDA, kNN, and briefly looked at the Decision Tree. We found that LDA and kNN are alright at predicting good wines and bad wines, but the kNN’s accuracy was focused too much on predicting average wines, which was not useful for us. The tree-based method was also not useful for our purposes because most of the data was concentrated in one node that could not be branched off to observe the quality of the wines. This is because our variables may be uncorrelated with predicting good and bad wines.

Originally though, we started off with kNN. We first split the original data set in half into training data and testing data, where we train our algorithm on the train data and then test it on the test data. Then, we used the kNN function to select the three columns of the scaled variables we ended up choosing, then extracted the quality. We found the optimal k to be 26, however, even with the highest accuracy it was not helpful as the around 70% of the data set it guessed correctly was mainly the ‘average’ wines. As Alex says, accuracy means nothing without context! So, in this case, just because our algorithm seemed accurate, when we looked at the details, it wasn’t accurate in the way that we were expecting it to be or that seemed relevant for the assignment.

training_rows <- sample(1:nrow(full_wine_scaled_final), 
                        size = nrow(full_wine_scaled_final)/2)

train_data <- full_wine_scaled_final[training_rows, ]
test_data <- full_wine_scaled_final[-training_rows, ] |>
  na.omit()
all_possible_wines <- expand_grid(scaled_alcohol = seq(from = 0, to = 1, by = .05),
                                  scaled_volatile_acidity = seq(from = 0, to = 1, by = .05),
                                  scaled_residual_sugar = seq(from = 0, to = 1, by = .05))

knn1 <- knn(train = train_data |> 
              na.omit() |>
            dplyr::select(type, scaled_alcohol, scaled_residual_sugar, scaled_volatile_acidity), 
            test = test_data |> 
            na.omit() |>
              dplyr::select(type, scaled_alcohol, scaled_residual_sugar, scaled_volatile_acidity), 
            k = 26, 
            cl = train_data$quality) 

test_data_predictions <- test_data |>
  na.omit() |>
  mutate(knn_prediction = knn1)

accuracy_KNN <- mean(test_data_predictions$quality == test_data_predictions$knn_prediction) * 100

Moving on to LDA

Because of how LDA is structured to primarily work with a categorical dependent variable, we decided to use LDA to predict both the good wines and the bad wines, as we decided quality to be a qualitative and not a quantitative variable. We plugged the three variables (unscaled!) into our lda function. Here we did three separate LDAs to measure how accurate the predictions would be on the test data. With the original, it primarily guessed well the ‘good’ wines, but very poorly guessed the bad wines. We suppose that the overall predictions are simply not great for predicting the best and worst wines because the root of the problem is, the variables we chose are subpar. However, we pushed through, because that was the only option we had. We measured the accuracy of LDA, and it was honestly not as good as kNN, but when we looked at the dataset and the mutated column, we realized that it was way better at predicting good wines and bad wines than kNN. For the LDA model that was being trained to predict bad wines, we decided to preset the prior probabilities so that it would account for class imbalances. In this case, the class imbalance was present in that average wines made up ~77% of the data set, whereas good wines and bad wines made up a significantly smaller portion of that. As a result, there was always an overprediction of average wines.

# Writing the algorithm using the train data and test data from the
# original full wine dataset

# Original LDA Model using full wine dataset
lda1 <- lda(quality ~ type + `residual sugar` + alcohol + `volatile acidity`, 
            data = train_data)

lda_predictions <- predict(lda1, newdata = test_data)$class

lda_predict <- test_data |>
  mutate(lda_pred = lda_predictions)

accuracy_LDA <- mean(lda_predict$quality == lda_predict$lda_pred) * 100

#LDA Model, Tweaked priors to become better at predicting the bad wines
lda1_bad_train <- lda(quality ~ type + `residual sugar` + alcohol + `volatile acidity`, 
            data = train_data,
            prior = c(.774,.191,.035))

lda_predictions_bad_train <- predict(lda1_bad_train, newdata = test_data)$class

lda_predict_bad_train <- test_data |>
  mutate(lda_pred_bad_train = lda_predictions_bad_train)

accuracy_LDA_bad_train <- mean(lda_predict_bad_train$quality == lda_predict_bad_train$lda_pred_bad_train) * 100

#LDA Model, Tweaked priors to become better at predicting the good wines
lda1_good_train <- lda(quality ~ type + `residual sugar` + alcohol + `volatile acidity`, 
            data = train_data,
            prior = c(.77, .2, .03))

lda_predictions_good_train <- predict(lda1, newdata = test_data)$class

lda_predict_good_train <- test_data |>
  mutate(lda_pred_good_train = lda_predictions_good_train)

accuracy_LDA_good_train <- mean(lda_predict_good_train$quality == lda_predict_good_train$lda_pred_good_train) * 100

We became curious about using Decision Trees and tree-based methods because of the recent examples in class, so we tried that with the three variables we chose, but we quickly realized that we are not very adept at tree-based methods, especially when we realized that most of our values were concentrated in one node, so we decided to scrap the idea. However, it was still an important part of our process, so I will include it here for now.

#Moving to Decision Trees
# With training data
tree1 <- rpart(factor(quality) ~ `volatile acidity` + alcohol + 
                 `residual sugar`, data = train_data, cp = 0.005)
rpart.plot(tree1)

# With full scale wine data
tree1_full <- rpart(factor(quality) ~ `volatile acidity` + alcohol + 
                 `residual sugar`, data = full_wine_scaled_final, cp = 0.005)
rpart.plot(tree1_full)

# With train data
tree_predictions <- predict(tree1, data = test_data, type = "class")
tree_predict <- train_data |>
  mutate(tree_pred = tree_predictions)

# With full scale wine data
tree_predictions_full <- predict(tree1_full, data = full_wine_scaled_final, type = "class")
tree_predict_full <- full_wine_scaled_final |>
  mutate(tree_pred = tree_predictions_full)

train_data_final <- train_data |>
    mutate(quality = case_when(
    quality == 3 | quality == 4 ~ "bad",    
    quality == 5 | quality == 6 ~ "average",  
    quality == 7 | quality == 8 | quality == 9 ~ "good"
  ))

In this chunk of code, we are getting closer to the predictions of our code. We start predicting the good wines and the bad wines. We realized that for the good wines prediction, it was more accurate when we used the overall LDA instead of the one catered to good wines with the set prior probabilities. However, the bad wines prediction was more accurate (just slightly) when we used the prior probabilities catered for the bad wines.

new_wine <- read_csv('/Users/clairezhang/Downloads/STAT 0218/Datasets/Claire_Michael_Aashna_wine_data.csv')

## Rows: 500 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): type
## dbl (4): wine_id, alcohol, residual sugar, volatile acidity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Predicting the GOOD wines in our new wine data set, LDA algorithm
lda1_good_wines <- lda(quality ~ type + `residual sugar` + alcohol + `volatile acidity`, 
            data = full_wine_lda_scaled,
            prior = c(.77, .04, .19))

lda1_good_wines$posterior

## NULL

lda_predictions_good_wines <- predict(lda1_good_wines, newdata = new_wine)

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

lda_predict_good_wines <- new_wine |>
  mutate(lda_pred_good_wines = lda_predictions_good_wines$class)

# Predicting GOOD wines
good_wines <- new_wine |>
  mutate(lda_pred_good_wines = lda_predictions_good_wines$class) |>
  filter(lda_pred_good_wines == 'good') |>
  head(10)

Let’s look at what LDA predicted as the good wines!

new_wine_predictions_good <- new_wine |> 
  mutate(
    lda_pred_good_wines = lda_predictions_good_wines$class,
    prob_average = lda_predictions_good_wines$posterior[, "average"],
    prob_bad = lda_predictions_good_wines$posterior[, "bad"],
    prob_good = lda_predictions_good_wines$posterior[, "good"], 
  )

top_good_wines <- new_wine_predictions_good |> 
  arrange(desc(prob_good)) |> 
  head(10)

top_good_wines

## # A tibble: 10 × 9
##    wine_id type  alcohol `residual sugar` `volatile acidity` lda_pred_good_wines
##      <dbl> <chr>   <dbl>            <dbl>              <dbl> <fct>              
##  1     352 white    13.8              4.6               0.27 good               
##  2     178 white    13.3              1.6               0.3  good               
##  3      40 white    13.1              4.3               0.28 good               
##  4     362 white    12.9              1.7               0.18 good               
##  5     192 white    13.1              4.9               0.3  good               
##  6      14 white    13.3              2                 0.33 good               
##  7     328 white    12.8              2                 0.19 good               
##  8     348 white    13.4              6.4               0.45 good               
##  9     460 white    13                3.4               0.28 good               
## 10      22 white    12.6              9                 0.24 good               
## # ℹ 3 more variables: prob_average <dbl>, prob_bad <dbl>, prob_good <dbl>

This chunk of code was used to predict bad wines with the lda of the bad wines.

# To predict BAD wines
lda1_bad_wines <- lda(quality ~ `residual sugar` + `alcohol` + `volatile acidity`, 
            data = full_wine_scaled_final,
            prior = c(.77,.19,.04))

lda_predictions_bad_wines <- predict(lda1_bad_wines, newdata = new_wine)

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

lda_predict_bad_wines <- new_wine |>
  mutate(lda_pred_bad_wines = lda_predictions_bad_wines$class)

# Predicting BAD wines
lda_predict_bad_wines <- new_wine |>
  mutate(lda_pred_bad_wines = lda_predictions_bad_wines$class) |>
  filter(lda_pred_bad_wines == 'bad') |>
  na.omit() |>
  head(10)

Let’s see what LDA predicts for bad wines!

new_wine_predictions_bad <- new_wine |> 
  mutate(
    lda_pred_bad_wines = lda_predictions_bad_wines$class,
    prob_average = lda_predictions_bad_wines$posterior[, "average"],
    prob_bad = lda_predictions_bad_wines$posterior[, "bad"],
    prob_good = lda_predictions_bad_wines$posterior[, "good"], 
  )

top_bad_wines <- new_wine_predictions_bad |> 
  arrange(-prob_bad) |> 
  head(10)

top_bad_wines

## # A tibble: 10 × 9
##    wine_id type  alcohol `residual sugar` `volatile acidity` lda_pred_bad_wines
##      <dbl> <chr>   <dbl>            <dbl>              <dbl> <fct>             
##  1     274 red      10.1              2.8              1.02  bad               
##  2     384 red       9.9              6                1.03  bad               
##  3     191 red       9.4              2.4              0.88  bad               
##  4     232 red      11.9              1.8              0.96  bad               
##  5     288 red       9.2              3.4              0.81  bad               
##  6     324 red       9.4              2.6              0.795 bad               
##  7      23 red      11                2.4              0.855 bad               
##  8     365 red       9.5              1.9              0.775 bad               
##  9     138 red       9.6              2.2              0.78  bad               
## 10     177 red      10.7              2.1              0.825 bad               
## # ℹ 3 more variables: prob_average <dbl>, prob_bad <dbl>, prob_good <dbl>