Wines of the PNW

Author

Your name here!

Published

January 27, 2025

Step Up Code:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
library(fastDummies)
wine <- readRDS(gzcon(url("https://github.com/cd-public/D505/raw/master/dat/wine.rds")))

Explanataion:The code imports the necessary packages for the homework and analysis. However, the fastDummies package is not utilized. Additionally, we load are data which in this case is data related to wine from github.

TODO: write your line-by-line explanation of the code here

Feature Engineering

We begin by engineering an number of features.

  1. Create a total of 10 features (including points).
  2. Remove all rows with a missing value.
  3. Ensure only log(price) and engineering features are the only columns that remain in the wino dataframe.
wino <- wine %>% 
  mutate(lprice=log(price)) %>% 
   group_by(taster_name) %>%
  mutate(
    taster_avg_score = mean(points, na.rm = TRUE),
    taster_score_dev = points - taster_avg_score,
    taster_avg_price = mean(price, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  mutate(fruity_dummy = as.integer(str_detect(tolower(description), "fruity")),
  earthy_dummy = as.integer(str_detect(tolower(description), "earth|earthy")),
  price_pp = price / points,
  expensive = ifelse(price > median(price, na.rm = TRUE), 1, 0),
  not_expensive = ifelse(price < median(price, na.rm = TRUE), 1, 0),
  review_count =  str_count(description, "\\S+")) %>% na.omit() %>% 
  select(-c(country,description,province,region_1,region_2,taster_twitter_handle,price,designation,points,taster_name,variety,winery,year,id,title))

Explanation:We begin by mutating the data set to create a new column, lprice, which represents the logarithm of the wine prices using the price column. Next, we use group_by to organize the data by taster_name, allowing us to compute statistics for each taster. Specifically, we calculate the average score by taking the mean of the points column. Additionally, we determine each taster’s deviation from the mean score. To explore potential patterns in wine preferences, we calculate the average price of wines tasted by each reviewer. This may helps show whether certain tasters review expensive or inexpensive wines. After these computations, we use the function ungroup() to remove the grouping, as the subsequent transformations should not be grouped by the taster_name column. After doing this, I investigated whether specific terms like “fruity” or “earthy” in the wine descriptions influenced log price. To do this, I applied ifelse with str_detect, though I initially attempted to use grepl but encountered issues implementing it within ifelse. Additionally, I created a price_pp(price per point) column to estimate the cost per rating point. To classify wines as expensive or inexpensive, I used the median price as a threshold—wines above the median were labeled expensive, while those below were considered inexpensive. Finally, I calculated the review length by applying str_count to the description column, counting the number of words in each review to assess its potential effect on log price. Before moving on to training and testing, we need to remove all rows containing NA values. Additionally, we must drop any columns that are not the features we generated. This helps prevent multicollinearity, which could arise if newly created columns are too closely related to their original sources. Another reason for removing certain columns is that some contain a large number of categorical variables, which can significantly slow down training. I initially overlooked this step and ended up waiting over an hour for the model to run. The country column had to be removed because, after filtering out NA values, it contained only a single remaining category. # Caret

We now use a train/test split to evaluate the features.

  1. Use the Caret library to partition the wino dataframe into an 80/20 split.
  2. Run a linear regression with bootstrap resampling.
  3. Report RMSE on the test partition of the data.
train_w <- createDataPartition(wino$lprice, p = 0.8, list = FALSE)
wino_tr <- wino[train_w, ]
wino_te <- wino[-train_w, ]
summary(wino_tr)
     lprice      taster_avg_score taster_score_dev  taster_avg_price
 Min.   :1.609   Min.   :86.68    Min.   :-9.2322   Min.   :23.29   
 1st Qu.:3.219   1st Qu.:89.11    1st Qu.:-1.7138   1st Qu.:33.94   
 Median :3.611   Median :89.11    Median : 0.7678   Median :34.66   
 Mean   :3.592   Mean   :89.25    Mean   : 0.3324   Mean   :38.39   
 3rd Qu.:3.912   3rd Qu.:89.23    3rd Qu.: 2.1922   3rd Qu.:47.28   
 Max.   :7.607   Max.   :90.09    Max.   :10.8906   Max.   :47.28   
  fruity_dummy      earthy_dummy       price_pp          expensive     
 Min.   :0.00000   Min.   :0.0000   Min.   : 0.05882   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 0.28409   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median : 0.41304   Median :1.0000  
 Mean   :0.03452   Mean   :0.0991   Mean   : 0.46774   Mean   :0.7318  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.: 0.56180   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :1.0000   Max.   :22.12088   Max.   :1.0000  
 not_expensive     review_count   
 Min.   :0.0000   Min.   :  7.00  
 1st Qu.:0.0000   1st Qu.: 35.00  
 Median :0.0000   Median : 43.00  
 Mean   :0.2245   Mean   : 43.14  
 3rd Qu.:0.0000   3rd Qu.: 51.00  
 Max.   :1.0000   Max.   :104.00  

Explanation: In this chunk of code, we start by creating a training partition using createDataPartition(), which splits the lprice column in wino, assigning 80% of the data to training. We set list = FALSE so that it returns row indices instead of a list. Next, we pass these indices to wino to create two separate data sets, wino_tr for training, which contains the selected 80% of the data, and wino_te for testing, which holds the remaining 20%. Lastly, we call summary(wino_tr) to get a quick statistical overview of the training data set.

m1 <- train(lprice ~ .,
  data = wino_tr,
  method = "lm",
  trControl = trainControl(method = "boot")
)
m1
Linear Regression 

17205 samples
    9 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 17205, 17205, 17205, 17205, 17205, 17205, ... 
Resampling results:

  RMSE       Rsquared   MAE     
  0.2406028  0.8237755  0.132605

Tuning parameter 'intercept' was held constant at a value of TRUE
print(m1$resample)
        RMSE  Rsquared       MAE   Resample
1  0.3468654 0.7119232 0.1045644 Resample01
2  0.2085447 0.8535962 0.1672957 Resample02
3  0.3722388 0.6952001 0.1016277 Resample03
4  0.2124066 0.8529952 0.1703658 Resample04
5  0.1691926 0.9051597 0.1373941 Resample05
6  0.1747833 0.9011792 0.1379088 Resample06
7  0.1758040 0.8974556 0.1323562 Resample07
8  0.1697248 0.9055688 0.1370592 Resample08
9  0.1784617 0.8958652 0.1338355 Resample09
10 0.3188153 0.7299461 0.1047425 Resample10
11 0.2076920 0.8571163 0.1666852 Resample11
12 0.3090816 0.7331805 0.1091555 Resample12
13 0.1927169 0.8773380 0.1541579 Resample13
14 0.1941708 0.8776911 0.1554432 Resample14
15 0.1734001 0.8980188 0.1334807 Resample15
16 0.1749378 0.9011476 0.1413828 Resample16
17 0.3068502 0.7359762 0.1087465 Resample17
18 0.3506931 0.7026095 0.1031782 Resample18
19 0.1973362 0.8708191 0.1586928 Resample19
20 0.1955634 0.8736681 0.1558174 Resample20
21 0.1769708 0.8928222 0.1331977 Resample21
22 0.3440834 0.7128434 0.1043395 Resample22
23 0.3491269 0.7111590 0.1040092 Resample23
24 0.1948734 0.8759354 0.1550228 Resample24
25 0.3207370 0.7251727 0.1046649 Resample25

Explanation:A linear regression model is trained using the train() function, with lprice as the dependent variable and all other columns in wino_tr as predictors. The argument method = “lm” specifies that a linear model is being fitted. The trControl parameter is set to trainControl(method = “boot”), which applies bootstrapping to re sample the data. After training, we print m1$resample which shows us many important statistical performance metrics but in this case were are interested mainly in RMSE. RMSE measures the average magnitude of the model’s prediction errors, with lower values indicating better model performance. However, RMSE should not be solely relied upon, as it is sensitive to large errors due to squaring each residual.

Variable selection

We now graph the importance of your 10 features.

plot(varImp(m1, scale = TRUE))

Explanation: This variable importance plot shows the relative contribution of each predictor in the trained linear regression model. The price_pp (price per point) variable is by far the most important feature, indicating that it has the strongest relationship with lprice. Other features such as expensive, taster_score_dev, and not_expensive also contribute to the model but to a much lesser extent. It is important to note that due to how I believe the varImp function operates, even if all of these variables were poor predictors of log price, price per point might still appear highly important. This would occur not because it is a strong predictor, but simply because it has the highest relative importance among the given variables.