Explanataion:The code imports the necessary packages for the homework and analysis. However, the fastDummies package is not utilized. Additionally, we load are data which in this case is data related to wine from github.
TODO: write your line-by-line explanation of the code here
Feature Engineering
We begin by engineering an number of features.
Create a total of 10 features (including points).
Remove all rows with a missing value.
Ensure only log(price) and engineering features are the only columns that remain in the wino dataframe.
Explanation:We begin by mutating the data set to create a new column, lprice, which represents the logarithm of the wine prices using the price column. Next, we use group_by to organize the data by taster_name, allowing us to compute statistics for each taster. Specifically, we calculate the average score by taking the mean of the points column. Additionally, we determine each taster’s deviation from the mean score. To explore potential patterns in wine preferences, we calculate the average price of wines tasted by each reviewer. This may helps show whether certain tasters review expensive or inexpensive wines. After these computations, we use the function ungroup() to remove the grouping, as the subsequent transformations should not be grouped by the taster_name column. After doing this, I investigated whether specific terms like “fruity” or “earthy” in the wine descriptions influenced log price. To do this, I applied ifelse with str_detect, though I initially attempted to use grepl but encountered issues implementing it within ifelse. Additionally, I created a price_pp(price per point) column to estimate the cost per rating point. To classify wines as expensive or inexpensive, I used the median price as a threshold—wines above the median were labeled expensive, while those below were considered inexpensive. Finally, I calculated the review length by applying str_count to the description column, counting the number of words in each review to assess its potential effect on log price. Before moving on to training and testing, we need to remove all rows containing NA values. Additionally, we must drop any columns that are not the features we generated. This helps prevent multicollinearity, which could arise if newly created columns are too closely related to their original sources. Another reason for removing certain columns is that some contain a large number of categorical variables, which can significantly slow down training. I initially overlooked this step and ended up waiting over an hour for the model to run. The country column had to be removed because, after filtering out NA values, it contained only a single remaining category. # Caret
We now use a train/test split to evaluate the features.
Use the Caret library to partition the wino dataframe into an 80/20 split.
Run a linear regression with bootstrap resampling.
Report RMSE on the test partition of the data.
train_w <-createDataPartition(wino$lprice, p =0.8, list =FALSE)wino_tr <- wino[train_w, ]wino_te <- wino[-train_w, ]summary(wino_tr)
lprice taster_avg_score taster_score_dev taster_avg_price
Min. :1.609 Min. :86.68 Min. :-9.2322 Min. :23.29
1st Qu.:3.219 1st Qu.:89.11 1st Qu.:-1.7138 1st Qu.:33.94
Median :3.611 Median :89.11 Median : 0.7678 Median :34.66
Mean :3.592 Mean :89.25 Mean : 0.3324 Mean :38.39
3rd Qu.:3.912 3rd Qu.:89.23 3rd Qu.: 2.1922 3rd Qu.:47.28
Max. :7.607 Max. :90.09 Max. :10.8906 Max. :47.28
fruity_dummy earthy_dummy price_pp expensive
Min. :0.00000 Min. :0.0000 Min. : 0.05882 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 0.28409 1st Qu.:0.0000
Median :0.00000 Median :0.0000 Median : 0.41304 Median :1.0000
Mean :0.03452 Mean :0.0991 Mean : 0.46774 Mean :0.7318
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.: 0.56180 3rd Qu.:1.0000
Max. :1.00000 Max. :1.0000 Max. :22.12088 Max. :1.0000
not_expensive review_count
Min. :0.0000 Min. : 7.00
1st Qu.:0.0000 1st Qu.: 35.00
Median :0.0000 Median : 43.00
Mean :0.2245 Mean : 43.14
3rd Qu.:0.0000 3rd Qu.: 51.00
Max. :1.0000 Max. :104.00
Explanation: In this chunk of code, we start by creating a training partition using createDataPartition(), which splits the lprice column in wino, assigning 80% of the data to training. We set list = FALSE so that it returns row indices instead of a list. Next, we pass these indices to wino to create two separate data sets, wino_tr for training, which contains the selected 80% of the data, and wino_te for testing, which holds the remaining 20%. Lastly, we call summary(wino_tr) to get a quick statistical overview of the training data set.
Linear Regression
17205 samples
9 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 17205, 17205, 17205, 17205, 17205, 17205, ...
Resampling results:
RMSE Rsquared MAE
0.2406028 0.8237755 0.132605
Tuning parameter 'intercept' was held constant at a value of TRUE
Explanation:A linear regression model is trained using the train() function, with lprice as the dependent variable and all other columns in wino_tr as predictors. The argument method = “lm” specifies that a linear model is being fitted. The trControl parameter is set to trainControl(method = “boot”), which applies bootstrapping to re sample the data. After training, we print m1$resample which shows us many important statistical performance metrics but in this case were are interested mainly in RMSE. RMSE measures the average magnitude of the model’s prediction errors, with lower values indicating better model performance. However, RMSE should not be solely relied upon, as it is sensitive to large errors due to squaring each residual.
Variable selection
We now graph the importance of your 10 features.
plot(varImp(m1, scale =TRUE))
Explanation: This variable importance plot shows the relative contribution of each predictor in the trained linear regression model. The price_pp (price per point) variable is by far the most important feature, indicating that it has the strongest relationship with lprice. Other features such as expensive, taster_score_dev, and not_expensive also contribute to the model but to a much lesser extent. It is important to note that due to how I believe the varImp function operates, even if all of these variables were poor predictors of log price, price per point might still appear highly important. This would occur not because it is a strong predictor, but simply because it has the highest relative importance among the given variables.