The following system recommends hypothetical lunch menu items for a group of students.
library(tidyverse)
library(kableExtra)
set.seed(123) #Specify seed value to keep results reproducible
ratings <-
read_csv("ratings.csv") %>% #Import raw data from CSV file
gather(item, rating, -user) %>% # Convert layout from wide to tall
filter(!is.na(rating)) %>% #Remove missing values
sample_frac(1, replace = FALSE) %>% #Randomize row order
mutate(data_cat = if_else(row_number() < n() * 0.8, "Training", "Test", missing = NULL)) #Label 60% of the data for training.
First, we use the ratings from the training data to calculate the global average for all user/item combinations. In this case, it’s 3.11.
training_avg <- ratings %>%
filter(data_cat == 'Training') %>%
summarise(tmean = mean(rating, na.rm = TRUE)) %>%
pull %>%
print()
[1] 3.113208
With the mean, we can begin to calculate the user and item biases by subtracting the global average from each user and item average.
We begin by calculating the average rating for each user. We then subtract the global average from each value to obtain the bias. This value gives us an indication of how harsh or generous each user is when rating the menu items relative to other users.
## Using your training data, calculate the raw average (mean) rating for every user-item combination.
user_avgs <- ratings %>%
filter(data_cat == 'Training') %>% # Select the training data
group_by(user) %>% #Group by user so R knows what elements to enter into the mean calculation
summarise(user_avg = mean(rating, na.rm = TRUE)) %>% #calculate the mean by user, ignoring missing values
mutate(user_bias = user_avg - training_avg) #calculate the bias
user_avgs %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
| user | user_avg | user_bias |
|---|---|---|
| Adam | 2.20 | -0.91 |
| Elina | 2.80 | -0.31 |
| Eric | 4.00 | 0.89 |
| John | 2.00 | -1.11 |
| Kevin | 2.80 | -0.31 |
| Lisa | 3.25 | 0.14 |
| Mia | 3.33 | 0.22 |
| Mike | 4.17 | 1.05 |
| Paul | 3.67 | 0.55 |
| Sachid | 3.50 | 0.39 |
| Scott | 3.00 | -0.11 |
We complete a similar operation but this time we calculate the average rating and bias by menu item. This gives us an indication of which items are more popular relative to the rest.
item_avgs <- ratings %>%
filter(data_cat == 'Training') %>%
group_by(item) %>%
summarise(item_avg = mean(rating, na.rm = TRUE)) %>%
mutate(item_bias = item_avg - training_avg)
item_avgs %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
| item | item_avg | item_bias |
|---|---|---|
| Hamburger | 2.78 | -0.34 |
| Pasta | 3.00 | -0.11 |
| Pizza | 4.00 | 0.89 |
| Salad | 3.38 | 0.26 |
| Sandwich | 2.71 | -0.40 |
| Soup | 3.29 | 0.17 |
| Tacos | 2.40 | -0.71 |
We’ve managed to calculate an average rating for all user/item combinations and bias values for each user and item. With these value we can calculate the baseline predictor for all combinations, even those that didn’t have a rating.
## From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.
bl_pred_df <- crossing(item_avgs, user_avgs) %>% # Ggenerate a dataframe with all user/item combinations
mutate(bl_predictor = item_bias + user_bias + training_avg) %>% # Create baseline predictor column
mutate(bl_predictor = pmax(pmin(bl_predictor, 5), 1)) %>% # Clip values to between 1 and 5.
select(item, user, bl_predictor) # Remove unnecesary columns
bl_pred_df %>%
spread(item, bl_predictor) %>% #rearange dataframe into a standar user/item matrix
kable(digits = 2) %>% # limit decimals places
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) #formatting
| user | Hamburger | Pasta | Pizza | Salad | Sandwich | Soup | Tacos |
|---|---|---|---|---|---|---|---|
| Adam | 1.86 | 2.09 | 3.09 | 2.46 | 1.80 | 2.37 | 1.49 |
| Elina | 2.46 | 2.69 | 3.69 | 3.06 | 2.40 | 2.97 | 2.09 |
| Eric | 3.66 | 3.89 | 4.89 | 4.26 | 3.60 | 4.17 | 3.29 |
| John | 1.66 | 1.89 | 2.89 | 2.26 | 1.60 | 2.17 | 1.29 |
| Kevin | 2.46 | 2.69 | 3.69 | 3.06 | 2.40 | 2.97 | 2.09 |
| Lisa | 2.91 | 3.14 | 4.14 | 3.51 | 2.85 | 3.42 | 2.54 |
| Mia | 3.00 | 3.22 | 4.22 | 3.60 | 2.93 | 3.51 | 2.62 |
| Mike | 3.83 | 4.05 | 5.00 | 4.43 | 3.77 | 4.34 | 3.45 |
| Paul | 3.33 | 3.55 | 4.55 | 3.93 | 3.27 | 3.84 | 2.95 |
| Sachid | 3.16 | 3.39 | 4.39 | 3.76 | 3.10 | 3.67 | 2.79 |
| Scott | 2.66 | 2.89 | 3.89 | 3.26 | 2.60 | 3.17 | 2.29 |
Now that we have both average and baseline predictions, we can calculate RMSE values to compare the accuracy of the predictions on both our training and test sets.
rmse_calcs <- ratings %>%
left_join(bl_pred_df, by = c('user','item')) %>% # Add bias values to our initial data
mutate(sq_err_bl_pred = (rating - bl_predictor)**2) %>% # Calculate the squared error for our baseline predictor
mutate(sq_err_avg_pred = (rating - training_avg)**2) # Calculate the squared error for our average predictor
rmse_calcs %>%
kable(col.names = c("User","Item","Rating","Category","Baseline","Baseline sq. error","Avg. sq. error")) %>% # Rename columns
kable_styling(bootstrap_options = c("striped", "hover"),fixed_thead = T, full_width = F) # Formatting
| User | Item | Rating | Category | Baseline | Baseline sq. error | Avg. sq. error |
|---|---|---|---|---|---|---|
| Adam | Soup | 2 | Training | 2.372507 | 0.1387613 | 1.2392310 |
| John | Salad | 3 | Training | 2.261792 | 0.5449504 | 0.0128159 |
| Elina | Soup | 5 | Training | 2.972507 | 4.1107289 | 3.5599858 |
| Elina | Salad | 2 | Training | 3.061793 | 1.1274032 | 1.2392310 |
| Lisa | Sandwich | 4 | Training | 2.851078 | 1.3200214 | 0.7864009 |
| Paul | Hamburger | 2 | Training | 3.331237 | 1.7721917 | 1.2392310 |
| John | Pizza | 2 | Training | 2.886792 | 0.7864009 | 1.2392310 |
| Mike | Salad | 3 | Training | 4.428459 | 2.0404955 | 0.0128159 |
| Adam | Sandwich | 1 | Training | 1.801078 | 0.6417262 | 4.4656461 |
| Mike | Sandwich | 4 | Training | 3.767745 | 0.0539425 | 0.7864009 |
| Eric | Salad | 3 | Training | 4.261793 | 1.5921202 | 0.0128159 |
| Mia | Soup | 3 | Training | 3.505840 | 0.2558742 | 0.0128159 |
| Kevin | Pizza | 4 | Training | 3.686793 | 0.0980990 | 0.7864009 |
| Paul | Pizza | 4 | Training | 4.553459 | 0.3063170 | 0.7864009 |
| Mike | Hamburger | 4 | Training | 3.831237 | 0.0284810 | 0.7864009 |
| Kevin | Pasta | 2 | Training | 2.686793 | 0.4716839 | 1.2392310 |
| Adam | Tacos | 1 | Training | 1.486792 | 0.2369669 | 4.4656461 |
| Paul | Sandwich | 5 | Training | 3.267745 | 3.0007080 | 3.5599858 |
| Mia | Tacos | 2 | Training | 2.620126 | 0.3845560 | 1.2392310 |
| Elina | Pasta | 2 | Training | 2.686793 | 0.4716839 | 1.2392310 |
| Sachid | Pasta | 3 | Training | 3.386792 | 0.1496084 | 0.0128159 |
| Sachid | Pizza | 5 | Training | 4.386793 | 0.3760235 | 3.5599858 |
| Lisa | Pizza | 5 | Training | 4.136793 | 0.7451273 | 3.5599858 |
| Mike | Pasta | 5 | Training | 4.053459 | 0.8959396 | 3.5599858 |
| Eric | Pasta | 5 | Training | 3.886792 | 1.2392310 | 3.5599858 |
| Adam | Pizza | 5 | Training | 3.086793 | 3.6603631 | 3.5599858 |
| John | Soup | 2 | Training | 2.172507 | 0.0297586 | 1.2392310 |
| Mike | Soup | 4 | Training | 4.339173 | 0.1150386 | 0.7864009 |
| Lisa | Tacos | 1 | Training | 2.536792 | 2.3617310 | 4.4656461 |
| Mia | Sandwich | 3 | Training | 2.934411 | 0.0043019 | 0.0128159 |
| Mia | Pizza | 4 | Training | 4.220126 | 0.0484554 | 0.7864009 |
| Scott | Sandwich | 1 | Training | 2.601078 | 2.5634513 | 4.4656461 |
| Eric | Soup | 5 | Training | 4.172507 | 0.6847451 | 3.5599858 |
| Kevin | Soup | 2 | Training | 2.972507 | 0.9457694 | 1.2392310 |
| Scott | Hamburger | 5 | Training | 2.664570 | 5.4542322 | 3.5599858 |
| Mike | Tacos | 5 | Training | 3.453459 | 2.3917887 | 3.5599858 |
| Lisa | Pasta | 3 | Training | 3.136792 | 0.0187122 | 0.0128159 |
| Eric | Hamburger | 3 | Training | 3.664570 | 0.4416536 | 0.0128159 |
| Kevin | Hamburger | 3 | Training | 2.464570 | 0.2866850 | 0.0128159 |
| Sachid | Salad | 5 | Training | 3.761792 | 1.5331579 | 3.5599858 |
| Sachid | Hamburger | 1 | Training | 3.164570 | 4.6853643 | 4.4656461 |
| Scott | Tacos | 3 | Training | 2.286792 | 0.5086650 | 0.0128159 |
| Mia | Salad | 5 | Training | 3.595126 | 1.9736716 | 3.5599858 |
| Elina | Hamburger | 2 | Training | 2.464570 | 0.2158255 | 1.2392310 |
| Kevin | Salad | 3 | Training | 3.061793 | 0.0038183 | 0.0128159 |
| Adam | Pasta | 2 | Training | 2.086793 | 0.0075329 | 1.2392310 |
| John | Hamburger | 2 | Training | 1.664570 | 0.1125131 | 1.2392310 |
| John | Pasta | 2 | Training | 1.886793 | 0.0128159 | 1.2392310 |
| John | Sandwich | 1 | Training | 1.601078 | 0.3612950 | 4.4656461 |
| Scott | Salad | 3 | Training | 3.261792 | 0.0685353 | 0.0128159 |
| Elina | Pizza | 3 | Training | 3.686793 | 0.4716839 | 0.0128159 |
| Mia | Hamburger | 3 | Training | 2.997904 | 0.0000044 | 0.0128159 |
| Scott | Pasta | 3 | Training | 2.886792 | 0.0128159 | 0.0128159 |
| Lisa | Hamburger | 4 | Test | 2.914570 | 1.1781578 | 0.7864009 |
| Kevin | Tacos | 2 | Test | 2.086793 | 0.0075329 | 1.2392310 |
| Adam | Salad | 5 | Test | 2.461793 | 6.4424976 | 3.5599858 |
| Paul | Tacos | 3 | Test | 2.953459 | 0.0021661 | 0.0128159 |
| Paul | Salad | 2 | Test | 3.928459 | 3.7189546 | 1.2392310 |
| Sachid | Sandwich | 3 | Test | 3.101078 | 0.0102168 | 0.0128159 |
| John | Tacos | 1 | Test | 1.286793 | 0.0822499 | 4.4656461 |
| Paul | Soup | 5 | Test | 3.839173 | 1.3475184 | 3.5599858 |
| Lisa | Salad | 4 | Test | 3.511792 | 0.2383466 | 0.7864009 |
| Eric | Pizza | 5 | Test | 4.886793 | 0.0128159 | 3.5599858 |
| Mike | Pizza | 5 | Test | 5.000000 | 0.0000000 | 3.5599858 |
| Elina | Sandwich | 4 | Test | 2.401078 | 2.5565510 | 0.7864009 |
| Scott | Soup | 5 | Test | 3.172507 | 3.3397316 | 3.5599858 |
| Sachid | Soup | 5 | Test | 3.672507 | 1.7622384 | 3.5599858 |
## Calculate the RMSE for raw average for both your training data and your test data.
rmse_df <- rmse_calcs %>%
gather(error_type,error_val, sq_err_bl_pred:sq_err_avg_pred) %>% # Convert from wide to tall
group_by(error_type, data_cat) %>% # Group by error type so R calculates the means correctly
summarise(rmse = sqrt(mean(error_val, na.rm = TRUE))) %>% # Calculate the square root of the mean.
type.convert() # Convert users and items to factors (for barplot below)
rmse_df %>%
kable(digits = 4) %>% # Limit to 4 decimal places
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) #formatting
| error_type | data_cat | rmse |
|---|---|---|
| sq_err_avg_pred | Test | 1.4806 |
| sq_err_avg_pred | Training | 1.3126 |
| sq_err_bl_pred | Test | 1.2159 |
| sq_err_bl_pred | Training | 0.9883 |
1-(rmse_df$rmse[3] / rmse_df$rmse[1]) #RMSE improvement Test set
[1] 0.1787335
1-(rmse_df$rmse[4] / rmse_df$rmse[2]) #RMSE improvement training set
[1] 0.2471183
ggplot(rmse_df, aes(x = error_type, y = rmse, fill = error_type)) +
geom_bar(stat = "identity") +
facet_grid( ~ data_cat) +
scale_fill_brewer(palette = "Paired") +
labs(title = "RMSE by data group and predictor type",
subtitle = "",
caption = "The RMSE for both data groups is based on the avg. and bias values of the training data.") +
ylab("RMSE") +
theme_minimal() +
theme(legend.position = "none", axis.title.x = element_blank()) +
geom_text(aes(label = round(rmse, 2)),
vjust = 1.6,
color = "white",
size = 5) +
scale_x_discrete(labels = c("Avg. Rating \n (Training Data)", "Baseline Predictor"))
As we can see from the caluculations above, using the baseline predictor vs the raw average results in a 17.9% improvement in the RMSE for the test data set. For the training data set, we observe a 24.7% improvement.