Introduction

The following system recommends hypothetical lunch menu items for a group of students.

Data acquisition & cleaning

library(tidyverse)
library(kableExtra)
set.seed(123) #Specify seed value to keep results reproducible
ratings <-
  read_csv("ratings.csv") %>% #Import raw data from CSV file
  gather(item, rating, -user) %>% # Convert layout from wide to tall
  filter(!is.na(rating)) %>% #Remove missing values
  sample_frac(1, replace = FALSE) %>% #Randomize row order
  mutate(data_cat = if_else(row_number() < n() * 0.8, "Training", "Test", missing = NULL)) #Label 60% of the data for training.

Calculation of means and biases

First, we use the ratings from the training data to calculate the global average for all user/item combinations. In this case, it’s 3.11.

training_avg <- ratings %>% 
  filter(data_cat == 'Training') %>% 
  summarise(tmean = mean(rating, na.rm = TRUE)) %>% 
  pull %>% 
  print()
[1] 3.113208

With the mean, we can begin to calculate the user and item biases by subtracting the global average from each user and item average.

We begin by calculating the average rating for each user. We then subtract the global average from each value to obtain the bias. This value gives us an indication of how harsh or generous each user is when rating the menu items relative to other users.

## Using your training data, calculate the raw average (mean) rating for every user-item combination.

user_avgs <- ratings %>% 
  filter(data_cat == 'Training') %>% # Select the training data
  group_by(user) %>% #Group by user so R knows what elements to enter into the mean calculation
  summarise(user_avg = mean(rating, na.rm = TRUE)) %>% #calculate the mean by user, ignoring missing values
  mutate(user_bias = user_avg - training_avg) #calculate the bias

user_avgs %>% 
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
user user_avg user_bias
Adam 2.20 -0.91
Elina 2.80 -0.31
Eric 4.00 0.89
John 2.00 -1.11
Kevin 2.80 -0.31
Lisa 3.25 0.14
Mia 3.33 0.22
Mike 4.17 1.05
Paul 3.67 0.55
Sachid 3.50 0.39
Scott 3.00 -0.11

We complete a similar operation but this time we calculate the average rating and bias by menu item. This gives us an indication of which items are more popular relative to the rest.

item_avgs <- ratings %>% 
  filter(data_cat == 'Training') %>% 
  group_by(item) %>% 
  summarise(item_avg = mean(rating, na.rm = TRUE)) %>% 
  mutate(item_bias = item_avg - training_avg)

item_avgs %>% 
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
item item_avg item_bias
Hamburger 2.78 -0.34
Pasta 3.00 -0.11
Pizza 4.00 0.89
Salad 3.38 0.26
Sandwich 2.71 -0.40
Soup 3.29 0.17
Tacos 2.40 -0.71

User/item matrix

We’ve managed to calculate an average rating for all user/item combinations and bias values for each user and item. With these value we can calculate the baseline predictor for all combinations, even those that didn’t have a rating.

## From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.
bl_pred_df <- crossing(item_avgs, user_avgs) %>% # Ggenerate a dataframe with all user/item combinations
  mutate(bl_predictor = item_bias + user_bias + training_avg) %>% # Create baseline predictor column
  mutate(bl_predictor = pmax(pmin(bl_predictor, 5), 1)) %>% # Clip values to between 1 and 5.
  select(item, user, bl_predictor) # Remove unnecesary columns

bl_pred_df %>% 
  spread(item, bl_predictor) %>% #rearange dataframe into a standar user/item matrix
  kable(digits = 2) %>% # limit decimals places
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) #formatting
user Hamburger Pasta Pizza Salad Sandwich Soup Tacos
Adam 1.86 2.09 3.09 2.46 1.80 2.37 1.49
Elina 2.46 2.69 3.69 3.06 2.40 2.97 2.09
Eric 3.66 3.89 4.89 4.26 3.60 4.17 3.29
John 1.66 1.89 2.89 2.26 1.60 2.17 1.29
Kevin 2.46 2.69 3.69 3.06 2.40 2.97 2.09
Lisa 2.91 3.14 4.14 3.51 2.85 3.42 2.54
Mia 3.00 3.22 4.22 3.60 2.93 3.51 2.62
Mike 3.83 4.05 5.00 4.43 3.77 4.34 3.45
Paul 3.33 3.55 4.55 3.93 3.27 3.84 2.95
Sachid 3.16 3.39 4.39 3.76 3.10 3.67 2.79
Scott 2.66 2.89 3.89 3.26 2.60 3.17 2.29

RMSE Calculation

Now that we have both average and baseline predictions, we can calculate RMSE values to compare the accuracy of the predictions on both our training and test sets.

rmse_calcs <- ratings %>% 
  left_join(bl_pred_df, by = c('user','item')) %>% # Add bias values to our initial data
  mutate(sq_err_bl_pred = (rating - bl_predictor)**2) %>% # Calculate the squared error for our baseline predictor
  mutate(sq_err_avg_pred = (rating - training_avg)**2) # Calculate the squared error for our average predictor

rmse_calcs %>% 
  kable(col.names = c("User","Item","Rating","Category","Baseline","Baseline sq. error","Avg. sq. error")) %>% # Rename columns
  kable_styling(bootstrap_options = c("striped", "hover"),fixed_thead = T, full_width = F) # Formatting
User Item Rating Category Baseline Baseline sq. error Avg. sq. error
Adam Soup 2 Training 2.372507 0.1387613 1.2392310
John Salad 3 Training 2.261792 0.5449504 0.0128159
Elina Soup 5 Training 2.972507 4.1107289 3.5599858
Elina Salad 2 Training 3.061793 1.1274032 1.2392310
Lisa Sandwich 4 Training 2.851078 1.3200214 0.7864009
Paul Hamburger 2 Training 3.331237 1.7721917 1.2392310
John Pizza 2 Training 2.886792 0.7864009 1.2392310
Mike Salad 3 Training 4.428459 2.0404955 0.0128159
Adam Sandwich 1 Training 1.801078 0.6417262 4.4656461
Mike Sandwich 4 Training 3.767745 0.0539425 0.7864009
Eric Salad 3 Training 4.261793 1.5921202 0.0128159
Mia Soup 3 Training 3.505840 0.2558742 0.0128159
Kevin Pizza 4 Training 3.686793 0.0980990 0.7864009
Paul Pizza 4 Training 4.553459 0.3063170 0.7864009
Mike Hamburger 4 Training 3.831237 0.0284810 0.7864009
Kevin Pasta 2 Training 2.686793 0.4716839 1.2392310
Adam Tacos 1 Training 1.486792 0.2369669 4.4656461
Paul Sandwich 5 Training 3.267745 3.0007080 3.5599858
Mia Tacos 2 Training 2.620126 0.3845560 1.2392310
Elina Pasta 2 Training 2.686793 0.4716839 1.2392310
Sachid Pasta 3 Training 3.386792 0.1496084 0.0128159
Sachid Pizza 5 Training 4.386793 0.3760235 3.5599858
Lisa Pizza 5 Training 4.136793 0.7451273 3.5599858
Mike Pasta 5 Training 4.053459 0.8959396 3.5599858
Eric Pasta 5 Training 3.886792 1.2392310 3.5599858
Adam Pizza 5 Training 3.086793 3.6603631 3.5599858
John Soup 2 Training 2.172507 0.0297586 1.2392310
Mike Soup 4 Training 4.339173 0.1150386 0.7864009
Lisa Tacos 1 Training 2.536792 2.3617310 4.4656461
Mia Sandwich 3 Training 2.934411 0.0043019 0.0128159
Mia Pizza 4 Training 4.220126 0.0484554 0.7864009
Scott Sandwich 1 Training 2.601078 2.5634513 4.4656461
Eric Soup 5 Training 4.172507 0.6847451 3.5599858
Kevin Soup 2 Training 2.972507 0.9457694 1.2392310
Scott Hamburger 5 Training 2.664570 5.4542322 3.5599858
Mike Tacos 5 Training 3.453459 2.3917887 3.5599858
Lisa Pasta 3 Training 3.136792 0.0187122 0.0128159
Eric Hamburger 3 Training 3.664570 0.4416536 0.0128159
Kevin Hamburger 3 Training 2.464570 0.2866850 0.0128159
Sachid Salad 5 Training 3.761792 1.5331579 3.5599858
Sachid Hamburger 1 Training 3.164570 4.6853643 4.4656461
Scott Tacos 3 Training 2.286792 0.5086650 0.0128159
Mia Salad 5 Training 3.595126 1.9736716 3.5599858
Elina Hamburger 2 Training 2.464570 0.2158255 1.2392310
Kevin Salad 3 Training 3.061793 0.0038183 0.0128159
Adam Pasta 2 Training 2.086793 0.0075329 1.2392310
John Hamburger 2 Training 1.664570 0.1125131 1.2392310
John Pasta 2 Training 1.886793 0.0128159 1.2392310
John Sandwich 1 Training 1.601078 0.3612950 4.4656461
Scott Salad 3 Training 3.261792 0.0685353 0.0128159
Elina Pizza 3 Training 3.686793 0.4716839 0.0128159
Mia Hamburger 3 Training 2.997904 0.0000044 0.0128159
Scott Pasta 3 Training 2.886792 0.0128159 0.0128159
Lisa Hamburger 4 Test 2.914570 1.1781578 0.7864009
Kevin Tacos 2 Test 2.086793 0.0075329 1.2392310
Adam Salad 5 Test 2.461793 6.4424976 3.5599858
Paul Tacos 3 Test 2.953459 0.0021661 0.0128159
Paul Salad 2 Test 3.928459 3.7189546 1.2392310
Sachid Sandwich 3 Test 3.101078 0.0102168 0.0128159
John Tacos 1 Test 1.286793 0.0822499 4.4656461
Paul Soup 5 Test 3.839173 1.3475184 3.5599858
Lisa Salad 4 Test 3.511792 0.2383466 0.7864009
Eric Pizza 5 Test 4.886793 0.0128159 3.5599858
Mike Pizza 5 Test 5.000000 0.0000000 3.5599858
Elina Sandwich 4 Test 2.401078 2.5565510 0.7864009
Scott Soup 5 Test 3.172507 3.3397316 3.5599858
Sachid Soup 5 Test 3.672507 1.7622384 3.5599858
## Calculate the RMSE for raw average for both your training data and your test data.
rmse_df <- rmse_calcs %>% 
  gather(error_type,error_val, sq_err_bl_pred:sq_err_avg_pred) %>% # Convert from wide to tall
  group_by(error_type, data_cat) %>% # Group by error type so R calculates the means correctly
  summarise(rmse = sqrt(mean(error_val, na.rm = TRUE))) %>% # Calculate the square root of the mean.
  type.convert() # Convert users and items to factors (for barplot below)

rmse_df %>% 
  kable(digits = 4) %>% # Limit to 4 decimal places
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) #formatting
error_type data_cat rmse
sq_err_avg_pred Test 1.4806
sq_err_avg_pred Training 1.3126
sq_err_bl_pred Test 1.2159
sq_err_bl_pred Training 0.9883
1-(rmse_df$rmse[3] / rmse_df$rmse[1]) #RMSE improvement Test set
[1] 0.1787335
1-(rmse_df$rmse[4] / rmse_df$rmse[2]) #RMSE improvement training set
[1] 0.2471183
ggplot(rmse_df, aes(x = error_type, y = rmse, fill = error_type)) +
  geom_bar(stat = "identity") +
  facet_grid( ~ data_cat) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "RMSE by data group and predictor type",
       subtitle = "",
       caption = "The RMSE for both data groups is based on the avg. and bias values of the training data.") +
  ylab("RMSE") +
  theme_minimal() +
  theme(legend.position = "none", axis.title.x = element_blank()) +
  geom_text(aes(label = round(rmse, 2)),
            vjust = 1.6,
            color = "white",
            size = 5) +
  scale_x_discrete(labels = c("Avg. Rating \n (Training Data)", "Baseline Predictor"))

As we can see from the caluculations above, using the baseline predictor vs the raw average results in a 17.9% improvement in the RMSE for the test data set. For the training data set, we observe a 24.7% improvement.

---
title: "Project 1: Global Baseline Predictors and RMSE"
subtitle: "DATA-612, Summer 2019"
author: "Fernando Figueres Zeledón"
output: html_notebook
---

# Introduction

The following system recommends hypothetical lunch menu items for a group of students.

# Data acquisition & cleaning

```{r Library, message=FALSE, warning=FALSE}
library(tidyverse)
library(kableExtra)
```

```{r data_import, message=FALSE}
set.seed(123) #Specify seed value to keep results reproducible
ratings <-
  read_csv("ratings.csv") %>% #Import raw data from CSV file
  gather(item, rating, -user) %>% # Convert layout from wide to tall
  filter(!is.na(rating)) %>% #Remove rows with missing rating value
  sample_frac(1, replace = FALSE) %>% #Randomize row order
  mutate(data_cat = if_else(row_number() < n() * 0.8, "Training", "Test", missing = NULL)) #Label 80% of the data for training.
```

## Calculation of means and biases

First, we use the ratings from the training data to calculate the global average for all user/item combinations. In this case, it's 3.11.

```{r training_avg_calc}
training_avg <- ratings %>% 
  filter(data_cat == 'Training') %>% # Select only training data
  summarise(tmean = mean(rating, na.rm = TRUE)) %>% # Calculate the mean
  pull %>% # Extract single value from data frame
  print()
```

With the mean, we can begin to calculate the user and item biases by subtracting the global average from each user and item average.

We begin by calculating the average rating for each user. We then subtract the global average from each value to obtain the bias. This value gives us an indication of how harsh or generous each user is when rating the menu items relative to other users.

```{r user_avgs}
user_avgs <- ratings %>% 
  filter(data_cat == 'Training') %>% # Select the training data
  group_by(user) %>% #Group by user so R knows what elements to enter into the mean calculation
  summarise(user_avg = mean(rating, na.rm = TRUE)) %>% #calculate the mean by user, ignoring missing values
  mutate(user_bias = user_avg - training_avg) #calculate the bias

user_avgs %>% 
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
```

We complete a similar operation but this time we calculate the average rating and bias by menu item. This gives us an indication of which items are more popular relative to the rest.

```{r item_avgs}
item_avgs <- ratings %>% 
  filter(data_cat == 'Training') %>% 
  group_by(item) %>% 
  summarise(item_avg = mean(rating, na.rm = TRUE)) %>% 
  mutate(item_bias = item_avg - training_avg)

item_avgs %>% 
  kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
```

# User/item matrix

We've managed to calculate an average rating for all user/item combinations and bias values for each user and item. With these value we can calculate the baseline predictor for all combinations, even those that didn't have a rating.

```{r baseline_predictors}
## From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.
bl_pred_df <- crossing(item_avgs, user_avgs) %>% # Ggenerate a dataframe with all user/item combinations
  mutate(bl_predictor = item_bias + user_bias + training_avg) %>% # Create baseline predictor column
  mutate(bl_predictor = pmax(pmin(bl_predictor, 5), 1)) %>% # Clip values to between 1 and 5.
  select(item, user, bl_predictor) # Remove unnecesary columns

bl_pred_df %>% 
  spread(item, bl_predictor) %>% #rearange dataframe into a standar user/item matrix
  kable(digits = 2) %>% # limit decimals places
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) #formatting
```

# RMSE Calculation

Now that we have both average and baseline predictions, we can calculate RMSE values to compare the accuracy of the predictions on both our training and test sets.

```{r rmse_cals}
rmse_calcs <- ratings %>% 
  left_join(bl_pred_df, by = c('user','item')) %>% # Add bias values to our initial data
  mutate(sq_err_bl_pred = (rating - bl_predictor)**2) %>% # Calculate the squared error for our baseline predictor
  mutate(sq_err_avg_pred = (rating - training_avg)**2) # Calculate the squared error for our average predictor

rmse_calcs %>% 
  kable(col.names = c("User","Item","Rating","Category","Baseline","Baseline sq. error","Avg. sq. error")) %>% # Rename columns
  kable_styling(bootstrap_options = c("striped", "hover"),fixed_thead = T, full_width = F) # Formatting
```

```{r}
rmse_df <- rmse_calcs %>% 
  gather(error_type,error_val, sq_err_bl_pred:sq_err_avg_pred) %>% # Convert from wide to tall
  group_by(error_type, data_cat) %>% # Group by error type so R calculates the means correctly
  summarise(rmse = sqrt(mean(error_val, na.rm = TRUE))) %>% # Calculate the square root of the mean.
  type.convert() # Convert users and items to factors (for barplot below)

rmse_df %>% 
  kable(digits = 4) %>% # Limit to 4 decimal places
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) #formatting
```

```{r}
1-(rmse_df$rmse[3] / rmse_df$rmse[1]) #RMSE % improvement Test set

1-(rmse_df$rmse[4] / rmse_df$rmse[2]) #RMSE % improvement training set
```


```{r}
ggplot(rmse_df, aes(x = error_type, y = rmse, fill = error_type)) +
  geom_bar(stat = "identity") +
  facet_grid( ~ data_cat) +
  scale_fill_brewer(palette = "Paired") +
  labs(title = "RMSE by data group and predictor type",
       subtitle = "",
       caption = "The RMSE for both data groups is based on the avg. and bias values of the training data.") +
  ylab("RMSE") +
  theme_minimal() +
  theme(legend.position = "none", axis.title.x = element_blank()) +
  geom_text(aes(label = round(rmse, 2)),
            vjust = 1.6,
            color = "white",
            size = 5) +
  scale_x_discrete(labels = c("Avg. Rating \n (Training Data)", "Baseline Predictor"))
```


As we can see from the caluculations above, using the baseline predictor vs the raw average results in a 17.9% improvement in the RMSE for the test data set. For the training data set, we observe a 24.7% improvement.