Instruction

In this first assignment, we will attempt to predict ratings with very little information. We will first look at just raw averages across all (training dataset) users. We will then account for “bias” by normalizing across users and across items. We will work with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing.

Introduction

In this project, a simple recommender system of McDonald’s Combo Meals is created. This system recommends McDonald’s Combo Meals to readers/customers who may then be interested to try out other Combo Meals than their usual calls. The data (ratings of Combo Meals) used in this project is collected by the three group members from themselves and their friends. The dataset contains a total of 10 users (customers) and their ratings of 11 items (McDonald’s Combo Meals).

Load Packages

library(tidyverse)
library(rmdformats)
library(knitr)
library(kableExtra)
library(formattable)
library(caTools)
library(yardstick)
library(gplots)

Read Data

data <- read_csv('https://raw.githubusercontent.com/oggyluky11/DATA-612-2020-SUMMER/master/Project%201/McDonald%20Meal%20Friend%20Rating.csv')
data

Data Exploration

The data is in long format. From the summary below, we can see that not all customers(/users) have rated all 11 products(/items). Only Friend 1 has rated all 11 products.

After reshapping the data into a user-item matrix, it is observed that there are 23 missing values within all user-item combinations. The data is not very sparse according to the heatmap shown below.

Before reshaping:

data %>% 
  mutate_if(is.character,as.factor) %>% 
  summary(maxsum = 20)
##      Customer                               Combo Meal     Rating     
##  Frank   : 8   10 Piece Chicken McNuggets        : 8   Min.   :1.000  
##  Friend 1:11   Artisan Grilled Chicken Sandwich  : 9   1st Qu.:2.000  
##  Friend 2:10   Bacon, Egg & Cheese Biscuit       : 7   Median :3.000  
##  Friend 3: 8   Big Mac                           :10   Mean   :2.816  
##  Friend 4: 8   Buttermilk Crispy Chicken Sandwich: 7   3rd Qu.:4.000  
##  Friend 5:10   Cheese burger                     : 7   Max.   :5.000  
##  Friend 6: 7   Double Quarter Pounder w Cheese   : 7                  
##  Friend 7: 9   Egg McMuffin                      : 7                  
##  Ricki   : 8   Filet-O-Fish                      : 9                  
##  Shirley : 8   Quarter Pounder w Cheese          : 8                  
##                Sausage McMuffin w Egg            : 8

After reshaping:

data %>% 
  spread(key = `Combo Meal`, value = Rating) %>%
  gather(key = `Combo Meal`, value = Rating, - Customer) %>%
  mutate_if(is.character,as.factor) %>% 
  summary(maxsum = 20)
##      Customer                               Combo Meal     Rating     
##  Frank   :11   10 Piece Chicken McNuggets        :10   Min.   :1.000  
##  Friend 1:11   Artisan Grilled Chicken Sandwich  :10   1st Qu.:2.000  
##  Friend 2:11   Bacon, Egg & Cheese Biscuit       :10   Median :3.000  
##  Friend 3:11   Big Mac                           :10   Mean   :2.816  
##  Friend 4:11   Buttermilk Crispy Chicken Sandwich:10   3rd Qu.:4.000  
##  Friend 5:11   Cheese burger                     :10   Max.   :5.000  
##  Friend 6:11   Double Quarter Pounder w Cheese   :10   NA's   :23     
##  Friend 7:11   Egg McMuffin                      :10                  
##  Ricki   :11   Filet-O-Fish                      :10                  
##  Shirley :11   Quarter Pounder w Cheese          :10                  
##                Sausage McMuffin w Egg            :10

Heatmap:

data %>%
  spread(key = `Combo Meal`, value = Rating) %>%
  column_to_rownames('Customer') %>%
  as.matrix() %>%
  heatmap.2(trace = 'none',
            density.info = 'none',
            dendrogram = 'none',
            Rowv = FALSE,
            Colv = FALSE,
            col = colorRampPalette(c("grey", "deeppink4"))(n = 299))

Separate Training Dataset & Test Dataset

The ratings of McDonald’s Combo Meal are then split into training dataset (85%) and test dataset (15%). In the Train_Test_Split table below, the training dataset is set to BLUE and the test dataset is set to RED under the Data_Group column.

set.seed(3)
data_split <- data %>%
  mutate(Data_Group = sample.split(Rating, 0.85)) %>%
  mutate(Data_Group = if_else(Data_Group==TRUE, 'Train', 'Test'))
data_train <- data_split %>% 
  filter(Data_Group == 'Train') %>%
  select(-Data_Group)
data_test <- data_split %>% 
  filter(Data_Group == 'Test') %>%
  select(-Data_Group) 
data_split %>%
  mutate(
  Data_Group = cell_spec(Data_Group, background = if_else(Data_Group == 'Test',
                                                    'lightpink','lightblue'))) %>%
  kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "bordered"),
                full_width = FALSE,
                font_size = 12) %>%
  add_header_above(c('Train_Test_Split' = 4)) %>%
  scroll_box(width = "100%", height = "400px")
Train_Test_Split
Customer Combo Meal Rating Data_Group
Ricki Big Mac 3 Train
Ricki Cheese burger 1 Test
Ricki Quarter Pounder w Cheese 3 Test
Ricki 10 Piece Chicken McNuggets 4 Test
Ricki Filet-O-Fish 5 Train
Ricki Artisan Grilled Chicken Sandwich 4 Train
Ricki Buttermilk Crispy Chicken Sandwich 2 Train
Ricki Bacon, Egg & Cheese Biscuit 2 Train
Shirley Big Mac 2 Train
Shirley Cheese burger 3 Train
Shirley Quarter Pounder w Cheese 4 Train
Shirley 10 Piece Chicken McNuggets 5 Test
Shirley Filet-O-Fish 5 Train
Shirley Artisan Grilled Chicken Sandwich 2 Train
Shirley Buttermilk Crispy Chicken Sandwich 4 Train
Shirley Sausage McMuffin w Egg 5 Test
Frank Big Mac 3 Train
Frank Cheese burger 2 Train
Frank Quarter Pounder w Cheese 3 Train
Frank Double Quarter Pounder w Cheese 5 Train
Frank 10 Piece Chicken McNuggets 4 Train
Frank Egg McMuffin 3 Train
Frank Sausage McMuffin w Egg 4 Train
Frank Bacon, Egg & Cheese Biscuit 2 Train
Friend 1 Big Mac 4 Train
Friend 1 Cheese burger 1 Train
Friend 1 Quarter Pounder w Cheese 2 Train
Friend 1 Double Quarter Pounder w Cheese 2 Test
Friend 1 10 Piece Chicken McNuggets 3 Train
Friend 1 Filet-O-Fish 1 Train
Friend 1 Artisan Grilled Chicken Sandwich 2 Test
Friend 1 Buttermilk Crispy Chicken Sandwich 1 Train
Friend 1 Egg McMuffin 3 Train
Friend 1 Sausage McMuffin w Egg 3 Train
Friend 1 Bacon, Egg & Cheese Biscuit 1 Train
Friend 2 Big Mac 2 Train
Friend 2 Cheese burger 5 Train
Friend 2 Quarter Pounder w Cheese 1 Train
Friend 2 Double Quarter Pounder w Cheese 2 Train
Friend 2 Filet-O-Fish 3 Train
Friend 2 Artisan Grilled Chicken Sandwich 2 Train
Friend 2 Buttermilk Crispy Chicken Sandwich 5 Train
Friend 2 Egg McMuffin 3 Train
Friend 2 Sausage McMuffin w Egg 1 Train
Friend 2 Bacon, Egg & Cheese Biscuit 5 Train
Friend 3 Big Mac 2 Train
Friend 3 Quarter Pounder w Cheese 3 Train
Friend 3 Double Quarter Pounder w Cheese 3 Train
Friend 3 10 Piece Chicken McNuggets 4 Train
Friend 3 Filet-O-Fish 3 Train
Friend 3 Artisan Grilled Chicken Sandwich 2 Train
Friend 3 Buttermilk Crispy Chicken Sandwich 3 Test
Friend 3 Sausage McMuffin w Egg 4 Train
Friend 4 Big Mac 4 Train
Friend 4 Quarter Pounder w Cheese 2 Train
Friend 4 10 Piece Chicken McNuggets 5 Train
Friend 4 Filet-O-Fish 2 Train
Friend 4 Artisan Grilled Chicken Sandwich 4 Train
Friend 4 Egg McMuffin 2 Train
Friend 4 Sausage McMuffin w Egg 1 Test
Friend 4 Bacon, Egg & Cheese Biscuit 4 Train
Friend 5 Big Mac 1 Train
Friend 5 Cheese burger 1 Test
Friend 5 Double Quarter Pounder w Cheese 1 Train
Friend 5 10 Piece Chicken McNuggets 2 Train
Friend 5 Filet-O-Fish 3 Test
Friend 5 Artisan Grilled Chicken Sandwich 5 Train
Friend 5 Buttermilk Crispy Chicken Sandwich 1 Train
Friend 5 Egg McMuffin 4 Train
Friend 5 Sausage McMuffin w Egg 1 Train
Friend 5 Bacon, Egg & Cheese Biscuit 5 Train
Friend 6 Big Mac 5 Train
Friend 6 Quarter Pounder w Cheese 1 Train
Friend 6 Double Quarter Pounder w Cheese 3 Train
Friend 6 Filet-O-Fish 4 Test
Friend 6 Artisan Grilled Chicken Sandwich 1 Train
Friend 6 Egg McMuffin 4 Train
Friend 6 Sausage McMuffin w Egg 5 Train
Friend 7 Big Mac 2 Train
Friend 7 Cheese burger 2 Test
Friend 7 Double Quarter Pounder w Cheese 1 Train
Friend 7 10 Piece Chicken McNuggets 2 Train
Friend 7 Filet-O-Fish 2 Train
Friend 7 Artisan Grilled Chicken Sandwich 1 Train
Friend 7 Buttermilk Crispy Chicken Sandwich 3 Train
Friend 7 Egg McMuffin 1 Train
Friend 7 Bacon, Egg & Cheese Biscuit 4 Train

Create a User-Item Matrix

Reshaping the complete dataset from long format to wide format, with customers as users (in rows) and products as items (in columns). The training dataset and test dataset are separated by color with null values showing as ‘?’ in the table below.

options(knitr.kable.NA = '?')
data_split %>% 
  mutate(Rating = cell_spec(Rating,'html',
                            background  = if_else(Data_Group == 'Test','lightpink','lightblue'))) %>%
  select(-Data_Group) %>%
  spread(`Combo Meal`, Rating) %>%
  arrange(desc(Customer)) %>%
  kable(escape = FALSE, caption = 'User-Item Matrix', align = 'lccccccccccc') %>%
  kable_styling(bootstrap_options = c("striped", "bordered"),
                full_width = FALSE,
                font_size = 12) %>%
  add_header_above(c('User', 'Item' = 11)) %>%
  footnote(symbol = 'The numbers in BLUE are training set, and those in RED are test set') 
User-Item Matrix
User
Item
Customer 10 Piece Chicken McNuggets Artisan Grilled Chicken Sandwich Bacon, Egg & Cheese Biscuit Big Mac Buttermilk Crispy Chicken Sandwich Cheese burger Double Quarter Pounder w Cheese Egg McMuffin Filet-O-Fish Quarter Pounder w Cheese Sausage McMuffin w Egg
Shirley 5 2 ? 2 4 3 ? ? 5 4 5
Ricki 4 4 2 3 2 1 ? ? 5 3 ?
Friend 7 2 1 4 2 3 2 1 1 2 ? ?
Friend 6 ? 1 ? 5 ? ? 3 4 4 1 5
Friend 5 2 5 5 1 1 1 1 4 3 ? 1
Friend 4 5 4 4 4 ? ? ? 2 2 2 1
Friend 3 4 2 ? 2 3 ? 3 ? 3 3 4
Friend 2 ? 2 5 2 5 5 2 3 3 1 1
Friend 1 3 2 1 4 1 1 2 3 1 2 3
Frank 4 ? 2 3 ? 2 5 3 ? 3 4
* The numbers in BLUE are training set, and those in RED are test set

Calculate Raw Average Rating

Calculate the raw average rating for every user-item combination.

mean_train_raw <- mean(data_train$Rating)
print(str_c('Raw average rating for every user-item combination: ', as.character(mean_train_raw)))
## [1] "Raw average rating for every user-item combination: 2.82432432432432"

Calculate RMSE for Raw Average

Calculate RMSE for raw average rating for both of the training dataset and test dataset.

RMSE_train_raw <- (data_train$Rating - mean_train_raw)^2 %>% mean() %>% sqrt()
print(str_c('RMSE of raw average rating for training set: ', as.character(RMSE_train_raw)))
## [1] "RMSE of raw average rating for training set: 1.34925513092493"
RMSE_test_raw <- (data_test$Rating - mean_train_raw)^2 %>% mean() %>% sqrt()
print(str_c('RMSE of raw average rating for test set: ', as.character(RMSE_test_raw)))
## [1] "RMSE of raw average rating for test set: 1.3685239438972"

Calculate the bias for each user and each item

Using the training dataset, we calculate the bias for each user and each item.

From the bias table below, it is observed that Friend 7 is the harshest customer, while the most popular product is the 10-Piece Chicken McNuggets Meal Combo.

options(knitr.kable.NA = '')
user_bias_tb <- data_train %>% 
  group_by(Customer) %>%
  summarise(User_Bias = mean(Rating) - mean_train_raw)
item_bias_tb <- data_train %>%
  group_by(`Combo Meal`) %>%
  summarise(Item_Bias = mean(Rating) - mean_train_raw)
user_bias_tb %>% 
  arrange(desc(User_Bias)) %>%
  data.frame(row.names = NULL) %>%
  merge(item_bias_tb %>% 
          arrange(desc(Item_Bias)) %>%
          data.frame(row.names = NULL), by = 0, all = TRUE) %>%
  arrange(desc(User_Bias)) %>%
  select(-Row.names) %>%
  kable(escape = FALSE, caption = 'Bias') %>%
  add_header_above(c('Bias for User' = 2, 'Bias for Item' = 2)) 
Bias
Bias for User
Bias for Item
Customer User_Bias Combo.Meal Item_Bias
Shirley 0.5090090 10 Piece Chicken McNuggets 0.5090090
Friend 4 0.4613900 Bacon, Egg & Cheese Biscuit 0.4613900
Frank 0.4256757 Filet-O-Fish 0.1756757
Ricki 0.3756757 Sausage McMuffin w Egg 0.1756757
Friend 6 0.3423423 Egg McMuffin 0.0328185
Friend 3 0.1756757 Big Mac -0.0243243
Friend 2 0.0756757 Cheese burger -0.0743243
Friend 5 -0.3243243 Buttermilk Crispy Chicken Sandwich -0.1576577
Friend 1 -0.7132132 Artisan Grilled Chicken Sandwich -0.1993243
Friend 7 -0.8243243 Double Quarter Pounder w Cheese -0.3243243
Quarter Pounder w Cheese -0.5386100

Calculate the Baseline Predictors

From the raw average rating, and the appropriate user and item biases, we calculate the baseline predictors for every user-item combination and present them by the table below.

baseline_pred <- data.frame('Customer' = character(),
                            'Combo_Meal' = character(),
                            'Baseline_Predictor' = numeric(),
                            stringsAsFactors = FALSE)
for (user in data$Customer %>% unique()){
  for(item in data$`Combo Meal` %>% unique()){
    user_bias <- user_bias_tb$User_Bias[user_bias_tb$Customer == user]
    item_bias <- item_bias_tb$Item_Bias[item_bias_tb$`Combo Meal` == item]
    baseline_predictor <- pmax(pmin(mean_train_raw + user_bias + item_bias,5),1) 
    
    baseline_pred <- add_row(baseline_pred,
                             Customer = user,
                             Combo_Meal = item,
                             Baseline_Predictor = baseline_predictor)
  }
}
baseline_pred %>%
  spread(key = 'Combo_Meal', value = 'Baseline_Predictor') %>%
  kable(caption = 'Baseline Predictors') %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = FALSE,
                font_size = 12) %>%
  add_header_above(c('User', 'Item' = 11))
Baseline Predictors
User
Item
Customer 10 Piece Chicken McNuggets Artisan Grilled Chicken Sandwich Bacon, Egg & Cheese Biscuit Big Mac Buttermilk Crispy Chicken Sandwich Cheese burger Double Quarter Pounder w Cheese Egg McMuffin Filet-O-Fish Quarter Pounder w Cheese Sausage McMuffin w Egg
Frank 3.759009 3.050676 3.711390 3.225676 3.092342 3.175676 2.925676 3.282818 3.425676 2.711390 3.425676
Friend 1 2.620120 1.911787 2.572501 2.086787 1.953453 2.036787 1.786787 2.143930 2.286787 1.572501 2.286787
Friend 2 3.409009 2.700676 3.361390 2.875676 2.742342 2.825676 2.575676 2.932819 3.075676 2.361390 3.075676
Friend 3 3.509009 2.800676 3.461390 2.975676 2.842342 2.925676 2.675676 3.032818 3.175676 2.461390 3.175676
Friend 4 3.794723 3.086390 3.747104 3.261390 3.128057 3.211390 2.961390 3.318533 3.461390 2.747104 3.461390
Friend 5 3.009009 2.300676 2.961390 2.475676 2.342342 2.425676 2.175676 2.532818 2.675676 1.961390 2.675676
Friend 6 3.675676 2.967342 3.628057 3.142342 3.009009 3.092342 2.842342 3.199485 3.342342 2.628057 3.342342
Friend 7 2.509009 1.800676 2.461390 1.975676 1.842342 1.925676 1.675676 2.032818 2.175676 1.461390 2.175676
Ricki 3.709009 3.000676 3.661390 3.175676 3.042342 3.125676 2.875676 3.232819 3.375676 2.661390 3.375676
Shirley 3.842342 3.134009 3.794723 3.309009 3.175676 3.259009 3.009009 3.366152 3.509009 2.794723 3.509009

Calculate RMSE for Baseline Predictors

Finally, we calculate the RMSE for the baseline predictors for both of the training dataset and the test dataset.

RMSE_train_baseline_pred <- data_train %>% 
  left_join(baseline_pred, by = c('Customer' = 'Customer', 'Combo Meal' = 'Combo_Meal')) %>%
  rmse(Rating, Baseline_Predictor) %>%
  .$`.estimate`
print(str_c('RMSE of baseline predictor for training set: ', RMSE_train_baseline_pred))
## [1] "RMSE of baseline predictor for training set: 1.21011708247286"
RMSE_test_baseline_pred <- data_test %>% 
  left_join(baseline_pred, by = c('Customer' = 'Customer', 'Combo Meal' = 'Combo_Meal')) %>%
  rmse(Rating, Baseline_Predictor) %>%
  .$`.estimate`
print(str_c('RMSE of baseline predictor for test set: ', RMSE_test_baseline_pred))
## [1] "RMSE of baseline predictor for test set: 1.14332066988903"

Summary

We have calculated the RMSE for raw average rating for our training dataset and test dataset earlier, and also the RMSE for the baseline predictors for both datasets. To compare the results, we can calculate the percentage increase/decrease of the two RMSE for the training and test datasets respectively.

From the results below, we can see that the RMSE for the training dataset is decreased by 10.31%, and that for the test dataset is decreased by 16.46% by using the baseline predictors instead of the raw average ratings.

Therefore, the performances of the RMSE for the baseline predictors are better.

data.frame(Metrics = c('RMSE_Train_Raw', 
                       'RMSE_Train_Baseline_Pred', 
                       '% Change in RMSE of training set',
                       'RMSE_Test_Raw', 
                       'RMSE_Test_Baseline_Pred',
                       '% Change in RMSE of Test Set'),
Value = c(RMSE_train_raw, 
          RMSE_train_baseline_pred, 
          (RMSE_train_baseline_pred - RMSE_train_raw)/RMSE_train_raw,
          RMSE_test_raw, 
          RMSE_test_baseline_pred,
          (RMSE_test_baseline_pred - RMSE_test_raw)/RMSE_test_raw),
stringsAsFactors = FALSE) %>%
  mutate(Value = case_when(str_detect(Metrics, '%') ~ cell_spec(str_c(as.character(round(Value*100,2)),'%'), bold = TRUE),
                           TRUE ~ if_else(str_detect(Metrics,'Train'),
                                          color_bar('lightblue')(Value),
                                          color_bar('lightpink')(Value))),
         Metrics = case_when(str_detect(Metrics, '%') ~ cell_spec(Metrics, bold = TRUE),
                             TRUE ~ Metrics)) %>%
  kable(escape = FALSE, caption = 'Summary',align = 'lr') %>%
  kable_styling('hover') %>%
  row_spec(c(1,2,4,5), background = 'white') %>%
  row_spec(c(3,6), background = 'lightgrey')
Summary
Metrics Value
RMSE_Train_Raw 1.3492551
RMSE_Train_Baseline_Pred 1.2101171
% Change in RMSE of training set -10.31%
RMSE_Test_Raw 1.3685239
RMSE_Test_Baseline_Pred 1.1433207
% Change in RMSE of Test Set -16.46%