Instruction

In this first assignment, we will attempt to predict ratings with very little information. We will first look at just raw averages across all (training dataset) users. We will then account for “bias” by normalizing across users and across items. We will work with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing.

Introduction

In this project, a simple recommender system of McDonald’s Combo Meals is created. This system recommends McDonald’s Combo Meals to readers/customers who may then be interested to try out other Combo Meals than their usual calls. The data (ratings of Combo Meals) used in this project is collected by the three group members from themselves and their friends. The dataset contains a total of 10 users (customers) and their ratings of 11 items (McDonald’s Combo Meals).

Load Packages

library(tidyverse)
library(rmdformats)
library(knitr)
library(kableExtra)
library(formattable)
library(caTools)
library(yardstick)
library(gplots)

Read Data

data <- read_csv('https://raw.githubusercontent.com/oggyluky11/DATA-612-2020-SUMMER/master/Project%201/McDonald%20Meal%20Friend%20Rating.csv')
data

Data Exploration

The data is in long format. From the summary below, we can see that not all customers(/users) have rated all 11 products(/items). Only Friend 1 has rated all 11 products.

After reshapping the data into a user-item matrix, it is observed that there are 23 missing values within all user-item combinations. The data is not very sparse according to the heatmap shown below.

Before reshaping:

data %>% 
  mutate_if(is.character,as.factor) %>% 
  summary(maxsum = 20)

##      Customer                               Combo Meal     Rating     
##  Frank   : 8   10 Piece Chicken McNuggets        : 8   Min.   :1.000  
##  Friend 1:11   Artisan Grilled Chicken Sandwich  : 9   1st Qu.:2.000  
##  Friend 2:10   Bacon, Egg & Cheese Biscuit       : 7   Median :3.000  
##  Friend 3: 8   Big Mac                           :10   Mean   :2.816  
##  Friend 4: 8   Buttermilk Crispy Chicken Sandwich: 7   3rd Qu.:4.000  
##  Friend 5:10   Cheese burger                     : 7   Max.   :5.000  
##  Friend 6: 7   Double Quarter Pounder w Cheese   : 7                  
##  Friend 7: 9   Egg McMuffin                      : 7                  
##  Ricki   : 8   Filet-O-Fish                      : 9                  
##  Shirley : 8   Quarter Pounder w Cheese          : 8                  
##                Sausage McMuffin w Egg            : 8

After reshaping:

data %>% 
  spread(key = `Combo Meal`, value = Rating) %>%
  gather(key = `Combo Meal`, value = Rating, - Customer) %>%
  mutate_if(is.character,as.factor) %>% 
  summary(maxsum = 20)

##      Customer                               Combo Meal     Rating     
##  Frank   :11   10 Piece Chicken McNuggets        :10   Min.   :1.000  
##  Friend 1:11   Artisan Grilled Chicken Sandwich  :10   1st Qu.:2.000  
##  Friend 2:11   Bacon, Egg & Cheese Biscuit       :10   Median :3.000  
##  Friend 3:11   Big Mac                           :10   Mean   :2.816  
##  Friend 4:11   Buttermilk Crispy Chicken Sandwich:10   3rd Qu.:4.000  
##  Friend 5:11   Cheese burger                     :10   Max.   :5.000  
##  Friend 6:11   Double Quarter Pounder w Cheese   :10   NA's   :23     
##  Friend 7:11   Egg McMuffin                      :10                  
##  Ricki   :11   Filet-O-Fish                      :10                  
##  Shirley :11   Quarter Pounder w Cheese          :10                  
##                Sausage McMuffin w Egg            :10

Heatmap:

data %>%
  spread(key = `Combo Meal`, value = Rating) %>%
  column_to_rownames('Customer') %>%
  as.matrix() %>%
  heatmap.2(trace = 'none',
            density.info = 'none',
            dendrogram = 'none',
            Rowv = FALSE,
            Colv = FALSE,
            col = colorRampPalette(c("grey", "deeppink4"))(n = 299))

Separate Training Dataset & Test Dataset

The ratings of McDonald’s Combo Meal are then split into training dataset (85%) and test dataset (15%). In the Train_Test_Split table below, the training dataset is set to BLUE and the test dataset is set to RED under the Data_Group column.

set.seed(3)
data_split <- data %>%
  mutate(Data_Group = sample.split(Rating, 0.85)) %>%
  mutate(Data_Group = if_else(Data_Group==TRUE, 'Train', 'Test'))
data_train <- data_split %>% 
  filter(Data_Group == 'Train') %>%
  select(-Data_Group)
data_test <- data_split %>% 
  filter(Data_Group == 'Test') %>%
  select(-Data_Group) 
data_split %>%
  mutate(
  Data_Group = cell_spec(Data_Group, background = if_else(Data_Group == 'Test',
                                                    'lightpink','lightblue'))) %>%
  kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "bordered"),
                full_width = FALSE,
                font_size = 12) %>%
  add_header_above(c('Train_Test_Split' = 4)) %>%
  scroll_box(width = "100%", height = "400px")

Train_Test_Split
Customer	Combo Meal	Rating	Data_Group
Ricki	Big Mac	3	Train
Ricki	Cheese burger	1	Test
Ricki	Quarter Pounder w Cheese	3	Test
Ricki	10 Piece Chicken McNuggets	4	Test
Ricki	Filet-O-Fish	5	Train
Ricki	Artisan Grilled Chicken Sandwich	4	Train
Ricki	Buttermilk Crispy Chicken Sandwich	2	Train
Ricki	Bacon, Egg & Cheese Biscuit	2	Train
Shirley	Big Mac	2	Train
Shirley	Cheese burger	3	Train
Shirley	Quarter Pounder w Cheese	4	Train
Shirley	10 Piece Chicken McNuggets	5	Test
Shirley	Filet-O-Fish	5	Train
Shirley	Artisan Grilled Chicken Sandwich	2	Train
Shirley	Buttermilk Crispy Chicken Sandwich	4	Train
Shirley	Sausage McMuffin w Egg	5	Test
Frank	Big Mac	3	Train
Frank	Cheese burger	2	Train
Frank	Quarter Pounder w Cheese	3	Train
Frank	Double Quarter Pounder w Cheese	5	Train
Frank	10 Piece Chicken McNuggets	4	Train
Frank	Egg McMuffin	3	Train
Frank	Sausage McMuffin w Egg	4	Train
Frank	Bacon, Egg & Cheese Biscuit	2	Train
Friend 1	Big Mac	4	Train
Friend 1	Cheese burger	1	Train
Friend 1	Quarter Pounder w Cheese	2	Train
Friend 1	Double Quarter Pounder w Cheese	2	Test
Friend 1	10 Piece Chicken McNuggets	3	Train
Friend 1	Filet-O-Fish	1	Train
Friend 1	Artisan Grilled Chicken Sandwich	2	Test
Friend 1	Buttermilk Crispy Chicken Sandwich	1	Train
Friend 1	Egg McMuffin	3	Train
Friend 1	Sausage McMuffin w Egg	3	Train
Friend 1	Bacon, Egg & Cheese Biscuit	1	Train
Friend 2	Big Mac	2	Train
Friend 2	Cheese burger	5	Train
Friend 2	Quarter Pounder w Cheese	1	Train
Friend 2	Double Quarter Pounder w Cheese	2	Train
Friend 2	Filet-O-Fish	3	Train
Friend 2	Artisan Grilled Chicken Sandwich	2	Train
Friend 2	Buttermilk Crispy Chicken Sandwich	5	Train
Friend 2	Egg McMuffin	3	Train
Friend 2	Sausage McMuffin w Egg	1	Train
Friend 2	Bacon, Egg & Cheese Biscuit	5	Train
Friend 3	Big Mac	2	Train
Friend 3	Quarter Pounder w Cheese	3	Train
Friend 3	Double Quarter Pounder w Cheese	3	Train
Friend 3	10 Piece Chicken McNuggets	4	Train
Friend 3	Filet-O-Fish	3	Train
Friend 3	Artisan Grilled Chicken Sandwich	2	Train
Friend 3	Buttermilk Crispy Chicken Sandwich	3	Test
Friend 3	Sausage McMuffin w Egg	4	Train
Friend 4	Big Mac	4	Train
Friend 4	Quarter Pounder w Cheese	2	Train
Friend 4	10 Piece Chicken McNuggets	5	Train
Friend 4	Filet-O-Fish	2	Train
Friend 4	Artisan Grilled Chicken Sandwich	4	Train
Friend 4	Egg McMuffin	2	Train
Friend 4	Sausage McMuffin w Egg	1	Test
Friend 4	Bacon, Egg & Cheese Biscuit	4	Train
Friend 5	Big Mac	1	Train
Friend 5	Cheese burger	1	Test
Friend 5	Double Quarter Pounder w Cheese	1	Train
Friend 5	10 Piece Chicken McNuggets	2	Train
Friend 5	Filet-O-Fish	3	Test
Friend 5	Artisan Grilled Chicken Sandwich	5	Train
Friend 5	Buttermilk Crispy Chicken Sandwich	1	Train
Friend 5	Egg McMuffin	4	Train
Friend 5	Sausage McMuffin w Egg	1	Train
Friend 5	Bacon, Egg & Cheese Biscuit	5	Train
Friend 6	Big Mac	5	Train
Friend 6	Quarter Pounder w Cheese	1	Train
Friend 6	Double Quarter Pounder w Cheese	3	Train
Friend 6	Filet-O-Fish	4	Test
Friend 6	Artisan Grilled Chicken Sandwich	1	Train
Friend 6	Egg McMuffin	4	Train
Friend 6	Sausage McMuffin w Egg	5	Train
Friend 7	Big Mac	2	Train
Friend 7	Cheese burger	2	Test
Friend 7	Double Quarter Pounder w Cheese	1	Train
Friend 7	10 Piece Chicken McNuggets	2	Train
Friend 7	Filet-O-Fish	2	Train
Friend 7	Artisan Grilled Chicken Sandwich	1	Train
Friend 7	Buttermilk Crispy Chicken Sandwich	3	Train
Friend 7	Egg McMuffin	1	Train
Friend 7	Bacon, Egg & Cheese Biscuit	4	Train

Create a User-Item Matrix

Reshaping the complete dataset from long format to wide format, with customers as users (in rows) and products as items (in columns). The training dataset and test dataset are separated by color with null values showing as ‘?’ in the table below.

options(knitr.kable.NA = '?')
data_split %>% 
  mutate(Rating = cell_spec(Rating,'html',
                            background  = if_else(Data_Group == 'Test','lightpink','lightblue'))) %>%
  select(-Data_Group) %>%
  spread(`Combo Meal`, Rating) %>%
  arrange(desc(Customer)) %>%
  kable(escape = FALSE, caption = 'User-Item Matrix', align = 'lccccccccccc') %>%
  kable_styling(bootstrap_options = c("striped", "bordered"),
                full_width = FALSE,
                font_size = 12) %>%
  add_header_above(c('User', 'Item' = 11)) %>%
  footnote(symbol = 'The numbers in BLUE are training set, and those in RED are test set')

User-Item Matrix
User	Item
Customer	10 Piece Chicken McNuggets	Artisan Grilled Chicken Sandwich	Bacon, Egg & Cheese Biscuit	Big Mac	Buttermilk Crispy Chicken Sandwich	Cheese burger	Double Quarter Pounder w Cheese	Egg McMuffin	Filet-O-Fish	Quarter Pounder w Cheese	Sausage McMuffin w Egg
Shirley	5	2	?	2	4	3	?	?	5	4	5
Ricki	4	4	2	3	2	1	?	?	5	3	?
Friend 7	2	1	4	2	3	2	1	1	2	?	?
Friend 6	?	1	?	5	?	?	3	4	4	1	5
Friend 5	2	5	5	1	1	1	1	4	3	?	1
Friend 4	5	4	4	4	?	?	?	2	2	2	1
Friend 3	4	2	?	2	3	?	3	?	3	3	4
Friend 2	?	2	5	2	5	5	2	3	3	1	1
Friend 1	3	2	1	4	1	1	2	3	1	2	3
Frank	4	?	2	3	?	2	5	3	?	3	4
^* The numbers in BLUE are training set, and those in RED are test set

Calculate Raw Average Rating

Calculate the raw average rating for every user-item combination.

mean_train_raw <- mean(data_train$Rating)
print(str_c('Raw average rating for every user-item combination: ', as.character(mean_train_raw)))

## [1] "Raw average rating for every user-item combination: 2.82432432432432"

Calculate RMSE for Raw Average

Calculate RMSE for raw average rating for both of the training dataset and test dataset.

RMSE_train_raw <- (data_train$Rating - mean_train_raw)^2 %>% mean() %>% sqrt()
print(str_c('RMSE of raw average rating for training set: ', as.character(RMSE_train_raw)))

## [1] "RMSE of raw average rating for training set: 1.34925513092493"

RMSE_test_raw <- (data_test$Rating - mean_train_raw)^2 %>% mean() %>% sqrt()
print(str_c('RMSE of raw average rating for test set: ', as.character(RMSE_test_raw)))

## [1] "RMSE of raw average rating for test set: 1.3685239438972"

Calculate the bias for each user and each item

Using the training dataset, we calculate the bias for each user and each item.

From the bias table below, it is observed that Friend 7 is the harshest customer, while the most popular product is the 10-Piece Chicken McNuggets Meal Combo.

options(knitr.kable.NA = '')
user_bias_tb <- data_train %>% 
  group_by(Customer) %>%
  summarise(User_Bias = mean(Rating) - mean_train_raw)
item_bias_tb <- data_train %>%
  group_by(`Combo Meal`) %>%
  summarise(Item_Bias = mean(Rating) - mean_train_raw)
user_bias_tb %>% 
  arrange(desc(User_Bias)) %>%
  data.frame(row.names = NULL) %>%
  merge(item_bias_tb %>% 
          arrange(desc(Item_Bias)) %>%
          data.frame(row.names = NULL), by = 0, all = TRUE) %>%
  arrange(desc(User_Bias)) %>%
  select(-Row.names) %>%
  kable(escape = FALSE, caption = 'Bias') %>%
  add_header_above(c('Bias for User' = 2, 'Bias for Item' = 2))

Bias
Bias for User		Bias for Item
Customer	User_Bias	Combo.Meal	Item_Bias
Shirley	0.5090090	10 Piece Chicken McNuggets	0.5090090
Friend 4	0.4613900	Bacon, Egg & Cheese Biscuit	0.4613900
Frank	0.4256757	Filet-O-Fish	0.1756757
Ricki	0.3756757	Sausage McMuffin w Egg	0.1756757
Friend 6	0.3423423	Egg McMuffin	0.0328185
Friend 3	0.1756757	Big Mac	-0.0243243
Friend 2	0.0756757	Cheese burger	-0.0743243
Friend 5	-0.3243243	Buttermilk Crispy Chicken Sandwich	-0.1576577
Friend 1	-0.7132132	Artisan Grilled Chicken Sandwich	-0.1993243
Friend 7	-0.8243243	Double Quarter Pounder w Cheese	-0.3243243
		Quarter Pounder w Cheese	-0.5386100

Calculate the Baseline Predictors

From the raw average rating, and the appropriate user and item biases, we calculate the baseline predictors for every user-item combination and present them by the table below.

baseline_pred <- data.frame('Customer' = character(),
                            'Combo_Meal' = character(),
                            'Baseline_Predictor' = numeric(),
                            stringsAsFactors = FALSE)
for (user in data$Customer %>% unique()){
  for(item in data$`Combo Meal` %>% unique()){
    user_bias <- user_bias_tb$User_Bias[user_bias_tb$Customer == user]
    item_bias <- item_bias_tb$Item_Bias[item_bias_tb$`Combo Meal` == item]
    baseline_predictor <- pmax(pmin(mean_train_raw + user_bias + item_bias,5),1) 
    
    baseline_pred <- add_row(baseline_pred,
                             Customer = user,
                             Combo_Meal = item,
                             Baseline_Predictor = baseline_predictor)
  }
}
baseline_pred %>%
  spread(key = 'Combo_Meal', value = 'Baseline_Predictor') %>%
  kable(caption = 'Baseline Predictors') %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = FALSE,
                font_size = 12) %>%
  add_header_above(c('User', 'Item' = 11))

Baseline Predictors
User	Item
Customer	10 Piece Chicken McNuggets	Artisan Grilled Chicken Sandwich	Bacon, Egg & Cheese Biscuit	Big Mac	Buttermilk Crispy Chicken Sandwich	Cheese burger	Double Quarter Pounder w Cheese	Egg McMuffin	Filet-O-Fish	Quarter Pounder w Cheese	Sausage McMuffin w Egg
Frank	3.759009	3.050676	3.711390	3.225676	3.092342	3.175676	2.925676	3.282818	3.425676	2.711390	3.425676
Friend 1	2.620120	1.911787	2.572501	2.086787	1.953453	2.036787	1.786787	2.143930	2.286787	1.572501	2.286787
Friend 2	3.409009	2.700676	3.361390	2.875676	2.742342	2.825676	2.575676	2.932819	3.075676	2.361390	3.075676
Friend 3	3.509009	2.800676	3.461390	2.975676	2.842342	2.925676	2.675676	3.032818	3.175676	2.461390	3.175676
Friend 4	3.794723	3.086390	3.747104	3.261390	3.128057	3.211390	2.961390	3.318533	3.461390	2.747104	3.461390
Friend 5	3.009009	2.300676	2.961390	2.475676	2.342342	2.425676	2.175676	2.532818	2.675676	1.961390	2.675676
Friend 6	3.675676	2.967342	3.628057	3.142342	3.009009	3.092342	2.842342	3.199485	3.342342	2.628057	3.342342
Friend 7	2.509009	1.800676	2.461390	1.975676	1.842342	1.925676	1.675676	2.032818	2.175676	1.461390	2.175676
Ricki	3.709009	3.000676	3.661390	3.175676	3.042342	3.125676	2.875676	3.232819	3.375676	2.661390	3.375676
Shirley	3.842342	3.134009	3.794723	3.309009	3.175676	3.259009	3.009009	3.366152	3.509009	2.794723	3.509009

Calculate RMSE for Baseline Predictors

Finally, we calculate the RMSE for the baseline predictors for both of the training dataset and the test dataset.

RMSE_train_baseline_pred <- data_train %>% 
  left_join(baseline_pred, by = c('Customer' = 'Customer', 'Combo Meal' = 'Combo_Meal')) %>%
  rmse(Rating, Baseline_Predictor) %>%
  .$`.estimate`
print(str_c('RMSE of baseline predictor for training set: ', RMSE_train_baseline_pred))

## [1] "RMSE of baseline predictor for training set: 1.21011708247286"

RMSE_test_baseline_pred <- data_test %>% 
  left_join(baseline_pred, by = c('Customer' = 'Customer', 'Combo Meal' = 'Combo_Meal')) %>%
  rmse(Rating, Baseline_Predictor) %>%
  .$`.estimate`
print(str_c('RMSE of baseline predictor for test set: ', RMSE_test_baseline_pred))

## [1] "RMSE of baseline predictor for test set: 1.14332066988903"

Summary

We have calculated the RMSE for raw average rating for our training dataset and test dataset earlier, and also the RMSE for the baseline predictors for both datasets. To compare the results, we can calculate the percentage increase/decrease of the two RMSE for the training and test datasets respectively.

From the results below, we can see that the RMSE for the training dataset is decreased by 10.31%, and that for the test dataset is decreased by 16.46% by using the baseline predictors instead of the raw average ratings.

Therefore, the performances of the RMSE for the baseline predictors are better.

data.frame(Metrics = c('RMSE_Train_Raw', 
                       'RMSE_Train_Baseline_Pred', 
                       '% Change in RMSE of training set',
                       'RMSE_Test_Raw', 
                       'RMSE_Test_Baseline_Pred',
                       '% Change in RMSE of Test Set'),
Value = c(RMSE_train_raw, 
          RMSE_train_baseline_pred, 
          (RMSE_train_baseline_pred - RMSE_train_raw)/RMSE_train_raw,
          RMSE_test_raw, 
          RMSE_test_baseline_pred,
          (RMSE_test_baseline_pred - RMSE_test_raw)/RMSE_test_raw),
stringsAsFactors = FALSE) %>%
  mutate(Value = case_when(str_detect(Metrics, '%') ~ cell_spec(str_c(as.character(round(Value*100,2)),'%'), bold = TRUE),
                           TRUE ~ if_else(str_detect(Metrics,'Train'),
                                          color_bar('lightblue')(Value),
                                          color_bar('lightpink')(Value))),
         Metrics = case_when(str_detect(Metrics, '%') ~ cell_spec(Metrics, bold = TRUE),
                             TRUE ~ Metrics)) %>%
  kable(escape = FALSE, caption = 'Summary',align = 'lr') %>%
  kable_styling('hover') %>%
  row_spec(c(1,2,4,5), background = 'white') %>%
  row_spec(c(3,6), background = 'lightgrey')

Summary
Metrics	Value
RMSE_Train_Raw	1.3492551
RMSE_Train_Baseline_Pred	1.2101171
% Change in RMSE of training set	-10.31%
RMSE_Test_Raw	1.3685239
RMSE_Test_Baseline_Pred	1.1433207
% Change in RMSE of Test Set	-16.46%

DATA 612 Project 1 - Global Baseline Predictors and RMSE