Description

Restaurant Rating Recommender System

This is a recommender system with 20 users and their respective ratings on 5 restaurants in New York It recommends restaurants to users based on other user ratings.

Dataset

Step-1:

I scrapped some data from online to get this rating on the 5 restaurants

Load Data

Step-2:

Load the necessary libraries

library(pander)
library(ggplot2)
library(knitr)
library(dplyr)
library(reshape2)

Step-3:

Load the rest_ratings dataset and create a user-item matrix

# load csv into data variable 
data <- read.csv("restratings.csv",row.names = 1)

# convert data into a matrix
data <- as.matrix(data)
pander(data)
Table continues below
  Andanada Aquagrill Asiate Balthazar Barbetta
Jakob 3 NA 4 4 3
Helen 4 5 3 NA 2
Viktoria NA 3 3 3 3
Zackary 5 5 5 5 NA
Kloe 2 3 3 NA 4
Petra 3 4 4 3 3
Keiron 2 2 NA 2 2
Beverly 3 NA 4 4 4
Blythe 2 3 3 NA 4
Chay 3 NA 3 3 3
Whitehouse 1 1 1 NA 1
Needham NA 3 4 3 4
Beattie 3 NA 4 4 3
Livingston 4 4 3 NA 2
Dupont NA 3 4 4 4
Liu 5 5 NA 5 5
Strickland 2 NA 3 3 4
Howe 3 NA 3 3 3
Akhtar 2 2 2 NA 2
Whitworth 3 NA 4 4 4
  Bouchon Bouley Caravaggio
Jakob 5 5 5
Helen 3 3 NA
Viktoria 4 4 3
Zackary 2 NA 2
Kloe NA 4 4
Petra 3 3 NA
Keiron NA 3 3
Beverly 1 1 NA
Blythe 3 4 3
Chay NA 4 4
Whitehouse 4 3 NA
Needham 5 5 5
Beattie 3 3 NA
Livingston 4 4 3
Dupont 2 NA 2
Liu NA 4 4
Strickland 3 3 NA
Howe NA 3 3
Akhtar 1 1 NA
Whitworth 3 4 3

Training and Test datasets

Step-4:

Break your ratings into separate training and test datasets.

Lets split the restaurant dataset into two. Training and Test. I selected 12 reviews from training. I will replace those with NA in the training set. NA was used so it would be omitted from our calculations. In the test dataset I only kept values identified for testing. the others were replaced with NA.

test_rows <- c(1,3,4,5,6,7,14,13,19,20,12,14)
test_cols <- c(1,4,2,3,4,5,2,3,3,4,5,3)
test_indices <- cbind(test_rows,test_cols)

data_train <- data
data_train[test_indices] <- NA

data_test <- data
data_test[test_indices] <- 0
data_test[data_test > 0] <- NA
data_test[test_indices] <- data[test_indices]

Train Dataset

data_train
           Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley
Jakob            NA        NA      4         4        3       5      5
Helen             4         5      3        NA        2       3      3
Viktoria         NA         3      3        NA        3       4      4
Zackary           5        NA      5         5       NA       2     NA
Kloe              2         3     NA        NA        4      NA      4
Petra             3         4      4        NA        3       3      3
Keiron            2         2     NA         2       NA      NA      3
Beverly           3        NA      4         4        4       1      1
Blythe            2         3      3        NA        4       3      4
           Caravaggio
Jakob               5
Helen              NA
Viktoria            3
Zackary             2
Kloe                4
Petra              NA
Keiron              3
Beverly            NA
Blythe              3
 [ reached getOption("max.print") -- omitted 11 rows ]

7 Test Dataset

7

data_test
           Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley
Jakob             3        NA     NA        NA       NA      NA     NA
Helen            NA        NA     NA        NA       NA      NA     NA
Viktoria         NA        NA     NA         3       NA      NA     NA
Zackary          NA         5     NA        NA       NA      NA     NA
Kloe             NA        NA      3        NA       NA      NA     NA
Petra            NA        NA     NA         3       NA      NA     NA
Keiron           NA        NA     NA        NA        2      NA     NA
Beverly          NA        NA     NA        NA       NA      NA     NA
Blythe           NA        NA     NA        NA       NA      NA     NA
           Caravaggio
Jakob              NA
Helen              NA
Viktoria           NA
Zackary            NA
Kloe               NA
Petra              NA
Keiron             NA
Beverly            NA
Blythe             NA
 [ reached getOption("max.print") -- omitted 11 rows ]

Calculations

Using training data, calculate the raw average (mean) rating for every user-item combination.

This function computes the raw average of the user-item matrix

Mean rating for each user in the restaurant train dataset

user_means <- rowMeans(data_train,na.rm = TRUE)
user_means_df <-  data.frame(as.list(user_means))
# change user means from wide to long 
user_means_df <- tidyr::gather(user_means_df,"user") 
p1 <- ggplot(user_means_df,aes(x=user, y=value,fill=user))+ geom_bar(stat="identity") + labs(title="Plot of Mean User ratings",x="User",y="Avg. Rating")
colnames(user_means_df) <-c("User","Rating")
pander(user_means)
Table continues below
Jakob Helen Viktoria Zackary Kloe Petra Keiron Beverly Blythe
4.333 3.333 3.333 3.8 3.4 3.333 2.4 2.833 3.143
Table continues below
Chay Whitehouse Needham Beattie Livingston Dupont Liu
3.333 1.833 4.167 3.2 3.4 3.167 4.667
Strickland Howe Akhtar Whitworth
3 3 1.6 3.5
p1

Mean rating for each restaurant in the User_restaurant7 train dataset.

restaurant_means <- colMeans(data_train,na.rm = TRUE)
restaurant_means_df <-  data.frame(as.list(restaurant_means))
# change user means from wide to long 
restaurant_means_df <- tidyr::gather(restaurant_means_df,"restaurant") 
p2 <- ggplot(restaurant_means_df,aes(x=7, y=value,fill=restaurant))+ geom_bar(stat="identity") + labs(title="Plot of restaurant Average Rating",x="restaurant",y="Avg. Rating")
colnames(restaurant_means_df) <-c("restaurant","Rating")
pander(restaurant_means)
Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley Caravaggio
2.938 3.091 3.429 3.636 3.176 3.067 3.389 3.385
p2

Calculate the RMSE for raw average for both your training data and your test data.

Rating for every user-item combination, for Test and Train data sets

raw_test <- mean(data_test,na.rm = TRUE)
raw_test_mat <- data_test
raw_test_mat[] <- raw_test
raw_test
[1] 3.333333
raw_train_mat <- data_train 
raw_train <- mean(data_train,na.rm = TRUE)
raw_train_mat[] <-raw_train
raw_train
[1] 3.252174

RMSE for Test and Train data sets

#find squre difference 
squareDiff_train <- (data_train - raw_train_mat)^2
# find mean of squareDiff
squareDiff_train_mean <- mean(squareDiff_train,na.rm = TRUE)
#find square root
rmse_train <- sqrt(squareDiff_train_mean)
# train test 
squareDiff_test <- (data_test - raw_test_mat)^2
# find mean of squareDiff
squareDiff_test_mean <- mean(squareDiff_test,na.rm = TRUE)
#find square root
rmse_test <- sqrt(squareDiff_test_mean)

RMSE for train dataset

rmse_train
[1] 1.053718

RMSE for test dataset

rmse_test
[1] 0.8498366

Using your training data, calculate the bias for each user and each item.

User Bias

## user bias
user_bias <- user_means - raw_train
user_bias_df <-  data.frame(as.list(user_bias))
user_bias_df <- tidyr::gather(user_bias_df,"user")
colnames(user_bias_df) <-c("User","Bias")
pander(user_bias_df)
User Bias
Jakob 1.081
Helen 0.08116
Viktoria 0.08116
Zackary 0.5478
Kloe 0.1478
Petra 0.08116
Keiron -0.8522
Beverly -0.4188
Blythe -0.1093
Chay 0.08116
Whitehouse -1.419
Needham 0.9145
Beattie -0.05217
Livingston 0.1478
Dupont -0.08551
Liu 1.414
Strickland -0.2522
Howe -0.2522
Akhtar -1.652
Whitworth 0.2478

restaurant Bias

#restaurant bias
restaurant_bias <- restaurant_means - raw_train
restaurant_bias_df <-  data.frame(as.list(restaurant_bias))
restaurant_bias_df <- tidyr::gather(restaurant_bias_df,"restaurant")
colnames(restaurant_bias_df) <-c("restaurant","Bias")
pander(restaurant_bias_df)
restaurant Bias
Andanada -0.3147
Aquagrill -0.1613
Asiate 0.1764
Balthazar 0.3842
Barbetta -0.0757
Bouchon -0.1855
Bouley 0.1367
Caravaggio 0.1324

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

# raw average + user bias + restaurant bias
calBaseLine <- function(in_matrix, restaurant_bias_in,user_bias_in,raw_average)
{
  out_matrix <- in_matrix
  row_count <-1
  for(item in 1:nrow(in_matrix))
  {
    col_count <-1
    for(colItem in 1: ncol(in_matrix))
    {
      #out_matrix[row_count,col_count] <- 0
     out_matrix[row_count,col_count] <- raw_average[1] + user_bias_in[[row_count]] +  restaurant_bias_in[[col_count]]
      col_count <- col_count +1  
    }
    row_count <- row_count +1
  }
return(out_matrix)
}
base_pred <- calBaseLine(data_train,restaurant_bias,user_bias,raw_train)
pander(base_pred)
Table continues below
  Andanada Aquagrill Asiate Balthazar Barbetta
Jakob 4.019 4.172 4.51 4.718 4.258
Helen 3.019 3.172 3.51 3.718 3.258
Viktoria 3.019 3.172 3.51 3.718 3.258
Zackary 3.485 3.639 3.976 4.184 3.724
Kloe 3.085 3.239 3.576 3.784 3.324
Petra 3.019 3.172 3.51 3.718 3.258
Keiron 2.085 2.239 2.576 2.784 2.324
Beverly 2.519 2.672 3.01 3.218 2.758
Blythe 2.828 2.982 3.319 3.527 3.067
Chay 3.019 3.172 3.51 3.718 3.258
Whitehouse 1.519 1.672 2.01 2.218 1.758
Needham 3.852 4.005 4.343 4.551 4.091
Beattie 2.885 3.039 3.376 3.584 3.124
Livingston 3.085 3.239 3.576 3.784 3.324
Dupont 2.852 3.005 3.343 3.551 3.091
Liu 4.352 4.505 4.843 5.051 4.591
Strickland 2.685 2.839 3.176 3.384 2.924
Howe 2.685 2.839 3.176 3.384 2.924
Akhtar 1.285 1.439 1.776 1.984 1.524
Whitworth 3.185 3.339 3.676 3.884 3.424
  Bouchon Bouley Caravaggio
Jakob 4.148 4.47 4.466
Helen 3.148 3.47 3.466
Viktoria 3.148 3.47 3.466
Zackary 3.614 3.937 3.932
Kloe 3.214 3.537 3.532
Petra 3.148 3.47 3.466
Keiron 2.214 2.537 2.532
Beverly 2.648 2.97 2.966
Blythe 2.957 3.28 3.275
Chay 3.148 3.47 3.466
Whitehouse 1.648 1.97 1.966
Needham 3.981 4.303 4.299
Beattie 3.014 3.337 3.332
Livingston 3.214 3.537 3.532
Dupont 2.981 3.303 3.299
Liu 4.481 4.803 4.799
Strickland 2.814 3.137 3.132
Howe 2.814 3.137 3.132
Akhtar 1.414 1.737 1.732
Whitworth 3.314 3.637 3.632

Calculate the RMSE for the baseline predictors for both your training data and your test data.

## test data
# finding Error
data_err <- data_test - base_pred
# squaring error
data_err <- (data_err)^2
#finding average 
data_rmse_test<- mean(data_err[test_indices])
# square root 
data_rmse_test<- sqrt(data_rmse_test)
## training data
# finding Error
data_err_train <- data_train - base_pred
# squaring error
data_err_train <- (data_err_train)^2
#finding average 
data_rmse_train <- mean(data_err_train,na.rm = TRUE)
# square root 
data_rmse_train<- sqrt(data_rmse_train)

RMSE for test data

data_rmse_test
[1] 0.6910205

RMSE for train data

data_rmse_train
[1] 0.7791674

Summarizing results

Lets calculate the percentage improvements based on the original (simple average) and baseline predictor (including bias) RMSE numbers for both Test and Train data sets.

The results show that we see a 50% improvement in making a prediction for the ratings in the Training data set. Where as we see only 38% improvement in prediction for the Test data set. Both are positive however the Training data set yielded better prediction.

# Train data set
R1 <- rmse_train
R1_data <- data_rmse_train
Prediction_Improv_Train <- (1-(R1_data/R1))*100
Prediction_Improv_Train
[1] 26.0554
# Test data set
R2 <- rmse_test
R2_data <- data_rmse_test
Prediction_Improv_Test <- (1-(R2_data/R2))*100
Prediction_Improv_Test
[1] 18.68783