Description

Restaurant Rating Recommender System

This is a recommender system with 20 users and their respective ratings on 5 restaurants in New York It recommends restaurants to users based on other user ratings.

Dataset

Step-1:

I scrapped some data from online to get this rating on the 5 restaurants

Load Data

Step-2:

Load the necessary libraries

library(pander)
library(ggplot2)
library(knitr)
library(dplyr)
library(reshape2)

Step-3:

Load the rest_ratings dataset and create a user-item matrix

# load csv into data variable 
data <- read.csv("restratings.csv",row.names = 1)

# convert data into a matrix
data <- as.matrix(data)
pander(data)

Table continues below
	Andanada	Aquagrill	Asiate	Balthazar	Barbetta
Jakob	3	NA	4	4	3
Helen	4	5	3	NA	2
Viktoria	NA	3	3	3	3
Zackary	5	5	5	5	NA
Kloe	2	3	3	NA	4
Petra	3	4	4	3	3
Keiron	2	2	NA	2	2
Beverly	3	NA	4	4	4
Blythe	2	3	3	NA	4
Chay	3	NA	3	3	3
Whitehouse	1	1	1	NA	1
Needham	NA	3	4	3	4
Beattie	3	NA	4	4	3
Livingston	4	4	3	NA	2
Dupont	NA	3	4	4	4
Liu	5	5	NA	5	5
Strickland	2	NA	3	3	4
Howe	3	NA	3	3	3
Akhtar	2	2	2	NA	2
Whitworth	3	NA	4	4	4

	Bouchon	Bouley	Caravaggio
Jakob	5	5	5
Helen	3	3	NA
Viktoria	4	4	3
Zackary	2	NA	2
Kloe	NA	4	4
Petra	3	3	NA
Keiron	NA	3	3
Beverly	1	1	NA
Blythe	3	4	3
Chay	NA	4	4
Whitehouse	4	3	NA
Needham	5	5	5
Beattie	3	3	NA
Livingston	4	4	3
Dupont	2	NA	2
Liu	NA	4	4
Strickland	3	3	NA
Howe	NA	3	3
Akhtar	1	1	NA
Whitworth	3	4	3

Training and Test datasets

Step-4:

Break your ratings into separate training and test datasets.

Lets split the restaurant dataset into two. Training and Test. I selected 12 reviews from training. I will replace those with NA in the training set. NA was used so it would be omitted from our calculations. In the test dataset I only kept values identified for testing. the others were replaced with NA.

test_rows <- c(1,3,4,5,6,7,14,13,19,20,12,14)
test_cols <- c(1,4,2,3,4,5,2,3,3,4,5,3)
test_indices <- cbind(test_rows,test_cols)

data_train <- data
data_train[test_indices] <- NA

data_test <- data
data_test[test_indices] <- 0
data_test[data_test > 0] <- NA
data_test[test_indices] <- data[test_indices]

Train Dataset

data_train

           Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley
Jakob            NA        NA      4         4        3       5      5
Helen             4         5      3        NA        2       3      3
Viktoria         NA         3      3        NA        3       4      4
Zackary           5        NA      5         5       NA       2     NA
Kloe              2         3     NA        NA        4      NA      4
Petra             3         4      4        NA        3       3      3
Keiron            2         2     NA         2       NA      NA      3
Beverly           3        NA      4         4        4       1      1
Blythe            2         3      3        NA        4       3      4
           Caravaggio
Jakob               5
Helen              NA
Viktoria            3
Zackary             2
Kloe                4
Petra              NA
Keiron              3
Beverly            NA
Blythe              3
 [ reached getOption("max.print") -- omitted 11 rows ]

7 Test Dataset

data_test

           Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley
Jakob             3        NA     NA        NA       NA      NA     NA
Helen            NA        NA     NA        NA       NA      NA     NA
Viktoria         NA        NA     NA         3       NA      NA     NA
Zackary          NA         5     NA        NA       NA      NA     NA
Kloe             NA        NA      3        NA       NA      NA     NA
Petra            NA        NA     NA         3       NA      NA     NA
Keiron           NA        NA     NA        NA        2      NA     NA
Beverly          NA        NA     NA        NA       NA      NA     NA
Blythe           NA        NA     NA        NA       NA      NA     NA
           Caravaggio
Jakob              NA
Helen              NA
Viktoria           NA
Zackary            NA
Kloe               NA
Petra              NA
Keiron             NA
Beverly            NA
Blythe             NA
 [ reached getOption("max.print") -- omitted 11 rows ]

Calculations

Using training data, calculate the raw average (mean) rating for every user-item combination.

This function computes the raw average of the user-item matrix

Mean rating for each user in the restaurant train dataset

user_means <- rowMeans(data_train,na.rm = TRUE)
user_means_df <-  data.frame(as.list(user_means))
# change user means from wide to long 
user_means_df <- tidyr::gather(user_means_df,"user") 
p1 <- ggplot(user_means_df,aes(x=user, y=value,fill=user))+ geom_bar(stat="identity") + labs(title="Plot of Mean User ratings",x="User",y="Avg. Rating")
colnames(user_means_df) <-c("User","Rating")
pander(user_means)

Table continues below
Jakob	Helen	Viktoria	Zackary	Kloe	Petra	Keiron	Beverly	Blythe
4.333	3.333	3.333	3.8	3.4	3.333	2.4	2.833	3.143

Table continues below
Chay	Whitehouse	Needham	Beattie	Livingston	Dupont	Liu
3.333	1.833	4.167	3.2	3.4	3.167	4.667

Strickland	Howe	Akhtar	Whitworth
3	3	1.6	3.5

p1

Mean rating for each restaurant in the User_restaurant7 train dataset.

restaurant_means <- colMeans(data_train,na.rm = TRUE)
restaurant_means_df <-  data.frame(as.list(restaurant_means))
# change user means from wide to long 
restaurant_means_df <- tidyr::gather(restaurant_means_df,"restaurant") 
p2 <- ggplot(restaurant_means_df,aes(x=7, y=value,fill=restaurant))+ geom_bar(stat="identity") + labs(title="Plot of restaurant Average Rating",x="restaurant",y="Avg. Rating")
colnames(restaurant_means_df) <-c("restaurant","Rating")
pander(restaurant_means)

Andanada	Aquagrill	Asiate	Balthazar	Barbetta	Bouchon	Bouley	Caravaggio
2.938	3.091	3.429	3.636	3.176	3.067	3.389	3.385

p2

Calculate the RMSE for raw average for both your training data and your test data.

Rating for every user-item combination, for Test and Train data sets

raw_test <- mean(data_test,na.rm = TRUE)
raw_test_mat <- data_test
raw_test_mat[] <- raw_test
raw_test

[1] 3.333333

raw_train_mat <- data_train 
raw_train <- mean(data_train,na.rm = TRUE)
raw_train_mat[] <-raw_train
raw_train

[1] 3.252174

RMSE for Test and Train data sets

#find squre difference 
squareDiff_train <- (data_train - raw_train_mat)^2
# find mean of squareDiff
squareDiff_train_mean <- mean(squareDiff_train,na.rm = TRUE)
#find square root
rmse_train <- sqrt(squareDiff_train_mean)
# train test 
squareDiff_test <- (data_test - raw_test_mat)^2
# find mean of squareDiff
squareDiff_test_mean <- mean(squareDiff_test,na.rm = TRUE)
#find square root
rmse_test <- sqrt(squareDiff_test_mean)

RMSE for train dataset

rmse_train

[1] 1.053718

RMSE for test dataset

rmse_test

[1] 0.8498366

Using your training data, calculate the bias for each user and each item.

User Bias

## user bias
user_bias <- user_means - raw_train
user_bias_df <-  data.frame(as.list(user_bias))
user_bias_df <- tidyr::gather(user_bias_df,"user")
colnames(user_bias_df) <-c("User","Bias")
pander(user_bias_df)

User	Bias
Jakob	1.081
Helen	0.08116
Viktoria	0.08116
Zackary	0.5478
Kloe	0.1478
Petra	0.08116
Keiron	-0.8522
Beverly	-0.4188
Blythe	-0.1093
Chay	0.08116
Whitehouse	-1.419
Needham	0.9145
Beattie	-0.05217
Livingston	0.1478
Dupont	-0.08551
Liu	1.414
Strickland	-0.2522
Howe	-0.2522
Akhtar	-1.652
Whitworth	0.2478

restaurant Bias

#restaurant bias
restaurant_bias <- restaurant_means - raw_train
restaurant_bias_df <-  data.frame(as.list(restaurant_bias))
restaurant_bias_df <- tidyr::gather(restaurant_bias_df,"restaurant")
colnames(restaurant_bias_df) <-c("restaurant","Bias")
pander(restaurant_bias_df)

restaurant	Bias
Andanada	-0.3147
Aquagrill	-0.1613
Asiate	0.1764
Balthazar	0.3842
Barbetta	-0.0757
Bouchon	-0.1855
Bouley	0.1367
Caravaggio	0.1324

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

# raw average + user bias + restaurant bias
calBaseLine <- function(in_matrix, restaurant_bias_in,user_bias_in,raw_average)
{
  out_matrix <- in_matrix
  row_count <-1
  for(item in 1:nrow(in_matrix))
  {
    col_count <-1
    for(colItem in 1: ncol(in_matrix))
    {
      #out_matrix[row_count,col_count] <- 0
     out_matrix[row_count,col_count] <- raw_average[1] + user_bias_in[[row_count]] +  restaurant_bias_in[[col_count]]
      col_count <- col_count +1  
    }
    row_count <- row_count +1
  }
return(out_matrix)
}
base_pred <- calBaseLine(data_train,restaurant_bias,user_bias,raw_train)
pander(base_pred)

Table continues below
	Andanada	Aquagrill	Asiate	Balthazar	Barbetta
Jakob	4.019	4.172	4.51	4.718	4.258
Helen	3.019	3.172	3.51	3.718	3.258
Viktoria	3.019	3.172	3.51	3.718	3.258
Zackary	3.485	3.639	3.976	4.184	3.724
Kloe	3.085	3.239	3.576	3.784	3.324
Petra	3.019	3.172	3.51	3.718	3.258
Keiron	2.085	2.239	2.576	2.784	2.324
Beverly	2.519	2.672	3.01	3.218	2.758
Blythe	2.828	2.982	3.319	3.527	3.067
Chay	3.019	3.172	3.51	3.718	3.258
Whitehouse	1.519	1.672	2.01	2.218	1.758
Needham	3.852	4.005	4.343	4.551	4.091
Beattie	2.885	3.039	3.376	3.584	3.124
Livingston	3.085	3.239	3.576	3.784	3.324
Dupont	2.852	3.005	3.343	3.551	3.091
Liu	4.352	4.505	4.843	5.051	4.591
Strickland	2.685	2.839	3.176	3.384	2.924
Howe	2.685	2.839	3.176	3.384	2.924
Akhtar	1.285	1.439	1.776	1.984	1.524
Whitworth	3.185	3.339	3.676	3.884	3.424

	Bouchon	Bouley	Caravaggio
Jakob	4.148	4.47	4.466
Helen	3.148	3.47	3.466
Viktoria	3.148	3.47	3.466
Zackary	3.614	3.937	3.932
Kloe	3.214	3.537	3.532
Petra	3.148	3.47	3.466
Keiron	2.214	2.537	2.532
Beverly	2.648	2.97	2.966
Blythe	2.957	3.28	3.275
Chay	3.148	3.47	3.466
Whitehouse	1.648	1.97	1.966
Needham	3.981	4.303	4.299
Beattie	3.014	3.337	3.332
Livingston	3.214	3.537	3.532
Dupont	2.981	3.303	3.299
Liu	4.481	4.803	4.799
Strickland	2.814	3.137	3.132
Howe	2.814	3.137	3.132
Akhtar	1.414	1.737	1.732
Whitworth	3.314	3.637	3.632

Calculate the RMSE for the baseline predictors for both your training data and your test data.

## test data
# finding Error
data_err <- data_test - base_pred
# squaring error
data_err <- (data_err)^2
#finding average 
data_rmse_test<- mean(data_err[test_indices])
# square root 
data_rmse_test<- sqrt(data_rmse_test)
## training data
# finding Error
data_err_train <- data_train - base_pred
# squaring error
data_err_train <- (data_err_train)^2
#finding average 
data_rmse_train <- mean(data_err_train,na.rm = TRUE)
# square root 
data_rmse_train<- sqrt(data_rmse_train)

RMSE for test data

data_rmse_test

[1] 0.6910205

RMSE for train data

data_rmse_train

[1] 0.7791674

Summarizing results

Lets calculate the percentage improvements based on the original (simple average) and baseline predictor (including bias) RMSE numbers for both Test and Train data sets.

The results show that we see a 50% improvement in making a prediction for the ratings in the Training data set. Where as we see only 38% improvement in prediction for the Test data set. Both are positive however the Training data set yielded better prediction.

# Train data set
R1 <- rmse_train
R1_data <- data_rmse_train
Prediction_Improv_Train <- (1-(R1_data/R1))*100
Prediction_Improv_Train

[1] 26.0554

# Test data set
R2 <- rmse_test
R2_data <- data_rmse_test
Prediction_Improv_Test <- (1-(R2_data/R2))*100
Prediction_Improv_Test

[1] 18.68783

DATA 612 Project 1 | Global Baseline Predictors and RMSE