Restaurant Rating Recommender System
This is a recommender system with 20 users and their respective ratings on 5 restaurants in New York It recommends restaurants to users based on other user ratings.
Step-1:
I scrapped some data from online to get this rating on the 5 restaurants
Step-2:
Load the necessary libraries
library(pander)
library(ggplot2)
library(knitr)
library(dplyr)
library(reshape2)
Step-3:
Load the rest_ratings dataset and create a user-item matrix
# load csv into data variable
data <- read.csv("restratings.csv",row.names = 1)
# convert data into a matrix
data <- as.matrix(data)
pander(data)
| Andanada | Aquagrill | Asiate | Balthazar | Barbetta | |
|---|---|---|---|---|---|
| Jakob | 3 | NA | 4 | 4 | 3 |
| Helen | 4 | 5 | 3 | NA | 2 |
| Viktoria | NA | 3 | 3 | 3 | 3 |
| Zackary | 5 | 5 | 5 | 5 | NA |
| Kloe | 2 | 3 | 3 | NA | 4 |
| Petra | 3 | 4 | 4 | 3 | 3 |
| Keiron | 2 | 2 | NA | 2 | 2 |
| Beverly | 3 | NA | 4 | 4 | 4 |
| Blythe | 2 | 3 | 3 | NA | 4 |
| Chay | 3 | NA | 3 | 3 | 3 |
| Whitehouse | 1 | 1 | 1 | NA | 1 |
| Needham | NA | 3 | 4 | 3 | 4 |
| Beattie | 3 | NA | 4 | 4 | 3 |
| Livingston | 4 | 4 | 3 | NA | 2 |
| Dupont | NA | 3 | 4 | 4 | 4 |
| Liu | 5 | 5 | NA | 5 | 5 |
| Strickland | 2 | NA | 3 | 3 | 4 |
| Howe | 3 | NA | 3 | 3 | 3 |
| Akhtar | 2 | 2 | 2 | NA | 2 |
| Whitworth | 3 | NA | 4 | 4 | 4 |
| Bouchon | Bouley | Caravaggio | |
|---|---|---|---|
| Jakob | 5 | 5 | 5 |
| Helen | 3 | 3 | NA |
| Viktoria | 4 | 4 | 3 |
| Zackary | 2 | NA | 2 |
| Kloe | NA | 4 | 4 |
| Petra | 3 | 3 | NA |
| Keiron | NA | 3 | 3 |
| Beverly | 1 | 1 | NA |
| Blythe | 3 | 4 | 3 |
| Chay | NA | 4 | 4 |
| Whitehouse | 4 | 3 | NA |
| Needham | 5 | 5 | 5 |
| Beattie | 3 | 3 | NA |
| Livingston | 4 | 4 | 3 |
| Dupont | 2 | NA | 2 |
| Liu | NA | 4 | 4 |
| Strickland | 3 | 3 | NA |
| Howe | NA | 3 | 3 |
| Akhtar | 1 | 1 | NA |
| Whitworth | 3 | 4 | 3 |
Step-4:
Break your ratings into separate training and test datasets.
Lets split the restaurant dataset into two. Training and Test. I selected 12 reviews from training. I will replace those with NA in the training set. NA was used so it would be omitted from our calculations. In the test dataset I only kept values identified for testing. the others were replaced with NA.
test_rows <- c(1,3,4,5,6,7,14,13,19,20,12,14)
test_cols <- c(1,4,2,3,4,5,2,3,3,4,5,3)
test_indices <- cbind(test_rows,test_cols)
data_train <- data
data_train[test_indices] <- NA
data_test <- data
data_test[test_indices] <- 0
data_test[data_test > 0] <- NA
data_test[test_indices] <- data[test_indices]
Train Dataset
data_train
Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley
Jakob NA NA 4 4 3 5 5
Helen 4 5 3 NA 2 3 3
Viktoria NA 3 3 NA 3 4 4
Zackary 5 NA 5 5 NA 2 NA
Kloe 2 3 NA NA 4 NA 4
Petra 3 4 4 NA 3 3 3
Keiron 2 2 NA 2 NA NA 3
Beverly 3 NA 4 4 4 1 1
Blythe 2 3 3 NA 4 3 4
Caravaggio
Jakob 5
Helen NA
Viktoria 3
Zackary 2
Kloe 4
Petra NA
Keiron 3
Beverly NA
Blythe 3
[ reached getOption("max.print") -- omitted 11 rows ]
7 Test Dataset
7
data_test
Andanada Aquagrill Asiate Balthazar Barbetta Bouchon Bouley
Jakob 3 NA NA NA NA NA NA
Helen NA NA NA NA NA NA NA
Viktoria NA NA NA 3 NA NA NA
Zackary NA 5 NA NA NA NA NA
Kloe NA NA 3 NA NA NA NA
Petra NA NA NA 3 NA NA NA
Keiron NA NA NA NA 2 NA NA
Beverly NA NA NA NA NA NA NA
Blythe NA NA NA NA NA NA NA
Caravaggio
Jakob NA
Helen NA
Viktoria NA
Zackary NA
Kloe NA
Petra NA
Keiron NA
Beverly NA
Blythe NA
[ reached getOption("max.print") -- omitted 11 rows ]
Using training data, calculate the raw average (mean) rating for every user-item combination.
This function computes the raw average of the user-item matrix
Mean rating for each user in the restaurant train dataset
user_means <- rowMeans(data_train,na.rm = TRUE)
user_means_df <- data.frame(as.list(user_means))
# change user means from wide to long
user_means_df <- tidyr::gather(user_means_df,"user")
p1 <- ggplot(user_means_df,aes(x=user, y=value,fill=user))+ geom_bar(stat="identity") + labs(title="Plot of Mean User ratings",x="User",y="Avg. Rating")
colnames(user_means_df) <-c("User","Rating")
pander(user_means)
| Jakob | Helen | Viktoria | Zackary | Kloe | Petra | Keiron | Beverly | Blythe |
|---|---|---|---|---|---|---|---|---|
| 4.333 | 3.333 | 3.333 | 3.8 | 3.4 | 3.333 | 2.4 | 2.833 | 3.143 |
| Chay | Whitehouse | Needham | Beattie | Livingston | Dupont | Liu |
|---|---|---|---|---|---|---|
| 3.333 | 1.833 | 4.167 | 3.2 | 3.4 | 3.167 | 4.667 |
| Strickland | Howe | Akhtar | Whitworth |
|---|---|---|---|
| 3 | 3 | 1.6 | 3.5 |
p1
Mean rating for each restaurant in the User_restaurant7 train dataset.
restaurant_means <- colMeans(data_train,na.rm = TRUE)
restaurant_means_df <- data.frame(as.list(restaurant_means))
# change user means from wide to long
restaurant_means_df <- tidyr::gather(restaurant_means_df,"restaurant")
p2 <- ggplot(restaurant_means_df,aes(x=7, y=value,fill=restaurant))+ geom_bar(stat="identity") + labs(title="Plot of restaurant Average Rating",x="restaurant",y="Avg. Rating")
colnames(restaurant_means_df) <-c("restaurant","Rating")
pander(restaurant_means)
| Andanada | Aquagrill | Asiate | Balthazar | Barbetta | Bouchon | Bouley | Caravaggio |
|---|---|---|---|---|---|---|---|
| 2.938 | 3.091 | 3.429 | 3.636 | 3.176 | 3.067 | 3.389 | 3.385 |
p2
Calculate the RMSE for raw average for both your training data and your test data.
Rating for every user-item combination, for Test and Train data sets
raw_test <- mean(data_test,na.rm = TRUE)
raw_test_mat <- data_test
raw_test_mat[] <- raw_test
raw_test
[1] 3.333333
raw_train_mat <- data_train
raw_train <- mean(data_train,na.rm = TRUE)
raw_train_mat[] <-raw_train
raw_train
[1] 3.252174
RMSE for Test and Train data sets
#find squre difference
squareDiff_train <- (data_train - raw_train_mat)^2
# find mean of squareDiff
squareDiff_train_mean <- mean(squareDiff_train,na.rm = TRUE)
#find square root
rmse_train <- sqrt(squareDiff_train_mean)
# train test
squareDiff_test <- (data_test - raw_test_mat)^2
# find mean of squareDiff
squareDiff_test_mean <- mean(squareDiff_test,na.rm = TRUE)
#find square root
rmse_test <- sqrt(squareDiff_test_mean)
RMSE for train dataset
rmse_train
[1] 1.053718
RMSE for test dataset
rmse_test
[1] 0.8498366
Using your training data, calculate the bias for each user and each item.
User Bias
## user bias
user_bias <- user_means - raw_train
user_bias_df <- data.frame(as.list(user_bias))
user_bias_df <- tidyr::gather(user_bias_df,"user")
colnames(user_bias_df) <-c("User","Bias")
pander(user_bias_df)
| User | Bias |
|---|---|
| Jakob | 1.081 |
| Helen | 0.08116 |
| Viktoria | 0.08116 |
| Zackary | 0.5478 |
| Kloe | 0.1478 |
| Petra | 0.08116 |
| Keiron | -0.8522 |
| Beverly | -0.4188 |
| Blythe | -0.1093 |
| Chay | 0.08116 |
| Whitehouse | -1.419 |
| Needham | 0.9145 |
| Beattie | -0.05217 |
| Livingston | 0.1478 |
| Dupont | -0.08551 |
| Liu | 1.414 |
| Strickland | -0.2522 |
| Howe | -0.2522 |
| Akhtar | -1.652 |
| Whitworth | 0.2478 |
restaurant Bias
#restaurant bias
restaurant_bias <- restaurant_means - raw_train
restaurant_bias_df <- data.frame(as.list(restaurant_bias))
restaurant_bias_df <- tidyr::gather(restaurant_bias_df,"restaurant")
colnames(restaurant_bias_df) <-c("restaurant","Bias")
pander(restaurant_bias_df)
| restaurant | Bias |
|---|---|
| Andanada | -0.3147 |
| Aquagrill | -0.1613 |
| Asiate | 0.1764 |
| Balthazar | 0.3842 |
| Barbetta | -0.0757 |
| Bouchon | -0.1855 |
| Bouley | 0.1367 |
| Caravaggio | 0.1324 |
From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.
# raw average + user bias + restaurant bias
calBaseLine <- function(in_matrix, restaurant_bias_in,user_bias_in,raw_average)
{
out_matrix <- in_matrix
row_count <-1
for(item in 1:nrow(in_matrix))
{
col_count <-1
for(colItem in 1: ncol(in_matrix))
{
#out_matrix[row_count,col_count] <- 0
out_matrix[row_count,col_count] <- raw_average[1] + user_bias_in[[row_count]] + restaurant_bias_in[[col_count]]
col_count <- col_count +1
}
row_count <- row_count +1
}
return(out_matrix)
}
base_pred <- calBaseLine(data_train,restaurant_bias,user_bias,raw_train)
pander(base_pred)
| Andanada | Aquagrill | Asiate | Balthazar | Barbetta | |
|---|---|---|---|---|---|
| Jakob | 4.019 | 4.172 | 4.51 | 4.718 | 4.258 |
| Helen | 3.019 | 3.172 | 3.51 | 3.718 | 3.258 |
| Viktoria | 3.019 | 3.172 | 3.51 | 3.718 | 3.258 |
| Zackary | 3.485 | 3.639 | 3.976 | 4.184 | 3.724 |
| Kloe | 3.085 | 3.239 | 3.576 | 3.784 | 3.324 |
| Petra | 3.019 | 3.172 | 3.51 | 3.718 | 3.258 |
| Keiron | 2.085 | 2.239 | 2.576 | 2.784 | 2.324 |
| Beverly | 2.519 | 2.672 | 3.01 | 3.218 | 2.758 |
| Blythe | 2.828 | 2.982 | 3.319 | 3.527 | 3.067 |
| Chay | 3.019 | 3.172 | 3.51 | 3.718 | 3.258 |
| Whitehouse | 1.519 | 1.672 | 2.01 | 2.218 | 1.758 |
| Needham | 3.852 | 4.005 | 4.343 | 4.551 | 4.091 |
| Beattie | 2.885 | 3.039 | 3.376 | 3.584 | 3.124 |
| Livingston | 3.085 | 3.239 | 3.576 | 3.784 | 3.324 |
| Dupont | 2.852 | 3.005 | 3.343 | 3.551 | 3.091 |
| Liu | 4.352 | 4.505 | 4.843 | 5.051 | 4.591 |
| Strickland | 2.685 | 2.839 | 3.176 | 3.384 | 2.924 |
| Howe | 2.685 | 2.839 | 3.176 | 3.384 | 2.924 |
| Akhtar | 1.285 | 1.439 | 1.776 | 1.984 | 1.524 |
| Whitworth | 3.185 | 3.339 | 3.676 | 3.884 | 3.424 |
| Bouchon | Bouley | Caravaggio | |
|---|---|---|---|
| Jakob | 4.148 | 4.47 | 4.466 |
| Helen | 3.148 | 3.47 | 3.466 |
| Viktoria | 3.148 | 3.47 | 3.466 |
| Zackary | 3.614 | 3.937 | 3.932 |
| Kloe | 3.214 | 3.537 | 3.532 |
| Petra | 3.148 | 3.47 | 3.466 |
| Keiron | 2.214 | 2.537 | 2.532 |
| Beverly | 2.648 | 2.97 | 2.966 |
| Blythe | 2.957 | 3.28 | 3.275 |
| Chay | 3.148 | 3.47 | 3.466 |
| Whitehouse | 1.648 | 1.97 | 1.966 |
| Needham | 3.981 | 4.303 | 4.299 |
| Beattie | 3.014 | 3.337 | 3.332 |
| Livingston | 3.214 | 3.537 | 3.532 |
| Dupont | 2.981 | 3.303 | 3.299 |
| Liu | 4.481 | 4.803 | 4.799 |
| Strickland | 2.814 | 3.137 | 3.132 |
| Howe | 2.814 | 3.137 | 3.132 |
| Akhtar | 1.414 | 1.737 | 1.732 |
| Whitworth | 3.314 | 3.637 | 3.632 |
Calculate the RMSE for the baseline predictors for both your training data and your test data.
## test data
# finding Error
data_err <- data_test - base_pred
# squaring error
data_err <- (data_err)^2
#finding average
data_rmse_test<- mean(data_err[test_indices])
# square root
data_rmse_test<- sqrt(data_rmse_test)
## training data
# finding Error
data_err_train <- data_train - base_pred
# squaring error
data_err_train <- (data_err_train)^2
#finding average
data_rmse_train <- mean(data_err_train,na.rm = TRUE)
# square root
data_rmse_train<- sqrt(data_rmse_train)
RMSE for test data
data_rmse_test
[1] 0.6910205
RMSE for train data
data_rmse_train
[1] 0.7791674
Lets calculate the percentage improvements based on the original (simple average) and baseline predictor (including bias) RMSE numbers for both Test and Train data sets.
The results show that we see a 50% improvement in making a prediction for the ratings in the Training data set. Where as we see only 38% improvement in prediction for the Test data set. Both are positive however the Training data set yielded better prediction.
# Train data set
R1 <- rmse_train
R1_data <- data_rmse_train
Prediction_Improv_Train <- (1-(R1_data/R1))*100
Prediction_Improv_Train
[1] 26.0554
# Test data set
R2 <- rmse_test
R2_data <- data_rmse_test
Prediction_Improv_Test <- (1-(R2_data/R2))*100
Prediction_Improv_Test
[1] 18.68783