Restaurant Recommendation System based on Yelp data

1.Introduction

As modern consumers, we greatly benefit from restaurant recommendation applications. It is so convenient to get a list of restaurants that match our preferences without much clicking, comparing, and browsing through a long list of reviews for each single business.

In this project, we want to apply the algorithms to develop predictive models learned from the DATA643 course “of”Current Topic of Data Science - Recommendation System“” to build a restaurant recommendation system that suggests the most suitable restaurant for users.

2.Motivation

It is very common that we hang out with families, friends, and coworkers when comes to lunch or dinner time. As the users of recommendation applications, people care more about how we will like a restaurant. People will tend to have happier experiences when the prediction of the recommendation system is as good as what it says. As there is a completed and big data set of user and restaurants reviews, we want to see whether we can use the latest techniques to make good predictions. In the data set, there are not only reviews but also relevant information of users and restaurants that allow us to do more complicated computation, which might lead to the construction of a better model.

3.Aim

3.1 In this project, we will use collaborative filtering algorithms to build the primary recommendation system.

3.2 Location of the restaurant is an important factor to be consided when building a restaurant recommendation system. Location will be used to filter the restaurants from a top50 list.

3.3 In the Yelp dataset there is more information other than only ratings. There are three criteria in reviews: funny, useful, and cool and these factors will be integrated to the primary ratings. We hope to increase the diversity and serendipity of the results of the recommendation system.

4.Dataset

In this project, we will use a Yelp Dataset Challenge round 9 from yelp website. The dataset has 4.1M reviews and 947K tips by 1M users for 144K businesses; 1.1M business attributes, e.g. hours, parking availability, ambience; and aggregated check-ins over time for each of the 125K businesses. The data includes diverse sets of cities: Edinburgh in U.K.; Karlsruhe in Germany; Montreal and Waterloo in Canada; Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vagas, Madison, and Cleveland in U.S.

install.packages("jsonlite",repos='http://cran.us.r-project.org')
devtools::install_github("sailthru/tidyjson")
install.packages("doParallel")
install.packages(('BBmisc'))
install.packages("DT")

Load packages

suppressWarnings(suppressMessages(library(jsonlite)))
suppressWarnings(suppressMessages(library(tidyjson)))
suppressWarnings(suppressMessages(library(plyr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(recommenderlab)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))

# user-item matrix
suppressWarnings(suppressMessages(library(stringi)))
suppressWarnings(suppressMessages(library(Matrix)))
suppressWarnings(suppressMessages(library(DT)))

4.1 Process the raw data

4.2 Explore the data

Load the pre-processed data

# read data from Github repository
business<- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/business.csv")

user <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/user_1.csv")

for (i in c(2:4)){
  a<- paste0(cat('"'),'https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/user_',i,'.csv',cat('"'))
  user_1 <- read.csv(a)
  user <- rbind(user, user_1)
}

## """"""

rating <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/rating_1.csv")

for (i in c(2:7)){
  a<- paste0(cat('"'),'https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/rating_',i,'.csv',cat('"'))
  rating_1 <- read.csv(a)
  rating <- rbind(rating, rating_1)
}

## """"""""""""

# save a copy  
rating_copy  <- rating

** View the data**

#rearrange the column
rating <- rating[,c("restaurant", "business_id", "user", "user_id","stars", "useful", "funny", "cool" ,"document.id")]

kable(head(rating,n=5))

restaurant	business_id	user	user_id	stars	useful	funny	cool	document.id
Daily Kitchen Modern Eatery and Rotisserie	YCEZLECK9IToE8Mysorbhw	Monera	—1lKK3aKOuomHnwAkAow	5	3	0	2	54219
The Placenta Lady	D1PhUlkQA1ZsVe9Cx4yqOw	Monera	—1lKK3aKOuomHnwAkAow	5	1	1	0	14186
Fresh Mama	5aeR9KcboZmhDZlFscnYRA	Monera	—1lKK3aKOuomHnwAkAow	5	1	0	0	3864
Red Velvet Cafe	t6WY1IrohUecqNjd9bG42Q	Monera	—1lKK3aKOuomHnwAkAow	4	2	0	0	51335
Echo & Rig	igHYkXZMLAc9UdV5VnR_AA	Monera	—1lKK3aKOuomHnwAkAow	5	0	0	0	3774

# convert ratings data to realRatingMatrix for implement of recommenderlab package

# length(unique(rating[,"user"]))  [1] 63081
# length(unique(rating[,"restaurant"])) [1] 65432

       
#build the user-item matrix
udf <- data.frame(user_No= seq(1:length(unique(rating[,"user"]))),user= unique(rating[,"user"]))
idf <- data.frame(restaurant_No= seq(1:length(unique(rating[,"restaurant"]))),restaurant=unique(rating[,"restaurant"]))

rating <- merge(rating,udf,by.x='user',by.y='user')
rating <- merge(rating,idf,by.x='restaurant',by.y='restaurant')

rating_mx <- sparseMatrix(
  i =  rating$user_No, 
  j =  rating$restaurant_No, 
  x = rating$stars, 
  dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)

#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
mx <- as(rating_mx,"realRatingMatrix")
#setting itemlabels
colnames(mx) <- paste("R", 1:65432, sep = "")
as(mx[1,1:10],"list")

## [[1]]
##  R1  R2  R3  R4  R5  R6  R7  R8  R9 R10 
##   5   5   5   4   5   5   5   5   5   4

#setting userlabels
rownames(mx) <- paste("U", 1:63081, sep = "")
as(mx[1,1:10], "list")

## $U1
##  R1  R2  R3  R4  R5  R6  R7  R8  R9 R10 
##   5   5   5   4   5   5   5   5   5   4

#Normalize by subtracting the row mean from all ratings in the row
mx_n <- normalize(mx)

#view the matrix
getRatingMatrix(mx)[1:10,1:5]

## 10 x 5 sparse Matrix of class "dgCMatrix"
##     R1 R2 R3 R4 R5
## U1   5  5  5  4  5
## U2   .  .  .  .  5
## U3   .  .  .  .  .
## U4   .  .  .  .  .
## U5   .  .  .  .  .
## U6   .  .  .  1  5
## U7   .  .  .  4  5
## U8   1  .  .  .  5
## U9   .  .  .  .  .
## U10  .  .  .  4  .

image(mx, main = "Yelp restarurant reviews Data")

image(mx_n, main = "Normalized Yelp restarurant reviews Data")

** Statistics of ratings data**

# use visualize_ratings function from SVDApproximation to visualize statistics for all ratings: item count of different ratings,item histogram of users' average ratings, item histogram of items' average ratings, item histogram of number of rated items by user, item histogram of number of scores items have

summary(rating[, 'stars'])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.716   5.000   5.000

#distribution of ratings
rating_frq <- as.data.frame(table(rating$stars))

ggplot(rating_frq,aes(Var1,Freq)) +   
  geom_bar(aes(fill = Var1), position = "dodge", stat="identity",fill="palegreen")+ labs(x = "Stars")

#calculate average reviews for each restaurant
business_mean <- data.frame(restaurant = idf$restaurant, average_stars=colMeans(mx))

par(mfrow=c(2,2))

ggplot(user,aes(review_count)) +
  geom_histogram(binwidth = 0.05,col='red',fill="plum") + coord_cartesian(ylim=c(0,12000)) + labs(x = "User Review COunt")+geom_vline(xintercept = mean(user$review_count),col='blue',size=1)

ggplot(business,aes(review_count)) +
  geom_histogram(binwidth = 0.05,col='blue',fill="sandybrown") + coord_cartesian(ylim=c(0,7000)) + labs(x = "Restaurant Review COunt")+geom_vline(xintercept = mean(business$review_count),col='red',size=1)

ggplot(user,aes(average_stars)) +
  geom_histogram(binwidth = 0.03,fill="plum")  + labs(x = "User Average Review")

ggplot(business_mean,aes(average_stars)) +
  geom_histogram(binwidth = 0.03,fill="sandybrown") + labs(x = "Restaurant Average Review")

round_r <- sum(user$average_stars == 1)+sum(user$average_stars == 2)+sum(user$average_stars == 3)+sum(user$average_stars == 4)+sum(user$average_stars == 5)
print(paste("Total number of people who had rounded average ratings:",round_r))

## [1] "Total number of people who had rounded average ratings: 405551"

user_rate_1 <- sum(user$review_count == 1)
user_rate_2 <- sum(user$review_count == 2)
user_rate_3 <- sum(user$review_count == 3)
user_rate_4 <- sum(user$review_count == 4)
print(paste("Number of people who only rated one restaurant:",user_rate_1))

## [1] "Number of people who only rated one restaurant: 189809"

print(paste("Number of people who only rated twice:",user_rate_2))

## [1] "Number of people who only rated twice: 126347"

print(paste("Number of people who only rated three times:",user_rate_3))

## [1] "Number of people who only rated three times: 96815"

print(paste("Number of people who only rated four times:",user_rate_4))

## [1] "Number of people who only rated four times: 69627"

print(paste("Number of people who only rated less than three times:",user_rate_1 + user_rate_2 +user_rate_3))

## [1] "Number of people who only rated less than three times: 412971"

By viewing the data we see:

1.Rating distribution is not normal with the most frequent rating at the highest rating 5, whose frequency is much higher than other ratings. One possibility is that people who would write reviews for restaurant on Yelp are those who will view review/ratings online before deciding to try a new restaurant. So there is more chance that these people like what they chose. This suggests that the current restaurant recommendation systems work very well so it is more likely that people could find the food they like by searching on the recommender engine/application.

2.Distribution of user review count is not normal with a average at 24. Majority people only wrote a few reviews and there are very few people wrote thousands of reviews with a maximum number at 11284.By looking at the minimum review count we knew that some people did not write any review.

3.Distribution of user review count is not normal with a average at 28. Majority restaurant received a few reviews and there are very few restaurant received thousands of reviews with a maximum number at 6414. If we look at the minimum review count, we can see any restaurant in this data set at least got 3 reviews.

4.The average rating for each user is multimodal distribution. The count ofaverage rating at each round number(stars) are much higher than other not rounded number. In consistant to Figure 1, average rating at 5 has the highest frequency. The possible reasons that a lot of people had a rounded average rating could either be these people only give the same rating for different restaurant and they only rated very few restaurants. It is intresting to notice that the number of people who had rounded average ratings, 405551, is close to the number of people who only rated less than three times, 412971.

5.Similar to the user average rating, the average rating for each restaurant is multimodal distribution. In consistant to Figure 1, average rating at 5 has the highest frequency. One of the possible reason for this pattern is there were a large number of restaurant received very few ratings and ratings were the same. Another reason is that there are a lot of very good restaurants always received 5. But, is it really possible?

5.Creating a Recommender

5.1 Building a User-based Collaborative Filtering Model

# check if there is abnormal ratings in the data
table(mx@data@x[] > 5)

## 
##   FALSE    TRUE 
## 1409140       1

table(mx@data@x[] < 1)

## 
##   FALSE 
## 1409141

# set the abnormal rating to a most closed normal number
mx@data@x[mx@data@x[] > 5] <- 5

# Keeping only restaurants with more than 50 ratings and users with more than 20 rating
mx_r <- mx[rowCounts(mx) > 20,]
mx_r <- mx_r[,colCounts(mx_r) > 50]

# creating the evaluation scheme, separate the data into train set and test set
set.seed(1)
(e <- evaluationScheme(mx_r[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))

## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.

# Creating a user-based collaborative filtering model using the training data.
(r_ubcf <- Recommender(getData(e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 960 users.

# r_ibcf <- Recommender(getData(e, "train"), "IBCF",parameter = list(k=30, method = "cosine", normalize = "Z-score", alpha=0.5))

# release memory
gc()

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  3716520 198.5   12002346 641.0  12002346 641.0
## Vcells 55756019 425.4  112780893 860.5 112779355 860.5

# Increasing the storage capacity
memory.limit(size=700000)

## [1] 7e+05

names(getModel(r_ubcf))

## [1] "description" "data"        "method"      "nn"          "sample"     
## [6] "normalize"   "verbose"

# evaluation  
results <- evaluate(e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.1sec/11.41sec] 
##   2  [0.05sec/11.88sec] 
##   3  [0.05sec/11.59sec] 
##   4  [0.05sec/11.34sec] 
##   5  [0.03sec/11.81sec]

avg(results)

##         RMSE      MSE      MAE
## res 1.485573 2.207031 1.221872

# making predictions on ratings
(p_rating <- predict(r_ubcf, getData(e, "known"), type="ratings",n=10))

## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1246644 ratings.

# show predicted ratings
as(p_rating, "matrix")[1:10,1:10]

##           R1       R3       R4       R5       R6      R10      R11
## U11 4.377811 4.400000 4.448639 4.704822 4.400000 4.476386 4.335383
## U31 2.800000 2.802313 2.700283 3.000264 2.800000 2.800000 2.874564
## U40 3.895722 4.000000 3.981405 4.183772 4.147711 4.032290 4.304996
## U41 4.350739 4.399499 4.426552 4.561826 4.434757 4.400000 4.376478
## U43 3.800000 3.800000 3.757982 3.856515 3.830139 3.827601 3.888330
## U45 4.221984 4.200000 4.200000 4.320884 4.200000 4.200000 4.197166
## U46 2.400000 2.473057 2.219222 2.466187 2.400000 2.400000 2.473215
## U53 3.400000 3.400000 3.306760 3.672172 3.534933 3.436452 3.111497
## U54 3.614614 3.622771 3.760030 3.818249 3.600000 3.550045 3.445086
## U59 3.873293 3.800000 3.864313 4.149622 3.802304 3.874812 3.838606
##          R12      R13      R17
## U11 4.429356 4.445291 4.371860
## U31 2.732841 2.811651 2.800000
## U40 3.754724 4.148015 4.057790
## U41 4.524867 4.443738 4.316639
## U43 3.767197 3.835270 3.885991
## U45 4.244129 4.200000 4.241087
## U46 2.523405 2.412146 2.291994
## U53 3.274084 3.405942 3.270816
## U54 3.620108 3.600000 3.590404
## U59 3.803884 3.788111 3.902065

# RMSE
(error <- data.frame(calcPredictionAccuracy(p_rating, getData(e, "unknown"))))

##      calcPredictionAccuracy.p_rating..getData.e...unknown...
## RMSE                                                1.468892
## MSE                                                 2.157643
## MAE                                                 1.188701

# evaluation  
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
     #1  [0.16sec/398.42sec] 
     #2  [0.17sec/393.06sec] 
     #3  [0.27sec/391.93sec] 
     #4  [0.09sec/393.77sec] 
     #5  [0.16sec/395.01sec] 
# making predictions on topNList
(p_topN <- predict(r_ubcf, mx_r[1201],type="topNList",n=10))

## Recommendations as 'topNList' with n = 10 for 1 users.

# show predicted top10 restaurants
pri_rec <- as(p_topN, "list")

On practical scenario, we have to consider the location while designing a restaurant recommendation system. In most of the time people will use recommendation engine to find restaurant from a certain city.

#get city info from business data
city <- business[,c('name','city','state')]
city <- city[!duplicated(city$name),]
colnames(city) <- c('restaurant','city','state')
idf_city <- left_join(idf,city,by='restaurant')

## Warning: Column `restaurant` joining factors with different levels,
## coercing to character vector

idf_city$restaurant_id <- paste("R", 1:65432, sep = "")
idf_city$city <- as.character(idf_city$city)
idf_city$state <- as.character(idf_city$state)

#get 50 restaurants for User 1201 from recemmender system
(p_top100 <- predict(r_ubcf, mx_r[1201],type="topNList",n=50))

## Recommendations as 'topNList' with n = 50 for 1 users.

# filter the restaurant for User 1201 based on location
pred_restaurant <-  data.frame(as(p_top100, "list"))
colnames(pred_restaurant) <- "U1201"
pred_restaurant[] <- lapply(pred_restaurant, as.character)
pred_restaurant$restaurant_id <- pred_restaurant$U1201

pred_restaurant <- left_join(pred_restaurant,idf_city, by='restaurant_id' )
pred_restaurant$city <- as.character(pred_restaurant$city)
pred_restaurant$state <- as.character(pred_restaurant$state)

# For example, if user 1201 want to get recommendation for restaurants in Las vegas, we can find out from the top100 list
(Lasvegas <- filter(pred_restaurant,city == "Las Vegas"))

##     U1201 restaurant_id restaurant_No                         restaurant
## 1   R1030         R1030          1030      Desert Wireless iPhone Repair
## 2    R478          R478           478                         SkinnyFATS
## 3   R6798         R6798          6798                         9037 Salon
## 4   R5179         R5179          5179                         Lucki Thai
## 5    R228          R228           228                       Bachi Burger
## 6   R1204         R1204          1204             The Buffet at Bellagio
## 7   R1483         R1483          1483                          The Henry
## 8    R246          R246           246                           Sake Rok
## 9    R844          R844           844           Jean Philippe Patisserie
## 10   R808          R808           808           Gangnam Asian BBQ Dining
## 11  R5370         R5370          5370              Libre Mexican Cantina
## 12  R4161         R4161          4161         El Sombrero Mexican Bistro
## 13  R1549         R1549          1549                               Cleo
## 14    R43           R43            43                      Vintner Grill
## 15   R811          R811           811        Cirque du Soleil - Zumanity
## 16  R2314         R2314          2314 Rise & Shine - A Steak & Egg Place
## 17  R1147         R1147          1147                  Soho SushiBurrito
## 18  R3879         R3879          3879              Professor Nails & Spa
## 19  R2639         R2639          2639                        Today Nails
## 20  R3688         R3688          3688                             Yassou
## 21 R10945        R10945         10945        Sun Buggy & ATV Fun Rentals
##         city state
## 1  Las Vegas    NV
## 2  Las Vegas    NV
## 3  Las Vegas    NV
## 4  Las Vegas    NV
## 5  Las Vegas    NV
## 6  Las Vegas    NV
## 7  Las Vegas    NV
## 8  Las Vegas    NV
## 9  Las Vegas    NV
## 10 Las Vegas    NV
## 11 Las Vegas    NV
## 12 Las Vegas    NV
## 13 Las Vegas    NV
## 14 Las Vegas    NV
## 15 Las Vegas    NV
## 16 Las Vegas    NV
## 17 Las Vegas    NV
## 18 Las Vegas    NV
## 19 Las Vegas    NV
## 20 Las Vegas    NV
## 21 Las Vegas    NV

5.2 Multi-Criteria Recommender System

Because there are three criteria in reviews: funny, useful, and cool, the rating will be calculated as follows:

\[ R: Users \times Items \to R_{0} \times R_{1} \times ...R_{k}\]

\(R_{0}\) is the set of possible overall rating values, and \(R_{i}\) represents the possible rating values for each individual criterion i (i = 1,..,k), typically on some numeric scale.

The prediction results of single-criteria collaborative filtering algorithm and multi-criteria collaborative filtering algorithms will be compared to decide which approach is better.

The implementation and evaluation will be performed in R and Apache Spark. At last, if time permits, an application will be built with the Shiny package.

5.2.1 Building the User-item Matrix Based on Useful, Funny, and Cool Comments

Useful Matrix

#build the user-item matrix based on funny comments
useful_mx <- sparseMatrix(
  i =  rating$user_No, 
  j =  rating$restaurant_No, 
  x = rating$useful, 
  dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)

#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
u_mx <- as(useful_mx,"realRatingMatrix")

#setting itemlabels
colnames(u_mx) <- paste("R", 1:65432, sep = "")

#setting userlabels
rownames(u_mx) <- paste("U", 1:63081, sep = "")


#view the matrix
getRatingMatrix(u_mx)[1:10,1:5]

## 10 x 5 sparse Matrix of class "dgCMatrix"
##     R1 R2 R3 R4 R5
## U1   3  1  1  2  0
## U2   .  .  .  .  0
## U3   .  .  .  .  .
## U4   .  .  .  .  .
## U5   .  .  .  .  .
## U6   .  .  .  4  0
## U7   .  .  .  1  0
## U8   1  .  .  .  3
## U9   .  .  .  .  .
## U10  .  .  .  1  .

Funny Matrix

#build the user-item matrix based on funny comments
funny_mx <- sparseMatrix(
  i =  rating$user_No, 
  j =  rating$restaurant_No, 
  x = rating$funny, 
  dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)

#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
f_mx <- as(funny_mx,"realRatingMatrix")

#setting itemlabels
colnames(f_mx) <- paste("R", 1:65432, sep = "")

#setting userlabels
rownames(f_mx) <- paste("U", 1:63081, sep = "")

#view the matrix
getRatingMatrix(f_mx)[1:10,1:5]

## 10 x 5 sparse Matrix of class "dgCMatrix"
##     R1 R2 R3 R4 R5
## U1   0  1  0  0  0
## U2   .  .  .  .  0
## U3   .  .  .  .  .
## U4   .  .  .  .  .
## U5   .  .  .  .  .
## U6   .  .  .  0  0
## U7   .  .  .  0  0
## U8   0  .  .  .  0
## U9   .  .  .  .  .
## U10  .  .  .  0  .

Cool Matrix

#build the user-item matrix based on funny comments
cool_mx <- sparseMatrix(
  i =  rating$user_No, 
  j =  rating$restaurant_No, 
  x = rating$cool, 
  dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)

#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
c_mx <- as(cool_mx,"realRatingMatrix")

#setting itemlabels
colnames(c_mx) <- paste("R", 1:65432, sep = "")

#setting userlabels
rownames(c_mx) <- paste("U", 1:63081, sep = "")

#view the matrix
getRatingMatrix(c_mx)[1:10,1:5]

## 10 x 5 sparse Matrix of class "dgCMatrix"
##     R1 R2 R3 R4 R5
## U1   2  0  0  0  0
## U2   .  .  .  .  1
## U3   .  .  .  .  .
## U4   .  .  .  .  .
## U5   .  .  .  .  .
## U6   .  .  .  1  0
## U7   .  .  .  1  1
## U8   0  .  .  .  0
## U9   .  .  .  .  .
## U10  .  .  .  1  .

# statistic of useful, funny and cool comments data
summary(u_mx@data@x[])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.006   1.000 500.000

summary(f_mx@data@x[])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.4091   0.0000 287.0000

summary(c_mx@data@x[])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.5151   0.0000 234.0000

From the summary, we can see the values of useful, funny or cool represent how many people felt the reviews for the restaurant are useful, funny or cool. The higher the value, the more popular of these restaurants are. We can consider these number as ratings from different aspects. Because the scale of these three factors is different, as you can see from the max value was 500, 287, and 234, we will converted them to binary values. Then the ratings for useful, funny or cool will be combined with the primary ratings to build the new recommender models.

# the frequeny of restaurant's rating is useful 
useful_tb <- as.data.frame(table(rating$useful))
useful_tb$Var1 <- as.numeric(as.character(useful_tb$Var1))
# how many pepople believed that the review was useful at a threshold at 100 restaurants having the same amount of "useful" notes
u_threshold <- useful_tb[useful_tb$Freq > 50,]

# the frequeny of restaurant's rating is useful 
funny_tb <- as.data.frame(table(rating$funny))
funny_tb$Var1 <- as.numeric(as.character(funny_tb$Var1))
# how many pepople believed that the review was useful at a threshold at 100 restaurants having the same amount of "useful" notes
f_threshold <- funny_tb[funny_tb$Freq > 50,]

# the frequeny of restaurant's rating is useful 
cool_tb <- as.data.frame(table(rating$useful))
cool_tb$Var1 <- as.numeric(as.character(cool_tb$Var1))
# how many pepople believed that the review was useful at a threshold at 100 restaurants having the same amount of "useful" notes
c_threshold <- cool_tb[cool_tb$Freq > 50,]

mx_b <-mx_r
# convert the basic rating matrix to binary matrix
mx_b@data@x [mx_b@data@x < mean(mx_b@data@x[])]<- 1
mx_b@data@x [mx_b@data@x > mean(mx_b@data@x[])]<- 0

# convert the useful matrix to binary matrix
u_mx@data@x [u_mx@data@x < max(u_threshold$Var1)]<- 1
u_mx@data@x [u_mx@data@x > max(u_threshold$Var1)]<- 0

# convert the funny rating matrix to binary matrix  
f_mx@data@x [f_mx@data@x < max(f_threshold$Var1)]<- 1
f_mx@data@x [f_mx@data@x > max(f_threshold$Var1)]<- 0

# convert the cool rating matrix to binary matrix
c_mx@data@x [c_mx@data@x < max(c_threshold$Var1)]<- 1
c_mx@data@x [c_mx@data@x > max(c_threshold$Var1)]<- 0


#chose the users and restaurants matching the constrained user-item matrix which users rated the restaurant more than 20 times and restaurants received more than 50 reviews.
u_mx_fit <- u_mx[,c(colnames(mx_r))]
u_mx_fit <- u_mx_fit[row.names(u_mx_fit) %in% c(rownames(mx_r)),]

f_mx_fit <- f_mx[,c(colnames(mx_r))]
f_mx_fit <- f_mx_fit[row.names(f_mx_fit) %in% c(rownames(mx_r)),]

c_mx_fit <- c_mx[,c(colnames(mx_r))]
c_mx_fit <- c_mx_fit[row.names(c_mx_fit) %in% c(rownames(mx_r)),]

# combine primary ratings with useful rating by element-wise multiplication  
r0_r1 <- mx_b@data * u_mx_fit@data

# combine primary ratings with funny rating by element-wise multiplication  
r0_r1_r2 <- r0_r1 * f_mx_fit@data

# combine primary ratings with cool rating by element-wise multiplication  
r0_r1_r2_r3 <- r0_r1_r2 * c_mx_fit@data

5.2.2 Building the Recommendation Systems Based on Multi_criteria Rating

There are 7 ways to intergrate useful, funny, cool, and primary ratings: primary+useful,primary+funny, primary+cool, primary+useful+funny, primary+useful+cool, primary+cool+funny, primary+useful+funny+cool. We will use primary+useful,primary+useful+funny, and primary+useful+funny+cool to build the recommendation models.

** Primary + Useful**

combine_1 <- as(r0_r1,"realRatingMatrix")

# creating the evaluation scheme, separate the data into train set and test set
set.seed(2)
(c1_e <- evaluationScheme(combine_1[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))

## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.

# Creating a user-based collaborative filtering) using the training data.
(c1_ubcf <- Recommender(getData(c1_e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 960 users.

# release memory
gc()

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  3737195 199.6   12002346 641.0  12002346 641.0
## Vcells 78247412 597.0  112780893 860.5 112779960 860.5

# evaluation  
c1_results <- evaluate(c1_e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/11.5sec] 
##   2  [0.04sec/12.53sec] 
##   3  [0.05sec/10.98sec] 
##   4  [0.91sec/11.21sec] 
##   5  [0.03sec/11.56sec]

avg(results)

##         RMSE      MSE      MAE
## res 1.485573 2.207031 1.221872

# making predictions on ratings
(c1_p_rating <- predict(c1_ubcf, getData(c1_e, "known"), type="ratings",n=10))

## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1068552 ratings.

# show predicted ratings
as(c1_p_rating, "matrix")[1:10,1:10]

##            R1        R3        R4        R5        R6       R10       R11
## U6         NA        NA        NA        NA        NA        NA        NA
## U14 0.4525209 0.4328711 0.4502401 0.3393765 0.3895817 0.3967006 0.3460426
## U30 0.6000000 0.6000000 0.6000000 0.5603706 0.6000000 0.5865870 0.6355804
## U33 0.1702462 0.2000000 0.2188049 0.2156242 0.2000000 0.2227773 0.2068239
## U40 0.3915823 0.3798296 0.3860204 0.3717654 0.4000000 0.3760555 0.3720130
## U41 0.4138108 0.4138108 0.4138108 0.3542383 0.4000000 0.3771508 0.3928913
## U44        NA        NA        NA        NA        NA        NA        NA
## U57 0.8000000 0.7912793 0.8094813 0.7847997 0.8000000 0.7742290 0.8069384
## U58 0.8000000 0.8220852 0.7909140 0.7721165 0.8000000 0.8000000 0.7839648
## U66 0.6000000 0.5872285 0.5824837 0.5836675 0.6000000 0.5885307 0.5629996
##           R12       R13       R17
## U6         NA        NA        NA
## U14 0.4080982 0.3996191 0.3764957
## U30 0.5885492 0.6000000 0.6000000
## U33 0.2104116 0.1911478 0.2000000
## U40 0.4208804 0.4231869 0.3884721
## U41 0.4530422 0.4000000 0.4034807
## U44        NA        NA        NA
## U57 0.8156695 0.7915733 0.8062886
## U58 0.8067258 0.8000000 0.8000000
## U66 0.6616446 0.6564478 0.6617045

# RMSE
(error <- data.frame(calcPredictionAccuracy(c1_p_rating, getData(c1_e, "unknown"))))

##      calcPredictionAccuracy.c1_p_rating..getData.c1_e...unknown...
## RMSE                                                     0.5033015
## MSE                                                      0.2533124
## MAE                                                      0.4509785

# evaluation  
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
     #1  [0.16sec/398.42sec] 
     #2  [0.17sec/393.06sec] 
     #3  [0.27sec/391.93sec] 
     #4  [0.09sec/393.77sec] 
     #5  [0.16sec/395.01sec] 
# making predictions on topNList
(c1_p_topN <- predict(c1_ubcf, combine_1[1201],type="topNList",n=10))

## Recommendations as 'topNList' with n = 10 for 1 users.

# show predicted top10 restaurants
(c1_rec <- as(c1_p_topN, "list"))

## $U1827
##  [1] "R1967" "R831"  "R603"  "R1861" "R1971" "R873"  "R5580" "R294" 
##  [9] "R1622" "R5977"

Primary + Useful + Funny

combine_2 <- as(r0_r1_r2,"realRatingMatrix")

# creating the evaluation scheme, separate the data into train set and test set
set.seed(3)
(c2_e <- evaluationScheme(combine_2[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))

## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.

# Creating a user-based collaborative filtering) using the training data.
(c2_ubcf <- Recommender(getData(c2_e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 960 users.

# release memory
gc()

##            used  (Mb) gc trigger   (Mb)  max used (Mb)
## Ncells  3737588 199.7   12002346  641.0  12002346  641
## Vcells 81950406 625.3  135417071 1033.2 135391498 1033

# evaluation  
c2_results <- evaluate(c2_e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/11.49sec] 
##   2  [0.03sec/11.76sec] 
##   3  [0.04sec/11.83sec] 
##   4  [0.05sec/11.84sec] 
##   5  [0.05sec/11.94sec]

avg(results)

##         RMSE      MSE      MAE
## res 1.485573 2.207031 1.221872

# making predictions on ratings
(c2_p_rating <- predict(c2_ubcf, getData(c2_e, "known"), type="ratings",n=10))

## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1068552 ratings.

# show predicted ratings
as(c2_p_rating, "matrix")[1:10,1:10]

##            R1        R3        R4        R5        R6       R10       R11
## U8  0.4242750 0.3862496 0.3882118 0.3499197 0.3871293 0.4219942 0.3580959
## U24 0.6000000 0.6000000 0.5808039 0.5543080 0.6000000 0.6000000 0.5769046
## U26 0.6000000 0.6000000 0.6000000 0.5680516 0.6000000 0.6000000 0.5756454
## U29 0.2320670 0.1787733 0.1581266 0.1752761 0.1845924 0.1998294 0.2152066
## U36 0.4000000 0.4000000 0.3894468 0.3783530 0.4000000 0.4000000 0.4097481
## U37 0.2017338 0.1858843 0.2024211 0.2162648 0.1776786 0.2205786 0.1484986
## U55 0.4000000 0.4000000 0.3874557 0.3344267 0.4000000 0.4000000 0.3821346
## U65 0.4000000 0.4000000 0.4427417 0.3727352 0.4230476 0.4356060 0.3717501
## U71 0.4000000 0.4000000 0.4278132 0.3971104 0.4000000 0.3751178 0.4000000
## U75 0.4239906 0.4136335 0.4000000 0.3648116 0.3872801 0.3900600 0.4315422
##           R12       R13       R17
## U8  0.3622406 0.3750624 0.3987515
## U24 0.5873950 0.5787035 0.6000000
## U26 0.5872671 0.6293759 0.6000000
## U29 0.1644822 0.1750886 0.1522962
## U36 0.4259119 0.3752954 0.4000000
## U37 0.2116399 0.1392685 0.2009791
## U55 0.4000000 0.3747020 0.3884420
## U65 0.3678212 0.3872607 0.4000000
## U71 0.4000000 0.4000000 0.4000000
## U75 0.4000000 0.4414822 0.4000000

# RMSE
(error <- data.frame(calcPredictionAccuracy(c2_p_rating, getData(c2_e, "unknown"))))

##      calcPredictionAccuracy.c2_p_rating..getData.c2_e...unknown...
## RMSE                                                     0.5203734
## MSE                                                      0.2707885
## MAE                                                      0.4677258

# evaluation  
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
     #1  [0.16sec/398.42sec] 
     #2  [0.17sec/393.06sec] 
     #3  [0.27sec/391.93sec] 
     #4  [0.09sec/393.77sec] 
     #5  [0.16sec/395.01sec] 
# making predictions on topNList
(c2_p_topN <- predict(c2_ubcf, combine_2[1201],type="topNList",n=10))

## Recommendations as 'topNList' with n = 10 for 1 users.

# show predicted top10 restaurants
(c2_rec <- as(c2_p_topN, "list"))

## $U1827
##  [1] "R1967" "R603"  "R602"  "R1081" "R831"  "R2291" "R1861" "R3438"
##  [9] "R873"  "R1464"

Primary + Useful + Funny + Cool

combine_3 <- as(r0_r1_r2_r3,"realRatingMatrix")

# creating the evaluation scheme, separate the data into train set and test set
set.seed(4)
(c3_e <- evaluationScheme(combine_3[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))

## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.

# Creating a user-based collaborative filtering) using the training data.
(c3_ubcf <- Recommender(getData(c3_e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 960 users.

# release memory
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3737864 199.7   12002346  641.0  12002346  641.0
## Vcells 85643829 653.5  135417071 1033.2 135414786 1033.2

# evaluation  
c3_results <- evaluate(c3_e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.04sec/11.36sec] 
##   2  [0.05sec/11.6sec] 
##   3  [0.04sec/11.59sec] 
##   4  [0.07sec/12.34sec] 
##   5  [0.03sec/11.53sec]

avg(results)

##         RMSE      MSE      MAE
## res 1.485573 2.207031 1.221872

# making predictions on ratings
(c3_p_rating <- predict(c3_ubcf, getData(c3_e, "known"), type="ratings",n=10))

## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1073790 ratings.

# show predicted ratings
as(c3_p_rating, "matrix")[1:10,1:10]

##            R1        R3        R4        R5        R6       R10       R11
## U5         NA        NA        NA        NA        NA        NA        NA
## U15 0.6000000 0.6000000 0.6135017 0.6118739 0.6000000 0.5788825 0.6000000
## U24 0.4073642 0.4305689 0.4000000 0.3906037 0.4000000 0.4053768 0.3680601
## U26 0.6000000 0.6000000 0.6000000 0.5651377 0.6000000 0.5697705 0.5912079
## U34 0.4000000 0.3854438 0.3843342 0.3705854 0.4000000 0.4000000 0.3843342
## U36 0.6000000 0.5840123 0.6271648 0.5732234 0.5868789 0.5868789 0.5964009
## U46 0.4000000 0.4000000 0.3791149 0.4038671 0.3888094 0.3892605 0.3790034
## U58 0.2136111 0.1774667 0.2443033 0.1718200 0.1867005 0.2313888 0.2165374
## U59 0.2000000 0.1908956 0.2000000 0.2164272 0.2000000 0.2000000 0.2334704
## U71 0.3776172 0.3776605 0.4154288 0.3869664 0.3878530 0.3710440 0.3925566
##           R12       R13       R17
## U5         NA        NA        NA
## U15 0.5645125 0.6000000 0.6000000
## U24 0.3602524 0.4499659 0.4046395
## U26 0.5891219 0.6000000 0.6000000
## U34 0.3855161 0.4166493 0.3848214
## U36 0.5657257 0.5868937 0.5868789
## U46 0.4000000 0.4000000 0.3811876
## U58 0.2347603 0.1994869 0.1639893
## U59 0.2000000 0.1859640 0.2181532
## U71 0.3807868 0.3785507 0.4516377

# RMSE
(error <- data.frame(calcPredictionAccuracy(c3_p_rating, getData(c3_e, "unknown"))))

##      calcPredictionAccuracy.c3_p_rating..getData.c3_e...unknown...
## RMSE                                                     0.5295649
## MSE                                                      0.2804390
## MAE                                                      0.4686014

# evaluation  
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
     #1  [0.16sec/398.42sec] 
     #2  [0.17sec/393.06sec] 
     #3  [0.27sec/391.93sec] 
     #4  [0.09sec/393.77sec] 
     #5  [0.16sec/395.01sec] 
# making predictions on topNList
(c3_p_topN <- predict(c3_ubcf, combine_3[1201],type="topNList",n=10))

## Recommendations as 'topNList' with n = 10 for 1 users.

# show predicted top10 restaurants
(c3_rec <- as(c3_p_topN, "list"))

## $U1827
##  [1] "R831"  "R603"  "R1861" "R1622" "R5580" "R63"   "R294"  "R6589"
##  [9] "R1464" "R2410"

#get 50 restaurants for User 1201 from recemmender system
(c1_p_top100 <- predict(c1_ubcf, mx_r[1201],type="topNList",n=50))

## Recommendations as 'topNList' with n = 50 for 1 users.

# filter the restaurant for User 1201 based on location
c1_pred_restaurant <-  data.frame(as(c1_p_top100, "list"))
colnames(c1_pred_restaurant) <- "U1201"
c1_pred_restaurant[] <- lapply(c1_pred_restaurant, as.character)
c1_pred_restaurant$restaurant_id <- c1_pred_restaurant$U1201

c1_pred_restaurant <- left_join(c1_pred_restaurant,idf_city, by='restaurant_id' )
c1_pred_restaurant$city <- as.character(c1_pred_restaurant$city)
c1_pred_restaurant$state <- as.character(c1_pred_restaurant$state)

# For example, if user 1201 want to get recommendation for restaurants in Las vegas, we can find out from the top100 list
(Lasvegas <- filter(c1_pred_restaurant,city == "Las Vegas"))

##     U1201 restaurant_id restaurant_No                       restaurant
## 1   R5580         R5580          5580                         Cafe Rio
## 2    R603          R603           603   Bayside Buffet at Mandalay Bay
## 3    R142          R142           142                    Serendipity 3
## 4   R1622         R1622          1622                              FIX
## 5     R63           R63            63 Luxor Hotel and Casino Las Vegas
## 6    R873          R873           873                     Michael Mina
## 7   R1971         R1971          1971 McFadden's Restaurant and Saloon
## 8   R1967         R1967          1967                       Yama Sushi
## 9   R1317         R1317          1317               Dick's Last Resort
## 10  R2291         R2291          2291                      Wahlburgers
## 11  R3565         R3565          3565                 MGM Grand Buffet
## 12 R17330        R17330         17330                Buffet Roundtable
## 13   R602          R602           602     Mandalay Bay Resort & Casino
## 14  R5305         R5305          5305                    Haute Doggery
## 15  R1490         R1490          1490          Wet Republic Ultra Pool
## 16  R1602         R1602          1602                    China Poblano
## 17  R8677         R8677          8677                Lulu Hawaiian BBQ
## 18    R96           R96            96            The Shops at Crystals
## 19   R595          R595           595        Wolfgang Puck Bar & Grill
## 20  R1564         R1564          1564                             PT's
## 21  R4997         R4997          4997           Jose Cuervo Tequileria
## 22   R614          R614           614                Cabo Wabo Cantina
## 23   R491          R491           491                          Egg & I
## 24  R1188         R1188          1188  South Point Hotel, Casino & Spa
## 25  R4872         R4872          4872                  AMPM Nail Salon
##         city state
## 1  Las Vegas    NV
## 2  Las Vegas    NV
## 3  Las Vegas    NV
## 4  Las Vegas    NV
## 5  Las Vegas    NV
## 6  Las Vegas    NV
## 7  Las Vegas    NV
## 8  Las Vegas    NV
## 9  Las Vegas    NV
## 10 Las Vegas    NV
## 11 Las Vegas    NV
## 12 Las Vegas    NV
## 13 Las Vegas    NV
## 14 Las Vegas    NV
## 15 Las Vegas    NV
## 16 Las Vegas    NV
## 17 Las Vegas    NV
## 18 Las Vegas    NV
## 19 Las Vegas    NV
## 20 Las Vegas    NV
## 21 Las Vegas    NV
## 22 Las Vegas    NV
## 23 Las Vegas    NV
## 24 Las Vegas    NV
## 25 Las Vegas    NV

Serendipity

U1827_predict <- data.frame(rbind('Primary' = unlist(pri_rec), 'Primary + Useful' = unlist(c1_rec), 'Primary + Useful + Funny' = unlist(c2_rec), 'Primary + Useful + Funny + Cool' = unlist(c3_rec)))
colnames(U1827_predict) <- paste0("No.",seq(1:10))
kable(U1827_predict)

	No.1	No.2	No.3	No.4	No.5	No.6	No.7	No.8	No.9	No.10
Primary	R1030	R478	R745	R1550	R1344	R6798	R5179	R228	R1204	R229
Primary + Useful	R1967	R831	R603	R1861	R1971	R873	R5580	R294	R1622	R5977
Primary + Useful + Funny	R1967	R603	R602	R1081	R831	R2291	R1861	R3438	R873	R1464
Primary + Useful + Funny + Cool	R831	R603	R1861	R1622	R5580	R63	R294	R6589	R1464	R2410

pri_rating <- predict(r_ubcf, mx_r[1201], type="ratings",n=10)
usefulness <- as(pri_rating, "matrix")
usefulness_df <- as.data.frame(usefulness) %>%
    gather(restaurant_id, predicted_rating,1:length(usefulness))

unexpected_1 <- setdiff(pri_rec[[1]], c1_rec[[1]])
unexpected_ratings <- filter(usefulness_df, restaurant_id  %in% unexpected_1 ) %>%
  filter(predicted_rating > mean(mx_r@data@x) )
serendipity_c1 <- nrow(unexpected_ratings)/length(unlist(c1_rec))
print(paste("serendipity for user 1807 using combiantion of primary rating and useful rating is:",serendipity_c1*100,"%"))

## [1] "serendipity for user 1807 using combiantion of primary rating and useful rating is: 100 %"

By combing primary rating and useful rating, we can get a totally different top 10 recommendations for user 1807.

unexpected_ratings <- left_join(unexpected_ratings,idf_city,by="restaurant_id")
kable(unexpected_LasVegas <- filter(unexpected_ratings,city == "Las Vegas"))

restaurant_id	predicted_rating	restaurant_No	restaurant	city	state
R228	4.224769	228	Bachi Burger	Las Vegas	NV
R478	4.249639	478	SkinnyFATS	Las Vegas	NV
R1030	4.266748	1030	Desert Wireless iPhone Repair	Las Vegas	NV
R1204	4.224138	1204	The Buffet at Bellagio	Las Vegas	NV
R5179	4.225634	5179	Lucki Thai	Las Vegas	NV
R6798	4.232940	6798	9037 Salon	Las Vegas	NV

new_restaurant <- setdiff(unexpected_LasVegas$restaurant,Lasvegas$restaurant)
print(paste("By combing primary rating and useful rating,we found",length(new_restaurant),"restaurants not recommended by the primary model by relevant:",paste(unlist(new_restaurant), collapse=','),"for user 1807."))

## [1] "By combing primary rating and useful rating,we found 6 restaurants not recommended by the primary model by relevant: Bachi Burger,SkinnyFATS,Desert Wireless iPhone Repair,The Buffet at Bellagio,Lucki Thai,9037 Salon for user 1807."

# topN for test data set based on primary recommendation system
(p_topN <- predict(r_ubcf, getData(e,"unknown"),type="topNList",n=10))

## Recommendations as 'topNList' with n = 10 for 240 users.

# show predicted top10 restaurants
pri_rec <- as(p_topN, "list")

# topN for test data set based on primary+useful rating
(c1_p_topN <- predict(c1_ubcf, getData(e,"unknown"),type="topNList",n=10))

## Recommendations as 'topNList' with n = 10 for 240 users.

# show predicted top10 restaurants
c1_rec <- as(c1_p_topN, "list")

serendipity_c1_df <- data.frame()
for (i in 1:length(pri_rec)){
  unexpected_1 <- setdiff(pri_rec[[i]], c1_rec[[i]])
  unexpected_ratings <- filter(usefulness_df, restaurant_id %in% unexpected_1 ) %>%
    filter(predicted_rating > mean(mx_r@data@x))
  serendipity_c1[i] <- nrow(unexpected_ratings)/10
  serendipity_c1_df_1 <- data.frame('user_id' = names(pri_rec[i]),'serendipity'= serendipity_c1[i])
  serendipity_c1_df <- rbind(serendipity_c1_df,serendipity_c1_df_1)
}

datatable(serendipity_c1_df, options = list(pageLength = 5))

kable(serendipity_c1_df)

user_id	serendipity
U11	1.0
U31	1.0
U40	1.0
U41	1.0
U43	1.0
U45	1.0
U46	1.0
U53	1.0
U54	1.0
U59	1.0
U63	1.0
U64	1.0
U67	1.0
U70	1.0
U77	1.0
U82	1.0
U92	1.0
U95	1.0
U103	1.0
U111	1.0
U114	1.0
U115	1.0
U116	1.0
U118	1.0
U122	1.0
U123	1.0
U128	1.0
U138	1.0
U140	1.0
U145	1.0
U154	1.0
U159	1.0
U166	1.0
U171	1.0
U172	1.0
U178	1.0
U186	1.0
U198	1.0
U201	1.0
U203	1.0
U209	1.0
U212	1.0
U216	1.0
U217	1.0
U220	0.0
U230	1.0
U247	1.0
U259	1.0
U263	1.0
U266	1.0
U282	1.0
U284	1.0
U285	1.0
U292	1.0
U293	1.0
U294	1.0
U319	1.0
U324	1.0
U326	1.0
U346	1.0
U348	1.0
U354	1.0
U363	1.0
U365	1.0
U371	1.0
U373	1.0
U391	1.0
U397	1.0
U402	1.0
U405	1.0
U413	1.0
U422	0.8
U429	1.0
U444	1.0
U455	1.0
U462	1.0
U482	1.0
U499	1.0
U508	1.0
U540	1.0
U545	1.0
U555	1.0
U565	1.0
U573	1.0
U577	1.0
U597	1.0
U602	1.0
U612	1.0
U645	1.0
U657	0.9
U660	1.0
U666	1.0
U668	1.0
U671	1.0
U683	1.0
U685	1.0
U686	1.0
U700	1.0
U708	1.0
U729	1.0
U735	1.0
U746	1.0
U750	1.0
U752	1.0
U776	1.0
U789	1.0
U803	1.0
U805	1.0
U806	1.0
U822	1.0
U823	1.0
U827	1.0
U841	1.0
U843	1.0
U844	1.0
U850	1.0
U851	1.0
U875	1.0
U884	1.0
U892	1.0
U899	1.0
U902	1.0
U907	1.0
U913	1.0
U920	1.0
U925	1.0
U950	1.0
U972	1.0
U973	0.9
U975	0.9
U987	1.0
U998	1.0
U1001	1.0
U1002	1.0
U1005	1.0
U1018	1.0
U1020	1.0
U1029	1.0
U1031	1.0
U1043	1.0
U1048	1.0
U1049	1.0
U1051	1.0
U1054	1.0
U1066	1.0
U1068	1.0
U1073	1.0
U1090	1.0
U1094	1.0
U1114	1.0
U1117	1.0
U1138	1.0
U1141	1.0
U1148	1.0
U1150	1.0
U1152	0.9
U1171	1.0
U1180	1.0
U1185	1.0
U1186	1.0
U1204	1.0
U1213	1.0
U1234	1.0
U1248	1.0
U1253	1.0
U1258	1.0
U1260	1.0
U1266	1.0
U1277	1.0
U1278	1.0
U1290	1.0
U1293	1.0
U1294	1.0
U1331	1.0
U1333	1.0
U1349	1.0
U1353	1.0
U1372	1.0
U1388	1.0
U1394	1.0
U1399	1.0
U1421	1.0
U1427	1.0
U1449	1.0
U1452	1.0
U1468	1.0
U1469	1.0
U1495	1.0
U1496	1.0
U1497	1.0
U1500	1.0
U1501	1.0
U1503	1.0
U1507	1.0
U1524	1.0
U1526	1.0
U1543	1.0
U1556	1.0
U1562	1.0
U1563	1.0
U1567	1.0
U1577	1.0
U1589	1.0
U1593	1.0
U1597	1.0
U1600	1.0
U1611	1.0
U1622	1.0
U1623	1.0
U1626	1.0
U1638	1.0
U1641	1.0
U1651	1.0
U1658	1.0
U1663	1.0
U1664	1.0
U1672	1.0
U1684	1.0
U1693	1.0
U1699	1.0
U1705	1.0
U1707	1.0
U1715	1.0
U1717	1.0
U1725	1.0
U1728	1.0
U1731	1.0
U1747	1.0
U1750	1.0
U1755	1.0
U1765	1.0
U1769	1.0
U1776	1.0
U1781	1.0
U1785	1.0
U1793	1.0
U1794	1.0
U1801	1.0
U1802	1.0
U1821	1.0

unexpected_ratings <- filter(usefulness_df, restaurant_id  %in% unexpected_1 ) %>%
  filter(predicted_rating > mean(mx_r@data@x) )

Conlcusion and Discussion:

One restaurant recommendation system baesd on the user_based collabarotive filtering algorithm was built with the Yelp academic data for challenge round 9.The RMSE is 1.47.
Restaurants recommending results could be furthered modified by the location. In the future, those information on locations (such as longitude and latitude) or the distance between restaurants, could be used to calculate the similarity.
The recommendation system based on multi-criteria ratings genererated a totally different list of restaurants for users. It is intriguing to see that the serendipity of the recommendation system based on multi-criteria ratings for each user was 100%. At the same time, the accuracy of the prediction was higher than only using one-criteria of rating, The RMSE reduced to 0.5.

Reference:

Blanca Vargas-Govea, Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system.CARS,( 2011)
Mengqi Yu, Meng Xue, Wenjia Ouyang. Restaurants Review Star Prediction for Yelp Dataset.Conference Proceedings (2015).
Gediminas Adomavicius, YoungOk Kwon. New Recommendation Techniques for Multi-Criteria Rating Systems. IEEE Intelligent Systems 22-3 (2017).
Jun Zeng, Feng Li, Haiyang Liu, Junhao Wen, Sachio Hirokawa. A Restaurant Recommender System Based on User Preference and Location in Mobile Environment. Advanced Applied Informatics (IIAI-AAI), 2016 5th IIAI International Congress.

DATA643 Final Project: Restaurant Recommendation System

Yun Mai, Kelly Shaffer

July 16, 2017