Yun Mai, Kelly Shaffer
July 16, 2017
As modern consumers, we greatly benefit from restaurant recommendation applications. It is so convenient to get a list of restaurants that match our preferences without much clicking, comparing, and browsing through a long list of reviews for each single business.
In this project, we want to apply the algorithms to develop predictive models learned from the DATA643 course "of"Current Topic of Data Science - Recommendation System"" to build a restaurant recommendation system that suggests the most suitable restaurant for users.
It is very common that we hang out with families, friends, and coworkers when comes to lunch or dinner time. As the users of recommendation applications, people care more about how we will like a restaurant. People will tend to have happier experiences when the prediction of the recommendation system is as good as what it says. As there is a completed and big data set of user and restaurants reviews, we want to see whether we can use the latest techniques to make good predictions. In the data set, there are not only reviews but also relevant information of users and restaurants that allow us to do more complicated computation, which might lead to the construction of a better model.
3.1 In this project, we will use collaborative filtering algorithms to build the primary recommendation system.
3.2 Location of the restaurant is an important factor to be considered when building a restaurant recommendation system. The location will be used to filter the restaurants from a top50 list.
3.3 In the Yelp dataset there is more information other than only ratings. There are three criteria in reviews: funny, useful, and cool and these factors will be integrated to the primary ratings. We hope to increase the diversity and serendipity of the results of the recommendation system.
In this project, we will use a Yelp Dataset Challenge round 9 from Yelp website. The dataset has 4.1M reviews and 947K tips by 1M users for 144K businesses; 1.1M business attributes, such as hours, parking availability, ambiance; and aggregated check-ins over time for each of the 125K businesses. The data includes diverse sets of cities: Edinburgh in U.K.; Karlsruhe in Germany; Montreal and Waterloo in Canada; Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vagas, Madison, and Cleveland in U.S.
install.packages("jsonlite",repos='http://cran.us.r-project.org')
devtools::install_github("sailthru/tidyjson")
install.packages("doParallel")
install.packages(('BBmisc'))
install.packages("DT")
Load packages
suppressWarnings(suppressMessages(library(jsonlite)))
suppressWarnings(suppressMessages(library(tidyjson)))
suppressWarnings(suppressMessages(library(plyr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(recommenderlab)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
# user-item matrix
suppressWarnings(suppressMessages(library(stringi)))
suppressWarnings(suppressMessages(library(Matrix)))
suppressWarnings(suppressMessages(library(DT)))
Load the pre-processed data
# read data from Github repository
business<- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/business.csv")
user <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/user_1.csv")
for (i in c(2:4)){
a<- paste0(cat('"'),'https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/user_',i,'.csv',cat('"'))
user_1 <- read.csv(a)
user <- rbind(user, user_1)
}
## """"""
rating <- read.csv("https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/rating_1.csv")
for (i in c(2:7)){
a<- paste0(cat('"'),'https://raw.githubusercontent.com/YunMai-SPS/DA643/master/DA643_final_project/rating_',i,'.csv',cat('"'))
rating_1 <- read.csv(a)
rating <- rbind(rating, rating_1)
}
## """"""""""""
# save a copy
rating_copy <- rating
View the data
#rearrange the column
rating <- rating[,c("restaurant", "business_id", "user", "user_id","stars", "useful", "funny", "cool" ,"document.id")]
kable(head(rating,n=5))
restaurant | business_id | user | user_id | stars | useful | funny | cool | document.id |
---|---|---|---|---|---|---|---|---|
Daily Kitchen Modern Eatery and Rotisserie | YCEZLECK9IToE8Mysorbhw | Monera | ---1lKK3aKOuomHnwAkAow | 5 | 3 | 0 | 2 | 54219 |
The Placenta Lady | D1PhUlkQA1ZsVe9Cx4yqOw | Monera | ---1lKK3aKOuomHnwAkAow | 5 | 1 | 1 | 0 | 14186 |
Fresh Mama | 5aeR9KcboZmhDZlFscnYRA | Monera | ---1lKK3aKOuomHnwAkAow | 5 | 1 | 0 | 0 | 3864 |
Red Velvet Cafe | t6WY1IrohUecqNjd9bG42Q | Monera | ---1lKK3aKOuomHnwAkAow | 4 | 2 | 0 | 0 | 51335 |
Echo & Rig | igHYkXZMLAc9UdV5VnR_AA | Monera | ---1lKK3aKOuomHnwAkAow | 5 | 0 | 0 | 0 | 3774 |
# convert ratings data to realRatingMatrix for implement of recommenderlab package
# length(unique(rating[,"user"])) [1] 63081
# length(unique(rating[,"restaurant"])) [1] 65432
#build the user-item matrix
udf <- data.frame(user_No= seq(1:length(unique(rating[,"user"]))),user= unique(rating[,"user"]))
idf <- data.frame(restaurant_No= seq(1:length(unique(rating[,"restaurant"]))),restaurant=unique(rating[,"restaurant"]))
rating <- merge(rating,udf,by.x='user',by.y='user')
rating <- merge(rating,idf,by.x='restaurant',by.y='restaurant')
rating_mx <- sparseMatrix(
i = rating$user_No,
j = rating$restaurant_No,
x = rating$stars,
dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)
#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
mx <- as(rating_mx,"realRatingMatrix")
#setting itemlabels
colnames(mx) <- paste("R", 1:65432, sep = "")
as(mx[1,1:10],"list")
## [[1]]
## R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
## 5 5 5 4 5 5 5 5 5 4
#setting userlabels
rownames(mx) <- paste("U", 1:63081, sep = "")
as(mx[1,1:10], "list")
## $U1
## R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
## 5 5 5 4 5 5 5 5 5 4
#Normalize by subtracting the row mean from all ratings in the row
mx_n <- normalize(mx)
#view the matrix
getRatingMatrix(mx)[1:10,1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## R1 R2 R3 R4 R5
## U1 5 5 5 4 5
## U2 . . . . 5
## U3 . . . . .
## U4 . . . . .
## U5 . . . . .
## U6 . . . 1 5
## U7 . . . 4 5
## U8 1 . . . 5
## U9 . . . . .
## U10 . . . 4 .
image(mx, main = "Yelp restarurant reviews Data")
image(mx_n, main = "Normalized Yelp restarurant reviews Data")
Statistics of ratings data
# use visualize_ratings function from SVDApproximation to visualize statistics for all ratings: item count of different ratings,item histogram of users' average ratings, item histogram of items' average ratings, item histogram of number of rated items by user, item histogram of number of scores items have
summary(rating[, 'stars'])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.716 5.000 5.000
#distribution of ratings
rating_frq <- as.data.frame(table(rating$stars))
ggplot(rating_frq,aes(Var1,Freq)) +
geom_bar(aes(fill = Var1), position = "dodge", stat="identity",fill="palegreen")+ labs(x = "Stars")
#calculate average reviews for each restaurant
business_mean <- data.frame(restaurant = idf$restaurant, average_stars=colMeans(mx))
par(mfrow=c(2,2))
ggplot(user,aes(review_count)) +
geom_histogram(binwidth = 0.05,col='red',fill="plum") + coord_cartesian(ylim=c(0,12000)) + labs(x = "User Review COunt")+geom_vline(xintercept = mean(user$review_count),col='blue',size=1)
ggplot(business,aes(review_count)) +
geom_histogram(binwidth = 0.05,col='blue',fill="sandybrown") + coord_cartesian(ylim=c(0,7000)) + labs(x = "Restaurant Review COunt")+geom_vline(xintercept = mean(business$review_count),col='red',size=1)
ggplot(user,aes(average_stars)) +
geom_histogram(binwidth = 0.03,fill="plum") + labs(x = "User Average Review")
ggplot(business_mean,aes(average_stars)) +
geom_histogram(binwidth = 0.03,fill="sandybrown") + labs(x = "Restaurant Average Review")
round_r <- sum(user$average_stars == 1)+sum(user$average_stars == 2)+sum(user$average_stars == 3)+sum(user$average_stars == 4)+sum(user$average_stars == 5)
print(paste("Total number of people who had rounded average ratings:",round_r))
## [1] "Total number of people who had rounded average ratings: 405551"
user_rate_1 <- sum(user$review_count == 1)
user_rate_2 <- sum(user$review_count == 2)
user_rate_3 <- sum(user$review_count == 3)
user_rate_4 <- sum(user$review_count == 4)
print(paste("Number of people who only rated one restaurant:",user_rate_1))
## [1] "Number of people who only rated one restaurant: 189809"
print(paste("Number of people who only rated twice:",user_rate_2))
## [1] "Number of people who only rated twice: 126347"
print(paste("Number of people who only rated three times:",user_rate_3))
## [1] "Number of people who only rated three times: 96815"
print(paste("Number of people who only rated four times:",user_rate_4))
## [1] "Number of people who only rated four times: 69627"
print(paste("Number of people who only rated less than three times:",user_rate_1 + user_rate_2 +user_rate_3))
## [1] "Number of people who only rated less than three times: 412971"
From raw data we can see:
1. Rating distribution is not normal with the most frequent rating at the highest rating 5, whose frequency is much higher than other ratings. One possibility is that people who would write reviews for a restaurant on Yelp are those who will view review/ratings online before deciding to try a new restaurant. So there is more chance that these people like what they chose. This suggests that the current restaurant recommendation systems work very well so it is more likely that people could find the food they like by searching for the recommender engine/application.
2. Distribution of user review count is not normal with an average at 24. Majority people only wrote a few reviews and there are very few people wrote thousands of reviews with a maximum number at 11284.By looking at the minimum review count we knew that some people did not write any review.
3. Distribution of user review count is not normal with an average at 28. Majority restaurant received a few reviews and there is very few restaurant received thousands of reviews with a maximum number at 6414. If we look at the minimum review count, we can see any restaurant in this data set at least got 3 reviews.
4. The average rating for each user is multimodal distribution. The count of average rating at each round number(stars) is much higher than other not rounded number. In consistent with Figure 1, average rating at 5 has the highest frequency. The possible reasons that a lot of people had a rounded average rating could either be these people only give the same rating for a different restaurant and they only rated very few restaurants. It is interesting to notice that the number of people who had rounded average ratings, 405551, is close to the number of people who only rated less than three times, 412971.
5. Similar to the user average rating, the average rating for each restaurant is multimodal distribution. In consistent with Figure 1, average rating at 5 has the highest frequency. One of the possible reason for this pattern is there was a large number of restaurant received very few ratings and ratings were the same. Another reason is that there are a lot of very good restaurants always received 5. But, is it really possible?
# check if there is abnormal ratings in the data
table(mx@data@x[] > 5)
##
## FALSE TRUE
## 1409140 1
table(mx@data@x[] < 1)
##
## FALSE
## 1409141
# set the abnormal rating to a most closed normal number
mx@data@x[mx@data@x[] > 5] <- 5
# Keeping only restaurants with more than 50 ratings and users with more than 20 rating
mx_r <- mx[rowCounts(mx) > 20,]
mx_r <- mx_r[,colCounts(mx_r) > 50]
# creating the evaluation scheme, separate the data into train set and test set
set.seed(1)
(e <- evaluationScheme(mx_r[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))
## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.
# Creating a user-based collaborative filtering model using the training data.
(r_ubcf <- Recommender(getData(e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 960 users.
# release memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3714477 198.4 12002346 641.0 12002346 641.0
## Vcells 55754921 425.4 112780893 860.5 112769645 860.4
# Increasing the storage capacity
memory.limit(size=700000)
## [1] 7e+05
names(getModel(r_ubcf))
## [1] "description" "data" "method" "nn" "sample"
## [6] "normalize" "verbose"
# evaluation
results <- evaluate(e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.09sec/11.46sec]
## 2 [0.05sec/12.05sec]
## 3 [0.05sec/11.83sec]
## 4 [0.03sec/11.4sec]
## 5 [0.06sec/11.81sec]
avg(results)
## RMSE MSE MAE
## res 1.485573 2.207031 1.221872
# making predictions on ratings
(p_rating <- predict(r_ubcf, getData(e, "known"), type="ratings",n=10))
## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1246644 ratings.
# show predicted ratings
as(p_rating, "matrix")[1:10,1:7]
## R1 R3 R4 R5 R6 R10 R11
## U11 4.377811 4.400000 4.448639 4.704822 4.400000 4.476386 4.335383
## U31 2.800000 2.802313 2.700283 3.000264 2.800000 2.800000 2.874564
## U40 3.895722 4.000000 3.981405 4.183772 4.147711 4.032290 4.304996
## U41 4.350739 4.399499 4.426552 4.561826 4.434757 4.400000 4.376478
## U43 3.800000 3.800000 3.757982 3.856515 3.830139 3.827601 3.888330
## U45 4.221984 4.200000 4.200000 4.320884 4.200000 4.200000 4.197166
## U46 2.400000 2.473057 2.219222 2.466187 2.400000 2.400000 2.473215
## U53 3.400000 3.400000 3.306760 3.672172 3.534933 3.436452 3.111497
## U54 3.614614 3.622771 3.760030 3.818249 3.600000 3.550045 3.445086
## U59 3.873293 3.800000 3.864313 4.149622 3.802304 3.874812 3.838606
# RMSE for n=10
error <- data.frame(calcPredictionAccuracy(p_rating, getData(e, "unknown")))
kable(error,caption="RMSE for n=10")
calcPredictionAccuracy.p_rating..getData.e...unknown... | |
---|---|
RMSE | 1.468892 |
MSE | 2.157643 |
MAE | 1.188701 |
# making predictions for User in the first row of test data set on topNList
(p_topN <- predict(r_ubcf, mx_r[1201],type="topNList",n=10))
## Recommendations as 'topNList' with n = 10 for 1 users.
# show predicted top10 restaurants
(pri_rec <- as(p_topN, "list"))
## $U1827
## [1] "R1030" "R478" "R745" "R1550" "R1344" "R6798" "R5179" "R228"
## [9] "R1204" "R229"
In practice, we have to consider the location while designing a restaurant recommendation system. In most of the time, people will use a recommendation engine to find a restaurant from a certain city.
#get city info from business data
city <- business[,c('name','city','state')]
city <- city[!duplicated(city$name),]
colnames(city) <- c('restaurant','city','state')
idf_city <- left_join(idf,city,by='restaurant')
## Warning: Column `restaurant` joining factors with different levels,
## coercing to character vector
idf_city$restaurant_id <- paste("R", 1:65432, sep = "")
idf_city$city <- as.character(idf_city$city)
idf_city$state <- as.character(idf_city$state)
#get 50 restaurants for User 1201 from recemmender system
(p_top100 <- predict(r_ubcf, mx_r[1201],type="topNList",n=50))
## Recommendations as 'topNList' with n = 50 for 1 users.
# filter the restaurant for User 1201 based on location
pred_restaurant <- data.frame(as(p_top100, "list"))
colnames(pred_restaurant) <- "U1201"
pred_restaurant[] <- lapply(pred_restaurant, as.character)
pred_restaurant$restaurant_id <- pred_restaurant$U1201
pred_restaurant <- left_join(pred_restaurant,idf_city, by='restaurant_id' )
pred_restaurant$city <- as.character(pred_restaurant$city)
pred_restaurant$state <- as.character(pred_restaurant$state)
# For example, if user 1201 want to get recommendation for restaurants in Las vegas, we can find out from the top100 list
Lasvegas <- filter(pred_restaurant,city == "Las Vegas")
#datatable(Lasvegas, options = list(pageLength = 5))
head(Lasvegas,n=5)
## U1201 restaurant_id restaurant_No restaurant
## 1 R1030 R1030 1030 Desert Wireless iPhone Repair
## 2 R478 R478 478 SkinnyFATS
## 3 R6798 R6798 6798 9037 Salon
## 4 R5179 R5179 5179 Lucki Thai
## 5 R228 R228 228 Bachi Burger
## city state
## 1 Las Vegas NV
## 2 Las Vegas NV
## 3 Las Vegas NV
## 4 Las Vegas NV
## 5 Las Vegas NV
Because there are three criteria in reviews: funny, useful, and cool, the rating will be calculated as follows:
R : Users × Items → R0 × R1 × ...Rk
R0 is the set of possible overall rating values, and Ri represents the possible rating values for each individual criterion i (i = 1,..,k), typically on some numeric scale.
The prediction results of single-criteria collaborative filtering algorithm and multi-criteria collaborative filtering algorithms will be compared to decide which approach is better.
The implementation and evaluation will be performed in R and Apache Spark. At last, if time permits, an application will be built with the Shiny package.
Useful Matrix
#build the user-item matrix based on funny comments
useful_mx <- sparseMatrix(
i = rating$user_No,
j = rating$restaurant_No,
x = rating$useful,
dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)
#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
u_mx <- as(useful_mx,"realRatingMatrix")
#setting itemlabels
colnames(u_mx) <- paste("R", 1:65432, sep = "")
#setting userlabels
rownames(u_mx) <- paste("U", 1:63081, sep = "")
#view the matrix
getRatingMatrix(u_mx)[1:10,1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## R1 R2 R3 R4 R5
## U1 3 1 1 2 0
## U2 . . . . 0
## U3 . . . . .
## U4 . . . . .
## U5 . . . . .
## U6 . . . 4 0
## U7 . . . 1 0
## U8 1 . . . 3
## U9 . . . . .
## U10 . . . 1 .
Funny Matrix
#build the user-item matrix based on funny comments
funny_mx <- sparseMatrix(
i = rating$user_No,
j = rating$restaurant_No,
x = rating$funny,
dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)
#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
f_mx <- as(funny_mx,"realRatingMatrix")
#setting itemlabels
colnames(f_mx) <- paste("R", 1:65432, sep = "")
#setting userlabels
rownames(f_mx) <- paste("U", 1:63081, sep = "")
#view the matrix
getRatingMatrix(f_mx)[1:10,1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## R1 R2 R3 R4 R5
## U1 0 1 0 0 0
## U2 . . . . 0
## U3 . . . . .
## U4 . . . . .
## U5 . . . . .
## U6 . . . 0 0
## U7 . . . 0 0
## U8 0 . . . 0
## U9 . . . . .
## U10 . . . 0 .
Cool Matrix
#build the user-item matrix based on funny comments
cool_mx <- sparseMatrix(
i = rating$user_No,
j = rating$restaurant_No,
x = rating$cool,
dimnames = list(levels(rating$user_No), levels(rating$restaurant_No))
)
#converting dcGMatrix to realRatingMatrix for applyting recommenderlab
c_mx <- as(cool_mx,"realRatingMatrix")
#setting itemlabels
colnames(c_mx) <- paste("R", 1:65432, sep = "")
#setting userlabels
rownames(c_mx) <- paste("U", 1:63081, sep = "")
#view the matrix
getRatingMatrix(c_mx)[1:10,1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
## R1 R2 R3 R4 R5
## U1 2 0 0 0 0
## U2 . . . . 1
## U3 . . . . .
## U4 . . . . .
## U5 . . . . .
## U6 . . . 1 0
## U7 . . . 1 1
## U8 0 . . . 0
## U9 . . . . .
## U10 . . . 1 .
# statistic of useful, funny and cool comments data
summary(u_mx@data@x[])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.006 1.000 500.000
summary(f_mx@data@x[])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4091 0.0000 287.0000
summary(c_mx@data@x[])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5151 0.0000 234.0000
From the summary, we can see the values of useful, funny or cool represent how many people felt the reviews for the restaurant are useful, funny or cool. The higher the value, the more popular of these restaurants are. We can consider these number as ratings from different aspects. Because the scale of these three factors is different, as you can see from the max value was 500, 287, and 234, we will convert them to binary values. Then the ratings for useful, funny or cool will be combined with the primary ratings to build the new recommender models.
# the frequeny of restaurant's rating is useful
useful_tb <- as.data.frame(table(rating$useful))
useful_tb$Var1 <- as.numeric(as.character(useful_tb$Var1))
# how many pepople believed that the review was useful at a threshold at 100 restaurants having the same amount of "useful" notes
u_threshold <- useful_tb[useful_tb$Freq > 50,]
# the frequeny of restaurant's rating is useful
funny_tb <- as.data.frame(table(rating$funny))
funny_tb$Var1 <- as.numeric(as.character(funny_tb$Var1))
# how many pepople believed that the review was useful at a threshold at 100 restaurants having the same amount of "useful" notes
f_threshold <- funny_tb[funny_tb$Freq > 50,]
# the frequeny of restaurant's rating is useful
cool_tb <- as.data.frame(table(rating$useful))
cool_tb$Var1 <- as.numeric(as.character(cool_tb$Var1))
# how many pepople believed that the review was useful at a threshold at 100 restaurants having the same amount of "useful" notes
c_threshold <- cool_tb[cool_tb$Freq > 50,]
mx_b <-mx_r
# convert the basic rating matrix to binary matrix
mx_b@data@x [mx_b@data@x < mean(mx_b@data@x[])]<- 1
mx_b@data@x [mx_b@data@x > mean(mx_b@data@x[])]<- 0
# convert the useful matrix to binary matrix
u_mx@data@x [u_mx@data@x < max(u_threshold$Var1)]<- 1
u_mx@data@x [u_mx@data@x > max(u_threshold$Var1)]<- 0
# convert the funny rating matrix to binary matrix
f_mx@data@x [f_mx@data@x < max(f_threshold$Var1)]<- 1
f_mx@data@x [f_mx@data@x > max(f_threshold$Var1)]<- 0
# convert the cool rating matrix to binary matrix
c_mx@data@x [c_mx@data@x < max(c_threshold$Var1)]<- 1
c_mx@data@x [c_mx@data@x > max(c_threshold$Var1)]<- 0
#chose the users and restaurants matching the constrained user-item matrix which users rated the restaurant more than 20 times and restaurants received more than 50 reviews.
u_mx_fit <- u_mx[,c(colnames(mx_r))]
u_mx_fit <- u_mx_fit[row.names(u_mx_fit) %in% c(rownames(mx_r)),]
f_mx_fit <- f_mx[,c(colnames(mx_r))]
f_mx_fit <- f_mx_fit[row.names(f_mx_fit) %in% c(rownames(mx_r)),]
c_mx_fit <- c_mx[,c(colnames(mx_r))]
c_mx_fit <- c_mx_fit[row.names(c_mx_fit) %in% c(rownames(mx_r)),]
# combine primary ratings with useful rating by element-wise multiplication
r0_r1 <- mx_b@data * u_mx_fit@data
# combine primary ratings with funny rating by element-wise multiplication
r0_r1_r2 <- r0_r1 * f_mx_fit@data
# combine primary ratings with cool rating by element-wise multiplication
r0_r1_r2_r3 <- r0_r1_r2 * c_mx_fit@data
There are 7 ways to intergrate useful, funny, cool, and primary ratings: primary+useful,primary+funny, primary+cool, primary+useful+funny, primary+useful+cool, primary+cool+funny, primary+useful+funny+cool. We will use primary+useful,primary+useful+funny, and primary+useful+funny+cool to build the recommendation models.
** Primary + Useful**
combine_1 <- as(r0_r1,"realRatingMatrix")
# creating the evaluation scheme, separate the data into train set and test set
set.seed(2)
(c1_e <- evaluationScheme(combine_1[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))
## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.
# Creating a user-based collaborative filtering) using the training data.
(c1_ubcf <- Recommender(getData(c1_e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 960 users.
# release memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3736387 199.6 12002346 641.0 12002346 641.0
## Vcells 78251706 597.1 112780893 860.5 112780608 860.5
# evaluation
c1_results <- evaluate(c1_e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.03sec/11.51sec]
## 2 [0.05sec/12.41sec]
## 3 [0.05sec/11.73sec]
## 4 [0.05sec/11.72sec]
## 5 [0.05sec/12.43sec]
avg(results)
## RMSE MSE MAE
## res 1.485573 2.207031 1.221872
# making predictions on ratings
(c1_p_rating <- predict(c1_ubcf, getData(c1_e, "known"), type="ratings",n=10))
## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1068552 ratings.
# show predicted ratings
as(c1_p_rating, "matrix")[1:10,1:7]
## R1 R3 R4 R5 R6 R10 R11
## U6 NA NA NA NA NA NA NA
## U14 0.4525209 0.4328711 0.4502401 0.3393765 0.3895817 0.3967006 0.3460426
## U30 0.6000000 0.6000000 0.6000000 0.5603706 0.6000000 0.5865870 0.6355804
## U33 0.1702462 0.2000000 0.2188049 0.2156242 0.2000000 0.2227773 0.2068239
## U40 0.3915823 0.3798296 0.3860204 0.3717654 0.4000000 0.3760555 0.3720130
## U41 0.4138108 0.4138108 0.4138108 0.3542383 0.4000000 0.3771508 0.3928913
## U44 NA NA NA NA NA NA NA
## U57 0.8000000 0.7912793 0.8094813 0.7847997 0.8000000 0.7742290 0.8069384
## U58 0.8000000 0.8220852 0.7909140 0.7721165 0.8000000 0.8000000 0.7839648
## U66 0.6000000 0.5872285 0.5824837 0.5836675 0.6000000 0.5885307 0.5629996
# RMSE
(error <- data.frame(calcPredictionAccuracy(c1_p_rating, getData(c1_e, "unknown"))))
## calcPredictionAccuracy.c1_p_rating..getData.c1_e...unknown...
## RMSE 0.5033015
## MSE 0.2533124
## MAE 0.4509785
# evaluation
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
#1 [0.16sec/398.42sec]
#2 [0.17sec/393.06sec]
#3 [0.27sec/391.93sec]
#4 [0.09sec/393.77sec]
#5 [0.16sec/395.01sec]
# making predictions on topNList
(c1_p_topN <- predict(c1_ubcf, combine_1[1201],type="topNList",n=10))
## Recommendations as 'topNList' with n = 10 for 1 users.
# show predicted top10 restaurants
(c1_rec <- as(c1_p_topN, "list"))
## $U1827
## [1] "R1967" "R831" "R603" "R1861" "R1971" "R873" "R5580" "R294"
## [9] "R1622" "R5977"
Primary + Useful + Funny
combine_2 <- as(r0_r1_r2,"realRatingMatrix")
# creating the evaluation scheme, separate the data into train set and test set
set.seed(3)
(c2_e <- evaluationScheme(combine_2[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))
## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.
# Creating a user-based collaborative filtering) using the training data.
(c2_ubcf <- Recommender(getData(c2_e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 960 users.
# release memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3736770 199.6 12002346 641.0 12002346 641.0
## Vcells 81954626 625.3 135417071 1033.2 135416853 1033.2
# evaluation
c2_results <- evaluate(c2_e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.05sec/11.61sec]
## 2 [0.04sec/11.87sec]
## 3 [0.04sec/11.63sec]
## 4 [0.04sec/11.87sec]
## 5 [0.06sec/11.92sec]
avg(results)
## RMSE MSE MAE
## res 1.485573 2.207031 1.221872
# making predictions on ratings
(c2_p_rating <- predict(c2_ubcf, getData(c2_e, "known"), type="ratings",n=10))
## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1068552 ratings.
# show predicted ratings
as(c2_p_rating, "matrix")[1:10,1:7]
## R1 R3 R4 R5 R6 R10 R11
## U8 0.4242750 0.3862496 0.3882118 0.3499197 0.3871293 0.4219942 0.3580959
## U24 0.6000000 0.6000000 0.5808039 0.5543080 0.6000000 0.6000000 0.5769046
## U26 0.6000000 0.6000000 0.6000000 0.5680516 0.6000000 0.6000000 0.5756454
## U29 0.2320670 0.1787733 0.1581266 0.1752761 0.1845924 0.1998294 0.2152066
## U36 0.4000000 0.4000000 0.3894468 0.3783530 0.4000000 0.4000000 0.4097481
## U37 0.2017338 0.1858843 0.2024211 0.2162648 0.1776786 0.2205786 0.1484986
## U55 0.4000000 0.4000000 0.3874557 0.3344267 0.4000000 0.4000000 0.3821346
## U65 0.4000000 0.4000000 0.4427417 0.3727352 0.4230476 0.4356060 0.3717501
## U71 0.4000000 0.4000000 0.4278132 0.3971104 0.4000000 0.3751178 0.4000000
## U75 0.4239906 0.4136335 0.4000000 0.3648116 0.3872801 0.3900600 0.4315422
# RMSE
(error <- data.frame(calcPredictionAccuracy(c2_p_rating, getData(c2_e, "unknown"))))
## calcPredictionAccuracy.c2_p_rating..getData.c2_e...unknown...
## RMSE 0.5203734
## MSE 0.2707885
## MAE 0.4677258
# evaluation
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
#1 [0.16sec/398.42sec]
#2 [0.17sec/393.06sec]
#3 [0.27sec/391.93sec]
#4 [0.09sec/393.77sec]
#5 [0.16sec/395.01sec]
# making predictions on topNList
(c2_p_topN <- predict(c2_ubcf, combine_2[1201],type="topNList",n=10))
## Recommendations as 'topNList' with n = 10 for 1 users.
# show predicted top10 restaurants
(c2_rec <- as(c2_p_topN, "list"))
## $U1827
## [1] "R1967" "R603" "R602" "R1081" "R831" "R2291" "R1861" "R3438"
## [9] "R873" "R1464"
Primary + Useful + Funny + Cool
combine_3 <- as(r0_r1_r2_r3,"realRatingMatrix")
# creating the evaluation scheme, separate the data into train set and test set
set.seed(4)
(c3_e <- evaluationScheme(combine_3[1:1200], method = "split",train = 0.8, given = 5, goodRating = 3, k=5))
## Evaluation scheme with 5 items given
## Method: 'split' with 5 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 1200 x 5243 rating matrix of class 'realRatingMatrix' with 488248 ratings.
# Creating a user-based collaborative filtering) using the training data.
(c3_ubcf <- Recommender(getData(c3_e, "train"), method ="UBCF", parameter = list(method = "cosine", normalize = "Z-score", nn=25)))
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 960 users.
# release memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3737055 199.6 12002346 641.0 12002346 641.0
## Vcells 85648014 653.5 135417071 1033.2 135416854 1033.2
# evaluation
c3_results <- evaluate(c3_e, method="UBCF", type = "ratings", n=c(1,3,5,10,15,20))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.03sec/11.31sec]
## 2 [0.04sec/11.53sec]
## 3 [0.06sec/11.51sec]
## 4 [0.04sec/11.74sec]
## 5 [0.05sec/11.53sec]
avg(results)
## RMSE MSE MAE
## res 1.485573 2.207031 1.221872
# making predictions on ratings
(c3_p_rating <- predict(c3_ubcf, getData(c3_e, "known"), type="ratings",n=10))
## 240 x 5243 rating matrix of class 'realRatingMatrix' with 1073790 ratings.
# show predicted ratings
as(c3_p_rating, "matrix")[1:10,1:7]
## R1 R3 R4 R5 R6 R10 R11
## U5 NA NA NA NA NA NA NA
## U15 0.6000000 0.6000000 0.6135017 0.6118739 0.6000000 0.5788825 0.6000000
## U24 0.4073642 0.4305689 0.4000000 0.3906037 0.4000000 0.4053768 0.3680601
## U26 0.6000000 0.6000000 0.6000000 0.5651377 0.6000000 0.5697705 0.5912079
## U34 0.4000000 0.3854438 0.3843342 0.3705854 0.4000000 0.4000000 0.3843342
## U36 0.6000000 0.5840123 0.6271648 0.5732234 0.5868789 0.5868789 0.5964009
## U46 0.4000000 0.4000000 0.3791149 0.4038671 0.3888094 0.3892605 0.3790034
## U58 0.2136111 0.1774667 0.2443033 0.1718200 0.1867005 0.2313888 0.2165374
## U59 0.2000000 0.1908956 0.2000000 0.2164272 0.2000000 0.2000000 0.2334704
## U71 0.3776172 0.3776605 0.4154288 0.3869664 0.3878530 0.3710440 0.3925566
# RMSE
(error <- data.frame(calcPredictionAccuracy(c3_p_rating, getData(c3_e, "unknown"))))
## calcPredictionAccuracy.c3_p_rating..getData.c3_e...unknown...
## RMSE 0.5295649
## MSE 0.2804390
## MAE 0.4686014
# evaluation
#(It took long time to run evaluate results of the command is put here)
#results <- evaluate(e, method="UBCF", type = "topNList", n=c(1,3,5,10,15,20))
#UBCF run fold/sample [model time/prediction time]
#1 [0.16sec/398.42sec]
#2 [0.17sec/393.06sec]
#3 [0.27sec/391.93sec]
#4 [0.09sec/393.77sec]
#5 [0.16sec/395.01sec]
# making predictions on topNList
(c3_p_topN <- predict(c3_ubcf, combine_3[1201],type="topNList",n=10))
## Recommendations as 'topNList' with n = 10 for 1 users.
# show predicted top10 restaurants
(c3_rec <- as(c3_p_topN, "list"))
## $U1827
## [1] "R831" "R603" "R1861" "R1622" "R5580" "R63" "R294" "R6589"
## [9] "R1464" "R2410"
#get 50 restaurants for User 1201 from recemmender system
(c1_p_top100 <- predict(c1_ubcf, mx_r[1201],type="topNList",n=50))
## Recommendations as 'topNList' with n = 50 for 1 users.
# filter the restaurant for User 1201 based on location
c1_pred_restaurant <- data.frame(as(c1_p_top100, "list"))
colnames(c1_pred_restaurant) <- "U1201"
c1_pred_restaurant[] <- lapply(c1_pred_restaurant, as.character)
c1_pred_restaurant$restaurant_id <- c1_pred_restaurant$U1201
c1_pred_restaurant <- left_join(c1_pred_restaurant,idf_city, by='restaurant_id' )
c1_pred_restaurant$city <- as.character(c1_pred_restaurant$city)
c1_pred_restaurant$state <- as.character(c1_pred_restaurant$state)
# For example, if user 1201 want to get recommendation for restaurants in Las vegas, we can find out from the top100 list
head(Lasvegas <- filter(c1_pred_restaurant,city == "Las Vegas"),n=5)
## U1201 restaurant_id restaurant_No restaurant
## 1 R5580 R5580 5580 Cafe Rio
## 2 R603 R603 603 Bayside Buffet at Mandalay Bay
## 3 R142 R142 142 Serendipity 3
## 4 R1622 R1622 1622 FIX
## 5 R63 R63 63 Luxor Hotel and Casino Las Vegas
## city state
## 1 Las Vegas NV
## 2 Las Vegas NV
## 3 Las Vegas NV
## 4 Las Vegas NV
## 5 Las Vegas NV
Serendipity
According to reference 5, The serendipity will be measured as:$Srdp(u) =\frac{|UNEXP(u) \cap USEFUL(u))|}{N}$
where USEFUL(u) denotes the useful (relevant) items for user u and N is the size of recommendation set RS(u).An unexpected set of recommendations for user u (UNEXP(u)) is defined as: UNEXP(u)=RS(u)\PM where PM is a set of recommendations generated by a primitive model which is assumed of low unexpectedness. RS(u) denotes the top-N recommendations generated by a recommender system for user u. When an element of RS(u) does not belong to PM, it is considered to be unexpected.
U1827_predict <- data.frame(rbind('Primary' = unlist(pri_rec), 'Primary + Useful' = unlist(c1_rec), 'Primary + Useful + Funny' = unlist(c2_rec), 'Primary + Useful + Funny + Cool' = unlist(c3_rec)))
colnames(U1827_predict) <- paste0("No.",seq(1:10))
kable(U1827_predict)
No.1 | No.2 | No.3 | No.4 | No.5 | No.6 | No.7 | No.8 | No.9 | No.10 | |
---|---|---|---|---|---|---|---|---|---|---|
Primary | R1030 | R478 | R745 | R1550 | R1344 | R6798 | R5179 | R228 | R1204 | R229 |
Primary + Useful | R1967 | R831 | R603 | R1861 | R1971 | R873 | R5580 | R294 | R1622 | R5977 |
Primary + Useful + Funny | R1967 | R603 | R602 | R1081 | R831 | R2291 | R1861 | R3438 | R873 | R1464 |
Primary + Useful + Funny + Cool | R831 | R603 | R1861 | R1622 | R5580 | R63 | R294 | R6589 | R1464 | R2410 |
pri_rating <- predict(r_ubcf, mx_r[1201], type="ratings",n=10)
usefulness <- as(pri_rating, "matrix")
usefulness_df <- as.data.frame(usefulness) %>%
gather(restaurant_id, predicted_rating,1:length(usefulness))
unexpected_1 <- setdiff(pri_rec[[1]], c1_rec[[1]])
unexpected_ratings <- filter(usefulness_df, restaurant_id %in% unexpected_1 ) %>%
filter(predicted_rating > mean(mx_r@data@x) )
serendipity_c1 <- nrow(unexpected_ratings)/length(unlist(c1_rec))
print(paste("serendipity for user 1827 using combiantion of primary rating and useful rating is:",serendipity_c1*100,"%"))
## [1] "serendipity for user 1827 using combiantion of primary rating and useful rating is: 100 %"
By combing primary rating and useful rating, we can get a totally different top 10 recommendations for user 1807.
unexpected_ratings <- left_join(unexpected_ratings,idf_city,by="restaurant_id")
kable(unexpected_LasVegas <- filter(unexpected_ratings,city == "Las Vegas"))
restaurant_id | predicted_rating | restaurant_No | restaurant | city | state |
---|---|---|---|---|---|
R228 | 4.224769 | 228 | Bachi Burger | Las Vegas | NV |
R478 | 4.249639 | 478 | SkinnyFATS | Las Vegas | NV |
R1030 | 4.266748 | 1030 | Desert Wireless iPhone Repair | Las Vegas | NV |
R1204 | 4.224138 | 1204 | The Buffet at Bellagio | Las Vegas | NV |
R5179 | 4.225634 | 5179 | Lucki Thai | Las Vegas | NV |
R6798 | 4.232940 | 6798 | 9037 Salon | Las Vegas | NV |
new_restaurant <- setdiff(unexpected_LasVegas$restaurant,Lasvegas$restaurant)
print(paste("By combing primary rating and useful rating,we found",length(new_restaurant),"restaurants not recommended by the primary model by relevant:",paste(unlist(new_restaurant), collapse=','),"for user 1827."))
## [1] "By combing primary rating and useful rating,we found 6 restaurants not recommended by the primary model by relevant: Bachi Burger,SkinnyFATS,Desert Wireless iPhone Repair,The Buffet at Bellagio,Lucki Thai,9037 Salon for user 1827."
# topN for test data set based on primary recommendation system
(p_topN <- predict(r_ubcf, getData(e,"unknown"),type="topNList",n=10))
## Recommendations as 'topNList' with n = 10 for 240 users.
# show predicted top10 restaurants
pri_rec <- as(p_topN, "list")
# topN for test data set based on primary+useful rating
(c1_p_topN <- predict(c1_ubcf, getData(e,"unknown"),type="topNList",n=10))
## Recommendations as 'topNList' with n = 10 for 240 users.
# show predicted top10 restaurants
c1_rec <- as(c1_p_topN, "list")
serendipity_c1_df <- data.frame()
for (i in 1:length(pri_rec)){
unexpected_1 <- setdiff(pri_rec[[i]], c1_rec[[i]])
unexpected_ratings <- filter(usefulness_df, restaurant_id %in% unexpected_1 ) %>%
filter(predicted_rating > mean(mx_r@data@x))
serendipity_c1[i] <- nrow(unexpected_ratings)/10
serendipity_c1_df_1 <- data.frame('user_id' = names(pri_rec[i]),'serendipity'= serendipity_c1[i])
serendipity_c1_df <- rbind(serendipity_c1_df,serendipity_c1_df_1)
}
head(serendipity_c1_df, options = list(pageLength = 5),n=5)
## user_id serendipity
## 1 U11 1
## 2 U31 1
## 3 U40 1
## 4 U41 1
## 5 U43 1
unexpected_ratings <- filter(usefulness_df, restaurant_id %in% unexpected_1 ) %>%
filter(predicted_rating > mean(mx_r@data@x) )
One restaurant recommendation system based on the user_based collaborative filtering algorithm was built with the Yelp academic data for challenge round 9.The RMSE is 1.47.
Restaurants recommending results could be furthered modified by the location. In the future, that information on locations (such as longitude and latitude) or the distance between restaurants, could be used to calculate the similarity.
The recommendation system based on multi-criteria ratings generated a totally different list of restaurants for users. It is intriguing to see that the serendipity of the recommendation system based on multi-criteria ratings for each user was 100%. At the same time, the accuracy of the prediction was higher than only using one-criteria of rating, The RMSE reduced to 0.5.
Blanca Vargas-Govea, Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system.CARS,( 2011)
Mengqi Yu, Meng Xue, Wenjia Ouyang. Restaurants Review Star Prediction for Yelp Dataset.Conference Proceedings (2015).
Gediminas Adomavicius, YoungOk Kwon. New Recommendation Techniques for Multi-Criteria Rating Systems. IEEE Intelligent Systems 22-3 (2017).
Jun Zeng, Feng Li, Haiyang Liu, Junhao Wen, Sachio Hirokawa. A Restaurant Recommender System Based on User Preference and Location in Mobile Environment. Advanced Applied Informatics (IIAI-AAI), 2016 5th IIAI International Congress.
Qiuxia Lu, Tianqi Chen, Weinan Zhang, Diyi Yang, Yong Yu.Serendipitous Personalized Ranking for Top-N Recommendation.Proceeding WI-IAT '12 Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, Volume 01, 258-265 (2012).