Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.
All the data is multiarray of Json, which means file is a collection of json data. We will use stream_in function, which parses json data line-by-line from our files stored within the data folder of our repository.
# Business Data
business <- stream_in(file("data/yelp_training_set_business.json"), verbose = F)
# user Data
user <- stream_in(file("data/yelp_training_set_user.json"), verbose = F)
# checkin Data
checkin <- stream_in(file("data/yelp_training_set_checkin.json"), verbose = F)
# review Data
review <- stream_in(file("data/yelp_training_set_review.json"), verbose = F)We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.
We applied additional transformation to remove unnessacary data. There were 1224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our dataset from outside of AZ that we allso removed.
As a result of our transformations, our recommender data now consists of 4828 unique businesses. This data is previewed below:
# dropped open, neighborhoods(no data), full address, and review count, type (all business)
df_business<-business %>% filter(open == "TRUE", state=="AZ") %>% select(-open, -neighborhoods, -full_address, -review_count, -type, -stars) %>% mutate(categories = sapply(categories, toString))
# create category filter
filter <- 'Argentine| Burmese| Cambodian| Cocktail Bars|Laotian|Lebanese|Live/Raw Food|Russian|African|Champagne Bars|Kosher|Modern European|Scandinavian|Taiwanese|Tapas/Small Plates|Afghan|Brazilian|Food Trucks|Shaved Ice|Wineries|Dim Sum|Ethiopian|Fondue|Hookah Bars|Persian/Iranian|Peruvian| Polish| Seafood Markets| Tapas Bars|Halal|British| Cheese Shops|German|Spanish|Cheesesteaks|Cuban|Do-It-Yourself Food|Gastropubs|Salad|Creperies|Soup|Chocolatiers & Shops|Filipino|Food Stands|Fruits & Veggies|Meat Shops|Mongolian|Soul Food|Comfort Food| Irish|Fish & Chips|Cajun/Creole|Caribbean|Pakistani|Southern|Candy Stores|Vegan|Latin American|Breweries|French|Gay Bars|Korean|Gluten-Free|Hawaiian|Farmers Market|Vegetarian|Middle Eastern|Ethnic Food|Indian|Pubs|Chicken Wings|Dive Bars| Juice Bars & Smoothies|Vietnamese|Cafes|Wine Bars|Bagels|Diners|Hot Dogs|Tex-Mex|Donuts|Greek|Thai| Desserts|Mediterranean|Beer| Wine & Spirits|Seafood|Sushi Bars| Lounges|Steakhouses|Buffets|Japanese|Sports Bars|Delis|Bakeries|Specialty Food|Breakfast & Brunch|Ice Cream & Frozen Yogurt|Burgers|Italian| Chinese|Coffee & Tea|American (New)|Sandwiches|Fast Food|Pizza|American (Traditional)|Bars|Mexican|Food| Restaurants'
# filter businesses using filter & factor categories
df_business <- df_business %>% filter(str_detect(categories, filter)) %>% mutate(categories = as.factor(categories))| business_id | categories | city | name | longitude | state | latitude |
|---|---|---|---|---|---|---|
| usAsSV36QmUej8–yvN-dg | Food, Grocery | Phoenix | Food City | -112.0854 | AZ | 33.39221 |
| PzOqRohWw7F7YEPBz6AubA | Food, Bagels, Delis, Restaurants | Glendale Az | Hot Bagels & Deli | -112.2003 | AZ | 33.71280 |
| qarobAbxGSHI7ygf1f7a_Q | Sandwiches, Restaurants | Gilbert | Jersey Mike’s Subs | -111.8120 | AZ | 33.37884 |
| gA5CuBxF-0CnOpGnryWJdQ | Mexican, Restaurants | Phoenix | La Paloma Mexican Food | -112.0814 | AZ | 33.48011 |
| acaBJcFEKPmmSDIO6c-ZGQ | Food, Grocery | Goodyear | Basha’s | -112.3920 | AZ | 33.46881 |
| JxVGJ9Nly2FFIs_WpJvkug | Pizza, Restaurants | Scottsdale | Sauce | -111.9263 | AZ | 33.61746 |
We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews.
# subset reviews & remove columns (type)
df_review <- review %>% filter(business_id %in% df_business$business_id) %>%
select(-type)
# assemble ratings data (funny/useful/cool) into singular columns.
df_review <- do.call(data.frame, df_review)
# remove factor from user_id
df_review$user_id <- as.character(df_review$user_id)
df_review$business_id <- as.character(df_review$business_id)| votes.funny | votes.useful | votes.cool | user_id | review_id | stars | date | business_id |
|---|---|---|---|---|---|---|---|
| 0 | 5 | 2 | rLtl8ZkDX5vH5nAx9C3q5Q | fWKvX83p0-ka4JS3dc6E5A | 5 | 2011-01-26 | 9yKzy9PApeiPPOUJEtnvkg |
| 0 | 0 | 0 | 0a2KyEL0d3Yb1V6aivbIuQ | IjZ33sJrzXqU-0X6U8NwyA | 5 | 2011-07-27 | ZRJwVLyzEJq1VAihDhYiow |
| 0 | 1 | 0 | 0hT2KtfLiobPvh6cDC8JQg | IESLBzqUCLdSzSqm0eCSxQ | 4 | 2012-06-14 | 6oRAC4uyJCsJl1X0WZpVSA |
| 1 | 3 | 4 | sqYN3lNgvPbPCTRsMFu27g | m2CKSsepBCoRYWxiRUsxAg | 4 | 2007-12-13 | -yxfBYGB6SEqszmxJxd97A |
| 4 | 7 | 7 | wFweIWhv2fREZV_dYkz_1g | riFQ3vxNpP4rWLk_CSri2A | 5 | 2010-02-12 | zp713qNhx8d9KCJJnrw1xA |
| 0 | 0 | 0 | Vh_DlizgGhSqQh4qfZ2h6A | XtnfnYmnJYi71yIuGsXIUA | 4 | 2012-08-17 | wNUea3IXZWD63bbOQaOH-g |
| text |
|---|
|
My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I’ve ever had. I’m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best “toast” I’ve ever had. Anyway, I can’t wait to go back! |
We applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations.
The dataframe preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.
| votes.funny | votes.useful | votes.cool | user_id | name | average_stars | review_count |
|---|---|---|---|---|---|---|
| 0 | 2 | 0 | Ch6CdTR2IVaVANr-RglMOg | T | 5.00 | 2 |
| 0 | 0 | 0 | NZrLmHRyiHmyT1JrfzkCOA | Beth | 1.00 | 1 |
| 30 | 45 | 36 | mWx5Sxt_dx-sYBZg6RgJHQ | Amy | 3.79 | 19 |
| 28 | 130 | 31 | hryUDaRk7FLuDAYui2oldw | Beach | 3.83 | 207 |
| 1 | 0 | 1 | 2t6fZNLtiqsihVmeO7zggg | christine | 3.00 | 2 |
| 0 | 3 | 2 | mn6F-eP5WU37b-iLTop2mQ | Denis | 4.50 | 4 |
Lastly, we evaluated our checkins data and applied our business filter once more. This decreased our checkin observations from 8282 to 4423. This change tells us that no checkin data is available for 8.4% of businesses in our subset.
Next, we created our main dataframe business and review dataframes using the Business_ID and User_ID as unique keys. This dataframe will be referenced later on when building our recommender matrices and algorithms.
CLEAN UP SECTION
The graphs below show geographic trends in our data as well as our most popular categories.
# All Business City
df_business %>% select(business_id, city) %>% group_by(city) %>% dplyr::summarise(Count = n()) %>%
ggplot(aes(x = city, y = Count)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90,
hjust = 1)) + xlab(label = "Business City") + geom_text(aes(label = Count),
colour = "blue", fontface = "bold", position = position_stack(vjust = 1.1),
check_overlap = TRUE)CONSIDER Including plot showing something like number of reviews for each business. Review date by month is pretty consistant and not telling of much on its own.
FALSE
FALSE Attaching package: 'lubridate'
FALSE The following object is masked from 'package:base':
FALSE
FALSE date
FALSE [1] NA
The plot below shows the distribution of our average ratings given by all users.
We tested 3 recommender algorithms to see which had the best performance metrics for our recommender system. To test the algorithsm, we first had to create a user-item matrix and then split our data into training and test sets.
Matrix Building
We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.
matrix_data <- df_main %>% select(user_id, business_id, stars) %>% mutate(user_id = as.factor(user_id),
business_id = as.factor(business_id), stars = as.numeric(stars))
ui_mat <- as(matrix_data, "realRatingMatrix")Train and Test Splits
Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 10 k-folds using the recommenderlab package. 80% of data was retained for training and 20% for testing purposes.
# evaluation method with 80% of data for train and 20% for test
set.seed(1000)
evalu <- evaluationScheme(ui_mat, method = "split", train = 0.8, given = 0,
goodRating = 1, k = 10)
# Prep data
train <- getData(evalu, "train") # Training Dataset
dev_test <- getData(evalu, "known") # Test data from evaluationScheme of type KNOWN
test <- getData(evalu, "unknown") # Unknow datset used for RMSE / model evaluationAlgo
Algo
Compare algorithms performance. Select most effective to build recommender system.
Test system
Final conlusion. Explain limitations of system. Make recommendations for future improvements.