Overview

Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.

Data Aquisition

All the data is multiarray of Json, which means file is a collection of json data. We will use stream_in function, which parses json data line-by-line from our files stored within the data folder of our repository.

# Business Data
business <- stream_in(file("data/yelp_training_set_business.json"), verbose = F)
# user Data
user <- stream_in(file("data/yelp_training_set_user.json"), verbose = F)
# checkin Data
checkin <- stream_in(file("data/yelp_training_set_checkin.json"), verbose = F)
# review Data
review <- stream_in(file("data/yelp_training_set_review.json"), verbose = F)

Data Transformations

Business

We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.

We applied additional transformation to remove unnessacary data. There were 1224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our dataset from outside of AZ that we allso removed.

As a result of our transformations, our recommender data now consists of 4828 unique businesses. This data is previewed below:

# dropped open, neighborhoods(no data), full address, and review count, type (all business)
df_business<-business %>% filter(open == "TRUE", state=="AZ") %>% select(-open, -neighborhoods, -full_address, -review_count, -type, -stars) %>% mutate(categories = sapply(categories, toString)) 

# create category filter
filter <- 'Argentine| Burmese| Cambodian| Cocktail Bars|Laotian|Lebanese|Live/Raw Food|Russian|African|Champagne Bars|Kosher|Modern European|Scandinavian|Taiwanese|Tapas/Small Plates|Afghan|Brazilian|Food Trucks|Shaved Ice|Wineries|Dim Sum|Ethiopian|Fondue|Hookah Bars|Persian/Iranian|Peruvian| Polish| Seafood Markets| Tapas Bars|Halal|British| Cheese Shops|German|Spanish|Cheesesteaks|Cuban|Do-It-Yourself Food|Gastropubs|Salad|Creperies|Soup|Chocolatiers & Shops|Filipino|Food Stands|Fruits & Veggies|Meat Shops|Mongolian|Soul Food|Comfort Food| Irish|Fish & Chips|Cajun/Creole|Caribbean|Pakistani|Southern|Candy Stores|Vegan|Latin American|Breweries|French|Gay Bars|Korean|Gluten-Free|Hawaiian|Farmers Market|Vegetarian|Middle Eastern|Ethnic Food|Indian|Pubs|Chicken Wings|Dive Bars| Juice Bars & Smoothies|Vietnamese|Cafes|Wine Bars|Bagels|Diners|Hot Dogs|Tex-Mex|Donuts|Greek|Thai| Desserts|Mediterranean|Beer| Wine & Spirits|Seafood|Sushi Bars| Lounges|Steakhouses|Buffets|Japanese|Sports Bars|Delis|Bakeries|Specialty Food|Breakfast & Brunch|Ice Cream & Frozen Yogurt|Burgers|Italian| Chinese|Coffee & Tea|American (New)|Sandwiches|Fast Food|Pizza|American (Traditional)|Bars|Mexican|Food| Restaurants'

# filter businesses using filter & factor categories
df_business <- df_business %>% filter(str_detect(categories, filter)) %>% mutate(categories = as.factor(categories))

Preview Business Data
business_id	categories	city	name	longitude	state	latitude
usAsSV36QmUej8–yvN-dg	Food, Grocery	Phoenix	Food City	-112.0854	AZ	33.39221
PzOqRohWw7F7YEPBz6AubA	Food, Bagels, Delis, Restaurants	Glendale Az	Hot Bagels & Deli	-112.2003	AZ	33.71280
qarobAbxGSHI7ygf1f7a_Q	Sandwiches, Restaurants	Gilbert	Jersey Mike’s Subs	-111.8120	AZ	33.37884
gA5CuBxF-0CnOpGnryWJdQ	Mexican, Restaurants	Phoenix	La Paloma Mexican Food	-112.0814	AZ	33.48011
acaBJcFEKPmmSDIO6c-ZGQ	Food, Grocery	Goodyear	Basha’s	-112.3920	AZ	33.46881
JxVGJ9Nly2FFIs_WpJvkug	Pizza, Restaurants	Scottsdale	Sauce	-111.9263	AZ	33.61746

Review

We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews.

# subset reviews & remove columns (type)
df_review <- review %>% filter(business_id %in% df_business$business_id) %>% 
    select(-type)

# assemble ratings data (funny/useful/cool) into singular columns.
df_review <- do.call(data.frame, df_review)

# remove factor from user_id
df_review$user_id <- as.character(df_review$user_id)
df_review$business_id <- as.character(df_review$business_id)

Preview Review Data (without Review Text)
votes.funny	votes.useful	votes.cool	user_id	review_id	stars	date	business_id
0	5	2	rLtl8ZkDX5vH5nAx9C3q5Q	fWKvX83p0-ka4JS3dc6E5A	5	2011-01-26	9yKzy9PApeiPPOUJEtnvkg
0	0	0	0a2KyEL0d3Yb1V6aivbIuQ	IjZ33sJrzXqU-0X6U8NwyA	5	2011-07-27	ZRJwVLyzEJq1VAihDhYiow
0	1	0	0hT2KtfLiobPvh6cDC8JQg	IESLBzqUCLdSzSqm0eCSxQ	4	2012-06-14	6oRAC4uyJCsJl1X0WZpVSA
1	3	4	sqYN3lNgvPbPCTRsMFu27g	m2CKSsepBCoRYWxiRUsxAg	4	2007-12-13	-yxfBYGB6SEqszmxJxd97A
4	7	7	wFweIWhv2fREZV_dYkz_1g	riFQ3vxNpP4rWLk_CSri2A	5	2010-02-12	zp713qNhx8d9KCJJnrw1xA
0	0	0	Vh_DlizgGhSqQh4qfZ2h6A	XtnfnYmnJYi71yIuGsXIUA	4	2012-08-17	wNUea3IXZWD63bbOQaOH-g

Preview of a Singular Review Text
text
My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I’ve ever had. I’m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best “toast” I’ve ever had. Anyway, I can’t wait to go back!

User

We applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations.

The dataframe preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.

Preview User Data
votes.funny	votes.useful	votes.cool	user_id	name	average_stars	review_count
0	2	0	Ch6CdTR2IVaVANr-RglMOg	T	5.00	2
0	0	0	NZrLmHRyiHmyT1JrfzkCOA	Beth	1.00	1
30	45	36	mWx5Sxt_dx-sYBZg6RgJHQ	Amy	3.79	19
28	130	31	hryUDaRk7FLuDAYui2oldw	Beach	3.83	207
1	0	1	2t6fZNLtiqsihVmeO7zggg	christine	3.00	2
0	3	2	mn6F-eP5WU37b-iLTop2mQ	Denis	4.50	4

Checkins

Lastly, we evaluated our checkins data and applied our business filter once more. This decreased our checkin observations from 8282 to 4423. This change tells us that no checkin data is available for 8.4% of businesses in our subset.

Merge Data

Next, we created our main dataframe business and review dataframes using the Business_ID and User_ID as unique keys. This dataframe will be referenced later on when building our recommender matrices and algorithms.

df_main <- df_business %>% inner_join(df_review, by = "business_id")

Visualize Data

CLEAN UP SECTION

The graphs below show geographic trends in our data as well as our most popular categories.

# All Business City
df_business %>% select(business_id, city) %>% group_by(city) %>% dplyr::summarise(Count = n()) %>% 
    ggplot(aes(x = city, y = Count)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1)) + xlab(label = "Business City") + geom_text(aes(label = Count), 
    colour = "blue", fontface = "bold", position = position_stack(vjust = 1.1), 
    check_overlap = TRUE)

# All Business City by Focus Category

## add graph later

CONSIDER Including plot showing something like number of reviews for each business. Review date by month is pretty consistant and not telling of much on its own.

FALSE 
FALSE Attaching package: 'lubridate'

FALSE The following object is masked from 'package:base':
FALSE 
FALSE     date

FALSE [1] NA

The plot below shows the distribution of our average ratings given by all users.

Recommender Algorithm

We tested 3 recommender algorithms to see which had the best performance metrics for our recommender system. To test the algorithsm, we first had to create a user-item matrix and then split our data into training and test sets.

Matrix Building

We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.

matrix_data <- df_main %>% select(user_id, business_id, stars) %>% mutate(user_id = as.factor(user_id), 
    business_id = as.factor(business_id), stars = as.numeric(stars))

ui_mat <- as(matrix_data, "realRatingMatrix")

Train and Test Splits

Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 10 k-folds using the recommenderlab package. 80% of data was retained for training and 20% for testing purposes.

# evaluation method with 80% of data for train and 20% for test
set.seed(1000)

evalu <- evaluationScheme(ui_mat, method = "split", train = 0.8, given = 0, 
    goodRating = 1, k = 10)

# Prep data
train <- getData(evalu, "train")  # Training Dataset 
dev_test <- getData(evalu, "known")  # Test data from evaluationScheme of type KNOWN
test <- getData(evalu, "unknown")  # Unknow datset used for RMSE / model evaluation

Algorithm 1 (Raj)

# I am planning to see how I can use Business's rating and
# rev_funny,rev_useful,rev_cool and see how users are rating against these
# parameters, I would check cosine similarlty of user rating with these info
# and recommend some similar Business to the Users.

Algorithm 2 (Christina)

Algo

Algorithm 3 (Juliann)

Algo

Analysis

Compare algorithms performance. Select most effective to build recommender system.

Recommender System

Test system

Conclusion

Final conlusion. Explain limitations of system. Make recommendations for future improvements.

References

Data Overview

Final Project - Yelp Recommender System

Juliann McEachern, Rajwant Mishra,Christina Valore

July 16, 2019