Overview

Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.

Data Aquisition

All the data is multiarray of Json, which means file is a collection of json data. We will use stream_in function, which parses json data line-by-line from our files stored within the data folder of our repository.

Data Transformations

Business

We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.

We applied additional transformation to remove unnessacary data. There were 1224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our dataset from outside of AZ that we allso removed.

As a result of our transformations, our recommender data now consists of 4828 unique businesses. This data is previewed below:

# dropped open, neighborhoods(no data), full address, and review count, type (all business)
df_business<-business %>% filter(open == "TRUE", state=="AZ") %>% select(-open, -neighborhoods, -full_address, -review_count, -type, -stars) %>% mutate(categories = sapply(categories, toString)) 

# create category filter
filter <- 'Argentine| Burmese| Cambodian| Cocktail Bars|Laotian|Lebanese|Live/Raw Food|Russian|African|Champagne Bars|Kosher|Modern European|Scandinavian|Taiwanese|Tapas/Small Plates|Afghan|Brazilian|Food Trucks|Shaved Ice|Wineries|Dim Sum|Ethiopian|Fondue|Hookah Bars|Persian/Iranian|Peruvian| Polish| Seafood Markets| Tapas Bars|Halal|British| Cheese Shops|German|Spanish|Cheesesteaks|Cuban|Do-It-Yourself Food|Gastropubs|Salad|Creperies|Soup|Chocolatiers & Shops|Filipino|Food Stands|Fruits & Veggies|Meat Shops|Mongolian|Soul Food|Comfort Food| Irish|Fish & Chips|Cajun/Creole|Caribbean|Pakistani|Southern|Candy Stores|Vegan|Latin American|Breweries|French|Gay Bars|Korean|Gluten-Free|Hawaiian|Farmers Market|Vegetarian|Middle Eastern|Ethnic Food|Indian|Pubs|Chicken Wings|Dive Bars| Juice Bars & Smoothies|Vietnamese|Cafes|Wine Bars|Bagels|Diners|Hot Dogs|Tex-Mex|Donuts|Greek|Thai| Desserts|Mediterranean|Beer| Wine & Spirits|Seafood|Sushi Bars| Lounges|Steakhouses|Buffets|Japanese|Sports Bars|Delis|Bakeries|Specialty Food|Breakfast & Brunch|Ice Cream & Frozen Yogurt|Burgers|Italian| Chinese|Coffee & Tea|American (New)|Sandwiches|Fast Food|Pizza|American (Traditional)|Bars|Mexican|Food| Restaurants'

# filter businesses using filter & factor categories
df_business <- df_business %>% filter(str_detect(categories, filter)) %>% mutate(categories = as.factor(categories))
Preview Business Data
business_id categories city name longitude state latitude
usAsSV36QmUej8–yvN-dg Food, Grocery Phoenix Food City -112.0854 AZ 33.39221
PzOqRohWw7F7YEPBz6AubA Food, Bagels, Delis, Restaurants Glendale Az Hot Bagels & Deli -112.2003 AZ 33.71280
qarobAbxGSHI7ygf1f7a_Q Sandwiches, Restaurants Gilbert Jersey Mike’s Subs -111.8120 AZ 33.37884
gA5CuBxF-0CnOpGnryWJdQ Mexican, Restaurants Phoenix La Paloma Mexican Food -112.0814 AZ 33.48011
acaBJcFEKPmmSDIO6c-ZGQ Food, Grocery Goodyear Basha’s -112.3920 AZ 33.46881
JxVGJ9Nly2FFIs_WpJvkug Pizza, Restaurants Scottsdale Sauce -111.9263 AZ 33.61746

Review

We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews.

Preview Review Data (without Review Text)
votes.funny votes.useful votes.cool user_id review_id stars date business_id
0 5 2 rLtl8ZkDX5vH5nAx9C3q5Q fWKvX83p0-ka4JS3dc6E5A 5 2011-01-26 9yKzy9PApeiPPOUJEtnvkg
0 0 0 0a2KyEL0d3Yb1V6aivbIuQ IjZ33sJrzXqU-0X6U8NwyA 5 2011-07-27 ZRJwVLyzEJq1VAihDhYiow
0 1 0 0hT2KtfLiobPvh6cDC8JQg IESLBzqUCLdSzSqm0eCSxQ 4 2012-06-14 6oRAC4uyJCsJl1X0WZpVSA
1 3 4 sqYN3lNgvPbPCTRsMFu27g m2CKSsepBCoRYWxiRUsxAg 4 2007-12-13 -yxfBYGB6SEqszmxJxd97A
4 7 7 wFweIWhv2fREZV_dYkz_1g riFQ3vxNpP4rWLk_CSri2A 5 2010-02-12 zp713qNhx8d9KCJJnrw1xA
0 0 0 Vh_DlizgGhSqQh4qfZ2h6A XtnfnYmnJYi71yIuGsXIUA 4 2012-08-17 wNUea3IXZWD63bbOQaOH-g
Preview of a Singular Review Text
text

My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I’ve ever had. I’m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best “toast” I’ve ever had.

Anyway, I can’t wait to go back!

User

We applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations.

The dataframe preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.

Preview User Data
votes.funny votes.useful votes.cool user_id name average_stars review_count
0 2 0 Ch6CdTR2IVaVANr-RglMOg T 5.00 2
0 0 0 NZrLmHRyiHmyT1JrfzkCOA Beth 1.00 1
30 45 36 mWx5Sxt_dx-sYBZg6RgJHQ Amy 3.79 19
28 130 31 hryUDaRk7FLuDAYui2oldw Beach 3.83 207
1 0 1 2t6fZNLtiqsihVmeO7zggg christine 3.00 2
0 3 2 mn6F-eP5WU37b-iLTop2mQ Denis 4.50 4

Checkins

Lastly, we evaluated our checkins data and applied our business filter once more. This decreased our checkin observations from 8282 to 4423. This change tells us that no checkin data is available for 8.4% of businesses in our subset.

Merge Data

Next, we created our main dataframe business and review dataframes using the Business_ID and User_ID as unique keys. This dataframe will be referenced later on when building our recommender matrices and algorithms.

Visualize Data

CLEAN UP SECTION

The graphs below show geographic trends in our data as well as our most popular categories.

CONSIDER Including plot showing something like number of reviews for each business. Review date by month is pretty consistant and not telling of much on its own.

FALSE 
FALSE Attaching package: 'lubridate'
FALSE The following object is masked from 'package:base':
FALSE 
FALSE     date

FALSE [1] NA

The plot below shows the distribution of our average ratings given by all users.

Recommender Algorithm

We tested 3 recommender algorithms to see which had the best performance metrics for our recommender system. To test the algorithsm, we first had to create a user-item matrix and then split our data into training and test sets.

Matrix Building

We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.

Train and Test Splits

Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 10 k-folds using the recommenderlab package. 80% of data was retained for training and 20% for testing purposes.

Algorithm 2 (Christina)

Algo

Algorithm 3 (Juliann)

Algo

Analysis

Compare algorithms performance. Select most effective to build recommender system.

Recommender System

Test system

Conclusion

Final conlusion. Explain limitations of system. Make recommendations for future improvements.

References