Overview

Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.

Data Aquisition & Transformations

Data was acquired and transformed in the preprocessing.R file located within our repositories final-project folder. Our data source was provided as multiarray Json files, meaning each file is a collection of json data. We used stream_in function, which parses json data line-by-line from the data folder of our repository. The collections included three, large data for Yelp businesses, users, and reviews.

Once obtained, we prepared our data for our recommender system using the following transformations:

Business

We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.

We applied additional transformation to remove unnecessary data. There were 1,224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our data set from outside of AZ that we also removed.

As a result of our transformations, our recommender data was shortened 4,828 unique businesses. This was further limited to 4,332 after randomly sampling our user-data. The output of which can be previewed below:

Review

We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews. We later applied another filter to the data to only use reviews from 10,000 randomly sampled users. This further decreases reviews to 44,494 observations. Our review data can be previewed in two parts below:

User

Next, we applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations. Do to processing constraints in R, we choose to randomly sample 10,000 users from these unique profiles.

The data frame preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.

Merged Dataframe

Last, we created our main data frame by merging business and reviews on Business_ID. This data frame will serve as the source of data for our recommender algorithms. The user and business unique keys were simplified from characters to numeric user/item identifiers.

This data frame will be referenced later on when building our recommender matrices and algorithms.

Algorthim Data Preparation

Matrix Building

We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.

Recommender Algorithms

We tested recommender algorithms using recommenderlab and sparklyr to see which performed the best on our recommender system data. To test the algorithms, we first had to create a user-item matrix and then split our data into training and test sets.

RecommenderLab

User-based CF

In our first example, user-based CF is used to create recommendations with the recommender lab package in R. We start by training our recommender with the train set, with our data being normalized with the Z-score and using cosine similarity for comparisons.

We then create our predictions using the dev-test set with ratings as our prediction output. It is imperative to set a floor and ceiling as sometimes predictions will fall outside of our ratings scale of 1-5.

Finally we calculated the prediction accuracy against the test data

Performance

After our analysis we see that User-based CF outperforms Item-based CF in all error metrics. Considering the size of our data set the RMSE is relatively low.

##            RMSE      MSE       MAE
## UB_acc 1.337380 1.788585 0.9812829
## IB_acc 1.473754 2.171951 1.1263370

Sparklyr

Due to the size of our data, we choose to use Spark in R to avoid input/output (I/O) bottleneck issues and maximize the performance speed of our recommender algorithm calculations.

Recommendations

The ml_recommend function allows us to see the top n user recommendations for each user/item. Below, we use this function and filtered our recommendations to show the top 10 restaurant recommendations for a selected user.

## NULL

Conclusion

Analysis

Through this project, we took an all-encompassing look at the different recommender methods we have learned this semester. We built a RealRating Matrix in RecommendLab and performed several algorithms on our data. We then compared this approach to running the training and test data in Spark using sparklyr’s ALS algorithm. We found that the Userbased recommender algorithm performed the best and had the lowest RMSE, however our ALS calculations performed very similiarly to this. Our item-based reommender produced our highest error score.

Our transition to Sparklyr showed us how effective cloud computing can be for large datasets. Our algorithm and prediction performance speeds signficantly improved when using Spark’s service, even on a local channel. ALS in Sparklyr was the clear winner for efficiency.

Limitations

The size of our data significantly limited our performance using certain packages in R. Functions in Recommenderlab took ~15 minutes to run in comparison to Sparklyr, which took approximately ~2 minutes. Sparklyr would have been able to handle our full data set, whereas our personal computers would have lacked the computational memory to solely use Recommenderlab. However, Sparklyr packages lacks in comparison to the built-in functions recommenderlab has to use and evaluate recommender algorithms functions.

Recommendations

We would recommend for future attempts performing natural language processing on the review text sentiment and analyzing the term-frequency of our categories to see how these variables could improve our recommendations. We would also benefit from using data processing engines, like Spark, to conduct all of our future large data, recommender calculations.