Yelp Recommender System
Yelp!
According to Wikipedia.org, Yelp is “an American multinational corporation…which publish crow-dourced reviews about local businesses.” Often times, when you or I are going out with friends, family, or significant other, sometimes the toughest decision of the night is looking for a restaurant or bar to eat. In New York City, I have the fortunate problem of having too many choices. However, sampling every restaurant from the local neighborhood is not only expensive but not feasible. With Yelp.com, a whole input of multiple users supply reviews (explicit information) to help guide which restaurants are the best places to eat. Of course, like all recommender systems, Yelp.com attempts to personalize your taste buds and recommend restaurants or bars that they predict you would like. And this is all possible with data, and lots of it.
1. Perform a scenario design analysis as described below. Consider whether it makes sense for your selected recommender system to perform scenario design twice, once for the organization (e.g. Yelp.com) and once for the organization’s customers.
The target are the users on Yelp. These users are consumers that are seeking recommendations for restaurants, bars, events, museums, etc. that would suit their tastes. Their key goals is to provide and recommend the best places for the user’s tastes (via their algorithms). This can be accomplished with data. As a user continues to contribute more ratings and information about the places they have visited, Yelp uses all of their collected data to help assist with your search.
It may make sense to have the user perform a scenario analysis when he or she is looking for a business of interest. Often times, a user has particular tastes that are different than others, and when a scenario analysis performed on both ends, it may end up resulting in an improved and better personalized recommender system for the user.
2. Attempt to reverse engineer what you can about the site, from the site interface and any available information that you can find on the Internet or elsewhere.
https://www.yelpblog.com/2013/11/yelp-recommended-reviews
According to this yelp blog website, yelp runs an “automated software that goes through more than 47 million reviews that have been submitted…to select the most useful and reliable ones to help find the businesses right for the user.” Their stance is quality over quantity, and as a result, Yelp uses approximatley 75% of the reviews that receive. The reviews that are of low quality or “untrusted” users, their reviews do not get utilized. In fact, they try to wean out reviews that are biased, such as owners of restaurants attempting to boost their restaurant grades. In addition, they attempt to minimize any preferential treatments with advertised businesses on their website vs. the un-advertised. All of this is an attempt to create a more perfect rating system.
Though I had attempted to look for specific algorithms from the yelp blogs, only 1 was explicited stated (this was the winner of the Kaggle competition, and his paper was published on their website). In fact, Yelp had sought the help of the online data science community in regards for their recommender system. There was a Kaggle competition several years ago called the RecSys Challenge 2013: Yelp Business Rating Prediction, where the 1st place winner won $300. They were given a training data set with 11,537 businesses, 8282 check-in sets, 43,873 users, and 229,907 reviews, and they were given a test set of 1205 businesses, 734 check-in sets, 5105 users, and 22956 reviews to predict. Numerous data scientists from different universities and organizations have taken this data from Yelp and used their own approaches to create the Yelp Food Recommender System.
One such example was from University of California, Irvine. These members attempted to utilize multiple algorithms to help create such system. (https://www.math.uci.edu/icamp/summer/research/student_research/recommender_systems_slides.pdf). They had identified the multiple problems that plagued the Yelp dataset. They had noted that sparsity (99.9% empty) was a challenge, and that there many large number of unknown users/businesses with no explicit ratings. Given these circumstances, they were looking at different algorithms that would produce the most reliable results (using the RMSE score as an indicator). The methods they utilized were: Nearest Neighbor Method, Euclidean Distance (though they ran into problems), Weighted Similarity-Jaccard Index, Matrix Factorization, User/Business means, Weighted averages (where ‘Funny’. ‘Cool’, ‘Useful’ added more weight to the reviews), and clustering. Ultimately, it appeared that they had went for a “blended” approach for their Kaggle competition.
Just to review some of the above concepts: The Nearest Neighbor Method is a non-parametric method used for classification and regression. This method attempts to look for k-number neighbors that appear most similar to the user being studied. It’s assumed that the user is likely similar to its neighbor. The Weighted Similarity-Jaccard Index is also known as the Intersectionover Union and the jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. For the Weighted Averages, they had created a scoring system that weighted reviews that were rated as ‘Funny’, ‘Cool’, ‘Useful’ heavier.
Another approach was done by Sumedh Sawant and Ginai Pai from Stanford University. http://cs229.stanford.edu/proj2013/SawantPai-YelpFoodRecommendationSystem.pdf. Again, using the Root metrics Mean Squared Error and Mean Absolute Error, they evaluated and compared several different algorithms’ performances. They had used ‘singular value decomposition’, ‘hybrid cascade of K-nearest neighbor clustering’, ‘weighted bi-partile graph projection’.
There are many more listed as this was a Kaggle competition, but I wanted to list just one more paper that was from the yelp website itself. https://www.yelp.com/html/pdf/YelpDatasetChallengeWinner_NetworkEfficiency.pdf. This paper is from Felix W. (who had won the Yelp Dataset Challenge). While I would like to into the details of the paper in his approach to a better recommender system, I will defer on the discussion of his paper at this moment as his paper is incredibly complex, and is beyond the scope of this course.
3. Include specific recommendations about how to improve the site’s recommendation capabilities going forward.
Yelp: Having a Kaggle competition and involving the whole online data science community is a great way to engage many very intelligent’s people’s mind in creating a better, more efficient recommender system. As mentioned above, sparsity and user rating reliability seem to be two big problems with Yelp. There are issues with selection bias (upset customers may go out of the way to give a restaurant a very low rating i.e. 1 star, or happy customers are too lazy to post a review). And in addition, there are many restaurants out there with no reviews. A potential way to improve the site’s recommendation capabilities could be Amazon’s Item-to-Item Collaborative Filtering (https://bbhosted.cuny.edu/bbcswebdav/pid-28524562-dt-content-rid-118278419_1/xid-118278419_1). Both Amazon (and assuming Yelp as well) have problems with huge amounts of data. Customers typically have extremely limited information, based on only a few business ratings. Customer data is volatile and the algorithm should respond immediately to this new information.
Amazon’s Item-to-Item Collaborative Filtering is a tool that “matches the user to similar customers, item-to-item collaborative filtering matches each of the user’s rated items/businesses to similar items/businesses, then combines those similar items into a recommendation list.” It is computational expensive to use collaborative filtering or cluster models. This would require significant amount of power and memory. However, with this method described by Amazon, their method of having a customer purchasing items/(going to businesses) and then calculating their similarities have proved to be quite successful. Below is the coding approach that Amazon had taken for this method.
Given the success (and speed/efficiency) that Amazon has had, Yelp (barring patent laws) could benefit from such an elegant and fast algorithm.
4. Create your report using an R Markdown file, and create a discussion thread with a link to the GitHub repo where your Markdown file notebook resides. You are not expected to need to write code for this discussion assignment.
For further discussion, please click on the “Issues” button and comment. Thank you!