In this writeup we will look at yelp restaurant reviews and look to discover interesting knowledge to use for making dining decisions. Especially we look at topic modeling to identify topics that go together and people are the most interested in. Initial dataset discovery is done in R, topic modeling in Python and Visualization using D3.
The dataset has several files, including yelp_academic_dataset_review.json which is the most interesting for this task. There are 1125458 reviews for 42153 businesses. Because the dataset is very big, we will pick a sample of 100,000 reviews for our report to save resources.
For topic modeling we will use Latent Dirichlet Allocation for topic modeling (for more information see http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf). Latent Dirichlet Allocation tries to infer the hidden topic structure from the documents by computing the posterior distribution, the conditional distribution of the hidden variables given the documents.
Python is used to do the modeling. There are several useful packages for python including numpy, scipi, query_helper and nltk. There is already an interesting project doing analysis for the Yelp dataset which can be used (see https://github.com/adityamarella/YelpDataMining)
There are two tasks to be accomplished. The first is to visualize a total count of topics. Therefore we will use k = 20 for the number of topics and then choose 10 which we will also try to label with an appropriate name.
For the second task we will divide the dataset into high star rating (4 or 5 stars) and low star ratings (1 to 3 stars) and then again mine the topics. We will use k = 10 topics for each subset and then choose 3 topics which can also be combined out of several found topics in order to generate the same topics for both subsets.
In order to visualize the topics, D3 is used and a custom library to read in generated topics from python in json format.
First, let’s take a look at the total topics:
When using k = 20, it’s straight-forward to choose the ten most interesting topics and label them as can be seen here.
It’s harder to find useful topics when splitting the datasets betwen low and high star ratings. So we will use three pre-defined topics and mix some of the given topics in order to find out interesting words given in reviews by users.
In summary, we can see that there are interesting findings about user review in the Yelp dataset. There are clear topics that can be found out of reviews and when splitting the dataset into low- and high-star ratings, there are different words used for each reviews. In the future tasks there should be much more interestings things to explore as there is lots of information hidden in this dataset.