Topic Model for Yelp Restaurant Reviews

Introduction

This report looks at yelp restaurant review dataset to discover interesting and useful knowledge to help people make dining decisions. In particular this report looks at topic modeling, to identify the topics that people are most interested in. We will use Latent Dirichlet Allocation for topic modeling. Latent Dirichlet Allocation as described here tries to infer the hidden topic structure from the documents by computing the posterior distribution, the conditional distribution of the hidden variables given the documents. In LDA, topics are $\beta_{1:K}$, where each $\beta_k$ is a distribution over the vocabulary. The topic proportions for the dth document are $\theta_d$, where $\theta_{d,k}$ is the topic proportion for topic k in document d. The topic assignments for the dth document are $z_d$, where $z_{d,n}$ is the topic assignment for the nth word in document d. Finally, the observed words for document d are $w_d$, where $w_{d,n}$ is the nth word in document d, which is an element from the fixed vocabulary. With this notation, the generative process for LDA corresponds to the following joint distribution of the hidden and observed variables, \[p(\beta_{1:K},\theta_{1:D},z_{1:D},w_{1:D})\] \[=\prod_{i=1}^Kp(\beta_i)\prod_{d=1}^Dp(\theta_d)\left(\prod_{n=1}^Np(z_{d,n}\textbf{$\mid$}\theta_d)p(w_{d,n}\textbf{$\mid$}\beta_{1:K},z_{d,n})\right)\]

Data Exploration

The dataset consists of yelp_academic_dataset_business.json, yelp_academic_dataset_review.json, yelp_academic_dataset_user.json, yelp_academic_dataset_checkin.json, yelp_academic_dataset_tip.json. Since in task 1 we will be building topic model for restaurant reviews, the two datasets of interest to us are yelp_academic_dataset_business.json and yelp_academic_dataset_review.json.

From the dataset we can see that there are 1125458 reviews for 42153 businesses. We won’t be able to mine the entire dataset because of memory and cpu constraints. Instead we will sample 50,000 reviews which is large enough representative sample, to build the topic model for the given dataset. The business dataset has details about the business, and the reviews dataset has reviews for the businesses present in the business dataset.

colnames(business)

##  [1] "business_id"   "full_address"  "hours"         "open"         
##  [5] "categories"    "city"          "review_count"  "name"         
##  [9] "neighborhoods" "longitude"     "state"         "stars"        
## [13] "latitude"      "attributes"    "type"

colnames(reviews)

## [1] "votes"       "user_id"     "review_id"   "stars"       "date"       
## [6] "text"        "type"        "business_id"

Data Cleaning

Before we can use the dataset to build the topic model, we need to clean up the data. tm package is used for that. We convert all the text to lowercase, remove all the profanity, remove stopwords, and apply stemming. Once the text is cleaned up, we generate DocumentTermMatrix by applying unigram tokenization on the document. Once that is done, we remove all the sparse terms from the DocumentTermMarix.

Data Modeling

Let’s build the LDA topic model using gibbs sampling. LDA modeling from topicmodels package as described here and here is used to generate the model. Refer here for a detailed explaination of gibbs sampling. We use k=10 topics while fitting the LDA model. k=10 is chosen empirically.

Data Visualization

Let’s visualize our topic model, to get insight into what topics people are talking about in reviews. We visualize the top 10 terms from each of the 10 topic models.

Comparing Topic Models for High and Low Ratings

We use the above dataset, to get all the reviews for chinese food, and compare the topic models of low ratings and high ratings. All the reviews that have less than 3 stars are considered for low ratings topic model, and all the reviews that have more than or equal to 3 stars are considered for high ratings topic model.