Introduction

This report looks at yelp restaurant review dataset to discover knowledge about the cuisines. We mine the dataset to visually understand the landscape of different types of cuisines and their similarities. The cuisine map can help users understand what cuisines are available and their relations, which allows for the discovery of new cuisines, thus facilitating exploration of unfamiliar cuisines.

Data Exploration

The dataset consists of yelp_academic_dataset_business.json, yelp_academic_dataset_review.json, yelp_academic_dataset_user.json, yelp_academic_dataset_checkin.json, yelp_academic_dataset_tip.json. We go through yelp_academic_dataset_business.json dataset, and filter out all the businesses that are categorized as restaurants.

There are a total of 14303 restaurants in the dataset. Specific restaurants are also tagged with cuisines (e.g. Indian or Italian). This report looks at the cuisine map of 50 cuisines that has the most number of reviews.

Data Cleaning

Before we can use the dataset to build the topic model, we need to clean up the data. tm package is used for that. We convert all the text to lowercase, remove all the profanity, remove stopwords, and apply stemming. Once the text is cleaned up, we generate DocumentTermMatrix by applying unigram tokenization on the document. Once that is done, we remove all the sparse terms from the DocumentTermMarix.

Data Visualization

The cuisine map is visualized with an attempt to answer the following questions:
1. What’s the best way of representing a cuisine ?
2. What’s the best way of computing similarity similarity of two cuisines ?
3. What’s the best way of clustering cuisines ?

For representing the cuisines, the author thinks that correlogram is the best way of representing the cuisine map. The correlation coefficient of two variables in a data sample is their covariance divided by the product of their standard deviations. It is a normalized measurement of how the two are linearly related.

Task 2.1

Task 2.2

The author tried to improve the cuisine map, by varying the text representation. In the first attempt, the weighting of the terms was changed from term frequency to tf-idf.

In the second attempt to improve the cuisine map, the author used the LDA topic model with k = 10. The top 5 topics were used to compute the similarity matrix of the documents. Cosine Similarity measure was used to determine the similarity of two cuisines.

Task 2.3

The author has used the tf vector of each cuisine from task 2.1 to do clustering. Clustering was also explored using the tf-idf vector from task 2.2. Two different clustering appraches, hieararchical agglomerative clustering and k-means clustering were explored by the author. For both hierarchical clustering and k-means clustering different values of k were explored, and k=5 seemed to be the best fit empirically.