Introduction

In this writeup we will look at yelp restaurant reviews and look to discover interesting knowledge to use for making dining decisions. Especially we look at similarities between different cuisines and will try to create a map to visually represent these similarities. The tools used to achieve this are Python and R.

Exploration

The dataset has several files, including yelp_academic_dataset_business.json which is the most interesting for this task. There are 14303 restaurants in these dataset which are categorized in different cuisines. In order to make the graphs better understandable we will look at the top 20 cuisines in each graph.

Visualization of the Cuisine Map

First we will use Python and the provided tools to extract cuisine files with the different cuisines. We will then convert these to the csv format in order to read them in in R.

In order to create a similarity matrix we will first use IDF (Inverse Document Frequency) and look at the top 20 most similar cuisines. For the visualization we will use R and the library corrplot which delivers good results.

The resulting visualization shows the similarities pretty well. For each cell, the darker the color blue in the matrix, the stronger the similarity. Of course, each cuisine is very similar to itself. But there can also be seen obvious similarities, for example between Chinese and Korean or Lounges and Sports Bars.

Improving the Cuisine Map

We will try to improve the cuisine map by using different text representation. Instead of IDF, we’re using TF-IDF to get a better representation of the cuisines. Furthermore we tried to filter out common words, for example ‘good’ or ‘high’, by using a list of the 50 most common words by TF and removing them from the cuisine text files.

Again, we will look at the 20 most similar cuisines with this new approach. The visualization is similar to the one we have seen before, however the similarities are less pronounced. This allows us to see the similarities more clearly. For example, Asian Fusion and Japanese as well as Italian and Pizza are very similar.

Incorporating Clustering in Cuisine Map

For this task, we will try to cluster the similarities in the cuisine map by using a varying number of clusters. We tried different cluster algorithms for this task, in specific hierarchical agglomerative clustering and k-means clustering. It seemed that k-means clustering delivered the best results. We used k=3 and k=5 as the number of clusters.

First let’s look at the variant with 3 clusters.

Here we can see very clear clusters, one could be depicted as Asian cuisine, one as “entertainment” cuisine and the third as “exotic” cuisine.

If we use 5 clusters, these categories don’t become so clear as the clustering may be too subtle.

Conclusion

In summary, we can see that from examining the different cuisine categories we can see expected and also surprising results. It became clear that using TF-IDF similarity is better than just using TF and for clustering the number of clusters is very important in order to understand the categories clearly.

By using other algorithms, for example LDA Topic model weights or BM25 similarity, we may obtain even better results.