In this writeup we will look at yelp restaurant generated cuisine files, which contain reviews by cuisine. We will try to discover knowledge about a certain cuisine to discover popular dishes specific to that cuisine. The goal will be to get a list of common dishes for a cuisine by building a dish recognizer. We will Italian as the cuisine and try to find all dishes. The tools used to achieve this are Python, Java and R.
Out of a total of 14303 restaurants in the dataset, there are 2380 Italian ones. By using the provided python tools we got a segmented file only containing reviews of Italian restaurants. This file contains around 1.5 million datasets, so it will be important to find algorithms, that are fast and efficient.
For the first assignment, there was a file provided (Italian.label) with 1 and 0 numbers to annotate phrase to be of the Italian cuisine or not. The task was to eliminate false positive non-dish name phrases and to change false negative dish name phrases to a positive label, ie 1. Furthermore some additional labels could be added. This label file will be important for automatic dish recognition in the next assignment.
The best approach I used was to just go through the file and look at all the phrases and see if they are related to the Italian cuisine. I removed phrases such as “las vegas 1”. There were also some dishes present that were not recognized as Italian, such as “tiramisu 0” which had to be changed to a 1. Additionally I added some common Italian dishes such as “pizza” or “spaghetti bolognese”. To get some more good words I referred to the Wikipedia article of Italian dishes (https://en.wikipedia.org/wiki/List_of_Italian_dishes).
The goal of this assignment was to mine additional dish names by using automatic algorithms. There were two specific methods mentioned: ToPMine and SegPhrase. SegPhrase uses a second input file with labels of already found dishes, while ToPMine just works by itself. This is why I used ToPMine first to be able to provide the result as an input to SegPhrase.
ToPMine is an excellent Topical Phrase Mining toolset, which can be retrieved at (http://web.engr.illinois.edu/~elkishk2/). It is able to process files using sophisticated algorithms such as Random Trees and mine phrases into several categories. Since categories didn’t matter so much here, I used the mined phrases with the most points. There are several parameters to tweak the output, I used the following setup:
inputFile='../rawFiles/Italian.txt'
# minimum phrase frequency
minsup=10
#maximum size of phrase (number of words)
maxPattern=8
#Two variations of phrase lda (1 and 2). Default topic model is 2
topicModel=2
numTopics=5
#set to 0 for no topic modeling and > 0 for topic modeling (around 1000)
gibbsSamplingIterations=500
#significance threshold for merging unigrams into phrases
thresh=4
#burnin before hyperparameter optimization
optimizationBurnIn=100
#alpha hyperparameter
alpha=2
#optimize hyperparameters every n iterations
optimizationInterval=50
As a result a file containing 17737 phrases was created and also 5 different categories. The file “topPhrases.txt” contained the phrases ranked by their occurence. Unfortunately some more handwork had to be done in order to find some more useful dishes as an input for ToPMine. The first ten results are shown as a sample:
food was good 2340
happy hour 1934
Italian food 1918
Italian restaurant 1905
love this place 1821
pretty good 1730
Olive Garden 1711
pasta dishes 1597
pizza was good 1566
Las Vegas 1545
Although the phrases are common, they are not so useful as Italian dishes. I looked at all phrases with 50 or more occurences and copied useful ones to the dishes found in task 1. The input file for the next step now contained 450 quality dishes and 150 phrases rated as not useful for Italian cuisine.
SegPhrase is another very useful phrase mining toolset available open-source (https://github.com/shangjingbo1226/SegPhrase). It will mine the phrases also using a label input file which designates good and bad phrases beforehand. This way it can be much more accurate for our purpose. First some installation work had to be done, and especially the versions for g++ and python had to be respected. Everything is documented on the page on github very well.
There are two scripts available: train.sh and parse.sh. For our purpose only train.sh is interesting, as it created the set of phrases which could then be used as a quality set for parse.sh. Again, we have to specify some parameters to tweak the tool:
RAW_TEXT='data/Italian.txt'
AUTO_LABEL=0
WORDNET_NOUN=1
DATA_LABEL='data/Italian_dishes.label'
KNOWLEDGE_BASE='data/wiki_labels_quality.txt'
KNOWLEDGE_BASE_LARGE='data/wiki_labels_all.txt'
STOPWORD_LIST='data/stopwords.txt'
SUPPORT_THRESHOLD=10
OMP_NUM_THREADS=4
DISCARD_RATIO=0.00
MAX_ITERATION=5
NEED_UNIGRAM=0
ALPHA=0.85
All parameters are described in the provided README.md. It is especially important to specify AUTO_LABEL=0 and provide our collected label file with Italian dishes as DATA_LABEL. After lots of calculation and run time, the script finishes with several output files. Most of these are intermediate results, the most interesting file is “salient.csv” which contains the most relevant phrases with their score. The whole file has 68915 phrases, most of which are useless to us. I present the first ten results:
goat_cheese,0.9986587301
tomato_sauce,0.9986042020
olive_oil,0.9984624830
sea_bass,0.9984571805
chocolate_cake,0.9984488046
caesar_salad,0.9982830265
mashed_potatoes,0.9982273561
ice_cream,0.9981633901
italian_sausage,0.9978135457
anti_pasta,0.9964569575
As we can see, these results are much better than what we got out of ToPMine. In the first 500 results, there are lots of interesting Italian dishes presented, all mined automatically with a manually created input file.
In conclusion, it can be said, that the presented tools are able to generate good results without much tweaking beforehand. There could be some more tweaking used, for example by providing a better input file or by evaluating the results, but the frameworks are of very good help to automate the process of finding phrases for a cuisine. The result can for example be used, to determine which cuisine a review belongs to.
I will also try to answer the questions asked in the description with the knowledge of looking through lots of mined phrases.
There are mainly food items presented in the reviews, seldom drinks are presented. This might be because of the nature of Italian restaurant.
Although I’m a lover of the Italian cuisine, I found lots of dishes that I haven’t heard before. The most surprising was “spaghetti a la vodka” which occured quite a few times. I definitely want to try that dish.
Most dishes were commonly known dishes when thinking of the Italian cuisine, for example “Pizza Margherita” or “Spaghetti Bolognese”. Furthermore lots of exotic and less known dishes were found and thus I think that the phrases can also be used as recommendations for the users.