Introduction

This report looks at yelp restaurant review dataset to discover knowledge about the cuisines. We mine the dataset for a particular cuisine, to discover common/popular dishes of a particular cuisine. Typically when you go to try a new cuisine, you don’t know beforehand the types of dishes that are available for that cuisine. For this task, we would like to identify the dishes that are available for a cuisine by building a dish recognizer. The author decided to explore different dishes for Indian cuisine.

Data Exploration

The dataset consists of yelp_academic_dataset_business.json, yelp_academic_dataset_review.json, yelp_academic_dataset_user.json, yelp_academic_dataset_checkin.json, yelp_academic_dataset_tip.json. We go through yelp_academic_dataset_business.json dataset, and filter out all the businesses that are categorized as Indian restaurants.

There are a total of 14303 restaurants in the dataset, out of which there are 202 Indian restaurants. All the reviews for these 202 Indian restaurants are mined by using the tools provided to build dish recognizer.

Data Mining

The author used textmining.jar provided as part of https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task1Tools/task1JavaTools.zip, that when given a cuisine file and a list of phrases will compute the Mutual Information for the words in the cuisine file. The cuisine file in this case is the reviews of Indian restaurants. The text mining algorithm implemented by textmining.jar is called Comparative Text Mining. The topic modeling is done by running

java -cp textmining.jar topicmodels.CTMMultiRun <parameters_file>

The parameters have the following parameters

xmlfile = directory_or_file_name;
cluster =4; //number of topics
it = 10; // number of iterations
resultsFile=topics; //result_file_name

The output has two files

1. topics_docprobs which contains the document distribution and
2. topics_termprobs that contains the term distribution for the common topics, as well as the sub-topics for each topic.

The author found some interesting dishes by looking at the term distribution in topics_termprobs. The tool can also be used for calculating word co-occurence based on mutual information. More specifically, the mutual information algorithm takes three parameters:

1. The input folder or file
2. --sentences: The optional splitting of sentences for local context word associations. Without this optional parameter, the algorithm outputs document-level word co-occurence.
3. The output filename

The author used the algorithm by specifying –sentences optional parameter, and found some relevant dishes.

The author also explored the dishes using TopMine to refine the results. The TopMine algorithm also mines the reviews of the Indian restaurants to explore the dish names. The algorithm uses the following parameters:

1. minsup = minumum support (minimum times a phrase candidate should appear in the corpus to be significant)
2. maxpattern = max size you would like a phrase to be (if you don't want too long of phrases that occasionally occur)
3. numTopics = number of topics - same as LDA
4. Gibbs Sampling iterations = number of iterations done for inference (learning the parameters. Usually 500 is good, may do more if you like)
5. thresh (significance) = the significance of a phrase. Equivalent to a z-score. I usually use 3 to 5. The higher it is, the fewer phrases will be found, but they will be of very high quality.
6. Topic Model, two variants of PhraseLDA are used (choose 1 or 2). 2 is the default topic model.

The author used the algorithm with default parameters, but setting Gibb sampling iterations to 500. The output topPhrases.txt was used to explore the dish names.

Opinion

He is an e.g. output of the expert model from the topic_termprobs output of textmining.jar

--- expert model (0)= indian_restaurant_reviews
naan 0.10899454898364651
masala 0.07706401521080385
tikka 0.06334130317233513
curry 0.053589882785072804
lamb 0.05251200851456515
paneer 0.044306346963164024
tandoori 0.032813962283013545
vindaloo 0.02184575875442527
vegetable 0.019508373538015343
korma 0.019357005051083124
samosas 0.0189751255667526
saag 0.01611815809731848
biryani 0.014922593020490909
goat 0.011961556394526004
aloo 0.011768176858126509
basmati 0.01172662576962215
palak 0.010316293136880574
spices 0.0101472137602951
curries 0.009940279559705872
pakora 0.00970749653781738
dal 0.00866491547334318
nan 0.008658236587192218
kofta 0.007945289043515921
peas 0.007932627859404111
makhani 0.00787733410484715
samosa 0.007568109450616106
spice 0.007547456956755993
spiced 0.0069399711019972
malai 0.006815904786392494
chana 0.006080383312996479
cauliflower 0.005652005208442147
pakoras 0.0056349196942948624
gobi 0.005256311140998758
tikki 0.005164346058865124
raita 0.005003063310014012
lentil 0.004975815539466382
daal 0.004812248402563703
lentils 0.004414069624920935
okra 0.0040624922300329675
josh 0.004057821124567791
rogan 0.003880496600104007
tamarind 0.0037852710284565784
karahi 0.003541589038347258
yogurt 0.003207123797866322
veg 0.0031486656079585972
chilli 0.0030720575796283622
tika 0.002556880089909672

The interesting thing about this output is that you can create so many dishes by different combinations. Some of them are already existing dishes, but we can be creative and create our own interesting dishes as well. Some of the dishes are

gobi pakoras
goat karahi
goat tikka
goat biryani
chilli tikka
malai kofta
okra
samosa
dal
dal makhani
gobi
aloo gobi
gobi masala
palak
aloo palak

We use textmining.jar tool for calculating the word co-occurence based on mutual information. We use the –sentence option for splitting the sentences for local context word associations. These are some of the results for all the gosht related dishes

nalli   gosht   4.8954226081424875E-6 
kadahi  gosht   3.6519436564127923E-6
gosht   vindaloo        9.170976944374744E-7
gosht   curry   1.1258746686261764E-6
gosht   rogan   1.1009422653791137E-5
gosht   daal    7.092818418872206E-6
gosht   biryani 1.0583382743613388E-6
gosht   curry   1.1258746686261764E-6

I myself being a big fan of Indian food would use tool like these to find new dishes. The results are very relevant, and useful. This is just with one default setting of the parameters

input  = /coursera/data-mining/task3/;
cluster =4;
it = 10;
resultsFile=topics;

We can experiment with different values for iterations and clusters and compare the results.