Datamining Capstone Final Report

Introduction

As part of this final report, we look at all the six tasks, and summarize the algorithms used for each task, as well as the usefulness of the results obtained for each of the tasks.

Task 1 (Topic Modeling)

In task 1 we looked at the yelp restaurant review dataset to discover interesting and useful knowledge to help people make dining decisions. In particular we looked at topic modeling, to identify the topics that people are most interested in while talking about dining out. We used Latent Dirichlet Allocation to infer the hidden topic structure from the documents.
Before we could apply the topic modeling to the dataset, the dataset was cleaned up using the tm package. The dataset was cleaned up by converting the text to lowercase, removing all the profanity and stopwords, and applying stemming. Once the text was cleaned up, DocumentTermMatrix was generated by applying unigram tokenization on the review dataset, and removing all the sparse terms.
We also partitioned the review dataset into positive and negative reviews for a particular (chinese) cuisine, and built lda topic models for both the both and negative reviews, using the same method as for the generic topic model. The positive and negative topic models helped us gain insight into what topics diners talk about when they talk positively about a dining experience vs. a negative experience. By looking at these topics models, restaurant owners can focus on the right things to give diners a better dining experience at their restaurants.

Generic Topic Model

Topic Model for Positive/Negative Reviews

Task 2 (Cuisine Similarity)

In task 2 we look at the yelp restaurant review dataset to discover knowledge about different cuisines. We mine the dataset to visually understand the landscape of different types of cuisines and their similarities. The cuisine map can help users understand what cuisines are available and their relations, which allows for the discovery of new cuisines, thus facilitating exploration of unfamiliar cuisines.

The cuisine map is visualized with an attempt to answer the following questions:
1. What’s the best way of representing a cuisine ?
2. What’s the best way of computing similarity similarity of two cuisines ?
3. What’s the best way of clustering cuisines ?

For representing the cuisines, the author thinks that correlogram is the best way of representing the cuisine map. The correlation coefficient of two variables in a data sample is their covariance divided by the product of their standard deviations. It is a normalized measurement of how the two are linearly related.

From the correlogram it can be clearly seen that japanese and sushi are related. Chinese, viatnemese, korean and asian fusion are related so are pubs, gastropubs, lounges and sports bars. So it is very easy to see from this correlogram, which cuisines are related. We can also see for e.g. that vegeterian and Indian are related, which tells us that people look for indian restaurant when looking for vegetarian cuisine.

The author tried to improve the cuisine map, by varying the text representation. In the first attempt, the weighting of the terms was changed from term frequency to tf-idf. In the second attempt to improve the cuisine map, the author used the LDA topic model with k = 10. The top 5 topics were used to compute the similarity matrix of the documents. Cosine Similarity measure was used to determine the similarity of two cuisines.

From the idf model we can also clearly see that buffets and indian also go together. Dim Sum, Seafood, Japanese and Sushi are very similar. Chicken Wings, nightlife, sports bars, bars and lounges have high simiarity. Italian cuisine and pizza go together.

The author has used the tf vector of each cuisine from task 2.1 to do clustering. Clustering was also explored using the tf-idf vector from task 2.2. Two different clustering appraches, hieararchical agglomerative clustering and k-means clustering were explored by the author. For both hierarchical clustering and k-means clustering different values of k were explored, and k=5 seemed to be the best fit empirically.

After applying clustering on the data we can clearly see clusters for asian cuisine, mexican cuisine, american cuisine and bar food.

Task 3 (Dish Discovery)

The author used textmining.jar provided as part of https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task1Tools/task1JavaTools.zip, that when given a cuisine file and a list of phrases will compute the Mutual Information for the words in the cuisine file. The cuisine file in this case is the reviews of Indian restaurants. The text mining algorithm implemented by textmining.jar is called Comparative Text Mining. The topic modeling is done by running

java -cp textmining.jar topicmodels.CTMMultiRun <parameters_file>

The parameters have the following parameters

xmlfile = directory_or_file_name;
cluster =4; //number of topics
it = 10; // number of iterations
resultsFile=topics; //result_file_name

The output has two files

1. topics_docprobs which contains the document distribution and
2. topics_termprobs that contains the term distribution for the common topics, as well as the sub-topics for each topic.

The author found some interesting dishes by looking at the term distribution in topics_termprobs. The tool can also be used for calculating word co-occurence based on mutual information. More specifically, the mutual information algorithm takes three parameters:

1. The input folder or file
2. --sentences: The optional splitting of sentences for local context word associations. Without this optional parameter, the algorithm outputs document-level word co-occurence.
3. The output filename

The author used the algorithm by specifying –sentences optional parameter, and found some relevant dishes.

The author also explored the dishes using TopMine to refine the results. The TopMine algorithm also mines the reviews of the Indian restaurants to explore the dish names. The algorithm uses the following parameters:

1. minsup = minumum support (minimum times a phrase candidate should appear in the corpus to be significant)
2. maxpattern = max size you would like a phrase to be (if you don't want too long of phrases that occasionally occur)
3. numTopics = number of topics - same as LDA
4. Gibbs Sampling iterations = number of iterations done for inference (learning the parameters. Usually 500 is good, may do more if you like)
5. thresh (significance) = the significance of a phrase. Equivalent to a z-score. I usually use 3 to 5. The higher it is, the fewer phrases will be found, but they will be of very high quality.
6. Topic Model, two variants of PhraseLDA are used (choose 1 or 2). 2 is the default topic model.

The author used the algorithm with default parameters, but setting Gibb sampling iterations to 500. The output topPhrases.txt was used to explore the dish names.
Here is an e.g. output of the expert model from the topic_termprobs output of textmining.jar

--- expert model (0)= indian_restaurant_reviews
naan 0.10899454898364651
masala 0.07706401521080385
tikka 0.06334130317233513
curry 0.053589882785072804
lamb 0.05251200851456515
paneer 0.044306346963164024
tandoori 0.032813962283013545
vindaloo 0.02184575875442527
vegetable 0.019508373538015343
korma 0.019357005051083124
samosas 0.0189751255667526
saag 0.01611815809731848
biryani 0.014922593020490909
goat 0.011961556394526004
aloo 0.011768176858126509
basmati 0.01172662576962215
palak 0.010316293136880574
spices 0.0101472137602951
curries 0.009940279559705872
pakora 0.00970749653781738
dal 0.00866491547334318
nan 0.008658236587192218
kofta 0.007945289043515921
peas 0.007932627859404111
makhani 0.00787733410484715
samosa 0.007568109450616106
spice 0.007547456956755993
spiced 0.0069399711019972
malai 0.006815904786392494
chana 0.006080383312996479
cauliflower 0.005652005208442147
pakoras 0.0056349196942948624
gobi 0.005256311140998758
tikki 0.005164346058865124
raita 0.005003063310014012
lentil 0.004975815539466382
daal 0.004812248402563703
lentils 0.004414069624920935
okra 0.0040624922300329675
josh 0.004057821124567791
rogan 0.003880496600104007
tamarind 0.0037852710284565784
karahi 0.003541589038347258
yogurt 0.003207123797866322
veg 0.0031486656079585972
chilli 0.0030720575796283622
tika 0.002556880089909672

The interesting thing about this output is that you can create so many dishes by different combinations. Some of them are already existing dishes, but we can be creative and create our own interesting dishes as well. Some of the dishes are

gobi pakoras
goat karahi
goat tikka
goat biryani
chilli tikka
malai kofta
okra
samosa
dal
dal makhani
gobi
aloo gobi
gobi masala
palak
aloo palak

We use textmining.jar tool for calculating the word co-occurence based on mutual information. We use the –sentence option for splitting the sentences for local context word associations. These are some of the results for all the gosht related dishes

nalli   gosht   4.8954226081424875E-6 
kadahi  gosht   3.6519436564127923E-6
gosht   vindaloo        9.170976944374744E-7
gosht   curry   1.1258746686261764E-6
gosht   rogan   1.1009422653791137E-5
gosht   daal    7.092818418872206E-6
gosht   biryani 1.0583382743613388E-6
gosht   curry   1.1258746686261764E-6

Task 4 (Mining Popular Dishes)

In this task we mine the yelp restaurant review dataset for a particular cuisine, to discover common/popular dishes of that cuisine. For this report the author mined popular dishes for Indian Cuisine.

We go through yelp_academic_dataset_business.json dataset, and filter out all the businesses that are categorized as Indian restaurants. Once we have categorized all the businesses, then we go through all the reviews for those businesses. The reviews are partitioned into positive reviews, where the stars for the reviews are greater than or equal to 3, and negative reviews, where the stars are less than or equal to 2.

Now for each partition, we look for all the common dishes that we had identified as part of task 3. For each dish we compute a weighted review count, both for positive and negative reviews. The weighted count is nothing but the number of times the dish is mentioned times the number of unique restaurants that are mentioned for that dish. We compute the weighted review count for both positive and negative review partitions. Only the dishes that have aggregate weight of greater than 1000 are considered for recommendation. This way we rule out the dishes that are not mentioned often, or are mentioned often, but only for few restaurants.
Once we have identified the popular dishes, we plot the number of positive mentions for the dish, and the number of negative mentions for the dish.

Task 5 (Restaurant Recommendation)

In task 4 we mined the dataset to discover popular dishes for Indian cuisine. In this task we mine the dataset to gather knowledge about restaurants that serve those dishes. In this particular report we only mined the dataset to determine popular restarurants for the most popular indian dish, chicken tikka, which was found using data mining in task 4.

Once we have identified the most popular dish, we go back and look at all the positive and negative reviews for all the restaurants that serve that dish. Once we have that data, we can plot a bar plot of all these restaurants with their positive and negative review counts, which gives the user a good idea of which restaurant to pick when he/she desires to have that dish. Positive reviews are the ones where the stars for the reviews are greater than or equal to 3, and negative reviews, where the stars are less than or equal to 2.

Task 6 (Hygiene Prediction)

This task looks at the yelp hygiene dataset to predict whether a restaurant will pass the hygiene inspection or not. The dataset is composed of a training subset of 546 restaurants used to train the classifier, and a testing subset of 12753 restaurants used for evaluating the performance of the classifier. The dataset is spread across 3 files such that the first 546 lines in each file correspond to the training subset and the rest are part of the testing subset. Below is a description of each file:
- hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
- hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have “[None]” in their label field implying that they are part of the testing subset.
- hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).

As a first step, we will read all the reviews from hygiene.dat as a data-frame, and create term document matrix for each restaurant. Since the training data-set consists of only 546 restaurants, we will use only the first 546 restaurant reviews from hygiene.dat for training our model. We will explore multiple models, first by using unigrams from term document matrix as the features, and again by using bigrams as the features. Before we extract the unigrams and bigrams as the features we will cleanup the reviews using tm package. We convert all the text to lowercase, remove all the profanity, remove stopwords, and apply stemming. We also remove all the sparse terms, and only allow 10% sparseness in the term document matrix.
We have some additional information about each restaurant in hygiene.dat.additional which we can use to enhance our feature set. We will extract features from this additional data for each restaurant, and add them to our existing feature set.
We can further enhance the feature set, by removing variables where the variance is zero, i.e. the variables are corelated.

Now we will partition the training sample, into training and testing sample set, so that we can validate our classifier, before we apply it to our test data. We do this by partitioning the training data, into a sample of 70:30, such that 70% of actual training data is used for training the classifier, and the remaining 30% is used for validating the classifier.

We will run multiple machine learning models, and use the one that gives use the best results. We will use F1 as the metric to be maximized. We will try k nearest neighbor, random forest and stochastic gradient boosting.

unigram knn model

unigram.knn.fit

## k-Nearest Neighbors 
## 
## 383 samples
## 153 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 345, 344, 344, 345, 345, 345, ... 
## Resampling results across tuning parameters:
## 
##   k  F1         F1 SD     
##   5  0.5890546  0.06779387
##   7  0.6250590  0.07753171
##   9  0.6357103  0.08532342
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.

confusion matrix for unigram knn model

unigram.knn.confusionmatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46 43
##          1 27 47
##                                           
##                Accuracy : 0.5706          
##                  95% CI : (0.4908, 0.6477)
##     No Information Rate : 0.5521          
##     P-Value [Acc > NIR] : 0.3478          
##                                           
##                   Kappa : 0.1493          
##  Mcnemar's Test P-Value : 0.0730          
##                                           
##             Sensitivity : 0.6301          
##             Specificity : 0.5222          
##          Pos Pred Value : 0.5169          
##          Neg Pred Value : 0.6351          
##              Prevalence : 0.4479          
##          Detection Rate : 0.2822          
##    Detection Prevalence : 0.5460          
##       Balanced Accuracy : 0.5762          
##                                           
##        'Positive' Class : 0               
##

bigram knn model

bigram.knn.fit

## k-Nearest Neighbors 
## 
## 383 samples
##  73 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 346, 345, 344, 345, 344, ... 
## Resampling results across tuning parameters:
## 
##   k  F1         F1 SD     
##   5  0.5453647  0.11753720
##   7  0.5211313  0.11848540
##   9  0.5363748  0.09093464
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.

confusion matrix for bigram knn model

bigram.knn.confusionmatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51 26
##          1 50 36
##                                           
##                Accuracy : 0.5337          
##                  95% CI : (0.4541, 0.6121)
##     No Information Rate : 0.6196          
##     P-Value [Acc > NIR] : 0.989740        
##                                           
##                   Kappa : 0.0796          
##  Mcnemar's Test P-Value : 0.008333        
##                                           
##             Sensitivity : 0.5050          
##             Specificity : 0.5806          
##          Pos Pred Value : 0.6623          
##          Neg Pred Value : 0.4186          
##              Prevalence : 0.6196          
##          Detection Rate : 0.3129          
##    Detection Prevalence : 0.4724          
##       Balanced Accuracy : 0.5428          
##                                           
##        'Positive' Class : 0               
##

unigram rf model

unigram.rf.fit

## Random Forest 
## 
## 383 samples
## 153 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 345, 344, 345, 345, 344, 345, ... 
## Resampling results across tuning parameters:
## 
##   mtry  F1         F1 SD     
##     10  0.6049382  0.07191913
##     50  0.5892803  0.06098618
##    100  0.6123985  0.07094478
##    500  0.5987488  0.06863563
##   1000  0.6070963  0.06390975
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 100.

confusion matrix for unigram rf model

unigram.rf.confusionmatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51 38
##          1 24 50
##                                           
##                Accuracy : 0.6196          
##                  95% CI : (0.5404, 0.6944)
##     No Information Rate : 0.5399          
##     P-Value [Acc > NIR] : 0.02422         
##                                           
##                   Kappa : 0.2448          
##  Mcnemar's Test P-Value : 0.09874         
##                                           
##             Sensitivity : 0.6800          
##             Specificity : 0.5682          
##          Pos Pred Value : 0.5730          
##          Neg Pred Value : 0.6757          
##              Prevalence : 0.4601          
##          Detection Rate : 0.3129          
##    Detection Prevalence : 0.5460          
##       Balanced Accuracy : 0.6241          
##                                           
##        'Positive' Class : 0               
##

bigram rf model

bigram.rf.fit

## Random Forest 
## 
## 383 samples
##  73 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 345, 345, 345, 344, 345, ... 
## Resampling results across tuning parameters:
## 
##   mtry  F1         F1 SD     
##     10  0.6537525  0.04721641
##     50  0.6615189  0.06886562
##    100  0.6377641  0.07695971
##    500  0.6206777  0.06910854
##   1000  0.6420762  0.06050119
## 
## F1 was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 50.

confusion matrix for bigram rf model

bigram.rf.confusionmatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47 30
##          1 33 53
##                                           
##                Accuracy : 0.6135          
##                  95% CI : (0.5342, 0.6886)
##     No Information Rate : 0.5092          
##     P-Value [Acc > NIR] : 0.004721        
##                                           
##                   Kappa : 0.2262          
##  Mcnemar's Test P-Value : 0.801059        
##                                           
##             Sensitivity : 0.5875          
##             Specificity : 0.6386          
##          Pos Pred Value : 0.6104          
##          Neg Pred Value : 0.6163          
##              Prevalence : 0.4908          
##          Detection Rate : 0.2883          
##    Detection Prevalence : 0.4724          
##       Balanced Accuracy : 0.6130          
##                                           
##        'Positive' Class : 0               
##

unigram gbm model

unigram.gbm.fit

## Stochastic Gradient Boosting 
## 
## 383 samples
## 153 predictors
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 345, 346, 345, 345, 344, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  F1         F1 SD     
##   2                  100      0.6324806  0.06978027
##   2                  200      0.6293300  0.06769479
##   2                  300      0.6130057  0.06217683
##   2                  400      0.6063549  0.05824169
##   2                  500      0.5959974  0.07142145
##   3                  100      0.5896646  0.09144147
##   3                  200      0.5718785  0.07660035
##   3                  300      0.5584011  0.06444784
##   3                  400      0.5881368  0.07923596
##   3                  500      0.5655245  0.07952033
##   4                  100      0.5833909  0.05070745
##   4                  200      0.5728192  0.06939895
##   4                  300      0.5625158  0.06988523
##   4                  400      0.5693650  0.07504513
##   4                  500      0.5630118  0.07180698
##   5                  100      0.5996130  0.07325693
##   5                  200      0.5960680  0.03772509
##   5                  300      0.5791496  0.03712235
##   5                  400      0.5869561  0.04584553
##   5                  500      0.5673033  0.05597792
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## F1 was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

confusion matrix for unigram gbm model

unigram.gbm.confusionmatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 57 32
##          1 22 52
##                                           
##                Accuracy : 0.6687          
##                  95% CI : (0.5908, 0.7404)
##     No Information Rate : 0.5153          
##     P-Value [Acc > NIR] : 5.237e-05       
##                                           
##                   Kappa : 0.3393          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.7215          
##             Specificity : 0.6190          
##          Pos Pred Value : 0.6404          
##          Neg Pred Value : 0.7027          
##              Prevalence : 0.4847          
##          Detection Rate : 0.3497          
##    Detection Prevalence : 0.5460          
##       Balanced Accuracy : 0.6703          
##                                           
##        'Positive' Class : 0               
##

bigram gbm model

bigram.gbm.fit

## Stochastic Gradient Boosting 
## 
## 383 samples
##  73 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 344, 345, 344, 345, 345, 345, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  F1         F1 SD     
##   2                  100      0.5809324  0.12792297
##   2                  200      0.5883536  0.10505908
##   2                  300      0.5764462  0.11107246
##   2                  400      0.5588518  0.09939930
##   2                  500      0.5618966  0.09387904
##   3                  100      0.5901226  0.11674285
##   3                  200      0.6010791  0.09627211
##   3                  300      0.5910125  0.10108267
##   3                  400      0.5837870  0.07390445
##   3                  500      0.5754913  0.08684396
##   4                  100      0.6072137  0.10505361
##   4                  200      0.5651959  0.09119413
##   4                  300      0.5811336  0.06910575
##   4                  400      0.5423515  0.09052754
##   4                  500      0.5776810  0.09319436
##   5                  100      0.6200092  0.08823854
##   5                  200      0.5699468  0.08930689
##   5                  300      0.6131817  0.09525975
##   5                  400      0.6147405  0.08661185
##   5                  500      0.6163756  0.09030642
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## F1 was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.

The author selected random forest model, as it gives the best F1 measure.