For my project, I will be applying a statistical learning algorithm to predict a customer’s star rating for a restaurant based on past users’ reviews. This model could be used to recommend other restaurants to the users as well as better target their advertisement efforts. I will be conducting this sentiment analysis using a Naive Bayes algorithm. We will be using reviews and star ratings extracted from the Zomato API. Below is an example of the format of a json review.
[
{
“rating”: “5”,
“review_text”: “The best latte I’ve ever had. It tasted a little sweet”,
“id”: “24127336”,
“rating_color”: “305D02”,
“review_time_friendly”: “2 months ago”,
“rating_text”: “Insane!”,
“timestamp”: “1435507367”,
“likes”: “0”,
“user”: {
“name”: “John Doe”,
“zomato_handle”: “John”,
“foodie_level”: “Super Foodie”,
“foodie_level_num”: “9”,
“foodie_color”: “f58552”,
“profile_url”: “https://www.zomato.com/john”,
“profile_deeplink”: “zoma.to/u/1170245”,
“profile_image”: “string”
},
“comments_count”: “0”
}
]
The data set pulled from the Zomato API contains 2100 restaurants, 2259 users, 3506 reviews and of those reviews 2314 a user assigned a rating. This data provides useful information such as business profile, review text, user profile, and ratings. The data was extracted in json format and was converted into a readable text in R.
Now that we have load the data, lets explore the data.
library(ggmap)
map <- get_map(location = 'USA', zoom = 4)
mapPoints <- ggmap(map) + geom_point(aes(x = long, y = lat, group = group), data = map.df)
mapPoints
From the map above, we can see the various cities and states that restaurants are located. I used cities that were referenced as foodie cities by Zagat. I was limited to extracting 100 restaurants from each city because of the limitations of my access to the API.
The graph below is a histogram of the number of reviews by star rating. We see that a majority of the ratings are zero. Looking through the data, I noticed that a majority of zero ratings were actually missing values rather reviews with negative reviews. During the analysis, I will remove these ratings. From the histogram, you can also see that majority of people who did respond gave a star rating between 4 and 5. This shows that there is a strong bias for reviewers to leave a positive reviews rather than negative reviews.
#Reorder the factors so it can be in numerical order
table_review_all$rating2 <- factor(table_review_all$rating, levels = c("0","1","1.5","2","2.5","3","3.5","4","4.5","5"))
#Plot the bar graph
plot(table_review_all$rating2, xlab = "Star Rating", ylab = "Count", main = "Bar Graph of Reviews by Star Rating" )
Now that we have explored the data, it is now time to begin the cleaning process. We will create a data set that only contains the reviews as well as an assigned rating. I will not include reviews with a zero rating because as previously stated since zero star ratings contained both positive and negative reviews this will have a negative effect on the model. We will also convert the star rating variable into a factor data type.
#Subsett the dataset to include only the reviews and ratings
word <- table_review_all[ c(1,2) ]
#subset the data to exclude ratings with 0 rating
library(sqldf)
word <- sqldf("Select * from word where rating <> 0 order by rating desc")
#covert rating to factor
word$rating <- factor(word$rating)
Next, I will build a volatile corpus out of the review texts, using functions from the tm package. After the corpus is created, I will begin to clean the text and create a document-term sparse matrix from the corpus, which stores each document as its own row. During the cleaning process, I will remove the formatting, punctuation and extra white space. I also convert all characters to lowercase, which allows the model to only have to process lower case letters. I will also apply a word stemming method to erase word suffixes to retrieve the root of the word. Stop words will be also removed because these words have no importance and appear too common in the English language.
#build a corpus using the text mining package
#install.packages("tm")
library(tm)
word_corpus <- VCorpus(VectorSource(word$review_text))
#create a document-term sparse matrix directly from the corpus
word_dtm <- DocumentTermMatrix(word_corpus, control = list(
tolower = TRUE,
removeNumbers = TRUE,
stopwords = function(x) { removeWords(x, stopwords()) },
removePunctuation = function(x) { gsub("[[:punct:]]+", " ", x) },
stemming = TRUE
))
For feature extraction, we need to count the frequency of every word that appears in the users review pool and remove the ones with low frequency. Since we have converted the corpus into a sparse matrix we can get a count of each words in the matrix.
#Get count of each word
freq <- colSums(as.matrix(word_dtm))
#create sort order (descending)
ord <- order(freq,decreasing=TRUE)
#inspect most frequently occurring terms
freq[head(ord)]
food place good great servic restaur
1117 1038 974 824 637 598
#inspect least frequently occurring terms
freq[tail(ord)]
еда единственный коктейлей минус хорошая чаевых
1 1 1 1 1 1
From the summary above we notice that the least frequent terms look to be a foreign language and may become a problem when proceeding with the analysis. It will be best to eliminate these irregular text to help the algorithm work more efficiently. To complete the above task we will use the function findFreqTerms(), which produces a list of terms that occur at least 5 times in the entire corpus. After the terms have been identified, I will create a subset DTM that only contains those frequent terms.
# indicator features for frequent words
head(findFreqTerms(word_dtm, 5))
[1] "abl" "absolut" "accept" "accolad" "accommod" "accompani"
# save frequently-appearing terms to a character vector
word_freq_words <- findFreqTerms(word_dtm, 5)
#str(word_freq_words)
# create DTMs with only the frequent terms
word_dtm_freq <- word_dtm[ , word_freq_words]
Now lets take a look at our Term Document Matrix to see if there are any other ways to clean up the document.
word_dtm_freq
<<DocumentTermMatrix (documents: 2314, terms: 1787)>>
Non-/sparse entries: 56445/4078673
Sparsity : 99%
Maximal term length: 13
Weighting : term frequency (tf)
inspect(word_dtm_freq[1:10,1:10])
<<DocumentTermMatrix (documents: 10, terms: 10)>>
Non-/sparse entries: 0/100
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Sample :
Terms
Docs abl absolut accept accolad accommod accompani accord across actual add
1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
##Remove Sparse Terms
word_dtm_freq_ <- removeSparseTerms(word_dtm_freq, .98)
##Check the difference between the two sets
dim(word_dtm_freq)
[1] 2314 1787
dim(word_dtm_freq_)
[1] 2314 284
The removeSparseTerms() function, removes those terms which have at least a 95 percentage of sparse (i.e., terms occurring 0 times in a document) elements.
Now that we have a clean data set. We can proceed with running the data through the algorithm to create our predictive model. We first need to split the data into train and test subsets. And because we are using a Navies Bayes algorithm we need to ensure that proportions of ratings in both the test and train data set are similar.
# creating training and test datasets
word_dtm_train <- word_dtm_freq_[1:1851, ]
word_dtm_test <- word_dtm_freq_[1852:2314, ]
# also save the labels
word_train_labels <- word[1:1851, ]$rating
word_test_labels <- word[1852:2314, ]$rating
# check that the proportion of the ratings is similar
prop.table(table(word_train_labels))
word_train_labels
3 3.5 4.5 4 5 1 2.5 1.5
0.00000000 0.04159914 0.32739060 0.34846029 0.28254997 0.00000000 0.00000000 0.00000000
2
0.00000000
prop.table(table(word_test_labels))
word_test_labels
3 3.5 4.5 4 5 1 2.5 1.5
0.25701944 0.37365011 0.00000000 0.00000000 0.00000000 0.12311015 0.11663067 0.04103672
2
0.08855292
We see that data is not randomly distributed in the data set so will need to randomly sample the data to fix this issue. We will also need to convert the counts to a factor.
set.seed(8008)
indx = sample(1:nrow(word_dtm_freq_), as.integer(0.80*nrow(word_dtm_freq_)))
word_dtm_train = word_dtm_freq_[indx,]
word_dtm_test = word_dtm_freq_[-indx,]
word_train_labels = word[indx,]$rating
word_test_labels = word[-indx,]$rating
prop.table(table(word_train_labels))
word_train_labels
3 3.5 4.5 4 5 1 2.5
0.049162615 0.109670448 0.258779038 0.287412210 0.222042139 0.025391680 0.024311183
1.5 2
0.006482982 0.016747704
prop.table(table(word_test_labels))
word_test_labels
3 3.5 4.5 4 5 1 2.5 1.5
0.06047516 0.10151188 0.27429806 0.24406048 0.24190065 0.02159827 0.01943844 0.01511879
2
0.02159827
#convert counts to a factor
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
# apply() convert_counts() to columns of train/test data
word_train <- apply(word_dtm_train, MARGIN = 2, convert_counts)
word_test <- apply(word_dtm_test, MARGIN = 2, convert_counts)
Now that we have transformed the raw restaurant reviews into a format that can be represented by a statistical mode, it is time to apply the Naives Bayes algorithm. I will be implementing the algorithm through the train function in R than will also apply a repeated k fold cross validation method at the same time.
#k fold cross validation
require(caret)
# define training control
train_control <- trainControl(method="repeatedcv", number=5, repeats=2)
set.seed(8008)
# train the model
model_word <- train(x = word_train, y = word_train_labels, trControl=train_control, method="nb")
##Use to make predictions
pred2 <- predict(model_word,word_test)
##evaluate the predictions against the true values
check <- confusionMatrix(pred2,word_test_labels)
check
Confusion Matrix and Statistics
Reference
Prediction 3 3.5 4.5 4 5 1 2.5 1.5 2
3 3 1 2 4 2 0 1 1 2
3.5 8 6 7 9 5 0 0 2 2
4.5 3 7 40 24 29 3 3 0 1
4 8 25 34 52 20 2 4 1 0
5 5 7 41 22 55 1 1 2 2
1 1 0 0 0 1 3 0 0 1
2.5 0 0 2 2 0 0 0 1 0
1.5 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 1 0 0 2
Overall Statistics
Accuracy : 0.3477
95% CI : (0.3044, 0.3931)
No Information Rate : 0.2743
P-Value [Acc > NIR] : 0.0003204
Kappa : 0.1589
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 3 Class: 3.5 Class: 4.5 Class: 4 Class: 5 Class: 1 Class: 2.5
Sensitivity 0.107143 0.12766 0.31496 0.4602 0.4911 0.300000 0.00000
Specificity 0.970115 0.92067 0.79167 0.7314 0.7692 0.993377 0.98899
Pos Pred Value 0.187500 0.15385 0.36364 0.3562 0.4044 0.500000 0.00000
Neg Pred Value 0.944072 0.90330 0.75354 0.8076 0.8257 0.984683 0.98035
Prevalence 0.060475 0.10151 0.27430 0.2441 0.2419 0.021598 0.01944
Detection Rate 0.006479 0.01296 0.08639 0.1123 0.1188 0.006479 0.00000
Detection Prevalence 0.034557 0.08423 0.23758 0.3153 0.2937 0.012959 0.01080
Balanced Accuracy 0.538629 0.52417 0.55331 0.5958 0.6302 0.646689 0.49449
Class: 1.5 Class: 2
Sensitivity 0.00000 0.200000
Specificity 0.99781 0.995585
Pos Pred Value 0.00000 0.500000
Neg Pred Value 0.98485 0.982571
Prevalence 0.01512 0.021598
Detection Rate 0.00000 0.004320
Detection Prevalence 0.00216 0.008639
Balanced Accuracy 0.49890 0.597792
From the R output, we can see that the Accuracy of the model is 0.59 and a Kappa statistics is 0.28. This accuracy percentage is not great because with an accuracy less than 50% means that if you guess a rating you have a better chance of getting it right then using this model. Also the Kappa statistic is below .4, which means that their is a poor agreement between the model’s predictions and the actual values. Since this is an obvious problem and we need to improve the model.
One way to improve the model is reduce the number of star ratings in the model. I decided to round the star ratings up to the nearest whole number. This should reduce some of the error from the previous model. Below are the steps I took to round the ratings.
##improve the model
word_train_labels_r = as.factor(ceiling(as.numeric(paste(word[indx,]$rating))))
word_test_labels_r = as.factor(ceiling(as.numeric(paste(word[-indx,]$rating))))
head(word_train_labels_r)
[1] 5 5 5 4 2 4
Levels: 1 2 3 4 5
head(word_train_labels)
[1] 5 4.5 5 4 2 3.5
Levels: 3 3.5 4.5 4 5 1 2.5 1.5 2
Now that I have made the necessary adjustments to the data, I will now apply the algorithm to the data and hopefully we see that the model has improved in terms of its accuracy.
#k fold cross validation
require(caret)
# define training control
train_control <- trainControl(method="repeatedcv", number=5, repeats=2)
set.seed(8008)
# train the model
model_word_r <- train(x = word_train, y = word_train_labels_r, trControl=train_control, method="nb")
# summarize results
print(model_word_r)
Naive Bayes
1851 samples
284 predictor
5 classes: '1', '2', '3', '4', '5'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 2 times)
Summary of sample sizes: 1480, 1480, 1481, 1482, 1481, 1481, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.5359483 0.2015875
TRUE 0.5359483 0.2015875
Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter 'adjust' was
held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
pred2r <- predict(model_word_r,word_test)
check <- confusionMatrix(pred2r,word_test_labels_r)
check
Confusion Matrix and Statistics
Reference
Prediction 1 2 3 4 5
1 3 1 1 0 1
2 1 1 0 1 1
3 0 3 2 6 5
4 2 5 21 91 57
5 4 7 13 62 175
Overall Statistics
Accuracy : 0.5875
95% CI : (0.5411, 0.6327)
No Information Rate : 0.5162
P-Value [Acc > NIR] : 0.001219
Kappa : 0.2817
Mcnemar's Test P-Value : 0.002543
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity 0.300000 0.058824 0.05405 0.5687 0.7322
Specificity 0.993377 0.993274 0.96714 0.7195 0.6161
Pos Pred Value 0.500000 0.250000 0.12500 0.5170 0.6705
Neg Pred Value 0.984683 0.965142 0.92170 0.7596 0.6832
Prevalence 0.021598 0.036717 0.07991 0.3456 0.5162
Detection Rate 0.006479 0.002160 0.00432 0.1965 0.3780
Detection Prevalence 0.012959 0.008639 0.03456 0.3801 0.5637
Balanced Accuracy 0.646689 0.526049 0.51060 0.6441 0.6741
From the R output, we can see that the Accuracy of the model is 0.59 and a Kappa value of 0.28. This is a great improvement compared to the previous model. Both the accuracy and Kappa values increased greatly but have not reached the ideal threshold that would make this model a ideal. We also see that the Sensitivity and Specificity has also increased which is also great.
Now lets put the model to the test against a human eye. I will now randomly pick 5 test points from the test data set and make my star rating predictions for the reviews. I will then determine my accuracy and compare it to the accuracy of the model. Feel free to make your own predictions and see if you can beat the model. Note, I will be using the final model that contains the rounded star ratings since this model out performed the original model.
set.seed(4545)
nxt <- sample(1:length(word$rating),5)
mytest <- word[nxt,]$review_text
mytest_labels <- as.factor(ceiling(as.numeric(paste(word[nxt,]$rating))))
mytest
[1] "Delicious Chicago style deep dish, great service, and old beer! The crust is the cornmeal style unlike some of the other Chicago style pizza joints"
[2] "To start off the night my wife and I decided to set up reservations here several weeks in advance prior to our trip to Manhattan. We arrive an hour early to be seated 2 hours later with how busy it was we weren't very concerned. This is a high end restaurant. The decor is absolutely breathtaking, tables are set beautifully, and the service was absolutely amazing. Food is extremely expensive; however that's to be expected here. I was blown away by the Red Snapper, beautiful presentation and the t..."
[3] "Of all the fancy, quaint, gaudy and renown seafood restaurants of the San Francisco Fisherman's Wharf area (most of which are beloved by generations of diners), I think the consensus of myself and my \"foodie\" family is that Scoma's has the best food and may be the least touristy. The vast menu, supplemented by the fresh specials are a real treat. Just outside the restaurant by the kitchen door is the restaurant's boat dock where seafood literally comes from the boats to your table. Service is ..."
[4] "The food was quite good and very rich. The lemon creme on the strawberry pancakes was to die for! The wait was about an hour for 3 people at 11 am which is a testament to how popular this place was. The inside wasn't air conditioned and the staff was not incredibly friendly but overall very enjoyable and if you can stand the wait then you should come here."
[5] "I visited Ayza this past weekend with a group of friends. We ordered a cheese platter which was delicious and also tried a sample of their house sangria which was great! For dinner we ordered the vegetarian panini and white mushroom truffle pizza. The panini was rather bland and needed some type of sauce to spruce it up. The pizza on the other hand was decent but too cheesy for my liking."
model_test <- word_dtm_freq_[nxt,]
model_test <- apply(model_test, MARGIN = 2, convert_counts)
My predictions for these reviews are: 1. 5 2. 5 3. 4 4. 4 5. 3
Now that I have made my predictions, lets compare my results against the model.
##My accuaracy
results <- c(5,4,4,3,3)
answer <- as.numeric(paste(mytest_labels))
myacc <- sum(results == mytest_labels)/5
##Model's Accuaracy
set.seed(4545)
pred2r <- predict(model_word_r,model_test)
#check <- confusionMatrix(pred2r,mytest_labels)
library(gmodels)
chk <- CrossTable(pred2r, mytest_labels,
prop.chisq = F, prop.t = F, prop.r = F,
dnn = c('predicted', 'actual'))
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 5
| actual
predicted | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|
4 | 1 | 2 | 0 | 3 |
| 1.000 | 1.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|
5 | 0 | 0 | 2 | 2 |
| 0.000 | 0.000 | 1.000 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 1 | 2 | 2 | 5 |
| 0.200 | 0.400 | 0.400 | |
-------------|-----------|-----------|-----------|-----------|
From my predictions, I perceived an accuracy of 0.6 while the model achieved an accuracy of 0.8. We see that the model is more accurate.
The model showed a moderate ability to predict star ratings based on the users’ reviews. There was a boost in accuracy when the number of star ratings were reduced to whole numbers. From this report, I learned how to apply a repeated holdout method in-conjugation with the Naive Bayes algorithm.
https://developers.zomato.com/documentation#!/common/collections
https://cran.r-project.org/web/packages/jsonlite/vignettes/json-apis.html
http://stackoverflow.com/questions/27430042/parse-nested-json-to-data-frame-in-r
http://www.washingtonpost.com/sf/style/2015/12/21/the-10-best-food-cities-in-america-ranked/?utm_term=.6c2db5dbfd57