Data Exploration:

Star ratings will be used here to indicate the sentiment label. For binary classification, we will convert the 1-5 scale rating values to {positive(1), negative(0)} values.

(More details on the data are available from https://www.yelp.com/dataset)

The actual data in the above website is given in multiple json files. The reviews data file contains the reviews and includes reviewID, businessID, businessName, the review text, star rating and other attributes. The business data file contains the businessName, businessID, address, categories (restaurants, beauty and salon, food, fitness, local services, etc.), various attributes of the business (free wifi, wheelchair access, parking, smoking allowed, operating hours, … etc).

We will consider a pre-processed data set which contains the business type, review text, star rating, and how many users found this review to be cool, funny, useful. There are ~ 40K rows in the pre-processed file.

Distribution of star ratings

From the histogram, we can see that the number of reviews is increasing from star rating 1 to 5. Star ratings 4 and 5 account for most of the total reviews - 64.56%, while the star ratings 1,2, and 3 account for remaining with almost the same proportion each.

We observe that most of the reviews are from Arizona, Nevada and the least reviews are from South Carolina, Illinois with less than 500 reviews. Below histogram shows distribution of reviews from different states

Lets see the usage of words in different star ratings. We observed that words like “love”,”delicious”, “amaze”, “nice”,”friendly”,”pretty” ….etc are most used in 5,4-star ratings which show positive emotion.

Top words in 5-star reviews
Top words in 4-star reviews
Top words in 3-star reviews
Top words in 2-star reviews
Top words in 1-star reviews

Here is the final data frame with lemmetized words, term-frequency and inverse term-frequencies calculated

Users on Yelp can vote a review as either funny, useful, or cool. Lets see the distribution of ratings across different vote categories. Below plot shows number of votes vs star ratings for each voting category.

From the above plot, we can see a few reviews from star rating 3,4 have high votes for funny, cool, useful reviews. Now let’s see the average votes for each across star ratings.

The first line graph shows the average funny votes across star ratings and can be seen that low star ratings have funny comments. This could be because users tend to give sarcastic reviews when they hate the restaurant. So, funny reviews are associated with negative sentiment.

The second line graph shows the average cool votes across star ratings and can be seen that as the number of stars increases the cool comments increase but again decreases with a 5-star rating.

The third line graph shows the average useful votes across star ratings and low star rating reviews are voted useful when compared to high star ratings. This is expected because most of the high star ratings will have common reviews like – Food is good, served on time, and reviews about maintenance and ambiance which most of the users don’t find useful.

Now lets see how does star ratings for reviews relate to the star-rating given in the dataset for business

In the above plot we can see that for starsBusiness rating of 1.5 most of the reviews are of 1-star and as the starsBusiness ratings increase the positive reviews keep increasing. Restaurants with starsBusiness rating > 4 are doing good with most of the ratings being 4,5-star.

Words like “food”, “service”, “time” and “restaurant” are common in all the reviews so lets remove them and after eliminating them we have obtained the below graph which depicts the proportion of top wprds in each review.

In the above figure we can see that for 5 star rating the proportion of words like awesome, amazing, love, delicious and pretty is high. For 1 star rating the proportion for words like bad and wait is very high. The below table shows the number of occurrences of a word in reviews 5 and 1.

Lexicon Dictionaries

We will consider three dictionaries, available through the tidytext package – The combined dictionary of terms denoting different sentiments, - The extended sentiment lexicon developed by Prof Bing Liu, and - The AFINN dictionary which includes words commonly used in user-generated content in the web.

The first provides lists of words denoting different sentiment (for eg., positive, negative, joy, fear, anticipation, …), the second specifies lists of positive and negative words, while the third gives a list of words with each word being associated with a positivity score from -5 to +5.

The number of matching terms for each dictionary is calculated and depicted using the graph below.

lets just use the dictionary based positive and negative terms to predict sentiment (positive or negative based on star rating). One approach for this is: using each dictionary, obtain an aggregated positive score and a negative score for each review; for the AFINN dictionary, an aggregate positivity score can be obtained for each review.

##                            Bing    NRC   afin
## Number of matching terms 225232 857343 179923

Using Bing Dictionary

Sentiment Analysis using the Bing dictionary gave us the words with either positive or negative sentiment. Then we summarized the sentiment words per review. Thereafter a sentiment score was calculated based on the proportion of the positive or negative words. Summarizing the entire analysis against the star ratings gave us the below table:

We took only those words which match the BING dictionary. The average positive score in the above table is nothing but the mean of the positive proportion and the average negative score is the mean of the negative proportion. Sentiment score for a single review is the difference between positive and negative proportions. Average sentiment score is the mean of all sentiment scores grouped by the star rating.

Using NRC Dictionary

Sentiment Analysis using NRC gave us several sentiment categories - all the words were grouped into one of these categories. Considering {anger, disgust, fear, sadness, negative} to denote ‘bad’ reviews, and {positive, joy, anticipation, trust, surprise} to denote ‘good’ reviews we got the GoodBad score for each word.

Using AFINN Dictionary

Analysis by Review Sentiment using Affin Dictionary gave the aggregate sentiment score according to to each star rating as follows:

1 star is lower rated and its average sentiment is negative 2.39 whereas 5 star is the highest rating with an average sentiment score of 7.28. It is worth noting that the average length is approximately the same for all ratings, as we would expect. Word length is really not the criteria which impacts the review sentiment. However the aggregated scores help us to predict the review sentiment as we have seen above. Higher scores mean better review sentiment.

Below table shows prediction accuracies using different dictionaries:

Now lets try building some prediction models using each dictionary

I did random sampling and have taken 16,000 sampled data to make run time manageable. Also split the training and test data into 50-50 just to avoid computation power issues. Before building prediction models, - I removed reviews with rating 3 (neutral sentiment), since we are building a binary classification model I want to classify reviews as either as positive or negative sentiment.

Bing Dictionary

After Removing Star rating 3 from the data set we have 33597 rows. The table below shows the distribution of bing dictionary’s rating for words in our data set.

8363 words have bing sentiment rating as -1 which means they have Star rating 1,2. Whereas we have 25234 words with bing sentiment rating as 1 which means they have Star rating 4 and 5

## Computing permutation importance.. Progress: 21%. Estimated remaining time: 2 minutes, 15 seconds.
## Computing permutation importance.. Progress: 44%. Estimated remaining time: 1 minute, 25 seconds.
## Computing permutation importance.. Progress: 65%. Estimated remaining time: 53 seconds.
## Computing permutation importance.. Progress: 86%. Estimated remaining time: 21 seconds.

NRC Dictionary

After Removing Star rating 3 from the data set we have 34263 rows. The table below shows the distribution of NRC dictionary’s rating for words in our data set.

8597 words have NRC sentiment rating as -1 which means they have Star rating 1,2. Whereas we have 25666 words with NRC sentiment rating as 1 which means they have Star rating 4 and 5

## Computing permutation importance.. Progress: 13%. Estimated remaining time: 4 minutes, 5 seconds.
## Computing permutation importance.. Progress: 28%. Estimated remaining time: 2 minutes, 58 seconds.
## Computing permutation importance.. Progress: 45%. Estimated remaining time: 2 minutes, 12 seconds.
## Computing permutation importance.. Progress: 60%. Estimated remaining time: 1 minute, 31 seconds.
## Computing permutation importance.. Progress: 73%. Estimated remaining time: 1 minute, 4 seconds.
## Computing permutation importance.. Progress: 88%. Estimated remaining time: 28 seconds.

AFINN Dictionary

After Removing Star rating 3 from the data set we have 28416 rows. The table below shows the distribution of AFINN dictionary’s rating for words in our data set.

8197 words have bing sentiment rating as -1 which means they have Star rating 1,2. Whereas we have 24704 words with AFINN sentiment rating as 1 which means they have Star rating 4 and 5

## Computing permutation importance.. Progress: 49%. Estimated remaining time: 33 seconds.

lets merge all the matched words from all three dictionaries into a single combined dictionary. The single combined dictionary matched data set consists of words where each word can have several sentiments from all the dictionaries.

The table below shows the distribution of sentiment rating for the words in our data set. Over 8621 words have rating as -1 which means they have Star rating 1,2. Whereas we have 25798 words with rating as 1 which means they have Star rating 4 and 5.

## Computing permutation importance.. Progress: 9%. Estimated remaining time: 5 minutes, 55 seconds.
## Computing permutation importance.. Progress: 21%. Estimated remaining time: 4 minutes, 27 seconds.
## Computing permutation importance.. Progress: 33%. Estimated remaining time: 3 minutes, 35 seconds.
## Computing permutation importance.. Progress: 45%. Estimated remaining time: 2 minutes, 55 seconds.
## Computing permutation importance.. Progress: 56%. Estimated remaining time: 2 minutes, 14 seconds.
## Computing permutation importance.. Progress: 69%. Estimated remaining time: 1 minute, 36 seconds.
## Computing permutation importance.. Progress: 81%. Estimated remaining time: 59 seconds.
## Computing permutation importance.. Progress: 93%. Estimated remaining time: 22 seconds.

Conclusion

From above table, we can see that random forest model with combined dictionary is performing well with an accuracy of 89.42% and when we just used the dictionaries alone to get an aggregated sentiment score AFINN dictionary performed well with approximately 84%