Linda Kaw
Data Science Capstone
The question that I am interested in addressing with Yelp data is: “What are the most common words used in the reviews for each star rating?”
This question is interesting because the results could be useful for predicting a review's rating based on its text alone. We can also find out what are the common words used for positive and negative reviews. Business owners might also be interested in having this question addressed so that they would be able to assess their own performance based on written feedback or conversations with their customers.
Since the Yelp dataset contains the star rating for each review, the question is definitely answerable using the dataset.
The review text data is first sampled to obtain 50000 reviews for each star rating. The model used for processing will be based on Natural Language Processing techniques. The frequency of each word for each star rating is then tabulated. We then display the top 25 most common words using word clouds.
The following steps are taken to preprocess the review text data:
Each data is tokenized to create a population of unigrams. The unigrams are then sorted in decreasing order of frequency and stored into data frames.
The word clouds on the right are generated for star ratings 1 (top), 3 (middle) and 5 (bottom). Due to space limitation, the word clouds for 2 and 4 star ratings are not shown.
The larger the size of the word in the word cloud, the higher the frequency of occurrence of the word within the review text data. For instance, “great” is the most common word used in reviews with star rating of 5.
From the word clouds, we are able to answer the question of “What are the most common words used in the reviews for each star rating?”
It can be observed that some words appear in all the word clouds. For instance, “food” and “place” appear with high frequency in all the word clouds and this may imply that most of the reviews are done for food places.
There does not seem to be much difference in the words used for star ratings 1-3, possibly due to the small size of the word clouds. This could also imply that the reviews for star ratings 1-3 are mostly similar.
There is a high number of positive words used in star ratings 4-5, e.g. “love” and “amazing”. This implies that it may be possible to differentiate between reviews with high (4-5) and low (1-3) star ratings based on the common words used in the reviews.