Analysis of Correspondence Review's Text and Stars Ratings

Tetyana S.
November 20, 2015

Introduction

A description of the question/problem and the rationale for studying it:

Can we predict the sentiment of a textual review (positive or negative words) from a corpus of restaurant and bar businesses reviews?
Using a subset of the data for last 5 years and random samples of reviews we want to predict the number of stars from 1 to 5 from the review's text.

Methods and Data

I use two files from the dataset was provided by Yelp (review and business)
I bind them by the key of business_id, then I select subset from the data about the past 5 years and get 30% random records from this data
For different sets of random data I spend sentiment analysis (I find the frequency of positive words, negative words and find the mean, median and the total number of the most commonly used words). This process repeats for each of 1-5 star review.

Results

For each of the five stars we show clouds of the most commonly used words and built frequency diagrams.
We choose the total sums of the the proportions of the positive and negative words as a criterion for determining the number of stars.

Discussion

After working with several groups of random samples I propose to discuss about this result:

k = sum(colSums(Words[,extractPos]))/sum(colSums(Words[,extractNeg]))

if k <= 1.5 then stars=1
if 1.5 < k <= 3 then we predict stars=2
if 3 < k <= 5.5 then we predict stars=3
if 5.5 < k <= 8.5 then we predict stars=4
if k > 8.5 then we predict stars=5

CONCLUSION: If you look at any business associated with restaurants and bars, it is possible to determine the number of the stars using only feedbacks from visitors.