Li Jiming
18/11/2015
The goal of the project is to build a model for prediction star review from its text alone, “Write your tip, we rate for you”.
How well can you guess a review's rating from its text alone? What are rates between positive and negative words used from customer's review in each specific star?
The dataset is part of the Yelp Dataset Challenge and the specific dataset used in this project which corresponds to Round 6 of their challenge. The dataset consists of a set of JSON files that include business information, reviews, tips, user information etc.
We have selected sample of data from yelp dataset review file and includes zones in review text, star review and review id. The review text will form the basic corpus of this project. Data Cleaning is critical for creation of the model, as the frequencies and sentiment can be distorted due to non unique elements such extra space, punctuation, lower and uppercase and the text articles.
In this project we implemented NLP (Natural language processing) techniques, Latent Dirichlet Allocation for topic modeling, and external resources to build an algorithm in R environment for star rate prediction. The algorithm will train a 1.6 million reviews and will be able to make predictions of star review based from its text alone.
Data were trained in specific words-star for each star data group: 1 gram, 3 grams and 4-6 grams
For prediction star review from its text alone we implement 1 gram, 3 grams, 5-6 grams algorithms.
Predicted accuracy in 100 testing samples for approximation(not 100% matched) test star and real star is: 81%. Visualization of observations based on classification method Linear Discriminant Analysis we used function partimat{klaR}. Important feature of this package is classification borders are displayed and the apparent error rates are given in each title.
It is necessary to create a data sample as huge dataset. This may decrease accuracy for the specific terms used in the star review text, although algorithm and functions are same for any type of file sizes. Result gives us a hint of next most likely sentiment words for phrases for each star reviews and lead us to use the algorithm for building the model for star prediction from text alone.
This topic is still very interesting and very challenging work. Result does not apply only in business, but also in many other fields such as healthcare, data mining, Neuro-linguistic programming (NLP), neurobiology etc. Accuracy in our application for star prediction based on its text alone is not as accurate as we want to be as customer may not use proper words for star rate.