Yelp Project: Predicting Stars from Text

The purpose of this project is to answer the following question:

Initial exploration shows clear differences in most frequent word stems between 1-star and 5-star ratings (1-star on left, 5-star on right):

1 and 5 Stars

Treat as a Classification Problem
- Response variable = number of stars
- Text reviews stemmed and converted into document-term matrix with TF/IDF weighting
Model Stack for Classification
- Component classifiers used were gradient boosting, quadratic discriminant analysis, and random forest
- Stacking classifer used was random forest
Used Random Sample of 25,000 Reviews
- Balanced to include 5000 of each star category
- 60% used for training, 40% for testing

Prediction performance at a glance using test data:

Box Plot

For each star level, majority of predictions were correct
Deviations between actual and predicted were within +/- 1 star in majority of cases
Lowest error rates in predicting 1-star and 5-star reviews
Conclusion: Stars can be predicted with reasonable amounts of accuracy from text