Exploring Yelp's Business Reviews Dataset

P. Doynov
Sun Nov 22 12:03:04 2015

Data Sciences Capstone Project

Introduction

In our daily life we rely on our previous experience and tent to prefer and trust businesses with high number of positive reviews. The “five-star-system” is popular as a quick mark from poor to excellent. The gained popularity of Yelp confirms that. In this project, an exploratory effort is made to review the types, frequencies, and star-ranking distributions for different types of businesses in the Yelp academic dataset from January 22, 2015. The possible factors that correlate to the high or low number of reviews for certain businesses are explored. This lead to the related questions:

  • Why some businesses generate significantly more reviews compared to other businesses?
  • Is it possible to model and predict the success of busines based on correlation to businesses with high review ratings?

Data Profile

Project Screenshot

Data Exploration by Categories

Project Screenshot

Modeling and Prediction

  • The binomial model demonstrated that all factors appear significant, nevertheless, the prediction performance was very poor.
  • Additional exploration with multiple selection and combinations of categories were not success.
  • Finally, using the “rpart” library for R to create a Classification and Regression Tree (CART) model, the confusion matrix started to look better: Project Screenshot

CART Modeling and Prediction

Project Screenshot