The dataset provided for the Capstone Project is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge (the documentation mentions Round 5, but the datasets for Rounds 5 and 6 are identical).
Yelp is a business founded in 2004 to “help people find great local businesses like dentists, hair stylists and mechanics.” As of the second quarter of 2015, Yelp had a monthly average of 83 million unique visitors who visited Yelp via their mobile device and written more than 83 million reviews.
The data can be downloaded from site: Yelp Dataset Challenge Round 6 Data [575 MB]
The dataset is stored in 5 files of JSON format, where each file is composed of a single object type (a one-json-object per line). The respective data files provides information about:
businesses and their attributes checkin times of customers (users) at the businesses reviews submitted by customers (users) about the businesses tips on the businesses users of the businesses The dataset consists of:
More than 1.5M reviews and 500K tips by 366K users for 61K businesses 481K business attributes, e.g., hours, parking availability, ambience. Social network of 366K users for a total of 2.9M social edges. Aggregated check-ins over time for each of the 61K businesses
Goal in the milestone report to identify a question or a problem that I am interested in addressing with the Yelp Academic Dataset.
As a customer one would be interested to know the quality of the service upfront without making a visit or call to the business.I am interested in solving for this. The question is “What is the expected over all quality of the service ?”
To provide a data product that will allow selection and overview of the business attributes (e.g. name, location, stars, reviews, tips) while performing sentiment analysis of business reviews. By linking business, reviews and tips data via “business_id”, various features (e.g. star rating, review counts, sentiment of the user reviews, quality of tips submitted etc) can be correlated and analysed to form an inferred picture of the business’ quality of service as perceived by other customers.
The product should also have a feature to compare between two businesses in same catagory.
Based on the initial explotratory data analysis that is being done, it may be worth building the model for 1 or two business catagories given the large amount of data and then expand the model to other catagories as well.
Goal is to explore the data to knock at the right features and then carry out the sentiment analysis to solve the question.