This is the point in the capstone project where I need to identify a question or a problem that I am interested in addressing with the Yelp Academic Dataset. This document serves as a milestone report on exploratory analysis of the data set.
Yelp is a business founded in 2004 to “help people find great local businesses like dentists, hair stylists and mechanics.” As of the second quarter of 2015, Yelp had a monthly average of 83 million unique visitors who visited Yelp via their mobile device and written more than 83 million reviews.
The dataset provided for the Capstone Project is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge (the documentation mentions Round 5, but the datasets for Rounds 5 and 6 are identical). The dataset is approximately 575MB so you will need access to a good Internet connection to download it.
The data can be downloaded from site: Yelp Dataset Challenge Round 6 Data [575 MB]
The dataset is stored in 5 files of JSON format, where each file is composed of a single object type (a one-json-object per line). The respective data files provides information about:
The dataset consists of:
After performing exploratory analysis on the data, I tried to imagine myself as one of the customer of the listed services/businesses. As a customer, I would be very interested in a tool that can provide me with a sensing and summary of the quality of service to expect from the business. Similar to a business directory, such a tool will be helpful to someone looking for a quick inferred overview of the business without scrutinizing the details.
The tool should allow selection and overview of the business attributes (e.g. name, location, stars, reviews, tips) while performing sentiment analysis of business reviews. By linking business, reviews and tips data via “business_id”, various features (e.g. star rating, review counts, sentiment of the user reviews, quality of tips submitted etc) can be correlated and analysed to form an inferred picture of the business’ quality of service as perceived by other customers.
One advance feature would be to allow customer to compare 2 related businesses (similar category) side-by-side based on the above approach.
Since there are 61K businesses with over 783 categories and 1.6M reviews, it would be time-consuming and resource-intensive to load and process all data. For the sake of demonstration of capstone project, the scaled-down approach will be to focus on a particular category of business e.g. Restaurants.
Any feedback to help me improve the idea will be welcomed and appreciated. Thank you.