Data

The dataset provided for the Capstone Project is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge (the documentation mentions Round 5, but the datasets for Rounds 5 and 6 are identical). The dataset is approximately 575MB so you will need access to a good Internet connection to download it.

The data can be downloaded from following site: http://www.yelp.com/dataset_challenge

The dataset is stored in 5 files of JSON format, where each file is composed of a single object type (a one-json-object per line). The respective data files provides information about:

businesses and their attributes checkin times of customers at given businesses store reviews submitted by customers about the businesses tips on the businesses users of the businesses

Question/Problem to ask

As a customer I would like to know the quality of the service as well as succinct tips before visiting store; moreover, I would love the Yelp has the service of comparison on similar stores in same category as recommendation. Thus, the essential question to as is “What is the quality of the service, and how is it related to tip and sentiment analysis of business reviews?

Approach

I would approach this problem by 2 steps. First off, I would try to find the correlation between stars attribute in bussiness and other attributes like review wording and tip. If I can find the correlation underneath, then I can move on to the the 2nd step: comparison of similar business. In this part, I can use the model I build in the 1st step, and provide suggestion of stores with close service quality and succinct tip for customers.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the Help toolbar button for more details on using R Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

You can also embed plots, for example:

plot(cars)

plot of chunk unnamed-chunk-2