Data Science Capstone - Final Report for Yelp Dataset Challenge

Title(Abstract)

The dataset provided for the Capstone Project is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge (the documentation mentions Round 5, but the datasets for Rounds 5 and 6 are identical). The dataset is approximately 575MB, and can be accessed from the follwong link for download: http://www.yelp.com/dataset_challenge

The dataset is stored in 5 files of JSON format, where each file is composed of a single object type (a one-json-object per line). The respective data files provides information about:

businesses and their attributes
checkin times of customers at given businesses store
reviews submitted by customers about the businesses
tips on the businesses
users of the businesses

The pipleine of what I have done is following.

Clean metadata in the 5 JSON files metioned above with homemade R scripts, and store clean datasets in data frame format for data access and management later.
Identify interesting questions/problems can be answerable by the Yelp datasets: find the correlation between the quality of service and tip or sentiment analysis of business review
Identify the quality of service as “stars” in business dataset.
Refine business, tip and review datasets with common business_id for data comparison and analysis.
Conduct exploratory data analysis on star score from refined business dataset and average star score from refined review dataset.

Introduction

After examining the 5 cleaned JSON files, and combining my intention of helping both customers side and business owner side, I found the most answerable and valuable question to ask is related to the quality of service. The direction of this question can be both beneficial to potential customers and business owners for at least following reasons. (1) For the potential customers, they can have clear picture on the quality of service before their visit, which would save their time and money. Moreover, if the aditional friendly tips can be provided, that would be great. (2) For the business owners, the analysis can serve as aa feedback on how to imporve their service.

Thus, I narrow down the problem I would like to solve with following words in my Milestone report.

“As a customer I would like to know the quality of the service as well as succinct tips before visiting store; moreover, I would love the Yelp has the service of comparison on similar stores in same category as recommendation. Thus, the essential question to as is”What is the quality of the service, and how is it related to tip and sentiment analysis of business reviews?"

Methods and Data

Describe how you used the data and the type of analytic methods that you used; it’s okay to be a bit technical here but clarity is important

I would approach this problem by 2 steps. First off, I would try to find the correlation between stars attribute in bussiness and other attributes like review wording and tip. If I can find the correlation underneath, then I can move on to the the 2nd step: comparison of similar business. In this part, I can use the model I build in the 1st step, and provide suggestion of stores with close service quality and succinct tip for customers.

For the 1st step, I narrow down the datasets I would use are business, review, and tip datasets.

library(utils)
library(readr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.2

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# load cleaned business, review, and tip datasets
businessmetadata <- readRDS("businessmetadata.rds")
reviewmetadata <- readRDS("reviewmetadata.rds")
tipmetadata <- readRDS("tipmetadata.rds")

In order to have eligible data for comparison, I futher select business, review and tip datasets which have common business_id. The useful in formation in these three datasets are business_id and stars. For the 1st step, I narrow down the datasets I would use are business, review, and tip datasets. In order to simplify plot on star score, I calculate averaged star score in review data and save data in reviewdatabusirefinestarsave.rds“.

Results

Describe what you found through your analysis of the data.

Discussion

Explain how you interpret the results of your analysis and what the implications are for your question/problem.