Data Science Specialization: Capstone Project

Data, Yelping Data Set
Format, JSON files
Files, {business, checkin, review, tip, user}

Introduction

Five Data sets were imported into R and were processed to derive useful information.

For this data a primary question was asked.

In the context of this assignment we will deal with the business rating as reported in the business dataset by the variable “stars”. Specifically we will try to build a predictive model for rating businesses based on data from the other datasets provided like “reviews” , “tips” and “users”. We will try to use numerical and categorical data and also extract some useful information from the text data provided, engineering new features, in order to train a predictive model. The model will be trained and evaluated for the subset of businesses in Edinburgh city, in order to scale down the amount of data processed to a coherent dataset with reviews referring to a common geographical point.

Methods and Data (1/2)

A number of Data transformations were applied on the datsets to derive a final dataset with useful features.

The original data were filtered for the Edinburgh city.
Time data were extracted from string format.
Text data were merged per business of reference.
Features were extracted from text.
All data for Edinburgh were merged into a tidy data set.

Methods and Data (2/2)

plot of chunk unnamed-chunk-2

From this plot we see that we don't have strong predictors for classifying our data so we don't expect very high prediction accuracy. This is expected as we are dealing with user inputs and arbitrary features extracrion.

Results & Discussion

[1] "Overall accuracy = 0.614925"

[1] "Kappa = 0.200285"

Different predictive algorithms including ensemble methods gave an accuracy at best around 65%. Possible generation of more features including more advanced techniques in extracting features from text will further improve accuracy and strengthen the model stability.