alt text

1.Title: Data Science Capstone - Yelp Final Project

Yelp is a company that connects people with business via users reviews or recommendations that those users made of the different business and that other users can read and consequently decide to visit or purchase. My personal experience is that I decide to buy/visit a business based on the number reviews and average stars granted by previous users. I only read a reviews in case the first 2 criterias are met (enough number of reviews and more than 3,5 stars). Of course there are other criterias (like how close the business is to where I am or the type of specialization) but this is decided by each user. I created this project to determine how strong is the relationship between previous reviews and next reviews (is my experience of cause effect demonstrated by the data provide by Yelp?)

2. Introduction

My primary question is: Is it possible to predict the average and number of stars, tips and checkins a business will obtain by taking into account previous year/s values ? This is interesting for businesses because a high correlation of evaluations between years would imply consistency in the aggregation of user activity across time. The rationale is that if a particular business experiences big increases or decreases, Yelp can help by highlighting problems to address or opportunities to explore. These warnings or tips that highligh problems and oportunities will be the text of the reviews and the tips.

3.Methods and data

Describe how you used the data and the type of analytic methods that you used; it’s okay to be a bit technical here but clarity is important

3.1 Exploratory data analysis (plots, summary tables) presented that interrogates the question of interest?

Explore the relationships between different features in each data file. Try linking data files together and explore the relationships between features across data files. Identify interesting outcomes that you may want to predict as part of a prediction question / problem Characterize any missing data that may be present in each of the files Many features incorporate free-text data that may need to be parsed, summarized, or quantified in some way. What is the best way to handle these data?

Yelp provided a set of 5 different files of data that would help me with this analisis The data provided was in json format. After converting the data and unflattening it, I discovered the following files and number of records: 1.569.264 reviews; 366.715 users; 61.184 businsess; 45.116 checkins and 495.107 tips. After exploring the original data, I decided to take only the relevant fields to help me prove or not my hypotesis. Furthermore, I had to link the data files together by using the common fields. Please see the follwing graphic that shows the 5 tables, the fields that I selected (the number on the left is the positional order of the field in the table) and the common fiels used to link them realionship (see the common fields linked by a blue line in the graphic) : alt text

In the case of checkin, before joining it to the rest of the tables, I just added up the numerous values into a single total by business by adding up all the columns that indicate different times of checkin (columns 3 to 171).

Considering that my analysis is trying to determine the impact of time on the number and average of stars, I had to transform the data . Given that my analysis is mostly an agregation of individual reviews, I grouped the number of stars, average, maximum, minimum by year and by business. I joined the resulting table with the number of tips and then with a grouping of number of tips by business and year.

When I mergeg the data I did it by using an inner join, This removed those records where we have reviews for one year but not for the next or viceversa, the rationale for this is that the business that do not have consecutive years reviews will not qualify for this study and introduce noise to the algorithm .

The following graphic shows the final format required: alt text

Another question I want to answer is whether the model becomes stronger by considering influential users or influential reviwe.s In the case of reviews we will consider influental only those that had a votes.useful >=1 that means that the review was taken into account by at least another user to make decisions of purchasing/visiting. The users will be more influential if they have more fans and if they have more votes useful. We will therefore only consider the reviews of those users with fans>=5 and votes.useful>1 The disadvangege of this second view is that this will reduce the number of records to consider, however these remaining records are more relevant at the time of influencing other users to buy or visit and perhaps also the average of future reviews.

Once the data was ready I explored the relationships between different relevant features. I divided this analysis in two parts:

3.1.1 Quantitative correlation (number of reviews, tips and checkins by years)

I created a scatter plot to see graphically the correlation between variables that determine in my hypotesis the number of reviews, tips or checkins

alt text

We can see linear realtionships betwen 2013 and 2014 both in the same year and accross years on the side of number of reviews (count), tips and much weaker in the side of checkin. In the case of 2015 the relationship is not linear, this can be becasue the year is not complete yet and therefore not comparable with the other 2 years.
Hence, I left 2015 out of the first part of the study.

3.1.2 Qualitative correlation (average/maximum and minimum starsby years)

I produced a similar graphic

alt text

Again, we can see linear realtionships betwen the average of stars granted to the different business between 2013 and 2014. In the case of 2015 the relationship is not clear.
Hence, I left 2015 out of the first part of the study.

For both analysis, I also left years previous to 2013 because they are old behaviours.

3.2 Was the (or multiple) statistical model, prediction algorithm or statistical inference described in the methods section?

Given this previous considerations, the models I will try to create 2 predictive models: The first one will predict for each business the number of reviews(counts) in the year 2014 given the number of reviews in the previous (2013) year. I will also add the tips and the checkins to increase the accuracy of the model. The second one will predict for each business what is the average number of stars granted by users in the year 2014 given the average of stars, maximum, minimum of the previous (2013) year for the same business.

Three additional considerations for both cases: I will later on remove any unnecesary variable using principal component analysis. In both cases I will also consider whether the conclusions change or are stronger when we consider influential users or tips. The following models 11111111111111111111 will be used to generate the models: Linear, Random Forest and classification tree. we will keep the one with best perforamnce

3.3 Are all of the methods presented in the results section introduced in the methods section?

Inferential or causal questions

What is your outcome? Do you have a key predictor that you want to correlate with your outcome? What factors might confound or cloud any associations that you try to estimate or explore? Are there any confounding factors for which you do not have measurements in your dataset? How can you deal with this? How robust are your findings to small changes to the model?

Prediction questions I split the data into appropriate training and test sets? The number of records and data remaining after conversion is adequate for doing this.

How good is the error rate of your prediction model? Can it be improved by changing algorithms? What types of features appear to be most important to prediction skill? Are all types of prediction errors equally important in your problem?

4.Results

Results - Describe what you found through your analysis of the data.

Is the primary statistical model, statistical inference or prediction output in the results summarized and interpreted or is raw output given without description or interpretation?

Statistical Model R squared
Linear Model Content Cell
Random Forest Content Cell

Is there a description of how the results relate to the primary questions of interest, or is it otherwise clear? The results indicate that the quantity of starts is closely related to the quantity of stars given on previous years. Also the average quantity of stars granted in a particular year is closely realted to the ones granted on previous years.

Was the primary question of interest answered / refuted or was there a description of why no clear answer could be obtained?

The primary question of intrest was answered by the model. So we can conclude that

5.Discussion

Discussion - Explain how you interpret the results of your analysis and what the implications are for your question/problem.

NOTE: Reproducibility of work: please see code in the following URL: http://rmarkdown.rstudio.com.