Pier Luigi Olearo
Monday, November 16, 2015
According to the instructions, the dataset provided is part of the Yelp Dataset Challenge; the specific dataset used in this capstone corresponds to Round 6 of their challenge.
We analyzed the following hypothesis:
1. the length of the review is linked to best/worst ratings
2. the usage of “useful” pushbutton is directly related to the 5-stars ratings
3. Subjects who use 5-stars ratings tend to rate everything with “5-stars”
The tokenization routines are not effectively applicable to large masses of data/words, so we decided to perform statistic tests on numerical frequencies.
The data are contained in Yelp Data in the format JSON (JavaScript Object Notation).
We have used the following files:
[1] "yelp_academic_dataset_business.json"
[1] "yelp_academic_dataset_review.json"
[1] "yelp_academic_dataset_user.json"
To analyze the data were used the following statistical tools:
1. shapiro.test for normality
2. wilcox.test for two-sample (Mann-Whitney test)
3. binom.test for an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment
4. cor.test for association between paired samples using Spearman's rho
5. ks.test for two-sample Kolmogorov-Smirnov test
First hypotesis
The results seem to confirm the hypothesis that the comments to 5 stars are the lengths significantly different than the others. Particularly, between 17 and 76 words, the probability of finding a 5-star review is 10% higher than all levels comment.
Second hypotesis
The results are mixed and do not seem to indicate a clear confirmation of the hypothesis proposed. However the high level of significance found in the Mann-Whitney test suggests a possible indication of digging.
Third hypotesis
The result of the binomial tests confirms that with a high level of significance there is about 85% chance that a user who checks 5-stars to a review tends to assign the same value to review later.