The purpose of this project is to discover if there is strong sentimental difference between the positive and negative reviews. And if we can use machine learning to let the computer to detect the topic of reviews.
We found there are significant sentimental difference between the negative and positive review, and the machine help us to detect the topic 4 in 5 time correct. Not bad!
The data set comes from Kaggle: https://www.kaggle.com/snap/amazon-fine-food-reviews This dataset contains Amazon fine food review, including around 500,000 reviews from Amazon Oct 1999 - Oct 2012
and UCSD: http://jmcauley.ucsd.edu/data/amazon/ The dataset contains product review and metadata from Amazon May 1996 - July 2014.
The food review overview shows the distribution of food review score and helpfulness score. The food review have mainly positive review but less help in general.
Top 20 review length and summary. It is very surprising that so many summary has length over 1000
Least 20 Review Length and Summary from Amazon food review. The number 61 comes form the review limitation.
Review length distribution by score
The higher the score, the longer the review could be.
Do does the length relate to the score in the statistical sense?We conduct ANOVA here and found significant association between these two ( p-value: < 2.2e-16)!
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
full_word_count<-read_csv("derived_data/full_word_count.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## id = col_double(),
## summary = col_character(),
## score = col_double(),
## num_words = col_double()
## )
anova<-lm(score~num_words,data = full_word_count)
summary(anova)
##
## Call:
## lm(formula = score ~ num_words, data = full_word_count)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2808 -0.2427 0.7524 0.7966 3.9419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.284e+00 2.453e-03 1746.3 <2e-16 ***
## num_words -1.228e-03 2.106e-05 -58.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.307 on 568452 degrees of freedom
## Multiple R-squared: 0.005944, Adjusted R-squared: 0.005942
## F-statistic: 3399 on 1 and 568452 DF, p-value: < 2.2e-16
The most 10 popular words in the all reviews are below:we can found many words used to describe the food such as flavor,taste!
Here start the fun staff, the question is: can we tell the sentimental difference between the negative and positive review?
R has three sentimental database: Bing,nrc and afinn, they are very different on their own way of defining sentiment. The Bings dataset split the words into negative and positive category. The NRC datase split the words into more detailes like angry, upset, fear…The afinn dataset have a numeric database to decribe the positiveness and negativeness.
Here are the overlap between our data and three database we mentioned.
The result shows that positive review has more positive words than negative words. While the negative review has more negative words than positive.
From the nrc database analysis, we get:It is little strange to see positive sentiment in the negative report.But we can see many negative sentiment are higher in the lower score compared to the score=5.
This last part are a little train to see if the computer can tell the topic of reviews.Adding to the origin food review, we use the review from Beauty, Outdoor, movieTV, VideoGame. The analysis we use is called LDA and k-means, they are widely used in the natural language processing. LDA is a short for latent Dirichlet allocation, if observations are words collected into documents, it posits each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics.
By LDA and k-means topic modeling, top five topic we can distinguish the 4 out of 5!
k-means is a clustering method used to partition observations into k clusters, it is hard for us to tell the accuracy.
In the future, we are interested in trying other sentimental analysis dictionary and add some of our customized dictionary for the food reviews. In addition, we are interested in trying to make a overlap of the topic and sentiment analysis to develop a recommendation system for the customers by their personal preference topic.