This paper aims to look at the usage of unigram-based sentiment analysis on textual restaurant reviews found in the Yelp Academic Dataset. The analysis will be performed using an opinion lexicon and sentiment scores will be recorded for each review. Using the scores as predictors, we will then predict whether the review has given a restaurant an above average star rating or otherwise. We will compare the overall accuracy from three models and the best classifer will be selected via 10-fold validation to predict the test set.
The underlying assumption is that the more positive a textual review is, the higher its star rating will be. The converse is also assumed, that a neutrally or negatively rated textual review would have less favourable rating. The sentiment inclination is then assumed by the number of positive and negative terms found in the textual revew.
The number of stars could have been used as the classes for the prediction problem. However, given the lack of predictors, it will be difficult to predict the exact rating of the review and hence evaluate the effective of the sentiment analysis. Therefore, binary classification is preferred in this case.
Textual reviews of any kind can contain plenty of information. However, it is often difficult to work with such unstructured data. Sentiment analysis is one way to extract information from text by scoring its ‘positivity’ and ‘negativity’. Also, successful prediction of star rating can serve as a check against inconsistency between the rating and textual review. For example, a reviewer wrote a highly positive review but rated the restaurant unfavourably, and vice versa.
Firstly, the review and business dataset are loaded using the ‘jsonlite’ package.
library(jsonlite)
review <- stream_in(file("./data/yelp_academic_dataset_review.json"))
business <- stream_in(file("./data/yelp_academic_dataset_business.json"))
Then, from the ‘business’ dataset, we extracted all the businesses that contains the ‘restaurant’ tag in its category.
restaurants <- NULL
for(i in 1:nrow(business)){
restaurants <- c(restaurants, "Restaurants" %in% business$categories[[i]])}
business.restaurants <- business[restaurants,]
Finally, since each restaurant contains a unique business id, we can use the id to extract reviews on these restaurants.
review_restaurants <- review[, "business_id"] %in% business.restaurants$business_id
review.restaurants <- review[review_restaurants, ]
We found 990,627 reviews on restaurants out of 1,569,264 total reviews.
As mentioned earlier, ratings of four stars or more will be classified as “above average”.
review.restaurants$above.avg <- 0
review.restaurants$above.avg[review.restaurants$stars>=4] <- 1
This gives us 652,216 reviews (66%) with above average ratings and 338,411 (34%) otherwise.
Punctuations, numbers and stopwords like ‘is’, ‘are’ which convey no meaningful sentiment will be removed from the textual review. Then, we will extract each individual words. The list of stopwords are obtained from the SMART information retrieval system via the ‘tm’ package.
As example of word extraction follows.
extract.words("The best fish and chips you will ever enjoy and equally superb fried shrimp.")
## [1] "fish" "chips" "enjoy" "equally" "superb" "fried" "shrimp"
The lexicon contains two list of single-term words, one containing positive words and the other negative. We will use the one provided by Hu and Liu. It categorizes nearly 6,800 words as positive or negative.
Source: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
hu.liu.pos = scan('./data/positive-words.txt',
what='character', comment.char=';')
hu.liu.neg = scan('./data/negative-words.txt',
what='character', comment.char=';')
set.seed(1)
sample(hu.liu.pos, 5)
## [1] "endearing" "flourishing" "marvels" "top" "decisive"
sample(hu.liu.neg, 5)
## [1] "traitor" "unreasonable" "overdo" "naïve"
## [5] "belligerent"
The ‘score.sentiment’ function was written to take in the list of words to be analysed, a list of positive words and a list of negative words. The function will return a list of four things: number of positive and negative words identified which are labelled as pos.score and neg.score respectively, as well as the positive and negative words that matches with the lexicon.
score.sentiment <- function(words, pos.words, neg.words) {
# Compare our words to the lists of positive and negative words
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# Remove the 'NA's
pos.matches = pos.matches[!is.na(pos.matches)]
neg.matches = neg.matches[!is.na(neg.matches)]
# Score the number of positive and negative words matched
pos.score = length(pos.matches)
neg.score = length(neg.matches)
return(list(pos.score = pos.score, neg.score = neg.score, pos.match = pos.words[pos.matches],
neg.match = neg.words[neg.matches]))
}
The sentiment scoring algorithm is applied to a sample of 5000 textual reviews.
require(plyr)
set.seed(1)
sample.review <- sample(nrow(review.restaurants), 5000)
sample.words <- llply(review.restaurants$text[sample.review], extract.words,
.progress = "text")
sample.sentiment = llply(sample.words, score.sentiment, pos.words = hu.liu.pos,
neg.words = hu.liu.neg)
Then, we checked if there are any reviews that cannot be scored, i.e no words that matches with the opinion lexicon.
sample.no.match <- which(sapply(sample.sentiment, "[[", "pos.score") == 0 &
sapply(sample.sentiment, "[[", "neg.score") == 0)
length(sample.no.match)
## [1] 99
Given that its only 2% of the data, we will not be too concerned about them for now.
We will now see if there is any relationship between the scores and the ‘above average’ classification. For the ratio of positive to negative score, each score was increased by one to prevent division by zero.
sample.data <- data.frame(above.avg = review.restaurants$above.avg[sample.review],
pos.score = sapply(sample.sentiment, "[[", "pos.score"), neg.score = sapply(sample.sentiment,
"[[", "neg.score"))
require(ggplot2)
require(gridExtra)
plot1 <- qplot(x = as.factor(above.avg), y = pos.score, data = sample.data,
geom = "boxplot", xlab = "Above Average", ylab = "Positive Score")
plot2 <- qplot(x = as.factor(above.avg), y = neg.score, data = sample.data,
geom = "boxplot", xlab = "Above Average", ylab = "Negative Score")
grid.arrange(plot1, plot2, ncol = 2)
From Figure 1, there appears to be some correlation between the ‘above average’ class and sentiment scores.
The dataset, i.e the sentiment scores from 990,627 textual reviews will be split into the training set (80%) and the test set (20%). 10-fold cross-validation will then be performed on the training set with three different training models. The model with the highest cross-validation overall accuracy will be used to train the full training set. After which, we will attempt to predict the test set.
Words extraction and sentiment scoring are performed on the full dataset.
words <- llply(review.restaurants$text, extract.words, .progress = "text")
sentiment <- llply(words, score.sentiment, pos.words = hu.liu.pos, neg.words = hu.liu.neg)
dat <- data.frame(above.avg = review.restaurants$above.avg, pos.score = sapply(sentiment,
"[[", "pos.score"), neg.score = sapply(sentiment, "[[", "neg.score"))
The dataset is then split into training and test.
set.seed(1)
train <- sample(nrow(dat), 0.8*nrow(dat))
training_set <- dat[train, ]
test_set <- dat[-train, ]
From the ‘caret’ package, we use the ‘createFolds’ function to split our training set into 10 folds for the purpose of cross-validation.
library(caret)
set.seed(1)
cv <- createFolds(training_set$above.avg, k=10, list=TRUE, returnTrain=TRUE)
The predictors used are the positive score, negative score and the score ratio.
require(MASS)
fold_accuracy <- NULL
for (fold in 1:10) {
lda.fit <- lda(above.avg ~ pos.score + neg.score + I((pos.score + 1)/(neg.score +
1)), data = training_set, subset = cv[[fold]])
pred <- predict(lda.fit, newdata = training_set[-cv[[fold]], ])
fold_accuracy <- c(fold_accuracy, mean(pred$class == training_set$above.avg[-cv[[fold]]]))
}
cv_accuracy <- mean(fold_accuracy)
The cross-validation accuracy for LDA is 0.748.
require(MASS)
fold_accuracy <- NULL
for (fold in 1:10) {
qda.fit <- qda(above.avg ~ pos.score + neg.score + I((pos.score + 1)/(neg.score +
1)), data = training_set, subset = cv[[fold]])
pred <- predict(qda.fit, newdata = training_set[-cv[[fold]], ])
fold_accuracy <- c(fold_accuracy, mean(pred$class == training_set$above.avg[-cv[[fold]]]))
}
cv_accuracy <- mean(fold_accuracy)
The cross-validation accuracy for QDA is 0.737.
fold_accuracy <- NULL
for (fold in 1:10) {
glm.fit <- glm(above.avg ~ pos.score + neg.score + I((pos.score + 1)/(neg.score +
1)), family = binomial(), data = training_set, subset = cv[[fold]])
prob <- predict(glm.fit, type = "response", newdata = training_set[-cv[[fold]],
])
fold_accuracy <- c(fold_accuracy, mean((prob > 0.5) == training_set$above.avg[-cv[[fold]]]))
}
cv_accuracy <- mean(fold_accuracy)
Lastly, the cross-validation accuracy for logistic regression is 0.754.
Since logistic regression gave the best cross-validation accuracy out of the three, we used it to predict the test set. The test accuracy obtained is 0.75.
glm.fit <- glm(above.avg ~ pos.score + neg.score + I((pos.score + 1)/(neg.score +
1)), family = binomial(), data = training_set)
prob <- predict(glm.fit, type = "response", newdata = test_set)
pred <- as.integer(prob > 0.5)
test_accuracy <- mean(pred == test_set$above.avg)
A test accuracy of 75%, given our simple model, has shown us some promise in the use of sentiment analysis to predict rating. However, the result has much room for improvement. We will briefly discuss some possible causes for inaccuracy and thoughts for future directions.
The confusion matrix is as shown.
require(caret)
confusionTable <- confusionMatrix(pred, test_set$above.avg, positive='1')$table
print(confusionTable)
## Reference
## Prediction 0 1
## 0 31860 12999
## 1 35803 117464
The sensitivity, the percentage of above-average rating that are correctly identified, was 90%. On the other hand, the specificity, the percentage of non above-average rating correctly identified, was only 47%. This tells us that the classifier is over identifying reviews to have above-average rating. Given that the predictors are all generated from the sentiment analysis, it makes sense to investigate the scoring process.
overall.score <- dat$pos.score - dat$neg.score
ratio.score <- (dat$pos.score + 1)/(dat$neg.score + 1)
require(ggplot2)
require(gridExtra)
plot1 <- qplot(x = as.factor(dat$above.avg), y = overall.score, geom = "boxplot",
xlab = "Above Average", ylab = "Positive Score - Negative Score")
plot2 <- qplot(x = as.factor(dat$above.avg), y = ratio.score, geom = "boxplot",
xlab = "Above Average", ylab = "log(Ratio of Positive to Negative Score)",
log = "y")
grid.arrange(plot1, plot2, ncol = 2)
The boxplots for sentiment score difference and score ratio for each class are shown in Figure 2.
Even though the sentiment in ‘above-average’ reviews are on average more positive than the non ‘above-average’ counterpart, there is too much overlapping sentiment between the two groups. Also, the range of sentiment is too wide in both groups.
We will now propose a few possible causes for the difficulty in accurately scoring the sentiment. The examples are all extracted from a review with high positive to negative score ratio (65:15), but is a non ‘above-average’ review of 3 stars.
The lexicon has limited words. Common words like “yum”, “yummy” are not included. The lexicon also only contain unigrams, i.e single terms. As mentioned earlier, connecting words can have different meaning. Lastly, all the words in the lexicon have the same weightage. Words that express greater degree of sentiment should be given more scoring points.
We have used simple models like logistic regression, LDA and QDA due to limited computation resource. For prediction problems, more flexible models like random forest and neural network might yield better accuracy.
In this paper, we have achieved a test accuracy of 75% via the use of sentiment analysis and logistic regression for our binary classification of above average rating. While it does provide some evidence for the effectiveness of sentiment analysis, the model is far from polished. We have discussed some difficulties in accurately scoring sentiment in a review. The future direction will be to overcome the these difficulties as well as lexicon and model selection.