In 2016, David Robinson wrote a blog post assessing the AFINN sentiment lexicon by looking at the distributions of sentiment scores in posts with different overall ratings. In theory, Yelp reviews with 1 star should be more negative than reviews with 3 stars. The analysis illustrated the effectiveness of the AFINN lexicon, but there are three other lexicons included in the tidytext package (nrc,bing, and loughran). This post will first replicate Robinson’s box plot then apply the same analysis to the other three sentiment lexicons included in tidytext.

Getting Started

The first step is to read a sample of the yelp_dataset_challenge_academic_dataset. As Robinson says, you can use the whole set, but for speedier processing it helps to use a subset of reviews (in this case I used 200,000 per Robinson’s example).

knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(dplyr)
infile <- "~/Downloads/yelp_dataset_challenge_academic_dataset/review.json"
review_lines <- read_lines(infile, n_max = 200000, progress = FALSE)
library(stringr)
library(jsonlite)
# Each line is a JSON object- the fastest way to process is to combine into a
# single JSON string and use fromJSON and flatten
reviews_combined <- str_c("[", str_c(review_lines, collapse = ", "), "]")
reviews <- fromJSON(reviews_combined) %>%
  flatten() %>%
  tbl_df()

Now, to produce sentiment scores, I’ll unnest the text field to create a dataframe with one row per word, then join that dataframe with each of the four sentiment lexticons. In cases where scores are positive or negative those values are converted to 1 and -1 respectively. In cases (like the nrc lexicon) where there are more options available, only positive and negative tags are retained.

library(tidytext)
# create df with one line per word. There should be ~8.1 million lines 
review_words <- reviews %>%
  select(review_id, business_id, stars, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))
# set up each lexicon as it's own df
nrc <- sentiments%>%
  filter(sentiment %in% c('positive','negative')
         & lexicon == 'nrc')%>%
  mutate(nrc = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, nrc)
bing <- sentiments%>%
  filter(lexicon == 'bing')%>%
  mutate(bing = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, bing)
loughran <- sentiments%>%
  filter(sentiment %in% c('positive','negative') 
         & lexicon == 'loughran')%>%
  mutate(loughran = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, loughran)
afinn <- sentiments%>%
  filter(lexicon == 'AFINN')%>%
  select(word, afinn = score)
# Join each lexicon to the review_words dataframe
reviews_scored <- review_words%>%
  left_join(nrc, by = 'word')%>%
  left_join(bing, by = 'word')%>%
  left_join(loughran, by = 'word')%>%
  left_join(afinn, by = 'word')

Now that we have a dataset with each word mapped to all four potential sentiment scores, we can calculate the average sentiment of each review with a simple aggregation function (group_by and summarise).

review_scores_summary <- reviews_scored%>%
  group_by(review_id, stars)%>%
  summarise(nrc_score = mean(nrc, na.rm = T),
            bing_score = mean(bing, na.rm = T),
            loughran_score = mean(loughran, na.rm = T),
            afinn_score = mean(afinn, na.rm = T))

Visualizing Score Distributions

First I want to replicate Robinson’s box plot for AFINN scores.

library(ggplot2)
afinn.box <- ggplot(review_scores_summary, aes(x = as.character(stars), y = afinn_score))+
  geom_boxplot()+
  labs(x = 'Yelp Review Score',
       y = 'AFINN Score')
afinn.box

This looks generally positive! As Robinson points out, there are a large number of outliers (strong reviews coded as negative and vice versa), but generally this is a good start. But now let’s see how the other three lexicons do in comparison.

nrc.box <- ggplot(review_scores_summary, aes(x = as.character(stars), y = nrc_score))+
  geom_boxplot()+
  labs(x = 'Yelp Review Score',
       y = 'NRC Score')
bing.box <- ggplot(review_scores_summary, aes(x = as.character(stars), y = bing_score))+
  geom_boxplot()+
  labs(x = 'Yelp Review Score',
       y = 'Bing Score')
loughran.box <- ggplot(review_scores_summary, aes(x = as.character(stars), y = loughran_score))+
  geom_boxplot()+
  labs(x = 'Yelp Review Score',
       y = 'Loughran Score')
library(gridExtra)
grid.arrange(afinn.box, nrc.box, bing.box, loughran.box, nrow = 2)

NRC performs OK, but errs on the side of being overly positive. The median score for 1 star reviews is a net positive! AFINN has a similar problem although it’s less egregious there. Loughran orders the median distributions correctly, but the boxes are very wide, particularly in the middle. The Bing lexicon appears to have fewer outliers but the score distributions are still skewed for five-star reviews. Bing also has another advantage not discussed in this post or Robinson’s in that it has the most comprehensive lexicon with over 6,000 words scored (NRC is next closets with over 5,000). Given those two advantages, I plan on using the Bing sentiment lexicon as much as possible moving forward.

LS0tCnRpdGxlOiAiRm9sbG93aW5nIFVwIE9uIFwiRG9lcyBTZW50aW1lbnQgQW5hbHlzaXMgV29yaz8gQSB0aWR5IEFuYWx5c2lzIG9mIFllbHAgUmV2aWV3c1wiIgphdXRob3I6ICJKb3NoIFlhem1hbiIKZGF0ZTogIjkvMTIvMjAxNyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKSW4gMjAxNiwgRGF2aWQgUm9iaW5zb24gd3JvdGUgYSBbYmxvZyBwb3N0XShodHRwOi8vdmFyaWFuY2VleHBsYWluZWQub3JnL3IveWVscC1zZW50aW1lbnQvKSBhc3Nlc3NpbmcgdGhlIGBBRklOTmAgc2VudGltZW50IGxleGljb24gYnkgbG9va2luZyBhdCB0aGUgZGlzdHJpYnV0aW9ucyBvZiBzZW50aW1lbnQgc2NvcmVzIGluIHBvc3RzIHdpdGggZGlmZmVyZW50IG92ZXJhbGwgcmF0aW5ncy4gSW4gdGhlb3J5LCBZZWxwIHJldmlld3Mgd2l0aCAxIHN0YXIgc2hvdWxkIGJlIG1vcmUgbmVnYXRpdmUgdGhhbiByZXZpZXdzIHdpdGggMyBzdGFycy4gVGhlIGFuYWx5c2lzIGlsbHVzdHJhdGVkIHRoZSBlZmZlY3RpdmVuZXNzIG9mIHRoZSBgQUZJTk5gIGxleGljb24sIGJ1dCB0aGVyZSBhcmUgdGhyZWUgb3RoZXIgbGV4aWNvbnMgaW5jbHVkZWQgaW4gdGhlIGB0aWR5dGV4dGAgcGFja2FnZSAoYG5yY2AsYGJpbmdgLCBhbmQgYGxvdWdocmFuYCkuIFRoaXMgcG9zdCB3aWxsIGZpcnN0IHJlcGxpY2F0ZSBSb2JpbnNvbidzIGJveCBwbG90IHRoZW4gYXBwbHkgdGhlIHNhbWUgYW5hbHlzaXMgdG8gdGhlIG90aGVyIHRocmVlIHNlbnRpbWVudCBsZXhpY29ucyBpbmNsdWRlZCBpbiB0aWR5dGV4dC4gCgojIyBHZXR0aW5nIFN0YXJ0ZWQKVGhlIGZpcnN0IHN0ZXAgaXMgdG8gcmVhZCBhIHNhbXBsZSBvZiB0aGUgYHllbHBfZGF0YXNldF9jaGFsbGVuZ2VfYWNhZGVtaWNfZGF0YXNldGAuIEFzIFJvYmluc29uIHNheXMsIHlvdSBjYW4gdXNlIHRoZSB3aG9sZSBzZXQsIGJ1dCBmb3Igc3BlZWRpZXIgcHJvY2Vzc2luZyBpdCBoZWxwcyB0byB1c2UgYSBzdWJzZXQgb2YgcmV2aWV3cyAoaW4gdGhpcyBjYXNlIEkgdXNlZCAyMDAsMDAwIHBlciBSb2JpbnNvbidzIGV4YW1wbGUpLiAKYGBge3J9CmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkKbGlicmFyeShyZWFkcikKbGlicmFyeShkcGx5cikKCmluZmlsZSA8LSAifi9Eb3dubG9hZHMveWVscF9kYXRhc2V0X2NoYWxsZW5nZV9hY2FkZW1pY19kYXRhc2V0L3Jldmlldy5qc29uIgpyZXZpZXdfbGluZXMgPC0gcmVhZF9saW5lcyhpbmZpbGUsIG5fbWF4ID0gMjAwMDAwLCBwcm9ncmVzcyA9IEZBTFNFKQoKbGlicmFyeShzdHJpbmdyKQpsaWJyYXJ5KGpzb25saXRlKQoKIyBFYWNoIGxpbmUgaXMgYSBKU09OIG9iamVjdC0gdGhlIGZhc3Rlc3Qgd2F5IHRvIHByb2Nlc3MgaXMgdG8gY29tYmluZSBpbnRvIGEKIyBzaW5nbGUgSlNPTiBzdHJpbmcgYW5kIHVzZSBmcm9tSlNPTiBhbmQgZmxhdHRlbgpyZXZpZXdzX2NvbWJpbmVkIDwtIHN0cl9jKCJbIiwgc3RyX2MocmV2aWV3X2xpbmVzLCBjb2xsYXBzZSA9ICIsICIpLCAiXSIpCgpyZXZpZXdzIDwtIGZyb21KU09OKHJldmlld3NfY29tYmluZWQpICU+JQogIGZsYXR0ZW4oKSAlPiUKICB0YmxfZGYoKQpgYGAKCk5vdywgdG8gcHJvZHVjZSBzZW50aW1lbnQgc2NvcmVzLCBJJ2xsIHVubmVzdCB0aGUgdGV4dCBmaWVsZCB0byBjcmVhdGUgYSBkYXRhZnJhbWUgd2l0aCBvbmUgcm93IHBlciB3b3JkLCB0aGVuIGpvaW4gdGhhdCBkYXRhZnJhbWUgd2l0aCBlYWNoIG9mIHRoZSBmb3VyIHNlbnRpbWVudCBsZXh0aWNvbnMuIEluIGNhc2VzIHdoZXJlIHNjb3JlcyBhcmUgYHBvc2l0aXZlYCBvciBgbmVnYXRpdmVgIHRob3NlIHZhbHVlcyBhcmUgY29udmVydGVkIHRvIGAxYCBhbmQgYC0xYCByZXNwZWN0aXZlbHkuIEluIGNhc2VzIChsaWtlIHRoZSBgbnJjYCBsZXhpY29uKSB3aGVyZSB0aGVyZSBhcmUgbW9yZSBvcHRpb25zIGF2YWlsYWJsZSwgb25seSBgcG9zaXRpdmVgIGFuZCBgbmVnYXRpdmVgIHRhZ3MgYXJlIHJldGFpbmVkLiAKCmBgYHtyfQpsaWJyYXJ5KHRpZHl0ZXh0KQoKIyBjcmVhdGUgZGYgd2l0aCBvbmUgbGluZSBwZXIgd29yZC4gVGhlcmUgc2hvdWxkIGJlIH44LjEgbWlsbGlvbiBsaW5lcyAKcmV2aWV3X3dvcmRzIDwtIHJldmlld3MgJT4lCiAgc2VsZWN0KHJldmlld19pZCwgYnVzaW5lc3NfaWQsIHN0YXJzLCB0ZXh0KSAlPiUKICB1bm5lc3RfdG9rZW5zKHdvcmQsIHRleHQpICU+JQogIGZpbHRlcighd29yZCAlaW4lIHN0b3Bfd29yZHMkd29yZCwKICAgICAgICAgc3RyX2RldGVjdCh3b3JkLCAiXlthLXonXSskIikpCgojIHNldCB1cCBlYWNoIGxleGljb24gYXMgaXQncyBvd24gZGYKbnJjIDwtIHNlbnRpbWVudHMlPiUKICBmaWx0ZXIoc2VudGltZW50ICVpbiUgYygncG9zaXRpdmUnLCduZWdhdGl2ZScpCiAgICAgICAgICYgbGV4aWNvbiA9PSAnbnJjJyklPiUKICBtdXRhdGUobnJjID0gaWZlbHNlKHNlbnRpbWVudCA9PSAncG9zaXRpdmUnLDEsLTEpKSU+JQogIHNlbGVjdCh3b3JkLCBucmMpCgpiaW5nIDwtIHNlbnRpbWVudHMlPiUKICBmaWx0ZXIobGV4aWNvbiA9PSAnYmluZycpJT4lCiAgbXV0YXRlKGJpbmcgPSBpZmVsc2Uoc2VudGltZW50ID09ICdwb3NpdGl2ZScsMSwtMSkpJT4lCiAgc2VsZWN0KHdvcmQsIGJpbmcpCgpsb3VnaHJhbiA8LSBzZW50aW1lbnRzJT4lCiAgZmlsdGVyKHNlbnRpbWVudCAlaW4lIGMoJ3Bvc2l0aXZlJywnbmVnYXRpdmUnKSAKICAgICAgICAgJiBsZXhpY29uID09ICdsb3VnaHJhbicpJT4lCiAgbXV0YXRlKGxvdWdocmFuID0gaWZlbHNlKHNlbnRpbWVudCA9PSAncG9zaXRpdmUnLDEsLTEpKSU+JQogIHNlbGVjdCh3b3JkLCBsb3VnaHJhbikKCmFmaW5uIDwtIHNlbnRpbWVudHMlPiUKICBmaWx0ZXIobGV4aWNvbiA9PSAnQUZJTk4nKSU+JQogIHNlbGVjdCh3b3JkLCBhZmlubiA9IHNjb3JlKQoKIyBKb2luIGVhY2ggbGV4aWNvbiB0byB0aGUgcmV2aWV3X3dvcmRzIGRhdGFmcmFtZQpyZXZpZXdzX3Njb3JlZCA8LSByZXZpZXdfd29yZHMlPiUKICBsZWZ0X2pvaW4obnJjLCBieSA9ICd3b3JkJyklPiUKICBsZWZ0X2pvaW4oYmluZywgYnkgPSAnd29yZCcpJT4lCiAgbGVmdF9qb2luKGxvdWdocmFuLCBieSA9ICd3b3JkJyklPiUKICBsZWZ0X2pvaW4oYWZpbm4sIGJ5ID0gJ3dvcmQnKQpgYGAKCk5vdyB0aGF0IHdlIGhhdmUgYSBkYXRhc2V0IHdpdGggZWFjaCB3b3JkIG1hcHBlZCB0byBhbGwgZm91ciBwb3RlbnRpYWwgc2VudGltZW50IHNjb3Jlcywgd2UgY2FuIGNhbGN1bGF0ZSB0aGUgYXZlcmFnZSBzZW50aW1lbnQgb2YgZWFjaCByZXZpZXcgd2l0aCBhIHNpbXBsZSBhZ2dyZWdhdGlvbiBmdW5jdGlvbiAoYGdyb3VwX2J5YCBhbmQgYHN1bW1hcmlzZWApLgoKYGBge3IsIGZpZy5hbGlnbj0nY2VudGVyJ30KcmV2aWV3X3Njb3Jlc19zdW1tYXJ5IDwtIHJldmlld3Nfc2NvcmVkJT4lCiAgZ3JvdXBfYnkocmV2aWV3X2lkLCBzdGFycyklPiUKICBzdW1tYXJpc2UobnJjX3Njb3JlID0gbWVhbihucmMsIG5hLnJtID0gVCksCiAgICAgICAgICAgIGJpbmdfc2NvcmUgPSBtZWFuKGJpbmcsIG5hLnJtID0gVCksCiAgICAgICAgICAgIGxvdWdocmFuX3Njb3JlID0gbWVhbihsb3VnaHJhbiwgbmEucm0gPSBUKSwKICAgICAgICAgICAgYWZpbm5fc2NvcmUgPSBtZWFuKGFmaW5uLCBuYS5ybSA9IFQpKQpgYGAKCiMjIFZpc3VhbGl6aW5nIFNjb3JlIERpc3RyaWJ1dGlvbnMKRmlyc3QgSSB3YW50IHRvIHJlcGxpY2F0ZSBSb2JpbnNvbidzIGJveCBwbG90IGZvciBgQUZJTk5gIHNjb3Jlcy4gCgpgYGB7ciwgZmlnLmFsaWduPSdjZW50ZXInfQpsaWJyYXJ5KGdncGxvdDIpCgphZmlubi5ib3ggPC0gZ2dwbG90KHJldmlld19zY29yZXNfc3VtbWFyeSwgYWVzKHggPSBhcy5jaGFyYWN0ZXIoc3RhcnMpLCB5ID0gYWZpbm5fc2NvcmUpKSsKICBnZW9tX2JveHBsb3QoKSsKICBsYWJzKHggPSAnWWVscCBSZXZpZXcgU2NvcmUnLAogICAgICAgeSA9ICdBRklOTiBTY29yZScpCgphZmlubi5ib3gKYGBgCgpUaGlzIGxvb2tzIGdlbmVyYWxseSBwb3NpdGl2ZSEgQXMgUm9iaW5zb24gcG9pbnRzIG91dCwgdGhlcmUgYXJlIGEgbGFyZ2UgbnVtYmVyIG9mIG91dGxpZXJzIChzdHJvbmcgcmV2aWV3cyBjb2RlZCBhcyBuZWdhdGl2ZSBhbmQgdmljZSB2ZXJzYSksIGJ1dCBnZW5lcmFsbHkgdGhpcyBpcyBhIGdvb2Qgc3RhcnQuIEJ1dCBub3cgbGV0J3Mgc2VlIGhvdyB0aGUgb3RoZXIgdGhyZWUgbGV4aWNvbnMgZG8gaW4gY29tcGFyaXNvbi4gCgpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KbnJjLmJveCA8LSBnZ3Bsb3QocmV2aWV3X3Njb3Jlc19zdW1tYXJ5LCBhZXMoeCA9IGFzLmNoYXJhY3RlcihzdGFycyksIHkgPSBucmNfc2NvcmUpKSsKICBnZW9tX2JveHBsb3QoKSsKICBsYWJzKHggPSAnWWVscCBSZXZpZXcgU2NvcmUnLAogICAgICAgeSA9ICdOUkMgU2NvcmUnKQpiaW5nLmJveCA8LSBnZ3Bsb3QocmV2aWV3X3Njb3Jlc19zdW1tYXJ5LCBhZXMoeCA9IGFzLmNoYXJhY3RlcihzdGFycyksIHkgPSBiaW5nX3Njb3JlKSkrCiAgZ2VvbV9ib3hwbG90KCkrCiAgbGFicyh4ID0gJ1llbHAgUmV2aWV3IFNjb3JlJywKICAgICAgIHkgPSAnQmluZyBTY29yZScpCmxvdWdocmFuLmJveCA8LSBnZ3Bsb3QocmV2aWV3X3Njb3Jlc19zdW1tYXJ5LCBhZXMoeCA9IGFzLmNoYXJhY3RlcihzdGFycyksIHkgPSBsb3VnaHJhbl9zY29yZSkpKwogIGdlb21fYm94cGxvdCgpKwogIGxhYnMoeCA9ICdZZWxwIFJldmlldyBTY29yZScsCiAgICAgICB5ID0gJ0xvdWdocmFuIFNjb3JlJykKCmxpYnJhcnkoZ3JpZEV4dHJhKQoKZ3JpZC5hcnJhbmdlKGFmaW5uLmJveCwgbnJjLmJveCwgYmluZy5ib3gsIGxvdWdocmFuLmJveCwgbnJvdyA9IDIpCmBgYAoKTlJDIHBlcmZvcm1zIE9LLCBidXQgZXJycyBvbiB0aGUgc2lkZSBvZiBiZWluZyBvdmVybHkgcG9zaXRpdmUuIFRoZSBtZWRpYW4gc2NvcmUgZm9yIDEgc3RhciByZXZpZXdzIGlzIGEgbmV0IHBvc2l0aXZlISBBRklOTiBoYXMgYSBzaW1pbGFyIHByb2JsZW0gYWx0aG91Z2ggaXQncyBsZXNzIGVncmVnaW91cyB0aGVyZS4gTG91Z2hyYW4gb3JkZXJzIHRoZSBtZWRpYW4gZGlzdHJpYnV0aW9ucyBjb3JyZWN0bHksIGJ1dCB0aGUgYm94ZXMgYXJlIHZlcnkgd2lkZSwgcGFydGljdWxhcmx5IGluIHRoZSBtaWRkbGUuIFRoZSBCaW5nIGxleGljb24gYXBwZWFycyB0byBoYXZlIGZld2VyIG91dGxpZXJzIGJ1dCB0aGUgc2NvcmUgZGlzdHJpYnV0aW9ucyBhcmUgc3RpbGwgc2tld2VkIGZvciBmaXZlLXN0YXIgcmV2aWV3cy4gQmluZyBhbHNvIGhhcyBhbm90aGVyIGFkdmFudGFnZSBub3QgZGlzY3Vzc2VkIGluIHRoaXMgcG9zdCBvciBSb2JpbnNvbidzIGluIHRoYXQgaXQgaGFzIHRoZSBtb3N0IGNvbXByZWhlbnNpdmUgbGV4aWNvbiB3aXRoIG92ZXIgNiwwMDAgd29yZHMgc2NvcmVkIChOUkMgaXMgbmV4dCBjbG9zZXRzIHdpdGggb3ZlciA1LDAwMCkuIEdpdmVuIHRob3NlIHR3byBhZHZhbnRhZ2VzLCBJIHBsYW4gb24gdXNpbmcgdGhlIEJpbmcgc2VudGltZW50IGxleGljb24gYXMgbXVjaCBhcyBwb3NzaWJsZSBtb3ZpbmcgZm9yd2FyZC4=