One question I’ve heard repeatedly since beginning to work with text data and sentiment analysis is how much text we really need to get a solid idea of aggregate sentiments. The law of large numbers suggests that more text is better, but if we only have a little bit of text, can sentiment analysis still work?

In a previous post I examined the accuracy of the sentiment lexicons in the tidytext package. Here I attempt to find the point at which we have enough text to measure sentiments precisely.

Getting Started

The first step is to read a sample of the Yelp Academic Dataset. In this case, I’m starting with a subset of 200,000 reviews out of the 4.7 million provided.

Now, to produce sentiment scores, we split out the text field to create a dataframe with one row per word. Then we join that dataframe with two of the sentiment lexicons available from the tidytext R package. The existing positive or negative values are converted to 1 and -1 respectively. For the nrc lexicon where there are several additional sentiment categories are available, only positive and negative tags are retained.

library(tidytext)
review_words <- reviews %>%
  select(review_id, business_id, stars, text) %>%
  unnest_tokens(word, text)%>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))
# set up each lexicon as it's own df
nrc <- sentiments%>%
  filter(sentiment %in% c('positive','negative')
         & lexicon == 'nrc')%>%
  mutate(nrc = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, nrc)
bing <- sentiments%>%
  filter(lexicon == 'bing')%>%
  mutate(bing = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, bing)
# Join each lexicon to the review_words dataframe
reviews_scored <- review_words%>%
  left_join(nrc, by = 'word')%>%
  left_join(bing, by = 'word')

Evaluating the precision of sentiment lexicons requires two steps. For a range of sample sizes, we’ll repeatedly sample our dataset (with replacement) and calculate the average sentiment of the selected words. Then we’ll examine the distribution of the average scores to determine what sample size is needed to produce consistent average sentiment measurements.

dfs <- list()
for(i in seq(25,5000,25)){
  mean_nrc <- c()
  sd_nrc <- c()
  mean_bing <- c()
  sd_bing <- c()
  
  for(j in seq(1,250)){
    words <- sample_n(tbl = reviews_scored, size = i, replace = TRUE)
    mean_nrc[j] <- mean(words$nrc, na.rm = TRUE)
    sd_nrc[j] <- sd(words$nrc, na.rm = T)
    mean_bing[j] <- mean(words$bing, na.rm = TRUE)
    sd_bing[j] <- sd(words$bing, na.rm = T)
  }
  dfs[[i]] <- data.frame(n_size = rep(i,1000), mean_nrc, sd_nrc, mean_bing, sd_bing)
}
scores <- bind_rows(dfs)

Analyzing the Distributions

First, sample mean values are plotted against sample word counts. As expected, the points form a funnel shape with wider distributions for lower sample sizes that converge towards the overall mean as the sample size increases. The NRC lexicon tends to be generally more positive than the Bing lexicon, but also appears to converge towards the global mean more

library(ggplot2)
library(yaztheme)
library(reshape2)
ggplot(scores%>%
         select(n_size, mean_nrc, mean_bing)%>%
         melt(id.vars = 'n_size'), aes(x = n_size, y = value))+
  geom_point(aes(color = variable), alpha = .025)+
  theme_yaz()+
  labs(title = 'Distribution of Sample Mean Sentiment Scores by Sample Size',
       y = 'Average Score', x = 'Sample Size',
       subtitle = 'Based on samples of words from Yelp Reviews and the NRC and Bing sentiment lexicons')+
  annotate('text', x = 4900, y = .7, label = 'bold("Bing")', color = yaz_cols[3], parse = TRUE)+
  annotate('text', x = 4900, y = .9, label = 'bold("NRC")', color = yaz_cols[1], parse = TRUE)+
  scale_color_manual(values = c(yaz_cols[1], yaz_cols[3]))+
  theme(legend.position = 'none')

This proves out when examining the standard deviations of scores by sample size. Both lexicons show significant marginal improvements from increasing sample size up to about 200-300 words but NRC improves slightly more rapidly than Bing.

ggplot(scores%>%
         group_by(n_size)%>%
         summarise(sd_bing = sd(mean_bing, na.rm = T),
                   sd_nrc = sd(mean_nrc, na.rm = T))%>%
         melt(id.vars = 'n_size'),
       aes(x = n_size, y = value, color = variable))+
  geom_line(size = 1.5)+
  labs(title = 'Standard Deviation of Mean Sentiment Scores by Sample Size',
       x = 'Sample Size', y = 'Standard Deviation')+
  scale_color_manual(name = 'Lexicon', values = c(yaz_cols[c(3,1)]), labels = c('Bing','NRC'))+
  theme_yaz()

Conclusion

There are several potential explanations for the superior precision of NRC over Bing. NRC could just be more precise for reviews data. The Bing lexicon could have scored less relevant words and therefore have lower coverage in the reviews data. Coupled with previous analysis that demonstrates Bing’s superior accuracy, the differences in precision are not enough to warrant discarding the lexicon - especially if you have more than a few hundred words to score.

What do you do if you don’t have enough data? Analyzing tweet text or headlines (as I often do) means you typically don’t have enough words from single utterances to produce stable sentiment estimates. What we can do is aggregate text data by some additional factor before calculating summary scores. For example, in this analysis of tweets during a VA Governor’s debate, sentiments are calculated based on the text from all tweets mentioning one candidate or another rather than on an individual basis.

---
title: "How Much Text Do We Really Need for Sentiment Analysis?"
author: "Josh Yazman"
date: "10/07/2017"
output: html_notebook
---

One question I've heard repeatedly since beginning to work with text data and sentiment analysis is how much text we really need to get a solid idea of aggregate sentiments. The law of large numbers suggests that more text is better, but if we only have a little bit of text, can sentiment analysis still work? 

In a previous [post](http://rpubs.com/joshyazman/sentiment-analysis-lexicon-comparison) I examined the accuracy of the sentiment lexicons in the `tidytext` package. Here I attempt to find the point at which we have enough text to measure sentiments precisely. 

## Getting Started
The first step is to read a sample of the [Yelp Academic Dataset](https://www.yelp.com/dataset/challenge). In this case, I'm starting with a subset of 200,000 reviews out of the 4.7 million provided.   

```{r}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(dplyr)

review_lines <- read_lines('review.json', n_max = 200000, progress = FALSE)

library(stringr)
library(jsonlite)

# Each line is a JSON object- the fastest way to process is to combine into a single JSON string and use fromJSON and flatten
make_dfs <- function(lines){
  lines_combined <- str_c("[", str_c(lines, collapse = ", "), "]")
  
  df <- fromJSON(lines_combined) %>%
    flatten() %>%
    tbl_df()
  return(df)
}

reviews <- make_dfs(review_lines)
reviews
```

Now, to produce sentiment scores, we split out the text field to create a dataframe with one row per word. Then we join that dataframe with two of the sentiment lexicons available from the `tidytext` R package. The existing `positive` or `negative` values are converted to `1` and `-1` respectively. For the `nrc` lexicon where there are several additional sentiment categories are available, only `positive` and `negative` tags are retained. 

```{r}
library(tidytext)

review_words <- reviews %>%
  select(review_id, business_id, stars, text) %>%
  unnest_tokens(word, text)%>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

# set up each lexicon as it's own df
nrc <- sentiments%>%
  filter(sentiment %in% c('positive','negative')
         & lexicon == 'nrc')%>%
  mutate(nrc = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, nrc)

bing <- sentiments%>%
  filter(lexicon == 'bing')%>%
  mutate(bing = ifelse(sentiment == 'positive',1,-1))%>%
  select(word, bing)

# Join each lexicon to the review_words dataframe
reviews_scored <- review_words%>%
  left_join(nrc, by = 'word')%>%
  left_join(bing, by = 'word')
```

Evaluating the precision of sentiment lexicons requires two steps. For a range of sample sizes, we'll repeatedly sample our dataset (with replacement) and calculate the average sentiment of the selected words. Then we'll examine the distribution of the average scores to determine what sample size is needed to produce consistent average sentiment measurements. 

```{r, fig.align='center'}
dfs <- list()

for(i in seq(25,5000,25)){
  mean_nrc <- c()
  sd_nrc <- c()
  mean_bing <- c()
  sd_bing <- c()
  
  for(j in seq(1,250)){
    words <- sample_n(tbl = reviews_scored, size = i, replace = TRUE)
    mean_nrc[j] <- mean(words$nrc, na.rm = TRUE)
    sd_nrc[j] <- sd(words$nrc, na.rm = T)
    mean_bing[j] <- mean(words$bing, na.rm = TRUE)
    sd_bing[j] <- sd(words$bing, na.rm = T)
  }
  dfs[[i]] <- data.frame(n_size = rep(i,1000), mean_nrc, sd_nrc, mean_bing, sd_bing)
}

scores <- bind_rows(dfs)
```

## Analyzing the Distributions
First, sample mean values are plotted against sample word counts. As expected, the points form a funnel shape with wider distributions for lower sample sizes that converge towards the overall mean as the sample size increases. The NRC lexicon tends to be generally more positive than the Bing lexicon, but also appears to converge towards the global mean more 

```{r, fig.width=6, fig.height=3}
library(ggplot2)
library(yaztheme)
library(reshape2)

ggplot(scores%>%
         select(n_size, mean_nrc, mean_bing)%>%
         melt(id.vars = 'n_size'), aes(x = n_size, y = value))+
  geom_point(aes(color = variable), alpha = .025)+
  theme_yaz()+
  labs(title = 'Distribution of Sample Mean Sentiment Scores by Sample Size',
       y = 'Average Score', x = 'Sample Size',
       subtitle = 'Based on samples of words from Yelp Reviews and the NRC and Bing sentiment lexicons')+
  annotate('text', x = 4900, y = .7, label = 'bold("Bing")', color = yaz_cols[3], parse = TRUE)+
  annotate('text', x = 4900, y = .9, label = 'bold("NRC")', color = yaz_cols[1], parse = TRUE)+
  scale_color_manual(values = c(yaz_cols[1], yaz_cols[3]))+
  theme(legend.position = 'none')
```

This proves out when examining the standard deviations of scores by sample size. Both lexicons show significant marginal improvements from increasing sample size up to about 200-300 words but NRC improves slightly more rapidly than Bing.

```{r, message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
ggplot(scores%>%
         group_by(n_size)%>%
         summarise(sd_bing = sd(mean_bing, na.rm = T),
                   sd_nrc = sd(mean_nrc, na.rm = T))%>%
         melt(id.vars = 'n_size'),
       aes(x = n_size, y = value, color = variable))+
  geom_line(size = 1.5)+
  labs(title = 'Standard Deviation of Mean Sentiment Scores by Sample Size',
       x = 'Sample Size', y = 'Standard Deviation')+
  scale_color_manual(name = 'Lexicon', values = c(yaz_cols[c(3,1)]), labels = c('Bing','NRC'))+
  theme_yaz()
```

## Conclusion
There are several potential explanations for the superior precision of NRC over Bing. NRC could just be more precise for reviews data. The Bing lexicon could have scored less relevant words and therefore have lower coverage in the reviews data. Coupled with previous analysis that demonstrates Bing's superior accuracy, the differences in precision are not enough to warrant discarding the lexicon - especially if you have more than a few hundred words to score. 

What do you do if you don't have enough data? Analyzing tweet text or headlines (as I often do) means you typically don't have enough words from single utterances to produce stable sentiment estimates. What we can do is aggregate text data by some additional factor before calculating summary scores. For example, in this [analysis of tweets during a VA Governor's debate](http://rpubs.com/joshyazman/vagov-debate-twitter-analysis), sentiments are calculated based on the text from all tweets mentioning one candidate or another rather than on an individual basis. 