1. Run sentiment analysis

1-3. State-of-the-art model

In this section, we will employ a deep learning model; unfortunately, R is not the most conducive environment for training or testing such models. As expected, Python is the preferred option when it comes to deep learning due to its extensive libraries and community support. Similar to our approach in the computer vision module, we will utilize Google Colab (link). Upload (i.e., drag and drop) your Reddit data in csv format to your Colab session. You might as well use this data.

In NLP, the state-of-the-art model architecture is undoubtedly Transformer (Check out the benchmark here). The architecture was introduced in 2017 and has since become a foundation for many NLP models. The Transformer architecture has been highly influential in NLP due to its ability to handle long-range dependencies in text and its scalability, allowing the training of large models that capture intricate language patterns and semantics.

Timeline of popular Transformer model releases (source):

The script in Colab will demonstrate how to leverage a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to predict the sentiment of a given string. Developed by Google, BERT is a powerful language processing AI model designed for a range of NLP tasks.

BERT reads texts and pays attention to the surrounding words to understand the context of each word. This is the “bidirectional” part, as it learns about the word based on the words that come before and after it, unlike older models which typically read text in one direction (left to right).

Thus, BERT helps computers better understand the meaning of words in a sentence. Imagine you’re learning a new language. When you hear or see a sentence, you don’t just look at each word in isolation—you look at the words around it to understand its meaning and context. If someone says, “It’s raining cats and dogs,” you understand that it’s an expression meaning it’s raining heavily, not that pets are falling from the sky.

Let’s proceed to the Colab and analyze the sentiment of Reddit threads using the BERT model.

2. Output sanity check

Import the data processed from the Colab to R.

reddit_sentiment <- read_csv(here('metal_reddit_bert.csv')) %>% 
  drop_na('bert_label')
## New names:
## Rows: 228 Columns: 11
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): Unnamed: 0, title, text, subreddit, url, bert_label dbl (4): ...1,
## timestamp, comments, bert_score date (1): date_utc
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Get sentiment scores using the dictionary method for comparison.

reddit_sentiment %<>% 
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

reddit_sentiment_dictionary <- sentiment_by(reddit_sentiment$title_text)

reddit_sentiment$sentiment_dict <- reddit_sentiment_dictionary %>% pull(ave_sentiment)
reddit_sentiment$word_count <- reddit_sentiment_dictionary %>% pull(word_count)

Check the correlation between the sentiment values from two different methods.

reddit_sentiment %<>% mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>% as.numeric())

cor(reddit_sentiment$bert_label_numeric, reddit_sentiment$sentiment_dict)
## [1] 0.3124655

0.28 implies a mild positive correlation. However, in the scatter plot below, the two do not seem to be very correlated. The threads that got 4-5 stars from the BERT model are mostly below 0 (meaning negative) in the other method.

ggplot(data = reddit_sentiment, aes(x = bert_label_numeric, y = sentiment_dict)) +
  geom_jitter(width = 0.1, height = 0) +
  geom_line(aes(y = 0), color = '#FFD700', lwd = 1, linetype='dashed') + 
  dark_theme_grey()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

Let’s look at some example threads and the predicted sentiment, and see which method makes more sense.

bert_example <- reddit_sentiment %>%
  filter(bert_label_numeric %in% c(1,5)) %>% 
  group_by(bert_label) %>%
  arrange(desc(bert_score)) %>%
  slice_head(n = 3) %>%
  ungroup()

# 1 star
bert_example %>% filter(bert_label_numeric == 1) %>% pull(title_text) %>% print()
## [1] "Convicted heavy metal Christian singer admits being atheist, duped fans to sell music. "   
## [2] "\034God awful metal music\035. "                                                           
## [3] "Rap music is terrible and doesn't belong in Need for Speed! we only want Rock and Metal!. "
# 5 star
bert_example %>% filter(bert_label_numeric == 5) %>% pull(title_text) %>% print()
## [1] "i love evilly stimming to metal music!!!!!!!!. I flap my hands to the drum beat and I bounce around!!!! It\031s so lovely!!! I\031m stimming to disasterpiece - Live by Slipknot right now!!!!! So good!!!!!"
## [2] "Best Metal Music. \n\n[View Poll](https://www.reddit.com/poll/16z1nwx)"                                                                                                                                      
## [3] "What are the best hair metal music videos?. "
sentimentr_example <- reddit_sentiment %>%
  mutate(sentimentr_abs = abs(sentiment_dict),
         sentimentr_binary = case_when(sentiment_dict > 0 ~ 'positive',
                                       TRUE ~ 'negative')) %>% 
  group_by(sentimentr_binary) %>%
  arrange(desc(sentimentr_abs)) %>%
  slice_head(n = 3) %>%
  ungroup() %>% 
  arrange(sentiment_dict)

# negative
sentimentr_example %>% filter(sentimentr_binary == 'negative') %>% pull(title_text) %>% print()
## [1] "Suicidal Tendencies - Institutionalized - Official Music Video [Hardcore Punk/Thrash Metal]. "
## [2] "We\031re now at the \034metal music is evil\035 stage of the Satanic panic. "                 
## [3] "People who don't like metal music, why?. "
# positive
sentimentr_example %>% filter(sentimentr_binary == 'positive') %>% pull(title_text) %>% print()
## [1] "A cool guide I made to introduce classical music to rock and metal fans. "
## [2] "Heavy metal music isn\031t good. "                                        
## [3] "Liking heavy metal music linked to high intelligence. "

3. Visualization

Let’s visualize the sentiment analysis result from the BERT model.

3-1: Sentiment distribution

  • Number of threads by sentiment category.
reddit_sentiment %>% 
  ggplot(aes(x = bert_label)) + 
  geom_bar(fill = "white") +
  dark_theme_gray()

  • Word counts by sentiment category.
reddit_sentiment %>% 
  ggplot(aes(x = bert_label, y = word_count)) +
  geom_jitter(height = 0, width = 0.05) +
  stat_summary(fun = mean, geom = "crossbar", width = 0.4, color = "red") +
  dark_theme_gray()

  • Association between a thread’s sentiment and the number of comments on the thread.
reddit_sentiment_rm_outlier <- reddit_sentiment %>%
  group_by(bert_label) %>%
  filter(
    between(
      comments, 
      quantile(comments, 0.25) - 1.5 * IQR(comments), 
      quantile(comments, 0.75) + 1.5 * IQR(comments)))

cor.test(reddit_sentiment_rm_outlier$comments, reddit_sentiment_rm_outlier$bert_label_numeric)
## 
##  Pearson's product-moment correlation
## 
## data:  reddit_sentiment_rm_outlier$comments and reddit_sentiment_rm_outlier$bert_label_numeric
## t = -0.86207, df = 197, p-value = 0.3897
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.19870036  0.07845417
## sample estimates:
##         cor 
## -0.06130473
reddit_sentiment_rm_outlier %>% 
  ggplot(aes(x = bert_label_numeric, y = comments)) +
  geom_jitter(height = 0, width = 0.05) +
  geom_smooth(method = 'loess', span = 0.75) +
  dark_theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

3-2: Word clouds

Using word clouds, we can visualize words that are frequently seen in either positive or negative threads. Using the same code from the previous Rmd file, we tokenize the texts and remove stop words.

data("stop_words")
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

reddit_sentiment_clean <- reddit_sentiment %>% 
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  unnest_tokens(word, title_text, token = "words") %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(str_detect(word, "[a-z]")) %>% 
  filter(!word %in% c('flu','shot','shots'))

We are not interested in words that are commonly seen in both positive and negative threads. We can identify words that are uniquely seen in either positive or negative threads using anti_join.

reddit_sentiment_clean_negative <- reddit_sentiment_clean %>% 
  filter(bert_label_numeric %in% c(1,2))
reddit_sentiment_clean_positive <- reddit_sentiment_clean %>% 
  filter(bert_label_numeric %in% c(4,5))

reddit_sentiment_clean_negative_unique <- reddit_sentiment_clean_negative %>% 
  anti_join(reddit_sentiment_clean_positive, by = 'word')
reddit_sentiment_clean_positive_unique <- reddit_sentiment_clean_positive %>%
  anti_join(reddit_sentiment_clean_negative, by = 'word')
  • Words appearing in negative threads
n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

reddit_sentiment_clean_negative_unique %>% 
  count(word, sort = TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)
  • Words appearing in positive threads
reddit_sentiment_clean_positive_unique %>% 
  count(word, sort = TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

3-3: Temporal analysis

# create new columns: date, year, day_of_week, is_weekend, time
reddit_sentiment %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date)) %>% 
  mutate(year = year(date),
         day_of_week = wday(date, label = TRUE),
         is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"),
         time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

Sentiment by year using a scatter plot.

reddit_sentiment %>% 
  ggplot(aes(x = date, y = bert_label_numeric)) +
  geom_jitter(width = 0, height = 0.05) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(reddit_sentiment$date, na.rm = T), 
                                max(reddit_sentiment$date, na.rm = T), 
                                by = "1 year")) +
  dark_theme_grey()

Let’s change it to a stacked bar plot.

# sentiment by year
reddit_sentiment %>% 
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = 'stack') +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), 
                                  max(reddit_sentiment$year), 
                                  by = 1)) +
  scale_fill_brewer(palette = 'PuRd', direction = -1) + 
  dark_theme_grey()

Set position = 'fill' to see the proportions.

# sentiment by year
reddit_sentiment %>% 
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = 'fill') +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), 
                                  max(reddit_sentiment$year), 
                                  by = 1)) +
  scale_fill_brewer(palette = 'PuRd', direction = -1) + 
  dark_theme_grey()

Sentiment by day of week.

# sentiment by day
reddit_sentiment %>% 
  ggplot(aes(x = day_of_week, fill = bert_label)) +
  geom_bar(position = 'fill') +
  scale_fill_brewer(palette = 'PuRd', direction = -1) +
  dark_theme_grey()

Sentiment by time of day.

reddit_sentiment %>% 
  ggplot(aes(x = time, fill = bert_label)) +
  geom_histogram(bins = 24, position = 'fill', color = 'black', lwd = 0.2) +
  scale_x_continuous(breaks = seq(0, 24, by=1)) + 
  scale_fill_manual(values = c('#bc5090', '#bc5090', '#ff6361', '#ffa600', '#ffa600')) + 
  dark_theme_grey()