Major Assignment 3

Originally written by Bon Woo Koo & Subhro Guhathakurta; modified by Uijeong Hwang

2024-11-11

#I aim to examine changes in sentiment towards Donald Trump over the past 12 months using user-generated text data and sentiment analysis.

# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "sf", "tmap", "here", "sentimentr")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

1-1: Downloading Reddit threads

Let’s first find threads using a keyword. Choose your own keyword. Check what the output object looks like.

# using keyword
threads_1 <- find_thread_urls(keywords = 'Trump', 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
  drop_na()

## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...

rownames(threads_1) <- NULL

colnames(threads_1)

## [1] "date_utc"  "timestamp" "title"     "text"      "subreddit" "comments" 
## [7] "url"

head(threads_1, 3) %>% knitr::kable()

date_utc	timestamp	title	text	subreddit	comments	url
2024-10-01	1727777200	What dudes who vote for Trump think off		MurderedByWords	1649	https://www.reddit.com/r/MurderedByWords/comments/1ftkhyg/what_dudes_who_vote_for_trump_think_off/
2024-11-06	1730880839	What the polls were not reflecting: many Texas latinos voting for Trump	Even the RGV, a traditional Democrat stronghold, had a poor showing on the vote for Harris.

Lesson learned: in the future, I will not trust pre-election polls conducted by folks who are not on the ground gathering real data from real people. |texas | 3377|https://www.reddit.com/r/texas/comments/1gktmu7/what_the_polls_were_not_reflecting_many_texas/ | |2024-08-20 | 1724118549|AOC Tears Into Donald Trump At the DNC | |interestingasfuck | 8231|https://www.reddit.com/r/interestingasfuck/comments/1ewjke1/aoc_tears_into_donald_trump_at_the_dnc/ |

2. Tokenization and stop words

2-1: Tokenization

Tokenization is the fundamental starting point in any Natural Language Processing (NLP) pipeline. “Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization” (direct quotes from here).

# Word tokenization
words <- threads_1 %>% 
  unnest_tokens(output = word, input = text, token = "words") 

words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

## Selecting by n

2-2: Stop words

Those words are so-called stop words – words that are commonly filtered out in the process of NLP because they are considered to be of little value in text analysis. These are typically common words that do not carry significant meaning on their own, such as articles, pronouns, conjunctions, and prepositions.

We will remove the stop words using a common stop words data set provided by tidytext package.

# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 100)])

##   [1] "other"         "can"           "second"        "different"    
##   [5] "our"           "com"           "weren't"       "right"        
##   [9] "s"             "we're"         "here's"        "everything"   
##  [13] "under"         "seemed"        "there"         "for"          
##  [17] "won't"         "where"         "whole"         "being"        
##  [21] "because"       "if"            "take"          "regardless"   
##  [25] "above"         "over"          "same"          "while"        
##  [29] "because"       "ours"          "anyhow"        "ie"           
##  [33] "per"           "needs"         "things"        "isn't"        
##  [37] "nothing"       "what's"        "across"        "when"         
##  [41] "with"          "down"          "above"         "maybe"        
##  [45] "newer"         "l"             "downs"         "several"      
##  [49] "given"         "thought"       "rd"            "out"          
##  [53] "noone"         "asking"        "he'd"          "said"         
##  [57] "may"           "herself"       "unfortunately" "cannot"       
##  [61] "has"           "eg"            "large"         "he"           
##  [65] "inward"        "which"         "either"        "must"         
##  [69] "your"          "longest"       "they"          "latterly"     
##  [73] "it"            "they'd"        "an"            "very"         
##  [77] "somebody"      "he's"          "group"         "general"      
##  [81] "on"            "presumably"    "changes"       "further"      
##  [85] "always"        "let's"         "ask"           "something"    
##  [89] "provides"      "ltd"           "too"           "would"        
##  [93] "furthers"      "opens"         "much"          "elsewhere"    
##  [97] "contains"      "together"      "instead"       "himself"

We can use anti_join() function for the filtration. This function is part of the dplyr package. It removes stop words from the text and saved as cleaned words. This function works the same way as other join functions in that the order of data.frame in the function argument matters. Consider the schematic below: anti_join(A,B) will give you everything from List A except those that are also in List B. If you do anti_join(B,A), it will give you everything in List B except those that are in List A.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_1 %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words" ) %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)

## Before: 5494, After: 1944

Let’s create the plot using the cleaned version.

words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

#Filtering out the keywork "Trump"
words_clean %>%
  count(word, sort = TRUE) %>%
  filter(word != "trump") %>%
top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

3. Word cloud

n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

words_clean %>% 
  filter(word != "trump") %>%
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.9)

4. N-grams

N-gram is the sequence of n-words appearing together. For example ‘basketball coach’, ‘dinner time’ are two words occurring together they are called i-grams. Similarly, ‘the three musketeers’ is a tri-gram, and ‘she was very hungry’ is a 4-gram. We will learn how to extract n-grams form the the Reddit text data, which will give further insights into the Reddit corpus. For advanced text analysis and machine learning based labeling, specific tokens, and n-grams can be used for feature engineering of the text data.

N-grams are used to analyze words in context. When we say (1) “We need to check the details.” and (2) “Can we pay it with a check?”, the word check are used as a verb and as a noun. We know what ‘check’ means in a sentence based on other words in the sentence, particularly words that are before and after the word ‘check.’ For example, if the word ‘check’ is used after ‘to’, we can infer that it is used as a verb. You can test bi-grams (2 words), tri-grams (3 words), and so on.

For example, “The result of separating bigrams is helpful for exploratory analyses of the text.” becomes paired_words
1 the result
2 result of
3 of separating
4 separating bigrams
5 bigrams is
6 is helpful
7 helpful for
8 for exploratory
9 exploratory analyses
10 analyses of
11 of the
12 the text

# Get trigrams
words_ngram <- threads_1 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

# Show ngrams with sorted values
words_ngram %>%
  count(paired_words, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()

paired_words	n
NA	212
the white house	6
voted for trump	6
a lot of	5
defeats kamala harris	4
i don t	4
trump defeats kamala	4
be able to	3
blah blah blah	3
don t know	3
donald trump defeats	3
he will be	3
i had a	3
i show him	3
the amount of	3
the majority of	3
they have been	3
we are the	3
we go to	3
we had to	3

Here we see the ngrams contain stop words such as * a, to, etc.* Next we will try to obtain ngrams occurring without stop words. We will use the separate function of the tidyr library to obtain the paired words in two columns i.e. word 1 and word 2. Subsequently we filter out columns containing stop words using the filter function

#separate the paired words into three columns
words_ngram_trio <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column, word 2 column and word 3 column
words_ngram_filtered <- words_ngram_trio %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]")& str_detect(word3, "[a-z]") )

# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
words_ngram_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new tri-gram (n=3) counts:
words_counts <- words_ngram_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()

word1	word2	word3	n
defeats	kamala	harris	4
trump	defeats	kamala	4
blah	blah	blah	3
donald	trump	defeats	3
city	eric	adams	2
donald	trump	wins	2
food	voucher	program	2
immigrants	phone	call	2
kamala	harris	wins	2
phone	call	trump	2
program	immigrants	phone	2
trump	wins	election	2
trump	wins	presidency	2
voucher	program	immigrants	2
white	house	trump	2
white	women	voted	2
women	voted	trump	2
york	city	eric	2
3rd	term	president	1
47th	president	thetimes.com	1

#The top trigram here is about Trump defeating Kamala. Something noteworthy is the trigram of voucher programs for immigrants and of women voted Trump.

By using the igraph and ggraph library we can visualize the words occurring in pairs.

# plot word network
words_counts %>%
  filter(n >= 2) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Sentiment analysis
library(dplyr)
library(text)

## Warning: package 'text' was built under R version 4.3.3

## [0;34mThis is text (version 1.2.3).
## [0m[0;32mText is new and still rapidly improving.
##                
## Newer versions may have improved functions and updated defaults to reflect current understandings of the state-of-the-art.
##                Please send us feedback based on your experience.[0m[0;35m
## 
## For more information about the package see www.r-text.org.[0m

# Join thread title and text.
trump_sentiment <- threads_1 %>%
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))


trump_sentiment_dictionary <- sentiment_by(trump_sentiment$title_text)
trump_sentiment$sentiment_dict <- trump_sentiment_dictionary %>% pull(ave_sentiment)
trump_sentiment$word_count <- trump_sentiment_dictionary %>% pull(word_count)

# Define positive and negative sentiment
sentimentr_example <- trump_sentiment %>%
  mutate(sentimentr_abs = abs(sentiment_dict),
         sentimentr_binary = case_when(sentiment_dict > 0 ~ 'positive',
                                       TRUE ~ 'negative')) %>%
  group_by(sentimentr_binary) %>%
  arrange(desc(sentimentr_abs)) %>%
  ungroup() %>%
  arrange(sentiment_dict)

sentimentr_example_filtered <- sentimentr_example %>%
  select(title_text, sentiment_dict, sentimentr_binary) %>% 
  mutate(title_text = str_trim(str_replace_all(title_text, "[^[:print:][:space:]]", ""))) %>% # delete special characters
  mutate(title_text = str_trim(str_replace_all(title_text, "[[:punct:]]+$", ""))) %>% # delete unnecessary punctuation marks
  distinct(title_text, .keep_all = TRUE) %>% 
  head(10)

print(sentimentr_example_filtered)

## # A tibble: 10 × 3
##    title_text                                   sentiment_dict sentimentr_binary
##    <chr>                                                 <dbl> <chr>            
##  1 "Mexican Presidents Harsh Takedown of Trump…         -0.794 negative         
##  2 "Team Trump Panics as Hell Breaks Loose in …         -0.707 negative         
##  3 "Stock Market Tanks as Trump Unveils Nightm…         -0.617 negative         
##  4 "Donald Trump Jr. taunts Zelenskyy about lo…         -0.617 negative         
##  5 "\"This is really terrifying\": Trump cabin…         -0.596 negative         
##  6 "[Anti Trump post] Donald Trump has cancell…         -0.452 negative         
##  7 "Trump confirms plans to declare national e…         -0.419 negative         
##  8 "Anti-Trump billboards from around the US"           -0.416 negative         
##  9 "Tim Walz loses home county to Trump"                -0.416 negative         
## 10 "Harris defeats Trump in Virginia"                   -0.402 negative

#The results predominantly show negative sentiment in the threads, which does align somewhat with public sentiment surrounding the Trump’s role in the government. However, the predominantly negative sentiment did not reflect on people’s voting behavior, since he did get reelected by a large margin of votes.

#Intriguing insights derived from the sentiment analysis are reflected by the bar chart and histogram. The bar chart reveals that there are 149 negative threads (61% of the total) and 95 positive threads (39% of the total), which highlights the predominance of negative sentiment regarding Trump during the last 12-months; however this was especially concentrated from September onward, when the majority of these posts were generated, leading up to election in November. The histogram shows that the sentiment scores of the threads are generally skewed toward zero; this suggests that the overall sentiment of threads about the Trump is somewhat negative, but not strongly negative.

# Calculate counts and proportions
sentimentr_example_summary <- sentimentr_example %>%
  group_by(sentimentr_binary) %>%
  summarise(count = n(), total = nrow(sentimentr_example), proportion = count / total)

# Bar chart: No. of threads by binary sentiment
bar_plot <- ggplot(sentimentr_example, aes(x = sentimentr_binary, fill = sentimentr_binary)) +
  geom_bar() +
  geom_text(data = sentimentr_example_summary,
            aes(x = sentimentr_binary, 
                y = count + 5,
                label = paste0(count, " (", scales::percent(proportion), ")")),
            size = 4, 
            color = "black",
            fontface = "bold") +
  labs(title = "Bar Chart: No. of Threads by Binary Sentiment",
       x = "Binary Sentiment", y = "No. of threads") +
  scale_fill_manual(values = c("red", "purple"), guide = "none") + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 12),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 10),
    legend.position = "none"
  )

print(bar_plot)

# Histogram: No. of threads by sentiment score
histogram_plot <- ggplot(sentimentr_example, aes(x = sentiment_dict, fill = sentimentr_binary)) +
  geom_histogram(binwidth = 0.03, alpha = 0.5, position = "identity", color = "black") +  
  labs(title = "Histogram: No. of Threads by Sentiment Score",
       x = "Sentiment Score", y = "No. of Threads") +
  scale_fill_manual(values = c("red", "purple"), guide = "none") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 12),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "none")

print(histogram_plot)

#Analyses on posting date/time

threads_1 %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by week
threads_1 %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 7*24*60*60) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_1$date, na.rm = T), 
                                max(threads_1$date, na.rm = T), 
                                by = "2 month")) +
  theme_minimal()