Instructions

In this assignment, you will download, analyze, and visualize Reddit threads based on a keyword of your choice. Specifically, you will be performing the following steps:

1. Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis.

I want to assess how people are discussing the ongoing Sudanese civil war, what is the public sentiment on this issue, and has public interest died off since other major conflicts like the Israel-Gaza conflict.

2. Search Reddit threads using a keyword of your choice.

-Specifying a subreddit for your search is optional.
-It is okay to combine data obtained by searching the keyword across multiple subreddits.
-You can choose any period, but ensure you gather a sufficient amount of data so that you can get meaningful results.
# Install all packages
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "sf", "here", "stringi", "sentimentr")
invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'RedditExtractoR' was built under R version 4.3.3
## Warning: package 'anytime' was built under R version 4.3.3
## Warning: package 'httr' was built under R version 4.3.3
## Warning: package 'tidytext' was built under R version 4.3.3
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'igraph' was built under R version 4.3.3
## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union
## Warning: package 'ggraph' was built under R version 4.3.3
## Warning: package 'textdata' was built under R version 4.3.3
## 
## Attaching package: 'textdata'
## 
## The following object is masked from 'package:httr':
## 
##     cache_info
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
## Warning: package 'here' was built under R version 4.3.3
## here() starts at D:/阿仁的東西/Graduate school/Notes/Monday/CP 8883 Intro to Urban Analytics/HWs and APIs/Major 3
## Warning: package 'sentimentr' was built under R version 4.3.3
# Grab Reddit data based on the keyword "sudanese civil war"
threads <- find_thread_urls(keywords = 'sudanese civil war', 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
                              drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads) <- NULL

# Find the top three subreddits with the most results, than search these three subreddits
# I set to 10 in case there is a tie among the subreddits
threads$subreddit %>% table() %>% sort(decreasing = T) %>% head(10)
## .
##             Sudan     AutoNewspaper      MilitaryPorn        newsbotbot 
##                16                10                10                 8 
##      GlobalPowers UMukhasimAutoNews         worldnews            Africa 
##                 7                 6                 6                 4 
##          autotldr     AskHistorians 
##                 4                 3

The three subreddits are Sudan, AutoNewspaper, and MilitaryPorn.

# Search from the three subreddits
threads1 <- find_thread_urls(keywords= 'sudanese civil war', 
                              subreddit = 'Sudan', 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
                              drop_na()

threads2 <- find_thread_urls(keywords= 'sudanese civil war', 
                              subreddit = 'AutoNewspaper', 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
                              drop_na()

threads3 <- find_thread_urls(keywords= 'sudanese civil war', 
                              subreddit = 'MilitaryPorn', 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
                              drop_na()

rownames(threads1) <- NULL
rownames(threads2) <- NULL
rownames(threads3) <- NULL

# Stack all three dataframes into one dataframe
stacked <- rbind(threads1, threads2, threads3)
write.csv(stacked, "SCW_data.csv", row.names = FALSE)
stacked <- read.csv("SCW_data.csv")
# Out of interest, we check whether posts have been decreasing as time went by
stacked_plot <- stacked %>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

stacked_plot %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 604800) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(from = as.POSIXct("2023-04-15"), # Start of conflict
                                  to = as.POSIXct(max(stacked_plot$date, na.rm = T)),
                                  by = "1 month")) +
  xlim(as.POSIXct("2023-04-15"), as.POSIXct(max(stacked_plot$date, na.rm = TRUE))) +
  theme_minimal()
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

# Discussion increased, interesting

3. Clean your text data and then tokenize it.

# Get stop words
data("stop_words")

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- stacked %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Let us see the top 20 unique words
words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

4. Generate a word cloud that illustrates the frequency of words except your keyword.

# Word cloud
n <- 20
h <- runif(n, 0, 1) 
s <- runif(n, 0.6, 1) 
v <- runif(n, 0.3, 0.7) 

# Delete keywords 
# I got rid of 'sudan' too
words_without_keyword <- words_clean %>% filter(!word %in% c('sudanese', 'civil', 'war', 'sudan'))

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
words_without_keyword %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8) 

5. Conduct a tri-gram analysis.

-Extract tri-grams from your text data.
-Remove tri-grams containing stop words or non-alphabetic terms.
-Present the frequency of tri-grams in a table.
-Discuss any noteworthy tri-grams you come across.
-If no meaningful tri-grams are found, you may analyze bi-grams as well. However, you still need to show results of the tri-grams.
# Get tri-gram
trigram <- stacked %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = words,
                input = text,
                token = "ngrams",
                n = 3)

# Delete stop words
trigram_words <- trigram %>%
  separate(words, c("word1", "word2", "word3"), sep = " ")

trigram_words_filtered <- trigram_words %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
trigram_words_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the trigram counts:
words_counts <- trigram_words_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 word3 n
rapid support forces 21
sudanese civil war 12
support forces rsf 10
1j h’d jj1 8
sudanese armed forces 8
de d3 d93c1j 7
sudan civil war 7
abdel fattah al 6
armed forces saf 6
paramilitary rapid support 6
transitional military council 6
fattah al burhan 5
ordinary sudanese citizens 5
united arab emirates 5
african civil war 4
d3 d93c1j d’f 4
d93c1j d’f b’dj 4
omar al bashir 4
addis ababa agreement 3
comprehensive peace agreement 3

Excluding gibberish text and the obvious nobrainer “sudanese civil war,” we do see some interesting trigrams. The rapid support forces (support forces rsf, paramilitary rapid support) and the sudanese armed forces (armed forces saf) are fighting each other, so we obviously see these terms popping up as the most frequent trigrams. Abdel Fattah al-Burhan is currently the de factor ruler of Sudan, so we see his name appearing commonly as well as abdel fattah al and fattah al burhan. Omar al-Bashir is the former president of Sudan replaced by the Transitional Military Council, so both omar al bashir and transitional military council are historical terms. The two agreements, addis ababa agreement and comprehensive peace agreement, concluded the first and second Sudanese Civil War, respectively. I suppose ordinary sudanese citizens and african civil war are just general terms mentioned when discussing the conflict. The most interesting find is UAE. What is the role of UAE in the conflict?

6. Perform a sentiment analysis on your text data using a dictionary method that accommodates negations.

-You are welcome to apply a deep learning-based model to enrich your analysis, but employing the dictionary method is imperative.
# Create empty list for each entry
all_sentence_score <- list()

# Perform sentiment analysis using sentimentr
# Store the word count, sentiment score, and sentence in a dataframe
for (i in 1:length(stacked$text)){
  sentiment <- sentiment(stacked$text[i])
  sentiment$sentence <- get_sentences(stacked$text[i])
  all_sentence_score[[i]] <- sentiment
}

final_sentence_score <- bind_rows(all_sentence_score)

7. Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

# print the results of the top and bottom 10 sentences by sentiment score  
good_sentiment <- final_sentence_score %>% 
  arrange(desc(sentiment)) %>% 
  head(10) %>% 
  select(word_count, sentiment, sentence) 
  
print(good_sentiment)
##     word_count sentiment
##          <int>     <num>
##  1:         28 1.2378336
##  2:         24 1.0369507
##  3:         22 1.0340235
##  4:         22 0.9977794
##  5:         33 0.9400193
##  6:         26 0.9315516
##  7:         52 0.9221891
##  8:         11 0.9196096
##  9:          8 0.9015611
## 10:         41 0.8808200
##                                                                                                                                                                                                                                                                                                                                                 sentence
##                                                                                                                                                                                                                                                                                                                                                   <char>
##  1:                                                                                                                                             Yusuf also pledged and assured that the civil administration is committed to enhancing security, stability, and service delivery, while also focusing on humanitarian aid and fostering community peace.
##  2:                                                                                                                                                                            Strengthening democratic institutions and processes, such as free and fair elections and an independent judiciary, could help build a more stable and prosperous society.
##  3:                                                                                                                                                                                            Honorable people, because of the spirit of determination and determination to achieve change with civilized tools and methods of awareness and attention.
##  4:                                                                                                                                                                                          :**\n\nI've invested more than 80$ on whatever Sudanese arabic learning books were available (the beginner book and the concise dictionary made by Parson).
##  5:                                                                                                                           Apparently it's pretty close to MSA:**\n\n(yes, I know every Arabic speakers say their own dialect is, but it's seems that the general consensus shifts more towards Levantine, Sudanese and Peninsular variants, right?).
##  6:                                                                                                                                                    Improving financial literacy: Fintech can also help improve financial literacy by providing tools and resources that help individuals and businesses better understand and manage their finances.
##  7: \\[EDIT\\] Here is summay of the speech\n\nIn his historic 2005 speech during the signing of the Comprehensive Peace Agreement (CPA) in Sudan, John Garang, the leader of the Sudan People's Liberation Movement/Army (SPLM/A), delivered a powerful address emphasizing reconciliation, unity, and the importance of peace for the people of Sudan.
##  8:                                                                                                                                                                                                                                                                                               I tried to promote a fundraiser but none gives a fuck.
##  9:                                                                                                                                                                                                                                                                                                             Please keep it civil and respectful =O<ÿ
## 10:                                                                                                                                  That is because it is easy for parents to say things, and it is easy for us to believe them right up until your American boyfriend is looking your father in the eye explaining how much he respects and loves you.

Some threads have a score higher than 1, possibly because the sentences have longer words. In general, the threads all had a positive sentiment, with many high scoring threads talking about peace and reconciliation. Thread 8 might be an outlier though.

bad_sentiment <- final_sentence_score %>% 
  arrange(sentiment) %>% 
  head(10) %>% 
  select(word_count, sentiment, sentence) 
  
print(bad_sentiment)
##     word_count  sentiment
##          <int>      <num>
##  1:         44 -1.4452193
##  2:         49 -1.1171429
##  3:          1 -1.0000000
##  4:         15 -0.9295160
##  5:         37 -0.9020408
##  6:          5 -0.8944272
##  7:         29 -0.8913376
##  8:         27 -0.8803437
##  9:         10 -0.8696264
## 10:         22 -0.8607979
##                                                                                                                                                                                                                                                                                                                sentence
##                                                                                                                                                                                                                                                                                                                  <char>
##  1:                                                     ***Disclaimer,*** I am pro-palestinian but I find it quite disrespectful that on a post discussing a photo being circulated of a woman being brutally raped by two men in Sudan, a bulk of the comments are squabbling about the Israel and Paelstine conflict.
##  2: A brutal war, which by all accounts is the largest humanitarian crisis in the history of Sudan with some of the most bone chilling crimes against humanity being committed mainly against south Sudanese people but unfortunately most is unknown due to intentional covering of truths by the Sudanese government.
##  3:                                                                                                                                                                                                                                                                                                           Genocide?
##  4:                                                                                                                                                                                             Unfortunately this started a massive conflict between 20,000 RSF soldiers and 42 Sudanese soldiers at Khartoum airport.
##  5:                                                                                                                          I don't know how long this war will last and adoption is a lengthy process, I want to find a way to move forward but I've lost my contacts in Khartoum as everyone fled from the conflict.
##  6:                                                                                                                                                                                                                                                                                           Too many to count really.
##  7:                                                                                                                             In light of the worsening crises Sudan is currently suffering from, especially after the war of April 2023\024whose end remains unknown\024it is necessary to pause and reflect deeply.
##  8:                                                                                                                                                I studied Spanish, Italian and Swedish (reaching conversational fluency but ended up losing most of it due to lack of exposure since all 3 were mostly self-taught).
##  9:                                                                                                                                                                                                                                     Performed rape, torture and killings of civilians and destroyed infrastructure.
## 10:                                                                                                                                                                                                &gt;There is no accurate data on how many have been killed, but death toll estimates run into the tens of thousands.

I would say the sentiment analysis for negative threads is more accurate, though the word count might have caused some results to exceed -1. The negative threads are mostly about war and deaths, so it makes sense that they are negative.

8. Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.

Does word count affect the sentiment?
# Drop NA values in word_count and sentiment
final_sentence_score <- final_sentence_score %>% drop_na(word_count)

ggplot(final_sentence_score, aes(x = sentiment, y = word_count)) +
  geom_point() +
  labs(title = "Does word count affect the sentiment?",
       x = "Sentiment",
       y = "Word Count") +
  theme_minimal()

In a sense, more words seem to cause the sentiment to become more neutral. This makes sense as the sentiment might be positive and negative in longer sentences.

What are unique words in positive and negative threads?
# Get word cloud for positive and negative sentences
positive_threads <- final_sentence_score %>% 
  # Get positive thread
  filter(sentiment > 0) %>% 
  # drop URLs
  mutate(text = str_replace_all(sentence, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, sentence, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]")) %>%
  # Clean keywords
  filter(!word %in% c('sudanese','civil','war', 'sudan'))

negative_threads <- final_sentence_score %>% 
  # Get negative thread
  filter(sentiment < 0) %>% 
  # drop URLs
  mutate(text = str_replace_all(sentence, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, sentence, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]")) %>%
  # Clean keywords
  filter(!word %in% c('sudanese','civil','war', 'sudan'))

# Get unique positive and negative words
positive_unique <- positive_threads %>%
  anti_join(negative_threads, by = 'word')

negative_unique <- negative_threads %>%
  anti_join(positive_threads, by = 'word')

# Print word cloud
positive_unique %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

In the positive threads, we see words that do not have much of an association with war.

negative_unique %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

In the negative threads, we do see some negative words like betrayed, corrupt, and killing.