Library setup

# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "here", "sentimentr","devtools")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

# helper to clean thread text fields
sanitize_threads <- function(df) {
  df %>%
    mutate(across(
      where(is.character),
      ~ .x %>%
        str_replace_all("\\|", "/") %>%
        str_replace_all("\\n", " ") %>%
        str_squish()
    ))
}

# main keyword and subreddit list used throughout
keywords_main <- c("stealing")
atl_subs <- c("Atlanta")

# create output directory for saved data/plots
out_dir <- here::here("outputs")
if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)

DISCLAIMER: Due to the uncensored nature of online communities, some Reddit posts may contain images or language that are not suitable for work or school. You may also encounter controversial content. This is part of studying urban analytics, as online platforms often reflect real social dynamics. That said, please keep an open mind, and I hope no one feels offended or uncomfortable with what we might come across.

Introduction and Goals

In this analysis I examine how Atlanta Reddit users talk about theft-related experiences, combining user-generated posts with dictionary-based sentiment and a BERT model to understand tone, themes. I will use key word stealing to evaluate about realing property crimes in Atlanta.

Model Process

1. Collecting Reddit Data

Downloading Reddit threads

# knit-time: load cached keyword search results (threads_1)
threads_1 <- readr::read_csv("C:/Users/qduong7/OneDrive - Georgia Institute of Technology/Documents/Team project (urban analytics)/outputs/threads_1.csv", show_col_types = FALSE) %>%
  drop_na() %>%
  sanitize_threads()
colnames(threads_1)
## [1] "date_utc"  "timestamp" "title"     "text"      "subreddit" "comments" 
## [7] "url"
head(threads_1, 3) %>% knitr::kable()
date_utc timestamp title text subreddit comments url
2025-07-08 1752003037 [Omega Speedmaster] Got a Moonwatch for an absolute steal in Japan. I know this is kinda a bragging post but I wanted to tell somebody that can understand my excitement. I spent 3 weeks in Japan, about 2 weeks in a campervan and a couple days in Tokyo. During our travels we visited quite a few second hand stores, mainly looking for anime figures, old consoles and also checking out the watches they had in stock. In regard to watches, most stores were priced pretty much identical to online prices, if not slightly more expensive. On one of our last days in Tokyo we went to a Book Off location outside of the city centre which had a relatively large watch selection. They had an Omega Speedmaster 50th Anniversary, which seemed to be priced somewhat competitively (around 3,600¬). Only upon my second inspection did I notice there was actually a 40% off hangtag. Could this be? I translated the text and it literally said 40% off listed price. So yeah& I wasnt actually in the market for a Moonwatch, but at that price it was a no-brainer. I ended up paying only 2,160¬ for the watch, which was less than any other Speedmaster I could find online. Sadly, no box and papers (only a dinged up wooden Omega box), but recently polished and serviced. It runs great at +3 seconds/day. Its not the best polishing job in the world, but honestly not bad. I have no clue why it was sitting there at that price and why nobody bought it. Watches 346 https://www.reddit.com/r/Watches/comments/1luxwqh/omega_speedmaster_got_a_moonwatch_for_an_absolute/
2025-05-20 1747778662 Steal my seat, how about a crop dusting I (33F) just got off a 7 hour cross country flight and I wanted to share this story before I go pass out. I was flying Southwest so there are no assigned seats and I was in B boarding group. I went towards the back of the plane so I could get an aisle seat since Im fairly tall. It was a completely full flight so I knew there would be someone sitting next to be. Enter Entitled Karen (EK). She had waited to check in so she was at the back of the C boarding group. She was pissy because at first she couldnt find anywhere to put her massive roller bag and the flight attendants were trying to get her to gate check it. Then she came up to my row and I thought great, shes gonna take the middle seat. The following dialogue (to the best of my sleepy recollection) occurred: EK: Move! OP: what? EK: I paid for an aisle seat and you are sitting in my seat. OP: (more confused) what?? EK: (practically screaming) I SAID I PAID FOR AN AISLE SEAT AND YOU ARE SITTING IN MY SEAT OP: This is Southwest, there are no assigned seats EK: Are you stupid, I said I need the aisle seat. OP: I can get up to give you the middle seat if you want I stupidly thought that was the end of it when a flight attendant came over and asked her to take a seat. I got up to let her into the row and she immediately sat down in the aisle seat. I tried to protest to the flight attendant but EK kept loudly saying she paid for an aisle seat (I know Southwest is switching to assigned seating soon but they havent implemented it yet). Now our flight was already delayed 30 minutes and I could see all the other passengers looking at me as if trying to say please just let her win, we want to get there as fast as we can. I looked at the flight attendant and she gave me a pleading look as if to also say please just give her this, we want to get out of here. So, being the fairly passive person I am, I stepped over her (she wouldnt get up to let me in) and sat in the middle seat. Now, I may be passive, but I also am petty. I am also really gassy on airplanes. Normally I try to contain it and go to the restroom when it gets bad enough and just let it rip there. But I was pissed at EK and, frankly, pissed at everyone else on the plane for letting her get away with it. So I decided that I was just going to let er rip all 7 hours of the flight. I put on my headphones (over the ear) and started passing gas about 1 hour into the flight. And lucky for me (and unlucky for everyone else) they were particularly stinky. Every time I would toot she would sniff, smell it, then shoot dirty looks at me. I just ignored her and just kept listening to my music and reading my kindle. At about hour 5 she had enough and hit the call button to ask to move. The flight attendant told her we were fully booked and there is nowhere to move her to. EK looked even more pissed but I guess she had seen enough of the internet to know if she threw a fit she would get arrested when we landed. When we did land (since we were near the back) I gave it one more silent but deadly toot to seal the deal and let her bathe in my stink for a couple more minutes. Do I feel bad that I ruined her flight? No. Do I feel bad that I stank up my row? Now. And also, for the first time ever I didnt have gas pains nor did I need to rush to the bathroom immediately after getting off my flight. Next time, check in early like the rest of us plebeians and maybe youll get a less salty seat mate. Tl;dr seat stealing snoot sniffs stinky stuff while stuck in the sky pettyrevenge 270 https://www.reddit.com/r/pettyrevenge/comments/1krhl8v/steal_my_seat_how_about_a_crop_dusting/
2025-08-12 1755029040 To the guy who was stealing cherries at Costco today… So, tonight around 8 pm at Costco Watford, I witnessed one of the most shameless bits of theft Ive ever seen. A guy came in with what looked like his brother and his young son. Right in front of me, he opened a box of cherries from the shelf and just started eating them. He knew I had seen him do it, but didnt even flinch or attempt to correct the wrong.. His brother then joined in, grabbing handfuls from the same box and feeding them to the kid. Not only are you stealing, youre teaching your kid that its totally fine to walk into a store, open something you havent paid for, and help yourself. =D And top it all off, I saw them spitting the seeds into empty boxes in another aisle where other customers shop from. Absolutely disgusting behaviour. If youre that desperate for cherries, maybe& I dunno& pay for them like everyone else? mildlyinfuriating 410 https://www.reddit.com/r/mildlyinfuriating/comments/1mojcbz/to_the_guy_who_was_stealing_cherries_at_costco/

It looks like there is not many interesting reddit threads. Let’s try searching by subreddit instead. To do this, I will find a subreddit name I am interested, by using find_subreddits() to get a list of subreddits related to my keyword of interest.

1-2. Downloading comments and additional information

# get individual comments
threads_content <- get_thread_content(threads_3$url[1:4])

The output object threads_2_content consists of two data frames: threads and comments.

The threads data frame contains additional information that was not provided by find_thread_urls, such as the username of the poster, upvotes, and downvotes.

names(threads_content)

# check upvotes and downvotes
print(threads_content$threads[,c('upvotes','downvotes','up_ratio')])

The comments data frame provides information on individual comments. Next I will do text cleaning and showing the head of df

# Sanitize text
threads_content$comments %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>% 
        str_squish() 
  ))

head(threads_content$comments, 3) %>% knitr::kable()

1-3. Analyzing Posting Date and Time

Using the date and timestamp information, I can analyze when posts are most popular on Reddit — by month, day, or hour.

First, let’s examine the number of threads per week over the last 12 months:

# create new column: date
threads_3 %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by week
threads_3 %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 7*24*60*60) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_3$date, na.rm = T), 
                                max(threads_3$date, na.rm = T), 
                                by = "1 month")) +
  theme_minimal()

Next, let’s examine the day of the week. Do people tend to post more on weekdays or weekends?

# create new columns: day_of_week, is_weekend
threads_3 %<>%  
  mutate(day_of_week = wday(date, label = TRUE)) %>% 
  mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))

# number of threads by time of day
threads_3 %>% 
  ggplot(aes(x = day_of_week, fill = is_weekend)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "pink")) + 
  theme_minimal()

The day-of-week plot shows that most posts about stealing/property crime happen on weekdays, not weekends. Monday, Tuesday, and Friday have the highest bar heights, with Tuesday slightly in the lead, while Saturday and Sunday are noticeably lower.

My interpretation is that people seem to post about theft after it happens during the work week for example, discovering a broken into car before work or after coming home, rather than making a lot of new theft posts on weekends. It also suggests that Reddit is being used as a place to debrief weekday incidents, ask for advice, or warn neighbors.

print(threads_3$timestamp[1])
## [1] 1550705171
print(threads_3$timestamp[1] %>% anytime(tz = anytime:::getTZ()))
## [1] "2019-02-20 18:26:11 EST"
threads_3 %<>%  
  mutate(time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

What about the time of day? Let’s visualize the number of threads by time of day using the time column we made from timestamp.

Note: the times are shown in my local time zone, so the posters and commenters may be in different time zones.

# number of threads by time of day
threads_3 %>% 
  ggplot(aes(x = time)) +
  geom_histogram(bins = 24, color = 'black') +
  scale_x_continuous(breaks = seq(0, 24, by=2)) + 
  theme_minimal()

These time-of-day and day-of-week views set the stage: when do people talk about thefts? It looks like the peaks hint at late-night or early morning chatter, which I keep in mind when interpreting sentiment spikes.

2. Tokenization and stop words

2-1. Tokenization

let’s tokenize the Reddit texts and plot the top 20 words to see which words appear most frequently. You will notice an issue: the plot includes many common words like the, to, and, a, for, etc. While it makes sense that these words appear often, they are not useful for our analysis.

# Word tokenization
words <- threads_3 %>% 
  unnest_tokens(output = word, input = text, token = "words") # run `?tidytext::unnest_tokens` on the console

words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")
## Selecting by n

This first pass shows raw vocabulary; unsurprisingly, common stop words swamp the view, underscoring why the next cleaning step matters.

2-2. Stop words

Those common words we just saw are known as stop words, words that are typically filtered out during NLP processing because they contribute little meaningful information to text analysis. These are usually very common words such as articles, pronouns, conjunctions, and prepositions. We can remove stop words using a built-in dataset from the tidytext package.

# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 100)])
##   [1] "there"      "seems"      "without"    "everything" "she"       
##   [6] "able"       "thought"    "afterwards" "below"      "on"        
##  [11] "things"     "overall"    "few"        "useful"     "alone"     
##  [16] "self"       "qv"         "it'll"      "third"      "everyone"  
##  [21] "we"         "into"       "anyways"    "asked"      "forth"     
##  [26] "somehow"    "been"       "com"        "v"          "next"      
##  [31] "is"         "me"         "what"       "t"          "very"      
##  [36] "kept"       "go"         "down"       "couldn't"   "can't"     
##  [41] "few"        "are"        "areas"      "two"        "grouping"  
##  [46] "yourself"   "older"      "such"       "d"          "near"      
##  [51] "new"        "each"       "kind"       "parting"    "how's"     
##  [56] "about"      "was"        "still"      "unlikely"   "asking"    
##  [61] "try"        "all"        "not"        "fact"       "have"      
##  [66] "you're"     "opening"    "they'd"     "above"      "otherwise" 
##  [71] "latest"     "came"       "take"       "how"        "anywhere"  
##  [76] "through"    "those"      "number"     "you'd"      "theirs"    
##  [81] "ever"       "yet"        "shouldn't"  "under"      "i"         
##  [86] "has"        "yours"      "because"    "former"     "happens"   
##  [91] "particular" "others"     "anyone"     "whither"    "having"    
##  [96] "between"    "end"        "great"      "was"        "so"

We will use anti_join() function to remove stop words from the text and leave behind a cleaned set of words. Like other join functions, the order of the data frames matters.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_3 %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
# Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 11048, After: 3725

Once I have removed the stop words, let’s create a plot using the cleaned text to see which meaningful words are used most frequently.

words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

After stripping stop words and URLs, the remaining terms surface place names and theft cues—evidence that the cleaning worked and the tokens are now meaningful for downstream sentiment.

3 N-grams

An n-gram is a sequence of n words that appear together. For example:

In advanced text analysis and machine learning, specific tokens and n-grams are often used as features for modeling and classification tasks.

N-grams are particularly useful for analyzing words in context. Consider these two sentences:

The word “check” functions as a verb in the first sentence and a noun in the second. We can understand its meaning based on the surrounding words, especially those immediately before or after it. For example, when “check” follows “to”, it’s likely being used as a verb.

As an example of bigram, the sentence “The result of separating bigrams is helpful for exploratory analyses of the text.” becomes a list of paired words:
1 the result
2 result of
3 of separating
4 separating bigrams
5 bigrams is
6 is helpful
7 helpful for
8 for exploratory
9 exploratory analyses
10 analyses of
11 of the
12 the text

# Tri-grams (n = 3)
words_trigram <- threads_3 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = trigram,
                input = text,
                token = "ngrams",
                n = 3)
# Show tri-grams with sorted values
words_trigram %>%
  count(trigram, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()
trigram n
nk f9 tk 26
0 f 0flt 13
1 1 a 13
1 a nk 13
1 s 1 13
1 t f 13
a nk f9 13
c usd e 13
e 1 s 13
f 0 f 13
f9 tk nk 13
f9 tk sd 13
s 1 1 13
sd 1 t 13
tk nk f9 13
tk sd 1 13
usd e 1 13
f atlanta to 11
t f atlanta 11
22restrict_sr onsort newt 5

Here, we can see that the n-grams still contain stop words such as a, to, and so on. Next, we’ll extract n-grams without stop words. We can use the separate function from the tidyr package to split each bigram into two columns: word 1 and word 2. Then, we filter out any rows where either column contains a stop word using the filter function.

# Separate tri-grams into three columns
words_trigram_sep <- words_trigram %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under any word column
words_trigram_filtered <- words_trigram_sep %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
library(stringi)
words_trigram_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new tri-gram (n=3) counts:
trigram_counts <- words_trigram_filtered %>%
  count(word1, word2, word3, sort = TRUE)

head(trigram_counts, 20) %>% 
  knitr::kable()
word1 word2 word3 n
nk f9 tk 26
f9 tk nk 13
f9 tk sd 13
tk nk f9 13
22restrict_sr onsort newt 5
chat settings wnuyl5a8flpnick 4
live chat settings 4
mind atlanta links 4
settings wnuyl5a8flpnick atlien_ 4
wnuyl5a8flpnick atlien_ atlanta 4
22events 22restrict_sr onsort 3
3a 22events 22restrict_sr 3
flair 3a 22events 3
weekly events thread 3
atlanta weekly events 2
atlien_ atlanta weekly 2
authorized repair person 2
calls detail morning 2
car break ins 2
daily 22restrict_sr onsort 2
top_tri_strings <- trigram_counts %>%
  head(3) %>%
  unite(trigram, word1:word3, sep = " ") %>%
  pull(trigram)
cat("Notable tri-grams: ", paste(top_tri_strings, collapse = "; "), ". These repeated phrases highlight the most common contexts in the pulled threads.")
## Notable tri-grams:  nk f9 tk; f9 tk nk; f9 tk sd . These repeated phrases highlight the most common contexts in the pulled threads.

These tri-grams capture the phrases Atlantans repeat most when talking about theft, they read like the chorus of our local safety story. One noticeable trigam I can see is car break ins. If tri-grams are sparse after filtering, we should also look at bi-grams as a fallback while still reporting tri-gram results.

replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"
words_trigram <- threads_3 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(
    !word1 %in% stop_words$word,
    !word2 %in% stop_words$word,
    !word3 %in% stop_words$word,
    str_detect(word1, "[a-z]"),
    str_detect(word2, "[a-z]"),
    str_detect(word3, "[a-z]")
  ) %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

trigram_counts <- words_trigram %>%
  count(word1, word2, word3, sort = TRUE)

# Show top trigrams
trigram_counts %>% head(20) %>% knitr::kable()
word1 word2 word3 n
nk f9 tk 26
f9 tk nk 13
f9 tk sd 13
tk nk f9 13
22restrict_sr onsort newt 5
chat settings wnuyl5a8flpnick 4
live chat settings 4
mind atlanta links 4
settings wnuyl5a8flpnick atlien_ 4
wnuyl5a8flpnick atlien_ atlanta 4
22events 22restrict_sr onsort 3
3a 22events 22restrict_sr 3
flair 3a 22events 3
weekly events thread 3
atlanta weekly events 2
atlien_ atlanta weekly 2
authorized repair person 2
calls detail morning 2
car break ins 2
daily 22restrict_sr onsort 2
# Notable trigrams
top_tri_strings <- trigram_counts %>%
  head(3) %>%
  unite(trigram, word1:word3, sep = " ") %>%
  pull(trigram)
cat("Notable tri-grams: ", paste(top_tri_strings, collapse = "; "), ". These repeated phrases highlight the most common contexts in the pulled threads.")
## Notable tri-grams:  nk f9 tk; f9 tk nk; f9 tk sd . These repeated phrases highlight the most common contexts in the pulled threads.
# Let's look at bi-grams for additional context
words_bigram <- threads_3 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = bigram,
                input = text,
                token = "ngrams",
                n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>%
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]")) %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))

bigram_counts <- words_bigram %>%
  count(word1, word2, sort = TRUE)

if (nrow(trigram_counts) == 0) {
  message("No clean tri-grams found; showing bi-grams instead.")
  bigram_counts %>% head(20) %>% knitr::kable()
}

Finally, by using the igraph and ggraph packages, we can visualize words occurring in pairs, which allows us to see common relationships between words in the text.

# plot word network using Bigram
words_counts <- bigram_counts
words_counts %>%
  filter(n >= 3) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

What we can interpret from this network is that phrases linking “Atlanta – police – links”, “apartment – complex”, and “license – plate” suggest stories about car break-ins, stolen plates, and thefts in apartment parking lots. And North Druid could be a target location or just place of interest ?

4 Sentiment Analysis

In this stage, I will import the sentiment analysis result data processed from Google Colab.

# import the data
reddit_sentiment <- readr::read_csv("C:/Users/qduong7/OneDrive - Georgia Institute of Technology/Documents/Team project (urban analytics)/RedditAnalysis/reddit_bert.csv", show_col_types = FALSE)
## New names:
## * `` -> `...1`
# drop NAs
reddit_sentiment %<>% drop_na('bert_label')

Comparison with the dictionary method

Get sentiment scores using the dictionary method for comparison.

# Join thread title and text.
reddit_sentiment %<>%
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

# dictionary method
reddit_sentiment_dictionary <- sentiment_by(reddit_sentiment$title_text)

reddit_sentiment$sentiment_dict <- reddit_sentiment_dictionary %>% pull(ave_sentiment)
reddit_sentiment$word_count <- reddit_sentiment_dictionary %>% pull(word_count)

Check the correlation between the sentiment values from two different methods.

reddit_sentiment %<>% mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>% as.numeric())

cor(reddit_sentiment$bert_label_numeric, reddit_sentiment$sentiment_dict)
## [1] 0.3302182

0.33 implies a mild positive correlation. In the scatter plot below, the two do not seem to be very correlated. The threads that got 1 stars from the BERT model are mostly below 0 (meaning negative) in the dictionary method.

Let’s look at some example threads and the predicted sentiment, and see which method makes more sense.

  • BERT: 1 star (negative) vs. 5 stars (positive)
bert_example <- reddit_sentiment %>%
  filter(bert_label_numeric %in% c(1,5)) %>%
  group_by(bert_label) %>%
  arrange(desc(bert_score)) %>%
  slice_head(n = 3) %>%
  ungroup()

# 1 star
bert_example %>% filter(bert_label_numeric == 1) %>% pull(title_text) %>% print()
## [1] "Random act of kindness at MARTA station goes viral. "           
## [2] "1 injured in shooting outside of Lenox Square in Buckhead. "    
## [3] "Woman accused of stealing more than $93K from church arrested. "
# 5 star
bert_example %>% filter(bert_label_numeric == 5) %>% pull(title_text) %>% print()
## [1] "This is my favorite drone shot of Georgia Tech's Tech Tower at sunset. "
## [2] "I\031m so proud of this city!!. "                                       
## [3] "Finally! A Trump Executive Order I Can Get Behind. "
  • Dictionary method: negative vs. positive
sentimentr_example <- reddit_sentiment %>%
  mutate(sentimentr_abs = abs(sentiment_dict),
         sentimentr_binary = case_when(sentiment_dict > 0 ~ 'positive',
                                       TRUE ~ 'negative')) %>%
  group_by(sentimentr_binary) %>%
  arrange(desc(sentimentr_abs)) %>%
  slice_head(n = 3) %>%
  ungroup() %>%
  arrange(sentiment_dict)

# negative
sentimentr_example %>% filter(sentimentr_binary == 'negative') %>% pull(title_text) %>% print()
## [1] "Suspected would-be robber says he's the real victim. "                                     
## [2] "Atlanta Police officer accused of stealing $500 cash from deceased shooting victim fired. "
## [3] "Thieves steal from boy with cancer. "
# positive
sentimentr_example %>% filter(sentimentr_binary == 'positive') %>% pull(title_text) %>% print()
## [1] "Some friendly individual relieved me of a window and an iPod touch last night. "
## [2] "WSJ Editorial: Democracy Succeeds in Georgia. "                                 
## [3] "Police officer doesn't punish girl caught stealing, helps her instead. "

5. Insights

Okay here is the fun part, let’s discuss intriguing insights derived from the sentiment analysis from my observations with these visualizations:

People complain about theft mid-week and across the day, with only small weekend differences

Looking by day of week, every bar is again dominated by 1-star posts. There isn’t a huge weekday vs weekend contrast, but we can see that Saturday and Sunday have a slightly thicker band of 4–5-star posts compared to the middle of the week. My interpretation is that on weekends people are a bit more likely to use “stealing” in jokes, memes, or lighter stories ( for example “my dog is stealing socks again”), whereas mid-week posts tilt more toward frustrated, crime-related reports.

reddit_sentiment %<>%
  mutate(
    date = suppressWarnings(anytime(date_utc, tz = anytime:::getTZ())),
    timestamp_parsed = suppressWarnings(anytime(timestamp, tz = anytime:::getTZ()))
  ) %>%
  filter(!is.na(date)) %>%
  mutate(
    year = year(date),
    day_of_week = wday(date, label = TRUE),
    is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"),
    time = coalesce(hour(timestamp_parsed), hour(date))
  )
if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)
#sentiment by day 
g1 <- reddit_sentiment %>%
  filter(year >= 2018) %>%
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = "stack") +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), max(reddit_sentiment$year), by = 1)) +
  scale_fill_brewer(palette = "PuRd", direction = -1) +
  theme_gray()
#sentiment by time 
g2 <- reddit_sentiment %>%
  filter(year >= 2018) %>%
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = "fill") +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), max(reddit_sentiment$year), by = 1)) +
  scale_fill_brewer(palette = "PuRd", direction = -1) +
  theme_gray()

ggsave(file.path(out_dir, "sentiment_by_year_stack.png"), g1, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_year_fill.png"), g2, width = 7, height = 4, dpi = 150)

g1; g2;

The time-of-day plot tells a complementary story. Negative posts appear at almost every hour, which fits the idea that theft and venting can happen anytime.Especially we can see negative post were dominated after late night (after 12 am ). But there’s a visible cluster of 4–5-star posts during the morning and early evening hours, the times when people are commuting or scrolling after work. Those more positive/neutral posts likely include advice threads, news links, or lighter commentary that reuse the word “stealing” in less literal ways. ## Posts about “stealing” are overwhelmingly negative, especially after 2020

Over the full 2018–2025 window, almost every year is dominated by 1-star posts from the BERT model, meaning that most “stealing” threads are clearly negative in tone. The stacked bar chart by year shows that volume peaks around 2018–2020, then drops but never disappears in later years. People keep coming back to Reddit to talk about theft, just with fewer threads per year than at the pre-COVID peak.

When we switch to the proportional view, the pattern gets sharper: in early years there is still a noticeable band of 4–5-star posts (more neutral or joking uses of “stealing”), but by 2021–2022 the bars are almost entirely dark (1-star). In other words, the mix of conversations shifts toward more clearly negative posts after 2020. Only 2023 shows a small rebound of more positive/neutral threads, but even there, 1-star posts remain the largest slice. During the year 2024 and 2025, 1 star posts( negative) remains the majority:

if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)

#sentiment by year fill 
g3 <- reddit_sentiment %>%
  ggplot(aes(x = day_of_week, fill = bert_label)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "PuRd", direction = -1) +
  theme_gray()
#sentiment by year using stacked bars 
g4 <- reddit_sentiment %>%
  ggplot(aes(x = time, fill = bert_label)) +
  geom_histogram(bins = 24, position = "fill", color = "black", lwd = 0.2) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  scale_fill_manual(values = c("#bc5090", "#bc5090", "#ff6361", "#ffa600", "#ffa600")) +
  theme_gray()

ggsave(file.path(out_dir, "sentiment_by_year_stack.png"), g1, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_year_fill.png"), g2, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_day.png"), g3, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_time.png"), g4, width = 7, height = 4, dpi = 150)

g3; g4

  • Number of threads by sentiment category.
reddit_sentiment %>%
  ggplot(aes(x = bert_label)) +
  geom_bar(fill = "grey") +
  theme_gray()

–> What we can learn from this plost is that most posts fall into the 1-star bucket. Atlanta social feeds lean heavily negative when theft is involved, with far fewer upbeat posts.

** Let say I want to see Word counts versus sentiment category: **

reddit_sentiment %>%
  ggplot(aes(x = bert_label, y = word_count)) +
  geom_jitter(height = 0, width = 0.05) +
  stat_summary(fun = mean, geom = "crossbar", width = 0.4, color = "red") +
  theme_gray()

My interpretation here is the longest narratives sit at the extremes (1-star vents and 4–5-star updates), while the middle tends to be shorter, suggesting stronger feelings produce longer narratives.

Vocabulary splits neatly between crime complaints and light-hearted posts

The text patterns line up nicely with what we’d expect from property-crime vs. playful uses of “stealing.” In the word network and the negative-thread word cloud, the dominant words and pairs point straight at concrete crime situations:

Phrases linking “Atlanta – police – links”, “apartment – complex”, and “license – plate” suggest stories about car break-ins, stolen plates, and thefts in apartment parking lots.

Single words like thief, alarm, gun, cops, company, van, plate, north echo the language of actual incidents: alarms going off, police being called, damage claims with a company, or specific parts of town.

By contrast, the positive-thread word cloud is almost a different universe:

Words like puppies, cards, trivia, braves, auburn, jeep, drivers point to sports, games, pets, and fandom.

Here “stealing” is probably being used metaphorically (“puppies stealing my heart”) or in a fun context (“Braves stealing bases,” “trivia night prizes”). So even when we condition on the same keyword, there are two distinct discourses: one around real theft and safety, another around playful or metaphorical “stealing.”

# plot word network
words_counts <- bigram_counts
words_counts %>%
  filter(n >= 3) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

Using word clouds, we can visualize the words that appear most frequently in positive and negative threads:

  • Words appearing in negative threads.
  • Words appearing in positive threads.

The two sentiment methods mostly agree on direction, but BERT is more nuanced

To check whether our sentiment labels are trustworthy, I compared a dictionary-based score (sentimentr) with the BERT 1–5 star labels. The correlation is about 0.33, which is modest but clearly positive: higher BERT stars tend to go with slightly more positive dictionary scores.

The scatterplot shows the shape of that relationship:

Most 1-star posts have dictionary scores below zero 4- and 5-star posts sit mostly at or above zero.

When we reduce the comparison to just “sign of the sentiment” (positive vs negative), the two methods agree about 57% of the time, and the typical dictionary magnitude is fairly small (~0.22 in absolute value). That tells us two things:

The dictionary method usually gets the direction right for clear, emotional posts (e.g., very angry theft complaints).

For subtler or sarcastic threads, it often hovers near zero or even misclassifies the tone, while BERT still assigns a confident star rating.

g_scatter <- ggplot(data = reddit_sentiment, aes(x = bert_label_numeric, y = sentiment_dict)) +
  geom_jitter(width = 0.1, height = 0) +
  geom_line(aes(y = 0), color = "#FFD700", lwd = 1, linetype = "dashed") +
  theme_gray()
ggsave(file.path(out_dir, "bert_vs_dict_scatter.png"), g_scatter, width = 7, height = 4, dpi = 150)

g_scatter

Display 10 sample texts to evaluate BERT vs Dictionary methods

# Sample texts with scores for credibility check
set.seed(123)
sample_10 <- reddit_sentiment %>%
  sample_n(10) %>%
  select(title_text, bert_label, sentiment_dict, comments, word_count)

# Spot-check 10 random posts to see if the two sentiment methods and engagement levels line up with intuition before trusting the metrics
sample_10
## # A tibble: 10 x 5
##    title_text                      bert_label sentiment_dict comments word_count
##    <chr>                           <chr>               <dbl>    <dbl>      <int>
##  1 "Georgia Tech won't require st~ 2 stars           -0.177       188         13
##  2 "1 injured in shooting outside~ 1 star            -0.45         94          9
##  3 "Marietta man accused of steal~ 1 star            -0.476         2         12
##  4 "Dekalb County dog stolen off ~ 1 star            -0.265         8          8
##  5 "2 men arrested for stealing $~ 1 star            -0.177         0          8
##  6 "Plant stealing vagrant ruinin~ 1 star            -0.0693       21        185
##  7 "PSA: Don't let people in your~ 1 star             0.217        49        148
##  8 "Carjacking at Piedmont &amp; ~ 4 stars            0            11          5
##  9 "Why the Thrashers left Atlant~ 4 stars            0            48         15
## 10 "Robbers in EAV. If anyone has~ 1 star             0.272       118         16
# Quick credibility check comparing dictionary sign vs BERT class (negative/neutral/positive)
credibility_stats <- reddit_sentiment %>%
  mutate(
    bert_sign = case_when(bert_label_numeric <= 2 ~ -1,
                          bert_label_numeric == 3 ~ 0,
                          TRUE ~ 1),
    dict_sign = case_when(sentiment_dict < 0 ~ -1,
                          sentiment_dict == 0 ~ 0,
                          TRUE ~ 1)
  ) %>%
  summarise(
    sign_agreement = mean(bert_sign == dict_sign),
    mean_abs_dict = mean(abs(sentiment_dict))
  )
credibility_stats
## # A tibble: 1 x 2
##   sign_agreement mean_abs_dict
##            <dbl>         <dbl>
## 1          0.567         0.222

EXTRA: Association between a thread’s sentiment and the number of comments on the threa

# Remove outliers
reddit_sentiment_rm_outlier <- reddit_sentiment %>%
  group_by(bert_label) %>%
  filter(
    between(
      comments,
      quantile(comments, 0.25) - 1.5 * IQR(comments),
      quantile(comments, 0.75) + 1.5 * IQR(comments)))

# Correlation analysis
cor.test(reddit_sentiment_rm_outlier$comments, reddit_sentiment_rm_outlier$bert_label_numeric)
## 
##  Pearson's product-moment correlation
## 
## data:  reddit_sentiment_rm_outlier$comments and reddit_sentiment_rm_outlier$bert_label_numeric
## t = -1.2782, df = 193, p-value = 0.2027
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.22918361  0.04952823
## sample estimates:
##         cor 
## -0.09162176
# Scatterplot
reddit_sentiment_rm_outlier %>%
  ggplot(aes(x = bert_label_numeric, y = comments)) +
  geom_jitter(height = 0, width = 0.05) + 
  geom_smooth(method = 'loess', span = 0.75) +
  theme_gray()
## `geom_smooth()` using formula = 'y ~ x'

What we can interprest from this chart: Comment volume doesn’t swing dramatically with sentiment, so people engage with both complaint posts and positive resolutions, but there’s no strong “more anger → more comments” effect.

Conclusion

Overall, Reddit threads that mention “stealing” are strongly skewed toward negative sentiment across all years, with especially dark tones in the COVID and post-COVID period. Complaints show up on every day of the week and at all hours, so they are more of a constant background worry than a weekend or late-night spike.

The text patterns split into two clear worlds: one cluster of posts uses words like thief, alarm, cops, apartment, plate, describing real property crimes and safety concerns, while a smaller cluster uses puppies, trivia, Braves, drivers in more playful or metaphorical ways.