Library setup

# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "here", "sentimentr","devtools")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

# helper to clean thread text fields
sanitize_threads <- function(df) {
  df %>%
    mutate(across(
      where(is.character),
      ~ .x %>%
        str_replace_all("\\|", "/") %>%
        str_replace_all("\\n", " ") %>%
        str_squish()
    ))
}

# main keyword and subreddit list used throughout
keywords_main <- c("stealing")
atl_subs <- c("Atlanta")

# create output directory for saved data/plots
out_dir <- here::here("outputs")
if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)

DISCLAIMER: Due to the uncensored nature of online communities, some Reddit posts may contain images or language that are not suitable for work or school. You may also encounter controversial content. This is part of studying urban analytics, as online platforms often reflect real social dynamics. That said, please keep an open mind, and I hope no one feels offended or uncomfortable with what we might come across.

Introduction and Goals

In this analysis I examine how Atlanta Reddit users talk about theft-related experiences, combining user-generated posts with dictionary-based sentiment and a BERT model to understand tone, themes. I will use key word stealing to evaluate about realing property crimes in Atlanta.

Model Process

Research goal
Data collection: keyword, subreddit, and combined pulls (see Section 1).
Text cleaning & tokenization: stop words removed, tokens created (Section 2).
Word clouds excluding keyword: pre/post-stopword clouds (Section 3).
Tri-gram analysis: extracted, cleaned, tabled, and discussed. (Section 4).
Sentiment Analysis (dictionary and BERT model): dictionary method (sentiment_by) and BERT, correlation and scatter (Section 4 & 5).
Sample texts & credibility: 10 samples plus agreement stats (Section 5).
Insights with plots (Section 5).

1. Collecting Reddit Data

Downloading Reddit threads

# knit-time: load cached keyword search results (threads_1)
threads_1 <- readr::read_csv("C:/Users/qduong7/OneDrive - Georgia Institute of Technology/Documents/Team project (urban analytics)/outputs/threads_1.csv", show_col_types = FALSE) %>%
  drop_na() %>%
  sanitize_threads()
colnames(threads_1)

## [1] "date_utc"  "timestamp" "title"     "text"      "subreddit" "comments" 
## [7] "url"

head(threads_1, 3) %>% knitr::kable()

date_utc	timestamp	title	text	subreddit	comments	url
2025-07-08	1752003037	[Omega Speedmaster] Got a Moonwatch for an absolute steal in Japan.	I know this is kinda a bragging post but I wanted to tell somebody that can understand my excitement. I spent 3 weeks in Japan, about 2 weeks in a campervan and a couple days in Tokyo. During our travels we visited quite a few second hand stores, mainly looking for anime figures, old consoles and also checking out the watches they had in stock. In regard to watches, most stores were priced pretty much identical to online prices, if not slightly more expensive. On one of our last days in Tokyo we went to a Book Off location outside of the city centre which had a relatively large watch selection. They had an Omega Speedmaster 50th Anniversary, which seemed to be priced somewhat competitively (around 3,600¬). Only upon my second inspection did I notice there was actually a 40% off hangtag. Could this be? I translated the text and it literally said 40% off listed price. So yeah& I wasnt actually in the market for a Moonwatch, but at that price it was a no-brainer. I ended up paying only 2,160¬ for the watch, which was less than any other Speedmaster I could find online. Sadly, no box and papers (only a dinged up wooden Omega box), but recently polished and serviced. It runs great at +3 seconds/day. Its not the best polishing job in the world, but honestly not bad. I have no clue why it was sitting there at that price and why nobody bought it.	Watches	346	https://www.reddit.com/r/Watches/comments/1luxwqh/omega_speedmaster_got_a_moonwatch_for_an_absolute/
2025-05-20	1747778662	Steal my seat, how about a crop dusting	I (33F) just got off a 7 hour cross country flight and I wanted to share this story before I go pass out. I was flying Southwest so there are no assigned seats and I was in B boarding group. I went towards the back of the plane so I could get an aisle seat since Im fairly tall. It was a completely full flight so I knew there would be someone sitting next to be. Enter Entitled Karen (EK). She had waited to check in so she was at the back of the C boarding group. She was pissy because at first she couldnt find anywhere to put her massive roller bag and the flight attendants were trying to get her to gate check it. Then she came up to my row and I thought great, shes gonna take the middle seat. The following dialogue (to the best of my sleepy recollection) occurred: EK: Move! OP: what? EK: I paid for an aisle seat and you are sitting in my seat. OP: (more confused) what?? EK: (practically screaming) I SAID I PAID FOR AN AISLE SEAT AND YOU ARE SITTING IN MY SEAT OP: This is Southwest, there are no assigned seats EK: Are you stupid, I said I need the aisle seat. OP: I can get up to give you the middle seat if you want I stupidly thought that was the end of it when a flight attendant came over and asked her to take a seat. I got up to let her into the row and she immediately sat down in the aisle seat. I tried to protest to the flight attendant but EK kept loudly saying she paid for an aisle seat (I know Southwest is switching to assigned seating soon but they havent implemented it yet). Now our flight was already delayed 30 minutes and I could see all the other passengers looking at me as if trying to say please just let her win, we want to get there as fast as we can. I looked at the flight attendant and she gave me a pleading look as if to also say please just give her this, we want to get out of here. So, being the fairly passive person I am, I stepped over her (she wouldnt get up to let me in) and sat in the middle seat. Now, I may be passive, but I also am petty. I am also really gassy on airplanes. Normally I try to contain it and go to the restroom when it gets bad enough and just let it rip there. But I was pissed at EK and, frankly, pissed at everyone else on the plane for letting her get away with it. So I decided that I was just going to let er rip all 7 hours of the flight. I put on my headphones (over the ear) and started passing gas about 1 hour into the flight. And lucky for me (and unlucky for everyone else) they were particularly stinky. Every time I would toot she would sniff, smell it, then shoot dirty looks at me. I just ignored her and just kept listening to my music and reading my kindle. At about hour 5 she had enough and hit the call button to ask to move. The flight attendant told her we were fully booked and there is nowhere to move her to. EK looked even more pissed but I guess she had seen enough of the internet to know if she threw a fit she would get arrested when we landed. When we did land (since we were near the back) I gave it one more silent but deadly toot to seal the deal and let her bathe in my stink for a couple more minutes. Do I feel bad that I ruined her flight? No. Do I feel bad that I stank up my row? Now. And also, for the first time ever I didnt have gas pains nor did I need to rush to the bathroom immediately after getting off my flight. Next time, check in early like the rest of us plebeians and maybe youll get a less salty seat mate. Tl;dr seat stealing snoot sniffs stinky stuff while stuck in the sky	pettyrevenge	270	https://www.reddit.com/r/pettyrevenge/comments/1krhl8v/steal_my_seat_how_about_a_crop_dusting/
2025-08-12	1755029040	To the guy who was stealing cherries at Costco today…	So, tonight around 8 pm at Costco Watford, I witnessed one of the most shameless bits of theft Ive ever seen. A guy came in with what looked like his brother and his young son. Right in front of me, he opened a box of cherries from the shelf and just started eating them. He knew I had seen him do it, but didnt even flinch or attempt to correct the wrong.. His brother then joined in, grabbing handfuls from the same box and feeding them to the kid. Not only are you stealing, youre teaching your kid that its totally fine to walk into a store, open something you havent paid for, and help yourself. =D And top it all off, I saw them spitting the seeds into empty boxes in another aisle where other customers shop from. Absolutely disgusting behaviour. If youre that desperate for cherries, maybe& I dunno& pay for them like everyone else?	mildlyinfuriating	410	https://www.reddit.com/r/mildlyinfuriating/comments/1mojcbz/to_the_guy_who_was_stealing_cherries_at_costco/

It looks like there is not many interesting reddit threads. Let’s try searching by subreddit instead. To do this, I will find a subreddit name I am interested, by using find_subreddits() to get a list of subreddits related to my keyword of interest.

Searching Atlanta-related subreddits

# search for subreddits
# This is just to *see* Atlanta-related subs and their sizes
subreddit_list <- RedditExtractoR::find_subreddits("stealing")

subreddit_list %>% 
  arrange(desc(subscribers)) %>% 
  dplyr::select(subreddit, title, subscribers) %>% 
  head(25) %>% 
  knitr::kable()

Still, many of the threads I pulled felt irrelevant to my research focus. So I decided to switch approaches, instead of mixing multiple subreddits, I now focus only on r/Atlanta, which has over 500,000 members and is much more clearly centered on the City of Atlanta and its daily issues.

## 2. Pull posts from ONE subreddit: r/Atlanta
threads_2 <- find_thread_urls(
  subreddit = "Atlanta",   
  sort_by  = "top",
  period   = "year"
) %>%
  drop_na() %>%
  sanitize_threads()

rownames(threads_2) <- NULL

# save for downstream use
write_csv(threads_2, file.path(out_dir, "threads_2.csv"))

## 5. Inspect a few rows
head(threads_2, 3) %>% knitr::kable()

# knit-time: load cached subreddit pull (threads_2)
threads_2 <- readr::read_csv("C:/Users/qduong7/OneDrive - Georgia Institute of Technology/Documents/Team project (urban analytics)/outputs/threads_2.csv", show_col_types = FALSE) %>%
  drop_na() %>%
  sanitize_threads()

rownames(threads_2) <- NULL

head(threads_2, 3) %>% knitr::kable()

This looks good. Now after selecting a subreddit, let’s search threads within that subreddit. In the script below, I will use steal key word, using find_thread_url:

# using both subreddit and keyword
threads_3 <- find_thread_urls(keywords= "stealing", 
                              subreddit = "Atlanta", 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
  drop_na() %>%
  sanitize_threads()

rownames(threads_3) <- NULL

# save for downstream use
write_csv(threads_3, file.path(out_dir, "threads_3.csv"))

head(threads_3, 3) %>% knitr::kable()

# knit-time: load cached combined keyword + subreddit pull (threads_3)
threads_3 <- readr::read_csv("C:/Users/qduong7/OneDrive - Georgia Institute of Technology/Documents/Team project (urban analytics)/outputs/threads_3.csv", show_col_types = FALSE) %>%
  drop_na() %>%
  sanitize_threads()

rownames(threads_3) <- NULL

head(threads_3, 3) %>% knitr::kable()

date_utc	timestamp	title	text	subreddit	comments	url
2019-02-20	1550705171	Fishy murder in Forsyth County	This happened in November and it’s just now hitting the news but mostly because the County Employee who’s house this took place, was caught stealing files related to the case. It’s so fishy. Edit -the link didn’t populate.https://www.wsbtv.com/news/local/forsyth-county/-shes-not-moving-911-calls-detail-morning-mother-of-5-was-found-dead-at-party/922892852	Atlanta	12	https://www.reddit.com/r/Atlanta/comments/asvgdl/fishy_murder_in_forsyth_county/
2020-06-21	1592701410	Your Advice on Picking an Apartment (practicality vs desire)	I’ve been in Metro Atlanta for 12 years, but I’ve always lived in Dekalb. My apartment near the Kensington Marta Station just spiked the rent so I’m in search of a new rental. Well, I’ve found one but it’s literally 5 minutes away from where I just left. Very spacious, but…very practical. Now I’m a young black man and the only place I’d care to live in Atlanta is Sweet Auburn. It has the perfect mix of black history, nightlife, accessibility and the spirit of black people of different economic classes living in close quarters. I’m aware of the ““bad”” and it doesn’t bother me as I’ve lived in questionable areas before. Problem is, the places around Edgewood/Auburn are either wildly expensive for very little room (it’s ATL, of course) or people just never seem to be renting. I found a place for a STEAL of a price on Lucy Street ($899) but I was 20 mins too slow and someone paid the deposit before me. So my question is, should I suck it up and do the practical thing with the place in Dekalb and be content with my OTP lifestyle, or should I take a risk for once in my life and get a temporary living situation until something sub $1150 opens up in Sweet Auburn (or is that wishful thinking)? Note: I am not open to sharing a living space with others as I have a significant other and a lot of stuff.	Atlanta	5	https://www.reddit.com/r/Atlanta/comments/hcxplt/your_advice_on_picking_an_apartment_practicality/
2014-10-03	1412306818	Update on my stolen car	A few months back, I posted about my Jeep having been stolen from right in front my house in Buckhead. I got some great responses from my fellow redditors just wishing me well or sharing their own stories of misfortune. Well, my story thankfully had a happy ending. I got a call on Tuesday from someone with the Atlanta police department telling me that my Jeep had been found in Cobb County. After a trip to a precinct in Vinings and a long ordeal at the impound lot in Smyrna, my Jeep Cherokee (lovingly named Chief Squarebox by my best friend) is back home. I’d honestly given up all hope in seeing my car again. My gf was the only one who constantly reassured me that they’d find it, it’ll turn up. Well, she was right. I’m just glad she’s not rubbing it in my face! The moral I guess is to never lose hope. My car turned up, my gf’s brother’s motorcycle was found after 6 months, my neighbor’s BMW was stolen last December and they got it back in February (albeit with $6,000 worth of damage)… Shitty situations can turn out well. Thanks again for all the kind words and for sharing your stories too. TL;DR My car was stolen over 3 months ago. Got it back yesterday. Thanks!	Atlanta	21	https://www.reddit.com/r/Atlanta/comments/2i5o3p/update_on_my_stolen_car/

This API pulls gives me pretty good snapshots: anything matching “stealing” across r/Atlanta subreddit. I will do some further check and then save it as CSV file for the sentiment analysis.

1-2. Downloading comments and additional information

# get individual comments
threads_content <- get_thread_content(threads_3$url[1:4])

The output object threads_2_content consists of two data frames: threads and comments.

The threads data frame contains additional information that was not provided by find_thread_urls, such as the username of the poster, upvotes, and downvotes.

names(threads_content)

# check upvotes and downvotes
print(threads_content$threads[,c('upvotes','downvotes','up_ratio')])

The comments data frame provides information on individual comments. Next I will do text cleaning and showing the head of df

# Sanitize text
threads_content$comments %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>% 
        str_squish() 
  ))

head(threads_content$comments, 3) %>% knitr::kable()

1-3. Analyzing Posting Date and Time

Using the date and timestamp information, I can analyze when posts are most popular on Reddit — by month, day, or hour.

First, let’s examine the number of threads per week over the last 12 months:

# create new column: date
threads_3 %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by week
threads_3 %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 7*24*60*60) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_3$date, na.rm = T), 
                                max(threads_3$date, na.rm = T), 
                                by = "1 month")) +
  theme_minimal()

Next, let’s examine the day of the week. Do people tend to post more on weekdays or weekends?

# create new columns: day_of_week, is_weekend
threads_3 %<>%  
  mutate(day_of_week = wday(date, label = TRUE)) %>% 
  mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))

# number of threads by time of day
threads_3 %>% 
  ggplot(aes(x = day_of_week, fill = is_weekend)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "pink")) + 
  theme_minimal()

The day-of-week plot shows that most posts about stealing/property crime happen on weekdays, not weekends. Monday, Tuesday, and Friday have the highest bar heights, with Tuesday slightly in the lead, while Saturday and Sunday are noticeably lower.

My interpretation is that people seem to post about theft after it happens during the work week for example, discovering a broken into car before work or after coming home, rather than making a lot of new theft posts on weekends. It also suggests that Reddit is being used as a place to debrief weekday incidents, ask for advice, or warn neighbors.

print(threads_3$timestamp[1])

## [1] 1550705171

print(threads_3$timestamp[1] %>% anytime(tz = anytime:::getTZ()))

## [1] "2019-02-20 18:26:11 EST"

threads_3 %<>%  
  mutate(time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

What about the time of day? Let’s visualize the number of threads by time of day using the time column we made from timestamp.

Note: the times are shown in my local time zone, so the posters and commenters may be in different time zones.

# number of threads by time of day
threads_3 %>% 
  ggplot(aes(x = time)) +
  geom_histogram(bins = 24, color = 'black') +
  scale_x_continuous(breaks = seq(0, 24, by=2)) + 
  theme_minimal()

These time-of-day and day-of-week views set the stage: when do people talk about thefts? It looks like the peaks hint at late-night or early morning chatter, which I keep in mind when interpreting sentiment spikes.

2. Tokenization and stop words

2-1. Tokenization

let’s tokenize the Reddit texts and plot the top 20 words to see which words appear most frequently. You will notice an issue: the plot includes many common words like the, to, and, a, for, etc. While it makes sense that these words appear often, they are not useful for our analysis.

# Word tokenization
words <- threads_3 %>% 
  unnest_tokens(output = word, input = text, token = "words") # run `?tidytext::unnest_tokens` on the console

words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

## Selecting by n

This first pass shows raw vocabulary; unsurprisingly, common stop words swamp the view, underscoring why the next cleaning step matters.

2-2. Stop words

Those common words we just saw are known as stop words, words that are typically filtered out during NLP processing because they contribute little meaningful information to text analysis. These are usually very common words such as articles, pronouns, conjunctions, and prepositions. We can remove stop words using a built-in dataset from the tidytext package.

# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 100)])

##   [1] "there"      "seems"      "without"    "everything" "she"       
##   [6] "able"       "thought"    "afterwards" "below"      "on"        
##  [11] "things"     "overall"    "few"        "useful"     "alone"     
##  [16] "self"       "qv"         "it'll"      "third"      "everyone"  
##  [21] "we"         "into"       "anyways"    "asked"      "forth"     
##  [26] "somehow"    "been"       "com"        "v"          "next"      
##  [31] "is"         "me"         "what"       "t"          "very"      
##  [36] "kept"       "go"         "down"       "couldn't"   "can't"     
##  [41] "few"        "are"        "areas"      "two"        "grouping"  
##  [46] "yourself"   "older"      "such"       "d"          "near"      
##  [51] "new"        "each"       "kind"       "parting"    "how's"     
##  [56] "about"      "was"        "still"      "unlikely"   "asking"    
##  [61] "try"        "all"        "not"        "fact"       "have"      
##  [66] "you're"     "opening"    "they'd"     "above"      "otherwise" 
##  [71] "latest"     "came"       "take"       "how"        "anywhere"  
##  [76] "through"    "those"      "number"     "you'd"      "theirs"    
##  [81] "ever"       "yet"        "shouldn't"  "under"      "i"         
##  [86] "has"        "yours"      "because"    "former"     "happens"   
##  [91] "particular" "others"     "anyone"     "whither"    "having"    
##  [96] "between"    "end"        "great"      "was"        "so"

We will use anti_join() function to remove stop words from the text and leave behind a cleaned set of words. Like other join functions, the order of the data frames matters.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_3 %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
# Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)

## Before: 11048, After: 3725

Once I have removed the stop words, let’s create a plot using the cleaned text to see which meaningful words are used most frequently.

words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

After stripping stop words and URLs, the remaining terms surface place names and theft cues—evidence that the cleaning worked and the tokens are now meaningful for downstream sentiment.

3 N-grams

An n-gram is a sequence of n words that appear together. For example:

“basketball coach” and “dinner time” are bigrams (two words),
“the three musketeers” is a trigram (three words), and
“she was very hungry” is a fourgram (four words).

In advanced text analysis and machine learning, specific tokens and n-grams are often used as features for modeling and classification tasks.

N-grams are particularly useful for analyzing words in context. Consider these two sentences:

“We need to check the details.”
“Can we pay it with a check?”

The word “check” functions as a verb in the first sentence and a noun in the second. We can understand its meaning based on the surrounding words, especially those immediately before or after it. For example, when “check” follows “to”, it’s likely being used as a verb.

As an example of bigram, the sentence “The result of separating bigrams is helpful for exploratory analyses of the text.” becomes a list of paired words:
1 the result
2 result of
3 of separating
4 separating bigrams
5 bigrams is
6 is helpful
7 helpful for
8 for exploratory
9 exploratory analyses
10 analyses of
11 of the
12 the text

# Tri-grams (n = 3)
words_trigram <- threads_3 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = trigram,
                input = text,
                token = "ngrams",
                n = 3)

# Show tri-grams with sorted values
words_trigram %>%
  count(trigram, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()

trigram	n
nk f9 tk	26
0 f 0flt	13
1 1 a	13
1 a nk	13
1 s 1	13
1 t f	13
a nk f9	13
c usd e	13
e 1 s	13
f 0 f	13
f9 tk nk	13
f9 tk sd	13
s 1 1	13
sd 1 t	13
tk nk f9	13
tk sd 1	13
usd e 1	13
f atlanta to	11
t f atlanta	11
22restrict_sr onsort newt	5

Here, we can see that the n-grams still contain stop words such as a, to, and so on. Next, we’ll extract n-grams without stop words. We can use the separate function from the tidyr package to split each bigram into two columns: word 1 and word 2. Then, we filter out any rows where either column contains a stop word using the filter function.

# Separate tri-grams into three columns
words_trigram_sep <- words_trigram %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under any word column
words_trigram_filtered <- words_trigram_sep %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
library(stringi)
words_trigram_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new tri-gram (n=3) counts:
trigram_counts <- words_trigram_filtered %>%
  count(word1, word2, word3, sort = TRUE)

head(trigram_counts, 20) %>% 
  knitr::kable()

word1	word2	word3	n
nk	f9	tk	26
f9	tk	nk	13
f9	tk	sd	13
tk	nk	f9	13
22restrict_sr	onsort	newt	5
chat	settings	wnuyl5a8flpnick	4
live	chat	settings	4
mind	atlanta	links	4
settings	wnuyl5a8flpnick	atlien_	4
wnuyl5a8flpnick	atlien_	atlanta	4
22events	22restrict_sr	onsort	3
3a	22events	22restrict_sr	3
flair	3a	22events	3
weekly	events	thread	3
atlanta	weekly	events	2
atlien_	atlanta	weekly	2
authorized	repair	person	2
calls	detail	morning	2
car	break	ins	2
daily	22restrict_sr	onsort	2

top_tri_strings <- trigram_counts %>%
  head(3) %>%
  unite(trigram, word1:word3, sep = " ") %>%
  pull(trigram)
cat("Notable tri-grams: ", paste(top_tri_strings, collapse = "; "), ". These repeated phrases highlight the most common contexts in the pulled threads.")

## Notable tri-grams:  nk f9 tk; f9 tk nk; f9 tk sd . These repeated phrases highlight the most common contexts in the pulled threads.

These tri-grams capture the phrases Atlantans repeat most when talking about theft, they read like the chorus of our local safety story. One noticeable trigam I can see is car break ins. If tri-grams are sparse after filtering, we should also look at bi-grams as a fallback while still reporting tri-gram results.

replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"
words_trigram <- threads_3 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(
    !word1 %in% stop_words$word,
    !word2 %in% stop_words$word,
    !word3 %in% stop_words$word,
    str_detect(word1, "[a-z]"),
    str_detect(word2, "[a-z]"),
    str_detect(word3, "[a-z]")
  ) %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

trigram_counts <- words_trigram %>%
  count(word1, word2, word3, sort = TRUE)

# Show top trigrams
trigram_counts %>% head(20) %>% knitr::kable()

word1	word2	word3	n
nk	f9	tk	26
f9	tk	nk	13
f9	tk	sd	13
tk	nk	f9	13
22restrict_sr	onsort	newt	5
chat	settings	wnuyl5a8flpnick	4
live	chat	settings	4
mind	atlanta	links	4
settings	wnuyl5a8flpnick	atlien_	4
wnuyl5a8flpnick	atlien_	atlanta	4
22events	22restrict_sr	onsort	3
3a	22events	22restrict_sr	3
flair	3a	22events	3
weekly	events	thread	3
atlanta	weekly	events	2
atlien_	atlanta	weekly	2
authorized	repair	person	2
calls	detail	morning	2
car	break	ins	2
daily	22restrict_sr	onsort	2

# Notable trigrams
top_tri_strings <- trigram_counts %>%
  head(3) %>%
  unite(trigram, word1:word3, sep = " ") %>%
  pull(trigram)
cat("Notable tri-grams: ", paste(top_tri_strings, collapse = "; "), ". These repeated phrases highlight the most common contexts in the pulled threads.")

## Notable tri-grams:  nk f9 tk; f9 tk nk; f9 tk sd . These repeated phrases highlight the most common contexts in the pulled threads.

# Let's look at bi-grams for additional context
words_bigram <- threads_3 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = bigram,
                input = text,
                token = "ngrams",
                n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>%
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]")) %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))

bigram_counts <- words_bigram %>%
  count(word1, word2, sort = TRUE)

if (nrow(trigram_counts) == 0) {
  message("No clean tri-grams found; showing bi-grams instead.")
  bigram_counts %>% head(20) %>% knitr::kable()
}

Finally, by using the igraph and ggraph packages, we can visualize words occurring in pairs, which allows us to see common relationships between words in the text.

# plot word network using Bigram
words_counts <- bigram_counts
words_counts %>%
  filter(n >= 3) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

What we can interpret from this network is that phrases linking “Atlanta – police – links”, “apartment – complex”, and “license – plate” suggest stories about car break-ins, stolen plates, and thefts in apartment parking lots. And North Druid could be a target location or just place of interest ?

4 Sentiment Analysis

In this stage, I will import the sentiment analysis result data processed from Google Colab.

# import the data
reddit_sentiment <- readr::read_csv("C:/Users/qduong7/OneDrive - Georgia Institute of Technology/Documents/Team project (urban analytics)/RedditAnalysis/reddit_bert.csv", show_col_types = FALSE)

## New names:
## * `` -> `...1`

# drop NAs
reddit_sentiment %<>% drop_na('bert_label')

Comparison with the dictionary method

Get sentiment scores using the dictionary method for comparison.

# Join thread title and text.
reddit_sentiment %<>%
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

# dictionary method
reddit_sentiment_dictionary <- sentiment_by(reddit_sentiment$title_text)

reddit_sentiment$sentiment_dict <- reddit_sentiment_dictionary %>% pull(ave_sentiment)
reddit_sentiment$word_count <- reddit_sentiment_dictionary %>% pull(word_count)

Check the correlation between the sentiment values from two different methods.

reddit_sentiment %<>% mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>% as.numeric())

cor(reddit_sentiment$bert_label_numeric, reddit_sentiment$sentiment_dict)

## [1] 0.3302182

0.33 implies a mild positive correlation. In the scatter plot below, the two do not seem to be very correlated. The threads that got 1 stars from the BERT model are mostly below 0 (meaning negative) in the dictionary method.

Let’s look at some example threads and the predicted sentiment, and see which method makes more sense.

BERT: 1 star (negative) vs. 5 stars (positive)

bert_example <- reddit_sentiment %>%
  filter(bert_label_numeric %in% c(1,5)) %>%
  group_by(bert_label) %>%
  arrange(desc(bert_score)) %>%
  slice_head(n = 3) %>%
  ungroup()

# 1 star
bert_example %>% filter(bert_label_numeric == 1) %>% pull(title_text) %>% print()

## [1] "Random act of kindness at MARTA station goes viral. "           
## [2] "1 injured in shooting outside of Lenox Square in Buckhead. "    
## [3] "Woman accused of stealing more than $93K from church arrested. "

# 5 star
bert_example %>% filter(bert_label_numeric == 5) %>% pull(title_text) %>% print()

## [1] "This is my favorite drone shot of Georgia Tech's Tech Tower at sunset. "
## [2] "I\031m so proud of this city!!. "                                       
## [3] "Finally! A Trump Executive Order I Can Get Behind. "

Dictionary method: negative vs. positive

sentimentr_example <- reddit_sentiment %>%
  mutate(sentimentr_abs = abs(sentiment_dict),
         sentimentr_binary = case_when(sentiment_dict > 0 ~ 'positive',
                                       TRUE ~ 'negative')) %>%
  group_by(sentimentr_binary) %>%
  arrange(desc(sentimentr_abs)) %>%
  slice_head(n = 3) %>%
  ungroup() %>%
  arrange(sentiment_dict)

# negative
sentimentr_example %>% filter(sentimentr_binary == 'negative') %>% pull(title_text) %>% print()

## [1] "Suspected would-be robber says he's the real victim. "                                     
## [2] "Atlanta Police officer accused of stealing $500 cash from deceased shooting victim fired. "
## [3] "Thieves steal from boy with cancer. "

# positive
sentimentr_example %>% filter(sentimentr_binary == 'positive') %>% pull(title_text) %>% print()

## [1] "Some friendly individual relieved me of a window and an iPod touch last night. "
## [2] "WSJ Editorial: Democracy Succeeds in Georgia. "                                 
## [3] "Police officer doesn't punish girl caught stealing, helps her instead. "

5. Insights

Okay here is the fun part, let’s discuss intriguing insights derived from the sentiment analysis from my observations with these visualizations:

People complain about theft mid-week and across the day, with only small weekend differences

Looking by day of week, every bar is again dominated by 1-star posts. There isn’t a huge weekday vs weekend contrast, but we can see that Saturday and Sunday have a slightly thicker band of 4–5-star posts compared to the middle of the week. My interpretation is that on weekends people are a bit more likely to use “stealing” in jokes, memes, or lighter stories ( for example “my dog is stealing socks again”), whereas mid-week posts tilt more toward frustrated, crime-related reports.

reddit_sentiment %<>%
  mutate(
    date = suppressWarnings(anytime(date_utc, tz = anytime:::getTZ())),
    timestamp_parsed = suppressWarnings(anytime(timestamp, tz = anytime:::getTZ()))
  ) %>%
  filter(!is.na(date)) %>%
  mutate(
    year = year(date),
    day_of_week = wday(date, label = TRUE),
    is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"),
    time = coalesce(hour(timestamp_parsed), hour(date))
  )

if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)
#sentiment by day 
g1 <- reddit_sentiment %>%
  filter(year >= 2018) %>%
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = "stack") +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), max(reddit_sentiment$year), by = 1)) +
  scale_fill_brewer(palette = "PuRd", direction = -1) +
  theme_gray()
#sentiment by time 
g2 <- reddit_sentiment %>%
  filter(year >= 2018) %>%
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = "fill") +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), max(reddit_sentiment$year), by = 1)) +
  scale_fill_brewer(palette = "PuRd", direction = -1) +
  theme_gray()

ggsave(file.path(out_dir, "sentiment_by_year_stack.png"), g1, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_year_fill.png"), g2, width = 7, height = 4, dpi = 150)

g1; g2;

The time-of-day plot tells a complementary story. Negative posts appear at almost every hour, which fits the idea that theft and venting can happen anytime.Especially we can see negative post were dominated after late night (after 12 am ). But there’s a visible cluster of 4–5-star posts during the morning and early evening hours, the times when people are commuting or scrolling after work. Those more positive/neutral posts likely include advice threads, news links, or lighter commentary that reuse the word “stealing” in less literal ways. ## Posts about “stealing” are overwhelmingly negative, especially after 2020

Over the full 2018–2025 window, almost every year is dominated by 1-star posts from the BERT model, meaning that most “stealing” threads are clearly negative in tone. The stacked bar chart by year shows that volume peaks around 2018–2020, then drops but never disappears in later years. People keep coming back to Reddit to talk about theft, just with fewer threads per year than at the pre-COVID peak.

When we switch to the proportional view, the pattern gets sharper: in early years there is still a noticeable band of 4–5-star posts (more neutral or joking uses of “stealing”), but by 2021–2022 the bars are almost entirely dark (1-star). In other words, the mix of conversations shifts toward more clearly negative posts after 2020. Only 2023 shows a small rebound of more positive/neutral threads, but even there, 1-star posts remain the largest slice. During the year 2024 and 2025, 1 star posts( negative) remains the majority:

if (!dir.exists(out_dir)) dir.create(out_dir, recursive = TRUE)

#sentiment by year fill 
g3 <- reddit_sentiment %>%
  ggplot(aes(x = day_of_week, fill = bert_label)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "PuRd", direction = -1) +
  theme_gray()
#sentiment by year using stacked bars 
g4 <- reddit_sentiment %>%
  ggplot(aes(x = time, fill = bert_label)) +
  geom_histogram(bins = 24, position = "fill", color = "black", lwd = 0.2) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  scale_fill_manual(values = c("#bc5090", "#bc5090", "#ff6361", "#ffa600", "#ffa600")) +
  theme_gray()

ggsave(file.path(out_dir, "sentiment_by_year_stack.png"), g1, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_year_fill.png"), g2, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_day.png"), g3, width = 7, height = 4, dpi = 150)
ggsave(file.path(out_dir, "sentiment_by_time.png"), g4, width = 7, height = 4, dpi = 150)

g3; g4

Number of threads by sentiment category.

reddit_sentiment %>%
  ggplot(aes(x = bert_label)) +
  geom_bar(fill = "grey") +
  theme_gray()

–> What we can learn from this plost is that most posts fall into the 1-star bucket. Atlanta social feeds lean heavily negative when theft is involved, with far fewer upbeat posts.

** Let say I want to see Word counts versus sentiment category: **

reddit_sentiment %>%
  ggplot(aes(x = bert_label, y = word_count)) +
  geom_jitter(height = 0, width = 0.05) +
  stat_summary(fun = mean, geom = "crossbar", width = 0.4, color = "red") +
  theme_gray()

My interpretation here is the longest narratives sit at the extremes (1-star vents and 4–5-star updates), while the middle tends to be shorter, suggesting stronger feelings produce longer narratives.

Vocabulary splits neatly between crime complaints and light-hearted posts

The text patterns line up nicely with what we’d expect from property-crime vs. playful uses of “stealing.” In the word network and the negative-thread word cloud, the dominant words and pairs point straight at concrete crime situations:

Phrases linking “Atlanta – police – links”, “apartment – complex”, and “license – plate” suggest stories about car break-ins, stolen plates, and thefts in apartment parking lots.

Single words like thief, alarm, gun, cops, company, van, plate, north echo the language of actual incidents: alarms going off, police being called, damage claims with a company, or specific parts of town.

By contrast, the positive-thread word cloud is almost a different universe:

Words like puppies, cards, trivia, braves, auburn, jeep, drivers point to sports, games, pets, and fandom.

Here “stealing” is probably being used metaphorically (“puppies stealing my heart”) or in a fun context (“Braves stealing bases,” “trivia night prizes”). So even when we condition on the same keyword, there are two distinct discourses: one around real theft and safety, another around playful or metaphorical “stealing.”

# plot word network
words_counts <- bigram_counts
words_counts %>%
  filter(n >= 3) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

Using word clouds, we can visualize the words that appear most frequently in positive and negative threads:

Words appearing in negative threads.

Words appearing in positive threads.

The two sentiment methods mostly agree on direction, but BERT is more nuanced

To check whether our sentiment labels are trustworthy, I compared a dictionary-based score (sentimentr) with the BERT 1–5 star labels. The correlation is about 0.33, which is modest but clearly positive: higher BERT stars tend to go with slightly more positive dictionary scores.

The scatterplot shows the shape of that relationship:

Most 1-star posts have dictionary scores below zero 4- and 5-star posts sit mostly at or above zero.

When we reduce the comparison to just “sign of the sentiment” (positive vs negative), the two methods agree about 57% of the time, and the typical dictionary magnitude is fairly small (~0.22 in absolute value). That tells us two things:

The dictionary method usually gets the direction right for clear, emotional posts (e.g., very angry theft complaints).

For subtler or sarcastic threads, it often hovers near zero or even misclassifies the tone, while BERT still assigns a confident star rating.

g_scatter <- ggplot(data = reddit_sentiment, aes(x = bert_label_numeric, y = sentiment_dict)) +
  geom_jitter(width = 0.1, height = 0) +
  geom_line(aes(y = 0), color = "#FFD700", lwd = 1, linetype = "dashed") +
  theme_gray()
ggsave(file.path(out_dir, "bert_vs_dict_scatter.png"), g_scatter, width = 7, height = 4, dpi = 150)

g_scatter

Display 10 sample texts to evaluate BERT vs Dictionary methods

# Sample texts with scores for credibility check
set.seed(123)
sample_10 <- reddit_sentiment %>%
  sample_n(10) %>%
  select(title_text, bert_label, sentiment_dict, comments, word_count)

# Spot-check 10 random posts to see if the two sentiment methods and engagement levels line up with intuition before trusting the metrics
sample_10

## # A tibble: 10 x 5
##    title_text                      bert_label sentiment_dict comments word_count
##    <chr>                           <chr>               <dbl>    <dbl>      <int>
##  1 "Georgia Tech won't require st~ 2 stars           -0.177       188         13
##  2 "1 injured in shooting outside~ 1 star            -0.45         94          9
##  3 "Marietta man accused of steal~ 1 star            -0.476         2         12
##  4 "Dekalb County dog stolen off ~ 1 star            -0.265         8          8
##  5 "2 men arrested for stealing $~ 1 star            -0.177         0          8
##  6 "Plant stealing vagrant ruinin~ 1 star            -0.0693       21        185
##  7 "PSA: Don't let people in your~ 1 star             0.217        49        148
##  8 "Carjacking at Piedmont &amp; ~ 4 stars            0            11          5
##  9 "Why the Thrashers left Atlant~ 4 stars            0            48         15
## 10 "Robbers in EAV. If anyone has~ 1 star             0.272       118         16

# Quick credibility check comparing dictionary sign vs BERT class (negative/neutral/positive)
credibility_stats <- reddit_sentiment %>%
  mutate(
    bert_sign = case_when(bert_label_numeric <= 2 ~ -1,
                          bert_label_numeric == 3 ~ 0,
                          TRUE ~ 1),
    dict_sign = case_when(sentiment_dict < 0 ~ -1,
                          sentiment_dict == 0 ~ 0,
                          TRUE ~ 1)
  ) %>%
  summarise(
    sign_agreement = mean(bert_sign == dict_sign),
    mean_abs_dict = mean(abs(sentiment_dict))
  )
credibility_stats

## # A tibble: 1 x 2
##   sign_agreement mean_abs_dict
##            <dbl>         <dbl>
## 1          0.567         0.222

EXTRA: Association between a thread’s sentiment and the number of comments on the threa

# Remove outliers
reddit_sentiment_rm_outlier <- reddit_sentiment %>%
  group_by(bert_label) %>%
  filter(
    between(
      comments,
      quantile(comments, 0.25) - 1.5 * IQR(comments),
      quantile(comments, 0.75) + 1.5 * IQR(comments)))

# Correlation analysis
cor.test(reddit_sentiment_rm_outlier$comments, reddit_sentiment_rm_outlier$bert_label_numeric)

## 
##  Pearson's product-moment correlation
## 
## data:  reddit_sentiment_rm_outlier$comments and reddit_sentiment_rm_outlier$bert_label_numeric
## t = -1.2782, df = 193, p-value = 0.2027
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.22918361  0.04952823
## sample estimates:
##         cor 
## -0.09162176

# Scatterplot
reddit_sentiment_rm_outlier %>%
  ggplot(aes(x = bert_label_numeric, y = comments)) +
  geom_jitter(height = 0, width = 0.05) + 
  geom_smooth(method = 'loess', span = 0.75) +
  theme_gray()

## `geom_smooth()` using formula = 'y ~ x'

What we can interprest from this chart: Comment volume doesn’t swing dramatically with sentiment, so people engage with both complaint posts and positive resolutions, but there’s no strong “more anger → more comments” effect.

Conclusion

Overall, Reddit threads that mention “stealing” are strongly skewed toward negative sentiment across all years, with especially dark tones in the COVID and post-COVID period. Complaints show up on every day of the week and at all hours, so they are more of a constant background worry than a weekend or late-night spike.

The text patterns split into two clear worlds: one cluster of posts uses words like thief, alarm, cops, apartment, plate, describing real property crimes and safety concerns, while a smaller cluster uses puppies, trivia, Braves, drivers in more playful or metaphorical ways.