# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "here")
devtools::install_github("lchiffon/wordcloud2")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
DISCLAIMER: Due to the uncensored nature of online communities, some Reddit posts may contain images or language that are not suitable for work or school. You may also encounter controversial content. This is part of studying urban analytics, as online platforms often reflect real social dynamics. That said, please keep an open mind, and I hope no one feels offended or uncomfortable with what we might come across.
I have used the RedditExtractoR package to collect
Reddit data. The function
RedditExtractoR::find_thread_urls() lets you search for
Reddit threads by subreddit name, keyword, or a combination of
both. This function takes four arguments:
keywordssubreddit (optional)sort_by: “top” (default), “relevance”, “comments”,
“new”, “hot”, “rising”
period: “month” (default), “hour”, “day”, “week”,
“year”, “all”Although the package description does not explicitly state this, it
appears that the find_thread_urls() function returns
maximum 250 threads.
Let’s first search for threads using a keyword.
threads_1 <- find_thread_urls(keywords = "public transportation",
sort_by = 'relevance',
period = 'all') %>%
drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads_1) <- NULL
# Sanitize text
threads_1 %<>%
mutate(across(
where(is.character),
~ .x %>%
str_replace_all("\\|", "/") %>% # replace vertical bars
str_replace_all("\\n", " ") %>% # replace newlines
str_squish() # clean up extra spaces
))
colnames(threads_1)
## [1] "date_utc" "timestamp" "title" "text" "subreddit" "comments"
## [7] "url"
head(threads_1, 3) %>% knitr::kable()
| date_utc | timestamp | title | text | subreddit | comments | url |
|---|---|---|---|---|---|---|
| 2023-06-04 | 1685902681 | We need to appreciate the decent level of public transport our rural areas get. | fuckcars | 243 | https://www.reddit.com/r/fuckcars/comments/140leil/we_need_to_appreciate_the_decent_level_of_public/ | |
| 2018-11-19 | 1542636032 | I love public transport | PublicFreakout | 1200 | https://www.reddit.com/r/PublicFreakout/comments/9ygzif/i_love_public_transport/ | |
| 2022-11-20 | 1668909420 | this city has no public transportation. Impossible to find work without a license or vehicle. | fuckcars | 356 | https://www.reddit.com/r/fuckcars/comments/yzs9ck/this_city_has_no_public_transportation_impossible/ |
The initial keyword search for “public transportation” returns a diverse set of Reddit threads across multiple subreddits. The sample of titles shows that users discuss a wide range of topics, including metro systems, buses, new lines, and general transit experiences. This confirms that the keyword successfully captures conversations directly related to public transportation rather than unrelated content.
Next, let’s try searching by subreddit instead.
# search for subreddits
subreddit_list <- RedditExtractoR::find_subreddits("public transportation")
## parsing URLs on page 1...
## parsing URLs on page 2...
subreddit_list %>%
arrange(desc(subscribers)) %>%
.[1:25,c('subreddit','title','subscribers')] %>%
knitr::kable()
| subreddit | title | subscribers | |
|---|---|---|---|
| 2qh33 | funny | funny | 66845602 |
| 2qh1i | AskReddit | Ask Reddit… | 57219950 |
| 2qh13 | worldnews | World News | 46891782 |
| 2qqjc | todayilearned | Today I Learned (TIL) | 41136859 |
| 2szyo | Showerthoughts | Showerthoughts | 34015613 |
| 2qh0u | pics | Reddit Pics | 33123445 |
| 2qh41 | travel | travel | 14079308 |
| 2ubgg | mildlyinfuriating | jukmifgguggh | 11567831 |
| 2s7tt | AdviceAnimals | Advice Animals | 9908321 |
| 2cneq | politics | Politics | 8955647 |
| 2qh61 | WTF | WTF?! | 7041105 |
| 2si92 | MapPorn | Map Porn, for interesting maps | 6121734 |
| 2tk0s | unpopularopinion | For your Opinions that are Unpopular | 4794635 |
| 2qjov | Philippines | Philippines - all about the Philippines | 3498130 |
| 2qhu8 | aviation | aviation | 2641102 |
| hcycg | povertyfinance | Personal Finance For The Financially Challenged | 2477212 |
| 2qh4r | conspiracy | conspiracy | 2265833 |
| 2uayg | AskEurope | Ask Europe | 1621515 |
| 2uah7 | AskAnAmerican | Ask Americans about their country! | 1096930 |
| 2qhu2 | nyc | nyc reddit | 953598 |
| 2qht0 | LosAngeles | Los Angeles news, meet-ups, events, and more! | 878584 |
| 2qjyy | bayarea | San Francisco Bay Area | 744418 |
| 2qhad | Seattle | Seattle | 744300 |
| 2qh3r | boston | Boston, MA | 730296 |
| 2qh2t | chicago | Chicago | 621344 |
The subreddit search reveals which communities are most active in discussing public transportation. Subreddits with the largest subscriber counts (such as r/transit and other city/transport-related communities) indicate where transit-related discussions are most concentrated. This helps identify r/transit as a suitable focus for deeper analysis, since it combines both scale (many users) and topical relevance.
Alternatively, you can check how many threads were found for that keyword within each subreddit.
threads_1$subreddit %>% table() %>% sort(decreasing = T) %>% head(20)
## .
## fuckcars worldnews PublicFreakout
## 21 11 9
## CasualUK todayilearned Damnthatsinteresting
## 7 6 4
## europe Futurology ImTheMainCharacter
## 4 4 4
## london mildlyinfuriating ShitAmericansSay
## 4 4 4
## brisbane CityPorn Coronavirus
## 3 3 3
## interestingasfuck nextfuckinglevel therewasanattempt
## 3 3 3
## AskReddit AskUK
## 2 2
Counting how often each subreddit appears in the keyword-based results shows that a relatively small number of subreddits account for most public transportation threads. In particular, r/transit and a few other mobility-oriented communities appear frequently, suggesting that discussions of transit are somewhat centralized in specific, dedicated spaces on Reddit.
After selecting a subreddit, let’s search threads within that subreddit.
# using subreddit
threads_2 <- find_thread_urls(subreddit = "transit",
sort_by = 'top',
period = 'year') %>%
drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads_2) <- NULL
# Sanitize text
threads_2 %<>%
mutate(across(
where(is.character),
~ .x %>%
str_replace_all("\\|", "/") %>%
str_replace_all("\\n", " ") %>%
str_squish()
))
head(threads_2, 3) %>% knitr::kable()
| date_utc | timestamp | title | text | subreddit | comments | url |
|---|---|---|---|---|---|---|
| 2025-08-11 | 1754878390 | The difference between CO2 emissions of city buses vs couch buses is staggering | Seems like every city bus should at least be hybrid at the least. Even better if they are trolley or battery electric. A Toyota RAV4 hybrid emits 1.55 times less CO2 per km than a Toyota gas model in city driving. If we assume the same for buses, a London local bus would emit only 51 grams per passenger km. Much closer to an electric car. If we also consider the CO2 emissions during production of the vehicles a hybrid electric bus would be better for environment than an electric car. You only need one bus per 500 people compared to one car per two people even one car one person. | transit | 204 | https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ |
| 2025-06-30 | 1751262244 | The largest high-speed railway station in Asia. | transit | 141 | https://www.reddit.com/r/transit/comments/1lnzrgt/the_largest_highspeed_railway_station_in_asia/ | |
| 2025-11-03 | 1762198113 | Yerevan, Armenia Metro has it all | The only single-car non-articulated trains in daily passenger service that I know of worldwide on the Charbakh spur line, a whopping 6 livery variations on the same series of Soviet-era trains (used either in 2 or 3 car trainsets on the main line), the coolest metro cat in the world, an abandoned car Metro 2033 style, and super interesting station designs. Bonus picture is the abandoned Yerevan cable car gondola that has been sitting defunct since the 1994 accident. | transit | 33 | https://www.reddit.com/r/transit/comments/1onldkb/yerevan_armenia_metro_has_it_all/ |
Restricting the search to r/transit gives a more focused dataset of posts specifically curated by users interested in public transportation. The sample of threads_2 shows titles that are almost entirely about transit systems, infrastructure, and policy. This subreddit-based filtering reduces noise and ensures that the subsequent analysis reflects conversations in a community explicitly centered on transit.
Lastly, let’s search by both the keyword and subreddit.
# using both subreddit and keyword
threads_3 <- find_thread_urls(keywords= "public transportation",
subreddit ="transit" ,
sort_by = 'relevance',
period = 'all') %>%
drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads_3) <- NULL
# Sanitize text
threads_3 %<>%
mutate(across(
where(is.character),
~ .x %>%
str_replace_all("\\|", "/") %>%
str_replace_all("\\n", " ") %>%
str_squish()
))
head(threads_3, 3) %>% knitr::kable()
| date_utc | timestamp | title | text | subreddit | comments | url |
|---|---|---|---|---|---|---|
| 2025-09-13 | 1757753530 | I love the Deutschlandticket | Context as a Student, I only pay 38¬ to use public transport (not including IC/ICE Trains). | transit | 55 | https://www.reddit.com/r/transit/comments/1nfslns/i_love_the_deutschlandticket/ |
| 2025-01-25 | 1737766983 | Bangkok gets free public transport for a week on smog crisis | transit | 3 | https://www.reddit.com/r/transit/comments/1i9b8az/bangkok_gets_free_public_transport_for_a_week_on/ | |
| 2023-12-20 | 1703092281 | If you had to rank public transportation systems of Denmark, Netherlands and Austria | Hola chicos! First time poster, long time lurker here. I was curious, if you had to rank the public transportation system of Austria, Denmark, and the Netherlands, how would you rank them? The criteria would ignore ticket/ridership costs. ​ Instead, this would focus more on: 1. Connectivity: How accessible is it in rural areas? City to city? Within city? 2. Integration to other forms of transit: Ex: Trains to trams to bikes; the use 1 card for the whole system, etc 3. Reliability: Is it on time? Does it have ample hours? 4. Alternatives: For instance if you can’t take a train, can you bike there? Or can you take a water taxi? Or can you hop on a bus? In other words are you mainly stuck to one way of getting to a place, or do you have a plan B? 5. Build Quality & Service: Is your infrastructure outdated and falling apart or is it the best of the best? Is it comfortable? How is the service? Do you have cool amenities like fancy cafes in your trains? ​ Lastly if you have any complaints or if you see limitations or weaknesses in a transit system feel free to comment. Likewise if you have suggestions to how you could improve a country’s transit systems for instance maybe more train schedules that extend to later hours and so forth. | transit | 32 | https://www.reddit.com/r/transit/comments/18mzmj6/if_you_had_to_rank_public_transportation_systems/ |
Combining the keyword “public transportation” with the r/transit subreddit further narrows the dataset to posts that explicitly mention the term within a transit-focused community. The sample shows that these threads often address higher-level issues such as system design, funding, or comparative transit quality across cities. This subset is more specialized but smaller in size, illustrating the trade-off between breadth and strict topical focus.
Save
threads_1,threads_2, andthreads_3as CSV files for the sentiment analysis.
The find_thread_urls function provides the title and
text of threads but does not include comments of each thread – it only
shows the number of comments. To retrieve the comments, you can use the
get_thread_content function, which takes the thread URLs
returned by find_thread_urls as input.
One caveat is that this process can take quite a long time to run. In
the example below, we use get_thread_content only for the
first four threads to keep things manageable.
# get individual comments
threads_2_content <- get_thread_content(threads_2$url[1:4])
The output object threads_2_content consists of two data
frames: threads and comments.
The threads data frame contains additional information
that was not provided by find_thread_urls, such as the
username of the poster, upvotes, and downvotes.
The
downvotescolumn appears to be always zero, but theup_ratiocolumn provides insight into how positively or negatively users reacted to the thread.
The get_thread_content() results show that each thread has additional metadata and comment-level data not available in the original find_thread_urls() output. The threads data frame includes upvotes and the upvote ratio, which provide a rough measure of how positively the community responded to each post. Even though downvotes are zero in the dataset, variation in up_ratio indicates differences in how controversial or well-received each thread was.
names(threads_2_content)
## [1] "threads" "comments"
# check upvotes and downvotes
print(threads_2_content$threads[,c('upvotes','downvotes','up_ratio')])
## upvotes downvotes up_ratio
## 1 716 0 0.96
## 2 715 0 0.97
## 3 720 0 1.00
## 4 716 0 0.99
The comments data frame provides information on
individual comments.
# Sanitize text
threads_2_content$comments %<>%
mutate(across(
where(is.character),
~ .x %>%
str_replace_all("\\|", "/") %>%
str_replace_all("\\n", " ") %>%
str_squish()
))
head(threads_2_content$comments, 3) %>% knitr::kable()
| url | author | date | timestamp | score | upvotes | downvotes | golds | comment | comment_id |
|---|---|---|---|---|---|---|---|---|---|
| https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ | Username7381 | 2025-08-11 | 1754879049 | 225 | 225 | 0 | 0 | Couch bus | 1 |
| https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ | LegoFootPain | 2025-08-11 | 1754879956 | 182 | 182 | 0 | 0 | JD Vance: breathes heavily | 1_1 |
| https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ | BigMatch_JohnCena | 2025-08-11 | 1754884507 | 10 | 10 | 0 | 0 | Underrated reply =- | 1_1_1 |
The comments table reveals how users interact with each thread at a finer level. Individual comments often contain more emotional language, reactions, and arguments than the original thread titles. Cleaning the text (removing line breaks and special characters) prepares these comments for later natural language processing and sentiment analysis, ensuring that the data is in a usable form.
Using the date and timestamp information, we can analyze when posts are most popular on Reddit — by month, day, or hour.
First, let’s examine the number of threads per week over the last 12
months. You can do this by setting the binwidth in
geom_histogram() to one week in
seconds.
# create new column: date
threads_2 %<>%
mutate(date = as.POSIXct(date_utc)) %>%
filter(!is.na(date))
# number of threads by week
threads_2 %>%
ggplot(aes(x = date)) +
geom_histogram(color="black", position = 'stack', binwidth = 604800) +
scale_x_datetime(date_labels = "%b %y",
breaks = seq(min(threads_2$date, na.rm = T),
max(threads_2$date, na.rm = T),
by = "1 month")) +
theme_minimal()
The weekly histogram of r/transit threads shows how posting activity
fluctuates over time. Peaks in the number of posts may correspond to
notable events in public transportation—such as new line openings,
service disruptions, or policy changes—while quieter periods suggest
more routine discussion. Overall, the plot indicates that interest in
public transportation on Reddit is persistent but uneven, with
occasional surges of attention.
Do people tend to post more on weekdays or weekends?
# create new columns: day_of_week, is_weekend
threads_2 %<>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))
# number of threads by time of day
threads_2 %>%
ggplot(aes(x = day_of_week, fill = is_weekend)) +
geom_bar(color = 'black') +
scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "pink")) +
theme_minimal()
The day-of-week plot shows that most posts in r/transit occur on
weekdays rather than weekends. This pattern is consistent with the idea
that transit-related concerns are closely tied to daily commuting and
work schedules. While weekends still see some activity, the higher
weekday posting volume suggests that users are more likely to discuss
transit when they are actively using it for work or school.
You can extract the time of day from the timestamp
column.
print(threads_2$timestamp[1])
## [1] 1754878390
print(threads_2$timestamp[1] %>% anytime(tz = anytime:::getTZ()))
## [1] "2025-08-10 22:13:10 EDT"
threads_2 %<>%
mutate(time = timestamp %>%
anytime(tz = anytime:::getTZ()) %>%
str_split('-| |:') %>%
sapply(function(x) as.numeric(x[4])))
Let’s visualize the number of threads by time of day using the
time column we made from timestamp.
Note: the times are shown in our local time zone, but the posters and commenters may be in different time zones.
# number of threads by time of day
threads_2 %>%
ggplot(aes(x = time)) +
geom_histogram(bins = 24, color = 'black') +
scale_x_continuous(breaks = seq(0, 24, by=2)) +
theme_minimal()
The time-of-day histogram reveals that posts are not uniformly
distributed across the 24-hour cycle. There tend to be more posts during
waking and commuting hours, with noticeable concentrations around
typical morning and evening periods. This further reinforces the
connection between transit use and online discussion: riders are more
likely to reflect on or complain about their experiences shortly after
they occur.
Tokenization is a fundamental first step in any Natural Language Processing (NLP) pipeline.
“Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization” (source).
For example, consider the sentence: “How are you?”
If we then encode each token as a number, we can represent text as numeric data. For instance, if we assign h = 1, e = 2, l = 3, o = 4, then:
You can already see the similarity between hello and heel. This kind of tokenization will be useful later when we perform more advanced NLP tasks.
For a more intuitive explanation, check out this video.
Next, let’s tokenize the Reddit texts and plot the top 20 words to see which words appear most frequently. You will notice an issue: the plot includes many common words like the, to, and, a, for, etc. While it makes sense that these words appear often, they are not useful for our analysis.
# Word tokenization
words <- threads_2 %>%
unnest_tokens(output = word, input = text, token = "words")####### # run `?tidytext::unnest_tokens` on the console
words %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "words",
y = "counts",
title = "Unique wordcounts")
## Selecting by n
The raw token frequency plot shows the most common words appearing in
r/transit post text. Many of the top words are function words such as
“the,” “to,” and “and,” which are linguistically necessary but not
analytically informative. This confirms the need to remove stop words
before drawing substantive conclusions. Nonetheless, even at this stage
we likely see frequent content words like “train,” “metro,” or “bus,”
reflecting the core focus of the subreddit.
Those common words we just saw are known as stop words – words that are typically filtered out during NLP processing because they contribute little meaningful information to text analysis. These are usually very common words such as articles, pronouns, conjunctions, and prepositions.
We can remove stop words using a built-in dataset from the
tidytext package.
# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 100)])
## [1] "by" "you'd" "might" "although" "she"
## [6] "general" "comes" "both" "shall" "during"
## [11] "what's" "okay" "has" "lest" "everything"
## [16] "fact" "she'll" "non" "things" "throughout"
## [21] "nevertheless" "my" "still" "place" "whatever"
## [26] "when's" "him" "already" "need" "said"
## [31] "she's" "neither" "n" "there" "went"
## [36] "it" "though" "specify" "want" "over"
## [41] "why" "came" "was" "about" "probably"
## [46] "she" "could" "tends" "both" "by"
## [51] "clearly" "to" "respectively" "try" "six"
## [56] "up" "himself" "hadn't" "uses" "own"
## [61] "behind" "whence" "also" "opens" "couldn't"
## [66] "let" "now" "from" "doesn't" "less"
## [71] "there's" "nobody" "has" "using" "you're"
## [76] "reasonably" "have" "whose" "within" "cannot"
## [81] "the" "no" "i" "otherwise" "x"
## [86] "those" "hasn't" "parting" "members" "thank"
## [91] "like" "going" "get" "however" "somebody"
## [96] "may" "they" "below" "we're" "g"
After removing stop words and non-alphabetic strings, the top 20 word plot shifts toward more meaningful vocabulary. High-frequency terms now highlight key themes in transit discussions, such as specific transport modes (e.g., “metro,” “bus”), urban contexts (“city,” “line”), and user-facing issues (“service,” “station”). This cleaner distribution provides a clearer picture of what people actually talk about when they discuss public transportation on Reddit.
We will use anti_join() function to remove stop words from the text and leave behind a cleaned set of words. Like other join functions, the order of the data frames matters. Here’s the logic:
anti_join(A, B) returns everything in A except
the elements that also appear in B.anti_join(B, A) returns everything in
B except what’s in A.# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
words_clean <- threads_2 %>%
# drop URLs
mutate(text = str_replace_all(text, replace_reg, "")) %>%
# Tokenization (word tokens)
unnest_tokens(word, text, token = "words") %>% ########
# drop stop words
anti_join(stop_words, by = "word") %>%
# drop non-alphabet-only strings
filter(str_detect(word, "[a-z]"))
# Check the number of rows after removal of the stop words. There should be fewer words now
print(
glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 5935, After: 2465
Once you have removed the stop words, let’s create a plot using the cleaned text to see which meaningful words are used most frequently.
words_clean %>%
count(word, sort = TRUE) %>%
top_n(20, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "words",
y = "counts",
title = "Unique wordcounts")
Let’s compare the frequency of words before and after removing stop words using a word cloud.
words %>%
count(word, sort = TRUE) %>%
wordcloud2()
words_clean %>%
count(word, sort = TRUE) %>%
wordcloud2()
The word clouds generated above look nice, but their color schemes can be a bit overwhelming.
The following block of code creates a custom color palette designed to highlight a selected number of words while graying out the rest. You can easily generate a collection of random colors using the HSV (Hue, Saturation, Value) color model.
n <- 20 # number of words with color
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright
df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
words_clean %>%
count(word, sort = TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
The word cloud based on raw tokens emphasizes both common function words and frequently used content words, making it visually rich but somewhat noisy. In contrast, the word cloud using the cleaned words (with stop words removed and custom coloring) highlights substantive transit-related terms. This visualization makes it easy to spot recurring concepts—such as transit modes, city names, and quality descriptors—giving an intuitive overview of dominant topics in the dataset.
An n-gram is a sequence of n words that appear together. For example:
In advanced text analysis and machine learning, specific tokens and n-grams are often used as features for modeling and classification tasks.
N-grams are particularly useful for analyzing words in context. Consider these two sentences:
The word “check” functions as a verb in the first sentence and a noun in the second. We can understand its meaning based on the surrounding words, especially those immediately before or after it. For example, when “check” follows “to”, it’s likely being used as a verb.
As an example of bigram, the sentence “The
result of separating bigrams is helpful for exploratory analyses of the
text.” becomes a list of paired words:
1 the result
2 result of
3 of separating
4 separating bigrams
5 bigrams is
6 is helpful
7 helpful for
8 for exploratory
9 exploratory analyses
10 analyses of
11 of the
12 the text
# Get ngrams. You may try playing around with the value of n, n=3, n=4
words_ngram <- threads_2 %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
select(text) %>%
unnest_tokens(output = paired_words,
input = text,
token = "ngrams",
n = 3)
Extracting trigrams reveals short phrases that capture more context than single words, such as recurring expressions about transit systems or city comparisons. Many of these trigrams still contain stop words, but they begin to hint at typical sentence structures and repeated phrases in transit discussions.
# Show ngrams with sorted values
words_ngram %>%
count(paired_words, sort = TRUE) %>%
head(20) %>%
knitr::kable()
| paired_words | n |
|---|---|
| NA | 163 |
| in the world | 5 |
| the bay area | 5 |
| to get to | 4 |
| all of the | 3 |
| as well as | 3 |
| cobb or gwinnett | 3 |
| i m sick | 3 |
| in the city | 3 |
| it s a | 3 |
| m sick of | 3 |
| one of the | 3 |
| sick of traffic | 3 |
| the city and | 3 |
| 00 and arrived | 2 |
| a big sign | 2 |
| a braves game | 2 |
| about public transit | 2 |
| amtrak ridership in | 2 |
| an electric car | 2 |
After splitting the n-grams and filtering out stop words and non-alphabetic terms, the resulting word pairs (bigrams) show more meaningful associations between content words. High-frequency pairs highlight how users link concepts—for example, connecting transit modes with performance descriptors or linking specific cities with their transportation systems. These bigrams provide insight into how people conceptually group ideas when discussing public transportation.
Here, we can see that the n-grams still contain stop words such as a,
to, and so on. Next, we’ll extract n-grams without stop words. We can
use the separate function from the tidyr
package to split each bigram into two columns: word 1 and
word 2. Then, we filter out any rows where either column
contains a stop word using the filter function.
#separate the paired words into two columns
words_ngram_pair <- words_ngram %>%
separate(paired_words, c("word1", "word2"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 5651 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
# drop stop words
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>%
# drop non-alphabet-only strings
filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))
# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
words_ngram_pair_filtered %<>%
filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))
# Sort the new bi-gram (n=2) counts:
words_counts <- words_ngram_pair_filtered %>%
count(word1, word2) %>%
arrange(desc(n))
head(words_counts, 20) %>%
knitr::kable()
| word1 | word2 | n |
|---|---|---|
| public | transportation | 9 |
| public | transit | 6 |
| light | rail | 5 |
| folding | doors | 4 |
| lines | picture | 3 |
| mexico | city | 3 |
| amtrak | ridership | 2 |
| braves | game | 2 |
| bus | route | 2 |
| car | centric | 2 |
| centennial | park | 2 |
| central | station | 2 |
| city | centre | 2 |
| dc | metro | 2 |
| deep | canvas | 2 |
| duffy | investigates | 2 |
| electric | car | 2 |
| elon | musk | 2 |
| expand | marta | 2 |
| expanding | marta | 2 |
Finally, by using the igraph and ggraph
packages, we can visualize words occurring in pairs,
which allows us to see common relationships between words in the
text.
# plot word network
words_counts %>%
filter(n >= 3) %>%
graph_from_data_frame() %>% # convert to graph
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8) +
labs(title = "Word Networks",
x = "", y = "")
The word network plot visualizes frequent word pairs as a graph, where
nodes are words and edges represent co-occurrence in bigrams. Clusters
in the network suggest thematic groupings—for instance, one cluster may
center on “metro,” “station,” and “line,” while another may relate to
“bus,” “service,” and “delay.” This network view reveals how different
aspects of public transportation are interconnected in user discussions
and helps identify core conceptual structures within the text.
Combining the title and body text into a single field (text_all) and applying sentimentr::sentiment_by() produces a numeric sentiment score for each thread, along with the text’s word count. The summary of threads_2$sentiment shows that the scores span both positive and negative values, but tend to cluster closer to zero. This indicates that Reddit posts about public transportation cover a range of emotional tones but, on average, lean toward moderately positive or neutral sentiment rather than extreme negativity.
# Package names
packages <- c(
"RedditExtractoR", "anytime", "magrittr", "httr",
"tidytext", "tidyverse", "igraph", "ggraph",
"wordcloud2", "textdata", "here", "sentimentr", "glue"
)
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
invisible(lapply(packages, library, character.only = TRUE))
# Package names
packages <- c(
"RedditExtractoR", "anytime", "magrittr", "httr",
"tidytext", "tidyverse", "igraph", "ggraph",
"wordcloud2", "textdata", "here", "sentimentr", "glue"
)
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
invisible(lapply(packages, library, character.only = TRUE))
# Combine title and text for sentiment analysis
threads_2 <- threads_2 %>%
mutate(
title = replace_na(title, ""),
text = replace_na(text, ""),
text_all = stringr::str_c(title, text, sep = ". ")
)
# Run sentiment analysis (dictionary + negation-aware)
sent_res <- sentiment_by(threads_2$text_all)
# Attach back to main data frame
threads_2$sentiment <- sent_res$ave_sentiment
threads_2$word_count <- sent_res$word_count
summary(threads_2$sentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00943 -0.00859 0.01695 0.03830 0.18872 0.67082
threads_2 %>%
filter(!is.na(sentiment), !is.na(date)) %>%
mutate(month = lubridate::floor_date(date, "month")) %>%
group_by(month) %>%
summarize(
avg_sentiment = mean(sentiment, na.rm = TRUE),
n_threads = n()
) %>%
ggplot(aes(x = month, y = avg_sentiment)) +
geom_line(color = "darkgreen") +
geom_point() +
labs(
title = "Average Sentiment About Public Transportation Over Time",
x = "Month",
y = "Average sentiment"
)
The monthly average sentiment plot shows how attitudes toward public
transportation in r/transit evolve over time. While there is some
fluctuation from month to month, the overall pattern suggests that
sentiment is generally stable and slightly positive, with occasional
peaks or dips that may correspond to specific news events,
infrastructure openings, or service crises. This temporal view
highlights that public perceptions of transit are dynamic but do not
swing wildly, reflecting a mix of everyday experiences and episodic
events.
library(stringr)
library(knitr)
# Create a combined text column
threads_2 <- threads_2 %>%
mutate(
title = tidyr::replace_na(title, ""),
text = tidyr::replace_na(text, ""),
text_all = stringr::str_c(title, text, sep = ". ")
)
# Sample 10 texts with sentiment scores
set.seed(123) # for reproducibility
sample_sentiment <- threads_2 %>%
filter(!is.na(sentiment)) %>%
select(text_all, sentiment, word_count, comments) %>% # comments from threads_2 metadata
mutate(
text_short = str_trunc(text_all, width = 140)
) %>%
select(text_short, sentiment, word_count, comments) %>%
slice_sample(n = 10)
knitr::kable(
sample_sentiment,
col.names = c(
"Text (truncated)",
"Sentiment score",
"Word count",
"Number of comments"
),
caption = "Sample of 10 Reddit threads on public transportation with sentiment scores"
)
| Text (truncated) | Sentiment score | Word count | Number of comments |
|---|---|---|---|
| American cities be like. | 0.2500000 | 4 | 185 |
| Why does Cairo, a city of over 22 million people, have only 3 metro lines?. Id expect more lines sooner because its one of the biggest … | 0.1448446 | 76 | 243 |
| I owe you an apology, Miami. I didnt know your game. Despite going to Miami many times on the way to visit family in the Keys, I never s… | 0.1725367 | 227 | 64 |
| We all like trains d [OC]. Opening of new stations along the 2 Line / Redmond, WA | 0.2353954 | 15 | 41 |
| Harness the power of patriotism to build more automated light metros. | 0.1296499 | 11 | 60 |
| Cities comparison. In case you couldn’t see this is obviously satire, so stop typing your angry comment. I decided to put a twist on the … | -0.1686942 | 28 | 177 |
| The world’s oldest underground station, Baker Street, 157 years apart.. | 0.0000000 | 9 | 21 |
| My first time at the new LAX Metro station. Very nice, very shiny, lots of employees around even at 10pm answering questions and giving d… | 0.3910913 | 54 | 48 |
| Copenhagen metro is so slick and reliable. Took the metro around a lot in Copenhagen. It was one of the best transit experiences, it was … | 0.2755092 | 65 | 49 |
| Every type of public transport in Moscow. | 0.0000000 | 7 | 188 |
The sample of 10 Reddit threads suggests that the sentimentr scores generally match the intuitive tone of the text. Posts that clearly praise public transportation—for example, appreciating frequent service, clean vehicles, or successful trips—tend to receive positive sentiment scores. More critical posts that mention delays, crowding, or safety concerns are scored as negative.
There are some borderline or mixed cases where a post contains both praise and criticism, and the resulting sentiment is close to zero. This reflects a limitation of dictionary-based sentiment methods, which can struggle with sarcasm, context, or nuanced opinions. Still, the sample indicates that the scores are credible enough for aggregate analysis, they capture whether a post is broadly positive, negative, or neutral toward public transportation.
The overall distribution:
threads_2 %>%
filter(!is.na(sentiment)) %>%
ggplot(aes(x = sentiment)) +
geom_histogram(bins = 30, color = "black", fill = "lightblue") +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Distribution of sentiment about public transportation",
x = "Sentiment score (sentimentr)",
y = "Number of threads"
) +
theme_minimal()
The histogram of sentiment scores shows that most posts fall on the
positive side of the scale, with fewer strongly negative values. This
suggests that, overall, Reddit discussions about public transportation
in this dataset are more favorable than purely complaint-driven. Users
frequently share helpful information, neutral updates, or even positive
experiences—such as smooth commutes, good connectivity, or system
improvements—rather than focusing exclusively on problems.
The relatively small number of strongly negative posts indicates that while frustrations do exist, they do not dominate the conversation. In other words, the general tone is cautiously positive or mildly supportive of public transportation, rather than overwhelmingly negative.
Is the distribution skewed negative (lots of complaints) or centered near zero?
threads_2 %>%
filter(!is.na(sentiment), !is.na(word_count)) %>%
ggplot(aes(x = sentiment, y = word_count)) +
geom_jitter(width = 0.02, height = 0, alpha = 0.4) +
geom_smooth(method = "loess", se = FALSE, color = "darkgreen") +
labs(
title = "Relationship between sentiment and text length",
x = "Sentiment score",
y = "Word count"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot of sentiment versus word count shows that posts of all
lengths can be positive, neutral, or negative, but there are some
patterns:
Many short posts (few words) have sentiment scores near zero, which often corresponds to simple observations, titles, or quick comments without strong emotional language.
Longer posts tend to show more extreme sentiment values, both positive and negative. These are likely detailed stories, rants, or enthusiastic reviews where users explain their experiences at length.
The smooth trend line suggests that there is only a weak overall relationship between text length and sentiment, but it does hint that when users feel strongly (either positively about a good transit system or negatively about persistent problems), they are more likely to write longer, more elaborated posts. This is consistent with the idea that stronger emotions encourage more expressive writing.
Do very negative or very positive posts tend to be longer (rants, detailed stories), while neutral posts are shorter?
library(lubridate)
# Make sure sentiment and date exist
threads_2 <- threads_2 %>%
filter(!is.na(sentiment), !is.na(date))
# Recreate day_of_week if needed
threads_2 <- threads_2 %>%
mutate(day_of_week = wday(date, label = TRUE))
# Average sentiment by day of week
threads_2 %>%
group_by(day_of_week) %>%
summarize(
avg_sentiment = mean(sentiment, na.rm = TRUE),
n_threads = n()
) %>%
ggplot(aes(x = day_of_week, y = avg_sentiment)) +
geom_col(fill = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Average sentiment about public transportation by day of week",
x = "Day of week",
y = "Average sentiment score"
) +
theme_minimal()
Interpretation: Why is Wednesday the Most Negative?
From the “Average Sentiment by Day of Week” plot, Wednesday shows the lowest average sentiment score, making it the most negative day for public transportation discussion on Reddit.
There are several possible explanations for this pattern:
By Wednesday, people have already been commuting for several days, and frustrations with delays, service reliability, or overcrowding accumulate.
Riders may be more likely to vent mid-week compared to the beginning or end of the week.
Weekend travel tends to be more flexible and less stressful.
In contrast, weekday travel — especially Wednesday — is dominated by work-related commutes, which are more prone to negative experiences.
Transit agencies frequently perform maintenance early in the week (Monday–Wednesday), which can cause delays and dissatisfaction.
If Wednesday has more posts overall, even a slightly negative tone can pull down the average sentiment.
Negative posts also tend to be longer and more expressive, which sentiment models pick up strongly.
Summary:
Wednesday stands out as the most negative day, likely reflecting peak commuter frustration and mid-week transit stress, compared with more neutral or even slightly positive weekend discussions.
Plot 4 — Sentiment vs comments (engagement)
threads_2 %>%
filter(!is.na(sentiment), !is.na(comments)) %>%
ggplot(aes(x = sentiment, y = comments)) +
geom_jitter(width = 0.02, height = 0, alpha = 0.4) +
geom_smooth(method = "loess", se = FALSE, color = "purple") +
labs(
title = "Sentiment vs. engagement (number of comments)",
x = "Sentiment score",
y = "Number of comments"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot of sentiment versus the number of comments shows that both positive and negative posts can attract engagement, but there is no strong linear relationship between sentiment score and comment count. Many posts with moderate or slightly positive sentiment receive a handful of comments, while a smaller number of highly commented posts appear across a range of sentiment values.
A few posts with very high comment counts tend to cluster at more emotionally charged scores – either clearly positive or somewhat negative – suggesting that strongly opinionated or provocative content can trigger more discussion. However, the overall pattern remains very noisy, indicating that engagement on Reddit is driven by multiple factors, such as topic novelty, visual content, or controversy, not just the emotional tone captured by the sentiment score.
Overall, the sentiment analysis reveals a nuanced picture of how people talk about public transportation on Reddit.
From the sentiment distribution plot (Plot 1), most posts fall on the positive or mildly positive side, with fewer strongly negative scores. This suggests that r/transit is not just a space for complaints; users also share positive experiences, appreciation for well-functioning systems, and neutral informational content. The sample of 10 posts with sentiment scores supports this pattern: descriptive posts score near zero, highly positive experiences (e.g., Copenhagen metro, new stations, LAX Metro) receive clearly positive sentiment, and only a few posts show clearly negative or sarcastic tones.
The relationship between sentiment and word count (Plot 2) indicates that strongly emotional posts—both positive and negative—are often longer, while short posts tend to be closer to neutral. This suggests that when users feel strongly about transit (either praising or criticizing it), they are more likely to write detailed narratives or rants, whereas brief comments are typically less emotionally loaded.
The average sentiment by day of week (Plot 3) adds a temporal dimension: sentiment is most negative on Wednesday, hinting at mid-week commuter fatigue and stress, while weekends tend to be more neutral or slightly positive. This aligns with the idea that daily commuting conditions shape how people feel and write about transit.
Finally, sentiment vs. number of comments (Plot 4) shows that emotional tone alone does not fully explain engagement. Although some highly emotional posts attract many comments, overall the relationship between sentiment and comment count is weak, suggesting that other factors—such as the topic’s relevance, novelty, or controversy—also play important roles in driving discussion.
Taken together, these plots show that online discourse about public transportation is generally more positive than purely complaint-driven, becomes more expressive when emotions are strong, varies over the week with commuting patterns, and only partly explains how much attention a post receives.