Major-Assignment 3

# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "here")

devtools::install_github("lchiffon/wordcloud2")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

DISCLAIMER: Due to the uncensored nature of online communities, some Reddit posts may contain images or language that are not suitable for work or school. You may also encounter controversial content. This is part of studying urban analytics, as online platforms often reflect real social dynamics. That said, please keep an open mind, and I hope no one feels offended or uncomfortable with what we might come across.

1. Collecting Reddit Data

1-1. Downloading Reddit threads

I have used the RedditExtractoR package to collect Reddit data. The function RedditExtractoR::find_thread_urls() lets you search for Reddit threads by subreddit name, keyword, or a combination of both. This function takes four arguments:

  • keywords
  • subreddit (optional)
  • sort_by: “top” (default), “relevance”, “comments”, “new”, “hot”, “rising”
    • If searching by keywords: use “relevance”, “comments”, “new”, “hot”, “top”
    • If searching by subreddit only: use “hot”, “new”, “top”, “rising”
  • period: “month” (default), “hour”, “day”, “week”, “year”, “all”

Although the package description does not explicitly state this, it appears that the find_thread_urls() function returns maximum 250 threads.

Let’s first search for threads using a keyword.

threads_1 <- find_thread_urls(keywords = "public transportation", 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
  drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads_1) <- NULL

# Sanitize text
threads_1 %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>%   # replace vertical bars
        str_replace_all("\\n", " ") %>%   # replace newlines
        str_squish()                      # clean up extra spaces
  ))

colnames(threads_1)
## [1] "date_utc"  "timestamp" "title"     "text"      "subreddit" "comments" 
## [7] "url"
head(threads_1, 3) %>% knitr::kable()
date_utc timestamp title text subreddit comments url
2023-06-04 1685902681 We need to appreciate the decent level of public transport our rural areas get. fuckcars 243 https://www.reddit.com/r/fuckcars/comments/140leil/we_need_to_appreciate_the_decent_level_of_public/
2018-11-19 1542636032 I love public transport PublicFreakout 1200 https://www.reddit.com/r/PublicFreakout/comments/9ygzif/i_love_public_transport/
2022-11-20 1668909420 this city has no public transportation. Impossible to find work without a license or vehicle. fuckcars 356 https://www.reddit.com/r/fuckcars/comments/yzs9ck/this_city_has_no_public_transportation_impossible/

The initial keyword search for “public transportation” returns a diverse set of Reddit threads across multiple subreddits. The sample of titles shows that users discuss a wide range of topics, including metro systems, buses, new lines, and general transit experiences. This confirms that the keyword successfully captures conversations directly related to public transportation rather than unrelated content.

Next, let’s try searching by subreddit instead.

# search for subreddits
subreddit_list <- RedditExtractoR::find_subreddits("public transportation")
## parsing URLs on page 1...
## parsing URLs on page 2...
subreddit_list %>% 
  arrange(desc(subscribers)) %>% 
  .[1:25,c('subreddit','title','subscribers')] %>% 
  knitr::kable()
subreddit title subscribers
2qh33 funny funny 66845602
2qh1i AskReddit Ask Reddit… 57219950
2qh13 worldnews World News 46891782
2qqjc todayilearned Today I Learned (TIL) 41136859
2szyo Showerthoughts Showerthoughts 34015613
2qh0u pics Reddit Pics 33123445
2qh41 travel travel 14079308
2ubgg mildlyinfuriating jukmifgguggh 11567831
2s7tt AdviceAnimals Advice Animals 9908321
2cneq politics Politics 8955647
2qh61 WTF WTF?! 7041105
2si92 MapPorn Map Porn, for interesting maps 6121734
2tk0s unpopularopinion For your Opinions that are Unpopular 4794635
2qjov Philippines Philippines - all about the Philippines 3498130
2qhu8 aviation aviation 2641102
hcycg povertyfinance Personal Finance For The Financially Challenged 2477212
2qh4r conspiracy conspiracy 2265833
2uayg AskEurope Ask Europe 1621515
2uah7 AskAnAmerican Ask Americans about their country! 1096930
2qhu2 nyc nyc reddit 953598
2qht0 LosAngeles Los Angeles news, meet-ups, events, and more! 878584
2qjyy bayarea San Francisco Bay Area 744418
2qhad Seattle Seattle 744300
2qh3r boston Boston, MA 730296
2qh2t chicago Chicago 621344

The subreddit search reveals which communities are most active in discussing public transportation. Subreddits with the largest subscriber counts (such as r/transit and other city/transport-related communities) indicate where transit-related discussions are most concentrated. This helps identify r/transit as a suitable focus for deeper analysis, since it combines both scale (many users) and topical relevance.

Alternatively, you can check how many threads were found for that keyword within each subreddit.

threads_1$subreddit %>% table() %>% sort(decreasing = T) %>% head(20)
## .
##             fuckcars            worldnews       PublicFreakout 
##                   21                   11                    9 
##             CasualUK        todayilearned Damnthatsinteresting 
##                    7                    6                    4 
##               europe           Futurology   ImTheMainCharacter 
##                    4                    4                    4 
##               london    mildlyinfuriating     ShitAmericansSay 
##                    4                    4                    4 
##             brisbane             CityPorn          Coronavirus 
##                    3                    3                    3 
##    interestingasfuck     nextfuckinglevel    therewasanattempt 
##                    3                    3                    3 
##            AskReddit                AskUK 
##                    2                    2

Counting how often each subreddit appears in the keyword-based results shows that a relatively small number of subreddits account for most public transportation threads. In particular, r/transit and a few other mobility-oriented communities appear frequently, suggesting that discussions of transit are somewhat centralized in specific, dedicated spaces on Reddit.

After selecting a subreddit, let’s search threads within that subreddit.

# using subreddit
threads_2 <- find_thread_urls(subreddit = "transit", 
                              sort_by = 'top', 
                              period = 'year') %>% 
  drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads_2) <- NULL

# Sanitize text
threads_2 %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>%
        str_squish()
  ))

head(threads_2, 3) %>% knitr::kable()
date_utc timestamp title text subreddit comments url
2025-08-11 1754878390 The difference between CO2 emissions of city buses vs couch buses is staggering Seems like every city bus should at least be hybrid at the least. Even better if they are trolley or battery electric. A Toyota RAV4 hybrid emits 1.55 times less CO2 per km than a Toyota gas model in city driving. If we assume the same for buses, a London local bus would emit only 51 grams per passenger km. Much closer to an electric car. If we also consider the CO2 emissions during production of the vehicles a hybrid electric bus would be better for environment than an electric car. You only need one bus per 500 people compared to one car per two people even one car one person. transit 204 https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/
2025-06-30 1751262244 The largest high-speed railway station in Asia. transit 141 https://www.reddit.com/r/transit/comments/1lnzrgt/the_largest_highspeed_railway_station_in_asia/
2025-11-03 1762198113 Yerevan, Armenia Metro has it all The only single-car non-articulated trains in daily passenger service that I know of worldwide on the Charbakh spur line, a whopping 6 livery variations on the same series of Soviet-era trains (used either in 2 or 3 car trainsets on the main line), the coolest metro cat in the world, an abandoned car Metro 2033 style, and super interesting station designs. Bonus picture is the abandoned Yerevan cable car gondola that has been sitting defunct since the 1994 accident. transit 33 https://www.reddit.com/r/transit/comments/1onldkb/yerevan_armenia_metro_has_it_all/

Restricting the search to r/transit gives a more focused dataset of posts specifically curated by users interested in public transportation. The sample of threads_2 shows titles that are almost entirely about transit systems, infrastructure, and policy. This subreddit-based filtering reduces noise and ensures that the subsequent analysis reflects conversations in a community explicitly centered on transit.

Lastly, let’s search by both the keyword and subreddit.

# using both subreddit and keyword
threads_3 <- find_thread_urls(keywords= "public transportation", 
                              subreddit ="transit" , 
                              sort_by = 'relevance', 
                              period = 'all') %>% 
  drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads_3) <- NULL

# Sanitize text
threads_3 %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>% 
        str_squish() 
  ))

head(threads_3, 3) %>% knitr::kable()
date_utc timestamp title text subreddit comments url
2025-09-13 1757753530 I love the Deutschlandticket Context as a Student, I only pay 38¬ to use public transport (not including IC/ICE Trains). transit 55 https://www.reddit.com/r/transit/comments/1nfslns/i_love_the_deutschlandticket/
2025-01-25 1737766983 Bangkok gets free public transport for a week on smog crisis transit 3 https://www.reddit.com/r/transit/comments/1i9b8az/bangkok_gets_free_public_transport_for_a_week_on/
2023-12-20 1703092281 If you had to rank public transportation systems of Denmark, Netherlands and Austria Hola chicos! First time poster, long time lurker here. I was curious, if you had to rank the public transportation system of Austria, Denmark, and the Netherlands, how would you rank them? The criteria would ignore ticket/ridership costs. &#x200B; Instead, this would focus more on: 1. Connectivity: How accessible is it in rural areas? City to city? Within city? 2. Integration to other forms of transit: Ex: Trains to trams to bikes; the use 1 card for the whole system, etc 3. Reliability: Is it on time? Does it have ample hours? 4. Alternatives: For instance if you can’t take a train, can you bike there? Or can you take a water taxi? Or can you hop on a bus? In other words are you mainly stuck to one way of getting to a place, or do you have a plan B? 5. Build Quality & Service: Is your infrastructure outdated and falling apart or is it the best of the best? Is it comfortable? How is the service? Do you have cool amenities like fancy cafes in your trains? &#x200B; Lastly if you have any complaints or if you see limitations or weaknesses in a transit system feel free to comment. Likewise if you have suggestions to how you could improve a country’s transit systems for instance maybe more train schedules that extend to later hours and so forth. transit 32 https://www.reddit.com/r/transit/comments/18mzmj6/if_you_had_to_rank_public_transportation_systems/

Combining the keyword “public transportation” with the r/transit subreddit further narrows the dataset to posts that explicitly mention the term within a transit-focused community. The sample shows that these threads often address higher-level issues such as system design, funding, or comparative transit quality across cities. This subset is more specialized but smaller in size, illustrating the trade-off between breadth and strict topical focus.

Save threads_1, threads_2, and threads_3 as CSV files for the sentiment analysis.

1-2. Downloading comments and additional information

The find_thread_urls function provides the title and text of threads but does not include comments of each thread – it only shows the number of comments. To retrieve the comments, you can use the get_thread_content function, which takes the thread URLs returned by find_thread_urls as input.

One caveat is that this process can take quite a long time to run. In the example below, we use get_thread_content only for the first four threads to keep things manageable.

# get individual comments
threads_2_content <- get_thread_content(threads_2$url[1:4])

The output object threads_2_content consists of two data frames: threads and comments.

The threads data frame contains additional information that was not provided by find_thread_urls, such as the username of the poster, upvotes, and downvotes.

The downvotes column appears to be always zero, but the up_ratio column provides insight into how positively or negatively users reacted to the thread.

The get_thread_content() results show that each thread has additional metadata and comment-level data not available in the original find_thread_urls() output. The threads data frame includes upvotes and the upvote ratio, which provide a rough measure of how positively the community responded to each post. Even though downvotes are zero in the dataset, variation in up_ratio indicates differences in how controversial or well-received each thread was.

names(threads_2_content)
## [1] "threads"  "comments"
# check upvotes and downvotes
print(threads_2_content$threads[,c('upvotes','downvotes','up_ratio')])
##   upvotes downvotes up_ratio
## 1     716         0     0.96
## 2     715         0     0.97
## 3     720         0     1.00
## 4     716         0     0.99

The comments data frame provides information on individual comments.

# Sanitize text
threads_2_content$comments %<>% 
  mutate(across(
    where(is.character),
    ~ .x %>%
        str_replace_all("\\|", "/") %>% 
        str_replace_all("\\n", " ") %>% 
        str_squish() 
  ))

head(threads_2_content$comments, 3) %>% knitr::kable()
url author date timestamp score upvotes downvotes golds comment comment_id
https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ Username7381 2025-08-11 1754879049 225 225 0 0 Couch bus 1
https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ LegoFootPain 2025-08-11 1754879956 182 182 0 0 JD Vance: breathes heavily 1_1
https://www.reddit.com/r/transit/comments/1mn0yhi/the_difference_between_co2_emissions_of_city/ BigMatch_JohnCena 2025-08-11 1754884507 10 10 0 0 Underrated reply =- 1_1_1

The comments table reveals how users interact with each thread at a finer level. Individual comments often contain more emotional language, reactions, and arguments than the original thread titles. Cleaning the text (removing line breaks and special characters) prepares these comments for later natural language processing and sentiment analysis, ensuring that the data is in a usable form.

1-3. Analyzing Posting Date and Time

Using the date and timestamp information, we can analyze when posts are most popular on Reddit — by month, day, or hour.

First, let’s examine the number of threads per week over the last 12 months. You can do this by setting the binwidth in geom_histogram() to one week in seconds.

# create new column: date
threads_2 %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by week
threads_2 %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 604800) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_2$date, na.rm = T), 
                                max(threads_2$date, na.rm = T), 
                                by = "1 month")) +
  theme_minimal()

The weekly histogram of r/transit threads shows how posting activity fluctuates over time. Peaks in the number of posts may correspond to notable events in public transportation—such as new line openings, service disruptions, or policy changes—while quieter periods suggest more routine discussion. Overall, the plot indicates that interest in public transportation on Reddit is persistent but uneven, with occasional surges of attention.

Do people tend to post more on weekdays or weekends?

# create new columns: day_of_week, is_weekend
threads_2 %<>%  
  mutate(day_of_week = wday(date, label = TRUE)) %>% 
  mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))

# number of threads by time of day
threads_2 %>% 
  ggplot(aes(x = day_of_week, fill = is_weekend)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "pink")) + 
  theme_minimal()

The day-of-week plot shows that most posts in r/transit occur on weekdays rather than weekends. This pattern is consistent with the idea that transit-related concerns are closely tied to daily commuting and work schedules. While weekends still see some activity, the higher weekday posting volume suggests that users are more likely to discuss transit when they are actively using it for work or school.

You can extract the time of day from the timestamp column.

print(threads_2$timestamp[1])
## [1] 1754878390
print(threads_2$timestamp[1] %>% anytime(tz = anytime:::getTZ()))
## [1] "2025-08-10 22:13:10 EDT"
threads_2 %<>%  
  mutate(time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

Let’s visualize the number of threads by time of day using the time column we made from timestamp.

Note: the times are shown in our local time zone, but the posters and commenters may be in different time zones.

# number of threads by time of day
threads_2 %>% 
  ggplot(aes(x = time)) +
  geom_histogram(bins = 24, color = 'black') +
  scale_x_continuous(breaks = seq(0, 24, by=2)) + 
  theme_minimal()

The time-of-day histogram reveals that posts are not uniformly distributed across the 24-hour cycle. There tend to be more posts during waking and commuting hours, with noticeable concentrations around typical morning and evening periods. This further reinforces the connection between transit use and online discussion: riders are more likely to reflect on or complain about their experiences shortly after they occur.

2. Tokenization and stop words

2-1. Tokenization

Tokenization is a fundamental first step in any Natural Language Processing (NLP) pipeline.

“Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization” (source).

For example, consider the sentence: “How are you?”

  • The most common approach is word tokenization, which breaks the sentence at spaces: how, are, you.
  • Character tokenization splits each character: h, o, w, …
  • Subword tokenization breaks words into meaningful subunits: e.g., smart-er.

If we then encode each token as a number, we can represent text as numeric data. For instance, if we assign h = 1, e = 2, l = 3, o = 4, then:

  • hello → 1 2 3 3 4
  • heel → 1 2 2 3

You can already see the similarity between hello and heel. This kind of tokenization will be useful later when we perform more advanced NLP tasks.

For a more intuitive explanation, check out this video.

Next, let’s tokenize the Reddit texts and plot the top 20 words to see which words appear most frequently. You will notice an issue: the plot includes many common words like the, to, and, a, for, etc. While it makes sense that these words appear often, they are not useful for our analysis.

# Word tokenization
words <- threads_2 %>% 
  unnest_tokens(output = word, input = text, token = "words")####### # run `?tidytext::unnest_tokens` on the console

words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")
## Selecting by n

The raw token frequency plot shows the most common words appearing in r/transit post text. Many of the top words are function words such as “the,” “to,” and “and,” which are linguistically necessary but not analytically informative. This confirms the need to remove stop words before drawing substantive conclusions. Nonetheless, even at this stage we likely see frequent content words like “train,” “metro,” or “bus,” reflecting the core focus of the subreddit.

2-2. Stop words

Those common words we just saw are known as stop words – words that are typically filtered out during NLP processing because they contribute little meaningful information to text analysis. These are usually very common words such as articles, pronouns, conjunctions, and prepositions.

We can remove stop words using a built-in dataset from the tidytext package.

# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 100)])
##   [1] "by"           "you'd"        "might"        "although"     "she"         
##   [6] "general"      "comes"        "both"         "shall"        "during"      
##  [11] "what's"       "okay"         "has"          "lest"         "everything"  
##  [16] "fact"         "she'll"       "non"          "things"       "throughout"  
##  [21] "nevertheless" "my"           "still"        "place"        "whatever"    
##  [26] "when's"       "him"          "already"      "need"         "said"        
##  [31] "she's"        "neither"      "n"            "there"        "went"        
##  [36] "it"           "though"       "specify"      "want"         "over"        
##  [41] "why"          "came"         "was"          "about"        "probably"    
##  [46] "she"          "could"        "tends"        "both"         "by"          
##  [51] "clearly"      "to"           "respectively" "try"          "six"         
##  [56] "up"           "himself"      "hadn't"       "uses"         "own"         
##  [61] "behind"       "whence"       "also"         "opens"        "couldn't"    
##  [66] "let"          "now"          "from"         "doesn't"      "less"        
##  [71] "there's"      "nobody"       "has"          "using"        "you're"      
##  [76] "reasonably"   "have"         "whose"        "within"       "cannot"      
##  [81] "the"          "no"           "i"            "otherwise"    "x"           
##  [86] "those"        "hasn't"       "parting"      "members"      "thank"       
##  [91] "like"         "going"        "get"          "however"      "somebody"    
##  [96] "may"          "they"         "below"        "we're"        "g"

After removing stop words and non-alphabetic strings, the top 20 word plot shifts toward more meaningful vocabulary. High-frequency terms now highlight key themes in transit discussions, such as specific transport modes (e.g., “metro,” “bus”), urban contexts (“city,” “line”), and user-facing issues (“service,” “station”). This cleaner distribution provides a clearer picture of what people actually talk about when they discuss public transportation on Reddit.

We will use anti_join() function to remove stop words from the text and leave behind a cleaned set of words. Like other join functions, the order of the data frames matters. Here’s the logic:

  • anti_join(A, B) returns everything in A except the elements that also appear in B.
  • Conversely, anti_join(B, A) returns everything in B except what’s in A.
# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_2 %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% ########
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 5935, After: 2465

Once you have removed the stop words, let’s create a plot using the cleaned text to see which meaningful words are used most frequently.

words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

3. Word cloud

Let’s compare the frequency of words before and after removing stop words using a word cloud.

words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()
words_clean %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

The word clouds generated above look nice, but their color schemes can be a bit overwhelming.

The following block of code creates a custom color palette designed to highlight a selected number of words while graying out the rest. You can easily generate a collection of random colors using the HSV (Hue, Saturation, Value) color model.

n <- 20 # number of words with color
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
words_clean %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

The word cloud based on raw tokens emphasizes both common function words and frequently used content words, making it visually rich but somewhat noisy. In contrast, the word cloud using the cleaned words (with stop words removed and custom coloring) highlights substantive transit-related terms. This visualization makes it easy to spot recurring concepts—such as transit modes, city names, and quality descriptors—giving an intuitive overview of dominant topics in the dataset.

4. N-grams

An n-gram is a sequence of n words that appear together. For example:

  • “basketball coach” and “dinner time” are bigrams (two words),
  • “the three musketeers” is a trigram (three words), and
  • “she was very hungry” is a fourgram (four words).

In advanced text analysis and machine learning, specific tokens and n-grams are often used as features for modeling and classification tasks.

N-grams are particularly useful for analyzing words in context. Consider these two sentences:

  • “We need to check the details.”
  • “Can we pay it with a check?”

The word “check” functions as a verb in the first sentence and a noun in the second. We can understand its meaning based on the surrounding words, especially those immediately before or after it. For example, when “check” follows “to”, it’s likely being used as a verb.

As an example of bigram, the sentence “The result of separating bigrams is helpful for exploratory analyses of the text.” becomes a list of paired words:
1 the result
2 result of
3 of separating
4 separating bigrams
5 bigrams is
6 is helpful
7 helpful for
8 for exploratory
9 exploratory analyses
10 analyses of
11 of the
12 the text

# Get ngrams. You may try playing around with the value of n, n=3, n=4
words_ngram <- threads_2 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

Extracting trigrams reveals short phrases that capture more context than single words, such as recurring expressions about transit systems or city comparisons. Many of these trigrams still contain stop words, but they begin to hint at typical sentence structures and repeated phrases in transit discussions.

# Show ngrams with sorted values
words_ngram %>%
  count(paired_words, sort = TRUE) %>% 
  head(20) %>% 
  knitr::kable()
paired_words n
NA 163
in the world 5
the bay area 5
to get to 4
all of the 3
as well as 3
cobb or gwinnett 3
i m sick 3
in the city 3
it s a 3
m sick of 3
one of the 3
sick of traffic 3
the city and 3
00 and arrived 2
a big sign 2
a braves game 2
about public transit 2
amtrak ridership in 2
an electric car 2

After splitting the n-grams and filtering out stop words and non-alphabetic terms, the resulting word pairs (bigrams) show more meaningful associations between content words. High-frequency pairs highlight how users link concepts—for example, connecting transit modes with performance descriptors or linking specific cities with their transportation systems. These bigrams provide insight into how people conceptually group ideas when discussing public transportation.

Here, we can see that the n-grams still contain stop words such as a, to, and so on. Next, we’ll extract n-grams without stop words. We can use the separate function from the tidyr package to split each bigram into two columns: word 1 and word 2. Then, we filter out any rows where either column contains a stop word using the filter function.

#separate the paired words into two columns
words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2"), sep = " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 5651 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))

# Sort the new bi-gram (n=2) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 n
public transportation 9
public transit 6
light rail 5
folding doors 4
lines picture 3
mexico city 3
amtrak ridership 2
braves game 2
bus route 2
car centric 2
centennial park 2
central station 2
city centre 2
dc metro 2
deep canvas 2
duffy investigates 2
electric car 2
elon musk 2
expand marta 2
expanding marta 2

Finally, by using the igraph and ggraph packages, we can visualize words occurring in pairs, which allows us to see common relationships between words in the text.

# plot word network
words_counts %>%
  filter(n >= 3) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

The word network plot visualizes frequent word pairs as a graph, where nodes are words and edges represent co-occurrence in bigrams. Clusters in the network suggest thematic groupings—for instance, one cluster may center on “metro,” “station,” and “line,” while another may relate to “bus,” “service,” and “delay.” This network view reveals how different aspects of public transportation are interconnected in user discussions and helps identify core conceptual structures within the text.

Combining the title and body text into a single field (text_all) and applying sentimentr::sentiment_by() produces a numeric sentiment score for each thread, along with the text’s word count. The summary of threads_2$sentiment shows that the scores span both positive and negative values, but tend to cluster closer to zero. This indicates that Reddit posts about public transportation cover a range of emotional tones but, on average, lean toward moderately positive or neutral sentiment rather than extreme negativity.

# Package names
packages <- c(
  "RedditExtractoR", "anytime", "magrittr", "httr",
  "tidytext", "tidyverse", "igraph", "ggraph",
  "wordcloud2", "textdata", "here", "sentimentr", "glue"
)

installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

invisible(lapply(packages, library, character.only = TRUE))
# Package names
packages <- c(
  "RedditExtractoR", "anytime", "magrittr", "httr",
  "tidytext", "tidyverse", "igraph", "ggraph",
  "wordcloud2", "textdata", "here", "sentimentr", "glue"
)

installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

invisible(lapply(packages, library, character.only = TRUE))
# Combine title and text for sentiment analysis
threads_2 <- threads_2 %>%
  mutate(
    title  = replace_na(title, ""),
    text   = replace_na(text, ""),
    text_all = stringr::str_c(title, text, sep = ". ")
  )

# Run sentiment analysis (dictionary + negation-aware)
sent_res <- sentiment_by(threads_2$text_all)

# Attach back to main data frame
threads_2$sentiment  <- sent_res$ave_sentiment
threads_2$word_count <- sent_res$word_count

summary(threads_2$sentiment)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00943 -0.00859  0.01695  0.03830  0.18872  0.67082
threads_2 %>%
  filter(!is.na(sentiment), !is.na(date)) %>%
  mutate(month = lubridate::floor_date(date, "month")) %>%
  group_by(month) %>%
  summarize(
    avg_sentiment = mean(sentiment, na.rm = TRUE),
    n_threads     = n()
  ) %>%
  ggplot(aes(x = month, y = avg_sentiment)) +
  geom_line(color = "darkgreen") +
  geom_point() +
  labs(
    title = "Average Sentiment About Public Transportation Over Time",
    x = "Month",
    y = "Average sentiment"
  )

The monthly average sentiment plot shows how attitudes toward public transportation in r/transit evolve over time. While there is some fluctuation from month to month, the overall pattern suggests that sentiment is generally stable and slightly positive, with occasional peaks or dips that may correspond to specific news events, infrastructure openings, or service crises. This temporal view highlights that public perceptions of transit are dynamic but do not swing wildly, reflecting a mix of everyday experiences and episodic events.

library(stringr)
library(knitr)

# Create a combined text column
threads_2 <- threads_2 %>%
  mutate(
    title  = tidyr::replace_na(title, ""),
    text   = tidyr::replace_na(text, ""),
    text_all = stringr::str_c(title, text, sep = ". ")
  )

# Sample 10 texts with sentiment scores
set.seed(123)  # for reproducibility
sample_sentiment <- threads_2 %>%
  filter(!is.na(sentiment)) %>%
  select(text_all, sentiment, word_count, comments) %>% # comments from threads_2 metadata
  mutate(
    text_short = str_trunc(text_all, width = 140)
  ) %>%
  select(text_short, sentiment, word_count, comments) %>%
  slice_sample(n = 10)

knitr::kable(
  sample_sentiment,
  col.names = c(
    "Text (truncated)",
    "Sentiment score",
    "Word count",
    "Number of comments"
  ),
  caption = "Sample of 10 Reddit threads on public transportation with sentiment scores"
)
Sample of 10 Reddit threads on public transportation with sentiment scores
Text (truncated) Sentiment score Word count Number of comments
American cities be like. 0.2500000 4 185
Why does Cairo, a city of over 22 million people, have only 3 metro lines?. Id expect more lines sooner because its one of the biggest … 0.1448446 76 243
I owe you an apology, Miami. I didnt know your game. Despite going to Miami many times on the way to visit family in the Keys, I never s… 0.1725367 227 64
We all like trains d [OC]. Opening of new stations along the 2 Line / Redmond, WA 0.2353954 15 41
Harness the power of patriotism to build more automated light metros. 0.1296499 11 60
Cities comparison. In case you couldn’t see this is obviously satire, so stop typing your angry comment. I decided to put a twist on the … -0.1686942 28 177
The world’s oldest underground station, Baker Street, 157 years apart.. 0.0000000 9 21
My first time at the new LAX Metro station. Very nice, very shiny, lots of employees around even at 10pm answering questions and giving d… 0.3910913 54 48
Copenhagen metro is so slick and reliable. Took the metro around a lot in Copenhagen. It was one of the best transit experiences, it was … 0.2755092 65 49
Every type of public transport in Moscow. 0.0000000 7 188

The sample of 10 Reddit threads suggests that the sentimentr scores generally match the intuitive tone of the text. Posts that clearly praise public transportation—for example, appreciating frequent service, clean vehicles, or successful trips—tend to receive positive sentiment scores. More critical posts that mention delays, crowding, or safety concerns are scored as negative.

There are some borderline or mixed cases where a post contains both praise and criticism, and the resulting sentiment is close to zero. This reflects a limitation of dictionary-based sentiment methods, which can struggle with sarcasm, context, or nuanced opinions. Still, the sample indicates that the scores are credible enough for aggregate analysis, they capture whether a post is broadly positive, negative, or neutral toward public transportation.

Plot 1 — Sentiment distribution

The overall distribution:

threads_2 %>%
  filter(!is.na(sentiment)) %>%
  ggplot(aes(x = sentiment)) +
  geom_histogram(bins = 30, color = "black", fill = "lightblue") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Distribution of sentiment about public transportation",
    x = "Sentiment score (sentimentr)",
    y = "Number of threads"
  ) +
  theme_minimal()

The histogram of sentiment scores shows that most posts fall on the positive side of the scale, with fewer strongly negative values. This suggests that, overall, Reddit discussions about public transportation in this dataset are more favorable than purely complaint-driven. Users frequently share helpful information, neutral updates, or even positive experiences—such as smooth commutes, good connectivity, or system improvements—rather than focusing exclusively on problems.

The relatively small number of strongly negative posts indicates that while frustrations do exist, they do not dominate the conversation. In other words, the general tone is cautiously positive or mildly supportive of public transportation, rather than overwhelmingly negative.

Plot 2 — Sentiment vs word count

Is the distribution skewed negative (lots of complaints) or centered near zero?

threads_2 %>%
  filter(!is.na(sentiment), !is.na(word_count)) %>%
  ggplot(aes(x = sentiment, y = word_count)) +
  geom_jitter(width = 0.02, height = 0, alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE, color = "darkgreen") +
  labs(
    title = "Relationship between sentiment and text length",
    x = "Sentiment score",
    y = "Word count"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot of sentiment versus word count shows that posts of all lengths can be positive, neutral, or negative, but there are some patterns:

Many short posts (few words) have sentiment scores near zero, which often corresponds to simple observations, titles, or quick comments without strong emotional language.

Longer posts tend to show more extreme sentiment values, both positive and negative. These are likely detailed stories, rants, or enthusiastic reviews where users explain their experiences at length.

The smooth trend line suggests that there is only a weak overall relationship between text length and sentiment, but it does hint that when users feel strongly (either positively about a good transit system or negatively about persistent problems), they are more likely to write longer, more elaborated posts. This is consistent with the idea that stronger emotions encourage more expressive writing.

Plot 3 — Sentiment over time

Do very negative or very positive posts tend to be longer (rants, detailed stories), while neutral posts are shorter?

library(lubridate)

# Make sure sentiment and date exist
threads_2 <- threads_2 %>%
  filter(!is.na(sentiment), !is.na(date))

# Recreate day_of_week if needed
threads_2 <- threads_2 %>%
  mutate(day_of_week = wday(date, label = TRUE))

# Average sentiment by day of week
threads_2 %>%
  group_by(day_of_week) %>%
  summarize(
    avg_sentiment = mean(sentiment, na.rm = TRUE),
    n_threads     = n()
  ) %>%
  ggplot(aes(x = day_of_week, y = avg_sentiment)) +
  geom_col(fill = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Average sentiment about public transportation by day of week",
    x = "Day of week",
    y = "Average sentiment score"
  ) +
  theme_minimal()

Interpretation: Why is Wednesday the Most Negative?

From the “Average Sentiment by Day of Week” plot, Wednesday shows the lowest average sentiment score, making it the most negative day for public transportation discussion on Reddit.

There are several possible explanations for this pattern:

  1. Mid-week commuter fatigue

By Wednesday, people have already been commuting for several days, and frustrations with delays, service reliability, or overcrowding accumulate.

Riders may be more likely to vent mid-week compared to the beginning or end of the week.

  1. Fewer positive “weekend-style” posts

Weekend travel tends to be more flexible and less stressful.

In contrast, weekday travel — especially Wednesday — is dominated by work-related commutes, which are more prone to negative experiences.

  1. Service disruptions often cluster mid-week

Transit agencies frequently perform maintenance early in the week (Monday–Wednesday), which can cause delays and dissatisfaction.

  1. Post volume + tone effect

If Wednesday has more posts overall, even a slightly negative tone can pull down the average sentiment.

Negative posts also tend to be longer and more expressive, which sentiment models pick up strongly.

Summary:

Wednesday stands out as the most negative day, likely reflecting peak commuter frustration and mid-week transit stress, compared with more neutral or even slightly positive weekend discussions.

Plot 4 — Sentiment vs comments (engagement)

threads_2 %>%
  filter(!is.na(sentiment), !is.na(comments)) %>%
  ggplot(aes(x = sentiment, y = comments)) +
  geom_jitter(width = 0.02, height = 0, alpha = 0.4) +
  geom_smooth(method = "loess", se = FALSE, color = "purple") +
  labs(
    title = "Sentiment vs. engagement (number of comments)",
    x = "Sentiment score",
    y = "Number of comments"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatterplot of sentiment versus the number of comments shows that both positive and negative posts can attract engagement, but there is no strong linear relationship between sentiment score and comment count. Many posts with moderate or slightly positive sentiment receive a handful of comments, while a smaller number of highly commented posts appear across a range of sentiment values.

A few posts with very high comment counts tend to cluster at more emotionally charged scores – either clearly positive or somewhat negative – suggesting that strongly opinionated or provocative content can trigger more discussion. However, the overall pattern remains very noisy, indicating that engagement on Reddit is driven by multiple factors, such as topic novelty, visual content, or controversy, not just the emotional tone captured by the sentiment score.

Insights from the Sentiment Analysis

Overall, the sentiment analysis reveals a nuanced picture of how people talk about public transportation on Reddit.

From the sentiment distribution plot (Plot 1), most posts fall on the positive or mildly positive side, with fewer strongly negative scores. This suggests that r/transit is not just a space for complaints; users also share positive experiences, appreciation for well-functioning systems, and neutral informational content. The sample of 10 posts with sentiment scores supports this pattern: descriptive posts score near zero, highly positive experiences (e.g., Copenhagen metro, new stations, LAX Metro) receive clearly positive sentiment, and only a few posts show clearly negative or sarcastic tones.

The relationship between sentiment and word count (Plot 2) indicates that strongly emotional posts—both positive and negative—are often longer, while short posts tend to be closer to neutral. This suggests that when users feel strongly about transit (either praising or criticizing it), they are more likely to write detailed narratives or rants, whereas brief comments are typically less emotionally loaded.

The average sentiment by day of week (Plot 3) adds a temporal dimension: sentiment is most negative on Wednesday, hinting at mid-week commuter fatigue and stress, while weekends tend to be more neutral or slightly positive. This aligns with the idea that daily commuting conditions shape how people feel and write about transit.

Finally, sentiment vs. number of comments (Plot 4) shows that emotional tone alone does not fully explain engagement. Although some highly emotional posts attract many comments, overall the relationship between sentiment and comment count is weak, suggesting that other factors—such as the topic’s relevance, novelty, or controversy—also play important roles in driving discussion.

Taken together, these plots show that online discourse about public transportation is generally more positive than purely complaint-driven, becomes more expressive when emotions are strong, varies over the week with commuting patterns, and only partly explains how much attention a post receives.