Introduction

My goal is to examine people’s recent experiences traveling to London, specifically during the holiday season. This will be done by scraping data from Reddit threads (filtering out the posts and using threads only from the last month) and conducting a sentiment analysis on the collected Reddit data.

Step 1: Import Packages

Step 2: Search Subreddits for Relevant Data

# Searching threads using keyword
# I attempted other keywords such as "London travel holiday season" (provided almost no data) and just "London travel" (provide too broad of data)
threads <- find_thread_urls(keywords = "London travel Christmas", 
                              sort_by = "relevance", 
                              period = "month") %>% 
  drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads) <- NULL

colnames(threads)
## [1] "date_utc"  "timestamp" "title"     "text"      "subreddit" "comments" 
## [7] "url"
head(threads, 3) %>% knitr::kable()
date_utc timestamp title text subreddit comments url
2024-11-15 1731691704 Christmas in London ! I am in London from 22 Dec to 29th Dec and then for 1 night from 1-2nd Jan.

I am confused as to what is the best option and best things to do during this period , as per my understanding 25th will be completely closed. Any ideas or suggestions ? TIA |uktravel | 17|https://www.reddit.com/r/uktravel/comments/1gs1o3m/christmas_in_london/ | |2024-11-02 | 1730528782|Is transit visa needed to travel via London Heathrow? I have expired US visa but I have a valid Canada tourist visa. I am traveling from the US to India. | |h1b | 19|https://www.reddit.com/r/h1b/comments/1ghqem5/is_transit_visa_needed_to_travel_via_london/ | |2024-11-13 | 1731490134|Is Vietnam a good place to travel to during Christmas? |Weather-wise, tourists surge and getting around&. Which area would be great to visit during this time? |VietNam | 17|https://www.reddit.com/r/VietNam/comments/1gq945s/is_vietnam_a_good_place_to_travel_to_during/ |

# Number of threads found by the keyword per subreddit (top 20)
threads$subreddit %>% table() %>% sort(decreasing = T) %>% head(20)
## .
##              uktravel             DigitalPR                london 
##                    23                    17                     8 
##          Europetravel                travel                 AskUK 
##                     7                     7                     5 
## BestofRedditorUpdates            solotravel                 AITAH 
##                     5                     5                     3 
##         AmItheAsshole         femaletravels          LondonTravel 
##                     3                     3                     3 
##   relationship_advice      ChristmasCoupons                f1visa 
##                     3                     2                     2 
##                 Gifts                   h1b          irishtourism 
##                     2                     2                     2 
##      ParisTravelGuide     SouthwestAirlines 
##                     2                     2
# Second method for searching relevant subreddits
subreddit_list <- RedditExtractoR::find_subreddits("London travel Christmas")
## parsing URLs on page 1...
subreddit_list[1:25,c('subreddit','title','subscribers')] %>% knitr::kable()
subreddit title subscribers
318mc LondonTravel London Travel 3151
2qmvs Christmasparty Christmas Party London 52
2qhqb unitedkingdom United Kingdom 4340336
35sd2 londontravelguide London Travel Guide: Attractions, Itineraries, Things To Do In London 1694
3fcf7 pokemongoLondon Pokemon GO London 7668
378fz hyperjapan HYPER JAPAN! 189
2slwv ukbike UK Bike - Cycling in the UK 35245
2qney southafrica South Africa 250672
3z1iw2 LondonCultureClub LondonCultureClub 112
2rdzk silavox The fracon cake handoff. 13
317ag Notebook_share a place to share notebooks 1321
8ar7k1 velvetrose velvetrose 57
3fgb7 PokemonGoMidlands PokemonGo Midlands: Gotta Catch ’em All 220
34d4u smuttravel Smut Travel Caravan! 0
2zclk FunForLouis FunForLouis - peace out, enjoy life, and live the adventure! 113
2sygb timburton Tim Burton 18608
2tty5 publictransit Public Transit: Go like a local 0
3fd3p pokemongoEU Pokemon GO Europe 90
2u441 britpics British Pics! 128507
3ayak kristinChenoweth Kristin Chenoweth 686
2rnoj ambientmusic Ambient music 86621
2y3cp NuWhoDailyMarathon50 NuWho Daily Marathon Leading Up to the 50th Anniversary Special 63
3g9oz sushideception sushideception writes 0
11sxha boramiyu Boramiyu (ô|ø ) 627
o50uo pigeonforgetn Pigeon Forge, Tennessee 553

Step 3: Fetch Data From Decided Subreddits

# Downloading data from my 4 chosen subreddits and keywords
threads_1 <- find_thread_urls(keywords = "London travel Christmas",  
                                subreddit = "LondonTravel", 
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_1) <- NULL

threads_2 <- find_thread_urls(keywords = "London travel Christmas", 
                                subreddit = "london",
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_2) <- NULL

threads_3 <- find_thread_urls(keywords = "London travel Christmas", 
                                subreddit = "uktravel", 
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_3) <- NULL

threads_4 <- find_thread_urls(keywords = "London travel Christmas", 
                                subreddit = "travel", 
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_4) <- NULL

threads_all <- rbind(threads_1, threads_2, threads_3, threads_4)
save(threads_all, file = "London_threads.RData")

Step 4: Clean text data and tokenize it

load("London_threads.Rdata")

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_all %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)


words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

Step 5: Generate Word Cloud

n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

no_keywords <- words_clean %>% filter(!word %in% c('london','travel','christmas'))

no_keywords %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

Step 6: Conduct a tri-gram analysis

# Get tri-grams
words_ngram <- threads_all %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

#separate the paired words into three columns
words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new bi-gram (n=3) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 word3 n
utm_source flyerutm_medium flyerutm_campaign 88
date location nov 35
flyerutm_campaign snow forecasts 29
flyerutm_medium flyerutm_campaign snow 29
forecasts santa sightings 29
snow forecasts santa 29
date location oct 28
flyerutm_campaign explore london 27
flyerutm_medium flyerutm_campaign explore 27
age family price 26
date loc nov 25
family price free 16
price free info 15
location oct 31st 14
348format pjpgauto webps 12
width 348format pjpgauto 12
utm_source thelondondoozy.beehiiv.comutm_medium newsletterutm_campaign 9
book utm_source flyerutm_medium 8
loc nov 16th 7
location nov 1st 7
words_counts %>%
  filter(n == 3) %>%
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Step 7: Perform Sentiment Analysis

# NRC lexicon method on the cleaned data
sentiments <- words_clean %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  count(sentiment, sort = TRUE)
## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 10 of `x` matches multiple rows in `y`.
## ℹ Row 9378 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
sentiments %>%
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  ggdark::dark_theme_minimal() +
  labs(title = "Sentiment Analysis of Reddit Comments", x = "Sentiment", y = "Frequency")
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

# Sentence-level sentiment analysis using syuzhet ()
sentiments <- threads_all$text %>%
  get_sentiment(method = "nrc")

sentiments_df <- data.frame(sentiment = sentiments)
sentiments_df %>%
  ggplot(aes(x = sentiment, fill = sentiment)) +
  geom_bar(show.legend = FALSE) +
  ggdark::dark_theme_minimal() +
  labs(title = "Sentence-Level Sentiment Analysis", x = "Sentiment", y = "Frequency")
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Sentiment analysis using sentimentr (taking negations into account)
all_sentence_score <- list()

for (i in 1:length(threads_all$text)){
  sentiment <- sentiment(threads_all$text[i])
  sentiment$sentence <- get_sentences(threads_all$text[i])
  all_sentence_score[[i]] <- sentiment
}

final_sentence_score <- bind_rows(all_sentence_score)
final_sentence_score
##       element_id sentence_id word_count  sentiment
##    1:          1           1         28 0.04252100
##    2:          1           2          6 0.00000000
##    3:          1           3         19 0.00000000
##    4:          1           4         23 0.00000000
##    5:          1           5         10 0.15811388
##   ---                                             
## 3782:          1           3         41 0.00000000
## 3783:          1           4         37 0.39044759
## 3784:          1           5         33 0.22630095
## 3785:          1           6         17 0.02425356
## 3786:          1           7          2 0.35355339
##                                                                                                                                                                                                                          sentence
##    1:                                                                     Hello,\n\nI'm visiting family near Cambridge up to Christmas, but they are heading to see my BIL's family on boxing day and I'm flying back to the USA.
##    2:                                                                                                                                                                                                   My flight is on Dec 27th.
##    3:                                                                                                                                          They can get me to a station/bus route but can't get me all the way into the city.
##    4:                                                                                            I'm less stressed about navigating busses, tubes, cabbies, whatever once I'm in the city but I just gotta make sure I get there.
##    5:                                                                                                                                                       * It looks like the options are NationalExpress, Flixbus, or MegaBus.
##   ---                                                                                                                                                                                                                            
## 3782: We are wanting to get to the mountainous areas, specifically the Berner Oberland and see more of the towns and not so much the cities as a lot of our other travel involves the bigger cities in France, Italy and Germany.
## 3783: Our idea was to get into Switzerland from Venice, but reading online reports from multiple websites including RickSteeves, there seems to be mixed reviews as whether it\031s better to take the trains or fly into Zurich.
## 3784:                                             Just wondering if anyone here has taken a similar travel path and wondering about the best ways to get into that area, or if there\031s something else that\031s better to see.
## 3785:                                                                                                                            We plan on staying in Switzerland until after Christmas and then flying out of Zurich to Berlin.
## 3786:                                                                                                                                                                                                                   Thank you
mean(final_sentence_score$sentiment)
## [1] 0.1298845
max(final_sentence_score$sentiment)
## [1] 1.432103
min(final_sentence_score$sentiment)
## [1] -1.067111

Step 8: Display 10 sample texts

set.seed(3)  
final_sentence_score[sample(1:nrow(final_sentence_score), 10), c('sentiment', 'sentence')]
##       sentiment
##  1:  0.17677670
##  2:  0.03535534
##  3:  0.25000000
##  4:  0.26943013
##  5:  0.38000000
##  6:  0.49135381
##  7: -0.40824829
##  8:  0.00000000
##  9: -0.20846377
## 10:  0.13093073
##                                                                                                                                                                     sentence
##  1:                                                                                                                       However, I've heard great things about going here.
##  2:                                                                                                                                   I NEED to see the flying Santa sleigh!
##  3:                                                                                         Or if anyone knows of any nice roasts where they do have nice veggie things too.
##  4: An educator will assist you in drawing different sounds through mark-making and help you create your own listening chart to interpret other paintings in the collection.
##  5:                                       Also would love to visit somewhere with much to do beyond drinking or clubbing and where I can seek out a lot of fun experiences .
##  6:                                                                                                                                        Hi guys I hope you're doing well.
##  7:                                                                                                                           * Anything else you recommend/don't recommend?
##  8:                                                                                                                Tickets will be transferred to first person who responds.
##  9:                                                                                             Any tips or advice to make the most of this would be immensely appreciated!!
## 10:                                        Covent Garden and Bloomsbury will follow on Nov 12th, with Marylebone Village and Old Spitalfields Market joining in on Nov 13th.

Credibility analysis: The sentiment analysis outcomes show mixed accuracy, with some results aligning well with the tone of the sentences while others miss the mark. For example, row 5 has a negative sentiment score while I see the sentence to be pretty positive. The overall performance of the sentiment analysis reveals limitations. Some words appear to have a disproportionately strong influence on the score, while the broader context or tone of the sentence is not always effectively analyzed. This issue is particularly evident in questions, which often lack clear emotional content, making it challenging for the model to assign an accurate sentiment score. Additionally, many of the sentiment scores are low and skew towards neutrality, indicating that the analysis may be underestimating the emotional tone of certain sentences.

Update: I believe the seed wasn’t set correctly, so the row 5 I am referencing is different.

Step 9: Analysis of Data

# create new column: date
threads_all %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by day
threads_all %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 86400) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_all$date, na.rm = T), 
                                max(threads_all$date, na.rm = T), 
                                by = "1 week")) +
  theme_minimal()

# create new columns: day_of_week, is_weekend
threads_all %<>%  
  mutate(day_of_week = wday(date, label = TRUE)) %>% 
  mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))

# number of threads by time of day
threads_all %>% 
  ggplot(aes(x = day_of_week, fill = is_weekend)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "indianred")) + 
  theme_minimal()

print(threads_all$timestamp[1])
## [1] 1731993068
print(threads_all$timestamp[1] %>% anytime(tz = anytime:::getTZ()))
## [1] "2024-11-19 00:11:08 EST"
threads_all %<>%  
  mutate(time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

# number of threads by time of day
threads_all %>% 
  ggplot(aes(x = time)) +
  geom_histogram(bins = 24, color = 'black') +
  scale_x_continuous(breaks = seq(0, 24, by=2)) + 
  theme_minimal()

Merging dataframes:

#Add a 'sentence_id' column to both data frames
words_clean$sentence_id <- 1:nrow(words_clean)
final_sentence_score$sentence_id <- 1:nrow(final_sentence_score)

# Merge the two data frames on 'sentence_id'
merged_data <- merge(words_clean, final_sentence_score, by = "sentence_id")
merged_data

Top 10 Keywords Associated with Sentiments:

# Calculate average sentiment and frequency for each word
words_summary <- merged_data %>%
  group_by(word) %>%
  summarise(avg_sentiment = mean(sentiment), n = n()) %>%
  arrange(desc(n)) %>%
  filter(n > 5) # Filter words appearing more than 5 times for relevance

# Select the top 10 words by frequency
top_words_clean <- words_summary %>%
  slice_max(order_by = n, n = 10)

library(ggplot2)

ggplot(top_words_clean, aes(x = reorder(word, avg_sentiment), y = avg_sentiment, fill = avg_sentiment)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ggdark::dark_theme_minimal() +
  labs(title = "Top 10 Words by Sentiment (Filtered)",
       x = "Word",
       y = "Average Sentiment")

Comparison between subreddits:

# Perform sentiment analysis on each text
sentiment_plots <- threads_all %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text) %>%
  drop_na()

# Get sentiment scores for each text
sentiment_plots$sentiment_score <- sapply(sentiment_plots$text, function(text) {
  sentiment(text)$sentiment[1]
})

# Selecting relevant columns for sentiment plotting
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
sentiment_plots <- sentiment_plots[, columns] %>%
  arrange(desc(sentiment_score))

# Cleaning up titles and texts
sentiment_plots$title <- strtrim(sentiment_plots$title, 50)
sentiment_plots$text <- strtrim(sentiment_plots$text, 50)

# Adding Day of the Week (DoW) for analysis
sentiment_plots$DoW <- wday(sentiment_plots$date_utc, label = TRUE, abbr = FALSE)

# Rearranging columns
sentiment_plots <- sentiment_plots %>% select(date_utc, DoW, title, text, subreddit, sentiment_score)

ggplot(sentiment_plots, aes(x = subreddit, y = sentiment_score)) +
  geom_boxplot(fill = "indianred", alpha = 0.7) +
  labs(title = "Sentiment Scores by Subreddit", x = "Subreddit", y = "Sentiment Score") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(sentiment_plots, aes(x = sentiment_score)) +
  geom_density(fill = "indianred", alpha = 0.9) +
  facet_wrap(~ subreddit) +
  labs(title = "Sentiment Distribution by Subreddit", x = "Sentiment Score", y = "Density") +
  theme_light()

Conculsion

The sentiment analysis in Reddit threads reveals a few key trends. Firstly, the frequency of threads varies by day and time, with peaks observed on certain days of the week and specific hours of the day, indicating patterns in user engagement. The top 10 words associated with positive and negative sentiments is interesting as they indicate a positive tone is found more common that negative words. Further comparison between subreddits reveals significant differences in sentiment scores, with certain subreddits displaying more polarized emotions. These results are visualized through boxplots and density plots, showing the distribution and variation of sentiment across different subreddits. This sentiment analysis can provide insights into the emotional tone of Reddit discussions and identify key themes contributing to user sentiment.