Introduction

My goal is to examine people’s recent experiences traveling to London, specifically during the holiday season. This will be done by scraping data from Reddit threads (filtering out the posts and using threads only from the last month) and conducting a sentiment analysis on the collected Reddit data.

Step 1: Import Packages

Step 2: Search Subreddits for Relevant Data

# Searching threads using keyword
# I attempted other keywords such as "London travel holiday season" (provided almost no data) and just "London travel" (provide too broad of data)
threads <- find_thread_urls(keywords = "London travel Christmas", 
                              sort_by = "relevance", 
                              period = "month") %>% 
  drop_na()

## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...

rownames(threads) <- NULL

colnames(threads)

## [1] "date_utc"  "timestamp" "title"     "text"      "subreddit" "comments" 
## [7] "url"

head(threads, 3) %>% knitr::kable()

date_utc	timestamp	title	text	subreddit	comments	url
2024-11-15	1731691704	Christmas in London !	I am in London from 22 Dec to 29th Dec and then for 1 night from 1-2nd Jan.

I am confused as to what is the best option and best things to do during this period , as per my understanding 25th will be completely closed. Any ideas or suggestions ? TIA |uktravel | 17|https://www.reddit.com/r/uktravel/comments/1gs1o3m/christmas_in_london/ | |2024-11-02 | 1730528782|Is transit visa needed to travel via London Heathrow? I have expired US visa but I have a valid Canada tourist visa. I am traveling from the US to India. | |h1b | 19|https://www.reddit.com/r/h1b/comments/1ghqem5/is_transit_visa_needed_to_travel_via_london/ | |2024-11-13 | 1731490134|Is Vietnam a good place to travel to during Christmas? |Weather-wise, tourists surge and getting around&. Which area would be great to visit during this time? |VietNam | 17|https://www.reddit.com/r/VietNam/comments/1gq945s/is_vietnam_a_good_place_to_travel_to_during/ |

# Number of threads found by the keyword per subreddit (top 20)
threads$subreddit %>% table() %>% sort(decreasing = T) %>% head(20)

## .
##              uktravel             DigitalPR                london 
##                    23                    17                     8 
##          Europetravel                travel                 AskUK 
##                     7                     7                     5 
## BestofRedditorUpdates            solotravel                 AITAH 
##                     5                     5                     3 
##         AmItheAsshole         femaletravels          LondonTravel 
##                     3                     3                     3 
##   relationship_advice      ChristmasCoupons                f1visa 
##                     3                     2                     2 
##                 Gifts                   h1b          irishtourism 
##                     2                     2                     2 
##      ParisTravelGuide     SouthwestAirlines 
##                     2                     2

# Second method for searching relevant subreddits
subreddit_list <- RedditExtractoR::find_subreddits("London travel Christmas")

## parsing URLs on page 1...

subreddit_list[1:25,c('subreddit','title','subscribers')] %>% knitr::kable()

	subreddit	title	subscribers
318mc	LondonTravel	London Travel	3151
2qmvs	Christmasparty	Christmas Party London	52
2qhqb	unitedkingdom	United Kingdom	4340336
35sd2	londontravelguide	London Travel Guide: Attractions, Itineraries, Things To Do In London	1694
3fcf7	pokemongoLondon	Pokemon GO London	7668
378fz	hyperjapan	HYPER JAPAN!	189
2slwv	ukbike	UK Bike - Cycling in the UK	35245
2qney	southafrica	South Africa	250672
3z1iw2	LondonCultureClub	LondonCultureClub	112
2rdzk	silavox	The fracon cake handoff.	13
317ag	Notebook_share	a place to share notebooks	1321
8ar7k1	velvetrose	velvetrose	57
3fgb7	PokemonGoMidlands	PokemonGo Midlands: Gotta Catch ’em All	220
34d4u	smuttravel	Smut Travel Caravan!	0
2zclk	FunForLouis	FunForLouis - peace out, enjoy life, and live the adventure!	113
2sygb	timburton	Tim Burton	18608
2tty5	publictransit	Public Transit: Go like a local	0
3fd3p	pokemongoEU	Pokemon GO Europe	90
2u441	britpics	British Pics!	128507
3ayak	kristinChenoweth	Kristin Chenoweth	686
2rnoj	ambientmusic	Ambient music	86621
2y3cp	NuWhoDailyMarathon50	NuWho Daily Marathon Leading Up to the 50th Anniversary Special	63
3g9oz	sushideception	sushideception writes	0
11sxha	boramiyu	Boramiyu (ô\|ø )	627
o50uo	pigeonforgetn	Pigeon Forge, Tennessee	553

Step 3: Fetch Data From Decided Subreddits

# Downloading data from my 4 chosen subreddits and keywords
threads_1 <- find_thread_urls(keywords = "London travel Christmas",  
                                subreddit = "LondonTravel", 
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_1) <- NULL

threads_2 <- find_thread_urls(keywords = "London travel Christmas", 
                                subreddit = "london",
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_2) <- NULL

threads_3 <- find_thread_urls(keywords = "London travel Christmas", 
                                subreddit = "uktravel", 
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_3) <- NULL

threads_4 <- find_thread_urls(keywords = "London travel Christmas", 
                                subreddit = "travel", 
                                sort_by = "relevance", 
                                period = "month") %>% 
                                drop_na()

rownames(threads_4) <- NULL

threads_all <- rbind(threads_1, threads_2, threads_3, threads_4)
save(threads_all, file = "London_threads.RData")

Step 4: Clean text data and tokenize it

load("London_threads.Rdata")

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_all %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)


words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

Step 5: Generate Word Cloud

n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

no_keywords <- words_clean %>% filter(!word %in% c('london','travel','christmas'))

no_keywords %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

Step 6: Conduct a tri-gram analysis

# Get tri-grams
words_ngram <- threads_all %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

#separate the paired words into three columns
words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new bi-gram (n=3) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()

word1	word2	word3	n
utm_source	flyerutm_medium	flyerutm_campaign	88
date	location	nov	35
flyerutm_campaign	snow	forecasts	29
flyerutm_medium	flyerutm_campaign	snow	29
forecasts	santa	sightings	29
snow	forecasts	santa	29
date	location	oct	28
flyerutm_campaign	explore	london	27
flyerutm_medium	flyerutm_campaign	explore	27
age	family	price	26
date	loc	nov	25
family	price	free	16
price	free	info	15
location	oct	31st	14
348format	pjpgauto	webps	12
width	348format	pjpgauto	12
utm_source	thelondondoozy.beehiiv.comutm_medium	newsletterutm_campaign	9
book	utm_source	flyerutm_medium	8
loc	nov	16th	7
location	nov	1st	7

words_counts %>%
  filter(n == 3) %>%
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Step 7: Perform Sentiment Analysis

# NRC lexicon method on the cleaned data
sentiments <- words_clean %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  count(sentiment, sort = TRUE)

## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 10 of `x` matches multiple rows in `y`.
## ℹ Row 9378 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

sentiments %>%
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  ggdark::dark_theme_minimal() +
  labs(title = "Sentiment Analysis of Reddit Comments", x = "Sentiment", y = "Frequency")

## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

# Sentence-level sentiment analysis using syuzhet ()
sentiments <- threads_all$text %>%
  get_sentiment(method = "nrc")

sentiments_df <- data.frame(sentiment = sentiments)
sentiments_df %>%
  ggplot(aes(x = sentiment, fill = sentiment)) +
  geom_bar(show.legend = FALSE) +
  ggdark::dark_theme_minimal() +
  labs(title = "Sentence-Level Sentiment Analysis", x = "Sentiment", y = "Frequency")

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Sentiment analysis using sentimentr (taking negations into account)
all_sentence_score <- list()

for (i in 1:length(threads_all$text)){
  sentiment <- sentiment(threads_all$text[i])
  sentiment$sentence <- get_sentences(threads_all$text[i])
  all_sentence_score[[i]] <- sentiment
}

final_sentence_score <- bind_rows(all_sentence_score)
final_sentence_score

##       element_id sentence_id word_count  sentiment
##    1:          1           1         28 0.04252100
##    2:          1           2          6 0.00000000
##    3:          1           3         19 0.00000000
##    4:          1           4         23 0.00000000
##    5:          1           5         10 0.15811388
##   ---                                             
## 3782:          1           3         41 0.00000000
## 3783:          1           4         37 0.39044759
## 3784:          1           5         33 0.22630095
## 3785:          1           6         17 0.02425356
## 3786:          1           7          2 0.35355339
##                                                                                                                                                                                                                          sentence
##    1:                                                                     Hello,\n\nI'm visiting family near Cambridge up to Christmas, but they are heading to see my BIL's family on boxing day and I'm flying back to the USA.
##    2:                                                                                                                                                                                                   My flight is on Dec 27th.
##    3:                                                                                                                                          They can get me to a station/bus route but can't get me all the way into the city.
##    4:                                                                                            I'm less stressed about navigating busses, tubes, cabbies, whatever once I'm in the city but I just gotta make sure I get there.
##    5:                                                                                                                                                       * It looks like the options are NationalExpress, Flixbus, or MegaBus.
##   ---                                                                                                                                                                                                                            
## 3782: We are wanting to get to the mountainous areas, specifically the Berner Oberland and see more of the towns and not so much the cities as a lot of our other travel involves the bigger cities in France, Italy and Germany.
## 3783: Our idea was to get into Switzerland from Venice, but reading online reports from multiple websites including RickSteeves, there seems to be mixed reviews as whether it\031s better to take the trains or fly into Zurich.
## 3784:                                             Just wondering if anyone here has taken a similar travel path and wondering about the best ways to get into that area, or if there\031s something else that\031s better to see.
## 3785:                                                                                                                            We plan on staying in Switzerland until after Christmas and then flying out of Zurich to Berlin.
## 3786:                                                                                                                                                                                                                   Thank you

mean(final_sentence_score$sentiment)

## [1] 0.1298845

max(final_sentence_score$sentiment)

## [1] 1.432103

min(final_sentence_score$sentiment)

## [1] -1.067111

Step 8: Display 10 sample texts

set.seed(3)  
final_sentence_score[sample(1:nrow(final_sentence_score), 10), c('sentiment', 'sentence')]

##       sentiment
##  1:  0.17677670
##  2:  0.03535534
##  3:  0.25000000
##  4:  0.26943013
##  5:  0.38000000
##  6:  0.49135381
##  7: -0.40824829
##  8:  0.00000000
##  9: -0.20846377
## 10:  0.13093073
##                                                                                                                                                                     sentence
##  1:                                                                                                                       However, I've heard great things about going here.
##  2:                                                                                                                                   I NEED to see the flying Santa sleigh!
##  3:                                                                                         Or if anyone knows of any nice roasts where they do have nice veggie things too.
##  4: An educator will assist you in drawing different sounds through mark-making and help you create your own listening chart to interpret other paintings in the collection.
##  5:                                       Also would love to visit somewhere with much to do beyond drinking or clubbing and where I can seek out a lot of fun experiences .
##  6:                                                                                                                                        Hi guys I hope you're doing well.
##  7:                                                                                                                           * Anything else you recommend/don't recommend?
##  8:                                                                                                                Tickets will be transferred to first person who responds.
##  9:                                                                                             Any tips or advice to make the most of this would be immensely appreciated!!
## 10:                                        Covent Garden and Bloomsbury will follow on Nov 12th, with Marylebone Village and Old Spitalfields Market joining in on Nov 13th.

Credibility analysis: The sentiment analysis outcomes show mixed accuracy, with some results aligning well with the tone of the sentences while others miss the mark. For example, row 5 has a negative sentiment score while I see the sentence to be pretty positive. The overall performance of the sentiment analysis reveals limitations. Some words appear to have a disproportionately strong influence on the score, while the broader context or tone of the sentence is not always effectively analyzed. This issue is particularly evident in questions, which often lack clear emotional content, making it challenging for the model to assign an accurate sentiment score. Additionally, many of the sentiment scores are low and skew towards neutrality, indicating that the analysis may be underestimating the emotional tone of certain sentences.

Update: I believe the seed wasn’t set correctly, so the row 5 I am referencing is different.

Step 9: Analysis of Data

# create new column: date
threads_all %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))

# number of threads by day
threads_all %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 86400) +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_all$date, na.rm = T), 
                                max(threads_all$date, na.rm = T), 
                                by = "1 week")) +
  theme_minimal()

# create new columns: day_of_week, is_weekend
threads_all %<>%  
  mutate(day_of_week = wday(date, label = TRUE)) %>% 
  mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))

# number of threads by time of day
threads_all %>% 
  ggplot(aes(x = day_of_week, fill = is_weekend)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "indianred")) + 
  theme_minimal()

print(threads_all$timestamp[1])

## [1] 1731993068

print(threads_all$timestamp[1] %>% anytime(tz = anytime:::getTZ()))

## [1] "2024-11-19 00:11:08 EST"

threads_all %<>%  
  mutate(time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

# number of threads by time of day
threads_all %>% 
  ggplot(aes(x = time)) +
  geom_histogram(bins = 24, color = 'black') +
  scale_x_continuous(breaks = seq(0, 24, by=2)) + 
  theme_minimal()

Merging dataframes:

#Add a 'sentence_id' column to both data frames
words_clean$sentence_id <- 1:nrow(words_clean)
final_sentence_score$sentence_id <- 1:nrow(final_sentence_score)

# Merge the two data frames on 'sentence_id'
merged_data <- merge(words_clean, final_sentence_score, by = "sentence_id")
merged_data

Top 10 Keywords Associated with Sentiments:

# Calculate average sentiment and frequency for each word
words_summary <- merged_data %>%
  group_by(word) %>%
  summarise(avg_sentiment = mean(sentiment), n = n()) %>%
  arrange(desc(n)) %>%
  filter(n > 5) # Filter words appearing more than 5 times for relevance

# Select the top 10 words by frequency
top_words_clean <- words_summary %>%
  slice_max(order_by = n, n = 10)

library(ggplot2)

ggplot(top_words_clean, aes(x = reorder(word, avg_sentiment), y = avg_sentiment, fill = avg_sentiment)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ggdark::dark_theme_minimal() +
  labs(title = "Top 10 Words by Sentiment (Filtered)",
       x = "Word",
       y = "Average Sentiment")

Comparison between subreddits:

# Perform sentiment analysis on each text
sentiment_plots <- threads_all %>%
  filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
  unnest(text) %>%
  drop_na()

# Get sentiment scores for each text
sentiment_plots$sentiment_score <- sapply(sentiment_plots$text, function(text) {
  sentiment(text)$sentiment[1]
})

# Selecting relevant columns for sentiment plotting
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
sentiment_plots <- sentiment_plots[, columns] %>%
  arrange(desc(sentiment_score))

# Cleaning up titles and texts
sentiment_plots$title <- strtrim(sentiment_plots$title, 50)
sentiment_plots$text <- strtrim(sentiment_plots$text, 50)

# Adding Day of the Week (DoW) for analysis
sentiment_plots$DoW <- wday(sentiment_plots$date_utc, label = TRUE, abbr = FALSE)

# Rearranging columns
sentiment_plots <- sentiment_plots %>% select(date_utc, DoW, title, text, subreddit, sentiment_score)

ggplot(sentiment_plots, aes(x = subreddit, y = sentiment_score)) +
  geom_boxplot(fill = "indianred", alpha = 0.7) +
  labs(title = "Sentiment Scores by Subreddit", x = "Subreddit", y = "Sentiment Score") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(sentiment_plots, aes(x = sentiment_score)) +
  geom_density(fill = "indianred", alpha = 0.9) +
  facet_wrap(~ subreddit) +
  labs(title = "Sentiment Distribution by Subreddit", x = "Sentiment Score", y = "Density") +
  theme_light()

Conculsion

The sentiment analysis in Reddit threads reveals a few key trends. Firstly, the frequency of threads varies by day and time, with peaks observed on certain days of the week and specific hours of the day, indicating patterns in user engagement. The top 10 words associated with positive and negative sentiments is interesting as they indicate a positive tone is found more common that negative words. Further comparison between subreddits reveals significant differences in sentiment scores, with certain subreddits displaying more polarized emotions. These results are visualized through boxplots and density plots, showing the distribution and variation of sentiment across different subreddits. This sentiment analysis can provide insights into the emotional tone of Reddit discussions and identify key themes contributing to user sentiment.

Exploring London Travel Experiences: A Sentiment Analysis of Reddit Threads

Katherine Losada

2024-11-23