My goal is to examine people’s recent experiences traveling to London, specifically during the holiday season. This will be done by scraping data from Reddit threads (filtering out the posts and using threads only from the last month) and conducting a sentiment analysis on the collected Reddit data.
# Searching threads using keyword
# I attempted other keywords such as "London travel holiday season" (provided almost no data) and just "London travel" (provide too broad of data)
threads <- find_thread_urls(keywords = "London travel Christmas",
sort_by = "relevance",
period = "month") %>%
drop_na()
## parsing URLs on page 1...
## parsing URLs on page 2...
## parsing URLs on page 3...
rownames(threads) <- NULL
colnames(threads)
## [1] "date_utc" "timestamp" "title" "text" "subreddit" "comments"
## [7] "url"
head(threads, 3) %>% knitr::kable()
| date_utc | timestamp | title | text | subreddit | comments | url |
|---|---|---|---|---|---|---|
| 2024-11-15 | 1731691704 | Christmas in London ! | I am in London from 22 Dec to 29th Dec and then for 1 night from 1-2nd Jan. |
I am confused as to what is the best option and best things to do during this period , as per my understanding 25th will be completely closed. Any ideas or suggestions ? TIA |uktravel | 17|https://www.reddit.com/r/uktravel/comments/1gs1o3m/christmas_in_london/ | |2024-11-02 | 1730528782|Is transit visa needed to travel via London Heathrow? I have expired US visa but I have a valid Canada tourist visa. I am traveling from the US to India. | |h1b | 19|https://www.reddit.com/r/h1b/comments/1ghqem5/is_transit_visa_needed_to_travel_via_london/ | |2024-11-13 | 1731490134|Is Vietnam a good place to travel to during Christmas? |Weather-wise, tourists surge and getting around&. Which area would be great to visit during this time? |VietNam | 17|https://www.reddit.com/r/VietNam/comments/1gq945s/is_vietnam_a_good_place_to_travel_to_during/ |
# Number of threads found by the keyword per subreddit (top 20)
threads$subreddit %>% table() %>% sort(decreasing = T) %>% head(20)
## .
## uktravel DigitalPR london
## 23 17 8
## Europetravel travel AskUK
## 7 7 5
## BestofRedditorUpdates solotravel AITAH
## 5 5 3
## AmItheAsshole femaletravels LondonTravel
## 3 3 3
## relationship_advice ChristmasCoupons f1visa
## 3 2 2
## Gifts h1b irishtourism
## 2 2 2
## ParisTravelGuide SouthwestAirlines
## 2 2
# Second method for searching relevant subreddits
subreddit_list <- RedditExtractoR::find_subreddits("London travel Christmas")
## parsing URLs on page 1...
subreddit_list[1:25,c('subreddit','title','subscribers')] %>% knitr::kable()
| subreddit | title | subscribers | |
|---|---|---|---|
| 318mc | LondonTravel | London Travel | 3151 |
| 2qmvs | Christmasparty | Christmas Party London | 52 |
| 2qhqb | unitedkingdom | United Kingdom | 4340336 |
| 35sd2 | londontravelguide | London Travel Guide: Attractions, Itineraries, Things To Do In London | 1694 |
| 3fcf7 | pokemongoLondon | Pokemon GO London | 7668 |
| 378fz | hyperjapan | HYPER JAPAN! | 189 |
| 2slwv | ukbike | UK Bike - Cycling in the UK | 35245 |
| 2qney | southafrica | South Africa | 250672 |
| 3z1iw2 | LondonCultureClub | LondonCultureClub | 112 |
| 2rdzk | silavox | The fracon cake handoff. | 13 |
| 317ag | Notebook_share | a place to share notebooks | 1321 |
| 8ar7k1 | velvetrose | velvetrose | 57 |
| 3fgb7 | PokemonGoMidlands | PokemonGo Midlands: Gotta Catch ’em All | 220 |
| 34d4u | smuttravel | Smut Travel Caravan! | 0 |
| 2zclk | FunForLouis | FunForLouis - peace out, enjoy life, and live the adventure! | 113 |
| 2sygb | timburton | Tim Burton | 18608 |
| 2tty5 | publictransit | Public Transit: Go like a local | 0 |
| 3fd3p | pokemongoEU | Pokemon GO Europe | 90 |
| 2u441 | britpics | British Pics! | 128507 |
| 3ayak | kristinChenoweth | Kristin Chenoweth | 686 |
| 2rnoj | ambientmusic | Ambient music | 86621 |
| 2y3cp | NuWhoDailyMarathon50 | NuWho Daily Marathon Leading Up to the 50th Anniversary Special | 63 |
| 3g9oz | sushideception | sushideception writes | 0 |
| 11sxha | boramiyu | Boramiyu (ô|ø ) | 627 |
| o50uo | pigeonforgetn | Pigeon Forge, Tennessee | 553 |
# Downloading data from my 4 chosen subreddits and keywords
threads_1 <- find_thread_urls(keywords = "London travel Christmas",
subreddit = "LondonTravel",
sort_by = "relevance",
period = "month") %>%
drop_na()
rownames(threads_1) <- NULL
threads_2 <- find_thread_urls(keywords = "London travel Christmas",
subreddit = "london",
sort_by = "relevance",
period = "month") %>%
drop_na()
rownames(threads_2) <- NULL
threads_3 <- find_thread_urls(keywords = "London travel Christmas",
subreddit = "uktravel",
sort_by = "relevance",
period = "month") %>%
drop_na()
rownames(threads_3) <- NULL
threads_4 <- find_thread_urls(keywords = "London travel Christmas",
subreddit = "travel",
sort_by = "relevance",
period = "month") %>%
drop_na()
rownames(threads_4) <- NULL
threads_all <- rbind(threads_1, threads_2, threads_3, threads_4)
save(threads_all, file = "London_threads.RData")
load("London_threads.Rdata")
# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
words_clean <- threads_all %>%
# drop URLs
mutate(text = str_replace_all(text, replace_reg, "")) %>%
# Tokenization (word tokens)
unnest_tokens(word, text, token = "words") %>%
# drop stop words
anti_join(stop_words, by = "word") %>%
# drop non-alphabet-only strings
filter(str_detect(word, "[a-z]"))
print(
glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
words_clean %>%
count(word, sort = TRUE) %>%
top_n(20, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "words",
y = "counts",
title = "Unique wordcounts")
n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright
df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
no_keywords <- words_clean %>% filter(!word %in% c('london','travel','christmas'))
no_keywords %>%
count(word, sort = TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
# Get tri-grams
words_ngram <- threads_all %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
select(text) %>%
unnest_tokens(output = paired_words,
input = text,
token = "ngrams",
n = 3)
#separate the paired words into three columns
words_ngram_pair <- words_ngram %>%
separate(paired_words, c("word1", "word2", "word3"), sep = " ")
# filter rows where there are stop words
words_ngram_pair_filtered <- words_ngram_pair %>%
# drop stop words
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings
filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII
library(stringi)
words_ngram_pair_filtered %<>%
filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))
# Sort the new bi-gram (n=3) counts:
words_counts <- words_ngram_pair_filtered %>%
count(word1, word2, word3) %>%
arrange(desc(n))
head(words_counts, 20) %>%
knitr::kable()
| word1 | word2 | word3 | n |
|---|---|---|---|
| utm_source | flyerutm_medium | flyerutm_campaign | 88 |
| date | location | nov | 35 |
| flyerutm_campaign | snow | forecasts | 29 |
| flyerutm_medium | flyerutm_campaign | snow | 29 |
| forecasts | santa | sightings | 29 |
| snow | forecasts | santa | 29 |
| date | location | oct | 28 |
| flyerutm_campaign | explore | london | 27 |
| flyerutm_medium | flyerutm_campaign | explore | 27 |
| age | family | price | 26 |
| date | loc | nov | 25 |
| family | price | free | 16 |
| price | free | info | 15 |
| location | oct | 31st | 14 |
| 348format | pjpgauto | webps | 12 |
| width | 348format | pjpgauto | 12 |
| utm_source | thelondondoozy.beehiiv.comutm_medium | newsletterutm_campaign | 9 |
| book | utm_source | flyerutm_medium | 8 |
| loc | nov | 16th | 7 |
| location | nov | 1st | 7 |
words_counts %>%
filter(n == 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8) +
labs(title = "Word Networks",
x = "", y = "")
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# NRC lexicon method on the cleaned data
sentiments <- words_clean %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
count(sentiment, sort = TRUE)
## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 10 of `x` matches multiple rows in `y`.
## ℹ Row 9378 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
sentiments %>%
ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
ggdark::dark_theme_minimal() +
labs(title = "Sentiment Analysis of Reddit Comments", x = "Sentiment", y = "Frequency")
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
# Sentence-level sentiment analysis using syuzhet ()
sentiments <- threads_all$text %>%
get_sentiment(method = "nrc")
sentiments_df <- data.frame(sentiment = sentiments)
sentiments_df %>%
ggplot(aes(x = sentiment, fill = sentiment)) +
geom_bar(show.legend = FALSE) +
ggdark::dark_theme_minimal() +
labs(title = "Sentence-Level Sentiment Analysis", x = "Sentiment", y = "Frequency")
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Sentiment analysis using sentimentr (taking negations into account)
all_sentence_score <- list()
for (i in 1:length(threads_all$text)){
sentiment <- sentiment(threads_all$text[i])
sentiment$sentence <- get_sentences(threads_all$text[i])
all_sentence_score[[i]] <- sentiment
}
final_sentence_score <- bind_rows(all_sentence_score)
final_sentence_score
## element_id sentence_id word_count sentiment
## 1: 1 1 28 0.04252100
## 2: 1 2 6 0.00000000
## 3: 1 3 19 0.00000000
## 4: 1 4 23 0.00000000
## 5: 1 5 10 0.15811388
## ---
## 3782: 1 3 41 0.00000000
## 3783: 1 4 37 0.39044759
## 3784: 1 5 33 0.22630095
## 3785: 1 6 17 0.02425356
## 3786: 1 7 2 0.35355339
## sentence
## 1: Hello,\n\nI'm visiting family near Cambridge up to Christmas, but they are heading to see my BIL's family on boxing day and I'm flying back to the USA.
## 2: My flight is on Dec 27th.
## 3: They can get me to a station/bus route but can't get me all the way into the city.
## 4: I'm less stressed about navigating busses, tubes, cabbies, whatever once I'm in the city but I just gotta make sure I get there.
## 5: * It looks like the options are NationalExpress, Flixbus, or MegaBus.
## ---
## 3782: We are wanting to get to the mountainous areas, specifically the Berner Oberland and see more of the towns and not so much the cities as a lot of our other travel involves the bigger cities in France, Italy and Germany.
## 3783: Our idea was to get into Switzerland from Venice, but reading online reports from multiple websites including RickSteeves, there seems to be mixed reviews as whether it\031s better to take the trains or fly into Zurich.
## 3784: Just wondering if anyone here has taken a similar travel path and wondering about the best ways to get into that area, or if there\031s something else that\031s better to see.
## 3785: We plan on staying in Switzerland until after Christmas and then flying out of Zurich to Berlin.
## 3786: Thank you
mean(final_sentence_score$sentiment)
## [1] 0.1298845
max(final_sentence_score$sentiment)
## [1] 1.432103
min(final_sentence_score$sentiment)
## [1] -1.067111
set.seed(3)
final_sentence_score[sample(1:nrow(final_sentence_score), 10), c('sentiment', 'sentence')]
## sentiment
## 1: 0.17677670
## 2: 0.03535534
## 3: 0.25000000
## 4: 0.26943013
## 5: 0.38000000
## 6: 0.49135381
## 7: -0.40824829
## 8: 0.00000000
## 9: -0.20846377
## 10: 0.13093073
## sentence
## 1: However, I've heard great things about going here.
## 2: I NEED to see the flying Santa sleigh!
## 3: Or if anyone knows of any nice roasts where they do have nice veggie things too.
## 4: An educator will assist you in drawing different sounds through mark-making and help you create your own listening chart to interpret other paintings in the collection.
## 5: Also would love to visit somewhere with much to do beyond drinking or clubbing and where I can seek out a lot of fun experiences .
## 6: Hi guys I hope you're doing well.
## 7: * Anything else you recommend/don't recommend?
## 8: Tickets will be transferred to first person who responds.
## 9: Any tips or advice to make the most of this would be immensely appreciated!!
## 10: Covent Garden and Bloomsbury will follow on Nov 12th, with Marylebone Village and Old Spitalfields Market joining in on Nov 13th.
Credibility analysis: The sentiment analysis outcomes show mixed accuracy, with some results aligning well with the tone of the sentences while others miss the mark. For example, row 5 has a negative sentiment score while I see the sentence to be pretty positive. The overall performance of the sentiment analysis reveals limitations. Some words appear to have a disproportionately strong influence on the score, while the broader context or tone of the sentence is not always effectively analyzed. This issue is particularly evident in questions, which often lack clear emotional content, making it challenging for the model to assign an accurate sentiment score. Additionally, many of the sentiment scores are low and skew towards neutrality, indicating that the analysis may be underestimating the emotional tone of certain sentences.
Update: I believe the seed wasn’t set correctly, so the row 5 I am referencing is different.
# create new column: date
threads_all %<>%
mutate(date = as.POSIXct(date_utc)) %>%
filter(!is.na(date))
# number of threads by day
threads_all %>%
ggplot(aes(x = date)) +
geom_histogram(color="black", position = 'stack', binwidth = 86400) +
scale_x_datetime(date_labels = "%b %y",
breaks = seq(min(threads_all$date, na.rm = T),
max(threads_all$date, na.rm = T),
by = "1 week")) +
theme_minimal()
# create new columns: day_of_week, is_weekend
threads_all %<>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
mutate(is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"))
# number of threads by time of day
threads_all %>%
ggplot(aes(x = day_of_week, fill = is_weekend)) +
geom_bar(color = 'black') +
scale_fill_manual(values = c("Weekday" = "gray", "Weekend" = "indianred")) +
theme_minimal()
print(threads_all$timestamp[1])
## [1] 1731993068
print(threads_all$timestamp[1] %>% anytime(tz = anytime:::getTZ()))
## [1] "2024-11-19 00:11:08 EST"
threads_all %<>%
mutate(time = timestamp %>%
anytime(tz = anytime:::getTZ()) %>%
str_split('-| |:') %>%
sapply(function(x) as.numeric(x[4])))
# number of threads by time of day
threads_all %>%
ggplot(aes(x = time)) +
geom_histogram(bins = 24, color = 'black') +
scale_x_continuous(breaks = seq(0, 24, by=2)) +
theme_minimal()
Merging dataframes:
#Add a 'sentence_id' column to both data frames
words_clean$sentence_id <- 1:nrow(words_clean)
final_sentence_score$sentence_id <- 1:nrow(final_sentence_score)
# Merge the two data frames on 'sentence_id'
merged_data <- merge(words_clean, final_sentence_score, by = "sentence_id")
merged_data
Top 10 Keywords Associated with Sentiments:
# Calculate average sentiment and frequency for each word
words_summary <- merged_data %>%
group_by(word) %>%
summarise(avg_sentiment = mean(sentiment), n = n()) %>%
arrange(desc(n)) %>%
filter(n > 5) # Filter words appearing more than 5 times for relevance
# Select the top 10 words by frequency
top_words_clean <- words_summary %>%
slice_max(order_by = n, n = 10)
library(ggplot2)
ggplot(top_words_clean, aes(x = reorder(word, avg_sentiment), y = avg_sentiment, fill = avg_sentiment)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ggdark::dark_theme_minimal() +
labs(title = "Top 10 Words by Sentiment (Filtered)",
x = "Word",
y = "Average Sentiment")
Comparison between subreddits:
# Perform sentiment analysis on each text
sentiment_plots <- threads_all %>%
filter(nzchar(text) & !grepl("http[s]?://", text)) %>%
unnest(text) %>%
drop_na()
# Get sentiment scores for each text
sentiment_plots$sentiment_score <- sapply(sentiment_plots$text, function(text) {
sentiment(text)$sentiment[1]
})
# Selecting relevant columns for sentiment plotting
columns <- c("date_utc", "title", "text", "subreddit", "sentiment_score")
sentiment_plots <- sentiment_plots[, columns] %>%
arrange(desc(sentiment_score))
# Cleaning up titles and texts
sentiment_plots$title <- strtrim(sentiment_plots$title, 50)
sentiment_plots$text <- strtrim(sentiment_plots$text, 50)
# Adding Day of the Week (DoW) for analysis
sentiment_plots$DoW <- wday(sentiment_plots$date_utc, label = TRUE, abbr = FALSE)
# Rearranging columns
sentiment_plots <- sentiment_plots %>% select(date_utc, DoW, title, text, subreddit, sentiment_score)
ggplot(sentiment_plots, aes(x = subreddit, y = sentiment_score)) +
geom_boxplot(fill = "indianred", alpha = 0.7) +
labs(title = "Sentiment Scores by Subreddit", x = "Subreddit", y = "Sentiment Score") +
theme_light() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(sentiment_plots, aes(x = sentiment_score)) +
geom_density(fill = "indianred", alpha = 0.9) +
facet_wrap(~ subreddit) +
labs(title = "Sentiment Distribution by Subreddit", x = "Sentiment Score", y = "Density") +
theme_light()
The sentiment analysis in Reddit threads reveals a few key trends. Firstly, the frequency of threads varies by day and time, with peaks observed on certain days of the week and specific hours of the day, indicating patterns in user engagement. The top 10 words associated with positive and negative sentiments is interesting as they indicate a positive tone is found more common that negative words. Further comparison between subreddits reveals significant differences in sentiment scores, with certain subreddits displaying more polarized emotions. These results are visualized through boxplots and density plots, showing the distribution and variation of sentiment across different subreddits. This sentiment analysis can provide insights into the emotional tone of Reddit discussions and identify key themes contributing to user sentiment.