Introduction
For this project, I built a prompt that allows the user to select a topic of their choice. Once the topic is selected this project will collect posts from Bluesky, clean the text data, perform a word frequency analysis, identify the most common terms, and create a word cloud visualization.
Load Required Packages
Authenticate with Bluesky
This code logs in to Bluesky using the AT Protocol API.
Collect Bluesky Posts
query <- rstudioapi::showPrompt(
title = "Bluesky Search Topic",
message = "Enter your Bluesky search topic:",
# default = "Heather Cox Richardson"
)
max_posts <- 100
query## [1] "RMarkdown"
posts_list <- list()
cursor <- NULL
while (length(posts_list) < max_posts) {
req <- request("https://bsky.social/xrpc/app.bsky.feed.searchPosts") |>
req_headers(Authorization = paste("Bearer", access_token)) |>
req_url_query(
q = query,
limit = 25,
sort = "latest"
)
if (!is.null(cursor)) {
req <- req |>
req_url_query(cursor = cursor)
}
response <- req |>
req_perform()
data <- resp_body_json(response, simplifyVector = FALSE)
if (length(data$posts) == 0) {
break
}
for (post in data$posts) {
post_text <- post$record$text
author <- post$author$handle
created_at <- post$record$createdAt
like_count <- ifelse(is.null(post$likeCount), 0, post$likeCount)
repost_count <- ifelse(is.null(post$repostCount), 0, post$repostCount)
reply_count <- ifelse(is.null(post$replyCount), 0, post$replyCount)
posts_list[[length(posts_list) + 1]] <- data.frame(
author = author,
created_at = created_at,
text = post_text,
likes = like_count,
reposts = repost_count,
replies = reply_count,
stringsAsFactors = FALSE
)
if (length(posts_list) >= max_posts) {
break
}
}
if (is.null(data$cursor)) {
break
}
cursor <- data$cursor
Sys.sleep(1)
}
bluesky_posts <- bind_rows(posts_list)View the Collected Data
## author created_at
## 1 bioblogo.bsky.social 2026-07-04T19:00:19.905Z
## 2 enactedmind.bsky.social 2026-07-04T10:04:53.867Z
## 3 cbgoodman.co 2026-07-02T18:19:39.942Z
## 4 instats.bsky.social 2026-07-02T09:51:25.733Z
## 5 instats.bsky.social 2026-07-02T09:42:03.654Z
## 6 rtbecard.bsky.social 2026-06-27T16:24:32.057Z
## text
## 1 10 reglas para un análisis inicial de datos efectivo con R: projectTemplate, skimr, naniar, ggplot2, rmarkdown... (PLOS Comp Bio)\n#biotapas\nwww.ncbi.nlm.nih.gov...
## 2 On that note - rmarkdown provides citations using a bibtex database and outputs latex->pdf and word, so you can send it to colleagues that don’t use tex. all that on top of being able to do stats in text.
## 3 Hmm…`targets` appears to have lost some functionality with Quarto. It can't find standard packages with `tar_quarto()` like knitr and rmarkdown. Is this a Quarto problem or a Targets problem, @ropensci?
## 4 New Instats livestreaming seminar: Electoral Geography & Spatial Causal Inference\n\n#ComputationalSocialScience #DataScience #Geography #PoliticalScience #QuantitativeMethods #ResearchMethods #SocialSciences #SpatialData #Statistics #ggplot2 #GIS #PositCloud #R #Rmarkdown #RStudio #tidyverse #R
## 5 New Instats livestreaming seminar: Geocoding & Spatial Autocorrelation\n\n#ComputationalSocialScience #DataScience #Geography #PoliticalScience #QuantitativeMethods #ResearchMethods #SocialSciences #SpatialData #Statistics #ggplot2 #GIS #PositCloud #R #Rmarkdown #RStudio #tidyverse #Research #Re
## 6 I finally found a nice solution for scaling figures in rmarkdown: Setup a chunk options hook which dynamically sets the out.width, fig.width and fig.height options.\n\nfig.width=7, scale=0.8 now returns a figure 7 in wide, but the contents will be scaled down by a factor of 0.8.\n\n#Rstats
## likes reposts replies
## 1 0 0 0
## 2 5 0 1
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 4 2 1
Save Raw Data
safe_query <- query |>
stringr::str_to_lower() |>
stringr::str_replace_all("[^a-z0-9]+", "_") |>
stringr::str_remove_all("^_|_$")
file_name <- paste0("bluesky_", safe_query, "_raw_posts.csv")
write.csv(bluesky_posts, file_name, row.names = FALSE)
paste("Saved the file as:", file_name)## [1] "Saved the file as: bluesky_rmarkdown_raw_posts.csv"
Clean the Text Data
This section cleans the Bluesky post text by removing URLs, mentions, punctuation, numbers, extra spaces, and common stopwords.
I also removed the words entered by the user because those words are part of the search term and would likely dominate the results.
Tokenize the Text
Tokenizing means splitting the text into individual words.
query_stopwords <- tibble(
word = query |>
stringr::str_to_lower() |>
stringr::str_replace_all("[^a-z\\s]", " ") |>
stringr::str_squish() |>
stringr::str_split("\\s+") |>
unlist()
)
custom_stopwords <- query_stopwords
words <- clean_posts |>
select(clean_text) |>
unnest_tokens(word, clean_text) |>
anti_join(stop_words, by = "word") |>
anti_join(custom_stopwords, by = "word") |>
filter(str_length(word) > 2)Word Frequency Analysis
This code counts how often each word appears.
## word n
## 1 quarto 34
## 2 code 18
## 3 rstats 17
## 4 data 12
## 5 jupyter 11
## 6 latex 9
## 7 notebooks 9
## 8 ggplot 8
## 9 package 8
## 10 rstudio 8
## 11 docs 7
## 12 run 7
## 13 analysis 6
## 14 ive 6
## 15 knitr 6
## 16 markdown 6
## 17 para 6
## 18 pdf 6
## 19 check 5
## 20 con 5
Save Word Frequency Table
file_name <- paste0("bluesky_", safe_query, "_word_frequencies.csv")
write.csv(word_freq, file_name, row.names = FALSE)
paste("Saved the file as:", file_name)## [1] "Saved the file as: bluesky_rmarkdown_word_frequencies.csv"
Top 20 Most Common Words
## word n
## 1 quarto 34
## 2 code 18
## 3 rstats 17
## 4 data 12
## 5 jupyter 11
## 6 latex 9
## 7 notebooks 9
## 8 ggplot 8
## 9 package 8
## 10 rstudio 8
## 11 docs 7
## 12 run 7
## 13 analysis 6
## 14 ive 6
## 15 knitr 6
## 16 markdown 6
## 17 para 6
## 18 pdf 6
## 19 check 5
## 20 con 5
Bar Chart of Most Common Words
ggplot(top_20_words, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "blue") +
coord_flip() +
labs(
title = paste("Top 20 Most Common Words in Bluesky Posts about", query),
x = "Word",
y = "Frequency"
) +
theme_minimal()Word Cloud
set.seed(123)
wordcloud(
words = word_freq$word,
freq = word_freq$n,
max.words = 100,
random.order = FALSE,
colors = brewer.pal(8, "Dark2")
)Findings
After collecting and cleaning the Bluesky posts, I performed a word frequency analysis to identify the most common terms related to the variable entered.
## word n
## 1 quarto 34
## 2 code 18
## 3 rstats 17
## 4 data 12
## 5 jupyter 11
## 6 latex 9
## 7 notebooks 9
## 8 ggplot 8
## 9 package 8
## 10 rstudio 8
## 11 docs 7
## 12 run 7
## 13 analysis 6
## 14 ive 6
## 15 knitr 6
## 16 markdown 6
## 17 para 6
## 18 pdf 6
## 19 check 5
## 20 con 5
Limitations
This analysis has several limitations. First, the dataset only includes posts available through Bluesky search at the time the data was collected. Second, the results depend on the search phrase provided by the user, so different search terms such as abbreviations of the word or term entered can produce different results. Third, word frequency analysis only counts how often words appear. It does not fully explain context, sarcasm, tone, or whether the posts are supportive or critical.
Conclusion
This project used the Bluesky API to collect posts related to a dynamic input variable. After cleaning the text, I used word frequency analysis to identify the most common terms in the dataset. The results provide a general overview of the themes and language commonly associated with discussions of the variable on Bluesky.