Bluesky Word Frequency Analysis of Posts by User Selected Topic

Roci Barnes

2026-07-05

Introduction

For this project, I built a prompt that allows the user to select a topic of their choice. Once the topic is selected this project will collect posts from Bluesky, clean the text data, perform a word frequency analysis, identify the most common terms, and create a word cloud visualization.

Load Required Packages

#install.packages(c("httr2", "jsonlite", "dplyr", "stringr", "tidytext", "ggplot2", "wordcloud", "RColorBrewer","rstudioapi","prettydoc"))

library(httr2)
library(jsonlite)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
bsky_handle <- Sys.getenv("BSKY_HANDLE")
bsky_app_password <- Sys.getenv("BSKY_APP_PASSWORD")

if (bsky_handle == "" || bsky_app_password == "") {
  stop("Missing Bluesky username or app password. Check your .Renviron file.")
}

Authenticate with Bluesky

This code logs in to Bluesky using the AT Protocol API.

login_response <- request("https://bsky.social/xrpc/com.atproto.server.createSession") |>
  req_method("POST") |>
  req_body_json(list(
    identifier = bsky_handle,
    password = bsky_app_password
  )) |>
  req_perform()

login_data <- resp_body_json(login_response)

access_token <- login_data$accessJwt

Collect Bluesky Posts

query <- rstudioapi::showPrompt(
  title = "Bluesky Search Topic",
  message = "Enter your Bluesky search topic:",
#  default = "Heather Cox Richardson"
)

max_posts <- 100

query
## [1] "RMarkdown"
posts_list <- list()
cursor <- NULL

while (length(posts_list) < max_posts) {
  
  req <- request("https://bsky.social/xrpc/app.bsky.feed.searchPosts") |>
    req_headers(Authorization = paste("Bearer", access_token)) |>
    req_url_query(
      q = query,
      limit = 25,
      sort = "latest"
    )
  
  if (!is.null(cursor)) {
    req <- req |>
      req_url_query(cursor = cursor)
  }
  
  response <- req |>
    req_perform()
  
  data <- resp_body_json(response, simplifyVector = FALSE)
  
  if (length(data$posts) == 0) {
    break
  }
  
  for (post in data$posts) {
    
    post_text <- post$record$text
    author <- post$author$handle
    created_at <- post$record$createdAt
    
    like_count <- ifelse(is.null(post$likeCount), 0, post$likeCount)
    repost_count <- ifelse(is.null(post$repostCount), 0, post$repostCount)
    reply_count <- ifelse(is.null(post$replyCount), 0, post$replyCount)
    
    posts_list[[length(posts_list) + 1]] <- data.frame(
      author = author,
      created_at = created_at,
      text = post_text,
      likes = like_count,
      reposts = repost_count,
      replies = reply_count,
      stringsAsFactors = FALSE
    )
    
    if (length(posts_list) >= max_posts) {
      break
    }
  }
  
  if (is.null(data$cursor)) {
    break
  }
  
  cursor <- data$cursor
  
  Sys.sleep(1)
}

bluesky_posts <- bind_rows(posts_list)

View the Collected Data

head(bluesky_posts)
##                    author               created_at
## 1    bioblogo.bsky.social 2026-07-04T19:00:19.905Z
## 2 enactedmind.bsky.social 2026-07-04T10:04:53.867Z
## 3            cbgoodman.co 2026-07-02T18:19:39.942Z
## 4     instats.bsky.social 2026-07-02T09:51:25.733Z
## 5     instats.bsky.social 2026-07-02T09:42:03.654Z
## 6    rtbecard.bsky.social 2026-06-27T16:24:32.057Z
##                                                                                                                                                                                                                                                                                                        text
## 1                                                                                                                                     10 reglas para un análisis inicial de datos efectivo con R: projectTemplate, skimr, naniar, ggplot2, rmarkdown... (PLOS Comp Bio)\n#biotapas\nwww.ncbi.nlm.nih.gov...
## 2                                                                                              On that note - rmarkdown provides citations using a bibtex database and outputs latex->pdf and word, so you can send it to colleagues that don’t use tex. all that on top of being able to do stats in text.
## 3                                                                                                Hmm…`targets` appears to have lost some functionality with Quarto. It can't find standard packages with `tar_quarto()` like knitr and rmarkdown. Is this a Quarto problem or a Targets problem, @ropensci?
## 4 New Instats livestreaming seminar: Electoral Geography & Spatial Causal Inference\n\n#ComputationalSocialScience #DataScience #Geography #PoliticalScience #QuantitativeMethods #ResearchMethods #SocialSciences #SpatialData #Statistics  #ggplot2 #GIS #PositCloud #R #Rmarkdown #RStudio #tidyverse #R
## 5 New Instats livestreaming seminar: Geocoding & Spatial Autocorrelation\n\n#ComputationalSocialScience #DataScience #Geography #PoliticalScience #QuantitativeMethods #ResearchMethods #SocialSciences #SpatialData #Statistics  #ggplot2 #GIS #PositCloud #R #Rmarkdown #RStudio #tidyverse #Research #Re
## 6       I finally found a nice solution for scaling figures in rmarkdown:  Setup a chunk options hook which dynamically sets the out.width, fig.width and fig.height options.\n\nfig.width=7, scale=0.8 now returns a figure 7 in wide, but the contents will be scaled down by a factor of 0.8.\n\n#Rstats
##   likes reposts replies
## 1     0       0       0
## 2     5       0       1
## 3     0       0       1
## 4     0       0       0
## 5     0       0       0
## 6     4       2       1

Number of Posts Collected

nrow(bluesky_posts)
## [1] 100

Save Raw Data

safe_query <- query |>
  stringr::str_to_lower() |>
  stringr::str_replace_all("[^a-z0-9]+", "_") |>
  stringr::str_remove_all("^_|_$")

file_name <- paste0("bluesky_", safe_query, "_raw_posts.csv")

write.csv(bluesky_posts, file_name, row.names = FALSE)

paste("Saved the file as:", file_name)
## [1] "Saved the file as: bluesky_rmarkdown_raw_posts.csv"

Clean the Text Data

This section cleans the Bluesky post text by removing URLs, mentions, punctuation, numbers, extra spaces, and common stopwords.

I also removed the words entered by the user because those words are part of the search term and would likely dominate the results.

clean_posts <- bluesky_posts |>
  mutate(
    clean_text = text |>
      str_to_lower() |>
      str_remove_all("http\\S+|www\\S+") |>
      str_remove_all("@\\w+") |>
      str_remove_all("#") |>
      str_remove_all("[^a-z\\s]") |>
      str_squish()
  )

# 

Tokenize the Text

Tokenizing means splitting the text into individual words.

query_stopwords <- tibble(
  word = query |>
    stringr::str_to_lower() |>
    stringr::str_replace_all("[^a-z\\s]", " ") |>
    stringr::str_squish() |>
    stringr::str_split("\\s+") |>
    unlist()
)

custom_stopwords <- query_stopwords

words <- clean_posts |>
  select(clean_text) |>
  unnest_tokens(word, clean_text) |>
  anti_join(stop_words, by = "word") |>
  anti_join(custom_stopwords, by = "word") |>
  filter(str_length(word) > 2)

Word Frequency Analysis

This code counts how often each word appears.

word_freq <- words |>
  count(word, sort = TRUE)

head(word_freq, 20)
##         word  n
## 1     quarto 34
## 2       code 18
## 3     rstats 17
## 4       data 12
## 5    jupyter 11
## 6      latex  9
## 7  notebooks  9
## 8     ggplot  8
## 9    package  8
## 10   rstudio  8
## 11      docs  7
## 12       run  7
## 13  analysis  6
## 14       ive  6
## 15     knitr  6
## 16  markdown  6
## 17      para  6
## 18       pdf  6
## 19     check  5
## 20       con  5

Save Word Frequency Table

file_name <- paste0("bluesky_", safe_query, "_word_frequencies.csv")

write.csv(word_freq, file_name, row.names = FALSE)

paste("Saved the file as:", file_name)
## [1] "Saved the file as: bluesky_rmarkdown_word_frequencies.csv"

Top 20 Most Common Words

top_20_words <- word_freq |>
  slice_max(n, n = 20, with_ties = FALSE)

top_20_words
##         word  n
## 1     quarto 34
## 2       code 18
## 3     rstats 17
## 4       data 12
## 5    jupyter 11
## 6      latex  9
## 7  notebooks  9
## 8     ggplot  8
## 9    package  8
## 10   rstudio  8
## 11      docs  7
## 12       run  7
## 13  analysis  6
## 14       ive  6
## 15     knitr  6
## 16  markdown  6
## 17      para  6
## 18       pdf  6
## 19     check  5
## 20       con  5

Bar Chart of Most Common Words

ggplot(top_20_words, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "blue") +
  coord_flip() +
  labs(
    title = paste("Top 20 Most Common Words in Bluesky Posts about", query),
    x = "Word",
    y = "Frequency"
  ) +
  theme_minimal()

Word Cloud

set.seed(123)

wordcloud(
  words = word_freq$word,
  freq = word_freq$n,
  max.words = 100,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

Findings

After collecting and cleaning the Bluesky posts, I performed a word frequency analysis to identify the most common terms related to the variable entered.

top_20_words
##         word  n
## 1     quarto 34
## 2       code 18
## 3     rstats 17
## 4       data 12
## 5    jupyter 11
## 6      latex  9
## 7  notebooks  9
## 8     ggplot  8
## 9    package  8
## 10   rstudio  8
## 11      docs  7
## 12       run  7
## 13  analysis  6
## 14       ive  6
## 15     knitr  6
## 16  markdown  6
## 17      para  6
## 18       pdf  6
## 19     check  5
## 20       con  5

Limitations

This analysis has several limitations. First, the dataset only includes posts available through Bluesky search at the time the data was collected. Second, the results depend on the search phrase provided by the user, so different search terms such as abbreviations of the word or term entered can produce different results. Third, word frequency analysis only counts how often words appear. It does not fully explain context, sarcasm, tone, or whether the posts are supportive or critical.

Conclusion

This project used the Bluesky API to collect posts related to a dynamic input variable. After cleaning the text, I used word frequency analysis to identify the most common terms in the dataset. The results provide a general overview of the themes and language commonly associated with discussions of the variable on Bluesky.