Bluesky Word Frequency Analysis of Posts about Heather Cox Richardson

Introduction

For this project, I analyzed Bluesky posts related to Heather Cox Richardson. Heather Cox Richardson is a historian and political commentator whose work is often discussed in relation to American politics, democracy, history, and current events.

The goal of this project was to collect posts from Bluesky, clean the text data, perform a word frequency analysis, identify the most common terms, and create a word cloud visualization.

Load Required Packages

install.packages(c("httr2", "jsonlite", "dplyr", "stringr", "tidytext", "ggplot2", "wordcloud", "RColorBrewer","prettydoc"))

library(httr2)
library(jsonlite)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)

Bluesky Login Information

bsky_handle <- Sys.getenv("BSKY_HANDLE")
bsky_app_password <- Sys.getenv("BSKY_APP_PASSWORD")

if (bsky_handle == "" || bsky_app_password == "") {
  stop("Missing Bluesky username or app password. Check your .Renviron file.")
}

Authenticate with Bluesky

This code logs in to Bluesky using the AT Protocol API.

login_response <- request("https://bsky.social/xrpc/com.atproto.server.createSession") |>
  req_method("POST") |>
  req_body_json(list(
    identifier = bsky_handle,
    password = bsky_app_password
  )) |>
  req_perform()

login_data <- resp_body_json(login_response)

access_token <- login_data$accessJwt

Collect Bluesky Posts

The search topic for this project is:

query <- "Heather Cox Richardson"
max_posts <- 100

This code searches Bluesky for posts related to Heather Cox Richardson. It collects up to 100 posts, or fewer if fewer posts are available.

posts_list <- list()
cursor <- NULL

while (length(posts_list) < max_posts) {
  
  req <- request("https://bsky.social/xrpc/app.bsky.feed.searchPosts") |>
    req_headers(Authorization = paste("Bearer", access_token)) |>
    req_url_query(
      q = query,
      limit = 25,
      sort = "latest"
    )
  
  if (!is.null(cursor)) {
    req <- req |>
      req_url_query(cursor = cursor)
  }
  
  response <- req |>
    req_perform()
  
  data <- resp_body_json(response, simplifyVector = FALSE)
  
  if (length(data$posts) == 0) {
    break
  }
  
  for (post in data$posts) {
    
    post_text <- post$record$text
    author <- post$author$handle
    created_at <- post$record$createdAt
    
    like_count <- ifelse(is.null(post$likeCount), 0, post$likeCount)
    repost_count <- ifelse(is.null(post$repostCount), 0, post$repostCount)
    reply_count <- ifelse(is.null(post$replyCount), 0, post$replyCount)
    
    posts_list[[length(posts_list) + 1]] <- data.frame(
      author = author,
      created_at = created_at,
      text = post_text,
      likes = like_count,
      reposts = repost_count,
      replies = reply_count,
      stringsAsFactors = FALSE
    )
    
    if (length(posts_list) >= max_posts) {
      break
    }
  }
  
  if (is.null(data$cursor)) {
    break
  }
  
  cursor <- data$cursor
  
  Sys.sleep(1)
}

bluesky_posts <- bind_rows(posts_list)

View the Collected Data

head(bluesky_posts)
##                       author               created_at
## 1   stefanejones.bsky.social 2026-07-05T01:46:04.966Z
## 2      mikeymomo.bsky.social 2026-07-05T00:43:10.896Z
## 3 bsargentnoble1.bsky.social 2026-07-05T00:31:11.664Z
## 4   onedandelion.bsky.social 2026-07-04T23:17:02.880Z
## 5    arcticfox87.bsky.social 2026-07-04T23:05:06.032Z
## 6    darleneryan.bsky.social 2026-07-04T21:53:45.830Z
##                                                                                                                                                                                                                                    text
## 1                                          Very Long Doomscrolling Break.\n\n"The Lincoln Portrait," narrated by Heather Cox Richardson.\n\n*"This is what he said. This is what Abe Lincoln said."*\n\nwww.youtube.com/watch?v=RL2z...
## 2                                                                                                                                                 Heather Cox Richardson\nJul 04, 2026\nJuly 3, 2026\nopen.substack.com/pub/heatherc...
## 3                                                                                                                                                                              www.facebook.com/share/v/1CmC...\nHeather Cox Richardson
## 4                                                                                                        7/03/26\n"I just duped you into something that you thought it was real but it really wasn't..."🤢\nyoutube.com/shorts/ruwmZ...
## 5 @Wajahat\n I agree with you, but have a listen, at least to the first 20 min of today's conversation between Heather Cox Richardson and Sarah Longwell, for a little bit of hope, which we desperately need\nyoutu.be/xBE68EKHq2c?...
## 6                                                                                                                                                      July 3, 2026\nHEATHER COX RICHARDSON\nJUL 4\n\nopen.substack.com/pub/heatherc...
##   likes reposts replies
## 1     2       2       1
## 2     1       0       0
## 3     0       0       0
## 4     1       3       0
## 5     1       0       0
## 6     0       0       0

Number of Posts Collected

nrow(bluesky_posts)
## [1] 100

Save Raw Data

write.csv(bluesky_posts, "bluesky_heather_cox_richardson_raw_posts.csv", row.names = FALSE)

Clean the Text Data

This section cleans the Bluesky post text by removing URLs, mentions, punctuation, numbers, extra spaces, and common stopwords.

I also removed the words “Heather,” “Cox,” and “Richardson” because those words are part of the search term and would likely dominate the results.

clean_posts <- bluesky_posts |>
  mutate(
    clean_text = text |>
      str_to_lower() |>
      str_remove_all("http\\S+|www\\S+") |>
      str_remove_all("@\\w+") |>
      str_remove_all("#") |>
      str_remove_all("[^a-z\\s]") |>
      str_squish()
  )

Tokenize the Text

Tokenizing means splitting the text into individual words.

custom_stopwords <- tibble(
  word = c("heather", "cox", "richardson", "hcr")
)

words <- clean_posts |>
  select(clean_text) |>
  unnest_tokens(word, clean_text) |>
  anti_join(stop_words, by = "word") |>
  anti_join(custom_stopwords, by = "word") |>
  filter(str_length(word) > 2)

Word Frequency Analysis

This code counts how often each word appears.

word_freq <- words |>
  count(word, sort = TRUE)

head(word_freq, 20)
##                          word  n
## 1                     america 27
## 2                        july 21
## 3                     history 20
## 4                    american 19
## 5                   historian 17
## 6                independence 13
## 7                 declaration 12
## 8                    americas 11
## 9  opensubstackcompubheatherc 11
## 10                     people 11
## 11                      awful 10
## 12                      tells 10
## 13                 bskysocial  9
## 14                   citizens  9
## 15                  democracy  9
## 16                      dream  9
## 17                     modern  9
## 18                        day  5
## 19                       hope  5
## 20                     nation  5

Save Word Frequency Table

write.csv(word_freq, "bluesky_heather_cox_richardson_word_frequencies.csv", row.names = FALSE)

Top 20 Most Common Words

top_20_words <- word_freq |>
  slice_max(n, n = 20)

top_20_words
##                          word  n
## 1                     america 27
## 2                        july 21
## 3                     history 20
## 4                    american 19
## 5                   historian 17
## 6                independence 13
## 7                 declaration 12
## 8                    americas 11
## 9  opensubstackcompubheatherc 11
## 10                     people 11
## 11                      awful 10
## 12                      tells 10
## 13                 bskysocial  9
## 14                   citizens  9
## 15                  democracy  9
## 16                      dream  9
## 17                     modern  9
## 18                        day  5
## 19                       hope  5
## 20                     nation  5
## 21                     rights  5
## 22                      sarah  5

Bar Chart of Most Common Words

ggplot(top_20_words, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "red") +
  coord_flip() +
  labs(
    title = "Top 20 Most Common Words in Bluesky Posts about Heather Cox Richardson",
    x = "Word",
    y = "Frequency"
  ) +
  theme_minimal()

Word Cloud

set.seed(123)

wordcloud(
  words = word_freq$word,
  freq = word_freq$n,
  max.words = 100,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

Findings

After collecting and cleaning the Bluesky posts, I performed a word frequency analysis to identify the most common terms related to Heather Cox Richardson.

top_20_words
##                          word  n
## 1                     america 27
## 2                        july 21
## 3                     history 20
## 4                    american 19
## 5                   historian 17
## 6                independence 13
## 7                 declaration 12
## 8                    americas 11
## 9  opensubstackcompubheatherc 11
## 10                     people 11
## 11                      awful 10
## 12                      tells 10
## 13                 bskysocial  9
## 14                   citizens  9
## 15                  democracy  9
## 16                      dream  9
## 17                     modern  9
## 18                        day  5
## 19                       hope  5
## 20                     nation  5
## 21                     rights  5
## 22                      sarah  5

The most common words in the dataset suggest that Bluesky users discussing Heather Cox Richardson often connect her name with topics related to politics, history, democracy, news, and current events. The bar chart shows the top 20 most frequent words, while the word cloud visually highlights the most repeated terms.

Limitations

This analysis has several limitations. First, the dataset only includes posts available through Bluesky search at the time the data was collected. Second, the results depend on the search phrase “Heather Cox Richardson,” so different search terms such as “HCR” might produce different results. Third, word frequency analysis only counts how often words appear. It does not fully explain context, sarcasm, tone, or whether the posts are supportive or critical.

Conclusion

This project used the Bluesky API to collect posts related to Heather Cox Richardson. After cleaning the text, I used word frequency analysis to identify the most common terms in the dataset. The results provide a general overview of the themes and language commonly associated with discussions of Heather Cox Richardson on Bluesky.