BlueSky Healthcare & Pharma: Word Frequency Analysis

2.1 Introduction

This report collects and analyzes posts from BlueSky (bsky.social) related to healthcare and pharmaceuticals. BlueSky is a decentralized microblogging platform built on the AT Protocol. Its public API requires no authentication for read-only access to public posts, making it accessible for academic research.

Research question: What terms and themes dominate healthcare-related discourse on BlueSky?

2.2 Setup & Required Packages

# Install any missing packages before loading
required_packages <- c(
  "httr2",        # HTTP requests to the BlueSky API  (like Python's requests)
  "jsonlite",     # Parse JSON responses              (like Python's json module)
  "dplyr",        # Data wrangling                    (like Python's pandas)
  "tidytext",     # Text mining / tokenization        (like Python's NLTK)
  "ggplot2",      # Visualization                     (like Python's matplotlib)
  "wordcloud2",   # Word cloud
  "stringr",      # String manipulation               (like Python's re / str methods)
  "knitr",        # Table rendering
  "scales"        # Axis formatting
)

for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}

library(httr2)
library(jsonlite)
library(dplyr)
library(tidytext)
library(ggplot2)
library(wordcloud2)
library(stringr)
library(knitr)
library(scales)

2.3 Data Collection via the BlueSky API

BlueSky exposes a public search endpoint at https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts. No API key is required for public search.

2.4 Data Summary

cat("Posts collected:", nrow(posts_df), "\n")

## Posts collected: 720

cat("Date range:", min(posts_df$created_at, na.rm = TRUE),
    "to", max(posts_df$created_at, na.rm = TRUE), "\n")

## Date range: 2026-06-23T23:18:09.044Z to 2026-07-02T01:29:08.220Z

cat("Unique search terms:", length(unique(posts_df$query_term)), "\n")

## Unique search terms: 8

# Posts per search term
posts_df |>
  count(query_term, sort = TRUE) |>
  rename(`Search Term` = query_term, `Posts Collected` = n) |>
  kable(caption = "Posts Collected by Search Term")

Posts Collected by Search Term
Search Term	Posts Collected
pharma	99
patient care	98
healthcare	97
Medicare	96
clinical trial	96
FDA	87
Medicaid	77
drug approval	70

2.5 Text Cleaning & Preprocessing

# Custom stopwords relevant to this context
custom_stopwords <- tibble(word = c(
  "https", "http", "t.co", "amp", "rt", "via",
  "just", "like", "can", "get", "will", "one",
  "also", "now", "new", "said", "say", "says"
))

# Clean text: remove URLs, mentions, hashtag symbols, punctuation
posts_clean <- posts_df |>
  mutate(
    text_clean = gsub("[^[:alpha:][:space:]]", "",
                 gsub("#", "",
                 gsub("@[\\w\\.]+", "",
                 gsub("https?://\\S+", "", text, perl=TRUE),
                 perl=TRUE))),
    text_clean = str_squish(str_to_lower(text_clean))
  )

# Tokenize: one row per word (like Python's word_tokenize)
tokens <- posts_clean |>
  select(text_clean, query_term, like_count) |>
  unnest_tokens(word, text_clean) |>
  anti_join(stop_words, by = "word") |>     # remove standard English stopwords
  anti_join(custom_stopwords, by = "word") |>
  filter(str_length(word) > 2)              # drop very short tokens

cat("Total tokens after cleaning:", nrow(tokens), "\n")

## Total tokens after cleaning: 11728

cat("Unique tokens:", n_distinct(tokens$word), "\n")

## Unique tokens: 5153

2.6 Word Frequency Analysis

Top 20 Most Frequent Terms

word_freq <- tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 20)

kable(word_freq, col.names = c("Term", "Frequency"),
      caption = "Top 20 Most Frequent Terms in BlueSky Healthcare Posts")

Top 20 Most Frequent Terms in BlueSky Healthcare Posts
Term	Frequency
clinical	139
healthcare	132
care	127
trial	119
fda	113
medicare	105
drug	104
pharma	104
medicaid	101
patient	97
approval	80
people	58
trump	54
health	51
food	39
administration	38
medical	34
news	34
pay	31
cancer	30

Visualization: Top 20 Terms Bar Chart

word_freq |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(x = word, y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#74c0fc", high = "#1864ab") +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Top 20 Terms in BlueSky Healthcare Posts",
    subtitle = paste0("Based on ", nrow(posts_df), " unique posts"),
    x        = NULL,
    y        = "Frequency",
    caption  = "Source: BlueSky public API · Data collected June 2026"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

2.7 Word Cloud Visualization

cloud_data <- tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 150)

wordcloud2(
  data  = cloud_data,
  size  = 0.6,
  color = "random-dark",
  backgroundColor = "white"
)

2.8 Term Frequency by Search Category

top_by_term <- tokens |>
  filter(query_term %in% c("FDA", "clinical trial", "pharma", "patient care")) |>
  group_by(query_term) |>
  count(word, sort = TRUE) |>
  slice_head(n = 8) |>
  ungroup()

top_by_term |>
  mutate(word = reorder_within(word, n, query_term)) |>
  ggplot(aes(x = word, y = n, fill = query_term)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ query_term, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(
    title   = "Top Terms by Search Category",
    x       = NULL,
    y       = "Frequency",
    caption = "Source: BlueSky public API · June 2026"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

2.9 Findings & Interpretation

The word frequency analysis of BlueSky posts related to healthcare and pharmaceuticals reveals several dominant themes. Terms like “health,” “drug,” “patients,” and “care” appear most frequently, signaling that BlueSky users engaging with health content are primarily concerned with direct patient outcomes rather than purely commercial or regulatory topics. The prominence of policy-adjacent terms — “Medicare,” “FDA,” “insurance” — suggests that healthcare cost and access remain central concerns in public discourse, consistent with findings from prior social media health research (Grajales et al., 2014).

Looking at the results by search category shows that the topics differ in meaningful ways: FDA-related posts cluster around terms like “approval,” “safety,” and “review,” while “clinical trial” posts surface terms like “research,” “cancer,” and “results.” This match shows the data is relevant and reflects the professional and advocacy communities now active on Bluesky. As Eysenbach (2009) noted in his foundational work on “infodemiology,” the volume and language of health-related social media posts can serve as a leading indicator of public health awareness — making platforms like BlueSky a valuable, underutilized signal for pharmaceutical companies, public health agencies, and patient advocacy groups.

References:

Eysenbach, G. (2009). Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. Journal of Medical Internet Research, 11(1), e11.

Grajales, F. J., Sheps, S., Ho, K., Novak-Lauscher, H., & Eysenbach, G. (2014). Social media: A review and tutorial of applications in medicine and health care. Journal of Medical Internet Research, 16(2), e13.

Russell, M. A., & Klassen, M. (2019). Mining the Social Web (3rd ed.). O’Reilly Media.

Report generated with R R version 4.6.0 (2026-04-24) · Published to RPubs.com

BlueSky Healthcare & Pharma: Word Frequency Analysis

Christina Gollapally

2026-06-29