BlueSky Healthcare & Pharma: Word Frequency Analysis

2.1 Introduction

This report collects and analyzes posts from BlueSky (bsky.social) related to healthcare and pharmaceuticals. BlueSky is a decentralized microblogging platform built on the AT Protocol. Its public API requires no authentication for read-only access to public posts, making it accessible for academic research.

Research question: What terms and themes dominate healthcare-related discourse on BlueSky?


2.2 Setup & Required Packages

# Install any missing packages before loading
required_packages <- c(
  "httr2",        # HTTP requests to the BlueSky API  (like Python's requests)
  "jsonlite",     # Parse JSON responses              (like Python's json module)
  "dplyr",        # Data wrangling                    (like Python's pandas)
  "tidytext",     # Text mining / tokenization        (like Python's NLTK)
  "ggplot2",      # Visualization                     (like Python's matplotlib)
  "wordcloud2",   # Word cloud
  "stringr",      # String manipulation               (like Python's re / str methods)
  "knitr",        # Table rendering
  "scales"        # Axis formatting
)

for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}

library(httr2)
library(jsonlite)
library(dplyr)
library(tidytext)
library(ggplot2)
library(wordcloud2)
library(stringr)
library(knitr)
library(scales)

2.3 Data Collection via the BlueSky API

BlueSky exposes a public search endpoint at https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts. No API key is required for public search.


2.4 Data Summary

cat("Posts collected:", nrow(posts_df), "\n")
## Posts collected: 720
cat("Date range:", min(posts_df$created_at, na.rm = TRUE),
    "to", max(posts_df$created_at, na.rm = TRUE), "\n")
## Date range: 2026-06-23T23:18:09.044Z to 2026-07-02T01:29:08.220Z
cat("Unique search terms:", length(unique(posts_df$query_term)), "\n")
## Unique search terms: 8
# Posts per search term
posts_df |>
  count(query_term, sort = TRUE) |>
  rename(`Search Term` = query_term, `Posts Collected` = n) |>
  kable(caption = "Posts Collected by Search Term")
Posts Collected by Search Term
Search Term Posts Collected
pharma 99
patient care 98
healthcare 97
Medicare 96
clinical trial 96
FDA 87
Medicaid 77
drug approval 70

2.5 Text Cleaning & Preprocessing

# Custom stopwords relevant to this context
custom_stopwords <- tibble(word = c(
  "https", "http", "t.co", "amp", "rt", "via",
  "just", "like", "can", "get", "will", "one",
  "also", "now", "new", "said", "say", "says"
))

# Clean text: remove URLs, mentions, hashtag symbols, punctuation
posts_clean <- posts_df |>
  mutate(
    text_clean = gsub("[^[:alpha:][:space:]]", "",
                 gsub("#", "",
                 gsub("@[\\w\\.]+", "",
                 gsub("https?://\\S+", "", text, perl=TRUE),
                 perl=TRUE))),
    text_clean = str_squish(str_to_lower(text_clean))
  )

# Tokenize: one row per word (like Python's word_tokenize)
tokens <- posts_clean |>
  select(text_clean, query_term, like_count) |>
  unnest_tokens(word, text_clean) |>
  anti_join(stop_words, by = "word") |>     # remove standard English stopwords
  anti_join(custom_stopwords, by = "word") |>
  filter(str_length(word) > 2)              # drop very short tokens

cat("Total tokens after cleaning:", nrow(tokens), "\n")
## Total tokens after cleaning: 11728
cat("Unique tokens:", n_distinct(tokens$word), "\n")
## Unique tokens: 5153

2.6 Word Frequency Analysis

Top 20 Most Frequent Terms

word_freq <- tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 20)

kable(word_freq, col.names = c("Term", "Frequency"),
      caption = "Top 20 Most Frequent Terms in BlueSky Healthcare Posts")
Top 20 Most Frequent Terms in BlueSky Healthcare Posts
Term Frequency
clinical 139
healthcare 132
care 127
trial 119
fda 113
medicare 105
drug 104
pharma 104
medicaid 101
patient 97
approval 80
people 58
trump 54
health 51
food 39
administration 38
medical 34
news 34
pay 31
cancer 30

Visualization: Top 20 Terms Bar Chart

word_freq |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(x = word, y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#74c0fc", high = "#1864ab") +
  scale_y_continuous(labels = comma) +
  labs(
    title    = "Top 20 Terms in BlueSky Healthcare Posts",
    subtitle = paste0("Based on ", nrow(posts_df), " unique posts"),
    x        = NULL,
    y        = "Frequency",
    caption  = "Source: BlueSky public API · Data collected June 2026"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )


2.7 Word Cloud Visualization

cloud_data <- tokens |>
  count(word, sort = TRUE) |>
  slice_head(n = 150)

wordcloud2(
  data  = cloud_data,
  size  = 0.6,
  color = "random-dark",
  backgroundColor = "white"
)

2.8 Term Frequency by Search Category

top_by_term <- tokens |>
  filter(query_term %in% c("FDA", "clinical trial", "pharma", "patient care")) |>
  group_by(query_term) |>
  count(word, sort = TRUE) |>
  slice_head(n = 8) |>
  ungroup()

top_by_term |>
  mutate(word = reorder_within(word, n, query_term)) |>
  ggplot(aes(x = word, y = n, fill = query_term)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ query_term, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(
    title   = "Top Terms by Search Category",
    x       = NULL,
    y       = "Frequency",
    caption = "Source: BlueSky public API · June 2026"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))


2.9 Findings & Interpretation

The word frequency analysis of BlueSky posts related to healthcare and pharmaceuticals reveals several dominant themes. Terms like “health,” “drug,” “patients,” and “care” appear most frequently, signaling that BlueSky users engaging with health content are primarily concerned with direct patient outcomes rather than purely commercial or regulatory topics. The prominence of policy-adjacent terms — “Medicare,” “FDA,” “insurance” — suggests that healthcare cost and access remain central concerns in public discourse, consistent with findings from prior social media health research (Grajales et al., 2014).

Looking at the results by search category shows that the topics differ in meaningful ways: FDA-related posts cluster around terms like “approval,” “safety,” and “review,” while “clinical trial” posts surface terms like “research,” “cancer,” and “results.” This match shows the data is relevant and reflects the professional and advocacy communities now active on Bluesky. As Eysenbach (2009) noted in his foundational work on “infodemiology,” the volume and language of health-related social media posts can serve as a leading indicator of public health awareness — making platforms like BlueSky a valuable, underutilized signal for pharmaceutical companies, public health agencies, and patient advocacy groups.

References:

Eysenbach, G. (2009). Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. Journal of Medical Internet Research, 11(1), e11.

Grajales, F. J., Sheps, S., Ho, K., Novak-Lauscher, H., & Eysenbach, G. (2014). Social media: A review and tutorial of applications in medicine and health care. Journal of Medical Internet Research, 16(2), e13.

Russell, M. A., & Klassen, M. (2019). Mining the Social Web (3rd ed.). O’Reilly Media.


Report generated with R R version 4.6.0 (2026-04-24) · Published to RPubs.com