Setup and Package Installation

NewsAPI access in R is provided through the newsr/newsapi package on GitHub (not on CRAN), so we install it with remotes. We also load the tidyverse for data manipulation, tidytext and textdata for text mining and sentiment lexicons, lubridate for date handling, and knitr/kableExtra for nicely formatted tables.

# Run once - uncomment if these packages are not yet installed
install.packages(c("remotes", "tidyverse", "tidytext", "textdata", "lubridate", "knitr", "kableExtra", "newsanchor"))
 remotes::install_github("news-r/newsapi")
library(newsapi)
library(tidyverse)
library(tidytext)
library(textdata)
library(lubridate)
library(knitr)
library(kableExtra)
library(ggplot2)
library(newsanchor)

Authentication

NewsAPI requires a free API key, available at newsapi.org. Never hard-code your API key in a script you plan to share or commit to version control. For class purposes, store your key as an environment variable (e.g., in a .Renviron file) and reference it with Sys.getenv().

# Replace "YOUR_API_KEY" with your actual key, or better, use Sys.getenv()
api_key <- Sys.getenv("NEWSAPI_KEY")

Tip: A common extension of this tutorial is to pull headlines for multiple brands (e.g., a focal brand and 1-2 competitors) so that the sentiment comparisons later in this tutorial are more meaningful. The commented-out top_headlines("Anthropic") line in the original script is an example of adding a second topic for comparison.

Pulling and Cleaning Headlines

library(httr)
library(jsonlite)
library(tidyverse)

fetch_news <- function(query, api_key, page_size = 10) {
  response <- GET(
    url = "https://newsapi.org/v2/everything",
    query = list(
      q         = query,
      language  = "en",
      sortBy    = "publishedAt",
      pageSize  = page_size,
      apiKey    = api_key
    )
  )
  
  # Surface the actual error message from NewsAPI
  if (status_code(response) != 200) {
    msg <- content(response, as = "parsed")$message
    stop("NewsAPI error for '", query, "': ", msg)
  }
  
  parsed   <- content(response, as = "text", encoding = "UTF-8")
  articles <- fromJSON(parsed, flatten = TRUE)$articles
  
  as_tibble(articles) %>%
    rename_with(~ str_replace_all(.x, "\\.", "_")) %>%
    mutate(query = query)
}


news_raw <- bind_rows(
  fetch_news("Snapchat",    api_key),
  fetch_news("Youtube", api_key),
  fetch_news("Threads",    api_key),
  fetch_news("Twitch", api_key)
)

glimpse(news_raw)
## Rows: 40
## Columns: 10
## $ author      <chr> "BetaList", "Faheem Tahir", "ResearchBuzz", "finance.yahoo…
## $ title       <chr> "ViewSnapStories – View and download Snapchat stories, spo…
## $ description <chr> "View and download Snapchat stories, spotlight, videos, wi…
## $ url         <chr> "https://betalist.com/startups/viewsnapstories", "https://…
## $ urlToImage  <chr> "https://resize.imagekit.co/48nsRSzsS0CRAFb7JvAyRuQ63voWTO…
## $ publishedAt <chr> "2026-06-20T21:00:00Z", "2026-06-20T19:03:26Z", "2026-06-2…
## $ content     <chr> "ViewSnapStories is a web-based tool that allows users to …
## $ source_id   <chr> NA, NA, NA, NA, NA, "breitbart-news", NA, "techradar", NA,…
## $ source_name <chr> "Betalist.com", "Yahoo Entertainment", "Researchbuzz.me", …
## $ query       <chr> "Snapchat", "Snapchat", "Snapchat", "Snapchat", "Snapchat"…

Inspect for Duplicates

news_raw %>%
  filter(!is.na(title)) %>%
  mutate(
    pub_date    = ymd_hms(publishedAt, quiet = TRUE),
    pub_day     = as.Date(pub_date),
    title_clean = str_remove(title, "\\s*-\\s*[^-]+$"),
    title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
    title_clean = str_to_lower(title_clean)
  ) %>%
  group_by(title_clean) %>%
  filter(n() > 1) %>%
  arrange(title_clean)

Clean and Deduplicate

We now apply the same cleaning steps and keep only one copy of each unique (cleaned) headline:

  • Remove rows with missing titles.
  • Parse the publication timestamp into a date.
  • Strip the trailing “- Source Name” suffix many outlets append to titles.
  • Remove punctuation/special characters and collapse extra whitespace.
  • Convert to lowercase for consistent text analysis.
  • Drop duplicate cleaned titles.
news_clean <- news_raw %>%
  filter(!is.na(.data$title)) %>%
  mutate(
    pub_date    = ymd_hms(.data$publishedAt, quiet = TRUE),
    pub_day     = as.Date(pub_date),
    title_clean = str_remove(.data$title, "\\s*-\\s*[^-]+$"),
    title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
    title_clean = str_to_lower(title_clean)
  ) %>%
  distinct(title_clean, .keep_all = TRUE)

dim(news_clean)
## [1] 39 13

Combine into a Tracking Data Frame

If you pulled multiple topics (e.g., SpaceX and a competitor), bind them into a single data frame here. With one topic, this step simply standardizes the object for the rest of the pipeline and saves a CSV snapshot — useful for reproducibility and for sharing data with teammates who don’t have API access.

news_df <- bind_rows(news_clean) %>%
  filter(!is.na(title))  # remove any empty rows

str(news_df)
## tibble [39 × 13] (S3: tbl_df/tbl/data.frame)
##  $ author     : chr [1:39] "BetaList" "Faheem Tahir" "ResearchBuzz" "finance.yahoo.com" ...
##  $ title      : chr [1:39] "ViewSnapStories – View and download Snapchat stories, spotlight, videos, without logging in." "Rosenblatt Keeps Neutral Rating On Snap (SNAP) After $2,195 Specs AR Glasses Debut" "Old Courthouse Heritage Museum, Rave Preservation Project, Firefox, More: Saturday Afternoon ResearchBuzz, June 20, 2026" "Introduces AI-Powered Advertising Suite to Streamline Campaign Workflow" ...
##  $ description: chr [1:39] "View and download Snapchat stories, spotlight, videos, without logging in." "Snap Inc. (NYSE:SNAP) features on the list of tech stocks to sell according to billionaires. Billionaire stake "| __truncated__ "NEW RESOURCES EIN Presswire: Old Courthouse Heritage Museum Creates Free Digital Archive of Artifacts (PRESS RE"| __truncated__ "Snap Inc. (NYSE:SNAP) is one of the penny stocks with explosive growth potential. On June 18, Snapchat introduc"| __truncated__ ...
##  $ url        : chr [1:39] "https://betalist.com/startups/viewsnapstories" "https://finance.yahoo.com/markets/stocks/articles/rosenblatt-keeps-neutral-rating-snap-190326896.html" "https://researchbuzz.me/2026/06/20/old-courthouse-heritage-museum-rave-preservation-project-firefox-more-saturd"| __truncated__ "https://biztoc.com/x/cf1458dafeffe747" ...
##  $ urlToImage : chr [1:39] "https://resize.imagekit.co/48nsRSzsS0CRAFb7JvAyRuQ63voWTODJwv_-G3obL7M/plain/s3://betalist-production/7xt7s2krn"| __truncated__ "https://s.yimg.com/lo/mysterio/api/9EF4BA75120DFBC8970A27E17976460023AD003163F08273AA6C582D47C3C4BD/subgraphmys"| __truncated__ "https://s0.wp.com/_si/?t=eyJpbWciOiJodHRwczpcL1wvczAud3AuY29tXC9pXC9ibGFuay5qcGciLCJ0eHQiOiJSZXNlYXJjaEJ1enoiLC"| __truncated__ "https://biztoc.com/cdn/cf1458dafeffe747_s.webp" ...
##  $ publishedAt: chr [1:39] "2026-06-20T21:00:00Z" "2026-06-20T19:03:26Z" "2026-06-20T18:37:24Z" "2026-06-20T17:46:49Z" ...
##  $ content    : chr [1:39] "ViewSnapStories is a web-based tool that allows users to view and download Snapchat stories, spotlight videos, "| __truncated__ "Snap Inc. (NYSE:SNAP) features on the list of tech stocks to sell according to billionaires. Billionaire stake "| __truncated__ "NEW RESOURCES \r\nEIN Presswire: Old Courthouse Heritage Museum Creates Free Digital Archive of Artifacts (PRES"| __truncated__ "Snap Inc. (NYSE:SNAP) is one of the penny stocks with explosive growth potential. On June 18, Snapchat introduc"| __truncated__ ...
##  $ source_id  : chr [1:39] NA NA NA NA ...
##  $ source_name: chr [1:39] "Betalist.com" "Yahoo Entertainment" "Researchbuzz.me" "Biztoc.com" ...
##  $ query      : chr [1:39] "Snapchat" "Snapchat" "Snapchat" "Snapchat" ...
##  $ pub_date   : POSIXct[1:39], format: "2026-06-20 21:00:00" "2026-06-20 19:03:26" ...
##  $ pub_day    : Date[1:39], format: "2026-06-20" "2026-06-20" ...
##  $ title_clean: chr [1:39] "viewsnapstories view and download snapchat stories spotlight videos without logging in" "rosenblatt keeps neutral rating on snap snap after 2 195 specs ar glasses debut" "old courthouse heritage museum rave preservation project firefox more saturday afternoon researchbuzz june 20 2026" "introduces ai" ...
write.csv(news_df, "news_df.csv")

4.5 Preview the Cleaned Data

news_df %>%
  select(source_name, title, pub_day) %>%
  head(10) %>%
  kable(caption = "Sample Cleaned Headlines") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Sample Cleaned Headlines
source_name title pub_day
Betalist.com ViewSnapStories – View and download Snapchat stories, spotlight, videos, without logging in. 2026-06-20
Yahoo Entertainment Rosenblatt Keeps Neutral Rating On Snap (SNAP) After $2,195 Specs AR Glasses Debut 2026-06-20
Researchbuzz.me Old Courthouse Heritage Museum, Rave Preservation Project, Firefox, More: Saturday Afternoon ResearchBuzz, June 20, 2026 2026-06-20
Biztoc.com Introduces AI-Powered Advertising Suite to Streamline Campaign Workflow 2026-06-20
Snapchat.com Paint Party Ideas Videos 2026-06-20
Breitbart News Federal Appeals Court Allows Ohio to Enforce Social Media Law Requiring Parental Consent for Minors 2026-06-20
Researchbuzz.me In the Weights, Early Web Links, Snapchat, More: Saturday ResearchBuzz, June 20, 2026 2026-06-20
TechRadar ICYMI: the week’s 7 biggest tech news stories, from Commodore flip-phone nostalgia to Tim Cook’s Apple price-hike warning 2026-06-20
Dailymail.com Ashley Cain’s axed documentary series is still available to watch on BBC iPlayer following his resurfaced misogynistic tweets as bosses say their vetting processes on the star ‘clearly failed’ 2026-06-20
CBC News Is Canada’s teen social media ban constitutional? It’s complicated 2026-06-20

Interpretation: At this stage you should have a tidy data frame where each row is a unique news headline, with a clean publication date and a query column indicating which platform search returned it. This is the foundation for everything that follows. If dim(news_clean) shows far fewer rows than dim(news_raw), that tells you a substantial share of “results” were duplicate stories — a useful sanity check before drawing conclusions about coverage volume across Snapchat, Youtube, Threads, and Twitch. While the top 10 headlines previewed here were relevant to the four platforms, expanding the preview to 40 rows revealed articles with no clear connection to any of the four companies — a reminder that broad search terms like “Threads” and “Twitch” can pull in unrelated content, which is a limitation worth keeping in mind when interpreting the sentiment results downstream. # 5. Tokenization

To analyze sentiment and word usage, we need to break each headline into individual words (“tokens”), remove common stop words (e.g., “the,” “and,” “of”) that carry little analytical meaning, and filter out pure numbers and very short tokens.

news_tokens <- news_df %>%
  select(source_name, title,query) %>%
  unnest_tokens(word, title) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"), nchar(word) > 2)

Interpretation: news_tokens is now a “one-token-per-row” data frame — the standard format for text mining with tidytext. Each row represents one meaningful word from one headline, tagged with the source (topic) it came from. This long format makes it easy to count words, join sentiment dictionaries, and compute summary statistics by group.

6. Sentiment Analysis

6.1 The Bing Lexicon (Binary Positive/Negative)

The Bing lexicon classifies each word as either "positive" or "negative" — a simple binary label with no magnitude.

sentiment_bing <- news_tokens %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many")

print(sentiment_bing)
## # A tibble: 15 × 4
##    source_name          query    word          sentiment
##    <chr>                <chr>    <chr>         <chr>    
##  1 TechRadar            Snapchat warning       negative 
##  2 Dailymail.com        Snapchat failed        negative 
##  3 CBC News             Snapchat complicated   negative 
##  4 Sportsnaut           Youtube  magnificently positive 
##  5 Help Net Security    Threads  stolen        negative 
##  6 Help Net Security    Threads  attack        negative 
##  7 The Irish Times      Threads  breaks        negative 
##  8 The Irish Times      Threads  imaginative   positive 
##  9 Freerepublic.com     Threads  fear          negative 
## 10 Snopes.com           Threads  smarter       positive 
## 11 Snopes.com           Threads  trump         positive 
## 12 Freerepublic.com     Threads  fear          negative 
## 13 Hogs Haven           Twitch   poised        positive 
## 14 Pro Football Network Twitch   mock          negative 
## 15 NESN                 Twitch   foolish       negative

Graph 1: Word Sentiment Contribution

This chart shows the top 10 words contributing to positive sentiment and the top 10 contributing to negative sentiment across all headlines.

sentiment_bing %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
  labs(
    title = "Social Media Platform Headlines",
    x = "Frequency (Word Count)",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 12)
  )

Interpretation: Words on the left panel are pulling overall sentiment down; words on the right are pulling it up. A notable finding is that “trump” appears as a positive word — this is a known limitation of lexicon-based sentiment analysis, as the Bing lexicon classifies “trump” based on its dictionary meaning (“to surpass or outdo”) rather than as a reference to the political figure. This is particularly interesting given that President Trump consistently polls with net-negative approval ratings among the general public, meaning the sentiment score here is likely misleading rather than reflective of actual public opinion.

Graph 2: Sentiment Volume Comparison Across Topics

If you pulled multiple topics (brands), this chart compares how many positive vs. negative sentiment-words appear in each topic’s headlines.

sentiment_bing %>%
  count(query, sentiment) %>%
  ggplot(aes(x = query, y = n, fill = sentiment)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
  labs(
    title = "Volume of Sentiment Words",
    subtitle = "Total counts of matched emotional words in headlines",
    x = "Platform",
    y = "Number of Words Matched",
    fill = "Sentiment Class"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

Interpretation: A higher total bar (positive + negative) for a topic means that topic generated more emotionally-charged language overall — which could reflect either higher news volume or more dramatic events. In our analysis, all four platforms skewed negative, suggesting unfavorable media framing across the board. Threads had the highest total sentiment word count with a mix of both positive and negative matches, indicating it generated the most emotionally charged coverage. Youtube and Snapchat had zero positive matches, meaning every sentiment-matched word in their headlines was negative. Twitch fell in between with minimal positive coverage. Overall, none of the four platforms are currently enjoying favorable media framing based on this snapshot of headlines.

6.2 Top Words Overall

Independent of sentiment, it’s useful to see which words dominate the headlines overall.

news_tokens %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 20) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = n)) +
  geom_col(show.legend = FALSE) +
  scale_fill_gradient(low = "#a8d8ea", high = "#0077b6") +
  labs(
    title   = "Top 20 Words in News Headlines",
    x       = "Count", y = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 13)

Interpretation: While words like “snapchat,” “snap,” and “social” are clearly tied to the platforms we searched, many of the top 20 words — such as “catholic,” “xbox,” “june,” and “caucus” — have no obvious connection to Snapchat, Youtube, Threads, or Twitch. This reflects the limitation noted earlier, where broad search terms pulled in unrelated articles, introducing noise into the word frequency counts. These irrelevant words should be interpreted with caution and would ideally be filtered out in a more refined analysis using more specific search queries.

6.3 The AFINN Lexicon (Scored -5 to +5)

Unlike Bing’s binary labels, AFINN assigns each word a numeric score from -5 (very negative) to +5 (very positive), allowing us to compute average sentiment intensity per topic.

news_tokens %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(query) %>%
  summarise(
    words_matched  = n(),
    mean_sentiment = round(mean(value), 3),
    sum_sentiment  = sum(value),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_sentiment)) %>%
  kable(caption = "AFINN Sentiment Score by Topic") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
AFINN Sentiment Score by Topic
query words_matched mean_sentiment sum_sentiment
Threads 7 -0.429 -3
Snapchat 5 -1.200 -6
Twitch 3 -2.000 -6

Interpretation: mean_sentiment tells you the average emotional tone of matched words for each topic — a value near 0 suggests neutral/mixed coverage, while a clearly positive or negative mean suggests a dominant tone. sum_sentiment reflects total emotional “weight,” which is influenced by both tone and volume of coverage.

6.4 Bing Sentiment Split (Positive vs. Negative Counts)

This table reshapes the Bing results into a wide format so you can directly compare positive counts, negative counts, and the net (positive − negative) score for each topic.

news_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(query, sentiment) %>%
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = list(n = 0)
  ) %>%
  mutate(
    positive = coalesce(positive, 0L),
    negative = coalesce(negative, 0L),
    net      = positive - negative
  ) %>%
  kable(caption = "Bing Sentiment Count by Topic") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Bing Sentiment Count by Topic
query negative positive net
Snapchat 3 0 -3
Threads 5 3 -2
Twitch 2 1 -1
Youtube 0 1 1

Interpretation: The net column is a simple, interpretable sentiment index. A positive net score suggests headlines skew favorable; a negative net score suggests the opposite. This kind of index is easy to track over time (e.g., weekly) to build a brand sentiment trendline.

7. TF-IDF: What Makes Each Topic’s Coverage Distinct?

TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are frequent within one topic’s headlines but rare across other topics’ headlines. This is especially useful for understanding what makes coverage of one brand distinctive compared to another.

news_tokens %>%
  count(query, word) %>%
  bind_tf_idf(word, query, n) %>%
  group_by(query) %>%
  slice_max(tf_idf, n = 6) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, query)) %>%
  ggplot(aes(x = tf_idf, y = word, fill = query)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ query, scales = "free_y", ncol = 2) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title   = "Top TF-IDF Terms by Topic",
    x       = "TF-IDF Score", y = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 12)

7.1 Refining the TF-IDF Plot for Many Topics

When you have more than two topics, the basic plot above can get crowded. The refined version below dynamically adjusts the color palette to the number of topics, reduces the number of terms shown per topic, truncates long words, and arranges panels in a grid for readability.

n_sources <- n_distinct(news_tokens$source_name)

news_tokens %>%
  count(query, word) %>%
  bind_tf_idf(word, query, n) %>%
  group_by(query) %>%
  slice_max(tf_idf, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(
    word = str_trunc(word, 20),
    word = reorder_within(word, tf_idf, query)
  ) %>%
  ggplot(aes(x = tf_idf, y = word, fill = query)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ query, scales = "free_y", ncol = 4) +
  scale_y_reordered() +
  scale_fill_manual(
    values = colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))(n_sources)
  ) +
  labs(
    title   = "Top TF-IDF Terms by Source",
    x       = "TF-IDF Score",
    y       = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    strip.text    = element_text(size = 8, face = "bold"),
    axis.text.y   = element_text(size = 4),
    panel.spacing = unit(1.2, "lines")
  )

# Save with a tall aspect ratio so labels don't crowd
ggsave("tfidf_plot.png", width = 16, height = 14, dpi = 150)

Interpretation: Words with high TF-IDF scores for a given platform are the terms that “define” that platform’s coverage relative to the others — these are often platform-specific features, events, or controversies unique to that company. For a marketing analyst, these words are strong candidates for keyword tracking, campaign hashtag ideas, or identifying emerging narratives unique to one platform versus another. It is worth noting however that due to the broad search terms used, some high TF-IDF words may reflect off-topic articles rather than genuine platform-specific coverage, which is a limitation of this analysis.

8. Putting It All Together

A complete brand-monitoring workflow using this tutorial would typically run on a schedule (e.g., daily or weekly):

  1. Pull fresh headlines for your brand and 1-2 key competitors.
  2. Clean and deduplicate.
  3. Tokenize and score sentiment (Bing for simple positive/negative counts, AFINN for an intensity-weighted index).
  4. Track the net sentiment score and mean_sentiment over time.
  5. Use TF-IDF periodically to check whether the themes of coverage are shifting (e.g., from “product launch” language to “regulatory” language).