NewsAPI access in R is provided through the
newsr/newsapi package on GitHub (not on CRAN), so we
install it with remotes. We also load the
tidyverse for data manipulation, tidytext and
textdata for text mining and sentiment lexicons,
lubridate for date handling, and
knitr/kableExtra for nicely formatted
tables.
# Run once - uncomment if these packages are not yet installed
install.packages(c("remotes", "tidyverse", "tidytext", "textdata", "lubridate", "knitr", "kableExtra", "newsanchor"))
remotes::install_github("news-r/newsapi")
library(newsapi)
library(tidyverse)
library(tidytext)
library(textdata)
library(lubridate)
library(knitr)
library(kableExtra)
library(ggplot2)
library(newsanchor)
NewsAPI requires a free API key, available at newsapi.org. Never hard-code your
API key in a script you plan to share or commit to version
control. For class purposes, store your key as an environment
variable (e.g., in a .Renviron file) and reference it with
Sys.getenv().
# Replace "YOUR_API_KEY" with your actual key, or better, use Sys.getenv()
api_key <- Sys.getenv("NEWSAPI_KEY")
Tip: A common extension of this tutorial is to pull headlines for multiple brands (e.g., a focal brand and 1-2 competitors) so that the sentiment comparisons later in this tutorial are more meaningful. The commented-out
top_headlines("Anthropic")line in the original script is an example of adding a second topic for comparison.
library(httr)
library(jsonlite)
library(tidyverse)
fetch_news <- function(query, api_key, page_size = 10) {
response <- GET(
url = "https://newsapi.org/v2/everything",
query = list(
q = query,
language = "en",
sortBy = "publishedAt",
pageSize = page_size,
apiKey = api_key
)
)
# Surface the actual error message from NewsAPI
if (status_code(response) != 200) {
msg <- content(response, as = "parsed")$message
stop("NewsAPI error for '", query, "': ", msg)
}
parsed <- content(response, as = "text", encoding = "UTF-8")
articles <- fromJSON(parsed, flatten = TRUE)$articles
as_tibble(articles) %>%
rename_with(~ str_replace_all(.x, "\\.", "_")) %>%
mutate(query = query)
}
news_raw <- bind_rows(
fetch_news("Snapchat", api_key),
fetch_news("Youtube", api_key),
fetch_news("Threads", api_key),
fetch_news("Twitch", api_key)
)
glimpse(news_raw)
## Rows: 40
## Columns: 10
## $ author <chr> "BetaList", "Faheem Tahir", "ResearchBuzz", "finance.yahoo…
## $ title <chr> "ViewSnapStories – View and download Snapchat stories, spo…
## $ description <chr> "View and download Snapchat stories, spotlight, videos, wi…
## $ url <chr> "https://betalist.com/startups/viewsnapstories", "https://…
## $ urlToImage <chr> "https://resize.imagekit.co/48nsRSzsS0CRAFb7JvAyRuQ63voWTO…
## $ publishedAt <chr> "2026-06-20T21:00:00Z", "2026-06-20T19:03:26Z", "2026-06-2…
## $ content <chr> "ViewSnapStories is a web-based tool that allows users to …
## $ source_id <chr> NA, NA, NA, NA, NA, "breitbart-news", NA, "techradar", NA,…
## $ source_name <chr> "Betalist.com", "Yahoo Entertainment", "Researchbuzz.me", …
## $ query <chr> "Snapchat", "Snapchat", "Snapchat", "Snapchat", "Snapchat"…
news_raw %>%
filter(!is.na(title)) %>%
mutate(
pub_date = ymd_hms(publishedAt, quiet = TRUE),
pub_day = as.Date(pub_date),
title_clean = str_remove(title, "\\s*-\\s*[^-]+$"),
title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
title_clean = str_to_lower(title_clean)
) %>%
group_by(title_clean) %>%
filter(n() > 1) %>%
arrange(title_clean)
We now apply the same cleaning steps and keep only one copy of each unique (cleaned) headline:
news_clean <- news_raw %>%
filter(!is.na(.data$title)) %>%
mutate(
pub_date = ymd_hms(.data$publishedAt, quiet = TRUE),
pub_day = as.Date(pub_date),
title_clean = str_remove(.data$title, "\\s*-\\s*[^-]+$"),
title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
title_clean = str_to_lower(title_clean)
) %>%
distinct(title_clean, .keep_all = TRUE)
dim(news_clean)
## [1] 39 13
If you pulled multiple topics (e.g., SpaceX and a competitor), bind them into a single data frame here. With one topic, this step simply standardizes the object for the rest of the pipeline and saves a CSV snapshot — useful for reproducibility and for sharing data with teammates who don’t have API access.
news_df <- bind_rows(news_clean) %>%
filter(!is.na(title)) # remove any empty rows
str(news_df)
## tibble [39 × 13] (S3: tbl_df/tbl/data.frame)
## $ author : chr [1:39] "BetaList" "Faheem Tahir" "ResearchBuzz" "finance.yahoo.com" ...
## $ title : chr [1:39] "ViewSnapStories – View and download Snapchat stories, spotlight, videos, without logging in." "Rosenblatt Keeps Neutral Rating On Snap (SNAP) After $2,195 Specs AR Glasses Debut" "Old Courthouse Heritage Museum, Rave Preservation Project, Firefox, More: Saturday Afternoon ResearchBuzz, June 20, 2026" "Introduces AI-Powered Advertising Suite to Streamline Campaign Workflow" ...
## $ description: chr [1:39] "View and download Snapchat stories, spotlight, videos, without logging in." "Snap Inc. (NYSE:SNAP) features on the list of tech stocks to sell according to billionaires. Billionaire stake "| __truncated__ "NEW RESOURCES EIN Presswire: Old Courthouse Heritage Museum Creates Free Digital Archive of Artifacts (PRESS RE"| __truncated__ "Snap Inc. (NYSE:SNAP) is one of the penny stocks with explosive growth potential. On June 18, Snapchat introduc"| __truncated__ ...
## $ url : chr [1:39] "https://betalist.com/startups/viewsnapstories" "https://finance.yahoo.com/markets/stocks/articles/rosenblatt-keeps-neutral-rating-snap-190326896.html" "https://researchbuzz.me/2026/06/20/old-courthouse-heritage-museum-rave-preservation-project-firefox-more-saturd"| __truncated__ "https://biztoc.com/x/cf1458dafeffe747" ...
## $ urlToImage : chr [1:39] "https://resize.imagekit.co/48nsRSzsS0CRAFb7JvAyRuQ63voWTODJwv_-G3obL7M/plain/s3://betalist-production/7xt7s2krn"| __truncated__ "https://s.yimg.com/lo/mysterio/api/9EF4BA75120DFBC8970A27E17976460023AD003163F08273AA6C582D47C3C4BD/subgraphmys"| __truncated__ "https://s0.wp.com/_si/?t=eyJpbWciOiJodHRwczpcL1wvczAud3AuY29tXC9pXC9ibGFuay5qcGciLCJ0eHQiOiJSZXNlYXJjaEJ1enoiLC"| __truncated__ "https://biztoc.com/cdn/cf1458dafeffe747_s.webp" ...
## $ publishedAt: chr [1:39] "2026-06-20T21:00:00Z" "2026-06-20T19:03:26Z" "2026-06-20T18:37:24Z" "2026-06-20T17:46:49Z" ...
## $ content : chr [1:39] "ViewSnapStories is a web-based tool that allows users to view and download Snapchat stories, spotlight videos, "| __truncated__ "Snap Inc. (NYSE:SNAP) features on the list of tech stocks to sell according to billionaires. Billionaire stake "| __truncated__ "NEW RESOURCES \r\nEIN Presswire: Old Courthouse Heritage Museum Creates Free Digital Archive of Artifacts (PRES"| __truncated__ "Snap Inc. (NYSE:SNAP) is one of the penny stocks with explosive growth potential. On June 18, Snapchat introduc"| __truncated__ ...
## $ source_id : chr [1:39] NA NA NA NA ...
## $ source_name: chr [1:39] "Betalist.com" "Yahoo Entertainment" "Researchbuzz.me" "Biztoc.com" ...
## $ query : chr [1:39] "Snapchat" "Snapchat" "Snapchat" "Snapchat" ...
## $ pub_date : POSIXct[1:39], format: "2026-06-20 21:00:00" "2026-06-20 19:03:26" ...
## $ pub_day : Date[1:39], format: "2026-06-20" "2026-06-20" ...
## $ title_clean: chr [1:39] "viewsnapstories view and download snapchat stories spotlight videos without logging in" "rosenblatt keeps neutral rating on snap snap after 2 195 specs ar glasses debut" "old courthouse heritage museum rave preservation project firefox more saturday afternoon researchbuzz june 20 2026" "introduces ai" ...
write.csv(news_df, "news_df.csv")
news_df %>%
select(source_name, title, pub_day) %>%
head(10) %>%
kable(caption = "Sample Cleaned Headlines") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| source_name | title | pub_day |
|---|---|---|
| Betalist.com | ViewSnapStories – View and download Snapchat stories, spotlight, videos, without logging in. | 2026-06-20 |
| Yahoo Entertainment | Rosenblatt Keeps Neutral Rating On Snap (SNAP) After $2,195 Specs AR Glasses Debut | 2026-06-20 |
| Researchbuzz.me | Old Courthouse Heritage Museum, Rave Preservation Project, Firefox, More: Saturday Afternoon ResearchBuzz, June 20, 2026 | 2026-06-20 |
| Biztoc.com | Introduces AI-Powered Advertising Suite to Streamline Campaign Workflow | 2026-06-20 |
| Snapchat.com | Paint Party Ideas Videos | 2026-06-20 |
| Breitbart News | Federal Appeals Court Allows Ohio to Enforce Social Media Law Requiring Parental Consent for Minors | 2026-06-20 |
| Researchbuzz.me | In the Weights, Early Web Links, Snapchat, More: Saturday ResearchBuzz, June 20, 2026 | 2026-06-20 |
| TechRadar | ICYMI: the week’s 7 biggest tech news stories, from Commodore flip-phone nostalgia to Tim Cook’s Apple price-hike warning | 2026-06-20 |
| Dailymail.com | Ashley Cain’s axed documentary series is still available to watch on BBC iPlayer following his resurfaced misogynistic tweets as bosses say their vetting processes on the star ‘clearly failed’ | 2026-06-20 |
| CBC News | Is Canada’s teen social media ban constitutional? It’s complicated | 2026-06-20 |
Interpretation: At this stage you should have a tidy data frame where each row is a unique news headline, with a clean publication date and a query column indicating which platform search returned it. This is the foundation for everything that follows. If dim(news_clean) shows far fewer rows than dim(news_raw), that tells you a substantial share of “results” were duplicate stories — a useful sanity check before drawing conclusions about coverage volume across Snapchat, Youtube, Threads, and Twitch. While the top 10 headlines previewed here were relevant to the four platforms, expanding the preview to 40 rows revealed articles with no clear connection to any of the four companies — a reminder that broad search terms like “Threads” and “Twitch” can pull in unrelated content, which is a limitation worth keeping in mind when interpreting the sentiment results downstream. # 5. Tokenization
To analyze sentiment and word usage, we need to break each headline into individual words (“tokens”), remove common stop words (e.g., “the,” “and,” “of”) that carry little analytical meaning, and filter out pure numbers and very short tokens.
news_tokens <- news_df %>%
select(source_name, title,query) %>%
unnest_tokens(word, title) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "^\\d+$"), nchar(word) > 2)
Interpretation: news_tokens is now a
“one-token-per-row” data frame — the standard format for text mining
with tidytext. Each row represents one meaningful word from
one headline, tagged with the source (topic) it came from.
This long format makes it easy to count words, join sentiment
dictionaries, and compute summary statistics by group.
The Bing lexicon classifies each word as either
"positive" or "negative" — a simple binary
label with no magnitude.
sentiment_bing <- news_tokens %>%
inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many")
print(sentiment_bing)
## # A tibble: 15 × 4
## source_name query word sentiment
## <chr> <chr> <chr> <chr>
## 1 TechRadar Snapchat warning negative
## 2 Dailymail.com Snapchat failed negative
## 3 CBC News Snapchat complicated negative
## 4 Sportsnaut Youtube magnificently positive
## 5 Help Net Security Threads stolen negative
## 6 Help Net Security Threads attack negative
## 7 The Irish Times Threads breaks negative
## 8 The Irish Times Threads imaginative positive
## 9 Freerepublic.com Threads fear negative
## 10 Snopes.com Threads smarter positive
## 11 Snopes.com Threads trump positive
## 12 Freerepublic.com Threads fear negative
## 13 Hogs Haven Twitch poised positive
## 14 Pro Football Network Twitch mock negative
## 15 NESN Twitch foolish negative
This chart shows the top 10 words contributing to positive sentiment and the top 10 contributing to negative sentiment across all headlines.
sentiment_bing %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
slice_max(n, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
labs(
title = "Social Media Platform Headlines",
x = "Frequency (Word Count)",
y = NULL
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
strip.text = element_text(face = "bold", size = 12)
)
Interpretation: Words on the left panel are pulling overall sentiment down; words on the right are pulling it up. A notable finding is that “trump” appears as a positive word — this is a known limitation of lexicon-based sentiment analysis, as the Bing lexicon classifies “trump” based on its dictionary meaning (“to surpass or outdo”) rather than as a reference to the political figure. This is particularly interesting given that President Trump consistently polls with net-negative approval ratings among the general public, meaning the sentiment score here is likely misleading rather than reflective of actual public opinion.
If you pulled multiple topics (brands), this chart compares how many positive vs. negative sentiment-words appear in each topic’s headlines.
sentiment_bing %>%
count(query, sentiment) %>%
ggplot(aes(x = query, y = n, fill = sentiment)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
labs(
title = "Volume of Sentiment Words",
subtitle = "Total counts of matched emotional words in headlines",
x = "Platform",
y = "Number of Words Matched",
fill = "Sentiment Class"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
Interpretation: A higher total bar (positive + negative) for a topic means that topic generated more emotionally-charged language overall — which could reflect either higher news volume or more dramatic events. In our analysis, all four platforms skewed negative, suggesting unfavorable media framing across the board. Threads had the highest total sentiment word count with a mix of both positive and negative matches, indicating it generated the most emotionally charged coverage. Youtube and Snapchat had zero positive matches, meaning every sentiment-matched word in their headlines was negative. Twitch fell in between with minimal positive coverage. Overall, none of the four platforms are currently enjoying favorable media framing based on this snapshot of headlines.
Independent of sentiment, it’s useful to see which words dominate the headlines overall.
news_tokens %>%
count(word, sort = TRUE) %>%
slice_head(n = 20) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill = n)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient(low = "#a8d8ea", high = "#0077b6") +
labs(
title = "Top 20 Words in News Headlines",
x = "Count", y = NULL,
caption = "Source: NewsAPI"
) +
theme_minimal(base_size = 13)
Interpretation: While words like “snapchat,” “snap,” and “social” are clearly tied to the platforms we searched, many of the top 20 words — such as “catholic,” “xbox,” “june,” and “caucus” — have no obvious connection to Snapchat, Youtube, Threads, or Twitch. This reflects the limitation noted earlier, where broad search terms pulled in unrelated articles, introducing noise into the word frequency counts. These irrelevant words should be interpreted with caution and would ideally be filtered out in a more refined analysis using more specific search queries.
Unlike Bing’s binary labels, AFINN assigns each word a numeric score from -5 (very negative) to +5 (very positive), allowing us to compute average sentiment intensity per topic.
news_tokens %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(query) %>%
summarise(
words_matched = n(),
mean_sentiment = round(mean(value), 3),
sum_sentiment = sum(value),
.groups = "drop"
) %>%
arrange(desc(mean_sentiment)) %>%
kable(caption = "AFINN Sentiment Score by Topic") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| query | words_matched | mean_sentiment | sum_sentiment |
|---|---|---|---|
| Threads | 7 | -0.429 | -3 |
| Snapchat | 5 | -1.200 | -6 |
| Twitch | 3 | -2.000 | -6 |
Interpretation: mean_sentiment tells
you the average emotional tone of matched words for each topic
— a value near 0 suggests neutral/mixed coverage, while a clearly
positive or negative mean suggests a dominant tone.
sum_sentiment reflects total emotional “weight,” which is
influenced by both tone and volume of coverage.
This table reshapes the Bing results into a wide format so you can directly compare positive counts, negative counts, and the net (positive − negative) score for each topic.
news_tokens %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(query, sentiment) %>%
pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = list(n = 0)
) %>%
mutate(
positive = coalesce(positive, 0L),
negative = coalesce(negative, 0L),
net = positive - negative
) %>%
kable(caption = "Bing Sentiment Count by Topic") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| query | negative | positive | net |
|---|---|---|---|
| Snapchat | 3 | 0 | -3 |
| Threads | 5 | 3 | -2 |
| Twitch | 2 | 1 | -1 |
| Youtube | 0 | 1 | 1 |
Interpretation: The net column is a
simple, interpretable sentiment index. A positive net score
suggests headlines skew favorable; a negative net score
suggests the opposite. This kind of index is easy to track over time
(e.g., weekly) to build a brand sentiment trendline.
TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are frequent within one topic’s headlines but rare across other topics’ headlines. This is especially useful for understanding what makes coverage of one brand distinctive compared to another.
news_tokens %>%
count(query, word) %>%
bind_tf_idf(word, query, n) %>%
group_by(query) %>%
slice_max(tf_idf, n = 6) %>%
ungroup() %>%
mutate(word = reorder_within(word, tf_idf, query)) %>%
ggplot(aes(x = tf_idf, y = word, fill = query)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ query, scales = "free_y", ncol = 2) +
scale_y_reordered() +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Top TF-IDF Terms by Topic",
x = "TF-IDF Score", y = NULL,
caption = "Source: NewsAPI"
) +
theme_minimal(base_size = 12)
When you have more than two topics, the basic plot above can get crowded. The refined version below dynamically adjusts the color palette to the number of topics, reduces the number of terms shown per topic, truncates long words, and arranges panels in a grid for readability.
n_sources <- n_distinct(news_tokens$source_name)
news_tokens %>%
count(query, word) %>%
bind_tf_idf(word, query, n) %>%
group_by(query) %>%
slice_max(tf_idf, n = 5, with_ties = FALSE) %>%
ungroup() %>%
mutate(
word = str_trunc(word, 20),
word = reorder_within(word, tf_idf, query)
) %>%
ggplot(aes(x = tf_idf, y = word, fill = query)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ query, scales = "free_y", ncol = 4) +
scale_y_reordered() +
scale_fill_manual(
values = colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))(n_sources)
) +
labs(
title = "Top TF-IDF Terms by Source",
x = "TF-IDF Score",
y = NULL,
caption = "Source: NewsAPI"
) +
theme_minimal(base_size = 10) +
theme(
strip.text = element_text(size = 8, face = "bold"),
axis.text.y = element_text(size = 4),
panel.spacing = unit(1.2, "lines")
)
# Save with a tall aspect ratio so labels don't crowd
ggsave("tfidf_plot.png", width = 16, height = 14, dpi = 150)
Interpretation: Words with high TF-IDF scores for a given platform are the terms that “define” that platform’s coverage relative to the others — these are often platform-specific features, events, or controversies unique to that company. For a marketing analyst, these words are strong candidates for keyword tracking, campaign hashtag ideas, or identifying emerging narratives unique to one platform versus another. It is worth noting however that due to the broad search terms used, some high TF-IDF words may reflect off-topic articles rather than genuine platform-specific coverage, which is a limitation of this analysis.
A complete brand-monitoring workflow using this tutorial would typically run on a schedule (e.g., daily or weekly):
net sentiment score and
mean_sentiment over time.