I chose to scrape the California political candidate running for governor, Tom Steyer.
Summary: My findings were that sentiment analysis in the context of AFINN and data scraping refers to whether the tone of the words used by a publication are positive, negative, or neutral–and not the actual feeling towards the subject expressed by the publication (e.g., political leaning). This explains why the graphs and charts for right-leaning news outlets like Fox News showed more positive sentiment in their language regarding Democrat Steyer–because the top words in the news headlines were “tax,” “california,” “billionaire,” and “ballot”–all of which appear neutral or positive, by AFINN’s sentiment measures.
NewsAPI access in R is provided through the
newsr/newsapi package on GitHub (not on CRAN), so we
install it with remotes. We also load the
tidyverse for data manipulation, tidytext and
textdata for text mining and sentiment lexicons,
lubridate for date handling, and
knitr/kableExtra for nicely formatted
tables.
# Run once - uncomment if these packages are not yet installed
install.packages(c("remotes", "tidyverse", "tidytext", "textdata", "lubridate", "knitr", "kableExtra", "newsanchor"))
remotes::install_github("news-r/newsapi")
library(newsanchor)
library(tidyverse)
library(tidytext)
library(textdata)
library(lubridate)
library(knitr)
library(kableExtra)
library(ggplot2)
library(newsanchor)
NewsAPI requires a free API key, available at newsapi.org. Never hard-code your
API key in a script you plan to share or commit to version
control. For class purposes, store your key as an environment
variable (e.g., in a .Renviron file) and reference it with
Sys.getenv().
api_key <- Sys.getenv("NEWS_API_KEY")
Tip: A common extension of this tutorial is to pull headlines for multiple brands (e.g., a focal brand and 1-2 competitors) so that the sentiment comparisons later in this tutorial are more meaningful. The commented-out
top_headlines("Anthropic")line in the original script is an example of adding a second topic for comparison.
library(httr)
library(jsonlite)
library(tidyverse)
fetch_news <- function(query, api_key, page_size = 10) {
response <- GET(
url = "https://newsapi.org/v2/everything",
query = list(
q = query,
language = "en",
sortBy = "publishedAt",
pageSize = page_size,
apiKey = api_key
)
)
# Surface the actual error message from NewsAPI
if (status_code(response) != 200) {
msg <- content(response, as = "parsed")$message
stop("NewsAPI error for '", query, "': ", msg)
}
parsed <- content(response, as = "text", encoding = "UTF-8")
articles <- fromJSON(parsed, flatten = TRUE)$articles
as_tibble(articles) %>%
rename_with(~ str_replace_all(.x, "\\.", "_")) %>%
mutate(query = query)
}
news_raw <- bind_rows(
fetch_news("Tom Steyer", api_key)
)
# Original Companies:
# news_raw <- bind_rows(
# fetch_news("AI", api_key),
# fetch_news("Anthropic", api_key),
# fetch_news("SpaceX", api_key),
# fetch_news("CoreWeave", api_key)
#)
glimpse(news_raw)
## Rows: 10
## Columns: 10
## $ author <chr> "Bjorn Lomborg", NA, "June 19, 2026", "Ed Kilgore", "Steve…
## $ title <chr> "Here’s what can come next with climate-change fever final…
## $ description <chr> "Fewer and fewer people are panicking about the climate \"…
## $ url <chr> "https://nypost.com/2026/06/19/opinion/heres-what-can-come…
## $ urlToImage <chr> "https://nypost.com/wp-content/uploads/sites/2/2026/06/cro…
## $ publishedAt <chr> "2026-06-19T22:00:00Z", "2026-06-19T21:35:40Z", "2026-06-1…
## $ content <chr> "Something was conspicuously missing from Californias prim…
## $ source_id <chr> NA, "fox-news", NA, "new-york-magazine", NA, NA, NA, NA, "…
## $ source_name <chr> "New York Post", "Fox News", "Hoover.org", "New York Magaz…
## $ query <chr> "Tom Steyer", "Tom Steyer", "Tom Steyer", "Tom Steyer", "T…
news_raw %>%
filter(!is.na(title)) %>%
mutate(
pub_date = ymd_hms(publishedAt, quiet = TRUE),
pub_day = as.Date(pub_date),
title_clean = str_remove(title, "\\s*-\\s*[^-]+$"),
title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
title_clean = str_to_lower(title_clean)
) %>%
group_by(title_clean) %>%
filter(n() > 1) %>%
arrange(title_clean)
We now apply the same cleaning steps and keep only one copy of each unique (cleaned) headline:
news_clean <- news_raw %>%
filter(!is.na(.data$title)) %>%
mutate(
pub_date = ymd_hms(.data$publishedAt, quiet = TRUE),
pub_day = as.Date(pub_date),
title_clean = str_remove(.data$title, "\\s*-\\s*[^-]+$"),
title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
title_clean = str_to_lower(title_clean)
) %>%
distinct(title_clean, .keep_all = TRUE)
dim(news_clean)
## [1] 10 13
If you pulled multiple topics (e.g., SpaceX and a competitor), bind them into a single data frame here. With one topic, this step simply standardizes the object for the rest of the pipeline and saves a CSV snapshot — useful for reproducibility and for sharing data with teammates who don’t have API access.
news_df <- bind_rows(news_clean) %>%
filter(!is.na(title)) # remove any empty rows
str(news_df)
## tibble [10 × 13] (S3: tbl_df/tbl/data.frame)
## $ author : chr [1:10] "Bjorn Lomborg" NA "June 19, 2026" "Ed Kilgore" ...
## $ title : chr [1:10] "Here’s what can come next with climate-change fever finally breaking" "Double endorsement drama: Trump backs second candidate in red state’s GOP gubernatorial runoff" "California Update: First Couple Under Investigation; Wealth-Tax Deal Underway?" "California Billionaire Tax Faces Last Hurdle Before Ballot" ...
## $ description: chr [1:10] "Fewer and fewer people are panicking about the climate \"catastrophe.\" Gallup's latest survey of the world's m"| __truncated__ "Trump endorses both Wilson and Evette in South Carolina's GOP gubernatorial runoff, hedging his bets ahead of t"| __truncated__ "Monthly update on the implications of California’s First Couple under federal investigation for tax and financi"| __truncated__ "California billionaire tax faces last hurdle before ballot. The wealth tax has qualified for the November ballo"| __truncated__ ...
## $ url : chr [1:10] "https://nypost.com/2026/06/19/opinion/heres-what-can-come-next-with-climate-change-fever-finally-breaking/" "https://www.foxnews.com/politics/double-endorsement-drama-trump-backs-second-candidate-red-states-gop-gubernatorial-runoff" "https://www.hoover.org/research/california-update-first-couple-under-investigation-wealth-tax-deal-underway" "http://nymag.com/intelligencer/article/california-billionaire-tax-faces-last-hurdle-before-ballot.html" ...
## $ urlToImage : chr [1:10] "https://nypost.com/wp-content/uploads/sites/2/2026/06/crop-39724023_ba0ec2.jpg?quality=75&strip=all&w=1200" "https://static.foxnews.com/foxnews.com/content/uploads/2026/06/trump-usa-poll-1.jpg" "https://hoover-s3-website.s3.us-west-2.amazonaws.com/s3fs-public/styles/facebook/public/2024-06/Matters-of-Poli"| __truncated__ "https://pyxis.nymag.com/v1/imgs/d88/398/3d475e9049aed791c583be96d7728e57a1-ca-billionairetax.1x.rsocial.w1200.jpg" ...
## $ publishedAt: chr [1:10] "2026-06-19T22:00:00Z" "2026-06-19T21:35:40Z" "2026-06-19T00:00:00Z" "2026-06-18T22:25:27Z" ...
## $ content : chr [1:10] "Something was conspicuously missing from Californias primary this month. In the state that built its political "| __truncated__ "President Donald Trump is making an 11th-hour endorsement in the final stretch ahead of Tuesday's high-profile "| __truncated__ "<ul><li>State & Local</li><li>California</li><li>Economics</li><li>Law & Policy</li><li>Regulation &"| __truncated__ "Wealth taxes aimed at the very rich are a perennial favorite on the progressive end of the ideological spectrum"| __truncated__ ...
## $ source_id : chr [1:10] NA "fox-news" NA "new-york-magazine" ...
## $ source_name: chr [1:10] "New York Post" "Fox News" "Hoover.org" "New York Magazine" ...
## $ query : chr [1:10] "Tom Steyer" "Tom Steyer" "Tom Steyer" "Tom Steyer" ...
## $ pub_date : POSIXct[1:10], format: "2026-06-19 22:00:00" "2026-06-19 21:35:40" ...
## $ pub_day : Date[1:10], format: "2026-06-19" "2026-06-19" ...
## $ title_clean: chr [1:10] "here s what can come next with climate" "double endorsement drama trump backs second candidate in red state s gop gubernatorial runoff" "california update first couple under investigation wealth" "california billionaire tax faces last hurdle before ballot" ...
write.csv(news_df, "news_df.csv")
news_df %>%
select(source_name, title, pub_day) %>%
head(10) %>%
kable(caption = "Sample Cleaned Headlines") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| source_name | title | pub_day |
|---|---|---|
| New York Post | Here’s what can come next with climate-change fever finally breaking | 2026-06-19 |
| Fox News | Double endorsement drama: Trump backs second candidate in red state’s GOP gubernatorial runoff | 2026-06-19 |
| Hoover.org | California Update: First Couple Under Investigation; Wealth-Tax Deal Underway? | 2026-06-19 |
| New York Magazine | California Billionaire Tax Faces Last Hurdle Before Ballot | 2026-06-18 |
| Reason | Did California’s Gubernatorial Race Reveal the Limits of ‘Abundance’ Politics on the Left? | 2026-06-18 |
| KPBS | A tax on billionaires qualified for the November ballot. 5 things to know about the measure | 2026-06-18 |
| KQED | 5 Things to Know About California’s New Billionaire Tax Measure | 2026-06-18 |
| CALmatters | California billionaire tax qualifies for November ballot | 2026-06-18 |
| Financial Post | Pimco Targets Out-of-Date Assets in New Real Estate Strategy | 2026-06-18 |
| NBC News | California billionaire tax proposal qualifies for the November ballot | 2026-06-18 |
Interpretation: At this stage you should have a tidy
data frame where each row is a unique news headline, with a clean
publication date and a source column indicating which
topic/brand search returned it. This is the foundation for everything
that follows. If dim(SpaceX_clean) shows far fewer rows
than dim(SpaceX_raw), that tells you a substantial share of
“results” were duplicate stories — a useful sanity check before drawing
conclusions about coverage volume.
To analyze sentiment and word usage, we need to break each headline into individual words (“tokens”), remove common stop words (e.g., “the,” “and,” “of”) that carry little analytical meaning, and filter out pure numbers and very short tokens.
news_tokens <- news_df %>%
select(source_name, title) %>%
unnest_tokens(word, title) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "^\\d+$"), nchar(word) > 2)
Interpretation: news_tokens is now a
“one-token-per-row” data frame — the standard format for text mining
with tidytext. Each row represents one meaningful word from
one headline, tagged with the source (topic) it came from.
This long format makes it easy to count words, join sentiment
dictionaries, and compute summary statistics by group.
The Bing lexicon classifies each word as either
"positive" or "negative" — a simple binary
label with no magnitude.
sentiment_bing <- news_tokens %>%
inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many")
print(sentiment_bing)
## # A tibble: 7 × 3
## source_name word sentiment
## <chr> <chr> <chr>
## 1 New York Post fever negative
## 2 New York Post breaking negative
## 3 Fox News endorsement positive
## 4 Fox News trump positive
## 5 Reason limits negative
## 6 Reason abundance positive
## 7 KPBS qualified positive
This chart shows the top 10 words contributing to positive sentiment and the top 10 contributing to negative sentiment across all headlines.
sentiment_bing %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
slice_max(n, n = 10, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
labs(
title = "Top Words Driving Sentiment in Stock News Titles",
x = "Frequency (Word Count)",
y = NULL
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
strip.text = element_text(face = "bold", size = 12)
)
Interpretation: Words on the left panel are pulling overall sentiment down; words on the right are pulling it up. A marketing analyst would scan this chart for words tied to specific events (e.g., “crash,” “delay,” “explosion” vs. “record,” “success,” “win”) to understand what kind of news is driving the tone — not just whether the tone is positive or negative.
If you pulled multiple topics (brands), this chart compares how many positive vs. negative sentiment-words appear in each topic’s headlines.
sentiment_bing %>%
count(source_name, sentiment) %>%
ggplot(aes(x = source_name, y = n, fill = sentiment)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
labs(
title = "Volume of Sentiment Words",
subtitle = "Total counts of matched emotional words in headlines",
x = "Stock Ticker",
y = "Number of Words Matched",
fill = "Sentiment Class"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
Interpretation: A higher total bar (positive + negative) for a topic means that topic generated more emotionally-charged language overall — which could reflect either higher news volume or more dramatic events. Comparing the ratio of green to red bars across topics tells you which brand is currently enjoying more favorable framing in the media.
Independent of sentiment, it’s useful to see which words dominate the headlines overall.
news_tokens %>%
count(word, sort = TRUE) %>%
slice_head(n = 20) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill = n)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient(low = "#a8d8ea", high = "#0077b6") +
labs(
title = "Top 20 Words in News Headlines",
x = "Count", y = NULL,
caption = "Source: NewsAPI"
) +
theme_minimal(base_size = 13)
Interpretation: This is your “what is everyone talking about” chart. Look for names of products, executives, partners, or events that recur — these are candidates for deeper investigation (e.g., is “Starship” appearing a lot because of a successful launch or a setback?).
Unlike Bing’s binary labels, AFINN assigns each word a numeric score from -5 (very negative) to +5 (very positive), allowing us to compute average sentiment intensity per topic.
news_tokens %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(source_name) %>%
summarise(
words_matched = n(),
mean_sentiment = round(mean(value), 3),
sum_sentiment = sum(value),
.groups = "drop"
) %>%
arrange(desc(mean_sentiment)) %>%
kable(caption = "AFINN Sentiment Score by Topic") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| source_name | words_matched | mean_sentiment | sum_sentiment |
|---|---|---|---|
| Hoover.org | 1 | 3 | 3 |
| Financial Post | 1 | 2 | 2 |
| Fox News | 1 | 2 | 2 |
| Reason | 1 | -1 | -1 |
Interpretation: mean_sentiment tells
you the average emotional tone of matched words for each topic
— a value near 0 suggests neutral/mixed coverage, while a clearly
positive or negative mean suggests a dominant tone.
sum_sentiment reflects total emotional “weight,” which is
influenced by both tone and volume of coverage.
This table reshapes the Bing results into a wide format so you can directly compare positive counts, negative counts, and the net (positive − negative) score for each topic.
news_tokens %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(source_name, sentiment) %>%
pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = list(n = 0)
) %>%
mutate(
positive = coalesce(positive, 0L),
negative = coalesce(negative, 0L),
net = positive - negative
) %>%
kable(caption = "Bing Sentiment Count by Topic") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| source_name | positive | negative | net |
|---|---|---|---|
| Fox News | 2 | 0 | 2 |
| KPBS | 1 | 0 | 1 |
| New York Post | 0 | 2 | -2 |
| Reason | 1 | 1 | 0 |
Interpretation: The net column is a
simple, interpretable sentiment index. A positive net score
suggests headlines skew favorable; a negative net score
suggests the opposite. This kind of index is easy to track over time
(e.g., weekly) to build a brand sentiment trendline.
TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are frequent within one topic’s headlines but rare across other topics’ headlines. This is especially useful for understanding what makes coverage of one brand distinctive compared to another.
news_tokens %>%
count(source_name, word) %>%
bind_tf_idf(word, source_name, n) %>%
group_by(source_name) %>%
slice_max(tf_idf, n = 6) %>%
ungroup() %>%
mutate(word = reorder_within(word, tf_idf, source_name)) %>%
ggplot(aes(x = tf_idf, y = word, fill = source_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ source_name, scales = "free_y", ncol = 2) +
scale_y_reordered() +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Top TF-IDF Terms by Topic",
x = "TF-IDF Score", y = NULL,
caption = "Source: NewsAPI"
) +
theme_minimal(base_size = 12)
When you have more than two topics, the basic plot above can get crowded. The refined version below dynamically adjusts the color palette to the number of topics, reduces the number of terms shown per topic, truncates long words, and arranges panels in a grid for readability.
n_sources <- n_distinct(news_tokens$source_name)
news_tokens %>%
count(source_name, word) %>%
bind_tf_idf(word, source_name, n) %>%
group_by(source_name) %>%
slice_max(tf_idf, n = 5, with_ties = FALSE) %>%
ungroup() %>%
mutate(
word = str_trunc(word, 20),
word = reorder_within(word, tf_idf, source_name)
) %>%
ggplot(aes(x = tf_idf, y = word, fill = source_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ source_name, scales = "free_y", ncol = 4) +
scale_y_reordered() +
scale_fill_manual(
values = colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))(n_sources)
) +
labs(
title = "Top TF-IDF Terms by Source",
x = "TF-IDF Score",
y = NULL,
caption = "Source: NewsAPI"
) +
theme_minimal(base_size = 10) +
theme(
strip.text = element_text(size = 8, face = "bold"),
axis.text.y = element_text(size = 4),
panel.spacing = unit(1.2, "lines")
)
# Save with a tall aspect ratio so labels don't crowd
ggsave("tfidf_plot.png", width = 16, height = 14, dpi = 150)
Interpretation: Words with high TF-IDF scores for a given topic are the terms that “define” that topic’s coverage relative to others — these are often product names, executives, locations, or event-specific terms. For a marketing analyst, these words are strong candidates for keyword tracking, campaign hashtag ideas, or identifying emerging narratives unique to your brand versus competitors.
A complete brand-monitoring workflow using this tutorial would typically run on a schedule (e.g., daily or weekly):
net sentiment score and
mean_sentiment over time.Which of the following best describes the difference between the Bing and AFINN sentiment lexicons used in this tutorial?
A. Bing scores words from -5 to +5; AFINN classifies words as positive or negative only. B. Bing classifies words as positive or negative only; AFINN assigns a numeric intensity score from -5 to +5. C. Both lexicons produce identical results because they are built from the same word list. D. AFINN can only be used with French-language text.
Answer: B. Bing is a binary classifier (each word is
labeled "positive" or "negative"), while AFINN
assigns a numeric score between -5 and +5, allowing computation of an
average sentiment intensity rather than just counts.
In Section 4.3, why do we create a title_clean column by
removing the trailing “- Source Name” portion of each headline and
converting text to lowercase, before checking for
duplicates?
Answer: News aggregators often return the same underlying story from multiple outlets, where each version appends a different source name to the end of the title (e.g., “SpaceX Launches Rocket - Reuters” vs. “SpaceX Launches Rocket - AP”). If we compared raw titles, these would look like distinct headlines and inflate our count of “unique” stories. Removing the source suffix, standardizing punctuation/whitespace, and lowercasing the text ensures that headlines describing the same story are recognized as duplicates and only counted once — giving a more accurate picture of true coverage volume.
A marketing manager looks at Graph 2 (Sentiment Volume Comparison) and sees that Brand A has 40 positive and 10 negative sentiment-word matches, while Brand B has 15 positive and 5 negative matches. Which statement is most accurate?
A. Brand A definitely has better brand sentiment than Brand B because it has more positive words. B. Brand B might have comparable or better relative sentiment, since both brands have the same 3:1 positive-to-negative ratio, but Brand A simply has more total coverage. C. The two brands cannot be compared in any way. D. Brand B’s coverage is more negative because it has fewer total matches.
Answer: B. Raw counts conflate sentiment tone with sentiment volume. Both brands have an identical 3:1 ratio of positive to negative words, so their relative tone is similar — Brand A simply has more total news coverage (more matched words overall). A good analyst should look at both the absolute volume (which signals visibility/buzz) and the ratio or net score (which signals tone) separately.
The TF-IDF analysis in Section 7 highlights words that are distinctive to one brand’s coverage versus another’s. Suppose you run this analysis for your company and a competitor, and your company’s top TF-IDF terms include words like “lawsuit,” “investigation,” and “delay,” while the competitor’s top terms include “launch,” “partnership,” and “award.” What might this tell you, and what would you investigate next?
Answer (sample discussion points): This pattern suggests that, during the analyzed period, your company’s distinctive media narrative is dominated by negative or risk-related events (legal/regulatory issues, delays), while the competitor’s distinctive narrative centers on positive business momentum (product launches, partnerships, recognition). This is a signal — not a verdict — and should prompt further investigation: (1) read the actual headlines behind these TF-IDF terms to understand the underlying stories, (2) check whether this is a recent shift or a longer-term pattern by re-running the analysis over different time windows, (3) consider how this narrative gap might affect brand perception, investor confidence, or campaign timing, and (4) coordinate with PR/communications teams if a response or proactive positive-story pipeline is warranted.
Why does the tutorial recommend storing your NewsAPI key using
Sys.getenv() and an .Renviron file rather than
typing it directly into the script (e.g.,
newsapi_key("abc123"))?
A. Sys.getenv() makes the API calls run faster. B.
Hard-coded keys are required by NewsAPI’s terms of service. C. It
prevents the key from being accidentally shared, committed to version
control, or exposed if the script/notebook is distributed. D.
.Renviron files automatically refresh the API key every 24
hours.
Answer: C. API keys are credentials tied to your account and usage limits. Hard-coding them into a script means anyone who receives that script (e.g., classmates, a shared GitHub repo, a knitted HTML report) also receives your credentials. Storing keys as environment variables keeps secrets out of shared code while still allowing the script to authenticate.
This tutorial uses lexicon-based sentiment analysis (Bing and AFINN), where sentiment is determined by looking up individual words in a pre-built dictionary. What is one limitation of this approach when applied to news headlines specifically, and how might it lead to a misleading sentiment score?
Answer (sample discussion points): Lexicon-based methods score words independently, ignoring context, negation, and sarcasm. For example, a headline like “SpaceX avoids major setback after engine issue” contains the negative word “setback” and possibly “issue,” which a lexicon would score negatively — even though the headline is reporting good news (the setback was avoided). Similarly, headlines are often short and may contain domain-specific or proper-noun “words” (e.g., company names, ticker symbols) that aren’t in general-purpose lexicons like Bing or AFINN, so meaningful content can be missed entirely. More advanced approaches (e.g., sentence-level models, transformer-based sentiment classifiers) can better capture context, negation, and domain-specific tone, at the cost of greater computational complexity.