Summary

For my in-class exercise, I decided to go with the four NBA Conference Finals Teams which was Cleveland Cavaliers, New York Knicks, San Antonio Spurs, and Oklahoma Thunder.

Through my findings, I found that the New York Knicks received the highest net negative sentiment score of -7 meaning that more headlines words were being labeled as negative. Comparing this to the highest net negative score which was Oklahoma Thunder at +1 meaning that there were more headline words that were labeled as positive. With the New York Knicks winning the NBA Championship, I would expect mostly positive news coverage due to breaking their 53 year drought. However, many Knicks-related headlines focused mainly on the championship celebrations which included extreme fan behavior and events occurring in New York City rather than the team’s on-court performance. As a result, these negative words from those headlines have lowered the Knick’s overall sentiment score which suggests that the news coverage may reflect events outside the Knick’s success in securing a Championship.

Setup and Package Installation

NewsAPI access in R is provided through the newsr/newsapi package on GitHub (not on CRAN), so we install it with remotes. We also load the tidyverse for data manipulation, tidytext and textdata for text mining and sentiment lexicons, lubridate for date handling, and knitr/kableExtra for nicely formatted tables.

# Run once - uncomment if these packages are not yet installed
install.packages(c("remotes", "tidyverse", "tidytext", "textdata", "lubridate", "knitr", "kableExtra", "newsanchor"))
 remotes::install_github("news-r/newsapi")

library(newsapi)
library(tidyverse)
library(tidytext)
library(textdata)
library(lubridate)
library(knitr)
library(kableExtra)
library(ggplot2)
library(newsanchor)

Authentication

NewsAPI requires a free API key, available at newsapi.org. Never hard-code your API key in a script you plan to share or commit to version control. For class purposes, store your key as an environment variable (e.g., in a .Renviron file) and reference it with Sys.getenv().

# Replace "YOUR_API_KEY" with your actual key, or better, use Sys.getenv()
api_key <-Sys.getenv("NEWSAPI_KEY")

Tip: A common extension of this tutorial is to pull headlines for multiple brands (e.g., a focal brand and 1-2 competitors) so that the sentiment comparisons later in this tutorial are more meaningful. The commented-out top_headlines("Anthropic") line in the original script is an example of adding a second topic for comparison.

Pulling and Cleaning Headlines

library(httr)
library(jsonlite)
library(tidyverse)

fetch_news <- function(query, api_key, page_size = 10) {
  response <- GET(
    url = "https://newsapi.org/v2/everything",
    query = list(
      q         = query,
      language  = "en",
      sortBy    = "publishedAt",
      pageSize  = page_size,
      apiKey    = api_key
    )
  )
  
  # Surface the actual error message from NewsAPI
  if (status_code(response) != 200) {
    msg <- content(response, as = "parsed")$message
    stop("NewsAPI error for '", query, "': ", msg)
  }
  
  parsed   <- content(response, as = "text", encoding = "UTF-8")
  articles <- fromJSON(parsed, flatten = TRUE)$articles
  
  as_tibble(articles) %>%
    rename_with(~ str_replace_all(.x, "\\.", "_")) %>%
    mutate(query = query)
}


news_raw <- bind_rows(
  fetch_news("New York Knicks", api_key) %>% mutate(team = "Knicks"),
  fetch_news("Cleveland Cavaliers", api_key) %>% mutate(team = "Cavaliers"),
  fetch_news("San Antonio Spurs", api_key) %>% mutate(team = "Spurs"),
  fetch_news("Oklahoma City Thunder", api_key) %>% mutate(team = "Thunder")
)

glimpse(news_raw)

## Rows: 40
## Columns: 11
## $ author      <chr> "Michael Macasero", "Mike Puma", "Saino Zachariah", "Kevin…
## $ title       <chr> "Hawks Acquire NBA Champion in Surprise Trade With Western…
## $ description <chr> "The Atlanta Hawks are actively working on bolstering thei…
## $ url         <chr> "https://www.profootballnetwork.com/nba/hawks-acquire-nba-…
## $ urlToImage  <chr> "https://s.yimg.com/ny/api/res/1.2/RpuyVF7N52s8ZPSZ3NIeEg-…
## $ publishedAt <chr> "2026-06-22T05:10:09Z", "2026-06-22T04:33:49Z", "2026-06-2…
## $ content     <chr> "The Atlanta Hawks are actively working on bolstering thei…
## $ source_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ source_name <chr> "Pro Football Network", "New York Post", "Pro Football Net…
## $ query       <chr> "New York Knicks", "New York Knicks", "New York Knicks", "…
## $ team        <chr> "Knicks", "Knicks", "Knicks", "Knicks", "Knicks", "Knicks"…

Inspect for Duplicates

news_raw %>%
  filter(!is.na(title)) %>%
  mutate(
    pub_date    = ymd_hms(publishedAt, quiet = TRUE),
    pub_day     = as.Date(pub_date),
    title_clean = str_remove(title, "\\s*-\\s*[^-]+$"),
    title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
    title_clean = str_to_lower(title_clean)
  ) %>%
  group_by(title_clean) %>%
  filter(n() > 1) %>%
  arrange(title_clean)

Clean and Deduplicate

We now apply the same cleaning steps and keep only one copy of each unique (cleaned) headline:

Remove rows with missing titles.
Parse the publication timestamp into a date.
Strip the trailing “- Source Name” suffix many outlets append to titles.
Remove punctuation/special characters and collapse extra whitespace.
Convert to lowercase for consistent text analysis.
Drop duplicate cleaned titles.

news_clean <- news_raw %>%
  filter(!is.na(.data$title)) %>%
  mutate(
    pub_date    = ymd_hms(.data$publishedAt, quiet = TRUE),
    pub_day     = as.Date(pub_date),
    title_clean = str_remove(.data$title, "\\s*-\\s*[^-]+$"),
    title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
    title_clean = str_to_lower(title_clean)
  ) %>%
  distinct(title_clean, .keep_all = TRUE)

dim(news_clean)

## [1] 33 14

Combine into a Tracking Data Frame

If you pulled multiple topics (e.g., SpaceX and a competitor), bind them into a single data frame here. With one topic, this step simply standardizes the object for the rest of the pipeline and saves a CSV snapshot — useful for reproducibility and for sharing data with teammates who don’t have API access.

news_df <- bind_rows(news_clean) %>%
  filter(!is.na(title))  # remove any empty rows

str(news_df)

## tibble [33 × 14] (S3: tbl_df/tbl/data.frame)
##  $ author     : chr [1:33] "Michael Macasero" "Mike Puma" "Saino Zachariah" "Kevin Duggan, Emily Smith" ...
##  $ title      : chr [1:33] "Hawks Acquire NBA Champion in Surprise Trade With Western Conference Contender" "Francisco Lindor’s possible next step towards Mets return emerges" "Hawks Predicted To Cut Ties With Zaccharie Risacher in ’Risky’ Trade for 1 of NBA’s ’More Talented Offensive Bigs’" "PEDESTRIANIZE NOW! Financial District Businesses Want Space for People Not Cars" ...
##  $ description: chr [1:33] "The Atlanta Hawks are actively working on bolstering their roster a few days before the 2026 NBA Draft. After a"| __truncated__ "Francisco Lindor’s next stop could be Citi Field but perhaps not to rejoin the Mets roster just yet." "The Atlanta Hawks overachieved in the 2025-26 season before losing to eventual champions, the New York Knicks, "| __truncated__ "Lower Manhattan merchants see green in proposals to pedestrianize the Financial District." ...
##  $ url        : chr [1:33] "https://www.profootballnetwork.com/nba/hawks-acquire-nba-champion-surprise-trade-western-conference-contender-j"| __truncated__ "https://nypost.com/2026/06/22/sports/francisco-lindors-possible-next-step-towards-mets-return-emerges/" "https://www.profootballnetwork.com/nba/hawks-zaccharie-risacher-domantas-sabonis-trade-june-2026/?utm_medium=rs"| __truncated__ "https://nyc.streetsblog.org/2026/06/22/financial-district-businesses-clamor-for-space-for-people-not-cars" ...
##  $ urlToImage : chr [1:33] "https://s.yimg.com/ny/api/res/1.2/RpuyVF7N52s8ZPSZ3NIeEg--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD02NzU7Y2Y9d2VicA--"| __truncated__ "https://nypost.com/wp-content/uploads/sites/2/2026/06/newspress-collage-m4uxkgnvw-1782101656507.jpg?quality=75&"| __truncated__ "https://s.yimg.com/lo/mysterio/api/d15eda85100e616d1ba3e95f5048ea4889817a12bb5d4ff61f2fd672fefdb7ac/lightyear_n"| __truncated__ "https://nyc.streetsblog.org/wp-content/uploads/sites/9/2026/06/IMG_8304-1-e1781726184974.jpg" ...
##  $ publishedAt: chr [1:33] "2026-06-22T05:10:09Z" "2026-06-22T04:33:49Z" "2026-06-22T04:32:52Z" "2026-06-22T04:05:00Z" ...
##  $ content    : chr [1:33] "The Atlanta Hawks are actively working on bolstering their roster a few days before the 2026 NBA Draft. After a"| __truncated__ "PHILADELPHIA Francisco Lindors next stop could be Citi Field but perhaps not to rejoin the Mets roster just yet"| __truncated__ "The Atlanta Hawks overachieved in the 2025-26 season before losing to eventual champions, the New York Knicks, "| __truncated__ "Bar the cars to fill the bars.\r\nLower Manhattan’s congested Financial District has a lot of foot traffic, but"| __truncated__ ...
##  $ source_id  : chr [1:33] NA NA NA NA ...
##  $ source_name: chr [1:33] "Pro Football Network" "New York Post" "Pro Football Network" "Streetsblog.org" ...
##  $ query      : chr [1:33] "New York Knicks" "New York Knicks" "New York Knicks" "New York Knicks" ...
##  $ team       : chr [1:33] "Knicks" "Knicks" "Knicks" "Knicks" ...
##  $ pub_date   : POSIXct[1:33], format: "2026-06-22 05:10:09" "2026-06-22 04:33:49" ...
##  $ pub_day    : Date[1:33], format: "2026-06-22" "2026-06-22" ...
##  $ title_clean: chr [1:33] "hawks acquire nba champion in surprise trade with western conference contender" "francisco lindor s possible next step towards mets return emerges" "hawks predicted to cut ties with zaccharie risacher in risky trade for 1 of nba s more talented offensive bigs" "pedestrianize now financial district businesses want space for people not cars" ...

write.csv(news_df, "news_df.csv")

4.5 Preview the Cleaned Data

news_df %>%
  select(team, title, pub_day) %>%
  head(10) %>%
  kable(caption = "Sample Cleaned Headlines") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Sample Cleaned Headlines
team	title	pub_day
Knicks	Hawks Acquire NBA Champion in Surprise Trade With Western Conference Contender	2026-06-22
Knicks	Francisco Lindor’s possible next step towards Mets return emerges	2026-06-22
Knicks	Hawks Predicted To Cut Ties With Zaccharie Risacher in ’Risky’ Trade for 1 of NBA’s ’More Talented Offensive Bigs’	2026-06-22
Knicks	PEDESTRIANIZE NOW! Financial District Businesses Want Space for People Not Cars	2026-06-22
Knicks	’Thunder Legend’ — NBA World Reacts As OKC Cuts Ties With $45,000,000 ’Championship Role Player’ in Cost-Cutting Trade	2026-06-22
Knicks	Episode 937: What’s Your Favorite Emotion?	2026-06-22
Knicks	Media Buying Briefing: How Sport Beach became a big Cannes Lions destination — and a business	2026-06-22
Knicks	Rama Duwaji Commands Attention in Off-Shoulder Top During Knicks Parade	2026-06-22
Knicks	Isaiah Thomas facing pushback on Chris Paul-Spurs take	2026-06-22
Knicks	Doc Rivers gets real on why James Harden’s play style does not lead to postseason success	2026-06-22

Interpretation: At this stage you should have a tidy data frame where each row is a unique news headline, with a clean publication date and a source column indicating which topic/brand search returned it. This is the foundation for everything that follows. If dim(SpaceX_clean) shows far fewer rows than dim(SpaceX_raw), that tells you a substantial share of “results” were duplicate stories — a useful sanity check before drawing conclusions about coverage volume.

5. Tokenization

To analyze sentiment and word usage, we need to break each headline into individual words (“tokens”), remove common stop words (e.g., “the,” “and,” “of”) that carry little analytical meaning, and filter out pure numbers and very short tokens.

news_tokens <- news_df %>%
  select(team, title) %>%
  unnest_tokens(word, title) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"), nchar(word) > 2)

Interpretation: news_tokens is now a “one-token-per-row” data frame — the standard format for text mining with tidytext. Each row represents one meaningful word from one headline, tagged with the source (topic) it came from. This long format makes it easy to count words, join sentiment dictionaries, and compute summary statistics by group.

6. Sentiment Analysis

6.1 The Bing Lexicon (Binary Positive/Negative)

The Bing lexicon classifies each word as either "positive" or "negative" — a simple binary label with no magnitude.

sentiment_bing <- news_tokens %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many")

print(sentiment_bing)

## # A tibble: 18 × 3
##    team      word       sentiment
##    <chr>     <chr>      <chr>    
##  1 Knicks    champion   positive 
##  2 Knicks    risky      negative 
##  3 Knicks    talented   positive 
##  4 Knicks    offensive  negative 
##  5 Knicks    favorite   positive 
##  6 Knicks    top        positive 
##  7 Knicks    lead       positive 
##  8 Knicks    success    positive 
##  9 Cavaliers favorite   positive 
## 10 Cavaliers champion   positive 
## 11 Cavaliers top        positive 
## 12 Spurs     wild       negative 
## 13 Spurs     won        positive 
## 14 Spurs     won        positive 
## 15 Spurs     impossible negative 
## 16 Spurs     rumor      negative 
## 17 Thunder   rumors     negative 
## 18 Thunder   relief     positive

Graph 1: Word Sentiment Contribution

This chart shows the top 10 words contributing to positive sentiment and the top 10 contributing to negative sentiment across all headlines.

sentiment_bing %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
  labs(
    title = "Top Words Driving Sentiment in Stock News Titles",
    x = "Frequency (Word Count)",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 12)
  )

Interpretation: Words on the left panel are pulling overall sentiment down; words on the right are pulling it up. A marketing analyst would scan this chart for words tied to specific events (e.g., “crash,” “delay,” “explosion” vs. “record,” “success,” “win”) to understand what kind of news is driving the tone — not just whether the tone is positive or negative.

Graph 2: Sentiment Volume Comparison Across Topics

If you pulled multiple topics (brands), this chart compares how many positive vs. negative sentiment-words appear in each topic’s headlines.

sentiment_bing %>%
  count(team, sentiment) %>%
  ggplot(aes(x = team, y = n, fill = sentiment)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
  labs(
    title = "Volume of Sentiment Words",
    subtitle = "Total counts of matched emotional words in headlines",
    x = "Stock Ticker",
    y = "Number of Words Matched",
    fill = "Sentiment Class"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

Interpretation: A higher total bar (positive + negative) for a topic means that topic generated more emotionally-charged language overall — which could reflect either higher news volume or more dramatic events. Comparing the ratio of green to red bars across topics tells you which brand is currently enjoying more favorable framing in the media.

6.2 Top Words Overall

Independent of sentiment, it’s useful to see which words dominate the headlines overall.

news_tokens %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 20) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = n)) +
  geom_col(show.legend = FALSE) +
  scale_fill_gradient(low = "#a8d8ea", high = "#0077b6") +
  labs(
    title   = "Top 20 Words in News Headlines",
    x       = "Count", y = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 13)

Interpretation: This is your “what is everyone talking about” chart. Look for names of products, executives, partners, or events that recur — these are candidates for deeper investigation (e.g., is “Starship” appearing a lot because of a successful launch or a setback?).

6.3 The AFINN Lexicon (Scored -5 to +5)

Unlike Bing’s binary labels, AFINN assigns each word a numeric score from -5 (very negative) to +5 (very positive), allowing us to compute average sentiment intensity per topic.

news_tokens %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(team) %>%
  summarise(
    words_matched  = n(),
    mean_sentiment = round(mean(value), 3),
    sum_sentiment  = sum(value),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_sentiment)) %>%
  kable(caption = "AFINN Sentiment Score by Topic") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

AFINN Sentiment Score by Topic
team	words_matched	mean_sentiment	sum_sentiment
Spurs	2	3.000	6
Cavaliers	5	1.800	9
Thunder	1	1.000	1
Knicks	7	0.286	2

Interpretation: mean_sentiment tells you the average emotional tone of matched words for each topic — a value near 0 suggests neutral/mixed coverage, while a clearly positive or negative mean suggests a dominant tone. sum_sentiment reflects total emotional “weight,” which is influenced by both tone and volume of coverage.

6.4 Bing Sentiment Split (Positive vs. Negative Counts)

This table reshapes the Bing results into a wide format so you can directly compare positive counts, negative counts, and the net (positive − negative) score for each topic.

news_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(team, sentiment) %>%
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = list(n = 0)
  ) %>%
  mutate(
    positive = coalesce(positive, 0L),
    negative = coalesce(negative, 0L),
    net      = positive - negative
  ) %>%
  kable(caption = "Bing Sentiment Count by Topic") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Bing Sentiment Count by Topic
team	positive	negative	net
Cavaliers	3	0	3
Knicks	6	2	4
Spurs	2	3	-1
Thunder	1	1	0

Interpretation: The net column is a simple, interpretable sentiment index. A positive net score suggests headlines skew favorable; a negative net score suggests the opposite. This kind of index is easy to track over time (e.g., weekly) to build a brand sentiment trendline.

7. TF-IDF: What Makes Each Topic’s Coverage Distinct?

TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are frequent within one topic’s headlines but rare across other topics’ headlines. This is especially useful for understanding what makes coverage of one brand distinctive compared to another.

news_tokens %>%
  count(team, word) %>%
  bind_tf_idf(word, team, n) %>%
  group_by(team) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, team)) %>%
  ggplot(aes(x = tf_idf, y = word, fill = team)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ team, scales = "free_y", ncol = 2) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title   = "Top TF-IDF Terms by Topic",
    x       = "TF-IDF Score", y = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 10)

7.1 Refining the TF-IDF Plot for Many Topics

When you have more than two topics, the basic plot above can get crowded. The refined version below dynamically adjusts the color palette to the number of topics, reduces the number of terms shown per topic, truncates long words, and arranges panels in a grid for readability.

n_sources <- n_distinct(news_tokens$team)

news_tokens %>%
  count(team, word) %>%
  bind_tf_idf(word, team, n) %>%
  group_by(team) %>%
  slice_max(tf_idf, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(
    word = str_trunc(word, 20),
    word = reorder_within(word, tf_idf, team)
  ) %>%
  ggplot(aes(x = tf_idf, y = word, fill = team)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ team, scales = "free_y", ncol = 4) +
  scale_y_reordered() +
  scale_fill_manual(
    values = colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))(n_sources)
  ) +
  labs(
    title   = "Top TF-IDF Terms by Source",
    x       = "TF-IDF Score",
    y       = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    strip.text    = element_text(size = 8, face = "bold"),
    axis.text.y   = element_text(size = 4),
    panel.spacing = unit(1.2, "lines")
  )

# Save with a tall aspect ratio so labels don't crowd
ggsave("tfidf_plot.png", width = 16, height = 14, dpi = 150)

Interpretation: Words with high TF-IDF scores for a given topic are the terms that “define” that topic’s coverage relative to others — these are often product names, executives, locations, or event-specific terms. For a marketing analyst, these words are strong candidates for keyword tracking, campaign hashtag ideas, or identifying emerging narratives unique to your brand versus competitors.

8. Putting It All Together

A complete brand-monitoring workflow using this tutorial would typically run on a schedule (e.g., daily or weekly):

Pull fresh headlines for your brand and 1-2 key competitors.
Clean and deduplicate.
Tokenize and score sentiment (Bing for simple positive/negative counts, AFINN for an intensity-weighted index).
Track the net sentiment score and mean_sentiment over time.
Use TF-IDF periodically to check whether the themes of coverage are shifting (e.g., from “product launch” language to “regulatory” language).

9. Quiz and Discussion Questions

Q1. (Multiple Choice)

Which of the following best describes the difference between the Bing and AFINN sentiment lexicons used in this tutorial?

A. Bing scores words from -5 to +5; AFINN classifies words as positive or negative only. B. Bing classifies words as positive or negative only; AFINN assigns a numeric intensity score from -5 to +5. C. Both lexicons produce identical results because they are built from the same word list. D. AFINN can only be used with French-language text.

Answer: B. Bing is a binary classifier (each word is labeled "positive" or "negative"), while AFINN assigns a numeric score between -5 and +5, allowing computation of an average sentiment intensity rather than just counts.

Q2. (Short Answer)

In Section 4.3, why do we create a title_clean column by removing the trailing “- Source Name” portion of each headline and converting text to lowercase, before checking for duplicates?

Answer: News aggregators often return the same underlying story from multiple outlets, where each version appends a different source name to the end of the title (e.g., “SpaceX Launches Rocket - Reuters” vs. “SpaceX Launches Rocket - AP”). If we compared raw titles, these would look like distinct headlines and inflate our count of “unique” stories. Removing the source suffix, standardizing punctuation/whitespace, and lowercasing the text ensures that headlines describing the same story are recognized as duplicates and only counted once — giving a more accurate picture of true coverage volume.

Q3. (Multiple Choice)

A marketing manager looks at Graph 2 (Sentiment Volume Comparison) and sees that Brand A has 40 positive and 10 negative sentiment-word matches, while Brand B has 15 positive and 5 negative matches. Which statement is most accurate?

A. Brand A definitely has better brand sentiment than Brand B because it has more positive words. B. Brand B might have comparable or better relative sentiment, since both brands have the same 3:1 positive-to-negative ratio, but Brand A simply has more total coverage. C. The two brands cannot be compared in any way. D. Brand B’s coverage is more negative because it has fewer total matches.

Answer: B. Raw counts conflate sentiment tone with sentiment volume. Both brands have an identical 3:1 ratio of positive to negative words, so their relative tone is similar — Brand A simply has more total news coverage (more matched words overall). A good analyst should look at both the absolute volume (which signals visibility/buzz) and the ratio or net score (which signals tone) separately.

Q4. (Discussion)

The TF-IDF analysis in Section 7 highlights words that are distinctive to one brand’s coverage versus another’s. Suppose you run this analysis for your company and a competitor, and your company’s top TF-IDF terms include words like “lawsuit,” “investigation,” and “delay,” while the competitor’s top terms include “launch,” “partnership,” and “award.” What might this tell you, and what would you investigate next?

Answer (sample discussion points): This pattern suggests that, during the analyzed period, your company’s distinctive media narrative is dominated by negative or risk-related events (legal/regulatory issues, delays), while the competitor’s distinctive narrative centers on positive business momentum (product launches, partnerships, recognition). This is a signal — not a verdict — and should prompt further investigation: (1) read the actual headlines behind these TF-IDF terms to understand the underlying stories, (2) check whether this is a recent shift or a longer-term pattern by re-running the analysis over different time windows, (3) consider how this narrative gap might affect brand perception, investor confidence, or campaign timing, and (4) coordinate with PR/communications teams if a response or proactive positive-story pipeline is warranted.

Q5. (Multiple Choice)

Why does the tutorial recommend storing your NewsAPI key using Sys.getenv() and an .Renviron file rather than typing it directly into the script (e.g., newsapi_key("abc123"))?

A. Sys.getenv() makes the API calls run faster. B. Hard-coded keys are required by NewsAPI’s terms of service. C. It prevents the key from being accidentally shared, committed to version control, or exposed if the script/notebook is distributed. D. .Renviron files automatically refresh the API key every 24 hours.

Answer: C. API keys are credentials tied to your account and usage limits. Hard-coding them into a script means anyone who receives that script (e.g., classmates, a shared GitHub repo, a knitted HTML report) also receives your credentials. Storing keys as environment variables keeps secrets out of shared code while still allowing the script to authenticate.

Q6. (Discussion)

This tutorial uses lexicon-based sentiment analysis (Bing and AFINN), where sentiment is determined by looking up individual words in a pre-built dictionary. What is one limitation of this approach when applied to news headlines specifically, and how might it lead to a misleading sentiment score?

Answer (sample discussion points): Lexicon-based methods score words independently, ignoring context, negation, and sarcasm. For example, a headline like “SpaceX avoids major setback after engine issue” contains the negative word “setback” and possibly “issue,” which a lexicon would score negatively — even though the headline is reporting good news (the setback was avoided). Similarly, headlines are often short and may contain domain-specific or proper-noun “words” (e.g., company names, ticker symbols) that aren’t in general-purpose lexicons like Bing or AFINN, so meaningful content can be missed entirely. More advanced approaches (e.g., sentence-level models, transformer-based sentiment classifiers) can better capture context, negation, and domain-specific tone, at the cost of greater computational complexity.

Mining the News: Sentiment and Text Analysis of Headlines with NewsAPI

A Tutorial for Marketing Analytics

Adapted from materials by Jimmy Zhenning Xu, Ph.D.

2026-06-23