My Scraping and 2-Sentence Summary of Findings

I chose to scrape the California political candidate running for governor, Tom Steyer.

Summary: My findings were that sentiment analysis in the context of AFINN and data scraping refers to whether the tone of the words used by a publication are positive, negative, or neutral–and not the actual feeling towards the subject expressed by the publication (e.g., political leaning). This explains why the graphs and charts for right-leaning news outlets like Fox News showed more positive sentiment in their language regarding Democrat Steyer–because the top words in the news headlines were “tax,” “california,” “billionaire,” and “ballot”–all of which appear neutral or positive, by AFINN’s sentiment measures.

Setup and Package Installation

NewsAPI access in R is provided through the newsr/newsapi package on GitHub (not on CRAN), so we install it with remotes. We also load the tidyverse for data manipulation, tidytext and textdata for text mining and sentiment lexicons, lubridate for date handling, and knitr/kableExtra for nicely formatted tables.

# Run once - uncomment if these packages are not yet installed
install.packages(c("remotes", "tidyverse", "tidytext", "textdata", "lubridate", "knitr", "kableExtra", "newsanchor"))
 remotes::install_github("news-r/newsapi")
library(newsanchor)
library(tidyverse)
library(tidytext)
library(textdata)
library(lubridate)
library(knitr)
library(kableExtra)
library(ggplot2)
library(newsanchor)

Authentication

NewsAPI requires a free API key, available at newsapi.org. Never hard-code your API key in a script you plan to share or commit to version control. For class purposes, store your key as an environment variable (e.g., in a .Renviron file) and reference it with Sys.getenv().

api_key <- Sys.getenv("NEWS_API_KEY")

Tip: A common extension of this tutorial is to pull headlines for multiple brands (e.g., a focal brand and 1-2 competitors) so that the sentiment comparisons later in this tutorial are more meaningful. The commented-out top_headlines("Anthropic") line in the original script is an example of adding a second topic for comparison.

Pulling and Cleaning Headlines

library(httr)
library(jsonlite)
library(tidyverse)

fetch_news <- function(query, api_key, page_size = 10) {
  response <- GET(
    url = "https://newsapi.org/v2/everything",
    query = list(
      q         = query,
      language  = "en",
      sortBy    = "publishedAt",
      pageSize  = page_size,
      apiKey    = api_key
    )
  )
  
  # Surface the actual error message from NewsAPI
  if (status_code(response) != 200) {
    msg <- content(response, as = "parsed")$message
    stop("NewsAPI error for '", query, "': ", msg)
  }
  
  parsed   <- content(response, as = "text", encoding = "UTF-8")
  articles <- fromJSON(parsed, flatten = TRUE)$articles
  
  as_tibble(articles) %>%
    rename_with(~ str_replace_all(.x, "\\.", "_")) %>%
    mutate(query = query)
}

news_raw <- bind_rows(
  fetch_news("Tom Steyer", api_key)
)

# Original Companies:
# news_raw <- bind_rows(
#  fetch_news("AI",    api_key),
#  fetch_news("Anthropic", api_key),
#  fetch_news("SpaceX",    api_key),
# fetch_news("CoreWeave", api_key)
#)

glimpse(news_raw)
## Rows: 10
## Columns: 10
## $ author      <chr> "Bjorn Lomborg", NA, "June 19, 2026", "Ed Kilgore", "Steve…
## $ title       <chr> "Here’s what can come next with climate-change fever final…
## $ description <chr> "Fewer and fewer people are panicking about the climate \"…
## $ url         <chr> "https://nypost.com/2026/06/19/opinion/heres-what-can-come…
## $ urlToImage  <chr> "https://nypost.com/wp-content/uploads/sites/2/2026/06/cro…
## $ publishedAt <chr> "2026-06-19T22:00:00Z", "2026-06-19T21:35:40Z", "2026-06-1…
## $ content     <chr> "Something was conspicuously missing from Californias prim…
## $ source_id   <chr> NA, "fox-news", NA, "new-york-magazine", NA, NA, NA, NA, "…
## $ source_name <chr> "New York Post", "Fox News", "Hoover.org", "New York Magaz…
## $ query       <chr> "Tom Steyer", "Tom Steyer", "Tom Steyer", "Tom Steyer", "T…

Inspect for Duplicates

news_raw %>%
  filter(!is.na(title)) %>%
  mutate(
    pub_date    = ymd_hms(publishedAt, quiet = TRUE),
    pub_day     = as.Date(pub_date),
    title_clean = str_remove(title, "\\s*-\\s*[^-]+$"),
    title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
    title_clean = str_to_lower(title_clean)
  ) %>%
  group_by(title_clean) %>%
  filter(n() > 1) %>%
  arrange(title_clean)

Clean and Deduplicate

We now apply the same cleaning steps and keep only one copy of each unique (cleaned) headline:

  • Remove rows with missing titles.
  • Parse the publication timestamp into a date.
  • Strip the trailing “- Source Name” suffix many outlets append to titles.
  • Remove punctuation/special characters and collapse extra whitespace.
  • Convert to lowercase for consistent text analysis.
  • Drop duplicate cleaned titles.
news_clean <- news_raw %>%
  filter(!is.na(.data$title)) %>%
  mutate(
    pub_date    = ymd_hms(.data$publishedAt, quiet = TRUE),
    pub_day     = as.Date(pub_date),
    title_clean = str_remove(.data$title, "\\s*-\\s*[^-]+$"),
    title_clean = str_squish(str_replace_all(title_clean, "[^[:alnum:][:space:]]", " ")),
    title_clean = str_to_lower(title_clean)
  ) %>%
  distinct(title_clean, .keep_all = TRUE)

dim(news_clean)
## [1] 10 13

Combine into a Tracking Data Frame

If you pulled multiple topics (e.g., SpaceX and a competitor), bind them into a single data frame here. With one topic, this step simply standardizes the object for the rest of the pipeline and saves a CSV snapshot — useful for reproducibility and for sharing data with teammates who don’t have API access.

news_df <- bind_rows(news_clean) %>%
  filter(!is.na(title))  # remove any empty rows

str(news_df)
## tibble [10 × 13] (S3: tbl_df/tbl/data.frame)
##  $ author     : chr [1:10] "Bjorn Lomborg" NA "June 19, 2026" "Ed Kilgore" ...
##  $ title      : chr [1:10] "Here’s what can come next with climate-change fever finally breaking" "Double endorsement drama: Trump backs second candidate in red state’s GOP gubernatorial runoff" "California Update: First Couple Under Investigation; Wealth-Tax Deal Underway?" "California Billionaire Tax Faces Last Hurdle Before Ballot" ...
##  $ description: chr [1:10] "Fewer and fewer people are panicking about the climate \"catastrophe.\" Gallup's latest survey of the world's m"| __truncated__ "Trump endorses both Wilson and Evette in South Carolina's GOP gubernatorial runoff, hedging his bets ahead of t"| __truncated__ "Monthly update on the implications of California’s First Couple under federal investigation for tax and financi"| __truncated__ "California billionaire tax faces last hurdle before ballot. The wealth tax has qualified for the November ballo"| __truncated__ ...
##  $ url        : chr [1:10] "https://nypost.com/2026/06/19/opinion/heres-what-can-come-next-with-climate-change-fever-finally-breaking/" "https://www.foxnews.com/politics/double-endorsement-drama-trump-backs-second-candidate-red-states-gop-gubernatorial-runoff" "https://www.hoover.org/research/california-update-first-couple-under-investigation-wealth-tax-deal-underway" "http://nymag.com/intelligencer/article/california-billionaire-tax-faces-last-hurdle-before-ballot.html" ...
##  $ urlToImage : chr [1:10] "https://nypost.com/wp-content/uploads/sites/2/2026/06/crop-39724023_ba0ec2.jpg?quality=75&strip=all&w=1200" "https://static.foxnews.com/foxnews.com/content/uploads/2026/06/trump-usa-poll-1.jpg" "https://hoover-s3-website.s3.us-west-2.amazonaws.com/s3fs-public/styles/facebook/public/2024-06/Matters-of-Poli"| __truncated__ "https://pyxis.nymag.com/v1/imgs/d88/398/3d475e9049aed791c583be96d7728e57a1-ca-billionairetax.1x.rsocial.w1200.jpg" ...
##  $ publishedAt: chr [1:10] "2026-06-19T22:00:00Z" "2026-06-19T21:35:40Z" "2026-06-19T00:00:00Z" "2026-06-18T22:25:27Z" ...
##  $ content    : chr [1:10] "Something was conspicuously missing from Californias primary this month. In the state that built its political "| __truncated__ "President Donald Trump is making an 11th-hour endorsement in the final stretch ahead of Tuesday's high-profile "| __truncated__ "<ul><li>State &amp; Local</li><li>California</li><li>Economics</li><li>Law &amp; Policy</li><li>Regulation &amp"| __truncated__ "Wealth taxes aimed at the very rich are a perennial favorite on the progressive end of the ideological spectrum"| __truncated__ ...
##  $ source_id  : chr [1:10] NA "fox-news" NA "new-york-magazine" ...
##  $ source_name: chr [1:10] "New York Post" "Fox News" "Hoover.org" "New York Magazine" ...
##  $ query      : chr [1:10] "Tom Steyer" "Tom Steyer" "Tom Steyer" "Tom Steyer" ...
##  $ pub_date   : POSIXct[1:10], format: "2026-06-19 22:00:00" "2026-06-19 21:35:40" ...
##  $ pub_day    : Date[1:10], format: "2026-06-19" "2026-06-19" ...
##  $ title_clean: chr [1:10] "here s what can come next with climate" "double endorsement drama trump backs second candidate in red state s gop gubernatorial runoff" "california update first couple under investigation wealth" "california billionaire tax faces last hurdle before ballot" ...
write.csv(news_df, "news_df.csv")

4.5 Preview the Cleaned Data

news_df %>%
  select(source_name, title, pub_day) %>%
  head(10) %>%
  kable(caption = "Sample Cleaned Headlines") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Sample Cleaned Headlines
source_name title pub_day
New York Post Here’s what can come next with climate-change fever finally breaking 2026-06-19
Fox News Double endorsement drama: Trump backs second candidate in red state’s GOP gubernatorial runoff 2026-06-19
Hoover.org California Update: First Couple Under Investigation; Wealth-Tax Deal Underway? 2026-06-19
New York Magazine California Billionaire Tax Faces Last Hurdle Before Ballot 2026-06-18
Reason Did California’s Gubernatorial Race Reveal the Limits of ‘Abundance’ Politics on the Left? 2026-06-18
KPBS A tax on billionaires qualified for the November ballot. 5 things to know about the measure 2026-06-18
KQED 5 Things to Know About California’s New Billionaire Tax Measure 2026-06-18
CALmatters California billionaire tax qualifies for November ballot 2026-06-18
Financial Post Pimco Targets Out-of-Date Assets in New Real Estate Strategy 2026-06-18
NBC News California billionaire tax proposal qualifies for the November ballot 2026-06-18

Interpretation: At this stage you should have a tidy data frame where each row is a unique news headline, with a clean publication date and a source column indicating which topic/brand search returned it. This is the foundation for everything that follows. If dim(SpaceX_clean) shows far fewer rows than dim(SpaceX_raw), that tells you a substantial share of “results” were duplicate stories — a useful sanity check before drawing conclusions about coverage volume.

5. Tokenization

To analyze sentiment and word usage, we need to break each headline into individual words (“tokens”), remove common stop words (e.g., “the,” “and,” “of”) that carry little analytical meaning, and filter out pure numbers and very short tokens.

news_tokens <- news_df %>%
  select(source_name, title) %>%
  unnest_tokens(word, title) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^\\d+$"), nchar(word) > 2)

Interpretation: news_tokens is now a “one-token-per-row” data frame — the standard format for text mining with tidytext. Each row represents one meaningful word from one headline, tagged with the source (topic) it came from. This long format makes it easy to count words, join sentiment dictionaries, and compute summary statistics by group.

6. Sentiment Analysis

6.1 The Bing Lexicon (Binary Positive/Negative)

The Bing lexicon classifies each word as either "positive" or "negative" — a simple binary label with no magnitude.

sentiment_bing <- news_tokens %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many")

print(sentiment_bing)
## # A tibble: 7 × 3
##   source_name   word        sentiment
##   <chr>         <chr>       <chr>    
## 1 New York Post fever       negative 
## 2 New York Post breaking    negative 
## 3 Fox News      endorsement positive 
## 4 Fox News      trump       positive 
## 5 Reason        limits      negative 
## 6 Reason        abundance   positive 
## 7 KPBS          qualified   positive

Graph 1: Word Sentiment Contribution

This chart shows the top 10 words contributing to positive sentiment and the top 10 contributing to negative sentiment across all headlines.

sentiment_bing %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
  labs(
    title = "Top Words Driving Sentiment in Stock News Titles",
    x = "Frequency (Word Count)",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 12)
  )

Interpretation: Words on the left panel are pulling overall sentiment down; words on the right are pulling it up. A marketing analyst would scan this chart for words tied to specific events (e.g., “crash,” “delay,” “explosion” vs. “record,” “success,” “win”) to understand what kind of news is driving the tone — not just whether the tone is positive or negative.

Graph 2: Sentiment Volume Comparison Across Topics

If you pulled multiple topics (brands), this chart compares how many positive vs. negative sentiment-words appear in each topic’s headlines.

sentiment_bing %>%
  count(source_name, sentiment) %>%
  ggplot(aes(x = source_name, y = n, fill = sentiment)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("positive" = "#2ecc71", "negative" = "#e74c3c")) +
  labs(
    title = "Volume of Sentiment Words",
    subtitle = "Total counts of matched emotional words in headlines",
    x = "Stock Ticker",
    y = "Number of Words Matched",
    fill = "Sentiment Class"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )

Interpretation: A higher total bar (positive + negative) for a topic means that topic generated more emotionally-charged language overall — which could reflect either higher news volume or more dramatic events. Comparing the ratio of green to red bars across topics tells you which brand is currently enjoying more favorable framing in the media.

6.2 Top Words Overall

Independent of sentiment, it’s useful to see which words dominate the headlines overall.

news_tokens %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 20) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = n)) +
  geom_col(show.legend = FALSE) +
  scale_fill_gradient(low = "#a8d8ea", high = "#0077b6") +
  labs(
    title   = "Top 20 Words in News Headlines",
    x       = "Count", y = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 13)

Interpretation: This is your “what is everyone talking about” chart. Look for names of products, executives, partners, or events that recur — these are candidates for deeper investigation (e.g., is “Starship” appearing a lot because of a successful launch or a setback?).

6.3 The AFINN Lexicon (Scored -5 to +5)

Unlike Bing’s binary labels, AFINN assigns each word a numeric score from -5 (very negative) to +5 (very positive), allowing us to compute average sentiment intensity per topic.

news_tokens %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(source_name) %>%
  summarise(
    words_matched  = n(),
    mean_sentiment = round(mean(value), 3),
    sum_sentiment  = sum(value),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_sentiment)) %>%
  kable(caption = "AFINN Sentiment Score by Topic") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
AFINN Sentiment Score by Topic
source_name words_matched mean_sentiment sum_sentiment
Hoover.org 1 3 3
Financial Post 1 2 2
Fox News 1 2 2
Reason 1 -1 -1

Interpretation: mean_sentiment tells you the average emotional tone of matched words for each topic — a value near 0 suggests neutral/mixed coverage, while a clearly positive or negative mean suggests a dominant tone. sum_sentiment reflects total emotional “weight,” which is influenced by both tone and volume of coverage.

6.4 Bing Sentiment Split (Positive vs. Negative Counts)

This table reshapes the Bing results into a wide format so you can directly compare positive counts, negative counts, and the net (positive − negative) score for each topic.

news_tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(source_name, sentiment) %>%
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = list(n = 0)
  ) %>%
  mutate(
    positive = coalesce(positive, 0L),
    negative = coalesce(negative, 0L),
    net      = positive - negative
  ) %>%
  kable(caption = "Bing Sentiment Count by Topic") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Bing Sentiment Count by Topic
source_name positive negative net
Fox News 2 0 2
KPBS 1 0 1
New York Post 0 2 -2
Reason 1 1 0

Interpretation: The net column is a simple, interpretable sentiment index. A positive net score suggests headlines skew favorable; a negative net score suggests the opposite. This kind of index is easy to track over time (e.g., weekly) to build a brand sentiment trendline.

7. TF-IDF: What Makes Each Topic’s Coverage Distinct?

TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are frequent within one topic’s headlines but rare across other topics’ headlines. This is especially useful for understanding what makes coverage of one brand distinctive compared to another.

news_tokens %>%
  count(source_name, word) %>%
  bind_tf_idf(word, source_name, n) %>%
  group_by(source_name) %>%
  slice_max(tf_idf, n = 6) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, source_name)) %>%
  ggplot(aes(x = tf_idf, y = word, fill = source_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ source_name, scales = "free_y", ncol = 2) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title   = "Top TF-IDF Terms by Topic",
    x       = "TF-IDF Score", y = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 12)

7.1 Refining the TF-IDF Plot for Many Topics

When you have more than two topics, the basic plot above can get crowded. The refined version below dynamically adjusts the color palette to the number of topics, reduces the number of terms shown per topic, truncates long words, and arranges panels in a grid for readability.

n_sources <- n_distinct(news_tokens$source_name)

news_tokens %>%
  count(source_name, word) %>%
  bind_tf_idf(word, source_name, n) %>%
  group_by(source_name) %>%
  slice_max(tf_idf, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  mutate(
    word = str_trunc(word, 20),
    word = reorder_within(word, tf_idf, source_name)
  ) %>%
  ggplot(aes(x = tf_idf, y = word, fill = source_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ source_name, scales = "free_y", ncol = 4) +
  scale_y_reordered() +
  scale_fill_manual(
    values = colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))(n_sources)
  ) +
  labs(
    title   = "Top TF-IDF Terms by Source",
    x       = "TF-IDF Score",
    y       = NULL,
    caption = "Source: NewsAPI"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    strip.text    = element_text(size = 8, face = "bold"),
    axis.text.y   = element_text(size = 4),
    panel.spacing = unit(1.2, "lines")
  )

# Save with a tall aspect ratio so labels don't crowd
ggsave("tfidf_plot.png", width = 16, height = 14, dpi = 150)

Interpretation: Words with high TF-IDF scores for a given topic are the terms that “define” that topic’s coverage relative to others — these are often product names, executives, locations, or event-specific terms. For a marketing analyst, these words are strong candidates for keyword tracking, campaign hashtag ideas, or identifying emerging narratives unique to your brand versus competitors.

8. Putting It All Together

A complete brand-monitoring workflow using this tutorial would typically run on a schedule (e.g., daily or weekly):

  1. Pull fresh headlines for your brand and 1-2 key competitors.
  2. Clean and deduplicate.
  3. Tokenize and score sentiment (Bing for simple positive/negative counts, AFINN for an intensity-weighted index).
  4. Track the net sentiment score and mean_sentiment over time.
  5. Use TF-IDF periodically to check whether the themes of coverage are shifting (e.g., from “product launch” language to “regulatory” language).

9. Quiz and Discussion Questions

Q1. (Multiple Choice)

Which of the following best describes the difference between the Bing and AFINN sentiment lexicons used in this tutorial?

A. Bing scores words from -5 to +5; AFINN classifies words as positive or negative only. B. Bing classifies words as positive or negative only; AFINN assigns a numeric intensity score from -5 to +5. C. Both lexicons produce identical results because they are built from the same word list. D. AFINN can only be used with French-language text.

Answer: B. Bing is a binary classifier (each word is labeled "positive" or "negative"), while AFINN assigns a numeric score between -5 and +5, allowing computation of an average sentiment intensity rather than just counts.


Q2. (Short Answer)

In Section 4.3, why do we create a title_clean column by removing the trailing “- Source Name” portion of each headline and converting text to lowercase, before checking for duplicates?

Answer: News aggregators often return the same underlying story from multiple outlets, where each version appends a different source name to the end of the title (e.g., “SpaceX Launches Rocket - Reuters” vs. “SpaceX Launches Rocket - AP”). If we compared raw titles, these would look like distinct headlines and inflate our count of “unique” stories. Removing the source suffix, standardizing punctuation/whitespace, and lowercasing the text ensures that headlines describing the same story are recognized as duplicates and only counted once — giving a more accurate picture of true coverage volume.


Q3. (Multiple Choice)

A marketing manager looks at Graph 2 (Sentiment Volume Comparison) and sees that Brand A has 40 positive and 10 negative sentiment-word matches, while Brand B has 15 positive and 5 negative matches. Which statement is most accurate?

A. Brand A definitely has better brand sentiment than Brand B because it has more positive words. B. Brand B might have comparable or better relative sentiment, since both brands have the same 3:1 positive-to-negative ratio, but Brand A simply has more total coverage. C. The two brands cannot be compared in any way. D. Brand B’s coverage is more negative because it has fewer total matches.

Answer: B. Raw counts conflate sentiment tone with sentiment volume. Both brands have an identical 3:1 ratio of positive to negative words, so their relative tone is similar — Brand A simply has more total news coverage (more matched words overall). A good analyst should look at both the absolute volume (which signals visibility/buzz) and the ratio or net score (which signals tone) separately.


Q4. (Discussion)

The TF-IDF analysis in Section 7 highlights words that are distinctive to one brand’s coverage versus another’s. Suppose you run this analysis for your company and a competitor, and your company’s top TF-IDF terms include words like “lawsuit,” “investigation,” and “delay,” while the competitor’s top terms include “launch,” “partnership,” and “award.” What might this tell you, and what would you investigate next?

Answer (sample discussion points): This pattern suggests that, during the analyzed period, your company’s distinctive media narrative is dominated by negative or risk-related events (legal/regulatory issues, delays), while the competitor’s distinctive narrative centers on positive business momentum (product launches, partnerships, recognition). This is a signal — not a verdict — and should prompt further investigation: (1) read the actual headlines behind these TF-IDF terms to understand the underlying stories, (2) check whether this is a recent shift or a longer-term pattern by re-running the analysis over different time windows, (3) consider how this narrative gap might affect brand perception, investor confidence, or campaign timing, and (4) coordinate with PR/communications teams if a response or proactive positive-story pipeline is warranted.


Q5. (Multiple Choice)

Why does the tutorial recommend storing your NewsAPI key using Sys.getenv() and an .Renviron file rather than typing it directly into the script (e.g., newsapi_key("abc123"))?

A. Sys.getenv() makes the API calls run faster. B. Hard-coded keys are required by NewsAPI’s terms of service. C. It prevents the key from being accidentally shared, committed to version control, or exposed if the script/notebook is distributed. D. .Renviron files automatically refresh the API key every 24 hours.

Answer: C. API keys are credentials tied to your account and usage limits. Hard-coding them into a script means anyone who receives that script (e.g., classmates, a shared GitHub repo, a knitted HTML report) also receives your credentials. Storing keys as environment variables keeps secrets out of shared code while still allowing the script to authenticate.


Q6. (Discussion)

This tutorial uses lexicon-based sentiment analysis (Bing and AFINN), where sentiment is determined by looking up individual words in a pre-built dictionary. What is one limitation of this approach when applied to news headlines specifically, and how might it lead to a misleading sentiment score?

Answer (sample discussion points): Lexicon-based methods score words independently, ignoring context, negation, and sarcasm. For example, a headline like “SpaceX avoids major setback after engine issue” contains the negative word “setback” and possibly “issue,” which a lexicon would score negatively — even though the headline is reporting good news (the setback was avoided). Similarly, headlines are often short and may contain domain-specific or proper-noun “words” (e.g., company names, ticker symbols) that aren’t in general-purpose lexicons like Bing or AFINN, so meaningful content can be missed entirely. More advanced approaches (e.g., sentence-level models, transformer-based sentiment classifiers) can better capture context, negation, and domain-specific tone, at the cost of greater computational complexity.