SPS_Data607_Week9_DC

Assignment: Using a New York Times API in R

The New York Times provides multiple public APIs via its developer portal (NYT Developer Network). To use any NYT API, you must first create an account and obtain an API key.

Your task:

  1. Choose one New York Times API endpoint (e.g., Article Search, Most Popular, Books, etc.).  You should use your data analysis to ask and attempt to answer an interesting question of your choosing.  Examples:  “What are the top five best selling hard cover books?”, “Which newspaper sections have produced the most popular articles, and how has this changed compared to X years ago?”, etc.

  2. In R, write code to:

-    Authenticate using your API key.
Never hard code an API key. 
There are a number of good ways to do this. 
For example: Store it in an environment variable 
(e.g., `Sys.getenv("NYT_API_KEY"`)).

-    Make a request to the endpoint

-    Parse the **JSON** response

-    Transform the result into a clean **R data frame** (tibble is fine)

Deliverable

-    The API you selected (endpoint + brief description)

-    The request you made (parameters used)

-    Code that returns a tidy data frame

-    A short note describing any data-cleaning decisions (e.g., nested fields, missing values)

Approach

  • Talk to the The New York Times API

  • Handle JSON data (API responses)

  • Clean and analyze data

  • Make charts

Setup

httr → sends a request to NYT (like opening a webpage automatically)

jsonlite → converts API response into table format

dplyr → lets you filter/select columns easily

ggplot2 → creates graphs

#install.packages(c("httr", "jsonlite", "dplyr", "ggplot2"))

library(httr)
library(jsonlite)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)

Set API key:

Sys.setenv(NYT_API_KEY = "API_KEY")

API Function

creating a reusable function that:

  1. asks the New York Times Article Search API for data

    • query = what you search (e.g., “China”)

    • begin_date / end_date = time range

    • page = which batch of results

  2. returns articles as a table

nyt_search <- function(query, begin_date, end_date, page = 0) {
  api_key <- Sys.getenv("NYT_API_KEY")
  
  url <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
  
  res <- GET(url, query = list(
    q = query,
    begin_date = begin_date,
    end_date = end_date,
    page = page,
    `api-key` = api_key
  ))
  
  content <- content(res, as = "text", encoding = "UTF-8")
  json <- fromJSON(content, flatten = TRUE)
  
  return(json$response$docs)
}

Data Collection

The API does NOT give all results at once

Each request = 10 articles Need a loop to get more

Create a data container with a loop to collect and store all results from API calls. Implement a delay (timer) between requests to prevent the NYT API from rate-limiting or blocking the API key.

all_articles <- list()

for (p in 0:20) {
  cat("Fetching page:", p, "\n")
  tmp <- nyt_search("China OR Chinese OR Beijing OR Shanghai", 19200101, 19291231, p)
  
  if (length(tmp) == 0) break
  all_articles[[p + 1]] <- tmp
  
  Sys.sleep(6)
}
Fetching page: 0 
df <- bind_rows(all_articles)

Data Cleaning

Raw API data is untidy

  • Keep only useful columns

  • Rename them

  • Fix date format

df_clean <- df %>%
  select(
    headline = headline.main,
    pub_date,
    snippet,
    section_name,
    web_url
  ) %>%
  mutate(pub_date = as.Date(pub_date))
Error in `select()`:
! Can't select columns that don't exist.
✖ Column `headline.main` doesn't exist.
head(df_clean)
Error:
! object 'df_clean' not found

Analysis

Articles Over Time

df_clean %>%
  mutate(year = format(pub_date, "%Y")) %>%
  count(year) %>%
  ggplot(aes(x = year, y = n, group = 1)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(title = "NYT Articles About China (1920s)",
       x = "Year",
       y = "Number of Articles")
Error:
! object 'df_clean' not found

Keyword Exploration

war_articles <- df_clean %>%
  filter(grepl("war", snippet, ignore.case = TRUE))
Error:
! object 'df_clean' not found
head(war_articles)
Error:
! object 'war_articles' not found