data607API

Author

XiaoFei Mei

Approach

The task is to made a request through New York Times API, obtain a tidy data frame and use the data to provide answer to a research question.

The first question seems interesting: “what are the top five best selling hard cover books?”

In order to find out answer, I’ll create an API key from nytimes developer site, specifically for the Books API. Then, establish a secure connection by storing the API key in an environment variable and retrieving it with Sys.getenv( ). Next define function to send GET requests to the NYT API for best selling books. In the end, present the result in a tidy data frame and export it as a csv file.

Authentication: Loading the API Key

library(httr)
library(jsonlite)
library(dplyr)
library(purrr)
library(lubridate)
library(tibble)
nyt_api_key <- Sys.getenv("NYT_API_KEY")
base_url <- "https://api.nytimes.com/svc/books/v3"

current_date <- Sys.Date()
start_date <- current_date - years(5)

dates_to_query <- seq.Date(
  from = start_date,
  to = current_date,
  by = "quarter"
)

formatted_dates <- format(dates_to_query, "%Y-%m-%d")

# Show how many dates we'll query
cat("Querying", length(formatted_dates), "dates from", 
    min(formatted_dates), "to", max(formatted_dates))
Querying 21 dates from 2021-03-29 to 2026-03-29

Making API request with error handling

# first few attempt had error message. so need to add error handling and rate limits. I also reduce the search time period.
fetch_best_sellers <- function(published_date, api_key, max_retries = 5) {
  
  endpoint <- paste0(base_url, "/lists/overview.json")
  attempt <- 1
  
  while (attempt <= max_retries) {
    
    response <- GET(
      url = endpoint,
      query = list(
        published_date = published_date,
        `api-key` = api_key
      )
    )
    
    status <- status_code(response)
    
    if (status == 200) {
      content_text <- content(response, "text", encoding = "UTF-8")
      parsed_json <- fromJSON(content_text, flatten = TRUE)
      
      lists_data <- parsed_json$results$lists
      hc_fiction <- lists_data[lists_data$list_name == "Hardcover Fiction", ]
      
      if (nrow(hc_fiction) == 0) return(NULL)
      
      books_df <- hc_fiction$books[[1]]
      # Remove published_date column here if you want
      return(books_df)
    }
    
    if (status == 429) {
      wait_time <- 2^attempt
      message(paste("Rate limited. Waiting", wait_time, "seconds..."))
      Sys.sleep(wait_time)
      attempt <- attempt + 1
    } else {
      warning(paste("Failed:", published_date, "Status:", status))
      return(NULL)
    }
  }
  return(NULL)
}


formatted_dates <- c("2026-02-01")  # Only search February 2026

Fetch Data

all_books_list <- map(formatted_dates, function(date) {
  Sys.sleep(3)   # void rate limit
  fetch_best_sellers(date, nyt_api_key)
})

all_books_list <- compact(all_books_list)
all_books_df <- bind_rows(all_books_list)

#create data frame
clean_books_df <- all_books_df %>%
  select(rank, title, author) %>%
  as_tibble()

glimpse(clean_books_df)
Rows: 15
Columns: 3
$ rank   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
$ title  <chr> "THE CORRESPONDENT", "ANATOMY OF AN ALIBI", "THE DEVIL'S DAUGHT…
$ author <chr> "Virginia Evans", "Ashley Elston", "Danielle Steel", "Celina My…
cat("\nSAMPLE DATA:\n")

SAMPLE DATA:
print(head(clean_books_df))
# A tibble: 6 × 3
   rank title                    author        
  <int> <chr>                    <chr>         
1     1 THE CORRESPONDENT        Virginia Evans
2     2 ANATOMY OF AN ALIBI      Ashley Elston 
3     3 THE DEVIL'S DAUGHTER     Danielle Steel
4     4 HOLLOW                   Celina Myers  
5     5 THE FIRST TIME I SAW HIM Laura Dave    
6     6 THE WIDOW                John Grisham  
# Aggregate Top 5 books
top_books <- clean_books_df %>%
  group_by(title, author) %>%
  summarise(
    avg_rank = mean(rank),
    appearances = n(),
    best_rank = min(rank),
    .groups = "drop"
  ) %>%
  arrange(avg_rank, desc(appearances)) %>%
  slice_head(n = 5)


#  Final Result
final_result <- top_books %>%
  arrange(avg_rank) %>%
  as_tibble()

cat("\nFINAL TOP 5 RESULT\n")

FINAL TOP 5 RESULT
print(final_result)
# A tibble: 5 × 5
  title                    author         avg_rank appearances best_rank
  <chr>                    <chr>             <dbl>       <int>     <int>
1 THE CORRESPONDENT        Virginia Evans        1           1         1
2 ANATOMY OF AN ALIBI      Ashley Elston         2           1         2
3 THE DEVIL'S DAUGHTER     Danielle Steel        3           1         3
4 HOLLOW                   Celina Myers          4           1         4
5 THE FIRST TIME I SAW HIM Laura Dave            5           1         5
# Export to CSV
write.csv(final_result, "nyt_top_books_feb2026.csv", row.names = FALSE)
cat("\nCSV file saved as: nyt_top_books_feb2026.csv\n")

CSV file saved as: nyt_top_books_feb2026.csv

Conclusion

This project practice connect to a NYT API, handle data extraction, clean and summarize results. Few problem I encountered were related to NYT API setting, for example, best selling books data weren’t released monthly, rather they have a set date to releasing. There were aslo request rate limit that made my first few attpempt unsuccessful, but those problem were fixed by reducing only search Feb 2026 data, and added rate limit error handling.

As for data-cleaning, nested fields in the API response, such as the list of books within each category, were flattened to extract only the relevant book-level information: rank, title, author. Missing or empty lists for certain dates were removed using purrr::compact() to ensure only valid results were included.