Overview

This project demonstrates how to retrieve and analyze data using web APIs, which allow developers to programmatically access structured data directly from various platforms. Specifically, we will use the New York Times API to examine data from two of its popular endpoints: the “Most Viewed” and “Most Shared” articles.

Setting Up Authentication

The first step was to register on the New York Times Developer Network to obtain a unique API key, which I stored securely in the code with echo = FALSE to keep it hidden from the output. For this analysis, I chose the “Most Popular” and “Most Shared” APIs. The “Most Popular” API lets us retrieve the most-viewed articles from a specified period; I configured it to show data from the last 30 days, focusing on articles with the highest view counts over the past month. After the request, I transformed the JSON data into a data frame to facilitate analysis. The “Most Shared” API, meanwhile, highlights articles most frequently shared on selected social platforms. I opted for Facebook as the sharing platform, as it provided consistent results during testing. This request was also set to capture articles from the past 30 days, with the JSON response converted into a data frame. This configuration enables a side-by-side comparison of which articles are most viewed versus most shared on the New York Times site.

Retrieving Data from the NYT API

Most Viewed Articles

To start, we use the New York Times “Most Viewed” API endpoint, which provides data on articles that have received the highest number of views. This insight helps identify popular topics and high-interest content on the New York Times platform. For this project, I decided to use both the “Most Popular” and “Most Shared” API endpoints.

The “Most Popular” endpoint returns the most-viewed articles over a user-specified timeframe. I chose a 30-day window, which provides a view of the top articles from the past month. After submitting the API request, I transformed the JSON response into a data frame for easier analysis.

The “Most Shared” API is designed similarly but instead shows articles that have been most frequently shared on a specified platform. I selected Facebook as the platform, as it was the most stable option during testing. This API request was configured to retrieve the most shared articles on Facebook over the last 30 days. Like the “Most Viewed” data, the response was processed from JSON into a data frame for analysis.

library(httr)
## Warning: package 'httr' was built under R version 4.3.3
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(knitr)


# Retrieving data from the Most Viewed API (NYT's most viewed articles in the last 30 days)
url_view <- "https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json"  # API endpoint for Most Viewed articles

# Sending a GET request to the Most Viewed API with the API key
response_view <- GET(url_view, query = list("api-key" = api_key))

# Parsing the JSON response into a list format
parsed_data_view <- fromJSON(content(response_view, as = "text"))

# Extracting the results (most viewed articles) from the parsed data
articles_view <- parsed_data_view$results


# Retrieving data from the Most Shared API (NYT's most shared articles on Facebook in the last 30 days)
url_shared <- "https://api.nytimes.com/svc/mostpopular/v2/shared/30/facebook.json"  # API endpoint for Most Shared articles on Facebook

# Sending a GET request to the Most Shared API with the API key
response_shared <- GET(url_shared, query = list("api-key" = api_key))

# Parsing the JSON response into a list format
parsed_data_shared <- fromJSON(content(response_shared, as = "text"))

# Extracting the results (most shared articles on Facebook) from the parsed data
articles_shared <- parsed_data_shared$results

Data Cleaning Process

In this step, I use the Tidyverse suite to clean and organize the data gathered from the New York Times API. This preparation is essential for creating a structured dataset that can address several questions I formulated to analyze reader engagement trends.

Article Overlap Between Groups

What is the count of articles that appear in both the “Most Viewed” and “Most Shared on Facebook” groups?

# Determining the number of articles that are the same between the two groups

# Joining both datasets to find matching articles based on the "id" column
matching_id <- inner_join(articles_view_dat, articles_shared_dat, by = "id")

# Displaying the count of common articles and their details
cat("There are", nrow(matching_id), "articles in common between the 'Most Viewed' in the past 30 days and the 'Most Shared on Facebook' in the past 30 days.\nThey are:\n")
## There are 4 articles in common between the 'Most Viewed' in the past 30 days and the 'Most Shared on Facebook' in the past 30 days.
## They are:
# Looping through each matching article to display the title, section, and published date
for (i in 1:nrow(matching_id)) {
  cat(matching_id$title.x[i], "published in the", matching_id$section.x[i], "section on", matching_id$published_date.x[i], ".\n")
}
## Kris Kristofferson, Country Singer, Songwriter and Actor, Dies at 88 published in the Obituaries section on 2024-09-29 .
## James Carville: Three Reasons I’m Certain Kamala Harris Will Win published in the Opinion section on 2024-10-23 .
## The Only Patriotic Choice for President published in the Opinion section on 2024-09-30 .
## At a Pennsylvania Rally, Trump Descends to New Levels of Vulgarity published in the U.S. section on 2024-10-19 .

Time Difference Analysis

Which articles in each group experienced the longest duration between their publication date and last update?

# Identifying articles with the longest duration between publication and last update

# Merging both datasets and removing duplicates to get unique articles
merged_df <- bind_rows(articles_view_dat, articles_shared_dat) %>%
  distinct()

# Converting publication and update dates to POSIXct format for time calculations
merged_df$published_date <- as.POSIXct(as.Date(merged_df$published_date))
merged_df$updated <- as.POSIXct(as.Date(merged_df$updated))

# Calculating the time difference in days between publication and last update
merged_df <- merged_df %>%
  mutate(t_diff = as.numeric(difftime(updated, published_date, units = "days")))

# Sorting articles by the longest time difference
time_df <- merged_df %>%
  arrange(desc(t_diff)) %>%
  select(t_diff, title)

# Displaying a table of articles sorted by the time passed between publication and last update
kable(time_df, caption = "Time Passed in Days Between First Published and Last Updated (Descending Order)")
Time Passed in Days Between First Published and Last Updated (Descending Order)
t_diff title
72 Tracking the Swing States for Harris and Trump
27 The Only Patriotic Choice for President
9 Pete Rose, Baseball Star Who Earned Glory and Shame, Dies at 83
9 The Dangers of Donald Trump, From Those Who Know Him
8 When Trump Rants, This Is What I Hear
7 Billy Joel Is Selling the Mansion He First Saw While Dredging Oysters
6 Milton Tracker: Latest on Storm’s Path, Power Outages and Winds
6 What We Know About Liam Payne’s Death and the Drugs Found in His System
6 The Secretive Dynasty That Controls the Boar’s Head Brand
4 What’s Wrong With Donald Trump?
4 Ann Patchett: The Decision I Made 30 Years Ago That I Still Regret
3 When a Television Meteorologist Breaks Down on Air and Admits Fear
3 Her Face Was Unrecognizable After an Explosion. A Placenta Restored It.
2 The Truth About Tuna
2 Phil Lesh, Bassist Who Anchored the Grateful Dead, Dies at 84
2 Why Trump Is Lying About Disaster Relief
2 As Election Nears, Kelly Warns Trump Would Rule Like a Dictator
2 Trump Is Telling Us What He Would Do. Believe Him.
2 How the North Carolina Legislature Left Homes Vulnerable to Helene
1 Kamala Harris Arrived for a Fox Interview. She Got a Debate.
1 6 Takeaways From Donald Trump’s 3-Hour Podcast With Joe Rogan
1 Hurricane Helene: Mapping More Than 600 Miles of Devastation
1 Harris and Trump Deadlocked to the End, Final Times/Siena National Poll Finds
1 Book Revives Questions About Trump’s Ties to Putin
1 Kris Kristofferson, Country Singer, Songwriter and Actor, Dies at 88
1 James Carville: Three Reasons I’m Certain Kamala Harris Will Win
1 Nate Silver: Here’s What My Gut Says About the Election, but Don’t Trust Anyone’s Gut, Even Mine
1 A Frustrated Trump Lashes Out Behind Closed Doors Over Money
1 At a Pennsylvania Rally, Trump Descends to New Levels of Vulgarity
1 Judge Orders Giuliani to Forfeit Millions in Assets to Election Workers He Defamed
1 An Open Letter to Jimmy Carter, on His 100th Birthday
1 The Many Links Between Project 2025 and Trump’s World
1 Is It Fascism? A Leading Historian Changes His Mind.
1 Frank Fritz, a Host of the Antiques Show ‘American Pickers,’ Dies at 60
1 American Business Cannot Afford to Risk Another Trump Presidency
1 Washington Post Says It Will Stop Endorsing Presidential Candidates
# Outputting the top three articles with the longest duration between first published and last updated
cat("The top three articles from the 'Most Viewed' and 'Most Shared on Facebook' groups with the longest time passed from first published to last updated are:\n\n",
    
    time_df$title[1], ":\n with", time_df$t_diff[1], "days passing since the article was originally published.\n\n",
    
    time_df$title[2], ":\n with", time_df$t_diff[2], "days passing since the article was originally published.\n\n",
    
    time_df$title[3], ":\n with", time_df$t_diff[3], "days passing since the article was originally published.\n\n")
## The top three articles from the 'Most Viewed' and 'Most Shared on Facebook' groups with the longest time passed from first published to last updated are:
## 
##  Tracking the Swing States for Harris and Trump :
##  with 72 days passing since the article was originally published.
## 
##  The Only Patriotic Choice for President :
##  with 27 days passing since the article was originally published.
## 
##  Pete Rose, Baseball Star Who Earned Glory and Shame, Dies at 83 :
##  with 9 days passing since the article was originally published.

Conclusion

The New York Times offers a wide array of versatile APIs that can be used to accomplish diverse analytical tasks. For this project, I focused on the Most Viewed and Most Shared APIs to build a sample report highlighting the New York Times’ top articles in terms of views and social media shares. This kind of report can be useful in understanding why certain articles attract more attention or clicks than others and which topics resonate most with audiences.

Looking ahead, a potential next step would be to analyze how the most shared articles vary across different social media platforms like X, Reddit, and Facebook. Such an analysis could reveal content preferences unique to each platform’s user base. Another interesting extension would be to automate the Most Viewed report daily over a month, tracking the longevity of articles in the top-viewed list. This could provide insights into content relevance and engagement over time.

APIs are essential tools for data access and integration, allowing seamless data transfer and analysis. However, changes to an API’s structure or data organization (e.g., adjustments to the Most Viewed API by the NYT) can disrupt processes that rely on them. This highlights the importance of monitoring API updates to maintain data workflows effectively.