Week 9 Assignment

Overview

This project demonstrates how to retrieve and analyze data using web APIs, which allow developers to programmatically access structured data directly from various platforms. Specifically, we will use the New York Times API to examine data from two of its popular endpoints: the “Most Viewed” and “Most Shared” articles.

Setting Up Authentication

The first step was to register on the New York Times Developer Network to obtain a unique API key, which I stored securely in the code with echo = FALSE to keep it hidden from the output. For this analysis, I chose the “Most Popular” and “Most Shared” APIs. The “Most Popular” API lets us retrieve the most-viewed articles from a specified period; I configured it to show data from the last 30 days, focusing on articles with the highest view counts over the past month. After the request, I transformed the JSON data into a data frame to facilitate analysis. The “Most Shared” API, meanwhile, highlights articles most frequently shared on selected social platforms. I opted for Facebook as the sharing platform, as it provided consistent results during testing. This request was also set to capture articles from the past 30 days, with the JSON response converted into a data frame. This configuration enables a side-by-side comparison of which articles are most viewed versus most shared on the New York Times site.

Retrieving Data from the NYT API

Most Viewed Articles

To start, we use the New York Times “Most Viewed” API endpoint, which provides data on articles that have received the highest number of views. This insight helps identify popular topics and high-interest content on the New York Times platform. For this project, I decided to use both the “Most Popular” and “Most Shared” API endpoints.

The “Most Popular” endpoint returns the most-viewed articles over a user-specified timeframe. I chose a 30-day window, which provides a view of the top articles from the past month. After submitting the API request, I transformed the JSON response into a data frame for easier analysis.

The “Most Shared” API is designed similarly but instead shows articles that have been most frequently shared on a specified platform. I selected Facebook as the platform, as it was the most stable option during testing. This API request was configured to retrieve the most shared articles on Facebook over the last 30 days. Like the “Most Viewed” data, the response was processed from JSON into a data frame for analysis.

library(httr)

## Warning: package 'httr' was built under R version 4.3.3

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 4.3.3

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(knitr)


# Retrieving data from the Most Viewed API (NYT's most viewed articles in the last 30 days)
url_view <- "https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json"  # API endpoint for Most Viewed articles

# Sending a GET request to the Most Viewed API with the API key
response_view <- GET(url_view, query = list("api-key" = api_key))

# Parsing the JSON response into a list format
parsed_data_view <- fromJSON(content(response_view, as = "text"))

# Extracting the results (most viewed articles) from the parsed data
articles_view <- parsed_data_view$results


# Retrieving data from the Most Shared API (NYT's most shared articles on Facebook in the last 30 days)
url_shared <- "https://api.nytimes.com/svc/mostpopular/v2/shared/30/facebook.json"  # API endpoint for Most Shared articles on Facebook

# Sending a GET request to the Most Shared API with the API key
response_shared <- GET(url_shared, query = list("api-key" = api_key))

# Parsing the JSON response into a list format
parsed_data_shared <- fromJSON(content(response_shared, as = "text"))

# Extracting the results (most shared articles on Facebook) from the parsed data
articles_shared <- parsed_data_shared$results

Data Cleaning Process

In this step, I use the Tidyverse suite to clean and organize the data gathered from the New York Times API. This preparation is essential for creating a structured dataset that can address several questions I formulated to analyze reader engagement trends.

Examining the Most Popular Sections

One of my key questions focuses on which New York Times sections featured the highest number of “Most Viewed” articles over the past month, as well as the sections that had the most articles shared on Facebook during the same period. By analyzing this information, we can gain insights into the sections that have resonated most with readers both in terms of views and social media shares. This provides a comparative look at high-traffic content versus the types of stories readers are most likely to share on Facebook.

# Creating a data frame with selected columns for the top 20 most viewed articles in the past month
articles_view_dat=articles_view %>%
  select(id, title, section, subsection, type, published_date, updated, geo_facet, url)

# Creating a data frame with selected columns for the top 20 most shared articles on Facebook in the past month
articles_shared_dat=articles_shared %>%
  select(id, title, section, subsection, type, published_date, updated, geo_facet, url)

# Identifying the most popular sections in the "Most Viewed" articles
most_viewed_sect <- articles_view_dat %>%
  arrange(section) %>%  # Sort articles by section for clarity
  select(section, everything())  # Place "section" as the first column for emphasis

# Counting the occurrences of each section in the "Most Viewed" articles
viewed_sect_count <- most_viewed_sect %>%
  count(section) %>%
  arrange(desc(n))  # Arrange sections by descending count to see the most popular ones

# Extracting the top three sections in the "Most Viewed" articles
top_3_sect <- viewed_sect_count %>%
  head(3)

# Calculating the total number of "Most Viewed" articles
total_articles_view <- sum(viewed_sect_count$n)

# Displaying the top three sections with their article counts and percentage of total
cat("The top three sections represented in the NY Times 'Most Viewed' articles in the past 30 days are: \n",
    top_3_sect$section[1], "with", top_3_sect$n[1], "articles, representing", ((top_3_sect$n[1]) / total_articles_view) * 100, "% of all top viewed articles.\n",
    top_3_sect$section[2], "with", top_3_sect$n[2], "articles, representing", ((top_3_sect$n[2]) / total_articles_view) * 100, "% of all top viewed articles.\n",
    top_3_sect$section[3], "with", top_3_sect$n[3], "articles, representing", ((top_3_sect$n[3]) / total_articles_view) * 100, "% of all top viewed articles.\n")

## The top three sections represented in the NY Times 'Most Viewed' articles in the past 30 days are: 
##  U.S. with 7 articles, representing 35 % of all top viewed articles.
##  Opinion with 5 articles, representing 25 % of all top viewed articles.
##  Business with 2 articles, representing 10 % of all top viewed articles.

# Now comparing with the "Most Shared" sections on Facebook over the past 30 days
most_shared_sect <- articles_shared_dat %>%
  arrange(section) %>%
  select(section, everything())  # Place "section" as the first column

# Counting occurrences of each section in the "Most Shared" articles on Facebook
shared_sect_count <- most_shared_sect %>%
  count(section) %>%
  arrange(desc(n))  # Arrange sections by descending count

# Extracting the top three sections in the "Most Shared" articles
top_3_sect_fb <- shared_sect_count %>%
  head(3)

# Calculating the total number of "Most Shared" articles on Facebook
total_articles_fb <- sum(shared_sect_count$n)

# Displaying the top three sections with their article counts and percentage of total
cat("The top three sections represented in the NY Times 'Most Shared' articles on Facebook in the past 30 days are: \n",
    top_3_sect_fb$section[1], "with", top_3_sect_fb$n[1], "articles, representing", ((top_3_sect_fb$n[1]) / total_articles_fb) * 100, "% of all top shared articles.\n",
    top_3_sect_fb$section[2], "with", top_3_sect_fb$n[2], "articles, representing", ((top_3_sect_fb$n[2]) / total_articles_fb) * 100, "% of all top shared articles.\n",
    top_3_sect_fb$section[3], "with", top_3_sect_fb$n[3], "articles, representing", ((top_3_sect_fb$n[3]) / total_articles_fb) * 100, "% of all top shared articles.\n")

## The top three sections represented in the NY Times 'Most Shared' articles on Facebook in the past 30 days are: 
##  Opinion with 8 articles, representing 40 % of all top shared articles.
##  U.S. with 4 articles, representing 20 % of all top shared articles.
##  Arts with 2 articles, representing 10 % of all top shared articles.

Analyzing Popular Page Types

Which Page Types Were the Most Popular in Both the “Most Viewed” and “Most Shared on Facebook” Groups?

To analyze this, we’ll identify the most common page types within both the “Most Viewed” and “Most Shared on Facebook” groups. This will reveal the types of pages that not only draw the most views but are also frequently shared, offering insight into the kinds of content that resonate the most with readers. By comparing these results, we can better understand reader engagement patterns and the specific types of articles that readers find compelling enough to share on social media.

# Analyzing the most popular article types in the "Most Viewed" group
most_viewed_type <- articles_view_dat %>%
  arrange(type) %>%  # Sort by article type for clarity
  select(type, everything())  # Place "type" as the first column for emphasis

# Counting occurrences of each article type in the "Most Viewed" group
viewed_type_count <- most_viewed_type %>%
  count(type) %>%
  arrange(desc(n))  # Arrange by descending count to identify the most popular types

# Displaying the two most common article types in the "Most Viewed" group
cat("For the top 20 most viewed NYT pages in the past 30 days:\n",
    viewed_type_count$type[1], "is the most common type with", viewed_type_count$n[1], "instances, making up", ((viewed_type_count$n[1]) / total_articles_view) * 100, "% of the top 20 viewed articles.\n",
    viewed_type_count$type[2], "is the next most common type with", viewed_type_count$n[2], "instances, making up", ((viewed_type_count$n[2]) / total_articles_view) * 100, "% of the top 20 viewed articles.\n")

## For the top 20 most viewed NYT pages in the past 30 days:
##  Article is the most common type with 17 instances, making up 85 % of the top 20 viewed articles.
##  Interactive is the next most common type with 3 instances, making up 15 % of the top 20 viewed articles.

# Analyzing the most popular article types in the "Most Shared on Facebook" group
most_shared_type <- articles_shared_dat %>%
  arrange(type) %>%
  select(type, everything())  # Place "type" as the first column

# Counting occurrences of each article type in the "Most Shared" group
shared_type_count <- most_shared_type %>%
  count(type) %>%
  arrange(desc(n))

# Displaying the two most common article types in the "Most Shared on Facebook" group
cat("For the top 20 NYT pages shared on Facebook in the past 30 days:\n",
    shared_type_count$type[1], "is the most common type with", shared_type_count$n[1], "instances, making up", ((shared_type_count$n[1]) / total_articles_fb) * 100, "% of the top 20 shared articles.\n",
    shared_type_count$type[2], "is the next most common type with", shared_type_count$n[2], "instances, making up", ((shared_type_count$n[2]) / total_articles_fb) * 100, "% of the top 20 shared articles.\n")

## For the top 20 NYT pages shared on Facebook in the past 30 days:
##  Article is the most common type with 17 instances, making up 85 % of the top 20 shared articles.
##  Interactive is the next most common type with 3 instances, making up 15 % of the top 20 shared articles.

Most Viewed Articles:
- Counts the occurrence of each article type in the top 20 most viewed articles, identifying the two most popular types along with their instance counts and percentages.
Most Shared on Facebook Articles:
- Repeats the process for the top 20 articles shared on Facebook, displaying the two most frequent article types with instance counts and corresponding percentages.

Article Overlap Between Groups

What is the count of articles that appear in both the “Most Viewed” and “Most Shared on Facebook” groups?

# Determining the number of articles that are the same between the two groups

# Joining both datasets to find matching articles based on the "id" column
matching_id <- inner_join(articles_view_dat, articles_shared_dat, by = "id")

# Displaying the count of common articles and their details
cat("There are", nrow(matching_id), "articles in common between the 'Most Viewed' in the past 30 days and the 'Most Shared on Facebook' in the past 30 days.\nThey are:\n")

## There are 4 articles in common between the 'Most Viewed' in the past 30 days and the 'Most Shared on Facebook' in the past 30 days.
## They are:

# Looping through each matching article to display the title, section, and published date
for (i in 1:nrow(matching_id)) {
  cat(matching_id$title.x[i], "published in the", matching_id$section.x[i], "section on", matching_id$published_date.x[i], ".\n")
}

## Kris Kristofferson, Country Singer, Songwriter and Actor, Dies at 88 published in the Obituaries section on 2024-09-29 .
## James Carville: Three Reasons I’m Certain Kamala Harris Will Win published in the Opinion section on 2024-10-23 .
## The Only Patriotic Choice for President published in the Opinion section on 2024-09-30 .
## At a Pennsylvania Rally, Trump Descends to New Levels of Vulgarity published in the U.S. section on 2024-10-19 .

Time Difference Analysis

Which articles in each group experienced the longest duration between their publication date and last update?

# Identifying articles with the longest duration between publication and last update

# Merging both datasets and removing duplicates to get unique articles
merged_df <- bind_rows(articles_view_dat, articles_shared_dat) %>%
  distinct()

# Converting publication and update dates to POSIXct format for time calculations
merged_df$published_date <- as.POSIXct(as.Date(merged_df$published_date))
merged_df$updated <- as.POSIXct(as.Date(merged_df$updated))

# Calculating the time difference in days between publication and last update
merged_df <- merged_df %>%
  mutate(t_diff = as.numeric(difftime(updated, published_date, units = "days")))

# Sorting articles by the longest time difference
time_df <- merged_df %>%
  arrange(desc(t_diff)) %>%
  select(t_diff, title)

# Displaying a table of articles sorted by the time passed between publication and last update
kable(time_df, caption = "Time Passed in Days Between First Published and Last Updated (Descending Order)")

Time Passed in Days Between First Published and Last Updated (Descending Order)
t_diff	title
72	Tracking the Swing States for Harris and Trump
27	The Only Patriotic Choice for President
9	Pete Rose, Baseball Star Who Earned Glory and Shame, Dies at 83
9	The Dangers of Donald Trump, From Those Who Know Him
8	When Trump Rants, This Is What I Hear
7	Billy Joel Is Selling the Mansion He First Saw While Dredging Oysters
6	Milton Tracker: Latest on Storm’s Path, Power Outages and Winds
6	What We Know About Liam Payne’s Death and the Drugs Found in His System
6	The Secretive Dynasty That Controls the Boar’s Head Brand
4	What’s Wrong With Donald Trump?
4	Ann Patchett: The Decision I Made 30 Years Ago That I Still Regret
3	When a Television Meteorologist Breaks Down on Air and Admits Fear
3	Her Face Was Unrecognizable After an Explosion. A Placenta Restored It.
2	The Truth About Tuna
2	Phil Lesh, Bassist Who Anchored the Grateful Dead, Dies at 84
2	Why Trump Is Lying About Disaster Relief
2	As Election Nears, Kelly Warns Trump Would Rule Like a Dictator
2	Trump Is Telling Us What He Would Do. Believe Him.
2	How the North Carolina Legislature Left Homes Vulnerable to Helene
1	Kamala Harris Arrived for a Fox Interview. She Got a Debate.
1	6 Takeaways From Donald Trump’s 3-Hour Podcast With Joe Rogan
1	Hurricane Helene: Mapping More Than 600 Miles of Devastation
1	Harris and Trump Deadlocked to the End, Final Times/Siena National Poll Finds
1	Book Revives Questions About Trump’s Ties to Putin
1	Kris Kristofferson, Country Singer, Songwriter and Actor, Dies at 88
1	James Carville: Three Reasons I’m Certain Kamala Harris Will Win
1	Nate Silver: Here’s What My Gut Says About the Election, but Don’t Trust Anyone’s Gut, Even Mine
1	A Frustrated Trump Lashes Out Behind Closed Doors Over Money
1	At a Pennsylvania Rally, Trump Descends to New Levels of Vulgarity
1	Judge Orders Giuliani to Forfeit Millions in Assets to Election Workers He Defamed
1	An Open Letter to Jimmy Carter, on His 100th Birthday
1	The Many Links Between Project 2025 and Trump’s World
1	Is It Fascism? A Leading Historian Changes His Mind.
1	Frank Fritz, a Host of the Antiques Show ‘American Pickers,’ Dies at 60
1	American Business Cannot Afford to Risk Another Trump Presidency
1	Washington Post Says It Will Stop Endorsing Presidential Candidates

# Outputting the top three articles with the longest duration between first published and last updated
cat("The top three articles from the 'Most Viewed' and 'Most Shared on Facebook' groups with the longest time passed from first published to last updated are:\n\n",
    
    time_df$title[1], ":\n with", time_df$t_diff[1], "days passing since the article was originally published.\n\n",
    
    time_df$title[2], ":\n with", time_df$t_diff[2], "days passing since the article was originally published.\n\n",
    
    time_df$title[3], ":\n with", time_df$t_diff[3], "days passing since the article was originally published.\n\n")

## The top three articles from the 'Most Viewed' and 'Most Shared on Facebook' groups with the longest time passed from first published to last updated are:
## 
##  Tracking the Swing States for Harris and Trump :
##  with 72 days passing since the article was originally published.
## 
##  The Only Patriotic Choice for President :
##  with 27 days passing since the article was originally published.
## 
##  Pete Rose, Baseball Star Who Earned Glory and Shame, Dies at 83 :
##  with 9 days passing since the article was originally published.

Conclusion

The New York Times offers a wide array of versatile APIs that can be used to accomplish diverse analytical tasks. For this project, I focused on the Most Viewed and Most Shared APIs to build a sample report highlighting the New York Times’ top articles in terms of views and social media shares. This kind of report can be useful in understanding why certain articles attract more attention or clicks than others and which topics resonate most with audiences.

Looking ahead, a potential next step would be to analyze how the most shared articles vary across different social media platforms like X, Reddit, and Facebook. Such an analysis could reveal content preferences unique to each platform’s user base. Another interesting extension would be to automate the Most Viewed report daily over a month, tracking the longevity of articles in the top-viewed list. This could provide insights into content relevance and engagement over time.

APIs are essential tools for data access and integration, allowing seamless data transfer and analysis. However, changes to an API’s structure or data organization (e.g., adjustments to the Most Viewed API by the NYT) can disrupt processes that rely on them. This highlights the importance of monitoring API updates to maintain data workflows effectively.

Week 9 Assignment – Web APIs

Shri Tripathi

October 27, 2024