## Warning: package 'jsonlite' was built under R version 4.4.3
APIs are a common source of data, especially in cases where the data could change at any time such as the most recent news. The New York Times provides free access to certain APIs, but require that each developer use their own API key. Today we we will go over the process of procuring and API key and using it to get data which we can use as a dataframe. Then we will attempt to answer the question: which New York Times news sections contain articles with the most words in January 2024?
Retrieve the API key from the environment variables. Using the combined URL, call the NY Times API.
nyt_api_base <- 'https://api.nytimes.com/svc/topstories/v2/'
readRenviron('.env')
nyt_api_key <- Sys.getenv('nyt_api_key')
archive_url <- paste0('https://api.nytimes.com/svc/archive/v1/2024/1.json?api-key=', nyt_api_key)
archive_data <- GET(archive_url)
archive_data
## Response [https://api.nytimes.com/svc/archive/v1/2024/1.json?api-key=VFhCuVrXMz2Ac9azrvtJrlEe7z7L6Geu]
## Date: 2025-03-31 00:44
## Status: 200
## Content-Type: application/json; charset=UTF-8
## Size: 12.9 MB
The raw data from NY Times is supposedly a JSON according to the URL. However, it requires some data wrangling to use. For today’s question, we will want the total word counts for each web page. We will also extract the abstract and URL to help verify the information found. The keys “news_desk” and “section_name” both appear to be categories, so we’ll keep both to check which one provides what we want.
parsed <- content(archive_data, 'parsed')
response_body <- parsed$response$docs
abstracts <- c()
word_counts <- c()
news_desks <- c()
web_urls <- c()
document_types <- c()
sections <- c()
for (entry in response_body) {
abstract <- entry['abstract']
abstracts <- append(abstracts, abstract)
count <- entry['word_count']
word_counts <- append(word_counts, count)
news_desk <- entry['news_desk']
news_desks <- append(news_desks, news_desk)
web_url <- entry['web_url']
web_urls <- append(web_urls, web_url)
document_type <- entry['document_type']
document_types <- append(document_types, document_type)
section <- entry['section_name']
sections <- append(sections, section)
}
nyt_df <- data.frame(
abstract = unlist(abstracts, use.names = FALSE),
word_count = unlist(word_counts, use.names = FALSE),
news_desk = unlist(news_desks, use.names = FALSE),
web_url = unlist(web_urls, use.names = FALSE),
document_type = unlist(document_types, use.names = FALSE),
section = unlist(sections, use.names = FALSE)
)
head(nyt_df)
## abstract
## 1 The tentative deal for the men’s golf circuits to join forces had a Dec. 31 deadline, but significant questions remained.
## 2 Harry Zheng makes his New York Times debut.
## 3 Iranian-backed Houthi gunmen from Yemen had fired on American helicopters responding to an attack on a commercial ship, American officials said.
## 4 New Year’s celebrations took place as protesters against the Israel-Hamas war staged demonstrations in Midtown Manhattan.
## 5 Quotation of the Day for Monday, January 1, 2024.
## 6 More than 90 percent of Palestinians in the territory say they have regularly gone without food for a whole day, according to the United Nations.
## word_count news_desk
## 1 558 Business
## 2 855 Games
## 3 1218 Foreign
## 4 608 Metro
## 5 36 Summary
## 6 1513 Foreign
## web_url
## 1 https://www.nytimes.com/2023/12/31/business/dealbook/pga-tour-saudi-deal-deadline.html
## 2 https://www.nytimes.com/2023/12/31/crosswords/daily-puzzle-2024-01-01.html
## 3 https://www.nytimes.com/2023/12/31/world/middleeast/us-houthi-clash.html
## 4 https://www.nytimes.com/2023/12/31/nyregion/times-square-new-years-eve.html
## 5 https://www.nytimes.com/2023/12/31/pageoneplus/quotation-of-the-day-in-a-jewish-arab-school-an-oasis-from-division-but-not-from-deep-fears.html
## 6 https://www.nytimes.com/2024/01/01/world/middleeast/gaza-israel-hunger.html
## document_type section
## 1 article Business Day
## 2 article Crosswords & Games
## 3 article World
## 4 article New York
## 5 article Corrections
## 6 article World
Put together a summary of the average word count per article and summaries for both news desks and sections.
## abstract word_count news_desk web_url
## Length:3785 Min. : 0.0 Length:3785 Length:3785
## Class :character 1st Qu.: 492.0 Class :character Class :character
## Mode :character Median : 859.0 Mode :character Mode :character
## Mean : 921.8
## 3rd Qu.: 1229.0
## Max. :12932.0
## document_type section
## Length:3785 Length:3785
## Class :character Class :character
## Mode :character Mode :character
##
##
##
nyt_df %>% filter(news_desk != '') %>% group_by(news_desk) %>% summarise(across(word_count, mean, na.rm = TRUE)) %>% arrange(desc(word_count))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(word_count, mean, na.rm = TRUE)`.
## ℹ In group 1: `news_desk = "Arts"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 59 × 2
## news_desk word_count
## <chr> <dbl>
## 1 Magazine 2399.
## 2 SundayBusiness 1953.
## 3 Headway 1745.
## 4 SpecialSections 1671
## 5 Arts&Leisure 1524.
## 6 TStyle 1511
## 7 OpEd 1491.
## 8 Editorial 1460
## 9 NYTNow 1351.
## 10 Projects and Initiatives 1281
## # ℹ 49 more rows
nyt_df %>% group_by(section) %>% summarise(across(word_count, mean, na.rm = TRUE)) %>% arrange(desc(word_count))
## # A tibble: 38 × 2
## section word_count
## <chr> <dbl>
## 1 Magazine 2399.
## 2 Headway 1745.
## 3 T Magazine 1511
## 4 Briefing 1342.
## 5 Obituaries 1305.
## 6 Opinion 1290.
## 7 Health 1129.
## 8 Lens 1113
## 9 Your Money 1113.
## 10 Education 1032
## # ℹ 28 more rows
According to the summary, the average word count in the full dataset is 921. Magazines are the wordiest news desks and sections at around 2399 words per article. However, there are a number of news desks and sections with 0 word counts. Let’s see if there’s a pattern regarding those.
Continuing with the summaries, we can also group by document type.
nyt_df %>% group_by(document_type) %>% summarise(across(word_count, mean, na.rm = TRUE)) %>% arrange(desc(word_count))
## # A tibble: 2 × 2
## document_type word_count
## <chr> <dbl>
## 1 article 972.
## 2 multimedia 0
According to document_type, some of the media are articles and others are multimedia. Multimedia results always have a word count of 0.
Next, let’s collect mean and median word counts for each news desk and section combination. From what we’ve seen, removing results with 0 word counts should not affect the highest word count totals. We expect (Magazine, Magazine) to be the most common (news desk, section) combination.
combo_df <- nyt_df %>%
filter(word_count > 0) %>%
group_by(news_desk, section) %>%
summarise(mean = mean(word_count, na.rm = TRUE),
median = median(word_count, na.rm = TRUE)) %>%
arrange(desc(mean))
## `summarise()` has grouped output by 'news_desk'. You can override using the
## `.groups` argument.
## # A tibble: 104 × 4
## # Groups: news_desk [44]
## news_desk section mean median
## <chr> <chr> <dbl> <dbl>
## 1 OpEd Podcasts 10720. 10783
## 2 SundayBusiness Technology 2581 2581
## 3 Magazine Magazine 2556. 1512.
## 4 SundayBusiness Business Day 1916. 1560
## 5 Headway Headway 1745. 1187
## 6 SpecialSections Business Day 1671 1671
## 7 Arts&Leisure Arts 1630. 1627
## 8 Arts&Leisure Theater 1591. 1717
## 9 Obits Science 1527 1527
## 10 TStyle T Magazine 1511 1526
## # ℹ 94 more rows
This is a peculiar result considering what we had previously established. The combination (Magazine, Magazine) has a comparable mean to each individual word count from previous analyses. However, the median is far below the mean. Furthermore, (OpEd, Podcasts) has both the highest mean and median by a wide margin. The combination (SundayBusiness, Technology) is another surprise increase.
We can do a check of how OpEd is split to see if there is any irregularity in its results.
nyt_df %>%
filter(news_desk %in% c('OpEd', 'SundayBusiness', 'Magazine')) %>%
count(news_desk, section) %>% arrange(desc(n))
## news_desk section n
## 1 OpEd Opinion 250
## 2 Magazine Magazine 49
## 3 SundayBusiness Business Day 17
## 4 OpEd Podcasts 4
## 5 SundayBusiness Technology 1
(OpEd, Opinion) is a far more frequent combination. While this does not change the fact that (OpEd, Podcasts) were on average the wordiest in this dataset, (Magazine, Magazine) articles are far more common to find and are usually going to be long. At a mere single occurrence, (SundayBusiness, Technology) seems to be an exception in our data.
One more eye test we can perform is to try and figure out why (OpEd, Podcasts) were so long. We do not have access to the articles themselves, but with the abstracts that we had collected earlier, we can finally examine them to get a gist of what these piece were about.
## abstract
## 1 Ezra Klein interviews Gloria Mark.
## 2 The Jan. 9, 2023, episode of “The Ezra Klein Show.”
## 3 The Jan. 19, 2023, episode of “The Ezra Klein Show.”
## 4 The Jan. 25, 2024, episode of “The Ezra Klein Show.”
Every result was from someone named Ezra Klein. These pieces appear to be from a podcast, potentially a transcript of everything said. While they are the longest pieces found, they do not seem to be articles in the vein of what we would imagine a New York Times article to be.
We sought to answer the question of which sections in the New York Times from January 2024 were the longest per article. The API, which was protected behind an API key, did not provide the individual articles themselves, but had word counts for every piece the paper featured. However, some pieces turned out to be multimedia which did not have word counts.
At first glance, news desk and section individually showed that magazine articles were the most verbose with over 2000 words on average. After further analysis, we found that pieces under OpEd and Podcasts were 5 times the word count of magazines. Depending on interpretation, either result can be seen as the answer to our question of the wordiest article. (OpEd, Podcasts) was the longest section of any news desk we found in our data. An honorable mention can be given to (SundayBusiness, Technology) as the lone article in the data had the second highest word count. Finally, magazine articles within the magazine news desk were consistently the longest articles without the caveats of these other sections.