Introduction

This assignment is an exploration in the use of The New York Times api directly from inside R

Method

1. Created an account with The New York Times and request an API key

link

2. Loaded libraries and used the keyring package to save my API key into my environment

library(jsonlite)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()  masks stats::filter()
## x purrr::flatten() masks jsonlite::flatten()
## x dplyr::lag()     masks stats::lag()
library(keyring)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
first_time <- FALSE


#If this is the very first time you are running this script you need to use save out your api key using keyring
if(first_time){
key_set_with_value(service = "NYT api",password = "YOUR_API_KEY_GOES_HERE")
}
api_key <- key_get("NYT api")

3. I used an article by Jonathan D Fitzgerald on storybench.org to get started with the jsonlite package

link

  • I used the paste0 function to concatenate my api key into the my query to the NYT api

  • The fromJSON and data.frame function do the heavy lifting of converting my JSON into a data frame

  • Queried articles related to ‘molecular fossils’ - organic compounds in the fossil record that are derived from once living organisms since the beginning of 2021.

  • I used the NYT Most Popular API

#Lets connect and look for articles in the Most Popular API related to molecular fossil
results <- fromJSON(paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=molecular fossil&begin_date=20210101&api-key=",api_key), flatten = TRUE) %>% data.frame()
glimpse(results)
## Rows: 6
## Columns: 32
## $ status                                <chr> "OK", "OK", "OK", "OK", "OK", "O…
## $ copyright                             <chr> "Copyright (c) 2021 The New York…
## $ response.docs.abstract                <chr> "A laborer discovered the fossil…
## $ response.docs.web_url                 <chr> "https://www.nytimes.com/2021/06…
## $ response.docs.snippet                 <chr> "A laborer discovered the fossil…
## $ response.docs.lead_paragraph          <chr> "Scientists on Friday announced …
## $ response.docs.print_section           <chr> "A", "A", NA, "MM", "D", NA
## $ response.docs.print_page              <chr> "1", "27", NA, "49", "2", NA
## $ response.docs.source                  <chr> "The New York Times", "The New Y…
## $ response.docs.multimedia              <list> [<data.frame[73 x 19]>], [<data.…
## $ response.docs.keywords                <list> [<data.frame[10 x 4]>], [<data.f…
## $ response.docs.pub_date                <chr> "2021-06-25T15:00:12+0000", "20…
## $ response.docs.document_type           <chr> "article", "article", "article"…
## $ response.docs.news_desk               <chr> "Science", "OpEd", "NYTNow", "Ma…
## $ response.docs.section_name            <chr> "Science", "Opinion", "Briefing"…
## $ response.docs.type_of_material        <chr> "News", "Op-Ed", "briefing", "In…
## $ response.docs._id                     <chr> "nyt://article/56530668-e4d8-5ca…
## $ response.docs.word_count              <int> 1520, 1081, 1062, 0, 6520, 13540
## $ response.docs.uri                     <chr> "nyt://article/56530668-e4d8-5ca…
## $ response.docs.headline.main           <chr> "Discovery of ‘Dragon Man’ Skull…
## $ response.docs.headline.kicker         <chr> "Matter", NA, NA, "The Health Is…
## $ response.docs.headline.content_kicker <lgl> NA, NA, NA, NA, NA, NA
## $ response.docs.headline.print_headline <chr> "Skull May Point to New Kind of …
## $ response.docs.headline.name           <lgl> NA, NA, NA, NA, NA, NA
## $ response.docs.headline.seo            <lgl> NA, NA, NA, NA, NA, NA
## $ response.docs.headline.sub            <lgl> NA, NA, NA, NA, NA, NA
## $ response.docs.byline.original         <chr> "By Carl Zimmer", "By Sarah Stew…
## $ response.docs.byline.person           <list> [<data.frame[1 x 8]>], [<data.fr…
## $ response.docs.byline.organization     <lgl> NA, NA, NA, NA, NA, NA
## $ response.meta.hits                    <int> 6, 6, 6, 6, 6, 6
## $ response.meta.offset                  <int> 0, 0, 0, 0, 0, 0
## $ response.meta.time                    <int> 24, 24, 24, 24, 24, 24

4. I select the columns that I’m interested after inspecting the data frame head with ‘glimpse’ (I also really like ‘names()’)

  • I kept the headline, abstract and section columns in the data frame
reduced_results <- results %>% select(headline = response.docs.headline.main, abstract = response.docs.abstract, section = response.docs.section_name)

reduced_results %>% kbl() %>% kable_styling()
headline abstract section
Discovery of ‘Dragon Man’ Skull in China May Add Species to Human Family Tree A laborer discovered the fossil and hid it in a well for 85 years. Scientists say it could help sort out the human family tree and how our species emerged. Science
Why Frigid Mars Is the Perfect Place to Look for Ancient Life Our early days on Earth have almost entirely disappeared, but on Mars, the past is entombed. Opinion
Infrastructure, Surfside, Giuliani: Your Thursday Evening Briefing Here’s what you need to know at the end of the day. Briefing
Can We Live to 200? Here’s a Roadmap 43 advances that could radically extend life spans over the next 100 years. Magazine
The Science of Climate Change Explained: Facts, Evidence and Proof Definitive answers to the big questions. Climate
Transcript: Ezra Klein Interviews Adam Tooze Every Tuesday and Friday, Ezra Klein invites you into a conversation about something that matters, like today’s episode with Adam Tooze. Listen wherever you get your podcasts. Podcasts

5. The NYT api only returns 10 responses at a time. With the jsonlite package it is possible to iteratively pull all responses 10 at a time

  • Saved out the query as a string
  • Calculate how many times you need to iterate by counting the number of hits and dividing by 10 (this is Jonathan D Fitzgerald’s method)
baseurl <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=molecular fossil&begin_date=20180101&api-key=",api_key)


initialQuery <- fromJSON(baseurl)
maxPages <- round((initialQuery$response$meta$hits[1] / 10)-1) 

6. Now you save all the ‘pages’ as a list that you generate by iterating over a for loop (this is Jonathan D Fitzgerald’s method)

  • use the rbind_pages function to create a data frame
  • I reduce my data frame down to the colums of interest
pages <- list()
for(i in 0:maxPages){
  nytSearch <- fromJSON(paste0(baseurl, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages[[i+1]] <- nytSearch 
  Sys.sleep(2) 
}
## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
all_results <- rbind_pages(pages)

all_reduced_results <- all_results %>% select(headline = response.docs.headline.main, abstract = response.docs.abstract, section = response.docs.section_name)

Results

Grouped by section and counted the number of articles about ‘molecular fossils’ that came from each section since 2018.

all_reduced_results %>% group_by(section) %>% count() %>% kbl() %>% kable_styling()
section n
Briefing 2
Climate 2
Crosswords & Games 1
Magazine 3
Opinion 3
Podcasts 1
Science 11
Style 1
T Brand 3
The Learning Network 1

Conclusion

It’s easy to get great out of the box performance with th jsonlite package when querying APIs. I did try a couple of additional things that I haven’t documented. I tried the R nytimes package and found it limiting and also explored querying subjects with more results and found that this for loop approach runs into trouble with results greater than 150 articles.