The task for week 9 was to get a NYT API key, construct an interface to R, grab some data and finally return a dataframe. I chose to work with the “Archive_api” because I actually have some future tasks that I’d like to tackle that involve old newspapers.

This file is available on rpubs here and in my github here

Load the Libraries

library(jsonlite)
library(httr)
library(knitr)

Build a function to Get JSON from NYT Archives API

I created a function to query the NYT Archive API and return a dataframe containing the data retrieved. Also, in the development process, I noticed that my request failed frequently so I’ve set my function up to retry using the RETRY() function.

#params
api.key <- "36a5b43cb0e04a1dad5e23a9810f2cc1"
yyyy <- "1929"
mm <- "09"

#return JSON from NYT API
get.NytArchives <- function(api.key,yyyy,mm){
  base.url <-  paste("https://api.nytimes.com/svc/archive/v1/",yyyy,"/",mm,".json",sep="")
  print(paste("Collecting NYT archvies data for: ",toString(yyyy),"-",toString(mm)))
  
  #get seems to fail sometimes, so keep on tryin'
  query <- RETRY("GET","https://api.nytimes.com/svc/archive/v1/1929/9.json",
                 query = list(api_key=api.key),
                 times = 100, 
                 pause_base = 2)
  query <- content(query,as="text",encoding="UTF-8")
  
  df <-  as.data.frame(fromJSON(query))
  
  #clean up the column names
  colnames(df) <- gsub("^.*\\.","", colnames(df))
  
  return(df)
}

Test the Function

We’ll do a call to grab a single month and see what we get back

result <- get.NytArchives(api.key,"1929","9")
## [1] "Collecting NYT archvies data for:  1929 - 9"

Check the Output

And now we’ll take a look and see what we got. First i’ll print the column names:

kable(colnames(result),col.names = "Column Names")
Column Names
copyright
hits
web_url
snippet
lead_paragraph
abstract
print_page
blog
source
multimedia
headline
keywords
pub_date
document_type
news_desk
section_name
subsection_name
byline
type_of_material
_id
word_count
slideshow_credits

For output purposes, I’ll select only a few of the columns listed above in order to keep things legible:

kable(head(result[c("web_url","snippet")],5))
web_url snippet
https://query.nytimes.com/gst/abstract.html?res=9D04E0D6163FE731A25752C0A96F9C946895D6CF With the ending of the vacation season business and industry swing into the Autumn season under exceptionally favorable circumstances. The Summer has been characterized generally by exceptional vigor in most lines, with high ratios of ac tivity, …
https://query.nytimes.com/gst/abstract.html?res=9503E3D6163FE731A25752C0A96F9C946895D6CF WASHINGTON, Aug. 31.–The next Congress will not consent to change the postal rates unless the postoffice bookkeeping system is changed to show accurately the receipts from different classifications, Senator George H. Moses of New Hampshire, forme…
https://query.nytimes.com/gst/abstract.html?res=9B02E1D6163FE731A25752C0A96F9C946895D6CF Mrs. Lillian Bloch Schwarz of Long Beach, L.I., prominent Zionist, died suddenly in Berlin, according to word received here by Hadassah, the Women’s Zionist Organization of America. Her age was 39. The body will arrive here on the Leviathan tomorr…
https://query.nytimes.com/gst/abstract.html?res=9906E3D6163FE731A25752C0A96F9C946895D6CF MINNEAPOLIS, Aug. 31. (AP).–Dedication of Minneapolis’s tallest building, the Foshay Tower, as a Washington memorial, today drew a notable gathering of Federal and State Government officials from most States of the Union….
https://query.nytimes.com/gst/abstract.html?res=9403E4D6163FE731A25752C0A96F9C946895D6CF LONDON, Aug. 31.–London is anxiously awaiting the result of the impending clash between the large force of Arabs which, it is officially announced, crossed the Syrian frontier into Palestine early yesterday morning, and the strong detachment of a…

The data looks good!

Now we’re going to try and grab a bunch of data all at once:

df <- data.frame(matrix(ncol = 2, nrow=0))

colnames(df) <- c("web_url","snippet")

for (i in 1:5){
  data <- get.NytArchives(api.key,1929,i)
  
  df<- rbind(df,data[c("web_url","snippet")])
}
## [1] "Collecting NYT archvies data for:  1929 - 1"
## [1] "Collecting NYT archvies data for:  1929 - 2"
## [1] "Collecting NYT archvies data for:  1929 - 3"
## [1] "Collecting NYT archvies data for:  1929 - 4"
## [1] "Collecting NYT archvies data for:  1929 - 5"

We’ve just collected 70620 articles from the NYT archives