library(tidyverse)
library(jsonlite)
library(httr)

Process

I am using the NY Times API for article searches. I want to search for articles about the New York Jets from the sports desk over the last year. I will be generating the API request programattically.

Creating the request for the GET() function in HTTR:

Load api key

location<-"c:\\password-files-for-r\\nytimes_keys.csv"
nytimes_keys<-read.csv(location)

base_url<-"https://api.nytimes.com/svc/search/v2/articlesearch.json?"
#main query
q<-"q=New+York+Jets"
#&news_desk=sports&begin_date=20200101&end_date=20201023
#elements of fq
key<-paste0("api-key=",nytimes_keys$api_key)
tag<-paste(q,key,sep="&")

url<-paste0(base_url,tag)

Requesting the Data

I will use the GET() function from the httr package and check the status to see if it was successful

jets_pull<-GET(url)

http_status(jets_pull)
## $category
## [1] "Success"
## 
## $reason
## [1] "OK"
## 
## $message
## [1] "Success: (200) OK"

Inspecting the data

The request to the api is structured as a nested named list. I need to find out where the content I am interested in is located.

#look at the names
names(jets_pull)
##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"
#i want the content, but its contents is just raw bytes
glimpse(jets_pull$content)
##  raw [1:223798] 7b 22 73 74 ...
#data is in bytes, so convert to text
jets_content<-fromJSON(rawToChar(jets_pull$content))

#after some checks I found where the data I am interested is located
names(jets_content$response$docs)
##  [1] "abstract"         "web_url"          "snippet"          "lead_paragraph"  
##  [5] "print_section"    "print_page"       "source"           "multimedia"      
##  [9] "headline"         "keywords"         "pub_date"         "document_type"   
## [13] "news_desk"        "section_name"     "subsection_name"  "byline"          
## [17] "type_of_material" "_id"              "word_count"       "uri"

Convert to Data frame

Since the data is structured as a list, I will convert it to a data frame.

df_jets<-data.frame(jets_content$response$docs)

#Tidy the Data

The headline column for this data frame is a nested data frame. I will need to unnest it in order to select the main headline.

Then I will create a new dataframe suitable of looking at what my API request returned

#unnest headline and put it in its own data frame
df_headline<-unnest(df_jets$headline)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c()`
output<-data.frame("main_headline"=df_headline$main,"abstract"=df_jets$abstract,"date"=df_jets$pub_date)

output%>%
  mutate(ymd=as.Date(date))%>%
  select(-date)

Conclusions

My pull was only for the first page of results of jets articles (the most recent). I could create a function that allows me to add a pagination facet, allowing me to cycle through the results pages and pull more data. I could have also added more facets to my data frame, like only pulling from the sports desk.