library(tidyverse)
library(jsonlite)
library(httr)
I am using the NY Times API for article searches. I want to search for articles about the New York Jets from the sports desk over the last year. I will be generating the API request programattically.
location<-"c:\\password-files-for-r\\nytimes_keys.csv"
nytimes_keys<-read.csv(location)
base_url<-"https://api.nytimes.com/svc/search/v2/articlesearch.json?"
#main query
q<-"q=New+York+Jets"
#&news_desk=sports&begin_date=20200101&end_date=20201023
#elements of fq
key<-paste0("api-key=",nytimes_keys$api_key)
tag<-paste(q,key,sep="&")
url<-paste0(base_url,tag)
I will use the GET() function from the httr package and check the status to see if it was successful
jets_pull<-GET(url)
http_status(jets_pull)
## $category
## [1] "Success"
##
## $reason
## [1] "OK"
##
## $message
## [1] "Success: (200) OK"
The request to the api is structured as a nested named list. I need to find out where the content I am interested in is located.
#look at the names
names(jets_pull)
## [1] "url" "status_code" "headers" "all_headers" "cookies"
## [6] "content" "date" "times" "request" "handle"
#i want the content, but its contents is just raw bytes
glimpse(jets_pull$content)
## raw [1:223798] 7b 22 73 74 ...
#data is in bytes, so convert to text
jets_content<-fromJSON(rawToChar(jets_pull$content))
#after some checks I found where the data I am interested is located
names(jets_content$response$docs)
## [1] "abstract" "web_url" "snippet" "lead_paragraph"
## [5] "print_section" "print_page" "source" "multimedia"
## [9] "headline" "keywords" "pub_date" "document_type"
## [13] "news_desk" "section_name" "subsection_name" "byline"
## [17] "type_of_material" "_id" "word_count" "uri"
Since the data is structured as a list, I will convert it to a data frame.
df_jets<-data.frame(jets_content$response$docs)
#Tidy the Data
The headline column for this data frame is a nested data frame. I will need to unnest it in order to select the main headline.
Then I will create a new dataframe suitable of looking at what my API request returned
#unnest headline and put it in its own data frame
df_headline<-unnest(df_jets$headline)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c()`
output<-data.frame("main_headline"=df_headline$main,"abstract"=df_jets$abstract,"date"=df_jets$pub_date)
output%>%
mutate(ymd=as.Date(date))%>%
select(-date)
My pull was only for the first page of results of jets articles (the most recent). I could create a function that allows me to add a pagination facet, allowing me to cycle through the results pages and pull more data. I could have also added more facets to my data frame, like only pulling from the sports desk.