library(httr)
library(jsonlite)
library(dplyr)
library(ggplot2)
This code will not run without first obtaining an API Key from NYT.
Once acquired, save your API key to R using Sys.setenv(NYTIMES_KEY="INSERT YOUR KEY HERE")
Your key can be retrieved without revealing it through the R code by then using Sys.getenv("NYTIMES_KEY")
This API works by taking the base url “http://api.nytimes.com/svc/search/v2/articlesearch.json?q=” and searching based on certain parameters, which include:
q (query term)
begin_date (YYYYMMDD format for earliest date to search)
end_date (YYYYMMDD format for latest date to search)
fl (list of fields to return)
To search for specific fields for a specific query term or time, we must set parameters and then write a short piece of code which pastes those into a new URL.
# Need to use + to string together separate words
q_term <- "dragons"
begin_date <- "201701011"
end_date <- "20180101"
First, we paste our parameters together with the API key to develop the URI which will return a JSON response.
baseurl <- paste0(
"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",
q_term,
"&begin_date=",
begin_date,
"&end_date=",
end_date,
"&facet_filter=true&api-key=",
Sys.getenv("NYTIMES_KEY"))
initialQuery <- fromJSON(baseurl)
As we can see, this search returned 31 hits, but only the API returns information about only 10 articles at a time.
To make this repeatable over different parameters, we can create a function to allow for a search of different search terms that will also:
Convert results to a data frame
Manipulate the search query to automatically download all pages of matching responses
Merge the different responses into a single data frame.
For relative simplicity, we will choose just three parameters: search terms, earliest search date, and latest search date, and will limit fields returned to just a few.
Note: This function will still require an API key to function. To set yours, enter your key in the following code:
Sys.setenv(NYTIMES_KEY="INSERT-YOUR-KEY-HERE")
articleSearch <- function(search_terms, search_from_date, search_to_date) {
baseurl <- paste0(
"http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",
search_terms,
"&begin_date=",
search_from_date,
"&end_date=",
search_to_date)
initialQuery <- fromJSON(paste0(
baseurl,
"&facet_filter=true&api-key=",
Sys.getenv("NYTIMES_KEY")))
hits <- initialQuery$response$meta$hits
numPages <- round((hits/10)-1)
pages <- list()
#This Sys.sleep is very important--otherwise you will get a 429 error
Sys.sleep(1)
for(i in 0:numPages){
nytSearch <- fromJSON(
paste0(
baseurl,
"&page=",
i,
"&api-key=",
Sys.getenv("NYTIMES_KEY")),
flatten = TRUE) %>%
data.frame()
message("Retrieving page ", i)
message(Sys.time())
pages[[i+1]] <- nytSearch
#A second Sys.sleep is needed in the for loop
Sys.sleep(2)
}
allNYTSearch <- rbind_pages(pages)
}
Let’s try running the function on a few different dessert trends and checking out some snippets.
Sys.sleep(2)
cronuts2018 <- articleSearch("cronuts","20180101","20181027")
## Retrieving page 0
## 2018-10-27 22:44:25
cronuts2018$response.docs.snippet[1:3]
## [1] "New Yorkers wait in line. Lots of lines. For anything. "
## [2] "For years, no chef dared try to improve the most iconic French pastry. Now, though, a new generation of bakers is trying."
## [3] "Two generations of bakers have made Alfonso’s Pastry Shoppe a reliable go-to for all kinds of cakes, and recent innovations have inspired foodie pilgrimages."
Sys.sleep(2)
cupcakes2018 <- articleSearch("cupcakes","20180101","20181027")
## Retrieving page 0
## 2018-10-27 22:44:30
## Retrieving page 1
## 2018-10-27 22:44:32
## Retrieving page 2
## 2018-10-27 22:44:34
## Retrieving page 3
## 2018-10-27 22:44:37
## Retrieving page 4
## 2018-10-27 22:44:39
## Retrieving page 5
## 2018-10-27 22:44:41
cupcakes2018$response.docs.snippet[1:3]
## [1] "Dogs, cops, cupcakes."
## [2] "The plot involves a team of covert agents devoted to a level of violence we’re told most people find unpalatable. Peter Berg directs."
## [3] "Contemporary art and vintage cars were on display at the September Art Fair in Bridgehampton, N.Y., as were plenty of early-fall looks. "
To visualize the data, we will make a function that takes the search response data frame as an input.
#Create a function
sumBarChart <- function(dataframe) {
dataframe %>%
group_by(response.docs.type_of_material) %>%
summarize(count=n()) %>%
ggplot() +
geom_bar(aes(y=count,
x=response.docs.type_of_material,
fill=response.docs.type_of_material),
stat = "identity") +
coord_flip() +
labs(y="Number of Articles", x="Type of Coverage")+
theme(legend.position="none")
}
sumBarChart(cronuts2018)+ggtitle("NY Times Cronut Coverage, 2018")
sumBarChart(cupcakes2018)+ggtitle("NY Times Cupcake Coverage, 2018")
Looks like cupcakes were a lot more popular in 2018 than cronuts!
The above code is heavily indebted to the work of Jonathan D. Fitzgerald, published on January 25, 2018 on Storybench, “Working with The New York Times API in R”