This week’s assignment required students to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform and store the data in an R data frame. I chose to use the Article Search API which allows users to search articles from September 18, 1851 to today, and retrieve headlines, bylines, abstracts, lead paragraphs, links to multimedia, and additional data.
library(RCurl)
library(dplyr)
library(stringr)
library(httr)
library(jsonlite)
library(DT)
library(rvest)
I used a simple query - without filtering or faceting - for articles published between July 1, 2016 and today’s date with the search term “rat” in their headlines, bylines, or bodies.
# query string
q <- "rat"
# beginning and end dates in "YYYYMMDD" format
begin_date <- "20160701"
end_date <- str_replace_all(Sys.Date(), "-", "")
base_url <- "https://api.nytimes.com/svc/search/v2/articlesearch.json?"
resp <- GET(paste0(base_url, "api-key=", key))
resp.df <- data.frame(names(headers(resp)), unlist(headers(resp)))
colnames(resp.df) <- c("HTTP response header", "value")
rownames(resp.df) <- NULL
knitr::kable(resp.df, row.names = FALSE)
| HTTP response header | value |
|---|---|
| access-control-allow-credentials | true |
| content-type | application/json; charset=UTF-8 |
| date | Sun, 30 Oct 2016 18:22:06 GMT |
| server | nginx/1.10.1 |
| via | kong/0.8.3 |
| x-kong-proxy-latency | 3 |
| x-kong-upstream-latency | 227 |
| x-powered-by | PHP/5.3.3 |
| x-ratelimit-limit-day | 1000 |
| x-ratelimit-limit-second | 1 |
| x-ratelimit-remaining-day | 52 |
| x-ratelimit-remaining-second | 0 |
| content-length | 12097 |
| connection | keep-alive |
Querying certain pages for the chosen search term in iterated requests to the Article Search API produced HTTP 403 Forbidden errors. These errors were reproduced by copying and pasting the relevant URL(s) into my web browser. My code prints an error message to flag these instances, and because no records were retrieved in these requests, the corresponding data are absent from the R data frame and its somewhat simplified displayed output. I also found that using Sys.sleep() between requests to the API prevented HTTP 429 errors.
Sys.sleep(2)
init_data <- fromJSON(paste0(base_url, "q=", URLencode(q, reserved = TRUE),
"&begin_date=", begin_date, "&end_date=", end_date,
"&sort=oldest", "&api-key=", key))
num_hits <- init_data$response$meta$hits
num_pages <- ceiling(num_hits / 10) - 1
search_data <- vector(mode = "list", length = (num_pages + 1))
response_time <- 0
for (i in 0:num_pages) {
Sys.sleep(2)
message(paste0("page: ", i))
from_api <- getURL(paste0(base_url, "q=", URLencode(q, reserved = TRUE), "&begin_date=",
begin_date, "&end_date=", end_date, "&sort=oldest", "&page=", i,
"&api-key=", key))
if(validate(from_api)[1]) {
data <- fromJSON(from_api, flatten = TRUE)
response_time <- response_time + data$response$meta$time
cat("page: ", format(i, width = 6),
" response time: ", format(response_time, width = 15), "\n")
search_data[[i + 1]] <- data$response$docs
} else {
err_msg <- read_html(from_api) %>% html_node("head > title") %>% html_text()
cat("page: ", format(i, width = 6),
" HTTP error: ", format(err_msg, width = 18, justify = "right"), "\n")
search_data[i + 1] <- NULL
}
}
## page: 0 response time: 15
## page: 1 HTTP error: 403 Forbidden
## page: 2 response time: 32
## page: 3 response time: 42
## page: 4 response time: 83
## page: 5 response time: 92
## page: 6 response time: 109
## page: 7 response time: 135
## page: 8 response time: 148
## page: 9 response time: 159
## page: 10 response time: 273
## page: 11 response time: 282
json_data <- search_data[sapply(search_data, length) > 0]
search_data.df <- rbind.pages(json_data)
Two functions from the jsonlite package - rbind.pages() and flatten() - were very helpful for quickly combining and flattening the nested data frames retreived through the API into a single 2 dimensional table. I also examined the data frame after combining and flattening to find columns containing lists, which were often empty, so that unwanted columns could be omitted from my output, and then extracted keyword values so that they could be displayed without the other elements in the keywords lists produced from the parsed JSON data.
# check data types of column vectors
#str(search_data.df)
#lapply(search_data.df, typeof)
output.df <- search_data.df
output.df$keywords <- lapply(search_data.df$keywords,
function(x) str_c(x$value, collapse = ", "))
output.df <- output.df %>% select(headline.main, byline.original, pub_date,
web_url:print_page, source, keywords,
document_type:subsection_name, type_of_material,
word_count) %>%
mutate(web_url = str_c("<a href='", web_url, "' target='_blank'> ", web_url, " </a>")) %>%
rename(headline_main = headline.main, byline = byline.original) %>%
mutate(byline = str_replace_all(byline, "By ", "")) %>%
mutate(pub_date = str_trim(str_replace_all(pub_date, "T|Z", " "), side = "both"))
API query string: rat
Publication start and end date bounds (YYYYMMDD): 20160701 - 20161030
Number of hits: 113
API response time: 282
Copyright (c) 2013 The New York Times Company. All Rights Reserved.
datatable(output.df, options = list(scrollX = TRUE), escape = FALSE)