Data 607 Week 9 Assignment

Web APIs

This week’s assignment required students to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform and store the data in an R data frame. I chose to use the Article Search API which allows users to search articles from September 18, 1851 to today, and retrieve headlines, bylines, abstracts, lead paragraphs, links to multimedia, and additional data.

Load Required Packages

library(RCurl)
library(dplyr)
library(stringr)
library(httr)
library(jsonlite)
library(DT)
library(rvest)

I used a simple query - without filtering or faceting - for articles published between July 1, 2016 and today’s date with the search term “rat” in their headlines, bylines, or bodies.

# query string
q <- "rat"  

# beginning and end dates in "YYYYMMDD" format
begin_date <- "20160701"
end_date <- str_replace_all(Sys.Date(), "-", "")

Check HTTP Response Header Fields

base_url <- "https://api.nytimes.com/svc/search/v2/articlesearch.json?"

resp <- GET(paste0(base_url, "api-key=", key))
resp.df <- data.frame(names(headers(resp)), unlist(headers(resp)))
colnames(resp.df) <- c("HTTP response header", "value")
rownames(resp.df) <- NULL
knitr::kable(resp.df, row.names = FALSE)

HTTP response header	value
access-control-allow-credentials	true
content-type	application/json; charset=UTF-8
date	Sun, 30 Oct 2016 18:22:06 GMT
server	nginx/1.10.1
via	kong/0.8.3
x-kong-proxy-latency	3
x-kong-upstream-latency	227
x-powered-by	PHP/5.3.3
x-ratelimit-limit-day	1000
x-ratelimit-limit-second	1
x-ratelimit-remaining-day	52
x-ratelimit-remaining-second	0
content-length	12097
connection	keep-alive

Fetching JSON Data

Querying certain pages for the chosen search term in iterated requests to the Article Search API produced HTTP 403 Forbidden errors. These errors were reproduced by copying and pasting the relevant URL(s) into my web browser. My code prints an error message to flag these instances, and because no records were retrieved in these requests, the corresponding data are absent from the R data frame and its somewhat simplified displayed output. I also found that using Sys.sleep() between requests to the API prevented HTTP 429 errors.

Sys.sleep(2)
init_data <- fromJSON(paste0(base_url, "q=", URLencode(q, reserved = TRUE), 
                        "&begin_date=", begin_date, "&end_date=", end_date, 
                        "&sort=oldest", "&api-key=", key))

num_hits <- init_data$response$meta$hits
num_pages <- ceiling(num_hits / 10) - 1

search_data <- vector(mode = "list", length = (num_pages + 1))  
response_time <- 0

for (i in 0:num_pages) {
  Sys.sleep(2)
  message(paste0("page: ", i))
  
  from_api <- getURL(paste0(base_url, "q=", URLencode(q, reserved = TRUE), "&begin_date=",
                          begin_date, "&end_date=", end_date, "&sort=oldest", "&page=", i,
                          "&api-key=", key))
  
  if(validate(from_api)[1]) {
    data <- fromJSON(from_api, flatten = TRUE)
    response_time <- response_time + data$response$meta$time
    cat("page: ", format(i, width = 6),
        "    response time: ", format(response_time, width = 15), "\n")
    search_data[[i + 1]] <- data$response$docs
  } else {
    err_msg <- read_html(from_api) %>% html_node("head > title") %>% html_text()
    cat("page: ", format(i, width = 6), 
        "    HTTP error: ", format(err_msg, width = 18, justify = "right"), "\n")
    search_data[i + 1] <- NULL
  }  
}

## page:       0     response time:               15 
## page:       1     HTTP error:       403 Forbidden 
## page:       2     response time:               32 
## page:       3     response time:               42 
## page:       4     response time:               83 
## page:       5     response time:               92 
## page:       6     response time:              109 
## page:       7     response time:              135 
## page:       8     response time:              148 
## page:       9     response time:              159 
## page:      10     response time:              273 
## page:      11     response time:              282

json_data <- search_data[sapply(search_data, length) > 0]

search_data.df <- rbind.pages(json_data)

Two functions from the jsonlite package - rbind.pages() and flatten() - were very helpful for quickly combining and flattening the nested data frames retreived through the API into a single 2 dimensional table. I also examined the data frame after combining and flattening to find columns containing lists, which were often empty, so that unwanted columns could be omitted from my output, and then extracted keyword values so that they could be displayed without the other elements in the keywords lists produced from the parsed JSON data.

Additional Data Cleaning

# check data types of column vectors
#str(search_data.df)
#lapply(search_data.df, typeof)

output.df <- search_data.df

output.df$keywords <- lapply(search_data.df$keywords, 
                                  function(x) str_c(x$value, collapse = ", "))

output.df <- output.df %>% select(headline.main, byline.original, pub_date, 
                          web_url:print_page, source, keywords, 
                          document_type:subsection_name, type_of_material, 
                          word_count) %>% 
  mutate(web_url = str_c("<a href='", web_url, "' target='_blank'> ", web_url, " </a>")) %>%
  rename(headline_main = headline.main, byline = byline.original) %>%
  mutate(byline = str_replace_all(byline, "By ", "")) %>%
  mutate(pub_date = str_trim(str_replace_all(pub_date, "T|Z", " "), side = "both"))

Article Search API Query Results

API query string: rat

Publication start and end date bounds (YYYYMMDD): 20160701 - 20161030

Number of hits: 113

API response time: 282

datatable(output.df, options = list(scrollX = TRUE), escape = FALSE)

R Sources

Stack Overflow 6/17/15: Converting URL character strings to active hyperlinks in datatable() output