Assignment

The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis. You’ll need to start by signing up for an API key.

Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.

Solution

Introduction

The JSON returned by the Archives API is very complex. Lists can be nested up to five levels deep; for example: response $\rightarrow$ docs $\rightarrow$ article $\rightarrow$ byline $\rightarrow$ person $\rightarrow$ firstname.

For the purposes of this homework, I am not going to try to shoehorn the full JSON into a data frame. Not only is that very difficult practically, it is unwise conceptually. The data about these articles is hierarchical in nature and does not respond well to “flattening” out into single-layer format demanded by data frames and tables. Therefore I will extract a subset of the data, including:

main headline
publication date
byline
lead paragraph
word count
print page
web url
id

and populate the data frame with that information. Lastly, a couple of examples of the interface usage will be shown.

Knowledge Gained

JSON parsing

One nice feature of the httr package is that it automatically recognizes JSON text and will call the jsonlite package behind the scenes to parse it. The jsonlite package may need to be installed, but not actively loaded via library().

Subsetting

There are what may appear to be strange choices in the function GetHeadlines below. For example, there is a lot of integer-based subsetting of lists. This is for two reasons. First, some of the lists have to be traversed programmatically. Second, benchmarking showed that using list-element extraction via [[]] was noticeably faster than using named element extraction using $. So where I was certain that the data elements were consistent over the years 1851–2019, I used numeric extraction. Unfortunately, the elements included in the JSON changed over time so some fields must be extracted by name.

Enforcing integrity

It appears that if the API does not have a value for an item, the natural parsing treats that field as missing—not empty. This caused significant headaches when capturing an entire month’s worth of headlines by adding one headline element extraction at a time. To get around this problem, I have taken advantage of the fact that the maximum of anything and NULL is the other thing.

max(NULL, -Inf)

## [1] -Inf

max(NULL, "")

## [1] ""

Therefore, I wrapped all extracts in a max call ensuring that the missing data will at least be coded as empty, and not disappear. This ensures that every extract has the necessary 9 data elements and that they can be combined.

For the date, R mainly uses epoch time, where 0 is January 1, 1970. Dates prior to that are negative numbers, thus the (overkill) use of negative one billion in the max call.

Code

library(httr)
library(data.table)
setDTthreads(0L)

GetHeadlines <- function(year, month) {
  if (!(is.numeric(year) && is.numeric(month))) {
    stop("Year and Month must be integers.")
  }
  if (year < 1851L || (year == 1851L && month < 9L)) {
    stop("Archives only begin in 9/1851")
  }
  if ((year > year(Sys.Date())) ||
      (year == year(Sys.Date()) && month > month(Sys.Date()))) {
    stop("Archives are not prophecies. Please contact Nostradamus.")
  }
  if (month < 1L || month > 12L) {
    stop("Please pick a valid month.")
  }
  
  URI <- paste0("https://api.nytimes.com/svc/archive/v1/", as.integer(year),
                "/", as.integer(month), ".json?api-key=", Sys.getenv("NYTAPI"))
  
  Response <- content(GET(url = URI), as = "parsed")
  
  hits <- Response[[2]][[1]][[1]]
  
  DT <- data.table()
  for (i in seq_len(hits)) {
    headline <- max(Response[[2]][[2]][[i]]$headline[[1]], "")
    date <- max(as.IDate(Response[[2]][[2]][[i]]$pub_date), -1e9)
    byline <- max(Response[[2]][[2]][[i]]$byline$original, "")
    lede <- max(Response[[2]][[2]][[i]]$lead_paragraph, "")
    keywords <- character()
    for (j in seq_along(Response[[2]][[2]][[i]]$keywords)) {
      keywords <- c(keywords, Response[[2]][[2]][[i]]$keywords[[j]]$value)
    }
    keywords <- max(paste(keywords, sep = "", collapse = ", "), "")
    count <- max(Response[[2]][[2]][[i]]$word_count, 0)
    page <- max(Response[[2]][[2]][[i]]$print_page, "")
    url <- Response[[2]][[2]][[i]]$web_url
    id <- max(Response[[2]][[2]][[i]]$`_id`, "")
    DT <- rbind(DT, data.table(headline = headline, date = date,
                                        byline = byline, lede = lede,
                                        keywords = keywords, count = count,
                                        page = page, url = url, id = id),
                    fill = FALSE)
  }
  setkey(DT, date)
  if (dim(DT)[[1]] != hits | dim(DT)[[2]] != 9) {
    message("Data frame dimensions do not conform to data pull extract. Please
            check code for errors.")
  }
  return(DT)
}

Examples

Terrorism

Lets compare the number of times “terrorism” was a key word on September 10, 2001 with September 12, 2001.

NE <- GetHeadlines(2001, 9)

On September 9, there were 561 headlines, of which only 1 contained the word “terrorism”. On September 12, there were 201 headlines of which 108 contained the word “terrorism”.

Winston Churchill: 1945

WC <- data.table()
for (q in seq_len(12)) {
  WC <- rbind(WC, GetHeadlines(1945, q))
}
ExampleK <- WC[id == WC[keywords %ilike% "churchill" &
                          !(headline %ilike% "churchill"), id][1]]
ExampleH <- WC[id == WC[!keywords %ilike% "churchill" &
                          (headline %ilike% "churchill"), id][1]]

In 1945, there were 126,247 headlines printed by the New York times. Of these, 362 contained the name “Churchill”. Interestingly, as a keyword, “churchill” appeared only 259 times. In actuality, there are only 129 where “churchill” appears in both a headline and as a keyword.

An example where “churchill” appears in just the keywords is article ID 4fc2352e45c1498b0d678868 with byline By JAMES B. RESTON Special to THE NEW YORK TIMES of 1945-01-05 with headline DECIDE TO REVIEW OVERSEAS SUPPLIES; Officials of United States and Britain Will Go Over Shipping Schedules as War Lasts European Turmoil a Factor Raw Materials Lacking and keywords STETTINIUS, EDWARD R JR, HOPKINS, HARRY L, CHURCHILL, WINSTON SPENCER, EUROPE, ARMAMENT, GENERAL, WORLD WAR II, MILITARY ACTION AND STRATEGY, SHIPS AND SHIPPING, GENERAL.

In reverse, an example where “churchill” appears in just the headline is article 4fc2352d45c1498b0d6787f2 with byline By SYDNEY GRUSON By Cable to THE NEW YORK TIMES of 1945-01-08 with headline GREEK POLICY RIFT REVIVED IN BRITAIN; Labor Chiefs Again Threaten to Quit Churchill–British Press Elas Outside Athens British and Retreating Elas Clash and keywords AMANDOS, C, MACROPOULOS, JOHN, STRABOLGI, LORD, SCOBIE, RONALD MACKENZIE, GREENWOOD, ARTHUR, GREENWOOD, ARTHUR, COCKS, FRANK SEYMOUR, GREAT BRITAIN, GREECE, POLITICS AND GOVERNMENT.

DT 607—Fall 2019

HW 9

Avraham Adler

10/24/2019