Introduction to Web API’s

The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs

You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe.

I decided to work with The Article Search API and to construct a function like the example in our text, chapter 9, using Yahoo weather data.

Set Up for NYT Web API’s

With the Article Search API (v2), you can search New York Times articles from Sept. 18, 1851 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

The Article Search API at a Glance

Base URI: http://api.nytimes.com/svc/search/v2/articlesearch

HTTP method: GET

Response formats: JSON (.json), JSONP (.jsonp)

There are 12 possible parameters for search, not counting the API key, and at least one of the parameters has 17 possible values. I decide to make my R wrapper function very simple for my version 1 and use 2 parameters, topic (search phrase) and response (most recent hits = new, oldest hits = old, blank = highest relevance).

Set Up the Data to Return

We see from our test that we get a list of 3. List 2, test[2] is a simple status (“OK”), which may be useful in some advanced error checking, but I am ignoring for now. List 3, or test[3], is a copyright statement from the New York Times, which I will ignore too. The search responses we want are in the first list (test[1]).

With a little more investigation we see that the part of the list we want is test[[1]]$docs. We can examine the 20 variables and decide which ones we want. Here are the variables.

I decide that headline, pub_date, web_url, and maybe snippet would make a good response to a search. After putting those 4 variables into a dataframe I find headline is really 3 values (main, kicker, and print_headline). After examining my example I decide main is the one I want.

test.df <- data.frame(test[[1]]$docs)
variable.names(test.df)
##  [1] "web_url"           "snippet"           "lead_paragraph"   
##  [4] "abstract"          "print_page"        "blog"             
##  [7] "source"            "multimedia"        "headline"         
## [10] "keywords"          "pub_date"          "document_type"    
## [13] "news_desk"         "section_name"      "subsection_name"  
## [16] "byline"            "type_of_material"  "X_id"             
## [19] "word_count"        "slideshow_credits"
# Structure a dataframe to return
response <- data.frame(test.df$headline, test.df$pub_date, test.df$web_url, test.df$snippet)
str(response)
## 'data.frame':    10 obs. of  6 variables:
##  $ main            : chr  "Supreme Court Declines to Hear Appeal in Google-Oracle Copyright Fight" "Google Asks Supreme Court to Decide Oracle Copyright Fight" "Student-Built Apps Teach Colleges a Thing or Two" "Samsung Stakes Claim on Wearable Tech That Monitors Health" ...
##  $ kicker          : chr  "Bits" NA NA "Bits" ...
##  $ print_headline  : chr  "Supreme Court Declines to Hear Google-Oracle Appeal" "Google Asks Supreme Court to Decide Oracle Copyright Fight" "Students<U+0092> Apps Teach Colleges a Thing or Two""| __truncated__ NA ...
##  $ test.df.pub_date: Factor w/ 10 levels "2013-01-11T13:00:37Z",..: 10 9 8 7 6 5 4 3 2 1
##  $ test.df.web_url : Factor w/ 10 levels "http://bits.blogs.nytimes.com/2013/01/11/c-e-s-2013-the-voice-controlled-home-with-a-catch/",..: 3 10 8 2 9 7 6 5 4 1
##  $ test.df.snippet : Factor w/ 10 levels "A years-long copyright fight between Google and Oracle will continue. At issue are the bits of code that help pieces of softwar"| __truncated__,..: 1 4 10 8 7 5 6 9 3 2

Build the Function

In the code below I take the ideas from above to write a function to do a simple search using New York Times Search API. The function will have two parameters, the search topic and what responses to return. After the function I run it using a few examples.

Function Junction

I call this function getNYTsearch and it’s parameters are topic and response. I use a simple error handling and return some simple instructions, if no parameters are entered.

getNYTsearch <- function(topic = "", response = "") {
## Give some instruction, if no parameters are entered
        if (topic == "") {
                stop('The getNYTsearch function requires a search topic to work. Please enter getNYTsearch(topic = "titanic", response = "new"). response can be o, old, n, or new. If no value for response is entered, the 10 most relevant hits are returned.')
        }
        
## Put the topic in search API format
        topic <- gsub(" ", "+", topic)

## Check response and set appropriate searchurl if valid
        if (!response %in% c("o", "n", "old", "new", "", "O", "N", "Old", "New")) {
                stop("Wrong response parameter. Choose either '0' for oldest or 'n' for newest or leave blank '' for most relevant.")
        }
        if (response == "o" || response == "O" || response == "old" || response == "Old") {
                searchurl <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=%s&sort=oldest&api-key=91d71cc6a4bdd339cb45f37747025836:18:62602330"
        }
        if (response == "n" || response == "N" || response == "new" || response == "New") {
                searchurl <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=%s&sort=newest&api-key=91d71cc6a4bdd339cb45f37747025836:18:62602330"
        }
        if (response == "") {
                searchurl <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=%s&api-key=91d71cc6a4bdd339cb45f37747025836:18:62602330"
        }

## Get search results
        searchurl <- sprintf(searchurl, topic)
        fullreply <- fromJSON(searchurl)

## Construct Dataframe to Return
        reply <- data.frame(fullreply$response$docs$headline$main, fullreply$response$docs$pub_date, fullreply$response$docs$web_url, fullreply$response$docs$snippet, stringsAsFactors = FALSE)
        colnames(reply) <- c("Headline", "Date", "URL", "Snippet")
        reply$Date <- substr(reply$Date, 1, 10)

        return(reply)
}

Using the Function

We run through this list of tests.

  1. No parameters = instructions # Could NOT knit to show in Markdown, but worked
  2. Bad response parameter # Could NOT knit, but worked as expected at prompt
  3. Topic with no response
  4. Topic with “old” response
  5. topic with “new” response

I use tbl_df from dplyr for ease of reading. They all seem to work.

# getNYTsearch()
# getNYTsearch("big data", "K")
tbl_df(getNYTsearch("big data"))
## Source: local data frame [10 x 4]
## 
##                                                             Headline
##                                                                (chr)
## 1               If Algorithms Know All, How Much Should Humans Help?
## 2                                 A Changing West Village Landscape 
## 3                                                                Big
## 4                                                                Big
## 5                               Faltering TV Show Hits Stride on Web
## 6         For New York<U+0092>s Pools, It<U+0092>s Not the Heat, It<U+0092>s the Politics
## 7  Looking at the Promise and Perils of the Emerging Big Data Sector
## 8                   First Data Seeks to Raise $3.2 Billion in I.P.O.
## 9                                                          Geography
## 10                                Big Data Gets Its Own Photo Album 
## Variables not shown: Date (chr), URL (chr), Snippet (chr)
tbl_df(getNYTsearch("big data", "o"))
## Source: local data frame [10 x 4]
## 
##                                                                       Headline
##                                                                          (chr)
## 1  Twelve Days Later from California; ARRIVAL OF THE ILLINOIS. NEARLY TWO MILL
## 2                          The Broadway Railway-Reply to Anti-Monopoly, No. 3.
## 3  LATEST INTELLIGENCE; By Telegraph to the New-York Daily Times XXXIId CONGRE
## 4                                                                   FINANCIAL.
## 5                                                                   FINANCIAL.
## 6  FIFTEEN DAYS LATER FROM CALIFORNIA.; Arrival of the Northern Light, via Nic
## 7  NEW-YORK CITY.; THE STRIKE. The Employing Painters and Journeymen in the Fi
## 8  NEW-YORK CITY.; THE CATHOLIC HISTORY OF AMERICA, LEOTURS THREE. The Cathoti
## 9  REV. HENRY GILES' LECTURES--No. II.; FALSE AND EXAGGERATED EULOGY IN POPULA
## 10                      FRANCE.; Political, Personal, and Miscelaneous Gossip.
## Variables not shown: Date (chr), URL (chr), Snippet (chr)
tbl_df(getNYTsearch("big data", "n"))
## Source: local data frame [10 x 4]
## 
##                                                                       Headline
##                                                                          (chr)
## 1                      In Arbitration, a <U+0091>Privatization of the Justice System<U+0092>
## 2            Corked? Fine Wines Languish in China Warehouses as Consumers Cool
## 3                    Tax Breaks Produce Surge for Film Industry in Los Angeles
## 4                    'Burnt,' 'Crisis,' Add to a Pileup of Flops at Box Office
## 5                            Candidates Follow Informal Rules in New Hampshire
## 6                   Prices, Politics Challenge Health Law's 3rd Sign-Up Season
## 7  SAP Chief Bill McDermott Embarks on Health Care Mission After Losing His Ey
## 8        China&apos;s Factory and Service Activity Show Economy Still Unsteady
## 9                   AP: Hundreds of Officers Lose Licenses Over Sex Misconduct
## 10                                           Silicon Valley<U+0092>s New Philanthropy
## Variables not shown: Date (chr), URL (chr), Snippet (chr)