Working New York Time’s Web APIs

In this assignment, we’re tasked with choosing one of the New York Times APIs, constructing an interface to read the JSON data (NYT’s default extension type), and transforming it into an R data frame. I chose to first look at the API for Book Reviews, thinking it would help me pick out my next book (Disclaimer: my book queue is already massive). As you can see from the results below, the API didn’t help much. I then decided to use their Article Search API to view articles based on the book I’m currently reading (‘The Three-Body Problem’ by Cixin Liu – a must read for Sci-Fi fans).

As always, our library set up comes first:

library('rjson')
library('knitr')
library('jsonlite')
library('dplyr')
library('ggplot2')
library('stringr')

Book Review API

Gathering the Data

After requesting a key for the book review API, I queried the API to find all book reviews for those written by Cixin Liu. The results came up empty. Moving on, I used the API to find book reviews on a much reviewed author: James Patterson. Here is the query to do so, with the results displayed below. Since the API data includes two columns (the first column is copyright), we have to extract the second column to get all of the results (req$results).

book_key <- "&api-key=8bc754f9b9602ba52ffd5d40ea63e9da:8:74836743"
book_url <- "http://api.nytimes.com/svc/books/v3/reviews.json?author=James%20Patterson"
book_req <- fromJSON(paste0(book_url, book_key))

book_reviews <- book_req$results
kable(book_reviews)
url publication_dt byline book_title book_author summary isbn13
http://www.nytimes.com/2006/12/03/books/review/03tbr.html 2006-12-03 DWIGHT GARNER Thriller James Patterson 9781439552285
http://www.nytimes.com/2001/07/24/books/books-of-the-times-love-story-or-is-that-death-story.html 2001-07-24 JANET MASLIN Suzanne’s Diary for Nicholas James Patterson 9780316969444
http://www.nytimes.com/2001/11/29/books/books-of-the-times-bodies-hang-in-california-and-bullets-fly-in-florida.html 2001-11-29 JANET MASLIN Violets Are Blue ~ Detective Alex Cross Series James Patterson 9780316693233
http://www.nytimes.com/2002/11/21/books/books-of-the-times-extending-franchises-alive-dead.html 2002-11-21 JANET MASLIN Four Blind Mice (Alex Cross) James Patterson 9780316693004
http://www.nytimes.com/2003/12/01/books/books-of-the-times-tender-family-moments-cut-quickly-to-violence.html 2003-12-01 JANET MASLIN (First Edition) the Big Bad Wolf Hardcover by James Patterson 2003 james patterson 9780316602907

Analyzing the Results

With the results of the pull in hand, we could analyze the (limited) amount of data in a number of ways. Here are some simple examples:

#1. Count of Books Reviewed by Reviewer
book_reviews %>%
  group_by(byline) %>%
  count(byline)
## Source: local data frame [2 x 2]
## 
##          byline     n
##           (chr) (int)
## 1 DWIGHT GARNER     1
## 2  JANET MASLIN     4
#2. Timeline of dates of review
#re-ordering the data frame from oldest to latest
book_reviews$publication_dt <- as.Date(book_reviews$publication_dt, format="%Y-%m-%d")
book_reviews$start <- book_reviews$publication_dt
book_reviews$end <- book_reviews$publication_dt
book_reviews <- book_reviews[order(as.Date(book_reviews$publication_dt, format="%d/%m/%Y")),]

#adding the start/end date and calculating the number of days between publications
book_reviews$start <- book_reviews$publication_dt
book_reviews$end <- book_reviews$publication_dt
for (i in 2:length(book_reviews$publication_dt)) {
  book_reviews$start[i] <- book_reviews$end[i - 1]
  i <- i+1
}
book_reviews$start <- as.Date(book_reviews$start, format="%Y-%m-%d")
book_reviews$end <- as.Date(book_reviews$end, format="%Y-%m-%d")
book_reviews$days_to_write <- as.numeric(book_reviews$end - book_reviews$start)

#graphing the data
ggplot(book_reviews, aes(x=book_title, y=days_to_write, fill=book_title)) + geom_bar(stat="identity") + xlab("Book Title") + ylab("Days to Publish") + guides(fill=FALSE) + theme(axis.text=element_text(size=10), axis.title=element_text(size=14, face="bold")) + scale_x_discrete(labels=function(book_title) str_wrap(book_title, width = 10))


Article Search API

Since the Book Reviews API was limited, and likely required additional pulls off of the URL of the results to get more information and deeper analysis (such as sentiment analysis on the reviews themselves), I decided to try out the Article Search API.

Gathering the Data

In gathering the data, we’re able to query for articles based on keywords. The keyword we’ll use is the name of the author Liu Cixin (in Chinese, the family name is displayed first). After querying for the author and using our API key for access, the JSON file returns our data but also includes file status and copyrights. We don’t want those so we filter them out. Next we realize that the query also has too many unnecessary columns, so we filter those out.

#search for articles
article_key <- "&api-key=a5a5ad65b30b53b4bc8b8cc5f74f8418:19:74836743"
article_url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Liu+Cixin"
article_req <- fromJSON(paste0(article_url, article_key))
articles_all <- article_req$response$docs
articles <- articles_all[,c(2,7,11,12,13,14,18,20)]
kable(head(articles,2))
snippet source pub_date document_type news_desk section_name _id slideshow_credits
Liu Cixin’s “The Three-Body Problem,” a science-fiction trilogy whose first book comes out Tuesday in the United States, has attracted a diverse Chinese audience. The New York Times 2014-11-11T00:00:00Z article Culture Books 5460f63b79881034af6c5981 NA
Even as the Chinese leadership offers praise for science fiction writers, the police have been reminding people not to use social media to flex their imaginations. The New York Times 2015-09-17T06:42:22Z blogpost Foreign World 55fa999f7988102b19028883 NA

Analyzing the Results

With our results in hand we can use the data frame to analyze the data in a variety of ways.

#1. Count of Article Types
articles %>%
  group_by(document_type) %>%
  count(document_type)
## Source: local data frame [2 x 2]
## 
##   document_type     n
##           (chr) (int)
## 1       article     4
## 2      blogpost     2
#2. Count of Sections the Article can be found in
articles %>%
  group_by(section_name) %>%
  count(section_name)
## Source: local data frame [4 x 2]
## 
##   section_name     n
##          (chr) (int)
## 1        Books     1
## 2      NYT Now     1
## 3         U.S.     1
## 4        World     3
#2. Count of where each Article came through from (which News Desk)
articles %>%
  group_by(news_desk) %>%
  count(news_desk)
## Source: local data frame [4 x 2]
## 
##   news_desk     n
##       (chr) (int)
## 1   Culture     1
## 2   Foreign     3
## 3      None     1
## 4    NYTNow     1