Step:1 - Create an account in http://developer.nytimes.com/docs. You may use your Google account or Facebook account.
Step:2 - create an API Key in the field of your interest. I chose Books API, and obtained the following API key: ae3ebf8d2c14b1623769762cea332b83:0:71863562
The main objective of this project is to obtain the following information on books from nytimes.com website (via API):
Obtain the list of book names which are popular in various categories
Obtain the author information
Obtain Amazon URL for the book
Books Rank
Published date
We will use the following R Packages in this project:
NOTE: I found that jsonlite is an excellent package to use to parse json documents, when compared to RJSONIO. The RJSONIO package’s function (fromJSON()) returns a list of lists (R’s list Objects), while jsonlite package’s function (fromJSON()) returns a list of data frames. Data frames are easier to handle/process, when compared to lists. Hence I am using jsonlite package. Note that both jsonlite and RJSONIO have the same function fromJSON(), but each package’s implementation is different, and output objects are also different. The package “data.table” contains the rbindlist() function which converts a list of elements to a data frame. I had to also use RJSONIO since I was getting error in using the jsonlite’s fromJSON() function (The problem with jsonlite package is described later in this document)
If any of the packages are not installed, please use the following command to install the packages:
install.packages(package_name)
As per the nytimes.com website, there are several books categories, and we need these category names, in order to obtain books information belonging to those categories. Once the books categories are obtained from nytimes.com, we will iteratively process the categories and obtain the books list via another call to the nytimes.com website.
The URL to obtain books categories is given below:
“http://api.nytimes.com/svc/books/v3/lists/names.json?api-key=xxxxxxxxxx”
The xxxxxxxxxx in the URL must be substituted with API Key, as shown below:
The above URL returns a json document, which has various books categories available at nytimes.com website. The R code to get this JSON document is given below:
library(RCurl)
## Loading required package: bitops
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
url <- "http://api.nytimes.com/svc/books/v3/lists/names.json?api-key=ae3ebf8d2c14b1623769762cea332b83:0:71863562"
web_data <- getURL(url)
categories_df <- fromJSON(web_data)
lapply(categories_df,head)
## $status
## [1] "OK"
##
## $copyright
## [1] "Copyright (c) 2015 The New York Times Company. All Rights Reserved."
##
## $num_results
## [1] 47
##
## $results
## list_name display_name
## 1 Combined Print and E-Book Fiction Combined Print & E-Book Fiction
## 2 Combined Print and E-Book Nonfiction Combined Print & E-Book Nonfiction
## 3 Hardcover Fiction Hardcover Fiction
## 4 Hardcover Nonfiction Hardcover Nonfiction
## 5 Trade Fiction Paperback Paperback Trade Fiction
## 6 Mass Market Paperback Paperback Mass-Market Fiction
## list_name_encoded oldest_published_date
## 1 combined-print-and-e-book-fiction 2011-02-13
## 2 combined-print-and-e-book-nonfiction 2011-02-13
## 3 hardcover-fiction 2008-06-08
## 4 hardcover-nonfiction 2008-06-08
## 5 trade-fiction-paperback 2008-06-08
## 6 mass-market-paperback 2008-06-08
## newest_published_date updated
## 1 2015-04-26 WEEKLY
## 2 2015-04-26 WEEKLY
## 3 2015-04-26 WEEKLY
## 4 2015-04-26 WEEKLY
## 5 2015-04-26 WEEKLY
## 6 2015-04-26 WEEKLY
The above display shows that the we obtained 4 data frames, and a close observation of the data tells us that we are interested in “results” data frame of the output.
The “results” data frame has the following columns:
list_name: This contains the books categories
display_name: Almost same like list_name
list_name_encoded: This contains the list name to be used in the URL while retrieving the books information in the respective category
oldest_published_date: Available books oldest published date
newest_published_date: Latest published date of the books in the category
updated: Contains how often the data is updated
The following R code updates the “categories_df” with just the results data frame
categories_df <- categories_df$results
head(categories_df)
## list_name display_name
## 1 Combined Print and E-Book Fiction Combined Print & E-Book Fiction
## 2 Combined Print and E-Book Nonfiction Combined Print & E-Book Nonfiction
## 3 Hardcover Fiction Hardcover Fiction
## 4 Hardcover Nonfiction Hardcover Nonfiction
## 5 Trade Fiction Paperback Paperback Trade Fiction
## 6 Mass Market Paperback Paperback Mass-Market Fiction
## list_name_encoded oldest_published_date
## 1 combined-print-and-e-book-fiction 2011-02-13
## 2 combined-print-and-e-book-nonfiction 2011-02-13
## 3 hardcover-fiction 2008-06-08
## 4 hardcover-nonfiction 2008-06-08
## 5 trade-fiction-paperback 2008-06-08
## 6 mass-market-paperback 2008-06-08
## newest_published_date updated
## 1 2015-04-26 WEEKLY
## 2 2015-04-26 WEEKLY
## 3 2015-04-26 WEEKLY
## 4 2015-04-26 WEEKLY
## 5 2015-04-26 WEEKLY
## 6 2015-04-26 WEEKLY
Our main processing logic depends on the list-name-encoded column of the categories_df data frame. We will iteratively process the results obtained for each book’s category by substituting the list-name-encoded in the following URL:
“http://api.nytimes.com/svc/books/v3/lists.json?list-name=xxxxxxxxx&api-key=ae3ebf8d2c14b1623769762cea332b83:0:71863562” Where xxxxxxxxx in the above URL will be substituted with the elements in “categories_df$list_name_encoded” (one at a time).
Each call returns a JSON document containing the list of all books in the specific category.
Just to test the waters, we will fetch only books belonging to one category first. Then we will select the required data and generalize the logic for all the books categories:
url <- paste("http://api.nytimes.com/svc/books/v3/lists.json?list-name=",categories_df$list_name_encoded[1],"&api-key=ae3ebf8d2c14b1623769762cea332b83:0:71863562",sep="")
web_data <- getURL(url)
books_df <- fromJSON(web_data)$results
lapply(books_df,head)
## $list_name
## [1] "Combined Print and E-Book Fiction" "Combined Print and E-Book Fiction"
## [3] "Combined Print and E-Book Fiction" "Combined Print and E-Book Fiction"
## [5] "Combined Print and E-Book Fiction" "Combined Print and E-Book Fiction"
##
## $display_name
## [1] "Combined Print & E-Book Fiction" "Combined Print & E-Book Fiction"
## [3] "Combined Print & E-Book Fiction" "Combined Print & E-Book Fiction"
## [5] "Combined Print & E-Book Fiction" "Combined Print & E-Book Fiction"
##
## $bestsellers_date
## [1] "2015-04-11" "2015-04-11" "2015-04-11" "2015-04-11" "2015-04-11"
## [6] "2015-04-11"
##
## $published_date
## [1] "2015-04-26" "2015-04-26" "2015-04-26" "2015-04-26" "2015-04-26"
## [6] "2015-04-26"
##
## $rank
## [1] 1 2 3 4 5 6
##
## $rank_last_week
## [1] 1 6 0 3 7 10
##
## $weeks_on_list
## [1] 13 22 1 3 30 10
##
## $asterisk
## [1] 0 0 0 0 0 0
##
## $dagger
## [1] 0 0 0 0 0 0
##
## $amazon_product_url
## [1] "http://www.amazon.com/The-Girl-Train-A-Novel-ebook/dp/B00L9B7IKE?tag=thenewyorktim-20"
## [2] "http://www.amazon.com/The-Longest-Ride-Nicholas-Sparks/dp/1455520640?tag=thenewyorktim-20"
## [3] "http://www.amazon.com/Hot-Pursuit-Stone-Barrington-Book-ebook/dp/B00KRPKUQW?tag=thenewyorktim-20"
## [4] "http://www.amazon.com/The-Stranger-Harlan-Coben/dp/0525953507?tag=thenewyorktim-20"
## [5] "http://www.amazon.com/All-Light-We-Cannot-See-ebook/dp/B00DPM7TIG?tag=thenewyorktim-20"
## [6] "http://www.amazon.com/The-Nightingale-Kristin-Hannah/dp/0312577222?tag=thenewyorktim-20"
##
## $isbns
## $isbns[[1]]
## isbn10 isbn13
## 1 1594633665 9781594633669
## 2 0698185390 9780698185395
## 3 1410477762 9781410477767
## 4 0857522329 9780857522320
## 5 1448171687 9781448171682
##
## $isbns[[2]]
## isbn10 isbn13
## 1 1455520659 9781455520657
## 2 1455576018 9781455576012
## 3 1455520667 9781455520664
## 4 1455520640 9781455520640
## 5 1455520632 9781455520633
## 6 1455584738 9781455584734
## 7 145558472X 9781455584727
##
## $isbns[[3]]
## isbn10 isbn13
## 1 0399169164 9780399169168
##
## $isbns[[4]]
## isbn10 isbn13
## 1 0525953507 9780525953500
## 2 0698186206 9780698186200
## 3 1410476235 9781410476234
##
## $isbns[[5]]
## isbn10 isbn13
## 1 1476746583 9781476746586
## 2 1410470229 9781410470225
## 3 1476746591 9781476746593
##
## $isbns[[6]]
## isbn10 isbn13
## 1 0312577222 9780312577223
## 2 1466850604 9781466850606
##
##
## $book_details
## $book_details[[1]]
## title
## 1 THE GIRL ON THE TRAIN
## description
## 1 A psychological thriller set in London is full of complications and betrayals.
## contributor author contributor_note price age_group
## 1 by Paula Hawkins Paula Hawkins 0
## publisher primary_isbn13 primary_isbn10
## 1 Riverhead 9780698185395 0698185390
##
## $book_details[[2]]
## title description
## 1 THE LONGEST RIDE The lives of two couples converge unexpectedly.
## contributor author contributor_note price age_group
## 1 by Nicholas Sparks Nicholas Sparks 0
## publisher primary_isbn13 primary_isbn10
## 1 Grand Central 9781455520664 1455520667
##
## $book_details[[3]]
## title
## 1 HOT PURSUIT
## description
## 1 In the 33rd Stone Barrington novel, the New York lawyer pursues an attractive pilot and must deal with her stalker ex-boyfriend as well as intrigue in the Middle East.
## contributor author contributor_note price age_group publisher
## 1 by Stuart Woods Stuart Woods 0 Putnam
## primary_isbn13 primary_isbn10
## 1 9780698154162
##
## $book_details[[4]]
## title
## 1 THE STRANGER
## description
## 1 Characters’ lives begin to fall apart as a mysterious stranger discloses secrets to them; a stand-alone thriller.
## contributor author contributor_note price age_group publisher
## 1 by Harlan Coben Harlan Coben 0 Dutton
## primary_isbn13 primary_isbn10
## 1 9780698186200 0698186206
##
## $book_details[[5]]
## title
## 1 ALL THE LIGHT WE CANNOT SEE
## description
## 1 The lives of a blind French girl and a gadget-obsessed German boy before and during World War II.
## contributor author contributor_note price age_group
## 1 by Anthony Doerr Anthony Doerr 0
## publisher primary_isbn13 primary_isbn10
## 1 Scribner 9781476746609
##
## $book_details[[6]]
## title
## 1 THE NIGHTINGALE
## description
## 1 Two sisters are separated in World War II France: one struggling to survive in the countryside, the other joining the Resistance in Paris.
## contributor author contributor_note price age_group
## 1 by Kristin Hannah Kristin Hannah 0
## publisher primary_isbn13 primary_isbn10
## 1 St. Martin's 9781466850606 1466850604
##
##
## $reviews
## $reviews[[1]]
## book_review_link
## 1 http://www.nytimes.com/2015/01/05/books/the-girl-on-the-train-by-paula-hawkins.html
## first_chapter_link
## 1
## sunday_review_link
## 1 http://www.nytimes.com/2015/02/01/books/review/the-girl-on-the-train-by-paula-hawkins.html
## article_chapter_link
## 1
##
## $reviews[[2]]
## book_review_link first_chapter_link sunday_review_link
## 1
## article_chapter_link
## 1
##
## $reviews[[3]]
## book_review_link first_chapter_link sunday_review_link
## 1
## article_chapter_link
## 1
##
## $reviews[[4]]
## book_review_link first_chapter_link sunday_review_link
## 1
## article_chapter_link
## 1
##
## $reviews[[5]]
## book_review_link first_chapter_link
## 1
## sunday_review_link
## 1 http://www.nytimes.com/2014/05/11/books/review/all-the-light-we-cannot-see-by-anthony-doerr.html
## article_chapter_link
## 1
##
## $reviews[[6]]
## book_review_link first_chapter_link sunday_review_link
## 1
## article_chapter_link
## 1
The above output display all the data frames obtained (related to “Combined Print and E-Book Fiction” category). But most of the required details are present in the “book_details” list. The following code will convert the “book_details” list to a data frame. The other details such as “list_name”" (which represents the category), “bestsellers_date” (ranking date), “published_date” (book published date), “rank” (book’s rank in the respective category) and “amazon_product_url” (URL to amazon link, where the product is listed) can be obtained as other elements of the list (see the display above).
Extracting all the required details, of a single category, and converting the data to a R data frame:
#books_df <- data.frame(books_df$book_details$title, (books_df$book_details)$author)
library(data.table)
book_details_df <- data.frame()
books_df_temp <- rbindlist(lapply(books_df$book_details,as.list))
book_details_df <- rbind(book_details_df,cbind(books_df$list_name, books_df$bestsellers_date, books_df$published_date, books_df_temp$title, books_df_temp$author, books_df$rank, books_df$amazon_product_url, books_df_temp$primary_isbn10,books_df_temp$primary_isbn13))
names(book_details_df) <- c("Category","Best_Sellers_Date","Published_Date","Title","Author", "Rank","Amazon_URL","Primary_ISBN10", "Primary_ISBN13")
head(book_details_df)
## Category Best_Sellers_Date Published_Date
## 1 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 2 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 3 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 4 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 5 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 6 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## Title Author Rank
## 1 THE GIRL ON THE TRAIN Paula Hawkins 1
## 2 THE LONGEST RIDE Nicholas Sparks 2
## 3 HOT PURSUIT Stuart Woods 3
## 4 THE STRANGER Harlan Coben 4
## 5 ALL THE LIGHT WE CANNOT SEE Anthony Doerr 5
## 6 THE NIGHTINGALE Kristin Hannah 6
## Amazon_URL
## 1 http://www.amazon.com/The-Girl-Train-A-Novel-ebook/dp/B00L9B7IKE?tag=thenewyorktim-20
## 2 http://www.amazon.com/The-Longest-Ride-Nicholas-Sparks/dp/1455520640?tag=thenewyorktim-20
## 3 http://www.amazon.com/Hot-Pursuit-Stone-Barrington-Book-ebook/dp/B00KRPKUQW?tag=thenewyorktim-20
## 4 http://www.amazon.com/The-Stranger-Harlan-Coben/dp/0525953507?tag=thenewyorktim-20
## 5 http://www.amazon.com/All-Light-We-Cannot-See-ebook/dp/B00DPM7TIG?tag=thenewyorktim-20
## 6 http://www.amazon.com/The-Nightingale-Kristin-Hannah/dp/0312577222?tag=thenewyorktim-20
## Primary_ISBN10 Primary_ISBN13
## 1 0698185390 9780698185395
## 2 1455520667 9781455520664
## 3 9780698154162
## 4 0698186206 9780698186200
## 5 9781476746609
## 6 1466850604 9781466850606
The above display confirms that we obtained the needed details for “Combined Print and E-Book Fiction”. Now we have to repeat this for all the categories. But if the function fromJSON() (from jsonlite) is used iteratively, I was getting the following error.
Quitting from lines 130-159 (MSDA_607_Week_11_Asignment.Rmd) Error in parseJSON(txt) : lexical error: invalid character inside string. to maximize business success. “,”contributor“:”by Peter H. (right here) ——^ Calls:
The following R code (commented) was throwing the above error. So commenting this code. When fromJSON() function from RJSONIO package is used, I was not getting this error. The code using fromJSON() (of RSJONIO) is shown after the following commented R code
#library(data.table)
#a <- nrow(categories_df)
#book_details_df <- data.frame()
#for (i in 1:a)
#{
# Sys.sleep(1)
#
# url <- paste("http://api.nytimes.com/svc/books/v3/lists.json?list-name=",categories_df$list_name_encoded[i],"&api-key=ae3ebf8d2c14b1623769762cea332b83:0:71863562",sep="")
# web_data <- getURL(url)
#
#
# books_df <- fromJSON(web_data)$results
#
#books_df_temp <- rbindlist(lapply(books_df$book_details,as.list))
#
#book_details_df <- rbind(book_details_df,cbind(books_df$list_name, books_df$bestsellers_date, books_df$published_date, books_df_temp$title, books_df_temp$author, books_df$rank, books_df$amazon_product_url, books_df_temp$primary_isbn10,books_df_temp$primary_isbn13))
#}
#names(book_details_df) <- c("Category","Best_Sellers_Date","Published_Date","Title","Author", "Rank","Amazon_URL","Primary_ISBN10", "Primary_ISBN13")
#head(book_details_df)
#tail(book_details_df)
#book_details_df
Here is the R code to fetch the data related to all books belonging to various categories.
The following R code produces the final data frame. But this code uses the RSJONI0, instead of jsonlite package.
library(data.table)
library(RJSONIO)
##
## Attaching package: 'RJSONIO'
##
## The following objects are masked from 'package:jsonlite':
##
## fromJSON, toJSON
df <- categories_df
a <- nrow(df)
books_df <- data.frame()
ls <- list()
for(b in 1:a)
{
url <- paste("http://api.nytimes.com/svc/books/v3/lists.json?list-name=",df$list_name_encoded[b],"&api-key=ae3ebf8d2c14b1623769762cea332b83:0:71863562",sep="")
web_data <- getURL(url)
ls <- fromJSON(web_data)$results
books <- list()
k <- length(ls)
for(i in 1:k)
{ books$category[i] <- ls[[i]]$list_name
books$bestsellers_date[i] <- ls[[i]]$bestsellers_date
books$published_date[i] <- ls[[i]]$published_date
books$title[i] <- ls[[i]]$book_details[[1]]$title
books$author[i] <- ls[[i]]$book_details[[1]]$author
books$rank[i] <- ls[[i]]$rank
books$amazon_link[i] <- ifelse(is.null(ls[[i]]$amazon_product_url),NA,ls[[i]]$amazon_product_url)
books$primary_isbn10[i] <- ls[[i]]$book_details[[1]]$primary_isbn10
books$primary_isbn13[i] <- ls[[i]]$book_details[[1]]$primary_isbn13
}
Sys.sleep(1)
books_df <- rbind(books_df,data.frame(books))
}
names(books_df) <- c("Category","Best_Sellers_Date","Published_Date","Title","Author", "Rank","Amazon_URL","Primary_ISBN10", "Primary_ISBN13")
head(books_df)
## Category Best_Sellers_Date Published_Date
## 1 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 2 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 3 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 4 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 5 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## 6 Combined Print and E-Book Fiction 2015-04-11 2015-04-26
## Title Author Rank
## 1 THE GIRL ON THE TRAIN Paula Hawkins 1
## 2 THE LONGEST RIDE Nicholas Sparks 2
## 3 HOT PURSUIT Stuart Woods 3
## 4 THE STRANGER Harlan Coben 4
## 5 ALL THE LIGHT WE CANNOT SEE Anthony Doerr 5
## 6 THE NIGHTINGALE Kristin Hannah 6
## Amazon_URL
## 1 http://www.amazon.com/The-Girl-Train-A-Novel-ebook/dp/B00L9B7IKE?tag=thenewyorktim-20
## 2 http://www.amazon.com/The-Longest-Ride-Nicholas-Sparks/dp/1455520640?tag=thenewyorktim-20
## 3 http://www.amazon.com/Hot-Pursuit-Stone-Barrington-Book-ebook/dp/B00KRPKUQW?tag=thenewyorktim-20
## 4 http://www.amazon.com/The-Stranger-Harlan-Coben/dp/0525953507?tag=thenewyorktim-20
## 5 http://www.amazon.com/All-Light-We-Cannot-See-ebook/dp/B00DPM7TIG?tag=thenewyorktim-20
## 6 http://www.amazon.com/The-Nightingale-Kristin-Hannah/dp/0312577222?tag=thenewyorktim-20
## Primary_ISBN10 Primary_ISBN13
## 1 0698185390 9780698185395
## 2 1455520667 9781455520664
## 3 9780698154162
## 4 0698186206 9780698186200
## 5 9781476746609
## 6 1466850604 9781466850606
tail(books_df)
## Category Best_Sellers_Date Published_Date
## 721 Travel 2015-03-28 2015-04-12
## 722 Travel 2015-03-28 2015-04-12
## 723 Travel 2015-03-28 2015-04-12
## 724 Travel 2015-03-28 2015-04-12
## 725 Travel 2015-03-28 2015-04-12
## 726 Travel 2015-03-28 2015-04-12
## Title
## 721 1,000 PLACES TO SEE BEFORE YOU DIE
## 722 TRACKS
## 723 UNTAMED
## 724 KOOK
## 725 HOW TO TRAVEL THE WORLD ON $50 A DAY
## 726 HOW TO BE PARISIAN WHEREVER YOU ARE
## Author Rank
## 721 Patricia Schultz 10
## 722 Ron Davidson 11
## 723 Will Harlan 12
## 724 Peter Heller 13
## 725 Matt Kepnes 14
## 726 Anne Berest, Audrey Diwan, Caroline de Maigret and Sophie Mas 15
## Amazon_URL
## 721 http://www.amazon.com/000-Places-See-Before-second/dp/0761156860?tag=thenewyorktim-20
## 722 <NA>
## 723 http://www.amazon.com/Untamed-Wildest-America-Cumberland-Island/dp/0802122582?tag=thenewyorktim-20
## 724 http://www.amazon.com/Kook-Surfing-Taught-Catching-Perfect/dp/0743294203?tag=thenewyorktim-20
## 725 http://www.amazon.com/How-Travel-World-50-Day/dp/0399173285?tag=thenewyorktim-20
## 726 http://www.amazon.com/How-Parisian-Wherever-You-Are/dp/0385538650?tag=thenewyorktim-20
## Primary_ISBN10 Primary_ISBN13
## 721 0761156860 9780761156864
## 722 148045267X 9781480452671
## 723 0802123856 9780802123855
## 724 0743294203 9780743294201
## 725 0399173285 9780399173288
## 726 0385538650 9780385538657
The above code segments successfully parses all the top ranked books details in various categories from nytimes.com website. The jsonlite package is excellent to parse JSON data, but this package’s fromJSON() function was failing to parse the data returned by the Books API of the nytimes.com website. So I had to use RJSONIO to process the Books JSON data. The RJSONIO package’s fromJSON() returns a list of lists, and I have to inevitably use “for loop”" to parse through the required JSON leaf nodes. The final data frame “books_df” contains all the books details, along with a link to amazon URL. This program can be enhanced further to parse the price information, books ratings from amazon.com.
-~-End of Project Report-~-