The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in Python to read in the JSON data, and transform it into a pandas DataFrame.
Important Note:
This assignment done using R
I decided to pick the books API to see what is the best selling books according to New York Times ranking. After reading the API documentation, I picked the sample call endpoint and call GET method into that endpoint.
First, I signed up for a NYT API, and create an app which leads to generate an API key for me. Second, I used this API inside of the GET from the httr package to retrieve the data from the site - Best selling books for this week. After that, I stored the resulted json into a variable called response.
## [1] "BRnDpUYaARf4t6wEHdpdRQQQw7s48bMf"
nyt_book_list <- GET("https://api.nytimes.com/svc/books/v3/lists/overview.json?api-key=BRnDpUYaARf4t6wEHdpdRQQQw7s48bMf")
nyt_book_list
## Response [https://api.nytimes.com/svc/books/v3/lists/overview.json?api-key=BRnDpUYaARf4t6wEHdpdRQQQw7s48bMf]
## Date: 2019-10-23 00:59
## Status: 200
## Content-Type: application/json; charset=UTF-8
## Size: 130 kB
## $category
## [1] "Success"
##
## $reason
## [1] "OK"
##
## $message
## [1] "Success: (200) OK"
## [1] "list"
The resulted output is a list of list names(book genre) and books details nested inside the books attributes. To pull out the book details - which I need to this assignment, I constructed a for-loop to extract title, author, publisher, and book rank. Finally, store the extracted data into a new dataframe ready for further analysis.
author <- vector()
publisher <- vector()
title <- vector()
rank <- vector()
list_name <- vector()
#Loop through books on the best sellers list
for ( i in 1: 18) {
for (j in 1:5) {
list_name <- c(response$results$lists[[i]]$list_name, list_name)
title <- c( response$results$lists[[i]]$books[[j]]$title, title)
author <- c( response$results$lists[[i]]$books[[j]]$author, author)
publisher <- c( response$results$lists[[i]]$books[[j]]$publisher, publisher)
rank <- c( as.numeric (response$results$lists[[i]]$books[[j]]$rank, rank ))
}
}
#Create a dataframe
books_df = data.frame(
list_name = list_name,
title = title,
author = author,
publisher = publisher,
rank = rank
)
books_df
## [1] TRUE
Please be sure that you set working directory in Rstudio to the current working directory.
## [1] TRUE
We can open the file from the GitHub
url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_9/bestNYTbooks.csv'
# reading the url as a dataframe
bestNYT_books_df <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
bestNYT_books_df
I had a curiousity to find out what is the most frequent publisher according to NYT best selling books.
by_publisher <- books_df %>%
group_by(publisher) %>%
select(title, publisher) %>%
summarize(total = n()) # how many books per publisher
by_publisher
library(ggplot2)
freq_publisher <- by_publisher %>%
filter(total > 2)
ggplot( data = freq_publisher, aes (x = publisher, y=total )) +
geom_bar(stat="identity") +
xlab("Publisher Name") +
ylab("Number of top books per publisher") +
ggtitle ("NY Times Best Seller book publisher") +
theme(legend.position="bottom")
It seems that the top publishers are Little Brown with 8 books per week followed by Scholastic with 6 books. However, Putnam and Random House have have the same number of books of 5 while Random House Audio has only 4 books per week.