Background

In this assignment, I read in raw data using the Books API from the New York Times – specifically, the list of best-sellers – and then performed some basic cleaning on the dataset.

I lost some data in the parsing and cleaning stages – I was not able to get the target age group or price of each book. However, the process of reading in data using an API – and cleaning the results – was certainly valuable.


Libraries

library(httr)
library(jsonlite)
library(stringr)
library(tidyr)
library(dplyr)
library(knitr)       # To create responsive HTML tables
library(kableExtra)  # To create responsive HTML tables



1. Get Data

I got the data from the NY Times API using the httr package’s GET method. I converted the data into a JSON, then an R dataframe, and flattened the data to remove layers.

# NYTimes API link to Books
url <- "https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json"

# Get raw data using API key
h <- GET(url, query = list(api_key = "ab5bbc29fe9144d7b143d26bbdc344e9"))

# Parse the contents of the data
j <- content(h, "parse")

# Convert the contents into a JSON format
k <- toJSON(j, pretty = TRUE)

# Convert from a JSON into a dataframe
m <- data.frame(fromJSON(k, flatten = TRUE), stringsAsFactors = FALSE)

head(m, 1) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
status copyright num_results results.title results.description results.contributor results.author results.contributor_note results.price results.age_group results.publisher results.isbns results.ranks_history results.reviews
OK Copyright (c) 2018 The New York Times Company. All Rights Reserved. 31192 “I GIVE YOU MY BODY …” The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. by Diana Gabaldon Diana Gabaldon 0 Dell list(isbn10 = list(“0399178570”), isbn13 = list(“9780399178573”)) list(primary_isbn10 = list(“0399178570”), primary_isbn13 = list(“9780399178573”), rank = list(8), list_name = list(“Advice How-To and Miscellaneous”), display_name = list(“Advice, How-To & Miscellaneous”), published_date = list(“2016-09-04”), bestsellers_date = list(“2016-08-20”), weeks_on_list = list(1), asterisk = list(0), dagger = list(0)) list(book_review_link = list(“”), first_chapter_link = list(“”), sunday_review_link = list(“”), article_chapter_link = list(“”))

2. Coerce Data as Character

The previous step read in most of the columns with an “Unknown” class, so I first removed the unnecessary columns and coerced the rest of the columns (apart from ISBN) into a character.

# Remove columns with empty or redundant data
n <- m[, -c(1,2,3,8,9,10,13,14)]

# Coerce columns into a character

n$results.title <- as.character(n$results.title)

n$results.description <- as.character(n$results.description)

n$results.contributor <- as.character(n$results.contributor)

n$results.publisher <- as.character(n$results.publisher)

head(n, 2) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
results.title results.description results.contributor results.author results.publisher results.isbns
“I GIVE YOU MY BODY …” The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. by Diana Gabaldon Diana Gabaldon Dell list(isbn10 = list(“0399178570”), isbn13 = list(“9780399178573”))
“MOST BLESSED OF THE PATRIARCHS” A character study that attempts to make sense of Jefferson’s contradictions. by Annette Gordon-Reed and Peter S. Onuf Annette Gordon-Reed and Peter S Onuf Liveright list(isbn10 = list(“0871404427”), isbn13 = list(“9780871404428”))



3. Remove Empty Lists

The data had many cells that were empty lists – list() – or wrapped in a list, like the ISBN column. I removed these list characters using the str_replace function.

n$results.title <- str_replace_all(n$results.title, "list\\(\\)", "")

n$results.description <- str_replace_all(n$results.description, "list\\(\\)", "")

n$results.contributor <- str_replace_all(n$results.contributor, "list\\(\\)", "")

n$results.publisher <- str_replace_all(n$results.publisher, "list\\(\\)", "")

n$results.isbns <- str_replace_all(n$results.isbns, "list\\(", "")

n$results.isbns <- str_replace_all(n$results.isbns, "\\)", "")

head(n, 2) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
results.title results.description results.contributor results.author results.publisher results.isbns
“I GIVE YOU MY BODY …” The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. by Diana Gabaldon Diana Gabaldon Dell isbn10 = “0399178570”, isbn13 = “9780399178573”
“MOST BLESSED OF THE PATRIARCHS” A character study that attempts to make sense of Jefferson’s contradictions. by Annette Gordon-Reed and Peter S. Onuf Annette Gordon-Reed and Peter S Onuf Liveright isbn10 = “0871404427”, isbn13 = “9780871404428”



4. Clean ISBN Columns

Finally, the last step I completed was cleaning up the ISBN columns and coercing them into a numeric. In retrospect, I wonder if coercing the ISBNs was helpful, since ISBNs can start with a zero – perhaps it would have been better to leave them as a string.

p <- n %>% separate(results.isbns, c("isbn10", "isbn13"), sep = ",")

p$isbn10 <- str_replace_all(p$isbn10, "isbn10[[:space:]]\\=[[:space:]]", "")

p$isbn13 <- str_replace_all(p$isbn13, "isbn13[[:space:]]\\=[[:space:]]", "")

p$isbn10 <- str_replace_all(p$isbn10, "[[:punct:]]", "")

p$isbn13 <- str_replace_all(p$isbn13, "[[:punct:]]", "")

p$isbn10 <- as.numeric(p$isbn10)

p$isbn13 <- as.numeric(p$isbn13)

names(p) <- c("title", "description", "contributor", "author", "publisher", "isbn10", "isbn13")


head(p, 3) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
title description contributor author publisher isbn10 isbn13
“I GIVE YOU MY BODY …” The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. by Diana Gabaldon Diana Gabaldon Dell 399178570 9.780399e+12
“MOST BLESSED OF THE PATRIARCHS” A character study that attempts to make sense of Jefferson’s contradictions. by Annette Gordon-Reed and Peter S. Onuf Annette Gordon-Reed and Peter S Onuf Liveright 871404427 9.780871e+12
#ASKGARYVEE The entrepreneur expands on subjects addressed on his Internet show, like marketing, management and social media. by Gary Vaynerchuk Gary Vaynerchuk HarperCollins 62273124 6.227313e+07


I ended up with a list of 20 best-selling books, their descriptions, contributors, authors, publishers, and ISBN numbers.

As a next step, I would attempt to edit the script so that I could keep the data on target age group and price, which would be interesting to analyze.