In this assignment, I read in raw data using the Books API from the New York Times – specifically, the list of best-sellers – and then performed some basic cleaning on the dataset.
I lost some data in the parsing and cleaning stages – I was not able to get the target age group or price of each book. However, the process of reading in data using an API – and cleaning the results – was certainly valuable.
library(httr)
library(jsonlite)
library(stringr)
library(tidyr)
library(dplyr)
library(knitr) # To create responsive HTML tables
library(kableExtra) # To create responsive HTML tables
I got the data from the NY Times API using the httr
package’s GET
method. I converted the data into a JSON, then an R dataframe, and flattened the data to remove layers.
# NYTimes API link to Books
url <- "https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json"
# Get raw data using API key
h <- GET(url, query = list(api_key = "ab5bbc29fe9144d7b143d26bbdc344e9"))
# Parse the contents of the data
j <- content(h, "parse")
# Convert the contents into a JSON format
k <- toJSON(j, pretty = TRUE)
# Convert from a JSON into a dataframe
m <- data.frame(fromJSON(k, flatten = TRUE), stringsAsFactors = FALSE)
head(m, 1) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
status | copyright | num_results | results.title | results.description | results.contributor | results.author | results.contributor_note | results.price | results.age_group | results.publisher | results.isbns | results.ranks_history | results.reviews |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OK | Copyright (c) 2018 The New York Times Company. All Rights Reserved. | 31192 | “I GIVE YOU MY BODY …” | The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. | by Diana Gabaldon | Diana Gabaldon | 0 | Dell | list(isbn10 = list(“0399178570”), isbn13 = list(“9780399178573”)) | list(primary_isbn10 = list(“0399178570”), primary_isbn13 = list(“9780399178573”), rank = list(8), list_name = list(“Advice How-To and Miscellaneous”), display_name = list(“Advice, How-To & Miscellaneous”), published_date = list(“2016-09-04”), bestsellers_date = list(“2016-08-20”), weeks_on_list = list(1), asterisk = list(0), dagger = list(0)) | list(book_review_link = list(“”), first_chapter_link = list(“”), sunday_review_link = list(“”), article_chapter_link = list(“”)) |
The previous step read in most of the columns with an “Unknown” class, so I first removed the unnecessary columns and coerced the rest of the columns (apart from ISBN) into a character.
# Remove columns with empty or redundant data
n <- m[, -c(1,2,3,8,9,10,13,14)]
# Coerce columns into a character
n$results.title <- as.character(n$results.title)
n$results.description <- as.character(n$results.description)
n$results.contributor <- as.character(n$results.contributor)
n$results.publisher <- as.character(n$results.publisher)
head(n, 2) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
results.title | results.description | results.contributor | results.author | results.publisher | results.isbns |
---|---|---|---|---|---|
“I GIVE YOU MY BODY …” | The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. | by Diana Gabaldon | Diana Gabaldon | Dell | list(isbn10 = list(“0399178570”), isbn13 = list(“9780399178573”)) |
“MOST BLESSED OF THE PATRIARCHS” | A character study that attempts to make sense of Jefferson’s contradictions. | by Annette Gordon-Reed and Peter S. Onuf | Annette Gordon-Reed and Peter S Onuf | Liveright | list(isbn10 = list(“0871404427”), isbn13 = list(“9780871404428”)) |
The data had many cells that were empty lists – list()
– or wrapped in a list, like the ISBN column. I removed these list characters using the str_replace
function.
n$results.title <- str_replace_all(n$results.title, "list\\(\\)", "")
n$results.description <- str_replace_all(n$results.description, "list\\(\\)", "")
n$results.contributor <- str_replace_all(n$results.contributor, "list\\(\\)", "")
n$results.publisher <- str_replace_all(n$results.publisher, "list\\(\\)", "")
n$results.isbns <- str_replace_all(n$results.isbns, "list\\(", "")
n$results.isbns <- str_replace_all(n$results.isbns, "\\)", "")
head(n, 2) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
results.title | results.description | results.contributor | results.author | results.publisher | results.isbns |
---|---|---|---|---|---|
“I GIVE YOU MY BODY …” | The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. | by Diana Gabaldon | Diana Gabaldon | Dell | isbn10 = “0399178570”, isbn13 = “9780399178573” |
“MOST BLESSED OF THE PATRIARCHS” | A character study that attempts to make sense of Jefferson’s contradictions. | by Annette Gordon-Reed and Peter S. Onuf | Annette Gordon-Reed and Peter S Onuf | Liveright | isbn10 = “0871404427”, isbn13 = “9780871404428” |
Finally, the last step I completed was cleaning up the ISBN columns and coercing them into a numeric. In retrospect, I wonder if coercing the ISBNs was helpful, since ISBNs can start with a zero – perhaps it would have been better to leave them as a string.
p <- n %>% separate(results.isbns, c("isbn10", "isbn13"), sep = ",")
p$isbn10 <- str_replace_all(p$isbn10, "isbn10[[:space:]]\\=[[:space:]]", "")
p$isbn13 <- str_replace_all(p$isbn13, "isbn13[[:space:]]\\=[[:space:]]", "")
p$isbn10 <- str_replace_all(p$isbn10, "[[:punct:]]", "")
p$isbn13 <- str_replace_all(p$isbn13, "[[:punct:]]", "")
p$isbn10 <- as.numeric(p$isbn10)
p$isbn13 <- as.numeric(p$isbn13)
names(p) <- c("title", "description", "contributor", "author", "publisher", "isbn10", "isbn13")
head(p, 3) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
title | description | contributor | author | publisher | isbn10 | isbn13 |
---|---|---|---|---|---|---|
“I GIVE YOU MY BODY …” | The author of the Outlander novels gives tips on writing sex scenes, drawing on examples from the books. | by Diana Gabaldon | Diana Gabaldon | Dell | 399178570 | 9.780399e+12 |
“MOST BLESSED OF THE PATRIARCHS” | A character study that attempts to make sense of Jefferson’s contradictions. | by Annette Gordon-Reed and Peter S. Onuf | Annette Gordon-Reed and Peter S Onuf | Liveright | 871404427 | 9.780871e+12 |
#ASKGARYVEE | The entrepreneur expands on subjects addressed on his Internet show, like marketing, management and social media. | by Gary Vaynerchuk | Gary Vaynerchuk | HarperCollins | 62273124 | 6.227313e+07 |
I ended up with a list of 20 best-selling books, their descriptions, contributors, authors, publishers, and ISBN numbers.
As a next step, I would attempt to edit the script so that I could keep the data on target age group and price, which would be interesting to analyze.