Data Sources

The first dataset, in CSV format posted on Kaggle, was originally scraped from Goodreads’ lists of the Top 100 books for each year from 1980 to 2023. This CSV includes genres, Goodreads readership numbers and user-provided ratings.

The second source of data is the NYTimes Books API. The key information here is the number of weeks that successful books appeared on bestseller lists and their rankings, in addition to the ISBNs and author names, which will be used to join the data to the Goodreads information.

Abstract

This analysis will explore whether certain attributes of books or authors correlate significantly with higher ratings or greater numbers of reviews on Goodreads, or higher rankings and longer runs on the New York Times weekly bestseller lists. After cleaning, transforming and tidying the datasets and joining on author, title and/or ISBN, the data will be used for EDA and visualization. The end-goal is to create dataframes that can provide insight into book and author popularity for recommendation and prediction.

Acquisition & Cleaning

The Goodreads CSV is imported and subset for variables of interest.

raw_gr <- read.csv("https://raw.githubusercontent.com/stephbc/DATA607_Fall2024/refs/heads/final_proj/goodreads_top100.csv")

# select relevant columns
gr <- raw_gr |>
  select(
    "isbn",
     "title",
     "authors",
     "language",
     "format", 
     "genres",
     "publication_date",
     "rating_score",
     "num_ratings")

Since the ISBN column will be used as the foreign key for each book, any observations with blank values or duplicate ISBNs must be removed. Also, the focus in this study will be on books published in the 10 year span from 2014-2023. Furthermore, since the NY Times API will only return bestselling books in the US and defaults to hardcover fiction, the following filtering can be applied to the Goodreads data to preliminarily screen for relevant results:

gr <- gr |>
  mutate(isbn = na_if(isbn, "")) |>
  drop_na(isbn) |>
  distinct(isbn, .keep_all = TRUE) |>
  filter(language == "English", format == "Hardcover") |>
  select(!c(language, format)) |>
  filter(!grepl("Nonfiction", genres))

gr$publication_date <- year(mdy(gr$publication_date))
gr <- gr |> filter(publication_date > "2013")

knitr::kable(head(gr))
isbn title authors genres publication_date rating_score num_ratings
9780425256220 Magic Breaks Ilona Andrews [‘Urban Fantasy’, ‘Fantasy’, ‘Paranormal’, ‘Romance’, ‘Magic’, ‘Vampires’, ‘Paranormal Romance’] 2014 4.49 50284
9781442468351 Lady Midnight Cassandra Clare [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Paranormal’, ‘Urban Fantasy’, ‘Young Adult Fantasy’, ‘Fiction’] 2016 4.36 226041
9780765376473 A Gathering of Shadows V.E. Schwab [‘Fantasy’, ‘Fiction’, ‘Young Adult’, ‘Magic’, ‘Adult’, ‘Audiobook’, ‘Romance’] 2016 4.22 191290
9781451686630 The Passenger Lisa Lutz [‘Mystery’, ‘Fiction’, ‘Thriller’, ‘Mystery Thriller’, ‘Suspense’, ‘Audiobook’, ‘Crime’] 2016 3.72 42321
9780062457790 They Both Die at the End Adam Silvera [‘Young Adult’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Contemporary’, ‘Queer’, ‘Audiobook’] 2017 3.77 722678
9780062348708 The Upside of Unrequited Becky Albertalli [‘Young Adult’, ‘Contemporary’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Audiobook’, ‘Queer’] 2017 3.90 74803

The New York Times API paginates results to 20 titles per request, with the total number of available historical results at 36528. This would mean 1826 calls total for the full list. Since the API is also rate limited to 10 requests per minute, the number of calls must reduced as much as possible. The API requests can then be made sequentially and throttled to 10 per minute or fewer.

So before the requests are formulated, the gr dataframe will be grouped by author name to count their books and rank them by number of appearances on Goodreads’ Top 100 for the selected years, and filtered for authors with 3 or more books on the list.

top_authors <- gr |>
  group_by(authors) |>
  reframe(
    gr_titles = n(),
    best_rating = max(rating_score),
    mean_rating = round(mean(rating_score), digits = 2),
    total_ratings = sum(num_ratings, na.rm = TRUE),
  ) |>
  filter(gr_titles > 2) |>
  arrange(desc(mean_rating))

length(top_authors$authors)
## [1] 33

This results in 33 authors for whom requests will be made to the API for their NY Times Bestseller Lists’ historical appearances.

base_url <- "https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json"

build_requests <- function(names) {
  lapply(names, \(name) {
    request(base_url) |>
      req_url_query(
        "api-key" = NYT_API_KEY,
        "author" = name) |>
      req_retry(backoff = ~10) |>
      req_throttle(rate = 6 / 60, realm = "https://api.nytimes.com/svc/books")})}

requests <- build_requests(top_authors$authors)
responses <- req_perform_sequential(requests, on_error = "continue")

The list of successful responses is formatted from a series of JSON objects into a list, into a dataframe and then transposed.

responses <- responses |> resps_successes() 
raw_nyt <- responses |> resps_data(\(resp) resp_body_json(resp)$results)

raw_nyt <- as.data.frame(do.call(cbind, raw_nyt))
raw_nyt <- as.data.frame(t(raw_nyt))
# select relevant columns
nyt <- raw_nyt |>
  select(
    "title",
    "author",
    "isbns",
    "ranks_history")

knitr::kable(head(nyt))
title author isbns ranks_history
V1 A MEMORY OF LIGHT Robert Jordan and Brandon Sanderson 0765325950 , 9780765325952, 1429997176 , 9781429997171, 0765364883 , 9780765364883 NULL
V2 ALCATRAZ VERSUS THE EVIL LIBRARIANS Brandon Sanderson 1250811066 , 9781250811066, 1250811104 , 9781250811103, 0765378949 , 9780765378941, 0765381419 , 9780765381415, 0765378965 , 9780765378965 1250811066 , 9781250811066 , 9 , Series Books , Children’s & Young Adult Series, 2022-10-09 , 2022-09-24 , 1 , 0 , 0 , 0
V3 DAWNSHARD Brandon Sanderson 125085055X , 9781250850553 , 9781705076446 , 12 , Audio Fiction , Audio Fiction , 2022-09-18 , 2022-09-03 , 0 , 0 , 0 , 0 , 125085055X , 9781250850553 , 15 , Hardcover Fiction, Hardcover Fiction, 2022-03-06 , 2022-02-19 , 1 , 0 , 0 , 0
V4 FIREFIGHT Brandon Sanderson 0385743580 , 9780385743587, 0375991220 , 9780375991226 NULL
V5 HERO OF AGES Brandon Sanderson NULL NULL
V6 OATHBRINGER Brandon Sanderson 076532637X , 9780765326379, 0765399830 , 9780765399830, 1250297141 , 9781250297143 NULL

Tidying & Transformations

Since two of the columns in nyt are still nested, the following functions are applied to un-nest, widen and hoist the appropriate values from the list-columns.

nyt <- nyt |>
  unnest(cols = isbns) |>
  unnest_wider(isbns)

nyt <- nyt |>
  unnest(cols = ranks_history)

nyt <- nyt |>
  hoist(ranks_history,
    rank = "rank",
    weeks_on_list = "weeks_on_list",
  )

nyt <- nyt |>
  subset(select = -c(isbn10, ranks_history)) |>
  rename(isbn = isbn13)

knitr::kable(head(nyt))
title author isbn rank weeks_on_list
ALCATRAZ VERSUS THE EVIL LIBRARIANS Brandon Sanderson 9781250811066 9 1
ALCATRAZ VERSUS THE EVIL LIBRARIANS Brandon Sanderson 9781250811103 9 1
ALCATRAZ VERSUS THE EVIL LIBRARIANS Brandon Sanderson 9780765378941 9 1
ALCATRAZ VERSUS THE EVIL LIBRARIANS Brandon Sanderson 9780765381415 9 1
ALCATRAZ VERSUS THE EVIL LIBRARIANS Brandon Sanderson 9780765378965 9 1
DAWNSHARD Brandon Sanderson 9781250850553 12 0

At this point, the NY Times Bestseller information can (optionally) be saved to disk as a CSV to backup the data (since the API calls took several minutes to complete and this can avoid having to repeat these costly steps).

saveable <- apply(nyt, 2, as.character)
write.csv(saveable, file = "nyt.csv")

Now, a subset for the bestseller information can be joined to the gr table. In this table, ISBNs are allowed to be duplicated, as each observation is actually an appearance on a bestseller list for a week. Many books appear for many weeks and achieve various ranks. For the fullest, tidiest portrait of book success, a mutating right join can be used to keep all rows of the NY Times data:

nyt_small <- nyt |>
  select(
    "isbn",
    "rank",
    "weeks_on_list")

books <- right_join(gr, nyt_small)

knitr::kable(head(books))
isbn title authors genres publication_date rating_score num_ratings rank weeks_on_list
9780440000211 The Book of Cold Cases Simone St. James [‘Mystery’, ‘Thriller’, ‘Mystery Thriller’, ‘Fiction’, ‘Horror’, ‘Paranormal’, ‘Audiobook’] 2022 3.79 107904 9 1
9780316134071 Dreams of Gods & Monsters Laini Taylor [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Angels’, ‘Paranormal’, ‘Urban Fantasy’, ‘Fiction’] 2014 4.12 108346 7 2
9780316134071 Dreams of Gods & Monsters Laini Taylor [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Angels’, ‘Paranormal’, ‘Urban Fantasy’, ‘Fiction’] 2014 4.12 108346 6 1
9781595143211 Silver Shadows Richelle Mead [‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’] 2014 4.34 60679 10 14
9781595143211 Silver Shadows Richelle Mead [‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’] 2014 4.34 60679 3 13
9781595143211 Silver Shadows Richelle Mead [‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’] 2014 4.34 60679 4 12

For a complete dataset with an observation for each ISBN including the highest rank achieved and number of weeks, gr and nyt are combined.

# select the best rank for each ISBN
nyt_isbn_grp1 <- nyt |>
  select(!weeks_on_list) |>
  group_by(isbn) |>
  slice(which.min(rank))

# select the most weeks for each ISBN
nyt_isbn_grp2 <- nyt |>
  select(!rank) |>
  group_by(isbn) |>
  slice(which.max(weeks_on_list))

# join into 1 table
nyt_group_join <- full_join(nyt_isbn_grp1, nyt_isbn_grp2, by = join_by(title, author, isbn))

# reformat columns to match
nyt_group_join <- nyt_group_join |>
  unnest(cols = title) |>
  unnest(cols = author) |>
  rename(authors = author)

# join onto the `gr` dataframe
books_best <- full_join(gr, nyt_group_join, by = join_by(title, authors, isbn))

# cleanup author strings
books_best$authors <- gsub("\\.$", "", books_best$authors)

knitr::kable(head(books_best))
isbn title authors genres publication_date rating_score num_ratings rank weeks_on_list
9780425256220 Magic Breaks Ilona Andrews [‘Urban Fantasy’, ‘Fantasy’, ‘Paranormal’, ‘Romance’, ‘Magic’, ‘Vampires’, ‘Paranormal Romance’] 2014 4.49 50284 NA NA
9781442468351 Lady Midnight Cassandra Clare [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Paranormal’, ‘Urban Fantasy’, ‘Young Adult Fantasy’, ‘Fiction’] 2016 4.36 226041 NA NA
9780765376473 A Gathering of Shadows V.E. Schwab [‘Fantasy’, ‘Fiction’, ‘Young Adult’, ‘Magic’, ‘Adult’, ‘Audiobook’, ‘Romance’] 2016 4.22 191290 NA NA
9781451686630 The Passenger Lisa Lutz [‘Mystery’, ‘Fiction’, ‘Thriller’, ‘Mystery Thriller’, ‘Suspense’, ‘Audiobook’, ‘Crime’] 2016 3.72 42321 NA NA
9780062457790 They Both Die at the End Adam Silvera [‘Young Adult’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Contemporary’, ‘Queer’, ‘Audiobook’] 2017 3.77 722678 NA NA
9780062348708 The Upside of Unrequited Becky Albertalli [‘Young Adult’, ‘Contemporary’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Audiobook’, ‘Queer’] 2017 3.90 74803 NA NA

Grouping

For the author success analysis, the data is grouped on author name and joined onto the top_authors data.

author_hist <- books_best |>
  group_by(authors) |>
  summarize(
    num_titles = n_distinct(title, na.rm = TRUE),
    best_rank = min(rank, na.rm = TRUE),
    mean_rank = round(mean(rank, na.rm = TRUE)),
    most_weeks = max(weeks_on_list, na.rm = TRUE),
    mean_weeks = round(mean(weeks_on_list, na.rm = TRUE)))

top_auth_hist <- left_join(top_authors, author_hist, by = join_by(authors))

top_auth_hist <- top_auth_hist |>
  select(!c(gr_titles, )) |>
  mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., Inf))) |>
  mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., -Inf))) |>
  mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., NaN)))

knitr::kable(head(top_auth_hist))
authors best_rating mean_rating total_ratings num_titles best_rank mean_rank most_weeks mean_weeks
Brandon Sanderson 4.76 4.48 458821 9 1 4 4 2
Patricia Briggs 4.51 4.41 248646 14 1 5 2 1
Sarah J. Maas 4.68 4.39 7599481 13 2 7 35 25
Martha Wells 4.48 4.35 167983 6 4 7 1 1
Rick Riordan 4.43 4.30 961261 8 1 1 754 754
Taylor Jenkins Reid 4.43 4.29 4774601 3 NA NA NA NA

If data for unique title is required, ISBNs must be dropped, and the columns aggregated.

books_best$title <- str_trim(tolower(books_best$title))

unique_titles <- books_best |>
  select(!c(isbn, genres, publication_date)) |>
  group_by(title) |>
  summarize(
    u_rating = round(mean(rating_score, na.rm = TRUE), digits = 2),
    u_num_ratings = round(mean(num_ratings, na.rm = TRUE)),
    u_rank = min(rank),
    u_weeks = max(weeks_on_list)) |>
  mutate(across(c(u_rating, u_num_ratings), ~na_if(., NaN)))

knitr::kable(head(unique_titles))
title u_rating u_num_ratings u_rank u_weeks
a conjuring of light 4.32 156208 NA NA
a court of frost and starlight 3.78 1007290 NA NA
a court of mist and fury NA NA 9 33
a court of thorns and roses 4.20 2252404 NA NA
a court of wings and ruin NA NA 10 0
a curse for true love 4.13 116599 NA NA

Analysis

Using the top_auth_hist dataframe, the relationships between author popularity and ratings on Goodreads Top 100 lists or rankings on the bestseller lists can be visualized.

The first graphic below plots num_titles, which is the total number of unique titles that appear on either list (and therefore an analog for author popularity or fame), against the author’s highest rated book on Goodreads.

There appears to be a roughly linear, positive relationship - so, a popular author may expect higher ratings by fans on Goodreads.

ggplot(data = top_auth_hist, aes(x = num_titles, y = best_rating)) +
  geom_point() +
  scale_x_binned() +
  geom_smooth(method = "lm") +
  labs(x = "Number of successful titles",
       y = "Top Mean User Rating", 
       title = "Highest Rating on Goodreads for Top Authors")

This second plot shows the relationship between author popularity and their average rank when on the Bestseller list.

In contrast, there appears to be a negative linear relationship, which could possibly indicate that the more famous an author is, the less likely their books are to achieve high ranks on the bestseller list.

ggplot(data = top_auth_hist, aes(x = num_titles, y = mean_rank)) +
  geom_point() +
  scale_y_reverse() +
  scale_x_binned() +
  geom_smooth(method = "lm") +
  labs(x = "Number of successful titles",
       y = "Rank",
       title = "Average Rank When Appearing on the NY Times Bestsellers List")

m_author <- lm(mean_rating ~ total_ratings + mean_weeks, data = top_auth_hist)
summary(m_author)
## 
## Call:
## lm(formula = mean_rating ~ total_ratings + mean_weeks, data = top_auth_hist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40600 -0.11275 -0.01809  0.12048  0.42370 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.036e+00  4.893e-02  82.480   <2e-16 ***
## total_ratings 4.381e-08  2.916e-08   1.503    0.146    
## mean_weeks    2.772e-04  2.831e-04   0.979    0.337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2078 on 24 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1218, Adjusted R-squared:  0.04864 
## F-statistic: 1.665 on 2 and 24 DF,  p-value: 0.2104

The multiple linear model above calculates if the y-value mean_rating on Goodreads can be predicted by the total_ratings (another measure of popularity for an author) in combination with mean_weeks, the average number of weeks that the author’s books will appear on the NY Times lists.

With the small multiple R-squared and elevated P-values for both predictors, the effect of author popularity on mean rating on Goodreads is not statistically significant.

The residuals are checked below to confirm if the conditions for regression are reasonable. On the residual plot, the fitted values are generally clustered without a pattern around the 0 line, even though there are outliers on the right. On the normal probability plot, the residuals appear nearly normal as well.

ggplot(data = m_author, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m_author, aes(sample = .resid)) +
  stat_qq()

Conclusion

In conclusion, the analysis of the combined datasets revealed that there is little to no significant relationship between author popularity and higher ratings. Despite validating the linear model, the lack of a strong correlation suggests that other factors, such as content quality, genre preferences, or marketing efforts, may play a more prominent role in influencing ratings. This insight highlights the complexity of rating dynamics and underscores the need for further investigation into additional variables that may better explain these patterns.