THE SUCCESS & POPULARITY OF BOOKS & AUTHORS

Data Sources

The first dataset, in CSV format posted on Kaggle, was originally scraped from Goodreads’ lists of the Top 100 books for each year from 1980 to 2023. This CSV includes genres, Goodreads readership numbers and user-provided ratings.

The second source of data is the NYTimes Books API. The key information here is the number of weeks that successful books appeared on bestseller lists and their rankings, in addition to the ISBNs and author names, which will be used to join the data to the Goodreads information.

Abstract

This analysis will explore whether certain attributes of books or authors correlate significantly with higher ratings or greater numbers of reviews on Goodreads, or higher rankings and longer runs on the New York Times weekly bestseller lists. After cleaning, transforming and tidying the datasets and joining on author, title and/or ISBN, the data will be used for EDA and visualization. The end-goal is to create dataframes that can provide insight into book and author popularity for recommendation and prediction.

Acquisition & Cleaning

The Goodreads CSV is imported and subset for variables of interest.

raw_gr <- read.csv("https://raw.githubusercontent.com/stephbc/DATA607_Fall2024/refs/heads/final_proj/goodreads_top100.csv")

# select relevant columns
gr <- raw_gr |>
  select(
    "isbn",
     "title",
     "authors",
     "language",
     "format", 
     "genres",
     "publication_date",
     "rating_score",
     "num_ratings")

Since the ISBN column will be used as the foreign key for each book, any observations with blank values or duplicate ISBNs must be removed. Also, the focus in this study will be on books published in the 10 year span from 2014-2023. Furthermore, since the NY Times API will only return bestselling books in the US and defaults to hardcover fiction, the following filtering can be applied to the Goodreads data to preliminarily screen for relevant results:

drop books with any language other than “English”
drop books with any format other than “Hardcover”
drop books with “Nonfiction” in the genres list
format the publication_date into a Date object of only the year, then drop any book published before 2014

gr <- gr |>
  mutate(isbn = na_if(isbn, "")) |>
  drop_na(isbn) |>
  distinct(isbn, .keep_all = TRUE) |>
  filter(language == "English", format == "Hardcover") |>
  select(!c(language, format)) |>
  filter(!grepl("Nonfiction", genres))

gr$publication_date <- year(mdy(gr$publication_date))
gr <- gr |> filter(publication_date > "2013")

knitr::kable(head(gr))

isbn	title	authors	genres	publication_date	rating_score	num_ratings
9780425256220	Magic Breaks	Ilona Andrews	[‘Urban Fantasy’, ‘Fantasy’, ‘Paranormal’, ‘Romance’, ‘Magic’, ‘Vampires’, ‘Paranormal Romance’]	2014	4.49	50284
9781442468351	Lady Midnight	Cassandra Clare	[‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Paranormal’, ‘Urban Fantasy’, ‘Young Adult Fantasy’, ‘Fiction’]	2016	4.36	226041
9780765376473	A Gathering of Shadows	V.E. Schwab	[‘Fantasy’, ‘Fiction’, ‘Young Adult’, ‘Magic’, ‘Adult’, ‘Audiobook’, ‘Romance’]	2016	4.22	191290
9781451686630	The Passenger	Lisa Lutz	[‘Mystery’, ‘Fiction’, ‘Thriller’, ‘Mystery Thriller’, ‘Suspense’, ‘Audiobook’, ‘Crime’]	2016	3.72	42321
9780062457790	They Both Die at the End	Adam Silvera	[‘Young Adult’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Contemporary’, ‘Queer’, ‘Audiobook’]	2017	3.77	722678
9780062348708	The Upside of Unrequited	Becky Albertalli	[‘Young Adult’, ‘Contemporary’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Audiobook’, ‘Queer’]	2017	3.90	74803

The New York Times API paginates results to 20 titles per request, with the total number of available historical results at 36528. This would mean 1826 calls total for the full list. Since the API is also rate limited to 10 requests per minute, the number of calls must reduced as much as possible. The API requests can then be made sequentially and throttled to 10 per minute or fewer.

So before the requests are formulated, the gr dataframe will be grouped by author name to count their books and rank them by number of appearances on Goodreads’ Top 100 for the selected years, and filtered for authors with 3 or more books on the list.

top_authors <- gr |>
  group_by(authors) |>
  reframe(
    gr_titles = n(),
    best_rating = max(rating_score),
    mean_rating = round(mean(rating_score), digits = 2),
    total_ratings = sum(num_ratings, na.rm = TRUE),
  ) |>
  filter(gr_titles > 2) |>
  arrange(desc(mean_rating))

length(top_authors$authors)

## [1] 33

This results in 33 authors for whom requests will be made to the API for their NY Times Bestseller Lists’ historical appearances.

base_url <- "https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json"

build_requests <- function(names) {
  lapply(names, \(name) {
    request(base_url) |>
      req_url_query(
        "api-key" = NYT_API_KEY,
        "author" = name) |>
      req_retry(backoff = ~10) |>
      req_throttle(rate = 6 / 60, realm = "https://api.nytimes.com/svc/books")})}

requests <- build_requests(top_authors$authors)
responses <- req_perform_sequential(requests, on_error = "continue")

The list of successful responses is formatted from a series of JSON objects into a list, into a dataframe and then transposed.

responses <- responses |> resps_successes() 
raw_nyt <- responses |> resps_data(\(resp) resp_body_json(resp)$results)

raw_nyt <- as.data.frame(do.call(cbind, raw_nyt))
raw_nyt <- as.data.frame(t(raw_nyt))

# select relevant columns
nyt <- raw_nyt |>
  select(
    "title",
    "author",
    "isbns",
    "ranks_history")

knitr::kable(head(nyt))

	title	author	isbns	ranks_history
V1	A MEMORY OF LIGHT	Robert Jordan and Brandon Sanderson	0765325950 , 9780765325952, 1429997176 , 9781429997171, 0765364883 , 9780765364883	NULL
V2	ALCATRAZ VERSUS THE EVIL LIBRARIANS	Brandon Sanderson	1250811066 , 9781250811066, 1250811104 , 9781250811103, 0765378949 , 9780765378941, 0765381419 , 9780765381415, 0765378965 , 9780765378965	1250811066 , 9781250811066 , 9 , Series Books , Children’s & Young Adult Series, 2022-10-09 , 2022-09-24 , 1 , 0 , 0 , 0
V3	DAWNSHARD	Brandon Sanderson	125085055X , 9781250850553	, 9781705076446 , 12 , Audio Fiction , Audio Fiction , 2022-09-18 , 2022-09-03 , 0 , 0 , 0 , 0 , 125085055X , 9781250850553 , 15 , Hardcover Fiction, Hardcover Fiction, 2022-03-06 , 2022-02-19 , 1 , 0 , 0 , 0
V4	FIREFIGHT	Brandon Sanderson	0385743580 , 9780385743587, 0375991220 , 9780375991226	NULL
V5	HERO OF AGES	Brandon Sanderson	NULL	NULL
V6	OATHBRINGER	Brandon Sanderson	076532637X , 9780765326379, 0765399830 , 9780765399830, 1250297141 , 9781250297143	NULL

Tidying & Transformations

Since two of the columns in nyt are still nested, the following functions are applied to un-nest, widen and hoist the appropriate values from the list-columns.

nyt <- nyt |>
  unnest(cols = isbns) |>
  unnest_wider(isbns)

nyt <- nyt |>
  unnest(cols = ranks_history)

nyt <- nyt |>
  hoist(ranks_history,
    rank = "rank",
    weeks_on_list = "weeks_on_list",
  )

nyt <- nyt |>
  subset(select = -c(isbn10, ranks_history)) |>
  rename(isbn = isbn13)

knitr::kable(head(nyt))

title	author	isbn	rank	weeks_on_list
ALCATRAZ VERSUS THE EVIL LIBRARIANS	Brandon Sanderson	9781250811066	9	1
ALCATRAZ VERSUS THE EVIL LIBRARIANS	Brandon Sanderson	9781250811103	9	1
ALCATRAZ VERSUS THE EVIL LIBRARIANS	Brandon Sanderson	9780765378941	9	1
ALCATRAZ VERSUS THE EVIL LIBRARIANS	Brandon Sanderson	9780765381415	9	1
ALCATRAZ VERSUS THE EVIL LIBRARIANS	Brandon Sanderson	9780765378965	9	1
DAWNSHARD	Brandon Sanderson	9781250850553	12	0

At this point, the NY Times Bestseller information can (optionally) be saved to disk as a CSV to backup the data (since the API calls took several minutes to complete and this can avoid having to repeat these costly steps).

saveable <- apply(nyt, 2, as.character)
write.csv(saveable, file = "nyt.csv")

Now, a subset for the bestseller information can be joined to the gr table. In this table, ISBNs are allowed to be duplicated, as each observation is actually an appearance on a bestseller list for a week. Many books appear for many weeks and achieve various ranks. For the fullest, tidiest portrait of book success, a mutating right join can be used to keep all rows of the NY Times data:

nyt_small <- nyt |>
  select(
    "isbn",
    "rank",
    "weeks_on_list")

books <- right_join(gr, nyt_small)

knitr::kable(head(books))

isbn	title	authors	genres	publication_date	rating_score	num_ratings	rank	weeks_on_list
9780440000211	The Book of Cold Cases	Simone St. James	[‘Mystery’, ‘Thriller’, ‘Mystery Thriller’, ‘Fiction’, ‘Horror’, ‘Paranormal’, ‘Audiobook’]	2022	3.79	107904	9	1
9780316134071	Dreams of Gods & Monsters	Laini Taylor	[‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Angels’, ‘Paranormal’, ‘Urban Fantasy’, ‘Fiction’]	2014	4.12	108346	7	2
9780316134071	Dreams of Gods & Monsters	Laini Taylor	[‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Angels’, ‘Paranormal’, ‘Urban Fantasy’, ‘Fiction’]	2014	4.12	108346	6	1
9781595143211	Silver Shadows	Richelle Mead	[‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’]	2014	4.34	60679	10	14
9781595143211	Silver Shadows	Richelle Mead	[‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’]	2014	4.34	60679	3	13
9781595143211	Silver Shadows	Richelle Mead	[‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’]	2014	4.34	60679	4	12

For a complete dataset with an observation for each ISBN including the highest rank achieved and number of weeks, gr and nyt are combined.

# select the best rank for each ISBN
nyt_isbn_grp1 <- nyt |>
  select(!weeks_on_list) |>
  group_by(isbn) |>
  slice(which.min(rank))

# select the most weeks for each ISBN
nyt_isbn_grp2 <- nyt |>
  select(!rank) |>
  group_by(isbn) |>
  slice(which.max(weeks_on_list))

# join into 1 table
nyt_group_join <- full_join(nyt_isbn_grp1, nyt_isbn_grp2, by = join_by(title, author, isbn))

# reformat columns to match
nyt_group_join <- nyt_group_join |>
  unnest(cols = title) |>
  unnest(cols = author) |>
  rename(authors = author)

# join onto the `gr` dataframe
books_best <- full_join(gr, nyt_group_join, by = join_by(title, authors, isbn))

# cleanup author strings
books_best$authors <- gsub("\\.$", "", books_best$authors)

knitr::kable(head(books_best))

isbn	title	authors	genres	publication_date	rating_score	num_ratings	rank	weeks_on_list
9780425256220	Magic Breaks	Ilona Andrews	[‘Urban Fantasy’, ‘Fantasy’, ‘Paranormal’, ‘Romance’, ‘Magic’, ‘Vampires’, ‘Paranormal Romance’]	2014	4.49	50284	NA	NA
9781442468351	Lady Midnight	Cassandra Clare	[‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Paranormal’, ‘Urban Fantasy’, ‘Young Adult Fantasy’, ‘Fiction’]	2016	4.36	226041	NA	NA
9780765376473	A Gathering of Shadows	V.E. Schwab	[‘Fantasy’, ‘Fiction’, ‘Young Adult’, ‘Magic’, ‘Adult’, ‘Audiobook’, ‘Romance’]	2016	4.22	191290	NA	NA
9781451686630	The Passenger	Lisa Lutz	[‘Mystery’, ‘Fiction’, ‘Thriller’, ‘Mystery Thriller’, ‘Suspense’, ‘Audiobook’, ‘Crime’]	2016	3.72	42321	NA	NA
9780062457790	They Both Die at the End	Adam Silvera	[‘Young Adult’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Contemporary’, ‘Queer’, ‘Audiobook’]	2017	3.77	722678	NA	NA
9780062348708	The Upside of Unrequited	Becky Albertalli	[‘Young Adult’, ‘Contemporary’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Audiobook’, ‘Queer’]	2017	3.90	74803	NA	NA

Grouping

For the author success analysis, the data is grouped on author name and joined onto the top_authors data.

author_hist <- books_best |>
  group_by(authors) |>
  summarize(
    num_titles = n_distinct(title, na.rm = TRUE),
    best_rank = min(rank, na.rm = TRUE),
    mean_rank = round(mean(rank, na.rm = TRUE)),
    most_weeks = max(weeks_on_list, na.rm = TRUE),
    mean_weeks = round(mean(weeks_on_list, na.rm = TRUE)))

top_auth_hist <- left_join(top_authors, author_hist, by = join_by(authors))

top_auth_hist <- top_auth_hist |>
  select(!c(gr_titles, )) |>
  mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., Inf))) |>
  mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., -Inf))) |>
  mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., NaN)))

knitr::kable(head(top_auth_hist))

authors	best_rating	mean_rating	total_ratings	num_titles	best_rank	mean_rank	most_weeks	mean_weeks
Brandon Sanderson	4.76	4.48	458821	9	1	4	4	2
Patricia Briggs	4.51	4.41	248646	14	1	5	2	1
Sarah J. Maas	4.68	4.39	7599481	13	2	7	35	25
Martha Wells	4.48	4.35	167983	6	4	7	1	1
Rick Riordan	4.43	4.30	961261	8	1	1	754	754
Taylor Jenkins Reid	4.43	4.29	4774601	3	NA	NA	NA	NA

If data for unique title is required, ISBNs must be dropped, and the columns aggregated.

books_best$title <- str_trim(tolower(books_best$title))

unique_titles <- books_best |>
  select(!c(isbn, genres, publication_date)) |>
  group_by(title) |>
  summarize(
    u_rating = round(mean(rating_score, na.rm = TRUE), digits = 2),
    u_num_ratings = round(mean(num_ratings, na.rm = TRUE)),
    u_rank = min(rank),
    u_weeks = max(weeks_on_list)) |>
  mutate(across(c(u_rating, u_num_ratings), ~na_if(., NaN)))

knitr::kable(head(unique_titles))

title	u_rating	u_num_ratings	u_rank	u_weeks
a conjuring of light	4.32	156208	NA	NA
a court of frost and starlight	3.78	1007290	NA	NA
a court of mist and fury	NA	NA	9	33
a court of thorns and roses	4.20	2252404	NA	NA
a court of wings and ruin	NA	NA	10	0
a curse for true love	4.13	116599	NA	NA

Analysis

Using the top_auth_hist dataframe, the relationships between author popularity and ratings on Goodreads Top 100 lists or rankings on the bestseller lists can be visualized.

The first graphic below plots num_titles, which is the total number of unique titles that appear on either list (and therefore an analog for author popularity or fame), against the author’s highest rated book on Goodreads.

There appears to be a roughly linear, positive relationship - so, a popular author may expect higher ratings by fans on Goodreads.

ggplot(data = top_auth_hist, aes(x = num_titles, y = best_rating)) +
  geom_point() +
  scale_x_binned() +
  geom_smooth(method = "lm") +
  labs(x = "Number of successful titles",
       y = "Top Mean User Rating", 
       title = "Highest Rating on Goodreads for Top Authors")

This second plot shows the relationship between author popularity and their average rank when on the Bestseller list.

In contrast, there appears to be a negative linear relationship, which could possibly indicate that the more famous an author is, the less likely their books are to achieve high ranks on the bestseller list.

ggplot(data = top_auth_hist, aes(x = num_titles, y = mean_rank)) +
  geom_point() +
  scale_y_reverse() +
  scale_x_binned() +
  geom_smooth(method = "lm") +
  labs(x = "Number of successful titles",
       y = "Rank",
       title = "Average Rank When Appearing on the NY Times Bestsellers List")

m_author <- lm(mean_rating ~ total_ratings + mean_weeks, data = top_auth_hist)
summary(m_author)

## 
## Call:
## lm(formula = mean_rating ~ total_ratings + mean_weeks, data = top_auth_hist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40600 -0.11275 -0.01809  0.12048  0.42370 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.036e+00  4.893e-02  82.480   <2e-16 ***
## total_ratings 4.381e-08  2.916e-08   1.503    0.146    
## mean_weeks    2.772e-04  2.831e-04   0.979    0.337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2078 on 24 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.1218, Adjusted R-squared:  0.04864 
## F-statistic: 1.665 on 2 and 24 DF,  p-value: 0.2104

The multiple linear model above calculates if the y-value mean_rating on Goodreads can be predicted by the total_ratings (another measure of popularity for an author) in combination with mean_weeks, the average number of weeks that the author’s books will appear on the NY Times lists.

With the small multiple R-squared and elevated P-values for both predictors, the effect of author popularity on mean rating on Goodreads is not statistically significant.

The residuals are checked below to confirm if the conditions for regression are reasonable. On the residual plot, the fitted values are generally clustered without a pattern around the 0 line, even though there are outliers on the right. On the normal probability plot, the residuals appear nearly normal as well.

ggplot(data = m_author, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m_author, aes(sample = .resid)) +
  stat_qq()

Conclusion

In conclusion, the analysis of the combined datasets revealed that there is little to no significant relationship between author popularity and higher ratings. Despite validating the linear model, the lack of a strong correlation suggests that other factors, such as content quality, genre preferences, or marketing efforts, may play a more prominent role in influencing ratings. This insight highlights the complexity of rating dynamics and underscores the need for further investigation into additional variables that may better explain these patterns.