The first dataset, in CSV format posted on Kaggle, was originally scraped from Goodreads’ lists of the Top 100 books for each year from 1980 to 2023. This CSV includes genres, Goodreads readership numbers and user-provided ratings.
The second source of data is the NYTimes Books API. The key information here is the number of weeks that successful books appeared on bestseller lists and their rankings, in addition to the ISBNs and author names, which will be used to join the data to the Goodreads information.
This analysis will explore whether certain attributes of books or authors correlate significantly with higher ratings or greater numbers of reviews on Goodreads, or higher rankings and longer runs on the New York Times weekly bestseller lists. After cleaning, transforming and tidying the datasets and joining on author, title and/or ISBN, the data will be used for EDA and visualization. The end-goal is to create dataframes that can provide insight into book and author popularity for recommendation and prediction.
The Goodreads CSV is imported and subset for variables of interest.
raw_gr <- read.csv("https://raw.githubusercontent.com/stephbc/DATA607_Fall2024/refs/heads/final_proj/goodreads_top100.csv")
# select relevant columns
gr <- raw_gr |>
select(
"isbn",
"title",
"authors",
"language",
"format",
"genres",
"publication_date",
"rating_score",
"num_ratings")
Since the ISBN column will be used as the foreign key for each book, any observations with blank values or duplicate ISBNs must be removed. Also, the focus in this study will be on books published in the 10 year span from 2014-2023. Furthermore, since the NY Times API will only return bestselling books in the US and defaults to hardcover fiction, the following filtering can be applied to the Goodreads data to preliminarily screen for relevant results:
language
other than “English”format
other than “Hardcover”genres
listpublication_date
into a Date object of only
the year, then drop any book published before 2014gr <- gr |>
mutate(isbn = na_if(isbn, "")) |>
drop_na(isbn) |>
distinct(isbn, .keep_all = TRUE) |>
filter(language == "English", format == "Hardcover") |>
select(!c(language, format)) |>
filter(!grepl("Nonfiction", genres))
gr$publication_date <- year(mdy(gr$publication_date))
gr <- gr |> filter(publication_date > "2013")
knitr::kable(head(gr))
isbn | title | authors | genres | publication_date | rating_score | num_ratings |
---|---|---|---|---|---|---|
9780425256220 | Magic Breaks | Ilona Andrews | [‘Urban Fantasy’, ‘Fantasy’, ‘Paranormal’, ‘Romance’, ‘Magic’, ‘Vampires’, ‘Paranormal Romance’] | 2014 | 4.49 | 50284 |
9781442468351 | Lady Midnight | Cassandra Clare | [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Paranormal’, ‘Urban Fantasy’, ‘Young Adult Fantasy’, ‘Fiction’] | 2016 | 4.36 | 226041 |
9780765376473 | A Gathering of Shadows | V.E. Schwab | [‘Fantasy’, ‘Fiction’, ‘Young Adult’, ‘Magic’, ‘Adult’, ‘Audiobook’, ‘Romance’] | 2016 | 4.22 | 191290 |
9781451686630 | The Passenger | Lisa Lutz | [‘Mystery’, ‘Fiction’, ‘Thriller’, ‘Mystery Thriller’, ‘Suspense’, ‘Audiobook’, ‘Crime’] | 2016 | 3.72 | 42321 |
9780062457790 | They Both Die at the End | Adam Silvera | [‘Young Adult’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Contemporary’, ‘Queer’, ‘Audiobook’] | 2017 | 3.77 | 722678 |
9780062348708 | The Upside of Unrequited | Becky Albertalli | [‘Young Adult’, ‘Contemporary’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Audiobook’, ‘Queer’] | 2017 | 3.90 | 74803 |
The New York Times API paginates results to 20 titles per request, with the total number of available historical results at 36528. This would mean 1826 calls total for the full list. Since the API is also rate limited to 10 requests per minute, the number of calls must reduced as much as possible. The API requests can then be made sequentially and throttled to 10 per minute or fewer.
So before the requests are formulated, the gr
dataframe
will be grouped by author name to count their books and rank them by
number of appearances on Goodreads’ Top 100 for the selected years, and
filtered for authors with 3 or more books on the list.
top_authors <- gr |>
group_by(authors) |>
reframe(
gr_titles = n(),
best_rating = max(rating_score),
mean_rating = round(mean(rating_score), digits = 2),
total_ratings = sum(num_ratings, na.rm = TRUE),
) |>
filter(gr_titles > 2) |>
arrange(desc(mean_rating))
length(top_authors$authors)
## [1] 33
This results in 33 authors for whom requests will be made to the API for their NY Times Bestseller Lists’ historical appearances.
base_url <- "https://api.nytimes.com/svc/books/v3/lists/best-sellers/history.json"
build_requests <- function(names) {
lapply(names, \(name) {
request(base_url) |>
req_url_query(
"api-key" = NYT_API_KEY,
"author" = name) |>
req_retry(backoff = ~10) |>
req_throttle(rate = 6 / 60, realm = "https://api.nytimes.com/svc/books")})}
requests <- build_requests(top_authors$authors)
responses <- req_perform_sequential(requests, on_error = "continue")
The list of successful responses is formatted from a series of JSON objects into a list, into a dataframe and then transposed.
responses <- responses |> resps_successes()
raw_nyt <- responses |> resps_data(\(resp) resp_body_json(resp)$results)
raw_nyt <- as.data.frame(do.call(cbind, raw_nyt))
raw_nyt <- as.data.frame(t(raw_nyt))
# select relevant columns
nyt <- raw_nyt |>
select(
"title",
"author",
"isbns",
"ranks_history")
knitr::kable(head(nyt))
title | author | isbns | ranks_history | |
---|---|---|---|---|
V1 | A MEMORY OF LIGHT | Robert Jordan and Brandon Sanderson | 0765325950 , 9780765325952, 1429997176 , 9781429997171, 0765364883 , 9780765364883 | NULL |
V2 | ALCATRAZ VERSUS THE EVIL LIBRARIANS | Brandon Sanderson | 1250811066 , 9781250811066, 1250811104 , 9781250811103, 0765378949 , 9780765378941, 0765381419 , 9780765381415, 0765378965 , 9780765378965 | 1250811066 , 9781250811066 , 9 , Series Books , Children’s & Young Adult Series, 2022-10-09 , 2022-09-24 , 1 , 0 , 0 , 0 |
V3 | DAWNSHARD | Brandon Sanderson | 125085055X , 9781250850553 | , 9781705076446 , 12 , Audio Fiction , Audio Fiction , 2022-09-18 , 2022-09-03 , 0 , 0 , 0 , 0 , 125085055X , 9781250850553 , 15 , Hardcover Fiction, Hardcover Fiction, 2022-03-06 , 2022-02-19 , 1 , 0 , 0 , 0 |
V4 | FIREFIGHT | Brandon Sanderson | 0385743580 , 9780385743587, 0375991220 , 9780375991226 | NULL |
V5 | HERO OF AGES | Brandon Sanderson | NULL | NULL |
V6 | OATHBRINGER | Brandon Sanderson | 076532637X , 9780765326379, 0765399830 , 9780765399830, 1250297141 , 9781250297143 | NULL |
Since two of the columns in nyt
are still nested, the
following functions are applied to un-nest, widen and hoist the
appropriate values from the list-columns.
nyt <- nyt |>
unnest(cols = isbns) |>
unnest_wider(isbns)
nyt <- nyt |>
unnest(cols = ranks_history)
nyt <- nyt |>
hoist(ranks_history,
rank = "rank",
weeks_on_list = "weeks_on_list",
)
nyt <- nyt |>
subset(select = -c(isbn10, ranks_history)) |>
rename(isbn = isbn13)
knitr::kable(head(nyt))
title | author | isbn | rank | weeks_on_list |
---|---|---|---|---|
ALCATRAZ VERSUS THE EVIL LIBRARIANS | Brandon Sanderson | 9781250811066 | 9 | 1 |
ALCATRAZ VERSUS THE EVIL LIBRARIANS | Brandon Sanderson | 9781250811103 | 9 | 1 |
ALCATRAZ VERSUS THE EVIL LIBRARIANS | Brandon Sanderson | 9780765378941 | 9 | 1 |
ALCATRAZ VERSUS THE EVIL LIBRARIANS | Brandon Sanderson | 9780765381415 | 9 | 1 |
ALCATRAZ VERSUS THE EVIL LIBRARIANS | Brandon Sanderson | 9780765378965 | 9 | 1 |
DAWNSHARD | Brandon Sanderson | 9781250850553 | 12 | 0 |
At this point, the NY Times Bestseller information can (optionally) be saved to disk as a CSV to backup the data (since the API calls took several minutes to complete and this can avoid having to repeat these costly steps).
saveable <- apply(nyt, 2, as.character)
write.csv(saveable, file = "nyt.csv")
Now, a subset for the bestseller information can be joined to the
gr
table. In this table, ISBNs are allowed to be
duplicated, as each observation is actually an appearance on a
bestseller list for a week. Many books appear for many weeks and achieve
various ranks. For the fullest, tidiest portrait of book success, a
mutating right join can be used to keep all rows of the NY Times
data:
nyt_small <- nyt |>
select(
"isbn",
"rank",
"weeks_on_list")
books <- right_join(gr, nyt_small)
knitr::kable(head(books))
isbn | title | authors | genres | publication_date | rating_score | num_ratings | rank | weeks_on_list |
---|---|---|---|---|---|---|---|---|
9780440000211 | The Book of Cold Cases | Simone St. James | [‘Mystery’, ‘Thriller’, ‘Mystery Thriller’, ‘Fiction’, ‘Horror’, ‘Paranormal’, ‘Audiobook’] | 2022 | 3.79 | 107904 | 9 | 1 |
9780316134071 | Dreams of Gods & Monsters | Laini Taylor | [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Angels’, ‘Paranormal’, ‘Urban Fantasy’, ‘Fiction’] | 2014 | 4.12 | 108346 | 7 | 2 |
9780316134071 | Dreams of Gods & Monsters | Laini Taylor | [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Angels’, ‘Paranormal’, ‘Urban Fantasy’, ‘Fiction’] | 2014 | 4.12 | 108346 | 6 | 1 |
9781595143211 | Silver Shadows | Richelle Mead | [‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’] | 2014 | 4.34 | 60679 | 10 | 14 |
9781595143211 | Silver Shadows | Richelle Mead | [‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’] | 2014 | 4.34 | 60679 | 3 | 13 |
9781595143211 | Silver Shadows | Richelle Mead | [‘Young Adult’, ‘Fantasy’, ‘Vampires’, ‘Paranormal’, ‘Romance’, ‘Urban Fantasy’, ‘Magic’] | 2014 | 4.34 | 60679 | 4 | 12 |
For a complete dataset with an observation for each ISBN including
the highest rank achieved and number of weeks, gr
and
nyt
are combined.
# select the best rank for each ISBN
nyt_isbn_grp1 <- nyt |>
select(!weeks_on_list) |>
group_by(isbn) |>
slice(which.min(rank))
# select the most weeks for each ISBN
nyt_isbn_grp2 <- nyt |>
select(!rank) |>
group_by(isbn) |>
slice(which.max(weeks_on_list))
# join into 1 table
nyt_group_join <- full_join(nyt_isbn_grp1, nyt_isbn_grp2, by = join_by(title, author, isbn))
# reformat columns to match
nyt_group_join <- nyt_group_join |>
unnest(cols = title) |>
unnest(cols = author) |>
rename(authors = author)
# join onto the `gr` dataframe
books_best <- full_join(gr, nyt_group_join, by = join_by(title, authors, isbn))
# cleanup author strings
books_best$authors <- gsub("\\.$", "", books_best$authors)
knitr::kable(head(books_best))
isbn | title | authors | genres | publication_date | rating_score | num_ratings | rank | weeks_on_list |
---|---|---|---|---|---|---|---|---|
9780425256220 | Magic Breaks | Ilona Andrews | [‘Urban Fantasy’, ‘Fantasy’, ‘Paranormal’, ‘Romance’, ‘Magic’, ‘Vampires’, ‘Paranormal Romance’] | 2014 | 4.49 | 50284 | NA | NA |
9781442468351 | Lady Midnight | Cassandra Clare | [‘Fantasy’, ‘Young Adult’, ‘Romance’, ‘Paranormal’, ‘Urban Fantasy’, ‘Young Adult Fantasy’, ‘Fiction’] | 2016 | 4.36 | 226041 | NA | NA |
9780765376473 | A Gathering of Shadows | V.E. Schwab | [‘Fantasy’, ‘Fiction’, ‘Young Adult’, ‘Magic’, ‘Adult’, ‘Audiobook’, ‘Romance’] | 2016 | 4.22 | 191290 | NA | NA |
9781451686630 | The Passenger | Lisa Lutz | [‘Mystery’, ‘Fiction’, ‘Thriller’, ‘Mystery Thriller’, ‘Suspense’, ‘Audiobook’, ‘Crime’] | 2016 | 3.72 | 42321 | NA | NA |
9780062457790 | They Both Die at the End | Adam Silvera | [‘Young Adult’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Contemporary’, ‘Queer’, ‘Audiobook’] | 2017 | 3.77 | 722678 | NA | NA |
9780062348708 | The Upside of Unrequited | Becky Albertalli | [‘Young Adult’, ‘Contemporary’, ‘Romance’, ‘LGBT’, ‘Fiction’, ‘Audiobook’, ‘Queer’] | 2017 | 3.90 | 74803 | NA | NA |
For the author success analysis, the data is grouped on author name
and joined onto the top_authors
data.
author_hist <- books_best |>
group_by(authors) |>
summarize(
num_titles = n_distinct(title, na.rm = TRUE),
best_rank = min(rank, na.rm = TRUE),
mean_rank = round(mean(rank, na.rm = TRUE)),
most_weeks = max(weeks_on_list, na.rm = TRUE),
mean_weeks = round(mean(weeks_on_list, na.rm = TRUE)))
top_auth_hist <- left_join(top_authors, author_hist, by = join_by(authors))
top_auth_hist <- top_auth_hist |>
select(!c(gr_titles, )) |>
mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., Inf))) |>
mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., -Inf))) |>
mutate(across(c(best_rank, mean_rank, most_weeks, mean_weeks), ~na_if(., NaN)))
knitr::kable(head(top_auth_hist))
authors | best_rating | mean_rating | total_ratings | num_titles | best_rank | mean_rank | most_weeks | mean_weeks |
---|---|---|---|---|---|---|---|---|
Brandon Sanderson | 4.76 | 4.48 | 458821 | 9 | 1 | 4 | 4 | 2 |
Patricia Briggs | 4.51 | 4.41 | 248646 | 14 | 1 | 5 | 2 | 1 |
Sarah J. Maas | 4.68 | 4.39 | 7599481 | 13 | 2 | 7 | 35 | 25 |
Martha Wells | 4.48 | 4.35 | 167983 | 6 | 4 | 7 | 1 | 1 |
Rick Riordan | 4.43 | 4.30 | 961261 | 8 | 1 | 1 | 754 | 754 |
Taylor Jenkins Reid | 4.43 | 4.29 | 4774601 | 3 | NA | NA | NA | NA |
If data for unique title is required, ISBNs must be dropped, and the columns aggregated.
books_best$title <- str_trim(tolower(books_best$title))
unique_titles <- books_best |>
select(!c(isbn, genres, publication_date)) |>
group_by(title) |>
summarize(
u_rating = round(mean(rating_score, na.rm = TRUE), digits = 2),
u_num_ratings = round(mean(num_ratings, na.rm = TRUE)),
u_rank = min(rank),
u_weeks = max(weeks_on_list)) |>
mutate(across(c(u_rating, u_num_ratings), ~na_if(., NaN)))
knitr::kable(head(unique_titles))
title | u_rating | u_num_ratings | u_rank | u_weeks |
---|---|---|---|---|
a conjuring of light | 4.32 | 156208 | NA | NA |
a court of frost and starlight | 3.78 | 1007290 | NA | NA |
a court of mist and fury | NA | NA | 9 | 33 |
a court of thorns and roses | 4.20 | 2252404 | NA | NA |
a court of wings and ruin | NA | NA | 10 | 0 |
a curse for true love | 4.13 | 116599 | NA | NA |
Using the top_auth_hist
dataframe, the relationships
between author popularity and ratings on Goodreads Top 100 lists or
rankings on the bestseller lists can be visualized.
The first graphic below plots num_titles
, which is the
total number of unique titles that appear on either list (and therefore
an analog for author popularity or fame), against the author’s highest
rated book on Goodreads.
There appears to be a roughly linear, positive relationship - so, a popular author may expect higher ratings by fans on Goodreads.
ggplot(data = top_auth_hist, aes(x = num_titles, y = best_rating)) +
geom_point() +
scale_x_binned() +
geom_smooth(method = "lm") +
labs(x = "Number of successful titles",
y = "Top Mean User Rating",
title = "Highest Rating on Goodreads for Top Authors")
This second plot shows the relationship between author popularity and their average rank when on the Bestseller list.
In contrast, there appears to be a negative linear relationship, which could possibly indicate that the more famous an author is, the less likely their books are to achieve high ranks on the bestseller list.
ggplot(data = top_auth_hist, aes(x = num_titles, y = mean_rank)) +
geom_point() +
scale_y_reverse() +
scale_x_binned() +
geom_smooth(method = "lm") +
labs(x = "Number of successful titles",
y = "Rank",
title = "Average Rank When Appearing on the NY Times Bestsellers List")
m_author <- lm(mean_rating ~ total_ratings + mean_weeks, data = top_auth_hist)
summary(m_author)
##
## Call:
## lm(formula = mean_rating ~ total_ratings + mean_weeks, data = top_auth_hist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40600 -0.11275 -0.01809 0.12048 0.42370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.036e+00 4.893e-02 82.480 <2e-16 ***
## total_ratings 4.381e-08 2.916e-08 1.503 0.146
## mean_weeks 2.772e-04 2.831e-04 0.979 0.337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2078 on 24 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.1218, Adjusted R-squared: 0.04864
## F-statistic: 1.665 on 2 and 24 DF, p-value: 0.2104
The multiple linear model above calculates if the y-value
mean_rating
on Goodreads can be predicted by the
total_ratings
(another measure of popularity for an author)
in combination with mean_weeks
, the average number of weeks
that the author’s books will appear on the NY Times lists.
With the small multiple R-squared and elevated P-values for both predictors, the effect of author popularity on mean rating on Goodreads is not statistically significant.
The residuals are checked below to confirm if the conditions for regression are reasonable. On the residual plot, the fitted values are generally clustered without a pattern around the 0 line, even though there are outliers on the right. On the normal probability plot, the residuals appear nearly normal as well.
ggplot(data = m_author, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = m_author, aes(sample = .resid)) +
stat_qq()
In conclusion, the analysis of the combined datasets revealed that there is little to no significant relationship between author popularity and higher ratings. Despite validating the linear model, the lack of a strong correlation suggests that other factors, such as content quality, genre preferences, or marketing efforts, may play a more prominent role in influencing ratings. This insight highlights the complexity of rating dynamics and underscores the need for further investigation into additional variables that may better explain these patterns.