Assignment_7_New_York_Times_Best_Sellers_List

Author

Stephen Lynch

Published

November 25, 2025

Introduction

This data is part of a larger project I am working on in terms of using GoodReads to do sentiment analysis and pair readers with books they will find interesting to read. I wanted to take the time early on to familiarize myself with how GoodReads operates their rankings, and draw some incidental insight on the relationship between review counts, ranking ans authors share the balance in terms of where a book lands in ranking on GoodReads NY times ranking list.

The data consists of 5 columns and 231 observations each being an individual book, it contains columns: rank, title, author, average rating, and number of reviews

Process of Aquiring the Data

This data all comes from the website “https://www.goodreads.com/list/show/9103.New_York_Times_Bestsellers” and its pages 1-3. I aquired it through a pre-existing table and snagged the five columns that interested me the most. Additionally I did a little bit of data cleaning and parsing in order to insure that it was in a workable format once I had obtained it. In all future code I will refer to the data as books_df which was the placeholder name for the dataframe.

Visualization

To start off my visualizations i just wanted to throw out a quick and dirty scatter plot that shows us relationships between average rating and the number of reviews.

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

books_df <- read_csv("https://www.dropbox.com/scl/fi/w8vitrvctg8yc7j6jaa3y/books_data.csv?rlkey=imjfmqf2my6hz6f9kozyezhri&st=r3b7alyz&dl=1")

New names:
Rows: 231 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): title, author dbl (4): ...1, rank, avg_rating, num_reviews
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

books_df <- books_df[,-1]
books_df %>% 
  ggplot(aes(y = avg_rating, x = num_reviews))+
  geom_point()

Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

Now as much as I would love this graph to be the end all be all, given the extreme outlier, and the several less extreme outliers we can’t tell very much of what is going on. Since there are only 30 elements above 200,000 reviews i am going to exclude those from the next graph.

books_df %>%
  filter(num_reviews < 200000) %>% 
  ggplot(aes(y = avg_rating, x = num_reviews))+
  geom_point()+
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

Looking here we can see a slight trend in the number of reviews leading to a slightly higher review count, I will dive deeper into that momentarily but I also want to mention how far some of these figures are from our line just to show that this alone is not adequate for predicting rating based on number of reviews.

books_df %>%
  filter(num_reviews < 200000) %>% 
  ggplot(aes(y = as.numeric(rank), x = num_reviews))+
  geom_point()

books_df %>%
  filter(num_reviews < 200000) %>% 
  ggplot(aes(y = as.numeric(rank), x = avg_rating))+
  geom_point()

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

Modeling

lm1 <- lm(data = books_df, as.numeric(rank) ~ avg_rating + num_reviews)

summary(lm1)


Call:
lm(formula = as.numeric(rank) ~ avg_rating + num_reviews, data = books_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-111.527  -56.966   -2.414   56.942  115.630 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.540e+02  7.421e+01   3.423 0.000739 ***
avg_rating  -3.340e+01  1.844e+01  -1.812 0.071413 .  
num_reviews -1.591e-05  4.473e-06  -3.557 0.000460 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 64.18 on 219 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.08414,   Adjusted R-squared:  0.07577 
F-statistic: 10.06 on 2 and 219 DF,  p-value: 6.613e-05

Here I did a small linear model to test the actual strength and ability of average rating and number of reviews to assess the rank of a given book. There is definitly some relationship however if we look at the model, average rating does not have a significant enough p-value to warrent keeping it in the model.

head(books_df)

# A tibble: 6 × 5
   rank title                                      author avg_rating num_reviews
  <dbl> <chr>                                      <chr>       <dbl>       <dbl>
1     1 To Kill a Mockingbird                      Harpe…       4.26     6800770
2     2 Charlotte's Web                            E.B. …       4.21     2025461
3     3 The Girl With the Dragon Tattoo (Millenni… Stieg…       4.18     3419845
4     4 The Hunger Games (The Hunger Games, #1)    Suzan…       4.35     9772489
5     5 The Chosen (Reuven Malter, #1)             Chaim…       4.07       96899
6     6 The Girl Who Kicked the Hornet’s Nest (Mi… Stieg…       4.24      772385

Using this little snipit of the top 6 books, number five has a low average rating and low number of reviews yet comparitivly it does better in the ranking system than number 6. This has to do with the good reads algorithm in some way or at least an incompleteness in data by which these books have been judged.

Conclusion: Due to the limited nature of the data in not giving us all of the metrics by which rank was associated we have failed to reach a reasonable conclusion as to whether or not Rank is a function of average rating and number of reviews. However, given the limited information we do have the rank is related much more highly to number of reviews than average rating.

Citation:

“New York Times Bestsellers (231 Books).” Goodreads, Listopia, www.goodreads.com/list/show/9103.New_York_Times_Bestsellers. Accessed 25 Nov. 2025.