Analysis of Book Genres, Ratings & Prices

Author

Daniel Neugebauer

1. Introduction

Do different book genres show meaningful differences in average rating, market price, and reader engagement?

I collected data on 120 books, over 6 pages (All 50 pages was unreasonable for my processing capabilities) for these variables:

title
genre
avg_rating
num_ratings
price
award_status
award_name

Using BooksToScrape.com, I investigated the relationship between book genres and their varying performance indicators. I personally, had no idea how popular any genre of book was or if there was a noticeable correlation between genre and price since I’m not the biggest reader, which made this interesting for me since it was a completely blind analysis.

2. Data

In this section, I import the dataset that I previously scraped. The site is a publicly available site designed for scraping practice. I wrote a script that looped through multiple catalog pages, accessed each book’s product page.

library(tidyverse)

Warning: package 'readr' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

books <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/neugebauerd1_xavier_edu/IQDfHNzInL_RRY4MWTdCnM8aARXbDXaHw_XPa-ThdFZZFY4?download=1")

Rows: 120 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): title, book_url, genre, award_status, award_name
dbl (3): avg_rating, num_ratings, price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(books)

Rows: 120
Columns: 8
$ title        <chr> "A Light in the Attic", "Tipping the Velvet", "Soumission…
$ book_url     <chr> "https://books.toscrape.com/catalogue/a-light-in-the-atti…
$ avg_rating   <dbl> 3, 1, 1, 4, 5, 1, 4, 3, 4, 1, 2, 4, 5, 5, 5, 3, 1, 1, 2, …
$ num_ratings  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ genre        <chr> "Poetry", "Historical Fiction", "Fiction", "Mystery", "Hi…
$ price        <dbl> 51.77, 53.74, 50.10, 47.82, 54.23, 22.65, 33.34, 17.93, 2…
$ award_status <chr> "non_award", "non_award", "non_award", "award", "award", …
$ award_name   <chr> NA, NA, NA, "High Rating (4–5 stars)", "High Rating (4–5 …

3. Visual Analysis

Plot 1 — Number of Books per Genre

books %>%
  count(genre, sort = TRUE) %>%
  ggplot(aes(x = reorder(genre, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Number of Books per Genre",
       x = "Genre", y = "Count")

Analysis:

I grouped the data by genre and counted the number of books in each category. This allows comparison of how well each genre is represented in the sample.

Interpretation:
Some genres appear much more frequently (Sequential art, non-fiction, default), meaning later comparisons may reflect difference in sample size.

Plot 2 — Average Rating by Genre

books %>%
  filter(!is.na(genre)) %>%  # ensure missing genres don't break plot
  mutate(genre = factor(genre)) %>%
  ggplot(aes(x = genre, y = avg_rating, fill = genre)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Average Rating by Genre",
       x = "Genre", y = "Average Rating (1–5 Stars)")

Analysis:
I plotted rating distributions using a grouped boxplot by genre to compare variation in average ratings across categories.

Interpretation:
Genres such as philosophy and science are very well reviewed, while many others either have too large a range of reviews to give meaningful insight or barely any. This is definitely due to this being a sample of the entire 50 pages. This could still be a useful framework.

Plot 3 — Price by Genre

books %>%
  ggplot(aes(x = genre, y = price, fill = genre)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Price Distribution by Genre",
       x = "Genre", y = "Price (£)")

Analysis:
I used another grouped boxplot to examine how book prices vary by genre.

Interpretation:
Some genres appear consistently more expensive (science, childrens, fiction) , which are all logically popular genres so that could be a real indicator of genre impacting price. However, it’s still important to keep in mind the limited sample.

Plot 4 — Rating vs Price

books %>%
  ggplot(aes(x = avg_rating, y = price)) +
  geom_jitter(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Rating and Price",
       x = "Average Rating", y = "Price (£)")

`geom_smooth()` using formula = 'y ~ x'

Analysis:
I visualized the relationship between rating and price using a scatterplot with a line of best fit to see if we could visualize a trend between price and rating.

Interpretation:
This tells us there really isn’t a good correlation between price and rating, the line stays almost completely horizontal even with 120 book entries into the data set.

Plot 5 — Rating vs Number of Ratings

books %>%
  ggplot(aes(x = avg_rating, y = num_ratings)) +
  geom_jitter(alpha = 0.6, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Rating vs Number of Ratings (Engagement)",
       x = "Rating", y = "Number of Ratings")

`geom_smooth()` using formula = 'y ~ x'

Analysis:
I plotted rating against number of reviews to evaluate whether higher ratings correlate with greater reader engagement.

Interpretation:
As the previous plot began to suggest, there is no correlation between engagement and the rating of the book.

4. Conclusion

Based on the analysis, there is no strong evidence that genre consistently predicts rating, price, or reader engagement. While some boxplots suggest minor variation across categories, the differences are not substantial enough to indicate a reliable trend. It is also likely that the results are influenced by the limited and uneven sample of available books, which can distort patterns that might appear in a larger dataset. In short, this analysis suggests that genre alone does not meaningfully determine how books are priced or received.