Final Project
What Makes a Book Beloved? A Dive Into Goodreads Reviews
Books have always been a part of my life. I have had countless experiences where a story transported me, challenged me, and made me feel understood - and I think most readers would agree with me. Books shape our imaginations and help us make sense of the world around us. That’s why understanding what makes specific books more popular than others is so interesting to me. For this project, I want to explore two main questions to identify key factors influencing book popularity and reader engagement:
Which book characteristics are most strongly associated with high popularity on Goodreads?
What factors drive deeper reader engagement beyond just ratings?
If we can identify the key factors that drive popularity - whether it’s genre, author reputation, or something else - we can better understand what resonates with readers. By examining metrics, such as the number of reviews, we can identify which books create community and lasting conversations, whereas others fade into the background.
Overview of the Data
For this project, my primary data source is called “Goodreads-books”, and it is from Kaggle (Goodreads-books). The dataset contains information on individual books and their attributes, including title, author(s), average rating, and more.
Data Dictionary
Book ID
Title
Author
Average Rating
ISBN
ISBN (13)
Language
Number of Pages
Number of Ratings
Number of Text Reviews
Publication Date
Publisher
Full data dictionary: books - Data Dictionary.xlsx
Summary Stats
Before looking deeper, I examined the characteristics of the Goodreads dataset to understand its structure. With over 11,000 books, the dataset covers a wide range of genres, authors, and publication years, making it suitable for analyzing book trends and reader behavior. The average book rating is around 3.93 out of 5, and the most common language is English. Notably, some books have received hundreds of thousands of ratings and reviews, providing a valuable opportunity to examine the factors that drive popularity and reader engagement.
Descriptive Analytics
To explore what makes a book popular and engaging on Goodreads, I created several visualizations. These helped me understand the distribution of certain variables, identify outliers, and identify patterns.
I looked at how average ratings are distributed across the dataset. This histogram reveals a clear trend. Most books on Goodreads receive relatively high average ratings, which can make readers feel their positive experiences are recognized and valued. This distribution peaks around four stars, and very few books are rated below two stars. This suggests that Goodreads users tend to rate books they enjoy, or maybe avoid rating books they DNF (did not finish) or even didn’t like. Readers may be more likely to share positive feelings about a book than criticism.
This scatterplot examines the relationship between a book’s number of ratings and its number of reviews. There is a general upward trend where books with more ratings tend to have more text reviews. However, the relationship is not perfectly linear. Some books receive thousands of ratings, but relatively few are accompanied by written reviews, whereas others have a high number of comments relative to their rating count.
Written reviews often reflect stronger emotional reactions, whether positive or negative. Books that spark conversation may generate more text reviews even if their overall rating count is modest. This helps us distinguish popularity from reader investment.
This histogram illustrates the distribution of book lengths, with most books falling between 200 and 400 pages. This suggests that mid-length books dominate in the Goodreads catalog, perhaps reflecting readers’ preference for manageable reading commitments. Understanding this helps interpret other metrics and raises questions about how book length influences reader satisfaction and popularity.
This scatterplot examines whether the number of pages in a book influences its average rating. While most books are below 1,000 pages, the trend line suggests a subtle pattern: longer books tend to maintain slightly higher average ratings, especially beyond the 1,000 page mark. This may reflect that readers who commit to longer books are more invested and selective, leading to more favorable reviews. Longer books might also create a sense of deeper connection or immersion, resonating more with dedicated readers. The widespread use of ratings among shorter books shows that length alone does not guarantee quality.
Secondary Data Source
One of the most interesting aspects of the Goodreads dataset is its ability to capture both the quantitative and qualitative aspects of reader engagement through written reviews. While the primary dataset provides a clear picture of how the books are rated, it doesn’t tell us much about the effort and depth behind the ratings. I wanted to see if reader enthusiasm is associated with more detailed feedback.
If highly rated books consistently have longer reviews, this suggests that positive reception is tied not only to star ratings but also to deeper engagement, and vice versa. Either way, examining review length alongside average ratings provides a better understanding of how readers interact with books beyond simply clicking a star.
I chose to focus on the romance genre and selected five books from the primary dataset. This doesn’t represent all genres, so I recommend examining another genre, such as self-help, where reviews may vary more widely depending on whether the reader resonated more or less with the book.
Books like Paradise not only earn high ratings but also inspire longer, more detailed reviews, suggesting that emotional resonance in this genre often translates into richer written feedback. Titles with lower ratings tend to attract shorter, less invested reviews. While this analysis is limited to romance, it highlights how review length can serve as a proxy for reader enthusiasm.
Conclusion
My exploration of Goodreads data reveals that reader engagement is multifaceted. It is shaped not only by how many stars a book earns but also by how deeply readers interact with it. Most books receive high ratings, yet written reviews are less common. While mid-length books dominate the catalog, longer books tend to earn slightly higher ratings. Interestingly, books with the most ratings tend to have longer reviews. This might suggest that emotional investment and reflection may not always align with popularity. We can gain a better understanding of how readers respond to books.
The data suggest that popularity is associated with higher average ratings, more ratings, and longer book length. Review length is a meaningful indicator: books that resonate emotionally often create more detailed written feedback. While these descriptive insights do not provide a complete picture, they highlight clear associations and open the door to further analyses that can explain why certain books capture both widespread attention and deeper reader investment.
Thanks for reading,
Samantha Krah