IMDB Ethical Web Scraping

Author

Cullen Ryle

Introduction

I plan to analyze the top 100 highest-rated books on Goodreads.com to investigate whether there are any common characteristics that link these titles together. The central research question is: What shared features, if any, contribute to a book being among the top-rated on Goodreads? To explore this, I will examine several variables, including each book’s position on the list, genre, individual rating, average rating, total number of reviews, and the number of users who have rated the book. By analyzing these variables, I aim to uncover any patterns or trends that may explain why these books are consistently recognized as top performers by the Goodreads community. The Reason I chose this topic is because I love reading, one of my favorite books that I have read over the last year was Fiddlers Green by Richard McKenna.

Analysis type breakdown

For this analysis I will use the R package Rvest to scrape the Goodreads webpage for the variables listed in the introduction section. After this data is collected using R script I will then put it all into a combined table where I will then use other R packages to test variables together in different graph types to see correlations between variables.

For reference the following is the webpage being used to scrape data from.

https://www.goodreads.com/list/show/153860.Goodreads_Top_100_Highest_Rated_Books_on_Goodreads_with_at_least_10_000_Ratings

Analysis type breakdown

Book position by Score

The graph Bp by S shows the list of the top 10 books by the overall score for the books. The Score is calculated from the number of people who have voted on the books rating and the rating they ranked the book. The graph showing that the first ranked books score is way beyond the others, the results of this comparison stemming from the total score could be skewed based on there being more people who reviewed one book over another.

Rating Count by Rating

The graph RC by R shows the total ratings per book by the average rating for the book. This showing for the top 100 books that most books do not have the same number of reviews and only a few having significantly more books than others. While in this list the average rating for the book not indicating but score. With this we gleam that most books with a distribution of ratings even at higher levels sometimes have a lower number of ratings while not being listed in first position.

Book Rank by Avg. Rating

The graph BR by Avg. R shows each book per listed rank and its listed average rating. This giving us insight into the difference between rank and book average rating. The main purpose of this graph inst go get significant insight but just to visualize in scale a comparison of the books ranked against each other.

Distribution by Average Ratings

The graph D by Avg. R shows that the average rating for books are around 4.6. Increasing in the ratings there are fewer and fewer books. Looking at the lowering count per increasing ratings, could assume a near exponential increase for those books closer to the highest rating. This shows that even with books considered the best it is even harder to make it higher among its comps.

Position by Rating

From the P by R graph it shows us that books with a higher ratings count tend to also have a higher book position. We could assume that some books with higher ratings may be skewed due to the higher count of ratings which would throw the end rating position off.

Final Analysis

To investigate what characteristics define the top 100 highest-rated books on Goodreads, several visualizations were created using key variables including book position, average rating, review count, and overall score. The graph comparing book position to score reveals that the top-ranked book stands out significantly, suggesting that total score is heavily influenced by the volume of ratings a book receives, rather than just its quality. Similarly, when examining review count versus average rating, we see that books can achieve high ratings even with fewer reviews, indicating that popularity does not always correlate with perceived quality.

A distribution analysis of average ratings shows that most top books cluster around an average rating of 4.6, with very few reaching higher scores. This sharp decline suggests that achieving near-perfect ratings is rare, even among the highest-rated books, highlighting the competitive nature of standing out at the very top. Finally, plotting book position against ratings count shows a trend where books with more reviews often occupy higher-ranked positions, hinting that broader readership may play a role in boosting visibility and score. However, this also raises the concern that rating counts could skew final rankings, favoring well-known titles over lesser-known but equally well-reviewed ones.

Together, these findings suggest that while average rating is important, the number of reviews and total score also play significant roles in determining which books make it to the top of the list—indicating that visibility and community engagement may be just as influential as quality alone.