Amazon 100 Bestsellers
Introduction
For this project, I chose to analyze a dataset containing the top 100 trending books in 2024. The data set includes information such as rankings, ratings, prices, genres, authors, and publication years. Books and entertainment media are interesting to analyze because popularity and success are influenced by many different factors, including genre popularity, pricing, author credability, and audience trends. Since books are a major part of the entertainment industry, analyzing this data can reveal patterns in what types of books perform well and what characteristics are associated with highly ranked titles.
The main questions I want to explore are whether certain genres tend to receive higher ratings, whether newer books take over older ones, and whether book price has any relationship with popularity or reader ratings. I am also interested in examining which authors or genres appear most frequently among the top-ranked books. My hypothesis is that recently published books and popular genres such as fantasy and fiction will dominate the trending rankings. Through data transformation, visualizations, and analysis, I hope to uncover meaningful trends within the book market and better understand what drives book popularity and success.
Amazon Books Data Set
The data set I will use in this project contains information on the top 100 books on amazon in 2024 and includes several variables that describe different characteristics of each book. These variables help provide insight into popularity, reader reception, pricing, and overall trends within the book market.
book title – The name of the book.
author – The writer of the book. This can be used to analyze which authors appear most frequently or have the highest-rated books.
genre – The category or type of book, such as fiction, fantasy, romance, or mystery. We will analyze this to identify which genres are most popular or highly rated.
rating – The average reader rating for the book, typically measured on a 5-star scale. We predict as the rank in the list of the books goes up for a specific book, so will the rating.
book price – The listed price of the book. This can be analyzed to determine whether more expensive books are associated with higher ratings or rankings.
rank – The position of the book on the trending list.
year of publication – The year the book was published. This allows analysis of whether newer or older books are higher on the trending list.
Together, these variables provide a strong foundation for exploring trends in book popularity, reader preferences, and genre performance through both data analysis and visualization.
Price Analysis
Above we are looking at the distribution of books and their prices listed on Amazon. As we can see above many books were priced between $4-$20 with a few outliers reaching upwards of $45.
Now this visualization is showing book prices and how the were rated on Amazon. This visualization should tell us whether higher quality and more expensive books provide a more enjoyable experience to readers. As we can see from the graph, the two highest priced books are around 4.75-4.85 rating, with a large majority of lower priced books also sitting around this range. This may indicate that higher priced books hit this mark more consistently, but lower priced books are not limited to low ratings.
Year Analysis
Our next analysis will be looking at the average rating for books on the Top 100 list, depending on the year they were published. Before, we hypothesized that newer books would surpass older books in terms of popularity and ratings, but the graph above disagrees, Before the year 2000, the average rating for books hit 4.9 six separate times, the most recent being 1999. After 2000, the average rating for books has only been higher than 4.8 once.
Genre Analysis
The slice_head function allows us to also create a visualization of the top 10 authors with the highest average rating. Though we tried to predict what genre combination would be the most popular using our past analysis we were wrong, with the graph above showing that Fiction, Mystery, Crime, Legal Thriller being the highest rated type of book genre.
Implications and Real-world Impact
This data set provides useful information on how books perform across multiple genres, price, authorship, and publication year. When analyzed together, these variables reveal patterns that can be applied to real publishing and consumer behavior.
This data could give authors real reason to develop writing strategies, pertaining to what genres they right, how much they charge for their books, or when they decide to release the book.
Secondary Data
head(book_second_df) Title Author
1 \n Pride and Prejudice\n Jane Austen
2 \n 1984\n George Orwell
3 \n The Great Gatsby\n F. Scott Fitzgerald
4 \n Jane Eyre\n Charlotte Brontë
5 \n Crime and Punishment\n Fyodor Dostoevsky
6 \n Lolita\n Vladimir Nabokov
The secondary data we pulled was from good reads, a popular book rating site where we pulled their Top 100 novels of all time, including the books title and the author.
Since Amazon blocks web scrapers by hiding information while using the inspector tool, good reads data was the only available to pull using scraping. Unfortunately, good reads data and list only provides the Author and the Title of the book, not enough data to create a graph or insight out of.
Conclusion
This project analyzed a book data set using variables such as genre, price, rating, and author to explore patterns in literary popularity and reader enjoyability. Through data cleaning, web scraping, and visualization in R, the analysis highlighted how different factors contribute to a book’s success and reception.
Additionally, comparisons of ratings vs years showed that newer books did not over perform older books, authors who appeared most frequently did not write the most popular genre, and book price affected consistency in ratings, not the maximum possible.