Introduction

Within this final project I will analyze the top 50 books sold off Amazon from the years 2009 through 2022. This data set was built by R which uses code to scrape book data off Amazon. I obtained this data set from the website Kaggle (https://www.kaggle.com/datasets). Specifically, you can find the details of the data set I analyzed here (https://www.kaggle.com/datasets/chriskachmar/amazon-top-50-bestselling-books-2009-2022). As of 05/12/2002, over 1800 people have downloaded the data set and it has around 9,000 views.

Amazon
Kaggle

What I Used

I focused on mastering Tableau in order to complete this project. Tableau allowed me to create a variety of line graphs, bar graphs, scatter plots and tables. I then used RMD and RPubs to code and publish this webpage.

Tableau

An Exploration

This project is classified as an exploration. This means that there is no set prediction or hypothesis. My goal is to explore and observe trends over the years.

I am analyzing numerous variables from this data set. I will be looking at name, author, year, genre, user rating, reviews and price.

Name - Name of Book

Author - Author of Book

Year - Year Released on Amazon (2009 - 2022). The last update was March 26th, 2022.

Genre - Fiction or Non-Fiction

User Rating - What rating Amazon users/readers have given the books (1-5 Scale)

Reviews - Number of Reviews for Each Book

Price - Price of Book in US Dollars

Lets get started.


All Books in Data Set

This first table shows all books within the data set. The table shows the book name, author, year, genre, price, number of reviews and user rating. This table allows you to use the filter on the side to examine the data by year.


Analyzing Book Price

This bar graph shows the average price of all non-fiction and fiction books from 2009 to 2022. It is evident that non-fiction books have a higher average price than fiction books. The average price for non-fiction is 14.338 and the average price for fiction books is 10.663. It is important to note that there is MORE non fiction books (388) than fiction books (312). This definitely has an impact on the overall average price of all books in each genre.

Now, lets dive deeper and look at the average price of books over the years.

The green line is for non-fiction books, and the pink line is for fiction books. I labeled the lowest and highest average prices.

The observations from this line chart showing the average price of books over the years are interesting.

Right off the bat, it is interesting to see how these curves appear to increase after 2009, hit a peak, then begin to slope slightly downwards until 2019. This is most likely because 2009 was just two years after Amazon started selling Kindles. Kindles took the world by storm. By 2010, Amazon was selling more E-books than paper books (https://www.history.com/this-day-in-history/amazon-opens-for-business). This definitely caused a spike in price as a result of Amazon increasing prices to maximize profit.

Non-Fiction book prices peaked in 2014. This seems to be a direct result of the Psychology textbook “Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5.” In 2014, this book cost 105. This book is the “backbone” of all psychology.

Fiction book prices peaked in 2016. This is due to the release of many five special edition Harry Potter Books. 5 of the top 50 books in 2016 were written by J.K Rowling. Harry Potter took the world by storm and Amazon made it convenient for readers to obtain these page-turners.

Both genres decreased around 2017 but more recently have started to slightly increase. The lowest prices were in 2019 for non-fiction and 2018 for fiction.

To create an engaging way to learn more about the prices of novels each year, I created an interactive bar graph.

This bar graph allows you to see the price of novels, who wrote them, and in what year. To see the most expensive books from a specific year, you can the filter on the right hand side and select which year you would like to see.


Analyzing Book Reviews

Amazon users are able to leave reviews online for books available on Amazon. This data set shows how many reviews each individual book received. Unfortunately, this data set does not provide what the reviews say, just the count of reviews given for each book.

This line chart shows the number of reviews for the top 50 books each year from the years 2009 to 2022.

The amount of reviews peaked in 2020 for both non-fiction and fiction books.

It can be inferred that the COVID-19 lockdown beginning in March of 2020 gave people more free time to read and review books. This would have a direct impact on the number of reviews recorded.

This bar graph analyzes the number of reviews for books released on Amazon in 2020 - the year COVID took over.

“Where the Crawdads Sing” by Delia Owens received a crazy amount of reviews. An article written by the New York Times at the end of 2019 illustrates, ‘Crawdads’ has sold more print copies than any other adult title this year — fiction or nonfiction — according to NPD BookScan, blowing away the combined print sales of new novels by John Grisham, Margaret Atwood and Stephen King. Putnam has returned to the printers nearly 40 times to feed a seemingly bottomless demand for the book. Foreign rights have sold in 41 countries” (https://www.nytimes.com/2019/12/21/books/where-the-crawdads-sing-delia-owens.html)

This book took the world by storm and in a time where people were locked in their houses due to a global pandemic - reading this book brought people joy and something to talk about.


Analyzing Book Count

This line chart shows the number of non-fiction and fiction books each year. Each year, the combination of books equal 50. To see the specific count of each genre you can toggle your mouse over the chart.

The amount of non-fiction books peaked in 2015. A few of the most reviewed non-fiction books in 2015 were “The 5 Love Languages” by Gary Chapman, “How to Win Friends and Influence People” by Dale Carnegie and “The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics” by Daniel James Brown.

To analyze this spike in nonfiction books I created a bar graph showing the top non-fiction books from Amazon in 2015. This bar graph also allows you to see the price, reviews and average user rating for each book.

This bar graph shows us the characteristics for nonfiction books in 2015.


Analyzing Book User Ratings

Amazon users have the ability to rate a book on a scale of 1-5. 1 means the reader did not like the book and 5 means they loved the book. Because this data set is the “top” 50 books from each year, the user rating is “good” for all the books.

This first bar graph shows the average user ratings for all books combined throughout the years. Fiction is rated slightly higher at 4.66 and non-fiction has an overall user rating of 4.62. This is interesting as there is more non-fiction books, but fiction has a higher average rating.

The next line graph shows the average user ratings over the years. As you can see, the line stays relatively stable at a high average user rating.

The lowest user rating ever dipped the past 12 years was in 2012 for fiction books. The fiction books in 2012 had an average user rating of 4.4952. I wanted to figure out why this number was so low so I created a bar graph that shows the user ratings for fiction books in 2012.

Turns out, this dip was due to a book written by J.K. Rowling called “The Casual Vacancy.” This book most likely had a low user rating because it was being compared to J.K. Rowling’s previously written, well renowned, Harry Potter books.

A journalist Lev Grossman wrote in Times magazine, “it’s not really possible to open The Casual Vacancy without a lot of expectations both high and low crashing around in your brain and distorting your vision. There’s no point pretending they’re not there. I know I had a lot of, let’s call them feelings when I opened the book. I have spent many, many hours reading J.K. Rowling’s work. I am a known Harry Potter fan” (https://time.com/4132710/j-k-rowlings-the-casual-vacancy-weve-read-it-heres-what-we-think/). I think this quote sums up how a lot of people felt about this book when it was released in 2012.

Furthermore, this interactive bar chart allows you to see the highest rated books from each year. It is interesting to toggle around this bar chart as you will notice that EVERY year there is at least one author who has multiple high rated books. This puts the author at the top of the bar graphs as the user ratings are summed. If you click “all,” you can see all the authors and books ranked by user rating.

Overall, Jeff Kinney has the most highest rated books from 2009 to 2022. This is a direct result of his Diary of a Wimpy Kid Series. Jeff Kinney has a had a book on the Amazon top 50 list EVERY YEAR except 2022.

Looking specifically at 2022, Colleen Hoover is at the top of the list for highest user rated books. This sparked my curiosity and I wanted to examine Colleen Hoovers Success the past 12 years. The results I found were interesting.

Below is the table created to analyze the data for Colleen Hoover’s books.

Colleen Hoover’s books have only been on Amazon’s top 50 in 2021 and 2022. Her books are all fiction, very reasonably priced and all have perfect user ratings.


Comparing User Rating and Reviews

After analyzing price, genre, count, reviews and user rating, I wanted to create scatterplots to see if there are any relationship between the quantitative variables (user rating, reviews and price).

This first scatterplot shows the relationship between user rating and the number of reviews each year.

These scatterplots (each year) all look relatively the same with no correlation. All data points seem to be towards the bottom right side of the graph. Although, each year does have a few outliers. The outliers in 2022 are “Where the Crawdads Sing” and “Midnight Library.” These outliers have a significantly larger amount of reviews than the other books of 2022.


Comparing Variables - User Rating and Price

User ratings and price seem to have no correlation. All data points tend to be on the right side of the graph with no positive, or negative relationships. Similar to the first scatterplots these plots also have outliers. The outlier in 2022 with a high price is, “The Complete Maus: A Survivors Tale.”


Comparing Variables - Price and Reviews

This scatterplot for the year of 2022 seems to have a very small negative linear relationship. We can look at the r squared values which tell us the proportion of the variance for a dependent variable that’s explained by an independent variable (to see the r squared value click on the line of best fit).

I interpreted the r squared value for the year 2022 below.

Fiction - 5% of the variance in reviews (dependent variable) can be explained by the price (independent variable.)

Non-Fiction - 50% of the variance in reviews can be explained by price.


Conclusion

I enjoyed completing this analysis for many reasons. First, I was able to recognize a significant amount of the books and authors that I was seeing. For example, Harry Potter was one of my favorite series growing up and clearly, I was not the only one who loved it! I also enjoyed analyzing trends and making observations about the spikes or dips shown in the data - specifically, the number of book reviews in 2020. Furthermore, it was interesting to see that many of the top books on Amazon are not books just for pleasure reading. For example, the DSM5 was on the top 50 list a few years as well as coloring books. These are just a few takeaways I obtained from this data set.

After finishing this long project I feel proficient in Tableau and I am excited to use it in my future classes and internship.