1 Background



Amazon is a pioneer of marketplace websites. It was founded on 5 July 1994 by Jeff Bezos and quickly became number one in The USA before taking over the other country. At its first conception, Amazon used to only sell books. So it’s only fitting that we do a simple analysis of Books sold on Amazon. We’ll explore several aspects of the data like book titles, authors, genres, prices, ratings, and the number of ratings received. This analysis aims to uncover trends and preferences within Amazon’s book marketplace.

The dataset was taken from Kaggle. Please note that the data extracted is from India, so there may be differences from our local reality.

2 Data Preparation

The amazon books dataset has three csv files: Books_df.csv, Genre_df.csv, and Sub_Genre_df.csv In consideration of the scope of this assignment, we will only use the Books_df.csv file as a data source.

First, read Books_df.csv and assign it to books:

books <- read.csv("data_input/Books_df.csv")

2.1 Data Inspection

Check the head of books:

head(books)

Check the tail of books:

tail(books)

Let’s see the structure of books dataframe:

str(books)
## 'data.frame':    7928 obs. of  10 variables:
##  $ X                  : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Title              : chr  "The Complete Novel of Sherlock Holmes" "Black Holes (L) : The Reith Lectures [Paperback] Hawking, Stephen" "The Kite Runner" "Greenlights: Raucous stories and outlaw wisdom from the Academy Award-winning actor" ...
##  $ Author             : chr  "Arthur Conan Doyle" "Stephen Hawking" "Khaled Hosseini" "Matthew McConaughey" ...
##  $ Main.Genre         : chr  "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" ...
##  $ Sub.Genre          : chr  "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" ...
##  $ Type               : chr  "Paperback" "Paperback" "Kindle Edition" "Paperback" ...
##  $ Price              : chr  "₹169.00" "₹99.00" "₹175.75" "₹389.00" ...
##  $ Rating             : num  4.4 4.5 4.6 4.6 4.5 4.5 4.6 4.5 4.5 4.5 ...
##  $ No..of.People.rated: num  19923 7686 50016 32040 1707 ...
##  $ URLs               : chr  "https://www.amazon.in/Complete-Novels-Sherlock-Holmes/dp/8175994312/ref=zg_bs_g_1318054031_d_sccl_1/000-0000000-0000000?psc=1" "https://www.amazon.in/Black-Holes-Lectures-Stephen-Hawking/dp/085750357X/ref=zg_bs_g_1318054031_d_sccl_2/000-00"| __truncated__ "https://www.amazon.in/Kite-Runner-Khaled-Hosseini-ebook/dp/B00B0CR0O6/ref=zg_bs_g_1318054031_d_sccl_3/000-00000"| __truncated__ "https://www.amazon.in/Greenlights-Raucous-stories-Academy-Award-winning/dp/1472280873/ref=zg_bs_g_1318054031_d_"| __truncated__ ...

The explanation of each column is:

  • Title: The title of the book

  • Author: Name of the author or publication house

  • Main Genre: The main genre the book belongs to

  • Sub Genre: The specific sub-genre of the book

  • Type: Indicates the format of the book, such as paperback, Kindle, audiobook, or hardcover

  • Price: The price of the book

  • Rating: The average rating of the book given by users

  • No. of People Rated: Indicates the count of users who have rated the book

  • URLs: Provides the link to the book’s page on Amazon for further details and purchase options

We also want to know how many rows and columns there are in books:

dim(books)
## [1] 7928   10

There are 7928 rows and 10 columns.

2.2 Data Cleansing & Coertions

We have to remove unused columns such as X and URLs, and make the count of columns down to 8 columns:

books <- subset(books, select = c(-X, -URLs))
dim(books)
## [1] 7928    8

The next step is to make sure that all the data types are correct. After taking a careful look, we noticed that the data type in the Price column is wrong. It’s supposed to be a double data type but instead, it’s a character data type. We first will remove the currency symbol of the Price column before changing it to double data type:

books$Price <- sub("₹", "", books$Price)
books$Price <- as.double(books$Price)
## Warning: NAs introduced by coercion

The warning of NAs introduced by coercion means that some output values are non-numeric. We will get back to this later. We also will check for unique values to determine whether it’s a factor data type. Nevertheless, because this is books dataset, it must have many unique values. Check for how many unique values for Main.Genre and Sub.Genre

length(unique(books$Main.Genre))
## [1] 30
length(unique(books$Sub.Genre))
## [1] 151
length(unique(books$Type))
## [1] 33

There are a lot of values for Main.Genre, Sub.Genre, and Type. However because the data above have more than 10 unique values, we wouldn’t change it to factor data type. Let’s recheck for newly changed data types in books:

str(books)
## 'data.frame':    7928 obs. of  8 variables:
##  $ Title              : chr  "The Complete Novel of Sherlock Holmes" "Black Holes (L) : The Reith Lectures [Paperback] Hawking, Stephen" "The Kite Runner" "Greenlights: Raucous stories and outlaw wisdom from the Academy Award-winning actor" ...
##  $ Author             : chr  "Arthur Conan Doyle" "Stephen Hawking" "Khaled Hosseini" "Matthew McConaughey" ...
##  $ Main.Genre         : chr  "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" ...
##  $ Sub.Genre          : chr  "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" ...
##  $ Type               : chr  "Paperback" "Paperback" "Kindle Edition" "Paperback" ...
##  $ Price              : num  169 99 176 389 348 ...
##  $ Rating             : num  4.4 4.5 4.6 4.6 4.5 4.5 4.6 4.5 4.5 4.5 ...
##  $ No..of.People.rated: num  19923 7686 50016 32040 1707 ...

2.3 Missing Data

Now, we need to make sure that there are no missing values. Let’s check for missing values:

colSums(is.na(books))
##               Title              Author          Main.Genre           Sub.Genre 
##                   0                  20                   0                   0 
##                Type               Price              Rating No..of.People.rated 
##                   0                 657                   0                   0

There are 177 empty values in the Age column and 20 empty values in the Author column. For the sake of data clarity, we will assign the missing data in the Age column as the median age based on sex. Search for the median value in Price:

empty_books <- books[is.na(books$Price),]
empty_books_median <- median(books$Price, na.rm = TRUE)
empty_books_median
## [1] 294

Set all NA in column Price as 294:

books$Price <- ifelse(is.na(books$Price), empty_books_median, books$Price)

And replace NA value in Author column as Unknown:

books$Author[is.na(books$Author)] <- "Unknown"

Recheck for any missing values:

colSums(is.na(books))
##               Title              Author          Main.Genre           Sub.Genre 
##                   0                   0                   0                   0 
##                Type               Price              Rating No..of.People.rated 
##                   0                   0                   0                   0

Nice! All the values are finally not empty.

2.4 Duplicates Data

Let’s move on to the next step which is to check for duplicated data. We want to check for books with same Title, Author, and Type:

duplicates_books <- duplicated(books$Title, books$Author, books$Type, fromLast = TRUE)
## Warning in duplicated.default(books$Title, books$Author, books$Type, fromLast =
## TRUE): NAs introduced by coercion
sum(duplicates_books)
## [1] 2188

That’s a lot of data. Let’s check for the duplicated data:

books[duplicates_books, ]

Let’s check for one of the duplicates data, for example “The Anatomy of Story”:

books[books$Title == 'The Anatomy of Story', ]

There are two books with title “The Anatomy of Story” and author John Truby. What qualifies it as different book is because it is in different Main.Genre. We will remove the duplicates books and assign it to books_clean. In case of duplicates books were found, we will remove all but last one:

books_clean <- books[!duplicated(books[c("Title", "Author", "Type")], fromLast = TRUE), ]

Great. Let’s recheck in case there are still duplicates data in the books_clean:

duplicates_books_clean <- duplicated(books_clean$Title, books_clean$Author, books_clean$Type, fromLast = TRUE)
## Warning in duplicated.default(books_clean$Title, books_clean$Author,
## books_clean$Type, : NAs introduced by coercion
sum(duplicates_books_clean)
## [1] 103

The sum of duplicates books_clean has reduced significantly, but there’s still 14 duplicates. Check for remaining duplicates books:

books_clean[duplicates_books_clean == T, ]

Let’s check for one of the duplicates data, for example “Reminiscences of a Stock Operator”:

books[books$Title == 'Reminiscences of a Stock Operator', ]

There are two books with title “Reminiscences of a Stock Operator”, but the Author values are different: Edwin Lefèvre and Edwin Lefevre. There is a non-latin character (è) there that makes it counts as different authors. We will replace non-latin characters so that it will be counted as the same books.

What about the others? Let’s check for one more, this time “Man’s Search For Meaning: The classic tribute to hope from the Holocaust”:

books[books$Title == 'Man\'s Search For Meaning: The classic tribute to hope from the Holocaust', ]

What makes them counted as different books is because the Author names have a period symbol (.). So, aside from replace every non-latin characters, we will also eliminate every period symbol in Author:

library(stringi)
books_clean$Author <- tolower(gsub("\\.", "", books_clean$Author))
books_clean$Author <- stri_trans_general(books_clean$Author, "Latin-ASCII")

books_clean <- books_clean[!duplicated(books_clean[c("Title", "Author", "Main.Genre")], fromLast = TRUE), ]

Now check if there’s still duplicated books_clean. To make things simple, let’s create combined and see whether there are still same books with the combination of Title, Author, and Type:

combined <- paste(books_clean$Title, books_clean$Author, books_clean$Type, sep = "_")
sum(duplicated(combined))
## [1] 0

Finally, there is no more duplicates data in books_clean. Now we’re good! Let’s move on to the next point which is Data Explanation.

3 Data Explanation

3.1 Summary

Check for the statistic summary of the books_clean dataframe:

summary(books_clean)
##     Title              Author           Main.Genre         Sub.Genre        
##  Length:5769        Length:5769        Length:5769        Length:5769       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Type               Price            Rating      No..of.People.rated
##  Length:5769        Min.   :  0.01   Min.   :0.000   Min.   :     0     
##  Class :character   1st Qu.:199.00   1st Qu.:4.300   1st Qu.:    54     
##  Mode  :character   Median :294.00   Median :4.500   Median :   373     
##                     Mean   :334.67   Mean   :4.265   Mean   :  5227     
##                     3rd Qu.:415.00   3rd Qu.:4.600   3rd Qu.:  2123     
##                     Max.   :999.00   Max.   :5.000   Max.   :500119

Because the majority of columns in books_clean use characters data type, the explanation step can only be done in Price, Rating, and No..of.People.rated:

  • The average price is ₹334.67

  • The minimum price is ₹0.01! So basically there’s no free book you can get on Amazon!

  • The maximum price is ₹999.00

  • The average rating is 4.265

  • The minimum rating is 0

  • The maximum rating is 5

  • The average number of people rated is 5227

  • The minimum number of people rated is 0

  • The maximum number of people rated is 500119

3.2 Data Distribution

Now we want to check the data distribution of Price, Rating, and No..of.People.rated:

3.2.1 Price Value

hist(books_clean$Price, 
     breaks = seq(min(books_clean$Price), max(books_clean$Price) + 10, by = 100),
     main = "Distribution of Book's Prices (in Rupees)", 
     xlab = "Price", 
     ylab = "Frequency", 
     col = "#b8564b")

Based on the histogram plot above, most book prices are in the range of ₹200 to ₹300, while it is really rare for book prices to be in the range of ₹900 to ₹1000.

3.2.2 Rating Value

hist(books_clean$Rating, 
     breaks = seq(min(books_clean$Rating), max(books_clean$Rating), by = 0.5),
     main = "Distribution of Book's Rating", 
     xlab = "Rating", 
     ylab = "Frequency", 
     col = "#b8564b")

Based on the histogram plot above, most book ratings are in the range of 4 to 4.5, while it is really rare for book ratings to be in the range of 2.5 to 3.

3.2.3 No..of.People.rated Value

hist(books_clean$No..of.People.rated, 
     breaks = seq(min(books_clean$No..of.People.rated), max(books_clean$No..of.People.rated)+ 100000, by = 100000),
     main = "Distribution of Number of People who Rated Book", 
     xlab = "Number of People", 
     ylab = "Frequency", 
     col = "#b8564b")

Based on the histogram plot above, most books are rated by the range of 0 to 100000 people, although if we see the summary above, there is an outlier of more than 500000 people rated a book.

4 Data Analysis

In this section, we will analyze couple of questions regarding books:

4.1 What Are the Top 10 Main Genres?

Answer:

library(ggplot2)

main_genre <- table(books_clean$Main.Genre)
main_genre <- sort(main_genre, decreasing = TRUE)
top_main_genre <- data.frame(head(main_genre, 10))

ggplot(top_main_genre, aes(x = Freq, y = reorder(Var1, Freq), fill = Freq)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top 10 Book Main Genre",
    x = "Frequency",
    y = "Main Genre"
  ) +
  scale_fill_gradient(low = "#ff9d6c", high = "#a24752") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5))

Based on the plot above, we can see that Romance is the top main genre for books that are sold on Amazon. In the second spot are Children’s Books followed by Sports in the third spot.

4.2 What Are the Top 10 Most Expensive Books by Main Genre?

Answer:

avg_price_by_genre <- aggregate(Price ~ Main.Genre, data = books_clean, FUN = mean)
top_10_pricier <- avg_price_by_genre[order(-avg_price_by_genre$Price), ]
top_10_pricier <- head(top_10_pricier, 10)
top_10_pricier

This is relevant to real-life situations where usually textbooks sold at higher prices. We can see that all of the Main Genre above are related to specific knowledge books.

4.3 What Are the Top 10 Most Affordable Books by Main Genre?

Answer:

avg_price_by_genre <- aggregate(Price ~ Main.Genre, data = books_clean, FUN = mean)
top_10_cheaper <- avg_price_by_genre[order(avg_price_by_genre$Price), ]
top_10_cheaper <- head(top_10_cheaper, 10)
top_10_cheaper

This is relevant to real-life situations where usually non-textbooks are sold at lower prices. We can see that all of the Main Genre above are not related to specific knowledge and more leaning towards leisure or hobbies books.

4.4 What Are the Top 10 Sub Genres?

Answer:

sub_genre <- table(books_clean$Sub.Genre)
sub_genre <- sort(sub_genre, decreasing = TRUE)
top_sub_genre <- data.frame(head(sub_genre, 10))

ggplot(top_sub_genre, aes(x = Freq, y = reorder(Var1, Freq), fill = Freq)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top 10 Book Sub Genre",
    x = "Frequency",
    y = "Sub Genre"
  ) +
  scale_fill_gradient(low = "#c16c32", high = "#9f4951") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5))

Based on the plot above, we can see that Biology & Life Sciences is the most popular subgenre. On the second spot is Humour followed by Reference in the third spot.

4.5 What Are the Top 10 Most Expensive Books by Sub Genres?

Answer:

avg_price_by_genre <- aggregate(Price ~ Sub.Genre, data = books_clean, FUN = mean)
top_10_pricier <- avg_price_by_genre[order(-avg_price_by_genre$Price), ]
top_10_pricier <- head(top_10_pricier, 10)
top_10_pricier

This is relevant to real-life situations where usually textbooks sold at higher prices. We can see that the majority of Sub Genre above are related to specific knowledge books.

4.6 What Are the Top 10 Most Affordable Books by Sub Genres?

Answer:

avg_price_by_genre <- aggregate(Price ~ Sub.Genre, data = books_clean, FUN = mean)
top_10_cheaper <- avg_price_by_genre[order(avg_price_by_genre$Price), ]
top_10_cheaper <- head(top_10_cheaper, 10)
top_10_cheaper

This is relevant to real-life situations where usually non-textbooks are sold at lower prices. We can see that all of Sub Genre above are not related to specific knowledge and more leaning towards leisure or hobbies books.

4.7 What Are the Top 10 Book Types?

Answer:

type <- table(books_clean$Type)
type <- sort(type, decreasing = TRUE)
top_type <- data.frame(head(type, 10))

ggplot(top_type, aes(x = Freq, y = reorder(Var1, Freq), fill = Freq)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top 10 Book Type",
    x = "Frequency",
    y = "Book Type"
  ) +
  scale_fill_gradient(low = "#c16c32", high = "#9f4951") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5))

Based on the plot above, we can see that Paperback is the most popular book format sold on Amazon. Following closely is Kindle Edition, with Hardcover ranking third in popularity.

4.8 What Are the Top 10 Most Expensive Books by Book Type?

Answer:

avg_price_by_type <- aggregate(Price ~ Type, data = books_clean, FUN = mean)
top_10_pricier <- avg_price_by_type[order(-avg_price_by_type$Price), ]
top_10_pricier <- head(top_10_pricier, 10)
top_10_pricier

Paperback Bunko is the priciest book type, followed by sheet music and textbook binding.

4.9 What Are the Top 10 Most Affordable Books by Book Type?

Answer:

avg_price_by_type <- aggregate(Price ~ Type, data = books_clean, FUN = mean)
top_10_cheaper <- avg_price_by_type[order(avg_price_by_type$Price), ]
top_10_cheaper <- head(top_10_cheaper, 10)
top_10_cheaper

Gift is the cheapest book type, followed by poster, and map.

4.10 What Are the Top 10 Authors by Numbers of Rating?

Answer:

avg_rated <- aggregate(No..of.People.rated ~ Author, data = books_clean, FUN = mean)
avg_rated <- avg_rated[order(-avg_rated$No..of.People.rated), ]
top_authors <- head(avg_rated, 10)

ggplot(top_authors, aes(x = No..of.People.rated, y = reorder(Author, No..of.People.rated), fill = No..of.People.rated)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Top 10 Authors by Number of People Who Rated Books",
    x = "Number of People Who Rated",
    y = "Author"
  ) +
  scale_fill_gradient(low = "#ff9d6c", high = "#a24752") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5))

Delia Owens has the highest number of ratings for her books. Followed by Paula Hawkins and Garmus Bonnie. The title of the book is as follows:

avg_rated <- aggregate(No..of.People.rated ~ Title + Author, data = books_clean, FUN = mean)
avg_rated <- avg_rated[order(-avg_rated$No..of.People.rated), ]
top_authors <- head(avg_rated, 10)
top_authors

5 Conclusion

We have done the analysis of Amazon book sales data. In doing so, we get these conclusions:

  • Romance books are the best-selling genre on Amazon

  • Textbooks rank as the most expensive main genre on Amazon

  • Hobby books rank as the most affordable main genre on Amazon

  • Paperback Bunko rank as the most expensive book type on Amazon

  • Gift rank as the most affordable book type on Amazon

  • The most rated book on Amazon is “Where the Crawdads Sing” by Delia Owens.

Keep in mind that this is a simple analysis based on a snippet of data. And also note that the data extracted is from India, so there may be differences from our local realities.

6 Reference

  1. Wikipedia