Amazon is a pioneer of marketplace websites. It was founded
on 5 July 1994 by Jeff Bezos and quickly became number one in The USA
before taking over the other country. At its first conception, Amazon
used to only sell books. So it’s only fitting that we do a simple
analysis of Books sold on Amazon. We’ll explore several aspects of the
data like book titles, authors, genres, prices, ratings, and the number
of ratings received. This analysis aims to uncover trends and
preferences within Amazon’s book marketplace.
The dataset was taken from Kaggle. Please note that the data extracted is from India, so there may be differences from our local reality.
The amazon books dataset has three csv files:
Books_df.csv, Genre_df.csv, and
Sub_Genre_df.csv In consideration of the scope of this
assignment, we will only use the Books_df.csv file as a
data source.
First, read Books_df.csv and assign it to
books:
Check the head of books:
Check the tail of books:
Let’s see the structure of books dataframe:
## 'data.frame': 7928 obs. of 10 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Title : chr "The Complete Novel of Sherlock Holmes" "Black Holes (L) : The Reith Lectures [Paperback] Hawking, Stephen" "The Kite Runner" "Greenlights: Raucous stories and outlaw wisdom from the Academy Award-winning actor" ...
## $ Author : chr "Arthur Conan Doyle" "Stephen Hawking" "Khaled Hosseini" "Matthew McConaughey" ...
## $ Main.Genre : chr "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" ...
## $ Sub.Genre : chr "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" ...
## $ Type : chr "Paperback" "Paperback" "Kindle Edition" "Paperback" ...
## $ Price : chr "₹169.00" "₹99.00" "₹175.75" "₹389.00" ...
## $ Rating : num 4.4 4.5 4.6 4.6 4.5 4.5 4.6 4.5 4.5 4.5 ...
## $ No..of.People.rated: num 19923 7686 50016 32040 1707 ...
## $ URLs : chr "https://www.amazon.in/Complete-Novels-Sherlock-Holmes/dp/8175994312/ref=zg_bs_g_1318054031_d_sccl_1/000-0000000-0000000?psc=1" "https://www.amazon.in/Black-Holes-Lectures-Stephen-Hawking/dp/085750357X/ref=zg_bs_g_1318054031_d_sccl_2/000-00"| __truncated__ "https://www.amazon.in/Kite-Runner-Khaled-Hosseini-ebook/dp/B00B0CR0O6/ref=zg_bs_g_1318054031_d_sccl_3/000-00000"| __truncated__ "https://www.amazon.in/Greenlights-Raucous-stories-Academy-Award-winning/dp/1472280873/ref=zg_bs_g_1318054031_d_"| __truncated__ ...
The explanation of each column is:
Title: The title of the book
Author: Name of the author or publication house
Main Genre: The main genre the book belongs to
Sub Genre: The specific sub-genre of the book
Type: Indicates the format of the book, such as paperback, Kindle, audiobook, or hardcover
Price: The price of the book
Rating: The average rating of the book given by users
No. of People Rated: Indicates the count of users who have rated the book
URLs: Provides the link to the book’s page on Amazon for further details and purchase options
We also want to know how many rows and columns there are in
books:
## [1] 7928 10
There are 7928 rows and 10 columns.
We have to remove unused columns such as X and
URLs, and make the count of columns down to 8 columns:
## [1] 7928 8
The next step is to make sure that all the data types are correct.
After taking a careful look, we noticed that the data type in the
Price column is wrong. It’s supposed to be a double data
type but instead, it’s a character data type. We first will remove the
currency symbol of the Price column before changing it to
double data type:
## Warning: NAs introduced by coercion
The warning of NAs introduced by coercion means that
some output values are non-numeric. We will get back to this later. We
also will check for unique values to determine whether it’s a factor
data type. Nevertheless, because this is books dataset, it
must have many unique values. Check for how many unique values for
Main.Genre and Sub.Genre
## [1] 30
## [1] 151
## [1] 33
There are a lot of values for Main.Genre,
Sub.Genre, and Type. However because the data
above have more than 10 unique values, we wouldn’t change it to factor
data type. Let’s recheck for newly changed data types in
books:
## 'data.frame': 7928 obs. of 8 variables:
## $ Title : chr "The Complete Novel of Sherlock Holmes" "Black Holes (L) : The Reith Lectures [Paperback] Hawking, Stephen" "The Kite Runner" "Greenlights: Raucous stories and outlaw wisdom from the Academy Award-winning actor" ...
## $ Author : chr "Arthur Conan Doyle" "Stephen Hawking" "Khaled Hosseini" "Matthew McConaughey" ...
## $ Main.Genre : chr "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" "Arts, Film & Photography" ...
## $ Sub.Genre : chr "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" "Cinema & Broadcast" ...
## $ Type : chr "Paperback" "Paperback" "Kindle Edition" "Paperback" ...
## $ Price : num 169 99 176 389 348 ...
## $ Rating : num 4.4 4.5 4.6 4.6 4.5 4.5 4.6 4.5 4.5 4.5 ...
## $ No..of.People.rated: num 19923 7686 50016 32040 1707 ...
Now, we need to make sure that there are no missing values. Let’s check for missing values:
## Title Author Main.Genre Sub.Genre
## 0 20 0 0
## Type Price Rating No..of.People.rated
## 0 657 0 0
There are 177 empty values in the Age column and 20
empty values in the Author column. For the sake of data
clarity, we will assign the missing data in the Age column
as the median age based on sex. Search for the median value in
Price:
empty_books <- books[is.na(books$Price),]
empty_books_median <- median(books$Price, na.rm = TRUE)
empty_books_median## [1] 294
Set all NA in column Price as 294:
And replace NA value in Author column as
Unknown:
Recheck for any missing values:
## Title Author Main.Genre Sub.Genre
## 0 0 0 0
## Type Price Rating No..of.People.rated
## 0 0 0 0
Nice! All the values are finally not empty.
Let’s move on to the next step which is to check for duplicated data.
We want to check for books with same Title,
Author, and Type:
## Warning in duplicated.default(books$Title, books$Author, books$Type, fromLast =
## TRUE): NAs introduced by coercion
## [1] 2188
That’s a lot of data. Let’s check for the duplicated data:
Let’s check for one of the duplicates data, for example “The Anatomy of Story”:
There are two books with title “The Anatomy of Story”
and author John Truby. What qualifies it as different book is because it
is in different Main.Genre. We will remove the duplicates
books and assign it to books_clean. In case of
duplicates books were found, we will remove all but last
one:
Great. Let’s recheck in case there are still duplicates data in the
books_clean:
duplicates_books_clean <- duplicated(books_clean$Title, books_clean$Author, books_clean$Type, fromLast = TRUE)## Warning in duplicated.default(books_clean$Title, books_clean$Author,
## books_clean$Type, : NAs introduced by coercion
## [1] 103
The sum of duplicates books_clean has reduced
significantly, but there’s still 14 duplicates. Check for remaining
duplicates books:
Let’s check for one of the duplicates data, for example “Reminiscences of a Stock Operator”:
There are two books with title “Reminiscences of a Stock
Operator”, but the Author values are different: Edwin
Lefèvre and Edwin Lefevre. There is a non-latin character (è) there that
makes it counts as different authors. We will replace non-latin
characters so that it will be counted as the same books.
What about the others? Let’s check for one more, this time “Man’s Search For Meaning: The classic tribute to hope from the Holocaust”:
What makes them counted as different books is because the
Author names have a period symbol (.). So, aside from
replace every non-latin characters, we will also eliminate every period
symbol in Author:
library(stringi)
books_clean$Author <- tolower(gsub("\\.", "", books_clean$Author))
books_clean$Author <- stri_trans_general(books_clean$Author, "Latin-ASCII")
books_clean <- books_clean[!duplicated(books_clean[c("Title", "Author", "Main.Genre")], fromLast = TRUE), ]Now check if there’s still duplicated books_clean. To
make things simple, let’s create combined and see whether
there are still same books with the combination of
Title, Author, and Type:
combined <- paste(books_clean$Title, books_clean$Author, books_clean$Type, sep = "_")
sum(duplicated(combined))## [1] 0
Finally, there is no more duplicates data in
books_clean. Now we’re good! Let’s move on to the next
point which is Data Explanation.
Check for the statistic summary of the books_clean
dataframe:
## Title Author Main.Genre Sub.Genre
## Length:5769 Length:5769 Length:5769 Length:5769
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Type Price Rating No..of.People.rated
## Length:5769 Min. : 0.01 Min. :0.000 Min. : 0
## Class :character 1st Qu.:199.00 1st Qu.:4.300 1st Qu.: 54
## Mode :character Median :294.00 Median :4.500 Median : 373
## Mean :334.67 Mean :4.265 Mean : 5227
## 3rd Qu.:415.00 3rd Qu.:4.600 3rd Qu.: 2123
## Max. :999.00 Max. :5.000 Max. :500119
Because the majority of columns in books_clean use
characters data type, the explanation step can only be done in
Price, Rating, and
No..of.People.rated:
The average price is ₹334.67
The minimum price is ₹0.01! So basically there’s no free book you can get on Amazon!
The maximum price is ₹999.00
The average rating is 4.265
The minimum rating is 0
The maximum rating is 5
The average number of people rated is 5227
The minimum number of people rated is 0
The maximum number of people rated is 500119
Now we want to check the data distribution of Price,
Rating, and No..of.People.rated:
hist(books_clean$Price,
breaks = seq(min(books_clean$Price), max(books_clean$Price) + 10, by = 100),
main = "Distribution of Book's Prices (in Rupees)",
xlab = "Price",
ylab = "Frequency",
col = "#b8564b")
Based on the histogram plot above, most book prices are in the range of
₹200 to ₹300, while it is really rare for book prices to be in the range
of ₹900 to ₹1000.
hist(books_clean$Rating,
breaks = seq(min(books_clean$Rating), max(books_clean$Rating), by = 0.5),
main = "Distribution of Book's Rating",
xlab = "Rating",
ylab = "Frequency",
col = "#b8564b")Based on the histogram plot above, most book ratings are in the range of 4 to 4.5, while it is really rare for book ratings to be in the range of 2.5 to 3.
hist(books_clean$No..of.People.rated,
breaks = seq(min(books_clean$No..of.People.rated), max(books_clean$No..of.People.rated)+ 100000, by = 100000),
main = "Distribution of Number of People who Rated Book",
xlab = "Number of People",
ylab = "Frequency",
col = "#b8564b")Based on the histogram plot above, most books are rated by the range of 0 to 100000 people, although if we see the summary above, there is an outlier of more than 500000 people rated a book.
In this section, we will analyze couple of questions regarding
books:
Answer:
library(ggplot2)
main_genre <- table(books_clean$Main.Genre)
main_genre <- sort(main_genre, decreasing = TRUE)
top_main_genre <- data.frame(head(main_genre, 10))
ggplot(top_main_genre, aes(x = Freq, y = reorder(Var1, Freq), fill = Freq)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Book Main Genre",
x = "Frequency",
y = "Main Genre"
) +
scale_fill_gradient(low = "#ff9d6c", high = "#a24752") +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5))Based on the plot above, we can see that Romance is the top main genre for books that are sold on Amazon. In the second spot are Children’s Books followed by Sports in the third spot.
Answer:
avg_price_by_genre <- aggregate(Price ~ Main.Genre, data = books_clean, FUN = mean)
top_10_pricier <- avg_price_by_genre[order(-avg_price_by_genre$Price), ]
top_10_pricier <- head(top_10_pricier, 10)
top_10_pricierThis is relevant to real-life situations where usually textbooks sold at higher prices. We can see that all of the Main Genre above are related to specific knowledge books.
Answer:
avg_price_by_genre <- aggregate(Price ~ Main.Genre, data = books_clean, FUN = mean)
top_10_cheaper <- avg_price_by_genre[order(avg_price_by_genre$Price), ]
top_10_cheaper <- head(top_10_cheaper, 10)
top_10_cheaperThis is relevant to real-life situations where usually non-textbooks are sold at lower prices. We can see that all of the Main Genre above are not related to specific knowledge and more leaning towards leisure or hobbies books.
Answer:
sub_genre <- table(books_clean$Sub.Genre)
sub_genre <- sort(sub_genre, decreasing = TRUE)
top_sub_genre <- data.frame(head(sub_genre, 10))
ggplot(top_sub_genre, aes(x = Freq, y = reorder(Var1, Freq), fill = Freq)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Book Sub Genre",
x = "Frequency",
y = "Sub Genre"
) +
scale_fill_gradient(low = "#c16c32", high = "#9f4951") +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5))Based on the plot above, we can see that Biology & Life Sciences is the most popular subgenre. On the second spot is Humour followed by Reference in the third spot.
Answer:
avg_price_by_genre <- aggregate(Price ~ Sub.Genre, data = books_clean, FUN = mean)
top_10_pricier <- avg_price_by_genre[order(-avg_price_by_genre$Price), ]
top_10_pricier <- head(top_10_pricier, 10)
top_10_pricierThis is relevant to real-life situations where usually textbooks sold at higher prices. We can see that the majority of Sub Genre above are related to specific knowledge books.
Answer:
avg_price_by_genre <- aggregate(Price ~ Sub.Genre, data = books_clean, FUN = mean)
top_10_cheaper <- avg_price_by_genre[order(avg_price_by_genre$Price), ]
top_10_cheaper <- head(top_10_cheaper, 10)
top_10_cheaperThis is relevant to real-life situations where usually non-textbooks are sold at lower prices. We can see that all of Sub Genre above are not related to specific knowledge and more leaning towards leisure or hobbies books.
Answer:
type <- table(books_clean$Type)
type <- sort(type, decreasing = TRUE)
top_type <- data.frame(head(type, 10))
ggplot(top_type, aes(x = Freq, y = reorder(Var1, Freq), fill = Freq)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Book Type",
x = "Frequency",
y = "Book Type"
) +
scale_fill_gradient(low = "#c16c32", high = "#9f4951") +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5))Based on the plot above, we can see that Paperback is the most popular book format sold on Amazon. Following closely is Kindle Edition, with Hardcover ranking third in popularity.
Answer:
avg_price_by_type <- aggregate(Price ~ Type, data = books_clean, FUN = mean)
top_10_pricier <- avg_price_by_type[order(-avg_price_by_type$Price), ]
top_10_pricier <- head(top_10_pricier, 10)
top_10_pricierPaperback Bunko is the priciest book type, followed by sheet music and textbook binding.
Answer:
avg_price_by_type <- aggregate(Price ~ Type, data = books_clean, FUN = mean)
top_10_cheaper <- avg_price_by_type[order(avg_price_by_type$Price), ]
top_10_cheaper <- head(top_10_cheaper, 10)
top_10_cheaperGift is the cheapest book type, followed by poster, and map.
We have done the analysis of Amazon book sales data. In doing so, we get these conclusions:
Romance books are the best-selling genre on Amazon
Textbooks rank as the most expensive main genre on Amazon
Hobby books rank as the most affordable main genre on Amazon
Paperback Bunko rank as the most expensive book type on Amazon
Gift rank as the most affordable book type on Amazon
The most rated book on Amazon is “Where the Crawdads Sing” by Delia Owens.
Keep in mind that this is a simple analysis based on a snippet of data. And also note that the data extracted is from India, so there may be differences from our local realities.