This data is talking about best selling books by Amazon from 2009 until 2019. It contains 550 books and already categorized into fiction and non-fiction. The data can be obtained from https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019 . I hope you enjoy it!
library(ggplot2)
book <- read.csv("bestsellers with categories.csv")
head(book)
## Name
## 1 10-Day Green Smoothie Cleanse
## 2 11/22/63: A Novel
## 3 12 Rules for Life: An Antidote to Chaos
## 4 1984 (Signet Classics)
## 5 5,000 Awesome Facts (About Everything!) (National Geographic Kids)
## 6 A Dance with Dragons (A Song of Ice and Fire)
## Author User.Rating Reviews Price Year Genre
## 1 JJ Smith 4.7 17350 8 2016 Non Fiction
## 2 Stephen King 4.6 2052 22 2011 Fiction
## 3 Jordan B. Peterson 4.7 18979 15 2018 Non Fiction
## 4 George Orwell 4.7 21424 6 2017 Fiction
## 5 National Geographic Kids 4.8 7665 12 2019 Non Fiction
## 6 George R. R. Martin 4.4 12643 11 2011 Fiction
tail(book)
## Name
## 545 Wonder
## 546 Wrecking Ball (Diary of a Wimpy Kid Book 14)
## 547 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## 548 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## 549 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## 550 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## Author User.Rating Reviews Price Year Genre
## 545 R. J. Palacio 4.8 21625 9 2017 Fiction
## 546 Jeff Kinney 4.9 9413 8 2019 Fiction
## 547 Jen Sincero 4.7 14331 8 2016 Non Fiction
## 548 Jen Sincero 4.7 14331 8 2017 Non Fiction
## 549 Jen Sincero 4.7 14331 8 2018 Non Fiction
## 550 Jen Sincero 4.7 14331 8 2019 Non Fiction
dim(book)
## [1] 550 7
This data contains 550 rows and 7 columns
names(book)
## [1] "Name" "Author" "User.Rating" "Reviews" "Price"
## [6] "Year" "Genre"
check data structure
str(book)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
We found out that Genre was wrong about this data type. So we need to change the data type into factor
# change data type `Genre`
book$Genre = as.factor(book$Genre)
str(book)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : Factor w/ 2 levels "Fiction","Non Fiction": 2 1 2 1 2 1 1 1 2 1 ...
Check missing value
colSums(is.na(book))
## Name Author User.Rating Reviews Price Year
## 0 0 0 0 0 0
## Genre
## 0
Great!! There’re no missing value in this dataset
Summary
summary(book)
## Name Author User.Rating Reviews
## Length:550 Length:550 Min. :3.300 Min. : 37
## Class :character Class :character 1st Qu.:4.500 1st Qu.: 4058
## Mode :character Mode :character Median :4.700 Median : 8580
## Mean :4.618 Mean :11953
## 3rd Qu.:4.800 3rd Qu.:17253
## Max. :4.900 Max. :87841
## Price Year Genre
## Min. : 0.0 Min. :2009 Fiction :240
## 1st Qu.: 7.0 1st Qu.:2011 Non Fiction:310
## Median : 11.0 Median :2014
## Mean : 13.1 Mean :2014
## 3rd Qu.: 16.0 3rd Qu.:2017
## Max. :105.0 Max. :2019
From our summary we can conclude that:
Maximum of user ratingg is 4.9
Maximum number of reviews is 87841
The book sales contain two genres, fiction and nonfiction
Book sales for 10 years
The most expensive price for book is $105
1. What genre is the most??
genre <- as.data.frame(table(book$Genre))
ggplot(genre, mapping = aes(x = reorder(Var1, Freq), Freq))+
geom_col(width = 0.5, fill = "green")+
theme_minimal()+
labs(title = "The most genre of book sales")+
xlab("Genre")+
ylab("Count")+
theme(axis.text.x = element_text(face = "bold"))
Interpretations:
During ten years of book sales, The non-fiction genre books are sold more than fiction genre books
2. What are the most popular books by the number of reviews?
top_books <- aggregate(Reviews~Name, book, sum)
ggplot(top_books[1:10,], mapping = aes(y = reorder(Name,Reviews), x = Reviews))+
geom_col(width = 0.5,
fill = "orange")+
ylab("Title of Book")+
labs(title = "Top 10 Most Popular Books",
subtitle = "By number of reviews")+
theme_minimal()+
theme(axis.text.y = element_text(size = 10))
Insights:
The most popular book by the number of reviews is 1984 (Signet Classics)
Some people might have a little interested with 11/22/63: A Novel
3. What are the most popular Author by the number of reviews?
# prepare the data
author <- aggregate(Reviews~Author, book, sum)
# ploting
ggplot(author[1:10,], mapping = aes(y = reorder(Author,Reviews), x = Reviews))+
geom_bar(stat = "identity",
width = 0.5,
fill = "palevioletred")+
theme_minimal()+
theme(axis.text.y = element_text(face = "bold"))+
labs(title = "Top 10 Most Popular Author",
subtitle = "By Number of Reviews",
y = "Author")
Insights:
The most popular Author by the number of reviews is Alex Michaelides
Some people might not have interested with the book by Alice Schertle
4. What about the number of User Rating in Percentage?
ggplot(data=book, aes(x= User.Rating))+
geom_histogram(aes(y=(..count..)/sum(..count..)),
fill="orangered", col="black", binwidth=0.05)+
scale_y_continuous(labels = scales::percent)+
ylab("Count (%)")+
xlab("User Rating")+
labs(title = "Number of User Rating in Percentage")+
theme_minimal()
Interpretation:
During 10 years of selling books by amazon, most of people have rated it over 4.5 to 4.9. Few people give the rating of book between 3 until 3.9. Based on this analysis, we can assume that the books sales by amazon are quite interesting for us to read
5. How about number of books by genre per year?
# prepare the data
year_price <- as.data.frame(table(book$Genre, book$Year))
names(year_price) <- c("Genre", "Year", "Count")
year_price <- year_price[order(year_price$Count, decreasing = T),]
# ploting
ggplot(year_price, mapping = aes(x = reorder(Year, Count),
y = Count,
fill = Genre))+
geom_bar(stat="identity", position=position_dodge(0.8))+
theme_minimal()+
xlab("Year")+
ylab("Number of Books")+
labs(title = "Number of books by genres per year")+
scale_fill_brewer(palette = "Set2")
Insights:
The most number of book Fiction occur at 2014
The most number of book Non Fiction occur at 2015
The number of non-fiction books is more than fiction books per year
6. What about the most expensive price of books?
# prepare the data
exp_books <- book[,c("Name","Price")]
exp_books <- exp_books[order(exp_books$Price, decreasing = T),]
exp_books <- exp_books[!duplicated(exp_books$Name),]
# ploting
ggplot(exp_books[1:10,], mapping = aes(y = Name,
x = Price))+
geom_bar(stat = "identity", width = 0.5, fill = "red")+
theme(axis.text.y = element_text(face = "bold", size = 20))+
theme_minimal()+
ylab("Title of Book")+
xlab("Price ($)")+
labs(title = "Top 10 Most Expensive Books",
subtitle = "from 2009 to 2019")
Insights:
The most expensive price for book is from Diagnostic and Statiscal Manual of Mental Disorders. The price is more than $100.
The cheapest price for book is from The Offical SAT Study Guide, 2016 Edition. The price is about $30
7. How about the most expensive price of book by Author?
# prepare the data
author_price <- book[,c("Author","Price")]
author_price <- book[!duplicated(book$Author),]
author_price <- author_price[order(author_price$Price, decreasing = T),]
# ploting
ggplot(author_price[1:10,], mapping = aes(y = Author,
x = Price))+
geom_bar(stat = "identity", width = 0.3, fill = "skyblue")+
theme(axis.text.y = element_text(face = "bold"))+
theme_minimal()+
ylab("Author")+
xlab("Price ($)")+
labs(title = "Top 10 Most Expensive Book by Author")
Insights:
During ten years of book sales, the most expensive price was the book by author American Psychiatric Association. Which is the price of book is more than $100.
The cheapest book during ten years is by author Gary Chapman. The price is about $30
From our analysis, we can say some assumption such as:
During ten years of book sales, nonfiction books is more sold than fiction books
The most expensive price of book is Diagnostic and Statiscal Manual of Mental Disorders
The most popular book by the number of reviews is 1984 (Signet Classics)
The most popular Author by the number of reviews is Alex Michaelides
Based on distribution of user rating during ten years of book sales, the books sales by amazon are quite interesting for us to read
The number of non-fiction books is more than fiction books per year
During ten years of book sales, the most expensive price was the book by author American Psychiatric Association