Intro

This data is talking about best selling books by Amazon from 2009 until 2019. It contains 550 books and already categorized into fiction and non-fiction. The data can be obtained from https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019 . I hope you enjoy it!

Input Library

library(ggplot2)

1. Data Explanatory

1.1 Input Data

book <- read.csv("bestsellers with categories.csv")

1.2 Data Inspection

head(book)
##                                                                 Name
## 1                                      10-Day Green Smoothie Cleanse
## 2                                                  11/22/63: A Novel
## 3                            12 Rules for Life: An Antidote to Chaos
## 4                                             1984 (Signet Classics)
## 5 5,000 Awesome Facts (About Everything!) (National Geographic Kids)
## 6                      A Dance with Dragons (A Song of Ice and Fire)
##                     Author User.Rating Reviews Price Year       Genre
## 1                 JJ Smith         4.7   17350     8 2016 Non Fiction
## 2             Stephen King         4.6    2052    22 2011     Fiction
## 3       Jordan B. Peterson         4.7   18979    15 2018 Non Fiction
## 4            George Orwell         4.7   21424     6 2017     Fiction
## 5 National Geographic Kids         4.8    7665    12 2019 Non Fiction
## 6      George R. R. Martin         4.4   12643    11 2011     Fiction
tail(book)
##                                                                                       Name
## 545                                                                                 Wonder
## 546                                           Wrecking Ball (Diary of a Wimpy Kid Book 14)
## 547 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## 548 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## 549 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
## 550 You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life
##            Author User.Rating Reviews Price Year       Genre
## 545 R. J. Palacio         4.8   21625     9 2017     Fiction
## 546   Jeff Kinney         4.9    9413     8 2019     Fiction
## 547   Jen Sincero         4.7   14331     8 2016 Non Fiction
## 548   Jen Sincero         4.7   14331     8 2017 Non Fiction
## 549   Jen Sincero         4.7   14331     8 2018 Non Fiction
## 550   Jen Sincero         4.7   14331     8 2019 Non Fiction
dim(book)
## [1] 550   7

This data contains 550 rows and 7 columns

names(book)
## [1] "Name"        "Author"      "User.Rating" "Reviews"     "Price"      
## [6] "Year"        "Genre"

check data structure

str(book)
## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

We found out that Genre was wrong about this data type. So we need to change the data type into factor

1.3 Data Wrangling

# change data type `Genre`

book$Genre = as.factor(book$Genre)


str(book)
## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : Factor w/ 2 levels "Fiction","Non Fiction": 2 1 2 1 2 1 1 1 2 1 ...

Check missing value

colSums(is.na(book))
##        Name      Author User.Rating     Reviews       Price        Year 
##           0           0           0           0           0           0 
##       Genre 
##           0

Great!! There’re no missing value in this dataset

Summary

summary(book)
##      Name              Author           User.Rating       Reviews     
##  Length:550         Length:550         Min.   :3.300   Min.   :   37  
##  Class :character   Class :character   1st Qu.:4.500   1st Qu.: 4058  
##  Mode  :character   Mode  :character   Median :4.700   Median : 8580  
##                                        Mean   :4.618   Mean   :11953  
##                                        3rd Qu.:4.800   3rd Qu.:17253  
##                                        Max.   :4.900   Max.   :87841  
##      Price            Year              Genre    
##  Min.   :  0.0   Min.   :2009   Fiction    :240  
##  1st Qu.:  7.0   1st Qu.:2011   Non Fiction:310  
##  Median : 11.0   Median :2014                    
##  Mean   : 13.1   Mean   :2014                    
##  3rd Qu.: 16.0   3rd Qu.:2017                    
##  Max.   :105.0   Max.   :2019

From our summary we can conclude that:

  1. Maximum of user ratingg is 4.9

  2. Maximum number of reviews is 87841

  3. The book sales contain two genres, fiction and nonfiction

  4. Book sales for 10 years

  5. The most expensive price for book is $105

2 Exploratory Data Analysis (Study Case)

1. What genre is the most??

genre <- as.data.frame(table(book$Genre))
ggplot(genre, mapping = aes(x = reorder(Var1, Freq), Freq))+
  geom_col(width = 0.5, fill = "green")+
  theme_minimal()+
  labs(title = "The most genre of book sales")+
  xlab("Genre")+
  ylab("Count")+
  theme(axis.text.x = element_text(face = "bold"))

Interpretations:

During ten years of book sales, The non-fiction genre books are sold more than fiction genre books

2. What are the most popular books by the number of reviews?

top_books <- aggregate(Reviews~Name, book, sum)

ggplot(top_books[1:10,], mapping = aes(y = reorder(Name,Reviews), x = Reviews))+
  geom_col(width = 0.5,
           fill = "orange")+
  ylab("Title of Book")+
  labs(title = "Top 10 Most Popular Books",
       subtitle = "By number of reviews")+
  theme_minimal()+
  theme(axis.text.y = element_text(size = 10))

Insights:

  1. The most popular book by the number of reviews is 1984 (Signet Classics)

  2. Some people might have a little interested with 11/22/63: A Novel

3. What are the most popular Author by the number of reviews?

# prepare the data
author <- aggregate(Reviews~Author, book, sum)


# ploting
ggplot(author[1:10,], mapping = aes(y = reorder(Author,Reviews), x = Reviews))+
  geom_bar(stat = "identity",
           width = 0.5,
           fill = "palevioletred")+
  theme_minimal()+
  theme(axis.text.y = element_text(face = "bold"))+
  labs(title = "Top 10 Most Popular Author",
       subtitle = "By Number of Reviews",
       y = "Author")

Insights:

  1. The most popular Author by the number of reviews is Alex Michaelides

  2. Some people might not have interested with the book by Alice Schertle

4. What about the number of User Rating in Percentage?

ggplot(data=book, aes(x= User.Rating))+
geom_histogram(aes(y=(..count..)/sum(..count..)),
               fill="orangered", col="black", binwidth=0.05)+
  scale_y_continuous(labels = scales::percent)+
  ylab("Count (%)")+
  xlab("User Rating")+
  labs(title = "Number of User Rating in Percentage")+
  theme_minimal()

Interpretation:

During 10 years of selling books by amazon, most of people have rated it over 4.5 to 4.9. Few people give the rating of book between 3 until 3.9. Based on this analysis, we can assume that the books sales by amazon are quite interesting for us to read

5. How about number of books by genre per year?

# prepare the data
year_price <- as.data.frame(table(book$Genre, book$Year))

names(year_price) <- c("Genre", "Year", "Count")
year_price <- year_price[order(year_price$Count, decreasing = T),]

# ploting
ggplot(year_price, mapping = aes(x = reorder(Year, Count), 
                                 y = Count,
                                 fill = Genre))+
  geom_bar(stat="identity", position=position_dodge(0.8))+
  theme_minimal()+
  xlab("Year")+
  ylab("Number of Books")+
  labs(title = "Number of books by genres per year")+
  scale_fill_brewer(palette = "Set2")

Insights:

  1. The most number of book Fiction occur at 2014

  2. The most number of book Non Fiction occur at 2015

  3. The number of non-fiction books is more than fiction books per year

6. What about the most expensive price of books?

# prepare the data
exp_books <- book[,c("Name","Price")]

exp_books <- exp_books[order(exp_books$Price, decreasing = T),]
exp_books <- exp_books[!duplicated(exp_books$Name),]


 # ploting 
ggplot(exp_books[1:10,], mapping = aes(y = Name,
                                x = Price))+
  geom_bar(stat = "identity", width = 0.5, fill = "red")+
  theme(axis.text.y = element_text(face = "bold", size = 20))+
  theme_minimal()+
  ylab("Title of Book")+
  xlab("Price ($)")+
  labs(title = "Top 10 Most Expensive Books",
       subtitle = "from 2009 to 2019")

Insights:

  1. The most expensive price for book is from Diagnostic and Statiscal Manual of Mental Disorders. The price is more than $100.

  2. The cheapest price for book is from The Offical SAT Study Guide, 2016 Edition. The price is about $30

7. How about the most expensive price of book by Author?

# prepare the data
author_price <- book[,c("Author","Price")]

author_price <- book[!duplicated(book$Author),]
author_price <- author_price[order(author_price$Price, decreasing = T),]

 # ploting 
ggplot(author_price[1:10,], mapping = aes(y = Author,
                                x = Price))+
  geom_bar(stat = "identity", width = 0.3, fill = "skyblue")+
  theme(axis.text.y = element_text(face = "bold"))+
  theme_minimal()+
  ylab("Author")+
  xlab("Price ($)")+
  labs(title = "Top 10 Most Expensive Book by Author")

Insights:

  1. During ten years of book sales, the most expensive price was the book by author American Psychiatric Association. Which is the price of book is more than $100.

  2. The cheapest book during ten years is by author Gary Chapman. The price is about $30

3. Final conclusion

From our analysis, we can say some assumption such as:

  1. During ten years of book sales, nonfiction books is more sold than fiction books

  2. The most expensive price of book is Diagnostic and Statiscal Manual of Mental Disorders

  3. The most popular book by the number of reviews is 1984 (Signet Classics)

  4. The most popular Author by the number of reviews is Alex Michaelides

  5. Based on distribution of user rating during ten years of book sales, the books sales by amazon are quite interesting for us to read

  6. The number of non-fiction books is more than fiction books per year

  7. During ten years of book sales, the most expensive price was the book by author American Psychiatric Association