Amazon Books

This document is to analysis and visualization about Amazon Book dataset that i get from Kaggle. There will be 7 Columns and 550 Rows..

Data Exploration

7 Columns are shown below:

df = read.csv("bestsellers with categories.csv")
str(df)
## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User.Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : int  17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
##  $ Price      : int  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : int  2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

Checking if there is missing Value. The result is no missing value.

anyNA(df)
## [1] FALSE

Preprocessing to create new column that is rating range.

df<- df %>% 
  mutate(rating_range = cut(User.Rating, c(0, 1, 2, 3, 4, 5))) %>% 
  mutate(rating_range = fct_recode(rating_range, 
                                            "0-1" = "(0,1]",
                                            "1-2" = "(1,2]", "2-3" = "(2,3]", 
                                            "3-4" = "(3,4]", "4-5" = "(4,5]"))
df_item_year <- df %>%group_by(Year) %>% 
        summarize(total_items=n()) %>% 
        mutate(label = glue("Year: {Year} 
                    Count of Items: {total_items}"))
df_item_year <- df_item_year[order(df_item_year$Year),]

Visualization

Total Book by year

From the visualization below, show that every year is have 50 book. It’s because the dataset itself its about 50 Best Seller Book in Amazon every Year.

ggplot(data=df_item_year, aes(x=Year, y=total_items)) +
  geom_line(color="maroon", size=1.25)+
  geom_point()+
  labs(title = "Book By Year",
             x = "Year",
             y = "Total Books") +
  theme(plot.title = element_text(hjust = 0.5))

Unique Book by year

From the visualization below, show that every year unique book is not always the same. So, the is book that being best seller more than 1 year.

uni_book=df[!duplicated(df$Name),]

ggplot(data=uni_book,aes(x=Year))+
  geom_histogram(aes(y=..density..),fill="maroon",col="navy",binwidth = 0.2)+
  labs(title = "Book By Year",
             x = "Year",
             y = "Total Books") +
  theme(plot.title = element_text(hjust = 0.5))

User Rating Distribution

From the visualization below there is no avg.rating that given below 3. The most Rating is 4.8 and no rating book is 5.0.

ggplot(data=df,aes(x=User.Rating))+
  geom_histogram(aes(y=(..count..)/sum(..count..)),fill="maroon",col="navy",binwidth=0.05)+
  labs(title = "Rating Distribution",
             x = "Rating",
             y = "Total Books") +
  scale_y_continuous(labels = scales::percent)+ylab("percent")+
  theme(plot.title = element_text(hjust = 0.5))

User Range Rating Distribution

From the visualization the dominant range rating book best seller in amazon is 4-5 Star.

df_rating <- df %>%group_by(rating_range) %>% 
        summarize(total_items=n()) %>% 
        mutate(label = glue("Rating Range: {rating_range} 
                    Count of Items: {total_items}"))
ggplot(data=df_rating, aes(x=rating_range, y=total_items))+
  geom_bar(stat='identity',fill = "maroon") +
  labs(title = "Rating Range Distribution",
             x = "Rating Range",
             y = "Total Books") +
  theme(plot.title = element_text(hjust = 0.5))

10 Top Author in best seller book

From top 10 Author the minimum book that published being best seller is 7 and the maximum is 12. Best Author is jeff Kiney with 12 Best Seller Book.

top_author=df%>%count(df$Author)%>%top_n(10)%>%head(10)
## Selecting by n
top_author=top_author%>%rename(Total=n,Author='df$Author')
top_author=top_author%>%arrange(Total)

top_author=data.frame(top_author)
ggplot(data=top_author,aes(x=Author,y=Total,fill=Total))+
  geom_bar(stat = "identity")+
  scale_fill_gradient(low = "yellow", high ="maroon") +
  labs(title = "Top 10 Author Published Book being Best Seller",
             x = "Author",
             y = "Total Books") +
  theme(axis.text=element_text(size = 6.5), plot.title = element_text(hjust = 0.5))