This document is to analysis and visualization about Amazon Book dataset that i get from Kaggle. There will be 7 Columns and 550 Rows..
7 Columns are shown below:
df = read.csv("bestsellers with categories.csv")
str(df)## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
Checking if there is missing Value. The result is no missing value.
anyNA(df)## [1] FALSE
Preprocessing to create new column that is rating range.
df<- df %>%
mutate(rating_range = cut(User.Rating, c(0, 1, 2, 3, 4, 5))) %>%
mutate(rating_range = fct_recode(rating_range,
"0-1" = "(0,1]",
"1-2" = "(1,2]", "2-3" = "(2,3]",
"3-4" = "(3,4]", "4-5" = "(4,5]"))df_item_year <- df %>%group_by(Year) %>%
summarize(total_items=n()) %>%
mutate(label = glue("Year: {Year}
Count of Items: {total_items}"))df_item_year <- df_item_year[order(df_item_year$Year),]From the visualization below, show that every year is have 50 book. It’s because the dataset itself its about 50 Best Seller Book in Amazon every Year.
ggplot(data=df_item_year, aes(x=Year, y=total_items)) +
geom_line(color="maroon", size=1.25)+
geom_point()+
labs(title = "Book By Year",
x = "Year",
y = "Total Books") +
theme(plot.title = element_text(hjust = 0.5))From the visualization below, show that every year unique book is not always the same. So, the is book that being best seller more than 1 year.
uni_book=df[!duplicated(df$Name),]
ggplot(data=uni_book,aes(x=Year))+
geom_histogram(aes(y=..density..),fill="maroon",col="navy",binwidth = 0.2)+
labs(title = "Book By Year",
x = "Year",
y = "Total Books") +
theme(plot.title = element_text(hjust = 0.5))From the visualization below there is no avg.rating that given below 3. The most Rating is 4.8 and no rating book is 5.0.
ggplot(data=df,aes(x=User.Rating))+
geom_histogram(aes(y=(..count..)/sum(..count..)),fill="maroon",col="navy",binwidth=0.05)+
labs(title = "Rating Distribution",
x = "Rating",
y = "Total Books") +
scale_y_continuous(labels = scales::percent)+ylab("percent")+
theme(plot.title = element_text(hjust = 0.5))From the visualization the dominant range rating book best seller in amazon is 4-5 Star.
df_rating <- df %>%group_by(rating_range) %>%
summarize(total_items=n()) %>%
mutate(label = glue("Rating Range: {rating_range}
Count of Items: {total_items}"))ggplot(data=df_rating, aes(x=rating_range, y=total_items))+
geom_bar(stat='identity',fill = "maroon") +
labs(title = "Rating Range Distribution",
x = "Rating Range",
y = "Total Books") +
theme(plot.title = element_text(hjust = 0.5))