This was a simple exploratory data analysis on Amazon top 50 selling book from 2009 to 2019.The data and idea was getting from Kaggle data set.
Description on Kaggle for the data set
“Dataset on Amazon’s Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads”
The data set contain of 550 row (books) and 7 column (Book name, Author, User Rating, Reviews, Price,Year, Genre). Which can consider as a small data set. The data set can be download from the website as below. It might be changing during time.
https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019
To prevent the data changing over time on Kaggle the data set has been upload to Github repository. And the R scrip and results plot can as well found on the repository. The link is as below:
https://github.com/LYT102/Exploratory-Analysis-Amazon-Top-50-Selling-Book
This analysis is running on Window 10 system with: R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit)
The R package Using on this analysis are data.table,dplyr,ggplot2, and cowplot as below
library(data.table)
library(dplyr)
library(ggplot2)
library(cowplot)
The data was read from the project local file and gone though function to generate the subset data and summary data which used to conduct exploratory data analysis.
RawData<-fread("./bestsellers with categories.csv",header =TRUE)
RawData<-tbl_df(RawData)
RawData$Genre<-as.factor(RawData$Genre)
head(RawData,5)
## # A tibble: 5 x 7
## Name Author `User Rating` Reviews Price Year Genre
## <chr> <chr> <dbl> <int> <int> <int> <fct>
## 1 10-Day Green Smoothie C~ JJ Smith 4.7 17350 8 2016 Non F~
## 2 11/22/63: A Novel Stephen King 4.6 2052 22 2011 Ficti~
## 3 12 Rules for Life: An A~ Jordan B. P~ 4.7 18979 15 2018 Non F~
## 4 1984 (Signet Classics) George Orwe~ 4.7 21424 6 2017 Ficti~
## 5 5,000 Awesome Facts (Ab~ National Ge~ 4.8 7665 12 2019 Non F~
The raw data from above will go though 2 function to shorter the time to generate subset data and summary data. The first function create was to generate the subset data of the raw data as below.
getData<-function(i){
DataFrameName<-RawData%>%filter(Year == i)%>%arrange(Name)
}
The function will get the specify selected year and rearrange the book name in ascending order. The below code show the generate of subset data from 2009 to 2019. The first 5 row of subset data 2009 was show as well.
Df2009<-getData(2009)
Df2010<-getData(2010)
Df2011<-getData(2011)
Df2012<-getData(2012)
Df2013<-getData(2013)
Df2014<-getData(2014)
Df2015<-getData(2015)
Df2016<-getData(2016)
Df2017<-getData(2017)
Df2018<-getData(2018)
Df2019<-getData(2019)
head(Df2009,5)
## # A tibble: 5 x 7
## Name Author `User Rating` Reviews Price Year Genre
## <chr> <chr> <dbl> <int> <int> <int> <fct>
## 1 Act Like a Lady, Think Lik~ Steve Ha~ 4.6 5013 17 2009 Non F~
## 2 Arguing with Idiots: How t~ Glenn Be~ 4.6 798 5 2009 Non F~
## 3 Breaking Dawn (The Twiligh~ Stepheni~ 4.6 9769 13 2009 Ficti~
## 4 Crazy Love: Overwhelmed by~ Francis ~ 4.7 1542 14 2009 Non F~
## 5 Dead And Gone: A Sookie St~ Charlain~ 4.6 1541 4 2009 Ficti~
As subset data generate from above summary from subset data can be conducted. The second function create was use to generate the summary data of the subset data as below.
SummaryGern<-function(DataFrameName){
Count<-DataFrameName%>%group_by(Genre)%>%summarise(count = n())
DataMean<-DataFrameName%>%select(Year,Genre,`User Rating`,Reviews,Price)%>%group_by(Genre)%>%
summarise_each(mean)%>%relocate(Year,.before = Genre)%>%
rename(User.Rating.Mean=`User Rating`,
Reviews.Mean=Reviews,
Price.Mean=Price)
DataMedian<-DataFrameName%>%select(Year,Genre,`User Rating`,Reviews,Price)%>%group_by(Genre)%>%
summarise_each(median)%>%relocate(Year,.before = Genre)%>%
rename(User.Rating.Median=`User Rating`,
Reviews.Median=Reviews,
Price.Median=Price)
Data<-inner_join(DataMean,DataMedian)
Data<-inner_join(Data,Count)
return(Data)
}
The function will count the book genre (Fiction, Non fiction), mean and median of User Rating, Reviews, Price based on book genre of the year. And inner join was perform to join up the summary data from the subset data. The below code show the generate of summary data from 2009 to 2019 and summary data from 2009 was show.
a<-SummaryGern(Df2009)
b<-SummaryGern(Df2010)
c<-SummaryGern(Df2011)
d<-SummaryGern(Df2012)
e<-SummaryGern(Df2013)
f<-SummaryGern(Df2014)
g<-SummaryGern(Df2015)
h<-SummaryGern(Df2016)
i<-SummaryGern(Df2017)
j<-SummaryGern(Df2018)
k<-SummaryGern(Df2019)
a
## # A tibble: 2 x 9
## Year Genre User.Rating.Mean Reviews.Mean Price.Mean User.Rating.Med~
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 2009 Fict~ 4.59 6534. 15.6 4.7
## 2 2009 Non ~ 4.58 3026. 15.2 4.6
## # ... with 3 more variables: Reviews.Median <dbl>, Price.Median <dbl>,
## # count <int>
The summary data from each year generate from above have been join up with the code as follow and the conversion from Year variable has been conduct as the summary data show is a numerical type variable. Which need to be convert into date variable to conduct plotting.
summarydata<-rbind(a,b,c,d,e,f,g,h,i,j,k)
summarydata<-summarydata[order(summarydata$Year),]
summarydata$Year<-as.Date(as.character(summarydata$Year),format= "%Y")
rm(a,b,c,d,e,f,g,h,i,j,k)
summarydata
## # A tibble: 22 x 9
## Year Genre User.Rating.Mean Reviews.Mean Price.Mean User.Rating.Med~
## <date> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 2009-01-21 Fict~ 4.59 6534. 15.6 4.7
## 2 2009-01-21 Non ~ 4.58 3026. 15.2 4.6
## 3 2010-01-21 Fict~ 4.62 8409. 9.7 4.7
## 4 2010-01-21 Non ~ 4.52 3527. 16 4.55
## 5 2011-01-21 Fict~ 4.62 10335. 11.6 4.7
## 6 2011-01-21 Non ~ 4.51 6483. 17.6 4.6
## 7 2012-01-21 Fict~ 4.50 19896. 12.3 4.6
## 8 2012-01-21 Non ~ 4.56 8163. 17.5 4.6
## 9 2013-01-21 Fict~ 4.55 19987. 10.7 4.65
## 10 2013-01-21 Non ~ 4.56 6739. 18.2 4.6
## # ... with 12 more rows, and 3 more variables: Reviews.Median <dbl>,
## # Price.Median <dbl>, count <int>
From the summary data the Year column was in Year-Month-Date format which will not affect much on the analysis plotting as the analysis will only conduct based on year.
With the summary data above 4 figures has been draw as below.
graph1<-ggplot(summarydata,aes(x=Year,y=count,color=Genre))
graph1<-graph1+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph1<-graph1+geom_point()+ylim(c(15,35))
graph1<-graph1+ggtitle("Count of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
graph1
From the graph above it show that most of the time (year) the number of non fiction book are more then the number of fiction book in Amazon Top 50 sell book. Once of the year that fiction book are listed more then non fiction book in Amazon Top 50 sell book are in year 2014.
graph2<-ggplot(summarydata,aes(x=Year,y=User.Rating.Mean,color=Genre))
graph2<-graph2+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph2<-graph2+geom_point()+theme(legend.position = "none")+ylim(c(4.4,5.0))
graph3<-ggplot(summarydata,aes(x=Year,y=User.Rating.Median,color=Genre))
graph3<-graph3+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph3<-graph3+geom_point()+ylim(c(4.4,5.0))
plot_row1<-plot_grid(graph2,graph3,rel_widths = c(0.8,1))
titlebindplot1<-ggdraw()+draw_label("Average and Median User Rating of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
BindGraph1<-plot_grid(titlebindplot1,plot_row1,ncol=1,rel_heights = c(0.1,1))
BindGraph1
From the figure above it show most of the time the average and median of user rating of fiction book are higher then non fiction book from Amazon Top 50 sell book. From the average user rating plot (left side) it show that on year 2012 and 2013 the user rating of fiction book was lower then non fiction book. On the other hand, on the median user rating plot (right side) it show that on year 2012 the rating of fiction book and non fiction book are the same rating of 4.6.
graph4<-ggplot(summarydata,aes(x=Year,y=Reviews.Mean,color=Genre))
graph4<-graph4+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph4<-graph4+geom_point()+theme(legend.position = "none")+ylim(c(1500,25000))
graph5<-ggplot(summarydata,aes(x=Year,y=Reviews.Median,color=Genre))
graph5<-graph5+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph5<-graph5+geom_point()+ylim(c(1500,25000))
plot_row2<-plot_grid(graph4,graph5,rel_widths = c(0.8,1))
titlebindplot2<-ggdraw()+draw_label("Average and Median User Reviews of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
BindGraph2<-plot_grid(titlebindplot2,plot_row2,ncol=1,rel_heights = c(0.1,1))
BindGraph2
From the figure above it show that most of the time the average and median number of user reviews on fiction book are more then non fiction book on Amazon Top 50 selling book. Both average and median plot significant show that on year 2018 the number of user reviews on fiction book are less then non fiction book on Amazon Top 50 selling book.
graph6<-ggplot(summarydata,aes(x=Year,y=Price.Mean,color=Genre))
graph6<-graph6+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph6<-graph6+geom_point()+theme(legend.position = "none")+ylim(c(6,25))
graph7<-ggplot(summarydata,aes(x=Year,y=Price.Median,color=Genre))
graph7<-graph7+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph7<-graph7+geom_point()+ylim(c(6,25))
plot_row3<-plot_grid(graph6,graph7,rel_widths = c(0.8,1))
titlebindplot3<-ggdraw()+draw_label("Average and Mean Price of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
BindGraph3<-plot_grid(titlebindplot3,plot_row3,ncol=1,rel_heights = c(0.1,1))
BindGraph3
From the figure above it show that most of the time the average and median price of non fiction book are higher then fiction book in Amazon Top 50 sell book. From the average price graph (left side) it show that on year 2009 the price of non fiction book are lower then fiction book. On the other hand, on median price graph (right side) it show on year 2015 the median of non fiction book are similar to fiction book which is at price of 9.
Lastly from the result above we have the overviews of Amazon Top 50 selling book from year 2009 to 2019. Most of the time the number of non fiction book list in Amazon Top 50 selling book are more then the number of fiction book, and the price of non fiction book list in Amazon Top 50 selling book are likely to be more expensive when compare to fiction book. On the other hand, the user rating of fiction book list in Amazon Top 50 selling book are mostly higher then non fiction book, and the number of user reviews of fiction book list in Amazon Top 50 selling book are likely to be more reviews when compare to non fiction book.
http://www.hooverlibrary.org/faq/fiction_vs_nonfiction
https://www.r-graph-gallery.com/279-plotting-time-series-with-ggplot2.html
https://stackoverflow.com/questions/14162829/set-date-range-in-ggplot
https://hollyemblem.medium.com/renaming-columns-with-dplyr-in-r-55b42222cbdc
https://stackoverflow.com/questions/30255833/convert-four-digit-year-values-to-a-date-type
https://www.datamentor.io/r-programming/return-function/
https://erdavenport.github.io/R-ecology-lesson/03-loops-and-functions.html
https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels/