Introduction

This was a simple exploratory data analysis on Amazon top 50 selling book from 2009 to 2019.The data and idea was getting from Kaggle data set.

Description on Kaggle for the data set

“Dataset on Amazon’s Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads”

The data set contain of 550 row (books) and 7 column (Book name, Author, User Rating, Reviews, Price,Year, Genre). Which can consider as a small data set. The data set can be download from the website as below. It might be changing during time.

https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019

To prevent the data changing over time on Kaggle the data set has been upload to Github repository. And the R scrip and results plot can as well found on the repository. The link is as below:

https://github.com/LYT102/Exploratory-Analysis-Amazon-Top-50-Selling-Book

R package and Version

This analysis is running on Window 10 system with: R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit)

The R package Using on this analysis are data.table,dplyr,ggplot2, and cowplot as below

library(data.table)
library(dplyr)
library(ggplot2)
library(cowplot)

Data Processing

Read data

The data was read from the project local file and gone though function to generate the subset data and summary data which used to conduct exploratory data analysis.

RawData<-fread("./bestsellers with categories.csv",header =TRUE)
RawData<-tbl_df(RawData)
RawData$Genre<-as.factor(RawData$Genre)

head(RawData,5)
## # A tibble: 5 x 7
##   Name                     Author       `User Rating` Reviews Price  Year Genre 
##   <chr>                    <chr>                <dbl>   <int> <int> <int> <fct> 
## 1 10-Day Green Smoothie C~ JJ Smith               4.7   17350     8  2016 Non F~
## 2 11/22/63: A Novel        Stephen King           4.6    2052    22  2011 Ficti~
## 3 12 Rules for Life: An A~ Jordan B. P~           4.7   18979    15  2018 Non F~
## 4 1984 (Signet Classics)   George Orwe~           4.7   21424     6  2017 Ficti~
## 5 5,000 Awesome Facts (Ab~ National Ge~           4.8    7665    12  2019 Non F~

Subset Data

The raw data from above will go though 2 function to shorter the time to generate subset data and summary data. The first function create was to generate the subset data of the raw data as below.

getData<-function(i){
  DataFrameName<-RawData%>%filter(Year == i)%>%arrange(Name)
  }

The function will get the specify selected year and rearrange the book name in ascending order. The below code show the generate of subset data from 2009 to 2019. The first 5 row of subset data 2009 was show as well.

Df2009<-getData(2009)
Df2010<-getData(2010)
Df2011<-getData(2011)
Df2012<-getData(2012)
Df2013<-getData(2013)
Df2014<-getData(2014)
Df2015<-getData(2015)
Df2016<-getData(2016)
Df2017<-getData(2017)
Df2018<-getData(2018)
Df2019<-getData(2019)

head(Df2009,5)
## # A tibble: 5 x 7
##   Name                        Author    `User Rating` Reviews Price  Year Genre 
##   <chr>                       <chr>             <dbl>   <int> <int> <int> <fct> 
## 1 Act Like a Lady, Think Lik~ Steve Ha~           4.6    5013    17  2009 Non F~
## 2 Arguing with Idiots: How t~ Glenn Be~           4.6     798     5  2009 Non F~
## 3 Breaking Dawn (The Twiligh~ Stepheni~           4.6    9769    13  2009 Ficti~
## 4 Crazy Love: Overwhelmed by~ Francis ~           4.7    1542    14  2009 Non F~
## 5 Dead And Gone: A Sookie St~ Charlain~           4.6    1541     4  2009 Ficti~

Summarise Data

As subset data generate from above summary from subset data can be conducted. The second function create was use to generate the summary data of the subset data as below.

SummaryGern<-function(DataFrameName){
    Count<-DataFrameName%>%group_by(Genre)%>%summarise(count = n())
    
    DataMean<-DataFrameName%>%select(Year,Genre,`User Rating`,Reviews,Price)%>%group_by(Genre)%>%
              summarise_each(mean)%>%relocate(Year,.before = Genre)%>%
              rename(User.Rating.Mean=`User Rating`,
                     Reviews.Mean=Reviews,
                     Price.Mean=Price)
    
    DataMedian<-DataFrameName%>%select(Year,Genre,`User Rating`,Reviews,Price)%>%group_by(Genre)%>%
                summarise_each(median)%>%relocate(Year,.before = Genre)%>%
                rename(User.Rating.Median=`User Rating`,
                       Reviews.Median=Reviews,
                       Price.Median=Price)

    Data<-inner_join(DataMean,DataMedian)
    Data<-inner_join(Data,Count)
    
    return(Data)
}

The function will count the book genre (Fiction, Non fiction), mean and median of User Rating, Reviews, Price based on book genre of the year. And inner join was perform to join up the summary data from the subset data. The below code show the generate of summary data from 2009 to 2019 and summary data from 2009 was show.

a<-SummaryGern(Df2009)
b<-SummaryGern(Df2010)
c<-SummaryGern(Df2011)
d<-SummaryGern(Df2012)
e<-SummaryGern(Df2013)
f<-SummaryGern(Df2014)
g<-SummaryGern(Df2015)
h<-SummaryGern(Df2016)
i<-SummaryGern(Df2017)
j<-SummaryGern(Df2018)
k<-SummaryGern(Df2019)

a
## # A tibble: 2 x 9
##    Year Genre User.Rating.Mean Reviews.Mean Price.Mean User.Rating.Med~
##   <dbl> <fct>            <dbl>        <dbl>      <dbl>            <dbl>
## 1  2009 Fict~             4.59        6534.       15.6              4.7
## 2  2009 Non ~             4.58        3026.       15.2              4.6
## # ... with 3 more variables: Reviews.Median <dbl>, Price.Median <dbl>,
## #   count <int>

The summary data from each year generate from above have been join up with the code as follow and the conversion from Year variable has been conduct as the summary data show is a numerical type variable. Which need to be convert into date variable to conduct plotting.

summarydata<-rbind(a,b,c,d,e,f,g,h,i,j,k)
summarydata<-summarydata[order(summarydata$Year),]
summarydata$Year<-as.Date(as.character(summarydata$Year),format= "%Y")

rm(a,b,c,d,e,f,g,h,i,j,k)

summarydata
## # A tibble: 22 x 9
##    Year       Genre User.Rating.Mean Reviews.Mean Price.Mean User.Rating.Med~
##    <date>     <fct>            <dbl>        <dbl>      <dbl>            <dbl>
##  1 2009-01-21 Fict~             4.59        6534.       15.6             4.7 
##  2 2009-01-21 Non ~             4.58        3026.       15.2             4.6 
##  3 2010-01-21 Fict~             4.62        8409.        9.7             4.7 
##  4 2010-01-21 Non ~             4.52        3527.       16               4.55
##  5 2011-01-21 Fict~             4.62       10335.       11.6             4.7 
##  6 2011-01-21 Non ~             4.51        6483.       17.6             4.6 
##  7 2012-01-21 Fict~             4.50       19896.       12.3             4.6 
##  8 2012-01-21 Non ~             4.56        8163.       17.5             4.6 
##  9 2013-01-21 Fict~             4.55       19987.       10.7             4.65
## 10 2013-01-21 Non ~             4.56        6739.       18.2             4.6 
## # ... with 12 more rows, and 3 more variables: Reviews.Median <dbl>,
## #   Price.Median <dbl>, count <int>

From the summary data the Year column was in Year-Month-Date format which will not affect much on the analysis plotting as the analysis will only conduct based on year.

Results

With the summary data above 4 figures has been draw as below.

Count of Amazon Top 50 Best Selling Book from year 2009 to 2019

graph1<-ggplot(summarydata,aes(x=Year,y=count,color=Genre))
graph1<-graph1+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph1<-graph1+geom_point()+ylim(c(15,35))
graph1<-graph1+ggtitle("Count of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
graph1

From the graph above it show that most of the time (year) the number of non fiction book are more then the number of fiction book in Amazon Top 50 sell book. Once of the year that fiction book are listed more then non fiction book in Amazon Top 50 sell book are in year 2014.

Average and Median of User Rating of Amazon Top 50 Selling Book from year 2009 to 2019

graph2<-ggplot(summarydata,aes(x=Year,y=User.Rating.Mean,color=Genre))
graph2<-graph2+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph2<-graph2+geom_point()+theme(legend.position = "none")+ylim(c(4.4,5.0))

graph3<-ggplot(summarydata,aes(x=Year,y=User.Rating.Median,color=Genre))
graph3<-graph3+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph3<-graph3+geom_point()+ylim(c(4.4,5.0))

plot_row1<-plot_grid(graph2,graph3,rel_widths = c(0.8,1))

titlebindplot1<-ggdraw()+draw_label("Average and Median User Rating of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
BindGraph1<-plot_grid(titlebindplot1,plot_row1,ncol=1,rel_heights = c(0.1,1))

BindGraph1

From the figure above it show most of the time the average and median of user rating of fiction book are higher then non fiction book from Amazon Top 50 sell book. From the average user rating plot (left side) it show that on year 2012 and 2013 the user rating of fiction book was lower then non fiction book. On the other hand, on the median user rating plot (right side) it show that on year 2012 the rating of fiction book and non fiction book are the same rating of 4.6.

Average and Median of User Reviews of Amazon Top 50 Selling Book from year 2009 to 2019

graph4<-ggplot(summarydata,aes(x=Year,y=Reviews.Mean,color=Genre))
graph4<-graph4+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph4<-graph4+geom_point()+theme(legend.position = "none")+ylim(c(1500,25000))

graph5<-ggplot(summarydata,aes(x=Year,y=Reviews.Median,color=Genre))
graph5<-graph5+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph5<-graph5+geom_point()+ylim(c(1500,25000))

plot_row2<-plot_grid(graph4,graph5,rel_widths = c(0.8,1))
titlebindplot2<-ggdraw()+draw_label("Average and Median User Reviews of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
BindGraph2<-plot_grid(titlebindplot2,plot_row2,ncol=1,rel_heights = c(0.1,1))

BindGraph2

From the figure above it show that most of the time the average and median number of user reviews on fiction book are more then non fiction book on Amazon Top 50 selling book. Both average and median plot significant show that on year 2018 the number of user reviews on fiction book are less then non fiction book on Amazon Top 50 selling book.

Average and Median of Price of Amazon Top 50 Selling Book from year 2009 to 2019

graph6<-ggplot(summarydata,aes(x=Year,y=Price.Mean,color=Genre))
graph6<-graph6+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph6<-graph6+geom_point()+theme(legend.position = "none")+ylim(c(6,25))

graph7<-ggplot(summarydata,aes(x=Year,y=Price.Median,color=Genre))
graph7<-graph7+geom_line()+scale_x_date(date_break= "1 year",date_labels = "%Y")
graph7<-graph7+geom_point()+ylim(c(6,25))

plot_row3<-plot_grid(graph6,graph7,rel_widths = c(0.8,1))
titlebindplot3<-ggdraw()+draw_label("Average and Mean Price of Amazon Top 50 Best Selling Book Based on Genre from 2009 to 2019")
BindGraph3<-plot_grid(titlebindplot3,plot_row3,ncol=1,rel_heights = c(0.1,1))

BindGraph3

From the figure above it show that most of the time the average and median price of non fiction book are higher then fiction book in Amazon Top 50 sell book. From the average price graph (left side) it show that on year 2009 the price of non fiction book are lower then fiction book. On the other hand, on median price graph (right side) it show on year 2015 the median of non fiction book are similar to fiction book which is at price of 9.

Conclusion

Lastly from the result above we have the overviews of Amazon Top 50 selling book from year 2009 to 2019. Most of the time the number of non fiction book list in Amazon Top 50 selling book are more then the number of fiction book, and the price of non fiction book list in Amazon Top 50 selling book are likely to be more expensive when compare to fiction book. On the other hand, the user rating of fiction book list in Amazon Top 50 selling book are mostly higher then non fiction book, and the number of user reviews of fiction book list in Amazon Top 50 selling book are likely to be more reviews when compare to non fiction book.

Reference

http://www.hooverlibrary.org/faq/fiction_vs_nonfiction

https://www.r-graph-gallery.com/279-plotting-time-series-with-ggplot2.html

https://stackoverflow.com/questions/14162829/set-date-range-in-ggplot

https://hollyemblem.medium.com/renaming-columns-with-dplyr-in-r-55b42222cbdc

https://stackoverflow.com/questions/30255833/convert-four-digit-year-values-to-a-date-type

https://www.datamentor.io/r-programming/return-function/

https://erdavenport.github.io/R-ecology-lesson/03-loops-and-functions.html

https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels/

https://wilkelab.org/cowplot/articles/plot_grid.html