Amazon Data Science Books Analysis 2012-2022

1. About Dataset

The dataset contains 946 books obtained from scraping Amazon books related to data science, statistics, data analysis, Python, deep learning, and machine learning. There are 18 columns:

title: title of the book author: author (or the authors) of the book price: price (in dollars) pages: number of pages avg_reviews: average reviews (out of 5) n_reviews: reviews done for each book star5: percentage of 5 star reviews star4: percentage of 4 star reviews star3: percentage of 3 star reviews star2: percentage of 2 star reviews star1: percentage of 1 star reviews dimensions: size of the book (in inches) weight: weight (in pounds or ounces) language: language of the book publisher: publisher ISBN-13: ISBN_13 code link: link of the Amazon book complete_link: complete link of the Amazon book (including the domain https://amazon.com)

1.1 Input Data

Make sure our data placed in the same folder our R project data.

book <- read.csv("final_book_dataset_kaggle.csv")

1.2 Review Data

We check if the data can be analyzed or not.

head(book)

##                                                                                                                   title
## 1                  Becoming a Data Head: How to Think Speak and Understand Data Science Statistics and Machine Learning
## 2                  Ace the Data Science Interview: 201 Real Interview Questions Asked FAANG Tech Startups & Wall Street
## 3                                                  Fundamentals of Data Engineering: Plan and Build Robust Data Systems
## 4 Essential Math for Data Science: Take Control of Your Data with Fundamental Linear Algebra Probability and Statistics
## 5                         Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 6                                                               Data Science from Scratch: First Principles with Python
##                              author price pages avg_reviews n_reviews star5
## 1 [Alex J. Gutman,Jordan Goldmeier] 24.49   272         4.6       184  0.74
## 2            [Nick Singh,Kevin Huo] 26.00   301         4.5       599  0.77
## 3           [Joe Reis,Matt Housley] 50.76   446         5.0        33  0.96
## 4                    [Thomas Nield] 44.80   347         4.5        27  0.79
## 5      [Foster Provost,Tom Fawcett] 36.99   413         4.5       970  0.71
## 6                       [Joel Grus] 45.22   406         4.4       594  0.65
##   star4 star3 star2 star1             dimensions      weight language
## 1  0.18  0.05  0.02  0.01    6 x 0.62 x 9 inches 12.5 ounces  English
## 2  0.10  0.06  0.03  0.04   7 x 0.68 x 10 inches 1.28 pounds  English
## 3  0.04  0.00  0.00  0.00    7 x 1 x 9.25 inches 1.57 pounds  English
## 4  0.05  0.05  0.05  0.05    7 x 0.75 x 9 inches 1.23 pounds  English
## 5  0.15  0.08  0.03  0.03  7 x 0.9 x 9.19 inches  1.5 pounds  English
## 6  0.19  0.08  0.04  0.04 6.9 x 0.9 x 9.1 inches  1.4 pounds  English
##                                         publisher        ISBN_13
## 1              Wiley; 1st edition (April 23 2021) 978-1119741749
## 2 Ace the Data Science Interview (August 16 2021) 978-0578973838
## 3       OReilly Media; 1st edition (July 26 2022) 978-1098108304
## 4        OReilly Media; 1st edition (July 5 2022) 978-1098102937
## 5  OReilly Media; 1st edition (September 17 2013) 978-1449361327
## 6        OReilly Media; 2nd edition (May 16 2019) 978-1492041139
##                                                                                                                                                                       link
## 1       /Becoming-Data-Head-Understand-Statistics/dp/1119741742/ref=sr_1_7?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-7
## 2           /Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_5?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-5
## 3 /Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/ref=sr_1_11?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-11
## 4        /Essential-Math-Data-Science-Fundamental/dp/1098102932/ref=sr_1_6?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-6
## 5   /Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=sr_1_8?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-8
## 6         /Data-Science-Scratch-Principles-Python/dp/1492041130/ref=sr_1_9?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-9
##                                                                                                                                                                                    complete_link
## 1       https://www.amazon.com/Becoming-Data-Head-Understand-Statistics/dp/1119741742/ref=sr_1_7?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-7
## 2           https://www.amazon.com/Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_5?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-5
## 3 https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/ref=sr_1_11?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-11
## 4        https://www.amazon.com/Essential-Math-Data-Science-Fundamental/dp/1098102932/ref=sr_1_6?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-6
## 5   https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=sr_1_8?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-8
## 6         https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/1492041130/ref=sr_1_9?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-9

2 Data Cleaning

In cleaning data, for further analysis we need grouping the data based on year published. Because there is no info about year published, we need to extract it from publisher column. We can group our data based on year published or other purpose.

2.1 Library Input

The library we use will help us to clean our data and to make beautiful plot.

library(tidyverse) #for data manipulation

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2) #for plotting data
library(stringr) #for data cleaning
library(gsubfn) #for data cleaning

## Loading required package: proto

2.2 Review Data Structure

We see the data structure so we can use it for further analysis.

glimpse(book)

## Rows: 946
## Columns: 18
## $ title         <chr> "Becoming a Data Head: How to Think Speak and Understand…
## $ author        <chr> "[Alex J. Gutman,Jordan Goldmeier]", "[Nick Singh,Kevin …
## $ price         <dbl> 24.49, 26.00, 50.76, 44.80, 36.99, 45.22, 24.88, 28.49, …
## $ pages         <dbl> 272, 301, 446, 347, 413, 406, 368, 240, 328, 280, 432, 5…
## $ avg_reviews   <dbl> 4.6, 4.5, 5.0, 4.5, 4.5, 4.4, 4.6, 4.3, NA, 4.5, 4.6, 4.…
## $ n_reviews     <int> 184, 599, 33, 27, 970, 594, 655, 6, 0, 383, 35, 16, 44, …
## $ star5         <dbl> 0.74, 0.77, 0.96, 0.79, 0.71, 0.65, 0.76, 0.78, 0.00, 0.…
## $ star4         <dbl> 0.18, 0.10, 0.04, 0.05, 0.15, 0.19, 0.14, 0.22, 0.00, 0.…
## $ star3         <dbl> 0.05, 0.06, 0.00, 0.05, 0.08, 0.08, 0.06, 0.00, 0.00, 0.…
## $ star2         <dbl> 0.02, 0.03, 0.00, 0.05, 0.03, 0.04, 0.02, 0.00, 0.00, 0.…
## $ star1         <dbl> 0.01, 0.04, 0.00, 0.05, 0.03, 0.04, 0.02, 0.00, 0.00, 0.…
## $ dimensions    <chr> "6 x 0.62 x 9 inches", "7 x 0.68 x 10 inches", "7 x 1 x …
## $ weight        <chr> "12.5 ounces", "1.28 pounds", "1.57 pounds", "1.23 pound…
## $ language      <chr> "English", "English", "English", "English", "English", "…
## $ publisher     <chr> "Wiley; 1st edition (April 23 2021)", "Ace the Data Scie…
## $ ISBN_13       <chr> "978-1119741749", "978-0578973838", "978-1098108304", "9…
## $ link          <chr> "/Becoming-Data-Head-Understand-Statistics/dp/1119741742…
## $ complete_link <chr> "https://www.amazon.com/Becoming-Data-Head-Understand-St…

The data structure doesn’t need any coercion so we can go straight to make a new column. The column we need is year column to check when the book is published.

2.3 Make Year Column

We will subset data from publisher column into year column. The data in publisher contain year in the last 5 character. So we need to subset it to get information about book published year. Because we need grouping books based on year we need to coerce the year column into factor.

book <- 
  book %>% 
  mutate(year = str_sub(publisher,-5,-2)) 

book$year <- as.factor(book$year)
head(book$year)

## [1] 2021 2021 2022 2022 2013 2019
## 30 Levels:  1972 1990 1995 1998 1999 2000 2002 2003 2005 2006 2008 ... wth.

2.4 Check Levels of Year Published

After we subset the data we will check if there is unnecessary levels that contain in the column.

levels(book$year)

##  [1] ""     "1972" "1990" "1995" "1998" "1999" "2000" "2002" "2003" "2005"
## [11] "2006" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016"
## [21] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "evie" "rk.Â" "wth."

2.5 Eliminate Unnecassary Year’s Level

We found 3 levels that contain unnecessary level. We create new data set containing only necessary level. It is assign into book_y.

book_y <- book[book$year %in% c("1972","1990","1995","1998","1999","2000","2002","2003","2005","2006","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2021","2022","2023"),]
book_y$year <- droplevels(book_y$year)

3 Data Manipulation and Plotting

After cleaning our data, we can manipulate the data so we can understand further about distribution of data science books in Amazon. To fulfill our purpose we can use 3 plot to understand the distribution of Amazon Books.

3.1 Plot 1, Amazon Data Science Book Rating Distribution

In plot 1, we will see the distribution of rating of data science books in Amazon. This distribution is seen in star5, star4, star3, star2, and star1 column. The amount of those column in percentage. So we can see the median of percentage of those column to see how much the rating of Data Science Books in Amazon.

3.1.1 Selecting Star Column

For further analysis we need dataset of books rating. First we select the column that we need about book rating.

book_box <- book_y %>% select(c(star5,star4,star3,star2,star1))
head(book_box)

##   star5 star4 star3 star2 star1
## 1  0.74  0.18  0.05  0.02  0.01
## 2  0.77  0.10  0.06  0.03  0.04
## 3  0.96  0.04  0.00  0.00  0.00
## 4  0.79  0.05  0.05  0.05  0.05
## 5  0.71  0.15  0.08  0.03  0.03
## 6  0.65  0.19  0.08  0.04  0.04

3.1.2 Pivoting Data for More Analysis

We need to make a dataset that contain x and y axis. To make this we need pivoting the data into 2 column. The x axis will be name and the y axis will be value.

book_boxplot <- pivot_longer(data = book_box, cols = c(star5,star4,star3,star2,star1))
head(book_boxplot)

## # A tibble: 6 × 2
##   name  value
##   <chr> <dbl>
## 1 star5  0.74
## 2 star4  0.18
## 3 star3  0.05
## 4 star2  0.02
## 5 star1  0.01
## 6 star5  0.77

3.1.3 Making Boxplot with ggplot2

After we get our x and y axis we need to make boxplot. The purpose of boxplot is we can see the median value of rating so we can see the distribution of rating of Data Science book in Amazon.

ggplot(data = book_boxplot, mapping = aes(x = name, y = value)) +
  geom_jitter( col = "green", alpha = 0.5) +
  geom_boxplot(outlier.shape = NA, aes(fill= name), col = "black", show.legend = FALSE) + 
    labs(title = "Amazon Data Science Book Rating Distribution", 
       subtitle = "All Published Book until 2023",
       x = "Star Rating",
       y = "Percentage Rating",
       fill = "Star Rating",
       ) +
  theme_light()

Interpretation

Most of data science book has highest review with median of 70% in 5 star rating.
The lowest median of the rating is 1 star rating. It means data science book in Amazon is 1 star rating is very rare for data science book genre.
Some of the book has full 5 star rating. In 100% value of rating only 5 star rating has 100% value in Amazon.

3.2 Plot 2, Amount Data Science Book in Amazon

After we found out the distribution of rating, we want to know the distribution of the books per published year. In plot 2 we can see amount of books published per year about Data Science that sold in Amazon.

3.2.1 Making Data Frame for 2012-2022 Books

We want to understand amount of book in Amazon store from 2012-2022. We assign the data into new data frame of amount of book sold in amazon per year published.

book_p <- as.data.frame(table(book_y$year))
tail(book_p,11)

##    Var1 Freq
## 15 2012   14
## 16 2013   17
## 17 2014   20
## 18 2015   12
## 19 2016   38
## 20 2017   52
## 21 2018   62
## 22 2019  112
## 23 2021  186
## 24 2022  213
## 25 2023   16

3.2.2 Making Barplot to Show Data Science Book Published in 2012-2022

We want to picture amount of data science book so we can understand the distribution of data science book per year.

ggplot(data = book_p[15:24,], 
       mapping = aes(x = Freq, 
                     y = Var1)) + 
  geom_col(fill="purple") +
  geom_label(aes(label = Freq)) +
  labs(
    title = "Amount Data Science Book in Amazon",
    subtitle = "from 2012-2022",
    y = "Year Published",
    x = "Total Book"
  ) +
  theme_light()

Interpretation

The highest amount book of data science published in 2022. It means the topic of data science in Amazon recently is the most popular.
The amount of book published that sold in Amazon is keep increasing since 2015. The trend is breakout during that year. And keep increasing until 2022 with the most book published.
The amount of book published can means that the amount of research about data science topic is increasing. It represents the demand of data science topic in Amazon.

3.3 Average Price of High Review Data Science Book in Amazon

After we know the amount and the rating, we can select the books with highest rating only. We want to understand the price of data science book in Amazon with highest rating. We can see the average price for each published year.

3.3.1 Finding Median Value Highest Review Book

We want to filter books that only has high review. From our boxplot we can see that 5 rating star has the highest median in other rating star. We want to know the book that has 5 rating star above median. We need the median of 5 star book.

median(book_y$star5)

## [1] 0.7

3.3.2 Filtering Data with Median Value

We will filter our data only for book with above 70% value of 5 star rating. We will only analyze high review book for our purpose.

book_q <- 
  book_y %>%
  filter(star5>=0.7)

3.3.3 Making Data Frame of List of Highest Review Books

We need x and y axis for ggplot. We need to make new dataset of price of the books and year published.

book_pr <- aggregate(x = price ~ year,
          data = book_q,
          FUN = mean)
tail(book_pr,10)

##    year    price
## 13 2012 45.63857
## 14 2013 81.25500
## 15 2014 63.12917
## 16 2015 39.30286
## 17 2016 42.42700
## 18 2017 70.04484
## 19 2018 52.67862
## 20 2019 60.85281
## 21 2021 46.27973
## 22 2022 45.25315

3.3.4 Making Plot of List of Highest Review Books

After we found the x and y axis for our data now we can calculate the average price of high review data science book in amazon. We want to see the average price overall of those books.

ggplot(data = book_pr[13:22,], 
       mapping = aes(x = year, 
                     y = price)) + 
  geom_col(fill="green") +
  labs(
    title = "Average Price of High Review Data Science Book in Amazon",
    subtitle = "from 2012-2022",
    y = "Average Price",
    x = "Year Published"
  ) +
  theme_gray() +
  geom_hline(aes(yintercept = mean(price)), 
             col = "red") +
  geom_text(aes(0,mean(price),
                label = round(mean(price),2),
                vjust = -1, hjust = -0.2),
            color= "red")

Interpretation

Average price of popular data science book is around $54.69.
The price of popular data science book is not change too much because the range of price most of them is near the average price.
The price of newest popular data science book is under the average. It means the people can consider buying new book instead old book about data science.

4. Final Conclusion

From all graphs above, we may say some assumptions, such as :

The trend of Data Science books is keep increasing every year.
The price of Data Science books become more affordable for people to buy.
People can access Data Science books very easily because the supply of Data Science books in Amazon is increasing.
Data Science books is highly appreciated by people because it has high rating.
People can buy new book instead of old book of data science in case they want to update the trend and the price can be cheaper.

Amazon Data Science Books Analysis 2012-2022

Adli Rikanda Saputra

2022-11-07

1. About Dataset

1.1 Input Data

1.2 Review Data

2 Data Cleaning

2.1 Library Input

2.2 Review Data Structure

2.3 Make Year Column

2.4 Check Levels of Year Published

2.5 Eliminate Unnecassary Year’s Level

3 Data Manipulation and Plotting

3.1 Plot 1, Amazon Data Science Book Rating Distribution

3.1.1 Selecting Star Column

3.1.2 Pivoting Data for More Analysis

3.1.3 Making Boxplot with ggplot2

3.2 Plot 2, Amount Data Science Book in Amazon

3.2.1 Making Data Frame for 2012-2022 Books

3.2.2 Making Barplot to Show Data Science Book Published in 2012-2022

3.3 Average Price of High Review Data Science Book in Amazon

3.3.1 Finding Median Value Highest Review Book

3.3.2 Filtering Data with Median Value

3.3.3 Making Data Frame of List of Highest Review Books

3.3.4 Making Plot of List of Highest Review Books

4. Final Conclusion