The dataset contains 946 books obtained from scraping Amazon books related to data science, statistics, data analysis, Python, deep learning, and machine learning. There are 18 columns:
title: title of the book author: author (or the authors) of the book price: price (in dollars) pages: number of pages avg_reviews: average reviews (out of 5) n_reviews: reviews done for each book star5: percentage of 5 star reviews star4: percentage of 4 star reviews star3: percentage of 3 star reviews star2: percentage of 2 star reviews star1: percentage of 1 star reviews dimensions: size of the book (in inches) weight: weight (in pounds or ounces) language: language of the book publisher: publisher ISBN-13: ISBN_13 code link: link of the Amazon book complete_link: complete link of the Amazon book (including the domain https://amazon.com)
Make sure our data placed in the same folder our R project data.
book <- read.csv("final_book_dataset_kaggle.csv")
We check if the data can be analyzed or not.
head(book)
## title
## 1 Becoming a Data Head: How to Think Speak and Understand Data Science Statistics and Machine Learning
## 2 Ace the Data Science Interview: 201 Real Interview Questions Asked FAANG Tech Startups & Wall Street
## 3 Fundamentals of Data Engineering: Plan and Build Robust Data Systems
## 4 Essential Math for Data Science: Take Control of Your Data with Fundamental Linear Algebra Probability and Statistics
## 5 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 6 Data Science from Scratch: First Principles with Python
## author price pages avg_reviews n_reviews star5
## 1 [Alex J. Gutman,Jordan Goldmeier] 24.49 272 4.6 184 0.74
## 2 [Nick Singh,Kevin Huo] 26.00 301 4.5 599 0.77
## 3 [Joe Reis,Matt Housley] 50.76 446 5.0 33 0.96
## 4 [Thomas Nield] 44.80 347 4.5 27 0.79
## 5 [Foster Provost,Tom Fawcett] 36.99 413 4.5 970 0.71
## 6 [Joel Grus] 45.22 406 4.4 594 0.65
## star4 star3 star2 star1 dimensions weight language
## 1 0.18 0.05 0.02 0.01 6 x 0.62 x 9 inches 12.5 ounces English
## 2 0.10 0.06 0.03 0.04 7 x 0.68 x 10 inches 1.28 pounds English
## 3 0.04 0.00 0.00 0.00 7 x 1 x 9.25 inches 1.57 pounds English
## 4 0.05 0.05 0.05 0.05 7 x 0.75 x 9 inches 1.23 pounds English
## 5 0.15 0.08 0.03 0.03 7 x 0.9 x 9.19 inches 1.5 pounds English
## 6 0.19 0.08 0.04 0.04 6.9 x 0.9 x 9.1 inches 1.4 pounds English
## publisher ISBN_13
## 1 Wiley; 1st edition (April 23 2021) 978-1119741749
## 2 Ace the Data Science Interview (August 16 2021) 978-0578973838
## 3 OReilly Media; 1st edition (July 26 2022) 978-1098108304
## 4 OReilly Media; 1st edition (July 5 2022) 978-1098102937
## 5 OReilly Media; 1st edition (September 17 2013) 978-1449361327
## 6 OReilly Media; 2nd edition (May 16 2019) 978-1492041139
## link
## 1 /Becoming-Data-Head-Understand-Statistics/dp/1119741742/ref=sr_1_7?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-7
## 2 /Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_5?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-5
## 3 /Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/ref=sr_1_11?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-11
## 4 /Essential-Math-Data-Science-Fundamental/dp/1098102932/ref=sr_1_6?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-6
## 5 /Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=sr_1_8?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-8
## 6 /Data-Science-Scratch-Principles-Python/dp/1492041130/ref=sr_1_9?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-9
## complete_link
## 1 https://www.amazon.com/Becoming-Data-Head-Understand-Statistics/dp/1119741742/ref=sr_1_7?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-7
## 2 https://www.amazon.com/Ace-Data-Science-Interview-Questions/dp/0578973839/ref=sr_1_5?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-5
## 3 https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/ref=sr_1_11?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-11
## 4 https://www.amazon.com/Essential-Math-Data-Science-Fundamental/dp/1098102932/ref=sr_1_6?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-6
## 5 https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=sr_1_8?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-8
## 6 https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/1492041130/ref=sr_1_9?crid=1IWIG31DNPO6P&keywords=data+science&qid=1663447969&sprefix=data+science%2Caps%2C586&sr=8-9
In cleaning data, for further analysis we need grouping the data based on year published. Because there is no info about year published, we need to extract it from publisher column. We can group our data based on year published or other purpose.
The library we use will help us to clean our data and to make beautiful plot.
library(tidyverse) #for data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2) #for plotting data
library(stringr) #for data cleaning
library(gsubfn) #for data cleaning
## Loading required package: proto
We see the data structure so we can use it for further analysis.
glimpse(book)
## Rows: 946
## Columns: 18
## $ title <chr> "Becoming a Data Head: How to Think Speak and Understand…
## $ author <chr> "[Alex J. Gutman,Jordan Goldmeier]", "[Nick Singh,Kevin …
## $ price <dbl> 24.49, 26.00, 50.76, 44.80, 36.99, 45.22, 24.88, 28.49, …
## $ pages <dbl> 272, 301, 446, 347, 413, 406, 368, 240, 328, 280, 432, 5…
## $ avg_reviews <dbl> 4.6, 4.5, 5.0, 4.5, 4.5, 4.4, 4.6, 4.3, NA, 4.5, 4.6, 4.…
## $ n_reviews <int> 184, 599, 33, 27, 970, 594, 655, 6, 0, 383, 35, 16, 44, …
## $ star5 <dbl> 0.74, 0.77, 0.96, 0.79, 0.71, 0.65, 0.76, 0.78, 0.00, 0.…
## $ star4 <dbl> 0.18, 0.10, 0.04, 0.05, 0.15, 0.19, 0.14, 0.22, 0.00, 0.…
## $ star3 <dbl> 0.05, 0.06, 0.00, 0.05, 0.08, 0.08, 0.06, 0.00, 0.00, 0.…
## $ star2 <dbl> 0.02, 0.03, 0.00, 0.05, 0.03, 0.04, 0.02, 0.00, 0.00, 0.…
## $ star1 <dbl> 0.01, 0.04, 0.00, 0.05, 0.03, 0.04, 0.02, 0.00, 0.00, 0.…
## $ dimensions <chr> "6 x 0.62 x 9 inches", "7 x 0.68 x 10 inches", "7 x 1 x …
## $ weight <chr> "12.5 ounces", "1.28 pounds", "1.57 pounds", "1.23 pound…
## $ language <chr> "English", "English", "English", "English", "English", "…
## $ publisher <chr> "Wiley; 1st edition (April 23 2021)", "Ace the Data Scie…
## $ ISBN_13 <chr> "978-1119741749", "978-0578973838", "978-1098108304", "9…
## $ link <chr> "/Becoming-Data-Head-Understand-Statistics/dp/1119741742…
## $ complete_link <chr> "https://www.amazon.com/Becoming-Data-Head-Understand-St…
The data structure doesn’t need any coercion so we can go straight to make a new column. The column we need is year column to check when the book is published.
We will subset data from publisher column into year column. The data in publisher contain year in the last 5 character. So we need to subset it to get information about book published year. Because we need grouping books based on year we need to coerce the year column into factor.
book <-
book %>%
mutate(year = str_sub(publisher,-5,-2))
book$year <- as.factor(book$year)
head(book$year)
## [1] 2021 2021 2022 2022 2013 2019
## 30 Levels: 1972 1990 1995 1998 1999 2000 2002 2003 2005 2006 2008 ... wth.
After we subset the data we will check if there is unnecessary levels that contain in the column.
levels(book$year)
## [1] "" "1972" "1990" "1995" "1998" "1999" "2000" "2002" "2003" "2005"
## [11] "2006" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016"
## [21] "2017" "2018" "2019" "2020" "2021" "2022" "2023" "evie" "rk.Â" "wth."
We found 3 levels that contain unnecessary level. We create new data set containing only necessary level. It is assign into book_y.
book_y <- book[book$year %in% c("1972","1990","1995","1998","1999","2000","2002","2003","2005","2006","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2021","2022","2023"),]
book_y$year <- droplevels(book_y$year)
After cleaning our data, we can manipulate the data so we can understand further about distribution of data science books in Amazon. To fulfill our purpose we can use 3 plot to understand the distribution of Amazon Books.
In plot 1, we will see the distribution of rating of data science books in Amazon. This distribution is seen in star5, star4, star3, star2, and star1 column. The amount of those column in percentage. So we can see the median of percentage of those column to see how much the rating of Data Science Books in Amazon.
For further analysis we need dataset of books rating. First we select the column that we need about book rating.
book_box <- book_y %>% select(c(star5,star4,star3,star2,star1))
head(book_box)
## star5 star4 star3 star2 star1
## 1 0.74 0.18 0.05 0.02 0.01
## 2 0.77 0.10 0.06 0.03 0.04
## 3 0.96 0.04 0.00 0.00 0.00
## 4 0.79 0.05 0.05 0.05 0.05
## 5 0.71 0.15 0.08 0.03 0.03
## 6 0.65 0.19 0.08 0.04 0.04
We need to make a dataset that contain x and y axis. To make this we need pivoting the data into 2 column. The x axis will be name and the y axis will be value.
book_boxplot <- pivot_longer(data = book_box, cols = c(star5,star4,star3,star2,star1))
head(book_boxplot)
## # A tibble: 6 × 2
## name value
## <chr> <dbl>
## 1 star5 0.74
## 2 star4 0.18
## 3 star3 0.05
## 4 star2 0.02
## 5 star1 0.01
## 6 star5 0.77
After we get our x and y axis we need to make boxplot. The purpose of boxplot is we can see the median value of rating so we can see the distribution of rating of Data Science book in Amazon.
ggplot(data = book_boxplot, mapping = aes(x = name, y = value)) +
geom_jitter( col = "green", alpha = 0.5) +
geom_boxplot(outlier.shape = NA, aes(fill= name), col = "black", show.legend = FALSE) +
labs(title = "Amazon Data Science Book Rating Distribution",
subtitle = "All Published Book until 2023",
x = "Star Rating",
y = "Percentage Rating",
fill = "Star Rating",
) +
theme_light()
Interpretation
Most of data science book has highest review with median of 70% in 5 star rating.
The lowest median of the rating is 1 star rating. It means data science book in Amazon is 1 star rating is very rare for data science book genre.
Some of the book has full 5 star rating. In 100% value of rating only 5 star rating has 100% value in Amazon.
After we found out the distribution of rating, we want to know the distribution of the books per published year. In plot 2 we can see amount of books published per year about Data Science that sold in Amazon.
We want to understand amount of book in Amazon store from 2012-2022. We assign the data into new data frame of amount of book sold in amazon per year published.
book_p <- as.data.frame(table(book_y$year))
tail(book_p,11)
## Var1 Freq
## 15 2012 14
## 16 2013 17
## 17 2014 20
## 18 2015 12
## 19 2016 38
## 20 2017 52
## 21 2018 62
## 22 2019 112
## 23 2021 186
## 24 2022 213
## 25 2023 16
We want to picture amount of data science book so we can understand the distribution of data science book per year.
ggplot(data = book_p[15:24,],
mapping = aes(x = Freq,
y = Var1)) +
geom_col(fill="purple") +
geom_label(aes(label = Freq)) +
labs(
title = "Amount Data Science Book in Amazon",
subtitle = "from 2012-2022",
y = "Year Published",
x = "Total Book"
) +
theme_light()
Interpretation
The highest amount book of data science published in 2022. It means the topic of data science in Amazon recently is the most popular.
The amount of book published that sold in Amazon is keep increasing since 2015. The trend is breakout during that year. And keep increasing until 2022 with the most book published.
The amount of book published can means that the amount of research about data science topic is increasing. It represents the demand of data science topic in Amazon.
After we know the amount and the rating, we can select the books with highest rating only. We want to understand the price of data science book in Amazon with highest rating. We can see the average price for each published year.
We want to filter books that only has high review. From our boxplot we can see that 5 rating star has the highest median in other rating star. We want to know the book that has 5 rating star above median. We need the median of 5 star book.
median(book_y$star5)
## [1] 0.7
We will filter our data only for book with above 70% value of 5 star rating. We will only analyze high review book for our purpose.
book_q <-
book_y %>%
filter(star5>=0.7)
We need x and y axis for ggplot. We need to make new dataset of price of the books and year published.
book_pr <- aggregate(x = price ~ year,
data = book_q,
FUN = mean)
tail(book_pr,10)
## year price
## 13 2012 45.63857
## 14 2013 81.25500
## 15 2014 63.12917
## 16 2015 39.30286
## 17 2016 42.42700
## 18 2017 70.04484
## 19 2018 52.67862
## 20 2019 60.85281
## 21 2021 46.27973
## 22 2022 45.25315
After we found the x and y axis for our data now we can calculate the average price of high review data science book in amazon. We want to see the average price overall of those books.
ggplot(data = book_pr[13:22,],
mapping = aes(x = year,
y = price)) +
geom_col(fill="green") +
labs(
title = "Average Price of High Review Data Science Book in Amazon",
subtitle = "from 2012-2022",
y = "Average Price",
x = "Year Published"
) +
theme_gray() +
geom_hline(aes(yintercept = mean(price)),
col = "red") +
geom_text(aes(0,mean(price),
label = round(mean(price),2),
vjust = -1, hjust = -0.2),
color= "red")
Interpretation
Average price of popular data science book is around $54.69.
The price of popular data science book is not change too much because the range of price most of them is near the average price.
The price of newest popular data science book is under the average. It means the people can consider buying new book instead old book about data science.
From all graphs above, we may say some assumptions, such as :
The trend of Data Science books is keep increasing every year.
The price of Data Science books become more affordable for people to buy.
People can access Data Science books very easily because the supply of Data Science books in Amazon is increasing.
Data Science books is highly appreciated by people because it has high rating.
People can buy new book instead of old book of data science in case they want to update the trend and the price can be cheaper.