This is my first markdown document that I will use as the project for Google Data Analytics Professional Certificate.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will use the tidyverse library which will load 8 main packages, namely:
Out of those 8 packages, we will be using ggplot2, dplyr, tidyr, and readr.
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(scales)
## Warning: package 'scales' was built under R version 4.2.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.3
The data used in this project is a dataset from Kaggle.com regarding the 50 Bestselling Books Each Year from 2009 to 2019 sold on Amazon.com. We will input the data to conduct further exploration of the dataset along with its visualization.
book <- read.csv("bestsellers with categories.csv")
glimpse(book)
## Rows: 550
## Columns: 7
## $ Name <chr> "10-Day Green Smoothie Cleanse", "11/22/63: A Novel", "12 …
## $ Author <chr> "JJ Smith", "Stephen King", "Jordan B. Peterson", "George …
## $ User.Rating <dbl> 4.7, 4.6, 4.7, 4.7, 4.8, 4.4, 4.7, 4.7, 4.7, 4.6, 4.6, 4.6…
## $ Reviews <int> 17350, 2052, 18979, 21424, 7665, 12643, 19735, 19699, 5983…
## $ Price <int> 8, 22, 15, 6, 12, 11, 30, 15, 3, 8, 8, 2, 32, 5, 17, 4, 6,…
## $ Year <int> 2016, 2011, 2018, 2017, 2019, 2011, 2014, 2017, 2018, 2016…
## $ Genre <chr> "Non Fiction", "Fiction", "Non Fiction", "Fiction", "Non F…
#checking the missing values
colSums(is.na(book))
## Name Author User.Rating Reviews Price Year
## 0 0 0 0 0 0
## Genre
## 0
book <- book %>%
mutate(Genre = as.factor(Genre)) %>%
arrange(Year)
Here is an explanation of the columns in the dataset:
library(ggplot2)
book %>%
select(Name, Genre) %>%
group_by(Genre) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percentage = prop.table(Count)*100) %>%
# Visualize the data with pie chart using "ggplot2" library
ggplot(aes(x = "", y = Percentage, fill = Genre)) +
geom_bar(stat = "identity", width = 1.12) +
scale_fill_manual(values = c("#FF90BC", "#FFC0D9")) +
coord_polar(theta = "y", start = pi / 3) +
theme_minimal() +
geom_label(aes(label = paste0(round(Percentage,2), "%")),
position = position_stack(vjust = 0.5)) +
labs(title = "Percentage of Genre",
y = NULL,
x = NULL) +
theme(plot.title = element_text(hjust = 0.5))
library(ggplot2)
book %>%
select(Name, Genre) %>%
group_by(Genre) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percentage = prop.table(Count)*100) %>%
# Visualize the data with bar chart using "ggplot2" library
ggplot(aes(x = Genre, y = Count, fill = Genre)) +
geom_bar(stat = "identity") +
geom_text(aes(y = Count, label = Count),
vjust = 1.6, color = "black", size = 5) +
scale_fill_manual(values = c("#FF90BC", "#FFC0D9")) +
theme_pander()
book %>%
select(Year, Genre) %>%
group_by(Genre, Year) %>%
summarise(count = n()) %>%
pivot_wider(names_from = Genre,
values_from = count) %>%
mutate(Fiction = -Fiction,
Year = as.factor(Year)) %>%
arrange(Year) %>%
# Visualize the data with pyramid chart using "ggplot2" library
ggplot(aes(x = Year)) +
geom_bar(stat = "identity",
width = 0.8,
fill = "#FF90BC",
aes(y = Fiction)) +
geom_text(aes(x = Year,
y = Fiction + 2,
label = abs(Fiction)),
colour = "white") +
geom_bar(stat = "identity",
width = 0.8,
fill = "#FFC0D9",
aes(y = `Non Fiction`)) +
geom_text(aes(x = Year,
y = `Non Fiction` - 2,
label = `Non Fiction`),
colour = "black") +
ylim(-35, 35) +
coord_flip() +
annotate("text", x = 0.1, y = -5, hjust = 0.3, vjust = -0.3,
label="Fiction", colour = "#FF90BC", fontface = 2) +
annotate("text", x = 0.1, y = 5, hjust = 0.4, vjust = -0.3,
label="Non Fiction", colour = "#FFC0D9", fontface = 2) +
labs(y = "Genre",
x = "Year") +
theme(axis.text.x = element_blank(),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(linetype = "dashed", colour = "grey"))
## `summarise()` has grouped output by 'Genre'. You can override using the
## `.groups` argument.
From the three charts above, we can use a pie chart to see the portion or percentage of each genre category. Additionally, we can also use a bar chart to see the number of books from each genre category. We can conclude from the three charts above that the percentage for fiction books is 43.64% with a total of 240 books (represented in dark pink), while non-fiction books have a percentage of 56.36% with a total of 310 books (represented in light pink).
The third chart is a Population/Pyramid Chart that can be used to see the number of books from each category (Fiction and Non-Fiction) grouped by year. This visualization is easier to understand if you want to see more details about the number of books in each category. The Population/Pyramid Chart is actually used to visualize population data. However, since we can see more specifically with this chart, we can use it to make it easier to understand the details.
Here are some additional details about the charts:
Overall, the charts show that non-fiction books are more popular than fiction books. This could be due to a number of factors, such as the increasing demand for self-help and educational books.
book %>%
select(User.Rating) %>%
group_by(User.Rating) %>%
summarise(count = n()) %>%
mutate(User.Rating = as.factor(User.Rating)) %>%
arrange(-User.Rating) %>%
ggplot(aes(x = User.Rating, y = count, fill = User.Rating)) +
geom_bar(stat = "identity") +
geom_text(aes(y = count, label = count),
vjust = 0.1, size = 3) +
theme(legend.position = "none")
## Warning: There was 1 warning in `arrange()`.
## ℹ In argument: `..1 = -User.Rating`.
## Caused by warning in `Ops.factor()`:
## ! '-' not meaningful for factors
Based on the bar chart above, several conclusions can be drawn:
Here are some other key points:
In conclusion, user rating 4.8 has the most books, while user ratings 3.3 and 3.6 have the least number of books.
library(ggplot2)
book %>%
group_by(Genre) %>%
summarise(Total_Reviews = sum(Reviews), .groups = "drop") %>%
ggplot(aes(x = Genre, y = Total_Reviews, fill = Genre)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Total_Reviews),
vjust = -0.5, color = "black", size = 4) +
scale_fill_manual(values = c("#FF90BC", "#FFC0D9")) +
theme_minimal() +
labs(x = "Genre", y = "Total Reviews") +
theme(legend.position = "center")
Based on the graph above, for Fiction books are the category of books that are reviewed the most by users with a total of 3,764,110 reviews. Meanwhile, the Non-Fiction book category received a review of 2,810,195 reviews.
p1 <- book %>%
filter(Genre == "Fiction") %>%
arrange(-Price) %>%
select(Name, Author, Price) %>%
distinct(Name, Author, Price) %>%
top_n(5) %>%
ggplot(aes(Price, reorder(Name, Price), fill = Price)) +
geom_col() +
scale_fill_gradient(low = "#FF90BC", high = "#FFC0D9") +
scale_y_discrete(labels = wrap_format(45)) +
geom_text(aes(label = Price),
hjust = 1.5) +
labs(title = "Fiction Books",
y = "Book Name") +
theme(legend.position = "none")
## Selecting by Price
p2 <- book %>%
filter(Genre == "Non Fiction") %>%
arrange(-Price) %>%
select(Name, Author, Price) %>%
distinct(Name, Author, Price) %>%
top_n(5) %>%
ggplot(aes(Price, reorder(Name, Price), fill = Price)) +
geom_col() +
scale_fill_gradient(low = "#FF90BC", high = "#FFC0D9") +
scale_y_discrete(labels = wrap_format(45)) +
geom_text(aes(label = Price),
hjust = 1.5) +
labs(title = "Non Fiction Books",
y = "Book Name") +
theme(legend.position = "none")
## Selecting by Price
ggarrange(p1, p2,
ncol = 1, nrow = 2)
From the sorted bar chart above, we can see that:
In conclusion, Non-Fiction books are more expensive than Fiction books.
library(treemapify)
## Warning: package 'treemapify' was built under R version 4.2.3
book %>%
filter(Genre == "Fiction") %>%
arrange(-Reviews) %>%
select(Name, Author, Reviews, User.Rating) %>%
distinct(Name, Author, Reviews, User.Rating) %>%
head(5) %>%
ggplot(aes(area = Reviews, label = Name, fill = Name, subgroup = Author, subgroup2 = Reviews, subgroup3 = User.Rating)) +
geom_treemap() +
geom_treemap_subgroup3_border(colour = "black", size = 3) +
geom_treemap_subgroup_text(
place = "topleft",
colour = "black",
reflow = T,
size = 14,
alpha = 0.8,
) +
geom_treemap_subgroup2_text(
colour = "white",
alpha = 1,
size = 17,
fontface = "italic"
) +
geom_treemap_subgroup3_text(
place = "topright",
colour = "black",
alpha = 0.6,
size = 14
) +
geom_treemap_text(
colour = "white",
place = "middle",
size = 17,
fontface = "bold",
reflow = T) +
theme(legend.position = "none")
library(ggplot2)
book %>%
filter(Genre == "Fiction") %>%
arrange(-Reviews) %>%
select(Name, Author, Reviews, User.Rating) %>%
distinct(Name, Author, Reviews, User.Rating) %>%
head(5) %>%
ggplot(aes(x = reorder(Name, -Reviews), y = Reviews, fill = Name)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("#EF9595", "#FF8080", "#FF90BC", "#FFC0D9", "#FF9B9B")) +
geom_text(aes(label = Reviews), vjust = -0.2, color = "black", size = 4) +
theme_minimal() +
labs(x = "Book Name", y = "Reviews") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
Based on the graph above, we have some conclusions:
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(book[1:7], label = T)
## Warning in ggcorr(book[1:7], label = T): data in column(s) 'Name', 'Author',
## 'Genre' are not numeric and were ignored
book %>%
ggplot(aes(Price, Year, col = Genre)) +
geom_point(size = 3) +
scale_color_manual(values = c("#FF90BC", "#FFC0D9"))
book %>%
ggplot(aes(Price, Reviews, col = Genre)) +
geom_point(size = 3) +
scale_color_manual(values = c("#FF90BC", "#FFC0D9"))
book %>%
ggplot(aes(Year, Reviews, col = Genre)) +
geom_point(size = 3) +
scale_color_manual(values = c("#FF90BC", "#FFC0D9"))
book %>%
ggplot(aes(User.Rating, Reviews, col = Genre)) +
geom_point(size = 3) +
scale_color_manual(values = c("#FF90BC", "#FFC0D9"))
book %>%
ggplot(aes(User.Rating, Price, col = Genre)) +
geom_point(size = 3) +
scale_color_manual(values = c("#FF90BC", "#FFC0D9"))
book %>%
ggplot(aes(Year, User.Rating, col = Genre)) +
geom_point(size = 3) +
scale_color_manual(values = c("#FF90BC", "#FFC0D9"))
Conclusion:
Notes:
Marketing Team:
Data Team: