Amazon Top 50 Bestselling Books 2009

This is my first markdown document that I will use as the project for Google Data Analytics Professional Certificate.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We will use the tidyverse library which will load 8 main packages, namely:

ggplot2: For data visualization
dplyr: For data manipulation
tidyr: For tidying data
readr: For data import settings
purrr: For functional programming
tibble: For a more modern restructuring of data
stringr: For string operations
forcats: For handling categories

Out of those 8 packages, we will be using ggplot2, dplyr, tidyr, and readr.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.2.3

library(scales)

## Warning: package 'scales' was built under R version 4.2.3

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.2.3

1. Background

The data used in this project is a dataset from Kaggle.com regarding the 50 Bestselling Books Each Year from 2009 to 2019 sold on Amazon.com. We will input the data to conduct further exploration of the dataset along with its visualization.

book <- read.csv("bestsellers with categories.csv")
glimpse(book)

## Rows: 550
## Columns: 7
## $ Name        <chr> "10-Day Green Smoothie Cleanse", "11/22/63: A Novel", "12 …
## $ Author      <chr> "JJ Smith", "Stephen King", "Jordan B. Peterson", "George …
## $ User.Rating <dbl> 4.7, 4.6, 4.7, 4.7, 4.8, 4.4, 4.7, 4.7, 4.7, 4.6, 4.6, 4.6…
## $ Reviews     <int> 17350, 2052, 18979, 21424, 7665, 12643, 19735, 19699, 5983…
## $ Price       <int> 8, 22, 15, 6, 12, 11, 30, 15, 3, 8, 8, 2, 32, 5, 17, 4, 6,…
## $ Year        <int> 2016, 2011, 2018, 2017, 2019, 2011, 2014, 2017, 2018, 2016…
## $ Genre       <chr> "Non Fiction", "Fiction", "Non Fiction", "Fiction", "Non F…

2. Data Exploration

#checking the missing values
colSums(is.na(book))

##        Name      Author User.Rating     Reviews       Price        Year 
##           0           0           0           0           0           0 
##       Genre 
##           0

book <- book %>% 
mutate(Genre = as.factor(Genre)) %>%
arrange(Year)

Here is an explanation of the columns in the dataset:

Name: Title of the book
Author: Author of the book
User Rating: Rating given by readers (1-5)
Reviews: Number of reviews written by readers
Price: Price (in US Dollars)
Year: Year of publication
Genre: Genre of the book (Fiction / Non-Fiction)

Checking the author values

book %>% 
  count(Author) %>% 
  arrange(Author)

##                                    Author  n
## 1                        Abraham Verghese  2
## 2                          Adam Gasiewski  1
## 3                           Adam Mansbach  1
## 4                               Adir Levy  1
## 5              Admiral William H. McRaven  1
## 6             Adult Coloring Book Designs  1
## 7                              Alan Moore  1
## 8                        Alex Michaelides  1
## 9                          Alice Schertle  1
## 10                            Allie Brosh  1
## 11       American Psychiatric Association  2
## 12     American Psychological Association 10
## 13                            Amor Towles  1
## 14                              Amy Ramos  2
## 15                            Amy Shields  1
## 16                              Andy Weir  1
## 17                            Angie Grace  1
## 18                           Angie Thomas  1
## 19                            Ann Voskamp  2
## 20                      Ann Whitford Paul  2
## 21                       Anthony Bourdain  1
## 22                          Anthony Doerr  2
## 23                           Atul Gawande  1
## 24                     Audrey Niffenegger  1
## 25                            B. J. Novak  2
## 26               Bessel van der Kolk M.D.  1
## 27                        Bill Martin Jr.  2
## 28                          Bill O'Reilly  7
## 29                           Bill Simmons  1
## 30                     Blue Star Coloring  2
## 31                           Bob Woodward  1
## 32                        Brandon Stanton  3
## 33                            Brené Brown  1
## 34                         Brian Kilmeade  1
## 35                      Bruce Springsteen  1
## 36                         Carol S. Dweck  3
## 37                             Celeste Ng  1
## 38                       Charlaine Harris  4
## 39                         Charles Duhigg  1
## 40                    Charles Krauthammer  1
## 41                         Cheryl Strayed  1
## 42                            Chip Gaines  1
## 43                             Chip Heath  1
## 44                           Chris Cleave  1
## 45                             Chris Kyle  1
## 46                         Chrissy Teigen  1
## 47                  Christina Baker Kline  1
## 48                    Christopher Paolini  1
## 49              Coloring Books for Adults  1
## 50                            Craig Smith  2
## 51                          Crispin Boyer  1
## 52                                     DK  2
## 53                          Dale Carnegie  5
## 54                              Dan Brown  3
## 55                         Daniel H. Pink  1
## 56                     Daniel James Brown  2
## 57                        Daniel Kahneman  2
## 58                       Daniel Lipkowitz  2
## 59                             Dav Pilkey  7
## 60                            Dave Ramsey  1
## 61                          David Goggins  1
## 62                            David Grann  1
## 63                       David McCullough  1
## 64                    David Perlmutter MD  1
## 65                            David Platt  2
## 66                        David Zinczenko  2
## 67                         Deborah Diesen  2
## 68  Delegates of the Constitutional\u0085  1
## 69                            Delia Owens  1
## 70                          Dinah Bucholz  1
## 71                        Don Miguel Ruiz  6
## 72                            Donna Tartt  2
## 73                             Doug Lemov  2
## 74                              Dr. Seuss  9
## 75                 Dr. Steven R Gundry MD  2
## 76                           Drew Daywalt  3
## 77                              E L James  6
## 78                         Eben Alexander  2
## 79                           Edward Klein  1
## 80                      Edward M. Kennedy  1
## 81                            Elie Wiesel  1
## 82                       Elizabeth Strout  1
## 83                  Emily Winfield Martin  4
## 84                             Eric Carle  7
## 85                            Eric Larson  1
## 86                           Ernest Cline  2
## 87                            F. A. Hayek  1
## 88                    F. Scott Fitzgerald  3
## 89                           Francis Chan  3
## 90                        Fredrik Backman  2
## 91                                 Gallup  9
## 92                            Garth Stein  2
## 93                           Gary Chapman 11
## 94                           Gayle Forman  1
## 95                            Geneen Roth  1
## 96                          George Orwell  1
## 97                    George R. R. Martin  2
## 98                     George R.R. Martin  3
## 99                         George W. Bush  1
## 100                         Giles Andreae  5
## 101                         Gillian Flynn  3
## 102                            Glenn Beck  3
## 103                          Golden Books  1
## 104                        Greg Mortenson  2
## 105                            Harper Lee  6
## 106                         Heidi Murkoff  1
## 107                Hillary Rodham Clinton  1
## 108                       Hopscotch Girls  1
## 109                          Howard Stern  1
## 110                     Ian K. Smith M.D.  1
## 111                            Ina Garten  3
## 112                           J. D. Vance  2
## 113                         J. K. Rowling  2
## 114                          J.K. Rowling  6
## 115                              JJ Smith  1
## 116                           James Comey  1
## 117                         James Dashner  1
## 118                       James Patterson  2
## 119                             Jay Asher  1
## 120                         Jaycee Dugard  1
## 121                           Jeff Kinney 12
## 122                           Jen Sincero  4
## 123                        Jennifer Smith  2
## 124                            Jill Twiss  1
## 125                           Jim Collins  4
## 126                         Joanna Gaines  2
## 127                       Joel Fuhrman MD  2
## 128                       Johanna Basford  2
## 129                            John Green  5
## 130                          John Grisham  5
## 131                        John Heilemann  1
## 132                           Jon Meacham  1
## 133                           Jon Stewart  1
## 134                         Jonathan Cahn  1
## 135                    Jordan B. Peterson  1
## 136                           Julia Child  1
## 137                        Justin Halpern  1
## 138                      Kathryn Stockett  4
## 139                        Keith Richards  1
## 140                           Ken Follett  1
## 141                            Kevin Kwan  1
## 142                       Khaled Hosseini  1
## 143                        Kristin Hannah  2
## 144                      Larry Schweikart  1
## 145                     Laura Hillenbrand  5
## 146                       Laurel Randolph  2
## 147                    Lin-Manuel Miranda  1
## 148                        Lysa TerKeurst  2
## 149                         M Prefontaine  1
## 150                     Madeleine L'Engle  1
## 151                      Malcolm Gladwell  4
## 152                       Margaret Atwood  1
## 153                   Margaret Wise Brown  3
## 154                           Marie Kondō  4
## 155                       Marjorie Sarnat  2
## 156                       Mark Hyman M.D.  1
## 157                           Mark Manson  3
## 158                             Mark Owen  1
## 159                         Mark R. Levin  2
## 160                            Mark Twain  1
## 161                          Markus Zusak  2
## 162                           Marty Noble  1
## 163                      Mary Ann Shaffer  1
## 164                        Maurice Sendak  1
## 165                 Melissa Hartwig Urban  3
## 166                         Michael Lewis  1
## 167                        Michael Pollan  1
## 168                         Michael Wolff  1
## 169                        Michelle Obama  2
## 170                           Mike Moreno  1
## 171                           Mitch Albom  1
## 172                        Muriel Barbery  1
## 173                       Naomi Kleinberg  2
## 174                        Nathan W. Pyle  1
## 175              National Geographic Kids  1
## 176                   Neil deGrasse Tyson  1
## 177                     Paper Peony Press  1
## 178                      Patrick Lencioni  5
## 179                        Patrick Thorpe  1
## 180                        Paul Kalanithi  1
## 181                         Paula Hawkins  2
## 182                          Paula McLain  1
## 183                          Paulo Coelho  1
## 184                            Pete Souza  1
## 185                     Peter A. Lillback  1
## 186                        Phil Robertson  1
## 187                          Pierre Dukan  1
## 188                   Pretty Simple Press  1
## 189                         R. J. Palacio  5
## 190                             RH Disney  2
## 191                         Rachel Hollis  3
## 192                      Raina Telgemeier  1
## 193                        Randall Munroe  1
## 194                          Randy Pausch  1
## 195                          Ray Bradbury  2
## 196                        Rebecca Skloot  3
## 197                          Ree Drummond  3
## 198                          Rick Riordan 11
## 199                              Rob Bell  1
## 200                           Rob Elliott  8
## 201                         Robert Jordan  1
## 202                         Robert Munsch  2
## 203                          Rod Campbell  4
## 204                          Roger Priddy  5
## 205                           Ron Chernow  1
## 206                             Rupi Kaur  4
## 207                         Rush Limbaugh  2
## 208                          Samin Nosrat  2
## 209                        Sandra Boynton  2
## 210                            Sara Gruen  1
## 211                           Sarah Palin  1
## 212                           Sarah Young  6
## 213                          Sasha O'Hara  1
## 214                            Scholastic  2
## 215                           School Zone  2
## 216                  Sherri Duskey Rinker  2
## 217                       Sheryl Sandberg  2
## 218                            Silly Bear  1
## 219                      Stephen Kendrick  1
## 220                          Stephen King  4
## 221                      Stephen R. Covey  7
## 222                       Stephenie Meyer  7
## 223                          Steve Harvey  1
## 224                      Steven D. Levitt  1
## 225                         Stieg Larsson  6
## 226                            Susan Cain  2
## 227                       Suzanne Collins 11
## 228                      Ta-Nehisi Coates  2
## 229                         Tara Westover  2
## 230                     Tatiana de Rosnay  1
## 231                     The College Board  6
## 232 The Staff of The Late Show with\u0085  1
## 233                   The Washington Post  1
## 234                       Thomas Campbell  1
## 235                        Thomas Piketty  1
## 236                          Thug Kitchen  4
## 237                       Timothy Ferriss  2
## 238                              Tina Fey  1
## 239                            Todd Burpo  2
## 240                            Tony Hsieh  1
## 241                        Tucker Carlson  1
## 242                         Veronica Roth  4
## 243                      W. Cleon Skousen  1
## 244                       Walter Isaacson  3
## 245                         William Davis  2
## 246                      William P. Young  2
## 247                      Wizards RPG Team  3
## 248                          Zhi Gang Sha  2

When examined closely, authors with the names J. K. Rowling and George R. R. Martin each have a writing style with 2 names that actually represent only 1 name. Therefore, we attempt to clean and organize them to make their names consistent.

book %>% 
  filter(Author %in% c("J.K. Rowling", "J. K. Rowling", "George R. R. Martin", "George R.R. Martin")) %>% 
  select(Author) %>% 
  count(Author)

##                Author n
## 1 George R. R. Martin 2
## 2  George R.R. Martin 3
## 3       J. K. Rowling 2
## 4        J.K. Rowling 6

book[book == "J.K. Rowling"] <- "J. K. Rowling"
book[book == "George R.R. Martin"] <- "George R. R. Martin"

book %>% 
  filter(Author %in% c("J.K. Rowling", "J. K. Rowling", "George R. R. Martin", "George R.R. Martin")) %>% 
  select(Author) %>% 
  count(Author)

##                Author n
## 1 George R. R. Martin 5
## 2       J. K. Rowling 8

3. Data Visualization

a. Percentage of Books Based on Genre (Pie Chart)

library(ggplot2)

book %>%
  select(Name, Genre) %>%
  group_by(Genre) %>%
  summarise(Count = n(), .groups = "drop") %>% 
  mutate(Percentage = prop.table(Count)*100) %>%
  
  # Visualize the data with pie chart using "ggplot2" library
  ggplot(aes(x = "", y = Percentage, fill = Genre)) +
  geom_bar(stat = "identity", width = 1.12) +
  scale_fill_manual(values = c("#FF90BC", "#FFC0D9")) +
  coord_polar(theta = "y", start = pi / 3) +
  theme_minimal() +
  geom_label(aes(label = paste0(round(Percentage,2), "%")),
             position = position_stack(vjust = 0.5)) +
  labs(title = "Percentage of Genre",
       y = NULL,
       x = NULL) +
  theme(plot.title = element_text(hjust = 0.5))

b. Total Books Based on Genre (Bar Chart)

library(ggplot2)

book %>% 
  select(Name, Genre) %>%
  group_by(Genre) %>%
  summarise(Count = n(), .groups = "drop") %>% 
  mutate(Percentage = prop.table(Count)*100) %>% 
  
  # Visualize the data with bar chart using "ggplot2" library
  ggplot(aes(x = Genre, y = Count, fill = Genre)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(y = Count, label = Count),
            vjust = 1.6, color = "black", size = 5) +
  scale_fill_manual(values = c("#FF90BC", "#FFC0D9")) +
  theme_pander()

c. Total Books Based on Year and Genre (Pyramid Chart)

book %>% 
  select(Year, Genre) %>% 
  group_by(Genre, Year) %>% 
  summarise(count = n()) %>% 
  pivot_wider(names_from = Genre,
              values_from = count) %>% 
  mutate(Fiction = -Fiction,
         Year = as.factor(Year)) %>% 
  arrange(Year) %>% 
  
  # Visualize the data with pyramid chart using "ggplot2" library
  ggplot(aes(x = Year)) +
  geom_bar(stat = "identity",
           width = 0.8,
           fill = "#FF90BC",
           aes(y = Fiction)) +
  geom_text(aes(x = Year,
                y = Fiction + 2,
                label = abs(Fiction)),
            colour = "white") +
  geom_bar(stat = "identity",
           width = 0.8,
           fill = "#FFC0D9",
           aes(y = `Non Fiction`)) +
  geom_text(aes(x = Year,
                y = `Non Fiction` - 2,
                label = `Non Fiction`),
            colour = "black") +
  ylim(-35, 35) +
  coord_flip() +
  annotate("text", x = 0.1, y = -5, hjust = 0.3, vjust = -0.3,
           label="Fiction", colour = "#FF90BC", fontface = 2) +
  annotate("text", x = 0.1, y = 5, hjust = 0.4, vjust = -0.3,
           label="Non Fiction", colour = "#FFC0D9", fontface = 2) +
  labs(y = "Genre",
       x = "Year") +
  theme(axis.text.x = element_blank(),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(linetype = "dashed", colour = "grey"))

## `summarise()` has grouped output by 'Genre'. You can override using the
## `.groups` argument.

From the three charts above, we can use a pie chart to see the portion or percentage of each genre category. Additionally, we can also use a bar chart to see the number of books from each genre category. We can conclude from the three charts above that the percentage for fiction books is 43.64% with a total of 240 books (represented in dark pink), while non-fiction books have a percentage of 56.36% with a total of 310 books (represented in light pink).

The third chart is a Population/Pyramid Chart that can be used to see the number of books from each category (Fiction and Non-Fiction) grouped by year. This visualization is easier to understand if you want to see more details about the number of books in each category. The Population/Pyramid Chart is actually used to visualize population data. However, since we can see more specifically with this chart, we can use it to make it easier to understand the details.

Here are some additional details about the charts:

The pie chart shows that the majority of books are non-fiction (56.36%).
The bar chart shows that there are more non-fiction books than fiction books in each year.
The population/pyramid chart shows that the number of non-fiction books has been increasing over time, while the number of fiction books has been decreasing.

Overall, the charts show that non-fiction books are more popular than fiction books. This could be due to a number of factors, such as the increasing demand for self-help and educational books.

d. Total Books by User Rating (Bar Chart)

book %>% 
  select(User.Rating) %>% 
  group_by(User.Rating) %>% 
  summarise(count = n()) %>%
  mutate(User.Rating = as.factor(User.Rating)) %>% 
  arrange(-User.Rating) %>% 
  
  ggplot(aes(x = User.Rating, y = count, fill = User.Rating)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(y = count, label = count),
            vjust = 0.1, size = 3)  +
  theme(legend.position = "none")

## Warning: There was 1 warning in `arrange()`.
## ℹ In argument: `..1 = -User.Rating`.
## Caused by warning in `Ops.factor()`:
## ! '-' not meaningful for factors

Based on the bar chart above, several conclusions can be drawn:

The number of books at the highest user rating (4.9) is 52 books.
User rating with a value of 4.8 has the most books, with 127 books.
User ratings with values of 3.3 and 3.6 have the least number of books, with 1 book each.

Here are some other key points:

User ratings above 4.2 generally have more books.
The number of books at user ratings 3.0 and 3.1 is relatively small.
The distribution of user ratings is uneven, with a concentration at 4.8.

In conclusion, user rating 4.8 has the most books, while user ratings 3.3 and 3.6 have the least number of books.

e. Total Reviews by Genre (Bar Chart)

library(ggplot2)

book %>%
  group_by(Genre) %>%
  summarise(Total_Reviews = sum(Reviews), .groups = "drop") %>%
  ggplot(aes(x = Genre, y = Total_Reviews, fill = Genre)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Total_Reviews),
            vjust = -0.5, color = "black", size = 4) +
  scale_fill_manual(values = c("#FF90BC", "#FFC0D9")) +
  theme_minimal() +
  labs(x = "Genre", y = "Total Reviews") +
  theme(legend.position = "center")

Based on the graph above, for Fiction books are the category of books that are reviewed the most by users with a total of 3,764,110 reviews. Meanwhile, the Non-Fiction book category received a review of 2,810,195 reviews.

f. The Top 5 Book by Price (Sorted Bar Chart)

p1 <- book %>% 
  filter(Genre == "Fiction") %>% 
  arrange(-Price) %>% 
  select(Name, Author, Price) %>% 
  distinct(Name, Author, Price) %>% 
  top_n(5) %>% 
  
  ggplot(aes(Price, reorder(Name, Price), fill = Price)) +
  geom_col() +
  scale_fill_gradient(low = "#FF90BC", high = "#FFC0D9") +
  scale_y_discrete(labels = wrap_format(45)) +
  geom_text(aes(label = Price),
            hjust = 1.5) +
  labs(title = "Fiction Books",
       y = "Book Name") +
  theme(legend.position = "none")

## Selecting by Price

p2 <- book %>% 
  filter(Genre == "Non Fiction") %>% 
  arrange(-Price) %>% 
  select(Name, Author, Price) %>% 
  distinct(Name, Author, Price) %>% 
  top_n(5) %>% 
  
  ggplot(aes(Price, reorder(Name, Price), fill = Price)) +
  geom_col() +
  scale_fill_gradient(low = "#FF90BC", high = "#FFC0D9") +
  scale_y_discrete(labels = wrap_format(45)) +
  geom_text(aes(label = Price),
            hjust = 1.5) +
  labs(title = "Non Fiction Books",
       y = "Book Name") +
  theme(legend.position = "none")

## Selecting by Price

ggarrange(p1, p2,
          ncol = 1, nrow = 2)

From the sorted bar chart above, we can see that:

For the Fiction category:

The Twilight Saga Collection is the first book with the highest price at $82.
Followed by Harry Potter Paperback Box Set (Books 1-7) in second place with a price of $52.
And Watchmen in third place with a price of $42.

Meanwhile, for the Non-Fiction category:

Diagnostic and Statistical Manual of Mental Disorders is the first book with the highest price at $105.
Followed by Hamilton: The Revolution in second place with a price of $54.
And The Book of Basketball: The NBA According to The Sports Guy in third place with a price of $53.

In conclusion, Non-Fiction books are more expensive than Fiction books.

g. The Top 5 Book by Total Reviews (Map Chart)

library(treemapify)

## Warning: package 'treemapify' was built under R version 4.2.3

book %>% 
  filter(Genre == "Fiction") %>% 
  arrange(-Reviews) %>% 
  select(Name, Author, Reviews, User.Rating) %>% 
  distinct(Name, Author, Reviews, User.Rating) %>% 
  head(5) %>% 
  
  ggplot(aes(area = Reviews, label = Name, fill = Name, subgroup = Author, subgroup2 = Reviews, subgroup3 = User.Rating)) +
  geom_treemap() + 
  geom_treemap_subgroup3_border(colour = "black", size = 3) +
  geom_treemap_subgroup_text(
    place = "topleft",
    colour = "black",
    reflow = T,
    size = 14,
    alpha = 0.8,
  ) +
  geom_treemap_subgroup2_text(
    colour = "white",
    alpha = 1,
    size = 17,
    fontface = "italic"
  ) +
  geom_treemap_subgroup3_text(
    place = "topright",
    colour = "black",
    alpha = 0.6,
    size = 14
  ) +
  geom_treemap_text(
    colour = "white", 
    place = "middle",
    size = 17,
    fontface = "bold",
    reflow = T) +
  theme(legend.position = "none")

Sorted Bar Chart Version

library(ggplot2)

book %>% 
  filter(Genre == "Fiction") %>% 
  arrange(-Reviews) %>% 
  select(Name, Author, Reviews, User.Rating) %>% 
  distinct(Name, Author, Reviews, User.Rating) %>% 
  head(5) %>% 
  
  ggplot(aes(x = reorder(Name, -Reviews), y = Reviews, fill = Name)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#EF9595", "#FF8080", "#FF90BC", "#FFC0D9", "#FF9B9B")) +
  geom_text(aes(label = Reviews), vjust = -0.2, color = "black", size = 4) +
  theme_minimal() +
  labs(x = "Book Name", y = "Reviews") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Based on the graph above, we have some conclusions:

Where the Crawdads Sing by Delia Owens is the book with the highest number of reviews, namely 87,841 user reviews with a user rating of 4.8.
Followed by The Girl on the Train by Paula Hawkins in second position with a total of 79,446 reviews and a user rating of 4.1,
And Gone Girl by Gillian Flynn with a total of 57,271 reviews and a user rating of 4.0.

h. The Top 5 Author by Total Books (Bar Chart)

library(ggplot2)

book %>%
  group_by(Author) %>%
  summarise(Num_Books = n_distinct(Name)) %>%
  arrange(desc(Num_Books)) %>%
  head(5) %>%
  
  ggplot(aes(x = reorder(Author, Num_Books), y = Num_Books, fill = Author)) +
  geom_bar(stat = "identity", fill = "#FF90BC") +
  geom_text(aes(label = Num_Books), vjust = -0.2, color = "black", size = 4) + 
  theme_minimal() +
  labs(x = "Author", y = "Number of Books") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none")

Based on the bar chart above, we can see that Jeff Kinney is the author with the most books created at 12 books. Followed by Rick Riordan in the second at 10 books and J. K. Rowling in the third at 8 books created.

4. Correlation Analysis

a. Heatmap Chart

library(GGally)

## Warning: package 'GGally' was built under R version 4.2.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(book[1:7], label = T)

## Warning in ggcorr(book[1:7], label = T): data in column(s) 'Name', 'Author',
## 'Genre' are not numeric and were ignored

1) Price and Year

book %>% 
  ggplot(aes(Price, Year, col = Genre)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#FF90BC", "#FFC0D9"))

A correlation value of -0.2 indicates a weak negative relationship between Price and Year.
This means that as Year increases, Price tends to decrease.
However, this relationship is weak, so it cannot be certain that Year is always inversely proportional to Price.

2) Price and Reviews

book %>% 
  ggplot(aes(Price, Reviews, col = Genre)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#FF90BC", "#FFC0D9"))

A correlation value of -0.1 indicates a weak negative relationship between Price and Reviews.
This means that as Reviews increase, Price tends to decrease.
Similarly, this relationship is weak and cannot be certain to always occur.

3) Year and Reviews

book %>% 
  ggplot(aes(Year, Reviews, col = Genre)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#FF90BC", "#FFC0D9"))

A correlation value of 0.3 indicates a weak positive relationship between Year and Reviews.
This means that as Year increases, Reviews tend to increase.
Again, this relationship is weak and cannot be certain to always occur.

4) User Rating and Reviews

book %>% 
  ggplot(aes(User.Rating, Reviews, col = Genre)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#FF90BC", "#FFC0D9"))

A correlation value of 0 indicates no relationship between User Rating and Reviews.
This means that changes in User Rating do not affect Reviews, and vice versa.

5) User Rating and Price

book %>% 
  ggplot(aes(User.Rating, Price, col = Genre)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#FF90BC", "#FFC0D9"))

A correlation value of -0.1 indicates a weak negative relationship between User Rating and Price.
This means that as User Rating increases, Price tends to decrease.
This relationship is weak and cannot be certain to always occur.

6) Year and User Rating

book %>% 
  ggplot(aes(Year, User.Rating, col = Genre)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#FF90BC", "#FFC0D9"))

A correlation value of 0.2 indicates a weak positive relationship between Year and User Rating.
This means that as Year increases, User Rating tends to increase.
This relationship is weak and cannot be certain to always occur.

Conclusion:

There are several weak positive and negative relationships between the variables analyzed.
It is important to remember that correlation does not imply causation.
Further analysis is needed to understand the relationships between variables in more detail.

Notes:

Correlation values range from -1 to 1.
A value of 0 indicates no relationship.
A value close to -1 indicates a strong negative relationship.
A value close to 1 indicates a strong positive relationship.

5. Recommendations

Marketing Team:

Non-fiction is more popular than fiction, with 56.36% compared to 43.64%. Consider promoting non-fiction books more heavily.
Rating 4.8 has the most books, while 3.3 and 3.6 have the least. Consider reviewing low-rated books to improve quality.
Non-fiction receives more reviews than fiction. Consider encouraging reviews for non-fiction books to increase visibility.
Non-fiction is more expensive than fiction. Consider offering discounts or promotions for non-fiction books to make them more affordable.
Jeff Kinney, Rick Riordan, and J.K. Rowling are the most prolific authors. Consider promoting the works of these prolific authors.
The number of non-fiction books and their reviews is increasing over time. Consider investing in non-fiction content and marketing strategies.

Data Team:

Conduct further analysis based on genre, publication year, and others to gain more specific insights.
Use statistical tests to confirm or refute hypotheses suggested by the visualizations.

Amazon Top 50 Bestselling Books 2009 - 2019

Syifa Azzahirah

2024-02-20