knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggthemes)
books <- read.csv("bestsellers.csv")
str(books)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
head(books)
## Name
## 1 10-Day Green Smoothie Cleanse
## 2 11/22/63: A Novel
## 3 12 Rules for Life: An Antidote to Chaos
## 4 1984 (Signet Classics)
## 5 5,000 Awesome Facts (About Everything!) (National Geographic Kids)
## 6 A Dance with Dragons (A Song of Ice and Fire)
## Author User.Rating Reviews Price Year Genre
## 1 JJ Smith 4.7 17350 8 2016 Non Fiction
## 2 Stephen King 4.6 2052 22 2011 Fiction
## 3 Jordan B. Peterson 4.7 18979 15 2018 Non Fiction
## 4 George Orwell 4.7 21424 6 2017 Fiction
## 5 National Geographic Kids 4.8 7665 12 2019 Non Fiction
## 6 George R. R. Martin 4.4 12643 11 2011 Fiction
summary(books$User.Rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.300 4.500 4.700 4.618 4.800 4.900
summary(books$Reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37 4058 8580 11953 17253 87841
table(books$Genre)
##
## Fiction Non Fiction
## 240 310
table(books$Year)
##
## 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## 50 50 50 50 50 50 50 50 50 50 50
summary(books$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 7.0 11.0 13.1 16.0 105.0
Novel Questions to Investigate
What is the average user rating for fiction versus non-fiction books?
Is there a relationship between the number of reviews and the price of the book?
How has the average user rating changed over the years for both fiction and non-fiction books?
genre_avg_rating <- books %>% group_by(Genre) %>% summarise(avg_rating = mean(User.Rating, na.rm = TRUE))
genre_avg_rating
## # A tibble: 2 × 2
## Genre avg_rating
## <chr> <dbl>
## 1 Fiction 4.65
## 2 Non Fiction 4.60
ggplot(books, aes(x = User.Rating)) + geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") + labs(title = "Distribution of User Rating", x = "User Rating", y = "Frequency") + theme_minimal()
From the graph, the distribution of user ratings appears to be relatively symmetrical, centered around a user rating of 4.5. This suggests that there are just as many ratings above 4.5 as there are below 4.5.
It is also interesting to note that the frequency of ratings tails off towards the ends of the graph, which means that there are fewer ratings of 3.5 and 5.0 than there are ratings of 4.0 and 4.5.
Overall, this distribution suggests that users are generally satisfied with the product or service being rated, but there is a slight bias towards positive ratings.
ggplot(books, aes(x = Reviews, fill = Genre)) + geom_density(alpha = 0.5) + labs(title = "Distribution of Reviews by Genre", x = "Number of Reviews", y = "Density") + scale_fill_manual(values = c("skyblue", "salmon")) + theme_minimal()
Fiction appears to be the most popular genre, with the highest density of reviews. Non-Fiction follows closely behind Fiction in terms of review density.
ggplot(books, aes(x = Price, y = Reviews)) + geom_point(alpha = 0.5, color = "darkblue") + geom_smooth(method = "lm", se = FALSE, color = "red") + labs(title = "Relationship between Price and Number of Reviews", x = "Price", y = "Number of Reviews") + theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
There is a positive correlation between price and number of reviews. This means that as the number of reviews increases, the price also tends to increase. The data points are somewhat scattered, which means that there is not a perfect linear relationship between price and number of of reviews. There are some books with high numbers of reviews that are relatively inexpensive, and there are some books with low numbers of reviews that are very expensive. It is important to note that correlation does not necessarily equal causation. Just because there is a positive correlation between price and number of reviews does not mean that a higher price causes a product to get more reviews.
ggplot(books, aes(x = Year, y = User.Rating, color = Genre)) + geom_line() + geom_point() + labs(title = "Average User Rating Over Years", x = "Year", y = "Average User Rating") + theme_minimal()
The average user rating for both Fiction and Non-Fiction books appears to be relatively stable over the years that are shown in the graph. The average user rating for Fiction books is slightly higher than the average user rating for Non-Fiction books throughout the entire time period. In 2010, the average user rating for Fiction books was around 4.25 and the average user rating for Non-Fiction books was around 4.1. In 2017.5, the average user rating for Fiction books was around 4.3 and the average user rating for Non-Fiction books was around 4.0. It is difficult to say for sure why the average user rating for Fiction books is higher than the average user rating for Non-Fiction books. However, there are a few possible explanations: Fiction books may be more entertaining than Non-Fiction books, which could lead to higher ratings. People may be more likely to rate Fiction books than Non-Fiction books. The types of people who read Fiction books may be more likely to give high ratings than the types of people who read Non-Fiction books.