The objective of this report is to conduct a comprehensive statistical analysis of factors influencing book prices and user ratings. Specifically, we aim to investigate the relationships between user ratings, the number of reviews, and book prices, as well as the differences in user ratings between fiction and non-fiction genres. By examining these relationships, we seek to provide actionable insights for publishers and stakeholders in the book industry to optimize pricing strategies and enhance reader satisfaction.
This report is tailored for publishers, marketers, and stakeholders in the book industry who are interested in understanding the factors influencing book prices and user ratings. It is also relevant for researchers and analysts seeking insights into consumer behavior and preferences in the literary market. Additionally, educators and students in statistics and market analysis may find value in the methodologies and findings presented in this report for academic purposes.
# View the first few rows of the data
head(book_data)
## Name
## 1 10-Day Green Smoothie Cleanse
## 2 11/22/63: A Novel
## 3 12 Rules for Life: An Antidote to Chaos
## 4 1984 (Signet Classics)
## 5 5,000 Awesome Facts (About Everything!) (National Geographic Kids)
## 6 A Dance with Dragons (A Song of Ice and Fire)
## Author User.Rating Reviews Price Year Genre
## 1 JJ Smith 4.7 17350 8 2016 Non Fiction
## 2 Stephen King 4.6 2052 22 2011 Fiction
## 3 Jordan B. Peterson 4.7 18979 15 2018 Non Fiction
## 4 George Orwell 4.7 21424 6 2017 Fiction
## 5 National Geographic Kids 4.8 7665 12 2019 Non Fiction
## 6 George R. R. Martin 4.4 12643 11 2011 Fiction
# Structure of the dataset
str(book_data)
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User.Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : int 17350 2052 18979 21424 7665 12643 19735 19699 5983 23848 ...
## $ Price : int 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : int 2016 2011 2018 2017 2019 2011 2014 2017 2018 2016 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
# Summary statistics
summary(book_data)
## Name Author User.Rating Reviews
## Length:550 Length:550 Min. :3.300 Min. : 37
## Class :character Class :character 1st Qu.:4.500 1st Qu.: 4058
## Mode :character Mode :character Median :4.700 Median : 8580
## Mean :4.618 Mean :11953
## 3rd Qu.:4.800 3rd Qu.:17253
## Max. :4.900 Max. :87841
## Price Year Genre
## Min. : 0.0 Min. :2009 Length:550
## 1st Qu.: 7.0 1st Qu.:2011 Class :character
## Median : 11.0 Median :2014 Mode :character
## Mean : 13.1 Mean :2014
## 3rd Qu.: 16.0 3rd Qu.:2017
## Max. :105.0 Max. :2019
# Check for missing values
colSums(is.na(book_data))
## Name Author User.Rating Reviews Price Year
## 0 0 0 0 0 0
## Genre
## 0
# Histogram of user ratings
hist(book_data$User.Rating,
main = "Distribution of User Ratings",
xlab = "User Rating",
col = "cyan")
# Box plot of prices
boxplot(book_data$Price,
main = "Distribution of Book Prices",
ylab = "Price",
col = "magenta")
The user rating histogram exhibits a distribution that is right-skewed, with the majority of books having high ratings.
The existence of outliers in the price box plot indicates that some publications are priced significantly higher than the rest.
plot(book_data$Reviews, book_data$Price,
main = "Reviews vs Price",
xlab = "Number of Reviews", ylab = "Price",
col = "Purple")
The correlation value of -0.1091819 between the number of reviews and price indicates a weak negative correlation
A weak negative correlation between the number of reviews and price could potentially mean that:
Books with lower prices tend to receive slightly more reviews, possibly due to being more accessible to a wider audience.
Books with higher prices, which may be perceived as more niche or specialized, tend to receive slightly fewer reviews.
Other factors, such as genre, author popularity, or marketing efforts, may have a stronger influence on the number of reviews than the price alone.
# Correlation between number of reviews and price
cor(book_data$Reviews, book_data$Price, use = "complete.obs")
## [1] -0.1091819
plot(book_data$Price, book_data$User.Rating,
main = "User Rating vs Price",
xlab = "Price", ylab = "User Rating",
col = "Orange")
# Correlation between user rating and price
cor(book_data$User.Rating, book_data$Price, use = "complete.obs")
## [1] -0.1330863
The very weak negative correlation of -0.1330863 between user rating and price provides some interesting insights:
Publishers and retailers may be employing pricing strategies that prioritize factors other than user ratings when setting prices for best-selling books.
After seeing the data of the best selling books based on their User Rating, Reviews, Prices, it has been seen that their is a significant affect on the pricing of the books based on the User Ratings and the number of Reviews. So the questions here are: Does the User Ratings affect eh pricing of the book? Does the number of reviews have any affect on the purchase of the book. Would there be any affect based on the Fiction and Fiction User Ratings?
Null Hypothesis (H0): There is no significant relationship between user ratings and the prices of books.
Alternative Hypothesis (H1): There is a significant relationship between user ratings and the prices of books.
This hypothesis aims to explore the relationship between user ratings and the prices of books.
# Hypothesis 1: User Ratings affect the prices of the book
correlation_test <- cor.test(book_data$User.Rating, book_data$Price)
print(correlation_test)
##
## Pearson's product-moment correlation
##
## data: book_data$User.Rating and book_data$Price
## t = -3.1434, df = 548, p-value = 0.00176
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.21430809 -0.05003665
## sample estimates:
## cor
## -0.1330863
The Pearson’s correlation test yielded a statistically significant negative correlation (correlation coefficient = -0.133, p-value = 0.00176) between user ratings and prices. This suggests that there is a weak negative linear relationship between user ratings and book prices. In other words, as user ratings increase, the prices of books tend to slightly decrease. However, the correlation coefficient indicates that the strength of this relationship is modest.
# Scatter plot of User Ratings vs Price
plot(book_data$User.Rating, book_data$Price,
xlab = "User Rating", ylab = "Price",
main = "User Ratings vs Price of Books",
col = "blue", pch = 16)
The scatter plot illustrates the distribution of books based on their user ratings and prices. While there is a slight downward trend, indicating a negative correlation, the spread of data points suggests that other factors may also influence book prices.
The analysis indicates a statistically significant negative correlation between user ratings and book prices (correlation coefficient = -0.133, p-value = 0.00176). Therefore, it is concluded that higher user ratings tend to be associated with slightly lower book prices.
Null Hypothesis (H0): There is no significant relationship between the number of reviews and the prices of books.
Alternative Hypothesis (H1): There is a significant relationship between the number of reviews and the prices of books.
This hypothesis examines whether the number of reviews influences the prices of books.
# Hypothesis 2: Number of Reviews affect the price of the book
correlation_test_reviews <- cor.test(book_data$Reviews, book_data$Price)
print(correlation_test_reviews)
##
## Pearson's product-moment correlation
##
## data: book_data$Reviews and book_data$Price
## t = -2.5713, df = 548, p-value = 0.0104
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.19104436 -0.02581111
## sample estimates:
## cor
## -0.1091819
The Pearson’s correlation test revealed a statistically significant negative correlation (correlation coefficient = -0.109, p-value = 0.0104) between the number of reviews and book prices. This indicates a weak negative linear relationship between the two variables, suggesting that books with higher numbers of reviews tend to have slightly lower prices. However, the correlation coefficient suggests that this relationship is modest.
# Scatter plot of Reviews vs Price
plot(book_data$Reviews, book_data$Price,
xlab = "Reviews", ylab = "Price",
main = "Number of Reviews vs Price of Books",
col = "red", pch = 16)
The scatter plot displays the distribution of books based on the number of reviews and prices. While there is a slight downward trend, indicating a negative correlation, the data points are widely spread, indicating that other factors may also influence book prices.
A statistically significant negative correlation is observed between the number of reviews and book prices (correlation coefficient = -0.109, p-value = 0.0104). Hence, books with higher numbers of reviews tend to have slightly lower prices.
Null Hypothesis (H0): There is no difference in mean user ratings between fiction and non-fiction books.
Alternative Hypothesis (H1): There is a difference in mean user ratings between fiction and non-fiction books.
This hypothesis investigates whether there is a difference in user ratings between fiction and non-fiction books.
# Hypothesis 3: Fiction vs Non-Fiction user ratings
fiction_ratings <- book_data$User.Rating[book_data$Genre == "Fiction"]
non_fiction_ratings <- book_data$User.Rating[book_data$Genre == "Non Fiction"]
t_test_rating_genre <- t.test(fiction_ratings, non_fiction_ratings)
print(t_test_rating_genre)
##
## Welch Two Sample t-test
##
## data: fiction_ratings and non_fiction_ratings
## t = 2.6299, df = 415.29, p-value = 0.008859
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01342894 0.09291515
## sample estimates:
## mean of x mean of y
## 4.648333 4.595161
The Welch’s t-test revealed a statistically significant difference in mean user ratings between fiction and non-fiction books (t = 2.63, p-value = 0.00886). The mean user rating for fiction books (mean = 4.65) is slightly higher than that for non-fiction books (mean = 4.60), with a 95% confidence interval for the difference in means of 0.013 to 0.093.
# Boxplot of User Ratings by Genre
boxplot(User.Rating ~ Genre, data = book_data,
xlab = "Genre", ylab = "User Rating",
main = "User Ratings by Genre",
col = c("red", "blue"))
The boxplot visually compares the distribution of user ratings between fiction and non-fiction books. The slightly higher median and upper quartile for fiction books suggest that, on average, fiction books tend to have slightly higher user ratings compared to non-fiction books.
The Welch’s t-test reveals a significant difference in mean user ratings between fiction and non-fiction books (t = 2.63, p-value = 0.00886). On average, fiction books exhibit slightly higher user ratings compared to non-fiction books.
The statistical analysis provides insights into the factors influencing book prices and user ratings. While user ratings and the number of reviews have a modest influence on book prices, genre significantly impacts user ratings. Fiction books tend to receive slightly higher user ratings compared to non-fiction books. These findings can inform decision-making processes in the publishing industry, helping publishers understand consumer preferences and optimize pricing strategies to maximize sales and reader satisfaction. Further research may explore additional variables and their impact on book sales and consumer behavior.