This dataset contains information about best-selling books, including their names, authors, user ratings, reviews, prices, years of publication, and genres.

library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(ggplot2)
Bestsellers <- read_excel("Bestsellers.xlsx")
View(Bestsellers)
bestsellers <-Bestsellers

Data Cleaning

Data cleaning is essential to ensure that the dataset is free of errors and missing values, which could negatively impact the analysis. Here, I check for missing values and remove any rows that contain them. I also convert categorical variables (Genre and Author) to factors to facilitate analysis.

sum(is.na(bestsellers))

## [1] 7337497

bestsellers <- na.omit(bestsellers)

bestsellers$Genre <- as.factor(bestsellers$Genre)
bestsellers$Author <- as.factor(bestsellers$Author)

str(bestsellers)

## tibble [360 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Name      : chr [1:360] "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author    : Factor w/ 248 levels "Abraham Verghese",..: 125 220 135 96 175 97 97 13 115 90 ...
##  $ UserRating: num [1:360] 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews   : num [1:360] 17350 2052 18979 21424 7665 ...
##  $ Price     : num [1:360] 8 22 15 6 12 11 30 15 3 8 ...
##  $ Year      : num [1:360] 2016 2011 2018 2017 2019 ...
##  $ Genre     : Factor w/ 2 levels "Fiction","Non Fiction": 2 1 2 1 2 1 1 1 2 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:1048214] 361 362 363 364 365 366 367 368 369 370 ...
##   ..- attr(*, "names")= chr [1:1048214] "361" "362" "363" "364" ...

Exploratory Data Analysis (EDA)

Exploratory Data Analysis helps us understand the basic structure of our dataset and identify any patterns, trends, or anomalies. This step provides a foundation for our predictive modeling.

Summary statistics

Summary statistics give us an overview of our dataset, including measures such as the mean, median, and standard deviation for numerical variables, and counts for categorical variables.

summary(bestsellers)

##      Name                       Author      UserRating      Reviews     
##  Length:360         Jeff Kinney    : 12   Min.   :3.30   Min.   :   37  
##  Class :character   Rick Riordan   : 10   1st Qu.:4.50   1st Qu.: 3487  
##  Mode  :character   Stephenie Meyer:  7   Median :4.70   Median : 6634  
##                     Bill O'Reilly  :  6   Mean   :4.61   Mean   :10153  
##                     Dav Pilkey     :  6   3rd Qu.:4.80   3rd Qu.:11830  
##                     J.K. Rowling   :  6   Max.   :4.90   Max.   :87841  
##                     (Other)        :313                                 
##      Price             Year              Genre    
##  Min.   :  0.00   Min.   :2009   Fiction    :165  
##  1st Qu.:  8.00   1st Qu.:2011   Non Fiction:195  
##  Median : 11.00   Median :2013                    
##  Mean   : 12.98   Mean   :2014                    
##  3rd Qu.: 16.00   3rd Qu.:2016                    
##  Max.   :105.00   Max.   :2019                    
##

Distribution of User Ratings

Visualizing the distribution of user ratings helps us understand how ratings are spread across the books in our dataset. This can provide insights into common rating patterns and their potential impact on book popularity.

ggplot(bestsellers, aes(x = UserRating)) +
  geom_histogram(binwidth = 0.1, fill = "darkblue", color = "black") +
  labs(title = "Distribution of User Ratings", x = "User Rating", y = "Count")

User Ratings vs. Reviews

Exploring the relationship between user ratings and the number of reviews helps us understand if highly-rated books tend to receive more reviews. This relationship is crucial for predictive modeling.

ggplot(bestsellers, aes(x = UserRating, y = Reviews)) +
  geom_point(color = "pink") +
  geom_smooth(method = "lm", color = "darkblue") +
  labs(title = "User Rating vs Reviews", x = "User Rating", y = "Number of Reviews")

## `geom_smooth()` using formula = 'y ~ x'

## Genre Distrubtion

Understanding the genre distribution helps determine which genres are most popular among bestsellers. Here, I use both a bar graph and a pie chart to visualize this distribution.

ggplot(bestsellers, aes(x = Genre)) +
  geom_bar(fill = "lightblue", color = "black") +
  labs(title = "Genre Distribution", x = "Genre", y = "Count")

genre_count <- bestsellers %>% 
  count(Genre) %>% 
  mutate(percentage = n / sum(n) * 100)

ggplot(genre_count, aes(x = "", y = percentage, fill = Genre)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Genre Distribution", x = "", y = "")

Author Distribution

Examining the distribution of authors helps identify which authors have the most bestsellers. I focused on the top 10 authors to keep the analysis concise.

# Bar graph for author distribution (top 10 authors)
top_authors <- bestsellers %>%
  count(Author) %>%
  top_n(10, n) %>%
  arrange(desc(n))

ggplot(top_authors, aes(x = reorder(Author, n), y = n)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  coord_flip() +
  labs(title = "Top 10 Authors", x = "Author", y = "Count")

Predictive Modeling: Linear Regression

Predictive modeling allows us to make forecasts based on our data. Here, we use linear regression to predict the number of reviews a book might receive based on its user rating, price, year of publication, and genre.

In order to conduct the regression, I split the dataset into training and testing sets to train the model on one portion of the data and test its performance on another. This helps prevent overfitting and ensures the model generalizes well to new data.

set.seed(123)
trainIndex <- createDataPartition(bestsellers$Reviews, p = 0.8, list = FALSE)
trainData <- bestsellers[trainIndex,]
testData <- bestsellers[-trainIndex,]

I built a linear regression model to predict the number of reviews based on user rating, price, year of publication, and genre. The summary of the model provides information about the relationship between these variables.

model <- lm(Reviews ~ UserRating + Price + Year + Genre, data = trainData)

summary(model)

## 
## Call:
## lm(formula = Reviews ~ UserRating + Price + Year + Genre, data = trainData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14113  -5925  -2629   2261  69092 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.074e+06  3.985e+05  -5.204 3.74e-07 ***
## UserRating       -5.858e+03  2.878e+03  -2.035   0.0428 *  
## Price             5.321e+01  6.198e+01   0.858   0.3914    
## Year              1.050e+03  1.992e+02   5.271 2.69e-07 ***
## GenreNon Fiction -6.974e+03  1.276e+03  -5.464 1.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10730 on 283 degrees of freedom
## Multiple R-squared:  0.1589, Adjusted R-squared:  0.147 
## F-statistic: 13.37 on 4 and 283 DF,  p-value: 5.435e-10

predictions <- predict(model, testData)

results <- data.frame(Actual = testData$Reviews, Predicted = predictions)
head(results)

##   Actual Predicted
## 1   2052 11892.874
## 2   4149  5450.976
## 3   9198  8518.690
## 4  12159 15271.393
## 5   1296  4903.827
## 6    615  3229.920

I evaluated the model using performance metrics such as RMSE (Root Mean Squared Error) and R-squared, which indicate how well our model fits the data.

postResample(predictions, testData$Reviews)

##         RMSE     Rsquared          MAE 
## 9278.3000239    0.1402753 6408.1699702

Actual vs. Predicted Views

Visualizing the actual vs. predicted reviews allows us to see how well the model’s predictions align with the actual data. A scatter plot with a diagonal line (indicating perfect predictions) helps in this comparison.

ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = 'blue', linetype = 'dashed') +
  labs(title = "Actual vs Predicted Reviews", x = "Actual Reviews", y = "Predicted Reviews")

Conclusion

The insights gained from this analysis can help a large bookstore retailer make informed decisions regarding their inventory and marketing strategies. By understanding which factors most significantly impact book reviews and user ratings, the retailer can:

Optimize Inventory: Stock more books from genres that are highly rated and have a higher likelihood of receiving more reviews.
Targeted Marketing: Focus marketing efforts on books with higher predicted reviews, as these are likely to generate more customer interest and sales.
Pricing Strategies: Consider the impact of price on user ratings and reviews when setting prices for different genres.

Future Work

While this analysis provides valuable insights, there are several areas for future work:

Incorporate Additional Features: Including more variables, such as promotional activities or author popularity, could improve the model’s accuracy.
Advanced Modeling Techniques: Exploring more advanced predictive modeling techniques, such as decision trees or machine learning algorithms, may yield better predictions.
Longitudinal Analysis: Examining trends over time could provide deeper insights into how book ratings and reviews evolve.

Bestselling Books

Brandy Nichols, PhD

2024-08-03