This dataset contains information about best-selling books, including their names, authors, user ratings, reviews, prices, years of publication, and genres.
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
Bestsellers <- read_excel("Bestsellers.xlsx")
View(Bestsellers)
bestsellers <-Bestsellers
Data cleaning is essential to ensure that the dataset is free of errors and missing values, which could negatively impact the analysis. Here, I check for missing values and remove any rows that contain them. I also convert categorical variables (Genre and Author) to factors to facilitate analysis.
sum(is.na(bestsellers))
## [1] 7337497
bestsellers <- na.omit(bestsellers)
bestsellers$Genre <- as.factor(bestsellers$Genre)
bestsellers$Author <- as.factor(bestsellers$Author)
str(bestsellers)
## tibble [360 × 7] (S3: tbl_df/tbl/data.frame)
## $ Name : chr [1:360] "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : Factor w/ 248 levels "Abraham Verghese",..: 125 220 135 96 175 97 97 13 115 90 ...
## $ UserRating: num [1:360] 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : num [1:360] 17350 2052 18979 21424 7665 ...
## $ Price : num [1:360] 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : num [1:360] 2016 2011 2018 2017 2019 ...
## $ Genre : Factor w/ 2 levels "Fiction","Non Fiction": 2 1 2 1 2 1 1 1 2 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:1048214] 361 362 363 364 365 366 367 368 369 370 ...
## ..- attr(*, "names")= chr [1:1048214] "361" "362" "363" "364" ...
Exploratory Data Analysis helps us understand the basic structure of our dataset and identify any patterns, trends, or anomalies. This step provides a foundation for our predictive modeling.
Summary statistics give us an overview of our dataset, including measures such as the mean, median, and standard deviation for numerical variables, and counts for categorical variables.
summary(bestsellers)
## Name Author UserRating Reviews
## Length:360 Jeff Kinney : 12 Min. :3.30 Min. : 37
## Class :character Rick Riordan : 10 1st Qu.:4.50 1st Qu.: 3487
## Mode :character Stephenie Meyer: 7 Median :4.70 Median : 6634
## Bill O'Reilly : 6 Mean :4.61 Mean :10153
## Dav Pilkey : 6 3rd Qu.:4.80 3rd Qu.:11830
## J.K. Rowling : 6 Max. :4.90 Max. :87841
## (Other) :313
## Price Year Genre
## Min. : 0.00 Min. :2009 Fiction :165
## 1st Qu.: 8.00 1st Qu.:2011 Non Fiction:195
## Median : 11.00 Median :2013
## Mean : 12.98 Mean :2014
## 3rd Qu.: 16.00 3rd Qu.:2016
## Max. :105.00 Max. :2019
##
Visualizing the distribution of user ratings helps us understand how ratings are spread across the books in our dataset. This can provide insights into common rating patterns and their potential impact on book popularity.
ggplot(bestsellers, aes(x = UserRating)) +
geom_histogram(binwidth = 0.1, fill = "darkblue", color = "black") +
labs(title = "Distribution of User Ratings", x = "User Rating", y = "Count")
Exploring the relationship between user ratings and the number of reviews helps us understand if highly-rated books tend to receive more reviews. This relationship is crucial for predictive modeling.
ggplot(bestsellers, aes(x = UserRating, y = Reviews)) +
geom_point(color = "pink") +
geom_smooth(method = "lm", color = "darkblue") +
labs(title = "User Rating vs Reviews", x = "User Rating", y = "Number of Reviews")
## `geom_smooth()` using formula = 'y ~ x'
## Genre Distrubtion
Understanding the genre distribution helps determine which genres are most popular among bestsellers. Here, I use both a bar graph and a pie chart to visualize this distribution.
ggplot(bestsellers, aes(x = Genre)) +
geom_bar(fill = "lightblue", color = "black") +
labs(title = "Genre Distribution", x = "Genre", y = "Count")
genre_count <- bestsellers %>%
count(Genre) %>%
mutate(percentage = n / sum(n) * 100)
ggplot(genre_count, aes(x = "", y = percentage, fill = Genre)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Genre Distribution", x = "", y = "")
Predictive modeling allows us to make forecasts based on our data. Here, we use linear regression to predict the number of reviews a book might receive based on its user rating, price, year of publication, and genre.
In order to conduct the regression, I split the dataset into training and testing sets to train the model on one portion of the data and test its performance on another. This helps prevent overfitting and ensures the model generalizes well to new data.
set.seed(123)
trainIndex <- createDataPartition(bestsellers$Reviews, p = 0.8, list = FALSE)
trainData <- bestsellers[trainIndex,]
testData <- bestsellers[-trainIndex,]
I built a linear regression model to predict the number of reviews based on user rating, price, year of publication, and genre. The summary of the model provides information about the relationship between these variables.
model <- lm(Reviews ~ UserRating + Price + Year + Genre, data = trainData)
summary(model)
##
## Call:
## lm(formula = Reviews ~ UserRating + Price + Year + Genre, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14113 -5925 -2629 2261 69092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.074e+06 3.985e+05 -5.204 3.74e-07 ***
## UserRating -5.858e+03 2.878e+03 -2.035 0.0428 *
## Price 5.321e+01 6.198e+01 0.858 0.3914
## Year 1.050e+03 1.992e+02 5.271 2.69e-07 ***
## GenreNon Fiction -6.974e+03 1.276e+03 -5.464 1.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10730 on 283 degrees of freedom
## Multiple R-squared: 0.1589, Adjusted R-squared: 0.147
## F-statistic: 13.37 on 4 and 283 DF, p-value: 5.435e-10
predictions <- predict(model, testData)
results <- data.frame(Actual = testData$Reviews, Predicted = predictions)
head(results)
## Actual Predicted
## 1 2052 11892.874
## 2 4149 5450.976
## 3 9198 8518.690
## 4 12159 15271.393
## 5 1296 4903.827
## 6 615 3229.920
I evaluated the model using performance metrics such as RMSE (Root Mean Squared Error) and R-squared, which indicate how well our model fits the data.
postResample(predictions, testData$Reviews)
## RMSE Rsquared MAE
## 9278.3000239 0.1402753 6408.1699702
Visualizing the actual vs. predicted reviews allows us to see how well the model’s predictions align with the actual data. A scatter plot with a diagonal line (indicating perfect predictions) helps in this comparison.
ggplot(results, aes(x = Actual, y = Predicted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = 'blue', linetype = 'dashed') +
labs(title = "Actual vs Predicted Reviews", x = "Actual Reviews", y = "Predicted Reviews")
The insights gained from this analysis can help a large bookstore retailer make informed decisions regarding their inventory and marketing strategies. By understanding which factors most significantly impact book reviews and user ratings, the retailer can:
While this analysis provides valuable insights, there are several areas for future work: