The primary goal of this analysis is to understand the customer behavior in terms of book purchases. The focus is on variables such as review ratings, geographical location, and pricing. The dataset comprises of 2000 transactions with each representing a unique purchase. The main question to be answered is: Which books generate the highest total sales, and how do customer reviews and geographic locations influence them?
# Load necessary packages
library(tidyverse)
library(knitr)
# Load data (replace with your data file)
d <- read.csv("./book_reviews.csv")
# Check dataset size and column names/types
glimpse(d)
## Rows: 2,000
## Columns: 4
## $ book <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made Easy", "…
## $ review <chr> "Excellent", "Fair", "Excellent", "Poor", "Great", NA, "Great",…
## $ state <chr> "TX", "NY", "NY", "FL", "Texas", "California", "Florida", "CA",…
## $ price <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 19.99, 29.99, …
We need to prepare our dataset for the analysis, handling missing data points, standardizing the state names, and transforming the review ratings into numerical values.
# Handling missing data
d_clean <- d %>%
filter(complete.cases(.))
# Standardizing state names
states_replace <- c("Texas" = "TX", "California" = "CA", "New York" = "NY", "Florida" = "FL")
d_clean$state <- plyr::revalue(d_clean$state, states_replace)
# Transforming review ratings into numerical values
review_replace <- c("Poor" = 1, "Fair" = 2, "Good" = 3, "Great" = 4, "Excellent" = 5)
d_clean$review <- as.numeric(plyr::revalue(d_clean$review, review_replace))
# Creating binary variable for high reviews
d_clean$is_high_review <- ifelse(d_clean$review >= 4, TRUE, FALSE)
# Check the cleaned data
glimpse(d_clean)
## Rows: 1,794
## Columns: 5
## $ book <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made …
## $ review <dbl> 5, 2, 5, 1, 4, 4, 1, 2, 2, 4, 2, 3, 5, 3, 2, 4, 1, 5, 2…
## $ state <chr> "TX", "NY", "NY", "FL", "TX", "FL", "CA", "CA", "TX", "…
## $ price <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 29.99,…
## $ is_high_review <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FAL…
With the prepared data, we’ll calculate total sales by book and evaluate the average review rating per book. We’ll also explore which books are most frequently purchased in each state.
# Total sales and average ratings by book
sales_by_book <- d_clean %>%
group_by(book) %>%
summarize(total_sales = sum(price), mean_rating = mean(review)) %>%
arrange(desc(total_sales))
# Sales per state
sales_by_state <- d_clean %>%
group_by(state, book) %>%
summarize(total_books_purchased = n()) %>%
arrange(desc(state))
# Print the summary tables
kable(sales_by_book, caption = "Sales and ratings by book")
| book | total_sales | mean_rating |
|---|---|---|
| Secrets Of R For Advanced Students | 18000.00 | 2.963889 |
| Fundamentals of R For Beginners | 14636.34 | 3.010929 |
| Top 10 Mistakes R Beginners Make | 10646.45 | 3.047887 |
| R Made Easy | 7036.48 | 2.965909 |
| R For Dummies | 5772.39 | 2.828255 |
kable(sales_by_state, caption = "Sales by state and book")
| state | book | total_books_purchased |
|---|---|---|
| TX | Fundamentals of R For Beginners | 96 |
| TX | R For Dummies | 83 |
| TX | R Made Easy | 87 |
| TX | Secrets Of R For Advanced Students | 82 |
| TX | Top 10 Mistakes R Beginners Make | 92 |
| NY | Fundamentals of R For Beginners | 97 |
| NY | R For Dummies | 81 |
| NY | R Made Easy | 96 |
| NY | Secrets Of R For Advanced Students | 108 |
| NY | Top 10 Mistakes R Beginners Make | 102 |
| FL | Fundamentals of R For Beginners | 74 |
| FL | R For Dummies | 77 |
| FL | R Made Easy | 85 |
| FL | Secrets Of R For Advanced Students | 86 |
| FL | Top 10 Mistakes R Beginners Make | 84 |
| CA | Fundamentals of R For Beginners | 99 |
| CA | R For Dummies | 120 |
| CA | R Made Easy | 84 |
| CA | Secrets Of R For Advanced Students | 84 |
| CA | Top 10 Mistakes R Beginners Make | 77 |
Let’s visualize our data to better understand the purchase patterns.
# Create a barplot of books purchased by state with ggplot2
ggplot(sales_by_state, aes(x = state, y = total_books_purchased, fill = book)) +
geom_bar(stat = "identity", position = "dodge", colour = "black", size = 0.25) +
scale_fill_brewer(palette = "Set3") +
labs(x = "State",
y = "Total Books Purchased",
fill = "Book",
title = "Books Purchased by State") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the analysis, we draw several insights. First, total sales are not solely determined by high review ratings, suggesting the presence of other influential factors. Second, geographic location significantly affects sales patterns, hinting at potential regional preferences or needs. To fully understand these patterns, further comprehensive analyses considering factors like demographic information or marketing efforts are needed.
Note: This analysis is based on a relatively simplified dataset and real-world scenarios might require considering more factors, such as book pricing, publication date, or genre. Thus, while these initial findings provide some interesting insights, they should be interpreted with caution due to the limited scope of the dataset.