Introduction

The primary goal of this analysis is to understand the customer behavior in terms of book purchases. The focus is on variables such as review ratings, geographical location, and pricing. The dataset comprises of 2000 transactions with each representing a unique purchase. The main question to be answered is: Which books generate the highest total sales, and how do customer reviews and geographic locations influence them?

# Load necessary packages
library(tidyverse)
library(knitr)

# Load data (replace with your data file)
d <- read.csv("./book_reviews.csv")

# Check dataset size and column names/types
glimpse(d)
## Rows: 2,000
## Columns: 4
## $ book   <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made Easy", "…
## $ review <chr> "Excellent", "Fair", "Excellent", "Poor", "Great", NA, "Great",…
## $ state  <chr> "TX", "NY", "NY", "FL", "Texas", "California", "Florida", "CA",…
## $ price  <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 19.99, 29.99, …

Data Preparation

We need to prepare our dataset for the analysis, handling missing data points, standardizing the state names, and transforming the review ratings into numerical values.

# Handling missing data
d_clean <- d %>%
  filter(complete.cases(.))

# Standardizing state names
states_replace <- c("Texas" = "TX", "California" = "CA", "New York" = "NY", "Florida" = "FL")
d_clean$state <- plyr::revalue(d_clean$state, states_replace)

# Transforming review ratings into numerical values
review_replace <- c("Poor" = 1, "Fair" = 2, "Good" = 3, "Great" = 4, "Excellent" = 5)
d_clean$review <- as.numeric(plyr::revalue(d_clean$review, review_replace))

# Creating binary variable for high reviews
d_clean$is_high_review <- ifelse(d_clean$review >= 4, TRUE, FALSE)

# Check the cleaned data
glimpse(d_clean)
## Rows: 1,794
## Columns: 5
## $ book           <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made …
## $ review         <dbl> 5, 2, 5, 1, 4, 4, 1, 2, 2, 4, 2, 3, 5, 3, 2, 4, 1, 5, 2…
## $ state          <chr> "TX", "NY", "NY", "FL", "TX", "FL", "CA", "CA", "TX", "…
## $ price          <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 29.99,…
## $ is_high_review <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FAL…

Data Analysis

With the prepared data, we’ll calculate total sales by book and evaluate the average review rating per book. We’ll also explore which books are most frequently purchased in each state.

# Total sales and average ratings by book
sales_by_book <- d_clean %>%
  group_by(book) %>%
  summarize(total_sales = sum(price), mean_rating = mean(review)) %>%
  arrange(desc(total_sales))

# Sales per state
sales_by_state <- d_clean %>%
  group_by(state, book) %>%
  summarize(total_books_purchased = n()) %>%
  arrange(desc(state))

# Print the summary tables
kable(sales_by_book, caption = "Sales and ratings by book")
Sales and ratings by book
book total_sales mean_rating
Secrets Of R For Advanced Students 18000.00 2.963889
Fundamentals of R For Beginners 14636.34 3.010929
Top 10 Mistakes R Beginners Make 10646.45 3.047887
R Made Easy 7036.48 2.965909
R For Dummies 5772.39 2.828255
kable(sales_by_state, caption = "Sales by state and book")
Sales by state and book
state book total_books_purchased
TX Fundamentals of R For Beginners 96
TX R For Dummies 83
TX R Made Easy 87
TX Secrets Of R For Advanced Students 82
TX Top 10 Mistakes R Beginners Make 92
NY Fundamentals of R For Beginners 97
NY R For Dummies 81
NY R Made Easy 96
NY Secrets Of R For Advanced Students 108
NY Top 10 Mistakes R Beginners Make 102
FL Fundamentals of R For Beginners 74
FL R For Dummies 77
FL R Made Easy 85
FL Secrets Of R For Advanced Students 86
FL Top 10 Mistakes R Beginners Make 84
CA Fundamentals of R For Beginners 99
CA R For Dummies 120
CA R Made Easy 84
CA Secrets Of R For Advanced Students 84
CA Top 10 Mistakes R Beginners Make 77

Visualization

Let’s visualize our data to better understand the purchase patterns.

# Create a barplot of books purchased by state with ggplot2
ggplot(sales_by_state, aes(x = state, y = total_books_purchased, fill = book)) +
  geom_bar(stat = "identity", position = "dodge", colour = "black", size = 0.25) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "State", 
       y = "Total Books Purchased", 
       fill = "Book",
       title = "Books Purchased by State") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

From the analysis, we draw several insights. First, total sales are not solely determined by high review ratings, suggesting the presence of other influential factors. Second, geographic location significantly affects sales patterns, hinting at potential regional preferences or needs. To fully understand these patterns, further comprehensive analyses considering factors like demographic information or marketing efforts are needed.

Note: This analysis is based on a relatively simplified dataset and real-world scenarios might require considering more factors, such as book pricing, publication date, or genre. Thus, while these initial findings provide some interesting insights, they should be interpreted with caution due to the limited scope of the dataset.