Data Analysis Project: R Programming Book Purchase Behavior

Introduction

The primary goal of this analysis is to understand the customer behavior in terms of book purchases. The focus is on variables such as review ratings, geographical location, and pricing. The dataset comprises of 2000 transactions with each representing a unique purchase. The main question to be answered is: Which books generate the highest total sales, and how do customer reviews and geographic locations influence them?

# Load necessary packages
library(tidyverse)
library(knitr)

# Load data (replace with your data file)
d <- read.csv("./book_reviews.csv")

# Check dataset size and column names/types
glimpse(d)

## Rows: 2,000
## Columns: 4
## $ book   <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made Easy", "…
## $ review <chr> "Excellent", "Fair", "Excellent", "Poor", "Great", NA, "Great",…
## $ state  <chr> "TX", "NY", "NY", "FL", "Texas", "California", "Florida", "CA",…
## $ price  <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 19.99, 29.99, …

Data Preparation

We need to prepare our dataset for the analysis, handling missing data points, standardizing the state names, and transforming the review ratings into numerical values.

# Handling missing data
d_clean <- d %>%
  filter(complete.cases(.))

# Standardizing state names
states_replace <- c("Texas" = "TX", "California" = "CA", "New York" = "NY", "Florida" = "FL")
d_clean$state <- plyr::revalue(d_clean$state, states_replace)

# Transforming review ratings into numerical values
review_replace <- c("Poor" = 1, "Fair" = 2, "Good" = 3, "Great" = 4, "Excellent" = 5)
d_clean$review <- as.numeric(plyr::revalue(d_clean$review, review_replace))

# Creating binary variable for high reviews
d_clean$is_high_review <- ifelse(d_clean$review >= 4, TRUE, FALSE)

# Check the cleaned data
glimpse(d_clean)

## Rows: 1,794
## Columns: 5
## $ book           <chr> "R Made Easy", "R For Dummies", "R Made Easy", "R Made …
## $ review         <dbl> 5, 2, 5, 1, 4, 4, 1, 2, 2, 4, 2, 3, 5, 3, 2, 4, 1, 5, 2…
## $ state          <chr> "TX", "NY", "NY", "FL", "TX", "FL", "CA", "CA", "TX", "…
## $ price          <dbl> 19.99, 15.99, 19.99, 19.99, 50.00, 19.99, 19.99, 29.99,…
## $ is_high_review <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FAL…

Data Analysis

With the prepared data, we’ll calculate total sales by book and evaluate the average review rating per book. We’ll also explore which books are most frequently purchased in each state.

# Total sales and average ratings by book
sales_by_book <- d_clean %>%
  group_by(book) %>%
  summarize(total_sales = sum(price), mean_rating = mean(review)) %>%
  arrange(desc(total_sales))

# Sales per state
sales_by_state <- d_clean %>%
  group_by(state, book) %>%
  summarize(total_books_purchased = n()) %>%
  arrange(desc(state))

# Print the summary tables
kable(sales_by_book, caption = "Sales and ratings by book")

Sales and ratings by book
book	total_sales	mean_rating
Secrets Of R For Advanced Students	18000.00	2.963889
Fundamentals of R For Beginners	14636.34	3.010929
Top 10 Mistakes R Beginners Make	10646.45	3.047887
R Made Easy	7036.48	2.965909
R For Dummies	5772.39	2.828255

kable(sales_by_state, caption = "Sales by state and book")

Sales by state and book
state	book	total_books_purchased
TX	Fundamentals of R For Beginners	96
TX	R For Dummies	83
TX	R Made Easy	87
TX	Secrets Of R For Advanced Students	82
TX	Top 10 Mistakes R Beginners Make	92
NY	Fundamentals of R For Beginners	97
NY	R For Dummies	81
NY	R Made Easy	96
NY	Secrets Of R For Advanced Students	108
NY	Top 10 Mistakes R Beginners Make	102
FL	Fundamentals of R For Beginners	74
FL	R For Dummies	77
FL	R Made Easy	85
FL	Secrets Of R For Advanced Students	86
FL	Top 10 Mistakes R Beginners Make	84
CA	Fundamentals of R For Beginners	99
CA	R For Dummies	120
CA	R Made Easy	84
CA	Secrets Of R For Advanced Students	84
CA	Top 10 Mistakes R Beginners Make	77

Visualization

Let’s visualize our data to better understand the purchase patterns.

# Create a barplot of books purchased by state with ggplot2
ggplot(sales_by_state, aes(x = state, y = total_books_purchased, fill = book)) +
  geom_bar(stat = "identity", position = "dodge", colour = "black", size = 0.25) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "State", 
       y = "Total Books Purchased", 
       fill = "Book",
       title = "Books Purchased by State") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

From the analysis, we draw several insights. First, total sales are not solely determined by high review ratings, suggesting the presence of other influential factors. Second, geographic location significantly affects sales patterns, hinting at potential regional preferences or needs. To fully understand these patterns, further comprehensive analyses considering factors like demographic information or marketing efforts are needed.

Note: This analysis is based on a relatively simplified dataset and real-world scenarios might require considering more factors, such as book pricing, publication date, or genre. Thus, while these initial findings provide some interesting insights, they should be interpreted with caution due to the limited scope of the dataset.