Introduction

In this project, we delve into the 2019 book sales data from a company which launched a new sales program mid-year (July 1st, 2019) aimed at encouraging customers to purchase more books. The primary objective of this analysis is to ascertain the effectiveness of this program. We’ll approach this by comparing book sales and customer reviews before and after the implementation of the program.

The dataset includes the following variables:

The dataset can be accessed here.

Our analytical workflow will include the following steps:

Step 1: Data Loading and Exploration

# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)

# Load the dataset
data <- read.csv("sales2019.csv")

# Print a summary of the data
glimpse(data)
## Rows: 5,000
## Columns: 5
## $ date                  <chr> "5/22/19", "11/16/19", "6/27/19", "11/6/19", "7/…
## $ user_submitted_review <chr> "it was okay", "Awesome!", "Awesome!", "Awesome!…
## $ title                 <chr> "Secrets Of R For Advanced Students", "R For Dum…
## $ total_purchased       <int> 7, 3, 1, 3, NA, 1, 5, NA, 7, 1, 7, NA, 3, 2, 0, …
## $ customer_type         <chr> "Business", "Business", "Individual", "Individua…

Step 2: Data Cleaning

Here we handle missing data to ensure quality and consistency in the subsequent analysis. Given our objectives, we will focus on the “total_purchased” and “user_submitted_review” columns.

# Drop rows with missing reviews
data_clean <- data %>% filter(!is.na(user_submitted_review))  

# Calculate average total_purchased by book title and replace missing values
complete_sale_by_book <- data_clean %>% 
  group_by(title) %>%  
  mutate(avg_purchased = mean(total_purchased, na.rm = TRUE),
         total_purchased = ifelse(is.na(total_purchased), round(avg_purchased), total_purchased)) %>%  
  ungroup()  

Step 3: Review Processing

We convert the reviews into two classes: positive or negative. We first take a look at the reviews to identify unique entries.

unique(complete_sale_by_book$user_submitted_review)
##  [1] "it was okay"                         
##  [2] "Awesome!"                            
##  [3] "Hated it"                            
##  [4] "Never read a better book"            
##  [5] "OK"                                  
##  [6] "The author's other books were better"
##  [7] "A lot of material was not needed"    
##  [8] ""                                    
##  [9] "Would not recommend"                 
## [10] "I learned a lot"

Based on the unique entries, we identify certain keywords that signify a positive or negative sentiment. From here, we can label the reviews accordingly. Here, we have chosen to identify positive keywords, and then create a new column (review) based on the sentiment keywords.

words_pos <- c("okay", "Awesome", "better book", "OK", "learned") 

# Create a new column 'review' based on sentiment keywords
complete_sale_by_book <- complete_sale_by_book %>% 
  mutate(review = case_when(
    str_detect(user_submitted_review, paste(words_pos, collapse = "|")) ~ "Positive",
    TRUE ~ "Negative"
  ))

kable(head(complete_sale_by_book))
date user_submitted_review title total_purchased customer_type avg_purchased review
5/22/19 it was okay Secrets Of R For Advanced Students 7 Business 4.055441 Positive
11/16/19 Awesome! R For Dummies 3 Business 4.045526 Positive
6/27/19 Awesome! R For Dummies 1 Individual 4.045526 Positive
11/6/19 Awesome! Fundamentals of R For Beginners 3 Individual 3.955032 Positive
7/18/19 Hated it Fundamentals of R For Beginners 4 Business 3.955032 Negative
1/28/19 Never read a better book Secrets Of R For Advanced Students 1 Business 4.055441 Positive

Step 4: Analysis of Sales

We now compare the total book sales before and after the new sales program. This analysis will give us an indication of the program’s impact on overall sales.

# Create a new column to separate sales into before and after the program
data_pre_post <- complete_sale_by_book %>% mutate(pre_or_post = if_else(mdy(date) < ymd("2019/07/01"), "Pre", "Post"))

# Compare book sales before and after the program
data_sales <- data_pre_post %>% 
  group_by(pre_or_post) %>%
  summarize(total_books_purchased = sum(total_purchased)) 

# Output a table with summary
kable(data_sales, caption = "Total books purchased before and after new sales program")
Total books purchased before and after new sales program
pre_or_post total_books_purchased
Post 9072
Pre 9114

Step 5: Subgroup Analysis by Customer Type

We now break down the sales data by customer type to see if the sales program had a different effect on businesses versus individual customers.

sales_by_customer <- data_pre_post %>% 
  group_by(pre_or_post, customer_type) %>%
  summarize(total_books_purchased = sum(total_purchased)) 
## `summarise()` has grouped output by 'pre_or_post'. You can override using the
## `.groups` argument.
# Output a table with summary
kable(sales_by_customer, caption="Book Sales Before and After the Program by Customer Type")
Book Sales Before and After the Program by Customer Type
pre_or_post customer_type total_books_purchased
Post Business 6309
Post Individual 2763
Pre Business 6222
Pre Individual 2892

Step 6: Analysis of Review Sentiment

Lastly, we evaluate if the sales program had an impact on review sentiment by comparing the counts of positive and negative reviews before and after the program.

pos_rv_pre_post <- data_pre_post %>%
  group_by(pre_or_post) %>% 
  summarize(number_reviews = sum(str_count(review,"Positive")))
kable(pos_rv_pre_post, caption="Positive reviews pre- vs post-new program")
Positive reviews pre- vs post-new program
pre_or_post number_reviews
Post 1128
Pre 1134
neg_rv_pre_post <- data_pre_post %>%
  group_by(pre_or_post) %>% 
  summarize(number_reviews = sum(str_count(review,"Negative")))
kable(neg_rv_pre_post, caption="Negative reviews pre- vs post-new program")
Negative reviews pre- vs post-new program
pre_or_post number_reviews
Post 1156
Pre 1126

Conclusions & Recommendations

Based on the analysis, the new program appears to have influenced book sales and customer reviews in several ways. Here are our main findings:

  1. Book Sales: The program had a differential impact on book sales. Sales to businesses increased following the introduction of the program, suggesting that the program was appealing to this customer segment. Conversely, individual customer sales declined post-program.
  2. Review Sentiment: The sentiment analysis of customer reviews revealed that the overall sentiment worsened after the program was launched. There was a slight decline in the number of positive reviews and a noticeable increase in the number of negative reviews post-program.

Given these findings, here are some recommendations for the company:

More data or additional context might be needed to fully understand these findings and their implications for the company’s future strategies.