In this project, we delve into the 2019 book sales data from a company which launched a new sales program mid-year (July 1st, 2019) aimed at encouraging customers to purchase more books. The primary objective of this analysis is to ascertain the effectiveness of this program. We’ll approach this by comparing book sales and customer reviews before and after the implementation of the program.
The dataset includes the following variables:
The dataset can be accessed here.
Our analytical workflow will include the following steps:
# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
# Load the dataset
data <- read.csv("sales2019.csv")
# Print a summary of the data
glimpse(data)
## Rows: 5,000
## Columns: 5
## $ date <chr> "5/22/19", "11/16/19", "6/27/19", "11/6/19", "7/…
## $ user_submitted_review <chr> "it was okay", "Awesome!", "Awesome!", "Awesome!…
## $ title <chr> "Secrets Of R For Advanced Students", "R For Dum…
## $ total_purchased <int> 7, 3, 1, 3, NA, 1, 5, NA, 7, 1, 7, NA, 3, 2, 0, …
## $ customer_type <chr> "Business", "Business", "Individual", "Individua…
Here we handle missing data to ensure quality and consistency in the subsequent analysis. Given our objectives, we will focus on the “total_purchased” and “user_submitted_review” columns.
# Drop rows with missing reviews
data_clean <- data %>% filter(!is.na(user_submitted_review))
# Calculate average total_purchased by book title and replace missing values
complete_sale_by_book <- data_clean %>%
group_by(title) %>%
mutate(avg_purchased = mean(total_purchased, na.rm = TRUE),
total_purchased = ifelse(is.na(total_purchased), round(avg_purchased), total_purchased)) %>%
ungroup()
We convert the reviews into two classes: positive or negative. We first take a look at the reviews to identify unique entries.
unique(complete_sale_by_book$user_submitted_review)
## [1] "it was okay"
## [2] "Awesome!"
## [3] "Hated it"
## [4] "Never read a better book"
## [5] "OK"
## [6] "The author's other books were better"
## [7] "A lot of material was not needed"
## [8] ""
## [9] "Would not recommend"
## [10] "I learned a lot"
Based on the unique entries, we identify certain keywords that signify a positive or negative sentiment. From here, we can label the reviews accordingly. Here, we have chosen to identify positive keywords, and then create a new column (review) based on the sentiment keywords.
words_pos <- c("okay", "Awesome", "better book", "OK", "learned")
# Create a new column 'review' based on sentiment keywords
complete_sale_by_book <- complete_sale_by_book %>%
mutate(review = case_when(
str_detect(user_submitted_review, paste(words_pos, collapse = "|")) ~ "Positive",
TRUE ~ "Negative"
))
kable(head(complete_sale_by_book))
| date | user_submitted_review | title | total_purchased | customer_type | avg_purchased | review |
|---|---|---|---|---|---|---|
| 5/22/19 | it was okay | Secrets Of R For Advanced Students | 7 | Business | 4.055441 | Positive |
| 11/16/19 | Awesome! | R For Dummies | 3 | Business | 4.045526 | Positive |
| 6/27/19 | Awesome! | R For Dummies | 1 | Individual | 4.045526 | Positive |
| 11/6/19 | Awesome! | Fundamentals of R For Beginners | 3 | Individual | 3.955032 | Positive |
| 7/18/19 | Hated it | Fundamentals of R For Beginners | 4 | Business | 3.955032 | Negative |
| 1/28/19 | Never read a better book | Secrets Of R For Advanced Students | 1 | Business | 4.055441 | Positive |
We now compare the total book sales before and after the new sales program. This analysis will give us an indication of the program’s impact on overall sales.
# Create a new column to separate sales into before and after the program
data_pre_post <- complete_sale_by_book %>% mutate(pre_or_post = if_else(mdy(date) < ymd("2019/07/01"), "Pre", "Post"))
# Compare book sales before and after the program
data_sales <- data_pre_post %>%
group_by(pre_or_post) %>%
summarize(total_books_purchased = sum(total_purchased))
# Output a table with summary
kable(data_sales, caption = "Total books purchased before and after new sales program")
| pre_or_post | total_books_purchased |
|---|---|
| Post | 9072 |
| Pre | 9114 |
We now break down the sales data by customer type to see if the sales program had a different effect on businesses versus individual customers.
sales_by_customer <- data_pre_post %>%
group_by(pre_or_post, customer_type) %>%
summarize(total_books_purchased = sum(total_purchased))
## `summarise()` has grouped output by 'pre_or_post'. You can override using the
## `.groups` argument.
# Output a table with summary
kable(sales_by_customer, caption="Book Sales Before and After the Program by Customer Type")
| pre_or_post | customer_type | total_books_purchased |
|---|---|---|
| Post | Business | 6309 |
| Post | Individual | 2763 |
| Pre | Business | 6222 |
| Pre | Individual | 2892 |
Lastly, we evaluate if the sales program had an impact on review sentiment by comparing the counts of positive and negative reviews before and after the program.
pos_rv_pre_post <- data_pre_post %>%
group_by(pre_or_post) %>%
summarize(number_reviews = sum(str_count(review,"Positive")))
kable(pos_rv_pre_post, caption="Positive reviews pre- vs post-new program")
| pre_or_post | number_reviews |
|---|---|
| Post | 1128 |
| Pre | 1134 |
neg_rv_pre_post <- data_pre_post %>%
group_by(pre_or_post) %>%
summarize(number_reviews = sum(str_count(review,"Negative")))
kable(neg_rv_pre_post, caption="Negative reviews pre- vs post-new program")
| pre_or_post | number_reviews |
|---|---|
| Post | 1156 |
| Pre | 1126 |
Based on the analysis, the new program appears to have influenced book sales and customer reviews in several ways. Here are our main findings:
Given these findings, here are some recommendations for the company:
More data or additional context might be needed to fully understand these findings and their implications for the company’s future strategies.