Final Project

Introduction

Outline

The goal of this project is to use data analytics to identify and improve potential areas of growth for the company Regork, a national grocery chain. As a team of data scientists, we analyzed data in the Complete Journey Study provided by 84.51°. It contains data on 2,500 households who frequently shop at Regork. More information about the dataset can be found here

In this project, we plan to explore, analyze and visualize the data to ultimately answer the below two questions:

Business Question 1: What are the spending habits of our customers during the holiday season and can we improve our marketing using new inferences or observations from existing data?

Business Question 2: What types of products can we observe seasonal trends in and develop new marketing campaigns around?

This project will help Regork to better understand what factors impact spending during one of the most crucial times of the year for retail, and will analyze seasonal product trends that can complement our spending analysis or stand on their own.

Limitations

We were limited in the scope of this presentation by other responsibilities and were unable to commit to analysis on a more granular level. Additionally, our trend-based analysis was fairly limited and were we given more time we would have loved to build this out to provide a more complete picture.

Additionally, our analysis did end up focusing in on the phenomena of spending to support holiday parties. The data we used in our analysis was collected in 2017 and due to many factors including the global Covid-19 pandemic people’s habits when it comes to hosting such events may have changed in the past 5 years.

Roles

Sean McGill - Editing and some additional analysis after the outline had been created, primary presenter.

Tejaswini Avvaru - R Markdown creation and formatting

Zach Taylor - Analysis of seasonal patterns of alcohol sales and related observations

Shilpa Rani - General analysis and business case proposals

Packages Required

The below libraries are used in this project. Most users shall need to install these packages prior to executing this project.

## Load Required Packages 

library("completejourney") #grocery store shopping transactions data from group of 2,469 households
library("stringr") # text manipulation
library("dplyr") # manipulating and transforming data (i.e., filtering, joining, etc.)
library("ggplot2") # data visualization plotting system using "Grammar of Graphics"
library("lubridate") # functions used for working with dates and times
library("gt") #used to produce nice-looking display tables
library("glue") #offers interpreted string literals that are small, fast, and dependency-free
library("reshape2") #Flexibly restructure and aggregate data

Data preparation

Data set details

There are eight built-in data sets available in this package and we are focusing mainly on:

  • demographics: household demographic data (age, income, family size, etc.)
  • products: product metadata (brand, description, etc.)
  • get_transactions(): to get the full transactions data set
  • campaigns: campaigns received by each household

Data set details

Loading data

Loading from datasource:

transactions <- get_transactions()
promotions <- get_promotions()
demo <- demographics
products_data <- products
coupons_data <- coupon_redemptions
promotions_data <- get_promotions()

Checking for NULL values in all the dataframes.

is.null(transactions)
## [1] FALSE
is.null(demo)
## [1] FALSE
is.null(promotions)
## [1] FALSE

Exploratory Data Analysis

Top Spenders during Holiday Season

We began by running an analysis of campaign and product sales during this time of year. Product sales analysis will be covered in our other analysis tab.

#sales during holiday months based on campaign

transactions %>% 
  inner_join(campaigns) %>%
  inner_join(campaign_descriptions) %>% 
  filter(week > 44) %>% 
  group_by(campaign_type) %>% 
  summarise(total_sales = sum(sales_value)) %>% 
  ggplot(aes(x = campaign_type, y = total_sales)) + 
  geom_col() +
  ggtitle("Total Sales During Holiday Months based on Campaign Types",
          subtitle = "Campaign A has highest sales in December") +
  scale_color_brewer()

We then evaluated sales by household ID so we could see where the top spenders stood in relation to the rest.

transactions %>%
  inner_join(products) %>%
  filter(week > 50) %>% 
  group_by(household_id) %>%
  summarize(Total.sales_value = sum(sales_value)) %>%
  arrange(desc(Total.sales_value)) %>%
  ggplot(aes(x = household_id, y = Total.sales_value  , fill = Total.sales_value, group = 1)) +
  geom_histogram(stat = 'identity') +
  labs(title = "Sales Values Per Household During Holiday Season",
       subtitle = "The plot shows all households for informational purposes, not to analyze each id individually",
       x = "Household_id",
       y = "Total sales") 

Households with similar demographics to those with the highest sales value could be given a loyalty program in order to promote spending at Regork rather than our competitors. We next focused in on the top 20 households to see if we could learn anything about them.

#sales by households during holiday months
transactions %>%
  inner_join(products)%>%
  filter(week > 50) %>% 
  group_by(household_id) %>%
  summarize(Total.sales_value = sum(sales_value)) %>%
  arrange(desc(Total.sales_value)) %>%
  top_n(20) %>%
  ggplot(aes(x = household_id, y = Total.sales_value  , fill = Total.sales_value, group = 1)) +
  geom_histogram(stat = 'identity') +
  labs(title = "Sales Values by the households during holiday month",
       subtitle = "The plot shows top 20 Household with highest sales values during holiday month",
       x = "Household_id",
       y = "Total sales") 

We observed that even within the top 20 there are 2 outliers and wanted to examine them closer.

Pulling demographics for the 2 highest spenders:

outlier1 <- demographics %>% 
  filter(household_id == "1453") %>%
  group_by(household_id)

outlier1
## # A tibble: 1 × 8
## # Groups:   household_id [1]
##   household_id age   income   home_ownership marital_s…¹ house…² house…³ kids_…⁴
##   <chr>        <ord> <ord>    <ord>          <ord>       <ord>   <ord>   <ord>  
## 1 1453         45-54 125-149K Homeowner      Married     3       2 Adul… 1      
## # … with abbreviated variable names ¹​marital_status, ²​household_size,
## #   ³​household_comp, ⁴​kids_count
outlier2 <- demographics %>% 
  filter(household_id == "1023") %>%
  group_by(household_id)

outlier2
## # A tibble: 0 × 8
## # Groups:   household_id [0]
## # … with 8 variables: household_id <chr>, age <ord>, income <ord>,
## #   home_ownership <ord>, marital_status <ord>, household_size <ord>,
## #   household_comp <ord>, kids_count <ord>

As we can see, household #1023 has no data, and should be disregarded going forward

Pulling campaign participation for household #1453

campaigns %>% 
  filter(household_id == "1453") %>% 
  left_join(., campaign_descriptions, by="campaign_id") %>% 
  arrange(start_date)
## # A tibble: 11 × 5
##    campaign_id household_id campaign_type start_date end_date  
##    <chr>       <chr>        <ord>         <date>     <date>    
##  1 26          1453         Type B        2016-12-28 2017-02-19
##  2 6           1453         Type C        2017-04-19 2017-05-21
##  3 8           1453         Type A        2017-05-08 2017-06-25
##  4 12          1453         Type B        2017-07-12 2017-08-13
##  5 13          1453         Type A        2017-08-08 2017-09-24
##  6 14          1453         Type C        2017-09-04 2017-11-08
##  7 17          1453         Type B        2017-10-18 2017-11-19
##  8 18          1453         Type A        2017-10-30 2017-12-24
##  9 20          1453         Type C        2017-11-27 2018-02-05
## 10 22          1453         Type B        2017-12-06 2018-01-07
## 11 23          1453         Type B        2017-12-28 2018-02-04

It can be observed that campaign is unlikely to have been the influencing factor on this household’s high spending, so we inferred that they were likely hosting one or more holiday parties.

Product Analysis During Holiday Season

The top product categories by sales during the last 3 weeks of December. We observe that soft drinks are the highest category and look for a complementary product, building off our previous party hosting inference.

#sales by product_category during holiday months

transactions %>%
  inner_join(products, by = 'product_id') %>%
  group_by(product_category, week) %>%
  filter(week > 50) %>%
  summarise(sales = sum(sales_value)) %>%
  arrange(desc(sales)) %>%
  head(10) %>%
  ggplot(aes(x = product_category, y = sales)) + 
  geom_col() +
  ggtitle("Sales by Product Category during Holiday Weeks",
          subtitle = "coupon/misc has highest sales, but can be disregarded")

We began looking for products that have strong seasonal sales correlated to holidays, and Zach brought up alcohol which is commonly observed in economics (his field of study). When we ran an analysis of alcohol sales we observed:

The Liquor product category showed substantial seasonal variation that correlates with the holidays of:
  • Super Bowl
  • St. Patrick’s Day
  • Memorial Day
  • Independence Day
  • Labor Day
  • Thanksgiving
  • Christmas
  • New Year

We investigated the hypothesis that different liquors are associated with different holidays. There may be an opportunity to promote less popular liquors during off-peak seasons.

Below is some code that shows seasonal dependence of different types of liquors on different holidays. Whiskey has a very different pattern than Margaritas. The below code will generate individual line plots, and also a table showing monthly spending on whisky and margarita which illustrates the seasonal differences.

# 1 Seasonal dependence of liquor sales
transactions %>%
  left_join(., products, by = "product_id") %>%
  filter(str_detect(product_category, 'LIQUOR')) %>%
  left_join(., coupons, by = "product_id") %>%
  left_join(., promotions, by = "product_id") %>%
  left_join(., campaign_descriptions, by = "campaign_id") %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE)) %>%
  ggplot() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  geom_line(mapping = aes(x = date, y = total_sales_value)) +
  labs(
    title = "Seasonal dependence of Liquor sales",
    subtitle = "Holidays drive higher sales",
    caption = "Data from completejourney package"
  ) +
  scale_y_continuous(name = "Total Dollar Value of Sales", labels = scales::dollar) +
  scale_x_date(date_breaks = "1 week", date_labels =  "%U")

A few of the more interesting product types with noticeable variation around holidays are:

  • Margaritas
  • Whiskey

Spending on margaritas is heavily concentrated compared to Whiskey. 31% of annual spending on margaritas comes in December, and the top 5 months of spending on margaritas represent 88% of annual spending.

#Create data for each liquor_type
margarita <- transactions %>%
  left_join(., products, by = "product_id") %>%
  filter(str_detect(product_category, 'LIQUOR')) %>%
  filter(str_detect(product_type, 'MARGARITA')) %>%
  left_join(., coupons, by = "product_id") %>%
  left_join(., promotions, by = "product_id") %>%
  left_join(., campaign_descriptions, by = "campaign_id") %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE))

whisky <- transactions %>%
  left_join(., products, by = "product_id") %>%
  filter(str_detect(product_category, 'LIQUOR')) %>%
  filter(str_detect(product_type, 'WHISK')) %>%
  left_join(., coupons, by = "product_id") %>%
  left_join(., promotions, by = "product_id") %>%
  left_join(., campaign_descriptions, by = "campaign_id") %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE))

margarita_month <- margarita %>%
  group_by(month = lubridate::floor_date(date, "month")) %>%
  summarize(monthly_sales_margarita = sum(total_sales_value))

whisky_month <- whisky %>%
  group_by(month = lubridate::floor_date(date, "month")) %>%
  summarize(monthly_sales_whisky = sum(total_sales_value))

month_total <- whisky_month %>%
  left_join(.,margarita_month, by = "month")


# Define the start and end dates for the data range
start_date <- "2017-01-01"
end_date <- "2017-12-31"

# Create a gt table to compare whisky and margarita sales by month
month_gt <- month_total %>%
  gt() %>%
  tab_header(
    title = "Margarita and Whisky Sales by Month",
    subtitle = glue("{start_date} to {end_date}")
  ) %>%
  fmt_date(
    columns = month,
    date_style = 11
  ) %>%
  fmt_currency(
    columns = c(monthly_sales_whisky),
    currency = "USD"
  ) %>%
  fmt_currency(
    columns = c(monthly_sales_margarita),
    currency = "USD"
  ) %>%
  cols_label(
    month = "Month",
    monthly_sales_whisky = "Whisky",
    monthly_sales_margarita = "Margarita",
  ) %>%
  opt_stylize(style = 6, color = "gray")
month_gt
Margarita and Whisky Sales by Month
2017-01-01 to 2017-12-31
Month Whisky Margarita
January $2,764.89 $45.88
February $2,800.07 $542.74
March $584.08 $45.88
April $245.92 $45.88
May $1,308.40 $330.79
June $5,022.83 $284.91
July $2,366.38 $89.44
August $2,959.60 $309.01
September $247.46 $43.56
October $1,133.21 $21.78
November $1,542.12 NA
December $1,670.31 $783.12

Whiskey sales spike in June with 22% of spending and the top 5 months of spending on Whisky only account for 66% of annual spending.

Perhaps Regork could benefit from trying to increase demand for margaritas (through promotions or improved display locations) in March, April, September, October, or November as these 5 months only account for 6% of annual spending. Potential promotions could be:
  • tie ins to football/tailgate parties
  • early spring/extending summer
# 3 Seasonal dependence of liquor sales by product_type - Whiskey; line graph
transactions %>%
  left_join(., products, by = "product_id") %>%
  filter(str_detect(product_category, 'LIQUOR')) %>%
  filter(str_detect(product_type, 'WHISK')) %>%
  left_join(., coupons, by = "product_id") %>%
  left_join(., promotions, by = "product_id") %>%
  left_join(., campaign_descriptions, by = "campaign_id") %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE)) %>%
  ggplot() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  geom_line(mapping = aes(x = date, y = total_sales_value)) +
  labs(
    title = "Seasonal dependence of Whiskey sales",
    subtitle = "Holidays drive higher sales",
    caption = "Data from completejourney package"
  ) +
  scale_y_continuous(name = "Total Dolalr Value of Sales", labels = scales::dollar) +
  scale_x_date(date_breaks = "1 week", date_labels =  "%U")

#4

#4  Seasonal dependence of liquor sales by product_type - MARGARITA; line graph
transactions %>%
  left_join(., products, by="product_id") %>%
  filter(str_detect(product_category, 'LIQUOR')) %>%
  filter(str_detect(product_type, 'MARGARITA')) %>%
  left_join(., coupons, by="product_id") %>%
  left_join(., promotions, by="product_id") %>%
  left_join(., campaign_descriptions, by="campaign_id") %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE)) %>%
  ggplot() +
  theme(axis.text.x=element_text(angle=60, hjust=1))+
  geom_line(mapping = aes(x = date, y = total_sales_value)) +
  labs(
    title = "Seasonal dependence of Margarita sales",
    subtitle = "Holidays drive higher sales",
    caption = "Data from completejourney package"
  ) +
  scale_y_continuous(name = "Total Dollar Value of Sales", labels = scales::dollar)+
  scale_x_date(date_breaks = "1 week", date_labels =  "%U")

#5

# 5 Seasonal dependence of liquor sales by product_type - SCHNAPP; line graph
df1 <- transactions %>%
  left_join(., products, by = "product_id") %>%
  filter(str_detect(product_category, 'LIQUOR')) %>%
  left_join(., coupons, by = "product_id") %>%
  left_join(., promotions, by = "product_id") %>%
  left_join(., campaign_descriptions, by = "campaign_id") %>%
  mutate(date = date(transaction_timestamp),
         liquor_type = case_when(
           str_detect(product_type, "MARGARITA") ~ "Margarita",
           str_detect(product_type, "WHISK") ~ "Whiskey",
           str_detect(product_type, "SCHNAPP") ~ "Schnapps",
           str_detect(product_type, "RUM") ~ "Rum",
           str_detect(product_type, "VODKA") ~ "Vodka",
           str_detect(product_type, "SPECIALTIES") ~ "Specialties",
           TRUE ~ as.character(product_type))) %>%
  group_by(date) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE)) %>%
  ggplot() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
  geom_line(mapping = aes(x = date, y = total_sales_value, group = liquor_type, color = liquor_type)) +
  labs(
    title = "Seasonal dependence of Liquor sales",
    subtitle = "Holidays drive higher sales",
    caption = "Data from completejourney package"
  ) +
  scale_y_continuous(name = "Total Dollar Value of Sales", labels = scales::dollar) +
  scale_x_date(date_breaks = "1 week", date_labels =  "%U")

Based on these observations, we believe that new advertising campaigns for the holiday season could be created based on complementary products to margaritas, including salty snacks and various dips as well as the top selling soft drinks. They could use the highest ranking promotional displays from below to promote impulse buying from customers.

Display Location Effect on Sales

df1 <- transactions %>%
  inner_join(products) %>%
  inner_join(promotions)
ad_labs <- c('Not on Ad', 'Interior Page Feature', 'Interior Page Line Item', 'Front Page Feature', 'Back Page Feature', 'Wrap Front Feature', 'Wrap Interior Coupon', 'Wrap Back Feature', 'Interior Page Coupon', 'Free On Interior Page', 'Free on Front Page, Back Page or Wrap')

df1 %>%
  ggplot(aes(x = display_location, y = quantity, color = display_location)) +
  geom_histogram(stat = 'identity') +
  labs(title = "Sales Per Display Location",
       x = "Display Location",
       y = "Number of Sales",) +
  scale_x_discrete(labels = c("Display", "Store Front", "Store Rear", "Front End Cap", "Mid-aisle End Cap", "Rear End Cap", "Side Aisle End Cap", "In-aisle", "Secondary Location Display"), guide = guide_axis(angle = 90)) +
  scale_color_discrete() +
  theme(legend.position = "remove")
The most useful displays would be:
  • In-Aisle
  • Front End Cap
  • Rear End Cap
  • Store Rear

Summary

Our Findings:

We wanted to investigate ways that we could take advantage of trends during the holiday season, and we took two approaches to find usable data.

Section 1: Trends among top spenders

The first business question we wanted to look into was whether there were any remarkable spending patterns during the holiday season that could be looked at closer, and when we looked into the data we found some interesting specific observations. We identified 2 households who were spending at much higher levels than the rest of the data set and wanted to specifically dig into them to identify whether there were any factors we could use to bring up the spending from similar customers.

When digging into the demographics for the two outlier households, we discovered some strange things about one of them. Household #1023 appears to be a missing entry, or one that has no demographic data available. Given this situation we elected to disregard this household moving forward. Household #1453 on the other hand does have demographic data available: they are a married couple in the 45-54 age bracket with one child, annual income of $125-149K, and they own their home. This demographic profile is generally unremarkable and has significant representation in the overall household sample.

Additionally, we looked into their campaigns and in the time period between November and December of 2017 they were targeted by 5 campaigns. 2 of these were type B campaigns, 2 were type C and 1 was type A. Based on this, we can observe the campaigns they were targeted by are unlikely to produce total purchases with value this deviant from the mean.

Based on this analysis of an outlier we would recommend a closer look at the sort of investment made in campaigns of type C, given that they had lower overall sales in this time period compared to campaign types A and B. Additionally, we would suggest the implementation of a loyalty program for customers similar to this one in demographic profile, as they might have similar spending potential but be spreading it over other retailers during this time of year.

Based on the unremarkable demographic profile of the outlier household in question, we made an inference that their spending might have been based on hosting holiday parties, and think there could be additional opportunities for successful campaigns targeting similar demographic profiles during future holiday seasons.

Section 2: Alcohol Sales Trends

When looking into the product side of things, we noted that the highest selling product types are soft drinks, beef, cheese and milk products. Soft drinks were the highest selling category, which added a bit more evidence to our theory that spending for holiday parties may be a good target for an advertising campaign, as soft drinks are generally the non-alcoholic beverage served at these events.

We also found that sales for various types of alcohol are seasonal and that certain ones peaked during our targeted timeframe around the holidays and end of year. One example we wanted to draw attention to specifically was the elevated sales of margaritas during the holiday season and the Super Bowl. We believe this is likely due to holiday parties around Christmas and New Year’s Eve, and can be taken advantage of in a couple of ways, outlined below.

A campaign involving coupons to margarita mix targeting households with age ranges that would typically be hosting holiday parties (likely mid 30s into 50s, also must be over 21 years of age for obvious reasons), this could also include coupons to complementary products such as salty snacks, soft drinks and various dip ingredients which are frequently served at parties.

Positioning the above mentioned products in displays that generally demonstrate higher sales performance, such as front end cap displays, can result in additional sales as customers see the items and think “That would be great for our party this weekend, let’s pick it up” when they might have had alternate plans before. In this way we can promote impulse buying and compound the effectiveness of our campaign by attracting customers who wouldn’t have ended up in the target demographics.

In Conclusion

We have outlined a new marketing campaign which may both help maximize sales to customers in a target demographic and boost sales of products by taking advantage of peak purchasing times. Combined with a targeted loyalty program based on the demographics of our top spenders, this may improve market share and boost revenue during one of our most important times of the year. Additionally, the principles can be applied all year to provide a continuous benefit and the promotional campaign could be used to help boost flagging sales for some products during the summer (specifically margaritas, which we believe may be purchased at restaurants during that time more than in-store).

We hope you will consider investing in this new campaign and we thank you for your time and interest!