Data Wrangling Final Project

Did Regork Campaigns Work?

Introduction

Regork started 27 promotional campaigns that lasted for the duration of 1 to 8 months at different points of time in the year 2017. These campaigns were divided into 3 types: Type A, Type B and Type C. In this project, we ranked these campaigns based on their effectiveness in terms of gross revenue generated during their run. We also try to answer whether these campaigns were able to bring about a permanent and positive impact in the customer spending behavior or not. This helps the business to select the best campaigns and employ similar promotional strategies in future.

Packages

To reproduce the code and results throughout this project following packages need to be loaded:

library(completejourney) #Import completejourney datasets
library(tidyverse)       #visualize, transform and tidy the data
library(plotly)          #showing plots and data at the same time
library(ggrepel)         #avoid label overlapping
library(plotly)          #plot interactive charts
library(ggplot2)         #plot charts
library(dplyr)           #data manipulation
library(DT)              #display the data in scrollable format
library(lubridate)       #perform date-time operations
library(ggalt)           #alter coordinate system for dumbbell chart
library(RColorBrewer)    #select color palette

Data Preparation

completejourney provides access to data sets characterizing household level transactions over the year 2017 from a group of 2,469 households who are frequent shoppers at a grocery store. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are captured.

Accessing the data

transactions <- get_transactions()
promotions <- get_promotions()

data(package = "completejourney")

There are eight built-in data sets available in this package (see data(package = ‘completejourney’). The data sets include:

campaigns: campaigns received by each household

campaign_descriptions: campaign metadata (length of time active)

coupons: coupon metadata (UPC code, campaign, etc.)

coupon_redemptions: coupon redemptions (household, day, UPC code, campaign)

demographics: household demographic data (age, income, family size, etc.)

products: product metadata (brand, description, etc.) promotions_sample: a sampling of the product placement in mailers and in stores corresponding to advertising campaigns

transactions_sample: a sampling of the products purchased by households. Full transactions data can be accessed using get_transaction() function provided with the package.

promotions: Full promotions data can be accessed using get_promotions() function provided with the package.

Schema Details

Relational Schema diagram for completejourney dataset

Exploratory Data Analysis

Let us start by calculating what fraction of targeted consumers that actually participated in the campaigns. This can help us to determine the efficiency of these campaigns.

Redemption rate:

List of households that were targeted in campaigns:

targeted_households <- campaigns %>% distinct(household_id)

List of households that redeemed their coupons:

redeemed_households <- coupon_redemptions %>%
distinct(household_id)

Redemption rate can be calculated as ratio of households that participated in the campaign by how many households were targeted.

round(count(redeemed_households)/count(targeted_households)*100, digits = 1)

##      n
## 1 26.3

Redemption sales

# Total sales in the year 2017
Net_sales <- transactions %>% 
  summarize(total_sales = sum(sales_value, na.rm = T))

# Sales by redeeming households
redeemed_household_sales <- transactions %>% 
  inner_join(redeemed_households, by = "household_id") %>% 
  summarize(total_sales = sum(sales_value, na.rm = T))

# Ratio of redeemed sales to net sales
round(redeemed_household_sales/Net_sales*100, digits = 1)

##   total_sales
## 1        34.7

# Ratio of redeeming households to total households
round(count(redeemed_households)/count(distinct(transactions, household_id))*100, digits = 1)

##      n
## 1 16.6

Hence, it implies that the 34.7% of total sales come from 16.6% (410) households that were targeted and participated in all the campaigns in the year 2017. A redemption rate of 26.3% signifies that the customers show satisfactory response to the campaigns. Now, Let us look at the duration of all the campaigns that Regork ran. Compiling the transactions and campaigns data by performing date operations.

transactions <- transactions %>%
  mutate(transaction_date = lubridate::as_date(transaction_timestamp))
campaign_graph <- campaign_descriptions %>%
  mutate(duration = end_date - start_date)
campaign_graph$campaign_id <- factor(campaign_graph$campaign_id,
                                     levels = as.integer(campaign_graph$campaign_id))

Plotting the campaign data:

campaign_graph %>%
  ggplot(aes(x = start_date, xend = end_date, y = campaign_id, group = campaign_id)) +
  geom_dumbbell(color = "slateblue1", size = 1.5) +
  geom_point(aes(x = start_date), color = "slateblue4", size = 2.5) +
  geom_point(aes(x = end_date), color = "slateblue4", size = 2.5) +
  geom_text(aes(label = duration), hjust = 2, vjust = 0.5,  size = 3) +
  labs(x = NULL, y = NULL,
       title = "Campaign Durations for all 27 campaigns",
       subtitle = "Start and End Dates for each campaign (along with duration)",
       caption = "Source: completejourney") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.background = element_rect(fill = "#FFFFFF"),
        panel.background = element_rect(fill = "#FFFFE0"),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_line(),
        axis.ticks = element_blank(),
        legend.position = "top",
        panel.border = element_blank())

We can see that the campaigns last for the duration of 1 to 7 months throughout the year. Also, it should be noted that some campaigns are run in the first half of the year while others in that later half. This will impact the types of products and their prices that we encounter in the transactions data. To gauge performance of campaigns, let us look at the sales revenues generated by different campaigns grouped by campaign types.

colour.pallet <- brewer.pal(3, "Paired")

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(coupon_redemptions, by = "household_id") %>%
  filter(redemption_date == transaction_date) %>%
  inner_join(campaigns, by = c("household_id", "campaign_id")) %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  group_by(campaign_type, campaign_id) %>%
  summarize(total_sales = sum(sales_value)) %>%
  mutate(Group = gsub('\\d+','',campaign_type)) %>%
  ggplot(aes(x = reorder(campaign_id, -total_sales), y = total_sales, fill = campaign_type)) +
  geom_col() +
  geom_text(aes(label = scales::comma(round(total_sales,0))),hjust = 0.5, vjust = -0.25,  size = 2) +
  scale_fill_manual(name = "Campaign Type", values = colour.pallet) +
  facet_wrap(.~Group,scales = 'free', strip.position = "bottom") +
  scale_y_continuous(name = "Total Sales (in $)", labels = scales::dollar) +
  labs(
    title = "Total spend at Regrok by All Households Targeted under each Campaign",
    subtitle = "Each campaign belongs to one of the three types - A, B & C",
    x = "Campaign IDs by their Type") +
  theme(strip.placement  = "outside",
        panel.spacing    = unit(0, "points"),
        strip.background = element_blank(),
        strip.text       = element_text(face = "bold", size = 12))

Analyzing total spend by campaigns show that campaign type A with only 4 campaigns generated the highest sales revenues per targeted households. Type B campaigns comprises of largest set of campaigns i.e. 18. The highest campaign sale of Type B is comparable to Least campaign sale in Type A. However, the lesser average sale per campaign is compensated by the overall volume of campaigns. Type C, with only 5 campaigns in it’s kitty is the under-performer. To determine which campaigns were successful and by what degree, let us look at the sales at different times throughout the year 2017.

campaigns_phase_sales <- campaigns %>% 
  inner_join(campaign_descriptions,by="campaign_id") %>% 
  inner_join(transactions, by = "household_id") %>% 
  mutate(days = difftime(start_date, end_date, Days)) %>% 
  mutate(phase = dplyr::case_when(
    as_date(transaction_timestamp) >= start_date & as_date(transaction_timestamp) <=end_date ~ "2. DURING",
    as_date(transaction_timestamp) < start_date ~ "1. BEFORE",
    as_date(transaction_timestamp) > end_date  ~ "3. AFTER" )) %>% 
  group_by(campaign_id,phase) %>% 
  summarize(sales_each_phase = sum(sales_value, na.rm = T))

campaign_days <- campaign_descriptions %>% 
  mutate(
    start_date = dplyr::case_when(
      year(start_date)< 2017 ~  as_date('2017-01-01'),
      TRUE ~ start_date),
    end_date = dplyr::case_when(
      year(end_date)> 2017 ~  as_date('2017-12-31'),
      TRUE ~ end_date),
    during_days=as.numeric(difftime(end_date, start_date, Days))+1,
    before_days=as.numeric(difftime(start_date, as_date('2017-01-01'), Days))/86400+1,
    after_days=as.numeric(difftime(as_date('2017-12-31'),end_date, Days))/86400 - 1)

We dissect the year 2017 into 3 sprints relative to the campaign duration.

Phase 1: Before Campaign
Phase 2: During Campaign
Phase 3: After Campaign

This helps to determine whether a campaign was able to bring about a permanent change in consumer behavior or not. We look at top 6 highest revenue generating campaigns, irrespective of their types.

# Graph
campaigns_phase_sales %>% 
  inner_join(campaign_days, by = "campaign_id") %>% 
  filter(campaign_id %in% c(18,13,8,16,14,27)) %>% 
  mutate(per_day_sales = dplyr::case_when(
    phase == '1. BEFORE' ~  round(sales_each_phase/before_days,0) ,
    phase == '2. DURING' ~  round(sales_each_phase/during_days,0)  ,
    phase == '3. AFTER' ~   round(sales_each_phase/after_days,0)
  )) %>% 
  select(campaign_id,phase,per_day_sales)  %>% 
  ggplot(aes(x = phase, y = per_day_sales, fill = phase)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_y_continuous( breaks = seq(0, 10000, by = 1000), labels = scales::dollar)+
  geom_text(aes(label = scales::dollar(per_day_sales)), hjust = 0.5, vjust = -0.25,  size = 2.5) +
  scale_fill_brewer(palette = "Blues") +
  facet_wrap(~ campaign_id, nrow = 2)

Results from Campaign Analysis:

In some cases, we see that the sales of phase DURING is lesser as compared to AFTER and BEFORE, this is because while the households redeem the coupons during their purchases, their average sales value is reduced because of coupon discount.
In campaigns 13 and 14, we see that per day sales generated by targeted households comprising these campaigns in phase AFTER is greater than phase DURING. These sales however are lesser as compared to their BEFORE phase. This is is because the campaigns were executed in the later half of the year, hence the daily average is lesser after the campaign was concluded. We do not have enough data to ascertain whether these campaigns were successful or not.
Campaign 16, although could not increase the total sales, but keeps the average sales still consistent before and after. Also, it should be noted that it was executed after during October, 2017. Hence the number of days after campaign are lesser as compared to before. So, consistent daily average indicates that the campaign was successful.
Campaign 27 was executed in first half of 2017. The after campaign sales of these households decreased as compared to before and during campaign sales. Hence the campaign was not successful.
In campaign 18, not only the households increased their spending during the campaign, the average daily spending also increased after the campaign was concluded. Hence campaign 18 is successful.

After concluding that Campaign 18 is the best out of top sales grossing campaigns, let us look at the products that comprised these sales figure. Calculating the average sales per day for these products before and after campaign.

prod_before <- coupon_redemptions %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  filter(campaign_id == 18) %>%
  inner_join(coupons, by = "coupon_upc") %>%
  inner_join(products, by = "product_id") %>%
  inner_join(transactions, by = "product_id") %>%
  filter(transaction_date < start_date) %>%
  group_by(product_id) %>%
  summarize(tot_sales_before = sum(sales_value)/302) %>%
  arrange(desc(tot_sales_before))

prod_after <- coupon_redemptions %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  filter(campaign_id == 18) %>%
  inner_join(coupons, by = "coupon_upc") %>%
  inner_join(products, by = "product_id") %>%
  inner_join(transactions, by = "product_id") %>%
  filter(transaction_date >= start_date) %>%
  group_by(product_id) %>%
  summarize(tot_sales_after = sum(sales_value)/62) %>%
  arrange(desc(tot_sales_after))

Plotting the products for which the daily sales difference was maximum.

prod_before %>%
  inner_join(prod_after, by = "product_id") %>%
  mutate(difference_after = tot_sales_after - tot_sales_before) %>%
  inner_join(products, by = "product_id") %>%
  group_by(product_category) %>%
  summarize(difference_final = sum(difference_after)) %>%
  top_n(10) %>%
  ggplot(aes(x = product_category, y = difference_final)) +
  geom_segment(aes(x = reorder(product_category, -difference_final), y = 0, xend = product_category, 
                   yend = difference_final), color = "DarkOrchid4", 
               size = 1) +
  geom_point(size = 3, color = "DarkOrchid") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", color = "Maroon"),
        plot.subtitle = element_text(size = 10, face = "italic"),
        axis.text.x=element_text(size = 10, angle = 90,hjust = 0.95,vjust=0.2)) +
  scale_y_continuous(name = "Increase in Avg Per Day Sales", labels = scales::dollar) +
  labs(
    title = "10 Best-selling Products that were included in Campaign ID 18",
    subtitle = "This graph represents the top 10 products that were sold using coupons from Campaign ID 18, Regrok's most successful campaign",
    x = "Product Category")

## Selecting by difference_final

We have identified that the product categories like Smoked Meat, Turkey, and Citrus had the most positive impact as part of Campaign 18. Regork could include these and similar products as part of other campaign types as well to increase the efficiency of those campaigns.

Conclusion

Our study was focused on Regork’s marketing campaigns – were the strategies successful or not and what affects did they have on products. As a first step, we found which were our longest running campaigns during 2017. The longest running campaign in this dataset was identified to be Campaign 15 whereas the usual duration for most of the campaigns was 32-50 days.

After evaluating the duration of all campaigns, we wanted to find out which campaigns were the most rewarding in term of Revenue. Once we plotted the data, it was apparent that Type A campaigns had higher revenues even though they were run for a similar duration. In fact, our longest running campaign (campaign 15) wasn’t close to our best performing campaigns in terms of revenue.

So, if duration isn’t a recipe for success, then what is? We studied six of the top performing campaigns to see if there was any difference in average per day sales during the three phases for each campaign i.e. before, during, and after. Only one campaign clearly demonstrated substantial difference in revenue after the campaign was executed. We saw a $450 per day increase in sales after this campaign was executed for the same households which participated in the campaign.

As a final step, we wanted to see what products saw a demand increase as a result of this successful campaign. We identified that product categories like Smoked Meat, Turkey, and Citrus had the most positive impact as part of Campaign 18. In conclusion, we analyzed that Type A campaigns are generally most effective whereas duration of campaign length does not seem to be a decisive factor. Regork successfully implemented Campaign 18 and saw increased in demand form the targeted households. Some product categories also seem to be most sensitive to campaigns like Smoked Meat, Turkey, and Citrus. Therefore we would recommend Regork to implement similar strategies in their future campaigns as they did in Campaign 18.