Synopsis

Analysis Briefing

In the year 2017, there were 27 different campaigns for Regork. There were 39,841 total transactions from 743 unique households that participated in campaigns that year. There are three different types of campaigns:

  • Type A

  • Type B

  • Type C

Purpose of Analysis

The purpose of this analysis is to understand which campaigns had the largest success in bringing in revenue, as well as what demographic participated in these successful campaigns. With this data, we will be able to make proper business decisions on:

  1. What campaigns we would like to bring back.
  2. When to bring them back.
  3. Who to market these campaigns to in the future.

I will be using the Complete Journey data set to analyze this problem. We only have demographic data for 801 households. This data analysis will only describe households with demographic information.

What is Being Analyzed

In this exploratory data analysis, I will be conducting 3 different analyses:

  1. Campaign Analysis
  2. Demographic Analysis
  3. Time-Series Analysis

Packages Required

Below are the perquisite libraries that need to be imported to conduct my exploratory data analysis:

  • tidyverse - Tidy the data

  • dplyr - Manipulate the data

  • ggrepel - Label the data

  • lubridate - Conduct data manipulation

  • completejourney - Where the data is coming from

  • DT - Used to display the data on the screen in a scroll able format

library(tidyverse)
library(dplyr)
library(ggrepel)
library(lubridate)
library(completejourney)
library(DT)

Data Preparation

This section contains all the procedures followed in getting the data ready for analysis.

Data Import

Complete Journey Data Set in R

Data Import Code:

library(completejourney) 

transactions <- get_transactions()

Data Manipulation

In order to conduct this analysis, we must join together the following data sets:

  • Campaigns
  • Demographics
  • Transactions
  • Products

Along with that, dropping variables that won’t be of use initially will make the data a bit more clean.

df <- campaigns %>%
  inner_join(demographics, by = 'household_id') %>%
  inner_join(transactions, by = 'household_id') %>%
  inner_join(products, by = 'product_id') %>%
  inner_join(campaign_descriptions, by = 'campaign_id') %>%
  filter(
    as.Date(transaction_timestamp, format = "%Y-%m-%d") > start_date,
    as.Date(transaction_timestamp, format = "%Y-%m-%d") < end_date
    )

class(df)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
colnames(df)
##  [1] "campaign_id"           "household_id"          "age"                  
##  [4] "income"                "home_ownership"        "marital_status"       
##  [7] "household_size"        "household_comp"        "kids_count"           
## [10] "store_id"              "basket_id"             "product_id"           
## [13] "quantity"              "sales_value"           "retail_disc"          
## [16] "coupon_disc"           "coupon_match_disc"     "week"                 
## [19] "transaction_timestamp" "manufacturer_id"       "department"           
## [22] "brand"                 "product_category"      "product_type"         
## [25] "package_size"          "campaign_type"         "start_date"           
## [28] "end_date"

Along with that, dropping variables that won’t be of use initially will make the data a bit more clean:

df <- df %>%
  select(-c(manufacturer_id, package_size, retail_disc))

Data Preview

datatable(head(df,50))

Meta Data

class_meta <- sapply(df, class)
null_meta <- sapply(df, function(df) sum(is.na(df)))
var_name <- colnames(df)

df_meta <- cbind(var_name, class_meta, null_meta)
rownames(df_meta) <- NULL
colnames(df_meta) <- c('Variable Name', 'Class', 'NULL Vals')
datatable(df_meta)

Exploratory Data Analysis

This section dives into the analysis that solves the questions in the Synopsis.

Campaign Analysis

As mentioned before, there are 3 different Campaign Types: Type A, Type B, Type C. Below shows the amount of campaigns in each Campaign Type, as well has how many transactions users in these campains completed in 2017.

transacs <- df %>%
  distinct(basket_id, .keep_all = TRUE) %>%
  group_by(campaign_type) %>%
  summarise(
    Transactions = scales::comma(n())
    ) 


ids <- campaign_descriptions %>%
  group_by(campaign_type) %>%
  summarise(campaigns = n())


datatable(cbind(ids, transacs)[-3])

Type B had the most campaigns, while Type A had the least. While Type A had the least, Type A still brought in the most total customers that completed a transaction, showing Type A had very efficient campaigns. Now we want to at this data per campaign. Below shows the average amount of households who completed a transaction for each campaign, grouped by each type of campaign:

df_1 <- df %>%
  distinct(basket_id, .keep_all = TRUE) %>%
  group_by(campaign_id, campaign_type) %>%
  summarise(trans = n())
  
df_1 %>%
  group_by(campaign_type) %>%
  summarise(avg_hh = round(mean(trans))) %>%
  ggplot(aes(x = campaign_type, y = avg_hh)) + 
  geom_col(aes(fill = campaign_type)) +
  labs(
    title = 'Average Households Per Campaign by Campaign Type',
    
    x = 'Campaign Type',
    y = 'Average Households'
  ) +
  theme(
    legend.position = 'none',
    plot.title = element_text(size = 15, face = 'bold')
  ) +
  geom_text(aes(label = scales::comma(avg_hh)), vjust = -0.3) +
  scale_y_continuous(labels = scales::comma)

Below shows the amount of revenue brought in by customers who participated in campaings by campaign ID:

df %>%
  distinct(basket_id, .keep_all = TRUE) %>%
  group_by(campaign_id, campaign_type) %>%
  summarise(sales = sum(sales_value)) %>%
  arrange(desc(sales)) %>%
  ggplot(aes(x = fct_reorder(campaign_id, sales), y = sales, fill = campaign_type)) +
    geom_col() +
    coord_flip() +
    labs(
      title = "Sales by Campaign",

      x = "Campaign ID",
      y = "Sales"
    ) +
  scale_fill_manual(name = 'Campaign Type',
                    values = c("#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
                    ) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 300000)) +
  geom_text(aes(label = scales::dollar(sales)), hjust = -0.10, size = 3) +
  theme(plot.title = element_text(size = 15, face = 'bold')) 

It can be seen that campaign IDs 18, 13, and 8 performed very well and explains the strong performance for Type A campaigns. I will dive deeper into these campaign in the next section to see what demographic contributed to these transactions.

Demographic Analysis

To further dive into these these campaigns, we would like to see what demographic best makes up these campaigns. Below is a Heat Map showing the distribution of total sales by income and age from customers who participated these campaigns:

df %>%
  distinct(basket_id, .keep_all = TRUE) %>%
  filter(campaign_id %in% c(18, 13, 8)) %>%
  group_by(income, age) %>%
  summarise(
    sales = sum(sales_value)
  ) %>%
  ggplot(aes(x = income, y = age, fill = sales)) +
    geom_tile(color = 'black') +
    geom_text(aes(label = scales::dollar(sales)), color = 'black', size = 2) +
    scale_fill_gradient2(low = "#075AFF",
                        mid = "#FFFFCC",
                         high = "#FF0000") +
    coord_flip() +
    labs(
      title = 'Total Sales by Income and Age',
      subtitle = 'The data shows that 45-54 year olds making 50-74K a year bring in the 
largest amounts of revenue for all income brackets and ages.',
      x = 'Income Bracket',
      y = 'Age',
      fill = 'Total Sales ($)'
    ) +
    theme(plot.title = element_text(face = 'bold', size = 15))
## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.

Below is the top selling product categories for campaigns 18, 13, 8 filtered for 45-54 year olds making 50-74K a year:

df %>%
  filter(
    campaign_id %in% c(18, 13, 8), 
    str_detect(income, regex('^50-74K')),
    str_detect(age, regex('45-'))
         ) %>%
  group_by(product_category) %>%   
  summarise(sales = sum(sales_value)) %>% 
  arrange(desc(sales)) %>%
  head(10) %>%
  ggplot(aes(x = fct_reorder(product_category, sales), y = sales)) + 
  geom_col(color = 'black', width = 0.95, fill = '#FF9F47') +
  coord_flip() + 
  labs(
    title = 'Top Selling Product Categories',
    y = 'Sales',
    x = 'Product Category'
  ) +
  theme(
    legend.position = 'none',
    plot.title = element_text(size = 15, face = 'bold')
  ) + 
  scale_y_continuous(labels = scales::dollar_format())

Time-Series Analysis

Now we want to look at the Monthly break down of sales among the main demographic for the best performing campaigns. Below shows the by month time-series for 45-54 year olds making 50-74K a year for campaign 13:

df %>%
  distinct(basket_id, .keep_all = TRUE) %>%
  filter(
    campaign_id %in% c(18, 13, 8), 
    str_detect(income, regex('^50-74K')),
    str_detect(age, regex('45-'))
    ) %>%
  mutate(month = month(transaction_timestamp)) %>%
  group_by(month, kids_count, campaign_id) %>%
  summarise(sales = sum(sales_value)) %>%  
  arrange(desc(sales)) %>%
  ggplot(aes(x = month, y = sales, fill = campaign_id)) +
    geom_area(position = 'stack') +
    scale_x_continuous(breaks = 1:12, labels = month.abb) +
    scale_y_continuous(labels = scales::dollar_format()) +
    labs(
      title = 'Total Sales by Month for the Top Consumers',
      subtitle = 'The data below shows that campaigns 8 and 13 performed very well from
      when it was deployed to end. Campaign 18 started slow but became high in revenue by November.',
      x = 'Month',
      y = 'Total Sales',
      fill = 'Campaign'
    ) +
    theme(
      plot.title = element_text(size = 15, face = 'bold')
    )

Summary

Now we have enough data to make insights about the questions we set out to find. The following is a summary of insights from the analysis:

  • Type A Campaigns had the least amount of campaigns, but brought in the most net revenue out of all the types.

  • Type A campaigns had the most efficient transactions/campaigns ratio out of any other type.

  • Type A Campaign ids 18, 13, and 8 were an outlier for bringing in high revenue in 2017. These make up the bin that will be called BPC (Best Performing Campaigns).

  • 45-54 year olds making 50-74K a year made up a majority of the BPC. This demographic makes up the bin that will be called SCD (Strongest Consuming Demographic).

  • Campaigns 8 and 13 performed very well from when it was deployed to end when looking at the data for the SCD.

  • Campaign 18 started slow but became high in revenue by November when looking at the data for the SCD.

We can use these insights to make the following recommendations:

  1. Bring back Type A campaigns 18, 13, and 8.
  2. Bring back Campaigns 13 and 8 at the same times as last year. Improve marketing when first deploying campaign 18 in the winter time.
  3. Market these campaigns to the 45-54 year old age group making 50-74K a year.