The Complete Journey Dataset with Marketing Campaigns

The dataset we have chosen contains the household level transactions of 2500 households who are frequent shoppers at a retailer.The retailer has conducted 30 campaigns, which were broadly categorized to three types, over the span of two years.

We have applied several data wrangling and modeling techniques to breakdown the datset and understand consumer behavior. Please navigate through the below tabs to view our project in detailed.

Introduction

Introduction

Through the dataset, we are trying to figure out how revenue of the retailer is affected through their marketing campaigns. While this is our main goal, we also tried to figure out how each type of marketing campaign affects the sales value and infer the impact of these marketing campaigns on the buying behaviour of the customers.

To understand the revenues of the retailer associated with each type of campaign, we can find the average amount of revenue earned by the retailer based on the household purchases for each type of campaign and see how it impacted the sales. One more approach is understanding how the length of the campaign has encouraged the households to use the coupons.

We started with cleaning and prepaing the data through proper formatting, so that the data is ready and actionable for complicated analyses. The current document provides details on how we have imported data into R and combined multiple datasets to be more accomodating and easy to access.

This analysis would help understand what type of marketing approaches attract the customer and could also provide more direction on how to design and modify the marketing strategies of the retailer.

Packages

Packages required for this project

We have chosed to use the following packages, from thousands of packages available for R.

library(knitr)
library(readr)
library(dplyr)
library(kableExtra)
library(purrr)
library(ggplot2)
library(gridExtra)

Reasons for using these packages in our project are detailed below.

  1. knitr - To use R Markdown more efficiently with flexibility and speed. Knitr enables easy way to set working directory, so that it would be easy for anyone to share and run the R code.

  2. readr - To read the files into R. This package offers optimized functionality, compared to the basic utils package, which is much slower.

  3. dplyr - Allows to easily manipulate data in R, using simple and efficient syntax. More importantly, allows function chaining, preventing any potential cluttering in the code, which in turn makes code reading and writing easier.

  4. kableExtra - Function ‘kable()’ is a light weight table generator coming from ‘knitr’. This package simplifies the way to manipulate the HTML or ‘LaTeX’ codes generated by ‘kable()’ and allows users to construct complex tables and customize styles using a readable syntax.

  5. purrr - Improves readabaility of code and makes reproducability of functions much easier to implement. map functions in this package are extremely easy to use and save time for programmer.
  6. ggplot2 - Used as a replacement for plot functions available in Base R packages. This package provides advanced graphics and offers better visualizations, that are more programmer friendly.

  7. gridExtra - To show the plots in a grid format.

Importing Data

Loading from datasource

For the purposes of our academic project, we have chosen to work on the dataset “Complete Journey customer transactions with marketing campaigns”. Required files for the same can be downloaded from this link.

After downloading the files to a secure location on your computer, set the working directory to the respective folder.The read_csv function automatically imports the data as tibbles, which display data when needed more optimally as compared to data frames. The class of all the below tables is tibble and it may take several minutes for data to load, given the bulk datasets used for this project.

file_names <- c( "campaign_desc","campaign_table","causal_data","coupon",
                 "coupon_redempt","hh_demographic","product","transaction_data")

for (i in seq_along(file_names)) { 
  file <- paste0("data/",file_names[i],".csv")
  assign(file_names[i],read_csv(file))
}

The original purpose of this data is to provide information about the purchases of each household. Collectively, all the tables imported above, shed light on the products purchased by each household, number of transactions made, product descrition, brand of the product, coupon applied, coupon discount, sales value, campaign type, campaign length etc., The variables of the dataset are clearly explained in the following section.

Variables

The original datasource has 34 unique variables stored separately in 8 different tables. The 30 campaigns were conducted on 1584 households out of which, 434 households redeemed 556 coupons. We also have details of total 2500 households, who shopped for almost a million products. The relational schema of the tables along with the respective variables is displayed below

Not all the data provided in its raw form can be useful. Data cleaning is done in the next section for each table. Details of the same are mentioned in the following section.

Data Cleaning

Is Data cleaning required ?

It is essential to clean the data before analyzing it. In this step, we make sure that all the variables are assigned appropriate data types.

For all the tibbles in the dataset, we have changed the identifying columns (IDs) to strings from integers, as IDs should be used as identifiers and to not perform any arithmetic or numerical operations.

Campaign Description

The following code converts the data type of campaign description to factor. A new variable called duration is created in the campaign_desc table that provides information on the number of days the campaign was run.

str(campaign_desc)
campaign_desc$description <- as.factor(campaign_desc$description)
campaign_desc$campaign    <- as.character(campaign_desc$campaign)
campaign_desc             <-  mutate(campaign_desc, duration = 
                                       end_day - start_day)

str(campaign_table)
campaign_table$description <- as.factor(campaign_table$description)
campaign_table[,2:3]       <- map(campaign_table[,2:3],as.character)

Coupon and Coupon Redemption

Below datasets contain the details of coupons. Unique coupon IDs and its mapping to the household that used the respective coupon on a particular day.

str(coupon)
coupon[,1:3]     <-   map(coupon[,1:3],as.character)

str(coupon_redempt)
coupon_redempt[,c(1,3,4)] <-  map(coupon_redempt[,c(1,3,4)],as.character)

Household Demographics

The hh_demographic table contains details of each household, like the size of household, the income group they fall into, Marital Status, Age Group and other demographic information. All the categorical variables are converted to factors

str(hh_demographic)
hh_demographic$household_key  <-  as.character(hh_demographic$household_key)
hh_demographic[,c(2:6,8)]     <-  map(hh_demographic[,c(2:6,8)],as.factor)
hh_demographic$household_size_desc  <-  as.integer(
                                          hh_demographic$household_size_desc)

Product

The product table has description of the products sold by the retailer, that were purchased by the current household group. This table has description of commodities, which have data type strings. product table also has a column called size that has the measurement unit of the respective product in ounces or pounds or in quantity. As this column cannot be generalized is left to be a character.

str(product)
product[,1:2]  <-  map(product[,1:2],as.character)
product[,3:4]  <-  map(product[,3:4],as.factor)

Causal Data

The below table has details of the products sold at a store. This table also provides information on how the particular product was displayed in store and also how the products were advertised in the mailer. The variables, Display and Mailer are categorical, and indicate the location where each product is displayed and advertised respectively.

str(causal_data)
causal_data[,1:2]  <-  map(causal_data[,1:2],as.character)
causal_data[,4:5]  <-  map(causal_data[,4:5],as.factor)

Transaction

Transaction table has details of all the transactions made by the households and can be considered central to the schema. The column Basket ID, refers to each transaction made by the customer and has all the product details. Transaction detail table also provides information on the sales value of each product purchased by the customer, along with the discounts applied through the coupons.

str(transaction_data)
transaction_data[,c(1,2,4,7)] <- map(transaction_data[,c(1,2,4,7)],as.character)
transaction_data$trans_time   <- as.character(transaction_data$trans_time)

Moving on from this step, we started performing analysis on the data, trying to connect data using commonalities amongst these data frames.

Plotting

Functions using ggplot

Visuals and graphs are a great way to view patterns and these help us to get a picture of overalla data without performing extensive analysis. For the purpose of this project, we have defined two functions that we used to create bar plots and box plots.

#Barplots
gg_bar <- function(df, x, colour, xlab = "", ylab = "", main = "", order = F) {
  ifelse(order == T, xr <- with(df, reorder(x, x, function(a) -length(a))), xr <- x)
  ggplot(data = df, aes(x = xr), position = "dodge") +
    geom_bar(  fill = colour ) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = main, x = xlab, y = ylab) +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          panel.background = element_blank()) 
}

#Boxplots
gg_box <- function(df, y, colour, xlab = "", ylab = "", main = "") {
 ggplot(data = df, aes(y = y)) +
 geom_boxplot() +
 labs(title = main, x = xlab, y = ylab) +
 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank())
}

These functions make it easier to maintain the same them for all plots while making it easier for the programmer to implement changes or modify themes across the project.

Product promotion by Display and Mailers

We view the mode of promotion for all products through the below bar graph. Note that from both the bar plots it can be infered, products that are not on display and those that are not in the ad are suppressed.

Form the above plots, it is clear that most number of products are displayed through ‘Secondary Location’ and the most products that are promoted through mail are featured on the ‘Interior’ page.

Viewing the Demographics of Households

Below are the plots summarising the demographic information of the households who purhase products from our retailer. Four plots show details about the Age, Income categor, composition of household and number of kids in each household.

Sales Value and Discounts

Below boxplots show how well distributed the sales value and retail discounts are. Under the sales value of 5$ for each transaction there are no outliers. However, there seem to be many outliers, for each product, when it comes to retail discount.

A closer look into the data reveals that the most of the outliers in total retail discount is due to the quantities of each product purchased. For example, if the discount availed on purchasing an apple is 1$, then the discount for 10 apples is 10 dollars. In this case, the outliers are visible due to the quantity of product but not because of the huge discount applied to a single product.

Data Analysis

Average duration of each campaign

We first inspect how long each type of campaign was run. There are 30 campaigns, belonging to three types as A,B and C.

summarize(group_by(campaign_desc,campaign_desc$description),
          avg_Time = mean(duration))
## # A tibble: 3 x 2
##   `campaign_desc$description` avg_Time
##   <fct>                          <dbl>
## 1 TypeA                           47.2
## 2 TypeB                           37.6
## 3 TypeC                           74.5

Type C campaign has the highest average duration of 74 days.

Sales value of Transactions

Interesting find: 87% of transactions have sales value below 5 dollars.

100*((transaction_data %>% 
    filter(sales_value < 5) %>%
                      nrow())/nrow(transaction_data))
## [1] 87.09925

Retail discounts are always indicated as a negative value or zero. But there are 36 records that have positive discount. These abnormal values could be a mistake in data entry and hence are fixed by multiplying with -1.

Additionally, of all the retail discount values, 97% of them are between -3 and 0. Only 3% purchases have received a discount of more than 3 dollars, per product.

100*((transaction_data %>% 
        filter(retail_disc > (-3) & retail_disc <= 0) %>%
                      nrow())/nrow(transaction_data))
## [1] 97.0955

Also, summary of the data revealed that of all the variables, two have missing values, which wouldn’t even be factoring in final analysis. As there are only 80 missing values of around 2.5 million observations, we can confidently ignore the missing values in the data of interest.

Data Manipulation

Merging Campaign and transaction data

Finalizing the data and preparing it for the further analysis is done in this section. Now that we have cleaned all the tables and formatted them properly, to address the next scope of analysis ,datasets of interest are merged and new comprehensive datasets are created.

To get all the records related to transactions and campaigns, the entire transaction_data with campaign_table and campaign_desc are merged together. But, merging all the campaigns with transaction_data resulted in more than 15 million records, and takes a long time to return the results. Considering that it would be very cumbersome to perform complex analysis and build models with such as huge dataset, we merged the transaction_data with respect to type of campaign.

After merging the datasets and separating them as shown below, we take the sum and average of sales value for all the three types of campaigns.

Type A Campaign

description campaign start_day end_day duration household_key basket_id day product_id quantity sales_value store_id retail_disc trans_time week_no coupon_disc coupon_match_disc
TypeA 18 587 642 55 1 27601281299 51 825123 1 3.99 436 0.00 1456 8 0 0
TypeA 18 587 642 55 1 27601281299 51 831447 1 2.99 436 0.00 1456 8 0 0
TypeA 18 587 642 55 1 27601281299 51 840361 1 1.09 436 -0.30 1456 8 0 0
TypeA 18 587 642 55 1 27601281299 51 845307 1 3.71 436 -0.62 1456 8 0 0

Potential Revenue by Campaign A

## [1] 20583817

Type B Campaign

description campaign start_day end_day duration household_key basket_id day product_id quantity sales_value store_id retail_disc trans_time week_no coupon_disc coupon_match_disc
TypeB 24 659 719 60 31 28167945894 99 824396 1 9.29 324 0.00 1243 15 0.0 0.0
TypeB 24 659 719 60 31 28167945894 99 833919 1 5.99 324 0.00 1243 15 0.0 0.0
TypeB 24 659 719 60 31 28167945894 99 870426 1 1.70 324 -1.09 1243 15 -0.5 -0.5
TypeB 24 659 719 60 31 28167945894 99 882190 1 2.49 324 -0.40 1243 15 -0.5 -0.5

Potential Revenue by Campaign B

## [1] 18795388

Type C Campaign

description campaign start_day end_day duration household_key basket_id day product_id quantity sales_value store_id retail_disc trans_time week_no coupon_disc coupon_match_disc
TypeC 15 547 708 161 143 28154518085 98 844165 1 1.00 446 -0.59 2308 15 0 0
TypeC 15 547 708 161 143 28154518085 98 935546 1 0.99 446 0.00 2308 15 0 0
TypeC 15 547 708 161 143 28154518085 98 992621 2 1.38 446 0.00 2308 15 0 0
TypeC 15 547 708 161 143 28154518085 98 1139216 1 7.99 446 -1.50 2308 15 0 0

Potential Revenue by Campaign C

## [1] 4456140

Interesting Find: Campaign A has the highest potential sales value amongst all the campaigns, totalling up to 20.5 million dollars, which is close to the poetential sales value of Type C and Type B campaigns combined.However, the average sales value for all the three types of campaigns is around 3 dollars.

Master Tables

In addition to the above tables, we create another master table combining the details of coupon, coupon redemption and product id, as shown below.
household_key day coupon_upc campaign product_id manufacturer department brand commodity_desc sub_commodity_desc curr_size_of_product
1 421 10000085364 8 100512 1765 GROCERY National FLUID MILK PRODUCTS FLUID MILK WHITE ONLY 1/2 GAL
1 421 10000085364 8 527731 1194 GROCERY National REFRGRATD JUICES/DRNKS DAIRY CASE 100% PURE JUICE - O 64 OZ
1 421 10000085364 8 1054539 1194 GROCERY National REFRGRATD JUICES/DRNKS DAIRY CASE 100% PURE JUICE - O 64 OZ
1 421 10000085364 8 802268 1487 GROCERY National YOGURT YOGURT NOT MULTI-PACKS 6 OZ

From the above analysis, it is clear that Type A campaign yielded highest revenue, and is also the campaign that ran for longest duration. In our next steps, we would like to find out whether duration of campaign has significant impact on the revenue generated.

Coupon Discounts

How did our coupons work ?

We only dealt with the revenues and coupons discounts overall until now. But, there are three types of Coupons discount provided to the customer as detailed below,

  1. Normal Coupon Discount - Discount on product through coupons reedemed by the customer. These are the coupons distributed to customer through various campaigns.

  2. Coupon Match Discount - This does not affect the sales value of the product. The discount is provided to customer, based on promotions by the manufacturer and the discounted amount is paid back to the retailer by the manufacturer.

  3. Loyalty Card Discount - This discount is applied to the customers who are enrolled in a loyalty card program. The assumption is that this particular discount does not have anything to do with the coupon IDs or campaigns.

Based on these parameters, we created two more columns that identify whether a transaction has discounts based on coupons or loyalty card.

Transactions with Loyalty Card

The total number of transactions that used loyalty card were,

## [1] 1303062

From the overall 2.5 Million million transactions that occured, about 50% transactions have obtained loyalty card discount. This is a very good rate, implying that relevant number of our customers would like to maintain long term relation with the retailer.

Transactions with Coupons

With a similar approach, we tried to find the number of transations that utilised coupon discounts. The coupon discounts include both the coupons provided by retailer and manufacturer of each product in each transaction.

## [1] 36422

Surprisingly, only 1.4% of the transactions have a valid coupon redeeemed. This means that the marketing campaigns are definitely not tapping enough customers.

Transactions with Coupons, but not loyalty Cards

To explore further, we look at customers who were using coupons even though they were not loyal customers of the retailer.

## [1] 16373

About 45% of the customers who redeemed coupons, were not a part of loyalty program. Since, the campaigns seem to have a positive impact on those who were not regular customers, we tried to focus on the demographics of this category. This approach allows us to understand what category of customers need to be targeted the most and which aspects of the marketing created most impact to attract unfamiliar customers.

Extracting demographic information

We have used join operations to extract the informations of customers that used coupons, without loyalty card. It is important to note that the demographic informations is not fully available for all customers. Below are tables have the information of our potential target customers.

household_only_coupon <- transaction_data %>% 
                          filter(loyalty == "No" & coupon_used == "Yes") %>% 
                          select(household_key) %>% unique()
                          
demographic_only_coupon <- left_join(household_only_coupon,
                                     hh_demographic,by = "household_key")
m <- demographic_only_coupon
demographic_only_coupon  <-  m[rowSums(is.na(m[,-1])) != ncol(m[,-1]),]

Below are the details of customers who used coupons at least once, without loyalty cards

Unfortunately, these plots do not tell much about this segment of customers, as they match the demographic plots of overall household transactions. However, having demographic information of all the households in the dataset would improve the quality of our findings.

Wrapping up

To sum it all

Comparing the average spend of households would provide us information about how frequent our transactions are with the customers. Also, grouping by each of the househlod as shown in below code would provide us household wise information.

Average spend of households

avg_spend <- transaction_data %>% 
              group_by(household_key) %>% 
              summarize(Total_spend = sum(sales_value),
                        no_vistis = length(unique(basket_id)),
                        avg_sales_purchase = 
                          sum(sales_value)/length(unique(basket_id))) 

avg_spend <- avg_spend[order(-avg_spend$Total_spend),]

Below boxplot shows the average spending of the households. Spend of customers ranged from 2 to 1664$. Our most loyal customer has visited the retailer 1300 times in the span for 600 times amounting to a highest total spend of 11000$

One more interesting fact is that our highest spent customer has made purchases worth ‘38000’ in just 600 visits, with an average spend of 63$

Running total of Sales value

It is also important to note that only 20% of the total households are contributing to about 50% of the total revenue.

To find the same, we used the cumulative sum of average amount spent by each household.

cumsum <- 100*cumsum(avg_spend$Total_spend)/sum(avg_spend$Total_spend)
running_totals <- cbind(index = c(1:2500),as.data.frame(cumsum))

We can infer that same through our plot as shown below.

The graph shows that about 1000 households are contributing to three fouths of sales, while the remaining 1500 are only adding to final quarter of sales. Through this, we can can clearly suggest that concentrating on regular loyal customers would be much more profitable to the retailer.

Revenue generated with respect to Day

Though we did not have the details of dates provided in the dataset, we assumed Day 1 to be the beginning of the period when campaigning has begun. To find more information on the same, we grouped all our transactions by eah day in the dataset.

day_revenue <- transaction_data %>% 
                group_by(day) %>% 
                summarize(Revenue = sum(sales_value)) 
                          
day_revenue$day <- as.integer(day_revenue$day)

After we have this data ready, a plot of Revnue with respect to day, shows that the Revenue has increased continuously. The graph has been increasing through out the 1st quarter and became more stable towards the end.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Futurescope

Although we have performed extensive analysis on the dataset, there is a lot more scope to this project. To understand the consumer behavior and improve marketing techniques, we would need more demographic data and details regarding the coupons.

We tried to identify the number of coupons or unique coupons that are distributed as a part of each campaign, but it seemed to be beyond the scope of this project. Due to the same, we were unable to figure out the Average Number of coupons redeemed per each campaign and for each household or the distinct number of coupons issued for each campaign.

Efficiency of the campaigns is calculated on basis of revenue here, but it can be built upon to identify the timeframes of coupon redemption, using the week and day details given for each transaction.

One more interesting direction to look at this would be to perform Market basket analysis of the products and interpret what kind of products and bands are most purchased, with ot without discounts.