Synopsis

  • Problem Statement

XYZ is an American retailing company with a distributed network of retail offline stores and strong loyal customer database. Marketing team of XYZ runs multiple campaigns across year to promote sales for targeted product category or brand for specific time period. XYZ also send weekly Mailers to customers to increase impressions and eventually purchase for certain product lines. XYZ has run thirty targeted campaigns in last two years and wants to see the performance of campaigns in terms of customers’ response rate, increased number of transactions and overall revenue generated. Marketing team will also like to know impact of customer demographics and psychographic characteristics on campaigns’ performance along with campaign attributes

  • Methodology

Problem statement is divided into five smaller problems which are combined together to achieve the desired output. First part of the puzzle talks about XYZ’s customers profile which will help marketing team to understand who our customers are. As a second step, effect of each demographic and psychographic attribute on probability of customer responding to a particular campaign has been studied. Similarly, campaign attributes and their impact on overall response rate has been analyzed in section three and four of the analysis. In the end, results from the above section are collated to see hybrid effect of customer and campaign attributes on model performance parameters

  • Analytical Approach

Descriptive Analysis has been performed to study customer and campaign profiles in section one and three respectively. Intuitive m visuals has been used to showcase various characteristics of both the entities. Inquisitive analysis have been carried out in section two and four to study impact of the attributes on campaigns performance. Hypothesis testing is used to see statistical significance of the impact. Combined effect has been studied with the help of random forest keeping campaign response rate as dependent variable for each customer X campaign pair.

  • Consumption

Campaign performance analysis can be a great asset to design future campaigns in combination with customer and product segments. Targeted marketing campaigns engineered by taking informed decision will lead to higher response rate and incremental revenues.

Requirements

Following packages has been used to perform the analysis in this report

library(magrittr)     # Sequential piping operations
library(ggplot2)      # Data visualization
library(tidyr)        # Data cleaning
library(dplyr)        # Data trandformation
library(data.table)   # Data importing
library(randomForest) # Tree based regressor model building
library(kableExtra)   # For printing tables
library(knitr)        # For printing tables

Data Preparation

  • Source

Data is loaded from Complete Journey customer transactions (with marketing campaigns) which contains customer transaction information for retailer XYZ

  • Understanding

    • Household Demographic data contains demographic characteristics i.e. age, marital status, income etc. for portion of households with no missing or abnormal values
    • Transactions data captures who(household) bought which product at what time by paying how much sales value and did he/she get any discount during that transaction. Dataset does not have any missing values. Data is available from 1st Jan 2015 to 30th December 2016. However, certain abnormalities have been found in the following variables
      • Quantity contains zeros as well as outlier values
      • Sales Value contains outliers
      • Retail Discount contains outliers as well as positive values(entries in discount columns should always be negative)
    • Campaign table lists the campaigns received by each household in the study. Data contains no missing or abnormal values
    • Campaign Description table contains campaign duration. Campaign coupons can only be active in that time window
    • Product table contains information product grouping at department, commodity and sub commodity level as well as brand, manufacturer and size of the product
    • Coupon table has records of each coupon delivered to each household with corresponding campaign and product information
    • Coupon Redemption table captures each coupon redeemed by each household on what day under which campaign
    • Causal Data provides information about product placement in store display sections and weekly mailers at product, store and week level
  • Assumption

Day 1 corresponds to 1st Jan 2015 and hence following are the time period of the available datasets

Table Start End
Transaction 2015-01-01 2016-12-11
Campaign Description 2015-08-12 2016-12-19
Coupon Redemption 2015-08-13 2016-12-04
Causal 2015-02-26 2016-12-01
  • Importation

Datafiles has been stored in dropbox location with public access. Hence, this r markdown can be run on any computer to generate same HTML report.

fread command has been used with colCLasses argument for defining required class for variables wherever required.

 dir.create( "dropbox" )

# Reading Causal Data
download.file("https://www.dropbox.com/s/v50iehqoogit35l/causal_data.csv?dl=1", 
               "./dropbox/causal_data.csv" )
causal_data <- fread( "./dropbox/causal_data.csv" )
# deletes downloaded file
file.remove( "./dropbox/causal_data.csv" )

# Reading Transaction Data
download.file("https://www.dropbox.com/s/htthq8w75925js6/transaction_data.csv?dl=1", 
             "./dropbox/transaction_data.csv" )
transaction_data <- fread( "./dropbox/transaction_data.csv" )
# deletes downloaded file
file.remove( "./dropbox/transaction_data.csv" )

# Reading Product Data
download.file("https://www.dropbox.com/s/ni08w355d5b6ugm/product.csv?dl=1", 
               "./dropbox/product.csv" )
product <- read.csv( "./dropbox/product.csv" )
# deletes downloaded file
file.remove( "./dropbox/product.csv" )

# Reading Coupon Data
download.file("https://www.dropbox.com/s/2fhaeokucddj5k5/coupon.csv?dl=1", 
               "./dropbox/coupon.csv" )
coupon <- read.csv( "./dropbox/coupon.csv",colClasses = c("character","integer","integer"))
# deletes downloaded file
file.remove( "./dropbox/coupon.csv" )

# Reading Campaign Table Data
download.file("https://www.dropbox.com/s/129ctlo79ug21e7/campaign_table.csv?dl=1", 
               "./dropbox/campaign_table.csv" )
campaign_table <- read.csv( "./dropbox/campaign_table.csv", stringsAsFactors = T)
# deletes downloaded file
file.remove( "./dropbox/campaign_table.csv" )

# Reading Coupon Redemption Data
download.file("https://www.dropbox.com/s/uycgy42otmjvlqv/coupon_redempt.csv?dl=1", 
              "./dropbox/coupon_redempt.csv" )
coupon_redempt <- read.csv( "./dropbox/coupon_redempt.csv",colClasses= c("integer", "integer","character","integer"))
# deletes downloaded file
file.remove( "./dropbox/coupon_redempt.csv" )

# Reading HH Demographic Data
download.file("https://www.dropbox.com/s/ovl9irr3hlq1krk/hh_demographic.csv?dl=1", 
              "./dropbox/hh_demographic.csv" )
hh_demographic <- read.csv( "./dropbox/hh_demographic.csv", stringsAsFactors = TRUE )
# deletes downloaded file
file.remove( "./dropbox/hh_demographic.csv" )

# Reading Campaign Description Data
download.file("https://www.dropbox.com/s/8rn7jin0ghymrji/campaign_desc.csv?dl=1", 
              "./dropbox/campaign_desc.csv" )
campaign_desc <- read.csv( "./dropbox/campaign_desc.csv", stringsAsFactors = T)
# deletes downloaded file

file.remove( "./dropbox/campaign_desc.csv" )

# deletes entire directory
unlink( "dropbox", recursive = TRUE )
  • Cleaning

day variable has been converted into yyyy-mm-dd format for further analysis. Also, Cleaning has been performed on Transaction data to treat missing, abnormal and outlier values in quantity and retail discount

# Converting day variable to yyyy-mm-dd fromat
# Assumption day = 1 corresponds to "2015-01-01"

campaign_desc$start_day <- as.Date(campaign_desc$start_day, origin = "2014-12-31")
campaign_desc$end_day <- as.Date(campaign_desc$end_day, origin = "2014-12-31")
coupon_redempt$day <- as.Date(coupon_redempt$day, origin = "2014-12-31")
transaction_data$day <- as.Date(transaction_data$day, origin = "2014-12-31")

# Treating Quantity and Retail Discount from Transaction Data
# Assumption : Quantity should be greater than zero
# Outlier Treatment : Quantity has been capped at 99.107th percentile i.e. 40 as quantity values are increasing drastically after 40
# Assumption : Retail discount should always be negative as per definition
transaction_data_clean <- transaction_data %>% 
  subset(quantity > 0  & 
           quantity < quantile(transaction_data$quantity,0.99107)) %>%
  subset(retail_disc <= 0)

library(ggplot2)
breaks <- c(seq(990,999,by = 2)/1000, 0.99107)
ggplot(
  data.frame(Quantiles = c(99000:99999)/100000,Quantity = quantile(transaction_data$quantity,c(99000:99999)/100000)) 
       , aes(x = Quantiles, y = Quantity)) + 
  geom_point() +
  ggtitle("Transaction Quantity : Quantile Plot for Outlier Elimination") +
  geom_vline(xintercept = 99107/100000, linetype = "dotdash") +
  scale_x_continuous(breaks = c(seq(990,999,by = 2)/1000, 0.99107))

  • Summary

Following table captures summary of the cleaned datasets used for the analysis

Table Records Missing Abnormal Variables Level
Transaction 2595732 0 23175 12 Household X Basket X Product
Demographics 801 0 0 8 Household
Causal 36786524 0 0 5 Product X Store X Week
Coupon 124548 0 0 3 Campaign X Coupon X Product
Coupon Redemption 2318 0 0 4 Household X Campaign X Coupon
Campaign Description 30 0 0 4 Campaign
Campaign Table 7208 0 0 3 Household X Campaign
Product 92353 0 0 7 Product

Proposed Exploratory Data Analysis

  • Customers’ Profile

    It is important to know our customer before jumping into deep dive analysis of campaign response and hence customer demographics and psychographic variables are studied in detail through descriptive analytics in this section

    • Datasets Required : Household Demographics, Transactions

    • Proposed Descriptive Analysis
      • Bar chart showcasing number of households across Marital status

      • Vertical bar chart showcasing number of households across age buckets

      • Horizontal bar chart showcasing number of households across household size
      • Histogram for number of transactions per month
      • Histogram for spend per month
      • Pie chart for weekday Vs weekend shopper
      • Pie chart for first week of the month shopper Vs Others

    • Proposed Analytical Dataset Creation : Following dataset will be created by joining demographics and transaction dataset

household_id monthly_trans monthly_spend wknd_shopper first_day_shopper age marital_status household_size
1 7 23 0 1 45-54 Single 2
2 12 45 1 0 25-34 Married 3
3 3 12 0 1 65+ Unknown 2
  • Effect of Customers’ attributes on Campaign Responce

    Campaign Response at customer level has been calculated by taking ratio of number of campaigns responded and number of campaigns targeted by at household level. Household level campaign response is then joined with Household attributes calculated from section one

    • Datasets Required : Household Demographics, Campaign Table, Coupon Redemption

    • Proposed Inquisitive Analysis & Hypothesis Testing
      • Vertical bar graph showing response rate varying across Marital Status with ANOVA
      • Vertical bar graph showing response rate varying across Age buckets with ANOVA
      • Vertical bar chart showcasing response rate across household size with ANOVA
      • Scatter plot showcasing response rate across transactions per month with ANOVA
      • Scatter plot showcasing response rate across spend per month with ANOVA
      • Vertical bar chart showcasing response rate across weekday and weekend shopper with ANOVA
      • Vertical bar chart showcasing response rate across first week of month shopper and others with ANOVA
    • Proposed analytical Dataset Creation
      household_id campaigns_sent campaigns_responded campaign_response
      1 8 4 0.50
      2 6 6 1.00
      3 3 2 0.33
  • Campaign Overview

    Campaign Response at customer level has been calculated by taking ratio of number of campaigns responded and number of campaigns targeted by at household level. Household level campaign response is then joined with Household attributes calculated from section one

    • Datasets Required : Campaign Description, Coupon, Product

    • Proposed Descriptive Analysis
      • Histogram for duration of the campaigns
      • Vertical bar graph showcasing number of campaigns across Type A, B and C
      • Horizontal bar graph showcasing number of campaigns across major commodities
      • Histogram for number of coupons distributed across campaigns
    • Proposed Analytical Dataset Creation
      campaign_id duration Type major_commodity coupons
      1 23 A Grocery 30
      2 34 B Dairy 15
      3 27 C Cosmetics 15
  • Effect of Campaigns’ attributes on Campaign Performance

    Performance at campaign level has been calculated by taking ratio of number of households responded and number of households targeted at campaign level. Campaign level household response is then joined with Campaign attributes calculated from section three

    • Datasets Required : Campaign Table, Coupon Redemption, Coupon

    • Proposed Inquisitive Analysis & Hypothesis Testing
      • Vertical bar graph showing response rate varying across Type with ANOVA
      • Vertical bar graph showing response rate varying across major commodity with ANOVA
      • Scatter plot showcasing response rate across campaign duration with ANOVA
      • Scatter plot showcasing response rate across number of coupons per campaign with ANOVA
    • Proposed analytical Dataset Creation
      campaign_id households_targeted households_responded households_response coupons_issued coupons_redeemed coupons_response
      1 300 150 0.50 800 400 0.5
      2 600 200 0.33 700 350 0.5
      3 500 100 0.20 1200 1200 1.0
  • Combined Effect of Customers’ and Campaigns’ attributes on Campaign Performance

    Performance at campaign X Customer need to be calculated in ones or zeros based on whether customer responded to the specific campaign or not

    • Datasets Required : Campaign Table, Coupon Redemption, Coupon, Household Demographics, Campaign Table, Coupon Redemption

    • Proposed predictive analysis : Logistic regression model will be built to predict whether a customer will respond to particular campaign or not based on customer and campaign attributes. Following steps will be carried out for model building
      • Bivariate Analysis
      • Hypothesis Testing
      • Variable Reduction
      • Model Building
    • Proposed analytical Dataset Creation
      campaign_id household_id response
      1 1 1
      2 1 0
      3 2 1