Campaign Performance Analysis

Synopsis

Problem Statement

XYZ is an American retailing company with a distribution network of retail offline stores and strong loyal customer database. Marketing team of XYZ runs multiple campaigns across year to promote sales for targeted product category or brand for specific time period. XYZ also send weekly Mailers to customers to increase impressions and eventually purchase for certain product lines. XYZ has run thirty targeted campaigns in last two years and wants to see the performance of campaigns in terms of customers’ response rate, increased number of transactions and overall revenue generated. Marketing team will like to know impact of customer demographics and psychographic characteristics on customer’s response to campaign to fine tune the target audience for future campaigns

Methodology

Problem statement is divided into four smaller problems which are combined together to achieve the desired output. First part of the puzzle talks about XYZ’s customers profile which will help marketing team to understand who our customers are. As a second step, effect of each demographic attribute on probability of customer responding to a particular campaign has been studied. Similarly, behavioral attributes and their impact on overall response rate has been analyzed in third section of the analysis. Behavioral attributes have been engineered from transaction and campaign response datasets. In the end, demographics and behavioral attributes are leveraged to train model for predicting customer’s response to campaign with the help of Random Forest machine learning algorithm.

Analytical Approach

Descriptive Analysis has been performed to study customer profiles in the first section section. Intuitive visuals has been used to showcase various characteristics of customers. Inquisitive analysis have been carried out in second and third section of the analysis to study impact of those attributes on campaign response. Combined effect has been studied with the help of random forest keeping campaign response rate as dependent variable for each customer.

Consumption

Campaign performance analysis can be a great asset to design future campaigns in combination with customer and product segments. Targeted marketing campaigns engineered by taking informed decision will lead to higher response rate and incremental revenues.

Requirements

Following packages has been used to perform the analysis in this report

library(magrittr)     # Sequential piping operations
library(ggplot2)      # Data visualization
library(tidyr)        # Data cleaning
library(dplyr)        # Data trandformation
library(data.table)   # Data importing
library(randomForest) # Tree based regressor model building
library(kableExtra)   # For printing tables
library(knitr)        # For printing tables
library(gridExtra)    # For plot grid 
library(grid)         # For creating and drawing text       
library(stringr)      # For String Manipulation
library(Hmisc)        # For String Manipulation

Data Preparation

Source

Data is loaded from Complete Journey customer transactions (with marketing campaigns) which contains customer transaction information for retailer XYZ

Understanding
- Household Demographic data contains demographic characteristics i.e. age, marital status, income etc. for portion of households with no missing or abnormal values
- Transactions data captures who(household) bought which product at what time by paying how much sales value and did he/she get any discount during that transaction. Dataset does not have any missing values. Data is available from 1st Jan 2015 to 30th December 2016. However, certain abnormalities have been found in the following variables
  - Quantity contains zeros as well as outlier values
  - Sales Value contains outliers
  - Retail Discount contains outliers as well as positive values(entries in discount columns should always be negative)
- Campaign table lists the campaigns received by each household in the study. Data contains no missing or abnormal values
- Campaign Description table contains campaign duration. Campaign coupons can only be active in that time window
- Product table contains information product grouping at department, commodity and sub commodity level as well as brand, manufacturer and size of the product
- Coupon table has records of each coupon delivered to each household with corresponding campaign and product information
- Coupon Redemption table captures each coupon redeemed by each household on what day under which campaign
- Causal Data provides information about product placement in store display sections and weekly mailers at product, store and week level
Assumption

Day 1 corresponds to 1st Jan 2015 and hence following are the time period of the available datasets

Table	Start	End
Transaction	2015-01-01	2016-12-11
Campaign Description	2015-08-12	2016-12-19
Coupon Redemption	2015-08-13	2016-12-04
Causal	2015-02-26	2016-12-01

Importation

Datafiles has been stored in dropbox location with public access. Hence, this r markdown can be run on any computer to generate same HTML report.

fread command has been used with colCLasses argument for defining required class for variables wherever required.

# dir.create( "dropbox" )

# Reading Causal Data
#download.file("https://www.dropbox.com/s/v50iehqoogit35l/causal_da#ta.csv?dl=1", 
#               "./dropbox/causal_data.csv" )
causal_data <- fread( "./dropbox/causal_data.csv")
# deletes downloaded file
#file.remove( "./dropbox/causal_data.csv" )

# Reading Transaction Data
#download.file("https://www.dropbox.com/s/htthq8w75925js6/transaction_data.csv?dl=1", 
#             "./dropbox/transaction_data.csv" )
transaction_data <- fread( "./dropbox/transaction_data.csv")
# deletes downloaded file
#file.remove( "./dropbox/transaction_data.csv" )

# Reading Product Data
#download.file("https://www.dropbox.com/s/ni08w355d5b6ugm/product.csv?dl=1", 
#               "./dropbox/product.csv" )
product <- fread( "./dropbox/product.csv" )
# deletes downloaded file
#file.remove( "./dropbox/product.csv" )

# Reading Coupon Data
#download.file("https://www.dropbox.com/s/2fhaeokucddj5k5/coupon.csv?dl=1", 
#               "./dropbox/coupon.csv" )
coupon <- fread( "./dropbox/coupon.csv",colClasses = c("character","integer","integer"))
# deletes downloaded file
#file.remove( "./dropbox/coupon.csv" )

# Reading Campaign Table Data
#download.file("https://www.dropbox.com/s/129ctlo79ug21e7/campaign_table.csv?dl=1", 
#               "./dropbox/campaign_table.csv" )
campaign_table <- fread( "./dropbox/campaign_table.csv", stringsAsFactors = T)
# deletes downloaded file
#file.remove( "./dropbox/campaign_table.csv" )

# Reading Coupon Redemption Data
#download.file("https://www.dropbox.com/s/uycgy42otmjvlqv/coupon_redempt.csv?dl=1", 
#              "./dropbox/coupon_redempt.csv" )
coupon_redempt <- fread( "./dropbox/coupon_redempt.csv",colClasses= c("integer", "integer","character","integer"))
# deletes downloaded file
#file.remove( "./dropbox/coupon_redempt.csv" )

# Reading HH Demographic Data
#download.file("https://www.dropbox.com/s/ovl9irr3hlq1krk/hh_demographic.csv?dl=1", 
#              "./dropbox/hh_demographic.csv" )
hh_demographic <- fread( "./dropbox/hh_demographic.csv", stringsAsFactors = TRUE )
# deletes downloaded file
#file.remove( "./dropbox/hh_demographic.csv" )

# Reading Campaign Description Data
#download.file("https://www.dropbox.com/s/8rn7jin0ghymrji/campaign_desc.csv?dl=1", 
#              "./dropbox/campaign_desc.csv" )
campaign_desc <- fread( "./dropbox/campaign_desc.csv", stringsAsFactors = T)
# deletes downloaded file

#file.remove( "./dropbox/campaign_desc.csv" )

# deletes entire directory
#unlink( "dropbox", recursive = TRUE )

Cleaning

day variable has been converted into yyyy-mm-dd format for further analysis. Also, Cleaning has been performed on Transaction data to treat missing, abnormal and outlier values in quantity and retail discount

# Converting day variable to yyyy-mm-dd fromat
# Assumption day = 1 corresponds to "2015-01-01"

campaign_desc$start_day <- as.Date(campaign_desc$start_day, origin = "2014-12-31")
campaign_desc$end_day <- as.Date(campaign_desc$end_day, origin = "2014-12-31")
coupon_redempt$day <- as.Date(coupon_redempt$day, origin = "2014-12-31")
transaction_data$day <- as.Date(transaction_data$day, origin = "2014-12-31")

# Treating Quantity and Retail Discount from Transaction Data
# Assumption : Quantity should be greater than zero
# Outlier Treatment : Quantity has been capped at 99.107th percentile i.e. 40 as quantity values are increasing drastically after 40
# Assumption : Retail discount should always be negative as per definition
transaction_data_clean <- transaction_data %>% 
  subset(quantity > 0  & 
           quantity < quantile(transaction_data$quantity,0.99107)) %>%
  subset(retail_disc <= 0)

library(ggplot2)
breaks <- c(seq(990,999,by = 2)/1000, 0.99107)
ggplot(
  data.frame(Quantiles = c(99000:99999)/100000,Quantity = quantile(transaction_data$quantity,c(99000:99999)/100000)) 
       , aes(x = Quantiles, y = Quantity)) + 
  geom_point() +
  ggtitle("Transaction Quantity : Quantile Plot for Outlier Elimination") +
  geom_vline(xintercept = 99107/100000, linetype = "dotdash") +
  scale_x_continuous(breaks = c(seq(990,999,by = 2)/1000, 0.99107))

Summary

Following table captures summary of the cleaned datasets used for the analysis

Table	Records	Abnormal	Variables	Level
Transaction	2595732	23175	12	Household X Basket X Product
Demographics	801	0	8	Household
Causal	36786524	0	5	Product X Store X Week
Coupon	124548	0	3	Campaign X Coupon X Product
Coupon Redemption	2318	0	4	Household X Campaign X Coupon
Campaign Description	30	0	4	Campaign
Campaign Table	7208	0	3	Household X Campaign
Product	92353	0	7	Product

Analytical Dataset Creation
- Raw Datasets Required : Household Demographics, Transactions
- Operations Performed : Analytical Dataset has been created by combining demographic and behavioral attributes of the customer. Demographic attributes such as Age, Marital Status, Household Size has been directly leveraged from household dataset. Whereas behavioral attributes has been engineered from transactional dataset. Response, Demographic and Behavioral attributes has been highlighted in the following table with sample records from analytical dataset

household_key	1	7	8	13	16	17
campaign_response_rate	0.25	0.00	0.10	0.70	0.00	0.00
campaign_response_flag	TRUE	FALSE	TRUE	TRUE	FALSE	FALSE
age_bucket	65+	45-54	25-34	25-34	45-54	65+
marital_status_code	A	A	U	U	B	B
income_bucket	35-49K	50-74K	25-34K	75-99K	50-74K	Under 15K
homeowner_descrption	Homeowner	Homeowner	Unknown	Homeowner	Homeowner	Homeowner
household_composition	2 Adults No Kids	2 Adults No Kids	2 Adults Kids	2 Adults Kids	Single Female	2 Adults No Kids
household_size	2	2	3	4	1	2
kid_category_description	None/Unknown	None/Unknown	1	2	None/Unknown	None/Unknown
number_of_visits_per_month	3.74	2.68	5.14	13.10	5.44	5.39
spend_per_month	188.27	154.55	251.59	628.14	84.00	223.46
retail_discount_per_month	-30.31	-21.93	-49.01	-77.49	-9.52	-48.24
coupon_discount_per_month	-3.50	-1.38	-1.29	-14.68	-0.06	-3.33
coupon_match_discount_per_month	-1.14	-0.23	-0.08	-3.70	0.00	0.00
number_of_active_months	23	22	22	21	18	23
number_of_stores_visited	2	2	3	1	2	4
weekend_spend_proportion	0.35	0.07	0.29	0.30	0.37	0.34
after_office_spend_proportion	0.11	0.42	0.52	0.40	0.17	0.22
first_week_month_spend_proportion	0.21	0.18	0.25	0.19	0.21	0.22
number_of_campaigns_sent	8	4	10	10	2	5

Exploratory Data Analysis

Who is our Customer?(Univariate Analysis)

It is important to know our customer before jumping into deep dive analysis of campaign response and hence customer demographics and psychographic variables are studied in detail through descriptive analytics in this section
- From the customer database, almost 73% customers are married and 27% are unmarried
- 5% of customers prefer to do more than 50% of their transaction during weekend. (This variable has been created to note the customer’s activity during weekend)
- Maximum proportion of customers belong to age 45 -54 followed by age 35-44. Customers with minimum proportion are from age bucket 19-24
- Most of our customers (25%) earns between 50-74K , followed by 35-49K group(22%) (Gives insight of which customers we should focus on).Least proportion of customers earns between 200-249 k
- 35% of customers mainly transact(more than 50% of transaction) after office hours.(This variable has been created to note the activity of customers after office hours)
- 28% of customers have kids in their family ( need to check why this proportion is so small) 32% of customer have no kids in their family
- Most of the customers prefer to have monthly expenditure around ($100 - $300)
- Almost 72% of customers visits more than 3 stores .This is quite interesting fact. Question - the range of products customers wants are not in one store ?That’s why customer is visiting so many stores

Effect of Customers’ attributes on Campaign Responce (Bivariate Analysis)

Campaign Response at customer level has been calculated by taking ratio of number of campaigns responded and number of campaigns targeted by at household level. Household level campaign response is then joined with Household attributes calculated from section one
- Bivariate Analysis for Categorical Variables
- Bivariate Analysis for Continuos Variables
- Random Forest Model Building

Campaign Performance Analysis

Supriya Sawant(M12931363) & Harsharaj Narsale(M12824891)

November 10, 2018

Synopsis

Requirements

Data Preparation

Exploratory Data Analysis

Summary