XYZ is an American retailing company with a distributed network of retail offline stores and strong loyal customer database. Marketing team of XYZ runs multiple campaigns across year to promote sales for targeted product category or brand for specific time period. XYZ also send weekly Mailers to customers to increase impressions and eventually purchase for certain product lines. XYZ has run thirty targeted campaigns in last two years and wants to see the performance of campaigns in terms of customers’ response rate, increased number of transactions and overall revenue generated. Marketing team will also like to know impact of customer demographics and psychographic characteristics on campaigns’ performance along with campaign attributes
Problem statement is divided into five smaller problems which are combined together to achieve the desired output. First part of the puzzle talks about XYZ’s customers profile which will help marketing team to understand who our customers are. As a second step, effect of each demographic and psychographic attribute on probability of customer responding to a particular campaign has been studied. Similarly, campaign attributes and their impact on overall response rate has been analyzed in section three and four of the analysis. In the end, results from the above section are collated to see hybrid effect of customer and campaign attributes on model performance parameters
Descriptive Analysis has been performed to study customer and campaign profiles in section one and three respectively. Intuitive m visuals has been used to showcase various characteristics of both the entities. Inquisitive analysis have been carried out in section two and four to study impact of the attributes on campaigns performance. Hypothesis testing is used to see statistical significance of the impact. Combined effect has been studied with the help of random forest keeping campaign response rate as dependent variable for each customer X campaign pair.
Campaign performance analysis can be a great asset to design future campaigns in combination with customer and product segments. Targeted marketing campaigns engineered by taking informed decision will lead to higher response rate and incremental revenues.
Following packages has been used to perform the analysis in this report
library(magrittr) # Sequential piping operations
library(ggplot2) # Data visualization
library(tidyr) # Data cleaning
library(dplyr) # Data trandformation
library(data.table) # Data importing
library(randomForest) # Tree based regressor model building
library(kableExtra) # For printing tables
library(knitr) # For printing tables
library(gridExtra) # For plot grid
library(grid)
library(stringr) # For String Manipulation
library(Hmisc) # For String Manipulation
Data is loaded from Complete Journey customer transactions (with marketing campaigns) which contains customer transaction information for retailer XYZ
Understanding
Assumption
Day 1 corresponds to 1st Jan 2015 and hence following are the time period of the available datasets
| Table | Start | End |
|---|---|---|
| Transaction | 2015-01-01 | 2016-12-11 |
| Campaign Description | 2015-08-12 | 2016-12-19 |
| Coupon Redemption | 2015-08-13 | 2016-12-04 |
| Causal | 2015-02-26 | 2016-12-01 |
Datafiles has been stored in dropbox location with public access. Hence, this r markdown can be run on any computer to generate same HTML report.
fread command has been used with colCLasses argument for defining required class for variables wherever required.
# dir.create( "dropbox" )
# Reading Causal Data
#download.file("https://www.dropbox.com/s/v50iehqoogit35l/causal_da#ta.csv?dl=1",
# "./dropbox/causal_data.csv" )
causal_data <- fread( "./dropbox/causal_data.csv" )
# deletes downloaded file
#file.remove( "./dropbox/causal_data.csv" )
# Reading Transaction Data
#download.file("https://www.dropbox.com/s/htthq8w75925js6/transaction_data.csv?dl=1",
# "./dropbox/transaction_data.csv" )
transaction_data <- fread( "./dropbox/transaction_data.csv" )
# deletes downloaded file
#file.remove( "./dropbox/transaction_data.csv" )
# Reading Product Data
#download.file("https://www.dropbox.com/s/ni08w355d5b6ugm/product.csv?dl=1",
# "./dropbox/product.csv" )
product <- read.csv( "./dropbox/product.csv" )
# deletes downloaded file
#file.remove( "./dropbox/product.csv" )
# Reading Coupon Data
#download.file("https://www.dropbox.com/s/2fhaeokucddj5k5/coupon.csv?dl=1",
# "./dropbox/coupon.csv" )
coupon <- read.csv( "./dropbox/coupon.csv",colClasses = c("character","integer","integer"))
# deletes downloaded file
#file.remove( "./dropbox/coupon.csv" )
# Reading Campaign Table Data
#download.file("https://www.dropbox.com/s/129ctlo79ug21e7/campaign_table.csv?dl=1",
# "./dropbox/campaign_table.csv" )
campaign_table <- read.csv( "./dropbox/campaign_table.csv", stringsAsFactors = T)
# deletes downloaded file
#file.remove( "./dropbox/campaign_table.csv" )
# Reading Coupon Redemption Data
#download.file("https://www.dropbox.com/s/uycgy42otmjvlqv/coupon_redempt.csv?dl=1",
# "./dropbox/coupon_redempt.csv" )
coupon_redempt <- read.csv( "./dropbox/coupon_redempt.csv",colClasses= c("integer", "integer","character","integer"))
# deletes downloaded file
#file.remove( "./dropbox/coupon_redempt.csv" )
# Reading HH Demographic Data
#download.file("https://www.dropbox.com/s/ovl9irr3hlq1krk/hh_demographic.csv?dl=1",
# "./dropbox/hh_demographic.csv" )
hh_demographic <- read.csv( "./dropbox/hh_demographic.csv", stringsAsFactors = TRUE )
# deletes downloaded file
#file.remove( "./dropbox/hh_demographic.csv" )
# Reading Campaign Description Data
#download.file("https://www.dropbox.com/s/8rn7jin0ghymrji/campaign_desc.csv?dl=1",
# "./dropbox/campaign_desc.csv" )
campaign_desc <- read.csv( "./dropbox/campaign_desc.csv", stringsAsFactors = T)
# deletes downloaded file
#file.remove( "./dropbox/campaign_desc.csv" )
# deletes entire directory
#unlink( "dropbox", recursive = TRUE )
day variable has been converted into yyyy-mm-dd format for further analysis. Also, Cleaning has been performed on Transaction data to treat missing, abnormal and outlier values in quantity and retail discount
# Converting day variable to yyyy-mm-dd fromat
# Assumption day = 1 corresponds to "2015-01-01"
campaign_desc$start_day <- as.Date(campaign_desc$start_day, origin = "2014-12-31")
campaign_desc$end_day <- as.Date(campaign_desc$end_day, origin = "2014-12-31")
coupon_redempt$day <- as.Date(coupon_redempt$day, origin = "2014-12-31")
transaction_data$day <- as.Date(transaction_data$day, origin = "2014-12-31")
# Treating Quantity and Retail Discount from Transaction Data
# Assumption : Quantity should be greater than zero
# Outlier Treatment : Quantity has been capped at 99.107th percentile i.e. 40 as quantity values are increasing drastically after 40
# Assumption : Retail discount should always be negative as per definition
transaction_data_clean <- transaction_data %>%
subset(quantity > 0 &
quantity < quantile(transaction_data$quantity,0.99107)) %>%
subset(retail_disc <= 0)
library(ggplot2)
breaks <- c(seq(990,999,by = 2)/1000, 0.99107)
ggplot(
data.frame(Quantiles = c(99000:99999)/100000,Quantity = quantile(transaction_data$quantity,c(99000:99999)/100000))
, aes(x = Quantiles, y = Quantity)) +
geom_point() +
ggtitle("Transaction Quantity : Quantile Plot for Outlier Elimination") +
geom_vline(xintercept = 99107/100000, linetype = "dotdash") +
scale_x_continuous(breaks = c(seq(990,999,by = 2)/1000, 0.99107))
Following table captures summary of the cleaned datasets used for the analysis
| Table | Records | Missing | Abnormal | Variables | Level |
|---|---|---|---|---|---|
| Transaction | 2595732 | 0 | 23175 | 12 | Household X Basket X Product |
| Demographics | 801 | 0 | 0 | 8 | Household |
| Causal | 36786524 | 0 | 0 | 5 | Product X Store X Week |
| Coupon | 124548 | 0 | 0 | 3 | Campaign X Coupon X Product |
| Coupon Redemption | 2318 | 0 | 0 | 4 | Household X Campaign X Coupon |
| Campaign Description | 30 | 0 | 0 | 4 | Campaign |
| Campaign Table | 7208 | 0 | 0 | 3 | Household X Campaign |
| Product | 92353 | 0 | 0 | 7 | Product |
Analytical Dataset Creation
| household_key | 1 | 7 | 8 | 13 | 16 | 17 |
| campaign_response_rate | 0.25 | 0.00 | 0.10 | 0.70 | 0.00 | 0.00 |
| campaign_response_flag | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE |
| age_bucket | 65+ | 45-54 | 25-34 | 25-34 | 45-54 | 65+ |
| marital_status_code | A | A | U | U | B | B |
| income_bucket | 35-49K | 50-74K | 25-34K | 75-99K | 50-74K | Under 15K |
| homeowner_descrption | Homeowner | Homeowner | Unknown | Homeowner | Homeowner | Homeowner |
| household_composition | 2 Adults No Kids | 2 Adults No Kids | 2 Adults Kids | 2 Adults Kids | Single Female | 2 Adults No Kids |
| household_size | 2 | 2 | 3 | 4 | 1 | 2 |
| kid_category_description | None/Unknown | None/Unknown | 1 | 2 | None/Unknown | None/Unknown |
| number_of_visits_per_month | 3.74 | 2.68 | 5.14 | 13.10 | 5.44 | 5.39 |
| spend_per_month | 188.27 | 154.55 | 251.59 | 628.14 | 84.00 | 223.46 |
| retail_discount_per_month | -30.31 | -21.93 | -49.01 | -77.49 | -9.52 | -48.24 |
| coupon_discount_per_month | -3.50 | -1.38 | -1.29 | -14.68 | -0.06 | -3.33 |
| coupon_match_discount_per_month | -1.14 | -0.23 | -0.08 | -3.70 | 0.00 | 0.00 |
| number_of_active_months | 23 | 22 | 22 | 21 | 18 | 23 |
| number_of_stores_visited | 2 | 2 | 3 | 1 | 2 | 4 |
| weekend_spend_proportion | 0.35 | 0.07 | 0.29 | 0.30 | 0.37 | 0.34 |
| after_office_spend_proportion | 0.11 | 0.42 | 0.52 | 0.40 | 0.17 | 0.22 |
| first_week_month_spend_proportion | 0.21 | 0.18 | 0.25 | 0.19 | 0.21 | 0.22 |
| number_of_campaigns_sent | 8 | 4 | 10 | 10 | 2 | 5 |
Who is our Customer?(Univariate Analysis)
It is important to know our customer before jumping into deep dive analysis of campaign response and hence customer demographics and psychographic variables are studied in detail through descriptive analytics in this section
Effect of Customers’ attributes on Campaign Responce (Bivariate Analysis)
Campaign Response at customer level has been calculated by taking ratio of number of campaigns responded and number of campaigns targeted by at household level. Household level campaign response is then joined with Household attributes calculated from section one