XYZ is an American retailing company with a distribution network of retail offline stores and strong loyal customer database. Marketing team of XYZ runs multiple campaigns across year to promote sales for targeted product category or brand for specific time period. XYZ also send weekly Mailers to customers to increase impressions and eventually purchase for certain product lines. XYZ has run thirty targeted campaigns in last two years and wants to see the performance of campaigns in terms of customers’ response rate, increased number of transactions and overall revenue generated. Marketing team will like to know impact of customer demographics and psychographic characteristics on customer’s response to campaign to fine tune the target audience for future campaigns
Problem statement is divided into four smaller problems which are combined together to achieve the desired output. First part of the puzzle talks about XYZ’s customers profile which will help marketing team to understand who our customers are. As a second step, effect of each demographic attribute on probability of customer responding to a particular campaign has been studied. Similarly, behavioral attributes and their impact on overall response rate has been analyzed in third section of the analysis. Behavioral attributes have been engineered from transaction and campaign response datasets. In the end, demographics and behavioral attributes are leveraged to train model for predicting customer’s response to campaign with the help of Random Forest machine learning algorithm.
Descriptive Analysis has been performed to study customer profiles in the first section section. Intuitive visuals has been used to showcase various characteristics of customers. Inquisitive analysis have been carried out in second and third section of the analysis to study impact of those attributes on campaign response. Combined effect has been studied with the help of random forest keeping campaign response rate as dependent variable for each customer.
Campaign performance analysis can be a great asset to design future campaigns in combination with customer and product segments. Targeted marketing campaigns engineered by taking informed decision will lead to higher response rate and incremental revenues.
Following packages has been used to perform the analysis in this report
library(magrittr) # Sequential piping operations
library(ggplot2) # Data visualization
library(tidyr) # Data cleaning
library(dplyr) # Data trandformation
library(data.table) # Data importing
library(randomForest) # Tree based regressor model building
library(kableExtra) # For printing tables
library(knitr) # For printing tables
library(gridExtra) # For plot grid
library(grid) # For creating and drawing text
library(stringr) # For String Manipulation
library(Hmisc) # For String Manipulation
Data is loaded from Complete Journey customer transactions (with marketing campaigns) which contains customer transaction information for retailer XYZ
Understanding
Assumption
Day 1 corresponds to 1st Jan 2015 and hence following are the time period of the available datasets
| Table | Start | End |
|---|---|---|
| Transaction | 2015-01-01 | 2016-12-11 |
| Campaign Description | 2015-08-12 | 2016-12-19 |
| Coupon Redemption | 2015-08-13 | 2016-12-04 |
| Causal | 2015-02-26 | 2016-12-01 |
Datafiles has been stored in dropbox location with public access. Hence, this r markdown can be run on any computer to generate same HTML report.
fread command has been used with colCLasses argument for defining required class for variables wherever required.
# Defining function to read datasets
read.dataset <- function(urls, names){
# dir.create( "dropbox" )
for (i in 1:length(urls)) {
path <- paste("./dropbox/",names[i],".csv", sep = "")
# Downloading Data
# download.file(urls[i], path)
# reading data into data frame
assign(names[i], fread(path), envir = parent.frame())
# deleting downloaded file
# file.remove(path)
}
# unlink( "dropbox", recursive = TRUE )
}
# dropbox urls for downloading datasets
causal_url <- "https://www.dropbox.com/s/v50iehqoogit35l/causal_da#ta.csv?dl=1"
trans_url <- "https://www.dropbox.com/s/htthq8w75925js6/transaction_data.csv?dl=1"
product_url <- "https://www.dropbox.com/s/ni08w355d5b6ugm/product.csv?dl=1"
coupon_url <- "https://www.dropbox.com/s/2fhaeokucddj5k5/coupon.csv?dl=1"
campaign_url <- "https://www.dropbox.com/s/129ctlo79ug21e7/campaign_table.csv?dl=1"
coupon_redempt_url <- "https://www.dropbox.com/s/uycgy42otmjvlqv/coupon_redempt.csv?dl=1"
hh_demographic_url <- "https://www.dropbox.com/s/ovl9irr3hlq1krk/hh_demographic.csv?dl=1"
campaign_desc_url <- "https://www.dropbox.com/s/8rn7jin0ghymrji/campaign_desc.csv?dl=1"
dataset_urls <- c(
causal_url
, trans_url
, product_url
, coupon_url
, campaign_url
, coupon_redempt_url
, hh_demographic_url
, campaign_desc_url
)
dataset_names <- c(
"causal_data"
, "transaction_data"
, "product"
, "coupon"
, "campaign_table"
, "coupon_redempt"
, "hh_demographic"
, "campaign_desc"
)
read.dataset(dataset_urls, dataset_names)
campaign_table <- campaign_table %>% mutate(description = factor(description))
hh_demographic[,2:8] <- lapply(hh_demographic[,2:8], factor)
campaign_desc <- campaign_desc %>% mutate(description = factor(description))
day variable has been converted into yyyy-mm-dd format for further analysis. Also, Cleaning has been performed on Transaction data to treat missing, abnormal and outlier values in quantity and retail discount
# Converting day variable to yyyy-mm-dd fromat
# Assumption day = 1 corresponds to "2015-01-01"
campaign_desc$start_day <- as.Date(campaign_desc$start_day, origin = "2014-12-31")
campaign_desc$end_day <- as.Date(campaign_desc$end_day, origin = "2014-12-31")
coupon_redempt$day <- as.Date(coupon_redempt$day, origin = "2014-12-31")
transaction_data$day <- as.Date(transaction_data$day, origin = "2014-12-31")
# Treating Quantity and Retail Discount from Transaction Data
# Assumption : Quantity should be greater than zero
# Outlier Treatment : Quantity has been capped at 99.107th percentile i.e. 40 as quantity values are increasing drastically after 40
# Assumption : Retail discount should always be negative as per definition
transaction_data_clean <- transaction_data %>%
subset(quantity > 0 &
quantity < quantile(transaction_data$quantity,0.99107)) %>%
subset(retail_disc <= 0)
library(ggplot2)
breaks <- c(seq(990,999,by = 2)/1000, 0.99107)
ggplot(
data.frame(Quantiles = c(99000:99999)/100000,Quantity = quantile(transaction_data$quantity,c(99000:99999)/100000))
, aes(x = Quantiles, y = Quantity)) +
geom_point() +
ggtitle("Transaction Quantity : Quantile Plot for Outlier Elimination") +
geom_vline(xintercept = 99107/100000, linetype = "dotdash") +
scale_x_continuous(breaks = c(seq(990,999,by = 2)/1000, 0.99107))
Following table captures summary of the cleaned datasets used for the analysis
| Table | Records | Missing | Abnormal | Variables | Level |
|---|---|---|---|---|---|
| Transaction | 2595732 | 0 | 23175 | 12 | Household X Basket X Product |
| Demographics | 801 | 0 | 0 | 8 | Household |
| Causal | 13999156 | 0 | 0 | 5 | Product X Store X Week |
| Coupon | 124548 | 0 | 0 | 3 | Campaign X Coupon X Product |
| Coupon Redemption | 2318 | 0 | 0 | 4 | Household X Campaign X Coupon |
| Campaign Description | 30 | 0 | 0 | 4 | Campaign |
| Campaign Table | 7208 | 0 | 0 | 3 | Household X Campaign |
| Product | 92353 | 0 | 0 | 7 | Product |
Analytical Dataset Creation
| household_key | 1 | 7 | 8 | 13 | 16 | 17 |
| campaign_response_rate | 0.25 | 0.00 | 0.10 | 0.70 | 0.00 | 0.00 |
| campaign_response_flag | TRUE | FALSE | TRUE | TRUE | FALSE | FALSE |
| age_bucket | 65+ | 45-54 | 25-34 | 25-34 | 45-54 | 65+ |
| marital_status_code | A | A | U | U | B | B |
| income_bucket | 35-49K | 50-74K | 25-34K | 75-99K | 50-74K | Under 15K |
| homeowner_descrption | Homeowner | Homeowner | Unknown | Homeowner | Homeowner | Homeowner |
| household_composition | 2 Adults No Kids | 2 Adults No Kids | 2 Adults Kids | 2 Adults Kids | Single Female | 2 Adults No Kids |
| household_size | 2 | 2 | 3 | 4 | 1 | 2 |
| kid_category_description | None/Unknown | None/Unknown | 1 | 2 | None/Unknown | None/Unknown |
| number_of_visits_per_month | 3.74 | 2.68 | 5.14 | 13.10 | 5.44 | 5.39 |
| spend_per_month | 188.27 | 154.55 | 251.59 | 628.14 | 84.00 | 223.46 |
| retail_discount_per_month | -30.31 | -21.93 | -49.01 | -77.49 | -9.52 | -48.24 |
| coupon_discount_per_month | -3.50 | -1.38 | -1.29 | -14.68 | -0.06 | -3.33 |
| coupon_match_discount_per_month | -1.14 | -0.23 | -0.08 | -3.70 | 0.00 | 0.00 |
| number_of_active_months | 23 | 22 | 22 | 21 | 18 | 23 |
| number_of_stores_visited | 2 | 2 | 3 | 1 | 2 | 4 |
| weekend_spend_proportion | 0.35 | 0.07 | 0.29 | 0.30 | 0.37 | 0.34 |
| after_office_spend_proportion | 0.11 | 0.42 | 0.52 | 0.40 | 0.17 | 0.22 |
| first_week_month_spend_proportion | 0.21 | 0.18 | 0.25 | 0.19 | 0.21 | 0.22 |
| number_of_campaigns_sent | 8 | 4 | 10 | 10 | 2 | 5 |
Who is our Customer?(Univariate Analysis)
It is important to know our customer before jumping into deep dive analysis of campaign response and hence customer demographics and psychographic variables are studied in detail through descriptive analytics in this section
Effect of Customers’ demographic attributes on Campaign Responce (Bivariate Analysis)
Effect of Customers’ behavioral attributes on Campaign Responce (Bivariate Analysis)
New variable campaign response flag has been creacted based on campaign response rate. Campaign response flag is 1 when customer has responded to any of the campaigns sent and 0 when customer responds to none of the campaign. Random forest has been built to predict campaign response flag using newly created behavioral variables and categorical attributes. Top 10 important predictors have been printed below with their Mean Decrease Gini. Also, confusion matrix has been printed which shows model to be 76% accurate in predicting whether a customer will response to campaign or not.
| FALSE. | TRUE. | class.error | |
|---|---|---|---|
| FALSE | 377 | 72 | 0.1603563 |
| TRUE | 106 | 205 | 0.3408360 |
| Predictor | Importance |
|---|---|
| coupon_discount_per_month | 74.49133 |
| coupon_match_discount_per_month | 51.01040 |
| retail_discount_per_month | 35.69834 |
| spend_per_month | 28.40812 |
| income_bucket | 23.37571 |
| after_office_spend_proportion | 21.82322 |
| number_of_visits_per_month | 18.13705 |
| weekend_spend_proportion | 16.94440 |
| first_week_month_spend_proportion | 16.09286 |
| number_of_campaigns_sent | 15.27727 |
The customer behaviour of XYZ retailing company has been analysed with help of demographic information provided. Response of customers to the campaign is analysed with new findings, and on the basis of new information derived from existing data sets
Age bucket, income range, household size, marital status seem to be important factor to define customer’s behaviour for purchasing. Derived new information ‘after office hour shopping’, ‘weekend shopping’, ‘spend per month’ are important factor for contribution in customers’ shopping
Consequently from customers’ data we found out that most of our customers are married and belong to age group 45-54 having income range around 35K -74K with monthly expenditure at store around $100 - $300.However, results we get for response rate is different from the expected one. Customers of age 55 - 64, having income 150-174k and having household size more than 5 are more responsive to the coupon distributed. Also, people tend to use less coupon after office hour and prefer more on weekends. Customers with large family size have mainly taken advantage of this campaign. They prefer to shop more if we offer them coupon discounts.
For the next campaign we should offer coupon of products higher price range to the customers who gave good response to these campaigns. We should probably change type of campaign for the customers from whom we didn’t receive good response. Number of stores visited by most of the customers are more than 3. It is necessary to find out if all stores has all products that customers require. It should be advised that our store should be fully accommodated with required products before weekends. For the people who do most of the transaction after office hours but not using coupon, it is required to check if there is any minimum threshold value for purchase to use that coupon
Because of the data sampling, we were able to get only 800 full observations after joining different datasets, and from that we have derived the analysis. We can predict even more accurate results if we get full 2500 observations. That will give us idea in depth about customer attributes and their response to the campaign