Synopsis

  • Problem Statement

XYZ is an American retailing company with a distribution network of retail offline stores and strong loyal customer database. Marketing team of XYZ runs multiple campaigns across year to promote sales for targeted product category or brand for specific time period. XYZ also send weekly Mailers to customers to increase impressions and eventually purchase for certain product lines. XYZ has run thirty targeted campaigns in last two years and wants to see the performance of campaigns in terms of customers’ response rate, increased number of transactions and overall revenue generated. Marketing team will like to know impact of customer demographics and psychographic characteristics on customer’s response to campaign to fine tune the target audience for future campaigns

  • Methodology

Problem statement is divided into four smaller problems which are combined together to achieve the desired output. First part of the puzzle talks about XYZ’s customers profile which will help marketing team to understand who our customers are. As a second step, effect of each demographic attribute on probability of customer responding to a particular campaign has been studied. Similarly, behavioral attributes and their impact on overall response rate has been analyzed in third section of the analysis. Behavioral attributes have been engineered from transaction and campaign response datasets. In the end, demographics and behavioral attributes are leveraged to train model for predicting customer’s response to campaign with the help of Random Forest machine learning algorithm.

  • Analytical Approach

Descriptive Analysis has been performed to study customer profiles in the first section section. Intuitive visuals has been used to showcase various characteristics of customers. Inquisitive analysis have been carried out in second and third section of the analysis to study impact of those attributes on campaign response. Combined effect has been studied with the help of random forest keeping campaign response rate as dependent variable for each customer.

  • Consumption

Campaign performance analysis can be a great asset to design future campaigns in combination with customer and product segments. Targeted marketing campaigns engineered by taking informed decision will lead to higher response rate and incremental revenues.

Requirements

Following packages has been used to perform the analysis in this report

library(magrittr)     # Sequential piping operations
library(ggplot2)      # Data visualization
library(tidyr)        # Data cleaning
library(dplyr)        # Data trandformation
library(data.table)   # Data importing
library(randomForest) # Tree based regressor model building
library(kableExtra)   # For printing tables
library(knitr)        # For printing tables
library(gridExtra)    # For plot grid 
library(grid)         # For creating and drawing text       
library(stringr)      # For String Manipulation
library(Hmisc)        # For String Manipulation

Data Preparation

  • Source

Data is loaded from Complete Journey customer transactions (with marketing campaigns) which contains customer transaction information for retailer XYZ

  • Understanding

    • Household Demographic data contains demographic characteristics i.e. age, marital status, income etc. for portion of households with no missing or abnormal values
    • Transactions data captures who(household) bought which product at what time by paying how much sales value and did he/she get any discount during that transaction. Dataset does not have any missing values. Data is available from 1st Jan 2015 to 30th December 2016. However, certain abnormalities have been found in the following variables
      • Quantity contains zeros as well as outlier values
      • Sales Value contains outliers
      • Retail Discount contains outliers as well as positive values(entries in discount columns should always be negative)
    • Campaign table lists the campaigns received by each household in the study. Data contains no missing or abnormal values
    • Campaign Description table contains campaign duration. Campaign coupons can only be active in that time window
    • Product table contains information product grouping at department, commodity and sub commodity level as well as brand, manufacturer and size of the product
    • Coupon table has records of each coupon delivered to each household with corresponding campaign and product information
    • Coupon Redemption table captures each coupon redeemed by each household on what day under which campaign
    • Causal Data provides information about product placement in store display sections and weekly mailers at product, store and week level
  • Assumption

Day 1 corresponds to 1st Jan 2015 and hence following are the time period of the available datasets

Table Start End
Transaction 2015-01-01 2016-12-11
Campaign Description 2015-08-12 2016-12-19
Coupon Redemption 2015-08-13 2016-12-04
Causal 2015-02-26 2016-12-01
  • Importation

Datafiles has been stored in dropbox location with public access. Hence, this r markdown can be run on any computer to generate same HTML report.

fread command has been used with colCLasses argument for defining required class for variables wherever required.

# Defining function to read datasets

read.dataset <- function(urls, names){
  
  # dir.create( "dropbox" )
  
  for (i in 1:length(urls)) {
    path <- paste("./dropbox/",names[i],".csv", sep = "")
    # Downloading Data
    # download.file(urls[i], path)
    # reading data into data frame
    assign(names[i], fread(path), envir = parent.frame())
    # deleting downloaded file
    # file.remove(path)  
  }
  # unlink( "dropbox", recursive = TRUE )
  
}

# dropbox urls for downloading datasets
causal_url <- "https://www.dropbox.com/s/v50iehqoogit35l/causal_da#ta.csv?dl=1"
trans_url <- "https://www.dropbox.com/s/htthq8w75925js6/transaction_data.csv?dl=1"
product_url <- "https://www.dropbox.com/s/ni08w355d5b6ugm/product.csv?dl=1"
coupon_url <- "https://www.dropbox.com/s/2fhaeokucddj5k5/coupon.csv?dl=1"
campaign_url <- "https://www.dropbox.com/s/129ctlo79ug21e7/campaign_table.csv?dl=1"
coupon_redempt_url <- "https://www.dropbox.com/s/uycgy42otmjvlqv/coupon_redempt.csv?dl=1"
hh_demographic_url <- "https://www.dropbox.com/s/ovl9irr3hlq1krk/hh_demographic.csv?dl=1"
campaign_desc_url <- "https://www.dropbox.com/s/8rn7jin0ghymrji/campaign_desc.csv?dl=1"

dataset_urls <- c(
  causal_url
  , trans_url
  , product_url
  , coupon_url
  , campaign_url
  , coupon_redempt_url
  , hh_demographic_url
  , campaign_desc_url
  )

dataset_names <- c(
  "causal_data"
  , "transaction_data"
  , "product"
  , "coupon"
  , "campaign_table"
  , "coupon_redempt"
  , "hh_demographic"
  , "campaign_desc"
)

read.dataset(dataset_urls, dataset_names)

campaign_table <- campaign_table %>% mutate(description = factor(description))
hh_demographic[,2:8] <- lapply(hh_demographic[,2:8], factor)
campaign_desc <- campaign_desc %>% mutate(description = factor(description))
  • Cleaning

day variable has been converted into yyyy-mm-dd format for further analysis. Also, Cleaning has been performed on Transaction data to treat missing, abnormal and outlier values in quantity and retail discount

# Converting day variable to yyyy-mm-dd fromat
# Assumption day = 1 corresponds to "2015-01-01"

campaign_desc$start_day <- as.Date(campaign_desc$start_day, origin = "2014-12-31")
campaign_desc$end_day <- as.Date(campaign_desc$end_day, origin = "2014-12-31")
coupon_redempt$day <- as.Date(coupon_redempt$day, origin = "2014-12-31")
transaction_data$day <- as.Date(transaction_data$day, origin = "2014-12-31")

# Treating Quantity and Retail Discount from Transaction Data
# Assumption : Quantity should be greater than zero
# Outlier Treatment : Quantity has been capped at 99.107th percentile i.e. 40 as quantity values are increasing drastically after 40
# Assumption : Retail discount should always be negative as per definition
transaction_data_clean <- transaction_data %>% 
  subset(quantity > 0  & 
           quantity < quantile(transaction_data$quantity,0.99107)) %>%
  subset(retail_disc <= 0)

library(ggplot2)
breaks <- c(seq(990,999,by = 2)/1000, 0.99107)
ggplot(
  data.frame(Quantiles = c(99000:99999)/100000,Quantity = quantile(transaction_data$quantity,c(99000:99999)/100000)) 
       , aes(x = Quantiles, y = Quantity)) + 
  geom_point() +
  ggtitle("Transaction Quantity : Quantile Plot for Outlier Elimination") +
  geom_vline(xintercept = 99107/100000, linetype = "dotdash") +
  scale_x_continuous(breaks = c(seq(990,999,by = 2)/1000, 0.99107))

  • Summary

Following table captures summary of the cleaned datasets used for the analysis

Table Records Missing Abnormal Variables Level
Transaction 2595732 0 23175 12 Household X Basket X Product
Demographics 801 0 0 8 Household
Causal 13999156 0 0 5 Product X Store X Week
Coupon 124548 0 0 3 Campaign X Coupon X Product
Coupon Redemption 2318 0 0 4 Household X Campaign X Coupon
Campaign Description 30 0 0 4 Campaign
Campaign Table 7208 0 0 3 Household X Campaign
Product 92353 0 0 7 Product
  • Analytical Dataset Creation

    • Raw Datasets Required : Household Demographics, Transactions
    • Operations Performed : Analytical Dataset has been created by combining demographic and behavioral attributes of the customer. Demographic attributes such as Age, Marital Status, Household Size has been directly leveraged from household dataset. Whereas behavioral attributes has been engineered from transactional dataset. Response, Demographic and Behavioral attributes has been highlighted in the following table with sample records from analytical dataset. Newly created variables have been highlighted in bold in the table below
household_key 1 7 8 13 16 17
campaign_response_rate 0.25 0.00 0.10 0.70 0.00 0.00
campaign_response_flag TRUE FALSE TRUE TRUE FALSE FALSE
age_bucket 65+ 45-54 25-34 25-34 45-54 65+
marital_status_code A A U U B B
income_bucket 35-49K 50-74K 25-34K 75-99K 50-74K Under 15K
homeowner_descrption Homeowner Homeowner Unknown Homeowner Homeowner Homeowner
household_composition 2 Adults No Kids 2 Adults No Kids 2 Adults Kids 2 Adults Kids Single Female 2 Adults No Kids
household_size 2 2 3 4 1 2
kid_category_description None/Unknown None/Unknown 1 2 None/Unknown None/Unknown
number_of_visits_per_month 3.74 2.68 5.14 13.10 5.44 5.39
spend_per_month 188.27 154.55 251.59 628.14 84.00 223.46
retail_discount_per_month -30.31 -21.93 -49.01 -77.49 -9.52 -48.24
coupon_discount_per_month -3.50 -1.38 -1.29 -14.68 -0.06 -3.33
coupon_match_discount_per_month -1.14 -0.23 -0.08 -3.70 0.00 0.00
number_of_active_months 23 22 22 21 18 23
number_of_stores_visited 2 2 3 1 2 4
weekend_spend_proportion 0.35 0.07 0.29 0.30 0.37 0.34
after_office_spend_proportion 0.11 0.42 0.52 0.40 0.17 0.22
first_week_month_spend_proportion 0.21 0.18 0.25 0.19 0.21 0.22
number_of_campaigns_sent 8 4 10 10 2 5

Exploratory Data Analysis

  • Who is our Customer?(Univariate Analysis)

    It is important to know our customer before jumping into deep dive analysis of campaign response and hence customer demographics and psychographic variables are studied in detail through descriptive analytics in this section

    • From the customer database, almost 73% customers are married and 27% are unmarried
    • 5% of customers prefer to do more than 50% of their transaction during weekend. This variable has been created to note the customer’s activity during weekend
    • Maximum proportion of customers belong to age 45 -54 followed by age 35-44. Customers with minimum proportion are from age bucket 19-24
    • Most of our customers (25%) earns between 50-74K , followed by 35-49K group(22%). This variable gives insight of which customers we should focus on).Least proportion of customers earns between 200-249 k
    • 35% of customers mainly transact(more than 50% of transaction) after office hours. This variable has been created to note the activity of customers after office hours)
    • 28% of customers have kids in their family. 32% of customer have no kids in their family
    • Most of the customers prefer to have monthly expenditure around ($100 - $300)
    • Almost 72% of customers visits more than 3 stores .This is quite interesting fact

  • Effect of Customers’ demographic attributes on Campaign Responce (Bivariate Analysis)

    • Age Bucket: People having age 55-64 have given maximum response to campaign ,however people with age 19-24 are less responsive
    • Marital Status code: From the known marital status of customers, Customers who are married have used more coupons
    • Income Bucket: Response rate is fluctuating in this category. It is very high for customer with income range 150-174k and minimum for customer with income range 250k+ followed by income range 15-24k
    • Homeowner Description: Out of known homeowner Description, customers who used more coupons are probable owner
    • Household Size: Customers have more than 5 family members have been more active in using coupons
    • Kid Category Description: Family with more than 3 kids have given maximum response to the campaign

  • Effect of Customers’ behavioral attributes on Campaign Responce (Bivariate Analysis)

    • Number of visits per month: Response rate increase till no of visits 5 per month and then remain constant
    • Spend per month: Response rate increases drastically expenditure for monthly expenditure upto 250 and then increases slowly further
    • Retail discount per month: For the retail discount range (-50 to -125) response rate is almost 25%
    • Number of Active months: Response rate is more for the customer who are active each month however for the customers who are less active have less data points. Hence response rate is almost zero for number of active months equal to 10 to 12 and steadily increases as number of active months increases
    • Number of stores: Considering the area where data points are significant in no, i.e. (number of stores between 0 to 15), Response rate appears to be increasing first (till number of stores =5) and then decreases slowly further
    • Weekend spend Proportion: Response rate is quite good (25%) for customers who transact (40 to 60 %) of shopping on weekend. This mean most of the coupon get redeemed during weekend shopping
    • After office spend proportion: Response rate appears to be less for people who mainly transact after office hour (after office hour transaction > 0.5)
    • First week month spend proportion: Customers who spends around 0.2 - 0.25 of their shopping on first week of month have more response rate as compared to other customers
    • Number of Campaign Sent: Response rate increases as coupon sent to each household key increase

  • Predictive Modelling for Campaign Response (Random Forest)

New variable campaign response flag has been creacted based on campaign response rate. Campaign response flag is 1 when customer has responded to any of the campaigns sent and 0 when customer responds to none of the campaign. Random forest has been built to predict campaign response flag using newly created behavioral variables and categorical attributes. Top 10 important predictors have been printed below with their Mean Decrease Gini. Also, confusion matrix has been printed which shows model to be 76% accurate in predicting whether a customer will response to campaign or not.

FALSE. TRUE. class.error
FALSE 377 72 0.1603563
TRUE 106 205 0.3408360
Predictor Importance
coupon_discount_per_month 74.49133
coupon_match_discount_per_month 51.01040
retail_discount_per_month 35.69834
spend_per_month 28.40812
income_bucket 23.37571
after_office_spend_proportion 21.82322
number_of_visits_per_month 18.13705
weekend_spend_proportion 16.94440
first_week_month_spend_proportion 16.09286
number_of_campaigns_sent 15.27727

Summary

  • Problem Statement

The customer behaviour of XYZ retailing company has been analysed with help of demographic information provided. Response of customers to the campaign is analysed with new findings, and on the basis of new information derived from existing data sets

  • Approach

Age bucket, income range, household size, marital status seem to be important factor to define customer’s behaviour for purchasing. Derived new information ‘after office hour shopping’, ‘weekend shopping’, ‘spend per month’ are important factor for contribution in customers’ shopping

  • Insights

Consequently from customers’ data we found out that most of our customers are married and belong to age group 45-54 having income range around 35K -74K with monthly expenditure at store around $100 - $300.However, results we get for response rate is different from the expected one. Customers of age 55 - 64, having income 150-174k and having household size more than 5 are more responsive to the coupon distributed. Also, people tend to use less coupon after office hour and prefer more on weekends. Customers with large family size have mainly taken advantage of this campaign. They prefer to shop more if we offer them coupon discounts.

  • Recomendation

For the next campaign we should offer coupon of products higher price range to the customers who gave good response to these campaigns. We should probably change type of campaign for the customers from whom we didn’t receive good response. Number of stores visited by most of the customers are more than 3. It is necessary to find out if all stores has all products that customers require. It should be advised that our store should be fully accommodated with required products before weekends. For the people who do most of the transaction after office hours but not using coupon, it is required to check if there is any minimum threshold value for purchase to use that coupon

  • Limitation

Because of the data sampling, we were able to get only 800 full observations after joining different datasets, and from that we have derived the analysis. We can predict even more accurate results if we get full 2500 observations. That will give us idea in depth about customer attributes and their response to the campaign