The Complete Journey

1. Introduction

1.1 Problem Statement:

Coupon recommendation: What kind of coupons should be recommended to the Customer?
We want to be able better predict the mapping of a right coupon to be given to the right customer in order to increase coupon redemption and in turn sales. The client should be interested in the analysis for the following reasons-

This approach helps maximize coupon redemption and increase customer loyalty as he/she gets the coupon that they are most likely to use.
It helps the client formulate marketing strategies like - sending mailers to the right customers about the right coupons, placement and advertisement of the right products thus optimizing cost incurred in marketing

1.2 Addressing the problem

We plan to address the problem by adopting the following steps:

We want to group the customers, appropriately, based on their purchase behavior and demographics; to be able to better understand the purchase preferences and how do they vary with different demographic features. Data Used: Transaction data, Product data, Demographic Data, Coupon Data
Having established purchase conducts of the different “groups” of customer, we want to give them suggestions of products to purchase that are most frequently bought together along with the products purchased by him/her; that the customer is likely to purchase along with an already purchased product. Data Used: Product Data, Transaction Data

Treatment based on analysis: Using the results from all the above analysis we will obtain the likelihood of products to be purchased by the customer in the future or in his/her current visit based on the results from market-basket analysis, we want to give appropriate campaigns and appropriate coupons to the customer to maximize the usage of coupons and thus increase overall customer satisfaction and customer-engagement.

1.3 Current proposed approach/analytic technique

Clustering: We want to use decision trees to arrive at a rule to be able to cluster the customer into groups.
Market Basket Analysis: We want to recommend the right products that the customer might be interested in buying along with the already purchased products

1.4 How will this analysis help the consumer?

The analysis can be used by the consumer in various ways to formulate various marketing strategies, specific to the group of customers; sending the right mailers to the intended customers
It will help the consumer decide the placement of products to increase the products purchase
The analysis will help increase the sales and thus profit
It will increase customer loyalty as the customer will receive campaigns and coupons appropriate to them

2. Packages Required

readr - to read the csv files
stringr - for string manipulations like str_replace etc
dplyr - for a flexibility in data manipulation
lubridate - for easy time and date manipulation

library(readr)
library(stringr)
library(dplyr)
library(lubridate)

3. Data Preparation

3.1 Original source of data

84.51, Customer Journey study, http://www.8451.com/area51/

3.2 More about source data

This dataset contains household level transactions over two years from a group of 2,500 households who shop at a retailer. It contains all of the purchases from each household’s purchases, not just those form a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

3.3 Data importing and cleaning

##------------------------- Product Table -------------------------------
prod_data <- read_csv("data/product.csv")
dim(prod_data)
str(prod_data)
head(prod_data)
 
#Missing value check
sum(is.na(prod_data$product_id))
sum(is.na(prod_data$manufacturer))
miss_dept <- which(is.na(prod_data$department))
prod_data[miss_dept,]
#Observation:- Since all values in other columns are missing we can safely remove these entries if none of these product_id are present in transaction data.
 
#check if any duplicate entry for Product ID
dim(prod_data[duplicated(prod_data$product_id),])[1]
 
 
#cleaning the size of Product Column
prod_data$curr_size_of_product <- str_trim(prod_data$curr_size_of_product)
str_replace(prod_data$curr_size_of_product, "-","")
 
##------------------------- HH Demographic Table -------------------------------
demo_data <- read_csv("data/hh_demographic.csv")
dim(demo_data)
str(demo_data)
head(demo_data)
 
#Missing Value Check
sum(is.na(demo_data)) #0
 
 
#check for duplicate entry in demographic table
dim(demo_data[duplicated(demo_data$household_key),])[1]
 
#cleaning the hh_comp_desc column
#creating levels for age
age_desc <- factor(demo_data$age_desc)
age_desc
#creating levels for income
income_desc <- factor(demo_data$income_desc, levels = c("Under 15K", "15-24K", "25-34K", "35-49K","50-74K", "75-99K" , "100-124K", "125-149K" ,"150-174K","175-199K", "200-249K" ,"250K+"))
 
#kid-category_des - making None/Unknown to 0 and 3+ to 3
demo_data$kid_category_desc <- str_replace(demo_data$kid_category_desc, "None/Unknown", "0")
demo_data$kid_category_desc <- str_replace(demo_data$kid_category_desc, "3\\+", "3")
demo_data$kid_category_desc <- as.numeric(demo_data$kid_category_desc)
 
#household_size_desc - converting to numbers and replacing 5+ to 5
demo_data$household_size_desc <- str_replace(demo_data$household_size_desc, "5\\+", "5")
demo_data$household_size_desc <- as.numeric(demo_data$household_size_desc)
 
#creating new column Adult_num (Number of Adults)
demo_data2 <- mutate(demo_data, adult_num = household_size_desc - kid_category_desc)
head(demo_data2)
 
##------------------------- Coupon Data Table ----------------------------------
coupon <- read_csv("data/coupon.csv")
dim(coupon)
str(coupon)
head(coupon)
 
#Missing Value Check
sum(is.na(coupon)) #0
 
#check if all product_id from the coupon table exist in the prod table
dim(anti_join(coupon, prod_data, by = "product_id"))[1]
 
#check if all campaign from the coupon table exist in the campaign table
camp_table <- read_csv("data/campaign_table.csv")
dim(anti_join(coupon, camp_table, by = "campaign"))[1]
 
 
##------------------------- Coupon Redempt Table -------------------------------
coup_red <- read_csv("data/coupon_redempt.csv")
dim(coup_red)
str(coup_red)
head(coup_red)
 
#Missing Value Check
sum(is.na(coup_red)) #0
 
#check if all household key from coupon_redempt table exist in the hh_demographic table
dim(anti_join(coup_red, demo_data, by = "household_key"))[1] #462 household keys that are not present in the demographic table
miss_hhkey <- anti_join(coup_red, demo_data, by = "household_key")[,1]
miss_hhkey1 <- as.vector(miss_hhkey)
coup_red2 <- coup_red[!coup_red$household_key %in% miss_hhkey1, ]
dim(coup_red2)
 
#check for sanity of days column
max(coup_red$day) < 365*2 #True
min(coup_red$day) > 0     #True
 
#check if all coupon upc from coupon redemption table exist in coupon table
dim(anti_join(coup_red, coupon, by = "coupon_upc"))[1] #0 no such coupon Upc
 
#check if campaign from coupon redemption table exist in campaign table
dim(anti_join(coup_red, camp_table, by = "campaign"))[1] #0 No such campaign
 
##----------------------Transaction data--------------------------------
 
transaction_data <- read_csv("data/transaction_data.csv")
dim(transaction_data)
head(transaction_data)
str(transaction_data)
 
#Check if transaction_data has products outside the product table
anti_join(transaction_data, prod_data, by = "product_id")#0*12
 
##Sanity : check for negative values
sum(transaction_data$day < 0)
sum(transaction_data$week_no < 0)
sum(transaction_data$quantity < 0)
sum(transaction_data$sales_value < 0)
 
#if a positive discount exists, we are imputing it to the same value, with a negative sign
transaction_data[transaction_data$retail_disc > 0,"retail_disc"] <- -(transaction_data[transaction_data$retail_disc > 0,"retail_disc"])
 
#Missing Value Check
sum(is.na(transaction_data))#0
 
#Converting trans_time to HMS format
transaction_data$trans_time <- as.numeric(transaction_data$trans_time)
h <- floor(transaction_data$trans_time/100)
m <- transaction_data$trans_time %% 100
input_hms <- paste(h,m,0,sep = ":")
tr_time <- hms(input_hms)
 
 
##--------------------------Campaign data-------------------------------
campaign <- read_csv("data/campaign_table.csv")
head(campaign)
dim(campaign)
campaign_desc <- read_csv("data/campaign_desc.csv")
head(campaign_desc)
dim(campaign_desc)
 
#Missing Value Check
sum(is.na(campaign))#0
sum(is.na(campaign_desc))#0
 
#Combining campaign table and campaign desc into one final campaign table.
campaign <- full_join(campaign, campaign_desc, by = "campaign")
 
#Sanity check : end_day > start_day
sum(campaign$end_day < campaign$start_day)#0
 
##---------------------------Causal Data--------------------------------
causal_data <- read_csv("data/causal_data.csv")
str(causal_data)
 
#Convert variables mailer and display to factor datatype
causal_data$mailer <- factor(causal_data$mailer)
causal_data$display <- factor(causal_data$display)
 
#Check if causal_data has products outside the product table
anti_join(causal_data, prod_data, by = "product_id")#0*5
 
#Missing Value Check
sum(is.na(causal_data))#0

3.4 Final transaction data

head(transaction_data)

## # A tibble: 6 x 12
##   household_key basket_id   day product_id quantity sales_value store_id
##           <int>     <dbl> <int>      <int>    <int>       <dbl>    <int>
## 1          2375   2.70e10     1    1004906        1        1.39      364
## 2          2375   2.70e10     1    1033142        1        0.82      364
## 3          2375   2.70e10     1    1036325        1        0.99      364
## 4          2375   2.70e10     1    1082185        1        1.21      364
## 5          2375   2.70e10     1    8160430        1        1.5       364
## 6          2375   2.70e10     1     826249        2        1.98      364
## # ... with 5 more variables: retail_disc <dbl>, trans_time <dbl>,
## #   week_no <int>, coupon_disc <dbl>, coupon_match_disc <dbl>

3.5 Table of Summary

Product Table
Variables of concern:
- product_id - to map the product department and other product related information to the transaction table
- department - used in identifying the customer’s purchase behaviour
- commodity_desc - used in identofying the customer’s purchase behaviour
Demographic Table
Variables of Concern:
- household_key - to map the related household information of the household key in teh transaction table
- age_desc - used for clustering the customers together
- marital status - used for clustering the customers together
- income_desc - used for clustering customers together
- homeowner_desc - fed to decision trees as categorical variables, used for clustering the customer in groups
- household_size_desc - used to estimate the total capacity of teh household and thus clustering customers together
- kid_category_desc - used to estimate the total number of kids in the house
- adult_num - new column created for indicating total nummber of adults in the house
Coupon Table:
Variables of Concern:
- coupon_upc - used to map the coupon related information from the coupon redemption table
- product_id - used to map the coupon applied on a particular product in teh transaction table
- campaign - to map the campaign used for a particular coupon and product
Coupon Redemption Table:
Variables of Concern:
- household_key - to map the household key from transation data to the coupons redeemed by them
- day - identifies when was the coupon redeemed by the customer
Transaction data:
Variables of Concern:
- household_key - to map household to transaction
- day - day of the 2 year span the transaction took place on
- product_id - to map product to transaction
- quantity - quantity bought in each transaction
- sales_value - net sales for the retailer
- retail_disc - discount due to loyalty program of the retailer
- trans_time - time of day when the transaction occurred
- week_no - week number corresponding to the day number
- coupon_disc - discount by manufacturer to customer
- coupon_match_disc- discount by manufacturer to retailer
Causal Data:
Variables of Concern:
- product_id - to map product to the type of display/mailer
- display - to identify different types of displays
- mailer - to identify different types of mailers
- week_no - the week the causal data is valid for
Campaign Table:
Variables of Concern:
- Household key - to map household to coupon supplied to them
- Campaign - identifier for each coupon
- Description - description of coupon
- Start day - Day the coupon is activated
- End day - Day the coupon is deactivated

4. Proposed Exploratory Data Analysis

4.1 Uncovering new information in the data

Before proceeding to explore the patterns in our data, we plan to consolidate all our separate data frames to one master transaction data frame, which will essentially contain all the values and variables we will need for our analysis and can summarise the entire dataset in the scope of the project. Each record of the transaction data should describe exactly one transaction.
We will achieve this by joining the transaction data with the dimension data based on the corresponding unique keys, and then selecting only the variables which are of interest to us. For example, we will join the transaction data with the demographics data using the hh_key to fetch the characteristics of the consumer such as age, marital status etc, so that the data needed for customer segmentation is present in one data frame.
We will also need to group our data by different variables, like product department, or age of the consumer, to proceed with further analysis based on these specific groups.

4.2 Plots and tables

The following plots could help us support the answers to our questions.

Plot of the top 20 most frequently purchased products - to help us choose the products we want to target.
Plot of sales over the age, or the income of the consumer - to decide the segment of the consumers we might want to specifically target.

4.3 What we do not know how to do

We need to understand the rules and the binding algorithm behind clustering and market basket analysis, to be able to implement them using R.

4.4 Future Machine Learning techniques

We plan to use techniques like - Association Rule Mining, Market Basket Analysis, Cluster Analysis and Decision Trees to answer our questions.

Association Mining will help us find an association between different products, and find patterns in our transaction data.
Market Basket Analysis will help us understand what items customers buy frequently. Clients can use this information to enhance their marketing strategies to effectively target the right consumers with the right products.
Decision Trees will provide us the rules to be able to cluster the customers based on the products purchased by them.
Clustering will help us in customer segmentation, based on demographics and purchase behavior. This knowledge will further help in choosing the right marketing strategies for each customer segment.