Coupon recommendation: What kind of coupons should be recommended to the Customer? We want to be able better predict the mapping of a right coupon to be given to the right customer in order to increase coupon redemption and in turn sales. The client should be interested in the analysis for the following reasons-
This approach helps maximize coupon redemption and increase customer loyalty as he/she gets the coupon that they are most likely to use.
It helps the client formulate marketing strategies like - sending mailers to the right customers about the right coupons, placement and advertisement of the right products thus optimizing cost incurred in marketing
We plan to address the problem by adopting the following steps:
We want to group the customers, appropriately, based on their purchase behavior and demographics; to be able to better understand the purchase preferences and how do they vary with different demographic features. Data Used: Transaction data, Product data, Demographic Data, Coupon Data
Having established purchase conducts of the different “groups” of customer, we want to give them suggestions of products to purchase that are most frequently bought together along with the products purchased by him/her; that the customer is likely to purchase along with an already purchased product. Data Used: Product Data, Transaction Data
Treatment based on analysis: Using the results from all the above analysis we will obtain the likelihood of products to be purchased by the customer in the future or in his/her current visit based on the results from market-basket analysis, we want to give appropriate campaigns and appropriate coupons to the customer to maximize the usage of coupons and thus increase overall customer satisfaction and customer-engagement.
Clustering: We want to use decision trees to arrive at a rule to be able to cluster the customer into groups.
Market Basket Analysis: We want to recommend the right products that the customer might be interested in buying along with the already purchased products
The analysis can be used by the consumer in various ways to formulate various marketing strategies, specific to the group of customers; sending the right mailers to the intended customers
It will increase customer loyalty as the customer will receive campaigns and coupons appropriate to them
readr - to read the csv files
stringr - for string manipulations like str_replace etc
dplyr - for a flexibility in data manipulation
lubridate - for easy time and date manipulation
library(readr)
library(stringr)
library(dplyr)
library(lubridate)
84.51, Customer Journey study, http://www.8451.com/area51/
This dataset contains household level transactions over two years from a group of 2,500 households who shop at a retailer. It contains all of the purchases from each household’s purchases, not just those form a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.
##------------------------- Product Table -------------------------------
prod_data <- read_csv("data/product.csv")
dim(prod_data)
str(prod_data)
head(prod_data)
#Missing value check
sum(is.na(prod_data$product_id))
sum(is.na(prod_data$manufacturer))
miss_dept <- which(is.na(prod_data$department))
prod_data[miss_dept,]
#Observation:- Since all values in other columns are missing we can safely remove these entries if none of these product_id are present in transaction data.
#check if any duplicate entry for Product ID
dim(prod_data[duplicated(prod_data$product_id),])[1]
#cleaning the size of Product Column
prod_data$curr_size_of_product <- str_trim(prod_data$curr_size_of_product)
str_replace(prod_data$curr_size_of_product, "-","")
##------------------------- HH Demographic Table -------------------------------
demo_data <- read_csv("data/hh_demographic.csv")
dim(demo_data)
str(demo_data)
head(demo_data)
#Missing Value Check
sum(is.na(demo_data)) #0
#check for duplicate entry in demographic table
dim(demo_data[duplicated(demo_data$household_key),])[1]
#cleaning the hh_comp_desc column
#creating levels for age
age_desc <- factor(demo_data$age_desc)
age_desc
#creating levels for income
income_desc <- factor(demo_data$income_desc, levels = c("Under 15K", "15-24K", "25-34K", "35-49K","50-74K", "75-99K" , "100-124K", "125-149K" ,"150-174K","175-199K", "200-249K" ,"250K+"))
#kid-category_des - making None/Unknown to 0 and 3+ to 3
demo_data$kid_category_desc <- str_replace(demo_data$kid_category_desc, "None/Unknown", "0")
demo_data$kid_category_desc <- str_replace(demo_data$kid_category_desc, "3\\+", "3")
demo_data$kid_category_desc <- as.numeric(demo_data$kid_category_desc)
#household_size_desc - converting to numbers and replacing 5+ to 5
demo_data$household_size_desc <- str_replace(demo_data$household_size_desc, "5\\+", "5")
demo_data$household_size_desc <- as.numeric(demo_data$household_size_desc)
#creating new column Adult_num (Number of Adults)
demo_data2 <- mutate(demo_data, adult_num = household_size_desc - kid_category_desc)
head(demo_data2)
##------------------------- Coupon Data Table ----------------------------------
coupon <- read_csv("data/coupon.csv")
dim(coupon)
str(coupon)
head(coupon)
#Missing Value Check
sum(is.na(coupon)) #0
#check if all product_id from the coupon table exist in the prod table
dim(anti_join(coupon, prod_data, by = "product_id"))[1]
#check if all campaign from the coupon table exist in the campaign table
camp_table <- read_csv("data/campaign_table.csv")
dim(anti_join(coupon, camp_table, by = "campaign"))[1]
##------------------------- Coupon Redempt Table -------------------------------
coup_red <- read_csv("data/coupon_redempt.csv")
dim(coup_red)
str(coup_red)
head(coup_red)
#Missing Value Check
sum(is.na(coup_red)) #0
#check if all household key from coupon_redempt table exist in the hh_demographic table
dim(anti_join(coup_red, demo_data, by = "household_key"))[1] #462 household keys that are not present in the demographic table
miss_hhkey <- anti_join(coup_red, demo_data, by = "household_key")[,1]
miss_hhkey1 <- as.vector(miss_hhkey)
coup_red2 <- coup_red[!coup_red$household_key %in% miss_hhkey1, ]
dim(coup_red2)
#check for sanity of days column
max(coup_red$day) < 365*2 #True
min(coup_red$day) > 0 #True
#check if all coupon upc from coupon redemption table exist in coupon table
dim(anti_join(coup_red, coupon, by = "coupon_upc"))[1] #0 no such coupon Upc
#check if campaign from coupon redemption table exist in campaign table
dim(anti_join(coup_red, camp_table, by = "campaign"))[1] #0 No such campaign
##----------------------Transaction data--------------------------------
transaction_data <- read_csv("data/transaction_data.csv")
dim(transaction_data)
head(transaction_data)
str(transaction_data)
#Check if transaction_data has products outside the product table
anti_join(transaction_data, prod_data, by = "product_id")#0*12
##Sanity : check for negative values
sum(transaction_data$day < 0)
sum(transaction_data$week_no < 0)
sum(transaction_data$quantity < 0)
sum(transaction_data$sales_value < 0)
#if a positive discount exists, we are imputing it to the same value, with a negative sign
transaction_data[transaction_data$retail_disc > 0,"retail_disc"] <- -(transaction_data[transaction_data$retail_disc > 0,"retail_disc"])
#Missing Value Check
sum(is.na(transaction_data))#0
#Converting trans_time to HMS format
transaction_data$trans_time <- as.numeric(transaction_data$trans_time)
h <- floor(transaction_data$trans_time/100)
m <- transaction_data$trans_time %% 100
input_hms <- paste(h,m,0,sep = ":")
tr_time <- hms(input_hms)
##--------------------------Campaign data-------------------------------
campaign <- read_csv("data/campaign_table.csv")
head(campaign)
dim(campaign)
campaign_desc <- read_csv("data/campaign_desc.csv")
head(campaign_desc)
dim(campaign_desc)
#Missing Value Check
sum(is.na(campaign))#0
sum(is.na(campaign_desc))#0
#Combining campaign table and campaign desc into one final campaign table.
campaign <- full_join(campaign, campaign_desc, by = "campaign")
#Sanity check : end_day > start_day
sum(campaign$end_day < campaign$start_day)#0
##---------------------------Causal Data--------------------------------
causal_data <- read_csv("data/causal_data.csv")
str(causal_data)
#Convert variables mailer and display to factor datatype
causal_data$mailer <- factor(causal_data$mailer)
causal_data$display <- factor(causal_data$display)
#Check if causal_data has products outside the product table
anti_join(causal_data, prod_data, by = "product_id")#0*5
#Missing Value Check
sum(is.na(causal_data))#0
head(transaction_data)
## # A tibble: 6 x 12
## household_key basket_id day product_id quantity sales_value store_id
## <int> <dbl> <int> <int> <int> <dbl> <int>
## 1 2375 2.70e10 1 1004906 1 1.39 364
## 2 2375 2.70e10 1 1033142 1 0.82 364
## 3 2375 2.70e10 1 1036325 1 0.99 364
## 4 2375 2.70e10 1 1082185 1 1.21 364
## 5 2375 2.70e10 1 8160430 1 1.5 364
## 6 2375 2.70e10 1 826249 2 1.98 364
## # ... with 5 more variables: retail_disc <dbl>, trans_time <dbl>,
## # week_no <int>, coupon_disc <dbl>, coupon_match_disc <dbl>
Before proceeding to explore the patterns in our data, we plan to consolidate all our separate data frames to one master transaction data frame, which will essentially contain all the values and variables we will need for our analysis and can summarise the entire dataset in the scope of the project. Each record of the transaction data should describe exactly one transaction.
We will achieve this by joining the transaction data with the dimension data based on the corresponding unique keys, and then selecting only the variables which are of interest to us. For example, we will join the transaction data with the demographics data using the hh_key to fetch the characteristics of the consumer such as age, marital status etc, so that the data needed for customer segmentation is present in one data frame.
We will also need to group our data by different variables, like product department, or age of the consumer, to proceed with further analysis based on these specific groups.
The following plots could help us support the answers to our questions.
We need to understand the rules and the binding algorithm behind clustering and market basket analysis, to be able to implement them using R.
We plan to use techniques like - Association Rule Mining, Market Basket Analysis, Cluster Analysis and Decision Trees to answer our questions.