Customer Retention - An exploratory data analysis

knitr::include_graphics('customer_retention.jpg')

PROBLEM STATEMENT

We aim to study the transaction pattern of households/ customers (data for which has been provided for a period of 2 years) to find whether a new customer who visits the store, will fall under the ‘high order transaction’ category or not. The store will develop different strategy of rewards accordingly in the coming future which is discussed below.

The store aims to device a strategy under which it will reward the ‘high order transactions’ customer with ‘loyalty coupons’ and the ‘low order transactions’ customer with ‘discount on next purchase’ coupons. The new customer prospected to fall under ‘low order transaction’ category, will be provided with ‘discount on first time purchase’ coupon so that he/she tends to become a frequent shopper.

Data & Methodology - Transaction data and Customer demographics data will be mainly used in the analysis. Campaign and coupon redemption data will also be used to supplement our analysis.
This analysis will help the retailer in designing customized promotional strategy for the consumer and retaining them in the process.

PACKAGES REQUIRED

LIST OF PACKAGES TO BE REQUIRED

####################
# Loading Packages #
####################

library(tidyverse)
library(tidyr)
library(data.table)
library(randomForest)
library(e1071)     
library(knitr)
library(rsample)
library(kableExtra)

PURPOSE OF PACKAGES

tidyverse - to have necessary core packages
- ggplot2 - data visualisation
- dplyr - data manipulation
- tidyr - data tidying
data.table - to read .csv files
knitr - to generate dynamic report
rsample - Classes and functions to create and summarize different types of resampling objects (e.g. bootstrap, cross-validation)

DATA PREPARATION

DATA SOURCE

Household level transactions of frequent shoppers at a retailer has been provided at http://uc-r.github.io/data_wrangling/mid-term

ORIGINAL PURPOSE OF THE DATA

The dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

DATA IMPORTING

Two tables that are relevant to the problem statement - transactions and demographic- are imported, read and operations to understand the number of observations and columns, variables present(that will play a role throughtout data wrangling) and data type is performed. Missing values, if any, is searched for in each table and then a summary statistics associated with each variable is generated.

DATA CLEANING

Transaction table - The ‘productid’ column is type ‘integer64’ so it might print as strange lookiing floating point data. Thus, its data type has been changed to that of the ‘character’ since ‘productid’ as integer doesn’t play any arithmetic role in the data analysis.

#Reading two tables only as per pur problem statement 
#Reading transactions and demographic table 

#################
# Loading Data #
#################

trans <- fread( "transaction_data.csv" 
                , colClasses = c(NA, "character", NA, NA, 
                                                      NA, NA, NA,NA, NA, 
                                                      NA, NA, NA ))

demog <- fread( "hh_demographic.csv" , colClasses = c(NA, "factor",
                                                    "factor","factor", 
                                                    "factor","factor","factor","factor"))

Summary of relevant variables- quantity, sales_value and retail_disc have been generated

###################
# Data Preparation#
###################

# summary(trans$quantity) #high quantity is observed which looks abnormal

# summary(trans$sales_value) #high sales_value are observed which look abnormal

# summary(trans$retail_disc)#retail_disc is positive if we look at the max value in the summary.A discount can't be positive.Thus cleaning is undertaken.

quantity had max value as 89638.0
sales_value had max value as 840.0
retail_disc is positive if we look at the max value in the summary.A discount can’t be positive

Thus, the above mentioned abnormalities were noted and planned to be cleaned in the data imputation step.

Product table - ‘manufacturer’, ‘department’ and ‘brand’ have been converted into factors as data can be later analysed across levels under these heads.

Demographic table - ‘age_desc’, ‘marital_status_code’, ‘income_desc’, ‘homeowner_desc’, ‘hh_comp_desc’, ‘household_size_desc’, ‘kid_category_desc’ have similary been converted into factor data type as data can be later analysed across levels under these heads.

Coupon_redempt table - ‘campaign’ has been converted into factor data type as data can be later analysed across campaigns.

Coupon table - Here too, ‘campaign’ has been converted into factor data type as data can be later analysed across campaigns.

Causal table - ‘mailer’ has been converted into factor data type as data can be later analysed across different mailers that were sent and thus their effect.

Campaign_table and Campaign_desc table - ‘description’ has been converted into factor data type as data can be later analysed across different campaign types and thus their effect.

The dataset has following variable and they mean as explain as follows-

summary_table <-

data.frame(

  Data_Source = c("trans", "demog"),

  Data_Description = c('Data on transactions',

                  'Demographic data for households'),

  Observations = c(dim(trans)[1], dim(demog)[1]),

  Columns = c(dim(trans)[2], dim(demog)[2]))

 

# Creating a table of data summary ---------------------------------------------

kable(summary_table, caption = "Summary of transactions and demographic datasets")

Summary of transactions and demographic datasets
Data_Source	Data_Description	Observations	Columns
trans	Data on transactions	2595732	12
demog	Demographic data for households	801	8

Missing values is found to be Zero in all the tables.

DATA IMPUTATION

Summary is looked into and it is seen that in the transactions table the maximum value of ‘retail_disc’ denoting discount(loyalty members) is positive which is an abnormality as discount can’t be a positive number. Such observations are visualised by a boxplot and positive values have been assigned 0. This is done because even if the discounts are positive, sales have been realised against the products. Thus, the observations can be kept(by assigning them to 0) instead of removing them and losing data on sales.

#retail_disc cleaning
boxplot(trans$retail_disc, main = "Discount(loyalty members)- before data cleaning", xlab = "Retail Discount", ylab = "Value in USD")

#boxplot for retail discounts. Outliers to be removed
#as this number cannot be greater than 0

##print(outlier_retail_disc <- trans[(trans$retail_disc > 0)]) #Checking using boxplot

#Reassigned positive values of retail discount to 0 since sales 
#has been realised 
trans$retail_disc[trans$retail_disc > 0] <- 0
boxplot(trans$retail_disc, main = "Discount(loyalty members)- after data cleaning",xlab = "Retail Discount", ylab = "Value in USD")

#summary(trans$retail_disc)

‘quantity’ and ‘sales_value’ have max value of 85055.00 and 505.00 which produce a possibility of outlier(s) being present. Thus, histograms and boxplots are plotted to visualise the outliers and the same are removed by observing the quantile at 0.991(99.1%). We looked at values at different quantile levels to identify outliers for significant variables

#quantity cleaning
boxplot(trans$quantity, main = "Quanity of items purchased(before data cleaning)")

Quantity values are observed which look abnormal. considering the summary statistics, quantile at 0.991 is considered and trans$quantity has been set to 24 as that seems a reasonable quantity.

#considering the summary statistics

#quantile(trans$quantity, c(0.25, 0.5,0.75,0.9,0.95,0.97,0.98,0.99, 0.991, 0.992, 0.995, 0.999))
#quantile at 0.991 is considered and trans$quantity has been set to 24 as that seems a reasonable quantity
trans <- trans %>% subset(trans$quantity <= 12)

#Checking cleaned quantity data
boxplot(trans$quantity, main = "Quanity of items purchased(after data cleaning)")

#summary(trans$quantity) #post cleaning in table

#sales_value cleaning
 #considering the summary statistics
boxplot(trans$sales_value, main = "Sales(before data cleaning)")

#quantile(trans$sales_value, c(0.25, 0.5,0.75,0.9,0.95,0.97,0.98,0.99, 0.999, 0.9999,.99995))

#Checking quantity for max values

#print(outlier_sales <- trans[(trans$sales_value > 400)])
#print(trans[trans$product_id == '12484608']) 
# The transaction for this product with $840 sales need to be removed as the sales price is quite high compared to other transactions for the same product

#print(trans[trans$basket_id == '32006114302'])
#nrow(trans)
#trans <- trans %>% subset(trans$sales_value != 840)
#nrow(trans)

boxplot(trans$sales_value, main = "Sales(after data cleaning)")

#summary(trans$sales_value) #convert into table

INFERENCES AND INSIGHTS AFTER DATA PREPARATION

INFERENCES

The transactions table shows that the retail discount was available since day 1, but campaign began on the 224th day. This puts a question mark on what new did this campaign came up with if already discounts were running on products.

There may be a possibility that the store provide discounts which could be availed on several products. Then the store looked at the pattern of purchases and then began a new campaign to target only those products that were purchased through discounts. This needs further clarification from the client.

The min value of ‘week no’ is 9 in the coupons table as per the summary statistics which raises a question about what happened in the first eight weeks which have not been mentioned. There may be a possibility that mailers began being sent only week 9 onwards or the display location of products began being considered only week 9 onwards.

INSIGHTS

Summary of relevant demographic variables have been generated. The plots give keen customer insights that will be helpful in drawing out keen CUSTOMER related insights.

Under column Highly_Active
1 denotes if the customer visited the store more than 20 months
0 denotes if the customer didn’t visit the store more than 20 months

#################
# Data Insights#
#################
# summary(demog$age_desc)
# summary(demog$hh_comp_desc)
# summary(demog$homeowner_desc)
# summary(demog$marital_status_code)
# summary(demog$income_desc)
# summary(demog$household_size_desc)
# summary(demog$kid_category_desc)

h1 <- 
  trans %>%
  group_by(
    household_key, 
    mth = month(as.Date(day - 1, origin = '2016-01-01')), 
    yr = year(as.Date(day - 1, origin = '2016-01-01'))
    ) %>%
  summarise(No_Of_visits = n_distinct(basket_id))

 library(tidyr)
#Calculating the number of months in which transactions happened across households
 h2 <- h1 %>% 
   group_by(household_key) %>% 
   summarise(active_months = n())
 
 
 #h3 <- h2 %>% 
  # mutate(Highly_Active = ifelse(active_months > 20,1,0))
 
 #Setting limit of 20 as the months to classify customers into high or lo order transaction customer
 #Creating a new variable 'active_months'
 h3 <- h2
 h3$Highly_Active <- ifelse(h3$active_months > 20,1,0)
 
 
 #joining transaction and demographic table 
 h4 <- inner_join(trans, demog, by = "household_key")
 
 #selecting only relevant variables as per problem statement
 h5 <- h4 %>%
   group_by(household_key) %>%
   select(household_key, age_desc, marital_status_code, income_desc,
          homeowner_desc, hh_comp_desc, household_size_desc, kid_category_desc)

#removing duplicates 
h6 <- h5[!duplicated(h5),]
 
 #ordering the rows
 h7 <- h6[order(h6$household_key),]
 
 #left join performed to get common observations to perform analysis and modelling on
 h8 <- left_join(h7, h3, by = "household_key")

# showing first few rows of the final table on which next steps are to be performed
  head(h8) %>% kable() %>%
  kable_styling() %>%
    scroll_box(width = "500px", height = "200px")

household_key	age_desc	marital_status_code	income_desc	homeowner_desc	hh_comp_desc	household_size_desc	kid_category_desc	active_months	Highly_Active
1	65+	A	35-49K	Homeowner	2 Adults No Kids	2	None/Unknown	23	1
7	45-54	A	50-74K	Homeowner	2 Adults No Kids	2	None/Unknown	22	1
8	25-34	U	25-34K	Unknown	2 Adults Kids	3	1	22	1
13	25-34	U	75-99K	Homeowner	2 Adults Kids	4	2	21	1
16	45-54	B	50-74K	Homeowner	Single Female	1	None/Unknown	18	0
17	65+	B	Under 15K	Homeowner	2 Adults No Kids	2	None/Unknown	22	1

p1 <- ggplot(demog, aes(x = marital_status_code)) + 
  geom_bar(fill = "firebrick1" ) + 
  xlab("Marital Status") + ylab("Total Count") +
  ggtitle("Marital Status wise customer distribution")
p1

For the available customer universe, number of married customers is three times compared to customers who are single. Marital status of around 40% customers are not known.

p2 <- ggplot(demog, aes(x = age_desc)) + 
  geom_bar(fill = "coral1" ) + 
  xlab("Age") + ylab("Total Count") +
  ggtitle("Age wise customer distribution")
p2

Largest numbers of customers are in the age range of 45-54.

p3 <- ggplot(demog, aes(x = household_size_desc)) + 
  geom_bar(fill = "darkgoldenrod1" ) + 
  xlab("Household size") + ylab("Total Count") +
  ggtitle("Household size wise customer distribution")
p3

About 55% of customers belong to household size of two or less.

p4 <- ggplot(demog, aes(x = income_desc)) + 
  geom_bar(fill = "cornflowerblue" ) + 
  xlab("Income brackets") + ylab("Total Count") +
  ggtitle("Income wise customer distribution")
p4

Most customers are in the mid-income bucket range of 40K-99K.

### EXPLORATORY DATA ANALYSIS

The data was explored and new datasets created to first observe the number of visits made by the customer in respective months. Keeping in mind, practicality and business acumen, a limit of 20 was chosen as the number of months, in which respective customers made visit to the retail store. The ones who made more than 20 visits were termed ‘customers with high order transactions’ and the others were termed ‘customers with low order transactions’. The same has been shown in the upcoming analyses.

It was then chosen to plot the transaction activity of customers as per factors-

Marital Status
Age
Having kids or not
Income
Type of resident
Household size
Number of kids

The graphs below show that all the factors mentioned above, play a role in deciding whether a new customer will fall under the ‘high order transaction’ cateogy or not.

Under column Highly_Active
1 denotes if the customer visited the store more than 20 months
0 denotes if the customer didn’t visit the store more than 20 months

#############################
# Exploratory Data Analysis #
#############################
plot_marital <- ggplot(h8, aes(marital_status_code, fill = factor(Highly_Active))) + 
geom_bar() + 
xlab("Marital status") + ylab("Total Count") +
ggtitle("Transaction activity of customers as per marital status") +
scale_fill_discrete(name = "Transaction Activity Category") 
  
plot_marital

We can see from the graph that transaction activity is high irrespective of marital status.

plot_kid <- ggplot(h8, aes(kid_category_desc, fill = factor(Highly_Active))) + 
geom_bar() + 
xlab("Number of kids") + ylab("Total Count") +
ggtitle("Transaction activity of customers depending upon having kids or not") +
scale_fill_discrete(name = "Transaction Activity Category") 

plot_kid

Transaction activity is slightly higher for households with 2 or more kids.

plot_age <- ggplot(h8, aes(age_desc, fill = factor(Highly_Active))) + geom_bar() + 
  xlab("Age") + ylab("Total Count") + 
  ggtitle("Transaction activity of customers as per age") +
  scale_fill_discrete(name = "Transaction Activity Category") 

plot_age

Activity is quite high in the age range of 55-64.

plot_income <- ggplot(h8, aes(income_desc, fill = factor(Highly_Active))) + 
geom_bar() + 
xlab("Income") + ylab("Total Count") +
ggtitle("Transaction activity of customers as per income") + scale_fill_discrete(name = "Transaction Activity Category") 

plot_income

Households with mid-income categories are showing highest activity.

plot_home <- ggplot(h8, aes(homeowner_desc, fill = factor(Highly_Active))) + 
geom_bar() + xlab("Category of homeowner") + ylab("Total Count") +
ggtitle("Transaction activity of customers depending upon being homeowner or not") +
scale_fill_discrete(name = "Transaction Activity Category") 

plot_home

Homeowners constitute the largest majority among the customers. The activity is least for them among all categories.

plot_house <- ggplot(h8, aes(household_size_desc, fill = factor(Highly_Active))) + 
geom_bar() + xlab("Household size") + ylab("Total Count") +
ggtitle("Transaction activity of customers as per household size") +
scale_fill_discrete(name = "Transaction Activity Category") 

plot_house

Transaction activity id higher for larger households.

p5 <- ggplot(data = h8) + geom_bar(mapping = aes(x = h8$age_desc, 
                                           fill = factor(h8$marital_status_code))) +
xlab("Age buckets") + ylab("Total Count") +
ggtitle("Split of customers by age and marital status") +
scale_fill_discrete(name = "Marital Status Code") 
p5

Married customers have higher activity across all age groups compared to customers who are single.

p6 <- ggplot(data = h8) + geom_bar(mapping = aes(x = h8$income_desc, fill = factor(h8$age_desc))) +
xlab("Income buckets") + ylab("Total Count") +
ggtitle("Split of customers by age within different income groups") +
scale_fill_discrete(name = "Age Range") 
p6

Middle income groups account for largest portion of customer activity.

p7 <- ggplot(h8, aes(x = marital_status_code,y = kid_category_desc,fill = active_months)) +
geom_tile() + xlab("Marital Status") + ylab("Kid Category") +
ggtitle("Customers transaction activity distribution across marital status & # of kids") + 
scale_fill_continuous(low = "white",high = "aquamarine4")

p7

High activity is prevalent among married customers with two or more kids.

SUMMARY

Problem Statement
Our problem statement was to explore data to profile customers and identify whether the customer is a high order/high frequency customer based on number of visits.

Approach - Data used and methodology employed
We merged transaction data with customer household demographics data to understand the relation of household demographic factors on customer visit frequency. Knowing your customer is the most important aspect of solving a prediction problem such as this. We added a new variable ‘Highly_Active’ in our data to identify frequent shoppers.

Interesting Insights
Customers in the age bucket 45-54 are most active (i.e. in terms of number of visits). The same customers constitute the highest percentage in income buckets of 35K-99K.
Activity is quite high in the age range of 55-64. This can be attributed to financial independence of people in this category and their time availability.

Implications
We can build a model to predict the category of any new customer based on their household demographics. We can employ different targeted approches to these different category of customers. For loyal customer who are highly active, we can offer them loyalty cards. For customers with low activity, we will offer discounts on first and next purchase.