We aim to study the transaction pattern of households/ customers (data for which has been provided for a period of 2 years) to find whether a new customer who visits the store, will fall under the ‘high order transaction’ category or not. The store will develop different strategy of rewards accordingly in the coming future which is discussed below.
The store aims to device a strategy under which it will reward the ‘high order transactions’ customer with ‘loyalty coupons’ and the ‘low order transactions’ customer with ‘discount on next purchase’ coupons. The new customer prospected to fall under ‘low order transaction’ category, will be provided with ‘discount on first time purchase’ coupon so that he/she tends to become a frequent shopper.
Data & Methodology - Transaction data and Customer demographics data will be mainly used in the analysis. Campaign and coupon redemption data will also be used to supplement our analysis. We are planning to use random forest/logistic regression for the prediction model.
Approach/Analytical Technique - We will look at a 6 month time frame of transaction data at household level to train our model and the next 6 month data to check the validity of the model and conclude about customer retention.
This analysis will help the retailer in designing customized promotional strategy for the consumer and retaining them in the process.
LIST OF PACKAGES TO BE REQUIRED
####################
# Loading Packages #
####################
library(tidyverse)
library(tidyr)
library(data.table)
library(randomForest)
library(e1071)
library(knitr)
library(rsample)
library(kableExtra)
PURPOSE OF PACKAGES
tidyverse - to have necessary core packages
rsample - Classes and functions to create and summarize different types of resampling objects (e.g. bootstrap, cross-validation)
DATA SOURCE
Household level transactions of frequent shoppers at a retailer has been provided at http://uc-r.github.io/data_wrangling/mid-term
ORIGINAL PURPOSE OF THE DATA
The dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.
DATA IMPORTING
Two tables that are relevant to the problem statement - transactions and demographic- are imported, read and operations to understand the number of observations and columns, variables present(that will play a role throughtout data wrangling) and data type is performed. Missing values, if any, is searched for in each table and then a summary statistics associated with each variable is generated.
DATA CLEANING
Transaction table - The ‘productid’ column is type ‘integer64’ so it might print as strange lookiing floating point data. Thus, its data type has been changed to that of the ‘character’ since ‘productid’ as integer doesn’t play any arithmetic role in the data analysis.
Summary of relevant variables- quantity, sales_value and retail_disc have been generated
quantity had max value as 89638.0
sales_value had max value as 840.0
retail_disc is positive if we look at the max value in the summary.A discount can’t be positive
Thus, the above mentioned abnormalities were noted and planned to be cleaned in the data imputation step.
Product table - ‘manufacturer’, ‘department’ and ‘brand’ have been converted into factors as data can be later analysed across levels under these heads.
Demographic table - ‘age_desc’, ‘marital_status_code’, ‘income_desc’, ‘homeowner_desc’, ‘hh_comp_desc’, ‘household_size_desc’, ‘kid_category_desc’ have similary been converted into factor data type as data can be later analysed across levels under these heads.
Coupon_redempt table - ‘campaign’ has been converted into factor data type as data can be later analysed across campaigns.
Coupon table - Here too, ‘campaign’ has been converted into factor data type as data can be later analysed across campaigns.
Causal table - ‘mailer’ has been converted into factor data type as data can be later analysed across different mailers that were sent and thus their effect.
Campaign_table and Campaign_desc table - ‘description’ has been converted into factor data type as data can be later analysed across different campaign types and thus their effect.
The dataset has following variable and they mean as explain as follows-
| Data_Source | Data_Description | Observations | Columns |
|---|---|---|---|
| trans | Data on transactions | 2595732 | 12 |
| demog | Demographic data for households | 801 | 8 |
Missing values is found to be Zero in all the tables.
DATA IMPUTATION
Summary is looked into and it is seen that in the transactions table the maximum value of ‘retail_disc’ denoting discount(loyalty members) is positive which is an abnormality as discount can’t be a positive number. Such observations are visualised by a boxplot and positive values have been assigned 0. This is done because even if the discounts are positive, sales have been realised against the products. Thus, the observations can be kept(by assigning them to 0) instead of removing them and losing data on sales.
‘quantity’ and ‘sales_value’ have max value of 85055.00 and 505.00 which produce a possibility of outlier(s) being present. Thus, histograms and boxplots are plotted to visualise the outliers and the same are removed by observing the quantile at 0.991(99.1%). We looked at values at different quantile levels to identify outliers for significant variables
Quantity values are observed which look abnormal. considering the summary statistics, quantile at 0.991 is considered and trans$quantity has been set to 24 as that seems a reasonable quantity.
INFERENCES
The transactions table shows that the retail discount was available since day 1, but campaign began on the 224th day. This puts a question mark on what new did this campaign came up with if already discounts were running on products.
There may be a possibility that the store provide discounts which could be availed on several products. Then the store looked at the pattern of purchases and then began a new campaign to target only those products that were purchased through discounts. This needs further clarification from the client.
The min value of ‘week no’ is 9 in the coupons table as per the summary statistics which raises a question about what happened in the first eight weeks which have not been mentioned. There may be a possibility that mailers began being sent only week 9 onwards or the display location of products began being considered only week 9 onwards.
INSIGHTS
Summary of relevant demographic variables have been generated. The plots give keen customer insights that will be helpful in drawing out keen CUSTOMER related insights.
Under column Highly_Active
1 denotes if the customer visited the store more than 20 months
0 denotes if the customer didn’t visit the store more than 20 months
| household_key | age_desc | marital_status_code | income_desc | homeowner_desc | hh_comp_desc | household_size_desc | kid_category_desc | active_months | Highly_Active |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 65+ | A | 35-49K | Homeowner | 2 Adults No Kids | 2 | None/Unknown | 23 | 1 |
| 7 | 45-54 | A | 50-74K | Homeowner | 2 Adults No Kids | 2 | None/Unknown | 22 | 1 |
| 8 | 25-34 | U | 25-34K | Unknown | 2 Adults Kids | 3 | 1 | 22 | 1 |
| 13 | 25-34 | U | 75-99K | Homeowner | 2 Adults Kids | 4 | 2 | 21 | 1 |
| 16 | 45-54 | B | 50-74K | Homeowner | Single Female | 1 | None/Unknown | 18 | 0 |
| 17 | 65+ | B | Under 15K | Homeowner | 2 Adults No Kids | 2 | None/Unknown | 22 | 1 |
Graph 1 - For the available customer universe, number of married customers is three times compared to customers who are single. Marital status of around 40% customers are not known.
Graph 2 - Largest numbers of customers are in the age range of 45-54.
Graph 3 - About 55% of customers belong to household size of two or less.
Graph 4 - Most customers are in the mid-income bucket range of 40K-99K.
The data was explored and new datasets created to first observe the number of visits made by the customer in respective months. Keeping in mind, practicality and business acumen, a limit of 20 was chosen as the number of months, in which respective customers made visit to the retail store. The ones who made more than 20 visits were termed ‘customers with high order transactions’ and the others were termed ‘customers with low order transactions’. The same has been shown in the upcoming analyses.
It was then chosen to plot the transaction activity of customers as per factors-
The graphs below show that all the factors mentioned above, play a role in deciding whether a new customer will fall under the ‘high order transaction’ cateogy or not.
Under column Highly_Active
1 denotes if the customer visited the store more than 20 months
0 denotes if the customer didn’t visit the store more than 20 months
Graph 1 - We can see from the graph that transaction activity is high irrespective of marital status.
Graph 2 - Transaction activity is slightly higher for households with 2 or more kids
Graph 3 - Activity is quite high in the age range of 55-64.
Graph 4 - Households with mid-income categories are showing highest activity.
Graph 5 - Homeowners constitute the largest majority among the customers. The activity is least for them among all categories.
Graph 6 - Transaction activity id higher for larger households.
Graph 7 - Married customers have higher activity across all age groups compared to customers who are single.
Graph 8 - Middle income groups account for largest portion of customer activity.
Graph 9 - High activity is prevalent among married customers with two or more kids.
All the factors, being significant are taken into consideration and ‘random forest’ algorithm is applied to conclude which cateogry a new customer is the most probable to fall under.
RANDOM FOREST RESULTS
Confusion Matrix
| 0 | 1 | class.error | |
|---|---|---|---|
| 0 | 5 | 110 | 0.9565217 |
| 1 | 12 | 434 | 0.0269058 |
Problem Statement Our problem statement was to predict whether any customer is a high order/high frequency customer based on number of visits.
Approach - Data used and methodology employed We merged transaction data with customer household demographics data to understand the relation of household demographic factors on customer visit frequency. Knowing your customer is the most important aspect of solving a prediction problem such as this. We added a new variable ‘Highly_Active’ in our data to identify frequent shoppers. We used Random Forest to predict if the new customers in the test set belong to high order or low order frequency category of customers. Based on R-squared value and confusion matrix, we can say that we have a decent idea about the factors affecting out dependent variable but the model is not yet robust.
Interesting Insights
Customers in the age bucket 45-54 are most active (i.e. in terms of number of visits). The same customers constitute the highest percentage in income buckets of 35K-99K.
Activity is quite high in the age range of 55-64. This can be attributed to financial independence of people in this category and their time availability.
Implications We can use this model to predict the category of any new customer based on their household demographics. We can employ different targeted approches to these different category of customers. For loyal customer who are highly active, we can offer them loyalty cards. For customers with low activity, we will offer discounts on first and next purchase.
Limitations We need to tune the model and may be look at other variables as well other than demographics so that the prediction is more accurate.