PROBLEM STATEMENT

We aim to study the transaction pattern of households/ customers (data for which has been provided for a period of 2 years) to find whether a new customer who visits the store, will fall under the ‘high order transaction’ category or not. The store will develop different strategy of rewards accordingly in the coming future which is discussed below.

The store aims to device a strategy under which it will reward the ‘high order transactions’ customer with ‘loyalty coupons’ and the ‘low order transactions’ customer with ‘discount on next purchase’ coupons. The new customer prospected to fall under ‘low order transaction’ category, will be provided with ‘discount on first time purchase’ coupon so that he/she tends to become a frequent shopper.

  • Data & Methodology - Transaction data and Customer demographics data will be mainly used in the analysis. Campaign and coupon redemption data will also be used to supplement our analysis. We are planning to use random forest/logistic regression for the prediction model.

  • Approach/Analytical Technique - We will look at a 6 month time frame of transaction data at household level to train our model and the next 6 month data to check the validity of the model and conclude about customer retention.

  • This analysis will help the retailer in designing customized promotional strategy for the consumer and retaining them in the process.

PACKAGES REQUIRED

LIST OF PACKAGES TO BE REQUIRED

####################
# Loading Packages #
####################

library(tidyverse)
library(tidyr)
library(data.table)
library(randomForest)
library(e1071)     
library(knitr)
library(rsample)
library(kableExtra)


PURPOSE OF PACKAGES

  1. tidyverse - to have necessary core packages

    • ggplot2 - data visualisation
    • dplyr - data manipulation
    • tidyr - data tidying
  2. data.table - to read .csv files
  3. randomForest - to apply random forest on the data
  4. e1071 - to execute Support Vector Machines
  5. knitr - to generate dynamic report
  6. rsample - Classes and functions to create and summarize different types of resampling objects (e.g. bootstrap, cross-validation)

DATA PREPARATION

DATA SOURCE

Household level transactions of frequent shoppers at a retailer has been provided at http://uc-r.github.io/data_wrangling/mid-term

ORIGINAL PURPOSE OF THE DATA

The dataset contains household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

DATA IMPORTING

Two tables that are relevant to the problem statement - transactions and demographic- are imported, read and operations to understand the number of observations and columns, variables present(that will play a role throughtout data wrangling) and data type is performed. Missing values, if any, is searched for in each table and then a summary statistics associated with each variable is generated.

DATA CLEANING

Transaction table - The ‘productid’ column is type ‘integer64’ so it might print as strange lookiing floating point data. Thus, its data type has been changed to that of the ‘character’ since ‘productid’ as integer doesn’t play any arithmetic role in the data analysis.


Summary of relevant variables- quantity, sales_value and retail_disc have been generated

quantity had max value as 89638.0
sales_value had max value as 840.0
retail_disc is positive if we look at the max value in the summary.A discount can’t be positive

Thus, the above mentioned abnormalities were noted and planned to be cleaned in the data imputation step.

Product table - ‘manufacturer’, ‘department’ and ‘brand’ have been converted into factors as data can be later analysed across levels under these heads.

Demographic table - ‘age_desc’, ‘marital_status_code’, ‘income_desc’, ‘homeowner_desc’, ‘hh_comp_desc’, ‘household_size_desc’, ‘kid_category_desc’ have similary been converted into factor data type as data can be later analysed across levels under these heads.

Coupon_redempt table - ‘campaign’ has been converted into factor data type as data can be later analysed across campaigns.

Coupon table - Here too, ‘campaign’ has been converted into factor data type as data can be later analysed across campaigns.

Causal table - ‘mailer’ has been converted into factor data type as data can be later analysed across different mailers that were sent and thus their effect.

Campaign_table and Campaign_desc table - ‘description’ has been converted into factor data type as data can be later analysed across different campaign types and thus their effect.

The dataset has following variable and they mean as explain as follows-  

Summary of transactions and demographic datasets
Data_Source Data_Description Observations Columns
trans Data on transactions 2595732 12
demog Demographic data for households 801 8


Missing values is found to be Zero in all the tables.


DATA IMPUTATION

Summary is looked into and it is seen that in the transactions table the maximum value of ‘retail_disc’ denoting discount(loyalty members) is positive which is an abnormality as discount can’t be a positive number. Such observations are visualised by a boxplot and positive values have been assigned 0. This is done because even if the discounts are positive, sales have been realised against the products. Thus, the observations can be kept(by assigning them to 0) instead of removing them and losing data on sales.

‘quantity’ and ‘sales_value’ have max value of 85055.00 and 505.00 which produce a possibility of outlier(s) being present. Thus, histograms and boxplots are plotted to visualise the outliers and the same are removed by observing the quantile at 0.991(99.1%). We looked at values at different quantile levels to identify outliers for significant variables


Quantity values are observed which look abnormal. considering the summary statistics, quantile at 0.991 is considered and trans$quantity has been set to 24 as that seems a reasonable quantity.

INFERENCES AND INSIGHTS AFTER DATA PREPARATION

INFERENCES

The transactions table shows that the retail discount was available since day 1, but campaign began on the 224th day. This puts a question mark on what new did this campaign came up with if already discounts were running on products.

There may be a possibility that the store provide discounts which could be availed on several products. Then the store looked at the pattern of purchases and then began a new campaign to target only those products that were purchased through discounts. This needs further clarification from the client.

The min value of ‘week no’ is 9 in the coupons table as per the summary statistics which raises a question about what happened in the first eight weeks which have not been mentioned. There may be a possibility that mailers began being sent only week 9 onwards or the display location of products began being considered only week 9 onwards.

INSIGHTS

Summary of relevant demographic variables have been generated. The plots give keen customer insights that will be helpful in drawing out keen CUSTOMER related insights.

Under column Highly_Active
1 denotes if the customer visited the store more than 20 months
0 denotes if the customer didn’t visit the store more than 20 months

household_key age_desc marital_status_code income_desc homeowner_desc hh_comp_desc household_size_desc kid_category_desc active_months Highly_Active
1 65+ A 35-49K Homeowner 2 Adults No Kids 2 None/Unknown 23 1
7 45-54 A 50-74K Homeowner 2 Adults No Kids 2 None/Unknown 22 1
8 25-34 U 25-34K Unknown 2 Adults Kids 3 1 22 1
13 25-34 U 75-99K Homeowner 2 Adults Kids 4 2 21 1
16 45-54 B 50-74K Homeowner Single Female 1 None/Unknown 18 0
17 65+ B Under 15K Homeowner 2 Adults No Kids 2 None/Unknown 22 1







Graph 1 - For the available customer universe, number of married customers is three times compared to customers who are single. Marital status of around 40% customers are not known.

Graph 2 - Largest numbers of customers are in the age range of 45-54.

Graph 3 - About 55% of customers belong to household size of two or less.

Graph 4 - Most customers are in the mid-income bucket range of 40K-99K.

EXPLORATORY DATA ANALYSIS

The data was explored and new datasets created to first observe the number of visits made by the customer in respective months. Keeping in mind, practicality and business acumen, a limit of 20 was chosen as the number of months, in which respective customers made visit to the retail store. The ones who made more than 20 visits were termed ‘customers with high order transactions’ and the others were termed ‘customers with low order transactions’. The same has been shown in the upcoming analyses.


It was then chosen to plot the transaction activity of customers as per factors-

  • Marital Status
  • Age
  • Having kids or not
  • Income
  • Type of resident
  • Household size
  • Number of kids

The graphs below show that all the factors mentioned above, play a role in deciding whether a new customer will fall under the ‘high order transaction’ cateogy or not.

Under column Highly_Active
1 denotes if the customer visited the store more than 20 months
0 denotes if the customer didn’t visit the store more than 20 months


Graph 1 - We can see from the graph that transaction activity is high irrespective of marital status.

Graph 2 - Transaction activity is slightly higher for households with 2 or more kids

Graph 3 - Activity is quite high in the age range of 55-64.

Graph 4 - Households with mid-income categories are showing highest activity.

Graph 5 - Homeowners constitute the largest majority among the customers. The activity is least for them among all categories.

Graph 6 - Transaction activity id higher for larger households.

Graph 7 - Married customers have higher activity across all age groups compared to customers who are single.

Graph 8 - Middle income groups account for largest portion of customer activity.

Graph 9 - High activity is prevalent among married customers with two or more kids.


All the factors, being significant are taken into consideration and ‘random forest’ algorithm is applied to conclude which cateogry a new customer is the most probable to fall under.

RANDOM FOREST RESULTS

Confusion Matrix

0 1 class.error
0 5 110 0.9565217
1 12 434 0.0269058

SUMMARY

Problem Statement Our problem statement was to predict whether any customer is a high order/high frequency customer based on number of visits.

Approach - Data used and methodology employed We merged transaction data with customer household demographics data to understand the relation of household demographic factors on customer visit frequency. Knowing your customer is the most important aspect of solving a prediction problem such as this. We added a new variable ‘Highly_Active’ in our data to identify frequent shoppers. We used Random Forest to predict if the new customers in the test set belong to high order or low order frequency category of customers. Based on R-squared value and confusion matrix, we can say that we have a decent idea about the factors affecting out dependent variable but the model is not yet robust.

Interesting Insights

Customers in the age bucket 45-54 are most active (i.e. in terms of number of visits). The same customers constitute the highest percentage in income buckets of 35K-99K.

Activity is quite high in the age range of 55-64. This can be attributed to financial independence of people in this category and their time availability.

Implications We can use this model to predict the category of any new customer based on their household demographics. We can employ different targeted approches to these different category of customers. For loyal customer who are highly active, we can offer them loyalty cards. For customers with low activity, we will offer discounts on first and next purchase.

Limitations We need to tune the model and may be look at other variables as well other than demographics so that the prediction is more accurate.