An Analysis of Customer Spend and Product Category-level Engagement

Introduction

Our research is focused on helping a major retailer know their customers better. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and more. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -

Problem Statement #1: What household demographic factors contribute most to customer spend in the stores? Learn about the most impactful factors affecting customer spending.
Problem Statement #2: What category is each customer most engaged with? Profile customers and their engagement levels with product categories.

Our methodology can be broadly classified into three steps -

Data cleaning and preparation: clean and tidy the tables/datasets in R, and join tables as required for the analysis (more in detail in the section on data preparation)
Data Exploration: explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions and correlation as required
Data Analysis: we propose a regression analysis (linear modeling) to determine the demographic factors that affect consumer spending the most, to be followed by a clustering technique to segment and profile customers according to their levels of engagement with various categories inside the store.

This research will help the client in two major ways. Firstly, understand better as to what factors impact the amount of purchase customers make inside stores. Secondly, design better targeted marketing campaigns powered by the knowledge of customer engagement (and spending) levels with various product categories.

Packages Required

Packages to be installed

Following Packages are needed for the project-

dplyr - To do data manipulation in R ; Functions- inner_join,glimpse,group_by,summarise,select
magrittr -To use %>% function in the code
tidyr -Included for data tidying ; Functions-To use gather and spread function
stats -Included for statistical functions ; Functions- To use linear regression(ln) and k-means function
ggplot2 - To create Data Visualisations ; Functions- Plot
factoextra - To create Data Visualisations for K-means ; Functions- fviz_cluster
knitr - To do formatting in R markdown ; Functions- kable
kableExtra - To do formatting in R markdown

suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr))) 
suppressWarnings(suppressMessages(library(tidyr))) 
suppressWarnings(suppressMessages(library(ggplot2))) 
suppressWarnings(suppressMessages(library(stats))) 
suppressWarnings(suppressMessages(library(factoextra)))
suppressWarnings(suppressMessages(library(kableExtra)))
suppressWarnings(suppressMessages(library(knitr)))

Data Preparation - I

3.1. Data Source The original data on the retail client was downloaded from the Data Wrangling course folders on Github

3.2.1 Introduction of source data The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)

Timeline: The data tracks customer transactions over a period of two years (104 weeks).

3.2.2 Explanation of source data (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below

Transaction data: Variables: This data frame captures all the product purchases done by all households in two years. This data frame has some variables which are of interest to our research question - the sales value, the retail discounts and coupon discounts which will help us arrive at final customer spend for each customer.

household_key	basket_id	day	product_id	quantity	sales_value	store_id	retail_disc	trans_time	week_no
2375	26984851472	1	1004906	1	1.39	364	-0.6	1631	1
2375	26984851472	1	1033142	1	0.82	364	0.0	1631	1
2375	26984851472	1	1036325	1	0.99	364	-0.3	1631	1

Demographics data Shows demographics information on a limited number of households. This will be crucial in solving our problem statement # 1, where we will build a linear model to see what demographic factors affect customer spend.

household_key	age_desc	marital_status_code	income_desc	homeowner_desc	hh_comp_desc	household_size_desc	kid_category_desc
1	65+	A	35-49K	Homeowner	2 Adults No Kids	2	None/Unknown
7	45-54	A	50-74K	Homeowner	2 Adults No Kids	2	None/Unknown
8	25-34	U	25-34K	Unknown	2 Adults Kids	3	1

Campaign_table Lists the campaign received by each household.

description	household_key	campaign
TypeA	17	26
TypeA	27	26
TypeA	212	26

Campaign description Describes the length of campaigns. Length affects coupon validity period.

description	campaign	start_day	end_day
TypeB	24	659	719
TypeC	15	547	708
TypeB	25	659	691

Product Table Describes the length of campaigns. Length affects coupon validity period. The variable departments represents the fine categories of products. This will be of special interest to our analysis on problem statement # 2, where we will build a customer profile to map customers and their engagement levels with categories.

product_id	manufacturer	department	brand	commodity_desc	sub_commodity_desc	curr_size_of_product
25671	2	GROCERY	National	FRZN ICE	ICE - CRUSHED/CUBED	22 LB
26081	2	MISC. TRANS.	National	NO COMMODITY DESCRIPTION	NO SUBCOMMODITY DESCRIPTION
26093	69	PASTRY	Private	BREAD	BREAD:ITALIAN/FRENCH

Coupon Table Lists all kinds of coupons given as part of various campaigns to customers. Also mentions the kind of products against which the coupons are redeemable.

coupon_upc	product_id	campaign
10000089061	27160	4
10000089064	27754	9
10000089073	28897	12

Coupon Redemption Table Lists the coupons that each household/customer redeemed.

household_key	day	coupon_upc	campaign
1	421	10000085364	8
1	421	51700010076	8
1	427	54200000033	8

Causal Data Table Signifies whether each product was featured/advertised in a weekly mailer or in some in-store display.

product_id	store_id	week_no	mailer
26190	286	70	A
26190	288	70	A
26190	289	70	A

3.2.3 Peculiarities in the data sets

* Missing Values -

any(is.na(campaign_d)) & any(is.na(campaign_t)) & any(is.na(causal)) & any(is.na(coupon)) & any(is.na(coupon_r)) & any(is.na(demo)) & any(is.na(prod)) & any(is.na(trans))

## [1] FALSE

Note:

We see that there are no missing values in the datasets
We will again do a missing values analysis and an outlier or abnormal values analysis on our variables of concern, on the datasets that we prepare for our problem statements #1 and #2 (in the next tab)

SUMMARY:

We showed above a glimpse and summary of how all the data from this comprehensive and complex collection look and what they mean. For solving our two research questions the tables that contain variables ofinterest would be -

Transaction data
Product data
Demographics data

Refer to tab Data Prep II for more on this and next steps…

Data Preparation - II

3.3. Data Preparation Steps

Here we focus on two areas, data preparation (slicing and dicing) for: * Problem Statement #1 * Problem Statement #2

Problem Statement #1

Tables of concern for Problem #1: * Transaction data * Demographic data

Variables of concern

sales_value: signifies the total customer spend, as per current understanding.
household_key: signifies a unique customer
age_desc: estimated age range of customer
marital_status_code: marital status (A - married, B- single, C - unknown)
income_desc: household income
homeowner_desc: homeowner description (homeowner, renter, etc.)
hh_comp_desc: household composition description (no. of adults, kids, etc.)
household_size_desc: size of household up to 5+
kid_category_desc: no. of children up to 3+

3.4. Methodology and Output dataset

group the sales_value for each unique household in the transaction dataset
match those row entries against the unique customers in the demographic dataset
inner join between the two datasets to have the customer spend and demographic variables for 850 unique customers.

#Dataset 1-Demographic factors driving customer spend
library(dplyr)
Demographic_factors<-group_by(trans,household_key) %>%
 summarise(Total_Sales=sum(sales_value))%>%
 inner_join(demo,by="household_key")

household_key	Total_Sales	age_desc	marital_status_code	income_desc	homeowner_desc	hh_comp_desc	household_size_desc	kid_category_desc
1	4330.16	65+	A	35-49K	Homeowner	2 Adults No Kids	2	None/Unknown
7	3400.05	45-54	A	50-74K	Homeowner	2 Adults No Kids	2	None/Unknown
8	5534.97	25-34	U	25-34K	Unknown	2 Adults Kids	3	1
13	13190.92	25-34	U	75-99K	Homeowner	2 Adults Kids	4	2
16	1512.02	45-54	B	50-74K	Homeowner	Single Female	1	None/Unknown

##  household_key   Total_Sales       age_desc   marital_status_code
##  Min.   :   1   Min.   :  646.4   19-24: 46   A:340              
##  1st Qu.: 596   1st Qu.: 2970.1   25-34:142   B:117              
##  Median :1218   Median : 4676.4   35-44:194   U:344              
##  Mean   :1235   Mean   : 5615.1   45-54:288                      
##  3rd Qu.:1914   3rd Qu.: 7165.0   55-64: 59                      
##  Max.   :2499   Max.   :27859.7   65+  : 72                      
##                                                                  
##     income_desc          homeowner_desc           hh_comp_desc
##  50-74K   :192   Homeowner      :504    1 Adult Kids    : 47  
##  35-49K   :172   Probable Owner : 11    2 Adults Kids   :187  
##  75-99K   : 96   Probable Renter: 11    2 Adults No Kids:255  
##  25-34K   : 77   Renter         : 42    Single Female   :144  
##  15-24K   : 74   Unknown        :233    Single Male     : 95  
##  Under 15K: 61                          Unknown         : 73  
##  (Other)  :129                                                
##  household_size_desc    kid_category_desc
##  1 :255              1           :114    
##  2 :318              2           : 60    
##  3 :109              3+          : 69    
##  4 : 53              None/Unknown:558    
##  5+: 66                                  
##                                          
##

Problem Statement #2

Tables of concern for Problem #2: * Transaction data * Product data

Variables of concern

Total_Sales: signifies the total customer spend, as per current understanding.
household_key: signifies a unique customer
product categories: the categories classified into

#Dataset 2-Profiling of Customers based on Product Category Engagement
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("magrittr")))
suppressWarnings(suppressMessages(library("tidyr")))

product_dpt_mapping <- read.csv("product_dpt_mapping.csv")

Customer_profile <- inner_join(trans, prod, by = "product_id")%>%
  inner_join(product_dpt_mapping, by = "department")%>%
  group_by(household_key, Super_category) %>%
  summarise(Total_Sales = sum(sales_value))%>%
  select(household_key, Super_category, Total_Sales)%>%
  spread(Super_category, Total_Sales)

## Warning: Column `department` joining factors with different levels,
## coercing to character vector

household_key	AUTOMOTIVE	COSMETICS	COUPONS	DELI	EGGS & MILK PRODUCTS	FLORAL	GARDEN SUPPLIES	GROCERY	GROCERY-HEALTHY	INFANT PRODUCTS	LIQUIOR	MEAT	MISC ITEMS	PHARMACY	SERVICES	STATIONARY	STORE SUPPLY/MISC	TOYS	VIDEO
1	NA	NA	20.00	212.17	NA	7.99	NA	3021.10	48.33	NA	NA	349.00	530.89	NA	140.68	NA	NA	NA	NA
2	NA	22.07	4.26	46.73	NA	21.99	NA	1264.77	7.47	NA	NA	232.19	330.04	NA	24.82	NA	NA	NA	NA
3	NA	1.25	NA	36.47	NA	NA	NA	1852.78	NA	NA	NA	578.97	179.09	NA	4.65	NA	NA	NA	NA
4	NA	2.12	NA	14.35	NA	NA	NA	703.60	3.29	NA	NA	157.77	318.98	NA	NA	NA	NA	NA	NA
5	NA	9.48	NA	80.03	NA	NA	NA	361.32	2.00	NA	NA	120.77	203.08	NA	2.38	NA	NA	NA	NA

##  household_key      AUTOMOTIVE       COSMETICS         COUPONS        
##  Min.   :   1.0   Min.   : 0.000   Min.   :  0.00   Min.   :    0.00  
##  1st Qu.: 625.8   1st Qu.: 2.950   1st Qu.:  4.01   1st Qu.:    3.49  
##  Median :1250.5   Median : 5.950   Median : 10.80   Median :   10.50  
##  Mean   :1250.5   Mean   : 7.421   Mean   : 25.16   Mean   :   80.34  
##  3rd Qu.:1875.2   3rd Qu.: 9.990   3rd Qu.: 29.12   3rd Qu.:   37.97  
##  Max.   :2500.0   Max.   :33.660   Max.   :529.07   Max.   :11161.12  
##                   NA's   :2439     NA's   :1214     NA's   :994       
##       DELI         EGGS & MILK PRODUCTS     FLORAL       GARDEN SUPPLIES 
##  Min.   :   0.75   Min.   :0.5000       Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  21.76   1st Qu.:0.8975       1st Qu.:  8.99   1st Qu.:  5.00  
##  Median :  56.66   Median :0.9500       Median : 18.98   Median : 13.34  
##  Mean   : 116.06   Mean   :1.4646       Mean   : 33.43   Mean   : 24.90  
##  3rd Qu.: 137.72   3rd Qu.:1.7600       3rd Qu.: 38.95   3rd Qu.: 29.86  
##  Max.   :1808.36   Max.   :5.7800       Max.   :877.12   Max.   :936.67  
##  NA's   :252       NA's   :2450         NA's   :1314     NA's   :2192    
##     GROCERY         GROCERY-HEALTHY   INFANT PRODUCTS    LIQUIOR        
##  Min.   :    1.74   Min.   :   0.00   Min.   :2.020   Min.   :   0.000  
##  1st Qu.:  558.17   1st Qu.:   5.00   1st Qu.:2.020   1st Qu.:   8.735  
##  Median : 1267.53   Median :  13.78   Median :2.355   Median :  17.480  
##  Mean   : 1862.72   Mean   :  56.82   Mean   :2.355   Mean   :  38.492  
##  3rd Qu.: 2575.04   3rd Qu.:  43.01   3rd Qu.:2.690   3rd Qu.:  34.020  
##  Max.   :15047.67   Max.   :5535.49   Max.   :2.690   Max.   :2642.020  
##                     NA's   :781       NA's   :2496    NA's   :1937      
##       MEAT           MISC ITEMS        PHARMACY         SERVICES     
##  Min.   :   1.71   Min.   :   0.0   Min.   : 1.190   Min.   :  0.00  
##  1st Qu.: 102.86   1st Qu.: 120.2   1st Qu.: 5.910   1st Qu.: 14.99  
##  Median : 274.07   Median : 310.3   Median : 6.990   Median : 39.44  
##  Mean   : 427.87   Mean   : 646.0   Mean   : 9.804   Mean   : 66.60  
##  3rd Qu.: 571.49   3rd Qu.: 800.3   3rd Qu.:11.748   3rd Qu.: 83.27  
##  Max.   :3708.78   Max.   :7939.3   Max.   :20.890   Max.   :885.24  
##  NA's   :37        NA's   :7        NA's   :2482     NA's   :177     
##    STATIONARY    STORE SUPPLY/MISC      TOYS          VIDEO       
##  Min.   :1.000   Min.   : 0.000    Min.   :0.99   Min.   : 5.000  
##  1st Qu.:2.435   1st Qu.: 1.000    1st Qu.:0.99   1st Qu.: 6.482  
##  Median :3.870   Median : 1.000    Median :1.24   Median : 9.235  
##  Mean   :3.870   Mean   : 2.453    Mean   :2.46   Mean   : 9.372  
##  3rd Qu.:5.305   3rd Qu.: 1.000    3rd Qu.:2.71   3rd Qu.:12.213  
##  Max.   :6.740   Max.   :20.000    Max.   :6.37   Max.   :13.990  
##  NA's   :2498    NA's   :2480      NA's   :2496   NA's   :2494

3.5. Variables Summary:

Dataset-1 : Demographic_factors
Total rows and variables = 801, 9 , respectively
Variables: household_key - it has zero missing values, and all unique values signifying distinct customers, 801 in number
Total_Sales - This will be our response variable. This column has no missing values, and data adheres to the possible ranges in the data dsictionary
age_desc - ranges for the variable age of customer ; no missing values ; data as per possible values in data dictionary
marital_status - shows marital status of customers showing three levels. No missing values and no abnormal values either
income_desc - gives possible ranges for the income levels, a factor. No missing values ; no abnormal values either
homeowner_desc - specifies whether homeowner is owner, a renter, etc. A factor variable, with no missing or abnormal values
household_size_desc - a factor variable, specifying the size of a household. No missing values, no abnormal values either
kid_category_desc - no missing values, no abnormal values either
Dataset-2 : Customer_profile
Total rows and variables = 2500, 20 , respectively
household_key: signifies unique household
other 19 variables: these are the ‘super categories’ that we created by grouping the 44 product categories / departments

Missing Values : There are no missing values. However the NA values might give the reader the wrong idea. To clarify, for the purposes of our analysis, an ‘NA’ would mean zero spending for the particular customer in that particular category. So, the NA values would be imputed with zeroes before starting analysis

Proposed Exploratory Data Analysis

4.1 Problem Statement 1-Demographic factors driving customer spend

4.1.1 Data Sources used-

HH Demographic
Transaction Data

4.1.2 Business Problem- Identify Demographic factors (Age Description, Martial Status Code, Income Desc, Household Size Description) driving overall dollar sales and quantify their sales impact

4.1.3 Business Implications-This will help quantify the impact of demographic factors driving sales. It will also help business understand the buying potential of new customers based on their demographic profile.

4.1.4 Methodology

4.1.4.a Preliminary Analysis - Summaries (Mean, Median) across various demographic variables should be calculated to understand how spend various across different levels. Following hypothesis will be tested-

Bigger size of household(Household Size Description) results in more spend
People with higher income(Income Desc) spend more

Once we have some idea about how various variables affect sales, linear regression model will be created to quantify the impact

4.1.4.b Linear Regression Model - Build Linear regression model using the dataset with all demographic variables and overall sales. Following steps will be followed-

Step 1-Data was collated with all the demographic variables and overall sales over a certain time period for all customers
Step 2-After checking relationships of variables with overall sales(Using plots of predictors and sales), take appropriate transformation to establish a linear relationship
Step 3-Once the linear relationship has been established, run the linear regression to calculate estimates for each of the variables
Step 4-Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model
Step 5-Calculate the contribution values to understand what percentage of sales is driven by which demographic factor
Step 6-Also, calculate values for test cases to see whether model is giving accurate predictions

4.2 Problem Statement 2-Profiling of Customers based on Product Category Engagement

4.2.1 Data Sources used-

Transaction Data
Product Data

4.2.2 Business Problem- Segmenting customers based on spend across various product categories

4.2.3 Business Implications - This will allow business to get better understanding of customer base based on product category preferences. It will allow business to get better visibility of the market. Over time, this analysis will allow business to keep track of how customers are moving across defined segments and how their preferences change with time

4.2.4 Methodology

4.2.4.a Preliminary Analysis- Summarize(mean,median and mode) the datasets and plot using box-plot graph to identify outliers in spend data. Calculated the range of spend that should be considered after removing outliers. Using plots thus created, identify categories which are more important that others.

4.2.4.b K-Means Segmentation - Using sales value across many product departments, K-means segmentation will be done to segment customers based on their product needs. Following steps will be followed-