Introduction

Our research is focused on helping a major retailer know their customers better. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and more. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -

  • Problem Statement #1: What household demographic factors contribute most to customer spend in the stores? Learn about the most impactful factors affecting customer spending.

  • Problem Statement #2: What category is each customer most engaged with? Profile customers and their engagement levels with product categories.


Our methodology can be broadly classified into three steps -

  • Data cleaning and preparation: clean and tidy the tables/datasets in R, and join tables as required for the analysis (more in detail in the section on data preparation)

  • Data Exploration: explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions and correlation as required

  • Data Analysis: we propose a regression analysis (linear modeling) to determine the demographic factors that affect consumer spending the most, to be followed by a clustering technique to segment and profile customers according to their levels of engagement with various categories inside the store.


This research will help the client in two major ways. Firstly, understand better as to what factors impact the amount of purchase customers make inside stores. Secondly, design better targeted marketing campaigns powered by the knowledge of customer engagement (and spending) levels with various product categories.

Packages Required

Packages to be installed

Following Packages are needed for the project-

  • dplyr - To do data manipulation in R ; Functions- inner_join,glimpse,group_by,summarise,select
  • magrittr -To use %>% function in the code
  • tidyr -Included for data tidying ; Functions-To use gather and spread function
  • stats -Included for statistical functions ; Functions- To use linear regression(ln) and k-means function
  • ggplot2 - To create Data Visualisations ; Functions- Plot
  • factoextra - To create Data Visualisations for K-means ; Functions- fviz_cluster
  • knitr - To do formatting in R markdown ; Functions- kable
  • kableExtra - To do formatting in R markdown
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr))) 
suppressWarnings(suppressMessages(library(tidyr))) 
suppressWarnings(suppressMessages(library(ggplot2))) 
suppressWarnings(suppressMessages(library(stats))) 
suppressWarnings(suppressMessages(library(factoextra)))
suppressWarnings(suppressMessages(library(kableExtra)))
suppressWarnings(suppressMessages(library(knitr)))

Data Preparation - I

3.1. Data Source The original data on the retail client was downloaded from the Data Wrangling course folders on Github


3.2.1 Introduction of source data The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)

Timeline: The data tracks customer transactions over a period of two years (104 weeks).


3.2.2 Explanation of source data (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below

  • Transaction data: Variables: This data frame captures all the product purchases done by all households in two years. This data frame has some variables which are of interest to our research question - the sales value, the retail discounts and coupon discounts which will help us arrive at final customer spend for each customer.
household_key basket_id day product_id quantity sales_value store_id retail_disc trans_time week_no coupon_disc coupon_match_disc
2375 26984851472 1 1004906 1 1.39 364 -0.6 1631 1 0 0
2375 26984851472 1 1033142 1 0.82 364 0.0 1631 1 0 0
2375 26984851472 1 1036325 1 0.99 364 -0.3 1631 1 0 0


  • Demographics data Shows demographics information on a limited number of households. This will be crucial in solving our problem statement # 1, where we will build a linear model to see what demographic factors affect customer spend.
household_key age_desc marital_status_code income_desc homeowner_desc hh_comp_desc household_size_desc kid_category_desc
1 65+ A 35-49K Homeowner 2 Adults No Kids 2 None/Unknown
7 45-54 A 50-74K Homeowner 2 Adults No Kids 2 None/Unknown
8 25-34 U 25-34K Unknown 2 Adults Kids 3 1


  • Campaign_table Lists the campaign received by each household.
description household_key campaign
TypeA 17 26
TypeA 27 26
TypeA 212 26


  • Campaign description Describes the length of campaigns. Length affects coupon validity period.
description campaign start_day end_day
TypeB 24 659 719
TypeC 15 547 708
TypeB 25 659 691


  • Product Table Describes the length of campaigns. Length affects coupon validity period. The variable departments represents the fine categories of products. This will be of special interest to our analysis on problem statement # 2, where we will build a customer profile to map customers and their engagement levels with categories.
product_id manufacturer department brand commodity_desc sub_commodity_desc curr_size_of_product
25671 2 GROCERY National FRZN ICE ICE - CRUSHED/CUBED 22 LB
26081 2 MISC. TRANS. National NO COMMODITY DESCRIPTION NO SUBCOMMODITY DESCRIPTION
26093 69 PASTRY Private BREAD BREAD:ITALIAN/FRENCH


  • Coupon Table Lists all kinds of coupons given as part of various campaigns to customers. Also mentions the kind of products against which the coupons are redeemable.
coupon_upc product_id campaign
10000089061 27160 4
10000089064 27754 9
10000089073 28897 12


  • Coupon Redemption Table Lists the coupons that each household/customer redeemed.
household_key day coupon_upc campaign
1 421 10000085364 8
1 421 51700010076 8
1 427 54200000033 8


  • Causal Data Table Signifies whether each product was featured/advertised in a weekly mailer or in some in-store display.
product_id store_id week_no display mailer
26190 286 70 0 A
26190 288 70 0 A
26190 289 70 0 A


3.2.3 Peculiarities in the data sets


* Missing Values -

any(is.na(campaign_d)) & any(is.na(campaign_t)) & any(is.na(causal)) & any(is.na(coupon)) & any(is.na(coupon_r)) & any(is.na(demo)) & any(is.na(prod)) & any(is.na(trans))
## [1] FALSE

Note:

  • We see that there are no missing values in the datasets
  • We will again do a missing values analysis and an outlier or abnormal values analysis on our variables of concern, on the datasets that we prepare for our problem statements #1 and #2 (in the next tab)


SUMMARY:

We showed above a glimpse and summary of how all the data from this comprehensive and complex collection look and what they mean. For solving our two research questions the tables that contain variables ofinterest would be -

  • Transaction data
  • Product data
  • Demographics data

Refer to tab Data Prep II for more on this and next steps…

Data Preparation - II

3.3. Data Preparation Steps

Here we focus on two areas, data preparation (slicing and dicing) for: * Problem Statement #1 * Problem Statement #2

Problem Statement #1

Tables of concern for Problem #1: * Transaction data * Demographic data

Variables of concern

  • sales_value: signifies the total customer spend, as per current understanding.
  • household_key: signifies a unique customer
  • age_desc: estimated age range of customer
  • marital_status_code: marital status (A - married, B- single, C - unknown)
  • income_desc: household income
  • homeowner_desc: homeowner description (homeowner, renter, etc.)
  • hh_comp_desc: household composition description (no. of adults, kids, etc.)
  • household_size_desc: size of household up to 5+
  • kid_category_desc: no. of children up to 3+

3.4. Methodology and Output dataset

  • group the sales_value for each unique household in the transaction dataset
  • match those row entries against the unique customers in the demographic dataset
  • inner join between the two datasets to have the customer spend and demographic variables for 850 unique customers.
#Dataset 1-Demographic factors driving customer spend
library(dplyr)
Demographic_factors<-group_by(trans,household_key) %>%
 summarise(Total_Sales=sum(sales_value))%>%
 inner_join(demo,by="household_key")
household_key Total_Sales age_desc marital_status_code income_desc homeowner_desc hh_comp_desc household_size_desc kid_category_desc
1 4330.16 65+ A 35-49K Homeowner 2 Adults No Kids 2 None/Unknown
7 3400.05 45-54 A 50-74K Homeowner 2 Adults No Kids 2 None/Unknown
8 5534.97 25-34 U 25-34K Unknown 2 Adults Kids 3 1
13 13190.92 25-34 U 75-99K Homeowner 2 Adults Kids 4 2
16 1512.02 45-54 B 50-74K Homeowner Single Female 1 None/Unknown
##  household_key   Total_Sales       age_desc   marital_status_code
##  Min.   :   1   Min.   :  646.4   19-24: 46   A:340              
##  1st Qu.: 596   1st Qu.: 2970.1   25-34:142   B:117              
##  Median :1218   Median : 4676.4   35-44:194   U:344              
##  Mean   :1235   Mean   : 5615.1   45-54:288                      
##  3rd Qu.:1914   3rd Qu.: 7165.0   55-64: 59                      
##  Max.   :2499   Max.   :27859.7   65+  : 72                      
##                                                                  
##     income_desc          homeowner_desc           hh_comp_desc
##  50-74K   :192   Homeowner      :504    1 Adult Kids    : 47  
##  35-49K   :172   Probable Owner : 11    2 Adults Kids   :187  
##  75-99K   : 96   Probable Renter: 11    2 Adults No Kids:255  
##  25-34K   : 77   Renter         : 42    Single Female   :144  
##  15-24K   : 74   Unknown        :233    Single Male     : 95  
##  Under 15K: 61                          Unknown         : 73  
##  (Other)  :129                                                
##  household_size_desc    kid_category_desc
##  1 :255              1           :114    
##  2 :318              2           : 60    
##  3 :109              3+          : 69    
##  4 : 53              None/Unknown:558    
##  5+: 66                                  
##                                          
## 



Problem Statement #2

Tables of concern for Problem #2: * Transaction data * Product data

Variables of concern

  • Total_Sales: signifies the total customer spend, as per current understanding.
  • household_key: signifies a unique customer
  • product categories: the categories classified into
#Dataset 2-Profiling of Customers based on Product Category Engagement
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("magrittr")))
suppressWarnings(suppressMessages(library("tidyr")))

product_dpt_mapping <- read.csv("product_dpt_mapping.csv")

Customer_profile <- inner_join(trans, prod, by = "product_id")%>%
  inner_join(product_dpt_mapping, by = "department")%>%
  group_by(household_key, Super_category) %>%
  summarise(Total_Sales = sum(sales_value))%>%
  select(household_key, Super_category, Total_Sales)%>%
  spread(Super_category, Total_Sales)
## Warning: Column `department` joining factors with different levels,
## coercing to character vector
household_key AUTOMOTIVE COSMETICS COUPONS DELI EGGS & MILK PRODUCTS FLORAL GARDEN SUPPLIES GROCERY GROCERY-HEALTHY INFANT PRODUCTS LIQUIOR MEAT MISC ITEMS PHARMACY SERVICES STATIONARY STORE SUPPLY/MISC TOYS VIDEO
1 NA NA 20.00 212.17 NA 7.99 NA 3021.10 48.33 NA NA 349.00 530.89 NA 140.68 NA NA NA NA
2 NA 22.07 4.26 46.73 NA 21.99 NA 1264.77 7.47 NA NA 232.19 330.04 NA 24.82 NA NA NA NA
3 NA 1.25 NA 36.47 NA NA NA 1852.78 NA NA NA 578.97 179.09 NA 4.65 NA NA NA NA
4 NA 2.12 NA 14.35 NA NA NA 703.60 3.29 NA NA 157.77 318.98 NA NA NA NA NA NA
5 NA 9.48 NA 80.03 NA NA NA 361.32 2.00 NA NA 120.77 203.08 NA 2.38 NA NA NA NA
##  household_key      AUTOMOTIVE       COSMETICS         COUPONS        
##  Min.   :   1.0   Min.   : 0.000   Min.   :  0.00   Min.   :    0.00  
##  1st Qu.: 625.8   1st Qu.: 2.950   1st Qu.:  4.01   1st Qu.:    3.49  
##  Median :1250.5   Median : 5.950   Median : 10.80   Median :   10.50  
##  Mean   :1250.5   Mean   : 7.421   Mean   : 25.16   Mean   :   80.34  
##  3rd Qu.:1875.2   3rd Qu.: 9.990   3rd Qu.: 29.12   3rd Qu.:   37.97  
##  Max.   :2500.0   Max.   :33.660   Max.   :529.07   Max.   :11161.12  
##                   NA's   :2439     NA's   :1214     NA's   :994       
##       DELI         EGGS & MILK PRODUCTS     FLORAL       GARDEN SUPPLIES 
##  Min.   :   0.75   Min.   :0.5000       Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  21.76   1st Qu.:0.8975       1st Qu.:  8.99   1st Qu.:  5.00  
##  Median :  56.66   Median :0.9500       Median : 18.98   Median : 13.34  
##  Mean   : 116.06   Mean   :1.4646       Mean   : 33.43   Mean   : 24.90  
##  3rd Qu.: 137.72   3rd Qu.:1.7600       3rd Qu.: 38.95   3rd Qu.: 29.86  
##  Max.   :1808.36   Max.   :5.7800       Max.   :877.12   Max.   :936.67  
##  NA's   :252       NA's   :2450         NA's   :1314     NA's   :2192    
##     GROCERY         GROCERY-HEALTHY   INFANT PRODUCTS    LIQUIOR        
##  Min.   :    1.74   Min.   :   0.00   Min.   :2.020   Min.   :   0.000  
##  1st Qu.:  558.17   1st Qu.:   5.00   1st Qu.:2.020   1st Qu.:   8.735  
##  Median : 1267.53   Median :  13.78   Median :2.355   Median :  17.480  
##  Mean   : 1862.72   Mean   :  56.82   Mean   :2.355   Mean   :  38.492  
##  3rd Qu.: 2575.04   3rd Qu.:  43.01   3rd Qu.:2.690   3rd Qu.:  34.020  
##  Max.   :15047.67   Max.   :5535.49   Max.   :2.690   Max.   :2642.020  
##                     NA's   :781       NA's   :2496    NA's   :1937      
##       MEAT           MISC ITEMS        PHARMACY         SERVICES     
##  Min.   :   1.71   Min.   :   0.0   Min.   : 1.190   Min.   :  0.00  
##  1st Qu.: 102.86   1st Qu.: 120.2   1st Qu.: 5.910   1st Qu.: 14.99  
##  Median : 274.07   Median : 310.3   Median : 6.990   Median : 39.44  
##  Mean   : 427.87   Mean   : 646.0   Mean   : 9.804   Mean   : 66.60  
##  3rd Qu.: 571.49   3rd Qu.: 800.3   3rd Qu.:11.748   3rd Qu.: 83.27  
##  Max.   :3708.78   Max.   :7939.3   Max.   :20.890   Max.   :885.24  
##  NA's   :37        NA's   :7        NA's   :2482     NA's   :177     
##    STATIONARY    STORE SUPPLY/MISC      TOYS          VIDEO       
##  Min.   :1.000   Min.   : 0.000    Min.   :0.99   Min.   : 5.000  
##  1st Qu.:2.435   1st Qu.: 1.000    1st Qu.:0.99   1st Qu.: 6.482  
##  Median :3.870   Median : 1.000    Median :1.24   Median : 9.235  
##  Mean   :3.870   Mean   : 2.453    Mean   :2.46   Mean   : 9.372  
##  3rd Qu.:5.305   3rd Qu.: 1.000    3rd Qu.:2.71   3rd Qu.:12.213  
##  Max.   :6.740   Max.   :20.000    Max.   :6.37   Max.   :13.990  
##  NA's   :2498    NA's   :2480      NA's   :2496   NA's   :2494


3.5. Variables Summary:

  • Dataset-1 : Demographic_factors

  • Total rows and variables = 801, 9 , respectively
  • Variables: household_key - it has zero missing values, and all unique values signifying distinct customers, 801 in number
  • Total_Sales - This will be our response variable. This column has no missing values, and data adheres to the possible ranges in the data dsictionary
  • age_desc - ranges for the variable age of customer ; no missing values ; data as per possible values in data dictionary
  • marital_status - shows marital status of customers showing three levels. No missing values and no abnormal values either
  • income_desc - gives possible ranges for the income levels, a factor. No missing values ; no abnormal values either
  • homeowner_desc - specifies whether homeowner is owner, a renter, etc. A factor variable, with no missing or abnormal values
  • household_size_desc - a factor variable, specifying the size of a household. No missing values, no abnormal values either
  • kid_category_desc - no missing values, no abnormal values either

  • Dataset-2 : Customer_profile

  • Total rows and variables = 2500, 20 , respectively
  • household_key: signifies unique household
  • other 19 variables: these are the ‘super categories’ that we created by grouping the 44 product categories / departments

Missing Values : There are no missing values. However the NA values might give the reader the wrong idea. To clarify, for the purposes of our analysis, an ‘NA’ would mean zero spending for the particular customer in that particular category. So, the NA values would be imputed with zeroes before starting analysis

Proposed Exploratory Data Analysis

4.1 Problem Statement 1-Demographic factors driving customer spend

4.1.1 Data Sources used-

  • HH Demographic
  • Transaction Data

4.1.2 Business Problem- Identify Demographic factors (Age Description, Martial Status Code, Income Desc, Household Size Description) driving overall dollar sales and quantify their sales impact

4.1.3 Business Implications-This will help quantify the impact of demographic factors driving sales. It will also help business understand the buying potential of new customers based on their demographic profile.


4.1.4 Methodology

4.1.4.a Preliminary Analysis - Summaries (Mean, Median) across various demographic variables should be calculated to understand how spend various across different levels. Following hypothesis will be tested-

  • Bigger size of household(Household Size Description) results in more spend
  • People with higher income(Income Desc) spend more

Once we have some idea about how various variables affect sales, linear regression model will be created to quantify the impact

4.1.4.b Linear Regression Model - Build Linear regression model using the dataset with all demographic variables and overall sales. Following steps will be followed-

  • Step 1-Data was collated with all the demographic variables and overall sales over a certain time period for all customers
  • Step 2-After checking relationships of variables with overall sales(Using plots of predictors and sales), take appropriate transformation to establish a linear relationship
  • Step 3-Once the linear relationship has been established, run the linear regression to calculate estimates for each of the variables
  • Step 4-Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model
  • Step 5-Calculate the contribution values to understand what percentage of sales is driven by which demographic factor
  • Step 6-Also, calculate values for test cases to see whether model is giving accurate predictions



4.2 Problem Statement 2-Profiling of Customers based on Product Category Engagement

4.2.1 Data Sources used-

  • Transaction Data
  • Product Data

4.2.2 Business Problem- Segmenting customers based on spend across various product categories

4.2.3 Business Implications - This will allow business to get better understanding of customer base based on product category preferences. It will allow business to get better visibility of the market. Over time, this analysis will allow business to keep track of how customers are moving across defined segments and how their preferences change with time


4.2.4 Methodology

4.2.4.a Preliminary Analysis- Summarize(mean,median and mode) the datasets and plot using box-plot graph to identify outliers in spend data. Calculated the range of spend that should be considered after removing outliers. Using plots thus created, identify categories which are more important that others.

4.2.4.b K-Means Segmentation - Using sales value across many product departments, K-means segmentation will be done to segment customers based on their product needs. Following steps will be followed-

  • Step 1-Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns
  • Step 2-Data was transformed from tall to wide format where columns have total sales for the product categories
  • Step 3-Total Sales values were standardized to give similar weightage to all product categories
  • Step 4-Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized
  • Step 5-K-means segmentation was run on the standardized columns to make clusters
  • Step 6-Visualize the cluster in R to understand how the customers have been clustered