1.Introduction

Our research is focused on helping a major retailer know their customers better. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and more. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -

  • Problem Statement #1: What household demographic factors contribute most to customer spend in the stores? Learn about the most impactful factors affecting customer spending.

  • Problem Statement #2: What category is each customer most engaged with? Profile customers and their engagement levels with product categories.


Our methodology can be broadly classified into three steps -

  • Data cleaning and preparation: Clean and tidy the tables/datasets in R, and join tables as required for the analysis (more in detail in the section on data preparation)

  • Data Exploration: Explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions and correlation as required

  • Data Analysis: We propose a regression analysis (linear modeling) to determine the demographic factors that affect consumer spending the most. This will be followed by a clustering technique to segment and profile customers according to their engagement levels with various product categories inside the store


This research will help the client in two major ways. Firstly, understand better as to what factors impact the amount of purchase customers make inside stores. Secondly, design better targeted marketing campaigns powered by the knowledge of customer engagement (and spending) levels with various product categories.

2.Packages

Packages to be installed

Following Packages are needed for the project-

  • dplyr - To do data manipulation in R ; Functions- inner_join,glimpse,group_by,summarise,select
  • magrittr -To use %>% function in the code
  • tidyr -Included for data tidying ; Functions-To use gather and spread function
  • stats -Included for statistical functions ; Functions- To use linear regression(ln) and k-means function
  • ggplot2 - To create Data Visualisations ; Functions- Plot
  • factoextra - To create Data Visualisations for K-means ; Functions- fviz_cluster
  • knitr - To do formatting in R markdown ; Functions- kable
  • kableExtra - To do formatting in R markdown
  • Purrr - To make the codes more efficient using functions like map_*
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr))) 
suppressWarnings(suppressMessages(library(tidyr))) 
suppressWarnings(suppressMessages(library(ggplot2))) 
suppressWarnings(suppressMessages(library(stats))) 
suppressWarnings(suppressMessages(library(factoextra)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(kableExtra)))
suppressWarnings(suppressMessages(library(purrr)))

3.1.Data Preparation - I

3.1. Data Source The original data on the retail client was downloaded from the Data Wrangling course folders on Github


3.2.1 Introduction of source data The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)

Timeline: The data tracks customer transactions over a period of two years (104 weeks).


3.2.2 Explanation of source data (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below

Code for Loading Tables

datasets<-c("campaign_desc","campaign_table","causal_data","coupon","coupon_redempt","hh_demographic","product","transaction_data","product_dpt_mapping")
for (i in seq_along(datasets))
{ 
  
  file_path=paste0("data/",datasets[i],".csv")
  if(file.exists(file_path))
  {
    df<-read.csv(file_path)
    assign(datasets[i],df)
  } else
    print("No such file exits")
}
  • Transaction data: Variables: This data frame captures all the product purchases done by all households in two years. This data frame has some variables which are of interest to our research question - the sales value, the retail discounts and coupon discounts which will help us arrive at final customer spend for each customer.
household_key basket_id day product_id quantity sales_value store_id retail_disc trans_time week_no coupon_disc coupon_match_disc
2375 26984851472 1 1004906 1 1.39 364 -0.6 1631 1 0 0
2375 26984851472 1 1033142 1 0.82 364 0.0 1631 1 0 0
2375 26984851472 1 1036325 1 0.99 364 -0.3 1631 1 0 0


  • Demographics data Shows demographics information on a limited number of households. This will be crucial in solving our problem statement # 1, where we will build a linear model to see what demographic factors affect customer spend.
household_key age_desc marital_status_code income_desc homeowner_desc hh_comp_desc household_size_desc kid_category_desc
1 65+ A 35-49K Homeowner 2 Adults No Kids 2 None/Unknown
7 45-54 A 50-74K Homeowner 2 Adults No Kids 2 None/Unknown
8 25-34 U 25-34K Unknown 2 Adults Kids 3 1


  • Campaign_table Lists the campaign received by each household.
description household_key campaign
TypeA 17 26
TypeA 27 26
TypeA 212 26


  • Campaign description Describes the length of campaigns. Length affects coupon validity period.
description campaign start_day end_day
TypeB 24 659 719
TypeC 15 547 708
TypeB 25 659 691


  • Product Table Describes the length of campaigns. Length affects coupon validity period. The variable departments represents the fine categories of products. This will be of special interest to our analysis on problem statement # 2, where we will build a customer profile to map customers and their engagement levels with categories.
product_id manufacturer department brand commodity_desc sub_commodity_desc curr_size_of_product
25671 2 GROCERY National FRZN ICE ICE - CRUSHED/CUBED 22 LB
26081 2 MISC. TRANS. National NO COMMODITY DESCRIPTION NO SUBCOMMODITY DESCRIPTION
26093 69 PASTRY Private BREAD BREAD:ITALIAN/FRENCH


  • Coupon Table Lists all kinds of coupons given as part of various campaigns to customers. Also mentions the kind of products against which the coupons are redeemable.
coupon_upc product_id campaign
10000089061 27160 4
10000089064 27754 9
10000089073 28897 12


  • Coupon Redemption Table Lists the coupons that each household/customer redeemed.
household_key day coupon_upc campaign
1 421 10000085364 8
1 421 51700010076 8
1 427 54200000033 8


  • Causal Data Table Signifies whether each product was featured/advertised in a weekly mailer or in some in-store display.
product_id store_id week_no display mailer
26190 286 70 0 A
26190 288 70 0 A
26190 289 70 0 A


3.2.3 Peculiarities in the data sets


Missing Values -

for (i in seq_along(datasets))
{ 
missing_values<-any(is.na(get(datasets[i])))
if (missing_values)
{
  
  print(paste(datasets[i],"has Missing values"))
} else
  print(paste(datasets[i],"has no Missing values"))
}
## [1] "campaign_desc has no Missing values"
## [1] "campaign_table has no Missing values"
## [1] "causal_data has no Missing values"
## [1] "coupon has no Missing values"
## [1] "coupon_redempt has no Missing values"
## [1] "hh_demographic has no Missing values"
## [1] "product has no Missing values"
## [1] "transaction_data has no Missing values"
## [1] "product_dpt_mapping has no Missing values"

Note:

  • We see that there are no missing values in the datasets
  • We will again do a missing values analysis and an outlier or abnormal values analysis on our variables of concern, on the datasets that we prepare for our problem statements #1 and #2 (in the next tab)


SUMMARY:

We showed above a glimpse and summary of how all the data from this comprehensive and complex collection look and what they mean. For solving our two research questions the tables that contain variables ofinterest would be -

  • Transaction data
  • Product data
  • Demographics data

Refer to tab Data preparation II for more on this and next steps…

3.3.Data Preparation - II

3.3. Data Preparation Steps

Here we focus on two areas, data preparation (slicing and dicing) for:

  • Problem Statement #1
  • Problem Statement #2

Problem Statement #1

Tables of concern for Problem #1:

  • Transaction data
  • Demographic data

Variables of concern

  • sales_value: signifies the total customer spend, as per current understanding.
  • household_key: signifies a unique customer
  • age_desc: estimated age range of customer
  • marital_status_code: marital status (A - married, B- single, C - unknown)
  • income_desc: household income
  • homeowner_desc: homeowner description (homeowner, renter, etc.)
  • hh_comp_desc: household composition description (no. of adults, kids, etc.)
  • household_size_desc: size of household up to 5+
  • kid_category_desc: no. of children up to 3+

3.4. Methodology and Output dataset

  • group the sales_value for each unique household in the transaction dataset
  • match those row entries against the unique customers in the demographic dataset
  • inner join between the two datasets to have the customer spend and demographic variables for 850 unique customers.
#Dataset 1-Demographic factors driving customer spend
Demographic_factors<-group_by(transaction_data,household_key) %>%
 summarise(Total_Sales=sum(sales_value))%>%
 inner_join(hh_demographic,by="household_key")
household_key Total_Sales age_desc marital_status_code income_desc homeowner_desc hh_comp_desc household_size_desc kid_category_desc
1 4330.16 65+ A 35-49K Homeowner 2 Adults No Kids 2 None/Unknown
7 3400.05 45-54 A 50-74K Homeowner 2 Adults No Kids 2 None/Unknown
8 5534.97 25-34 U 25-34K Unknown 2 Adults Kids 3 1
13 13190.92 25-34 U 75-99K Homeowner 2 Adults Kids 4 2
16 1512.02 45-54 B 50-74K Homeowner Single Female 1 None/Unknown
##  household_key   Total_Sales       age_desc   marital_status_code
##  Min.   :   1   Min.   :  646.4   19-24: 46   A:340              
##  1st Qu.: 596   1st Qu.: 2970.1   25-34:142   B:117              
##  Median :1218   Median : 4676.4   35-44:194   U:344              
##  Mean   :1235   Mean   : 5615.1   45-54:288                      
##  3rd Qu.:1914   3rd Qu.: 7165.0   55-64: 59                      
##  Max.   :2499   Max.   :27859.7   65+  : 72                      
##                                                                  
##     income_desc          homeowner_desc           hh_comp_desc
##  50-74K   :192   Homeowner      :504    1 Adult Kids    : 47  
##  35-49K   :172   Probable Owner : 11    2 Adults Kids   :187  
##  75-99K   : 96   Probable Renter: 11    2 Adults No Kids:255  
##  25-34K   : 77   Renter         : 42    Single Female   :144  
##  15-24K   : 74   Unknown        :233    Single Male     : 95  
##  Under 15K: 61                          Unknown         : 73  
##  (Other)  :129                                                
##  household_size_desc    kid_category_desc
##  1 :255              1           :114    
##  2 :318              2           : 60    
##  3 :109              3+          : 69    
##  4 : 53              None/Unknown:558    
##  5+: 66                                  
##                                          
## 



Problem Statement #2

Tables of concern for Problem #2:

  • Transaction data
  • Product data

Variables of concern

  • Total_Sales: signifies the total customer spend, as per current understanding.
  • household_key: signifies a unique customer
  • product categories: the categories classified into
#Dataset 2-Profiling of Customers based on Product Category Engagement
Customer_profile <- inner_join(transaction_data, product, by = "product_id")%>%
  inner_join(product_dpt_mapping, by = "department")%>%
  group_by(household_key, Super_category) %>%
  summarise(Total_Sales = sum(sales_value))%>%
  select(household_key, Super_category, Total_Sales)%>%
  spread(Super_category, Total_Sales)
household_key AUTOMOTIVE COSMETICS COUPONS DELI EGGS & MILK PRODUCTS FLORAL GARDEN SUPPLIES GROCERY GROCERY-HEALTHY INFANT PRODUCTS LIQUIOR MEAT MISC ITEMS PHARMACY SERVICES STATIONARY STORE SUPPLY/MISC TOYS VIDEO
1 NA NA 20.00 212.17 NA 7.99 NA 3021.10 48.33 NA NA 349.00 530.89 NA 140.68 NA NA NA NA
2 NA 22.07 4.26 46.73 NA 21.99 NA 1264.77 7.47 NA NA 232.19 330.04 NA 24.82 NA NA NA NA
3 NA 1.25 NA 36.47 NA NA NA 1852.78 NA NA NA 578.97 179.09 NA 4.65 NA NA NA NA
4 NA 2.12 NA 14.35 NA NA NA 703.60 3.29 NA NA 157.77 318.98 NA NA NA NA NA NA
5 NA 9.48 NA 80.03 NA NA NA 361.32 2.00 NA NA 120.77 203.08 NA 2.38 NA NA NA NA
##  household_key      AUTOMOTIVE       COSMETICS         COUPONS        
##  Min.   :   1.0   Min.   : 0.000   Min.   :  0.00   Min.   :    0.00  
##  1st Qu.: 625.8   1st Qu.: 2.950   1st Qu.:  4.01   1st Qu.:    3.49  
##  Median :1250.5   Median : 5.950   Median : 10.80   Median :   10.50  
##  Mean   :1250.5   Mean   : 7.421   Mean   : 25.16   Mean   :   80.34  
##  3rd Qu.:1875.2   3rd Qu.: 9.990   3rd Qu.: 29.12   3rd Qu.:   37.97  
##  Max.   :2500.0   Max.   :33.660   Max.   :529.07   Max.   :11161.12  
##                   NA's   :2439     NA's   :1214     NA's   :994       
##       DELI         EGGS & MILK PRODUCTS     FLORAL       GARDEN SUPPLIES 
##  Min.   :   0.75   Min.   :0.5000       Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  21.76   1st Qu.:0.8975       1st Qu.:  8.99   1st Qu.:  5.00  
##  Median :  56.66   Median :0.9500       Median : 18.98   Median : 13.34  
##  Mean   : 116.06   Mean   :1.4646       Mean   : 33.43   Mean   : 24.90  
##  3rd Qu.: 137.72   3rd Qu.:1.7600       3rd Qu.: 38.95   3rd Qu.: 29.86  
##  Max.   :1808.36   Max.   :5.7800       Max.   :877.12   Max.   :936.67  
##  NA's   :252       NA's   :2450         NA's   :1314     NA's   :2192    
##     GROCERY         GROCERY-HEALTHY   INFANT PRODUCTS    LIQUIOR        
##  Min.   :    1.74   Min.   :   0.00   Min.   :2.020   Min.   :   0.000  
##  1st Qu.:  558.17   1st Qu.:   5.00   1st Qu.:2.020   1st Qu.:   8.735  
##  Median : 1267.53   Median :  13.78   Median :2.355   Median :  17.480  
##  Mean   : 1862.72   Mean   :  56.82   Mean   :2.355   Mean   :  38.492  
##  3rd Qu.: 2575.04   3rd Qu.:  43.01   3rd Qu.:2.690   3rd Qu.:  34.020  
##  Max.   :15047.67   Max.   :5535.49   Max.   :2.690   Max.   :2642.020  
##                     NA's   :781       NA's   :2496    NA's   :1937      
##       MEAT           MISC ITEMS        PHARMACY         SERVICES     
##  Min.   :   1.71   Min.   :   0.0   Min.   : 1.190   Min.   :  0.00  
##  1st Qu.: 102.86   1st Qu.: 120.2   1st Qu.: 5.910   1st Qu.: 14.99  
##  Median : 274.07   Median : 310.3   Median : 6.990   Median : 39.44  
##  Mean   : 427.87   Mean   : 646.0   Mean   : 9.804   Mean   : 66.60  
##  3rd Qu.: 571.49   3rd Qu.: 800.3   3rd Qu.:11.748   3rd Qu.: 83.27  
##  Max.   :3708.78   Max.   :7939.3   Max.   :20.890   Max.   :885.24  
##  NA's   :37        NA's   :7        NA's   :2482     NA's   :177     
##    STATIONARY    STORE SUPPLY/MISC      TOYS          VIDEO       
##  Min.   :1.000   Min.   : 0.000    Min.   :0.99   Min.   : 5.000  
##  1st Qu.:2.435   1st Qu.: 1.000    1st Qu.:0.99   1st Qu.: 6.482  
##  Median :3.870   Median : 1.000    Median :1.24   Median : 9.235  
##  Mean   :3.870   Mean   : 2.453    Mean   :2.46   Mean   : 9.372  
##  3rd Qu.:5.305   3rd Qu.: 1.000    3rd Qu.:2.71   3rd Qu.:12.213  
##  Max.   :6.740   Max.   :20.000    Max.   :6.37   Max.   :13.990  
##  NA's   :2498    NA's   :2480      NA's   :2496   NA's   :2494


3.5. Variables Summary:

  • Dataset-1 : Demographic_factors

  • Total rows and variables = 801, 9 , respectively
  • Variables: household_key - it has zero missing values, and all unique values signifying distinct customers, 801 in number
  • Total_Sales - This will be our response variable. This column has no missing values, and data adheres to the possible ranges in the data dsictionary
  • age_desc - ranges for the variable age of customer ; no missing values ; data as per possible values in data dictionary
  • marital_status - shows marital status of customers showing three levels. No missing values and no abnormal values either
  • income_desc - gives possible ranges for the income levels, a factor. No missing values ; no abnormal values either
  • homeowner_desc - specifies whether homeowner is owner, a renter, etc. A factor variable, with no missing or abnormal values
  • household_size_desc - a factor variable, specifying the size of a household. No missing values, no abnormal values either
  • kid_category_desc - no missing values, no abnormal values either

  • Dataset-2 : Customer_profile

  • Total rows and variables = 2500, 20 , respectively
  • household_key: signifies unique household
  • other 19 variables: these are the ‘super categories’ that we created by grouping the 44 product categories / departments

Missing Values : There are no missing values. However the NA values might give the reader the wrong idea. To clarify, for the purposes of our analysis, an ‘NA’ would mean zero spending for the particular customer in that particular category. So, the NA values would be imputed with zeroes before starting analysis

4.1.Data Analysis: Problem-1

4.1 Problem Statement 1-Demographic factors driving customer spend

  • 4.1.1 Analysis

Here is the main analysis of our first problem statement - What factors primarily affect the total customer spending for our retail client. The analysis is being done on the dataset - ‘Demographic_factors’. The reader might remember that this dataset in the one we prepared (using wrangling procedures) in our data preparation sections I and II. The predictor variables being investigated are income_desc, household_size_desc, marital_status_code, and age_desc and all of these are ‘factors’


To begin with, the analysis “graphically” investigates the possible effects of these variables on the customer spend (Total_Sales). Along with the plots, an interpretation of plots is provided. This is followed by a linear model on the Total_Sales against the four predictor variables, along with interpretations. At the very last of this section, we present a paragraph on the summary/explanation of results.

Demographic_factors %>%
  ggplot(aes(x = Total_Sales, y = income_desc)) +
  geom_point() +
  labs(title = "Spend vs Income", x = "Spend", y = "Income")

Observation: We see that the average consumer spend looks to be the highest in the age group of $50-74k followed by the age groups of $35-49k and $75-99k.

Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = age_desc)) +
  geom_point() +
  labs(title = "Spend vs Age", x = "Spend", y = "Age")

Observation: It appears that the age bracket of 35-55 years spends the highest compared to all other age groups.

Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = household_size_desc,
                                color = marital_status_code)) +
  geom_point() +
  labs(title = "Spend vs Household Size, and Marital Status", x = "Spend", 
       y = "Household Size")

Observation: It appears that households with sizes of 2 and 3 would be the highest average spenders, followed by households with sizes 1 and 5. Moreover, it also appears that on average, married customers do spend more as compared to unmarried and single people.


From the above, we conjecture that a consumer profile who is married, has an household size of 2 to 3, is in the age bracket of 35 to 55, and earns $50-74k per annum, will likely be the highest average spender for our client. We explore it a little further though a linear regression model as follows.


Since all our predictor variables are factors, we take the following steps before running a regression-

  • Dummy encoding before running our regression
  • Check to make sure our response variable can be reasonably assumed to be normally distributed
# Dummy encoding 

head(model.matrix( ~  age_desc + marital_status_code + income_desc + 
                     household_size_desc, data = Demographic_factors))
# Regression

model_1 <- lm(Total_Sales ~ age_desc + marital_status_code + 
                           income_desc + household_size_desc, 
                         data = Demographic_factors)
summary(model_1)
## 
## Call:
## lm(formula = Total_Sales ~ age_desc + marital_status_code + income_desc + 
##     household_size_desc, data = Demographic_factors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7400.3 -2450.4  -639.5  1497.9 19669.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4994.1      688.4   7.255 1.06e-12 ***
## age_desc25-34            708.0      683.6   1.036  0.30072    
## age_desc35-44           1400.9      670.2   2.090  0.03695 *  
## age_desc45-54            782.6      650.4   1.203  0.22931    
## age_desc55-64            223.6      783.8   0.285  0.77555    
## age_desc65+             -334.1      755.6  -0.442  0.65852    
## marital_status_code.L   -228.9      253.7  -0.902  0.36729    
## marital_status_code.Q    216.0      324.2   0.666  0.50543    
## income_desc.L           3102.2      954.3   3.251  0.00121 ** 
## income_desc.Q          -1245.2      926.4  -1.344  0.17933    
## income_desc.C          -1936.3      840.4  -2.304  0.02151 *  
## income_desc^4          -1989.2      798.6  -2.491  0.01297 *  
## income_desc^5           -615.0      733.4  -0.839  0.40202    
## income_desc^6            -82.5      647.3  -0.127  0.89862    
## income_desc^7            301.8      554.4   0.544  0.58638    
## income_desc^8           -614.2      483.7  -1.270  0.20464    
## income_desc^9           -179.3      487.4  -0.368  0.71315    
## household_size_desc2     233.2      369.4   0.631  0.52806    
## household_size_desc3     678.9      478.5   1.419  0.15641    
## household_size_desc4    1138.8      614.4   1.854  0.06421 .  
## household_size_desc5+    735.5      599.1   1.228  0.21995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3517 on 708 degrees of freedom
##   (72 observations deleted due to missingness)
## Multiple R-squared:  0.1209, Adjusted R-squared:  0.09608 
## F-statistic: 4.869 on 20 and 708 DF,  p-value: 2.833e-11

Interpretation: Our linear model results seem to match our initial conjecture (from the graphical investigation) to a great extent. Some of those observations are as follows.

  • Age group 35-44 comes out to be statistically significant. Which means that a positive indicator in this age group could have a significant effect (1400 units) on the consumer spend.

  • Marital status ‘Married’ shows statistical significance which has the value of the intercept 5244. Which shows that a positive indicator in this category has a significant effect on consumer spend.

  • Income categories “15-24k”, “35-49k” and “50-74k” show statistical significance. With all else constant, a positive indicator in these categories will have a positive effect (combined with the intercept) on the consumer spend.

  • Household sizes 3 and 4 have a statistical significance in the model. Which means these have a significant effect on the consumer spend.


Summary: Overall we find that of the four predictor variables studied there are a few categories in each that can largely influence the consumer spend. So we can reasonably conclude that customers are likely to spend more if they fall into the categories under each predictor as mentioned above. Since we get a lesser R-square from the study, we can conjecture that there could be other variables that can possibly help further explain consumer demand variations to different factors


4.1.2 Methodology

4.1.2.a Preliminary Analysis - Summaries (Mean, Median) across various demographic variables were calculated to understand how spend varies across different levels. Also, we did a graphical investigation to see patterns, and correlations.

Once we had some idea about how various variables affect consumer spend, a linear regression model was built to ascertain which variables categories affect spending and how they affect spending.

4.1.2.b Linear Regression Model - We built a Linear Regression model using the dataset with all demographic variables and overall sales. Following steps were followed-

  • Step 1-Data was collated with all the demographic variables and overall sales over a certain time period for all customers

  • Step 2-After checking relationships of variables with overall sales(Using plots of predictors and sales), we tested appropriateness for normality of the response variable. And we followed a dummy encoding for all the factor predictor variables.

  • Step 3-Once these validations were complete, we ran the linear regression to calculate estimates for each of the variables and categories

  • Step 4-Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model

  • Step 5-Calculate the contribution values to understand to what magnitude is sales affected by each of those demographic factors.


4.2.Data Analysis: Problem-2

4.2 Problem Statement 2-Profiling of Customers based on Product Category Engagement

4.2.1 Data Sources used-

  • Transaction Data
  • Product Data

4.2.2 Business Problem- Segmenting customers based on spend across various product categories

4.2.3 Data Analysis -

4.2.3.a Preliminary Analysis- Summarize(mean,median and mode) the datasets and plot using box-plot graph to identify outliers in spend data. Calculated the range of spend that should be considered after removing outliers. Using plots thus created, identify categories which are more important that others.

4.2.3.b K-Means Segmentation - Using sales value across many product departments, K-means segmentation was done to segment customers based on their product needs. Following steps were followed-

  • Step 1 - Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns

  • Step 2 - Using the product description information, similar product categories(department) were combined to form broad product categories. e.g. ‘Frozen grocery’, ‘Produce’ and ‘Grocery’ were all combined under Grocery

  • Step 3 - Sales values were rolled up at broad product category level to get total sales for the households

  • Step 4 -Data was transformed from tall to wide format where columns have total sales for the product categories

household_key AUTOMOTIVE COSMETICS COUPONS DELI EGGS & MILK PRODUCTS FLORAL GARDEN SUPPLIES GROCERY GROCERY-HEALTHY INFANT PRODUCTS LIQUIOR MEAT MISC ITEMS PHARMACY SERVICES STATIONARY STORE SUPPLY/MISC TOYS VIDEO
1 NA NA 20.00 212.17 NA 7.99 NA 3021.10 48.33 NA NA 349.00 530.89 NA 140.68 NA NA NA NA
2 NA 22.07 4.26 46.73 NA 21.99 NA 1264.77 7.47 NA NA 232.19 330.04 NA 24.82 NA NA NA NA
3 NA 1.25 NA 36.47 NA NA NA 1852.78 NA NA NA 578.97 179.09 NA 4.65 NA NA NA NA
4 NA 2.12 NA 14.35 NA NA NA 703.60 3.29 NA NA 157.77 318.98 NA NA NA NA NA NA
5 NA 9.48 NA 80.03 NA NA NA 361.32 2.00 NA NA 120.77 203.08 NA 2.38 NA NA NA NA
6 NA 9.96 17.99 125.23 NA 23.98 8.58 3288.33 46.19 NA NA 1213.28 1097.23 NA 165.39 NA NA NA NA

AUTOMOTIVE COSMETICS
COUPONS </th>
  DELI </th>
EGGS & MILK PRODUCTS
 FLORAL </th>
GARDEN SUPPLIES
GROCERY </th>
GROCERY-HEALTHY INFANT PRODUCTS
LIQUIOR </th>
  MEAT </th>
MISC ITEMS
PHARMACY </th>
SERVICES </th>
STATIONARY STORE SUPPLY/MISC
  TOYS </th>
 VIDEO </th>
Min. : 0.0000 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. :0.00000 Min. : 0.00 Min. : 0.000 Min. : 1.74 Min. : 0.00 Min. :0.000000 Min. : 0.000 Min. : 0.00 Min. : 0.0 Min. : 0.00000 Min. : 0.00 Min. :0.000000 Min. : 0.00000 Min. :0.000000 Min. : 0.00000
1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 12.23 1st Qu.:0.00000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 558.17 1st Qu.: 0.00 1st Qu.:0.000000 1st Qu.: 0.000 1st Qu.: 98.14 1st Qu.: 119.0 1st Qu.: 0.00000 1st Qu.: 10.98 1st Qu.:0.000000 1st Qu.: 0.00000 1st Qu.:0.000000 1st Qu.: 0.00000
Median : 0.0000 Median : 0.99 Median : 2.50 Median : 46.13 Median :0.00000 Median : 0.00 Median : 0.000 Median : 1267.53 Median : 5.47 Median :0.000000 Median : 0.000 Median : 269.70 Median : 308.7 Median : 0.00000 Median : 34.91 Median :0.000000 Median : 0.00000 Median :0.000000 Median : 0.00000
Mean : 0.1811 Mean : 12.94 Mean : 48.40 Mean : 104.36 Mean :0.02929 Mean : 15.86 Mean : 3.067 Mean : 1862.72 Mean : 39.07 Mean :0.003768 Mean : 8.668 Mean : 421.54 Mean : 644.1 Mean : 0.07059 Mean : 61.89 Mean :0.003096 Mean : 0.01962 Mean :0.003936 Mean : 0.02249
3rd Qu.: 0.0000 3rd Qu.: 11.40 3rd Qu.: 16.01 3rd Qu.: 124.94 3rd Qu.:0.00000 3rd Qu.: 17.07 3rd Qu.: 0.000 3rd Qu.: 2575.04 3rd Qu.: 26.33 3rd Qu.:0.000000 3rd Qu.: 0.000 3rd Qu.: 563.54 3rd Qu.: 799.2 3rd Qu.: 0.00000 3rd Qu.: 77.99 3rd Qu.:0.000000 3rd Qu.: 0.00000 3rd Qu.:0.000000 3rd Qu.: 0.00000
Max. :33.6600 Max. :529.07 Max. :11161.12 Max. :1808.36 Max. :5.78000 Max. :877.12 Max. :936.670 Max. :15047.67 Max. :5535.49 Max. :2.690000 Max. :2642.020 Max. :3708.78 Max. :7939.3 Max. :20.89000 Max. :885.24 Max. :6.740000 Max. :20.00000 Max. :6.370000 Max. :13.99000

Total sales across product category suggests that biggest product category is Grocery followed by Misc Items and Meat. These 19 variables were taken as input for K-means clustering
product Sum_cols
GROCERY 4656794.40
MISC ITEMS 1610363.09
MEAT 1053836.82
DELI 260902.59
SERVICES 154722.32
COUPONS 120987.76
GROCERY-HEALTHY 97669.04
FLORAL 39653.34
COSMETICS 32360.37
LIQUIOR 21670.88
GARDEN SUPPLIES 7667.82
AUTOMOTIVE 452.66
PHARMACY 176.48
EGGS & MILK PRODUCTS 73.23
VIDEO 56.23
STORE SUPPLY/MISC 49.05
TOYS 9.84
INFANT PRODUCTS 9.42
STATIONARY 7.74


Histogram of all 2500 households and their overall spend

Checking quantile values to understand what should be the cut-off for analysis.Customers who spent less than $390 in 2 years time were removed from the analysis as we dont have sufficient history for those cases

##      10%      50%      90% 
##  389.878 2157.750 7448.216


* Step 5 -Total Sales values were standardized to give similar weightage to all product categories

GROCERY MISC ITEMS MEAT DELI SERVICES COUPONS GROCERY-HEALTHY FLORAL COSMETICS LIQUIOR GARDEN SUPPLIES AUTOMOTIVE PHARMACY EGGS & MILK PRODUCTS VIDEO STORE SUPPLY/MISC TOYS INFANT PRODUCTS STATIONARY
0.5112524 -0.1974663 -0.2465622 0.5824087 0.8392364 -0.1025096 0.0317165 -0.2309761 -0.4367516 -0.1416004 -0.1467752 -0.1227636 -0.0746371 -0.1192238 -0.0448826 -0.044699 -0.0310159 -0.0417706 -0.0239491
-0.4182378 -0.4174619 -0.4940023 -0.4104241 -0.4943611 -0.1505816 -0.2160162 0.1119995 0.2405165 -0.1416004 -0.1467752 -0.1227636 -0.0746371 -0.1192238 -0.0448826 -0.044699 -0.0310159 -0.0417706 -0.0239491
-0.1070494 -0.5828009 0.2405860 -0.4719961 -0.7265263 -0.1635922 -0.2613066 -0.4267171 -0.3983925 -0.1416004 -0.1467752 -0.1227636 -0.0746371 -0.1192238 -0.0448826 -0.044699 -0.0310159 -0.0417706 -0.0239491
-0.7152218 -0.4295762 -0.6516471 -0.6047419 -0.7800498 -0.1635922 -0.2413594 -0.4267171 -0.3716946 -0.1416004 -0.1467752 -0.1227636 -0.0746371 -0.1192238 -0.0448826 -0.044699 -0.0310159 -0.0417706 -0.0239491
-0.8963643 -0.5565241 -0.7300246 -0.2105853 -0.7526549 -0.1635922 -0.2491806 -0.4267171 -0.1458363 -0.1416004 -0.1467752 -0.1227636 -0.0746371 -0.1192238 -0.0448826 -0.044699 -0.0310159 -0.0417706 -0.0239491
0.6526767 0.4228588 1.5842527 0.0606674 1.1236590 -0.1086484 0.0187418 0.1607510 -0.1311064 -0.1416004 0.2242823 -0.1227636 -0.0746371 -0.1192238 -0.0448826 -0.044699 -0.0310159 -0.0417706 -0.0239491


* Step 6 -Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized. 15 was taken as the optimal cluster count from the graph

# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 40
k.values <- 1:40

# extract wss for 1-40 clusters
wss_values <- suppressWarnings(map_dbl(k.values, wss))
par(mfrow=c(1,1))
suppressWarnings(plot(k.values, wss_values,
     type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares"))


* Step 7 -K-means segmentation was run on the standardized columns to make clusters

#running k-means for 15 cluster count
k2 <- suppressWarnings(kmeans(df, centers = 15, nstart = 25))
str(k2)
## List of 9
##  $ cluster     : int [1:2250] 5 9 9 9 9 5 5 5 9 15 ...
##  $ centers     : num [1:15, 1:19] 1.818 1.083 0.634 6.07 0.255 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:19] "GROCERY" "MISC ITEMS" "MEAT" "DELI" ...
##  $ totss       : num 42731
##  $ withinss    : num [1:15] 135 660 115 0 3044 ...
##  $ tot.withinss: num 13725
##  $ betweenss   : num 29006
##  $ size        : int [1:15] 4 9 4 1 639 14 1 10 1238 233 ...
##  $ iter        : int 11
##  $ ifault      : int 2
##  - attr(*, "class")= chr "kmeans"

After getting cluster mappind for all 2250 households, data looks like this

##   k2$cluster MISC ITEMS GROCERY
## 1          5     530.89 3021.10
## 2          9     330.04 1264.77
## 3          9     179.09 1852.78
## 4          9     318.98  703.60
## 5          9     203.08  361.32
## 6          5    1097.23 3288.33


* Step 8 -Visualize the cluster in R to understand how the customers have been clustered


* Step 9 - Profile and name the clusters based on the average spend across categories. Segments have been named as ‘H-Floral-Cosmetics-Garden Supplies buyers’ where H represents ‘high’ total spend and Product categories names suggests the kind of product a customer/household buys(Check Excel)

cluster number avg_GROCERY avg_MISC_ITEMS avg_MEAT avg_DELI avg_SERVICES avg_COUPONS avg_GROCERY_HEALTHY avg_FLORAL avg_COSMETICS avg_LIQUIOR avg_GARDEN_SUPPLIES avg_AUTOMOTIVE avg_PHARMACY avg_EGGS___MILK_PRODUCTS avg_VIDEO avg_STORE_SUPPLY_MISC avg_TOYS avg_INFANT_PRODUCTS avg_STATIONARY
1 4 5491.0200 1600.3275 1306.0325 88.01500 130.56750 32.81000 37.66250 7.060000 45.995000 26.560000 0.0000000 0.0000000 0.0000000 0.4400000 0.0000000 0.0000000 0.0000000 2.355 0.0000000
2 9 4101.6133 2158.5800 605.4089 366.78667 270.55222 371.75111 237.92000 69.255556 61.822222 22.824444 19.4611111 0.0000000 14.9455556 0.0000000 0.0000000 0.0000000 0.0000000 0.000 0.0000000
3 4 3253.7375 1724.6425 779.4725 170.92500 77.58750 12.93500 29.20750 7.382500 7.045000 44.657500 0.0000000 0.0000000 0.0000000 0.0000000 11.3125000 0.0000000 0.0000000 0.000 0.0000000
4 1 13523.8300 6447.2700 2263.5500 1808.36000 476.46000 11161.12000 563.31000 877.120000 212.120000 49.980000 936.6700000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000 0.0000000
5 639 2536.0856 824.4299 587.5830 130.32310 81.38161 58.04653 42.63070 19.318060 15.181847 9.443083 2.7926604 0.0455243 0.0077934 0.0182942 0.0000000 0.0236307 0.0015493 0.000 0.0000000
6 14 2055.8636 638.4293 609.9521 95.69143 66.16429 17.66143 18.25500 17.275714 7.216429 1.612143 6.3478571 0.0000000 0.0000000 2.7821429 0.0000000 0.0000000 0.0000000 0.000 0.0000000
7 1 6432.8600 4248.3500 888.7000 66.23000 67.85000 2.99000 21.24000 0.000000 46.210000 2642.020000 0.0000000 4.4500000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000 0.0000000
8 10 4385.0750 1563.1170 586.2970 257.82900 91.98100 35.48300 1694.22900 39.557000 46.844000 12.888000 5.8440000 0.3990000 0.5970000 0.0000000 0.0000000 0.0000000 0.0000000 0.000 0.0000000
9 1238 904.8305 259.3305 209.6189 42.64703 30.48683 12.75612 16.43466 6.333207 4.501615 5.148215 0.7878837 0.0296123 0.0106381 0.0067932 0.0040388 0.0016155 0.0000000 0.000 0.0000000
10 233 4865.4105 1895.4209 1132.1258 293.27983 146.18455 116.42378 78.99468 32.853176 34.432060 15.077296 5.8499142 0.1380258 0.0680687 0.0454506 0.0000000 0.0085837 0.0042489 0.000 0.0042918
11 67 7388.5609 2969.6737 1487.0336 582.53433 297.20836 314.28716 130.21716 121.321940 84.197164 24.645373 30.8532836 0.1035821 0.0297015 0.0141791 0.0000000 0.0000000 0.0222388 0.000 0.0000000
12 1 1110.0400 308.1300 326.7800 45.86000 56.45000 11.50000 0.00000 9.990000 2.990000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000 6.7400000
13 1 1964.4100 1052.5700 635.8200 13.08000 128.90000 0.00000 4.48000 52.970000 1.590000 59.180000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 6.3700000 0.000 0.0000000
14 2 8289.2800 3683.4000 1747.4500 736.39000 529.16500 486.54000 274.97500 188.185000 101.695000 18.580000 60.7600000 0.0000000 0.0000000 0.0000000 0.0000000 14.9750000 0.0000000 0.000 0.0000000
15 26 2704.9212 1344.3535 497.9615 150.88654 71.50192 122.01115 58.04923 22.616923 49.244231 10.457692 2.5196154 13.0526923 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000 0.0000000


5.Summary

We studied two different problem statements. The summary of both are mentioned below, one after the other.


Problem Statement - 1

5.1.1 Problem Statement The purpose of this study was to understand how the demographic factors like marital status, age, income and household size affect consumer spending. Knowing what demographics make consumers send more during store visits will enable the client’s business to be more customer-centric and to ensure better customer lifetime value

5.1.2 Methodology

  • Step 1 Summaries (Mean, Median) across various demographic variables were calculated to understand how spend varies across different levels. Also, we did a graphical investigation to see patterns and correlations. Once we had some idea about how various variables affect consumer spend, a linear regression model was built to ascertain which variables categories affect spending and how they affect spending.

  • Step 2 Data was collated with all the demographic variables and overall sales over a certain time period for all customers

  • Step 3 After checking relationships of variables with overall sales(Using plots of predictors and sales), we tested appropriateness for normality of the response variable. And we followed a dummy encoding for all the factor predictor variables.

  • Step 4 Once these validations were complete, we ran the linear regression to calculate estimates for each of the variables and categories

  • Step 5 Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model

  • Step 6 Calculate the contribution values to understand to what magnitude is sales affected by each of those demographic factors.

5.1.3 Interesting Insights!

A key insight we found was regarding the demographic factors that would likely make the consumer the highest spender at the stores.-

  • Consumer spending seems to be highest if consumers are married, have a household size of 2 to 3, earn somewhere between $30-75k, and are between 35 to 45 years of age.

5.1.4 Project Implications

We feel that this insight on what demographics make consumers spend more will be key to any decision-making for our retail client. This can be a big enabler in any promotion/coupon targeting programs to various consumer segments or to attract those demographic segments of consumers who currently are not high spenders. Overall, we believe this knowledge is key to ensuring a high average customer lifetime value.

Limitations of the analysis * This study was limited to the demographic factors we saw in the data. when we ran the regression we found it gave a lower R-square value. We think that there could be other predictor variables that we are not seeing in the data, which might help explain the correlation even better. Variables such as, whether the consumer is a loyal to another retail chain, or whether distance and convenience is a factor how people choose to walk into stores.



Problem Statement - 2

5.2.1 Problem Statement The purpose of this study was to get better understanding of our retail client’s customers based on product category preferences. It will allow business to get better visibility of the market and customize promotions based on their buying preference. Further, these promotions can help improve the customer interaction level with the store. Over time, it will also allow business to keep track of how customers are moving across defined segments and how their preferences change with time

5.2.2 Methodology

Using sales value across many product departments, K-means segmentation was done to segment customers based on their product needs. Following steps were followed-

  • Step 1 - Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns

  • Step 2 - Using the product description information, similar product categories(department) were combined to form broad product categories. e.g. ‘Frozen grocery’, ‘Produce’ and ‘Grocery’ were all combined under Grocery

  • Step 3 - Sales values were rolled up at broad product category level to get total sales for the households

  • Step 4 - Data was transformed from tall to wide format where columns have total sales for the product categories

  • Step 5 - Total Sales values were standardized to give similar weightage to all product categories

  • Step 6 - Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized

  • Step 7 - K-means segmentation was run on the standardized columns to make clusters

  • Step 8 - Visualize the cluster in R to understand how the customers have been clustered

  • Step 9 - Profile and name the clusters based on the average spend across categories. Segments have been named as ‘H-Floral-Cosmetics-Garden Supplies buyers’ where H represents ‘high’ total spend and Product categories names suggests the kind of product a customer/household buys

5.2.3 Intersting Insights! Following interesting insights were found in the analysis-

  • Among the high spenders, there is a segment ‘H-Floral-Cosmetics-Garden Suplies buyers’ with 67 households and it seems to be buying products from floral, cosmetics and garden supplies. We can target this segment using cross-category promotions across these product categories to increase their engagement level. Also, in case sales of any of these product categories is dropping, sales across this customer segment should be checked for any potential reasons.

  • Majority of the customers(~1230) belong to the “VL-Grocery-Basic Necessities” segment who visit stores for the basic necessities (groceries/basic necessities).Majority of the money is spent on buying on grocery(~61%) followed by Misc Items(17%) and meat(14%)

  • Another important segment is “L-Broad category Spenders” with 639 customers. These customers spend on broad categories ranging from Grocery, Deli, Meat , Services and coupons to less common

5.2.4 Project Implications Earlier promotions were sent out to household without much consideration of customer/household profile and the buying potential. So, the promotions were not customized for the audience and were sent to all irrespective of their preferences New segmentation exercise would allow business to send targeted messages and enable them to design promotional strategies in a better way. For eg. If the store wants to promote new automotive product in the store, it should be sending the promotional e-mail to all households part of segment 15 “L-Automotive Products” as these are the households who have bought products from this category. Also, knowing how your sales is distributed across categories will allow stores to come up with cross-category promotions aimed at increasing customer spend.

Limitations of the analysis Following are the few limitations and few factors which could be included to get improve the quality of segments in the future-

  • We can club some of the segments which have very few households present in them like segment 4,14,7,12 and 13

  • Broad product categories created after combining similar departments could be clubbed further to get fewer dimensions relevant for our business problem. Also, current definition can be further validated to ensure that mapping used is correct as per business requirement

  • Other variables like frequency of visits should be included in the analysis which would give us better idea about the buying pattern and would make household segments more useful for this kind of study