An Analysis of Customer Spend and Product Category-level Engagement

1.Introduction

Our research is focused on helping a major retailer know their customers better. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and more. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -

Problem Statement #1: What household demographic factors contribute most to customer spend in the stores? Learn about the most impactful factors affecting customer spending.
Problem Statement #2: What category is each customer most engaged with? Profile customers and their engagement levels with product categories.

Our methodology can be broadly classified into three steps -

Data cleaning and preparation: Clean and tidy the tables/datasets in R, and join tables as required for the analysis (more in detail in the section on data preparation)
Data Exploration: Explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions and correlation as required
Data Analysis: We propose a regression analysis (linear modeling) to determine the demographic factors that affect consumer spending the most. This will be followed by a clustering technique to segment and profile customers according to their engagement levels with various product categories inside the store

This research will help the client in two major ways. Firstly, understand better as to what factors impact the amount of purchase customers make inside stores. Secondly, design better targeted marketing campaigns powered by the knowledge of customer engagement (and spending) levels with various product categories.

2.Packages

Packages to be installed

Following Packages are needed for the project-

dplyr - To do data manipulation in R ; Functions- inner_join,glimpse,group_by,summarise,select
magrittr -To use %>% function in the code
tidyr -Included for data tidying ; Functions-To use gather and spread function
stats -Included for statistical functions ; Functions- To use linear regression(ln) and k-means function
ggplot2 - To create Data Visualisations ; Functions- Plot
factoextra - To create Data Visualisations for K-means ; Functions- fviz_cluster
knitr - To do formatting in R markdown ; Functions- kable
kableExtra - To do formatting in R markdown
Purrr - To make the codes more efficient using functions like map_*

suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr))) 
suppressWarnings(suppressMessages(library(tidyr))) 
suppressWarnings(suppressMessages(library(ggplot2))) 
suppressWarnings(suppressMessages(library(stats))) 
suppressWarnings(suppressMessages(library(factoextra)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(kableExtra)))
suppressWarnings(suppressMessages(library(purrr)))

3.1.Data Preparation - I

3.1. Data Source The original data on the retail client was downloaded from the Data Wrangling course folders on Github

3.2.1 Introduction of source data The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)

Timeline: The data tracks customer transactions over a period of two years (104 weeks).

3.2.2 Explanation of source data (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below

Code for Loading Tables

datasets<-c("campaign_desc","campaign_table","causal_data","coupon","coupon_redempt","hh_demographic","product","transaction_data","product_dpt_mapping")
for (i in seq_along(datasets))
{ 
  
  file_path=paste0("data/",datasets[i],".csv")
  if(file.exists(file_path))
  {
    df<-read.csv(file_path)
    assign(datasets[i],df)
  } else
    print("No such file exits")
}

Transaction data: Variables: This data frame captures all the product purchases done by all households in two years. This data frame has some variables which are of interest to our research question - the sales value, the retail discounts and coupon discounts which will help us arrive at final customer spend for each customer.

household_key	basket_id	day	product_id	quantity	sales_value	store_id	retail_disc	trans_time	week_no
2375	26984851472	1	1004906	1	1.39	364	-0.6	1631	1
2375	26984851472	1	1033142	1	0.82	364	0.0	1631	1
2375	26984851472	1	1036325	1	0.99	364	-0.3	1631	1

Demographics data Shows demographics information on a limited number of households. This will be crucial in solving our problem statement # 1, where we will build a linear model to see what demographic factors affect customer spend.

household_key	age_desc	marital_status_code	income_desc	homeowner_desc	hh_comp_desc	household_size_desc	kid_category_desc
1	65+	A	35-49K	Homeowner	2 Adults No Kids	2	None/Unknown
7	45-54	A	50-74K	Homeowner	2 Adults No Kids	2	None/Unknown
8	25-34	U	25-34K	Unknown	2 Adults Kids	3	1

Campaign_table Lists the campaign received by each household.

description	household_key	campaign
TypeA	17	26
TypeA	27	26
TypeA	212	26

Campaign description Describes the length of campaigns. Length affects coupon validity period.

description	campaign	start_day	end_day
TypeB	24	659	719
TypeC	15	547	708
TypeB	25	659	691

Product Table Describes the length of campaigns. Length affects coupon validity period. The variable departments represents the fine categories of products. This will be of special interest to our analysis on problem statement # 2, where we will build a customer profile to map customers and their engagement levels with categories.

product_id	manufacturer	department	brand	commodity_desc	sub_commodity_desc	curr_size_of_product
25671	2	GROCERY	National	FRZN ICE	ICE - CRUSHED/CUBED	22 LB
26081	2	MISC. TRANS.	National	NO COMMODITY DESCRIPTION	NO SUBCOMMODITY DESCRIPTION
26093	69	PASTRY	Private	BREAD	BREAD:ITALIAN/FRENCH

Coupon Table Lists all kinds of coupons given as part of various campaigns to customers. Also mentions the kind of products against which the coupons are redeemable.

coupon_upc	product_id	campaign
10000089061	27160	4
10000089064	27754	9
10000089073	28897	12

Coupon Redemption Table Lists the coupons that each household/customer redeemed.

household_key	day	coupon_upc	campaign
1	421	10000085364	8
1	421	51700010076	8
1	427	54200000033	8

Causal Data Table Signifies whether each product was featured/advertised in a weekly mailer or in some in-store display.

product_id	store_id	week_no	mailer
26190	286	70	A
26190	288	70	A
26190	289	70	A

3.2.3 Peculiarities in the data sets

Missing Values -

for (i in seq_along(datasets))
{ 
missing_values<-any(is.na(get(datasets[i])))
if (missing_values)
{
  
  print(paste(datasets[i],"has Missing values"))
} else
  print(paste(datasets[i],"has no Missing values"))
}

## [1] "campaign_desc has no Missing values"
## [1] "campaign_table has no Missing values"
## [1] "causal_data has no Missing values"
## [1] "coupon has no Missing values"
## [1] "coupon_redempt has no Missing values"
## [1] "hh_demographic has no Missing values"
## [1] "product has no Missing values"
## [1] "transaction_data has no Missing values"
## [1] "product_dpt_mapping has no Missing values"

Note:

We see that there are no missing values in the datasets
We will again do a missing values analysis and an outlier or abnormal values analysis on our variables of concern, on the datasets that we prepare for our problem statements #1 and #2 (in the next tab)

SUMMARY:

We showed above a glimpse and summary of how all the data from this comprehensive and complex collection look and what they mean. For solving our two research questions the tables that contain variables ofinterest would be -

Transaction data
Product data
Demographics data

Refer to tab Data preparation II for more on this and next steps…

3.3.Data Preparation - II

3.3. Data Preparation Steps

Here we focus on two areas, data preparation (slicing and dicing) for:

Problem Statement #1
Problem Statement #2

Problem Statement #1

Tables of concern for Problem #1:

Transaction data
Demographic data

Variables of concern

sales_value: signifies the total customer spend, as per current understanding.
household_key: signifies a unique customer
age_desc: estimated age range of customer
marital_status_code: marital status (A - married, B- single, C - unknown)
income_desc: household income
homeowner_desc: homeowner description (homeowner, renter, etc.)
hh_comp_desc: household composition description (no. of adults, kids, etc.)
household_size_desc: size of household up to 5+
kid_category_desc: no. of children up to 3+

3.4. Methodology and Output dataset

group the sales_value for each unique household in the transaction dataset
match those row entries against the unique customers in the demographic dataset
inner join between the two datasets to have the customer spend and demographic variables for 850 unique customers.

#Dataset 1-Demographic factors driving customer spend
Demographic_factors<-group_by(transaction_data,household_key) %>%
 summarise(Total_Sales=sum(sales_value))%>%
 inner_join(hh_demographic,by="household_key")

household_key	Total_Sales	age_desc	marital_status_code	income_desc	homeowner_desc	hh_comp_desc	household_size_desc	kid_category_desc
1	4330.16	65+	A	35-49K	Homeowner	2 Adults No Kids	2	None/Unknown
7	3400.05	45-54	A	50-74K	Homeowner	2 Adults No Kids	2	None/Unknown
8	5534.97	25-34	U	25-34K	Unknown	2 Adults Kids	3	1
13	13190.92	25-34	U	75-99K	Homeowner	2 Adults Kids	4	2
16	1512.02	45-54	B	50-74K	Homeowner	Single Female	1	None/Unknown

##  household_key   Total_Sales       age_desc   marital_status_code
##  Min.   :   1   Min.   :  646.4   19-24: 46   A:340              
##  1st Qu.: 596   1st Qu.: 2970.1   25-34:142   B:117              
##  Median :1218   Median : 4676.4   35-44:194   U:344              
##  Mean   :1235   Mean   : 5615.1   45-54:288                      
##  3rd Qu.:1914   3rd Qu.: 7165.0   55-64: 59                      
##  Max.   :2499   Max.   :27859.7   65+  : 72                      
##                                                                  
##     income_desc          homeowner_desc           hh_comp_desc
##  50-74K   :192   Homeowner      :504    1 Adult Kids    : 47  
##  35-49K   :172   Probable Owner : 11    2 Adults Kids   :187  
##  75-99K   : 96   Probable Renter: 11    2 Adults No Kids:255  
##  25-34K   : 77   Renter         : 42    Single Female   :144  
##  15-24K   : 74   Unknown        :233    Single Male     : 95  
##  Under 15K: 61                          Unknown         : 73  
##  (Other)  :129                                                
##  household_size_desc    kid_category_desc
##  1 :255              1           :114    
##  2 :318              2           : 60    
##  3 :109              3+          : 69    
##  4 : 53              None/Unknown:558    
##  5+: 66                                  
##                                          
##

Problem Statement #2

Tables of concern for Problem #2:

Transaction data
Product data

Variables of concern

Total_Sales: signifies the total customer spend, as per current understanding.
household_key: signifies a unique customer
product categories: the categories classified into

#Dataset 2-Profiling of Customers based on Product Category Engagement
Customer_profile <- inner_join(transaction_data, product, by = "product_id")%>%
  inner_join(product_dpt_mapping, by = "department")%>%
  group_by(household_key, Super_category) %>%
  summarise(Total_Sales = sum(sales_value))%>%
  select(household_key, Super_category, Total_Sales)%>%
  spread(Super_category, Total_Sales)

household_key	AUTOMOTIVE	COSMETICS	COUPONS	DELI	EGGS & MILK PRODUCTS	FLORAL	GARDEN SUPPLIES	GROCERY	GROCERY-HEALTHY	INFANT PRODUCTS	LIQUIOR	MEAT	MISC ITEMS	PHARMACY	SERVICES	STATIONARY	STORE SUPPLY/MISC	TOYS	VIDEO
1	NA	NA	20.00	212.17	NA	7.99	NA	3021.10	48.33	NA	NA	349.00	530.89	NA	140.68	NA	NA	NA	NA
2	NA	22.07	4.26	46.73	NA	21.99	NA	1264.77	7.47	NA	NA	232.19	330.04	NA	24.82	NA	NA	NA	NA
3	NA	1.25	NA	36.47	NA	NA	NA	1852.78	NA	NA	NA	578.97	179.09	NA	4.65	NA	NA	NA	NA
4	NA	2.12	NA	14.35	NA	NA	NA	703.60	3.29	NA	NA	157.77	318.98	NA	NA	NA	NA	NA	NA
5	NA	9.48	NA	80.03	NA	NA	NA	361.32	2.00	NA	NA	120.77	203.08	NA	2.38	NA	NA	NA	NA

##  household_key      AUTOMOTIVE       COSMETICS         COUPONS        
##  Min.   :   1.0   Min.   : 0.000   Min.   :  0.00   Min.   :    0.00  
##  1st Qu.: 625.8   1st Qu.: 2.950   1st Qu.:  4.01   1st Qu.:    3.49  
##  Median :1250.5   Median : 5.950   Median : 10.80   Median :   10.50  
##  Mean   :1250.5   Mean   : 7.421   Mean   : 25.16   Mean   :   80.34  
##  3rd Qu.:1875.2   3rd Qu.: 9.990   3rd Qu.: 29.12   3rd Qu.:   37.97  
##  Max.   :2500.0   Max.   :33.660   Max.   :529.07   Max.   :11161.12  
##                   NA's   :2439     NA's   :1214     NA's   :994       
##       DELI         EGGS & MILK PRODUCTS     FLORAL       GARDEN SUPPLIES 
##  Min.   :   0.75   Min.   :0.5000       Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  21.76   1st Qu.:0.8975       1st Qu.:  8.99   1st Qu.:  5.00  
##  Median :  56.66   Median :0.9500       Median : 18.98   Median : 13.34  
##  Mean   : 116.06   Mean   :1.4646       Mean   : 33.43   Mean   : 24.90  
##  3rd Qu.: 137.72   3rd Qu.:1.7600       3rd Qu.: 38.95   3rd Qu.: 29.86  
##  Max.   :1808.36   Max.   :5.7800       Max.   :877.12   Max.   :936.67  
##  NA's   :252       NA's   :2450         NA's   :1314     NA's   :2192    
##     GROCERY         GROCERY-HEALTHY   INFANT PRODUCTS    LIQUIOR        
##  Min.   :    1.74   Min.   :   0.00   Min.   :2.020   Min.   :   0.000  
##  1st Qu.:  558.17   1st Qu.:   5.00   1st Qu.:2.020   1st Qu.:   8.735  
##  Median : 1267.53   Median :  13.78   Median :2.355   Median :  17.480  
##  Mean   : 1862.72   Mean   :  56.82   Mean   :2.355   Mean   :  38.492  
##  3rd Qu.: 2575.04   3rd Qu.:  43.01   3rd Qu.:2.690   3rd Qu.:  34.020  
##  Max.   :15047.67   Max.   :5535.49   Max.   :2.690   Max.   :2642.020  
##                     NA's   :781       NA's   :2496    NA's   :1937      
##       MEAT           MISC ITEMS        PHARMACY         SERVICES     
##  Min.   :   1.71   Min.   :   0.0   Min.   : 1.190   Min.   :  0.00  
##  1st Qu.: 102.86   1st Qu.: 120.2   1st Qu.: 5.910   1st Qu.: 14.99  
##  Median : 274.07   Median : 310.3   Median : 6.990   Median : 39.44  
##  Mean   : 427.87   Mean   : 646.0   Mean   : 9.804   Mean   : 66.60  
##  3rd Qu.: 571.49   3rd Qu.: 800.3   3rd Qu.:11.748   3rd Qu.: 83.27  
##  Max.   :3708.78   Max.   :7939.3   Max.   :20.890   Max.   :885.24  
##  NA's   :37        NA's   :7        NA's   :2482     NA's   :177     
##    STATIONARY    STORE SUPPLY/MISC      TOYS          VIDEO       
##  Min.   :1.000   Min.   : 0.000    Min.   :0.99   Min.   : 5.000  
##  1st Qu.:2.435   1st Qu.: 1.000    1st Qu.:0.99   1st Qu.: 6.482  
##  Median :3.870   Median : 1.000    Median :1.24   Median : 9.235  
##  Mean   :3.870   Mean   : 2.453    Mean   :2.46   Mean   : 9.372  
##  3rd Qu.:5.305   3rd Qu.: 1.000    3rd Qu.:2.71   3rd Qu.:12.213  
##  Max.   :6.740   Max.   :20.000    Max.   :6.37   Max.   :13.990  
##  NA's   :2498    NA's   :2480      NA's   :2496   NA's   :2494

3.5. Variables Summary:

Dataset-1 : Demographic_factors
Total rows and variables = 801, 9 , respectively
Variables: household_key - it has zero missing values, and all unique values signifying distinct customers, 801 in number
Total_Sales - This will be our response variable. This column has no missing values, and data adheres to the possible ranges in the data dsictionary
age_desc - ranges for the variable age of customer ; no missing values ; data as per possible values in data dictionary
marital_status - shows marital status of customers showing three levels. No missing values and no abnormal values either
income_desc - gives possible ranges for the income levels, a factor. No missing values ; no abnormal values either
homeowner_desc - specifies whether homeowner is owner, a renter, etc. A factor variable, with no missing or abnormal values
household_size_desc - a factor variable, specifying the size of a household. No missing values, no abnormal values either
kid_category_desc - no missing values, no abnormal values either
Dataset-2 : Customer_profile
Total rows and variables = 2500, 20 , respectively
household_key: signifies unique household
other 19 variables: these are the ‘super categories’ that we created by grouping the 44 product categories / departments

Missing Values : There are no missing values. However the NA values might give the reader the wrong idea. To clarify, for the purposes of our analysis, an ‘NA’ would mean zero spending for the particular customer in that particular category. So, the NA values would be imputed with zeroes before starting analysis

4.1.Data Analysis: Problem-1

4.1 Problem Statement 1-Demographic factors driving customer spend

4.1.1 Analysis

Here is the main analysis of our first problem statement - What factors primarily affect the total customer spending for our retail client. The analysis is being done on the dataset - ‘Demographic_factors’. The reader might remember that this dataset in the one we prepared (using wrangling procedures) in our data preparation sections I and II. The predictor variables being investigated are income_desc, household_size_desc, marital_status_code, and age_desc and all of these are ‘factors’

To begin with, the analysis “graphically” investigates the possible effects of these variables on the customer spend (Total_Sales). Along with the plots, an interpretation of plots is provided. This is followed by a linear model on the Total_Sales against the four predictor variables, along with interpretations. At the very last of this section, we present a paragraph on the summary/explanation of results.

Demographic_factors %>%
  ggplot(aes(x = Total_Sales, y = income_desc)) +
  geom_point() +
  labs(title = "Spend vs Income", x = "Spend", y = "Income")

Observation: We see that the average consumer spend looks to be the highest in the age group of $50-74k followed by the age groups of $35-49k and $75-99k.

Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = age_desc)) +
  geom_point() +
  labs(title = "Spend vs Age", x = "Spend", y = "Age")

Observation: It appears that the age bracket of 35-55 years spends the highest compared to all other age groups.

Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = household_size_desc,
                                color = marital_status_code)) +
  geom_point() +
  labs(title = "Spend vs Household Size, and Marital Status", x = "Spend", 
       y = "Household Size")

Observation: It appears that households with sizes of 2 and 3 would be the highest average spenders, followed by households with sizes 1 and 5. Moreover, it also appears that on average, married customers do spend more as compared to unmarried and single people.

From the above, we conjecture that a consumer profile who is married, has an household size of 2 to 3, is in the age bracket of 35 to 55, and earns $50-74k per annum, will likely be the highest average spender for our client. We explore it a little further though a linear regression model as follows.

Since all our predictor variables are factors, we take the following steps before running a regression-

Dummy encoding before running our regression
Check to make sure our response variable can be reasonably assumed to be normally distributed

# Dummy encoding 

head(model.matrix( ~  age_desc + marital_status_code + income_desc + 
                     household_size_desc, data = Demographic_factors))

# Regression

model_1 <- lm(Total_Sales ~ age_desc + marital_status_code + 
                           income_desc + household_size_desc, 
                         data = Demographic_factors)
summary(model_1)

## 
## Call:
## lm(formula = Total_Sales ~ age_desc + marital_status_code + income_desc + 
##     household_size_desc, data = Demographic_factors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7400.3 -2450.4  -639.5  1497.9 19669.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4994.1      688.4   7.255 1.06e-12 ***
## age_desc25-34            708.0      683.6   1.036  0.30072    
## age_desc35-44           1400.9      670.2   2.090  0.03695 *  
## age_desc45-54            782.6      650.4   1.203  0.22931    
## age_desc55-64            223.6      783.8   0.285  0.77555    
## age_desc65+             -334.1      755.6  -0.442  0.65852    
## marital_status_code.L   -228.9      253.7  -0.902  0.36729    
## marital_status_code.Q    216.0      324.2   0.666  0.50543    
## income_desc.L           3102.2      954.3   3.251  0.00121 ** 
## income_desc.Q          -1245.2      926.4  -1.344  0.17933    
## income_desc.C          -1936.3      840.4  -2.304  0.02151 *  
## income_desc^4          -1989.2      798.6  -2.491  0.01297 *  
## income_desc^5           -615.0      733.4  -0.839  0.40202    
## income_desc^6            -82.5      647.3  -0.127  0.89862    
## income_desc^7            301.8      554.4   0.544  0.58638    
## income_desc^8           -614.2      483.7  -1.270  0.20464    
## income_desc^9           -179.3      487.4  -0.368  0.71315    
## household_size_desc2     233.2      369.4   0.631  0.52806    
## household_size_desc3     678.9      478.5   1.419  0.15641    
## household_size_desc4    1138.8      614.4   1.854  0.06421 .  
## household_size_desc5+    735.5      599.1   1.228  0.21995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3517 on 708 degrees of freedom
##   (72 observations deleted due to missingness)
## Multiple R-squared:  0.1209, Adjusted R-squared:  0.09608 
## F-statistic: 4.869 on 20 and 708 DF,  p-value: 2.833e-11

Interpretation: Our linear model results seem to match our initial conjecture (from the graphical investigation) to a great extent. Some of those observations are as follows.

Age group 35-44 comes out to be statistically significant. Which means that a positive indicator in this age group could have a significant effect (1400 units) on the consumer spend.
Marital status ‘Married’ shows statistical significance which has the value of the intercept 5244. Which shows that a positive indicator in this category has a significant effect on consumer spend.
Income categories “15-24k”, “35-49k” and “50-74k” show statistical significance. With all else constant, a positive indicator in these categories will have a positive effect (combined with the intercept) on the consumer spend.
Household sizes 3 and 4 have a statistical significance in the model. Which means these have a significant effect on the consumer spend.

Summary: Overall we find that of the four predictor variables studied there are a few categories in each that can largely influence the consumer spend. So we can reasonably conclude that customers are likely to spend more if they fall into the categories under each predictor as mentioned above. Since we get a lesser R-square from the study, we can conjecture that there could be other variables that can possibly help further explain consumer demand variations to different factors

4.1.2 Methodology

4.1.2.a Preliminary Analysis - Summaries (Mean, Median) across various demographic variables were calculated to understand how spend varies across different levels. Also, we did a graphical investigation to see patterns, and correlations.

Once we had some idea about how various variables affect consumer spend, a linear regression model was built to ascertain which variables categories affect spending and how they affect spending.

4.1.2.b Linear Regression Model - We built a Linear Regression model using the dataset with all demographic variables and overall sales. Following steps were followed-

Step 1-Data was collated with all the demographic variables and overall sales over a certain time period for all customers
Step 2-After checking relationships of variables with overall sales(Using plots of predictors and sales), we tested appropriateness for normality of the response variable. And we followed a dummy encoding for all the factor predictor variables.
Step 3-Once these validations were complete, we ran the linear regression to calculate estimates for each of the variables and categories
Step 4-Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model
Step 5-Calculate the contribution values to understand to what magnitude is sales affected by each of those demographic factors.

4.2.Data Analysis: Problem-2

4.2 Problem Statement 2-Profiling of Customers based on Product Category Engagement

4.2.1 Data Sources used-

Transaction Data
Product Data

4.2.2 Business Problem- Segmenting customers based on spend across various product categories

4.2.3 Data Analysis -

4.2.3.a Preliminary Analysis- Summarize(mean,median and mode) the datasets and plot using box-plot graph to identify outliers in spend data. Calculated the range of spend that should be considered after removing outliers. Using plots thus created, identify categories which are more important that others.

4.2.3.b K-Means Segmentation - Using sales value across many product departments, K-means segmentation was done to segment customers based on their product needs. Following steps were followed-

Step 1 - Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns
Step 2 - Using the product description information, similar product categories(department) were combined to form broad product categories. e.g. ‘Frozen grocery’, ‘Produce’ and ‘Grocery’ were all combined under Grocery
Step 3 - Sales values were rolled up at broad product category level to get total sales for the households
Step 4 -Data was transformed from tall to wide format where columns have total sales for the product categories

household_key	AUTOMOTIVE	COSMETICS	COUPONS	DELI	EGGS & MILK PRODUCTS	FLORAL	GARDEN SUPPLIES	GROCERY	GROCERY-HEALTHY	INFANT PRODUCTS	LIQUIOR	MEAT	MISC ITEMS	PHARMACY	SERVICES	STATIONARY	STORE SUPPLY/MISC	TOYS	VIDEO
1	NA	NA	20.00	212.17	NA	7.99	NA	3021.10	48.33	NA	NA	349.00	530.89	NA	140.68	NA	NA	NA	NA
2	NA	22.07	4.26	46.73	NA	21.99	NA	1264.77	7.47	NA	NA	232.19	330.04	NA	24.82	NA	NA	NA	NA
3	NA	1.25	NA	36.47	NA	NA	NA	1852.78	NA	NA	NA	578.97	179.09	NA	4.65	NA	NA	NA	NA
4	NA	2.12	NA	14.35	NA	NA	NA	703.60	3.29	NA	NA	157.77	318.98	NA	NA	NA	NA	NA	NA
5	NA	9.48	NA	80.03	NA	NA	NA	361.32	2.00	NA	NA	120.77	203.08	NA	2.38	NA	NA	NA	NA
6	NA	9.96	17.99	125.23	NA	23.98	8.58	3288.33	46.19	NA	NA	1213.28	1097.23	NA	165.39	NA	NA	NA	NA

AUTOMOTIVE	COSMETICS	`COUPONS </th>`	`DELI </th>`	EGGS & MILK PRODUCTS	`FLORAL </th>`	GARDEN SUPPLIES	`GROCERY </th>`	GROCERY-HEALTHY	INFANT PRODUCTS	`LIQUIOR </th>`	`MEAT </th>`	MISC ITEMS	`PHARMACY </th>`	`SERVICES </th>`	STATIONARY	STORE SUPPLY/MISC	`TOYS </th>`	`VIDEO </th>`
Min. : 0.0000	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. :0.00000	Min. : 0.00	Min. : 0.000	Min. : 1.74	Min. : 0.00	Min. :0.000000	Min. : 0.000	Min. : 0.00	Min. : 0.0	Min. : 0.00000	Min. : 0.00	Min. :0.000000	Min. : 0.00000	Min. :0.000000	Min. : 0.00000
1st Qu.: 0.0000	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 12.23	1st Qu.:0.00000	1st Qu.: 0.00	1st Qu.: 0.000	1st Qu.: 558.17	1st Qu.: 0.00	1st Qu.:0.000000	1st Qu.: 0.000	1st Qu.: 98.14	1st Qu.: 119.0	1st Qu.: 0.00000	1st Qu.: 10.98	1st Qu.:0.000000	1st Qu.: 0.00000	1st Qu.:0.000000	1st Qu.: 0.00000
Median : 0.0000	Median : 0.99	Median : 2.50	Median : 46.13	Median :0.00000	Median : 0.00	Median : 0.000	Median : 1267.53	Median : 5.47	Median :0.000000	Median : 0.000	Median : 269.70	Median : 308.7	Median : 0.00000	Median : 34.91	Median :0.000000	Median : 0.00000	Median :0.000000	Median : 0.00000
Mean : 0.1811	Mean : 12.94	Mean : 48.40	Mean : 104.36	Mean :0.02929	Mean : 15.86	Mean : 3.067	Mean : 1862.72	Mean : 39.07	Mean :0.003768	Mean : 8.668	Mean : 421.54	Mean : 644.1	Mean : 0.07059	Mean : 61.89	Mean :0.003096	Mean : 0.01962	Mean :0.003936	Mean : 0.02249
3rd Qu.: 0.0000	3rd Qu.: 11.40	3rd Qu.: 16.01	3rd Qu.: 124.94	3rd Qu.:0.00000	3rd Qu.: 17.07	3rd Qu.: 0.000	3rd Qu.: 2575.04	3rd Qu.: 26.33	3rd Qu.:0.000000	3rd Qu.: 0.000	3rd Qu.: 563.54	3rd Qu.: 799.2	3rd Qu.: 0.00000	3rd Qu.: 77.99	3rd Qu.:0.000000	3rd Qu.: 0.00000	3rd Qu.:0.000000	3rd Qu.: 0.00000
Max. :33.6600	Max. :529.07	Max. :11161.12	Max. :1808.36	Max. :5.78000	Max. :877.12	Max. :936.670	Max. :15047.67	Max. :5535.49	Max. :2.690000	Max. :2642.020	Max. :3708.78	Max. :7939.3	Max. :20.89000	Max. :885.24	Max. :6.740000	Max. :20.00000	Max. :6.370000	Max. :13.99000

Total sales across product category suggests that biggest product category is Grocery followed by Misc Items and Meat. These 19 variables were taken as input for K-means clustering

product	Sum_cols
GROCERY	4656794.40
MISC ITEMS	1610363.09
MEAT	1053836.82
DELI	260902.59
SERVICES	154722.32
COUPONS	120987.76
GROCERY-HEALTHY	97669.04
FLORAL	39653.34
COSMETICS	32360.37
LIQUIOR	21670.88
GARDEN SUPPLIES	7667.82
AUTOMOTIVE	452.66
PHARMACY	176.48
EGGS & MILK PRODUCTS	73.23
VIDEO	56.23
STORE SUPPLY/MISC	49.05
TOYS	9.84
INFANT PRODUCTS	9.42
STATIONARY	7.74

Histogram of all 2500 households and their overall spend

Checking quantile values to understand what should be the cut-off for analysis.Customers who spent less than $390 in 2 years time were removed from the analysis as we dont have sufficient history for those cases

##      10%      50%      90% 
##  389.878 2157.750 7448.216

* Step 5 -Total Sales values were standardized to give similar weightage to all product categories

GROCERY	MISC ITEMS	MEAT	DELI	SERVICES	COUPONS	GROCERY-HEALTHY	FLORAL	COSMETICS	LIQUIOR	GARDEN SUPPLIES	AUTOMOTIVE	PHARMACY	EGGS & MILK PRODUCTS	VIDEO	STORE SUPPLY/MISC	TOYS	INFANT PRODUCTS	STATIONARY
0.5112524	-0.1974663	-0.2465622	0.5824087	0.8392364	-0.1025096	0.0317165	-0.2309761	-0.4367516	-0.1416004	-0.1467752	-0.1227636	-0.0746371	-0.1192238	-0.0448826	-0.044699	-0.0310159	-0.0417706	-0.0239491
-0.4182378	-0.4174619	-0.4940023	-0.4104241	-0.4943611	-0.1505816	-0.2160162	0.1119995	0.2405165	-0.1416004	-0.1467752	-0.1227636	-0.0746371	-0.1192238	-0.0448826	-0.044699	-0.0310159	-0.0417706	-0.0239491
-0.1070494	-0.5828009	0.2405860	-0.4719961	-0.7265263	-0.1635922	-0.2613066	-0.4267171	-0.3983925	-0.1416004	-0.1467752	-0.1227636	-0.0746371	-0.1192238	-0.0448826	-0.044699	-0.0310159	-0.0417706	-0.0239491
-0.7152218	-0.4295762	-0.6516471	-0.6047419	-0.7800498	-0.1635922	-0.2413594	-0.4267171	-0.3716946	-0.1416004	-0.1467752	-0.1227636	-0.0746371	-0.1192238	-0.0448826	-0.044699	-0.0310159	-0.0417706	-0.0239491
-0.8963643	-0.5565241	-0.7300246	-0.2105853	-0.7526549	-0.1635922	-0.2491806	-0.4267171	-0.1458363	-0.1416004	-0.1467752	-0.1227636	-0.0746371	-0.1192238	-0.0448826	-0.044699	-0.0310159	-0.0417706	-0.0239491
0.6526767	0.4228588	1.5842527	0.0606674	1.1236590	-0.1086484	0.0187418	0.1607510	-0.1311064	-0.1416004	0.2242823	-0.1227636	-0.0746371	-0.1192238	-0.0448826	-0.044699	-0.0310159	-0.0417706	-0.0239491

* Step 6 -Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized. 15 was taken as the optimal cluster count from the graph

# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 40
k.values <- 1:40

# extract wss for 1-40 clusters
wss_values <- suppressWarnings(map_dbl(k.values, wss))
par(mfrow=c(1,1))
suppressWarnings(plot(k.values, wss_values,
     type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares"))

* Step 7 -K-means segmentation was run on the standardized columns to make clusters

#running k-means for 15 cluster count
k2 <- suppressWarnings(kmeans(df, centers = 15, nstart = 25))
str(k2)

## List of 9
##  $ cluster     : int [1:2250] 5 9 9 9 9 5 5 5 9 15 ...
##  $ centers     : num [1:15, 1:19] 1.818 1.083 0.634 6.07 0.255 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:15] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:19] "GROCERY" "MISC ITEMS" "MEAT" "DELI" ...
##  $ totss       : num 42731
##  $ withinss    : num [1:15] 135 660 115 0 3044 ...
##  $ tot.withinss: num 13725
##  $ betweenss   : num 29006
##  $ size        : int [1:15] 4 9 4 1 639 14 1 10 1238 233 ...
##  $ iter        : int 11
##  $ ifault      : int 2
##  - attr(*, "class")= chr "kmeans"

After getting cluster mappind for all 2250 households, data looks like this

##   k2$cluster MISC ITEMS GROCERY
## 1          5     530.89 3021.10
## 2          9     330.04 1264.77
## 3          9     179.09 1852.78
## 4          9     318.98  703.60
## 5          9     203.08  361.32
## 6          5    1097.23 3288.33

* Step 8 -Visualize the cluster in R to understand how the customers have been clustered

* Step 9 - Profile and name the clusters based on the average spend across categories. Segments have been named as ‘H-Floral-Cosmetics-Garden Supplies buyers’ where H represents ‘high’ total spend and Product categories names suggests the kind of product a customer/household buys(Check Excel)

cluster	number	avg_GROCERY	avg_MISC_ITEMS	avg_MEAT	avg_DELI	avg_SERVICES	avg_COUPONS	avg_GROCERY_HEALTHY	avg_FLORAL	avg_COSMETICS	avg_LIQUIOR	avg_GARDEN_SUPPLIES	avg_AUTOMOTIVE	avg_PHARMACY	avg_EGGS___MILK_PRODUCTS	avg_VIDEO	avg_STORE_SUPPLY_MISC	avg_TOYS	avg_INFANT_PRODUCTS	avg_STATIONARY
1	4	5491.0200	1600.3275	1306.0325	88.01500	130.56750	32.81000	37.66250	7.060000	45.995000	26.560000	0.0000000	0.0000000	0.0000000	0.4400000	0.0000000	0.0000000	0.0000000	2.355	0.0000000
2	9	4101.6133	2158.5800	605.4089	366.78667	270.55222	371.75111	237.92000	69.255556	61.822222	22.824444	19.4611111	0.0000000	14.9455556	0.0000000	0.0000000	0.0000000	0.0000000	0.000	0.0000000
3	4	3253.7375	1724.6425	779.4725	170.92500	77.58750	12.93500	29.20750	7.382500	7.045000	44.657500	0.0000000	0.0000000	0.0000000	0.0000000	11.3125000	0.0000000	0.0000000	0.000	0.0000000
4	1	13523.8300	6447.2700	2263.5500	1808.36000	476.46000	11161.12000	563.31000	877.120000	212.120000	49.980000	936.6700000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000	0.0000000
5	639	2536.0856	824.4299	587.5830	130.32310	81.38161	58.04653	42.63070	19.318060	15.181847	9.443083	2.7926604	0.0455243	0.0077934	0.0182942	0.0000000	0.0236307	0.0015493	0.000	0.0000000
6	14	2055.8636	638.4293	609.9521	95.69143	66.16429	17.66143	18.25500	17.275714	7.216429	1.612143	6.3478571	0.0000000	0.0000000	2.7821429	0.0000000	0.0000000	0.0000000	0.000	0.0000000
7	1	6432.8600	4248.3500	888.7000	66.23000	67.85000	2.99000	21.24000	0.000000	46.210000	2642.020000	0.0000000	4.4500000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000	0.0000000
8	10	4385.0750	1563.1170	586.2970	257.82900	91.98100	35.48300	1694.22900	39.557000	46.844000	12.888000	5.8440000	0.3990000	0.5970000	0.0000000	0.0000000	0.0000000	0.0000000	0.000	0.0000000
9	1238	904.8305	259.3305	209.6189	42.64703	30.48683	12.75612	16.43466	6.333207	4.501615	5.148215	0.7878837	0.0296123	0.0106381	0.0067932	0.0040388	0.0016155	0.0000000	0.000	0.0000000
10	233	4865.4105	1895.4209	1132.1258	293.27983	146.18455	116.42378	78.99468	32.853176	34.432060	15.077296	5.8499142	0.1380258	0.0680687	0.0454506	0.0000000	0.0085837	0.0042489	0.000	0.0042918
11	67	7388.5609	2969.6737	1487.0336	582.53433	297.20836	314.28716	130.21716	121.321940	84.197164	24.645373	30.8532836	0.1035821	0.0297015	0.0141791	0.0000000	0.0000000	0.0222388	0.000	0.0000000
12	1	1110.0400	308.1300	326.7800	45.86000	56.45000	11.50000	0.00000	9.990000	2.990000	0.000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000	6.7400000
13	1	1964.4100	1052.5700	635.8200	13.08000	128.90000	0.00000	4.48000	52.970000	1.590000	59.180000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	6.3700000	0.000	0.0000000
14	2	8289.2800	3683.4000	1747.4500	736.39000	529.16500	486.54000	274.97500	188.185000	101.695000	18.580000	60.7600000	0.0000000	0.0000000	0.0000000	0.0000000	14.9750000	0.0000000	0.000	0.0000000
15	26	2704.9212	1344.3535	497.9615	150.88654	71.50192	122.01115	58.04923	22.616923	49.244231	10.457692	2.5196154	13.0526923	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000	0.0000000

5.Summary

We studied two different problem statements. The summary of both are mentioned below, one after the other.

Problem Statement - 1

5.1.1 Problem Statement The purpose of this study was to understand how the demographic factors like marital status, age, income and household size affect consumer spending. Knowing what demographics make consumers send more during store visits will enable the client’s business to be more customer-centric and to ensure better customer lifetime value

5.1.2 Methodology

Step 1 Summaries (Mean, Median) across various demographic variables were calculated to understand how spend varies across different levels. Also, we did a graphical investigation to see patterns and correlations. Once we had some idea about how various variables affect consumer spend, a linear regression model was built to ascertain which variables categories affect spending and how they affect spending.
Step 2 Data was collated with all the demographic variables and overall sales over a certain time period for all customers
Step 3 After checking relationships of variables with overall sales(Using plots of predictors and sales), we tested appropriateness for normality of the response variable. And we followed a dummy encoding for all the factor predictor variables.
Step 4 Once these validations were complete, we ran the linear regression to calculate estimates for each of the variables and categories
Step 5 Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model
Step 6 Calculate the contribution values to understand to what magnitude is sales affected by each of those demographic factors.

5.1.3 Interesting Insights!

A key insight we found was regarding the demographic factors that would likely make the consumer the highest spender at the stores.-

Consumer spending seems to be highest if consumers are married, have a household size of 2 to 3, earn somewhere between $30-75k, and are between 35 to 45 years of age.

5.1.4 Project Implications

We feel that this insight on what demographics make consumers spend more will be key to any decision-making for our retail client. This can be a big enabler in any promotion/coupon targeting programs to various consumer segments or to attract those demographic segments of consumers who currently are not high spenders. Overall, we believe this knowledge is key to ensuring a high average customer lifetime value.

Limitations of the analysis * This study was limited to the demographic factors we saw in the data. when we ran the regression we found it gave a lower R-square value. We think that there could be other predictor variables that we are not seeing in the data, which might help explain the correlation even better. Variables such as, whether the consumer is a loyal to another retail chain, or whether distance and convenience is a factor how people choose to walk into stores.

Problem Statement - 2

5.2.1 Problem Statement The purpose of this study was to get better understanding of our retail client’s customers based on product category preferences. It will allow business to get better visibility of the market and customize promotions based on their buying preference. Further, these promotions can help improve the customer interaction level with the store. Over time, it will also allow business to keep track of how customers are moving across defined segments and how their preferences change with time

5.2.2 Methodology

Using sales value across many product departments, K-means segmentation was done to segment customers based on their product needs. Following steps were followed-

Step 1 - Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns
Step 2 - Using the product description information, similar product categories(department) were combined to form broad product categories. e.g. ‘Frozen grocery’, ‘Produce’ and ‘Grocery’ were all combined under Grocery
Step 3 - Sales values were rolled up at broad product category level to get total sales for the households
Step 4 - Data was transformed from tall to wide format where columns have total sales for the product categories
Step 5 - Total Sales values were standardized to give similar weightage to all product categories
Step 6 - Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized
Step 7 - K-means segmentation was run on the standardized columns to make clusters
Step 8 - Visualize the cluster in R to understand how the customers have been clustered
Step 9 - Profile and name the clusters based on the average spend across categories. Segments have been named as ‘H-Floral-Cosmetics-Garden Supplies buyers’ where H represents ‘high’ total spend and Product categories names suggests the kind of product a customer/household buys

5.2.3 Intersting Insights! Following interesting insights were found in the analysis-

Among the high spenders, there is a segment ‘H-Floral-Cosmetics-Garden Suplies buyers’ with 67 households and it seems to be buying products from floral, cosmetics and garden supplies. We can target this segment using cross-category promotions across these product categories to increase their engagement level. Also, in case sales of any of these product categories is dropping, sales across this customer segment should be checked for any potential reasons.
Majority of the customers(~1230) belong to the “VL-Grocery-Basic Necessities” segment who visit stores for the basic necessities (groceries/basic necessities).Majority of the money is spent on buying on grocery(~61%) followed by Misc Items(17%) and meat(14%)
Another important segment is “L-Broad category Spenders” with 639 customers. These customers spend on broad categories ranging from Grocery, Deli, Meat , Services and coupons to less common

5.2.4 Project Implications Earlier promotions were sent out to household without much consideration of customer/household profile and the buying potential. So, the promotions were not customized for the audience and were sent to all irrespective of their preferences New segmentation exercise would allow business to send targeted messages and enable them to design promotional strategies in a better way. For eg. If the store wants to promote new automotive product in the store, it should be sending the promotional e-mail to all households part of segment 15 “L-Automotive Products” as these are the households who have bought products from this category. Also, knowing how your sales is distributed across categories will allow stores to come up with cross-category promotions aimed at increasing customer spend.

Limitations of the analysis Following are the few limitations and few factors which could be included to get improve the quality of segments in the future-

We can club some of the segments which have very few households present in them like segment 4,14,7,12 and 13
Broad product categories created after combining similar departments could be clubbed further to get fewer dimensions relevant for our business problem. Also, current definition can be further validated to ensure that mapping used is correct as per business requirement
Other variables like frequency of visits should be included in the analysis which would give us better idea about the buying pattern and would make household segments more useful for this kind of study