Our research is focused on helping a major retailer know their customers better. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and more. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -
Problem Statement #1: What household demographic factors contribute most to customer spend in the stores? Learn about the most impactful factors affecting customer spending.
Problem Statement #2: What category is each customer most engaged with? Profile customers and their engagement levels with product categories.
Our methodology can be broadly classified into three steps -
Data cleaning and preparation: Clean and tidy the tables/datasets in R, and join tables as required for the analysis (more in detail in the section on data preparation)
Data Exploration: Explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions and correlation as required
Data Analysis: We propose a regression analysis (linear modeling) to determine the demographic factors that affect consumer spending the most. This will be followed by a clustering technique to segment and profile customers according to their engagement levels with various product categories inside the store
This research will help the client in two major ways. Firstly, understand better as to what factors impact the amount of purchase customers make inside stores. Secondly, design better targeted marketing campaigns powered by the knowledge of customer engagement (and spending) levels with various product categories.
Packages to be installed
Following Packages are needed for the project-
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(stats)))
suppressWarnings(suppressMessages(library(factoextra)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(kableExtra)))
suppressWarnings(suppressMessages(library(purrr)))
3.1. Data Source The original data on the retail client was downloaded from the Data Wrangling course folders on Github
3.2.1 Introduction of source data The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)
Timeline: The data tracks customer transactions over a period of two years (104 weeks).
3.2.2 Explanation of source data (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below
Code for Loading Tables
datasets<-c("campaign_desc","campaign_table","causal_data","coupon","coupon_redempt","hh_demographic","product","transaction_data","product_dpt_mapping")
for (i in seq_along(datasets))
{
file_path=paste0("data/",datasets[i],".csv")
if(file.exists(file_path))
{
df<-read.csv(file_path)
assign(datasets[i],df)
} else
print("No such file exits")
}
| household_key | basket_id | day | product_id | quantity | sales_value | store_id | retail_disc | trans_time | week_no | coupon_disc | coupon_match_disc |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2375 | 26984851472 | 1 | 1004906 | 1 | 1.39 | 364 | -0.6 | 1631 | 1 | 0 | 0 |
| 2375 | 26984851472 | 1 | 1033142 | 1 | 0.82 | 364 | 0.0 | 1631 | 1 | 0 | 0 |
| 2375 | 26984851472 | 1 | 1036325 | 1 | 0.99 | 364 | -0.3 | 1631 | 1 | 0 | 0 |
| household_key | age_desc | marital_status_code | income_desc | homeowner_desc | hh_comp_desc | household_size_desc | kid_category_desc |
|---|---|---|---|---|---|---|---|
| 1 | 65+ | A | 35-49K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 7 | 45-54 | A | 50-74K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 8 | 25-34 | U | 25-34K | Unknown | 2 Adults Kids | 3 | 1 |
| description | household_key | campaign |
|---|---|---|
| TypeA | 17 | 26 |
| TypeA | 27 | 26 |
| TypeA | 212 | 26 |
| description | campaign | start_day | end_day |
|---|---|---|---|
| TypeB | 24 | 659 | 719 |
| TypeC | 15 | 547 | 708 |
| TypeB | 25 | 659 | 691 |
| product_id | manufacturer | department | brand | commodity_desc | sub_commodity_desc | curr_size_of_product |
|---|---|---|---|---|---|---|
| 25671 | 2 | GROCERY | National | FRZN ICE | ICE - CRUSHED/CUBED | 22 LB |
| 26081 | 2 | MISC. TRANS. | National | NO COMMODITY DESCRIPTION | NO SUBCOMMODITY DESCRIPTION | |
| 26093 | 69 | PASTRY | Private | BREAD | BREAD:ITALIAN/FRENCH |
| coupon_upc | product_id | campaign |
|---|---|---|
| 10000089061 | 27160 | 4 |
| 10000089064 | 27754 | 9 |
| 10000089073 | 28897 | 12 |
| household_key | day | coupon_upc | campaign |
|---|---|---|---|
| 1 | 421 | 10000085364 | 8 |
| 1 | 421 | 51700010076 | 8 |
| 1 | 427 | 54200000033 | 8 |
| product_id | store_id | week_no | display | mailer |
|---|---|---|---|---|
| 26190 | 286 | 70 | 0 | A |
| 26190 | 288 | 70 | 0 | A |
| 26190 | 289 | 70 | 0 | A |
3.2.3 Peculiarities in the data sets
Missing Values -
for (i in seq_along(datasets))
{
missing_values<-any(is.na(get(datasets[i])))
if (missing_values)
{
print(paste(datasets[i],"has Missing values"))
} else
print(paste(datasets[i],"has no Missing values"))
}
## [1] "campaign_desc has no Missing values"
## [1] "campaign_table has no Missing values"
## [1] "causal_data has no Missing values"
## [1] "coupon has no Missing values"
## [1] "coupon_redempt has no Missing values"
## [1] "hh_demographic has no Missing values"
## [1] "product has no Missing values"
## [1] "transaction_data has no Missing values"
## [1] "product_dpt_mapping has no Missing values"
Note:
SUMMARY:
We showed above a glimpse and summary of how all the data from this comprehensive and complex collection look and what they mean. For solving our two research questions the tables that contain variables ofinterest would be -
Refer to tab Data preparation II for more on this and next steps…
3.3. Data Preparation Steps
Here we focus on two areas, data preparation (slicing and dicing) for:
Problem Statement #1
Tables of concern for Problem #1:
Variables of concern
3.4. Methodology and Output dataset
#Dataset 1-Demographic factors driving customer spend
Demographic_factors<-group_by(transaction_data,household_key) %>%
summarise(Total_Sales=sum(sales_value))%>%
inner_join(hh_demographic,by="household_key")
| household_key | Total_Sales | age_desc | marital_status_code | income_desc | homeowner_desc | hh_comp_desc | household_size_desc | kid_category_desc |
|---|---|---|---|---|---|---|---|---|
| 1 | 4330.16 | 65+ | A | 35-49K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 7 | 3400.05 | 45-54 | A | 50-74K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 8 | 5534.97 | 25-34 | U | 25-34K | Unknown | 2 Adults Kids | 3 | 1 |
| 13 | 13190.92 | 25-34 | U | 75-99K | Homeowner | 2 Adults Kids | 4 | 2 |
| 16 | 1512.02 | 45-54 | B | 50-74K | Homeowner | Single Female | 1 | None/Unknown |
## household_key Total_Sales age_desc marital_status_code
## Min. : 1 Min. : 646.4 19-24: 46 A:340
## 1st Qu.: 596 1st Qu.: 2970.1 25-34:142 B:117
## Median :1218 Median : 4676.4 35-44:194 U:344
## Mean :1235 Mean : 5615.1 45-54:288
## 3rd Qu.:1914 3rd Qu.: 7165.0 55-64: 59
## Max. :2499 Max. :27859.7 65+ : 72
##
## income_desc homeowner_desc hh_comp_desc
## 50-74K :192 Homeowner :504 1 Adult Kids : 47
## 35-49K :172 Probable Owner : 11 2 Adults Kids :187
## 75-99K : 96 Probable Renter: 11 2 Adults No Kids:255
## 25-34K : 77 Renter : 42 Single Female :144
## 15-24K : 74 Unknown :233 Single Male : 95
## Under 15K: 61 Unknown : 73
## (Other) :129
## household_size_desc kid_category_desc
## 1 :255 1 :114
## 2 :318 2 : 60
## 3 :109 3+ : 69
## 4 : 53 None/Unknown:558
## 5+: 66
##
##
Problem Statement #2
Tables of concern for Problem #2:
Variables of concern
#Dataset 2-Profiling of Customers based on Product Category Engagement
Customer_profile <- inner_join(transaction_data, product, by = "product_id")%>%
inner_join(product_dpt_mapping, by = "department")%>%
group_by(household_key, Super_category) %>%
summarise(Total_Sales = sum(sales_value))%>%
select(household_key, Super_category, Total_Sales)%>%
spread(Super_category, Total_Sales)
| household_key | AUTOMOTIVE | COSMETICS | COUPONS | DELI | EGGS & MILK PRODUCTS | FLORAL | GARDEN SUPPLIES | GROCERY | GROCERY-HEALTHY | INFANT PRODUCTS | LIQUIOR | MEAT | MISC ITEMS | PHARMACY | SERVICES | STATIONARY | STORE SUPPLY/MISC | TOYS | VIDEO |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | NA | 20.00 | 212.17 | NA | 7.99 | NA | 3021.10 | 48.33 | NA | NA | 349.00 | 530.89 | NA | 140.68 | NA | NA | NA | NA |
| 2 | NA | 22.07 | 4.26 | 46.73 | NA | 21.99 | NA | 1264.77 | 7.47 | NA | NA | 232.19 | 330.04 | NA | 24.82 | NA | NA | NA | NA |
| 3 | NA | 1.25 | NA | 36.47 | NA | NA | NA | 1852.78 | NA | NA | NA | 578.97 | 179.09 | NA | 4.65 | NA | NA | NA | NA |
| 4 | NA | 2.12 | NA | 14.35 | NA | NA | NA | 703.60 | 3.29 | NA | NA | 157.77 | 318.98 | NA | NA | NA | NA | NA | NA |
| 5 | NA | 9.48 | NA | 80.03 | NA | NA | NA | 361.32 | 2.00 | NA | NA | 120.77 | 203.08 | NA | 2.38 | NA | NA | NA | NA |
## household_key AUTOMOTIVE COSMETICS COUPONS
## Min. : 1.0 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 625.8 1st Qu.: 2.950 1st Qu.: 4.01 1st Qu.: 3.49
## Median :1250.5 Median : 5.950 Median : 10.80 Median : 10.50
## Mean :1250.5 Mean : 7.421 Mean : 25.16 Mean : 80.34
## 3rd Qu.:1875.2 3rd Qu.: 9.990 3rd Qu.: 29.12 3rd Qu.: 37.97
## Max. :2500.0 Max. :33.660 Max. :529.07 Max. :11161.12
## NA's :2439 NA's :1214 NA's :994
## DELI EGGS & MILK PRODUCTS FLORAL GARDEN SUPPLIES
## Min. : 0.75 Min. :0.5000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 21.76 1st Qu.:0.8975 1st Qu.: 8.99 1st Qu.: 5.00
## Median : 56.66 Median :0.9500 Median : 18.98 Median : 13.34
## Mean : 116.06 Mean :1.4646 Mean : 33.43 Mean : 24.90
## 3rd Qu.: 137.72 3rd Qu.:1.7600 3rd Qu.: 38.95 3rd Qu.: 29.86
## Max. :1808.36 Max. :5.7800 Max. :877.12 Max. :936.67
## NA's :252 NA's :2450 NA's :1314 NA's :2192
## GROCERY GROCERY-HEALTHY INFANT PRODUCTS LIQUIOR
## Min. : 1.74 Min. : 0.00 Min. :2.020 Min. : 0.000
## 1st Qu.: 558.17 1st Qu.: 5.00 1st Qu.:2.020 1st Qu.: 8.735
## Median : 1267.53 Median : 13.78 Median :2.355 Median : 17.480
## Mean : 1862.72 Mean : 56.82 Mean :2.355 Mean : 38.492
## 3rd Qu.: 2575.04 3rd Qu.: 43.01 3rd Qu.:2.690 3rd Qu.: 34.020
## Max. :15047.67 Max. :5535.49 Max. :2.690 Max. :2642.020
## NA's :781 NA's :2496 NA's :1937
## MEAT MISC ITEMS PHARMACY SERVICES
## Min. : 1.71 Min. : 0.0 Min. : 1.190 Min. : 0.00
## 1st Qu.: 102.86 1st Qu.: 120.2 1st Qu.: 5.910 1st Qu.: 14.99
## Median : 274.07 Median : 310.3 Median : 6.990 Median : 39.44
## Mean : 427.87 Mean : 646.0 Mean : 9.804 Mean : 66.60
## 3rd Qu.: 571.49 3rd Qu.: 800.3 3rd Qu.:11.748 3rd Qu.: 83.27
## Max. :3708.78 Max. :7939.3 Max. :20.890 Max. :885.24
## NA's :37 NA's :7 NA's :2482 NA's :177
## STATIONARY STORE SUPPLY/MISC TOYS VIDEO
## Min. :1.000 Min. : 0.000 Min. :0.99 Min. : 5.000
## 1st Qu.:2.435 1st Qu.: 1.000 1st Qu.:0.99 1st Qu.: 6.482
## Median :3.870 Median : 1.000 Median :1.24 Median : 9.235
## Mean :3.870 Mean : 2.453 Mean :2.46 Mean : 9.372
## 3rd Qu.:5.305 3rd Qu.: 1.000 3rd Qu.:2.71 3rd Qu.:12.213
## Max. :6.740 Max. :20.000 Max. :6.37 Max. :13.990
## NA's :2498 NA's :2480 NA's :2496 NA's :2494
3.5. Variables Summary:
Dataset-1 : Demographic_factors
kid_category_desc - no missing values, no abnormal values either
Dataset-2 : Customer_profile
other 19 variables: these are the ‘super categories’ that we created by grouping the 44 product categories / departments
Missing Values : There are no missing values. However the NA values might give the reader the wrong idea. To clarify, for the purposes of our analysis, an ‘NA’ would mean zero spending for the particular customer in that particular category. So, the NA values would be imputed with zeroes before starting analysis
4.1 Problem Statement 1-Demographic factors driving customer spend
Here is the main analysis of our first problem statement - What factors primarily affect the total customer spending for our retail client. The analysis is being done on the dataset - ‘Demographic_factors’. The reader might remember that this dataset in the one we prepared (using wrangling procedures) in our data preparation sections I and II. The predictor variables being investigated are income_desc, household_size_desc, marital_status_code, and age_desc and all of these are ‘factors’
To begin with, the analysis “graphically” investigates the possible effects of these variables on the customer spend (Total_Sales). Along with the plots, an interpretation of plots is provided. This is followed by a linear model on the Total_Sales against the four predictor variables, along with interpretations. At the very last of this section, we present a paragraph on the summary/explanation of results.
Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = income_desc)) +
geom_point() +
labs(title = "Spend vs Income", x = "Spend", y = "Income")
Observation: We see that the average consumer spend looks to be the highest in the age group of $50-74k followed by the age groups of $35-49k and $75-99k.
Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = age_desc)) +
geom_point() +
labs(title = "Spend vs Age", x = "Spend", y = "Age")
Observation: It appears that the age bracket of 35-55 years spends the highest compared to all other age groups.
Demographic_factors %>%
ggplot(aes(x = Total_Sales, y = household_size_desc,
color = marital_status_code)) +
geom_point() +
labs(title = "Spend vs Household Size, and Marital Status", x = "Spend",
y = "Household Size")
Observation: It appears that households with sizes of 2 and 3 would be the highest average spenders, followed by households with sizes 1 and 5. Moreover, it also appears that on average, married customers do spend more as compared to unmarried and single people.
From the above, we conjecture that a consumer profile who is married, has an household size of 2 to 3, is in the age bracket of 35 to 55, and earns $50-74k per annum, will likely be the highest average spender for our client. We explore it a little further though a linear regression model as follows.
Since all our predictor variables are factors, we take the following steps before running a regression-
# Dummy encoding
head(model.matrix( ~ age_desc + marital_status_code + income_desc +
household_size_desc, data = Demographic_factors))
# Regression
model_1 <- lm(Total_Sales ~ age_desc + marital_status_code +
income_desc + household_size_desc,
data = Demographic_factors)
summary(model_1)
##
## Call:
## lm(formula = Total_Sales ~ age_desc + marital_status_code + income_desc +
## household_size_desc, data = Demographic_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7400.3 -2450.4 -639.5 1497.9 19669.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4994.1 688.4 7.255 1.06e-12 ***
## age_desc25-34 708.0 683.6 1.036 0.30072
## age_desc35-44 1400.9 670.2 2.090 0.03695 *
## age_desc45-54 782.6 650.4 1.203 0.22931
## age_desc55-64 223.6 783.8 0.285 0.77555
## age_desc65+ -334.1 755.6 -0.442 0.65852
## marital_status_code.L -228.9 253.7 -0.902 0.36729
## marital_status_code.Q 216.0 324.2 0.666 0.50543
## income_desc.L 3102.2 954.3 3.251 0.00121 **
## income_desc.Q -1245.2 926.4 -1.344 0.17933
## income_desc.C -1936.3 840.4 -2.304 0.02151 *
## income_desc^4 -1989.2 798.6 -2.491 0.01297 *
## income_desc^5 -615.0 733.4 -0.839 0.40202
## income_desc^6 -82.5 647.3 -0.127 0.89862
## income_desc^7 301.8 554.4 0.544 0.58638
## income_desc^8 -614.2 483.7 -1.270 0.20464
## income_desc^9 -179.3 487.4 -0.368 0.71315
## household_size_desc2 233.2 369.4 0.631 0.52806
## household_size_desc3 678.9 478.5 1.419 0.15641
## household_size_desc4 1138.8 614.4 1.854 0.06421 .
## household_size_desc5+ 735.5 599.1 1.228 0.21995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3517 on 708 degrees of freedom
## (72 observations deleted due to missingness)
## Multiple R-squared: 0.1209, Adjusted R-squared: 0.09608
## F-statistic: 4.869 on 20 and 708 DF, p-value: 2.833e-11
Interpretation: Our linear model results seem to match our initial conjecture (from the graphical investigation) to a great extent. Some of those observations are as follows.
Age group 35-44 comes out to be statistically significant. Which means that a positive indicator in this age group could have a significant effect (1400 units) on the consumer spend.
Marital status ‘Married’ shows statistical significance which has the value of the intercept 5244. Which shows that a positive indicator in this category has a significant effect on consumer spend.
Income categories “15-24k”, “35-49k” and “50-74k” show statistical significance. With all else constant, a positive indicator in these categories will have a positive effect (combined with the intercept) on the consumer spend.
Household sizes 3 and 4 have a statistical significance in the model. Which means these have a significant effect on the consumer spend.
Summary: Overall we find that of the four predictor variables studied there are a few categories in each that can largely influence the consumer spend. So we can reasonably conclude that customers are likely to spend more if they fall into the categories under each predictor as mentioned above. Since we get a lesser R-square from the study, we can conjecture that there could be other variables that can possibly help further explain consumer demand variations to different factors
4.1.2 Methodology
4.1.2.a Preliminary Analysis - Summaries (Mean, Median) across various demographic variables were calculated to understand how spend varies across different levels. Also, we did a graphical investigation to see patterns, and correlations.
Once we had some idea about how various variables affect consumer spend, a linear regression model was built to ascertain which variables categories affect spending and how they affect spending.
4.1.2.b Linear Regression Model - We built a Linear Regression model using the dataset with all demographic variables and overall sales. Following steps were followed-
Step 1-Data was collated with all the demographic variables and overall sales over a certain time period for all customers
Step 2-After checking relationships of variables with overall sales(Using plots of predictors and sales), we tested appropriateness for normality of the response variable. And we followed a dummy encoding for all the factor predictor variables.
Step 3-Once these validations were complete, we ran the linear regression to calculate estimates for each of the variables and categories
Step 4-Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model
Step 5-Calculate the contribution values to understand to what magnitude is sales affected by each of those demographic factors.
4.2 Problem Statement 2-Profiling of Customers based on Product Category Engagement
4.2.1 Data Sources used-
4.2.2 Business Problem- Segmenting customers based on spend across various product categories
4.2.3 Data Analysis -
4.2.3.a Preliminary Analysis- Summarize(mean,median and mode) the datasets and plot using box-plot graph to identify outliers in spend data. Calculated the range of spend that should be considered after removing outliers. Using plots thus created, identify categories which are more important that others.
4.2.3.b K-Means Segmentation - Using sales value across many product departments, K-means segmentation was done to segment customers based on their product needs. Following steps were followed-
Step 1 - Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns
Step 2 - Using the product description information, similar product categories(department) were combined to form broad product categories. e.g. ‘Frozen grocery’, ‘Produce’ and ‘Grocery’ were all combined under Grocery
Step 3 - Sales values were rolled up at broad product category level to get total sales for the households
Step 4 -Data was transformed from tall to wide format where columns have total sales for the product categories
| household_key | AUTOMOTIVE | COSMETICS | COUPONS | DELI | EGGS & MILK PRODUCTS | FLORAL | GARDEN SUPPLIES | GROCERY | GROCERY-HEALTHY | INFANT PRODUCTS | LIQUIOR | MEAT | MISC ITEMS | PHARMACY | SERVICES | STATIONARY | STORE SUPPLY/MISC | TOYS | VIDEO |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | NA | 20.00 | 212.17 | NA | 7.99 | NA | 3021.10 | 48.33 | NA | NA | 349.00 | 530.89 | NA | 140.68 | NA | NA | NA | NA |
| 2 | NA | 22.07 | 4.26 | 46.73 | NA | 21.99 | NA | 1264.77 | 7.47 | NA | NA | 232.19 | 330.04 | NA | 24.82 | NA | NA | NA | NA |
| 3 | NA | 1.25 | NA | 36.47 | NA | NA | NA | 1852.78 | NA | NA | NA | 578.97 | 179.09 | NA | 4.65 | NA | NA | NA | NA |
| 4 | NA | 2.12 | NA | 14.35 | NA | NA | NA | 703.60 | 3.29 | NA | NA | 157.77 | 318.98 | NA | NA | NA | NA | NA | NA |
| 5 | NA | 9.48 | NA | 80.03 | NA | NA | NA | 361.32 | 2.00 | NA | NA | 120.77 | 203.08 | NA | 2.38 | NA | NA | NA | NA |
| 6 | NA | 9.96 | 17.99 | 125.23 | NA | 23.98 | 8.58 | 3288.33 | 46.19 | NA | NA | 1213.28 | 1097.23 | NA | 165.39 | NA | NA | NA | NA |
| AUTOMOTIVE | COSMETICS |
|
| EGGS & MILK PRODUCTS |
| GARDEN SUPPLIES |
| GROCERY-HEALTHY | INFANT PRODUCTS |
|
| MISC ITEMS |
|
| STATIONARY | STORE SUPPLY/MISC |
|
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.0000 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. :0.00000 | Min. : 0.00 | Min. : 0.000 | Min. : 1.74 | Min. : 0.00 | Min. :0.000000 | Min. : 0.000 | Min. : 0.00 | Min. : 0.0 | Min. : 0.00000 | Min. : 0.00 | Min. :0.000000 | Min. : 0.00000 | Min. :0.000000 | Min. : 0.00000 | |
| 1st Qu.: 0.0000 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 12.23 | 1st Qu.:0.00000 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 558.17 | 1st Qu.: 0.00 | 1st Qu.:0.000000 | 1st Qu.: 0.000 | 1st Qu.: 98.14 | 1st Qu.: 119.0 | 1st Qu.: 0.00000 | 1st Qu.: 10.98 | 1st Qu.:0.000000 | 1st Qu.: 0.00000 | 1st Qu.:0.000000 | 1st Qu.: 0.00000 | |
| Median : 0.0000 | Median : 0.99 | Median : 2.50 | Median : 46.13 | Median :0.00000 | Median : 0.00 | Median : 0.000 | Median : 1267.53 | Median : 5.47 | Median :0.000000 | Median : 0.000 | Median : 269.70 | Median : 308.7 | Median : 0.00000 | Median : 34.91 | Median :0.000000 | Median : 0.00000 | Median :0.000000 | Median : 0.00000 | |
| Mean : 0.1811 | Mean : 12.94 | Mean : 48.40 | Mean : 104.36 | Mean :0.02929 | Mean : 15.86 | Mean : 3.067 | Mean : 1862.72 | Mean : 39.07 | Mean :0.003768 | Mean : 8.668 | Mean : 421.54 | Mean : 644.1 | Mean : 0.07059 | Mean : 61.89 | Mean :0.003096 | Mean : 0.01962 | Mean :0.003936 | Mean : 0.02249 | |
| 3rd Qu.: 0.0000 | 3rd Qu.: 11.40 | 3rd Qu.: 16.01 | 3rd Qu.: 124.94 | 3rd Qu.:0.00000 | 3rd Qu.: 17.07 | 3rd Qu.: 0.000 | 3rd Qu.: 2575.04 | 3rd Qu.: 26.33 | 3rd Qu.:0.000000 | 3rd Qu.: 0.000 | 3rd Qu.: 563.54 | 3rd Qu.: 799.2 | 3rd Qu.: 0.00000 | 3rd Qu.: 77.99 | 3rd Qu.:0.000000 | 3rd Qu.: 0.00000 | 3rd Qu.:0.000000 | 3rd Qu.: 0.00000 | |
| Max. :33.6600 | Max. :529.07 | Max. :11161.12 | Max. :1808.36 | Max. :5.78000 | Max. :877.12 | Max. :936.670 | Max. :15047.67 | Max. :5535.49 | Max. :2.690000 | Max. :2642.020 | Max. :3708.78 | Max. :7939.3 | Max. :20.89000 | Max. :885.24 | Max. :6.740000 | Max. :20.00000 | Max. :6.370000 | Max. :13.99000 |
| product | Sum_cols |
|---|---|
| GROCERY | 4656794.40 |
| MISC ITEMS | 1610363.09 |
| MEAT | 1053836.82 |
| DELI | 260902.59 |
| SERVICES | 154722.32 |
| COUPONS | 120987.76 |
| GROCERY-HEALTHY | 97669.04 |
| FLORAL | 39653.34 |
| COSMETICS | 32360.37 |
| LIQUIOR | 21670.88 |
| GARDEN SUPPLIES | 7667.82 |
| AUTOMOTIVE | 452.66 |
| PHARMACY | 176.48 |
| EGGS & MILK PRODUCTS | 73.23 |
| VIDEO | 56.23 |
| STORE SUPPLY/MISC | 49.05 |
| TOYS | 9.84 |
| INFANT PRODUCTS | 9.42 |
| STATIONARY | 7.74 |
Histogram of all 2500 households and their overall spend
Checking quantile values to understand what should be the cut-off for analysis.Customers who spent less than $390 in 2 years time were removed from the analysis as we dont have sufficient history for those cases
## 10% 50% 90%
## 389.878 2157.750 7448.216
* Step 5 -Total Sales values were standardized to give similar weightage to all product categories
| GROCERY | MISC ITEMS | MEAT | DELI | SERVICES | COUPONS | GROCERY-HEALTHY | FLORAL | COSMETICS | LIQUIOR | GARDEN SUPPLIES | AUTOMOTIVE | PHARMACY | EGGS & MILK PRODUCTS | VIDEO | STORE SUPPLY/MISC | TOYS | INFANT PRODUCTS | STATIONARY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5112524 | -0.1974663 | -0.2465622 | 0.5824087 | 0.8392364 | -0.1025096 | 0.0317165 | -0.2309761 | -0.4367516 | -0.1416004 | -0.1467752 | -0.1227636 | -0.0746371 | -0.1192238 | -0.0448826 | -0.044699 | -0.0310159 | -0.0417706 | -0.0239491 |
| -0.4182378 | -0.4174619 | -0.4940023 | -0.4104241 | -0.4943611 | -0.1505816 | -0.2160162 | 0.1119995 | 0.2405165 | -0.1416004 | -0.1467752 | -0.1227636 | -0.0746371 | -0.1192238 | -0.0448826 | -0.044699 | -0.0310159 | -0.0417706 | -0.0239491 |
| -0.1070494 | -0.5828009 | 0.2405860 | -0.4719961 | -0.7265263 | -0.1635922 | -0.2613066 | -0.4267171 | -0.3983925 | -0.1416004 | -0.1467752 | -0.1227636 | -0.0746371 | -0.1192238 | -0.0448826 | -0.044699 | -0.0310159 | -0.0417706 | -0.0239491 |
| -0.7152218 | -0.4295762 | -0.6516471 | -0.6047419 | -0.7800498 | -0.1635922 | -0.2413594 | -0.4267171 | -0.3716946 | -0.1416004 | -0.1467752 | -0.1227636 | -0.0746371 | -0.1192238 | -0.0448826 | -0.044699 | -0.0310159 | -0.0417706 | -0.0239491 |
| -0.8963643 | -0.5565241 | -0.7300246 | -0.2105853 | -0.7526549 | -0.1635922 | -0.2491806 | -0.4267171 | -0.1458363 | -0.1416004 | -0.1467752 | -0.1227636 | -0.0746371 | -0.1192238 | -0.0448826 | -0.044699 | -0.0310159 | -0.0417706 | -0.0239491 |
| 0.6526767 | 0.4228588 | 1.5842527 | 0.0606674 | 1.1236590 | -0.1086484 | 0.0187418 | 0.1607510 | -0.1311064 | -0.1416004 | 0.2242823 | -0.1227636 | -0.0746371 | -0.1192238 | -0.0448826 | -0.044699 | -0.0310159 | -0.0417706 | -0.0239491 |
* Step 6 -Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized. 15 was taken as the optimal cluster count from the graph
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(df, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 40
k.values <- 1:40
# extract wss for 1-40 clusters
wss_values <- suppressWarnings(map_dbl(k.values, wss))
par(mfrow=c(1,1))
suppressWarnings(plot(k.values, wss_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of squares"))
* Step 7 -K-means segmentation was run on the standardized columns to make clusters
#running k-means for 15 cluster count
k2 <- suppressWarnings(kmeans(df, centers = 15, nstart = 25))
str(k2)
## List of 9
## $ cluster : int [1:2250] 5 9 9 9 9 5 5 5 9 15 ...
## $ centers : num [1:15, 1:19] 1.818 1.083 0.634 6.07 0.255 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:15] "1" "2" "3" "4" ...
## .. ..$ : chr [1:19] "GROCERY" "MISC ITEMS" "MEAT" "DELI" ...
## $ totss : num 42731
## $ withinss : num [1:15] 135 660 115 0 3044 ...
## $ tot.withinss: num 13725
## $ betweenss : num 29006
## $ size : int [1:15] 4 9 4 1 639 14 1 10 1238 233 ...
## $ iter : int 11
## $ ifault : int 2
## - attr(*, "class")= chr "kmeans"
After getting cluster mappind for all 2250 households, data looks like this
## k2$cluster MISC ITEMS GROCERY
## 1 5 530.89 3021.10
## 2 9 330.04 1264.77
## 3 9 179.09 1852.78
## 4 9 318.98 703.60
## 5 9 203.08 361.32
## 6 5 1097.23 3288.33
* Step 8 -Visualize the cluster in R to understand how the customers have been clustered
* Step 9 - Profile and name the clusters based on the average spend across categories. Segments have been named as ‘H-Floral-Cosmetics-Garden Supplies buyers’ where H represents ‘high’ total spend and Product categories names suggests the kind of product a customer/household buys(Check Excel)
| cluster | number | avg_GROCERY | avg_MISC_ITEMS | avg_MEAT | avg_DELI | avg_SERVICES | avg_COUPONS | avg_GROCERY_HEALTHY | avg_FLORAL | avg_COSMETICS | avg_LIQUIOR | avg_GARDEN_SUPPLIES | avg_AUTOMOTIVE | avg_PHARMACY | avg_EGGS___MILK_PRODUCTS | avg_VIDEO | avg_STORE_SUPPLY_MISC | avg_TOYS | avg_INFANT_PRODUCTS | avg_STATIONARY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 4 | 5491.0200 | 1600.3275 | 1306.0325 | 88.01500 | 130.56750 | 32.81000 | 37.66250 | 7.060000 | 45.995000 | 26.560000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.4400000 | 0.0000000 | 0.0000000 | 0.0000000 | 2.355 | 0.0000000 |
| 2 | 9 | 4101.6133 | 2158.5800 | 605.4089 | 366.78667 | 270.55222 | 371.75111 | 237.92000 | 69.255556 | 61.822222 | 22.824444 | 19.4611111 | 0.0000000 | 14.9455556 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
| 3 | 4 | 3253.7375 | 1724.6425 | 779.4725 | 170.92500 | 77.58750 | 12.93500 | 29.20750 | 7.382500 | 7.045000 | 44.657500 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 11.3125000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
| 4 | 1 | 13523.8300 | 6447.2700 | 2263.5500 | 1808.36000 | 476.46000 | 11161.12000 | 563.31000 | 877.120000 | 212.120000 | 49.980000 | 936.6700000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
| 5 | 639 | 2536.0856 | 824.4299 | 587.5830 | 130.32310 | 81.38161 | 58.04653 | 42.63070 | 19.318060 | 15.181847 | 9.443083 | 2.7926604 | 0.0455243 | 0.0077934 | 0.0182942 | 0.0000000 | 0.0236307 | 0.0015493 | 0.000 | 0.0000000 |
| 6 | 14 | 2055.8636 | 638.4293 | 609.9521 | 95.69143 | 66.16429 | 17.66143 | 18.25500 | 17.275714 | 7.216429 | 1.612143 | 6.3478571 | 0.0000000 | 0.0000000 | 2.7821429 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
| 7 | 1 | 6432.8600 | 4248.3500 | 888.7000 | 66.23000 | 67.85000 | 2.99000 | 21.24000 | 0.000000 | 46.210000 | 2642.020000 | 0.0000000 | 4.4500000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
| 8 | 10 | 4385.0750 | 1563.1170 | 586.2970 | 257.82900 | 91.98100 | 35.48300 | 1694.22900 | 39.557000 | 46.844000 | 12.888000 | 5.8440000 | 0.3990000 | 0.5970000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
| 9 | 1238 | 904.8305 | 259.3305 | 209.6189 | 42.64703 | 30.48683 | 12.75612 | 16.43466 | 6.333207 | 4.501615 | 5.148215 | 0.7878837 | 0.0296123 | 0.0106381 | 0.0067932 | 0.0040388 | 0.0016155 | 0.0000000 | 0.000 | 0.0000000 |
| 10 | 233 | 4865.4105 | 1895.4209 | 1132.1258 | 293.27983 | 146.18455 | 116.42378 | 78.99468 | 32.853176 | 34.432060 | 15.077296 | 5.8499142 | 0.1380258 | 0.0680687 | 0.0454506 | 0.0000000 | 0.0085837 | 0.0042489 | 0.000 | 0.0042918 |
| 11 | 67 | 7388.5609 | 2969.6737 | 1487.0336 | 582.53433 | 297.20836 | 314.28716 | 130.21716 | 121.321940 | 84.197164 | 24.645373 | 30.8532836 | 0.1035821 | 0.0297015 | 0.0141791 | 0.0000000 | 0.0000000 | 0.0222388 | 0.000 | 0.0000000 |
| 12 | 1 | 1110.0400 | 308.1300 | 326.7800 | 45.86000 | 56.45000 | 11.50000 | 0.00000 | 9.990000 | 2.990000 | 0.000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 6.7400000 |
| 13 | 1 | 1964.4100 | 1052.5700 | 635.8200 | 13.08000 | 128.90000 | 0.00000 | 4.48000 | 52.970000 | 1.590000 | 59.180000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 6.3700000 | 0.000 | 0.0000000 |
| 14 | 2 | 8289.2800 | 3683.4000 | 1747.4500 | 736.39000 | 529.16500 | 486.54000 | 274.97500 | 188.185000 | 101.695000 | 18.580000 | 60.7600000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 14.9750000 | 0.0000000 | 0.000 | 0.0000000 |
| 15 | 26 | 2704.9212 | 1344.3535 | 497.9615 | 150.88654 | 71.50192 | 122.01115 | 58.04923 | 22.616923 | 49.244231 | 10.457692 | 2.5196154 | 13.0526923 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000 | 0.0000000 |
We studied two different problem statements. The summary of both are mentioned below, one after the other.
Problem Statement - 1
5.1.1 Problem Statement The purpose of this study was to understand how the demographic factors like marital status, age, income and household size affect consumer spending. Knowing what demographics make consumers send more during store visits will enable the client’s business to be more customer-centric and to ensure better customer lifetime value
5.1.2 Methodology
Step 1 Summaries (Mean, Median) across various demographic variables were calculated to understand how spend varies across different levels. Also, we did a graphical investigation to see patterns and correlations. Once we had some idea about how various variables affect consumer spend, a linear regression model was built to ascertain which variables categories affect spending and how they affect spending.
Step 2 Data was collated with all the demographic variables and overall sales over a certain time period for all customers
Step 3 After checking relationships of variables with overall sales(Using plots of predictors and sales), we tested appropriateness for normality of the response variable. And we followed a dummy encoding for all the factor predictor variables.
Step 4 Once these validations were complete, we ran the linear regression to calculate estimates for each of the variables and categories
Step 5 Compare the model iteration using model-fit statistics(R-square, error) to arrive at the best model
Step 6 Calculate the contribution values to understand to what magnitude is sales affected by each of those demographic factors.
5.1.3 Interesting Insights!
A key insight we found was regarding the demographic factors that would likely make the consumer the highest spender at the stores.-
5.1.4 Project Implications
We feel that this insight on what demographics make consumers spend more will be key to any decision-making for our retail client. This can be a big enabler in any promotion/coupon targeting programs to various consumer segments or to attract those demographic segments of consumers who currently are not high spenders. Overall, we believe this knowledge is key to ensuring a high average customer lifetime value.
Limitations of the analysis * This study was limited to the demographic factors we saw in the data. when we ran the regression we found it gave a lower R-square value. We think that there could be other predictor variables that we are not seeing in the data, which might help explain the correlation even better. Variables such as, whether the consumer is a loyal to another retail chain, or whether distance and convenience is a factor how people choose to walk into stores.
Problem Statement - 2
5.2.1 Problem Statement The purpose of this study was to get better understanding of our retail client’s customers based on product category preferences. It will allow business to get better visibility of the market and customize promotions based on their buying preference. Further, these promotions can help improve the customer interaction level with the store. Over time, it will also allow business to keep track of how customers are moving across defined segments and how their preferences change with time
5.2.2 Methodology
Using sales value across many product departments, K-means segmentation was done to segment customers based on their product needs. Following steps were followed-
Step 1 - Required columns(Household ids, Total Sales, Product Categories/Departments) were pulled from Transaction and Product data after merging them using appropriate columns
Step 2 - Using the product description information, similar product categories(department) were combined to form broad product categories. e.g. ‘Frozen grocery’, ‘Produce’ and ‘Grocery’ were all combined under Grocery
Step 3 - Sales values were rolled up at broad product category level to get total sales for the households
Step 4 - Data was transformed from tall to wide format where columns have total sales for the product categories
Step 5 - Total Sales values were standardized to give similar weightage to all product categories
Step 6 - Decide cluster count such that the total intra-cluster variation(within cluster sum of squares) is minimized
Step 7 - K-means segmentation was run on the standardized columns to make clusters
Step 8 - Visualize the cluster in R to understand how the customers have been clustered
Step 9 - Profile and name the clusters based on the average spend across categories. Segments have been named as ‘H-Floral-Cosmetics-Garden Supplies buyers’ where H represents ‘high’ total spend and Product categories names suggests the kind of product a customer/household buys
5.2.3 Intersting Insights! Following interesting insights were found in the analysis-
Among the high spenders, there is a segment ‘H-Floral-Cosmetics-Garden Suplies buyers’ with 67 households and it seems to be buying products from floral, cosmetics and garden supplies. We can target this segment using cross-category promotions across these product categories to increase their engagement level. Also, in case sales of any of these product categories is dropping, sales across this customer segment should be checked for any potential reasons.
Majority of the customers(~1230) belong to the “VL-Grocery-Basic Necessities” segment who visit stores for the basic necessities (groceries/basic necessities).Majority of the money is spent on buying on grocery(~61%) followed by Misc Items(17%) and meat(14%)
Another important segment is “L-Broad category Spenders” with 639 customers. These customers spend on broad categories ranging from Grocery, Deli, Meat , Services and coupons to less common
5.2.4 Project Implications Earlier promotions were sent out to household without much consideration of customer/household profile and the buying potential. So, the promotions were not customized for the audience and were sent to all irrespective of their preferences New segmentation exercise would allow business to send targeted messages and enable them to design promotional strategies in a better way. For eg. If the store wants to promote new automotive product in the store, it should be sending the promotional e-mail to all households part of segment 15 “L-Automotive Products” as these are the households who have bought products from this category. Also, knowing how your sales is distributed across categories will allow stores to come up with cross-category promotions aimed at increasing customer spend.
Limitations of the analysis Following are the few limitations and few factors which could be included to get improve the quality of segments in the future-
We can club some of the segments which have very few households present in them like segment 4,14,7,12 and 13
Broad product categories created after combining similar departments could be clubbed further to get fewer dimensions relevant for our business problem. Also, current definition can be further validated to ensure that mapping used is correct as per business requirement
Other variables like frequency of visits should be included in the analysis which would give us better idea about the buying pattern and would make household segments more useful for this kind of study