Our research is focused on helping a major retailer know their customers better. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and more. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -
Problem Statement #1: What household demographic factors contribute most to customer spend in the stores? Learn about the most impactful factors affecting customer spending.
Problem Statement #2: What category is each customer most engaged with? Profile customers and their engagement levels with product categories.
Our methodology can be broadly classified into three steps -
Data cleaning and preparation: clean and tidy the tables/datasets in R, and join tables as required for the analysis (more in detail in the section on data preparation)
Data Exploration: explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions and correlation as required
Data Analysis: we propose a regression analysis (linear modeling) to determine the demographic factors that affect consumer spending the most, to be followed by a clustering technique to segment and profile customers according to their levels of engagement with various categories inside the store.
This research will help the client in two major ways. Firstly, understand better as to what factors impact the amount of purchase customers make inside stores. Secondly, design better targeted marketing campaigns powered by the knowledge of customer engagement (and spending) levels with various product categories.
Packages to be installed
Following Packages are needed for the project-
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(magrittr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(stats)))
suppressWarnings(suppressMessages(library(factoextra)))
suppressWarnings(suppressMessages(library(kableExtra)))
suppressWarnings(suppressMessages(library(knitr)))
3.1. Data Source The original data on the retail client was downloaded from the Data Wrangling course folders on Github
3.2.1 Introduction of source data The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)
Timeline: The data tracks customer transactions over a period of two years (104 weeks).
3.2.2 Explanation of source data (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below
| household_key | basket_id | day | product_id | quantity | sales_value | store_id | retail_disc | trans_time | week_no | coupon_disc | coupon_match_disc |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2375 | 26984851472 | 1 | 1004906 | 1 | 1.39 | 364 | -0.6 | 1631 | 1 | 0 | 0 |
| 2375 | 26984851472 | 1 | 1033142 | 1 | 0.82 | 364 | 0.0 | 1631 | 1 | 0 | 0 |
| 2375 | 26984851472 | 1 | 1036325 | 1 | 0.99 | 364 | -0.3 | 1631 | 1 | 0 | 0 |
| household_key | age_desc | marital_status_code | income_desc | homeowner_desc | hh_comp_desc | household_size_desc | kid_category_desc |
|---|---|---|---|---|---|---|---|
| 1 | 65+ | A | 35-49K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 7 | 45-54 | A | 50-74K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 8 | 25-34 | U | 25-34K | Unknown | 2 Adults Kids | 3 | 1 |
| description | household_key | campaign |
|---|---|---|
| TypeA | 17 | 26 |
| TypeA | 27 | 26 |
| TypeA | 212 | 26 |
| description | campaign | start_day | end_day |
|---|---|---|---|
| TypeB | 24 | 659 | 719 |
| TypeC | 15 | 547 | 708 |
| TypeB | 25 | 659 | 691 |
| product_id | manufacturer | department | brand | commodity_desc | sub_commodity_desc | curr_size_of_product |
|---|---|---|---|---|---|---|
| 25671 | 2 | GROCERY | National | FRZN ICE | ICE - CRUSHED/CUBED | 22 LB |
| 26081 | 2 | MISC. TRANS. | National | NO COMMODITY DESCRIPTION | NO SUBCOMMODITY DESCRIPTION | |
| 26093 | 69 | PASTRY | Private | BREAD | BREAD:ITALIAN/FRENCH |
| coupon_upc | product_id | campaign |
|---|---|---|
| 10000089061 | 27160 | 4 |
| 10000089064 | 27754 | 9 |
| 10000089073 | 28897 | 12 |
| household_key | day | coupon_upc | campaign |
|---|---|---|---|
| 1 | 421 | 10000085364 | 8 |
| 1 | 421 | 51700010076 | 8 |
| 1 | 427 | 54200000033 | 8 |
| product_id | store_id | week_no | display | mailer |
|---|---|---|---|---|
| 26190 | 286 | 70 | 0 | A |
| 26190 | 288 | 70 | 0 | A |
| 26190 | 289 | 70 | 0 | A |
3.2.3 Peculiarities in the data sets
* Missing Values -
any(is.na(campaign_d)) & any(is.na(campaign_t)) & any(is.na(causal)) & any(is.na(coupon)) & any(is.na(coupon_r)) & any(is.na(demo)) & any(is.na(prod)) & any(is.na(trans))
## [1] FALSE
Note:
SUMMARY:
We showed above a glimpse and summary of how all the data from this comprehensive and complex collection look and what they mean. For solving our two research questions the tables that contain variables ofinterest would be -
Refer to tab Data Prep II for more on this and next steps…
3.3. Data Preparation Steps
Here we focus on two areas, data preparation (slicing and dicing) for: * Problem Statement #1 * Problem Statement #2
Problem Statement #1
Tables of concern for Problem #1: * Transaction data * Demographic data
Variables of concern
3.4. Methodology and Output dataset
#Dataset 1-Demographic factors driving customer spend
library(dplyr)
Demographic_factors<-group_by(trans,household_key) %>%
summarise(Total_Sales=sum(sales_value))%>%
inner_join(demo,by="household_key")
| household_key | Total_Sales | age_desc | marital_status_code | income_desc | homeowner_desc | hh_comp_desc | household_size_desc | kid_category_desc |
|---|---|---|---|---|---|---|---|---|
| 1 | 4330.16 | 65+ | A | 35-49K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 7 | 3400.05 | 45-54 | A | 50-74K | Homeowner | 2 Adults No Kids | 2 | None/Unknown |
| 8 | 5534.97 | 25-34 | U | 25-34K | Unknown | 2 Adults Kids | 3 | 1 |
| 13 | 13190.92 | 25-34 | U | 75-99K | Homeowner | 2 Adults Kids | 4 | 2 |
| 16 | 1512.02 | 45-54 | B | 50-74K | Homeowner | Single Female | 1 | None/Unknown |
## household_key Total_Sales age_desc marital_status_code
## Min. : 1 Min. : 646.4 19-24: 46 A:340
## 1st Qu.: 596 1st Qu.: 2970.1 25-34:142 B:117
## Median :1218 Median : 4676.4 35-44:194 U:344
## Mean :1235 Mean : 5615.1 45-54:288
## 3rd Qu.:1914 3rd Qu.: 7165.0 55-64: 59
## Max. :2499 Max. :27859.7 65+ : 72
##
## income_desc homeowner_desc hh_comp_desc
## 50-74K :192 Homeowner :504 1 Adult Kids : 47
## 35-49K :172 Probable Owner : 11 2 Adults Kids :187
## 75-99K : 96 Probable Renter: 11 2 Adults No Kids:255
## 25-34K : 77 Renter : 42 Single Female :144
## 15-24K : 74 Unknown :233 Single Male : 95
## Under 15K: 61 Unknown : 73
## (Other) :129
## household_size_desc kid_category_desc
## 1 :255 1 :114
## 2 :318 2 : 60
## 3 :109 3+ : 69
## 4 : 53 None/Unknown:558
## 5+: 66
##
##
Problem Statement #2
Tables of concern for Problem #2: * Transaction data * Product data
Variables of concern
#Dataset 2-Profiling of Customers based on Product Category Engagement
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("magrittr")))
suppressWarnings(suppressMessages(library("tidyr")))
product_dpt_mapping <- read.csv("product_dpt_mapping.csv")
Customer_profile <- inner_join(trans, prod, by = "product_id")%>%
inner_join(product_dpt_mapping, by = "department")%>%
group_by(household_key, Super_category) %>%
summarise(Total_Sales = sum(sales_value))%>%
select(household_key, Super_category, Total_Sales)%>%
spread(Super_category, Total_Sales)
## Warning: Column `department` joining factors with different levels,
## coercing to character vector
| household_key | AUTOMOTIVE | COSMETICS | COUPONS | DELI | EGGS & MILK PRODUCTS | FLORAL | GARDEN SUPPLIES | GROCERY | GROCERY-HEALTHY | INFANT PRODUCTS | LIQUIOR | MEAT | MISC ITEMS | PHARMACY | SERVICES | STATIONARY | STORE SUPPLY/MISC | TOYS | VIDEO |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | NA | 20.00 | 212.17 | NA | 7.99 | NA | 3021.10 | 48.33 | NA | NA | 349.00 | 530.89 | NA | 140.68 | NA | NA | NA | NA |
| 2 | NA | 22.07 | 4.26 | 46.73 | NA | 21.99 | NA | 1264.77 | 7.47 | NA | NA | 232.19 | 330.04 | NA | 24.82 | NA | NA | NA | NA |
| 3 | NA | 1.25 | NA | 36.47 | NA | NA | NA | 1852.78 | NA | NA | NA | 578.97 | 179.09 | NA | 4.65 | NA | NA | NA | NA |
| 4 | NA | 2.12 | NA | 14.35 | NA | NA | NA | 703.60 | 3.29 | NA | NA | 157.77 | 318.98 | NA | NA | NA | NA | NA | NA |
| 5 | NA | 9.48 | NA | 80.03 | NA | NA | NA | 361.32 | 2.00 | NA | NA | 120.77 | 203.08 | NA | 2.38 | NA | NA | NA | NA |
## household_key AUTOMOTIVE COSMETICS COUPONS
## Min. : 1.0 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 625.8 1st Qu.: 2.950 1st Qu.: 4.01 1st Qu.: 3.49
## Median :1250.5 Median : 5.950 Median : 10.80 Median : 10.50
## Mean :1250.5 Mean : 7.421 Mean : 25.16 Mean : 80.34
## 3rd Qu.:1875.2 3rd Qu.: 9.990 3rd Qu.: 29.12 3rd Qu.: 37.97
## Max. :2500.0 Max. :33.660 Max. :529.07 Max. :11161.12
## NA's :2439 NA's :1214 NA's :994
## DELI EGGS & MILK PRODUCTS FLORAL GARDEN SUPPLIES
## Min. : 0.75 Min. :0.5000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 21.76 1st Qu.:0.8975 1st Qu.: 8.99 1st Qu.: 5.00
## Median : 56.66 Median :0.9500 Median : 18.98 Median : 13.34
## Mean : 116.06 Mean :1.4646 Mean : 33.43 Mean : 24.90
## 3rd Qu.: 137.72 3rd Qu.:1.7600 3rd Qu.: 38.95 3rd Qu.: 29.86
## Max. :1808.36 Max. :5.7800 Max. :877.12 Max. :936.67
## NA's :252 NA's :2450 NA's :1314 NA's :2192
## GROCERY GROCERY-HEALTHY INFANT PRODUCTS LIQUIOR
## Min. : 1.74 Min. : 0.00 Min. :2.020 Min. : 0.000
## 1st Qu.: 558.17 1st Qu.: 5.00 1st Qu.:2.020 1st Qu.: 8.735
## Median : 1267.53 Median : 13.78 Median :2.355 Median : 17.480
## Mean : 1862.72 Mean : 56.82 Mean :2.355 Mean : 38.492
## 3rd Qu.: 2575.04 3rd Qu.: 43.01 3rd Qu.:2.690 3rd Qu.: 34.020
## Max. :15047.67 Max. :5535.49 Max. :2.690 Max. :2642.020
## NA's :781 NA's :2496 NA's :1937
## MEAT MISC ITEMS PHARMACY SERVICES
## Min. : 1.71 Min. : 0.0 Min. : 1.190 Min. : 0.00
## 1st Qu.: 102.86 1st Qu.: 120.2 1st Qu.: 5.910 1st Qu.: 14.99
## Median : 274.07 Median : 310.3 Median : 6.990 Median : 39.44
## Mean : 427.87 Mean : 646.0 Mean : 9.804 Mean : 66.60
## 3rd Qu.: 571.49 3rd Qu.: 800.3 3rd Qu.:11.748 3rd Qu.: 83.27
## Max. :3708.78 Max. :7939.3 Max. :20.890 Max. :885.24
## NA's :37 NA's :7 NA's :2482 NA's :177
## STATIONARY STORE SUPPLY/MISC TOYS VIDEO
## Min. :1.000 Min. : 0.000 Min. :0.99 Min. : 5.000
## 1st Qu.:2.435 1st Qu.: 1.000 1st Qu.:0.99 1st Qu.: 6.482
## Median :3.870 Median : 1.000 Median :1.24 Median : 9.235
## Mean :3.870 Mean : 2.453 Mean :2.46 Mean : 9.372
## 3rd Qu.:5.305 3rd Qu.: 1.000 3rd Qu.:2.71 3rd Qu.:12.213
## Max. :6.740 Max. :20.000 Max. :6.37 Max. :13.990
## NA's :2498 NA's :2480 NA's :2496 NA's :2494
3.5. Variables Summary:
Dataset-1 : Demographic_factors
kid_category_desc - no missing values, no abnormal values either
Dataset-2 : Customer_profile
other 19 variables: these are the ‘super categories’ that we created by grouping the 44 product categories / departments
Missing Values : There are no missing values. However the NA values might give the reader the wrong idea. To clarify, for the purposes of our analysis, an ‘NA’ would mean zero spending for the particular customer in that particular category. So, the NA values would be imputed with zeroes before starting analysis
4.1 Problem Statement 1-Demographic factors driving customer spend
4.1.1 Data Sources used-
4.1.2 Business Problem- Identify Demographic factors (Age Description, Martial Status Code, Income Desc, Household Size Description) driving overall dollar sales and quantify their sales impact
4.1.3 Business Implications-This will help quantify the impact of demographic factors driving sales. It will also help business understand the buying potential of new customers based on their demographic profile.
4.1.4 Methodology
4.1.4.a Preliminary Analysis - Summaries (Mean, Median) across various demographic variables should be calculated to understand how spend various across different levels. Following hypothesis will be tested-
Once we have some idea about how various variables affect sales, linear regression model will be created to quantify the impact
4.1.4.b Linear Regression Model - Build Linear regression model using the dataset with all demographic variables and overall sales. Following steps will be followed-
4.2 Problem Statement 2-Profiling of Customers based on Product Category Engagement
4.2.1 Data Sources used-
4.2.2 Business Problem- Segmenting customers based on spend across various product categories
4.2.3 Business Implications - This will allow business to get better understanding of customer base based on product category preferences. It will allow business to get better visibility of the market. Over time, this analysis will allow business to keep track of how customers are moving across defined segments and how their preferences change with time
4.2.4 Methodology
4.2.4.a Preliminary Analysis- Summarize(mean,median and mode) the datasets and plot using box-plot graph to identify outliers in spend data. Calculated the range of spend that should be considered after removing outliers. Using plots thus created, identify categories which are more important that others.
4.2.4.b K-Means Segmentation - Using sales value across many product departments, K-means segmentation will be done to segment customers based on their product needs. Following steps will be followed-