Who are my customers?

Business Understanding

Introduction:

As one of the world’s largest retailers operating in 2,769 grocery retail stores, Kroger is committed to meet the demands of an ever-evolving customer base. As a part of it’s next level growth strategy, Kroger plans to analyze it’s transactional data from April 2016 to October 2017 on the basis of what its customers looks forward to buy and factors influencing their spend capacity.

Problem Statement:

  • What constitutes majority of Kroger’s customer base? How is it changing over time?
  • Who are the benefactors of loyalty program and which are the demographics where loyalty program is most effective?

Solution Overview:

  • Identify proportion of customers with various demographic traits using household data
  • Observe increases in loyalty program participation for various customer demographics
  • Observe any shifts in these proportions between years 2016 and 2017
  • Study shopping behavior of these high shift customers

Business Application

  • Knowing the composition of customer base would help Kroger understand its untapped potential
  • Understanding customers who benefit from the loyalty card program

Required Packages / Functions Used

Required Packages

Prior installing and loading the following package would result in smooth execution of codes to reproduce the project.

List of Required Packages
Package Purpose
tidyverse Creating Tidy and data and ease of installation of other package
data.table Importing Huge Data sets; Function fread
ggplot2 Creating Visuals
lubridate Rectify Data format across tables
DT Creating functional tables in HTML
knitr Useful in generating dynamic reports
rmarkdown Converting R Markdown files into HTML/required format
magrittr For Pipe operator
sqldf To ease for writing SQL queries in R

Defined Functions

Defined several functions based on the use cases, where a chunk of codes were repeated several times. Defining function helped to overcome repeatedly writing the same set of codes and avoiding mistakes at the same time.

  • 1. Importing Data sets
read_fun <- function(file){
  fread(file, header = TRUE, 
  strip.white =  TRUE, stringsAsFactors = FALSE, 
  na.strings = c("null", "NA"))
}
  • 2. Removing Outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Data Preparation

Importing Data Set

Using R’s data.table::fread() function to import the comma separated files.

data_table <- c("5000_households.csv","5000_products.csv","5000_transactions.csv")
output_table <- c("households","products","transactions")

for(d in seq_along(data_table))
{
  data <- fread(paste0("data/",data_table[d]))
  assign(output_table[d],data)
}

Cleaning

  • Step 1. White space removal from data set while importing dataset.

  • Step 2. Null value treatment. Identifying blank,NA,“null” as Null values.

colSums(is.na(transactions))
colSums(is.na(households))
colSums(is.na(products))

households_clean <- na.omit(households)
  • Step 3. Renaming Columns names as per the data dictionary provided to avoid any confusion.
colnames(households) <-c("hshd_num","loyalty_flag","age_range","martial_status",
                         "income_range","homeowner_desc","hshd_composition",
                          "hh_size","children")

colnames(products) <- c("product_num","department","commodity","brand_type",
                        "natural_organic_flag")

colnames(transactions) <-c("hshd_num","basket_num","date","product_num","spend",
                            "units","store_region","week_num","year")
  • Step 4. Outlier Treatment. Recreating columns for Spends and Units in Transaction tables.
transactions$spend_clean <- remove_outliers(transactions$spend)

transactions$unit_clean <- remove_outliers(transactions$unit)

households_clean <- na.omit(households)

Data Validation

  • Step 1. Checking for summary() of measure variables in Transactional table
summary(transactions)
  • Step 2. Formating Date Column in Transactions table
transactions$date <- dmy(as.character(transactions$date))
  • Step 3. Removing negative transactional units and spends
transactions <- subset(transactions, (spend>=0 & spend>=0))
  • Step 4. Outlier treatment after observing boxplots for measure variables
boxplot(transactions$spend)

boxplot(transactions$units)

Overview of Data

A look at the sample of 10 observations from each table, gives the user a chance to have a glimpse at the dataset.

  • Households
Sample from the Households Table
hshd_num loyalty_flag age_range martial_status income_range homeowner_desc hshd_composition hh_size children
688 Y 75+ Unknown 35-49K Homeowner NA null NOT AVAILABLE
2590 N 75+ Unknown 50-74K Homeowner NA null NOT AVAILABLE
1171 Y 75+ Unknown 75-99K Homeowner NA null NOT AVAILABLE
1531 Y 75+ Unknown 75-99K Homeowner NA null NOT AVAILABLE
403 N 75+ Unknown UNDER 35K Renter NA null NOT AVAILABLE
283 Y 25-34 Unknown 50-74K Homeowner NA null NOT AVAILABLE
3864 Y 25-34 Unknown 50-74K Homeowner NA null NOT AVAILABLE
4899 Y 25-34 Unknown 50-74K Homeowner NA null NOT AVAILABLE
1443 N 35-44 Unknown 150K+ Homeowner NA null NOT AVAILABLE
3598 Y 35-44 Unknown 35-49K Unknown NA null NOT AVAILABLE
  • Products
Sample from the Products Table
product_num department commodity brand_type natural_organic_flag
92993 NON-FOOD PET PRIVATE N
93924 NON-FOOD PET PRIVATE N
94272 NON-FOOD PET PRIVATE N
94299 NON-FOOD PET PRIVATE N
94594 NON-FOOD PET PRIVATE N
94606 NON-FOOD PET PRIVATE N
94613 NON-FOOD PET PRIVATE N
95625 NON-FOOD PET PRIVATE N
96152 NON-FOOD PET PRIVATE N
96153 NON-FOOD PET PRIVATE N
  • Transactions
Sample from the Transactions Table
hshd_num basket_num date product_num spend units store_region week_num year spend_clean unit_clean
34 1253 2016-01-03 539501 2.19 1 EAST 1 2016 2.19 1
60 1595 2016-01-03 5260099 0.99 1 WEST 1 2016 0.99 1
60 1595 2016-01-03 4535660 2.50 2 WEST 1 2016 2.50 NA
168 3393 2016-01-03 5602916 4.50 1 SOUTH 1 2016 4.50 1
199 4558 2016-01-03 5683532 3.49 1 SOUTH 1 2016 3.49 1
252 3491 2016-01-03 980450 2.79 1 SOUTH 1 2016 2.79 1
355 4733 2016-01-03 735993 1.00 1 EAST 1 2016 1.00 1
366 4369 2016-01-03 138157 9.98 2 EAST 1 2016 NA NA
379 226 2016-01-03 83880 1.29 1 EAST 1 2016 1.29 1
413 2668 2016-01-03 89361 1.79 1 WEST 1 2016 1.79 1

Exploratory Data Analysis

Data Dictionary

Variable Description of Households Table
Variable Description
hshd_num Uniquely identifies each household
loyalty_flag Estimated age range
age_range If the household owns a loyalty card
martial_status Maritial Status (A - Married, B - Single, U - Unknown)
income_range Total Household income
homeowner_desc Type of Residency(Homeowner,renter,etc)
hshd_composition Composition of household
hh_size Size of household up to 5+
children Number of children present up tp 3+
Variable Description of Products Table
Variable Description
product_num Uniquely identifies each product
department Groups products at Food, Non-Food, Pharma
commodity Groups products together at lower level than Department
brand_type Identifies the product as Private or National
natural_organic_flag Identifies the product as Natural/Organic or Inorganic
Variable Description of Transactions Table
Variable Description
product_num Uniquely identifies each product
department Groups products at Food, Non-Food, Pharma
commodity Groups products together at lower level than Department
brand_type Identifies the product as Private or National
natural_organic_flag Identifies the product as Natural/Organic or Inorganic

Analysis

Merging Transactions, Products and Households into one single table
data_joined <- sqldf("Select t.*,p.*,h.* 
                     from transactions t
                     Left Join products p on p.product_num=t.product_num
                     Left Join households_clean h on h.hshd_num=t.hshd_num
                     ")
data_joined$month <- month(data_joined$date)
Sample from the Merged Table
hshd_num basket_num date product_num spend units store_region week_num year spend_clean unit_clean product_num..12 department commodity brand_type natural_organic_flag hshd_num..17 loyalty_flag age_range martial_status income_range homeowner_desc hshd_composition hh_size children month
34 1253 2016-01-03 539501 2.19 1 EAST 1 2016 2.19 1 539501 FOOD CANNED GOODS NATIONAL N 34 Y 55-64 Married 50-74K Homeowner 2 Adults and Kids 3 1 1
60 1595 2016-01-03 5260099 0.99 1 WEST 1 2016 0.99 1 5260099 FOOD PRODUCE NATIONAL N 60 Y 45-54 Married 150K+ Homeowner 2 Adults and Kids 5+ 3+ 1
60 1595 2016-01-03 4535660 2.50 2 WEST 1 2016 2.50 NA 4535660 FOOD PRODUCE NATIONAL N 60 Y 45-54 Married 150K+ Homeowner 2 Adults and Kids 5+ 3+ 1
168 3393 2016-01-03 5602916 4.50 1 SOUTH 1 2016 4.50 1 5602916 FOOD DELI PRIVATE N 168 Y 55-64 Married 50-74K Homeowner 2 Adults 2 NOT AVAILABLE 1
199 4558 2016-01-03 5683532 3.49 1 SOUTH 1 2016 3.49 1 5683532 FOOD DELI NATIONAL N 199 Y 45-54 Married 50-74K Homeowner 2 Adults and Kids 3 1 1
252 3491 2016-01-03 980450 2.79 1 SOUTH 1 2016 2.79 1 980450 FOOD BAKERY NATIONAL N 252 Y 45-54 Single 75-99K Unknown 1 Adult and Kids 2 1 1
355 4733 2016-01-03 735993 1.00 1 EAST 1 2016 1.00 1 735993 FOOD GROCERY STAPLE NATIONAL N 355 Y null null null null null null null 1
366 4369 2016-01-03 138157 9.98 2 EAST 1 2016 NA NA 138157 FOOD GROCERY STAPLE NATIONAL N 366 Y null null null null null null null 1
379 226 2016-01-03 83880 1.29 1 EAST 1 2016 1.29 1 83880 NON-FOOD HOUSEHOLD PRIVATE N 379 Y 55-64 Married 100-150K Homeowner 2 Adults 2 NOT AVAILABLE 1
413 2668 2016-01-03 89361 1.79 1 WEST 1 2016 1.79 1 89361 FOOD GROCERY STAPLE PRIVATE N 413 Y 35-44 Single 35-49K Renter Single Male 1 NOT AVAILABLE 1
Objective

The objective of this analysis is to use Kroger data to find patterns and behaviors indicating health of the business and potential opportunities.

Sales around the year:

In order to understand the overall picture, we begin by looking at the average sales by month.

monthly_sales <-  transactions %>%
                    group_by(Month = month(date), Year = year(date)) %>%
                    summarize(spend=sum(spend)) %>%
                    group_by(Month) %>%
                    summarize(spend = mean(spend)) %>%
                    arrange(Month) %>% 
                    data.frame %>% 
                    mutate(Month = as.factor(Month))

Observation:

  • As can be seen, the sales remain about same throughout the year except in December presumably due to holiday season.

  • Total sales increased by ~1% between year 2016 and 2017

Is this spike in sales caused by a larger volume of customers or a larger basket size? Let’s find out next.

Average basket spends by month:
avg_basket_sales <-  transactions %>%
                      group_by(hshd_num, basket_num, Month = month(date)) %>%
                      summarize(basket_spend=sum(spend))

avg_basket_sales2 <- avg_basket_sales %>%
                      group_by(Month) %>%
                      summarize(monthly_avg_basket_spend = mean(basket_spend)) %>%
                      arrange(Month) %>% 
                      data.frame %>% 
                      mutate(Month = as.factor(Month))

Observation:

  • Average basket spend by month indicates that high sales is primarily driven by existing customers buying more volume of products in December.

Now, we want to understand what type of customers purchase most from Krogers. It would improve our understanding of typical Kroger customer and highlight potential customer groups Kroger can tap into

Who are my customers?

Best Performing Demographics Profile

Total sales by household:

hshd_sales <- transactions %>% 
                group_by(hshd_num) %>% 
                summarize(sales = sum(spend))

merged_transactions <- merge(hshd_sales, households, by = "hshd_num")
nrow(hshd_sales)
## [1] 996726
nrow(merged_transactions)
## [1] 4982
(1 - nrow(merged_transactions)/nrow(hshd_sales))
## [1] 0.9950016

Challenges:

  • 99.9% households information is not available in households dataset

Average Spends and Households:

spend_profile <-  na.omit(merged_transactions) %>% 
  group_by(loyalty_flag, martial_status, income_range, hh_size, age_range) %>% 
  summarize(avg_spend = mean(sales), hh_count = n()) %>% 
  filter(hh_count >= 10) %>% #removing cases where sample size is < 10 households
  arrange(desc(avg_spend))
Sample from the spend_profile Table
loyalty_flag martial_status income_range hh_size age_range avg_spend hh_count
N Married 100-150K 2 55-64 71.42273 11
Y Single 75-99K 1 55-64 60.79696 23
Y Married UNDER 35K 2 45-54 58.38462 13
N Single 35-49K 1 45-54 56.52300 10
Y Married 50-74K 4 35-44 55.66357 14
Y Married UNDER 35K 5+ 45-54 55.44583 12
Y Married 50-74K 4 45-54 53.47000 19
Y Single UNDER 35K 1 55-64 50.58133 15
Y Married 50-74K 3 55-64 49.29250 44
N Single 50-74K 1 45-54 49.04900 10
Y Single 75-99K 1 25-34 47.92765 17
Y Married 100-150K 4 55-64 47.54600 10
Y Married UNDER 35K 3 75+ 46.95750 20
Y Married 150K+ 5+ 45-54 46.77933 30
Y Single 50-74K 1 45-54 45.31155 58
Y Married 100-150K 2 65-74 44.63308 13
Y Single 50-74K 1 65-74 44.17900 10
Y Married UNDER 35K 2 55-64 43.92407 27
Y Single 100-150K 1 45-54 43.80886 35
Y Single 75-99K 1 35-44 43.50957 23

Observation:

  • Looking at recurring demographic traits of top 5 groups, we observe that Married, 50-74 Years Old, 4 people size customers are best spending.

Next, we also want to look at the fastest-growing customer demographics of customers to identify potential future customers.

Fastest growing demographics profile

YoY change in Average Spends by Customer Demographics

hshd_sales_16 <- transactions %>% 
                  filter(year == 2016) %>% 
                  group_by(hshd_num) %>% 
                  summarize(sales_16 = sum(spend),hh_16 = n())

hshd_sales_17 <- transactions %>% 
                  filter(year == 2017) %>% 
                  group_by(hshd_num) %>% 
                  summarize(sales_17 = sum(spend),hh_17 = n())

merged_transactions_16_17 <- merge(hshd_sales_16, hshd_sales_17, by = "hshd_num")
yearly_demographics <- merge(merged_transactions_16_17, households, by = "hshd_num")

Observation:

  • We looked at the shopping trends of customers who shopped repeatedly in both years 2016 and 2017 but could not do so due to limitation in demographics data.

Challenges:

  • On merging the data for customers who have repeatedly shopped with the customer demographics table; it has returned zero information indicating that households table have no information regarding these customers.

  • This could be a major setback for the company, as not having information or demographics related data about regular customers can led loss of sales or not understanding their needs.

Retaining existing customers is equally important as acquiring new ones for any large scale business like Kroger. Which is why we want to understand customer retention rates.

length(unique(merged_transactions_16_17$hshd_num))
## [1] 3520
length(unique(hshd_sales_16$hshd_num))
## [1] 516720

Takeaway:

  • There were 3520 customers who shopped again in 2017 after 2016. Out of total 516718 HHIDs in 2016. There is a poor retention that Kroger can potentially work on.
Retention of Customers
  • Year Over Year: Out of 2016 shoppers how many shopped again in 2017

  • Month over Month:

Retain rate: How many customer shop in consecutive months?

Reactivation rate: How many customers shop again after having shopped from Kroger in past?

hshd_month <- transactions %>%
  filter(year == 2016) %>% 
  mutate( Month = month(date)) %>% 
  select(hshd_num, Month) %>% 
  unique %>% 
  mutate(next_month = Month + 1)
Sample from the Customer M-o-M Table
hshd_num Month next_month
34 1 2
60 1 2
168 1 2
199 1 2
252 1 2
355 1 2
366 1 2
379 1 2
413 1 2
462 1 2
hshd_month %>% 
  group_by(hshd_num) %>% 
  summarise(cnt = n()) %>% arrange(desc(cnt))

Observation:

  • Due to limitations with transaction data, we do not find significant number of households visiting Kroger more than once.

Challenges:

  • Since none of the Household Ids shop more than one day, New, repeating and reactivated customers cannot be assessed

Identifying M-o-M, New, Retained and Reactivated customers for Kroger could help them understand their customer in-depth. It could also help in assessing on what sort of customers, they need to focus on during their marketing campaign. Retention rate of customers also is very important metric, which due to randomness or missing information cannot be evaluated here.

After analyzing sales and demographics traits, we want to sales by product types.

Sales by Region
region_wise_sale <- data_joined %>% 
  group_by(year,month,store_region) %>% 
  summarise(gmv = sum(spend_clean,na.rm=T),
  sales = sum(unit_clean,na.rm=T)) %>% 
  arrange(year,month,store_region)

Observations:

  • East region stores have been the best performing in terms of sales volume in comparison to the other stores.

  • While there was Month-Over-Month growth in sales volume during 2016 for West and Central regions, sales have been stagnant for West and declining for Central in 2017.

Next, we want to observe sales by product type. One important product categorization is Private Vs National brands. Selling private brands promotes brand loyalty towards Kroger while also providing higher profit margins than national brands.

Sales by Brand Type

Private labels are items manufactured and market under the Kroger brand, which have higher margin in comparison to National label.

brand_type_wise_sale <- data_joined %>% 
  group_by(year,month,brand_type) %>% 
  summarise(gmv = sum(spend_clean,na.rm=T),
  sales = sum(unit_clean,na.rm=T)) %>% 
  arrange(year,month,brand_type)

Observations:

  • The private brands for Kroger contributes ~ 32 % of their total sales, which have seen a growth of 0.5 % over 2016 to 2017.

Also,

  • Food Department contributes the maximum towards the total sales. Usually these are fast moving items, which calls the need to predict the demand before hand to plan the inventory well in advance. There hasn’t been in major change between year 2016 and 2017.

  • During 2016, percentage contribution from inorganic products were increasing Month-Over-Month, contributing on an average 96% of total sales, while Year 2017 saw an inclination towards Natural/Organic product with their contribution increasing from 5% to 6%. This is an early indicator that customers in future are more likely to shift towards natural products.

Summary

The objective of this analysis is to use Kroger data to find patterns and behaviors indicating health of the business and potential opportunities. Below is a summary of key findings.

Overall performance by year:

  • Total sales of Kroger increased by ~1% between year 2016 and 2017.

Best performing months:

  • Sales are highest in December presumably due to holiday season.

  • The high sales is driven primarily by existing customers buying a high volume of products.

Best performing customer demographics:

  • Married, 50-74 Years Old, 4 people size customers are spend the most at Kroger

Private brands sales performance:

Selling private brands promotes brand loyalty towards Kroger while also providing higher profit margins than National brands. We found:

  • The private brands for Kroger contribute ~ 32 % of their total sales, which have seen a growth of 0.5 % over 2016 to 2017.

Best performing regions:

  • East region stores have been the best performing in terms of sales volume.

  • While there was Month-Over-Month growth in sales volume during 2016 for West and Central regions, sales has been stagnant for West and declining for Central in 2017.

Next Steps / Recommendations

  • Looking at the data, there seems to be huge gap in how the data is collated for every household. The company to boosts sales and enchance customer experience and loyalty should be putting into use the customer data already available with them but the process of collecting customer information needs to be streamlined.

  • Understanding on a monthly level, Retained vs Reactivated vs New customers coming to the stores. Further, understanding of these three groups could help marketing campaigns in an effective manner.

  • Given, the iformation on customer, KNN Clustering could be performed to identify groups of customers with similar shopping patterns and hence sales and inventories could be planned well ahead of an event.

  • A predictive model could be built to assit in understanding the monthly sales for any region. This information could be help in evaluating the prospectives of opening a new store in the region given the demographics of that region is known in advance.