Complete Journey Customer Transactions

Who are my customers?

Business Understanding

Introduction:

As one of the world’s largest retailers operating in 2,769 grocery retail stores, Kroger is committed to meet the demands of an ever-evolving customer base. As a part of it’s next level growth strategy, Kroger plans to analyze it’s transactional data from April 2016 to October 2017 on the basis of what its customers looks forward to buy and factors influencing their spend capacity.

Problem Statement:

What constitutes majority of Kroger’s customer base? How is it changing over time?
Who are the benefactors of loyalty program and which are the demographics where loyalty program is most effective?

Solution Overview:

Identify proportion of customers with various demographic traits using household data
Observe increases in loyalty program participation for various customer demographics
Observe any shifts in these proportions between years 2016 and 2017
Study shopping behavior of these high shift customers

Business Application

Knowing the composition of customer base would help Kroger understand its untapped potential
Understanding customers who benefit from the loyalty card program

Required Packages / Functions Used

Required Packages

Prior installing and loading the following package would result in smooth execution of codes to reproduce the project.

List of Required Packages
Package	Purpose
tidyverse	Creating Tidy and data and ease of installation of other package
data.table	Importing Huge Data sets; Function fread
ggplot2	Creating Visuals
lubridate	Rectify Data format across tables
DT	Creating functional tables in HTML
knitr	Useful in generating dynamic reports
rmarkdown	Converting R Markdown files into HTML/required format
magrittr	For Pipe operator
sqldf	To ease for writing SQL queries in R

Defined Functions

Defined several functions based on the use cases, where a chunk of codes were repeated several times. Defining function helped to overcome repeatedly writing the same set of codes and avoiding mistakes at the same time.

1. Importing Data sets

read_fun <- function(file){
  fread(file, header = TRUE, 
  strip.white =  TRUE, stringsAsFactors = FALSE, 
  na.strings = c("null", "NA"))
}

2. Removing Outliers

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Data Preparation

Importing Data Set

Using R’s data.table::fread() function to import the comma separated files.

data_table <- c("5000_households.csv","5000_products.csv","5000_transactions.csv")
output_table <- c("households","products","transactions")

for(d in seq_along(data_table))
{
  data <- fread(paste0("data/",data_table[d]))
  assign(output_table[d],data)
}

Cleaning

Step 1. White space removal from data set while importing dataset.
Step 2. Null value treatment. Identifying blank,NA,“null” as Null values.

colSums(is.na(transactions))
colSums(is.na(households))
colSums(is.na(products))

households_clean <- na.omit(households)

Step 3. Renaming Columns names as per the data dictionary provided to avoid any confusion.

colnames(households) <-c("hshd_num","loyalty_flag","age_range","martial_status",
                         "income_range","homeowner_desc","hshd_composition",
                          "hh_size","children")

colnames(products) <- c("product_num","department","commodity","brand_type",
                        "natural_organic_flag")

colnames(transactions) <-c("hshd_num","basket_num","date","product_num","spend",
                            "units","store_region","week_num","year")

Step 4. Outlier Treatment. Recreating columns for Spends and Units in Transaction tables.

transactions$spend_clean <- remove_outliers(transactions$spend)

transactions$unit_clean <- remove_outliers(transactions$unit)

households_clean <- na.omit(households)

Data Validation

Step 1. Checking for summary() of measure variables in Transactional table

summary(transactions)

Step 2. Formating Date Column in Transactions table

transactions$date <- dmy(as.character(transactions$date))

Step 3. Removing negative transactional units and spends

transactions <- subset(transactions, (spend>=0 & spend>=0))

Step 4. Outlier treatment after observing boxplots for measure variables

boxplot(transactions$spend)

boxplot(transactions$units)

Overview of Data

A look at the sample of 10 observations from each table, gives the user a chance to have a glimpse at the dataset.

Households

Sample from the Households Table
hshd_num	loyalty_flag	age_range	martial_status	income_range	homeowner_desc	hshd_composition	hh_size	children
688	Y	75+	Unknown	35-49K	Homeowner	NA	null	NOT AVAILABLE
2590	N	75+	Unknown	50-74K	Homeowner	NA	null	NOT AVAILABLE
1171	Y	75+	Unknown	75-99K	Homeowner	NA	null	NOT AVAILABLE
1531	Y	75+	Unknown	75-99K	Homeowner	NA	null	NOT AVAILABLE
403	N	75+	Unknown	UNDER 35K	Renter	NA	null	NOT AVAILABLE
283	Y	25-34	Unknown	50-74K	Homeowner	NA	null	NOT AVAILABLE
3864	Y	25-34	Unknown	50-74K	Homeowner	NA	null	NOT AVAILABLE
4899	Y	25-34	Unknown	50-74K	Homeowner	NA	null	NOT AVAILABLE
1443	N	35-44	Unknown	150K+	Homeowner	NA	null	NOT AVAILABLE
3598	Y	35-44	Unknown	35-49K	Unknown	NA	null	NOT AVAILABLE

Products

Sample from the Products Table
product_num	department	commodity	brand_type	natural_organic_flag
92993	NON-FOOD	PET	PRIVATE	N
93924	NON-FOOD	PET	PRIVATE	N
94272	NON-FOOD	PET	PRIVATE	N
94299	NON-FOOD	PET	PRIVATE	N
94594	NON-FOOD	PET	PRIVATE	N
94606	NON-FOOD	PET	PRIVATE	N
94613	NON-FOOD	PET	PRIVATE	N
95625	NON-FOOD	PET	PRIVATE	N
96152	NON-FOOD	PET	PRIVATE	N
96153	NON-FOOD	PET	PRIVATE	N

Transactions

Sample from the Transactions Table
hshd_num	basket_num	date	product_num	spend	units	store_region	week_num	year	spend_clean	unit_clean
34	1253	2016-01-03	539501	2.19	1	EAST	1	2016	2.19	1
60	1595	2016-01-03	5260099	0.99	1	WEST	1	2016	0.99	1
60	1595	2016-01-03	4535660	2.50	2	WEST	1	2016	2.50	NA
168	3393	2016-01-03	5602916	4.50	1	SOUTH	1	2016	4.50	1
199	4558	2016-01-03	5683532	3.49	1	SOUTH	1	2016	3.49	1
252	3491	2016-01-03	980450	2.79	1	SOUTH	1	2016	2.79	1
355	4733	2016-01-03	735993	1.00	1	EAST	1	2016	1.00	1
366	4369	2016-01-03	138157	9.98	2	EAST	1	2016	NA	NA
379	226	2016-01-03	83880	1.29	1	EAST	1	2016	1.29	1
413	2668	2016-01-03	89361	1.79	1	WEST	1	2016	1.79	1

Exploratory Data Analysis

Data Dictionary

Variable Description of **Households** Table
Variable	Description
hshd_num	Uniquely identifies each household
loyalty_flag	Estimated age range
age_range	If the household owns a loyalty card
martial_status	Maritial Status (A - Married, B - Single, U - Unknown)
income_range	Total Household income
homeowner_desc	Type of Residency(Homeowner,renter,etc)
hshd_composition	Composition of household
hh_size	Size of household up to 5+
children	Number of children present up tp 3+

Variable Description of **Products** Table
Variable	Description
product_num	Uniquely identifies each product
department	Groups products at Food, Non-Food, Pharma
commodity	Groups products together at lower level than Department
brand_type	Identifies the product as Private or National
natural_organic_flag	Identifies the product as Natural/Organic or Inorganic

Variable Description of **Transactions** Table
Variable	Description
product_num	Uniquely identifies each product
department	Groups products at Food, Non-Food, Pharma
commodity	Groups products together at lower level than Department
brand_type	Identifies the product as Private or National
natural_organic_flag	Identifies the product as Natural/Organic or Inorganic

Analysis

Merging Transactions, Products and Households into one single table

data_joined <- sqldf("Select t.*,p.*,h.* 
                     from transactions t
                     Left Join products p on p.product_num=t.product_num
                     Left Join households_clean h on h.hshd_num=t.hshd_num
                     ")
data_joined$month <- month(data_joined$date)

Sample from the Merged Table
hshd_num	basket_num	date	product_num	spend	units	store_region	week_num	year	spend_clean	unit_clean	product_num..12	department	commodity	brand_type	natural_organic_flag	hshd_num..17	loyalty_flag	age_range	martial_status	income_range	homeowner_desc	hshd_composition	hh_size	children	month
34	1253	2016-01-03	539501	2.19	1	EAST	1	2016	2.19	1	539501	FOOD	CANNED GOODS	NATIONAL	N	34	Y	55-64	Married	50-74K	Homeowner	2 Adults and Kids	3	1	1
60	1595	2016-01-03	5260099	0.99	1	WEST	1	2016	0.99	1	5260099	FOOD	PRODUCE	NATIONAL	N	60	Y	45-54	Married	150K+	Homeowner	2 Adults and Kids	5+	3+	1
60	1595	2016-01-03	4535660	2.50	2	WEST	1	2016	2.50	NA	4535660	FOOD	PRODUCE	NATIONAL	N	60	Y	45-54	Married	150K+	Homeowner	2 Adults and Kids	5+	3+	1
168	3393	2016-01-03	5602916	4.50	1	SOUTH	1	2016	4.50	1	5602916	FOOD	DELI	PRIVATE	N	168	Y	55-64	Married	50-74K	Homeowner	2 Adults	2	NOT AVAILABLE	1
199	4558	2016-01-03	5683532	3.49	1	SOUTH	1	2016	3.49	1	5683532	FOOD	DELI	NATIONAL	N	199	Y	45-54	Married	50-74K	Homeowner	2 Adults and Kids	3	1	1
252	3491	2016-01-03	980450	2.79	1	SOUTH	1	2016	2.79	1	980450	FOOD	BAKERY	NATIONAL	N	252	Y	45-54	Single	75-99K	Unknown	1 Adult and Kids	2	1	1
355	4733	2016-01-03	735993	1.00	1	EAST	1	2016	1.00	1	735993	FOOD	GROCERY STAPLE	NATIONAL	N	355	Y	null	null	null	null	null	null	null	1
366	4369	2016-01-03	138157	9.98	2	EAST	1	2016	NA	NA	138157	FOOD	GROCERY STAPLE	NATIONAL	N	366	Y	null	null	null	null	null	null	null	1
379	226	2016-01-03	83880	1.29	1	EAST	1	2016	1.29	1	83880	NON-FOOD	HOUSEHOLD	PRIVATE	N	379	Y	55-64	Married	100-150K	Homeowner	2 Adults	2	NOT AVAILABLE	1
413	2668	2016-01-03	89361	1.79	1	WEST	1	2016	1.79	1	89361	FOOD	GROCERY STAPLE	PRIVATE	N	413	Y	35-44	Single	35-49K	Renter	Single Male	1	NOT AVAILABLE	1

Objective

The objective of this analysis is to use Kroger data to find patterns and behaviors indicating health of the business and potential opportunities.

Sales around the year:

In order to understand the overall picture, we begin by looking at the average sales by month.

monthly_sales <-  transactions %>%
                    group_by(Month = month(date), Year = year(date)) %>%
                    summarize(spend=sum(spend)) %>%
                    group_by(Month) %>%
                    summarize(spend = mean(spend)) %>%
                    arrange(Month) %>% 
                    data.frame %>% 
                    mutate(Month = as.factor(Month))

Observation:

As can be seen, the sales remain about same throughout the year except in December presumably due to holiday season.
Total sales increased by ~1% between year 2016 and 2017

Is this spike in sales caused by a larger volume of customers or a larger basket size? Letâs find out next.

Average basket spends by month:

avg_basket_sales <-  transactions %>%
                      group_by(hshd_num, basket_num, Month = month(date)) %>%
                      summarize(basket_spend=sum(spend))

avg_basket_sales2 <- avg_basket_sales %>%
                      group_by(Month) %>%
                      summarize(monthly_avg_basket_spend = mean(basket_spend)) %>%
                      arrange(Month) %>% 
                      data.frame %>% 
                      mutate(Month = as.factor(Month))

Observation:

Average basket spend by month indicates that high sales is primarily driven by existing customers buying more volume of products in December.

Now, we want to understand what type of customers purchase most from Krogers. It would improve our understanding of typical Kroger customer and highlight potential customer groups Kroger can tap into

Who are my customers?

Best Performing Demographics Profile

Total sales by household:

hshd_sales <- transactions %>% 
                group_by(hshd_num) %>% 
                summarize(sales = sum(spend))

merged_transactions <- merge(hshd_sales, households, by = "hshd_num")

nrow(hshd_sales)

## [1] 996726

nrow(merged_transactions)

## [1] 4982

(1 - nrow(merged_transactions)/nrow(hshd_sales))

## [1] 0.9950016

Challenges:

99.9% households information is not available in households dataset

Average Spends and Households:

spend_profile <-  na.omit(merged_transactions) %>% 
  group_by(loyalty_flag, martial_status, income_range, hh_size, age_range) %>% 
  summarize(avg_spend = mean(sales), hh_count = n()) %>% 
  filter(hh_count >= 10) %>% #removing cases where sample size is < 10 households
  arrange(desc(avg_spend))

Sample from the spend_profile Table
loyalty_flag	martial_status	income_range	hh_size	age_range	avg_spend	hh_count
N	Married	100-150K	2	55-64	71.42273	11
Y	Single	75-99K	1	55-64	60.79696	23
Y	Married	UNDER 35K	2	45-54	58.38462	13
N	Single	35-49K	1	45-54	56.52300	10
Y	Married	50-74K	4	35-44	55.66357	14
Y	Married	UNDER 35K	5+	45-54	55.44583	12
Y	Married	50-74K	4	45-54	53.47000	19
Y	Single	UNDER 35K	1	55-64	50.58133	15
Y	Married	50-74K	3	55-64	49.29250	44
N	Single	50-74K	1	45-54	49.04900	10
Y	Single	75-99K	1	25-34	47.92765	17
Y	Married	100-150K	4	55-64	47.54600	10
Y	Married	UNDER 35K	3	75+	46.95750	20
Y	Married	150K+	5+	45-54	46.77933	30
Y	Single	50-74K	1	45-54	45.31155	58
Y	Married	100-150K	2	65-74	44.63308	13
Y	Single	50-74K	1	65-74	44.17900	10
Y	Married	UNDER 35K	2	55-64	43.92407	27
Y	Single	100-150K	1	45-54	43.80886	35
Y	Single	75-99K	1	35-44	43.50957	23

Observation:

Looking at recurring demographic traits of top 5 groups, we observe that Married, 50-74 Years Old, 4 people size customers are best spending.

Next, we also want to look at the fastest-growing customer demographics of customers to identify potential future customers.

Fastest growing demographics profile

YoY change in Average Spends by Customer Demographics

hshd_sales_16 <- transactions %>% 
                  filter(year == 2016) %>% 
                  group_by(hshd_num) %>% 
                  summarize(sales_16 = sum(spend),hh_16 = n())

hshd_sales_17 <- transactions %>% 
                  filter(year == 2017) %>% 
                  group_by(hshd_num) %>% 
                  summarize(sales_17 = sum(spend),hh_17 = n())

merged_transactions_16_17 <- merge(hshd_sales_16, hshd_sales_17, by = "hshd_num")

yearly_demographics <- merge(merged_transactions_16_17, households, by = "hshd_num")

Observation:

We looked at the shopping trends of customers who shopped repeatedly in both years 2016 and 2017 but could not do so due to limitation in demographics data.

Challenges:

On merging the data for customers who have repeatedly shopped with the customer demographics table; it has returned zero information indicating that households table have no information regarding these customers.
This could be a major setback for the company, as not having information or demographics related data about regular customers can led loss of sales or not understanding their needs.

Retaining existing customers is equally important as acquiring new ones for any large scale business like Kroger. Which is why we want to understand customer retention rates.

length(unique(merged_transactions_16_17$hshd_num))

## [1] 3520

length(unique(hshd_sales_16$hshd_num))

## [1] 516720

Takeaway:

There were 3520 customers who shopped again in 2017 after 2016. Out of total 516718 HHIDs in 2016. There is a poor retention that Kroger can potentially work on.

Retention of Customers

Year Over Year: Out of 2016 shoppers how many shopped again in 2017
Month over Month:

Retain rate: How many customer shop in consecutive months?

Reactivation rate: How many customers shop again after having shopped from Kroger in past?

hshd_month <- transactions %>%
  filter(year == 2016) %>% 
  mutate( Month = month(date)) %>% 
  select(hshd_num, Month) %>% 
  unique %>% 
  mutate(next_month = Month + 1)

Sample from the Customer M-o-M Table
hshd_num	Month	next_month
34	1	2
60	1	2
168	1	2
199	1	2
252	1	2
355	1	2
366	1	2
379	1	2
413	1	2
462	1	2

hshd_month %>% 
  group_by(hshd_num) %>% 
  summarise(cnt = n()) %>% arrange(desc(cnt))

Observation:

Due to limitations with transaction data, we do not find significant number of households visiting Kroger more than once.

Challenges:

Since none of the Household Ids shop more than one day, New, repeating and reactivated customers cannot be assessed

Identifying M-o-M, New, Retained and Reactivated customers for Kroger could help them understand their customer in-depth. It could also help in assessing on what sort of customers, they need to focus on during their marketing campaign. Retention rate of customers also is very important metric, which due to randomness or missing information cannot be evaluated here.

After analyzing sales and demographics traits, we want to sales by product types.

Sales by Region

region_wise_sale <- data_joined %>% 
  group_by(year,month,store_region) %>% 
  summarise(gmv = sum(spend_clean,na.rm=T),
  sales = sum(unit_clean,na.rm=T)) %>% 
  arrange(year,month,store_region)

Observations:

East region stores have been the best performing in terms of sales volume in comparison to the other stores.
While there was Month-Over-Month growth in sales volume during 2016 for West and Central regions, sales have been stagnant for West and declining for Central in 2017.

Next, we want to observe sales by product type. One important product categorization is Private Vs National brands. Selling private brands promotes brand loyalty towards Kroger while also providing higher profit margins than national brands.

Sales by Brand Type

Private labels are items manufactured and market under the Kroger brand, which have higher margin in comparison to National label.

brand_type_wise_sale <- data_joined %>% 
  group_by(year,month,brand_type) %>% 
  summarise(gmv = sum(spend_clean,na.rm=T),
  sales = sum(unit_clean,na.rm=T)) %>% 
  arrange(year,month,brand_type)

Observations:

The private brands for Kroger contributes ~ 32 % of their total sales, which have seen a growth of 0.5 % over 2016 to 2017.

Also,

Food Department contributes the maximum towards the total sales. Usually these are fast moving items, which calls the need to predict the demand before hand to plan the inventory well in advance. There hasn’t been in major change between year 2016 and 2017.
During 2016, percentage contribution from inorganic products were increasing Month-Over-Month, contributing on an average 96% of total sales, while Year 2017 saw an inclination towards Natural/Organic product with their contribution increasing from 5% to 6%. This is an early indicator that customers in future are more likely to shift towards natural products.

Summary

The objective of this analysis is to use Kroger data to find patterns and behaviors indicating health of the business and potential opportunities. Below is a summary of key findings.

Overall performance by year:

Total sales of Kroger increased by ~1% between year 2016 and 2017.

Best performing months:

Sales are highest in December presumably due to holiday season.
The high sales is driven primarily by existing customers buying a high volume of products.

Best performing customer demographics:

Married, 50-74 Years Old, 4 people size customers are spend the most at Kroger

Private brands sales performance:

Selling private brands promotes brand loyalty towards Kroger while also providing higher profit margins than National brands. We found:

The private brands for Kroger contribute ~ 32 % of their total sales, which have seen a growth of 0.5 % over 2016 to 2017.

Best performing regions:

East region stores have been the best performing in terms of sales volume.
While there was Month-Over-Month growth in sales volume during 2016 for West and Central regions, sales has been stagnant for West and declining for Central in 2017.

Next Steps / Recommendations

Looking at the data, there seems to be huge gap in how the data is collated for every household. The company to boosts sales and enchance customer experience and loyalty should be putting into use the customer data already available with them but the process of collecting customer information needs to be streamlined.
Understanding on a monthly level, Retained vs Reactivated vs New customers coming to the stores. Further, understanding of these three groups could help marketing campaigns in an effective manner.
Given, the iformation on customer, KNN Clustering could be performed to identify groups of customers with similar shopping patterns and hence sales and inventories could be planned well ahead of an event.
A predictive model could be built to assit in understanding the monthly sales for any region. This information could be help in evaluating the prospectives of opening a new store in the region given the demographics of that region is known in advance.

Complete Journey Customer Transactions

Anupreet Gupta , Palash Arora

Who are my customers?

Business Understanding

Introduction:

Problem Statement:

Solution Overview:

Business Application

Required Packages / Functions Used

Required Packages

Defined Functions

Data Preparation

Importing Data Set

Cleaning

Data Validation

Overview of Data

Exploratory Data Analysis

Data Dictionary

Analysis

Merging Transactions, Products and Households into one single table

Objective

Sales around the year:

Average basket spends by month:

Who are my customers?

Fastest growing demographics profile

Retention of Customers

Sales by Region

Sales by Brand Type

Summary

Next Steps / Recommendations