Who are my customers?

Business Understanding

Introduction:

As one of the world’s largest retailers operating in 2,769 grocery retail stores, Kroger is committed to meet the demands of an ever-evolving customer base. As a part of it’s next level growth strategy, Kroger plans to analyze it’s transactional data from April 2016 to October 2017 on the basis of what its customers looks forward to buy and factors influencing their spend capacity.

Problem Statement:

  • What constitutes majority of Kroger’s customer base? How is it changing over time?
  • Who are the benefactors of loyalty program and which are the demographics where loyalty program is most effective?

Solution Overview:

  • Identify proportion of customers with various demographic traits using household data
  • Observe increases in loyalty program participation for various customer demographics
  • Observe any shifts in these proportions between years 2016 and 2017
  • Study shopping behavior of these high shift customers

Business Application

  • Knowing the composition of customer base would help Kroger understand its untapped potential
  • Understanding customers who benefit from the loyalty card program

Required Packages

Prior installing and loading the following package would result in smooth execution of codes to reproduce the project.

List of Required Packages
Package Purpose
tidyverse Creating Tidy and data and ease of installation of other package
ggplot2 Creating Visuals
lubridate Rectify Data format across tables
readr Importing CSV files
DT Creating functional tables in HTML
knitr Useful in generating dynamic reports
rmarkdown Converting R Markdown files into HTML/required format
magrittr For Pipe operator

Data Preparation

Importing Data Set

Using R’s read.csv() function to import the comma separated files.

households <- read.csv("data/5000_households.csv", header = TRUE, 
                       strip.white =  TRUE, stringsAsFactors = FALSE, 
                       na.strings = c("null", "NA"))

products <- read.csv("data/5000_products.csv", header = TRUE,
                       strip.white =  TRUE, stringsAsFactors = FALSE,
                       na.strings = c("null", "NA"))

transactions <- read.csv("data/5000_transactions.csv", header = TRUE,
                       strip.white =  TRUE, stringsAsFactors = FALSE,
                       na.strings = c("null", "NA"))

Cleaning

  • Step 1. White space removal from data set while importing dataset.

  • Step 2. Null value treatment. Identifying blank,NA,“null” as Null values.

  • Step 3. Renaming Columns names as per the data dictionary provided to avoid any confusion.

colnames(households) <-c("hshd_num","loyalty_flag","age_range","martial_status",
                         "income_range","homeowner_desc","hshd_composition",
                          "hh_size","children")

colnames(products) <- c("product_num","department","commodity","brand_type",
                        "natural_organic_flag")

colnames(transactions) <-c("hshd_num","basket_num","date","product_num","spend",
                            "units","store_region","week_num","year")

Data Validation

  • Step 1. Checking for summary() of measure variables in Transactional table
summary(transactions)
  • Step 2. Formating Date Column in Transactions table
transactions$date <- dmy(as.character(transactions$date))
  • Step 3. Removing negative transactional units and spends
transactions <- subset(transactions, (spend>=0 & spend>=0))
  • Step 4. Outlier treatment after observing boxplots for measure variables
boxplot(transactions$spend)

boxplot(transactions$units)

Overview of Data

A look at the sample of 10 observations from each table, gives the user a chance to have a glimpse at the dataset.

  • Households
Sample from the Households Table
hshd_num loyalty_flag age_range martial_status income_range homeowner_desc hshd_composition hh_size children
688 Y 75+ Unknown 35-49K Homeowner NA NA NOT AVAILABLE
2590 N 75+ Unknown 50-74K Homeowner NA NA NOT AVAILABLE
1171 Y 75+ Unknown 75-99K Homeowner NA NA NOT AVAILABLE
1531 Y 75+ Unknown 75-99K Homeowner NA NA NOT AVAILABLE
403 N 75+ Unknown UNDER 35K Renter NA NA NOT AVAILABLE
283 Y 25-34 Unknown 50-74K Homeowner NA NA NOT AVAILABLE
3864 Y 25-34 Unknown 50-74K Homeowner NA NA NOT AVAILABLE
4899 Y 25-34 Unknown 50-74K Homeowner NA NA NOT AVAILABLE
1443 N 35-44 Unknown 150K+ Homeowner NA NA NOT AVAILABLE
3598 Y 35-44 Unknown 35-49K Unknown NA NA NOT AVAILABLE
  • Products
Sample from the Products Table
product_num department commodity brand_type natural_organic_flag
92993 NON-FOOD PET PRIVATE N
93924 NON-FOOD PET PRIVATE N
94272 NON-FOOD PET PRIVATE N
94299 NON-FOOD PET PRIVATE N
94594 NON-FOOD PET PRIVATE N
94606 NON-FOOD PET PRIVATE N
94613 NON-FOOD PET PRIVATE N
95625 NON-FOOD PET PRIVATE N
96152 NON-FOOD PET PRIVATE N
96153 NON-FOOD PET PRIVATE N
  • Transactions
Sample from the Transactions Table
hshd_num basket_num date product_num spend units store_region week_num year
3 34 1253 2016-01-03 539501 2.19 1 EAST 1 2016
4 60 1595 2016-01-03 5260099 0.99 1 WEST 1 2016
5 60 1595 2016-01-03 4535660 2.50 2 WEST 1 2016
6 168 3393 2016-01-03 5602916 4.50 1 SOUTH 1 2016
7 199 4558 2016-01-03 5683532 3.49 1 SOUTH 1 2016
8 252 3491 2016-01-03 980450 2.79 1 SOUTH 1 2016
9 355 4733 2016-01-03 735993 1.00 1 EAST 1 2016
10 366 4369 2016-01-03 138157 9.98 2 EAST 1 2016
11 379 226 2016-01-03 83880 1.29 1 EAST 1 2016
12 413 2668 2016-01-03 89361 1.79 1 WEST 1 2016

Exploratory Data Analysis

Data Dictionary

Variable Description of Households Table
Variable Description
hshd_num Uniquely identifies each household
loyalty_flag Estimated age range
age_range If the household owns a loyalty card
martial_status Maritial Status (A - Married, B - Single, U - Unknown)
income_range Total Household income
homeowner_desc Type of Residency(Homeowner,renter,etc)
hshd_composition Composition of household
hh_size Size of household up to 5+
children Number of children present up tp 3+
Variable Description of Products Table
Variable Description
product_num Uniquely identifies each product
department Groups products at Food, Non-Food, Pharma
commodity Groups products together at lower level than Department
brand_type Identifies the product as Private or National
natural_organic_flag Identifies the product as Natural/Organic or Inorganic
Variable Description of Transactions Table
Variable Description
product_num Uniquely identifies each product
department Groups products at Food, Non-Food, Pharma
commodity Groups products together at lower level than Department
brand_type Identifies the product as Private or National
natural_organic_flag Identifies the product as Natural/Organic or Inorganic

Analysis

monthly_sales <-  transactions %>%
                    group_by(Month = month(date), Year = year(date)) %>%
                      summarize(spend=sum(spend)) %>%
                        group_by(Month) %>%
                          summarize(spend = mean(spend)) %>%
                            arrange(Month) %>% data.frame

monthly_sales$Month <- as.factor(monthly_sales$Month)

Sales are highest in December presumably due to Christmas holidays

avg_basket_sales <-  transactions %>%
                      group_by(hshd_num, basket_num, Month = month(date)) %>%
                        summarize(basket_spend=sum(spend))

avg_basket_sales2 <- avg_basket_sales %>%
                      group_by(Month) %>%
                        summarize(monthly_avg_basket_spend = mean(basket_spend)) %>%
                          arrange(Month) %>% data.frame

avg_basket_sales2$Month <- as.factor(avg_basket_sales2$Month)

The average basket spend is approximately constant throughout year except December when average basket amount is highest

Next Steps

  • Transform data to get sales for year 2016 and 2017 side by side
  • Plot year over year trend of shopping and loyalty program subscription for various customer demographics groups
  • Observe groups with significant shift in sales and analyze product category preference
  • Identify customer demographics with significant increase/decrease in loyalty program subscription

Visualizations

Summary

Testing