As one of the world’s largest retailers operating in 2,769 grocery retail stores, Kroger is committed to meet the demands of an ever-evolving customer base. As a part of it’s next level growth strategy, Kroger plans to analyze it’s transactional data from April 2016 to October 2017 on the basis of what its customers looks forward to buy and factors influencing their spend capacity.
Prior installing and loading the following package would result in smooth execution of codes to reproduce the project.
| Package | Purpose |
|---|---|
| tidyverse | Creating Tidy and data and ease of installation of other package |
| data.table | Importing Huge Data sets; Function fread |
| ggplot2 | Creating Visuals |
| lubridate | Rectify Data format across tables |
| DT | Creating functional tables in HTML |
| knitr | Useful in generating dynamic reports |
| rmarkdown | Converting R Markdown files into HTML/required format |
| magrittr | For Pipe operator |
| sqldf | To ease for writing SQL queries in R |
Defined several functions based on the use cases, where a chunk of codes were repeated several times. Defining function helped to overcome repeatedly writing the same set of codes and avoiding mistakes at the same time.
read_fun <- function(file){
fread(file, header = TRUE,
strip.white = TRUE, stringsAsFactors = FALSE,
na.strings = c("null", "NA"))
}
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
Using R’s data.table::fread() function to import the comma separated files.
data_table <- c("5000_households.csv","5000_products.csv","5000_transactions.csv")
output_table <- c("households","products","transactions")
for(d in seq_along(data_table))
{
data <- fread(paste0("data/",data_table[d]))
assign(output_table[d],data)
}
Step 1. White space removal from data set while importing dataset.
Step 2. Null value treatment. Identifying blank,NA,“null” as Null values.
colSums(is.na(transactions))
colSums(is.na(households))
colSums(is.na(products))
households_clean <- na.omit(households)
colnames(households) <-c("hshd_num","loyalty_flag","age_range","martial_status",
"income_range","homeowner_desc","hshd_composition",
"hh_size","children")
colnames(products) <- c("product_num","department","commodity","brand_type",
"natural_organic_flag")
colnames(transactions) <-c("hshd_num","basket_num","date","product_num","spend",
"units","store_region","week_num","year")
transactions$spend_clean <- remove_outliers(transactions$spend)
transactions$unit_clean <- remove_outliers(transactions$unit)
households_clean <- na.omit(households)
summary(transactions)
transactions$date <- dmy(as.character(transactions$date))
transactions <- subset(transactions, (spend>=0 & spend>=0))
boxplot(transactions$spend)
boxplot(transactions$units)
A look at the sample of 10 observations from each table, gives the user a chance to have a glimpse at the dataset.
| hshd_num | loyalty_flag | age_range | martial_status | income_range | homeowner_desc | hshd_composition | hh_size | children |
|---|---|---|---|---|---|---|---|---|
| 688 | Y | 75+ | Unknown | 35-49K | Homeowner | NA | null | NOT AVAILABLE |
| 2590 | N | 75+ | Unknown | 50-74K | Homeowner | NA | null | NOT AVAILABLE |
| 1171 | Y | 75+ | Unknown | 75-99K | Homeowner | NA | null | NOT AVAILABLE |
| 1531 | Y | 75+ | Unknown | 75-99K | Homeowner | NA | null | NOT AVAILABLE |
| 403 | N | 75+ | Unknown | UNDER 35K | Renter | NA | null | NOT AVAILABLE |
| 283 | Y | 25-34 | Unknown | 50-74K | Homeowner | NA | null | NOT AVAILABLE |
| 3864 | Y | 25-34 | Unknown | 50-74K | Homeowner | NA | null | NOT AVAILABLE |
| 4899 | Y | 25-34 | Unknown | 50-74K | Homeowner | NA | null | NOT AVAILABLE |
| 1443 | N | 35-44 | Unknown | 150K+ | Homeowner | NA | null | NOT AVAILABLE |
| 3598 | Y | 35-44 | Unknown | 35-49K | Unknown | NA | null | NOT AVAILABLE |
| product_num | department | commodity | brand_type | natural_organic_flag |
|---|---|---|---|---|
| 92993 | NON-FOOD | PET | PRIVATE | N |
| 93924 | NON-FOOD | PET | PRIVATE | N |
| 94272 | NON-FOOD | PET | PRIVATE | N |
| 94299 | NON-FOOD | PET | PRIVATE | N |
| 94594 | NON-FOOD | PET | PRIVATE | N |
| 94606 | NON-FOOD | PET | PRIVATE | N |
| 94613 | NON-FOOD | PET | PRIVATE | N |
| 95625 | NON-FOOD | PET | PRIVATE | N |
| 96152 | NON-FOOD | PET | PRIVATE | N |
| 96153 | NON-FOOD | PET | PRIVATE | N |
| hshd_num | basket_num | date | product_num | spend | units | store_region | week_num | year | spend_clean | unit_clean |
|---|---|---|---|---|---|---|---|---|---|---|
| 34 | 1253 | 2016-01-03 | 539501 | 2.19 | 1 | EAST | 1 | 2016 | 2.19 | 1 |
| 60 | 1595 | 2016-01-03 | 5260099 | 0.99 | 1 | WEST | 1 | 2016 | 0.99 | 1 |
| 60 | 1595 | 2016-01-03 | 4535660 | 2.50 | 2 | WEST | 1 | 2016 | 2.50 | NA |
| 168 | 3393 | 2016-01-03 | 5602916 | 4.50 | 1 | SOUTH | 1 | 2016 | 4.50 | 1 |
| 199 | 4558 | 2016-01-03 | 5683532 | 3.49 | 1 | SOUTH | 1 | 2016 | 3.49 | 1 |
| 252 | 3491 | 2016-01-03 | 980450 | 2.79 | 1 | SOUTH | 1 | 2016 | 2.79 | 1 |
| 355 | 4733 | 2016-01-03 | 735993 | 1.00 | 1 | EAST | 1 | 2016 | 1.00 | 1 |
| 366 | 4369 | 2016-01-03 | 138157 | 9.98 | 2 | EAST | 1 | 2016 | NA | NA |
| 379 | 226 | 2016-01-03 | 83880 | 1.29 | 1 | EAST | 1 | 2016 | 1.29 | 1 |
| 413 | 2668 | 2016-01-03 | 89361 | 1.79 | 1 | WEST | 1 | 2016 | 1.79 | 1 |
| Variable | Description |
|---|---|
| hshd_num | Uniquely identifies each household |
| loyalty_flag | Estimated age range |
| age_range | If the household owns a loyalty card |
| martial_status | Maritial Status (A - Married, B - Single, U - Unknown) |
| income_range | Total Household income |
| homeowner_desc | Type of Residency(Homeowner,renter,etc) |
| hshd_composition | Composition of household |
| hh_size | Size of household up to 5+ |
| children | Number of children present up tp 3+ |
| Variable | Description |
|---|---|
| product_num | Uniquely identifies each product |
| department | Groups products at Food, Non-Food, Pharma |
| commodity | Groups products together at lower level than Department |
| brand_type | Identifies the product as Private or National |
| natural_organic_flag | Identifies the product as Natural/Organic or Inorganic |
| Variable | Description |
|---|---|
| product_num | Uniquely identifies each product |
| department | Groups products at Food, Non-Food, Pharma |
| commodity | Groups products together at lower level than Department |
| brand_type | Identifies the product as Private or National |
| natural_organic_flag | Identifies the product as Natural/Organic or Inorganic |
data_joined <- sqldf("Select t.*,p.*,h.*
from transactions t
Left Join products p on p.product_num=t.product_num
Left Join households_clean h on h.hshd_num=t.hshd_num
")
data_joined$month <- month(data_joined$date)
| hshd_num | basket_num | date | product_num | spend | units | store_region | week_num | year | spend_clean | unit_clean | product_num..12 | department | commodity | brand_type | natural_organic_flag | hshd_num..17 | loyalty_flag | age_range | martial_status | income_range | homeowner_desc | hshd_composition | hh_size | children | month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34 | 1253 | 2016-01-03 | 539501 | 2.19 | 1 | EAST | 1 | 2016 | 2.19 | 1 | 539501 | FOOD | CANNED GOODS | NATIONAL | N | 34 | Y | 55-64 | Married | 50-74K | Homeowner | 2 Adults and Kids | 3 | 1 | 1 |
| 60 | 1595 | 2016-01-03 | 5260099 | 0.99 | 1 | WEST | 1 | 2016 | 0.99 | 1 | 5260099 | FOOD | PRODUCE | NATIONAL | N | 60 | Y | 45-54 | Married | 150K+ | Homeowner | 2 Adults and Kids | 5+ | 3+ | 1 |
| 60 | 1595 | 2016-01-03 | 4535660 | 2.50 | 2 | WEST | 1 | 2016 | 2.50 | NA | 4535660 | FOOD | PRODUCE | NATIONAL | N | 60 | Y | 45-54 | Married | 150K+ | Homeowner | 2 Adults and Kids | 5+ | 3+ | 1 |
| 168 | 3393 | 2016-01-03 | 5602916 | 4.50 | 1 | SOUTH | 1 | 2016 | 4.50 | 1 | 5602916 | FOOD | DELI | PRIVATE | N | 168 | Y | 55-64 | Married | 50-74K | Homeowner | 2 Adults | 2 | NOT AVAILABLE | 1 |
| 199 | 4558 | 2016-01-03 | 5683532 | 3.49 | 1 | SOUTH | 1 | 2016 | 3.49 | 1 | 5683532 | FOOD | DELI | NATIONAL | N | 199 | Y | 45-54 | Married | 50-74K | Homeowner | 2 Adults and Kids | 3 | 1 | 1 |
| 252 | 3491 | 2016-01-03 | 980450 | 2.79 | 1 | SOUTH | 1 | 2016 | 2.79 | 1 | 980450 | FOOD | BAKERY | NATIONAL | N | 252 | Y | 45-54 | Single | 75-99K | Unknown | 1 Adult and Kids | 2 | 1 | 1 |
| 355 | 4733 | 2016-01-03 | 735993 | 1.00 | 1 | EAST | 1 | 2016 | 1.00 | 1 | 735993 | FOOD | GROCERY STAPLE | NATIONAL | N | 355 | Y | null | null | null | null | null | null | null | 1 |
| 366 | 4369 | 2016-01-03 | 138157 | 9.98 | 2 | EAST | 1 | 2016 | NA | NA | 138157 | FOOD | GROCERY STAPLE | NATIONAL | N | 366 | Y | null | null | null | null | null | null | null | 1 |
| 379 | 226 | 2016-01-03 | 83880 | 1.29 | 1 | EAST | 1 | 2016 | 1.29 | 1 | 83880 | NON-FOOD | HOUSEHOLD | PRIVATE | N | 379 | Y | 55-64 | Married | 100-150K | Homeowner | 2 Adults | 2 | NOT AVAILABLE | 1 |
| 413 | 2668 | 2016-01-03 | 89361 | 1.79 | 1 | WEST | 1 | 2016 | 1.79 | 1 | 89361 | FOOD | GROCERY STAPLE | PRIVATE | N | 413 | Y | 35-44 | Single | 35-49K | Renter | Single Male | 1 | NOT AVAILABLE | 1 |
The objective of this analysis is to use Kroger data to find patterns and behaviors indicating health of the business and potential opportunities.
In order to understand the overall picture, we begin by looking at the average sales by month.
monthly_sales <- transactions %>%
group_by(Month = month(date), Year = year(date)) %>%
summarize(spend=sum(spend)) %>%
group_by(Month) %>%
summarize(spend = mean(spend)) %>%
arrange(Month) %>%
data.frame %>%
mutate(Month = as.factor(Month))
Observation:
As can be seen, the sales remain about same throughout the year except in December presumably due to holiday season.
Total sales increased by ~1% between year 2016 and 2017
Is this spike in sales caused by a larger volume of customers or a larger basket size? Let’s find out next.
avg_basket_sales <- transactions %>%
group_by(hshd_num, basket_num, Month = month(date)) %>%
summarize(basket_spend=sum(spend))
avg_basket_sales2 <- avg_basket_sales %>%
group_by(Month) %>%
summarize(monthly_avg_basket_spend = mean(basket_spend)) %>%
arrange(Month) %>%
data.frame %>%
mutate(Month = as.factor(Month))
Observation:
Now, we want to understand what type of customers purchase most from Krogers. It would improve our understanding of typical Kroger customer and highlight potential customer groups Kroger can tap into
Best Performing Demographics Profile
Total sales by household:
hshd_sales <- transactions %>%
group_by(hshd_num) %>%
summarize(sales = sum(spend))
merged_transactions <- merge(hshd_sales, households, by = "hshd_num")
nrow(hshd_sales)
## [1] 996726
nrow(merged_transactions)
## [1] 4982
(1 - nrow(merged_transactions)/nrow(hshd_sales))
## [1] 0.9950016
Challenges:
Average Spends and Households:
spend_profile <- na.omit(merged_transactions) %>%
group_by(loyalty_flag, martial_status, income_range, hh_size, age_range) %>%
summarize(avg_spend = mean(sales), hh_count = n()) %>%
filter(hh_count >= 10) %>% #removing cases where sample size is < 10 households
arrange(desc(avg_spend))
| loyalty_flag | martial_status | income_range | hh_size | age_range | avg_spend | hh_count |
|---|---|---|---|---|---|---|
| N | Married | 100-150K | 2 | 55-64 | 71.42273 | 11 |
| Y | Single | 75-99K | 1 | 55-64 | 60.79696 | 23 |
| Y | Married | UNDER 35K | 2 | 45-54 | 58.38462 | 13 |
| N | Single | 35-49K | 1 | 45-54 | 56.52300 | 10 |
| Y | Married | 50-74K | 4 | 35-44 | 55.66357 | 14 |
| Y | Married | UNDER 35K | 5+ | 45-54 | 55.44583 | 12 |
| Y | Married | 50-74K | 4 | 45-54 | 53.47000 | 19 |
| Y | Single | UNDER 35K | 1 | 55-64 | 50.58133 | 15 |
| Y | Married | 50-74K | 3 | 55-64 | 49.29250 | 44 |
| N | Single | 50-74K | 1 | 45-54 | 49.04900 | 10 |
| Y | Single | 75-99K | 1 | 25-34 | 47.92765 | 17 |
| Y | Married | 100-150K | 4 | 55-64 | 47.54600 | 10 |
| Y | Married | UNDER 35K | 3 | 75+ | 46.95750 | 20 |
| Y | Married | 150K+ | 5+ | 45-54 | 46.77933 | 30 |
| Y | Single | 50-74K | 1 | 45-54 | 45.31155 | 58 |
| Y | Married | 100-150K | 2 | 65-74 | 44.63308 | 13 |
| Y | Single | 50-74K | 1 | 65-74 | 44.17900 | 10 |
| Y | Married | UNDER 35K | 2 | 55-64 | 43.92407 | 27 |
| Y | Single | 100-150K | 1 | 45-54 | 43.80886 | 35 |
| Y | Single | 75-99K | 1 | 35-44 | 43.50957 | 23 |
Observation:
Next, we also want to look at the fastest-growing customer demographics of customers to identify potential future customers.
YoY change in Average Spends by Customer Demographics
hshd_sales_16 <- transactions %>%
filter(year == 2016) %>%
group_by(hshd_num) %>%
summarize(sales_16 = sum(spend),hh_16 = n())
hshd_sales_17 <- transactions %>%
filter(year == 2017) %>%
group_by(hshd_num) %>%
summarize(sales_17 = sum(spend),hh_17 = n())
merged_transactions_16_17 <- merge(hshd_sales_16, hshd_sales_17, by = "hshd_num")
yearly_demographics <- merge(merged_transactions_16_17, households, by = "hshd_num")
Observation:
Challenges:
On merging the data for customers who have repeatedly shopped with the customer demographics table; it has returned zero information indicating that households table have no information regarding these customers.
This could be a major setback for the company, as not having information or demographics related data about regular customers can led loss of sales or not understanding their needs.
Retaining existing customers is equally important as acquiring new ones for any large scale business like Kroger. Which is why we want to understand customer retention rates.
length(unique(merged_transactions_16_17$hshd_num))
## [1] 3520
length(unique(hshd_sales_16$hshd_num))
## [1] 516720
Takeaway:
Year Over Year: Out of 2016 shoppers how many shopped again in 2017
Month over Month:
Retain rate: How many customer shop in consecutive months?
Reactivation rate: How many customers shop again after having shopped from Kroger in past?
hshd_month <- transactions %>%
filter(year == 2016) %>%
mutate( Month = month(date)) %>%
select(hshd_num, Month) %>%
unique %>%
mutate(next_month = Month + 1)
| hshd_num | Month | next_month |
|---|---|---|
| 34 | 1 | 2 |
| 60 | 1 | 2 |
| 168 | 1 | 2 |
| 199 | 1 | 2 |
| 252 | 1 | 2 |
| 355 | 1 | 2 |
| 366 | 1 | 2 |
| 379 | 1 | 2 |
| 413 | 1 | 2 |
| 462 | 1 | 2 |
hshd_month %>%
group_by(hshd_num) %>%
summarise(cnt = n()) %>% arrange(desc(cnt))
Observation:
Challenges:
Identifying M-o-M, New, Retained and Reactivated customers for Kroger could help them understand their customer in-depth. It could also help in assessing on what sort of customers, they need to focus on during their marketing campaign. Retention rate of customers also is very important metric, which due to randomness or missing information cannot be evaluated here.
After analyzing sales and demographics traits, we want to sales by product types.
region_wise_sale <- data_joined %>%
group_by(year,month,store_region) %>%
summarise(gmv = sum(spend_clean,na.rm=T),
sales = sum(unit_clean,na.rm=T)) %>%
arrange(year,month,store_region)
Observations:
East region stores have been the best performing in terms of sales volume in comparison to the other stores.
While there was Month-Over-Month growth in sales volume during 2016 for West and Central regions, sales have been stagnant for West and declining for Central in 2017.
Next, we want to observe sales by product type. One important product categorization is Private Vs National brands. Selling private brands promotes brand loyalty towards Kroger while also providing higher profit margins than national brands.
Private labels are items manufactured and market under the Kroger brand, which have higher margin in comparison to National label.
brand_type_wise_sale <- data_joined %>%
group_by(year,month,brand_type) %>%
summarise(gmv = sum(spend_clean,na.rm=T),
sales = sum(unit_clean,na.rm=T)) %>%
arrange(year,month,brand_type)
Observations:
Also,
Food Department contributes the maximum towards the total sales. Usually these are fast moving items, which calls the need to predict the demand before hand to plan the inventory well in advance. There hasn’t been in major change between year 2016 and 2017.
During 2016, percentage contribution from inorganic products were increasing Month-Over-Month, contributing on an average 96% of total sales, while Year 2017 saw an inclination towards Natural/Organic product with their contribution increasing from 5% to 6%. This is an early indicator that customers in future are more likely to shift towards natural products.
The objective of this analysis is to use Kroger data to find patterns and behaviors indicating health of the business and potential opportunities. Below is a summary of key findings.
Overall performance by year:
Best performing months:
Sales are highest in December presumably due to holiday season.
The high sales is driven primarily by existing customers buying a high volume of products.
Best performing customer demographics:
Private brands sales performance:
Selling private brands promotes brand loyalty towards Kroger while also providing higher profit margins than National brands. We found:
Best performing regions:
East region stores have been the best performing in terms of sales volume.
While there was Month-Over-Month growth in sales volume during 2016 for West and Central regions, sales has been stagnant for West and declining for Central in 2017.
Looking at the data, there seems to be huge gap in how the data is collated for every household. The company to boosts sales and enchance customer experience and loyalty should be putting into use the customer data already available with them but the process of collecting customer information needs to be streamlined.
Understanding on a monthly level, Retained vs Reactivated vs New customers coming to the stores. Further, understanding of these three groups could help marketing campaigns in an effective manner.
Given, the iformation on customer, KNN Clustering could be performed to identify groups of customers with similar shopping patterns and hence sales and inventories could be planned well ahead of an event.
A predictive model could be built to assit in understanding the monthly sales for any region. This information could be help in evaluating the prospectives of opening a new store in the region given the demographics of that region is known in advance.