Our research aims to help a large retailer better understand its customers. The goal of the study is to gain knowledge from the available database, which includes information on customers, transactions, and other topics. As part of our methodology, we intend to organize, examine, and analyze the dataset we obtained in order to finally respond to the following important questions.
What household demographic factors influence the most store spending and Learn about the most influential factors influencing customer spending and gain insights into what should be done to increase sales volume
Our methodology can be divided into three steps. -
Organizing and preparing data: In R, organize the tables/datasets, and join tables as needed for the analysis (more in detail in the section on data preparation).
Data Exploration: Investigate the nature and source of data, and as needed visualize variable distributions and correlations.
Data Analysis: We propose a co-relational analysis to identify the demographic factors that have the greatest influence on consumer spending. This will be followed by a technique for delving deeper into the obtained observations in order to create a customer profile based on their engagement levels with various departments within the store.
‘Exploratory Data Analysis’ is the approach we took to address the business objective. Using summary statistics and graphical representations, we conduct data investigations to discover patterns, detect anomalies and validate assumptions. Our methodology can be divided into three steps.
• Customer segmentation based on demographic information - To better understand how people use coupons today, we compiled data on age and income levels and divided our sales into those that qualified for coupons and those that did not. We computed the average and total values for both sets of data.
• Product-category grouping - Because different product categories at Regork are managed by different managers, we’re looking at how sales penetration varies for different product categories with and without coupon usage.
• Organizing stores according to average coupon discounts - To make the most of our coupons, we’d like to know which stores have the highest average coupon discounted sales.
Based on the results of the aforementioned analysis, we can choose which client segment to target in order to increase overall sales through the use of coupons based on factors such as age, income range, product purchasing patterns, geography, and so on.
The project requires the following packages.
completejourney - Provides the data set for analysis.
ggplot2 - Provides us with the different graphs that we can use to plot our results.
plotly - Used to make our graphs more interactive.
dplyr - For flexibility in data manipulation.
tidyr - To clean up messy data.
lubridate - To easily format timestamps to date for transaction data.
scales - To map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.
library(completejourney)
library(ggplot2)
library(plotly)
library(dplyr)
library(tidyr)
library(lubridate)
library(scales)
The original data set is contained in the package called “completejourney.” The dataset focuses on transaction level data from 2,469 households that frequented the client’s stores over a period of one year. Furthermore, the data captures purchases made across all product categories available at the retailer’s stores. We have customer demographics data and direct marketing campaign data available for certain households, among other things.
This package includes eight built-in data sets. We used three tables from this package. They are products, demographics, and transactions.
Firstly, we are removing all the data point which have missing values from the demographics data frame
na_marital_status<- which(!is.na(demographics$marital_status))
demographics<- demographics[na_marital_status,]
na_home_ownership <- which(!is.na(demographics$home_ownership))
demographics <- demographics[na_home_ownership,]
Firstly, We created a new data frame called ‘demography_df’ that contains total sales across various categories such as Kids, Household Income, and Age Groups.
demography_df <- get_transactions() %>%
inner_join(demographics,by = "household_id") %>%
mutate(kids_flag = if_else(kids_count == '0', 'No Kids', 'With Kids')) %>%
group_by(age,income,kids_flag) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
Then we looked at the top ten departments in terms of sales and created a dataframe with sales and department information.
# Looking at Coupons usage across Top 10 Departments ----------------------
# Creating Top 10 Departments
top_10_departments <- get_transactions() %>%
inner_join(products, by = "product_id") %>%
group_by(department) %>%
summarize(dep_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(dep_sales)) %>%
#select(department) %>%
head(10)
Furthermore, we wanted to compare the sales value of coupons used by households across departments.
coupons_df <- get_transactions() %>%
inner_join(demographics) %>%
inner_join(products, by = 'product_id') %>%
filter(department %in% top_10_departments$department & household_size %in% c('4','5+')) %>%
mutate(tot_discount = coupon_disc + coupon_match_disc) %>%
group_by(basket_id) %>%
mutate(tot_disc_basket = sum(tot_discount)) %>%
ungroup() %>%
mutate(coupon_flag = if_else(tot_disc_basket > 0, 1,0))
To begin, we are considering co-relational analysis between total sales and households with kids category to determine whether total sales are in any way related to the objective parameters of people with and without children, as well as whether spending has anything to do with income range brackets available in the given data set.
theme_set(theme_classic())
ggplot(demography_df, aes(total_sales, age,fill=age)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Age Groups") +
xlab("Total Sales") +
scale_x_continuous(labels = scales::dollar_format()) +
ggtitle("How Sales Gets Affected w.r.t Age and Kids?") +
facet_wrap(~ kids_flag) +
labs(title = "How Sales Gets Affected w.r.t Age and Kids?",
subtitle = "This graph looks at total sales across households(All age range) with and without kids",
caption = "Source:CompleteJourney")+
theme(panel.background = element_rect(fill = "grey"))+
geom_smooth()
Based on the bar graphs above, we can see that people of all ages who do not have children spend significantly more money in our grocery stores than people who do.
From the first visualization we looked at people from different age brackets (with or without kids) and how they contribute to sales volume. Let’s dig a little deeper
get_transactions() %>%
inner_join(demographics,by = "household_id") %>%
inner_join(products, by = "product_id") %>%
group_by(household_size,income) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales)) %>%
ggplot(aes(income, total_sales, color = household_size)) +
geom_point(aes(color = household_size)) +
geom_line(aes(group = household_size)) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Income", y = "Sales Revenue", color = "Household Size",
title = "How Sales Gets Affected w.r.t Income and Household Size?",
subtitle = "This graph looks at total sales per income and hosehold size",
caption = "Source:CompleteJourney") +
theme(axis.text.x = element_text(size = 8, angle = 45))
To identify a trend in the spending of four and five-person households that are contributing less than expected. To see where these households’ spending is concentrated, we must first identify the top ten departments that account for the majority of total sales.
ggplot(top_10_departments, aes(department, dep_sales, fill = department)) +
geom_bar(stat = 'identity') +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Department", y = "Sales") +
theme(axis.text.x = element_text(size = 8, angle = 45)) +
ggtitle("Sales for Top 10 Performing Departments")
As can be seen, the departments mentioned above in the graph are the major contributors to driving sales volume. We can now filter these departments for household sizes 4 and 5 to see if there is a pattern to notice.
The anomalies in spending by these household sizes prompted us to consider how their spending is distributed across the overall Top 10 departments in order to identify any anomalies.
get_transactions() %>%
inner_join(demographics,by = "household_id") %>%
filter(household_size == '4' | household_size == '5+') %>%
inner_join(products, by = "product_id") %>%
filter(department %in% top_10_departments$department) %>%
select(department, household_size, sales_value) %>%
group_by(department, household_size) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE),
) %>%
arrange(desc(total_sales)) %>%
ggplot(aes(total_sales, department))+
geom_point(aes(color = department))+
geom_segment(aes(x = 0, y = department, xend = total_sales, yend = department), size = 0.5) +
labs(x = "Sales Revenue", y = "Department") +
ggtitle("Revenue Sales for Top 10 performing departments for household sizes 4 plus") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~household_size,ncol = 3)
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
The lion’s share of their spending is in the grocery department, indicating that they are hesitant to spend in other departments due to the size of their household. This necessitates a unique perspective on the campaign and its analysis.
Households with four or more members spend more on groceries than on other categories. Coupons and campaigns should be directed towards other departments and products where they can purchase them with the embedded discount to attract them to other departments.
As we saw that the majority of these households’ contributions are in the grocery department, we were reminded of the campaign that distributes coupons to them and their use of coupons.
coupon_graph_df <- coupons_df %>%
group_by(department, household_size, basket_id, coupon_flag) %>%
summarise(Total_Sales = sum(sales_value)) %>%
group_by(household_size, department) %>%
summarise(sales_w_coupon = sum(if_else(coupon_flag == 1, Total_Sales, 0)),
sales_wo_coupon = sum(if_else(coupon_flag == 1, 0,Total_Sales)))
plot_coupon <- coupon_graph_df %>%
select(department, household_size, sales_w_coupon, sales_wo_coupon) %>%
gather("Criterion", "Total_Sales", -c(department,household_size)) %>%
mutate(x_axis = paste(household_size))
ggplot(plot_coupon, aes(Total_Sales, department, fill = Criterion)) +
geom_bar(stat = "identity") +
labs(title = " Total Sales with and without coupons across Top 10 Departments and Household Size",
x = "Total Spend",
y = "Departments",
color = "Total with and without coupons per Household Size") +
theme(plot.title = element_text(hjust = 0.5)) + scale_x_continuous(labels = scales::dollar_format()) +
facet_wrap(. ~ household_size) +
scale_fill_brewer(name = "Criterion", labels = c("Total Sales With Coupon", "Total Sales Without Coupon"), palette = "Set1")
From a product department standpoint, we can see that there is a good chunk of product departments that perform better without coupons being provided for them in these filtered household sizes. As a result, coupon marketing for these categories can be reduced overall or optimized in such a way that marketing costs are reduced.
We see a significant difference in average sales with and without coupons at the store level. In contrast to the overall trend, households with four or more members spend more without coupons than with coupons.
Based on the bar graphs above, we can see that people of all ages who do not have children spend significantly more in our grocery stores than people who do. People who have children are concerned about their spending. They are on a tight budget. Run special family-focused campaigns and offer discounts on products with a high profit margin, giving the family the impression that they are saving money, causing them to buy more products and spend more money overall.
The children accompany them to the store and do not allow them to shop alone. Create a play area for children in the store where parents can drop off their children before beginning to shop.
We can see that as family income rises, their spending decreases at the grocery store. That means the store is unable to meet the premium needs of these higher-income customers. The store’s inventory should be reworked to meet the needs of these customers.
Households with sizes of 4 and 5+ spend more on groceries than on other departments. Coupons and campaigns should be directed towards other departments and products where they can purchase them with the embedded discount to attract them to other departments.
The primary limitation of this dataset is the limited time period for which it is valid, namely 2017,
There are only a few data points in the demographics table. The demographics table is only mapped to 32% of the transaction data set.
There is no detailed information on the current promotions, such as the percentage of discount offered, etc.