Analysis of year long customer shopping data reveals business opportunities, growth factors and growth inhibitors. Customer shopping behavior is inferred through exploratory data analysis over their demographics, transactions, coupons, income range and other factors. To better spend Regork’s resources, Analysis is granulated to a brief level on :
Analysed the demographics based on quantity and order value to briefly understand the most loyal customer base,this is to understand where to focus ‘Regork’s’ customer behavior based promotions should be and to recognize parameters that increase loyalty as they contribute the most to revenue.
Identified the relationship between retail discount and products that go together based on the best performing departments;performed predictive analytics on one such pair of products to arrive at the correlation.
Analysed time constrained sales on a quarterly basis to determine which departments generate the least amount of revenue. Further, A deeper look at the customer transactions combined with coupons establishes a peculiar customer shopping behavior.
Based on the cohort analysis we’ve identified the most loyal bunch of customers, they are in the income range of $50-75k in the age range of 45-54. Once we have identified the demographics, we can conduct various experiment by dividing these groups into stratified sample ,to reduce the retail discount offered to this group, hence improving the overall revenue for the company.It is imperative to retain these customers and further work in the aspects of constant improvement in customer experience, quality offered and ease of shopping to increase brand value and Word of mouth marketing.These were identified based on various parameters like no. of visits and various calculated metrics.Later on,new products and promotions launched can be targeted to this cohort to understand purchase patterns which can be applied to other cohorts to amplify growth opportunities.
In produce we can eliminate the retail discount offer. As the quantity sold of produce is constant throughout the year. If not eliminate we reduce the retail discount specially during the months of April, May, June, July.
Least performing department’s sales are majorly dependent on coupon based transactions. This indicates a serious growth internally in these departments in all the current working aspects such as quality, experience and product placement, as these sales are already very low and are further dominated only with coupons. Externally, increasing such coupons and tapping into potential untouched market by increasing marketing, combining with best sellers and further discounting will increase the chances of growth.
Packages used in our data analysis :
Complete Journey - Main data set containing all the necessary data.
Tidyverse - Opinionated collection of R packages designed for data science
Lubridate - for easy time and date manipulation
ggplot - used for visualizing data
gridExtra - provides useful extensions to the grid system
Patchwork - package designed to make plot composition simple
visdat - provides visualizations for the data frame
Stats - for statistical calculations and random number generation
hrbrthemes - provides typography centric themes
treemapify - for plotting tree maps
library(tidyverse)
library(lubridate)
library(ggplot2)
library(completejourney)
library(visdat)
library(gridExtra)
library(patchwork)
library(hrbrthemes)
library(stats)
library(treemapify)
Bottom_categories <- get_transactions() %>%# Storing our transaction data in a variable
inner_join(products) %>% # joining it by products to departmentalize our transactions
filter(department != 'COUPON') %>% # filtering our data to include only valid departments
group_by(department) %>% summarise(Total_sale_value = sum(sales_value, na.rm = TRUE)) %>% # Calculating the total sales value to understand revenue generated
arrange(desc(Total_sale_value)) %>% tail(10)#Arranging them to obtain the least revenue generating departments
Each demographic has representation that varies in number, calculating the total sales to compare each demographics would not work in this case. So average sales for demographics was looked at in our Analysis.
And we’ve calculated factors for demographics, which will be used to calculate the respective Average sales.
Categorizing data as based on income ranges
This has been done to suffice our wide range of data into categorizable customer bases for feasibility in terms of understanding customer purchase patterns :
get_transactions() %>%
inner_join(demographics) %>% filter(quantity < 1000) %>%
select(household_id,basket_id,quantity,sales_value,household_size,income) %>%
mutate(income_char = trimws(as.character(income))) %>%
mutate(income_range = case_when( income_char %in% c("Under 15K","25-34K","35-49K") ~"Low",
( income_char %in% c("50-74K","75k-99K") ~ "Middle"),
( income_char %in% c("100-124K","125k-149K") ~ "Upper-Middle"),
( income_char %in% c("150k-174K","175-199K","200-249K","250K+")~ "High"))) %>%
group_by(household_size,income_range) %>%
summarise(hsld_count = n_distinct(household_id),
distinct_basket = n_distinct(basket_id),
total_sales = sum(sales_value)) %>% filter(!is.na(income_range)) %>%
mutate(sales_per_basket = round(total_sales/distinct_basket),
income_range = factor(income_range, levels = c("Low", "Middle", "Upper-Middle", "High")))
## # A tibble: 19 × 6
## # Groups: household_size [5]
## household_size income_range hsld_count distinct_basket total_sales sales_pe…¹
## <ord> <fct> <int> <int> <dbl> <dbl>
## 1 1 High 6 851 31606. 37
## 2 1 Low 105 10181 253168. 25
## 3 1 Middle 74 6923 211358. 31
## 4 1 Upper-Middle 10 1159 28677. 25
## 5 2 High 12 1072 53176. 50
## 6 2 Low 123 12060 331982. 28
## 7 2 Middle 73 6054 222742. 37
## 8 2 Upper-Middle 11 698 27116. 39
## 9 3 High 6 496 26659. 54
## 10 3 Low 45 5019 136255. 27
## 11 3 Middle 19 1400 52175. 37
## 12 3 Upper-Middle 5 510 15174. 30
## 13 4 Low 15 1317 51671. 39
## 14 4 Middle 13 1731 49858. 29
## 15 4 Upper-Middle 5 306 12230. 40
## 16 5+ High 3 272 14405. 53
## 17 5+ Low 22 1970 55502. 28
## 18 5+ Middle 13 834 40934. 49
## 19 5+ Upper-Middle 3 421 17291. 41
## # … with abbreviated variable name ¹sales_per_basket
At first we looked at different cohorts of data by combining the demographic data with transactions. To understand where our majority of the population lies. So, we plotted the tree maps for “income” and “age” to know more about our customers.
The next analysis we did was by defining our own custom metrics i.e Average Quantity Per Basket and Average Revenue Per Basket, which is plotted across household sizes. In this we’ve also grouped our income range.
The next analysis we did was to identify the customer loyalty and for that we came up with a custom metric called “Visit Number” essentially meaning how many times a customer has visited the website and placed an order. This was bucketised into different groups and from the graph we see the most number of customer belong to “50+” cohort a. For us to further drill down the customer graphs, we included income and age range in those to see the exact cohort of our loyal customer
Then we did the coupon analysis for the least performing departments to see the coupons usage and check if there’s any improvement that can be done for that and we draw insights from that
Another analysis we did was a seasonality analysis for a few of our important departments and we arrived at a conclusion regarding the retail discounts offered to them.
The primary purpose of this analysis is to understand which customer base is loyal to ‘Regork’. To identify that demographic,we are initially filtering data to plot the households by their income range.
So, the first thing we did was the plot the graphs for “Income Range” then we see where the majority of the household’s belong, based on the graphs given below we see that the mass household belong to 50-74k range, followed by 35-49k bracket.
We concur from the above graph where does our customer base lie based on the household_id.
To dive deeper into our initial insight , the transaction data is then analysed by the age groups that are a part of our dataset.This will help us in our comparitive analysis further in our data exploration. Again, the agenda here is to check the distribution based on age group, we find majority of our households belong to the age group of 45-54 followed up by 35-44.
We concur from the above graph where does our customer base lie based on the household_id.
The above two graphs plotted for income range and age gives us an idea about where our majority of population lies. Looking at our graph it is apparent that income ranges of “50-74K” and “75K-99K” are the income ranges that generate more revenue for ‘Regork’ - as in ‘Middle’ income range is the most loyal customer base.
Follow up by that we dissect the data by using a calculated metric of Average number of Quantity ordered by Household size and Average Revenue per order for household size, the hypothesis was the lower level income houses won’t order more items into their basket apart from the one which is absolutely necessary and based on the graphs we plotted below we can see that. Hence, concluding the hypothesis was true
Exploring the data further we are establishing what the average order value of different customer bases is considering different income ranges as the primary parameter.
## Demographic Analysis based on Customer Loyalty
# Avg order value -3
get_transactions() %>%
inner_join(demographics) %>% filter(quantity < 1000) %>%
select(household_id,basket_id,quantity,sales_value,household_size,income) %>%
mutate(income_char = trimws(as.character(income))) %>%
mutate(income_range = case_when( income_char %in% c("Under 15K","25-34K","35-49K") ~"Low",
( income_char %in% c("50-74K","75k-99K") ~ "Middle"),
( income_char %in% c("100-124K","125k-149K") ~ "Upper-Middle"),
( income_char %in% c("150k-174K","175-199K","200-249K","250K+")~ "High"))) %>%
group_by(household_size,income_range) %>%
summarise(hsld_count = n_distinct(household_id),
distinct_basket = n_distinct(basket_id),
total_sales = sum(sales_value)) %>% filter(!is.na(income_range)) %>%
mutate(sales_per_basket = round(total_sales/distinct_basket),
income_range = factor(income_range, levels = c("Low", "Middle", "Upper-Middle", "High"))) %>%
ggplot(aes(x = household_size,y = sales_per_basket, fill = household_size)) +
geom_point(size = 5,color = "grey",alpha = 1.5,shape = 21,fill = "red",stroke = 2) +
geom_segment(aes(x = household_size, xend = household_size, y = 0, yend = sales_per_basket),
color = "blue",size = 0.75,linetype = "solid") +
facet_grid(~income_range) +
ggtitle("Average Revenue in a Single Order by Household Size") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "# of People in House",y = "Avg Revenue per Basket") +
#theme(legend.position="bottom")+
theme(panel.background = element_rect(fill = "white"))
The hypothesis was for the lower income range customers usually don’t spend more on unnecessary items and based on the graphs we see that this is followed the lower income household the avg revenue per basket is in the range of 20-30 and for the higher income household is higher than 40, indicating that we can target these household for the cross sell of the other items or items which are not frequently sold cause they’ve the purchasing power
Exploring the data further we are establishing what the average quantity per basket of different customer bases is considering their income ranges available .
The hypothesis was that for the lower income household we though people will order only what’s necessary for them and as the household size increases the avg items per basket will also increase but we only see that happening for higher income households not for the lower income households
Inference: On combining the above two graphs we can say that for the lower income household who are purchasing from us, we cannot cross-sell any items and they not only buy items which are bare minimum but also the items which are cheaper else their average spend would be higher, we see that’s not happening.
Diving deeper into our established insights from our data analysis: The next analysis we went on was to identify the most loyal customer and to which cohorts do they belong. We calculated that by identifying the “Visit Number”. This means that customers who frequently visit us are considered loyal and any customer whose visit number is more than 50 essentially translates to the fact that these customers are ordering once every week from us on an average and they can be further experimented with.
So based on the graph plotted we see that most of the customers belong to the category of 50+ visits and our next step of the analysis is to find out the demographic information for the same.
From the above graph we see the age and income range of our customers. These are the bunch of our customers who are the most loyal and we can experiment with them by dividing them into various test and control groups. A few of the experiments are as follows a. Reduce the discount/coupon usage on this and see the impact b. If we ever launch a new category of product line we can experiment the sales of it to this group by giving them additional discounts on these newly implemented categories of products c. A brave option would be to eliminate the discount for them in the categories across which they order on a regular basis
Analyzing the data to determine the least performing departments - departments that generate the least amount of revenue, we then infer the percentage of transactions in this data that utilise the coupons that ‘Regork’ provides through varies campaigns.
The inference obtained states that bout 85 % of the transactions that have been made in the least performing departments are through coupons based on which we can understand the customer purchase trends .
## # A tibble: 1,025 × 5
## household_id product_id department coupon_upc redemption_date
## <chr> <chr> <chr> <chr> <date>
## 1 1 5564790 RESTAURANT 10000085364 2017-05-17
## 2 1 5564790 RESTAURANT 51700010076 2017-05-17
## 3 1 5564790 RESTAURANT 54200000033 2017-05-23
## 4 1 5564790 RESTAURANT 10000085476 2017-11-09
## 5 1 5564790 RESTAURANT 54200029176 2017-11-09
## 6 1 5564790 RESTAURANT 10000085364 2017-05-17
## 7 1 5564790 RESTAURANT 51700010076 2017-05-17
## 8 1 5564790 RESTAURANT 54200000033 2017-05-23
## 9 1 5564790 RESTAURANT 10000085476 2017-11-09
## 10 1 5564790 RESTAURANT 54200029176 2017-11-09
## # … with 1,015 more rows
## # A tibble: 874 × 6
## household_id product_id department coupon_upc redemption_date coupon_disc
## <chr> <chr> <chr> <chr> <date> <dbl>
## 1 1 5564790 RESTAURANT 10000085364 2017-05-17 0
## 2 1 5564790 RESTAURANT 51700010076 2017-05-17 0
## 3 1 5564790 RESTAURANT 54200000033 2017-05-23 0
## 4 1 5564790 RESTAURANT 10000085476 2017-11-09 0
## 5 1 5564790 RESTAURANT 54200029176 2017-11-09 0
## 6 1 5564790 RESTAURANT 10000085364 2017-05-17 0
## 7 1 5564790 RESTAURANT 51700010076 2017-05-17 0
## 8 1 5564790 RESTAURANT 54200000033 2017-05-23 0
## 9 1 5564790 RESTAURANT 10000085476 2017-11-09 0
## 10 1 5564790 RESTAURANT 54200029176 2017-11-09 0
## # … with 864 more rows
Our hypothesis was to look into seasonality trends of the various major categories in the entire catalogue and see if they follow any trend and for that our usual suspect is “produce”, this is because most of the items are perishable and these are seasonal. So, they tend to sell more in summer and not so much in winter.
So, for that we are using a custom metric avg quantity per order, this is only for two categories “Produce” and “Seafood”. I.e the number of items of produce in a single order. Alongside the trend of this at month level we also plotted the “Retail_discount” trend for produce and seafood. (Plotted in dotted Lines)
Produce and seafood with retail discount
get_transactions() %>%
inner_join(products) %>%
mutate(month1=month(transaction_timestamp,label = T))%>%
group_by(department,month1)%>%
summarise(quantity_total=sum(quantity),
average_basket=n_distinct(basket_id),
prop=quantity_total/average_basket,
retail_disc1=sum(retail_disc),
retail_disc1=retail_disc1/1000) %>%
filter(trimws(department) %in% c("PRODUCE","SEAFOOD")) %>%
ggplot(aes(x=month1, y=prop, group=department)) +
geom_line(aes(color=department))+
geom_line(aes(y=retail_disc1, color=department), linetype = "dashed")+
scale_y_continuous(
name = "Average quatity per Basket of produce and Seafood",
sec.axis = sec_axis( trans=~.*1, name="Retail_disc(scaled by a factor of 1000")
) +
ylab("Year 2017")+
labs(
title = ' Produce along the month(solid line) & retail discount(dashed line) ',
subtitle = ' Retail discount has little to no imapct on how produce is sold',
caption = 'Completejourney data')
## Joining, by = "product_id"
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
theme(legend.position="bottom")+
theme(panel.background = element_rect(fill = "white"))
## List of 2
## $ legend.position : chr "bottom"
## $ panel.background:List of 5
## ..$ fill : chr "white"
## ..$ colour : NULL
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
*** To see the forecast of impact of retail discount on prdouce. We used linear regression and scaled the retail discount value by 1000 ***
Linear regression
df<-get_transactions()%>%
inner_join(products) %>%
mutate(month1=month(transaction_timestamp,label = T))%>%
group_by(department,month1)%>%
summarise(quantity_total=sum(quantity),
average_basket=n_distinct(basket_id),
prop=quantity_total/average_basket,
retail_disc=sum(retail_disc),
retail_disc_mod=retail_disc/1000) %>%
filter(trimws(department) %in% c("PRODUCE")) %>%
ungroup() %>%
select(prop,retail_disc_mod)
## Joining, by = "product_id"
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
x <- lm(df$prop~df$retail_disc_mod,df)
summary(x)
##
## Call:
## lm(formula = df$prop ~ df$retail_disc_mod, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19512 -0.15499 0.02548 0.12565 0.19253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.98535 0.14988 19.92 2.23e-09 ***
## df$retail_disc_mod 0.18859 0.04091 4.61 0.000965 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1542 on 10 degrees of freedom
## Multiple R-squared: 0.68, Adjusted R-squared: 0.648
## F-statistic: 21.25 on 1 and 10 DF, p-value: 0.0009652
Based on the above graphs we see that the when the retail discount was higher for produce during the months of May,June, July and the avg quantity per basket size follows that pattern and during the winter even thought the retail discount is lower we still see that the avg qty per basket is increasing indicating that even if we remove the retail discount it might work. And the trend of seafood follows the retail discount pattern.
The Primary Goal of this analysis is to establish a framework that is diverse to be able to look at internal and external factors driving and inhibiting growth.This analysis focuses on : Who are the Regork customers? Customer Retention(Retain customers that are driving growth) Our goal was to identity the demographics of loyal customers. Based on the graph below, we can see that the income ranges of $35-49k and $50-75k are loyal bunches, and in those age groups, 45-54
For the seasonality trend of produce and seafood. We saw that the average quantity is only 68% related with the produce. This implies that retail discount have little impact on the quantity of produce sold throughout the year
How inferences were addressed : Data set has been granulated to a brief level over transactions, products, demographics and other tables by joining these. Visual graphs have been plotted to find patterns,trends and actionable insights. A detailed approach has been addressed in the Approach section as well as in the flow of the report.
Interesting insights of the summary :