.

Introduction

Analysis of year long customer shopping data reveals business opportunities, growth factors and growth inhibitors. Customer shopping behavior is inferred through exploratory data analysis over their demographics, transactions, coupons, income range and other factors. To better spend Regork’s resources, Analysis is granulated to a brief level on :

  1. Who/What is driving the growth ?
  2. Who/What is inhibiting the growth?
  3. Where is potential for business opportunities?

Approach

  1. Analysed the demographics based on quantity and order value to briefly understand the most loyal customer base,this is to understand where to focus ‘Regork’s’ customer behavior based promotions should be and to recognize parameters that increase loyalty as they contribute the most to revenue.

  2. Identified the relationship between retail discount and products that go together based on the best performing departments;performed predictive analytics on one such pair of products to arrive at the correlation.

  3. Analysed time constrained sales on a quarterly basis to determine which departments generate the least amount of revenue. Further, A deeper look at the customer transactions combined with coupons establishes a peculiar customer shopping behavior.

Proposed Solution

  1. Based on the cohort analysis we’ve identified the most loyal bunch of customers, they are in the income range of $50-75k in the age range of 45-54. Once we have identified the demographics, we can conduct various experiment by dividing these groups into stratified sample ,to reduce the retail discount offered to this group, hence improving the overall revenue for the company.It is imperative to retain these customers and further work in the aspects of constant improvement in customer experience, quality offered and ease of shopping to increase brand value and Word of mouth marketing.These were identified based on various parameters like no. of visits and various calculated metrics.Later on,new products and promotions launched can be targeted to this cohort to understand purchase patterns which can be applied to other cohorts to amplify growth opportunities.

  2. In produce we can eliminate the retail discount offer. As the quantity sold of produce is constant throughout the year. If not eliminate we reduce the retail discount specially during the months of April, May, June, July.

  3. Least performing department’s sales are majorly dependent on coupon based transactions. This indicates a serious growth internally in these departments in all the current working aspects such as quality, experience and product placement, as these sales are already very low and are further dominated only with coupons. Externally, increasing such coupons and tapping into potential untouched market by increasing marketing, combining with best sellers and further discounting will increase the chances of growth.

Packages Required

Packages used in our data analysis :

  1. Complete Journey - Main data set containing all the necessary data.

  2. Tidyverse - Opinionated collection of R packages designed for data science

  3. Lubridate - for easy time and date manipulation

  4. ggplot - used for visualizing data

  5. gridExtra - provides useful extensions to the grid system

  6. Patchwork - package designed to make plot composition simple

  7. visdat - provides visualizations for the data frame

  8. Stats - for statistical calculations and random number generation

  9. hrbrthemes - provides typography centric themes

  10. treemapify - for plotting tree maps

library(tidyverse)
library(lubridate)
library(ggplot2)
library(completejourney)
library(visdat)
library(gridExtra)
library(patchwork)
library(hrbrthemes)
library(stats)
library(treemapify)

Data Preparation

Clean Data

Determining the least performing departments

Bottom_categories <- get_transactions() %>%# Storing our transaction data in a variable 
  inner_join(products) %>% # joining it by products to departmentalize our transactions
  filter(department != 'COUPON') %>% # filtering our data to include only valid departments
  group_by(department) %>% summarise(Total_sale_value = sum(sales_value, na.rm = TRUE)) %>% # Calculating the total sales value to understand revenue generated
  arrange(desc(Total_sale_value)) %>% tail(10)#Arranging them to obtain the least revenue generating departments

Determining the best performing departments

Each demographic has representation that varies in number, calculating the total sales to compare each demographics would not work in this case. So average sales for demographics was looked at in our Analysis.

And we’ve calculated factors for demographics, which will be used to calculate the respective Average sales.

Categorizing data as based on income ranges

  1. Low income range category :“Under 15K”,“25-34K”,“35-49K” income ranges
  2. Middle income range category :“50-74K”,“75k-99K” income ranges
  3. Upper middle range category : “100-124K”,“125k-149K” income ranges
  4. High income range category : “150k-174K”,“175-199K”,“200-249K”,“250K+” income ranges

This has been done to suffice our wide range of data into categorizable customer bases for feasibility in terms of understanding customer purchase patterns :

get_transactions() %>% 
  inner_join(demographics) %>% filter(quantity < 1000) %>% 
  select(household_id,basket_id,quantity,sales_value,household_size,income) %>% 
  mutate(income_char = trimws(as.character(income))) %>% 
  mutate(income_range = case_when( income_char %in% c("Under 15K","25-34K","35-49K") ~"Low",
                                   ( income_char %in% c("50-74K","75k-99K") ~ "Middle"),
                                   ( income_char %in% c("100-124K","125k-149K") ~ "Upper-Middle"),
                                   ( income_char %in% c("150k-174K","175-199K","200-249K","250K+")~ "High"))) %>%  
  group_by(household_size,income_range) %>% 
  summarise(hsld_count = n_distinct(household_id),
            distinct_basket = n_distinct(basket_id),
            total_sales = sum(sales_value)) %>% filter(!is.na(income_range)) %>%  
  mutate(sales_per_basket = round(total_sales/distinct_basket), 
         income_range = factor(income_range, levels = c("Low", "Middle", "Upper-Middle", "High"))) 
## # A tibble: 19 × 6
## # Groups:   household_size [5]
##    household_size income_range hsld_count distinct_basket total_sales sales_pe…¹
##    <ord>          <fct>             <int>           <int>       <dbl>      <dbl>
##  1 1              High                  6             851      31606.         37
##  2 1              Low                 105           10181     253168.         25
##  3 1              Middle               74            6923     211358.         31
##  4 1              Upper-Middle         10            1159      28677.         25
##  5 2              High                 12            1072      53176.         50
##  6 2              Low                 123           12060     331982.         28
##  7 2              Middle               73            6054     222742.         37
##  8 2              Upper-Middle         11             698      27116.         39
##  9 3              High                  6             496      26659.         54
## 10 3              Low                  45            5019     136255.         27
## 11 3              Middle               19            1400      52175.         37
## 12 3              Upper-Middle          5             510      15174.         30
## 13 4              Low                  15            1317      51671.         39
## 14 4              Middle               13            1731      49858.         29
## 15 4              Upper-Middle          5             306      12230.         40
## 16 5+             High                  3             272      14405.         53
## 17 5+             Low                  22            1970      55502.         28
## 18 5+             Middle               13             834      40934.         49
## 19 5+             Upper-Middle          3             421      17291.         41
## # … with abbreviated variable name ¹​sales_per_basket

Exploratory Data Analysis

At first we looked at different cohorts of data by combining the demographic data with transactions. To understand where our majority of the population lies. So, we plotted the tree maps for “income” and “age” to know more about our customers.

The next analysis we did was by defining our own custom metrics i.e Average Quantity Per Basket and Average Revenue Per Basket, which is plotted across household sizes. In this we’ve also grouped our income range.

The next analysis we did was to identify the customer loyalty and for that we came up with a custom metric called “Visit Number” essentially meaning how many times a customer has visited the website and placed an order. This was bucketised into different groups and from the graph we see the most number of customer belong to “50+” cohort a. For us to further drill down the customer graphs, we included income and age range in those to see the exact cohort of our loyal customer

Then we did the coupon analysis for the least performing departments to see the coupons usage and check if there’s any improvement that can be done for that and we draw insights from that

Another analysis we did was a seasonality analysis for a few of our important departments and we arrived at a conclusion regarding the retail discounts offered to them.

Demographic Analysis based on Customer Loyalty

The primary purpose of this analysis is to understand which customer base is loyal to ‘Regork’. To identify that demographic,we are initially filtering data to plot the households by their income range.

So, the first thing we did was the plot the graphs for “Income Range” then we see where the majority of the household’s belong, based on the graphs given below we see that the mass household belong to 50-74k range, followed by 35-49k bracket.

We concur from the above graph where does our customer base lie based on the household_id.

To dive deeper into our initial insight , the transaction data is then analysed by the age groups that are a part of our dataset.This will help us in our comparitive analysis further in our data exploration. Again, the agenda here is to check the distribution based on age group, we find majority of our households belong to the age group of 45-54 followed up by 35-44.

We concur from the above graph where does our customer base lie based on the household_id.

The above two graphs plotted for income range and age gives us an idea about where our majority of population lies. Looking at our graph it is apparent that income ranges of “50-74K” and “75K-99K” are the income ranges that generate more revenue for ‘Regork’ - as in ‘Middle’ income range is the most loyal customer base.

Follow up by that we dissect the data by using a calculated metric of Average number of Quantity ordered by Household size and Average Revenue per order for household size, the hypothesis was the lower level income houses won’t order more items into their basket apart from the one which is absolutely necessary and based on the graphs we plotted below we can see that. Hence, concluding the hypothesis was true

Exploring the data further we are establishing what the average order value of different customer bases is considering different income ranges as the primary parameter.

## Demographic Analysis based on Customer Loyalty 
# Avg order value -3
get_transactions() %>% 
  inner_join(demographics) %>% filter(quantity < 1000) %>% 
  select(household_id,basket_id,quantity,sales_value,household_size,income) %>% 
  mutate(income_char = trimws(as.character(income))) %>% 
  mutate(income_range = case_when( income_char %in% c("Under 15K","25-34K","35-49K") ~"Low",
                                   ( income_char %in% c("50-74K","75k-99K") ~ "Middle"),
                                   ( income_char %in% c("100-124K","125k-149K") ~ "Upper-Middle"),
                                   ( income_char %in% c("150k-174K","175-199K","200-249K","250K+")~ "High"))) %>%  
  group_by(household_size,income_range) %>% 
  summarise(hsld_count = n_distinct(household_id),
            distinct_basket = n_distinct(basket_id),
            total_sales = sum(sales_value)) %>% filter(!is.na(income_range)) %>%  
  mutate(sales_per_basket = round(total_sales/distinct_basket), 
         income_range = factor(income_range, levels = c("Low", "Middle", "Upper-Middle", "High"))) %>% 
  ggplot(aes(x = household_size,y = sales_per_basket, fill = household_size)) +
  geom_point(size = 5,color = "grey",alpha = 1.5,shape = 21,fill = "red",stroke = 2) +
  geom_segment(aes(x = household_size, xend = household_size, y = 0, yend = sales_per_basket),
               color = "blue",size = 0.75,linetype = "solid") +
  facet_grid(~income_range) +
  ggtitle("Average Revenue in a Single Order by Household Size") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(x = "# of People in House",y = "Avg Revenue per Basket") +
  #theme(legend.position="bottom")+
  theme(panel.background = element_rect(fill = "white"))

The hypothesis was for the lower income range customers usually don’t spend more on unnecessary items and based on the graphs we see that this is followed the lower income household the avg revenue per basket is in the range of 20-30 and for the higher income household is higher than 40, indicating that we can target these household for the cross sell of the other items or items which are not frequently sold cause they’ve the purchasing power

Exploring the data further we are establishing what the average quantity per basket of different customer bases is considering their income ranges available .

The hypothesis was that for the lower income household we though people will order only what’s necessary for them and as the household size increases the avg items per basket will also increase but we only see that happening for higher income households not for the lower income households

Inference: On combining the above two graphs we can say that for the lower income household who are purchasing from us, we cannot cross-sell any items and they not only buy items which are bare minimum but also the items which are cheaper else their average spend would be higher, we see that’s not happening.

Diving deeper into our established insights from our data analysis: The next analysis we went on was to identify the most loyal customer and to which cohorts do they belong. We calculated that by identifying the “Visit Number”. This means that customers who frequently visit us are considered loyal and any customer whose visit number is more than 50 essentially translates to the fact that these customers are ordering once every week from us on an average and they can be further experimented with.

So based on the graph plotted we see that most of the customers belong to the category of 50+ visits and our next step of the analysis is to find out the demographic information for the same.

From the above graph we see the age and income range of our customers. These are the bunch of our customers who are the most loyal and we can experiment with them by dividing them into various test and control groups. A few of the experiments are as follows a. Reduce the discount/coupon usage on this and see the impact b. If we ever launch a new category of product line we can experiment the sales of it to this group by giving them additional discounts on these newly implemented categories of products c. A brave option would be to eliminate the discount for them in the categories across which they order on a regular basis

Coupon Analysis

Analyzing the data to determine the least performing departments - departments that generate the least amount of revenue, we then infer the percentage of transactions in this data that utilise the coupons that ‘Regork’ provides through varies campaigns.

The inference obtained states that bout 85 % of the transactions that have been made in the least performing departments are through coupons based on which we can understand the customer purchase trends .

## # A tibble: 1,025 × 5
##    household_id product_id department coupon_upc  redemption_date
##    <chr>        <chr>      <chr>      <chr>       <date>         
##  1 1            5564790    RESTAURANT 10000085364 2017-05-17     
##  2 1            5564790    RESTAURANT 51700010076 2017-05-17     
##  3 1            5564790    RESTAURANT 54200000033 2017-05-23     
##  4 1            5564790    RESTAURANT 10000085476 2017-11-09     
##  5 1            5564790    RESTAURANT 54200029176 2017-11-09     
##  6 1            5564790    RESTAURANT 10000085364 2017-05-17     
##  7 1            5564790    RESTAURANT 51700010076 2017-05-17     
##  8 1            5564790    RESTAURANT 54200000033 2017-05-23     
##  9 1            5564790    RESTAURANT 10000085476 2017-11-09     
## 10 1            5564790    RESTAURANT 54200029176 2017-11-09     
## # … with 1,015 more rows
## # A tibble: 874 × 6
##    household_id product_id department coupon_upc  redemption_date coupon_disc
##    <chr>        <chr>      <chr>      <chr>       <date>                <dbl>
##  1 1            5564790    RESTAURANT 10000085364 2017-05-17                0
##  2 1            5564790    RESTAURANT 51700010076 2017-05-17                0
##  3 1            5564790    RESTAURANT 54200000033 2017-05-23                0
##  4 1            5564790    RESTAURANT 10000085476 2017-11-09                0
##  5 1            5564790    RESTAURANT 54200029176 2017-11-09                0
##  6 1            5564790    RESTAURANT 10000085364 2017-05-17                0
##  7 1            5564790    RESTAURANT 51700010076 2017-05-17                0
##  8 1            5564790    RESTAURANT 54200000033 2017-05-23                0
##  9 1            5564790    RESTAURANT 10000085476 2017-11-09                0
## 10 1            5564790    RESTAURANT 54200029176 2017-11-09                0
## # … with 864 more rows

Seasonality Effect for a few of the top categories

Our hypothesis was to look into seasonality trends of the various major categories in the entire catalogue and see if they follow any trend and for that our usual suspect is “produce”, this is because most of the items are perishable and these are seasonal. So, they tend to sell more in summer and not so much in winter.

So, for that we are using a custom metric avg quantity per order, this is only for two categories “Produce” and “Seafood”. I.e the number of items of produce in a single order. Alongside the trend of this at month level we also plotted the “Retail_discount” trend for produce and seafood. (Plotted in dotted Lines)

Produce and seafood with retail discount

get_transactions() %>% 
  inner_join(products) %>% 
  mutate(month1=month(transaction_timestamp,label = T))%>%
  group_by(department,month1)%>%
  summarise(quantity_total=sum(quantity),
            average_basket=n_distinct(basket_id),
            prop=quantity_total/average_basket,
            retail_disc1=sum(retail_disc),
             retail_disc1=retail_disc1/1000) %>% 
  filter(trimws(department) %in% c("PRODUCE","SEAFOOD")) %>% 
ggplot(aes(x=month1, y=prop, group=department)) +
  geom_line(aes(color=department))+
  geom_line(aes(y=retail_disc1, color=department), linetype = "dashed")+
  scale_y_continuous(
    name = "Average quatity per Basket of produce and Seafood",
    sec.axis = sec_axis( trans=~.*1, name="Retail_disc(scaled by a factor of 1000")
  ) +
  ylab("Year 2017")+
  labs(
    title = '             Produce along the month(solid line) & retail discount(dashed line) ',
    subtitle = '               Retail discount has little to no imapct on how produce is sold',
    caption = 'Completejourney data')
## Joining, by = "product_id"
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

  theme(legend.position="bottom")+
  theme(panel.background = element_rect(fill = "white"))
## List of 2
##  $ legend.position : chr "bottom"
##  $ panel.background:List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

*** To see the forecast of impact of retail discount on prdouce. We used linear regression and scaled the retail discount value by 1000 ***

Linear regression

df<-get_transactions()%>%
  inner_join(products) %>% 
  mutate(month1=month(transaction_timestamp,label = T))%>%
  group_by(department,month1)%>%
  summarise(quantity_total=sum(quantity),
            average_basket=n_distinct(basket_id),
            prop=quantity_total/average_basket,
            retail_disc=sum(retail_disc),
            retail_disc_mod=retail_disc/1000) %>% 
  filter(trimws(department) %in% c("PRODUCE")) %>%
  ungroup() %>% 
  select(prop,retail_disc_mod)
## Joining, by = "product_id"
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
x <- lm(df$prop~df$retail_disc_mod,df)
summary(x)
## 
## Call:
## lm(formula = df$prop ~ df$retail_disc_mod, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19512 -0.15499  0.02548  0.12565  0.19253 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.98535    0.14988   19.92 2.23e-09 ***
## df$retail_disc_mod  0.18859    0.04091    4.61 0.000965 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1542 on 10 degrees of freedom
## Multiple R-squared:   0.68,  Adjusted R-squared:  0.648 
## F-statistic: 21.25 on 1 and 10 DF,  p-value: 0.0009652

Based on the above graphs we see that the when the retail discount was higher for produce during the months of May,June, July and the avg quantity per basket size follows that pattern and during the winter even thought the retail discount is lower we still see that the avg qty per basket is increasing indicating that even if we remove the retail discount it might work. And the trend of seafood follows the retail discount pattern.

Summary

The Primary Goal of this analysis is to establish a framework that is diverse to be able to look at internal and external factors driving and inhibiting growth.This analysis focuses on : Who are the Regork customers? Customer Retention(Retain customers that are driving growth) Our goal was to identity the demographics of loyal customers. Based on the graph below, we can see that the income ranges of $35-49k and $50-75k are loyal bunches, and in those age groups, 45-54

For the seasonality trend of produce and seafood. We saw that the average quantity is only 68% related with the produce. This implies that retail discount have little impact on the quantity of produce sold throughout the year

How inferences were addressed : Data set has been granulated to a brief level over transactions, products, demographics and other tables by joining these. Visual graphs have been plotted to find patterns,trends and actionable insights. A detailed approach has been addressed in the Approach section as well as in the flow of the report.

Interesting insights of the summary :

  1. Households are buying produce throughout the year, irrespective of whether retail discount is available or not.
  2. If you want to increase the base of loyal customers, the target customers should be of age group other than “45-54”
  3. Least performing departments are only dependent on coupons which shows the dire state and huge growth of opportunities.