2022_BANA7025_Anurag_Thimmareddy

Store wise analysis

1. Introduction

Introduction to business problem:- Comparing high Revenue Generating stores with Low revenue generating stores on product categories that have low sales volume with respect to age-group demographics data , we were looking to identify ,which Product categories can be better marketed to the age-specific groups to boost the overall sales in Low and High Revenue generating stores.

Explanation of the problem solution:- By taking mean of the total sales value store wise, we considered stores with SV greater than mean as High performing stores and less than mean as low performing stores. Adding onto it, we find the product categories that generate most and least sales combining with the age group demographics data to identify which age group is responsible for the sales

Proposed Solution:- From the above analysis we could come up with varied suggestions on least selling product categories in the stores by targeting specific age groups by running age-specific campaigns / discounts to boost the overall sales.

2. Packages Required

# install.packages("plotly")
# install.packages("ggthemes")
# install.packages("usethis")
# install.packages("ggplot2")
# install.packages("completejourney")
# install.packages("lubridate")
# install.packages("gghighlight")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(ggthemes)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(gghighlight)

library(tidyverse) : to transform and better represent data .
library(dplyr) : for transformation and manipulation of the data.
library(ggplot2) : for data visualization to improve quality and aesthetics of graphics.
library(completejourney) : Dataset to answer the business problem.
library(plotly) : to represent plots and data simultaneously (dynamic plotting).
library(ggthemes) : to improve and implement the visual aesthetics across the report.
library(magrittr) : for Pipe operators (mutation , joins ).
library(gghighlight) : to add direct labels for plots.
library(ggrepel) : labelling the data.

3.Data Cleaning

transactions <- get_transactions()
promotions <- get_promotions()
transaction <- transactions %>%
  mutate(New_product_id = as.numeric(product_id)) %>%
  mutate(New_basket_id = as.numeric(basket_id)) %>%
  mutate(New_store_id = as.numeric(store_id)) %>%
  mutate(New_household_id = as.numeric(household_id)) %>%
  mutate(Date = as.Date(as.character(as.POSIXct(transaction_timestamp)))) %>%
  na.omit()
transaction

## # A tibble: 1,469,307 x 16
##    house~1 store~2 baske~3 produ~4 quant~5 sales~6 retai~7 coupo~8 coupo~9  week
##    <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <int>
##  1 900     330     311985~ 1095275       1    0.5     0          0       0     1
##  2 900     330     311985~ 9878513       1    0.99    0.1        0       0     1
##  3 1228    406     311986~ 1041453       1    1.43    0.15       0       0     1
##  4 906     319     311987~ 1020156       1    1.5     0.29       0       0     1
##  5 906     319     311987~ 1053875       2    2.78    0.8        0       0     1
##  6 906     319     311987~ 1060312       1    5.49    0.5        0       0     1
##  7 906     319     311987~ 1075313       1    1.5     0.29       0       0     1
##  8 1058    381     311986~ 985893        1    1.88    0.21       0       0     1
##  9 1058    381     311986~ 988791        1    1.5     1.29       0       0     1
## 10 1058    381     311986~ 9297106       1    2.69    0          0       0     1
## # ... with 1,469,297 more rows, 6 more variables: transaction_timestamp <dttm>,
## #   New_product_id <dbl>, New_basket_id <dbl>, New_store_id <dbl>,
## #   New_household_id <dbl>, Date <date>, and abbreviated variable names
## #   1: household_id, 2: store_id, 3: basket_id, 4: product_id, 5: quantity,
## #   6: sales_value, 7: retail_disc, 8: coupon_disc, 9: coupon_match_disc

promotion <- promotions %>%
  mutate(New_product_id = as.numeric(product_id)) %>%
  mutate(New_store_id = as.numeric(store_id)) %>%
  na.omit()
promotion

## # A tibble: 20,940,529 x 7
##    product_id store_id display_location mailer_location  week New_prod~1 New_s~2
##    <chr>      <chr>    <fct>            <fct>           <int>      <dbl>   <dbl>
##  1 1000050    316      9                0                   1    1000050     316
##  2 1000050    337      3                0                   1    1000050     337
##  3 1000050    441      5                0                   1    1000050     441
##  4 1000092    292      0                A                   1    1000092     292
##  5 1000092    293      0                A                   1    1000092     293
##  6 1000092    295      0                A                   1    1000092     295
##  7 1000092    298      0                A                   1    1000092     298
##  8 1000092    299      0                A                   1    1000092     299
##  9 1000092    304      0                A                   1    1000092     304
## 10 1000092    306      0                A                   1    1000092     306
## # ... with 20,940,519 more rows, and abbreviated variable names
## #   1: New_product_id, 2: New_store_id

product <- products %>%
  mutate(New_product_id = as.numeric(product_id)) %>%
  mutate(New_manufacture_id = as.numeric(manufacturer_id)) %>%
  na.omit()
product

## # A tibble: 61,677 x 9
##    product_id manufactur~1 depar~2 brand produ~3 produ~4 packa~5 New_p~6 New_m~7
##    <chr>      <chr>        <chr>   <fct> <chr>   <chr>   <chr>     <dbl>   <dbl>
##  1 25671      2            GROCERY Nati~ FRZN I~ ICE - ~ 22 LB     25671       2
##  2 26190      69           GROCERY Priv~ FRUIT ~ APPLE ~ 50 OZ     26190      69
##  3 26355      69           GROCERY Priv~ COOKIE~ SPECIA~ 14 OZ     26355      69
##  4 26426      69           GROCERY Priv~ SPICES~ SPICES~ 2.5 OZ    26426      69
##  5 26540      69           GROCERY Priv~ COOKIE~ TRAY P~ 16 OZ     26540      69
##  6 26601      69           DRUG GM Priv~ VITAMI~ VITAMI~ 300 CT~   26601      69
##  7 26691      16           GROCERY Priv~ PNT BT~ HONEY   12 OZ     26691      16
##  8 26738      69           GROCERY Priv~ ICE CR~ TRADIT~ 56 OZ     26738      69
##  9 26941      69           GROCERY Priv~ ICE CR~ TRADIT~ 56 OZ     26941      69
## 10 27030      69           GROCERY Priv~ ICE CR~ TRADIT~ 56 OZ     27030      69
## # ... with 61,667 more rows, and abbreviated variable names 1: manufacturer_id,
## #   2: department, 3: product_category, 4: product_type, 5: package_size,
## #   6: New_product_id, 7: New_manufacture_id

Once we have the complete journey data, for the ease of plotting and analysis we have converted product_id , store_id , basket_id, household_id into New_product_id, New_Store_id, New_basket_id, New_household_id from Character datatype to Numeric datatype using “as.numeric function” and stored them with different names.
We have discarded/omitted all the rows having ‘NA’ values to have a cleaner data for the sake of analysis.

4.Exploratory Data Analysis(Part 1)

# plotiing scatter plot for sales value and store id ----------------------
Store_SV <- transaction %>%
  select(New_store_id, sales_value ) %>%
  #filter(New_store_id <= 500 & New_store_id >= 250) %>%
  group_by(New_store_id) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  ggplot(aes(x=New_store_id, y = totalsales)) +
  geom_point(color = 'blue', size = 0.7)+
  geom_smooth(colour = "red",size = 0.5, se = FALSE)+
  gghighlight(New_store_id < 1000, ) +
  theme( plot.title = element_text(size = rel(1.5)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black"),
  ) +
  labs(title = "Store ID VS Total Sales" )+
  xlab("Store ID") +
  ylab("Total Sales")
ggplotly(Store_SV)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Data is spread across the access in the range of 0 to 30000. but there are no store id’s between 500 to 30000. so we are considering them as outliers.
logic behind deciding the outliers is, In general the store id’s are always sequential
for example : if the store1 has 1 as its ID then, Store 2 has store id - 2 not 30000.
Highlighted region is the region we are considering and shaded region are considered as outliers

#top 20 & bottom 20 performing stores-----------------------
Store_SV2 <- transaction %>%
  select(New_store_id, sales_value ) %>%
  group_by(New_store_id) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales))
head(Store_SV2, 20)

## # A tibble: 20 x 2
##    New_store_id totalsales
##           <dbl>      <dbl>
##  1          367    148170.
##  2          406    120904.
##  3          429     86884.
##  4          361     81196.
##  5          343     80701.
##  6          356     78874.
##  7          292     76775.
##  8          381     72037.
##  9        31782     71189.
## 10          323     69799.
## 11          321     69693.
## 12          369     67718.
## 13          375     66715.
## 14        32004     63774.
## 15          319     63164.
## 16          427     61297.
## 17          327     61095.
## 18          432     60966.
## 19        31862     60580.
## 20          384     60524.

tail(Store_SV2, 20)

## # A tibble: 20 x 2
##    New_store_id totalsales
##           <dbl>      <dbl>
##  1         2186       3.79
##  2         1067       3.57
##  3          681       3.49
##  4          227       3.29
##  5          250       3.19
##  6         3932       3.13
##  7          472       2.96
##  8         2697       2.69
##  9          659       2.61
## 10         2773       2.58
## 11          724       2.47
## 12         3267       2   
## 13         4010       1.78
## 14           26       1.75
## 15          511       1.5 
## 16          551       1   
## 17          765       1   
## 18         2760       1   
## 19          639       0.85
## 20          610       0.5

# Mean of the sales values   --------
Mean_TSV <- mean(Store_SV2$totalsales,na.rm = TRUE)
Mean_TSV

## [1] 10056.98

#categorize stores into high performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------


High_PS_Top20 <- transaction %>%
  select(New_store_id, sales_value ) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(New_store_id) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales >= Mean_TSV) %>%
  arrange(desc(totalsales)) %>%
  head(20) %>%
  ggplot( aes(x= New_store_id , y= totalsales)) +
  labs(title = "Top 20 High Performing Stores") +
  xlab("Store ID") +
  ylab("Total Sales") +
  geom_line(color = "orange")+
  geom_point(color = "black") +
  theme( plot.title = element_text(size = rel(1.2)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black"))

ggplotly(High_PS_Top20)

#categorize stores into high performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------
#storing as table for further use-------------------

High_PS <- transaction %>%
  select(New_store_id, sales_value ) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(New_store_id) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales >= Mean_TSV) %>%
  arrange(desc(totalsales))

#Least 20 selling product categories in High performing stores---------------------------
High_PS_LPC <- High_PS %>%
  inner_join(transaction , by = 'New_store_id') %>%
  inner_join(product, by = "New_product_id") %>%
  group_by(product_category) %>%
  summarize(total_sales = sum(sales_value))%>%
  arrange(desc(total_sales)) %>%
  tail(20) %>%
  ggplot(aes(x = total_sales, y = product_category)) +
  geom_point(color = "green", size = 2) +
  geom_segment(aes(x = 0, xend = total_sales, y = product_category, yend = product_category), color = "violet") +
  labs(title = "Least 20 selling product categories") +
  xlab("Total Sales") +
  ylab("Product Category") +
  theme( plot.title = element_text(size = rel(1.2)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black"))
ggplotly(High_PS_LPC)

In least selling products:

Prepaid wireless & accessories has the least sales value
Bouquet has the highest sales value

# Age group stats of Top 10 selling product categories in High performing stores ----------------------------
High_PS_PC_H_AGE <- transaction %>%
  inner_join(demographics , by = "household_id") %>%
  inner_join(product, by = "New_product_id") %>%
  select(New_store_id, sales_value, age, product_category) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(age, product_category) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales >= Mean_TSV) %>%
  arrange(desc(totalsales)) %>%
  head(10) %>%
  ggplot(aes(x = age, y = totalsales)) +
  geom_bar(stat = "identity",aes(fill = totalsales)) +
  labs(title = " Age group stats of top 10 selling product categories in High performing stores") +
  xlab("AGE") +
  ylab("Total Sales") +
  theme( plot.title = element_text(size = rel(1.1)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black")) +
  facet_wrap("product_category")

## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.

ggplotly(High_PS_PC_H_AGE)

Age group stats of top 10 selling product categories in High performing stores:

We can infer the age groups of 25-34 , 35-44 are generating nil sales in Beers/Ales , Fluid Milk Products , Frozen meat / Meat Dinners.
Soft drinks are purchased by all age groups. product categories like bag snacks, baked bread, Beers, Fluid milk products, frozen meat are only purchased by “45 - 54” age group only

# Age group stats of least 10 selling product categories in High performing stores ----------------------------
High_PS_PC_L_AGE <- transaction %>%
  inner_join(demographics , by = "household_id") %>%
  inner_join(product, by = "New_product_id") %>%
  select(New_store_id, sales_value, age, product_category) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(age, product_category) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales >= Mean_TSV) %>%
  arrange(totalsales) %>%
  head(20) %>%
  ggplot(aes(x = age, y = totalsales)) +
  geom_bar(stat = "identity",aes(fill = totalsales)) +
  labs(title = " Age group stats of least 10 selling product categories in High performing stores") +
  xlab("AGE") +
  ylab("Total Sales") +
  theme( plot.title = element_text(size = rel(1.1)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black")) +
  facet_wrap("product_category")

## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.

High_PS_PC_L_AGE

Age group stats of least 10 selling product categories in High performing stores:

To boost the overall sales in these stores we could observe that 25-34 , 35-44 age group is not accounting for Sales in Frozen Pizza category , likewise only “45-54” age group is maximizing the revenue from softdrinks.
Cigarettes, Cold cereal, frozen pizza, beers are only purchased by “45-54” age group.
similar insights can be made for the other displayed product categories

5.Exploratory Data Analysis(Part 2)

#categorize stores into Low performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------



Low_PS_least20 <- transaction %>%
  select(New_store_id, sales_value ) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(New_store_id) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales >= Mean_TSV) %>%
  arrange(desc(totalsales)) %>%
  tail(20) %>%
  ggplot( aes(x= New_store_id , y= totalsales)) +
  labs(title = "Top 20 Stores in Low Performing Stores") +
  xlab("Store ID") +
  ylab("Total Sales") +
  geom_line(color = "orange")+
  geom_point(color = "black") +
  theme( plot.title = element_text(size = rel(1.2)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black"))

ggplotly(Low_PS_least20)

#categorize stores into Low performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------
#storing as table for further use

Low_PS <- transaction %>%
  select(New_store_id, sales_value ) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(New_store_id) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales >= Mean_TSV) %>%
  arrange(desc(totalsales))

#Least 20 selling product categories in Low performing stores---------------------------

Low_PS_LPC <- Low_PS %>%
  inner_join(transaction , by = 'New_store_id') %>%
  inner_join(product, by = "New_product_id") %>%
  group_by(product_category) %>%
  summarize(total_sales = sum(sales_value))%>%
  arrange(desc(total_sales)) %>%
  head(20) %>%
  ggplot(aes(x = total_sales, y = product_category)) +
  geom_point(color = "#008080", size = 2) +
  geom_segment(aes(x = 0, xend = total_sales, y = product_category, yend = product_category), color = "#ffad60") +
  labs(title = "Least 20 selling product categories in Low Performing Stores") +
  xlab("Total Sales") +
  ylab("Product Categories") +
  theme( plot.title = element_text(size = rel(1.2)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black"))
ggplotly(Low_PS_LPC)

In least 20 selling product categories in Low performing stores:

Soft drinks has highest sales value
Candy-packaged has least sales value

# Age group stats of top 10 selling product categories in low performing stores ----------------------------
Low_PS_PC_H_AGE <- transaction %>%
  inner_join(demographics , by = "household_id") %>%
  inner_join(product, by = "New_product_id") %>%
  select(New_store_id, sales_value, age, product_category) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(age, product_category) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales < Mean_TSV) %>%
  arrange(desc(totalsales)) %>%
  head(10) %>%
  ggplot(aes(x = age, y = totalsales)) +
  geom_bar(stat = "identity",aes(fill = totalsales)) +
  theme_bw() +
  labs(title = " Age group stats of top 10 selling product categories in low performing stores") +
  xlab("AGE") +
  ylab("Total Sales") +
  theme( plot.title = element_text(size = rel(1.1)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black")) +
  facet_wrap("product_category")

## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.

ggplotly(Low_PS_PC_H_AGE)

Age group stats of top 10 selling product categories in low performing stores:

25-34 , 35-44 age group have nil contribution to the Icecream/Milk/Sherbts , Lunchmeat , soup categories while Bagged Snacks are only brought by the 25-34 age group.
Product Categories like cigarettes, frozen pizza are only purchased by “35 - 44” age group.
Beers are purchased by “25 - 34”, “35 - 44”

# Age group stats of least 10 selling product categories in low performing stores ----------------------------
Low_PS_PC_L_AGE <- transaction %>%
  inner_join(demographics , by = "household_id") %>%
  inner_join(product, by = "New_product_id") %>%
  select(New_store_id, sales_value, age, product_category) %>%
  filter(New_store_id <= 1000 & New_store_id >= 250) %>%
  group_by(age, product_category) %>%
  summarise(totalsales = sum(sales_value)) %>%
  arrange(desc(totalsales)) %>%
  filter(totalsales < Mean_TSV) %>%
  arrange(totalsales) %>%
  head(10) %>%
  ggplot(aes(x = age, y = totalsales)) +
  geom_bar(stat = "identity",aes(fill = totalsales)) +
  theme_bw() +
  labs(title = " Age group stats of least 10 selling product categories in low performing stores") +
  xlab("AGE") +
  ylab("Total Sales") +
  theme( plot.title = element_text(size = rel(1.1)),
         panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.line = element_line(colour = "black")) +
  facet_wrap("product_category")

## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.

ggplotly(Low_PS_PC_L_AGE)

Age group stats of least 10 selling product categories in low performing stores:

Inferring from the data, these products are not generating the revenue and most of them are a bad investment which give running loss to the stores.
Products like Brocolli, Floral-accessories are not being sold at all.
Coffee is purchased by “19-24”, “25-34” age groups.
Hence we could propose to not stack these specific items in these stores to avoid Running Loss

6.Summary

Problem Statement: This analysis is intended to the business to answer the question which product category and which specific age group they should be targeting to increase the overall revenue in the Stores and how to minimize the deadstock loss in particular stores.

Methodology: From the Complete Journey dataset , we have used following data frames – Products , transactions, demographics for our analysis. Mainly I have used visualization implementing with Scatter plots Lollipop graphs , bar graphs to observe the findings and draw the outcomes.

Insights found:

On the Store wise analysis of Top and Least selling product categories based on the age groups we were able to button down on which particular group could be targeted to increase the overall sales of the particular store/stores.
For Example: In high sales in high performing stores- we can boost the total sales by offering discounts on Frozen Meat/meat Dinners , Milk Products.

Proposal to the Business Problem:

In Low Revenue Generating stores we could actually not stock fresh veggies like the Broccoli , Cauliflower Floral Accessories as they generate 0 to Nil revenue hence we can minimize the deadstock loss.
In High Revenue and top selling products we can maximize the revenue by targeting specific age-groups or discount schemes for products like bag snacks , ,Baked breads, beers/ales , Fluid milk products , Frozen meat dinners , which would enable the customers who don’t purchase them to considering to purchase and those who purchase generally increase their quantity of purchase which anyway is a win-win for the Business.

Limitations and Improvements : As the store id’s are not sequential they range from 2 to 34280 where in there are no stores for some numbers , hence we based our analysis only on stores with id’s from 250 to 1000. However this could be scaled to all the stores provided the Store id’s are sequential , we would have a better visualization in the plots. The above findings when combined with other demographic columns like the Income group or Marital status , we can even find more intriguing results like the middle income groups purchasing trends, married couples purchasing trends.

2022_BANA7025_Anurag_Thimmareddy_finalproject

Anurag Thimmareddy

2022-10-06