Introduction to business problem:- Comparing high Revenue Generating stores with Low revenue generating stores on product categories that have low sales volume with respect to age-group demographics data , we were looking to identify ,which Product categories can be better marketed to the age-specific groups to boost the overall sales in Low and High Revenue generating stores.
Explanation of the problem solution:- By taking mean of the total sales value store wise, we considered stores with SV greater than mean as High performing stores and less than mean as low performing stores. Adding onto it, we find the product categories that generate most and least sales combining with the age group demographics data to identify which age group is responsible for the sales
Proposed Solution:- From the above analysis we could come up with varied suggestions on least selling product categories in the stores by targeting specific age groups by running age-specific campaigns / discounts to boost the overall sales.
# install.packages("plotly")
# install.packages("ggthemes")
# install.packages("usethis")
# install.packages("ggplot2")
# install.packages("completejourney")
# install.packages("lubridate")
# install.packages("gghighlight")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(gghighlight)
transactions <- get_transactions()
promotions <- get_promotions()
transaction <- transactions %>%
mutate(New_product_id = as.numeric(product_id)) %>%
mutate(New_basket_id = as.numeric(basket_id)) %>%
mutate(New_store_id = as.numeric(store_id)) %>%
mutate(New_household_id = as.numeric(household_id)) %>%
mutate(Date = as.Date(as.character(as.POSIXct(transaction_timestamp)))) %>%
na.omit()
transaction
## # A tibble: 1,469,307 x 16
## house~1 store~2 baske~3 produ~4 quant~5 sales~6 retai~7 coupo~8 coupo~9 week
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 900 330 311985~ 1095275 1 0.5 0 0 0 1
## 2 900 330 311985~ 9878513 1 0.99 0.1 0 0 1
## 3 1228 406 311986~ 1041453 1 1.43 0.15 0 0 1
## 4 906 319 311987~ 1020156 1 1.5 0.29 0 0 1
## 5 906 319 311987~ 1053875 2 2.78 0.8 0 0 1
## 6 906 319 311987~ 1060312 1 5.49 0.5 0 0 1
## 7 906 319 311987~ 1075313 1 1.5 0.29 0 0 1
## 8 1058 381 311986~ 985893 1 1.88 0.21 0 0 1
## 9 1058 381 311986~ 988791 1 1.5 1.29 0 0 1
## 10 1058 381 311986~ 9297106 1 2.69 0 0 0 1
## # ... with 1,469,297 more rows, 6 more variables: transaction_timestamp <dttm>,
## # New_product_id <dbl>, New_basket_id <dbl>, New_store_id <dbl>,
## # New_household_id <dbl>, Date <date>, and abbreviated variable names
## # 1: household_id, 2: store_id, 3: basket_id, 4: product_id, 5: quantity,
## # 6: sales_value, 7: retail_disc, 8: coupon_disc, 9: coupon_match_disc
promotion <- promotions %>%
mutate(New_product_id = as.numeric(product_id)) %>%
mutate(New_store_id = as.numeric(store_id)) %>%
na.omit()
promotion
## # A tibble: 20,940,529 x 7
## product_id store_id display_location mailer_location week New_prod~1 New_s~2
## <chr> <chr> <fct> <fct> <int> <dbl> <dbl>
## 1 1000050 316 9 0 1 1000050 316
## 2 1000050 337 3 0 1 1000050 337
## 3 1000050 441 5 0 1 1000050 441
## 4 1000092 292 0 A 1 1000092 292
## 5 1000092 293 0 A 1 1000092 293
## 6 1000092 295 0 A 1 1000092 295
## 7 1000092 298 0 A 1 1000092 298
## 8 1000092 299 0 A 1 1000092 299
## 9 1000092 304 0 A 1 1000092 304
## 10 1000092 306 0 A 1 1000092 306
## # ... with 20,940,519 more rows, and abbreviated variable names
## # 1: New_product_id, 2: New_store_id
product <- products %>%
mutate(New_product_id = as.numeric(product_id)) %>%
mutate(New_manufacture_id = as.numeric(manufacturer_id)) %>%
na.omit()
product
## # A tibble: 61,677 x 9
## product_id manufactur~1 depar~2 brand produ~3 produ~4 packa~5 New_p~6 New_m~7
## <chr> <chr> <chr> <fct> <chr> <chr> <chr> <dbl> <dbl>
## 1 25671 2 GROCERY Nati~ FRZN I~ ICE - ~ 22 LB 25671 2
## 2 26190 69 GROCERY Priv~ FRUIT ~ APPLE ~ 50 OZ 26190 69
## 3 26355 69 GROCERY Priv~ COOKIE~ SPECIA~ 14 OZ 26355 69
## 4 26426 69 GROCERY Priv~ SPICES~ SPICES~ 2.5 OZ 26426 69
## 5 26540 69 GROCERY Priv~ COOKIE~ TRAY P~ 16 OZ 26540 69
## 6 26601 69 DRUG GM Priv~ VITAMI~ VITAMI~ 300 CT~ 26601 69
## 7 26691 16 GROCERY Priv~ PNT BT~ HONEY 12 OZ 26691 16
## 8 26738 69 GROCERY Priv~ ICE CR~ TRADIT~ 56 OZ 26738 69
## 9 26941 69 GROCERY Priv~ ICE CR~ TRADIT~ 56 OZ 26941 69
## 10 27030 69 GROCERY Priv~ ICE CR~ TRADIT~ 56 OZ 27030 69
## # ... with 61,667 more rows, and abbreviated variable names 1: manufacturer_id,
## # 2: department, 3: product_category, 4: product_type, 5: package_size,
## # 6: New_product_id, 7: New_manufacture_id
Once we have the complete journey data, for the ease of plotting and analysis we have converted product_id , store_id , basket_id, household_id into New_product_id, New_Store_id, New_basket_id, New_household_id from Character datatype to Numeric datatype using “as.numeric function” and stored them with different names.
We have discarded/omitted all the rows having ‘NA’ values to have a cleaner data for the sake of analysis.
# plotiing scatter plot for sales value and store id ----------------------
Store_SV <- transaction %>%
select(New_store_id, sales_value ) %>%
#filter(New_store_id <= 500 & New_store_id >= 250) %>%
group_by(New_store_id) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
ggplot(aes(x=New_store_id, y = totalsales)) +
geom_point(color = 'blue', size = 0.7)+
geom_smooth(colour = "red",size = 0.5, se = FALSE)+
gghighlight(New_store_id < 1000, ) +
theme( plot.title = element_text(size = rel(1.5)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
) +
labs(title = "Store ID VS Total Sales" )+
xlab("Store ID") +
ylab("Total Sales")
ggplotly(Store_SV)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Data is spread across the access in the range of 0 to 30000. but there are no store id’s between 500 to 30000. so we are considering them as outliers.
logic behind deciding the outliers is, In general the store id’s are always sequential
for example : if the store1 has 1 as its ID then, Store 2 has store id - 2 not 30000.
Highlighted region is the region we are considering and shaded region are considered as outliers
#top 20 & bottom 20 performing stores-----------------------
Store_SV2 <- transaction %>%
select(New_store_id, sales_value ) %>%
group_by(New_store_id) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales))
head(Store_SV2, 20)
## # A tibble: 20 x 2
## New_store_id totalsales
## <dbl> <dbl>
## 1 367 148170.
## 2 406 120904.
## 3 429 86884.
## 4 361 81196.
## 5 343 80701.
## 6 356 78874.
## 7 292 76775.
## 8 381 72037.
## 9 31782 71189.
## 10 323 69799.
## 11 321 69693.
## 12 369 67718.
## 13 375 66715.
## 14 32004 63774.
## 15 319 63164.
## 16 427 61297.
## 17 327 61095.
## 18 432 60966.
## 19 31862 60580.
## 20 384 60524.
tail(Store_SV2, 20)
## # A tibble: 20 x 2
## New_store_id totalsales
## <dbl> <dbl>
## 1 2186 3.79
## 2 1067 3.57
## 3 681 3.49
## 4 227 3.29
## 5 250 3.19
## 6 3932 3.13
## 7 472 2.96
## 8 2697 2.69
## 9 659 2.61
## 10 2773 2.58
## 11 724 2.47
## 12 3267 2
## 13 4010 1.78
## 14 26 1.75
## 15 511 1.5
## 16 551 1
## 17 765 1
## 18 2760 1
## 19 639 0.85
## 20 610 0.5
# Mean of the sales values --------
Mean_TSV <- mean(Store_SV2$totalsales,na.rm = TRUE)
Mean_TSV
## [1] 10056.98
#categorize stores into high performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------
High_PS_Top20 <- transaction %>%
select(New_store_id, sales_value ) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(New_store_id) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales >= Mean_TSV) %>%
arrange(desc(totalsales)) %>%
head(20) %>%
ggplot( aes(x= New_store_id , y= totalsales)) +
labs(title = "Top 20 High Performing Stores") +
xlab("Store ID") +
ylab("Total Sales") +
geom_line(color = "orange")+
geom_point(color = "black") +
theme( plot.title = element_text(size = rel(1.2)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(High_PS_Top20)
#categorize stores into high performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------
#storing as table for further use-------------------
High_PS <- transaction %>%
select(New_store_id, sales_value ) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(New_store_id) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales >= Mean_TSV) %>%
arrange(desc(totalsales))
#Least 20 selling product categories in High performing stores---------------------------
High_PS_LPC <- High_PS %>%
inner_join(transaction , by = 'New_store_id') %>%
inner_join(product, by = "New_product_id") %>%
group_by(product_category) %>%
summarize(total_sales = sum(sales_value))%>%
arrange(desc(total_sales)) %>%
tail(20) %>%
ggplot(aes(x = total_sales, y = product_category)) +
geom_point(color = "green", size = 2) +
geom_segment(aes(x = 0, xend = total_sales, y = product_category, yend = product_category), color = "violet") +
labs(title = "Least 20 selling product categories") +
xlab("Total Sales") +
ylab("Product Category") +
theme( plot.title = element_text(size = rel(1.2)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(High_PS_LPC)
In least selling products:
Prepaid wireless & accessories has the least sales value
Bouquet has the highest sales value
# Age group stats of Top 10 selling product categories in High performing stores ----------------------------
High_PS_PC_H_AGE <- transaction %>%
inner_join(demographics , by = "household_id") %>%
inner_join(product, by = "New_product_id") %>%
select(New_store_id, sales_value, age, product_category) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(age, product_category) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales >= Mean_TSV) %>%
arrange(desc(totalsales)) %>%
head(10) %>%
ggplot(aes(x = age, y = totalsales)) +
geom_bar(stat = "identity",aes(fill = totalsales)) +
labs(title = " Age group stats of top 10 selling product categories in High performing stores") +
xlab("AGE") +
ylab("Total Sales") +
theme( plot.title = element_text(size = rel(1.1)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
facet_wrap("product_category")
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
ggplotly(High_PS_PC_H_AGE)
Age group stats of top 10 selling product categories in High performing stores:
We can infer the age groups of 25-34 , 35-44 are generating nil sales in Beers/Ales , Fluid Milk Products , Frozen meat / Meat Dinners.
Soft drinks are purchased by all age groups. product categories like bag snacks, baked bread, Beers, Fluid milk products, frozen meat are only purchased by “45 - 54” age group only
# Age group stats of least 10 selling product categories in High performing stores ----------------------------
High_PS_PC_L_AGE <- transaction %>%
inner_join(demographics , by = "household_id") %>%
inner_join(product, by = "New_product_id") %>%
select(New_store_id, sales_value, age, product_category) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(age, product_category) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales >= Mean_TSV) %>%
arrange(totalsales) %>%
head(20) %>%
ggplot(aes(x = age, y = totalsales)) +
geom_bar(stat = "identity",aes(fill = totalsales)) +
labs(title = " Age group stats of least 10 selling product categories in High performing stores") +
xlab("AGE") +
ylab("Total Sales") +
theme( plot.title = element_text(size = rel(1.1)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
facet_wrap("product_category")
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
High_PS_PC_L_AGE
Age group stats of least 10 selling product categories in High performing stores:
To boost the overall sales in these stores we could observe that 25-34 , 35-44 age group is not accounting for Sales in Frozen Pizza category , likewise only “45-54” age group is maximizing the revenue from softdrinks.
Cigarettes, Cold cereal, frozen pizza, beers are only purchased by “45-54” age group.
similar insights can be made for the other displayed product categories
#categorize stores into Low performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------
Low_PS_least20 <- transaction %>%
select(New_store_id, sales_value ) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(New_store_id) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales >= Mean_TSV) %>%
arrange(desc(totalsales)) %>%
tail(20) %>%
ggplot( aes(x= New_store_id , y= totalsales)) +
labs(title = "Top 20 Stores in Low Performing Stores") +
xlab("Store ID") +
ylab("Total Sales") +
geom_line(color = "orange")+
geom_point(color = "black") +
theme( plot.title = element_text(size = rel(1.2)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(Low_PS_least20)
#categorize stores into Low performing stores ()---------------------------
# top 20 performing stores ------------------------------------------------
#storing as table for further use
Low_PS <- transaction %>%
select(New_store_id, sales_value ) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(New_store_id) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales >= Mean_TSV) %>%
arrange(desc(totalsales))
#Least 20 selling product categories in Low performing stores---------------------------
Low_PS_LPC <- Low_PS %>%
inner_join(transaction , by = 'New_store_id') %>%
inner_join(product, by = "New_product_id") %>%
group_by(product_category) %>%
summarize(total_sales = sum(sales_value))%>%
arrange(desc(total_sales)) %>%
head(20) %>%
ggplot(aes(x = total_sales, y = product_category)) +
geom_point(color = "#008080", size = 2) +
geom_segment(aes(x = 0, xend = total_sales, y = product_category, yend = product_category), color = "#ffad60") +
labs(title = "Least 20 selling product categories in Low Performing Stores") +
xlab("Total Sales") +
ylab("Product Categories") +
theme( plot.title = element_text(size = rel(1.2)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
ggplotly(Low_PS_LPC)
In least 20 selling product categories in Low performing stores:
Soft drinks has highest sales value
Candy-packaged has least sales value
# Age group stats of top 10 selling product categories in low performing stores ----------------------------
Low_PS_PC_H_AGE <- transaction %>%
inner_join(demographics , by = "household_id") %>%
inner_join(product, by = "New_product_id") %>%
select(New_store_id, sales_value, age, product_category) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(age, product_category) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales < Mean_TSV) %>%
arrange(desc(totalsales)) %>%
head(10) %>%
ggplot(aes(x = age, y = totalsales)) +
geom_bar(stat = "identity",aes(fill = totalsales)) +
theme_bw() +
labs(title = " Age group stats of top 10 selling product categories in low performing stores") +
xlab("AGE") +
ylab("Total Sales") +
theme( plot.title = element_text(size = rel(1.1)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
facet_wrap("product_category")
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
ggplotly(Low_PS_PC_H_AGE)
Age group stats of top 10 selling product categories in low performing stores:
25-34 , 35-44 age group have nil contribution to the Icecream/Milk/Sherbts , Lunchmeat , soup categories while Bagged Snacks are only brought by the 25-34 age group.
Product Categories like cigarettes, frozen pizza are only purchased by “35 - 44” age group.
Beers are purchased by “25 - 34”, “35 - 44”
# Age group stats of least 10 selling product categories in low performing stores ----------------------------
Low_PS_PC_L_AGE <- transaction %>%
inner_join(demographics , by = "household_id") %>%
inner_join(product, by = "New_product_id") %>%
select(New_store_id, sales_value, age, product_category) %>%
filter(New_store_id <= 1000 & New_store_id >= 250) %>%
group_by(age, product_category) %>%
summarise(totalsales = sum(sales_value)) %>%
arrange(desc(totalsales)) %>%
filter(totalsales < Mean_TSV) %>%
arrange(totalsales) %>%
head(10) %>%
ggplot(aes(x = age, y = totalsales)) +
geom_bar(stat = "identity",aes(fill = totalsales)) +
theme_bw() +
labs(title = " Age group stats of least 10 selling product categories in low performing stores") +
xlab("AGE") +
ylab("Total Sales") +
theme( plot.title = element_text(size = rel(1.1)),
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
facet_wrap("product_category")
## `summarise()` has grouped output by 'age'. You can override using the `.groups`
## argument.
ggplotly(Low_PS_PC_L_AGE)
Age group stats of least 10 selling product categories in low performing stores:
Inferring from the data, these products are not generating the revenue and most of them are a bad investment which give running loss to the stores.
Products like Brocolli, Floral-accessories are not being sold at all.
Coffee is purchased by “19-24”, “25-34” age groups.
Hence we could propose to not stack these specific items in these stores to avoid Running Loss
Problem Statement: This analysis is intended to the business to answer the question which product category and which specific age group they should be targeting to increase the overall revenue in the Stores and how to minimize the deadstock loss in particular stores.
Methodology: From the Complete Journey dataset , we have used following data frames – Products , transactions, demographics for our analysis. Mainly I have used visualization implementing with Scatter plots Lollipop graphs , bar graphs to observe the findings and draw the outcomes.
Insights found:
On the Store wise analysis of Top and Least selling product categories based on the age groups we were able to button down on which particular group could be targeted to increase the overall sales of the particular store/stores.
For Example: In high sales in high performing stores- we can boost the total sales by offering discounts on Frozen Meat/meat Dinners , Milk Products.
Proposal to the Business Problem:
In Low Revenue Generating stores we could actually not stock fresh veggies like the Broccoli , Cauliflower Floral Accessories as they generate 0 to Nil revenue hence we can minimize the deadstock loss.
In High Revenue and top selling products we can maximize the revenue by targeting specific age-groups or discount schemes for products like bag snacks , ,Baked breads, beers/ales , Fluid milk products , Frozen meat dinners , which would enable the customers who don’t purchase them to considering to purchase and those who purchase generally increase their quantity of purchase which anyway is a win-win for the Business.
Limitations and Improvements : As the store id’s are not sequential they range from 2 to 34280 where in there are no stores for some numbers , hence we based our analysis only on stores with id’s from 250 to 1000. However this could be scaled to all the stores provided the Store id’s are sequential , we would have a better visualization in the plots. The above findings when combined with other demographic columns like the Income group or Marital status , we can even find more intriguing results like the middle income groups purchasing trends, married couples purchasing trends.