Our primary goals in our research and analysis are to boost Regork’s sales and increase operational effectiveness. We have selected two problem statements, one relating to the optimal package sizes to position in our business and the other to the best way to issue coupons to consumers in order to maximize sales.
We worked on the problem statements and analysed the existing ‘CompleteJourney’ datasets relating to transactions, products, and demographics. we cleaned, explored, and analyzed the datasets to address the problem statements and derive meaningful insights from the data sets like Transactions,Demographics and Product.
Business Problem 1: Is the retail discount distribution in relation to the household income ideal?Are discounts yielding higher sales?
Business Problem 2: For top 5 products in grocery which kind of package sizes are generating highest sales. So that business could focus on those package sizes while procuring from manufacturer
Our methodology can be broadly classified into three steps -
Data cleaning and preparation: In R, tidy up the datasets and tables and connect them as necessary for the analysis (more in detail in the section on data preparation)
Data Exploration: Identify anomalous values and outliers, investigate the nature and source of the data, investigate missing values and their causes, and show variable distributions and correlation as necessary.
Data Analysis: Exploratory Data Analysis to perform initial investigations on data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.
How will this Analysis help?
Outome 1 It will help Regork to distribute the discount properly so that it would result in higher sales.
Outcome 2 It will help the Regork to focus on package sizes which are yielding higher sales and reduce the quantum of other packages which inturn would help in better procurement and inventory management and higher sales.
Proposed Solution
Increase the quantity of packages which are yielding higher sales and decrease the other quantity also increase the distribution of discounts to the income group which is generating more sales so that it would yield in overall financial improvement of the company. Our analysis will say to which household income group more discount coupons should be distributed.
The following packages are needed for this project
library(completejourney)
library(tidyverse)
library(reactable)
library(ggplot2)
library(ggforce)
library(dplyr)
library(magrittr)
library(here)
library(plotly)
library(RColorBrewer)
library(treemapify)
completejourney provides access to data sets characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are captured.
There are eight built-in data sets available in this package (see data(package = ‘completejourney’). The data sets include:
Loading data for datasets the below datasets
transactions <- get_transactions()
Contains transaction-level product purchases by households (what you would normally see on a receipt).
Contains metatdata regarding the products purchased (brand, description, etc.).
This table contains demographic information for a portion of households. Due to nature of the data, the demographic information is not available for all households.
reactable(head(transactions, 10), defaultPageSize = 5)
reactable(head(products, 10), defaultPageSize = 5)
reactable(head(demographics, 10), defaultPageSize = 5)
print(paste('Number of missing values in transactions is ',sum(is.na(transactions))))
## [1] "Number of missing values in transactions is 0"
print(paste('Number of missing values in products is ',sum(is.na(products))))
## [1] "Number of missing values in products is 31654"
print(paste('Number of missing values in demographics is ',sum(is.na(demographics))))
## [1] "Number of missing values in demographics is 370"
Observations There are no missing values in transactions and promotions. There are some missing values in products and demographics.
We have not filtered out the missing values of the package_size of groceries department for our analysis but have filtered only ounce and pounds for our analysis
Checking classes of data to check, if all categorical variables and factors are in the required classes
transactionsClasses <- map_df(transactions,class)
reactable(head(transactionsClasses, 10), defaultPageSize = 5)
productsClasses <- map_df(products,class)
reactable(head(productsClasses, 10), defaultPageSize = 5)
demographicsClasses <- map_df(demographics,class)
reactable(head(demographicsClasses, 10), defaultPageSize = 5)
All the variables are in the required classes and no modification is needed
The following new variables have been calculated from existing variables to enable better understanding of what they represent
Grouping the income based on the income range into categories of “Low”, “Medium”, “High”
Please find below the exploratory Data Analysis for both of the business problems
Is the retail discount distribution in relation to the household income ideal?Are discounts yielding higher sales?
For this after cleaning the data we had divided the income categories into 3 groups High,Medium and Low and formed a new data frame called “complete_data” Then filtering the sales data based up on the retail discount and Plotting the required graphs for generating insights addressing the business problem
We found the below insights after peforming the analysis from the graphs:
Below is the code and plots for the same
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital…¹ house…² house…³ kids_…⁴
## <chr> <ord> <ord> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2 2 Adul… 0
## 2 1001 45-54 50-74K Homeowner Unmarried 1 1 Adul… 0
## 3 1003 35-44 25-34K <NA> Unmarried 1 1 Adul… 0
## 4 1004 25-34 15-24K <NA> Unmarried 1 1 Adul… 0
## 5 101 45-54 Under 15K Homeowner Married 4 2 Adul… 2
## 6 1012 35-44 35-49K <NA> Married 5+ 2 Adul… 3+
## 7 1014 45-54 15-24K <NA> Married 4 2 Adul… 2
## 8 1015 45-54 50-74K Homeowner Unmarried 1 1 Adul… 0
## 9 1018 45-54 35-49K Homeowner Married 5+ 2 Adul… 3+
## 10 1020 45-54 25-34K Homeowner Married 2 2 Adul… 0
## # … with 791 more rows, and abbreviated variable names ¹marital_status,
## # ²household_size, ³household_comp, ⁴kids_count
transactions <- get_transactions()
# complete_data is a dataframe formed after inner joining the tables demographics and transactions
complete_data <- demographics %>%
inner_join(transactions,by="household_id")
#null_values <- colSums(is.na(demographics))
#null_values
#unique(transactions$retail_disc)
y <- Filter(Negate(is.null), demographics$income)
unique(y)
## [1] 35-49K 50-74K 25-34K 15-24K Under 15K 75-99K 100-124K
## [8] 125-149K 150-174K 250K+ 175-199K 200-249K
## 12 Levels: Under 15K < 15-24K < 25-34K < 35-49K < 50-74K < ... < 250K+
# Discarding all the NA values in the table
complete_data <- na.omit(complete_data)
# Discarding any warnings
options(warn = -1)
# Selecting the colour palatte
palette <- brewer.pal(3, "Paired")
# Graph for < $1
complete_data_1 <- complete_data %>%
# Creating a column income group, be categorizing income group into Low, Medium ,and High
mutate(
income_group = case_when(
income =="Under 15K" | income=="15-24K" | income=="25-34K" ~ "1. Low",
income =="35-49K" | income=="50-74K" | income=="75-99K" ~ "2. Medium",
income =="100-124K" | income=="125-149K" | income=="150-174K" | income=="175-199K" | income=="200-249K" | income=="250K+" ~ "3. High",
TRUE ~ "other"))%>%
# Grouping the complete_data dataframe by income_group
group_by(income_group) %>%
# Filtering row data based on the discount required to analyse
filter(retail_disc > 0 & retail_disc < 1) %>%
# Calculating the total sum of sales_value and total sum of retail_disc
summarize(sum_sales = sum(sales_value), sum_discount=sum(retail_disc)) %>%
# Creating a new column showing the percentage of sum_discount based on sum_sales
mutate(percentage = sum_discount*100/sum_sales)
## Plotting
# Used bar graph to plot
plot_1<-ggplot(complete_data_1,aes(x = income_group)) +
geom_bar(aes(y = sum_sales, fill = "Sales Generated"), stat = 'identity') +
geom_text(aes(y = sum_sales, label = scales::dollar(sum_sales)), vjust = -0.25) +
geom_bar(aes(x = income_group, y = sum_discount, fill = "Discount Amount"), stat = "identity") +
geom_point(aes(x = income_group, y = percentage/0.00005, group = 1), shape = 17, size = 4, stat = "identity", size =1, color = "red") +
geom_text(aes(y = percentage, label = scales::percent(percentage/100)), vjust = -29, size = 3) +
scale_y_continuous(name = "Total Sales Value", label = scales::dollar, sec.axis = sec_axis(~.*0.00005, name = "Total Percentage of Discount Used", label = scales::number)) +
scale_fill_manual(name = "Total Value of:", values = palette) +
labs(title = "Retail Discount (for less than $1) v/s Total Sales Revenue (by Income Group)",
subtitle = "Low income: Less than 35K, Medium income: 35K-99K, High income: 100K and above",
x = "Income Group of Shoppers")
plot_1
# Graph for $1-5
complete_data_2 <- complete_data %>%
mutate(
income_group = case_when(
income =="Under 15K" | income=="15-24K" | income=="25-34K" ~ "1. Low",
income =="35-49K" | income=="50-74K" | income=="75-99K" ~ "2. Medium",
income =="100-124K" | income=="125-149K" | income=="150-174K" | income=="175-199K" | income=="200-249K" | income=="250K+" ~ "3. High",
TRUE ~ "other"))%>%
group_by(income_group) %>%
filter(retail_disc > 1 & retail_disc < 5) %>%
summarize(sum_sales = sum(sales_value), sum_discount=sum(retail_disc)) %>%
mutate(percentage = sum_discount*100/sum_sales)
# Used bar graph to plot
plot_2 <- ggplot(complete_data_2,aes(x = income_group)) +
geom_bar(aes(y = sum_sales, fill = "Sales Generated"), stat = 'identity') +
geom_text(aes(y = sum_sales, label = scales::dollar(sum_sales)), vjust = -0.25) +
geom_bar(aes(x = income_group, y = sum_discount, fill = "Discount Amount"), stat = "identity") +
geom_point(aes(x = income_group, y = percentage/0.0002, group = 1), shape = 17, size = 4, stat = "identity", size =1, color = "red") +
geom_text(aes(y = percentage, label = scales::percent(percentage/100)), vjust = -19, size = 3) +
scale_y_continuous(name = "Total Sales Value", label = scales::dollar, sec.axis = sec_axis(~.*0.0002, name = "Total Percentage of Discount Used", label = scales::number)) +
scale_fill_manual(name = "Total Value of:", values = palette) +
labs(title = "Retail Discount (between $1 and $5) v/s Total Sales Revenue (by Income Group)",
subtitle = "Low income: Less than 35K, Medium income: 35K-99K, High income: 100K and above",
x = "Income Group of Shoppers")
options(warn = -1)
plot_2
For top 5 products in grocery which kind of package sizes are generating highest sales. So that business could focus on those package sizes while procuring from manufacturer
For this after cleaning the data and first identified various product categories in Grocery and we finalised on Top 5 products by sales to generate insights Then we have further analysed those products by sales with their package sizes to draw an analysis which package size are generating maximum sales
We found the below insights after peforming the analysis from the graphs:
*0.6 ounce is the best size to sell Sweeteners which is yielding higher sales
Below is the code and plots for the same
## `summarise()` has grouped output by 'product_id', 'product_category'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'product_id', 'product_category'. You can
## override using the `.groups` argument.
#Bar plot
unidff_lb %>%
ggplot(aes(x = product_category,y=max_sales,fill=package_size)) +
geom_bar(stat = 'identity') + geom_text(aes(y = max_sales, label = scales::dollar(max_sales)), vjust = -0.25) +
labs(
title = "Top 5 product categories having maximum sales with their respective package sizes.",
subtitle = "This data has been collected over the year 2017."
)
options(warn = -1)
#tree map of maximum sold grocery items
unidff%>%
ggplot(aes(area = max_sales, fill = product_category, label = paste(product_category, max_sales, sep = "\n"))) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre",
size = 15) +
theme(legend.position = "none")+
labs(
title = "Top 5 product categories of Groceries having maximum sales.",
subtitle = "This data has been collected over the year 2017."
)
The aim of this project was to find out how which how the discount coupons should be distributed to aid highr sales and also to propose which type of package sizes need to be kept for aiding higher sales. We had observed the Following insights