Introduction

Our primary goals in our research and analysis are to boost Regork’s sales and increase operational effectiveness. We have selected two problem statements, one relating to the optimal package sizes to position in our business and the other to the best way to issue coupons to consumers in order to maximize sales.

We worked on the problem statements and analysed the existing ‘CompleteJourney’ datasets relating to transactions, products, and demographics. we cleaned, explored, and analyzed the datasets to address the problem statements and derive meaningful insights from the data sets like Transactions,Demographics and Product.

Business Problem 1: Is the retail discount distribution in relation to the household income ideal?Are discounts yielding higher sales?

Business Problem 2: For top 5 products in grocery which kind of package sizes are generating highest sales. So that business could focus on those package sizes while procuring from manufacturer

Our methodology can be broadly classified into three steps -

Data cleaning and preparation: In R, tidy up the datasets and tables and connect them as necessary for the analysis (more in detail in the section on data preparation)

Data Exploration: Identify anomalous values and outliers, investigate the nature and source of the data, investigate missing values and their causes, and show variable distributions and correlation as necessary.

Data Analysis: Exploratory Data Analysis to perform initial investigations on data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.

How will this Analysis help?

Outome 1 It will help Regork to distribute the discount properly so that it would result in higher sales.

Outcome 2 It will help the Regork to focus on package sizes which are yielding higher sales and reduce the quantum of other packages which inturn would help in better procurement and inventory management and higher sales.

Proposed Solution

Increase the quantity of packages which are yielding higher sales and decrease the other quantity also increase the distribution of discounts to the income group which is generating more sales so that it would yield in overall financial improvement of the company. Our analysis will say to which household income group more discount coupons should be distributed.

The following packages are needed for this project

  • completejourney
    • The completejourney provides access to data sets characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store.
  • tidyverse
    • The tidyverse is an opinionated collection of R packages designed for data science.
  • reactable
    • reactable package provides Interactive data tables in R base on the React table library.
  • ggplot2
    • ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
  • ggforce
    • ggforce is a package aimed at providing missing functionality to ggplot2 through the extension system introduced with ggplot2 v2.0.0
  • dplyr
    • dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges like select etc
  • magrittr
    • Magrittr is a forward Pipe Operator in R Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression.
  • here
    • The goal of the here package is to enable easy file referencing in project-oriented workflows
  • RColorBrewer
    • The RColorBrewer package is an unavoidable tool to manage colors with R.
  • treemapify
    • The treemapify package allows creating treemaps in ggplot2
library(completejourney)
library(tidyverse)
library(reactable)
library(ggplot2)
library(ggforce)
library(dplyr)
library(magrittr)
library(here)
library(plotly)
library(RColorBrewer)
library(treemapify)

Data Preparation

Data source:

completejourney provides access to data sets characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are captured.

There are eight built-in data sets available in this package (see data(package = ‘completejourney’). The data sets include:

  • campaigns: campaigns received by each household
  • campaign_descriptions: campaign metadata (length of time active)
  • coupons: coupon metadata (UPC code, campaign, etc.)
  • coupon_redemptions: coupon redemptions (household, day, UPC code, campaign)
  • demographics: household demographic data (age, income, family size, etc.)
  • products: product metadata (brand, description, etc.)
  • promotions_sample: a sampling of the product placement in mailers and in stores corresponding to advertising campaigns
  • transactions_sample: a sampling of the products purchased by households

Loading data for datasets the below datasets

transactions <- get_transactions()

Transaction Data

Contains transaction-level product purchases by households (what you would normally see on a receipt).

Product Data

Contains metatdata regarding the products purchased (brand, description, etc.).

Demographics

This table contains demographic information for a portion of households. Due to nature of the data, the demographic information is not available for all households.

A Glimpse of Original Data

Transactions

reactable(head(transactions, 10), defaultPageSize = 5)

Products

reactable(head(products, 10), defaultPageSize = 5)

Demographics

reactable(head(demographics, 10), defaultPageSize = 5)

Cleaning Data

Checking for missing values

print(paste('Number of missing values in transactions is ',sum(is.na(transactions))))
## [1] "Number of missing values in transactions is  0"
print(paste('Number of missing values in products is ',sum(is.na(products))))
## [1] "Number of missing values in products is  31654"
print(paste('Number of missing values in demographics is ',sum(is.na(demographics))))
## [1] "Number of missing values in demographics is  370"

Observations There are no missing values in transactions and promotions. There are some missing values in products and demographics.

Analyzing missing values

We have not filtered out the missing values of the package_size of groceries department for our analysis but have filtered only ounce and pounds for our analysis

Checking classes

Checking classes of data to check, if all categorical variables and factors are in the required classes

Classes for transactions

transactionsClasses <- map_df(transactions,class)
reactable(head(transactionsClasses, 10), defaultPageSize = 5)

Classes for products

productsClasses <- map_df(products,class)
reactable(head(productsClasses, 10), defaultPageSize = 5)

Classes for demographics

demographicsClasses <- map_df(demographics,class)
reactable(head(demographicsClasses, 10), defaultPageSize = 5)

All the variables are in the required classes and no modification is needed

Data Preparation

New variables

The following new variables have been calculated from existing variables to enable better understanding of what they represent

Income Group

Grouping the income based on the income range into categories of “Low”, “Medium”, “High”

Exploratory Data Analysis

Please find below the exploratory Data Analysis for both of the business problems

Business Problem 1

Is the retail discount distribution in relation to the household income ideal?Are discounts yielding higher sales?

For this after cleaning the data we had divided the income categories into 3 groups High,Medium and Low and formed a new data frame called “complete_data” Then filtering the sales data based up on the retail discount and Plotting the required graphs for generating insights addressing the business problem

We found the below insights after peforming the analysis from the graphs:

  • We had observed that revenue generated by issuing discounts to middle income group was way higher compared to higher income and lower income group
  • Less than 1$ discount coupons were generating highest sales compared to 1-5 $ discounts
  • More than 5 dollars discount has yielded miniscule sales so we can reduce giving those discounts.
  • Lower income group is being given highest percentage of discounts compared to other income groups

Below is the code and plots for the same

demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital…¹ house…² house…³ kids_…⁴
##    <chr>        <ord> <ord>     <ord>          <ord>     <ord>   <ord>   <ord>  
##  1 1            65+   35-49K    Homeowner      Married   2       2 Adul… 0      
##  2 1001         45-54 50-74K    Homeowner      Unmarried 1       1 Adul… 0      
##  3 1003         35-44 25-34K    <NA>           Unmarried 1       1 Adul… 0      
##  4 1004         25-34 15-24K    <NA>           Unmarried 1       1 Adul… 0      
##  5 101          45-54 Under 15K Homeowner      Married   4       2 Adul… 2      
##  6 1012         35-44 35-49K    <NA>           Married   5+      2 Adul… 3+     
##  7 1014         45-54 15-24K    <NA>           Married   4       2 Adul… 2      
##  8 1015         45-54 50-74K    Homeowner      Unmarried 1       1 Adul… 0      
##  9 1018         45-54 35-49K    Homeowner      Married   5+      2 Adul… 3+     
## 10 1020         45-54 25-34K    Homeowner      Married   2       2 Adul… 0      
## # … with 791 more rows, and abbreviated variable names ¹​marital_status,
## #   ²​household_size, ³​household_comp, ⁴​kids_count
transactions <- get_transactions()

# complete_data is a dataframe formed after inner joining the tables demographics and transactions
complete_data <- demographics %>%
  inner_join(transactions,by="household_id")



#null_values <- colSums(is.na(demographics))
#null_values
#unique(transactions$retail_disc)


y <- Filter(Negate(is.null), demographics$income)
unique(y)
##  [1] 35-49K    50-74K    25-34K    15-24K    Under 15K 75-99K    100-124K 
##  [8] 125-149K  150-174K  250K+     175-199K  200-249K 
## 12 Levels: Under 15K < 15-24K < 25-34K < 35-49K < 50-74K < ... < 250K+
# Discarding all the NA values in the table
complete_data <- na.omit(complete_data)
# Discarding any warnings
options(warn = -1) 

# Selecting the colour palatte
palette <- brewer.pal(3, "Paired")

# Graph for < $1

complete_data_1 <- complete_data %>%
  # Creating a column income group, be categorizing income group into Low, Medium ,and High
  mutate(
    income_group = case_when(
      income =="Under 15K" | income=="15-24K" | income=="25-34K" ~ "1. Low",
      income =="35-49K" | income=="50-74K" | income=="75-99K"  ~ "2. Medium",  
      income =="100-124K" | income=="125-149K" | income=="150-174K" | income=="175-199K" | income=="200-249K" | income=="250K+" ~ "3. High",
    TRUE ~ "other"))%>%
  # Grouping the complete_data dataframe by income_group
  group_by(income_group) %>%
  # Filtering row data based on the discount required to analyse
  filter(retail_disc > 0 & retail_disc < 1) %>%
  # Calculating the total sum of sales_value and total sum of retail_disc
  summarize(sum_sales = sum(sales_value), sum_discount=sum(retail_disc)) %>%
  # Creating a new column showing the percentage of sum_discount based on sum_sales
  mutate(percentage = sum_discount*100/sum_sales)

## Plotting 
# Used bar graph to plot
plot_1<-ggplot(complete_data_1,aes(x = income_group)) +
  geom_bar(aes(y = sum_sales, fill = "Sales Generated"), stat = 'identity') +
  geom_text(aes(y = sum_sales, label = scales::dollar(sum_sales)), vjust = -0.25) +
  geom_bar(aes(x = income_group, y = sum_discount, fill = "Discount Amount"), stat = "identity") +
  geom_point(aes(x = income_group, y = percentage/0.00005, group = 1), shape = 17, size = 4, stat = "identity", size =1, color = "red") +
  geom_text(aes(y = percentage, label = scales::percent(percentage/100)), vjust = -29, size = 3) +
  scale_y_continuous(name = "Total Sales Value", label = scales::dollar, sec.axis = sec_axis(~.*0.00005, name = "Total Percentage of Discount Used", label = scales::number)) +
  scale_fill_manual(name = "Total Value of:", values = palette) +
  labs(title = "Retail Discount (for less than $1) v/s Total Sales Revenue (by Income Group)",
       subtitle = "Low income: Less than 35K, Medium income: 35K-99K, High income: 100K and above",
       x = "Income Group of Shoppers")
plot_1

# Graph for $1-5

complete_data_2 <- complete_data %>%
  mutate(
    income_group = case_when(
      income =="Under 15K" | income=="15-24K" | income=="25-34K" ~ "1. Low",
      income =="35-49K" | income=="50-74K" | income=="75-99K"  ~ "2. Medium",  
      income =="100-124K" | income=="125-149K" | income=="150-174K" | income=="175-199K" | income=="200-249K" | income=="250K+" ~ "3. High",
      TRUE ~ "other"))%>%
  group_by(income_group) %>%
  filter(retail_disc > 1 & retail_disc < 5) %>%
  summarize(sum_sales = sum(sales_value), sum_discount=sum(retail_disc)) %>%
  mutate(percentage = sum_discount*100/sum_sales)
  
# Used bar graph to plot
  plot_2 <- ggplot(complete_data_2,aes(x = income_group)) +
  geom_bar(aes(y = sum_sales, fill = "Sales Generated"), stat = 'identity') +
  geom_text(aes(y = sum_sales, label = scales::dollar(sum_sales)), vjust = -0.25) +
  geom_bar(aes(x = income_group, y = sum_discount, fill = "Discount Amount"), stat = "identity") +
  geom_point(aes(x = income_group, y = percentage/0.0002, group = 1), shape = 17, size = 4, stat = "identity", size =1, color = "red") +
  geom_text(aes(y = percentage, label = scales::percent(percentage/100)), vjust = -19, size = 3) +
  scale_y_continuous(name = "Total Sales Value", label = scales::dollar, sec.axis = sec_axis(~.*0.0002, name = "Total Percentage of Discount Used", label = scales::number)) +
  scale_fill_manual(name = "Total Value of:", values = palette) +
  labs(title = "Retail Discount (between $1 and $5) v/s Total Sales Revenue (by Income Group)",
       subtitle = "Low income: Less than 35K, Medium income: 35K-99K, High income: 100K and above",
       x = "Income Group of Shoppers")
 options(warn = -1)
 plot_2

Business Problem 2

For top 5 products in grocery which kind of package sizes are generating highest sales. So that business could focus on those package sizes while procuring from manufacturer

For this after cleaning the data and first identified various product categories in Grocery and we finalised on Top 5 products by sales to generate insights Then we have further analysed those products by sales with their package sizes to draw an analysis which package size are generating maximum sales

We found the below insights after peforming the analysis from the graphs:

  • We had observed that 12 Ounce is the best size for selling packaged Drinking water/Soft drinks to generate higher sales

*0.6 ounce is the best size to sell Sweeteners which is yielding higher sales

Below is the code and plots for the same

## `summarise()` has grouped output by 'product_id', 'product_category'. You can
## override using the `.groups` argument.
## `summarise()` has grouped output by 'product_id', 'product_category'. You can
## override using the `.groups` argument.

#Bar plot

unidff_lb %>%
  ggplot(aes(x = product_category,y=max_sales,fill=package_size)) +
  geom_bar(stat = 'identity') + geom_text(aes(y = max_sales, label = scales::dollar(max_sales)), vjust = -0.25)  +
  labs(
    title = "Top 5 product categories having maximum sales with their respective package sizes.",
    subtitle = "This data has been collected over the year 2017."
  )

options(warn = -1)

#tree map of maximum sold grocery items

unidff%>%
  ggplot(aes(area = max_sales, fill = product_category, label = paste(product_category, max_sales, sep = "\n"))) +
  geom_treemap() +
  geom_treemap_text(colour = "white",
                    place = "centre",
                    size = 15) +
  theme(legend.position = "none")+
  labs(
    title = "Top 5 product categories of Groceries having maximum sales.",
    subtitle = "This data has been collected over the year 2017."
  )

Summary

The aim of this project was to find out how which how the discount coupons should be distributed to aid highr sales and also to propose which type of package sizes need to be kept for aiding higher sales. We had observed the Following insights

Insights

  • We had observed that revenue generated by issuing discounts to middle income group was way higher compared to higher income and lower income group
  • Less than 1$ discount coupons were generating highest sales compared to 1-5 $ discounts
  • More than 5 dollars discount has yielded miniscule sales so we can reduce giving those discounts.
  • Lower income group is being given highest percentage of discounts compared to other income groups
  • More discounts should be offered to lower income groups to increase sales, but that would have an impact on profitability.
  • Hence it would be a good strategy to give discounts ranging from 0-1$ for medium income groups to drive higher sales
  • 12 Ounce is the best size for selling packaged Drinking water to generate higher sales
  • 0.6 ounce is the best size to sell Sweeteners which is yielding higher sales

Implications and Recommendations

  • We suggest targeting middle income group while giving discounts since its yielding higher sales.
  • We suggest Regork to focus more on 12 Ounce package size while selling water
  • We suggest Regork to focus on 0.6 ounce sweeteners as it is generating higher sales

Limitations

  • In our analysis we have taken discount versus income group.
  • We had observed that for greater than 5$ discounts the data was very less
  • We could further analyse product versus discount data to get more insights on why the above is happening if we have more data
  • There were too many scales of package sizes we limited our analysis towards pounds and ounces

Future Scope

  • We could further analyse Product v/s Discount analysis in future which could further generate more insights
  • To analyse all the products in Grocery category not only the top 5

Division of work

  • Data Cleaning on Business problem #1 - Lokesh
  • Data Analysis on Business problem #1 - Nitish
  • Data Cleaning on Business problem #2 - Arjun
  • Data Analysis on Business problem #2 - Vaishnavi
  • PPT and Report preparation done by entire team.