Business (Data Scientists) Team: Nageswara Rao Channaboina, Himalaya Batra, Venkata Chaitanya Ammisetti, Rudranshi Singh, Tejaswini Sapkota

Introduction

The objective of this project is to provide business insights to Regork, a national grocery chain to assist them in growing their business. Using our analysis we aim to find growth opportunities for them by answering the below business questions:

  1. Is there a certain demographic which has future potential but is not currently captured in the market share?
  2. Do certain products tend to be purchased with other products on a regular basis?

We answered these questions by joining combinations of transactions, products, demographics and coupons data to find important and useful insights which can help the business grow and improve its strategies.

By the end of this analysis, Regork will be able to:

  1. Understand the potential of an untapped demographic market to capture market share
  2. Find products that could be grouped to boost sales revenue

Packages Required

Following packages were used:

library(tidyverse)
library(completejourney)
library(dplyr)
library(lubridate)
library(treemap)
library(DT)
library(RColorBrewer)
  • Knitr: Used to display an aligned table on the screen
  • Tidyverse: Used to tidy data
  • Lubridate: Used to manipulate date-time
  • Treemap: Visualize hierarchical data using nested rectangles
  • Dplyr: Used for data manipulation
  • Ggplot2: Used to plot charts
  • Plotly: Used to plot interactive charts

Data Preparation

Data Import

We are using the Complete Journey Study dataset for this analysis. The data represents grocery store shopping transactions over one year from a group of 2,469 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

First of all, Import data in below way:

library(completejourney)
transactions <- as.data.frame(get_transactions())

Have a look about the tables which are being used:

colnames(transactions)
##  [1] "household_id"          "store_id"              "basket_id"            
##  [4] "product_id"            "quantity"              "sales_value"          
##  [7] "retail_disc"           "coupon_disc"           "coupon_match_disc"    
## [10] "week"                  "transaction_timestamp"
colnames(products)
## [1] "product_id"       "manufacturer_id"  "department"       "brand"           
## [5] "product_category" "product_type"     "package_size"
colnames(demographics)
## [1] "household_id"   "age"            "income"         "home_ownership"
## [5] "marital_status" "household_size" "household_comp" "kids_count"
colnames(coupons)
## [1] "coupon_upc"  "product_id"  "campaign_id"
colnames(coupon_redemptions)
## [1] "household_id"    "coupon_upc"      "campaign_id"     "redemption_date"

Data Preview: Get a glimpse about data one by one

glimpse(transactions)
## Rows: 1,469,307
## Columns: 11
## $ household_id          <chr> "900", "900", "1228", "906", "906", "906", "906"…
## $ store_id              <chr> "330", "330", "406", "319", "319", "319", "319",…
## $ basket_id             <chr> "31198570044", "31198570047", "31198655051", "31…
## $ product_id            <chr> "1095275", "9878513", "1041453", "1020156", "105…
## $ quantity              <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales_value           <dbl> 0.50, 0.99, 1.43, 1.50, 2.78, 5.49, 1.50, 1.88, …
## $ retail_disc           <dbl> 0.00, 0.10, 0.15, 0.29, 0.80, 0.50, 0.29, 0.21, …
## $ coupon_disc           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ transaction_timestamp <dttm> 2017-01-01 06:53:26, 2017-01-01 07:10:28, 2017-…
glimpse(products)
## Rows: 92,331
## Columns: 7
## $ product_id       <chr> "25671", "26081", "26093", "26190", "26355", "26426",…
## $ manufacturer_id  <chr> "2", "2", "69", "69", "69", "69", "69", "69", "69", "…
## $ department       <chr> "GROCERY", "MISCELLANEOUS", "PASTRY", "GROCERY", "GRO…
## $ brand            <fct> National, National, Private, Private, Private, Privat…
## $ product_category <chr> "FRZN ICE", NA, "BREAD", "FRUIT - SHELF STABLE", "COO…
## $ product_type     <chr> "ICE - CRUSHED/CUBED", NA, "BREAD:ITALIAN/FRENCH", "A…
## $ package_size     <chr> "22 LB", NA, NA, "50 OZ", "14 OZ", "2.5 OZ", "16 OZ",…
glimpse(demographics)
## Rows: 801
## Columns: 8
## $ household_id   <chr> "1", "1001", "1003", "1004", "101", "1012", "1014", "10…
## $ age            <ord> 65+, 45-54, 35-44, 25-34, 45-54, 35-44, 45-54, 45-54, 4…
## $ income         <ord> 35-49K, 50-74K, 25-34K, 15-24K, Under 15K, 35-49K, 15-2…
## $ home_ownership <ord> Homeowner, Homeowner, NA, NA, Homeowner, NA, NA, Homeow…
## $ marital_status <ord> Married, Unmarried, Unmarried, Unmarried, Married, Marr…
## $ household_size <ord> 2, 1, 1, 1, 4, 5+, 4, 1, 5+, 2, 5+, 4, 2, 1, 5+, 1, 1, …
## $ household_comp <ord> 2 Adults No Kids, 1 Adult No Kids, 1 Adult No Kids, 1 A…
## $ kids_count     <ord> 0, 0, 0, 0, 2, 3+, 2, 0, 3+, 0, 3+, 2, 0, 0, 3+, 0, 0, …
glimpse(coupons)
## Rows: 116,204
## Columns: 3
## $ coupon_upc  <chr> "10000085207", "10000085207", "10000085207", "10000085207"…
## $ product_id  <chr> "9676830", "9676943", "9676944", "9676947", "9677008", "96…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", "26"…
glimpse(coupon_redemptions)
## Rows: 2,102
## Columns: 4
## $ household_id    <chr> "1029", "1029", "165", "712", "712", "2488", "2488", "…
## $ coupon_upc      <chr> "51380041013", "51380041313", "53377610033", "51380041…
## $ campaign_id     <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", …
## $ redemption_date <date> 2017-01-01, 2017-01-01, 2017-01-03, 2017-01-07, 2017-…

Data Visualisation & Analysis

Sales Analysis

The below graph uses transactions & demographics data from the complete journey data set. We have tried to showcase a quantitative relation between different age-groups & their total sales value. To display the quantitative relationship we used a bar plot. From the bar plot it can be analyzed that the age-group of 19-24 is contributing lowest in total sales and on the other hand the age-group 45-54 is contributing highest.

Based on our observations in the first graph, we decided to deep-dive more into age-groups category to check why the total sales from a particular age-group is less compared to other age-groups. First, we extracted the number of households count for each age-group. In that we observed that three age-groups: 19-24, 55-64 & 65+ is having nearly same household counts.

demographics %>% 
  group_by(age) %>%
  summarise(total_households = n())
## # A tibble: 6 × 2
##   age   total_households
##   <ord>            <int>
## 1 19-24               46
## 2 25-34              142
## 3 35-44              194
## 4 45-54              288
## 5 55-64               59
## 6 65+                 72

Our current no. of households for 19-24 age group is pretty low compared to other groups. So, the business problem we’re solving is to get more 19-24 age group households in the store.

We found three departments: NUTRITION, DELI & PASTRY where 19-24 age group is spending money equally with respect to other age groups.

## # A tibble: 23 × 3
## # Groups:   age [1]
##    age   department         ttsles
##    <ord> <chr>               <dbl>
##  1 19-24 CNTRL/STORE SUP      1   
##  2 19-24 RESTAURANT           1.39
##  3 19-24 PHOTO & VIDEO        4.98
##  4 19-24 COUPON              12.4 
##  5 19-24 TRAVEL & LEISURE    29.7 
##  6 19-24 CHEF SHOPPE         41.4 
##  7 19-24 GARDEN CENTER       46.9 
##  8 19-24 SEAFOOD            169.  
##  9 19-24 COSMETICS          503.  
## 10 19-24 FLORAL             520.  
## 11 19-24 NUTRITION          806.  
## 12 19-24 SALAD BAR          956.  
## 13 19-24 SEAFOOD-PCKGD     1088.  
## 14 19-24 SPIRITS           1616.  
## 15 19-24 PASTRY            1665.  
## 16 19-24 MISCELLANEOUS     1771.  
## 17 19-24 DELI              2902.  
## 18 19-24 FUEL              6651.  
## 19 19-24 MEAT-PCKGD        7057.  
## 20 19-24 PRODUCE           7071.  
## 21 19-24 MEAT              7556.  
## 22 19-24 DRUG GM          17024.  
## 23 19-24 GROCERY          68181.
## # A tibble: 27 × 3
## # Groups:   age [1]
##    age   department          ttsles
##    <ord> <chr>                <dbl>
##  1 45-54 POSTAL CENTER         4.57
##  2 45-54 PHOTO & VIDEO         4.98
##  3 45-54 CNTRL/STORE SUP       9.95
##  4 45-54 GM MERCH EXP         19.3 
##  5 45-54 AUTOMOTIVE           55.1 
##  6 45-54 FROZEN GROCERY       80.7 
##  7 45-54 COUPON              143.  
##  8 45-54 TRAVEL & LEISURE    315.  
##  9 45-54 RESTAURANT          357.  
## 10 45-54 CHEF SHOPPE         401.  
## 11 45-54 GARDEN CENTER      1360.  
## 12 45-54 SPIRITS            2004.  
## 13 45-54 SALAD BAR          3398.  
## 14 45-54 COSMETICS          3947.  
## 15 45-54 SEAFOOD            4638.  
## 16 45-54 FLORAL             4970.  
## 17 45-54 SEAFOOD-PCKGD      7600.  
## 18 45-54 NUTRITION         11817.  
## 19 45-54 PASTRY            14662.  
## 20 45-54 MISCELLANEOUS     19367.  
## 21 45-54 DELI              32841.  
## 22 45-54 MEAT-PCKGD        42920.  
## 23 45-54 MEAT              63829.  
## 24 45-54 PRODUCE           72134.  
## 25 45-54 FUEL              72694.  
## 26 45-54 DRUG GM          129664.  
## 27 45-54 GROCERY          482531.
## # A tibble: 25 × 3
## # Groups:   age [1]
##    age   department        ttsles
##    <ord> <chr>              <dbl>
##  1 55-64 POSTAL CENTER        1  
##  2 55-64 CNTRL/STORE SUP      2  
##  3 55-64 GM MERCH EXP        12.0
##  4 55-64 RESTAURANT          15  
##  5 55-64 FROZEN GROCERY      30.6
##  6 55-64 COUPON              37.4
##  7 55-64 TRAVEL & LEISURE    75.6
##  8 55-64 CHEF SHOPPE         82.4
##  9 55-64 SPIRITS            130. 
## 10 55-64 GARDEN CENTER      242. 
## 11 55-64 SALAD BAR          506. 
## 12 55-64 FLORAL             734. 
## 13 55-64 COSMETICS          756. 
## 14 55-64 SEAFOOD            806. 
## 15 55-64 NUTRITION         1652. 
## 16 55-64 SEAFOOD-PCKGD     1908. 
## 17 55-64 PASTRY            2417. 
## 18 55-64 MISCELLANEOUS     5623. 
## 19 55-64 DELI              6610. 
## 20 55-64 MEAT-PCKGD        7699. 
## 21 55-64 MEAT             13051. 
## 22 55-64 FUEL             13354. 
## 23 55-64 PRODUCE          14416. 
## 24 55-64 DRUG GM          18815. 
## 25 55-64 GROCERY          84164.
## # A tibble: 25 × 3
## # Groups:   age [1]
##    age   department         ttsles
##    <ord> <chr>               <dbl>
##  1 65+   PHOTO & VIDEO        1.99
##  2 65+   GM MERCH EXP         2.95
##  3 65+   AUTOMOTIVE          11.0 
##  4 65+   FROZEN GROCERY      13.1 
##  5 65+   COUPON              18.5 
##  6 65+   RESTAURANT          22.4 
##  7 65+   SPIRITS             77.8 
##  8 65+   TRAVEL & LEISURE    90.5 
##  9 65+   CHEF SHOPPE        130.  
## 10 65+   GARDEN CENTER      388.  
## 11 65+   COSMETICS          474.  
## 12 65+   FLORAL             681.  
## 13 65+   SEAFOOD            741.  
## 14 65+   SALAD BAR         1212.  
## 15 65+   SEAFOOD-PCKGD     1323.  
## 16 65+   MISCELLANEOUS     1911.  
## 17 65+   NUTRITION         2115.  
## 18 65+   PASTRY            3525.  
## 19 65+   DELI              6607.  
## 20 65+   MEAT-PCKGD        7632.  
## 21 65+   MEAT             11781.  
## 22 65+   FUEL             12554.  
## 23 65+   PRODUCE          16341.  
## 24 65+   DRUG GM          21689.  
## 25 65+   GROCERY          87255.
## # A tibble: 27 × 3
## # Groups:   age [1]
##    age   department          ttsles
##    <ord> <chr>                <dbl>
##  1 35-44 PROD-WHS SALES        2.52
##  2 35-44 CNTRL/STORE SUP       3   
##  3 35-44 PHOTO & VIDEO         8.6 
##  4 35-44 GM MERCH EXP         14.9 
##  5 35-44 AUTOMOTIVE           71.7 
##  6 35-44 FROZEN GROCERY       98.6 
##  7 35-44 COUPON              123.  
##  8 35-44 RESTAURANT          130.  
##  9 35-44 TRAVEL & LEISURE    211.  
## 10 35-44 CHEF SHOPPE         338.  
## 11 35-44 GARDEN CENTER       762.  
## 12 35-44 SPIRITS            1060.  
## 13 35-44 SEAFOOD            2165.  
## 14 35-44 SALAD BAR          2358.  
## 15 35-44 FLORAL             3606.  
## 16 35-44 COSMETICS          3705.  
## 17 35-44 SEAFOOD-PCKGD      4223.  
## 18 35-44 MISCELLANEOUS     10013.  
## 19 35-44 PASTRY            10079.  
## 20 35-44 NUTRITION         11668.  
## 21 35-44 DELI              24195.  
## 22 35-44 MEAT-PCKGD        33912.  
## 23 35-44 MEAT              46814.  
## 24 35-44 PRODUCE           51782.  
## 25 35-44 FUEL              67669.  
## 26 35-44 DRUG GM           92438.  
## 27 35-44 GROCERY          356905.

Using this above observations we plotted column bar plot for three departments show casing total sales value per age-group

Coupons Redeemed

To further analyze the data, we wanted to confirm coupons redeemed based on their age groups for those three departments.

The below graph uses coupons, coupon redemptions, product, and demographic tables to populate the data for Nutrition, Pastry, and Deli departments. The graph shows the number of coupons redeemed based on the age group.For each age group, we use the top five coupons upc which are being redeemed.

The graph further confirmed our analysis on how coupons redeemed are lowest in the 19-24 demographic due to low number of households.

Basket Analysis

This analysis aims to answer the business question: ‘Do certain products tend to be purchased with other products on a regular basis?’ Product pairing is an important marketing tool and this analysis aims to find the top products which can be paired together to fuel revenues further.

We started with analyzing products, transactions and demographics to get general insights from the data. To understand which products were bought together the most, we grouped the products for each basket based on two major divisions (Grocery and Other departments) and further dug in to make associations between each product pair based on their quantity.

Once the product pairs and their quantities were associated, we sliced the top 50 pairs which can be sold together.

The graph below shows the product pairs being sold together the most in Grocery department. We see that Soft Drinks and Potato Chips/Nachos have a high correlation and can be paired. Other such grouping can be done with Soup & Shredded cheese.

The graph below shows the product pairs being sold together the most in Other departments. We see that Baby Food and Baby Juices have a high correlation and can be grouped together. Other such pairs we see are: Turkey & Ham and Candy Bars & Chewing gums.

Summary

Our analysis aimed to suggest solutions to some crucial questions for Regork. We proposed solutions to inform the company about the potential of an untapped demographic and pattern between products being bought together.

To summarise our solutions and its implications:

  • The sales for demographic (age-group) 19-24 is currently the least due to low number of households. Increasing the footfall of the store by capturing this demographic by giving them coupons can greatly increase our market share in the area.
  • Products can be paired based on our analysis and could either be shelved together or provided in multi-pack to increase sales revenue.

Limitation: Our current basket analysis is more of an indicative analysis. To create more confidence, we need to run Apriori algorithm to find confidence of the rule sets we defined.