Introduction

The goal of our project is to use data analytics to improve retailer (national grocery chain named Regork) efficiency in purchasing and marketing by means of better understanding customer habits and establishing a long-term relationships with them. We are analyzing the data from the Complete Journey Dataset. With this dataset, the question that we want to find an answer to is which customer is buying what types of products and from which department, as well as the demographics (age-group) the customer belong, to better understand their shopping trends and improve sales.

The objective of this project is to explore the Complete Journey dataset and answers some interesting questions related to demographics and sales as per department. Our Business problem here is to analyze the sales for demographic (age-group) with respect to department. We are focussing on age groups that are having fewer sales and then deep dive into why their sales are less and which departments are less consumed by that particular age group. We also tried to analyze the coupons redeemed by different age groups and tried to tie them down to our main goal of enhancing the Regork business with the help of marketing.

We can use transaction and customer data to find important and useful insights which can help the business grow and improve it’s strategies. It can be used to answer questions like:

  • What departments are bringing the most and least sales?
  • What is the general trend of age groups spending ?
  • Which demographic group is spending more on which department products?
  • How many coupons are redeemed by customers based on the department?

Packages Required

Following packages were used:

  • Knitr: Used to display an aligned table on the screen
  • Tidyverse: Used to tidy data
  • Lubridate: Used to manipulate date-time
  • Treemap: Visualize hierarchical data using nested rectangles
  • Dplyr: Used for data manipulation
  • Ggplot2: Used to plot charts
  • Plotly: Used to plot interactive charts
library(tidyverse)
library(completejourney)
library(dplyr)
library(lubridate)
library(treemap)
library(DT)

Data Preparation

This sections contains all the procedures followed in getting the data analysis ready. Each step has been explained and the codes have been given.

Data Import

We are using the Complete Journey Study dataset for this analysis. The data represents grocery store shopping transactions over one year from a group of 2,469 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.

Data Import Code:

library(completejourney)
transactions <- as.data.frame(get_transactions())

Details about the tables used:

colnames(transactions)
##  [1] "household_id"          "store_id"              "basket_id"            
##  [4] "product_id"            "quantity"              "sales_value"          
##  [7] "retail_disc"           "coupon_disc"           "coupon_match_disc"    
## [10] "week"                  "transaction_timestamp"
colnames(products)
## [1] "product_id"       "manufacturer_id"  "department"       "brand"           
## [5] "product_category" "product_type"     "package_size"
colnames(demographics)
## [1] "household_id"   "age"            "income"         "home_ownership"
## [5] "marital_status" "household_size" "household_comp" "kids_count"
colnames(coupons)
## [1] "coupon_upc"  "product_id"  "campaign_id"
colnames(coupon_redemptions)
## [1] "household_id"    "coupon_upc"      "campaign_id"     "redemption_date"
Data Preview
glimpse(transactions)
## Rows: 1,469,307
## Columns: 11
## $ household_id          <chr> "900", "900", "1228", "906", "906", "906", "906"…
## $ store_id              <chr> "330", "330", "406", "319", "319", "319", "319",…
## $ basket_id             <chr> "31198570044", "31198570047", "31198655051", "31…
## $ product_id            <chr> "1095275", "9878513", "1041453", "1020156", "105…
## $ quantity              <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales_value           <dbl> 0.50, 0.99, 1.43, 1.50, 2.78, 5.49, 1.50, 1.88, …
## $ retail_disc           <dbl> 0.00, 0.10, 0.15, 0.29, 0.80, 0.50, 0.29, 0.21, …
## $ coupon_disc           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ transaction_timestamp <dttm> 2017-01-01 06:53:26, 2017-01-01 07:10:28, 2017-…
glimpse(products)
## Rows: 92,331
## Columns: 7
## $ product_id       <chr> "25671", "26081", "26093", "26190", "26355", "26426",…
## $ manufacturer_id  <chr> "2", "2", "69", "69", "69", "69", "69", "69", "69", "…
## $ department       <chr> "GROCERY", "MISCELLANEOUS", "PASTRY", "GROCERY", "GRO…
## $ brand            <fct> National, National, Private, Private, Private, Privat…
## $ product_category <chr> "FRZN ICE", NA, "BREAD", "FRUIT - SHELF STABLE", "COO…
## $ product_type     <chr> "ICE - CRUSHED/CUBED", NA, "BREAD:ITALIAN/FRENCH", "A…
## $ package_size     <chr> "22 LB", NA, NA, "50 OZ", "14 OZ", "2.5 OZ", "16 OZ",…
glimpse(demographics)
## Rows: 801
## Columns: 8
## $ household_id   <chr> "1", "1001", "1003", "1004", "101", "1012", "1014", "10…
## $ age            <ord> 65+, 45-54, 35-44, 25-34, 45-54, 35-44, 45-54, 45-54, 4…
## $ income         <ord> 35-49K, 50-74K, 25-34K, 15-24K, Under 15K, 35-49K, 15-2…
## $ home_ownership <ord> Homeowner, Homeowner, NA, NA, Homeowner, NA, NA, Homeow…
## $ marital_status <ord> Married, Unmarried, Unmarried, Unmarried, Married, Marr…
## $ household_size <ord> 2, 1, 1, 1, 4, 5+, 4, 1, 5+, 2, 5+, 4, 2, 1, 5+, 1, 1, …
## $ household_comp <ord> 2 Adults No Kids, 1 Adult No Kids, 1 Adult No Kids, 1 A…
## $ kids_count     <ord> 0, 0, 0, 0, 2, 3+, 2, 0, 3+, 0, 3+, 2, 0, 0, 3+, 0, 0, …
glimpse(coupons)
## Rows: 116,204
## Columns: 3
## $ coupon_upc  <chr> "10000085207", "10000085207", "10000085207", "10000085207"…
## $ product_id  <chr> "9676830", "9676943", "9676944", "9676947", "9677008", "96…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", "26"…
glimpse(coupon_redemptions)
## Rows: 2,102
## Columns: 4
## $ household_id    <chr> "1029", "1029", "165", "712", "712", "2488", "2488", "…
## $ coupon_upc      <chr> "51380041013", "51380041313", "53377610033", "51380041…
## $ campaign_id     <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", …
## $ redemption_date <date> 2017-01-01, 2017-01-01, 2017-01-03, 2017-01-07, 2017-…
Data Description
demographics
Variable Description
household_id Estimated age range
income Household income
home_ownership Homeowner, renter, etc.
marital_status Marital status (Married, Single, Unknown)
household_size Size of household up to 5+
household_comp Household composition
kid_count Number of children present up to 3+
transactions
Variable Description
household_id Estimated age range
basket_id Uniquely identifies each purchase occasion
product_id Uniquely identifies each product
quantity Number of the product purchased during the trip
retail_disc Discount applied due to the retailer’s loyalty card program
coupon_disc Discount applied due to a manufacturer coupon
coupon_match_disc Discount applied due to retailer’s match of manufacturer coupon
week Week of the transaction; Ranges 1-53
transaction_timestamp Date and time of day when the transaction occurred
products
Variable Description
product_id Uniquely identifies each product
manufacturer_id Uniquely identifies each manufacturer
department Groups similar products together
brand Indicates private or national label brand
product_category Groups similar products together at lower level
product_type Groups similar products together at lowest level
package_size Indicates package size (not available for all products)
coupons
Variable Description
household_id Estimated age range
coupon_upc Uniquely identifies each coupon (unique to household and campaign)
campaign_id Uniquely identifies each campaign
redemption_date Date when the coupon was redeemed
coupon redemptions
Variable Description
household_id Estimated age range
coupon_upc Uniquely identifies each coupon (unique to household and campaign)
campaign_id Uniquely identifies each campaign
redemption_date Date when the coupon was redeemed

Exploratory Data Analysis

Sales Analysis

The above graph uses transactions & demographics data from the complete journey data set. We have tried to showcase a quantitative relation between different age-groups & their total sales value. To display the quantitative relationship we used a tree map plot. From the tree map plot it can be analyzed that age-group with more total sales value has been applied with a darker palette. Hence, according to our analysis, the age-group of 19-24 is having lighter palette scheme means their contribution to total sales value is very less compared to other age-groups

Based on our observations in the first graph, we decided to deep-dive more into age-groups category to check why the total sales from a particular age-group is less compared to other age-groups. First, we extracted the number of households count for each age-group. In that we observed that three age-groups: 19-24, 55-64 & 65+ is having nearly same household counts.

demographics %>% 
  group_by(age) %>%
  summarise(total_households = n())
## # A tibble: 6 × 2
##   age   total_households
##   <ord>            <int>
## 1 19-24               46
## 2 25-34              142
## 3 35-44              194
## 4 45-54              288
## 5 55-64               59
## 6 65+                 72

We identified three departments: NUTRITION, DELI & PASTRY where the contribution to total sales value from each age-group was having comparable quantities.

## # A tibble: 23 × 3
## # Groups:   age [1]
##    age   department         ttsles
##    <ord> <chr>               <dbl>
##  1 19-24 CNTRL/STORE SUP      1   
##  2 19-24 RESTAURANT           1.39
##  3 19-24 PHOTO & VIDEO        4.98
##  4 19-24 COUPON              12.4 
##  5 19-24 TRAVEL & LEISURE    29.7 
##  6 19-24 CHEF SHOPPE         41.4 
##  7 19-24 GARDEN CENTER       46.9 
##  8 19-24 SEAFOOD            169.  
##  9 19-24 COSMETICS          503.  
## 10 19-24 FLORAL             520.  
## 11 19-24 NUTRITION          806.  
## 12 19-24 SALAD BAR          956.  
## 13 19-24 SEAFOOD-PCKGD     1088.  
## 14 19-24 SPIRITS           1616.  
## 15 19-24 PASTRY            1665.  
## 16 19-24 MISCELLANEOUS     1771.  
## 17 19-24 DELI              2902.  
## 18 19-24 FUEL              6651.  
## 19 19-24 MEAT-PCKGD        7057.  
## 20 19-24 PRODUCE           7071.  
## 21 19-24 MEAT              7556.  
## 22 19-24 DRUG GM          17024.  
## 23 19-24 GROCERY          68181.
## # A tibble: 27 × 3
## # Groups:   age [1]
##    age   department          ttsles
##    <ord> <chr>                <dbl>
##  1 45-54 POSTAL CENTER         4.57
##  2 45-54 PHOTO & VIDEO         4.98
##  3 45-54 CNTRL/STORE SUP       9.95
##  4 45-54 GM MERCH EXP         19.3 
##  5 45-54 AUTOMOTIVE           55.1 
##  6 45-54 FROZEN GROCERY       80.7 
##  7 45-54 COUPON              143.  
##  8 45-54 TRAVEL & LEISURE    315.  
##  9 45-54 RESTAURANT          357.  
## 10 45-54 CHEF SHOPPE         401.  
## 11 45-54 GARDEN CENTER      1360.  
## 12 45-54 SPIRITS            2004.  
## 13 45-54 SALAD BAR          3398.  
## 14 45-54 COSMETICS          3947.  
## 15 45-54 SEAFOOD            4638.  
## 16 45-54 FLORAL             4970.  
## 17 45-54 SEAFOOD-PCKGD      7600.  
## 18 45-54 NUTRITION         11817.  
## 19 45-54 PASTRY            14662.  
## 20 45-54 MISCELLANEOUS     19367.  
## 21 45-54 DELI              32841.  
## 22 45-54 MEAT-PCKGD        42920.  
## 23 45-54 MEAT              63829.  
## 24 45-54 PRODUCE           72134.  
## 25 45-54 FUEL              72694.  
## 26 45-54 DRUG GM          129664.  
## 27 45-54 GROCERY          482531.
## # A tibble: 25 × 3
## # Groups:   age [1]
##    age   department        ttsles
##    <ord> <chr>              <dbl>
##  1 55-64 POSTAL CENTER        1  
##  2 55-64 CNTRL/STORE SUP      2  
##  3 55-64 GM MERCH EXP        12.0
##  4 55-64 RESTAURANT          15  
##  5 55-64 FROZEN GROCERY      30.6
##  6 55-64 COUPON              37.4
##  7 55-64 TRAVEL & LEISURE    75.6
##  8 55-64 CHEF SHOPPE         82.4
##  9 55-64 SPIRITS            130. 
## 10 55-64 GARDEN CENTER      242. 
## 11 55-64 SALAD BAR          506. 
## 12 55-64 FLORAL             734. 
## 13 55-64 COSMETICS          756. 
## 14 55-64 SEAFOOD            806. 
## 15 55-64 NUTRITION         1652. 
## 16 55-64 SEAFOOD-PCKGD     1908. 
## 17 55-64 PASTRY            2417. 
## 18 55-64 MISCELLANEOUS     5623. 
## 19 55-64 DELI              6610. 
## 20 55-64 MEAT-PCKGD        7699. 
## 21 55-64 MEAT             13051. 
## 22 55-64 FUEL             13354. 
## 23 55-64 PRODUCE          14416. 
## 24 55-64 DRUG GM          18815. 
## 25 55-64 GROCERY          84164.
## # A tibble: 25 × 3
## # Groups:   age [1]
##    age   department         ttsles
##    <ord> <chr>               <dbl>
##  1 65+   PHOTO & VIDEO        1.99
##  2 65+   GM MERCH EXP         2.95
##  3 65+   AUTOMOTIVE          11.0 
##  4 65+   FROZEN GROCERY      13.1 
##  5 65+   COUPON              18.5 
##  6 65+   RESTAURANT          22.4 
##  7 65+   SPIRITS             77.8 
##  8 65+   TRAVEL & LEISURE    90.5 
##  9 65+   CHEF SHOPPE        130.  
## 10 65+   GARDEN CENTER      388.  
## 11 65+   COSMETICS          474.  
## 12 65+   FLORAL             681.  
## 13 65+   SEAFOOD            741.  
## 14 65+   SALAD BAR         1212.  
## 15 65+   SEAFOOD-PCKGD     1323.  
## 16 65+   MISCELLANEOUS     1911.  
## 17 65+   NUTRITION         2115.  
## 18 65+   PASTRY            3525.  
## 19 65+   DELI              6607.  
## 20 65+   MEAT-PCKGD        7632.  
## 21 65+   MEAT             11781.  
## 22 65+   FUEL             12554.  
## 23 65+   PRODUCE          16341.  
## 24 65+   DRUG GM          21689.  
## 25 65+   GROCERY          87255.
## # A tibble: 27 × 3
## # Groups:   age [1]
##    age   department          ttsles
##    <ord> <chr>                <dbl>
##  1 35-44 PROD-WHS SALES        2.52
##  2 35-44 CNTRL/STORE SUP       3   
##  3 35-44 PHOTO & VIDEO         8.6 
##  4 35-44 GM MERCH EXP         14.9 
##  5 35-44 AUTOMOTIVE           71.7 
##  6 35-44 FROZEN GROCERY       98.6 
##  7 35-44 COUPON              123.  
##  8 35-44 RESTAURANT          130.  
##  9 35-44 TRAVEL & LEISURE    211.  
## 10 35-44 CHEF SHOPPE         338.  
## 11 35-44 GARDEN CENTER       762.  
## 12 35-44 SPIRITS            1060.  
## 13 35-44 SEAFOOD            2165.  
## 14 35-44 SALAD BAR          2358.  
## 15 35-44 FLORAL             3606.  
## 16 35-44 COSMETICS          3705.  
## 17 35-44 SEAFOOD-PCKGD      4223.  
## 18 35-44 MISCELLANEOUS     10013.  
## 19 35-44 PASTRY            10079.  
## 20 35-44 NUTRITION         11668.  
## 21 35-44 DELI              24195.  
## 22 35-44 MEAT-PCKGD        33912.  
## 23 35-44 MEAT              46814.  
## 24 35-44 PRODUCE           51782.  
## 25 35-44 FUEL              67669.  
## 26 35-44 DRUG GM           92438.  
## 27 35-44 GROCERY          356905.

Using this above observations we plotted column bar plot for three departments show casing total sales value per age-group

Department Analysis

Using the observations in Graph 1 - Plot 2, we deep-dived & filtered only the three age-groups: 19-24, 55-64 & 65+ as these have nearly equal household counts. Hence, we plotted column bar plot to represent the difference and it is observed that even when household counts for three age-groups is nearly same still 19-24 age-group contribution in three departments is less compared to other two. Thus, it can be concluded that age-group 19-24 is purchasing less from the three departments.

Coupons Redeemed

The below graph uses coupons, coupon redemptions, product, and demographic tables to populate the data for Nutrition, Pastry, and Deli departments. The graph shows the number of coupons redeemed based on the age group.For each age group, we use the top five coupons upc which are being redeemed.

Summary

The above exercise helped us understand the complete journey dataset better. Now we have answers to some inetersting questions and are in a position to do follow-up and deep dive analysis. The following is the summary of the analysis:

  1. The sales for demographic (age-group) 19-24 has the least sales and 45-54 has the most sales.
  2. The age-groups 19-24, 55-64 & 65+were having nearly same household counts and hence when we compared their sales for few departments we got to know that 19-24 demographic group consume less nutrition.
  3. The departments Nutrition, Deli and Pastry where the contribution to total sales value from each age-group was having comparable quantities and when compared Nutrition was consumed less by age group 19-24.
  4. The coupons distributed within various departments gave us an insight that GROCERY department has redeemed most coupons whereas PHOTO & VIDEO was the least.
  5. The Top 5 coupons redeemed for departments Nutrition, Deli and Pastry in each age group signifies that 19-24 demographic group also redeem less coupons for those department.

In short, our goal was to analyze the sales for different age groups and extract insights to help Regork business to grow by providing more coupons for departments which are having less sales in this case NUTRITION and create awareness about how nutrition is more important in our daily life with the help of marketing team, this will allow Regork to increase sales for those age groups and generate more revenue.