The objective of this project is to provide business insights to Regork, a national grocery chain to assist them in growing their business. Using our analysis we aim to find growth opportunities for them by answering the below business questions:
We answered these questions by joining combinations of transactions, products, demographics and coupons data to find important and useful insights which can help the business grow and improve its strategies.
By the end of this analysis, Regork will be able to:
Following packages were used:
library(tidyverse)
library(completejourney)
library(dplyr)
library(lubridate)
library(treemap)
library(DT)
library(RColorBrewer)
We are using the Complete Journey Study dataset for this analysis. The data represents grocery store shopping transactions over one year from a group of 2,469 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.
First of all, Import data in below way:
library(completejourney)
transactions <- as.data.frame(get_transactions())
Have a look about the tables which are being used:
colnames(transactions)
## [1] "household_id" "store_id" "basket_id"
## [4] "product_id" "quantity" "sales_value"
## [7] "retail_disc" "coupon_disc" "coupon_match_disc"
## [10] "week" "transaction_timestamp"
colnames(products)
## [1] "product_id" "manufacturer_id" "department" "brand"
## [5] "product_category" "product_type" "package_size"
colnames(demographics)
## [1] "household_id" "age" "income" "home_ownership"
## [5] "marital_status" "household_size" "household_comp" "kids_count"
colnames(coupons)
## [1] "coupon_upc" "product_id" "campaign_id"
colnames(coupon_redemptions)
## [1] "household_id" "coupon_upc" "campaign_id" "redemption_date"
glimpse(transactions)
## Rows: 1,469,307
## Columns: 11
## $ household_id <chr> "900", "900", "1228", "906", "906", "906", "906"…
## $ store_id <chr> "330", "330", "406", "319", "319", "319", "319",…
## $ basket_id <chr> "31198570044", "31198570047", "31198655051", "31…
## $ product_id <chr> "1095275", "9878513", "1041453", "1020156", "105…
## $ quantity <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales_value <dbl> 0.50, 0.99, 1.43, 1.50, 2.78, 5.49, 1.50, 1.88, …
## $ retail_disc <dbl> 0.00, 0.10, 0.15, 0.29, 0.80, 0.50, 0.29, 0.21, …
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ transaction_timestamp <dttm> 2017-01-01 06:53:26, 2017-01-01 07:10:28, 2017-…
glimpse(products)
## Rows: 92,331
## Columns: 7
## $ product_id <chr> "25671", "26081", "26093", "26190", "26355", "26426",…
## $ manufacturer_id <chr> "2", "2", "69", "69", "69", "69", "69", "69", "69", "…
## $ department <chr> "GROCERY", "MISCELLANEOUS", "PASTRY", "GROCERY", "GRO…
## $ brand <fct> National, National, Private, Private, Private, Privat…
## $ product_category <chr> "FRZN ICE", NA, "BREAD", "FRUIT - SHELF STABLE", "COO…
## $ product_type <chr> "ICE - CRUSHED/CUBED", NA, "BREAD:ITALIAN/FRENCH", "A…
## $ package_size <chr> "22 LB", NA, NA, "50 OZ", "14 OZ", "2.5 OZ", "16 OZ",…
glimpse(demographics)
## Rows: 801
## Columns: 8
## $ household_id <chr> "1", "1001", "1003", "1004", "101", "1012", "1014", "10…
## $ age <ord> 65+, 45-54, 35-44, 25-34, 45-54, 35-44, 45-54, 45-54, 4…
## $ income <ord> 35-49K, 50-74K, 25-34K, 15-24K, Under 15K, 35-49K, 15-2…
## $ home_ownership <ord> Homeowner, Homeowner, NA, NA, Homeowner, NA, NA, Homeow…
## $ marital_status <ord> Married, Unmarried, Unmarried, Unmarried, Married, Marr…
## $ household_size <ord> 2, 1, 1, 1, 4, 5+, 4, 1, 5+, 2, 5+, 4, 2, 1, 5+, 1, 1, …
## $ household_comp <ord> 2 Adults No Kids, 1 Adult No Kids, 1 Adult No Kids, 1 A…
## $ kids_count <ord> 0, 0, 0, 0, 2, 3+, 2, 0, 3+, 0, 3+, 2, 0, 0, 3+, 0, 0, …
glimpse(coupons)
## Rows: 116,204
## Columns: 3
## $ coupon_upc <chr> "10000085207", "10000085207", "10000085207", "10000085207"…
## $ product_id <chr> "9676830", "9676943", "9676944", "9676947", "9677008", "96…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", "26"…
glimpse(coupon_redemptions)
## Rows: 2,102
## Columns: 4
## $ household_id <chr> "1029", "1029", "165", "712", "712", "2488", "2488", "…
## $ coupon_upc <chr> "51380041013", "51380041313", "53377610033", "51380041…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", …
## $ redemption_date <date> 2017-01-01, 2017-01-01, 2017-01-03, 2017-01-07, 2017-…
The below graph uses transactions & demographics data from the complete journey data set. We have tried to showcase a quantitative relation between different age-groups & their total sales value. To display the quantitative relationship we used a bar plot. From the bar plot it can be analyzed that the age-group of 19-24 is contributing lowest in total sales and on the other hand the age-group 45-54 is contributing highest.
Based on our observations in the first graph, we decided to deep-dive more into age-groups category to check why the total sales from a particular age-group is less compared to other age-groups. First, we extracted the number of households count for each age-group. In that we observed that three age-groups: 19-24, 55-64 & 65+ is having nearly same household counts.
demographics %>%
group_by(age) %>%
summarise(total_households = n())
## # A tibble: 6 × 2
## age total_households
## <ord> <int>
## 1 19-24 46
## 2 25-34 142
## 3 35-44 194
## 4 45-54 288
## 5 55-64 59
## 6 65+ 72
Our current no. of households for 19-24 age group is pretty low compared to other groups. So, the business problem we’re solving is to get more 19-24 age group households in the store.
We found three departments: NUTRITION, DELI & PASTRY where 19-24 age group is spending money equally with respect to other age groups.
## # A tibble: 23 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 19-24 CNTRL/STORE SUP 1
## 2 19-24 RESTAURANT 1.39
## 3 19-24 PHOTO & VIDEO 4.98
## 4 19-24 COUPON 12.4
## 5 19-24 TRAVEL & LEISURE 29.7
## 6 19-24 CHEF SHOPPE 41.4
## 7 19-24 GARDEN CENTER 46.9
## 8 19-24 SEAFOOD 169.
## 9 19-24 COSMETICS 503.
## 10 19-24 FLORAL 520.
## 11 19-24 NUTRITION 806.
## 12 19-24 SALAD BAR 956.
## 13 19-24 SEAFOOD-PCKGD 1088.
## 14 19-24 SPIRITS 1616.
## 15 19-24 PASTRY 1665.
## 16 19-24 MISCELLANEOUS 1771.
## 17 19-24 DELI 2902.
## 18 19-24 FUEL 6651.
## 19 19-24 MEAT-PCKGD 7057.
## 20 19-24 PRODUCE 7071.
## 21 19-24 MEAT 7556.
## 22 19-24 DRUG GM 17024.
## 23 19-24 GROCERY 68181.
## # A tibble: 27 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 45-54 POSTAL CENTER 4.57
## 2 45-54 PHOTO & VIDEO 4.98
## 3 45-54 CNTRL/STORE SUP 9.95
## 4 45-54 GM MERCH EXP 19.3
## 5 45-54 AUTOMOTIVE 55.1
## 6 45-54 FROZEN GROCERY 80.7
## 7 45-54 COUPON 143.
## 8 45-54 TRAVEL & LEISURE 315.
## 9 45-54 RESTAURANT 357.
## 10 45-54 CHEF SHOPPE 401.
## 11 45-54 GARDEN CENTER 1360.
## 12 45-54 SPIRITS 2004.
## 13 45-54 SALAD BAR 3398.
## 14 45-54 COSMETICS 3947.
## 15 45-54 SEAFOOD 4638.
## 16 45-54 FLORAL 4970.
## 17 45-54 SEAFOOD-PCKGD 7600.
## 18 45-54 NUTRITION 11817.
## 19 45-54 PASTRY 14662.
## 20 45-54 MISCELLANEOUS 19367.
## 21 45-54 DELI 32841.
## 22 45-54 MEAT-PCKGD 42920.
## 23 45-54 MEAT 63829.
## 24 45-54 PRODUCE 72134.
## 25 45-54 FUEL 72694.
## 26 45-54 DRUG GM 129664.
## 27 45-54 GROCERY 482531.
## # A tibble: 25 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 55-64 POSTAL CENTER 1
## 2 55-64 CNTRL/STORE SUP 2
## 3 55-64 GM MERCH EXP 12.0
## 4 55-64 RESTAURANT 15
## 5 55-64 FROZEN GROCERY 30.6
## 6 55-64 COUPON 37.4
## 7 55-64 TRAVEL & LEISURE 75.6
## 8 55-64 CHEF SHOPPE 82.4
## 9 55-64 SPIRITS 130.
## 10 55-64 GARDEN CENTER 242.
## 11 55-64 SALAD BAR 506.
## 12 55-64 FLORAL 734.
## 13 55-64 COSMETICS 756.
## 14 55-64 SEAFOOD 806.
## 15 55-64 NUTRITION 1652.
## 16 55-64 SEAFOOD-PCKGD 1908.
## 17 55-64 PASTRY 2417.
## 18 55-64 MISCELLANEOUS 5623.
## 19 55-64 DELI 6610.
## 20 55-64 MEAT-PCKGD 7699.
## 21 55-64 MEAT 13051.
## 22 55-64 FUEL 13354.
## 23 55-64 PRODUCE 14416.
## 24 55-64 DRUG GM 18815.
## 25 55-64 GROCERY 84164.
## # A tibble: 25 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 65+ PHOTO & VIDEO 1.99
## 2 65+ GM MERCH EXP 2.95
## 3 65+ AUTOMOTIVE 11.0
## 4 65+ FROZEN GROCERY 13.1
## 5 65+ COUPON 18.5
## 6 65+ RESTAURANT 22.4
## 7 65+ SPIRITS 77.8
## 8 65+ TRAVEL & LEISURE 90.5
## 9 65+ CHEF SHOPPE 130.
## 10 65+ GARDEN CENTER 388.
## 11 65+ COSMETICS 474.
## 12 65+ FLORAL 681.
## 13 65+ SEAFOOD 741.
## 14 65+ SALAD BAR 1212.
## 15 65+ SEAFOOD-PCKGD 1323.
## 16 65+ MISCELLANEOUS 1911.
## 17 65+ NUTRITION 2115.
## 18 65+ PASTRY 3525.
## 19 65+ DELI 6607.
## 20 65+ MEAT-PCKGD 7632.
## 21 65+ MEAT 11781.
## 22 65+ FUEL 12554.
## 23 65+ PRODUCE 16341.
## 24 65+ DRUG GM 21689.
## 25 65+ GROCERY 87255.
## # A tibble: 27 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 35-44 PROD-WHS SALES 2.52
## 2 35-44 CNTRL/STORE SUP 3
## 3 35-44 PHOTO & VIDEO 8.6
## 4 35-44 GM MERCH EXP 14.9
## 5 35-44 AUTOMOTIVE 71.7
## 6 35-44 FROZEN GROCERY 98.6
## 7 35-44 COUPON 123.
## 8 35-44 RESTAURANT 130.
## 9 35-44 TRAVEL & LEISURE 211.
## 10 35-44 CHEF SHOPPE 338.
## 11 35-44 GARDEN CENTER 762.
## 12 35-44 SPIRITS 1060.
## 13 35-44 SEAFOOD 2165.
## 14 35-44 SALAD BAR 2358.
## 15 35-44 FLORAL 3606.
## 16 35-44 COSMETICS 3705.
## 17 35-44 SEAFOOD-PCKGD 4223.
## 18 35-44 MISCELLANEOUS 10013.
## 19 35-44 PASTRY 10079.
## 20 35-44 NUTRITION 11668.
## 21 35-44 DELI 24195.
## 22 35-44 MEAT-PCKGD 33912.
## 23 35-44 MEAT 46814.
## 24 35-44 PRODUCE 51782.
## 25 35-44 FUEL 67669.
## 26 35-44 DRUG GM 92438.
## 27 35-44 GROCERY 356905.
Using this above observations we plotted column bar plot for three departments show casing total sales value per age-group
To further analyze the data, we wanted to confirm coupons redeemed based on their age groups for those three departments.
The below graph uses coupons, coupon redemptions, product, and demographic tables to populate the data for Nutrition, Pastry, and Deli departments. The graph shows the number of coupons redeemed based on the age group.For each age group, we use the top five coupons upc which are being redeemed.
The graph further confirmed our analysis on how coupons redeemed are lowest in the 19-24 demographic due to low number of households.
This analysis aims to answer the business question: ‘Do certain products tend to be purchased with other products on a regular basis?’ Product pairing is an important marketing tool and this analysis aims to find the top products which can be paired together to fuel revenues further.
We started with analyzing products, transactions and demographics to get general insights from the data. To understand which products were bought together the most, we grouped the products for each basket based on two major divisions (Grocery and Other departments) and further dug in to make associations between each product pair based on their quantity.
Once the product pairs and their quantities were associated, we sliced the top 50 pairs which can be sold together.
The graph below shows the product pairs being sold together the most in Grocery department. We see that Soft Drinks and Potato Chips/Nachos have a high correlation and can be paired. Other such grouping can be done with Soup & Shredded cheese.
The graph below shows the product pairs being sold together the most in Other departments. We see that Baby Food and Baby Juices have a high correlation and can be grouped together. Other such pairs we see are: Turkey & Ham and Candy Bars & Chewing gums.
Our analysis aimed to suggest solutions to some crucial questions for Regork. We proposed solutions to inform the company about the potential of an untapped demographic and pattern between products being bought together.
To summarise our solutions and its implications:
Limitation: Our current basket analysis is more of an indicative analysis. To create more confidence, we need to run Apriori algorithm to find confidence of the rule sets we defined.