The goal of our project is to use data analytics to improve retailer (national grocery chain named Regork) efficiency in purchasing and marketing by means of better understanding customer habits and establishing a long-term relationships with them. We are analyzing the data from the Complete Journey Dataset. With this dataset, the question that we want to find an answer to is which customer is buying what types of products and from which department, as well as the demographics (age-group) the customer belong, to better understand their shopping trends and improve sales.
The objective of this project is to explore the Complete Journey dataset and answers some interesting questions related to demographics and sales as per department. Our Business problem here is to analyze the sales for demographic (age-group) with respect to department. We are focussing on age groups that are having fewer sales and then deep dive into why their sales are less and which departments are less consumed by that particular age group. We also tried to analyze the coupons redeemed by different age groups and tried to tie them down to our main goal of enhancing the Regork business with the help of marketing.
We can use transaction and customer data to find important and useful insights which can help the business grow and improve it’s strategies. It can be used to answer questions like:
Following packages were used:
library(tidyverse)
library(completejourney)
library(dplyr)
library(lubridate)
library(treemap)
library(DT)
This sections contains all the procedures followed in getting the data analysis ready. Each step has been explained and the codes have been given.
We are using the Complete Journey Study dataset for this analysis. The data represents grocery store shopping transactions over one year from a group of 2,469 households who are frequent shoppers at a retailer. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are included.
Data Import Code:
library(completejourney)
transactions <- as.data.frame(get_transactions())
Details about the tables used:
colnames(transactions)
## [1] "household_id" "store_id" "basket_id"
## [4] "product_id" "quantity" "sales_value"
## [7] "retail_disc" "coupon_disc" "coupon_match_disc"
## [10] "week" "transaction_timestamp"
colnames(products)
## [1] "product_id" "manufacturer_id" "department" "brand"
## [5] "product_category" "product_type" "package_size"
colnames(demographics)
## [1] "household_id" "age" "income" "home_ownership"
## [5] "marital_status" "household_size" "household_comp" "kids_count"
colnames(coupons)
## [1] "coupon_upc" "product_id" "campaign_id"
colnames(coupon_redemptions)
## [1] "household_id" "coupon_upc" "campaign_id" "redemption_date"
glimpse(transactions)
## Rows: 1,469,307
## Columns: 11
## $ household_id <chr> "900", "900", "1228", "906", "906", "906", "906"…
## $ store_id <chr> "330", "330", "406", "319", "319", "319", "319",…
## $ basket_id <chr> "31198570044", "31198570047", "31198655051", "31…
## $ product_id <chr> "1095275", "9878513", "1041453", "1020156", "105…
## $ quantity <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sales_value <dbl> 0.50, 0.99, 1.43, 1.50, 2.78, 5.49, 1.50, 1.88, …
## $ retail_disc <dbl> 0.00, 0.10, 0.15, 0.29, 0.80, 0.50, 0.29, 0.21, …
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ transaction_timestamp <dttm> 2017-01-01 06:53:26, 2017-01-01 07:10:28, 2017-…
glimpse(products)
## Rows: 92,331
## Columns: 7
## $ product_id <chr> "25671", "26081", "26093", "26190", "26355", "26426",…
## $ manufacturer_id <chr> "2", "2", "69", "69", "69", "69", "69", "69", "69", "…
## $ department <chr> "GROCERY", "MISCELLANEOUS", "PASTRY", "GROCERY", "GRO…
## $ brand <fct> National, National, Private, Private, Private, Privat…
## $ product_category <chr> "FRZN ICE", NA, "BREAD", "FRUIT - SHELF STABLE", "COO…
## $ product_type <chr> "ICE - CRUSHED/CUBED", NA, "BREAD:ITALIAN/FRENCH", "A…
## $ package_size <chr> "22 LB", NA, NA, "50 OZ", "14 OZ", "2.5 OZ", "16 OZ",…
glimpse(demographics)
## Rows: 801
## Columns: 8
## $ household_id <chr> "1", "1001", "1003", "1004", "101", "1012", "1014", "10…
## $ age <ord> 65+, 45-54, 35-44, 25-34, 45-54, 35-44, 45-54, 45-54, 4…
## $ income <ord> 35-49K, 50-74K, 25-34K, 15-24K, Under 15K, 35-49K, 15-2…
## $ home_ownership <ord> Homeowner, Homeowner, NA, NA, Homeowner, NA, NA, Homeow…
## $ marital_status <ord> Married, Unmarried, Unmarried, Unmarried, Married, Marr…
## $ household_size <ord> 2, 1, 1, 1, 4, 5+, 4, 1, 5+, 2, 5+, 4, 2, 1, 5+, 1, 1, …
## $ household_comp <ord> 2 Adults No Kids, 1 Adult No Kids, 1 Adult No Kids, 1 A…
## $ kids_count <ord> 0, 0, 0, 0, 2, 3+, 2, 0, 3+, 0, 3+, 2, 0, 0, 3+, 0, 0, …
glimpse(coupons)
## Rows: 116,204
## Columns: 3
## $ coupon_upc <chr> "10000085207", "10000085207", "10000085207", "10000085207"…
## $ product_id <chr> "9676830", "9676943", "9676944", "9676947", "9677008", "96…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", "26"…
glimpse(coupon_redemptions)
## Rows: 2,102
## Columns: 4
## $ household_id <chr> "1029", "1029", "165", "712", "712", "2488", "2488", "…
## $ coupon_upc <chr> "51380041013", "51380041313", "53377610033", "51380041…
## $ campaign_id <chr> "26", "26", "26", "26", "26", "26", "26", "26", "26", …
## $ redemption_date <date> 2017-01-01, 2017-01-01, 2017-01-03, 2017-01-07, 2017-…
| Variable | Description |
|---|---|
| household_id | Estimated age range |
| income | Household income |
| home_ownership | Homeowner, renter, etc. |
| marital_status | Marital status (Married, Single, Unknown) |
| household_size | Size of household up to 5+ |
| household_comp | Household composition |
| kid_count | Number of children present up to 3+ |
| Variable | Description |
|---|---|
| household_id | Estimated age range |
| basket_id | Uniquely identifies each purchase occasion |
| product_id | Uniquely identifies each product |
| quantity | Number of the product purchased during the trip |
| retail_disc | Discount applied due to the retailer’s loyalty card program |
| coupon_disc | Discount applied due to a manufacturer coupon |
| coupon_match_disc | Discount applied due to retailer’s match of manufacturer coupon |
| week | Week of the transaction; Ranges 1-53 |
| transaction_timestamp | Date and time of day when the transaction occurred |
| Variable | Description |
|---|---|
| product_id | Uniquely identifies each product |
| manufacturer_id | Uniquely identifies each manufacturer |
| department | Groups similar products together |
| brand | Indicates private or national label brand |
| product_category | Groups similar products together at lower level |
| product_type | Groups similar products together at lowest level |
| package_size | Indicates package size (not available for all products) |
| Variable | Description |
|---|---|
| household_id | Estimated age range |
| coupon_upc | Uniquely identifies each coupon (unique to household and campaign) |
| campaign_id | Uniquely identifies each campaign |
| redemption_date | Date when the coupon was redeemed |
| Variable | Description |
|---|---|
| household_id | Estimated age range |
| coupon_upc | Uniquely identifies each coupon (unique to household and campaign) |
| campaign_id | Uniquely identifies each campaign |
| redemption_date | Date when the coupon was redeemed |
The above graph uses transactions & demographics data from the complete journey data set. We have tried to showcase a quantitative relation between different age-groups & their total sales value. To display the quantitative relationship we used a tree map plot. From the tree map plot it can be analyzed that age-group with more total sales value has been applied with a darker palette. Hence, according to our analysis, the age-group of 19-24 is having lighter palette scheme means their contribution to total sales value is very less compared to other age-groups
Based on our observations in the first graph, we decided to deep-dive more into age-groups category to check why the total sales from a particular age-group is less compared to other age-groups. First, we extracted the number of households count for each age-group. In that we observed that three age-groups: 19-24, 55-64 & 65+ is having nearly same household counts.
demographics %>%
group_by(age) %>%
summarise(total_households = n())
## # A tibble: 6 × 2
## age total_households
## <ord> <int>
## 1 19-24 46
## 2 25-34 142
## 3 35-44 194
## 4 45-54 288
## 5 55-64 59
## 6 65+ 72
We identified three departments: NUTRITION, DELI & PASTRY where the contribution to total sales value from each age-group was having comparable quantities.
## # A tibble: 23 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 19-24 CNTRL/STORE SUP 1
## 2 19-24 RESTAURANT 1.39
## 3 19-24 PHOTO & VIDEO 4.98
## 4 19-24 COUPON 12.4
## 5 19-24 TRAVEL & LEISURE 29.7
## 6 19-24 CHEF SHOPPE 41.4
## 7 19-24 GARDEN CENTER 46.9
## 8 19-24 SEAFOOD 169.
## 9 19-24 COSMETICS 503.
## 10 19-24 FLORAL 520.
## 11 19-24 NUTRITION 806.
## 12 19-24 SALAD BAR 956.
## 13 19-24 SEAFOOD-PCKGD 1088.
## 14 19-24 SPIRITS 1616.
## 15 19-24 PASTRY 1665.
## 16 19-24 MISCELLANEOUS 1771.
## 17 19-24 DELI 2902.
## 18 19-24 FUEL 6651.
## 19 19-24 MEAT-PCKGD 7057.
## 20 19-24 PRODUCE 7071.
## 21 19-24 MEAT 7556.
## 22 19-24 DRUG GM 17024.
## 23 19-24 GROCERY 68181.
## # A tibble: 27 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 45-54 POSTAL CENTER 4.57
## 2 45-54 PHOTO & VIDEO 4.98
## 3 45-54 CNTRL/STORE SUP 9.95
## 4 45-54 GM MERCH EXP 19.3
## 5 45-54 AUTOMOTIVE 55.1
## 6 45-54 FROZEN GROCERY 80.7
## 7 45-54 COUPON 143.
## 8 45-54 TRAVEL & LEISURE 315.
## 9 45-54 RESTAURANT 357.
## 10 45-54 CHEF SHOPPE 401.
## 11 45-54 GARDEN CENTER 1360.
## 12 45-54 SPIRITS 2004.
## 13 45-54 SALAD BAR 3398.
## 14 45-54 COSMETICS 3947.
## 15 45-54 SEAFOOD 4638.
## 16 45-54 FLORAL 4970.
## 17 45-54 SEAFOOD-PCKGD 7600.
## 18 45-54 NUTRITION 11817.
## 19 45-54 PASTRY 14662.
## 20 45-54 MISCELLANEOUS 19367.
## 21 45-54 DELI 32841.
## 22 45-54 MEAT-PCKGD 42920.
## 23 45-54 MEAT 63829.
## 24 45-54 PRODUCE 72134.
## 25 45-54 FUEL 72694.
## 26 45-54 DRUG GM 129664.
## 27 45-54 GROCERY 482531.
## # A tibble: 25 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 55-64 POSTAL CENTER 1
## 2 55-64 CNTRL/STORE SUP 2
## 3 55-64 GM MERCH EXP 12.0
## 4 55-64 RESTAURANT 15
## 5 55-64 FROZEN GROCERY 30.6
## 6 55-64 COUPON 37.4
## 7 55-64 TRAVEL & LEISURE 75.6
## 8 55-64 CHEF SHOPPE 82.4
## 9 55-64 SPIRITS 130.
## 10 55-64 GARDEN CENTER 242.
## 11 55-64 SALAD BAR 506.
## 12 55-64 FLORAL 734.
## 13 55-64 COSMETICS 756.
## 14 55-64 SEAFOOD 806.
## 15 55-64 NUTRITION 1652.
## 16 55-64 SEAFOOD-PCKGD 1908.
## 17 55-64 PASTRY 2417.
## 18 55-64 MISCELLANEOUS 5623.
## 19 55-64 DELI 6610.
## 20 55-64 MEAT-PCKGD 7699.
## 21 55-64 MEAT 13051.
## 22 55-64 FUEL 13354.
## 23 55-64 PRODUCE 14416.
## 24 55-64 DRUG GM 18815.
## 25 55-64 GROCERY 84164.
## # A tibble: 25 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 65+ PHOTO & VIDEO 1.99
## 2 65+ GM MERCH EXP 2.95
## 3 65+ AUTOMOTIVE 11.0
## 4 65+ FROZEN GROCERY 13.1
## 5 65+ COUPON 18.5
## 6 65+ RESTAURANT 22.4
## 7 65+ SPIRITS 77.8
## 8 65+ TRAVEL & LEISURE 90.5
## 9 65+ CHEF SHOPPE 130.
## 10 65+ GARDEN CENTER 388.
## 11 65+ COSMETICS 474.
## 12 65+ FLORAL 681.
## 13 65+ SEAFOOD 741.
## 14 65+ SALAD BAR 1212.
## 15 65+ SEAFOOD-PCKGD 1323.
## 16 65+ MISCELLANEOUS 1911.
## 17 65+ NUTRITION 2115.
## 18 65+ PASTRY 3525.
## 19 65+ DELI 6607.
## 20 65+ MEAT-PCKGD 7632.
## 21 65+ MEAT 11781.
## 22 65+ FUEL 12554.
## 23 65+ PRODUCE 16341.
## 24 65+ DRUG GM 21689.
## 25 65+ GROCERY 87255.
## # A tibble: 27 × 3
## # Groups: age [1]
## age department ttsles
## <ord> <chr> <dbl>
## 1 35-44 PROD-WHS SALES 2.52
## 2 35-44 CNTRL/STORE SUP 3
## 3 35-44 PHOTO & VIDEO 8.6
## 4 35-44 GM MERCH EXP 14.9
## 5 35-44 AUTOMOTIVE 71.7
## 6 35-44 FROZEN GROCERY 98.6
## 7 35-44 COUPON 123.
## 8 35-44 RESTAURANT 130.
## 9 35-44 TRAVEL & LEISURE 211.
## 10 35-44 CHEF SHOPPE 338.
## 11 35-44 GARDEN CENTER 762.
## 12 35-44 SPIRITS 1060.
## 13 35-44 SEAFOOD 2165.
## 14 35-44 SALAD BAR 2358.
## 15 35-44 FLORAL 3606.
## 16 35-44 COSMETICS 3705.
## 17 35-44 SEAFOOD-PCKGD 4223.
## 18 35-44 MISCELLANEOUS 10013.
## 19 35-44 PASTRY 10079.
## 20 35-44 NUTRITION 11668.
## 21 35-44 DELI 24195.
## 22 35-44 MEAT-PCKGD 33912.
## 23 35-44 MEAT 46814.
## 24 35-44 PRODUCE 51782.
## 25 35-44 FUEL 67669.
## 26 35-44 DRUG GM 92438.
## 27 35-44 GROCERY 356905.
Using this above observations we plotted column bar plot for three departments show casing total sales value per age-group
Using the observations in Graph 1 - Plot 2, we deep-dived & filtered only the three age-groups: 19-24, 55-64 & 65+ as these have nearly equal household counts. Hence, we plotted column bar plot to represent the difference and it is observed that even when household counts for three age-groups is nearly same still 19-24 age-group contribution in three departments is less compared to other two. Thus, it can be concluded that age-group 19-24 is purchasing less from the three departments.
The below graph uses coupons, coupon redemptions, product, and demographic tables to populate the data for Nutrition, Pastry, and Deli departments. The graph shows the number of coupons redeemed based on the age group.For each age group, we use the top five coupons upc which are being redeemed.
The above exercise helped us understand the complete journey dataset better. Now we have answers to some inetersting questions and are in a position to do follow-up and deep dive analysis. The following is the summary of the analysis:
In short, our goal was to analyze the sales for different age groups and extract insights to help Regork business to grow by providing more coupons for departments which are having less sales in this case NUTRITION and create awareness about how nutrition is more important in our daily life with the help of marketing team, this will allow Regork to increase sales for those age groups and generate more revenue.