Introduction

The goal of this analysis is to identify key demographic groups among retail shoppers and understand their shopping behavior. We will use the completejourney package to analyze transaction data, demographics, and promotions. Our primary focus will be on understanding which demographic groups are the most frequent shoppers and their response to various promotions.

## Packages Required

install.packages("completejourney", repos = "https://cran.rstudio.com/")
## package 'completejourney' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ashto\AppData\Local\Temp\RtmpGmeBYL\downloaded_packages
library(completejourney)
library(dplyr)
library(ggplot2)


transactions <- get_transactions()
demographics <- completejourney::demographics


head(transactions)
## # A tibble: 6 × 11
##   household_id store_id basket_id   product_id quantity sales_value retail_disc
##   <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
## 1 900          330      31198570044 1095275           1        0.5         0   
## 2 900          330      31198570047 9878513           1        0.99        0.1 
## 3 1228         406      31198655051 1041453           1        1.43        0.15
## 4 906          319      31198705046 1020156           1        1.5         0.29
## 5 906          319      31198705046 1053875           2        2.78        0.8 
## 6 906          319      31198705046 1060312           1        5.49        0.5 
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>
head(demographics)
## # A tibble: 6 × 8
##   household_id age   income    home_ownership marital_status household_size
##   <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
## 1 1            65+   35-49K    Homeowner      Married        2             
## 2 1001         45-54 50-74K    Homeowner      Unmarried      1             
## 3 1003         35-44 25-34K    <NA>           Unmarried      1             
## 4 1004         25-34 15-24K    <NA>           Unmarried      1             
## 5 101          45-54 Under 15K Homeowner      Married        4             
## 6 1012         35-44 35-49K    <NA>           Married        5+            
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
colnames(demographics)
## [1] "household_id"   "age"            "income"         "home_ownership"
## [5] "marital_status" "household_size" "household_comp" "kids_count"
transactions <- transactions %>% filter(!is.na(sales_value))

transactions <- transactions %>% distinct()
demographics <- demographics %>% distinct()
summary(transactions)
##  household_id         store_id          basket_id          product_id       
##  Length:1469307     Length:1469307     Length:1469307     Length:1469307    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     quantity        sales_value       retail_disc        coupon_disc      
##  Min.   :    0.0   Min.   :  0.000   Min.   :  0.0000   Min.   : 0.00000  
##  1st Qu.:    1.0   1st Qu.:  1.290   1st Qu.:  0.0000   1st Qu.: 0.00000  
##  Median :    1.0   Median :  2.000   Median :  0.0100   Median : 0.00000  
##  Mean   :  104.1   Mean   :  3.128   Mean   :  0.5388   Mean   : 0.01794  
##  3rd Qu.:    1.0   3rd Qu.:  3.490   3rd Qu.:  0.6800   3rd Qu.: 0.00000  
##  Max.   :89638.0   Max.   :840.000   Max.   :130.0200   Max.   :55.93000  
##  coupon_match_disc       week       transaction_timestamp           
##  Min.   :0.000000   Min.   : 1.00   Min.   :2017-01-01 06:53:26.00  
##  1st Qu.:0.000000   1st Qu.:14.00   1st Qu.:2017-04-01 19:31:00.00  
##  Median :0.000000   Median :27.00   Median :2017-07-02 11:17:58.00  
##  Mean   :0.003092   Mean   :27.37   Mean   :2017-07-02 10:58:07.54  
##  3rd Qu.:0.000000   3rd Qu.:41.00   3rd Qu.:2017-10-02 12:28:34.00  
##  Max.   :7.700000   Max.   :53.00   Max.   :2017-12-31 23:01:20.00
summary(demographics)
##  household_id          age            income               home_ownership
##  Length:801         19-24: 46   50-74K   :192   Renter            : 42   
##  Class :character   25-34:142   35-49K   :172   Probable Renter   : 11   
##  Mode  :character   35-44:194   75-99K   : 96   Homeowner         :504   
##                     45-54:288   25-34K   : 77   Probable Homeowner: 11   
##                     55-64: 59   15-24K   : 74   Unknown           :  0   
##                     65+  : 72   Under 15K: 61   NA's              :233   
##                                 (Other)  :129                            
##    marital_status household_size          household_comp   kids_count 
##  Married  :340    1 :255         1 Adult Kids    : 93    0      :513  
##  Unmarried:324    2 :318         1 Adult No Kids :255    1      :159  
##  Unknown  :  0    3 :109         2 Adults Kids   :195    2      : 60  
##  NA's     :137    4 : 53         2 Adults No Kids:258    3+     : 69  
##                   5+: 66         Unknown         :  0    Unknown:  0  
##                                                                       
## 
ggplot(transactions, aes(x = sales_value)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black", alpha = 0.7) +
  scale_x_continuous(limits = c(0, 100)) + # Adjusting x-axis limits to focus on lower range
  labs(title = "Distribution of Sales Values",
       subtitle = "Focusing on the lower range of retail transaction sales values",
       x = "Sales Value",
       y = "Frequency") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray80"),
    panel.grid.minor = element_blank()
  )

merged_data <- transactions %>%
  inner_join(demographics, by = "household_id")

age_spending <- merged_data %>%
  group_by(age) %>%
  summarize(average_spending = mean(sales_value))

ggplot(age_spending, aes(x = age, y = average_spending)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Average Spending by Age",
       x = "Age",
       y = "Average Spending")

merged_data <- transactions %>%
  inner_join(demographics, by = "household_id")

income_spending <- merged_data %>%
  group_by(income) %>%
  summarize(average_spending = mean(sales_value))

ggplot(income_spending, aes(x = income, y = average_spending)) +
  geom_bar(stat = "identity", fill = "lightcoral") +
  labs(title = "Average Spending by Income Level",
       x = "Income Level",
       y = "Average Spending")

household_size_spending <- merged_data %>%
  group_by(household_size) %>%
  summarize(average_spending = mean(sales_value))

ggplot(household_size_spending, aes(x = household_size, y = average_spending)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  labs(title = "Average Spending by Household Size",
       x = "Household Size",
       y = "Average Spending")

purchase_frequency_age <- merged_data %>%
  group_by(age) %>%
  summarize(frequency = n())


ggplot(purchase_frequency_age, aes(x = age, y = frequency)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  labs(title = "Purchase Frequency by Age",
       x = "Age",
       y = "Frequency") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray80"),
    panel.grid.minor = element_blank()
  )

purchase_frequency_income <- merged_data %>%
  group_by(income) %>%
  summarize(frequency = n())

ggplot(purchase_frequency_income, aes(x = income, y = frequency)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  labs(title = "Purchase Frequency by Income Level",
       x = "Income Level",
       y = "Frequency") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray80"),
    panel.grid.minor = element_blank()
  )

promotion_response <- merged_data %>%
  mutate(promotion_used = ifelse(coupon_disc > 0, "Yes", "No")) %>%
  group_by(age, promotion_used) %>%
  summarize(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

ggplot(promotion_response, aes(x = age, y = percentage, fill = promotion_used)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Promotion Response Rate by Age",
       x = "Age",
       y = "Percentage",
       fill = "Promotion Used")

basket_size <- merged_data %>%
  group_by(income) %>%
  summarize(average_basket_size = mean(quantity, na.rm = TRUE))


ggplot(basket_size, aes(x = income, y = average_basket_size)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Average Basket Size by Income Level",
       x = "Income Level",
       y = "Average Basket Size") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray80"),
    panel.grid.minor = element_blank()
  )

Interpretation

The analysis reveals that certain demographic groups exhibit distinct shopping behaviors. For example, the average spending varies significantly across age groups, with younger shoppers tending to spend more per transaction. Additionally, higher-income households show different spending patterns compared to lower-income households. Households with more members tend to have higher average spending.

Promotion response rates vary across different age groups, indicating that some age groups are more inclined to use promotions. The average basket size also varies by income level, suggesting that higher-income shoppers tend to purchase more items per transaction.

Recommendations

  1. Targeted Promotions: Develop targeted marketing campaigns for higher-income households to leverage their distinct spending patterns. Focus on age groups that are more responsive to promotions to maximize the effectiveness of promotional campaigns.

  2. Age-Specific Marketing: Create age-specific marketing strategies to cater to the spending habits of different age groups. Younger shoppers tend to spend more per transaction, so offering them special deals and promotions might increase their shopping frequency.

  3. Enhanced Loyalty Programs: Implement loyalty programs that offer personalized rewards based on demographic insights to drive customer engagement and retention. Focus on households with more members by offering family-oriented promotions and rewards, as they tend to spend more.

  4. Inventory and Stock Management: Plan inventory based on the average basket size by income level to ensure that popular items are always in stock. Monitor purchase patterns over time to identify seasonal trends and adjust inventory accordingly.

  5. Promotion Strategies: Design promotions that appeal to specific age groups that are more likely to use them. Consider offering bundled products or discounts on frequently purchased items to increase the average basket size.