Introduction

Potential Growth in Demographic Areas

I am trying to solve the business problem of not marketing correctly to specific demographics. This analysis uncovers areas to increase revenue through targeted marketing based on existing spending habits and where there is over/under-indexing for different demographics.

Path to Solution

To arrive at a solution to where we can earn more revenue from specific demographics, it was important to index whether customers were below, above, or average in spending. This provided insight for which demographics were over or under-indexing in categories and where we could adjust our marketing investment. With this information, we can alter our advertising strategies or coupons to encourage more spending from those customers. The data required is the transactions, products, and demographics data.

How This Analysis Helps

This analysis provides easily intpretable plots that show where spending is low or high and provides an index score to compare to all other demographics. Data from plots shows areas of improvement for marketing and promotions.

Required Packages

Complete Journey - Provides required data for analysis

Tidyverse - Provides tools to tidy and visualize data

Knitr - Provides tools to create tables

# Load Packages
library(completejourney)
library(tidyverse)
library(knitr)
# Load Transactions Data
transactions <- get_transactions()
# Join Data
demo_purc <- transactions %>%
  left_join(products, by = "product_id") %>%
  left_join(demographics, by = "household_id")
# Share of Total Spend by Group
category_demo_summary <- demo_purc %>%
  group_by(income, product_category) %>%
  summarize(
    total_revenue = sum(sales_value),
    total_units = sum(quantity),
    unique_households = n_distinct(household_id),
    .groups = "drop"
  )

Now we have a simplified set of data to review spending habits

# Overall share by demographic group
overall_share <- demo_purc %>%
  group_by(income) %>%
  summarize(group_spend = sum(sales_value), .groups = "drop") %>%
  mutate(overall_share = group_spend / sum(group_spend))
# Share by category
category_share <- category_demo_summary %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup()
# Create index by combining demographic share with category share
index_scores <- category_share %>%
  left_join(overall_share, by = "income") %>%
  mutate(index = (category_share / overall_share) * 100)
# Find Groups above and below average spend
## Below Index
index_scores %>%
  filter(index < 75) %>%
  arrange(index) %>%
  select(income, product_category, index)
## # A tibble: 913 × 3
##    income    product_category               index
##    <ord>     <chr>                          <dbl>
##  1 15-24K    PROD SUPPLIES                   0   
##  2 35-49K    MEAT SUPPLIES                   0   
##  3 50-74K    BULK FOODS                      0   
##  4 50-74K    PROD SUPPLIES                   0   
##  5 75-99K    COFFEE SHOP SWEET GOODS&RETAIL  2.27
##  6 175-199K  PET CARE SUPPLIES               2.53
##  7 35-49K    BOTTLE DEPOSITS                 2.75
##  8 250K+     FRZN JCE CONC/DRNKS             2.80
##  9 Under 15K BIRD SEED                       3.30
## 10 50-74K    SERVICE BEVERAGE                3.54
## # ℹ 903 more rows
## Above Index
index_scores %>%
  filter(index > 125) %>%
  arrange(desc(index)) %>%
  select(income, product_category, index)
## # A tibble: 798 × 3
##    income   product_category       index
##    <ord>    <chr>                  <dbl>
##  1 200-249K RESTRICTED DIET        4607.
##  2 175-199K HOME HEALTH CARE       2485.
##  3 200-249K COUPONS/STORE & MFG    2088.
##  4 25-34K   MISCELLANEOUS CROUTONS 2077.
##  5 250K+    PROD SUPPLIES          1496.
##  6 250K+    MISCELLANEOUS HBC      1334.
##  7 200-249K VEAL                   1304.
##  8 250K+    FIREWORKS              1131.
##  9 125-149K PHARMACY               1096.
## 10 35-49K   NDAIRY/TEAS/JUICE/SOD   947.
## # ℹ 788 more rows
# Create plot to visualize results

# Narrow top categories to prevent over plotting
top_categories <- category_demo_summary %>%
  group_by(product_category) %>%
  summarize(total_households = sum(unique_households)) %>%
  slice_max(total_households, n = 10) %>%
  pull(product_category)
# Plot index scores on heat map to compare
index_scores %>%
  filter(product_category %in% top_categories) %>%  
  ggplot(aes(x = income, y = product_category, fill = index)) +
  geom_tile() +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 100) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Income and Product Category",
       fill = "Index\n(100 = avg)")

ggsave("income_heatmap.png", width = 10, height = 6, dpi = 300)

Heat map shows high spending in soup and crackers categories for high income group and low spending in soft drinks and fluid milk.

Next we will look into age instead of income to check spending patterns by age

# Create age index to see spending habits based on age
age_index <- demo_purc %>%
  filter(!is.na(age)) %>%
  group_by(age, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") |>
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(age)) %>%
      group_by(age) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = "age"
  ) %>%
  mutate(index = (category_share / overall_share) * 100)
# Plot to compare spending by age
age_index %>%
  filter(product_category %in% top_categories) %>%
  ggplot(aes(x = age, y = index, fill = index)) +
  geom_col() +
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 100) +
  facet_wrap(~ product_category, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Age and Product Category",
       x = "Age Range", y = "Index (100 = avg)", fill = "Index")

ggsave("age_bar.png", width = 10, height = 6, dpi = 300)

This bar chart shows lower spending on bag snacks and cheese in the oldest age group. There is high spending on either extreme for soup and high spending on eggs in the oldest age groups and high spending on soft drinks in the youngest group.

The next demographic grouping will look into household size

# Create household size index
size_index <- demo_purc %>%
  filter(!is.na(household_size)) %>%
  group_by(household_size, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(household_size)) %>%
      group_by(household_size) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = "household_size"
  ) %>%
  mutate(index = (category_share / overall_share) * 100)
# Plot household index to compare spending habits
size_index %>%
  filter(product_category %in% top_categories) %>%
  ggplot(aes(x = product_category, y = index, fill = household_size)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Household Size and Product Category",
       x = "Product Category", y = "Index (100 = avg)", fill = "Household Size")

ggsave("household_size_bar.png", width = 10, height = 6, dpi = 300)

Household size shows large households are the biggest spenders when it comes to crackers, bread, and fluid milk products. On the other side, one person households spend more on eggs, soup, and soft drinks.

Next we review how marital status impacts spending

# Create marital status index
marital_index <- demo_purc %>%
  filter(!is.na(marital_status)) %>%
  mutate(marital_status = case_when(
    marital_status == "Married" ~ "Married",
    TRUE ~ "Unmarried"
  )) %>%
  group_by(marital_status, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(marital_status)) %>%
      mutate(marital_status = case_when(
        marital_status == "Married" ~ "Married",
        TRUE ~ "Unmarried"
      )) %>%
      group_by(marital_status) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = "marital_status"
  ) %>%
  mutate(
    index = (category_share / overall_share) * 100,
    deviation = index - 100
  )
# Plot marital status index to compare spending habits
marital_index %>%
  filter(product_category %in% top_categories) %>%
  ggplot(aes(x = reorder(product_category, deviation), y = deviation, fill = marital_status)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  coord_flip() +
  scale_fill_manual(values = c("Married" = "blue", "Unmarried" = "red")) +
  labs(title = "Purchase Index Deviation by Marital Status and Product Category",
       x = "Product Category", y = "Deviation from Average (0 = avg)", fill = "Marital Status")

marital_table <- marital_index %>%
  filter(product_category %in% top_categories) %>%
  select(marital_status, product_category, index) %>%
  mutate(index = round(index, 1)) %>%
  pivot_wider(names_from = marital_status, values_from = index) %>%
  arrange(desc(Married)) %>%
  rename("Product Category" = product_category)

ggsave("marital_status_bar.png", width = 10, height = 6, dpi = 300)

The marital status chart shows us that married people spend more on average on crackers and bread and unmarried spend more on soft drinks, beef, and eggs.

Now we will combine age and income to check for new spending patterns

# Create combined index of age and income
combined_index <- demo_purc %>%
  filter(!is.na(income), !is.na(age)) %>%
  group_by(income, age, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(income), !is.na(age)) %>%
      group_by(income, age) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = c("income", "age")
  ) %>%
  mutate(index = (category_share / overall_share) * 100)
# Chart was too busy with 10 categories, so we will reduce to top 5
top_5_categories <- category_demo_summary %>%
  group_by(product_category) %>%
  summarize(total = sum(total_revenue)) %>%
  slice_max(total, n = 5) %>%
  pull(product_category)
# Plot combined index
combined_index %>%
  filter(product_category %in% top_5_categories) %>%
  ggplot(aes(x = income, y = index, fill = age)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  facet_wrap(~ product_category, ncol = 2) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Income and Age Group",
       x = "Income", y = "Index (100 = avg)", fill = "Age Group")

ggsave("age_income_combined.png", width = 10, height = 6, dpi = 300)
age_income_table <- combined_index %>%
  filter(product_category %in% top_5_categories) %>%
  mutate(
    index = round(index, 1),
    age_income = paste(age, "|", income)
  ) %>%
  select(product_category, age_income, index) %>%
  pivot_wider(names_from = product_category, values_from = index) %>%
  rename("Age | Income" = age_income)

kable(age_income_table)
Age | Income BEEF CHEESE COUPON/MISC ITEMS FLUID MILK PRODUCTS SOFT DRINKS
19-24 | Under 15K 147.8 168.2 47.3 125.5 154.0
25-34 | Under 15K 201.6 100.4 84.4 78.3 157.1
35-44 | Under 15K 140.8 98.3 94.1 80.9 120.1
45-54 | Under 15K 108.6 118.1 105.5 92.5 129.9
55-64 | Under 15K 114.2 96.6 106.4 56.2 74.7
65+ | Under 15K 144.9 49.8 66.6 62.6 109.0
19-24 | 15-24K 118.2 77.9 12.6 61.1 209.4
25-34 | 15-24K 113.8 124.7 63.9 87.5 98.4
35-44 | 15-24K 115.5 80.9 98.8 105.2 145.4
45-54 | 15-24K 136.7 79.5 84.4 102.8 222.3
55-64 | 15-24K 127.1 115.3 28.4 85.9 110.6
65+ | 15-24K 69.0 61.6 51.3 129.4 68.1
19-24 | 25-34K 131.6 114.2 68.9 74.9 156.3
25-34 | 25-34K 115.6 119.1 45.7 115.4 92.1
35-44 | 25-34K 137.3 94.4 99.0 86.8 113.3
45-54 | 25-34K 121.3 112.2 92.4 115.4 91.6
55-64 | 25-34K 80.4 112.0 57.0 87.8 144.8
65+ | 25-34K 166.0 79.3 78.2 78.3 118.6
19-24 | 35-49K 71.9 119.9 119.3 97.1 93.8
25-34 | 35-49K 89.6 95.7 70.8 110.5 88.2
35-44 | 35-49K 120.9 117.0 111.6 92.9 85.9
45-54 | 35-49K 110.3 107.0 81.9 125.7 103.8
55-64 | 35-49K 164.7 102.1 86.4 105.8 99.5
65+ | 35-49K 80.3 70.2 75.3 103.4 127.5
19-24 | 50-74K 70.0 110.8 48.6 71.9 122.3
25-34 | 50-74K 84.9 130.0 124.4 107.8 84.2
35-44 | 50-74K 93.8 90.9 128.8 94.3 103.3
45-54 | 50-74K 101.5 103.8 96.2 102.3 90.1
55-64 | 50-74K 55.3 74.5 110.9 95.6 90.7
65+ | 50-74K 101.6 83.0 65.2 82.0 33.2
19-24 | 75-99K 137.2 168.5 12.2 53.8 273.8
25-34 | 75-99K 56.3 101.5 162.1 126.9 99.0
35-44 | 75-99K 89.6 97.7 91.2 93.6 97.9
45-54 | 75-99K 93.8 95.1 102.2 107.5 115.1
55-64 | 75-99K 122.0 130.7 101.7 66.5 77.2
65+ | 75-99K 59.7 88.5 157.2 141.7 71.2
25-34 | 100-124K 97.9 100.1 164.9 100.9 63.6
35-44 | 100-124K 41.1 90.5 81.7 107.2 113.4
45-54 | 100-124K 111.4 80.2 181.9 125.4 107.3
55-64 | 100-124K 101.3 60.0 130.8 93.3 94.6
65+ | 100-124K 61.2 63.0 150.0 90.4 91.9
25-34 | 125-149K 40.9 108.7 103.8 135.4 58.0
35-44 | 125-149K 73.5 102.4 97.0 96.1 59.0
45-54 | 125-149K 62.1 95.1 158.5 90.4 72.5
55-64 | 125-149K 45.5 177.3 201.6 130.8 95.6
65+ | 125-149K 96.0 95.8 110.7 109.6 138.2
25-34 | 150-174K 56.8 119.2 105.6 80.8 76.8
35-44 | 150-174K 87.3 95.7 129.4 106.0 103.0
45-54 | 150-174K 113.5 66.3 58.0 81.6 149.6
55-64 | 150-174K 81.2 54.3 186.2 50.0 37.2
65+ | 150-174K 93.4 49.8 NA 194.6 46.2
25-34 | 175-199K 2.2 43.1 114.1 7.4 72.1
35-44 | 175-199K 32.6 122.6 176.9 174.7 40.3
45-54 | 175-199K 79.2 74.2 83.8 68.7 20.1
55-64 | 175-199K 410.5 66.4 1.4 65.9 61.2
19-24 | 200-249K 5.2 158.9 142.9 20.8 66.4
25-34 | 200-249K 45.8 234.6 4.1 49.8 61.1
35-44 | 200-249K 134.3 62.1 12.8 42.3 119.4
45-54 | 200-249K 43.7 38.3 NA 102.0 47.8
19-24 | 250K+ 22.7 97.0 90.6 36.7 97.9
35-44 | 250K+ 98.3 67.3 77.1 93.6 66.3
45-54 | 250K+ 96.6 84.4 22.9 129.4 58.8

This combined chart shows that 55-64 upper income individuals spend significantly more on beef. Younger, high income individuals spend more on cheese. It also shows potential for growth for high income individuals with soft drinks and fluid milk products.

Next will be income and household size to see more patterns

# Create combined index of income and household size
combined_income_size <- demo_purc %>%
  filter(!is.na(income), !is.na(household_size)) %>%
  mutate(income_group = case_when(
    income %in% c("Under 15K", "15-24K", "25-34K")         ~ "Low (Under 35K)",
    income %in% c("35-49K", "50-74K")                       ~ "Middle (35-74K)",
    income %in% c("75-99K", "100-124K")                     ~ "Upper Middle (75-124K)",
    income %in% c("125-149K", "150-174K", "175-199K", "200K+") ~ "High (125K+)",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(income_group)) %>%
  group_by(income_group, household_size, product_category) %>%
  summarize(total_revenue = sum(sales_value),
            total_transactions = n(),
            .groups = "drop") %>%
  filter(total_transactions >= 30) %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(income), !is.na(household_size)) %>%
      mutate(income_group = case_when(
        income %in% c("Under 15K", "15-24K", "25-34K")         ~ "Low (Under 35K)",
        income %in% c("35-49K", "50-74K")                       ~ "Middle (35-74K)",
        income %in% c("75-99K", "100-124K")                     ~ "Upper Middle (75-124K)",
        income %in% c("125-149K", "150-174K", "175-199K", "200K+") ~ "High (125K+)",
        TRUE ~ NA_character_
      )) %>%
      filter(!is.na(income_group)) %>%
      group_by(income_group, household_size) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = c("income_group", "household_size")
  ) %>%
  mutate(
    index = (category_share / overall_share) * 100,
    income_group = factor(income_group, levels = c("Low (Under 35K)", "Middle (35-74K)", 
                                                    "Upper Middle (75-124K)", "High (125K+)"))
  )
# Plot combined index
combined_income_size %>%
  filter(product_category %in% top_5_categories) %>%
  ggplot(aes(x = index, y = income_group, color = household_size)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_vline(xintercept = 100, linetype = "dashed", color = "black") +
  facet_wrap(~ product_category, ncol = 2) +
  labs(title = "Purchase Index by Income Group and Household Size",
       x = "Index (100 = avg)", y = "Income Group", color = "Household Size")

income_size_table <- combined_income_size %>%
  filter(product_category %in% top_5_categories) %>%
  mutate(index = round(index, 1)) %>%
  select(income_group, household_size, product_category, index) %>%
  pivot_wider(names_from = product_category, values_from = index) %>%
  arrange(income_group, household_size) %>%
  rename("Income Group" = income_group, "Household Size" = household_size)

kable(income_size_table)
Income Group Household Size BEEF CHEESE COUPON/MISC ITEMS FLUID MILK PRODUCTS SOFT DRINKS
Low (Under 35K) 1 121.8 113.4 85.6 93.3 108.2
Low (Under 35K) 2 129.7 85.6 85.7 95.7 131.3
Low (Under 35K) 3 141.9 110.9 51.6 109.0 114.6
Low (Under 35K) 4 128.4 122.2 51.5 104.5 259.4
Low (Under 35K) 5+ 87.2 90.2 114.6 101.0 119.4
Middle (35-74K) 1 108.4 106.0 89.2 105.0 103.1
Middle (35-74K) 2 95.9 101.2 103.9 96.6 89.9
Middle (35-74K) 3 85.1 100.9 84.8 95.7 82.6
Middle (35-74K) 4 97.2 101.4 140.0 114.0 83.8
Middle (35-74K) 5+ 100.7 114.8 80.7 117.3 84.8
Upper Middle (75-124K) 1 58.2 89.9 136.5 103.3 126.7
Upper Middle (75-124K) 2 98.1 92.9 109.8 106.2 92.8
Upper Middle (75-124K) 3 85.1 104.3 101.4 95.0 126.7
Upper Middle (75-124K) 4 63.3 93.2 154.8 121.9 90.5
Upper Middle (75-124K) 5+ 139.7 101.7 89.3 103.5 60.6
High (125K+) 1 66.5 60.6 108.8 73.6 89.9
High (125K+) 2 102.3 87.0 98.2 76.1 69.4
High (125K+) 3 64.7 110.3 81.6 100.6 103.0
High (125K+) 4 48.2 103.2 132.8 120.2 37.2
High (125K+) 5+ 54.8 103.3 176.6 113.8 69.4
ggsave("income_size_dotplot.png", width = 10, height = 6, dpi = 300)

High income households with 1-2 people spend less on cheese. 4 person households are complete opposites for soft drink spending, so there is room for growth in high income to spend more on soft drinks.

Summary

This analysis uncovered areas to increase revenue through targeted marketing based on existing spending habits and where there is over/under-indexing for different demographics. These areas include soft drinks and fluid milk among high-income customers as well as beef for 55-64 year olds, and cheese for young, high-income customers.

Recommendations

I recommend a premium soft drink marketing campaign to reach the high-income demographic based on their current low index. This can include sparkling water and premium sodas. I also recommend organic or locally sourced milk promotions for high-income customers. For upper-income, 55-64 year olds, there should be a focus on premium beef/quality cuts since they spend significantly more in that category. To target the young, high-income individuals, there should be cheese promotions focused on quality cheese and pairing guides. To reach single person households in the bag snacks and cheese categories, there should be a promotion for convenient, single serve options. For large households of 4+ people, soft drinks should be the foucs with promotions on large quantity value packs to grow sales from high-income households.

Limitations

Some limitations in this analysis include the data being older and not covering a long time period. Using an index could shows relative comparison, but does not show full purchasing power. Price sensitivity is also a variable not considered in this analysis.