Introduction

Potential Growth in Demographic Areas

I am trying to solve the business problem of not marketing correctly to specific demographics. This analysis uncovers areas to increase revenue through targeted marketing based on existing spending habits and where there is over/under-indexing for different demographics.

Path to Solution

To arrive at a solution to where we can earn more revenue from specific demographics, it was important to index whether customers were below, above, or average in spending. This provided insight for which demographics were over or under-indexing in categories and where we could adjust our marketing investment. With this information, we can alter our advertising strategies or coupons to encourage more spending from those customers. The data required is the transactions, products, and demographics data.

How This Analysis Helps

This analysis provides easily intpretable plots that show where spending is low or high and provides an index score to compare to all other demographics. Data from plots shows areas of improvement for marketing and promotions.

Required Packages

Complete Journey - Provides required data for analysis

Tidyverse - Provides tools to tidy and visualize data

Knitr - Provides tools to create tables

# Load Packages
library(completejourney)
library(tidyverse)
library(knitr)

# Load Transactions Data
transactions <- get_transactions()

# Join Data
demo_purc <- transactions %>%
  left_join(products, by = "product_id") %>%
  left_join(demographics, by = "household_id")

# Share of Total Spend by Group
category_demo_summary <- demo_purc %>%
  group_by(income, product_category) %>%
  summarize(
    total_revenue = sum(sales_value),
    total_units = sum(quantity),
    unique_households = n_distinct(household_id),
    .groups = "drop"
  )

Now we have a simplified set of data to review spending habits

# Overall share by demographic group
overall_share <- demo_purc %>%
  group_by(income) %>%
  summarize(group_spend = sum(sales_value), .groups = "drop") %>%
  mutate(overall_share = group_spend / sum(group_spend))

# Share by category
category_share <- category_demo_summary %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup()

# Create index by combining demographic share with category share
index_scores <- category_share %>%
  left_join(overall_share, by = "income") %>%
  mutate(index = (category_share / overall_share) * 100)

# Find Groups above and below average spend
## Below Index
index_scores %>%
  filter(index < 75) %>%
  arrange(index) %>%
  select(income, product_category, index)

## # A tibble: 913 × 3
##    income    product_category               index
##    <ord>     <chr>                          <dbl>
##  1 15-24K    PROD SUPPLIES                   0   
##  2 35-49K    MEAT SUPPLIES                   0   
##  3 50-74K    BULK FOODS                      0   
##  4 50-74K    PROD SUPPLIES                   0   
##  5 75-99K    COFFEE SHOP SWEET GOODS&RETAIL  2.27
##  6 175-199K  PET CARE SUPPLIES               2.53
##  7 35-49K    BOTTLE DEPOSITS                 2.75
##  8 250K+     FRZN JCE CONC/DRNKS             2.80
##  9 Under 15K BIRD SEED                       3.30
## 10 50-74K    SERVICE BEVERAGE                3.54
## # ℹ 903 more rows

## Above Index
index_scores %>%
  filter(index > 125) %>%
  arrange(desc(index)) %>%
  select(income, product_category, index)

## # A tibble: 798 × 3
##    income   product_category       index
##    <ord>    <chr>                  <dbl>
##  1 200-249K RESTRICTED DIET        4607.
##  2 175-199K HOME HEALTH CARE       2485.
##  3 200-249K COUPONS/STORE & MFG    2088.
##  4 25-34K   MISCELLANEOUS CROUTONS 2077.
##  5 250K+    PROD SUPPLIES          1496.
##  6 250K+    MISCELLANEOUS HBC      1334.
##  7 200-249K VEAL                   1304.
##  8 250K+    FIREWORKS              1131.
##  9 125-149K PHARMACY               1096.
## 10 35-49K   NDAIRY/TEAS/JUICE/SOD   947.
## # ℹ 788 more rows

# Create plot to visualize results

# Narrow top categories to prevent over plotting
top_categories <- category_demo_summary %>%
  group_by(product_category) %>%
  summarize(total_households = sum(unique_households)) %>%
  slice_max(total_households, n = 10) %>%
  pull(product_category)
# Plot index scores on heat map to compare
index_scores %>%
  filter(product_category %in% top_categories) %>%  
  ggplot(aes(x = income, y = product_category, fill = index)) +
  geom_tile() +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 100) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Income and Product Category",
       fill = "Index\n(100 = avg)")

ggsave("income_heatmap.png", width = 10, height = 6, dpi = 300)

Heat map shows high spending in soup and crackers categories for high income group and low spending in soft drinks and fluid milk.

Next we will look into age instead of income to check spending patterns by age

# Create age index to see spending habits based on age
age_index <- demo_purc %>%
  filter(!is.na(age)) %>%
  group_by(age, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") |>
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(age)) %>%
      group_by(age) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = "age"
  ) %>%
  mutate(index = (category_share / overall_share) * 100)
# Plot to compare spending by age
age_index %>%
  filter(product_category %in% top_categories) %>%
  ggplot(aes(x = age, y = index, fill = index)) +
  geom_col() +
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 100) +
  facet_wrap(~ product_category, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Age and Product Category",
       x = "Age Range", y = "Index (100 = avg)", fill = "Index")

ggsave("age_bar.png", width = 10, height = 6, dpi = 300)

This bar chart shows lower spending on bag snacks and cheese in the oldest age group. There is high spending on either extreme for soup and high spending on eggs in the oldest age groups and high spending on soft drinks in the youngest group.

The next demographic grouping will look into household size

# Create household size index
size_index <- demo_purc %>%
  filter(!is.na(household_size)) %>%
  group_by(household_size, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(household_size)) %>%
      group_by(household_size) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = "household_size"
  ) %>%
  mutate(index = (category_share / overall_share) * 100)
# Plot household index to compare spending habits
size_index %>%
  filter(product_category %in% top_categories) %>%
  ggplot(aes(x = product_category, y = index, fill = household_size)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Household Size and Product Category",
       x = "Product Category", y = "Index (100 = avg)", fill = "Household Size")

ggsave("household_size_bar.png", width = 10, height = 6, dpi = 300)

Household size shows large households are the biggest spenders when it comes to crackers, bread, and fluid milk products. On the other side, one person households spend more on eggs, soup, and soft drinks.

Next we review how marital status impacts spending

# Create marital status index
marital_index <- demo_purc %>%
  filter(!is.na(marital_status)) %>%
  mutate(marital_status = case_when(
    marital_status == "Married" ~ "Married",
    TRUE ~ "Unmarried"
  )) %>%
  group_by(marital_status, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(marital_status)) %>%
      mutate(marital_status = case_when(
        marital_status == "Married" ~ "Married",
        TRUE ~ "Unmarried"
      )) %>%
      group_by(marital_status) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = "marital_status"
  ) %>%
  mutate(
    index = (category_share / overall_share) * 100,
    deviation = index - 100
  )
# Plot marital status index to compare spending habits
marital_index %>%
  filter(product_category %in% top_categories) %>%
  ggplot(aes(x = reorder(product_category, deviation), y = deviation, fill = marital_status)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  coord_flip() +
  scale_fill_manual(values = c("Married" = "blue", "Unmarried" = "red")) +
  labs(title = "Purchase Index Deviation by Marital Status and Product Category",
       x = "Product Category", y = "Deviation from Average (0 = avg)", fill = "Marital Status")

marital_table <- marital_index %>%
  filter(product_category %in% top_categories) %>%
  select(marital_status, product_category, index) %>%
  mutate(index = round(index, 1)) %>%
  pivot_wider(names_from = marital_status, values_from = index) %>%
  arrange(desc(Married)) %>%
  rename("Product Category" = product_category)

ggsave("marital_status_bar.png", width = 10, height = 6, dpi = 300)

The marital status chart shows us that married people spend more on average on crackers and bread and unmarried spend more on soft drinks, beef, and eggs.

Now we will combine age and income to check for new spending patterns

# Create combined index of age and income
combined_index <- demo_purc %>%
  filter(!is.na(income), !is.na(age)) %>%
  group_by(income, age, product_category) %>%
  summarize(total_revenue = sum(sales_value), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(income), !is.na(age)) %>%
      group_by(income, age) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = c("income", "age")
  ) %>%
  mutate(index = (category_share / overall_share) * 100)
# Chart was too busy with 10 categories, so we will reduce to top 5
top_5_categories <- category_demo_summary %>%
  group_by(product_category) %>%
  summarize(total = sum(total_revenue)) %>%
  slice_max(total, n = 5) %>%
  pull(product_category)
# Plot combined index
combined_index %>%
  filter(product_category %in% top_5_categories) %>%
  ggplot(aes(x = income, y = index, fill = age)) +
  geom_col(position = "dodge") +
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  facet_wrap(~ product_category, ncol = 2) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Index by Income and Age Group",
       x = "Income", y = "Index (100 = avg)", fill = "Age Group")

ggsave("age_income_combined.png", width = 10, height = 6, dpi = 300)

age_income_table <- combined_index %>%
  filter(product_category %in% top_5_categories) %>%
  mutate(
    index = round(index, 1),
    age_income = paste(age, "|", income)
  ) %>%
  select(product_category, age_income, index) %>%
  pivot_wider(names_from = product_category, values_from = index) %>%
  rename("Age | Income" = age_income)

kable(age_income_table)

Age \| Income	BEEF	CHEESE	COUPON/MISC ITEMS	FLUID MILK PRODUCTS	SOFT DRINKS
19-24 \| Under 15K	147.8	168.2	47.3	125.5	154.0
25-34 \| Under 15K	201.6	100.4	84.4	78.3	157.1
35-44 \| Under 15K	140.8	98.3	94.1	80.9	120.1
45-54 \| Under 15K	108.6	118.1	105.5	92.5	129.9
55-64 \| Under 15K	114.2	96.6	106.4	56.2	74.7
65+ \| Under 15K	144.9	49.8	66.6	62.6	109.0
19-24 \| 15-24K	118.2	77.9	12.6	61.1	209.4
25-34 \| 15-24K	113.8	124.7	63.9	87.5	98.4
35-44 \| 15-24K	115.5	80.9	98.8	105.2	145.4
45-54 \| 15-24K	136.7	79.5	84.4	102.8	222.3
55-64 \| 15-24K	127.1	115.3	28.4	85.9	110.6
65+ \| 15-24K	69.0	61.6	51.3	129.4	68.1
19-24 \| 25-34K	131.6	114.2	68.9	74.9	156.3
25-34 \| 25-34K	115.6	119.1	45.7	115.4	92.1
35-44 \| 25-34K	137.3	94.4	99.0	86.8	113.3
45-54 \| 25-34K	121.3	112.2	92.4	115.4	91.6
55-64 \| 25-34K	80.4	112.0	57.0	87.8	144.8
65+ \| 25-34K	166.0	79.3	78.2	78.3	118.6
19-24 \| 35-49K	71.9	119.9	119.3	97.1	93.8
25-34 \| 35-49K	89.6	95.7	70.8	110.5	88.2
35-44 \| 35-49K	120.9	117.0	111.6	92.9	85.9
45-54 \| 35-49K	110.3	107.0	81.9	125.7	103.8
55-64 \| 35-49K	164.7	102.1	86.4	105.8	99.5
65+ \| 35-49K	80.3	70.2	75.3	103.4	127.5
19-24 \| 50-74K	70.0	110.8	48.6	71.9	122.3
25-34 \| 50-74K	84.9	130.0	124.4	107.8	84.2
35-44 \| 50-74K	93.8	90.9	128.8	94.3	103.3
45-54 \| 50-74K	101.5	103.8	96.2	102.3	90.1
55-64 \| 50-74K	55.3	74.5	110.9	95.6	90.7
65+ \| 50-74K	101.6	83.0	65.2	82.0	33.2
19-24 \| 75-99K	137.2	168.5	12.2	53.8	273.8
25-34 \| 75-99K	56.3	101.5	162.1	126.9	99.0
35-44 \| 75-99K	89.6	97.7	91.2	93.6	97.9
45-54 \| 75-99K	93.8	95.1	102.2	107.5	115.1
55-64 \| 75-99K	122.0	130.7	101.7	66.5	77.2
65+ \| 75-99K	59.7	88.5	157.2	141.7	71.2
25-34 \| 100-124K	97.9	100.1	164.9	100.9	63.6
35-44 \| 100-124K	41.1	90.5	81.7	107.2	113.4
45-54 \| 100-124K	111.4	80.2	181.9	125.4	107.3
55-64 \| 100-124K	101.3	60.0	130.8	93.3	94.6
65+ \| 100-124K	61.2	63.0	150.0	90.4	91.9
25-34 \| 125-149K	40.9	108.7	103.8	135.4	58.0
35-44 \| 125-149K	73.5	102.4	97.0	96.1	59.0
45-54 \| 125-149K	62.1	95.1	158.5	90.4	72.5
55-64 \| 125-149K	45.5	177.3	201.6	130.8	95.6
65+ \| 125-149K	96.0	95.8	110.7	109.6	138.2
25-34 \| 150-174K	56.8	119.2	105.6	80.8	76.8
35-44 \| 150-174K	87.3	95.7	129.4	106.0	103.0
45-54 \| 150-174K	113.5	66.3	58.0	81.6	149.6
55-64 \| 150-174K	81.2	54.3	186.2	50.0	37.2
65+ \| 150-174K	93.4	49.8	NA	194.6	46.2
25-34 \| 175-199K	2.2	43.1	114.1	7.4	72.1
35-44 \| 175-199K	32.6	122.6	176.9	174.7	40.3
45-54 \| 175-199K	79.2	74.2	83.8	68.7	20.1
55-64 \| 175-199K	410.5	66.4	1.4	65.9	61.2
19-24 \| 200-249K	5.2	158.9	142.9	20.8	66.4
25-34 \| 200-249K	45.8	234.6	4.1	49.8	61.1
35-44 \| 200-249K	134.3	62.1	12.8	42.3	119.4
45-54 \| 200-249K	43.7	38.3	NA	102.0	47.8
19-24 \| 250K+	22.7	97.0	90.6	36.7	97.9
35-44 \| 250K+	98.3	67.3	77.1	93.6	66.3
45-54 \| 250K+	96.6	84.4	22.9	129.4	58.8

This combined chart shows that 55-64 upper income individuals spend significantly more on beef. Younger, high income individuals spend more on cheese. It also shows potential for growth for high income individuals with soft drinks and fluid milk products.

Next will be income and household size to see more patterns

# Create combined index of income and household size
combined_income_size <- demo_purc %>%
  filter(!is.na(income), !is.na(household_size)) %>%
  mutate(income_group = case_when(
    income %in% c("Under 15K", "15-24K", "25-34K")         ~ "Low (Under 35K)",
    income %in% c("35-49K", "50-74K")                       ~ "Middle (35-74K)",
    income %in% c("75-99K", "100-124K")                     ~ "Upper Middle (75-124K)",
    income %in% c("125-149K", "150-174K", "175-199K", "200K+") ~ "High (125K+)",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(income_group)) %>%
  group_by(income_group, household_size, product_category) %>%
  summarize(total_revenue = sum(sales_value),
            total_transactions = n(),
            .groups = "drop") %>%
  filter(total_transactions >= 30) %>%
  group_by(product_category) %>%
  mutate(category_share = total_revenue / sum(total_revenue)) %>%
  ungroup() %>%
  left_join(
    demo_purc %>%
      filter(!is.na(income), !is.na(household_size)) %>%
      mutate(income_group = case_when(
        income %in% c("Under 15K", "15-24K", "25-34K")         ~ "Low (Under 35K)",
        income %in% c("35-49K", "50-74K")                       ~ "Middle (35-74K)",
        income %in% c("75-99K", "100-124K")                     ~ "Upper Middle (75-124K)",
        income %in% c("125-149K", "150-174K", "175-199K", "200K+") ~ "High (125K+)",
        TRUE ~ NA_character_
      )) %>%
      filter(!is.na(income_group)) %>%
      group_by(income_group, household_size) %>%
      summarize(group_spend = sum(sales_value), .groups = "drop") %>%
      mutate(overall_share = group_spend / sum(group_spend)),
    by = c("income_group", "household_size")
  ) %>%
  mutate(
    index = (category_share / overall_share) * 100,
    income_group = factor(income_group, levels = c("Low (Under 35K)", "Middle (35-74K)", 
                                                    "Upper Middle (75-124K)", "High (125K+)"))
  )
# Plot combined index
combined_income_size %>%
  filter(product_category %in% top_5_categories) %>%
  ggplot(aes(x = index, y = income_group, color = household_size)) +
  geom_point(size = 4, alpha = 0.8) +
  geom_vline(xintercept = 100, linetype = "dashed", color = "black") +
  facet_wrap(~ product_category, ncol = 2) +
  labs(title = "Purchase Index by Income Group and Household Size",
       x = "Index (100 = avg)", y = "Income Group", color = "Household Size")

income_size_table <- combined_income_size %>%
  filter(product_category %in% top_5_categories) %>%
  mutate(index = round(index, 1)) %>%
  select(income_group, household_size, product_category, index) %>%
  pivot_wider(names_from = product_category, values_from = index) %>%
  arrange(income_group, household_size) %>%
  rename("Income Group" = income_group, "Household Size" = household_size)

kable(income_size_table)

Income Group	Household Size	BEEF	CHEESE	COUPON/MISC ITEMS	FLUID MILK PRODUCTS	SOFT DRINKS
Low (Under 35K)	1	121.8	113.4	85.6	93.3	108.2
Low (Under 35K)	2	129.7	85.6	85.7	95.7	131.3
Low (Under 35K)	3	141.9	110.9	51.6	109.0	114.6
Low (Under 35K)	4	128.4	122.2	51.5	104.5	259.4
Low (Under 35K)	5+	87.2	90.2	114.6	101.0	119.4
Middle (35-74K)	1	108.4	106.0	89.2	105.0	103.1
Middle (35-74K)	2	95.9	101.2	103.9	96.6	89.9
Middle (35-74K)	3	85.1	100.9	84.8	95.7	82.6
Middle (35-74K)	4	97.2	101.4	140.0	114.0	83.8
Middle (35-74K)	5+	100.7	114.8	80.7	117.3	84.8
Upper Middle (75-124K)	1	58.2	89.9	136.5	103.3	126.7
Upper Middle (75-124K)	2	98.1	92.9	109.8	106.2	92.8
Upper Middle (75-124K)	3	85.1	104.3	101.4	95.0	126.7
Upper Middle (75-124K)	4	63.3	93.2	154.8	121.9	90.5
Upper Middle (75-124K)	5+	139.7	101.7	89.3	103.5	60.6
High (125K+)	1	66.5	60.6	108.8	73.6	89.9
High (125K+)	2	102.3	87.0	98.2	76.1	69.4
High (125K+)	3	64.7	110.3	81.6	100.6	103.0
High (125K+)	4	48.2	103.2	132.8	120.2	37.2
High (125K+)	5+	54.8	103.3	176.6	113.8	69.4

ggsave("income_size_dotplot.png", width = 10, height = 6, dpi = 300)

High income households with 1-2 people spend less on cheese. 4 person households are complete opposites for soft drink spending, so there is room for growth in high income to spend more on soft drinks.

Regork Demographics Analysis

Evan Wilson

2026-02-26