Introduction

Being “data driven” has become a ubiquitous descriptor for any serious firm in the digital age. One of the most tangible and obvious benefits of being data driven is the scourge of targeted advertising: knowing who the customer is, and what they want. In this preliminary analysis and visualization, we break down spending trends by demographic combination to give a fuller picture of the activity of the firm’s customer base as well as a broader understanding of the niche the firm inhabits in the economic environment. These insight have implications from stocking decisions, to shortage anticipations, to coupon deliver and management, to advertising campaigns, and so on.

Required Packages

  • library(“tidyverse”)
  • library(“lubridate”)
  • library(“completejourney”)
  • library(“ggplot2”)
  • library(“tidyr”)
  • library(“printr”)

The Data

This data is from The Complete Journey Study, representing grocery store shopping transactions over the course of a full year. This is a micro-level panel data set that persistently matches transactions and demographic qualities to individual households and I joined, cleaned, and recoded the data for easier use in my analysis. Several variables which were factor levels were converted into binary variables, and all missing values were dropped.

transactions <- get_transactions()
demographics <- demographics
products <- products

df<- transactions %>%
  full_join(demographics, by = "household_id") %>%
  full_join(products, by = "product_id")
  
df_cleaned <- df %>%
  drop_na(household_size, kids_count, income, age, home_ownership, marital_status) %>%
  mutate(kids_binary = if_else(as.numeric(kids_count) == 1, 0, 1)) %>%
  mutate(kids_binary = factor(kids_binary, levels = c(0, 1), labels = c("No Kids", "Kids"))) %>%
  mutate(home_ownership_binary = if_else(home_ownership == "Homeowner", 1, 0)) %>%
  mutate(home_ownership_binary = factor(home_ownership_binary, levels = c(0, 1), labels = c("Not Homeowner", "Homeowner")))

After our changes, here is a list of the salient variables of interest used in the analysis, the kind of variable they are, and what those variables denote:

  • Income (Factor) - A list twelve different income brackets in ascending order.

  • Sales Value (Numeric) - cost of a purchased good

  • Quantity (Numeric) - quantity of a good purchased during a single checkout

  • Product Type (Character) - verbal description of a good’s category

  • Household Size (Factor) - List of household sizes, ranging from one to five

  • Kids (Binary) - Records whether or not a household has any children at all

  • Homeowner (Binary) - Records whether or not the household has a homeowner

  • Marriage (Binary) - Records whether or not members of the household are married

Preliminary Analysis

A basic question any firm needs to ask: Who are our target audience? Is this grocery store a Whole Foods, or a Safeway? To begin to apprehend this question, we investigate the relationship between income and spending in our data set.

df_income_sales_summary <- df %>%
  group_by(income) %>%
  summarize(
    total_sales_value = sum(sales_value, na.rm = TRUE),
    avg_sales_value = mean(sales_value, na.rm = TRUE)
  ) %>%
  ungroup()

ggplot(df_income_sales_summary, aes(x = income, y = total_sales_value)) +
  geom_point(aes(color = income), size = 3) +
  labs(
    x = "Income", 
    y = "Total Sales Value", 
    title = "Total Sales Value vs. Income"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  


ggplot(df_income_sales_summary, aes(x = income, y = avg_sales_value)) +
  geom_point(aes(color = income), size = 3) +
  labs(
    x = "Income", 
    y = "Average Sales Value", 
    title = "Average Sales Value vs. Income"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
Average Sales Value Vs Income
Average Sales Value Vs Income

As we can see, the average spending per household is as we might expect, with progressively higher levels of spending as we jump income brackets, with a few curious exceptions. The lowest income bracket has higher average spending than the second, third, and fourth lowest.

Total Sales Value Vs Income
Total Sales Value Vs Income

However, terms of total revenue coming into the store, the plurality seems to come from the middle class, in the 35k - 99k range with the mode of total sales value coming from the 50k-74k income bracket. This seems to characterize our firm as a inhabiting a middle of the road Krogeresque place in the commercial landscape.

Investigating by Demographics

We make attempt to characterize spending activity by demographic, which may allow us to better target potential customers with advertising or coupon campaigns, et cetera. We begin by asking:

Which demographics spend the most?

df_summary<- df_cleaned %>%
  group_by(household_size, kids_binary, income, age, home_ownership_binary, marital_status) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_sales_value))

print(df_summary)

df_average<- df_cleaned %>%
  group_by(household_size, kids_binary, income, age, home_ownership_binary, marital_status) %>%
  summarize(average_sales_value = mean(sales_value, na.rm = TRUE)) %>%
  arrange(desc(average_sales_value))

print(df_average)

ggplot(df_summary, aes(x = income, y = total_sales_value, fill = income)) +
  geom_bar(stat = "identity") +
  facet_grid(kids_binary ~ home_ownership_binary + marital_status) + 
  labs(x = "Income", y = "Total Sales Value", title = "Total Sales Value by Income and Demographic Combinations") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(df_average, aes(x = income, y = average_sales_value, fill = income)) +
  geom_bar(stat = "identity") +
  facet_grid(kids_binary ~ home_ownership_binary + marital_status) + 
  labs(x = "Income", y = "Total Sales Value", title = "Total Sales Value by Income and Demographic Combinations") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Average Salves Value by Income and Demographic Combinations It’s no surprise: The plurality contributions to average sales come from Married Homeowners with Kids, by far. Non-homeowner contributions are paltry compared to the contributions of homeowners.

Total Salves Value by Income and Demographic Combinations Those contributions are mirrored in the total sales data. There is obviously endogenity here – married homeowners with kids tend to have more disposable income, so those individuals are clustered in the middle and upper end of the income data rather than on the lower end of it, which itself entails more sales. But we’re not doing regression analysis here!

What products have the highest sales value per demographic group?

df_product_summary<- df_cleaned %>%
  group_by(household_size, kids_binary, income, age, home_ownership_binary, marital_status, product_type) %>%
  summarize(total_sales_value = sum(sales_value, na.rm = TRUE)) %>%
  ungroup()

df_top_products <- df_product_summary %>%
  group_by(kids_binary, home_ownership_binary, marital_status) %>%
  arrange(desc(total_sales_value)) %>%
  slice_head(n = 5) %>%
  ungroup()

print(df_top_products, n=40)

df_product_average<- df_cleaned %>%
  group_by(household_size, kids_binary, income, age, home_ownership_binary, marital_status, product_type) %>%
  summarize(average_sales_value = mean(sales_value, na.rm = TRUE)) %>%
  ungroup()

df_top_products_average <- df_product_average %>%
  group_by(kids_binary, home_ownership_binary, marital_status) %>%
  arrange(desc(average_sales_value)) %>%
  slice_head(n = 5) %>%
  ungroup()

print(df_top_products_average, n=40)

Looking at the aggregate sales numbers turns out to be rather uninformative. All demographic combinations have their top 5 products by income bracket as some combination of cigarettes and gasoline with some very minor exceptions. We already know there is inelastic demand for these products. The table displaying Average sales numbers is more illuminating:

Average Sales by Product, Differentiated by Demographic Characteristics
Average Sales by Product, Differentiated by Demographic Characteristics

Recall that this is also differentiating by income bracket and household size, which helps to explain the reapted categories. But in any case, it’s clear what the real cash cows for the grocery business seem to be: Gasoline, Cigarettes, Gift Cards, Beef, Patio Furniture, and Infant Formula. Homeowners buy a great deal of patio furniture compared to non-home owners, which accords with our priors. Infant formula has a high average sales value for those unmarried with kids, but not for those whom are married with kids, which is interesting.

Which products sell the most volume per demographic group?

We can analyze this data in a different way: by looking at quantity rather than salves value. This may give us useful information about where mark-ups can be placed, where profit margins are high, and where they are low. This seems reasonable, but when conducting preliminary analysis, every single product was gasoline. These result filter out gasoline, but remember: For every income level and every combination of demographic information, gasoline is the real highest value in terms of both top sales value, top quantity, and average quantity.

df_summary_quantity <- df_cleaned %>%
  group_by(household_size, income, age, kids_binary, home_ownership_binary, marital_status, product_type) %>%
  summarize(total_quantity_sold = sum(quantity, na.rm = TRUE)) %>%
  ungroup()

df_top_quantity <- df_summary_quantity %>%
  filter(product_type != "GASOLINE-REG UNLEADED") %>%
  group_by(kids_binary, home_ownership_binary, marital_status) %>%
  arrange(desc(total_quantity_sold)) %>%
  slice_head(n = 5) %>%
  ungroup()

print(df_top_quantity, n=40)

df_average_quantity <- df_cleaned %>%
  filter(product_type != "GASOLINE-REG UNLEADED") %>%
  group_by(household_size, income, age, kids_binary, home_ownership_binary, marital_status, product_type) %>%
  summarize(average_quantity_sold = mean(quantity, na.rm = TRUE)) %>%
  ungroup()

df_top_quantity_average <- df_average_quantity %>%
  group_by(kids_binary, home_ownership_binary, marital_status) %>%
  arrange(desc(average_quantity_sold)) %>%
  slice_head(n = 5) %>%
  ungroup()

print(df_top_quantity_average, n=40)

write.csv(df_top_quantity, file='top quantity.csv')
write.csv(df_top_quantity_average, file='top quantity avg.csv')
Average Quantity by Product, Differentiated by Demographic Characteristics
Average Quantity by Product, Differentiated by Demographic Characteristics

Middle America eats a lot of corn - and that’s valuable information for any firm. Corn is nowhere near our highest sales either in total or on average, but dominates the list of highest quantities. Where these charts fail to corroborate each other is likely where our margins are thin.

Conclusions

The evidence seems compelling: appealing to married homeowners with kids appears crucial. Send out coupons for diapers and school supplies; have two-for-one deals on caulk, wire, and any other household repair equipment. These could even be loss-leaders: Just get married homeowners with kids in the door. But beyond that: the bulk of your sales come from the middle class. Expanding access to gasoline and cigarettes seems critical – high in quantity and sales, they’re very reliable sources of revenue. If you can identify which demographic combinations that customers belong to, targeted advertisements and coupon campaigns can get them in the door fast – Homeowners spend a lot on patio furniture, the unmarried with children spend a lot on infant formula, and everyone spends a lot on gift cards. If internal metrics indicate that there is any degree of price inelasticity in corn, or apples, or any of the other high-quantity goods, small price changes can really add up. This information conveys who your customers are, and what they want, and that seems valuable for any business to comprehend.